Exploration of Machine Learning-Generated Spectral Libraries for Data
Independent Acquisition in Complex Ocean Metaproteomic Analyses
Abstract
Ocean metaproteomics provides valuable insights into the structure and
function of marine microbial communities. Yet, ocean samples are
challenging due to their extensive biological diversity that results in
a very large number of peptides with a large dynamic range. This study
characterized the capabilities of data independent acquisition (DIA)
mode for use in ocean metaproteomic samples. Spectral libraries were
constructed from discovered peptides and proteins using machine learning
algorithms to remove incorporation of false positives in the libraries.
When compared with 1-dimensional and 2-dimensional data dependent
acquisition analyses (DDA), DIA outperformed DDA both with and without
gas phase fractionation. We found that larger discovered protein
spectral libraries performed better, regardless of the geographic
distance between where samples were collected for library generation and
where the test samples were collected. Moreover, the spectral library
containing all unique proteins present in the Ocean Protein Portal
outperformed smaller libraries generated from individual sampling
campaigns. However, a spectral library constructed from all open reading
frames in a metagenome was found to be too large to be workable,
resulting in low peptide identifications due to challenges maintaining a
low false discovery rate with such a large database size. Given
sufficient sequencing depth and validation studies, spectral libraries
generated from previously discovered proteins can serve as a community
resource, saving resequencing efforts. The spectral libraries generated
in this study are available at the Ocean Protein Portal for this
purpose.