2.3 | Mass Spectrometry
Mass spectrometry proteomics is able to detect translational products of
sORFs directly in biological samples using either bottom-up (from
peptide fragments) or top-down (intact precursor) modalities. However,
specialized sample preparation and computational methods must be applied
for high-sensitivity detection of small, unannotated microproteins. For
example, a standard bottom-up proteomics experiment begins with
isolation of the proteome, during which small molecules and proteolytic
fragments are typically removed by SDS-PAGE or filter-aided sample
preparation. Furthermore, most peptide and protein identification from
proteomics data is accomplished via spectral matching against the
annotated proteome database. For these reasons, sORF-encoded
polypeptides are both de-enriched from proteomic samples, and absent
from databases, and therefore cannot be detected with standard proteomic
workflows and searches.
Multiple recent reviews and protocols describing microprotein
identification via proteomics are available, so we provide a brief
overview highlighting only the key concerns here. Microprotein discovery
methods are built on the same technologies used for standard shotgun
proteomics, with several modifications (Figure 3). First, because
sORF-encoded microproteins are small, most are identified by only a
single proteotypic or fingerprint tryptic fragment in a typical
proteomics experiment. A major factor complicating detection of
microproteins is coelution and/or cofragmentation of the one or two
detectable tryptic peptides derived from a given microprotein with
abundant tryptic and/or proteolytic fragments of larger proteins.
Resulting ion suppression and/or complex spectra preclude detection
and/or identification of the microprotein fragment, regardless of its
abundance; this consideration is less severe for larger, canonical
proteins, which generate many tryptic peptides and thus detection of any
individual fragment is not required. Therefore, the first critical step
of any sORF proteomic experiment is to achieve proteome extraction in
the absence of proteolysis of canonical proteins (e.g., via boiling in
acidic solution or application of protease inhibitors) to minimize
sample complexity, followed by or concomitant with enrichment of the
small proteome and exclusion of large proteins. Small protein enrichment
can be achieved via multiple chemical and biophysical methods, such as
solid phase extraction, peptide gels, GELFrEE resolution, and organic
solvent or surfactant extraction. When they have been compared
head-to-head, these methods have typically been shown to offer
comparable numbers, but non-overlapping sets, of microproteins detected.
Depending on the experimental goals, the size selection approach for
microprotein proteomics can therefore be optimally chosen: for the
deepest coverage, multiple methods should be employed on replicate
samples and the results combined; for a rapid, robust and economical
approach, organic solvent extraction may prove attractive.
Subsequent to small proteome isolation, most microprotein studies to
date have employed bottom-up proteomic analysis, in which microproteins
are enzymatically digested into peptide fragments (typically with
trypsin, though multienzyme digests have been shown to improve small
proteome coverage), followed by liquid chromatography-tandem mass
spectrometry, often with multi-dimensional separation. This experiment
provides thousands of raw peptide fragmentation spectra corresponding
both to known canonical small proteins and microproteins, which must
then be identified and distinguished. This is typically accomplished via
peptide-spectral matching against expanded databases comprising the
canonical proteome as well as candidate sORF sequences. For eukaryotes,
databases can be derived from three-frame transcriptome translations,
ribosome profiling-derived translatomes, or publicly available
noncanonical ORF databases such as OpenProt and sORFs.org; six-frame
genomic translation can be employed for prokaryotes. Peptide-spectral
matching against any of these databases affords identifications of both
canonical small proteins and unannotated microproteins. It is important
to note that discrimination of false-positive identifications that arise
from searching expanded databases is critical. One important
consideration is use of a contaminants database to prevent aberrant
matching of artefactual peptides (e.g., fragments of trypsin or keratin
in dust) to sORF sequences. Another method commonly applied for this
purpose is application of a stringent false-discovery rate of less than
or equal to 1%, estimated by querying hits to a decoy database
constructed from reversed amino acid sequences of the search database
entries. However, the expansion of the decoy database also decreases
sensitivity for true positive matches, as documented in work from
Fournier and colleagues . An alternative approach is to employ
permissive false discovery rates, followed by either manual inspection
of fragmentation spectra or a secondary algorithm like PepQuery to
exclude false positive spectra better explained by peptides arising from
canonical, mutant or post-translationally modified proteins. After
exclusion of peptides matching (or near-matching) annotated proteins,
the resulting list of identifications represent candidate unannotated
microproteins, which can be computationally mapped to the sORFs that
encode them and experimentally validated.
Mass spectrometry typically detects one to two orders of magnitude fewer
microproteins in a given experiment than ribosome profiling. This may be
due to the abovementioned challenge in detecting single
microprotein-derived fingerprint peptides; the relative insensitivity of
mass spectrometry to some classes of microproteins, including
membrane-localized, positively charged, and low-abundance species; the
instability of some sORF translation products; reduced sensitivity for
true-positive detections as a result of expanded decoy databases applied
for stringent false discovery rate estimation; or all of these factors.
Nonetheless, mass spectrometry offers several advantages. First,
enrichment strategies, such as membrane fractionation and chemical
labeling, can enable identification of microproteins that are refractory
to shotgun analysis of whole-cell tryptic digests, thus beginning to
address one of the major limitations of microprotein proteomics while at
the same time affording functional information about microproteins
(e.g., chemical reactivity, subcellular localization) that is
inaccessible to sequencing methods. Second, without specialized analysis
pipelines, ribosome profiling with elongation inhibitors is refractory
to confident detection of sORFs that overlap canonical protein coding
sequences in alternative reading frames, due to the requirement for
three-nucleotide periodicity for ORF calling. In contrast, mass
spectrometry can readily detect and identify microproteins derived from
overlapping ORFs, which can represent as much as 30% of microproteins
identified in a proteomic experiment. Given the complementary nature of
genomics, ribosome profiling and mass spectrometry, it is likely that
the combination of these methods offers the greatest power for
large-scale, high-confidence microprotein identification.