A compositional data model to predict the isotope distribution for
average peptides using a compositional spline model.
Abstract
We propose an updated approach for approximating the isotope
distribution of average peptides given their monoisotopic mass. Our
methodology involves in-silico cleavage of the entire UNIPROT database
of Human reviewed proteins using Trypsin, generating a theoretical
peptide dataset. The isotope distribution is computed using BRAIN. We
apply a compositional data modelling strategy that utilizes an additive
log-ratio transformation for the isotope probabilities followed by a
penalized spline regression. Furthermore, due to the impact of the
number of Sulphur atoms on the course of the isotope distribution, we
develop separate models for peptides containing zero up to five Sulphur
atoms. Additionally, we propose three methods to estimate the number of
Sulphur atoms based on an observed isotope distribution. The performance
of the spline models and the Sulphur prediction approaches is evaluated
using a mean squared error and a modified Pearson’s χ² goodness-of-fit
measure on an experimental UPS2 data set. Our analysis reveals that the
variability in spectral accuracy contributes more to the errors than the
approximation of the theoretical isotope distribution by our proposed
average peptide model. Moreover, we find that the accuracy of predicting
the number of Sulphur atoms based on the observed isotope distribution
is limited by measurement accuracy.