Abstract
How can we use our current wealth of terrestrial data, encompassing
biogenic and abiogenic systems, to determine the distinguishing
properties of life? SCOBI (Statistical Classification of Biosignature
Information) uses machine learning techniques to algorithmically
identify combinations of measurements that are “indicative of life”. A
set of ~1000 observations, comprising elemental
abundance, isotopic fractionation, VNIR reflectance, and (in progress)
Raman spectra, have been assembled from existing literature and
databases. The observations cover systems classified as “indicative
alive” (e.g., cells, vegetation), “indicative non-alive” (e.g.,
fossils, teeth), “mixed indicative” (e.g., soil, pond water), or
“non-indicative” (e.g., rocks, meteorites). VNIR data was preprocessed
by linear interpolation from 400-2100 nm and smoothed with a
Savitzky-Golay filter. To limit the amount of
Earth-biochemistry-specific (non-agnostic) information included, the
first five spectral features extracted were number of peaks, number of
troughs, mean reflectance, mean peak width, and broadest peak width. To
help further emphasize agnostic biosignatures, Earth-specific features
such as chlorophylls have been manually flagged so that feature
importance with and without them can be compared. Classifiers including
k-nearest neighbors (KNN), Gaussian Naïve Bayes (GNB), logistic
regression (LR), random forest (RF), and support vector machine (SVM)
were implemented, as was a combination voting classifier. Performance
metrics included false positive rates, false negative rates, and AUC
with 50-50 test/train splits (Monte Carlo simulations). Key takeaways
from this stage, prior to the inclusion of Raman spectra, are (1) the
overall success rate of 0.933 AUC was most heavily influenced by the
elemental abundance data; and (2) VNIR reflectance had the lowest
classification performance with 0.52 AUC (58% of objects correctly
classified). The next steps are to complete integration of Raman
spectral data and to improve the approach to pre-processing and feature
extraction for both types of spectral data, such as automated baseline
removal, whole spectrum matching, and dimensionality reduction.