Jesse Murray - ESS Open Archive

How can we use our current wealth of terrestrial data, encompassing biogenic and abiogenic systems, to determine the distinguishing properties of life? SCOBI (Statistical Classification of Biosignature Information) uses machine learning techniques to algorithmically identify combinations of measurements that are “indicative of life”. A set of ~1000 observations, comprising elemental abundance, isotopic fractionation, VNIR reflectance, and (in progress) Raman spectra, have been assembled from existing literature and databases. The observations cover systems classified as “indicative alive” (e.g., cells, vegetation), “indicative non-alive” (e.g., fossils, teeth), “mixed indicative” (e.g., soil, pond water), or “non-indicative” (e.g., rocks, meteorites). VNIR data was preprocessed by linear interpolation from 400-2100 nm and smoothed with a Savitzky-Golay filter. To limit the amount of Earth-biochemistry-specific (non-agnostic) information included, the first five spectral features extracted were number of peaks, number of troughs, mean reflectance, mean peak width, and broadest peak width. To help further emphasize agnostic biosignatures, Earth-specific features such as chlorophylls have been manually flagged so that feature importance with and without them can be compared. Classifiers including k-nearest neighbors (KNN), Gaussian Naïve Bayes (GNB), logistic regression (LR), random forest (RF), and support vector machine (SVM) were implemented, as was a combination voting classifier. Performance metrics included false positive rates, false negative rates, and AUC with 50-50 test/train splits (Monte Carlo simulations). Key takeaways from this stage, prior to the inclusion of Raman spectra, are (1) the overall success rate of 0.933 AUC was most heavily influenced by the elemental abundance data; and (2) VNIR reflectance had the lowest classification performance with 0.52 AUC (58% of objects correctly classified). The next steps are to complete integration of Raman spectral data and to improve the approach to pre-processing and feature extraction for both types of spectral data, such as automated baseline removal, whole spectrum matching, and dimensionality reduction.