loading page

Classifying Agnostic Biosignatures using Raman, VNIR, and Elemental Data
  • +5
  • Aarya Mishra,
  • Tao Sheng,
  • Aivaras Vilutis,
  • Paxton Tomko,
  • Michael Furlong,
  • Jesse Murray,
  • Sunanda Sharma,
  • Diana Gentry
Aarya Mishra
University of San Francisco

Corresponding Author:[email protected]

Author Profile
Tao Sheng
University of Pittsburgh
Author Profile
Aivaras Vilutis
NASA Ames Research Center
Author Profile
Paxton Tomko
Purdue University
Author Profile
Michael Furlong
NASA Ames Research Center
Author Profile
Jesse Murray
NASA Ames Research Center
Author Profile
Sunanda Sharma
Massachusetts Institute of Technology
Author Profile
Diana Gentry
NASA Ames Research Center
Author Profile

Abstract

How can we use our current wealth of terrestrial data, encompassing biogenic and abiogenic systems, to determine the distinguishing properties of life? SCOBI (Statistical Classification of Biosignature Information) uses machine learning techniques to algorithmically identify combinations of measurements that are “indicative of life”. A set of ~1000 observations, comprising elemental abundance, isotopic fractionation, VNIR reflectance, and (in progress) Raman spectra, have been assembled from existing literature and databases. The observations cover systems classified as “indicative alive” (e.g., cells, vegetation), “indicative non-alive” (e.g., fossils, teeth), “mixed indicative” (e.g., soil, pond water), or “non-indicative” (e.g., rocks, meteorites). VNIR data was preprocessed by linear interpolation from 400-2100 nm and smoothed with a Savitzky-Golay filter. To limit the amount of Earth-biochemistry-specific (non-agnostic) information included, the first five spectral features extracted were number of peaks, number of troughs, mean reflectance, mean peak width, and broadest peak width. To help further emphasize agnostic biosignatures, Earth-specific features such as chlorophylls have been manually flagged so that feature importance with and without them can be compared. Classifiers including k-nearest neighbors (KNN), Gaussian Naïve Bayes (GNB), logistic regression (LR), random forest (RF), and support vector machine (SVM) were implemented, as was a combination voting classifier. Performance metrics included false positive rates, false negative rates, and AUC with 50-50 test/train splits (Monte Carlo simulations). Key takeaways from this stage, prior to the inclusion of Raman spectra, are (1) the overall success rate of 0.933 AUC was most heavily influenced by the elemental abundance data; and (2) VNIR reflectance had the lowest classification performance with 0.52 AUC (58% of objects correctly classified). The next steps are to complete integration of Raman spectral data and to improve the approach to pre-processing and feature extraction for both types of spectral data, such as automated baseline removal, whole spectrum matching, and dimensionality reduction.