loading page

Interpretable Machine Learning Biosignature Detection from Ocean Worlds Analogue CO2 Isotopologue Data
  • +3
  • Lily A Clough,
  • Victoria Da Poian,
  • Jonathan D Major,
  • Lauren M Seyler,
  • Brett A McKinney,
  • Bethany P Theiling
Lily A Clough
Tandy School of Computer Science, The University of Tulsa, Aurora Engineering, Planetary Environments Laboratory, NASA Goddard Space Flight Center
Victoria Da Poian
Planetary Environments Laboratory, NASA Goddard Space Flight Center, Earth and Planetary Science, Johns Hopkins University, Microtel LLC
Jonathan D Major
School of Geosciences, University of South Florida
Lauren M Seyler
School of Natural Sciences and Mathematics, Stockton University
Brett A McKinney
Tandy School of Computer Science, The University of Tulsa
Bethany P Theiling
Planetary Environments Laboratory, NASA Goddard Space Flight Center

Abstract

Future missions to icy ocean worlds (OW) such as Europa and Enceladus will evaluate the habitability of, and search for, potential biosignatures from these worlds. These missions will benefit substantially from autonomous science methods to process high volumes of collected data and to prioritize detections of signals of interest for the first available downlink. Mass spectrometers (MS) are prime candidates for implementing science autonomy due to their large data volumes and complex spectral data products, and are powerful tools for proposed biosignature detection. Light stable isotopes are considered strong candidate biosignatures due to the large fractionations promoted by biological activity. However, biogenic isotope fractionations may be obscured due to variations in biological productivity and complex abiotic geological processes. Machine learning (ML) may disentangle biotic from abiotic MS data to accurately identify biosignatures; however, ML model predictions can be enigmatic to humans, which compromises trust in scientifically significant detections. We develop and test a biosignature detection ML model using a feature selection method based on nearest-neighbors projected distance regression (NPDR) that identifies important predictors through statistical interactions and provides mathematical and geochemical context for biosignatures. Our Random Forest (RF) biosignature model is trained on benchmark CO2 isotopologue measurements of laboratory-generated analogue OW seawaters. The model predicts the presence of microbes with 87.3% mean accuracy regardless of the salt compositions of analogue seawaters. We further interpret biosignature models using network visualization of statistical interactions between features to understand statistical mechanisms of biosignature prediction, and we evaluate false prediction probabilities using single-sample importance. 
12 Aug 2024Submitted to ESS Open Archive
12 Aug 2024Published in ESS Open Archive