Interpretable Machine Learning Biosignature Detection from Ocean Worlds Analogue CO2 Isotopologue Data
Abstract
Future missions to icy ocean worlds (OW) such as Europa and Enceladus will evaluate the habitability of, and search for, potential biosignatures from these worlds. These missions will benefit substantially from autonomous science methods to process high volumes of collected data and to prioritize detections of signals of interest for the first available downlink. Mass spectrometers (MS) are prime candidates for implementing science autonomy due to their large data volumes and complex spectral data products, and are powerful tools for proposed biosignature detection. Light stable isotopes are considered strong candidate biosignatures due to the large fractionations promoted by biological activity. However, biogenic isotope fractionations may be obscured due to variations in biological productivity and complex abiotic geological processes. Machine learning (ML) may disentangle biotic from abiotic MS data to accurately identify biosignatures; however, ML model predictions can be enigmatic to humans, which compromises trust in scientifically significant detections. We develop and test a biosignature detection ML model using a feature selection method based on nearest-neighbors projected distance regression (NPDR) that identifies important predictors through statistical interactions and provides mathematical and geochemical context for biosignatures. Our Random Forest (RF) biosignature model is trained on benchmark CO2 isotopologue measurements of laboratory-generated analogue OW seawaters. The model predicts the presence of microbes with 87.3% mean accuracy regardless of the salt compositions of analogue seawaters. We further interpret biosignature models using network visualization of statistical interactions between features to understand statistical mechanisms of biosignature prediction, and we evaluate false prediction probabilities using single-sample importance.