Machine Learning for Outlier Detection in Algal and Cyanobacterial
Fluorescence Signals
Abstract
Many drinking water utilities drawing from waters susceptible to harmful
algal blooms (HABs) are implementing monitoring tools that can alert
them of the onset of potential blooms. Some have invested in
fluorescence-based online monitoring probes to measure chlorophyll a and
phycocyanin, two pigments found in cyanobacteria, but it is not clear
how to best use the data generated this way. Previous studies have
focused on correlating phycocyanin fluorescence and cyanobacteria cell
counts. However, not all utilities collect cell count data, making this
method impossible to apply in some cases. Instead, this paper proposes a
novel approach to determine when a utility needs to respond to an HAB
based on machine learning by identifying outliers in chlorophyll a and
phycocyanin fluorescence data without the need for corresponding cell
counts or biovolume. Four existing algorithms are evaluated on data
collected at four buoys in Lake Erie from 2014-2019: k-means clustering,
One-Class Support Vector Machine (SVM), elliptic envelope, and Isolation
Forest (iForest). When trained and tested on data collected at different
buoys, the iForest algorithm performed the best in terms of computation
time for training and true positive rate, and second best for false
positive rate. In a more realistic application where the algorithms are
trained on historical phycocyanin data collected at the same location as
the testing data, all the algorithms, except k-means, accurately
identified anomalies in phycocyanin data coinciding with real
cyanobacteria bloom events. Therefore, One-Class SVM, elliptic envelope,
and iForest are promising algorithms for detecting potential HABs using
fluorescence data.