Automatic Quality Control of Crowdsourced Rainfall Data with Multiple
Noises: A Machine Learning Approach
Abstract
In geophysics, crowdsourcing is an emerging non-traditional
environmental monitoring approach that encourages contributions of data
from individual citizens. Because of their reliance on undertrained
citizens and imprecise low-cost sensors, crowdsourced data applications
suffer from different types of noises that can deteriorate the overall
monitoring accuracy. In this study, we propose a machine learning
approach for automatic Crowdsourced data Quality Control (CSQC) by
detecting and removing noisy data points in spatially and temporally
discrete crowdsourced observations. We design a set of features from the
original and interpolated rainfall data, and apply them to train and
test the CSQC models based on both supervised and non-supervised machine
learning algorithms. Performances of the CSQC models under various
scenarios assuming no further retraining are also tested (hereafter
referred to as transferability). The results based on synthetic but
realistic data show that the CSQC model can significantly reduce the
overall rainfall estimation error. Under the stationary assumption, CSQC
models based on both supervised and unsupervised algorithms can have
decent performances in noisy data identification and overall rainfall
estimation error reduction; however, if the model is transferred to
other cities with different rainfall structure or noise composition
(without retraining), the supervised Multi-Layer Perceptrons (MLPs)
turns out to be the best performing one.