Abstract
Health and environmental hazards related to high pollutant
concentrations have become a serious issue from the perspectives of
public policy and human health. The objective of this research is to
improve the estimation of grid-wise PM2.5, a criteria
pollutant, by reducing systematic bias in estimating
PM2.5 empirically from speciation provided by MERRA-2
using a ML approach. We present a unique application of machine learning
(ML) for estimating hourly PM2.5 concentrations at grid
points of Modern-Era Retrospective analysis for Research and
Applications version 2 (MERRA-2). The model was trained using various
meteorological parameters and aerosol species simulated by MERRA-2 and
ground measurements from Environmental Protection Agency (EPA) air
quality system (AQS) stations. monitors. The ML approach significantly
improved performance and reduced mean bias in the 0-10 µg
m-3 range. We also used the Random Forest ML model for
each EPA region using one year of collocated datasets. The resulting ML
models for each EPA region were validated and the aggregate data set has
a Pearson correlation of 0.88 (RMSE = 4.8 µg m-3) and
0.82 (RMSE = 5.8 µg m-3) for training and testing,
respectively. The correlation (and RMSE) increased to 0.89 (4.0), 0.95
(1.6), 0.94 (1.1) for daily, monthly, and yearly average comparisons.
The results from initial implementation of the ML model for global
region are encouraging but require more research and development to
overcome challenges associated with data gaps in many parts of the
world.