Parmap: Analytics Engine Scalable for Climate Model Evaluation on Cloud
and High-Performance Computing Platforms
Abstract
The need to better understand climate change has driven model
simulations to greater fidelity with improved spatiotemporal resolution
(e.g., < 10 km at sub-hourly cadence). For example, the 7 km
GEOS-5 Nature Run (G5NR) with 30-minute outputs from 2005-07 at the NASA
Center for Climate Simulation (NCCS) is ~4 PB and is not
easily portable. The rise of these high-fidelity climate models
coincides with the emergence of cloud computing as a viable platform for
scientific analytics. NASA has adopted a cloud computing strategy using
public providers like Amazon Web Services (AWS). However, it is not
cost- or time- effective to move the High- Performance Computing
(HPC)-based model computations and data to the cloud. Thus, there is a
need for scalable model evaluation compatible with both the cloud and
HPC platforms like NCCS. To fill this need we have extended the
analytics component of the Apache Science Data Analytics Platform (SDAP)
with a streamlined version that specifically targets high-resolution
science data products and climate model outputs on a regular coordinate
grid. Gridded inputs (as opposed to other data structures like point
clouds or swath-based measurements supported by SDAP), enable offsets to
particular grid cells to be directly computed, allow for processing on
the original NetCDF or HDF granules, do not require a second tiled copy
of the data, and accommodate a simpler technology stack since no
geospatial database is required for lookups or tile storage. Our core
module, Parmap, abstracts the map-reduce model so that users can select
from a variety of map computational modes, including Spark, Dask,
serverless AWS Lambda, PySparkling, and Python multiprocessing. Example
analytics include area-averaged time series and time-averaged,
correlation and climatological maps. Benchmarks compare favorably with
the full SDAP implementation.