Abstract
Faced with unprecedented growth in earth data volume and demand, NASA
has developed the Earth Data Analytic Services (EDAS) framework, a high
performance big data analytics and machine learning framework. This
framework enables scientists to execute data processing workflows
combining common analysis and forecast operations close to the massive
data stores at NASA. The data is accessed in standard (NetCDF, HDF,
etc.) formats in a POSIX file system and processed using vetted tools of
earth data science, e.g. ESMF, CDAT, NCO, Keras, Tensorflow, etc. EDAS
utilizes high performance parallel data access, a custom distributed
array framework, and a streaming parallel in-memory workflow for
efficiently processing huge datasets within limited memory spaces with
interactive response times. EDAS services are accessed via a WPS API
being developed in collaboration with the ESGF Compute Working Team to
support server-side analytics for ESGF. The API can be accessed using
direct web service calls, a Python script, a Unix-like shell client, or
a JavaScript-based web application. New analytic operations can be
developed in Python, Java, or Scala (with support for other languages
planned). Client packages in Python, Java/Scala, or JavaScript contain
everything needed to build and submit EDAS requests. The EDAS
architecture brings together the tools, data storage, and
high-performance computing required for timely analysis of large-scale
data sets, where the data resides, to ultimately produce societal
benefits. It is currently deployed at NASA in support of the
Collaborative REAnalysis Technical Environment (CREATE) project, which
centralizes numerous global reanalysis datasets onto a single advanced
data analytics platform. This service enables decision makers to compare
multiple reanalysis datasets and investigate trends, variability, and
anomalies in earth system dynamics around the globe. EDAS services
include configurable high performance neural network learning modules
designed to operate on the products of EDAS workflows. As a science
technology driver we have explored the capabilities of these services
for long-range forecasting of the interannual variation of important
regional scale seasonal cycles. Neural networks were trained to forecast
All-India Summer Monsoon Rainfall (AISMR) one year in advance using (as
input) the top 8-64 principal components of the global surface
temperature and 200 hPa geopotential height fields from NASA’s MERRA2
and NOAA’s Twentieth Century Reanalyses. The promising results from
these investigations illustrate the power of easily accessible machine
learning services coupled to huge repositories of earth science data.