Seamless Transition of Data Analyses and Analytics from a Local
Workstation to Scalable, Massively Distributed Processing on the Cloud
Using the Open Source PODPAC Library
Abstract
Newer satellite platforms, such as NISAR, are poised to produce huge
amounts of data that require large computational resources. Currently,
researchers typically download datasets for analysis on local computer
resources. This paradigm is no longer practical given the volumes of
data from new sensing platforms. While cloud computing services offer a
potential solution for accessing and managing large computational
resources, there remains a significant barrier to entry. Levering cloud
services requires users to: navigate new terminology without appropriate
documentation; optimize settings for services to reduce costs; and
maintain software dependencies, upgrades, and allocated hardware
resources. A more accessible approach for migrating earth scientists to
the cloud is needed. To address this problem, we are developing the open
source Python library PODPAC (Pipeline for Observational Data Processing
Analysis and Collaboration), with the goal of helping to address NASA’s
rapidly growing observational data volume and variety needs. PODPAC
enables earth scientists to seamlessly transition between processing on
a local workstation (their current paradigm) to distributed remote
processing on the cloud. It does this by leveraging a text-based JSON
format automatically generated for any plug-and-play algorithm developed
using PODPAC (e.g., in a Jupyter Notebook). This text format describes
data provenance, and is used in RESTful web requests to preconfigured
PODPAC cloud deployments, allowing scalable, massively distributed
processing. We demonstrate the seamless transition to the cloud by
developing a simplified soil moisture downscaling algorithm in Python
using PODPAC. Data for this algorithm uses NASA Soil Moisture Active
Passive (SMAP) sensor retrieved from the National Snow and Ice Data
Center using OpenDAP, and fine-scale topographic data retrieved via Open
Geospatial Consortium (OGC) Web Coverage Service (WCS) calls. We then
use a serverless AWS Lambda function to run the same algorithm using the
automatically-generated text format. Our generic preconfigured
environment can handle a wide variety of processing pipelines, and scale
up to 1024 parallel processes. This approach enables incremental
adoption of cloud services by researchers, significantly lowering the
barrier to entry.