Seamless Transition of Data Analyses and Analytics from a Local Workstation to Scalable, Massively Distributed Processing on the Cloud Using the Open Source PODPAC Library

Jerry Bieszczad; Mattheus Ueckermann; Dara Entekhabi; David Callender; David Sullivan

doi:10.1002/essoar.10500684.1

loading page

Seamless Transition of Data Analyses and Analytics from a Local Workstation to Scalable, Massively Distributed Processing on the Cloud Using the Open Source PODPAC Library

Jerry Bieszczad,
Mattheus Ueckermann,
Dara Entekhabi,
David Callender,
David Sullivan

Abstract

Newer satellite platforms, such as NISAR, are poised to produce huge amounts of data that require large computational resources. Currently, researchers typically download datasets for analysis on local computer resources. This paradigm is no longer practical given the volumes of data from new sensing platforms. While cloud computing services offer a potential solution for accessing and managing large computational resources, there remains a significant barrier to entry. Levering cloud services requires users to: navigate new terminology without appropriate documentation; optimize settings for services to reduce costs; and maintain software dependencies, upgrades, and allocated hardware resources. A more accessible approach for migrating earth scientists to the cloud is needed. To address this problem, we are developing the open source Python library PODPAC (Pipeline for Observational Data Processing Analysis and Collaboration), with the goal of helping to address NASA’s rapidly growing observational data volume and variety needs. PODPAC enables earth scientists to seamlessly transition between processing on a local workstation (their current paradigm) to distributed remote processing on the cloud. It does this by leveraging a text-based JSON format automatically generated for any plug-and-play algorithm developed using PODPAC (e.g., in a Jupyter Notebook). This text format describes data provenance, and is used in RESTful web requests to preconfigured PODPAC cloud deployments, allowing scalable, massively distributed processing. We demonstrate the seamless transition to the cloud by developing a simplified soil moisture downscaling algorithm in Python using PODPAC. Data for this algorithm uses NASA Soil Moisture Active Passive (SMAP) sensor retrieved from the National Snow and Ice Data Center using OpenDAP, and fine-scale topographic data retrieved via Open Geospatial Consortium (OGC) Web Coverage Service (WCS) calls. We then use a serverless AWS Lambda function to run the same algorithm using the automatically-generated text format. Our generic preconfigured environment can handle a wide variety of processing pipelines, and scale up to 1024 parallel processes. This approach enables incremental adoption of cloud services by researchers, significantly lowering the barrier to entry.