Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

,


Intake / Pangeo Catalog: Making It Easier To Consume Earth's Climate and Weather Data
Anderson Banihirwe, Charles Blackmon-Luca, Ryan Abernathey, Joseph Hamman Computer simulations of the Earth's climate and weather generate huge amounts of data.These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc...).Finding, investigating, loading these data assets into computeready data containers costs time and effort.The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
In this notebook, we demonstrate the integration of data discovery tools such as intake and intake-esm (an intake plugin) with data stored in cloud optimized formats (zarr).We highlight (1) how these tools provide transparent access to local and remote catalogs and data, (2) the API for exploring arbitrary metadata associated with data, loading data sets into data array containers.
We also showcase the Pangeo catalog, an open source project to enumerate and organize cloud optimized climate data stored across a variety of providers, and a place where several intakeesm collections are now publicly available.We use one of these public collections as an example to show how an end user would explore and interact with the data, and conclude with a short overview of the catalog's online presence.

Introduction
Computer simulations of the Earth's climate and weather generate huge amounts of data.These data are often persisted on high-performance computing (HPC) systems or in the cloud across multiple data assets in a variety of formats (netCDF, Zarr, etc.).Finding, investigating, and loading these data assets into compute-ready data containers costs time and effort.The user should know what data are available and their associated metadata, preferably before loading a specific data asset and analyzing it.
In this notebook, we demonstrate intake-esm, a Python package and an intake plugin with aims of facilitating: • the discovery of earth's climate and weather datasets.
• the ingestion of these datasets into xarray dataset containers.
The common/popular starting point for finding and investigating large datasets is with a data catalog.A data catalog is a collection of metadata, combined with search tools, that helps data analysts and other users to find the data they need.For a user to take full advantage of intake-esm, they must point it to an Earth System Model (ESM) data catalog.This is a JSON-formatted file that conforms to the ESM collection specification.

ESM Collection Specification
The ESM collection specification provides a machine-readable format for describing a wide range of climate and weather datasets, with a goal of making it easier to index and discover climate and weather data assets.An asset is any netCDF/HDF file or Zarr store that contains relevant data.
An ESM data catalog serves as an inventory of available data, and provides information to explore the existing data assets.Additionally, an ESM catalog can contain information on how to aggregate compatible groups of data assets into singular xarray datasets.

Use Case: CMIP6 hosted on Google Cloud
The Coupled Model Intercomparison Project (CMIP) is an international collaborative effort to improve the knowledge about climate change and its impacts on the Earth System and on our society.CMIP began in 1995, and today we are in its sixth phase (CMIP6).The CMIP6 data archive consists of data models created across approximately 30 working groups and 1,000 researchers investigating the urgent environmental problem of climate change, and will provide a wealth of information for the next Assessment Report (AR6) of the Intergovernmental Panel on Climate Change (IPCC).
Last year, Pangeo partnered with Google Cloud to bring CMIP6 climate data to Google Cloud's Public Datasets program.You can read more about this process here.For the remainder of this section, we will demonstrate intake-esm's features using the ESM data catalog for the CMIP6 data stored on Google Cloud Storage.This catalog resides in a dedicated CMIP6 bucket.

Loading an ESM data catalog
To load an ESM data catalog with intake-esm, the user must provide a valid ESM data catalog as input: [1]: import warnings warnings.filterwarnings("ignore")import intake col = intake.open_esm_datastore('https://storage.googleapis.com/cmip6/Note: the amount of details provided in the catalog is determined by the data provider who builds the catalog.

Searching for datasets
After exploring the CMIP6 controlled vocabulary, it's straightforward to get the data assets you want using intake-esm's search() method.In the example below, we are are going to search for the following: • variables: tas which stands for near-surface air temperature • experiments: ['historical', 'ssp245', 'ssp585']: historical: all forcing of the recent past.
• grid_label: gr which stands for regridded data reported on the data provider's preferred target grid.
For more details on the CMIP6 vocabulary, please check this website.

# list all merged datasets
[key for key in dsets.keys()]--> The keys in the returned dictionary of datasets are constructed as follows: '

Pangeo Catalog
Pangeo Catalog is an open-source project to enumerate and organize cloud-optimized climate data stored across a variety of providers.In addition to offering various useful climate datasets in a consolidated location, the project also serves as a means of accessing public ESM data catalogs.

Accessing catalogs using Python
At the core of the project is a GitHub repository containing several static intake catalogs in the form of YAML files.Thanks to plugins like intake-esm and intake-xarray, these catalogs can contain links to ESM data catalogs or data assets that can be loaded into xarray datasets, along with the arguments required to load them.
By editing these files using Git-based version control, anyone is free to contribute a dataset supported by the available intake plugins.Users can then browse these catalogs by providing their associated URL as input into intake's open_catalog(); their tree-like structure allows a user to explore their entirety by simply opening the root catalog and recursively walking through it: [7]: cat = intake.open_catalog('https://raw.githubusercontent.com/pangeo-data/The catalogs can also be explored using intake's own search() method: [8]: cat_subset = cat.search('cmip6')list(cat_subset) [8]: ['climate.cmip6_gcs','climate.GFDL_CM2_6', 'climate.tracmip','climate'] Once we have found a dataset or collection we want to explore, we can do so without the need of any user inputted argument: [9]: cat.climate.tracmip()<IPython.core.display.HTML object>

Accessing catalogs using catalog.pangeo.io
For those who don't want to initialize a Python environmemt to explore the catalogs, catalog.pangeo.iooffers a means of viewing them from a standalone web application.The website directly mirrors the catalogs in the GitHub repository, with previews of each dataset or collection loaded on the fly: From here, users can view the JSON input associated with an ESM collection and sort/subset its contents:

Conclusion
With intake-esm, much of the toil associated with discovering, loading, and consolidating data assets can be eliminated.In addition to making computations on huge datasets more accessible to the scientific community, the package also promotes reproducibility by providing simple methodology to create consistent datasets.Coupled with Pangeo Catalog (which in itself is powered by intake), intake-esm gives climate scientists the means to create and distribute large data collections with instructions on how to use them essentially written into their ESM specifications.
There is still much work to be done with respect to intake-esm and Pangeo Catalog; in particular, goals include: • Merging ESM collection specifications into SpatioTemporal Asset Catalog (STAC) specification to offer a more universal specification standard • Development of tools to verify and describe catalogued data on a regular basis • Restructuring of catalogs to allow subsetting by cloud provider region Please reach out if you are interested in participating in any way.