Abstract
The twin pressures to achieve mind-share and to harness available
computing power drive the evolution of geoscientific data analysis
tools. Such tools have enabled a remarkable progression in the atomic or
fundamental unit of data they can easily analyze. In the mid-1980s we
analyzed one or a few naked arrays at at time, and now researchers
routinely intercompare climatological ensembles each comprising
thousands of files of heterogeneous variables richly dressed in
metadata. Two complementary semantic trends have empowered this
analytical revolution: more intuitive and concise analysis commands that
can exploit more standardized and brokered self-describing data stores.
This talk highlights how tool developers can leverage these trends to
successfully imagine and build the analysis tools of tomorrow by
understanding the needs of domain researchers and the power of domain
specific languages today. This talk will also highlight recent
improvements in compression speed and interoperability that
geoscientists can exploit to reduce our carbon footprint. Observations
and simulations to advance Earth system sciences generate exabytes of
archived data per year. Storage accounts for about 40% of datacenter
power consumption, with its attendant consequences for greenhouse gas
emissions and environmental sustainability. Precision-preserving lossy
compression can further reduce the size of losslessly compressed data by
10-25% without compromising its scientific content. Modern lossless
codecs (e.g., Zstandard or Zlib-ng) accelerate compression and
decompression, relative to the traditional Zlib, by factors of 2-5x with
no penalty in compression ratio. These proven modern compression
technologies can help geoscientific datacenters become significantly
greener.