Abstract
Over the past decade, the huge success in many large-scale projects like
the USArray component of Earthscope gave rise to a massive increase in
the data volume available to the seismology community. We assert that
the software infrastructure of the field has not kept up with parallel
developments in ‘big data’ sciences. As a step towards enabling research
at the extreme scale to more of the seismology community, we are
developing a new framework for seismic data processing and management we
call Massive Parallel Analysis System for Seismologists (MsPASS). MsPASS
leverages several existing technologies: (1) Spark as the scalable
parallel processing framework, (2) MongoDB as the flexible database
system, and (3) Docker and Singularity as the containerized virtual
environment. The core of the system builds on a rewrite of the SEISPP
package to implement wrappers around the widely accepted ObsPy toolkit.
The wrappers automate many database operations and provide a mechanism
to automatically save the processing history and provide a mechanism for
reproducibility. The synthesis of these components can provide
flexibility to adapt to a wide range of data processing workflows. The
use of containers enables the deployment to a wide range of computing
platforms without requiring intervention by system administrators. We
evaluate the effectiveness of the system with a deconvolution processing
workflow applied to USArray data. Through extensive documentation and
examples, we aim to make this system a sustainable, open-source
framework for the community.