Thematic harmonization of environmental data: Facilitating
interoperability of data within and among repositories in support of
data reuse and scientific synthesis
- Margaret O'Brien,
- Colin Smith,
- Corinna Gries
Abstract
Data repositories and research networks worldwide are publishing a
diverse array of long-term and experimental data for meaningful reuse,
repurpose, and integration. However, in synthesis research the largest
time investment is still in discovering, cleaning and combining primary
datasets until all are completely understood and converted to a usable
format. To accelerate this process, we have developed an approach to
define flexible domain specific data models and convert primary data to
these models using a light-weight and distributed workflow framework.
The approach is based on extensive experience in synthesis research
workflows, takes into account the distributed nature of original data
curation, satisfies the requirement for regular additions to the
original data, and is not determined by a single synthesis research
question. Furthermore, all data describing the sampling context are
preserved and the harmonization may be performed by data scientists that
are not specialists in each specific research domain. Our harmonization
process is 3-phased. First, a Design Phase captures essential
attributes, considers already existing standardization efforts, and
external vocabularies that disambiguate meaning. Second, an
Implementation Phase publishes the data model and best practice guides
for reference, followed by conversion of relevant repository contents by
data managers, and creation of software for data discovery and
exploration. Third, a Maintenance Phase implements programmatic
workflows that run automatically when parent data are revisioned using
event notification services.In this presentation we demonstrate the
harmonization process for ecological community survey data and highlight
the unique challenges and lessons learned. Additionally, we demonstrate
the maintenance workflow and data exploration and aggregation tools that
plug in to this data model