Going beyond the spreadsheet - developing Best Practices in ‘long-tail’
environmental data curation and publishing
Abstract
The research data repository of the Environmental Data Initiative (EDI)
is a signatory of the FAIR Data Principles. Building on over 30 years of
data curation research and experience in the NSF-funded US Long-Term
Ecological Research program (LTER), it provides mature functionalities,
well established workflows, and support for ‘long-tail’ environmental
data publication. High quality scientific metadata are enforced through
automatic checks against community developed rules and the Ecological
Metadata Language (EML) standard. Although the EDI repository is far
along the continuum of making its data FAIR, representatives from EDI
and the LTER Information Management community have recently been
developing best practices for the edge cases in environmental data
publishing. Here we discuss and seek feedback on how to best handle the
publication of these ‘long-tail’ data when extensive additional data are
available along with e.g., genomics data, physical specimens, or flux
tower data. While these latter data are better handled in other
discipline-specific repositories such as NCBI, iDigBio, and AmeriFlux,
they are frequently associated with other data collected at the same
time and location, or even from the same samples. This is particularly
relevant across the LTER Network, where sites represent integrative
research projects. Questions we address (and seek community input from)
include: How to archive documents and images when they are data, e.g.,
field notebooks, or time-lapse photographs of plant phenology? How to
deal with data from Unmanned Vehicle (e.g., drones and underwater
gliders), acoustic data, or model outputs, which may be several
terabytes in size? How should processing scripts or modeling code be
associated with data? Overall, these best practices address issues of
Findability and Accessibility of data as well as greater transparency of
the research process.