Going beyond the spreadsheet - developing Best Practices in ‘long-tail’ environmental data curation and publishing

Corinna Gries; Renée Brown; Mary Gastil-Buhl; Sarah Elmendorf; Hap Garritt; Mary Martin; Greg Maurer; An Nguyen; John Porter; Timothy Whiteaker

doi:10.1002/essoar.10505435.1

loading page

Going beyond the spreadsheet - developing Best Practices in ‘long-tail’ environmental data curation and publishing

Corinna Gries,
Renée Brown,
Mary Gastil-Buhl,
Sarah Elmendorf,
Hap Garritt,
Mary Martin,
Greg Maurer,
An Nguyen,
John Porter,
Timothy Whiteaker

Abstract

The research data repository of the Environmental Data Initiative (EDI) is a signatory of the FAIR Data Principles. Building on over 30 years of data curation research and experience in the NSF-funded US Long-Term Ecological Research program (LTER), it provides mature functionalities, well established workflows, and support for ‘long-tail’ environmental data publication. High quality scientific metadata are enforced through automatic checks against community developed rules and the Ecological Metadata Language (EML) standard. Although the EDI repository is far along the continuum of making its data FAIR, representatives from EDI and the LTER Information Management community have recently been developing best practices for the edge cases in environmental data publishing. Here we discuss and seek feedback on how to best handle the publication of these ‘long-tail’ data when extensive additional data are available along with e.g., genomics data, physical specimens, or flux tower data. While these latter data are better handled in other discipline-specific repositories such as NCBI, iDigBio, and AmeriFlux, they are frequently associated with other data collected at the same time and location, or even from the same samples. This is particularly relevant across the LTER Network, where sites represent integrative research projects. Questions we address (and seek community input from) include: How to archive documents and images when they are data, e.g., field notebooks, or time-lapse photographs of plant phenology? How to deal with data from Unmanned Vehicle (e.g., drones and underwater gliders), acoustic data, or model outputs, which may be several terabytes in size? How should processing scripts or modeling code be associated with data? Overall, these best practices address issues of Findability and Accessibility of data as well as greater transparency of the research process.