Integrating interdisciplinary data: The EMERGE Database and its broader lessons for data management best practices
Abstract
In environmental research, cross-disciplinary analyses enable the discovery of novel insights that may not otherwise be evident. Doing these analyses efficiently requires integration of heterogeneous data into a common data structure; however, this type of data integration represents a major challenge, especially for large, multi-institutional projects. Not only should the sharing of individual datasets follow FAIR principles (Findable, Accessible, Interoperable, Reusable), but the ideal data management system should also include a central multidisciplinary data organization framework.
The EMERGE Database (EMERGE-DB;
https://emerge-db.asc.ohio-state.edu/) is the central data hub of the EMERGE Biology Integration Institute (NSF award # 2022070), which investigates the changing dynamics of a thawing permafrost ecosystem in Stordalen Mire, northern Sweden. The EMERGE-DB accomplishes the essential tasks of data management (i.e., data storage and sharing), while also offering more advanced functionality to facilitate interdisciplinary collaboration. Data and standardized metadata—including both sample and file metadata—are integrated within a Neo4j graph database, which allows combined datasets from different source files to be obtained via efficient custom queries. A front-end web portal provides access to this data for both the public and for EMERGE project members (who can access non-public data via login), with different pages providing different “views” of the database for different common use cases. Although data are still deposited to external community repositories (e.g. Zenodo, NCBI databases) to ensure cost-effective long-term accessibility, these depositions are tracked within the EMERGE-DB’s standardized metadata system, with all internally- and externally-stored datasets displayed within a centralized page on the web portal. Although this data integration and sharing framework is customized for the EMERGE project’s needs, many of its guiding principles—such as the centralized web access point for all datasets, and general file formatting standards to streamline the detailed integration of sample metadata—are broadly applicable as “best practices” that other projects can apply in their own data management systems.