Dave Vieglais

and 16 more

Material samples are vital across multiple scientific disciplines with samples collected for one project often proving valuable for additional studies. The Internet of Samples (iSamples) project aims to integrate large, diverse, cross-discipline sample repositories and enable access and discovery of material samples as FAIR data (Findable, Accessible, Interoperable, and Reusable). Here we report our recent progress in controlled vocabulary development and mapping. In addition to a core metadata schema to integrate SESAR, GEOME, Open Context, and Smithsonian natural history collections, three small but important controlled vocabularies (CVs) describing specimen type, material type, and sampled feature were created. The new CVs provide consistent semantics for high-level integration of existing vocabularies used in the source collections. Two methods were used to map source record properties to terms in the new CVs: Keyword-based heuristic rules were manually created where existing terminologies were similar to the new CVs, such as in records from SESAR, GEOME, and Open Context and some aspects of Smithsonian Darwin Core records. For example specimen type =liquid>aqueous in SESAR records mapped to specimen type = liquid or gas sample and material type = liquid water. A machine learning approach was applied to Smithsonian Darwin Core records to infer sampled feature terms from record text describing habitat, locality, higher geography, and higher classification fields. Applying fastText with a 600-billion-token corpus in the general domain, we provided the machine a level of “understanding” of English words. With 200 and 995-record training sets, 87%, 94% precision and 85%, 92% recall were obtained respectively, yielding performance sufficient for production use. Applying these approaches, more than 3x106 records of the four large collections have been mapped successfully to a common core data model facilitating cross-domain discovery and retrieval of the sample records.

Ilya Zaslavsky

and 3 more

The EarthCube Data Discovery Studio (DDStudio) integrates several technical components into an end-to-end data discovery and exploration system. Beyond supporting dataset search across multiple data sources, it lets geoscientists explore the data using Jupyter notebooks; organize the discovered datasets into thematic collections which can be shared with other users; edit metadata records and contribute metadata describing additional datasets; and examine provenance and validate automated metadata enhancements. DDStudio provides access to 1.67 million metadata records from 40+ geoscience repositories, which are automatically enhanced and exposed via standard interfaces in both ISO-19115 and in schema.org markup; the latter can be used by commercial search engines (Google, Bing) to index DDStudio content. For geoscience end users, DDStudio provides a custom Geoportal-based user interface which enables spatio-temporal, faceted, and full-text search, and provides access to additional functions listed above. Key project accomplishments over the last year include: - User interface improvements, based on design advice from a Science Gateways Community Institute (SGCI) usability team, who conducted user interviews, performed usability testing, and analyzed a dozen of other search portals to identify the most useful features. This work resulted in a streamlined user interface, particularly in presentation of search results and in management of thematic collections. - The earlier effort to publish DDStudio content using schema.org markup resulted in significant usage increase. With over 900K records indexed by Google, nearly half of the roughly 1000 unique users per month are now accessing DDStudio via referrals from Google. - The added ability to harvest and process JSON-LD metadata makes it possible to integrate EarthCube GeoCodes content into DDStudio, and work with this content using DDStudio’s user interface. - New application domains include joint work with the library community, and interoperation with DataMed, a similar system that indexes 2.3 million biomedical datasets.

Rebecca Koskela

and 4 more

The EarthCube Technology & Architecture Committee formed a Resource Registry Working Group (WG) to develop a framework for a registry of EarthCube (EC) resources, enabling users to discover scientific and technical resources (software, tools, vocabularies, etc.) that are relevant to their research. The registry will promote EC investments, reduce time to science, help enable interdisciplinary research, more clearly define what is EC, and provide a vehicle for tool and software producers to notify the community about new products, increase visibility, and gain recognition. A primary requirement is to enable systematic description of EarthCube computational resources in terms of their functionality and interfaces for utilization, to enable users to identify components that can work together in integrated workflows. This requires understanding the specifics of how a software component communicates—both the messaging protocol, and the syntax and semantics of information formats getting data into and out of a component. This registry would work in conjunction with schema.org dataset descriptions being developed by the community to streamline linkage of data and software components for research workflows. The WG created definitions for a set of resources to include in a first iteration of the registry, and a set of properties that should be specified for all resources, as well as properties specific to particular resource types. The suggested resource types are: Software, Interface/API, Interchange format, Dataset, Repository, Service, Platform, Vocabulary/ontology/Information model, Specification, Catalog/registry, and Use Case. Dataset and Use Case resources registration is out of scope for the WG project, to be handled separately. Elaboration of this registry is in the workplan for EarthCube, with the goal maximum reuse of existing vocabularies and technology and compatibility with related registry activities.