Shelley Stall

and 9 more

Research data are a vital component of the scientific record. Discovering and assessing data for possible reuse in future research is challenging. The Belmont Forum has recently awarded funds to three international teams as part of a four-year Collaborative Research Action (CRA) on Science-driven e-Infrastructure Innovation (SEI) for the Enhancement of Transnational, Interdisciplinary and Transdisciplinary Data Use to improve data management practices that will increase data reuse. One of these awardees, PARSEC, comprises two interwoven strands, one focused on improving data practices for reuse and credit, and one for synthesis science. The data specialists work alongside synthesis science researchers as they determine the influence of natural protected areas on socioeconomic outcomes for local communities. They collaborate with the researchers to better understand their motivations and work practices, and to aid them in the data-related steps that need to be taken during the research lifecycle. This will ensure their data and code are FAIR-compliant and thus enhance the likelihood of their data being reused and their analyses reproducible. The PARSEC team is working with Research Data Alliance (RDA), Earth Science Information Partners (ESIP), DataCite and ORCID to build awareness of the elements required for data creators to receive credit and automated attribution for their data contributions, and the tools that will make it easier to observe usage. Credit for data is an important incentive for researchers to make their data reusable. When data are FAIR and cited, their related publications have higher visibility. We shall discuss various ways in which we are working across the science-data interface in our multi-country and multi-disciplinary working environment to improve data (and code) reuse through better management and crediting. Make your Data FAIR, Cite your Data, Get Credit, Increase Reuse and reap the rewards!
With the mass adoption of data analysis in several scientific fields such as climatology, medicine, astronomy and astrophysics, the availability of an appropriate analytics infrastructure has become a necessity increasingly recognized by the scientific community. However, appropriate tools and applications are required to process the large volume of data collected and generated by researchers. One of the biggest challenges lies in the fact that these tools need to be gathered to be applied in specific domains. The area of bioclimatic data is a scientific field that still has much to improve in this matter. It is a field of study that lacks great efforts in the direction to provide methodologies and tools to facilitate the understanding of the complex phenomena involved in the influence that environmental variables have on biodiversity on the planet. Thus, the purpose of this work is to propose a big data analytics architecture that presents an ecosystem that systematizes and facilitates the task of the scientists to deal with the complexity in the bioclimatic data analysis, providing tools for storage, management, analysis using machine learning algorithms and data mining, and visualization tools. The methodological approach of this work was to make a thorough bibliographical study to verify the most used tools and the suitability of each one to the purpose of the work. In addition, the literature provided indications of software ecosystem implementations methodologies that served as a guide in the architecture design. Within the architecture, we attempted to gather a set of bioclimatic data based on a subset of data obtained from the Atmospheric Radiation Measurement (ARM) data repository for climatic data, and the Brazilian Biodiversity Portal for biodiversity data. As a result, we were able to gather a series of tools to access data such as Cassandra, distribution of processing such as Spark, programming interface represented by Jupyter Notebook, system modules for data format conversion, machine learning algorithms libraries and software for data visualization. This research discuss the importance of a domain purpose design of a data analysis architecture for bioclimatic data. We concluded that this type of ecosystem is imperative to facilitate the research process and increase the quality of the results.

Lucas Bauer

and 5 more

The Amazon rainforest has a great influence on the global energy balance and carbon fluxes, responsible for the net removal of approximately 4 million tons of carbon per year, via photosynthetic activity. Climate change and deforestation have impacts on the carbon budget in Amazonia, transforming CO2 sink areas into sources. Given the complexity of the factors that govern the carbon exchange in the Amazon and its influence on biological processes, the use of Data science strategies can promote a better understanding about the main environmental factors for different scenarios, and also, assist in public policies to mitigate the global warming effects. This study aims to identify the environmental factors that determine the temporal variability of carbon exchanges between the biosphere and the atmosphere in the Tapajós National Forest, in the Amazon, applying Data Science strategies in an integrated set of environmental data from energy and carbon fluxes and remote sensing data. The specific objective is to assess the influence of a selected set of environmental variables on the variability of carbon exchanges, with the use of an artificial neural networks classification model to identify the variables with great impact on source, sink and neutrality scenarios in Tapajós National Forest. Data Science strategies were applied to an integrated dataset of ground-based carbon flux measurements and remote sensing data, considering the period between 2002 and 2006. An artificial neural network (ANN) classification model was developed to identify the environmental variables with great impact on carbon source, sink and neutrality conditions. The average global score of ANN model was 65%. It was possible to identify the predictor variables with greatest impact to the carbon sink condition: radiation at the top of the atmosphere, sensible and latent energy fluxes and leaf area index. Thus, the ANN model with an ensemble of Data Science strategies can improve a better understanding of variability CO2 fluxes and be a powerful tool to promote new knowledge.