National Academies Press: OpenBook
« Previous: 3 Materials Design
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

4

Data Curation

Robert Hull, Rensselaer Polytechnic Institute, introduced three speakers who were invited to address data curation: James Goddin, ANSYS Granta; Josh Peek, Space Telescope Science Institute; and Martin Green, National Institute of Standards and Technology (NIST). Susan Sinnott, Pennsylvania State University, moderated a short Q&A following the presentations, and then introduced a panel discussion.

ACCELERATED METALLURGY: DATA MANAGEMENT AND MACHINE LEARNING

Goddin spoke about Europe’s Accelerated Metallurgy Project, the 2011-2016 large-scale research project to speed up identification of new, useful alloys, whose production and use make up a significant portion of Europe’s gross domestic product (GDP). He also shared an overview of its approach to developing a Virtual Alloy Library and how the project’s insights have informed Granta’s other work.

The Accelerated Metallurgy Project

Only 61 of the 87 known metals are commonly available commercially, and identifying promising new alloys takes a years-long process of trial and error, plus more years to be developed into a commercial product. There is a huge number of potentially useful alloys—roughly 32,000 potential ternary alloy systems alone—yet because the exploration process is so daunting, 90 percent of them have never been explored. The aim of the Accelerated Metallurgy Project was to change that.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

The project covered six technology areas: (1) lightweight alloys for aerospace and automotive; (2) high-temperature alloys for rockets, turbines, and nuclear fusion; (3) high-temperature superconductors for electrical applications; (4) thermoelectric alloys for heat scavenging; (5) magnetic alloys for motors and refrigeration; and (6) phase change alloys for electronics. It was supported by a €21 million (~$23 million U.S.) investment and involved 32 industry and academic partners who developed both computational and experimental techniques.

The project successfully generated a huge volume of results. Standardization to compare and contrast experimental and simulation data was essential, Goddin said; because different methods yielded different results, both types of data were used in combination. Computational screening was used to identify candidate alloys; candidate materials were rapidly manufactured using techniques from additive manufacturing and then rapidly characterized using high-throughput techniques and selective lower throughput testing.

Machine learning (ML) was essential to the project’s iterative approach, where the network was trained to be able to extrapolate, identify, test, and capture data, and then retrained. Each iteration expanded the knowledge base within the project, enabling generation of promising alloy compositions. This accumulated knowledge was then pooled into the new Virtual Alloy Library, created and hosted by Granta.

The Virtual Alloy Library

The Virtual Alloy Library houses the collected computational and experimental data on a scale never seen before, Goddin said. Unlike other projects, each partner had access to the full scope of results to aid commercialization, although the results were kept confidential and shared only among partners.

Data in the library underwent strong, detailed standardization to ensure that the data were comparable from one laboratory to another, maximizing their value. Multiple automatic linkages, from site location to alloy composition to simulations to test results, enabled data to be accessed and shared easily. In addition, all compositions were screened for potential risk factors, including screening for restricted substances, critical materials, environmental impacts, and cost.

The library’s customizable viewer provides access to materials pedigree and material records, including very complex data such as composition, price, physical and mechanical properties, product risks, and relationships between the data. Viewing all of these data at one time enhances the evaluation and selection process for promising alloys.

Data capture for the library was an intense process that started with 1,800 reference alloys from Granta’s Materials Universe database, as well as nearly 15,000 simulated alloys and more than 2,000 physical specimens. Granta created its Remote Import tool and used application programming interface (API)-generated

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

machine integration to automate and standardize results capture. Automation was so successful that the approach has become a large part of Granta’s workflow.

Vision 2040 and ICME Implementation

The Accelerated Metallurgy Project ended in 2016, but Granta has learned some valuable lessons that Goddin suggested can inform the National Aeronautics and Space Administration’s (NASA’s) Vision 2040. Three of the cross-cutting streams that NASA identified are data-centric: data management, data analytics and visualization, and information sharing and reusability. These are core to Granta’s mission and were a large part of its role in the Accelerated Metallurgy Project.

In addition, Granta has an overall vision for integrating integrated computational materials engineering (ICME) into its workflows, Web services tools, simulation tools, data management, and enhancements to Granta MI, which is already one of the widest and deepest materials data repositories in existence. The organization also has initiatives to more fully incorporate ML and artificial intelligence (AI), which can be especially helpful now that reliable materials information is more readily available for model training.

Q&A

June Lau, NIST, asked how often data standards were changed, and if that required updating data models. Goddin replied that Granta develops standards for particular domains, such as metals or composites, which incorporate established best practice from across multiple leading manufacturers in each domain, although a certain degree of customization always remains, owing to individual data legacies. Better standardization of test outputs would require linking test standards and data models, something that can get overlooked when focusing on test procedure.

Another participant asked about metadata capture measurement techniques. Goddin replied that Granta tries to capture as much relevant metadata as possible, because it is not always clear what data will be important once the data start being used, unneeded data can always be removed or ignored, and by maintaining links to the original data files, new data can be readily added at a later date. Standardization includes predetermining the minimum data needed, such as machine output, original materials, and specimen photographs. The ideal process is to think backward, from the end of the project to the beginning, to understand what data are needed at every stage in order to achieve the best results from the final data output. From this, a detailed set of data requirements can be constructed and standardized to ensure consistency.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

TOWARD A GRAND UNIFIED PRACTICE OF ASTRONOMICAL METADATA

Peek detailed several data sharing repositories developed and used by the Space Telescope Science Institute and shared lessons from the astronomy community that he suggested may help the materials community as its use of data analytics intensifies.

Data Sharing Through Repositories

When it comes to data sharing, Peek noted that astronomers have two important advantages compared to the materials community: a common grid along whose lines the universe can be mapped, and the fact that most astronomy data is commercially valueless, thus minimizing both the penalties to sharing it and the incentives to hoard it. Peek pointed to several repositories of astronomical data that are both large and easy to access and use, making them highly valuable to astronomers. The Mikulski Archive for Space Telescopes contains data collected at different times, with different equipment, for different purposes, that can be layered together to reveal and describe thousands of structures in the universe. Similarly, NASA’s Extragalactic Database houses and links observational galaxy data to facilitate combining and sharing data to yield new insights.

Major telescopes, such as the Chandra or the Hubble, also have major data repositories associated with them. While their use is ostensibly targeted toward specific scientific questions, the specialized data they collect for individual projects eventually become publicly available. In fact, the majority of papers resulting from these telescopes’ data represent such “secondary” use of data. Open data sharing and access promotes diversity within the community and enhances the value gained from these billion-dollar telescopes. In addition, the Sloan Digital Sky Survey, which surveyed huge swaths of the sky, is explicitly designated for open, archival data and up to now has resulted in powerful maps and unexpected insights.

Mechanisms for Standardization and Exchange

Peek explained how the Virtual Observatory (VO), the International Virtual Observatory Alliance (IVOA), and the Common Archive Observation Model (CAOM) have facilitated data sharing across the astronomy community.

The VO is an ecosystem of astronomy standards, tools, and groups whose goal is to enable researchers to discover and share data. For example, it can be used to request all data collected from all existing sources for a particular part of the sky (known as a cone search), and that data will be standardized, enabling access and analysis via multiple different tools.

The IVOA defines data standards for the VO. Because it comprises a wide array of national-level organizations, the standards-defining process can be rather

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

complex, making its biggest challenges cultural, not technological. For example, IVOA members have different viewpoints on how the VO should govern, and the VO has much more support in Europe than in the United States.

The success of the IVOA depends on whether major projects or individual champions emphasize its importance and understand VO compliance, Peek said. Community uptake of standards is also crucial; some existing standards, however useful, have never been used. Last, the IVOA can be slow and formal. That is important in standards work, but software moves fast, and the disconnect can make it difficult to create unified support for standards.

The CAOM has been more successful, Peek said. It was designed to work across many applications and implementations; focuses at the center level, rather than the user level; is designed specifically for astronomy; and has a relatively generic structure, making it useful for multiple data types. Its common software set standardizes metadata from any telescopic observation, enabling simultaneous, multilevel elastic data searching, access, and sharing from different telescopes and different time periods. Raw data, metadata, and data specific to the artifact under study are layered together for analysis and interpretation.

Lessons for the Materials Community

The astronomy community’s use of data can provide the materials community with several lessons. First, developing standards requires an iterative process of building, developing, deploying, using, and revising. The process can fall apart if use or development should stop. CAOM was able to use this process successfully because people were willing to deploy and revise in a way that is vital for standards to be relevant and useful. While building consensus is a slow process, successful and usable standards ultimately accelerate and improve the science.

Data sharing creates powerful returns on data that may initially seem useless. To facilitate data sharing, Peek suggested it would be useful to find common conceptual grids through which materials data can be exchanged. In addition, he said it is important to use data models to enhance data discovery and physical understanding, to recognize the roles that culture and personalities play in the process of defining common modes, and to create and implement useful standards through an iterative deploy-and-revise process.

Q&A

Apurva Mehta, Stanford University, asked why astronomers were motivated to share their data if doing so did not help them personally. Peek replied that there are several reasons: Researchers are not motivated by financial incentives, because there are no commercial opportunities in astronomy; requested data are processed

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

at the repositories and then sent to scientists, so the data go into the repository before the researcher even sees these data; and currently there really is only one Hubble Telescope, so there is no competition or alternative if a researcher does not like NASA’s terms of use.

James Warren, NIST, commented that there are different viewpoints on standards, and Peek agreed that there is a disconnect between those who value getting work done thoroughly and those who value getting work done fast. Ann RacuyaRobbins, World Knowledge Bank, asked how individual biases might influence projects. Peek replied that biases are hard to escape, but it is important to reward those who, for the greater good, spend extra effort to make data public and accessible. Bias can also depend on funding sources. In general, the community should judge projects by how open, available, and standards-compliant they are, not just by whether or not they work.

NEXT-GENERATION MATERIALS GENOME INITIATIVE, DRIVEN BY DATA AND AI

Green described the Materials Genome Initiative (MGI), which combines computational tools, experimental tools, and digital data to find new materials that will be crucial to realizing new technologies. He argued that using AI within the MGI framework will create a new materials science paradigm, dramatically decreasing the costs and time needed for new materials discovery.

Green focused on three specific areas that are fueling MGI: advancements in materials data, AI-driven materials science, and the High-Throughput Experimental Materials Collaboratory (HTE-MC).

Data Challenges in Materials Science

In materials science, as in all scientific domains, there is organized, discoverable, accessible, and interoperable big data; “long-tail” literature data that are discoverable and accessible but may not be interoperable; and “dark” data—data that are unpublished, unfindable, inaccessible, and not interoperable.1 Because data are so crucial to AI, Green said that finding dark data and making it usable is a critical challenge for enabling AI-driven discovery.

Another challenge is that the data are widely distributed across multiple repositories. It is discoverable via individual registries, and some of it may be interoperable, but it is still not centralized. To address this, Green stressed that all data is valuable

___________________

1 A.R. Ferguson, J.L. Nielson, M.H. Cragin, A.E. Bandrowski, and M.E. Martone, 2014, Big data from small data: Data-sharing in the “long tail” of neuroscience, Nature Neuroscience 17(11):1442-1447, http://doi.org/10.1038/nn.3838.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

and should follow the Findable, Accessible, Interoperable, Reusable (FAIR) Guiding Principles.2 Achieving this in turn necessitates standards for data itself, the metadata, and for data exchange.

Compared to astronomy, Green said that materials science has many more variables that must be taken into account, different communities who have to agree on data standards, and more types of data that have to be standardized to create the much-needed paradigm shift that will enhance materials science data sharing.

The Value of AI

AI-driven materials science presents the opportunity to create a fully robotic system for discovery and validation of new materials, Green said. In the traditional materials science approach, humans accumulate years of knowledge and then use “dumb” instruments to perform experiments and develop models. With AI, humans can work in partnership with intelligent systems (run by algorithms into which have been embedded physical laws and appropriate accumulated knowledge) to perform autonomously driven experiments and simulations much faster and at a much lower cost than humans could achieve alone.

Engaging AI will require a paradigm shift to include a workflow that moves from measurement to manifolds (the surface that represents physical reality of the data) to models. With AI, every measurement can be informed by knowledge gained from previous measurements, creating better manifolds for better models. AI can discover the needed manifolds two different ways: by collecting multitudes of data, or by making intelligent guesses based on scientific constraints and fewer data points.

Materials exploration is a complex, multidimensional space with multiple data points, requiring more experiments than humans can perform. Only with AI can we achieve maximum knowledge with minimum experimentation, Green said, noting that feeding the algorithm the necessary data about materials physics and structures ensures that outputs will be restricted to physically realizable conclusions. Another key strength of this paradigm is that AI can find innovations even when analyzing small, complex collections of data by finding patterns and outliers.

The HTE-MC

High-throughput experimentation already enables rapid, parallel experiments that generate high-quality and consistent data sets. Green described how

___________________

2 M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, et al., 2016, The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3:160018, http://doi.org/10.1038/sdata.2016.18.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

the HTE-MC facilitates AI-driven materials science by increasing access to high-throughput experimental tools via a materials “collaboratory.”3

As Green and colleagues envision it, the collaboratory would be a federated network of virtual high-throughput experimental synthesis and characterization tools supported by state-of-the-art data and AI platforms, enabling data to be rapidly generated and shared. The platform would facilitate the three main components of high-throughput experimentation: synthesis of a combinatorial data library, measurement of chosen variables, and data analytics, with all the data and metadata captured and shared via a common database. The platform would work for any material, but the tools will be material-specific. As an opening step, NIST has created an experimental resource registry with data, high-throughput libraries, instruments, software, and utilities.

Q&A

Tresa Pollock, University of California, Santa Barbara, pointed out that combinatorial experiments are powerful but very slow. Green agreed, and suggested that they could be sped up if, instead of making libraries, a synthesis tool could be used to perform analysis in real time. Green also pointed to the dramatic acceleration that occurred with genomic sequencing, suggesting that a similar acceleration could be achievable in materials discovery, although creating the tools to make that possible will require significant investment.

PANEL DISCUSSION ON DATA CURATION

Susan Sinnott, Pennsylvania State University, introduced the speakers for the workshop’s second panel discussion: B.S. Manjunath, University of California, Santa Barbara; Ichiro Takeuchi, University of Maryland; and Cormac Toher, Duke University. Following their remarks, Hull moderated an open discussion (Q&A).

B.S. Manjunath, University of California, Santa Barbara

Manjunath’s team created BisQue, a Web-based, open source platform for data management solutions. The platform was created to better manage large collections of unstructured data, such as images and multidimensional data items, as well as to facilitate scalable, reproducible computation, data annotation, and sharing of data and methods.

___________________

3 M.L. Green, C.L. Choi, J.R. Hattrick-Simpers, A.M. Joshi, I. Takeuchi, S.C. Baron, E. Campo, et al., 2017, Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies, Applied Physics Reviews 4:011105, https://doi.org/10.1063/1.4977487.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

BisQue’s solutions include management, analysis, and sharing of images and metadata; a flexible, scalable query system; a module system for scalable analysis integration; support for multiple formats; and the Analysis Marketplace, where researchers can share and discover new modules. In addition, it can be used for any data type; can manage multimodal data and five-dimensional (5D) graphical data sets; and offers customizable analysis, searching, and organizing tools. BisQue is primarily designed for microscopy images, such as those commonly used in neuroscience, marine science, and medicine, but it can also enhance materials models that require data mining, management, analysis, and visualization.

Materials scientists can also take advantage of the fact that the Dream.3D software infrastructure for visualizing material microstructures has been integrated into BisQue. In addition, neural networks could offer materials scientists promising data solutions. Neural networks within data management platforms, especially those that include image synthesis and generation tools, hold promise for enhancing materials science research and discovery, although Manjunath cautioned that neural networks can still be fooled by malicious actors.

Ichiro Takeuchi, University of Maryland

Takeuchi spoke about the evolution of data challenges facing high-throughput and combinatorial materials science. One of the early challenges, he said, was in making hundreds of samples quickly and correctly. The challenge then shifted to the struggles of working with the large amount of data from these samples, generated by advanced instrumentation and measurement tools. Analysis tools—clustering, data mining, visualization, and ML—have helped handle the complexity and diversity of data. As an extension of this, Bayesian optimizations and active learning are now being used to control the sequence of experiments and lower the number of experiments required, reducing overall costs and time.

It is also now possible to perform experiments guided by high-throughput computations. However, sometimes the appropriate computations or theoretical predictions do not exist—for example, there are not many predictions of new superconductors. In such a situation, one would like to turn to experimental databases. An AI model whose predictions adhere to physical rules can be built from even a small, disparate collection of experimental data. The problem is that for most functional materials, well-curated experimental databases do not exist. Such data gaps present a challenge for applying ML to materials discovery, Takeuchi said.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

Cormac Toher, Duke University

Toher discussed the AFLOW4 framework for automated computational materials design. Its database has almost 3 million entries and more than 500 million properties. It also has advanced searching properties, a crystal prototypes library, online ML capabilities, and a convex hull generator to analyze thermodynamics.

The database is organized into separate, stacked layers: from project (such as all ternary alloy systems), to set (such as one specific alloy system), to calculation (for one specific decorated alloy prototype), and ultimately the materials data and characteristics. Its API supports search queries and enables complex searching that can be integrated into a workflow. In the future, AFLOW will be integrated into Open Databases Integration for Materials Design (OPTiMaDe). The team also plans to build a closed feedback loop for computation, ML, and experiments.

OPEN DISCUSSION

Following the opening remarks, panelists and participants discussed several challenges to applying data analytics and ML in materials science, ideas for improvements to current data sharing practices, and ways to ensure data accuracy.

Challenges to Applying Data Analytics and ML

Hull kicked off the discussion by asking each panelist to name the single biggest challenge to successfully applying data analytics and ML to new materials discovery. Takeuchi answered that more experimental databases, which are just as important as computational resources, are needed. The data exist, but are buried in the literature, he noted, suggesting that ML could be used to populate a database via text analysis and machine reading.

Manjunath agreed, noting that easy data access is key because the amount of data available limits the ability for algorithms to learn. He suggested that an interdisciplinary team of computer and materials scientists should collaborate to create high-quality training databases. Toher added that access to experimental data would also help verify computational results. However, the fact that much of the literature is in PDF form is an impediment to mining the data it contains, a participant noted.

The participant suggested that if more databases and outcomes were published, there would also be more negative data on which to train ML models. Gareth Conduit, Intellegens, asked how access to negative data could be improved. Takeuchi replied that negative information is important, and it is worth doing

___________________

4 See the Center for Autonomous Materials Design, Materials Science, Duke University, AFLOW website at http://aflow.org/.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

extra work to include negative results in what is published and therefore findable. Missing data can limit the models and introduce bias. Hull and Conduit added that behavioral information could be used to fill in gaps.

Lau asked panelists to comment on how ML systems can be influenced by bias, a problem that has become apparent in other fields when ML has been applied to analyze photos of human faces. Toher replied that bias is another reason that negative results should be published, because training models need negative data in order to reduce bias. Takeuchi reiterated that bias is always present, though Manjunath noted that approximations, where interpolations are made based on the available data, can sometimes help.

Another challenge is the potential for AI and ML to be misused by malicious actors. In reply to a question by Sinnott, Manjunath acknowledged that AI and ML can be used for adversarial attacks or creating deep fakes, making verification and authentication important issues to address. Toher added that in computation, data provenance and parameters to enable data verification are essential. Takeuchi added that we as a society risk oversimplifying problems or overhyping AI when we focus too much on algorithms instead of useful results for real questions.

Another challenge, pointed out by Brian Storey, Olin College and Toyota Research Institute, is that, unlike genomic data, materials data sets are isolated for different applications, such as batteries, semiconductors, or glasses, which adds to the complexity. Takeuchi agreed that this is an inherent difficulty in materials science, and added that there are financial incentives, such as patents and profits, to avoid sharing data.

Improving Data Sharing

Participants discussed several ideas for incentivizing and improving data sharing practices. In response to a question from Pollock, Manjunath replied that a platform or system to facilitate reproducibility could have the biggest impact on increasing data sharing. Takeuchi urged the materials community to be more open to data sharing, noting that there is a culture change under way that is trending in the right direction, albeit slowly. Toher added that an open, nonproprietary, long-lasting data format can have a big impact on accessibility, which is required for sharing.

Pollock also wondered if it might be possible to influence instrument manufacturers to facilitate structure and imaging data collection. Takeuchi agreed that better data collection methods in these realms would help. Some experiments do record structural data, but those details are rarely published, and data refinement is time and labor intensive.

Haydn Wadley, University of Virginia, along with several other participants, discussed the idea of incentivizing data sharing. One participant noted that although some funding agencies demand it, and it is the right thing to do, scientists still need

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×

incentives to share their data. Takeuchi replied that historically, materials data has been a business commodity, where companies own data, and sharing was discouraged. Scientists can also be reluctant to share their data until they feel they have finished using that data. Lau countered that scientists may be incentivized if data sharing were better integrated into the publication process, ensuring that authors receive credit and a permanent digital object identifier (DOI) for their data contributions.

A participant suggested that sharing models instead of data could alleviate some of these problems. Toher replied that this practice is common in the computational community, and any technical hurdles could be overcome—for example, by making the model available for direct queries, as opposed to a separate download. Manjunath added that model sharing could also be coupled with monetary incentives to make it more appealing.

Ensuring Data Accuracy

Hull pointed out that review and assessment to ensure accuracy is a vital part of data curation. An error in one measurement or property could introduce systemic errors throughout an entire database. Takeuchi suggested reporting errors where they are found and perhaps creating a new system for comparing and calibrating databases. Toher added that understanding the methodologies for computational and experimental data is imperative, but where the onus should be—on the data owner or data user—is a difficult question.

Green noted that it is difficult to validate data without metadata, that curating data is prohibitively expensive, and that users should determine for themselves whether the data are sound. One participant added that comparing first-principles calculations can help ensure data reliability, and another stated that the metadata and data documentation are especially important for validation and interoperability of materials data. Takeuchi added that uncertainty should be calculated for all experimental data.

Florencia Paredes, Citrine Informatics, noted that Citrine’s academic, government, and industry partners frequently want to use public data, but those data points rarely have the right metadata or parameters to be used effectively. Takeuchi agreed that metadata, and indeed all experimental conditions, should be as comprehensive and detailed as possible. Toher added that on the computational side, it is important to record methodology in detail as well.

Bill Mahoney, ASM International, asked if peer review could help ensure data accuracy. Manjunath answered that reviewers should be able to validate and verify data in addition to the conclusions, which would also help data sharing. Takeuchi added that peer review and monetizing data could both incentivize good data practices. Toher noted that problems could also be reported in a less formal review and feedback loop.

Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 29
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 30
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 31
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 32
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 33
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 34
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 35
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 36
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 37
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 38
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 39
Suggested Citation:"4 Data Curation." National Academies of Sciences, Engineering, and Medicine. 2021. Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25628.
×
Page 40
Next: 5 Emerging Applications »
Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop Get This Book
×
 Data Analytics and What It Means to the Materials Community: Proceedings of a Workshop
Buy Paperback | $40.00 Buy Ebook | $32.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Emerging techniques in data analytics, including machine learning and artificial intelligence, offer exciting opportunities for advancing scientific discovery and innovation in materials science. Vast repositories of experimental data and sophisticated simulations are being utilized to predict material properties, design and test new compositions, and accelerate nearly every facet of traditional materials science. How can the materials science community take advantage of these opportunities while avoiding potential pitfalls? What roadblocks may impede progress in the coming years, and how might they be addressed?

To explore these issues, the Workshop on Data Analytics and What It Means to the Materials Community was organized as part of a workshop series on Defense Materials, Manufacturing, and Its Infrastructure. Hosted by the National Academies of Sciences, Engineering, and Medicine, the 2-day workshop was organized around three main topics: materials design, data curation, and emerging applications. Speakers identified promising data analytics tools and their achievements to date, as well as key challenges related to dealing with sparse data and filling data gaps; decisions around data storage, retention, and sharing; and the need to access, combine, and use data from disparate sources. Participants discussed the complementary roles of simulation and experimentation and explored the many opportunities for data informatics to increase the efficiency of materials discovery, design, and testing by reducing the amount of experimentation required. With an eye toward the ultimate goal of enabling applications, attendees considered how to ensure that the benefits of data analytics tools carry through the entire materials development process, from exploration to validation, manufacturing, and use. This publication summarizes the presentations and discussion of the workshop.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!