Much of the workshop discussion was driven by an overarching assumption: The materials science community would benefit from appropriate access to data and metadata for materials development, processing, application development, and application life cycles. Currently, that access does not appear to be sufficiently widespread, and many participants captured the constraints and identified potential improvements to enable broader access to materials and manufacturing data and metadata.
Data availability was a much discussed topic at the workshop. Several participants spoke of the difficulties associated with the fact that data are not always archived properly for long-term storage and that experimental data, including information about the procedures used and how the data were acquired, are not readily available in the public domain. Among the obstacles to data sharing discussed at the workshop were these:
- Intellectual property constraints. Several participants said that data ownership is an obstacle to data sharing. Thom Mason, of Oak Ridge National Laboratory, said that because individual researchers in the materials research community invest substantial time and effort in making and characterizing a sample, there will be challenges associated with moving into open source data. Other participants noted that access to proprietary and other data owned by private industry can also be constrained because a
company wishes to retain its own intellectual property. This then becomes a cultural consideration in the community (see the final theme in this chapter: Culture).
In discussion sessions, some participants also identified several challenges when the data are generated under a DOD contract. DOD contracts have many data requirements in them, and DOD needs to be aware that these requirements can sometimes be perceived as cost prohibitive. One participant pointed out that material suppliers, original equipment manufacturers, and the government have a joint responsibility to share data. In some cases, a participant noted, suppliers provide the material but no corresponding metadata. Agreements with suppliers can take a year or two to develop, which can slow the rate of innovation. Julie Christodoulou, of the Office of Naval Research (ONR), stated that the Lightweight and Modern Metals Manufacturing Innovation Institute (LM3II) is currently sorting through issues related to levels of engagement and intellectual property. She believes that each project will likely have its own unique intellectual property arrangements. Other participants indicated that the Air Force Office of Scientific Research (AFOSR), ONR, the Defense Advanced Research Projects Agency (DARPA), and other defense agencies are exploring ways to share with the broader community data generated under a DOD contract.
- Data heterogeneity. Several participants pointed out that data are produced by many different people and organizations. They noted that when data are produced via heterogeneous distributed systems, there is no easy or centralized access.
Several possible remedies for the unavailability of data were explored at the workshop:
- Data-mining utilities. A participant noted the possibility of using data-mining utilities—something akin to a “materials Google”—to search existing data. This could improve access and increase ease of use.
- Storing materials instead of data. On several occasions, participants discussed the option of storing samples of a material directly instead of storing data related to an experiment. During a discussion session, Denise Swink, a private consultant, asked if reestablishing critical material repositories in DOD should be considered. She said there is a reluctance to do so but noted the large financial difference between stockpiling and archiving—a few of the workshop participants are probably more interested in archiving. One participant mentioned the Air Force’s digital twin program, in which the digital representation of a material keeps informa-
tion about the material properties. Perhaps it would be valuable to also save an actual sample to examine along with the digital twin. However, some workshop participants pointed out that critical questions would need to be addressed, such as how much material to retain, access criteria, and other policy issues. Others suggested that any sort of repository could be prohibitively expensive. To be a viable option, it might be necessary, they suggested, to set priorities for which materials to retain.
- Government mandate to store data. Several participants suggested that a data storage mandate might be useful, proposing that any government-funded data be put into a standard format and stored in some long-term external repository. A participant reported that this step is under consideration by the National Science Foundation (NSF) and the Department of Energy (DOE). A DOD participant said that DOD had explored the idea of such a mandate but found it to be time-consuming and expensive and, in the end, not cost-effective. He said that a mandate would have to be for more than a data format; it would need to answer questions such as who owns and maintains the data and where the information should reside. Someone else noted that NSF has begun developing a data repository that is not, however, very user friendly.
- Development of private data repositories and formats. Workshop participants also discussed data repositories and how to ensure that data remain accessible in the future. Several persons spoke of the need to define structured databases. Dan Crichton, of NASA’s Jet Propulsion Laboratory (JPL), said that data archives typically require at least a 50-year expected time window for operability. For data to remain usable for that long, he pointed out, they would need to be captured in a stable format, even if that format is not contemporary, and should be software independent, relying upon static conceptual models. He noted in particular the importance of separating the data from the technology used to acquire it; otherwise, the data could be rendered obsolete too rapidly, as technology changes so rapidly. Michael Stebbins, of the White House Office of Science and Technology Policy (OSTP), discussed at length the White House open data initiative, which leverages existing and emerging information channels (such as journal publications) to create data repositories of the future. Another participant suggested that DOD assess the cost of storage vs. reproduction of the data to determine if some sort of data repository (whether mandated or voluntary) makes sense. Yet another participant pointed out that it is fairly common for universities to have permanent storage facilities available, using the Deep Blue program at the University of Michigan as an example. However, other participants pointed out that these programs are expensive for the host and do not always include metadata.
- Automatic data capture. Several participants briefly mentioned the use of automatic methods to capture raw and unstructured data from journal articles or to capture and upload raw data from instruments during an experiment. Dave Shepherd, of the Department of Homeland Security (DHS), discussed an automatic data capturing project in biosecurity known as algorithms for analysis, an algorithm-oriented project to identify emerging technologies that could be used against our nation. It uses natural-language processing software to find descriptors in the scientific literature, patents, and other scientific documentation. The data must be up-to-date and continually harvested. Such a project might be useful for the materials community as well.
Several speakers noted the recent overall rise in scientific data output. For instance, Mr. Crichton showed that the amount of data produced at JPL has been increasing at a significant and highly nonlinear rate: In 2000, a planetary mission data set contained 4 TB of data. Now, such a project consists of over 500 TB of data. During their presentations, Jed Pitera, of IBM, Mr. Shepherd, and Chuck Ward, of the U.S. Air Force, all noted huge increases in data rates in their communities as well.
However, multiple participants repeatedly questioned whether materials science truly has a “big data” problem or whether it simply has a “data” problem. Several participants noted that the growth in data in materials science, while substantial, is not as dramatic as it is in other scientific disciplines. Ms. Swink argued that important data issues in materials science, such as proprietary data access and the lack of homogeneity, are data problems but not “big data” problems. Other participants indicated that, regardless of the regime, the amount of data produced in materials science is outpacing the algorithms and processing needed to analyze it. They pointed out that the traditional model of data analysis, which relies on the individual researcher, is not likely to be a viable model as the data size continues to increase. At the local level, there may be more data than the researcher has time to analyze. This could indicate need for more centralized analysis and computing tools. Several participants suggested looking at successful big data analysis techniques used in other domains, such as protein databases and work flow models.
Dr. Pitera suggested that the materials science community may be well served by using data reduction or extraction techniques so as to exit the big data regime—in other words, to make the “haystack” smaller and the search for the needle easier. (The idiom “to find a needle in a haystack” was taken up by participants and repeated many times over the course of the workshop.) However, other participants wondered how to perform data selection in the most judicious and domain-specific manner. One person mentioned the need to determine the necessary approach
and its associated trade-offs: that is, whether one must interface with a large mass of data or whether one needs algorithms to select the salient features of the data one needs. In a discussion session, Jesus de la Garza, of Virginia Tech, pointed out trade-offs between collecting all the data possible (even if we don’t know what to do with them) and collecting only the information we know we want. He observed, however, that most research decisions require an assessment of trade-offs, and this is no different.
Many of the participants referred to the “four V’s” of big data (volume, velocity, variety, and veracity)1 several times during the course of the workshop. Although no one ranked them in importance with respect to materials science, much discussion time was devoted to the last two—variety and veracity. Individual participants noted that there are many sources of uncertainty in materials data, with no consistent methods to verify data quality. This problem underscores the benefit of openly sharing data, so that they can be verified independently.
Dr. Pitera pointed out that data quality can be limited by instrument quality as well as by interpretation quality, and it can be influenced by various data artifacts. Mr. Crichton argued that peer review of data is necessary to assess usability; the international community should agree to standard models and to a consistent peer review process. He suggested that research supported by taxpayer dollars should make its data available and citable: Citations would give the reader confidence that the data set is publication-quality. Several participants discussed challenges associated with the verification of proprietary data. Someone noted that journal publishers are a good point of leverage in the academic community; however, in disciplines such as pharmaceuticals and some others, publication is not a high priority. The data in those instances are used to build business and are considered proprietary. It becomes difficult, if not impossible, for the outside community to gauge the quality of those data.
Several workshop participants complained that, in the materials science community, it is difficult to compare the results of simulations to physical measurements. To help resolve this, Jesse Margiotta, of DARPA, noted that Integrated Computational Materials for Engineering (ICME) has a vigorous verification and validation component used to provide confidence limits. Another participant suggested using a materials work flow platform (such as Kepler2) to capture data and enable reproducible results.
1 For a full description of the four V’s, see the section “IBM and Big Data” in Session 1.
Another challenge lamented at the workshop was the lack of standards and terminologies for data and metadata. Several participants noted the absence of a formal ontology for materials science and the need for a practical set of identifiers and descriptors. It was noted during a discussion that the materials community may suffer from a dearth of conversation about ontology.
One participant complained that the process for developing standard terminology is very difficult and slow, pointing to the erstwhile ASTM International committee that once developed standards in this area. In general, however, companies do not like to pay their employees to do this type of activity, and the ASTM International committee folded because there was no funding for it from the community. The same participant noted that companies may not be interested in attaching themselves to a certain standard format, because they are concerned they will be forced to share information they would prefer to keep proprietary in their own formats. During his presentation, Mr. Crichton suggested that the international community should agree to standard models and to a consistent peer review process. He pointed out that agreement on how to represent data can be very difficult because different scientists will have different emphases within the same data set. Mr. Shepherd also pointed out problems with many formats, resolutions, and source locations of large data sets.
To move forward, one participant suggested looking to the NSF program EarthCube3 as a model for working across different communities to develop ontologies and names. Dr. Margiotta also reported that DARPA, along with the Army Research Laboratory (ARL) and other program partners, is developing methods to standardize data fields and metadata fields for materials and materials processing.
The concept of metadata availability had several meanings at the workshop. Mainly it referred to access to knowledge and information about a particular experiment—models used, starting conditions, and other “meta” information about the data—in other words, information that is not generally available in a journal article, which limits one’s ability to replicate the experiment. Many participants spoke over and over about the need to capture and report metadata.
In other cases, metadata availability referred to broader access to the models themselves; there were several separate discussions about the need to have a stan-
dard modeling toolkit available to researchers. A participant suggested that, for the small-scale researcher, data production is outpacing computing capabilities. In the case of neutron beams, for example, the bottleneck is in processing data, not in having access to beam lines. Continued progress is needed in developing new data analysis programs. Several participants and speakers discussed the idea of transitioning data computation to the source of the data. (This is also related to the workshop discussions of data size, summarized above.) Mr. Crichton said that this paradigm is likely to become increasingly important in the next 5 years, as data sets become more and more massive. More analysis tools and capabilities will be needed at the site of the data repositories. Mr. Crichton predicted that merely distributing data would soon become an obsolete approach; instead, research services and analytics will be a more advantageous approach, as users will need services rather than just the data.
Mr. Shepherd, of DHS, described a project in predictive biology that uses knowledge-based (KBase) data to combine data for microbes, microbial communities, and plants into a single integrated data model. Users can upload their own data and build predictive models, which represents more of a community effort in big data. Its plug-in architecture will allow other laboratories to use their own algorithms to analyze the data. This project mixes big data, high-performance computing, and cloud architecture. This type of community modeling effort may be a useful example for the materials science community.
Several participants stressed the importance of developing and distributing models for predictive purposes, particularly to determine inspection and maintenance intervals and life prediction as a function of use. Dr. de la Garza stated the importance of prediction rather than reaction; he suggested that big data analytics for prediction would be a powerful tool in the materials community. Mr. Shepherd and Dr. Pitera both provided examples of predictive modeling in different areas (biosecurity and equipment maintenance, respectively).
Several participants noted that the materials community lacks a data-sharing culture. This is likely the result of a variety of factors, they believed. The most prominent factor is the reluctance, particularly on the part of private industry, to share proprietary data or other data that may have value. One participant noted that withholding materials data is practiced in order to gain a competitive edge, and companies want to protect their intellectual property. Another factor discussed was the difficulty associated with working under sometimes complex DOD contract rules and regulations.
Several participants discussed the need for incentives for sharing data. They proposed specific incentive models for sharing and structuring materials data in
lieu of purchasing data ownership rights. Such incentives included the use of data citations, the development of quid pro quo relationships, and the creation of precompetitive partnerships. Some participants suggested using existing information channels to increase the opportunity for data sharing, such as linkages from journals to data repositories, consistent with the goals of the White House initiative on openness and access to scientific data.