- Volume. Volume refers to the physical amount of data. Dr. Pitera stated that problems related to volume in big data are scaling and the cost of dealing with that much information as much as structural problems.
- Velocity. Velocity is both the rate at which data are collected and the time between receiving input and requiring an output. Dr. Pitera gave several examples of different time frames: In some instances, the maximum duration is several minutes (such as the time scale of a “coffee break,” the time it takes for a researcher to get a coffee and return to work); in other instances, the maximum duration is a few seconds (such as the time a call center operator might need to return an answer to a client).
- Variety. Data come in a wide variety of formats.
- Veracity. There are many sources of uncertainty associated with data quality. Data can be limited by instrument quality, as well as interpretation quality, and data can be influenced by various data artifacts.
Dr. Pitera stated that managing the four V’s requires expertise in data (acquisition, integration, cleansing, storage, protection, and management), domain (models and hypotheses), informatics (algorithms, simulations, and rules), mathematics (analysis), systems (scale and velocity), and visualizations (sharing results). He said that his group’s efforts focus primarily on the first two areas of expertise (data and domain): How do we get data, ingest them, and structure the domain of interest to capture the needed information? He posited that there are opportunities for big data in all types of study. The discovery phase focuses on conducting computational experiments, mining the literature (the published literature, as well as unpublished laboratory documentation), and finding new materials or repurposing existing ones. Dr. Pitera postulated that people tend to focus on the discovery phase. However, the material must be made usable, and integration remains a challenge. Life-cycle issues, including continuous reengineering and monitoring, are all topics to be explored.
Dr. Pitera described three representative projects in materials and big data:
- Harvard’s Clean Energy Project. This is a distributed computing project that focuses on using big data for materials discovery. It seeks to identify new photovoltaic materials with higher efficiency rates. This involves large-scale calculations, along with data mining and analytics.2
- Pharmaceuticals. Pharmaceutical companies wish to mine patent data and medical literature to identify new relationships between drugs and disease.
- Mining equipment. The goal of this project is to understand predictive equipment maintenance for heavy mining machinery.
2 See http://cleanenergy.molecularspace.org/about-cep/. Accessed June 2, 2014.