strings to be stored, indexed, and retrieved, but also as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data.

A number of challenges in both data management and data analysis require new approaches to support the big data era. These challenges span generation of the data, preparation for analysis, and policy-related challenges in its sharing and use, including the following:

  • Dealing with highly distributed data sources,
  • Tracking data provenance, from data generation through data preparation,
  • Validating data,
  • Coping with sampling biases and heterogeneity,
  • Working with different data formats and structures,
  • Developing algorithms that exploit parallel and distributed architectures,
  • Ensuring data integrity,
  • Ensuring data security,
  • Enabling data discovery and integration,
  • Enabling data sharing,
  • Developing methods for visualizing massive data,
  • Developing scalable and incremental algorithms, and
  • Coping with the need for real-time analysis and decision-making.

To the extent that massive data can be exploited effectively, the hope is that science will extend its reach, and technology will become more adaptive, personalized, and robust. It is appealing to imagine, for example, a health-care system in which increasingly detailed data are maintained for each individual—including genomic, cellular, and environmental data—and in which such data can be combined with data from other individuals and with results from fundamental biological and medical research so that optimized treatments can be designed for each individual. One can also envision numerous business opportunities that combine knowledge of preferences and needs at the level of single individuals with fine-grained descriptions of goods, skills, and services to create new markets.

It is natural to be optimistic about the prospects. Several decades of research and development in databases and search engines have yielded a wealth of relevant experience in the design of scalable data-centric technology. In particular, these fields have fueled the advent of cloud computing and other parallel and distributed platforms that seem well suited to massive data analysis. Moreover, innovations in the fields of machine learning, data mining, statistics, and the theory of algorithms have yielded

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement