The National Academies Press

Currently Skimming:

1 Introduction
Pages 11-21

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 11... ... arises from the new accelerators designed to test aspects of the Standard Model of particle physics. Second, many areas of science and engineering have become increasingly exploratory, with large data sets being gathered outside the context of any particular theory in the hope that new phenomena will emerge. Read the entire page →
From page 12... ... One can also envision numerous microeconomic consequences of massive data analysis where preferences and needs at the level of single individuals are combined with fine-grained descriptions of goods, skills, and services to create new markets. In general, what is particularly notable about the recent rise in the prevalence of "big data" is not merely the size of modern data sets, but rather that their fine-grained nature permits inferences and decisions at the level of single individuals. Read the entire page →
From page 13... ... But the goals for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) , instead focusing on the ambitious goal of inference. Read the entire page →
From page 14... ... , but the inferences and decisions we wish to make may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many subcollections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition. Read the entire page →
From page 15... ... This effort goes well beyond the province of a single discipline, and one of the main conclusions of this report is the need for a thoroughgoing interdisciplinarity in approaching problems of massive data. The major roles that computer scientists and statisticians have to play have already been alluded to above, and the committee emphasizes that the computer scientists involved in building big data systems must develop a deeper awareness of inferential issues, while statisticians must concern themselves with scalability, algorithmic issues, and real-time decision-making. Read the entire page →
From page 16... ... There will necessarily have to be a trade-off, one which is based on an assessment of the relative value of privacy when compared with the possible gains from data analysis. For society to agree on the terms of this trade-off, it will be necessary to understand exactly what are the possible gains from data analysis. Read the entire page →
From page 17... ... Finally, human-oriented data often involve natural language and other representations, with a rich underlying semantics, and the inferential problems of interest often involve reasoning about underlying causes and human intentions. Second, distributed computing systems have become a reality, with major implications for the collection and processing of massive data. Read the entire page →
From page 18... ... It is also the case that many instances of streaming data require real-time or near-real-time processing; examples include the online auctions run for ad placement in search engines and early alert systems for disease outbreaks. This requirement creates new algorithmic challenges where answer quality need to be traded off against answer timeliness. Read the entire page →
From page 19... ... INTRODUCTION 19 TABLE 1.1 Scientific and Engineering Fields Impacted by Massive Data Area Affected in 1995 Area Affected in 2012 Noteworthy Use Cases Physical sciences Physical sciences Astronomy, particle physics Climatology Climatology Signal processing Signal processing Medicine Medicine Imaging, medical records Artificial intelligence Artificial intelligence Natural language processing, computer vision Marketing Marketing Internet advertising, corporate loyalty programs N/A Political science Agent-based modeling of regime change N/A Forensics Fraud detection, drug/human/ CBRNe trafficking N/A Cultural studies Human terrain assessment, land use, cultural geography N/A Sociology Comparative sociology, social networks, demography, belief and information diffusion N/A Biology Genomics, proteomics, ecology N/A Neuroscience fMRI, multi-electrode recordings N/A Psychology Social psychology NOTE: CBRNe, chemical, biological, radiological, nuclear, enhanced improvised explosive devices; fMRI, functional magnetic resonance imaging; N/A, not applicable. ORGANIZATION OF THIS REPORT The statement of task for the study that led to this report reads as follows: The study will carry out the following tasks: • Assess the current state of data analysis for mining of massive sets and streams of data, • Identify gaps in current practice and theory, and • Propose a research agenda to fill those gaps. Read the entire page →
From page 20... ... Chapter 3 pursues the systems perspective further, discussing recent developments in parallel and distributed systems, databases, and streaming architectures. Chapter 4 addresses issues surrounding the temporal nature of data, serving to highlight the fact that many massive data sets arise as temporal streams and that many interesting inferential questions revolve around the detection of temporal trends, changes, and patterns. Read the entire page →
From page 21... ... INTRODUCTION 21 Chapter 10 attempts to bring several of the strands of the report together into a proposal for a taxonomy of some of the major algorithmic problems arising in massive data analysis. The committee hopes that the ideas in this section will serve to organize the research landscape and also provide a point of departure for the design of "middleware" that links highlevel inferential goals to the algorithms and hardware needed to achieve those goals. Read the entire page →

From page 11...

... arises from the new accelerators designed to test aspects of the Standard Model of particle physics. Second, many areas of science and engineering have become increasingly exploratory, with large data sets being gathered outside the context of any particular theory in the hope that new phenomena will emerge.

Read the entire page →

From page 12...

... One can also envision numerous microeconomic consequences of massive data analysis where preferences and needs at the level of single individuals are combined with fine-grained descriptions of goods, skills, and services to create new markets. In general, what is particularly notable about the recent rise in the prevalence of "big data" is not merely the size of modern data sets, but rather that their fine-grained nature permits inferences and decisions at the level of single individuals.

Read the entire page →

From page 13...

... But the goals for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) , instead focusing on the ambitious goal of inference.

Read the entire page →

From page 14...

... , but the inferences and decisions we wish to make may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many subcollections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition.

Read the entire page →

From page 15...

... This effort goes well beyond the province of a single discipline, and one of the main conclusions of this report is the need for a thoroughgoing interdisciplinarity in approaching problems of massive data. The major roles that computer scientists and statisticians have to play have already been alluded to above, and the committee emphasizes that the computer scientists involved in building big data systems must develop a deeper awareness of inferential issues, while statisticians must concern themselves with scalability, algorithmic issues, and real-time decision-making.

Read the entire page →

From page 16...

... There will necessarily have to be a trade-off, one which is based on an assessment of the relative value of privacy when compared with the possible gains from data analysis. For society to agree on the terms of this trade-off, it will be necessary to understand exactly what are the possible gains from data analysis.

Read the entire page →

From page 17...

... Finally, human-oriented data often involve natural language and other representations, with a rich underlying semantics, and the inferential problems of interest often involve reasoning about underlying causes and human intentions. Second, distributed computing systems have become a reality, with major implications for the collection and processing of massive data.

Read the entire page →

From page 18...

... It is also the case that many instances of streaming data require real-time or near-real-time processing; examples include the online auctions run for ad placement in search engines and early alert systems for disease outbreaks. This requirement creates new algorithmic challenges where answer quality need to be traded off against answer timeliness.

Read the entire page →

From page 19...

... INTRODUCTION 19 TABLE 1.1 Scientific and Engineering Fields Impacted by Massive Data Area Affected in 1995 Area Affected in 2012 Noteworthy Use Cases Physical sciences Physical sciences Astronomy, particle physics Climatology Climatology Signal processing Signal processing Medicine Medicine Imaging, medical records Artificial intelligence Artificial intelligence Natural language processing, computer vision Marketing Marketing Internet advertising, corporate loyalty programs N/A Political science Agent-based modeling of regime change N/A Forensics Fraud detection, drug/human/ CBRNe trafficking N/A Cultural studies Human terrain assessment, land use, cultural geography N/A Sociology Comparative sociology, social networks, demography, belief and information diffusion N/A Biology Genomics, proteomics, ecology N/A Neuroscience fMRI, multi-electrode recordings N/A Psychology Social psychology NOTE: CBRNe, chemical, biological, radiological, nuclear, enhanced improvised explosive devices; fMRI, functional magnetic resonance imaging; N/A, not applicable. ORGANIZATION OF THIS REPORT The statement of task for the study that led to this report reads as follows: The study will carry out the following tasks: • Assess the current state of data analysis for mining of massive sets and streams of data, • Identify gaps in current practice and theory, and • Propose a research agenda to fill those gaps.

Read the entire page →

From page 20...

... Chapter 3 pursues the systems perspective further, discussing recent developments in parallel and distributed systems, databases, and streaming architectures. Chapter 4 addresses issues surrounding the temporal nature of data, serving to highlight the fact that many massive data sets arise as temporal streams and that many interesting inferential questions revolve around the detection of temporal trends, changes, and patterns.

Read the entire page →

From page 21...

... INTRODUCTION 21 Chapter 10 attempts to bring several of the strands of the report together into a proposal for a taxonomy of some of the major algorithmic problems arising in massive data analysis. The committee hopes that the ideas in this section will serve to organize the research landscape and also provide a point of departure for the design of "middleware" that links highlevel inferential goals to the algorithms and hardware needed to achieve those goals.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

1 Introduction Pages 11-21

1 Introduction
Pages 11-21