The National Academies Press

Currently Skimming:

6 Resources, Trade-offs, and Limitations
Pages 82-92

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 82... ... At a high level, they can be partitioned into the following categories: • Computational resources, such as space, time, number of process ing units, and the amount of communication between them; • Statistical or information-theoretic resources, such as the number of data samples and their type (e.g., whether a sample is random or carefully selected by the algorithms, whether the data are "labeled" or "unlabeled," and so on) ; often, one might also like to minimize the amount and type of information revealed about the data set in order to perform certain computations, to minimize the loss of privacy; and • Physical resources, such as the amount of energy used during the computation. Read the entire page →
From page 83... ... The two goals are naturally intertwined; in particular, understanding the limitations often suggests approaches for sidestepping them. In this section the committee surveys an illustrative list of concepts and notions developed in theoretical computer science, with the emphasis on material relevant to computing over massive data sets.1 Tractability and Intractability One of the central notions in understanding computation is that of polynomial time. Read the entire page →
From page 84... ... Although it is not known whether an NP-complete problem can be solved in polynomial time -- this question, called "P versus NP," is one of the central open problems in computer science -- it is widely conjectured that such algorithms do not exist.2 The notion of NP-completeness thus provides a very useful tool guiding algorithm design. Specifically, showing that a problem is NP-complete means that instead of trying to find a complete solution, one likely needs to modify the question. Read the entire page →
From page 85... ... For example, consider the following set disjointness problem, where two parties want to determine whether two data sets of equal size, each held locally by one of the parties, contain any common items. It has been shown that in order to accomplish this task, the parties must exchange a number of bits that is linear in the size of the data set (Razborov, 1992; Kalyanasundaram and Schnitger, 1992) Read the entire page →
From page 86... ... Parallel Algorithms A fundamental and widely studied question is to understand for which problems one can obtain a speedup using parallelism. Many models of parallel computation have been studied. Read the entire page →
From page 87... ... Challenges for Computer Science Computational Hardness of Massive Data Set Problems Given the usefulness of computational hardness in guiding the development of general polynomial-time algorithms, it would be helpful to be able to apply such tools to algorithms designed for massive data as well. However, polynomial-time is typically not a sufficient condition for tractability when the input to a problem is very large. Read the entire page →
From page 88... ... More study is needed for each of these classes of algorithms. The Role of Constants To this point, the discussion of running times involved asymptotic analysis -- that is, the running times were specified up to a leading constant. Read the entire page →
From page 89... ... Challenges for Other Disciplines Perhaps the biggest challenge to understanding the fundamentals of computing over massive data lies in understanding the trade-offs between resources traditionally studied by computer science and those typically 3 For example, see the website for DIMACS Workshop on Parallelism: A 2020 Vision, March 14-16, 2011, Piscataway, N.J., available at http://dimacs.rutgers.edu/Workshops/ Parallel/slides/slides.html. Read the entire page →
From page 90... ... Statistics Traditionally, computer sciences view the input data as a burden: the larger it is, the more work that needs to be done to process it. If one views the input as a "blob" of arbitrary data, this conclusion appears inevitable. Read the entire page →
From page 91... ... 2003. "Tutorial on Machine Learning Theory." Presented at the 44th Annual IEEE Symposium on Foundations of Computer Science, October 11-14, Cambridge, Mass. Read the entire page →
From page 92... ... Foundations and Trends in Theoretical Computer Science 2(4) Read the entire page →

From page 82...

... At a high level, they can be partitioned into the following categories: • Computational resources, such as space, time, number of process ing units, and the amount of communication between them; • Statistical or information-theoretic resources, such as the number of data samples and their type (e.g., whether a sample is random or carefully selected by the algorithms, whether the data are "labeled" or "unlabeled," and so on) ; often, one might also like to minimize the amount and type of information revealed about the data set in order to perform certain computations, to minimize the loss of privacy; and • Physical resources, such as the amount of energy used during the computation.

Read the entire page →

From page 83...

... The two goals are naturally intertwined; in particular, understanding the limitations often suggests approaches for sidestepping them. In this section the committee surveys an illustrative list of concepts and notions developed in theoretical computer science, with the emphasis on material relevant to computing over massive data sets.1 Tractability and Intractability One of the central notions in understanding computation is that of polynomial time.

Read the entire page →

From page 84...

... Although it is not known whether an NP-complete problem can be solved in polynomial time -- this question, called "P versus NP," is one of the central open problems in computer science -- it is widely conjectured that such algorithms do not exist.2 The notion of NP-completeness thus provides a very useful tool guiding algorithm design. Specifically, showing that a problem is NP-complete means that instead of trying to find a complete solution, one likely needs to modify the question.

Read the entire page →

From page 85...

... For example, consider the following set disjointness problem, where two parties want to determine whether two data sets of equal size, each held locally by one of the parties, contain any common items. It has been shown that in order to accomplish this task, the parties must exchange a number of bits that is linear in the size of the data set (Razborov, 1992; Kalyanasundaram and Schnitger, 1992)

Read the entire page →

From page 86...

... Parallel Algorithms A fundamental and widely studied question is to understand for which problems one can obtain a speedup using parallelism. Many models of parallel computation have been studied.

Read the entire page →

From page 87...

... Challenges for Computer Science Computational Hardness of Massive Data Set Problems Given the usefulness of computational hardness in guiding the development of general polynomial-time algorithms, it would be helpful to be able to apply such tools to algorithms designed for massive data as well. However, polynomial-time is typically not a sufficient condition for tractability when the input to a problem is very large.

Read the entire page →

From page 88...

... More study is needed for each of these classes of algorithms. The Role of Constants To this point, the discussion of running times involved asymptotic analysis -- that is, the running times were specified up to a leading constant.

Read the entire page →

From page 89...

... Challenges for Other Disciplines Perhaps the biggest challenge to understanding the fundamentals of computing over massive data lies in understanding the trade-offs between resources traditionally studied by computer science and those typically 3 For example, see the website for DIMACS Workshop on Parallelism: A 2020 Vision, March 14-16, 2011, Piscataway, N.J., available at http://dimacs.rutgers.edu/Workshops/ Parallel/slides/slides.html.

Read the entire page →

From page 90...

... Statistics Traditionally, computer sciences view the input data as a burden: the larger it is, the more work that needs to be done to process it. If one views the input as a "blob" of arbitrary data, this conclusion appears inevitable.

Read the entire page →

From page 91...

... 2003. "Tutorial on Machine Learning Theory." Presented at the 44th Annual IEEE Symposium on Foundations of Computer Science, October 11-14, Cambridge, Mass.

Read the entire page →

From page 92...

... Foundations and Trends in Theoretical Computer Science 2(4)

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

6 Resources, Trade-offs, and Limitations Pages 82-92

6 Resources, Trade-offs, and Limitations
Pages 82-92