The National Academies Press

Currently Skimming:

Data Mining and Visualization
Pages 30-40

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 30... ... press release, July 11, 2000 The amount of data stored on electronic media is growing exponentially fast. Today's data warehouses dwarf the biggest databases built a decade ago (Kimball and Merz, 2000) Read the entire page →
From page 31... ... The next section introduces data mining tasks and models, followed by a quick tour of some theoretical results. Next, a review of the recent advances is presented, followed by challenges and a summary. Read the entire page →
From page 32... ... The model then can be evaluated for its accuracy in making predictions on the unseen test set. Descriptive data mining, which yields human insight, is harder to evaluate yet necessary in many domains, because the users may not trust predictions coming out of a black box or because, legally, one must explain the predictions. Read the entire page →
From page 33... ... Similar results exist for the consistency of decision tree algorithms (Gordon and Olshen, 1984~. While asymptotic consistency theorems are comforting because they guarantee that with enough data the learning algorithms will converge to the target concept one is trying to learn, our world is not so ideal. Read the entire page →
From page 34... ... The expected error of any learning algorithm for a given target concept and training set size can be decomposed into two terms: the bias and the variance (Geman et al., 1992~. The importance of the decomposition is that it is valid for finite training set sizes not asymptotically and that the terms can be measured experimentally. Read the entire page →
From page 35... ... If the data are expected to contain some noise, the model on the top will probably make a better prediction at x = 9.5 than the squiggly model on the bottom, which (over) fits the data perfectly. Read the entire page →
From page 36... ... Two learning techniques developed in the last few years have had a significant impact: Bagging and Boosting. Both methods learn multiple models and vote them in order to make a prediction, and both have been shown to be very successful in improving prediction accuracy on real data (Bauer and Kohavi, 1999; Quinlan,1996~. Read the entire page →
From page 37... ... Discoveries may reveal correlations that are not causal. For example, human reading ability correlates with shoe size, but wearing larger shoes will not improve one's reading ability. Read the entire page →
From page 38... ... This short review of the basic goals of data mining, some theory, and recent advances should provide those interested with enough information to see the value of data mining and use it to find nuggets; after all, almost everyone has access to the main ingredient needed: data. ACKNOWLEDGMENTS I thank Carla Brodley, Tom Dietterich, Pedro Domingos, Rajesh Parekh, Ross Quinlan, and Zijian Zheng for their comments and suggestions. Read the entire page →
From page 39... ... 1997. Seven Methods for Transforming Corporate Data into Business Intelligence. Read the entire page →
From page 40... ... 1995. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. Read the entire page →

From page 30...

... press release, July 11, 2000 The amount of data stored on electronic media is growing exponentially fast. Today's data warehouses dwarf the biggest databases built a decade ago (Kimball and Merz, 2000)

Read the entire page →

From page 31...

... The next section introduces data mining tasks and models, followed by a quick tour of some theoretical results. Next, a review of the recent advances is presented, followed by challenges and a summary.

Read the entire page →

From page 32...

... The model then can be evaluated for its accuracy in making predictions on the unseen test set. Descriptive data mining, which yields human insight, is harder to evaluate yet necessary in many domains, because the users may not trust predictions coming out of a black box or because, legally, one must explain the predictions.

Read the entire page →

From page 33...

... Similar results exist for the consistency of decision tree algorithms (Gordon and Olshen, 1984~. While asymptotic consistency theorems are comforting because they guarantee that with enough data the learning algorithms will converge to the target concept one is trying to learn, our world is not so ideal.

Read the entire page →

From page 34...

... The expected error of any learning algorithm for a given target concept and training set size can be decomposed into two terms: the bias and the variance (Geman et al., 1992~. The importance of the decomposition is that it is valid for finite training set sizes not asymptotically and that the terms can be measured experimentally.

Read the entire page →

From page 35...

... If the data are expected to contain some noise, the model on the top will probably make a better prediction at x = 9.5 than the squiggly model on the bottom, which (over) fits the data perfectly.

Read the entire page →

From page 36...

... Two learning techniques developed in the last few years have had a significant impact: Bagging and Boosting. Both methods learn multiple models and vote them in order to make a prediction, and both have been shown to be very successful in improving prediction accuracy on real data (Bauer and Kohavi, 1999; Quinlan,1996~.

Read the entire page →

From page 37...

... Discoveries may reveal correlations that are not causal. For example, human reading ability correlates with shoe size, but wearing larger shoes will not improve one's reading ability.

Read the entire page →

From page 38...

... This short review of the basic goals of data mining, some theory, and recent advances should provide those interested with enough information to see the value of data mining and use it to find nuggets; after all, almost everyone has access to the main ingredient needed: data. ACKNOWLEDGMENTS I thank Carla Brodley, Tom Dietterich, Pedro Domingos, Rajesh Parekh, Ross Quinlan, and Zijian Zheng for their comments and suggestions.

Read the entire page →

From page 39...

... 1997. Seven Methods for Transforming Corporate Data into Business Intelligence.

Read the entire page →

From page 40...

... 1995. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework.

Read the entire page →

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.

Data Mining and Visualization Pages 30-40

Data Mining and Visualization
Pages 30-40