3

Knowledge Discovery: Data Mining and Search

The Nature of Data Mining and Search

The World-Wide Web (the Web) is a remarkable source of information, and this session, moderated by Dianne O'Leary of the University of Maryland, focused on gathering information from it, validating that information, and using the information to draw conclusions and make predictions.

The term “data mining” is quite old. Jon Kleinberg of Cornell University noted that, until about 1992, it occurred mostly in the statistics literature and mostly in a pejorative context. Today, however, it refers to retrieving useful information from multiple sources, often on the Web, and making useful inferences from it.

There are a remarkable number of approaches to designing search engines for the Web. The search engine Google, for example, uses a link analysis technique related to the principle of hubs and authorities. If, in examining the directed graph formed by links between Web pages, we can identify a tightly linked bipartite subgraph, then pages with high in-degree are denoted as authorities, since they are cited by many other pages, while those with high out-degree are hubs, or reference sites for specialized subject areas. The identification can be done automatically though the eigenvectors of the Laplacian of the graph. In contrast, search engines such as Alta Vista characterize pages by a series of keywords that are weighted by factors such as frequency and size of type, and then matched to a user's query, perhaps using techniques such as latent semantic indexing and low-rank approximations to matrices. The Alexa search engine uses the recent history of users ' Web browsing to tailor query responses. And engines like Yahoo rely on humans to search the Web and create a directed acyclic graph taxonomy for the pages.

An overarching challenge is to develop a new level of capability in automated search and mining. For example, instead of merely providing the capability to retrieve Web pages that concern the safety records of automobiles, it would be useful to be able to satisfy requests such as, “Compile a report with all available information on the safety of the 1998 Ford Taurus.” This task requires the ability to retrieve information stored in tables or sentences on various Web pages, judge its relevance and reliability, and present the results in an easily understood document.

Clearly mathematics, especially graph theory, statistics, and linear algebra, have already played critical roles in data mining and searching, and this session explored the current state and future opportunities for interactions.

Problems in Data Mining

Chid Apte of the IBM T.J. Watson Research Center discussed extracting “actionable insights” from data. He reminded workshop participants that data are not ordinarily presented with ease of mining in mind. For instance, some time series data may be raw data, whereas others have been “back-fitted, ” revised in the light of supplementary data. Back-fitting makes the use of statistical methods particularly difficult because of the hidden correlations (Problem 25).

Problem 25. How can we verify the integrity of data gathered from multiple sources? In particular, we need to distinguish signal from noise.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 12
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop 3 Knowledge Discovery: Data Mining and Search The Nature of Data Mining and Search The World-Wide Web (the Web) is a remarkable source of information, and this session, moderated by Dianne O'Leary of the University of Maryland, focused on gathering information from it, validating that information, and using the information to draw conclusions and make predictions. The term “data mining” is quite old. Jon Kleinberg of Cornell University noted that, until about 1992, it occurred mostly in the statistics literature and mostly in a pejorative context. Today, however, it refers to retrieving useful information from multiple sources, often on the Web, and making useful inferences from it. There are a remarkable number of approaches to designing search engines for the Web. The search engine Google, for example, uses a link analysis technique related to the principle of hubs and authorities. If, in examining the directed graph formed by links between Web pages, we can identify a tightly linked bipartite subgraph, then pages with high in-degree are denoted as authorities, since they are cited by many other pages, while those with high out-degree are hubs, or reference sites for specialized subject areas. The identification can be done automatically though the eigenvectors of the Laplacian of the graph. In contrast, search engines such as Alta Vista characterize pages by a series of keywords that are weighted by factors such as frequency and size of type, and then matched to a user's query, perhaps using techniques such as latent semantic indexing and low-rank approximations to matrices. The Alexa search engine uses the recent history of users ' Web browsing to tailor query responses. And engines like Yahoo rely on humans to search the Web and create a directed acyclic graph taxonomy for the pages. An overarching challenge is to develop a new level of capability in automated search and mining. For example, instead of merely providing the capability to retrieve Web pages that concern the safety records of automobiles, it would be useful to be able to satisfy requests such as, “Compile a report with all available information on the safety of the 1998 Ford Taurus.” This task requires the ability to retrieve information stored in tables or sentences on various Web pages, judge its relevance and reliability, and present the results in an easily understood document. Clearly mathematics, especially graph theory, statistics, and linear algebra, have already played critical roles in data mining and searching, and this session explored the current state and future opportunities for interactions. Problems in Data Mining Chid Apte of the IBM T.J. Watson Research Center discussed extracting “actionable insights” from data. He reminded workshop participants that data are not ordinarily presented with ease of mining in mind. For instance, some time series data may be raw data, whereas others have been “back-fitted, ” revised in the light of supplementary data. Back-fitting makes the use of statistical methods particularly difficult because of the hidden correlations (Problem 25). Problem 25. How can we verify the integrity of data gathered from multiple sources? In particular, we need to distinguish signal from noise.

OCR for page 12
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Apte predicted that breakthroughs would come though interdisciplinary approaches, drawing from expertise in statistics, machine learning, and data management. In addition, results from computational learning theory, human-computer interaction, understanding of high-dimensional systems, visualization, and optimization involving variational problems will be essential (Problem 26 and Problem 27). Problem 26. Are there automated methods to select the best data transformations, dropping to an appropriate subspace? In other words, how can we identify important features and search for alternate models within the space in order to find a near-optimal one? Problem 27. Can we extract from data robust models that are simultaneously comprehensible to end users? Enhancing the Power of Automated Knowledge Discovery Raúl Valdés-Pérez of Carnegie Mellon University focused on the problems of knowledge discovery—gathering and analyzing data in order to determine underlying rules and defining properties. He demonstrated some current technology and presented several challenges. The current capabilities of automated discovery are quite limited: we cannot easily extract information from multiple sources with multiple formats and reason about it ( Problem 28 and Problem 29). In addition, the presentation of results to users is quite primitive, given, for example, as a decision tree or a cluster diagram. Problem 28. Enlarge the range of automated discovery: for instance, enable machines to form conjectures or make predictions. Problem 29. Make data mining so automatic that even non-experts in mining can ask questions and get useful information back that is easy to interpret. Web Social Structure and Search Jon Kleinberg of Cornell University focused on the information content and social structure of the Web and on the problems of Web searching. He noted that searching is made difficult by two opposing characteristics. On the one hand, for specific queries, only a few sites (if any) might contain the answer to a user's query, and there are challenges in natural language processing to infer that the answer is contained within a given Web page. On the other hand, general queries bring back far too many responses, overwhelming the person who formulated the query. Thus effective filtering of responses is essential. A capability for giving priority to Web pages that are authorities, in the sense discussed above, would be useful. Problem 30. Find good visualization models for the Web that present information in a usable way. Thus far, the most successful tool has been the graphical browser.

OCR for page 12
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Kleinberg is fascinated by the problem of discovering cyber communities through the hub-and-authority structure (Problem 31). Problem 31. The 200 million Web pages have been reduced to 500,000 overlapping communities (which do not constitute a cover). What is the structure of this aggregation and what properties should we search for? Multiresolution approaches (as proposed in Problem 6) might be useful here, as might the study of self-similarity plus perturbation. Problem 32. Construct a simple random process that yields a graph with properties of the Web: obeying the power law, with many dense bipartite subgraphs, and a small diameter, since the average distance between any two Web pages is 15 to 20 links. “Clickstream data,” the path that a user takes in surfing the Web, defines a dynamic graph. If a large Internet service provider such as America Online made such data available to researchers, it would reveal a great deal about communities, enable people to restructure Web sites for easier access, and also have great value to advertisers. Such a release of data is problematic, however, because of privacy concerns and the economic value of the data. Problem 33. Classify Web pages by topic. This problem is analogous to image segmentation, so one possible approach might be to use computer vision methods in this context. In problems such as Problem 33, computational learning theory is a useful tool in trying to deduce function from observations. This effort has led to interesting interactions between computer science and the mathematical sciences. Valiant' s model of PAC learning is connected to Vapnik's ideas about classification and pattern recognition; the computational algorithms minimize a cost function plus a term to restrict the search space, just as in regularization methods for ill-posed problems. A technique called boosting adds classifiers to take care of difficult examples in a stable way and is analogous to column generation in mathematical programming. Additional Problems in Data Mining and Search A critical issue is to distinguish between trusted authorities and unreliable ones. The Weighted Majority Algorithm is one approach that can be applied, finding trusted authorities with at most O(k + log n) mistakes, where n is the number of experts and k is the number of mistakes of the most reliable one. Susan Gauch of the University of Kansas posed a fundamental problem in designing search engines (Problem 34), which prompted participants to identify a related fundamental problem for images (Problem 35).

OCR for page 12
The Interface of Three Areas of Computer Science with the Mathematical Sciences: Summary of a Workshop Problem 34. How can a search engine solicit feedback from the user about how well the retrieved documents satisfy the query, and how can the engine make subsequent use of such information? In addition, how could retrieval be tailored to an individual user, making intelligent guesses, for instance, about whether a request for “Michael Jordan ” refers to the computer scientist, the jazz musician, or the basketball player? Problem 35. The issue of image retrieval is not well understood, unless there is accompanying text or an index. How can we retrieve images related to a given one by color profile (e.g., “mostly red ”), category (e,g., landscape), or subject (e.g., Isaac Newton)? Summary of Data Mining and Search Session Laszlo Lovasz of Microsoft Research noted that two interesting graph decompositions have emerged in recent decades. Robertson-Seymour graph minor theory for low-density graphs considers graphs G that do not contain Km as a minor. Such graphs are a small perturbation of two-dimensional graphs glued together. The Szemeredi lemma notes that graphs in which the number of edges is proportional to n2, where n is the number of vertices, can be decomposed into pieces that look random and have various lower dimensions. This suggests Problem 36. Problem 36. Is there a randomized way of simulating the Internet as a graph? These connections highlight the importance of basic linear algebra constructions such as the graph Laplacian in understanding Internet structure and search. Lovasz noted that “discrete-continuous” is not the right dichotomy, since all of these graph concepts have analogs in the continuous setting and the problems are closely related to finding good embeddings of finite metric spaces into each other with little distortion. It is essential that algorithms should work on huge problems, and thus they must run in time close to linear. Lovasz quoted Ronald Coifman of Yale University as saying that in representing data, the most economical representation brings us halfway to the discovery of the underlying structure, and this must be a basic principle in data mining and search.