understand language—no mean feat—including indirect speech acts. If a student were to tell a hypothetical librarian assistant that she's looking for "a book by Tukey," she means that she doesn't know which book she wants, she'd like to see some titles, and then she'd like the assistant to send her one.
Thomas, the front-end to Congress's information system, relies on sophisticated information retrieval algorithms in responding to user requests [ref]. Natural language processing systems exploit online databases of newswire stories to learn rules that resolve word-sense ambiguity [ref]. Machine learning techniques can autonomously generate classifications of stellar objects in large scale sky surveys . Planning systems can help organize access to enormous quantities of satellite image data . Researchers in information retrieval, machine learning, knowledge discovery in databases, and natural language understanding are finding that the information glut is actually helpful, providing an inexhaustible supply of training data, test cases, and opportunities to develop new techniques. Their results are case studies in how massive datasets in different domains can be handled.
Our own research interest lies in exploratory data analysis (EDA). We have developed an assistant for intelligent data exploration, called AIDE, to help users with the task. AIDE is a knowledge-based planning system that incrementally explores a dataset, guided by user directives and its own evaluation of indications in the data . The system is mixed-initiative: the user can let AIDE run unattended, searching for its own conception of interesting structure, or can use AIDE as an interactive statistics package, or can combine the two processes, letting AIDE perform some of the tasks of exploration, but under human guidance. While AIDE is not a system for exploration of massive datasets, we believe that the issues addressed by AIDE are relevant to any system that would explore data semi-autonomously.
EDA is a search problem, not unlike medical diagnosis or chess. At every juncture, you can take one of several actions—transform a variable, order a CAT scan, sacrifice a bishop. After each action, your state of knowledge changes. Because actions are generally not free, you look into the future to evaluate potential outcomes, so the action you eventually select is by some local criterion "best." If local criteria guaranteed global success, chess wouldn't be a challenge. Nor would diagnosis or EDA. But even if local criteria are heuristically good, the search space in the vicinity of a decision can be enormous. The average branching factor of a search is the average number of actions feasible at each decision. Chess has an average branching factor of 40, or thereabouts, and because games last roughly 50 moves, the search space for chess contains 4050 positions. EDA has a much larger branching factor. Any statistics package permits dozens of operations on dozens or hundreds of subsets of variables. An analyst might run any of hundreds of operations at each juncture, from simple transformations, to descriptive statistics, to model specification and analyses of residuals, and so on. The search space for EDA, even for a small dataset, is effectively infinite.
AI tames intractable search spaces two ways, by reformulating a problem to have a smaller search space, and by using expert knowledge to focus attention on small pieces of the space. Our approach to data exploration relies on both of these techniques. In order to explain further, we need to take a short detour for some background information about AI planning.