Massive Data Sets and Artificial Intelligence Planning
Robert St. Amant and Paul R. Cohen
University of Massachusetts
Generating a description of a massive dataset involves searching through an enormous space of possibilities. Artificial Intelligence (AI) may help to alleviate the problem. AI researchers are familiar with large search problems and have developed a variety of techniques to handle them. One area in particular, AI planning, offers some useful guidance about how the exploration of massive datasets might be approached. We describe a planning system we have implemented for exploring small datasets, and discuss its potential application to massive datasets.
1 The Information Glut
It has been said that the Sunday New York Times contains more unique assertions than an average person encountered during a lifetime in the Renaissance. After a week of email and telephone calls, reading papers, listening to the news, skimming direct-mail catalogs and channel-hopping on TV, we are given little respite by the Times. Researchers in Artificial Intelligence (AI) have developed methods to both manage and exploit this information glut, so that we might all enjoy a more contemplative life. We provide a quick tour of AI applications for processing large amounts of data, and then focus on what AI can contribute directly to data analysis.
A variety of AI applications have been developed to take advantage of the the availability of huge quantities of information in electronic form. Examples include personal assistants that filter news stories, schedule meetings, and "crawl the web" in search of information at remote Internet sites. Many of these applications are easy to build and require little intelligence; the effort of a few hours is enough, for example, to write a filtering program to find calls for papers in the welter of Internet traffic. More intelligence is required if an assistant is to take account of human goals and preferences. One's personal email assistant should learn that broadcast messages about cars with lights on in the parking lot are irrelevant unless the car might be one's own. Similarly, an information retrieval assistant should
understand language—no mean feat—including indirect speech acts. If a student were to tell a hypothetical librarian assistant that she's looking for "a book by Tukey," she means that she doesn't know which book she wants, she'd like to see some titles, and then she'd like the assistant to send her one.
Thomas, the front-end to Congress's information system, relies on sophisticated information retrieval algorithms in responding to user requests [ref]. Natural language processing systems exploit online databases of newswire stories to learn rules that resolve word-sense ambiguity [ref]. Machine learning techniques can autonomously generate classifications of stellar objects in large scale sky surveys . Planning systems can help organize access to enormous quantities of satellite image data . Researchers in information retrieval, machine learning, knowledge discovery in databases, and natural language understanding are finding that the information glut is actually helpful, providing an inexhaustible supply of training data, test cases, and opportunities to develop new techniques. Their results are case studies in how massive datasets in different domains can be handled.
Our own research interest lies in exploratory data analysis (EDA). We have developed an assistant for intelligent data exploration, called AIDE, to help users with the task. AIDE is a knowledge-based planning system that incrementally explores a dataset, guided by user directives and its own evaluation of indications in the data . The system is mixed-initiative: the user can let AIDE run unattended, searching for its own conception of interesting structure, or can use AIDE as an interactive statistics package, or can combine the two processes, letting AIDE perform some of the tasks of exploration, but under human guidance. While AIDE is not a system for exploration of massive datasets, we believe that the issues addressed by AIDE are relevant to any system that would explore data semi-autonomously.
2 An planning perspective on data analysis
EDA is a search problem, not unlike medical diagnosis or chess. At every juncture, you can take one of several actions—transform a variable, order a CAT scan, sacrifice a bishop. After each action, your state of knowledge changes. Because actions are generally not free, you look into the future to evaluate potential outcomes, so the action you eventually select is by some local criterion "best." If local criteria guaranteed global success, chess wouldn't be a challenge. Nor would diagnosis or EDA. But even if local criteria are heuristically good, the search space in the vicinity of a decision can be enormous. The average branching factor of a search is the average number of actions feasible at each decision. Chess has an average branching factor of 40, or thereabouts, and because games last roughly 50 moves, the search space for chess contains 4050 positions. EDA has a much larger branching factor. Any statistics package permits dozens of operations on dozens or hundreds of subsets of variables. An analyst might run any of hundreds of operations at each juncture, from simple transformations, to descriptive statistics, to model specification and analyses of residuals, and so on. The search space for EDA, even for a small dataset, is effectively infinite.
AI tames intractable search spaces two ways, by reformulating a problem to have a smaller search space, and by using expert knowledge to focus attention on small pieces of the space. Our approach to data exploration relies on both of these techniques. In order to explain further, we need to take a short detour for some background information about AI planning.
Planning is a form of search, in which the task is to construct a sequence of steps that lead from a set of initial conditions to some desired conclusion. Planners formulate the search problem in terms of states, goals, and sequences of actions to achieve goals. States are partial descriptions of the world, in the internal representation of the system. In a data exploration system, the state includes the data under consideration, models of the behavior of the data, observations of unusual characteristics, and so forth. Goals specify desired states. A goal might be, for example, to find a set of predictors for some specific variable. Actions take the system from one state to another. Examples of EDA actions include generating a linear fit for a relationship, power transforming a variable, and other such operations. Actions are defined by the preconditions or requirements for their application and the effects of their application.
Planners rely on task decomposition to solve complex problems, using knowledge about states, actions, and goals to structure the search at different levels of abstraction . The simplest planners work by backward-chaining. Given a goal state, a planner begins by examining those actions that achieve the goal state. By treating the preconditions of these actions as goals to be satisfied in turn, and taking note of potential interactions between actions, the planner recursively generates a sequence of appropriate actions.
Traditional AI planners construct plans from primitive actions, stating from scratch for each new problem. Partial hierarchical planning takes a different approach. If a planning system repeatedly encounters similar problems, or subproblems, it becomes wasteful to generate similar solutions from scratch each time. Instead, a partial hierarchical planner can rely on a library of partial plans that apply to common cases. Rather than choosing an appropriate action at some point during the planning process, the planner can choose a partial plan in which many or all of the actions have already been decided on. If the library has sufficient coverage, planning largely reduces to a matter of choosing which plan to apply at what time.
Problem reformulation and expert knowledge apply in the AIDE planning framework as follows. Problem reformulation corresponds to capitalizing on knowledge of abstraction. It's clear that statisticians do not think of EDA as sequences of independent operations selected from menus of statistics packages. The fact that menus have names (e.g., Transform Data) suggests that elementary operations can be grouped into equivalence classes. Data analysis plans can be formulated in terms of these classes of operations, not the individual operations. Reformulating EDA in terms of abstract operations (such as transform, decompose, fit, etc.) reduces the branching factor of the search because there are fewer abstract actions to consider at each juncture. A combinatorially intractable search is replaced by two simpler searches: one to generate an abstract plan, the other for operations to instantiate the plan. For example, a useful abstract plan involves partitioning one or more variables, performing an operation on each partition, then presenting the results in an organized way. Histograms do this for one variable, contingency tables for two or more variables; tables of means in analysis of variance do it, too.
AIDE takes advantage of expert knowledge by storing abstract plans, instead of trying to generate them de novo. We have developed a simple but powerful language in which users can express plans for EDA. Plans are triggered by indications (which AIDE discovers automatically). For example, if a variable has "gaps" in its range, abstract plans are triggered to to search for another variable to explain the gaps and produce a contingency table, to
examine the separate partitions more closely, and so on. The number of options at any juncture is still large because AIDE generally finds many indications and can apply many plans with different parameterizations, but it is better structured and thus more manageable.
3 A Planning System For Analysis of Small Datasets
To make the discussion above concrete, we present AIDE's planning representation. While the data handling aspects of the language are novel and powerful, we make no claim that they will scale to massive datasets. The plan language, on the other hand, implements strategic aspects of the exploration, which we believe are comparable for datasets of all sizes.
AIDE's processing is built around a planning representation. In a functional language, procedures call one another directly. A statistical-summary procedure, for example, might call a mean procedure during its processing. In a planning language, control moves between plans indirectly, by the establishment and satisfaction of goals. Rather than specifying that a mean should be computed, a plan might establish a goal (compute-location ?x). This goal could be satisfied by a mean, or median, or trimmed-mean procedure, depending on the characteristics of ?x. The advantage gained is flexibility. Instead of relying on functions that must be heavily parameterized to produce appropriate behavior, the planning language distinguishes between two separate control issues: what should be done, and how it should be done.
An example of a plan specification is given below. A plan has a name, a goal that the plan can potentially satisfy, constraints on its bindings, and a body. The body of a plan is a control schema of subgoal specifications, subgoals which must be satisfied for the plan to complete successfully. An action specification is similar to a plan specification, except that its body contains arbitrary code, rather than a control schema.
:goal (generate-description :histogram-type ?batch ?histogram)
(:SUBGOAL (decompose (function unique-values) ?batch ?bins))
(:MAP/TRANSFORM (?bin ?bins ?histogram)
(:SUBGOAL (reduce (function count) ?bin)))))
:goal (generate-description :histogram-type ?batch ? histogram)
(:SUBGOAL (decompose(function unique-values) ?batch ?bins))
(:MAP/TRANSFORM (?bin ?bins ?histogram)
(:SUBGOAL (reduce (function count) ?bin)))))
This plan implements a simple procedure for generating a histogram of a discrete-valued variable. In words: divide the range of the variable into its unique values; break the variable into subsets, one subset per value; count the number of elements in each subset. The resulting counts are the bar heights of the histogram. In other words, we decompose the variable, which generates a new relation with a single attribute that contains the subsets. We then apply a transformation, with an embedded reduction, which maps each subset relation to a single value, the "count" statistic of the subset.
Let's now consider a contingency table generated for a relationship between categorical variables <x,y>. The procedure is as follows. We divide the relationship into subsets, one
corresponding to each unique combination of x and y values. Each subset is structurally identical to the original relationship. We now record the number of observations in each subset. The resulting values are the cell counts for the contingency table. A contingency table can thus be viewed as a two-dimensional analog of a histogram. Not surprisingly, the two procedures are identical in form:
:goal (generate-description :contingency-table ?xy-relationship ?table)
(:SUBGOAL (decompose (function unique-values) ?xy-relation ?cells))
(:MAP/TRANSFORM (?cell ?cells ?table)
(:SUBGOAL (reduce (function count) ?cell)))))
In fact, we might use a single plan for both procedures, relying on subordinate plans to specialize appropriately on the type of the input data structure. These combinations are simple, hardly different from the way they would appear in a functional representation. The planning representation offers more flexibility than shown here. In addition to sequencing and functional composition of primitives, the body of a plan can specify that subgoals should be satisfied or actions executed in parallel, iteratively, conditionally, or in other ways. Further, we can control how we evaluate whether a plan has completed successfully or not; we might wish to iterate over all possible ways to satisfy a subgoal, or simply accept the first plan that succeeds.
A histogram/contingency table plan is a simple example of how procedures may generalize. A more complex example is shown below. This plan was initially designed to iteratively fit a resistant line to a relationship . The resistant line generated by some methods is initially only an approximation, in the sense that executing the same procedure on the residuals may give a line with non-zero slope. When this happens, the fit is reapplied to the residuals and the line parameters updated appropriately. When the magnitude of the incremental changes falls below some heuristic threshold, the iteration stops. This plan captures the processing involved in iterating over the residuals. (The subgoals have been elided for the sake of presentation.)
:goal (describe :iterative-fit ?operation ?relationship ?description)
(:WHILE (not (null ?continue-iteration-p))
This kind of heuristic improvement is also part of other procedures, lowess being the most familiar. The same plan can be used for both procedures. The fit subgoal is satisfied by different subordinate plans, their selection depending on context. This example is again
typical of the way complex procedures may be generated dynamically through plan combination. Most plans do not specify behavior down to the level of primitive operations, but let the selection of appropriate subordinate plans depend on the context of intermediate results generated.
Finally, we have an example of a simple incremental modeling plan, instantiated in the exploration of a dataset. It generates an initial model of the dataset of an appropriate type. It then establishes the goal of elaborating the model. With the plans in the AIDE library, elaboration will involve adding relationships, one at a time, to the model. One of the plans that matches the elaborate-model subgoal recursively establishes an identical subgoal, with ?model bound to the incrementally extended model.
(define-plan explore-by-incremental-modeling ()
:goal (explore-by :modeling ?model-type ?description ?structure ?model)
:constraint ((?structure ((:dataset-type dataset))))
(:WHEN (null ?model)
(generate-initial-model ?description ?structure
(elaborate-model ?model-type ?activity
?description ?structure ?model))))
AIDE's plan library currently contains over 100 plans. Fewer than a dozen plans are usually applicable at any point during exploration, and each plan constrains the set of subordinate plans applicable at later points. Many of the plans are simple, on the order of univariate transformations and residual generation. Others are more complex descriptive and model-building procedures. Some, like the last plan presented, implement control strategies applicable in a variety of different situations. With the plan representation we have found little difficulty in implementing most familiar exploratory procedures. In addition to descriptive plans for resistant lines, various box plot procedures, and smoothing procedures, we have implemented forward selection algorithms for cluster analysis and regression analysis , a causal modeling algorithm , and a set of weaker opportunistic modeling procedures.
3.2 Controlling Plan Execution
Unless we can ensure that only a single procedure is applicable at any time, and that we know the exact subset of the data we should concentrate on—an improbable scenario—we must address a set of questions concerning control:
- Which data structure (variable, relationship, model, etc.) should be considered?
- Which plan (or plan type, in the case of higher level decisions) is appropriate?
- Given a set of comparable but different results, produced by similar procedures (e.g., a least-squares or a resistant line; four clusters or five), which result is best?
In AIDE, rules are used to make these decisions. The relevance of a rule is determined by structure and context constraints. By structure constraints, we mean that the feature and indication values specified in a rule must be present in the structure under consideration. Rules can thus constrain plans to act, for example, only on variables with indications of clustering, or on relationships between continuous variables. Context constraints apply to the history of plan and subgoal activations that have led up to the current decision point. Using context constraints a rule can, for example, prevent clusters detected in the residuals of a linear fit from being explored if the clustering plan has not yet been applied at the level of the original relationship. These constraints can prune and order related decisions. These rules control the ordering and selection of data to explore, the plans to explore them, and the results they produce.
AIDE can thus run without assistance from the user. Its rules let it evaluate candidate data structures to explore; its library provides the plans that perform the exploration. Not surprisingly, however, AIDE's lack of contextual knowledge will take the exploration in directions an informed human analyst would not need or want to go. If a user is interested in gender issues, for example, partitions of data by sex will form a significant part of the exploration. AIDE cannot know that one user is interested in gender, another studies time series of climate change, and so on. For this reason, AIDE is an analyst's assistant, not an autonomous package. It hunts for indications and suggests plans to explore them, but the analyst guides the whole process.
The process is mixed-initiative in the following sense. AIDE explores a dataset by elaborating and executing a hierarchy of plans. Decision points in the hierarchy are made available for the user's inspection. The user guides exploration by changing the ordering of plans, by selecting new plans, or by shifting to different areas in the data. This guidance includes selecting appropriate dataset variables and subsets to be explored as well as applying any available statistical manipulations to the selected data. The arrangement gives the user most of the flexibility of the usual statistical interface, while still remaining within the planning framework.
Each decision point gives the user the opportunity to view the choices—and the data—currently under consideration. The user can step through the process, letting the planner execute only until another decision is reached. The user can let planner execution continue until some descriptive result has been generated, at which point another decision can be made using the new information. At any time the user can override AIDE's decisions about the operation to execute or data to examine. AIDE keeps the user constantly informed with descriptions of the data being explored, justifications for candidate operations at a decision point, ordering constraints on choices, and descriptions of the plans under consideration.
AIDE represents a compromise between two eventually unworkable approaches to data analysis. On one hand we have statistics packages, which are driven by users, and treat actions as independent. The search space for EDA is intractable when elementary actions are independent selections from statistics package menus. On the other hand, we have autonomous ''black boxes'' from machine learning and statistics, algorithms like stepwise multiple regres-
sion, CART, and C4. These require nothing but data from the user, but nor does the user have much opportunity to control how they work, and the results are never explained. Your choice is to relinquish all control, or control every detail, of analyses. AIDE offers an alternative, involving the user in the strategic aspect of data analysis, but reducing the tactical burden.
To manage the process AIDE relies on the techniques of AI planning. We have encountered some strong similarities between planning problems and the types of problems found in EDA. These similarities yield some some potentially useful implications for the processing of massive datasets, implications that we have only begun to study.
Peter Huber observes that data analysis differs from other computer-supported tasks, such as programming and text preparation. In the latter tasks the final version is everything; there is no need to know the path by which it was reached. In data analysis the correctness of the end product cannot be checked without inspecting the path leading to it . Huber argues that existing paradigms are thus inappropriate for data analysis tasks. If a massive dataset is to be analyzed without the direct supervision of a human user, then a representation of the process carried out is a necessary component of the result. This is one of the key distinctions between planning and other forms of search—the aim is to generate a sequence (or more complex combination) of operations, not simply the end result.
One of the best-known observations to arise in the planning literature is that problem decomposition is most effective when the subproblems that are generated are largely independent of one another. A recent AI textbook contains an example in which a search problem, if solved directly, requires a search of 1030 states. In contrast, a hierarchical decomposition brings the search space down to 600 states . By imposing structure on the search space we can sometimes transform an intractable problem into a relatively simple one. Using an analogy developed during the Massive Datasets Workshop, we observe that the effectiveness of exploration of a massive dataset will depend on our knowledge of its geography: we need to know either where to look for patterns or what kinds of patterns to look for. In the absence of this knowledge we have no way to carve up the search space into smaller, more manageable blocks. A hierarchical approach to exploring a massive dataset can be effective to the extent that the decomposition follows the geography of the dataset.
A planner can derive great benefit from compiled knowledge in the form of a plan library. Much of the knowledge we bring to bear in solving problems takes the form of procedures or sequences of actions for achieving particular goals . Planning problems in some complex environments are too difficult to solve without the user of sophisticated plan libraries . Similarly, the exploration of massive datasets poses a problem that may only be solved with both knowledge of individual data analysis operations and knowledge about how they can be combined.
Our long-term objectives for this research include fully automated model-building and discovery mechanisms driven by an opportunistic control strategy. We expect to develop the automated strategies from our experience with the system as a manual analysis decision aid, letting human analysts provide much of the initial reasoning control strategy. Although we are developing this system for our own use in modeling complex program behavior, we believe it to be potentially useful for any scientific modeling problem, particularly those in which the process of data collection is itself automated and the resulting volume of data overwhelming for human analysts (e.g., astronomical surveys or remote sensing of terrestrial
In sum, AI researchers are responding to the challenges and opportunities afforded by massive amounts of electronic data. Capitalizing on the opportunities involves statistical reasoning, although not all AI researchers recognize it as such. Statistics provides a foundation for analysis and design of AI programs. Reciprocally, some AI programs facilitate the work of applied statisticians, not only in EDA, but also in model induction and testing.
This research is supported by ARPA/Rome Laboratory under contract #F30602-93-0100, and by the Dept. of the Army, Army Research Office under contract #DAAH04-95-1-0466. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes not withstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of the Advanced Research Projects Agency, Rome Laboratory, or the U.S. Government.
 Paul R. Cohen, Michael L. Greenberg, David M. Hart, and Adele E. Howe. Trial by fire: Understanding the design requirements for agents in complex environments. AI Magazine, 10(3):32-48, Fall 1989.
 John D. Emerson and Michal A. Stoto. Transforming data. In David C. Hoaglin, Frederick Mosteller, and John W. Tukey, editors, Understanding robust and exploratory data analysis. Wiley, 1983.
 Usama Fayyad, Nicholas Weir, and S. Djorgovski. Skicat: A machine learning system for automated cataloging of large scale sky surveys . In Proceedings of the Tenth International Conference on Machine Learning , pages 112-119. Morgan Kaufmann, 1993.
 Michael P. Georgeff and Amy L. Lansky. Procedural knowledge. Proceedings of the IEEE Special Issue on Knowledge Representation , 74(10):1383-1398, 1986.
 Peter J. Huber. Data analysis implications for command language design. In K. Hopper and I. A. Newman, editors, Foundation for Human-Computer Communication. Elsevier Science Publishers, 1986.
 Amy L. Lansky and Andrew G. Philpot. AI-based planning for data analysis tasks. IEEE Expert, Winter 1993.
 Stuart Russell and Peter Norvig. Artificial Intelligence: A Modem Approach. Prentice Hall, 1995.
 Robert St. Amant and Paul R. Cohen. Toward the integration of exploration and modeling in a planning framework. In Proceedings of the AAAI-94 Workshop in Knowledge Discovery in Databases, 1994.
 Robert St. Amant and Paul R. Cohen. A case study in planning for exploratory data analysis. In Advances in Intelligent Data Analysis , pages 1-5, 1995.
 Robert St. Amant and Paul R. Cohen. Control representation in an EDA assistant. In Douglas Fisher and Hans Lenz, editors, Learning from Data: AI and Statistics V. Springer, 1995. To appear.