Statistics and Massive Data Sets: One View from the Social Sciences
Albert F. Anderson
Public Data Queries, Inc.
Generating a description of a massive dataset involves searching through an enormous space of possibilities. Artificial Intelligence (AI) may help to alleviate the problem. AI researchers are familiar with large search problems and have developed a variety of techniques to handle them. One area in particular. AI planning, offers some useful guidance about how the exploration of massive datasets might be approached. We describe a planning system we have implemented for exploring small datasets, and discuss its potential application to massive datasets.
The computing resources available within a university have become determining forces in defining the academic and research horizons for faculty, students, and researchers. Over the past half century, mechanical calculators have been replaced by batch-oriented mainframe processors which evolved into interactive hosts serving first hardcopy terminals, then intelligent display terminals, and finally PCs. Today, high performance workstations integrated within local and external networks have brought unprecedented computing power to the desktop along with interactive graphics, mass storage capabilities, and fingertip access to a world of computing and data resources. Each transition has opened new opportunities for those who work with demographic, social, economic, behavioral, and health data to manage larger quantities of data and to apply more sophisticated techniques to the analysis of those data.
Social scientists, however, have not realized the potential of this revolution to the extent that it has been realized by their colleagues in the physical, natural, engineering, and biomedical sciences. Distributed computing environments encompassing local and external
networks could provide high speed access to a wide range of mass data using low-cost, high speed, on-line mass-storage coupled to high performance processors feeding graphics workstations. Social scientists, scholars, planners, students. and even the public at large could access, manage, analyze, and visualize larger data sets more easily using a broader range of conventional and graphics tools than heretofore possible. But most of those who use these data today work with the data in ways remarkably similar to methods used by their predecessors three decades ago. Large data sets comprising records for thousands and even millions of individuals or other units of analysis have been, and continue to be, costly to use in terms of the dollars, time, computing facilities, and technical expertise required to handle them. These barriers can now be removed.
This paper looks briefly at the nature of the problem, the opportunities offered by computing and information system technology, and at one effort that has been made to realize the potential of these opportunities to revolutionize the manner in which massive census and survey data sets are handled. As this one example illustrates, realization of these opportunities has the potential for more than just a change of degree in terms of numbers of users, ease of use, and speed of response. Users of demographic, social, economic, behavioral, health, and environmental data can experience a qualitative change in how they work, interacting with data and tools in ways never before possible.
2 The Problem
Demographers, social scientists, and others who work with census and survey data are often faced with the necessity of working with data sets of such magnitude and complexity that the human and technological capabilities required to make effective use of the data are stretched to their limits-and often beyond. Even today, researchers may find it necessary to coordinate three or more layers of support personnel to assist them with their efforts to retrieve information from data sets ranging to gigabytes (GB) in size. Yet, these data are among the most valuable resources available for gaining insight into the social processes that are changing our world. These challenges are compounded today by the recognition that many of our pressing local, national, and international problems require multi-disciplinary approaches if the problems are to be understood and resolved. The success of these multi-disciplinary endeavors will depend in part upon how readily and how effectively researchers and analysts from the social sciences, environment, public policy, and public health can bring data from their disciplines, along with geographic and topological data, to bear on these problems.
Consequently, public data such as the Public Use Microdata Samples (PUMS). Current Population Surveys (CPS). American Housing Surveys (AHS), Census Summary Tape Files (STF), and National Center for Health Statistics mortality files are of greater potential value to a broader range of researchers, scholars, students, and planners than ever before. Yet, these are data that, because of the cost and difficulty in working with them, have historically been underutilized relative to their potential to lend insight into social, economic, political, historical, health, and educational issues. These are data that are relevant at levels ranging from personal to global concerns.
Unlocking information requires more than just access to data. Getting answers to even
simple questions provides significant challenges to users of multi-gigabyte data sets. The challenges become much greater when more complex questions are asked-questions that require the construction of indices within or across records, matching and merging information across data sets, and displaying data graphically or geographically. For example, to gain insight into college enrollment rates, one might wish to create an index for each child in the PUMS representing the total years of siblings college attendance that would overlap with the given child college years, assuming that each child were to attend college for four years starting at the same age. Such an index could reflect the economic pressure placed on a family by having multiple children in college at the same time. Availability of such an index could immediately suggest a host of questions and possible avenues of inquiry to be pursued-for example, establishing from other data sources the validity of the index as a predictor of college attendance or examining how the index varies over household and family characteristics such as the race, education, occupation, and age cohort of the head and spouse or the family structure within the PUMS data. One might also wish to generate thematic maps of the distribution of relationships within local, regional, or national contexts. To investigate such questions today would require access to costly computing and data resources as well as significant technical expertise. The task would challenge an accomplished scholar working in a well endowed research center.
Challenges exist on the technology side, also. The optimal application of mass storage, high performance processors, and high speed networks to the task of providing faster, cheaper, and easier access to mass data requires that strategies for using parallel and multiple processing be developed, data compression techniques be evaluated, overall latencies within the system be minimized, advantage be taken of the Internet for sharing resources, etc. In the case of latencies, for example, ten second startup and communication latencies are of little consequence to a five hour task, but a severe bottleneck for a one second task.
3 The Opportunity
Creative application of currently available computing and information technology can remove the obstacles to the use of massive demographic, social, economic, environmental, and health data sets while also providing innovative methods for working with the data. Tasks related to accessing, extracting, transforming, analyzing, evaluating, displaying, and communicating information can be done in seconds and minutes rather than the hours, days, and even weeks that users have faced in the past. Networks can allow resources too costly to be justified for a small number of users within a local context to be shared regionally, nationally, or even internationally. The models for sharing access to specialized instruments that have worked well in the physical, natural, and medical sciences can be applied equally well to the social sciences.
Dedicated parallel and multiple processing systems have the potential to essentially eliminate the I/O and processing bottlenecks typically associated with handling files containing millions of data records. Serving systems based on closely coupled high performance processors can, in a fraction of a second, reduce massive data to tabulations, variance-covariance matrices, or other summary formats which can then be sent to desktop clients capable of merging, analyzing, and displaying data from multiple sources. Bootstrap and jackknife
procedures can be built into the systems to provide estimates of statistical parameters. Distributing the task between remote servers and desktop processors can minimize the quantity of information that must be moved over networks.
The rapidly falling cost of high performance systems relative to their performance is significantly reducing the hardware costs associated with creating dedicated facilities optimized for the handling of massive census and survey data.
4 One Path to an Answer
A collaborative effort involving researchers at the Population Studies Center (PSC) at the University of Michigan and the Consortium for International Earth Science Information Network (CIESIN) at Saginaw. Michigan, led to a demonstration in 1993 of the prototype data access system providing interactive access via the Internet to the 1980 and 1990 land person records per file. The prototype, named xplore, as subsequently stimulated the development of the Ulysses system at CIESIN and a commercial system, DQ-Explore, y Public Data Queries. Inc. Users of the CIESIN facilities, who currently number more than 1,000 researchers, scholars, analysts, planners, news reporters, and students around the world, can readily generate tables from these data sets in seconds. The prototype ran on a loosely clustered parallel system of eight HP 735 workstations. The use of more efficient algorithms in the new designs is allowing better performance to be achieved using fewer processors. Other data sets are being added to the system. The prototype system has also been demonstrated on larger parallel processing systems. IBM SP1/SP2s, to provide interactive access to the 1990 5 represent a realization of the promise of high performance information and computing technology to minimize, if not eliminate, the cost in terms of dollars, time, and technical expertise required to work with the PUMS and similar large, complex data sets.
The Explore prototype was designed by Albert F. Anderson and Paul H. Anderson to perform relatively simple operations on data, but to do so very quickly, very easily, and through the collaboration with CIESIN, at very low cost for users. Multi-dimensional tabulations may be readily generated as well as summary statistics on one item within the cross-categories of others. Some statistical packages provide modes of operation that are interactive or that allow the user to chain processes in ways that, in effect, can give interactive access to data, but not to data sets the size of the PUMS and not with the speed and ease possible with Explore. Thirty years ago, interacting with data meant passing boxes of cards through a card sorter again-and again, and again and... More recently, users often interacted with hundreds, even thousands, of pages of printed tabular output, having produced as much information in one run as possible to minimize overall computing time and costs. The Explore approach to managing and analyzing data sets allows users to truly interact with massive data. Threads of interest can be pursued iteratively. Users can afford to make mistakes. Access to the prototype and to Ulysses on the CIESIN facilities has clearly succeeded in letting users access the 1990 PUMS data within an environment that reduces their costs in using such data to such an extent that they are free to work and interact with the data in ways never before possible.
The current development effort at Public Data Queries, Inc., is funded by small business research and development grants from the National Institutes of Health (NIH)-specifically
the National Institute for Child Health and Human Development (NICHD) The PDQ-Explore system combines high speed data compression/uncompression techniques with efficient use of single level store file I/O to make optimum use of the available disk, memory, and hardware architecture on the host hardware. More simply, the active data are stored in RAM while they are in use and key portions of the program code are designed to be held in the on-chip instruction caches of the processors throughout the computing intensive portions of system execution. As a consequence, execution speeds can in effect be increased more than one thousand fold over conventional approaches to data management and analysis. Because the task is by nature highly parallel, the system scales well to larger numbers of higher performance processors. Public Data Queries. Inc., is currently investigating the applicability of symmetric multiprocessing (SMP) technology to the task with the expectation that more complex recoding, transformation, matching/merging, and analytic procedures can be accommodated while improving performance beyond present levels.
Current implementations on HP, IBM, and Intel Pentium systems allow records to be processed at rates on the order of 300,000-500,000 per second per processor for data held in RAM. Processing times are expected to be reduced by at least a factor of two, and probably more, through more efficient coding of the server routines. System latencies and other overhead are expected to reduce to milliseconds. Expectations are that within one year, the system could be running on servers capable of delivering tabulations in a fraction of a second from the 5 data, more than 18 million person and housing records.
For information on the CIESIN Ulysses system, contact: firstname.lastname@example.org