Massive Data Sets Workshop: The Morning After
Peter J. Huber
Some issues crucial for the analysis of massive data sets are identified: computational complexity, data management questions, heterogeneity of data, customized systems, and some suggestions are offered on how to confront the challenges inherent in those issues.
This paper collects some of my observations at, reactions to, and conclusions from, the workshop on Massive Data Sets in Washington-D.C., July 7-8, 1995.
At the workshop, we had not gotten as far as I had hoped. We had discussed long wish-lists, but had not winnowed them down to a list of challenges. While some position papers had discussed specific bottlenecks, or had recounted actual experiences with methods that worked, and things one would have liked to do but couldn't, those examples had not been elaborated upon and inserted into a coherent framework. In particular, the discussions in the Small Groups barely had scratched the implications of the fact that massive sets differ from smaller ones not only by size. Maybe an additional day, providing more time for thinking and for informal contacts and discussions, would have been beneficial.
I shall try to continue the discussion of some of the points we left unfinished and connect some of the open ends.
2 Disclosure: Personal experiences.
Clearly, the personal viewpoints of the workshop participants were heavily influenced by the data sets they had worked with. We somehow resembled the proverbial group of blind men confronted with an elephant. This makes it mandatory to disclose the data that have shaped one's views. In my case these were: children's growth data, census data, air traffic radar data, environmental data, hospital data, marketing research data, road quality data, agricultural and meteorological data, with sizes ranging from 3 Mbytes to 2 Gbytes. Most data sets were observational, a few were opportunistic; there were no imaging data. The census data were an outlier in several respects. I shall later cite specific examples for illustrative purposes. Perhaps the most important thing I have learned from these
experiences was: even though the data sources and the analysis goals at first blush seemed disparate, the analyses almost invariably converged or are expected to converge toward a sometimes rudimentary, sometimes elaborate, customized data analysis system adapted to a particular data set. The reason of course is that in the case of large data sets many people will have to work for an extended period of time with the same or similar data.
3 What is massive? A classification of size.
A thing is massive, if it is too heavy to be moved easily. We may call a data set massive, if its mere size causes aggravation. Of course, any such a characterization is subjective and depends on the task, one's skills, and on the available computing resources.
In my position paper (Huber 1994b), I had proposed a crude objective classification of data by size, from tiny (102 bytes), small (104), medium (106), large (108), huge (1010) to monster (1012). The step size 100 is large enough to turn quantitative differences into qualitative ones: specific tasks begin to hurt at well defined steps of the ladder. Whether monster sets should be regarded as legitimate objects of data analysis is debatable (at first, I had deliberately omitted the "monster" category, then Ed Wegman added it under the name "ridiculous"). Ralph Kahn's description of the Earth Observing System however furnishes a good argument in favor of planning for data analysis (rather than mere data processing) of monster sets.
Data analysis goes beyond data processing and ranges from data analysis in the strict sense (non-automated, requiring human judgment based on information contained in the data, and therefore done in interactive mode, if feasible) to mere data processing (automated, not requiring such judgment). The boundary line is blurred, parts of the judgmental analysis may later be turned into unsupervised preparation of the data for analysis, that is, into data processing. For example, most of the tasks described by Bill Eddy in connection with fNMR imaging must be classified as data processing.
- With regard to visualization, one runs into problems just above medium sets.
- With regard to data analysis, a definite frontier of aggravation is located just above large sets, where interactive work breaks down, and where there are too many subsets to step through for exhaustive visualization.
- With regard to mere data processing, the frontiers of aggravation are less well defined, processing times in batch mode are much more elastic than in interactive mode. Some simple standard data base management tasks with computational complexity O(n) or O(n log(n)) remain feasible beyond terabyte monster sets, while others (e.g. clustering) blow up already near large sets.
4 Obstacles to scaling.
By now, we have very considerable experience with data analysis of small and medium sets. Present-day PCs are excellently matched to the requirements of interactive analysis of medium sets-if one tries to go beyond, one hits several bottlenecks all at once. The problem is to scale our tools up to larger sets. Some hard obstacles to scaling are caused by human limitations, by computational complexity or by technological limits. Others are financial (e.g. memory costs) or lack of software (for massive parallelism).
4.1 Human limitations: visualization.
Direct visualization of a whole data set through scatterplots and scatterplot matrices is feasible without significant information loss up to about medium size. For larger sets one must either step through subsets or show summaries (e.g. density estimates). This limitation has more to do with the resolution of the human visual system than with display hardware. Ed Wegman's position paper elaborates on this.
This limitation holds across the board, also for imaging data: an efficiently coded high-resolution picture comprises at most a few megabytes.
4.2 Human-machine interactions.
Necessary prerequisites for interactivity are: the task is such that a sequence of reasonably straightforward decisions have to be made in relatively quick succession, each of them based on the results of the preceding step. All three prerequisites can be violated for large sets: the decisions may be not straightforward because of data complexity, the response may be too slow (the human side of the feedback loop is broken if response time exceeds the order of human think time, with the latter depending on the task under consideration), and it may be difficult to provide a rational basis for the next decision if one cannot visualize the preceding results.
In interactive work, the timing requirements are stringent: For high-interaction graphics the response time must be a fraction of a second, for most other tasks of interactive data analysis it can be a few seconds, but it may exceed 10-20 seconds only very rarely.
4.3 Storage requirements.
According to experiences with medium to large sets on PCs and workstations, disk size nowadays rarely is a problem, but memory size frequently is a bottleneck preventing full use of fast processors.
Backup storage (disk) must be large enough to hold the raw data plus several derived sets. For comfortable work it ought to provide space for the equivalent of about 10 copies of the raw data set.
On a single-processor machine with fiat high-speed memory, the latter must be large enough to hold 4 copies of the largest array one intends to work with (otherwise one runs into severe swapping problems with two-argument array operations such as matrix multiplications); for comfortable work, it ought to be at least twice as large. Thus, in order to run the processor at full speed, one sometimes may need almost as much free memory as free disk space.
4.4 Computational complexity.
Processor speed does not scale well, since computational complexity tends to increase faster than linearly with data size.
The position papers of Huber (1994b) and Wegman contain extensive discussions of computational complexity issues. so it suffices to sketch the salient points.
At present, batch tasks involving a total of 1012 floating point operations (flops) are easy to handle on PCs, and 1015 flops are just about feasible on supercomputers (this corresponds to 1 Gflop per second, sustained for two weeks). If performance continues to double every two years as in the past, we may reach 1018 operations in 20 years or so. This argument ignores all questions of data flow, whose just-in-time management may become a much worse bottleneck, considering that light moves a mere 0.3 mm in 10-12 seconds.
Even so, it follows that we can safely forget about so-called computer-intensive methods. For large sets and beyond, tasks with a computational complexity of O(n2) are not feasible as batch jobs on any machine now. The practical limit is around O(n3/2) operations-up to huge sets now, up to monster sets in 20 years.
A simple order-of-magnitude calculation shows that computer-intensive operations on the whole of a huge sets are infeasible even when the data has a coarse-grained structure and the intensive operations are restricted to one grain at a time, unless those grains are small. For the sake of the argument let us regard a total of n3/2 operations as feasible in the huge to monster range. If there are n1-α grains with a size m = nα each, and if a calculation with m3 operations is to be performed on each grain, we reach the feasible total of n3/2 operations with α = 0.25. For n = 1010 and n = 1012, this corresponds to grain sizes of merely 316 and 1000, respectively.
If the data is arranged in the form of a matrix with r rows and c columns, n = rc, with , then the same kind of argument shows that tasks with complexity O(nc), such as multiple regression, singular value decomposition and multiplication with a c-by-c matrix on the right, all are feasible, while clustering algorithms, with complexity O(nr), are out.
5 On the Structure of Large Data Sets.
Larger data sets tend to be structurally different from smaller ones. They are not just more of the same, they are larger because they have to be larger. In particular, they are, as a rule, much more heterogeneous.
5.1 Types of data.
Data can be experimental (from a designed experiment), observational (with little or no control of the process generating the data), or opportunistic (the data have been collected for an unrelated purpose). Massive data sets rarely belong to the first category, since by a clever design the data flow often can be reduced already before it is recorded (of course there are exceptions, e.g., computational fluid dynamics). But they often belong to the third category for plain reasons of economy.
Sometimes, data sets are massive because their collection is mandated by law (e.g. census and certain health data), or because they are collected anyway for other purposes (e.g. financial data). Often, however, they have to be massive because smaller sets will not do, and the predominant reason why they will not do is that the data in question are intrinsically heterogeneous. In particular, there may be many observers and many observed objects, both being located in space and time (e.g. aircraft traffic radar data).
I wonder whether the onslaught of massive data sets will finally force us to acknowledge and heed some studiously ignored. but long-standing admonitions going back to Deming (1940), and reiterated by Tukey (1962), to wit: The statistical protession as a whole is paying much too little attention to the need for dealing with heterogeneous data and with data that arise from conditions not in statistical control (randomness).
5.2 How do data sets grow?
If we think in terms of a hierarchical data organization, data sets may grow by acquiring
- more hierarchical layers, or
- more branches, or
- bigger leaves,
or all of the above. For example, some data sets are extremely large because each single leaf is an image comprising several megabytes.
It must be stressed that actual data sets very often either do not possess a tree structure, or else several conflicting ones. Instead of ''leaf'', the more neutral terms "case" or "grain" might therefore be more appropriate.
5.3 On data organization.
Statistical data bases often have a tree structure imposed on them through sampling or data collection (e.g. census districts-housing blocks-households-persons). But there may be several simultaneous conflicting tree structures (e.g. households and employers). Different priority orderings of categorical variables define different tree structures. For very large sets, a clean tree structure is rather the exception than the rule. In particular, those sets often are composed during the analysis from several, originally unrelated sources (for example health data and environmental data, collected independently for different purposes), that are linked as an afterthought through common external (here: geographical) references. In
our work with some opportunistic data sets, we found that this kind of associative joining of originally unrelated data sets was one of the most important operations. Moreover, the larger the data set is, the more important are subsetting operations, and also these cut across hierarchies or establish new ones.
No single logical structure fits all purposes. In our experience, the flat format traditional in statistics usually turned out to be most expedient: the data are organized as a family of loosely linked matrices, each row corresponding to a "case", with a fixed number of columns, each column or "variable" being of a homogeneous type.
Sometimes, an even simpler linear organization is preferable: a very long unstructured sequence of items, each item consisting of a single observation together with the circumstances under which it was made (who observed whom, which variable, when, where, and the like). From that basis, the interesting parts are extracted as required and restructured into matrix form.
How such a logical organization should be implemented physically is of course an entirely different question. The problem with massive data is to distribute not only the data, but also ancillary materials and retrieval tools over a hierarchy of storage devices, so that the ad hoc retrieval and reorganization tasks to be encountered in the course of a data analysis can be performed efficiently. For example, when should one work with pointers, when with copies?
5.4 Derived data sets.
It has been said that data analysis is a progress through sequences of derived data sets. We can distinguish between at least four levels of derived data sets:
- raw data set: rarely accessed, never modified,
- base data set: frequently accessed, rarely modified,
- low level derived sets: semi-permanent,
- high level derived sets: transient.
The base set is a cleaned and reorganized version of the raw set, streamlined for fast access and easy handling. The base set and low level derived sets ordinarily must be maintained on some mass storage device for reasons of space. Their preparation may involve sophisticated and complex, time-consuming data processing. The highest level derived sets almost by definition must fit into high speed memory for reasons of computational efficiency. The actual sizes and details of organization clearly will be governed by the available hardware and software. To fix the idea: on presently available super-workstations (100 Mflops, 1 Gbyte memory) with almost-present-day software one may just be able to handle a huge raw data set (10 Gbytes). In this case, high level derived sets might comprise about 100 Mbytes of data each for non-graphical tasks, but at most a few Mbytes for tasks involving highly interactive visualization. With massively parallel hardware some of these figures can be pushed higher, but adequate software does not yet exist. Here we have a definite software challenge: to design and implement a pilot system for general purpose data analysis of massive data sets on massively parallel hardware. More comments on this are contained in Section 10.
Derived sets can be formed in various ways. In our experience, low level derived sets mostly are created by application specific preprocessing, or by subsetting (more about this in Sections 8 and 9). Summaries are problematic with large sets-one ought not to group or summarize across heterogeneity-and splitting into homogeneous parts may be an overly expensive clustering problem. Thus, one will be restricted in practice to splitting based on external a priori information or, if data based, to CART-like single-variable methods. Afterwards, summaries of the homogeneous parts then may be recombined into new derived sets.
6 Data base management and related issues.
With larger data sets, data base management operations become ever more important.
6.1 Data base management and data analysis systems.
In view of the importance and central role of data base operations, it has been suggested that future data analysis (DA) systems should be built around a data base management (DBM) kernel. But paradoxically, all the usual DBM systems do a very poor job with large statistical data bases. For an explanation why this is so, see French (1995), who confronts the design goals of the ordinary DBM systems with those of decision support systems (DSS). Data analysis needs all facilities of a DSS, but more flexibility, in particular read-write symmetry to assist with the creation and manipulation of derived sets. As a consequence, the designer of a DA system must perforce also design and implement his or her own DBM subsystem.
6.2 Problems and challenges in the data base area.
Data base type operations get both harder and more important with larger sets, and they are used more frequently. With small and medium sets, where everything fits into high speed memory with room to spare and where all tasks are easily handled by a single processor, one does not even realize when one is performing a data base operation on the side. Larger sets may have to be spread over three or four hierarchical storage levels, each level possibly being split into several branches. Parallel processors and distributed memory create additional complications.
With large sets, processing time problems have to do more with storage access than with processor speed. To counteract that, one will have to produce small, possibly distributed, derived sets that selectively contain the required information and can be accessed quickly, rather than to work with pointers to the original, larger sets, even if this increases the total required storage space and creates tricky problems with keeping data integrity (e.g. with carrying back and expanding to a superset some changes one has made in a subset).
To get a good grip on those problems, we must identify, categorize and rank the tasks we actually perform now with moderately sized sets. We then must identify specific tasks that become harder, or more important, or both, with massive data sets, or with distributed processors and memory. In any case, one will need general, efficient subset operations that can operate on potentially very large base sets sitting on relatively slow storage devices.
7 The stages of an data analysis.
Most data analysis is done by non-statisticians, and there is much commonality hidden behind a diversity of languages. Rather than to try to squeeze the analysis into a too narrow view of what statistics is all about, statisticians ought to take advantage of the situation, get involved interdisciplinarily, learn from the experience, expand their own mind, and thereby their field, and act as catalysts for the dissemination of insights and methodologies. Moreover, the larger the data sets are, the more important the general science aspects of the analysis seem to become relative to the "statistical" aspects.
I believe that some of the discussions at the workshop have become derailed precisely because they were too much concerned with categories defined in terms of classical statistical concepts. In retrospect, it seems to me that it might have been more profitable to structure the discussions according to stages common to most data analyses and to watch out for problems that become more pronounced with more massive data.
At the risk of belaboring the obvious, I am providing a kind of commented check-list on steps to be watched.
7.1 Planning the data collection.
Very often, the data is already there, and one cannot influence its collection and its documentation any more.
The planning of a large scale data collection runs into problems known from Big Science projects: many different people are involved over several years in a kind of relay race. By the time the data are ready to be analyzed, the original designers of the experiment have left or are no longer interested, and the original goals may have been modified beyond recognition.
The obvious conclusion is that big scale data collection must be planned with am open mind for unforeseen modes of use. Beware of obsolescence. The documentation must be complete and self-sufficient -10 years later, technical specifications of the measuring equipment may be lost, and names of geographical locations and the scope of ZIP code numbers may have changed. Even the equipment to read the original media may be gone.
If one is planning to collect massive data, one should never forget to reserve a certain percentage of the total budget for data analysis and for data presentation.
7.2 Actual collection.
It is not possible to plan and specify correctly all details ahead. In particular, minor but crucial changes in the coding of the data often remain undocumented and must afterwards be reconstructed through painstaking detective work. Whoever is responsible for collecting the data must also be held responsible for documenting changes to the code book and keeping it up-to-date.
Everybody seems to be aware of the need for quality control, in particular with regard to instrument drift and the need for continuous calibration. There is much less awareness that also the quality of hardware, software and firmware of the recording system must be closely watched. I have personally encountered at least two unrelated instances where leading bits were lost due to integer overflow, in one case because the subject matter scientist had underestimated the range of a variable, in the other case because a programmer had overlooked that short integers do not suffice to count the seconds in a day. I also remember a case of unusable data summaries calculated on-line by the recording apparatus (we noticed the programming error only because the maximum occasionally fell below the average).
7.3 Data access.
As data analysts we need tools to read raw data in arbitrary and possibly weird binary formats. An example was given by Allen McIntosh in connection with Telephone Network Data. Not only the actual reading must be efficient, but also the ad hoc programming of data input must be straightforward and easy; we have repeatedly run into similar problems and have found that very often we were hunting a moving target of changing data formats.
In addition, we must be able to write data in similarly weird formats, in order that we can force heterogeneous sets into a homogeneous form.
7.4 Initial data checking.
Usually, the problem is viewed as one of legality and plausibility controls. What is outside of the plausible range is turned into missing values by the checking routine. This is a well-tested, successful recipe for overlooking obvious, unexpected features, such as the ozone hole.
The real problem of data checking has to do with finding systematic errors in the data collection. and this is much harder! For example. how does one find accidental omissions or duplications of entire batches of data? The "linear" data organization mentioned in Section 5.3 facilitates such checks.
7.5 Data analysis proper.
A person analyzing data alternates in no particular sequence between the following five types of activities:
With massive data sets, both the inspection and the comparison parts run into problems with visualization. Interpretation is thinking in models. Models are the domain of subject matter specialists, not of statisticians; not all models are stochastic! Therefore, modelling is one of the areas least amenable to a unified treatment and thus poses some special challenges with regard to its integration into general purpose data analysis software through export and import of derived sets.
7.6 The final product: presentation of arguments and conclusions.
With massive data, also the results of an analysis are likely to be massive. Jim Hodges takes the final product of an analysis to be an argument. I like this idea, but regard it a gross oversimplification: in the case of massive data we are dealing not with a single argument, but with a massive plural of arguments. For example with marketing data, a few hundred persons may be interested in specific arguments about their own part of the world, and once they become interested also in comparisons ("How is my product X doing in comparison to product Y of my competitor?"), complexity grows out of hand. However, there is a distinction between potential and actual: from a near infinity of potential arguments, only a much smaller, but unpredictable, selection will ever be actually used.
With massive data, the number of potential arguments is too large for the traditional pre-canned presentation in the form of a report. One rather must prepare a true decision support system, that is a customized, special-purpose data analysis system sitting on top of a suitable derived data set that is able to produce and present those arguments that the end user will need as a basis for his or her conclusions and decisions. If such a system does a significantly better job than, say, a 1 000-page report, everybody will be happy; this is a modest goal.
8 Examples and some thoughts on strategy.
By now, we have ample experience with the analysis of medium size data sets (data in the low megabyte range), and we begin to feel reasonably comfortable with large sets (108 bytes, or 100 megabytes), even though direct visualization of larger than medium sets in their entirety is an unsolved (and possibly unsolvable) problem. Let us postulate for the sake of the argument-somewhat optimistically-that we know how to deal with large sets.
Assume you are confronted with a huge data set (1010 bytes, or 10 gigabytes).
If a meaningful analysis is possible with a 1% random subsample, the problem is solved-we are back to large sets. Except for validation and confirmation, we might not even need the other 99%.
Assume therefore that random samples do not work for the problem under consideration. They may not work for one of several possible reasons: either because the data are very inhomogeneous, or because they are highly structured, or because one is looking for rare events, or any combination of the above. Density estimates or summary statistics then will not work either.
Example: Air traffic radar data. A typical situation is: some 6 radar stations observe several hundred aircraft, producing a 64-byte record per radar per aircraft per antenna turn, approximately 50 megabytes per hour. If one is to investigate a near collision, one extracts a subset, defined a by window in space and time surrounding the critical event. If one is to investigate reliability and accuracy of radars under real-life air traffic conditions, one must differentiate between gross errors and random measurement errors. Outlier detection and interpretation is highly non-trivial to begin with. Essentially, one must first connect thousands of dots to individual flight paths (technically, this amounts to tricky prediction and identification problems). The remaining dots are outliers, which then must be sorted out and identified according to their likely causes (a swarm of birds, a misrecorded range measurement, etc. etc.). In order to assess the measurement accuracy, one must compare individual measurements of single radars to flight paths determined from all radars, interpolated for that particular moment of time. Summary statistics do not enter at all, except at the very end, when the results are summarized for presentation.
I believe this example is typical: the analysis of large sets either begins with task and subject matter specific, complex preprocessing, or by extracting systematic subsets on the basis of a priori considerations, or a combination of the two. Summaries enter only later. Often, the subsets will be defined by windows in space and time. Even more often, the selection has two stages: locate remarkable features by searching for exceptional values of certain variables, then extract all data in the immediate neighborhoods of such features. For a non-trivial example of preprocessing, compare Ralph Kahn's description of the Earth Observing System and the construction of several layers of derived data sets. For one beginning with subsetting, see Eric Lander's description of how a geneticist will find the genes responsible for a particular disease: in a first step, the location in the human genome (which is a huge data set, 3 × 109 base pairs) is narrowed down by a factor 1 000 by a technique called genetic mapping (Lander 1995).
After the preparatory steps, one may want to look up additional information in other data bases, possibly from informal external sources:
Example: Environmental data. We found (through EDA of a large environmental data set) that very high radon levels were tightly localized and occurred in houses sitting on the locations of old mine shafts.
In this example, indiscriminate grouping would have hidden the problem and would have made it impossible to investigate causes and necessary remedies. The issue here is one of "data mining", not one of looking, like a traditional statistician, "for a central tendency, a measure of variability, measures of pairwise association between a number of variables". Random samples would have been useless, too: either one would have missed the exceptional values altogether, or one would have thrown them out as outliers.
Data analysis is detective work. The metaphor is trite but accurate. A careful distinction between tasks requiring the acumen of a first rate sleuth and tasks involving mere routine work is required. After perusing some of the literature on data mining, I have begun to wonder: too much emphasis is put on futile attempts to automate non-routine tasks, and not enough effort is spent on facilitating routine work.
In particular, everybody would like to identify noteworthy, but otherwise unspecified features by machine. From my experience with projection pursuit on small to medium sets I think this is a hopeless search for the holy Grail (computational complexity grows too fast with dimensionality). Pattern discovery is intrinsically harder than pattern recognition. A less ambitious, still hard task is the approximate match-up problem: find all structures in data set A that are approximately similar to some structure in data set B, where A and B are large. It is not at all clear whether even such problems can be solved within the desired O(n3/2)-complexity.
9 Volume reduction.
Volume reduction through data compression (with information loss) sometimes is advocated as a kind of panacea against data glut: "keep only what is exceptional, and summarize the rest". At least in the case of observational, as against experimental data, I think this is a daydream, possibly running counter to several of the reasons why the data are being collected in the first place! It is only a slight exaggeration to claim that observational data deserve to be saved either completely or else not at all. For example, a survey of the sky must be preserved completely if one later wants to check the early stages of supernovae.
But prior to specific analyses, targeted volume reduction usually can be performed on a data matrix either by reducing the number of rows (cases) or the number of columns (variables), or both. This might be done based on a priori interest. Data-driven reduction might be done for example by aggregation (i.e. by combining similar rows) or by summarizing (e.g. by forming means and variances over a homogeneous set of rows). Reducing the number of variables is usually called "dimension reduction" and can be done for example by variable selection (pick one of several highly correlated columns), or by forming (linear or non-linear) combinations of variables.
But it is difficult to put the general notion of dimension reduction on a sound theoretical basis; exploratory projection pursuit comes closest, but as its computational complexity increases exponentially with dimension, it is not well suited to massive data sets.
The next best general approach is dimension reduction through principal component or correspondence analysis (i.e. the truncated singular value decomposition): remember that the k leading singular values yield the best approximation (in the square norm sense) to the data matrix by a matrix of rank k. According to Sue Dumais this was surprisingly successful in the context of information retrieval even with quite large data matrices.
However, the most powerful type of dimension reduction is through fitting local models:
Example: Children's growth data. Several hundred children were observed periodically from birth to adulthood. Among other things. for each child, 36 measurements of body length were available. It was possible to reduce dimension from 36 to 6, by fitting a 6-parameter growth curve to each child. The functional form of that curve had several components and had been found by estimating some 30 global parameters from the total available population of children. Most of the 6 parameters specific to a particular child had an intuitive interpretation (age at puberty, duration and intensity of pubertal growth spurt, etc.); the residual error of the fit was only slightly larger than the intrinsic variability of the measurements.
Local model fitting is computationally expensive, but typically, it seems to stay within the critical O(n3/2)-limit.
10 Supercomputers and Software Challenges.
On presently available super-workstations (100 Mflops, 1 Gbyte memory) one can certainly handle large sets with almost-present-day software, and one may just barely be able to handle huge sets (10 Gbytes). To push those figures higher, one would have to invest in massively parallel supercomputers and novel software. Is such an investment worthwhile? What are its chances to succeed? I do not have final answers, but would like to offer some food for thought.
10.1 When do we need a Concorde?
I believe the perceived (or claimed) need for supercomputers exceeds the real need.
The problem is that of the Concorde: flying the Concorde is a status symbol, but if too much time is spent on the ground, the fast flight is not worth the money. Response times in fractions of a second are neither needed nor appreciated, if it takes several minutes, and possibly hours, to think up a question and to digest the answer.
We must learn to identify and differentiate between situations where supercomputers are necessary or at least truly advantageous, and situations where they are not.
We have encountered several examples with raw data sets in the 10-900 MByte range, where it had been claimed that a large mainframe or supercomputer plus several months of customized programming were needed in order to accomplish certain data analytic tasks. In all those cases we found that a well endowed PC, plus a a few weeks of customized programming in a high-level interpreted data analysis language (ISP) would be adequate, with comparable or even shorter execution times, at total costs that were orders of magnitude lower.
10.2 General Purpose Data Analysis and Supercomputers.
Can general purpose data analysis take advantage of supercomputers? Dongarra's famous benchmark comparison (I have the version of April 13, 1995 in front of me) highlights the crux of the problem: without special hand-tuning efforts to improve and distribute the data flow, the fastest multi-processor supercomputers beat the fastest single-processor superworkstations merely by a factor 4. which is not worth the money. Tuning may yield another factor 20 or so. We need to discuss strategies for recovering that factor 20. Ad hoc code tweaking on a case by case basis is so labor intensive and error-prone that ordinarily it will be out of the question. That is, we have a very serious software challenge.
Our experiences with ISP suggest how a speed-up could be done in a general purpose system. ISP is a small, general purpose, array oriented, interpretive data analysis language somewhat similar to S. It contains a core of 100-200 building blocks (commands or functions). Explicit, interpreted looping is slow, but it is rarely needed. Efficiency relative to compiled code increases with data size because of array orientedness. I believe that many of the building blocks can be beefed up for parallelism, but there may be snags. For example, reorganization and redistribution of data between subtasks might be more expensive than anticipated.
The proof of the pudding is in the eating, i.e. we need to conduct the experiment suggested by Huber (1994b, at the end of section 8) and build a working pilot system. To prove the point we should aim for a small, relatively unspecific, but universal system. For a basis, I would choose ISP over S precisely because it has these three properties. But we should take the opportunity to build a new, better system from scratch, rather than trying to port an existing system. From past experiences I estimate that it will take three years before the pilot system for such an experiment attains sufficient maturity for beta testing. We better begin soon.
The next section summarizes the issues and experiences I consider important for such an undertaking.
10.3 Languages, Programming Environments and Data-Based Prototyping.
Here is an account of some things we have learned from our ISP experience (see also Huber (1994a)). The traditional programming languages (Fortran, C, Pascal, ...) are too low-level for the purposes of data analysis. We learned that already back in the 1970's: there is much re-programming and re-use under just slightly different circumstances, and for that, these languages are too clumsy and too error-prone (cf. also the comments by Allen McIntosh). Subroutine libraries like LINPACK help, but are not enough. We need a high-level array-oriented language on top, with a simple syntax and safe semantics, whose units or building blocks must be very carefully selected for universal use. The language must be user-extensible through combination of those units into new building blocks with the same syntax. The core building blocks must be highly optimized.
In the 1980's we became aware of the need for programming environments in the style of Smalltalk and LISP machines. In data analysis, you never hit it right the first, or the second, or even the third time around, and it must be possible to play interactively with modifications, but without having to start everything from scratch. Rather than building a system on top of Smalltalk or LISP, we decided to augment our data analysis language ISP so that it acquired its own programming environment.
Around 1990, we realized that we had to go even further into what we call data-based prototyping: build a customized data analysis system while actually doing production work with the data. The basic problem is that the user (whether it is a customer or we ourselves) never is able to specify the requirements in advance. Our solution is to mock up a rough-and-ready working prototype, and let the user work with it on his or her actual data. Without involving the actual user early and actively in the use and re-design of the system, in particular in issues of presentation of results (what to show and how), it is extremely hard to arrive at a satisfactory solution. Some comments made by Schmitz and Schoenherr in the context of marketing databases nicely illustrate the difficulty. A high-level language and a good programming environment are indispensable prerequisites for data-based prototyping.
None of the existing languages and systems is entirely satisfactory. After seeing Carr's list of preferences in Section 4 of his position paper, it seems to me that our own software (ISP) comes closer to an ideal, universal system than I ever would have suspected. It does a job superior to Matlab in the area of visualization; we use it instead of GIS because the latter systems have difficulties with discontinuous values that are attached to arbitrary points rather than to grid points or political districts; after becoming dissatisfied with all available general data base software, we began to improvise our own approaches in ISP; we prefer it to SAS and S for data analysis, especially with large sets. Carr's comment on the ''diminished influence of the statistical community upon my work'' is reflected by the fact that in ISP we never felt compelled to go beyond a frugal minimum of statistical functions.
11 Summary of conclusions.
- With the analysis of massive data sets, one has to expect extensive, application-and task-specific preprocessing. We need tools for efficient ad hoc programming.
- It is necessary to provide a high-level data analysis language, a programming environment and facilities for data-based prototyping.
- Subset manipulation and other data base operations, in particular the linking of originally unrelated data sets, are very important. We need a data base management system with characteristics rather different from those of a traditional DBMS.
- The need for summaries arises not at the beginning, but toward the end of the analysis.
- Individual massive data sets require customized data analysis systems tailored specifically toward them, first for the analysis, and then for the presentation of results.
- Pay attention to heterogeneity in the data.
- Pay attention to computational complexity; keep it below O(n3/2), or forget about the algorithm.
- The main software challenge: we should build a pilot data analysis system working according to the above principles on massively parallel machines.
Deming, W. E. (1940). Discussion of Professor Hotelling's Paper. Ann. Math. Statist. 11 470-471.
French, C. D. (1995). "One Size Fits All" Database Architectures Do Not Work For DSS. SIGMOD RECORD, Vol. 24, June 1995. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. ACM Press.
Huber, P. J. (1994a). Languages for Statistics and Data Analysis. In: Computational Statistics, P. Dirschedl and R. Ostermann (Eds.), Physica-Verlag, Heidelberg.
Huber, P. J. (1994b). Huge Data Sets. In: Proceedings of the 1994 COMPSTAT Meeting, R. Dutter and W. Grossmann (Eds.), Physica-Verlag, Heidelberg.
Lander, E. S. (1995). Mapping heredity: Using probabilistic models and algorithms to map genes and genomes, Notices of the AMS, July 1995, 747-753. Adapted from: Calculating the Secrets of Life. National Academy Press, Washington, D.C. 1995.
Tukey, J. W. (1962). The Future of Data Analysis. Ann. Math. Statist. 33 1-67.