11 Summary of conclusions.
- With the analysis of massive data sets, one has to expect extensive, application-and task-specific preprocessing. We need tools for efficient ad hoc programming.
- It is necessary to provide a high-level data analysis language, a programming environment and facilities for data-based prototyping.
- Subset manipulation and other data base operations, in particular the linking of originally unrelated data sets, are very important. We need a data base management system with characteristics rather different from those of a traditional DBMS.
- The need for summaries arises not at the beginning, but toward the end of the analysis.
- Individual massive data sets require customized data analysis systems tailored specifically toward them, first for the analysis, and then for the presentation of results.
- Pay attention to heterogeneity in the data.
- Pay attention to computational complexity; keep it below O(n3/2), or forget about the algorithm.
- The main software challenge: we should build a pilot data analysis system working according to the above principles on massively parallel machines.
Deming, W. E. (1940). Discussion of Professor Hotelling's Paper. Ann. Math. Statist. 11 470-471.
French, C. D. (1995). "One Size Fits All" Database Architectures Do Not Work For DSS. SIGMOD RECORD, Vol. 24, June 1995. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. ACM Press.
Huber, P. J. (1994a). Languages for Statistics and Data Analysis. In: Computational Statistics, P. Dirschedl and R. Ostermann (Eds.), Physica-Verlag, Heidelberg.
Huber, P. J. (1994b). Huge Data Sets. In: Proceedings of the 1994 COMPSTAT Meeting, R. Dutter and W. Grossmann (Eds.), Physica-Verlag, Heidelberg.
Lander, E. S. (1995). Mapping heredity: Using probabilistic models and algorithms to map genes and genomes, Notices of the AMS, July 1995, 747-753. Adapted from: Calculating the Secrets of Life. National Academy Press, Washington, D.C. 1995.
Tukey, J. W. (1962). The Future of Data Analysis. Ann. Math. Statist. 33 1-67.