sufficient to embody the essential features of the data set—is pervasive. Various methods with their roots in linear algebra and statistics are used and continually being improved, and increasingly deep results from real analysis and probabilistic methods—such as random projections and diffusion geometry—are being brought to bear.
Statisticians contribute a long history of experience in dealing with the intricacies of real-world data—how to detect when something is going wrong with the data-gathering process, how to distinguish between outliers that are important and outliers that come from measurement error, how to design the data-gathering process so as to maximize the value of the data collected, how to cleanse the data of inevitable errors and gaps. As data sets grow into the terabyte and petabyte range, existing statistical tools may no longer suffice, and continuing innovation is necessary. In the realm of massive data, long-standing paradigms can break—for example, false positives can become the norm rather than the exception—and more research endeavors need strong statistical expertise.
For example, in a large portion of data-intensive problems, observations are abundant and the challenge is not so much how to avoid being deceived by a small sample size as to be able to detect relevant patterns. As noted in the New York Times: “In field after field, computing and the Web are creating new realms of data to explore sensor signals, surveillance tapes, social network chatter, public records and more.”11 This may call for researchers in machine or statistical learning to develop algorithms that predict an outcome based on empirical data, such as sensor data or streams from the Internet. In that approach, one uses a sample of the data to discover relationships between a quantity of interest and explanatory variables. Strong mathematical scientists who work in this area combine best practices in data modeling, uncertainty management, and statistics, with insight about the application area and the computing implementation. These prediction problems arise everywhere: in finance and medicine, of course, but they are also crucial to the modern economy so much so that businesses like Netflix, Google, and Facebook rely on progress in this area. A recent trend is that statistics graduate students who in the past often ended up in pharmaceutical companies, where they would design clinical trials, are increasingly also being recruited by companies focused on Internet commerce.
Finding what one is looking for in a vast sea of data depends on search algorithms. This is an expanding subject, because these algorithms need to search a database where the data may include words, numbers, images and video, sounds, answers to questionnaires, and other types of data, all linked
11 Steve Lohr, 2009, For today’s graduate, just one word: Statistics. New York Times, August 5.