together in various ways. It is also necessary for these methods to become increasingly “intelligent” as the scale of the data increases, because it is insufficient to simply identify matches and propose an ordered list of hits. New techniques of machine learning continue to be developed to address this need. Another new consideration is that data often come in the form of a network; performing mathematical and statistical analyses on networks requires new methods.
Statistical decision theory is the branch of statistics specifically devoted to using data to enable optimal decisions. What it adds to classical statistics beyond inference of probabilities is that it integrates into the decision information about costs and the value of various outcomes. It is critical to many of the projects envisioned in OSTP’s Big Data initiative. To pick two examples from many,12 the Center for Medicare and Medicaid Services (CMS) has the program Using Administrative Claims Data (Medicare) to Improve Decision-Making while the Food and Drug Administration is establishing the Virtual Laboratory Environment, which will apply “advanced analytical and statistical tools and capabilities, crowd-sourcing of analytics to predict and promote public health.” Better decision making has been singled out as one of the most promising ways to curtail rising medical costs while optimizing patient outcomes, and statistics is at the heart of this issue.
Ideas from statistics, theoretical computer science, and mathematics have provided a growing arsenal of methods for machine learning and statistical learning theory: principal component analysis, nearest neighbor techniques, support vector machines, Bayesian and sensor networks, regularized learning, reinforcement learning, sparse estimation, neural networks, kernel methods, tree-based methods, the bootstrap, boosting, association rules, hidden Markov models, and independent component analysis—and the list keeps growing. This is a field where new ideas are introduced in rapid-fire succession, where the effectiveness of new methods often is markedly greater than existing ones, and where new classes of problems appear frequently.
Large data sets require a high level of computational sophistication because operations that are easy at a small scale—such as moving data between machines or in and out of storage, visualizing the data, or displaying results—can all require substantial algorithmic ingenuity. As a data set becomes increasingly massive, it may be infeasible to gather it in one place and analyze it as a whole. Thus, there may be a need for algorithms that operate in a distributed fashion, analyzing subsets of the data and aggregating those results to understand the complete set. One aspect of this is the
12 The information and quotation here are drawn from OSTP, “Fact Sheet: Big Data Across the Federal Government, March 29, 2012.” Available at http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final.pdf.