challenge of data assimilation, in which we wish to use new data to update model parameters without reanalyzing the entire data set. This is essential when new waves of data continue to arrive, or subsets are analyzed in isolation of one another, and one aims to improve the model and inferences in an adaptive fashion—for example, with streaming algorithms. The mathematical sciences contribute in important ways to the development of new algorithms and methods of analysis, as do other areas as well.
Simplifying the data so as to find their underlying structure is usually essential in large data sets. The general goal of dimensionality taking data with a large number of measurements and finding which combinations of the measurements are sufficient to embody the essential features of the data set—is pervasive. Various methods with their roots in linear algebra, statistics, and, increasingly, deep results from real analysis and probabilistic methods—such as random projections and diffusion geometry—are used in different circumstances, and improvements are still needed. Such issues are central to NSF’s Core Techniques and Technologies for Advancing Big Data Science and Engineering program and to data as diverse as those from climate, genomics, and threat reduction. Related to search and also to dimensionality reduction is the issue of anomaly detection—detecting which changes in a large system are abnormal or dangerous, often characterized as the needle-in-a-haystack problem. The Defense Advanced Research Projects Agency (DARPA) has its Anomaly Detection at Multiple Scales program on anomaly-detection and characterization in massive data sets, with a particular focus on insider-threat detection, in which anomalous actions by an individual are detected against a background of routine network activity. A wide range of statistical and machine learning techniques can be brought to bear on this, some growing out of statistical techniques originally used for quality control, others pioneered by mathematicians in detecting credit card fraud.
Two types of data that are extraordinarily important yet exceptionally subtle to analyze are words and images. The fields of text mining and natural language processing deal with finding and extracting information and knowledge from a variety of textual sources, and creating probabilistic models of how language and grammatical structures are generated. Image processing, machine vision, and image analysis attempt to restore noisy image data into a form that can be processed by the human eye, or to bypass the human eye altogether and understand and represent within a computer what is going on in an image without human intervention.
Related to image analysis is the problem of finding an appropriate language for describing shape. Multiple approaches, from level sets to “eigenshapes,” come in, with differential geometry playing a central role. As part of this problem, methods are needed to describe small deformations of shapes, usually using some aspect of the geometry of the space of