• Basic data structures. This category includes structures such as hash tables, inverted indices, tables/relations, etc. These are data structures that one finds in standard textbooks on algorithms and databases (Cormen et al., 2009; Garcia-Molina et al., 2008).
  • More abstract, but basic, mathematical structures. This category includes structures such as sets, vectors, matrices, graphs, and metric spaces. Many of these mathematical structures have sparse variants, e.g., vectors and graphs with few non-trivial components, matrices of low rank, and so on. Each mathematical structure can be represented using many data structures, with different implementations supporting different operations and optimizing different metrics.
  • Derived mathematical structures. This category includes more sophisticated structures such as clusters, linear projections, data samples, and such. Although these representations are often basic mathematical structures, they do not directly represent the original data, but are instead derived from the data and often can be viewed as representations of components in a model of the data.

The committee also differentiates between two different “design goals” that one has to keep in mind in choosing appropriate data representations:

  • First, on the upstream or data-acquisition or data-generation side, one would like a structure that is “sufficiently close” to the data. Such a representation supports the process of generating, storing, or accessing the data while providing a model for the noise or uncertainty properties of the data. The goal of such representations is to support all of these operations without significantly altering the data.
  • Second, on the downstream or data analysis side, one would like a structure that has both a flexible description and is tractable algorithmically. That is, it should be expressive enough so that it can describe a range of types of data, but it should not be so expressive that it can describe anything (in which case computations and inferences of interest would likely be intractable).


Although a picture may be worth a thousand words, a good representation of data is priceless: a single data representation, or sometimes multiple ones, allows one to carry out a large number of data processing and analysis tasks in a manner that is both algorithmically efficient and statistically meaningful. This section presents an overview of the various goals of data

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement