. "3 Improving Current Capabilities for Data Integration in Science." Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press, 2010.
The following HTML text is provided to enhance online
readability. Many aspects of typography translate only awkwardly to HTML.
Please use the page image
as the authoritative form to ensure accuracy.
Steps toward Large-Scale Data Integration in the Sciences: Summary of a Workshop
been layered over various back ends, notably MonetDB. RAM performs query normalization, simplification, and optimization within its array model before translating into queries on the underlying relational engine. That layer can perform further optimization in the relational model before executing the queries.
Secondary-Storage Extensions to Data-Analysis Environments. The second approach to layering uses a DBMS to provide relatively seamless access to secondary storage from a data analysis environment. The type system of the environment thus effectively becomes the data model, usually providing vectors, matrices, and higher-dimensional arrays. There is no special query language in this approach—disk-resident data are manipulated with the same functions used for in-memory data. It is up to the underlying interface to the DBMS to determine when functions can be performed in the database and when data need to be retrieved for main-memory manipulation. Ohkawa (1993) used this approach with the New S statistical package and an object-oriented DBMS. The RIOT prototype (Zhang et al., 2009) supports the R data-analysis environment using a relational DBMS. To create optimization opportunities in the underlying DBMS, both systems use lazy evaluation techniques. An operation on a secondary-storage object merely creates an expression that represents the application of the operation. Repeated deferral allows accumulating operations into one or more expression trees. Such trees are evaluated only when their result is to be output to the user, at which point they may be optimized before processing.
According to Dr. Maier, the SciDB project (Cudré-Mauroux et al., 2009) has recently begun development of an open-source database with fully native support for an array model, including an array-aware storage manager. In addition to a data model and algebra for multi-dimensional arrays, SciDB will support history and versioning of arrays, provenance, uncertainty annotations, and parallel execution of queries. If successful, it should provide a suitable platform for integrating extremely large scientific datasets.