Read "Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop" at NAP.edu

Page 18 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

3
Improving Current Capabilities for Data Integration in Science

Any new direction or method of scientific inquiry starts out with a few visionary scientists blazing the path. All are focused on getting results from their research, and invariably they invent new data formats and semantics. This behavior leads to rapid innovation by each individual group but greater difficulty in sharing data across groups, or even across projects in a single group. In the early days of any domain, this state of affairs is a good thing because it maximizes the rate of early innovation, and a similar situation holds as new directions and innovative methods are explored even in mature disciplines.

However, there are drawbacks to this state, and these were noted by workshop participants. Usually, data are available only haphazardly from these early projects—that is, they are not well documented or curated and are not always easily accessible. Individual groups have little incentive to publish data, which slows the progress of the broader field. A new researcher in the domain is presented with a daunting data-discovery problem. And when the data are finally found, they may not be in a usable format. It is common, in this stage, for data to be transmitted to a requester as a bundle of code and data, such that the code is required in order to read the data. But getting code to run in a new environment can be far from trivial because of differences in operating systems, compilers, search paths for libraries, and so on, so that a researcher attempting to reuse the data might spend a good deal of time just getting to the point of being able to read the incoming data. Because most of the areas of scientific research discussed at the workshop are still in this stage with regard to data integration, the researchers share these challenges.

Page 19 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

Further, there are multiple ways in which reuse might be hindered. The structure selected by the original researcher to organize the data might be inconvenient for a subsequent user—for example, they might be stored as geographic images, one for each time step, whereas the new researcher needs a time series for each spatial location. Or an underlying choice that was not even explicitly considered by the original researcher—perhaps the projection that was used to map the data from Earth’s surface onto two dimensions—might not be suitable for the reuse context. (Even for research areas that have matured, such challenges can arise whenever data are applied in unanticipated ways.) The parameters that characterize the projection, or even the units, might not be clear because of incomplete metadata. Lastly, the second researcher’s software tools may not be able to handle the individual data elements. To massage the data into correct format and organization may pose a tedious data-manipulation problem. It can take weeks or more of effort to convert data into a form suitable for reuse. Many new researchers give up before they get to this stage. In short, it is often just too difficult to reuse data gathered by other researchers.

It is crucial to focus on this transformation problem. Several workshop participants noted that it is not difficult to write clear transforms if the relevant metadata are available. Most popular transforms have been written multiple times by multiple labs, which is, of course, inefficient. Workshop participants said it was rarely easy to locate existing transformation software of interest, and some suggested that an online service to share transforms could be established. Such a service would allow scientists to avoid having to reinvent tools, but it would require publishing and documenting transforms in a systematic way so that others could locate them.

In our Internet-savvy world, one should be able to locate data sets and transforms of interest using the Web. At present this is a hopeless task. Workshop participants identified four steps that would make this task possible:

Repositories. Several participants noted the need for domain-specific (as well as general) repositories where scientific data sets can be archived. Because data decay over time and require periodic maintenance, such repositories must be staffed with professionals who can do such maintenance as well as assist scientists trying to use data sets in the repository. Good search tools are needed so the contents of a repository can be easily browsed and objects of interest located. Lastly, curation facilities are also needed so that the precise semantics of data sets can be documented. Obviously, the curation cannot be such an onerous human task that the repository will not

Page 20 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

be used. Curation information must be easy to locate, browse, and understand. Dr. Stonebraker suggested that Genbank and the Sloan Digital Sky Survey are examples of data repositories with effective search tools and good curation, but he said that many more such facilities are needed.

Web-based search. It is nearly impossible to locate structured data using current text-oriented search engines. Moreover, there seems to be little incentive on the part of the search engine companies to provide this capability. Thus, targeted research will be necessary to enable locating structured data on the Web. Ideas for doing this include a science-oriented tagging system—that is, a system that makes assumptions about the content of a file based on some knowledge of the field of science—and storing science data in hypertext markup language (HTML), which would make them visible to search engines. The latter idea is only feasible for small data and is not a general approach.
Community-driven information extraction. Given how much information is now available on the Web, the ability to interpret and integrate relevant Web content can have huge benefits. However, search alone can be a tedious means of collecting data from disparate sources. Webscale information extraction, assisted by an automated tool, represents a bottom-up complement to top-down approaches like the Semantic Web.¹ Another approach is to provide a suite of extraction tools to enable communities of interest on the Web to collaborate in creating and curating integrated datasets in domains they care about. This seems particularly promising for scientific domains, given that scientists are technically sophisticated and willing to collaborate.
Locating transforms. As noted above, several workshop participants suspected that the data transforms they need at any given time have probably already been written at least once, but cannot be found, leaving individual researchers and groups to write their own. The same is true for all sorts of data manipulations, with similar kinds of code modules appearing over and over among

¹

The Semantic Web is an ambitious dream of deploying interlinked information via the resource description framework (RDF) throughout the Web. It encompasses a wide variety of philosophies, goals, and technologies. In general, it would rely on the establishment of ontologies and tools to help those who publish data to mark their content in terms that can be recognized semantically. Many of the Semantic Web technologies are proving to be useful, especially RDF, SPARQL, and OWL. Because the Semantic Web per se does not provide any particular set of standard entity names (URIs) or any particular approach to semantics, leaving these to particular application layers, any practical system for data integration must add these.

Page 21 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

different research groups. Effort is wasted in writing such transforms many times and in maintaining such code as circumstances change. Obviously, it would be best to have a system that allows for reuse of common transforms; such a system might also support the development of more robust transforms. Perhaps one or more repositories (something like SourceForge) could be established to store such code. Another option would be for science funding agencies to form their own code repositories.

Workshop participants discussed some of the tools that have been produced by the database community that could help with data integration in the sciences, for both structured and semistructured data. The four subsections that follow provide a sampling of the approaches covered. The workshop was not designed to prioritize the potential value of database tools to scientific research data, and so this sample should not be construed as being more than just illustrative. Other critical techniques for data integration in some contexts—such as parallel processing and data indexing, which are very important when working with very large sets of data—are not covered here.

FEDERATORS

Dr. Haas’s presentation provided an overview of how federators can be used to integrate data. She covered technical federation techniques, not the use of federation as a management or governance concept. Federation engines present users with a virtual repository of information. The users can manipulate information as if it were stored together in a single place with a single interface whereas it may actually be stored in multiple, possibly heterogeneous places. Federation engines come in different flavors, each presenting a different interface to users. The most common interface is that of a relational database management system (DBMS), effected through methods such as the Open Database Connectivity (ODBC) method, the Structured Query Language (SQL), and the relational data model. However, some federation engines present an extensible markup language (XML) interface (supporting some variant of XPath or XQuery, typically), and others might act like an object-oriented database or even a content repository. Besides having different interfaces, the capability of federation engines also varies, from “gateway” systems that allow simple queries against one source at a time while providing a common interface to all sources, to systems that allow users to leverage the full power of their query language to gather or correlate information from multiple diverse sources with a high level of query function.

Page 22 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

To illustrate the potential of federation, Dr. Haas described the example of a pharmaceutical company with four main research sites in different countries. Each site has many data sources, including these:

A special-purpose store for chemical compound information, searchable by chemical structure;
A relational database holding results from various assays; and
A literature source linking drugs to diseases and symptoms.

Data sizes range from hundreds of thousands of compounds to billions of test results. The four sites focus on different diseases and, for the most part, different compounds, locally storing the information they produce and use. However, as a scientist forms hypotheses about a compound, he or she might need to ask a coworker to find a compound with a structure similar to the one he or she is working with that has been associated with asthma and that has assay scores on test X within range [A,B]. Such a query might need data from all of the sites.

In this example, federation allows the scientist to pose the query without worrying about the geographic distribution of the data or about the different interfaces for the chemical stores, relational databases, and literature sources. The federation engine bridges this heterogeneity and drives the execution of the query across the different sources, reporting the results to the waiting scientists.

The architecture of IBM’s InfoSphere Federation Server (IFS) illustrates how federation works. IFS has two main components: a query engine that supports either SQL or SQL/XML and a set of wrappers that connect the engine to a wide variety of data sources. A wrapper is a code module that handles four main functions:

It handles the connection to the data source and transaction management.
In response to requests from the query engine it drives the data source to produce the required result and retrieve the data.
The wrapper also provides a mapping from the data model and functions in the underlying source into the relational model. If the underlying database is relational, this is straightforward. But in the case of the chemical store described earlier, the chemical similarity search and the chemical structure must be mapped to relational constructs.
The wrapper participates in query planning, providing estimates of the costs of various operations to allow the query processor to identify a feasible and efficient plan for the query.

Page 23 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

The query processor is an extended relational query processor. When a query arrives, it is parsed, the table and column names are resolved, the query is rewritten into a canonical form and optimized, and a run-time plan is produced and executed. Each phase of query processing after name resolution is modified to deal with wrappers and distributed data. In addition, a new phase analyzes the rewritten query and looks for opportunities to push work down to the remote data sources. Complex queries can be handled, and all improvements to the basic query processor—for example, new execution strategies or better optimizations—are immediately available for dealing with distributed, heterogeneous data.

Federation has been used for many purposes. It is often used to extend an existing database with heterogeneous, hard-to-convert data that are separately owned or that will rarely be used. This usage saves maintenance or creation costs for the warehouse. Federation is also frequently used to build a view across multiple organizational units, as in the four research labs in the example. This is an appealing use case, but the query workload must be watched carefully, as it is easy for complex queries to be generated that are challenging to optimize and may lead to unacceptable performance in some circumstances. Portals are more easily built on top of a federation engine, rather than hand-coding access to different data sources. Another common use of federation is as a prototyping environment for data-intensive applications. Even if a large materialized store must eventually be built, federation is easy to set up, and it allows testing of queries and early examination of the data.

Federation is a powerful tool for data integration, but it is not a panacea. Federation integrates data lazily, as it is needed. It is appropriate when data sets are not too large or when the queries are selective enough that only a small fraction of the data will ever be returned. It works well when the data do not need too much preprocessing or cleansing or when the data change frequently and up-to-date results are desired. The extract, transform, and load (ETL) paradigm, which is commonly used in business, is an alternative approach for the integration of primarily structured data. Its first step is to extract data from various sources, which includes conversion into some common format. The collected data are then transformed through a series of rules to prepare them for use. Transformations might include filtering, sorting, cleaning, and translating individual records for consistency, and other such operations. Finally, ETL loads the resulting data into the system where it will be warehoused and used. Generally speaking, ETL has strengths and weaknesses that are complementary to those of federation.

Dr. Haas suggested the following as potential steps for improving federation technology:

Page 24 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

Federation engines need continued work to minimize data movement, exploit multiprocessors, and leverage caching and even indexing to reduce response times for complex queries and large volumes of data.
Other work is needed to extend the engines’ capabilities. Today, entity resolution (figuring out when two data elements refer to the same real-world object) and data cleansing (discovering and correcting errors in the data) are typically batch operations. When those steps are necessary, federation cannot be used. Dynamic algorithms for these tasks would enable federation.
Most federation engines today work only on traditional structured or semistructured data, though they can also return some uninterpreted fields, such as images. As the ability to extract information from unstructured data is improved, federation engines will need to grow to handle these new types.
Finally, understanding where data come from is critical to many scientific endeavors. Hence, mechanisms for tracking provenance must be extended to function in a federated environment.

RESOURCE DESCRIPTION FRAMEWORK

Orri Erling gave an overview of the resource description framework (RDF) and linked data principles for science data and metadata. Using RDF as the data model for these metadata has numerous advantages. Sometimes, especially in the life sciences, data themselves are also represented in the RDF model. For other domains, such as those involving large arrays of instrument data, RDF is not a convenient format for the bulk of the data but is still appropriate for annotation. From the viewpoint of processes, data and metadata should go hand in hand, but different sizes and modeling characteristics often necessitate different representations for data and metadata.

RDF has several advantages for science metadata. To begin with, data are self-describing, and all entities and terms used have universal resource identifiers (URIs). The term “linked data” is used to mean a set of RDF triples where the URIs representing the entities, classes, and properties thereof are dereferenceable via HTTP. In addition, there is a constantly growing body of reusable ontologies, which provide the conceptual bases for RDF. Reusing terminology and modeling metadata has obvious advantages over reinventing the metadata schema for each application. Also, RDF is inherently schemaless—that is, not all entities of a class need have the same properties, and properties can be attached to data instance by instance without any database-wide schema alteration. This makes RDF less cumbersome than, say, relational database manage-

Page 25 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

ment systems (RDBMSs) for highly variable or sparse data. Further, there is a constantly developing set of tools for harvesting, exchanging, storing, and querying RDF. Finally, the RDF model has well-defined semantics for inferencing and many features for facilitating mapping between ontologies and instance data sets. Classes, properties, and instances can be declared to be the same for purposes of a query. Scalar values may be typed value by value—for example, denoting a unit of measure. Thus, both issues of different identifiers for the same entities and different units of measure can be made explicit value by value in RDF.

Scalability of RDF storage is no longer a major problem, with billions of RDF statements being stored per server and with scale-out clustering available from at least OpenLink, Systap, and Garlic for larger scales. Also, data compression for RDF continues to advance, leading to further improvement of scalability. With the next generation of RDF storage, the performance penalty that RDF suffers when compared to RDBMSs for the same workload is likely to be substantially reduced through use of techniques such as adaptive indexing and caching of intermediate results. Task-specific relational schemas will probably continue to have some performance advantage for applications where the schema and workload are stable and known in advance, according to Dr. Erling.

Relational databases can also be mapped into RDF without storing the data in RDF. This is possible with tools such as Virtuoso or D2RQ. Thus, if science metadata are already in relational form, the RDF conversion for data interchange and integration can be done declaratively and on demand. A World Wide Web Consortium (W3C) working group aimed at developing standards for such mapping was launched in October 2009.

As an example, Dr. Erling described a harvesting model used for media metadata, which could be easily adapted to science metadata. The site bbc.openlinksw.com publishes metadata about programs of the BBC. The bbc.openlinksw.com server periodically crawls this content and presents it for search and structured querying via SPARQL, the SQL equivalent for RDF. Additionally, this server, if used as a proxy for accessing other RDF content, caches this content and allows querying over the BBC data and other cached data. For example, one can combine data from the BBC, LastFM, Musicbrainz, and other sources, all of which contain information about a musical artist. For the content producer, publishing the metadata is as simple as exposing RDF files for HTTP access. These files can be generated within the pipeline for content production.

This harvesting example is low-cost, incremental integration that does not require a priori agreement on schema and can accommodate any future data without schema alteration by a database administrator. Query-time inference can be used for identifying different names for the same entity and presenting the union of properties associated with each

Page 26 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

identifier. More complex matching and inference can be done as an ELT transformation step without altering the source data.

The broader utilization of RDF might have positive impacts on the metadata publishing practices of scientific communities over time. Since every element has a URI, many of which can be dereferenced over HTTP, both schema and instance data identifiers point to their source, which provides a means of implicit attribution. Since data and their schema are thus objects of attribution and citation, there is an incentive for publishing data and schemas of high quality.

If many RDF data sets are kept in a common repository, it is easy to see which identifiers, ontologies, or taxonomies are in the broadest use. This ease of discovery will drive convergence of terminology. While very complex, centrally administrated ontologies exist, the ones enjoying the fastest adoption are lightweight ones developed through a bottom-up community process.

MapReduce AND ITS CLONES

MapReduce² and the accompanying Google File System³ were developed at Google to solve the problem of massive explosion in data by leveraging cheap hardware for both storage and processing. They are designed to scale to thousands of commodity servers, which means that failure is assumed to be not an exception but more of a rule. Hence, many design decisions within these systems are biased toward fault-tolerance, scalability, and agility as opposed to performance. Apache Hadoop⁴ is the open-source implementation of MapReduce, and it has the sister technology Hadoop Distributed File System (HDFS). At the workshop, Amr Awadallah of Cloudera Computing described a popular example that illustrates the scalability of Hadoop for economically storing large amounts of scientific data: the Large Hadron Collider Tier 2 site at the University of Nebraska-Lincoln, which currently stores 400 TB of data.⁵ As scientific data sets continue to grow at exponential rates, the need is paramount for scalable, fault-tolerant systems that can both store and process data economically. MapReduce and its clones represent an option for addressing that need for some types of scientific data.

The MapReduce model is a programming paradigm for processing large data sets; it makes it easy to scale execution linearly over a large

²	See http://labs.google.com/papers/mapreduce.html.
³	See http://labs.google.com/papers/gfs.html.
⁴	See http://hadoop.apache.org/core.
⁵	Details of this example may be found at http://www.cloudera.com/blog/2009/05/01/high-energy-hadoop.

Page 27 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

number of servers. In its simplest form, the developer specifies a map function that does the first stage of processing. The output data from the mappers are consistently hash-sorted then pulled by the reducers in what is known as the shuffle stage. Finally, the reducers perform the postprocessing of the results from the mappers. The origins of the MapReduce programming model come from functional languages such as LISP.

The MapReduce programming model is available in many shapes and forms. In fact, many of the traditional RDBMS vendors (for example, Teradata, Oracle, Greenplum) support MapReduce indirectly through user-defined functions (for mappers) and user-defined aggregates (for reducers).

The power of the overall MapReduce system (the distributed scheduling system that executes MapReduce jobs) comes from its ability to (1) automatically distribute/schedule the jobs and (2) transparently handle failures without requiring the jobs to be reexecuted from scratch (which would be very frustrating for multihour jobs processing large amounts of data). The system also allows the number of servers to be dynamically scaled up or down while jobs are running, so a number of additional servers can be thrown into the processing pool and jobs will begin using them transparently. The system is also designed to run a large number of data-processing jobs with various operating requirements. Some of these jobs can be operational jobs with high priority, so the system will automatically kill (preempt) the mapper or reducer tasks of lower-priority jobs to make room for the operational jobs. The jobs that have been preempted are resumed once the system has available resources for them. Furthermore, the system has optimizations to detect partial failure. For example, if one of the mappers executing a part of the job is running slowly compared with the rest of the mappers (maybe that node has unreliable disks), the system automatically starts a redundant mapper on a separate server, and whichever one finishes first wins. The MapReduce system is storage-system independent: It can read data from a normal file system, a distributed file system, an in-memory, key-value store, or even a traditional RDBMS.

Dr. Awadallah presented a list of MapReduce scientific examples and presentations that was assembled by members of the NSF Cluster Exploratory program. The list includes the following:

Florida International University’s Indexing Geospatial Data with MapReduce,
University of Washington’s Scaling the Sky with MapReduce and Interactive Visualization of Large Data,
University of Maryland’s Commodity Computing in Genomics Research,

Page 28 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

Carnegie Mellon University’s Cluster Computing for Statistical Machine Translation,
University of California, Irvine, Large-Scale Automated Data Cleaning, and
University of California, Santa Barbara, Scalable Graph Processing.

Dr. Awadallah believed that MapReduce is most suitable for batch data-processing jobs. This would include ETL jobs that process original raw data into their relational form (because MapReduce does not require a predefined schema to be able to process data) and complex data transformations that are difficult to express in SQL (e.g., optical correction algorithms for astronomical images). MapReduce also has the ability to process data from multiple heterogeneous systems, such as those that exist in federations, through simple reader and writer functions. For example, one can have a MapReduce job that fetches input data from the distributed file system then joins them with data from a RDBMS. This allows the MapReduce system to run on top of data sources that range from unstructured (for example, collections of text, video streams, or satellite images), to semistructured (for example, XML, JSON, or RDF-like data), to relationally structured data (for example, tables with predefined column schemas).

DATA MANAGEMENT FOR SCIENTIFIC DATA

Dr. Maier’s workshop presentation covered data management concepts that are of use for scientific data. Most commercial data-integration solutions are based on the relational model, with a few using XML as a target model. Such offerings are not likely to be of great help for integrating scientific data sets because there is not much support for some data types common to science, such as sequences, time series, and multidimensional arrays. Commercial relational DBMSs offer support for some scientific data types, most often time series and spatial objects, such as are used in geographic information systems (GISs). However, such support is supplied either by an encoding into the underlying relational model or through an extension of an abstract data type (ADT). In either case, the data types are not part of the core model of the system, and there is limited understanding of the types in query and storage-management layers.

Many scientific data types exhibit some form of order or, more generally, topology (a notion of adjacent elements and neighborhoods). This structure arises from the organization of the underlying physical world, such as chains of nucleotides or amino acids (ordered sequences) or discretized versions of continuous spaces arising from sensing or simulation

Page 29 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

(multidimensional arrays, finite-element meshes). The desired operations on these data types are often order- or neighborhood-sensitive: examples include pattern matching, image filtering, and regridding.

Dr. Maier said that it has long been recognized that relational models and languages lack support for ordered types. While it is possible to encode ordered structures into the relational model, the associated operations can be hard to express, and optimization opportunities are obscured. Over the years, query languages for array and mesh data types have been suggested, such as AQL (Libkin, Machlin, and Wong, 1996), Array Manipulation Language (Marathe and Salem, 2002), and GridFields (Howe and Maier, 2005). However, no full-featured DBMS based on these languages is currently available.

Because of the limitations of relational DBMSs for supporting arrays, Maier reported that many scientific data end up in files using array data formats, such as NetCDF⁶ and HDF.⁷ While such formats support multidimensional arrays directly and appropriate access methods, they offer a file-per-dataset model and limited operations and hence are far from a full DBMS. They support interfaces to languages popular in scientific domains (C++, Fortran, Python) and to multiple data-analysis environments (R, Matlab, Octave). There are libraries of utilities for common operations available on some platforms, but there is no automatic optimization over groups of operators.

There are also approaches that layer support for scientific data types over existing storage managers, usually a DBMS. Maier stated that the following are the two main approaches:

Array Model and Query Language. This approach provides an array data model and query language and performs some optimization and evaluation natively in that model, with the underlying storage system managing persistent storage, and possibly some degree of support for memory management, access methods, and query execution. Raster Data Manager (RasDaMan) (Baumann et al., 1998) is the most mature example of this approach. RasDaMan is an open-source system supporting an array data model and query language, with commercial support and extensions available. It provides its own query optimization, query evaluation, and main-memory management, using the underlying system (usually a relational DBMS) as a “tile store” for fragments of arrays. A more recent example is the RAM research project (van Ballegooij et al., 2003), which provides an array model and query facility that has

⁶	See http://www.unidata.ucar.edu/software/netcdf/.
⁷	See http://www.hdfgroup.org/.

Page 30 Cite

Suggested Citation:"3 Improving Current Capabilities for Data Integration in Science." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.

×

been layered over various back ends, notably MonetDB. RAM performs query normalization, simplification, and optimization within its array model before translating into queries on the underlying relational engine. That layer can perform further optimization in the relational model before executing the queries.

Secondary-Storage Extensions to Data-Analysis Environments. The second approach to layering uses a DBMS to provide relatively seamless access to secondary storage from a data analysis environment. The type system of the environment thus effectively becomes the data model, usually providing vectors, matrices, and higher-dimensional arrays. There is no special query language in this approach—disk-resident data are manipulated with the same functions used for in-memory data. It is up to the underlying interface to the DBMS to determine when functions can be performed in the database and when data need to be retrieved for main-memory manipulation. Ohkawa (1993) used this approach with the New S statistical package and an object-oriented DBMS. The RIOT prototype (Zhang et al., 2009) supports the R data-analysis environment using a relational DBMS. To create optimization opportunities in the underlying DBMS, both systems use lazy evaluation techniques. An operation on a secondary-storage object merely creates an expression that represents the application of the operation. Repeated deferral allows accumulating operations into one or more expression trees. Such trees are evaluated only when their result is to be output to the user, at which point they may be optimized before processing.

According to Dr. Maier, the SciDB project (Cudré-Mauroux et al., 2009) has recently begun development of an open-source database with fully native support for an array model, including an array-aware storage manager. In addition to a data model and algebra for multi-dimensional arrays, SciDB will support history and versioning of arrays, provenance, uncertainty annotations, and parallel execution of queries. If successful, it should provide a suitable platform for integrating extremely large scientific datasets.