4
Success in Data Integration

As a domain becomes more mature, more scientists begin to develop interest in it and progress starts to depend on the sharing of data. In the beginning such sharing is quite difficult, so a domain must develop ways to facilitate sharing as it matures. This includes the setting of standards, which may slow progress in individual groups to achieve a greater good for all. While the discussions recounted above evinced skepticism about any global schema, there are places where standards have been quite successful, some of which are described below. The most successful standards tend to occur bottom-up. In other words, individual scientists recognize the need and work to build consensus standards. Other standards are imposed top-down by some sort of dominant force in an enterprise. Top-down appears to work only rarely, and bottom-up approaches have a much better chance of success, according to several workshop participants. However, standards are also facilitated if there is a dominant player in a domain, as pointed out by Dr. Stonebraker. In enterprise data, for example, Walmart has so much influence that it can specify standards and force all of its suppliers to conform if they wish to sell goods to Walmart. Google also has this sort of influence in the Web search space. In domains where there is a dominant player, standards are much easier to achieve.

The successes of the Sloan Digital Sky Survey and Genbank in sharing astronomy data and genomic data are well known in the scientific community. The National Spatial Data Infrastructure (NSDI), mentioned by Dr. Clarke, has been emulated worldwide as the global spatial data infra-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 31
4 Success in Data Integration A s a domain becomes more mature, more scientists begin to develop interest in it and progress starts to depend on the sharing of data. In the beginning such sharing is quite difficult, so a domain must develop ways to facilitate sharing as it matures. This includes the setting of stan- dards, which may slow progress in individual groups to achieve a greater good for all. While the discussions recounted above evinced skepticism about any global schema, there are places where standards have been quite successful, some of which are described below. The most successful standards tend to occur bottom-up. In other words, individual scientists recognize the need and work to build consensus standards. Other stan - dards are imposed top-down by some sort of dominant force in an enter- prise. Top-down appears to work only rarely, and bottom-up approaches have a much better chance of success, according to several workshop participants. However, standards are also facilitated if there is a dominant player in a domain, as pointed out by Dr. Stonebraker. In enterprise data, for example, Walmart has so much influence that it can specify standards and force all of its suppliers to conform if they wish to sell goods to Walmart. Google also has this sort of influence in the Web search space. In domains where there is a dominant player, standards are much easier to achieve. The successes of the Sloan Digital Sky Survey and Genbank in shar- ing astronomy data and genomic data are well known in the scientific community. The National Spatial Data Infrastructure (NSDI), mentioned by Dr. Clarke, has been emulated worldwide as the global spatial data infra- 

OCR for page 31
 STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES structure and is another example of success. The NSDI was prompted by an Executive Order issued by President Clinton in 1994, which also called for “development of a National Geospatial Data Clearinghouse, spatial data standards, a National Digital Geospatial Data Framework and partner- ships for data acquisition.”1 The NSDI enables sharing of geographical information, elimination of redundancies, and other significant benefits. Some other success stories, perhaps less well known, are presented here. FREEBASE Freebase is a large, collaboratively edited database of crosslinked data developed by Metaweb Technologies. Freebase has incorporated the contents of several large, openly accessible data sources, such as Wikipe- dia and Musicbrainz, allowing users to add data and build structure by adding metadata tags that categorize or connect items.2 To date, most of the information in Freebase relates to people and places, though it can accommodate a wide range of data types, including research data. Freebase is intended to be an important component of the Semantic Web, allowing automation of many Web search functions and communica- tion between electronic devices (New York Times, 2007). However, Freebase has quality issues, omissions, errors, and redundant information—most of its information is not truly integrated. While Freebase is a success in some respects (community contributions have led to large volumes of information and it is possible to get useful answers to some queries), it cannot guarantee accurate and complete answers. Overall, Freebase demonstrates a novel mechanism for data aggregation, but it has not yet solved many of the challenges of information integration. MELBOURNE HEALTH Melbourne Health, a healthcare provider in Melbourne, Australia, envisions building a generic informatics model for beneficial collaboration across organizations and expansion to other research areas (Bihammar and Chong, 2007). Melbourne Health’s original goal was to link the data- bases from seven hospitals and two research institutes for multiple dis - ease research. The challenges in this work come from the large amount of data, the paucity of data standards, poor interoperability between databases, and the need to ensure compliance with ethical, privacy, and regulatory norms. 1 Quoted from http://www.fgdc.gov/nsdi/policyandplanning/executive_order. Accessed May 5, 2010. 2 Available at http://freebase.com. Accessed October 23, 2009.

OCR for page 31
 SUCCESS IN DATA INTEGRATION Medical documents and research data come from files, Excel spread- sheets, and databases. The hospitals and clinics may use different systems. The HL7 Clinical Document Architecture (CDA), an XML-based markup standard intended to specify the encoding, structure, and semantics of clinical documents for exchange, is used. According to the IDC case study (Bihammar and Chong, 2007), Melbourne Health has linked research databases in 16 organizations, allowing them to collaborate. SCIENCE COMMONS AND NEUROCOMMONS Science Commons (http://sciencecommons.org), launched in 2005, is an offshoot of Creative Commons, a not-for-profit organization that develops and disseminates free legal and technical tools to facilitate the finding and sharing of creative content (Garlick, 2005). It also focuses on lowering barriers that researchers face to sharing data, publications, and materials. The goals are to expand sharing, interoperability, and reuse of data, but these goals are hampered by legal and cultural barriers. Although research data are not subject to copyright protection, the arrangement of data and the structure of databases may be protected (for a discussion of the legal context for sharing and accessing research data, see NRC, 2009). Specific rights to reuse or integrate data may be unclear, and integrating data collected under different jurisdictions may be problematic. Research- ers in some fields might take proprietary approaches to data or might lack the motivation to make their data available proactively. Science Commons has developed several programs and tools to lower these barriers. The Protocol for Implementing Open Access Data allows researchers to mark their data for machine-readable discovery in the pub- lic domain so that their databases can be legally integrated with others, including those collected in other jurisdictions.3 The NeuroCommons project, under the auspices of Science Com- mons, is developing an open-source knowledge management platform for biological research. The goal is to make all knowledge sources—including articles, knowledge bases, research data, and physical materials— interoperable and uniformly accessible by computational agents. Neu- roCommons is a prototype framework for creating information artifacts that can provide lessons for future communities, particularly in reaching community consensus around technical standards and curation processes. The NeuroCommons framework utilizes URIs and RDF, making it part of the Semantic Web.4 3 Information drawn from http://www.sciencecommons.org. Accessed October 23, 2009. 4 Information drawn from http://neurocommons.org. Accessed October 23, 2009.

OCR for page 31
 STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES To apply this idea to scientific information artifacts, one creates a set of conventions for syntactic and semantic compatibility among compo - nents and a standard packaging mechanism to make selecting and install- ing components easy. One starts with the primary sources (databases, knowledge bases, and the like), applies a script to do the normalization, and comes up with a packaged component. The resulting “binary” may or may not be collected with others to make a distribution. Someone creating a local installation optimized for local query obtains needed components from one or more distributions and installs those into their own environment. Some two dozen components have been created and collected in the NeuroCommons framework. The components are independent and the architecture is open, so that anyone may pick and choose the ones they like without having to take all of them. One may create new components and either add them to the distribution (subject to quality control), create a new distribution, or just use them privately. Currently the NeuroCom- mons distribution is accomplished either through a set of RDF files or a database dump. Bio2RDF Bio2RDF (http://bio2rdf.org) is an open-source project that aims to facilitate biomedical knowledge discovery using Semantic Web technolo- gies. Bio2RDF is an important contributor to the Linked Data Web, offer- ing the integration of over 30 major biological databases with content ranging from biological sequences (such as are stored in UniProt, Gen- bank, RefSeq, Entrez Gene), structures (from the Protein Data Bank), path- ways and interactions (cPATHs), and diseases (OMIM), to community- developed biomedical ontologies (OBO). This project builds on W3C standards for sharing information over existing Web architecture and representing biomedical knowledge using standardized logic-based languages. Powered by open-source tools, Bio2RDF enables scientists to not only explore manually curated and com- puted aggregated knowledge about biological entities but to also link their data and enable all scientists to ask fairly sophisticated questions across distributed, but integrated, biomedical resources. Bio2RDF-linked data are available today as N3 files, indexed Virtuoso databases, and SPARQL endpoints across three mirrors located in Canada and Australia. With interest growing in the Bio2RDF data and services beyond the initial developers, the group is fielding requests to add more than 50 additional data sources in the areas of yeast and human biology, toxicoge- nomics, and drug discovery.