1
Importance and Use of Scientific and Technical Databases

Modern technology has propelled us into the information age, making it possible to generate and record vast quantities of new data.1 Advances in computing and communications technologies and the development of digital networks have revolutionized the manner in which data are stored, communicated, and manipulated. Databases, and uses to which they can be put, have become increasingly valuable commodities.

The now-common practice of downloading material from online databases has made it easy for researchers and other users to acquire data, which frequently have been produced with considerable investments of time, money, and other resources. Government agencies and most government contractors or grantees in the United States (though not in many other countries) usually make their data, produced at taxpayer expense, available at no cost or for the cost of reproduction and dissemination. For-profit and not-for-profit database producers (other than most government contractors and grantees) typically charge for access to and use of their data through subscriptions, licensing agreements, and individual sales.

Currently many for-profit and not-for-profit database producers are concerned about the possibility that significant portions of their databases will be copied or used in substantial part by others to create "new" derivative databases. If an identical or substantially similar database is then either redisseminated broadly or sold and used in direct competition with the original rights holder's database, the rights holder's revenues will be undermined, or in extreme cases,

1  

Box 1.1 provides definitions of data and of several other key terms used in this report.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases 1 Importance and Use of Scientific and Technical Databases Modern technology has propelled us into the information age, making it possible to generate and record vast quantities of new data.1 Advances in computing and communications technologies and the development of digital networks have revolutionized the manner in which data are stored, communicated, and manipulated. Databases, and uses to which they can be put, have become increasingly valuable commodities. The now-common practice of downloading material from online databases has made it easy for researchers and other users to acquire data, which frequently have been produced with considerable investments of time, money, and other resources. Government agencies and most government contractors or grantees in the United States (though not in many other countries) usually make their data, produced at taxpayer expense, available at no cost or for the cost of reproduction and dissemination. For-profit and not-for-profit database producers (other than most government contractors and grantees) typically charge for access to and use of their data through subscriptions, licensing agreements, and individual sales. Currently many for-profit and not-for-profit database producers are concerned about the possibility that significant portions of their databases will be copied or used in substantial part by others to create "new" derivative databases. If an identical or substantially similar database is then either redisseminated broadly or sold and used in direct competition with the original rights holder's database, the rights holder's revenues will be undermined, or in extreme cases, 1   Box 1.1 provides definitions of data and of several other key terms used in this report.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Box 1.1 Definitions of Key Terms Used in This Report Data are facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors. A data element is the smallest unit of information to which reference is made. This report is concerned primarily with digital data, although a large portion of raw data is recorded as analog data, which also can be digitized. For purposes of this report the terms data and facts are treated interchangeably, as is the case in legal contexts. Data in a database may be characterized as predominantly word oriented (e.g., as in a text, bibliography, directory, dictionary), numeric (e.g., properties, statistics, experimental values), image (e.g., fixed or moving video, such as a film of microbes under magnification or time-lapse photography of a flower opening), or sound (e.g., a sound recording of a tornado or a fire). Word oriented, numeric, image, and sound databases are processed by different types of software (text or word processing, data processing, image processing, and sound processing). Data can also be referred to as raw, processed, or verified. Raw data consist of original observations, such as those collected by satellite and beamed back to Earth, or initial experimental results, such as laboratory test data. After they are collected, raw data can be processed or refined in many different ways. Processing usually makes data more usable, ordered, or simplified, thus increasing their intelligibility. Verified data are data whose quality and accuracy have been assured. For experimental results, verification signifies that the data have been shown to be reproducible in a test or experiment that repeats the original. For observational data, verification means that the data have been compared with other data whose quality is known or that the instrument with which they were obtained has been properly calibrated and tested. Digital data may be processed or stored on various types of media, including magnetic (RAM, hard drive, diskettes, tapes) and optical (CD-ROM, DVD) media. Data can be made accessible either through portable media or, increasingly, online. A database is a collection of related data and information—generally numeric, word oriented, sound, and/or image—organized to permit search and retrieval or processing and reorganizing. A data set is a collection of similar and related data records or data points. Many databases are a resource from which specific data points, facts, or textual information are extracted for use in building a derivative database or data product. A derivative database, also called a value-added or transformative database, is built from one or more preexisting database(s) and frequently includes extractions from multiple databases, as well as original data. A database producer acquires data in raw, reduced, or otherwise processed from—either directly, through experimentation or observation, or indirectly, from one or more organizations or preexisting databases—for inclusion in a database that the database producer is generating. Such database creators—sometimes known as database publishers or originators but for the purpose of this report referred to as database producers—traditionally are the rights holders of the intellectual property rights in the databases.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases In general, database production covers all aspects of preparation, processing, and maintenance; development of software for search, retrieval, and manipulation; and documentation of the software and database features and functions prior to distribution of the database by a vendor. Among the wide variety of functions encompassed by database production, in addition to data acquisition, are data reduction (where needed), formatting, enhancing, expanding, merging with other data or data records, categorizing, classifying, indexing, abstracting, tagging, flagging, coding, sorting/rearranging, putting into tabular form, creating visual representations, updating, and putting into searchable and retrievable form for and use and manipulation by users. A database vendor (variously known as a distributor, online host (mostly in Europe and the United Kingdom), disseminator, or provider) sells, leases, or licenses digitized versions of a database on optical disks (e.g., CD-ROM, DVD), floppy disks, tapes, or downloadable complete databases. Many databases, particularly textual ones, are also based on or provided as hand-copy paper publications. A database producer organization may also serve as a database vendor if it both produces a database and provides online access directly to users or sells, leases, or licenses the database. For the sake of simplicity, the term database dissemination or distribution as used in this report includes the concept of making databases available online. The modifier scientific and technical designates the subject matter of the database content in the general areas covered in this report. the rights holder will be put out of business. Besides being unfair to the rights holder, this actual or potential loss of revenue may create a disincentive to produce and then maintain databases, thus reducing the number of databases available to others. However, preventing database uses by others, or making access and subsequent use more expensive or difficult, may discourage socially useful applications of databases. The question is how to protect rights in databases while ensuring that factual data remain accessible for public-interest and other uses. This report explores issues in the conundrum posed by the need to properly balance the rights of original database producers or rights holders and the rights of all the downstream users and competitors—with the principal focus on the balance of rights between the database rights holders and public-interest users such as researchers, educators, and librarians. In particular, the Committee for a Study on Promoting Access to Scientific and Technical Data for the Public Interest focuses on scientific and technical (S&T) data (with examples drawn primarily from the physical and biological sciences) as an essential consideration in reasoned attempts to balance competing interests in databases.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases To broaden the perspective of and enhance cooperation among the various competing interests, and to help ensure an efficient and effective outcome for all, the committee examines the following basic elements in the larger issue at hand: Salient characteristics and the importance of S&T databases produced and used in research; Impacts of computer technology on the production, distribution, and use of S&T databases; Motivations of the various sectors involved in S&T research and the dissemination and use of research results; Economic issues and incentives that influence the production, distribution, and use of S&T databases, and how these activities are interrelated; Mechanisms currently in place for protecting these economic incentives; and New legislation currently under consideration that would affect the production, dissemination, and use of S&T databases in a variety of ways. To ensure the most successful outcome in the current debate over rights in databases, any new action must take account of and balance the legitimate interests of the various stakeholders, and must reflect awareness of how the broad public interest can best be served. SCIENTIFIC AND TECHNICAL DATA AND THE CREATION OF NEW KNOWLEDGE Factual data are both an essential resource for and a valuable output from scientific research. It is through the formation, communication, and use of facts and ideas that scientists conduct research. Throughout the history of science, new findings and ideas have been recorded and used as the basis for further scientific advances and for educating students. Now, as a result of the near-complete digitization of data collection, manipulation, and dissemination over the past 30 years, almost every aspect of the natural world, human activity, and indeed every life form can be observed and captured in an electronic database.2 There is barely a sector of the economy that is not significantly engaged in the creation and exploitation of digital databases, and there are many—such as insurance, banking, or direct marketing—that are completely database dependent. Certainly scientific and engineering research is no exception in its growing reliance on the creation and exploitation of electronic databases. The genetic sequence of each living organism is a natural database, transforming biological 2   See Paul F. Uhlir (1995), "From Spacecraft to Statecraft: The Role of Earth Observation Satellites in the Development and Verification of International Environmental Protection Agreements," GIS Law, Vol. 2, p. 1.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases research and applications over the past decade into a data-dependent enterprise and giving rise to the rapidly growing field of bioinformatics. Myriad data collection platforms, recording and storing information about our physical universe at an ever-increasing rate, are now integral to the study and understanding of the natural environment, from small ecological subsystems to planet-scale geophysical processes and beyond. Similarly, the engineering disciplines continually create databases about our constructed environment and new technical processes, which are endlessly updated and refined to fuel our technological progress and innovation system. Basic scientific research drives most of the world's progress in the natural and social sciences. Basic, or fundamental, research may be defined as research that leads to new understanding of how nature works and how its many facets are interconnected.3 Society uses the fruits of such research to expand the world's base of knowledge and applies that knowledge in myriad ways to create wealth and to enhance the public welfare. New scientific understanding and its applications are yielding benefits such as the following: Improved diagnosis, pharmaceuticals, and treatments in medicine; Better and higher-yield food production in agriculture; New and improved materials for fabrication of manufactured objects, building materials, packaging, and special applications such as microelectronics; Faster, cheaper, and safer transportation and communication; Better means for energy production; Improved ability to forecast environmental conditions and to manage natural resources; and More powerful ways to explore all aspects of our universe, ranging from the finest subnuclear scale to the boundaries of the universe, and encompassing living organisms in all their variety.4 SCIENTIFIC AND TECHNICAL DATABASES AS A RESOURCE-THE CURRENT CONTEXT The committee's January 1999 Workshop on Promoting Access to Scientific and Technical Data for the Public Interest: An Assessment of Policy Options,5 3   See John A. Armstrong (1993), "Is Basic Research a Luxury Our Society Can No Longer Afford?" Karl Taylor Compton Lecture, Massachusetts Institute of Technology, October 13. 4   National Research Council (1997), Bits of Power: Issues in Global Access to Scientific Data, National Academy Press, Washington, D.C., p. 18. 5   See online, National Research Council (1999), Proceedings of the Workshop on Promoting Access to Scientific and Technical Data for the Public Interest: An Assessment of Policy Options, National Academy Press, Washington, D.C., <http://www.nap.edu>.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases included presentations on and discussions of data activities in twelve selected organizations representing three broad sectors (government, not-for-profit, and commercial). The sample activities illustrated some of the depth and range of uses for S&T databases today (Table 1.1 provides a summary) and indicated also the complexity of the often overlapping relationships and interests of database users and producers. The discussion below outlines basic aspects of current data activities, including collection and production of S&T data and databases, dissemination, and use, and it describes the roles that the three sectors play in the overall process. In contrasting past and current practices, it indicates how ongoing technological advances have contributed to increased capabilities for obtaining and using S&T data. This description, which provides essential background for the remainder of this report, draws on examples from the four general discipline areas—geographic and environmental, genomic, chemical and chemical engineering, and meteorological research and applications—focused on in the workshop. Collection of Original Data and Production of New Databases Sources of Primary Data and Uses The process of scientific inquiry typically has begun with the formulation of a working hypothesis, based usually on limited observation and data, followed by experimentation designed to test the hypothesis. The experimentation results in the accumulation of new data used to confirm or refute the original hypothesis. Understanding of the natural and physical world has been advanced by researchers building on a growing base of knowledge that is continually being refined, tested, and augmented in the long-established approach to scientific inquiry known as the scientific method. With the advent of digital technologies has come a dramatic increase in the pace and volume of data acquisition. Ongoing rapid advances in electronic technologies for computing and communications, experimentation, and observation ranging from high-frequency direct sampling to multispectral remote sensing have enabled dramatic increases in the quantities of data generated about the natural world at scales from the microcosm to the macrocosm. For instance, the volume of data on weather and climate stored in the National Climatic Data Center has increased 750-fold in the past two decades (Box 1.2). A pharmaceutical company that 5 years ago could characterize 100,000 compounds per year can now handle a million compounds in a week. Although some of these data represent actual measurements, large quantities of data also are being generated through numerical simulations performed on supercomputers. Collection of new data is becoming increasingly automated as recording devices and instrumentation become more sophisticated and rapid. Moreover, many older paper-based data sets, such as historical U.S. Weather

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases TABLE 1.1 Examples of Different Types of S&T Database Activities Discussed in the January 1999 Workshop Organization (Sector) Information and Tools Provided Data Sources Geographic and Environmental U.S. Geological Survey (USGS) (Government) Geographic data: maps and map products Data from other programs: biologic, geologic, hydrologic USGS, other federal agencies, state and local governments, not-for-profit researchers, partnerships with private-sector Long-Term Ecological Research (LTER) Network Office (Not-for-Profit) Site description database, integrated climate database, remotely sensed ecological data Ecological researchers at distributed sites belonging to the LTER network GeoSystems Global Corp. (Commercial) Digital maps, MapQuest Web site, mapping services From the public domain: government-produced maps (federal, state, local), digital geographic data, remotely sensed imagery Other sources: commercial and other countries' maps, digital data, remotely sensed imagery, other published sources Genomic National Center for Biotechnology Information (Government) GenBank: DNA and protein sequence data; Other genomic mapping databases; 3D protein structure database; bibliographic databases; software tools Direct contributions from scientists; access to other databases from government, not-for-profit, other country sources Center for Bioinformatics University of Pennsylvania (Not-for-Profit) Specialized biological databases; software tools for integration of distributed heterogeneous databases Proprietary and public-domain experimental data from academic researchers; manual processing and encoding of data from published literature; online molecular and cellular biology and genomic databases Molecular Applications Group (Commercial) Software for storing, mining, and visualizing genomic data; databases derived from public and private data and proprietary software >150 online database sites, public and proprietary

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Users Dissemination Modes USGS, other government agencies, commercial database providers and value adders, researchers, and the general public Maps: hard copy (paper, plastic, film) and digital form; distributed by agency directly and through partnerships with private-sector, not-for-profit sector Researchers Internet, some tape and CD-ROM for portability Commercial clients: large companies and consumers Maps: hard copy and digital form Software products distributed via retail channels (CD-ROM) and directly to corporate customers Mapping services distributed via Internet Research scientists in academic, government, commercial organizations Internet access via Web servers and File Transfer Protocol (FTP) Research scientists in academic, government, commercial organizations (U.S. and abroad) Internet access Source code distributed directly Research scientists in academic, government, commercial organizations (U.S. and abroad) Some software products downloaded from the Web; others require on-site expert installation

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Organization (Sector) Information and Tools Provided Data Sources Chemical and Chemical Engineering National Institute of Standards and Technology (NIST) Physical and Chemical Properties Division (Government) Specialized chemistry and chemical engineering databases (extensively evaluated and documented) Experimental results from published literature; experiments done specifically for data acquisition; published data evaluations; supplementary data deposits Chemical Abstracts Service American Chemical Society (Not-for-Profit) Chemical Abstracts: bibliographic database Registry: registry of chemical substances Software access tools Journals, patents, books, proceedings, dissertations Institute for Scientific Information (Commercial) Bibliographic databases: citation indexes, tables of contents Information services Linkages to publishers' full text databases Journals, books, proceedings (print and electronic format) Meteorological National Climatic Data Center (Government) Climatological summaries from National Weather Service stations; historical long-term climatic databases National Weather Service, World Meteorological Organization, NASA, bilateral agreements with other countries Unidata Program, University Corporation for Atmospheric Research (Not-for-Profit) Quasi-real-time atmospheric and related data Case study data sets Software tools Public: National Weather Service, National Environmental Data Service Private: network of lightning sensors, sensors in commercial aircraft TASC (Commercial) Real-time weather information Public: National Weather Service—downlink directly from U.S. and international weather satellites, other observational sources NOTE: Although the subject matter of this study included all S&T databases, the committee was able to choose only representative examples for discussion and analysis in the report. For instance, specific examples from the social sciences or the space sciences, among other disciplines, were not included.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Users Dissemination Modes Researchers in academic, government, commercial organizations (some databases used primarily by industrial users) Variety of forms: hard copy publication, CD-ROM or floppy disk, Internet access; NIST distributes directly or via agreements with secondary distributors Researchers in academic, government, commercial organizations; patent examiners; students Electronic access, hard copy, CD-ROM Academic, government lab, and corporate libraries; researchers in academic, government, commercial organizations Diskette, CD-ROM, FTP files, Internet access, hard copy Individuals, commercial clients, government agencies, engineering uses Hard copy, microfiche, magnetic tape, disks, CD-ROM, FTP, Internet Academic departments Internet News media (broadcast and cable TV), aviation, energy and power, agribusiness Public and private data communication networks: satellite broadcasting services and Internet

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Box 1.2 Example of Large-Scale Data Collection Activity by the Federal Government Statistics from just one discipline in the natural sciences—atmospheric physics—illustrate the explosive growth in the size of some digital scientific and technical databases. The National Climatic Data Center (NCDC) is responsible for storing national, as well as some global, weather and climatic information. Once, most of these data came from human observations of the current state of the weather using simple and straightforward instrumentation, including such commonplace devices as thermometers, barometers, wind vanes, and rain gauges. The comparatively recent deployment of satellites, sophisticated Doppler radars, lightning de tection networks, automatic surface-observing platforms, and heavily instrumented buoys in the marine environment, all linked together through broadband, high-speed communication systems, has increased the types and volumes of data collected. The NCDC's storage requirements for these data have increased concomitantly by many orders of magnitude. In the period between 1980, when some of the high-resolution data were just beginning to be recorded, and 1994, when much of the Doppler radar and lightning data had yet to be generated, the volume of data stored at the NCDC increased from approximately 1 terabyte to 230 terabytes. By 1999, the NCDC's data holdings had grown to 750 terabytes and are projected to expand to more than 20 petabytes by 2014. These data are archived indefinitely and made available to the public. SOURCE: Information provided by Gerald Barton, National Oceanic and Atmospheric Administration, Washington, D.C Bureau observational records or U.S. census data, are being digitized and organized into electronically accessible databases. This shift from a data-poor to a data-rich research and education environment is occurring through the activities of a host of government agencies, universities, and other research establishments, both public and private, nationally and internationally, in diverse research disciplines. In many cases data are being collected not to answer specific scientific questions, but rather to describe various physical and biological phenomena in ever-increasing detail. This broad-based acquisition of data, coupled with data mining and knowledge discovery6 and the broad review and analysis of information stored in large databases, is anticipated to reveal trends or patterns or to lead 6   Data mining and knowledge discovery are related, frequently confused terms, as are data, information, and knowledge. In the context of electronic databases, the data stored therein remain as data until they are extracted (mined) and recompiled (put in a context), at which point they become information. After ''information is developed into a collection of related inferences, the data, now

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases to 11,339), the number of database producers increased by a factor of 18 (from 200 to 3,686), and the number of vendors grew by a factor of 23 (from 105 to 2,459). In 1975 the 301 identified databases contained about 52 million records, whereas in 1998 the 11,339 tallied databases held nearly 12.05 billion records, a 231-fold increase in the number of records. Although in today's digitized information world databases are produced on all continents, the percentage of all types of databases produced in the United States continues to represent the lion's share of the global output. In 1998, of the 11,339 databases that were identified, 63% were produced in the United States. In 1975, of the 301 publicly available, computer-readable databases worldwide, 59% were U.S. databases. From 1985 to 1993, the ratio of U.S. to non-U.S. databases remained at about 2:1. From 1994 on, production of non-U.S. databases has accelerated somewhat, so that in 1998 the ratio of the number of U.S. to non-U.S. databases was about 3:2. The average size of U.S. databases in terms of the number of records they contained was larger than that of the non-U.S. databases. As noted above, however, most U.S. government and academic databases are not represented in these figures. In the source quoted here, database statistics were compiled in eight major subject categories—business, health/life/medical sciences, humanities, law, multi-disciplinary, news/general, science/technology/engineering, and social sciences. If the health/life/medical sciences category is combined with science/technology/engineering, that general scientific and technical category had the largest number of databases (28%) in 1998, followed by business (26%), news/general (15%), and law (11%), with the remaining three categories accounting for the other 20%.19 The Uniqueness of Many S&T Databases A key characteristic of original S&T databases is that many of them are the only one of their particular kind, available only from a single-source, which has significant economic and legal implications, as discussed in subsequent chapters of this report. For example, many S&T databases describe physical phenomena or transitory events that have been rendered unique by the passage of time. Measurements of a snowstorm obtained with a single radar observation, or a statistical compilation of some key socioeconomic characteristics such as income levels collected by a state agency, cannot be recaptured after the original event. The vast majority of observational data sets of the natural world, as well as all unique historical records, can never again be recreated independently and are thus available only as originally obtained, frequently from a single-source. Other S&T databases are de facto unique because the cost of obtaining the data was 19   Williams (1998), "State of Databases Today," note 18, p. xxvi.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases extremely high. This is the case with very large facilities for physical experiments or space-based observatories. Even when data similar but not identical to original research results or observations are available for use in non-technical applications, scientists and engineers will likely not find an inexact replica of a database a suitable substitute if it does not meet certain specifications for a particular experiment or analysis. For example, two infrared sensors with similar spatial and spectral characteristics on different satellites collecting observations of Earth may provide relatively interchangeable data products for the non-expert consumer, but for a researcher, the absence of one spectral band can make all the difference in whether a certain type of research can be performed. Thus a database generally deemed adequate as a substitute in the mass consumer market very likely will not be usable for many research or education purposes. Dissemination of Scientific and Technical Data and the Issue of Access S&T data traditionally were disseminated in paper form in journal articles, textbooks, reference books, and abstracting and indexing publications. As data have become available in electronic form, they have been distributed via magnetic tape and, more recently, optical media such as CD-ROM or DVD. The growing use of the Internet has revolutionized dissemination by allowing most databases to be made available globally in electronic form. Digitization and the potential for instant, low-cost global communication have opened tremendous new opportunities for the dissemination and utilization of S&T databases and other forms of information, but also have led to a blurring of the traditional roles and relationships of database producers, vendors, and users of those databases in the government, not-for-profit, and commercial-sectors. In fact, virtually anyone who obtains access to a digital database can instantly become a worldwide disseminator, whether legally or illegally.20 Two of the most important mechanisms for the dissemination of public and publicly funded databases have been government data centers and public libraries. Government, or government-funded, data centers have been created in recent decades for dissemination of data obtained in certain programs or research disciplines. Examples of such data centers include the National Center for Biotech 20   Of course, this same development is occurring with other forms of online information and proprietary publications outside the S&T database context, such as with copyrighted digital music and videos. For an extensive discussion of the impact of the Internet on various types of information and related intellectual property rights management, see Computer Science and Telecommunications Board, National Research Council (2000), The Digital Dilemma: Intellectual Property in the Information Age, National Academy Press, Washington, D.C., in press.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases nology Information and the National Climatic Data Center (Table 1.1), but many others have been established for almost every field of research. 21 Public libraries, whether part of the federal depository library program, university research libraries, or other public libraries or foundations specializing in various S&T or other academic subjects, not only preserve and publicly disseminate government data, but provide general public access for many proprietary S&T databases as well. With ever-increasing costs, however, the libraries' ability to provide this public "safety net" for all published products is diminishing. 22 Historically, most federal government S&T data and government-funded research data in the United States have been fully and openly available to the public.23 This has meant that such data are available free or at low cost for academic and commercial research—and indeed any other use—without restrictions and can be incorporated into derivative databases, which can, themselves, be redistributed and incorporated into additional databases. In some instances in which the government contracts for the dissemination of data, however, the rights assigned to the database vendor may place restrictions on the ability of the research and education communities to fully utilize the data. Increasingly, both government and not-for-profit organizations are exploring means to recover database production and distribution costs, or to generate revenue streams in order to support their expensive data activities, thereby making them function in a manner similar to commercial organizations. The ability to access existing data and to extract and recombine selected portions of them for research or for incorporation into new databases for further distribution and use has become a key part of the scientific process by which new insights are gained and knowledge is advanced. When the ability to access or distribute data on an international basis is required, various intergovernmental agreements are depended on to facilitate such exchanges in the public sector. In contrast, to achieve a suitable return on their investment, private-sector vendors of proprietary databases typically seek to control unauthorized access to and use of their databases. It is at the intersection of public and private interests in data 21   See National Research Council (1995), Preserving Scientific Data , note 17, and the accompanying Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers, National Academy Press, Washington, D.C., and National Research Council (1997) Bits of Power, note 4, for a description of many of these government S&T data centers. 22   As the prices of many serial journal subscriptions substantially outpace the rate of inflation, for example, research libraries increasingly need to rely on interlibrary loans to obtain access for their students and professors. See Association of Research Libraries (1999), ARL Statistics: 1997-98, Martha Kyrillidou et al., eds., Association of Research Libraries, Washington, D.C. 23   As defined in National Research Council (1997), Bits of Power, note 4, p. 15, "full and open" availability of data means that "data and information derived from publicly funded research are made available with as few restrictions as possible, on a nondiscriminatory basis, for no more than the cost of reproduction and distribution."

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases where the greatest challenges emerge. As an example, Box 1.3 sketches some of the issues and approaches currently being tried. Use of Scientific and Technical Databases Prior to its public dissemination, the use of a database is limited to those involved in the collection of data or production, and therefore does not provide the opportunity to contribute broadly to the advancement of scientific knowledge, technical progress, economic growth, or other applications beyond those of the immediate group. It is only upon the distribution of a database that its far-reaching research, educational, and other socioeconomic values are realized. One or more researchers applying varying hypotheses, manipulating the data in different ways, or combining elements from disparate databases may produce a diversity of data and information products. The contribution of any of these products to scientific and technical knowledge might well assume a value far greater than the costs of database production and dissemination. The results of a thorough Box 1.3 Database Production in Competitive Research and the Question of Access Genomic sequence databases exemplify the tension over rights in data and their uses associated with the development of original databases that have both important fundamental research uses and great potential for applied commercial products. Advances in molecular biology and automated DNA sequencing technology have made possible the rapid sequencing of genomes from a variety of life forms, including human beings. These databases are being produced simultaneously by researchers at government, not-for-profit, and commercial laboratories Although the government and not-for-profit genomic database producers may be slower than the commercial-sector in compiling gene sequence data on the same organisms, they are striving to create analogous databases in order to provide the results on an open basis as a public good for broad research and other uses. Government and not-for-profit sequence data are collected and integrated into major sequence databases in a cooperative international effort that includes the National Center for Biotechnology Information in the United States,1 the European Molecular Biology Laboratory in the United Kingdom on behalf of the European Union,2 and the DNA Database of Japan.3 These centers not only collect and share the data on a daily basis, but also provide some quality control, documentation, and organization of the data before making the information freely available to the scientific and technical community, typically over the Internet. The Human

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Genome Project aims to provide full sequence data for the human genome and to serve as the future reference standard. Because of the high intrinsic commercial value of human genomic information for the identification of disease markers and therapeutic agents, commercial entities simultaneously seek to be first in generating primary genomic data, which they can license to pharmaceutical or biotechnology companies, or patent, if possible, to gain market advantage. While the human genome provides the basic blueprint for human life, it is small differences in individual genes that are likely to provide insight into important questions such as variations in disease susceptibility in different populations (for example, why certain groups of people are predisposed to high blood pressure, diabetes, or Alzheimer's disease). These can be studied by comparing the gene sequences of different populations, such as those individuals susceptible to a disease compared to those individuals who are not. Hence, over time, gene sequence databases of a wide variety of discrete populations will be developed supported by a mix of public and private funding. Recently, for instance, the Icelandic government formed a controversial partnership with a private U.S. firm to develop a database that will contain genetic information on the entire Icelandic population. Icelanders belong to a highly homogeneous gene pool, which will simplify the detection of disease-related genes. The government gave the firm, by statute, and exclusive license to create and operate that database. 4 Another recently begun effort involves a consortium of ten U.S. and foreign pharmaceutical companies, together with government and not-for-profit organizations, formed to generate a map of human single-nucleotide polymorphisms,5 which can be thought of as a low-resolution indicator highlighting areas of variability in the genetic code associated with genetic differences between individuals. Although the research is funded in large part by the commercial-sector, the results will be made publicly available. In addition to the cost-sharing benefits of this consortium, a major reason for its establishment is the fear that an individual company, or group of companies, could generate scientifically valuable databases and information on a proprietary basis, preventing broad access and capturing a high proportion of the associated intellectual property rights.6 1   See the National Center for Biotechnology Information Web site at <www.ncbi.gov>. 2   See the European Molecular Biology Laboratory Web site at <www.embl.uk>. 3   See the DNA Database of Japan Web site at <www.ddbj.nig.ad.ip>. 4   See J. Gulcher and K. Stefansson (1999), "An Icelandic Saga on a Centralized Healthcare Database and Democratic Decision Making," National Biotechnology, Vol. 17, July, p. 620, and Martin Enserink (1998), "Physicians Wary of Scheme to Pool Icelanders' Genetic Data," Science, Vol. 281, August 14, pp. 890-891. 5   See Eliot Marshall (1999), "Drug Firms to Create Public Database of Genetic Mutations," Science, Vol. 284, April 16, pp. 406-407. 6   See Marshall (1999), note 5.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases database analysis may reveal a value of the data not apparent in even a detailed examination of the individual elements of the database itself. With the widespread availability of information on the Internet have come abundant opportunities to search for scientific and technical gold in this ore of factual elements. The possibilities for discovery of new insights about the natural word—with both commercial and public-interest value—are extraordinary. In considering how databases are used, it is important to distinguish between end use and derivative use. End use—accessing a database to verify some fact or perform some job-related or personal task, such as obtaining an example for a work memo—is most typical of public consumer uses. End use does not involve the physical integration of one or more portions of the database into another database in order to create a new information product. A derivative (value-adding or transformative) use (see Box 1.1) builds on a preexisting database and includes at least one, and frequently many more, extractions from one or more databases to create a new database, which can be used for the same, a similar, or an entirely different purpose than the original component database(s). Integration of Distributed Data to Broaden Access and Potential for Discovery In seeking new knowledge, researchers may gather data from widely disparate sources. A significant advantage arising from the abundance of digitized data now accessible through both private and public networks is the potential for linking data in multiple (even thousands of) databases. The ability to link sites on the World Wide Web is one type of integration that could result in more data being available overall to users. Another is the merging of databases of the same or complementary content. It is now possible to maintain a site with continuously verified links to related information sites for use by subscribers or members of a specific group; an example is the Engineering Village of Engineering Information, Inc.24 Yet another type of integration occurs in the connection of distributed databases such that different parts of a single large database may reside on different computers in geographically dispersed locations throughout the country or the world. With a common structure, data can be located in a physically distributed network and accessed as if they were in one database in one computer in one location. The cost can thus be distributed and the value of each contributory database increased. Still other databases are automatically created from other databases. For example, data are routinely mined and collected by ''knowbots" and "web crawlers" (software employing artificial intelligence and rule-based selection techniques) on the Internet throughout the world and retrieved for pro 24   See, for example, the Engineering Village of Engineering Information Web site at <www.ei.org/aivillage/village.serve-page?p=4011>.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases cessing and further use. One such data mining activity in the area of biotechnology was described and discussed at the committee's January 1999 workshop (see the Molecular Applications Group's activities summarized in Table 1.1). With a capability to integrate information in multiple databases comes the potential for exploiting relationships identified in the information and developing new knowledge. In many scientific fields, the initial investment by the database rights holder may not produce the greatest value until it is integrated with the investments of others. For example, while protein sequence data are valuable in their own right, their value is greatly enhanced if associated x-ray crystallographic data are also concurrently available. It is possible to use the combined data to understand the way in which protein chains are folded and, in the case of an enzyme, the way in which various nonsequential residues, or even residues on separate protein chains, combine to form an active site. Derivative Databases and New Data-Driven Research and Capabilities The ethos in research is that science builds on science. The creation of derivative databases not only enables incremental advances in the knowledge base, but also can contribute to major new findings, particularly when existing data are combined with new or entirely different data. The importance for research and related educational activities of producing new derivative databases cannot be overemphasized. 25 The vast increase in the creation of digital databases in recent decades, together with the ability to make them broadly and instantaneously available, has resulted in entire new fields of data-driven research. For example, the study of biological systems has been transformed radically in the past 20 years from an experimental research endeavor conducted in laboratories to one that relies heavily on computing and on access to and further refine 25   As noted by Vinton Cerf, senior vice president at MCI WorldCom, Inc. ("ACM Awards Keynote," Association for Computing Machinery, New York City, May 15, 1999): Scientific databases are proving to be non-linear accelerators of research in specific fields such as biology, astronomy, meteorology, space physics, chemistry, economics, epidemiology, environmental studies and a wealth of other fields. The non-linearity comes about because as each research adds more material to the database, the information is placed in juxtaposition with all other items in the system, exhibiting the same kind of non-linear impact that placing computers in a common network has had, in accordance with Metcalfe's Law (which says that the value of the network grows as the square number of devices in the net). Cerf's Law says that shared databases grow in value in accordance with the number of combinations of data items in the database. When the hundreds of thousands of databases on the Internet and other networks are accessible remotely and can be reached in parallel, and when the partial results can be combined and searched a new, the value of these data can grow dramatically.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases ment of globally linked databases.26 Indeed, one of the fastest growing disciplines is bioinformatics, a computer-based approach to biological research. New technologies, such as DNA microarrays and high-throughput sequencing machines, are producing a deluge of data. A challenge to biology in the coming decades will be to convert these data into knowledge.27 The availability of global remote-sensing satellite observations, coupled with other airborne and in situ observational capabilities, has given rise to a new field of environmental research, Earth system science, which integrates the study of the physical and biological processes of our planet at various scales. The large meteorological databases obtained from government satellites, ground-based radar, and other data collection systems pose a challenge similar to that mentioned above for biology, but also already have yielded a remarkable range of commercial and non-commercial value. Dissemination of the atmospheric observations in real-time or near-real time for "nowcasts" and daily weather forecasts has very high commercial value, which is captured by third-party distributors. Use of these atmospheric observations to develop numerical models that predict the weather accurately, hours or days in advance, adds value in terms of safety and economic benefits to society that are not readily quantifiable. While the economic value of these data can be gauged by the profits of private-sector distributors, how does one measure the value of the lives and property saved by timely and accurate hurricane forecasts and tornado warnings? Once the immediate and most lucrative commercial value is exploited, the resulting data continue to have significant commercial and public-interest uses indefinitely. For instance, these data enable basic research on severe weather and long-term climate trends and provide various retrospective applications for industry. The original databases are archived and made available by the National Climatic Data Center (see Box 1.2). Derivative databases and data products are distributed under various arrangements by both commercial and not-for-profit entities like the Unidata Program of the University Corporation for Atmospheric Research (see Table 1.1 and the online workshop Proceedings). Geographic information systems that integrate myriad sources of data provide an opportunity for new insights about the natural and constructed environment, greatly enhancing our knowledge of where we live and how we affect our physical environment. Important applications include environmental management, urban planning, route planning and navigation, emergency preparedness 26   See generally, Working Group on Biomedical Computing (1999), "The Biomedical Information Science Initiative," National Institutes of Health, June 3, available online at <www.nih.gov/welcome/director/060399.htm>. 27   See Sylvia Spengler and Manfred D. Zorn (1999), "Handling Data Sets in Biology," Lawrence Berkeley National Laboratory Colloquium, Washington, D.C.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases and response, land-use regulation, and enhancement of agricultural productivity, among many others.28 Finally, databases used by researchers and educators also frequently are produced and disseminated primarily for other purposes. For example, a physical scientist studying the complex relationships among geology, hydrology, and biology as they relate to the preservation of species diversity likely would draw on numerous digital and hard copy databases originally gathered for other purposes. A social scientist studying the characteristics and patterns of urban crime or the spread of communicable diseases likely would do the same. For many scientists, the ability to supplement existing databases with further data collection in a seamless web of old and new data is basic to meeting the needs of their specific investigations. Text Databases and Online Publication Another type of S&T database not yet discussed, but that is used extensively by the research community, consists primarily of text with data summarized or added as examples. These databases may consist of primary literature (as in the case of full text databases of journal articles) or secondary literature (as in the case of bibliographic reference databases). Traditionally, this text has been available in print form, with publishers providing peer review, professional editing, indexing and formatting, and other services, including marketing and distribution. Increasingly this information is being provided as text databases with the publishers also providing the systems that allow access to these databases. These value-adding or information repackaging functions are performed by both not-for-profit and for-profit organizations. For example, the not-for-profit American Association for the Advancement of Science, a scientific society, produces a database containing the full text of articles from Science magazine, including enhancements to the content that do not appear in the print version.29 Similarly, the for-profit publisher Elsevier Science produces Science Direct, a database containing the full text of its journal articles. Bibliographic reference databases are also produced by government, not-for-profit, and for-profit organizations, such as the National Library of Medicine, Chemical Abstracts Service, and the Institute for Scientific Information, respectively (see Table 1.1 and the online workshop Proceedings30). Where full text databases include associated data collec 28   National Academy of Public Administration (1998), Geographic Information for the 21st Century: Building a Strategy for the Nation , National Academy of Public Administration, Washington, D.C. 29   See Science online at <http://www.sciencemag.org/>. 30   See the committee's online Proceedings, note 5, at Chapter 3, "Characteristics of Scientific and Technical Databases."

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases tions, physical and legal possession of the data collections may be retained by the originator or may pass to the publisher. As S&T data and results are increasingly digitized and made available online, publishers are seeking access to and inclusion of the underlying data collections on which published articles are based. The intent is not only to provide greater validity and support for published research articles, but also to make their online publications more interesting and useful to the S&T customer base. The ability to link to the underlying databases instantaneously and at different levels of detail adds an entirely new and exciting dimension to scientific publishing and to the potential for new research, but also raises the question of who will have the rights to exploiting those data. THE CHALLENGE OF EFFECTIVELY BALANCING PRIVATE RIGHTS AND THE PUBLIC INTEREST IN SCIENTIFIC AND TECHNICAL DATABASES The general advancement of knowledge independent of its eventual societal benefits is a goal of basic research. Nevertheless, an endless array of examples demonstrates how the creation of new knowledge, building on the existing base of understanding and information developed by researchers, has enabled broad and important socioeconomic benefits for the nation as a whole. Our society appreciates that knowledge itself is intrinsically valuable and important, and our success in the world market for advanced technology products and services attests to the direct economic benefits of the resulting applications. It is for these reasons that government funds basic research and related data activities as a public good.31,32 Yet it is precisely these activities that are at risk of being hindered, if not in some instances stopped, by proposed major changes to the legal protections of factual databases. 31   As Lester Thurow points out: "A successful knowledge-based economy requires large public investments in education, infrastructure, and research and development." He goes on to say that private returns are apt to be more certain if one is looking for an extension of existing knowledge rather than for a major breakthrough; thus private firms tend to concentrate their money on the development end of the R&D process. Time lags are shorter, and in the business world speed is everything. Because of this proclivity in the private-sector, government should focus its spending on the long-tailed projects for advancing basic knowledge. This is where the private firms won't invest, but it is precisely where the breakthroughs that generate business opportunities are made. (Lester C. Thurow (1999), "Building Wealth: The New Rules for Individuals, Companies, and Nations," Atlantic Monthly, June, p. 64.) 32   For a discussion of public goods in the context of basic scientific research and related data activities, see National Research Council (1997), Bits of Power, note 4, pp. 111-114. This issue is discussed in greater detail in Chapters 3 and 4.

OCR for page 14
A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases Legislative efforts are currently under way in the United States, the European Union, and the World Intellectual Property Organization to greatly enhance the legal protection of proprietary databases. These new legal approaches threaten to compromise traditional and customary access to and use of S&T data for public-interest endeavors, including not-for-profit research, education, and general library uses. At the same time, there are legitimate concerns by the rights holders in databases regarding unauthorized and uncompensated uses of their data products, including at times the wholesale commercial misappropriation of proprietary databases. Because of the complex web of interdependent relationships among public-sector and private-sector database producers, disseminators, and users, any action to increase the rights of persons in one category likely will compromise the rights of the persons in the other categories, with far-reaching and potentially negative consequences. Of course, it is in the common interest of both database rights holders and users—and of society in general—to achieve a workable balance among the respective interests so that all legitimate rights remain reasonably protected.