1
Research Data in the Digital Age

In a 1965 article in Electronics Magazine, Gordon Moore, the cofounder of Intel, observed that the number of components on an integrated circuit per unit of cost was doubling on a regular basis—a period he later set at 2 years.1 What came to be known as Moore’s law has become a defining property of the digital age.2 For more than half a century, the power of computing available at a given cost has risen exponentially, which has increased computer power by many orders of magnitude. Today, the most powerful computers can perform more than a million billion operations per second. Storage devices can handle petabytes of information.3 Data can be transferred at rates of 10 gigabits (or 10 billion bits) per second (see Box 1-1 for a description of units of size for data). Sensors such as the charged-coupled devices used in modern cameras and telescopes can acquire data from billions of pixels simultaneously. Furthermore, in key areas of computing, Moore’s law continues to hold.4 Many measures of computing power continue to double every 1 to 2 years. As a result, the quan-

1

Gordon E. Moore. 1965. “Cramming more components onto integrated circuits.” Electronics 38(19):114–117.

2

Michael S. Turner. 2007. “Scientific discovery in the Information Age.” Presentation at the De Lange Conference on Emerging Libraries: How Knowledge Will Be Accessed, Discovered, and Disseminated in the Age of Digital Information, March 6, Houston, TX. Available online at http://delange.rice.edu/VI/EL/Turner-DeLange-2007.pdf?action=details&event=921.

3

A petabyte represents a million billion characters, the equivalent of the text in one billion books.

4

Not all measures of computing power are increasing exponentially. For example, the transfer rate of data within computers from memory devices to the central processing unit is growing slowly and at a linear rate. Physical limitations on the power of single processors have constrained the continued general application of Moore’s law. However, new algorithms for processors and storage units linked in parallel may lead to resumed exponential increases in computing power in the future.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 11
1 Research Data in the Digital Age In a 1965 article in Electronics Magazine, Gordon Moore, the cofounder of Intel, observed that the number of components on an integrated circuit per unit of cost was doubling on a regular basis—a period he later set at 2 years.1 What came to be known as Moore’s law has become a defining property of the digital age.2 For more than half a century, the power of computing available at a given cost has risen exponentially, which has increased computer power by many orders of magnitude. Today, the most powerful computers can perform more than a million billion operations per second. Storage devices can handle petabytes of information.3 Data can be transferred at rates of 10 gigabits (or 10 billion bits) per second (see Box 1-1 for a description of units of size for data). Sensors such as the charged-coupled devices used in modern cameras and telescopes can acquire data from billions of pixels simultaneously. Furthermore, in key areas of computing, Moore’s law continues to hold.4 Many measures of computing power continue to double every 1 to 2 years. As a result, the quan - 1 Gordon E. Moore. 1965. “Cramming more components onto integrated circuits.” Electronics 38(19):114–117. 2 Michael S. Turner. 2007. “Scientific discovery in the Information Age.” Presentation at the De Lange Conference on Emerging Libraries: How Knowledge Will Be Accessed, Discovered, and Disseminated in the Age of Digital Information, March 6, Houston, TX. Available online at http://delange.rice.edu/VI/EL/Turner-DeLange-2007.pdf?action=details&event=921. 3 A petabyte represents a million billion characters, the equivalent of the text in one billion books. 4 Not all measures of computing power are increasing exponentially. For example, the transfer rate of data within computers from memory devices to the central processing unit is growing slowly and at a linear rate. Physical limitations on the power of single processors have constrained the continued general application of Moore’s law. However, new algorithms for processors and storage units linked in parallel may lead to resumed exponential increases in computing power in the future. 

OCR for page 11
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 1-1 Units of Size for Data Bit: The fundamental unit of digital information, equivalent to a 1 or a 0, or to an electronic switch being on or off. Bit is short for binary digit. Byte: The information stored in eight bits. A byte can be used to store one character of English text. Kilobyte: The information stored in approximately 1,000 bytes, which is the equivalent of about 15 lines of text. Megabyte: The information stored in approximately 1,000 kilobytes. A large novel contains about a megabyte of information, and a standard compact disc holds about 680 megabytes of digital information. Gigabyte: The information stored in approximately 1,000 megabytes. A typical hard drive (as of 2008) holds about 500 gigabytes of information. Terabyte: The information stored in approximately 1,000 gigabytes. The printed infor- mation stored in the Library of Congress equals approximately 10 terabytes. Petabyte: The information stored in approximately 1,000 terabytes. All U.S. academic research libraries combined contain about 2 petabytes of information. Exabyte: The information stored in approximately 1,000 petabytes. According to one estimate,a human beings have spoken about 5 exabytes of words over the course of our species’ history. a See http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/. tity of data being created and stored by businesses, individuals, government, scientific institutions, and individuals is growing rapidly. Figure 1-1 shows one consulting firm’s projection of how information and available storage will grow in the coming years. This exponential increase in computing power has had profound conse- quences for many aspects of modern society, including scientific, engineering, and medical research.5 Using digital technologies, researchers can measure, describe, and model phenomena much more comprehensively and in far greater detail than was possible in the past. They can detect and analyze the products 5 Alexander Szalay and Jim Gray. 2006. “Science in an exponential world.” Nature 440:413– 414.

OCR for page 11
 RESEARCH DATA IN THE DIGITAL AGE 1,800 Information Created 1,600 Available Storage, 2007 1,400 1,200 Tape Optical Other 21% 22% Exabytes 1,000 1% Available Storage 800 Disk 56% 600 264EB 400 200 0 2005 2006 2007 2008 2009 2010 2011 FIGURE 1-1 Projected global information creation and available storage. Figure 1-1.eps NOTE: One exabtyte equals one billion gigabyptes. SOURCE: IDC White Paper sponsored by EMC, The Dierse and Exploding Digital Unierse, March 2008. Available at: http://www.emc.com/collateral/analyst-reports/diverse-exploding-digital- universe.pdf. of high-energy particle collisions to probe the underlying structure of matter. They can extract information about the functioning of nerve cells and construct models of neural processing. They can combine simultaneous measurements of atmospheric and oceanic conditions to predict the effects of pollutants on climates. They can extract patterns of health from extensive databases of genetic and medical records. Examples of the impact of digital technologies on research fields appear as sidebars throughout this report, and the number of such examples could be multiplied many times. The advances in digital technologies have caused a massive increase in the quantity of data generated by research projects. The proposed Large Synoptic Survey Telescope is expected to gather 30 terabytes of data per night and more than 60 petabytes over its lifetime (see Box 1-2). Particle physics experiments conducted with the Large Hadron Collider at CERN (Figure 1-2) will generate 15 petabytes of data annually. Even relatively small-scale projects can generate immense quantities of data that can be valuable in multiple research fields. These quantities of data are much too large to examine by hand. Instead, computers must conduct the initial analysis of data before the processed and condensed results are reviewed by researchers.

OCR for page 11
 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 1-2 Digital Data in Astronomy As astronomical observatories have become more powerful, they also have become more data-intensive.a Table 1-1 shows the trend in recent decades. The Sloan Digital Sky Survey (SDSS), for example, has delivered an unprecedented flood of data since it began operation in 2000. The SDSS uses a dedicated 2.5-meter telescope on Apache Point, New Mexico, equipped with two special-purpose instruments. The telescope’s camera can image 1.5 square degrees of sky at a time—about eight times the area of the full moon. A pair of spectrographs can measure spectra of—and hence TABLE 1-1 Data Trends in Astronomy Research Cosmic Microwave Background (CMB) Surveys: Collect information used to understand the origin and evolution of the universe Data items Year Survey (pixels) 1990 Cosmic Background Explorer (COBE) 1,000 2000 Boomerang (balloon-borne millimeter-wave telescope) 10,000 2002 Cosmic Background Imager (CBI) 50,000 2003 Wilkinson Microwave Anisotropy Probe (WMAP) 1,000,000 2009 Planck 10,000,000 Galaxy Surveys: Collect two dimensional optical images of galaxies and quasars Year Survey Objects 1970 Lick Observatory 1,000,000 1990 Automatic Plate Measuring Facility (APM) 2,000,000 2005 Sloan Digital Sky Survey (SDSS) 200,000,000 2009 Visible and Infrared Telescope for Astronomy (VISTA) 1,000,000,000 Large Synoptic Survey Telescope (LSST)1 2015 20,000,000,000 Galaxy Redshift Surveys: Collect three dimensional optical catalogs of galaxies and quasars Year Survey Objects 1986 Center for Astrophysics (CfA) 3,500 1996 Las Campanas Redshift Survey (LCRS) 23,000 2003 2dF Galaxy Redshift Survey 250,000 2005 Sloan Digital Sky Survey (SDSS) 750,000 2007 SDSS color-redshift survey 20,000,000 2015 LSST color-redshift survey 4,000,000,000 NOTE: There are 100 billion galaxies in the observable universe, meaning that LSST will record about 20 percent. Source: Presentation to the committee by Alex Szalay, Johns Hopkins University, December 2007, updated in 2008 with comments by Tony Tyson and Michael Turner.

OCR for page 11
 RESEARCH DATA IN THE DIGITAL AGE distances to—more than 600 galaxies and quasars in a single observation. A custom- designed set of software pipelines keeps pace with the enormous data flow from the telescope. In its first 5 years of operation, the Sloan telescope searched more than 8,000 square degrees of the northern sky—about a fifth of the entire sky—in five wavelength bands. It recorded some 217 million objects, mostly galaxies, stars, and asteroids, and measured spectra for around 675,000 of these.b With funding from multiple sources and countries, the SDSS has followed a policy of freely releasing data annually, with separate Web sites for research users and the general public. A recent release, Data Release 7 (DR7), in November 2008, included some 16 terabytes of images and spectra. Its current phase, SDSS-II, is among the largest astronomical collaborations ever undertaken, involving more than 300 astrono- mers, astrophysicists, and engineers at 25 institutions around the world. The SDSS has helped to revolutionize the interactions between a telescope, its data, and its user communities. Because the SDSS data archive is available to any astronomer, roughly half of the 2,100 refereed papers based on SDSS data have come from authors outside the project itself, and that proportion is rising. In fact, for the past 2 years, the SDSS has produced the most high-impact papers of any astro- nomical observatory.c At the same time, the project has extended the “reach” of those wishing to participate in frontier astronomy research or to simply enjoy the ability to “be there” as amateur aficionados. The public is offered both the raw data of SDSS and, at a “SkyServer” Web site, a range of search tools to help them use the data. Teachers are encouraged to adapt the projects for use in the classroom. SDSS data also are available through the National Virtual Observatory (http://www.us-vo.org), a collaborative effort involving universities, supercomputer centers, observatories, and data repositories.d Even bigger projects are under development. For example, the Large Synoptic Survey Telescope (http:/www.lsst.org) that is currently being developed will generate as much data each night as a complete SDSS. As the “Living LSST Document, Version 1.0, of May 15, 2008” put it: LSST has been conceived as a public facility: The database it will produce, and the associated object catalogs that are generated from that database, will be made available to the world’s astronomical research community and to the public at large with no proprietary period. The software which created the LSST database will be open source. LSST will be a significant milestone in the globalization of the information revolution. LSST will put terabytes of data each night into the hands of anyone who wants to explore it, and in some sense will become an Internet telescope: the ultimate network peripheral device to explore the universe, and a shared resource for all humanity. a Alexander Szalay and Jim Gray. 2001. “The world-wide telescope.” Science 293:2037–2040. b Robert C. Kennicutt, Jr. 2007. “Sloan at five.” Nature 450:488–489. c J. P. Madrid and F. D. Macchetto. 2006. “High-impact astronomical observatories.” Bulletin of the American Astronomical Society 38:1286–1287. d Alexander Szalay, Johns Hopkins University, presentation to the committee, December 10, 2008.

OCR for page 11
16 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA FIGURE 1-2  LHC at CERN. Figure 1-2.eps SouRCE: © CERN.  See http://cdsweb.cern.ch/record/42370. bitmap image low res However,  the  most  consequential  changes  being  fostered  by  digital  tech- nologies  involve  issues  that  range  beyond  the  quantities  of  data  generated.  Today, researchers can access a rapidly expanding range of digital information  from  around  the  world  almost  instantaneously.  They  can  use  this  information  to analyze their results, as when biologists compare DNA sequences they have  generated  to  sequences  stored  in  worldwide  databases.  They  can  incorporate  information  from  others  with  their  own  data  to  make  discoveries  that  would  otherwise have  been impossible, as when  epidemiologists  combine  census  and  economic data to analyze the prevalence of disease. They can analyze data pro- duced  by  others  to  answer  questions  that  could  not  have  been  anticipated  by  the  data’s  creators,  as  when  astronomers  use  digital  sky  surveys  to  investigate  newly  recognized  phenomena  in  distant  galaxies.  For  some  areas  of  science,  engineering,  and  medical  research  in  the  digital  age,  carrying  out  laboratory  experiments to corroborate or disprove hypotheses has given way to a process  of hypothesis testing based on computational analysis and modeling. The  creation  of  inexpensive,  complex  sensors  is  contributing  to  the  data  explosion by enabling new research approaches in a variety of fields, particularly  in  the  earth  sciences.  Projects  such  as  the  National  Science  Foundation’s  Net- work for Earthquake Engineering Simulation and National Ecological observa-   National  Research  Council.  2001.  Issues for Science and Engineering Researchers in the Digital Age. Washington, DC: The National Academies Press.

OCR for page 11
 RESEARCH DATA IN THE DIGITAL AGE tory Network, as well as the National Aeronautics and Space Administration’s Earth Observing System, depend heavily on sensor networks. Digital technologies also are making possible a new kind of science that depends on simulations combined with experimentation and observation.7 Cosmologists can combine simulations of galactic dynamics with astronomical observations of distant galaxies to analyze the early evolution of the universe. Records of calls made with cell phones can be compared to mathematical models of social networks. Researchers can model the functions of cells, simu - late the effects of modifying those functions, and then re-create these modifi - cations in real cells to alter biological function and refine the original models. Large-scale simulations of natural phenomena can be as valuable as data drawn from observations of the natural world. The advances in research enabled by high-performance computing and high-performance communications are contributing to a steady growth of col - laborations and interdisciplinary projects. Digital communication technologies enable researchers to communicate and exchange data with colleagues around the world, creating electronic collaborations that can catalyze progress. By making it possible to address more complex and integrative questions, these technologies also catalyze interdisciplinary collaboration. As one indicator of this trend, consider the growth in the number of authors on research papers over time. Over the course of 40 years, according to a computerized analysis of millions of published science and engineering papers, the number of authors for papers in the sciences nearly doubled, from 1.9 to 3.5.8 In the environmental sciences, the fraction of papers with multiple authors rose from 25 percent to 82 percent; in economics, it rose from 9 percent to 52 percent. Collaborations have also become more international. In 2003, 20 per- cent of all research publications had authors from more than one country, compared with 8 percent in 1988.9 Citations to literature produced outside the author’s home country rose from 42 percent of all citations in 1992 to 48 percent in 2003. However, the most far-reaching effects of digital technologies are not evident in traditional measures of research collaboration. Researchers—and especially young researchers—are developing new ways to interact with each other and with the subjects they study.10 They exchange information in virtual 7 The 2020 Science Group. 2006. Towards 2020 Science. Redmond, WA: Microsoft Corporation. Available at http://research.microsoft.com/en-us/um/cambridge/projects/towards2020science/ downloads/T2020S_ReportA4.pdf. 8 Stefan Wuchty, Benjamin F. Jones, and Brian Uzzi. 2007. “The increasing dominance of teams in production of knowledge.” Science 316:1036–1039. 9 National Science Board. 2006. Science and Engineering Indicators 200. Arlington, VA: National Science Foundation. 10 Carolyn Y. Johnson. 2008. “Out in the open: Some scientists sharing results.” The Boston Globe, August 21, p. A1.

OCR for page 11
8 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA communities, write and read blogs on research developments, and are pioneer- ing new methods to conduct research and share their results. In the long run, these developments are likely to have a more profound effect on research than increases in the pace or scale of traditional practices. These developments can be difficult to foresee. For example, research in many fields is moving toward much more open and collaborative models that are both served and driven by technology, and this trend is likely to result in research environments very dif - ferent from those that have prevailed in the past. Although our committee has not tried to predict the long-term outcomes of this process, ongoing changes can be expected to continue to transform how research is done and how researchers interact with each other. The rapid spread of digital technologies also is transforming the relation - ship between researchers and the broader public that supports and expects to benefit from research. When research results that underlie important public policies are available electronically, they can be examined and questioned by any member of the public. Individuals interested in specific issues—whether the regulation of an environmental toxin or the development of therapies for a human disease—can monitor, comment on, and even shape ongoing research. Similarly, digital technologies have profound implications for scientific, engineering, and medical education.11 Students can have access to research information from instruments in distant locations.12 Computer owners around the world can contribute to the solution of particular research problems by allowing their computers to become parts of distributed computational net - works.13 Data from cutting-edge research are being made available on the Internet for use not only by the research community but by educators or any - one else interested in the subject.14 Members of the public are participating in research projects as varied as analyses of genetic variation and galactic struc - ture.15 Although fascinating, the full consequences of changing technologies for scientific, engineering, and medical education or for direct public participation in research lie outside the scope of this report. 11 National Research Council. 2002. Preparing for the Reolution: Information Technology and the Future of the Research Uniersity. Washington, DC: The National Academies Press. 12 An example is the Education and Outreach Project of the National Virtual Observatory (http://www.virtualobservatory.org). 13 An example is the SETI@home project (http://setiathome.berkeley.edu), which uses computer time provided by volunteers to analyze astronomical data for signs of intelligence. 14 Ryan Scranton, Andrew Connolly, Simon Krughoff, Jeremy Brewer, Alberto Conti, Carol Christian, Craig Sosin, Greg Coombe, and Paul Heckbert. 2007. “Sky in Google Earth: The next frontier in astronomical data discovery and visualization.” Available at http://arxiv.org/PS_cache/ arxiv/pdf/0709/0709.0752v2.pdf. 15 For the analysis of genetic variation, see https://www3.nationalgeographic.com/genographic. For the analysis of galactic structure, see http://www.galaxyzoo.org.

OCR for page 11
9 RESEARCH DATA IN THE DIGITAL AGE CHALLENGES POSED BY RESEARCH DATA IN A DIGITAL AGE Rapid advances in computing and communication technologies have changed the professional responsibilities, interpersonal interactions, and daily practices of researchers. Many of these changes have strengthened the research enterprise, both by enabling researchers to ask new questions of nature and by providing new means of achieving research objectives. At the same time, some changes have raised important issues involving researchers, research institu - tions, sponsors, and journals.16 These issues are the focus of this report on the integrity, accessibility, and stewardship of research data. As discussed in Chapter 2, although advances in digital technologies allow phenomena and objects to be described more comprehensively and accurately, they also can complicate the process of verifying the accuracy and validity of the data (see Box 1-3 for an example). Digital technologies require the translation of phenomena and objects into digital representations, which can introduce inaccuracies into the data. Digital data often undergo several layers of complex processing as they move from an instrument or sensor to the point of being reviewed by a researcher. If this processing is not properly done or is misunderstood, the results can be misleading. In some cases, researchers may intentionally or unintentionally distort data in a misguided attempt to empha - size particular features and downplay others. In the worst cases, researchers can falsify or fabricate data, thereby violating both the ethical and methodological standards of research integrity. Many of these considerations apply as well to data that are not generated or stored digitally, but digital technologies both expand and intensify the challenge of maintaining the integrity of data. Chapter 3 describes the challenges that researchers face in maintaining the traditional openness of research in a digital age. Electronic technologies provide researchers with many new ways of communicating data to others, but providing other researchers with access to large databases can be difficult and expensive. With smaller, heterogeneous databases, where quality control and documentation tend to be less formal, sociological and technological factors can restrict data sharing. Also, an increasing range of restrictions are being placed on research data as this information becomes more valuable for commercial uses, which can limit the distribution and utilization of data within and beyond the research community. Even as more research data are being created, their value for future uses is increasing. Chapter 4 describes the need to preserve many research data for long-term use, even in situations where those uses cannot be currently envi- sioned. Digital storage technologies, application environments, and operating systems change every few years, which means that digital bits must continually 16 National Research Council. 2001. Issues for Science and Engineering Researchers in the Digital Age, Washington, DC: National Academy Press.

OCR for page 11
20 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 1-3 Digital Data in the Neurosciences The neurosciences illustrate both the potential value of well-organized and accessible data and the variety of issues raised by the increased importance of data handling and data sharing. It is not surprising that the neurosciences are rich in the use of and need for data, given the complexity of the nervous system. The brain has roughly a hundred billion neurons and more than 1,000 subdivisions, each with different structures and circuitry. In the past, neurological research has depended heavily on autopsy for clues about function and structure. Now it relies heavily on in vivo imaging methods and computational models, both of which depend on computing power and mathematical techniques. This new universe of neuroscience data is too vast and complex for manual analysis. Large-scale detailed maps of the brain can require some 25 gigabytes of memory per image. Also, neuroscientists must work across multiple scales of resolu- tion because they do not yet know which levels are critical for many neurological processes. They must integrate such diverse datasets as cellular neuroimaging, gene expression data, genotype data, neuronal morphology, and clinical data. Making neuroscience data widely available holds tremendous potential for help- ing science and society. This includes: • Facilitating replication and validation of experimental results, • Promoting collective analyses of large numbers of experiments by different groups, • Improving communication within and between groups, and • Promoting collaboration. Several very effective databases have been developed in the neurosciences. They include: • The Cell Centered Database, started in 2002, makes two-dimensional and three-dimensional static and dynamic microscopic data available to the research com- munity. It also links data obtained at cellular and subcellular scales to molecular and higher order structure. It is built on the Biomedical Informatics Research Network and Telescience grid infrastructure for distributed collaboration. • SumsDB is a repository of brain-mapping data, including surfaces and vol- umes, with both structural and functional data. It includes more than 500 studies on monkeys, rodents, apes, humans, and others, totaling about 10 percent of the pub- lished literature. It also includes a data mining tool called WebCaret so that SumsDB

OCR for page 11
2 RESEARCH DATA IN THE DIGITAL AGE can be searched online without downloading. Its designers have made attempts to provide metadata and show the source of data, including links to online publications. Many questions have arisen in developing these and other databases. Which digital data and data stored on film need to be stored? Do calibrations (i.e., the char- acterization of an instrument’s response to known stimulus) need to be stored, and if so which ones? Should proprietary tools be stored so that users can see how the primary data were processed? For now, there is reason to err on the side of deposit- ing too much data, because no one knows what subsequent researchers will need. However, it is likely that just a small percentage of databases will find widespread use, which complicates, rather than simplifies, the task of storage. Complex databases always include errors. Obvious errors, such as coordinates that lie outside the brain, can be found more easily when data are shared. However, policing data before they are added to a database can be so time-intensive that it can discourage database building. Fortunately, new technologies for assuring the quality of data based on advances in such areas as pattern recognition and learning theory, combined with rapid advances in data processing and storage, are providing new and automated methods for testing the quality of data. Another problem is that most data assigned to databases in the neurosciences are not adequately annotated, and even those with annotation tend to use nonstan- dard terminology, making them “islands” of diverse resources. Such databases may not be useful for comparative studies or other purposes. Issues of who has rights to use data also are far from resolved. A researcher may work for 5 years to assemble data on a transgenic mouse and be reluctant to give the data away. To make data open and accessible, incentives may need to be developed to encourage scientists to share their data. Another issue is whether journals may be responsible for receiving and storing all primary or supplementary data. Most publications lack a suitable place to enter and store supplementary data, and who should pay for this service remains unresolved. These issues, most of which we discuss later in this report, are being extensively explored in the research and policy-making communities. Many questions do not yet have clear answers that extend across all research disciplines. SOURCE: This box draws on presentations to the committee by David Van Essen, Washington University in St. Louis, and Maryann Martone, University of California, San Diego on December 10, 2007.

OCR for page 11
22 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA be transferred from one storage platform and software environment to another if they are not to be lost. Digital data also need to be annotated in sufficient detail that future researchers, sometimes in fields well removed from those of the data’s original creators, can both use the data and understand their limitations. Main- taining data collections for long-term use thus requires continued investment and planning, which can compete with expenditures for ongoing research. DESCRIPTIONS OF TERMS USED IN THE REPORT In describing issues as broad as those covered in this report, it is essential to have clear understanding of the basic terms. Research Data Despite the importance of research data, there exists no standard or widely accepted definition of exactly what research data are. For the purposes of this report, we have treated data as information used in scientific, engineering, and medical research as inputs to generate research conclusions (see Box 1-4 for defini- tions from other reports). This usage encompasses a wide variety of information. It includes textual information, numeric information, instrumental readouts, equations, statistics, images (whether fixed or moving), diagrams, and audio recordings. It includes raw data, processed data, published data, and archived data. It includes the data generated by experiments, by models and simulations, and by observations of natural phenomena at specific times and locations. It includes data gathered specifically for research as well as information gathered for other purposes that is then used in research. It includes data stored on a wide variety of media, including magnetic and optical media.17 Though our concerns in this report lie largely with the application of digi- tal technologies in research, our examination of the issues is not limited to digital data. Nor does this report address just those areas traditionally consid- ered “science.” It applies to all efforts to derive new knowledge about the physi - cal, biological, or social worlds and thus encompasses research in engineering and in all of the physical, biological, behavioral, and social sciences. The conclu - sions in the report generally apply to quantitative data. However, many of our conclusions also apply to qualitative data, though we have not focused on the issues unique to qualitative data. Also, this report does not address research in the humanities, which lies outside the committee’s charge and expertise. 17 As a point of comparison, the Office of Management and Budget defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This “recorded” material excludes physical objects (e.g., laboratory samples).” See OMB Circular A-110 at http:// www.whitehouse.gov/omb/circulars/a110/a110.html.

OCR for page 11
2 RESEARCH DATA IN THE DIGITAL AGE BOX 1-4 Definitions of “Research Data” from Other Reports “Data are facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors.”a “A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.”b “Any information that can be stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simula- tions, etc. Such data may be generated by various means including observation, computation, or experiment.”c a National Research Council. 1999. A Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases. Washington, DC: National Academy Press, p. 15. b Consultative Committee for Space Data Systems. 2002. Reference Model for an Open Archival Information System (OAIS). Washington, DC: National Aeronautics and Space Administration, p. 1-9. Available at http://public.ccsds.org/publications/archive/650x0b1.pdf c National Science Board. 2005. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA: National Science Foundation, p. 13. The term “data” in this report excludes physical objects (including living organisms) and other materials used in research, such as biological reagents or the devices, instruments, or computers that generate experimental or observa - tional data. In many cases, these physical objects can be described in written, numeric, or visual forms, and these descriptions constitute data. However, because materials are tangible whereas data are generally intangible, different issues surround their use, storage, and dissemination. Some of the observa - tions and conclusions in this report apply to materials as well as to data, and on occasion we make this extension of our conclusions explicit. However, the treatment of materials in research introduces issues that are beyond the subject matter of this report.18 Finally, our definition excludes information that can be important in research but is not used to generate research conclusions, including interpre - 18 Issues related to sharing research materials in the life sciences have been addressed by a previ - ous National Research Council report. See National Research Council. 2003. Sharing Publication- Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington, DC: The National Academies Press.

OCR for page 11
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA tive statements, or matters of personal judgment, such as peer reviews, plans for future research, communications with colleagues, or personnel assess- ments. Of course, the line between research data and subjective judgments is sometimes difficult to draw, since subjective judgments can influence the structure ascribed to data. Nevertheless, a distinction exists, and we do not mean to imply that all of the information associated with research necessarily constitutes research data. Metadata As used in this report, the term “metadata” refers to descriptions of the content, context, and structure of information objects, including research data, at any leel of aggregation (for example, a single data item, many items, or an entire database). According to the National Science Foundation report Cyber- infrastructure Vision for the 2st Century, metadata “summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identifica- tion of similar data in different data collections.”19 Metadata make it easier for data users to find and utilize data, particularly if they are machine-readable. Metadata are extremely diverse, ranging from written descriptions of instruments and software to the largely tacit knowledge on which the success of an investigation often depends. They are a critical part of the context needed to assess the integrity of data and use data accurately. Metadata are themselves data, since they consist of descriptive, factual information about data. Thus, conclusions about data in this report generally apply to metadata as well, although special considerations sometimes apply to metadata. Until fairly recently, the term “metadata” was used primarily by the library community and by individual research communities.20 As digital data has become more important in a variety of disciplines and fields, the scope and value of metadata have grown, leading to the development of metadata stan - dards. Metadata standards represent an agreed set of terminologies, definitions, and values to be provided for data in a given field or community. 21 19 NSF Cyberinfrastructure Council (2005), NSF’s Cyberinfrastructure Vision for 2st Century Discoery, Arlington, VA, National Science Foundation. 20 Tony Gill, Anne J. Gilliland, Maureen Whalen, and Mary S. Woodley. 2008. Introduction to Metadata, Version 3.0. Los Angeles, CA: J. Paul Getty Trust. Available at www.getty.edu/research/ conducting_research/standards/intrometadata/index.html. 21 U.S. Geological Survey, Coastal and Marine Biology InfoBank. USGS CMG “Formal Meta- Meta- data” Definition. See walrus.wr.usgs.gov/infobank/programs/html/definition/fmeta.html. Accessed December 8, 2008.

OCR for page 11
2 RESEARCH DATA IN THE DIGITAL AGE Raw and Processed Data Raw data directly from an instrument or data that have not been docu - mented or processed usually are of little value to anyone except the individuals who generate or collect them. In many fields, capturing data that are “whole” or “perfect” may be difficult or impossible. Instruments may only partially and imperfectly record phenomena. Researchers may not even see the raw data on which their conclusions are based. In some cases, raw data may exist in a com- puter buffer for only a fraction of a second before they undergo processing. In other cases, raw data may be so voluminous that they cannot be examined in anything other than a processed or condensed form. However, raw data may need to be retained to validate research findings and, in some research fields, to support patent applications, investigate instances of research misconduct, or justify public policies. Data used to draw conclusions, derive findings, and build models may undergo many changes as they are processed, distributed, and archived. They are analyzed, aggregated, and reformulated by researchers. Data often are orga - nized into structures for long-term storage and access that require the expertise of professionals trained in the management and handling of large databases. As soon as raw data are processed, the algorithms, computer programs, and other techniques used in that processing become crucial to their understand- ing. Many data cannot be properly interpreted or used without understanding the processing they have undergone, and it is generally impossible to judge the integrity of processed data without access to the metadata documenting how they were processed. In some cases, this processing may be so machine- dependent that the metadata must include either a thorough representation or a copy of the devices used to do the processing. Consequently, to judge the accuracy and validity of data, researchers, policy makers, and other users of data may need a thorough understanding of the tools and procedures used to analyze those data. In many cases, a high level of expertise is needed to use metadata in order to place data in context. Given the relatively broad definitions of data and metadata that we have adopted in this report, a great many issues are obviously associated with the generation, use, dissemination, and preservation of research data in the digital age. In this report, however, we focus on three specific issues, which we describe using the terms integrity, accessibility, and stewardship. Integrity Integrity describes an uncompromising adherence to ethical values, strict honesty, and absolute avoidance of deception. Integrity also describes the state of being whole and complete, of being totally unimpaired. Thus, the word “integrity” has both an ethical meaning and a structural or methodological meaning. In this report we use the word “integrity” in both senses.

OCR for page 11
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA According to one definition, “being assured of data’s integrity means hav - ing confidence that the data are complete, verified, and remain unaltered.” 22 This is possible only if researchers adhere to professional and ethical standards of their fields. In some research fields, these standards are written, but in many areas they exist as tacit knowledge that is passed from senior researchers to beginning researchers over the course of a research apprenticeship. These pro - fessional standards, in turn, describe the methods, procedures, and tools that researchers are expected to employ to minimize error and bias in their work. Consequently, integrity in research has both an individual and a communal meaning. Researchers maintain the integrity of research data by adhering to the professional standards of their fields. Researchers are expected to describe their methods and tools to others in sufficient detail that the data can be checked and the results verified. Com- pletely and accurately describing the conditions under which data are collected, characterizing the equipment used and its response, and recording anything that was done to the data thereafter are critical to ensuring data integrity. Thus, for experimental data, integrity implies that the data can be reproduced in a test or experiment that repeats the conditions of the original test or experiment. For observational data, data of high “quality” (a term that we sometimes will use as a synonym for data integrity) have been validated through comparison with data whose quality is known or by being generated with an instrument that has been adequately calibrated or tested. Accessibility In this report, accessibility refers to the availability of research data to researchers other than those who generated the data. Accessibility is a critical element of integrity, because data must be available to others in order for the validity of those data to be verified. However, in some cases an investigator may not be able to make data available to the public. For example, in private compa- nies, data may need to be restricted for commercial reasons. In such cases, data are frequently made available within the company to evaluate their integrity. In this report, the term “accessibility” generally implies public access as well as availability to other researchers upon request. Accessibility does not necessarily imply free access, because providing access to data entails financial costs that must be met. Also, access does not necessarily imply that researchers must provide inquirers with the training and expertise they would need to understand or use data. However, data should be accompanied by sufficient metadata for colleagues to assess the integrity of those data. 22 University of Minnesota Research Data Management Online Workshop (www.research.umn. edu/datamgtq1/MDI_020.html).

OCR for page 11
2 RESEARCH DATA IN THE DIGITAL AGE Stewardship In the broadest possible sense, the term “utility” in the name of our com - mittee refers to all of the various applications of research data. Both integrity and accessibility are critical elements of utility, because research data must have integrity and be broadly accessible to be effectively utilized. However, our focus in this report is on a specific aspect of utility that we refer to as data stewardship—the long-term preservation of data so as to ensure their continued value, sometimes for unanticipated uses. Stewardship goes beyond simply making data accessible. It implies preserving data and metadata so that they can be used by researchers in the same field and in fields other than that of the data’s creators. It implies the active curation and preservation of data over extended periods, which generally requires moving data from one storage platform to another. The term “stewardship” embodies a conception of research in which data are both an end product of research and a vital com - ponent of the research infrastructure. THE vARIETIES OF RESEARCH DATA As the examples presented throughout this report illustrate, research data are so varied that they can be described in their entirety only in the most general terms. Different research fields have very different approaches to the treatment of research data. Even at the level of individual research groups, expectations and demands can vary greatly from one investigator to another. This tremendous variety within the research community complicates the task of arriving at conclusions that apply across all fields of research. Research fields are also characterized by diversity in the origins of data and by the size and other characteristics of data collections. Diversity Across Disciplines There is great diversity in the ways data are gathered and analyzed both among and within disciplines. The sidebars in this and other chapters describe some of the diversity among disciplines, but individual disciplines also harbor great diversity in the ways data are gathered and analyzed. Data in physics, for example, range from small datasets generated by a “tabletop” experiment to the terabytes of data generated by an accelerator-based experiment. Databases in the social sciences may be freely available to all researchers in some fields and tightly restricted in other fields. Some fields within a discipline may have traditions of storing data for extended periods while others discard data rela - tively quickly. (In this report, “field” refers to an area of research smaller than a discipline. In many cases, a field can be roughly associated with the community of researchers who follow and publish articles in a relatively small collection

OCR for page 11
28 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA of related journals—what analysts of science have referred to as “invisible colleges.”23) Furthermore, some of the most interesting and productive areas of research today involve researchers from multiple disciplines working together on com - plex, integrative problems.24 In some cases, these areas of multidisciplinary research become so well defined that they evolve into research fields of their own, as in astrobiology. In other cases, researchers may come together to work on a multidisciplinary project and then disband once the project is over. In interdisciplinary research, different traditions of data treatment meet and sometimes clash, and new ways to gather, analyze, and store data may need to be developed to address novel challenges. Diversity in Origins of Data The practices for analyzing, disseminating, and storing research data vary greatly from field to field.25 For example, in some fields, observational data can be re-created by other researchers, but in other fields observations are impos - sible or impractical to make a second time. In these cases, observational data may need to be carefully archived for future use, including uses that cannot currently be foreseen. Data generated through computer simulations are increasingly important in a variety of fields.26 Data generated entirely by computation can in principle be regenerated, assuming that enough is known about the hardware, software, and inputs used in the computation. However, each of these three components of a computation may be so complex or indeterminate that the computational data have some of the characteristics of observational data. Furthermore, many simulations involve random inputs, so that successive simulations will not be exactly the same. In some cases, sharing and preserving the models and soft - ware tools used to create a simulation will be more important for verifying and building upon research than sharing and preserving the data generated. In other cases, the data themselves have value and can represent such a large investment of resources that they may need to be preserved for subsequent use in the same way that unique observational data are preserved. 23 Daryl E. Chubin. 1983. Sociology of Sciences: An Annotated Bibliography on Inisible Colleges, 92–98. New York: Garland. 24 National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2005. Facilitating Interdisciplinary Research. Washington, DC: The National Academies Press. 25 National Research Council. 1995. Presering Scientific Data on the Physical Unierse: A New trategy for Archiing Our Nation’s Scientific Information. Washington, DC: National Academy S Press. 26 Ghaleb Abdulla, Terence Critchlow, and William Arrighi. 2004. “Simulation data as data streams.” SIGMOD Record 33(1):89–94.

OCR for page 11
29 RESEARCH DATA IN THE DIGITAL AGE Data from experiments may be reproducible if a robust description of the experiment is available. In practice, however, it may not be possible to re-create the exact conditions of the experiment. An experimental apparatus also may be so costly to build or use that experiments can be conducted only once or over a limited time period. If so, long-term preservation of the data generated by the experiment may be essential for optimizing the experiment’s value. Diversity in Types of Data Collections In this report, we use the term “database” to refer to a collection of data that is organized to permit search, retrieval, processing, and reorganization of stored information. Databases include datasets, which are collections of similar or related data. We use the term “data collection” interchangeably with “database.” In its report Long-Lied Data Collections: Enabling Research and Education in the 2st Century, the National Science Board divided data collections into three broad categories (Box 1-5).27 “Research collections” are the products of one or more focused research projects and typically serve just the research group that generated the data. “Resource collections” serve a single science or engineering community and are generally intermediate in size and budget. “Reference collections” serve large segments of the research and education communities and are often supported by large budgets. These categories may seem to correspond to small-scale research, inter- mediate-sized research projects, and large-scale research, but the National Science Board’s report shows that such an association can be misleading. Using digital technologies, relatively small-scale projects can generate immense quan - tities of data that become the basis for research in many related fields. Large- scale reference data collections may be the product of many small projects linked through digital networks. Or large projects may produce focused data collections that serve a narrow research purpose and never become publicly available. Thus, distinguishing research data by the size of the group that gen - erated those data is problematic—in part because of new capabilities created by digital technologies. STRUCTURE OF THE REPORT The remainder of this report is organized into three thematic chapters and a final summary chapter. Chapter 2 considers the integrity of data throughout their life cycle, from their collection to their disposal or preservation. Maintain- ing the integrity of research data is a fundamental obligation of researchers; 27 National Science Board. 2005. Long-Lied Data Collections: Enabling Research and Education in the 2st Century. Arlington, VA: National Science Foundation.

OCR for page 11
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 1-5 Three Types of Data Collections The National Science Board (NSB) has organized data collections into the three categories described below. In addition, the NSB defined “collection” to refer “not only to stored data but also to the infrastructure, organizations, and individuals necessary to preserve access to the data.” Research data collections are the products of one or more focused research projects and typically contain data that are subject to limited processing or curation. They may or may not conform to community standards, such as standards for file formats, metadata structure, and content access policies. Quite often, applicable standards may be nonexistent or rudimentary because the data types are novel and the size of the user community [is] small. Research collections may vary greatly in size but are intended to serve a specific group, often limited to immediate participants. There may be no intention to preserve the collection beyond the end of a project. One reason for this is funding. These collections are supported by relatively small budgets, often through research grants funding a specific project. (Example: The Fluxes Over Snow Surfaces Project, http://www.atd.ucar.edu/rtf/projects/FLOSS.) Resource or community data collections serve a single science or engineering community. These digital collections often establish community-level standards either by selecting from among preexisting standards or by bringing the community together to develop new standards where they are absent or inadequate. The budgets for resource or community data collections are intermediate in size and generally are provided through direct funding from agencies. Because of changes in agency priori- ties, it is often difficult to anticipate how long a resource or community data collection will be maintained. (Example: The Arabidopsis Information Resource, http://www. arabidopsis.org.) Reference data collections are intended to serve large segments of the research and education community. Characteristic features of this category of digital collections are a broad scope and a diverse set of user communities including scientists, students, and educators from a wide variety of disciplinary, institutional, and geographical set- tings. In these circumstances, conformance to robust, well-established, and compre- hensive standards is essential, and the selection of standards by reference collections often has the effect of creating a universal standard. Budgets supporting reference collections are often large, reflecting the scope of the collection and breadth of impact. Typically, the budgets come from multiple sources and are in the form of direct, long- term support, and the expectation is that these collections will be maintained indefi- nitely. (Example: Protein Data Bank, http://www.pdb.org.) SOURCE: National Science Board (2005), Long-Lived Data Collections: Enabling Research and Education in the 21st Century, Arlington, VA, National Science Foundation.

OCR for page 11
 RESEARCH DATA IN THE DIGITAL AGE achieving this objective in the digital age can be either easier or more difficult than in earlier times. Chapter 3 considers the issues of accessing and sharing research data. The research enterprise is built on the precept that researchers will make the data on which publicly disseminated conclusions are based available to their colleagues so that others can verify and build on those data. Accessibility is vital for ensur- ing the integrity of research data and facilitating their future use. Chapter 4 discusses the stewardship of research data, that is, their long- term preservation in databases for various future research uses and other appli - cations. Preserving data collections can be expensive and difficult—so much so that it can compete with the conduct of research. Yet the loss of many kinds of research data also can incur substantial costs. The final chapter reorganizes recommendations that have appeared earlier in the volume according to different actors within the research community rather than thematically. It also discusses how action can be motivated when responsibility for research integrity, accessibility, and stewardship is shared across the components of the research community. Each part of the research enterprise has much to gain or lose, depending on how research data are man- aged, and each has a role to play in ensuring the integrity, accessibility, and stewardship of research data.

OCR for page 11