National Academies Press: OpenBook

Open Science by Design: Realizing a Vision for 21st Century Research (2018)

Chapter: 2 Broadening Access to the Results of Scientific Research

« Previous: 1 Introduction
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 23
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 24
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 25
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 26
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 27
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 28
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 29
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 30
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 31
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 32
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 33
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 34
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 35
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 36
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 37
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 38
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 39
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 40
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 41
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 42
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 43
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 44
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 45
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 46
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 47
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 48
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 49
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 50
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 51
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 52
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 53
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 54
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 55
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 56
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 57
Suggested Citation:"2 Broadening Access to the Results of Scientific Research." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 58

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

2 Broadening Access to the Results of Scientific Research SUMMARY POINTS • The concept of open science, as it has emerged over the past several dec- ades, is tightly linked with traditional scientific values and norms. At the same time, the digital revolution makes possible a restructuring of research practices and institutions built around the openness of publications, data, code, and other research products. • Open science is motivated by a number of actual and anticipated benefits. They include the availability of the results of publicly funded research to the public, as well as more reliable and efficient research. Openness also enables researchers to address entirely new questions and work across na- tional and disciplinary boundaries. Open science supports expanded access to the research process itself through citizen science activities. • Despite the advantages and motivations for open science, significant barri- ers and limitations remain. These barriers and limitations include aspects of research culture and incentives that work against open science, insufficient infrastructure, resource constraints, disciplinary differences, policy and le- gal constraints, and lack of awareness. ORIGINS AND SIGNIFICANCE OF OPEN SCIENCE The concept of open science, sometimes also referred to as “open scholar- ship,” is an ambitious goal that aims to ensure the availability and usability of (1) scholarly publications, (2) the data that result from scholarly research, and (3) the methodology, including code or algorithms, that was used to generate those data. The first of these is often known as open access. Since the term open access is sometimes used in other contexts, this report will use the term open publication instead. Ensuring the availability and usability of data resulting from research is known as open data. Ensuring the availability and usability of methods, in the case of computational work, is known as open code, and it is related to the concept of open source software. 23

24 Open Science by Design: Realizing a Vision for 21st Century Research Open science typically refers to the entire process of conducting science and harkens back to the original precepts underpinning the conduct and goals of the scientific enterprise (Storer, 1966; Borgman, 2010; Neylon, 2017). Openness has been seen as a “norm” of science: “The substantive findings of science are a prod- uct of social collaboration and are assigned to the community….The institutional conception of science as part of the public domain is linked with the imperative for the communication of findings” (Merton, 1942). In addition, openness facili- tates realization of the scientific norm that results are critically examined before they are accepted (Merton, 1942). The digital revolution of the past several dec- ades has vastly increased the possibilities of openness and lowered the costs: Shifting from ink on paper to digital text suddenly allows us to make perfect copies of our work. Shifting from isolated computers to a globe-spanning network of connected computers suddenly allows us to share perfect copies of our work with a worldwide audience at essentially no cost. About thirty years ago this kind of free global sharing became something new under the sun. Before that, it would have sounded like a quixotic dream (Suber, 2012). More recently, the InterAcademy Council and the National Academies of Sciences, Engineering, and Medicine have reaffirmed openness as a core value of science (IAC-IAP, 2012; NASEM, 2017b). The European FOSTER (Facilitate Open Science Training for European Research) group has argued that open sci- ence is a concept that applies to the “whole research cycle, fostering sharing and collaboration as early as possible thus entailing a systemic change to the way sci- ence and research is done” (FOSTER, 2018; Figure 2-1). The contemporary focus on openness in science has evolved in the context of the public Internet and the communication opportunities it has afforded, as well as the broadening of the scientific enterprise to include many new institutions worldwide. Distinct, but interrelated, motivations also include: the taxpayer’s right to the results of publicly funded research; the ability of any member of soci- ety to scrutinize, evaluate, challenge and reproduce scientific claims; and the op- portunity for anyone, including private citizens, to build directly on the scientific investigations of others. The motivations, benefits, and challenges of open science will be explored in more detail below. These factors all influence how open sci- ence is perceived, defined, implemented, and promoted (Royal Society, 2012; Fecher and Friesike, 2014; Pomerantz and Peek, 2016; Tennant et al., 2016). Open publication is the most developed aspect of open science and has be- come more widely implemented over the past decade. Open publication refers to free and unrestricted access to publications with the only restriction on use being that proper attribution and credit needs to be given to the original creator of the work, as originally advocated by the Budapest Open Access Initiative, 2002, see

FIGURE 2-1 The FOSTER Taxonomy of Open Science. SOURCE: FOSTER (Facilitate Open Science Training for European Research) project. Online. Available at https://figshare.com/articles/Open_Science_Taxonomy/1508606. Courtesy of Attribution 4.0 International (CC BY 4.0). 25

26 Open Science by Design: Realizing a Vision for 21st Century Research Box 2-1). 1 Further, publications are to be deposited in “an appropriate standard electronic format” in at least one archive maintained by a reputable institution “that seeks to enable open access, unrestricted distribution, interoperability, and long-term archiving” (Open Access Max-Planck-Gesellschaft, 2003). In the years since the first open access or open publication definition was put forward, open journals have emerged and traditional journals have, in some cases, revised their relevant policies. In an attempt to delineate the variation in interpretation of openness by journal publishers, the Public Library of Science (PLOS), Scholarly Publishing and Academic Resources Coalition (SPARC), and Open Access Scholarly Publishers Association (OASPA) have published the guide HowOpenIsIt? (Table 2-1). The guide assesses the spectrum of policies and approaches from fully open to closed along multiple dimensions. It suggests that fully open publication means that all articles in the journal are freely available to readers immediately upon publication. Immediate availability of articles at no cost to the reader beyond that required to access the Internet is known as gold open access. Other aspects of fully open publication in the realm of articles include generous reuse rights; the author holding copyright with no restrictions; the author being able to post any version to any repository or website with no delay; journals making copies of all articles automatically and immediately available in a trusted repository; and the full text of articles and supporting data being accessible via an application program interface (API) (SPARC et al., 2014). Less open approaches to publication include green open access, in which authors are able to self-archive a version of the article in an open access repository when access to the final published version requires a subscription to the journal. BOX 2-1 The Budapest Open Access Initiative By “open access” to [peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. SOURCE: Budapest Open Access Initiative, 2002. 1 The Bethesda Statement on Open Access Publishing is a related statement with a focus on the biomedical research community (Bethesda Statement, 2003).

TABLE 2-1 HowOpenIsIt? SOURCE: SPARC, PLOS, and OASPA. 2014. Online. Available at https://www.plos.org/files/HowOpenIsIt_English.pdf. Licensed under CC BY. 27

28 Open Science by Design: Realizing a Vision for 21st Century Research Note that copyright holder consent is a key requirement for making a publication openly available. Licenses designed to allow authors to retain copyright to their work have been developed by the Creative Commons organization, which allows authors to choose from one of several licenses consistent with copyright law (Carroll, 2011, 2015). The retention of copyright by authors for the purpose of making publications openly available has been one of the most contentious issues surrounding open publication, since it goes against journal publishing practices that require authors to assign the copyright to their work to the journals through copyright transfer agreements as a condition for publication. Beyond open publication, much recent activity has been dedicated to the con- cept of open data, such as the availability of the data that support the research results reported in an article. Increasingly, the openness of data is seen as being critical to the progress of science, stimulating innovation, enhancing reproducibility, and ena- bling new research questions. Combining datasets for new insights and mining data through sophisticated machine learning algorithms are made possible by the open availability of datasets (Hrynaszkiewicz and Cockerill, 2012; Tennant, 2016). The Open Data Handbook (2018) offers this definition for open data: “Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and share alike.” (Open Data Handbook, 2018). This implies that the data are available “in a convenient and modifiable form” such that there are no unnecessary technological obstacles to exercising licensed rights (Open Data Handbook, 2018). The Panton Principles for Open Data in Science, among other points, em- phasize that when publishing data, authors need to “make an explicit and robust statement” about their wishes regarding how their data can be used (Murray-Rust et al., 2010; Molloy, 2011). With a focus on data accessibility, stewardship, and reuse by humans as well as machines, the FAIR Guiding Principles were devel- oped by an international group including individuals representing academia, in- dustry, funding agencies, and publishers (Wilkinson et al., 2016; see Box 2-2). It is important to note that FAIR data and open data are distinct but comple- mentary concepts. FAIR data are not necessarily open, and open data are not nec- essarily FAIR. Data that are open and FAIR will maximize the impact of open science. Finally, the concept of open code is fundamentally linked to open source software and the Open Source Initiative that was founded in 1998 (Open Source Initiative, 2018). Open source licenses allow users the right to modify software code and freely redistribute it. The licenses are motivated by a desire to share and improve code by participating in an engaged community of users and software developers. The recent focus on open code differs in that it has not been concerned solely with the collaborative nature of software development, but ties in with the broader goals of open science. With computation becoming an increasingly inte- gral part of scientific research in many domains, the availability of data and com- putational methods for many research studies is critical to the evaluation, repro- ducibility, and extension of those studies. A workshop held at the American

Broadening Access to the Results of Scientific Research 29 Association for the Advancement of Science in early 2016 led to a set of recom- mendations to address this problem (Stodden et al., 2016). In order to allow for reproducibility, the group recommended that “data, code, and workflows should be available and cited” (Stodden et al., 2016). The Transparency and Openness Promotion (TOP) Guidelines promulgated in 2015 are a set of recommended standards for adoption by journals to promote open practices, which encompass open data, research materials, and code (Nosek et al., 2015). The Guidelines are further described in Chapter 4. BOX 2-2 The FAIR Guiding Principles for Scientific Data Management and Stewardship To Be Findable: F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource To Be Accessible: A1. (meta)data are retrievable by their identifier using a standardized commu- nications protocol A1.1 the protocol is open, free, and universally implementable A1.2. the protocol allows for an authentication and authorization procedure, where necessary A2. metadata are accessible, even when the data are no longer available To Be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable lan- guage for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data To Be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1. 2. (meta)data are associated with detailed provenance R1. 3. (meta)data meet domain-relevant community standards SOURCE: Wilkinson et al., 2016.

30 Open Science by Design: Realizing a Vision for 21st Century Research MOTIVATIONS FOR OPEN SCIENCE A vision of open science is unfolding in research communities across a wide range of scientific domains, driven by the expanding use of digital, easily sharea- ble products of scientific research. These products range from publications to soft- ware used to produce results; from raw and/or processed data associated with re- search to digitized representations of physical artifacts. The rationale for opening the methods and outcomes of research is strong, multifold, and increasingly ac- cepted by scientific, engineering, and biomedical investigators. Published science has traditionally operated as a form of open or partially open commons or common-pool resource, subject to legal frameworks such as intellectual property rights and with a few exceptions such as those for proprietary research and research related to national security (Hess and Ostrom, 2003). Intel- lectual property issues are covered in Chapter 5. Researchers publish their work if they want to get credit and recognition, which sustains and advances their ca- reers. Advances in information technology are greatly expanding the possibilities for using this resource. To the extent that science becomes more open and acces- sible, there should be more rapid and efficient progress in generating reliable knowledge. The more science is used, the more valuable it is. Individual research- ers benefit as their own contributions become more widely known and recognized. At the same time, there is a need to develop rules and norms to manage and cooperate in the use of this shared resource. What rules are needed to align the self-interests of the variety of stakeholders so that they contribute to the larger vision and realize the advantages of open science? Are specific efforts needed to ensure that the open science enterprise remains sustainable—that efforts to feed and replenish the commons run ahead of efforts to exploit it? What does sustain- ability mean in different national and disciplinary contexts? The economic analy- sis of open source software provides some insight on how communities can come together to create and sustain shared resources (Lerner and Triole, 2000). This section describes the motivations for open science as well as the ben- efits: both those that are being realized today and those that can currently be an- ticipated. Chapter 3 includes more detailed descriptions of approaches to open science that are being taken in several different disciplines and their benefits. These benefits include enhancing the ability of the general public to access knowledge generated through publicly supported research, strengthening the reli- ability and efficiency of research, enabling researchers to address new questions, including those that cross disciplinary boundaries, and allowing a broader group of scientists to participate in the research enterprise on a global basis. The follow- ing section describes various barriers and limitations to wide implementation of open science. Certainly, given the fact that the research enterprise as a whole is some dis- tance from fully realizing open science, and since many of the benefits have yet to be realized, they are difficult to quantify. To that extent, this discussion is for- ward-looking. Many important transformations and innovations in the history of

Broadening Access to the Results of Scientific Research 31 science, and in history more broadly, have been opposed at first because of diffi- culty in quantifying or even imagining the benefits. For example, much of the biomedical research community was strongly opposed to the Human Genome Project when it was first proposed, believing that it diverted resources from more valuable investigator-driven work (Palca, 1992). The project and its impact look much different in hindsight. Today’s advances in biomedical research, and many other fields such as archaeology, would not be imaginable without genomic map- ping and analysis. While there are undeniably significant costs associated with implementing policies and practices that support open science, realizing the benefits discussed in this section translates into a higher return on the investment of financial and human resources in research activity. Likewise, downstream societal benefits of research such as improved medical treatments and economically valuable techno- logical advances can be realized more quickly and efficiently. Ensuring the Reliability of Knowledge and Facilitating the Reproducibility of Results Ensuring the reliability of knowledge and reported results constitutes the heart of science and the scientific method. Experimental research progresses by testing and refining hypotheses and building understanding based on the accumu- lated evidence. Throughout the history of science, there are examples of widely- accepted hypotheses being superseded or overturned due to failures to reproduce or replicate findings. Recent concerns about reproducibility and replicability in science emerged first in fields such as biomedical research and social psychology, but have become a broader issue in science (Economist, 2013). In recent years, a number of efforts to reproduce or replicate published re- sults have been undertaken. Several efforts in biomedical research found rates of reproducibility of fifty percent or lower (Begley and Ellis, 2012; Prinz et al., 2012). In 2015, the Open Science Collaborative attempted to replicate100 psy- chological studies published in leading journals (Nosek, 2015). Although 97 per- cent of the original studies had statistically significant results, OSC researchers were only able to replicate 39 percent of the findings. Camerer et al. (2016) rep- licated 18 laboratory experiments in economics and confirmed over 60 percent of the published findings. However, Chang and Li (2015) could only replicate half of the results in published economics journals using author-provided code and data because many journal data archives did not have the code and data. John Ioannidis has highlighted issues such as underpowered studies, flexi- bility in study design and analysis, and publishing bias that favors articles report- ing positive results as causes of irreproducibility (Ioannidis, 2005). Other causes include the use of underperforming computational tools in data analysis and cross contamination or misidentification of cell lines in biological research (Offord, 2018; Huang et al., 2017). Outright fabrication or falsification of data is also a cause of lack of reproducibility. Although there is not enough information avail- able to estimate the percentage of published work that is fabricated or falsified,

32 Open Science by Design: Realizing a Vision for 21st Century Research there has been a steady stream of high-profile cases from countries around the world, and several examples of researchers in fields such as anesthesiology who have built entire careers on fraudulent work spanning 100 or more articles (NASEM, 2017b). While some level of irreproducibility is normal in research, the inability to replicate a very high percentage of scientific findings undermines the credibility of science (Wykstra, 2017). How does open science relate to concerns about reproducibility? Certainly, open science in the form of open publication, open data, and open code supports the ability of researchers to confirm and reproduce findings. Ensuring openness and access facilitates better quality research through prevention of mistakes and more rapid and efficient discovery and correction of mistakes that do occur. Once it becomes common practice for significant and relevant portions of digital repre- sentations of scientific results to be open and shared, one can anticipate more care and attention will be paid to the process of preparing and producing the results— including their documentation—so that others can follow the process in more depth than was possible previously. Expectations and requirements for openness also allow for a more rapid discovery of fabrication and falsification of data, serv- ing as deterrents to misconduct (NASEM, 2017b). In short, open science strength- ens the self-correcting mechanisms inherent in the research enterprise. Greater transparency is a major focus of those working to increase repro- ducibility and replicability in science (e.g., Munafò et al., 2017). The Reproduci- bility Initiative, launched in 2012 by Science Exchange, PLOS, Figshare and Mendeley, identifies and rewards high-quality reproducible research through val- idation of critical research findings (Science Exchange, 2018). Recent concerns over reproducibility have served to reinforce and catalyze progress toward open science in the form of new policies and practices adopted by research funders, research institutions, and publishers, as will be explored in more detail below. Yet open science is not the only factor or solution to addressing the repro- ducibility issue, and open science will not automatically solve whatever problems there are. It should also be noted that some have questioned whether reproduci- bility is a significant issue for science (Fanelli, 2018). As this report was being completed in 2018, the National Academies of Sciences, Engineering, and Medi- cine was undertaking a study on reproducibility and replicability of research, that “will draw conclusions and make recommendations for improving rigor and trans- parency in scientific and engineering research and will identify and highlight com- pelling examples of good practices” (NASEM, 2018b). Faster, More Creative, and More Efficient Knowledge Creation In addition to improving the reliability and reproducibility of research, open science can aid the advance of knowledge in several other ways. First, open sci- ence can accelerate progress by making research more efficient. When scientific results are made openly available in digital form, they enable faster, deeper, and broader dissemination of the results to other researchers. Wider sharing and col- laboration allows research communities to quickly access results and underlying

Broadening Access to the Results of Scientific Research 33 information, which, in turn, stimulates more, and more rapid, scientific discovery. New networking tools hold out the possibility of marshalling large collaborations of researchers who will be able to tackle problems more quickly and effectively than what is feasible today (Nielsen, 2011). When data resulting from clinical research on humans and on animals is reused, it maximizes the value of the con- tributions made by those research subjects to the advance of knowledge. It is im- portant to note that sharing and reuse of data vary widely between disciplines. As will be explored in more detail in Chapter 3, significant data resources have been created in genomics and astronomy that demonstrate the value and logic of data sharing and reuse. In other domains, particularly those where the culture of shar- ing and reuse has not taken hold, benefits are not being realized (Wallis et al., 2013). Second, open science enables researchers to ask and address entirely new sorts of questions. Semantically linked, machine-readable data can be analyzed by computers in order to reveal relationships within and between systems that would be impossible to discover otherwise (Science International, 2015). The po- tential for data from different disciplines being linked in this way and queried to understand complex phenomena and systems is particularly exciting. Increas- ingly, addressing complex problems of interest in science and society requires a multitude of methods and scientific results from different communities. This in- terdisciplinary work will be greatly aided by open, searchable, digital results that are made more available across communities. Without such interdisciplinary ex- changes, modern problem-solving is hindered by leaving knowledge to be in ef- fect locked inside a particular community—even when most members of a given scientific society have free access to journals and digital artifacts in a particular field. Furthermore, as search engines are able to go beyond keywords to follow scientific arguments from one paper and even community to another, interdisci- plinary science has the potential to be highly accelerated. While the above discussion implies that many benefits of this sort of work will be reaped in the future, as open science practices become more widespread, some examples can be seen today. What is needed to address complex problems is the ability to find and integrate results not only within communities, but also across communities—without paywalls or subscription barriers. Utilizing ad- vanced machine learning tools in analyzing datasets or literature, for example, will facilitate new insights and discoveries. Further, digital platforms for extend- ing and repurposing scientific results and connecting them across multiple com- munities, as well as sophisticated search engines that can follow scientific argu- ments from one result to another, will need to be developed and made available. Making data available under FAIR principles is critical to facilitating this accel- eration in knowledge creation. For example, when data, software, algorithms, and other digital artifacts of the scientific process are made available and interopera- ble, they can more easily be reprocessed, modified, extended, or used for other purposes. For example, fields such as ecology and epidemiology combine dispar- ate data from multiple sources to analyze phenomena such as oil spills and the spread of disease (Pasquetto et al., 2017).

34 Open Science by Design: Realizing a Vision for 21st Century Research What evidence is there that open science will deliver these benefits? Econ- omists have studied the knowledge production process at a broad level and largely concluded that open science promotes knowledge discovery and better science. For example, Mukherjee and Stern (2009) developed an overlapping generations model that elucidates the tradeoff between secrecy and disclosure. Secrecy yields private returns whereas the private and social returns to disclosure and the benefits of open science depend on the use of scientific discovery by subsequent genera- tions. The model shows that open science is associated with a higher level of so- cial welfare. Another study examined the relationship between the innovative per- formance of biotechnology firms and their activity in academic publishing, and found that open science strategies had a positive impact on innovation (Jong and Slavova, 2014). Economists have also studied the returns to open science in the context of publications and patents. Publications promote open science whereas intellectual property rights assigned by patents exchange public disclosure of an invention for the right of the inventor to exclusively exploit the invention for a limited time. (Chapter 5 further explores intellectual property issues related to open science.) Researchers have examined whether there is a trade-off between patenting inven- tions and publishing results, and found that these research activities are comple- ments instead of substitutes (Stephan et al., 2007; Fabrizio and DiMinin, 2008; Azoulay et al., 2009). However, Murray and Stern (2007) and Fehder et al. (2014) identified publication-patent pairs and examined the impact of patenting on sub- sequent research. Publications appear before the patent is granted, and citations to the publication could potentially change once intellectual property rights were as- signed. They found that papers were less likely to be cited after the patent was assigned, suggesting that patenting may close off inquiry and reduce knowledge creation in areas related to the patented invention. Aghion et al. (2010, 2016) stud- ied the impact of NIH agreements that increased academics’ access to patented, genetically engineered mice. They found that increased openness, measured by access to mice, prompted entry by new researchers and increased the diversity of research topics. They concluded that intellectual property rights decrease research interest and diversity. Williams (2013) examined the effect of Celera’s patents on human genes on subsequent research and innovation. She found that patenting reduced research and innovation related to the patented genes by between 20 to 30 percent. The topic of how proprietary concerns may act as a barrier to openness is discussed below. Researchers have also examined the impact of online access and open pub- lication of scholarship on the number of citations. Online access to articles via subscription reduces search costs and likely increases citations, but the citation impact may be conflated with the quality of the journal. Evans and Reimer (2009) found that open publication increased citations to multidisciplinary journals by 20 percent. However, McCabe and Snyder (2015) showed that this estimated in- crease resulted from a specification error and disappeared when time effects were included in the model. They concluded that the citation benefit of open publication

Broadening Access to the Results of Scientific Research 35 in the previous literature was attributable to omitted variable bias from not con- trolling for journal quality. McCabe and Snyder (2015) found that JSTOR (an article repository) increased citations to economics and business journals by about 10 percent, but Elsevier’s Science Direct appeared to provide no citation boost. Both JSTOR and Science Direct provide online access but are subscription-based, not open. McCabe and Snyder (2014) found that open publication increased cita- tions to science journals by about 8 percent. Eysenbach (2006) demonstrates that open articles have higher citations in PNAS than subscription access articles. Gaule and Maystre (2011) revisited this question and found no significant citation effect. Davis et al. (2008) and Davis (2010, 2011) conducted an experiment where submissions to 11 American Phys- iological Society journals were randomly assigned to open publication or sub- scription access. They found that open articles were more likely to be downloaded but received the same number of citations as subscription access articles one and three years after publication. McCabe (2013) concluded that the citation impact of open publication may have been overestimated by open access supporters. On the other hand, Wagner (2014) summarized a large, annotated bibliography on the topic with the conclusion that open access articles have a persistent citation ad- vantage that varies by discipline. How can we reconcile the findings of Aghion et al. (2010) and Williams (2013) which show that intellectual property rights were associated with less di- versity in science, with the conclusions of Davis et al. (2008) and McCabe and Snyder (2015), which found limited impact of online and open publication on ci- tations? First, genetically engineered mice and genetic tests patented by Celera are high-impact scientific discoveries. Limiting access to these discoveries closed down some productive avenues of inquiry. However, not all published articles are of the same quality. McCabe and Snyder (2013, 2014) found that open publication increased citations to the highest quality articles and decreased citations to the least-cited articles. Expanding Access to Knowledge and to the Research Enterprise Open science also expands access to knowledge and to the research process itself. One important justification for expanded access is the public support for a large portion of the research activity that leads to reported results. The federal government invested $121 billion in research and development (R&D) spending in fiscal year 2015. About $34 billion of the total is allocated to university R&D, resulting in datasets, publications, and other outputs (Rosenbloom et al., 2015; Edwards, 2017; NSB, 2018). Federal spending on intramural research totaled about $36 billion in 2015 (NSB, 2018). Over the past several decades, the belief that knowledge whose creation has been supported by the public should be acces- sible to the public has gained considerable ground. For example, disease advocacy organizations and consumer groups played an important role in support of NIH’s policy of requiring that publications based on NIH-funded work be made availa- ble to the public following an embargo period (Albert, 2006). As will be explored

36 Open Science by Design: Realizing a Vision for 21st Century Research in more detail below, support for open science is growing among researchers, alt- hough attitudes are ambiguous (Odell et al., 2017). In 1997, the National Research Council recommended that: Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded re- search. The public-good interests in the full and open access to and use of scientific data need to be balanced against legitimate concerns for the pro- tection of national security, individual privacy, and intellectual property (NRC, 1997). The proposition that research data created through public funding should be publicly accessible as a default position has been advocated as an international standard. According to Science International, “if this social revolution in science is to be achieved, it is not only a matter of making data that underpin a scientific claim intelligently open, but also of having a default position of openness for pub- licly funded data in general” (Science International, 2015). The strongest early practical rationale for this position came from biomed- ical research; the idea was that the public should be able to see and utilize the latest research relevant to promoting health and curing disease. This rationale spurred policy makers to support the development of the National Library of Med- icine’s PubMed interface to MEDLINE, NLM’s database of citations to the liter- ature, in the 1990s and to PubMed Central, NLM’s full text article repository, in the 2000s (Varmus, 2009). Knowledge of biomedical research has helped com- munities facing health crises, such as AIDS activists, to better pursue their goals (NASEM, 2016). Health literacy and broader science literacy can help individu- als, communities, and entire societies to benefit from research in areas such as popular epidemiology and participatory environmental monitoring (NASEM, 2016). Open science may also contribute to a democratization of knowledge and a better informed citizenry (Arza and Fressoli, 2017). The proposition that scientific knowledge is a global public good raises an international dimension to this par- ticular benefit of open science (NRC, 1997; Science International, 2015). Ex- panded international use of publicly-funded research may deliver positive benefits without disadvantaging the researchers who originally performed it or the national government that supported it. Developing country researchers are often enthusi- astic users of open science resources (Swan, 2012). An estimated 80 percent of active journals in Latin America are open access (Science International, 2015). There are several open data initiatives in Africa, including the African Open Sci- ence Platform, which aims to “promote the development and coordination of data policies, data training and data infrastructure” across the continent (CODATA, 2016). It may also be the case that the impacts of data-enabled science and tech- nology on individuals and societies are so profound and potentially disruptive that deeper engagement with society is necessary both in solving existing problems

Broadening Access to the Results of Scientific Research 37 and legitimating emerging technologies (NASEM, 2017a). One-way communica- tion of science to society is not enough. In many domains, science needs actively to engage with other societal actors as knowledge partners in jointly framing ques- tions and jointly seeking solutions. The unprecedented ubiquity and diversity in modes of modern digital communication lend themselves to this task. An additional reason for supporting broader access to scientific knowledge and the research process is that this access may speed scientific progress. The involvement of the broader public in the research enterprise, which is also called citizen science, has become more prominent in recent years, largely due to the progress of digital technologies and open science practices (Smith et al., 2017). For example, Zooniverse is a citizen science web portal that hosts projects in which volunteers assist professional researchers (zooniverse.org). There are many examples of citizen contributions to research in areas such as data gathering and environmental monitoring (Arza and Fressoli, 2017). Although the benefits of open science are increasingly being realized and recognized, there are significant barriers to a research enterprise and environment where access to research products is routinely expected. These barriers as well as approaches to overcoming them will be discussed in the next section. BARRIERS TO OPEN SCIENCE Some barriers to open access to research products may be addressed through the development of new tools and institutions. While some barriers can only be lowered through thoughtful changes in the policies and practices of research en- terprise stakeholders, others are interrelated in complex ways. Some barriers are more relevant to one component of open science than to others (i.e., open publi- cations, open data, or open code). This section will provide an overview of the major barriers, including information on how difficult change is likely to be. Economic Barriers Some of the most challenging barriers to open science are the incentives of market participants and the structure of the market for scholarly communication, particularly in the area of open publication. The scientific article, which is peer reviewed and compiled with other articles within a journal, is the traditional ap- proach to disseminating new research. Scientific journals emerged during the 17th century (Fyfe et al., 2015). Traditionally, journals have been distributed to insti- tutions (e.g., university libraries) and individuals via subscription. Since World War II, there has been a global expansion of research activity, leading to rapid growth in the number of articles published. Publishers perform many important functions as a key component of the research enterprise. These functions include organizing the peer review process, developing and implementing policies in areas such as responsible conduct of re- search; addressing authorship problems; performing an array of technical tasks such as format migrations; and managing relations with authors, vendors, and the

38 Open Science by Design: Realizing a Vision for 21st Century Research media (Anderson, 2016). Journal publishers also maintain the information tech- nology infrastructure that supports and controls access to content as well as the development of new infrastructure and platforms. Publishers of scientific journals have included a range of for profit and nonprofit entities, many of the latter being scientific societies. Robert Maxwell’s UK-based Pergamon Press worked to make journal publishing a profitable business starting in the 1950s by launching new journals and recruiting top scientists to edit and contribute to them (Buranyi, 2017). Pergamon and other commercial publishers also took on the task of pub- lishing the journals owned by some scientific societies. Profits increased with the number of journals, as libraries would simply add new journals requested by fac- ulty to their subscription lists. From the 1970s on, scientists began to pay more attention to the prestige and visibility of the journals in which they published. The advent of the journal impact factor, described in more detail below, contributed to this focus on prestige. Publishing in a “high-impact” journal came to be seen as essential to career progress in many fields (Buranyi, 2017). Annual subscrip- tion prices rose as well. The 1990s brought a wave of consolidation among scientific publishers, as Netherlands-based Elsevier acquired Pergamon, leaving it in control of over 1,000 journals (Buranyi, 2017). Further increases in subscription prices and the advent of “big deal” agreements between publishers and libraries followed in the late 1990s. Under these agreements, publishers agree to provide online access to a bundle of their journals, including all back issues, priced at a discount to the sum of the individual journal subscriptions (Bergstrom et al., 2014). Despite paying lower per journal prices, total outlays by libraries increased to the point where this has been called the “serials crisis” (Panitch and Michalak, 2005). In 2015, Lari- vière et al. found that the five most prolific publishers, including Reed-Elsevier, Taylor & Francis, Wiley-Blackwell, Springer, and Sage, control over one-half of all the scientific journal market, and that the profit margins of these companies have been in the range of 25 to 40 percent in recent years (Larivière et al., 2015). According to one economist who studies the industry, this situation “demonstrates a lack of competitive pressure in this industry, leading to so high profit levels of the leading publishers that they have not yet felt a strong need to change the way they operate” (Björk, 2017a). Unlike some other intellectual property-based businesses such as recorded music, the incumbent firms in commercial scientific publishing have been able to navigate technological and other changes while maintaining a profitable business model based largely on subscription revenue. In contrast to music or other parts of commercial publishing, where firms pay creators for content, authors of re- search articles are not paid by the publishers. Research is supported by public and private funders and by the performing institutions. Nonprofit publishers also occupy an important place in the scholarly com- munications ecosystem. The most prominent of these are scientific society pub- lishers, although university presses and other nonprofit organizations, such as the Public Library of Science (PLOS, described in more detail in Chapter 3), also participate. Publishing has long been a core activity of many societies. The size

Broadening Access to the Results of Scientific Research 39 and relative importance of society publishers varies considerably by discipline and according to the specific society in question. For example, the American Chemical Society publishes 50 peer-reviewed journals and is one of the top five publishers of articles in chemistry (ACS, 2018; Larivière et al., 2015). By contrast, in the social and behavioral sciences, society publishers play a smaller role in overall scholarly communication than in disciplines such as physics and chemistry (Larivière et al., 2015). Society publishers undertake publishing activities as part of their overall mission of providing service to their members and disciplines. They have tradi- tionally used a business model centered on subscription income. For some socie- ties, publishing operations generate a surplus that they use to subsidize other ac- tivities, such as education programs or meetings (Collins et al., 2013). Available information indicates that there is a considerable variation among disciplines and individual societies regarding the size of the surplus (if any) generated by pub- lishing and the extent of the society’s dependence on that income. For example, in 2011 subscriptions and manuscript charges accounted for 53 percent of the rev- enues of the Ecological Society of America and journal publication accounted for 43 percent of expenses, with society revenue and expenses each totaling over $6 million (Collins et al., 2013). Over the past several decades, as technological change has transformed sci- entific publishing and for-profit publishers have increased their overall share, so- ciety publishers have faced the challenge of investing in digital production and distribution systems and responding to changes in markets and author preferences. For example, in the life sciences, where the number of journals offered by for- profit publishers has increased rapidly, some society journals have faced increased competition for manuscripts. Whereas 20 years ago an author whose manuscript was rejected by, say, Nature might then submit it to a society journal, today the author is more likely to submit to Nature Microbiology or another disciplinary journal offered by a for-profit publisher (Schloss, et al., 2017). Some societies have entered into partnerships with for-profit publishers, in which the company performs most non-editorial functions and includes the society’s journals in its own subscription bundles, paying the society a fee in return. The American Geo- physical Union’s partnership with Wiley-Blackwell is a good example (AGU, 2012). Competition from self-publication and open science have not seriously af- fected the market share of commercial and nonprofit publishers of high-prestige journals. Exploring the incentives of stakeholders gives some insight into why this may be the case: • Researchers: Researchers have the incentive to maximize the visibility of each scientific discovery. These incentives are reinforced by the academic promotion and tenure processes at universities and by funders. Promotion and tenure requirements incentivize researchers to maximize the prestige of the journal in which their papers are published. Funders also require proposals to include publications, and journal impact factors are used as

40 Open Science by Design: Realizing a Vision for 21st Century Research proxies for the quality of science (Ginther et al., 2018). Researchers both consume and produce scholarship. Researchers prefer to read and cite high-quality work (McCabe, 2013). Researchers have no market power when it comes to publishing their research, and they prefer to publish work in a widely read journal. Researchers provide free labor to journals in ad- dition to production of research articles in the form of editing and peer review (Bergstrom, 2001). Researchers also do not typically bear the costs of subscribing to journals if they are affiliated with an institution. Finally, researchers may bear the cost of open publication through article pro- cessing charges, while publishing an article in a traditional subscription journal is generally without cost to the researcher. Of course, researchers who are working at institutions that cannot afford subscription fees and cannot themselves afford to pay the article processing charges levied by open publication journals do not enjoy legal access to the system. To re- duce the knowledge gap across the globe, Research4Life, a public-private partnership of international organizations, universities, and 175 interna- tional publishes, provides developing countries with affordable access to research and scholarly information (Research4Life, 2018). • Universities: Universities seek to maximize the visibility and productivity of their faculty. Because university administrators and tenure review com- mittees may not be subject matter experts, they rely on signals of quality for their research faculty. These include the number of publications, the prestige of the journals where faculty publish, and their success in research funding. All of these outcomes are linked to scholarly publication. Uni- versities also purchase journals for their students and faculty at fees in- creasing faster than the rate of inflation, especially from commercial pub- lishers (Bergstrom et al., 2014). • Research funders: Federal research funders are held accountable by Con- gress. The peer review process is designed to allocate funding to the “best” science. Past accomplishments in terms of the prestige of publishing ven- ues are used to forecast whether the current research proposal is of suffi- cient quality to be funded. Thus, research funders also use journal publi- cations as proxies for quality (Ginther et al., 2018). • Scientific societies and other nonprofit publishers: Scientific societies promote the scholarship of their disciplines for their members. They typi- cally publish journals, and journal revenues may in turn support the activ- ities of the association (Willinsky, 2004). Other nonprofit publishers such as university presses also seek to maximize the readership of their journals and cover their costs via subscription fees. Publishers pursuing open ac- cess business models are discussed in more detail in Chapter 3. • Commercial publishers: Typically, publishers bundle journal subscrip- tions as a way of cross-subsidizing lesser journals by including high pro- file journals in the bundle.

Broadening Access to the Results of Scientific Research 41 Given these incentive structures, it becomes easier to understand the market structure of scholarly publication. Economists have studied the scholarly commu- nication market structure in order to understand why for-profit publishers con- tinue to have market-pricing power in the face of competition from self-publica- tion and open access journals. Furthermore, while there are significant “first copy” costs, the marginal cost of providing online access to journal content is essentially zero. This situation persists because many of the incentives of researchers, uni- versities, and funders create a powerful motivation to leave the current system in place: when the contribution of an idea is difficult to measure, institutions use signals of quality (e.g., citations, prestige of the journal) to infer quality (Berg- strom, 2001). Varian (1994) argued that marginal cost pricing is not profit-maximizing for information goods such as scholarly publications. Thus, publishers have an incen- tive to engage in first-degree price discrimination, where they sell the same bundle of journals at different prices to different consumers. Bergstrom et al. (2014) exam- ined the prices paid by public university libraries for “big deal” journal bundles from commercial and nonprofit publishers. They found significant price discrimination by commercial publishers by the research-intensiveness of the university, and a lesser amount of price discrimination by nonprofit publishers. The “big deal” pricing strategies of journal publishers have played a major role in shaping the market for research journals. First, publishers recognized that demand for the journals was inelastic and priced subscriptions to maximize rents. Second, the shift from a physical journal to online access meant that libraries ef- fectively “rented” access to the current journal as well as the older volumes of the journal. “Big deal” bundle pricing may have also made it difficult for new journals to enter the market given that university library budgets were being squeezed (McCabe 2013). McCabe (2013) argued that the cost pressures on libraries asso- ciated with “big deal” pricing led to the open access business model. This business model shifts the costs from subscribers (university libraries) onto the researchers. The Public Library of Science (PLOS, the largest and most highly cited open ac- cess journal publisher) charges publication fees ranging from $1,595 for PLOS ONE to $3,000 for PLOS Biology (PLOS, 2018). McCabe, Snyder and Fagin (2013) argue that the current pricing structure of open access journals may dis- suade publication. The higher publication fees distort the market, leading to fewer submissions and potentially reducing the volume of publications. Further, Poynder (2018) argues that national open access “big deals” of the type that pub- lishers conclude with higher education bodies in some European countries allow publishers to protect their market positions. These agreements combine subscrip- tion fees with discounts on the APCs paid to the journals by researchers at insti- tutions covered by the agreement. One important aspect of these and other large subscription agreements is that they generally include non-disclosure agreements, so that purchasing organizations are not able to discern the prices that others are paying. In response to competition from open access journals, some subscription- based publishers are offering a hybrid open access model, where authors can pay

42 Open Science by Design: Realizing a Vision for 21st Century Research a publication fee and the article is freely available. Mueller-Langer and Watt (2014) examined the impact of hybrid open access (HOA) pilot agreements be- tween commercial publishers and the University of California system, the Uni- versities of Hong Kong and Goettingen, all universities in the Netherlands, and the Max Planck Institutes. They found that HOA has no significant impact on citations after controlling for institution quality and citations to preprint versions of the article. Society publishers are also responding to these trends. As discussed above, the size and importance of publishing activities varies by discipline and society. Societies have adopted new policies and expressed varying perspectives on trends in scholarly communication and open publication in particular. Some societies with large publishing operations have adapted their approaches to the movement toward open publication. For example, ACS offers a range of HOA (hybrid open access) options for authors, with the APC to be charged varying according to the license desired, the length of the embargo period to be followed, whether ACS is responsible for depositing the final published article in a designated repository or whether the author is responsible for depositing the accepted manuscript, and so forth (ACS, 2018). ACS has also launched its own open access journal and a pre- print service. Society publishers have expressed a range of perspectives in their public statements and policy positions as well. They are generally supportive of open publication in principle, but are skeptical about the imposition of funder mandates that require gold open access at the time of publication, or green open access with embargo periods of less than one year (Collins et al., 2013). The American Phys- ical Society “supports the principles of Open Access to the maximum extent pos- sible that allows the Society to maintain peer-reviewed high-quality journals, se- cure archiving, and the Society's long‑term financial stability, to the benefit of the scientific enterprise” (APS, 2009). It is important to remember that scholarly communications involves real costs, and that the current state of the subscription journals market is the result of choices made by publishers, institutions, researchers, and funders over many years. Some experts argue that moving away from traditional publishers operating on a subscription model would entail forgoing the benefits of significant invest- ments in digital infrastructure that publishers are making, and would constitute a short-sighted “race to the bottom” (Anderson, 2018). As noted above, journal rev- enues play an important role in supporting the programs and activities of scientific societies that advance individual disciplines and science as a whole. Some path- ways to open publication, such as mandates that specify immediate gold open ac- cess or eliminate embargo periods for green open access, would be problematic for many societies and their ability to sustain their professional infrastructure. Yet the issue is complex. Some might question why research library budgets that have been under considerable pressure should be expected to generate surplus funds to support the professional activities of societies. Others are more skeptical about the ultimate value provided by commercial publishers in particular, given

Broadening Access to the Results of Scientific Research 43 their large profit margins (discussed above), arguing that they benefit from pub- lishing research that is funded by other sources, and that writing, reviewing, and some portion of editing tasks are performed by volunteers (Conley and Wooders, 2009). Publishing journals as a profit-maximizing business is certainly as legiti- mate as it is for other distributors of digital content based on intellectual property protections. The research enterprise and its stakeholders are responsible for the future of scholarly communication. Chapters 5 and 6 will cover the issues and choices facing the research enterprise in moving forward. Academic Culture and Misaligned Incentives One important set of barriers to open science springs from the fact that many of the benefits redound to research communities and the broader research enter- prise itself, yet researchers are recognized and rewarded largely based on their individual production and accomplishments. The culture of open science is seen as being about advancing the public interest—when research products are broadly available and discoverable, they benefit more people and drive more innovation than when they are not. Research also has some characteristics of a public good in economic terms, in that use by one individual does not reduce availability to others. However, researchers can be excluded from using publications and other research products. Getting Scooped Barriers related to culture and incentives operate at several levels. At one level, researchers might be concerned about being “scooped” by other researchers if data are shared openly and reused by others before the researchers who gener- ated them are able to fully exploit them in multiple publications (EC, 2018b). In some fields and disciplines, particularly those where acquiring data involves con- siderable effort or expense, such as collecting specimens from remote areas, or undertaking epidemiological studies that require a number of complicated steps, delays in sharing data underlying the first publications may be an accepted prac- tice (Pearce and Smith, 2011). Whether or not the risk of being scooped is over- stated, some adjustments in rewards and expectations may be necessary to address this concern in the fields where it exists in order to facilitate more rapid and com- plete data sharing. For example, institutions and disciplines might work to ensure that the first person to share research outputs receives appropriate credit, and that researchers who generate valuable and widely reused datasets receive proper at- tribution. Ultimately, the solution to ensuring that data are shared quickly and lessening the perceived need for delays motivated by career interests is ensuring that those who create valuable data are recognized and rewarded, but restructuring reward systems is not straightforward or easy. The rationale that sharing data quickly will deliver public health benefits and perhaps even save lives may not win out over the desire to hold data closely in order to ensure that one’s postdocs and graduate students are able to author publishable work based on this data. Note

44 Open Science by Design: Realizing a Vision for 21st Century Research also that the same rules should apply to all as efforts are made to appropriately reward data creation and sharing. If some researchers practice open science and others do not, the ones who do not may enjoy competitive advantage. When fun- ders and other stakeholders require openness of publications and data as a conse- quence of receiving funding, a more level playing field can be created. Exposure of Errors Another concern that might make researchers reluctant to share data and methods is that such sharing would expose their errors to the community. New research workflows in which reporting results and sharing research products takes place within a process where community review helps to uncover error will im- prove the reliability of results, as described above. Preregistration of studies can help to uncover mistakes in analytical approaches before data are collected. Jour- nals such as PeerJ and Open Science, the latter published by the Royal Society, have instituted open peer review, another mechanism aimed at improving the quality of research (McKiernan et al., 2016). It may take time for research com- munities to transition to open practices that enable wider review and scrutiny of research. Psychology is a current encouraging example. Concerns about repro- ducibility led many inside and outside the field to critically examine practices and standards, and new open practices such as preregistration and replication studies are being tried and refined (Winerman, 2017). At the same time, some experts have raised concerns in recent years about the nature of scientific disputes in the context of changing standards related to transparency or reproducibility. The rise of blogs, social media, and venues for post-publication comment and review has greatly expanded opportunities to cor- rect, criticize, raise questions, and make accusations against researchers, often anonymously (NASEM, 2017b). Disciplines where standards and practices are being reexamined, such as psychology, have seen intense disputes over the valid- ity of widely heralded results as well as over the tone and personal nature of the critiques. While some prominent leaders in the discipline have identified the harsh nature of criticism itself as a significant issue, others argue that raising concerns over tone diverts attention and focus away from the substance of critiques (Singal, 2016). It is important for errors or misconduct to be identified and corrected; it is also important that small errors or legitimate differences in analytical choices not be cast as malfeasance. In order to maximize the value of greater openness and transparency, disciplines and the research enterprise itself may need to devote some attention to developing new norms around the pursuit of accuracy and re- lated issues (Gelman, 2018). Career Considerations In addition to concerns arising from relatively short-term potential impacts of sharing specific research products, longer-term career considerations may also explain reluctance on the part of some researchers to adopt open practices.

Broadening Access to the Results of Scientific Research 45 Achieving the vision of open science requires scientists to make results publicly accessible and to engage in sharing data with the community as an expected prac- tice. Researchers are motivated by the possibility of gaining career advancement, support, and recognition for their work in addition to curiosity and the desire to advance their fields (EC, 2017b). Career prospects in science are increasingly challenging especially for early-career researchers because of the scarcity of per- manent academic positions and the difficulty of getting funded (Stephan, 2012a). Individual researchers may not perceive that taking the steps necessary to make their own work accessible will be in their best interests. Data sharing requires a focus on data preparation and infrastructure for stewardship, preservation, and broad use. In the absence of clear requirements to do so, scientists who take the time to make sure that software is robust, data are sufficiently described, and data stewardship and preservation meet good practice and community standards may not be rewarded by higher education institutions (e.g., through promotion and ten- ure or infrastructure support) or recognized within their disciplines. Preparing data and code for deposit involves considerable time costs. Researchers may suffer if they prioritize their open science work that benefits the community at the expense of publishing more journal articles. Some aspects of current research evaluation practices may contribute to concerns about how openness and open practices affect the career prospects of researchers. The most salient issue is the importance of bibliometric indicators such as the Journal Impact Factor (JIF) in evaluating research and researchers (Declaration of Open Research Assessment, DORA, 2013; Casadevall and Fang, 2015). Developed in the 1960s by the Institute for Scientific Information (and now a product of Clarivate Analytics), JIF measures the yearly average number of ci- tations to recent articles in a particular journal (Cross, 2009). The ability to digi- tally index articles, which allows JIF and other indicators to be automatically tracked and calculated, has enabled the development and wide use of JIF and other bibliometric indicators. The use of bibliometric indicators in research evaluation affects researcher rewards and incentives both directly (in hiring or promotion) and indirectly (as a factor in funding or publication decisions). It is widely perceived around the world that the JIF of the journals that researchers have published in plays an outsized role in hiring and promotion decisions in research institutions (Abbott et al., 2010; Casadevall and Fang, 2015). JIF was not developed as a tool for evaluating re- search or researchers, and there are numerous reasons why using it in this way is inappropriate. These reasons include: (1) citation distributions within journals are highly skewed, meaning that JIF may not accurately track the citation profile of individual articles; (2) there are wide differences between fields in typical citation patterns, so researchers in fields where influential articles may take several years to be heavily cited are disadvantaged; (3) JIF and other indicators can be gamed by journal editors, research institutions, and individual researchers; and (4) JIF is not transparent, as the data and methodologies underlying it are proprietary (DORA, 2013; Wilsdon et al., 2017).

46 Open Science by Design: Realizing a Vision for 21st Century Research Some experts argue that the misuse of JIF and other bibliometric indicators may even cause broader harm to researchers and to the research enterprise itself. The contention is that apparent imbalances within some parts of the science and engineering workforce and low rates of success in research funding proposals to U.S. federal agencies have helped to create an environment of hypercompetition that discourages risk taking, shortchanges quality control, and dissuades research- ers from sharing (Alberts et al., 2014; Fang and Casadevall, 2015; NASEM, 2017b; Stephan, 2012b). Such hypercompetition may directly discourage open practices such as sharing data and other research products if researchers are pri- marily concerned with maintaining an advantage. Vale and Hyman (2016) argue the heightened competition between scientists in high-profile journals has strained the peer-review system; however, “the need for a system of validation has only become more pronounced as the volume of scientific work has increased” (p. 4). Researchers in a hypercompetitive environment might also prioritize publishing their work in journals with the highest possible JIFs, regardless of whether publi- cation in such journals is consistent with making research products available un- der open principles. No researcher’s career has been harmed by publishing in high-impact journals. Countervailing Factors and Efforts to Address Barriers Related to Culture and Incentives All of the barriers to open science discussed above related to culture and incentives are likely higher and more challenging for early career researchers than they are for their senior colleagues (Eveleth, 2014; The Guardian, 2018). Alt- hough some of these barriers may take considerable time and effort to address, there are some encouraging signs of positive change. First, the potential negative effects of open practices on careers, including anxieties about being “scooped,” may be shrinking over time as advantages become more apparent. As discussed above, open publication may confer an advantage in terms of citations (Hitchcock, 2018; Wang et al., 2015). This merits continued study. There is also evidence that media coverage and social media discussion of openly published research is greater than that for traditionally published work (Wang et al., 2015). Further, there are indications that JIFs of indexed open access journals may be increasing compared with those of traditional, subscription journals (McKiernan et al., 2017). Moreover, more subscription journals are allowing authors to deposit pre- prints or postprints that are openly available (sometimes in response to funder mandates) or offering an open publication option for purchase by the author. The benefits and downsides of these options are discussed in more detail below. In addition to encouraging progress toward open practices within the context of conventional reward and incentive systems, the participants in the research en- terprise can also take steps to change cultures and incentive systems in ways that explicitly encourage and reward open practices. For example, a number of prizes and funding programs launched in recent years have recognized and supported open

Broadening Access to the Results of Scientific Research 47 science (McKiernan et al., 2017). Funder, institutional, and publisher policies man- dating open policies also contribute to changing culture and incentives. New efforts to publicly track the extent to which researchers follow open practices are also being developed. One well-known example is the initiative led by the Center for Open Science (COS) and several journals to assign badges to accompany published articles where authors have shared data or materials, or pre- registered their studies (COS, 2018a). While this initiative has yielded encourag- ing results, further work is necessary to separate the impact of badges from other editorial changes supportive of open practices introduced at the same time, and to confirm other results of introducing badges (Kidwell et al., 2016; Bastian, 2017). At a broader level, funder and journal openness mandates may generate data that can be utilized by community compilation and reporting efforts aimed at improv- ing transparency. For example, FDAAA Trials Tracker is a website launched in 2018 that gathers information on compliance with U.S. Food and Drug Admin- istration requirements that all clinical trial results be reported and makes the in- formation available in an accessible format (FDAAA Trials Tracker, 2018). Box 2-3 describes additional requirements related to open access to clinical studies. Another approach is to modify researcher evaluation criteria and tools in order to avoid discouraging open practices or even to explicitly reward them. Pre- venting the misuse of JIF and other bibliometric indicators in the evaluation of research and researchers is one possible approach. The 2013 San Francisco Dec- laration on Research Assessment is one prominent effort that has gained many signatories among institutions, funders, and journals (DORA, 2013). The 2015 Leiden Manifesto for Research Metrics is a parallel effort (Hicks et al., 2015). Both of these statements emphasize the importance of expert judgement in the evaluation process. Efforts are also ongoing to take advantage of the capabilities of information technologies and the explosion of online interactions to develop new measures of research impact that would address some of the negative aspects of the JIF and enable a broader consideration of the value of articles and other research products. Taken together, these new measures have been labeled alternative metrics or alt- metrics. For example, efforts are underway to develop substantially new citation- based indicators based on transparent metric calculations that are open to scien- tifically based oversight (Hutchins et al., 2016). Others are developing metrics that go beyond citation-based indicators, incorporating information on down- loads, mentions on social media, and other online reader behavior (NISO, 2014; Howard, 2013). Developing new indicators to evaluate research and researchers and facilitating their use will require a better understanding of technical and insti- tutional prerequisites to their use—such as standards for digital author identifi- ers—and how these might be put in place. Indeed, the open science movement itself can provide the impetus to the improvement and wide use of high-quality metrics, and these metrics can play an important role in recognizing and rewarding open practices (Wilsdon et al., 2017).

48 Open Science by Design: Realizing a Vision for 21st Century Research BOX 2-3 Clinical Research Access to information about clinical studies is important to researchers, health care professionals, and patients. For many years, patients seeking in- formation about clinical studies were dependent on their clinicians to know about and recommend relevant studies. While their clinicians might have been aware of the clinical trials being conducted at their own institutions, there was no easy way to find out whether there was a suitable study elsewhere, even at a neighboring institution. Patient advocacy groups and others argued that in- formation about clinical trials should be readily available to members of the public and that such availability should be required by law. At the same time, because clinical trials are the cornerstone of evidence- based practice, many investigators had called for better reporting of clinical trials research (Meinert, 1988; Haynes, 1998). Meta-analyses and systematic reviews depend on the most comprehensive information possible for making recommendations about changes in medical practice. One author (Chalmers, 1990) went so far as to say that it is “scientific misconduct” not to report the results of one’s research. In late 1997, a section of the Food and Drug Administration (FDA) Moderni- zation Act required the creation of a database of information about clinical trials (FDA, 1997). The law directed the Secretary of Health and Human Services through the National Institutes of Health (NIH) to establish, maintain, and operate a “registry of clinical trials (whether federally or privately funded) of experimental treatments for serious or life-threatening diseases and conditions.” The law required that for each clinical trial listed in the registry there be at least a description of the purpose of the experimental treatment, the eligibility criteria for participation in the trial, the location of the trial, and, most importantly for patients, a point of contact for enrollment. Beginning in early 1998, a working group comprising members from the NIH and the FDA began planning the implementation of the registry, and the National Library of Medicine, which had extensive experience in developing biomedical databases, took on the task of developing what became known as ClinicalTrials.gov. Standard data elements, standard methods for labeling and transmitting the data, use of standard vocab- ularies, and use of standard web technologies all played a role in the design of the system. ClinicalTrials.gov was launched in February of 2000 (McCray, 2000; McCray and Ide, 2000). In addition to interactive searching, the data can be freely downloaded and reused according to specified terms and conditions. As of 2017, there are several hundred thousand trials from around the world registered in ClinicalTrials.gov and an increasing number of these include detailed results data. The legislative requirements for making clinical trials data available were critical both for the original development of ClinicalTrials.gov as well as for its continued significant expansion and growth. The initial 1997 law was amended a decade later to require submission of not just a description of the protocol design and eligibility criteria, but also the results of completed trials (Continued)

Broadening Access to the Results of Scientific Research 49 2-3 Continued (FDAAA 801, 2007). The final rule for implementation of this amendment was issued in 2016 and includes guidance for assessing compliance. Perhaps equally important for the extraordinary growth of the database was a joint state- ment by the editors of prominent medical journals in 2004 (ICMJE, 2004) that advised authors of clinical trials reports that a condition for publication would be deposit in a public registry at the inception of the trial. References Chalmers, I. 1990. Underreporting research is scientific misconduct. The Journal of the American Medical Association 263(10):1405-1408. FDA (Food and Drug Administration). 1997. PUBLIC LAW 105–115—NOV. 21, 1997. Food and Drug Administration Modernization Act of 1997. FDAAA (Food and Drug Administration Amendments Act) 801, 2007. PUBLIC Law 110-85 – Sept. 27, 2007. Food and Drug Administration Amendments Act of 2007. Federal Register. 2016. Clinical Trials Registration and Results Information Submission. 42 CFR Part 11. Docket Number NIH – 2011-003. RIN 0925- AA55. 2016. National Institutes of Health, Department of Health and Hu- man Services. Haynes, B., and A. Haines. 1998. Barriers and bridges to evidence based clinical practice. The BMJ 317:273-276. ICMJE (International Committee of Medical Journal Editors). 2004. Clinical Trial Registration: A Statement from the International Committee of Medi- cal Journal Editors. Online. Available at http://www.icmje.org/news-and- editorials/clin_trial_sep2004.pdf. Accessed March 30, 2018. IOM (Institute of Medicine). 2015. Sharing Clinical Trial Data: Maximizing Bene- fits, Minimizing Risk. Washington, DC: The National Academies Press. McCray, A. T. 2000. Better access to information about clinical trials. Annals of Internal Medicine 133(8):609-614. McCray, A. T., and N. C. Ide. 2000. Design and implementation of a national clinical trials registry. Journal of the American Medical Informatics Associ- ation 7(3):313-323. Meinert, C. L. 1988. Toward prospective registration of clinical trials. Controlled Clinical Trials 9:1-5. Finally, broader efforts are underway to rethink research evaluation prac- tices and develop new approaches that place less emphasis on JIF and other bib- liometric indicators and more emphasis on other contributions of researchers, in- cluding adherence to open practices. For example, the Peer Reviewers’ Openness Initiative proposes that peer reviewers commit to withholding comprehensive re- view of submissions where data or materials are not openly available (Morey et al., 2016). A 2017 European Commission (EC) report describes a new approach to evaluating researchers and their career contributions where open practices are

50 Open Science by Design: Realizing a Vision for 21st Century Research central (EC, 2017b). Some experts advocate a fundamental rethinking of ap- proaches to peer review characterized by openness, with scholarly communica- tions organized around network or library concepts rather than fixed journal arti- cles (Kriegeskorte et al., 2012; Kennison and Norberg, 2015). Privacy and Security Concerns Privacy Concerns As described above, open science is critical for addressing the reproducibil- ity challenge in scientific research while facilitating future research that validates or builds on previous results. An unintended and potentially harmful consequence of publicly sharing research data, however, is the possible effect on privacy. Re- searchers have long recognized the privacy implications of publicly sharing re- search data, especially when such data involve human subjects, such as patients in a clinical trial. The tension between privacy protection and scientific openness is longstanding. For example, many studies in the area of public health pertain to health care records and medical history, which makes it extremely difficult, if not impossible, to maintain patient privacy while openly sharing all the information necessary to reproduce or replicate a published study (O’Neill et al., 2016). Traditionally, researchers rely on anonymization, or “de-identification,” methods to strike a balance between open data and human subject privacy. The idea is that once all personally identifiable information has been removed from a published dataset, an individual would no longer be associated with any record in the dataset. Participants in research studies expect that the data collected about them will be handled with care and that, unless they have given explicit consent to have their personal information shared, their data will be safeguarded. The fed- eral government has provided specific guidance through its HIPAA legislation, which provides standards for the electronic exchange, privacy, and security of health information. 2 The intent of the legislation is to safeguard personally iden- tifiable information, known as PII. HIPAA’s “safe harbor” defines 18 specific attributes (e.g., name, phone number, medical record number) as “protected health information” in need of suppression (CDC, 2003). In recent years, however, it has become clear that even anonymized data can reveal private information about the human subjects. The key challenge here is that even attributes that are not labeled as personally identifiable may still con- tain sensitive information that associates an individual, and that by linking those data to other publicly available resources, individuals can be reidentified. (Sweeney, 1997, 2002, 2003, 2009; Malin and Sweeney, 2001). In a case study of a state-released dataset containing 2.8 million hospital records, investigators showed that even after removing from the dataset all information except the pro- 2The Health Insurance Portability and Accountability Act of 1996 (HIPAA), Public Law 104-191. See https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations.

Broadening Access to the Results of Scientific Research 51 cedures received by a patient, the percentage of patients with a unique set of pro- cedures is still 42.8 percent; in other words, as the investigators state, “an adver- sary would have about a 42.8 percent chance of linking the anesthesia record to the hospital database, thereby discovering the patient’s sensitive information.” (O’Neill et al., 2016) In August 2016, after AOL Research released 20 million search queries is- sued by its users (with no user identifier or personal information attached), a re- porter from The New York Times was still able to locate an individual from the anonymized search records by cross referencing the contents of the queries with phonebook listings (Barbaro and Zeller, 2006). Similarly, researchers were able to re-identify individuals in an anonymized version of Netflix’s movie preference database for a contest that challenged researchers to try to improve its recommen- dation engine. By comparing rental dates and ratings in the Netflix database with reviews posted on the Internet Movie Database, the researchers were able to dis- cover individuals’ entire rental histories, potentially revealing sensitive infor- mation about them (Narayanan and Shmatikov, 2008). As a result of this re-iden- tification, a class-action lawsuit was filed against Netflix, and, as part of the settlement, Netflix cancelled a second planned contest. After making numerous attempts to develop better mechanisms for disasso- ciating individuals from a published dataset (Sweeney, 2002; Machanavajjhala et al., 2006), researchers in the field of data privacy realized a fundamental issue with many of the then-existing techniques: these techniques rely on assumptions of “adversarial background knowledge,” i.e., the external sources of information an adversary has access to beyond the dataset being released. Examples of back- ground knowledge include phonebook listings in the AOL example and hospital databases in the health care example. One can see that such background knowledge is plentiful and hard to enumerate in practice, leading to privacy vio- lations even after anonymizing the data. Recent advances in data privacy aim to address this issue by developing techniques that are agnostic to adversarial background knowledge. A notable ex- ample is the concept of differential privacy (Dwork, 2008), which is a uniform privacy guarantee no matter what background knowledge an adversary possesses. A wide variety of techniques has been developed to achieve differential privacy, mostly by inserting random noise into the data being released or to the query an- swers being generated from the dataset. In spite of these advances, there are still significant challenges facing the wide adoption of differential privacy in the re- search community. A notable one is how to validate previous research results or establish new findings from data that have already been perturbed with random noise. While one might be tempted to simply rerun the original research workflow over the perturbed data, research has shown that doing so may lead to statistically invalid results that require complex, task-specific procedures to correct (Gaboardi et al., 2016; Rogers et al., 2016). As such, the proper balance between open data and privacy protection of human subjects is still a major ongoing challenge. Sev- eral repositories have been developed as emerging solutions to these issues, in-

52 Open Science by Design: Realizing a Vision for 21st Century Research cluding Genotypes and Phenotypes (dbGAP) for genotype-phenotype relation- ships (Mailman et al., 2007; dbGap, 2018 ), the Yale University Open Data Access (YODA) project for clinical trials (The YODA Project, 2018), and the forthcom- ing Vivli platform for clinical research (Vivli, 2018). However, these repositories are expensive to set up and manage, and should be part of the infrastructure that is developed to support open science. National Security Concerns Openness of research results has been a source of tension in security re- search and practices for years. For example, the “export” of cryptographic tech- nology was severely restricted in the United States until 1992, after such export control was already challenged by individual level openness efforts such as PGP, 3 which was released in 1991. A key argument in discussions of the effect of open- ness on national security is that providing open access to data and methodology might have the unanticipated outcome of aiding malicious individuals and organ- izations. Specifically, • Adversaries might use openly available data or methods to make the de- sign and implementation of their attacks easier. For example, an adversary might directly adopt an open-source machine learning algorithm to bypass CAPTCHA challenges commonly used by web security applications. • Adversaries might also leverage the openly shared knowledge of a mis- sion-critical system to find bugs or vulnerabilities to defeat the system it- self. For example, by examining publicly available data on the Supervi- sory Control and Data Acquisition system used by a power station, an adversary might be able to design more effective attacks on the power net- work. Both of these points reflect long-standing debates in security-related re- search. For the first concern, one can draw an analogy to the debate of whether researchers should be allowed to publish the computer security vulnerabilities they identify for, say, an encryption algorithm, or if such flaws should be kept behind closed doors to prevent adversaries from taking advantage of them (Cavusoglu and Raghunathan, 2007). As in the computer security case, while there might be perceived costs from the adversarial usage of open data and meth- ods, what outweighs such costs is the effect open data and methods have on in- forming and incentivizing defenders to strengthen their defenses (Pond, 2000), easing the design and implementation of defensive systems, and eventually ensur- ing progress in the research fields critical to national security. In other words, openness benefits both attacker and defender, and, arguably, more the defender than the attacker. 3PGP (Pretty Good Privacy) is freely available software for the encryption of electronic mail and other data (Zimmerman, 1995).

Broadening Access to the Results of Scientific Research 53 The second concern reflects the debate between security-through-obscurity and security-by-design. The former tries to maintain security by hiding knowledge of the system design from attackers, with the premise that, without knowing how a system is designed, an adversary would not be able to effectively attack it. Se- curity-by-design, on the other hand, recognizes that hiding system design from attackers rarely works in the long run, as an attacker can accumulate knowledge of the system design over time by using the system, observing its behavior, and other methods. Thus, security-by-design assumes the system design to be public knowledge, and aims to make the design inherently secure even when an adver- sary knows how it works. The progress of computer security research in the last few decades has repeatedly shown that security-by-design is the only viable long- term approach (Cavoukian and Chanliau, 2013). Insufficient Infrastructure Infrastructure provides the engine that supports the vision of open science. If articles, data, code, and other research products constitute the content that is to be available under FAIR principles, open science infrastructure consists of the tools and metadata through which research products are created, shared, and as- sessed, including “data about the research process itself, such as reference lists and funding information” (Peters, 2017). As noted earlier, the foundation that enables open science is the spectacular improvement in the capacity and performance of information technologies that has occurred in accordance with Moore’s law and related formulations (Moore, 1965). For example, computing power has increased exponentially as chip densi- ties have grown from a thousand transistors in 1970, to a million transistors in 1990, to a billion transistors in 2010. At the same time, network bandwidths have increased from thousands of bits per second in the 1980s to millions of bits per second in the 1990s to billions of bits per second in the 2000s. The capacity of storage devices (electromechanical disks and electronic flash memory) has grown from millions of bytes to billions of bytes to trillions of bytes. For example, a terabyte capacity disk now costs less than 100 dollars. Because of this great in- crease in capacity, we can now store more data than we can effectively and effi- ciently process. Ongoing data science research will contribute to the advance of open science as well as data processing techniques. As discussed above, FAIR data is a requirement of open science. An exam- ple of FAIR data for human use is provided by public webpages. Search engines have made many such pages findable and they are usually either immediately ac- cessible or accessible via a paywall. Since these pages are designed for human readers, they are made (more or less) interoperable by the readers’ knowledge of the language and the subject matter. Pages are often reusable by cut-and-paste document editing tools. Open science data should also be FAIR for software agents. This requires that both a wider array of data be available and that knowledge about the data be “machine readable.” That is, machine-readable

54 Open Science by Design: Realizing a Vision for 21st Century Research metadata should be available for software agents to support automated interoper- ability and reusability. Other attributes of data that are important for open science include trustwor- thiness and citability. Techniques for assessing and rating trustworthiness are es- sential to enable proper reuse of data (and to avoid harmful reuse). And citability is an important step towards rewarding scientists for publishing important data. The definition and use of DOIs (digital object identifiers) is a related example of a useful technique for uniquely identifying journal articles. The Semantic Web is a vision for how data and knowledge might be stored online in a machine-accessible form. The Semantic Web offers a set of standard- ized computer languages for representing data and knowledge (“recommenda- tions” of the World Wide Web Consortium), and one of these languages, the Re- source Description Framework (RDF), is well suited for representing the metadata needed to make online datasets FAIR. In fact, the Center for Expanded Data An- notation and Retrieval (CEDAR), a standards-based metadata authoring system developed under the NIH Big Data to Knowledge Program, uses RDF for pre- cisely this purpose (CEDAR, 2018). A second architecture, the Digital Object Architecture (DO), has also been under development for several decades. The DO addresses the interoperability of heterogeneous data in a manner similar to how the Internet addressed the interop- erability of heterogeneous networks: that is, a new layer of abstraction is intro- duced. In the case of the Internet, TCP/IP (Transmission Control Protocol/Internet Protocol) defined a virtual network that interconnected physical networks at com- puters. In the case of the DO, a digital object is a virtual data object that references a data object at a lower level of abstraction. Just as an Internet message has a header that contains the necessary metadata to transmit the message, a DO digital object has a “landing page” containing the necessary metadata to understand and manipulate the digital object. As an additional point of similarity, just as each computer has an Internet address (its IP number), so each digital object has a “handle,” which is used to reference its digital object. There is a need for infrastructures that semantically link research objects to each other, such as persistent identifiers for research objects (PIDs), and standard ways of collecting, expressing as metadata and semantically linking PIDs. Some groups are developing such services and integrations, including ORCID, the Data Citation Implementation Pilot (DCIP) project, and FREYA under the European Commission’s Horizon 2020. These efforts are described in Chapter 4. The distributed location of data repositories is an issue that is mitigated as network performance continues to improve. That is, distributed data is increas- ingly understood to be the norm for data processing activities. And in some cases, the dataset is too large to move efficiently and “processing is brought to the data” rather than data being brought to a processing center. In fact, the location of the data has increasingly been pushed into the background by the emergence of cloud computing. Cloud computing is sometimes called the industrialization of IT, much as the electric power grid was the industrialization of local generation of electricity. This industrialization of the underlying infrastructure for open science,

Broadening Access to the Results of Scientific Research 55 among other things, turns IT capital costs into operating costs and could thereby accelerate the emergence of open science. The question of “who pays” remains important, especially for science, which has seldom been overfunded. If open science infrastructure remains an un- funded mandate, the movement towards an open science enterprise will be signif- icantly slowed. Proposals on both sides of the Atlantic have been made to address this problem. In the United States, NIH has considered calling for the establish- ment of a “data commons,” which would be financially supported by grant funds earmarked for data infrastructure. In Europe, consideration is being given to a similar earmarking of some research funds. Appealing again to the Internet experience, the NSFnet was supported from the start with a mixture of NSF funds, university funds, and private sector invest- ment funds. When the NSFnet was retired in 1995, universities began shouldering most of their networking costs themselves. And after the Federal Next Generation Internet program awarded research universities grants to connect to (what be- came) Internet2, virtually all network costs were borne by the universities them- selves. In Europe, however, university networking costs continue to be partially supported by government. Making data, code, and other research outputs available under FAIR prin- ciples involves both a number of specific short-term and long-term costs that need to be covered. As discussed in the next chapter, many “big science” projects in astronomy, high-energy physics, and genomics are funded and undertaken with the starting assumption that the resulting data are a central output. The hardware, software, and other resources needed to enable long-term access to data are in- cluded in the budget and built as part of the project itself. Likewise, in the case of smaller projects, the costs of cleaning and formatting data, ensuring that adequate documentation and metadata are attached, and other short-term costs may be sup- ported by the grant. As for long-term costs, some disciplinary communities have built institutions and repositories that are responsible for keeping smaller community datasets, such as the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. However, the cultures of some disciplines might lack a shared understanding that data should be curated and made available on a long-term basis. Berman and Cerf (2013) used the example of sensor data that are made avail- able for several years after the research is concluded, paid for by grant funds, but where there is no funding to support longer-term access. What if research a decade later would benefit from access to and reuse of this sensor data? Erway and Rinehart (2016) reviewed various possible funding strategies for long-term data management, noting that funders are increasingly advocating that institutions accept responsibility for data management as a library preservation function. They found that institutions are mainly supporting data management ser- vices through their library budgets, but that some are exploring more diversified sources of funding. In taking on a larger role, institutions might need to be more involved in working with researchers to decide how and when data may be released, ensure data quality, and meet requirements for protection of private information.

56 Open Science by Design: Realizing a Vision for 21st Century Research Performing these functions would help support rigor and protect the institution’s reputation, but would also require additional resources and capabilities. Much important work remains, with key tasks and decisions facing all the participants in the research enterprise. Ensuring that resources for management and long-term stewardship of data and other research products are available— including highly trained data scientists, tools, and data standards—will require significant long-term effort on the part of stakeholders working across disciplines, sectors, and national boundaries. As will be discussed below, researchers in sev- eral fields have made significant progress and have created numerous examples and models that hold the potential for wider deployment. Disciplinary Differences in the Nature of Research and Data Differences in the nature of research and the types of data collected may create special barriers or limitations in sharing data, reusing data, or ensuring the long-term availability of data. The privacy and national security barriers discussed above are examples. Other challenges arise from the size or complexity of data generated by “big science” projects, such as those in some areas of physics. An important and emerging type of data are the very large datasets that capture ex- tremely rare, time-sensitive events. Subtleties in this data and their generation may not be readily captured without detailed knowledge of how the data were col- lected. Safeguards may be needed to prevent misuse or misrepresentation of cer- tain types of data. The challenges of making such data available for sharing and reuse, and providing for long-term curation, are considerable. For example, seismology illustrates issues related to the reproducibility and replicability of research results, discussed above. While it is impossible to repli- cate a given unique natural phenomenon such as a seismic event, it is possible to reproduce an analysis of the data collected on an event (i.e., analyze the same data using the same software). The seismology community around the world maintains a network of regional data archives that facilitate study and understanding of earthquakes and other seismic phenomena. For example, the Southern California Earthquake Data Center, founded in 1991, operates the Seismological Laboratory at the California Institute of Technology and serves as an archive of seismological data for southern California. It links to other seismological data archives around the world. Long-term curation and stewardship of data is another general challenge that affects disciplines differently. For some “big science” fields, funds to support data sharing and archiving are included in the overall project budget, but steward- ship may be difficult or impossible to sustain once the project or experiment ends. Even in fields where the size or complexity of data do not present particular chal- lenges, communities may not have well-developed standards for deciding which datasets are of long-term value and how or where they should be curated. Chapter 3 discusses several specific examples of challenges related to data stewardship.

Broadening Access to the Results of Scientific Research 57 The Laser Interferometer Gravitational Wave Observatory (LIGO) is an ex- ample of a project that is generating very large, complex datasets and that illus- trates the challenges of imagining a route to complete open data that would allow an outsider to carry out credible analysis of the data streams from three sites (LIGO, 2018). Caltech and the Massachusetts Institute of Technology operate LIGO with support from the National Science Foundation. LIGO achieved the first direct observation of gravitational waves in 2015. The LIGO detectors collect very large amounts of data on astronomical events that occur erratically or evolve slowly, such as the collision of black holes many light years away from earth. A great deal of knowledge about the detectors themselves, the analytical software, and other aspects of the experiment is required to use the data effectively. In work- ing to overcome these challenges of size and complexity, LIGO supports data sharing and reuse through the LIGO Open Science Center (LOSC, 2018). In ad- dition to providing access to LIGO data packages on specific events, the LOSC site includes video tutorials and extensive data usage notes. Open Science and Proprietary Research This report focuses on transitioning to open science mainly in the context of published research. In most cases, the principles, practices, and expectations for openness in published research should not vary according to whether the fun- der is a federal agency, private foundation, or profit-making company, or whether the performer is a university, government laboratory, or corporate researcher. When a company performs research that produces an invention for which intel- lectual property protection should be secured and where results are publishable, it can choose to file a patent application before the relevant research article is pub- lished. If open science requirements such as data sharing would expose infor- mation about the research that the company does not wish to publicize, it can choose not to publish an article and protect the invention through patenting or trade secrecy. This principle is seen in clinical research, where requirements for preregistration and data sharing are being codified and enforced regardless of funding source or performer (FDA, 2007; Taichman et al., 2017). Open science does have implications for proprietary research in some areas where the need to publish and stay on the cutting edge overlaps with interest in developing products. Some research methods and technologies fall into this cate- gory. For example, many advances in biomedical research techniques involving zinc-finger proteins and zinc-finger nuclease have been patented, leading to a complex intellectual property landscape that affects how research and product de- velopment progresses in academic and corporate settings (Chandrasekharan et al., 2009). Open science data and materials options have been developed to work around some barriers caused by proprietary data and materials (Chandrasekharan et al., 2009). It will be important to see how the relationship between proprietary research and open science evolves in the future. It is possible that companies will find that participating in an open science ecosystem is beneficial and advances innovation.

58 Open Science by Design: Realizing a Vision for 21st Century Research It is also possible that companies will find it more difficult to manage intellectual property risks in an open science world, which might constitute a disincentive to performing research. Research Underlying Regulations The above discussion illustrates that transitioning to open science will in- volve addressing a number of complex issues involving how the research enter- prise operates and how it relates to the broader society. This process will neces- sarily require time, development of new approaches, and a certain amount of trial and error. This report develops a vision for moving forward and identifies priority tasks. However, several important issues that lie largely outside the scope of the study will remain. One example is the implementation of open science practices in research relevant to policymaking and regulation in areas such as environmental health. An Environmental Protection Agency (EPA) proposal for new requirements for openness that would cover research underlying some regulations spurred spirited debate at the time this study was being completed (EPA, 2018). Opponents of the proposal argued that it would unduly restrict the scientific basis for regulations, while proponents argued that the change would improve transparency and that the concerns were overblown (985 Scientists, 2018; Hahn, 2018). Although the issues raised here are outside the scope of this study, the ex- ample does illustrate that implementing requirements for open science in certain policy contexts will raise difficult questions, and may become politicized. Issues of data access and quality have been subject to political debates in other areas, such as climate change, and will continue to be (NASEM, 2009). There will be cases where data and code cannot be made completely open, but where the results should not be simply rejected out of hand. Ensuring that efforts to expand open- ness and transparency are consistent with other priorities will be a key challenge in realizing the benefits of open science.

Next: 3 The State of Open Science »
Open Science by Design: Realizing a Vision for 21st Century Research Get This Book
×
Buy Paperback | $55.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Openness and sharing of information are fundamental to the progress of science and to the effective functioning of the research enterprise. The advent of scientific journals in the 17th century helped power the Scientific Revolution by allowing researchers to communicate across time and space, using the technologies of that era to generate reliable knowledge more quickly and efficiently. Harnessing today’s stunning, ongoing advances in information technologies, the global research enterprise and its stakeholders are moving toward a new open science ecosystem. Open science aims to ensure the free availability and usability of scholarly publications, the data that result from scholarly research, and the methodologies, including code or algorithms, that were used to generate those data.

Open Science by Design is aimed at overcoming barriers and moving toward open science as the default approach across the research enterprise. This report explores specific examples of open science and discusses a range of challenges, focusing on stakeholder perspectives. It is meant to provide guidance to the research enterprise and its stakeholders as they build strategies for achieving open science and take the next steps.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!