National Academies Press: OpenBook

Open Science by Design: Realizing a Vision for 21st Century Research (2018)

Chapter: 3 The State of Open Science

« Previous: 2 Broadening Access to the Results of Scientific Research
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 59
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 60
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 61
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 62
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 63
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 64
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 65
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 66
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 67
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 68
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 69
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 70
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 71
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 72
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 73
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 74
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 75
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 76
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 77
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 78
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 79
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 80
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 81
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 82
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 83
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 84
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 85
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 86
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 87
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 88
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 89
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 90
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 91
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 92
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 93
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 94
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 95
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 96
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 97
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 98
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 99
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 100
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 101
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 102
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 103
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 104
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 105
Suggested Citation:"3 The State of Open Science." National Academies of Sciences, Engineering, and Medicine. 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, DC: The National Academies Press. doi: 10.17226/25116.
×
Page 106

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

3 The State of Open Science SUMMARY POINTS • Despite the barriers discussed in Chapter 2, open science has made steady progress over the past several decades. More and more research products are available on an open basis. Still, this progress has been uneven, and the research enterprise remains some distance from achieving complete open science. • Several significant trends have expanded the possibilities for publishing articles on an open basis. These trends include the emergence of open pub- lishing venues, author self-archiving through institutional repositories and preprint servers, and open publication mandates adopted by funders and institutions. However, a large percentage of the world's scientific literature is still only available via subscription. Achieving universal or near-univer- sal open publication in a way that serves the research enterprise and its stakeholders remains a challenging, pressing task. • In the area of data, code, and other research products, there has also been significant progress toward developing practices and infrastructure that would support openness under FAIR principles. There are wide disparities by discipline, with some coming close to the expectations of open data and others quite far away. Different disciplines face different challenges in fostering open data related to cost and infrastructure. For example, some disciplines lack well-developed metadata standards, researchers may not have the incentives or resources to prepare data according to FAIR princi- ples, and repositories that support FAIR data might not be available. GENERAL STATE OF OPEN SCIENCE In the 15 years since the Budapest Open Access Initiative (BOAI) issued its declaration, there have been numerous efforts to promote and realize open sci- ence. A growing number of public and private research sponsors around the world are mandating open publication, open data, or both, on the part of grantees, with some variety in the specifics of their policies, including the National Institutes of Health, the National Science Foundation, the Bill & Melinda Gates Foundation, 59

60 Open Science by Design: Realizing a Vision for 21st Century Research the European Commission (EC), and the Wellcome Trust. The University of Southampton maintains a repository of open science policies adopted by funders and research organizations (Figure 3-1; ROARMap, 2018). Supportive tools and infrastructure have been developed, including discovery platforms (e.g., Science- Open and 1Science) and browser-based extensions (e.g., Open Access Button, Canary Haz, and Unpaywall) (Piwowar et al., 2018). Academic social networks, such as ResearchGate and Academia.edu, provide an increasingly popular but controversial solution to author self-archiving (Van Noorden, 2014). At the same time, some articles are shared in copyright-violating pirate sites, such as Sci-Hub and LibGen, provoking debate over the efficiency and ethics of traditional models of scientific publishing (Björk, 2017b; Piwowar et al., 2018). The open science movement has catalyzed new investment, prompted controversy, and had a sig- nificant impact on the global research enterprise and its stakeholders. While un- derscoring the impact of existing policies and progress made, Figure 3-1 also re- veals the speed of change and puts in perspective the need for additional efforts. Several entities have monitored and analyzed the progress and status of open science. Most of these efforts focus on open publication. For example, Sci- ence-Metrix, a Canadian science data analytics company, found that as of 2013 over half the articles published during the period 2007–2012 were available for free download (Science-Metrix, 2014). Using oaDOI technology, an open online service that determines open publication status for 67 million articles, it is esti- mated that at least 28 percent of the literature is open (green or gold, 19 million articles in total) and that this proportion is growing, driven particularly by growth in gold and hybrid open access adoption (Piwowar et al., 2018). Piwowar et al. (2018) also suggested that the most common mechanism for open publication is not gold, green, or hybrid open access, but rather an under-discussed category of articles made free-to-read on the publisher website, without an explicit open li- cense (Piwowar et al., 2018). In December 2017, Web of Science, a large biblio- graphic database, began to release more detailed data on the availability of publi- cations than were available previously, categorizing open articles as “gold,” “green accepted,” or “green published” (Bosman and Kramer, 2018; Library Re- search News, 2018). Most recently, Science-Metrix (2018), analyzed three bibli- ographic databases (1Science database, Scopus, and Web of Science) to measure the availability of open publications, finding that at least two-thirds of the articles published between 2011 and 2014 and having at least one U.S. author could be downloaded for free as of August 2016 (Science-Metrix, 2018). Using newly available open publication status data from oaDOI in Web of Science, Bosman and Kramer (2018) explored year-on-year open access levels across research fields, countries, institutions, languages, funders, and topics by relating the resulting patterns to disciplinary, national, and institutional contexts. They find that openness varies significantly by discipline, with the highest levels (over 50 percent) in some life sciences/biomedicine and physical sciences/ technology fields and lower levels (under 20 percent) in social sciences and arts/humanities (Bosman and Kramer, 2018). Within the broad category of social

FIGURE 3-1 Open science policies adopted by research funders and research organizations around the world. SOURCE: ROARMap, University of Southampton. 61

62 Open Science by Design: Realizing a Vision for 21st Century Research sciences, psychology registers the highest levels of open publication, possibly be- cause its publication culture is more similar to life sciences/biomedicine than to the other social and behavioral sciences. Similarly, Piwowar et al. (2018) found that over half of the papers are freely available in biomedical research and math- ematics, while less than one-fifth of the publications in the disciplines of chemis- try and engineering and technology are freely open (see Figure 3-2). The figure demonstrates that green open access is popular in physics and mathematics, while hybrid articles are common in mathematics and biomedical research. Authors in biomedical research, mathematics, health, and clinical medicine often publish in gold journals. Regarding specialties within disciplines, over 80 percent of publi- cations in astronomy and astrophysics, fertility, and tropical medicine were open. On the other hand, more than 90 percent of publications are hidden behind a pay- wall in pharmacy, inorganic and nuclear chemistry, and chemical engineering (Piwowar et al., 2018). Different fields of science have different cultures, and common issues are availability of infrastructures, policies and standards, and cul- ture. Astronomy has had a culture of sharing, for example, in part because of lim- ited access to the equipment to conduct observations and experiments (NASEM, 2018c). There is a need for raising awareness within different disciplines about the value of open science. Examples of disciplinary approaches are described in the boxes throughout this chapter, including biological sciences such as genomic research and precision medicine; astronomy and astrophysics; earth sciences; and economics. Regarding funders, the proportion of open publications that are based on research supported by NIH and the Wellcome Trust is high and increasing, which is understandable given their mandates requiring deposit in PubMed Cen- tral or Europe PubMed Central (PMC) within 12 and 6 months after publication respectively for all research funded (Bosman and Kramer, 2018; Open Access Oxford, 2018). The United Kingdom and Austria, through the Universities UK and the Aus- trian Science Fund respectively, have conducted quantitative studies to monitor the transition to open publication. Universities UK, the representative organiza- tion for the United Kingdom’s universities (2017), recently found that the propor- tion of journals published globally with immediate open access increased from under 50 percent in 2012 to over 60 percent in 2016, while the proportion of sub- scription-only journals has fallen (Universities UK, 2017). The global proportion of articles accessible immediately on publication rose from 18 percent in 2014 to 25 percent in 2016; and the global proportion of articles accessible after 12 months increased from 25 percent to 32 percent (Universities UK, 2017). The Austrian Science Fund—Austria’s main public funder of basic research—actively monitors compliance with its open publication mandate (ASF, 2018). The 2017 assessment found that 92 percent of all peer-reviewed publications listed in final reports of ASF-funded projects were openly available (Kunzman and Reckling, 2017).

The State of Open Science 63 FIGURE 3-2 Percentage of different access types of a random sample of WoS articles and reviews with a DOI published between 2009 and 2015 per NSF discipline (excluding arts and humanities). SOURCE: Piwowar, H., J. Priem, V. Larivière, J. P. Alperin, L. Matthias, B. Norlander, A. Farley, J. West, and S. Haustein. 2018. The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. PeerJ 6:e4375. DOI 10.7717/peerj.4375. Courtesy of Attribution 4.0 International (CC BY 4.0). Status and trends related to open data and open code are more difficult to track than those related to open publication. In October 2017, Figshare, an open access repository that is part of the Holtzbrinck Publishing Group, released its second State of Open Data report (Figshare, 2017). The report includes perspec- tives from leaders in the open data field and results of a survey of researchers. The survey discovered that 82 percent of nearly 2,300 respondents are aware of open datasets and that 74 percent of their respondents are curating their data for sharing (Figshare, 2017). A global online survey of 1,200 researchers, conducted by the Leiden University and Elsevier in 2017, found that less than 15 percent of re- searchers share data in a data repository and most (>80 percent) researchers only share data with direct collaborators (Berghmans et al., 2017). In 2017, the Inter- national Development Research Centre launched the State of Open Data project, which includes a plan to “critically review the current state of the open data move- ment” and produce a core reference publication during 2018 (State of Open Data, 2018). CURRENT APPROACHES TO OPEN SCIENCE This section explores various approaches to open science, focusing on open publication and open data. Part of the committee’s task was to provide illustrations from several scientific disciplines within the biological sciences, social sciences, physical sciences, and earth sciences. The section includes examples drawn from biomedical sciences, economics, astronomy and astrophysics, and earth sciences, along with other examples from outside of those disciplines. A comprehensive assessment of open science within individual disciplines or across disciplines is

64 Open Science by Design: Realizing a Vision for 21st Century Research beyond the scope of the study. Nonetheless, this overview and the illustrative ex- amples provide insight on how policies, practices, and resources that support open science can be developed and implemented. Open Publications Open Access Journals Open access journals are freely available to readers online “without finan- cial, legal, or technical barriers other than those inseparable from gaining access to the internet itself” (Suber, 2015). In contrast to traditional subscription models of scientific publishing, open access publishers typically charge an article pro- cessing charge (APC), which is paid by the author or the author’s home institution. Open access facilitates free and unrestricted access to articles for everyone imme- diately after publication (gold open access). As described in Chapter 2, less open approaches to publication include green open access, in which authors are able to self-archive a version of the article in an open access repository when access to the final published version requires a subscription to the journal. Open publication may also be provided following an embargo period. A list of open access journals in all fields and languages is available in the Directory of Open Access Journals (DOAJ), a community-based online directory launched in 2003 in Sweden with 300 open access journals (DOAJ, 2018). As of March 2018, this number has in- creased to over 11,100 open access journals, with nearly 2,982,000 articles in 124 countries (DOAJ, 2018). 1 Although the majority of open access journals do not require APCs, these journals account for a minority of the open access articles published worldwide, and only 18 percent of the open access articles published in the United States (Crawford, 2018). A wide range of APCs is charged by open access journals. For example, F1000 Research charges $150 to $1,000 depending on word count (F1000 Research, 2018). F1000 Research gives discounts or waivers to its refer- ees, advisory board members, and authors from institutions in some developing countries (F1000 Research, 2018). A successful case of open access publishing is the Public Library of Science (PLOS), a nonprofit scientific organization founded in 2001. PLOS launched its first journal, PLOS Biology, in 2003 (see Box 3-1). PLOS publishes several peer- reviewed journals, providing free and unrestricted access to research and an open approach to scientific assessment (PLOS, 2017a). PLOS One, a multidisciplinary peer-reviewed journal launched in 2006, had been the largest journal in the world in terms of articles published until 2017, when it was passed by Scientific Reports (Davis, 2017). 1DOAJ does not include “hybrid” journals that contain open access and subscription access articles.

The State of Open Science 65 BOX 3-1 Public Library of Science (PLOS) The Public Library of Science (PLOS) is a nonprofit publisher with a mis- sion to accelerate progress in science and medicine by leading a transfor- mation in research communication (Heber, 2017). In 2001, PLOS founders Harold Varmus, Patrick Brown, and Michael Eisen circulated an open letter urging scientific and medical publishers to make published research available through free online public archives, such as the U.S. National Library of Med- icine’s PubMed Central. Nearly 34,000 scientists from 180 nations signed the letter (PLOS, 2017). In 2001, PLOS became a nonprofit entity and officially became a publisher in 2003, making published scientific and medical articles immediately and freely available online across the globe without restriction. PLOS rapidly became a key component of the open science movement. In 2003, PLOS launched its first open access journal, PLOS Biology. Since then, the organization has introduced six additional peer-reviewed journals, including PLOS Medicine in 2004; community journals, PLOS Computational Biology, PLOS Genetics, and PLOS Pathogens in 2005; PLOS ONE, the first multidisciplinary open access journal in 2006; and the fourth community jour- nal PLOS Neglected Tropical Diseases in 2007. PLOS became financially self- sufficient in 2010 based on the Article Processing Charge model (PLOS, 2017). PLOS also introduced new communications tools, including The PLOS Blogs Network, PLOS Collections, and PLOS Currents, while publishing over 165,000 articles from authors in 190 countries (PLOS, 2017). PLOS currently partners with protocols.io, in the development of practical tools for PLOS au- thors to address reproducibility and to gain recognition and credit for their work (Heber, 2017; PLOS Blogs, 2017). PLOS has also been actively engaging early career researchers with social media and live blogging at scientific con- ferences. References Heber, J. 2017. Advocating Open Science at PLOS. Presentation to the National Academies of Sciences, Engineering, and Medicine’s Committee on Toward an Open Science Enterprise, Public Symposium. September 18, 2017. PLOS. 2017. Who We Are. Online. Available at https://www.plos.org/who-we- are. Accessed December 1, 2017. PLOS Blogs. 2017. Protocols.io Tools for PLOS Authors: Reproducibility and Recognition. Online. Available at http://blogs.plos.org/plos/2017/04/proto cols-io-tools-for-reproducibility. Accessed December 4, 2017. Several entities provide guidelines for assessing the quality of open access journals. DOAJ, in collaboration with the Committee on Publication Ethics (COPE), Open Access Scholarly Publishers Association (OASPA), and World Association of Medical Editors (WAME), identifies principles of transparency and best practice for scholarly publications according to several criteria, such as peer review process, governing body, copyright, ownership and management,

66 Open Science by Design: Realizing a Vision for 21st Century Research conflicts of interest, revenue sources, etc. (DOAJ, 2018). Publishers or journals that do not meet these criteria will not be included in their publisher’s list. Addi- tionally, the Open Access Directory (OAD) provides guidelines, best practices, and recommendations for open access journals (OAD, 2017), while COPE offers resources in the current debates related to promoting integrity in research and scholarly publication (COPE, 2017). OASPA has strict criteria for becoming a member of its organization. The Think, Check, and Submit website provides a checklist for selecting trusted journals (Think, Check, and Submit, 2017). Some journals exhibit questionable marketing schemes via spam e-mails, perform only cursory peer-review procedures, lack transparency in publishing op- erations, and imitate legitimate journals (Beall, 2016; Pisanski, 2017). Research- ers who are eager to publish or scientists who lack sufficient time to investigate a publisher may submit their papers without verifying a journal’s reputability. Beall recommends that scholars read the available reviews and descriptions, and then decide whether they want to submit articles, serve as editors, or serve on editorial boards. Open Access Repositories An open access repository is “a set of services that provides open access to research or educational content created at an institution or by a specific research community. Repositories may be comprehensive or may focus on publications or data. They may be institutionally-based or subject-based collections” (COAR, 2015a, p. 3). Lynch (2003) defined the institutional repository as “a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members” (Lynch, 2003, p. 2). While institutional repositories were developed as a new strategy for uni- versities to accelerate changes in scholarly communication, disciplinary reposito- ries have been established since the early 2000s, often focused on preprints and rapid dissemination of research results. To improve the visibility and impact of research, the majority of open access policies and laws require or request authors to deposit their articles into an open access repository, which has become a key infrastructure component to support these policies. Networked open access repos- itories enable funders and institutions to track funded research output across re- positories, deliver data usage, host collections of academic journals, and link re- lated content across the network (COAR, 2015a). The Confederation of Open Access Repositories (COAR) has developed a roadmap to identify key trends to identify priorities for further investments in interoperability (COAR, 2015b). Pub- Med Central, managed by the National Library of Medicine, is one of the largest and best-known public access repositories of publications in the biomedical sci- ences (See Box 3-2).

The State of Open Science 67 BOX 3-2 PubMed Central “As we all know, scientists want their work to be found, read, and cited” (Varmus, 2008). PubMed Central (PMC), founded in 2000, is a free digital archive of full- text biomedical and life sciences journal articles housed at the U.S. National Institutes of Health’s National Library of Medicine (NLM) (NLM, 2018b). The motivation for PMC is to maximize the public investment in NIH-supported re- search. Articles are submitted to PMC by publishers or directly by authors. PubMed Central is distinct from PubMed, NLM’s database of some 27 million citations to the biomedical literature (NLM, 2018a). In response to a Congressional mandate in 2008 (the Consolidated Ap- propriation Act of 2008, P.L. 110-161), NIH implemented its Public Access Policy (NIH Public Access Policy, 2016). Since April of that year, authors of NIH-funded research have been required to deposit, or have deposited for them, their final accepted peer-reviewed manuscripts in PMC, with an allowa- ble embargo period of up to 12 months (NIH Public Access Policy, 2016; Var- mus, 2008). Francis Collins, responding to a request from Congress in 2011, noted that the public access policy is a “prudent and beneficial” policy for sev- eral reasons: It applies 21st century information technology to the NIH invest- ment in the promotion of science and health; it allows NIH to make strategic reasons about its portfolio; and it ensures more rapid progress in science and medical treatments (NIH, 2011). PMC provides free access to the articles in its database but the majority of the articles, with the exception of those that are already in the public domain, are protected by copyright law. This means that users of the database are subject to the fair use principles of copyright law and cannot, for example, download the entire database for text mining or other purposes. PMC identifies those articles that are open access and provides a service for downloading them, including a filter for the subset of articles that have a CC-BY or CC-0 license. As of January 2018, there are 4.6 million full-text articles from several thousand journals archived in PMC, and some 39 percent (1.8 million) of these are fully open access (NLM, 2018b). References NIH (National Institutes of Health). 2011. Francis Collins letter to the Honora- ble Joseph R. Pitts. Online. Available at https://publicaccess.nih.gov/ Collins_reply_to_Pitts121611.pdf. Accessed March 29, 2018. NLM (National Library of Medicine). 2018a. PubMed. Online. Available at https://www.ncbi.nlm.nih.gov/pubmed. Accessed March 29, 2018. NLM. 2018b. PubMed Central. https://www.ncbi.nlm.nih.gov/pmc. Accessed March 29, 2018. NIH Public Access Policy. 2016. NIH Public Access Policy Details. Online. Available at https://publicaccess.nih.gov/policy.htm. Accessed March 29, 2018. Varmus, H. 2008. Progress toward public access to science. PLOS Biology, Apr 8;6(4):e101.

68 Open Science by Design: Realizing a Vision for 21st Century Research University Open Access Policies Open access policies have become increasingly adopted in academia. Since 2008, faculties of over 70 universities, schools, and departments have established open access policies to make their publications and research more accessible to policy makers, educators, scholars, and the public (Columbia University, 2017). In 2008, the Harvard Faculty of Arts and Sciences voted unanimously to grant the university a nonexclusive, irrevocable right to disseminate their scholarly articles for non-commercial purpose (Harvard Library, 2017). By June 2014, the remain- ing eight Harvard schools, including the law school and medical school, adopted similar open-access policies. Scholarly articles provided by Harvard faculty and researchers are stored, preserved, and made available in the Digital Access to Scholarship at Harvard (DASH), a free open access repository available to anyone with internet access. Similarly, Massachusetts Institute of Technology (MIT) fac- ulty voted unanimously in 2009 to make their scholarly articles available free online through DSpace, the open source software created by Hewlett-Packard and the MIT Libraries. Faculty authors may opt out on a paper-by-paper basis (MIT Libraries, 2009). The faculty of the University of California (UC) adopted an open-access policy in 2013. The policy was amended in 2015 to include all re- searchers employed by the UC. The UC open access policies require that UC fac- ulty and other employees provide a copy of their scholarly articles for inclusion in the eScholarship.org repository, or provide a link to an open version of their articles elsewhere. A number of guidelines are available to facilitate open access to faculty re- search and improve scholarly communication. For example, A SPARC Guide for Campus Action includes suggestions related to understanding rights as an author and making informed choices about publication venues (SPARC, 2012). Recom- mendation 4.2 of the 10-year anniversary statement of the Budapest Open Access Initiative (2012) states, supporters of open access “should develop guidelines to universities and funding agencies considering OA [open access] policies, includ- ing recommended policy terms, best practices, and answers to frequently asked questions” (BOAI, 2012). As part of the BOAI recommendation, the Harvard Open Access Project (HOAP) released a comprehensive guide, Good Practices for University Open Access Policies in 2012 and 2015, based on policies adopted at Harvard University, Stanford University, MIT, and the University of Kansas (Shieber and Suber, eds., 2015). The guide has been endorsed by 15 organizations and projects in the U.S., Europe, and Australia. Similarly, open tools and re- sources for data management have been promoted in the research library world in the “23 Things: Libraries for Research Data” overview (23 Things, 2018) by the Libraries for Research Data Interest Group of the Research Data Alliance. The overview has been widely disseminated and translated from English into 10 lan- guages. According to the guide, there are at least six types of university open access policies. Among those types, Shieber and Suber recommend a policy that “pro- vides for automatic default rights retention in scholarly articles and a commitment

The State of Open Science 69 to provide copies of articles for open distribution” (Shieber and Suber, eds., 2015., p. 6). To be consistent with copyright law, the guide recommends a policy that “grants the institution certain nonexclusive rights to future research articles pub- lished by faculty. This sort of policy typically offers a waiver option or opt-out for authors. It also requires deposit in the repository” (Shieber and Suber, eds., 2015, p.7). However, compliance involving deposits in a repository requires time, which necessitates education, assistance, and incentives. The guide suggests “when the institution reviews faculty publications for promotion, tenure, awards, funding, or raises, it should limit its review of research articles to those on deposit in the institutional repository” (Shieber and Suber, eds., 2015, p. 22). Indiana Uni- versity-Purdue University Indianapolis (IUPUI) has become one of the first insti- tutions to include open access as a value in its promotion and tenure guidelines, through librarian-facilitated efforts (Odell et al., 2016). While an effective open access policy can build support for open access, institutions considering adopting their own open access policies are able to refer to the current Harvard model policy (see Box 3-3), which incorporates the latest recommended practices described in their 2015 guide (Shieber and Suber, eds., 2015). To date, over 60 organizations worldwide have adopted a version of the Harvard policy for the development and promotion of open access (Harvard Li- brary, 2017). Internationally, the Registry of Open Access Repository Mandates and Policies (ROARMAP) lists over 200 open access mandates and policies adopted by universities, research institutes, and research funders across the globe (ROADMAP, 2017). In addition to the policy guidelines published by the United Nations Education, Scientific and Cultural Organization (UNESCO) (Swan, 2012) and Mediterranean Open Access Network (MedOANet, 2013), the Euro- pean University Association (EUA) provides a practical guide for universities in the context of current European open access policies (EUA, 2015). Preprints A preprint is defined as “a complete written description of a body of scien- tific work that has yet to be published in a journal” (Bourne et al., 2017). Preprints can be the complete and original manuscripts of scientific documents, including a research article, review, editorial, commentary, and a large dataset. that are not yet certified by peer review. Preprint servers can also host other objects such as posters presented at scientific meetings. The purpose of preprint distribution is “to share the results of recent research freely and openly before they are certified by peer review, in a manner that permits immediate discovery and discussion of the results and feedback to authors from the research community at large” (Inglis, 2017). Providing preprint services is not without costs. For large services such as arXiv and bioRxiv, extensive hardware and software infrastructure is required. Although articles are not peer reviewed, they are screened and categorized, which requires staffing. Costs are typically covered by the host institutions and by foun- dation grants.

70 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-3 A Model Open Access Policy The Faculty of <university name> is committed to disseminating the fruits of its research and scholarship as widely as possible. In keeping with that commitment, the Faculty adopts the following policy: Each Faculty member grants to <university name> permission to make available his or her scholarly articles and to exercise the copyright in those articles. More specifically, each Faculty member grants to <university name> a nonexclusive, irrevocable, worldwide license to exercise any and all rights under copyright relating to each of his or her scholarly articles, in any medium, provided that the articles are not sold for a profit, and to authorize others to do the same. The policy applies to all scholarly articles authored or co-authored while the person is a member of the Faculty except for any articles completed before the adoption of this policy and any articles for which the Faculty member entered into an incompatible licensing or assignment agreement before the adoption of this policy. The Provost or Provost’s designate will waive application of the license for a particular article or delay access for a specified period of time upon ex- press direction by a Faculty member. Each Faculty member will provide an electronic copy of the author’s final version of each article no later than the date of its publication at no charge to the appropriate representative of the Provost’s Office in an appropriate format (such as PDF) specified by the Provost’s Office. The Provost’s Office may make the article available to the public in an open-access repository. The Office of the Provost will be responsible for inter- preting this policy, resolving disputes concerning its interpretation and appli- cation, and recommending changes to the Faculty from time to time. The pol- icy will be reviewed after three years and a report presented to the Faculty. SOURCE: S. M. Shieber, 2015. FIGURE 3-3 Biology preprints over time. SOURCE: http://asapbio.org/preprint-info/ biology-preprints-over-time. Courtesy of Attribution 4.0 International (CC BY 4.0).

The State of Open Science 71 Preprints are gaining momentum among the scientific community. Since 1991, researchers in disciplines such as physics (and later mathematics, computer science, and quantitative biology) have been able to access preprints through arXiv, a repository of electronic preprints of scientific papers. arXiv is operated by the Cornell University Library and currently contains over 1.3 million preprints (Cornell University Library, 2017). In 2013, bioRxiv was launched as a repository of life science preprints covering all of the life sciences, clinical trials, epidemiol- ogy, as well as science communication and education (see Figure 3-3). Operated by the Cold Spring Harbor Laboratory, bioRxiv is modeled conceptually on arXiv but uses different technology, and offers somewhat different features and func- tions (Inglis, 2017). Economics has a long history of utilizing preprints, which are called working papers in that discipline (See Box 3-4). Preprint services are being launched in a growing number of disciplines, as indicated in Table 3-1. For example, the American Chemical Society (ACS) and its global partners launched ChemRxiv, a preprint server for chemistry-related information. The Center for Open Science (COS) has launched PsyArXiv (psy- chology), AgriXiv (agriculture), SocArXiv (social sciences), engrXiv (engineer- ing), and LawArXiv (law), with the most recent additions including NutriXiv (nu- tritional sciences) and SportRxiv (sport) (COS, 2017; Luther, 2017). In 2017, the American Geophysical Union and Atypon announced the development of Earth and Space Science Open Archive (ESSOAr). This preprint server will join the existing EarthArXiv as preprint servers for the earth and space science community (Voosen, 2017). There are other services that provide preprint functions. For example, the Social Science Research Network (SSRN) was created in 1994 as a tool for rapid dissemination of scholarly research in the social sciences and humanities. The SSRN, bought by Elsevier in 2016, facilitates the free posting and sharing of research material, including preprints, conference papers, and non-peer-reviewed papers in social science research (Gordon, 2016). F1000Research is “an open research publishing platform for life scientists that offers immediate publication and transparent peer review” (F1000Research, 2018). An article submitted to F1000Research also requires data and code deposition, either in an F1000 ap- proved repository or in an institutional repository. Bourne et al. (2017) described a number of advantages of preprint submis- sion from the standpoint of both individual researchers and the broad community. Preprints are free to post and to read, which provides accelerated transmission of scientific results. Researchers can evaluate new findings and their reliability with- out the delay introduced by journal peer review. Some funders are now providing incentives to those who submit preprints (Inglis, 2017). However, there are chal- lenges associated with managing preprints, including anxieties about “scooping” (other researchers using the preprint to publish work in advance of those submit- ting a preprint) and reluctance to use open licenses (Inglis, 2017; INLEXIO, 2017). There is a need for more education and discussion regarding the choice of licenses and ways to prevent unattributed use of the results. NIH is working with

72 Open Science by Design: Realizing a Vision for 21st Century Research an international group of research funders to examine the feasibility of establish- ing a central service of preprints to encourage sharing of preprints in the life sci- ences (NIH, 2017b). BOX 3-4 Working Papers in Economics The National Bureau of Economic Research (NBER) issued its first working paper (preprint) in 1973 as a way of disseminating research more quickly than waiting for lengthy editorial review at the Bureau. The papers were originally mailed to libraries, research institutes, journalists, and other interested parties on a subscription basis; over time, print distribution has given way to electronic dissemination. As of October 2017, approximately 24,000 working papers have been issued by the NBER. These working papers reside behind a pay wall for 18 months and then are provided freely to the international research community (green open access). For residents of developing countries, journalists, and government employees, there is no pay wall, even for new papers. NBER research associates are leading economics researchers and not necessarily representative of the entire economics profession. Today, there are nearly 1,500 NBER-affiliated researchers. Only NBER research associates and conference participants are allowed to release NBER working papers. Nevertheless, as the thought leader in the profession, NBER created a culture of openness for the economics profession that has had a lasting impact. Economists outside of the NBER recognized the need to disseminate re- search prior to publication. In 1993, the Economics Working Paper archive was opened at Washington University in St. Louis. In 1997, Research Papers in Economics (RePEc) was created to facilitate the sharing of economic re- search (http://repec.org). RePEc is a “decentralized bibliographic database of working papers, journal articles, books, books chapters and software compo- nents, all maintained by volunteers.” According to RePEc, 1,900 archives from 93 countries have contributed 2.3 million research pieces to the archive. Alt- hough economics journals have pay walls, the free availability of working pa- pers means that almost all economics research is open. Any economist can register and maintain an author profile at RePEc and as of 2017, more than 50,000 authors have registered worldwide. Economics journals have supported replication and hosted data archives for almost 30 years. As a condition of acceptance, the Journal of Human Re- sources required authors to preserve data for 3 years after publication in order to promote replication starting in 1989. The Journal of Applied Econometrics has data archives for most papers starting in 1995 (http://qed.econ.queensu.ca/jae). The American Economic Review and other American Economic Association journals required data archiving starting in 2004. The American Economic Re- view has hired a data editor to ensure the proper archival of datasets and software programs, and to consider exceptions to the data archival policies for restricted use datasets. In 2007, the American Economic Association launched four field journals in part to reduce the influence of for-profit journals in the profession.

TABLE 3-1 Preprint Servers Name Fields Start Year Owned/Operated by Submissions in 2016 Selected preprint services arXiv Physics, mathematics, computing, 1991 Cornell University Library 113,308 quantitative biology, quantitative finance, statistics bioRxiv Life sciences 2013 Cold Spring Harbor Laboratory 4,712 PeerJ Preprints General 2013 PeerJ ~1,000 Preprints (MDPI) General 2016 Multidisciplinary Digital Publishing ~1,000 Institute (MDPI) SocArXiv Social sciences 2016 Open Science Framework (OSF) 633 PsyArXiv Psychology 2016 OSF 191 engrXiv Engineering 2016 OSF 35 ChemRxiv Chemistry 2017 ACS N/A AgriXiv Agriculture 2017 OSF N/A EarthArXiv Earth Sciences 2017 OSF N/A LawArXiv Law 2017 OSF N/A NutriXiv Nutritional Sciences 2017 OSF N/A Sport RXiv Sport science 2017 OSF N/A Services with preprint functions Social Science Research Network (SSRN) Social sciences 1994 Elsevier 66,310 Figshare General 2012 Figshare Unknown Zenodo General 2013 OpenAire/CERN 318 F1000Research General 2013 F1000Research 215 Authorea General 2013 Authorea Unknown SOURCE: https://www.inlexio.com/rising-tide-preprint-servers; https://researchpreprints.com/2017/03/09/a-list-of-preprint-servers. 73

74 Open Science by Design: Realizing a Vision for 21st Century Research European Commission Open Research Publishing Platform The European Commission (EC) has proposed to fund the EC Open Re- search Publishing Platform for Horizon 2020 beneficiaries to comply with the Horizon 2020 open access mandate and to increase open access peer reviewed publications in Horizon 2020 (EC, 2017c). The platform will provide an easy, fast, and reliable open access publishing venue free to Horizon 2020 grantees on a voluntary basis, including preprints support, open access, open peer review, and innovative research indicators most appropriate for individual disciplines and/or national context. Building on the best practices of other funders, such as the Bill & Melinda Gates Foundation and the Wellcome Trust, the commission hopes that the platform will contribute to a more diversified and competitive open access publishing market. One contractor or a consortium led by one contractor will be selected to run the platform with a 4-year initial contract. The contractor will be required to commit to a minimum number of preprints and articles to be published during the initial 4-year period and to develop a plan for sustainability of the ser- vice beyond the 4 years. While some experts such as Jacobs (2018) interpret this movement as “a sign of increasing frustration on the part of research funders and institutions at the pace and cost of the change to open access,” the success of the platform will depend on the quality of the scientific publication service provided (EC, 2017c). Current international approaches to open science are described fur- ther in the final section of this chapter. Pay It Forward Initiative The University of California (UC), Davis and the California Digital Library (CDL) conducted a study in 2015 and 2016 to examine the economic implications of large North American research institutions converting to an entirely article pro- cessing charge (APC) business model. With support from the Andrew W. Mellon Foundation, the study was conducted in partnership with Harvard University, Ohio State University, the University of British Columbia, and University of Cal- ifornia Libraries, along with the Association of Learned and Professional Society Publishers (ALPSP) and the private sector, including Thomson Reuters (Web of Science) and Elsevier (Scopus). These large North American research institutions would assume the large part of the financial burden in an APC-driven open access model, the predominant open access business model of gold open access publish- ers (Anderson, 2017). The study involved a number of qualitative analyses based on academic author surveys and publisher surveys, as well as quantitative anal- yses based on data for a 5-year period (2009-2013), including library subscription expenditures, university publishing output, and potential APCs (UC Libraries, 2016; Anderson, 2017).

The State of Open Science 75 The final report, Pay It Forward: Investigating a Sustainable Model of Open Access Article Processing Charges for Large North American Research Institutions (2016), has the following three major findings: 1. The total cost to publish in a fully APC-funded journal will exceed current library journal budgets for the most research-intensive North American research institutions; 2. This cost difference could be covered by grant funds, already a major source of funding for publishing fees; but 3. Ultimately, author-controlled discretionary funds, such as research grants and personal research accounts that incentivize authors to act as informed consumers of publishing services, are necessary to introduce both real competition and pricing pressures into the journal publishing system (UC Libraries, 2016). To establish these findings, the study examined the level of APCs each in- stitution could afford, based on its current subscription spending. The study dis- covered that the average APC for partner institution publications in full open ac- cess journals is $1,892 (Figure 3-4). While research-intensive institutions would be unable to convert to the APC model if they had to rely solely on their existing subscription budgets, the study found that those institutions could afford a transi- tion to APC, if grant funds were applied to the cost (Anderson, 2017). This is not an entirely novel idea, as many authors are already using grant funds for APCs. A key strategy could be a multi-payer model involving library subsidies, together with grants, startup packages, and discretionary research funds (Anderson, 2017). For example, the Wellcome Trust notes that its APC payments, which cover both full open access and high-cost hybrid journals, consume less than 1 percent of its overall research budget (UC Libraries, 2016; Anderson, 2017). According to the report, incorporating grant and discretionary funds into the financial flow for a full APC business model may be a viable direction for both research-intensive institutions and their funders. The report emphasizes that it is essential to introduce competition for au- thors to ensure that APCs remain affordable in the future. This can be accom- plished by giving authors some financial responsibility in deciding where to pub- lish, using funds that they control directly. Additionally, the report acknowledges that the information available on current APCs is almost entirely derived from STEM fields, which historically have higher subscription costs than social science and humanities disciplines (Crotty, 2016; UC Libraries, 2016). Because the report provides APC estimations based on available data, it likely overestimates costs for non-STEM fields, and additional analysis may be needed for other disciplines. There is also a need to monitor global developments on an ongoing basis to assess opportunities for collaboration with European countries toward more immediate, large-scale transition to an open science enterprise.

76 Open Science by Design: Realizing a Vision for 21st Century Research ← $1,892: Average APC for partner institution publications in full open access journals FIGURE 3-4 APCs are affordable for large research-intensive institutions if grant funds are applied. SOURCE: Presentation by Ivy Anderson, California Digital Library, Commit- tee on Toward an Open Science Enterprise public symposium, September 18, 2017. A recent report from the Max Planck Digital Library (MPDL) has claimed that a large-scale open access transformation is possible without financial risk (MPDL, 2015). Yet this is a contentious issue. Some argue against efforts to promote publishing models based on gold open access enabled by APCs, and in- stead advocate for a combination of green open access mandates and community efforts to create and sustain new institutions for publishing and expert review (Shulenberger, 2016). As a recent development, the UC Libraries released a new report, Pathways to Open Access, in February 2018 that identifies the current state of open access approaches, a set of strategies to achieve those approaches, and possible next steps to assist UC campus libraries and the California Digital Library to pursue a large- scale transition to open access (UC Libraries, 2018). An accompanied published chart summarizes those approaches and strategies identified in the report, includ- ing green open access, gold open access-APC based, gold open access-non APC based, and universal strategies. Private Foundation Initiatives Open access publishing has increasingly become part of the business pro- cess among the philanthropic community. For example, the Bill & Melinda Gates Foundation has one of the most stringent open-access policies. After a 2-year tran- sition period for policy compliance, the foundation’s Open Access Policy has been fully operational as of January 1, 2017, with no exceptions to the policy (Bill & Melinda Gates Foundation, 2017; Hansen, 2017; Adams, 2018). Under its pol- icy, the foundation requires grantees to make their research papers and data avail- able immediately upon publication without any embargo period and allow for their unrestricted use under the Creative Commons Attribution Generic License

The State of Open Science 77 (CC BY 4.0) or an equivalent license (Bill & Melinda Gates Foundation, 2017; Hansen, 2017; Adams, 2018). The foundation will pay reasonable fees in order to publish on its open access terms. Launched in July 2016, the web-based service Chronos tracks the impact of research while simplifying research publishing. As a new initiative, Gates Open Research was launched in late 2017, with a model used by the Wellcome Trust in the United Kingdom (Wellcome Open Research), to provide their grantees with an open research platform for open peer review and rapid author-led publication (Butler, 2017; Open Research Central, 2017; Van Noorden, 2017; Bill & Melinda Gates Foundation, 2018). As one of the most in- fluential global health philanthropic organizations, the foundation emphasizes that “the free, immediate, and unrestricted access to research will accelerate inno- vation, helping to reduce global inequity and empower the world’s poorest people to transform their own lives” (Bill & Melinda Gates Foundation, 2017). Because of a rapidly changing landscape in scholarly communications, the Wellcome Trust will conduct its first review of its open access policy and a result will be an- nounced by the end of 2018 (Wellcome Trust, 2018). While a growing number of funding organizations are committing to open sharing of research, the funder community is building effective partnerships in an effort to meet current and future open science challenges. One major effort is the creation of the Open Research Funders Group (ORFG) 2 in December 2016, fol- lowing a forum of open access stakeholders convened by The Robert Wood John- son Foundation and the Scholarly Publishing and Academic Resources Coalition (SPARC) in late 2015. The ORFG develops actionable principles and policies that encourage innovation, increase access to research articles and data, and promote reproducibility (ORFG, 2018). While many organizations have expressed an in- terest in developing their own open policies, a significant challenge is the lack of clarity about an effective policy. In an attempt to describe the variation in interpretation of openness by funding organizations, ORFG has published a guide, HowOpenIsIt? Guide to Research Funder Policies (2017), building on the success of HowOpenIsIt? Guide for Evaluating the Openness of Journals described in Chapter 2 (see Table 2-2). During recent infectious diseases outbreaks in 2016, the publishing community largely agreed, at the prompting of WHO and funders such as the Bill & Melinda Gates Foundation and the Wellcome Trust, to adopt open science practices, including early publication of data and preprints and open access publication (PLOS, 2016). Such agreements applied in times of interna- tional public health emergencies underscore the benefits of an open science ap- proach. 2As of January 2018, ORFG members include the Alfred P. Sloan Foundation, American Heart Association, A Charitable Fund of Peter Baldwin and Lisbet Rausing (ARCADIA), the Bill & Melinda Gates Foundation, Eric & Wendy Schmidt Fund for Strategic Innovation, James S. McDonnell Foundation, John Templeton Foundation, Laura and John Arnold Foun- dation, Leona M. and Harry B. Helmsley Charitable Trust, Open Society Foundation, Robert Wood Johnson Foundation, and Wellcome Trust. Additional information can be found at http://www.orfg.org/members.

78 Open Science by Design: Realizing a Vision for 21st Century Research Publisher and Society Initiatives Publishers and professional societies are exploring options for expanding open access to accelerate scientific discovery. The American Geophysical Un- ion (AGU), which consists of 60,000 members from 137 countries, is the largest society publisher in the discipline of Earth and space science with 20 peer-re- viewed scholarly journals and over 6,000 published papers in 2016 (Stall, 2017). The AGU produces four open access journals, including Journal of Advances in Modeling Earth Systems, Earth’s Future, Earth and Space Science, and Geo- Health, with content currently representing nearly 100,000 articles (AGU, 2017a). Articles published in those journals become freely available immedi- ately online upon publication, and authors can select one of several Creative Commons (CC) licenses. AGU allows a draft or the author’s version of the ac- cepted manuscript to be posted to any nonprofit preprint server to encourage community engagement. Through its publishing partner Wiley, AGU offers dis- counts or waivers on fees from researchers in developing countries to increase access to research. Additionally, AGU is part of the innovative Research4Life program, which provides over 5,000 institutions in low- and middle-income countries free or low-cost access (AGU, 2017a; Research4Life, 2018). In addi- tion to these gold open access options, AGU also makes all publications open after a 2-year embargo period. (See Chapter 2 and above for more explanation on gold and green access.) Open Data Most research data in repositories today is not available under FAIR prin- ciples. Realizing this availability will entail significant costs and complexities. The wide variety of types and sizes of research datasets means that developing effective tools and practices will require significant and sustained community in- put. Long-term curation of data and research software will require standards for the types of data that should be stored and how long they should be stored. This section considers several examples and potential lessons. Big Science Data Open data is largely the norm in fields such as high-energy physics and astronomy, as funding for these projects is significant, and as such data distribu- tion is well thought out and closely monitored by the respective federal agencies. Good examples include the Large Hadron Collider, and some of the large scale astrophysical archives (Hubble Legacy Archive, Sloan Digital Sky Survey, etc.). They typically started in areas where the data were far removed from any financial impacts. More recently data from other areas, like genomics (Human Genome Project, 1000 genomes, etc.) and material science (Material Genome Initiative) are also heading towards data sharing in large open archives. Such a transition for a given field typically requires a decade of focused effort by the community, and

The State of Open Science 79 a substantial federal investment. Boxes 3-5 and 3-6 illustrate examples of open practices in the fields of astronomy and astrophysics as well as genomics research, respectively. With the size and complexity of datasets continually increasing, yes- terday’s “big data” appears less big today, today’s “big data” will appear small in five or ten years, and so forth. BOX 3-5 Astronomy and Astrophysics The Sloan Digital Sky Survey (SDSS) has been one of the largest, most detailed, and most often cited surveys in the history of astronomy. The SDSS has provided world-leading datasets for a wide range of astrophysical research, including the study of extragalactic astrophysics, cosmology, the Milky Way, and stars (ARC, 2012). All SDSS data are released to the public under open science principles. The SDSS project has revolutionized the interactions between a tel- escope, its data, and its user communities (NAS-NAE-IOM, 2009). There was a desire to develop large-scale (petascale) computing and storage to enable greater access and better usability of information by the as- tronomy and physics community. The Astrophysical Research Consortium (ARC) was formed in 1984, and a pioneering 2.5-meter telescope was created at Apache Point Observatory (APO) in New Mexico that maps the sky to ex- amine the structure of the universe (NAS-NAE-IOM, 2009). To accelerate dis- coveries in astronomy, the SDSS was initiated to “digitally map about half of the Northern sky in five spectral bands from ultraviolet to the near infrared” (Szalay, 2000). However, the data challenge in this field was the integration of disparate types of data about astronomical objects (stars, galaxies, quasars), including images, spectroscopy data, and astrometric data, along with the large volumes of data (2 to 4 TB per year) (NRC, 2008). After nearly a decade of design and construction, the SDSS entered rou- tine operations in 2000. With funding from multiple sources and countries, the SDSS has been releasing data annually at the American Astronomical Society Meeting. The data obtained from the project are available at SkyServer, an SDSS-managed public database designed and built at Johns Hopkins Univer- sity, for both astronomers and for science education (Gatlin, 2013). Anyone with a web browser can navigate through the sky using the SkyServer website. Teachers are encouraged to adapt the projects for use in their classroom. Since 2000, SDSS has progressed through the following phases with mul- tiple surveys: • SDSS I (2000–2005), including deep multicolor imaging over 8,000 square degrees and measured spectra of more than 700,000 celestial objects. • SDSS II (2005–2008), including the Sloan Supernova Survey. SDSS II completed the original survey goals of imaging half the northern sky and mapping the 3-dimensional clustering of one million galaxies and 100,000 quasars. (Continued)

80 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-5 Continued • SDSS III (2008–2014), including the Apache Point Observatory Galactic Evolution Experiment (APOGEE) and Baryon Oscillation Spectroscopic Survey (BOSS) using the largest-ever, three-dimensional map of distant galaxies. • SDSS IV (2014–2020), including the extended BOSS (eBOSS), APOGEE- 2, and Mapping Nearby Galaxies at APO (MaNGA) (SDSS, 2017). While SDSS has recorded a total of 25 TB data during the first (2000– 2005) and second surveys (2005–2008) combined, the amount of data to be saved at the end of the third survey (2008–2014) is 100 TB due to the multiple reprocessing versions of the data (Singh and Kumar, 2016). The SDSS is dis- tinctive within the astronomical community for its participatory, bottom-up sci- entific research planning process, currently involving over 50 contributing in- stitutional members in the collaboration. For the first time in the collaboration’s history, the current fourth phase of SDSS (SDSS-IV) partners with a sister telescope located in the Southern hemisphere in Chile to observe regions of the sky that are not visible from the Northern hemisphere (Alfred P. Sloan Foundation, 2017). In keeping with previous SDSS policy, the SDSS-IV pro- vides regularly scheduled public data releases, and the current version is Data Release 14. The website for each of the SDSS I, II, and III is still available but no longer updated. All SDSS data are available through public archives and used extensively by the community for research and teaching. For example, there are more than 7,000 refereed papers published, with well over 350,000 citations (Szalay, 2014). Citizen-science projects, such as Galaxy Zoo, invite the gen- eral public to help classify millions of galaxies in the SDSS data via the Internet (Lincott et al., 2008; Khullar, 2017), and led to the discovery of a unique ce- lestial object by a Dutch school teacher. Next generation large astronomical surveys, such as the Large Synoptic Survey Telescope (LSST) and Panoramic Survey Telescope and Rapid Response System (Pan-STARRS), have also used the SDSS experience to develop their own data management infrastruc- ture and services (Szalay, 2014). The SDSS has contributed to the globaliza- tion of scientific innovation through open science. The SDSS is managed by the Astrophysical Research Consortium for the participating institutions of the SDSS collaboration. Funding for the current SDSS IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Sci- ence, and the participating institutions (SDSS, 2017). References Alfred P. Sloan Foundation. 2017. Sloan Digital Sky Survey. Online. Available at https://sloan.org/programs/science/sloan-digital-sky-survey. Accessed November 13, 2017. ARC (Astrophysical Research Consortium). 2012. Principles of Operation for SDSS-IV. Online. Available at http://www.sdss.org/wp-content/uploads/20 14/11/principles.sdss4_.v4.pdf. Accessed November 15, 2017. (Continued)

The State of Open Science 81 BOX 3-5 Continued Gatlin, L. 2013. Johns Hopkins astronomer awarded $9.5M to create 'virtual tel- escope.’ Johns Hopkins University. Online. Available at https://hub.jhu. edu/2013/11/01/szalay-grant-skyserver. Accessed November 15, 2017. Khullar, G. 2017. The Sloan Digital Sky Survey: A Legacy. Online. Available at https://astrobites.org/2017/02/03/the-sloan-digital-sky-survey-a-legacy. Accessed November 13, 2017. Lintott, C. J., K. Schawinski, A. Slosar, K. Land, S. Bamford, D. Thomas, M. J. Raddick, R. C. Nichol, A. Szalay, D. Andreescu, P. Murray, and J. Vanden- berg. 2008. Galaxy Zoo: Morphologies derived from visual inspection of gal- axies from the Sloan Digital Sky Survey. Monthly Notices of the Royal As- tronomical Society 389:1179-1189. NAS-NAE-IOM (National Academy of Sciences, National Academy of Engi- neering, and Institute of Medicine). 2009. Ensuring the Integrity, Accessi- bility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. NRC (National Research Council). 2008. Integrated Computational Materials Engineering: A Transformational Discipline for Improved Competitiveness and National Security. Washington, DC: The National Academies Press. NSF (National Science Foundation). 2014. SciServer: Big Data infrastructure for science. Online. Available at https://www.nsf.gov/discoveries/disc_summ. jsp?cntn_id=133526. Accessed November 16, 2017. SDSS (Sloan Digital Sky Survey). 2017. The Sloan Digital Sky Survey: Mapping the Universe. Online. Available at http://www.sdss.org. Accessed Novem- ber 13, 2017. Singh, M. K., and G. D. Kumar. 2016. Effective Big Data Management and Op- portunities for Implementation. Hershey, PA: IGI Global. Szalay, A. S. 2017. From SkyServer to SciServer. The Annals of the American Academy of Political and Social Science 675(1):202-220. Szalay, A. S., P. Kunszt, A. Thakar, J. Gray, and D. Slutz. 2000. The Sloan Dig- ital Sky Survey and its Archive. Online. Available at https://arxiv.org/abs/ astro-ph/9912382v1. Accessed November 16, 2017. A major consideration is what happens to data from a major research facil- ity, which often takes hundreds of millions of dollars and decades of effort, once the facility is shut down (e.g., BaBaR at SLAC 3). The legacy value of the invest- ments made remain in the data, which need to be preserved and curated for at least several additional decades. This preservation phase of the data lifecycle requires skills different from those needed for capturing and analyzing data from an active instrument. Several major facilities are getting closer and closer to this point. 3BaBaR is a large-scale particle physics experiment conducted at the SLAC National Accelerator Laboratory and designed to study fundamental questions about the universe, including the nature of antimatter, the properties and interactions of the particles known as quarks and leptons, and searches for new physics. For more information, see http://www- public.slac.stanford.edu/babar.

82 Open Science by Design: Realizing a Vision for 21st Century Research Maintaining and reinventing the data curation for each project in isolation will be very inefficient, and the task requires economies of scale. The expertise for cura- tion will require active involvement by librarians and archivists, augmenting the legacy and corporate knowledge of the individual projects. The Long Tail of Science The long tail of science is increasingly gaining attention in the open science community. While big data tend to comprise homogeneous, standardized, and regulated data, long-tail data can be relatively small and heterogeneous individu- ally but very large in the number of datasets (Heidorn, 2008; Borgman, 2015; e- IRG, 2016; see Table 3-2). Data heterogeneity includes differences in the size, structure, format, and complexity of research data. Long tail data exist across all disciplines, mostly only in individual comput- ers or personal websites with minimal or no attached metadata or documentation, resulting in issues such as irreproducibility of research, duplicate research, and, potentially, innovation loss (e-IRG, 2016). For example, environmental science research involves enormous complexity of its datasets, including physical, chem- ical, and biological data that reside in small files (e.g., spreadsheets and tables) collected in laboratories (Szalay, 2014). Other challenges associated with long- tail data include data quality due to varying technology across disciplines, diffi- culty of discoverability in diverse repositories, and lack of incentives for research- ers to deposit their data. Mostly, the demands for metadata are simply too cum- bersome for normal scientists, who feel that the relatively small amounts of data to be published do not justify the effort that needs to be spent to add the required extra information for the publishing process. Part of the reason for the balkaniza- tion of long-tail data is its isolation/geographic segregation. Most of such data sit on tens of thousands of personal computers, or personal websites. If all data could be stored on the same “science cloud,” where it would take a mouse click to up- load and link new information, a complex network of interrelated datasets could rapidly be built. It is quite likely that the relationships between datasets would resemble the network graphs of co-authorship. The technology to do automatic discovery of a wider context from data tables on the web is already here (Cafarella et al., 2008). A substantial amount of data currently resides in “Supplementary Information” accompanying journal articles—in front or behind paywalls, but mostly in formats that do not lend themselves to text- or data-mining. Several publishers are currently moving towards ensuring at least one copy of article-re- lated datasets is available in open repositories (e.g., Dryad, Figshare), as well as in the journal record (COPDESS, 2015; Byrne, 2017).

The State of Open Science 83 BOX 3-6 Genomic Data The Human Genome Project was a large-scale project to determine the se- quence of the human genome. The project successfully created a human ref- erence genome, together with the complete sequences of five model organ- isms (The Human Genome Project Completion). The work was coordinated by the National Institutes of Health and the U.S. Department of Energy and involved a large interdisciplinary team, with participating laboratories in the U.S. and abroad (Collins et al., 1998; Lander et al., 2001; Hood, 2013). The goals of the project were first set forth in 1988 by a committee of the U.S. National Academy of Sciences (NRC, 1988). Among the goals articulated by the Academy report and in subsequent publications by the leaders of the effort was a significant focus on open data sharing: “Considerable data will be gen- erated from the mapping and sequencing project. Unless this information is effectively collected, stored, analyzed, and provided in an accessible form to the general research community worldwide, it will be of little value” (NRC, 1988, p. 7), and “Collection, analysis, annotation, and storage of the ever in- creasing amounts of mapping, sequencing, and expression data in publicly accessible, user-friendly databases is critical to the project’s success” (Collins et al., 1998, p. 688). The National Library of Medicine’s National Center for Biotechnology Information (NCBI) was founded in 1988, and since then it has built and maintains numerous publicly available genomic databases for use by scientists and the interested public (NCBI). The Human Genome Project has fostered not only an interdisciplinary culture, involving collaborations among computer scientists, engineers, mathematicians, and biologists, but also a cul- ture in which data and computational code are openly and freely shared (Lander et al., 2001; Hood, 2013; Cook-Deegan, 2017). The Personal Genome Project (PGP) was founded in 2005 and is dedi- cated “to creating public resources that everyone can access” and to a “highly participatory approach to research-participant communication and interaction” (Church, 2005; Harvard Personal Genome Project, 2014; Ball et al., 2014). The project enrolls volunteers who are interested in publicly sharing their ge- nomic, health, and trait data for the benefit of scientific progress. Acknowledg- ing that it is not possible to guarantee privacy, confidentiality, and anonymity of genetic data when the explicit goal is to share those data, the project has developed a novel “open consent” framework. Because the PGP aims to have all of its participants both engaged and informed, potential participants are given a study guide that provides a primer on genomic science and discusses the risks of participating, after which they must pass an exam testing their un- derstanding of the material (Angrist, 2009). As of January 2018, the project has enrolled more than 5,000 participants. (Continued)

84 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-6 Continued The NIH updated its genomic data sharing policy in late 2014. The policy details the agency’s expectations for the sharing of both human and non-hu- man genomic data generated by studies supported by the NIH (NIH, Genomic Data Sharing). Data generated from human studies must be submitted to the NIH generally within 3 months after generation, and the NIH may allow another 6-month embargo period before public release. In addition, the policy requires that investigators obtain participants’ consent to share their data broadly for future research purposes. References Angrist, M. E. 2009. Wide open: The personal genome project, citizen science and veracity in informed consent. Personalized Medicine 6(6):691-699. Ball, M. P., J. R. Bobe, M. F. Chou, T. Clegg, P. W. Estep, J. E. Lunshof, W. Vandewege, A. W. Zaranek, and G. M. Church. 2014. Harvard Personal Genome Project: Lessons from participatory public research. Genome Medicine 6(2):10-16. Church, G. M. 2005. The Personal Genome Project. Molecular Systems Biol- ogy 1(1):0030. Collins, F. S., A. Patrinos, E. Jordan, A. Chakravarti, R. Gesteland, and L. Walters. 1998. New goals for the U.S. Human Genome Project: 1998- 2003. Science 282(5389):682-689. Contreras, J. L. 2015. NIH’s genomic data sharing policy: timing and tradeoffs. Trends in Genetics 31(2):55-57. Cook-Deegan, R., R. A. Ankeny, and K. Maxson Jones. 2017. Sharing Data to Build a Medical Information Commons: From Bermuda to the Global Alliance. Annual Review of Genomics and Human Genetics 18:389-415. Hood L, and L. Rowen. 2013. The Human Genome Project: Big science trans- forms biology and medicine. Genome Medicine 5(9):79. Lander et al. and the International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822):860-921. NCBI (National Center for Biotechnology Information). Online. Available at https://www.ncbi.nlm.nih.gov. Accessed March 30, 2018. NIH (National Institutes of Health). 2010. The Human Genome Project Comple- tion. Online. Available at https://www.genome.gov/11006943. Accessed March 30, 2018. NIH. NIH Genomic Data Sharing. Online. Available at https://osp.od.nih.gov/ scientific-sharing/genomic-data-sharing. Accessed March 30, 2018. NRC (National Research Council). 1988. Report of the Committee on Mapping and Sequencing the Human Genome. Washington, DC: The National Academies Press.The Harvard Personal Genome Project. Online. Avail- able at https://pgp.med.harvard.edu/about. Accessed March 30, 2018.

The State of Open Science 85 TABLE 3-2 Big Data vs. Long-Tail Data Big Data Long-Tail Data 1 Homogeneous Heterogeneous 2 Large Small 3 Common standards Unique standards or no standards 4 Regulated Not regulated 5 Central curation Individual curation 6 Disciplinary repositories Institutional, general or no repository SOURCE: e-IRG, 2016. Discovering, transforming and reusing data collected by others has become a major part of science today, yet the process is still painful. The Research Data Alliance (RDA) and the National Data Service (NDS) are leading the way in the path towards establishing a universal, easy-to-use data publishing and manage- ment framework, but this is an area that will require consistent long-term attention before it can be said that the problem has been solved (See Box 3-7). Clearly, scientists can learn from best practices in industry, but those techniques need to be carefully tailored to the specific needs of science (assessing data quality, ref- ereeing process, relation to publications, easy attribution, tracking provenance). A number of initiatives address challenges involved in managing long-tail data. For example, the RDA’s Long Tail of Research Data Internet Group, launched in 2013 with over 90 members from around the world, has developed a set of good practices for managing research data archived in the university context (RDA, 2017a). The European Library Federation (LIBER) released 10 recom- mendations for libraries to get started with research data management (LIBER, 2012); the Confederation of Open Access Repositories (COAR) issued the repos- itory Interoperability roadmap (COAR, 2014); and the Open Access Infrastructure for Research in Europe (OpenAIRE) links literature to data. Additional work is needed to establish a relevant, operational ecosystem for the long tail of science during the implementation of international, national, and local e-infrastructures, possibly using automated techniques to extract the metadata needed for discovery and indexing (Cafarella, 2008). While reuse of data remains highly dependent on discipline- and data-specific metadata, which have long been recognized as criti- cal for reuse (Brazma et al., 2001), support for researchers willing to invest time and efforts in establishing such standards is also critical. Scientific Collections and Sample Preservation While much of this report focuses on digital research products, a significant percentage of research effort continues to involve collection, analysis, and use of physical specimens and materials. Metadata about specimen collections may or may not be available in digital form online.

86 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-7 The National Data Service The National Data Service (NDS) is “an emerging vision for how research- ers and scientists across all disciplines can find, reuse, and publish data” (NDS, 2017a). While many scientific communities are increasingly developing discipline-specific data services, the U.S. and international communities lack a unified open framework for storing, sharing, and publishing data that can be used across disciplinary boundaries (NDS, 2017b). Building on existing infra- structure for data archiving and sharing within specific communities, NDS aims to provide open, shareable tools that will support cross-disciplinary research and new discoveries to help transform education, society, and economic de- velopment. NDS focuses on innovations that bring domain specific data man- agement components into cross-disciplinary use, as well as projects that seamlessly integrate disparate services. To advance this vision, the NDS Consortium has been established as a coalition of stakeholders, and its inaugural workshop was held in June 2014 in Boulder, CO. The Consortium links together National Science Foundation DataNet projects (e.g., DataONE, SEAD), Data Infrastructure Building Blocks (DIBBs) projects (e.g., NCSA Brown Dog, Whole Tale), the National Science Foundation Big Data Innovation Hubs, and other major community initiatives (e.g., EarthCube, ICPSR, MagIC); Major Research Equipment and Facilities Construction (MREFC) projects; National Institute of Standards and Technol- ogy’s (NIST) Material Measurement Laboratory; universities, libraries, civic or- ganizations and municipalities (e.g., City of Chicago, ThinkChicago), and na- tional organizations and the services that connect them (XSEDE, Globus, ESIP); publishers (e.g., Elsevier); and international efforts (e.g., RDA, GO FAIR, GODAN) (NDS, 2017b). Towards a world where it is easier to search, publish, link, and reuse data of all disciplines, the NDS Consortium is advanc- ing discovery by enabling open sharing of data, increasing collaboration within/across fields, providing large-scale data service interoperability, and fa- cilitating an incubator of data technologies, projects, and pilots (McHenry, 2017). Additionally, the consortium launched the NDS Labs Workbench (Wil- lis, 2017), a scalable platform for research data access, education, and train- ing to promote data tools (NDS, 2017a). The consortium will drive impact to- ward an open framework that will revolutionize data sharing through effective partnerships between the U.S. and international research organizations and publishers. References McHenry, K. 2017. Enabling Open Science Without Impeding Open Science. Presentation to the National Academies of Sciences, Engineering, and Medicine’s Committee on Toward an Open Science Enterprise, Public Symposium. September 18, 2017. NDS (National Data Service). 2017a. Online. Available at http://www.national dataservice.org. Accessed December 13, 2017. (Continued)

The State of Open Science 87 BOX 3-7 Continued NDS. 2017b. A vision for accelerating discovery through data sharing, Online. Available at http://www.nationaldataservice.org/NDS-Summary.pdf. Ac- cessed December 14, 2017. Willis, C., M. Lambert, K. McHenry, and C. Kirkpatrick. 2017. Container-Based Analysis Environments for Low-Barrier Access to Research Data. Pro- ceedings of the Practice and Experience in Advanced Research Compu- ting 2017 on Sustainability, Success and Impact 58. doi:10.1145/3093 338.3104164. Historically, scientists (especially natural scientists) have kept their collec- tions either in museums or in central locations in their university departments, but also as personal collections in their own laboratories for their use and that of their research groups. These samples had collection data with varying levels of speci- ficity associated with them; however, neither these data nor the physical samples were easily accessible by others. The preservation of scientific collections and data acquired with public and/or private funding, and their wide accessibility now and in the future as a public good, is supported and encouraged by professional scientific societies (e.g., AGU, 2016; GSA, 2018). McNutt et al. (2016) stated that “access to data, samples, methods, and reagents used to conduct research and anal- ysis, as well as to the code used to analyze and process data and samples, is a fundamental requirement for transparency and reproducibility” (McNutt et al., 2016, p. 1024). The Role of the U.S. Government The U.S. government has supported the creation of scientific collections and their long-term management and use as far back as the early 19th century (Sztein, 2016). Federal spending comprises a high percentage of the total amount of money spent on research. In the last two decades, there has been a drive to make scientific samples that were obtained or generated with support provided by taxpayer dollars more readily available to different actors in the scientific community. Two important reports on this topic have been published by the National Research Council (2002) and the Interagency Working Group on Scientific Collections (IWGSC) (2009, known as the “Green Report”). Reasons for preserving physical collections in- clude: (1) preserved collections allow the replication of the original experiments; (2) samples are sometimes used as standards; (3) samples may be irreplaceable or too expensive to recollect; (4) samples can be sources of ideas and can be used for education and training; (5) samples can be used for future analysis or experi- mental use; (6) scientific collections can be used for purposes unforeseen when the collection was created; and (7) reprocessing of old samples with new technol- ogy allows for the generation of new knowledge.

88 Open Science by Design: Realizing a Vision for 21st Century Research The IWGSC was created in 2006 by the White House National Science and Technology Council to focus attention and planning for federal/federally funded collections management (IWGSC, 2016). It is managed by the White House Of- fice of Science and Technology Policy (OSTP) and co-chaired by the U.S. De- partment of Agriculture and the Smithsonian Institution. Fifteen federal agencies have scientific collections and/or granting programs. The variety of physical col- lections is considerable. Some collections include rocks, minerals, meteorites, cel- lular and tissue samples, fossils, soils, and water, rock, soil, and ice cores. Others include type specimens of plants, microbes, and animals. Scientific collections can also include living organisms, such as type culture microorganism collections, seed banks and plant germplasm repositories, and other biological resource cen- ters (IWGSC, 2009). An IWGSC survey to identify the scope and range of feder- ally held scientific collections conducted a decade ago (IWGSC, 2009) revealed that, of the 291 responses received, cellular/tissue scientific collections repre- sented 22 percent (held in 10 of the 14 responding agencies), geological collec- tions comprised 21 percent of the collections (held in eight agencies); paleonto- logical collections represented 14 percent (held in four agencies), and vertebrate and botanical collections each represented 12 percent and 11 percent, respectively (each held by seven agencies). The Green Report contained several recommendations, including the need for the development of budgeting information for collections and assessing and projecting costs; the identification and dissemination of policies and best practices on organization, management, physical and online access, and long-term preser- vation; and issues related to data and metadata accessibility, especially the need to document physical objects and make collection information available online, and develop an online clearinghouse for information on contents and access to federal scientific collections. The OSTP issued a Scientific Collections memo in March 2014 (OSTP, 2014; see Appendix D), where object-based scientific collections are defined as “sets of physical objects, living or inanimate, and their supporting records and documentation, which are used in science and resource management and serve as long-term research assets that are preserved, catalogued, and managed by or sup- ported by Federal agencies for research, resource management, education, and other uses” (OSTP, 2014, pp. 2-3). The memo asks each agency to develop plans to manage their physical scientific collections “to improve management of and access to scientific collections,” and to function as “an essential base for develop- ing scientific evidence and … resource for scientific research, education, and re- source management.” The end goal of this effort is the “systematic improvement of the develop- ment, management, accessibility, and preservation of scientific collections owned and/or funded by Federal agencies.” This initiative is only for long-term institu- tional, archival collections, not for short-term project collections. The agencies were to include, among other requirements, consideration of legislative and regu- latory requirements, clarification on who has the responsibility to carry out poli- cies, projection of the costs of developing, preserving, and managing scientific

The State of Open Science 89 collections, agency requirements and standards for long-term preservation, maintenance, accessibility for public use, strategies to provide online information about physical collection contents and access to objects and digital files, unless limited by law or to protect national interests, definition of the process to de-ac- cess, transfer, dispose of collections, assignation of resources within each agency to implement policy, consistency with the 2013 Open, Machine-Readable Data OSTP memo (White House, 2013), and a request to agencies to work together and coordinate through the IWGSC (GSA, 2018). The registry of U.S. Federal Scientific Collections is a curated source of information about object-based science collections owned or managed by U.S. federal departments and agencies (USFSC, 2018). The registry is a collaboration among the IWGSC, Scientific Collections International (SciColl), and the Smith- sonian Institution, which manages the registry. At the time of this writing, 485 institutions are involved in this initiative, which includes 148 institutional and project collections. The main goals of this registry are to improve access to infor- mation about U.S. Federal scientific collections and the institutions that maintain them; and to improve interoperability among databases by providing an authority file of unique codes and machine-readable identifiers for institutions and their collections (OSTP, 2014). The IWGSC compiled a list of the status of scientific collection policies by federal agencies (IWGSC, 2018). Of the 15 federal agencies, eight have scientific collections policies: the National Aeronautics and Space Administration, the Smith- sonian Institution, the U.S. Department of Agriculture, the U.S. Department of De- fense, the U.S. Department of Health and Human Services, the U.S. Food and Drug Administration, the National Institutes of Health, and the U.S. Environmental Pro- tection Agency. The U.S. Department of Interior has Interior-wide Museum collec- tion policies, and agencies within the department, such as the U.S. Geological Sur- vey (USGS), are developing their own scientific collection policies. For example, USGS is developing its policies based on comprehensive doc- uments such as the USGS Geologic Collections Management System (USGS, 2018), a process to help determine the best fate for a given collection. The man- agement of these collections and data is done through the National Geological and Geophysical Data Preservation Program (USGS, 2018). USGS provides some funds for intramural collection management and grants to State Geological Sur- veys and other Department of Interior agencies. The National Science Founda- tion's data sharing policy states “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing” (NSF, 2018a). A good physical scientific collection is properly documented, well pre- served, and curated. The metadata attached should include field number, geo- graphic location, collector, date collected, sample type, reason for collection, pro- ject name, other important data, and include analyses and derivative samples. Research specimens can be added to permanent scientific collections following

90 Open Science by Design: Realizing a Vision for 21st Century Research different pathways: from intramural federal sources, from one federal agency to another, from non-federal researchers, from private collectors, and from interna- tional collaborations and exchange (IWGSC, 2009). In addition, in order to organize the samples in any given collection, sample identification needs to be standardized. One such approach is to assign a Univer- sally Unique Identifier (UUID) to each sample and its associated metadata. In the geosciences, the System for Earth Sample Registration (SESAR) (SESAR, 2018), hosted at the Lamont-Doherty Earth Observatory of Columbia University, and supported by NSF as part of the Interdisciplinary Earth Data Alliance, operates a registry that distributes the International Geological Sample Number (IGSN). The IGSN consists of an alphanumeric code assigned to specimens and related sam- pling features to ensure both unique identification and unambiguous referencing of data generated by the study of the samples with UUIDs (USFSC, 2018). SESAR catalogs and preserves sample metadata profiles, and provides access to the sample catalog via the Global Sample Search. Individual researchers can ob- tain their own accounts, which allows them to register their samples. Using UUIDs such as the IGSN is a concrete step towards making samples FAIR (AGU, 2017b). Multidisciplinary meetings (EOS, 2017) are bringing together researchers from disciplines with different approaches to sampling and informatics specialists to discuss relationships between data and samples, issues of data representation, and the challenges of creating and maintaining links between the physical samples and the data derived from them at different collection scales. The Integrated Dig- itized Biocollections website is an initiative aimed at making “data and images for millions of biological specimens” available online (iDigBio, 2018). Box 3-8 describes examples of open data in the discipline of the earth sciences. The Role of Universities In addition to government and museum repositories, universities have played an important role in the curation and archiving of scientific collections. They maintain scientific collections that are funded from both governmental and nongovernmental sources. While many are members of the Natural Science Collections Alliance (http://nscalliance.org), several large repositories are not. A few examples of such university repositories are the International Ocean Discov- ery Program, the Oregon State University Marine and Geology repository, the Scripps Institution of Oceanography Collections, and the University of California, Berkeley Museum of Paleontology. Many university repositories have maintained funding through difficult times, but an alarming number are facing budget cuts that have led to closure and loss of valuable scientific collections. The fate of collections held by individual scientists working in university settings can be particularly complex. As the cur- rent generation of senior scientists retires, their collections become the responsi- bility of institutions that must decide what to keep and what to discard and also to find and manage space for such collections. It is not uncommon that the scientist’s university department disposes of the collections once he/she retires, with the loss

The State of Open Science 91 of potentially valuable samples and the information associated with them. This can be the case despite the fact these samples may be unique and irreplaceable (AGU, 2017b). Even in the cases where the scientist is proactive and tries to place these collections in museums or other institutions before retirement, success is not guaranteed. One of the main reasons given for the rejection of these collections by the institutions is the high cost associated with their proper curation and stor- age. Universities or other institutions holding collections sometimes decide, usu- ally because of lack of space, funds, curatorial staff, or because of a change in scientific direction, to divest themselves from those collections. (For a recent case regarding a collection of Antarctic marine sediment cores, see Witze, 2016.) While the Antarctic collection has found a new home (Oregon State University, 2017), many other high-value research collections remain at risk. BOX 3-8 Earth Sciences Perhaps the best developed model for open data in the earth sciences is in support of the scientific ocean drilling effort whose current incarnation is the International Ocean Discovery Program (IODP; http://iodp.org/about-iodp/his- tory). Scientific coring of the seafloor started in the 1940s and has evolved into an international collaboration with several platforms that allow drilling and re- covery of sub-seafloor materials, enabling scientists to investigate samples of sediment, rock, fluids and biota. IODP is the current implementation of this decades-long endeavor. IODP coordinates the international efforts, maintains core repositories for the physical samples, and supports an open (after an embargo period) database for most of the data generated on the ship (http://web.iodp.tamu.edu/OVERVIEW/) and an open publication portal that archives the initial publications related to the research (http://publica- tions.iodp.org). Data generated after the expedition and in shore-based re- search, and publication in journals outside of IODP, however, are not part of the IODP structure. Another example of open data in the earth sciences can be found in the seismological community. The study of earthquakes (seismology) began cen- turies ago, and now relies on a global network of sensitive instruments that record ground motion, many of which report data in real time. Seismology has many applications, including preparing for and mitigating seismic hazards and distinguishing between explosions and earthquakes, among many others. In- terpretation of seismic (or nuclear) events relies on records from around the globe, so seismologists began early on to share data. The Incorporated Research Institutions for Seismology (IRIS) plays a lead- ing role in archiving and providing access to observed and derived data for the global earth science community, in particular, ground motion, atmospheric, infrasonic, hydrological, and hydroacoustic data (https://www.iris.edu/hq). Earth- quake data from around the world are accessible via an “earthquake browser” (http://ds.iris.edu) that displays earthquake locations in near real-time and (Continued)

92 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-8 Continued allows searching and downloading of data in several formats. There are links to open source code for analyzing the data. IRIS also has an array of materials useful for educators from K-12 through graduate programs with many open ac- cess publications and online videos. The twin fields of paleomagnetism and rock magnetism involve magnetic measurements on geological and archaeological materials. These endeavors contribute key evidence to a number of challenging research problems in Earth sciences, including (1) understanding of past climate changes and their rela- tion to the Earth’s magnetic field; (2) the evolution of structure in the Earth’s core, its boundary and associated influences on the geomagnetic field; (3) the geodynamics of the Earth’s mantle, where magnetic data are crucial in deter- mining the fixity of mantle plumes like Hawaii and the possibility of true polar wander; (4) biogeomagnetism; and (5) magnetism at high pressures and in extraterrestrial bodies including other planets. The Magnetics Information Consortium (MagIC; http://earthref.org/MagIC) provides a data archive that al- lows the discovery and reuse of such data for the broader earth sciences com- munity. MagIC began in 2002 as an NSF-funded project to develop a comprehen- sive database for archiving of paleo- and rock magnetic data, from laboratory measurements to a variety of derived data and metadata such as the positions of the spin axis of the Earth from the point of view of the wander continents and the variations of the strength and direction of the field through time, to changes in environmentally controlled rock magnetic mineralogy. Closely linked to the MagIC project is open source software for the conversion of la- boratory data to a common data format that allows interpretation of the data in a consistent and reproducible manner. Once published, the data and interpre- tations can be uploaded into the MagIC database. All software involved with the MagIC project is freely available on GitHub repositories. MagIC also main- tains an open access textbook on rock and paleomagnetism and links the data to the original publications (only a portion of which are currently openly avail- able). While specimen images and other analytical information can be placed online and used by researchers around the world, this does not mean that the actual specimens can be discarded (Nature, 2017). Technologies not yet developed might yield important discoveries when applied to scientific specimens in the fu- ture, and analyses performed with those new techniques can supplement original analyses to test novel questions (McNutt, 2016). One such case is the reconstruc- tion of the 1918 influenza virus through RNA sequencing of highly degraded virus fragments recovered from tissue samples from victims of that pandemic, only pos- sible after the development of PCR techniques in the 1980s. The reconstruction of the 1918 influenza virus allowed the development of novel insights into its biology and pathogenesis, and provided important information about prevention and control of future pandemics (Taubenberger et al., 2012). Box 3-9 describes recent examples of scientific collections in the field of biological sciences.

The State of Open Science 93 All researchers in any type of setting need to consider their physical collection and data management plans at the earliest stages of their research. The preservation of physical samples has similar challenges to those presented by digital datasets: accessibility, decisions on what to save and what to discard, how to manage what is being saved, and issues of discoverability and of reuse (Sztein, 2016). Funding con- siderations frequently determine the preservation of collections, their associated metadata, and the databases that permit the discoverability and reuse of those collections. Funding stability would greatly assist in the preservation of those valu- able resources for future generations. BOX 3-9 Precision Medicine In 2011, a National Research Council consensus study published a bold new vision for research in health and medicine (NRC, 2011). The significant advances in molecular biology together with the promise afforded by electronic health records made it an opportune time to consider new ways of defining diseases while gaining a deeper understanding of disease mechanisms, path- ogenesis, and treatments. In early 2015, as part of his State of the Union ad- dress, President Obama announced the Precision Medicine Initiative (PMI) and the funding that would accompany it (The White House, 2015a) The National Institutes of Health (NIH) announced its plan for implement- ing the initiative later that year (Collins, 2015; NIH, 2015). The program, named the All of US research program, involves recruiting at least 1 million individuals and collecting biological, health, behavioral, and environmental data about them. Participants in the program must be willing to share their health data, provide a biospecimen, and be recontacted for future research. The PMI data is envisioned as a public resource that will be accessible not only to researchers, but also to interested members of the public, e.g., “citizen scientists.” The specifics of data sharing and access are under development and are expected to adhere to a set of privacy and trust principles (The White House, 2015b). These principles include complying with legal and other regu- latory requirements, adequately informing participants about how their data will be used, developing multiple tiers of data access based on data type and use, and measures for protecting PMI data from unauthorized use. Notably, and to “enrich the public data resource,” the principles require that users of the data publish or publicly post the outcome of their research, including negative outcomes. The Million Veteran Program, an observational cohort study and “mega- biobank” effort, is a Department of Veterans Affairs (VA) research effort (Gaziano et al., 2016). Veterans are asked to provide a blood sample, respond (Continued)

94 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-9 Continued to a number of questionnaires, and allow access to their electronic health rec- ords housed at the VA. Currently, access to the data is limited to VA-affiliated researchers, but future plans include broadening that access and potential collaboration with the All of US program. References Collins, F. S., and H. Varmus. 2015. A new initiative on precision medicine. The New England Journal of Medicine 372(9):793-795. Gaziano, J. M., J. Concato, M. Brophy, J. Fiore, S. Pyarajan, J. Breeling, S. Whit- bourne, J. Deen, C. Shannon, D. Humphries, P. Guarino, M. Aslan, D. An- derson, R. LaFleur, T. Hammond, K. Schaa, J. Moser, G. Huang, S. Murali- dhar, R. Przygodzki, and T. J. O’Leary. 2016. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. Journal of Clinical Epidemiology 70:214-223. NIH (National Institutes of Health). 2015. The Precision Medicine Initiative cohort program – Building a research foundation for 21st century medicine. Online. Available at https://acd.od.nih.gov/documents/reports/DRAFT-PMI-WG-Re port-9-11-2015-508.pdf. Accessed March 30, 2018. NRC (National Research Council). 2011. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. Washington, DC: National Academies Press. The White House. 2015a. Remarks by the President on Precision Medicine. Online. Available at https://obamawhitehouse.archives.gov/the-press-office/ 2015/01/30/remarks-president-precision-medicine. Accessed March 30, 2018. The White House. 2015b. Precision Medicine Initiative: Privacy and Trust Princi- ples. Online. Available at https://obamawhitehouse.archives.gov/sites/defau lt/files/microsites/finalpmiprivacyandtrustprinciples.pdf. Accessed March 30, 2018. Open Repositories A number of organizations provide repositories for archiving datasets. For example, the Registry of Research Data Repositories (Re3Data), formerly Data- Bib, provides the largest and most comprehensive registry of over 1,500 data re- positories, with a wide range of disciplines from around the world. A publication, Metadata Schema for the Description of Research Data Repositories (Version 3.0), released in 2015, describes the re3data.org properties (Rücknagel et al., 2015). PLOS has identified a set of trusted repositories that are recognized within their communities (see Table 3-3). For example, the Inter-university Consortium for Political and Social Research (ICPSR) is a large archive of digital social sci- ence data (MIT Libraries, 2018). For biomedical and environmental science re- positories and field standards, PLOS suggests that researchers utilize FAIRshar- ing (FAIRsharing, 2017) and Re3Data that provide criteria to identify appropriate

The State of Open Science 95 data repositories, including licensing, certificates and standards, policy, and other criteria. Additionally, Scientific Data (http://www.nature.com/sdata/policies/re- positories) provides a list of repositories that have been evaluated to ensure that they meet their requirements for data access, preservation, and stability. Box 3-10 illustrates open data practices for economics research. BOX 3-10 Economics Unlike other social and behavioral sciences such as sociology and psy- chology where researchers generate their own data, economics has typically relied upon government-collected data and statistics. As a result, every re- searcher has had access to these data collections. Economics organizations have worked to make research data more accessible. Since the 1970s the National Bureau of Economic Research (NBER) has maintained a public use data archive (http://www.nber.org/data) that started with lending out 9-track tapes of federal data collections such as the Current Population Survey to NBER researchers. When Internet access became available in the 1990s, NBER added data to its website. Data were shared and made available as a way of treating economics as a science where reproducibility is part of the process. These data are widely used by social science researchers. For ex- ample, the NBER working paper associated with the NBER patent database has over 3,000 citations in Google Scholar. The Federal Reserve Bank of St. Louis started the Federal Reserve Eco- nomic Data site (FRED) in the 1990s as a way of compiling economic time series data in one location. FRED started as a dial-in electronic bulletin board that moved onto the web in 1995 (FRED, 2018). FRED currently hosts 504,000 US and international time series data from 87 sources (https://fred.stlou- isfed.org/) and features online tools, an API, and tools for smart phones. More recently, the Center for the Advancement of Data and Research in Economics (CADRE) at the Federal Reserve Bank of Kansas City began working to doc- ument data inputs and methods for various fields in economics with an em- phasis on widely used microeconomic datasets such as the Current Popula- tion Survey and the Survey of Income and Program Participation (Federal Reserve Bank of Kansas City, 2018). As behavioral and experimental economics grew as a field, the American Economic Association developed the Randomized Controlled Trial (RCT) Registry in 2013. By 2017, it had registered over 1,000 RCTs in over 100 countries (AEA, 2017). Investigators can voluntarily register their RCTs and related projects. The economics profession has also responded to issues as- sociated with conflict of interest among researchers. The movie Inside Job showed that some economists had been paid to generate research that sup- ported the sponsor’s point of view, and that the disclosure of sponsor relation- ships was sometimes lacking. In 2012, the NBER adopted, and shortly after- ward the American Economic Association followed suit, a conflict-of-interest (Continued)

96 Open Science by Design: Realizing a Vision for 21st Century Research BOX 3-10 Continued policy in which researchers are required to disclose financial conflicts of inter- est associated with sponsored research and publications (NBER, 2012; AEA, 2018). The culture of openness has resulted in the economics profession be- ing a relatively FAIR discipline, which may have extended the intellectual reach of economics research (Angrist et al., 2017). References AEA (American Economic Association). 2017. A milestone in research trans- parency: the AEA’s RCT Registry now contains 1,000+ studies from over 100+ countries! Online. Available at https://www.aeaweb.org/news/rct -registry-over-1000. Accessed March 30, 2018. AEA. 2018. Disclosure Policy. Online. Available at https://www.aeaweb.org/ journals/policies/disclosure-policy. Accessed March 30, 2018. Angrist, J., P. Azoulay, G. Ellison, R. Hill, and S. F. Lu. 2017. Inside Job or Deep Impact? Using Extramural Citations to Assess Economic Scholar- ship. The National Bureau of Economic Research Working Paper 23698. FRED (Federal Reserve Bank of St. Louis). 2018. What is FRED? Online. Available at https://fredhelp.stlouisfed.org/fred/about/about-fred/what-is -fred. Accessed March 30, 2018. Federal Reserve Bank of Kansas City. 2018. Data Services. Online. Available at https://www.kansascityfed.org/research/cadre/dataservices. Accessed March 30, 2018. NBER (National Bureau of Economic Research). 2012. Research Financial Conflict of Interest Policy. Online. Available at http://admin.nber.org/COI/ NBER_ResearchFCOI_Policy.pdf. A growing number of universities are starting to build research data reposito- ries to help researchers manage data, preserve data for the long term, and allow permanent access to datasets in a reliable environment. MIT offers DSpace, a re- pository established to capture, distribute, and preserve the digital products of MIT faculty and researchers. The Harvard Dataverse Network (DVN), supported by the Harvard-MIT Data Center and Institute for Quantitative Social Science (IQSS), is a repository infrastructure that includes a large collection of research data in the social sciences (Harvard Dataverse, 2018; MIT Library, 2018). The University of Minne- sota Libraries also list popular data repositories categorized by subject, including agricultural sciences; archaeology; astronomy; biological and life sciences; chemis- try; computer science and source code; earth, environmental, and geosciences; GIS and geography; health and medical sciences; physics; and social sciences (Univer- sity of Minnesota Libraries, 2018). Data availability facilitates reproducibility of research; allows validation, replication, reanalysis, new analysis, reinterpretation or inclusion into meta-analyses; and makes citation of data and research articles easier by ensuring recognition for authors (PLOS One, 2018).

TABLE 3-3 Open Data Repositories Disciplines Repositories Links Cross-disciplinary Dryad Digital Repository http://datadryad.org Figshare http://figshare.com Harvard Dataverse Network http://thedata.harvard.edu/dvn Open Science Framework http://osf.io Zenodo http://zenodo.org Biochemistry caNanoLab http://cananolab.nci.nih.gov/caNanoLab Kinetic Models of Biological Systems (KiMoSys) http://www.kimosys.org Mass spectrometry Interactive Virtual Environment (MassIVE) http://massive.ucsd.edu PubChem http://pubchem.ncbi.nlm.nih.gov Standards for Reporting Enzymology Data (STRENDA DB) https://www.beilstein-strenda-db.org/strenda/index.xhtml Biomedical Sciences The Cancer Imaging Archive (TCIA) http://www.cancerimagingarchive.net Influenza Research Database http://www.fludb.org National Addiction & HIV Data Archive Program (NAHDAP) http://www.icpsr.umich.edu/icpsrweb/NAHDAP/index.jsp National Database for Autism Research (NDAR) http://ndar.nih.gov PhysioNet SICAS Medical Image Repository http://physionet.org https://www.smir.ch Marine Sciences SEA scieNtific Open data Edition (SEANOE) http://www.seanoe.org Model Organisms The Arabidopsis Information Resource (TAIR) http://www.arabidopsis.org Eukaryotic Pathogen Database Resources (EuPathDB) FlyBase http://eupathdb.org/eupathdb Mouse Genome Informatics (MGI) Rat Genome Database (RGD) http://flybase.org SmedGD http://www.informatics.jax.org VectorBase http://rgd.mcw.edu WormBase http://smedgd.neuro.utah.edu Xenbase http://www.vectorbase.org/index.php Zebrafish Model Organism Database (ZFIN) http://www.wormbase.org/#01-23-6 http://www.xenbase.org/common http://zfin.org (Continued) 97

TABLE 3-3 Continued 98 Disciplines Repositories Links Neuroscience Functional Connectomes Project International Neuroimaging http://fcon_1000.projects.nitrc.org Data-Sharing Initiative (FCP/INDI) http://neuromorpho.org/neuroMorpho/index.jsp NeuroMorpho.org http://neuromorpho.org OpenfMRI http://openfmri.org Omics ArrayExpress http://www.ebi.ac.uk/arrayexpress Biological General Repository for Interaction Datasets (BioGRID) http://thebiogrid.org Database of Interacting Proteins (DIP) dbGAP http://dip.doe-mbi.ucla.edu/dip/Main.cgi The European Genome-phenome Archive (EGA) http://www.ncbi.nlm.nih.gov/gap Gene Expression Omnibus (GEO) http://www.ebi.ac.uk/ega GenomeRNAi GPM DB http://www.ncbi.nlm.nih.gov/geo IntAct Molecular Interaction Database http://www.genomernai.org MetaboLights http://gpmdb.thegpm.org/index.html NURSA http://www.ebi.ac.uk/intact PeptideAtlas http://www.ebi.ac.uk/metabolights ProteomeXchange https://www.nursa.org/nursa/index.jsf Proteomics Identifications (PRIDE) http://www.peptideatlas.org http://www.proteomexchange.org http://www.ebi.ac.uk/pride/archive Physical Sciences Australian Antarctic Data Centre (AADC) http://www1.data.antarctica.gov.au Cold and Arid Regions Science Data Center (CARD) http://card.westgis.ac.cn Environmental Data Initiative Repository National Climatic Data Center (NCDC) https://portal.edirepository.org/nis/home.jsp National Environmental Research Council Data Centres (NERC) http://www.ncdc.noaa.gov Oak Ridge National Laboratory Distributed Active Archive Center http://www.nerc.ac.uk/research/sites/data (ORNL DAAC) PANGAEA http://daac.ornl.gov Reaction Database Standard Search Interface SIMBAD Astronomical Database http://www.pangaea.de UK Solar System Data Centre http://durpdg.dur.ac.uk/HEPDATA/REAC World Data Center for Climate at DKRZ (WDCC) http://simbad.u-strasbg.fr/simbad

http://www.ukssdc.ac.uk http://www.wdc-climate.de Sequencing Database of Genomic Variants Archive (DGVa) http://www.ebi.ac.uk/dgva dbSNP dbVar http://www.ncbi.nlm.nih.gov/snp DNA DataBank of Japan (DDBJ) http://www.ncbi.nlm.nih.gov/dbvar EBI Metagenomics http://www.ddbj.nig.ac.jp EMBL Nucleotide Sequence Database (ENA) http://www.ebi.ac.uk/metagenomics European Variation Archive (EVA) http://www.ebi.ac.uk/ena GenBank miRBase http://www.ebi.ac.uk/eva/?Home NCBI Sequence Read Archive (SRA) http://www.ncbi.nlm.nih.gov/genbank NCBI Trace Archive http://www.mirbase.org Uniprot http://www.ncbi.nlm.nih.gov/sra http://www.ncbi.nlm.nih.gov/Traces/home http://www.ebi.ac.uk/uniprot Social Sciences Data Archiving and Networking Services (DANS) https://easy.dans.knaw.nl/ui/home Inter-university Consortium for Political and Social Research (ICPSR) Qualitative Data Repository https://www.icpsr.umich.edu/icpsrweb/landing.jsp https://qdr.syr.edu Structural Databases Biological Magnetic Resonance Data Bank (BMRB) http://www.bmrb.wisc.edu Cambridge Crystallographic Data Centre (CCDC) Coherent X-ray Imaging Data Bank (CXIDB) https://www.ccdc.cam.ac.uk Crystallography Open Database (COD) Electron Microscopy Data Bank (EMDB) http://www.cxidb.org FlowRepository Protein Circular Dichroism Data Bank (PCDDB) http://www.crystallography.net Worldwide Protein Data Bank (wwPDB) http://www.emdatabank.org https://flowrepository.org http://pcddb.cryst.bbk.ac.uk http://wwpdb.org 99

TABLE 3-3 Continued 100 Disciplines Repositories Links Taxonomic & Species Global Biodiversity Information Facility (GBIF) http://www.gbif.org Diversity Integrated Taxonomic Information System (ITIS) http://www.itis.gov Knowledge Network for Biocomplexity (KNB) https://knb.ecoinformatics.org NCBI Taxonomy http://www.ncbi.nlm.nih.gov/taxonomy Unstructured and/or BioStudies https://www.ebi.ac.uk/biostudies Large Data CSIRO Data Access Portal https://data.csiro.au GigaDB http://gigadb.org SimTK https://simtk.org Swedish National Data Service https://snd.gu.se/en SOURCE: http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

The State of Open Science 101 Sharing and Preserving Research Software Sharing and preserving research software code has become an increasingly important issue in recent years Journals have been introducing policies and new capabilities, including new editorial staff and technical tools, to ensure that the analytical code associated with an article meets certain quality standards and is made available (Baker, 2016). Concerns about reproducibility have provided a major impetus for this trend. Reanalyzing or verifying data requires use of the original code. Regarding long-term preservation of code, the relevant practices, barriers, and considerations are largely the same as those related to data. The challenges of ensuring that data are properly cited are covered in Chapter 4—citation practices for software are even less developed. Some institutional repositories have devel- oped guidelines and best practices in software preservation that they are using in their communities (Rios, 2016). Reproducing a study or using older data requires the code and/software uti- lized during the experiment or the research undertaken. Oftentimes, researchers do not believe that they have the right to preserve software due to licensing terms and conditions. Aufderheide et al. (2018) addressed the issue of software preser- vation and found that “individuals and institutions need clear guidance on the le- gality of archiving legacy software to ensure continued access to digital files of all kinds and to illuminate the history of technology” (Association of Research Libraries, 2018). Additional information relating to code and reproducibility is further described in Chapter 4. Considering the importance of code in the vision of open science, there is a need to address non-computational methodologies. These methods include prereg- istration of studies, most common in clinical research and psychology, which could be expanded in other areas of science. Kimmelman et al. (2014) stressed the im- portance of separating exploratory from confirmatory research, and in this context, registration of confirmatory experiments in preclinical research has been suggested (Kimmelman et al., 2014; Mogil and Macleod, 2017). Additionally, publication of laboratory protocols via electronic research notebooks or open access repository for science methods such as protocols.io, which allows forking and amendments to ex- isting protocols, is a helpful feature to accelerate methodological development to- ward an open science enterprise (PLOS, 2017b; Goodman, 2018). International Approaches Open science approaches are being broadly assessed and adopted through- out the world. The British Royal Society has prepared an extensive report on is- sues related to open science (The Royal Society, 2012). In 2015, the European Council and the Group of Seven (G7) adopted open science and the reusability of research data as a priority. The FAIR principles were adopted by Science Europe and endorsed by the G20 in the 2016 Hangzhou summit (Mons et al., 2017). At its September 2017 meeting in Turin, Italy, the G7 committed to giving incentives

102 Open Science by Design: Realizing a Vision for 21st Century Research for open science activities and to providing global research infrastructures on the basis of FAIR data (G7, 2017). While the FAIR principles are increasingly recog- nized by governments, the private sector, and the scientific community globally, infrastructure needs have been addressed most intensively in Europe, Australia, and Africa. This section describes key community-driven initiatives toward an open science enterprise at a global level. Research Data Alliance The Research Data Alliance (RDA) is a global community-driven organi- zation, launched in 2013 with support from the EC, U.S. National Science Foun- dation (NSF), U.S. National Institute of Standards and Technology (NIST), and Australia’s Department of Industry, Innovation and Science, to accelerate data sharing and data-driven innovation. As of October 2017, the RDA comprises more than 6,000 individuals from over 130 countries, including researchers, policy makers, and open science enablers and promoters (RDA, 2017b). Through its Working and Interest Groups, RDA creates infrastructure (tools, models, prelim- inary standards, code, curriculum, policy, etc.) that is developed and deployed to support specific challenges in data sharing and data-driven research. For example, the RDA Data Publishing Services Working Group developed a model for “an open, universal literature-data cross-linking service that improves visibility, dis- coverability, reuse, and reproducibility by bringing existing article/data links to- gether, normalizes them using a common schema, and exposes the full set as an open service” (RDA, 2017b). Other RDA outputs include models for machine readable data type registries, approaches to data citation for data collections that change over time, curriculum for data science instruction, a common metadata vocabulary for agricultural data, and other infrastructure needed to enable data- driven research. The RDA meets twice a year at Plenaries around the world to accommodate its global community. Its meetings are working meetings where many of its Inter- est and Working groups get together to advance the conceptualization, develop- ment, deployment, and adoption of its infrastructure outputs, and meet with a broad spectrum of stakeholders and communities. Both the U.S. and European regions of the RDA support the engagement of early career professionals with RDA Working and Interest Groups. RDA plenaries, programs, and operations are supported through its regions by funders from around the world including the Na- tional Science Foundation, the National Institute of Standards and Technology, the EC, the Australian Government Department of Education and Training, the United Kingdom’s nonprofit company JISC (formerly the Joint Information Sys- tems Committee), the Japan Science and Technology Agency, Research Data Canada, University of Montreal, the Alfred P. Sloan Foundation, the John D. and Catherine T. MacArthur Foundation, and others.

The State of Open Science 103 International Council for Science The Committee on Data for Science and Technology (CODATA) was cre- ated by the International Council for Science (ICSU) in 1966 with the mission “to improve the quality, reliability, management, accessibility and use of data of im- portance to all fields of science and technology” (CODATA, 2016). As the ICSU Committee on Data, CODATA promotes international collaboration to improve the availability, usability, and interoperability of research data. CODATA’s 2015 Strategic Plan and 2016 Prospectus of Strategy and Achievement identify its three priority areas (CODATA, 2017): 1. Promoting principles, policies and practices for open data and open sci- ence; 2. Advancing the frontiers of data science; and 3. Building capacity for open science by improving data skills and the func- tions of national science systems needed to support open data. CODATA achieves these objectives through its standing committees, stra- tegic initiatives, and Task Groups and Working Groups. CODATA supports the Data Science Journal and collaborates on major data conferences, such as Sci- DataCon and IDW. A landmark publication of Science International (composed of ICSU, the InterAcademy Partnership, The World Academy of Sciences, and the International Social Science Council) entitled Open Data in a Big Data World highlights critical issues related to open data and open science while laying out a framework for how the vision of Open Data in a Big Data World can be achieved (Science International, 2015). Additionally, CODATA supports educational op- portunities for early career researchers, including the International Training Workshop in Open Data for Better Science, through a grant from the Chinese Academy of Sciences. Another ICSU interdisciplinary body, the World Data System (ICSU- WDS), was created in 2008, building on the over 50-year legacy of the World Data Centers and Federation of Astronomical and Geophysical data analysis Ser- vices. ICSU-WDS promotes universal and equitable access to scientific data and data services, products, and information across a range of disciplines including the natural and social sciences and humanities (ICSU-WDS, 2017). European Activities Open Science is one of three priority areas for research, science, and inno- vation policy in Europe (EC, 2017d). To support the transition to more effective open science, the EC launched the European Open Science Cloud (EOSC) in 2016, with a vision for “a federated, globally accessible environment where re- searchers, innovators, companies and citizens can publish, find and reuse each other’s data and tools for research, innovation and educational purposes” (EC,

104 Open Science by Design: Realizing a Vision for 21st Century Research 2016). The EOSC High Level Expert Group, including 10 members from Euro- pean countries, Japan, and Australia, released its first report, Realising the Euro- pean Open Science Cloud, which provides specific recommendations to the Com- mission regarding actions needed to implement the EOSC (EC, 2016). In June 2017, the EOSC summit was held in Brussels to further discuss how to make EOSC a reality by 2020. The EC also established the Open Science Policy Platform (OSPP) in 2016, a high-level expert advisory group that will support the development and imple- mentation of open science policy in Europe. The group is tasked with addressing various dimensions of open science, including the establishment of a reward sys- tem, the measurement of quality and impact, the change of business models for publishing, FAIR open data, EOSC activities, research integrity, and open educa- tion (EC, 2017e). At the request of the Commission, RAND Europe and other entities, such as Deloitte, Digital Science, Altmetric, and Figshare, developed a monitor that tracks open science trends in Europe while identifying the main driv- ers, incentives, and constraints on its evolution. Additionally, a group of European Union (EU) member states is preparing the Global Open (GO) FAIR initiative that focuses on involving all networked initiatives, research disciplines, and interested EU member states to make research data FAIR. The Netherlands has initiated and co-leads the early development of the GO FAIR initiative. Three pillars of GO FAIR include: (1) GO CHANGE, which aims to promote cultural change to make the FAIR principles a working standard; (2) GO TRAIN, which deals with locating, creating, maintaining, and sustaining the required data expertise in Europe through training and education; and (3) GO BUILD, regarding the need for interoperable and federated data in- frastructures (Dutch Techcentre for Life Sciences, 2016). GO FAIR encourages close cooperation with activities in other regions, such as the NIH Commons (Bonazzi and Bourne, 2017) to build an Internet of FAIR data and services. The European Southern Observatory (ESO) has recently endorsed the EOSC Declaration and expressed its support for the EOSC initiative on open ac- cess to scientific data. The ESO emphasizes that astronomy has been leading well- managed, curated open access to data in scientific research (ESO, 2017). Other Global Initiatives Significant efforts are also underway in Australia and Africa to promote the transition towards an open science system. The Australian National Data Service (ANDS), established in 2008, is a joint collaboration between Monash University and the Australian National University and the Commonwealth Scientific and In- dustrial Research Organisation (CSIRO) that addresses the challenges of manag- ing research data in the country. Another effort is Australia’s Academic and Re- search Network (AARNet), a high speed low latency network infrastructure for research and education across a diverse range of disciplines in the sciences and humanities (AARNet, 2017). In Africa, the East African Community has recently adopted the Dakar Declaration on Open Science in Africa, following the Sci-GaIA

The State of Open Science 105 workshops (Barbera et al., 2015) related to the promotion of open science across Africa (CODESRIA, 2017). In South Africa, the African Data Intensive Research Cloud “aims to establish resources to support data intensive radio astronomy re- search among collaborating partners in South Africa and African Square Kilome- ter Array telescope partner countries” (Simmonds et al., 2016, p. 1). Additionally, CODATA is supporting the African Open Science Platform to improve the impact of open data across the research community.

Next: 4 A Vision for Open Science by Design »
Open Science by Design: Realizing a Vision for 21st Century Research Get This Book
×
Buy Paperback | $55.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Openness and sharing of information are fundamental to the progress of science and to the effective functioning of the research enterprise. The advent of scientific journals in the 17th century helped power the Scientific Revolution by allowing researchers to communicate across time and space, using the technologies of that era to generate reliable knowledge more quickly and efficiently. Harnessing today’s stunning, ongoing advances in information technologies, the global research enterprise and its stakeholders are moving toward a new open science ecosystem. Open science aims to ensure the free availability and usability of scholarly publications, the data that result from scholarly research, and the methodologies, including code or algorithms, that were used to generate those data.

Open Science by Design is aimed at overcoming barriers and moving toward open science as the default approach across the research enterprise. This report explores specific examples of open science and discusses a range of challenges, focusing on stakeholder perspectives. It is meant to provide guidance to the research enterprise and its stakeholders as they build strategies for achieving open science and take the next steps.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!