National Academies Press: OpenBook

Biological Collections: Ensuring Critical Research and Education for the 21st Century (2020)

Chapter: 5 Generating, Integrating, and Accessing Digital Data

« Previous: 4 Building and Maintaining a Robust Infrastructure
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 93
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 94
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 95
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 96
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 97
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 98
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 99
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 100
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 101
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 102
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 103
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 104
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 105
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 106
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 107
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 108
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 109
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 110
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 111
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 112
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 113
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 114
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 115
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 116
Suggested Citation:"5 Generating, Integrating, and Accessing Digital Data." National Academies of Sciences, Engineering, and Medicine. 2020. Biological Collections: Ensuring Critical Research and Education for the 21st Century. Washington, DC: The National Academies Press. doi: 10.17226/25592.
×
Page 117

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Generating, Integrating, and Accessing Digital Data Throughout most of their history, biological collections and the physical specimens they contained were explicitly linked to the physical locations where they were housed. These biological collections consisted of specimens and their accompanying data in written records, and to access the collections users had to travel to the collection or receive specimens through the mail. That is changing now, however, as increasing numbers of biological collections have been digitized. This digitization 1 of specimen data, combined with the cyberinfrastructure 2 that underlies how digital data are stored, managed, and used, has fundamentally transformed the biological collections community (Ball-Damerow et al., 2019; Hedrick et al., 2020) and the work of researchers who rely on biological collections, as digitization makes possible the remote examination of biological collections and greatly enhances their discoverability and usefulness. A key component of digitization has been the development of collection databases that provide digital specimen data to aggregated data repositories, producing a global biodiversity infrastructure. Online data repositories democratize access to digital specimen data, making possible new avenues of scientific inquiry, promoting the multiplication and expansion of research collaborations and community networks, and providing a greater range of educational and training opportunities (Lacey et al., 2017; Monfils et al., 2017). A robust cyberinfrastructure can also facilitate evaluation and the development of metrics for assessing the diversity of biological collections and their impact on research and education (Meehan et al., 2018) (see Chapters 2 and 3). Biological collections have driven increasingly integrative and collaborative science—with the potential to address a wide variety of problems from disease, such as coronavirus disease 2019 (COVID- 19) (Cook et al., 2020), to species responses to climate change (Meineke et al., 2018)—which in turn has intensified the need for greater access to high-quality digital data. Over the past decade, a wide range of advances in the process of generating digital data of all kinds and building the cyberinfrastructure for biological collections has emerged. However, the robust cyberinfrastructure that the biological collections community requires has yet to be fully realized. This chapter focuses on the challenges of and strategies for advancing the accessibility and integration of digital biological collections for research and education. CURRENT STATE OF DIGITIZATION, DATA, AND CYBERINFRASTRUCTURE Digitization: An Evolving Process Biological collections encompass a diverse array of specimen data that span biological, physiological, temporal, and spatial features of the specimens. Digitization is the process of converting these analog or printed specimen data from specimen labels, field notes, card catalogs, ledgers, genetic sequences, images, audio, and video recordings, and more into digital representations. Digitization helps preserve the long-term integrity of specimens by allowing researchers to inspect metadata and digital 1 The conversion of textual, image, or sound-based specimen information to digital formats. 2 Cyberinfrastructure, a term first used by the National Science Foundation, encompasses the computing systems, repositories, advanced instruments, software, high-performance networks, and people that enable/support data acquisition, storage, management, integration, mining, analysis, visualization, and distribution (adapted from Stewart et al., 2019). See https://scholarworks.iu.edu/dspace/handle/2022/12967. Prepublication Copy 93

Biological Collections: Ensuring Critical Research and Education for the 21st Century images without having to access and physically handle the specimens while opening new avenues of data- driven research (e.g., ecological niche modeling). The biological collections community has spent decades digitizing specimen data to increase their visibility and accessibility to researchers, educators, and the general public. In fact, the digitization of specimens and associated materials and the uploading of these digital data into online platforms has long been a requirement for funding programs such as the National Science Foundation (NSF) Living Stock Collections for Biological Research program and its successor, the Collections in Support of Biological Research (CSBR) program, among others. In 2010 the Network Integrated Biocollections Alliance 3 (NIBA) outlined a vision and strategic plan to “document the nation’s biodiversity resources and create a dynamic electronic resource that will serve the country’s needs in answering critical questions.” At that point, it was estimated that only approximately 10 percent of all specimens in natural history collections worldwide had been digitized (Page et al., 2015). NSF responded to elements of the NIBA plan by establishing the Advancing Digitization of Biodiversity Collections (ADBC) 4 program, which funds digitization efforts that coalesce around scientific questions or themes through extensive collaborative networks, called thematic collections networks (TCNs), overseen by the national coordinating center for these efforts, Integrated Digitized Biocollections (iDigBio). 5 iDigBio now hosts more than 121 million digital specimen records, the majority of which were largely unavailable to users 10 years ago. Based on iDigBio’s digitized holdings compared with estimates of specimens held in U.S. collections, it is now estimated that about 30 percent of all natural history specimens in the United States have been digitized. However, there is still a long way to go until all collections have been digitized, particularly given the challenges posed by certain types of collections and the need for a workforce with both curatorial and data management skills. However, thanks to recent efforts, research using natural history collections data, as measured by citation in publications, has increased dramatically over the past decade, reflecting the increasing number of digitized collections (e.g., Ball-Damerow et al., 2019; Heberling et al., 2019) (see Figure 5-1). A 3 See https://digbiocol.files.wordpress.com/2010/08/niba_brochure.pdf. 4 See https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503559. 5 See https://www.idigbio.org. 94 Prepublication Copy

Generating, Integrating, and Accessing Digital Data B FIGURE 5-1 Publications using digitized natural history data provided and/or served by the NSF-supported Advancing Digitization of Biodiversity Collections (ADBC) Program, 2010–2019. A. Cumulative number of publications that reference the national digitization effort versus those that use data served by iDigBio and related portals. B. Cumulative number of publications authored by ADBC-supported investigators versus those authored by the larger community. The development of digitization workflows (e.g., Haston et al., 2012; Karim et al., 2016; Nelson et al., 2012, 2015; Tulig et al., 2012) over the past decade, coupled with an emerging community of practice among collections professionals, provides a roadmap for accelerating the pace of digitization in the coming decade if sufficient funding can be made available. These digitization workflows provide institutions that house biological collections with guiding principles that can be adapted to their varied needs, collection sizes, and capabilities. Additionally, workshops organized and sponsored by iDigBio 6 and others have made digitization more widely adopted, better understood, and more efficient across the natural history collections community. Living and natural history collections follow the same general digitization workflow (see Figure 5-2), with all collections providing data on the source of the specimen, date of sampling, the collector, and other attributes of provenance. However, the workflows will differ between collections due to the unique digitization priorities of each collection and the varying needs of their respective research and end-user communities. Rapid technological advances in digitization and cyberinfrastructure have allowed a large amount of historical data to be converted into digital representations over the last 20 years. The current digitization process for existing specimens typically involves hand-entering primary data from a specimen label, field notes, card catalog, or ledger into a database, which can be time-consuming. As described later in this chapter, numerous attempts have been made to speed up this process while preserving the quality of the digital data produced. The pace of digitizing newly acquired specimens, on the other hand, is much more rapid. Specimen data are increasingly “born” digital—directly produced in digital format (e.g., GPS locations, digital spreadsheets, nucleic acid sequencing, three-dimensional images, computer tomography, etc.), which drastically reduces the amount of time required to create specimen records and integrate and share them online. 6 See https://www.idigbio.org/content/workflow-modules-and-task-lists. Prepublication Copy 95

Biological Collections: Ensuring Critical Research and Education for the 21st Century FIGURE 5-2 Generalized digitization workflow. The above shows the common pathway for digitizing specimens in living and natural history collections. The workflow begins with curation and preparation of specimens. Thereafter, digitization starts with primary and additional associated specimen data, shown above occurring in two different pathways: (a) the creation of the primary digital specimen record and digitization of the associated data occur concurrently; and (b) the creation of the primary digital specimen record happens first (e.g., scanning of herbarium sheet to digitize record from label), followed by digitization of the associated data, sometimes at a much later date. Specimen data are then associated, typically in a relational database, and stored, ideally, on a server (in-house or cloud-based). Once specimens and their associated data have been digitized, the digital datasets can then be used in- house (e.g., tracking loans and users) and, increasingly, more globally by inclusion in external aggregation sites (e.g., the Global Biodiversity Information Facility, the Global Catalogue of Microorganisms), where they can be discovered and used by a wide range of user groups (e.g., ecologists, policymakers, educators). Toward Accessible and Integrated Data Digital data from biological collections can be organized into one or more datasets that are collectively stored in local databases. At the local level, digitized collections can be easier to manage than non-digitized collections and may improve the ability of the collections managers to provide access, respond to requests, physically manage space, and allocate budget resources. Digital collection databases can be published and then accessed through online thematic, taxonomic, or geographic data portals, aggregators, and catalogs. Often, the biological collections community uses portals and aggregators interchangeably. In this report a portal is defined as the online platform that allows users to perform advanced searches on the published collections found therein. This could be a local portal to an individual collection or a portal of aggregated collections. An aggregator is the cyberinfrastructure that gathers and compiles data from published collections and makes them searchable through portals. A catalog is similar to a portal but is a term mostly used by the living collection community. Catalogs enable users to search, request, or buy specimens and materials, facilitate the collection of fees, and provide information on shipping permits, compliance with regulations, and user registration unique to living collections. 96 Prepublication Copy

Generating, Integrating, and Accessing Digital Data A major global portal for natural history collections is hosted by the Global Biodiversity Information Facility (GBIF), while iDigBio hosts a portal for collections primarily based in the United States, and the Atlas of Living Australia (ALA) and the Distributed System of Scientific Collections (DiSSCo) provide portals to Australian and European collections, respectively. There are also project- based portals (e.g., TCN portals, such as SERNEC) and taxonomic portals (e.g., Vertnet, Fishnet, EntoWeb, iDigPaleo). The data that are available via major portals are based on common standards (e.g., Darwin Core Standards 7 or the Access to Biological Collections Data 8 schema). These standards help data providers share specimen data using a common terminology of fields, controlled vocabularies, and data classes that describe the taxonomic identity, collecting event, locality, collectors, geological context, and specimen attributes as well as various kinds of media (Wieczorek et al., 2012). The use of these common standards facilitates the computerized aggregation of data from multiple types of collections and the integration of specimen data with other sources of information. Users are able to search data, download results for further analysis, and integrate the downloaded data with other resources, such as environmental data (e.g., temperature, precipitation, etc.). New standards to allow the incorporation of additional properties, called extensions, are continually being developed by the global community. Living stock collections serve as specimen repositories and data providers for members of the research community, who interface with these collections through online databases, catalogs, and aggregators. One such centralized aggregator, the Global Catalogue of Microorganisms (GCM), 9 hosted by the World Federation for Culture Collections (WFCC) and managed by the Chinese Academy of Sciences (CAS), facilitates the access and sharing of microbial living stock collections along with their associated data. Online platforms provide information such as available strains, genes and alleles, and genome sequences with functional annotation for the acquisition of research material. Standardized abbreviations for genes, alleles, and depositors and coordinated genome sequencing and annotation projects help make these data useful to the user community (Jarret and McCluskey, 2019). In the GCM, users can locate desired strains along with the associated metadata (e.g., date of isolation, geographic origin, growth conditions, and medium, etc.). Users can add strains to a shopping cart if they wish to acquire them for research, and by putting a strain in the cart a user is linked directly to the source collection, from which the specimen(s) can be requested. Many databases of individual microbial collections are interoperable due to the efforts of projects such as CABRI 10 or the now-defunct StrainInfo (Verslyppe et al., 2014), which helped build common datasets based on specific data standards and formats. Cyberinfrastructure in Support of Biological Collections Biological collections may offer solutions to various major societal challenges relating to biology and the environment, from the emergence of new pathogens or the need for new antibiotics to the response of species to climate change, but this is possible only if the data can be accessed, aggregated, and analyzed effectively (Cook et al., 2020; Fontaine et al., 2012; Rocha, et al., 2014). Following FAIR data principles (i.e., data that are findable, accessible, interoperable, and reusable [Wilkinson et al., 2016])—and the TRUST principles for digital repositories—i.e. repositories that promote the principles of transparency, responsibility, user focus, sustainability, and technology (Lin et al., 2020)—will require a robust cyberinfrastructure. As the digitization of biological collections continues to create large and diverse datasets, an effective cyberinfrastructure will need to incorporate mechanisms to improve access to an ecosystem of digital repositories and enable the integration of diverse types of data. Recognizing the need for a more robust cyberinfrastructure, the Earth sciences community established EarthCube in 2011 with NSF funding from both the Directorate for Geosciences and the Office of Advanced 7 See https://dwc.tdwg.org. 8 See https://www.tdwg.org/standards/abcd. 9 See http://gcm.wfcc.info. 10 See http://www.cabri.org. Prepublication Copy 97

Biological Collections: Ensuring Critical Research and Education for the 21st Century Cyberinfrastructure of the Computer and Information Science and Engineering Directorate at NSF. 11 Collaborative projects with the biological collections community (such as ePANDDA 12 and ELC13) as well as products resulting from EarthCube have been recommended for adoption by the biological collections community (e.g., Hobern et al., 2018). A similarly broad, community-level endeavor has not yet taken place between the biological sciences and computer science communities, but the timing is right, given the past decade of focused digitization. For any local digitization effort to be successful, individual collection-holding institutions need a basic desktop computer and access to server infrastructure in order to house collection management system (CMS) databases, image repositories, and the necessary software for data publishing. Collections also require a workforce skilled in data management as well as collections curation and taxonomy. Both the natural history and living collections communities are using a large number of unique CMS databases that range from simple spreadsheets to more sophisticated systems that allow database management and data manipulation, such as feature-rich SQL or Oracle-based systems (Arctos, Collections Space, Specify, BRAHMS, Axiell EMu, BioloMICS, GRIN, etc.) with extensive data models, collection management, and publishing capabilities. Data publishing increases the discovery of specimens for traditional research uses, for research that makes use of the digital data themselves (e.g., predictive modeling, recording of traits through optical character recognition of textual notes, or by machine learning from images), for formal and informal education, and for other novel downstream uses. While many institutions do not have the resources in house to install and maintain the necessary cyberinfrastructure to run a collections database and make their data available online, hosting services provided by web-based collection management packages and community-based solutions provide the cyberinfrastructure and technical expertise necessary to facilitate the digitization and publishing of these collections. CHALLENGES Realizing the promise of the digitization revolution will require overcoming a number of challenges. On one hand, there is an extensive community-wide backlog of specimens and associated materials that need to be digitized, creating gaps in our knowledge about the world’s biodiversity and missed collaboration opportunities between researchers. On the other hand, the multiplication of shared databases that vary in data quality and format and the proliferation of data aggregators and repositories can lead to an unnecessary duplication of effort, data disintegration, and limited data usability. Mass digitization is exposing digital data to an ever-increasing diversity of users for a myriad of uses, resulting in an increasingly complex digital landscape. Addressing these challenges will require the development, support, and maintenance of robust and coordinated cyberinfrastructure that provides for the ever- increasing needs of the world’s biological collections. Dark Data While the majority of data generated today are immediately digitally captured, historical collections typically have a backlog of data that have yet to be digitized. The digital revolution and the increase in the accessibility of digitized specimen data have been so profound that undigitized collections are now referred to as “dark data”—referring to the fact that they are essentially unavailable for modern scientific study without physical access to the specimens within institutions (Heidorn, 2008). The absence of these specimens from the global and national collections digital infrastructure represents lost opportunities for research and education as well as limits to returns on the investments made by the funding agencies that supported the acquisition of the specimens, even if the research projects that generated the undigitized collections were otherwise successful. 11 See https://www.earthcube.org/info/about. 12 See https://www.earthcube.org/group/epandda. 13 See https://www.earthcube.org/group/earth-life-consortium-elc. 98 Prepublication Copy

Generating, Integrating, and Accessing Digital Data Discipline-Specific Limitations and Biases Although digitization efforts to date have been transformational for both biological collections and research communities, most U.S. specimens, especially those from taxonomically diverse groups, remain undigitized and unavailable for inclusion in cutting-edge research. The process of digitization can be particularly challenging for some disciplines where specimen labels are obscured or scarce, where taxonomic diversity is high and poorly known, where the type of preservation precludes automated capture of information (wet specimens in alcohol, for instance), or where the availability of historical paper records (card catalogs, ledgers, field notes, etc.) is limited. For example, for natural history collections, it is estimated that well over 50 percent of vertebrate collections (Krishtalka et al., 2016) and 20 percent of herbarium specimens (per personal communication, Barbara Thiers, Director of the William and Lynda Steere Herbarium at the New York Botanical Garden, 2020) are digitized and available online, while only 4 percent of entomology collections have been digitized (Cobb et al., 2019), and most invertebrate biodiversity remains unknown or ignored (Di Marco et al., 2017). Plaguing biodiversity research, taxonomic bias 14 also leads to a disproportional amount of dark data for certain collections and resulting discrepancies in knowledge from organism to organism across a wide range of biological fields (Adam et al., 2017; Clark et al., 2002). Multiple logistical and technical factors contribute to this bias, such as those mentioned above, but regulatory bottlenecks and restrictions play a role as well. Large-scale digitization efforts reveal the extent of century-long sampling and taxonomic limitations and biases and provide insights on how to account for such issues to inform future collecting (Daru et al., 2017; Troudet et al., 2017) and digitization efforts. For some biological collections, certain data fields need to be redacted or restricted and kept dark to protect sensitive information or specimens. This might include the exact geographic location of an endangered orchid or a fossil site on federal land, information and access to particularly virulent strains of biothreat pathogens, and personal identifiers in the case of organisms or samples originating from human specimens. Project-Based Collections A potentially large body of dark data lies in project-based collections—a group of specimens or samples collected with a particular purpose (e.g., for a specific research program or project or a survey of a group of organisms in a particular region) but never transferred to a permanent physical repository (e.g., museum collection or biological research center). While these valuable collections could make important contributions to science and society, the key problem is that they typically reside in an investigator’s lab, freezer, or office, making them difficult to identify and locate (for more, see Chapter 4). Typically, these collections are not accessioned, digitized, and made accessible to the wider scientific community through national data portals or catalogs. The barriers preventing accessioning into repositories and the subsequent digitization can be diverse. While some projects produce scientific publications that describe their findings and the materials accumulated, researchers may not be willing to share—or may be reluctant to relinquish control of—the specimens in their project-based research and thus be hesitant to contribute them to a publicly available repository or data portal. Even when researchers are willing to contribute their specimens and data, sometimes collections simply do not have the capacity or the resources to entertain such requests because of limited space and inadequate funds for accessioning and digitizing the specimens. Some project-based collections may not be suitable for incorporation into a permanent collection or digitization because of the recipient institution’s acquisition policies and guidelines (e.g., strategic growth, accessioning limitations, permits, etc.) or an inability to assess the value of a project- based collection and its benefit to the institution. 14 The fact that some taxa are more investigated than others. Prepublication Copy 99

Biological Collections: Ensuring Critical Research and Education for the 21st Century Private collections are also difficult to find. While outside the purview of this report, these private collections may hold essential data for documenting biodiversity, which may eventually be accessioned in public collections. Although the number and holdings of private collections in the United States are unknown, a recent survey in Europe found that private collections there may make up as many as 33 million specimens (Willemse et al., 2019). There are obvious issues concerning data quality and the willingness of these private collection holders to digitize and publish the data associated with the collections, but this information from Europe suggests that U.S. private collections may be a particularly valuable source of biodiversity data currently invisible to the research and education communities. An Inefficient Data Pipeline Currently, each online portal or aggregator collects a copy of a collection’s data published on a local database and ingests, normalizes, aggregates, and re-publishes this copy online. However, the current data publishing landscape lacks a streamlined and standardized pathway for carrying out these steps. For instance, if a collection shares its data with multiple aggregators, each aggregator may serve slightly different versions of the same record because they each have different publication schedules and different displayed fields for the specimen data. This publishing process and subsequent data verification steps (taxonomic and geographic verification, data cleanup, annotation, etc.) result in a massive duplication of effort by the aggregators and they each reconcile the specimen digital data while also creating confusion on the part of data users presented with multiple, yet slightly different, copies of the same data. Thus, while large amounts of data are appearing in portals, effective access to these data requires informatics expertise to remove duplicates prior to research use. As a consequence, some researchers and educators who may lack sufficient data management skills will rely solely on a single portal rather than exploring other portals for additional data—a practice that likely limits the number and possibly the diversity of the specimens obtained from a search. Furthermore, there is no effective mechanism in the current data publishing model for effectively and efficiently returning user annotations of data to the original data providers for incorporation into the data stream, resulting in a complete loss of this effort on the part of users of the data for the collections community. Leading aggregators such as GBIF, iDigBio, GCM, the Atlas of Living Australia, and others recognize the problems of duplicate records and version control (Hobern et al., 2019) as well as the inadequate methods for annotation, but so far they have been unable to develop either a short-term fix or long-term solutions. Variability in Data Quality and Format As the quantity of digital data dramatically increases, the presence of incomplete data, data of questionable quality, and a lack of standardization limit both the roles that biological collections data can play in research and education and their usefulness. Issues such as incomplete data records and inaccurate or poorly transcribed data are ubiquitous and lead to limitations on the use of specimen digital data. For instance, an investigator searching on higher-level taxonomy, such as plant family, would miss records for which this information has not been recorded at a higher level but only at a lower one. Studies attempting to quantify the timing of animal migration or plant flowering would be severely hampered by a lack of specific temporal information. Some disciplines (e.g., botany) have used skeletal records 15 as an initial step in digitizing specimen records in order to save time (Nelson et al., 2015; Rabeler et al., 2015), but while this method opens up a large number of records for discovery, some of the information in these records has yet to be completely digitized, meaning that certain fields of information are not readily available for research. Similarly, although some disciplines have made great strides in community georeferencing 16 endeavors, such as the NSF-funded MaNIS, ORNIS, HerpNet, and Fishnet collaborative projects (Chapman and Wieczorek, 2006), many specimen records are not yet georeferenced and are thus 15 A basic set of data per specimen (Nelson et al., 2015). 16 The process of converting a text-based description into a geospatial coordinate. 100 Prepublication Copy

Generating, Integrating, and Accessing Digital Data unavailable for spatial analyses such as ecological niche modeling and species distribution analyses (Bloom et al., 2017; Seltmann et al., 2018). Other locality records may never be able to be georeferenced because of historical limitations in the precisions of their locality information. Data transcription errors and a suite of taxonomic naming issues (Nekola et al., 2019) create a variety of other issues. For example, the rate of errors in geospatial designation or taxonomic classification, either through synonymy or misidentification, has been estimated to range anywhere from 5 percent to 60 percent (e.g., Goodwin et al., 2015; Nekola et al., 2019). Without adequate taxonomic resolution, taxonomic incongruencies can result in incomplete species distribution and trait information. In addition, a lack of adherence to standardized terminology and controlled vocabularies, as well as limitations of or incorrect mappings to Darwin Core fields, have led to various problems in data analysis. For example, attempts to compile information on all “females” of a species are hampered by the numerous variants of this term in the sex field—F, Female, female, etc. (e.g., 2,800 distinct values appear in the sex field in VertNet; see Guralnick et al., 2016). Approaching the issue at the source by standardizing and controlling vocabulary in local collections databases and providing common names for organisms would increase usability, but a consensus on taxonomy, terminology, and common names among scientists, which will be needed in order to enable such functions, is still elusive in some disciplines. Limitations Affecting Data Usability Once published to a portal, digital datasets require collections professionals to curate and maintain their quality, just as physical specimens require specialized care. Inadequate maintenance of these datasets can severely impair the use, value, and impact of biological collections data in research and education. Both local and community-level mechanisms could improve the quality of their data. One challenge is the lack of expertise by collections professionals in evaluating data quality across broad taxonomic distances and types of data, although standardized vocabularies could provide the necessary tools to assess data completeness, quality, and consistency and to increase the fitness-for-use of biodiversity data (Ball-Damerow et al., 2019). Data transcription errors also require correction by individual collections or potentially through community efforts (see Nekola et al., 2019, for a summary). However, digital datasets are often not maintained and updated for a variety of reasons, ranging from insufficient resources and staff turnover to disputes related to intellectual property rights and to a simple lack of understanding that digital datasets are not static, one-off products. Another factor affecting data usability is the fact that data portals have been developed for different uses and different communities and their interfaces are not always user-friendly for either the public or the research community. Their design has often been an afterthought because the interfaces for most portals are designed with a single purpose in mind and anticipate only one type of user—the research or collections specialist, and not the general public or student users (Hendy and MacFadden, 2014). Thus, although millions of specimen records are available online, the level of technical expertise necessary for accessing them may be too high for some users. Portals that were designed to serve a wide array of data (e.g., GCM and GBIF) also suffer from limited search capability. Fields that are unique to particular collection types (e.g., mutant allele for genetic stock centers or geologic data for paleontological specimens) are not searchable, making those data more difficult to discover. Currently, many data portals are available only in a single or a few languages, providing yet another barrier to accessibility and contribution. Inadequate Methods for Data Integration and Attribution Realizing the vision of successfully integrating and tracking data from various sources carries many challenges, most significant of which are issues of scale and interoperability. Data integration relies on the unambiguous identification of individual data elements, packets of data, and people through the use of globally unique identifiers (GUIDs), digital object identifiers (DOIs), and open researcher and Prepublication Copy 101

Biological Collections: Ensuring Critical Research and Education for the 21st Century contributor IDs (ORCIDs) (Page et al., 2008) as well as the implementation of standardized application programming interfaces (APIs) and exchange formats (Konig et al., 2019). Despite several attempts (e.g., Güntsch, et al., 2017; Guralnick et al., 2014; Nelson et al., 2018), the biological collections community has been unable to agree on a single form of identifier to describe data elements, though many candidates have been proposed (GUIDs, life science identifiers, uniform resource identifiers, DOIs, Darwin core triplets 17). Although most collections now use some form of identifiers as listed above, there is no centralized system of registration to ensure the uniqueness—and therefore traceability—of these identifiers, and attempts to link data informatically have been only marginally successful (e.g., Guralnick et al., 2014). Because living stock and natural history collections databases were established in parallel using different types of identifiers, integrating them has proved to be quite complex, and these difficulties may preclude opportunities to integrate the data from these resources. The challenge is exacerbated by the differing types of published data not being comparable, by differing expertise, and by the different user communities being served. As a result, tracking the use of biological collections data in research and education still remains largely a manual and time-consuming endeavor. Issues of tracking multiple identifiers and integrating specimen data across databases and portals are exacerbated by the fact that identifiers do not reliably persist through to the products of research created from the use of these specimens (Arbeláez-Cortés et al., 2017; Rouhan et al., 2017). In fact, even the way that specimens are cited in published work is inconsistent, if they are cited at all. This results in a lack of recognition and attribution of the contribution of biological collections to research and education. Despite all of the challenges described above, electronic citation and tracking of digital specimen records, each with a unique identifier, can provide attribution to local collections and can enable assessment of short- and long-term impact, both locally and nationally. Although digitization of biological collections has provided access to massive numbers of specimen records, the assessment of the impact of this resource has barely begun (Hobern et al., 2019; Lendemer et al., 2019). Few biological collections have the resources or community-based guidance to take the next step in determining the contributions of their collections to the published scientific body of knowledge. For example, due to incomplete or non-unique metadata in GenBank, even the apparently simple task of automatically connecting genetic data from GenBank to voucher specimens in iDigBio cannot be accurately accomplished, although this connection may be established manually for a given collection, as demonstrated more than a decade ago (Strasser, 2008). While technology may offer some solutions, the development of such citation and attribution systems is in the early stages of implementation—see occCite 18 and GBIF citation metrics and guidelines 19 as promising examples—and it will require substantial investment if these are to be implemented on large scales. The problem is compounded by a lack of coordination among the members of the biological collection community and by a lack of appropriate resources to develop and implement an assessment of collective impact. Investing in the development of bioinformatics tools and cyberinfrastructure to capture data used in publications and other forms of output could be transformational in making it possible to accumulate national usage statistics and to carry out rigorous evaluations of the impact of both physical and digital resources. Limited Mechanisms to Support a Cyberinfrastructure That Promotes Collaboration The diversity of biological collections poses many challenges to the effective development and implementation of a cohesive, adaptable, and sustainable cyberinfrastructure that serves the entire collections community. For example, inherent differences between living and natural history collections such as differing needs and goals, compounded by external factors such as different funding opportunities and requirements, have thwarted collaborative efforts to integrate digital data from these collections. 17 A concatenation of values for institution code, collection code, and catalog number for a specimen. 18 See https://hannahlowens.github.io/occCite is an online tool that enables biological collections to track how their data are being used. 19 See https://www.gbif.org/citation-guidelines. 102 Prepublication Copy

Generating, Integrating, and Accessing Digital Data Many natural history institutions with the necessary funding for personnel and technology have been digitizing their collections for four decades (Nelson et al., 2018), but NSF’s 10-year, $100-million ADBC program, launched in 2011, has led to even greater strides in digitization and provided access to an ever- increasing quantity of data from natural history collections. In contrast, at present living stock collections are ineligible for funding through the ADBC program, and, for now, no comparable programs specifically fund the digitization of living biodiversity collections. The immense amount of digital information being produced by current digitization efforts and the data integration challenges outlined above threaten to outstrip the necessary cyberinfrastructure support (storage devices, backup systems, routine maintenance, and technological upgrades). The financial outlay required for these necessary components and additional workforce needs (see Chapter 6) is sometimes not adequately factored into the cost estimates of digitization, so that the infrastructure components and workforce needs are left unfunded (see Chapter 7), with it being necessary to put retroactive measures in place to address the issue in hindsight. Without sufficient investment in these cyberinfrastructure components and support by individual collections, funders, and the community as a whole, the amount of digital data stored, shared, and integrated will continue to be limited for certain collections. However, it is precisely a broadly based, flexible, and robust cyberinfrastructure that could integrate complementary data from living and natural history collections (e.g., microbiome studies, food safety, biotechnologies applications, etc) or other groups of collections. THE WAY FORWARD Digitization is increasing the relevance of collections in diverse ways and allowing collections around the world to network their way toward the “global museum” that will seamlessly integrate worldwide collections (Bakker et al., 2020). To date, the digitization of biological collections has proved extremely valuable and successful. The result has been new partnerships for innovative scientific inquiry and learning. Digitization has significantly increased the accessibility and usability of biological collections data for traditional research, for new research of global societal importance, for education (e.g., Cook et al., 2014; Powers et al., 2014), and for an ever-increasing and ever-more diverse collection of additional applications (for review, see Ball-Damerow et al., 2019; James et al., 2018; Krishtalka et al., 2016; Nelson and Ellis, 2018). However, if such successes are to continue and multiply, a great deal of work remains to be done. A large percentage of the nation’s biological collections have not yet been digitized. Data cleaning exercises, standardization, and the provision of annotation mechanisms will significantly increase the usefulness of both the collections that have already been digitized and those that will be digitized in coming years. Finally, digitized biological collections will be most valuable as components of a highly integrated cyberinfrastructure that provides easy access to the collections, integration among different collections and with data beyond collections (such as environmental data, genetic data, biodiversity analyses, etc.), and a way to enable effective collaboration among the many researchers who work with those collections and among potential users of the data. These steps will make it feasible to fulfill the extraordinary promise of digitized biological collections. Innovative Approaches to Reducing Dark Data Given the foundational role that digitization plays in the development of an accessible, useable, and networked scientific infrastructure, it is important that biological collections continue to digitize and to provide data that are of high quality, in a standardized format, fit for use, and broadly accessible. Digitization workflows are currently in place in many communities and institutions, and systematic digitization is set to become more efficient than in the past thanks to ongoing training support by iDigBio and others. The quantity of digital data available for end use is determined not just by the pace at which historical data can be digitized, but also by the efficiency of adding new field-collected materials or project-based collections to permanent repositories and online portals. In order to not contribute to the backlog of undigitized material, the large amount of data associated with these new specimens needs to be Prepublication Copy 103

Biological Collections: Ensuring Critical Research and Education for the 21st Century “born digital.” Streamlining their integration into collection databases and online data aggregators will require a collaboration among field collectors, collections professionals, and the informatics community. By building on recent achievements of the collections community, future efforts to digitize most U.S. collections seem feasible, given sufficient time and funding. Massive digitization efforts to capture and place online not only the metadata associated with biological specimens but also high-resolution images of the specimens themselves, along with videos and vocalizations, have unleashed entire new areas of study. Thanks to new imaging techniques and technologies, the use of rare or fragile natural history collections is less invasive, and it is possible to carry out detailed examinations of specimen attributes without extensive handling of the specimens themselves (see Box 5-1). Sensitive computed tomography (CT) methods of scanning whole organisms and individual skeletal elements capture anatomical features in unprecedented detail and permit precise three-dimensional replication of specimen morphology. Other technological advances have made the digitization of some collection types less time consuming and more efficient. (e.g., trays of insects with multiple labels, fluid-preserved specimens, microscopic organisms). Batch processing or automation and the use of optical character recognition (OCR) have shown some success in optimizing the capture of text from specimen label images. The secondary augmentation of records through georeferencing 20 can be facilitated through the use of online software such as GEOLocate. 21 Specially designed robotic systems that select and image individual specimens or scan whole drawers of specimens and their data are now a reality. The use of convolutional neural networks, a form of machine learning that has been used for species identification (e.g., Carranza-Rojas et al., 2017) and the capture of trait information from specimen images and text such as whether a specimen is in flower or fruit (e.g., Lorieul et al., 2019), is another area of innovation that could advance digitization (see Box 5-2) and that is ripe for collaboration with computer scientists. The natural history collections community has begun to use outside assistance in the digitization process in an effort to reduce the amount of dark data. The impact and contribution of citizen scientists and volunteers to the digitization effort have steadily increased through efforts such as Notes from Nature, 22 the Smithsonian Transcription Center,23 and CitSciScribe, 24 among others. The annual WeDigBio 25 (Worldwide Engagement for Digitizing Biocollections) global transcription event has also galvanized these digitization efforts by engaging a large and diverse set of individuals from varying backgrounds in the digitization process (Ellwood et al., 2018). Although these citizen science efforts were originally designed to assist with the transcription of specimen label data, field notes, and other text (Hill et al., 2012), citizen scientists are extending their contributions to other forms of data capture, such as scoring herbarium specimens for phenological phase. Despite lingering skepticism about the quality of data produced by citizen scientists, it has been found that, when given appropriate instructions, citizen scientists produce data that are on par with specialists (Brenskelle et al., 2020; Catlin-Groves, 2012), and the power of engaging citizen scientists is shown by the fact that the 4-day WeDigBio event in 2018 resulted in more than 50,000 record transcriptions (Ellwood et al., 2018). However, despite the addition of these efforts to existing collections digitization efforts, most of the nation’s collections remain to be digitized. 20 Assigning a latitude and longitude to a collection locality (e.g., GeoLocate, Google Earth, etc.) 21 See http://www.geo-locate.org. 22 See https://www.notesfromnature.org. 23 See https://transcription.si.edu. 24 See https://citsciscribe.org. 25 See https://wedigbio.org. 104 Prepublication Copy

Generating, Integrating, and Accessing Digital Data BOX 5-1 Eggs Benedictine: Crackless Analysis of Eggshell Composition As organisms grow, they can incorporate numerous signatures from the environment around them into their bodies—including environmental contaminants. Scientists have long used material in biological collections to study changes in these contaminants and their biological effects over time, such as the thinning of eggshells in birds of prey as DDT levels in the environment increased. Usually, though, the techniques used to study contaminants in biological specimens result in the destruction of the specimen itself. Eggs are a good example: If you want to find out what birds were exposed to in the 19th or 20th century, and you have eggs Image courtesy of Monica Tischler, collected and preserved from that era (egg collecting, or oology, Benedictine Univerity was a huge Victorian craze), you can crush the eggshells and submit them to chemical analysis. But then you don’t have an egg anymore. Monica Tischler, professor of biology at Benedictine University, solved the problem of destroying egg specimens in order to study them by using eggs from the university’s Jurica-Suchy Nature Museum and Argonne National Laboratory’s Advanced Photon Source (APS).a The APS produces some of the most powerful X-rays available, powerful enough to “see” chemical composition in the eggs without destroying them. “We have eggs dating back 150 years,” Tischler said. “Before binoculars were invented and made bird- watching popular, many people collected bird eggs. Then when migratory bird acts were instituted in the late 19th century and made the practice of collecting eggs unfashionable and illegal, many collections were donated to museums” like the one at Benedictine University. “When birds lay eggs, they excrete contaminants into the egg, and the contaminants in the eggshell reflect blood concentrates of those contaminants,” Tischler said. “These specimens represent a window into the past. The problem is that up until this research, all the techniques used to identify the contaminant in an eggshell were destructive. You take the eggshell, crush it, dissolve it in acid, and examine it. It would be unfathomable to destroy these rare eggs for research.” Researchers identified naturally occurring elements such as calcium, iron, and zinc within eggs, but also elements such as manganese, arsenic, bromine, and lead, which can be considered contaminants. “It’s a new technique to gain a window into the past to compare watersheds and compare contaminants over time,” Tischler explains. But you have to have the eggs on hand, in this case, thousands of egg specimens amassed by the late Benedictine professors Frs. Hilary and Edmund Jurica, O.S.B., over a period of decades in the early 20th century and later donated to the museum. a See https://www.labmanager.com/news/professor-s-egg-research-hatches-new-discoveries-on- environmental-change-10908. It is important to note that the physical specimen is the nexus for the digitized data associated with it and that it should not be neglected or discarded. Often, the specimens remain the primary source of verifiable biodiversity data, and the curation of the underlying specimens required for such analyses remains paramount, especially if researchers want to later examine the physical specimens after analyzing data from digitized information such as images or genetic sequences. For example, downstream analyses can include the extraction of DNA for the confirmation of species identifications based on analyses of digitized specimens or a simple inspection of specimens for verification and occurrence that might appear anomalous in terms of locality or habitat. As such, digitization is not a substitute for physical specimens, but rather a necessary complementary activity that exponentially increases the usefulness of and provides wider access to the collections of these physical specimens. In fact, evidence is accumulating that use of Prepublication Copy 105

Biological Collections: Ensuring Critical Research and Education for the 21st Century BOX 5-2 Leveraging Machine Learning to Augment Digital Data Potential The increasing availability of digitized collections data—textual, geographic, and images—is enabling the application of novel technologies for innovative research. One such application is machine learning, “the science of getting computers to act without being explicitly programmed.” Application of machine learning approaches to digital images of herbarium specimens, which are two-dimensional and generally standard in format, is opening doors to new areas of botanical research in ecology, evolution, and agriculture (Soltis et al., 2020, and a special issue in volume 8 of Applications in Plant Sciences, 2020). An early application was the development of powerful tools for identifying plant species with an astonishing level of accuracy (e.g., Carranza-Rojas et al., 2017). Likewise, the coupling of digitized herbarium images with machine learning has the potential to revolutionize capture of changes in plant phenology—budburst, flowering, fruiting—across space and time, providing a rich data resource that augments current observation networks of professionals and citizen scientists to assess phenological changes in a changing climate (e.g., Lorieul et al., 2019; Pearson et al., 2020; Willis et al., 2017). An emerging area is the use of herbarium images for scoring so-called “plant functional traits”—those features tied to key ecosystem functions—across species, space, and time for ecological analysis on local and global scales; the application of machine learning to functional trait extraction from images is just around the corner (Shouman et al., 2019; Soltis et al., 2020). Similar approaches are enabling the extraction of trait data from textual information in specimen records—such as body mass, reproductive status, or habitat information— for comparative analysis. Key to all emerging interdisciplinary research uses of digitized collections data is the linkage of collections to heterogeneous data representing environmental, climate, spatial, phylogenetic, and genomic information. physical specimens through loans and visits to collections has actually increased with the recent online accessibility of digital records (Vollmar et al., 2010). For living stock collections, continued digitization allows researchers around the globe to locate and acquire an ever-growing number of existing and newly developed model organisms, with the digital data being more of a finding tool and the physical specimen still remaining vitally important. In some cases, such as in the case of destructive sampling or loss of a specimen, the electronic information stored in a database becomes the only record available; this is the case especially for a growing number of microorganism specimens (see Box 5-3), and thus digitization is essential for future studies that aim to understand their biology and evolution. Increasing Data Visibility Although digitization and sharing data with online open access data portals continue to provide more data for research and education, vast amounts of data produced through research and collecting endeavors, such as project-based collections data, are still not publicly available. This is particularly prevalent at institutions that lack permanent collections. Making these data public would increase the visibility of the data as well as promote research reproducibility and reduce redundancy. The primary onus of ensuring that data are captured and disseminated falls on funding agencies, reviewers, and publishers. The NSF Directorate for Biological Sciences requires a data management plan as part of all research proposals, but while this is a prerequisite for funding for living stock collections, there is no requirement for digitization, publishing, or ensuring the long-term accessibility of specimens and their data for natural history collections. There is thus an opportunity to develop more stringent requirements for managing and archiving specimens and their data as part of a specimen management plan (see also Chapters 4 and 7). Likewise, there is no uniformity in the requirements for data citation in publications through journals. Publishing entities along with their editorial boards (and with pressure from funding agencies) could enact uniform requirements for data citations in order to promote reproducible science as well as to provide the necessary mechanisms for collection attribution. 106 Prepublication Copy

Generating, Integrating, and Accessing Digital Data BOX 5-3 When Electronic Data Become the Only Data Diverse studies have revealed the existence of large numbers of viruses, bacteria, archaea, and protists (Cai et al., 2019; Coutinho et al., 2019; Ryan et al., 2019) that are not available from any physical collection. In these cases, the only record available is nucleic acid sequences, electron microscope pictures, or the metadata related to the sample and project where they were detected. This is also true for biological collections where specimens or biological material are consumed during research investigation, and the situation is particularly prevalent for environmental samples, such as soil for microbial analysis, marine or riverine water samples, or other new “collections” not yet explored. Without physical material, some collections of DNA cannot be identified taxonomically and therefore cannot be assigned a scientific name. In GenBank it is common to find large sets of sequences that have as source organisms “uncultured sea- water bacterium,” which at the time was the best identification possible. In the future, some of these records can play a key role in the definition of new taxa, and the metadata associated with the records represent an opportunity for increased access to data and metadata for an expanding array of biological research questions. For these collections, while common standards and best practices for long-term preservation and curation need to be developed, the biological collections community has the capacity to manage, curate, and integrate new molecular-only collections. For example, some genome projects are aimed at providing a phylogenomic framework to identify otherwise unidentified sequences and understand gene functions (Nagy et al., 2020). In some cases, the increasing number of sequences with physical material that are being lodged with these aggregators can now be used to compare and confirm identification of these non-preserved sequences. As new research is conducted, digital records will need to be updated as physical specimens are re-determined, more organisms are described, and new taxa defined. Some of these “orphan” records with unnamed species could be assigned to these new organisms, but this effort will require careful curation and continuous scanning for taxonomic updates. Tools to Improve Data Quality Emerging efforts to provide online tools for improving data quality while also facilitating data integration, usability, and accessibility to a broader range of communities hold significant promise in many areas. Both discipline-specific efforts to address data quality and larger-scale efforts by data aggregators provide such opportunities. The aggregator community has a major role to play, with GBIF, iDigBio, ALA, GenBank, and VertNet having already incorporated data quality tests and assertions 26 into their portals, which, in some cases, automatically correct or augment records to enhance their fitness for use (Bouadjenek et al., 2019; Chapman et al., 2020). Most of the changes made as a result of these tests and assertions improve data quality by identifying georeferencing mismatches, genetic sequences that are inconsistent with the literature, taxonomic or geographic anomalies, duplicates, or issues related to data standards or vocabulary. Currently, there is no uniformity in the identification of the errors nor in the implementation of the edits across the various aggregators, but recommendations to improve data quality have been proposed (Chapman et al., 2020; Groom et al., 2019). In addition, there is a need to create standardized and consistent mechanisms for feeding these corrections and data flags, or annotations created by users of the data, back to the data providers in order to inform data correction and augmentation at the source. In some cases, annotations and errors found by the users of the data are provided to the data providers in a format requiring corrections to be made individually, one record at a time, which is simply not feasible for large datasets. In the past few years many web annotation tools for eliminating these hurdles have become available (Suhrbier et al., 2017; Tschöpe et al., 2013). Partnering with computer scientists and software developers could lead to the deployment of mechanisms for routing data quality annotations to the data providers and for those annotations to be easily reviewed and integrated into the source data in batches. Machine learning and other forms of artificial intelligence may provide incremental increases in the annotation of certain collections, primarily through text recognition 26 A query that looks for problems in a biological collection dataset. Prepublication Copy 107

Biological Collections: Ensuring Critical Research and Education for the 21st Century and OCR technologies using images of labels or card catalogs or ledgers. A systematic and standardized approach to improve data quality will result in optimized user experience. Some portals have started to adopt the use of facets, filters, or auto-complete for searching, rather than completely blank entry fields. Such modifications are also steps in increasing the accessibility of collections data to a wider range of users. Promoting Integration and Attribution Many national and international organizations have developed standards for collections data management that inform the integrity and format of digitized data. These data associated with specimens usually involve a suite of unique identifiers with taxonomic, locality, temporal, and preparation information as well as various collections management–based fields (catalog number, cataloger, etc.). While the fields of information captured may vary by discipline or collection, the widespread adoption of global unique identifiers (GUIDs) would allow for a much deeper and broader integration of data both within and among collections. Collections with a critical body of digitized data based on or derived from the specimens are now interested in linking their basic collection metadata to information such as gene sequences, isotope values, or morphometric analyses. Such linkage will further improve data integration and create better connections between primary specimen records and extended data. Linking the data in this way creates what has become known as the “extended specimen” (Webster, 2017) (see Figure 1-2). Extending specimen data with these resources greatly increases the value of the digitized collection for downstream uses while promoting integrated science (Lendemer et al., 2020; Thiers et al., 2019). A lack of integrated online resources will restrict access to valuable collections information, limiting the uses of the data in research and the potential scientific discoveries related to those data. For maximal use, digital data require integration and interoperability at multiple levels. At the specimen level, data derived from the diverse preparations of each specimen (e.g., skeletons, tissues, parasites, field notes, publications, etc.) need to be connected in order to create full extended or holistic specimens for multidisciplinary applications. In addition, these data need to be integrated with the new data streams derived from subsequent investigations (e.g., GenBank sequences, IsoBank signatures, images, CT scans, viromes, and various -omic data). At the collection level, creating associations among taxonomically disparate specimens to highlight such relationships as tissue–voucher, host–parasite, pollinator–host plant, predator–prey, commensals, and others are crucial for integrative science. At the ecosystem level, many novel uses of biological collections data, such as evaluating species’ responses to global change, require integration with other forms of data, such as genetic, observation, trait, environmental, geographic, ecological, and remote sensing data. Such an integration will not only require the collections to be more robust and complete but will also necessitate the creation of interoperable linkages among databases. Some levels of integration of disparate datasets are currently being achieved on a national and global scale through various aggregators and individual museum data management systems, but more coordination between these aggregators and developers is needed to simplify and standardize the landscape. A cyberinfrastructure for biological collections could enable data integration while also providing annotation tools and a system for attribution of specimen data used in research, education, policy development, or other activities of this scope. Creating such a cyberinfrastructure will require robust technological cyberinfrastructure tools to link data elements and also social incentives that will engage all actors in the data pipeline from collections, to researchers, aggregators, data authority providers (taxonomy), journal editors, and beyond. The promises of data integration and attribution were addressed in a Biodiversity Collections Network workshop in 2018 27 (Bentley, 2018) in which a possible system of identifiers and linkage mechanisms was identified as a solution to better integration and attribution of digitized biodiversity data. For example, a number of systems that are intended to solve various aspects of 27 See http://bcon.aibs.org/wp-content/uploads/2018/05/BCoN-Needs-Assessment-workshop-report-1.pdf. 108 Prepublication Copy

Generating, Integrating, and Accessing Digital Data the integration process are being developed (e.g., GenBank Linkout 28 and Pensoft ARPHA writing tool 29), but while there are analogous systems in other domains that one could learn from or coopt (e.g., RRIs of the Resource Identification Initiative 30), no comprehensive solution has been forthcoming. The more that such technological solutions are implemented, the less the community will need to rely on social solutions where all producers and users of data need to perform linkages manually. The broadened utility of collections data, through integration with other data sources, will eventually increase the use of collections (both physical and digital) and thereby increase the attribution, tracking, assessment of impact, and subsequent advocacy for these resources. Assigning identifiers to downloaded datasets from aggregators would also promote both attribution of data use to the providing institution and reproducible science. For example, GBIF assigns a DOI for a downloaded dataset, but recent research has shown that neither URLs nor DOIs are stable, even over short time-frames, and suggests instead a method of cryptographic content-based identifiers (Elliott et al., 2020). Continued efforts to develop methods for identifiers of datasets to enable data integration, attribution, and reproducible science are needed. One technological solution that could potentially resolve the data integration and attribution problem and that has recently received attention is blockchain (van Rossum, 2017). Blockchain is used most commonly in cryptocurrency where it provides an incorruptible digital ledger of economic transactions. A blockchain- inspired network has the necessary technological components to provide the identification of the various elements of the network while also tracking all transactions associated with each item. The network could take advantage of the existing identifiers commonly used in the collections community (GUIDs, DOIs, ORCIDs, etc.) to effectively identify occurrence records, data downloads, publications, and agents. Transactions such as a change or augmentation of the record by the collection, an aggregator, or a user of the collection; a loan or gift of material by a collection to an end-user; the lodging of a DNA sequence with GenBank; or the publication of results depending on the use of physical specimens or data could all be digitally recorded by the blockchain network. Each of these individual transactions would be logged by the system and would be traceable and immutable. Further investigation of a blockchain-based cyberinfrastructure could yield innovations for managing and tracking all activities of biological collections. Developing a National Cyberinfrastructure As digitization spreads across scientific disciplines and data sharing becomes more common, the development of a flexible, unified, and sustained national cyberinfrastructure would provide greater opportunities to integrate and support disparate digital datasets such as living stock and natural history collections and would facilitate research and educational opportunities. This shared resource would not only serve the needs of the collections communities but also provide a baseline to all biodiversity knowledge. Partnerships and pooled resources may be the key to the development and implementation of a permanent, effective cyberinfrastructure in support of digitization, annotation, integration, and analysis of the nation’s collections. Because small collections may have unique holdings that reflect regional species pools or the expertise of present and past local collectors and researchers, making these collections digitally available will be a first step toward greater advocacy, visibility, use, and inclusion in large-scale studies. However, some small collections do not have the resources to manage their own cyberinfrastructure or establish and maintain a portal to store their data or even publish them online. The cyberinfrastructure needs of these collections are in some cases being addressed at the community level through cloud hosting of collections databases (e.g., Arctos 31, Specify 32, or BioAware 33). These web- 28 See https://www.ncbi.nlm.nih.gov/projects/linkout. 29 See https://arpha.pensoft.net. 30 See https://www.force11.org/group/resource-identification-initiative. 31 See https://arctosdb.org/about. 32 See https://www.sustain.specifysoftware.org. 33 See https://www.bio-aware.com. Prepublication Copy 109

Biological Collections: Ensuring Critical Research and Education for the 21st Century based collection management packages offer information technology support which is often not provided in-house by the institution but which is necessary to facilitate digitization and publishing of collections. This model has the additional benefits of making data publishing streamlined and making connections to external data repositories more robust (e.g., GenBank, Morphbank, IsoBank, Morphosource, Ontobrowser, DataOne). Data portals such as iDigBio provide global access to digital data from U.S. collections and are therefore a key feature of cyberinfrastructure, but they in turn rely on additional cyberinfrastructure components, such as hardware for servers and storage, an evolving database schema to accommodate innovations in digitization, and a workforce capable of adapting to a rapidly changing data sciences landscape (see Chapter 6). This type of infrastructure needs to be maintained at the national level, for use by and the benefit of the biological collections community as a whole (see Chapter 8). A Robust Cyberinfrastructure to Promote Coordination and Collaboration Connecting data in order to generate shared resources has additional benefits. For example, researchers are increasingly interested in patterns of spatial, environmental, and genetic variation, particularly when evaluating how species might respond to climate change. Data from living stock and natural history collections, environmental databases, NEON (the National Ecological Observatory Network), and GenBank would all contribute to addressing these questions, and a cyberinfrastructure to support these linkages would enable important new science while ultimately reducing costs through the elimination of duplicated effort. Moreover, the development and deployment of analytical tools and pipelines through unified resources would democratize biodiversity science by allowing accomplished biological specialists who are not well trained in informatics and computer science to address important basic and applied research. Collaboration with a national cyberinfrastructure for life sciences research, such as CyVerse 34 (funded by the NSF Directorate for Biological Sciences), the Texas Advanced Computing Center, 35 the Extreme Science and Engineering Discovery Environment, 36 and the Data Observation Network for Earth, 37 could provide resources to support biological collections and lead to an enhanced national network of digital data from collections and other relevant repositories by improving accessibility to and linkages among data from different sources. The EarthCube community (see above) could serve as a model for how such a collaboration might be implemented. A national cyberinfrastructure for biological collections that will support these collections and facilitate their ever-growing base of end-users will require collaboration, especially between the collections community and computer scientists and engineers, but also between collections staff from diverse collections types and communities (e.g., natural history and living stocks). Until recently, interactions between these communities have been limited due to a lack of funding and staff availability (see Chapters 6 and 7). However, the effective development and deployment of cyberinfrastructure for biological collections will require both (1) application of recent advances from computer science and engineering in new contexts and (2) innovation of cyberinfrastructure components to meet the unique needs of biological collections and an ever-widening user community (e.g., Heberling et al., 2019). Successful implementation will require an interdisciplinarity that is only beginning to emerge among computer and data science and all fields of biology represented by biological collections. To date, innovations in the development of the world’s largest aggregators of data from natural history collections (e.g., GBIF, iDigBio, ALA) and living collections (e.g., GCM) have resulted from close collaborations among biologists, data scientists, and engineers. Moreover, as some computer and data scientists are embracing the data from these biological collections (Chen et al., 2019; Drew et al., 2017), interesting challenges for machine learning and analytical pipelines are being tackled. A similar, although perhaps less appreciated, facet of the situation is that biological collections provide unique and scientifically 34 See https://www.cyverse.org. 35 See https://www.tacc.utexas.edu/-/tacc-a-holistic-approach-to-making-cyberinfrastructure-accessible. 36 See https://www.xsede.org. 37 See https://www.dataone.org/working_groups/cyberinfrastructure. 110 Prepublication Copy

Generating, Integrating, and Accessing Digital Data interesting challenges that could possibly benefit the computer science community, perhaps with extensions to problems outside of collections. NSF’s Harnessing the Data Revolution “big idea” is certainly relevant to collections data, particularly as both the volume and heterogeneity of data increase and as researchers and educators are increasingly interested in connecting collections data with other data resources, from environmental to genomic data. However, continued progress and new advances will require expanded collaborations. Formal efforts to bring these groups together through, for example, workshops, shared funding, and other opportunities would reap large rewards for the design and extension of cyberinfrastructure in capturing the many elements of the extended specimen and aligning data resources in living and natural history collections. CONCLUSIONS Certain impediments will have to be overcome before the potential of a national cyberinfrastructure and the digitization it supports can be realized. Through varied programs past and present, NSF contributions to biological collections digitization and cyberinfrastructure have been critical in the United States. In order to be successful and sustainable, the digitization and development of a robust cyberinfrastructure will require continued support from NSF. Although digitization efforts have involved hundreds of collections, phylogenetic, geographic, temporal, and taxonomic gaps in digitization are evident. Harnessing the opportunity for data-driven discoveries and transdisciplinary collaboration will depend on a continuing effort to digitize new and existing biological collections using developed communities of practice (e.g., best practices and standards). Investment in the development of new technologies and cost-effective, high-throughput workflows for digitizing collections that, to date, have lagged—such as entomological collections—will enhance both the number of specimens and the taxonomic scope of digitized collections. Future digitization initiatives will need to be prioritized to address this disparity in order to ensure better representation of data from these underrepresented groups. In addition, the identification, assessment, and accessioning of legacy project-based collections could bring a large number of valuable specimens and their digitized records into the public domain and prevent the future accumulation of inaccessible collections that diminish NSF’s investment in their assembly and future use in research and education. Compounding these issues is the lack of resources or associated workforce (see Chapter 6) and also staff who may not realize the value of the collections once digitized. If these “dark data” can be made available, both the physical collections and their digital representations can be used in future research, contributing to the growing fabric of networked collections. National and global portals and catalogues have made important contributions to the biological collections community by providing a platform with which to exchange and share data and promote standardization and consistency. Continual updating, augmenting, and improving digital data records using annotation tools and data assertions, for example, will greatly improve overall data quality and, in turn, lead to more comprehensive data integration and greater accessibility of digital data. However, mechanisms for data annotation and attribution require an interoperability of data and systems which may be impeded by global indecision about the application of globally unique identifiers for specimen records. In addition, despite some progress, integrated systems that enable the citation of data used in research publications and attribution to data providers are difficult to develop and will require an all-encompassing approach with social incentives and innovative technological solutions. These are not insurmountable problems, but it will be important to address them in the development of a comprehensive national cyberinfrastructure for the large-scale, long-term digitization and use of digitized data. The integration of specimen data with other biological components as well as with data sources outside of the biological realm will require the implementation of a network of cyberinfrastructure resources not yet realized. Possible future collaborations are potentially unlimited, but computer scientists and the collections community will require mechanisms to bring them together and instruction on how to communicate across disciplinary barriers. Rapid developments during the past few years argue for the value of these collaborations. Just as innovations in digitization have resulted from partnerships between these communities, further collaborations, particularly in the application of machine learning, will lead to Prepublication Copy 111

Biological Collections: Ensuring Critical Research and Education for the 21st Century even greater progress in digitization, georeferencing, and data analysis. A unified cyberinfrastructure that connects all types of biological collections, such as living and natural history collections, could accelerate research and provide innovative educational opportunities. Moreover, a permanent national cyberinfrastructure that supports the needs noted above in terms of expanded digitization of dark data, improvement in data quality, and an increased accessibility to digital data would certainly spur data use. Without this resource, collections—both physical and digital—will continue to be underused. RECOMMENDATIONS FOR THE NEXT STEPS Recommendation 5-1: The leadership (managers and directors) of biological collections should provide the necessary mechanisms for staff to keep pace with advances in digitization and data management through training in digitization techniques and publishing of standardized quality data that can be efficiently integrated into portals. Recommendation 5-2: Professional societies should initiate and cultivate opportunities for research collaborations within the biological collections community. These collaborations should include working with the computer and data sciences communities to promote the development and implementation of tools to build the cyberinfrastructure (e.g., data storage, annotation, integration, and accessibility to expand the use of biological collections to a broader range of stakeholders). Recommendation 5-3: The NSF Directorate for Biological Sciences should continue to provide funding for the digitization of biological collections and for the cyberinfrastructure to support both living and natural history collections. Specifically, the NSF Directorate for Biological Sciences should: • partner with other directorates within NSF (e.g., physics, chemistry, computer sciences, and education) and other federal agencies and departments (e.g., the Department of Health and Human Services, the Department of Agriculture, the Food and Drug Administration, the Department of the Interior, the National Oceanic and Atmospheric Administration, the National Aeronautics and Space Administration, the Department of Energy, etc.); • establish ongoing mechanisms for the biological collections community to meet, develop best practices, and work toward such goals as establishing and implementing unique identifiers, clear workflows, and standardized data pipelines; and • promote and fund the development of a necessary national cyberinfrastructure, with appropriate tools, and technology to effect the efficient multi-layer integration of data and collections attribution. REFERENCES Adam, P. S., G. Borrel, C. Brochier-Armanet, and S. Gribaldo. 2017. The growing tree of archaea: New perspectives on their diversity, evolution and ecology. The ISME Journal 11(11):2407–2425. Arbeláez-Cortés, E., A. R. Acosta-Galvis, C. DoNascimiento, D. Espitia-Reina, A. González-Alvarado, and C. A. Medina. 2017. Knowledge linked to museum specimen vouchers: Measuring scientific production from a major biological collection in colombia. Scientometrics 112(3):1323–1341. Bakker, F. T., A. Antonelli, J. A. Clarke, J. A. Cook, S. V. Edwards, P. G. P. Ericson, S. Faurby, N. Ferrand, M. Gelang, R. G. Gillespie, M. Irestedt, K. Lundin, E. Larsson, P. Matos-Maraví, J. Müller, T. von Proschwitz, G. K. Roderick, A. Schliep, N. Wahlberg, J. Wiedenhoeft, and M. Källersjö. 2020. The global museum: Natural history collections and the future of evolutionary science and public education. Peerj 8:e8225. Ball-Damerow, J. E., L. Brenskelle, N. Barve, P. S. Soltis, P. Sierwald, R. Bieler, R. LaFrance, A. H. Ariño, and R. Guralnick. 2019. Research applications of primary biodiversity databases in the digital age. PLoS One 14(9). 112 Prepublication Copy

Generating, Integrating, and Accessing Digital Data Bentley, A. 2018. Integration, attribution, and value in the web of natural history museum data. 2. Bloom, T. D. S., A. Flower, and E. G. DeChaine. 2017. Why georeferencing matters: Introducing a practical protocol to prepare species occurrence records for spatial analysis. Ecology and evolution 8(1):765–777. Bouadjenek, M. R., J. Zobel, and K. Verspoor. 2019. Automated assessment of biological database assertions using the scientific literature. BMC bioinformatics 20(1):216. Brenskelle, L., R. P. Guralnick, M. Denslow, and B. J. Stucky. 2020. Maximizing human effort for analyzing scientific images: A case study using digitized herbarium sheets. Applications in Plant Sciences 8(6):e11370. Cai, L., B. B. Jorgensen, C. A. Suttle, M. He, B. A. Cragg, N. Jiao, and R. Zhang. 2019. Active and diverse viruses persist in the deep sub-seafloor sediments over thousands of years. The ISME Journal 13(7):1857–1864. Carranza-Rojas, J., H. Goeau, P. Bonnet, E. Mata-Montero, and A. Joly. 2017. Going deeper in the automated identification of herbarium specimens. BMC Evolutionary Biology 17(1):181. Catlin-Groves, C. L. 2012. The citizen science landscape: From volunteers to citizen sensors and beyond. International Journal of Zoology 2012:349630. Chapman, A., and J. Wieczorek. 2006. Guide to best practices for georeferencing. Copenhagen, Denmark: Global Biodiversity Information Facility. Chapman, A. D., L. Belbin, P. F. Zermoglio, J. Wieczorek, P. J. Morris, M. Nicholls, E. R. Rees, A. K. Veiga, A. Thompson, A. M. Saraiva, S. A. James, C. Gendreau, A. Benson, and D. Schigel. 2020. Developing standards for improved data quality and for selecting fit for use biodiversity data. Biodiversity Information Science and Standards 4. Chen, M. L., A. Doddi, J. Royer, L. Freschi, M. Schito, M. Ezewudo, I. S. Kohane, A. Beam, and M. Farhat. 2019. Beyond multidrug resistance: Leveraging rare variants with machine and statistical learning models in mycobacterium tuberculosis resistance prediction. EBioMedicine 43:356–369. Clark, J. A., J. M. Hoekstra, P. D. Boersma, and P. Kareiva. 2002. Improving u.S. Endangered species act recovery plans: Key findings and recommendations of the scb recovery plan project. Conservation Biology 16(6):1510–1519. Cobb, N. S., L. F. Gall, J. M. Zaspel, N. J. Dowdy, L. M. McCabe, and A. Y. Kawahara. 2019. Assessment of north american arthropod collections: Prospects and challenges for addressing biodiversity research. Peerj 7:e8086. Cook, J. A., S. V. Edwards, E. A. Lacey, R. P. Guralnick, P. S. Soltis, D. E. Soltis, C. K. Welch, K. C. Bell, K. E. Galbreath, C. Himes, J. M. Allen, T. A. Heath, A. C. Carnaval, K. L. Cooper, M. Liu, J. Hanken, and S. Ickert-Bond. 2014. Natural history collections as emerging resources for innovative education. BioScience 64(8):725–734. Cook, J. A., S. Arai, B. Armién, J. Bates, C. A. C. Bonilla, M. B. d. S. Cortez, J. L. Dunnum, A. W. Ferguson, K. M. Johnson, F. A. A. Khan, D. L. Paul, D. M. Reeder, M. A. Revelez, N. B. Simmons, B. M. Thiers, C. W. Thompson, N. S. Upham, M. P. M. Vanhove, P. W. Webala, M. Weksler, R. Yanagihara, and P. S. Soltis. 2020. Integrating biodiversity infrastructure into pathogen discovery and mitigation of emerging infectious diseases. BioScience 70(6):531–534. Coutinho, F. H., R. Rosselli, and F. Rodriguez-Valera. 2019. Trends of microdiversity reveal depth- dependent evolutionary strategies of viruses in the mediterranean. mSystems 4(6). Daru, B. H., D. S. Park, R. B. Primack, C. G. Willis, D. S. Barrington, T. J. S. Whitfeld, T. G. Seidler, P. W. Sweeney, D. R. Foster, A. M. Ellison, and C. C. Davis. 2018. Widespread sampling biases in herbaria revealed from large-scale digitization. New Phytologist 217(2):939–955. Di Marco, M., S. Chapman, G. Althor, S. Kearney, C. Besancon, N. Butt, J. M. Maina, H. P. Possingham, K. Rogalla von Bieberstein, O. Venter, and J. E. M. Watson. 2017. Changing trends and persisting biases in three decades of conservation science. Global Ecology and Conservation 10:32–42. Drew, J. A., C. S. Moreau, and M. L. J. Stiassny. 2017. Digitization of museum collections holds the potential to enhance researcher diversity. Nature Ecology & Evolution 1(12):1789–1790. Prepublication Copy 113

Biological Collections: Ensuring Critical Research and Education for the 21st Century Elliott, M. J., J. H. Poelen, and J. A. B. Fortes. 2020. Toward reliable biodiversity dataset references. Ecological Informatics 59:101132. Ellwood, E. R., P. Kimberly, R. Guralnick, P. Flemons, K. Love, S. Ellis, J. M. Allen, J. H. Best, R. Carter, S. Chagnoux, R. Costello, M. W. Denslow, B. A. Dunckel, M. M. Ferriter, E. E. Gilbert, C. Goforth, Q. Groom, E. R. Krimmel, R. LaFrance, J. L. Martinec, A. N. Miller, J. Minnaert- Grote, T. Nash, P. Oboyski, D. L. Paul, K. D. Pearson, N. D. Pentcheff, M. A. Roberts, C. E. Seltzer, P. S. Soltis, R. Stephens, P. W. Sweeney, M. von Konrat, A. Wall, R. Wetzer, C. Zimmerman, and A. R. Mast. 2018. Worldwide engagement for digitizing biocollections (wedigbio): The biocollections community's citizen-science space on the calendar. BioScience 68(2):112–124. Fontaine, B., A. Perrard, and P. Bouchet. 2012. 21 years of shelf life between discovery and description of new species. Current Biology 22(22):R943–R944. Goodwin, Z. A., D. J. Harris, D. Filer, J. R. Wood, and R. W. Scotland. 2015. Widespread mistaken identity in tropical plant collections. Curr Biol 25(22):R1066–R1067. Groom, Q., P. Desmet, L. Reyserhove, T. Adriaens, D. Oldoni, S. Vanderhoeven, S. J. Baskauf, A. Chapman, M. McGeoch, R. Walls, J. Wieczorek, J. R. U. Wilson, P. F. F. Zermoglio, and A. Simpson. 2019. Improving darwin core for research and management of alien species. Biodiversity Information Science and Standards 3. Güntsch, A., R. Hyam, G. Hagedorn, S. Chagnoux, D. Röpert, A. Casino, G. Droege, F. Glöckler, K. Gödderz, Q. Groom, J. Hoffmann, A. Holleman, M. Kempa, H. Koivula, K. Marhold, N. Nicolson, V. S. Smith, and D. Triebel. 2017. Actionable, long-term stable and semantic web compatible identifiers for access to biological collection objects. Database 2017. Guralnick, R. P., P. F. Zermoglio, J. Wieczorek, R. LaFrance, D. Bloom, and L. Russell. 2016. The importance of digitized biocollections as a source of trait data and a new vertnet resource. Database. Haston, E. M., R. W. N. Cubey, M. Pullan, H. Atkins, and D. Harris. 2012. Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach. Zookeys 209:93–102. Heberling, J. M., L. A. Prather, and S. J. Tonsor. 2019. The changing uses of herbarium data in an era of global change: An overview using automated content analysis. BioScience 69(10):812–822. Hedrick, B. P., J. M. Heberling, E. K. Meineke, K. G. Turner, C. J. Grassa, D. S. Park, J. Kennedy, J. A. Clarke, J. A. Cook, D. C. Blackburn, S. V. Edwards, and C. C. Davis. 2020. Digitization and the future of natural history collections. BioScience 70(3):243–251. Heidorn, P. B. 2008. Shedding light on the dark data in the long tail of science. Library Trends 57(2):280–299. Hendy, A. J. W., and B. J. MacFadden. 2014. Digitizing paleontological collections for new audiences: Past practices and the potential for public participation. The Paleontological Society Special Publications 13:127–128. Hill, A., R. Guralnick, A. Smith, A. Sallans, R. Gillespie, M. Denslow, J. Gross, Z. Murrell, T. Conyers, P. Oboyski, J. Ball, A. Thomer, R. Prys-Jones, J. de la Torre, P. Kociolek, and L. Fortson. 2012. The notes from nature tool for unlocking biodiversity records from museum records through citizen science. Zookeys 209:219–233. Hobern, D., B. Baptiste, K. Copas, R. Guralnick, A. Hahn, E. van Huis, E.-S. Kim, M. McGeoch, I. Naicker, L. Navarro, D. Noesgaard, M. Price, A. Rodrigues, D. Schigel, C. A. Sheffield, and J. Wieczorek. 2019. Connecting data and expertise: A new alliance for biodiversity knowledge. Biodiversity data journal 7:e33679. James, S. A., P. S. Soltis, L. Belbin, A. D. Chapman, G. Nelson, D. L. Paul, and M. Collins. 2018. Herbarium data: Global biodiversity and societal botanical needs for novel research. Applications in Plant Sciences 6(2):e1024. Jarrett, R. L., and K. McCluskey (eds.). 2019. The biological resources of model organisms. Boca Raton, FL: CRC Press. 114 Prepublication Copy

Generating, Integrating, and Accessing Digital Data Karim, T., R. Burkhalter, Ã. Farrell, A. Molineux, G. Nelson, J. Utrup, and S. Butts. 2016. Digitization workflows for paleontology collections. Palaeontologia Electronica. doi: 10.26879/566. Krishtalka, L., E. Dalcin, S. Ellis, J. Ganglo, T. Hosoya, M. Nakae, I. Owens, D. Paul, M. Pignal, B. Theirs, and S. Masinde. 2016. Accelerating the discovery of biocollections data. Copenhagen, Denmark: GBIF Secretariat. Lacey, E. A., T. T. Hammond, R. E. Walsh, K. C. Bell, S. V. Edwards, E. R. Ellwood, R. Guralnick, S. M. Ickert-Bond, A. R. Mast, J. E. McCormack, A. K. Monfils, P. S. Soltis, D. E. Soltis, and J. A. Cook. 2017. Climate change, collections and the classroom: Using big data to tackle big problems. Evolution: Education and Outreach 10(1). Lendemer, J., B. Thiers, A. Monfils, J. Zaspel, E. Ellwood, A. Bentley, K. LeVan, J. Bates, D. Jennings, D. Contreras, L. Lagomarsino, P. Mabee, L. Ford, G. Robert, R. Gropp, M. Revelez, N. Cobb, K. Seltmann, and M. Catherine. 2019. The extended specimen network: A strategy to enhance us biodiversity collections, promote research and education. BioScience 70:1–8. Lorieul, T., K. Pearson, E. Ellwood, H. Goëau, J.-F. Molino, P. Sweeney, J. Yost, J. Sachs, E. Mata- Montero, G. Nelson, P. Soltis, P. Bonnet, and A. Joly. 2019. Toward a large‐scale and deep phenological stage annotation of herbarium specimens: Case studies from temperate, tropical, and equatorial floras. Applications in Plant Sciences 7:e01233. Meehan, C. J., G. A. Goig, T. A. Kohl, L. Verboven, A. Dippenaar, M. Ezewudo, M. R. Farhat, J. L. Guthrie, K. Laukens, P. Miotto, B. Ofori-Anyinam, V. Dreyer, P. Supply, A. Suresh, C. Utpatel, D. van Soolingen, Y. Zhou, P. M. Ashton, D. Brites, A. M. Cabibbe, B. C. de Jong, M. de Vos, F. Menardo, S. Gagneux, Q. Gao, T. H. Heupink, Q. Liu, C. Loiseau, L. Rigouts, T. C. Rodwell, E. Tagliani, T. M. Walker, R. M. Warren, Y. Zhao, M. Zignol, M. Schito, J. Gardy, D. M. Cirillo, S. Niemann, I. Comas, and A. Van Rie. 2019. Whole genome sequencing of mycobacterium tuberculosis: Current standards and open issues. Nature Reviews Microbiology 17(9):533–545. Meineke, E. K., C. C. Davis, and T. J. Davies. 2018. The unrealized potential of herbaria for global change biology. Ecological Monographs 88(4):505–525. Monfils, A. K., K. E. Powers, C. J. Marshall, C. T. Martine, J. F. Smith, and L. A. Prather. 2017. Natural history collections: Teaching about biodiversity across time, space, and digital platforms. Southeastern Naturalist 16(sp10):47–57. Nagy, L. G., Z. Merényi, B. Hegedüs, and B. Bálint. 2020. Novel phylogenetic methods are needed for understanding gene function in the era of mega-scale genome sequencing. Nucleic Acids Research 48(5):2209–2219. Nekola, J. C., B. T. Hutchins, A. Schofield, B. Najev, and K. E. Perez. 2019. Caveat consumptor notitia museo: Let the museum data user beware. Global Ecology and Biogeography 28(12):1722–1734. Nelson, G., and S. Ellis. 2018. The history and impact of digitization and digital data mobilization on biodiversity research. Philosophical Transactions of the Royal Society B 374(1763):20170391. Nelson, G., D. Paul, G. Riccardi, and A. R. Mast. 2012. Five task clusters that enable efficient and effective digitization of biological collections. Zookeys (209):19–45. Nelson, G., P. Sweeney, L. E. Wallace, R. K. Rabeler, D. Allard, H. Brown, J. R. Carter, M. W. Denslow, E. R. Ellwood, C. C. Germain-Aubrey, E. Gilbert, E. Gillespie, L. R. Goertzen, B. Legler, D. B. Marchant, T. D. Marsico, A. B. Morris, Z. Murrell, M. Nazaire, C. Neefus, S. Oberreiter, D. Paul, B. R. Ruhfel, T. Sasek, J. Shaw, P. S. Soltis, K. Watson, A. Weeks, and A. R. Mast. 2015. Digitization workflows for flat sheets and packets of plants, algae, and fungi. Applications in plant sciences 3(9):apps.1500065. Nelson, G., P. Sweeney, and E. Gilbert. 2018. Use of globally unique identifiers (guids) to link herbarium specimen records to physical specimens. Applications in Plant Sciences 6(2):e1027. Page, A. J., C. A. Cummins, M. Hunt, V. K. Wong, S. Reuter, M. T. Holden, M. Fookes, D. Falush, J. A. Keane, and J. Parkhill. 2015. Roary: Rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693. Page, R. D. M. 2008. Biodiversity informatics: The challenge of linking data and the role of shared identifiers. Briefings in Bioinformatics 9(5):345–354. Prepublication Copy 115

Biological Collections: Ensuring Critical Research and Education for the 21st Century Pearson, K. D., G. Nelson, M. F. J. Aronson, P. Bonnet, L. Brenskelle, C. C. Davis, E. G. Denny, E. R. Ellwood, H. Goëau, J. M. Heberling, A. Joly, T. Lorieul, S. J. Mazer, E. K. Meineke, B. J. Stucky, P. Sweeney, A. E. White, and P. S. Soltis. 2020. Machine learning using digitized herbarium specimens to advance phenological research. BioScience 70(7):610–620. Powers, K. E., L. A. Prather, J. A. Cook, J. Woolley, H. L. Bart, Jr., A. K. Monfils, and P. Sierwald. 2014. Revolutionizing the use of natural history collections in education. Science Education Review 13(2):24–33. Rabeler, R. 2015. Skeletal records accompanying images: Efficiency vs later utility. Paper read at Presentation made to the annual meeting of the Society for the Preservation of Natural History Collections. https://www. idigbio. org/content/skeletal-records-accompanying-images-efficiency- vs-later-utility (accessed August 24, 2020). Rocha, L. A., A. Aleixo, G. Allen, F. Almeda, C. C. Baldwin, M. V. L. Barclay, J. M. Bates, A. M. Bauer, F. Benzoni, C. M. Berns, M. L. Berumen, D. C. Blackburn, S. Blum, F. Bolaños, R. C. K. Bowie, R. Britz, R. M. Brown, C. D. Cadena, K. Carpenter, L. M. Ceríaco, P. Chakrabarty, G. Chaves, J. H. Choat, K. D. Clements, B. B. Collette, A. Collins, J. Coyne, J. Cracraft, T. Daniel, M. R. de Carvalho, K. de Queiroz, F. Di Dario, R. Drewes, J. P. Dumbacher, A. Engilis, M. V. Erdmann, W. Eschmeyer, C. R. Feldman, B. L. Fisher, J. Fjeldså, P. W. Fritsch, J. Fuchs, A. Getahun, A. Gill, M. Gomon, T. Gosliner, G. R. Graves, C. E. Griswold, R. Guralnick, K. Hartel, K. M. Helgen, H. Ho, D. T. Iskandar, T. Iwamoto, Z. Jaafar, H. F. James, D. Johnson, D. Kavanaugh, N. Knowlton, E. Lacey, H. K. Larson, P. Last, J. M. Leis, H. Lessios, J. Liebherr, M. Lowman, D. L. Mahler, V. Mamonekene, K. Matsuura, G. C. Mayer, H. Mays, J. McCosker, R. W. McDiarmid, J. McGuire, M. J. Miller, R. Mooi, R. D. Mooi, C. Moritz, P. Myers, M. W. Nachman, R. A. Nussbaum, D. Ó. Foighil, L. R. Parenti, J. F. Parham, E. Paul, G. Paulay, J. Pérez-Emán, A. Pérez-Matus, S. Poe, J. Pogonoski, D. L. Rabosky, J. E. Randall, J. D. Reimer, D. R. Robertson, M.-O. Rödel, M. T. Rodrigues, P. Roopnarine, L. Rüber, M. J. Ryan, F. Sheldon, G. Shinohara, A. Short, W. B. Simison, W. F. Smith-Vaniz, V. G. Springer, M. Stiassny, J. G. Tello, C. W. Thompson, T. Trnski, P. Tucker, T. Valqui, M. Vecchione, E. Verheyen, P. C. Wainwright, T. A. Wheeler, W. T. White, K. Will, J. T. Williams, G. Williams, E. O. Wilson, K. Winker, R. Winterbottom, and C. C. Witt. 2014. Specimen collection: An essential tool. Science 344(6186):814–815. Rouhan, G., L. J. Dorr, L. Gautier, P. Clerc, S. Muller, and M. Gaudeul. 2017. The time has come for natural history collections to claim co-authorship of research articles. TAXON 66(5):1014–1016. Ryan, S. J., C. J. Carlson, E. A. Mordecai, and L. R. Johnson. 2019a. Global expansion and redistribution of aedes-borne virus transmission risk with climate change. PLOS Neglected Tropical Diseases 13(3):e0007213. Ryan, S. J., C. A. Lippi, R. Nightingale, G. Hamerlinck, M. J. Borbor-Cordova, B. M. Cruz, F. Ortega, R. Leon, E. Waggoner, and A. M. Stewart-Ibarra. 2019b. Socio-ecological factors associated with dengue risk and aedes aegypti presence in the galápagos islands, ecuador. International Journal of Environmental Research and Public Health 16(5). Seltmann, K. C., S. Lafia, D. L. Paul, S. A. James, D. Bloom, N. Rios, S. Ellis, U. Farrell, J. Utrup, M. Yost, E. Davis, R. Emery, G. Motz, J. Kimmig, V. Shirey, E. Sandall, D. Park, C. Tyrrell, R. S. Thackurdeen, M. Collins, V. O’Leary, H. Prestridge, C. Evelyn, and B. Nyberg. 2018. Georeferencing for research use (GRU): An integrated geospatial training paradigm for biocollections researchers and data providers. Research and Ideas and Outcomes 4:e32449. Shouman, S., N. Mason, J. M. Heberling, T. Kichey, D. Closset-Kopp, A. Kobeissi, and G. Decocq. 2020. Leaf functional traits at home and abroad: A community perspective of sycamore maple invasion. Forest Ecology and Management 464:118061. Soltis, P. S., G. Nelson, A. Zare, and E. K. Meineke. 2020. Plants meet machines: Prospects in machine learning for plant biology. Applications in Plant Sciences 8(6):e11371. Strasser, B. J. 2008. Genetics. Genbank—natural history in the 21st century? Science 322(5901):537–538. 116 Prepublication Copy

Generating, Integrating, and Accessing Digital Data Suhrbier, L., W. H. Kusber, O. Tschöpe, A. Güntsch, and W. G. Berendsohn. 2017. Annosys- implementation of a generic annotation system for schema-based data using the example of biodiversity collection data. Database: The journal of biological databases and curation 2017(1):bax018. Thiers, B., Monfils, A., Zaspel, J., Ellwood, E., Bentley, A., Levan, K., Bates, J., Jennings, D., Contreras, D., Lagomarsino, L., Mabee, P., Ford, L., Guralnick, R., Gropp, R., Revelez, M., Cobb, N., Lendemer, J., Seltmann, K., Aime, M. C. 2019. Extending U.S. Biodiversity collections to promote research and education. https://bcon.aibs.org/wp-content/uploads/2019/04/Extending- Biodiversity-Collections-Full-Report.pdf (accessed August 24, 2020). Troudet, J., P. Grandcolas, A. Blin, R. Vignes-Lebbe, and F. Legendre. 2017. Taxonomic bias in biodiversity data and societal preferences. Scientific Reports 7(1):9132. Tschöpe, O., J. A. Macklin, R. A. Morris, L. Suhrbier, and W. G. Berendsohn. 2013. Annotating biodiversity data via the Internet. TAXON 62(6):1248–1258. Tulig, M., N. Tarnowsky, M. Bevans, A. Kirchgessner, and B. Thiers. 2012. Increasing the efficiency of digitization workflows for herbarium specimens. Zookeys 209:103–113. van Rossum, J. 2017. Blockchain for research. Digital Science. Holtzbrinck Publishing Group: London. Verslyppe, B., W. De Smet, B. De Baets, P. De Vos, and P. Dawyndt. 2014. Straininfo introduces electronic passports for microorganisms. Systematic and Applied Microbiology 37(1):42–50. Vollmar, A., Formerly, J. Macklin, and L. Ford. 2010. Natural history specimen digitization: Challenges and concerns. Biodiversity Informatics 7:93–112. Webster, M. S. (ed.). 2017. The extended specimen: Emerging frontiers in collections-based ornithological research. Boca Raton, FL: The American Ornithological Society. Original edition, Studies in Avian Biology. Wieczorek, J., D. Bloom, R. Guralnick, S. Blum, M. Döring, R. Giovanni, T. Robertson, and D. Vieglais. 2012. Darwin core: An evolving community-developed biodiversity data standard. PLoS One 7(1):e29715. Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-Beltran, A. J. G. Gray, P. Groth, C. Goble, J. S. Grethe, J. Heringa, P. A. C. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. A. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and B. Mons. 2016. The fair guiding principles for scientific data management and stewardship. Scientific Data 3:160018. Willemse, L., E. van Egmond, V. Runnel, H. Saarenmaa, A. C. Rubio, K. Gödderz, and X. Vermeersch. 2019. Future challenges in digitisation of private natural history collections. Biodiversity Information Science and Standards 3:e37640. Willis, C. G., E. R. Ellwood, R. B. Primack, C. C. Davis, K. D. Pearson, A. S. Gallinat, J. M. Yost, G. Nelson, S. J. Mazer, N. L. Rossington, T. H. Sparks, and P. S. Soltis. 2017. Old plants, new tricks: Phenological research using herbarium specimens. Trends in Ecology & Evolution 32(7):531–546. Prepublication Copy 117

Next: 6 Cultivating a Highly Skilled Workforce »
Biological Collections: Ensuring Critical Research and Education for the 21st Century Get This Book
×
Buy Prepub | $69.00 Buy Paperback | $60.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biological collections are a critical part of the nation's science and innovation infrastructure and a fundamental resource for understanding the natural world. Biological collections underpin basic science discoveries as well as deepen our understanding of many challenges such as global change, biodiversity loss, sustainable food production, ecosystem conservation, and improving human health and security. They are important resources for education, both in formal training for the science and technology workforce, and in informal learning through schools, citizen science programs, and adult learning. However, the sustainability of biological collections is under threat. Without enhanced strategic leadership and investments in their infrastructure and growth many biological collections could be lost.

Biological Collections: Ensuring Critical Research and Education for the 21st Century recommends approaches for biological collections to develop long-term financial sustainability, advance digitization, recruit and support a diverse workforce, and upgrade and maintain a robust physical infrastructure in order to continue serving science and society. The aim of the report is to stimulate a national discussion regarding the goals and strategies needed to ensure that U.S. biological collections not only thrive but continue to grow throughout the 21st century and beyond.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!