A revolution in digital information is occurring across all realms of human endeavor. This revolution is characterized by a tremendous increase in the quantity of digital information, dynamism in both the purposes for and technology by which digital information is used, and multiplicity of digital information handlers. The implications of this revolution in digital information include an imperative for effective long-term digital curation and a workforce sufficient in skill and number to meet that challenge.
The vast increase in the quantity of digital information is made possible by computer and networking technologies that create, capture, copy, share, and store massive amounts of information easily and at very low cost. Many studies examining the production, use, and sharing of digital information confirm an astounding rate of increase (Lyman and Varian, 2003; Bohn and Short, 2009; IDC, 2014). The increase is occurring across all sectors, from scientific research to government administration, health care, business, and cultural and personal expression.
Research in the sciences is producing an enormous and rapidly growing flow of digital information. From distant satellites to medical implants, a great range of sensors permits the collection of unprecedented quantities of digital information across the scientific disciplines. That vast quantity in turn creates challenges of how best to share, store, manage, and analyze the data. Indeed, the availability of big data has transformed many scientific disciplines, with fields such as molecular biology, biodiversity, ecosystem studies, and geography becoming very data intensive. Hybrid fields such as bioinformatics, biodiversity informatics, ecoinformatics, and geospatial science have also emerged in response to the immense quantities of digital information.
Government is another source of the huge increase in digital information. The federal government generates a tremendous flow of administrative documents from agencies such as the Social Security Administration, Medicare, Medicaid, the Veteran’s Administration, and the Internal Revenue Service, while state and local governments produce records in such areas as education, voter registration, and property ownership. Federal and state governments also generate a large volume of statistical data that are collected and disseminated for policy research (Card, D. R. et al., 2010).1
1The major federal statistical agencies include the Census Bureau, Bureau of Labor Statistics, Bureau of Transportation Statistics, Bureau of Justice Statistics, National Center for Education Statistics, National Agricultural Statistics Service, and the National Center for Health Statistics. In addition, the Office of Management and Budget identifies approximately 80 other federal agencies and organizations that produce statistics (Office of Management and Budget, 2012).
The health sector also contributes to the unwieldy quantity of digital information. With the increased use of digital imaging technology and the shift to electronic health records, health care providers and insurance companies are recording detailed information on patients, diagnoses, treatments, and payments. The increased flow of digital information is used to analyze the efficacy of treatments, trends in diseases and chronic conditions over time, and costs of various procedures and treatments.
Collection of vast amounts of digital information is ubiquitous in the private sector as well. Retailers, banks, credit-rating agencies, and insurance companies record transactions as digital information. In the entertainment industry, digital media are the primary, and in some cases, the exclusive mode for distributing products, whether texts, music, games, or motion pictures (Academy of Motion Pictures Arts and Sciences, 2007). Commercial strategies have been transformed by the use of digital information. Many companies use data analytics, web analytics, and other business intelligence techniques to analyze consumer behavior, target advertising toward specific individuals or consumer groups, manage inventory and production schedules, and devise business strategy (Manyika et al., 2011).
Private citizens, too, are creating and sharing enormous amounts of digital information. Social media are platforms for huge quantities of photos, videos, and personal information, much of it ephemera, yet also perhaps of cultural and historical value or useful in social research (Lee, C. A., ed. 2011). According to an estimate of information generation in 2012, every minute of the day users uploaded 48 hours of new video to YouTube, sent 204,166,667 e-mail messages, and submitted over 2 million search queries to Google (Spencer, 2012).
Much of the increased flow of information is “born digital.” Research data from sensors as varied as particle accelerators, astronomical observatories, remote sensing platforms, or automated DNA sequencers are captured in digital formats. Government, commercial, and health records are initiated in digital formats. A variety of technologies permit personal and cultural creative expression as digital information. A substantial portion of this vast new trove of digital information also results from major initiatives to digitize analog data, from historic maps and weather almanacs to audio recordings, photographs, and even the label data on museum specimens. Libraries, archives, and museums are transitioning from physical to digital collections and from manual to automated processes for collections management.
One further component of the vast increase in digital information is metadata. Metadata, or data about data, describe the contexts and the content of data files. Accurate and complete metadata are essential for analyzing data and can themselves be an important resource for research. In some instances, the volume of metadata required for effective documentation exceeds the volume of the data being described.
This vast increase in the sheer quantity of digital information—from whatever source or sector, whether originating in or transferred to digital format, and whether consisting of data or metadata—presents many challenges for digital curation. The capture, management, preservation, and storage of content this large require significant curatorial skills and knowledge.
In addition to the massive increase in the sheer quantity of digital information, other characteristics of the revolution in digital information demand curatorial expertise. Much digital information is being used or reused in ways not anticipated when that information was collected. Digital information moves readily across temporal boundaries, as digitized data from ships’ logs
in the seventeenth century are analyzed by climate scientists in the twenty-first. Digital information also moves across sectorial boundaries, as epidemiologists examine commercial data on consumer searches for flu remedies. Digital information ignores disciplinary boundaries as well, as researchers in bioinformatics combine datasets originating in biology, genetics, and engineering. Such dynamism in the use and reuse of digital information places high demand on curatorial expertise. The curatorial challenges of interoperability and accessibility are great when the uses of digital information are so dispersed and fluid.
The technology for capturing, managing, and storing digital information is also in continual flux. Both the hardware and software for accessing, interpreting, and preserving digital information are continually being upgraded. Curatorial strategies for storage and retrieval are therefore never definitively settled. Dependencies between data, software, and metadata also raise challenges for curation. In a rapidly shifting technological environment, software and metadata can change independently of the data; old software quickly becomes unusable and old metadata become impossible to interpret. Anticipating such problems and developing strategies to mitigate them are a core activity of digital curation.
The revolution in digital information may also be characterized by the multiplicity and diversity of people handling digital information and the contexts in which they do so. Handlers of digital information include professionals trained and engaged in curation per se, as well as experts in a variety of domains who must do some digital curation in order to accomplish other aims. But there is also an enormous range of others—librarians, administrators, and those participating in crowdsourcing—who are producing or gaining access to stores of digital information. They do so in a variety of organizational contexts, from enormous government-funded data repositories to smaller libraries, from major commercial databanks and cloud services to startup companies, as well as in private archives and collections.
The multiplicity and diversity of people involved with digital information and the contexts of that involvement have implications for digital curation and curators. Different levels of curation might be appropriate to different types of producers and users of digital information. Not only the technology of access to digital information but also the propriety of it is a concern, with curators having to address issues of data security. Methods and approaches of digital curation will also need to be adjusted in response to the tremendous range of resources, methodologies, and organization of workflows in the very different settings where digital curation activities occur.
The revolution in digital information requires an accompanying surge in the advancement of digital curation, and therefore in the digital curation workforce. How to meet the demand for a digital curation workforce, suitable both in expertise and in number, is the challenge the study committee addressed. The specific charge to the study committee, as described in the Statement of Task, is as follows:
- Identify the various practices and spectrum of skill sets that digital curation comprises, looking in particular at human versus automated tasks, both now and in the foreseeable future.
- Examine the possible career path demands and options for professionals working in
- digital curation activities, and analyze the economic and social importance of these employment opportunities for the nation over time. In particular, identify and analyze the evolving roles and models of digital curation functions in research organizations, and their effects on employment opportunities and requirements.
- Identify and assess the existing and future models for education and training in digital curation skill sets and career paths in various domains.
- Produce a consensus report with findings and recommendations, taking into consideration the various stakeholder groups in the digital curation community, that address items 1–3, above.
The remainder of this chapter defines digital curation and its key elements, reflects further on the characteristics of digital curation and its workforce, delineates the topical scope and time frame of the committee’s work, and presents the organization of this report.
1.5.1 Digital Curation
For the purposes of this study, digital curation is defined as the active management and enhancement of digital information assets for current and future use. After reviewing numerous alternatives, the committee adopted this definition so as to encompass a wide range of curatorial activities and practices. This section considers the term digital curation itself and elaborates on each element of the definition.
Digital curation differs in several ways from curation as it is traditionally understood. Generally, curation denotes the selection, care, and preservation of collections of objects. The content of curated collections is typically relatively small, consisting of rare or unique works of art, rare books and manuscripts, important natural and physical specimens, or cultural artifacts. Curation takes place in relatively limited organizational contexts: libraries, archives, museums, art galleries, herbaria, and similar institutions; the work of specially trained curators has focused primarily on preserving and archiving collections within these settings. Digital curation displays some continuity with this tradition of curation. Regardless of whether a collection is physical or digital, a curator must appraise its value and relevance to the community of potential users; determine the need for preservation; document provenance and authenticity; describe, register, and catalog its content; arrange for long-term storage and preservation; and provide a means for access and use.
Yet digital information also poses many new challenges for curation: the immense and ever-increasing quantities of material to be curated, the need for active and ongoing management in a context of continually changing uses and technology, and the great diversity of organizational contexts in which curation occurs.
1.5.2 Active Management and Enhancement
The phrase “active management and enhancement” was chosen to distinguish curation from simply collecting and storing data and information. Active management denotes planned, systematic, purposeful, and directed actions that make digital information fit for a purpose. It includes coordinated activities that allow users to understand and exploit digital information assets and to ensure their integrity over time. Active management also refers to activities that ensure that digital information will remain discoverable, accessible, and useable for as long as
potential users have a need or a right to use it. It may further involve securing digital information from unauthorized access.
Active management of digital information entails a wide range of both managerial and technical activities. Relevant managerial activities include developing policies for digital curation; assessing risks to the organization that might result from current technology, policies, and curation practices; identifying information assets; evaluating the effectiveness of systems and processes that support digital curation; monitoring compliance with regulations and best practices; mobilizing financial and technical resources for curation; and recruiting and training qualified digital curation personnel to support consistent curation practices across an organization. Technical activities include working directly with the hardware and software systems that support information management, such as establishing and operating repositories for long-term archival management of digital information, organizing and cataloging digital information assets, creating or enhancing the metadata associated with digital information objects, disseminating digital information, and managing access to repositories and their content.
Enhancement means taking measures to increase the value of digital information for current and future use. Most digital information is neither naturally useful nor immediately valuable at the moment it is created or collected. Curation processes that reduce or eliminate noise in the data and that detect and correct errors or other anomalies may increase its immediate utility. Data may need to be repackaged to prevent format obsolescence or represented in a form that satisfies the needs of specific user communities. Enhancement does not, of course, include intentional manipulation to support false conclusions.
The collection or assembly of descriptive metadata is another very important aspect of enhancement. Transforming digital data into useful information usually requires active intervention by skilled people and software applications. Furthermore, because digital information is fragile, corruptible, easily altered, and subject to accidental and intentional deletion, maintaining the integrity of information is a critical aspect of digital curation. Digital curation can enhance the integrity of digital information and increase its trustworthiness through security and restricted access to curation systems, replication, documentation of any transformations of the information, and auditable process and procedures.
1.5.3 Digital Information Assets
Assets have value. Not all digital information is an asset. In a stream of ephemera and communication, determining which digital information constitutes an asset, as opposed to a liability, an intermediate product, or just plain noise, is highly dependent on the context in which it is used or is anticipated to be used. Further, some digital information can become an asset through curatorial activity—not only through enhancement of its utility, but by measures to ensure ease of discovery, access, and distribution.
1.5.4 Current and Future Use
That digital information has both current uses and potential future uses has important implications for digital curation. The range of current uses across many sectors requires curation of digital information for that contemporaneous diversity of users and methodologies. Future use of digital information, both within and beyond the context in which it was first created or collected, places additional demands on curation. Attention must be paid to updating and upgrading technologies, software, and metadata, both for the preservation of the digital information and for maintenance of access to it.
To further introduce the committee’s approach to the topic of preparing a workforce for digital curation, some attributes of how, by whom, and where that work is conducted merit comment. An essential attribute of the work of digital curation is that it is accomplished along a continuum. It does not consist of a discrete profession labeled “digital curator” with a defined set of tasks undertaken in a dedicated setting. Rather, it is more usefully conceived as a series of activities undertaken by a range of personnel in a great variety of settings. This heterogeneity has major implications for measuring the work of digital curation, estimating demand for its workforce, and determining how best to train that workforce.
The continuum of professional positions including some responsibility for digital curation is very long. At one end of that continuum are specialists whose jobs consist exclusively or primarily of curation. They are designated personnel with specific expertise and training in the field of digital curation. At the opposite end of the continuum are jobs that may include curatorial tasks from time to time. Digital curation may be an essential but not predominant part of these jobs. The curatorial activities included in these jobs may be deemed a chore, even a distraction from the primary work to be accomplished; they may not even be recognized as curation.
Importantly, the two ends of that continuum are connected. Most digital information derives its meaning, value, and utility in relation to the domains, problems, or processes to which it is applied. Therefore, professional curators cannot make sound decisions, provide services, or add value to digital information without some knowledge of those domains, problems, or processes. In scientific fields, for example, digital curators need familiarity with the terminology, methods, common data types and formats, standards of acceptance, and norms of a specific scientific community. In commercial environments, digital curators need knowledge of the competitive environment, regulatory framework, and nomenclature of a particular line of business. At the other end of the continuum, professionals such as research scientists or marketing analysts who are engaged in work that is seen as far from curatorial must nonetheless be proficient in curation. Without the knowledge and skills to conduct sound curatorial practices, such as recording metadata or maintaining accessible formats or properly combining datasets, those professionals will fail at the rest of their work.
The organizational and institutional settings in which digital curation is accomplished also vary along a large spectrum. Some of these, such as formal data centers and repositories or government statistical agencies, may be explicitly dedicated to the curation of digital information. They may pursue curation as an end in itself. In other settings, the curation of digital information may be but one component of a very broad set of activities, in which curation serves but does not define the goal. There is variation all along the spectrum. In some organizational settings, the work of curation will be concentrated among specific personnel; in others it will be dispersed. Some settings will take responsibility for digital information from the moment it is collected or created, whereas others will begin to manage it only after the original producer has assigned metadata, or after the original user no longer has a need for it.
The various organizations and institutions in which digital curation is conducted also have very different resources, and therefore very different ways of organizing and accomplishing the work of curation. Investment in technology and human capital vary. The potential for automating some activities of digital curation is also variable. It may be affected by such factors as the size of organizations, the types of technical systems in place, the volume and types of
information, and the degree to which curation tasks have been integrated into workflows and business processes. These variations can be found not only between different sectors (e.g., financial, retail, entertainment, manufacturing, health care, research, and education), but also within organizations in the same sector.
Digital curation is not the only term used to characterize activities that enhance the value of information. Information management, data management, data stewardship, data governance, and digital archiving are related terms used to describe processes and activities that overlap with curation. Information management is concerned with the full range of issues that affect acquisition, organization, processing, and delivery of information including efficiency of operations, controlling costs, and regulatory compliance, often in the context of an organization-wide information architecture. According to Data Management International (DAMA International2), data management is the “development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise.” Information management and similar fields entail processes and activities that overlap with curation, yet are distinct from it. What distinguishes curation from these other fields is its emphasis on enhancing the value of information assets for current and future use and its attention to the repurposing and reuse of information, both within and beyond the context in which it was first created or collected.
In the absence of a formal occupational classification of digital curator responsible for a delimited set of tasks in a standard work setting, the committee’s approach was to identify and analyze digital curation activities, investigating different scenarios for distributing digital curation activities. The way responsibility for digital curation activities is distributed across and within organizations will be an important factor in determining the right mix of curation knowledge and skills in the workforce. The committee’s approach also reflects the dynamic nature of digital curation, in which standards and best practices are still evolving and automation lags behind the exponential growth in digital information.
To minimize confusion over the scope of activities and the necessary knowledge and skills that digital curation comprises, the committee reached a consensus about some fields that are related to digital curation, but beyond the scope of this report. Data science and data analytics are two related fields, which, like digital curation, are recent and lack clear definitions and boundaries. The committee understood data science to mean the application of mathematics, statistics, and computer science to extract meaning from data and solve complex problems using statistical techniques, algorithms, and visualization. Data analytics extends statistical analysis with descriptive and predictive models to obtain knowledge from data by using insight from analyses to identify trends, evaluate performance, characterize consumer behavior, detect anomalies, recommend action, or guide and communicate decision making.
After reviewing definitions of these new jobs, the committee decided that positions that focus exclusively or primarily on developing algorithms, refining statistical techniques, mining data, and applying analytics to data did not fall within the committee’s definition of digital curation. Digital curation differs from data science and analytics because curation is needed for many types of digital information, such as websites, blogs, social media, music, videos, geospatial information, online publications, and textual databases, to name but a few examples.
Furthermore, data analytics and data science typically focus on the immediate use of data for scientific and commercial purposes, rather than on current and potential future use of digital information.
The statement of task asked the committee to identify and analyze the evolving roles and models of digital curation functions in research organizations. The report pays particular attention to digital curation capacity and needs in research environments where recent changes in policy may raise the visibility of digital curation.
A further aspect of the scope of this report follows from the committee’s decision to address digital curation activities rather than limit itself to examining the narrower occupational category of digital curator. Although emerging career paths of professional digital curators were investigated, it was not possible to analyze the economic and social importance of employment opportunities occurring across the full range of digital curation activities. The committee recognized that digital information flows readily across national borders through interconnected global infrastructures. Nevertheless, the committee determined that in order to make its task tractable, the primary focus of the report would be on workforce and educational needs for digital curation in the United States, while drawing on evidence from other countries and international efforts for salient examples and purposes of comparison.
Regarding the time frame, the committee was asked to consider the spectrum of knowledge and skills needed for digital curation now and into the foreseeable future. Given the dynamic nature of information technology and uncertainty about the speed at which organizations will develop policies, systems, and good practices for digital curation, the committee defined its time horizon as the next decade. Even within a 10-year time frame, numerous unknowns may influence the nature of digital curation activities and the demand for individuals with digital curation knowledge and expertise.
This chapter has characterized the revolution in digital information, defined digital curation, and reflected on some characteristics and contexts of curatorial work. It has also delineated the Statement of Task and clarified the scope of what the committee undertook in order to address that task. Chapter 2 examines the evolution, current state, and ongoing development of digital curation. It also considers how to measure the benefits and costs of digital curation. Chapter 3 uses a variety of resources to devise estimates of current and future demand for the workforce in digital curation. Chapter 4 addresses the education of a workforce sufficient to meet the varied challenges of digital curation as they arise across different sectors and domains, within different organizational settings, at many different levels. It identifies and assesses the current state of educational opportunities in digital curation and considers steps for future progress.
Academy of Motion Picture Arts and Sciences, Science and Technology Council. 2007. The Digital Dilemma: Strategic Issues in Archiving and Accessing Digital Motion Picture Materials. http://www.oscars.org/science-technology/council/projects/digitaldilemma/.
Bohn, R. E., and J. E. Short. 2009. How Much Information? 2009 Report on American Consumers. December. Global Information Industry Center, University of California at San Diego. http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf. Accessed June 16, 2013.
Card, D., R. Chetty, M. S. Feldstein, and E. Saez. 2010. Expanding access to administrative data for research in the United States. In Ten Years and Beyond: Economists Answer NSF’s Call for Long-Term Research Agendas, C. L. Schultze and D. H. Newlon, eds. American Economic Association. Available at SSRN: http://ssrn.com/abstract=1888586 or http://dx.doi.org/10.2139/ssrn.1888586.
IDC. 2014, The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. White Paper sponsored by EMC2.
http://www.emc.com/leadership/digital-universe/2014iview/index.htm?cmp=micro-big_data-general-emc&page=http%3A%2F%2Fwww.emc.com%2Fcampaign%2Fbigdata%2Findex.htm. Accessed: October 14, 2014.
Lee, C. A., ed. 2011. I, Digital: Personal Collections in the Digital Era. Chicago, IL: Society of American Archivists.
Lyman, P., and H. R. Varian. 2003. How Much Information? http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/. Accessed June 16, 2013.
Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. H. Byers. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation.
Office of Management and Budget. 2012. Statistical Programs of the United States Government: Fiscal Year 2012. http://www.whitehouse.gov/sites/default/files/omb/assets/information_and_regulatory_affairs/12statprog.pdf.
Spencer, N. 2012. How much data is created every minute? Blog post to Visual News posted on June 19. http://www.visualnews.com/2012/06/19/how-much-data-created-every-minute/?view=infographic. Accessed June 16, 2013.