Summary and Recommendations
The National Archives and Records Administration (NARA) launched its Electronic Records Archives (ERA) program in 1998 to create a system to preserve and provide access to federal electronic records. Early steps in the ERA program included NARA’s exploration of possible solutions and its undertaking of development projects with partners, including the San Diego Supercomputer Center, the University of Maryland, and the Georgia Tech Research Institute. In 2004, NARA released the final version of its ERA request for proposals (RFP) and selected two contractors to develop designs for the ERA.
The first report1 of the National Research Council’s Committee on Digital Archiving and the National Archives and Records Administration and its subsequent letter report2 provided recommendations on design, engineering, and related issues facing the ERA program, which was being conceived at the time of their writing. Although some of the reports’ conclusions relate to specific development initiatives and early design ideas, most of the observations about archive system design are not tied to these specifics and should remain useful to NARA as it develops, refines, and iterates the ERA program. The “Summary and Recommendations” chapter of the committee’s first report is reprinted in Appendix B of the present report, and its 2003 letter report is reprinted in Appendix C.
The ERA program has been involved in procurement activities, but this committee has not been given access to information about the ERA design since the issuance of the ERA program’s RFP, and this report does not contain—or comment on—technical details about the current system designs. The committee does urge that NARA find a mechanism, consistent with
procurement regulations, for obtaining a comprehensive outside technical review of the ERA designs that are proposed by its contractors. The committee’s first report recommended that NARA establish an ongoing advisory group on digital preservation and information technology (IT) system design; such a body might be charged with carrying out a review of this type.
This final report focuses on longer-term, more strategic issues that are related to electronic records archiving at NARA, including technology and other trends that shape the context in which the ERA exists, the archival processes of the ERA itself, and the future evolution of the ERA system. In addition, it addresses an important set of technical and design issues associated with assuring record integrity and authenticity that were not covered in detail in the committee’s earlier reports.
NARA recognizes that it faces a significant challenge. The ERA program, the Records Management Redesign initiative, and NARA’s work on records management under the e-government initiative led by the Office of Management and Budget all reflect NARA’s awareness of these problems and represent positive steps in starting to address them. As the findings and recommendations below indicate, electronic records represent both a substantial challenge and a significant opportunity for NARA.
The committee’s findings and recommendations are organized under and support the following six high-level recommendations:
Get ready for a rapidly rising tide of electronic records.
Plan for continuing technology change and increasing user expectations.
Reengineer relations with federal agencies to help them create records that are archive-ready.
Do not assume that ERA is unique: become more involved with other organizations that have interests in preserving electronic records.
Learn how to exploit the enthusiasm and capabilities of the research community and work with others who do that well.
Take strong measures internally and provide government-wide leadership to ensure record integrity and provenance.
These recommendations are discussed in more detail below.
RECOMMENDATION 1. PLAN FOR A RISING TIDE OF ELECTRONIC RECORDS
Finding 1.1 Within a short time, the vast majority of records will be electronic. The volume of records and the challenge of preserving them will continue to grow significantly.
Four technology effects are creating an avalanche of digital materials: the growing fraction of information that is born digital and not systematically retained on paper, the relative ease and inexpensiveness of recording information digitally compared with the effort and cost of recording it on paper, the advent of new technologies for communicating and recording information, and the decreasing cost of storage, which drives the retention of information that previously would have been discarded.
It is reasonable to expect that the growth in the amount of digital information produced in society at large will likewise occur in government—and thus the growth in the volume of permanent records. The volume of data to be stored will also grow because individual records
will continue to become larger. In addition, the total number and variety of record data types3 that must be stored by NARA and other repositories will increase over time as new versions of existing types are created and entirely new ones are invented.
Recommendation 1.1 The National Archives and Records Administration (NARA) should devote a significant and growing fraction of its attention and resources to electronic records, especially in the areas of records management, appraisal, and transfer of records to NARA.
A rapid shift to electronic records has two important implications. First, as the findings and recommendations below detail, many of the processes associated with appraisal, scheduling, ingest (i.e., intake of records and associated metadata), and access will require major modification in order to accommodate the volume of electronic records expected. Second, as the share of records that are electronic continues to grow rapidly, the relative resources allocated for their management compared with resources for managing traditional records will need to grow as well. Over the coming decade or so, electronic records will become the primary line of business for NARA rather than a side activity. This change will have implications for NARA’s organizational culture, structure, and staffing.
Finding 1.2 New forms of records will emerge as electronic systems capture new kinds of information that may also constitute essential evidence.
The way that government conducts its business has been transformed by the pervasive use of electronic systems. As those systems evolve, so too will the types of records that must be handled by NARA. Certainly, a great many formal government communications continue to use familiar, traditional records: forms, memoranda, pictures, and so on. But other types of information that could well be deemed “essential evidence”4 are also being captured by electronic systems. They include new forms of communication, dynamically generated information, and transactions. Saving the potentially huge amounts of data associated with these new types of information, re-creating the behavior of old IT systems decades later, and establishing what should be retained as a permanent record are all extremely difficult challenges.
Recommendation 1.2 NARA should develop the capacity to predict and anticipate significant changes in record types and volume, styles of record keeping, and concepts of what should be preserved as a permanent record as well as the implications of such changes for future versions of the Electronic Records Archives (ERA).
Sustained growth in record volume, the continual introduction of new record types, new forms of record keeping, and new concepts of what constitutes a record mean that NARA will need the knowledge and agility to respond in a timely and effective manner so that it can make necessary changes to its systems and processes and so that it can advise and influence agencies as they implement new systems. To stay abreast of developments, NARA does not need detailed surveys, which are costly and difficult to conduct, but it does need to develop the capacity internally and externally (e.g., through research activities or advisory committees) to
predict changes and discontinuities in record types by monitoring new data types, information production rates, and similar trends throughout society and within government.
RECOMMENDATION 2. PLAN FOR CONTINUING TECHNOLOGY CHANGE AND RISING USER EXPECTATIONS
Because plans at the outset are for the ERA to operate for a long time, considerable technology change will occur over the course of its existence. The committee does not have a crystal ball allowing it to see far into the technology future, but past experience does suggest some relevant technology trends that are likely to persist, as described below. Technology changes will, in turn, fuel user expectations: as new technology yields new capabilities that are deployed in systems outside NARA, users will expect the ERA to keep up. These expectations are also considered below.
Finding 2.1 Sustained performance improvements in the ERA’s component technologies will continue indefinitely.
Although precise changes in the ERA’s component technologies cannot be predicted, some general trends are apparent. For example, as in the past, new forms of storage providing such desired characteristics as higher volumetric density, lower cost, and faster access times will continue to emerge, and the commercial market will move toward these new technologies. Also, the familiar trends of ever-faster processors and networks will continue, making cost-effective network transfer and processing of ever-larger volumes of records possible. The past 20 years have seen sustained improvements in computing performance in many different respects, and there is no reason to suppose that such sustained improvements will not continue.
Recommendation 2.1 Because radical changes in information technology cannot be specifically anticipated, the ERA should be designed for change.
The continuous and significant change in technology capabilities into the indefinite future means that flexibility in the design of the Electronic Records Archives will be essential. The committee’s first report stressed the need for a modular design for the ERA to allow components to evolve or to be replaced incrementally—advice that remains critical for the success of the ERA’s initial design and future evolution. As the ERA evolves, improving its ability to change should be a major objective. For example, use of commercial off-the-shelf technology should be emphasized for the ERA, and ERA software should be written to be as portable as possible across offerings of different computer and storage vendors. Because network interfaces and protocols are likely to remain relatively stable, the ERA should conform to prevailing network standards. The ERA should also be designed to be readily extensible—that is, to accommodate change such as the introduction of new record formats without requiring immediate or widespread modification of the system.
Finding 2.2 The gains of new technologies will be highly skewed toward higher-volume products.
Certain computer technologies, such as ATA/IDE5 disk drives and Ethernet networks, are high-volume, low-cost products for which the cost-performance ratio steadily improves. These
products generally conform to de facto industry standards that have a relatively long lifetime, and backward compatibility is generally provided across several product generations. Low-volume products, in contrast, rarely exhibit sustainable cost-performance characteristics and often disappear rapidly from use. Today’s trend in moving to disk from tape for mass storage is an example of what often happens in such cases, with a very high volume product replacing a relatively low volume product.
Recommendation 2.2 Components of the ERA should evolve in step with prevailing commercial trends.
The ERA will benefit from continuing technology evolution only if the ERA implementation stays roughly in step with mainstream commercial trends. NARA should therefore use high-volume products wherever possible. The functions of the ERA are not uncommon enough to support the use of unique hardware or software components; using such components would expose NARA to the risk of significantly greater costs and more difficult system evolution. NARA should be wary of special-purpose solutions that do not enjoy the benefits of continuous technology improvements driven by a large and active community of users. The ERA program should plan and budget for renovations of both hardware and software to stay on the technology curve of mainstream IT.
Finding 2.3 Full-content search, information retrieval, and information extraction offer inexpensive and reasonable, albeit imperfect, capabilities today for textual materials.
One of the major challenges for the ERA system is that of ingesting the anticipated volume of records without being bogged down by the human effort associated with producing the purely descriptive metadata that have historically been important for finding records. Realistically, there are unlikely to be enough resources (people or dollars) to rely on anything other than computer-based algorithms for this purpose.
One approach to reducing this burden is full-content searching. There has been considerable progress in this area—that an enormous body of publicly accessible Web pages and documents of multiple data types can be searched effectively by search engines such as Google is a major achievement of relevance to NARA. Furthermore, there is widespread demand for better searching technologies.
Another approach is to use information-retrieval techniques such as the family of techniques for automatic metadata extraction. These methods, which make use of information-extraction techniques to extract metadata from the content of records, have reached a level of maturity that permits them to be used in production systems; the techniques will continue to improve steadily, but no breakthroughs are on the horizon. Information-extraction techniques do not and will not provide the accuracy of a human. Nevertheless, they are much less expensive than manual processing is, and in some instances the margin of error they provide may be acceptable to NARA (and to NARA’s customers, who may be more interested in having rapid access to a larger universe of archived records than in high precision). For the foreseeable future, automatic metadata extraction will need to be combined with expert guidance to achieve usable results.
Importantly, as is true with the provision of search capabilities, automatic metadata extraction need not be performed at ingest of records. The highest priority at ingest is capturing the records along with metadata that are not discernible from the content; the extraction of additional metadata can be deferred and can of course be repeatedly applied as techniques improve.
Finding 2.4 There is no general technical solution for the digital preservation problem today, and one is unlikely to appear any time soon.
As discussed in the committee’s first report, the ability to create, store, and retrieve the bits that represent a record is a fundamental requirement of the ERA. Good implementations of this capability will be quite complex and will require refinement and evolution over time. However, the basic requirements for this part of the ERA are fairly well understood, and technologies exist or are being developed to satisfy most of them.
Ensuring that bits can be understood by systems and by human beings in the distant future is much more difficult, however. The committee’s first report, citing the imperative of making significant progress toward implementing an archival system, proposed a pragmatic, short-term strategy. But as development of the ERA proceeds, NARA will need to seek out longer-term solutions.
For the long term, no generally applicable technical solution is evident. No single proposed approach to digital preservation, including migration, migration to durable fixed formats, or emulation, will work in all cases. Each method has its advantages; each has drawbacks and associated costs. One cannot know with any great certainty which proposed methods for preserving the interpretability of digital objects into the distant future will subsequently be deemed successful, much less relatively cost-effective. Preservation will therefore require a combination of approaches. Selecting the appropriate preservation techniques is part of the expertise that will be required of electronic records archivists. There will be a need for compromises based on user requirements and varying levels of service that NARA provides for different records.
The preservation problem is unlikely ever to be completely solved. Instead, our understanding will grow over time in IT areas that become relatively stable, such as relational databases and geographic information systems, and as digital preservation increasingly becomes a mainstream activity. In spite of this growth in understanding, new problems will arise as novel record data types emerge.
There is, however, a significant opportunity for NARA to ease some preservation problems by helping promote the development and adoption of formats that have desired attributes for preservation. These attributes include public specifications, broad applicability, compatibility with prior-format versions, and stability. NARA has already been involved in the development of one such format, the nascent Portable Document Format (PDF)/A standard. NARA will need to determine and periodically amend the list of formats that it preferentially supports and work to ensure that record-creating agencies provide records in these supported formats. This is an important part of ensuring that records are created archive-ready (see Recommendation 3.1).
Finding 2.5 User expectations will be high and will continue to grow as the capabilities offered by other information technology (IT) services improve.
Users are likely to expect NARA to keep pace with the aggressive rate of improvements that they see in commercial online services. Also, it is reasonable to expect that users will want appropriate transformation and processing—such as conversion from persistent forms into easily usable forms—and not just access to raw data. In other words, users will expect services well beyond the simple delivery of static records.
Given its limited resources, NARA cannot afford to meet all of these expectations for all
records. Expectations will need to be reconciled with the notion of varying levels of service for different record types. One such adjustment would be to make all records easily accessible even though they were not all equally easy to view or process. Another approach would be to enable third parties to meet the requirements of users who have special retrieval or processing needs or who must work with infrequently used document formats (see Recommendation 4.1).
RECOMMENDATION 3. REENGINEER PROCESSES TO ACCOMMODATE CHALLENGES AND OPPORTUNITIES PRESENTED BY ELECTRONIC RECORDS
If NARA and federal agencies do not succeed in transforming their approach to transferring permanent records for preservation, the probable result will be significant difficulty in acquiring the desired permanent records and a costly, labor-intensive effort applied to those records that are acquired. As the subrecommendations below detail, NARA should therefore establish procedures and systems to ingest new records automatically on a routine basis, capture metadata at or close to record creation, take steps to increase the likelihood that records are provided archive-ready, and use automatic techniques to supply missing metadata. In implementing the recommendations below, NARA should determine the areas in which existing law and regulations provide sufficient flexibility and those in which changes to the policies governing records management may be required.
Finding 3.1 The growing volume of electronic records will swamp a system that relies on manual processing.
As the volume of electronic records to be stored in the ERA grows (Recommendation 1), existing processes, which rely on manual handling of records, will cease to be cost-effective, or even feasible. Affordability requires automation. It is, for example, not feasible to manually supply additional metadata about ingested records except at a very coarse level of granularity.
Finding 3.2 Preparation for the ingest of electronic records closer to their creation will make archiving easier, cheaper, and more accurate.
The ingest of electronic records decades after their creation could be problematic. This is because obsolete data and metadata formats may be difficult to interpret, and it may be quite difficult to modify the systems in which the records reside in order to facilitate ingest.
There are two very similar techniques for addressing this problem: a passive one, in which NARA is sent records early, on a provisional basis, and carries out a “dry run” of ingest; and an active one, in which NARA actually ingests records in anticipation of a future transfer of custody to NARA. Both give early warning of potential problems with ingest. Although the active mode gives NARA the most information about the records that it will have to handle, it is probably harder for agencies (and NARA) to implement. On the other hand, the resulting archival copies might also provide additional incentives for agencies to support the creation of archive-ready records and their delivery to NARA, as the ERA could provide off-site backup of agency records.
Finding 3.3 Automated ingest depends on records being provided in structured, standardized formats together with sufficiently standardized essential metadata.
Without some degree of standardization in the essential metadata (metadata that cannot be derived from the record itself through information extraction or search), the ingest of each
set of records will require significant, costly human intervention. Nonetheless, NARA and the ERA system will have to be flexible in accepting records from originating agencies—that is, neither NARA processes nor the ERA design should be brittle with respect to variations resulting from legacy systems, poor implementation of unusual record types, and other circumstances.
Recommendation 3.1 To permit records to be ingested using largely automated processes, NARA should strive to change federal agency practice and system design so that permanent electronic records are created or provided to NARA “archive-ready.”
In order to achieve the requisite degree of standardization of formats and essential metadata, NARA will need to provide very explicit guidance and realistic requirements that record-keeping software systems can support. These would need to be requirements that agencies can implement without significant changes to normal business processes or without investing in large amounts of human effort to provide needed metadata. In particular, NARA should consider requiring all newly acquired agency systems that produce permanent records to do the following: create those records in formats acceptable to NARA, include explicit metadata in their output, and use standardized mechanisms for transferring records to NARA, such as using secure network communications. In the long run, archiving considerations will have to become part of the government process of software procurement and development for systems that produce or are deemed likely to produce permanent records.
When a government agency does indeed revise or reengineer its business practices, it can (and sometimes does) vastly improve its records-management practices. These reengineering efforts are opportunities for NARA to furnish the guidance and standards that ensure that archive-ready records are created by the reengineered system.
Successful implementation of the ERA system may also require NARA to become more actively involved in the establishment of standards used in constructing federal systems, such as those governing the formats used to represent records and associated metadata. It may also require NARA—or other bodies responsible for information policy, such as the Office of Management and Budget—to establish guidance to agencies, together with auditing and enforcement tools as necessary.
Experience shows the difficulty of implementing systems that depend on user-supplied metadata or that depend on user compliance with externally imposed requirements that take effort and do not necessarily support the user’s internal business needs. Consequently, NARA should foster the development of software that automatically captures metadata at record creation. It should also work closely with agencies to determine whether its requirements can realistically be implemented and link archive-ready concerns with agency interests in supporting the agencies’ own record-keeping operations.
Finding 3.4 NARA should not let the existing backlog of electronic records that are not archive-ready undermine the long-term objectives of developing capabilities and processes to handle new records through automation.
Over the long term, the preceding recommendations address the move to automated ingest of records. However, a considerable backlog of electronic records already exists, and it will continue to grow before automatic measures take full effect. If NARA is to avoid being totally swamped, it will have to perform triage on the handling of the backlog. This will mean avoiding significant manual processing, preserving the obvious context and features of the
environment that would support extraction of automatic metadata in the future, and using automatic metadata extraction (and accepting its shortcomings).
Recommendation 3.2 The ERA system should include fallback capabilities that do not require intensive manual processing for instances in which agencies do not provide records archive-ready. NARA should accept and accommodate the resulting varying levels of metadata quality.
There will be many cases, including those involving legacy and noncompliant systems, in which the specified or desired metadata are not provided along with agency records. As a fallback, NARA should employ low-cost if imperfect techniques for automatic metadata extraction. Such techniques are a much more cost-effective way than is manual processing for handling enormous volumes of records. Using these techniques will require NARA to accept significant imperfection in some of its metadata and thus to shift from the current mindset of having very high quality metadata.
It will also be necessary for ERA systems and processes to carefully track the provenance of metadata and to distinguish among different metadata sources, which have varying quality. As metadata-extraction techniques improve, the metadata can be upgraded; the ERA should accommodate easily making such updates to record descriptors. Improved search techniques can also be used to improve future access.
Recommendation 3.3 NARA should carefully consider its approach to the description, appraisal, and selection of electronic records for retention. The trade-off between cost and utility should always be considered.
As discussed above, it is essential for the ERA to able to ingest records automatically and to have the agencies that create records take on greater responsibility for providing required metadata. However, this does not mean that agencies would have to take on the full burden traditionally borne by archivists.
The costs and benefits for varying levels of investment in metadata (using either manual or automatic processing) should be considered carefully. Electronic records present significant opportunities to exploit new capabilities, such as full-content searching, structured documents, and automated metadata creation. To the extent that NARA can make use of full-content searching, and exploit search and other access mechanisms that agencies themselves have created for their own records, there will be opportunities to create new kinds of finding aids that do not rely on manual processing.
Also, there are trade-offs between what is gained through fine-level appraisal versus the costs of making the selection of electronic records for retention. There is a growing volume of information being produced, there are ever-improving automated techniques for finding records, and a rapidly falling cost of storage. Thus, instead of examining records carefully to mark only a select few for retention, it may become more expedient, and more appropriate, to retain larger “chunks” of records (even if some of the records may not warrant retention as permanent records), in order to ensure cost-effective preservation of the government’s output. Such a shift would appear to be consistent with NARA’s Records Management Initiative, with its emphasis on functional appraisal, which implicitly is more broadly granular.
Recommendation 3.4 NARA should continue to expand the use of cooperative arrangements that shift responsibility for preserving and providing access to records to the agencies that create them.
The model of having certain agencies retain responsibility for certain types of records by operating what are, in effect, affiliated archives, is well established (e.g., the arrangements covering the preservation of large scientific data sets; a NARA-Government Printing Office agreement). This model could be expanded. Under such arrangements, NARA’s role shifts to that of a standards setter, a records-access federator, and a repository of last resort.
NARA would benefit from allowing more agencies to operate affiliated archives because it could then spend its scarce resources on problems that the agencies cannot address on their own. Such arrangements would also provide a way for NARA to partner with agencies to provide better access to records. Advanced access services are best provided or supported by the mission agency that knows the domain; in this way the requisite expertise in managing a complex system need not be transferred to NARA. Advanced access capabilities can also be provided by third parties, possibly for a fee (see Recommendation 4.1). With these opportunities comes the responsibility for establishing and verifying appropriate standards for affiliated archives and for establishing fallback mechanisms in the event that agencies cannot fulfill their responsibilities at some point in the future.
For such partnerships to work, NARA will have to offer some form of incentive. For example, as NARA develops its electronic capabilities, it has the opportunity to provide generic expertise in electronic records management and preservation that complements the subject-matter expertise of the partner agencies. Finally, as the number of partnerships grows, NARA can collect, codify, and disseminate best preservation practices.
In addition to depending on NARA’s contributions, the success of cooperative agreements will depend on agencies honoring their commitments under these agreements. NARA may need to assume the role of auditor.
RECOMMENDATION 4. ENABLE PARTNERSHIPS WITH OTHER ORGANIZATIONS WITH INTERESTS IN ELECTRONIC RECORD PRESERVATION
In the long run, the ERA should become part of a larger collection of archives, libraries, and information services operated by governments, nonprofit organizations, and the private sector rather than a unique, stand-alone system that only NARA runs. Many other organizations are also developing large collections of electronic materials and associated systems and tools—notably, initiatives launched by the Library of Congress, the Government Printing Office, and the National Library of Medicine.
Finding 4.1 An open architecture with public interfaces would allow NARA to outsource some functionality, encourage competition of services, and foster innovation in access.
By designing the ERA system so that third parties can build new capabilities on top of it, NARA can accomplish several things. It can, at low cost, tap the resources and entrepreneurship of other organizations, commercial firms, and researchers to provide enhanced services, help fulfill NARA’s mission, increase the perceived value of NARA’s holdings, and explore new technologies and approaches.
Recommendation 4.1 NARA should use a design strategy for the ERA that enables a range of other organizations to provide access services that NARA does not offer.
NARA does not have to be the sole access provider. Instead, it can enable agency partners and private service providers to develop alternative access mechanisms. Different providers
could, for example, offer different views of the same underlying data, index them differently, or derive new digital products from them. Capabilities that might be built on top of the ERA system include the following: data annotation to supplement and correct record metadata through third-party systems that are separate from NARA, but accessible as if part of a unified system; the continued introduction of specialized, enhanced capabilities by partners and vendors that support particular needs; and entrepreneurship to provide value-added services, either through application programming interfaces (APIs) or bulk download.
The result—and the desired outcome—is that NARA would be able to leverage considerable assistance from its customers, partners, and other third parties. Once the technical capabilities were in place, NARA could take the concept one step farther by actively seeking out partners to work in areas of interest. This effort would provide NARA with a low-cost method of exploring new technologies and staying on the leading edge.
Recommendation 4.2 NARA should anticipate future demand for federated access to records in the ERA and other affliated archives.
Federated access, in which a software layer allows one to access a collection of archives as if it were a single archive, would be a valuable capability. Federation would provide a common interface to federal records regardless of which agency happened to have responsibility for managing them. (Recommendation 3.4, above, recommends distributing some of the burden for archiving to institutions beyond NARA.) Federated access would also be valuable in allowing access to span NARA and presidential library resources, for example. Early iterations of the ERA design may well not include these capabilities, but the ERA should be designed with their future addition in mind.
Although there is a large relevant technology and standards base, there is currently no real corresponding base of operational practice in federating access to archives. Among the issues to consider are the minimum common standards that archives must agree to (e.g., unique record identifiers) to support federated access. The easiest way to accommodate federation might be to expose an API that accommodates it. And if that API is exposed, federation could be provided by third parties with minimal effort by NARA.
RECOMMENDATION 5. BROADEN INTERACTIONS WITH THE RESEARCH COMMUNITY
Finding 5.1 NARA has been too narrowly engaged with the research and development occurring in digital archiving, digital libraries, and related areas.
In the early stages of the ERA program, NARA sponsored several research and development projects in which it worked with limited segments of the relevant research community consisting of a small number of specific organizations. It has not, however, fully exploited opportunities to engage the broader research communities working in areas related to NARA’s interests. Prototypes are a useful part of the system development process and a valuable way to test and communicate the results of research, but prototypes by themselves do not advance the state of the art or science.
Recommendation 5.1 NARA should be more actively engaged with multiple and diverse sources of research so as to gain early experience with emerging technologies, build internal expertise, and ensure that its future needs are met.
By virtue of its extensive experience in archiving and its significant needs for preserving electronic materials, NARA should more actively engage relevant research communities—for example, by sponsoring research, suggesting problems that need to be addressed, engaging with researchers, hosting workshops and seminars, and hiring students. These activities would help meet NARA’s needs and those of the broader archival and preservation communities.
An organization like NARA should be involved in research for several reasons: (1) to gain early experience with technologies that might be expected to begin appearing in commercial products within roughly 3 to 5 years, (2) to learn more broadly about a research area through direct and regular engagement with related research communities and the experts within those communities, (3) to describe to researchers challenging problems faced by NARA, and (4) to build absorptive capacity to understand how to apply the results of relevant research sponsored by other organizations.
NARA should anticipate that since research involves a degree of risk, not all research projects will necessarily produce results that can be implemented directly. The trade-off is that NARA stands to benefit from innovative ideas that had not been anticipated.
To manage its research activities most effectively, NARA should seek to employ individuals who have existing relationships with the various relevant research communities. Such individuals should also bring with them a proven track record of producing or applying research results in these areas.
Finding 5.2 Most of the technical challenges faced by NARA are shared with other organizations operating large repositories for digital materials, but a few problems are specific to the operation of NARA and similar archives.
Two distinct classes of research areas in which NARA appears to have unmet technology needs are apparent:
Problems shared with other organizations. Most of NARA’s technical problems are the same as those faced by designers and operators of any large digital library or repository. For these types of problems, NARA’s greatest leverage will come from drawing on research sponsored by other organizations, learning about best practices, inducing others to work on problems of interest to NARA by offering corpora, and participating in joint research programs with other federal agencies and organizations that face similar challenges. Wherever technologies may be shared with other applications, partnerships to jointly address problems involving these technologies will help stretch limited research resources. By joining with others, NARA also can couple to a joint learning curve—that is, it won’t have to make unique mistakes.
Problems specific to government archives and similar institutions. Examples of research problems of particular importance to NARA and similar institutions include the following: how government transactions systems should be preserved, how to provide digital assurances over a very long period of time, and how to provide semiautomated redaction and declassification of materials. Working on these more specialized problems may require engagement with specific existing research communities. Partnership with agencies that have greater experience in managing IT research programs is also likely to be the most effective mechanism for this class of research problems.
Recommendation 5.2 NARA should develop a research strategy that lays out its technical needs and the partnerships that would best address them.
Considering NARA’s limited resources, its limited capacity to manage research programs, and a set of challenging technical problems related to the general topic of archiving and digital preservation, the issues of how and where NARA should invest arise. Questions for NARA to consider include these: What agencies share an interest in NARA’s problem or a similar problem? What research community is working on that problem or a similar problem? How can NARA most effectively join and influence that community? How can it learn from or participate in others’ research programs? What results are anticipated, and how will the results be brought back to NARA?
The development and deployment of the ERA will engender a number of research topics. Some of these topics can perhaps be anticipated now, but others will arise only as experience with the ERA accrues. Having a research agenda will help guide NARA in selecting which research initiatives and organizations to become involved with. As NARA defines its research agenda, it should be careful to frame its needs as problems that academic and commercial researchers can understand and address, and it should frame them in ways that are amenable to working in partnership with other organizations.
Developing a research agenda is likely to be daunting when an organization is in the midst of a rapid technological and cultural shift, as is NARA today. Fortunately, NARA need not (and should not) work in isolation or start from scratch in developing a research agenda. Partnering with more experienced research organizations, as recommended below (see Recommendation 5.3), will help, but NARA will nonetheless need to develop its own voice. Several recent research agendas in the area of archiving and preservation (including reports prepared for the National Science Foundation [NSF], the Library of Congress, and NARA’s National Historical Publications and Records Commission) provide a basis for NARA to find common ground with other research initiatives.
Recommendation 5.3 NARA should implement its research strategy by working with other agencies that have similar research needs and expertise in managing research programs.
Implementation of NARA’s research strategy will require parsing the agenda into a research program, building or joining a community of researchers, monitoring work, and creating incentives and opportunities for technology transfer. In carrying out all of these activities, NARA should seek partnerships with agencies for which research management is a core competency, such as NSF or the Defense Advanced Research Projects Agency (DARPA). These agencies have developed considerable expertise with selecting and reviewing research, engaging with research communities, and managing research projects. The selection of specific research proposals and routine management of the research process should be left to organizations for which research management is a core competence. NARA should work with its partners to ensure that it has a voice in developing research initiatives and has opportunities to engage directly with the organizations and experts that do the research.
NARA’s fundamental requirements for digital preservation and access are shared to varying degrees with a number of government organizations that also must maintain access to digital records into the indefinite future. Problems such as semiautomatic redaction or declassification can be investigated with partners such as DARPA or the intelligence community’s Advanced Research and Development Activity (ARDA). Another opportunity is presented by
NSF’s program in digital preservation, which is explicitly designed to support multiagency participation.
Recommendation 5.4 NARA should also engage the research community by providing access to data in the form of electronic records.
Academic researchers are often starved for data: they have good ideas, but no way to test them in the real world. Providing interesting data in a form that the research community can use is often a sufficient incentive to get the best people working on a problem. It also provides a very cost-effective way of engaging a research community. The National Institute of Standards and Technology and a number of university groups have experience in this area; partnering with such an organization would help NARA learn how to create data sets that are interesting and useful to various research communities.
NARA might also consider providing direct access to some of its digital archives in the form of APIs that provide access to selected internal data or operations of the ERA system, enabling people to build new interfaces or applications on top of NARA’s archives. Such access would be attractive to a variety of computer science and social science research communities, and it could be provided by NARA at relatively low cost, considering the potential benefits.
RECOMMENDATION 6. ENHANCE CAPABILITIES AND PROVIDE GOVERNMENT-WIDE LEADERSHIP FOR RECORD INTEGRITY AND PROVENANCE
As NARA recognizes, electronic records are subject to a variety of threats, ranging from accidental corruption to deliberate tampering, that are quite different in character from the threats to which paper records are subject. It is essential for the ERA and the records-management practices that it supports or requires to be up to the task of adequately protecting the public record.
Fortunately, electronic records also afford important opportunities to improve integrity and authenticity assurance through the use of robust cryptographic techniques, automatic replication, and system design. In order to ensure trust in the digital records held by the ERA, stringent measures—including the use of cryptographic techniques—should be taken to protect records against deliberate or accidental compromise. These methods should be built into the ERA system design and its operational procedures. The overall trustworthiness of records also depends on the methods used by agencies that supply records to the ERA. Thus, NARA should concern itself with the trustworthiness of a record throughout its entire custody chain, and it should provide leadership on these issues.
Finding 6.1 NARA’s past procedures and practices would be inadequate for safeguarding digital records.
Concepts developed and applied to paper records, such as chain of custody, are also of value in assessing the authenticity of electronic records. They do not, however, adequately address the threats associated with electronic records. These threats include hardware, software, and operational errors, as well as individuals (both outsiders and insiders) who may either tamper in subtle ways with individual records or systematically modify large numbers of records.
Recommendation 6.1 The ERA should use appropriate technical and procedural methods available for assuring the integrity and authenticity of electronic records.
The new vulnerabilities introduced by virtue of the digital nature of the records to be preserved underscore the importance of complementing traditional procedures by making use of robust digital techniques and an appropriate overall system design for verifying integrity and authenticity.
Digital assurances for records are based fundamentally on maintaining multiple, geographically and administratively separated copies and on two cryptographic tools—hash digests for integrity checking and digital signatures for authentication. Digital signatures are an excellent technique for verifying that recently transmitted data—such as a set of records being transferred from an agency to NARA—actually came from where they were supposed to have come, but digital signatures are a poor means of verifying the origin of stored data. Instead, the origin can be verified (1) at ingest, by recording metadata on the outcome of having verified the origin of those data at the time they were received, and (2) from then on, by maintaining the integrity of the stored data and associated metadata.
Designing digital assurances into an electronic records archive is similar to designing security measures: that is, the cryptographic techniques must be chosen carefully, but more importantly, the overall system design, not just the cryptographic mechanisms, must not allow openings for attackers. And as with security systems, provisions must be made to change the cryptographic algorithms or system design if the chosen cryptographic algorithms are ever found to be faulty (a very likely contingency at some time in the future) or if the system design is found to be susceptible to attack.
Cryptographic protections are only a part of the solution, however, in part because they detect damage or attack only after the fact. Prevention and repair are even more crucial parts of the solution and depend on careful procedures and system designs, including the following: (1) maintaining multiple copies that are geographically separated and independently administered; (2) tracking records from their origins, through storage and IT systems of various sorts, and eventually into the archive; (3) protecting the systems that hold records from unauthorized tampering, both by internal employees and external attackers; (4) employing access controls, audit logs, and procedural safeguards when changing the archive contents to reduce human errors (e.g., requiring multiple people to authorize a change); and (5) employing sound software development methodologies to reduce the likelihood of record corruption due to software bugs.
Recommendation 6.2 Throughout the ERA system design and operation, NARA should carry out an iterative threat modeling, analysis, and response process.
A threat analysis should be done early in the design of the ERA so that proper security measures can be designed and implemented from the outset. It is essential that ERA design proposals be analyzed against a threat model in order to gain an understanding of the degree to which alternative designs are vulnerable to attack. The threat analysis should be undertaken not only for the ERA itself, but also for the entire chain of custody of a record from its creation, through its retention for active use in a government agency, to its eventual transfer to the archive. This initial threat modeling would be only the first step of a larger, iterative threat-countering process that involved designing against expected threats, observing failures that occur, and designing new countermeasures. Threat modeling and analysis are a specialized
art, which suggests that NARA should seek outside specialists to undertake these tasks under contract.
Recommendation 6.3 The ERA should be designed so that auditing, monitoring, and testing of ongoing operations verify that digital assurance measures are working as intended.
Methods for detecting very rare events such as the corruption of stored information are extremely susceptible to error because they are rarely exercised. The ERA should be designed to include ongoing processes that detect errors and that exercise error-detection and correction procedures. One of the threats to ERA integrity is software errors that corrupt file structures. For this reason, procedures associated with introducing new software must provide for adequate testing, gradual introduction of software changes so that impacts on data integrity are apparent early on, and capabilities to repair any errors that are introduced.
Recommendation 6.4 NARA should ensure that it has access to expertise in digital assurance techniques and in the operational procedures that accompany them.
The basic cryptographic techniques and other trustworthiness measures discussed in this recommendation are widely used in a variety of contexts. However, their application in a system designed to provide very long term assurances of authenticity, integrity, and chain-of-custody management is likely to be leading edge for the foreseeable future. As a result, NARA will need access to highly specialized expertise on an ongoing basis—for example, to assess initial and subsequent system designs and operational procedures. As it would seem difficult for NARA to maintain this kind of highly specialized expertise in-house, it should consider establishing an expert advisory committee to specifically address the kinds of issues discussed in this recommendation. Digital assurance in the context of long-term preservation is also a possible research area for NARA.
Recommendation 6.5 NARA should lead an effort to enhance digital assurance for electronic records government-wide.
Agencies also need to protect records, lest NARA’s archiving activities turn into a “garbage-in, garbage-out” exercise. All of the subfindings and subrecommendations presented above are equally applicable to the agencies that originate records and later pass them to NARA and to agencies that manage records locally under cooperative agreements.
NARA’s role in records management offers an opportunity to establish federal standards and to inaugurate, advance, and even mandate their use in records management. By virtue of its central role and planned longevity, the ERA is the ultimate client, and perhaps the one with the most stringent requirements, for secure digital records management. NARA has undertaken leadership in other areas of records management through the Records Management Redesign program and the E-government Electronic Records Management initiative and has issued records-management guidance in several areas related to digital assurance. NARA should help spread the use of digital assurance techniques, as well as monitor their use in records management, throughout the government.
As the National Archives and Records Administration recognizes, its initiatives for electronic records are critical to fulfilling its mission to preserve essential evidence as society and
government information continues to move to a born-digital and even only-digital state. In fulfilling its mission, NARA enters an era in which it must understand, cope with, and embrace the rapid pace of change that characterizes information technology. As these changes unfold, NARA will face a number of difficult questions, some of them alluded to above.
The rapid evolution in government information and information systems provides a good illustration. Formal government communications continue to be conventional records: forms, letters, pictures, and so on. But citizens and government officials increasingly interact via e-mail and text and voice messages, not memoranda and forms. Web pages increasingly present not only static documents but also ephemeral views that are not stored as records, but are merely computed from online databases through software that itself changes over time. What portions of this constantly changing material are essential evidence that documents the national experience? Handling the resulting flood of data and re-creating the behavior of ever-more-complex IT systems decades later are both extremely difficult challenges. The National Records and Archives Administration of an e-government will have to be equipped to address the technical challenges that arise, and to examine and possibly revise how it carries out its basic mission.