Summary and Recommendations

As in other sectors of society, much of the business—and thus record keeping—of the federal government depends on digital information. Documents are created, transmitted, and stored electronically. E-mail has become an important—and often primary—communications technology. And many records exist only in electronic form, stored in databases and other computer systems. Some of these records will, in time, be transferred to the custody of the National Archives and Records Administration (NARA) for long-term preservation.

NARA’s current systems for archival preservation of electronic records are limited in capability and ad hoc in nature. Recognizing the growing importance of electronic records to its mission of preserving “essential evidence,”1 NARA launched the Electronic Records Archives (ERA) initiative. It sponsored work through this program at the San Diego Supercomputer Center (SDSC), which resulted in a series of archival preservation demonstrations. Building on this experience and that of other institutions studying digital preservation, NARA’s new ERA Program Office plans to begin initial procurement for a production ERA in 2003. As of this writing, NARA has hired a contractor to assist with the ERA program and has started to define desired capabilities and requirements for the system, including a vision statement and concept of operations for the ERA.

THE IMPORTANCE OF THE ELECTRONIC RECORDS ARCHIVES PROGRAM (CHAPTER 1)

Finding 1. As NARA recognizes, it is critical to start developing new electronic records preservation capabilities quickly in order to continue to fulfill NARA’s mandate to preserve federal records.

1  

National Archives and Records Administration (NARA). 2000. Ready Access to Essential Evidence: The Strategic Plan of the United States National Archives and Records Administration 1997-2002 (revised 2002). Government Printing Office, Washington, D.C.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
Summary and Recommendations As in other sectors of society, much of the business—and thus record keeping—of the federal government depends on digital information. Documents are created, transmitted, and stored electronically. E-mail has become an important—and often primary—communications technology. And many records exist only in electronic form, stored in databases and other computer systems. Some of these records will, in time, be transferred to the custody of the National Archives and Records Administration (NARA) for long-term preservation. NARA’s current systems for archival preservation of electronic records are limited in capability and ad hoc in nature. Recognizing the growing importance of electronic records to its mission of preserving “essential evidence,”1 NARA launched the Electronic Records Archives (ERA) initiative. It sponsored work through this program at the San Diego Supercomputer Center (SDSC), which resulted in a series of archival preservation demonstrations. Building on this experience and that of other institutions studying digital preservation, NARA’s new ERA Program Office plans to begin initial procurement for a production ERA in 2003. As of this writing, NARA has hired a contractor to assist with the ERA program and has started to define desired capabilities and requirements for the system, including a vision statement and concept of operations for the ERA. THE IMPORTANCE OF THE ELECTRONIC RECORDS ARCHIVES PROGRAM (CHAPTER 1) Finding 1. As NARA recognizes, it is critical to start developing new electronic records preservation capabilities quickly in order to continue to fulfill NARA’s mandate to preserve federal records. 1   National Archives and Records Administration (NARA). 2000. Ready Access to Essential Evidence: The Strategic Plan of the United States National Archives and Records Administration 1997-2002 (revised 2002). Government Printing Office, Washington, D.C.

OCR for page 1
With the rapid increase in federal records in digital form—and with many records born digital or existing only in digital form—it is clear that solutions must be found for preserving these records in order for NARA to continue to fulfill its mandate. NARA has determined and the committee concurs that new capabilities for electronic record archiving are needed for NARA to perform its mission to preserve and provide access to federal records of enduring value. The overall challenge facing NARA is substantial. The volume and diversity of digital records that will be eligible for transfer from the custody of federal agencies to NARA is projected to be very large. Indeed, it is reasonable to anticipate that in the not-too-distant future, the number of digital records is likely to exceed the number of records originating in paper form. NARA’s current systems for electronic records, designed primarily to support preservation of relational databases and similar highly structured records, cannot meet these demands. The backlog of electronic records presents additional challenges. Under the paper-based model, NARA receives records from a few years to many decades after they were are created. The transfer of electronic records to NARA has for the most part proceeded in similar fashion. Thus when the ERA system becomes operational, NARA will face a large backlog of electronic records that were created over the past few decades, many of which may pose challenging preservation problems owing to their age (media deterioration, loss of documentation and other metadata, and obsolescence of data types). For records yet to be created, there may be ways to avoid the technology obsolescence problem by restructuring records acquisition processes to obtain records closer to the time they are created. (Discussion of this opportunity and related process issues is deferred to the committee’s second report.) If NARA fails to design and implement an electronic records archiving program that is capable of handling the projected volume and diversity of electronic records, important records are likely to be lost for the reasons discussed above. Likewise, significant delays in the ERA program would put records at greater risk of loss. The consequence of either failure to institute a program or a significant delay in doing so would be the possible—indeed even likely—loss of an important part of the nation’s history. Finding 2. ERA systems can and should be built, but it is a challenging, leading-edge engineering undertaking, not a routine procurement. Although no one has yet designed, built, or managed a production digital archives system on the scale that NARA envisions, the ERA program can be launched in a technically sound way. No off-the-shelf overall solution is available, but there are demonstrated solutions to many of the important system components the ERA will need, making it possible to start building ERA capabilities today. The projected scale and complexity of the ERA program mean that the task of designing, engineering, and evolving the system is a formidable challenge. Recommendation 1. The ERA should comprise a series of interrelated systems that evolve over time to fulfill NARA’s digital preservation needs. The scope of the ERA in terms of time and function demands that the digital archives be thought of as a set of systems that evolve over time. Digital materials (records and associated metadata) will need to be preserved for a very long time—longer than the lifetime of any

OCR for page 1
physical device or software component that is part of the ERA. This means that hardware and software will have to be replaced or modernized many times without disrupting the archive. Because requirements for NARA’s digital archives will change over time, the ERA systems must be designed to evolve gracefully. For example, new data types (or “file formats”2) will emerge and require modifications to parts of the system. Likewise, storage and other implementation technologies will evolve, necessitating the replacement or upgrading of parts of systems without disrupting the operation of the ERA as a whole. The volume of records to be stored will continue to increase, requiring a strategy for graceful scaling of storage and processing. New options for preservation will probably emerge and be incorporated into the ERA. User demands will also change over time, requiring modifications in the ways that records are located and accessed by users. COMMONALITIES WITH OTHER DIGITAL PRESERVATION ACTIVITIES (CHAPTER 2) Finding 3. The requirements of NARA’s ERA program have much in common with those of other digital preservation systems. Although NARA’s statutory mandate to preserve federal records is unique, many organizations need to preserve digital objects and are taking steps to design and implement systems that address this need. They are developing architectures, techniques, experiments, pilot systems, and expertise relating to digital preservation. NARA has embraced some of these developments, such as the Open Archival Information System (OAIS) reference model for characterizing archival systems in terms of their ingest, storage, and access functions,3 and is aware of activities in organizations such as the Library of Congress, the National Aeronautics and Space Administration (NASA), and the Online Computer Library Center (OCLC). The technology to address the ERA largely overlaps the technology required to build other digital repositories. For example, NARA and others require robust long-term storage of bits, accommodation for many different data types, metadata standards and processing techniques, and flexible searching and access provisions. Many systems require a high degree of scalability and the ability to evolve their architecture and implementation. And a number of organizations, government agencies, and businesses also face the challenge of long-term preservation of large volumes of data. As it examines commonalities with others working on digital preservation, NARA should evaluate what is already available as commercial software. There are a few areas where NARA faces requirements that are more stringent than those typical of other digital repositories. NARA’s mandate to guarantee record authenticity is stronger than that of some but not all organizations. NARA must also be ready to preserve materials of very diverse types, even if they were created by obsolete systems or saved on 2   “Data types” is a more general term than “file formats,” though the latter may be more familiar. 3   Consultative Committee for Space Data Systems (CCSDS). 2002. Reference Model for an Open Archival Information System (OAIS), CCSDS 650.0-B-1 (Blue Book). CCSDS Secretariat, National Aeronautics and Space Administration, Washington, D.C. January. Available online at <http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf>.

OCR for page 1
obsolete storage media; by contrast, many digital libraries can simplify their challenge by accommodating only a limited set of contemporaneous or common data types. Indeed, the ERA will have to be capable of ingesting the full variety of data types used to create permanent records across the federal government, which will roughly correspond to the full variety of data types in use more broadly. Also, NARA must sometimes redact classified or restricted documents in order to produce versions that can be released to the public. While these and other requirements may be special to NARA’s electronic preservation systems, the list of special requirements is modest compared with the list of requirements that the ERA shares with other digital preservation systems. Recommendation 2. NARA should emphasize the ERA program’s commonality with other digital preservation systems and engage with other programs and organizations wherever possible. NARA would benefit from increased coordination with other federal entities (such as the Library of Congress or NASA), other institutions (such as OCLC, university libraries, archiving projects, and foreign national libraries and archives), and businesses that share common interests in digital preservation. Enhanced coordination should extend at least to increased information sharing as the various institutions move forward with digital repositories. It might extend to such activities as joint work on standards or best practices. However, the committee is not recommending coordinated or joint procurements, which could significantly complicate or delay the ERA program. It is not enough to be aware of efforts by other institutions; NARA must engage them by becoming an active participant. Engagement means not only that NARA will have better access to the expertise and artifacts that these efforts develop but also that it can influence the research, development, and deployment agendas of these groups when it is appropriate and help build a larger community addressing engineering issues related to digital preservation. NARA’s long-term objective should be to increase the commonalities between its preservation systems and those of other digital repositories such as libraries, because this will (1) make it possible to share software and metadata standards with other institutions, (2) help stimulate the development of commercial off-the-shelf (COTS) components by increasing the size of the market for these components, and (3) help grow the cadre of professionals trained in digital preservation and build ties that would help NARA recruit them to work on the ERA program. LESSONS LEARNED FROM THE SDSC DEMONSTRATIONS (CHAPTER 3) Finding 4. Demonstrations conducted at the San Diego Supercomputer Center (SDSC) for NARA have provided a useful opportunity for NARA to explore relevant technologies. However, the work has not informed many significant aspects of the ERA design, has not reduced the engineering risk of the program, and has not enhanced NARA’s operational capabilities for running ERA systems. The SDSC proof-of-concept demonstration projects have provided NARA with the opportunity to interact with the information technology (IT) community and to explore approaches for a production digital archiving system. The SDSC projects have demonstrated options for

OCR for page 1
parts of a production digital archiving system, but NARA should not interpret these projects as solutions to digital archiving issues, as a substitute for gaining experience with operational pilots, or as a source of components of a production system. The areas where the SDSC work falls short of what would be accomplished through operational pilots include the following: Scale. The quantity of records tested in the SDSC demonstrations is small compared with the quantity of records that NARA anticipates ingesting. Complexity. The demonstrations addressed only a few relatively simple record types (such as e-mail and Senate legislative records), so the experience is not easily transferable to more complex problems. Attention to trustworthiness issues. For example, the SDSC work did not address requirements related to redundancy, integrity checks, or access controls. Attention to operational matters. The SDSC work should be understood as a demonstration of technology rather than as a prototype of an operational system or an operational pilot. The SDSC demonstrations shed little light on the operational issues, such as the work flow associated with ingesting high volumes of varied records, that NARA will need to address in order to run ERA systems. Nor does this exploratory work substitute for having NARA staff work with a system in a production environment; users of SDSC’s scientific data management systems, unlike the potential users of the ERA, are generally very savvy about technical matters and can turn to a highly proficient support staff. Finally, the SDSC work emphasized a particular strategy for digital preservation: migrating records to XML-based formats. Although XML has important applications in archiving, the use of an XML document format does not solve the problem of format obsolescence, and it would be inappropriate to rely on a migration strategy alone for long-term preservation (see Recommendation 4 below). In addition, some aspects of the SDSC demonstrations were research work that attempted to express semantic constraints within records. This is a worthwhile long-term goal that may have some utility as a technique for ingesting or preserving certain types of records. However, this particular demonstration of “lifting knowledge” from a document is not persuasive; the technology is far from ready for inclusion in a production system. ENGINEERING THE ERA (CHAPTERS 4 AND 5) Finding 5. The broad principles and expectations that NARA has established thus far for the ERA are an insufficient basis for proceeding with its design, procurement, and operation. NARA has thus far expressed the requirements and objectives for the ERA in very high-level terms. Some of these objectives stem directly from its statutory mandate, e.g., “preserve and provide access to any kind of electronic record.”4 It has also embraced the OAIS reference model to describe the high-level structure of the system. 4   The basic goals of NARA and the ERA program, as expressed in NARA documents, are presented in Appendix A of this report.

OCR for page 1
More preliminary work is required in setting expectations for the system and estimating its size and scope before NARA can start procuring a workable production system. This includes (1) characterizing the electronic records that the ERA should be expected to ingest in the near term and (2) making pragmatic engineering decisions and defining realistic requirements and priorities. Only by jointly considering archival and technical concerns can NARA chart a course to meet its preservation mandate with achievable IT systems. Recommendation 3. Before proceeding with design and procurement of the ERA, NARA should gather more data about the electronic records that it expects to preserve in the near future. To formulate some of the quantitative and qualitative requirements of the ERA, NARA needs more information about the population of government records that it will hold and projections about how the records will be used. To guide the engineering of the ERA, and especially of its early versions, NARA should obtain or estimate these data now. None of these attributes will remain unchanged over the life of the system, but it is nonetheless important to develop the system based on these initial requirements. When data are not available or are impractical to obtain, estimates should be prepared, justified, and made explicit; otherwise, a system design will reflect implicit estimates, which may be dangerously wrong. Importantly, the intent of this recommendation is not that the ERA program should be significantly delayed to conduct detailed surveys; only rough estimates are needed, and order-of-magnitude estimates will suffice in most cases. Examples of the required data include these: Characterization of the population of digital records that will need to be preserved. How many government records, using what data types, will require preservation? How many records does NARA expect to receive in each future year? As remarked below, the ERA will need to prioritize handling of records based on archival and technical considerations. Both current data and forward estimates are required in order to make these decisions. Estimates of size and scaling trajectory. How much data will be stored in the ERA, and how will it grow over time? These estimates, which are needed to inform the technical structure of the system, may follow directly from estimates of the record population but may also be governed by the overall project plan, ingest rates, and other considerations. Mechanisms for delivering records to NARA. While today’s and future records can be delivered to NARA using secure networking techniques, records generated over the past 30 years or less may reside outside network-connected systems on media that are rapidly becoming obsolete. How many records are stored on which media? Estimates of access rates. Since the ERA does not yet exist, access rates can only be estimated, perhaps based on experience with digital libraries (which might provide a better indication of user interest in online collections than would data on access to NARA’s non-electronic records). The system will certainly need to be designed to increase access performance as demand increases, but even an initial system will require some estimate of access rates. Budget estimates. The quantitative estimates lead to estimates of costs for developing, procuring, and operating the ERA and are critical to making informed design decisions and investment trade-offs. If, for example, it turns out that the originally planned scope of the ERA would be unaffordable, criteria for preservation scheduling may have to be adjusted. Many

OCR for page 1
design decisions concerning ingest processes and the amount of automation required for them will be driven by the cost of ingesting records of different types. The committee could not find much basis to support estimates for the cost of the ERA, nor could it determine whether current ERA plans are consistent with budgetary constraints. Recommendation 4. NARA should address key design issues before commencing implementation and apply a pragmatic engineering approach to the ERA’s development and evolution. Although it may be tempting to speak in absolutes (e.g., “every important record will be preserved forever”) when designing a system, engineering practice recognizes that there are objectives that are subject to constraints. Engineering the ERA will require specifying the objectives and constraints of such a system. This report describes and the section below summarizes some of the important design issues that need to be addressed and provides some advice on how to think about them. In some cases, the committee’s preliminary analysis led to design suggestions, but this is no substitute for comprehensive analysis by NARA. Key engineering tasks include these: 1. Prioritize the functions of the ERA and focus initial design on capabilities that permit rapid deployment of operational pilots. NARA’s most basic requirement is to save bits for a hundred years or more. To achieve this requires a combination of careful technical and operational design based on extensive industry experience with robust storage systems of shorter life. Examples of measures to meet this goal are (1) redundant storage of bits at separate physical sites to survive physical destruction, (2) copying bits to new physical storage devices as existing devices age, (3) using storage systems with nonproprietary interfaces to prevent lock-in by any specific vendor, (4) careful system design and operation to guard against human error that might delete vital bits, and (5) diverse system implementations to guard against software errors that could lose data. One way to gain experience is to build an effective bit storage capability that supports pilot programs to begin preserving records in the short term and that provides a critical foundation for future systems with broader capabilities. Some basic ingest (i.e., intake of records and associated metadata) and access mechanisms are required in early ERA versions, but other functions that are less important can be deferred for later implementation, as long as the initial architecture, design concept, and implementation strategy are sufficiently flexible and evolvable and sufficient attention has been devoted to overall robustness, survivability, maintainability, and compatibility with critical long-term requirements. 2. Design for common cases. Given resource limitations, it is not reasonable to expect a system to preserve and provide equally good support for all records. A relatively small number of data types will likely support the majority of records that federal agencies are creating. It is also advisable that early system builds concentrate on a relatively small number of types. It will be necessary, therefore, to make some choices about what quality of service to provide for different types of record. To decide on the service level accorded each class of record, NARA will need to assess both technical and archival aspects of records. For some lower-priority formats, ERA support should, at least initially, be limited to capturing, storing, and providing access to the original bits and essential metadata.

OCR for page 1
3. Take pragmatic steps now to facilitate future access to records. NARA does not have to anticipate—or invest in—all the higher-level services that future users might want. Indeed, future archivists and researchers will be skilled in computing and thus will be more able to manipulate and interpret digital records. Also, many institutions, including NARA, share an interest in building an infrastructure of tools that support conversion, migration, and emulation; these will be available to NARA and its users. The ERA should be designed so that certain fundamental information about a record is saved to enable future access. This pragmatic strategy includes the following elements: Be neutral with respect to migration, emulation, or other approaches. None of today’s approaches—such as migration or emulation—for dealing with data type obsolescence have been perfected, nor has one emerged as accepted archival practice. Today, one should rely foremost on saving the original bits—even if one is unable to decode or render those bits when the records are ingested—together with additional information that facilitates their interpretation (preferred derived forms and essential metadata; see below). The viability of whatever strategies are used in the future will depend on the availability of the original bits. Save records in “preferred derived forms” in addition to the original bits. The derived forms simplify access to records because the formats are chosen pragmatically to be common, well-documented, and expected to last a long time. These derived forms can readily be created for many common record types by making use of existing export functions or conversion software. Derived forms are, however, no substitute for preserving and providing access to the original bits. (A related strategy that may be discussed in the committee’s second report is to encourage agencies to create records in preferred formats at the outset.) Do not rely primarily on a strategy of converting records to platform- and vendor-independent archiving formats in an attempt to avoid obsolescence. This point follows from the previous two elements but is stated explicitly because it runs directly counter to the approach of converting all records to technology-independent formats, which is not likely to be effective. XML formats are often proposed for this role, but they cannot assume the role of the original data type because they cannot be relied on to faithfully encode all of the elements of all data types. By contrast, an XML derived form may be a very useful derived form that serves as an adjunct to saving the original data type. Save essential metadata. While it is often advisable to save as much metadata as possible, the most important metadata to save are those that cannot be derived from the record itself and thus would otherwise be lost (e.g., contextual metadata). Save essential external references that are implicit or explicit in the record. Digital files often refer to other digital objects, such as embedded images, tables generated by running some program, and files belonging to other organizations. The possibilities for cross-reference in digital files are far richer than in paper files, and rules will have to be developed to decide which cross-references should be preserved by copy, by reference, or both. Archive as much information as possible about the software and work flow processes used to ingest the original records. This information may be essential when future users of the archive wish to understand in detail how records have been processed. A desirable goal would be that the ingest process work flow be log-based and otherwise designed to facilitate analysis in case the preserved form of the record is later discovered to have been incorrectly ingested.

OCR for page 1
4. Safeguard the bits. The risks of the various possible causes of data loss—such as malicious acts, natural disasters, software bugs, human error, and hardware failures—should be assessed and used to make informed engineering cost-benefit trade-offs. A combination of appropriate system design and operational policies and procedures will be required. Measures to consider include redundancy (e.g., geographically distributed replicas), media refresh (copying data to new media before old media fail due to age), integrity checks (e.g., to verify the integrity of records received from agencies, to detect errors in data storage systems, and to protect against tampering), access controls (e.g., to control who can write or modify records and to protect classified or otherwise nonpublic records), and auditing. Some records may be deemed more important than others and will justify greater investment than others to ensure that they persist. 5. Select the appropriate storage media. The economics, performance, and robustness of all-disk storage systems have recently begun to exceed those of systems that include magnetic tape either as a primary storage medium or as a backup. While not yet common practice, it is likely that robust disk-only storage systems will become an attractive alternative to tape storage early in the life of the ERA. NARA should seriously consider such designs for the first ERA systems; they are much simpler than storage involving both tape and disk. Offline, possibly write-once, storage may continue to play some useful role in storing infrequently accessed records; the cost, complexity, performance, and reliability trade-offs associated with each technology option should be carefully considered. Even if the ERA does not initially eschew tape, it should be designed to make it easy to switch away from tape in the future. 6. Decide where to invest in access capabilities. Historically, the primary tool for finding physical (paper) records of interest has been the finding aid; finding aids, along with other surrogate records, are now used by computer systems that help people to find physical records. When the entire contents of digital records are available for computer processing, content-based retrieval techniques like full-text searching (which has proven to be a high-payoff, relatively low-cost method in other contexts) become possible. These will alter access interface designs, ingest, processing, access strategies, cost trade-offs, and even approaches to handling confidential or classified records. 7. Plan for consistent access to digital and physical records. Over time, it should become possible to use single, consistent access tools to search all the records in the custody of NARA, be they physical or digital. Indeed, over time, some current physical records may even be transferred to digital form. The ERA design needs to take this into account, and while such cross-collection capabilities would probably not be implemented in early iterations of the ERA, it is essential that the architecture recognize this long-term convergence. INFORMATION TECHNOLOGY EXPERTISE (CHAPTER 6) Finding 6. Greater information technology (IT) expertise is needed if NARA is to successfully design, acquire, and operate ERA systems.

OCR for page 1
Insufficient technical expertise at NARA is a major obstacle to successful development and acquisition of the ERA. Based on briefings and other interactions with NARA staff, the committee concludes that while there is recognition of the importance of the ERA program, few NARA staff members have experience with or fully understand the complexity of building and managing a program as challenging as the ERA. NARA today does not appear to have sufficient technical depth to assure success in launching the ERA program—that is, to define and manage the overall architecture, develop the appropriate request for proposals, evaluate technical responses, negotiate changes in the architecture with vendors, and manage the implementation of the system. In addition to needing a quick ramp-up in the IT expertise necessary to oversee the early phases of procurement, NARA faces a longer-term need for a more pervasive culture change— IT skills related to preservation will need to be a core competence throughout the organization, on a par with its other institutional strengths. NARA recognizes the existence of this issue in its appointment of a change manager associated with the ERA program, but the difficulty in achieving this shift cannot be overestimated. It will not be possible to achieve the needed changes quickly; this pervasive change should be addressed in parallel with other facets of ERA development. Recommendation 5. In order to pursue technical development of the ERA, NARA should first hire a small team of first-rate information technologists with systems design expertise. The addition of a few employees with properly focused systems design expertise would greatly increase the likelihood that the ERA program will be successful. Preparing the architectural design of the ERA requires first-class talent having both archival and IT expertise. Whether the architecture is defined by NARA staff (the preferred approach; see Recommendation 7) or contractors, the challenges of hiring qualified IT people are almost identical for these two approaches. If it is to be successful, the contracting approach requires NARA’s expertise to equal that of the design contractors in order to determine whether a design will meet NARA’s needs. Contracting for system implementation once an architecture is defined likewise requires at a minimum an in-house contract monitoring staff (e.g., the contracting officer’s technical representative) with IT expertise at least as good as that of the contractor’s people. This expertise will be essential, in particular if NARA is to successfully pursue an iterative development approach (Recommendation 7). Recommendation 6. To supplement its in-house expertise, NARA should recruit an advisory group of government, academic, and commercial experts with deep knowledge of digital preservation and IT system design. A standing ERA advisory committee focused on digital preservation issues would provide an ongoing way to supplement NARA’s IT capabilities. By drawing expertise from the range of digital preservation efforts under way in government, industry, and elsewhere, the advisory committee would allow NARA to learn from those efforts and to foster collaboration (e.g., on techniques, standards, or common components) where warranted.

OCR for page 1
STRATEGY FOR EVOLVING AND ACQUIRING THE ERA (CHAPTER 7) Finding 7. Building the ERA as a conventional procurement is unlikely to succeed owing to the difficulty of accurately anticipating all system requirements. Instead, an iterative approach is needed. Procurement of the ERA is fundamentally different from procurement of a payroll or other commonplace IT system. No one has built an ERA before, and its requirements are not yet completely understood. Also, some of the requirements—such as safeguarding bits with very high confidence—are stringent. As a result, the procurement should emphasize modularity, iteration, and working with vendors to define and evolve the system rather than arms-length specification and delivery of a completed, turnkey system. The ERA will need to evolve, perhaps rapidly, during its early years as technical and operational requirements are modified by experience. Later in its life, the ERA may evolve more slowly as new needs are identified and as old hardware and software components are replaced or upgraded. Recommendation 7. The ERA should be designed as a modular system that can be built, maintained, modified, and evolved incrementally, subject to an overall architecture. A proper modular design allows components to be upgraded without disrupting the operation of the system. An overall structure for the ERA is suggested by the OAIS reference model, but a design for the ERA will need to be much more detailed in order to exploit the benefits of modularity. A modular structure would make it easier to use COTS components for the ERA. The system’s architecture ensures that the pieces fit together, that the system can be incrementally modified, and that it can evolve over time. Interfaces between major parts should be specified using an open approach that allows multiple vendors to supply components over a long lifetime. The architecture is itself subject to evolution over time as requirements are better understood or new requirements emerge. However, devising a good initial architecture is very important as the program’s success will be sensitive to the nature of the architectural decisions that are made early on. It is, for example, far harder to evolve the modular structure than to evolve individual modules. The architecture of the ERA system(s) should be “owned” (specified and evolved over time) by NARA. This would help ensure that NARA understands the implications of alternative proposals, reduces its dependence on vendors and the risk of proprietary lock-in, and understands the limitations and strengths of systems that vendors deliver. Preparing the architectural design of the ERA requires first-class talent having both archival and IT expertise. So too does evolving it over time to meet new requirements. A far poorer alternative to a NARA-owned architecture is for NARA to contract for one or more architectural designs. In this case, it may be worthwhile to obtain several proposals, because different contractors may have different opportunities or ideas for incorporating COTS elements. This alternative also requires first-class technical talent in order to specify the scope of a design contract, to evaluate resulting designs, and to proceed with acquiring and evolving a system.

OCR for page 1
Recommendation 8. NARA should begin development of the ERA with a small number of focused pilot production systems designed to gain early experience and to converge ultimately into a smaller number of more comprehensive systems. NARA should concurrently develop and deploy small, focused systems that rapidly build operational experience. All of these systems should be built within a common architectural framework (Recommendation 7), so that they may eventually coalesce into a smaller number of more comprehensive systems as experience and confidence grow. It is especially important that the data model—the data types and related metadata—conform to the architecture so that the digital data obtained by ingesting records into one of the early systems will carry forward into future evolutions. The initial systems should be selected and scoped for rapid deployment—this is the key to gaining early experience to inform the requirements of later systems. The following are some examples of limited-scope systems that might be considered for early pilots: U.S. State Department diplomatic cables. NARA is preparing to acquire a collection of diplomatic cables, which are simple structured text files, in digital form. Ingest might include automatic extraction of metadata from the cables; access might include full-text search or other methods appropriate to the collection. For quickest deployment, NARA might consider making these records available using software already developed for operating a digital library. Records at the National Personnel Records Center. There is interest in preserving large but homogeneous collections of official military records scanned in TIFF image format when they are transfered to NARA’s National Personnel Records Center. Confidentiality considerations and the imperative to provide ready access to veterans or next-of-kin would require careful attention to access controls. E-mail from the Clinton administration held by the Clinton Presidential Center. Metadata could be extracted from the e-mail headers, full-text search could be provided, and so on. The presence of attachments would permit gaining experience with preserving and providing access to a broad range of relatively contemporary data types. These three examples illustrate collections that could be organized and made available quickly. Although these collections might lack the scale of the eventual ERA, early deployment of systems to preserve and access them would yield important operational experience for NARA and avoid costly mistakes in later, more complex systems. Experience with early systems can be expected to lead to changes to the ERA architecture and to the substantial refinement of requirements for subsequent, more comprehensive systems. Managing the initial architecture, the first system deployments, the learning from early operations, and the revisions to architecture and specifications, and evolving the ERA will be the task of NARA’s augmented IT staff.