Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Building an Electronic Records Archive: Letter Report October 16, 2003 THE NATIONAL ACADEMIES Advisers to the Nation on Science, Engineering, and Medicine Computer Science and Telecommunications Board 500 Fifth Street, NW Washington, DC 20001 202 334 2605 202 334 2318 (fax) www.cstb.org October 16, 2003 Dr. Kenneth Thibodeau Director, Electronic Records Archives (ERA) Program Management Office The National Archives and Records Administration 8601 Adelphi Road College Park, MD 20740 Dear Dr. Thibodeau: In this letter,1 the National Research Council’s Committee on Digital Archiving and the National Archives and Records Administration, chartered to study the Electronic Records Archives (ERA) program, offers comments on the ERA program’s draft request for proposals (DRFP)2 and the attached requirements document (RD)3 and ERA deployment concept (ERA DC)4 that were made available for public review in August 2003. The comments below elaborate on issues discussed in the committee’s first report, Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for Initial Development (hereafter referred to as the committee’s first report),5 tying the issues specifically to the DRFP, which was not available in early 2003 when the committee completed its report. A meeting with the ERA team on August 14, 2003, helped the committee to better understand the philosophy underlying the DRFP and also informed the comments below. The committee supports the National Archives and Records Administration’s (NARA’s) goal of building new electronic records preservation capabilities quickly to meet NARA’s and the nation’s needs to preserve federal records (committee’s first report, Findings 1 and 2, pp. 1-2). The committee understands that the general objective of the DRFP is to spell out these needs and to seek proposals that offer thoughtful and innovative ways to meet them (thus challenging industry), rather than to prescribe too narrowly how the ERA must be designed. As detailed in item 1 below, the committee is especially concerned that the software development process outlined in the DRFP and RD does not reflect Finding 7 of the committee’s first report (p. 11), “Building the ERA as a conventional procurement is unlikely to succeed owing to the difficulty of accurately anticipating all systems requirements. Instead, an iterative approach is needed.” The committee makes the following specific suggestions for how the DRPF might be improved: 1. Adopt an iterative design approach. The DRFP is written assuming a classic waterfall design methodology in which requirements are established early and development proceeds in a serial fashion. The committee’s first report advocated a quite different approach—spiral design—that makes use of early prototyping and continuous refinement of requirements (Finding 7, p. 11; Recommendation 8, p. 12; and discussion, pp. 67-70). The DRFP asks vendors to provide information on their proposed use of prototyping,6 but prototyping does not receive as much emphasis as it should. Discussion with NARA indicated that the two design phases and five deployment increments described in the DRFP are to be treated somewhat like iterative design: at each stage, a new set of requirements will be prepared, and parts of the system may be revised and new parts implemented for the
OCR for page 2
Building an Electronic Records Archive: Letter Report October 16, 2003 first time. Discussion also indicated that easier, less risky system parts would be designed first and the more difficult ones delayed in the expectation that new and better approaches would emerge. This strategy will not reap the full benefits of iterative design, which requires learning early on from prototypes and an ability to evolve requirements and improve subsequent designs as more is learned. For example, a number of important technical details—e.g., file formats supported, essential metadata tags selected, and the data model used to save ingested records—are likely to require iterative development. Some of these design decisions (e.g., the data model) will be much easier to change when the system is young and has not yet ingested a large amount of data. In fact, some revisions may require reading and reformatting the contents of the entire archive. Also, the mechanisms for ingesting new records are likely to require extensive iterative design. The degree of automation, human interface, and resulting productivity of the ingest process will be critical to the success of the system (committee’s first report, p. 49). Designing a good user interface is hard even when the needs and skills of users are well understood; neither is the case here. The DRFP is silent on the qualifications of individuals required to operate and maintain the ERA; such aspects as the human error rate will be critical to the long-term viability and cost of the system. 2. Specify that an explicit threat model be developed early in the ERA’s life cycle. The DRFP makes occasional mention of measures that might help in averting threats (physical distribution of storage sites, user authentication, audit logs, etc.), but it includes no overall requirement that the system be capable of surviving an attack or accident.7 Records stored in the ERA will be kept for hundreds of years and must be protected against loss due to natural disasters, hardware and software failures, operator errors, and potential attacks from both inside and outside NARA (committee’s first report, Recommendation 4, p. 9). The committee’s first report advocated that NARA develop an explicit threat model and evaluate ERA designs against the model (p. 54). Retrofitting of security measures has repeatedly been demonstrated to fail. Because security should be designed in from the beginning (committee’s first report, p. 53), a threat analysis should be done as early as possible in the ERA life cycle. Moreover, contractors may propose quite different system designs depending on the threat model that is developed. 3. Tighten up requirements related to security, integrity, survivability, reliability, and robustness. In addition to specifying that a threat analysis be done at the outset, the RD should be revised to clarify a number of additional points: ERA 1.28 should include a requirement that record transfers can be integrity-checked (committee’s first report, Recommendation 4, p. 9 and p. 56). ERA 1.2 mentions authorization and ERA 13.7 mentions authentication, but these safeguards are distinct from integrity-checking. Protection should be provided against human error and malicious insiders. Any request that would delete or permanently alter data in the archive should require confirmation by two or more separate humans who are authorized to act on the request, a requirement that is not in the current RD. ERA 13.4, “The system shall support virus detection,” and ERA 13.5, “The system shall support virus elimination.” How will the system handle a record that appears to include a virus? Is the virus part of the record? Whose responsibility is it to deal with embedded viruses—NARA’s or the end-user’s? ERA 13.8, “The system shall provide for backup of ERA.” How does this requirement relate to the deployment strategy of “safe stores”? Is such backup in addition to safe stores, or do safe stores meet this requirement? ERA 14.5, “The system shall use templates to check the authenticity of transferred electronic records.” This requirement is unclear, given the use of the vague term “template.” What information contained in a template can be used to check authenticity? How? Also, it might be better to require
OCR for page 3
Building an Electronic Records Archive: Letter Report October 16, 2003 simply that the authenticity of records be checked and leave it to contractors to develop a suitable approach. ERA 18.10.2, “The system shall output certified copies of electronic records in formats selectable by the user from available choices.” What is a “certified copy”? Will these copies be electronic files “signed” by NARA? Is the contractor being asked how to certify copies and how to record such certification? ERA 21.1, “The system shall register users.” This requirement appears to imply that access cannot be anonymous. Is this necessary or desirable? RD, section 2.7.4 second paragraph. The term “non-repudiation” is used incorrectly here (either that or the committee misunderstands the paragraph). 4. Explicitly require a levels-of-service approach. Service-level differentiation is a practical necessity (committee’s first report, p. 26). If the ERA must be able to accept all data types supplied by agencies, then “ideal” preservation solutions cannot practically be developed for the full spectrum of data types expected. The committee has observed a growing recognition at NARA that different record types or contexts will require different levels of service. The DRFP includes references to this approach,9 but the full ramifications of service-level differentiation are not yet clearly expressed, and the level-of-service approach does not appear in the requirements themselves. In the spirit of seeking industry ideas, the DRFP should include an express requirement to classify records as to levels of service, so that proposals will include mechanisms to help set, enforce, and revise the service levels associated with records or groups of records. Just as templates may be required to streamline the ingest of common record types, so also a codification of service levels may be an important part of the ERA’s structure. ERA 18.5.2, “The system shall provide the capability to electronically present electronic assets independently of the software with which they were created,” should also be clarified and reconciled with the notion of providing levels of service. Presumably the intent of ERA 18.5.2 is that lack of access to the software that created a record should not preclude access to the record itself. Yet if the level of service for a particular record provides for keeping only the bits—because, for example, the data type is so seldom used or so unimportant that no attempt has been made to emulate or migrate the record or do any other form of conversion—then “present” must mean “deliver the bits” rather than “display an image.” Finally, Table C-4, “ERA Supported Data Types,”10 should be filled in more completely to reflect (1) the data types that NARA has already announced it will support, such as Portable Document Format (PDF), (2) a concrete approach to requirement ERA 28, “The system shall accept all types of electronic records,” and (3) varying levels of service and how they would be provided for different record types. The ERA must be designed, at the outset, to support a broad range of record types, even if implementations of some types are deferred. 5. Avoid overspecification. In some places, the DRFP calls for specific approaches to implementation rather than asking for proposals for an overall system that exhibits particular desired behaviors. Calling for the latter would be consistent with the strategy presented at the August 14, 2003, meeting, which is to challenge industry to propose innovative solutions. Specific instances of the requirements specifying particular implementation approaches rather than input-output behaviors include the following: Multiple references to “persistent data formats.”11 It would be best to omit these references, because whether and how to employ persistent data formats is an internal implementation decision. Instead, NARA can evaluate proposed designs on the basis of how they handle the obsolescence of data types, the degree to which they introduce errors, and so on. Although the use of persistent formats may ultimately be the best strategy for at least some types of records, it is not the only possible approach
OCR for page 4
Building an Electronic Records Archive: Letter Report October 16, 2003 nor necessarily the best in all circumstances.12 The requirements should be neutral with respect to how a contractor may propose designing a system that will retain the requisite level of functionality and access over time. ERA 11.5, “The system shall store electronic records such that an individual electronic record does not span media volumes.” This decision should be up to the implementers as long as the input-output requirements are met. In particular, this requirement would seem to rule out redundant array of independent disks (RAID) technology and related techniques for file-system redundancy, which are arguably applicable to the ERA. ERA 184.108.40.206, “The system shall maintain an archive file directory defining the physical locations of all electronic records within the system.” This requirement also appears to overspecify the design. A requirement to be able to obtain the physical location of electronic records in the ERA should be made into an external (output) requirement. “ERA will support automated media maintenance and tools to recover data from failed media.”13 Does this requirement apply to ingesting records from media provided by others, or to media used within the ERA to store records? (See also ERA 11.4.) If the latter, this requirement would intrude on a contractor’s ability to provide the best technologies for storing data for long periods—chiefly, using redundancy techniques in such a way that recovering data from failed media is unnecessary. If the phrase “automated media maintenance” refers to robotic tape libraries, it, too, overconstrains the solution. Detailed requirements on search characteristics included under ERA 17.5. Is this detailed specification necessary? Proposals should include state-of-the-art searching capabilities. (See also points 7 and 8, below.) 6. Strengthen the requirement for the ERA to preserve the original bit stream. An essential input-output behavior is that the ERA be able to ingest, preserve, and produce the original bit stream of a record in its native format (committee’s first report, p. 31). Although the August 2003 discussion indicated that this is the intent, the committee believes this requirement may not be stated sufficiently strongly in the RD. Perhaps requirement ERA 7.2, “The system shall support retaining data files in the formats in which they were ingested,” is intended to address this requirement. If so, it should be strengthened to specify that the original bit stream of each record must be saved (the requirement as currently worded says only that the original format be used, not that the bits must remain unaltered). (Of course, depending on the level of service deemed appropriate for a particular record or class of records, additional preservation measures might well be undertaken as specified in the record’s preservation plan.) 7. Require more flexible search facilities. The RD as written assumes that there is a single search mechanism for the entire archive. But different collections or record types might be more accessible with quite different search engines or techniques. The committee recommends that the ERA be structured so that several search techniques can be accommodated, and so that search software can be easily changed or replaced over time. (Search technologies are discussed briefly in the committee’s first report, pp. 51-53.) 8. Explicitly require a federation interface that supports search and retrieval across multiple archives. Researchers often search more than a single archive and will benefit from federated search and access mechanisms whereby several archives can be accessed at once. (This capability to search for and access items across institutions would be in addition to whatever federation techniques are used within the ERA to manage storage.) If ERA, as the nation’s archive of electronic records, offers a federation mechanism and invites others to join, it will induce others to do so as well. The RD should challenge industry to design a federation interface based on approaches being developed or prototyped by others (e.g., the digital library community).
OCR for page 5
Building an Electronic Records Archive: Letter Report October 16, 2003 9. Plan for increasingly distributed operations. While the core of the ERA will be distributed among several sites and closely managed by NARA, some processes may be deployed elsewhere. For example, it would seem essential to permit record ingest to be performed within a client agency, or by a contractor working for the agency and/or NARA. Over time, agencies may come to play a greater role in record ingest, and NARA’s role will be to provide the ingest software or specifications, and to provide careful automatic checking of the records that have been ingested, together with associated data (data currently entered in Standard Form 258, metadata tags, file format integrity, authenticity checks, etc.). Another good example of the value of distributed operations is the opportunity to leverage externally supplied preservation tools, either for file migration or for on-the-fly presentation of an obsolete format. The ERA should, therefore, be designed from the outset so that certain parts can be operated at geographically separate sites to accommodate flexible workflows. Perhaps ERA 1.8 and ERA 1.9 are intended to address distribution, but it is not clear. ERA should, for example, be designed so that NARA employees or contractors physically outside a NARA facility can have secure access to the ERA. This capability will enable contracting out various processing tasks and also allow NARA employees to work from home. (Note that a remote access requirement also affects ERA 20; the user interface should permit operation from, say, standard PCs on the Internet.) Any need for other agencies or contractors to be able to examine records for sensitive content (ERA 12.4) will have an impact on workflow and other matters. Is there a need to record who (or which automated process) made a decision regarding a record’s sensitivity? Of course, a careful threat analysis (see point 2, above) with respect to distributed operations is essential, and security requirements associated with distributed operations should be included under ERA 13. 10. Specify more precisely the acceptable update lag between primary and “safe store” data repositories. The ERA DC correctly notes that there will be a lag while data stores synchronize.14 For system design to proceed, this lag will have to be specified more precisely: if the lag is on the order of seconds, large instantaneous bandwidth may be required to connect the safe store sites in order to communicate updates, whereas if the lag can be longer, then the communications need only meet average ingest rates and might even use mailed media rather than network connections for synchronization. 11. Provide better definitions for key terms. Despite an impressive glossary in RD Appendix B, some important terms are used without clear definition in the DRFP. The DRFP should introduce better definitions or cite literature in which these terms are explored further. For example: Essential characteristics—A clear idea of what this phrase means in practice would be valuable. For example, will the need for strong functional equivalence arise? Some concrete examples might be helpful. Persistent format—This term has been introduced in several papers written in conjunction with the ERA program, but the committee found especially helpful the clarification offered at the August 2003 meeting—that is, the use of persistent formats as a form of migration that tries to minimize the number of transformations. Self-describing format—This is a slippery term: every description is written in a “language” that itself must be described, thus begetting an infinite recursion. Again, examples would help to clarify a non-obvious term. See also ERA 11.3. Conceptual search—This phrase means different things to different people. If the intent is to distinguish between controlled-vocabulary (metadata) search and full-text search, it would be
OCR for page 6
Building an Electronic Records Archive: Letter Report October 16, 2003 sufficient to say simply “full-text search.” If something more elaborate is meant, the requirement is asking for a capability beyond the state of the art. Template—The term is defined in operational terms, but not clearly with respect to its contents or (detailed) role. Electronically—This term is used in an unnatural way in several places in the RD. For example, ERA 14.2 says, “The system shall provide the capability to accept transfers electronically.” Why not simply say “via a network connection”? Or is the term “electronically” meant to include disk and tape, too? Similarly, ERA 18.1 says, “The system shall be capable of electronically presenting all electronic record types.” Should the wording be “delivering via a network connection” instead of “electronically presenting”? For some records, the level of service may be limited to delivering the original bit stream. In each case, a more explicit definition would help. Sincerely, Robert F. Sproull, Chair Committee on Digital Archiving and the National Archives and Records Administration
OCR for page 7
Building an Electronic Records Archive: Letter Report October 16, 2003 NOTES 1 Support for this project was provided by the National Archives and Records Administration under Contract No. NAMA-02-C-0012. Any opinions, findings, conclusions, or recommendations expressed in this letter are those of the authors and do not necessarily reflect the views of the organization that provided support for the project. This letter report has been reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise, in accordance with procedures approved by the National Research Council’s Report Review Committee. The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as possible and to ensure that the report meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We wish to thank the following individuals for their review of the report: William Y. Arms, Cornell University; David. D. Clark, Massachusetts Institute of Technology; Judith L. Klavans, Columbia University; MacKenzie Smith, MIT Libraries; and J. Timothy Sprehe, Sprehe Information Management Associates. Although these reviewers provided many constructive comments, they were not asked to endorse the conclusions or recommendations, nor did they see the final draft of the report before its release. The review of this report was overseen by Robert J. Spinrad, Xerox Corporation (retired). Appointed by the National Research Council, he was responsible for making certain that an independent examination of this report was carried out in accordance with institutional procedures and that all review comments were carefully considered. Responsibility for the final content of this report rests entirely with the authoring committee and the institution. 2 Electronic Records Archives Program Management Office, National Archives and Records Administration (NARA). 2003. Electronic Records Archives Draft Request for Proposal (DRFP). NARA, College Park, Md., August 5. Available online at <http://www.archives.gov/electronic_records_archives/acquisition/draft_rfp.html>. 3 Electronic Records Archives Program Management Office, National Archives and Records Administration (NARA). 2003. Electronic Records Archives Requirements Document (RD), (DRFP, Section J, Attachment 2). NARA, College Park, Md., July 31. Available online at <http://www.archives.gov/electronic_records_archives/acquisition/draft_rfp.html>. 4 Electronic Records Archives Program Management Office, National Archives and Records Administration (NARA). 2003. ERA Deployment Concept (DC), (DRFP, Section J, Attachment 17). NARA, College Park, Md., July 29. Available online at <http://www.archives.gov/electronic_records_archives/acquisition/draft_rfp.html>. 5 National Research Council. 2003. Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for Initial Development, Robert F. Sproull and Jon Eisenberg, editors. National Academies Press, Washington, D.C. The committee’s final report will be issued in early 2004. 6 DRFP, subsection 3.1.2, p. L-18. 7 Storage robustness issues are discussed in the committee’s first report at pp. 42-44, security and access controls are discussed at pp. 53-55, and record integrity is discussed at pp. 56-57. 8 References in the form (ERA x.y) are to numbered requirements appearing in the RD, pp. 19-48. 9 “ERA must be capable of providing different levels of service,” RD, p. 8. “Levels of service—the ability of ERA to provide different capabilities for different records,” RD, p. B-6. 10 RD, p. C-4. 11 “For permanent records—those preserved forever—and for some temporary records which need to be kept for lengths of time that exceed several generations of information technology, it will be necessary to transform the records from the formats in which they were received to persistent formats,” RD, p. 10. “NARA’s goal is to preserve electronic records in persistent formats that will enable access to authentic electronic records indefinitely into the future,” RD, p. 14. 12 The committee’s first report cautions against relying primarily “on a strategy of converting records to platform- and vendor-independent archiving format to avoid obsolescence” (p. 32) and discusses various approaches to preservation (p. 8 and pp. 32-34). 13 RD, p. 15, last paragraph under heading 2.7.3. 14 DC Section 4.0, “Synchronization with the operational record stores,” p. 8.
Representative terms from entire chapter: