Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 62
7 Strategy for Evolution and Acquisition Because the ERA must have a long life—many decades—it must evolve over time. Although it is possible to express some of today’s requirements for the ERA, others will become clear only as operational experience is gained, and the requirements will themselves evolve. The types of records to be preserved, the interests and capabilities of users, and other aspects of the context will change. So too will the technology available to NARA; computer capabilities and cost-performance evolve very fast, so what is difficult or expensive today can be much easier in 10 years. The sections that follow describe how, even though all the requirements cannot be anticipated, the system can be built using techniques that make evolution easy rather than hard. STRATEGY FOR EVOLUTION Modular Design One of the most important techniques for developing a large system is to modularize— making a complex problem more tractable by breaking it down into a set of smaller components and enabling independent evolution of the pieces over time. NARA and the SDSC have embraced modular design in general and the OAIS overall framework for an archive.1 The SDSC demonstrations successfully exploit the clear separation of the major modules of ingest, storage, and access. A modular approach presumes that each modular component will be changed several 1 Consultative Committee for Space Data Systems (CCSDS). 2002. Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1 (Blue Book). CCSDS Secretariat, National Aeronautics and Space Administration, Washington, D.C. January. Available online at <http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf>.
OCR for page 63
times as requirements change or more attractive technologies emerge. Properly designed components can be replaced with improved versions with minimal disruption to other modules. Modularization thus enables considerable flexibility in evolving a system. For example: Several prototypes can be built to explore alternative methods of accomplishing a particularly crucial goal. Prototypes of individual modules can be discarded without jeopardizing investments made in the rest of the system. A sequence of smaller, focused development projects can be used to add capability to the system. The design and implementation of different system components can be divided among multiple vendors or research centers with specialized abilities. For example, a modular design should permit developing a new access technique for a particular kind of record without altering other parts of the system. Commercial off-the-shelf (COTS) or other third-party packages—for, say, full-text indexing and search—can often be integrated individually into the system. The hallmarks of modular design are the decomposition of the system into separate modules and the specification of the interfaces that surround each module, i.e., the details of the connections a module makes to other modules in the system. While it would be fairly easy to draw a high-level block diagram showing a modular system structure (such as was done in the OAIS framework2), more detailed evaluation of the required modules and interfaces is critical to obtaining the rewards of modular design. If modules are too big, they become hard to change or replace. If the interfaces become too complex or allow internal details of a module’s implementation to become known to other modules, the set of modules becomes “brittle,” and the ability to change one module independently of others may be lost. The details of modular structure and interface designs are critical. Possible Modules and Interfaces for the ERA This section briefly discusses some considerations for modular design of an archive system and outlines some of the possible modules and interfaces that would need to be specified for the ERA. In describing these, the committee intends only to provide some concrete illustrations of issues to be faced in the system design, not to do detailed design work. To exploit modular structure for incremental evolution, it is necessary to define the software interfaces by which each module connects to other modules. These interfaces require defining subroutine calls and data types transferred between the modules. The data model used to store records in the file system serves in effect as an interface between ingest and access modules: Even though an ingest module does not directly invoke an access module, the 2 ISO Consultative Committee for Space Data Systems (CCSDS). 2002. Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1 (Blue Book). CCSDS Secretariat, National Aeronautics and Space Administration, Washington, D.C. January. Available online at <http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf>.
OCR for page 64
data created by any ingest module must be precisely understood by every corresponding access module. Modular structure is also used for another purpose: to develop sets of modules that perform related functions. For example, the ERA might have a set of access modules and choose one to invoke based on the collection the user has requested. Similarly, a data-type converter is selected and invoked based on the types of its input and output. The following are among the likely key interfaces in the ERA system: The file system interface. This interface provides functions to manipulate files in the repository: naming, reading, writing, access control, etc.3 The ingest module interface. The principal role of an ingest module is to prepare and write data into the file system. However, a detailed system design will uncover other interfaces— for example, for tracking records as part of work flow management. The access module interface. Access modules will be used to build indexes of collections and to query previously built indexes. Access modules will need access to any files that are copied from the main file system into a cache. The data model and associated interface. These are for manipulating stored data according to the data model. One objective of defining these interfaces is to allow multiple ingest and access modules to be written without disrupting other parts of the system. There are also important smaller modules and their corresponding interfaces. For example, the following are likely to be required: Data-type converters, which transform data from one type into another. Converters may be used within ingest modules to generate derived forms, or within access modules to generate the data type requested by the user or necessary for preparing a visual presentation. Data-type checkers, which are simply a variant of data-type converters. They are used by ingest modules to verify the integrity of native and derived forms. Metadata extractors, which derive metadata from records. They may be used within ingest modules to derive metadata for records being ingested, or they may be used by indexers to extract metadata that are useful for searching a set of records. Indexers used by access modules to (1) prepare an index to a set of records and (2) respond to search queries. There are certainly many others. 3 Strictly speaking, the use of the term “file” assumes a particular architecture—specifically that the archive lives on a file system and not on, say, a database. It also assumes that the file system does not have, for example, multiple resource forks per file (i.e., on some systems, you can have one logical file, but within it, you might have separate ways of accessing it that give you the data, the metadata, an alternative view, etc). The term file is used in this report to stand for some sort of stored object or item, which may not necessarily actually be a file stored on a conventional file system.
OCR for page 65
Architecture An architecture specification is an essential part of the long-term plan for evolving a system. It is the architecture of the system that ensures that the pieces fit together, that the system can be incrementally upgraded, and that it can evolve over time. The architecture needs to specify the interfaces between major parts using an open approach that allows multiple vendors to supply components over a long lifetime. It needs to be nonmonolithic, employing the modular structure discussed above. An example of a digital preservation architecture sketch in this spirit is that of the Library of Congress’ National Digital Information Infrastructure Preservation Program (NDIIPP).4 A common architecture is especially important for a complex program such as the ERA, which must unify multiple systems. By building early iterations of these systems to be compliant with a common architectural framework, these systems can coalesce into a smaller number of more comprehensive systems as experience and confidence grow. Because changing the modular structure is far harder than evolving separate modules, devising a good initial architecture for the earliest deployed systems is important. But no matter how good the initial design, it will be subject to evolution over time as requirements are better understood or new requirements emerge. The architecture of a system like the ERA should be “owned”—that is, specified and evolved over time—by its user. By building the internal expertise to perform this process itself, NARA will better understand the implications of alternative proposals, maintain better control over the development process, and be better able to use the resulting system and understand its limitations and strengths as it is delivered. By owning the architecture, NARA also reduces its dependence on the vendors selected to build implementations of the ERA and helps to avoid proprietary lock-in. Preparing the architectural design of the ERA requires first-class talent having both archival and IT expertise. So too does managing the inevitable evolution over time to meet new requirements. Thus in both the short and long terms, NARA staff will need to understand deeply the IT aspects of the ERA systems and of digital preservation more generally. Alternatively, NARA could contract for one or more architectural designs. This is a poor alternative to a NARA-owned architecture. It also does not relieve NARA of the need for first-class technical talent, for NARA would still have to specify the scope of a design contract, evaluate resulting designs, and proceed with acquiring and evolving a system. In this alternative, it might be worthwhile to obtain several proposals, because different contractors may have different technology opportunities, different ideas for incorporating COTS elements, and so forth. NARA will need to carefully evaluate the one or more architectures it commissions. It may wish to seek the help of outside experts or others with similar needs (e.g., Library of Congress, National Library of Medicine, and other operators of digital libraries) to help in the evaluation. NARA might also choose to contract for help in evaluating multiple architectural proposals. 4 National Digital Information Infrastructure and Preservation Program, Library of Congress. 2003. NDIIPP Plan Appendix. Library of Congress, Washington, D.C. Available online at <http://www.digitalpreservation.gov/ndiipp/repor/ndiipp_appendix.pdf>.
OCR for page 66
Some Other Strategies for Long System Life Special considerations need to be given to ensuring that the ERA can continue to operate over many decades. Even if the system’s requirements were to remain invariant over such a long time, the COTS hardware and software on which the system runs would need to be replaced many times. The design of the system can simplify this process. In addition to modularity, discussed above, the following design ideas will facilitate long-term operation: Use networks based on Internet Protocol (IP) to interconnect hardware components. The structure of hardware components should allow flexible interconnection of hardware from various vendors using standard networking hardware and protocols—e.g., IP.5 Replacing old hardware with new, or adding hardware capacity is a matter of attaching the new hardware to the network and perhaps changing the network configuration. Although the physical layers of networks will evolve over time to increase performance—from Ethernet to gigabit Ethernet to 10-gigabit Ethernet and beyond—routers will connect the different physical layers into a single network. The Internet protocols may change, but very slowly, and with evolutionary support from hardware and system software vendors. IP-based networks are the best bet for long-term hardware interconnection. Design software to simplify porting to new hardware and software systems. Over the life of the ERA, software that implements the ERA will need to be deployed on new computers, perhaps computers that are not completely compatible, either in hardware or software, with the older computers. Writing portable software is a fairly common practice; it involves choosing appropriate programming languages and isolating hardware and software system dependencies in small “compatibility layers” of software. Avoid proprietary lock-in and choose truly common COTS products. NARA will need to retain sufficient intellectual property rights for the software it procures for the ERA system to be sure that the ERA software can be modified and ported to new hardware and software platforms. Likewise, interfaces and data types that play essential roles in the modular structure (and therefore, evolution) of the system should be free of proprietary features that might trap NARA into hiring only a particular contractor that holds the necessary intellectual property rights. In designing the architecture of the ERA, NARA wisely wishes to make use of COTS hardware and software products. Because of the long-lifetime objective of the ERA architecture, however, COTS components will have to be chosen with care to be sure that they, or equivalent replacements, are available for a long time. COTS hardware, including storage media, is essential. Trade-offs may be required between (1) using a few large off-the-shelf software components, which will facilitate system integration but may introduce critical dependencies affecting the system’s future viability should the components cease to be sold or supported, and (2) using a larger number of smaller components that could be replaced with custom versions if necessary. One way to achieve some flexibility is to partially insulate the 5 Note that the fact that the components of the system use Internet protocols to interconnect internally is separate from the question of whether the NARA ERA is connected directly to the Internet.
OCR for page 67
ERA architecture from the details of a particular COTS offering by specifying a simple, generic interface, which is then attached to a specific COTS product with a shim (driver)—this idea is commonly used in operating systems and is used in the Storage Request Broker in the SDSC work. COTS modules must be chosen carefully to be valuable to a long-lived system. COTS offerings with a large market—truly common products—and slowly changing specifications (perhaps today’s operating systems and network-attached file systems are examples) can probably be replaced with compatible or similar products for many years. But low-volume COTS modules involve risk: If, say, NARA were to buy a single vendor’s document management system to implement a major portion of the ERA, there might be no other COTS source for upgrades or replacements. Generally, COTS components are easier to build into systems with short lifetimes than long-lived systems such as the ERA.6 These strategies all depend on judgments about the expected life of hardware and software components. In extreme cases, modules may need to be redesigned or rewritten to replace components no longer available. But without modular structure and nonproprietary interfaces it will be far harder to keep the system running smoothly over a long period. ITERATIVE (SPIRAL) DEVELOPMENT A series of implementations is built, fashioned to acquire experience with an initial system and then incorporate improvements—this technique is called “iterative design.” When precise requirements are not known in advance, the detailed requirements may be elicited by undertaking a series of iterative designs in which development follows a cycle of specify, design, implement, and operate. An explicit iterative design goal is to keep the cycles short— to complete a single cycle in months, not years. At each iteration of the cycle, the architecture is refined. Military procurements call this technique “spiral development,” in reference to how requirements are refined and expanded as a result of experience with earlier, simpler systems.7 Iterative design allows users to operate a partially working system, or partially working system components, early in the development process. The approach provides rapid feedback about what works, what doesn’t, what needs to be refined or rejected, and what is missing. The initial systems usually have modest expectations; they gather experience to inform the next iteration. The initial systems are put in operation and subjected to a range of uses and tests in order to obtain as much experience as possible to influence subsequent iterations. This form of deliberate, rapid evolution is used to refine the initial set of requirements. Eventually, the requirements will become well understood. In contrast to a more conventional procurement, the ERA program will also have to man 6 There are, of course, many other issues associated with the use of COTS and other trade-offs between using custom and off-the-shelf software. These are not discussed in this report. 7 See, for example, Barry Boehm (edited by Wilfred J. Hansen), 2000, “Spiral Development: Experience, Principles, and Refinements,” Special Report CMU/SEI-2000SR-008, Spiral Development Workshop (February 9, 2000), Carnegie Mellon Software Engineering Institute, Pittsburgh, Pa. Available online at <http://www.sei.cmu.edu/cbs/spiral2000/february2000/BoehmSR.html>.
OCR for page 68
age the evolution of the requirements themselves by virtue of the very long life of the ERA program. Requirements will inevitably change as electronic records evolve, new techniques for using records emerge, new preservation techniques become available, and so forth. Planning for the ERA program should, therefore, anticipate a sustained iterative development process. Experience shows that an iterative design will be required to arrive at the best modular design for the ERA. While the very-high-level structure—with separate ingest, storage, and access components—is likely to change little, the detailed interface designs will evolve. With the present level of experience in the archiving community, it is possible to craft a good initial design but to not be assured it is correct. The modularity and interfaces themselves will have to be improved by iteration as requirements are better understood and change. PILOTS: STARTING SMALL AND GAINING EXPERIENCE An important early stage of the iterative development process is the building of “pilots”— systems that are sufficiently small and simple that they can be rapidly deployed but capable enough for production use. A pilot is relatively small compared with the ultimate system, meaning that its cost is also relatively small and that any failures that occur in the earliest stages have only modest consequences for the program as a whole. The crucial difference between pilots and “prototypes” is that pilots, unlike prototypes, are designed with enough functionality and sufficient scale to be operated in a production environment, allowing real-world experience to be gained that informs the requirements of later, more capable iterations of the system. This sort of staged approach is being used for the electronic deposit system at the Netherlands national library (Box 7.1). For a program as complex as the ERA, it is very useful to “sample” the problem space by launching several pilots concurrently. Each pilot provides experience with different aspects of the full problem; each pilot also offers opportunities to try different approaches to particular elements of the problem. Although the pilots will differ from one another in some respects, all the pilot systems should be constructed within a common architectural framework. By working within a common framework, the pilot systems can, in subsequent design iterations, eventually coalesce into a smaller number of more comprehensive systems as experience and confidence grow. It is especially important that the data model—the data types and related metadata—used in each pilot conform to a common architecture so that the digital data obtained by ingesting records into one of the early systems will carry forward into future evolutions.8 At some system iteration it may be necessary to rebuild the archive according to a new data model (especially if the data model is changed significantly), reading all the records archived using the old model, converting to the new model, and writing a new archive. This scenario—the wholesale reformatting of the archive—should be possible with early designs. 8 No matter how well the data model is defined at the outset, it is still likely to change as the system design is iteratively refined. To minimize the disruptions caused by such changes, version numbers should be made explicit in the data model. In this way, the ERA software can respond to a range of versions correctly, rather than requiring the entire archive to be converted to a new data model when versions change.
OCR for page 69
BOX 7.1 Staged Implementation of the Electronic Deposit System at the Netherlands National Library A recent example of a staged implementation is the Electronic Deposit system at the Netherlands national library—the Koninklijke Bibliotheek (KB) in the Hague—the first operational system of this kind in the world. Its goal is to archive for the long term all documents electronically published in the Netherlands. In 2000, the KB issued an RFP and then signed a development contract with IBM in the fall. A previous study had already recommended that the program (1) use the OAIS reference model and (2) focus entirely on document storage and retrieval functions, while leaving to the library system (already in existence) the responsibility for cataloguing, indexing, and search. The system was built following these recommendations. Using a tape robot, optical library, and RAID disk system, it relies as much as possible on off-the-shelf products (e.g., IBM Content Manager; Tivoli Storage Manager, a backup and archive system; and DB2, a database management system). The design objectives call for 20 TB in 2005, with a long-term objective of 500 TB. The system was delivered in fall 2002. In parallel, several joint KB-IBM studies were conducted,1 most of them on actual preservation issues. Their results will fuel the next development stage, in which some preservation functionality will be added to the system. 1 IBM-Koninklijke Bibliotheek (KB). 2002. IBM-Koninklijke Bibliotheek Long-Term Preservation Study. Koninklijke Bibliotheek, The Hague. Experience with the pilot systems can be expected to lead to changes to the architecture and to substantial refinement of requirements for subsequent, more comprehensive systems. Successfully building on the pilot experience will require skill in developing an initial architecture, managing the first system developments, learning from early operations, making revisions to architecture and specifications, and evolving the overall program. ERA pilots should exploit available collections of records that could be organized and made available quickly, and they should sample different dimensions of the overall archiving challenge. NARA will need to choose a limited set of objectives for the pilot systems. Examples of some of the limitations on archive content might be these: Deal only with records already held by NARA by reingesting them into a new system. Select a single collection. Select a single agency with a limited set of scheduled digital records. Select a diverse collection but provide the best access only to the six most common data types. Select a small collection that has challenging ingest problems because of old media, old software, unknown data types, etc.
OCR for page 70
Here are some specific examples of limited-scope systems that might be considered for early ERA pilots: U.S. State Department cables. NARA is preparing to acquire digital forms of State Department diplomatic cables, which are simple text files. One pilot might focus on preserving these cables, extracting appropriate metadata automatically from the cables, perhaps providing full-text search, or other access appropriate to the collection. For quickest deployment, NARA might consider making these records available using software already developed for operating a digital library. Records at the National Personnel Records Center. Military personnel records, traditionally stored on paper or microfilm, have more recently been managed by the Department of Defense as TIFF image files. There is interest in preserving these records in electronic form when they are transferred to NARA’s National Personnel Records Center. Containing millions of service and medical records of discharged and deceased veterans, these collections are large but homogeneous. Use and access considerations would be quite different than for the State Department cables because of confidentiality protections and the imperative to provide ready access to veterans or next of kin. Access controls would be required. E-mail from the Clinton administration held by the Clinton Presidential Center. This collection would lead to experience with a broader and more modern range of data types, because it contains e-mail attachments of all sorts. Metadata could be extracted from the e-mail headers, full-text search could be provided, and so forth. This pilot would provide useful information on the range of data types attached to e-mail and how best to preserve and access these records.
Representative terms from entire chapter: