4
Designing and Engineering the ERA

AN ENGINEERING APPROACH

The ERA program is a complex undertaking. Its archiving goals for the ERA program are ambitious, in keeping with NARA’s tradition of careful stewardship of the records it preserves. To meet these goals the ERA will have to be engineered properly. Key elements of an engineering approach include expressing program objectives and constraints, measuring or estimating key parameters, defining realistic requirements, and making pragmatic engineering design decisions. This chapter discusses these issues in the context of the ERA program. The two chapters that follow address design issues related to key system properties, including scalability, reliability, trustworthiness, and longevity.

Attention to engineering principles is especially important because the work done by NARA to date, as evidenced in the documents and briefings provided to the committee, displays insufficient engineering perspective. Efforts to date have focused on articulating high-level preservation requirements, an essential early step for developing the system’s archival requirements. However, in order to develop a procurement plan through which successive iterations of working systems can be built, delivered, and made operational in a reasonable amount of time, NARA will have to address the specific engineering considerations and cost trade-offs that this section discusses.

Engineering practice depends strongly on experience with prior designs. In this case, there is no body of prior designs from which to draw direct lessons—no large-scale, long-lived, wide-scope electronic archives exist today. Instead, one must seek experience in the engineering portions of other systems that have properties required of the ERA and gain experience by building capabilities incrementally. For example, copious experience is available for important elements of the ERA system, e.g., a scalable, robust file system. Other qualities, such as very-long-term preservation, can be shaped by experience in the industry, even if there is scant experience with systems designed explicitly to preserve for the very long term. Finally,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 24
4 Designing and Engineering the ERA AN ENGINEERING APPROACH The ERA program is a complex undertaking. Its archiving goals for the ERA program are ambitious, in keeping with NARA’s tradition of careful stewardship of the records it preserves. To meet these goals the ERA will have to be engineered properly. Key elements of an engineering approach include expressing program objectives and constraints, measuring or estimating key parameters, defining realistic requirements, and making pragmatic engineering design decisions. This chapter discusses these issues in the context of the ERA program. The two chapters that follow address design issues related to key system properties, including scalability, reliability, trustworthiness, and longevity. Attention to engineering principles is especially important because the work done by NARA to date, as evidenced in the documents and briefings provided to the committee, displays insufficient engineering perspective. Efforts to date have focused on articulating high-level preservation requirements, an essential early step for developing the system’s archival requirements. However, in order to develop a procurement plan through which successive iterations of working systems can be built, delivered, and made operational in a reasonable amount of time, NARA will have to address the specific engineering considerations and cost trade-offs that this section discusses. Engineering practice depends strongly on experience with prior designs. In this case, there is no body of prior designs from which to draw direct lessons—no large-scale, long-lived, wide-scope electronic archives exist today. Instead, one must seek experience in the engineering portions of other systems that have properties required of the ERA and gain experience by building capabilities incrementally. For example, copious experience is available for important elements of the ERA system, e.g., a scalable, robust file system. Other qualities, such as very-long-term preservation, can be shaped by experience in the industry, even if there is scant experience with systems designed explicitly to preserve for the very long term. Finally,

OCR for page 24
there is a large body of more general experience in engineering complex, large-scale systems that can be applied. DATA AND ESTIMATES TO SUPPORT THE DEFINITION OF INITIAL REQUIREMENTS Engineering for the ERA program requires a solid understanding of requirements, including the data types1 to be accommodated, the quantity of records to be stored, the kinds of access to be provided, and the performance expected. Although these requirements can be expected to evolve at every stage of a system’s life as a result of changes in the characteristics of electronic records, the system will successfully meet its expectations only if its engineering is in step with its requirements. So it is essential, even for the very first system, to state these requirements carefully and explicitly. A key to understanding the initial requirements is information about the population of government records that it will hold and projections about how those records will be used. Importantly, great precision is not needed. Indeed, in some cases, data may be unavailable or impractical to obtain. It is not necessary to significantly delay the ERA program in order to conduct in-depth surveys. Rough, even order-of-magnitude estimates, if well justified, will suffice in most cases. It is important that estimates supporting initial requirements be made explicit; otherwise, a system design might reflect implicit estimates that are dangerously wrong. The assumptions and reasoning behind the estimates should be made explicit. This will allow the estimates and consequent decisions to be modified whenever the assumptions and estimates change. In order for the first iterations of the ERA to be designed, questions such as the following need to be answered: What are the data types that it must support, and what is their frequency of occurrence? In what forms do records currently exist—e.g., which data types, on what storage media, and with what kinds of supporting documentation or online metadata? If there is an inventory of digital records “waiting in the wings” to be archived, what are the properties of these collections? The system design must also anticipate and accommodate new data types and changes in their distribution over time. How much data must be accommodated at the outset? A great many design decisions (such as the archive media, the implementation technology, and the techniques used to provide reliability) will require estimates of the scale of the archive. The committee heard estimates of 1   Throughout this report, “data type” is used to identify the data-encoding rules whereby various kinds of records (documents, electronic mail messages, pictures, database entries, etc.) are expressed as a collection of bits. Thus an image might be represented by bits whose data type is TIFF or GIF or JPEG or any of a number of other such specifications. “File format” is often used interchangeably with “data type,” but “data type” is used throughout this report because the literal interpretation of “file format” is files of bits, which would be too restricted. For example, when an image is embedded in an e-mail message that is itself embedded in a “folder” of many messages saved in a file, the bits representing the image cannot properly be called a “file.”

OCR for page 24
1 PB, but this figure needs to be verified and justified. What is the expected rate at which the archive will need to scale? Is the initial flow into the archive likely to be a small number of very large files or a very large number of small files or some combination of these? How will the records be delivered to NARA? While today’s and future records can be delivered to NARA using secure networking techniques, records generated over the past 30 years or less may reside on media that are rapidly becoming obsolete. How many records are stored on which media? Early versions of the ERA may have a disproportionately greater burden to deal with old media. Alternatively, NARA might decide to contract for media conversion services to copy the data to modern media. In any case, an inventory of media types and quantities is required. What rates of access to the archive will be required? Access rate might be measured by counting retrieved documents per day, retrieved gigabytes per day, searches per day, or all three. How will NARA provide for searching, retrieval, and access in the digital archive? Will users be able to search and retrieve files and items online? Will NARA be able to support search across and within individual record groups and series? Will there be provisions for full-text search? Will such services be provided by NARA or third parties? Estimating access requirements in advance will seem very difficult, and it will surely change as the archive grows, but an initial estimate is essential to produce the first system implementation. What financial resources are available, and how much will it cost to acquire and operate the system? No engineering project can be undertaken without some expectation of its costs. Back-of-the-envelope calculations can be done to estimate how much computer and storage equipment will be required, the cost of personnel to ingest records and operate the system, and the cost of building the system software. Of particular concern is the cost of manual processes required to ingest records, especially manually recording metadata for records. Because it is not a routine procurement, it is difficult today to project the eventual costs of building and operating a full-scale system, but it is nonetheless important to estimate costs of the early iterations. Experience with each iteration should be used to help inform cost estimates for future versions. Technology lifespan. At what rate are components such as storage devices expected to be replaced? What is the unit of replacement? The committee has seen scant evidence that questions of this sort have been posed carefully or answered with enough rigor thus far to set the stage for procuring ERA systems. PRAGMATIC ENGINEERING DECISIONS While it can be tempting to deal in absolutes when designing a system (“every important record will be preserved forever”), engineering practice recognizes that a system is designed to meet objectives subject to constraints. A bridge has a limited load capability, traffic capacity, lifetime, and budget. Engineering for the ERA will similarly require expressing its objectives and constraints. This will require some “engineering” considerations not often found in writing about archival processes: 1. Design for common cases. Although the archive has an obligation to save every record scheduled for preservation, a different quality of service can be applied to different records. Service-level differentiation is a practical necessity. It is not feasible to delay system develop

OCR for page 24
ment until a solution is developed for every conceivable record type (and no universal solution is on the horizon that would apply across record types and thus significantly reduce the incremental cost of handling new record types). The alternative, a strategy whereby agencies would be required to submit documents in specific data types, would put a burden on the agencies that could mean that some records of historical value would not be preserved. If NARA were to insist on a uniform level of service for all record formats, then the level of service in the ERA probably would reduce to the lowest common denominator. The strategy “accept everything; provide different quality of access” ensures that all records are captured and leaves NARA the opportunity to provide enhanced access in the future as technology improves. There is a trade-off here: Access may, in fact, become more difficult over time as data types become obsolete, but one does not want to wait until more robust access functionality is available before taking steps to capture and preserve records. For example, more compute power, better conversion tools or emulators, or other technological advances might make accessing some records feasible in the future even though it is not feasible today. The quality-of-service approach leaves that option available without jeopardizing the construction or utility of a first system. Service-level differentiation means that NARA may accept and preserve records in any data type, but certain access services will simply not be available for certain data types. Some records will be accessed more than others; some by the public, some only by scholarly researchers. It may not be cost-effective for NARA to provide online access facilities for obscure data types, but users should be able to retrieve the original bits and use their own resources to manipulate them. For example, it is perfectly reasonable that NARA might be able to easily provide full-text search capabilities for some types of record but not for others.2 Another example: NARA might fairly easily provide services for viewing, manipulating, and copying a set of maps stored in common bitmapped image formats, while it might not offer such capabilities for maps embedded in a proprietary geographical information system. Once the notion of service level differentiation is accepted, it can become a powerful approach for simplifying system design. One way to establish priorities is frequency of occurrence: Although the total universe of possible record data types is very large, a large fraction of NARA’s preservation needs can be met by devoting significant effort to the most commonly used data types. Assessments by archivists of significance or likelihood of access might be other ways to prioritize record types. NARA should therefore focus on commonly used data types and acknowledge that by explicit design, documents in commonly used data types will be preserved with a higher quality of access service.3 As a result, an early step in the design process should be a survey 2   One example where full-text would be harder to support is records available only in bitmapped image format. However, applying optical character recognition can often recover searchable text from scanned documents. And a variety of research is under way to search collections of pictures for certain features, e.g., images of people. As part of its service-level definitions, NARA will decide whether to offer such services. As the ERA evolves, more such services are likely to be offered. 3   NARA’s recent development of guidance for transfer of records in three common formats—PDF, TIFF, and e-mail with attachments—is a good example of identifying and placing priority on common data types. See National Archives and Records Administration (NARA), 2002, Transfer of Permanent E-records to NARA, NARA, College Park, Md. Available online at <http://www.archives.gov/records_management/initiatives/transfer_to_nara.html>.

OCR for page 24
or well-justified estimate (based, perhaps, on sampling) of the data types used in digital records found throughout the federal government as well as records already in NARA custody. From the survey, data types can be prioritized based on frequency of occurrence and other criteria. 2. Prioritize the functions of the ERA and focus initial design on capabilities that permit rapid deployment of operational pilots. Which capabilities must the ERA have from the outset? Which can be added later? A key requirement is to save bits for a hundred years or more. To achieve this requires a combination of careful technical and operational design based on extensive industry experience with robust storage systems of shorter life (see Chapter 5). An effective bit storage capability is required for any pilot program and provides a critical foundation for future systems with broader capabilities. In addition to storage, certain ingest and access mechanisms are required in early ERA versions. However, as discussed above, it may be acceptable for collections to be correctly ingested but for access to them to be primitive at first, improving only later as new access functions are added to the system. Other functions that are less important can be deferred for later implementation, as long as the initial architecture, design concept, and implementation strategy are sufficiently flexible and evolvable and have devoted sufficient attention to overall robustness, survivability, maintainability, and compatibility with critical long-term requirements. Later iterations of the ERA might include additional migration, emulation, or other preservation functionality as these technologies become more mature. It may also be prudent to set priorities that determine the order in which records are added to the growing archive. These priorities might be similar to those used to establish quality of service, e.g., common record types are ingested first. Record types that are used infrequently or difficult to process may be deferred. This prioritization of functions can be helpful in other contexts. Suppose that at some later point in time, funding for the ERA were to be curtailed (e.g., due to extreme budget pressures). Which functions would it need to continue to operate, and which could be reduced? 3. Support a division of labor. Must NARA build the entire ERA or might other government organizations, commercial firms, and individual researchers fulfill parts of the ERA’s mission? The committee recommends that NARA define its essential mission quite narrowly, with an emphasis on saving digital records in their original form together with appropriate metadata and providing access to those records in their original form (the original bits).4 Additional services, such as interpreting obsolete data types; providing high-quality, high-performance Web access to rendered records; full-text searching; and finding aids of various kinds, could be provided by NARA as well as others. NARA may decide to provide access itself if no one 4   NARA’s activities are for the most part currently limited to a basic preservation task. For paper records, NARA typically provides records in bulk form, and users have the responsibility to sift through the material and select what is appropriate. Similarly, NARA does not provide photocopying services for paper records, but users can pay a third party to do this. In any case, people’s expectations for services for digital records will be shaped by their experiences with other online services. Partnerships with others provide one avenue for meeting some of these expectations while concentrating on the central preservation mission.

OCR for page 24
else does, but NARA has much to gain from partnering with others to provide these services. For example, if in the future common access techniques evolve for digital libraries (together with software to implement them), NARA’s ERA can serve as one of the accessible libraries. In cases where there is an identifiable commercial market for the information, third-party commercial organizations might also play a role in providing these additional services. 4. Consider how to integrate electronic and traditional, nondigital records. What should be the relationship between electronic records and records in other formats? It was unclear to the committee whether NARA proposes a unified approach for access to all records or whether it develops specific access systems for particular formats of records (e.g. paper, film, photographs, maps, electronic records). This decision will influence the development of finding aids, metadata standards, and search capabilities in the ERA. If multiple format-specific access systems are part of NARA’s plan, then the ERA would, for example, need a capability for managing references between the electronic records and records in other formats and for providing facilities for searching across the systems. To minimize complexity and dependency on other systems, it is probably inadvisable to try to integrate access across electronic records and traditional formats in early iterations of the ERA. However, systems should be designed with an eye toward the future integration of traditional and electronic records. For example, it is desirable to assign unique identifiers to both electronic and traditional records to help unify the totality of the archives for researchers. 5. Consider future interoperation with other repositories. NARA should also bear in mind the need for interoperability between the ERA and digital repositories in other archives and research libraries. 6. Consider what can be automated. Although all the details of work flow and task scheduling need not be specified in detail at the outset, even the initial design must be cognizant of which tasks can and will be carried out manually and which can be automated. The degree of automation will have significant implications for the costs of operating the system. It will also affect NARA’s internal business processes, the number and types of staff required (archival vs. technical, professional vs. clerical), and NARA’s relationships with other federal agencies. (Detailed discussion of this issue is deferred to the committee’s second report.) All of these considerations imply setting appropriate expectations for the ERA. An archive accessible to the public, in which every record can be presented through a Web browser (or whatever is the preferred public access technique of the day) within a few seconds of a request, could not be deployed today. However, critical ingest, storage, and access capabilities can and should be developed and deployed in pilot programs, and additional capabilities added over time. SUPPORTING FUTURE ARCHIVISTS AND RESEARCHERS Future archivists and researchers will be skilled in computing and have better tools and methods than are available today, and thus will be able to manipulate and interpret digital records. Today’s researchers are increasingly savvy about analyzing digital records, such as census files, economic data, electronic mail series, and so on. They work from the digital

OCR for page 24
records, not paper copies or other visual presentations of records. Researchers’ tools are constantly improving: Techniques for automatically extracting information from text, for summarizing text passages, or for finding complex relationships among several documents have been the subject of research and are becoming commercial products. Also, the digital archivist will be increasingly equipped to examine large quantities of records or collections and to build new catalogs and finding aids that depend on the underlying digital archive. Presentation, conversion, and emulation software of varying capabilities exist as commercial products today.5 Future users will also have available computers that are far more powerful than today’s. Just as NARA will evolve with respect to skills and technologies to handle digital records, so too will researchers and many other customers of the archive. While the public may want records presented visually on their screens, some members of the research community and digital archivists will desire access to the original bits and associated documentation. They may, for example, wish to verify the accuracy of preferred derived forms or the results obtained through migration or emulation. To support the researcher of the future, the ERA should strive to save the information that will be essential to future reverse engineers: software operating manuals, documentation on data types, and source code when it’s available.6 In some cases, it will also be useful to save executable code associated with the record, to be available for future emulation. It is also important to save information about the processes NARA uses to ingest records (including the source code for the software) so that future researchers can determine exactly how records were processed as they were archived. The ERA should also be alert to new information becoming available—e.g., when a proprietary data type is made public, its specification should be entered into the archive. In short, use the archive to store all technical data about the archive itself. PRAGMATIC STEPS TO FACILITATE FUTURE ACCESS TO RECORDS Although it may be the dream of every archivist to make records easily and immediately accessible, a more important objective is to preserve all the information necessary to allow today’s bits to be interpreted correctly far in the future. NARA does not have to anticipate or invest in all of the higher-level capabilities that future users might want. As discussed in the previous section, future researchers will have access to tools and expertise that will allow them to manipulate and interpret records. Also, many institutions share an interest with NARA in building tools that support conversion, migration, and emulation, and this technology base will be available to NARA and its users. However, certain fundamental information about a record is required to support future access. A pragmatic strategy to facilitate future access would include the following elements: 5   One such effort to capture these tools systematically is the PRONOM system being developed by the Public Record Office, the national archives of England and Wales. This capability alone does not necessarily help with obsolescence—i.e., when the last piece of useful software that supports a file format can no longer be executed, the file format becomes unreadable. This need not ever happen, because a software emulator of the bare hardware can later simulate the execution of the application on the digital record. This is the emulation approach to obsolescence. 6   Even though some of this information may be copyrighted, NARA should make every effort to preserve it— perhaps based on fair use arguments—because it is essential for the future operation of the archive.

OCR for page 24
1. Save the original bits. As is generally appreciated, it is essential that the record be saved in its original form—the original bits—even if the ERA is unable to decode, render, or execute those bits at the time the record is ingested.7 As the archived data are refreshed onto new media over time, the physical recording of the bits will change; but the ERA will, of course, always be able to deliver the “original bit stream”—the digital data that were ingested. Saving the original bit stream alone is not a guarantee of future access, but it is the foundation on which future preservation measures depend. 2. Save records in “preferred derived forms” in addition to the original bits. In many cases it is advisable to save derived forms of the record as well in order to facilitate various forms of access. For example, anticipating a need to make visual presentations of a record, it might be advantageous to prepare, at ingest, a PDF version of a word-processing document; then an access module need only know how to render PDF files rather than how to decode each word-processor data type used for the original records. As another example, anticipating a need to perform full-text searches of records, ingest processing might extract and save an ASCII text file. In some cases, a single derived form might serve both purposes—e.g., an XML encoding of a word-processor document, together with a style sheet,8 will simplify presentation as well as searching. Also, much like the Rosetta stone, a derived form is an aid to future researchers seeking to interpret the original bit streams. Although derived forms can be added to the archive long after a record is ingested, it is advisable to create the preferred derived forms as early as possible: The software for preparing such forms may be available at ingest time but may not be readily available years later when a record is accessed. Current record scheduling procedures used by NARA tend to delay ingestion until many years after the records are created. If this gap were shortened, problems of obsolete hardware and software at the time of ingest would be reduced. Derived forms are, of course, no substitute for retaining and providing access to the original bit stream. A derived form may introduce distortions or errors into the original, or it may omit useful information from the original—e.g., rendering a document may suppress a change history recorded by the word processor in the original data type. Software that creates derived forms may have bugs that introduce errors. 7   The merits of saving original bits for records that were converted during ingestion to the National Archives are illustrated by the case of the NIPS files, which recorded detailed information about Department of Defense (DOD) activities during the Vietnam War. Records in this format, formally known as the National Military Command System 360 Formatted File System, could only be read and interpreted by software developed in the early 1960s for DOD. Later, the DOD withdrew support of the NIPS software, which had the practical effect of making the NIPS format obsolete. When the files were transferred to NARS around 1977-1978, they were decoded and reformatted (“de-NIPSed”) to make the records software independent. In the late 1990s it was learned that this decoding, which was done around 1977-1980, had introduced data anomalies in some files. Fortunately, the original NIPS files had been retained and periodically transferred to new storage media at least two times over two decades. Thus, when the data anomalies were discovered, the original NIPS-encoded files could be accessed to correct the anomalies. 8   Although style sheets are problematical for long-term preservation (for example, they refer to an underlying rendering model that may change with time; see Chapter 3), it is reasonable to use them in conjunction with preferred derived forms, for which it is acceptable to have shorter lifetimes and less than perfect rendering.

OCR for page 24
3. Be neutral with respect to migration, emulation, or other approaches. There is much debate today in the archival community concerning access to obsolete data types—those for which widely available software and/or hardware is no longer available. One approach is migration: to convert records expressed in expiring data types into modern data types, repeating as those data types in turn expire. An alternative is emulation: Save the original executable software used to manipulate each data type (applications and possibly supporting elements such as the operating system) and emulate the operations of the obsolete computer and thus emulate the old application on a new hardware and software platform.9 The emulation approach would, of course, also apply where the records themselves consist of executable code. Migration and emulation have disadvantages as well as advantages. There are other possible preservation approaches as well.10 None of these approaches has emerged as accepted practice—debate, experiments, and developments will continue.11 NARA’s most prudent strategy is to use archival procedures that will accommodate both emulation and migration. Both depend on saving the data files in their original data type. Both depend on saving additional information to be able to interpret the bits in the future. NARA should be saving information that will support both migration and emulation in the future. 4. Do not rely primarily on a strategy of converting records to platform- and vendor-independent archiving formats to avoid obsolescence. Conversion of each data type to a platform- and vendor-neutral data type at ingest is a form of migration, with all of its limitations. Such data types cannot, therefore, replace the role of the original data type because they cannot encode all of the elements of all data types. XML formats are often proposed for this role. It is important to realize that conversion to XML has the same limitations as conversion to other platform- and vendor-independent formats. Sometimes lossless derived types (i.e., where there is an inverse transformation between, for example, a native format and XML) are available; however, one still has to worry about bugs in the transformation software. Box 4.1 discusses XML as a preservation format in more detail. An XML (or other format) derived form may, however, be a very useful adjunct to saving the original data type, as discussed above. 5. Save ephemeral (nonderivable) metadata. Good archival practice is to save as much metadata as possible about each record (and the collection or group in which it resides). As is the case with paper records, the most important metadata to save are the metadata that would otherwise be lost. By contrast, a great deal of metadata can be recovered from the record itself (assuming it can be interpreted). Electronic records lend themselves to automated tools for extracting derivable metadata. 9   See, e.g., Jeff Rothenberg, 1998, Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, Council on Library and Information Resources, Washington, D.C. Available online at <http://www.clir.org/pubs/reports/rothenberg/contents.html>. 10   Raymond A. Lorie. 2001. “A Project on Preservation of Digital Data.” RLG DigiNews, 5(3). Available online at <http://www.rlg.org/preserv/diginews/diginews5-3.html>. 11   This report does not attempt to review the approaches in any detail or evaluate them, though the second report may do so.

OCR for page 24
BOX 4.1 The Role of XML in Preservation XML is increasingly proposed as the representation of choice for the long-term preservation of electronic records. XML is a powerful tool and does have a role to play in digital archiving. In particular, it allows the expression of document content separate from presentation, in a nonproprietary, platform-independent manner. Moreover, XML is establishing itself as a widely used tool. However, it is not a panacea for preservation. Here is a brief outline of why it is not: XML is an appealing way to embed markup within text but it is not a self-contained document format. Even relatively straightforward uses of XML require the use of a set of additional conventions. In particular, XML requires additional support to specify document appearance and does not deal with nontextual or dynamic data. Some of these limitations are being addressed by continual development of XML-related conventions to include style languages and graphics languages and so forth and by embracing other conventions, such as those separately developed for image and multimedia data. Thus representing a document in its entirety requires components using these conventions, and interpreting and rendering an XML document in the future will require recourse to a number of conventions besides XML per se, each of whose proper interpretation would need to be preserved as well. Ensuring the faithful interpretation of these conventions in the future may be a smaller preservation problem than that of ensuring the faithful interpretation of individual document formats, but it is still a preservation problem. XML conveys structure, but not meaning. XML provides a means of structuring data and allows for the expression of a vocabulary to describe data components of document classes. It does not in itself serve to record what the data mean. Some text tagged as <TITLE> may be the title of the document, or it may be the title of a person (“Herr Doktor Professor”), or it may have neither meaning. Other documents might use <TI> or even <T> to tag their titles. Tags don’t convey meaning. A particular markup tag may have meaning specific to the application that created the document and may not be completely captured by XML or other conventions. Even if one retains possession of the schema or Document Type Definition (DTD) used to create the document, one could not reconstruct the document’s behavior in its entirely without somehow recording the application-specific semantics of tags. In such cases, conversion to XML has not eliminated the difficult preservation problem of interpreting a document in the future. Conversion to XML is potentially lossy. Converting a document to XML will inevitably lose information unless the set of XML-related conventions has provision for every feature required by the original document. As the point above suggests, this is not likely to be the case for application-specific semantics. In addition, conversion to XML is a migration, with all the intrinsic risks of approximations, inaccuracies, and errors. XML cannot, therefore, be considered as an alternative to saving the original data objects in their original data types. The set of XML-related conventions comprises components that are new, relatively untested, and not widely supported. XML itself is 5 years old; many of the related standards are only now coming into being. Even if XML itself remains unchanged forever, these supporting conventions may change considerably or may fail to find widespread support and be replaced by alternative approaches.1 Thus, using XML may not provide an eternal representation whose bits only need to be made durable but may, rather, require migration or emulation of components. 1   Indeed, DTDs are now being replaced by schemas (XSchema) as a means for specifying the required structure of XML documents.

OCR for page 24
Consider, by way of example, a memo in the form of a word-processor file obtained from a disk of a White House staffer; the text of the memo identifies the author, the date it was written, and the recipient. The items in the text could be recorded as metadata in anticipation of indexing and searching metadata to find all documents written by a given author, for example. If these metadata are not recorded as the record is ingested, they can be recorded later. By contrast, there are ephemeral metadata about the record—items such as which staffer’s PC contained the disk, the date and time the file was written onto the disk, the version of the operating system used—that are not evident from the record itself and that will be lost unless explicitly recorded during ingest. Databases offer another example. One of the challenges of preserving older databases is that schema documentation is absent. In particular, many critical integrity assumptions are not made explicit, nor can they be deduced by inspecting the data, though in many cases it may be possible to deduce them by analyzing a corpus of queries. Metadata that should be saved would include formal and informal information on schema information, query libraries, and so forth. 6. Save essential external references that are implicit or explicit in the record. As is well-known to archivists, a digital resource will often make references to other resources. In some cases, this is because these resources represent other components of the same “compound document.” For example, the image components of some document types are stored as physically separate resources. In general, digital records may comprise multiple explicit components, so to preserve such a record, one must be vigilant about archiving all of its components. Implicit references to resources such as default style sheets or fonts must also be considered. It may be valuable to have a tool or process to ensure that records are in fact saved in their entirety.12 In principle, this is a straightforward goal, but no clear solutions exist for managing external references; it is an active area of research in the digital library community. Digital records present some other challenging problems, including these: (1) The cross-references are buried inside the representation rather than being explicitly visible, as are the citations in a paper report, and (2) digital cross-references often use naming schemes (for example, file numbers, local file names, or URLs) that are unstable, not standardized, and may not survive very long. Indeed they may have stopped working by the time that the document is ingested. 12   One way to do this is to simulate, at ingest time, access and presentation of the record, making sure that all resources used by the access and presentation processes are available in the archive. The full set of external references required to support the emulation approach includes the files required to install the application, operating system, and other supporting facilities on bare hardware. These are all implicit external references that would need to be preserved in a software repository, perhaps as part of the ERA or perhaps shared with other digital archives. The Vesta research project (Allan Heydon et al., 2002, The Vesta Software Configuration Management System, SRC Research Report 177, Compaq’s Systems Research Center, Palo Alto, Calif. Available online at <http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-177.html>) developed this sort of capability for all of the code and other resources required to build a large software system.