Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 18
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy 1 Ongoing Technology Change and Rising User Expectations AN AVALANCHE OF DIGITAL INFORMATION Several important trends are driving the increase in the volume of digital records: more information is being produced and stored, a greater fraction of information is born digital, and a greater fraction of the born-digital material is retained only in digital form. A study conducted at the University of California, Berkeley, estimated at about 30 percent a year the rate of growth in the total amount of new information stored between 1999 and 2002.1 Much of this information is being produced and stored in digital form. Although it is difficult to isolate the amount of federal government information in these general data, there is no reason to assume that the federal government is not experiencing these broader trends. A growing percentage of information is born digital, stored in file servers, database systems, correspondence-management systems, and other electronic information systems, and not systematically retained on paper. The University of California, Berkeley, study found that the amount of information printed on paper is still increasing but that printed documents represent less than 1 percent of the total information being produced today and that the vast majority of information is being stored on magnetic media. This result seems reasonable, given anecdotal evidence that the vast majority of documents, for example, are created using office automation software. No paper copy may ever be produced of many of the various born-digital records that are stored in file servers, database systems, correspondence-management systems, and other electronic information systems. As electronic filing systems become more commonplace, it will become less likely than in the past for the paper copies that are printed to be retained systematically. 1 Peter Lyman and Hal R. Varian. 2003. How Much Information, 2003. School of Information Management and Systems, University of California, Berkeley. Available online at <http://www.sims.berkeley.edu/how-much-info2003>. Accessed January 7, 2004.
OCR for page 19
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy This trend away from paper documents is illustrated by the experience of the Government Printing Office, which reports that the number of documents that it prints has fallen by roughly two-thirds over the past decade. People are increasingly accessing documents via the Web, and many publications are no longer produced in paper form at all. Five major technology trends support a prediction that the National Archives and Records Administration (NARA) will soon face an avalanche of digital materials: Computer technology makes it easier and less expensive to record more information than can be recorded in the paper world. The increasing use of automated systems will make it possible to capture a larger number of records associated with government functions. Automated e-government services and systems, for example, will enable and encourage more transactions per citizen and per government employee. Information and transactions are finer-grained. Information is becoming “finer-grained.” For example, it will be possible to record—and preserve—all of the individual changes made to a case file as well as its final contents. Content/document-management systems allow an audit trail for every individual who “touches” (i.e., creates, displays, copies, or modifies) a record. This audit trail metadata could be quite voluminous. Also, new kinds of records will be created as data from government-operated sensors are automatically logged. Many record-keeping systems are transaction-based, with databases whose contents are continually changing. In many cases, it may be impractical to log and preserve all of these transactions. The alternative is that systems are designed so that snapshots are captured and preserved on a regular basis. This way of preserving data will require changes in system design and careful consideration about what should be captured, what format it should be captured in, what the refresh rate should be, and so forth. Information is often dynamically generated. Information used by the public and by decision makers is increasingly created on-the-fly. For example, many Web pages today are dynamically created and formatted for presentation from underlying databases. The problem of archiving databases that change over time is not a new one, but archivists will face more complex preservation decisions as more and more information is dynamically generated. Even when the underlying database itself is amenable to preservation, displaying information from that database for users poses enormous challenges. For example, the information displayed might be the output of analysis software that has gone through multiple revisions—what was actually viewed on any particular day is a function of the version of the software running on that day. Capturing such changes and transformations is both challenging and crucial. Technologies will allow people to communicate and record information in new ways. In the 1990s, some e-mail messages began to be determined to be permanent records. Likewise, technologies today are creating records that may be selected in the future for permanent preservation; examples include video conferencing, instant messaging, voice over Internet Protocol, and video presentations. The drop in storage costs drives the retention of information that previously would have been discarded. The cost per unit of storage currently falls by roughly a factor of two per year.2 As a 2 For a detailed examination of trends in disk drive storage, see, for example, H. Grochowski and R.D. Halem, 2003, “Technological Impact of Magnetic Hard Disk Drives on Storage Systems,” IBM Systems Journal 42(2).
OCR for page 20
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy result, enormous quantities of information are being stored today. The University of California, Berkeley, study estimates, for example, that just over 400,000 terabytes of original information are stored each year on hard disks.3 Indeed, storage costs have dropped so much in the past decade that many individuals and organizations have stopped “garbage collecting” (i.e., deleting old files) within their stored information. As this trend continues, businesses will increasingly destroy electronic records on the basis of the business risks of retaining records beyond regulatory mandates rather than because of storage costs. With digital storage growing ever cheaper, it becomes more economical to keep everything than it is to put in the effort needed to decide what can be thrown away. Tools capable of searching ever-larger volumes of information will help fuel demand for more things to be considered permanent records. The volume of data to be stored will also grow because individual records will continue to become larger. It is becoming easier and more commonplace to generate and digitally store multimedia and other rich content. A 30-minute video file is much larger than a 30-minute audio file. A PowerPoint presentation with embedded animation is much larger than a plaintext file. Finally, the total number and variety of record data types that must be stored by the National Archives and Records Administration and other repositories will increase over time as new versions of existing data types and entirely new data types are invented.4 The overall growth in the variety of data types will be accompanied by spurts of growth and contraction in the number of data types in active use at any point in time: as a new class of record types emerges, one will likely see an initial explosion of data types. That proliferation will occur as various vendors introduce products, and it will be followed by a consolidation to a smaller number of de facto standards as the market picks winners and losers. Over time, barriers to market entry will tend to increase as an industry segment matures; those barriers will lead to a smaller number of vendors and higher expectations of greater interchangeability. There were, for example, many early word processors, each with its own file format, but today 3 Peter Lyman and Hal R. Varian. 2003. How Much Information, 2003. School of Information Management and Systems, University of California, Berkeley. Available online at <http://www.sims.berkeley.edu/how-much-info2003>. Accessed January 7, 2004. 4 A data type refers to the data-encoding rules describing how various kinds of records—word-processing documents, e-mail messages, images—are expressed as a collection of bits. For example, an image might be represented by bits whose data type is TIFF, GIF, or JPEG, where the specification describes how the bits are to be interpreted as an image. The more commonplace term “file format” is often used interchangeably with data type, but data type is more appropriate. The literal interpretation of file format is a specification for a file made up of bits. In some cases the bits might be embedded within a file (e.g., a file might contain a folder that contains multiple e-mail messages, one of which contains an image of data type GIF). Some data types are wrappers that encapsulate other data types. For example, files of data type WAVE may contain linear pulse code modulated or highly compressed audio. Other data types support bundling of multiple collections of bits that represent a single record together with associated metadata. Compounding the complexity of the problem are the myriad options and subtypes associated with some data types. For example, TIFF, which is a well-established data type for images, specifies a wrapper that can contain several different image encodings. Newer image data types such as JPEG-2000 are even more complex. The Library of Congress Web site (“Digital Formats for Library of Congress Collections,” <http://www.digitalpreservation.gov/formats/>, accessed May 1, 2005), from which this discussion was partially drawn, provides information about various data types.
OCR for page 21
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy only a small number of data types are commonly used. Similarly, early Web sites depended on numerous ad hoc scripts, but today’s Web sites are commonly implemented using more standard approaches. Today, the broad range of Extensible Markup Language (XML)-related formats is relatively immature and can be expected to follow a similar pattern. Despite expected changes over time, for the foreseeable future there will be many data types and subtypes in use, creating great difficulty for archiving. NARA will never be able to support all of these types and subtypes equally well, yet important records may be created in these formats. Compromises in terms of the level of service provided will be necessary (see the section “Long-Term Preservation,” below). NARA will need to monitor the kinds of trends described above for several reasons. The volume and types of electronic records that it must handle have implications for the resources devoted to them. NARA must be able to anticipate trends in order to be able to make necessary changes to its own systems, to issue requests for proposals (RFPs) with sufficient scalability requirements, and to advise and influence agencies as they implement new systems. The objective is to avoid surprises like the one that occurred in the 1990s when use of e-mail became widespread and it proved difficult to adapt to this novel form of records. It is neither essential nor cost-effective to carry out detailed surveys of potential types and volumes of records. Internally, NARA will need staff with the expertise and responsibility to track changes. External advice, such as that obtained through research activities or advisory committees, can supplement NARA’s capacity to predict changes and discontinuities in record types and volume. Such external advice would be based on an awareness of new data types on the horizon and on an understanding of information production inside government and more broadly in society. PLANNING FOR CONTINUED TECHNOLOGY CHANGE It is envisioned that NARA’s Electronic Records Archives (ERA) will operate for many decades. The enormous changes in information technology (IT) that have occurred just in the past 10 years suggest that the ERA will experience considerable technology change during its existence. These changes will come about in terms of both the types of records that the ERA will need to accommodate and the technology components that will be available for use in future iterations of the ERA. Indeed, the technology context and the system itself are likely to evolve during the ERA’s lifetime—so much so that the problem can be usefully viewed largely in terms of how the ERA system will evolve around the archive’s data and the data structures used to represent them, both of which are comparatively stable. An important component of planning for the ERA’s evolution will be to identify and distinguish various trends—those requiring system change, those requiring system additions, and those requiring simply that the system be sufficiently scalable. Technology Performance Improvements Although the committee does not have a crystal ball for seeing far into the technology future, past experience suggests some relevant technology trends that are likely to persist. Precise changes in the ERA’s component technologies, for example, cannot be predicted, but some general trends are apparent. Storage will continue to become much denser and cheaper. As in the past, new forms of storage providing such desired characteristics as higher volumet-
OCR for page 22
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy ric density, lower cost, and faster access times will continue to emerge, and the commercial market will move toward these new technologies. As was discussed in the committee’s first report, at present, tape-based storage is gradually being replaced by magnetic disk storage. Also, the familiar trend of processors and networks becoming ever faster will continue, making possible cost-effective network transfer and processing of ever-larger volumes of records. The past 20 years have seen dramatic performance improvements in many directions. In 1984, for example, a state-of-the-art workstation had a processor that operated at 1 million instructions per second and had 1 megabyte of RAM. In 2004, a competent off-the-shelf personal computer (PC), which cost much less than the 1984 workstation, had a processor with a clock rate of around 2 gigahertz and 1 gigabyte of RAM. In 1984 a 10 megabyte hard drive for a PC retailed for $2,500; today a 250 gigabyte hard drive for a PC retails for under $200. The capacity is up by a factor of 25,000, and the cost per byte is down by a factor of 300,000. In 1984, the fastest links in the Internet were 64 kilobits per second; today the fastest links are 40 gigabits per second—up by a factor of nearly a million. Experience has shown that it typically takes a decade or more to figure out what can be done with such new capabilities. For example, all of the Internet technology necessary to support the World Wide Web existed by 1982. That included a working Internet, a good-sized community of users with a need to communicate organized information of various types, the concept of hypertext, the Domain Name System, and desktop PCs connected to the network. But the World Wide Web itself wasn’t invented until 1991—it took a full 9 years to work out the implications and realize that this application was possible. Therefore, even if technology improvements were to stop in their tracks tomorrow, one would still expect a very high rate of change (improvement in performance and cost) in delivered systems for at least another two decades. And there is no reason to suppose that these sorts of sustained performance improvements will not continue into the indefinite future. Exploiting the Commercial Mainstream The ERA system will benefit from continuing technology evolution only if the ERA implementation stays roughly in step with mainstream commercial trends. The functions of the ERA are not uncommon enough to support the use of unique hardware or software components, and relying on such components would expose NARA to the risk of significantly greater costs and more difficult system evolution. There will be, to be sure, some solutions that lead the way and appear to have a good chance of being adopted by a larger community later on. However, NARA should be wary of special-purpose solutions that do not enjoy the benefits of continuous technology improvements. Sustained improvements in the cost-performance ratio are reflected almost entirely in high-volume products. Examples of high-volume products today are personal computers; ATA/IDE5 disk drives; compact-disk read-only memory (CD-ROM), CD-recordable (CD-R), digital versatile disk (DVD)-ROM and DVD-R optical storage devices; and Ethernet (10 mega- 5 The conventional industry shorthand for a family of widely used hard-drive interface standards known as Advanced Technology Attachment (ATA), Integrated Drive Electronics (IDE), Enhanced IDE (EIDE), and Ultra Direct Memory Access (UDMA).
OCR for page 23
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy bit [Mb], 100 Mb, and 1 gigabit [Gb]) networks. Each of these products has seen sustained decreases in cost per unit of performance. In contrast, low-volume products rarely exhibit sustainable cost-performance improvement and often disappear rapidly from use. Today’s trend in moving to disk from tape for mass storage is an example of what often happens in such cases, with a very high volume product replacing a relatively low volume product. Despite the anticipated rapid changes in many component technologies, past experience suggests that network interoperability will be relatively stable. Important examples include the backward compatibility among several generations of Ethernet and the persistence of the Internet’s Transmission Control Protocol/Internet Protocol (TCP/IP) standards. As networking hardware and software evolve, they are likely to offer backward compatibility of network protocols and interfaces. Thus it will be possible to connect new equipment to old equipment via network hardware of various kinds, and it will be relatively easy to add new capacity to an archival system and to federate archival systems. Flexible, Modular, and Extensible Design Effective upgrading of hardware and software to stay on the technology curve of mainstream IT also places a premium on flexibility. At a minimum, the software for the ERA system should be written to be as portable as possible across offerings of different computer and storage vendors. NARA should also seek out products that can be obtained from multiple vendors, so as to avoid being locked in to an arrangement with a single vendor. Another key to the successful functioning of the ERA over time will be careful modular design, which is critical to allowing components to evolve or to be replaced as needed. It simply takes too long to redesign a monolithic system to exploit technology improvements. Modular design makes a complex problem more tractable by breaking it down into a set of smaller components and enabling the independent evolution of the parts. This type of system is decomposed into separate modules, each with a well-defined interface. Properly designed components can then be replaced with improved versions, while causing minimal disruption to other modules or to the system as a whole. If modules are too large, they become hard to modify or replace. If the interfaces are overly complex or if they permit internal details of a module’s implementation to become visible to other modules, the ability to change one module independently of others may be lost. In addition to being designed for change, the ERA should be designed to be extensible—that is, to be able to adapt to some kinds of changes without requiring modification of the system, at least not immediately. For example, if NARA began accepting a new format, the system should at first accept it as a type “unknown.” That way, new records could be accepted without having to wait for the system to be modified. Later, a small, type-specific module could be added to facilitate more sophisticated handling of the type. LONG-TERM PRESERVATION The committee’s first report, citing the imperative of making significant progress toward implementing an archival system, proposed a pragmatic short-term strategy that included the following: (1) preserving the original bit stream representing a record; (2) converting records to “preferred variant” formats, where available, that are likely to be more readily interpretable in the future; and (3) preserving enough information for records amenable to emulation ap-
OCR for page 24
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy proaches so as not to foreclose that option. However, as development of the ERA proceeds, NARA will have to seek out longer-term preservation solutions. As the committee’s first report discussed, a fundamental requirement of the ERA is the ability to store and retrieve the bit streams representing a record and its associated metadata. The requirements for this part of the ERA are fairly well understood (although good implementations will be quite complex, and they will certainly require refinement and evolution over time). In addition, technologies exist or are being developed to satisfy many of these requirements, and research projects are already reaching advanced states. As a general rule, one keeps the original bits,6 since this allows for using the original viewer if that is still possible, and it ensures that all information that was in a record is still there (even if interpreting it may be problematic). Also, the bit stream resulting from a reversible transformation applied to the original bits is as good as the original bit stream, provided that the transformation is error-free. The transformation of an uncompressed bitmap image into a run-length encoding produces a file equivalent to the original bit stream, assuming, of course, that the technical metadata identify or define the new representation correctly. The lowest level of a system architecture for a digital archive is a storage component that sees the bit streams as objects. This component supports a simple object interface used to create, delete (in cases where a specific retention period applies), and retrieve an object. Many of the design issues associated with object storage were addressed in the committee’s first report. In practice, good implementations will be quite complex and they will certainly evolve. Therefore, one of the main characteristics of such a subsystem should be its modularity, enabling components to be upgraded without affecting other parts. There is also a growing commercial interest in subsystems with such object stores. This interest is driven by such developments as new regulations that mandate long-term retention. However, ensuring that bits can be understood in the distant future is a much more difficult problem than are issues involving object storage. Various approaches have been developed,7 none of which is suitable in all cases. Moreover, one cannot know with any great certainty which proposed methods for preserving the interpretability of digital objects into the distant future will subsequently be deemed successful, much less prove relatively cost-effective. Preservation of records will thus require selecting the appropriate preservation techniques for particular types of records. An archive can and should, of course, use different approaches for different types of record, and it can and may simultaneously and in parallel pursue several approaches for a particular type of record. Electronic records archivists will need to understand the range of available preservation techniques and make judgments about which are appropriate. Because each approach has potential drawbacks and associated costs, compromises will be needed, based on levels of service. A compromise may involve preserving a subset of the 6 The “bit stream” or “data stream” to be preserved does not include media-specific encoding techniques, checksums, and so forth, that take place at the physical storage layer. The marketplace can generally be relied on to shake out any such “under the covers” concerns—for example, no one today worries about whether (or how) a DVD-ROM reversibly encodes the bits. 7 There is an extensive literature on the long-term interpretation problem. A compilation of some recent work can be found in Gerard Clifton and Michael Day, 2004, “File Formats and Tools,” in DPC/PADI What’s New in Digital Preservation, No. 9, July-December, Digital Preservation Coalition, York, United Kingdom; available online at <http://www.dpconline.org/docs/whatsnew/issue9.html#2.3>, accessed May 1, 2005.
OCR for page 25
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy functionalities associated with a particular record. For example, the archivist may stipulate that the printed representation of a document is all that matters. In that case, saving the picture of the page may be sufficient (assuming that there is enough metadata to support efficient access). For another type of document, the archivist may require a higher level of service by asking that the comments hidden in the file also be archived. Then, in order to be equivalent to the original, the new representation will have to contain the comments. A still higher level of service may require that the interactive environment of the word processor itself be reproducible in the future. Box 1.1 provides some additional examples of levels of service. For each document type (or document type in a particular collection), the archivist needs to determine the appropriate levels of service that should be used. In the example above, preserving the look of the page and the comments is sufficient if the look and comments constitute the essential information of the document. This means that any additional information (for example, which buttons were used to produce it!) is deemed nonessential, or incidental. Any incidental information may be discarded in the preservation process without any loss of value. Deciding what is essential is crucial to good archiving. A report from the National Archives of Australia says it well: “Determining the essence of records … is essential to an efficient, effective and accountable preservation program. Focusing on the essence of a record allows us to clearly state our archival requirements for the preservation of that record and to be held accountable against those requirements.”8 The archivist has three basic approaches among which to select: Keep the information in its original format and rely on running the original viewer on an emulation of the original system, or running a viewer (or at least a program that decodes the original data) on a stable virtual machine. Keep the information in a standard or intermediate format (i.e., converted once at most) and commit to providing a viewer for that format into the far future. Convert the format into a new one when obsolescence is threatening the readability of the record. Each option is discussed in more detail below. One General Approach to Preservation: Keep the Information in Its Original Format As described below, several approaches to preservation rely on keeping the information from a record in its original format. Relying on the Availability of the Original Viewer In the long term the original viewer will no longer work, but it is important to realize that the viewer may still exist in the medium-term. When the original representation is kept, that viewer can be used, and the threat of obsolescence is delayed, possibly by several years. The 8 Helen Heslop, Simon Davis, and Andrew Wilson. 2002. An Approach to the Preservation of Digital Records. National Archives of Australia, Canberra, Australia, December. Available online at <http://www.naa.gov.au/recordkeeping/er/digital_preservation/green_paper.pdf>. Accessed May 23, 2005.
OCR for page 26
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy BOX 1.1 Examples of Varying Levels of Service for Records Preservation Preserving text—without presentation. Assuming that a file is an ASCII text file (original or obtained by converting the document once), metadata can be used to specify how to interpret the bits. The metadata must include the definition of every character, either in English or by showing the bitmap of each character in a common font (the bitmap format itself must be explained). But still, the amount of metadata remains very reasonable. Images. It may be reasonable to assume that a few formats for images may be used as standards (the commonly used options within the Tagged Image File Format (TIFF) standard are an example) and that such formats will be explained in detail in so many places that decoding algorithms will always exist. However, some special formats will always exist as well, and once many formats exist, some will become obsolete more quickly. In such cases, it is preferable to make sure that the format interpretation is specified in the metadata; if the interpretation is too complicated, it may be better to convert the original bit stream into another one that may be less compressed but easier to explain. An alternative is to store a program executable on a virtual machine; this keeps the original format and only converts it to a simple bitmap image (a logical view) for presentation on demand. Note that the logical view can be the same whatever the original format is, greatly simplifying the job of the future user. Data structures in XML—no presentation. Given the widespread and growing use of XML (Extensible Markup Language), an increasing amount of information will be organized and exchanged as XML structures. Nonetheless, metadata describing the semantic meanings of the tags must be captured. Relational databases. Multiple levels of service can be identified. The most basic level only preserves the data. It comprises the content of the table in a character form and the schema containing the table and column names, comments on their semantics, units, integrity constraints, and so on. Instructions on how to decode the format in which the elements are stored must also be archived (as a precise explanation or as a program written for a virtual machine). A higher level of service will request that one preserve the querying and reporting capabilities of the initial system. If these capabilities were simply those of standard SQL (Structured Query Language), nothing other than the data would be needed, since a query component could be written in the future. But modern database systems employ user-defined functions. To be able in the future to ask the same questions as those asked today, the definition of these functions must also be preserved. If there is enough incentive to code these functions for a stable virtual machine, a future system would be able to make use of them. However, as soon as the complexity increases, a full reenactment of the SQL system may be needed, which in turn requires an metadata play an important role: they must specify (very precisely) the program to be used and all of the parameters that will ensure a faithful interpretation. The preservation subsystem may also contain information on various paths for accessing the right application program. For example, the same application program may run on Windows 2000 or Windows XP. Similarly, the same operating system may run on different hardware environments. Therefore, when a component becomes obsolete along one access path, another path may still be avail-
OCR for page 27
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy emulation of the original machine. More generally, an application is often used to present the data (presentation is more than just a literal display of the data in the Relational Database Management System [RDBMS]), accept transactions, and so forth, and thus it embodies considerable information about the meaning of the data. The raw data (or data schema) alone does not specify the meaning; either very careful additional documentation must be preserved or there must be some way to execute or emulate the application at least in part (e.g., for presentation). Another approach is to consider the database as a vehicle in which records are stored so as to provide efficient access and other capabilities. Instead of archiving the vehicle itself, the records can be rebuilt from the database and archived as data structures. General data structure. Not all records are relational or XML. Many files may have their own format to represent geographical data, engineering data, or statistical data. Not all files can be converted to XML because, in some cases, the volume of the information in them would increase dramatically. Thus, there is a need to specify how to decode these formats as well. Documents with presentation. When it is important to preserve the look of a printed document, it is always possible to store an image of the individual pages. In any case, the original bit stream can be archived, together with information on how to decode it. But these formats can be so complicated that only a program can do the job. The program can be written for a virtual machine; it would essentially generate on demand a structure that contains both the tagged data elements of the document and their explicit presentation attributes, down to the individual characters’ images. Reconstructing a page is then easy; there is no need for any original word processor or viewer. An alternative is to do the conversion once at ingest time and to archive the resulting structure. Spreadsheet. A spreadsheet is a document with presentation, so the item above applies. A first level of service may be to save the content as an image. But some of the values appearing in cells are specified as formulas that compute them as a function of the values in other cells. In the next level of service, these formulas could be stored as metadata to convey the mathematical relationships between cells. If the archivist decides that a future user should be able to execute these formulas, then the execution of a program becomes necessary. A virtual machine approach or emulation can be used. Dynamic applications. Extreme cases such as video games or highly interactive applications may require the archiving of programs to reenact the original behavior. If the code exists in a high-level language, compiling it for a stable virtual machine is possible. But more often the best alternative is to emulate the original hardware and software on which the program ran. Since the emulator may have to be time-sensitive, the implementation would be much more involved. able. A particular use of such metadata may be to trigger an alarm condition when the last path is in danger of being broken by obsolescence.9 9 This functionality has been implemented in a development system. See R.J. Van Diessen, 2002, Preservation Requirements in a Deposit System, Koninklijke Bibliotheek, The Hague, The Netherlands. Available online at <http://www.kb.nl/hrd/dd/dd_onderzoek/reports/3-preservation.pdf>. Accessed May 1, 2005.
OCR for page 28
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy Relying on Metadata In some simple cases, it is possible to include a specification of the process itself without introducing a program.10 For example, the definition of the BMP bitmap image data type can be explained in a few sentences. Unfortunately, however, processes are generally too complicated to be completely defined by a textual explanation. Relying on Standards If it were possible to rely only on a small set of well-established standards, it might be reasonable to imagine that the viewer could be rewritten every time the platform changed. Using standards is certainly good practice. However, the variety of formats that exists today is due in part to the fact that different formats have different performance or ease-of-use characteristics. Users and communities may not be satisfied with tools that they do not think are optimal for their applications and their environments. Emulating the Original System The approach of emulating the original system consists of keeping the original executable program that was used to create and/or manipulate the information in the first place. That program (with its operating system) works only on the original or an equivalent machine. In the future, the only way to execute the program would be to rely on an emulator. Initial proposals for emulation suggested that a description of the original machine architecture be archived in all its details so that an emulator might be written when needed. This suggestion is hard to carry out when the expertise on the original machine has evaporated or when the original machine is not available for testing. An alternative to this approach is to write some kind of emulator specification early, when the expertise exists, and to use that information to generate an emulator when the new machine on which it would run is known. A second alternative is to rely on a stable virtual machine,11 allowing an emulator of the original machine to be written—and debugged—at the time the original machine is known. In the future, a virtual machine emulator will make it possible to execute the emulator of the original machine, providing an original machine that then executes the original application code. The metadata must simply contain a user’s guide on how to run the program. Preserving the execution of the original program is justifiable when reenacting that program’s behavior is of interest, but it is overkill for data archiving. In order to archive a collection of pictures when only the pictures themselves are of interest for posterity, it should not be necessary to save the full system that enabled the original user to create, modify, and enhance the pictures. To read an e-mail, it is not necessary to reactivate the whole e-mail system! There is another drawback: in many cases the application program may display the data in a certain way (for example, a graphical representation) without giving explicit access to 10 J. Rothenberg. 1999. Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, Council on Library and Information Resources, Washington, D.C. 11 R. Lorie. 2001. “Long-Term Preservation of Digital Information.” Paper presented at Joint Conference on Digital Libraries, Association for Computing Machinery/Institute of Electrical and Electronics Engineers, Roanoke, Va., June 2001.
OCR for page 29
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy the data themselves. In such a case, it is impossible to export the basic data from the old system to the new one, making the reuse of the data all but impossible. Relying on a Stable Virtual Machine The method of preservation that relies on a stable virtual machine consists of archiving with the original file, a decoding program P, which understands the internal structure and presents the information in a logical view (a hierarchical structure of tagged elements similar to those used in XML). That program P is written for the virtual machine. In the future, a restore-application program reads the bit stream and the program P; it passes P to an emulator of the virtual machine that executes it. During that execution the data are decoded and returned to the client according to the logical view. The schema of the view (the definition of the tags) can also be archived and retrieved by using a similar technique. Note that in addition to decoding the data structure, the program P may perform any arbitrary code on the data. For example, some data elements returned in the logical view can be obtained by computation on other data elements. If the virtual machine provides a minimal communication mechanism, the program P can receive parameters from the future environment and compute a new object content dynamically.12 A Second General Approach to Preservation: Convert Once to a Standard or Canonical Format Relying on Metadata A format can be converted once into a new format. It is quite possible that the new format is much simpler to explain than the original one was. This may be a reason to convert once, probably at record ingest time. On the other hand, the new format may be less efficient (with less compression, for example). There is also a risk of error or loss of features in any conversion. The trade-off must be carefully evaluated. However, converting once does not mean that the original file cannot be kept as well. Converting the data (once) to XML is starting to be recognized as one approach to the preservation problem.13 Although it is true that XML can play a very efficient role for purposes of preserving information, it is not by itself a panacea. XML is a convention for identifying data elements in a character string through the use of tags (the particular tags can be chosen to best fit the particular application). XML per se does not specify what to do with the data; in particular, it does not say how the data should be presented. However, the tags provide a way for other programs (such as style sheets), associated with the XML technology, to take care of the presentation by providing a syntax to specify the presentation attributes for each tag (for example: title in Helvetica font, 18-point, centered; author in Helvetica, 12-point, centered; and so on). But XML does not help in preserving the execution of style sheets for the future. 12 Raymond Lorie and Raymond van Diessen. 2005. “Long-Term Preservation of Complex Processes.” Paper presented at Image Science and Technology Archiving Conference, Washington, D.C., April. 13 For example, the National Archives of Australia has developed an open source software package that provides a plug-in architecture to convert a range of file formats—such as Microsoft Office and OpenOffice formats, relational databases, JPEG, GIF, TIFF, PNG, and BMP—to XML representations for longer-term access.
OCR for page 30
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy Relying on Standards The fact that a format can be converted once into another format is very useful if the other format is so standard and stable that a viewer will remain available “forever.” Desirable properties of such standard and stable formats include a public definition, broad applicability, compatibility with prior versions, and a slow rate of change. NARA has played an important role in the development of one such format—the draft PDF/A standard—and can continue to help promote the development and adoption of formats that have attractive archival properties. Relying on a Stable Virtual Machine As explained above, a virtual machine can be used to preserve the decoding of an original file. The same applies to a file obtained by an initial conversion of the original file onto a new format. This may be appropriate when tools are available for extracting all essential information from the original document and storing it in a new format that is simpler to decode. A Third General Approach to Preservation: Migrate the Format into a New One When Obsolescence Threatens the Readability of the Record The third general approach to preservation is what is generally understood by the term “migration.”14 Each time the configuration of hardware or software is changed in a manner that affects the interpretability of a document file, the data and/or the program must be altered. Conversion poses some special challenges for several reasons: (1) it imposes the ongoing burden of assessing the interpretability of formats contained in the archive; (2) it has unbounded costs, because there is no end to the conversions that need to be done; (3) it may introduce errors at each conversion; and (4) features of a record may be lost as they are mapped from one version to another. The third consideration—the possibility of introducing errors—is particularly significant. As with other aspects of archive management, human inspection, except on a very sparse sample, will be prohibitively expensive. Even if the conversion failure rate is very low, the consequence of automatic conversion and little or no human checking is to place the burden of discovering errors and recovering from them on the ultimate users of the record. This eventuality is likely to lead the user to fall back on one of the other preservation approaches described above. Conversion would be more attractive in some particular scenarios—for example, if the original form of a record is in some version of a very popular, well-characterized format such as PDF (which also happens to be relatively stable, in the backward-compatibility sense). When that record is migrated to succeeding versions of PDF and then into successor formats, each of these conversions is likely to be highly accurate because society in general invested a great deal in their being so at the time. Moreover, because all earlier (con)versions would have been kept in such cases, there would be a window of opportunity (while interpreters for the 14 Note that the term “migration” is sometimes defined also to include the refreshing of bit streams onto new storage media. However, this refreshing is, of course, a fundamental requirement for all three approaches described above.
OCR for page 31
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy last format or two are still available) in which errors could be corrected. (And since “live” interpreters would be available for both original and new formats at the time, at least some form of automatic error checking would be possible, such as comparing resulting bitmaps.) A variation on this strategy is migration on access, which postpones the conversion until a user requests a record. This approach has several possible advantages. For example, deferral can save on resources when most content is rarely accessed; additionally, on-the-fly conversion makes it more likely that the most up-to-date conversion software will be used.15 However, deferral pushes the problem to a time when data-type obsolescence would make it much more difficult to develop or validate a converter. Also, the advantages of migration on access are reduced in proportion to the fraction of data types that end up being accessed in the future. GROWING USER EXPECTATIONS Users of online services, such as e-commerce sites, Web search engines, and the like, have developed expectations for near-immediate access to vast data sets. Users’ expectations for the ERA will include the following: Easy, online access to entire collections. Users will expect the full contents of digital collections archived by the ERA to be searchable and accessible online. They will expect access to be available on an unmediated basis—that is, without going through an archivist—though they may sometimes seek such assistance. People increasingly expect to be able to locate government information online, and the Internet is rapidly becoming the primary channel for accessing government information.16 Free and anonymous nominal access. As with other online government information, users will expect to be able to search for and access information archived by the ERA without having to pay fees or to register. Users will be more willing to pay fees for bulk downloads and other such services. Speed. Users have come to expect near-instantaneous response times even when searching databases containing many billions of items (e.g., Web search engines and e-commerce sites). Users are increasingly intolerant of slow response caused by high demand; those providing Web services today carefully architect and provision their systems to minimize response time. Similar requirements are being placed on government systems.17 Capacity planning, performance measurement, and production engineering will need to be done on an ongoing basis in order to provide a particular level of quality as public demand scales up. Of course, not all access can be provided on an instantaneous basis: much richer data types and more sophisticated approaches for searching records may require much more processing time. 15 David S.H. Rosenthal et al. 2005. “Transparent Format Migration of Preserved Web Content.” D-Lib Magazine 11(1). Available online at <http://www.dlib.org/>. Accessed May 1, 2005. 16 John B. Horrigan and Lee Rainey. 2002. Counting on the Internet. Pew Internet & American Life Project, Washington, D.C., p. 9. Available online at <http://www.pewinternet.org/reports/pdfs/PIP_Expectations.pdf>. Accessed May 1, 2005. 17 For example, the FirstGov search engine was specified to have a response time under 5 seconds. William Matthews. 2002. “Vendors Vie for New FirstGov Contract,” Federal Computer Week, January 21. Available online at <http://www.fcw.com/fcw/articles/2002/0121/news-first-01-21-02.asp>. Accessed May 1, 2005.
OCR for page 32
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy Nor can prebuilt indexes or other specialized access paths be made available for all types of records. High levels of availability. Users have come to expect the generally high levels of availability that major commercial services provide by investing substantially in redundant systems and network connections. A well-designed user interface. Commercial services offer well-designed user interfaces that are continually being refined. Users will expect the same of the ERA’s interface. Federated search. Users are beginning to expect services that allow them to search across multiple sources of content. For example, search engines such as Google provide a federated search across the Web, and Fedstats.gov provides a federated search across data from U.S. federal statistical agencies. Users will, for example, increasingly expect to be able to search for government documents and records regardless of which agency has custody of them. Appropriate interpretation and processing, not just access to the raw data—such as conversion from persistent forms into easily usable forms. If users were able to access the Nixon tapes online today, for example, they would expect them to be available in Real Player or Windows Media Player format. User demand for certain types of records can also be quite high, exceeding initial projections. A recent instance was the United Kingdom’s release in 2002 of the 1901 census data. According to the U.K.’s National Archives, the Web service was designed to support 1.2 million users per day, but demand on the first day was 1.2 million users per hour. The demand continued to rise in the days that followed, and the Web site had to be taken down so that its capacity could be significantly increased.18 These are demanding expectations. In the list above, the last item in particular anticipates that users will expect services that go well beyond the simple delivery of static records on a disk. However, as this report recommends, NARA’s priority should be to ingest, manage, and provide basic access to records. Given limited resources, NARA cannot afford to meet all of these expectations for all records—expectations will need to be reconciled against the notion of varying levels of service for different record types. Therefore, third parties should be encouraged to develop user services beyond the basic level provided by NARA, an option discussed in Chapter 3 of this report. OTHER TECHNOLOGY TRENDS Full-Content Search and Automated Metadata Extraction One of the major challenges for the ERA system is that of ingesting the anticipated volumes of records without being bogged down by too much (costly) human effort. Some metadata tell how the records are structured, what kind of records they are, where they came from, when they were created, and so on—information needed both for the management of the records within NARA and to help users find records. Other metadata, such as subject 18 Additional information is available online at <http://www.nationalarchives.gov.uk/about/operate/meetings/censusadvisory/17jan2002.htm>. Accessed May 1, 2005.
OCR for page 33
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy indexing terms, have historically been assigned by humans and are used primarily for finding rather than management. Several technologies make it possible to avoid the human effort associated with generating the second kind of metadata. One approach to reducing this burden is full-content searching. There has been considerable progress in this area—that an enormous body of publicly accessible Web pages can be searched effectively by search engines such as Google is a major achievement of relevance to NARA—and there is widespread demand for better searching technologies. Another technique for increasing the degree of automation possible is the use of automatic metadata extraction. This type of technique involves a wide variety of algorithms and approaches, often referred to as information-extraction techniques, for extracting information from text that is either semistructured or natural language. To take a simple example, such techniques enable the sender, recipient, and subject of a memorandum to be extracted automatically even if these items are not explicitly tagged in the document. Information extraction appears in a number of production systems, including the following: CiteSeer,19 in which a Web crawler monitors university Web sites, finds things that it identifies as research publications or technical reports, and extracts information such as the title and author names from the file (usually a PDF and PostScript version of the paper), as well as citations to other papers (at the end of the paper). This system, which has been operational since the late 1990s, is very heavily used in the computer science community. WhizBang! Technologies’ job search system FlipDog (subsequently sold to Monster.com), which crawled the Web looking for job postings and extracted data (e.g., location, salary, and title) from them.20 Google, which uses information extraction to answer search queries such as “What is Jupiter?” The first link returned will probably be “Web definitions for Jupiter.” Clicking on it displays a list of several-sentence definitions that have been extracted from Web documents using a set of rules for finding things that look like definitions. Shop-bots , which use information extraction of some kind. In the early days most shop-bots used handwritten extraction rules, and many probably still do. However, handwritten rules are fragile, because when a site changes, the handwritten wrappers stop working. Thus, there has been much research on automatically learning to extract information from e-commerce pages. The details of which algorithms are used by which sites are generally proprietary. 19 See C.L. Giles, K. Bollacker, and S. Lawrence, 1998, “CiteSeer: An Automatic Citation Indexing System,” in Proceedings of the 3rd ACM Conference on Digital Libraries (DL.98), pp. 89-98, Pittsburgh, Pa., June 23-26; S. Lawrence, K. Bollacker, and C.L. Giles, 1999, “Indexing and Retrieval of Scientific Literature,” in Proceedings of the Eighth International Conference on Information and Knowledge Management (CIKM 99), pp. 139-146, Kansas City, Missouri, November 2-6; Y. Petinot, C.L Giles, V. Bhatnagar, P.B. Teregowda, H. Han, and I. Councill, 2004, “CiteSeer-API: Towards Seamless Resource Location and Interlinking for Digital Libraries,” Proceedings of the 13th Conference on Information and Knowledge Management (CIKM 2004), pp. 553-561, Association for Computing Machinery, New York. 20 Related techniques are discussed in U. Nahm and R. Mooney, 2002, “Text Mining with Information Extraction,” in Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, AAAI Press, Menlo Park, Calif.
OCR for page 34
Building an Electronic Records Archive at the National Archives and Records Administration: Recommendations for a Long-Term Strategy Another technique for the automated supply of metadata is categorization, which discerns groupings of related content. Although the techniques discussed here are valuable, they do not always yield correct metadata. The errors that occur will tend to have a minimal impact on the precision (the proportion of items retrieved that are relevant) of finding aids, although they obviously will reduce recall (the proportion of relevant items retrieved by a search). More complex natural language problems, such as parsing and “understanding,” are quite difficult and appear likely to remain so. Nevertheless, the information-extraction techniques that have been developed over a period of several decades have reached a level of maturity allowing their use in production systems when guided by human expertise. Although they do not and will not provide the accuracy of a human, they offer the considerable advantages of improved speed and reduced 0cost that is several orders of magnitude lower than that for manual processing.
Representative terms from entire chapter: