5
Key Technical Issues

DATA MODEL

Which information and metadata are saved for each preserved record? How are they represented? Using which data types? The “data model” is the specification that answers these questions. The ingest process builds representations of records that conform to the model and the access process finds records, deciphers the representation according to the model, and presents the results to the user. The data model is thus a key interface in the system: it is the interface between the ingest process and the access process, using a linkage provided by the storage system. It is an “interface to the future.”1

The data model must be designed to evolve, because new data types and requirements will emerge over the life of the system. Thus a careful design for the data model that tries to anticipate future ERA needs will simplify the system. While it is easy to allow different models to coexist in a single archive simply by labeling each record with the identity of the data model used to store it, a proliferation of data models will result in a costly proliferation of software to interpret them.

This chapter highlights some of the properties of a data model, principally to tie it into discussion elsewhere in the section. Box 5.1 shows some possible elements of a digital record. Many variations are possible; for example, the metadata for each file (original and derived) might be recorded separately, with the metadata that pertain to the record as a whole kept in

1  

Although this expression may seem trite, it accurately describes the vital role of the data model. The ingest process must prepare a digital representation of the record that supports future access processes—some of which may not be created until years after the ingestion of the record. Thus, the interface must be designed to ignore the inner workings of access modules. At the same time, the properties of the interface will enable or constrain what future access modules can do with the record.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 35
5 Key Technical Issues DATA MODEL Which information and metadata are saved for each preserved record? How are they represented? Using which data types? The “data model” is the specification that answers these questions. The ingest process builds representations of records that conform to the model and the access process finds records, deciphers the representation according to the model, and presents the results to the user. The data model is thus a key interface in the system: it is the interface between the ingest process and the access process, using a linkage provided by the storage system. It is an “interface to the future.”1 The data model must be designed to evolve, because new data types and requirements will emerge over the life of the system. Thus a careful design for the data model that tries to anticipate future ERA needs will simplify the system. While it is easy to allow different models to coexist in a single archive simply by labeling each record with the identity of the data model used to store it, a proliferation of data models will result in a costly proliferation of software to interpret them. This chapter highlights some of the properties of a data model, principally to tie it into discussion elsewhere in the section. Box 5.1 shows some possible elements of a digital record. Many variations are possible; for example, the metadata for each file (original and derived) might be recorded separately, with the metadata that pertain to the record as a whole kept in 1   Although this expression may seem trite, it accurately describes the vital role of the data model. The ingest process must prepare a digital representation of the record that supports future access processes—some of which may not be created until years after the ingestion of the record. Thus, the interface must be designed to ignore the inner workings of access modules. At the same time, the properties of the interface will enable or constrain what future access modules can do with the record.

OCR for page 35
BOX 5.1 Some Possible Elements of a Digital Record Original files: one or more digital files, the original bit stream or native form of the record represented in its native data type. A record may consist of more than one file, e.g., a report in which each chapter is represented by a separate word-processor file. Optional derived forms: digital files obtained from the original files by a converter. ERA policies might encourage certain kinds of derived forms to be saved, for example these: A form whose data type is chosen to simplify “presentation” or “rendering” the record into a visible form for printing or display. A form whose data type is chosen to simplify content searching. Metadata for the record. In addition to the usual metadata normally captured for records, there is additional metadata associated with electronic records, such as: Data types of the digital files and derived forms, with sufficient information to allow finding documentation about the data types. Data types will usually be versioned. Relationship among the digital files that constitute the record. Integrity checks—e.g., a cryptographic hash, for the digital files and the metadata. Ephemeral/nonderivable metadata—i.e., properties of the context in which the record was created that are not specified in the record itself. Derived metadata—i.e., properties that have been extracted from the record. Provenance and history, such as evidence that it was transferred accurately to the archive (a form of ephemeral metadata). In the case of derived forms, metadata identify the converter used to obtain the derived form from the original. Unique identifier of the record. The data type of the metadata—i.e., the definitions of metadata elements used to construct the metadata for this record. a separate file. Metadata that pertain to collections as a whole might be stored in yet another file, referenced in metadata for each record in the collection. The data model must also deal with embedding and aggregation. Embedding occurs when one record is embedded within another, e.g., a spreadsheet is embedded as an attachment within an e-mail message. Aggregation occurs when several records are saved together, e.g., a series of e-mail messages is saved in a single file, even though each message is to be treated as a separate record. Another form of aggregation that may be desirable is the container—e.g., as used by the SDSC demonstration—which simply collects a group of records into a single digital file for more efficient handling by the file system.2 The archive should contain complete documentation about all versions of the data model, including specifications of the data types it uses. Since metadata sets are likely to proliferate 2   A container is distinct from an archivist’s “collection.” A collection may span several containers, and several collections might fit within a single container.

OCR for page 35
and be complex, it is essential that any use of metadata be linked to a complete definition of its terms. Several aspects of the data model require more discussion: Evolution. The key to smooth evolution of the data model is to carefully label each digital file that is part of the stored record with its type; when the file is read, the type identifier selects software that can interpret the file correctly. This is sometimes called “self-identifying data.” Whenever a data type is chosen as part of the data model, the system designer should ask, How do I introduce a new version of this data type without disrupting existing records? Clearly, if new types are introduced too frequently, the proliferation of types will lead to overly complex software to decode them all. Note that types must be versioned—it is not enough to say “this is a Microsoft Word document.” A version number and perhaps platform (e.g., PC or Macintosh) are required for unambiguous identification. Unique identifier. The ERA should assign to each record a unique identifier that can be used, both inside and outside the archive, to refer to records. Such identifiers are used inside a system to identify records without regard to their location or storage mechanism; they are used outside a system to specify a link to a specific record. A related question is how separate parts of the representation of a record (e.g., distinct original files or derived forms) are identified. The unique identifier for digital records should be harmonized with ways of identifying other records held by NARA.3 There are a number of digital identifier techniques and implementations today, and it is entirely possible that the particular scheme used by the ERA may change a few times over the lifetime of the archive. The system design should accommodate such changes. Modifying the archive. Although the principal idea of an archive is that records, once ingested, must not be modified, the ERA must allow certain kinds of information to be added or changed. Records themselves, once ingested, are rarely, if ever, modified or removed. It may be advantageous to add new derived forms long after the record was first ingested, perhaps when improved migration tools and techniques become available. And some of the metadata surrounding records (such as conditions of use, where to find derived versions, and relation to records added at a later date) may need to be modified. Some metadata may need to be updated—for example, to record ephemeral metadata that come to light after the original ingestion, or to note changes to access controls for a record that result from new regulations or the passage of time. Importantly, all such modifications need to be logged in such a way that it is possible later to identify causes of mistakes and untangle them. It also may be wise to design the system so that information cannot be deleted, only augmented.4 3   NARA does have a procedure for establishing a unique identifier for each record series for its paper records. Essentially, the records hierarchy has 13 levels, from Record Group to item, that assign a unique identifier based on the specific Record Group and the level of the record. This does not go all the way down to a unique ID for each record. It would be possible to add additional data (such as a date and sequence number) to provide unique identification. 4   Computer science researchers have designed “write-once” file systems in which it is not possible to delete a file, only to add new files or supplant old ones (add new versions of old files). This may not be a wise approach for the ERA. Suppose, however, certain records must be expunged from the archive, perhaps as a result of a court order. Even though rare, such a modification might be required.

OCR for page 35
Commonalities with digital libraries. The issues confronting the design of a data model for the ERA are almost exactly the same as those for a digital library. Unfortunately, there are no standard data models that can simply be adopted by NARA. Nevertheless, NARA should seek to align its design with that of digital libraries, since this will increase the likelihood of providing uniform access to libraries and archives in the future. Closely related to the specification of the data model is a set of policies used to operate the archive. For example, the data model determines how derived forms are recorded in the archive, but a policy will specify which kinds of derived forms should be generated for a collection, perhaps specified as part of the collection’s profile. Policies will also apply to modifications to the archive—who may introduce modifications and under what conditions (perhaps even requiring two staffers to confirm certain kinds of changes, as is common practice in the financial services industry). Data Types and Obsolescence The data model confronts the most challenging problem of an archive: For records to be useful many decades after they are ingested, they must be expressed in the data model using data types that can still be decoded and interpreted at the time of access. By that time the computers and software used to create the original records may be obsolete. The archival community has written at length on this topic.5 The debate over emulation vis-à-vis migration continues (see the brief discussion of emulation and migration in the preceding chapter), and neither one is firmly established as the only way to preserve digital records. In fact, some researchers in digital preservation reject the notion that emulation and migration are mutually exclusive (assuming, as we do here, that the requisite information to support each is retained) and argue that each is appropriate under particular circumstances. Consequently, the ERA should be designed to: Record in the archive the information necessary to support both emulation and migration when and if either becomes common. The original bit stream and careful records of the software environment in which it was created (data type and ephemeral metadata to record the fonts, operating system, application program, and other digital resources or pointers to them) are essential. Function adequately in the absence of either emulation or migration solutions. This requires some pragmatic choices, which are discussed here and in Chapter 4. NARA shares with others an interest in making one or another of these preservation strategies operational, such as building emulators for a few of the most popular computers. 5   See, for example, Jeff Rothenberg, 1995 (revised in 1999), “Ensuring the Longevity of Digital Documents,” Scientific American 272(1): 42-47; revised and expanded version available online at <http://www.clir.org/pubs/archives/ensuring.pdf>; Task Force on Archiving Digital Information, 1996, Preserving Digital Information (commissioned by the Commission on Preservation and Access and the Research Libraries Group, Inc. [RLG]), Mountain View, Calif., RLG, available online at <http://www.rlg.org/ArchTF/tfadi.index.htm>; and Howard Besser, 2000, “Digital Longevity,” in Handbook for Digital Projects: A Management Tool for Preservation and Access, Maxine Sitts, ed. Andover, Mass., Northeast Document Conservation Center, pp. 155-166. See also Stephen Manes, 1998, “Time and Technology Threaten Digital Archives…,” New York Times, April 7, p. F4.

OCR for page 35
This option is worthy of further exploration. However, as a future option rather than a present reality, the ERA program cannot depend on a particular preservation strategy at present. The preceding chapter describes a pragmatic approach to making records useful in the future: Anticipate common purposes to which records may be put in the future (e.g., display or text searching) and prepare, as each record is ingested, one or more derived forms of the record, chosen to streamline future access for those uses. As a complement to possible future efforts that provide access through such techniques as emulation or migration, the pragmatic strategy is to support access by using a smaller number of data types to express the derived forms of records. These are referred to in this report as “preferred data types.” This approach requires characterizing the most common (future) uses of different kinds of records and choosing associated preferred data types for the derived forms.6 These choices, and the kinds of access users seek, may, of course, change over time in ways not anticipated today—but some reasonable projections of use that address likely forms of access can be made. As an illustration, here are some common uses of records and possible choices for preferred data types to support the use: Presentation of a document in visual form (printed or displayed). Some preferred data types: image data types (fax, TIFF), Portable Document Format (PDF), ASCII text, XML-encoded document with a style sheet that specifies rendering parameters.7 Searching the text of many documents. Some preferred data types: PDF,8 ASCII text, and XML-encoded document. Loading a relational table into a database. Some preferred data types: comma-separated variables (CSV), dBase, XML encoding. The derived forms may fail to reveal one or more properties of the original record but nevertheless will make the record accessible many years after its ingest. For example, a presentation created from an XML-encoded derived form may not break lines and pages in the same places as the original word-processing software; a PDF file derived from a word-processor file will not retain the change history recorded in the native data type. But supporting PDF as a presentation data type is far easier than supporting the very large number of data types from which PDF files can be derived. The choice of preferred data type will require consideration of the loss of fidelity compared with the native form.9 A future researcher considering the use of a derived form will need this information to determine whether to be satisfied with a derived form or to take extra steps to interpret the native data type. Retaining one or more derived forms will add somewhat to the storage requirements of the archive, but wise choices of preferred data types will probably not increase storage require 6   It is important to note that manipulating (editing or changing) a preserved record is not a common use, so preferred data types can have much more constrained aims than native data types. 7   Box 4.1 discusses some of the limitations associated with XML encoding. 8   Most PDF writers create files that are easy to search for text strings. However, it is certainly possible to create PDF files that defy simple searching. 9   For some records, it may be difficult to carefully enumerate and document this loss. The cost and limitations of this evaluation should be factored into the decision to create preferred data types.

OCR for page 35
ments by more than a factor of 2. In many cases, a single derived form can serve all the common uses anticipated for the record. In some cases, the native data type may itself be a preferred data type that satisfies all the anticipated uses. The reason for recording derived forms when the record is ingested is simple: It is at this time that software to create the derived forms is most likely to be available. Unfortunately, the ingest process often occurs many years after a record was originally created, by which time the software to create derived forms may already be obsolete. Then the process may require custom software or may be very difficult (e.g., if the native data type remained proprietary and fell into disuse). If the derived forms of a record are to be useful decades after they are created, the preferred data types must be chosen carefully. NARA will need to select and extend the collection of preferred data types. Preferred data types would have at least some of the following properties: In common use. The data type should be in common use and reasonably expected to remain so for some time. NARA should not have to make decisions about common data types quickly, so the list can evolve slowly as NARA watches for common forms.10 Well documented. The data type should have sufficient documentation to allow modestly skilled programmers to write software to process the data type for its intended uses, such as presentation. Standards are usually well documented, but often proprietary data types also have fine documentation (e.g., PDF11 and RTF12 today) and may be reasonable candidates for data types of derived forms. Many proprietary data types also have corresponding “external” or “interchange” data types that are candidates for preferred data types. It is an added benefit if the data type is simple. Slowly changing. If the definition of the data type is stable, the cost of supporting it is reduced. Free from intellectual property encumbrances. It is a disadvantage if processing a data type requires licenses from intellectual property owners. (In the long run, NARA may wish to seek agreements that processing archival documents represents “fair use” of such property, or seek specific legislative relief.) NARA—or the federal government—may wish to induce vendors to transfer obsolete data type specifications to the public domain. Software available in the public domain or in open-source form. For each preferred data type, NARA will need to obtain and maintain associated access software. In many cases, NARA can take advantage of software written by others. Note, however, that to be useful to NARA the software must be in a form that is likely to be useful decades from now—for example, that can be ported as needed to new hardware and software platforms. Choices of preferred data types for the ERA should also be influenced by common practice 10   If preferred data types are chosen carefully, it is possible that the period of time before a preferred data type must be decommissioned may be lengthened considerably. 11   Adobe Systems’ “Portable Document Format,” <http://partners.adobe.com:80/asn/developer/acrosdk/docs/filefmtspecs/PDFReference.pdf>. 12   Microsoft’s Rich Text Format, <http://msdn.microsoft.com/library/en-us/dnrtfspec/html/rtfspec.asp>.

OCR for page 35
in digital libraries, other digital archives, and mainstream computing.13 Common choices would lead to opportunities for sharing access software among such partners and may increase the likelihood of commercial products that will reduce the cost of developing or evolving the ERA. Selecting a relatively small number of preferred data types has the additional advantage of allowing NARA staff to become truly expert in the technical aspects of particular formats. Derived forms can also be used to address other needs of the ERA. For example, a database in which some fields are public and others have special access restrictions might have a derived form for public access, with sensitive fields omitted. Redacted versions of a record might be stored as derived forms with relaxed access controls. Derived forms may also be a simple way to deal with unique or complex data types. For example, an archivist seeking to preserve records stored in a unique IT system with unknown internal data structures might run a collection of “reports”—text files intended for printing— which taken together reveal all the information retained in the unknown data structures. Archiving these reports as derived forms might be the most practical approach to preserving the essential contents of a record. Metadata Metadata are conventionally expressed as a series of “attribute, value” pairs (also sometimes called “tag, value” pairs). A metadata set is a collection of attribute names and corresponding definitions used to express metadata. There are no standard universal metadata sets, though many popular subsets exist (e.g., Dublin Core14). Some attempts are under way to try to relate all metadata definitions to a common ontology, but this complex approach is very risky. It seems clear that metadata sets will change frequently over the lifetime of the ERA and that the system should be designed to accommodate such change, an observation confirmed by the experience of the SDSC demonstrations. XML offers one convenient syntax today for recording metadata, in part because it is easily extensible; it is already the representation of choice in the digital library community. For metadata saved in the ERA to be useful, users must be able to obtain precise definitions of metadata tags. It is for this reason that each metadata record should identify the metadata set it uses and provide a way to find, in the archive itself, the definitions of the tags in that set. As discussed in Chapter 3, the SDSC work showed that indexing pertinent metadata in a relational database offers a surprisingly simple and effective way to find archived records. Some metadata saved in the ERA must be interpreted by the ERA software itself, e.g., tags that specify data types of the digital files that constitute the record. In order to accommodate changing metadata sets without enshrining specific metadata tag names throughout the software, some form of lookup should be used to convert from the tag name used in the specific metadata set to a name that is meaningful only to the software. 13   Preferred data types have been selected in a number of programs. For example, the DSpace project (<www.dspace.org>) provides varying levels of support for different formats. The Federal Court system’s Case Management/Electronic Case Filing (CM/ECF) system, which allows courts to accept filings and provides access to filed documents over the Internet, has adopted the PDF format. For more information on the CM/ECF system, see <http://pacer.psc.uscourts.gov/cmecf/> and <http://pacer.psc.uscourts.gov/documents/press.pdf>. 14   <http://dublincore.org>

OCR for page 35
STORAGE The storage function of the ERA is used to save files for a long time (perhaps it should be called a “long-term file system”15). It thus serves as a transparent bridge between ingest and access functions: The files produced by ingest are delivered without modification to access. In many respects it is like any other file system. Some of its requirements, though not commonplace, are shared with other very large, mission-critical systems, including these: Scalability. It must be able to grow to hold many (even hundreds of) petabytes of data, in many billions or even trillions of individual files. Robustness. It must not corrupt data or fail to deliver data requested. Access control. Only authorized software, under the control of authorized staff, may store files or make other modifications to what is stored. It is in the area of robustness over a very long period of time—the lifetime of its data is longer than that of any known data storage device or medium—that the system presents unusual requirements.16 The following strategies can be used to preserve digital files for many decades: Store redundant replicas. To survive the failure of a storage device, information is stored redundantly. But care must be taken to categorize failure modes and determine appropriate storage strategies. For example, if a file is stored on two different disks, both of which are located in the same room, a fire might destroy both copies. So some kind of geographic replication is a must—usually achieved today by saving copies of a file at one or more remote sites connected by a high-speed network. Replication is governed by policies enforced by the file system that specify how many copies should be kept, where they should be stored, and so on. Detect errors and correct them automatically. In order to detect failed devices in a large file system, it is necessary to use a background task to read all the data in the system constantly and check for errors caused by deterioration. This process requires computing a hash or checksum of each file and comparing the value with a value computed from the same data when it was first stored.17 When a corrupted file is detected, one of its redundant copies is used to create a fresh copy somewhere else so as to conform to the replication policy. (And other files stored on the same device that produced errors are probably also moved, since the device as a whole may be failing. The device is then removed from service or replaced.) Refresh media.18 As storage media age, copy all the files stored on them to new devices, 15   Calling it an archival file system risks confusing it with archival file systems offered by some computer system vendors, often as part of a hierarchical storage management product. 16   The robustness requirements may well not be uniform for all records. The requirements might, for example, be a property of a record that is assigned at ingest time and might be reflected in how much the storage system “invests” in preserving that record. This is a good example of how complicated some of the architectural issues for the ERA can be. 17   Disk and tape recording formats include checksums that are verified on each read, but these are viewed as insufficiently robust. A file-level hash is advisable as well. 18   The term “refresh” is preferred to “migrate,” because the second term is used to describe a conversion of data type.

OCR for page 35
and remove the old devices from service. This step requires that new devices can be attached to the file system at all times over the life of the system. Rebuild directories and indexes by scanning the archive. If sensitive data that describe the structure of a file system are lost, the entire file system can be scanned to rebuild the data. Such a design requires storing files on the disk in a way that records their structure within the file system. Use implementation diversity. To guard against losing files because of errors in the file-system software, replicas of a file can be saved in file systems with distinct implementations. These techniques are all widely used (or at least widely advocated) today. The engineering challenge for NARA is that no one has yet demonstrated that they can, together, implement a file system that will preserve files for a hundred years or more.19 The scalability of the file system, as well as its ability to connect new storage devices to replace old ones, can be achieved using network-connected storage systems. When it is no longer possible or economically feasible to expand the capacity of any existing storage system, a new system is procured, attached to the network, and configured to participate in the overall file system. The overall file system operates as a “federation” of the network-connected storage components. This technique uses a software layer that makes the collection of network-connected systems appear to be a single file system. It would provide services such as these: Naming of files, and mapping names to locations where files are stored; Routing file access requests to the appropriate network-connected store, incorporating newly added stores, and providing the necessary “drivers” for new classes of network storage services; Access control; Redundancy control, as described above (maintaining redundant copies, scanning for corrupted files, refreshing old media, removing old storage from service); Audit log, used to track all changes to the file system; Performance measurement, used to determine the load placed on the file system, its internal overhead (e.g., for integrity auditing or media refresh), occurrences of faulty media, and so on; Management processes, used to modify configurations when new storage is added to the federation or to direct that old storage should be evacuated and abandoned, etc. The software layer insulates the ERA clients of the file system from the various implementations used in the federation. Implementations can be replaced by changing at most a driver in the software. The Storage Request Broker (SRB), used in the SDSC demonstrations, is one example of such a distributed file system, but the technology is quite common.20 19   Digital computers and their storage devices were unknown in 1903! 20   The federated file system model is a mature, well-understood technology. Multiple implementations exist, of which the SRC is only one example. The research community is exploring new technologies for decentralized storage, location, and retrieval of information that may offer additional capabilities in the future. But NARA should focus today on more mature technologies that enjoy commercial support and are likely to be around for a while.

OCR for page 35
Different storage policies may apply to different parts of the distributed file system. Working storage for both ingest and access modules may be provided by part of the file system federation. Because working storage need not be retained for decades, some of the redundancy, auditing, and media refreshing policies that are required for the long-term storage will not apply to the working stores. The distributed file system may use COTS products to provide network-connected storage. Note that some of the requirements and policies of the distributed file system interact with properties of the network storage—e.g., a conventional network-accessible file system product will not be able to identify all files that should be moved when a corrupted file is detected. Modern file systems may have redundancy techniques built in (e.g., RAID21) that interact with redundancy policies of the federation. These issues will have to be carefully addressed to determine how COTS products can be used in the file system. Storage must be geographically distributed in order to prevent catastrophic loss of data.22 One possibility is that remote components are nevertheless managed as part of a single file system. Since one of the key reasons for geographical diversity is to guard against total failure of a site, it will be important that the entire file system can survive loss of a site. Managing replication to avoid single points of failure is one of the jobs of the federation software. Note that most properties of the file system apply to file systems required for digital libraries. There are opportunities for NARA to collaborate with others to arrive at common specifications and implementations. The file system should be designed without knowledge of the data model, so the file system implementation can be shared even if the data model is not. File System Performance Requirements The file system must be designed to meet the scale and performance requirements that the ERA will face. Examples of some of the file-system-specific performance metrics required to guide design are these: Size. What is the target size of the initial repository? How many bytes? How many files? How will each of these scale over time? Bandwidth. How much bandwidth will be necessary to support ingest, access, file scanning, and media refresh? Refresh time. How long will it take to copy or recreate the entire archive? This parameter is important, because if the file system becomes so large that it takes 5 years to copy and the refresh cycle needs to be repeated every 4 years, the design will not work. Ingest rate. At what rate will new records be incorporated into the repository? Note that the geographic redundancy requirement means that the ingest rate heavily influences the bandwidth required for communication with remote storage, as new data are copied to remote sites. 21   The acronym originally referred to “redundant array of inexpensive disks,” but today it commonly refers to “redundant array of independent disks.” 22   The only other alternative is to make backup tapes and carry them far offsite. Although tape backup is simpler than a hierarchical storage management system using tape, a remote file system is even simpler because it avoids tapes altogether.

OCR for page 35
Response time. Must the system support real-time queries? If so, what are reasonable response time targets? Life span. At what rate are components expected to be replaced (particularly the storage devices themselves). What is the unit of replacement? What is the impact of replacement on performance? Performance requirements for access are harder to anticipate than those for ingest and storage. Although the conventional model for access is a user conducting a search and reading records on a computer screen, it is also likely that users in the future will want to run statistical queries, perform data mining and automated cross-referencing and correlation, and do other things that a future access system may or may not support and which could have significant bearing on an access system’s performance requirements. Moreover, it is likely that access will involve creating full-text indexes of collections, which imposes an additional load on the long-term file system. These targets may eliminate certain technology choices (for example, access performance requirements might eliminate the possibility that tape could be used as a primary preservation medium). They are essential to guide the implementation of software that controls the federation—for example, the design of the naming system is quite sensitive to anticipated system size. Selecting Storage Media Presently, NARA stores most of its electronic records using off-line tape storage; this is also the approach used in the SDSC demonstrations. For new systems, disks are becoming the preferred storage choice. Instead of storing files on tape and copying and storing backups offsite, digital archives are kept on geographically separated disk replicas. Data are transferred to other locations using either a network connection or by shipping disks or servers containing the files offsite. While other storage media, such as optical disks, have been considered for long-term preservation, they suffer from many of the same drawbacks as tape. Disks have a number of advantages compared with tape, including these: Lower cost. The overall system costs of tape and disk storage are roughly equal, but the cost per byte of disk storage is declining faster than that of tape. (Cost comparisons of disk and tape are complicated by debates about what to count, especially the human support staff required.) Projections favor disks in coming years. Volumetric density. Disks take less physical space to store the same amount of data. Densities are improving by a factor of about 2 every year. Fast access. Disks allow fast access, suitable for interactive applications. By contrast, tape processing incurs a variety of access delays. Tapes must be mounted (either by a robot or manually) and once mounted, must be read sequentially. Less complexity. Disk-only file systems are much simpler than hierarchical storage management schemes that must manage the migration of data between disk and tape. Tapes also require large file sizes to be efficient, which results in complex file-aggregation mechanisms. If accessing a record requires reading files from several disks or tapes because they were created at different times (e.g., metadata or derived forms that were added long after the record was ingested), disk-only systems will perform far better than tape systems. A tape system could be

OCR for page 35
the problem-solving skills and craftsmanship required. NARA will probably need to devise multiple ingest processes and associated software to cope with the variability of records presented to it. Designing ingest processes depends critically on the kinds and quantities of records to be processed. This is why it is important that NARA inventory digital records waiting to be ingested and survey the records that agencies will soon pass on to NARA. In the past, a large fraction of the ingest effort has been devoted to dealing with the problems of extracting data from physical media, such as floppy disks of great age that have suffered untold abuses. These media-related problems will subside in the future for several reasons: (1) data are increasingly stored on hard disks, whose reliability and capacity have steadily improved; (2) because a high-speed network interface is standard on computers large and small, moving data to modern media is greatly simplified; and (3) computer owners—and especially professional IT departments—can and commonly do move important data to new media to ensure its continued accessibility. Where records are stored on off-line media such as tape and ingest is deferred for years after record creation, media issues will persist. Unfortunately, a far more serious and growing problem confronts the digital archivist: a profusion of data types, many of which are very complex. Simply checking that records are represented in their claimed data type can be difficult—it may require running hard-to-find software on rare computer systems. While “data dumps” may have sufficed in the past to distinguish ASCII from EBCDIC databases, checking the integrity of a modern word processing file is much more difficult—it may require running the word processor software that created the record.25 Moreover, many of these data files may contain hidden references to external data that should be considered for archiving as well. Another important problem is that of data management—determining which digital files represent records that should be preserved. When many files are saved in an ad hoc fashion on a government computer, such as the personal computer of a White House staffer, and no formal records-management process is used, the files must sifted to find records to archive. For example, NARA was presented with several hundred hard disk drives from the White House containing the digital record of the Clinton administration. Extracting files from these disks was relatively easy, but discarding system files and duplicates recorded on several machines required additional steps.26 Although experience offers little advice on general strategies for ingest, the committee offers some suggestions: Try to reduce the variability of records scheduled for preservation. Identifying preferred data types and encouraging creating agencies to adopt these as native data types is one approach. Advance awareness of new data types being presented to NARA can guide adoption of new preferred derived forms and development of associated software. (The ERA will, of course, 25   It would be useful and relatively easy to save the validation software at ingest time. 26   Because these files come from the White House, NARA has to also examine each record to determine whether it should be classified as personal, political, or governmental, because the three types of records require different treatment.

OCR for page 35
still have to be capable of ingesting the full variety of data types used in the federal government, which will roughly correspond to the full variety of data types in use more broadly.) Automate common cases, for example by creating scripts that carry out the transformations and checks required. For example, since it appears likely that e-mail records (with attachments) will be an increasingly common record form, it makes sense to invest processing as much e-mail as possible automatically. Develop a plan for handling metadata. Ephemeral metadata must be captured as part of the ingest process. It may also be easy to capture some derived metadata as well and store the results along with the archival record. However, sophisticated metadata extraction, text summarization, and searching need not be part of an initial system. Such processes can be deferred until later generations of the system without losing information, and better technology to automate these processes may be available in the future. Develop explicit work flow designs. Records may need to be reviewed by an archivist or other non-IT professional in order to make essential dispositions. For example, presidential records need to be categorized as government, political, or personal and handled accordingly. Some records may need to be reviewed for security or other access-control issues and then tagged with suitable access-control metadata. (This work flow requirement has implications for system design; see below.) The need to accommodate diverse work flows may be the most challenging aspect of designing ingest software. Both the work flows and the software must be able to evolve over time. Design work flows, software, and auditing processes so that the integrity of records is guaranteed. Ideally, the creating agency should be able to certify that the records, as they appear in the archive, are genuine. As noted above, record in the archive the details of processes used to ingest records so that future researchers are able to understand any processing or translation applied to the records during ingest. Finally, NARA will need to establish crisp guidelines governing modifications that staff make to records as they are ingested. Vigilant checking for errors or inconsistencies in data might lead to a desire to fix errors or fill in gaps in data. Such actions would not be tolerated for paper records and should not be permitted for digital records. It is thus important to document what actions are taken with respect to media renewal or creating preferred format types. If any changes occur in the underlying bit stream of digital records (without regard to their impact on rendering), the preservation documentation should call attention to this. If data are missing and the agency cannot locate the missing data, then the documentation should call this to the attention of users. Digital records are susceptible to accidental or deliberate alteration; ingest processes should pay attention to end-to-end integrity assurance. The highly variable nature of ingestion processes will, as noted above, probably result in an evolving set of ad hoc processes and software. However, there are a few system engineering ideas that could be applied to ingestion: The ingestion of a set of records (e.g., a collection) by possibly ad hoc processes should create a set of files conforming to the ERA data model stored in such a way that the normal ERA access processes can retrieve them. These files might be considered to be provisionally entered into the archive. They are stored and accessed using the standard file-system inter-

OCR for page 35
face, but they have yet to be formally released into the archive.27 In their provisional state, the files can be accessed normally to be checked by the creating agency or the NARA ingestion staff or to be reviewed by an archivist to assign record-specific properties such as access controls. Because the files are provisional, they may be deleted if errors are found. Finally, the provisional files are formally entered into the archive. A set of records ready to be entered into the archive (the provisional files above) should be checked for consistency using an automated checker to ensure that they conform to the data model. For example, metadata should be checked to ensure that essential metadata tags are present, that all files are properly described, etc. It may be useful to verify each file using an integrity checker associated with the file’s data type. For example, if one of the files is expressed in an XML encoding with an associated data type definition (DTD) or schema, the checker should verify that the XML file conforms to these specifications (i.e., it is a valid XML document). Validation provides an opportunity to identify records that may have been garbled at some stage; marking nonconforming documents allows one to potentially identify problems (e.g., when records are exported from the creating agency) and forestall downstream processing errors. Although ingestion software may be ad hoc, there are common modules that should be implemented once and shared, e.g., data-type checkers and converters. The cost of operating the ERA will depend critically on the amount of manual labor required to staff the ingest process. If significant amounts of metadata must be entered and checked by humans or if digital records arrive at NARA in corrupted or incomplete form, the ingest process will bog down and ultimately limit the ERA’s ability to meet its mandate. Providing the right kind of user interfaces or automation for streamlining this process will depend on the details of the human processing required. This is another area where estimates based on records already produced by government agencies are required to design the ERA. ACCESS Accessing the ERA should be much like accessing a digital library: It requires a means to find a record, retrieve the record, and possibly convert it into another data type for delivery to the requestor. Digital libraries today routinely deliver their content via Web browsers or download using a number of standard presentation formats. Although these systems are largely custom built, commercial software components are increasingly used. Perhaps the most vexing problem facing ERA is that the performance required for accessing offerings, especially online access, is unknown. Some collections will be used a great deal; others will not. Moreover, the design and deployment of access software changes as demand changes. Doubtless access modules will need to be redesigned several times as access statistics become known or change. For collections that are accessed frequently by online users, it will probably be advisable to “stage” the files by copying them from the long-term file system to a separate file system 27   The SDSC demonstrations exploited this idea to great benefit.

OCR for page 35
designed for high-performance access. As access demand for the collection grows, more access modules can be deployed on available computers, all working from the same set of staged files. The staged files act as a cache of the files held in the long-term file system. (Such staging, by decoupling the archival copy of the record from potentially malicious users, helps protect record integrity.) Some kinds of access may require preprocessing an entire collection of records (or more). For example, full-text search software usually builds and saves an index to a corpus in order to offer faster searching than would be possible by simply scanning all the contents for each search request. Indexing requires that records be rendered in an appropriate format. Software that summarizes text or automatically extracts metadata (e.g., names of people or businesses cited in news feeds) likewise builds a database of extracted information. Even simple searches based on metadata of collections or records will require building an index.28 These indexing and access techniques will need to save extracted information temporarily on a separate file system—not part of the long-term file system.29 While these indexes can be deleted and rebuilt if necessary, dismal performance will result if a large portion of the archive must be scanned in order to rebuild one or more indexes. As a result, techniques are required for storing and updating these files incrementally, as changes are made to the archive. NARA will need to set expectations for access to ERA records. In preparing this report, the committee has assumed that users will receive either a digital file representing the record (in its native data type or in one of the available derived forms) or will be presented a visual representation of the record. The committee has not addressed the much more difficult problems of presenting online access to software that can manipulate one or more records (or providing access to records that are themselves executable). By way of example, consider the variety of online access methods offered today by the Census Bureau to the 2000 census data.30 Some offerings are simple tables, presented visually. Others are responses to trivial queries against an underlying database. Some are sophisticated statistical extraction and calculation applications working from a census database. Which kinds of access could ERA users expect to this data decades hence? Probably the first, and perhaps the second (it’s not hard to provide a simple search mechanism for tables). Offering complex applications, however, would not only place a computational burden on the ERA, but also would require emulation, porting, or some other technique to allow today’s software to run in the far future. NARA will have to make quality-of-service choices about whether and when to invest in such capabilities. It may be adequate to allow users to simply download all the data and process it themselves. Alternatively, if there is sufficient public demand, the Census Bureau itself might take on the task of providing such ongoing access to old census data sets. 28   The SDSC demonstrations used a relational database to record a subset of metadata information used for finding records. 29   Note that some subtle aspects of access control arise when indexing or extracting data from collections. A user not authorized to view a record must not see any extracts from that record. 30   See <http://www.census.gov/main/www/access.html>.

OCR for page 35
Finding Aids and Search Over the lifetime of the ERA, access methods can be expected to change. In the last few years, for example, impressive Internet search services have emerged, and users now expect full-text searching to be available for any large collection. The traditional method of access to archives uses finding aids that describe broad categories of records, for example at the series level. Finding aids are usually based on controlled-vocabulary metadata, for example, MARC AMC and Encoded Archival Description (EAD).31 Controlled-vocabulary descriptors are assigned manually to information objects, usually at ingest, which makes ingest a more labor-intensive process. NARA has invested considerable resources in describing records series in accordance with uniform practices.32 The traditional method of cataloging information and providing access is designed for archives of physical media (e.g., paper, pictures, movies). Different methods of cataloging and access are possible with electronic records, and NARA should include these methods in its planning. Recent practice in the use of controlled-vocabulary metadata uses automated or semi-automated assignment of descriptors to information objects. One example is text categorization techniques used to assign subject codes to newswire articles,33 diagnostic codes to patient discharge summaries, and grades to practice GMAT exams.34 Studies show that current techniques assign descriptors as accurately as humans for some tasks. Because it is not limited by human labor, text categorization can be applied inexpensively, so it could be routinely used in an ERA. It can also be used to provide something that the traditional finding aids do not: by assigning metadata to individual records, which would be prohibitively expensive if done manually, NARA could provide significantly enhanced access at the individual record level. Finding aids based on full-text search have been developed by the Digital Library and Information Retrieval research communities,35 and are being adopted commercially, for example by WestLaw.36 Given a query, the available archives are ranked by how well their 31   See, respectively, the MARC Standards home page, at <http://www.loc.gov/marc/>, and the EAD standard’s home page, at <http://www.loc.gov/ead/>. 32   As of 2002, only 20 percent of “NARA’s vast holdings are described in ARC,” NARA’s online catalog system. See <http://www.archives.gov/research_room/arc>. 33   Yiming Yang and Xin Liu. 1999. “A Re-examination of Text Categorization Methods.” Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pp. 42-49. 34   Leah S. Larkey. 1998. “Automated Essay Grading Using Text Categorization Techniques.” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Melbourne, Australia, pp. 90-95. 35   Luis Gravano and Hector Garcia-Molina. 1995. “Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies.” Proceedings of the 21st International Conference on Very Large Data Bases (VLDB 1995); L. Gravano, P. Ipeirotis, and M. Sahami. In press. “QProber: A System for Automatic Classification of Hidden-Web Databases.” ACM Transactions on Information Systems; Jamie Callan. 2000. “Distributed Information Retrieval,” in Advances in Information Retrieval, W.B. Croft, ed., Kluwer Academic Publishers, Boston, Mass., pp. 127-150. 36   Jack G. Conrad, Xi S. Guo, Peter Jackson, and Monem Meziou. 2002. “Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment.” Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002); Jack G. Conrad and Joanne R.S. Claussen. Forthcoming in 2003. “Early User-System Interaction for Database Selection in Massive Domain-Specific Online Environments.” ACM Transactions on Information Systems.

OCR for page 35
contents match the query. This type of finding aid supports detailed information needs and information needs that controlled-vocabulary metadata do not anticipate. During the last decade the public has become familiar with full-text search in Web, e-mail, corporate, and personal document databases. It is likely that NARA will eventually be expected to provide full-text search capabilities within its archives. It may also be expected to provide for searches across sets of archives (sometimes called “federated search”), such as across both NARA and presidential library collections.37 For common record data types, full-text search can be provided inexpensively, with little manual intervention, using commercial software.38 Full-text search is merely the simplest form of content search, which may include searching images, sounds, animations, videos, hypermedia structures, etc. At present, full-text retrieval is fairly mature while the technology for content-based retrieval of nontextual materials is still immature but developing quickly. Extensions of simple text search are, however, inevitable and will no doubt be demanded by future users if NARA’s holdings evolve to include significant multimedia holdings. The standard method of cataloging archives has been manual assignment of controlled-vocabulary metadata, and NARA may initially face some resistance in adopting alternatives. However a fairly large body of research comparing full-text and controlled-vocabulary methods over a 35-year period indicates (1) each method works “best” for particular types of information needs, (2) the two approaches provide about the same “average case” effectiveness, and (3) a combination of the two approaches is the most effective solution.39 The latter conclusion is reflected in the National Library of Medicine’s PubMed system, which uses both full-text and controlled-vocabulary indexing.40 The cataloging and access methodologies used for physical media (e.g., paper, pictures, movies) are labor-intensive and expensive. Newer, content-based cataloging and access methods designed for digital resources are, in contrast, compute-intensive but increasingly inexpensive as computing becomes cheaper. NARA can exploit this property of electronic records to reduce its costs, to improve its ability to ingest information quickly, and to improve the quality of access services it provides. Traditional cataloging and access methodologies will continue to be needed, but NARA will almost certainly never have sufficient resources to apply them to all of the electronic records worth archiving. Techniques for automatic metadata assignment and/or building indexes for content search 37   Luo Si and Jamie Callan. 2002. “Using Sampled Data and Regression to Merge Search Engine Results.” Proceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Tampere, Finland, pp. 19-26. 38   If full-text indexing is to be provided, engineering calculations should include the storage required for the index, which can be anywhere from 50 to 300 percent the size of the raw data, depending upon the capabilities one wants to offer. Of course, this information would not be stored in the archive, but rather in the working storage associated with an access system. Also, indexes can be regenerated, so the number of index replicas required is driven by such considerations as performance, not reliability. 39   Cyril W. Cleverdon. 1967. “The Cranfield Tests on Index Language Devices.” Aslib Proceedings 19:173-192, reprinted in Karen Sparck Jones and Peter Willett, eds. 1997. Readings in Information Retrieval. Morgan Kaufmann, San Francisco; T.B. Rajashekar and W.B. Croft. 1993. “Combining Automatic and Manual Index Representations in Probabilistic Retrieval.” Journal of the American Society for Information Science 46(4): 272-283. 40   See <http://www.ncbi.nlm.nih.gov/PubMed/>.

OCR for page 35
can be applied during ingestion or as a part of the access process. If they can extract essential metadata—i.e., elements that are deemed by NARA to be obligatory for every record—they should be used as part of the ingest process. However, since these techniques are being improved rapidly, it is probably wise to defer their broad use to the time of access, when more modern techniques are available. Metadata extraction or index generation can be applied, if desired, as a collection is staged from the archival file system. At a minimum, the ERA and NARA overall should provide for full-text searching of all finding aids for its holdings regardless of physical format or data type. The Archival Research Catalog is a step in the right direction, but it is incomplete, especially with regard to electronic records.41 Access to Underlying Digital Files While most users will want access to screen presentations of records or to modest numbers of digital files represented in common data types (e.g., word-processor documents, database tables), some researchers can benefit from access to the elements of the underlying data model used by the ERA. For example, researchers who reverse engineer obscure data types, explore automatic metadata extraction, or devise new methods for content searching (especially on difficult data types such as images, video clips, or executable files) will probably wish for access to the files stored in the archive without mediation or modification (subject, of course, to suitable access controls). These researchers will also make use of data type specifications, metadata definitions, and other information available from the archive that describes the details of the data model used to store records. SECURITY AND ACCESS CONTROL Security will need to be carefully designed into the NARA system to address all of the usual concerns about unauthorized access to systems and vulnerability to denial-of-service attacks or to natural or manmade disasters. These security concerns must be addressed from the very beginning of system design. As an illustration of this principle, consider the basic question of whether the data in the ERA should be stored in cleartext or should be encrypted to prevent inadvertent disclosure of restricted information to the staff and vendors who are in frequent contact with the archive. If stored in cleartext, it is virtually certain that there will be one or more instances of compromised data over the (very long) life of the ERA.42 However, archivists are reluctant to encrypt data for fear that future generations might lose the key and thus all access to the data. Note that either approach entails risks. To decide whether it is worth the cost and operational overhead (especially to ensure that 41   As of this writing, only one electronic data file series is included in the ARC. 42   For example, reports surfaced in late 2002 of the theft of 500,000 medical records by stealing hard disks from a Defense Department contractor. See Associated Press, 2003, “Military, Family Medical Files Stolen,” Washington Times, January 1. Available online at <http://www.washtimes.com/national/20030101-94263751.htm>.

OCR for page 35
keys are not lost) of encrypting the data is a difficult question that requires careful analysis. A cryptographically protected archive is a much more difficult design than cleartext, since one needs to worry about key storage and distribution, which pieces of the system operate encrypted and which in the clear, which pieces need to have access to keys, the performance of decryption, and so forth. These decisions will have many ramifications in the details of the design, e.g., the module interfaces. On the other hand, converting from cleartext to encrypted would require recopying the entire archive. This is a massive undertaking, especially because it will take quite a while and one does not want archive operations (for either ingest or access) to be hampered for such a period. Accordingly, this decision is better made at the outset than being retrofitted later on. Addressing this and other related security issues is part of a comprehensive system design. There are reasonably well-understood methodologies for doing the threat analysis and working the results into system requirements and design. Indeed, in many respects, the ERA’s security issues can be handled by straightforward application of engineering best practices. These issues are not discussed in detail in this initial report, and the reader is referred to the extensive literature on computer system security.43 In contrast to many digital libraries, the ERA must control access to many of its records. Access controls are comprised of three basic ingredients: A way to authenticate users who wish access, i.e., to verify their identity.44 What are the requirements for authentication?45 Does every user need to be identified individually, or as a member of a class, e.g., “Internet visitor?”46 Properties of individual records or collections of records, recorded as metadata with the records, that indicate what kind of access is permitted. A set of rules that checks whether a certain user may access a certain record. The rules may cover large classes of records, e.g., “Allow access to a record labeled ‘by citizen owner’ only if the value of the metadata tag named ‘social security number’ matches the property ‘social security number’ of the authenticated user.” Or the rules may be very specific, e.g., “John Wright has access to record NARA/ERA/Vietnam-72/104567.” 43   An overview of cybersecurity issues is provided in Computer Science and Telecommunications Board, National Research Council, 2002, Cybersecurity Today and Tomorrow: Pay Now or Pay Later. National Academy Press, Washington, D.C. An in-depth examination of trustworthiness issues and research challenges is provided in CSTB, NRC, 1999, Trust in Cyberspace, National Academy Press, Washington, D.C. 44   NARA is understandably reluctant to tackle the problem of authenticating all the principals who may have created government records (for example, by recording digital signatures with records and keeping enough information to check signatures many years later). However, there is no reason to avoid authenticating users of an archiving system. Experience elsewhere in the government, such as the Department of Defense’s DoD Common Access Card program, may help arrive at an appropriate design. 45   These and related issues are discussed in National Research Council, 2003, Who Goes There? Authentication Through the Lens of Privacy, The National Academies Press, Washington, D.C. 46   For access to public records, privacy considerations may mean that a detailed audit trail should not be retained, especially since there is no risk of a user damaging or stealing the only copy of an electronic record. So there may be a requirement to authenticate users to classes such as “general public” or “Internet visitor” rather than for individual identification.

OCR for page 35
In the ERA, these access rules will be complex and may change owing to the passage of time, specific events (e.g., the death of the person to whom the record belongs or refers), or legislation or court orders. The committee saw no evidence that NARA had begun to formalize access controls in a way that could reasonably be automated in the ERA. Perhaps access controls for NARA’s existing archives are suitable and can be easily codified for the ERA, but the committee did not see evidence that this had been done and indeed heard a good deal that suggested otherwise, including extensive use of, and indeed reliance upon, human review just prior to the delivery of physical records from the existing archives. (Note that it is not clear whether this will be required in the ERA.) One of the very real complications is that substantial numbers (though again, the committee has no quantification here) of records ingested into the ERA will be classified at various levels (in the official sense of government classification: SECRET, TOP SECRET, etc.). In briefings, the committee was told that classified and other national-security information would be physically segregated in separate instances of the ERA reserved for that purpose (that is, using “air gaps”), that it is not a requirement of an ERA system that it be able to hold both classified and unclassified information, and thus it would not have to attempt to deal with multilevel secure systems operation. Nonetheless, the classification of these records triggers a host of highly specific and highly structured design and operational requirements, which the committee has not examined at all. Here the committee raises only two issues: Does NARA intend to actually build multiple complete and independent ERA systems operating at different levels of classification? If so how many and of what relative sizes, and what constraints will this put on the ERA procurement strategy? The committee has seen no details on the practicality of this approach. How will declassification be handled? As declassification occurs, material may need to flow from the various classified versions of the ERA to the unclassified one. In addition, provenance and source metadata will need to be propagated from one such system to another. Situations may also occur where the metadata are unclassified (or classified at a lower level than the actual records), and provision will need to be made for these situations. In addition, there is the situation where metadata can be public and the documents described are awaiting declassification review (for which the backlog is often very long). As explained above, access controls depend in part on a way to authenticate users. NARA should think through how it wishes to authenticate users of the system, whether all users will need to be authenticated, and how attributes are tied to registered users. NARA should not invent a new technology for authenticating users to a computer system—there are several adequate schemes available already—but should determine how the authentication system is administered and how authenticated user identities are tied to access control rules. For example, how does a new user register for access to the system? How does an existing user apply for augmented access?47 47   For a discussion of authentication policies, particularly regarding privacy issues, see National Research Council, 2003, Who Goes There? Authentication Through the Lens of Privacy, The National Academies Press, Washington, D.C.

OCR for page 35
INTEGRITY OF RECORDS Ensuring the survival, integrity, and authenticity of the records (and accompanying metadata) entrusted to it is at the core of NARA’s mission. In a digital environment, achieving these goals becomes considerably more complex and nuanced than has been the case in an environment of paper records; designing appropriate measures is an interdependent mixture of techniques from archival practice on the one hand and computer science, cryptology, and computer security on the other. The committee has not comprehensively investigated these questions in preparing this report, but it is clear that they need much more extensive structural consideration than they seem to have received to date.48 One set of questions pertains to the transfer of records agencies to NARA and their ingest into an ERA system. Current NARA practice, as explained to the committee, consists of verifying that the file sent by the agency and the file received by NARA have the same length (the same number of bits); this falls well short of available and commonly used tools that provide much more effective ways of verifying integrity. Human intervention to examine records to see if they are complete, match expected formats, and generally “make sense”—an approach adopted in NARA’s current process for ingesting databases—is insufficient to detect bogus records and does not scale up to large volumes of records; automated validation and ingest tools are required. Common best practice is to use checksums to establish that records have not been tampered with, together with some form of authentication of the record source. For example, transfers could be audited using the following procedure. The ingest protocol for a set of records would include a step in which the donating agency provides a list of records to be given to NARA, each one accompanied by its hash checksum,49 and a later step in which NARA verifies that each of the records it has received does indeed have the same (recomputed) checksum as the one provided in the earlier step. Once the files have entered the ERA (along with metadata that explain where they originated and what measures were taken to authenticate the source and to validate the file) it will be necessary to design various processes within the ERA to ensure that the bits are not corrupted while in the custody of the ERA. Checksums are the obvious approach here and can also be used to help protect against software errors. The survival of records once they have entered the ERA depends on several factors. One is the use of redundant, geographically distributed storage to allow records to survive various sorts of physical catastrophes, accidental or deliberate. The committee has seen little discus sion of what the design parameters need to be for redundancy and distribution of storage, and 48   One example of a subtle integrity issue that might arise in the future that the committee did not consider is that of executable records. Can the integrity of executables be safely preserved? What if an executable has an expiration date, time bomb, or some other feature that affects interpretation as a function of time? 49   A hash checksum for a file is the output of a cryptographic hash function (such as the function SHA-1, standardized by NIST) when it is given the file as input; the crucial property of such a function is that it is presumed to be computationally infeasible to find any other input that produces the same output. Therefore this output can serve as a characteristic digital fingerprint for the file, which can easily be recomputed at will and checked against the expected value.

OCR for page 35
whether these parameters will vary from one class of records to the next. A detailed threat and requirements analysis is needed in this area. The integrity of records once within the ERA also depends on the design and operation of effective computer security measures as part of the ERA to ensure that unauthorized people cannot add, delete, or alter objects within the ERA. Hash checksums, independently maintained, can offer a second line of defense for at least detection (if not necessarily repair) of alterations, be they due to attacks or accidental failures of the types discussed earlier. However, in order to protect against malevolent change, the hash value associated with a digital object must be separately protected so that an attacker who manages to gain access to change one cannot also change the other. Since a hash value can be written in a relatively small number of digits, one can protect it from change by publishing it in a very public place, such as a classified advertisement in the New York Times (which will, a short time after publication, be captured on microfilm that is distributed to many libraries), or by otherwise depositing the hash value in hundreds of libraries. 50 50   For an archive that contains millions of digital objects, this idea would lead to purchasing an impractically large number of classified ads. A technique is available for combining the hash values of very large numbers of objects—such as the entire archive—and publishing only a single number that can be computed only by knowing all of the contents. See section 2.4 of Stuart Haber and W. Scott Stornetta, 1997, “Secure Names for Bit-strings,” Proceedings of the 4th ACM Conference on Computer and Communications Security, pp. 2-35, ACM Press, New York, April. The technology is offered commercially by Surety, <http://www.surety.com/solutions.php>.