They are routinely used by scientific computing applications: Organizing and saving scientific datasets for a long time is an important requirement for NPACI users.
For its NARA work, SDSC has built a number of demonstrations that use this data-storage infrastructure.1 The archival processing conforms to the OAIS model, with its principal ingest, storage, and access components. The demonstrations have treated a few record collections, building ingest and access functions suitable for each. A few examples follow:
Approximately a million electronic mail messages were ingested. Header fields, such as sender, recipient, date, and subject, were extracted to form metadata for each message. Messages were transformed into an XML representation to explicitly tag the metadata elements. A relational database of metadata allowed easy retrieval of messages based on metadata properties—e.g., all messages from a given sender on a given day. This experiment did not attempt to deal with e-mail attachments and their wide-ranging data types.
Files describing the Senate Legislative Activities for the 106th Congress, expressed as 99 Rich Text Format (RTF) files, were ingested. These files were created from an IT system (the Thomas system) that keeps track of bills, amendments, and resolutions for each senator. The SDSC project attempted to “lift knowledge” from the text representation to obtain something akin to the original database, by first converting from RTF to an XML format, then (in effect) parsing text to extract names of senators, committees, bills, etc. This work showed inconsistencies in the original database (especially the omission of one senator from the collection). The result was expressed as a (new) database using XML syntax.
An electronic database already held in NARA archives describing air missions over Vietnam was transformed into an XML format for preservation. One problem that required attention was normalizing several coordinate systems, including a military grid scheme no longer in use. In the process of building presentation tools, SDSC discovered inconsistencies in the geographic (and geometric) data in the database. Expressing the map data in XML permitted building quite simple presentation viewers using commercial tools.
All of these demonstrations stressed the use of XML as a preservation format because of its independence from vendor- or computer-system specifics, in some cases taking advantage of freeware or commercial packages for processing XML.
In the remainder of this chapter, the committee assesses the usefulness of certain strategies and features of the SDSC work and classifies them as (1) lessons that might influence the construction of the NARA system, (2) aspects that may not apply, and (3) choices that NARA should not consider. These specific lessons complement the engineering issues discussed in the succeeding chapters.
1. The SDSC work increases confidence that it is possible to build an electronic archive system and that some of the assumptions behind the project are sound. The SDSC project demonstrated both
Reagan Moore. 2002. “The San Diego Project: Persistent Objects,” Proceedings of the Workshop on XML as a Preservation Language, Urbino, Italy, October. Available online at <http://www.sdsc.edu/NARA/Publications.htm>.