3
Specific Lessons to Be Learned from the SDSC Demonstration Projects

Starting in 1998, NARA cosponsored work at the San Diego Supercomputer Center (SDSC) to explore the long-term preservation of electronic records. Recently, NARA and the National Science Foundation (NSF) have been supporting work to extend and refine an architecture developed at SDSC and now referred to as “persistent archives.”

SDSC is part of the National Partnership for Advanced Computational Infrastructure (NPACI), a collaboration among 46 U.S. member institutions and foreign affiliates. The principal thrust of this collaboration is to develop the computational infrastructure required to support large-scale scientific computation. The partnership has developed techniques to link together large computers into a global grid, and a corresponding data grid for storing numerous large data sets used in scientific computation.

SDSC has considerable experience building and operating large data storage systems. The current data archive has a capacity of about 400 TB, in which tape robots move data between tape cartridges of roughly 20 GB capacity and a 1.6-TB disk cache. A high-speed network gateway delivers up to 90 MB/sec transfer rates to computational nodes via networks of various kinds. This system is designed principally for high capacity and the very-high-speed access required by supercomputers.

SDSC has also developed data management techniques that allow uniform access to files from different kinds of computer systems. Principal among these is the Storage Request Broker (SRB), which mediates between clients and storage of various kinds (file systems, databases, and the tape archive). Files are accessed by a logical name; the SRB middleware determines where files are stored and how to access them. Also kept is a metadata catalog, which principally records file metadata (such as the location of a file), but may also contain application- or domain-specific metadata. Several SRBs may work in concert to form a federated data management system, in which clients access the combined collections of the federation. These facilities have the important property that they hide computer- and vendor-dependent details, so that storage equipment can be upgraded without changing client software.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 19
3 Specific Lessons to Be Learned from the SDSC Demonstration Projects Starting in 1998, NARA cosponsored work at the San Diego Supercomputer Center (SDSC) to explore the long-term preservation of electronic records. Recently, NARA and the National Science Foundation (NSF) have been supporting work to extend and refine an architecture developed at SDSC and now referred to as “persistent archives.” SDSC is part of the National Partnership for Advanced Computational Infrastructure (NPACI), a collaboration among 46 U.S. member institutions and foreign affiliates. The principal thrust of this collaboration is to develop the computational infrastructure required to support large-scale scientific computation. The partnership has developed techniques to link together large computers into a global grid, and a corresponding data grid for storing numerous large data sets used in scientific computation. SDSC has considerable experience building and operating large data storage systems. The current data archive has a capacity of about 400 TB, in which tape robots move data between tape cartridges of roughly 20 GB capacity and a 1.6-TB disk cache. A high-speed network gateway delivers up to 90 MB/sec transfer rates to computational nodes via networks of various kinds. This system is designed principally for high capacity and the very-high-speed access required by supercomputers. SDSC has also developed data management techniques that allow uniform access to files from different kinds of computer systems. Principal among these is the Storage Request Broker (SRB), which mediates between clients and storage of various kinds (file systems, databases, and the tape archive). Files are accessed by a logical name; the SRB middleware determines where files are stored and how to access them. Also kept is a metadata catalog, which principally records file metadata (such as the location of a file), but may also contain application- or domain-specific metadata. Several SRBs may work in concert to form a federated data management system, in which clients access the combined collections of the federation. These facilities have the important property that they hide computer- and vendor-dependent details, so that storage equipment can be upgraded without changing client software.

OCR for page 19
They are routinely used by scientific computing applications: Organizing and saving scientific datasets for a long time is an important requirement for NPACI users. For its NARA work, SDSC has built a number of demonstrations that use this data-storage infrastructure.1 The archival processing conforms to the OAIS model, with its principal ingest, storage, and access components. The demonstrations have treated a few record collections, building ingest and access functions suitable for each. A few examples follow: Approximately a million electronic mail messages were ingested. Header fields, such as sender, recipient, date, and subject, were extracted to form metadata for each message. Messages were transformed into an XML representation to explicitly tag the metadata elements. A relational database of metadata allowed easy retrieval of messages based on metadata properties—e.g., all messages from a given sender on a given day. This experiment did not attempt to deal with e-mail attachments and their wide-ranging data types. Files describing the Senate Legislative Activities for the 106th Congress, expressed as 99 Rich Text Format (RTF) files, were ingested. These files were created from an IT system (the Thomas system) that keeps track of bills, amendments, and resolutions for each senator. The SDSC project attempted to “lift knowledge” from the text representation to obtain something akin to the original database, by first converting from RTF to an XML format, then (in effect) parsing text to extract names of senators, committees, bills, etc. This work showed inconsistencies in the original database (especially the omission of one senator from the collection). The result was expressed as a (new) database using XML syntax. An electronic database already held in NARA archives describing air missions over Vietnam was transformed into an XML format for preservation. One problem that required attention was normalizing several coordinate systems, including a military grid scheme no longer in use. In the process of building presentation tools, SDSC discovered inconsistencies in the geographic (and geometric) data in the database. Expressing the map data in XML permitted building quite simple presentation viewers using commercial tools. All of these demonstrations stressed the use of XML as a preservation format because of its independence from vendor- or computer-system specifics, in some cases taking advantage of freeware or commercial packages for processing XML. In the remainder of this chapter, the committee assesses the usefulness of certain strategies and features of the SDSC work and classifies them as (1) lessons that might influence the construction of the NARA system, (2) aspects that may not apply, and (3) choices that NARA should not consider. These specific lessons complement the engineering issues discussed in the succeeding chapters. LESSONS FROM THE SDSC PROJECT THAT MAY BE HELPFUL IN DESIGNING THE ERA 1. The SDSC work increases confidence that it is possible to build an electronic archive system and that some of the assumptions behind the project are sound. The SDSC project demonstrated both 1   Reagan Moore. 2002. “The San Diego Project: Persistent Objects,” Proceedings of the Workshop on XML as a Preservation Language, Urbino, Italy, October. Available online at <http://www.sdsc.edu/NARA/Publications.htm>.

OCR for page 19
archival and technical processes for embedding a few of NARA’s electronic collections within a prototype electronic archive. In particular, the demonstrations showed the following: The OAIS model provides a useful overall system structure, although it does little to help specify an implementation. The basic OAIS structure—ingest, storage, and access—was successfully mirrored in the overall modular design of the SDSC demonstrations. XML is a useful way to represent metadata. A significant degree of independence from particular vendors and systems can be achieved. The SDSC demonstrations also show how modern networking technologies can be used to interconnect heterogeneous machines and to add equipment as needed to increase capacity. Networking technologies also allow parts of the system to be physically separated. The XAPT ingest workbench demonstrated by SDSC can be operated anywhere—for example, run on workstations in the agency that originates the records even though other parts of the work flow, and the archival storage, are located elsewhere. 2. Metadata sets will be constantly changing. The writings of the electronic archive community make it clear that a universal metadata set is extremely unlikely. The SDSC projects exploited the ability to tailor metadata sets for each collection. The scientific data sets that SDSC archives make even more extensive and critical use of metadata (e.g., recording important physical parameters of instruments used to make measurements that are recorded in the file) than is likely in an archive of electronic government records. 3. Indexing metadata offers a simple and effective way to search for archived records. The SDSC system entered pertinent metadata into a relational database, which could be searched to locate records. This approach leverages the power of relational database systems, including interactive query software, and is easy to understand and use. Metadata searches are not as powerful as full-text searches, but the SDSC work demonstrated their value. 4. Placing a “federation layer” between the archival system and its file storage is a very useful technique, and the SDSC implementation of such a layer (the SRB) is quite extensive. The SRB is a piece of middleware that enables distributed clients to access storage resources in a heterogeneous environment. Among the benefits of the SRB approach, the following are particularly noteworthy: The SRB approach provides uniform access to and manipulation of files stored in file systems, databases, and archival storage. It allows new implementations of file systems (or storage types) to be added to the system as it evolves. This allows storage capacity to increase; it also allows new hardware to be introduced. However, the SRB does not provide automatic refreshing—copying of data from old storage equipment to new—which would be a desirable capability in a production system. Similarly, it allows unused file systems to be removed. It supports location-independent access to files by keeping a mapping between the permanent file identifier and the physical location of the file.

OCR for page 19
It allows working files to be accessible in exactly the same fashion that archived files are. This means that a collection can be tested before it is committed to the archive (“tested” may mean that audits are performed to ensure integrity or to verify that the access software works correctly). In other words, the same auditing and access software can access collections as works in progress as well as archived collections. Very-high-speed data transfer rates can be provided if necessary. 5. The project demonstrated several successful uses of significant COTS products. COTS file system implementations (hardware and software) are easily incorporated using SRB mediation. Commercial relational database software maintains a metadata index and processes ad hoc queries used to find records in the archive. ASPECTS OF THE SDSC PROJECT THAT MIGHT NOT APPLY TO THE NARA SYSTEM 1. The SDSC file system’s exclusive use, at least for long-term archival, of off-line tape storage. Tape is clearly not the only way to build a robust, long-term file system, and a trend toward increased use of online disk storage is evident. Tape storage propagates significant complexity to the rest of the system, especially the requirement that efficient use of tape requires files to be quite large—a minimum of several gigabytes. The trade-offs between tapes and other media depend significantly on how often and in what patterns the data are accessed. Since the SDSC demonstrations made no attempt to mimic the scale of an operating ERA, they offer no evidence that tape archival storage will have adequate performance. 2. The conversion of each record to a single XML representation as a way to achieve persistence— i.e., to avoid obsolescence of data types. SDSC’s approach to persistence of data types relies on the conversion of records into an XML representation as part of the ingest process. This method has a number of problems. For example, there is no mention of how a stylesheet specification might be archived and remain executable in the future. Style sheets refer to an underlying rendering model that may change with time; moreover, the existing standards for style sheets do not cover all possible rendering and presentation techniques. An example helps make this more concrete: One can instruct Microsoft Word to produce HTML or XML, but the output will contain many Word-specific tags. If these tags are not understood, one has access to the text but may lose access to a lot of information about layout, change tracking, and many other things. Also, the SDSC project demonstrated the XML approach only for simple record formats, such as electronic mail messages, and tackled neither complex—but increasingly common— commercial record types, such as presentations with animation, nor the issues associated with preserving records that contain scripts or executable elements. 3. Validation of approaches through use in a production environment or for an extended period of time. Because the SDSC demonstrations were limited in scope and duration, they do not provide the same sort of operational experience that would be gained from operating early iterations of an ERA. For example, no provision was made for automated media refresh (automatically copying bits from an aging storage medium to a newer one). Nor did the

OCR for page 19
demonstrations address the problems of work flow and scale associated with ingest—for example, How will NARA be able to ingest electronic records fast enough to meet its needs? The SDSC demonstrations also shed little light on the operational capabilities NARA will need to run ERA systems. ERA users are less likely to have such skills or recourse to the same sort of IT support. Each demonstration was carried out by highly skilled programmers capable of diagnosing problems in the input documents, the ingest processing, and the data finally added to the archive. There was no attempt to build a system for ingest that could be operated on a routine basis by less skilled people. (Users of SDSC’s scientific data management systems are generally highly technically savvy and they can turn to a highly proficient support staff.) More generally, such exploratory work is no substitute for having NARA staff work with a system in a production environment. AREAS WHERE THE SDSC PROJECT EXPERIENCE SHOULD NOT BE USED IN DESIGNING THE NARA SYSTEM The archival file system used, HPSS, should not be used as a model for NARA because it lacks important properties. HPSS is designed for manipulating large data sets on large computers (i.e., scientific data on supercomputers), which have a set of requirements different from those of an electronic archive. For example, HPSS does not provide facilities for replication (redundant copies must be created explicitly), for automatically refreshing the storage media (media refresh has been done under HPSS, but it requires explicit management by staff), or for geographic redundancy. The SDSC project covered a very small number of file formats and cannot therefore serve as a model for preserving records across the federal government, where a large number of formats will be encountered. The project, for example, did not consider the following: How to prioritize how much support to provide to which formats (quality of service), How to determine what subset of formats may cover many of the commonly found record types, and How to deal with formats for which there are no existing tools to extract the information from which an XML structure can be built. The efforts to build a knowledge layer are not ready for deployment. Trying to express semantic constraints within records is a worthwhile long-term goal, but the demonstrations of how to “lift knowledge” from a document (such as the Senate legislative activity example discussed above) is not persuasive. These techniques are insufficiently developed to be planned for the NARA system.