Skip to main content

Currently Skimming:

5 Key Technical Issues
Pages 35-57

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 35...
... Using which data types? The "data model" is the specification that answers these questions.
From page 36...
... Another form of aggregation that may be desirable is the container c.g., as used by the SDSC demonstration which simply collects a group of records into a single digital file for more efficient handling by the file system.2 The archive should contain complete documentation about all versions of the data model, nclucl~ng specifications of the data types it uses. Since metadata sets are likely to proliferate 2A container is distinct from an archivist's "collection." A collection may span several containers, and several collections might fit within a single container.
From page 37...
... is to carefully label each digital file that is part of the stored record with its type; when the file is read, the type identifier selects software that can interpret the file correctly. This is sometimes called "self-identifying data." Whenever a data type is chosen as part of the data model, the system designer should ask, How do ~ introduce a new version of this ctata type without disrupting existing records?
From page 38...
... confronts the most challenging problem of an archive: For records to be useful many decades after they are ingested, they must be expressed in the data model using data types that can still be decoded and interpreted at the time of access. By that time the computers and software used to create the original records may be obsolete.
From page 39...
... As a complement to possible future efforts that provide access through such techniques as emulation or migration, the pragmatic strategy is to support access by using a smaller number of data types to express the derived forms of records. These are referred to in this report as "preferred data types." This approach requires characterizing the most common (future)
From page 40...
... In some cases, the native data type may itself be a preferred data type that satisfies all the anticipated uses. The reason for recording derived forms when the record is ingested is simple: It is at this time that software to create the derived forms is most likely to be available.
From page 41...
... Redacted versions of a record might be stored as derived forms with relaxed access controls. Derived forms may also be a simple way to deal with unique or complex data types.
From page 42...
... 18The term "refresh" is preferred to "migrate," because the second term is used to describe a conversion of data type.
From page 43...
... , used in the SDSC demonstrations, is one example of such a distributed file system, but the technology is quite common.20 19Digital computers and their storage devices were unknown in 1903! 20The federated file system model is a mature, well-understood technology.
From page 44...
... The file system should be designed without knowledge of the data model, so the file system implementation can be shared even if the data model is not. File System Performance Requirements The file system must be designed to meet the scale and performance requirements that the ERA will face.
From page 45...
... Selecting Storage Media Presently, NARA stores most of its electronic records using off-line tape storage; this is also the approach used in the SDSC demonstrations. For new systems, disks are becoming the preferred storage choice.
From page 46...
... INGEST Ingest processes are designed according to the data types23 of incoming records and the work flows of the organization building the archive. Some digital document repositories have been created with a streamlined process for scanning large numbers of uniform paper clocuments or ingesting particular clig~tal formats and building a repository using very little manual labor.
From page 47...
... Advance awareness of new data types being presented to NARA can guide adoption of new preferred derived forms and development of associated software. (The ERA will, of course, 25It would be useful and relatively easy to save the validation software at ingest time.
From page 48...
... If data are missing and the agency cannot locate the missing data, then the documentation should call this to the attention of users. Digital records are susceptible to accidental or deliberate alteration; ingest processes should pay attention to end-to-end integrity assurance.
From page 49...
... It may be useful to verify each file using an integrity checker associated with the file's data type. For example, if one of the files is expressecI in an XML encoding with an associated data type definition (DTD)
From page 50...
... NARA will need to set expectations for access to ERA records. In preparing this report, the committee has assumed that users will receive either a digital file representing the record (in its native data type or in one of the available derived forms)
From page 51...
... 1998. "Automated Essay Grading Using Text Categorization Techniques.', Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98)
From page 52...
... It may also be expected to provide for searches across sets of archives (sometimes called "federated search") , such as across both NARA and presidential library collections.37 For common record data types, fulItext search can be provided inexpensively, with little manual intervention, using commercial software.38 Full-text search is merely the simplest form of content search, which may include searching images, sounds, animations, videos, hypermedia structures, etc.
From page 53...
... , some researchers can benefit from access to the elements of the underlying data model used by the ERA. For example, researchers who reverse engineer obscure data types, explore automatic metadata extraction, or devise new methods for content searching (especially on difficult data types such as images, video clips, or executable files)
From page 54...
... Access controls are comprised of three basic ingredients: · A way to authenticate users who wish access, i.e., to verify their identity.44 What are the requirements for authentication? 45 Does every user need to be identified individually, or as a member of a class, e.g., "Tnternet visitor?
From page 55...
... The committee saw no evidence that NARA had begun to formalize access controls in a way that could reasonably be automated in the ERA. Perhaps access controls for NARA's existing archives are suitable and can be easily codified for the ERA, but the committee did not see evidence that this had been clone and indeed heard a good deal that suggested otherwise, including extensive use of, and indeed reliance upon, human review just prior to the delivery of physical records from the existing archives.
From page 56...
... In a digital environment, achieving these goals becomes considerably more complex and nuanced than has been the case in an environment of paper records; designing appropriate measures is an interdependent mixture of techniques from archival practice on the one hand and computer science, cryptology, and computer security on the other. The committee has not comprehensively investigated these questions in preparing this report, but it is clear that they need much more extensive structural consideration than they seem to have received to date.48 One set of questions pertains to the transfer of records agencies to NARA and their ingest into an ERA system.
From page 57...
... of alterations, be they due to attacks or accidental failures of the types discussed earlier. However, in order to protect against malevolent change, the hash value associated with a digital object must be separately protected so that an attacker who manages to gain access to change one cannot also change the other.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.