there is a large body of more general experience in engineering complex, large-scale systems that can be applied.
Engineering for the ERA program requires a solid understanding of requirements, including the data types1 to be accommodated, the quantity of records to be stored, the kinds of access to be provided, and the performance expected. Although these requirements can be expected to evolve at every stage of a system’s life as a result of changes in the characteristics of electronic records, the system will successfully meet its expectations only if its engineering is in step with its requirements. So it is essential, even for the very first system, to state these requirements carefully and explicitly.
A key to understanding the initial requirements is information about the population of government records that it will hold and projections about how those records will be used. Importantly, great precision is not needed. Indeed, in some cases, data may be unavailable or impractical to obtain. It is not necessary to significantly delay the ERA program in order to conduct in-depth surveys. Rough, even order-of-magnitude estimates, if well justified, will suffice in most cases.
It is important that estimates supporting initial requirements be made explicit; otherwise, a system design might reflect implicit estimates that are dangerously wrong. The assumptions and reasoning behind the estimates should be made explicit. This will allow the estimates and consequent decisions to be modified whenever the assumptions and estimates change.
In order for the first iterations of the ERA to be designed, questions such as the following need to be answered:
What are the data types that it must support, and what is their frequency of occurrence? In what forms do records currently exist—e.g., which data types, on what storage media, and with what kinds of supporting documentation or online metadata? If there is an inventory of digital records “waiting in the wings” to be archived, what are the properties of these collections? The system design must also anticipate and accommodate new data types and changes in their distribution over time.
How much data must be accommodated at the outset? A great many design decisions (such as the archive media, the implementation technology, and the techniques used to provide reliability) will require estimates of the scale of the archive. The committee heard estimates of
Throughout this report, “data type” is used to identify the data-encoding rules whereby various kinds of records (documents, electronic mail messages, pictures, database entries, etc.) are expressed as a collection of bits. Thus an image might be represented by bits whose data type is TIFF or GIF or JPEG or any of a number of other such specifications. “File format” is often used interchangeably with “data type,” but “data type” is used throughout this report because the literal interpretation of “file format” is files of bits, which would be too restricted. For example, when an image is embedded in an e-mail message that is itself embedded in a “folder” of many messages saved in a file, the bits representing the image cannot properly be called a “file.”