| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 42
.
4
The Opportunities: The Relationship of Technological
Advances to New Data Use and Retention Strategies
Rapid progress in information technology continually alters both the quantity and the quality of
scientific information and periodically stimulates fundamental modification of data management and
archiving strategies. Recent technological advances have enabled new methods and strategies for data
storage and retrieval and have created better ways of connecting users to data resources and to each other.
Moreover, the evolving technologies are catalysts for revising organizational structures to manage
scientific data archives much more effectively in a distributed manner. Assumptions about effective
management of scientific data that have been long and firmly held are being directly challenged by new
information technology. These assumptions have been based on experience with management of paper
records, generally in domains outside of science. Some of the outdated assumptions that are rapidly
losing their relevance include the following:
.
Physical possession of the data is essential to their management and archiving. This principle
has outlived its usefulness in the context of electronic physical science data and has made access difficult
for legitimate users. Electronic information is easily copied and disseminated. This feature removes
constraints imposed by the limited physical access. Because most government physical science data are
considered to be in the public domain, the constraints of copyright and fee collection to the free
movement of data are removed as well.
· Cost of an archive increases in proportion to collection size and use. Physical archive cost is a
function of space, as well as cataloging, repair, and access efforts. Improved inventory technology has
eased some of the cost burden over the last several years, but, fundamentally, archives with large physical
holdings operate in traditional ways with linearly scaling costs. Such costs nch,nilv lit. 11~. Ins..
physical handling of items scales with use, whereas budgets reflect usage indirectly. _
1 ~ ~1 , , ~· ~ ~. ~. ~
J O -J ~- ---7 ~ ~
In contrast
<;le~;~rurll`; 1nluImallon storage aria marlagemem COSTS nave aeclmeo as rapidly as the costs or computer
technology and processing over the last 30 years. There is no foreseeable end to this process. Storing and
using the next byte will be cheaper than storing and using the most recent byte for a long time to come.
Only archivists and librarians have the capabilities to manage archived data. While librarians
and archivists are important advisors and participants in scientific data management, the dominant
management responsibility falls to the scientific community and its designated scientific data managers
(who are a blend of scientist' computer scientist, and librarian/archivist). If practicing scientists do not
participate in the management of scientific information, such data will fall into obscurity or obsolescence.
42
OCR for page 43
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies
43
· The locator information (catalogs about the managed objects is simple and compact. Finding
relevant scientific information often requires searching the full content and this content generally is not
in the conveniently compressed form of text. For example, to search for all data sets where the
stratospheric ozone concentration is less than some ad hoc threshold in some region, one would need to
execute a complex algorithm on every data sample covering the region in question. Queries such as this
become even more complex if the region of interest is determined after retrieval (e.g., how many days in
a row was the are al extent of the ozone hole over open ocean greater than 5,000 square kilometers?. The
selection and use of scientific data to solve complex problems can be simplified through the use of the
concept of browsing information based on content. Browsing often involves examination of large
numbers or samples and data volumes. Specialized "browsing products" can be defined to locate records
of interest. For the query examples above, low-resolution ozone maps could be used to find candidate
data sets with high probability of relevance. Information about the processes (including sensor character-
istics, computer program capabilities, and calibration points) used to develop the data set is needed for its
proper use. Such information increases the size and complexity of the locator service.
The remainder of this chapter describes how advancing information technologies enable the data
manager, librarian, and archivist to deal with the ch~ll~n~ of .~.i~.ntifi~. flats m~n~.m~nt in
collaborative fashion with the scientific user community.
~_= _ ~A ~^ ~ ~= ~^~ ^ ~
ENABLING TECHNOLOGIES AND RELATED DEVELOPMENTS
Table 4. ~ provides a summary of aspects of scientific data management changed by new technologies
and related developments. These six areas are discussed in more detail below.
High-Performance Computer Networks
The rapid expansion of computer networks and their use for electronic mad! and database access have
obviated the need for researchers and other users of scientific and technical data to be in physical
proximity to colleagues, information resources, and even advanced technical facilities. This has
presented a menu of choices about the best means to distribute data and the responsibility of managing
them.
A worldwide, "virtual" library is being created on the Internet. Application programs such as Mosaic
are demonstrating the power of free and simple navigation across an ocean of available resources.
Improving network capacity, reliability, performance, and security measures are helping to make these
resources more widely accessible and useful.
High-performance networks also support movement of information for new applications (e.g., for
producing safely managed backup copies, "profiling" information for individual user's needs, or staging
data through a number of refinement steps in different locations for focused research). Networks support
collaborative work and research projects that span traditional research boundaries. Such work requires
easy access to a variety of data sources at once.
High-performance networks enable scientific data resources to be widely distributed and managed by
groups of scientists. Users thus are freed to concentrate on the most effective use of the data, rather than
on their own data management issues. Networks can provide a vehicle for regularly distributing backup
copies of data and metadata to ensure safe storage. Distribution of data to users can be done via the
network in addition to, or instead of, via physical media such as tapes and CD-ROMs. Data can be linked
together to help users navigate among related items. This kind of linking is at the heart of the World Wide
Web concept and brought to users by Mosaic. The population of information providers (e.g., people who
can contribute to the knowledge base) has now grown to include all networked members of a user
OCR for page 44
44
Preserving Scientific Data on Our Physical Universe
TABLE 4.1 New Technologies and Related Developments That Enable a New Strategy for the Management of
Scientific and Technical Data
New Technology Trends
and Related Developments
Key Features
What Is Enabled?
High-performance computer networks
Low and declining cost of storage
Advanced data management
Changing requirements for
information technology professionals
High reliability of technology components
Development arid acceptance of standards
Distnbuted functions; rapid
delivery of large data volumes
Inexpensive backup; continually
declining cost; ease of migration
Ability to rigorously and formally
manage diverse data types
Ability of personnel with lower
technical skills to succeed in
data management roles
Availability of better components
and connections; reduced
procurement and operations costs
Agreement on terms, interfaces,
media, procedures
Location of databases and archives
where best managed; collaborative
work; distributed organizations;
distributed responsibility
Deferral of archiving decisions; trust in
distributed management due to safe
storage backup
More complex data structures (other
than "flat files") handled in archives,
with great potential advantages
Ability to entrust scientific data
management in a distributed
environment
Reduced cost and effort in data
migration; trusted connections for
communication and collaboration
Reduced effort to communicate and
apply results of others; ability to
. . .
concentrate on mission issues and
not on technology support
population. Such contributions can be as simple as an annotation on an existing item, or as complex as a
fully processed and peer-reviewed new item. Most profoundly, the evolving network infrastructure
enables new concepts for distribution of functions and responsibility in organizations (NRC, 1994~.
Although networks can provide a quick and easy means to distribute data, it must be noted that CD-
ROMs have been used to distribute data for several years and have been very successful. CD-ROMs not
only permit users to have a huge local library of data, but they often come with a better set of data access
tools than are normally available. Some data sets are large enough that the most cost-effective method to
deliver them is on media such as Exabyte tapes (8 mm).
Low and Declining Cost of Storage
As for most aspects of computer hardware, the cost of storage has declined continuously and rapidly
for the 30 years of the modern computer age. New storage technology is also increasingly compact and
supports ever greater access speeds (Gelsinger et al., 1989~. The historical trends are expected to
continue for up to 20 years. Already, laboratory engineering results confirm this projection for at least the
next decade. The most significant implication is that the decisions about sampling or discarding
scientific data can generally be deferred, particularly for data sets for which the necessary metadata exist
and whose quality has been certified. For relatively smaller data sets, the deliberation regarding long-
term retention may well cost more than the recurring acts of migration. The cost of storage is small in
relation to overall mission or investigation costs and therefore should not be a decision driver. Experi
OCR for page 45
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies
45
ence suggests, however, that the funds to meet these costs need to receive special protection in the annual
agency budget cycles. The support for the data management aspects of scientific missions has typically
had a lower priority than the data collection aspects. The low cost of storage also implies that the
incremental cost of supporting a remote safe copy of data and metadata also will be small, except for the
very large data sets. Therefore, over the next few decades, data received and stored may be expected to be
cheaply and quickly migrated to new technologies when storage media reach their nominal limits of
reliability or for convenience of improved access.
It is important not to expect a perpetual advantage from this technological discontinuity. The fact
that data require significant time periods for their migration must be considered. The cost decay trend
will slow down at some point in the future, causing the overall cost of storage to return to something
closer to the linear relationship to volume. We also must be realistic and expect that funds will not always
be available to save and back up every data set. Decisions on retention or sampling will have to be made.
Nevertheless, the already low and continually declining cost of storage allows a priori decisions to be
made in certain circumstances to keep scientific data sets indefinitely. Backup or safe storage copies of
data are becoming more affordable as data migration becomes less expensive with smaller, faster, and
cheaper storage devices. Reliability also is improving with new software-based archive systems
(including migration and backup features). However, there is an enhanced need for ongoing technology
monitoring by an appropriate body for media, standards, and migration automation. Such monitoring
should be incorporated in any scientific data management and archiving strategy.
The rapid change of storage technologies suggests that efforts to protect today's scientific data legacy
must be accelerated. The obsolescence of media types and recorders/players is occurring within shorter
and shorter time periods. This implies that "salvage" activities will be increasingly difficult for data left
out of migrations to new media. This "join or be left behind" by-product of rapid technological change
intensifies short-term budget pressures on archives. It demands in response a strong management
commitment to provide resources and save important data sets.
If digital data are to survive, it is of fundamental importance to manage and constrain the costs of
archive maintenance. The problem is that new data will be coming in, old data will need to be migrated
to new media, the building will need to be repaired, and there usually will not be a lot of extra money for
new equipment or added staff. To avoid problems, the data migration process in the system design must
be almost totally automated. This refinement often has not been achieved, and it can cause unnecessary
budget difficulties. Finally, it is essential for agencies to preserve all the hardware and software
necessary to access all their data until the data have been successfully migrated or otherwise disposed of.
Advanced Data Management
There are signs that data management technology is beginning to address and, perhaps, to catch up
with the complexities of the very large volumes of scientific data. Improvements have occurred in
database management systems, hierarchical file systems, data representation standards, query optimizers,
data distribution techniques, specialized access methods, and data security tools (Silberschatz et al.,
1991~. Further, investment in standards and cooperative approaches is accelerating, fueled in part by the
demands of medicine, education, entertainment, journalism, financial services, and other commercial
applications. While competing approaches and inconsistent vocabulary create near-term confusion, the
attention and investment levels bode well for the longer-term capability to go beyond "flat file"
representations of data that need to be archived. The new tools and techniques are more descriptive of the
data, their heritage, the processes that have worked upon the data, and the relationships of data to each
other.
New data management technology will enable easier representation of more diverse types of
scientific data. Because of the rigor that new techniques require (e.g., for self-documentation or for
precise definition of access methods), long-term archives will benefit from data structures other than flat
OCR for page 46
46
Preserving Scientific Data on Our Physical Universe
files. The new technology also implies that the creation of a richer set of metadata will be easier to
implement and that these data will be of high scientific value for content-based retrievals. To realize the
potential of this enabled facility with metadata, the scientific community will have to accept and support
efforts to develop and apply new metadata requirements.
The Changing Requirements for Information Technology Professionals
Information technology professionals with high skill levels can now be found in all parts of the
United States and around the world. But as they bring the information technology industry to higher
levels of maturity, the effect is to reduce the complexity of major tasks in managing information. Such
tasks previously required their skilled use of sophisticated assembly language or job control language
(JCL) programming. ICL programming refers to the steps in the old days that one used at the system
console to get programs to run, attach the right files, print to the right printer, and similar functions.
Today, much of this work is masked, made automatic, and controlled through icons and other means.
These tasks can now be performed by competent scientists or professionals with lower technical skills,
rather than by highly trained specialists. Because more functions can be completely handled by
machines, management of the data can be greatly automated and operated by less skilled individuals. The
data themselves can be widely distributed without fear of loss, particularly with a backup copy in safe
storage.
Over the next 5 to 10 years, the costs for information technology professionals at individual scientific
data centers and archives can be dramatically reduced. The reasons for the reduction in costs include
more automatic processes for storage management, rudimentary learning capability in systems, services
performed by end users based on their preferences, improved systems management, higher component
reliability, improved application of standards, and vendor consistency with standards.
Although the dominant trend will be for a smaller, less technically skilled staff to manage the
physical aspects of the archive, there will be a pressing demand for fewer, highly skilled people who
blend the skills of physical scientist, computer scientist, and archivist. These people must be able to
handle the intellectual challenges of bridging these disciplines while providing the coaching and
direction to help develop data and operations standards for scientific communities.
High Reliability of Technology Components
Microprocessors, new storage media technologies, mature software, error correction capabilities,
improved packaging, and reduced power consumption have all made si~nifi~.nnt o.ontnbiltion~ to th
reliability of computer systems and networks
r - ~ _ _ _ =~ __A~ ,^ ~ ~ _ _^ _
What was recently considered unreliable, requiring
constant attention and expensive repair, is now regarded as reliable and not worthy of effort to repair.
Although precautions have always been taken to protect against loss of valuable data, many of these
precautions are now built into the base of mature software or are increasingly familiar parts of facilities'
operating procedures.
High reliability of technology supports a capacity for high levels of trust and the ability to widely
distribute functions and databases. These distributed systems can achieve the same levels of quality and
trust as centralized archives through the use of the same underlying hardware and software technology,
operating procedures, safe storage of copies, and high-quality (error-corrected) telecommunication
connections. High reliability has enabled new applications such as the World Wide Web, in which
context switching from one machine to the next-on a worldwide basis is readily accomplished.
Increased reliability also has allowed computing technology to be put into the hands of business
managers, consumers, and shop clerks. Without such reliability, maintenance effort would outweigh
productivity benefit. As a result, powerful organizational or operational frameworks can be built, much
as new materials enable new architecture or new machines.
OCR for page 47
The Opportunities: The Relationship of Technological Advances to New Data Use and Retention Strategies
Development and Acceptance of Standards
47
The development of effective standards has been pivotal to promoting the widespread use of
electronic information. Communication protocols such as TCP/IP have fueled the growth of the Internet.
Other format standards for documents support their interchange. For example, the Standard Generalized
Markup Language (SGML) provides a uniform way of formatting textual documents so that they can be
read by different document processing tools. The HyperText Markup Language (HTML) is a standard
used to represent and link documents; it is used to describe pages viewed with Internet viewers such as
Mosaic. Hardware and software standards such as the instruction set architectures for microprocessor-
based computers, modem protocols, media formats, and query languages also have played critical roles.
Standards can simplify many of the traditional data management jobs. For example, the time that
would be used to decipher a tape format is saved and the job of installing a new application is facilitated.
Having effective standards in place reduces the level of tedious, nonproductive effort and frees up time
for new tasks for the archivist. Standards determined now will typically be in effect for long periods of
time, perhaps a decade or more, with some small evolutionary augmentations. This means that a baseline
of appropriate standards can be selected for a body of information with some reasonable expectation that
they will not be quickly replaced. When it appears that the existing standards baseline needs to be
updated, the information can then be migrated to a new one. A deliberate data migration strategy based
on standards tracking is possible.
The role of standards certainly is not limited to the general computing community. Scientific teams
and discipline groups continuously work to codify best practices, definitions, and algorithms. These are
propagated as community standards. Standards developed by the scientific community are often the most
important to promote and apply. If properly promulgated, they can enable improved understanding,
broader collaboration, and facilitation of the data management and related research.
Finally, it should be emphasized that standards and guidelines to support long-term archiving must
not inhibit innovation, or the evolution of information systems and technology. Often the best standards
and guidelines are those that are independent of technology.
OPPORTUNITIES FOR NEW ORGANIZATIONAL STRUCTURES
With rapid technological improvements and newly enabled capabilities, it is sometimes easy to
forget the importance of long-term commitment by managers to policy and resource requirements. No
technological changes will by themselves replace the basic, unsung efforts of high-quaTity scientific data
management. In fact, although technology itself can improve the availability of data, truly accessible and
useful scientific information will be achieved only through such management commitment. This
commitment must be based on a coherent strategy for life-cycle management of data, including technol-
ogy acquisition, data and information management practices, and technology-independent standards to
ensure that the minimum levels of data content and consistency for research uses are met. Further, such
a comprehensive strategy will be successful only with the active and committed involvement of the
scientific community itself. The level of effort and change that may be required to achieve this
community involvement cannot be underestimated, and fundamental change to the value system of the
community may be required.
Nevertheless, as discussed above, technological advances allow the creation of new infrastructure,
challenging existing organizational assumptions. Effective organizational designs based on new alloca-
tions of responsibility are enabled. For scientific data management, the technological changes support
organizations with the following attributes:
· Widlely distributed responsibility. New telecommunications, data management, and standards
technology allows for high levels of trust in distributed data management. Physical possession of data by
OCR for page 48
48
Preserving Scientific Data on Our Physical Universe
archivists is no longer essential. The wide availability of information technology professionals and other
skilled data managers (along with the lower technical skill levels actually needed) enhances the ability to
distribute the data more broadly and increase user participation. Such distribution of data and their
ownership (whether actual or implied) by user groups improves the utility of the data and helps create
important support for long-term retention.
· High-value peer-to-peer communication. With access to data and to people on line, a variety of
new collaborative relationships can develop. Information can be broadcast to interested individuals in a
timely fashion. Data can be provided directly to field researchers to focus new data collection. Physical
proximity and formal lines of communication are no longer vital to effective organizational operation.
Indeed, closed, highly structured organizations often will be uncompetitive or fad! to take full advantage
of innovation.
Specialized data centers. Distribution of resources implies that some specific locations can
specialize and yet still contribute effectively to all. Specialized groups or institutions could be created in
a scientific discipline or in some aspect of data management, archives, or standards. Designation of such
specialized centers, in addition to those already in existence, is a significant mechanism for achieving
economies of scale, reducing overall costs while enhancing the effectiveness of certain functions for the
benefit of all.
· Explicit long-term (technology) strategies. A long-term technology strategy needs to be devel-
oped. The rapidly changing base of technology requires that a deliberate sequence of phases be selected,
through which data and data management will migrate. The constant evolution of information technolo-
gies demands that an organizational element take on this "technology navigation" function.
Measurement as a vital tool. In a fast-paced, and, perhaps, widely distributed effort, metrics are
important to clearly communicate expectations of performance, register results, and help in detecting
weak spots for corrective action. In particular, metrics could be established to determine data set use and
to support archiving strategy decisions. Metrics also could be developed to help ensure high-quality
service and proper data protection.
.
.
Representative terms from entire chapter:
safe storage