| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 405
Page 405
48
Enabling Petabyte Computing
Reagan W. Moore
San Diego Supercomputer Center
Statement of the Problem
One expectation for the national information infrastructure
(NII) is that information be readily available and accessible over
the Internet. Large archives will be created that will make various
types of data available: remote-sensing data, economic statistics,
clinical patient records, digitized images of art, government
records, scientific simulation results, and so on. To remain
competitive, one must be able to analyze this information. For
example, researchers will want to "mine" data to find correlations
between disperate data sets such as epidemiology studies of patient
records. Others will want to analyze remote-sensing data to develop
better predictive models of the impact of government regulations on
the environment. Still others will want to incorporate direct field
measurements into their weather models to improve the models'
predictive power. In each of these cases, the ability to manipulate
massive data sets will become as important as computational
modeling is to solve problems.
A new technology is needed to enable this vision of "data
assimilation." It can be characterized as "petabyte computing," the
manipulation of terabyte-size data sets accessed from petabyte-size
archives 1R The infrastructure that
will sustain this level of data assimilation will constitute a
significant technological advantage. Because of recent advances in
archival storage, it is now feasible to implement a system that can
manipulate data on a scale a thousand times larger than is being
attempted today. Unfortunately, the software infrastructure to
control such data movement does not yet exist. An initiative to
develop the corresponding software technology is needed to enable
this vision by 2000.
Background
Advances in archival storage technology have made it possible to
consider manipulating terabyte data sets accessed from petabyte
storage archives. At the same time, advances in parallel computer
technology make it possible to process the retrieved data at
comparable rates. The combination of these technologies will enable
a new mode of science in which data assimilation becomes as
important as computational simulation is to the development of
predictive models.
Data assimilation can be viewed as combining data mining (in
which correlations are sought between large data sets) and data
modeling (in which observational data are combined with a
simulation model to provide an improved predictive system). These
approaches may require locating a single data set within a data
archive ("data picking") or deriving a data subset from data that
may be distributed uniformly throughout the archive. The latter
requires supporting the streaming of data through data-subsetting
platforms to create the desired data set. Therefore, handling large
data sets in this manner will become dependent on the ability to
manipulate parallel I/O streams.
Current capabilities are exemplified by commercial applications
of data mining in which companies maintain "just-in-time" inventory
by aggressively analyzing daily or weekly sales. The analysis of
sales trends allows a company to tune purchase orders to meet the
predicted demand, thus minimizing the cost and overhead of
maintaining a large inventory of goods. The amount of information
analyzed is limited by current storage and database technology.
Data sets up to 5 terabytes in size, consisting primarily of short
transaction records, can be analyzed.
OCR for page 406
OCR for page 407
OCR for page 408
OCR for page 409
OCR for page 410
OCR for page 411
Representative terms from entire chapter:
petabyte computing
Page 406
Examples of similar transaction-based data sets include bank
check logging archives, airline ticket archives, insurance claim
archives, and clinical patient hospital records. Each of these
archives constitutes a valuable repository of information that
could be mined to analyze trends, search for compliance with
federal laws, or predict usage patterns.
The size of individual such archives is expected to grow to
petabytes by 2000. Part of the growth in size is expected from the
aggregation of information over time. But an important component of
the size increase is expected to come from incorporating additional
ancillary information into the databases. Clinical patient records
will be augmented with the digitized data sets produced by modern
diagnostic equipment such as magnetic resonance imaging, positron
emission tomography, x-rays, and so on. Insurance claim archives
will be augmented with videotapes of each accident scene. Check
archives will be augmented with digitized images of each check to
allow validation of signatures.
In addition, virtually all scientific disciplines are producing
data sets of a size comparable to those found in industry. These
data sets, though, are distinguished from commercial data sets in
that they consist predominantly of binary large objects or "blobs,"
with small amounts of associated metadata that describe the
contents and format of each blob. A premier example of such data
sets is the Earth Observing System archive that will contain
satellite-based remote-sensing images of the Earth 2R The archive is expected to grow to 8
petabytes in size by 2006. The individual data sets will consist of
multifrequency digitized images of the Earth's surface below the
satellite flight paths. The multifrequency images will be able to
be analyzed to detect vegetation, heat sources, mineral
composition, glaciers, and many other features of the surface.
With such a database, it should be possible to examine the
effect of governmental regulations on the environment. For example,
it will be possible to measure the size of croplands and compare
those measurements to regulations on land use policies or water
usage. By incorporating economic and census information, it will be
possible to measure the impact of restrictions of water allocations
on small versus large farms. Another example will be to correlate
crop subsidies with actual crop yield and water availability.
Numerous other examples can be given to show the usefulness of
remote-sensing data in facilitating the development of government
regulations.
Remote-sensing data can also be used to improve our knowledge of
the environment. An interesting example is calculating the global
water budget by measuring the change in size of the world's
glaciers and the heights of the oceans. This information is needed
to understand global warming, better predict climate change, and
predict water availability for farming. All these examples require
the ability to manipulate massive amounts of data, both to pick out
individual data sets of interest and to stream large fractions of
the archive through data-subsetting platforms to find the
appropriate information.
Further examples of large scientific data sets include the
following:
•
Global change data sets. Simulations of the
Earth's climate are generated on supercomputers based on physical
models for different components of the environment. For instance,
100-year simulations are created based on particular models for
cloud formation over the Pacific Ocean. To understand the
difference in the predictions of the global climate as the models
are changed, time-dependent comparisons need to be made, both
between the models and with remote-sensing data. Such data
manipulations need support provided by petabyte computing.
•
Environmental data sets. Environmental
modeling of major bays in the United States is being attempted by
coupling remote-sensing data with simulations of the tidal flow
within the bays. Projects have been started for the Chesapeake Bay,
the San Diego Bay, the Monterey Bay, and the San Francisco Bay. In
each case, it should be possible to predict the impact of dredging
policies on bay ecology. Through fluid dynamics simulations of the
tides, it should be possible to correlate contaminant dispersal
within the bay and compare the predictions with actual
measurements. Each of these projects has the capability of
generating terabytes to petabytes of data and will need the
petabyte software infrastructure to support data comparisons.
•
Map data sets. The Alexandria project at
the University of California, Santa Barbara, is constructing a
digital library of digitized maps. Such a library can contain
information on economic infrastructure (roads, pipes, transmission
lines), land use (parcels, city boundaries), and governmental
policy (agricultural preserve boundaries). Correlating this
information will be essential to interpret much of the
remote-sensing data correctly.
Page 407
The capacity of today's hardware and software infrastructure for
supporting data assimilation of large data sets needs to be
increased by a factor of roughly 1,000, as shown in Table 1.
TABLE 1 Infrastructure Needed to Support Data
Assimilation
Technology
Current Capacity
Needed Capacity
Data archive capacity
Terabytes
Petabytes
Communication rates
Megabytes/second
Gigabytes/second
Data manipulation
Gigabytes of data stored on local disk
Terabytes
Execution rate on computer platform
Gigaflops
Teraflops
Furthermore, increased functionality is needed to support
manipulating data sets accessed from archives within databases. The
development of the higher-capacity, greater-functionality petabyte
computing system should be a national goal.
Analysis and Forecast
Technology advancements now make it possible to integrate
analysis of massive observational databases with computational
modeling. For example, coupling observed data with computer
simulations can provide greatly enhanced predictive systems for
understanding our interactions with our environment. This
represents an advance in scientific methodology that will allow
solution of radically new problems with direct application not only
to the largest academic research problems, but also to governmental
policymaking and commercial competitiveness.
The proposed system represents the advent of "petabyte
computing"the ability to solve important societal problems
that require processing petabytes of data per day. Such a system
requires the ability to sustain local data rates on the order of 10
gigabytes/second, teraflops compute power, and petabyte-size
archives. Current systems are a factor of 1,000 smaller in scale,
sustaining I/O at 10 to 20 megabytes/second, gigaflops execution
rates, and terabyte-size archives. The advent of petabyte computing
will be made possible by advances in archival storage technology
(hundreds of terabytes to petabytes of data available in a single
tape robot) and the development of scalable systems with linearly
expandable I/O, storage, and compute capabilities. It is now
feasible to design a system that can support distributed scientific
data mining and the associated computation.
A petabyte computing system will be a scalable parallel system.
It will provide archival storage space, data access transmission
bandwidth, and compute power in proportional amounts. As the size
of the archive is increased, the data access bandwidth and the
compute power should grow at the same rate. This implies both a
technology that can be expanded to handle even larger data archives
as well as one that can be reduced in scope for cost-effective
handling of smaller data archives. The system can be envisioned as
a parallel computer that supports directly attached peripherals
that constitute the archival storage system. Each peripheral will
need its own I/O channel to access a separate data-subsetting
platform, which in turn will be connected through the parallel
computer to other compute nodes.
The hardware to support such a system will be available in 1996.
Data archives will be able to store a petabyte of data in a single
tape robot. They will provide a separate controller for each
storage device, allowing transmission of large data sets in
parallel. The next-generation parallel computers will be able to
sustain I/O rates of 12 gigabytes/second to attached peripheral
storage devices, allowing the movement of a petabyte of data per
day between the data archive and the compute platform. The parallel
computers will also be able to execute at rates exceeding 100
gigaflops, thus providing the associated compute power needed to
process the data. Since the systems are scalable, commercial grade
systems could be constructed from the same technology by reducing
the number of nodes. Systems that support the analysis of terabytes
of data per day could be built at greatly reduced cost.
Page 408
What is missing is the software infrastructure to control data
movement. Mechanisms are needed that allow an application to
manipulate data stored in the archive through a database interface.
The advantages provided by such a system are that the application
does not have to coordinate the data movement; data sets can be
accessed through relational queries rather than by file name; and
data formats can be controlled by the database, eliminating the
need to convert between file formats. The software infrastructure
needed to do this consists of a library interface between the
application and the database to convert application read and write
requests into SQL-* queries to the database, an object-relational
database that supports user-extensible functions to allow
application-specific manipulations of the data sets, an interface
between the database and the archival storage system to support
data movement and control data placement, and an archive system
that masks hardware dependencies from the database. SQL-* is a
notation that has been developed in a previous project for a future
ANSI-standard extended Standard Query Language.
Each of these software systems is being built independently of
the others, with a focus on a system that can function in a local
area network. The petabyte computing capability requires the
integration of these technologies across a wide area network to
enable use of data-intensive analyses on the NII. Such a system
will be able to support access to multiple databases and archives,
allowing the integration of information from multiple sources.
Technology advances are needed in each of the underlying
software infrastructure components: database technology, archival
storage technology, and data-caching technology. Advances are also
needed in the interfaces that allow the integration of these
technologies over a wide area network. Each of these components is
examined in more detail in the following sections.
Database Technology
A major impediment to constructing a petabyte computing system
has been the UNIX file system. When large amounts of data that may
consist of millions of files must be manipulated, the researcher is
confronted with the need to design data format interface tools,
devise schemes to keep track of the large name space, and develop
scripts to cache the data as needed in the local file system.
Database technology eliminates many of these impediments.
Scientific databases appropriate for petabyte computing will
need to integrate the capabilities of both relational database
technology and object-oriented technology. Such systems are called
object-relational databases. They support queries based on SQL,
augmented by the ability to specify operations that can be applied
to the data sets. They incorporate support for user-defined
functions that can be used to manipulate the data objects stored in
the database. The result is a system that allows a user to locate
data by attribute, such as time of day when the data set was
created or geographic area that the data set represents, and to
perform operations on the data.
To be useful as a component in the scalable petabyte system, the
database must run on a parallel platform, and support multiple I/O
streams and coarse-grained operations on the data sets. This means
the result of a query should be translated into the retrieval of
multiple data sets, each of which is independently moved from the
archive to a data-subsetting platform where the appropriate
functions are applied. By manipulating multiple data sets
simultaneously, the system will be able to aggregate the required
I/O and compute rates to the level needed to handle petabyte
archives. Early versions of object-relational database technology
are available from needed to handle petabyte archives. Early
versions of object-relational database technology are available
from several companies, such as Illustra, IBM, and Oracle, although
they are designed to handle gigabyte data sets.
A second requirement for the object-relational database
technology is that it be interfaced to archival storage systems
that support the scientific data sets. Current database technology
relies on the use of direct attached disk to store both data
objects and the metadata. This limits the amount of data that can
be analyzed. Early prototypes have been built that store large
scientific data sets in a separate data archive. The performance of
such systems is usually limited by the single communication channel
typically used to link the archival storage system to the
database.
The critical missing software component for manipulating large
data sets is the interface between the database and the archival
storage system. Mechanisms are needed that will allow the database
to maintain large
Page 409
data sets on tertiary storage. The envisioned system will
support transparent access of tertiary storage by applications. It
will consist of a middleware product running on a supercomputer
that allows an application to issue read and write calls that are
transparently turned into SQL-* queries. The object-relational
database processes the query and responds by applying the requested
function to the appropriate data set. Since the database and
archival storage system are integrated, the application may access
data that are stored on tertiary tape systems. The data sets may in
fact be too large to fit on the local disk, and the query function
may involve creating the desired subset by using partial data set
caching. The result is transparent access of archived data by the
application without the user having to handle the data formatting
or data movement.
The missing software infrastructure is the interface between
object-relational databases and archival storage systems. Two
interfaces are needed: a data control interface to allow the
database to optimize data placement, grouping, layout, caching, and
migration within the archival storage system; and a data movement
interface to optimize retrieval of large data sets through use of
parallel I/O streams.
Prototype data movement interfaces are being built to support
data movement between object-relational databases and archival
storage systems that are compliant with the IEEE Mass Storage
System Reference Model. These prototypes can be used to analyze I/O
access patterns and help determine the requirements for the data
control interface.
Archival Storage Technology
Third-generation archival storage technology, such as the High
Performance Storage System (HPSS) that is being developed by the
DOE laboratories in collaboration with IBM, provides most of the
basic archival storage mechanisms needed to support petabyte
computing. HPSS will support partial file caching, parallel I/O
streams, and service classes. Service classes provide a mechanism
to classify the type of data layout and data caching needed to
optimize retrieval of a data set. Although HPSS is not expected to
be a supported product until 1996 or 1997, this time frame is
consistent with that expected for the creation of parallel
object-relational database technology.
Extensions to the HPSS architecture are needed to support
multiple bitfile movers on the parallel data-subsetting platforms.
Each of the data streams will need a separate software driver to
coordinate data movement with the direct attached storage device.
The integration of these movers with the database data access
mechanisms will require understanding how to integrate cache
management between the two systems. The database will need to be
able to specify data control elements, including data set
groupings, data set location, and caching and migration policies. A
generic interface that allows such control information to be passed
between commercially provided databases and archival storage
systems needs to become a standard for data-intensive computations
to become a reality.
To other national research efforts are investigating some of the
component technologies: the Scalable I/O Initiative at Caltech and
the National Science Foundation MetaCenter. The intent of the first
project is to support data transfer across multiple parallel I/O
streams from archival storage to the user application. The second
project is developing the requisite common authentication, file,
and scheduling systems needed to support distributed data
movement.
Data-Caching Technology
Making petabyte computing available as a resource on the NII to
access distributed sources of information will require
understanding how to integrate cache management across multiple
data-delivery mechanisms. In the wide area network that the NII
will provide, the petabyte computing system must integrate
distributed compute platforms, distributed data sources, and
network buffers. Caching systems form a critical component of wide
area network technology that can be used to optimize data movement
among these systems. In particular, the number of times data are
copied as they are moved from a data source to an application must
be minimized to relieve bandwidth congestion, improve access time,
and improve throughput. This is an important
Page 410
research area because many different data delivery mechanisms
must be integrated to achieve this goal; archival storage, network
transmission, databases, file systems, operating systems, and
applications.
Data-intensive analysis of information stored in distributed
data sources is an aggressive goal that is not yet feasible.
Movement of a petabyte of data per day corresponds to an average
bandwidth of 12 gigabytes/second. The new national networks
transmit data at 15 megabytes/second, with the bandwidth expected
to grow by a factor of 4 over the next 2 to 3 years. This implies
that additional support mechanisms must be provided if more than a
terabyte of data is going to be retrieved for analysis. The
implementation of the petabyte computing capability over a wide
area network may require the installation of caches within the
network to minimize data movement and support faster local access.
Management of network caches may become an important component of a
distributed petabyte computing system. Until this occurs,
data-intensive analyses will have to be done in a tightly coupled
system.
Future Systems
A teraflops computer will incorporate many of the features
needed to support data-intensive problems. Such a system will
provide the necessary scalable parallel architecture for
computation, I/O access, and data storage. Data will flow in
parallel from the data storage devices to parallel nodes on the
teraflops computer. Most such systems are being designed to support
computationally intensive problems.
The traditional perspective is that computationally intensive
problems generate information on the teraflops computer, which is
then archived. The minimum I/O rate needed to sustain just this
data archiving can be estimated from current systems by scaling
from data flow analyses of current CRAY supercomputers 3R. For the workload at the San Diego
Supercomputer Center, roughly 14 percent of the data written to
disk survives to the end of the computation and 2 percent of the
generated data is archived. The amount of data that is generated is
roughly proportional to the average workload execution rate, with
about 1 bit of data transferred for every 6 floating-point
operations 3R. Various scaling laws
for how the amount of transmitted data varies as a function of the
compute power can be derived for specific applications. For
three-dimensional computational fluid dynamics, the expected data
flow scales as the computational execution rate to the 3/4
power.
Using characterizations such as these, it is possible to project
the I/O requirements for a teraflops computer. The amount of data
movement from the supercomputer to the archival storage system is
estimated to be between 5 and 35 terabytes of data per day. If the
flow to the local disk is included, the amount will be up to seven
times larger. A teraflops computer will need to be able to support
an appreciable fraction of the data movement associated with
petabyte computing.
This implies that it will be quite feasible to consider building
a computer capable of processing a petabyte of data daily within
the next 2 years. Given the rate of technology advancement, the
system will be affordable for commercial use within 5 years. This
assumes that the software infrastructure described previously has
been developed.
The major driving force to develop this capability is the vision
that predictive systems based on data-intensive computations are
possible. This vision is based on the technological expertise that
is emerging from a variety of national research efforts. These
include the NSF/ARPA-funded Gigabit Testbeds, the Scalable I/O
Initiative, NSL UniTree/HPSS archival storage prototypes, and the
development of the MetaCenter concept. These projects can be
leveraged to build a significant technological lead for the United
States in data assimilation.
Recommendations
A collaborative effort is needed between the public sector and
software vendors to develop the underlying technology to enable
analysis of petabyte data archives. This goal is sufficiently
aggressive and incorporates a wide enough range of technologies
that no single vendor will be able to build a petabyte compute
capability. The implementation of such a capability can lead to
dramatic new ways for understanding our environment and the impact
our technology has upon the environment. Building the software
infrastructure through a public sector
Page 411
initiative will allow future commercial systems to be
constructed more rapidly. With the rapid advance of hardware
technology, commercial versions of the petabyte compute capability
will be feasible within 5 years.
References
[1] Moore, Reagan W. 1992. ''File Servers,
Networking, and Supercomputers," Advanced Information Storage
Systems, Vol. 4, SDSC Report GA-A20574.
[2] Davis, F., W. Farrell, J. Gray, C.R.
Mechoso, R.W. Moore, S. Sides, and M. Stonebraker. 1994. "EOSDIS
Alternative Architecture," final report submitted to HAIS,
September 6.
[3] Vildibill, Mike, Reagan W. Moore, and
Henry Newman. 1993. "I/O Analysis of the CRAY Y-MP8/864,"
Proceedings of the 31st Semi-annual Cray User Group Meeting,
Montreaux, Switzerland, March.