Enabling Petabyte Computing
Statement of the Problem
One expectation for the national information infrastructure (NII) is that information be readily available and accessible over the Internet. Large archives will be created that will make various types of data available: remote-sensing data, economic statistics, clinical patient records, digitized images of art, government records, scientific simulation results, and so on. To remain competitive, one must be able to analyze this information. For example, researchers will want to "mine" data to find correlations between disperate data sets such as epidemiology studies of patient records. Others will want to analyze remote-sensing data to develop better predictive models of the impact of government regulations on the environment. Still others will want to incorporate direct field measurements into their weather models to improve the models' predictive power. In each of these cases, the ability to manipulate massive data sets will become as important as computational modeling is to solve problems.
A new technology is needed to enable this vision of "data assimilation." It can be characterized as "petabyte computing," the manipulation of terabyte-size data sets accessed from petabyte-size archives 1R The infrastructure that will sustain this level of data assimilation will constitute a significant technological advantage. Because of recent advances in archival storage, it is now feasible to implement a system that can manipulate data on a scale a thousand times larger than is being attempted today. Unfortunately, the software infrastructure to control such data movement does not yet exist. An initiative to develop the corresponding software technology is needed to enable this vision by 2000.
Advances in archival storage technology have made it possible to consider manipulating terabyte data sets accessed from petabyte storage archives. At the same time, advances in parallel computer technology make it possible to process the retrieved data at comparable rates. The combination of these technologies will enable a new mode of science in which data assimilation becomes as important as computational simulation is to the development of predictive models.
Data assimilation can be viewed as combining data mining (in which correlations are sought between large data sets) and data modeling (in which observational data are combined with a simulation model to provide an improved predictive system). These approaches may require locating a single data set within a data archive ("data picking") or deriving a data subset from data that may be distributed uniformly throughout the archive. The latter requires supporting the streaming of data through data-subsetting platforms to create the desired data set. Therefore, handling large data sets in this manner will become dependent on the ability to manipulate parallel I/O streams.
Current capabilities are exemplified by commercial applications of data mining in which companies maintain "just-in-time" inventory by aggressively analyzing daily or weekly sales. The analysis of sales trends allows a company to tune purchase orders to meet the predicted demand, thus minimizing the cost and overhead of maintaining a large inventory of goods. The amount of information analyzed is limited by current storage and database technology. Data sets up to 5 terabytes in size, consisting primarily of short transaction records, can be analyzed.
Examples of similar transaction-based data sets include bank check logging archives, airline ticket archives, insurance claim archives, and clinical patient hospital records. Each of these archives constitutes a valuable repository of information that could be mined to analyze trends, search for compliance with federal laws, or predict usage patterns.
The size of individual such archives is expected to grow to petabytes by 2000. Part of the growth in size is expected from the aggregation of information over time. But an important component of the size increase is expected to come from incorporating additional ancillary information into the databases. Clinical patient records will be augmented with the digitized data sets produced by modern diagnostic equipment such as magnetic resonance imaging, positron emission tomography, x-rays, and so on. Insurance claim archives will be augmented with videotapes of each accident scene. Check archives will be augmented with digitized images of each check to allow validation of signatures.
In addition, virtually all scientific disciplines are producing data sets of a size comparable to those found in industry. These data sets, though, are distinguished from commercial data sets in that they consist predominantly of binary large objects or "blobs," with small amounts of associated metadata that describe the contents and format of each blob. A premier example of such data sets is the Earth Observing System archive that will contain satellite-based remote-sensing images of the Earth 2R The archive is expected to grow to 8 petabytes in size by 2006. The individual data sets will consist of multifrequency digitized images of the Earth's surface below the satellite flight paths. The multifrequency images will be able to be analyzed to detect vegetation, heat sources, mineral composition, glaciers, and many other features of the surface.
With such a database, it should be possible to examine the effect of governmental regulations on the environment. For example, it will be possible to measure the size of croplands and compare those measurements to regulations on land use policies or water usage. By incorporating economic and census information, it will be possible to measure the impact of restrictions of water allocations on small versus large farms. Another example will be to correlate crop subsidies with actual crop yield and water availability. Numerous other examples can be given to show the usefulness of remote-sensing data in facilitating the development of government regulations.
Remote-sensing data can also be used to improve our knowledge of the environment. An interesting example is calculating the global water budget by measuring the change in size of the world's glaciers and the heights of the oceans. This information is needed to understand global warming, better predict climate change, and predict water availability for farming. All these examples require the ability to manipulate massive amounts of data, both to pick out individual data sets of interest and to stream large fractions of the archive through data-subsetting platforms to find the appropriate information.
Further examples of large scientific data sets include the following:
The capacity of today's hardware and software infrastructure for supporting data assimilation of large data sets needs to be increased by a factor of roughly 1,000, as shown in Table 1.
Furthermore, increased functionality is needed to support manipulating data sets accessed from archives within databases. The development of the higher-capacity, greater-functionality petabyte computing system should be a national goal.
Analysis and Forecast
Technology advancements now make it possible to integrate analysis of massive observational databases with computational modeling. For example, coupling observed data with computer simulations can provide greatly enhanced predictive systems for understanding our interactions with our environment. This represents an advance in scientific methodology that will allow solution of radically new problems with direct application not only to the largest academic research problems, but also to governmental policymaking and commercial competitiveness.
The proposed system represents the advent of "petabyte computing"the ability to solve important societal problems that require processing petabytes of data per day. Such a system requires the ability to sustain local data rates on the order of 10 gigabytes/second, teraflops compute power, and petabyte-size archives. Current systems are a factor of 1,000 smaller in scale, sustaining I/O at 10 to 20 megabytes/second, gigaflops execution rates, and terabyte-size archives. The advent of petabyte computing will be made possible by advances in archival storage technology (hundreds of terabytes to petabytes of data available in a single tape robot) and the development of scalable systems with linearly expandable I/O, storage, and compute capabilities. It is now feasible to design a system that can support distributed scientific data mining and the associated computation.
A petabyte computing system will be a scalable parallel system. It will provide archival storage space, data access transmission bandwidth, and compute power in proportional amounts. As the size of the archive is increased, the data access bandwidth and the compute power should grow at the same rate. This implies both a technology that can be expanded to handle even larger data archives as well as one that can be reduced in scope for cost-effective handling of smaller data archives. The system can be envisioned as a parallel computer that supports directly attached peripherals that constitute the archival storage system. Each peripheral will need its own I/O channel to access a separate data-subsetting platform, which in turn will be connected through the parallel computer to other compute nodes.
The hardware to support such a system will be available in 1996. Data archives will be able to store a petabyte of data in a single tape robot. They will provide a separate controller for each storage device, allowing transmission of large data sets in parallel. The next-generation parallel computers will be able to sustain I/O rates of 12 gigabytes/second to attached peripheral storage devices, allowing the movement of a petabyte of data per day between the data archive and the compute platform. The parallel computers will also be able to execute at rates exceeding 100 gigaflops, thus providing the associated compute power needed to process the data. Since the systems are scalable, commercial grade systems could be constructed from the same technology by reducing the number of nodes. Systems that support the analysis of terabytes of data per day could be built at greatly reduced cost.
What is missing is the software infrastructure to control data movement. Mechanisms are needed that allow an application to manipulate data stored in the archive through a database interface. The advantages provided by such a system are that the application does not have to coordinate the data movement; data sets can be accessed through relational queries rather than by file name; and data formats can be controlled by the database, eliminating the need to convert between file formats. The software infrastructure needed to do this consists of a library interface between the application and the database to convert application read and write requests into SQL-* queries to the database, an object-relational database that supports user-extensible functions to allow application-specific manipulations of the data sets, an interface between the database and the archival storage system to support data movement and control data placement, and an archive system that masks hardware dependencies from the database. SQL-* is a notation that has been developed in a previous project for a future ANSI-standard extended Standard Query Language.
Each of these software systems is being built independently of the others, with a focus on a system that can function in a local area network. The petabyte computing capability requires the integration of these technologies across a wide area network to enable use of data-intensive analyses on the NII. Such a system will be able to support access to multiple databases and archives, allowing the integration of information from multiple sources.
Technology advances are needed in each of the underlying software infrastructure components: database technology, archival storage technology, and data-caching technology. Advances are also needed in the interfaces that allow the integration of these technologies over a wide area network. Each of these components is examined in more detail in the following sections.
A major impediment to constructing a petabyte computing system has been the UNIX file system. When large amounts of data that may consist of millions of files must be manipulated, the researcher is confronted with the need to design data format interface tools, devise schemes to keep track of the large name space, and develop scripts to cache the data as needed in the local file system. Database technology eliminates many of these impediments.
Scientific databases appropriate for petabyte computing will need to integrate the capabilities of both relational database technology and object-oriented technology. Such systems are called object-relational databases. They support queries based on SQL, augmented by the ability to specify operations that can be applied to the data sets. They incorporate support for user-defined functions that can be used to manipulate the data objects stored in the database. The result is a system that allows a user to locate data by attribute, such as time of day when the data set was created or geographic area that the data set represents, and to perform operations on the data.
To be useful as a component in the scalable petabyte system, the database must run on a parallel platform, and support multiple I/O streams and coarse-grained operations on the data sets. This means the result of a query should be translated into the retrieval of multiple data sets, each of which is independently moved from the archive to a data-subsetting platform where the appropriate functions are applied. By manipulating multiple data sets simultaneously, the system will be able to aggregate the required I/O and compute rates to the level needed to handle petabyte archives. Early versions of object-relational database technology are available from needed to handle petabyte archives. Early versions of object-relational database technology are available from several companies, such as Illustra, IBM, and Oracle, although they are designed to handle gigabyte data sets.
A second requirement for the object-relational database technology is that it be interfaced to archival storage systems that support the scientific data sets. Current database technology relies on the use of direct attached disk to store both data objects and the metadata. This limits the amount of data that can be analyzed. Early prototypes have been built that store large scientific data sets in a separate data archive. The performance of such systems is usually limited by the single communication channel typically used to link the archival storage system to the database.
The critical missing software component for manipulating large data sets is the interface between the database and the archival storage system. Mechanisms are needed that will allow the database to maintain large
data sets on tertiary storage. The envisioned system will support transparent access of tertiary storage by applications. It will consist of a middleware product running on a supercomputer that allows an application to issue read and write calls that are transparently turned into SQL-* queries. The object-relational database processes the query and responds by applying the requested function to the appropriate data set. Since the database and archival storage system are integrated, the application may access data that are stored on tertiary tape systems. The data sets may in fact be too large to fit on the local disk, and the query function may involve creating the desired subset by using partial data set caching. The result is transparent access of archived data by the application without the user having to handle the data formatting or data movement.
The missing software infrastructure is the interface between object-relational databases and archival storage systems. Two interfaces are needed: a data control interface to allow the database to optimize data placement, grouping, layout, caching, and migration within the archival storage system; and a data movement interface to optimize retrieval of large data sets through use of parallel I/O streams.
Prototype data movement interfaces are being built to support data movement between object-relational databases and archival storage systems that are compliant with the IEEE Mass Storage System Reference Model. These prototypes can be used to analyze I/O access patterns and help determine the requirements for the data control interface.
Archival Storage Technology
Third-generation archival storage technology, such as the High Performance Storage System (HPSS) that is being developed by the DOE laboratories in collaboration with IBM, provides most of the basic archival storage mechanisms needed to support petabyte computing. HPSS will support partial file caching, parallel I/O streams, and service classes. Service classes provide a mechanism to classify the type of data layout and data caching needed to optimize retrieval of a data set. Although HPSS is not expected to be a supported product until 1996 or 1997, this time frame is consistent with that expected for the creation of parallel object-relational database technology.
Extensions to the HPSS architecture are needed to support multiple bitfile movers on the parallel data-subsetting platforms. Each of the data streams will need a separate software driver to coordinate data movement with the direct attached storage device. The integration of these movers with the database data access mechanisms will require understanding how to integrate cache management between the two systems. The database will need to be able to specify data control elements, including data set groupings, data set location, and caching and migration policies. A generic interface that allows such control information to be passed between commercially provided databases and archival storage systems needs to become a standard for data-intensive computations to become a reality.
To other national research efforts are investigating some of the component technologies: the Scalable I/O Initiative at Caltech and the National Science Foundation MetaCenter. The intent of the first project is to support data transfer across multiple parallel I/O streams from archival storage to the user application. The second project is developing the requisite common authentication, file, and scheduling systems needed to support distributed data movement.
Making petabyte computing available as a resource on the NII to access distributed sources of information will require understanding how to integrate cache management across multiple data-delivery mechanisms. In the wide area network that the NII will provide, the petabyte computing system must integrate distributed compute platforms, distributed data sources, and network buffers. Caching systems form a critical component of wide area network technology that can be used to optimize data movement among these systems. In particular, the number of times data are copied as they are moved from a data source to an application must be minimized to relieve bandwidth congestion, improve access time, and improve throughput. This is an important
research area because many different data delivery mechanisms must be integrated to achieve this goal; archival storage, network transmission, databases, file systems, operating systems, and applications.
Data-intensive analysis of information stored in distributed data sources is an aggressive goal that is not yet feasible. Movement of a petabyte of data per day corresponds to an average bandwidth of 12 gigabytes/second. The new national networks transmit data at 15 megabytes/second, with the bandwidth expected to grow by a factor of 4 over the next 2 to 3 years. This implies that additional support mechanisms must be provided if more than a terabyte of data is going to be retrieved for analysis. The implementation of the petabyte computing capability over a wide area network may require the installation of caches within the network to minimize data movement and support faster local access. Management of network caches may become an important component of a distributed petabyte computing system. Until this occurs, data-intensive analyses will have to be done in a tightly coupled system.
A teraflops computer will incorporate many of the features needed to support data-intensive problems. Such a system will provide the necessary scalable parallel architecture for computation, I/O access, and data storage. Data will flow in parallel from the data storage devices to parallel nodes on the teraflops computer. Most such systems are being designed to support computationally intensive problems.
The traditional perspective is that computationally intensive problems generate information on the teraflops computer, which is then archived. The minimum I/O rate needed to sustain just this data archiving can be estimated from current systems by scaling from data flow analyses of current CRAY supercomputers 3R. For the workload at the San Diego Supercomputer Center, roughly 14 percent of the data written to disk survives to the end of the computation and 2 percent of the generated data is archived. The amount of data that is generated is roughly proportional to the average workload execution rate, with about 1 bit of data transferred for every 6 floating-point operations 3R. Various scaling laws for how the amount of transmitted data varies as a function of the compute power can be derived for specific applications. For three-dimensional computational fluid dynamics, the expected data flow scales as the computational execution rate to the 3/4 power.
Using characterizations such as these, it is possible to project the I/O requirements for a teraflops computer. The amount of data movement from the supercomputer to the archival storage system is estimated to be between 5 and 35 terabytes of data per day. If the flow to the local disk is included, the amount will be up to seven times larger. A teraflops computer will need to be able to support an appreciable fraction of the data movement associated with petabyte computing.
This implies that it will be quite feasible to consider building a computer capable of processing a petabyte of data daily within the next 2 years. Given the rate of technology advancement, the system will be affordable for commercial use within 5 years. This assumes that the software infrastructure described previously has been developed.
The major driving force to develop this capability is the vision that predictive systems based on data-intensive computations are possible. This vision is based on the technological expertise that is emerging from a variety of national research efforts. These include the NSF/ARPA-funded Gigabit Testbeds, the Scalable I/O Initiative, NSL UniTree/HPSS archival storage prototypes, and the development of the MetaCenter concept. These projects can be leveraged to build a significant technological lead for the United States in data assimilation.
A collaborative effort is needed between the public sector and software vendors to develop the underlying technology to enable analysis of petabyte data archives. This goal is sufficiently aggressive and incorporates a wide enough range of technologies that no single vendor will be able to build a petabyte compute capability. The implementation of such a capability can lead to dramatic new ways for understanding our environment and the impact our technology has upon the environment. Building the software infrastructure through a public sector
initiative will allow future commercial systems to be constructed more rapidly. With the rapid advance of hardware technology, commercial versions of the petabyte compute capability will be feasible within 5 years.
 Moore, Reagan W. 1992. ''File Servers, Networking, and Supercomputers," Advanced Information Storage Systems, Vol. 4, SDSC Report GA-A20574.
 Davis, F., W. Farrell, J. Gray, C.R. Mechoso, R.W. Moore, S. Sides, and M. Stonebraker. 1994. "EOSDIS Alternative Architecture," final report submitted to HAIS, September 6.
 Vildibill, Mike, Reagan W. Moore, and Henry Newman. 1993. "I/O Analysis of the CRAY Y-MP8/864," Proceedings of the 31st Semi-annual Cray User Group Meeting, Montreaux, Switzerland, March.