Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 20
Getting up to Speed the Future of Supercomputing 2 Explanation of Supercomputing The term “supercomputer” refers to computing systems (hardware, systems software, and applications software) that provide close to the best currently achievable sustained performance on demanding computational problems. The term can refer either to the hardware/ software system or to the hardware alone. Two definitions follow: From Landau and Fink1: “The class of fastest and most powerful computers available.” From the Academic Press Dictionary of Science and Technology: “1. any of a category of extremely powerful, large-capacity mainframe computers that are capable of manipulating massive amounts of data in an extremely short time. 2. any computer that is one of the largest, fastest, and most powerful available at a given time.” “Supercomputing” is used to denote the various activities involved in the design, manufacturing, or use of supercomputers (e.g., “supercomputing industry” or “supercomputing applications”). Similar terms are “high-performance computing” and “high-end computing.” The latter terms are used interchangeably in this report to denote the broader range of activities related to platforms that share the same technology as supercomputers but may have lower levels of performance. 1 Rubin H. Landau and Paul J. Fink, Jr. 1993. A Scientist’s and Engineer’s Guide to Workstations and Supercomputers. New York, N.Y.: John Wiley & Sons.
OCR for page 21
Getting up to Speed the Future of Supercomputing The meaning of the terms supercomputing or supercomputer is relative to the overall state of computing at a given time. For example, in 1994, when describing computers subject to export control, the Department of Commerce’s Bureau of Export Administration amended its definition of “supercomputer” to increase the threshold level from a composite theoretical performance (CTP) equal to or exceeding 195 million theoretical operations per second (MTOPS) to a CTP equal to or exceeding 1,500 MTOPS.2 Current examples of supercomputers are contained in the TOP500 list of the 500 most powerful computer systems as measured by best performance on the Linpack benchmarks.3 Supercomputers provide significantly greater sustained performance than is available from the vast majority of installed contemporary mainstream computer systems. In applications such as the analysis of intelligence data, weather prediction, and climate modeling, supercomputers enable the generation of information that would not otherwise be available or that could not be generated in time to be actionable. Supercomputing can accelerate scientific research in important areas such as physics, material sciences, biology, and medicine. Supercomputer simulations can augment or replace experimentation in cases where experiments are hazardous, expensive, or even impossible to perform or to instrument. They can collapse time and enable us to observe the evolution of climate over centuries or the evolution of galaxies over billions of years; they can expand space and allow us to observe atomic phenomena or shrink space and allow us to observe the core of a supernova. They can save lives and money by producing better predictions on the landfall of a hurricane or the impact of an earthquake. In most cases, the problem solved on a supercomputer is derived from a mathematical model of the physical world. Approximations are made when the world is represented using continuous models (partial differential equations) and when these continuous models are discretized. Validated approximate solutions will provide sufficient information to stimulate human scientific imagination or to aid human engineering judgment. As computational power increases, fewer compromises are made, and more accurate results can be obtained. Therefore, in many application domains, there is essentially no limit to the amount of compute power 2 Federal Register, February 24, 1994, at <http://www.fas.org/spp/starwars/offdocs/940224.htm>. 3 The TOP500 list is available at <http://www.top500.org>. The Linpack benchmark solves a dense system of linear equations; in the version used for TOP500, one picks a system size for which the computer exhibits the highest computation rate.
OCR for page 22
Getting up to Speed the Future of Supercomputing that can be usefully applied to a problem. As the committee shows in Chapter 4, many disciplines have a good understanding of how they would exploit supercomputers that are many orders of magnitude more powerful than the ones they currently use; they have a good understanding of how science and engineering will benefit from improvements in supercomputing performance in the years and decades to come. One of the principal ways to increase the amount of computing achievable in a given period of time is to use parallelism—doing multiple coordinated computations at the same time. Some problems, such as searches for patterns in data, can distribute the computational workload easily. The problem can be broken down into subproblems that can be solved independently on a diverse collection of processors that are intermittently available and that are connected by a low-speed network such as the Internet.4 Some problems necessarily distribute the work over a high-speed computational grid5 in order to access unique resources such as very large data repositories or real-time observational facilities. However, many important problems, such as the modeling of fluid flows, cannot be so easily decomposed or widely distributed. While the solution of such problems can be accelerated through the use of parallelism, dependencies among the parallel subproblems necessitate frequent exchanges of data and partial results, thus requiring significantly better communication (both higher bandwidth and lower latency) between processors and data storage than can be provided by a computational grid. Both computational grids and supercomputers hosted in one machine room are components of a cyberinfrastructure, defined in a recent NSF report as “the infrastructure based upon distributed computer, information and communication technology.”6 This report focuses mostly on systems hosted in one machine room (such systems often require a large, dedicated room). To maintain focus, it does not address networking except to note its importance. Also, the report does not address special-purpose hardware accelerators. Special-purpose hardware has always played an important but 4 A good example is SETI@home: The Search for Extraterrestrial Intelligence, <http://setiathome.ssl.berkeley.edu>. 5 A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities (I. Foster and C. Kesselman, 2003, The Grid 2: Blueprint for a New Computing Infrastructure, 2nd ed., San Francisco, Calif.: Morgan Kaufman). 6 NSF. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure. NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure.
OCR for page 23
Getting up to Speed the Future of Supercomputing limited role in supercomputing.7 The committee has no evidence that this situation is changing. While it expects special-purpose hardware to continue to play an important role, the existence of such systems does not affect its discussion of general-purpose supercomputers. Supercomputers in the past were distinguished by their unique (vector) architecture and formed a clearly identifiable product category. Today, clusters of commodity computers that achieve the highest performance levels in scientific computing are not very different from clusters of similar size that are used in various commercial applications. Thus, the distinction between supercomputers and mainstream computers has blurred. Any attempt to draw a clear dividing line between supercomputers and mainstream computers, e.g., by price or level of performance, will lead to arbitrary distinctions. Rather than attempting to draw such distinctions, the discussion will cover the topmost performing systems but will not exclude other high-performance computing systems that share to a significant extent common technology with the top-performing systems. Virtually all supercomputers are constructed by connecting a number of compute nodes, each having one or more processors with a common memory, by a high-speed interconnection network (or switch). Supercomputer architectures differ in the design of their compute nodes, their switches, and their node-switch interface. The system software used on most contemporary supercomputers is some variant of UNIX; most commonly, programs are written in Fortran, C, and C++, augmented with language or library extensions for parallelism and application libraries. Global parallelism is most frequently expressed using the MPI message passing library,8 while OpenMP9 is often used to express parallelism within a node. Libraries and languages that support global arrays10 are 7 The GRAPE (short for GRAvity PipE) family of special-purpose systems for astrophysics is one example. See the GRAPE Web site for more information: <http://grape.astron.s.utokyo.ac.jp/grape/>. 8 MPI: A Message-Passing Interface Standard; see <http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html>. 9 Leonardo Dagum and Ramesh Menon. 1998. “OpenMP: An Industry-Standard API for Shared-Memory Programming.” IEEE Journal of Computational Science and Engineering (5)1. 10 Tarek A. El-Ghazawi, William W. Carlson, and Jesse M. Draper, UPC Language Specification (V 1.1.1), <http://www.gwu.edu/~upc/docs/upc_spec_1.1.1.pdf>; Robert W. Numrich and John Reid, 1998, “Co-array Fortran for Parallel Programming,” SIGPLAN Fortran Forum 17(2), 1-31; J. Nieplocha, R.J. Harrison, and R.J. Littlefield, 1996, “Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers,” Journal of Supercomputing 10, 197-220; Katherine Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Philip Colella, and Alexander Aiken, 1998, “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience 10, 825-836.
OCR for page 24
Getting up to Speed the Future of Supercomputing becoming more widely used as hardware support for direct access to remote memory becomes more prevalent. Most of the traditional algorithm approaches must be modified so that they scale effectively on platforms with a large number of processors. Supercomputers are used to handle larger problems or to introduce more accurate (but more computationally intensive) physical models. Both may require new algorithms. Some algorithms are very specialized to a particular application domain, whereas others—for example, mesh partitioners—are of general use. Two commonly used measures of the overall productivity of high-end computing platforms are capacity and capability. The largest supercomputers are used for capability or turnaround computing where the maximum processing power is applied to a single problem. The goal is to solve a larger problem, or to solve a single problem in a shorter period of time. Capability computing enables the solution of problems that cannot otherwise be solved in a reasonable period of time (for example, by moving from a two-dimensional to a three-dimensional simulation, using finer grids, or using more realistic models). Capability computing also enables the solution of problems with real-time constraints (e.g., intelligence processing and analysis). The main figure of merit is time to solution. Smaller or cheaper systems are used for capacity computing, where smaller problems are solved. Capacity computing can be used to enable parametric studies or to explore design alternatives; it is often needed to prepare for more expensive runs on capability systems. Capacity systems will often run several jobs simultaneously. The main figure of merit is sustained performance per unit cost. There is often a trade-off between the two figures of merit, as further reduction in time to solution is achieved at the expense of increased cost per solution; different platforms exhibit different trade-offs. Capability systems are designed to offer the best possible capability, even at the expense of increased cost per sustained performance, while capacity systems are designed to offer a less aggressive reduction in time to solution but at a lower cost per sustained performance.11 A commonly used unit of measure for both capacity systems and capability systems is peak floating-point operations (additions or multiplications) per second, often measured in teraflops (Tflops), or 1012 floating-point operations per second. For example, a 2003 report by the JASONs12 11 Note that the capacity or capability of a system depends on the mix of application codes it runs. The SETI@home grid system provides more sustained performance for its application than is possible on any single supercomputer platform; it would provide very low sustained performance on a weather simulation. 12 JASON Program Office. 2003. Requirements for ASCI. July 29.
OCR for page 25
Getting up to Speed the Future of Supercomputing estimated that within 10 years a machine of 1,000 Tflops (1 petaflops) would be needed to execute the most demanding Advanced Simulation and Computing (ASC) application (compared to the then-existing ASC platforms of highest capability, the White at Lawrence Livermore National Laboratory and the Q at Los Alamos National Laboratory, of 12.3 Tflops and 20 Tflops, respectively). Although peak flops are a contributing factor to performance, they are only a partial measure of supercomputer productivity because performance as delivered to the user depends on much more than peak floating-point performance (e.g., on local memory bandwidth and latency or interconnect bandwidth and latency). A system designed for high capability can typically be reconfigured into multiple virtual lower-capacity machines to run multiple less demanding jobs in parallel. There is much discussion about the use of custom processors for capacity computing (see Box 2.1). Commodity clusters are frequently used for capacity computing because they provide better cost/performance. However, for many capability applications, custom processors give faster turnaround—even on applications for which they are not the most cost-effective capacity machines. A supercomputer is a scientific instrument that can be used by many disciplines and is not exclusive to one discipline. It can be contrasted, for example, with the Hubble Space Telescope, which has immense potential for enhancing human discovery in astronomy but little potential for designing automobiles. Astronomy also relies heavily on supercomputing to simulate the life cycle of stars and galaxies, after which results from simulations are used in concert with Hubble’s snapshots of stars and galaxies at various evolutionary stages to form consistent theoretical views of the cosmos. It can be argued that supercomputing is no less important than the Hubble Telescope in achieving the goal of understanding the universe. However, it is likely that astronomers paid much less attention to ensuring that supercomputing resources would be available than they paid to carefully justifying the significant cost of the telescope. In astronomy, as in other disciplines, supercomputing is essential to progress but is not discipline-specific enough to marshal support to ensure that it is provided. Nevertheless, as the committee heard, the net contributions of supercomputing, when summed over a multitude of disciplines, are no less than monumental in their impact on overall human goals. Therefore, supercomputing in some sense transcends its individual uses and can be a driver of progress in the 21st century.
OCR for page 26
Getting up to Speed the Future of Supercomputing BOX 2.1 Custom Processors and Commodity Processors Most supercomputers are built from commodity processors that are designed for a broad market and are manufactured in large numbers. A small number of supercomputers use custom processors that are designed to achieve high performance in scientific computing and are manufactured in small numbers. Commodity processors, because they benefit from economies of scale and sophisticated engineering, provide the shortest time to solution (capability) and the highest sustained performance per unit cost (capacity) for a broad range of applications that have significant spatial and temporal locality and therefore take good advantage of the caches provided by commodity processors. A small set of important scientific applications, however, has almost no locality. They achieve shorter time to solution and better sustained performance per unit cost on a custom processor that provides higher effective local memory bandwidth on access patterns having no locality. For a larger set of applications with low locality, custom processors deliver better time to solution but at a higher cost per unit of sustained performance. Commodity processors are often criticized because of their low efficiency (the fraction of peak performance they sustain). However, peak performance, and hence efficiency, is the wrong measure. The system metrics that matter are sustained performance (on applications of interest), time to solution, and cost. The rate at which operands can be transferred to/from the processor is the primary performance bottleneck for many scientific computing codes.1,2 Custom processors differ primarily in the effective memory bandwidth that they provide on different types of access patterns. Whether a machine has a vector processor, a scalar processor, or a multithreaded processor is a secondary issue. The main issue is whether it has efficient support for irregular accesses (gather/scatter), high memory bandwidth, and the ability to hide memory latency so as to sustain this bandwidth. Vector processors, for example, typically have a short (if any) cache line and high memory bandwidth. The vectors themselves provide a latency hiding mechanism. Such features enable custom processors to more efficiently deliver the raw memory bandwidth provided by memory chips, which often dominate system cost. Hence, these processors can be more cost effective on applications that are limited by memory bandwidth. Commodity processors are manufactured in high volume and hence benefit from economies of scale. The high volume also justifies sophisticated engineering—for example, the clock rate of the latest Intel Xeon processor is at least four times faster than the clock rate of the Cray X1. A commodity processor includes much of its memory system but little of its memory capacity on the processor chip, and this memory system is adapted for applications with high spatial and temporal locality. A typical commodity processor chip includes the level 1 and 2 caches on the chip and an external memory interface. This external interface limits sustained local memory bandwidth and requires local memory accesses to be performed in units of cache lines (typically 64 to 128 bytes in length3). Accessing
OCR for page 27
Getting up to Speed the Future of Supercomputing memory in units of cache lines wastes a large fraction (as much as 94 percent) of local memory bandwidth when only a single word of the cache line is needed. Many scientific applications have sufficient spatial and temporal locality that they provide better performance per unit cost on commodity processors than on custom processors. Some scientific applications can be solved more quickly using custom processors but at a higher cost. Some users will pay that cost; others will tolerate longer times to solution or restrict the problems they can solve to save money. A small set of scientific applications that are bandwidth-intensive can be solved both more quickly and more cheaply using custom processors. However, because this application class is small, the market for custom processors is quite small.4 In summary, commodity processors optimized for commercial applications meet the needs of most of the scientific computing market. For the majority of scientific applications that exhibit significant spatial and temporal locality, commodity processors are more cost effective than custom processors, making them better capability machines. For those bandwidth-intensive applications that do not cache well, custom processors are more cost effective and therefore offer better capacity on just those applications. They also offer better turnaround time for a wider range of applications, making them attractive capability machines. However, the segment of the scientific computing market—bandwidth-intensive and capability—that needs custom processors is too small to support the free market development of such processors. The above discussion is focused on hardware and on the current state of affairs. As the gap between processor speed and memory speed continues to increase, custom processors may become competitive for an increasing range of applications. From the software perspective, systems with fewer, more powerful processors are easier to program. Increasing the scalability of software applications and tools to systems with tens of thousands or hundreds of thousands of processors is a difficult problem, and the characteristics of the problem do not behave in a linear fashion. The cost of using, developing, and maintaining applications on custom systems can be substantially less than the comparable cost on commodity systems and may cancel out the apparent cost advantages of hardware for commodity-based high-performance systems—for applications that will run only on custom systems. These issues are discussed in more detail in Chapter 5. 1 L. Carrington, A. Snavely, X. Gao, and N. Wolter. 2003. A Performance Prediction Framework for Scientific Applications. ICCS Workshop on Performance Modeling and Analysis (PMA03). Melbourne, June. 2 S. Goedecker and A. Hoisie. 2001. Performance Optimization of Numerically Intensive Codes. Philadelphia, Pa.: SIAM Press. 3 The IBM Power 4 has a 512-byte level 3 cache line. 4 This categorization of applications is not immutable. Since commodity systems are cheaper and more broadly available, application programmers have invested significant effort in adapting applications to these systems. Bandwidth-intensive applications are those that are not easily adapted to achieve acceptable performance on commodity systems. In many cases the difficulty seems to be intrinsic to the problem being solved.
Representative terms from entire chapter: