Click for next page ( 10


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 9
Supercomputing Past and Present This chapter provides background material on supercomputing to establish key elements of context. A summary of reports and government activities illuminates the recent history of supercomputing. A brief overview of the current state of supercomputing technology follows. PREVIOUS REPORTS AND RECENT FEDERAL INITIATIVES During the past few decades, a number of reports have dealt with supercomputing and its role in science and engineering research. The first of the modern reports is the Report of the Panel on Large Scale Computing in Science and Engineering (the Lax report. The Lax report made four basic recommendations: (1) increase access for the science and engineering research community to regularly upgraded supercomputing facilities via high bandwidth networks, (2) increase research in computational mathematics, software, and algorithms necessary to the effective and efficient use of supercomputing systems, (3) train people in scientific computing, and (4) invest in research and development basic to the design and implementation of new supercomputing systems of substantially increased capability and capacity, beyond that likely to arise from commercial requirements alone. In 1985, following the guidelines of the Lax report, the National Science Foundation (NSF) established five supercomputer centers. Following the renewal of four of the five NSF supercomputer centers in 1990 and the possible implications for them contained in the 1991 High Performance Computing Act (P.L. 102-194), the National Science Board (NSB) commissioned the NSF Blue Ribbon Panel on High Performance Computing to investigate the future changes in the overall scientific environment due to rapid advances in computers and scientific computing.2 The panel's report, From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Computing (the Branscomb report), recommended a significant expansion in NSF investments, including accelerating progress in high-performance computing through computer and computational science research. In 1995, NSF formed a task force to advise it on the review and management of the supercomputer centers program. The chief finding of the Report of the Task Force on the Future of the NSF iPanel on Large Scale Computing in Science and Engineering. 1982. Report. Sponsored by the Department of Defense and the National Science Foundation in cooperation with the Department of Energy and the National Aeronautics and Space Administration. Washington, D.C., December 26. National Science Foundation. 1993. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance Computing. NSF Blue Ribbon Panel on High Performance Computing, August. 9

OCR for page 9
10 THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT Supercomputer Centers Program (the Hayes report)3 was that the Advanced Scientific Computing Centers funded by NSF had enabled important research in computational science and engineering and had also changed the way that computational science and engineering contribute to advances in fundamental research across many areas. The recommendation of the task force was to continue to maintain a strong Advanced Scientific Computing Centers program. Congress asked the National Research Council's Computer Science and Telecommunications Board to examine the High Performance Computing and Communications Initiative (HPCCI).4 CSTB's 1995 report Evo1/lving the High Performance Computing and Communications Initiative to Support the Nation 's Infrastructure (the Brooks/Sutheriand report)5 recommended the continuation of government support of research in information technology; the continuation of the HPCCI; funding of a strong experimental research program in software and algorithms for parallel computing machines; HPCCI support for precompetitive research in computer architecture (but end direct HPCCI funding for development of commercial hardware by computer vendors and for "industrial stimulus" purchases of hardware); and the development of a teraflop computer as a research direction rather than a destination. In 1997, following the guidelines of the Hayes report, NSF established two Partnerships for Advanced Computational Infrastructure (PACI), one with the San Diego Supercomputer Center as a leading-edge site and the other with the National Center for Supercomputing Applications as a leading- edge site. Each partnership includes participants from other academic, industry, and government sites. A third participant, the Pittsburgh Supercomputer Center. was added in 2000. The PACI program is scheduled to end in the fall of 2004. ~ lo, In 1999, the President's Information Technology Advisory Committee's (PITAC's) Report to the President: Information Technology Research: Investing in Our Future (the PITAC report) made recommendations similar to those of the Lax, Hayes, and Branscomb reports.6 PITAC found that federal information technology R&D is too heavily focused on near-term problems and that investment was inadequate. The committee's main recommendation was to create a strategic initiative to support long- term research in fundamental issues in computing, information, and communications. Supercomputing applications have also been studied. In 1999, The Biomedica1/t Information Science and Techno1/togy Initiative found that because the number of biomedical researchers who could profit from using supercomputing facilities was increasing, the National Institutes of Health (NIH) should take a strong leadership position and help support the national supercomputer centers.7 The 2003 report Revo1/~utionizing Science and Engineering Through Cyberinfrastructure: Report of the Nationa1/t Science Foundation B1/~ue-Ribbon Advisory Pane/ on Cyberinfrastructure (the Atkins report)8 National Science Foundation. 1995. Report of the Task Force on the Future of the NSF S?~percomp?~ter Centers Program. September 15. 4HPCCI was formally created when Congress passed the High-Performance Computing Act of 1991 (P.L.102- 194), which authorized a 5-year program in high-performance computing and communications. The goal ofthe HPCCI was to "accelerate the development of future generations of high-performance computers and networks and the use of these resources in the federal government and throughout the American economy" (Federal Coordinating Council for Science, Engineering, and Technology (FCCSET), 1992, Grand Challenges: High-Performance Computing and Communications. FY 1992 U.S. Research and Development Program, Office of Science and Technology Policy, Washington D.C.~. The initiative broadened from four primary agencies addressing grand challenges such as forecasting severe weather events and aerospace design research to more than 10 agencies addressing national challenges such as electronic commerce and health care. Computer Science and Telecommunications Board (CSTB), National Research Council. 1995. Evolving the High Performance Computing and Communications Initiative to Support the Nation 's Infrastructure. Washington, D.C.: National Academy Press. 6President's Information Technology Advisory Committee (PITAC). 1999. Report to the President. Information Technology Research: Investing in Our Future. February. National Institutes of Health. 1999. The Biomedical Information Science and Technology Initiative. Working Group on Biomedical Computing, Advisory Committee to the Director, National Institutes of Health, June 3. National Science Foundation. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Bl?~e-Ribbon Advisory Panel on Cyberinfrastr?~ct?~re. January.

OCR for page 9
SUPER COMPUTING PASTAND PRESENT 11 found that scientific and engineering research is pushed by continuing progress in computing, information, and communication technology (among other things) and pulled by the expanding complexity, scope, and scale of today's research challenges. The panel's overall recommendation was that NSF should establish and lead a large-scale interagency and internationally coordinated advanced cyberinfrastructure program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. The panel strongly recommended that the U.S. academic research community have access to the most powerful computers that can be built and operated in production mode and that NSF should support five centers that will provide high-end computing resources. There have also been studies of the use of supercomputing for missions important to the United States such as national security. The DOE Accelerated Strategic Computing Initiative (ASCI) 9 was established in 1995 to transition from a test-based to a simulation-based certification program to analyze and predict the performance, reliability, and safety of nuclear weapons. The first supercomputer, ASCI Red, which had 1 Tflop performance, was delivered in 1996. Other ASCI supercomputers include ASCI Blue, ASCI White, and ASCI Q. The goal of ASCI Purple, scheduled for 2005, is 100 Tflop. In 1996, a study by the Office of the Director of Defense Research and Engineering (DDR&E) stated that in order for the United States to maintain supremacy in the high-end computing field, a major national security program would be necessary.~ Two reports by the General Accounting Of fice (GAO) examined DOE's use of its computing capabilities. The titles of these reports summarize the GAO findings: Information Technology: Department of Energy Does Not Effective1/ly Manage Its Supercomputers~ ~ and Nuc1/tear Weapons: DOE Needs to Improve Oversight of the $5 Billion Strategic Computing Initiative. The first report citied utilization rates that showed, in the GAO's view, that the national laboratories were underutilizing their supercomputing capacity and missing opportunities to share it. (DOE disputed those findings.) The lack of an investment strate~v and a defined process was cited as a reason whv DOE was not fully iustifvin~ ~7_' ~ _' _' J _' ~7 , , , rams ~ , ~ ~ , ~ , ~ ~ ~ ~ ~ ~ its supercomputer acqu~s~hons. l he second report tound that a lack of comprehensive planning and progress tracking systems in the ASCI program made assessment of the initiative's progress difficult and subjective. In 1998 NSA and DDR&E joined forces and funding to support the development of the SV-2 (now the X1) by Cray Research in order to meet government needs that could not be met elsewhere in the marketplace. In the third quarter of 2002, Cray delivered five early production versions of the X1. A 1024-processor commercial X1 was delivered in early 2003. The Department of Defense sponsored the Report of the Defense Science Board Task Force on DOD .~unercomnutin~ Need.v ~3 The task force found that there is a significant need for hi~h-nertc~rmance --I-- -----r------o - ------ ---- ----------- -------- ------ ------ -- -- --I----------- ------ --- ---I-- r-- computers that provide extremely fast access to very large global memories and that such computers support a crucial national cryptanalysts capability. Task force recommendations included providing additional financial support for the development of the Cray SV-2 (now the X1), developing an integrated system based on commercial off-the-shelf (COTS) microprocessors and a new high-bandwidth memory system, and investing in long-term research on critical technologies. 9This initiative subsequently became the Advanced Simulation and Computing Program but is still often referenced as ASCI. iDirector of Defense Research and Engineering. 1996. DDRE Integrated Process Team Study A National Security High End Computing Program. General Accounting Office (GAO). 1998. Information Technology: Department of Energy Does Not Effectively Manage Its S?~percomp?~ters. Report to the Chairman, Committee on the Budget, House of Representatives (GAO/RCED-98-208~. Washington, D.C.: GAO, July. i2GAO. 1999. Nuclear Weapons: DOE Needs to Improve Oversight of the $5 Billion Strategic Computing Initiative. Report to the Chairman, Subcommittee on Military Procurement, House Committee on Armed Services (GAO/RCED-99-195~. Washington, D.C.: GAO, July. i3Defense Science Board. 2000. Report of the Defense Science Board Task Force on DOD S?~percomp?~ting Needs. October 11.

OCR for page 9
12 THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT In 2001, Charles Holland, principal assistant deputy under secretary of defense for science and technology, and a team of experts authored a reports that focused on DOD's research and redevelopment agenda for high-performance computing. The report found that current research does not adequately address medium- to long-term needs and proposed an agenda with three thrusts technology development, concept demonstration, and industry adoption to address the challenges of producing innovative ideas and reinvigorating the academic and industry research communities. Supercomputing architecture was the focus of Survey and Ana1/lysis of the Nationa1/t Security High Performance Computing Architectura1/t Requirements (the Games report).~5 The survey found that a major investment had been made by the national security community to migrate legacy applications from vector supercomputers to commodity high-performance computers (HPCs). It found that although vector supercomputers process more efficiently than commodity HPCs, most but not all large applications scale well on commodity HPCs. Finally, it reported that some researchers found it increasingly difficult to program distributed-memory commodity HPCs, which had a negative impact on their research productivity. Recommendations were to assess the usefulness of Japanese vector supercomputers, reach out to researchers through the use of OpenMP on shared-memory systems, promote flexibility through software that combines OpenMP and message passing interface and that switches between vector- and cache-based optimizations, and establish a multifaceted R&D program to improve the productivity of high-performance computing for national security applications. The goal of the DARPA high productivity computing systems (HPCS) program, initiated in 2002, is to provide a new generation of economically viable, high-productivity computing systems for the national security and industrial user community in 2007-2010. It is focused on addressing the gap between the capability needed to meet mission requirements and the current offerings of the commercial marketplace. HPCS has three phases: an industrial concept study currently under way with Cray, SGI, IBM, HP, and Sun; an R&D phase that was awarded to Sun, Cray, and IBM in July 2003 and lasting until 2006; and full-scale development, to be completed by 2010, ideally by the two best vendors from the second phase. The Defense Appropriations Bill for FY 2002 directed the Secretary of Defense to submit a development and acquisition plan for a comprehensive, long-range, integrated, high-end computing (IHEC) program to Congress by July 1,2002. The resulting report, High Performance Computing for the Nationa1/t Security Community, was released in the spring of 2003. The report recommends an IHEC program that integrates applied research, advanced development, and engineering and prototype development. The applied research element will focus on developing the fundamental concepts in high- end computing and creating a pipeline of new ideas and graduate-level expertise for employment in industry and the national security community. The advanced development element will select and refine innovative technologies and architectures for potential integration into high-end systems. The engineering and prototype development element will build operational prototypes and system level testbeds. The report also emphasizes the importance of high-end computing laboratories that will test system software on dedicated large-scale platforms; support the development of software tools and algorithms; develop and advance benchmarking, modeling, and simulations for system architectures; and conduct detailed technical requirements analysis. The report suggests $390 million per year as the steady-state budget for this program. The program is planned to consolidate existing DARPA, DOE/NNSA, and NSA R&D programs and will feature a joint program office with DDR&E oversight. In addition to the study by the NRC's Committee on the Future of Supercomputing that resulted in this interim report, two other studies of the future of U.S. supercomputing are under way: one by the National Coordination Office for Information Technology Research and Development (ITRD) and another by the JASONs. The ITRD High-End Computing Revitalization Task Force has been charged with developing a plan and a 5-year roadmap to guide federal investments in high-end computing starting i4Charles J. Holland. 2001. DOD Research and Development Agenda for High Productivity Computing Systems. White paper, June 11. i5Richard A. Games. 2001. Survey and Analysis of the National Security High Performance Computing Architectural Requirements. June 4.

OCR for page 9
SUPER COMPUTING PASTAND PRESENT 13 with fiscal year 2005. The final report is due in August 2003, in time to influence the FY 2005 budget. The JASONs' study, commissioned by DOE at the request of Congress, will identify the distinct requirements of the Stockpile Stewardship Program and its relation to the ASCI acquisition strategy. The JASONs are expected to complete their (ciassif~ed) report in August 2003. SUPERCOMPUTING TECHNOLOGY Vendors Supercomputers have been manufactured in the United States and abroad since early in the history of the computer industry. Since 1993, a list of the sites operating the 500 most powerful computer systems has been available to the public. This list, called the TOP500, is updated twice a year. Performance is measured by the number of floating point operations performed per second (flops) while executing the LINPACK benchmark to solve a dense system of linear equations.~7 According to the June 2003 TOP500 list, the United States and Japan dominate the use of and manufacture of high-performance systems (although supercomputers are used in Europe, European computer companies have been limited to the integration of relatively small cluster systems). The TOP500 data show that the United States has a 50 percent share of installed supercomputers, Germany has 11 percent, and Japan has ~ percent, accounting for 69 percent of the total. Another interesting comparison is to look at the aggregate performance by country. From the distribution by performance, the U.S. has 54 percent ofthe aggregate performance ofthe TOP500 computers and Japan has 17 percent, together accounting for 71 percent of the total. Breaking the numbers down by manufacturers, the top three, all U.S. companies, are Hewlett-Packard (32 percent of the TOP500 machines), IBM (31 percent), and SGI (1 1 percent); together they account for 74 percent of the systems. Performance by manufacturer shows that 35 percent of the performance is attributable to IBM's aggregate share of 31 percent, Hewlett-Packard's 24 percent, and NEC's 12 percent. Ninety-one percent of the top 500 systems are U.S. made. In summary, in both use and manufacture, the United States is the dominant participant, followed by Japan. A small number of companies dominate the market. Germany is a large user of supercomputing but not a large producer. Architecture Contemporary supercomputers are all built by clustering large numbers of compute nodes. They span a spectrum of architectural choices, from clusters that are assembled from low-cost, high-volume components, to systems that are custom built for high-end scientific computing. The main differentiators are the node technology, the switch (sometimes called the interconnect) technology, and the node-switch interface. i6See . i7No single number captures system performance across a wide range of applications and architectures. Flops in a dense linear algebra benchmark is but one figure of merit; however, it is the one used for this widely referenced list. i8Although these percentages would probably change if different metrics were used, the dominance of the United States over other countries would most likely remain.

OCR for page 9
14 Node Techno1/togy THE FUTURE OF SUPERCOMPUTING: ANINTERIMREPORT Most low-cost clusters use 32-bit Intel or Advanced Micro Devices (AMD) microprocessors. These microprocessors are targeted for low-end servers and are produced in very large volumes (on the order of hundreds of millions). They are not optimized for scientific computing. Sixty-four-bit microprocessors (Alpha, Power, Spare, MIPS, Itanium, Opteron) offer the advantages of support for larger memories, a better performing memory subsystem, and support for larger shared- memory multiprocessor (SMP) configurations. The production volumes of these microprocessors are two orders of magnitude smaller than the volumes for 32-bit microprocessors. These microprocessors are mostly targeted for high-end commercial servers, although on occasion vendors will develop SMP configurations that are optimized for scientific computing. Both 32-bit and 64-bit scalar microprocessors are optimized for single-thread performance on codes that exhibit good temporal and spatial locality.~9 These codes make most of their memory references to an on-chip cache, with good cache reuse. Such processors have limited off-chip bandwidth, can support only a small number (at most ~ or 16) of simultaneously outstanding memory references, and have cache line mechanisms that are not ideal for scientific applications. In high-end application codes that do not make good use of caches, this approach to memory system design leads to a dramatic drop in actual performance when compared with the theoretical peak. This problem is mitigated in processors that employ either multithreading or vectors to generate a large number of outstanding memory references and that therefore tolerate long memory latencies while sustaining high bandwidth (thereby reducing the need for data locality). Over 20 percent of the systems are based on Intel and AMD 32-bit processors. About percent of the systems use vector processors. Approximately 60 percent of the systems use 64-bit processors. Switch Techno1/togy Low-end clusters, including some on the TOP500 list, use high-volume switched Ethernet technology for the interconnect. Higher bandwidth and lower latency are achieved by using custom interconnects from third-party vendors (e.g., Quadrics and Myricom) or from the system vendors (e.g., Cray, IBM, NEC, and SGI). A key differentiator between systems is the fraction of total system cost allocated to the interconnect: Low-bandwidth networks will represent less than 10 percent of total system cost; a high- bandwidth network may approach half of total system cost. Another important differentiator is the scalability of the interconnect to large numbers of nodes. Node-Switch Interface Nodes of low-end clusters connect to the switch via a standard I/O bus, such as peripheral component interconnect (PCI). This choice reuses high-volume, low-cost technology but limits the function and performance of the interconnect, since I/O interfaces are not optimized for fast processor-to-processor communication. In such systems, global bandwidth is an order of magnitude lower than local memory bandwidth. The software for communication typically uses message passing, further increasing communication latency and limiting bandwidth for short messages. A custom memory-connected interface, typically proprietary, can be used to increase bandwidth, reduce latency, or provide added functionality. Such interfaces are usually paired with higher- performance custom switches.20 In particular, a custom interface can directly support shared memory communication, allowing a processor to access the memory of a remote node via load and store i9Temporal locality is the property that data accessed recently in the past are likely to be accessed soon in the future. Spatial locality is the property that data that are stored very near one another tend to be accessed closely in time. 20Recently, Intel and other companies have been directly attaching standard interconnects such as Ethernet and Infiniband directly to the memory system rather than via the PCI bus; thus, some of the performance advantages of custom interfaces are becoming available with standard interconnects.

OCR for page 9
SUPER COMPUTING PASTAND PRESENT instructions. Since shared memory communication has little software overhead, it has lower communication latency; however, the small number of pending memory references supported by mass- market microprocessors limits global bandwidth. Shared memory support is generally believed to facilitate parallel programming, because of the single name space it provides; it also facilitates the use of a single operating system image to control the entire machine. Approximately half the systems in the TOP500 list use proprietary switch interfaces. Products Close to 20 percent of the TOP500 systems are self-made or are assembled by system integrators from commodity components. Almost all of these systems use Intel or AMD 32-bit microprocessor nodes and run Linux. The use of this type of cluster architecture was popularized by the Beowulf project, following on previous Network of Workstations Projects. Such Beowulf clusters are increasingly used as commercial capacity machines (e.g., Web servers and search engines) and as departmental or project scientific computing machines in research and industry. Such clusters are attractive because of their low purchase cost, the large number of component suppliers, and the ease of adding components. Clusters of this type, which use low-cost Ethernet interconnects, are often used to run "embarrassingly parallel" jobs consisting of many almost independent sequential subtasks. ~ J ~ ~ ~ ~ ~ _, ~ ~ The top U.S. vendors all offer clusters with 64-bit SMP nodes and custom switches. With the exception of HP, all provide custom switch interfaces. The top ranked Hewlett-Packard (HP) systems, including the second TOP500-ranked ASCI Q system, use AlphaServer SMP nodes connected (via a standard PCI interface) by a Quadrics switch; global communication uses message passing. Previous Hewlett-Packard clusters used the custom Hyperfabric interconnect. The top-ranked IBM systems, including the fourth-ranked (by TOP500) ASCI White system, use Power SMP nodes connected with an IBM proprietary switch using a proprietary interface (Power 4 systems currently use a standard I/O interface); global communication uses message passing. The Cray T3E uses Alpha uniprocessor nodes connected by a Cray proprietary switch with a proprietary interface that supports fast (put/get) remote memory access; the largest such system on the TOP500 list has 1,900 processors. (Cray is no longer pursuing the T3E architecture.) The SGI Origin uses MIPS quad-processor nodes connected with an SGI proprietary switch and an interface that supports cache-coherent global shared memory; the largest such system, with 1024 processors, is deployed at NASA Ames (SGI is now shipping systems that use Itanium processors). The Sun Fire, with up to 106 Spare processors, also supports global cache-coherent shared 15 memory. NEC in Japan and Cray in the United States are at present the only vendors that manufacture vector processors for large-scale computing; their production volumes are significantly smaller than the volumes for nonvector 64-bit microprocessors. Such processors tend to be used in small-volume systems for the high end of the scientific and technical computing markets. In the past, other top Japanese vendors (Fujitsu and Hitachi) offered systems with vector processors. In the United States vector processors are being developed by Cray, with its new X1 product line. Ten systems on the current TOP500 list use Cray vector processor nodes.22 2iThomas Sterling, Donald J. Becker, Daniel Savarese, John E. Dorband, Udaya A. Ranawak, and Charles V. Packer. 1995. Beow?~lf: A Parallel Workstation for Scientific Computation. Proceedings of the 24th International Conference on Parallel Processing. 22Some mass-marketed microprocessors have limited support for vector instructions in a form that typically cannot be used to hide memory latency. The committee reserves the term "vector processor" for systems that have large vector register files and that support vector load/store instructions that can address noncontiguous memory locations.

OCR for page 9
16 THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT The NEC Earth Simulator The most significant Japanese supercomputer manufacturer is NEC. In the spring of 2002, NEC released the Earth Simulator (ES), a system with a peak performance of 40 Tflops/sec that is ranked first on the June 2003 TOP500 list. Based on the TOP500 LINPACK benchmark, the ES is the worId's fastest computer by a factor of 2.58. An even greater ratio seems to hold for geosciences applications that it was specifically designed to support. The ES is a cluster of 640 shared-memory multiprocessor (SMP) nodes. Each SMP node has eight processors, based on the SX-6 NEC processor design; each processor is a vector processor with a clock frequency of 500 MHz. Eight vector units within each processor provide a peak performance of Gflop/sec per processor. The peak memory bandwidth is 32 GBps. Each processor has 72 vector registers, each with 256 elements. A robust crossbar network connects the nodes and provides a peak bandwidth of 16 GBps per node. The sustained bandwidth is approximately 12 GBps, full duplex. The design of the nodes of the ES (including vector processor and memory system) is evolutionary within the SX vector family. Semiconductor technology and advanced packaging are used to achieve performance. The software is also evolutionary and fairly stable. It is instructive to compare the ES to the ASCI Q system at Los Alamos National Laboratory (LANL), which uses HP AlphaServer ES45 nodes and a Quadrics switch. Compared with the ASCI Q. the significant characteristics of the Earth Simulator are these: Higher ratio of memory bandwidth tofloating-point rate (4 B/flop versus 0.8 B/flop9. This ratio improves performance significantly for many codes that are memory intensive but do not exhibit the spatial and temporal locality exploited by caches. Although some such codes can be rewritten to be more cache-friendly, certain algorithms seem intrinsically difficult to localize.23 . Use of vector para1/~1/te1/tism in addition to SMP and message-passing para1/~1/te1/tism. The availability of a large number of vector registers and of vector load instructions makes it possible to prefetch data and to hide memory latency for codes where data accesses are predictable but not spatially localized. Codes that vectorize well can achieve a high fraction of the peak floating performance of the SX-6. On the other hand, the scalar performance of the SX-6 processor is not as good as the scalar performance of the Alpha processor, so the Alpha processor may achieve better performance on codes that do not vector~ze well and are cache fiiendly. Use of a g1/toba1/t switch with a higher ratio of g1/toba1/t bandwidth tofloating-point rate 60.2 B/flop versus 0. 03 B/flop9. This property contributes to performance on codes that require large amounts of gIobal communication. In summary, ES achieves a higher fraction of peak floating performance on many codes because of better memory bandwidth, better gIobal bandwidth, and the availability of a memory prefetch mechanism (vector registers and vector load/store operations). There are no new micro-architectural concepts or unique technologies that are noteworthy in the ES. Rather, the performance is achieved through the use of a purpose-built microprocessor with high memory bandwidth and latency-hiding hardware and through the acceptance of a different budget balance between node hardware and interconnect hardware. 23See, for example, the GUPS benchmark described in Brian R. Gaeke, Parry Husbands, Xiaoye S Li, Leonid Oliker, Katherine A Yelick, and Rupak Biswas, 2002, Memory-Intensive Benchmarks: IRAMvs. Cache-Based Machines, International Parallel and Distributed Processing Symposium (IPDPS).

OCR for page 9
SUPER COMPUTING PASTAND PRESENT 17 Software Message passing is the main programming model used to scale applications to large systems; the MPI standard message-passing library is available on all TOP500 systems, including shared memory systems. Lower overhead communication can be achieved using put/get libraries on systems with suitable switch interfaces, such as the Cray T3E. Shared memory parallelism on SMP nodes is often exploited using OpenMP (i.e., C or FORTRAN with extensions for loop and task parallelism). However, OpenMP does not seem to be used for systemwide parallelism on large systems (even those supporting shared memory), perhaps because programmers lack the skill to use it well. Almost all TOP500 systems use variants of Unix for their operating system. Shared memory systems are controlled by one global OS image, while distributed memory systems typically have one OS image per node. Lower-end Beowulf clusters typically use Linux, while higher-end systems use proprietary Unix systems. Libraries, programming tools, parallel file systems, and various system management tools complete the parallel programming environment available on these platforms. Most vendor platforms use proprietary parallel programming environments. The proprietary software is often derived from open-source software; for example, all proprietary MPI implementations are derived from open source MPI implementations. Beowulf clusters mostly use open-source parallel software that is contributed by developers worldwide. Support for standard programming environments and interfaces, across all platforms, is an important goal that is only partially achieved. MPI and OpenMP are two successful standardization efforts in which industry adopted a de facto standard developed by the HPC community and/or the HPC vendors. Another successful model is provided by the TotalView parallel debugger, where a third-party software vendor supports the same software product across all main HPC platforms. However, TotalView is a singular example; attempts to standardize various tool interfaces and parallel system services, in particular parallel I/O, have had limited success. Although programming for large-scale parallel machines is more complex than programming for sequential machines, the typical programming environment available for scalable parallel computing is less sophisticated and less standardized than the environment available on small systems. Algorithms The algorithms used to run supercomputing applications are needed not just within the applications themselves but also to analyze the output data, store and transmit the data over unreliable media, load balance efficiently, and so on. The primary challenge introduced by supercomputing is that many conventional algorithms for these problems must be modified so as to scale effectively to much larger data sets or numbers of processors and to run efficiently on machines with deep memory hierarchies. For example, a numerical simulation on a very large mesh may involve converting an algorithm from one using dense matrices or even direct solvers on sparse matrices to one using a specialized iterative method that may still use a parallelized direct method on subproblems. Initially it may be possible to use a serialized mesh partitioner to load balance the matrix across processors, but as the matrix grows a parallel mesh partitioner may be needed. As another example, the problem may be scaled in order to introduce new physical models (e.g., one that respects polycrystalline structure in plasticity models), requiring wholly new discretizations and subgrid models. As this example illustrates, some of these algorithms are very specialized to particular application domains, whereas others, like mesh partitioners, are of quite general use.