| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 9
Supercomputing Past and Present
This chapter provides background material on supercomputing to establish key elements of context.
A summary of reports and government activities illuminates the recent history of supercomputing. A
brief overview of the current state of supercomputing technology follows.
PREVIOUS REPORTS AND RECENT FEDERAL INITIATIVES
During the past few decades, a number of reports have dealt with supercomputing and its role in
science and engineering research. The first of the modern reports is the Report of the Panel on Large
Scale Computing in Science and Engineering (the Lax report. The Lax report made four basic
recommendations: (1) increase access for the science and engineering research community to regularly
upgraded supercomputing facilities via high bandwidth networks, (2) increase research in computational
mathematics, software, and algorithms necessary to the effective and efficient use of supercomputing
systems, (3) train people in scientific computing, and (4) invest in research and development basic to the
design and implementation of new supercomputing systems of substantially increased capability and
capacity, beyond that likely to arise from commercial requirements alone. In 1985, following the
guidelines of the Lax report, the National Science Foundation (NSF) established five supercomputer
centers.
Following the renewal of four of the five NSF supercomputer centers in 1990 and the possible
implications for them contained in the 1991 High Performance Computing Act (P.L. 102-194), the
National Science Board (NSB) commissioned the NSF Blue Ribbon Panel on High Performance
Computing to investigate the future changes in the overall scientific environment due to rapid advances in
computers and scientific computing.2 The panel's report, From Desktop to Teraflop: Exploiting the U.S.
Lead in High Performance Computing (the Branscomb report), recommended a significant expansion in
NSF investments, including accelerating progress in high-performance computing through computer and
computational science research.
In 1995, NSF formed a task force to advise it on the review and management of the supercomputer
centers program. The chief finding of the Report of the Task Force on the Future of the NSF
iPanel on Large Scale Computing in Science and Engineering. 1982. Report. Sponsored by the Department of
Defense and the National Science Foundation in cooperation with the Department of Energy and the National
Aeronautics and Space Administration. Washington, D.C., December 26.
National Science Foundation. 1993. From Desktop to Teraflop: Exploiting the U.S. Lead in High Performance
Computing. NSF Blue Ribbon Panel on High Performance Computing, August.
9
OCR for page 10
10
THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT
Supercomputer Centers Program (the Hayes report)3 was that the Advanced Scientific Computing
Centers funded by NSF had enabled important research in computational science and engineering and had
also changed the way that computational science and engineering contribute to advances in fundamental
research across many areas. The recommendation of the task force was to continue to maintain a strong
Advanced Scientific Computing Centers program.
Congress asked the National Research Council's Computer Science and Telecommunications Board
to examine the High Performance Computing and Communications Initiative (HPCCI).4 CSTB's 1995
report Evo1/lving the High Performance Computing and Communications Initiative to Support the Nation 's
Infrastructure (the Brooks/Sutheriand report)5 recommended the continuation of government support of
research in information technology; the continuation of the HPCCI; funding of a strong experimental
research program in software and algorithms for parallel computing machines; HPCCI support for
precompetitive research in computer architecture (but end direct HPCCI funding for development of
commercial hardware by computer vendors and for "industrial stimulus" purchases of hardware); and the
development of a teraflop computer as a research direction rather than a destination.
In 1997, following the guidelines of the Hayes report, NSF established two Partnerships for
Advanced Computational Infrastructure (PACI), one with the San Diego Supercomputer Center as a
leading-edge site and the other with the National Center for Supercomputing Applications as a leading-
edge site. Each partnership includes participants from other academic, industry, and government sites. A
third participant, the Pittsburgh Supercomputer Center. was added in 2000. The PACI program is
scheduled to end in the fall of 2004.
~ lo,
In 1999, the President's Information Technology Advisory Committee's (PITAC's) Report to the
President: Information Technology Research: Investing in Our Future (the PITAC report) made
recommendations similar to those of the Lax, Hayes, and Branscomb reports.6 PITAC found that federal
information technology R&D is too heavily focused on near-term problems and that investment was
inadequate. The committee's main recommendation was to create a strategic initiative to support long-
term research in fundamental issues in computing, information, and communications.
Supercomputing applications have also been studied. In 1999, The Biomedica1/t Information Science
and Techno1/togy Initiative found that because the number of biomedical researchers who could profit from
using supercomputing facilities was increasing, the National Institutes of Health (NIH) should take a
strong leadership position and help support the national supercomputer centers.7
The 2003 report Revo1/~utionizing Science and Engineering Through Cyberinfrastructure: Report of
the Nationa1/t Science Foundation B1/~ue-Ribbon Advisory Pane/ on Cyberinfrastructure (the Atkins report)8
National Science Foundation. 1995. Report of the Task Force on the Future of the NSF S?~percomp?~ter Centers
Program. September 15.
4HPCCI was formally created when Congress passed the High-Performance Computing Act of 1991 (P.L.102-
194), which authorized a 5-year program in high-performance computing and communications. The goal ofthe
HPCCI was to "accelerate the development of future generations of high-performance computers and networks and
the use of these resources in the federal government and throughout the American economy" (Federal Coordinating
Council for Science, Engineering, and Technology (FCCSET), 1992, Grand Challenges: High-Performance
Computing and Communications. FY 1992 U.S. Research and Development Program, Office of Science and
Technology Policy, Washington D.C.~. The initiative broadened from four primary agencies addressing grand
challenges such as forecasting severe weather events and aerospace design research to more than 10 agencies
addressing national challenges such as electronic commerce and health care.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1995. Evolving the
High Performance Computing and Communications Initiative to Support the Nation 's Infrastructure. Washington,
D.C.: National Academy Press.
6President's Information Technology Advisory Committee (PITAC). 1999. Report to the President. Information
Technology Research: Investing in Our Future. February.
National Institutes of Health. 1999. The Biomedical Information Science and Technology Initiative. Working
Group on Biomedical Computing, Advisory Committee to the Director, National Institutes of Health, June 3.
National Science Foundation. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure:
Report of the National Science Foundation Bl?~e-Ribbon Advisory Panel on Cyberinfrastr?~ct?~re. January.
OCR for page 11
SUPER COMPUTING PASTAND PRESENT 11
found that scientific and engineering research is pushed by continuing progress in computing,
information, and communication technology (among other things) and pulled by the expanding
complexity, scope, and scale of today's research challenges. The panel's overall recommendation was
that NSF should establish and lead a large-scale interagency and internationally coordinated advanced
cyberinfrastructure program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically
empower all scientific and engineering research and allied education. The panel strongly recommended
that the U.S. academic research community have access to the most powerful computers that can be built
and operated in production mode and that NSF should support five centers that will provide high-end
computing resources.
There have also been studies of the use of supercomputing for missions important to the United States
such as national security. The DOE Accelerated Strategic Computing Initiative (ASCI) 9 was established
in 1995 to transition from a test-based to a simulation-based certification program to analyze and predict
the performance, reliability, and safety of nuclear weapons. The first supercomputer, ASCI Red, which
had 1 Tflop performance, was delivered in 1996. Other ASCI supercomputers include ASCI Blue, ASCI
White, and ASCI Q. The goal of ASCI Purple, scheduled for 2005, is 100 Tflop.
In 1996, a study by the Office of the Director of Defense Research and Engineering (DDR&E) stated
that in order for the United States to maintain supremacy in the high-end computing field, a major
national security program would be necessary.~°
Two reports by the General Accounting Of fice (GAO) examined DOE's use of its computing
capabilities. The titles of these reports summarize the GAO findings: Information Technology:
Department of Energy Does Not Effective1/ly Manage Its Supercomputers~ ~ and Nuc1/tear Weapons: DOE
Needs to Improve Oversight of the $5 Billion Strategic Computing Initiative. The first report citied
utilization rates that showed, in the GAO's view, that the national laboratories were underutilizing their
supercomputing capacity and missing opportunities to share it. (DOE disputed those findings.) The lack
of an investment strate~v and a defined process was cited as a reason whv DOE was not fully iustifvin~
~7_' ~ _' _' J _' ~7
·, , · ·, · rams ~ , ~ ~ , ~ , ~ ~ ~ ~ · ~ · ~
its supercomputer acqu~s~hons. l he second report tound that a lack of comprehensive planning and
progress tracking systems in the ASCI program made assessment of the initiative's progress difficult and
subjective.
In 1998 NSA and DDR&E joined forces and funding to support the development of the SV-2 (now
the X1) by Cray Research in order to meet government needs that could not be met elsewhere in the
marketplace. In the third quarter of 2002, Cray delivered five early production versions of the X1. A
1024-processor commercial X1 was delivered in early 2003.
The Department of Defense sponsored the Report of the Defense Science Board Task Force on DOD
.~unercomnutin~ Need.v ~3 The task force found that there is a significant need for hi~h-nertc~rmance
--I-- -----r------o - ------ ---- ----------- -------- ------ ------ -- -- --I----------- ------ --- ---I-- r--
computers that provide extremely fast access to very large global memories and that such computers
support a crucial national cryptanalysts capability. Task force recommendations included providing
additional financial support for the development of the Cray SV-2 (now the X1), developing an integrated
system based on commercial off-the-shelf (COTS) microprocessors and a new high-bandwidth memory
system, and investing in long-term research on critical technologies.
9This initiative subsequently became the Advanced Simulation and Computing Program but is still often
referenced as ASCI.
i°Director of Defense Research and Engineering. 1996. DDRE Integrated Process Team Study A National
Security High End Computing Program.
General Accounting Office (GAO). 1998. Information Technology: Department of Energy Does Not
Effectively Manage Its S?~percomp?~ters. Report to the Chairman, Committee on the Budget, House of
Representatives (GAO/RCED-98-208~. Washington, D.C.: GAO, July.
i2GAO. 1999. Nuclear Weapons: DOE Needs to Improve Oversight of the $5 Billion Strategic Computing
Initiative. Report to the Chairman, Subcommittee on Military Procurement, House Committee on Armed Services
(GAO/RCED-99-195~. Washington, D.C.: GAO, July.
i3Defense Science Board. 2000. Report of the Defense Science Board Task Force on DOD S?~percomp?~ting
Needs. October 11.
OCR for page 12
12
THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT
In 2001, Charles Holland, principal assistant deputy under secretary of defense for science and
technology, and a team of experts authored a reports that focused on DOD's research and redevelopment
agenda for high-performance computing. The report found that current research does not adequately
address medium- to long-term needs and proposed an agenda with three thrusts technology
development, concept demonstration, and industry adoption to address the challenges of producing
innovative ideas and reinvigorating the academic and industry research communities.
Supercomputing architecture was the focus of Survey and Ana1/lysis of the Nationa1/t Security High
Performance Computing Architectura1/t Requirements (the Games report).~5 The survey found that a major
investment had been made by the national security community to migrate legacy applications from vector
supercomputers to commodity high-performance computers (HPCs). It found that although vector
supercomputers process more efficiently than commodity HPCs, most but not all large applications scale
well on commodity HPCs. Finally, it reported that some researchers found it increasingly difficult to
program distributed-memory commodity HPCs, which had a negative impact on their research
productivity. Recommendations were to assess the usefulness of Japanese vector supercomputers, reach
out to researchers through the use of OpenMP on shared-memory systems, promote flexibility through
software that combines OpenMP and message passing interface and that switches between vector- and
cache-based optimizations, and establish a multifaceted R&D program to improve the productivity of
high-performance computing for national security applications.
The goal of the DARPA high productivity computing systems (HPCS) program, initiated in 2002, is
to provide a new generation of economically viable, high-productivity computing systems for the national
security and industrial user community in 2007-2010. It is focused on addressing the gap between the
capability needed to meet mission requirements and the current offerings of the commercial marketplace.
HPCS has three phases: an industrial concept study currently under way with Cray, SGI, IBM, HP, and
Sun; an R&D phase that was awarded to Sun, Cray, and IBM in July 2003 and lasting until 2006; and
full-scale development, to be completed by 2010, ideally by the two best vendors from the second phase.
The Defense Appropriations Bill for FY 2002 directed the Secretary of Defense to submit a
development and acquisition plan for a comprehensive, long-range, integrated, high-end computing
(IHEC) program to Congress by July 1,2002. The resulting report, High Performance Computing for the
Nationa1/t Security Community, was released in the spring of 2003. The report recommends an IHEC
program that integrates applied research, advanced development, and engineering and prototype
development. The applied research element will focus on developing the fundamental concepts in high-
end computing and creating a pipeline of new ideas and graduate-level expertise for employment in
industry and the national security community. The advanced development element will select and refine
innovative technologies and architectures for potential integration into high-end systems. The engineering
and prototype development element will build operational prototypes and system level testbeds. The
report also emphasizes the importance of high-end computing laboratories that will test system software
on dedicated large-scale platforms; support the development of software tools and algorithms; develop
and advance benchmarking, modeling, and simulations for system architectures; and conduct detailed
technical requirements analysis. The report suggests $390 million per year as the steady-state budget for
this program. The program is planned to consolidate existing DARPA, DOE/NNSA, and NSA R&D
programs and will feature a joint program office with DDR&E oversight.
In addition to the study by the NRC's Committee on the Future of Supercomputing that resulted in
this interim report, two other studies of the future of U.S. supercomputing are under way: one by the
National Coordination Office for Information Technology Research and Development (ITRD) and
another by the JASONs. The ITRD High-End Computing Revitalization Task Force has been charged
with developing a plan and a 5-year roadmap to guide federal investments in high-end computing starting
i4Charles J. Holland. 2001. DOD Research and Development Agenda for High Productivity Computing
Systems. White paper, June 11.
i5Richard A. Games. 2001. Survey and Analysis of the National Security High Performance Computing
Architectural Requirements. June 4.
OCR for page 13
SUPER COMPUTING PASTAND PRESENT
13
with fiscal year 2005. The final report is due in August 2003, in time to influence the FY 2005 budget.
The JASONs' study, commissioned by DOE at the request of Congress, will identify the distinct
requirements of the Stockpile Stewardship Program and its relation to the ASCI acquisition strategy. The
JASONs are expected to complete their (ciassif~ed) report in August 2003.
SUPERCOMPUTING TECHNOLOGY
Vendors
Supercomputers have been manufactured in the United States and abroad since early in the history of
the computer industry. Since 1993, a list of the sites operating the 500 most powerful computer systems
has been available to the public. This list, called the TOP500, is updated twice a year. Performance is
measured by the number of floating point operations performed per second (flops) while executing the
LINPACK benchmark to solve a dense system of linear equations.~7
According to the June 2003 TOP500 list, the United States and Japan dominate the use of and
manufacture of high-performance systems (although supercomputers are used in Europe, European
computer companies have been limited to the integration of relatively small cluster systems). The
TOP500 data show that the United States has a 50 percent share of installed supercomputers, Germany
has 11 percent, and Japan has ~ percent, accounting for 69 percent of the total. Another interesting
comparison is to look at the aggregate performance by country. From the distribution by performance,
the U.S. has 54 percent ofthe aggregate performance ofthe TOP500 computers and Japan has 17 percent,
together accounting for 71 percent of the total.
Breaking the numbers down by manufacturers, the top three, all U.S. companies, are Hewlett-Packard
(32 percent of the TOP500 machines), IBM (31 percent), and SGI (1 1 percent); together they account for
74 percent of the systems. Performance by manufacturer shows that 35 percent of the performance is
attributable to IBM's aggregate share of 31 percent, Hewlett-Packard's 24 percent, and NEC's 12 percent.
Ninety-one percent of the top 500 systems are U.S. made.
In summary, in both use and manufacture, the United States is the dominant participant, followed by
Japan. A small number of companies dominate the market. Germany is a large user of supercomputing
but not a large producer.
Architecture
Contemporary supercomputers are all built by clustering large numbers of compute nodes. They span
a spectrum of architectural choices, from clusters that are assembled from low-cost, high-volume
components, to systems that are custom built for high-end scientific computing. The main differentiators
are the node technology, the switch (sometimes called the interconnect) technology, and the node-switch
interface.
i6See .
i7No single number captures system performance across a wide range of applications and architectures. Flops in
a dense linear algebra benchmark is but one figure of merit; however, it is the one used for this widely referenced
list.
i8Although these percentages would probably change if different metrics were used, the dominance of the
United States over other countries would most likely remain.
OCR for page 14
14
Node Techno1/togy
THE FUTURE OF SUPERCOMPUTING: ANINTERIMREPORT
Most low-cost clusters use 32-bit Intel or Advanced Micro Devices (AMD) microprocessors. These
microprocessors are targeted for low-end servers and are produced in very large volumes (on the order of
hundreds of millions). They are not optimized for scientific computing.
Sixty-four-bit microprocessors (Alpha, Power, Spare, MIPS, Itanium, Opteron) offer the advantages
of support for larger memories, a better performing memory subsystem, and support for larger shared-
memory multiprocessor (SMP) configurations. The production volumes of these microprocessors are two
orders of magnitude smaller than the volumes for 32-bit microprocessors. These microprocessors are
mostly targeted for high-end commercial servers, although on occasion vendors will develop SMP
configurations that are optimized for scientific computing.
Both 32-bit and 64-bit scalar microprocessors are optimized for single-thread performance on codes
that exhibit good temporal and spatial locality.~9 These codes make most of their memory references to
an on-chip cache, with good cache reuse. Such processors have limited off-chip bandwidth, can support
only a small number (at most ~ or 16) of simultaneously outstanding memory references, and have cache
line mechanisms that are not ideal for scientific applications. In high-end application codes that do not
make good use of caches, this approach to memory system design leads to a dramatic drop in actual
performance when compared with the theoretical peak. This problem is mitigated in processors that
employ either multithreading or vectors to generate a large number of outstanding memory references and
that therefore tolerate long memory latencies while sustaining high bandwidth (thereby reducing the need
for data locality). Over 20 percent of the systems are based on Intel and AMD 32-bit processors. About
percent of the systems use vector processors. Approximately 60 percent of the systems use 64-bit
processors.
Switch Techno1/togy
Low-end clusters, including some on the TOP500 list, use high-volume switched Ethernet technology
for the interconnect. Higher bandwidth and lower latency are achieved by using custom interconnects
from third-party vendors (e.g., Quadrics and Myricom) or from the system vendors (e.g., Cray, IBM,
NEC, and SGI). A key differentiator between systems is the fraction of total system cost allocated to the
interconnect: Low-bandwidth networks will represent less than 10 percent of total system cost; a high-
bandwidth network may approach half of total system cost. Another important differentiator is the
scalability of the interconnect to large numbers of nodes.
Node-Switch Interface
Nodes of low-end clusters connect to the switch via a standard I/O bus, such as peripheral component
interconnect (PCI). This choice reuses high-volume, low-cost technology but limits the function and
performance of the interconnect, since I/O interfaces are not optimized for fast processor-to-processor
communication. In such systems, global bandwidth is an order of magnitude lower than local memory
bandwidth. The software for communication typically uses message passing, further increasing
communication latency and limiting bandwidth for short messages.
A custom memory-connected interface, typically proprietary, can be used to increase bandwidth,
reduce latency, or provide added functionality. Such interfaces are usually paired with higher-
performance custom switches.20 In particular, a custom interface can directly support shared memory
communication, allowing a processor to access the memory of a remote node via load and store
i9Temporal locality is the property that data accessed recently in the past are likely to be accessed soon in the
future. Spatial locality is the property that data that are stored very near one another tend to be accessed closely in
time.
20Recently, Intel and other companies have been directly attaching standard interconnects such as Ethernet and
Infiniband directly to the memory system rather than via the PCI bus; thus, some of the performance advantages of
custom interfaces are becoming available with standard interconnects.
OCR for page 15
SUPER COMPUTING PASTAND PRESENT
instructions. Since shared memory communication has little software overhead, it has lower
communication latency; however, the small number of pending memory references supported by mass-
market microprocessors limits global bandwidth. Shared memory support is generally believed to
facilitate parallel programming, because of the single name space it provides; it also facilitates the use of
a single operating system image to control the entire machine. Approximately half the systems in the
TOP500 list use proprietary switch interfaces.
Products
Close to 20 percent of the TOP500 systems are self-made or are assembled by system integrators
from commodity components. Almost all of these systems use Intel or AMD 32-bit microprocessor
nodes and run Linux. The use of this type of cluster architecture was popularized by the Beowulf
project, following on previous Network of Workstations Projects. Such Beowulf clusters are
increasingly used as commercial capacity machines (e.g., Web servers and search engines) and as
departmental or project scientific computing machines in research and industry. Such clusters are
attractive because of their low purchase cost, the large number of component suppliers, and the ease of
adding components. Clusters of this type, which use low-cost Ethernet interconnects, are often used to
run "embarrassingly parallel" jobs consisting of many almost independent sequential subtasks.
~ J
~ ~ ~ ~ ~ _, ~ ~
The top U.S. vendors all offer clusters with 64-bit SMP nodes and custom switches. With the
exception of HP, all provide custom switch interfaces. The top ranked Hewlett-Packard (HP) systems,
including the second TOP500-ranked ASCI Q system, use AlphaServer SMP nodes connected (via a
standard PCI interface) by a Quadrics switch; global communication uses message passing. Previous
Hewlett-Packard clusters used the custom Hyperfabric interconnect. The top-ranked IBM systems,
including the fourth-ranked (by TOP500) ASCI White system, use Power SMP nodes connected with an
IBM proprietary switch using a proprietary interface (Power 4 systems currently use a standard I/O
interface); global communication uses message passing. The Cray T3E uses Alpha uniprocessor nodes
connected by a Cray proprietary switch with a proprietary interface that supports fast (put/get) remote
memory access; the largest such system on the TOP500 list has 1,900 processors. (Cray is no longer
pursuing the T3E architecture.) The SGI Origin uses MIPS quad-processor nodes connected with an SGI
proprietary switch and an interface that supports cache-coherent global shared memory; the largest such
system, with 1024 processors, is deployed at NASA Ames (SGI is now shipping systems that use Itanium
processors). The Sun Fire, with up to 106 Spare processors, also supports global cache-coherent shared
15
memory.
NEC in Japan and Cray in the United States are at present the only vendors that manufacture vector
processors for large-scale computing; their production volumes are significantly smaller than the volumes
for nonvector 64-bit microprocessors. Such processors tend to be used in small-volume systems for the
high end of the scientific and technical computing markets. In the past, other top Japanese vendors
(Fujitsu and Hitachi) offered systems with vector processors. In the United States vector processors are
being developed by Cray, with its new X1 product line. Ten systems on the current TOP500 list use Cray
vector processor nodes.22
2iThomas Sterling, Donald J. Becker, Daniel Savarese, John E. Dorband, Udaya A. Ranawak, and Charles V.
Packer. 1995. Beow?~lf: A Parallel Workstation for Scientific Computation. Proceedings of the 24th International
Conference on Parallel Processing.
22Some mass-marketed microprocessors have limited support for vector instructions in a form that typically
cannot be used to hide memory latency. The committee reserves the term "vector processor" for systems that have
large vector register files and that support vector load/store instructions that can address noncontiguous memory
locations.
OCR for page 16
16
THE FUTURE OF SUPERCOM:PUTING: ANINTERIMREPORT
The NEC Earth Simulator
The most significant Japanese supercomputer manufacturer is NEC. In the spring of 2002, NEC
released the Earth Simulator (ES), a system with a peak performance of 40 Tflops/sec that is ranked first
on the June 2003 TOP500 list. Based on the TOP500 LINPACK benchmark, the ES is the worId's fastest
computer by a factor of 2.58. An even greater ratio seems to hold for geosciences applications that it was
specifically designed to support.
The ES is a cluster of 640 shared-memory multiprocessor (SMP) nodes. Each SMP node has eight
processors, based on the SX-6 NEC processor design; each processor is a vector processor with a clock
frequency of 500 MHz. Eight vector units within each processor provide a peak performance of
Gflop/sec per processor. The peak memory bandwidth is 32 GBps. Each processor has 72 vector
registers, each with 256 elements. A robust crossbar network connects the nodes and provides a peak
bandwidth of 16 GBps per node. The sustained bandwidth is approximately 12 GBps, full duplex. The
design of the nodes of the ES (including vector processor and memory system) is evolutionary within the
SX vector family. Semiconductor technology and advanced packaging are used to achieve performance.
The software is also evolutionary and fairly stable.
It is instructive to compare the ES to the ASCI Q system at Los Alamos National Laboratory
(LANL), which uses HP AlphaServer ES45 nodes and a Quadrics switch. Compared with the ASCI Q.
the significant characteristics of the Earth Simulator are these:
Higher ratio of memory bandwidth tofloating-point rate (4 B/flop versus 0.8 B/flop9. This ratio
improves performance significantly for many codes that are memory intensive but do not exhibit the
spatial and temporal locality exploited by caches. Although some such codes can be rewritten to be more
cache-friendly, certain algorithms seem intrinsically difficult to localize.23
.
Use of vector para1/~1/te1/tism in addition to SMP and message-passing para1/~1/te1/tism. The availability
of a large number of vector registers and of vector load instructions makes it possible to prefetch data and
to hide memory latency for codes where data accesses are predictable but not spatially localized. Codes
that vectorize well can achieve a high fraction of the peak floating performance of the SX-6. On the other
hand, the scalar performance of the SX-6 processor is not as good as the scalar performance of the Alpha
processor, so the Alpha processor may achieve better performance on codes that do not vector~ze well and
are cache fiiendly.
· Use of a g1/toba1/t switch with a higher ratio of g1/toba1/t bandwidth tofloating-point rate 60.2 B/flop
versus 0. 03 B/flop9. This property contributes to performance on codes that require large amounts of
gIobal communication.
In summary, ES achieves a higher fraction of peak floating performance on many codes because of
better memory bandwidth, better gIobal bandwidth, and the availability of a memory prefetch mechanism
(vector registers and vector load/store operations). There are no new micro-architectural concepts or
unique technologies that are noteworthy in the ES. Rather, the performance is achieved through the use
of a purpose-built microprocessor with high memory bandwidth and latency-hiding hardware and through
the acceptance of a different budget balance between node hardware and interconnect hardware.
23See, for example, the GUPS benchmark described in Brian R. Gaeke, Parry Husbands, Xiaoye S Li, Leonid
Oliker, Katherine A Yelick, and Rupak Biswas, 2002, Memory-Intensive Benchmarks: IRAMvs. Cache-Based
Machines, International Parallel and Distributed Processing Symposium (IPDPS).
OCR for page 17
SUPER COMPUTING PASTAND PRESENT
17
Software
Message passing is the main programming model used to scale applications to large systems; the MPI
standard message-passing library is available on all TOP500 systems, including shared memory systems.
Lower overhead communication can be achieved using put/get libraries on systems with suitable switch
interfaces, such as the Cray T3E. Shared memory parallelism on SMP nodes is often exploited using
OpenMP (i.e., C or FORTRAN with extensions for loop and task parallelism). However, OpenMP does
not seem to be used for systemwide parallelism on large systems (even those supporting shared memory),
perhaps because programmers lack the skill to use it well.
Almost all TOP500 systems use variants of Unix for their operating system. Shared memory systems
are controlled by one global OS image, while distributed memory systems typically have one OS image
per node. Lower-end Beowulf clusters typically use Linux, while higher-end systems use proprietary
Unix systems. Libraries, programming tools, parallel file systems, and various system management tools
complete the parallel programming environment available on these platforms.
Most vendor platforms use proprietary parallel programming environments. The proprietary software
is often derived from open-source software; for example, all proprietary MPI implementations are derived
from open source MPI implementations. Beowulf clusters mostly use open-source parallel software that
is contributed by developers worldwide. Support for standard programming environments and interfaces,
across all platforms, is an important goal that is only partially achieved. MPI and OpenMP are two
successful standardization efforts in which industry adopted a de facto standard developed by the HPC
community and/or the HPC vendors. Another successful model is provided by the TotalView parallel
debugger, where a third-party software vendor supports the same software product across all main HPC
platforms. However, TotalView is a singular example; attempts to standardize various tool interfaces and
parallel system services, in particular parallel I/O, have had limited success. Although programming for
large-scale parallel machines is more complex than programming for sequential machines, the typical
programming environment available for scalable parallel computing is less sophisticated and less
standardized than the environment available on small systems.
Algorithms
The algorithms used to run supercomputing applications are needed not just within the applications
themselves but also to analyze the output data, store and transmit the data over unreliable media, load
balance efficiently, and so on. The primary challenge introduced by supercomputing is that many
conventional algorithms for these problems must be modified so as to scale effectively to much larger
data sets or numbers of processors and to run efficiently on machines with deep memory hierarchies. For
example, a numerical simulation on a very large mesh may involve converting an algorithm from one
using dense matrices or even direct solvers on sparse matrices to one using a specialized iterative method
that may still use a parallelized direct method on subproblems. Initially it may be possible to use a
serialized mesh partitioner to load balance the matrix across processors, but as the matrix grows a parallel
mesh partitioner may be needed. As another example, the problem may be scaled in order to introduce
new physical models (e.g., one that respects polycrystalline structure in plasticity models), requiring
wholly new discretizations and subgrid models. As this example illustrates, some of these algorithms are
very specialized to particular application domains, whereas others, like mesh partitioners, are of quite
general use.
Representative terms from entire chapter:
scientific computing