Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 21
4
Existing Conditions
Jack Worlton
Los Alamos National Laboratory
and
Worlton and Associates
CHARACTERISTICS OF HIGH-PERFORMANCE COMPUTERS
Types of Computers
The wide variety of computers used in computational science and en-
gineering can be illustrated by plotting the performance and cost of all
of the computers available from the computing industry. The resulting
band of products, as illustrated in Figure 4.1, ranges over microcomputers,
minicomputers, superminicomputers, minisupercomputers, high-end main-
frames, and supercomputers.
There is a band rather than just a line because there is a range of
performance that is available for a given cost and a range of costs for a
given performance. In general, the lower edge of the band is represented
by the older products, and the upper edge of the band is represented by
the newer products. That is, newer products have a higher performance
for a given cost and a lower cost for a given performance than do older
products.
One of the tongue-in-cheek definitions of supercomputers is that a
supercomputer is any computer that costs $10 million. In fact, current
supercomputer prices range from about $1 million to $20 million, but
in terms of constant dollars, the $10 million average is a useful rule of
thumb over about 3 decades. For example, the IBM 704 in the mid-1950s
cost $2 million to $3 million and operated at about 10,000 operations per
second. However, if these 1950s dollars are converted to 1980s dollars,
21
OCR for page 22
22
( ~108
By
o
IL
cn 1 07
lo
1o6
lo
O 105
104
JACK WORLTON
109
an. ~ .;;; .; . ; ;;;;
A,, :, Otto - r, ::
0.1 1
1 00 1 000 1 0000
COST (x$1000)
FIGURE 4.1 Performance and cost of available types of computers. ( Reprinted, by
permission, from Worlton and Associates.)
the cost of the IBM 704 is approximately $10 million. One way to view
this development is that over the past 30 years, for a constant cost the
performance of supercomputers has increased by a factor of 10,000; or,
put another way, for a constant performance the cost has decreased by a
factor of 10,000. This is equivalent to an annually compounded rate of
improvement in the ratio of performance to cost of about 36 percent per
year, a factor of 2 every 2.25 years, or a factor of 10 every 7.5 years.
Taxonomy of High-Performance Computers
If we examine in greater detail the high-performance computers, we
distinguish at the first level three categories of computers: general purpose,
special purpose, and research (Table 4.1~.
General Purpose
The general-purpose high-performance computers include the super-
computers, high-end mainframes, and minisupercomputers. The super-
computers (such as the Cray Y-MP and ETAi°) are distinguished by their
relatively higher execution rates (sustained rates of about 108 to 109 oper-
ations per second), their larger memory capacities (up to several hundred
OCR for page 23
EXISTING CONDI7IONS
TABLE 4.1 Taxonomy of High-Performance Computers
Category
Instances
General purpose
Special purpose
Research
Supercomputers
High-end mainframes
Minisupercomputers
Single-purpose
Bit processors
Massively parallel
Systolic
National
Industry
University
23
million 64-bit words), their high-speed input-output systems, and their high
prices. The prices of the mainframes (such as the IBM 3090, Amdahl 5890,
and Control Data Corporation 990) are about the same as those of the
supercomputers, ranging from about $1 million to $20 million, but their
performance is typically about a factor of 5 below that of supercomputers;
also, their memory capacities are typically lower and their input-output
systems are slower. The minisupercomputers (such as the Convex C-Series,
the Alliant FX/8, and the SCS-40) have performances that are typically 5
to 10 times lower than those of supercomputers and prices in the range of
$100,000 to $1 million, although some of the larger systems have prices in
the $1 million to $2 million range.
Special Purpose
There is no clear line of distinction between general- and special-
purpose computers; rather, the distinction is relative. That is, the general-
purpose computers have relatively more applications that they can execute
with high efficiency than do the special-purpose computers. It has been
known for decades that a trade-off is available between execution rate
and generality; that is, for a given component technology, special-purpose
computers can be made that are faster than general-purpose computers.
However, the cost of a special-purpose computer is in addition to, not
instead of, the cost of a general-purpose computer, because users cannot
afford to acquire a special-purpose computer for every application and
therefore must always have general-purpose computers that are specialized
for particular applications through application software. Thus the role of
OCR for page 24
24
JACK WORLTON
special-purpose computers is to augment, not to replace, general-purpose
computers.
The most limited purpose of design is found in single-purpose com-
puters such as logic emulators. While limited in the range of problems
they can solve, these are highly effective when the workload is highly spe-
cialized. Bit processors are designed for applications in which the data can
be expressed by a single bit, such as image processing. Massively paral-
lel computers attempt to achieve high performance by using a very large
number of slow processors. Finally, the systolic computers implement an
algorithm by pumping data through a series of identical functional units;
for example, a systolic computer that implemented a matrix multiply would
have an array of identical multiply-add processors that communicate their
partial results to one another.
It should be observed that the generality of many of the special-purpose
processors is being broadened by both hardware and software techniques
in the newer designs.
Research
Research computers are designed to investigate how to make better
computers rather than for applications in science and engineering. Some
of these are supported at the national level, the best known being Japan's
Super-Speed Computer Project, which combines the efforts of Japan's six
largest computer companies to design a supercomputer that will operate
at 10~° floating-point operations per second and will be ready for demon-
stration by March 1990. This project is funded at the $100 million level
by the Ministry of International Made and Industry in Japan. There are
similar projects in Europe, such as the Alvey project in Great Britain, the
Marisis project in France, and the Suprenum project in West Germany. In
the American scheme of things, these types of projects are conducted in
universities; examples include the Cedar project at the University of Illi-
nois, the Ultra project at New York University, and the Hypercube project
at the California Institute of Technology. Finally, industrial firms are also
building research computers, including the RP3, the GFll, and the TF-1
projects at IBM.
THE SCIENTIFIC COMPUTING ENVIRONMENT A MATRIX
It is useful to think of the scientific computing environment in terms
of a matrix in which we match the different types of computers against the
generic information technologies, of which there are four: (1) processing, in
which we transform an input into an output according to an algorithm; (2)
storage, in which we transfer information over time; (3) communications, in
which we transfer information over space; and (4) the human interface, in
OCR for page 25
EXISTING CONDITIONS
TECHNOLOGIES
~:,,,,,, , ~
... ' 2 PROCESSING2.' ' ' ' '.' 2 '
............................... ; , .. 0
A......
......... ...... ~
..' COMMUNICATIONS.2''.2'.
by, :
.................................. ..,,,, ~
25
SYSTEMS
.......... ' LARGE C LE ........
............. ~...... ' ' ' '' ' ' ' ' ""'1 ~........................
Supers Minisupers Workstations
Mainframes Superminis PCs
Common file Local disk Floppies and
system systems hard disks
Site networks, Site networks, Site networks,
LANs, WANs LANs, WANs LANs, WANs
Transparency Transparency Transparency
and and and
Visualization Visualization Visualization
FIGURE 4.2 The scientific computing environment. Note: LAN, local-area network;
WAN, wide-area network. (Repented, by permission, from Worlton and Associates.)
which we present information to and accept information from the user. We
now classify the types of computers by the level of sharing among the users:
(1) the personal resources that are not shared (personal computers with
prices less than $10,000 and workstations with prices between $10,000 and
$100,000~; (2) mid-range computers that are typically shared by a few tens
of users (minisupers and superminis with prices in the range of $100,000 to
$1 million); and (3) the large-scale systems that are typically shared by a
few hundred users (supercomputers and high-end mainframes with prices
in excess of $1 million). By matching the generic information technologies
against these categories of computers, we thereby create a 12-way taxonomy
that defines the scientific computing environment, as illustrated in Figure
4.2. Some major trends and characteristics of the technologies are described
below.
Trends in the Generic Information Technologies
Processing
For all three types of computers, the processing power for a given
cost is increasing, and the cost of a given level of processing power is
decreasing. There are no fundamental, technological, or economical limits
that will prevent these trends from continuing over at least the next decade.
These trends may slow down somewhat, but they will continue.
OCR for page 26
26
Storage
JACK WORLTON
There is no single storage technology that meets all requirements
for fast access time and low cost, so a hierarchy of technologies is used,
including semiconductors for the main storage, magnetic disks for secondary
storage, and magnetic tapes for archival storage. Progress has been most
rapid in the main memory, where access times are now just a few tens of
nanoseconds (a nanosecond is 10-9 s). The access time to magnetic disks
is typically a few tens of milliseconds (a millisecond is 10-3 S. the time
being constrained by the physical rotation of the disks. Thus there is a
6-orders-of-magnitude gap between the access times of main memory and
those of disks. This causes unacceptable delays when storage requirements
exceed the capacity of main memory, so that a new level in the storage
hierarchy has been created, often called the solid-state disk (SSD), with
access times of tens of microseconds (a microsecond is 10-6 s). The third
level in the storage system, the archival level of storage, has been and
remains a problem, because magnetic tape is not an ideal medium for
very large archives: it has low volumetric efficiency, its shelf life is limited,
and it can be erased. Optical storage technology is being developed in
both disk and tape formats, but so far there are no recording standards
for optical media, and most users are unwilling to record their archives
on a nonstandard medium; thus magnetic tape will continue to be used
for archival storage by most scientific computing centers until recording
standards are developed for optical disks and tapes. An exception will be
found in specialized applications where the advantages of optical media
exceed their disadvantages.
Communications
Communication both within and among computers and their users
continues to grow in importance. Three major trends are evident: (1)
the rate of information transmission is increasing, (2) the cost per bit of
information transmitted is decreasing, and (3) the connectivi~the number
of users who have access to data communications ports is increasing.
The point about connectivity is a matter of urgent management con-
cern. In 1981 Los Alamos National Laboratory conducted a strategic plan-
ning exercise that led to the conclusion that data communications should
be provided to all employees as a utility comparable to heat, light, power,
and telephones. This policy was called the Data Communication Utility.
At that time the laboratory had only about 1000 communications ports for
a staff of about 7000, and new ports were being installed at the rate of
only 200 per year, so that providing the full staff with data communications
would have required about 30 years. However, the pace of port installation
has been increased to 800 to 1000 per year and there are now between
OCR for page 27
EXISTING CONDITIONS
27
5000 and 6000 ports for the staff, so that the Data Communication Utility
will soon be a reality at Los Alamos. The Data Communication Utility is
a policy that should be considered by all organizations.
The networks to which data communications ports provide access in-
clude local-area networks (LANs) that span a building, site networks that
span a whole site, and wide-area networks (WANs) that span continents.
These networks permit transfer of information among computing resources
and among an ever-growing set of scientists and engineers who communi-
cate with each other electronically.
Interface
The function of the human interface in information technology is
twofold: (1) to provide transparent access to the information technologies,
and (2) to provide v~sual~zai'on tools that lead to the insights that are the
very purpose of computational science and engineering.
It is a fundamental principle of technology development that all tech-
nologies strive toward the condition of transparency, that is, the ability to
achieve the function of the technology without specifying how the function
is achieved. For example, the driver of an automobile is concerned primar-
ily with just four devices, the steering wheel, the gear shift, the accelerator,
and the brake pedal, but not with the thousands of parts these interface
devices control. Similarly, the computer user should be concerned only
with the nature of the problem being solved and not with specifying how
each step in the solution is to be accomplished. Many computer systems are
deficient not because they lack the computational power the users require,
but because they are not transparent they require the user to specify in
great detail how the computation should proceed. Future systems will in-
creasingly allow users to specify what is to be done, not how. For example,
parallel-processing computers are being developed today, but these will not
be mature until parallel execution is transparent to the user.
Visualization tools are now recognized as being as important as pro-
cessing, storage, and communications in computational science and en-
gineering. Richard Hamming's dictum that the purpose of computing is
insight, not numbers, is the rationale for visualization. The visualiza-
tion technologies include graphics terminals, microfilm, video tape, system
software, and, more recently, the interdisciplinary collaboration between
computer scientists and artists to present information in powerful and easily
perceived forms.
Modes of Operation
Most science and engineering computer centers offer their users this
whole environment, and the users then choose from among the resources
OCR for page 28
28
RATIONALE FO R
HIGH-PERFORMANCE
COMPUTING
JACK WORLTON
COMPLEXITY
THE USER
( R e q u i r e m e n t s ) | TIMELINESS
TH E MANAG ER
(Benefits)
FEASIBILI TY
SA VINGS (Time costs)
FAILURE ANALYSIS
QUALITY ASSURANCE
PRODUCTI VI TY
FIGURE 4.3 The rationale for high-performance computing. (Reported, by permission,
from Worlton and Associates.)
those that best meet their needs. For example, some users prefer to use
a workstation for development of programs, visualization of results, and
execution of small-scale problems; when problems grow beyond the ability
of the workstation to provide an acceptable response time, the problems
are then submitted to a supercomputer. Other users find the local control
and ease of use of mid-range computers desirable as an intermediate step
between workstations and supercomputers. An important characteristic of
the computing environment should be that users have a uniform interface
across all three types of computers, so that they can move applications
among the types of computers without significant conversion effort. This is
the main reason for the growing popularity of the Unix operating system:
Unix is available on all three generic types of computing systems and hence
can provide a relatively seamless interface among them.
RATIONALE FOR USE OF SUPERCOMPUTERS
Why do scientists and engineers use powerful but expensive supercom-
puters instead of less powerful but less expensive mid-range and personal
computers? And why do they use supercomputers in addition to mid-range
and personal computers? This discussion has to do with the rationale for
supercomputing, which can be viewed from two perspectives: that of the
user and that of the manager, as illustrated in Figure 4.3.
OCR for page 29
EXISTING CONDITIONS
29
Requirements for Supercomputing:
The User Perspective
From the point of view of the user, supercomputers are required to
solve highly complex problems in a timely manner. If the computing en-
vironment does not have complex problems to solve, then supercomputers
are not required; also, if there are complex problems to be solved but
there is no time urgency for their solution, then again, supercomputers are
not required. However, if both complexity and timeliness are important to
an organization, then it is a management error not to provide the-most
powerful computing resources available, and these are by definition the
supercomputers.
We can analyze computing requirements of all kinds by using a simple
but powerful dimensional analysis in which we decompose execution rate
into two explanatory variables, complexity and response time:
Execution rate [operations per problem]
Complexity [operations per problem]
Response time [seconds per problem]
Complexity has units of operations per problem, where operations
refers to the ordinary things computers do add, subtract, multiply, and
so on. Response time refers to the time required to solve a problem, in
units of seconds per problem. Because this model is dimensionally correct,
we can use it as an algebraic equation in which any two of the variables
uniquely determine the third. For example, we can specie the complexity
of the problems we want to solve and the response time required, and this
determines the execution rate of the computer necessary to meet these
requirements. Alternatively, for a given computer we can specify either
problem complexity or response time, but not both.
Given this model we can see that there will be requirements for high
execution rates if the complexity of the problem is large, or if the response
time required is small, or both. It is the need to solve problems with
ever-larger complexities and ever-shorter response times that is driving
the unrelenting demands for higher execution rates in supercomputers.
Historians tell us it is inherent in the nature of science, technology, and
engineering that they grow in complexity. For example, a tour through the
Smithsonian's Air and Space Museum provides a perspective not only on
the history of aerospace but also on complexity. The Wright brothers' flyer
was a relatively simple device compared to the DC-3, the DC-3 was not
as complex as jet aircraft, and jet aircraft are not as complex as aerospace
vehicles. We can quantify this growing complexity: in the decade from
1910 to 1920 it required only on the order of 10,000 engineering hours to
OCR for page 30
30
JACK WORLTON
design large aircraft, but in the decade of the 1970s this metric had grown
to about 10,000,000 engineering hours.
Similarly, shorter response times are becoming more critical because
of intense international competition. In the September 1988 issue of IEEE
Spectrum, an executive of a semiconductor firm is quoted as saying, "The
limit for the time it takes to design an integrated circuit is a year. Any
longer and it will already be out of date when it is introduced" (p. 33~.
Success in the university, in industry, and in government is often determined
not only by the results produced but also by the time scale on which the
results are produced.
We can quantify this relationship with a Homograph, that is, a diagram
that shows a correct relationship among the variables for any straight line
drawn across it, as illustrated in Figure 4.4. The scale of response times
decreases as we go higher in the diagram, the scale of complexity increases
as we go higher, and the sustained execution rate links the two other
scales. Suppose we want to solve a problem that has a complexity of 109
operations, and we desire a 15-min response time, which is about 1000 s.
The required execution rate is then 109/103 = 106 operations per second.
However, if we required the solution on an interactive time scale, for
example 10 s, then the required execution rate would be 108 operations per
second 100 times faster. Alternatively, if the 15-min response time were
satisfactory but the complexity of the problem were 10~2 operations, then
the required execution rate would be 109 operations per second. This last
example is taken from the requirements at National Aeronautics and Space
Administration Ames for the Numerical Aerodynamic Simulator. Figure
4.4 shows that it is the combination of high complexity and short response
times that forces the use of high-performance computers.
The largest problems that are being solved on current supercomputers
have a complexity between 10~3 and 10~4 total operations. However, critical
problems, such as long-range climate studies, that have complexities that
are orders of magnitude more complex await the development of more
powerful supercomputers.
The Dimensions of Arithmetic Complexity
We can analyze complexity in more detail by continuing our dimen-
sional analysis; we decompose complexity into a 4-factor formula:
Complexity (operations per problem) = A' V' T'G
where
OCR for page 31
EXISTING CONDIT ONS
RESPONSE TIME
(SEC/PROBLEM)
31
EXECUTION
RATE (OP/SEC)
1o1
1ol2
_ =
COUP LEXITY
(OP/PROBLEM)
INK
XX
108 ~
1 o5
S U P E R CO M P U TE R S
HI N F RAM ES
~ \
_ SUPE~1INIS
104
1o6
;~ \
4 _ M I C R Of\
10 4.
1 o6
1 0
1 0
1014
-1 o1 3
1 o1 2
1 o11
1 0
FIGURE 4.4 Nomograph of computational science and engineenng. (Repented, by
permission, from Worlton and Associates.)
G = geometry [points per time step]
T = time [time steps per problem]
V = variables inumber of variables computed per point]
A = algorithm [operations per variable].
Geometrical complexity comes from the number of dimensions and the
resolution (number of points) within each dimension, dimension meaning
not only the space dimensions but also any variable that must be system-
atically varied over some range. For example, if we are tracking particles
that do different things depending on their energy and angle of flight, then
energr and angularity are treated as additional dimensions. If the problem
is time dependent, then calculations must be performed at all of the geo-
metrical points for each time step. For each point in space-time, a certain
number of variables are computed, and for each variable, a certain number
OCR for page 40
40
JACK WORLTON
per chip and therefore avoid transmission delays from chip to chip. CMOS
has a major disadvantage as a logic technology for supercomputers: its gate
delay is slower than that for ECL. However, its relatively slow speed can
be partially compensated for by cooling it with liquid nitrogen to a nominal
temperature of 77 K, which lowers the gate delay. The cycle times of the
ETAi° range from 24 us down to 7 ns.
GaAs. Gallium arsenide has two advantages relative to ECL: it has a lower
gate delay and lower power dissipation. The lower power dissipation implies
a potential for increases in packing density without causing heat dissipation
problems and therefore even further speed increases. This technology also
has two disadvantages: higher cost and lower gate density per chip. The
gate density of GaAs is projected to increase dramatically compared to
the gate density of traditional logic technologies, as illustrated in Figure
4.10. If this projection should prove to be true or even approximately true,
the gate density disadvantage of GaAs would no longer exist. The GaAs
industry is maturing rapidly, and the cost per gate is projected to fall as
increasing gate densities are achieved. Gallium arsenide is being used in
the design of the Cray-3 for delivery in about 1990, with a projected circle
time of 2 ns.
HEMT. The high-electron mobility transistor (HEMT) technology is an
aluminum-doped version of GaAs that is cooled to liquid nitrogen temper-
atures. It is more complex than GaAs and its development will take longer,
probably into the mid-199Os. If it is developed successfully, it may offer
cycle times even lower than those offered by GaAs.
Other Logic Technologies. Josephson junction technology is a superconduct-
ing technology that was investigated in the 1970s and early 1980s but then
was dropped by most companies because of its complexity and also because
its performance characteristics could be achieved with other technologies
that were more tractable; however, Japanese companies continue to study
this technology. Optical, molecular, and quantum-tunneling technologies
are being studied for their potential for high-performance computers, but
they seem to offer few prospects for the near term.
Summary. For the immediate future, it is expected that most supercomput-
ers and other high-performance computers will use ECL and CMOS logic
technology, but if the implementation of the Cray-3 in GaAs is successful,
it could create a guidepost for others to follow. The leading-edge cycle
time should be about 2 us around 1990, and cycle times of 1 ns should
appear in supercomputers by 1995.
OCR for page 41
EXISTING CONDITIONS
loo
~5
S 10
In
'S
C,
MC,~
O 103
1o2
41
GaAs ~ ~
T T I I
T T r
1980 1982 1984 1986 1988 1990
YEAR
~ _._
1992 1 994
FIGURE 4.10 Projected trends in gate density of high-speed logic circuits. (Repented, by
permission, from Bernard C. Cole, Gallium arsenide starts getting some respect, Electronics,
June 1988, p. 41.)
Milestones in Processor Architecture
Architectural Efficiency
Efficiency in computer architecture is measured in units of results
generated per clock cycle. Improvements in computer architecture that
have resulted in higher efficiency are due largely to concurrency, that is,
to doing more things at once. Some selected examples are illustrated in
Figure 4.11.
In the mid-1950s in computers such as the IBM 704, instructions were
executed in a sequential scalar mode; that is, they specified only one opera-
tion on one pair of operands, and the processing of instructions included a
series of sequential steps: fetching the instruction, decoding it, forming the
effective address, fetching the operand, and then executing the operation.
Beginning in about 1960 in computers like the IBM STRETCH, an instruc-
tion lookahead provided the ability to fetch and process instruction N + 1
while executing instruction N. By the mid-1960s in computers such as the
CDC 6600, multiple function units were included that allowed several exe-
cutions, such as add and multiply, to be performed concurrently. By about
1970, the operations were speeded up by pipelining, that is, subdividing
OCR for page 42
42
1ol2
Cal
Lo
-
0 1010
tar
z
o
C'
x
LL
U]
go
En
En
11
109
1o8
107
1o6
105
104
JACK WORETON
.
.
~.,..~.,.]
. ~
~.2 ~
An ,, ,,
I;;;; ~
Is.. Multiple FUs 7.~..]
~,. . . . .. ...,, ~
~1
me , ,,,,.,,,,.,,.,
1950 1960 1970 1980 1990 2000
YEAR
FIGURE 4.11 Some architectural contributions to higher performance. (Repented, by
permission, from Worlton and Associates.)
the steps of the operations into substeps and overlapping these substeps
for faster execution. The 1970s saw the development of vector processors
in which the same operation was applied to many operands or operand
pairs, rather than to just one pair as in the scalar designs. The major
supercomputers today use vector processing to achieve high speed. The
first generation of vector processors had memory-to-memory designs; that
is, they fetched the vectors from memory and returned the results to mem-
o~y. However, this caused long start-up times, and the second generation
of vector processors followed the lead of the Cray-1 in using vector regis-
ters that allowed faster start-up times. Most vector designs today use the
register-to-register design. The leading edge of supercomputer architecture
today is found in designs that incorporate multiple vector processors, or
parallel-vector designs.
Balance in Vector Processor Design
One of the key issues in vector processing is the balance between the
scalar speed (doing one operation per instruction) and the vector speed
(doing many repetitive operations per instruction). Figure 4.12 compares
the merits of two designs, one with a relative performance of 10 in the
scalar mode and 100 in the vector mode, and another with a relative
OCR for page 43
EXISTING CONDITIONS
1000 -.
C'
A
a:
100-
a:
o
Cal
43
1
0.0 0.2 0.4
VECTOR FRACTION
FIGURE 4.12 Balance in vector processor design. (Repented, by permission, from
Worlton and Associates.)
performance of 2.S in the scalar mode and 400 in the vector mode. The
overall performance of the system is a function of the fractions of the work
done in each of the different modes. For practical applications, the fraction
of vector work is in the range 0.3 to 0.7, so that even though the second
design has a higher peak performance than does the first design, the first
design, which is better balanced, is clearly to be preferred.
Parallel Processing
In addition to decreasing cycle time and increasing design efficiency, a
third approach to increasing computer speed is through the use of multiple
processors, and there are many design issues for these so-called parallel
computers: Should there be a few fast processors or many slow ones?
Should the memory be shared, attached to each processor, or both? What
kind of interconnect network should be used to provide communications
among the processors, and among the processors and the memory? Inhere
are literally thousands of ways to design parallel computers, and at the
moment there is wide disagreement in the industry about which are the best
OCR for page 44
44
JACK WORLTON
choices. To analyze future prospects in these options, we turn to a taxonomy
of computer architectures-an intellectual road map of possibilities.
Flynn's Taxonomy
Michael Flynn's taxonomy from the 1960s has been the most com-
monly used guide to generic types of architectures. It matches the control
concurrency expressed in the number of instruction streams (one or many)
against the execution concurrency expressed in the number of data streams
that are being processed (one or many), to generate the four types of ar-
chitectures: (1) single instruction, single data (SISD); (2) single instruction,
multiple data (SIMD); (3) multiple instruction, single data (MISD); and (4)
multiple instruction, multiple data (MIMD). This has been a valuable guide
for over 20 years, but it is no longer an adequate guide because it is not
comprehensive for all of the architectural design options being explored.
An Expanded Taxonomy of Architectures
In addition to the types of control concurrency serial and parallel-
included in the Flynn taxonomy, a third type is being used, called clustering.
In a clustered design, clusters of multiple-instruction-stream processors are
connected together with global control with access to global memory.
Also, the types of instructions employed in the 1960s were limited to scalar
instructions, and there are now several new options that need to be included
in any classification scheme.
We can systematically explore the options for designing instruction
types by considering a number pair (a,b), where a = the number of
v~ela.lvilS Mu ill The ilmc1 union and 0 = Ine number of operands (or
operand pairs) specified. There are thus four options, as follows:
· The (1,1) option defines the scalar type of instruction, in which one
operation is specified on one pair of operands.
· The (1,N) option defines the vector type of operation, in which
one operation is specified, but the operation is applied to many operands
or pairs of operands.
~,% , ~. . ~
~r~^t~qt~ ~.~;~A .~ ·~ : ~:_ _ ~ L ~ _ . ~
~ ne (M, 1) option cennes the systolic type of operation, in which
many operations are performed on each operand that is fetched from
memory.
· Finally, the (M,N) option defines the horizontal type of operation,
in which many operations are specified on many operands for each instruc-
tion. The horizontal category is also referred to as very long instruction
word (VLIW).
We can now create an expanded taxonomy of architectures that
matches the 3 control categories against the 4 instruction categories to
create a 12-way taxonomy, as illustrated in Figure 4.13.
OCR for page 45
EXISTING CONDITIONS
CONTROL
C ONCURRENCY
45
.
INSTRUCTION TYPES
SCALAR VECTOR SYSTOLIC
(1,1) (1,N) (IVI,1)
===
VLIW
(M,N)
SERIAL
PARALLEL
CLUSTERED
FIGURE 4.13 An expanded taxonomy of computer architectures. Notation (a,b): a,
number of operations specified; b, number of operands (or operand pairs) specified.
(Reprinted, by permission, from Worlton and Associates.)
Historically, through the CDC 7600 in about 1970, most supercomput-
ers had serial-scalar designs; that is, they executed one stream of scalar
instructions. First-generation vector processors had serial-vector designs;
that is, they also executed a single stream of instructions, but the instruc-
tions specified vector operations. Machines of this generation were typified
by the Cray-1 and the Cyber 205. In the 1980s, parallel-vector computers
have been developed in which multiple vector processors are interconnected
through a communication network. The Cray Y-MP and the CDC/ETAi°
are examples of computers in this category. The further development of
vector designs is expected to be generalized to clustered-vector designs,
with clusters of multiple vector processors interconnected with global net-
works. An example of this category is the Cedar project at the University
of Illinois that uses the Alliant I;XJ8 as the cluster.
The parallel-scalar category represents those designs that use a large
number of relatively slow processors to attain high performance, such as the
NCUBE/ten and the Intel iPSC. These can also be designed with clustering,
as found in the Myrias-2.
Systolic designs have been largely special-purpose computers that im-
plement a particular algorithm such as a matrix multiply or a fast Fourier
transform, and they can be developed in serial, parallel, or clustered form.
The WARP system developed by Carnegie Mellon University is the leading
edge of this type of design. The horizontal or VLIW designs attempt to
execute multiple operations per clock period, and they too can be devel-
oped in serial, parallel, or clustered form. Examples of the VLIW design
are found in the Cydra-S and the Multiflow Mace computers.
OCR for page 46
46
JACK WORLTON
This taxonomy can be extended by subdividing each of the categories
to develop an even more detailed taxonomy.
Key Issues in Parallel Processing
The trend toward parallel computation is perhaps the major trend
to watch in the next decade of computer architecture, and it is useful
to identify some of the key issues that are being explored in the dozens
of commercial and research parallel-processor projects. These issues can
be thought of in three categories: system software, system hardware, and
applications.
Parallel processing will affect system software in language design, com-
pilers, operating systems, and libraries. For example, should we continue
to use Fortran to specify parallel processing or should we abandon this
old language and adopt something new? Fortran is being modified by
adding new constructs that allow the specification of parallel processing,
and no doubt this language will continue to be used long into the future
because of the huge commitment to applications that exist for it. Compilers
are being influenced by parallel processing because they need to identify
opportunities for parallel execution in a transparent manner, that is, so
the user need not be concerned with the details of how the parallelism
is achieved but only with what is to be done. Operating systems are now
more complex because they need to maintain control and provide services
for many streams of instructions rather than just for one. Finally, system
libraries are being adapted to use parallel processors.
Applications will be affected by parallel processing by the need to
insert parallel control statements, to develop parallel algorithms, and to
rethink the mathematical models on which the algorithms are based.
One of the key issues in the design of high-performance parallel
computers is the trade-off between speed and parallelism. That is, in
principle we could select the number of processors to be from as few as
one or two to as many as thousands (or indeed millions), and we could
select the speed of the individual processors to be as slow as a bit-serial
microcomputer or as fast as a supercomputer, as illustrated in Figure 4.14.
However, a design with only a few slow processors would be too slow to
be of any use as a high-performance computer, and a design with thousands
of supercomputers in a single system would not be economically feasible,
so that the practical area of trade-off lies in between. The evolution of
parallel processing is following three guideposts.
First, the zone of "large-grain" parallelism is defined by the use of
a relatively small number of fast processors, 1 ~ P< 100. This form of
parallel-processor design was pioneered by Cray Research, Inc. in 1982 with
the introduction of the Cray X-MP product line, in which the individual
OCR for page 47
EXISTING CONDITIONS
o
In
In
o
C]
UJ
Lo
In
47
~ Large-graln ~
And ) $)
arm ) ~1~1~$0~
it.. Mealum-:araln ~
(, ", Parall'ell~sm' ~ ~,"3
~100 :c:'~'P2~c~"210.00})
~ . . : . .. ~.... . . _
_.... ~ _
_ ... . . .. _
<:~: Small-graln,~
'2:~.~'~'~:P:atralIe11sm~ .. It's
~ '':''', 2.'.{P:, :~.1 00'0} " . it,
NUMBER OF PROCESSORS (P)
FIGURE 4.14 The trade-off between speed and parallelism. (Repented, by permission,
from Worlton and Associates.)
processors are vector supercomputers in their own right and the number
of processors varies from the initial offering of 2 to the current offering
of 8 in the Cray Y-MP/~. Other vendors have followed this design as
a guidepost, including parallel-vector processors offered by Control Data
Corporation, IBM Corporation, Convex, Alliant, the Japanese Super-Speed
Computer Project, and widely rumored introductions of similar designs from
the Japanese supercomputer vendors. The number of processors in this
domain is being expanded from 8 processors to 16 in the near term, and to
32 and 64 in the generations in development for delivery in the early 1990s.
Second, the zone of "medium-grain" parallelism is defined by the use
of a larger number of slower processors, with the number of processors
being in the range of 100 < P < 1000. This form of parallelism is typified
by the BEN Butterfly, which has 256 processors, and by the Myrias SPS-2,
with 512 processors.
Third, the zone of "small-grain" parallelism is defined by the use of an
even larger number of processors, P ~ 1000, each of which is even slower
and is typically a bit-serial microprocessor. Here the number of processors
varies from 4096 in the AMT-610 DAP, to 16,384 in the Goodyear MPP, to
65,536 in the Connection Machine.
A Taxonomy of Limits to Massive Parallelism
The term massive parallelism is loosely applied to any of a variety of
computers in which the desired performance is achieved through a very
OCR for page 48
48
JACK WORLTON
DECOMPOSITION
ALGORITHMIC
LIMITS TO
MASSIVE PARALLELISM
SERIAL BOTTLENECK
COMMUNICATION
IMPLEMENTATION
SYNCHRONIZA TION
INTERFERENCE
FIGURE 4.15 Taxonomy of limits to massive parallelism. (Repented, by permission, from
Alex Yuen-Wai Kwok, Center for Supercomputer Research and Development Report No.
679, August 1987.)
large number of relatively slow processors, with the threshold beyond 1000
processors often used to mean massive.
The ability of a parallel computer to perform efficiently as the number
of processors is increased is referred to as scalability. A study of architec-
tural scalability published at the Center for Supercomputer Research and
Development at the University of Illinois presents a taxonomy of the limits
to scalability, as summarized in Figure 4.15.
Limits of the first kind are due to algorithmic constraints. For a system
with P processors to perform at full effectiveness would require that the
number of available tasks equal or exceed the number of processors through-
out the calculation. However, not all algorithms can be so decomposed,
so that there is a decomposition limit to scalability. When the number of
tasks available is just one, this is referred to as the serial bottleneck, which
has very serious implications for the effectiveness of massive parallelism.
Limits of the second kind are due to the implementation details, includ-
ing latencies caused by communications, synchronization, and interference.
As tasks are decomposed to finer levels of detail, the communications re-
quirements increase in proportion to the level of decomposition, and the
amount of actual communication latency is dependent on the system design.
A second implementation limit is caused by the need to integrate the results
of the decomposed tasks through various synchronization techniques. At
OCR for page 49
EXISTING CONDITIONS
49
the synchronization points, the number of available tasks decreases, so that
the number of idle processors increases, thus limiting performance. A third
implementation limit is caused by interference as the processors attempt
to access shared resources; these resources may include shared memory,
shared communication networks, or shared input-output systems.
Following Hockney's nils method, it can be shown mathematically that
the fraction of parallelism that is required to achieve 50 percent efficiency
for a parallel-processor system is given by
(P - 2)
7ri/2 = up _ 1'
where P = the number of processors, and hence the fraction of serial work
cannot exceed
1-~/2 = up _ i)
Thus as P grows, the domain of applicability becomes inherently more
limited, because ~/2 approaches 1.0 and 1 - ~/2 approaches 0 very
rapidly. This does not imply that it is impossible to use parallel processors
effectively, but it does provide a guide to the domain of application, with
a smaller number of very fast processors being more useful for general
purposes than a larger number of slow processors.
CONCLUSIONS
It has been over 300 years since Galileo unlocked the door to the
world of empirical science, using a telescope as a key, so to speak. In the
year Galileo died (1642), Isaac Newton was born, and he invented a key we
call the calculus that he used to unlock the door to the world of theoretical
science. And it has been a scant 4 decades since von Neumann recognized
that powerful electronic computers were the key to still another world of
science, computational science.
In the opening lines of Man of La Mancha, the author (Dale Wasser-
man) invites the audience, "Come, enter into the world of my imagination."
The supercomputer, too, bids us enter into the world of imagination. In
the real world, we are constrained by limits of time, space, energy, matter,
and costs, but in the world of the supercomputer, these constraints are
merely bits of information that we can readily manipulate. In the world
of the supercomputer, we can experiment with new designs for such things
as aircraft, automobiles, accelerators, weapons, and chemicals; we can ex-
periment with these designs until they either perform as we desire or we
discover why they fail. By experimenting in the world of the supercom-
puter, we can avoid many of the costs, delays, and failures that would have
OCR for page 50
so
JACK WORLTON
occurred if we had experimented with these designs in the real world. In
this new world we can create previously unthinkable systems such as black
holes, stand in space to watch them perform, and create videotapes of this
adventure to take back to the real world.
The university, the industry, or the nation that would be a leader
in the modern world of intense international competition must master the
information technologies, the leading edge of which are the supercomputing
technologies. It is imperative for our future success as a nation that we
accept the invitation offered by the supercomputer "Come, enter into the
world of . . . imagination."
Representative terms from entire chapter:
jack worlton