Click for next page ( 22


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 21
4 Existing Conditions Jack Worlton Los Alamos National Laboratory and Worlton and Associates CHARACTERISTICS OF HIGH-PERFORMANCE COMPUTERS Types of Computers The wide variety of computers used in computational science and en- gineering can be illustrated by plotting the performance and cost of all of the computers available from the computing industry. The resulting band of products, as illustrated in Figure 4.1, ranges over microcomputers, minicomputers, superminicomputers, minisupercomputers, high-end main- frames, and supercomputers. There is a band rather than just a line because there is a range of performance that is available for a given cost and a range of costs for a given performance. In general, the lower edge of the band is represented by the older products, and the upper edge of the band is represented by the newer products. That is, newer products have a higher performance for a given cost and a lower cost for a given performance than do older products. One of the tongue-in-cheek definitions of supercomputers is that a supercomputer is any computer that costs $10 million. In fact, current supercomputer prices range from about $1 million to $20 million, but in terms of constant dollars, the $10 million average is a useful rule of thumb over about 3 decades. For example, the IBM 704 in the mid-1950s cost $2 million to $3 million and operated at about 10,000 operations per second. However, if these 1950s dollars are converted to 1980s dollars, 21

OCR for page 21
22 ( ~108 By o IL cn 1 07 lo 1o6 lo O 105 104 JACK WORLTON 109 an. ~ .;;; .; . ; ;;;; A,, :, Otto - r, :: 0.1 1 1 00 1 000 1 0000 COST (x$1000) FIGURE 4.1 Performance and cost of available types of computers. ( Reprinted, by permission, from Worlton and Associates.) the cost of the IBM 704 is approximately $10 million. One way to view this development is that over the past 30 years, for a constant cost the performance of supercomputers has increased by a factor of 10,000; or, put another way, for a constant performance the cost has decreased by a factor of 10,000. This is equivalent to an annually compounded rate of improvement in the ratio of performance to cost of about 36 percent per year, a factor of 2 every 2.25 years, or a factor of 10 every 7.5 years. Taxonomy of High-Performance Computers If we examine in greater detail the high-performance computers, we distinguish at the first level three categories of computers: general purpose, special purpose, and research (Table 4.1~. General Purpose The general-purpose high-performance computers include the super- computers, high-end mainframes, and minisupercomputers. The super- computers (such as the Cray Y-MP and ETAi) are distinguished by their relatively higher execution rates (sustained rates of about 108 to 109 oper- ations per second), their larger memory capacities (up to several hundred

OCR for page 21
EXISTING CONDI7IONS TABLE 4.1 Taxonomy of High-Performance Computers Category Instances General purpose Special purpose Research Supercomputers High-end mainframes Minisupercomputers Single-purpose Bit processors Massively parallel Systolic National Industry University 23 million 64-bit words), their high-speed input-output systems, and their high prices. The prices of the mainframes (such as the IBM 3090, Amdahl 5890, and Control Data Corporation 990) are about the same as those of the supercomputers, ranging from about $1 million to $20 million, but their performance is typically about a factor of 5 below that of supercomputers; also, their memory capacities are typically lower and their input-output systems are slower. The minisupercomputers (such as the Convex C-Series, the Alliant FX/8, and the SCS-40) have performances that are typically 5 to 10 times lower than those of supercomputers and prices in the range of $100,000 to $1 million, although some of the larger systems have prices in the $1 million to $2 million range. Special Purpose There is no clear line of distinction between general- and special- purpose computers; rather, the distinction is relative. That is, the general- purpose computers have relatively more applications that they can execute with high efficiency than do the special-purpose computers. It has been known for decades that a trade-off is available between execution rate and generality; that is, for a given component technology, special-purpose computers can be made that are faster than general-purpose computers. However, the cost of a special-purpose computer is in addition to, not instead of, the cost of a general-purpose computer, because users cannot afford to acquire a special-purpose computer for every application and therefore must always have general-purpose computers that are specialized for particular applications through application software. Thus the role of

OCR for page 21
24 JACK WORLTON special-purpose computers is to augment, not to replace, general-purpose computers. The most limited purpose of design is found in single-purpose com- puters such as logic emulators. While limited in the range of problems they can solve, these are highly effective when the workload is highly spe- cialized. Bit processors are designed for applications in which the data can be expressed by a single bit, such as image processing. Massively paral- lel computers attempt to achieve high performance by using a very large number of slow processors. Finally, the systolic computers implement an algorithm by pumping data through a series of identical functional units; for example, a systolic computer that implemented a matrix multiply would have an array of identical multiply-add processors that communicate their partial results to one another. It should be observed that the generality of many of the special-purpose processors is being broadened by both hardware and software techniques in the newer designs. Research Research computers are designed to investigate how to make better computers rather than for applications in science and engineering. Some of these are supported at the national level, the best known being Japan's Super-Speed Computer Project, which combines the efforts of Japan's six largest computer companies to design a supercomputer that will operate at 10~ floating-point operations per second and will be ready for demon- stration by March 1990. This project is funded at the $100 million level by the Ministry of International Made and Industry in Japan. There are similar projects in Europe, such as the Alvey project in Great Britain, the Marisis project in France, and the Suprenum project in West Germany. In the American scheme of things, these types of projects are conducted in universities; examples include the Cedar project at the University of Illi- nois, the Ultra project at New York University, and the Hypercube project at the California Institute of Technology. Finally, industrial firms are also building research computers, including the RP3, the GFll, and the TF-1 projects at IBM. THE SCIENTIFIC COMPUTING ENVIRONMENT A MATRIX It is useful to think of the scientific computing environment in terms of a matrix in which we match the different types of computers against the generic information technologies, of which there are four: (1) processing, in which we transform an input into an output according to an algorithm; (2) storage, in which we transfer information over time; (3) communications, in which we transfer information over space; and (4) the human interface, in

OCR for page 21
EXISTING CONDITIONS TECHNOLOGIES ~:,,,,,, , ~ ... ' 2 PROCESSING2.' ' ' ' '.' 2 ' ............................... ; , .. 0 A...... ......... ...... ~ ..' COMMUNICATIONS.2''.2'. by, : .................................. ..,,,, ~ 25 SYSTEMS .......... ' LARGE C LE ........ ............. ~...... ' ' ' '' ' ' ' ' ""'1 ~........................ Supers Minisupers Workstations Mainframes Superminis PCs Common file Local disk Floppies and system systems hard disks Site networks, Site networks, Site networks, LANs, WANs LANs, WANs LANs, WANs Transparency Transparency Transparency and and and Visualization Visualization Visualization FIGURE 4.2 The scientific computing environment. Note: LAN, local-area network; WAN, wide-area network. (Repented, by permission, from Worlton and Associates.) which we present information to and accept information from the user. We now classify the types of computers by the level of sharing among the users: (1) the personal resources that are not shared (personal computers with prices less than $10,000 and workstations with prices between $10,000 and $100,000~; (2) mid-range computers that are typically shared by a few tens of users (minisupers and superminis with prices in the range of $100,000 to $1 million); and (3) the large-scale systems that are typically shared by a few hundred users (supercomputers and high-end mainframes with prices in excess of $1 million). By matching the generic information technologies against these categories of computers, we thereby create a 12-way taxonomy that defines the scientific computing environment, as illustrated in Figure 4.2. Some major trends and characteristics of the technologies are described below. Trends in the Generic Information Technologies Processing For all three types of computers, the processing power for a given cost is increasing, and the cost of a given level of processing power is decreasing. There are no fundamental, technological, or economical limits that will prevent these trends from continuing over at least the next decade. These trends may slow down somewhat, but they will continue.

OCR for page 21
26 Storage JACK WORLTON There is no single storage technology that meets all requirements for fast access time and low cost, so a hierarchy of technologies is used, including semiconductors for the main storage, magnetic disks for secondary storage, and magnetic tapes for archival storage. Progress has been most rapid in the main memory, where access times are now just a few tens of nanoseconds (a nanosecond is 10-9 s). The access time to magnetic disks is typically a few tens of milliseconds (a millisecond is 10-3 S. the time being constrained by the physical rotation of the disks. Thus there is a 6-orders-of-magnitude gap between the access times of main memory and those of disks. This causes unacceptable delays when storage requirements exceed the capacity of main memory, so that a new level in the storage hierarchy has been created, often called the solid-state disk (SSD), with access times of tens of microseconds (a microsecond is 10-6 s). The third level in the storage system, the archival level of storage, has been and remains a problem, because magnetic tape is not an ideal medium for very large archives: it has low volumetric efficiency, its shelf life is limited, and it can be erased. Optical storage technology is being developed in both disk and tape formats, but so far there are no recording standards for optical media, and most users are unwilling to record their archives on a nonstandard medium; thus magnetic tape will continue to be used for archival storage by most scientific computing centers until recording standards are developed for optical disks and tapes. An exception will be found in specialized applications where the advantages of optical media exceed their disadvantages. Communications Communication both within and among computers and their users continues to grow in importance. Three major trends are evident: (1) the rate of information transmission is increasing, (2) the cost per bit of information transmitted is decreasing, and (3) the connectivi~the number of users who have access to data communications ports is increasing. The point about connectivity is a matter of urgent management con- cern. In 1981 Los Alamos National Laboratory conducted a strategic plan- ning exercise that led to the conclusion that data communications should be provided to all employees as a utility comparable to heat, light, power, and telephones. This policy was called the Data Communication Utility. At that time the laboratory had only about 1000 communications ports for a staff of about 7000, and new ports were being installed at the rate of only 200 per year, so that providing the full staff with data communications would have required about 30 years. However, the pace of port installation has been increased to 800 to 1000 per year and there are now between

OCR for page 21
EXISTING CONDITIONS 27 5000 and 6000 ports for the staff, so that the Data Communication Utility will soon be a reality at Los Alamos. The Data Communication Utility is a policy that should be considered by all organizations. The networks to which data communications ports provide access in- clude local-area networks (LANs) that span a building, site networks that span a whole site, and wide-area networks (WANs) that span continents. These networks permit transfer of information among computing resources and among an ever-growing set of scientists and engineers who communi- cate with each other electronically. Interface The function of the human interface in information technology is twofold: (1) to provide transparent access to the information technologies, and (2) to provide v~sual~zai'on tools that lead to the insights that are the very purpose of computational science and engineering. It is a fundamental principle of technology development that all tech- nologies strive toward the condition of transparency, that is, the ability to achieve the function of the technology without specifying how the function is achieved. For example, the driver of an automobile is concerned primar- ily with just four devices, the steering wheel, the gear shift, the accelerator, and the brake pedal, but not with the thousands of parts these interface devices control. Similarly, the computer user should be concerned only with the nature of the problem being solved and not with specifying how each step in the solution is to be accomplished. Many computer systems are deficient not because they lack the computational power the users require, but because they are not transparent they require the user to specify in great detail how the computation should proceed. Future systems will in- creasingly allow users to specify what is to be done, not how. For example, parallel-processing computers are being developed today, but these will not be mature until parallel execution is transparent to the user. Visualization tools are now recognized as being as important as pro- cessing, storage, and communications in computational science and en- gineering. Richard Hamming's dictum that the purpose of computing is insight, not numbers, is the rationale for visualization. The visualiza- tion technologies include graphics terminals, microfilm, video tape, system software, and, more recently, the interdisciplinary collaboration between computer scientists and artists to present information in powerful and easily perceived forms. Modes of Operation Most science and engineering computer centers offer their users this whole environment, and the users then choose from among the resources

OCR for page 21
28 RATIONALE FO R HIGH-PERFORMANCE COMPUTING JACK WORLTON COMPLEXITY THE USER ( R e q u i r e m e n t s ) | TIMELINESS TH E MANAG ER (Benefits) FEASIBILI TY SA VINGS (Time costs) FAILURE ANALYSIS QUALITY ASSURANCE PRODUCTI VI TY FIGURE 4.3 The rationale for high-performance computing. (Reported, by permission, from Worlton and Associates.) those that best meet their needs. For example, some users prefer to use a workstation for development of programs, visualization of results, and execution of small-scale problems; when problems grow beyond the ability of the workstation to provide an acceptable response time, the problems are then submitted to a supercomputer. Other users find the local control and ease of use of mid-range computers desirable as an intermediate step between workstations and supercomputers. An important characteristic of the computing environment should be that users have a uniform interface across all three types of computers, so that they can move applications among the types of computers without significant conversion effort. This is the main reason for the growing popularity of the Unix operating system: Unix is available on all three generic types of computing systems and hence can provide a relatively seamless interface among them. RATIONALE FOR USE OF SUPERCOMPUTERS Why do scientists and engineers use powerful but expensive supercom- puters instead of less powerful but less expensive mid-range and personal computers? And why do they use supercomputers in addition to mid-range and personal computers? This discussion has to do with the rationale for supercomputing, which can be viewed from two perspectives: that of the user and that of the manager, as illustrated in Figure 4.3.

OCR for page 21
EXISTING CONDITIONS 29 Requirements for Supercomputing: The User Perspective From the point of view of the user, supercomputers are required to solve highly complex problems in a timely manner. If the computing en- vironment does not have complex problems to solve, then supercomputers are not required; also, if there are complex problems to be solved but there is no time urgency for their solution, then again, supercomputers are not required. However, if both complexity and timeliness are important to an organization, then it is a management error not to provide the-most powerful computing resources available, and these are by definition the supercomputers. We can analyze computing requirements of all kinds by using a simple but powerful dimensional analysis in which we decompose execution rate into two explanatory variables, complexity and response time: Execution rate [operations per problem] Complexity [operations per problem] Response time [seconds per problem] Complexity has units of operations per problem, where operations refers to the ordinary things computers do add, subtract, multiply, and so on. Response time refers to the time required to solve a problem, in units of seconds per problem. Because this model is dimensionally correct, we can use it as an algebraic equation in which any two of the variables uniquely determine the third. For example, we can specie the complexity of the problems we want to solve and the response time required, and this determines the execution rate of the computer necessary to meet these requirements. Alternatively, for a given computer we can specify either problem complexity or response time, but not both. Given this model we can see that there will be requirements for high execution rates if the complexity of the problem is large, or if the response time required is small, or both. It is the need to solve problems with ever-larger complexities and ever-shorter response times that is driving the unrelenting demands for higher execution rates in supercomputers. Historians tell us it is inherent in the nature of science, technology, and engineering that they grow in complexity. For example, a tour through the Smithsonian's Air and Space Museum provides a perspective not only on the history of aerospace but also on complexity. The Wright brothers' flyer was a relatively simple device compared to the DC-3, the DC-3 was not as complex as jet aircraft, and jet aircraft are not as complex as aerospace vehicles. We can quantify this growing complexity: in the decade from 1910 to 1920 it required only on the order of 10,000 engineering hours to

OCR for page 21
30 JACK WORLTON design large aircraft, but in the decade of the 1970s this metric had grown to about 10,000,000 engineering hours. Similarly, shorter response times are becoming more critical because of intense international competition. In the September 1988 issue of IEEE Spectrum, an executive of a semiconductor firm is quoted as saying, "The limit for the time it takes to design an integrated circuit is a year. Any longer and it will already be out of date when it is introduced" (p. 33~. Success in the university, in industry, and in government is often determined not only by the results produced but also by the time scale on which the results are produced. We can quantify this relationship with a Homograph, that is, a diagram that shows a correct relationship among the variables for any straight line drawn across it, as illustrated in Figure 4.4. The scale of response times decreases as we go higher in the diagram, the scale of complexity increases as we go higher, and the sustained execution rate links the two other scales. Suppose we want to solve a problem that has a complexity of 109 operations, and we desire a 15-min response time, which is about 1000 s. The required execution rate is then 109/103 = 106 operations per second. However, if we required the solution on an interactive time scale, for example 10 s, then the required execution rate would be 108 operations per second 100 times faster. Alternatively, if the 15-min response time were satisfactory but the complexity of the problem were 10~2 operations, then the required execution rate would be 109 operations per second. This last example is taken from the requirements at National Aeronautics and Space Administration Ames for the Numerical Aerodynamic Simulator. Figure 4.4 shows that it is the combination of high complexity and short response times that forces the use of high-performance computers. The largest problems that are being solved on current supercomputers have a complexity between 10~3 and 10~4 total operations. However, critical problems, such as long-range climate studies, that have complexities that are orders of magnitude more complex await the development of more powerful supercomputers. The Dimensions of Arithmetic Complexity We can analyze complexity in more detail by continuing our dimen- sional analysis; we decompose complexity into a 4-factor formula: Complexity (operations per problem) = A' V' T'G where

OCR for page 21
EXISTING CONDIT ONS RESPONSE TIME (SEC/PROBLEM) 31 EXECUTION RATE (OP/SEC) 1o1 1ol2 _ = COUP LEXITY (OP/PROBLEM) INK XX 108 ~ 1 o5 S U P E R CO M P U TE R S HI N F RAM ES ~ \ _ SUPE~1INIS 104 1o6 ;~ \ 4 _ M I C R Of\ 10 4. 1 o6 1 0 1 0 1014 -1 o1 3 1 o1 2 1 o11 1 0 FIGURE 4.4 Nomograph of computational science and engineenng. (Repented, by permission, from Worlton and Associates.) G = geometry [points per time step] T = time [time steps per problem] V = variables inumber of variables computed per point] A = algorithm [operations per variable]. Geometrical complexity comes from the number of dimensions and the resolution (number of points) within each dimension, dimension meaning not only the space dimensions but also any variable that must be system- atically varied over some range. For example, if we are tracking particles that do different things depending on their energy and angle of flight, then energr and angularity are treated as additional dimensions. If the problem is time dependent, then calculations must be performed at all of the geo- metrical points for each time step. For each point in space-time, a certain number of variables are computed, and for each variable, a certain number

OCR for page 21
40 JACK WORLTON per chip and therefore avoid transmission delays from chip to chip. CMOS has a major disadvantage as a logic technology for supercomputers: its gate delay is slower than that for ECL. However, its relatively slow speed can be partially compensated for by cooling it with liquid nitrogen to a nominal temperature of 77 K, which lowers the gate delay. The cycle times of the ETAi range from 24 us down to 7 ns. GaAs. Gallium arsenide has two advantages relative to ECL: it has a lower gate delay and lower power dissipation. The lower power dissipation implies a potential for increases in packing density without causing heat dissipation problems and therefore even further speed increases. This technology also has two disadvantages: higher cost and lower gate density per chip. The gate density of GaAs is projected to increase dramatically compared to the gate density of traditional logic technologies, as illustrated in Figure 4.10. If this projection should prove to be true or even approximately true, the gate density disadvantage of GaAs would no longer exist. The GaAs industry is maturing rapidly, and the cost per gate is projected to fall as increasing gate densities are achieved. Gallium arsenide is being used in the design of the Cray-3 for delivery in about 1990, with a projected circle time of 2 ns. HEMT. The high-electron mobility transistor (HEMT) technology is an aluminum-doped version of GaAs that is cooled to liquid nitrogen temper- atures. It is more complex than GaAs and its development will take longer, probably into the mid-199Os. If it is developed successfully, it may offer cycle times even lower than those offered by GaAs. Other Logic Technologies. Josephson junction technology is a superconduct- ing technology that was investigated in the 1970s and early 1980s but then was dropped by most companies because of its complexity and also because its performance characteristics could be achieved with other technologies that were more tractable; however, Japanese companies continue to study this technology. Optical, molecular, and quantum-tunneling technologies are being studied for their potential for high-performance computers, but they seem to offer few prospects for the near term. Summary. For the immediate future, it is expected that most supercomput- ers and other high-performance computers will use ECL and CMOS logic technology, but if the implementation of the Cray-3 in GaAs is successful, it could create a guidepost for others to follow. The leading-edge cycle time should be about 2 us around 1990, and cycle times of 1 ns should appear in supercomputers by 1995.

OCR for page 21
EXISTING CONDITIONS loo ~5 S 10 In 'S C, MC,~ O 103 1o2 41 GaAs ~ ~ T T I I T T r 1980 1982 1984 1986 1988 1990 YEAR ~ _._ 1992 1 994 FIGURE 4.10 Projected trends in gate density of high-speed logic circuits. (Repented, by permission, from Bernard C. Cole, Gallium arsenide starts getting some respect, Electronics, June 1988, p. 41.) Milestones in Processor Architecture Architectural Efficiency Efficiency in computer architecture is measured in units of results generated per clock cycle. Improvements in computer architecture that have resulted in higher efficiency are due largely to concurrency, that is, to doing more things at once. Some selected examples are illustrated in Figure 4.11. In the mid-1950s in computers such as the IBM 704, instructions were executed in a sequential scalar mode; that is, they specified only one opera- tion on one pair of operands, and the processing of instructions included a series of sequential steps: fetching the instruction, decoding it, forming the effective address, fetching the operand, and then executing the operation. Beginning in about 1960 in computers like the IBM STRETCH, an instruc- tion lookahead provided the ability to fetch and process instruction N + 1 while executing instruction N. By the mid-1960s in computers such as the CDC 6600, multiple function units were included that allowed several exe- cutions, such as add and multiply, to be performed concurrently. By about 1970, the operations were speeded up by pipelining, that is, subdividing

OCR for page 21
42 1ol2 Cal Lo - 0 1010 tar z o C' x LL U] go En En 11 109 1o8 107 1o6 105 104 JACK WORETON . . ~.,..~.,.] . ~ ~.2 ~ An ,, ,, I;;;; ~ Is.. Multiple FUs 7.~..] ~,. . . . .. ...,, ~ ~1 me , ,,,,.,,,,.,,., 1950 1960 1970 1980 1990 2000 YEAR FIGURE 4.11 Some architectural contributions to higher performance. (Repented, by permission, from Worlton and Associates.) the steps of the operations into substeps and overlapping these substeps for faster execution. The 1970s saw the development of vector processors in which the same operation was applied to many operands or operand pairs, rather than to just one pair as in the scalar designs. The major supercomputers today use vector processing to achieve high speed. The first generation of vector processors had memory-to-memory designs; that is, they fetched the vectors from memory and returned the results to mem- o~y. However, this caused long start-up times, and the second generation of vector processors followed the lead of the Cray-1 in using vector regis- ters that allowed faster start-up times. Most vector designs today use the register-to-register design. The leading edge of supercomputer architecture today is found in designs that incorporate multiple vector processors, or parallel-vector designs. Balance in Vector Processor Design One of the key issues in vector processing is the balance between the scalar speed (doing one operation per instruction) and the vector speed (doing many repetitive operations per instruction). Figure 4.12 compares the merits of two designs, one with a relative performance of 10 in the scalar mode and 100 in the vector mode, and another with a relative

OCR for page 21
EXISTING CONDITIONS 1000 -. C' A a: 100- a: o Cal 43 1 0.0 0.2 0.4 VECTOR FRACTION FIGURE 4.12 Balance in vector processor design. (Repented, by permission, from Worlton and Associates.) performance of 2.S in the scalar mode and 400 in the vector mode. The overall performance of the system is a function of the fractions of the work done in each of the different modes. For practical applications, the fraction of vector work is in the range 0.3 to 0.7, so that even though the second design has a higher peak performance than does the first design, the first design, which is better balanced, is clearly to be preferred. Parallel Processing In addition to decreasing cycle time and increasing design efficiency, a third approach to increasing computer speed is through the use of multiple processors, and there are many design issues for these so-called parallel computers: Should there be a few fast processors or many slow ones? Should the memory be shared, attached to each processor, or both? What kind of interconnect network should be used to provide communications among the processors, and among the processors and the memory? Inhere are literally thousands of ways to design parallel computers, and at the moment there is wide disagreement in the industry about which are the best

OCR for page 21
44 JACK WORLTON choices. To analyze future prospects in these options, we turn to a taxonomy of computer architectures-an intellectual road map of possibilities. Flynn's Taxonomy Michael Flynn's taxonomy from the 1960s has been the most com- monly used guide to generic types of architectures. It matches the control concurrency expressed in the number of instruction streams (one or many) against the execution concurrency expressed in the number of data streams that are being processed (one or many), to generate the four types of ar- chitectures: (1) single instruction, single data (SISD); (2) single instruction, multiple data (SIMD); (3) multiple instruction, single data (MISD); and (4) multiple instruction, multiple data (MIMD). This has been a valuable guide for over 20 years, but it is no longer an adequate guide because it is not comprehensive for all of the architectural design options being explored. An Expanded Taxonomy of Architectures In addition to the types of control concurrency serial and parallel- included in the Flynn taxonomy, a third type is being used, called clustering. In a clustered design, clusters of multiple-instruction-stream processors are connected together with global control with access to global memory. Also, the types of instructions employed in the 1960s were limited to scalar instructions, and there are now several new options that need to be included in any classification scheme. We can systematically explore the options for designing instruction types by considering a number pair (a,b), where a = the number of v~ela.lvilS Mu ill The ilmc1 union and 0 = Ine number of operands (or operand pairs) specified. There are thus four options, as follows: The (1,1) option defines the scalar type of instruction, in which one operation is specified on one pair of operands. The (1,N) option defines the vector type of operation, in which one operation is specified, but the operation is applied to many operands or pairs of operands. ~,% , ~. . ~ ~r~^t~qt~ ~.~;~A .~ ~ : ~:_ _ ~ L ~ _ . ~ ~ ne (M, 1) option cennes the systolic type of operation, in which many operations are performed on each operand that is fetched from memory. Finally, the (M,N) option defines the horizontal type of operation, in which many operations are specified on many operands for each instruc- tion. The horizontal category is also referred to as very long instruction word (VLIW). We can now create an expanded taxonomy of architectures that matches the 3 control categories against the 4 instruction categories to create a 12-way taxonomy, as illustrated in Figure 4.13.

OCR for page 21
EXISTING CONDITIONS CONTROL C ONCURRENCY 45 . INSTRUCTION TYPES SCALAR VECTOR SYSTOLIC (1,1) (1,N) (IVI,1) === VLIW (M,N) SERIAL PARALLEL CLUSTERED FIGURE 4.13 An expanded taxonomy of computer architectures. Notation (a,b): a, number of operations specified; b, number of operands (or operand pairs) specified. (Reprinted, by permission, from Worlton and Associates.) Historically, through the CDC 7600 in about 1970, most supercomput- ers had serial-scalar designs; that is, they executed one stream of scalar instructions. First-generation vector processors had serial-vector designs; that is, they also executed a single stream of instructions, but the instruc- tions specified vector operations. Machines of this generation were typified by the Cray-1 and the Cyber 205. In the 1980s, parallel-vector computers have been developed in which multiple vector processors are interconnected through a communication network. The Cray Y-MP and the CDC/ETAi are examples of computers in this category. The further development of vector designs is expected to be generalized to clustered-vector designs, with clusters of multiple vector processors interconnected with global net- works. An example of this category is the Cedar project at the University of Illinois that uses the Alliant I;XJ8 as the cluster. The parallel-scalar category represents those designs that use a large number of relatively slow processors to attain high performance, such as the NCUBE/ten and the Intel iPSC. These can also be designed with clustering, as found in the Myrias-2. Systolic designs have been largely special-purpose computers that im- plement a particular algorithm such as a matrix multiply or a fast Fourier transform, and they can be developed in serial, parallel, or clustered form. The WARP system developed by Carnegie Mellon University is the leading edge of this type of design. The horizontal or VLIW designs attempt to execute multiple operations per clock period, and they too can be devel- oped in serial, parallel, or clustered form. Examples of the VLIW design are found in the Cydra-S and the Multiflow Mace computers.

OCR for page 21
46 JACK WORLTON This taxonomy can be extended by subdividing each of the categories to develop an even more detailed taxonomy. Key Issues in Parallel Processing The trend toward parallel computation is perhaps the major trend to watch in the next decade of computer architecture, and it is useful to identify some of the key issues that are being explored in the dozens of commercial and research parallel-processor projects. These issues can be thought of in three categories: system software, system hardware, and applications. Parallel processing will affect system software in language design, com- pilers, operating systems, and libraries. For example, should we continue to use Fortran to specify parallel processing or should we abandon this old language and adopt something new? Fortran is being modified by adding new constructs that allow the specification of parallel processing, and no doubt this language will continue to be used long into the future because of the huge commitment to applications that exist for it. Compilers are being influenced by parallel processing because they need to identify opportunities for parallel execution in a transparent manner, that is, so the user need not be concerned with the details of how the parallelism is achieved but only with what is to be done. Operating systems are now more complex because they need to maintain control and provide services for many streams of instructions rather than just for one. Finally, system libraries are being adapted to use parallel processors. Applications will be affected by parallel processing by the need to insert parallel control statements, to develop parallel algorithms, and to rethink the mathematical models on which the algorithms are based. One of the key issues in the design of high-performance parallel computers is the trade-off between speed and parallelism. That is, in principle we could select the number of processors to be from as few as one or two to as many as thousands (or indeed millions), and we could select the speed of the individual processors to be as slow as a bit-serial microcomputer or as fast as a supercomputer, as illustrated in Figure 4.14. However, a design with only a few slow processors would be too slow to be of any use as a high-performance computer, and a design with thousands of supercomputers in a single system would not be economically feasible, so that the practical area of trade-off lies in between. The evolution of parallel processing is following three guideposts. First, the zone of "large-grain" parallelism is defined by the use of a relatively small number of fast processors, 1 ~ P< 100. This form of parallel-processor design was pioneered by Cray Research, Inc. in 1982 with the introduction of the Cray X-MP product line, in which the individual

OCR for page 21
EXISTING CONDITIONS o In In o C] UJ Lo In 47 ~ Large-graln ~ And ) $) arm ) ~1~1~$0~ it.. Mealum-:araln ~ (, ", Parall'ell~sm' ~ ~,"3 ~100 :c:'~'P2~c~"210.00}) ~ . . : . .. ~.... . . _ _.... ~ _ _ ... . . .. _ <:~: Small-graln,~ '2:~.~'~'~:P:atralIe11sm~ .. It's ~ '':''', 2.'.{P:, :~.1 00'0} " . it, NUMBER OF PROCESSORS (P) FIGURE 4.14 The trade-off between speed and parallelism. (Repented, by permission, from Worlton and Associates.) processors are vector supercomputers in their own right and the number of processors varies from the initial offering of 2 to the current offering of 8 in the Cray Y-MP/~. Other vendors have followed this design as a guidepost, including parallel-vector processors offered by Control Data Corporation, IBM Corporation, Convex, Alliant, the Japanese Super-Speed Computer Project, and widely rumored introductions of similar designs from the Japanese supercomputer vendors. The number of processors in this domain is being expanded from 8 processors to 16 in the near term, and to 32 and 64 in the generations in development for delivery in the early 1990s. Second, the zone of "medium-grain" parallelism is defined by the use of a larger number of slower processors, with the number of processors being in the range of 100 < P < 1000. This form of parallelism is typified by the BEN Butterfly, which has 256 processors, and by the Myrias SPS-2, with 512 processors. Third, the zone of "small-grain" parallelism is defined by the use of an even larger number of processors, P ~ 1000, each of which is even slower and is typically a bit-serial microprocessor. Here the number of processors varies from 4096 in the AMT-610 DAP, to 16,384 in the Goodyear MPP, to 65,536 in the Connection Machine. A Taxonomy of Limits to Massive Parallelism The term massive parallelism is loosely applied to any of a variety of computers in which the desired performance is achieved through a very

OCR for page 21
48 JACK WORLTON DECOMPOSITION ALGORITHMIC LIMITS TO MASSIVE PARALLELISM SERIAL BOTTLENECK COMMUNICATION IMPLEMENTATION SYNCHRONIZA TION INTERFERENCE FIGURE 4.15 Taxonomy of limits to massive parallelism. (Repented, by permission, from Alex Yuen-Wai Kwok, Center for Supercomputer Research and Development Report No. 679, August 1987.) large number of relatively slow processors, with the threshold beyond 1000 processors often used to mean massive. The ability of a parallel computer to perform efficiently as the number of processors is increased is referred to as scalability. A study of architec- tural scalability published at the Center for Supercomputer Research and Development at the University of Illinois presents a taxonomy of the limits to scalability, as summarized in Figure 4.15. Limits of the first kind are due to algorithmic constraints. For a system with P processors to perform at full effectiveness would require that the number of available tasks equal or exceed the number of processors through- out the calculation. However, not all algorithms can be so decomposed, so that there is a decomposition limit to scalability. When the number of tasks available is just one, this is referred to as the serial bottleneck, which has very serious implications for the effectiveness of massive parallelism. Limits of the second kind are due to the implementation details, includ- ing latencies caused by communications, synchronization, and interference. As tasks are decomposed to finer levels of detail, the communications re- quirements increase in proportion to the level of decomposition, and the amount of actual communication latency is dependent on the system design. A second implementation limit is caused by the need to integrate the results of the decomposed tasks through various synchronization techniques. At

OCR for page 21
EXISTING CONDITIONS 49 the synchronization points, the number of available tasks decreases, so that the number of idle processors increases, thus limiting performance. A third implementation limit is caused by interference as the processors attempt to access shared resources; these resources may include shared memory, shared communication networks, or shared input-output systems. Following Hockney's nils method, it can be shown mathematically that the fraction of parallelism that is required to achieve 50 percent efficiency for a parallel-processor system is given by (P - 2) 7ri/2 = up _ 1' where P = the number of processors, and hence the fraction of serial work cannot exceed 1-~/2 = up _ i) Thus as P grows, the domain of applicability becomes inherently more limited, because ~/2 approaches 1.0 and 1 - ~/2 approaches 0 very rapidly. This does not imply that it is impossible to use parallel processors effectively, but it does provide a guide to the domain of application, with a smaller number of very fast processors being more useful for general purposes than a larger number of slow processors. CONCLUSIONS It has been over 300 years since Galileo unlocked the door to the world of empirical science, using a telescope as a key, so to speak. In the year Galileo died (1642), Isaac Newton was born, and he invented a key we call the calculus that he used to unlock the door to the world of theoretical science. And it has been a scant 4 decades since von Neumann recognized that powerful electronic computers were the key to still another world of science, computational science. In the opening lines of Man of La Mancha, the author (Dale Wasser- man) invites the audience, "Come, enter into the world of my imagination." The supercomputer, too, bids us enter into the world of imagination. In the real world, we are constrained by limits of time, space, energy, matter, and costs, but in the world of the supercomputer, these constraints are merely bits of information that we can readily manipulate. In the world of the supercomputer, we can experiment with new designs for such things as aircraft, automobiles, accelerators, weapons, and chemicals; we can ex- periment with these designs until they either perform as we desire or we discover why they fail. By experimenting in the world of the supercom- puter, we can avoid many of the costs, delays, and failures that would have

OCR for page 21
so JACK WORLTON occurred if we had experimented with these designs in the real world. In this new world we can create previously unthinkable systems such as black holes, stand in space to watch them perform, and create videotapes of this adventure to take back to the real world. The university, the industry, or the nation that would be a leader in the modern world of intense international competition must master the information technologies, the leading edge of which are the supercomputing technologies. It is imperative for our future success as a nation that we accept the invitation offered by the supercomputer "Come, enter into the world of . . . imagination."