Read "Supercomputers: Directions in Technology and Applications" at NAP.edu

« Previous: Part B: The Changing Landscape of Supercomputer Technology

Page 21 Cite

Suggested Citation:"4. Exisiting Conditions." National Research Council and National Academy of Sciences. 1989. Supercomputers: Directions in Technology and Applications. Washington, DC: The National Academies Press. doi: 10.17226/1405.

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Page 29 Cite

Page 30 Cite

Page 31 Cite

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Existing Conditions Jack Worlton Los Alamos National Laboratory and Worlton and Associates CHARACTERISTICS OF HIGH-PERFORMANCE COMPUTERS Types of Computers The wide variety of computers used in computational science and en- gineering can be illustrated by plotting the performance and cost of all of the computers available from the computing industry. The resulting band of products, as illustrated in Figure 4.1, ranges over microcomputers, minicomputers, superminicomputers, minisupercomputers, high-end main- frames, and supercomputers. There is a band rather than just a line because there is a range of performance that is available for a given cost and a range of costs for a given performance. In general, the lower edge of the band is represented by the older products, and the upper edge of the band is represented by the newer products. That is, newer products have a higher performance for a given cost and a lower cost for a given performance than do older products. One of the tongue-in-cheek definitions of supercomputers is that a supercomputer is any computer that costs $10 million. In fact, current supercomputer prices range from about $1 million to $20 million, but in terms of constant dollars, the $10 million average is a useful rule of thumb over about 3 decades. For example, the IBM 704 in the mid-1950s cost $2 million to $3 million and operated at about 10,000 operations per second. However, if these 1950s dollars are converted to 1980s dollars, 21

22 ( ~108 By o IL cn 1 07 lo 1o6 lo O 105 104 JACK WORLTON 109 an. ~ .;;; .; . ; ;;;; A,, :, Otto - r, :: 0.1 1 1 00 1 000 1 0000 COST (x$1000) FIGURE 4.1 Performance and cost of available types of computers. ( Reprinted, by permission, from Worlton and Associates.) the cost of the IBM 704 is approximately $10 million. One way to view this development is that over the past 30 years, for a constant cost the performance of supercomputers has increased by a factor of 10,000; or, put another way, for a constant performance the cost has decreased by a factor of 10,000. This is equivalent to an annually compounded rate of improvement in the ratio of performance to cost of about 36 percent per year, a factor of 2 every 2.25 years, or a factor of 10 every 7.5 years. Taxonomy of High-Performance Computers If we examine in greater detail the high-performance computers, we distinguish at the first level three categories of computers: general purpose, special purpose, and research (Table 4.1~. General Purpose The general-purpose high-performance computers include the super- computers, high-end mainframes, and minisupercomputers. The super- computers (such as the Cray Y-MP and ETAi°) are distinguished by their relatively higher execution rates (sustained rates of about 108 to 109 oper- ations per second), their larger memory capacities (up to several hundred

EXISTING CONDI7IONS TABLE 4.1 Taxonomy of High-Performance Computers Category Instances General purpose Special purpose Research Supercomputers High-end mainframes Minisupercomputers Single-purpose Bit processors Massively parallel Systolic National Industry University 23 million 64-bit words), their high-speed input-output systems, and their high prices. The prices of the mainframes (such as the IBM 3090, Amdahl 5890, and Control Data Corporation 990) are about the same as those of the supercomputers, ranging from about $1 million to $20 million, but their performance is typically about a factor of 5 below that of supercomputers; also, their memory capacities are typically lower and their input-output systems are slower. The minisupercomputers (such as the Convex C-Series, the Alliant FX/8, and the SCS-40) have performances that are typically 5 to 10 times lower than those of supercomputers and prices in the range of $100,000 to $1 million, although some of the larger systems have prices in the $1 million to $2 million range. Special Purpose There is no clear line of distinction between general- and special- purpose computers; rather, the distinction is relative. That is, the general- purpose computers have relatively more applications that they can execute with high efficiency than do the special-purpose computers. It has been known for decades that a trade-off is available between execution rate and generality; that is, for a given component technology, special-purpose computers can be made that are faster than general-purpose computers. However, the cost of a special-purpose computer is in addition to, not instead of, the cost of a general-purpose computer, because users cannot afford to acquire a special-purpose computer for every application and therefore must always have general-purpose computers that are specialized for particular applications through application software. Thus the role of

24 JACK WORLTON special-purpose computers is to augment, not to replace, general-purpose computers. The most limited purpose of design is found in single-purpose com- puters such as logic emulators. While limited in the range of problems they can solve, these are highly effective when the workload is highly spe- cialized. Bit processors are designed for applications in which the data can be expressed by a single bit, such as image processing. Massively paral- lel computers attempt to achieve high performance by using a very large number of slow processors. Finally, the systolic computers implement an algorithm by pumping data through a series of identical functional units; for example, a systolic computer that implemented a matrix multiply would have an array of identical multiply-add processors that communicate their partial results to one another. It should be observed that the generality of many of the special-purpose processors is being broadened by both hardware and software techniques in the newer designs. Research Research computers are designed to investigate how to make better computers rather than for applications in science and engineering. Some of these are supported at the national level, the best known being Japan's Super-Speed Computer Project, which combines the efforts of Japan's six largest computer companies to design a supercomputer that will operate at 10~° floating-point operations per second and will be ready for demon- stration by March 1990. This project is funded at the $100 million level by the Ministry of International Made and Industry in Japan. There are similar projects in Europe, such as the Alvey project in Great Britain, the Marisis project in France, and the Suprenum project in West Germany. In the American scheme of things, these types of projects are conducted in universities; examples include the Cedar project at the University of Illi- nois, the Ultra project at New York University, and the Hypercube project at the California Institute of Technology. Finally, industrial firms are also building research computers, including the RP3, the GFll, and the TF-1 projects at IBM. THE SCIENTIFIC COMPUTING ENVIRONMENT A MATRIX It is useful to think of the scientific computing environment in terms of a matrix in which we match the different types of computers against the generic information technologies, of which there are four: (1) processing, in which we transform an input into an output according to an algorithm; (2) storage, in which we transfer information over time; (3) communications, in which we transfer information over space; and (4) the human interface, in

EXISTING CONDITIONS TECHNOLOGIES ~:,,,,,, , ~ ... ' 2 PROCESSING2.' ' ' ' '.' 2 ' ............................... ; , .. 0 A...... ......... ...... ~ ..' COMMUNICATIONS.2''.2'. by, : .................................. ..,,,, ~ 25 SYSTEMS .......... ' LARGE C LE ........ ............. ~...... ' ' ' '' ' ' ' ' ""'1 ~........................ Supers Minisupers Workstations Mainframes Superminis PCs Common file Local disk Floppies and system systems hard disks Site networks, Site networks, Site networks, LANs, WANs LANs, WANs LANs, WANs Transparency Transparency Transparency and and and Visualization Visualization Visualization FIGURE 4.2 The scientific computing environment. Note: LAN, local-area network; WAN, wide-area network. (Repented, by permission, from Worlton and Associates.) which we present information to and accept information from the user. We now classify the types of computers by the level of sharing among the users: (1) the personal resources that are not shared (personal computers with prices less than $10,000 and workstations with prices between $10,000 and $100,000~; (2) mid-range computers that are typically shared by a few tens of users (minisupers and superminis with prices in the range of $100,000 to $1 million); and (3) the large-scale systems that are typically shared by a few hundred users (supercomputers and high-end mainframes with prices in excess of $1 million). By matching the generic information technologies against these categories of computers, we thereby create a 12-way taxonomy that defines the scientific computing environment, as illustrated in Figure 4.2. Some major trends and characteristics of the technologies are described below. Trends in the Generic Information Technologies Processing For all three types of computers, the processing power for a given cost is increasing, and the cost of a given level of processing power is decreasing. There are no fundamental, technological, or economical limits that will prevent these trends from continuing over at least the next decade. These trends may slow down somewhat, but they will continue.

26 Storage JACK WORLTON There is no single storage technology that meets all requirements for fast access time and low cost, so a hierarchy of technologies is used, including semiconductors for the main storage, magnetic disks for secondary storage, and magnetic tapes for archival storage. Progress has been most rapid in the main memory, where access times are now just a few tens of nanoseconds (a nanosecond is 10-9 s). The access time to magnetic disks is typically a few tens of milliseconds (a millisecond is 10-3 S. the time being constrained by the physical rotation of the disks. Thus there is a 6-orders-of-magnitude gap between the access times of main memory and those of disks. This causes unacceptable delays when storage requirements exceed the capacity of main memory, so that a new level in the storage hierarchy has been created, often called the solid-state disk (SSD), with access times of tens of microseconds (a microsecond is 10-6 s). The third level in the storage system, the archival level of storage, has been and remains a problem, because magnetic tape is not an ideal medium for very large archives: it has low volumetric efficiency, its shelf life is limited, and it can be erased. Optical storage technology is being developed in both disk and tape formats, but so far there are no recording standards for optical media, and most users are unwilling to record their archives on a nonstandard medium; thus magnetic tape will continue to be used for archival storage by most scientific computing centers until recording standards are developed for optical disks and tapes. An exception will be found in specialized applications where the advantages of optical media exceed their disadvantages. Communications Communication both within and among computers and their users continues to grow in importance. Three major trends are evident: (1) the rate of information transmission is increasing, (2) the cost per bit of information transmitted is decreasing, and (3) the connectivi~the number of users who have access to data communications ports is increasing. The point about connectivity is a matter of urgent management con- cern. In 1981 Los Alamos National Laboratory conducted a strategic plan- ning exercise that led to the conclusion that data communications should be provided to all employees as a utility comparable to heat, light, power, and telephones. This policy was called the Data Communication Utility. At that time the laboratory had only about 1000 communications ports for a staff of about 7000, and new ports were being installed at the rate of only 200 per year, so that providing the full staff with data communications would have required about 30 years. However, the pace of port installation has been increased to 800 to 1000 per year and there are now between

EXISTING CONDITIONS 27 5000 and 6000 ports for the staff, so that the Data Communication Utility will soon be a reality at Los Alamos. The Data Communication Utility is a policy that should be considered by all organizations. The networks to which data communications ports provide access in- clude local-area networks (LANs) that span a building, site networks that span a whole site, and wide-area networks (WANs) that span continents. These networks permit transfer of information among computing resources and among an ever-growing set of scientists and engineers who communi- cate with each other electronically. Interface The function of the human interface in information technology is twofold: (1) to provide transparent access to the information technologies, and (2) to provide v~sual~zai'on tools that lead to the insights that are the very purpose of computational science and engineering. It is a fundamental principle of technology development that all tech- nologies strive toward the condition of transparency, that is, the ability to achieve the function of the technology without specifying how the function is achieved. For example, the driver of an automobile is concerned primar- ily with just four devices, the steering wheel, the gear shift, the accelerator, and the brake pedal, but not with the thousands of parts these interface devices control. Similarly, the computer user should be concerned only with the nature of the problem being solved and not with specifying how each step in the solution is to be accomplished. Many computer systems are deficient not because they lack the computational power the users require, but because they are not transparent they require the user to specify in great detail how the computation should proceed. Future systems will in- creasingly allow users to specify what is to be done, not how. For example, parallel-processing computers are being developed today, but these will not be mature until parallel execution is transparent to the user. Visualization tools are now recognized as being as important as pro- cessing, storage, and communications in computational science and en- gineering. Richard Hamming's dictum that the purpose of computing is insight, not numbers, is the rationale for visualization. The visualiza- tion technologies include graphics terminals, microfilm, video tape, system software, and, more recently, the interdisciplinary collaboration between computer scientists and artists to present information in powerful and easily perceived forms. Modes of Operation Most science and engineering computer centers offer their users this whole environment, and the users then choose from among the resources

28 RATIONALE FO R HIGH-PERFORMANCE COMPUTING JACK WORLTON COMPLEXITY THE USER ( R e q u i r e m e n t s ) | TIMELINESS TH E MANAG ER (Benefits) FEASIBILI TY SA VINGS (Time costs) FAILURE ANALYSIS QUALITY ASSURANCE PRODUCTI VI TY FIGURE 4.3 The rationale for high-performance computing. (Reported, by permission, from Worlton and Associates.) those that best meet their needs. For example, some users prefer to use a workstation for development of programs, visualization of results, and execution of small-scale problems; when problems grow beyond the ability of the workstation to provide an acceptable response time, the problems are then submitted to a supercomputer. Other users find the local control and ease of use of mid-range computers desirable as an intermediate step between workstations and supercomputers. An important characteristic of the computing environment should be that users have a uniform interface across all three types of computers, so that they can move applications among the types of computers without significant conversion effort. This is the main reason for the growing popularity of the Unix operating system: Unix is available on all three generic types of computing systems and hence can provide a relatively seamless interface among them. RATIONALE FOR USE OF SUPERCOMPUTERS Why do scientists and engineers use powerful but expensive supercom- puters instead of less powerful but less expensive mid-range and personal computers? And why do they use supercomputers in addition to mid-range and personal computers? This discussion has to do with the rationale for supercomputing, which can be viewed from two perspectives: that of the user and that of the manager, as illustrated in Figure 4.3.

EXISTING CONDITIONS 29 Requirements for Supercomputing: The User Perspective From the point of view of the user, supercomputers are required to solve highly complex problems in a timely manner. If the computing en- vironment does not have complex problems to solve, then supercomputers are not required; also, if there are complex problems to be solved but there is no time urgency for their solution, then again, supercomputers are not required. However, if both complexity and timeliness are important to an organization, then it is a management error not to provide the-most powerful computing resources available, and these are by definition the supercomputers. We can analyze computing requirements of all kinds by using a simple but powerful dimensional analysis in which we decompose execution rate into two explanatory variables, complexity and response time: Execution rate [operations per problem] Complexity [operations per problem] Response time [seconds per problem] Complexity has units of operations per problem, where operations refers to the ordinary things computers do add, subtract, multiply, and so on. Response time refers to the time required to solve a problem, in units of seconds per problem. Because this model is dimensionally correct, we can use it as an algebraic equation in which any two of the variables uniquely determine the third. For example, we can specie the complexity of the problems we want to solve and the response time required, and this determines the execution rate of the computer necessary to meet these requirements. Alternatively, for a given computer we can specify either problem complexity or response time, but not both. Given this model we can see that there will be requirements for high execution rates if the complexity of the problem is large, or if the response time required is small, or both. It is the need to solve problems with ever-larger complexities and ever-shorter response times that is driving the unrelenting demands for higher execution rates in supercomputers. Historians tell us it is inherent in the nature of science, technology, and engineering that they grow in complexity. For example, a tour through the Smithsonian's Air and Space Museum provides a perspective not only on the history of aerospace but also on complexity. The Wright brothers' flyer was a relatively simple device compared to the DC-3, the DC-3 was not as complex as jet aircraft, and jet aircraft are not as complex as aerospace vehicles. We can quantify this growing complexity: in the decade from 1910 to 1920 it required only on the order of 10,000 engineering hours to

30 JACK WORLTON design large aircraft, but in the decade of the 1970s this metric had grown to about 10,000,000 engineering hours. Similarly, shorter response times are becoming more critical because of intense international competition. In the September 1988 issue of IEEE Spectrum, an executive of a semiconductor firm is quoted as saying, "The limit for the time it takes to design an integrated circuit is a year. Any longer and it will already be out of date when it is introduced" (p. 33~. Success in the university, in industry, and in government is often determined not only by the results produced but also by the time scale on which the results are produced. We can quantify this relationship with a Homograph, that is, a diagram that shows a correct relationship among the variables for any straight line drawn across it, as illustrated in Figure 4.4. The scale of response times decreases as we go higher in the diagram, the scale of complexity increases as we go higher, and the sustained execution rate links the two other scales. Suppose we want to solve a problem that has a complexity of 109 operations, and we desire a 15-min response time, which is about 1000 s. The required execution rate is then 109/103 = 106 operations per second. However, if we required the solution on an interactive time scale, for example 10 s, then the required execution rate would be 108 operations per second 100 times faster. Alternatively, if the 15-min response time were satisfactory but the complexity of the problem were 10~2 operations, then the required execution rate would be 109 operations per second. This last example is taken from the requirements at National Aeronautics and Space Administration Ames for the Numerical Aerodynamic Simulator. Figure 4.4 shows that it is the combination of high complexity and short response times that forces the use of high-performance computers. The largest problems that are being solved on current supercomputers have a complexity between 10~3 and 10~4 total operations. However, critical problems, such as long-range climate studies, that have complexities that are orders of magnitude more complex await the development of more powerful supercomputers. The Dimensions of Arithmetic Complexity We can analyze complexity in more detail by continuing our dimen- sional analysis; we decompose complexity into a 4-factor formula: Complexity (operations per problem) = A' V' T'G where

EXISTING CONDIT ONS RESPONSE TIME (SEC/PROBLEM) 31 EXECUTION RATE (OP/SEC) 1o1 1ol2 _ = COUP LEXITY (OP/PROBLEM) INK XX 108 ~ 1 o5 S U P E R CO M P U TE R S HI N F RAM ES ~ \ _ SUPE~1INIS 104 1o6 ;~ \ 4 _ M I C R Of\ 10 4. 1 o6 1 0 1 0 1014 -1 o1 3 1 o1 2 1 o11 1 0 FIGURE 4.4 Nomograph of computational science and engineenng. (Repented, by permission, from Worlton and Associates.) G = geometry [points per time step] T = time [time steps per problem] V = variables inumber of variables computed per point] A = algorithm [operations per variable]. Geometrical complexity comes from the number of dimensions and the resolution (number of points) within each dimension, dimension meaning not only the space dimensions but also any variable that must be system- atically varied over some range. For example, if we are tracking particles that do different things depending on their energy and angle of flight, then energr and angularity are treated as additional dimensions. If the problem is time dependent, then calculations must be performed at all of the geo- metrical points for each time step. For each point in space-time, a certain number of variables are computed, and for each variable, a certain number

32 JACK WORLTON of operations must be performed in its computation. Thus total complexity is the product of all of these 4 factors. In practice, one of the difficult decisions in computational science and engineering is the trade-off among these variables. There is a practical limit on total complexity, so that as we increase any one of these factors we must decrease another. For example, as we increase geometrical complexity, we may have to use simpler physics; or if we put in better physics, we may have to decrease the number of dimensions or the resolution. An Example of Complexity The following example illustrates the structure of a typical supercom- puter calculation: Spatial resolution G Time resolution T Variables V Algorithm A = 104 mesh points (100 x 100) = 400{) time steps = 100 variables per point 30 operations per variable per time step per point The total complexity is the product of these factors, equal to 1.2 x 10~i total operations. The problem was executed on a Cray-1 supercomputer at Los Alamos National Laboratory in 1983 at an execution rate of about 20 million operations per second, so the response time was 1.2 x 10~/2.0 x 107 = 6000 s = 1.67 h. = The Scale of Response Times There is a broad range of requirements for response times, from only a few seconds up to about 100 h. Problems exceeding 100 h are usually deemed to be intractable, both because they delay progress on the part of the user for so long and because they consume so much of the computing resource that they deny access to other users. A problem that requires 100 h of total execution time cannot be executed nonstop but must be executed at a rate of a few hours per night, so that the real-time requirement for completion is weeks to months. The scale of response times In h '`. He of as follows: r ~^ ~- - ~ A · Interactive. Some response times must be only a few seconds, and ideally the delay should be imperceptible to the user for such simple tasks as entering a line of instruction during code development. · Preproduction and postproduction. Before a long calculation is executed, it is often submitted for a short run to determine if the input is correct, to avoid wasting time. Also, after a longer production run is made, it is often necessary to analyze the results in calculations that are typically

EXISTING CONDITIONS 1o8 107 a, 1 on - 105 LL - Lu 104 in in o ILL An Or 103 1o2 jo1 10° 33 1 Year 1 Month c_ 50 Hours - 5 Hours 15 Min. - 87 Hours ~c 7 Hours 30 Minutes 3 Minutes ~ 10 Seconds 100 RELATIVE EXECUTION RATE FIGURE 4.5 Scale of response times. Note: Solid square, interactive; open tnangle, preproduction and postproduction; solid triangle, production; open circle, benchmark; and closed Circle, intractable. (Reprinted, lay permission, from Worlton and Associates.) minutes to tens of minutes long and that are run during the daytime. Small-scale production calculations are also of this size. · Production. Problems that require hours of run time are run at night during the so-called production period. Thus any problem of this size can yield only one result per day to the user. · Benchmark At rare intervals, problems of higher complexity than the production runs are executed to check on the sensitivity of production runs to higher resolution or better science. The run times of benchmark calculations are typically in the range of 10 to 100 h. · Intractable. Calculations requiring over 100 h are so costly in both the expenditure of resources and in delays in real time that they are rarely, if ever, performed. 1b understand the importance of more powerful computers, consider the effect of executing the above scale of problems on a computer that is 100 times faster than the original computer, as illustrated in Figure 4.5. For ease of reference, we refer to the original computer as 1X and to the faster computer as lOOX. Problems that would require a full year of calculation on a 1X, and hence would be unthinkable, could be executed

34 JACK WORLTON as benchmark calculations on the lOOX in 87 h. Intractable problems that would require nonstop runs of 1 month (720 h) on the 1X could be done as routine production jobs on the lOOX in about 7 h. The SO-h benchmark calculations on the 1X could be completed in just 30 min on the lOOX. Five-hour production runs on the 1X could be done in only 3 min on the lOOX, and 15-min calculations on the 1X could be completed in just 10 s on the lOOX. This comparison is relevant to the relative performance of mid-range computers and supercomputers and to the relative performance of current and future supercomputers. Benefits of Supercomputing: The Managerial Perspective The above discussion of requirements is of interest primarily to users who have to solve complex problems in a timely manner, and this is, of course, an important part of why supercomputers are used. However, the issues of problem complexity and solution time are only of periph- eral interest to managers. Managers are interested in the benefits to the organization of making the substantial investment in these computational resources. Users sometimes make the mistake of trying to explain to their managers why they need supercomputers in terms of problem complexity and response time, when what the manager wants to know is, What can we do with supercomputers that we couldn't do without them? What savings in time and money can we achieve? How will their use affect the productivity of the organization? Five benefits are briefly outlined as follows: · Feasibility. The ability to overcome the limitations of experimental and theoretical methods through computational science and engineering makes it possible to solve problems that are intractable by conventional methods alone. · Savings. By using computational science and engineering to guide experimentation, costly and time-consuming experiments can be focused on the most productive areas, thereby economizing on manpower, time, and budgets. · Failure analysis. By identifying failure modes early in a project in the world of information rather than later in the production or operational phase, the consequences of failures can often be avoided. The space shuttle Challenger is a tragic example, but failures of all kinds need to be avoided to minimize delays, wasted resources, and embarrassment to an organization. · Quality assurance. If the maximum set of options possible within the constraints of the time of a project are explored, optimum quality results can be produced. In the modern world of intense international competition, high quality is essential for survival.

EXISTING CONDITIONS . 35 Producavi~. By increasing output and reducing time and cost, the supercomputer increases the productivity of an organization. TRENDS IN SUPERCOMPUTER TECHNOLOGY This analysis of trends in supercomputer technology focuses primarily on system hardware, and within that topic, on trends in the technologies of processing and storage. The equally important topics of system and applications software, and the technologies of communications and the human interface within system hardware, are beyond the scope of this analysis. fiend in Execution Rate at Los Alamos National Laboratory High-PeJorman ce Computing Los Alamos National Laboratory has been, and continues to be, one of the world's leading organizations in the application of high-performance computers to science and engineering, and it is instructive to analyze the history of computing at Los Alamos for insights into how high-performance computer technology has evolved. This history is illustrated in Figure 4.6. When the laboratory was first established in 1943, there were no electronic computers, so punched-card accounting machines were used in early R&D efforts; these operated at about one operation per second. As electronic computers became available, the most powerful of these were installed at Los Alamos; Figure 4.6 illustrates the approximate sustained execution rate of the fastest of these computers (in units of operations per second normalized to the CDC 7600 for administrative purposes). The fastest computer currently installed at Los Alamos is the Cray Y-MP, which has a sustained execution rate on the order of 109 operations per second. Thus there has been an increase of some 8 to 9 orders of magnitude in the execution rate of computers at Los Alamos in the past 45 years. Whether future improvements in the execution rate of supercomputers will match that of the past decades is a matter of deep concern to computational scientists and engineers. Serial versus Parallel Processors All of the computers installed at Ins Alamos through the Cray-l, the first of which was installed at Los Alamos in 1976, were serial processors that is, they were designed to execute only a single stream of instructions in sequential mode. It is clear from the shape of the trend line in Figure 4.6 that there was a gradual slowing of the development of these serial

36 10 12 10 1 1 10 10. 10 9 10 8 10 7 10 6 10 5 By o `_ 104 1 0 3 10' 10 1 10 o JACK WORLTON __ -SERIAL we_. A 1940 1950 1960 om Tar r, C)JEC l-El _ ~PARAt-t E: _ ~ _ _~_AAAA^~ r ~ 1970 1980 1990 2000 YEAR FIGURE 4.6 History and projection of execution rate at Los Alamos National Laboratory. Note: solid circle, senal; solid tnangle, parallel; open tnangle, projected. (Repented, by permission, from Worlton and Associates.) designs, but in 1982 a new trend line began with the installation of the first of the parallel-processor supercomputers, the Cray X-MP/2 with two vector processors, and later models in that line with four and eight vector processors. Future prospects for faster supercomputers will be based not only on improvements in component technology and the architecture of single processors, but also on the increasing number of processors used in supercomputers. Whereas the currently available supercomputers use up to 8 processors, supercomputers developed during the period 1990 to 1995 will use from 16 to 64 processors. The dramatic increase in execution rate for the projected points in Figure 4.6 is expected to come from both the increase in the number of processors and the development of higher-performance logic circuits (see below, "~end in High-Speed Logic Technologies"~.

EXISTING CONDITIONS 37 Dimensional Analysis of Execution Rate We can systematically explore the prospects for designing faster super- computers by using a dimensional analysis of execution rate. We decompose the units of execution rate (operations per second) into three explanatory variables: (1) the cycle time, in units of seconds per cycle; (2) design efficiency, in units of results per cycle per processing element (PE); and (3) the degree of parallelism, in units of the number of PEs. Execution rate [results per second] _ Design efficiency [results per cycle per PE] Cycle time [time per cycle] x Parallelism [number of PEs] There are three methods of increasing overall execution rate: (1) decrease the cycle time, (2) increase the single-processor efficiency, or (3) increase the number of processors. We will explore the trend for each of these variables. Trend in Component Densities Both speed and cost improvements in both logic and memory compo- nents depend on increases in the density of components per chip, and the general trend lines are shown in Figure 4.7. During the period 1959 to 1972, the density of components per chip increased by about a factor of 2 every 2 years, a trend first identified by Gordon Moore of Intel and known as Moore's Law. It is useful to track these trends in terms of their quadrupling time, rather than their doubling time, because memory chip generations increase by a factor of 4. Thus, in this early period, the density of components per chip increased by a factor of 4 every 2 years. Since 1972 the density of components per chip has been quadrupling about every 3 years and is expected to slow down to a quadrupling every 4 years sometime in the early l990s. This pattern of change is a 'Cpiecewise exponential," that is, a series of exponentials in which the successive exponents become gradually smaller, ultimately approaching a limit. This is a pattern commonly found in the evolution of electronics and high-performance computing, including semiconductor memory, magnetic disks, and the execution rates of supercomputers. Both dynamic and static RAM memories follow this pattern, with a quadrupling period of 3 years, whereas the quadrupling period of magnetic disk density is much longer, about 8 years. fiend in Cycle Times A scatter diagram of the cycle times of leading-edge supercomputers since the mid-1960s is shown in Figure 4.8. There has been a decrease in

38 I ~S _ 10 4 ~L, 1 0 of 2 0 10 Cal 1ol 10 3 10 loo JACK WORLTON 10 _ 8 10 - 7 10 _ /Dt = Do. 4 / at = Do. 4 ,3' - /Dt= Do. 4 ~' ~ - ~ ~ ~ 1 9SS 1 965 1 975 1 985 1 99S 200S Yea r FIGURE 4.7 llend in component density. (Adapted from James D. Meindl, 1987, Chips for advanced computing, Sci. Am. 256:78-88.) supercomputer cycle time from 100 ns in the CDC 6600 in 1964, to 27.5 ns in the CDC 7600 in 1969, to 12.5 ns in the Cray-1 in 1976, to 4 ns in the Cray-2 in 1985, and there is a projected decrease to the 2- to 3-ns range in several computers scheduled for delivery in 1990. The recent trend in the leading edge of cycle-time development shows a decrease of about a factor of 2 in 4 to S years. This is a fairly slow rate of improvement compared to the improvements in earlier decades. For example, in the period l9SS to 1970, cycle times improved from about 12 ,us to 27.5 ns, a factor of 436; however, in the next 15 years, cycle times improved from 27.5 ns to 4 ns, a factor of only about 7. This is a major reason for the slowdown in the growth of serial-processor execution rates. Those computers having cycle times that fall above the leading edge of this trend have attempted to use architectural features to compensate for their slower cycle times. fiend in High-Speed Logic Technologies The prospects for faster logic technologies are illustrated in Figure 4.9, which shows the gate delay (in nanoseconds) and the power per gate (in milliwatts) for several high-speed logic technologies. Technologies to be preferred are those with both low gate delays and low power dissipation per gate. Not shown in Figure 4.9 are other important characteristics of logic technologies such as gate density and cost. The traditional logic technology used for high-performance computers has been bipolar emitter-coupled logic (ECL), which is the fastest of the silicon technologies and is used in most modern supercomputers. Two other logic technologies being used

EXISTING CONDITIONS 100 oh 10 ._ - 39 C DC 6600 ~Cyber 205 ~ ETA-1 OP C DC 7600 m ~ ET A - 10 a ~ ~ S810 CRAY-1 X-MP ~ SX-2 C RAY -2 ETA-1 OE ETA-I OG Y-MP 820 N EC SX-3 C R A Y-3 .1 - 1960 1965 1970 1975 Year 1980 1985 1990 1995 FIGURE 4.8 1tend in cycle times of supercomputers. (Reprinted, by permission, from Worlton and Associates.) -8 10 10 1 o~1 ° 1 0 0-12 - W~ ~ of. A ~ `_ ~ A ~ At A ~ ~ ~ At ~ J J ~] A ~ ~ A A, A ~ A A' i_ //~'~'~'~'~'~C5~.'_~ \ _ ~ 'I' ~ at' ' ~ ~-~ ~ ,~;W _ ~ . ~ ~ ~ ~ ~ ~ ~ ~ .~ . ~ ~ Ga~,~,~,~,~,~,~,~ -~f//~]:: ~ .% .~ .~ .~ ~ - ~ ~ ~ ~ ~ 1~ 001 .01 .1 1 10 100 POWER DISSIPATION (mW) FIGURE 4.9 High-speed logic technologies. Note: CMOS, complementary metal-oxide semiconductor; E CL, bipolar emitter-coupled logic; GaAs, gallium arsenide; HEMT, high- electron mobility transfer; JJ, Josephson junction. (Courtesy Hiroshi Kashiwagi, Electro- Technical Laboratory, Japan.) in recent designs are complementary metal-oxide semiconductor (CMOS) and gallium arsenide (GaAs). CMOS. The CMOS technology being used in the ETAi ° has two advantages relative to ECL: lower power dissipation and higher gate density. The lower power dissipation makes it possible to achieve high densities of gates per chip; the higher gate density makes it possible to incorporate more functions

40 JACK WORLTON per chip and therefore avoid transmission delays from chip to chip. CMOS has a major disadvantage as a logic technology for supercomputers: its gate delay is slower than that for ECL. However, its relatively slow speed can be partially compensated for by cooling it with liquid nitrogen to a nominal temperature of 77 K, which lowers the gate delay. The cycle times of the ETAi° range from 24 us down to 7 ns. GaAs. Gallium arsenide has two advantages relative to ECL: it has a lower gate delay and lower power dissipation. The lower power dissipation implies a potential for increases in packing density without causing heat dissipation problems and therefore even further speed increases. This technology also has two disadvantages: higher cost and lower gate density per chip. The gate density of GaAs is projected to increase dramatically compared to the gate density of traditional logic technologies, as illustrated in Figure 4.10. If this projection should prove to be true or even approximately true, the gate density disadvantage of GaAs would no longer exist. The GaAs industry is maturing rapidly, and the cost per gate is projected to fall as increasing gate densities are achieved. Gallium arsenide is being used in the design of the Cray-3 for delivery in about 1990, with a projected circle time of 2 ns. HEMT. The high-electron mobility transistor (HEMT) technology is an aluminum-doped version of GaAs that is cooled to liquid nitrogen temper- atures. It is more complex than GaAs and its development will take longer, probably into the mid-199Os. If it is developed successfully, it may offer cycle times even lower than those offered by GaAs. Other Logic Technologies. Josephson junction technology is a superconduct- ing technology that was investigated in the 1970s and early 1980s but then was dropped by most companies because of its complexity and also because its performance characteristics could be achieved with other technologies that were more tractable; however, Japanese companies continue to study this technology. Optical, molecular, and quantum-tunneling technologies are being studied for their potential for high-performance computers, but they seem to offer few prospects for the near term. Summary. For the immediate future, it is expected that most supercomput- ers and other high-performance computers will use ECL and CMOS logic technology, but if the implementation of the Cray-3 in GaAs is successful, it could create a guidepost for others to follow. The leading-edge cycle time should be about 2 us around 1990, and cycle times of 1 ns should appear in supercomputers by 1995.

EXISTING CONDITIONS loo ~5 S 10 In 'S C, MC,~ O 103 1o2 41 GaAs ~ ~ T T I I T T r 1980 1982 1984 1986 1988 1990 YEAR ~ _._ 1992 1 994 FIGURE 4.10 Projected trends in gate density of high-speed logic circuits. (Repented, by permission, from Bernard C. Cole, Gallium arsenide starts getting some respect, Electronics, June 1988, p. 41.) Milestones in Processor Architecture Architectural Efficiency Efficiency in computer architecture is measured in units of results generated per clock cycle. Improvements in computer architecture that have resulted in higher efficiency are due largely to concurrency, that is, to doing more things at once. Some selected examples are illustrated in Figure 4.11. In the mid-1950s in computers such as the IBM 704, instructions were executed in a sequential scalar mode; that is, they specified only one opera- tion on one pair of operands, and the processing of instructions included a series of sequential steps: fetching the instruction, decoding it, forming the effective address, fetching the operand, and then executing the operation. Beginning in about 1960 in computers like the IBM STRETCH, an instruc- tion lookahead provided the ability to fetch and process instruction N + 1 while executing instruction N. By the mid-1960s in computers such as the CDC 6600, multiple function units were included that allowed several exe- cutions, such as add and multiply, to be performed concurrently. By about 1970, the operations were speeded up by pipelining, that is, subdividing

42 1ol2 Cal Lo - 0 1010 tar z o C' x LL U] go En En 11 109 1o8 107 1o6 105 104 JACK WORETON . . ~.,..~.,.] . ~ ~.2 ~ An ,, ,, I;;;; ~ Is.. Multiple FUs 7.~..] ~,. . . . .. ...,, ~ ~1 me , ,,,,.,,,,.,,., 1950 1960 1970 1980 1990 2000 YEAR FIGURE 4.11 Some architectural contributions to higher performance. (Repented, by permission, from Worlton and Associates.) the steps of the operations into substeps and overlapping these substeps for faster execution. The 1970s saw the development of vector processors in which the same operation was applied to many operands or operand pairs, rather than to just one pair as in the scalar designs. The major supercomputers today use vector processing to achieve high speed. The first generation of vector processors had memory-to-memory designs; that is, they fetched the vectors from memory and returned the results to mem- o~y. However, this caused long start-up times, and the second generation of vector processors followed the lead of the Cray-1 in using vector regis- ters that allowed faster start-up times. Most vector designs today use the register-to-register design. The leading edge of supercomputer architecture today is found in designs that incorporate multiple vector processors, or parallel-vector designs. Balance in Vector Processor Design One of the key issues in vector processing is the balance between the scalar speed (doing one operation per instruction) and the vector speed (doing many repetitive operations per instruction). Figure 4.12 compares the merits of two designs, one with a relative performance of 10 in the scalar mode and 100 in the vector mode, and another with a relative

EXISTING CONDITIONS 1000 -. C' A a: 100- a: o Cal 43 1 0.0 0.2 0.4 VECTOR FRACTION FIGURE 4.12 Balance in vector processor design. (Repented, by permission, from Worlton and Associates.) performance of 2.S in the scalar mode and 400 in the vector mode. The overall performance of the system is a function of the fractions of the work done in each of the different modes. For practical applications, the fraction of vector work is in the range 0.3 to 0.7, so that even though the second design has a higher peak performance than does the first design, the first design, which is better balanced, is clearly to be preferred. Parallel Processing In addition to decreasing cycle time and increasing design efficiency, a third approach to increasing computer speed is through the use of multiple processors, and there are many design issues for these so-called parallel computers: Should there be a few fast processors or many slow ones? Should the memory be shared, attached to each processor, or both? What kind of interconnect network should be used to provide communications among the processors, and among the processors and the memory? Inhere are literally thousands of ways to design parallel computers, and at the moment there is wide disagreement in the industry about which are the best

44 JACK WORLTON choices. To analyze future prospects in these options, we turn to a taxonomy of computer architectures-an intellectual road map of possibilities. Flynn's Taxonomy Michael Flynn's taxonomy from the 1960s has been the most com- monly used guide to generic types of architectures. It matches the control concurrency expressed in the number of instruction streams (one or many) against the execution concurrency expressed in the number of data streams that are being processed (one or many), to generate the four types of ar- chitectures: (1) single instruction, single data (SISD); (2) single instruction, multiple data (SIMD); (3) multiple instruction, single data (MISD); and (4) multiple instruction, multiple data (MIMD). This has been a valuable guide for over 20 years, but it is no longer an adequate guide because it is not comprehensive for all of the architectural design options being explored. An Expanded Taxonomy of Architectures In addition to the types of control concurrency serial and parallel- included in the Flynn taxonomy, a third type is being used, called clustering. In a clustered design, clusters of multiple-instruction-stream processors are connected together with global control with access to global memory. Also, the types of instructions employed in the 1960s were limited to scalar instructions, and there are now several new options that need to be included in any classification scheme. We can systematically explore the options for designing instruction types by considering a number pair (a,b), where a = the number of v~ela.lvilS Mu ill The ilmc1 union and 0 = Ine number of operands (or operand pairs) specified. There are thus four options, as follows: · The (1,1) option defines the scalar type of instruction, in which one operation is specified on one pair of operands. · The (1,N) option defines the vector type of operation, in which one operation is specified, but the operation is applied to many operands or pairs of operands. ~,% , ~. . ~ ~r~^t~qt~ ~.~;~A .~ ·~ : ~:_ _ ~ L ~ _ . ~ ~ ne (M, 1) option cennes the systolic type of operation, in which many operations are performed on each operand that is fetched from memory. · Finally, the (M,N) option defines the horizontal type of operation, in which many operations are specified on many operands for each instruc- tion. The horizontal category is also referred to as very long instruction word (VLIW). We can now create an expanded taxonomy of architectures that matches the 3 control categories against the 4 instruction categories to create a 12-way taxonomy, as illustrated in Figure 4.13.

EXISTING CONDITIONS CONTROL C ONCURRENCY 45 . INSTRUCTION TYPES SCALAR VECTOR SYSTOLIC (1,1) (1,N) (IVI,1) === VLIW (M,N) SERIAL PARALLEL CLUSTERED FIGURE 4.13 An expanded taxonomy of computer architectures. Notation (a,b): a, number of operations specified; b, number of operands (or operand pairs) specified. (Reprinted, by permission, from Worlton and Associates.) Historically, through the CDC 7600 in about 1970, most supercomput- ers had serial-scalar designs; that is, they executed one stream of scalar instructions. First-generation vector processors had serial-vector designs; that is, they also executed a single stream of instructions, but the instruc- tions specified vector operations. Machines of this generation were typified by the Cray-1 and the Cyber 205. In the 1980s, parallel-vector computers have been developed in which multiple vector processors are interconnected through a communication network. The Cray Y-MP and the CDC/ETAi° are examples of computers in this category. The further development of vector designs is expected to be generalized to clustered-vector designs, with clusters of multiple vector processors interconnected with global net- works. An example of this category is the Cedar project at the University of Illinois that uses the Alliant I;XJ8 as the cluster. The parallel-scalar category represents those designs that use a large number of relatively slow processors to attain high performance, such as the NCUBE/ten and the Intel iPSC. These can also be designed with clustering, as found in the Myrias-2. Systolic designs have been largely special-purpose computers that im- plement a particular algorithm such as a matrix multiply or a fast Fourier transform, and they can be developed in serial, parallel, or clustered form. The WARP system developed by Carnegie Mellon University is the leading edge of this type of design. The horizontal or VLIW designs attempt to execute multiple operations per clock period, and they too can be devel- oped in serial, parallel, or clustered form. Examples of the VLIW design are found in the Cydra-S and the Multiflow Mace computers.

46 JACK WORLTON This taxonomy can be extended by subdividing each of the categories to develop an even more detailed taxonomy. Key Issues in Parallel Processing The trend toward parallel computation is perhaps the major trend to watch in the next decade of computer architecture, and it is useful to identify some of the key issues that are being explored in the dozens of commercial and research parallel-processor projects. These issues can be thought of in three categories: system software, system hardware, and applications. Parallel processing will affect system software in language design, com- pilers, operating systems, and libraries. For example, should we continue to use Fortran to specify parallel processing or should we abandon this old language and adopt something new? Fortran is being modified by adding new constructs that allow the specification of parallel processing, and no doubt this language will continue to be used long into the future because of the huge commitment to applications that exist for it. Compilers are being influenced by parallel processing because they need to identify opportunities for parallel execution in a transparent manner, that is, so the user need not be concerned with the details of how the parallelism is achieved but only with what is to be done. Operating systems are now more complex because they need to maintain control and provide services for many streams of instructions rather than just for one. Finally, system libraries are being adapted to use parallel processors. Applications will be affected by parallel processing by the need to insert parallel control statements, to develop parallel algorithms, and to rethink the mathematical models on which the algorithms are based. One of the key issues in the design of high-performance parallel computers is the trade-off between speed and parallelism. That is, in principle we could select the number of processors to be from as few as one or two to as many as thousands (or indeed millions), and we could select the speed of the individual processors to be as slow as a bit-serial microcomputer or as fast as a supercomputer, as illustrated in Figure 4.14. However, a design with only a few slow processors would be too slow to be of any use as a high-performance computer, and a design with thousands of supercomputers in a single system would not be economically feasible, so that the practical area of trade-off lies in between. The evolution of parallel processing is following three guideposts. First, the zone of "large-grain" parallelism is defined by the use of a relatively small number of fast processors, 1 ~ P< 100. This form of parallel-processor design was pioneered by Cray Research, Inc. in 1982 with the introduction of the Cray X-MP product line, in which the individual

EXISTING CONDITIONS o In In o C] UJ Lo In 47 ~ Large-graln ~ And ) $) arm ) ~1~1~$0~ it.. Mealum-:araln ~ (, ", Parall'ell~sm' ~ ~,"3 ~100 :c:'~'P2~c~"210.00}) ~ . . : . .. ~.... . . _ _.... ~ _ _ ... . . .. _ <:~: Small-graln,~ '2:~.~'~'~:P:atralIe11sm~ .. It's ~ '':''', 2.'.{P:, :~.1 00'0} " . it, NUMBER OF PROCESSORS (P) FIGURE 4.14 The trade-off between speed and parallelism. (Repented, by permission, from Worlton and Associates.) processors are vector supercomputers in their own right and the number of processors varies from the initial offering of 2 to the current offering of 8 in the Cray Y-MP/~. Other vendors have followed this design as a guidepost, including parallel-vector processors offered by Control Data Corporation, IBM Corporation, Convex, Alliant, the Japanese Super-Speed Computer Project, and widely rumored introductions of similar designs from the Japanese supercomputer vendors. The number of processors in this domain is being expanded from 8 processors to 16 in the near term, and to 32 and 64 in the generations in development for delivery in the early 1990s. Second, the zone of "medium-grain" parallelism is defined by the use of a larger number of slower processors, with the number of processors being in the range of 100 < P < 1000. This form of parallelism is typified by the BEN Butterfly, which has 256 processors, and by the Myrias SPS-2, with 512 processors. Third, the zone of "small-grain" parallelism is defined by the use of an even larger number of processors, P ~ 1000, each of which is even slower and is typically a bit-serial microprocessor. Here the number of processors varies from 4096 in the AMT-610 DAP, to 16,384 in the Goodyear MPP, to 65,536 in the Connection Machine. A Taxonomy of Limits to Massive Parallelism The term massive parallelism is loosely applied to any of a variety of computers in which the desired performance is achieved through a very

48 JACK WORLTON DECOMPOSITION ALGORITHMIC LIMITS TO MASSIVE PARALLELISM SERIAL BOTTLENECK COMMUNICATION IMPLEMENTATION SYNCHRONIZA TION INTERFERENCE FIGURE 4.15 Taxonomy of limits to massive parallelism. (Repented, by permission, from Alex Yuen-Wai Kwok, Center for Supercomputer Research and Development Report No. 679, August 1987.) large number of relatively slow processors, with the threshold beyond 1000 processors often used to mean massive. The ability of a parallel computer to perform efficiently as the number of processors is increased is referred to as scalability. A study of architec- tural scalability published at the Center for Supercomputer Research and Development at the University of Illinois presents a taxonomy of the limits to scalability, as summarized in Figure 4.15. Limits of the first kind are due to algorithmic constraints. For a system with P processors to perform at full effectiveness would require that the number of available tasks equal or exceed the number of processors through- out the calculation. However, not all algorithms can be so decomposed, so that there is a decomposition limit to scalability. When the number of tasks available is just one, this is referred to as the serial bottleneck, which has very serious implications for the effectiveness of massive parallelism. Limits of the second kind are due to the implementation details, includ- ing latencies caused by communications, synchronization, and interference. As tasks are decomposed to finer levels of detail, the communications re- quirements increase in proportion to the level of decomposition, and the amount of actual communication latency is dependent on the system design. A second implementation limit is caused by the need to integrate the results of the decomposed tasks through various synchronization techniques. At

EXISTING CONDITIONS 49 the synchronization points, the number of available tasks decreases, so that the number of idle processors increases, thus limiting performance. A third implementation limit is caused by interference as the processors attempt to access shared resources; these resources may include shared memory, shared communication networks, or shared input-output systems. Following Hockney's nils method, it can be shown mathematically that the fraction of parallelism that is required to achieve 50 percent efficiency for a parallel-processor system is given by (P - 2) 7ri/2 = up _ 1' where P = the number of processors, and hence the fraction of serial work cannot exceed 1-~/2 = up _ i) Thus as P grows, the domain of applicability becomes inherently more limited, because ~/2 approaches 1.0 and 1 - ~/2 approaches 0 very rapidly. This does not imply that it is impossible to use parallel processors effectively, but it does provide a guide to the domain of application, with a smaller number of very fast processors being more useful for general purposes than a larger number of slow processors. CONCLUSIONS It has been over 300 years since Galileo unlocked the door to the world of empirical science, using a telescope as a key, so to speak. In the year Galileo died (1642), Isaac Newton was born, and he invented a key we call the calculus that he used to unlock the door to the world of theoretical science. And it has been a scant 4 decades since von Neumann recognized that powerful electronic computers were the key to still another world of science, computational science. In the opening lines of Man of La Mancha, the author (Dale Wasser- man) invites the audience, "Come, enter into the world of my imagination." The supercomputer, too, bids us enter into the world of imagination. In the real world, we are constrained by limits of time, space, energy, matter, and costs, but in the world of the supercomputer, these constraints are merely bits of information that we can readily manipulate. In the world of the supercomputer, we can experiment with new designs for such things as aircraft, automobiles, accelerators, weapons, and chemicals; we can ex- periment with these designs until they either perform as we desire or we discover why they fail. By experimenting in the world of the supercom- puter, we can avoid many of the costs, delays, and failures that would have

so JACK WORLTON occurred if we had experimented with these designs in the real world. In this new world we can create previously unthinkable systems such as black holes, stand in space to watch them perform, and create videotapes of this adventure to take back to the real world. The university, the industry, or the nation that would be a leader in the modern world of intense international competition must master the information technologies, the leading edge of which are the supercomputing technologies. It is imperative for our future success as a nation that we accept the invitation offered by the supercomputer "Come, enter into the world of . . . imagination."

Next: 5. Toward the Future »

Supercomputers: Directions in Technology and Applications (1989)

Chapter: 4. Exisiting Conditions

Welcome to OpenBook!

Get Email Updates