Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
s Toward the Future Steve Chen Supercomputer Systems, InG If Jack Worlton is a lifetime fellow-user of supercomputers, I have become a longtime pursuer of a dream machine. I have chased this machine for more than 10 years. I still have not found the perfect machine to fulfill the users' needs. This has become very challenging but also very rewarding work. My hope is that some day we can come up with a machine that is about 100 times faster than today's machines. This machine, as one of the fundamental tools, will be used by scientists and engineers in many different disciplines to study things they cannot do today. I would like to share with you some of my thoughts on the future developments in supercomputing and their potential impact. I will speak only from a designer's point of view. THE CURRENT STAGE IN SUPERCOMPUTING Supercomputing has come a long way, when viewed from many angles: in speed, the central processing units (CPUs), memory size, input/output (I/O), peripherals, physical size, and software. Speed You have heard about machine clock rate coming down from 100 ns to 50 ns, then to 25 ns, 12.5 ns, 6 ns, and 4 ns. And each time the clock rate is reduced by half, the underlying component technology becomes more 51
52 SO CHEN complex. Furthermore, the requirements for data space increase. So the challenge we face in designing the machine pets worse , ~. . _ Central Processing Units The central processing unit (CPU) is the heart of the system. When we cannot get more speed out of a single CPU, we start combining more CPUs. But this is not an easy job either. We cannot just tie many boxes together and make the machine faster. My favorite analogy: to build a faster racing car, we have to decrease the car size and at the same time have more engines in the chassis. We cannot put in larger engines because the car would become big and clumsy. So for each generation, we have to invent a smaller engine that runs faster than the previous one and link together as many engines as possible, such that the car can run efficiently when all engine power is applied concurrently. We have seen the number of CPUs increasing from 1 to 2, to 4, to 8 in a machine. But keep in mind that each CPU has to be faster than the previous generation. That makes the development work tough! Memory Size We start with 1 million words per CPU for data space. Next we see the words increasing to 4, then to 8, then to 16 megawords per CPU. The data space is increased to allow solving bigger problems as each generation's machines harness more and faster CPUs. We are trying to stay one step ahead of the application. Unfortunately, sometimes we have felt that we are fighting a losing game. The memory component designer can only give us a bigger memory chip with very little improvement in speed. Hence the data access time from memory becomes slower relative to data compute time. We must now figure out all kinds of tricks to compensate for the gap between the memory chip and the CPU speed. Input/Output Many years ago an input/output (I/O) channel could run about 1 megabyte per second. This was increased to 10 megabytes per second, and then to 100 megabytes per second, which soon will become a standard rate for anything usable. So the trend is clear. To solve bigger problems of the future, we cannot just add memory size and CPU power without significantly increasing the I/O transfer rate. Peripherals Peripherals are also a serious problem. Advances in storage technology are falling behind the CPU's improvement in terms of capacity and speed.
TOWARD THE FUTURE 53 Ten years ago it was common to have disks with hundreds of megabytes and a 1-megabyte-per-second transfer rate. ~day, we have gigabyte storage units with a 10-megabyte-per-second transfer rate. In the meantime, we still have to use a solid-state secondary memory device as a buffer to smooth out the speed difference between CPUs and peripherals. Physical Size Not too many people recognize the changes in the physical size of supercomputers. Many years ago the CDC 6600 filled about 500 square feet of floor space. The CRAY-XMP occupied roughly 100 square feet. The CPU module of the CRAY-YMP is suitcase-sized. Future products may shrink even further. But that does not mean that such a CPU is easy to design. We cannot just squeeze everything together. As each generation of machine comes down in size, the heat dissipation becomes harder to deal with. We can increase circuit density in the chip, but we cannot proportionally reduce the power per gate. For example, a suitcase-sized supercomputer may dissipate a couple of thousand watts of power. We may be able to put it on the desktop, but we will have an instant meltdown in case of a cooling malfunction it will go right through the table. We are dealing with a fantastic problem. It's no small design challenge to try to keep a supercomputer cool. Software No one paid attention to software initially. Most people were thinking about supercomputers as just pieces of hardware. The user was forced to figure out how to use it and then hand-code to optimize everything. Later on we had a little primitive compiler software. Then slowly, people started to recognize that this was not good enough anymore. Production-quality compiler software was developed for vector processing over the past 10 years. User expectations for software functionality and performance fea- tures continue to rise as more and more supercomputers become available and are widely used. Systems Let's view supercomputer development from a different perspective to appreciate how far we have come. When we look at the 10-year period from 1955 to 1965, we can see that the CDC 6600 was a dominant factor in the supercomputer arena, with 1 million to 10 million floating-point operations per second.
54 S1~^ CHEN In the period 1965 to 1975, the CDC 7600, the TI AS C, the Burroughs BSP, and the Illiac IV were developed. They reached from 10 million to 100 million floating-point operations per second. The CDC 7600 was the major workhorse during this time period. From 1975 to 1985, thanks to Seymour Cray, a new machine took the lead. Cray created the CRAY-1 architecture to take advantage of extensive pipeline vector processing. In addition, supercomputer systems became more reliable. The mean time between failures jumped from 10 hours to 100 hours and then to 1000 hours a viable product for use in commercial industry. After the Cray-XMP was introduced, applications expanded rapidly, from pure laboratory research to various commercial product areas. During this time, more machines and manufacturers entered the mar- ket: the GRAY-2, the CDC Cyber-205, and also, from overseas, the Fujitsu, Hitachi, and NEC models. These machines generally reached from 100 mil- lion to 1 billion floating-point operations per second. Many more players have joined in because they see the importance of supercomputing, not only in the computer industry itself, but also in its wide effects on many key industry applications. Personally, I have had the good fortune to work with two of the best designers in the world, Dave Kuck and Seymour Cray. I have learned a lot from them. Dave Kuck inspired me with the Illiac IV and with the follow-on Burroughs BSP project. These projects gave me a deeper inside view of the system and software areas. I was also pleased to be able to join Cray Research. Seymour Cray was a good model of the best designer in the hardware and packaging areas. Finally, I was lucky to have the opportunity to participate in designing the CRAY-XMP and the YMP, to try my first foot in the water. THE NEXT STAGE IN SUPERCOMPUTING What's in store in the next 10 years? Definitely more companies will enter the competition, but also some will fall out. The important thing is that speed will be widespread. In the highest-performance arena, instead of going 10 times faster, the range will increase to 100 times faster. We will see machines with 32 to 256 CPUs in production use. Machine speed will reach between 1 billion and 100 billion floating-point operations per second. This is based on the technology as far as we can see, barring any major breakthroughs. Even this may not be fast enough. The Director of the National Center for Atmospheric Research, Bill Buzbee, once told me that the next generation of ocean problems may take about 100 to 1000 hours of current supercomputer time. I couldn't even comprehend the problem he was
TOWARD THE FUTURE 55 describing. But the problem definitely cannot be solved today. We need to continue to push supercomputer technology forward in order to fulfill those requirements. My personal goal in the future is to develop such a computational engine for scientists and engineers to open new frontiers in science and industry, similar to those made possible by the electron microscope and by steam- and gas-powered engines in earlier days. I have discovered that developing such a machine is not an easy job anymore. No single person or single company can do it alone. We must depend on various technologies-component, software, and application to advance in a balanced way. We need to take advantage of every technology we can get and stretch to move all these areas ahead. Parallel Processing Environment We are going into the arena of parallel processing, and it is just a matter of time before people will learn how to do it. I know it is painful. But we have moved from assembly language to Fortran. We took a long time to get there and now Fortran may never die. Now we must move from Fortran to parallel Fortran. It took about 10 years to grow from serial Fortran into vector Fortran, and now it may take another 10 years to go from vector to parallel Fortran. But if we don't start now, we may never be able to take advantage of the performance of future machines. So we see where the train is going. Idday, and in the near future, we will have in production from 1- to 16-processor, high-performance machines. But we also have seen experimental or developmental machines that have 32 to 256 processors or even 1000 processors. Right now such machines are in the research and development stage the critical task is to study how to use them. Because each processor is quite slow, these machines are not used in production for general applications. Our goal is to move gradually toward more and faster processors, while maintaining a consistent system architecture. This approach will ensure that no users will suffer a degradation of performance in running their existing production codes on the next-generation, more parallel machines when they become available. In the meantime, as users gain experience in developing more parallel application algorithms, they will be able to explore higher performance through the added number of processors. I believe this is a sensible approach to protect the users' software investment and, at the same time, induce the long-term development of parallel applications. Next, let us focus on how the three key technology areas component, system, and application may proceed in developing a future high-perfor- mance supercomputer.
56 STEVE CHEN Component Technology Development We will stretch the currently available component technology. We must combine improvements in many elements to enhance the design of the machine. Device Speed Device speeds have come down from 1 ns to 0.5 ns, and then to 250 ps and 125 ps. They may even come down to the 50-ps range. Complementary metal oxide semiconductor (CMOS), gallium arsenide (GaAs), and bipolar devices are all viable. Each has its own advantages and disadvantages. Circuit Den silty Depending on the device type, today's circuit density is approaching the 1 K-gate level for GaAs, the 10 K-gate level for bipolar, and the 100 K-gate level for CMOS. In the future, we may see even larger-scale integrated circuits. How usable are these big chips? Bigger doesn't always mean better. The advantage of these superchips depends on the trade-off of speed, power, circuit complexity, and overall system considerations. Metal Interconnect As circuit density increases, more transistors have to be connected in a relatively small and expensive silicon area. One way to keep the chip size down is to make the interconnect metal thinner, so that more signal lines can be placed next to each other. However, a thin metal line may degrade the signal speed and integrity. As a result, the electronic signal may travel more slowly between transistors, even though each transistor's intrinsic switching speed is very fast. And, in the worse case, the signal may not travel far enough before it disappears. Furthermore, very thin metal may cause an electromigration problem in a high-speed (high-current) application. This is due to the loss of the electron-carrying property altogether inside the chip, leading to unreliable components. Hence we have to develop a better metal interconnect system within the integrated circuit to allow sufficient current-carrying capability (for speed), while maintaining smaller physical size (for density). The balancing act between speed and density is among the most demanding requirements facing our component designers in the future. Substrate Material The substrate material used to fabricate the printed circuit board is another critical factor. The traditional fiberglass-like material may not be
TOWARD THE FUTURE 57 sufficient for future high-speed and high-density applications. The electrical property of the material may cause the signal to slow down and become noisy and lossy as speed increases. In addition, the mechanical and thermal properties of the material are also important in deciding the number of signal layers, the density of signal lines, and the compatibility between chip and substrate. We should continue to enhance current substrate materials and search for new ones to give us the maximum component packaging density required for a high-performance system. Power Consumption As I mentioned earlier, for a given technology, power per gate in a chip is not coming down as fast as we would like it to. We have seen improvements from 50 to 100 milliwatts per gate dropping to 10 to 20 milliwatts per gate (a factor of 5 reduction), then to 5 to 10 milliwatts per gate (only a factor of 2 reduction). This power-reduction trend appears to have flattened out. Hence, while we are increasing circuit density, the total power per chip is rising, causing difficult cooling problems at the component and system levels. This is a very critical area, and we need intensive cooperative research efforts with component manufacturers in the future. Packaging Many of the integrated circuits we are using are getting faster. Un- fortunately, the performance gains at the component level are aerated significantly because of the packaging loss all the way up to the system level. Multiple levels of interconnect media, such as printed circuit boards, chip attachments, connectors, backplane wires, and so on, all affect perfor- mance. As clock rate increases, component, module, and system packaging becomes a very critical issue for the total system design. Testing and Measurement The bigger the chip, the more pins there are to handle. Future chips might have 250 to 1000 pins. In addition, they will operate at high speeds and high power levels. As a result, the problem of testing chips becomes quite complex and expensive. The same is true for high-speed measurement equipment for circuit board and system checkout. Because a piece of test equipment may cost up to $5 million, the availability of cost-eRective, high-performance test equipment has become a more visible concern. Unfortunately, it is getting harder to find suppliers of advanced test and measurement equipment to satisfy the performance requirements. Com- panies in the United States keep dropping out of the market, and some equipment is only available from overseas. Without such equipment, one
58 SO CHEN may have the best design, but one cannot build, test, and ship the machine. So this is also a very important area to watch. System Technology Development Architecture Concepts Once we have the best components, the next step is to put the system together in the slickest way. There are many ways we can do this. We hear about many different architectural concepts being explored:-single versus multiple processors; system throughput versus processor speed; single-level versus multiple-level parallelism; loosely coupled versus tightly coupled system interconnects; monolithic versus distributed memory; and special- purpose versus general-purpose system design. If one looks underneath the design of future machines, it will have one or more of these architec- ture Savors. However, the most important thing is to design a balanced architecture and provide good software to support an application or many applications. The user, in general, should be aware of but not be bothered with the complexity of system design. Solution Time As I have mentioned before, the issue now Is not how fast one can design a machine to do A + B; the real issue is solution time. In earlier days, people compared different machines by counting how many millions of floating-point "add" or "multiply" operations could be done in a second (MFLOPS). That measurement is similar to the RPM (revolutions per minute) rate of the wheels of a racing car. The RPM rate is not an indicator of how much usable horsepower is available when driving on a real road. Similarly, the MFLOPS of supercomputers bear no relation to the performance obtainable on real user applications. Later, when performance was measured by how fast a machine could compute "Livermore Loops," some people could not differentiate between a real supercomputing system and a "designer machine" targeted for Liv- ermore Loops. We should raise ourselves to a higher level. It took me about 5 years of preaching I can tell you that's how long I've kept arguing the point to convince users to find a new performance measurement yardstick. Fortunately, now they have gone up one notch to use LINPACK' a set of mathematical subroutines for solving linear algebra that is, in general, more usable than just the Livermore Loops rate or the peak MFLOPS rate. Even so, the performance numbers on LINPACK are still only an indicator of the computation time for a small part of the total solution process. To be successful in future high-performance parallel processing
TOWARD THE FUTVRE 59 systems, we must strive for overall system performance and start to talk about solution time. And we need the users' help to define what we mean by solution time rather than computation time. For example, three-dimensional seismic processing may involve reading more than 20,000 tapes of earth data before a machine begins to do A + B. The process starts by getting data into the machine with the 20,000 tapes and then generating the analysis and output to see exactly ' whet is underneath the ground. The whole process may take 3 months of today's supercomputer time, during which only a few days may be spent on numeric- intensive computational tasks. We need to define this whole process so that we can measure "total time to get results." We want to make sure the scientists can do their thinking instead of playing around with the computer system, or running around the computer room. If I give a machine to an aircraft designer, that person should be able to construct a model, pick a grid point, describe the air foil, wing, and tail, and then simulate it to see if the design is correct. The model should include structure, air flow, and control and other interdisciplinary conditions that have to be satisfied in one design. The designer should be able to define this design process from beginning to end and measure the machine performance by the total time that must be spent completing this design process. This measurement is called solution time. The solution time includes all of the following elements: Data acquisition/entry; Data access/storage; Data motion/sharing; Data computation/process; and Data interpretation/visualization. How to capture the raw and digitized design data, how to store it, and how to move it efficiently in and out of the disk, solid-state secondary memory, and main memory during computation are all essential to the solution process. Then, after all that has been done, how quickly can the results be interpreted? When data can be generated very rapidly, a whole week may be required to digest the numbers. I would rather see the visual: the underground picture, or the heat flow on the surface of the integrated circuit chip. When the alpha particle hits the electronic device, I want to see the electromagnetic field moving while I watch. I want to be able to start, or stop and restart again, the simulation process any time I want. While I am simulating an air foil for an aircraft, I want to see if a particular region of the air foil is subject to high pressure or temperature. If I feel something is going wrong, I want to zoom into a particular area to test it again or try out a different algorithm or analysis. I need to have an
60 STEVE CHEN Levels of Parallelism System User Specified, System Scheduled Job Job Step Program Procedure Basic Blocks Loops Statements System Scheduled User and Utility Specified User and Compiler Compiler and User Compiler and User Compiler Compiler interactive design or analysis capability on the system. And last but not the least important of all, I want to be able to complete all this process without leaving my own design station. I hope these examples illustrate the important difference between the computation time and the solution time that involves the whole process. Whoever designs it, the machine with minimal solution time will be the best system in real application. Exploitation of Parallelism ~ achieve high performance on future parallel systems, we should work from two directions (see box). From the bottom up, we should continue to improve the compiler techniques to exploit automatically the parallelism in user programs. This includes extending vector detection capability to the detection of parallel processable code. From the top down, we should provide system and applications support in terms of libraries, utilities, and packages, all designed to help users prepare their applications to get the most performance out of the parallelism existing at the highest level. One way to think of a parallel application in the future is as a multiple- domain approach. We have many, many processors at our disposal. How do we decompose a problem and make it 99 percent parallel? It is not difficult. If we look at natural phenomena, most are parallel. Unfortunately, we are trained to think sequentially. Take the aircraft design example again. We simulate one wing, then another wing, then the body, the tail. Each part is called one domain. We can now simulate all domains at the same time.
TOWARD THE FUTURE 61 We can also think of a parallel application as a multiple-stage pipeline approach. Take the seismic processing example. First we start with tape input, and then comes data verification and alignment. The next step is analysis and simulation. The final step is data interpretation and visual- ization of the underground picture. All stages of the whole process can be done concurrently on the system. The first stage can be performed in groups of a few processors, with data flowing continuously to support the next stage on another group of processors, and so on. Take this one step further. If we look at the future application de- velopment, we can bring different disciplines into one design solution, a multidiscipline approach. For example, in the design of a space shuttle, materials, structure, aerodynamics, and control problems can all be evalu- ated at the same time with various design criteria. The analysis step of each discipline can be processed in parallel by different groups of processors. These examples are just a few of the ideas for exploring future parallel systems to achieve much higher system performance through a top-down application decomposition than can be obtained only by the bottom-up compiler approach. The key to success is the adaptability of the system architecture. Users should not have to change application algorithms when they migrate to higher parallel machines. Application Technology Development Many examples indicate that supercomputers have proved very useful in various industries-in the defense, petroleum, aerospace, automotive, meteorological, electronic, and chemical segments. Today, all the industrial countries of the world are developing their own application techniques using supercomputers. These tools improve their competitiveness in creating new materials, developing better processes and products, or making new scientific discoveries. We see existing applications expanding to include more complex geometry or more refined theory as machine capability and capacity keep improving. We also see the potential in new areas, especially materials science. We need help to find new materials, whether we are designing integrated circuits for supercomputers or developing industrial products. Other emerging application areas include biomedical engineering, pharmaceuticals, and financial analysis. New applications will also evolve from interdisciplinary areas. We have to think about how to develop future application technology along with future system design. We must start earlier to interact with leading application scientists and engineers to develop the next generation of algorithms to make the greatest use of parallel processing. These efforts
62 51~ CHEN will also help to speed up the migration of existing application codes onto new machines. Our challenge is to start using these machines in production as soon as they become available. The worst thing we can do in this country is to design the best machines but then not use them. Then some other country will jump in to make use of them ahead of us. We have already seen this happening in some industries with the current generation of supercomputers. We certainly want to keep our leadership position in application technology development for future machines. SUMMARY New Directions In summary, I will point out a few new directions that may evolve in supercomputing: petition; Comprehensive support for parallel processing; Development of open systems that enhance productivity and com Total system design to minimize solution time; Seamless services environment and distribution of functions; and Wider applications in scientific, engineering, and commercial fields. In the future there will be more comprehensive support for parallel processing from very primitive to very sophisticated levels. This means that more compiler and system software features will be made available for supporting users in parallelizing their application algorithms as well as developing and debugging parallel programs. The open system concept is spreading rapidly. Participants are work- ing from many directions to exchange ideas and codes. An open system environment will allow us to concentrate our development and application resources only on those extension areas related to performance or function- ality. This will prevent the "reinventing the wheel" syndrome and enhance our productivity in delivering competitive products. A total system design that minimizes solution time is an important key. We will measure machines by solution time instead of by computation time. The user will see a seamless services environment with distribution of functions the supercomputer merged with mainframes and workstations. Users won't have to tackle different kinds of environments. Instead, an integrated design, engineering, and manufacturing computing environment will emerge, greatly enhancing user productivity and industry efficiency.
TOWARI) THE FUTURE 63 We will also see a broad expansion of applications for science, engi- neering, and commercial endeavors. Scientists and engineers will explore the unknown and develop new technologies. Industry will be more compet- itive and productive through its development of new products or processes. Potential Impact Developments in supercomputing technology strongly influence not only the competitiveness of key industries in our national economy but also the vitality of the computing industry itself. This influence on the computer industry can be shown in a simple triangle (Figure 5.1~. The base of the triangle represents personal computers and workstations. The middle section contains mainframe or mid-range computers. At the top is the supercomputer. All three levels of technology are interacting heavily. For example, the basic component technology, parallel architecture concepts, and software and hardware design exploited in the supercomputer arena will trickle down to the mainframe and workstation level; vice-versa, the user interface software and application tools commonly seen at the workstation level will be introduced at the supercomputer level. As a result, the supercomputing technology pulls the computer industry upward, creating new market opportunities and enhancing user productivity. Need for Technological Leadership I used to say, '`How do we stay there?" I have changed my mind. Now I say, "How do we get there?" The race is too close to call at this time. I don't think we have too much leadership in component technology. I have worked on this problem for many years. Each year I become more humble when I see how difficult it is to build this kind of machine without a competitive and sustainable technology base. We are losing by months from many points of view. We are starting to lose some of the critical components. We have tried to help U.S. companies, to work with them, to drive their capability forward to meet with us. But sometimes it is like wrestling with a big boat. Our competitors have the advantage. Their work is integrated. They can focus on something and stay in there for a long time. They can sacrifice one segment of their industry to pay for another one as long as it is strategically important to their long-term technology objectives. In the past, we in the United States seemed not to be able to do that no matter how hard we tried. Thus, to reverse this trend, some component and computer industry leaders need to work together intensively to develop and maintain a strong component technology base in this country.
64 Apple Sun Silicon Graphics / SO CHEN A , SSI Cray CDC \SUPERCOMPUTERS IBM Fujitsu\ Hitachi \ NEC \ / IBM / DEC CDC Unisys \ Fujitsu \ MAINFRAMES Hitachi NEC IBM \ WORKSTATIONS HP/Apollo \ PCS NeXt \ · Integrated Circuits · Architecture · Printed Circuits · Hardware · Packaging Cooling · Power · Software · User Environment · Applications FIGURE 5.1 Impact of supercomputing technology. (Note: Manufacture of supercomput- ers by CDC was discontinued in April 1989.) Fortunately, we still have some lead in software and application tech- nology, especially with respect to parallel processing. My hope is to combine our resources with those of government, universities, and industry. It is important for us to keep this cooperative development effort moving. In 5 years, we can design a machine that is 100 times faster than today's, but nobody will be able to use it unless we ship it with good software and application tools.
TOWARD THE FUTURE 65 We must start working with users today. It may take 5 years to develop an application. Beginning now, while users are developing their next- generation applications for a high-performance parallel machine, we can be developing our next-generation system software and application libraries and tools for a high-efflciency user environment. We are entering a new paradigm of supercomputing in which user application (and productivity) is in the center, instead of hardware (peak rate) as in the last decade. That is my goal. We have to keep this technology leadership. We can accomplish it as long as we have a common view of the future. In order to develop and sustain supercomputing technology, we must take a long-term view. We must be willing to take risks. We have learned from our past experience. Also, most importantly, we should have a focus. We have many resources in this country, but they are scattered and never focused enough. That is why we are losing step by step in some areas. These are just some of my personal observations and experiences that I would like to share with you. Certainly, I am not done yet. I am still chasing that dream machine!
66 DISCUSSION DISCUSSION Michel Gouilloud: Steve Chen, you have come with a long list of challenges and problems. Can you suggest some priorities, in other words, some of the problems you see as the most critical for you in the path of developing your next generation of machines? Steve Chen: I think the underlying component technology is the most critical problem. For example, in silicon technology I seem plateau for speed and power. The next-generation chip we see is denser but not faster, and it requires more power. We certainly don't want to have a machine that is 100 times faster but needs 100 times more power. We may have to build a power substation next to the computer room. That problem is real. We need a breakthrough in this area. Another critical area is high-density cooling. We have to be able to cool a small area that has very dense heat dissipation, e.g., 10,000 to 20,000 watts. The next area is application. We need to work with users to design machines that are balanced, while at the same time preparing their future applications to take full advantage of parallel processing. Michael Teter: We from Corning Glass are interacting fairly heavily with the Cornell Supercomputer Facility. We seem to notice that, independent of the size of the supercomputer there, as soon as users start competing for time, the amount that any individual scientist has for his own research becomes essentially negligible, and he would almost be better off buying a VAX and working by himself. Larry Smarr: The largest university user of the NCSA has received 10,000 hours in the last year. Several users have used over 1000 hours. Kodak uses more than 100 hours a month. It is management of the allocation of time that is important. The national centers are still learning how to do this. In fact, it is only within the past several months that the blue ribbon peer review boards for each center have taken over completely the allocation of time. Previously, individual program officers at the NSF simply forwarded any good proposal they received, and that caused some real saturation problems. Our goal is certainly to upgrade the facilities as rapidly as we can. That requires leadership and support from Congress and the NSF. I believe we are all now beginning to pull together on that. Our goal is to give to those users who are on the machine both supercomputer response time and supercomputer power, even if that means that we have to limit by strict peer review the number of users on the system. Arthur Freeman: I would like to add to the discussion about whether it
DISCUSSION 67 is better to use a VAX. If you can use a VAX, don't go to a supercomputer. One thing that is very clear is that 100,000 VAXs don't add up to a supercomputer in terms of capability, just as 100,000 VoLkswagen engines don't add up to a Saturn engine. Supercomputers are very different from VAXs. I think people have to understand this difference between capacity and capability. Capability just is not there on a VAX: It is there on a supercomputer. We want to increase that capability all the time. George Super: My question addresses a concern outside the operational discussion that has just been going on. Steve Chen, you very accurately de- scribed that one of the major challenges you face is decomposing problems and understanding how to think differently about solution sets. You said that we need to increase the demand function, because we are possibly at a stage now in our society where our supply of supercomputing capacity exceeds our ability to use it wisely. I wonder if you think that we are facing a major intellectual challenge, a computational mechanics challenge that is even greater than the technical challenge of building faster machines? Steve Chen: Yes, we face a psychological challenge. I was joking with Jack Worlton. For many years, every time I spoke with him, he always said he needed a machine 100 times faster. Now I say, '`I will give you that machine, but tell me how to use it." Each time I have given him a machine at Los Alamos, it was already too slow. But, at the same time, the machine was not used to exploit its full performance features. We had a four-processor system for more than 5 years. But the users were still using the system as a throughput machine without going to parallel processing. This was because it was so easy to port all the existing application codes onto the new system, to run it as a four-way throughput machine instead of a four-way parallel machine. In contrast, the overseas users are more aggressive. A good example is the European Consortium for Medium-Range Weather Forecasting. In anticipating the future performance required by finer-resolution forecasting models and upcoming parallel machines, they have already decomposed their problems with a general e-way parallel approach where n is greater than 1. They had demonstrated their parallel algorithms in a research model before the next machines arrived. Hence they were able to continue to upgrade their production forecast model from 1 processor to 2 processors, to 4 processors today; next they will have 8, 16, and even higher numbers of processors as soon as those become available. Their transition from a research to a production model has been quick and successful, because they took a long-term view and broke that psychological barrier very quickly. We in the United States are behind in this respect. We have got to catch up in this area.
68 DISCUSSION John Riganai~: Steve, in the earliest days of vector architecture, Sey- mour Cray made a presentation to Lawrence Livermore National Labora- tory. At the end of the presentation he was asked what made him believe that the vector architecture he was discussing was really a general-purpose machine-it didn't exist at the timecard whether the problems at Livermore would be able to map into those. The way Harry Nelson tells the story, Seymour just smiled enigmatically and said, "We'll see." Well, we did see, and the vector architecture has proven to be quite general purpose. But the architectures that are evolving now are one step more difficult to understand. Can you help us, especially from a user point of view, to understand why the parallel architectures, the cluster architectures, really will be general purpose in the sense that they will be able to map general applications onto those architectures? Steve Chen: Yes. Let's refer to my earlier remarks. You can think about your applications and decompose them from the top down, e.g., using the multiple-domain approach, the multiple-stage pipeline approach, or the multi-discipline approach. These are natural approaches by which you can easily map many applications on to the parallel architecture. You get the best performance that way. With the proper tool set, the user should be able to exploit this high-level parallelism in a simple and general way without entanglement with the lowest-level machine complexity. Mark Karplus: You gave us the hope that in 5 or 6 years you might have a machine that is 100 times faster and that will combine some improvements in technology, plus minor parallelism. What many people wonder about is doing much better. I think there will be people who will very easily figure out how to use a machine that is 100 times faster and who will want more. But there is very little discussion of massive parallelism, and many people say from the computer point of view that the future is to get machines that are 1000 or 10,000 times faster. Steve Chen: I can only give you my personal viewpoint. I think those are worthwhile research activities at this moment. I would like to see that effort moving forward. But as far as putting 1000 microprocessors together, I don't think you can achieve the same capability we're talking about in solving general applications problems. I would rather evolve from the currently available smaller parallel machines to larger parallel machines, step by step. We have to move the whole community, instead of just one or two very bright scientists. A few people might be able to sit down at a terminal and decompose a problem into 1000 parallel tasks. That would be very good. But I don't think we can bring in the whole community that way in a short period of time. However, I do see the possibility of special- purpose massively parallel machines cooperating with the general-purpose supercomputer. Edward Mason: At Amoco Corporation we use supercomputers and
DISCUSSION 69 massive computation for geophysics, but we also have a chemical company and a refining company. One of the biggest problems is retraining or educating people who are very good in particular fields of science but who have not used supercomputers, to solve problems by taking advantage of the opportunities provided by computational science and, when appropriate, by supercomputers. Visualization and transparency are crucial. Parallel computing has been discussed a lot here, but the biologist or the chemist could care less how it is done. The concern is what can be done. And the problem is to have those experts in chemistry, biology, and other fields become familiar with how to simply, from their point of view, exploit supercomputers. Lany Smarr: Critical to the success of that education and training, which I think is issue number one, is having the industrial users live and work in the university environment where, because of the NSF initiative, we have such a vast number of faculty and students who are not having to relearn but are very energetically going directly into using supercomputers. Having them work shoulder to shoulder with the people from industry is proving to be very effective in bringing about that technology transfer. I would very much like to see more support from the government for this education and training part of the program.