The Future of Supercomputing—Conclusions and Recommendations
Chapters 1 through 9 describe a long and largely successful history of supercomputing, a present state of turmoil, and an uncertain future. In this chapter the committee summarizes what it has learned during this study and what it recommends be done.
Supercomputing has a proud history in the United States. Ever since the 1940s our nation has been a leader in supercomputing. Although early applications were primarily military ones, by the 1960s there was a growing supercomputer industry with many nonmilitary applications. The only serious competition for U.S. vendors has come from Japanese vendors. While Japan has enhanced vector-based supercomputing, culminating in the Earth Simulator, the United States has made major innovations in parallel supercomputing through the use of commodity components. Much of the software running on the Earth Simulator and on supercomputer platforms everywhere originates from research performed in the United States.
Conclusion: Since the inception of supercomputing, the United States has been a leader and an innovator in the field.
Ever since the 1960s, there have been differences between supercomputing and the broader, more mainstream computing market. One difference has been the higher performance demanded (and paid for) by supercomputer users. Another difference has been the emphasis of
supercomputer users on the mathematical aspects of software and on the data structures and computations that are used in scientific simulations. However, there has always been interplay between advances in supercomputing (hardware and software) and advances in mainstream computing.
There has been enormous growth in the dissemination and use of computing in the United States and in the rest of the world since the 1940s. The growth in computing use overall has been significantly greater than the growth in the use of supercomputing. As computing power has increased, some former users of supercomputing have found that their needs are satisfied by computing systems closer to the mainstream.
Conclusion: Supercomputing has always been a specialized form at the cutting edge of computing. Its share of overall computing has decreased as computing has become ubiquitous.
Supercomputing has been of great importance throughout its history because it has been the enabler of important advances in crucial aspects of national defense, in scientific discovery, and in addressing problems of societal importance. At the present time, supercomputing is used to tackle challenging problems in stockpile stewardship, in defense intelligence, in climate prediction and earthquake modeling, in transportation, in manufacturing, in societal health and safety, and in virtually every area of basic science understanding. The role of supercomputing in all of these areas is becoming more important, and supercomputing is having an ever-greater influence on future progress. However, despite continuing increases in capability, supercomputer systems are still inadequate to meet the needs of these applications. Although it is hard to quantify in a precise manner the benefits of supercomputing, the committee believes that the returns on increased investments in supercomputing will greatly exceed the cost of these investments.
Conclusion: Supercomputing has played, and continues to play, an essential role in national security and in scientific discovery. The ability to address important scientific and engineering challenges depends on continued investments in supercomputing. Moreover, the increasing size and complexity of new applications will require the continued evolution of supercomputing for the foreseeable future.
Supercomputing benefits from many technologies and products developed for the broad computing market. Most of the TOP500 listed systems are clusters built of commodity processors. As commodity processors have increased in speed and decreased in price, clusters have benefited. There is no doubt that commodity-based supercomputing sys-
tems are cost effective in many applications, including some of the most demanding ones.
However, the design of commodity processors is driven by the needs of commercial data processing or personal computing; such processors are not optimized for scientific computing. The Linpack benchmark that is used to rank systems in the TOP500 list is representative of supercomputing applications that do not need high memory bandwidth (because caches work well) and do not need high global communication bandwidth. Such applications run well on commodity clusters. Many important applications need better local memory bandwidth and lower apparent latency (i.e., better latency hiding), as well as better global bandwidth and latency. Technologies for better bandwidth and latency exist. Better local memory bandwidth and latency are only available in custom processors. Better global bandwidth and latency are only available in custom interconnects with custom interfaces. The availability of local and global high bandwidth and low latency improves the performance of the many codes that leverage only a small fraction of the peak performance of commodity systems because of bottlenecks in access to local and remote memories. The availability of local and global high bandwidth can also simplify programming, because less programmer time needs to be spent in tuning memory access and communication patterns, and simpler programming models can be used. Furthermore, since memory access time is not scaling at the same rate as processor speed, more commodity cluster users will become handicapped by low effective memory bandwidth. Although increased performance must be weighed against increased cost, there are some applications that cannot achieve the needed turnaround time without custom technology.
Conclusion: Commodity clusters satisfy the needs of many supercomputer users. However, some important applications need the better main memory bandwidth and latency hiding that are available only in custom supercomputers; many need the better global bandwidth and latency interconnects that are available only in custom or hybrid supercomputers; and most would benefit from the simpler programming model that can be supported well on custom systems. The increasing gap between processor speed and communication latencies is likely to increase the fraction of supercomputing applications that achieve acceptable performance only on custom and hybrid supercomputers.
Supercomputing systems consist not only of hardware but also of software. There are unmet needs in supercomputing software at all levels, from the operating system to the algorithms to the application-specific software. These unmet needs stem from both technical difficulties and
difficulties in maintaining an adequate supply of people in the face of competing demands on software developers. Particularly severe needs are evident in software to promote productivity—that is, to speed the solution process by reducing programmer effort or by optimizing execution time. While many good algorithms exist for problems solved on supercomputers, needs remain for a number of reasons: (1) because the problems being attempted on supercomputers have difficulties that do not arise in those being attempted on smaller platforms, (2) because new modeling and analysis needs arise only after earlier supercomputer analyses point them out, and (3) because algorithms must be modified to exploit changing supercomputer hardware characteristics.
Conclusion: Advances in algorithms and in software technology at all levels are essential to further progress in solving applications problems using supercomputing.
Supercomputing software, algorithms, and hardware are closely bound. As architectures change, new software solutions are needed. If architectural choices are made without considering software and algorithms, the resulting system may be unsatisfactory. Because a supercomputing system is a kind of ecosystem, significant changes are both disruptive and expensive. Attention must therefore be paid to all aspects of the ecosystem and to their interactions when developing future generations of supercomputers.
Educated and skilled people are an important part of the supercomputing ecosystem. Supercomputing experts need a mix of specialized knowledge in the applications with which they work and in the various supercomputing technologies.
Conclusion: All aspects of a particular supercomputing ecosystem, be they hardware, software, algorithms, or people, must be strong if the ecosystem is to function effectively.
Computer suppliers are by nature economically opportunistic and move into areas of greatest demand and largest potential profit. Because of the high cost of creating a supercomputing ecosystem and the relatively small customer base, the supercomputing market is less profitable and riskier. Custom systems form a small and decreasing fraction of the supercomputer market and are used primarily for certain government applications. The commercial demand for such systems is not sufficient to support vendors of custom supercomputers or a broad range of commercial providers of software for high-performance science and engineering applications. As the commodity market has grown, and as the costs of developing commodity components have risen, government missions are
less able to influence the design of commodity products (they might not succeed, for example, in having certain features included in instruction sets). Although spillovers from solutions to the technical problems facing supercomputing will eventually benefit the broader market, there is not sufficient short-term benefit to motivate commercial R&D.
The government has always been the primary consumer and funder of supercomputing. It has sponsored advances in supercomputing in order to ensure that its own needs are met. It is a customer both directly, through purchases for government organizations, and indirectly, through grants and contracts to organizations that in turn acquire supercomputers. Although supercomputing applications could be very important to industry in areas such as transportation, energy sources, and product design, industry is not funding the development of new supercomputer applications or the major scaling of current applications.
Conclusion: The supercomputing needs of the government will not be satisfied by systems developed to meet the demands of the broader commercial market. The government has the primary responsibility for creating and maintaining the supercomputing technology and suppliers that will meet its specialized needs.
The DoD has to assure the development and production of cutting-edge weapons systems such as aircraft and submarines, which are not developed or produced for the civilian market. To do this, it continuously undertakes to analyze which capabilities are needed in the defense industrial base, and it maintains these capabilities and has an ongoing long-term investment strategy to guarantee that there will always be suppliers to develop and produce these systems. Similarly, to ensure its access to specialized custom supercomputers that would not be produced without government involvement, DoD needs the same kind of analysis of capabilities and investment strategy. The strategy should aim at leveraging trends in the commercial computing marketplace as much as possible, but in the end, responsibility for an effective R&D and procurement strategy rests with the government agencies that need the custom supercomputers.
However, the analogy with aircraft and submarines breaks down in one essential aspect: Not only are custom supercomputers essential to our security, they can also accelerate many other research and engineering endeavors. The scientific and engineering discovery enabled by such supercomputers has broad societal and economic benefits, and government support of the R&D for these supercomputers may broaden their use by others outside the government. Broader use by industry is desirable and should be encouraged, because of the positive impact on U.S. competitiveness and the positive impact on supercomputing vendors.
Conclusion: Government must bear primary responsibility for maintaining the flow of resources that guarantees access to the custom systems it needs. While an appropriate strategy will leverage developments in the commercial computing marketplace, the government must routinely plan for developing what the commercial marketplace will not, and it must budget the necessary funds.
For a variety of reasons, the government has not always done a good job in its stewardship role. Predictability and continuity are important prerequisites for enhancing supercomputing performance for use in applications. Unstable government funding and a near-term planning focus can result in (and have resulted in) high transition costs, limiting the exploitation of supercomputing advances for many applications. Uneven and unpredictable acquisition patterns have meant fewer industrial suppliers of hardware and software, as companies have closed or moved into other areas of computing. Insufficient investment in long-term basic R&D and in research access to supercomputers has eroded opportunities to make major progress in the technical challenges facing supercomputing.
Conclusion: The government has lost opportunities for important advances in applications using supercomputing, in supercomputing technology, and in ensuring an adequate supply of supercomputing ecosystems in the future. Instability of long-term funding and uncertainty in policies have been the main contributors to this loss.
Taken together, the conclusions reached from this study lead to an overall recommendation:
Overall Recommendation: To meet the current and future needs of the United States, the government agencies that depend on supercomputing, together with the U.S. Congress, need to take primary responsibility for accelerating advances in supercomputing and ensuring that there are multiple strong domestic suppliers of both hardware and software.
The government is the primary user of supercomputing. Government-funded research is pushing the frontiers of knowledge and bringing important societal benefits. Advances in supercomputing must be accelerated to maintain U.S. military superiority, to achieve the goals of stockpile stewardship, and to maintain national security. Continued advances in supercomputing are also vital for a host of scientific advancements in biology, climate, economics, energy, material science, medicine, physics,
and seismology. Because all of these are, directly or indirectly, the responsibility of the government, it must ensure that the supercomputing infrastructure adequately supports the nation’s needs in coming years. These needs are distinct from those of the broader information technology industry because they involve platforms and technologies that are unlikely on their own to have a broad enough market any time soon to satisfy the needs of the government.
To facilitate the government’s assumption of that responsibility, the committee makes eight recommendations.
Recommendation 1. To get the maximum leverage from the national effort, the government agencies that are the major users of supercomputing should be jointly responsible for the strength and continued evolution of the supercomputing infrastructure in the United States, from basic research to suppliers and deployed platforms. The Congress should provide adequate and sustained funding.
A small number of government agencies are the primary users of supercomputing, either directly, by themselves acquiring supercomputer hardware or software, or indirectly, by awarding contracts and grants to other organizations that purchase supercomputers. These agencies are also the major funders of supercomputing research. At present, those agencies include the Department of Energy (DOE), including its National Nuclear Security Administration and its Office of Science; the Department of Defense (DoD), including its National Security Agency (NSA); the National Aeronautics and Space Administration (NASA); the National Oceanic and Atmospheric Administration (NOAA); and the National Science Foundation (NSF). (The increasing use of supercomputing in biomedical applications suggests that NIH should be added to the list.) Although the agencies have different missions and different needs, they benefit from the synergies of coordinated planning and acquisition strategies and coordinated support for R&D. In short, they need to be part of the supercomputing ecosystem. For instance, many of the technologies, in particular the software, need to be broadly available across all platforms. If the agencies are not jointly responsible and jointly accountable, the resources spent on supercomputing technologies are likely to be wasted as efforts are duplicated in some areas and underfunded in others.
Achieving collaborative and coordinated government support for supercomputing is a challenge that many previous studies have addressed without effecting much improvement in day-to-day practice. What is needed is an integrated plan rather than the coordination of distinct supercomputing plans through a diffuse interagency coordination structure. Such integration across agencies has not been achieved in the past,
and interagency coordination mechanisms have served mostly to communicate independently planned activities. A possible explanation is that although each agency needs to obtain supercomputing for its own purposes, no agency has the responsibility to ensure that the necessary technology will be available to be acquired.
Today, much of the coordination happens relatively late in the planning process and reflects decisions rather than goals. In order for the agencies to meet their own mission responsibilities and also take full advantage of the investments made by other agencies, collaboration and coordination must become much more long range. To make that happen, the appropriate incentives must be in place—collaboration and coordination must be based on an alignment of interests, not just on a threat of vetoes from higher-level management.
One way to facilitate that process is for the agencies with a need for supercomputing to create and maintain a joint 5- or 10-year written plan for high-end computing (HEC) based on both the roadmap that is the subject of Recommendation 5 and the needs of the participating agencies. That HEC plan, which would be revised annually, would be increasingly specific with respect to development and procurement as the time remaining to achieve particular goals decreased. Included in the plan would be a clear delineation of which agency or agencies would be responsible for contracting and overseeing a large procurement, such as a custom supercomputer system or a major hardware or software component of such a system. The plan would also include cost estimates for elements of the plan, but it would not be an overall budget. For example, planning for the development and acquisition of what the HECRTF report calls “leadership systems” would be part of this overall HEC plan, but the decisions about what to fund would not be made by the planners. Each new version of the plan would be critically reviewed by a panel of outside experts and updated in response to that review.
Appropriate congressional committees in the House and Senate would have the funding and oversight responsibility to ensure that the HEC plan meets the long-term needs of the nation. Both the House and Senate authorization and appropriation subcommittees and the Office of Management and Budget would require (1) that every budget request concerning supercomputing describe how the request is aligned with the HEC plan and (2) that an agency budget request does not omit a supercomputing investment (for which it has responsibility according to the HEC plan) on which other agencies depend. Similarly, House and Senate appropriation committees would ensure (1) that budgets passed into law are consistent with the HEC plan and (2) that any negotiated budget reductions do not adversely affect other investments dependent on them. Consistency does not imply that every part of every request would be in
the plan. Mission agencies sometimes face short-term needs to meet short-term deliverables that cannot be anticipated. New disruptive technologies sometimes provide unanticipated opportunities. However, revisions to the plan would be responsive to those needs and opportunities.
The use of an HEC plan would not preclude agencies from individual activities, nor would it prevent them from setting their own priorities. Rather, the intent is to identify common needs at an early stage and to leverage shared efforts to meet those needs, while minimizing duplicative efforts. For example,
Research and development in supercomputing will continue to be the responsibility of the agencies that fund research and also use supercomputing, notably NSF, DOE (the National Nuclear Security Administration and the Office of Science), DoD, NSA, NASA, NOAA, and NIH. A subset of these agencies, working in loose coordination, will focus on long-term basic research in supercomputing technologies. Another subset of these agencies, working in tighter coordination, will be heavily involved in industrial supercomputing R&D.
Each agency will continue to be responsible for the development of the domain-specific technologies, in particular domain-specific applications software, that satisfy its needs.
The acquisition of supercomputing platforms will be budgeted for by each agency according to its needs. Joint planning and coordination of acquisitions will increase the efficiency of the procurement processes from the government viewpoint and will decrease variability and uncertainty from the vendor viewpoint. In particular, procurement overheads and delays can be reduced with multiagency acquisition plans whereby once a company wins a procurement bid issued by one agency, other agencies can buy versions of the winning system.
Tighter integration in the funding of applied research and development in supercomputing will ease the burden on application developers and will enhance the viability of domestic suppliers.
Until such a structure is in place, the agencies whose missions rely on supercomputing must take responsibility for the future availability of leading supercomputing capabilities. That responsibility extends to the basic research on which future supercomputing depends. These agencies should cooperate as much as they can—leveraging one another’s efforts is always advantageous—but they must move ahead whether or not a formal long-term planning and coordination framework exists. More specifically, it continues to be the responsibility of the NSF, DoD, and DOE, as the primary sponsors of basic research in science and engineering, to support both the research needed to drive progress in supercomputing and
the infrastructure needs of those using supercomputing for their research. Similarly, it is the responsibility of those agencies whose mission is the safety and security of the nation or the health and well-being of its citizens to plan for future supercomputing needs essential to their missions, as well as to provide for present-day supercomputing needs.
Recommendation 2. The government agencies that are the primary users of supercomputing should ensure domestic leadership in those technologies that are essential to meet national needs.
Some critical government needs justify a premium for faster and more powerful computation that most or all civilian markets cannot justify commercially. Many of these critical needs involve national security. Because the United States may want to be able to restrict foreign access to some supercomputing technology, it will want to create these technologies here at home. Even if there is no need for such restrictions, the United States will still need to produce these technologies domestically, simply because it is unlikely that other countries will do so given the lack of commercial markets for many of these technologies. U.S. leadership in unique supercomputing technologies, such as custom architectures, is endangered by inadequate funding, inadequate long-term plans, and the lack of coordination among the agencies that are the major funders of supercomputing R&D. Those agencies should ensure that our country has the supercomputers it needs to satisfy critical requirements in areas such as cryptography and nuclear weapon stewardship as well as for systems that will provide the breakthrough capabilities that bring broad scientific and technological progress for a strong and robust U.S. economy.
The main concern of the committee is not that the United States is being overtaken by other countries, such as Japan, in supercomputing. Rather, it is that current investments and current plans are not sufficient to provide the future supercomputing capabilities that our country will need. That the first-place computer in the June 2004 TOP500 list was located in Japan is not viewed by this committee as a compelling indication of loss of leadership in technological capability. U.S. security is not necessarily endangered if a computer in a foreign country is capable of doing some computations faster than U.S.-based computers. The committee believes that had our country made an investment similar to Japan’s at the same time, it could have created a powerful and equally capable system. The committee’s concern is that the United States has not been making the investments that will guarantee its ability to create such a system in the future.
Leadership is measured by a broad technological capability to acquire and exploit effectively machines that can best reduce the time to solution of important computational problems. From this perspective, it is not the
Earth Simulator system that is worrisome but rather the fact that its construction was such a singular event. It seems that without significant government support, custom high-bandwidth processors are not viable products. Two of the three Japanese companies that were manufacturing such processors do not do so anymore, and the third (NEC) may also bow to market realities in a not too distant future—since the Japanese government seems less willing now to subsidize the development of leading supercomputing technologies. The software technology of the Earth Simulator is at least a decade old. The same market realities prevail here at home. No fundamentally new high-bandwidth architecture has emerged as a product in the last few years in either Japan or the United States. No significant progress has occurred in commercially available supercomputing software for more than a decade. No investment that would match the time scale and magnitude of the Japanese investment in the Earth Simulator has been made in the United States.
The agencies responsible for supercomputing can ensure that key supercomputing technologies, such as custom high-bandwidth processors, will be available to satisfy their needs only by maintaining our nation’s world leadership in these technologies. Recommendations 3 through 8 outline some of the actions that need to be taken by these agencies to maintain this leadership.
Recommendation 3. To satisfy its need for unique supercomputing technologies such as high-bandwidth systems, the government needs to ensure the viability of multiple domestic suppliers.
The U.S. industrial base must include suppliers on whom the government can rely to build custom systems to solve problems that are unique to the government role. Since only a few units of such systems are ever needed, there is no broad market for them and hence no commercial, off-the-shelf suppliers. Domestic supercomputing vendors are a source of both the components and the engineering talent necessary to construct low-volume systems for the government.
To ensure their continuing existence, the domestic suppliers must be able to sustain a viable business model. For a public company, that means having predictable and steady revenue recognizable by the financial market. A company cannot continue to provide cutting-edge products without R&D. At least two models of support have been used successfully: (1) an implicit guarantee of a steady purchase of supercomputing systems, giving the companies a steady income stream with which to fund ongoing R&D and (2) explicit funding for a company’s R&D. Stability is a key issue. Suppliers of such systems or components often are small companies that can easily lose viability; uncertainty can mean the loss of skilled personnel to other sectors of the larger computing industry or the loss of
investors. Historically, government priorities and technical directions have changed more frequently than would be justified by technology lifetimes, creating market instabilities. The chosen funding model must ensure stability. The agencies responsible for supercomputing might consider the model proposed by the British UKHEC initiative, whereby government solicits and funds proposals for the procurement of three successive generations of a supercomputer family over 4 to 6 years.
It is important to have multiple suppliers for any key technology, in order to maintain competition, to prevent technical stagnation, to provide diverse supercomputing ecosystems to address diverse needs, and to reduce risk. (The recent near-death experience of Cray in the 1990s is a good example of such risk.) On the other hand, it is unrealistic to expect that such narrow markets will attract a large number of vendors. As happens for many military technologies, one may typically end up with only a few suppliers. The risk of stagnation is mitigated by the continued pressure coming from commodity supercomputer suppliers.
The most important unique supercomputing technology identified in this report is high-bandwidth, custom supercomputing systems. The vector systems developed by Cray have been the leading example of this technology. Cray is now the only domestic manufacturer of such systems. The R&D cost to Cray for a new product has been estimated by IDC to be close to $200 million; assuming a 3-year development cycle, this results in an annual R&D cost of about $70 million, or about $140 million per year for two vendors. Note that Cray has traditionally been a vertically integrated company that develops and markets a product stack that goes from chips and packaging to system software, compilers, and libraries. However, Cray seems to be becoming less integrated, and other suppliers of high-bandwidth systems may choose to be less integrated, resulting in a different distribution of R&D costs among suppliers. Other suppliers may also choose high-bandwidth architectures that are not vector.
Another unique supercomputing technology identified in this report is that of custom switches and custom, memory-connected switch interfaces. Companies such as Cray, IBM, and SGI have developed such technologies and have used them exclusively for their own products—the Cray Red Storm interconnect is a recent example. Myricom (a U.S. company) and Quadrics (a European company) develop scalable, high-band-width, low-latency interconnects for clusters, but use a standard I/O bus (PCI-X) interface and support themselves from the broader cluster market. The R&D costs for such products are likely to be significantly lower than for a full custom supercomputer.
These examples are not meant to form an exhaustive list of leadership supercomputing technologies. The agencies that are the primary users of supercomputing should, however, establish such a list, aided by the
roadmap described in Recommendation 5, and should ensure that there are viable domestic suppliers.
Similar observations can be made about software for high-performance computing. Our ability to efficiently exploit leading supercomputing platforms is hampered by inadequate software support. The problem is not only the lack of investment in research but also, and perhaps more seriously, the lack of sustained investments needed to promote the broad adoption of new software technologies that can significantly reduce time to solution at the high end but that have no viable commercial market.
Recommendation 4. The creation and long-term maintenance of the software that is key to supercomputing requires the support of those agencies that are responsible for supercomputing R&D. That software includes operating systems, libraries, compilers, software development and data analysis tools, application codes, and databases.
The committee believes that the current low-level, uncoordinated investment in supercomputing software significantly constrains the effectiveness of supercomputing. It recommends larger and better targeted investments by those agencies that are responsible for supercomputing R&D.
The situation for software is somewhat more complicated than that for hardware: Some software—in particular, application codes—is developed and maintained by national laboratories and universities, and some software, such as the operating system, compiler, and libraries, is provided with the hardware platform by a vertically integrated vendor. The same type of software, such as a compiler or library, that is packaged and sold by one (vertical) vendor with the hardware platform is developed and maintained by a (horizontal) vendor as a stand-alone product that is available on multiple platforms. Additionally, an increasing amount of the software used in supercomputing is developed in an open source model. The same type of software, such as a communication library, may be freely available in open source and also available from vendors under a commercial license.
Different funding models are needed to accommodate these different situations. A key goal is to ensure the stability and longevity of organizations that maintain and evolve software. The successful evolution and maintenance of complex software systems are critically dependent on institutional memory—that is, on the continuous involvement of the few key developers that understand the software design. Stability and continuity are essential to preserve institutional memory. Whatever model of support is used, it should be implemented so that a stable organization with a lifetime of decades can maintain and evolve the software. Many of
the supercomputing software vendors are very small (tens of employees) and can easily fail or be bought out, even if they are financially viable. For example, several vendors of compilers and performance tools for supercomputing were acquired by Intel in the last few years. As a result, developers who were working on high-performance computing products shifted to work on technologies with a broader market. The open source model is not, per se, a guarantee of stability, because it does not ensure continuing stable support for the software.
It is also important to provide funding for software integration, as it is often a major source of function and performance bugs. Such integration was traditionally done by vertically integrated vendors, but new models are needed in the current, less integrated world of supercomputing.
As it invests in supercomputing software, the government must carefully balance its need to ensure the availability of software against the possibility of driving its commercial suppliers out of business by subsidizing their competitors, be they in government laboratories or in other companies. The government should not duplicate successful commercial software packages but should instead invest in technology that does not yet exist. When new commercial providers emerge, the government should purchase their products and redirect its own efforts toward technology that it cannot acquire off the shelf. HPSS and Totalview are examples of successful partnerships between government and the supercomputing software industry. NASTRAN and Dyna are examples of government-funded applications that were successfully transitioned to commercial suppliers.
Barriers to the replacement of application programming interfaces are very high owing to the large sunk investments in application software. Any change that significantly enhances our ability to program very large systems will entail a radical, coordinated change of many technologies, creating a new ecosystem. To make this change, the government needs long-term coordinated investments in a large number of interlocking technologies.
Recommendation 5. The government agencies responsible for supercomputing should underwrite a community effort to develop and maintain a roadmap that identifies key obstacles and synergies in all of supercomputing.
A roadmap is necessary to ensure that investments in supercomputing R&D are prioritized appropriately. The challenges in supercomputing are very significant, and the amount of ongoing research is quite limited. To make progress, it is important to identify and address the key roadblocks. Furthermore, technologies in different domains are interdependent:
Progress on a new architecture may require, in addition to computer architecture work, specific advances in packaging, interconnects, operating system structures, programming languages and compilers, and so forth. Thus, investments need to be coordinated. To drive decisions, one needs a roadmap of the technologies that affect supercomputing. The roadmap needs to have quantitative and measurable milestones.
Some examples of roadmap-like planning activities are the semiconductor industry’s roadmap, the ASC curves and barriers workshops, and the petaflops workshops. However, none of these is a perfect model. It is important that a supercomputing roadmap be driven both top-down by application needs and bottom-up by technology barriers and that mission needs as well as science needs be incorporated. Its creation and maintenance should be an open process that involves a broad community. That community should include producers—commodity as well as custom, components as well as full systems, hardware as well as software—and consumers from all user communities. The roadmap should focus on the evolution of each specific technology and on the interplay between technologies. It should be updated annually and undergo major revisions at suitable intervals.
The roadmap should be used by agencies and by Congress to guide their long-term research and development investments. Those roadblocks that will not be addressed by industry without government intervention need to be identified, and the needed research and development must be initiated. Metrics must be developed to support the quantitative aspects of the roadmap. It is important also to invest in some high-risk, high-return research ideas that are not indicated by the roadmap, to avoid being blindsided.
Recommendation 6. Government agencies responsible for supercomputing should increase their levels of stable, robust, sustained multiagency investment in basic research. More research is needed in all the key technologies required for the design and use of supercomputers (architecture, software, algorithms, and applications).
The top performance of supercomputers has increased rapidly in the last decades, but their sustained performance has lagged, and the productivity of supercomputing users has lagged as well.1 During the last decade the advance in supercomputing performance has been largely due
See, for example, Figure 1 in the HECRTF report, at <http://www.hpcc.gov/pubs/2004_hecrtf/20040702_hecrtf.pdf>.
to the advance in microprocessor performance driven by increased miniaturization, with limited contributions from increasing levels of parallelism.2
It will be increasingly difficult for supercomputing to benefit from improvements in processor performance in the coming decades. For reasons explained in Chapter 5, the rate of improvement in single-processor performance is decreasing; chip performance is improved mainly by increasing the number of concurrent threads executing on a chip (an increase in parallelism). Additional parallelism is also needed to hide the increasing relative memory latency. Thus, continued improvement in supercomputer performance at current rates will require a massive increase in parallelism, requiring significant research progress in algorithms and software. As the relative latencies of memory accesses and global communications increase, the performance of many scientific codes will shrink, relative to the performance of more cache friendly and more loosely coupled commercial codes. The cost/performance advantage of commodity systems for these scientific codes will erode. As discussed in Chapter 5, an extrapolation of current trends clearly indicates the need for fundamental changes in the structure of supercomputing systems in a not too distant future. To effect these changes, new research in supercomputing architecture is also needed.
Perhaps as a result of the success of commodity-based systems, the last decade saw few novel technologies introduced into supercomputer systems and a reduction in supercomputing research investments. The number and size of supercomputing-related grants in computer architecture or computer software have decreased. As the pressure for fundamental changes grows, it is imperative to increase investments in supercomputing research.
The research investments should be balanced across architecture, software, algorithms, and applications. They should be informed by the supercomputing roadmap but not constrained by it. It is important to focus on technologies that have been identified as roadblocks and that are beyond the scope of industry investments in computing. It is equally important to support long-term speculative research in potentially disruptive technical advances. The research investment should also be informed by the “ecosystem” view of supercomputing—namely, that progress must come on a broad front of interrelated technologies rather than in the form of individual breakthroughs.
One of the needs of an ecosystem is for skilled and well-educated
people. Opportunities to educate and train supercomputing professionals should be part of every research program. Steady funding for basic research at universities, together with opportunities for subsequent employment at research institutions and private companies, might attract more students to prepare for a career in supercomputing.
Research should include a mix of small, medium, and large projects. Many small individual projects are necessary for the development of new ideas. A smaller number of large projects that develop technology demonstrations are needed to bring these ideas to maturity and to study the interaction between various technologies in a realistic environment. Such demonstrations projects (which are different from product prototyping activities) should not be expected to be stable platforms for exploitation by users, because the need to maintain a stable platform conflicts with the ability to use the platform for experiments. It is important that the development of such demonstration systems have the substantial involvement of academic researchers, particularly students, to support the education of the new generation of researchers, and that the fruits of such projects not be proprietary. In Chapter 9, the necessary investments in such projects were estimated at about $140 million per year. This does not include investments in the development and use of application specific software.
Large-scale research in supercomputing can occur in a vertical model, whereby researchers from multiple disciplines collaborate to design and implement one technology demonstration system. Or, it can occur in a horizontal model, in a center that emphasizes one discipline or focuses on the technology related to one roadblock in the supercomputing roadmap. A large effort focused on a demonstration system brings together people from many disciplines and is a good way of generating unexpected breakthroughs. However, such an effort must be constructed carefully so that each of the participants is motivated by the expectation that the collaboration will advance his or her research goals.
In its early days, supercomputing research generated many ideas that eventually became broadly used in the computing industry. Pipelining, multithreading, and multiprocessing are familiar examples. The committee expects that such influences will continue in the future. Many of the roadblocks faced today by supercomputing are roadblocks that affect all computing, but affect supercomputing earlier and to a more significant extent. One such roadblock is the memory wall,3 which is due to the slower progress in memory speeds than in processor speeds. Supercomputers
are disproportionately affected by the memory wall owing to the more demanding characteristics of supercomputing applications. There can be little doubt that solutions developed to solve this problem for supercomputers will eventually influence the broad computing industry, so that investments in basic research in supercomputing are likely to be of broad benefit to information technology.
Recommendation 7. Supercomputing research is an international activity; barriers to international collaboration should be minimized.
Research has always benefited from the open exchange of ideas and the opportunity to build on the achievements of others. The national leadership advocated in these recommendations is enhanced, not compromised, by early-stage sharing of ideas and results. In light of the relatively small community of supercomputing researchers, international collaborations are particularly beneficial. The climate modeling community, for one, has long embraced that view.
Research collaboration must include access to supercomputing systems. Many research collaborations involve colocation. Many of the best U.S. graduate students are foreigners, many of whom ultimately become citizens or permanent residents. Access restrictions based on citizenship hinder collaboration and are contrary to the openness that is essential to good research. Such restrictions will reduce the ability of research and industry to benefit from advances in supercomputing and will restrict the transfer of the most talented people and the most promising ideas to classified uses of supercomputing.
Restrictions on the import of supercomputers to the United States have not benefited the U.S. supercomputing industry and are unlikely to do so in the future. Restrictions on the export of supercomputers have hurt supercomputer manufacturers by restricting their market. Some kinds of export controls—on commodity systems, especially—lack any clear rationale, given that such systems are in fact built from widely available COTS components, most of which are manufactured overseas. It makes little sense to restrict sales of commodity systems built from components that are not export controlled.
Although the supercomputing industry is similar in ways to some military industries (small markets, small ecosystems, and critical importance to government missions), there are significant differences that increase the benefits and decrease the risks of a more open environment.
A faster computer in another country does not necessarily endanger U.S. security; U.S. security requires a broad technological capability to acquire and exploit effectively machines that can best reduce the time to solution of important computational problems. Such technological capa-
bility is embodied not in one platform or one code but in a broad community of researchers and developers in industry, academia, and government who collaborate and exchange ideas with as few impediments as possible.
The computer and semiconductor technologies are (still) moving at a fast pace and, as a result, supercomputing technology is evolving rapidly. The development cycles of supercomputers are only a few years long compared with decades-long cycles for many weapons. To maintain its vitality, supercomputing R&D must have strong ties to the broad, open research community.
The supercomputing market shares some key hardware and software components with the much larger mainstream computing markets. If the supercomputing industry is insulated from this larger market, there will be costly reinvention and/or costly delays. Indeed, the levels of investment needed to maintain a healthy supercomputing ecosystem pale when they are compared with the cost of a major weapon system. A more segregated R&D environment will inevitably lead to a higher price tag if fast progress is to be maintained.
Supercomputers are multipurpose (nuclear simulations, climate modeling, and so on). In particular, they can be used to support scientific research, to advance engineering, and to help solve important societal problems. If access to supercomputers is restricted, then important public benefits would be lost. Moreover, the use of supercomputers for broader applications in no way precludes their use for defense applications.
Finally, advances in supercomputing technology can benefit the broader IT industry; application codes developed in national laboratories can benefit industrial users. Any restriction of this technology flow reduces the competitiveness of the U.S. industry.
Restrictions on the export of supercomputing technology may hamper international collaboration, reduce the involvement of the open research community in supercomputing, and reduce the use of supercomputers in research and in industry. The benefit of denying potential adversaries or proliferators access to key supercomputing technology has to be carefully weighed against the damage that export controls do to research within the United States, to the supercomputing industry, and to international collaborations.
Recommendation 8. The U.S. government should ensure that researchers with the most demanding computational requirements have access to the most powerful supercomputing systems.
Access to the most powerful supercomputers is important for the advancement of science in many disciplines. The committee believes that a model in which top supercomputing capabilities are provided by differ-
ent agencies with different missions is a healthy model. Each agency is the primary supporter of certain research or mission-driven communities; each agency should have a long-term plan and budget for the acquisition of the supercomputing systems that are needed to support its users. The users should be involved in the planning process and should be consulted in setting budget priorities for supercomputing. Budget priorities should be reflected in the HEC plan proposed in Recommendation 1. In Chapter 9, the committee estimated at about $800 million per year the cost of a healthy procurement process that would satisfy the capability supercomputing needs (but not their capacity needs) of the major agencies using supercomputing and that would include the platforms primarily used for research. This estimate includes both platforms used for mission-specific tasks and platforms used to support science.
The NSF supercomputing centers have traditionally provided open access to a broad range of academic users. They have been responsive to their scientific users in installing and supporting software packages and providing help to both novice and experienced users. However, some of the centers in the PACI program have increased the scope of their activities, even in the face of a flat budget, to include research in networking and grid computing and to expand their education mission. The focus of their activity has shifted as their mission has broadened. The increases in scope have not been accompanied by sufficient increases in funding. The expanded mission and the flat budget have diluted the centers’ attention to the support of computational scientists with capability needs. Similar difficulties have arisen at DOE’s NERSC.
It is important to repair the current situation at NSF, in which the computational science users of supercomputing centers appear to have too little involvement in programmatic and budgetary planning. All the research communities in need of supercomputing have a continuing responsibility to help to provide direction for the supercomputing infrastructure that is used by scientists of a particular discipline and to participate in sustaining the needed ecosystems. These communities should prioritize funding for the acquisition and operation of the research supercomputing infrastructure against their other infrastructure needs. Further, such funding should clearly be separated from funding for computer and computational science and engineering research. Users of DOE and DoD centers have a similar responsibility to provide direction. This does not mean that supercomputing centers must be disciplinary. Indeed, multidisciplinary centers provide incentives for collaborations that would not occur otherwise, and they enable the participation of small communities. A multidisciplinary center should be supported by the agencies (such as NSF or NIH) that support the disciplines involved, but with serious commitment from the user communities supported by these agencies.
The planning and funding process followed by each agency mustensure stability from the users’ viewpoint. Many research groups end up using their own computer resources, or they spend time ensuring that their codes run on a wide variety of systems, not necessarily because it is the most efficient strategy but because they believe it minimizes the risk of depending on systems they do not control. This strategy traps users into a lowest-common-denominator programming model, which in turn constrains the performance they might otherwise achieve by using more specialized languages and tools. More stability in the funding and acquisition process can ultimately lead to a more efficient use of resources. Finally, the mechanism used for allocating supercomputing resources must ensure that almost all of the computer time on capability systems is allocated to jobs for which that capability is essential. The Earth Simulator usage policies are illustrative. Supercomputers are scarce and expensive resources that should be used not to accommodate the largest number of users but to solve the largest, most difficult, and most important scientific problems.