Early in the 21st century, single-processor performance stopped growing exponentially, and it now improves at a modest pace, if at all. The abrupt shift is due to fundamental limits on the power efficiency of complementary metal oxide semiconductor (CMOS) integrated circuits (used in virtually all computer chips today) and apparent limits on what sorts of efficiencies can be exploited in single-core architectures. A sequential-programming model will no longer be sufficient to facilitate future information technology (IT) advances.
Although efforts to advance low-power technology are important, the only foreseeable way to continue advancing performance is with parallelism. To that end, the hardware industry recently began doubling the number of cores per chip rather than focusing solely on more performance per core and began deploying more aggressive parallel options, for example, in graphics processing units (GPUs). Attaining dramatic IT advances in the future will require programs and supporting software systems that can access vast parallelism. The shift to explicitly parallel hardware will fail unless there is a concomitant shift to useful programming models for parallel hardware. There has been progress in that direction: extremely skilled and savvy programmers can exploit vast parallelism (for example, in what has traditionally been referred to as high-performance computing), domain-specific languages flourish (for example, SQL and DirectX), and powerful abstractions hide complexity (for example, MapReduce). However, none of those developments comes close to the ubiquitous support for programming parallel hardware that is required to sustain
growth in computing performance and meet society’s expectations for IT. (See Box 5.1 for additional context regarding other aspects of computing research that should not be neglected while the push for parallelism in software takes place.)
The findings and results described in this report represent a serious set of challenges not only for the computing industry but also for the many sectors of society that depend on advances in IT and computation. The findings also pose challenges to U.S. competitiveness: a slowdown in the growth of computing performance will have global economic and political repercussions. The committee has developed a set of recommended actions aimed at addressing the challenges, but the fundamental power and energy constraints mean that even our best efforts may not offer a complete solution. This chapter presents the committee’s recommendations in two categories: research—the best science and engineering minds must be brought to bear; and practice—how we go about developing computer hardware and software today will form a foundation for future performance gains. Changes in education are also needed; the emerging generation of technical experts will need to understand quite different (and in some cases not yet developed) models of thinking about IT, computation, and software.
In light of the inevitable trend toward parallel architectures and emerging applications, one must ask whether existing applications are amenable algorithmically for decomposition on any parallel architecture. Algorithms based on context-dependent state machines are not easily amenable to parallel decomposition. Applications based on those algorithms have always been around and are likely to gain more importance as security needs grow. Even so, there is a large amount of throughput parallelism in these applications, in that many such tasks usually need to be processed simultaneously by a data center.
At the other extreme, there are applications that have obvious parallelism to exploit. The abundance of parallelism in a vast majority of those underlying algorithms is data-level parallelism. One simple example of data-level parallelism for mass applications is found in two-dimensional (2D) and three-dimensional (3D) media-processing (image, signal, graphics, and so on), which has an abundance of primitives (such as blocks, triangles, and grids) that need to be processed simultaneously. Continuous growth in the size of input datasets (from the text-heavy Internet of the past to 2D-media-rich current Internet applications to emerging 3D
React, But Don’t Overreact, to Parallelism
As this report makes clear, software and hardware researchers and practitioners should address important concerns regarding parallelism. At such critical junctures, enthusiasm seems to dictate that all talents and resources be applied to the crisis at hand. Taking the longer view, however, a prudent research portfolio must include concomitant efforts to advance all systems aspects, lest they become tomorrow’s bottlenecks or crises.
For example, in the rush to innovate on chip multiprocessors (CMPs), it is tempting to ignore sequential core performance and to deploy many simple cores. That approach may prevail, but history and Amdahl’s law suggest caution. Three decades ago, a hot technology was vectors. Pioneering vector machines, such as the Control Data STAR-100 and Texas Instruments ASC, advanced vector technology without great concern for improving other aspects of computation. Seymour Cray, in contrast, designed the Cray-11 to have great vector performance as well as to be the world’s fastest scalar computer. Ultimately, his approach prevailed, and the early machines faded away.
Moreover, Amdahl’s law raises concern.2 Amdahl’s limit argument assumed that a fraction, P, of software execution is infinitely parallelizable without overhead, whereas the remaining fraction, 1 - P, is totally sequential. From that assumption, it follows that the speedup with N cores—execution time on N cores divided by execution time on one core—is governed by 1/[(1 - P) + P/N]. Many learn that equation, but it is still instructive to consider its harsh numerical consequences. For N = 256 cores and a fraction parallel P = 99%, for example, speedup is bounded by 72. Moreover, Gustafson3 made good “weak scaling” arguments for why some software will fare much better. Nevertheless, the committee is skeptical that most future software will avoid sequential bottlenecks. Even such a very parallel approach as MapReduce4 has near-sequential activity as the reduce phase draws to a close.
For those reasons, it is prudent to continue work on faster sequential cores,
Internet applications) has been important in the steady increase in available parallelism for these sorts of applications.
A large and growing collection of applications lies between those extremes. In these applications, there is parallelism to be exploited, but it is not easy to extract: it is less regular and less structured in its spatial and temporal control and its data-access and communication patterns. One might argue that these have been the focus of the high-performance computing (HPC) research community for many decades and thus are well understood with respect to those aspects that are amenable to parallel decomposition. The research community also knows that algorithms best suited for a serial machine (for example, quicksort, simplex, and gaston) differ from their counterparts that are best suited for parallel machines
especially with an emphasis on energy efficiency (for example, on large-content addressable-memory structures) and perhaps on-demand scaling (to be responsive to software bottlenecks). Hill and Marty5 illuminate some potential opportunities by extending Amdahl’s law with a corollary that models CMP hardware. They find, for example, that as Moore’s law provides more transistors, many CMP designs benefit from increasing the sequential core performance and considering asymmetric (heterogeneous) designs where some cores provide more performance (statically or dynamically).
Finally, although the focus in this box is on core performance, many other aspects of computer design continue to require innovation to keep systems balanced. Memories should be larger, faster, and less expensive. Nonvolatile storage should be larger, faster, and less expensive and may merge with volatile memory. Networks should be faster (higher bandwidth) and less expensive, and interfaces to networks may need to get more closely coupled to host nodes. All components must be designed for energy-efficient operation and even more energy efficiency when not in current use.
1Richard M. Russell, 1978, The Cray-1 computer system, Communications of the ACM 21(1): 63-72
2Gene M. Amdahl, 1967, Validity of the single-processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, Atlantic City, N.J,, April 18-20, 1967, pp. 483-485.
3John L. Gustafson, 1998, Reevaluating Amdahl’s law, Communications of the ACM 31(5): 532-533.
4Jeffrey Dean and Sanjay Ghemawat, 2004, MapReduce: Simplified data processing on large clusters, Symposium on Operating System Design and Implementation, San Francisco, Cal., December 6-8, 2004.
5Mark D. Hill and Michael R. Marty, 2008, Amdahl’s law in the multicore era, IEEE Computer 41(7): 33-38, available online at http://www.cs.wisc.edu/multifacet/papers/tr1593_amdahl_multicore.pdf.
(for example, mergesort, interior-point, and gspan). Given the abundance of single-thread machines in mass computing, commonly found implementations of these algorithms on mass machines are almost always the nonparallel or serial-friendly versions. Attempts to extract parallelism from the serial implementations are unproductive exercises and likely to be misleading if they cause one to conclude that the original problem has an inherently sequential nature. Therefore, there is an opportunity to benefit from the learning and experience of the HPC research and to reformulate problems in terms amenable to parallel decomposition.
Three additional observations are warranted in the modern context of data-intensive connected computing:
- In the growing segment of the entertainment industry, in contrast with the scientific computing requirements of the past, approximate or sometimes even incorrect solutions are often good enough if end users are insensitive to the details. An example is cloud simulation for gaming compared with cloud simulation for weather prediction. Looser correctness requirements almost always make problems more amenable to parallel algorithms because strict dependence requires synchronized communication, whereas an approximation often can eliminate communication and synchronization.
- The serial fraction of any parallel algorithm would normally dominate the performance—a manifestation of Amdahl’s law (described in Box 2.2) typically referred to as weak scaling. That is true for a fixed problem size. However, if the problem size continues to scale, one would observe continuously improved performance scaling of a parallel architecture, provided that it could effectively handle the larger data input. This is the so-called Gustafson corollary1 to Amdahl’s law. Current digitization trends are leading to input-dataset scaling for most applications (for example, today there might be 1,000 songs on a typical iPod, but in another couple of years there may be 10,000).
- Massive, easily accessible real-time datasets have turned some previous sparse input into much denser inputs. This has at least two important algorithmic implications: the problem becomes more regular and hence more amenable to parallelism, and better training and hence better classification accuracies make additional parallel formulations usable in practice. Examples include scene completion in photographs2 and language-neutral translation systems.3
For many of today’s applications, the underlying algorithms in use do not assume or exploit parallel processing explicitly, except as in the cases described above. Instead, software creators typically depend, implicitly or explicitly, on compilers and other layers of the programming environment to parallelize where possible, leaving the software developer free to think sequentially and focus on higher-level issues. That state of affairs will need to change, and a fundamental focus on parallelism will be needed in
1John L. Gustafson, 1998, Reevaluating Amdahl’s law, Communications of the ACM 31(5): 532-533.
2See James Hays and Alexei A. Efros, 2008, Scene completion using millions of photographs, Communications of the ACM 51(10): 87-94.
3See Jim Giles, 2006, Google tops translation ranking, Nature.com, November 7, 2006, available online at http://www.nature.com/news/2006/061106/full/news061106-6.html.
designing solutions to specific problems in addition to general programming paradigms and models.
Recommendation: Invest in research in and development of algorithms that can exploit parallel processing.
Programming Methods and Systems
Many of today’s programming models, languages, compilers, hyper-visors, and operating systems are targeted primarily at single-core hardware. For the future, all these layers in the stack must explicitly target parallel hardware. The intellectual keystone of this endeavor is rethinking programming models. Programmers must have appropriate models of computation that express application parallelism in such a way that diverse and evolving computer hardware systems and software can balance computation and minimize communication among multiple computational units. There was a time in the late 1970s when even the conventional sequential-programming model was thought to be an eventual limiter of software creation, but better methods and training largely ameliorated that concern. We need advances in programmer productivity for parallel systems similar to the advances brought first by structured programming languages, such as Fortran and C, and then later by managed programming languages, such as C# and Java.
The models themselves may or may not be explicitly parallel; it is an open question whether and when most programmers should be exposed to explicit hardware parallelism. The committee does not call for a singular programming model, because a unified solution may or may not exist. Instead, it recommends the exploration of alternative models—perhaps domain-specific—that can serve as sources of possible future unification. Moreover, the committee expects that some programming models will favor ease of use by a broad base of programmers who are not necessarily expert whereas others will target expert programmers who seek the highest performance for critical subsystems that get extensively reused.
Additional research is needed in the development of new libraries and new programming languages that embody the new programming models described above. Development of such libraries will facilitate rapid prototyping of complementary and competing ideas. The committee expects that some of the languages will be easier for more typical programmers to use—that is, they will appear on the surface to be sequential or declarative—and that others will target efficiency and, consequently, expert programmers.
New programming languages—especially those whose abstractions are far from the underlying parallel hardware—will require new compila-
tion and runtime support. Fortress, Chapel, and X10 are three new recently proposed general-purpose parallel languages, but none of them has yet developed a strong following.4 Experience has shown that it is generally exceedingly difficult to parallelize sequential code effectively—or even to parallelize and redesign highly sequential algorithms. Nevertheless, we must redouble our efforts on this front in part by changing the languages, targeting specific domains, and enlisting new hardware support.
We also need more research in system software for highly parallel systems. Although the hypervisors and operating systems of today can handle some modest parallelism, future systems will include many more cores (and multithreaded contexts), whose allocation, load-balancing, and data communication and synchronization interactions will be difficult to handle well. Solving those problems will require a rethinking of how computation resources are viewed, much as increased physical memory size led to virtual memory a half-century ago.
Recommendation: Invest in research in and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers.
Computer Architecture and Hardware
Most 20th-century computers used a single sequential processor, but many larger computers—hidden in the backroom or by the Internet—harnessed multiple cores on separate chips to form a symmetric multiprocessor (SMP). When industry was unable to use more transistors on a chip for a faster core effectively, it turned, by default, to implementing multiple cores per chip to provide an SMP-like software model. In addition, special-purpose processors—notably GPUs and digital signal processing (DSP) hardware—exploited parallelism and were very successful in important niches.
Researchers must now determine the best way to spend the transistor bounty still provided by Moore’s law.5 On the one hand, we must examine and refine CMPs and associated architectural approaches. But CMP architectures bring numerous issues to the fore. Will multiple cores work in most computer deployments, such as in desktops and even in mobile phones? Under what circumstances should some cores be more capable
4For more on Fortress, see the website of Project Fortress community, at http://projectfortress.sun.com/Projects/Community. For more on Chapel, see the website The Chapel parallel programming language, at http://chapel.cray.com. For more on X10, see the website The X10 programming language, at http://x10.codehaus.org/.
5James Larus, 2009, Spending Moore’s dividend, Communications of the ACM 52(5):62-69.
than others or even use different instruction-set architectures? How can cores be harnessed together temporarily in an automated or semiautomated fashion to overcome sequential bottlenecks? What mechanisms and policies will best exploit locality and ease communication? How should synchronization and scheduling be handled? How will challenges associated with power and energy be addressed? What do the new architectures mean for system-level features, such as reliability and security?
Research in computer architecture must focus on providing useful, programmable systems driven by important applications. It is well known that customizing hardware for a specific task yields more efficient and higher-performance hardware. DSP chips are one example. Research is needed to understand application characteristics to drive parallel-hardware design. There is a bit of a chicken-and-egg problem. Without effective CMP hardware, it is hard to motivate programmers to build parallel applications; but it is also difficult to build effective hardware without parallel applications. Because of the lack of parallel applications, hardware designers are at risk of developing idiosyncratic CMP hardware artifacts that serve as poor targets for applications, libraries, compilers, and runtime systems. In some cases, progress may be facilitated by domain-specific systems that may lead to general-purpose systems later.
CMPs have now inherited the computing landscape from performance-stalled single cores. To promote robust, long-term growth, however, we need to look for alternatives to CMPs. Some of the alternatives may prove better; some may pioneer improvements in CMPs; and even if no alternative proves better, we would then know that CMPs have withstood the assault of alternatives. The research could eschew conventional cores. It could, for example, view the chip as a tabula rasa of billions of transistors, which translates to hundreds of functional units; but the best organization of these units into a programmable architecture is an open question. Nevertheless, researchers must be keenly aware of the need to enable useful, programmable systems. Examples include evolving GPUs for more general-purpose programming, game processors, or computational accelerators used as coprocessors; and exploiting special-purpose, energy-efficient engines at some level of granularity for computations, such as fast Fourier transforms, Codec, or encryption. Other tasks to which increased computational capability could be applied include architectural support for machine learning, communication compression, decompression, encryption, and decryption, and dedicated engines for GPS, networking, human interface, search, and video analytics. Those approaches have potential demonstrated advantages in increased performance and energy efficiency relative to a more conservative CMP approach.
Ultimately, we must question whether the CMP-architecture direc-
tion, as currently defined, is a good approach for designing most computers. The current CMP architecture preserves object-code compatibility, the heart of the architectural franchise that keeps such companies as Intel and AMD investing heavily. Despite their motivation and ability to expend resources, if systems with CMP architectures cannot be effectively programmed, an alternative will be needed. Is using homogeneous processors in CMP architectures the best approach, or will computer architectures that include multiple but heterogeneous cores be more effective—for example, a single high-performance but power-inefficient processor for programs that are stubbornly sequential and many power-efficient but lower-performance cores for other applications? Perhaps truly effective parallel hardware needs to follow a model that does not assume shared memory parallelism, instead exploiting single-instruction multiple-data approaches, streaming, dataflow, or other paradigms yet to be invented. Are there other important niches like those exploited by GPUs and DSPs? Alternatively, will cores support more graphics and GPUs support more general-purpose programs, so that the line between the two blurs? And most important, are any of those alternatives sufficient to keep the industry driving forward at a pace that can avoid the challenges described elsewhere?
We may also need to consider fundamentally rethinking the nature of hardware in light of today’s near-universal connectivity to the Internet. The trend is likely to accelerate. When Google needed to refine the general Internet search problem, it used the MapReduce paradigm so that it could easily and naturally harness the computational horsepower of a very large number of computer systems. Perhaps an equivalent basic shift in how we think about engineering computer systems themselves ought to be considered.
The slowing of growth in single-core performance provides the best opportunity to rethink computer hardware since the von Neumann model was developed in the 1940s. While a focus on the new research challenges is critical, continuing investments are needed in new computation substrates whose underlying power efficiency promises to be fundamentally better than silicon-based CMOSs. In the best case, investment will yield devices and manufacturing methods—as yet unforeseen—that will dramatically surpass the transistor-based integrated circuit. In the worst case, no new technology will emerge to help solve the problems. It is therefore essential to invest in parallel approaches, as outlined previously, and to do so now. Performance is needed immediately, and society cannot wait the decade or two needed to refine a new technology, which may or may not even be on the horizon. Moreover, even if we discover a groundbreaking new technology, the investment in parallelism would not be wasted, inasmuch as it is very likely that advances in parallelism would exploit
new technology as well.6 Substantial research investment should focus on approaches that eschew conventional cores and develop new experimental structures for each chip’s billions of transistors.
Recommendation: Invest in research on and development of parallel architectures driven by applications, including enhancements of chip multiprocessor systems and conventional data-parallel architectures, cost-effective designs for application-specific architectures, and support for radically different approaches.
Computer scientists and engineers manage complexity by separating interface from implementation. In conventional computer systems, the separation is recursive and forms the traditional computing stack: applications, programming language, compiler, runtime and virtual machine environments, operating system, hypervisor, and architecture. The committee has expressed above and in Chapter 4 the need for innovation with reference to that stack. However, some long-term work should focus on whether the von Neumann stack is best for our parallel future. The examination will require teams of computer scientists in many subdisciplines. Ideas may focus on changing the details of an interface (for example, new instructions) or even on displacing a portion of the stack (for example, compiling straight down to field-programmable gate arrays). Work should explore first what is possible and later how to move IT from where it is today to where we want it to be.
Recommendation: Focus long-term efforts on rethinking of the canonical computing “stack”—applications, programming language, compiler, runtime, virtual machine, operating system, hypervisor, and architecture—in light of parallelism and resource-management challenges.
Finally, the fundamental question of power efficiency merits considerable research attention. Chapter 3 explains in great detail the power limitations that we are running up against with CMOS technology. But
6For example, in the 1930s AT&T could see the limitations of relays and vacuum tubes for communication switches and began the search for solid-state devices. Ultimately, AT&T Bell Labs discovered the solid-state semiconductor transistor, which, after several generations of improvements, became the foundation of today’s IT. Even earlier, the breakthrough innovation of the stored-program computer architecture (EDSAC)replacing the patch-panel electronic calculator (ENIAC) changed the fundamental approach to computing and opened the door for the computing revolution of the last 60 years. See Arthur W. Burks, Herman H. Goldstine, and John von Neumann, 1946, Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, Princeton, N.J.: Institute for Advanced Study, available online at http://www.cs.unc.edu/~adyilie/comp265/vonNeumann.html.
the power challenges go beyond chip and architectural considerations and warrant attention at all levels of the computing system. New parallel-programming models and approaches will also have an effect on power needs. Thus, research and development efforts are needed in multiple dimensions, with high priority going to software, then to application-specific devices, and then, as described earlier in this report, to alternative devices.7
Recommendation: Invest in research and development to make computer systems more power-efficient at all levels of the system, including software, application-specific approaches, and alternative devices. Such efforts should address ways in which software and system architectures can improve power efficiency, such as by exploiting locality and the use of domain-specific execution units.
The need for power efficiency at the processor level was explored in detail in Chapter 3. That chapter explored the decreasing rate of energy-use reduction by silicon technology as feature sizes decrease. One of the consequences of that trend is a flattening of the energy efficiency of computing devices; that is, a given level of performance improvement from a new generation of devices comes with a higher energy need than was the case in previous generations. The increased energy need has broad implications for the sustainability of computing growth from an economic and environmental perspective. That is particularly true for the kinds of server-class systems that are relied on by businesses and users of cloud-computing services.8
If improvements in energy efficiency of computing devices flatten out while hardware-cost improvements continue at near historical rates, there will be a shift in the economic costs of computing. The cost basis for deploying computer servers will change as energy-related costs as a fraction of total IT expenses begin to increase. To some extent, that has already been observed by researchers and IT professionals, and this trend
7Indeed, a new National Science Foundation science and technology center, the Center for Energy Efficient Electronics Science (ES3), has recently been announced. The press release for the center quotes center Director Eli Yablonovitch: “There has been great progress in making transistor circuits more efficient, but further scientific breakthroughs will be needed to achieve the six-orders-of-magnitude further improvement that remain before we approach the theoretical limits of energy consumption.” See Sarah Yang, 2010, NSF awards $24.5 million for center to stem increase of electronics power draw, UC Berkeley News, February 23, 2010, available online at http://berkeley.edu/news/media/releases/2010/02/23_nsf_award.shtml.
8For more on data centers, their design, energy efficiency, and so on, see Luiz Barroso and Urs Holzle, 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, San Rafael, Cal.: Morgan & Claypool, available online at http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006.
is partially responsible for the increased attention being given to so-called green-computing efforts.9
The following simple model illustrates the relative weight of two of the main components of IT expenses for large data centers: server-hardware depreciation and electricity consumption. Assume a data center filled mostly with a popular midrange server system that is marketed as a high-efficiency system: a Dell PowerEdge Smart 2950 III. As of December 2008, a reasonable configuration of the system was priced at about US$6,000 and may consume from 208 W (at idle) to 313 W (under scientific workload) with an average consumption estimated at 275 W.10 When the system is purchased as part of a large order, vendors typically offer discounts of at least 15 percent, bringing the actual cost closer to US$5,000. With servers having an operational lifetime of about 4 years, the total energy used by this server in operation is 9,636 kWh, which translates to US$674.52 if it is using the U.S. average industrial cost of electricity for 2008, US$0.0699/kWh.11 The typical energy efficiency of data-center facilities can multiply IT power consumption by 1.8-2.0,12 which would result in an actual electricity cost of running the server of up to about US$1,300.
According to that rough model, electricity costs for the server could correspond to about one-fourth of its hardware costs. If hardware-cost efficiency (performance/hardware costs) continues to improve at historical rates but energy efficiency (performance/electricity costs) stops improving, the electricity costs would surpass hardware costs within 3 years. At that point, electricity use could become a primary limiting factor in the growth of aggregate computing performance. Another implication of such a scenario is that at that point most of the IT expenses would be funding development and innovation not in the computing field but in the energy generation and distribution sectors of the economy, and this
9See, for example, Maury Wright’s article, which examines improving power-conversion efficiency (arguably low-hanging fruit among the suite of challenges that need to be ad-dressed): Maury Wright, 2009, Efficient architectures move sources closer to loads, EE Times Design, January 26, 2009, available online at http://www.eetimes.com/showArticle.jhtml?articleID=212901943&cid=NL_eet. See also Randy H. Katz, 2009, Tech titans building boom, IEEE Spectrum, February 2009, available online at http://www.spectrum.ieee.org/green-tech/buildings/tech-titans-building-boom.
10See an online Dell power calculator in Planning for energy requirements with Dell servers, storage, and networking, available online at http://www.dell.com/content/topics/topic.aspx/global/products/pedge/topics/en/config_calculator?c=us&cs=555&l=en&s=biz.
11See U.S. electric utility sales at a site of DOE’s Energy Information Administration: 2010, U.S. electric utility sales, revenue and average retail price of electricity, available online at http://www.eia.doe.gov/cneaf/electricity/page/at_a_glance/sales_tabs.html.
12See the TPC-C executive summary for the Dell PowerEdge 2900 at the Transactions Processing Performance Council Web site, June 2008, PowerEdge 2900 Server with Oracle Database 11g Standard Edition One, available online at http://www.tpc.org/results/individual_results/Dell/Dell_2900_061608_es.pdf.
would adversely affect the virtuous cycle described in Chapter 2 that has propelled so many advances in computing technology.
Energy use could curb the growth in computing performance in another important way: by consuming too much of the planet’s energy resources. We are keenly aware today of our planet’s limited energy budget, especially for electricity generation, and of the environmental harm that can result from ignoring such limitations. Computing confers an immense benefit on society, but that benefit is offset in part by the resources that it consumes. As computing becomes more pervasive and the full value to society of the field’s great advances over the last few decades begins to be recognized, its energy footprint becomes more noticeable.
An Environmental Protection Agency report to Congress in 200713 states that servers consumed about 1.5 percent of the total electricity generated in the United States in 2006 and that server energy use had doubled from 2000 to 2006. The same report estimated that under current efficiency trends, server power consumption could double once more from 2006 to 2011—a growth that would correspond to the energy output of 10 new power plants (about 5 GW).14 An interesting way to understand the effect of such growth rates is to compare them with the projections for growth in electricity generation in the United States. The U.S. Department of Energy estimated that about 87 MW of new summer generation capacity would come on line in 2006-2011—an increase of less than 9 percent in that period.15
On the basis of those projections, growth in server energy use is outpacing growth in overall electricity use by a wide margin; server use is expected to grow at about 14 percent a year compared with overall electricity generation at about 1.74 percent a year. If those rates are maintained, server electricity use will surpass 5 percent of the total U.S. generating capacity by 2016.
The net environmental effect of more (or better) computing capabilities goes beyond simply accounting for the resources that server-class
13See the Report to Congress of the U.S. Environmental Protection Agency (EPA) on the Energy Star Program (EPA, 2007, Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431, Washington, D.C.: EPA, available online at http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf).
14An article in the EE Times suggests that data-center power requirements are increasing by as much as 20 percent per year. See Mark LaPedus, 2009, Green-memory movement takes root, EE Times, May 18, 2009, available online at http://www.eetimes.com/showArticle.jhtml?articleID=217500448&cid=NL_eet.
15Find information on DOE planned nameplate capacity additions from new generators at DOE, 2010, Planned nameplate capacity additions from new generators, by energy source, available online at http://www.eia.doe.gov/cneaf/electricity/epa/epat2p4.html.
computers consume. It must also include the energy and emission savings that are enabled by additional computing capacity. A joint report by The Climate Group and the Global e-Sustainability Initiative (GeSI) states that although the worldwide carbon footprint of the computing and telecommunication sectors might triple from 2002 to 2020, the same sectors could deliver over 5 times their footprint in emission savings in other industries (including transportation and energy generation and transmission).16
Whether that prediction is accurate depends largely on how smartly computing is deployed in those sectors. It is clear, however, that even if the environmental effect of computing machinery is dwarfed by the environmental savings made possible by its use, computing will remain a large consumer of electricity, so curbing consumption of natural resources should continue to have high priority.
Transitioning of Legacy Applications
It will take time for results of the proposed research agenda to come to fruition. Society has an immediate and pressing need to use current and emerging chip multiprocessor systems effectively. To that end, the committee offers two recommendations related to current development and engineering practices.
Although we expect long-term success in the effective use of parallel systems to come from rethinking architectures and algorithms and developing new programming methods, this strategy will probably sacrifice the backward-platform and cross-platform compatibility that has been an economic cornerstone of IT for decades. To salvage value from the nation’s current, substantial IT investment, we should seek ways to bring sequential programs into the parallel world. On the one hand, we expect no silver bullets to enable automatic black-box transformation. On the other hand, it is prohibitively expensive to rewrite many applications. In fact, the committee believes that industry will not migrate its installed base of software to a new parallel future without good, reliable tools to facilitate the migration. Not only can industry not afford a brute-force migration financially, but also it cannot take the chance that innate latent bugs will manifest, potentially many years after the original software engineers created the code being migrated. If we cannot find a way to smooth the transition, this single item could stall the entire parallelism effort, and innovation in many types of IT might well stagnate. The committee urges industry and academe to develop tools that provide a
16Global e-Sustainability Initiative, 2008, Smart2020: Enabling the Low Carbon Economy in the Information Age, Brussels, Belgium: Global e-Sustainability Initiative, available online at http://www.smart2020.org.
middle ground and give experts “power tools” that can assist with the hard work that will be necessary for vastly increased parallelization. In addition, emphasis should be placed on tools and strategies to enhance code creation, maintenance, verification, and adaptation. All are essential, and current solutions, which are often inadequate even for single-thread software development, are unlikely to be useful for parallel systems.
Recommendation: Invest in the development of tools and methods to transform legacy applications to parallel systems.
Competition in the private sector often (appropriately) encourages the development of proprietary interfaces and implementations that seek to create competitive advantage. In computer systems, however, a lack of standardization can also impede progress when many incompatible approaches allow none to achieve the benefits of wide adoption and reuse—and this is a major reason that industry participates in standards efforts. We therefore encourage the development of programming interface standards. Standards can facilitate wide adoption of parallel programming and yet encourage competition that will benefit all. Perhaps a useful model is the one used for Java: the standard was initially developed by a small team (not a standards committee), protected in incubation from devolving into many incompatible variants, and yet made public enough to facilitate use and adoption by many cooperating and competing entities.
Recommendation: To promote cooperation and innovation by sharing, encourage development of open interface standards for parallel programming rather than proliferating proprietary programming environments.
As described earlier in this report, future growth in performance will be driven by parallel programs. Because most programs now in use are not parallel, we will need to rely on the creation of new parallel programs. Who will create those programs? Students must be educated in parallel programming at both the undergraduate and the graduate levels, both in computer science and in other domains in which specialists use computers.
Current State of Programming
One view of the current pool of practicing programmers is that there is a large disparity between the very best programmers and the rest in both time to solution and elegance of solution. The conventional wisdom in the field is that the difference in skills and productivity between the average programmer and the best programmers is a factor of 10 or more.17 Opinions may vary on the specifics, but the pool of programmers breaks down roughly as follows:
A. A few highly trained, highly skilled, and highly productive computer science (CS) system designers.
B. A few highly trained, highly skilled, and highly productive CS application developers.
C. Many moderately well-trained (average), moderately productive CS system developers.
D. Many moderately productive developers without CS training.
The developers who are not CS-trained are domain scientists, business people, and others who use computers as a tool to solve their problems. There are many such people. It is possible that fewer of those people will be able to program well in the future, because of the difficulty of parallel programming. However, if the CS community develops good abstractions and programming languages that make it easy to program in parallel, even more of those types of developers will be productive.
There is some chance that we will find solutions in which most programmers still program sequentially. Some existing successful systems, such as databases and Web services, exploit parallelism but do not require parallel programs to be written by most users and developers. For example, a developer writes a single-threaded database query that operates in parallel with other queries managed by the database system. Another more modern and popularly known example is MapReduce, which abstracts many programming problems for search and display into a sequence of Map and Reduce operations, as described in Chapter 4.
Those examples are compelling and useful, but we cannot assume that such domain-specific solutions will generalize to all important and pervasive problems. In addition to the shift to new architectural approaches,
17In reality, the wizard programmers can have an even far greater effect on the organization than the one order of magnitude cited. The wizards will naturally gravitate to an approach to problems that saves tremendous amounts of effort and will debug later, and they will keep a programming team out of trouble far out of proportion to the 10:1 ratio mentioned. Indeed, as in arts in general, there is a sense in which no number of ordinary people can be combined to accomplish what one gifted person can contribute.
attention must be paid to the curriculum to ensure that students are prepared to keep pace with the expected changes in software systems and development. Without adequate training, we will not produce enough of the category (A) and (B) highly skilled programmers above. Without them, who will build the programming abstraction systems?
Parallel computing and thus parallel programming showed great promise in the 1980s with comparably great expectations about what could be accomplished. However, apart from horizontally scalable programming paradigms, such as MapReduce, limited progress resulted in frustration and comparatively little progress in recent years. Accordingly, the focus recently has been more on publishable research results on the theory of parallelism and new languages and approaches and less on simplification of expression and practical use of parallelism and concurrency.
There has been much investment and comparatively limited success in the approach of automatically extracting parallelism from sequential code. There has been considerably less focus on effective expression of parallelism in such a way that software is not expected to guess what parallelism was present in the original problem or computational formulation. Those questions remain unresolved. What should we teach?
Modern Computer-Science Curricula Ill-Equipped for a Parallel Future
In the last 20 years, what is considered CS has greatly expanded, and it has been increasingly interdisciplinary. Recently, many CS departments—such as those at the Massachusetts Institute of Technology, Cornell University, Stanford University, and the Georgia Institute of Technology—have revised their curricula by reducing or eliminating a required core and adding multiple “threads” of concentrations from which students choose one or more specializations, such as computational biology, computer systems, theoretical computing, human-computer interaction, graphics, robotics, or artificial intelligence. With respect to the topic of the present report, the CS curriculum is not training undergraduate and graduate students in either effective parallel programming or parallel computational thinking. But that knowledge is now necessary for effective programming of current commodity-parallel hardware, which is increasingly common in the form of CMPs and graphics processors, not to mention possible changes in systems of the future.
Developers and system designers are needed. Developers design and program application software; system designers design and build parallel-programming systems—which include programming languages, compilers, runtime systems, virtual machines, and operating systems—to make them work on computer hardware. In most universities, parallel
programming is not part of the undergraduate curriculum for either CS students or scientists in other domains and is often presented as a graduate elective course for CS and electrical and computer engineering students. In the coming world of nearly ubiquitous parallel architectures, relegating parallelism to the boundaries of the curriculum will not suffice. Instead, it will increasingly be a practical tool for domain scientists and will be immediately useful for software, system, and application development.
Parallel programming—even the parallel programming of today—is hard, but there are enough counterexamples to suggest that it may not be intractable. Computational reasoning for parallel problem-solving—the intellectual process of mapping the structure of a problem to a strategy for solution—is fairly straightforward for computer scientists and domain scientists alike, regardless of the level of parallelism involved or apparent in the solution. Most domain scientists—in such fields as physics, biology, chemistry, and engineering—understand the concepts of causality, correlation, and independence (parallelism vs sequence). There is a mismatch between how scientists and other domain specialists think about their problems and how they express parallelism in their code. It therefore becomes difficult for both computer and noncomputer scientists to write programs. Straightforwardness is lost in the current expression of parallel programming. It is possible, and even common, to express the problem of parallel programming in a way that is complex and difficult to understand, but the recommendations in this report are aimed at developing models and approaches in which such complexity is not necessary.
Arguably, computational experimentation—performing science exploration with computer models—is becoming an important part of modern scientific endeavor. Computational experimentation is modernizing the scientific method. Consequently, the ability to express scientific theories and models in computational form is a critical skill for modern scientists. If computational models are to be targeted to parallel hardware, as we argue in this report, parallel approaches to reasoning and thinking will be essential. Jeannette Wing has argued18 for the importance of computational thinking, broadly, and a current National Research Council study is exploring that notion. A recent report of that study also touched on concurrency and parallelism as part of computational thinking.19 With respect to the CS curriculum, because no general-purpose paradigm has
18Jeannette M. Wing, 2006, Computational thinking, Communications of the ACM 49(3): 33-35.
19See NRC, 2010, Report of a Workshop on the Scope and Nature of Computational Thinking, Washington, D.C.: The National Academies Press, available online at http://www.nap.edu/catalog.php?record_id=12840.
emerged, universities should teach diverse parallel-programming languages, abstractions, and approaches until effective ways of teaching and programming emerge. The necessary shape of the needed changes will not be clear until some reasonably general parallel-programming methods have been devised and shown to be promising. Nevertheless, possible models for reform include making parallelism an intrinsic part of every course (algorithms, architecture, programming, operating systems, compilers, and so on) as a fundamental way of solving problems; adding specialized courses, such as parallel computational reasoning, parallel algorithms, parallel architecture, and parallel programming; and creating an honors section for advanced training in parallelism (this option is much less desirable in that it enforces the notion that parallel programming is outside mainstream approaches). It will be important to try many parallel-programming languages and models in the curriculum and in research to sort out which ones will work best and to learn the most effective methods.
Recommendation: Incorporate in computer science education an increased emphasis on parallelism, and use a variety of methods and approaches to prepare students better for the types of computing resources that they will encounter in their careers.
Since the invention of the transistor and the stored-program computer architecture in the 1ate 1940s, we have enjoyed over a half-century of phenomenal growth in computing and its effects on society. Will the second half of the 20th century be recorded as the golden age of computing progress, or will we now step up to the next set of challenges and continue the growth in computing that we have come to expect?
Our computing models are likely to continue to evolve quickly in the foreseeable future. We expect that there are still many changes to come, which will require evolution of combined software and hardware systems. We are already seeing substantial centralization of computational capability in the cloud-computing paradigm with its attendant challenges to data storage and bandwidth. It is also possible to envision an abundance of Internet-enabled embedded devices that run software that has the sophistication and complexity of software running on today’s general-purpose processors. Networked, those devices will form a ubiquitous and invisible computing platform that provides data and services that we can only begin to imagine today. These drivers combine with the technical constraints and challenges outlined in the rest of this report to reinforce the notion that computing is changing at virtually every level.
The end of the exponential runup in uniprocessor performance and the market saturation of the general-purpose processor mark the end of the “killer micro.” This is a golden time for innovation in computing architectures and software. We have already begun to see diversity in computer designs to optimize for such metrics as power and throughput. The next generation of discoveries will require advances at both the hardware and the software levels.
There is no guarantee that we can make future parallel computing ubiquitous and as easy to use as yesterday’s sequential computer, but unless we aggressively pursue efforts suggested by the recommendations above, it will be game over for future growth in computing performance. This report describes the factors that have led to the limitations on growth in the use of single processors based on CMOS technology. The recommendations here are aimed at supporting and focusing research, development, and education in architectures, power, and parallel computing to sustain growth in computer performance and enjoy the next level of benefits to society.
This page intentionally left blank.