Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 80
3 Power Is Now Limiting Growth in Computing Performance T he previous chapters define computer performance and discuss why its continued growth is critical for the U.S. economy. This chapter explores the relationship between computer performance and power consumption. The limitations imposed by power consumption are responsible for the present movement toward parallel computation. This chapter argues that such limitations will constrain the computation performance of even parallel computing systems unless computer design- ers take fundamentally different approaches. The laws of thermodynamics are evident to anyone who has ever used a laptop computer on his or her lap: the computer becomes hot. When you run a marathon, your body is converting an internal supply of energy formed from the food you eat into two outputs: mechanical forces produced by your body’s muscles and heat. When you drive your car, the engine converts the energy stored in the gasoline into the kinetic energy of the wheels and vehicle motion and heat. If you put your hand on an illuminated light bulb, you discover that the bulb is not only radiat- ing the desired light but also generating heat. Heat is the unwanted, but inevitable, side effect of using energy to accomplish any physical task—it is not possible to convert all the input energy perfectly into the desired results without wasting some energy as heat. A vendor who describes a “powerful computer” is trying to character- ize the machine as fast, not thermally hot. In this report, the committee uses power to refer to the use of electric energy and performance to mean computational capability. When we refer to power in the context of a 80
OCR for page 81
81 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE computer system, we are talking about energy flows: the rate at which energy must be provided to the computer system from a battery or wall outlet, which is the same as the rate at which that energy, now converted to heat, must be extracted from the system. The temperature of the chip or system rises above the ambient temperature and causes heat energy to flow into the environment. To limit the system’s temperature rise, heat must be extracted efficiently. (Transferring heat from the computer system to the environment is the task of the cooling subsystem in a computer.) Thus, referring to a chip’s power requirements is equivalent to talking about power consumed and dissipated. When we talk about scaling computing performance, we implicitly mean to increase the computing performance that we can buy for each dollar we spend. If we cannot scale down the energy per function as fast as we scale up the performance (functions per second), the power (energy per second) consumed by the system will rise, and the increase in power consumption will increase the cost of the system. More expensive hard- ware will be needed to supply the extra power and then remove the heat that it generates. The cost of managing the power in and out of the system will rise to dominate the cost of the hardware. Historically, technology scaling has done a good job of scaling down energy cost per function as the total cost per function dropped, and so the overall power needs of systems were relatively constant as perfor- mance (functions per second) dramatically increased. Recently, the cost per function has been dropping faster than the power per function, which means that the overall power of constant-cost chips has been growing. The power problem is getting worse because of the recent difficulty in continuing to scale down power-supply voltages, as is described later in this chapter. Our ability to supply power and cool chips is not improving rapidly, so for many computers the performance per dollar is now limited by power issues. In addition, computers are increasingly available in a vari - ety of form-factors and many, such as cell phones, have strict power limits because of user constraints. People do not want to hold hot cell phones, and so the total power budget needs to be under a few watts when the phone is active. Designers today must therefore find the best performance that can be achieved within a specified power envelope. To assist in understanding these issues, this chapter reviews inte - grated circuit (IC) technology scaling. It assumes general knowledge of electric circuits, and some readers may choose to review the findings listed here and then move directly to Chapter 4. The basic conclusions of this chapter are as follows:
OCR for page 82
82 THE FUTURE OF COMPUTING PERFORMANCE · Power consumption has become the limiting constraint on future growth in single-processor performance. · Power limitations on individual processors have resulted in chips that have multiple lower-power processors that, in some applica- tions, yield higher aggregate performance than chips with single power-limited processors. · Even as the computing industry successfully shifts to multiple, simpler, and lower-power processor cores per chip, it will again face performance limits within a few years because of aggregate, full-chip power limitations. · Like complementary metal oxide semiconductor (CMOS) tran- sistors, many of the electronic devices being developed today as potential replacements for CMOS transistors are based on regu- lating the flow of electrons over a thermodynamic barrier and will face their own power limits. · A promising approach to enabling more power-efficient compu- tation is to design application-specific or algorithm-specific com- putational units. For that approach to succeed, new chip design and verification methods (such as advanced electronic-design automation tools) will need to be developed to reduce the time and cost of IC design, and new IC fabrication methods will be needed to reduced the one-time mask costs. · The present move to chip multiprocessors is a step in the direction of using parallelism to sustain growth in computing performance. However, this or any other hardware architecture can succeed only if appropriate software can be developed to exploit the par- allelism in the hardware effectively. That software challenge is the subject of Chapter 4. For intrepid readers prepared to continue with this chapter, the com - mittee starts by explaining how classic scaling enabled the creation of cheaper, faster, and lower-power circuits; in essence, scaling is responsible for Moore’s law. The chapter also discusses why modern scaling produces smaller power gains than before. With that background in technology scal- ing, the chapter then explains how computer designers have used improv- ing technology to create faster computers. The discussion highlights why processor performance grew faster than power efficiency and why the problem is more critical today. The next sections explain the move to chips that have multiple processors and clarify both the power-efficiency advantage of parallel computing and the limitations of this approach. The chapter concludes by mentioning some alternative technologies to assess the potential advantages and the practicality of these approaches. Alter- natives to general-purpose processors are examined as a way to address
OCR for page 83
83 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE power limitations. In short, although incremental advances in computing performance will continue, overcoming the power constraint is difficult or potentially impossible and will require radical rethinking of computation and of the logic gates used to build computing systems. BASIC TECHNOLOGY SCALING Although this report focuses on computers based on IC processors, it is useful to remember that computers have historically used various tech - nologies. The earliest electric computers were built in the 1940s and used mechanical relays.1,2 Vacuum tubes enabled faster electronic computers. By the 1960s, the technology for building computers changed again, to transistors that were smaller and had lower cost, lower power, and greater reliability. Within a decade, computers migrated to ICs instead of discrete transistors and were able to scale up performance as the technology scaled down the size of transistors in the ICs. Each technology change, from relays to vacuum tubes to transistors to ICs, decreased the cost, increased the performance of each function, and decreased the energy per function. Those three factors enabled designers to continue to build more capable computers for the same cost. Early IC computers were built with bipolar transistors3 in their ICs, which offered high performance but used relatively high power (com - pared with other IC options). By the late 1970s, low-end computers used NMOS4 technology, which offered greater density and thus lower cost per function but also lower-speed gates than the bipolar alternative. As scal - ing continued, the cost per function was dropping rapidly, but the energy needs of each gate were not dropping as rapidly, so the power-dissipation 1 Raúl Rojas, 1997, Konrad Zuse’s legacy: The architecture of the Z1 and Z3, IEEE Annals of the History of Computing 19(2): 5-19. 2 IBM, 2010, Feeds, speeds and specifications, IBM Archives, website, available online at http://www-03.ibm.com/ibm/history/exhibits/markI/markI_feeds.html. 3 Silicon ICs use one of two basic structures for building switches and amplifiers. Both transistor structures modify the silicon by adding impurities to it that increase the concen - tration of electric carriers—electrons for N regions and holes for P regions—and create three regions: two Ns separated by a P or two Ps separated by an N. That the electrons are blocked by holes (or vice versa) means that there is little current flow in all these structures. The first ICs used NPN bipolar transistors, in which the layers are formed vertically in the material and the current flow is a bulk property that requires electrons to flow into the P region (the base) and holes to flow into the top N region (the emitter). 4 NMOS transistors are lateral devices that work by having a “gate” terminal that controls the surface current flow between the “source” and “drain” contacts. The source-drain ter- minals are doped N and supply the electrons that flow through a channel; hence the name NMOS. Doping refers to introducing impurities to affect the electrical properties of the semiconductor. PMOS transistors make the source and drain material P, so holes (electron deficiencies) flow across the channels.
OCR for page 84
84 THE FUTURE OF COMPUTING PERFORMANCE requirements of the chips were growing. By the middle 1980s, most pro - cessor designers moved from bipolar and NMOS to CMOS5 technology. CMOS gates were slower than those of NMOS or bipolar circuits but dis- sipated much less energy, as described in the section below. Using CMOS technology reduced the energy per function by over an order of mag- nitude and scaled well. The remainder of this chapter describes CMOS technology, its properties, its limitations, and how it affects the potential for growth in computing performance. CLASSIC CMOS SCALING Computer-chip designers have used the scaling of feature sizes (that is, the phenomenon wherein the same functionality requires less space on a new chip) to build more capable, more complex devices, but the result- ing chips must still operate within the power constraints of the system. Early chips used circuit forms (bipolar or NMOS circuits) that dissipated power all the time, whether the gate6 was computing a new value or just holding the last value. Even though scaling allowed a decrease in the power needed per gate, the number of gates on a chip was increasing faster than the power requirements were falling; by the early to middle 1980s, chip power was becoming a design challenge. Advanced chips were dissipating many watts;7 one chip, the HP Focus processor, for exam- ple, was dissipating over 7 W, which at the time was a very large number.8 Fortunately, there was a circuit solution to the problem. It became possible to build a type of gate that dissipated power only when the out- put value changed. If the inputs were stable, the circuit would dissipate practically no power. Furthermore, the gate dissipated power only as long as it took to get the output to transition to its new value and then returned to a zero-power state. During the transition, the gate’s power requirement was comparable with those of the previous types of gates, but because the transition lasts only a short time, even in a very active machine a gate 5 The C in CMOS stands for complementary. CMOS uses both NMOS and PMOS transistors. 6 A logic gate is a fundamental building block of a system. Gates typically have two to four inputs and produce one input. These circuits are called logic gates because they compute simple functions used in logic. For example, an AND gate takes two inputs (either 1s or 0s) and returns 1 if both are 1s and 0 if either is 0. A NOT gate has only one input and returns 1 if the input is 0 and 0 if the input is 1. 7 Robert M. Supnick, 1984, MicroVAX 32, a 32 bit microprocessor, IEEE Journal of Solid State Circuits 19(5): 675-681, available online at http://ieeexplore.ieee.org/stamp/stamp.js p?arnumber=1052207&isnumber=22598. 8 Joseph W. Beyers, Louis J. Dohse, Jospeh P. Fucetola, Richard L. Kochis, Cliffird G. Lob, Gary L. Taylor, and E.R. Zeller, 1981, A 32-bit VLSI CPU chip, IEEE Journal of Solid-State Circuits 16(5): 537-542, available online at http://ieeexplore.ieee.org/stamp/stamp.jsp?ar number=1051634&isnumber=22579.
OCR for page 85
85 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE TABLE 3.1 Scaling Results for Circuit Performance Device or Circuit Parameter Scaling Factor Device dimension tox, L, W 1/k Doping concentration Na k Voltage V 1/k Current I 1/k Capacitance eA/t 1/k Delay time per circuit VC/I 1/k 2 Power dissipation per circuit VI 1/k Power density VI/A 1 SOURCE: Reprinted from Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien. Yu, V. Leo Rideout, Ernest Bassous, and Andre R. LeBlanc, 1974, Design of ion-implanted MOSFETS with very small physical dimensions, IEEE Journal of Solid State Circuits 9(5): 256-268. would be in transition around 1 percent of the time. Thus, moving to the new circuit style decreased the power consumed by a computation by a factor of over 30.9 The new circuit style was called complementary MOS, or CMOS. A further advantage of CMOS gates was that their performance and power were completely determined by the MOS transistor properties. In a classic 1974 paper, reprinted in Appendix D, Robert Dennard et al. showed that the MOS transistor has a set of very convenient scaling properties.10 The scaling properties are shown in Table 3.1, taken from that paper. If all the voltages in a MOS device are scaled down with the physi - cal dimensions, the operation of the device scales in a particularly favor- able way. The gates clearly become smaller because linear dimensions are scaled. That scaling also causes gates to become faster with lower energy per transition. If all dimensions and voltages are scaled by the scaling fac- tor k (k has typically been 1.4), after scaling the gates become (1/k)2 their previous size, and k2 more gates can be placed on a chip of roughly the same size and cost as before. The delay of the gate also decreases by 1/k, and, most important, the energy dissipated each time the gate switches decreases by (1/k)3. To understand why the energy drops so rapidly, note that the energy that the gate dissipates is proportional to the energy that is stored at the output of the gate. That energy is proportional to a quan- 9 The old style dissipated power only half the time; this is why the improvement was by a factor of roughly 30. 10 Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien. Yu, V. Leo Rideout, Ernest Bassous, and Andre R. LeBlanc, 1974, Design of ion-implanted MOSFETS with very small physical dimensions, IEEE Journal of Solid State Circuits 9(5):256–268.
OCR for page 86
86 THE FUTURE OF COMPUTING PERFORMANCE tity called capacitance11 and the square of the supply voltage. The load capacitance of the wiring decreases by 1/k because the smaller gates make all the wires shorter and capacitance is proportional to length. Therefore, the power requirements per unit of space on the chip (mm2), or energy per second per mm2, remain constant: Power = (number of gates)(CLoad/gate)(Clock Rate)(Vsupply2) Power density = Ng Cload Fclk Vdd2 Ng = CMOS gates per unit area Cload = capacitive load per CMOS gate Fclk = clock frequency Vdd = supply voltage Power density = ( k2 )( 1/k )( k )(1/k)2 = 1 That the power density (power requirements per unit space on the chip, even when each unit space contains many, many more gates) can remain constant across generations of CMOS scaling has been a critical property underlying progress in microprocessors and in ICs in general. In every technology generation, ICs can double in complexity and increase in clock frequency while consuming the same power and not increasing in cost. Given that description of classic CMOS scaling, one would expect the power of processors to have remained constant since the CMOS tran- sition, but this has not been the case. During the late 1980s and early 1990s, supply voltages were stuck at 5 V for system reasons. So power density would have been expected to increase as technology scaled from 2 mm to 0.5 mm. However, until recently supply voltage has scaled with technology, but power densities continued to increase. The cause of the discrepancy is explained in the next section. Note that Figure 3.1 shows no microprocessors above about 130 W; this is because 130 W is the physical limit for air cooling, and even approaching 130 W requires massive heat sinks and local fans. 11 Capacitance is a measure of how much electric charge is needed to increase the voltage between two points and is also the proportionality constant between energy stored on a wire and its voltage. Larger capacitors require more charge (and hence more current) to reach a voltage than a smaller capacitor. Physically larger capacitors tend to have larger capaci - tance. Because all wires have at least some parasitic capacitance, even just signaling across the internal wires of a chip dissipates some power. Worse, to minimize the time wasted in charging or discharging, the transistors that drive the signal must be made physically larger, and this increases their capacitance load, which the prior gate must drive, and costs power and increases the incremental die size.
OCR for page 87
87 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE 1,000 100 10 1 1985 1990 1995 2000 2005 2010 Year of Introduction FIGURE 3.1 Microprocessor power dissipation (watts) over time (1985-2010). HOW CMOS-PROCESSOR PERFORMANCE IMPROVED EXPONENTIALLY, AND THEN SLOWED Microprocessor performance, as measured against the SPEC2006 benchmark12,13,14 was growing exponentially at the rate of more than 50 percent per year (see Figure 3.2). That phenomenal single-processor per- formance growth continued for 16 years and then slowed substantially15 partially because of power constraints. This section briefly describes how those performance improvements were achieved and what contributed to the slowdown in improvement early in the 2000s. To achieve exponential performance growth, microprocessor design- ers scaled processor-clock frequency and exploited instruction-level paral- 12 For older processors, SPEC2006 numbers were estimated from older versions of the SPEC benchmark by using scaling factors. 13 John L. Henning, 2006, SPEC CPU2006 benchmark descriptions, ACM SIGARCH Com - puter Architecture News 34(4): 1-17. 14 John L. Henning, 2007, SPEC CPU suite growth: An historical perspective, ACM SI - GARCH Computer Architecture News 35(1): 65-68. 15 John L. Hennessy and David A. Patterson, 2006, Computer Architecture: A Quantitative Approach, fourth edition, San Francisco, Cal.: Morgan Kauffman, pp. 2-4.
OCR for page 88
88 THE FUTURE OF COMPUTING PERFORMANCE 100 10 1 0 0.1 0 0.01 0.001 0 1985 1990 1995 2000 2005 2010 Year of Introduction FIGURE 3.2 Integer application performance (SPECint2006) over time (1985-2010). lelism (ILP) to increase the number of instructions per cycle.16,17,18 The power problem arose primarily because clock frequencies were increasing faster than the basic assumption in Dennard scaling (described previously). The assumption there is that clock frequency will increase inversely pro - portionally to the basic gate speed. But the increases in clock frequency were made because of improvements in transistor speed due to CMOS- technology scaling combined with improved circuits and architecture. The designs also included deeper pipelining and required less logic (fewer operations on gates) per pipeline stage.19 Separating the effect of technology scaling from those of the other improvements requires examination of metrics that depend solely on the improvements in underlying CMOS technology (and not other improve- 16 Ibid. 17 Mark Horowitz and William Dally, 2004, How scaling will change processor architecture, IEEE International Solid States Circuits Conference Digest of Technical Papers, San Fran - cisco, Cal., February 15-19, 2004, pp. 132-133. 18 Vikas Agarwal, Stephen W. Keckler, and Doug Burger, 2000, Clock rate versus IPC: The end of the road for conventional microarchitectures, Proceedings of the 27th International Symposium Computer Architecture, Vancouver, British Columbia, Canada, June 12-14, 2000, pp. 248-259. 19 Pipelining is a technique in which the structure of a processor is partitioned into simpler, sequential blocks. Instructions are then executed in assembly-line fashion by the processor.
OCR for page 89
89 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE ments in circuits and architecture). (See Box 3.1 for a brief discussion of this separation.) Another contribution to increasing power requirements per chip has been the nonideal scaling of interconnecting wires between CMOS devices. As the complexity of computer chips increased, it was not sufficient simply to place two copies of the previous design on the new chip. To yield the needed performance improvements, new commu- BOX 3.1 Separating the Effects of CMOS Technology Scaling on Performance by Using the FO4 Metric To separate the effect of CMOS technology scaling from other sorts of optimizations, processor clock-cycle time can be characterized by using the technology-dependent delay metric fanout-of-four delay (FO4), which is defined as the delay of one inverter driving four copies of an equally sized inverter.1,2 The metric measures the clock cycle in terms of the basic gate speed and gives a number that is relatively technology-independent. In Dennard scaling, FO4/ cycle would be constant. As it turns out, clock-cycle time decreased from 60-90 FO4 at the end of the 1980s to 12-25 in 2003-2004. The increase in frequency caused power to increase and, combined with growing die size, accounted for most of the power growth until the early 2000s. That fast growth in clock rate has stopped, and in the most recent machines the number of FO4 in a clock cycle has begun to increase. Squeezing cycle time further does not result in substantial performance improvements, but it does increase power dissipation, complexity, and cost of design.3,4 As a result, clock frequency is not increasing as fast as before (see Figure 3.3). The decrease in the rate of growth in of clock frequency is also forecast in the 2009 ITRS semi- conductor roadmap,5 which shows the clock rate for the highest-performance single processors no more than doubling each decade over the foreseeable future. 1 David Harris, Ron Ho, Gu-Yeon Wei, and Mark Horowitz, The fanout-of-4 inverter delay metric, Unpublished manuscript, May 29, 2009, available online at http://www-vlsi.stanford. edu/papers/dh_vlsi_97.pdf. 2 David Harris and Mark Horowitz, 1997, Skew-tolerant Domino circuits, IEEE Journal of Solid-State Circuits 32(11): 1702-1711. 3 Mark Horowitz and William Dally, 2004, How scaling will change processor architecture, IEEE International Solid States Circuits Conference Digest of Technical Papers, San Francisco, Cal., February 15-19, 2004, pp. 132-133. 4Vikas Agarwal, Stephen W. Keckler, and Doug Burger, 2000, Clock rate versus IPC: The end of the road for conventional microarchitectures. Proceedings of the 27th International Sym- posium on Computer Architecture, Vancouver, British Columbia, Canada, June 12-14, 2000, pp. 248-259. 5See http://www.itrs.net/Links/2009ITRS/Home2009.htm.
OCR for page 90
90 THE FUTURE OF COMPUTING PERFORMANCE nication paths across the entire machine were needed—interconnections that did not exist in the previous generation. To provide the increased interconnection, it was necessary to increase the number of levels of metal interconnection available on a chip, and this increased the total load capacitance faster than assumed in Dennard scaling. Another fac- tor that has led to increases in load capacitance is the slight scaling up of wire capacitance per length. That has been due to increasing side-to- side capacitance because practical considerations limited the amount of vertical scaling possible in wires. Technologists have attacked both those issues by creating new insulating materials that had lower capacitance per length (known as low K dielectrics); this has helped to alleviate the problem, but it continues to be a factor in shrinking technologies. One reason that increasing clock rate was pushed so hard in the 1990s, apart from competitive considerations in the chip market, was that find- ing parallelism in an application constructed from a sequential stream of instructions (ILP) was difficult, required large hardware structures, and was increasingly inefficient. Doubling the hardware (number of transistors available) generated only about a 50 percent increase in performance—a relationship that at Intel was referred to as Pollack’s rule.20 To continue to scale performance required dramatic increases in clock frequency, which drove processor power requirements. By the early 2000s, processors had attained power dissipation levels that were becoming difficult to handle cheaply, so processor power started to level out. Consequently, single- processor performance improvements began to slow. The upshot is a core finding and driver of the present report (see Figure 3.3), namely, Finding: After many decades of dramatic exponential growth, single- processor performance is increasing at a much lower rate, and this situ - ation is not expected to improve in the foreseeable future. HOW CHIP MULTIPROCESSORS ALLOW SOME CONTINUED PERFORMANCE-SCALING One way around the performance-scaling dilemma described in the previous section is to construct computing systems that have multiple, explicitly parallel processors. For parallel applications, that arrangement should get around Pollack’s rule; doubling the area should double the 20 Patrick P. Gelsinger, 2001, Microprocessors for the new millennium: Challenges, op - portunities, and new frontiers, IEEE International Solid-State Circuits Conference Digest of Technical Papers, San Francisco, Cal., February 5-7, 2001, pp. 22-25. Available online at http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=912412&isnumber=19686.
OCR for page 94
94 THE FUTURE OF COMPUTING PERFORMANCE PROBLEMS IN SCALING NANOMETER DEVICES If voltages could continue to be scaled with feature size (following classic Dennard scaling), CMP performance could continue to be scaled with technology. However, early in this decade scaling ran into some fundamental limits that make it impossible to continue along that path, 26 and the improvements in both performance and power achieved with technology scaling have slowed from their historical rates. The net result is that even CMPs will run into power limitations. To understand those issues and their ramifications, we need to revisit technology scaling and look at one aspect of transistor performance that we ignored before: leak - age current. As described earlier, CMOS circuits have the important property that they dissipate energy only when a node changes value. Consider the simple but representative CMOS logic circuits in Figure 3.4. One type of CMOS device, a pMOS transistor, is connected to the power supply (Vsupply). When its input is low (Vgnd), it turns on, connects Vsupply to the output, and drives the output high to Vsupply. When the input to the pMOS device is high (Vsupply), it disconnects the output from Vsupply. The other type of CMOS device, an nMOS transistor, has the complementary behavior: when its input is high (Vsupply), it connects the output to Vgnd; when its input is low (Vgnd), it disconnects the output from Vgnd. Because of the construction of the CMOS logic, the pMOS and nMOS transistors are never driving the output at the same time. Hence, the only current that flows through the gate is that needed to charge or discharge the capacitances associated with the gate, so the energy consumed is mostly the energy needed to change the voltage on a capacitor with transistors, which is Cload multiplied by Vsupply2. For that analysis to hold, it is impor- tant that the off transistors not conduct any current in the off state: that is, they should have low leakage. However, the voltage scaling that the industry has been following has indirectly been increasing leakage current. Transistors operate by chang- ing the height of an energy barrier to modulate the number of carriers that can flow across them. One might expect a fairly sharp current transition, so that when the barrier is higher than the energy of the carriers, there is no current, and when it is lowered, the carriers can “spill” over and flow across the transistor. The actual situation is more complex. The basic reason is related to thermodynamics. At any finite temperature, although 26 Sam Naffziger reviews the Vdd limitations and describes various approaches (circuit, architecture, and so on) to future processor design given the voltage scaling limitations in the article High-performance processors in a power-limited world, Proceedings of the IEEE Symposium on VLSI Circuits, Honolulu, Hawaii, June 15-17, 2006, pp. 93-97, available on - line at http://ewh.ieee.org/r5/denver/sscs/Presentations/2006_11_Naffziger_paper.pdf.
OCR for page 95
95 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE Vsupply Vsupply pMOS Output Input Output Input B nMOS Input A Vgnd V gnd FIGURE 3.4 Representative CMOS logic circuits. there is a well-defined average energy for the carriers, the energy of each individual carrier follows a probability distribution. The probability of having an energy higher than the average falls off exponentially, with a characteristic scale factor that is proportional to the temperature of the transistors measured measured in kelvins. The hotter the device, the wider the range of energies that the carriers can have. That energy distribution is critical in the building of transistors. Even with an energy barrier that is higher than the average energy of the car- riers, some carriers will flow over the barrier and through the transistor; the transistor will continue to conduct some current when we would like it to be off. The energy scale is kT, where k is the Boltzmann constant and T is the temperature in kelvins. We can convert it into voltage by divid- ing energy by the charge on the particle, an electron in this case: q = 1.6 × 10–19 coulombs. kT/q is around 26 mV at room temperature. Thus, the current through an off transistor drops exponentially with the height of the energy barrier, falling by slightly less than a factor of 3 for each 26-mV increase in the barrier height. The height of the barrier is normally called the threshold voltage (Vth) of the transistor, and the leakage current can be written as q(Vgs – Vth ) I ds = I o e , α kT
OCR for page 96
96 THE FUTURE OF COMPUTING PERFORMANCE where Io is a constant around 1 mA (microampere) per micrometer of tran- sistor width at room temperature, Vgs is the voltage applied to the control gate of the transistor, and a is a number greater than 1 (generally around 1.3) that represents how effectively the gate voltage changes the energy barrier. From the equation, it is easy to see that the amount of leakage cur- rent through an off transistor (Vgs = 0) depends heavily on the transistor’s threshold voltage. The leakage current increases by about a factor of 10 each time the threshold voltage drops by another 100 mV. Historically, Vths were around 800 mV, so the residual transistor leak- age currents were so small that they did not matter. Starting from high Vth values, it was possible to scale Vth, Vsupply, and L together. While leak- age current grew exponentially with shrinking Vth, the contribution of subthreshold leakage to the overall power was negligible as long as Vth values were still relatively large. But ultimately by the 90-nm node, the leakage grew to a point where it started to affect overall chip power.27 At that point, Vth and Vsupply scaling slowed dramatically. One approach to reduce leakage current is to reduce temperature, inasmuch as this makes the exponential slope steeper. That is possible and has been tried on occasion, but it runs into two problems. The first is that one needs to consider the power and cost of providing a low-temperature environment, which usually dwarf the gains provided by the system; this is especially true for small or middle-size systems that operate in an office or home environment. The second is related to testing, repair, thermal cycling, and reliability of the systems. For those reasons, we will not consider this option further in the present report. However, for suf - ficiently large computing centers, it may prove advantageous to use liquid cooling or other chilling approaches where the energy costs of operating the semiconductor hardware in a low-temperature environment do not outweigh the performance gains, and hence energy savings, that are pos- sible in such an environment. Vth stopped scaling because of increasing leakage currents, and Vsupply scaling slowed to preserve transistor speed with a constant (Vsupply – Vth). Once leakage becomes important, an interesting optimization between Vsupply and Vth is possible. Increasing Vth decreases leakage current but also makes the gates slower because the number of carriers that can flow through a transistor is roughly proportional to the decreasing (Vsupply – Vth). One can recover the lost speed by increasing Vsupply, but this also increases the power consumed to switch the gate dynamically. For a given gate delay, the lowest-power solution is one in which the marginal energy cost of increasing Vdd is exactly balanced by the marginal energy savings 27 Edward J. Nowak, 2002, Maintaining the benefits of CMOS scaling when scaling bogs down, IBM Journal of Research and Development 46(2/3): 169-180.
OCR for page 97
97 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE of increasing Vth. The balance occurs when the static leakage power is roughly 30 percent of the dynamic power dissipation. This leakage-constrained scaling began at roughly the 130-nm tech - nology node, and today both Vsupply and Vth scaling have dramatically slowed; this has also changed how gate energy and speed scale with tech- nology. The energy required to switch a gate is C multiplied by V supply2, which scales only as 1/k if Vsupply is not scaling. That means that technol- ogy scaling reduces the power by k only if the scaled circuit is run at the same frequency. That is, if gate speed continued to increase, half the die (the size of the scaled circuit) would dissipate the same power as the full die in the previous generation and would operate k times, that is 1.4 times, faster, much less than the three-fold performance increase we have come to expect. Clearly, that is not optimal, so many designers are scaling Vdd slightly to increase the energy savings. That works but lowers the gate speeds, so some parallelism is needed just to recover from the slowing single-thread performance. The poor scaling will eventually limit the performance of CMPs. Combining the lessons of the last several sections of this chapter, the committee concluded that neither CMOS nor chip multiprocessors can overcome the power limits facing modern computer systems. That leads to another core conclusion of this report. Basic laws of physics and constraints on chip design mean that the growth in the performance of computer systems will become limited by their power and thermal requirements within the next decade. Optimists might hope that new technologies and new research could overcome that limitation and allow hardware to continue to drive future performance scaling akin to what we have seen with single-thread performance, but there are reasons for caution, as described in the next section. Finding: The growth in the performance of computing systems—even if they are multiple-processor parallel systems—will become limited by power consumption within a decade. ADVANCED TECHNOLOGY OPTIONS If CMOS scaling, even in chip-multiprocessor designs, is reaching limits, it is natural to ask whether other technology options might get around the limits and eventually overtake CMOS, as CMOS did to nMOS and bipolar circuits in the 1980s. The answer to the question is mixed. 28 It 28 Mark Bohr, Intel senior fellow, gave a plenary talk at ISSCC 2009 on scaling in an SOC world in which he argues that “our challenge . . . is to recognize the coming revolutionary
OCR for page 98
98 THE FUTURE OF COMPUTING PERFORMANCE is clear that new technologies and techniques will be created and applied to scaled technologies, but these major advances—such as high-k gate dielectrics, low-K interconnect dielectrics, and strained silicon—will prob- ably be used to continue technology scaling in general and not create a disruptive change in the technology. Recent press reports make it clear, for example, that Intel expects to be using silicon supplemented with other materials in future generations of chips.29 A recent study compared estimated gate speed and energy of transis- tors built with exotic materials that should have very high performance.30 Although the results were positive, the maximum improvement at the same power was modest, around a factor of 2 for the best technology. Those results should not be surprising. The fundamental problem is that Vth does not scale, so it is hard to scale the supply voltage. The limitation on Vth is set by leakage of carriers over an energy barrier, so any device that modulates current by changing an energy barrier should have similar limi- tations. All the devices used in the study cited above used the same cur- rent-control method, as do transistors made from nanotubes, nanowires, graphene, and so on. To get around that limitation, one needs to change “the game” and build devices that work by different principles. A few options are being pursued, but each has serious issues that would need to be overcome before they could become practical. One alternative approach is to stop using an energy barrier to con- trol current flow and instead use quantum mechanical tunneling. That approach eliminates the problem with the energy tails by using carriers that are constrained by the energy bands in the silicon, which have fixed levels. Because there is no energy tail, they can have, in theory, a steep turnon characteristic. Many researchers are trying to create a useful device of this type, but there are a number of challenges. The first is to create a large enough current ratio in a small enough voltage range. The tunneling current will turn on rapidly, but its increase with voltage is not that rapid. Because a current ratio of around 10,000 is required, we need a device that can transition through this current range in a small voltage (<400 mV). changes and opportunities and to prepare to utilize them (Mark Bohr, 2009, The new era of scaling in an SOC world, IEEE International Solid-State Circuits Conference , San Fran- cisco, Cal., February 9, 2009, available online at http://download.intel.com/technology/ architecture-silicon/ISSCC_09_plenary_paper_Bohr.pdf). 29 Intel CEO Paul Ottelini was said to have declared that silicon was in its last decade as the base material of the CPU (David Flynn, 2009, Intel looks beyond silicon for processors past 2017, Apcmag.com, October 29, 2009, available online at http://apcmag.com/intel-looks- beyond-silicon-for-processors-past-2017.htm). 30 Donghyun Kim, Tejas Krishnamohan1, and Krishna C. Saraswat, 2008, Performance evaluation of 15nm gate length double-gate n-MOSFETs with high mobility channels: III-V, Ge and Si, Electrochemical Society Transactions 16(11): 47-55.
OCR for page 99
99 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE Even if one can create a device with that current ratio, another problem arises. The speed of the gates depends on the transistor current. So not only do we need the current ratio, we also need devices that can supply roughly the same magnitude of current as CMOS transistors provide. Tunnel currents are often small, so best estimates indicate that tunnel FETs might be much slower (less current) than in CMOS transistors. Such slowness will make their adoption difficult. Another group of researchers are trying to leverage the collective effort of many particles together to get around the voltage limits of CMOS. Recall that the operating voltage is set by the thermal energy (kT) divided by the charge on one electron, because that is the charged particle. If the charged particle had a charge of 2q, the voltage requirements would be half what it is today. That is the approach that nerve cells use to operate robustly at low voltages. The proteins in the voltage-activated ion chan- nels have a charge that allows them to operate easily at 100 mV. Although some groups have been trying to create paired charge carriers, most are looking at other types of cooperative processes. The ion channels in nerves go though a physical change, so many groups are trying to build logic from nanorelays (nanomicroelectromechanical systems, or nano MEMS). Because of the large number of charges on the gate electrode and the posi- tive feedback intrinsic in electrostatic devices, it is theoretically possible to have very low operating voltages; indeed, operation down to a couple of tenths of a volt seems possible. Even as researchers work to overcome that hurdle, there are a number of issues that need to be addressed. The most important is determining the minimum operating voltage that can reliably overcome contact sticking. It might not take much voltage to cre- ate a contact, but if the two surfaces that connect stick together (either because of molecular forces or because of microwelding from the current flow), larger voltages will be needed to break the contact. Devices will have large numbers of these structures, so the voltage must be less than CMOS operating voltages at similar performance. The second issue is per- formance and reliability. This device depends on a mechanically moving structure, so the delay will probably be larger than that of CMOS (around 1 nanosecond), and it will probably be an additional challenge to build structures that can move for billions of cycles without failing. There is also promising research in the use of electron-spin-based devices (spintronics) in contrast with the charge-based devices (electron - ics) in use today. Spin-based devices—and even pseudospin devices, such as the BiSFET31—have the potential to greatly reduce the power dissi- 31 Sanjay K.Banerjee, Leonard F. Register, Emanuel Tutuc, Dharmendar Reddy, and Allan H. MacDonald, 2009, Bilayer pseudospin field-effect transistor (BiSFET): A proposed new logic device, IEEE Electron Device Letters 30(2): 158-160.
OCR for page 100
100 THE FUTURE OF COMPUTING PERFORMANCE pated in performing basic logic functions. However, large fundamental and practical problems remain to be solved before spintronic systems can become practical.32 Those or other approaches (such as using the correlation of particles in ferro materials33) might yield a breakthrough. However, given the complexity of today’s chips, with billions of working transistors, it is likely to take at least a decade to introduce any new tech - nology into volume manufacturing. Thus, although we should continue to invest in technology research, we cannot count on it to save the day. It is unlikely to change the situation in the next decade. Recommendation: Invest in research and development to make com- puter systems more power-efficient at all levels of the system, includ - ing software, application-specific approaches, and alternative devices. R&D should be aimed at making logic gates more power-efficient. Such efforts should address alternative physical devices beyond incremental improvements in today’s CMOS circuits. APPLICATION-SPECIFIC INTEGRATED CIRCUITS Although the shift toward chip multiprocessors will allow industry to continue to scale the performance of CMPs based on general-purpose processor cores for some time, general-purpose chip multiprocessors will reach their own limit. As discussed earlier, CMP designers can trade off single-thread performance of individual processors against lower energy dissipation per instruction, thus allowing more instructions by multiple processors while the same amount of energy is dissipated by the chip. However, that is possible only within some range of energy performance. Beyond some limit, lowering energy per instruction by processor sim - plification can lead to overall CMP performance degradation because processor performance starts to decrease faster than energy per instruc - tion. That range is likely to be a factor of about 10, that is, energy per instruction cannot be reduced by more than a factor of 10 compared with the highest-performance single-processor chip, such as the Intel Pentium 4 or the Intel Itanium.34 When such limits are reached, we will need to create other approaches 32 In their article, cited in the preceding footnote, Banerjee et al. look at a promising technol - ogy that still faces many challenges. 33 See, for instance, the research of Sayeef Salahuddin at the University of California, Berkeley. 34 The real gain might be even smaller because with an increase in the number of proces - sors on the chip, more energy will be dissipated by the memory system and interconnect, or the performance of many parallel applications will scale less than linearly with the number of processors.
OCR for page 101
101 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE to create an energy-efficient computation unit. On the basis of the histori- cal data, the answer seems clear: we will need to create more application- optimized processing units. It is well known that tuning the hardware and software toward a specific application or set of applications allows a more energy-efficient solution. That work started with the digital watch many decades ago and continues today. Figure 3.5 shows data for general-pur- pose processors, digital-signal processors, and application-specific inte - grated circuits (ASICs) from publications presented at the International Solid-State Circuits Conference. The data are somewhat dated, but all chips were designed for similar 0.18- to 0.25-µm CMOS technology, and one can see that the ASIC designs are roughly 3 orders of magnitude more energy-efficient than the general-purpose processors. The main reason for such a difference is a combination of algorithm and hardware tuning and the ability to reduce the use of large memory structures as general interconnects: instead of a value’s being stored in a register or memory, it is consumed by the next function unit. Doing only what needs to be done saves both energy and area (see Figure 3.6). More recently, researchers at Lawrence Berkeley National Laboratory, interested in building peta-scale supercomputers for kilometer-scale cli - mate modeling, argued that designing a specialized supercomputer based on highly efficient customizable embedded processors can be attractive in terms of energy cost.35 For example, they estimated that a peta-scale climate supercomputer built with custom chips would consume 2.5 MW of electric power whereas a computer with the same level of performance but built with general-purpose AMD processors would require 179 MW. The current design trend, however, is away from building custom- ized solutions; increasing design complexity has caused the nonrecurring engineering costs for designing these chips to grow rapidly. Typical ASIC design requires $20-50 million, which limits the range of market segments to very few with volumes high enough to justify the initial engineering investment. Thus, if we do need to create more application-optimized computing systems, we will need to create a new approach to design that will allow a small team to create an application-specific chip at reasonable cost. That leads to this chapter’s overarching recommendation. Efforts are needed along multiple paths to deal with the power limitations that modern scaling and computer-chip designs are encountering. 35 Michael Wehner, Leonid Oliker, and John Shalf, 2008, Towards ultra-high resolution models of climate and weather, International Journal of High Performance Computing Ap - plications 22(2): 149-165.
OCR for page 102
102 THE FUTURE OF COMPUTING PERFORMANCE Chip Year Paper Description Chip Year Paper Description # # µP - S/390 1 1997 10.3 11 1998 18.1 DSP -Graphics µP – PPC 2 2000 5.2 12 1998 18.2 DSP - DSPs (SOI) Multimedia µP - G5 3 1999 5.2 13 2000 14.6 DSP – Multimedia µP - G6 4 2000 5.6 14 2002 22.1 DSP – Microprocessors Mpeg Decoder µP - Alpha 5 2000 5.1 15 1998 18.3 DSP - Multimedia µP - P6 6 1998 15.4 16 2001 21.2 Encryption Processor µP - Alpha 7 1998 18.4 17 2000 14.5 Hearing Aid Dedicated Processor µP – PPC 8 1999 5.6 18 2000 4.7 FIR for Disk Read Head 9 1998 18.6 DSP - 19 1998 2.1 MPEG DSP’s StrongArm Encoder 10 2000 4.2 DSP – Comm 20 2002 7.2 802.11a Baseband WLAN 100 NEC Energy (Power) Efficiency ( MOPS/mW ) DSP General Microprocessors Dedicated 10 P u r p o se D S P 1 PPC 0.1 0.01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Chip Number FIGURE 3.5 Energy efficiency comparison of CPUs, DSPs, and ASICs. SOURCE: 3.5 bottom Robert Brodersen of the University of California, Berkeley, and Teresa Meng of Stanford University. Data published at International Solid-State Circuits Confer- ence (0.18- to 0.25-µm).
OCR for page 103
103 POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE 100 Aop (mm 2 per operation) 10 Microprocessors Dedicated General 1 Purpose DSP 0.1 0.01 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Chip Number FIGURE 3.6 Area efficiency comparison of CPUs, DSPs, and ASICs. SOURCE: Robert Brodersen of the University of3.6 California at Berkeley and Teresa Meng of Stanford University. Recommendation: Invest in research and development of parallel archi - tectures driven by applications, including enhancements of chip mul- tiprocessor systems and conventional data-parallel architectures, cost- effective designs for application-specific architectures, and support for radically different approaches. BIBLIOGRAPHY Broderson, Robert. “Interview: A Conversation with Teresa Meng.” ACM Queue 2(6): 14-21, 2004. Chabini, Noureddine, Ismaïl Chabini, El Mostapha Aboulhamid, and Yvon Savaria. “Meth - ods for Minimizing Dynamic Power Consumption in Synchronous Designs with Mul - tiple Supply Voltages.” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22(3): 346-351, 2003. Copeland, Jack. Colossus: The Secrets of Bletchley Park’s Code-Breaking Computers. New York: Oxford University Press, 2006. International Technology Roadmap for Semiconductors (ITRS). “System Drivers.” ITRS 2007 Edition. Available online at http://www.itrs.net/Links/2007ITRS/Home2007.htm. Kannan, Hari, Fei Guo, Li Zhao, Ramesh Illikkal, Ravi Iyer, Don Newell, Yan Solihin, and Christos Kozyrakis. “From Chaos to QoS: Case Studies in CMP Resource Manage - ment.” At the 2nd Workshop on Design, Architecture, and Simulation of Chip-Multi - processors (dasCMP). Orlando, Fla., December 10, 2006. Khailany, Brucek, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles, and Andrew Chang. “Imagine: Media Pro- cessing with Streams.” IEEE Micro 21(2): 35-46, 2001.
OCR for page 104
104 THE FUTURE OF COMPUTING PERFORMANCE Knight, Tom. 1986.“An Architecture for Mostly Functional Languages.” In Proceedings of ACM Conference on LISP and Functional Programming. Cambridge, Mass., August 4-6, 1986, pp. 105-112. Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. “Niagara: A 32-way Multithreaded Sparc Processor.” IEEE Micro 25(2):21-29 2005, 2005. Lee, Edward A. “The Problem with Threads.” IEEE Computer 39(5): 33-42, 2006. Lee, Walter, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. “Space-time Scheduling of Instruction-level Parallel - ism on a Raw Machine.” In Proceedings of the Eighth International Conference on Architec- tural Support for Programming Language and Operating Systems. San Jose, Cal., October 3-7, 1998, pp. 46-57. Lomet, David B., “Process Structuring, Synchronization, and Recovery Using Atomic Ac - tions,” In Proceedings of the ACM Conference on Language Design for Reliable Software. Raleigh, N.C., March 28-30, 1977, pp. 128-137. Marković, Dejan, Borivoje Nikolić, and Robert W. Brodersen. “Power and Area Minimiza- tion for Multidimensional Signal Processing.” IEEE Journal of Solid-State Circuits 42(4): 922-934, 2007. Nowak, Edward J. “Maintaining the Benefits of CMOS Scaling When Scaling Bogs Down.” IBM Journal of Research and Development 46(2/3):169-180, 2002. Rixner, Scott, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-Lagunas, Peter R. Mattson, and John D. Owens. “A Bandwidth-Efficient Architecture for Media Processing.” In Proceedings of the International Symposium on Microarchitecture. Dallas, Tex.: November 30-December 2, 1998, pp. 3-13, 1998. Rusu, Stefan, Simon Tam, Harry Muljono, David Ayers, and Jonathan Chang. “A Dual-core Multi-threaded Xeon Processor with 16MB L3 Cache.” In IEEE International Solid-State Circuits Conference Digest of Technical Papers. San Francisco, Cal., February 6-9, 2006, pp. 315-324. Sandararajan, Vijay, and Keshab Parhi. “Synthesis of Low Power CMOS VLSI Circuits Using Dual Supply Voltages.” In Proceedings of the 35th Design Automation Conference. New Orleans, La.., June 21-25, 1999, pp. 72-75. Sutter, Herb, and James Larus. “Software and the Concurrency Revolution.” ACM Queue 3(7): 54-62, 2005. Taur, Yuan, and Tak H. Ning, Fundamentals of Modern VLSI Devices, Ninth Edition, New York: Cambridge University Press, 2006. Thies, Bill, Michal Karczmarek, and Saman Amarasinghe. “StreamIt: A Language for Stream - ing Applications.” In Proceedings of the International Conference on Compiler Construction. Grenoble, France, April 8-12, 2002, pp. 179-196. Wehner, Michael, Leonid Oliker, and John Shalf. “Towards Ultra-High Resolution Models of Climate and Weather.” International Journal of High Performance Computing Application 22(2): 149-165, 2008. Zhao, Li, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Srihari Makineni, and Don Newell. “CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms.” In Proceed- ings of the 16th International Conference on Parallel Architecture and Compilation Techniques. Brasov, Romania, September 15-19, 2007, pp. 339-352.