Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 80
3
Power Is Now Limiting Growth
in Computing Performance
T
he previous chapters define computer performance and discuss
why its continued growth is critical for the U.S. economy. This
chapter explores the relationship between computer performance
and power consumption. The limitations imposed by power consumption
are responsible for the present movement toward parallel computation.
This chapter argues that such limitations will constrain the computation
performance of even parallel computing systems unless computer design-
ers take fundamentally different approaches.
The laws of thermodynamics are evident to anyone who has ever
used a laptop computer on his or her lap: the computer becomes hot.
When you run a marathon, your body is converting an internal supply
of energy formed from the food you eat into two outputs: mechanical
forces produced by your body’s muscles and heat. When you drive your
car, the engine converts the energy stored in the gasoline into the kinetic
energy of the wheels and vehicle motion and heat. If you put your hand
on an illuminated light bulb, you discover that the bulb is not only radiat-
ing the desired light but also generating heat. Heat is the unwanted, but
inevitable, side effect of using energy to accomplish any physical task—it
is not possible to convert all the input energy perfectly into the desired
results without wasting some energy as heat.
A vendor who describes a “powerful computer” is trying to character-
ize the machine as fast, not thermally hot. In this report, the committee
uses power to refer to the use of electric energy and performance to mean
computational capability. When we refer to power in the context of a
80
OCR for page 81
81
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
computer system, we are talking about energy flows: the rate at which
energy must be provided to the computer system from a battery or wall
outlet, which is the same as the rate at which that energy, now converted
to heat, must be extracted from the system. The temperature of the chip
or system rises above the ambient temperature and causes heat energy to
flow into the environment. To limit the system’s temperature rise, heat
must be extracted efficiently. (Transferring heat from the computer system
to the environment is the task of the cooling subsystem in a computer.)
Thus, referring to a chip’s power requirements is equivalent to talking
about power consumed and dissipated.
When we talk about scaling computing performance, we implicitly
mean to increase the computing performance that we can buy for each
dollar we spend. If we cannot scale down the energy per function as fast
as we scale up the performance (functions per second), the power (energy
per second) consumed by the system will rise, and the increase in power
consumption will increase the cost of the system. More expensive hard-
ware will be needed to supply the extra power and then remove the heat
that it generates. The cost of managing the power in and out of the system
will rise to dominate the cost of the hardware.
Historically, technology scaling has done a good job of scaling down
energy cost per function as the total cost per function dropped, and so
the overall power needs of systems were relatively constant as perfor-
mance (functions per second) dramatically increased. Recently, the cost
per function has been dropping faster than the power per function, which
means that the overall power of constant-cost chips has been growing.
The power problem is getting worse because of the recent difficulty in
continuing to scale down power-supply voltages, as is described later in
this chapter.
Our ability to supply power and cool chips is not improving rapidly,
so for many computers the performance per dollar is now limited by
power issues. In addition, computers are increasingly available in a vari -
ety of form-factors and many, such as cell phones, have strict power limits
because of user constraints. People do not want to hold hot cell phones,
and so the total power budget needs to be under a few watts when the
phone is active. Designers today must therefore find the best performance
that can be achieved within a specified power envelope.
To assist in understanding these issues, this chapter reviews inte -
grated circuit (IC) technology scaling. It assumes general knowledge of
electric circuits, and some readers may choose to review the findings
listed here and then move directly to Chapter 4. The basic conclusions of
this chapter are as follows:
OCR for page 82
82 THE FUTURE OF COMPUTING PERFORMANCE
· Power consumption has become the limiting constraint on future
growth in single-processor performance.
· Power limitations on individual processors have resulted in chips
that have multiple lower-power processors that, in some applica-
tions, yield higher aggregate performance than chips with single
power-limited processors.
· Even as the computing industry successfully shifts to multiple,
simpler, and lower-power processor cores per chip, it will again
face performance limits within a few years because of aggregate,
full-chip power limitations.
· Like complementary metal oxide semiconductor (CMOS) tran-
sistors, many of the electronic devices being developed today as
potential replacements for CMOS transistors are based on regu-
lating the flow of electrons over a thermodynamic barrier and will
face their own power limits.
· A promising approach to enabling more power-efficient compu-
tation is to design application-specific or algorithm-specific com-
putational units. For that approach to succeed, new chip design
and verification methods (such as advanced electronic-design
automation tools) will need to be developed to reduce the time
and cost of IC design, and new IC fabrication methods will be
needed to reduced the one-time mask costs.
· The present move to chip multiprocessors is a step in the direction
of using parallelism to sustain growth in computing performance.
However, this or any other hardware architecture can succeed
only if appropriate software can be developed to exploit the par-
allelism in the hardware effectively. That software challenge is the
subject of Chapter 4.
For intrepid readers prepared to continue with this chapter, the com -
mittee starts by explaining how classic scaling enabled the creation of
cheaper, faster, and lower-power circuits; in essence, scaling is responsible
for Moore’s law. The chapter also discusses why modern scaling produces
smaller power gains than before. With that background in technology scal-
ing, the chapter then explains how computer designers have used improv-
ing technology to create faster computers. The discussion highlights why
processor performance grew faster than power efficiency and why the
problem is more critical today. The next sections explain the move to
chips that have multiple processors and clarify both the power-efficiency
advantage of parallel computing and the limitations of this approach. The
chapter concludes by mentioning some alternative technologies to assess
the potential advantages and the practicality of these approaches. Alter-
natives to general-purpose processors are examined as a way to address
OCR for page 83
83
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
power limitations. In short, although incremental advances in computing
performance will continue, overcoming the power constraint is difficult or
potentially impossible and will require radical rethinking of computation
and of the logic gates used to build computing systems.
BASIC TECHNOLOGY SCALING
Although this report focuses on computers based on IC processors, it
is useful to remember that computers have historically used various tech -
nologies. The earliest electric computers were built in the 1940s and used
mechanical relays.1,2 Vacuum tubes enabled faster electronic computers.
By the 1960s, the technology for building computers changed again, to
transistors that were smaller and had lower cost, lower power, and greater
reliability. Within a decade, computers migrated to ICs instead of discrete
transistors and were able to scale up performance as the technology scaled
down the size of transistors in the ICs. Each technology change, from
relays to vacuum tubes to transistors to ICs, decreased the cost, increased
the performance of each function, and decreased the energy per function.
Those three factors enabled designers to continue to build more capable
computers for the same cost.
Early IC computers were built with bipolar transistors3 in their ICs,
which offered high performance but used relatively high power (com -
pared with other IC options). By the late 1970s, low-end computers used
NMOS4 technology, which offered greater density and thus lower cost per
function but also lower-speed gates than the bipolar alternative. As scal -
ing continued, the cost per function was dropping rapidly, but the energy
needs of each gate were not dropping as rapidly, so the power-dissipation
1 Raúl Rojas, 1997, Konrad Zuse’s legacy: The architecture of the Z1 and Z3, IEEE Annals
of the History of Computing 19(2): 5-19.
2 IBM, 2010, Feeds, speeds and specifications, IBM Archives, website, available online at
http://www-03.ibm.com/ibm/history/exhibits/markI/markI_feeds.html.
3 Silicon ICs use one of two basic structures for building switches and amplifiers. Both
transistor structures modify the silicon by adding impurities to it that increase the concen -
tration of electric carriers—electrons for N regions and holes for P regions—and create three
regions: two Ns separated by a P or two Ps separated by an N. That the electrons are blocked
by holes (or vice versa) means that there is little current flow in all these structures. The first
ICs used NPN bipolar transistors, in which the layers are formed vertically in the material
and the current flow is a bulk property that requires electrons to flow into the P region (the
base) and holes to flow into the top N region (the emitter).
4 NMOS transistors are lateral devices that work by having a “gate” terminal that controls
the surface current flow between the “source” and “drain” contacts. The source-drain ter-
minals are doped N and supply the electrons that flow through a channel; hence the name
NMOS. Doping refers to introducing impurities to affect the electrical properties of the
semiconductor. PMOS transistors make the source and drain material P, so holes (electron
deficiencies) flow across the channels.
OCR for page 84
84 THE FUTURE OF COMPUTING PERFORMANCE
requirements of the chips were growing. By the middle 1980s, most pro -
cessor designers moved from bipolar and NMOS to CMOS5 technology.
CMOS gates were slower than those of NMOS or bipolar circuits but dis-
sipated much less energy, as described in the section below. Using CMOS
technology reduced the energy per function by over an order of mag-
nitude and scaled well. The remainder of this chapter describes CMOS
technology, its properties, its limitations, and how it affects the potential
for growth in computing performance.
CLASSIC CMOS SCALING
Computer-chip designers have used the scaling of feature sizes (that
is, the phenomenon wherein the same functionality requires less space on
a new chip) to build more capable, more complex devices, but the result-
ing chips must still operate within the power constraints of the system.
Early chips used circuit forms (bipolar or NMOS circuits) that dissipated
power all the time, whether the gate6 was computing a new value or just
holding the last value. Even though scaling allowed a decrease in the
power needed per gate, the number of gates on a chip was increasing
faster than the power requirements were falling; by the early to middle
1980s, chip power was becoming a design challenge. Advanced chips
were dissipating many watts;7 one chip, the HP Focus processor, for exam-
ple, was dissipating over 7 W, which at the time was a very large number.8
Fortunately, there was a circuit solution to the problem. It became
possible to build a type of gate that dissipated power only when the out-
put value changed. If the inputs were stable, the circuit would dissipate
practically no power. Furthermore, the gate dissipated power only as long
as it took to get the output to transition to its new value and then returned
to a zero-power state. During the transition, the gate’s power requirement
was comparable with those of the previous types of gates, but because
the transition lasts only a short time, even in a very active machine a gate
5 The C in CMOS stands for complementary. CMOS uses both NMOS and PMOS transistors.
6 A logic gate is a fundamental building block of a system. Gates typically have two to four
inputs and produce one input. These circuits are called logic gates because they compute
simple functions used in logic. For example, an AND gate takes two inputs (either 1s or 0s)
and returns 1 if both are 1s and 0 if either is 0. A NOT gate has only one input and returns
1 if the input is 0 and 0 if the input is 1.
7 Robert M. Supnick, 1984, MicroVAX 32, a 32 bit microprocessor, IEEE Journal of Solid
State Circuits 19(5): 675-681, available online at http://ieeexplore.ieee.org/stamp/stamp.js
p?arnumber=1052207&isnumber=22598.
8 Joseph W. Beyers, Louis J. Dohse, Jospeh P. Fucetola, Richard L. Kochis, Cliffird G. Lob,
Gary L. Taylor, and E.R. Zeller, 1981, A 32-bit VLSI CPU chip, IEEE Journal of Solid-State
Circuits 16(5): 537-542, available online at http://ieeexplore.ieee.org/stamp/stamp.jsp?ar
number=1051634&isnumber=22579.
OCR for page 85
85
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
TABLE 3.1 Scaling Results for Circuit Performance
Device or Circuit Parameter Scaling Factor
Device dimension tox, L, W 1/k
Doping concentration Na k
Voltage V 1/k
Current I 1/k
Capacitance eA/t 1/k
Delay time per circuit VC/I 1/k
2
Power dissipation per circuit VI 1/k
Power density VI/A 1
SOURCE: Reprinted from Robert H. Dennard, Fritz H. Gaensslen,
Hwa-Nien. Yu, V. Leo Rideout, Ernest Bassous, and Andre R. LeBlanc,
1974, Design of ion-implanted MOSFETS with very small physical
dimensions, IEEE Journal of Solid State Circuits 9(5): 256-268.
would be in transition around 1 percent of the time. Thus, moving to the
new circuit style decreased the power consumed by a computation by a
factor of over 30.9 The new circuit style was called complementary MOS,
or CMOS.
A further advantage of CMOS gates was that their performance and
power were completely determined by the MOS transistor properties.
In a classic 1974 paper, reprinted in Appendix D, Robert Dennard et
al. showed that the MOS transistor has a set of very convenient scaling
properties.10 The scaling properties are shown in Table 3.1, taken from that
paper. If all the voltages in a MOS device are scaled down with the physi -
cal dimensions, the operation of the device scales in a particularly favor-
able way. The gates clearly become smaller because linear dimensions are
scaled. That scaling also causes gates to become faster with lower energy
per transition. If all dimensions and voltages are scaled by the scaling fac-
tor k (k has typically been 1.4), after scaling the gates become (1/k)2 their
previous size, and k2 more gates can be placed on a chip of roughly the
same size and cost as before. The delay of the gate also decreases by 1/k,
and, most important, the energy dissipated each time the gate switches
decreases by (1/k)3. To understand why the energy drops so rapidly, note
that the energy that the gate dissipates is proportional to the energy that
is stored at the output of the gate. That energy is proportional to a quan-
9 The old style dissipated power only half the time; this is why the improvement was by
a factor of roughly 30.
10 Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien. Yu, V. Leo Rideout, Ernest Bassous,
and Andre R. LeBlanc, 1974, Design of ion-implanted MOSFETS with very small physical
dimensions, IEEE Journal of Solid State Circuits 9(5):256–268.
OCR for page 86
86 THE FUTURE OF COMPUTING PERFORMANCE
tity called capacitance11 and the square of the supply voltage. The load
capacitance of the wiring decreases by 1/k because the smaller gates make
all the wires shorter and capacitance is proportional to length. Therefore,
the power requirements per unit of space on the chip (mm2), or energy
per second per mm2, remain constant:
Power = (number of gates)(CLoad/gate)(Clock Rate)(Vsupply2)
Power density = Ng Cload Fclk Vdd2
Ng = CMOS gates per unit area
Cload = capacitive load per CMOS gate
Fclk = clock frequency
Vdd = supply voltage
Power density = ( k2 )( 1/k )( k )(1/k)2 = 1
That the power density (power requirements per unit space on the
chip, even when each unit space contains many, many more gates) can
remain constant across generations of CMOS scaling has been a critical
property underlying progress in microprocessors and in ICs in general. In
every technology generation, ICs can double in complexity and increase
in clock frequency while consuming the same power and not increasing
in cost.
Given that description of classic CMOS scaling, one would expect
the power of processors to have remained constant since the CMOS tran-
sition, but this has not been the case. During the late 1980s and early
1990s, supply voltages were stuck at 5 V for system reasons. So power
density would have been expected to increase as technology scaled from
2 mm to 0.5 mm. However, until recently supply voltage has scaled with
technology, but power densities continued to increase. The cause of the
discrepancy is explained in the next section. Note that Figure 3.1 shows no
microprocessors above about 130 W; this is because 130 W is the physical
limit for air cooling, and even approaching 130 W requires massive heat
sinks and local fans.
11 Capacitance is a measure of how much electric charge is needed to increase the voltage
between two points and is also the proportionality constant between energy stored on a wire
and its voltage. Larger capacitors require more charge (and hence more current) to reach
a voltage than a smaller capacitor. Physically larger capacitors tend to have larger capaci -
tance. Because all wires have at least some parasitic capacitance, even just signaling across
the internal wires of a chip dissipates some power. Worse, to minimize the time wasted in
charging or discharging, the transistors that drive the signal must be made physically larger,
and this increases their capacitance load, which the prior gate must drive, and costs power
and increases the incremental die size.
OCR for page 87
87
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
1,000
100
10
1
1985 1990 1995 2000 2005 2010
Year of Introduction
FIGURE 3.1 Microprocessor power dissipation (watts) over time (1985-2010).
HOW CMOS-PROCESSOR PERFORMANCE IMPROVED
EXPONENTIALLY, AND THEN SLOWED
Microprocessor performance, as measured against the SPEC2006
benchmark12,13,14 was growing exponentially at the rate of more than 50
percent per year (see Figure 3.2). That phenomenal single-processor per-
formance growth continued for 16 years and then slowed substantially15
partially because of power constraints. This section briefly describes how
those performance improvements were achieved and what contributed to
the slowdown in improvement early in the 2000s.
To achieve exponential performance growth, microprocessor design-
ers scaled processor-clock frequency and exploited instruction-level paral-
12 For older processors, SPEC2006 numbers were estimated from older versions of the
SPEC benchmark by using scaling factors.
13 John L. Henning, 2006, SPEC CPU2006 benchmark descriptions, ACM SIGARCH Com -
puter Architecture News 34(4): 1-17.
14 John L. Henning, 2007, SPEC CPU suite growth: An historical perspective, ACM SI -
GARCH Computer Architecture News 35(1): 65-68.
15 John L. Hennessy and David A. Patterson, 2006, Computer Architecture: A Quantitative
Approach, fourth edition, San Francisco, Cal.: Morgan Kauffman, pp. 2-4.
OCR for page 88
88 THE FUTURE OF COMPUTING PERFORMANCE
100
10
1
0
0.1
0
0.01
0.001
0
1985 1990 1995 2000 2005 2010
Year of Introduction
FIGURE 3.2 Integer application performance (SPECint2006) over time (1985-2010).
lelism (ILP) to increase the number of instructions per cycle.16,17,18 The power
problem arose primarily because clock frequencies were increasing faster
than the basic assumption in Dennard scaling (described previously).
The assumption there is that clock frequency will increase inversely pro -
portionally to the basic gate speed. But the increases in clock frequency
were made because of improvements in transistor speed due to CMOS-
technology scaling combined with improved circuits and architecture. The
designs also included deeper pipelining and required less logic (fewer
operations on gates) per pipeline stage.19
Separating the effect of technology scaling from those of the other
improvements requires examination of metrics that depend solely on the
improvements in underlying CMOS technology (and not other improve-
16 Ibid.
17 Mark Horowitz and William Dally, 2004, How scaling will change processor architecture,
IEEE International Solid States Circuits Conference Digest of Technical Papers, San Fran -
cisco, Cal., February 15-19, 2004, pp. 132-133.
18 Vikas Agarwal, Stephen W. Keckler, and Doug Burger, 2000, Clock rate versus IPC: The
end of the road for conventional microarchitectures, Proceedings of the 27th International
Symposium Computer Architecture, Vancouver, British Columbia, Canada, June 12-14, 2000,
pp. 248-259.
19 Pipelining is a technique in which the structure of a processor is partitioned into simpler,
sequential blocks. Instructions are then executed in assembly-line fashion by the processor.
OCR for page 89
89
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
ments in circuits and architecture). (See Box 3.1 for a brief discussion of
this separation.) Another contribution to increasing power requirements
per chip has been the nonideal scaling of interconnecting wires between
CMOS devices. As the complexity of computer chips increased, it was
not sufficient simply to place two copies of the previous design on the
new chip. To yield the needed performance improvements, new commu-
BOX 3.1
Separating the Effects of CMOS Technology Scaling
on Performance by Using the FO4 Metric
To separate the effect of CMOS technology scaling from other sorts of
optimizations, processor clock-cycle time can be characterized by using the
technology-dependent delay metric fanout-of-four delay (FO4), which is defined
as the delay of one inverter driving four copies of an equally sized inverter.1,2
The metric measures the clock cycle in terms of the basic gate speed and gives
a number that is relatively technology-independent. In Dennard scaling, FO4/
cycle would be constant. As it turns out, clock-cycle time decreased from 60-90
FO4 at the end of the 1980s to 12-25 in 2003-2004. The increase in frequency
caused power to increase and, combined with growing die size, accounted for
most of the power growth until the early 2000s.
That fast growth in clock rate has stopped, and in the most recent machines
the number of FO4 in a clock cycle has begun to increase. Squeezing cycle time
further does not result in substantial performance improvements, but it does
increase power dissipation, complexity, and cost of design.3,4 As a result, clock
frequency is not increasing as fast as before (see Figure 3.3). The decrease in
the rate of growth in of clock frequency is also forecast in the 2009 ITRS semi-
conductor roadmap,5 which shows the clock rate for the highest-performance
single processors no more than doubling each decade over the foreseeable
future.
1 David Harris, Ron Ho, Gu-Yeon Wei, and Mark Horowitz, The fanout-of-4 inverter delay
metric, Unpublished manuscript, May 29, 2009, available online at http://www-vlsi.stanford.
edu/papers/dh_vlsi_97.pdf.
2 David Harris and Mark Horowitz, 1997, Skew-tolerant Domino circuits, IEEE Journal of
Solid-State Circuits 32(11): 1702-1711.
3 Mark Horowitz and William Dally, 2004, How scaling will change processor architecture,
IEEE International Solid States Circuits Conference Digest of Technical Papers, San Francisco,
Cal., February 15-19, 2004, pp. 132-133.
4Vikas Agarwal, Stephen W. Keckler, and Doug Burger, 2000, Clock rate versus IPC: The
end of the road for conventional microarchitectures. Proceedings of the 27th International Sym-
posium on Computer Architecture, Vancouver, British Columbia, Canada, June 12-14, 2000,
pp. 248-259.
5See http://www.itrs.net/Links/2009ITRS/Home2009.htm.
OCR for page 90
90 THE FUTURE OF COMPUTING PERFORMANCE
nication paths across the entire machine were needed—interconnections
that did not exist in the previous generation. To provide the increased
interconnection, it was necessary to increase the number of levels of
metal interconnection available on a chip, and this increased the total
load capacitance faster than assumed in Dennard scaling. Another fac-
tor that has led to increases in load capacitance is the slight scaling up
of wire capacitance per length. That has been due to increasing side-to-
side capacitance because practical considerations limited the amount of
vertical scaling possible in wires. Technologists have attacked both those
issues by creating new insulating materials that had lower capacitance
per length (known as low K dielectrics); this has helped to alleviate the
problem, but it continues to be a factor in shrinking technologies.
One reason that increasing clock rate was pushed so hard in the 1990s,
apart from competitive considerations in the chip market, was that find-
ing parallelism in an application constructed from a sequential stream of
instructions (ILP) was difficult, required large hardware structures, and
was increasingly inefficient. Doubling the hardware (number of transistors
available) generated only about a 50 percent increase in performance—a
relationship that at Intel was referred to as Pollack’s rule.20 To continue to
scale performance required dramatic increases in clock frequency, which
drove processor power requirements. By the early 2000s, processors had
attained power dissipation levels that were becoming difficult to handle
cheaply, so processor power started to level out. Consequently, single-
processor performance improvements began to slow. The upshot is a core
finding and driver of the present report (see Figure 3.3), namely,
Finding: After many decades of dramatic exponential growth, single-
processor performance is increasing at a much lower rate, and this situ -
ation is not expected to improve in the foreseeable future.
HOW CHIP MULTIPROCESSORS ALLOW SOME
CONTINUED PERFORMANCE-SCALING
One way around the performance-scaling dilemma described in the
previous section is to construct computing systems that have multiple,
explicitly parallel processors. For parallel applications, that arrangement
should get around Pollack’s rule; doubling the area should double the
20 Patrick P. Gelsinger, 2001, Microprocessors for the new millennium: Challenges, op -
portunities, and new frontiers, IEEE International Solid-State Circuits Conference Digest
of Technical Papers, San Francisco, Cal., February 5-7, 2001, pp. 22-25. Available online at
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=912412&isnumber=19686.
OCR for page 94
94 THE FUTURE OF COMPUTING PERFORMANCE
PROBLEMS IN SCALING NANOMETER DEVICES
If voltages could continue to be scaled with feature size (following
classic Dennard scaling), CMP performance could continue to be scaled
with technology. However, early in this decade scaling ran into some
fundamental limits that make it impossible to continue along that path, 26
and the improvements in both performance and power achieved with
technology scaling have slowed from their historical rates. The net result
is that even CMPs will run into power limitations. To understand those
issues and their ramifications, we need to revisit technology scaling and
look at one aspect of transistor performance that we ignored before: leak -
age current.
As described earlier, CMOS circuits have the important property
that they dissipate energy only when a node changes value. Consider
the simple but representative CMOS logic circuits in Figure 3.4. One type
of CMOS device, a pMOS transistor, is connected to the power supply
(Vsupply). When its input is low (Vgnd), it turns on, connects Vsupply to
the output, and drives the output high to Vsupply. When the input to the
pMOS device is high (Vsupply), it disconnects the output from Vsupply. The
other type of CMOS device, an nMOS transistor, has the complementary
behavior: when its input is high (Vsupply), it connects the output to Vgnd;
when its input is low (Vgnd), it disconnects the output from Vgnd. Because
of the construction of the CMOS logic, the pMOS and nMOS transistors
are never driving the output at the same time. Hence, the only current
that flows through the gate is that needed to charge or discharge the
capacitances associated with the gate, so the energy consumed is mostly
the energy needed to change the voltage on a capacitor with transistors,
which is Cload multiplied by Vsupply2. For that analysis to hold, it is impor-
tant that the off transistors not conduct any current in the off state: that
is, they should have low leakage.
However, the voltage scaling that the industry has been following has
indirectly been increasing leakage current. Transistors operate by chang-
ing the height of an energy barrier to modulate the number of carriers that
can flow across them. One might expect a fairly sharp current transition,
so that when the barrier is higher than the energy of the carriers, there
is no current, and when it is lowered, the carriers can “spill” over and
flow across the transistor. The actual situation is more complex. The basic
reason is related to thermodynamics. At any finite temperature, although
26 Sam Naffziger reviews the Vdd limitations and describes various approaches (circuit,
architecture, and so on) to future processor design given the voltage scaling limitations in
the article High-performance processors in a power-limited world, Proceedings of the IEEE
Symposium on VLSI Circuits, Honolulu, Hawaii, June 15-17, 2006, pp. 93-97, available on -
line at http://ewh.ieee.org/r5/denver/sscs/Presentations/2006_11_Naffziger_paper.pdf.
OCR for page 95
95
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
Vsupply
Vsupply
pMOS
Output
Input Output
Input B
nMOS
Input A
Vgnd
V gnd
FIGURE 3.4 Representative CMOS logic circuits.
there is a well-defined average energy for the carriers, the energy of each
individual carrier follows a probability distribution. The probability of
having an energy higher than the average falls off exponentially, with a
characteristic scale factor that is proportional to the temperature of the
transistors measured measured in kelvins. The hotter the device, the
wider the range of energies that the carriers can have.
That energy distribution is critical in the building of transistors. Even
with an energy barrier that is higher than the average energy of the car-
riers, some carriers will flow over the barrier and through the transistor;
the transistor will continue to conduct some current when we would like
it to be off. The energy scale is kT, where k is the Boltzmann constant and
T is the temperature in kelvins. We can convert it into voltage by divid-
ing energy by the charge on the particle, an electron in this case: q = 1.6
× 10–19 coulombs. kT/q is around 26 mV at room temperature. Thus, the
current through an off transistor drops exponentially with the height of
the energy barrier, falling by slightly less than a factor of 3 for each 26-mV
increase in the barrier height. The height of the barrier is normally called
the threshold voltage (Vth) of the transistor, and the leakage current can
be written as
q(Vgs – Vth )
I ds = I o e ,
α kT
OCR for page 96
96 THE FUTURE OF COMPUTING PERFORMANCE
where Io is a constant around 1 mA (microampere) per micrometer of tran-
sistor width at room temperature, Vgs is the voltage applied to the control
gate of the transistor, and a is a number greater than 1 (generally around
1.3) that represents how effectively the gate voltage changes the energy
barrier. From the equation, it is easy to see that the amount of leakage cur-
rent through an off transistor (Vgs = 0) depends heavily on the transistor’s
threshold voltage. The leakage current increases by about a factor of 10
each time the threshold voltage drops by another 100 mV.
Historically, Vths were around 800 mV, so the residual transistor leak-
age currents were so small that they did not matter. Starting from high
Vth values, it was possible to scale Vth, Vsupply, and L together. While leak-
age current grew exponentially with shrinking Vth, the contribution of
subthreshold leakage to the overall power was negligible as long as Vth
values were still relatively large. But ultimately by the 90-nm node, the
leakage grew to a point where it started to affect overall chip power.27 At
that point, Vth and Vsupply scaling slowed dramatically.
One approach to reduce leakage current is to reduce temperature,
inasmuch as this makes the exponential slope steeper. That is possible and
has been tried on occasion, but it runs into two problems. The first is that
one needs to consider the power and cost of providing a low-temperature
environment, which usually dwarf the gains provided by the system;
this is especially true for small or middle-size systems that operate in
an office or home environment. The second is related to testing, repair,
thermal cycling, and reliability of the systems. For those reasons, we will
not consider this option further in the present report. However, for suf -
ficiently large computing centers, it may prove advantageous to use liquid
cooling or other chilling approaches where the energy costs of operating
the semiconductor hardware in a low-temperature environment do not
outweigh the performance gains, and hence energy savings, that are pos-
sible in such an environment.
Vth stopped scaling because of increasing leakage currents, and Vsupply
scaling slowed to preserve transistor speed with a constant (Vsupply – Vth).
Once leakage becomes important, an interesting optimization between
Vsupply and Vth is possible. Increasing Vth decreases leakage current but
also makes the gates slower because the number of carriers that can flow
through a transistor is roughly proportional to the decreasing (Vsupply –
Vth). One can recover the lost speed by increasing Vsupply, but this also
increases the power consumed to switch the gate dynamically. For a given
gate delay, the lowest-power solution is one in which the marginal energy
cost of increasing Vdd is exactly balanced by the marginal energy savings
27 Edward J. Nowak, 2002, Maintaining the benefits of CMOS scaling when scaling bogs
down, IBM Journal of Research and Development 46(2/3): 169-180.
OCR for page 97
97
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
of increasing Vth. The balance occurs when the static leakage power is
roughly 30 percent of the dynamic power dissipation.
This leakage-constrained scaling began at roughly the 130-nm tech -
nology node, and today both Vsupply and Vth scaling have dramatically
slowed; this has also changed how gate energy and speed scale with tech-
nology. The energy required to switch a gate is C multiplied by V supply2,
which scales only as 1/k if Vsupply is not scaling. That means that technol-
ogy scaling reduces the power by k only if the scaled circuit is run at the
same frequency. That is, if gate speed continued to increase, half the die
(the size of the scaled circuit) would dissipate the same power as the
full die in the previous generation and would operate k times, that is
1.4 times, faster, much less than the three-fold performance increase we
have come to expect. Clearly, that is not optimal, so many designers are
scaling Vdd slightly to increase the energy savings. That works but lowers
the gate speeds, so some parallelism is needed just to recover from the
slowing single-thread performance. The poor scaling will eventually limit
the performance of CMPs.
Combining the lessons of the last several sections of this chapter,
the committee concluded that neither CMOS nor chip multiprocessors
can overcome the power limits facing modern computer systems. That
leads to another core conclusion of this report. Basic laws of physics and
constraints on chip design mean that the growth in the performance
of computer systems will become limited by their power and thermal
requirements within the next decade. Optimists might hope that new
technologies and new research could overcome that limitation and allow
hardware to continue to drive future performance scaling akin to what
we have seen with single-thread performance, but there are reasons for
caution, as described in the next section.
Finding: The growth in the performance of computing systems—even
if they are multiple-processor parallel systems—will become limited by
power consumption within a decade.
ADVANCED TECHNOLOGY OPTIONS
If CMOS scaling, even in chip-multiprocessor designs, is reaching
limits, it is natural to ask whether other technology options might get
around the limits and eventually overtake CMOS, as CMOS did to nMOS
and bipolar circuits in the 1980s. The answer to the question is mixed. 28 It
28 Mark Bohr, Intel senior fellow, gave a plenary talk at ISSCC 2009 on scaling in an SOC
world in which he argues that “our challenge . . . is to recognize the coming revolutionary
OCR for page 98
98 THE FUTURE OF COMPUTING PERFORMANCE
is clear that new technologies and techniques will be created and applied
to scaled technologies, but these major advances—such as high-k gate
dielectrics, low-K interconnect dielectrics, and strained silicon—will prob-
ably be used to continue technology scaling in general and not create a
disruptive change in the technology. Recent press reports make it clear, for
example, that Intel expects to be using silicon supplemented with other
materials in future generations of chips.29
A recent study compared estimated gate speed and energy of transis-
tors built with exotic materials that should have very high performance.30
Although the results were positive, the maximum improvement at the
same power was modest, around a factor of 2 for the best technology.
Those results should not be surprising. The fundamental problem is that
Vth does not scale, so it is hard to scale the supply voltage. The limitation
on Vth is set by leakage of carriers over an energy barrier, so any device
that modulates current by changing an energy barrier should have similar limi-
tations. All the devices used in the study cited above used the same cur-
rent-control method, as do transistors made from nanotubes, nanowires,
graphene, and so on. To get around that limitation, one needs to change
“the game” and build devices that work by different principles. A few
options are being pursued, but each has serious issues that would need
to be overcome before they could become practical.
One alternative approach is to stop using an energy barrier to con-
trol current flow and instead use quantum mechanical tunneling. That
approach eliminates the problem with the energy tails by using carriers
that are constrained by the energy bands in the silicon, which have fixed
levels. Because there is no energy tail, they can have, in theory, a steep
turnon characteristic. Many researchers are trying to create a useful device
of this type, but there are a number of challenges. The first is to create a
large enough current ratio in a small enough voltage range. The tunneling
current will turn on rapidly, but its increase with voltage is not that rapid.
Because a current ratio of around 10,000 is required, we need a device that
can transition through this current range in a small voltage (<400 mV).
changes and opportunities and to prepare to utilize them (Mark Bohr, 2009, The new era
of scaling in an SOC world, IEEE International Solid-State Circuits Conference , San Fran-
cisco, Cal., February 9, 2009, available online at http://download.intel.com/technology/
architecture-silicon/ISSCC_09_plenary_paper_Bohr.pdf).
29 Intel CEO Paul Ottelini was said to have declared that silicon was in its last decade as the
base material of the CPU (David Flynn, 2009, Intel looks beyond silicon for processors past
2017, Apcmag.com, October 29, 2009, available online at http://apcmag.com/intel-looks-
beyond-silicon-for-processors-past-2017.htm).
30 Donghyun Kim, Tejas Krishnamohan1, and Krishna C. Saraswat, 2008, Performance
evaluation of 15nm gate length double-gate n-MOSFETs with high mobility channels: III-V,
Ge and Si, Electrochemical Society Transactions 16(11): 47-55.
OCR for page 99
99
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
Even if one can create a device with that current ratio, another problem
arises. The speed of the gates depends on the transistor current. So not
only do we need the current ratio, we also need devices that can supply
roughly the same magnitude of current as CMOS transistors provide.
Tunnel currents are often small, so best estimates indicate that tunnel
FETs might be much slower (less current) than in CMOS transistors. Such
slowness will make their adoption difficult.
Another group of researchers are trying to leverage the collective
effort of many particles together to get around the voltage limits of CMOS.
Recall that the operating voltage is set by the thermal energy (kT) divided
by the charge on one electron, because that is the charged particle. If the
charged particle had a charge of 2q, the voltage requirements would be
half what it is today. That is the approach that nerve cells use to operate
robustly at low voltages. The proteins in the voltage-activated ion chan-
nels have a charge that allows them to operate easily at 100 mV. Although
some groups have been trying to create paired charge carriers, most are
looking at other types of cooperative processes. The ion channels in nerves
go though a physical change, so many groups are trying to build logic
from nanorelays (nanomicroelectromechanical systems, or nano MEMS).
Because of the large number of charges on the gate electrode and the posi-
tive feedback intrinsic in electrostatic devices, it is theoretically possible
to have very low operating voltages; indeed, operation down to a couple
of tenths of a volt seems possible. Even as researchers work to overcome
that hurdle, there are a number of issues that need to be addressed. The
most important is determining the minimum operating voltage that can
reliably overcome contact sticking. It might not take much voltage to cre-
ate a contact, but if the two surfaces that connect stick together (either
because of molecular forces or because of microwelding from the current
flow), larger voltages will be needed to break the contact. Devices will
have large numbers of these structures, so the voltage must be less than
CMOS operating voltages at similar performance. The second issue is per-
formance and reliability. This device depends on a mechanically moving
structure, so the delay will probably be larger than that of CMOS (around
1 nanosecond), and it will probably be an additional challenge to build
structures that can move for billions of cycles without failing.
There is also promising research in the use of electron-spin-based
devices (spintronics) in contrast with the charge-based devices (electron -
ics) in use today. Spin-based devices—and even pseudospin devices, such
as the BiSFET31—have the potential to greatly reduce the power dissi-
31 Sanjay K.Banerjee, Leonard F. Register, Emanuel Tutuc, Dharmendar Reddy, and Allan
H. MacDonald, 2009, Bilayer pseudospin field-effect transistor (BiSFET): A proposed new
logic device, IEEE Electron Device Letters 30(2): 158-160.
OCR for page 100
100 THE FUTURE OF COMPUTING PERFORMANCE
pated in performing basic logic functions. However, large fundamental
and practical problems remain to be solved before spintronic systems
can become practical.32 Those or other approaches (such as using the
correlation of particles in ferro materials33) might yield a breakthrough.
However, given the complexity of today’s chips, with billions of working
transistors, it is likely to take at least a decade to introduce any new tech -
nology into volume manufacturing. Thus, although we should continue
to invest in technology research, we cannot count on it to save the day. It
is unlikely to change the situation in the next decade.
Recommendation: Invest in research and development to make com-
puter systems more power-efficient at all levels of the system, includ -
ing software, application-specific approaches, and alternative devices.
R&D should be aimed at making logic gates more power-efficient. Such
efforts should address alternative physical devices beyond incremental
improvements in today’s CMOS circuits.
APPLICATION-SPECIFIC INTEGRATED CIRCUITS
Although the shift toward chip multiprocessors will allow industry
to continue to scale the performance of CMPs based on general-purpose
processor cores for some time, general-purpose chip multiprocessors will
reach their own limit. As discussed earlier, CMP designers can trade off
single-thread performance of individual processors against lower energy
dissipation per instruction, thus allowing more instructions by multiple
processors while the same amount of energy is dissipated by the chip.
However, that is possible only within some range of energy performance.
Beyond some limit, lowering energy per instruction by processor sim -
plification can lead to overall CMP performance degradation because
processor performance starts to decrease faster than energy per instruc -
tion. That range is likely to be a factor of about 10, that is, energy per
instruction cannot be reduced by more than a factor of 10 compared with
the highest-performance single-processor chip, such as the Intel Pentium
4 or the Intel Itanium.34
When such limits are reached, we will need to create other approaches
32 In their article, cited in the preceding footnote, Banerjee et al. look at a promising technol -
ogy that still faces many challenges.
33 See, for instance, the research of Sayeef Salahuddin at the University of California,
Berkeley.
34 The real gain might be even smaller because with an increase in the number of proces -
sors on the chip, more energy will be dissipated by the memory system and interconnect, or
the performance of many parallel applications will scale less than linearly with the number
of processors.
OCR for page 101
101
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
to create an energy-efficient computation unit. On the basis of the histori-
cal data, the answer seems clear: we will need to create more application-
optimized processing units. It is well known that tuning the hardware and
software toward a specific application or set of applications allows a more
energy-efficient solution. That work started with the digital watch many
decades ago and continues today. Figure 3.5 shows data for general-pur-
pose processors, digital-signal processors, and application-specific inte -
grated circuits (ASICs) from publications presented at the International
Solid-State Circuits Conference. The data are somewhat dated, but all
chips were designed for similar 0.18- to 0.25-µm CMOS technology, and
one can see that the ASIC designs are roughly 3 orders of magnitude more
energy-efficient than the general-purpose processors.
The main reason for such a difference is a combination of algorithm
and hardware tuning and the ability to reduce the use of large memory
structures as general interconnects: instead of a value’s being stored in a
register or memory, it is consumed by the next function unit. Doing only
what needs to be done saves both energy and area (see Figure 3.6).
More recently, researchers at Lawrence Berkeley National Laboratory,
interested in building peta-scale supercomputers for kilometer-scale cli -
mate modeling, argued that designing a specialized supercomputer based
on highly efficient customizable embedded processors can be attractive
in terms of energy cost.35 For example, they estimated that a peta-scale
climate supercomputer built with custom chips would consume 2.5 MW
of electric power whereas a computer with the same level of performance
but built with general-purpose AMD processors would require 179 MW.
The current design trend, however, is away from building custom-
ized solutions; increasing design complexity has caused the nonrecurring
engineering costs for designing these chips to grow rapidly. Typical ASIC
design requires $20-50 million, which limits the range of market segments
to very few with volumes high enough to justify the initial engineering
investment. Thus, if we do need to create more application-optimized
computing systems, we will need to create a new approach to design that
will allow a small team to create an application-specific chip at reasonable
cost. That leads to this chapter’s overarching recommendation. Efforts
are needed along multiple paths to deal with the power limitations that
modern scaling and computer-chip designs are encountering.
35 Michael Wehner, Leonid Oliker, and John Shalf, 2008, Towards ultra-high resolution
models of climate and weather, International Journal of High Performance Computing Ap -
plications 22(2): 149-165.
OCR for page 102
102 THE FUTURE OF COMPUTING PERFORMANCE
Chip Year Paper Description Chip Year Paper Description
# #
µP - S/390
1 1997 10.3 11 1998 18.1 DSP -Graphics
µP – PPC
2 2000 5.2 12 1998 18.2 DSP -
DSPs
(SOI) Multimedia
µP - G5
3 1999 5.2 13 2000 14.6 DSP –
Multimedia
µP - G6
4 2000 5.6 14 2002 22.1 DSP –
Microprocessors Mpeg Decoder
µP - Alpha
5 2000 5.1 15 1998 18.3 DSP -
Multimedia
µP - P6
6 1998 15.4 16 2001 21.2 Encryption
Processor
µP - Alpha
7 1998 18.4 17 2000 14.5 Hearing Aid
Dedicated Processor
µP – PPC
8 1999 5.6 18 2000 4.7 FIR for Disk
Read Head
9 1998 18.6 DSP - 19 1998 2.1 MPEG
DSP’s StrongArm Encoder
10 2000 4.2 DSP – Comm 20 2002 7.2 802.11a
Baseband
WLAN
100
NEC
Energy (Power) Efficiency ( MOPS/mW )
DSP
General
Microprocessors
Dedicated
10
P u r p o se D S P
1
PPC
0.1
0.01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Chip Number
FIGURE 3.5 Energy efficiency comparison of CPUs, DSPs, and ASICs. SOURCE:
3.5 bottom
Robert Brodersen of the University of California, Berkeley, and Teresa Meng of
Stanford University. Data published at International Solid-State Circuits Confer-
ence (0.18- to 0.25-µm).
OCR for page 103
103
POWER IS NOW LIMITING GROWTH IN COMPUTING PERFORMANCE
100
Aop (mm 2 per operation)
10
Microprocessors Dedicated
General
1
Purpose DSP
0.1
0.01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Chip Number
FIGURE 3.6 Area efficiency comparison of CPUs, DSPs, and ASICs. SOURCE:
Robert Brodersen of the University of3.6
California at Berkeley and Teresa Meng of
Stanford University.
Recommendation: Invest in research and development of parallel archi -
tectures driven by applications, including enhancements of chip mul-
tiprocessor systems and conventional data-parallel architectures, cost-
effective designs for application-specific architectures, and support for
radically different approaches.
BIBLIOGRAPHY
Broderson, Robert. “Interview: A Conversation with Teresa Meng.” ACM Queue 2(6): 14-21,
2004.
Chabini, Noureddine, Ismaïl Chabini, El Mostapha Aboulhamid, and Yvon Savaria. “Meth -
ods for Minimizing Dynamic Power Consumption in Synchronous Designs with Mul -
tiple Supply Voltages.” In IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 22(3): 346-351, 2003.
Copeland, Jack. Colossus: The Secrets of Bletchley Park’s Code-Breaking Computers. New York:
Oxford University Press, 2006.
International Technology Roadmap for Semiconductors (ITRS). “System Drivers.” ITRS 2007
Edition. Available online at http://www.itrs.net/Links/2007ITRS/Home2007.htm.
Kannan, Hari, Fei Guo, Li Zhao, Ramesh Illikkal, Ravi Iyer, Don Newell, Yan Solihin, and
Christos Kozyrakis. “From Chaos to QoS: Case Studies in CMP Resource Manage -
ment.” At the 2nd Workshop on Design, Architecture, and Simulation of Chip-Multi -
processors (dasCMP). Orlando, Fla., December 10, 2006.
Khailany, Brucek, William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung
Namkoong, John D. Owens, Brian Towles, and Andrew Chang. “Imagine: Media Pro-
cessing with Streams.” IEEE Micro 21(2): 35-46, 2001.
OCR for page 104
104 THE FUTURE OF COMPUTING PERFORMANCE
Knight, Tom. 1986.“An Architecture for Mostly Functional Languages.” In Proceedings of
ACM Conference on LISP and Functional Programming. Cambridge, Mass., August 4-6,
1986, pp. 105-112.
Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. “Niagara: A 32-way
Multithreaded Sparc Processor.” IEEE Micro 25(2):21-29 2005, 2005.
Lee, Edward A. “The Problem with Threads.” IEEE Computer 39(5): 33-42, 2006.
Lee, Walter, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek
Sarkar, and Saman Amarasinghe. “Space-time Scheduling of Instruction-level Parallel -
ism on a Raw Machine.” In Proceedings of the Eighth International Conference on Architec-
tural Support for Programming Language and Operating Systems. San Jose, Cal., October
3-7, 1998, pp. 46-57.
Lomet, David B., “Process Structuring, Synchronization, and Recovery Using Atomic Ac -
tions,” In Proceedings of the ACM Conference on Language Design for Reliable Software.
Raleigh, N.C., March 28-30, 1977, pp. 128-137.
Marković, Dejan, Borivoje Nikolić, and Robert W. Brodersen. “Power and Area Minimiza-
tion for Multidimensional Signal Processing.” IEEE Journal of Solid-State Circuits 42(4):
922-934, 2007.
Nowak, Edward J. “Maintaining the Benefits of CMOS Scaling When Scaling Bogs Down.”
IBM Journal of Research and Development 46(2/3):169-180, 2002.
Rixner, Scott, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo López-Lagunas,
Peter R. Mattson, and John D. Owens. “A Bandwidth-Efficient Architecture for Media
Processing.” In Proceedings of the International Symposium on Microarchitecture. Dallas,
Tex.: November 30-December 2, 1998, pp. 3-13, 1998.
Rusu, Stefan, Simon Tam, Harry Muljono, David Ayers, and Jonathan Chang. “A Dual-core
Multi-threaded Xeon Processor with 16MB L3 Cache.” In IEEE International Solid-State
Circuits Conference Digest of Technical Papers. San Francisco, Cal., February 6-9, 2006,
pp. 315-324.
Sandararajan, Vijay, and Keshab Parhi. “Synthesis of Low Power CMOS VLSI Circuits Using
Dual Supply Voltages.” In Proceedings of the 35th Design Automation Conference. New
Orleans, La.., June 21-25, 1999, pp. 72-75.
Sutter, Herb, and James Larus. “Software and the Concurrency Revolution.” ACM Queue
3(7): 54-62, 2005.
Taur, Yuan, and Tak H. Ning, Fundamentals of Modern VLSI Devices, Ninth Edition, New York:
Cambridge University Press, 2006.
Thies, Bill, Michal Karczmarek, and Saman Amarasinghe. “StreamIt: A Language for Stream -
ing Applications.” In Proceedings of the International Conference on Compiler Construction.
Grenoble, France, April 8-12, 2002, pp. 179-196.
Wehner, Michael, Leonid Oliker, and John Shalf. “Towards Ultra-High Resolution Models of
Climate and Weather.” International Journal of High Performance Computing Application
22(2): 149-165, 2008.
Zhao, Li, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Srihari Makineni, and Don Newell.
“CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms.” In Proceed-
ings of the 16th International Conference on Parallel Architecture and Compilation Techniques.
Brasov, Romania, September 15-19, 2007, pp. 339-352.