Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 5
1
Computer and Semiconductor Technology Trends and
Implications
C
omputing and information and communications semiconductor physics at increasingly small feature
technology has had incredible effects on nearly sizes.
every sector of society. Until recently, advances A National Research Council (NRC) report, The
in information and communications technology have Future of Computing Performance: Game Over or Next
been driven by steady and dramatic gains in single- Level?,2 explored the causes and implications of the
processor (core) speeds. However, current and future slowdown in the historically dramatic exponential
generations of users, developers, and innovators will be growth in computing performance and the end of the
unable to depend on these improvements in computing dominance of the single microprocessor in computing.
performance. The findings and recommendations from that report are
In the several decades leading up to the early 2000s, provided in Appendix D. The authoring committee of
single-core processor performance doubled about every this report concurs with those findings and
2 years. These repeated performance doublings came to recommendations. This chapter draws on material in that
be referred to in the popular press as "Moore's Law," report and the committee's own expertise and discusses
even though Moore's Law itself was a narrow the technological challenges to sustaining growth in
observation about the economics of chip lithography computing performance and their implications for
feature sizes.1 This popular understanding of Moore's computing and innovation. The chapter concludes with a
Law was enabled by both technology--higher clock discussion of the implications of these technological
rates, reductions in transistor size, and faster switching realities for United States defense. Subsequent chapters
via fabrication improvements--and architectural and have a broader emphasis, beyond technology, on the
compiler innovations that increased performance while implications for global technology policy and innovation
preserving software compatibility with previous- issues.
generation processors. Ongoing and predictable
improvements in processor performance created a cycle 1.1 Interrelated Challenges to Continued
of improved single-processor performance followed by Performance Scaling
enhanced software functionality. However, it is no
longer possible to increase performance via higher clock The reasons for the slowdown in the traditional
rates, because of power and heat dissipation constraints. exponential growth in computing performance are many.
These constraints are themselves manifestations of more Several technical drivers have led to a shift from ever-
fundamental challenges in materials science and faster single-processor computer chips as the foundation
for nearly all computing devices to an emphasis on what
have been called "multicore" processors--placing
1
The technological and economic challenges are intertwined.
For example, Moore's Law is enabled by the revenues needed to
2
fund the research and development necessary to advance the NRC, The Future of Computing Performance: Game Over or
technology. See, for example, The Economic Limit to Moore's Next Level?, Washington, D.C.: The National Academies Press
Law IEEE Transactions on Semiconductor Manufacturing, Vol. (available online at http://www.nap.edu/catalog.php?record_id
24, No. 1, February 2011. =12980.
5
OCR for page 6
6 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING
multiple processors, sometimes of differing power and/or demanding applications that only executed on the latest,
performance characteristics and functions, on a single highest performance hardware drove the market for the
chip. This section describes those intertwined technical newest, fastest, and largest memory machines as they
drivers and the resulting challenges to continued growth appeared.
in computing performance. This shift away from an
emphasis on ever-increasing speed has disrupted what
has historically been a continuing progression of
dramatic sequential performance improvements and
associated software innovation and evolution atop a
predictable hardware base followed by increased demand
for ever more software innovations that in turn motivated
hardware improvements. This disruption has profound
implications not just for the information technology
industry, but for society as a whole. This section first
describes the benefits of this virtuous cycle--now
ending--that we have depended on for so long. The
technical challenges related to scaling nanometer
devices, what the shift to multicore architectures means
for architectural innovation, programming explicitly
parallel hardware, increased heterogeneity in hardware,
and the need for correct, secure, and evolvable software
are then discussed.
1.1.1 Hardware-Software Virtuous Cycle
The hardware and performance improvements
described above came with a stable programming FIGURE 1-1 Cracks in the hardware-software virtuous cycle.
interface between hardware and software. This interface SOURCE: Adapted from a 2011 briefing presentation on the
persisted over multiple hardware generations and in turn Computer Science and Telecommunications Board report The
Future of Computing Performance: Game Over or Next Level?
contributed to the creation of a virtuous hardware-
software cycle (see Figure 1-1). Hardware and software
capabilities and sophistication each grew dramatically in
Another manifestation of the virtuous cycle in
part because hardware and software designers could
software was the adoption of high-level programming
innovate in isolation from each other, while still
language abstractions, such as object orientation,
leveraging each other's advances in a predictable and
managed runtimes, automatic memory management,
sustained fashion. For example, hardware designers
libraries, and domain-specific languages. Programmers
added sophisticated out-of-order instruction issue logic,
embraced these abstractions (1) to manage software size,
branch prediction, data prefetching, and instruction
sophistication, and complexity and (2) to leverage
prefetching to the capabilities. Yet, even as the hardware
existing components developed by others. However,
became more complex, application software did not have
these abstractions are not without cost and rely on
to change to take advantage of the greater performance
system software (i.e., compilers, runtimes, virtual
in the underlying hardware and, consequently, achieve
machines, and operating systems) to manage software
greater performance on the software side as well.
complexity and to map abstractions to efficient hardware
Software designers were able to make grounded and
implementations. In the past, as long as the software
generally accurate assumptions about future capabilities
used a sequential programming interface, the cost of
of the hardware and could--and did--create software
abstraction was hidden by ongoing, significant
that needed faster, next-generation processors with larger
improvements in hardware performance. Programmers
memories even before chip and system architects
embraced abstraction and consequently produced
actually were able to deliver them. Moreover, rising
working software faster.
hardware performance allowed software tool developers
Looking ahead, it seems likely that the right choice
to raise the level of abstraction for software development
of new abstractions will expand the pool of programmers
via advanced libraries and programming models, further
further. For example, a domain specialist can become a
accelerating application development. New, more
programmer if the language is intuitive and the
OCR for page 7
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 7
abstractions match his or her domain expertise well. In fact, scaling of semiconductor technology hit
Higher-level abstractions and domain-specific toolkits, several coincident roadblocks that led to this slowdown,
whether for technical computing or World Wide Web including architectural design constraints, power
services, have allowed software developers to create limitations, and chip lithography challenges (both the
complex systems quickly and with fewer common errors. high costs associated with patterning smaller and smaller
However, implicit in this approach has been an integrated circuit features and with fundamental device
assumption that hardware performance would continue physics). As described below, the combination of these
to increase (hiding the overhead of these abstractions) challenges can be viewed as a perfect storm of difficulty
and that developers need not understand the mapping of for microprocessor performance scaling.
the abstractions to hardware to achieve adequate With regard to power, through the 1990s and early
performance.3 As these assumptions break down, the 2000s the power needed to deliver performance
difficulty in achieving high performance from software improvements on the best performing microprocessors
will rise, requiring hardware designers and software grew from about 510 watts in 1990 to 100150 watts in
developers to work together much more closely and 2004 (see Figure 1-2). This increase in power stopped in
exposing increasing amounts of parallelism to software 2004, because cooling and heat dissipation proved
developers (discussed further below). One possible inadequate. Furthermore, the exploding demand for
example of this is the use of computer-aided design tools portable devices, such as phones, tablets, and netbooks,
for hardware-software co-design. Another source of increased the market importance of lower-power and
continued improvements in delivered application energy-efficient processor designs.
performance could also come from efficient
implementation techniques for high-level programming
language abstractions.
1.1.2 Problems in Scaling Nanometer Devices
Early in the 2000s, semiconductor scaling--the
process of technology improvement so that it performs
the same functionalities at ever smaller scales--
encountered fundamental physical limits that now make
it impractical to continue along the historical paths to
ever-increasing performance.4 Expected improvements
in both performance and power achieved with
technology scaling have slowed from their historical
rates, whereas implicit expectations were that chip speed
and performance would continue to increase
dramatically. There are deep technical reasons for (1) FIGURE 1-2 Thirty five years of microprocessor trend data.
why the scaling worked so well for so long and (2) why SOURCE: Original data collected and plotted by M. Horowitz,
it is no longer delivering dramatic performance F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C.
Batten. Dotted-line extrapolations by C. Moore: Chuck Moore,
improvements. See Appendix E for a brief overview of 2011, "Data processing in exascale-class computer systems,"
the relationship between slowing processor performance The Salishan Conference on High Speed Computing, April 27,
growth and Dennard scaling and the powerful 2011. (www.lanl.gov/orgs/hpc/salishan)
implications of this slowdown.
3
Such abstractions may increase the energy costs of computa- In the past, computer architects increased
tion over time; a focus on energy costs (as opposed to perfor- performance with clever architectural techniques such as
mance) may have led to radically different strategies for both ILP (instruction-level parallelism through the use of deep
hardware and software. Hence, energy-efficient software abstrac- pipelines, multiple instruction issue, and speculation)
tions are an important area for future development. and memory locality (multiple levels of caches). As the
4
In "High-Performance Processors in a Power-Limited World,"
number of transistors per unit area on a chip continued to
Sam Naffziger reviews the Vdd limitations and describes various
approaches (circuit, architecture) to future processor design given increase (as predicted by Moore's Law), microprocessor
the voltage scaling limitations: Sam Naffziger, 2006, "High- designers used these transistors to, in part, increase the
performance processors in a power-limited world," Proceedings of potential to exploit ILP by increasing the number of
the IEEE Symposium on VLSI Circuits, Honolulu, HI, June 1517, instructions executed in parallel (IPC, or instructions per
2006, p. 9397.
OCR for page 8
8 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING
clock cycle).5 Transistors were also used to achieve than higher-performance single-core chips. Higher-
higher frequencies than were supported by the raw performance cores were eschewed in part because of
transistor speedups, for example, by duplicating logic diminishing performance returns and emerging chip
and by reducing the depth of logic between pipeline power constraints that made small performance gains at
latches to allow faster clock cycles. Both of these efforts a cost of larger power use unattractive. When single-core
yielded diminishing returns in the mid-2000s. ILP scaling slowed, a shift in emphasis to multicore chips
improvements are continuing, but also with diminishing was the obvious choice, in part because it was the only
returns.6 alternative that could be deployed rapidly. Multicore
Continuing the progress of semiconductor scaling-- chips consisting of less complex cores that exploited
whether used for multiple cores or not--is now only the most effective ILP ideas were developed. These
dependent on innovation in structures and materials to chips offered the promise of performance scaling linearly
overcome the reduced performance scaling traditionally with power. However, this scaling was only possible if
provided by Dennard scaling.7 software could effectively make use of them (a
Continued scaling also depends on continued significant challenge). Moreover, early multicore chips
innovation in lithography. Current state-of-the-art with just a few cores could be used effectively at either
manufacturing uses a 193-nanometer wavelength to print the operating system level, avoiding the need to change
structures that are only tens of nanometers in size. This application software, or by a select group of applications
apparent violation of optical laws has been supported by retargeted for multicore chips.
innovations in mask patterning and compensated for by With the turn to multicore, at least three other
increasingly complex computational optics. Future related architectural trends are important to note to
lithography scaling is dependent on continued understand how computer designers and architects seek
innovation. to optimize performance--a shift toward increased data
parallelism, accelerators and reconfigurable circuit
1.1.3 The Shift to Multicore Architectures and Related designs, and system-on-a-chip (SoC) integrated designs.
Architectural Trends First, a shift toward increased data parallelism is
evident particularly in graphics processing units (GPUs).
The shift to multicore architectures meant that GPUs have evolved, moving from fixed-function
architects began using the still-increasing transistor pipelines to somewhat configurable ones to a set of
counts per chip to build multiple cores per chip rather throughput-oriented "cores" that allowed more
successful general-purpose GPU (GP-GPU)
5
programming.
Achieved application performance depends on the characteris- Second, accelerators and reconfigurable circuit
tics of the application's resource demands and on the hardware.
6
ILP improvements are incremental (1020 percent), leading to
designs have matured to provide an intermediate
single-digit compound annual growth rates. alternative between software running on fixed hardware,
7
According to Mark Bohr, "Classical MOSFET scaling tech- for example, a multicore chip, and a complete hardware
niques were followed successfully until around the 90nm genera- solution such as an application-specific integrated
tion, when gate-oxide scaling started to slow down due to in- circuit, albeit with their own cost and configuration
creased gate leakage" (Mark Bohr, February 9, 2009, "ISSCC challenges. Accelerators perform fixed functions well,
Plenary Talk: The New Era of Scaling in an SOC World") At
roughly the same time, subthreshold leakage limited the scaling of
such as encryption-decryption and compression-
the transistor Vt (threshold voltage), which in turn limited the decompression, but do nothing else. Reconfigurable
scaling of the voltage supply in order to maintain performance. fabrics, such as field-programmable gate arrays
Since the active power of a circuit is proportional to the square of (FPGAs), sacrifice some of the performance and power
the supply voltage, this reduced scaling of supply voltage had a benefits of fixed-function accelerators but can be
dramatic impact on power. This interaction between leakage retargeted to different needs. Both offer intermediate
power and active power has led chip designers to a balance where
leakage consumes roughly 30 percent of the power budget. Sev-
solutions in at least four ways: time needed to design and
eral approaches are being undertaken. Copper interconnects have test, flexibility, performance, and power.
replaced aluminum. Strained silicon and Silicon-on-Insulator have Reconfigurable accelerators pose some serious
provided improved transistor performance. Use of a low-K challenges in building and configuring applications; tool
dielectric material for the interconnect layers has reduced the par- chain issues need to be addressed before FPGAs can
asitic capacitance, improving performance. High-K metal gate become widely used as cores. To use accelerators and
transistor structures restarted gate "oxide" scaling with orders of
magnitude reduction in gate leakage. Transistor structures such as
reconfigurable logic effectively, their costs must be
FinFET, or Intel's Tri-Gate have improved control of the transis- overcome when they are not in use. Fortunately, if
tor channel, allowing additional scaling of Vt for improved tran- power, not silicon area, is the primary cost measure,
sistor performance and reduced active and leakage power.
OCR for page 9
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 9
turning the units off when they are not needed reduces conservatively set parameters well above a mean value
energy consumption (see discussion of dark and dim to tolerate variation while creating the illusion of error-
silicon, below). free hardware. As process variation grows relative to
Third, increasing levels of integration that made the mean values, guard bands become overly conservative.
microprocessor possible four decades ago now enable This means that new errors will be exposed more
complete SoCs. They combine most of the functions of a frequently to software, posing software and system
motherboard onto a single chip, usually with off-chip reliability challenges.
main memory. These processors integrate memory and
input/output controllers, graphics processors, and other 1.1.4 Game Changer: Programming for Explicitly
special-purpose accelerators. These (SoC) designs are Parallel Commodity Hardware
widely used in almost all devices, from servers and
personal computers to smartphones and embedded The advent of multicore chips changes the software
devices. interface. Sequential software no longer becomes faster
Fourth, power efficiency is increasingly a major with every hardware generation, and software needs to
factor in the design of multicore chips. Power has gone be written to leverage parallel hardware explicitly.
from a factor to optimize in the near-final design of Current trends in hardware, specifically multicore, might
computer architectures to a second-order constraint to, seem to suggest that every technology generation will
now, a first-order design constraint. As the right side of increase the number of processors and, accordingly, that
Figure 1-2 projects, future systems cannot achieve more parallel software written for these chips would speed up
performance from simply a linear increase in core count in proportion to the number of processors (often referred
at a linear increase in power. Chips deployed in to as scalable software).
everything from phones, tablets, and laptops to servers Reality is not so straightforward. There are limits to
and data centers must take into account power needs. the number of cores that can usefully be placed on a
One technique for enabling more transistors per chip chip. Moreover, even software written in parallel
at better performance levels without dramatically languages typically has a sequential component. In
increasing the power needed per chip is dark silicon. addition, there are intrinsic limits in the theoretically
Dark silicon refers to a design wherein a chip has many available parallelism in some problems, as well as in
transistors, but only a fraction of them are powered on at their solution via currently known algorithms. Even a
any one time to stay within a power budget. Thus, small fraction of sequential computation significantly
function-specific accelerators can be powered on and off compromises scalability (see Figure 1-3), compromising
to maximize chip performance. A related design is dim expected improvements that might be gained by
silicon where transistors operate in a low-power but still- additional processors on the chip.
useful state. Dark and dim silicon make accelerators and
reconfigurable logic more effective. However, making
dark and dim silicon practical is not easy, because
adding silicon area per chip always raises cost, even if
the silicon only provides value when it is on. This also
presents significant software challenges, as each
heterogeneous functional unit requires efficient code
(e.g., this may mean multiversion code, as well as
compilers and tool chains designed for many variations).
Thus, even as dark and dim silicon become more widely
adopted, using them to create value is a significant open
challenge.
Moreover, emerging transistors have more
variability than in the past, due to variations in the chip
fabrication process: Some transistors will be faster, while
others are slower, and some use more power and others
use less. This variability is emerging now, because some FIGURE 1-3 Amdahl's Law example of potential speedup on
aspects of fabrication technology (e.g., gate oxides) are 16 cores based on the fraction of the program that is parallel.
reaching atomic dimensions. Classically, hardware hid
almost all errors from software (except memory errors)
with techniques (such as guard bands) that
OCR for page 10
10 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING
As part of ongoing research programs, it will be approaches. GPUs are an example of hardware
important to analyze the interplay among the available specialization designed to be substantially more power
parallelism in applications, energy consumption by the efficient for a specific workload. The problem with this
resulting chip (under load with real applications), the trend is three-fold.
performance of different algorithmic formulations, and First, hardware specialization can only be justified
programming complexity. for ubiquitous and/or high-value workloads due to the
In addition, most programs written in existing high cost of chip design and fabrication. Second,
parallel languages are dependent on the number of creating software that exploits hardware specialization
hardware processors. Further developments in parallel and heterogeneity closely couples hardware and
software are necessary to be performance portable, that software--such coupling may be good for performance,
is, it should execute on a variety of parallel computing power, and energy, but it typically sacrifices software
platforms and should show performance in proportion to portability to different hardware, a mainstay expectation
the number of processors on all these platforms without in computing over many decades.
modifications, within some reasonable bounds. Third, the lead time needed for effective software
Deeply coupled to parallelism is data com- support of these heterogeneous devices may reduce the
munication. To operate on the same data in parallel on time they can be competitive in the marketplace. If it
different processors, the data must be communicated to takes longer to deliver the tools (compilers, domain-
each processor. More processors imply more specific language, and so on) than it takes to design and
communication. Communicating data between deliver the chip, then the tools will appear after the chip,
processors on the same chip or between chips is costly in with negative consequences.8 This problem, however, is
power and time. Unfortunately, most parallel not new. For example, by holding the IA-32 instruction
programming systems result in programs whose set architecture relatively constant across generations of
performance heavily depends on the memory hierarchy hardware, software could be delivered in a timely
organization of the processor. Where the data is located manner. Designing and building a software system for
in a system directly affects performance and energy. hardware that does not exist, or is not similar to prior
Consequently, sequential and even existing parallel hardware, requires well-specified hardware-software
software is not performance portable to successive interfaces and accurate simulators to test the software
generations of evolving parallel hardware, or even independently. Because executing software on
between two machines of the same generation with the simulators requires tens to thousands of more time than
same number of processors if they have different executing on actual hardware, software will lag hardware
memory organizations. Software designers currently without careful system and interface design. In summary,
must modify software for it to run efficiently on each writing portable and high-performance software is hard,
multicore machine. The need for such efforts breaks the making such software parallel is harder, and developing
virtuous cycle described above and makes building and software that can exploit heterogeneous parallel
evolving correct, secure, and performance-portable architectures is even harder.
software a substantial challenge.
Finally, automatic parallelization systems, which 1.1.6 Correct and Secure Software that Evolves
seek to transform a sequential program into a parallel
programmer without programmer intervention, have Performance--in the sense of ever-increasing chip
mostly failed. Had they been successful, programmers speed--is not the only critical demand of modern
would be able to write in a familiar sequential language application and system software. Although performance
and yet still see the performance benefits of parallel is fungible and ever-faster computer chips can be used to
execution. As a result, research now focuses on program- enable a variety of functionality, software is the
mer-specified parallelism. underpinning of virtually all our economic, government,
and military infrastructure. Thus, the criticality of
1.1.5 Heterogeneity in Hardware Secure, Parallel Evolvable, Reliable, and Correct
software cannot be overemphasized. This report uses the
As mentioned above, not only are technology trends term SPERC software to refer to these software
leading designers and developers of computer hardware properties.
to focus on multicore systems, but they are also leading
to an emphasis on specialization and heterogeneity to 8
The same is true for hardware-software co-design efforts. Suc-
provide power, performance, and energy efficiency. This cess in co-design requires that both the hardware and software be
specialization is a marked contrast to previous delivered at roughly the same time. If the software lags behind the
hardware, it diminishes the strategy's effectiveness.
OCR for page 11
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 11
Achieving each of these desirable SPERC software the large cost of porting software to new languages and
properties is difficult in isolation, and each property is platforms will be a barrier to adoption.
still the subject of much research. In an era where new
technologies--at all levels of the system--appear 1.2 Future Directions for Hardware and Software
quickly, yet the rate of hardware performance Innovation
improvement is slowing, an alternative to the virtuous
cycle described earlier is essential. Rather than Section 1.1 outlined many of the technological
remaining oblivious to hardware shifts, new approaches challenges to continued growth in computing
and methodologies are needed that allow our complex performance and some of the implications (e.g., the shift
software systems to evolve nimbly, using new to multicore and increased emphasis on power
technologies and adapting to changing conditions for efficiency.) This section provides a brief overview of
rapid deployment. This flexibility and rapid adaptation current hardware and software research strategies for
will be key to continued superiority, for all large-scale building and evolving future computer systems that seek
enterprises, including military and defense needs. continued improvements to high performance and energy
In addition to flexibility and nimbleness, as the efficiencies.
world becomes more connected, building software that
executes reliably and guarantees some security 1.2.1 Advanced Hardware Technology Options
properties is critical. For example, modern programming
systems for languages such as PHP, JavaScript, Java, and Earlier sections of this chapter described issues that
C#, while more secure than native systems because of have hindered continued scaling of modern
their type and memory safety, do not guarantee provably semiconductor technology and some of the current
secure programs. For example, mainstream programming innovations in materials and structures that have allowed
models do not yet support concise expression of continued progress. All are variations on historical
semantic security properties such as "only an approaches. Are there more radical innovations that may
authenticated user can access their own data," which is deliver future improvements? In principle, yes, but there
key to proving security properties. Even recipes of best are daunting challenges.
practices for secure programming remain an open Transistors built from alternative materials such as
problem. germanium (Ge) and Group IIIV materials, such as
Finally, functional correctness remains a major gallium arsenide, indium phosphide, indium arsenide,
challenge. Designing and building correct parallel and indium antimonide, promise improved power
software is a daunting task. For example, static efficiency,9 but only by about a factor of two, as they
verification is the process that analyzes code to ensure also suffer from the same threshold voltage limits, and
that it guarantees certain properties and user-defined limit on-supply voltage scaling inherent in current
specifications. Static verification of even basic properties complementary-symmetry metal-oxide semiconductor
of sequential software in some cases cannot be decided, technologies.
and computing approximations often involves Advances in packaging technology continue, and
exponential amounts of computation to analyze some of those offer promise for power and performance
properties on all programming paths. Evaluating the improvements. For example, 3D stacking and through-
same properties in parallel programs is even harder, silicon vias are being explored for some SoC designs.
since the analysis must consider all possible execution The primary limitation for 3D stacking of memory,
interleaving of concurrent statements in distinct parallel however, is capacity (i.e., only limited dynamic random-
tasks. Current practice sometimes verifies small critical access memory can be placed in the stack).
components of large systems, but for the most part, Finally, more exotic alternatives to the use of
executes the program on a variety of test inputs (testing) electrons as the "tokens," coupled with an energy barrier
to detect errors. Correctness and security demands on as the control--the method used by all modern computer
software may trump performance in some cases, but chips--are under investigation.10 Although all of these
applications will typically need to combine these
properties with high performance and parallelism.
9
Even assuming that there are programming models Donghyun Kim, Tejas Krishnamohan, and Krishna C.
that establish a solid foundation for creating SPERC Saraswat, 2008, "Performance Evaluation of 15nm Gate Length
Double-Gate n-MOSFETs with High Mobility Channels: IIIV,
software, adoption will be a challenge. Commodity and Ge and Si," The Electroch. Soc. Trans. 16(11): 4755.
defense software will need to be created or ported to use 10
K. Bernstein, R. Calvin, W. Porod, A. Seabaugh, and J.
them. The enormous investment in legacy software and Welser, 2010, "Device and Architecture Outlook for Beyond
CMOS Switches," Proceedings of the IEEE 98(12): 21692184.
OCR for page 12
12 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING
technologies show potential, each has serious challenges can be blunted by the overheads of communicating
that need to be resolved through continued fundamental control and data to and from accelerators, especially if
research before they could be adopted for high-volume someone seeks to offload even smaller amounts of work
manufacturing. to expand the availability of off-loadable work.
Reconfigurable designs, such as FPGAs, described
1.2.2 Prospects for Performance Improvements earlier, may provide a middle ground, but they are not
yet easily programmable. Similarly, SoCs combine
In the committee's view, there is no "silver bullet" specialized accelerators on a single chip and have had
to address current challenges in computer architecture great success in the embedded market, such as
and the slowdown in growth in computing performance. smartphones and tablets. As SoCs continue to proliferate,
Rather, efforts in complementary directions to continue the challenge will be simplifying software and hardware
to increase value, that is, performance, under power design and programmability while maximizing
constraints, will be needed. Early multicore chips offered performance and power efficiency.
homogeneous parallelism. Heterogeneous cores on a Moreover, communication at all levels--close,
single chip are now part of an effort to achieve greater cross-chip, off-chip, off-board, off-node, offsite--must
power efficiency, but they present even greater be minimized to save energy. For example, moving
programming challenges. operands from a close-by register file can use energy
Efforts to advance conventional multicore chips and comparable to an operation (e.g., floating-point multiply-
to create more power-efficient core designs will add), while moving them from cross- or off-chip uses
continue. On one hand, researchers will continue to tens to hundreds of times more energy. Thus, a focus on
explore techniques that reduce the power used by reducing computational energy without a concomitant
individual cores without unduly sacrificing performance. focus on reducing communication is doomed to have
In turn, this will allow placement of more cores on each limited effect.
chip. Researchers could also explore radical redesigns of Finally, a reconsideration of the hardware-software
cores that focus on power first, for example, by boundary may be in order. While abstraction layers hide
minimizing nonlocal information movement through complexity at each boundary, they also hide optimization
spatially aware designs that limit communication of data and innovation possibilities. For decades, software and
(see Section 1.1.4). hardware experts innovated independently on opposite
GP-GPU computing, in particular, and vector and sides of the instruction set architecture boundary.
single-instruction multiple-data operation, in general, Multicore chips began the end of the era of separation.
offer promise for specific workloads. Each of these Going forward, co-design is needed, where chip
reduce power consumption by amortizing the cost of functionality and software are designed in concert, with
dealing with an instruction (e.g., fetch and decode) repeated design and optimization feedback between the
across the benefit of many data manipulations. All offer hardware and software teams. However, since the
great peak performance, but this performance can be software development cycle typically significantly lags
hard to achieve without deep expertise coupling behind the hardware development cycle, effective co-
algorithm and architecture, hardly a prescription for design will also require more rapid deployment of
broad programmability. Moreover, software that runs on effective tools in a timescale commensurate with the
such chips must allocate work to cope with allocating specialized hardware if its full functionality is to be
work to heterogeneous computing units, such as realized.
throughput-oriented GPUs and latency-oriented
conventional central processing units, highlighting the 1.2.3 Software
need for advances in software and programming
methodologies as described earlier. Creating software systems and applications for
More heterogeneity will arise from expanded use of parallel, power-constrained computing systems on a
accelerators and reconfigurable logic, described earlier, single chip requires innovations at all levels of the
that is needed for increased performance under power software-hardware design and engineering stack:
constraints. Accelerators are so-named because they can algorithms, programming models, compilers, runtime
accelerate performance. While this is true, recent work
shows that the greater benefit of accelerators may be in
reducing power.11 However, accelerator effectiveness Christos Kozyrakis, and Mark Horowitz, 2010, "Understanding
Sources of Inefficiency in General-Purpose Chips," Proceedings
11
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, of the 37th International Symposium on Computer Architecture
Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, (ISCA), Saint-Malo, France, June 2010.
OCR for page 13
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 13
systems, operating systems, and hardware-software use these new models, to parallel hardware will require
interfaces. new compiler, runtime, and operating system services
One strategy for addressing the challenges inherent that model, observe, and reason, and then adapt to and
to parallel programming is to first design application- change dynamic program behaviors to satisfy
specific languages and system software and then seek performance and energy constraints.
generalizations. The most successful examples of Because power and energy are now the first-order
parallelism come from distributed search systems, Web constraint in hardware design, there is an opportunity for
services, and databases executing on distinct devices, as algorithmic design and system software to play a much
opposed to the challenge of parallelism within a single larger role in power and energy management. This area
device (chip) that is addressed here. Parallel algorithm is a critical research topic with broad applicability.
and system design success stories include MapReduce12
for processing data used in search, databases for 1.3 The Rise of Mobile Computing, Services, and
persistent storage and retrieval, and domain-specific Software
toolkits with parallel support, such as MATLAB. Part of
their success is rooted in providing a level of abstraction Historically, the x86 instruction set architecture has
in which programmers write sequential components, come to dominate the commercial computing space,
while the runtime and system software implement and including laptops, desktops, and servers. Developed
manage the parallelism. On the other hand, GPU originally by Intel and licensed by AMD, the commercial
programming is also a success story, but when used for success of this architecture has either eliminated or
game engineering development, for instance, it relies on forced into smaller markets other architectures
expert programmers with deep knowledge of parallelism, developed by MIPS, HP, DEC, and IBM, among others.
algorithm-to-hardware mappings, and performance More than 300 million PCs are sold each year, most of
tuning. General-purpose computing on GPUs does not them powered by x86 processors.13 Further, since the
require in-depth knowledge about graphics hardware, but improvement in capabilities of single-core processors
does require programmers to understand parallelism, started slowing dramatically, nearly all laptops, desktops,
locality, and bandwidth--general-purpose computing and servers are now shipping with multicore processors.
primitives. Over the past decade, the availability of capable,
More research is needed in domain-specific parallel affordable, and very low-power processor chips has
algorithms, because most applications are sequential. spurred a fast rise in mobile computing devices in the
Sequential algorithms are almost never appropriate for form of smartphones and tablets. The annual sales
parallel systems. Expressing algorithms in such a way volume of smartphones and tablets already exceeds that
that they satisfy the key SPERC properties and are of PCs and servers.14 The dominant architecture is U.K.-
performance portable across different parallel hardware based ARM, rather than x86. ARM does not
and generations of parallel hardware requires investment manufacture chips; instead it licenses the architecture to
and research in new programming models and third parties for incorporation into custom SoC designs
programming languages. by other vendors. The openness of the ARM architecture
These programming models must enable expert, has facilitated its adoption by many hardware
typical, and potentially naïve programmers to use manufacturers. In addition, these mobile devices now
parallel hardware effectively. Since parallel commonly incorporate two cores, and at least one SoC
programming is extremely complex, the expertise vendor has been shipping four-core designs in volume
necessary to effectively work in this realm is currently since early 2012. Furthermore, new heterogeneous big-
only within reach of the most expert programmers, and and small-core designs that couple a higher performance,
the majority of existing systems are not performance higher power core with a lower performance, lower
portable. A key requirement will be to create modular power core have recently been announced.15 Multicore
programming models that make it possible to chips are now ubiquitous across the entire range of
encapsulate parallel software in libraries in such a way computing devices.
that (1) they can be reused by many applications and (2)
the system adapts and controls the total amount of
13
parallelism that effectively utilizes the hardware, without See http://www.gartner.com/it/page.jsp?id=1893523. Last ac-
over- or undersubscription. Mapping applications, which cessed on February 7, 2012.
14
See http://www.canalys.com/newsroom/smart-phones-over
take-client-pcs-2011. Last accessed on February 7, 2012.
12 15
In 2004, Google introduced the software framework, See www.tegra3.org; http://www.reuters.com/article/2011/10/
MapReduce, to support distributed computing on large datasets on 19/arm-idUSL5E7LJ42H20111019. Last accessed on June 25,
clusters of computers. 2012.
OCR for page 14
14 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING
The rise of the ARM architecture in mobile performed in the "cloud" rather than on the mobile
computing has the potential to adjust the balance of device. A flexible software infrastructure and algorithms
power in the computing world as mobile devices become that optimize for network availability, power on the
more popular and supplant PCs for many users. device, and precision are heralding a challenging
Although the ARM architecture comes from the United ecosystem.
Kingdom, Qualcomm, Texas Instruments, and NVIDIA Fundamental to these technologies are algorithms
are all U.S.-based companies with strong positions in for ensuring properties such as reliability, availability,
this space. However, the shift does open the door to and security in a distributed computing system, as well
more foreign competition, such as Korea's Samsung, and as algorithms for deep data mining and inference. These
new entries, because ARM licenses are relatively algorithms are very different in nature from parallel
inexpensive, allowing many vendors to design ARM- algorithms suitable for traditional supercomputing
based chips and have them fabricated in Asia. applications. While U.S. researchers have made
However, just as technical challenges are changing investments in these areas already, the importance and
the hardware and software development cycle and the commercial growth potential demand research and
software-hardware interface, the rise of mobile development into algorithmic areas including encryption,
computing and its associated software ecosystems are machine learning, data mining, and asynchronous
changing the nature of software deployment and algorithms for distributed systems protocols.
innovation in applications. In contrast to developing
applications for general-purpose PCs--where any 1.4 Summary and Implications
application developer, for example, a U.S. defense
contractor or independent software vendor, can create Semiconductor scaling has encountered fundamental
software that executes on any PC of their choosing--in physical limits, and improvements in performance and
many cases, developing software for mobile devices power are slowing. This slowdown has, among other
imposes additional requirements on developers, with things, driven a shift from the single microprocessor
"apps" having to be approved by the hardware vendors computer architectures to homogenous and now
before deployment. There are advantages and heterogeneous multicore processors, which break the
disadvantage to each approach, but changes in the virtuous cycle that most software innovation has
amount and locus of control over software deployments expected and relied on. While innovations in transistor
will have implications for what kind of software is materials, lithography, and chip architecture provide
developed and how innovation proceeds. promising opportunities for improvements in
A final inflection point is the rise of large-scale performance and power, there is no consensus by the
services, as exemplified by search engines, social semiconductor and computer industry on the most
networks and cloud-hosting services. At the largest promising path forward.
scale, the systems supporting each of these are larger It is likely that these limitations will require a shift
than the entire Internet was just a few years ago. in the locus of innovation away from dependence on
Associated innovations have included a renewed focus single-thread performance, at least in the way
on analysis of unstructured and ill-structured data (so- performance has been achieved (i.e., increasing transistor
called big data), packaging and energy efficiency for count per chip at reduced power). Performance at the
massive data centers, and the architecture of service processor level will continue to be important, as that
delivery and content distribution systems. All of these performance can be translated into desired functionalities
are the enabling technologies for delivery of services to (such as increased security, reliability, more capable
mobile devices. The mobile device is already becoming software, and so on.) But new ways of thinking about
the primary personal computing system for many people, overall system goals and how to achieve them may be
backed up by data storage, augmented computational needed.
horsepower, and services provided by the cloud. What, then, are the most promising opportunities for
Leadership in the technologies associated with innovation breakthroughs by the semiconductor and
distributed cloud services, data center hardware and computing industry? The ongoing globalization of
software, and mobile devices will provide a competitive science and technology and increased--and cheaper--
advantage in the global computing marketplace. access to new materials, technologies, infrastructure, and
Software innovations in mobile systems where markets have the potential to shift the U.S. competitive
power constraints are severe (battery life directly affects advantage in the global computing ecosystem, as well as
user experience) are predicted to use a different model to refocus opportunities for innovation in the computing
than PCs, in which more and more processing is space. In addition, the computing and semiconductor
OCR for page 15
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 15
industry has become a global enterprise, fueled by reinforces the critical need for the United States to assess
increasingly competitive overseas semiconductor the geographic and technological landscape of research
markets and firms that have made large and focused and development focused on this and other areas of
investments in the computing space over the last decade. computer and semiconductor innovation.
The possibility of new technological approaches
emerging both in the United States and overseas
OCR for page 16