Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 5
1 Computer and Semiconductor Technology Trends and Implications C omputing and information and communications semiconductor physics at increasingly small feature technology has had incredible effects on nearly sizes. every sector of society. Until recently, advances A National Research Council (NRC) report, The in information and communications technology have Future of Computing Performance: Game Over or Next been driven by steady and dramatic gains in single- Level?,2 explored the causes and implications of the processor (core) speeds. However, current and future slowdown in the historically dramatic exponential generations of users, developers, and innovators will be growth in computing performance and the end of the unable to depend on these improvements in computing dominance of the single microprocessor in computing. performance. The findings and recommendations from that report are In the several decades leading up to the early 2000s, provided in Appendix D. The authoring committee of single-core processor performance doubled about every this report concurs with those findings and 2 years. These repeated performance doublings came to recommendations. This chapter draws on material in that be referred to in the popular press as "Moore's Law," report and the committee's own expertise and discusses even though Moore's Law itself was a narrow the technological challenges to sustaining growth in observation about the economics of chip lithography computing performance and their implications for feature sizes.1 This popular understanding of Moore's computing and innovation. The chapter concludes with a Law was enabled by both technology--higher clock discussion of the implications of these technological rates, reductions in transistor size, and faster switching realities for United States defense. Subsequent chapters via fabrication improvements--and architectural and have a broader emphasis, beyond technology, on the compiler innovations that increased performance while implications for global technology policy and innovation preserving software compatibility with previous- issues. generation processors. Ongoing and predictable improvements in processor performance created a cycle 1.1 Interrelated Challenges to Continued of improved single-processor performance followed by Performance Scaling enhanced software functionality. However, it is no longer possible to increase performance via higher clock The reasons for the slowdown in the traditional rates, because of power and heat dissipation constraints. exponential growth in computing performance are many. These constraints are themselves manifestations of more Several technical drivers have led to a shift from ever- fundamental challenges in materials science and faster single-processor computer chips as the foundation for nearly all computing devices to an emphasis on what have been called "multicore" processors--placing 1 The technological and economic challenges are intertwined. For example, Moore's Law is enabled by the revenues needed to 2 fund the research and development necessary to advance the NRC, The Future of Computing Performance: Game Over or technology. See, for example, The Economic Limit to Moore's Next Level?, Washington, D.C.: The National Academies Press Law IEEE Transactions on Semiconductor Manufacturing, Vol. (available online at http://www.nap.edu/catalog.php?record_id 24, No. 1, February 2011. =12980. 5
OCR for page 6
6 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING multiple processors, sometimes of differing power and/or demanding applications that only executed on the latest, performance characteristics and functions, on a single highest performance hardware drove the market for the chip. This section describes those intertwined technical newest, fastest, and largest memory machines as they drivers and the resulting challenges to continued growth appeared. in computing performance. This shift away from an emphasis on ever-increasing speed has disrupted what has historically been a continuing progression of dramatic sequential performance improvements and associated software innovation and evolution atop a predictable hardware base followed by increased demand for ever more software innovations that in turn motivated hardware improvements. This disruption has profound implications not just for the information technology industry, but for society as a whole. This section first describes the benefits of this virtuous cycle--now ending--that we have depended on for so long. The technical challenges related to scaling nanometer devices, what the shift to multicore architectures means for architectural innovation, programming explicitly parallel hardware, increased heterogeneity in hardware, and the need for correct, secure, and evolvable software are then discussed. 1.1.1 Hardware-Software Virtuous Cycle The hardware and performance improvements described above came with a stable programming FIGURE 1-1 Cracks in the hardware-software virtuous cycle. interface between hardware and software. This interface SOURCE: Adapted from a 2011 briefing presentation on the persisted over multiple hardware generations and in turn Computer Science and Telecommunications Board report The Future of Computing Performance: Game Over or Next Level? contributed to the creation of a virtuous hardware- software cycle (see Figure 1-1). Hardware and software capabilities and sophistication each grew dramatically in Another manifestation of the virtuous cycle in part because hardware and software designers could software was the adoption of high-level programming innovate in isolation from each other, while still language abstractions, such as object orientation, leveraging each other's advances in a predictable and managed runtimes, automatic memory management, sustained fashion. For example, hardware designers libraries, and domain-specific languages. Programmers added sophisticated out-of-order instruction issue logic, embraced these abstractions (1) to manage software size, branch prediction, data prefetching, and instruction sophistication, and complexity and (2) to leverage prefetching to the capabilities. Yet, even as the hardware existing components developed by others. However, became more complex, application software did not have these abstractions are not without cost and rely on to change to take advantage of the greater performance system software (i.e., compilers, runtimes, virtual in the underlying hardware and, consequently, achieve machines, and operating systems) to manage software greater performance on the software side as well. complexity and to map abstractions to efficient hardware Software designers were able to make grounded and implementations. In the past, as long as the software generally accurate assumptions about future capabilities used a sequential programming interface, the cost of of the hardware and could--and did--create software abstraction was hidden by ongoing, significant that needed faster, next-generation processors with larger improvements in hardware performance. Programmers memories even before chip and system architects embraced abstraction and consequently produced actually were able to deliver them. Moreover, rising working software faster. hardware performance allowed software tool developers Looking ahead, it seems likely that the right choice to raise the level of abstraction for software development of new abstractions will expand the pool of programmers via advanced libraries and programming models, further further. For example, a domain specialist can become a accelerating application development. New, more programmer if the language is intuitive and the
OCR for page 7
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 7 abstractions match his or her domain expertise well. In fact, scaling of semiconductor technology hit Higher-level abstractions and domain-specific toolkits, several coincident roadblocks that led to this slowdown, whether for technical computing or World Wide Web including architectural design constraints, power services, have allowed software developers to create limitations, and chip lithography challenges (both the complex systems quickly and with fewer common errors. high costs associated with patterning smaller and smaller However, implicit in this approach has been an integrated circuit features and with fundamental device assumption that hardware performance would continue physics). As described below, the combination of these to increase (hiding the overhead of these abstractions) challenges can be viewed as a perfect storm of difficulty and that developers need not understand the mapping of for microprocessor performance scaling. the abstractions to hardware to achieve adequate With regard to power, through the 1990s and early performance.3 As these assumptions break down, the 2000s the power needed to deliver performance difficulty in achieving high performance from software improvements on the best performing microprocessors will rise, requiring hardware designers and software grew from about 510 watts in 1990 to 100150 watts in developers to work together much more closely and 2004 (see Figure 1-2). This increase in power stopped in exposing increasing amounts of parallelism to software 2004, because cooling and heat dissipation proved developers (discussed further below). One possible inadequate. Furthermore, the exploding demand for example of this is the use of computer-aided design tools portable devices, such as phones, tablets, and netbooks, for hardware-software co-design. Another source of increased the market importance of lower-power and continued improvements in delivered application energy-efficient processor designs. performance could also come from efficient implementation techniques for high-level programming language abstractions. 1.1.2 Problems in Scaling Nanometer Devices Early in the 2000s, semiconductor scaling--the process of technology improvement so that it performs the same functionalities at ever smaller scales-- encountered fundamental physical limits that now make it impractical to continue along the historical paths to ever-increasing performance.4 Expected improvements in both performance and power achieved with technology scaling have slowed from their historical rates, whereas implicit expectations were that chip speed and performance would continue to increase dramatically. There are deep technical reasons for (1) FIGURE 1-2 Thirty five years of microprocessor trend data. why the scaling worked so well for so long and (2) why SOURCE: Original data collected and plotted by M. Horowitz, it is no longer delivering dramatic performance F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. Dotted-line extrapolations by C. Moore: Chuck Moore, improvements. See Appendix E for a brief overview of 2011, "Data processing in exascale-class computer systems," the relationship between slowing processor performance The Salishan Conference on High Speed Computing, April 27, growth and Dennard scaling and the powerful 2011. (www.lanl.gov/orgs/hpc/salishan) implications of this slowdown. 3 Such abstractions may increase the energy costs of computa- In the past, computer architects increased tion over time; a focus on energy costs (as opposed to perfor- performance with clever architectural techniques such as mance) may have led to radically different strategies for both ILP (instruction-level parallelism through the use of deep hardware and software. Hence, energy-efficient software abstrac- pipelines, multiple instruction issue, and speculation) tions are an important area for future development. and memory locality (multiple levels of caches). As the 4 In "High-Performance Processors in a Power-Limited World," number of transistors per unit area on a chip continued to Sam Naffziger reviews the Vdd limitations and describes various approaches (circuit, architecture) to future processor design given increase (as predicted by Moore's Law), microprocessor the voltage scaling limitations: Sam Naffziger, 2006, "High- designers used these transistors to, in part, increase the performance processors in a power-limited world," Proceedings of potential to exploit ILP by increasing the number of the IEEE Symposium on VLSI Circuits, Honolulu, HI, June 1517, instructions executed in parallel (IPC, or instructions per 2006, p. 9397.
OCR for page 8
8 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING clock cycle).5 Transistors were also used to achieve than higher-performance single-core chips. Higher- higher frequencies than were supported by the raw performance cores were eschewed in part because of transistor speedups, for example, by duplicating logic diminishing performance returns and emerging chip and by reducing the depth of logic between pipeline power constraints that made small performance gains at latches to allow faster clock cycles. Both of these efforts a cost of larger power use unattractive. When single-core yielded diminishing returns in the mid-2000s. ILP scaling slowed, a shift in emphasis to multicore chips improvements are continuing, but also with diminishing was the obvious choice, in part because it was the only returns.6 alternative that could be deployed rapidly. Multicore Continuing the progress of semiconductor scaling-- chips consisting of less complex cores that exploited whether used for multiple cores or not--is now only the most effective ILP ideas were developed. These dependent on innovation in structures and materials to chips offered the promise of performance scaling linearly overcome the reduced performance scaling traditionally with power. However, this scaling was only possible if provided by Dennard scaling.7 software could effectively make use of them (a Continued scaling also depends on continued significant challenge). Moreover, early multicore chips innovation in lithography. Current state-of-the-art with just a few cores could be used effectively at either manufacturing uses a 193-nanometer wavelength to print the operating system level, avoiding the need to change structures that are only tens of nanometers in size. This application software, or by a select group of applications apparent violation of optical laws has been supported by retargeted for multicore chips. innovations in mask patterning and compensated for by With the turn to multicore, at least three other increasingly complex computational optics. Future related architectural trends are important to note to lithography scaling is dependent on continued understand how computer designers and architects seek innovation. to optimize performance--a shift toward increased data parallelism, accelerators and reconfigurable circuit 1.1.3 The Shift to Multicore Architectures and Related designs, and system-on-a-chip (SoC) integrated designs. Architectural Trends First, a shift toward increased data parallelism is evident particularly in graphics processing units (GPUs). The shift to multicore architectures meant that GPUs have evolved, moving from fixed-function architects began using the still-increasing transistor pipelines to somewhat configurable ones to a set of counts per chip to build multiple cores per chip rather throughput-oriented "cores" that allowed more successful general-purpose GPU (GP-GPU) 5 programming. Achieved application performance depends on the characteris- Second, accelerators and reconfigurable circuit tics of the application's resource demands and on the hardware. 6 ILP improvements are incremental (1020 percent), leading to designs have matured to provide an intermediate single-digit compound annual growth rates. alternative between software running on fixed hardware, 7 According to Mark Bohr, "Classical MOSFET scaling tech- for example, a multicore chip, and a complete hardware niques were followed successfully until around the 90nm genera- solution such as an application-specific integrated tion, when gate-oxide scaling started to slow down due to in- circuit, albeit with their own cost and configuration creased gate leakage" (Mark Bohr, February 9, 2009, "ISSCC challenges. Accelerators perform fixed functions well, Plenary Talk: The New Era of Scaling in an SOC World") At roughly the same time, subthreshold leakage limited the scaling of such as encryption-decryption and compression- the transistor Vt (threshold voltage), which in turn limited the decompression, but do nothing else. Reconfigurable scaling of the voltage supply in order to maintain performance. fabrics, such as field-programmable gate arrays Since the active power of a circuit is proportional to the square of (FPGAs), sacrifice some of the performance and power the supply voltage, this reduced scaling of supply voltage had a benefits of fixed-function accelerators but can be dramatic impact on power. This interaction between leakage retargeted to different needs. Both offer intermediate power and active power has led chip designers to a balance where leakage consumes roughly 30 percent of the power budget. Sev- solutions in at least four ways: time needed to design and eral approaches are being undertaken. Copper interconnects have test, flexibility, performance, and power. replaced aluminum. Strained silicon and Silicon-on-Insulator have Reconfigurable accelerators pose some serious provided improved transistor performance. Use of a low-K challenges in building and configuring applications; tool dielectric material for the interconnect layers has reduced the par- chain issues need to be addressed before FPGAs can asitic capacitance, improving performance. High-K metal gate become widely used as cores. To use accelerators and transistor structures restarted gate "oxide" scaling with orders of magnitude reduction in gate leakage. Transistor structures such as reconfigurable logic effectively, their costs must be FinFET, or Intel's Tri-Gate have improved control of the transis- overcome when they are not in use. Fortunately, if tor channel, allowing additional scaling of Vt for improved tran- power, not silicon area, is the primary cost measure, sistor performance and reduced active and leakage power.
OCR for page 9
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 9 turning the units off when they are not needed reduces conservatively set parameters well above a mean value energy consumption (see discussion of dark and dim to tolerate variation while creating the illusion of error- silicon, below). free hardware. As process variation grows relative to Third, increasing levels of integration that made the mean values, guard bands become overly conservative. microprocessor possible four decades ago now enable This means that new errors will be exposed more complete SoCs. They combine most of the functions of a frequently to software, posing software and system motherboard onto a single chip, usually with off-chip reliability challenges. main memory. These processors integrate memory and input/output controllers, graphics processors, and other 1.1.4 Game Changer: Programming for Explicitly special-purpose accelerators. These (SoC) designs are Parallel Commodity Hardware widely used in almost all devices, from servers and personal computers to smartphones and embedded The advent of multicore chips changes the software devices. interface. Sequential software no longer becomes faster Fourth, power efficiency is increasingly a major with every hardware generation, and software needs to factor in the design of multicore chips. Power has gone be written to leverage parallel hardware explicitly. from a factor to optimize in the near-final design of Current trends in hardware, specifically multicore, might computer architectures to a second-order constraint to, seem to suggest that every technology generation will now, a first-order design constraint. As the right side of increase the number of processors and, accordingly, that Figure 1-2 projects, future systems cannot achieve more parallel software written for these chips would speed up performance from simply a linear increase in core count in proportion to the number of processors (often referred at a linear increase in power. Chips deployed in to as scalable software). everything from phones, tablets, and laptops to servers Reality is not so straightforward. There are limits to and data centers must take into account power needs. the number of cores that can usefully be placed on a One technique for enabling more transistors per chip chip. Moreover, even software written in parallel at better performance levels without dramatically languages typically has a sequential component. In increasing the power needed per chip is dark silicon. addition, there are intrinsic limits in the theoretically Dark silicon refers to a design wherein a chip has many available parallelism in some problems, as well as in transistors, but only a fraction of them are powered on at their solution via currently known algorithms. Even a any one time to stay within a power budget. Thus, small fraction of sequential computation significantly function-specific accelerators can be powered on and off compromises scalability (see Figure 1-3), compromising to maximize chip performance. A related design is dim expected improvements that might be gained by silicon where transistors operate in a low-power but still- additional processors on the chip. useful state. Dark and dim silicon make accelerators and reconfigurable logic more effective. However, making dark and dim silicon practical is not easy, because adding silicon area per chip always raises cost, even if the silicon only provides value when it is on. This also presents significant software challenges, as each heterogeneous functional unit requires efficient code (e.g., this may mean multiversion code, as well as compilers and tool chains designed for many variations). Thus, even as dark and dim silicon become more widely adopted, using them to create value is a significant open challenge. Moreover, emerging transistors have more variability than in the past, due to variations in the chip fabrication process: Some transistors will be faster, while others are slower, and some use more power and others use less. This variability is emerging now, because some FIGURE 1-3 Amdahl's Law example of potential speedup on aspects of fabrication technology (e.g., gate oxides) are 16 cores based on the fraction of the program that is parallel. reaching atomic dimensions. Classically, hardware hid almost all errors from software (except memory errors) with techniques (such as guard bands) that
OCR for page 10
10 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING As part of ongoing research programs, it will be approaches. GPUs are an example of hardware important to analyze the interplay among the available specialization designed to be substantially more power parallelism in applications, energy consumption by the efficient for a specific workload. The problem with this resulting chip (under load with real applications), the trend is three-fold. performance of different algorithmic formulations, and First, hardware specialization can only be justified programming complexity. for ubiquitous and/or high-value workloads due to the In addition, most programs written in existing high cost of chip design and fabrication. Second, parallel languages are dependent on the number of creating software that exploits hardware specialization hardware processors. Further developments in parallel and heterogeneity closely couples hardware and software are necessary to be performance portable, that software--such coupling may be good for performance, is, it should execute on a variety of parallel computing power, and energy, but it typically sacrifices software platforms and should show performance in proportion to portability to different hardware, a mainstay expectation the number of processors on all these platforms without in computing over many decades. modifications, within some reasonable bounds. Third, the lead time needed for effective software Deeply coupled to parallelism is data com- support of these heterogeneous devices may reduce the munication. To operate on the same data in parallel on time they can be competitive in the marketplace. If it different processors, the data must be communicated to takes longer to deliver the tools (compilers, domain- each processor. More processors imply more specific language, and so on) than it takes to design and communication. Communicating data between deliver the chip, then the tools will appear after the chip, processors on the same chip or between chips is costly in with negative consequences.8 This problem, however, is power and time. Unfortunately, most parallel not new. For example, by holding the IA-32 instruction programming systems result in programs whose set architecture relatively constant across generations of performance heavily depends on the memory hierarchy hardware, software could be delivered in a timely organization of the processor. Where the data is located manner. Designing and building a software system for in a system directly affects performance and energy. hardware that does not exist, or is not similar to prior Consequently, sequential and even existing parallel hardware, requires well-specified hardware-software software is not performance portable to successive interfaces and accurate simulators to test the software generations of evolving parallel hardware, or even independently. Because executing software on between two machines of the same generation with the simulators requires tens to thousands of more time than same number of processors if they have different executing on actual hardware, software will lag hardware memory organizations. Software designers currently without careful system and interface design. In summary, must modify software for it to run efficiently on each writing portable and high-performance software is hard, multicore machine. The need for such efforts breaks the making such software parallel is harder, and developing virtuous cycle described above and makes building and software that can exploit heterogeneous parallel evolving correct, secure, and performance-portable architectures is even harder. software a substantial challenge. Finally, automatic parallelization systems, which 1.1.6 Correct and Secure Software that Evolves seek to transform a sequential program into a parallel programmer without programmer intervention, have Performance--in the sense of ever-increasing chip mostly failed. Had they been successful, programmers speed--is not the only critical demand of modern would be able to write in a familiar sequential language application and system software. Although performance and yet still see the performance benefits of parallel is fungible and ever-faster computer chips can be used to execution. As a result, research now focuses on program- enable a variety of functionality, software is the mer-specified parallelism. underpinning of virtually all our economic, government, and military infrastructure. Thus, the criticality of 1.1.5 Heterogeneity in Hardware Secure, Parallel Evolvable, Reliable, and Correct software cannot be overemphasized. This report uses the As mentioned above, not only are technology trends term SPERC software to refer to these software leading designers and developers of computer hardware properties. to focus on multicore systems, but they are also leading to an emphasis on specialization and heterogeneity to 8 The same is true for hardware-software co-design efforts. Suc- provide power, performance, and energy efficiency. This cess in co-design requires that both the hardware and software be specialization is a marked contrast to previous delivered at roughly the same time. If the software lags behind the hardware, it diminishes the strategy's effectiveness.
OCR for page 11
OCR for page 12
12 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING technologies show potential, each has serious challenges can be blunted by the overheads of communicating that need to be resolved through continued fundamental control and data to and from accelerators, especially if research before they could be adopted for high-volume someone seeks to offload even smaller amounts of work manufacturing. to expand the availability of off-loadable work. Reconfigurable designs, such as FPGAs, described 1.2.2 Prospects for Performance Improvements earlier, may provide a middle ground, but they are not yet easily programmable. Similarly, SoCs combine In the committee's view, there is no "silver bullet" specialized accelerators on a single chip and have had to address current challenges in computer architecture great success in the embedded market, such as and the slowdown in growth in computing performance. smartphones and tablets. As SoCs continue to proliferate, Rather, efforts in complementary directions to continue the challenge will be simplifying software and hardware to increase value, that is, performance, under power design and programmability while maximizing constraints, will be needed. Early multicore chips offered performance and power efficiency. homogeneous parallelism. Heterogeneous cores on a Moreover, communication at all levels--close, single chip are now part of an effort to achieve greater cross-chip, off-chip, off-board, off-node, offsite--must power efficiency, but they present even greater be minimized to save energy. For example, moving programming challenges. operands from a close-by register file can use energy Efforts to advance conventional multicore chips and comparable to an operation (e.g., floating-point multiply- to create more power-efficient core designs will add), while moving them from cross- or off-chip uses continue. On one hand, researchers will continue to tens to hundreds of times more energy. Thus, a focus on explore techniques that reduce the power used by reducing computational energy without a concomitant individual cores without unduly sacrificing performance. focus on reducing communication is doomed to have In turn, this will allow placement of more cores on each limited effect. chip. Researchers could also explore radical redesigns of Finally, a reconsideration of the hardware-software cores that focus on power first, for example, by boundary may be in order. While abstraction layers hide minimizing nonlocal information movement through complexity at each boundary, they also hide optimization spatially aware designs that limit communication of data and innovation possibilities. For decades, software and (see Section 1.1.4). hardware experts innovated independently on opposite GP-GPU computing, in particular, and vector and sides of the instruction set architecture boundary. single-instruction multiple-data operation, in general, Multicore chips began the end of the era of separation. offer promise for specific workloads. Each of these Going forward, co-design is needed, where chip reduce power consumption by amortizing the cost of functionality and software are designed in concert, with dealing with an instruction (e.g., fetch and decode) repeated design and optimization feedback between the across the benefit of many data manipulations. All offer hardware and software teams. However, since the great peak performance, but this performance can be software development cycle typically significantly lags hard to achieve without deep expertise coupling behind the hardware development cycle, effective co- algorithm and architecture, hardly a prescription for design will also require more rapid deployment of broad programmability. Moreover, software that runs on effective tools in a timescale commensurate with the such chips must allocate work to cope with allocating specialized hardware if its full functionality is to be work to heterogeneous computing units, such as realized. throughput-oriented GPUs and latency-oriented conventional central processing units, highlighting the 1.2.3 Software need for advances in software and programming methodologies as described earlier. Creating software systems and applications for More heterogeneity will arise from expanded use of parallel, power-constrained computing systems on a accelerators and reconfigurable logic, described earlier, single chip requires innovations at all levels of the that is needed for increased performance under power software-hardware design and engineering stack: constraints. Accelerators are so-named because they can algorithms, programming models, compilers, runtime accelerate performance. While this is true, recent work shows that the greater benefit of accelerators may be in reducing power.11 However, accelerator effectiveness Christos Kozyrakis, and Mark Horowitz, 2010, "Understanding Sources of Inefficiency in General-Purpose Chips," Proceedings 11 Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, of the 37th International Symposium on Computer Architecture Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, (ISCA), Saint-Malo, France, June 2010.
OCR for page 13
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 13 systems, operating systems, and hardware-software use these new models, to parallel hardware will require interfaces. new compiler, runtime, and operating system services One strategy for addressing the challenges inherent that model, observe, and reason, and then adapt to and to parallel programming is to first design application- change dynamic program behaviors to satisfy specific languages and system software and then seek performance and energy constraints. generalizations. The most successful examples of Because power and energy are now the first-order parallelism come from distributed search systems, Web constraint in hardware design, there is an opportunity for services, and databases executing on distinct devices, as algorithmic design and system software to play a much opposed to the challenge of parallelism within a single larger role in power and energy management. This area device (chip) that is addressed here. Parallel algorithm is a critical research topic with broad applicability. and system design success stories include MapReduce12 for processing data used in search, databases for 1.3 The Rise of Mobile Computing, Services, and persistent storage and retrieval, and domain-specific Software toolkits with parallel support, such as MATLAB. Part of their success is rooted in providing a level of abstraction Historically, the x86 instruction set architecture has in which programmers write sequential components, come to dominate the commercial computing space, while the runtime and system software implement and including laptops, desktops, and servers. Developed manage the parallelism. On the other hand, GPU originally by Intel and licensed by AMD, the commercial programming is also a success story, but when used for success of this architecture has either eliminated or game engineering development, for instance, it relies on forced into smaller markets other architectures expert programmers with deep knowledge of parallelism, developed by MIPS, HP, DEC, and IBM, among others. algorithm-to-hardware mappings, and performance More than 300 million PCs are sold each year, most of tuning. General-purpose computing on GPUs does not them powered by x86 processors.13 Further, since the require in-depth knowledge about graphics hardware, but improvement in capabilities of single-core processors does require programmers to understand parallelism, started slowing dramatically, nearly all laptops, desktops, locality, and bandwidth--general-purpose computing and servers are now shipping with multicore processors. primitives. Over the past decade, the availability of capable, More research is needed in domain-specific parallel affordable, and very low-power processor chips has algorithms, because most applications are sequential. spurred a fast rise in mobile computing devices in the Sequential algorithms are almost never appropriate for form of smartphones and tablets. The annual sales parallel systems. Expressing algorithms in such a way volume of smartphones and tablets already exceeds that that they satisfy the key SPERC properties and are of PCs and servers.14 The dominant architecture is U.K.- performance portable across different parallel hardware based ARM, rather than x86. ARM does not and generations of parallel hardware requires investment manufacture chips; instead it licenses the architecture to and research in new programming models and third parties for incorporation into custom SoC designs programming languages. by other vendors. The openness of the ARM architecture These programming models must enable expert, has facilitated its adoption by many hardware typical, and potentially naïve programmers to use manufacturers. In addition, these mobile devices now parallel hardware effectively. Since parallel commonly incorporate two cores, and at least one SoC programming is extremely complex, the expertise vendor has been shipping four-core designs in volume necessary to effectively work in this realm is currently since early 2012. Furthermore, new heterogeneous big- only within reach of the most expert programmers, and and small-core designs that couple a higher performance, the majority of existing systems are not performance higher power core with a lower performance, lower portable. A key requirement will be to create modular power core have recently been announced.15 Multicore programming models that make it possible to chips are now ubiquitous across the entire range of encapsulate parallel software in libraries in such a way computing devices. that (1) they can be reused by many applications and (2) the system adapts and controls the total amount of 13 parallelism that effectively utilizes the hardware, without See http://www.gartner.com/it/page.jsp?id=1893523. Last ac- over- or undersubscription. Mapping applications, which cessed on February 7, 2012. 14 See http://www.canalys.com/newsroom/smart-phones-over take-client-pcs-2011. Last accessed on February 7, 2012. 12 15 In 2004, Google introduced the software framework, See www.tegra3.org; http://www.reuters.com/article/2011/10/ MapReduce, to support distributed computing on large datasets on 19/arm-idUSL5E7LJ42H20111019. Last accessed on June 25, clusters of computers. 2012.
OCR for page 14
14 THE GLOBAL ECOSYSTEM IN ADVANCED COMPUTING The rise of the ARM architecture in mobile performed in the "cloud" rather than on the mobile computing has the potential to adjust the balance of device. A flexible software infrastructure and algorithms power in the computing world as mobile devices become that optimize for network availability, power on the more popular and supplant PCs for many users. device, and precision are heralding a challenging Although the ARM architecture comes from the United ecosystem. Kingdom, Qualcomm, Texas Instruments, and NVIDIA Fundamental to these technologies are algorithms are all U.S.-based companies with strong positions in for ensuring properties such as reliability, availability, this space. However, the shift does open the door to and security in a distributed computing system, as well more foreign competition, such as Korea's Samsung, and as algorithms for deep data mining and inference. These new entries, because ARM licenses are relatively algorithms are very different in nature from parallel inexpensive, allowing many vendors to design ARM- algorithms suitable for traditional supercomputing based chips and have them fabricated in Asia. applications. While U.S. researchers have made However, just as technical challenges are changing investments in these areas already, the importance and the hardware and software development cycle and the commercial growth potential demand research and software-hardware interface, the rise of mobile development into algorithmic areas including encryption, computing and its associated software ecosystems are machine learning, data mining, and asynchronous changing the nature of software deployment and algorithms for distributed systems protocols. innovation in applications. In contrast to developing applications for general-purpose PCs--where any 1.4 Summary and Implications application developer, for example, a U.S. defense contractor or independent software vendor, can create Semiconductor scaling has encountered fundamental software that executes on any PC of their choosing--in physical limits, and improvements in performance and many cases, developing software for mobile devices power are slowing. This slowdown has, among other imposes additional requirements on developers, with things, driven a shift from the single microprocessor "apps" having to be approved by the hardware vendors computer architectures to homogenous and now before deployment. There are advantages and heterogeneous multicore processors, which break the disadvantage to each approach, but changes in the virtuous cycle that most software innovation has amount and locus of control over software deployments expected and relied on. While innovations in transistor will have implications for what kind of software is materials, lithography, and chip architecture provide developed and how innovation proceeds. promising opportunities for improvements in A final inflection point is the rise of large-scale performance and power, there is no consensus by the services, as exemplified by search engines, social semiconductor and computer industry on the most networks and cloud-hosting services. At the largest promising path forward. scale, the systems supporting each of these are larger It is likely that these limitations will require a shift than the entire Internet was just a few years ago. in the locus of innovation away from dependence on Associated innovations have included a renewed focus single-thread performance, at least in the way on analysis of unstructured and ill-structured data (so- performance has been achieved (i.e., increasing transistor called big data), packaging and energy efficiency for count per chip at reduced power). Performance at the massive data centers, and the architecture of service processor level will continue to be important, as that delivery and content distribution systems. All of these performance can be translated into desired functionalities are the enabling technologies for delivery of services to (such as increased security, reliability, more capable mobile devices. The mobile device is already becoming software, and so on.) But new ways of thinking about the primary personal computing system for many people, overall system goals and how to achieve them may be backed up by data storage, augmented computational needed. horsepower, and services provided by the cloud. What, then, are the most promising opportunities for Leadership in the technologies associated with innovation breakthroughs by the semiconductor and distributed cloud services, data center hardware and computing industry? The ongoing globalization of software, and mobile devices will provide a competitive science and technology and increased--and cheaper-- advantage in the global computing marketplace. access to new materials, technologies, infrastructure, and Software innovations in mobile systems where markets have the potential to shift the U.S. competitive power constraints are severe (battery life directly affects advantage in the global computing ecosystem, as well as user experience) are predicted to use a different model to refocus opportunities for innovation in the computing than PCs, in which more and more processing is space. In addition, the computing and semiconductor
OCR for page 15
COMPUTER AND SEMICONDUCTOR TECHNOLOGY TRENDS AND IMPLICATIONS 15 industry has become a global enterprise, fueled by reinforces the critical need for the United States to assess increasingly competitive overseas semiconductor the geographic and technological landscape of research markets and firms that have made large and focused and development focused on this and other areas of investments in the computing space over the last decade. computer and semiconductor innovation. The possibility of new technological approaches emerging both in the United States and overseas
OCR for page 16