Click for next page ( 106


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 105
4 The End of Programming as We Know It F uture growth in computing performance will have to come from software parallelism that can exploit hardware parallelism. Pro- grams will need to be expressed by dividing work into multiple computations that execute on separate processors and that communicate infrequently or, better yet, not at all. This chapter first explains how cur- rent software reaped the benefits of Moore’s law and how much of the resulting software is not well suited to parallelism. It then explores the challenges of programming parallel systems. The committee explores examples of software and programming parallelism successes and possi - bilities for leveraging these successes, as well as examples of limitations of parallelism and challenges to programming parallel systems. The sudden shift from single-core to multiple-core processor chips requires a dramatic change in programming, but software developers are also challenged by the continuously widening gap between memory system and processor performance. That gap is often referred to as the “memory wall,” but it reflects a continuous rather than discrete shift in the balance between the costs of computational and memory operations and adds to the difficulty of obtaining high performance. To optimize for locality, software must be written to minimize both communication between processors and data transfers between processors and memory. 105

OCR for page 105
106 THE FUTURE OF COMPUTING PERFORMANCE MOORE’S BOUNTY: SOFTWARE ABSTRACTION Moore’s bounty is a portable sequential-programming model.1 Pro- grammers did not have to rewrite their software—software just ran faster on the next generation of hardware. Programmers therefore designed and built new software applications that executed correctly on current hardware but were often too compute-intensive to be useful (that is, it took too long to get a useful answer in many cases), anticipating the next generation of faster hardware. The software pressure built demand for next-generation hardware. The sequential-programming model evolved in that ecosystem as well. To build innovative, more capable sophisticated software, software developers turned increasingly to higher-level sequen- tial-programming languages and higher levels of abstraction (that is, the reuse of libraries and component software for common tasks). Moore’s law helped to drive the progression in sequential-language abstractions because increasing processor speed hid their costs. For example, early sequential computers were programmed in assem- bly language. Assembly-language statements have a one-to-one mapping to the instructions that the computer executes. In 1957, Backus and his colleagues at IBM recognized that assembly-language programming was arduous, and they introduced the first implementation of a high-level sequential scientific computing language, called Fortran, for the IBM 704 computer.2 Instead of writing assembly language, programmers wrote in Fortran, and then a compiler translated Fortran into the computer’s assembly language. The IBM team made the following claims for that approach: 1. Programs will contain fewer errors because writing, reading, and understanding Fortran is easier than performing the same tasks in assembly language. 2. It will take less time to write a correct program in Fortran because each Fortran statement is an abstraction of many assembly instructions. 3. The performance of the program will be comparable with that of assembly language given good compiler technology. 1 Jim Larus makes this argument in an article in 2009, Spending Moore’s dividend, Com - munications of the ACM 52(5): 62-69. 2 J.W. Backus, R.I. Beeber, S. Best, R. Goldberg, L.M. Haibt, H.L. Herrick, R.A. Nelson, D. Sayre, P.B. Sheridan, H. Stern, Ziller, R.A. Hughes, and R. Nutt, 1957, The Fortran automatic coding system, Proceedings of the Western Joint Computer Conference, Los Angeles, Cal., pp. 187-198, available online at http://archive.computerhistory.org/resources/text/For- tran/102663113.05.01.acc.pdf.

OCR for page 105
107 THE END OF PROGRAMMING AS WE KNOW IT The claimed benefits of high-level languages are now widely accepted. In fact, as computers got faster, modern programming languages added more and more abstractions. For example, modern languages—such as Java, C#, Ruby, Python, F#, PHP, and Javascript—provide such features as automatic memory management, object orientation, static typing, dynamic typing, and referential transparency, all of which ease the pro- gramming task. They do that often at a performance cost, but companies chose these languages to improve the correctness and functionality of their software, which they valued more than performance mainly because the progress of Moore’s law hid the costs of abstraction. Although higher levels of abstraction often result in performance penalties, the initial tran - sition away from hand-coded assembly language came with performance gains, in that compilers are better at managing the complexity of low-level code generation, such as register allocation and instruction scheduling, better than most programmers. That pattern of increases in processor performance coupling with increases in the use of programming-language abstractions has played out repeatedly. The above discussion describes the coupling for general- purpose computing devices, and it is at various stages in other hardware devices, such as graphics hardware, cell phones, personal digital assis- tants, and other embedded devices. High-level programming languages have made it easier to create capable, large sophisticated sequential programs. During the height of the synergy between software and increasing single-processor speeds in 1997, Nathan Myhrvold, former chief technology officer for Microsoft, postulated four “laws of software”:3 1. Software is a gas. Software always expands to fit whatever con- tainer it is stored in. 2. Software grows until it becomes limited by Moore’s law. The growth of software is initially rapid, like gas expanding, but is inevitably limited by the rate of increase in hardware speed. 3. Software growth makes Moore’s law possible. People buy new hardware because the software requires it. 4. Software is limited only by human ambition and expectation. We will always find new algorithms, new applications, and new users. 3 These laws were described in a 1997 presentation that the Association for Computing Machinery hosted on the next 50 years of computing (Nathan P. Myhrvold, 1997, The next fifty years of software, Presentation, available at http://research.microsoft.com/en-us/um/ siliconvalley/events/acm97/nmNoVid.ppt).

OCR for page 105
108 THE FUTURE OF COMPUTING PERFORMANCE Myhrvold’s analysis explains both the expansion of existing applica - tions and the explosive growth in innovative applications. Some of the code expansion can be attributed to a lack of attention to performance and memory use: it is often easier to leave old and inefficient code in a system than to optimize it and clean it up. But the growth in performance also enabled the addition of new features into existing software systems and new paradigms for computing. For example, Vincent Maraia reports that in 1993, the Windows NT 3.1 operating system (OS) consisted of 4-5 million lines of code and by 2003, the Windows Server OS had 50 mil- lion lines, 10 times as many.4 Similarly, from 2000 to 2007, the Debian 2.2 Linux OS grew from about 59 to 283 million lines in version 4.0, about 5 times as many.5 Those operating systems added capabilities, such as better reliability, without slowing down the existing features, and users experi - enced faster operating system startup time and improvement in overall performance. Furthermore, the improvements described by Moore’s law enabled new applications in such domains as science, entertainment, busi- ness, and communication. Thus, the key driver in the virtuous cycle of exploiting Moore’s law is that applications benefited from processor per- formance improvements without those applications having to be adapted to changes in hardware. Programs ran faster on successive generations of hardware, allowing new features to be added without slowing the application performance. The problem is that much of the innovative software is sequential and is designed to execute on only one processor, whereas the previous chapters explained why all future computers will contain multiple proces- sors. Thus, current programs will not run faster on successive generations of hardware.6 The shift in the hardware industry has broken the perfor- mance-portability connection in the virtuous cycle—sequential programs will not benefit from increases in processor performance that stem from the use of multiple processors. There were and are many problems—for example, in search, Web applications, graphics, and scientific computing— that require much more processing capability than a single processor pro- 4 Vincent Maraia, The Build Master: Microsoft’s Software Configuration Management Best Practices, Addison-Wesley Professional, 2005. 5 Debian Web site, Wikipedia.com.http://en.wikipedia.org/wiki/Debian. 6 Successive generations of hardware processors will not continue to increase in per- formance as they have in the past; this may be an incentive for programmers to develop tools and methods to optimize and extract the most performance possible from sequential programs. In other words, working to eliminate the inefficiencies in software may yield im - pressive gains given that past progress in hardware performance encouraged work on new functions rather than optimizing existing functions. However, optimizing deployed software for efficiency will ultimately reach a point of diminishing returns and is not a long-term alternative to moving to parallel systems for more performance.

OCR for page 105
109 THE END OF PROGRAMMING AS WE KNOW IT vides. The developers of the applications and programming systems have made much progress in providing appropriate abstractions (discussed in detail below) but not enough in that most developers and programming systems currently use the sequential model. Conventional sequential pro- grams and programming systems are ill equipped to support parallel programming because they lack abstractions to deal with the problems of extracting parallelism, synchronizing computations, managing locality, and balancing load. In the future, however, all software must be able to exploit multiple processors to enter into a new virtuous cycle with succes- sive generations of parallel hardware that expands software capabilities and generates new applications.7 Finding: There is no known alternative to parallel systems for sustain- ing growth in computing performance; however, no compelling pro- gramming paradigms for general parallel systems have yet emerged. To develop parallel applications, future developers must invent new parallel algorithms and build new parallel applications. The applica- tions will require new parallel-programming languages, abstractions, compilers, debuggers, execution environments, operating systems, and hardware virtualization systems. We refer to those tools collectively as a programming system. Future programming systems will need to take advantage of all those features to build applications whose performance will be able to improve on successive generations of parallel hardware that increase their capabilities by increasing the number of processors. In contrast, what we have today are conventional sequential-programming systems based on two abstractions that are fundamentally at odds with parallelism and locality. First, they tie the ordering of statements in a program to a serial execution order of the statements. Any form of paral- lelism violates that model unless it is unobservable. Second, conventional programs are written on the assumption of a flat, uniform-cost global memory system. Coordinating locality (minimizing the number of expen- sive main memory references) is at odds with the flat model of memory that does not distinguish between fast and slow memory (for example, on and off chip). Parallelism and locality are also often in conflict in that 7 Indeed, the compiler community is bracing for the challenges ahead. On p. 62 of their book, Mary Hall et al. observe that “exploiting large-scale parallel hardware will be essential for improving an application’s performance or its capabilities in terms of execution speed and power consumption. The challenge for compiler research is how to enable the exploi - tation of the power [that is, performance, not thermals or energy] of the target machine, including its parallelism, without undue programmer effort.” (Mary Hall, David Padua, and Keshav Pingali, 2009, Compiler research: The next 50 years, Communications of the AC 52(2): 60-67.

OCR for page 105
110 THE FUTURE OF COMPUTING PERFORMANCE locality will encourage designs that put all data close to a single processor to avoid expensive remote references, whereas performing computations in parallel requires spreading data among processors. SOFTWARE IMPLICATIONS OF PARALLELISM There are five main challenges to increasing performance and energy efficiency through parallelism: · Finding independent operations. · Communicating between operations. · Preserving locality between operations. · Synchronizing operations. · Balancing the load represented by the operations among the sys- tem resources. The first challenge in making an application parallel is to design a parallel algorithm to solve the problem at hand that provides enough independent operations to keep the available parallel resources busy. Some demanding problems have large amounts of data parallelism—that is, a single operation can be performed for every data element of a set, and the operations are independent of one another (or can be made so via transformations). Some problems also have moderate amounts of control or task parallelism in which different operations can be performed in parallel on different data items. In both task and data parallelism, an operation may comprise a sequence of instructions. For some applica - tions, the parallelism is limited by a sequence of dependent operations, and performance is limited not by throughput but by the latency along this critical path.8 The second challenge, communication, occurs when computations that execute in parallel are not entirely independent and must commu- nicate. Some demanding problems cannot be divided into completely independent parallel tasks, but they can be divided into parallel tasks that communicate to find a solution cooperatively. For example, to search for a particular object in an image, one may divide the image into pieces that are searched by independent tasks. If the object crosses pieces, the tasks will need to communicate. The programming system can perform com- munication through inputs and outputs along dependences by reading and writing to shared data structures or by explicitly sending messages 8 This limit on parallelism is often called Amdahl’s law, after Gene Amdahl. For more on this law, see Box 2.4.

OCR for page 105
111 THE END OF PROGRAMMING AS WE KNOW IT between parallel operations. Even in the implicit case, some data will need to transfer between processors to allow access to shared data. Locality, the third challenge, reduces the costs associated with com- munication by placing two operations that access the same data near each other in space or in time. Scheduling operations nearby in space on the same processor avoids communication entirely, and placing them on nearby processors may reduce the distance that data need to travel. Sched- uling operations nearby in time shortens the lifetime of data produced by one operation and consumed by another; this reduces the volume of live data and allows the data to be captured in small on-chip memories. Locality avoids the need for communication between processors, but is also critical for avoiding another form of communication: the movement of data between memory and processors. The fourth challenge, synchronization, is also needed to provide cooperation between parallel computations. Some operations must be performed in a particular order to observe dependence. Other operations may be performed in an arbitrary order but must be grouped so that some sequences execute atomically (without interference from other sequences). Synchronization is used to serialize parts of an otherwise parallel execu - tion, and there is often a tension between the performance gained from parallelism and the correctness ensured by synchronization. For example, barrier synchronization forces a set of parallel computations to wait until all of them reach the barrier. Locks are used to control access to shared data structures by allowing only one thread to hold a lock at a given time. Unnecessary synchronization may occur when an entire data structure is locked to manipulate one element or when a barrier is placed on every loop iteration even when the iterations are independent. Finally, load balancing involves distributing operations evenly among processors. If the load becomes unbalanced because some processors have more work than others or take more time to perform their work, other processors will be idle at barrier synchronization or when program execu- tion ends. The difficulty of load balancing depends on the characteristics of the application. Load balancing is trivial if all parallel computations have the same cost, more difficult if they have different costs that are known in advance, and even more difficult if the costs are not known until the tasks execute. Locality’s Increasing Importance Effective parallel computation is tied to coordinating computations and data; that is, the system must collocate computations with their data. Data are stored in memory. Main-memory bandwidth, access energy, and latency have all scaled at a lower rate than the corresponding characteris -

OCR for page 105
112 THE FUTURE OF COMPUTING PERFORMANCE tics of processor chips for many years. In short, there is an ever-widening gap between processor and memory performance. On-chip cache memo - ries are used to bridge the gap between processor and memory perfor- mance partially. However, even a cache with the best algorithm to predict the next operands needed by the processor does not have a success rate high enough to close the gap effectively. The advent of chip multiproces- sors means that the bandwidth gap will probably continue to widen in that the aggregate rate of computation on a single chip will continue to outpace main-memory capacity and performance improvements. The gap between memory latency and computation is also a limitation in software performance, although this gap will not grow with multicore technology, inasmuch as clock rates are relatively constant. In addition to performance concerns, the movement of data between cores and between the proces- sor and memory chips consumes a substantial fraction of a system’s total power budget. Hence, to keep memory from severely limiting system power and performance, applications must have locality, and we must increase the amount of locality. In other words, the mapping of data and computation should minimize the distance that data must travel. To see the importance of locality in future systems, it is instructive to examine the relative energy per operation for contemporary systems and how it is expected to scale with technology. In a contemporary 40-nm CMOS process, performing a 64-bit floating-point multiply-add (FMA) operation requires that the energy of the operation, Eop, be equal to 100 pJ. The energy consumed in moving data over 1 mm of wire, Ew, is 200 fJ/ bit-mm, or 12.8 pJ/W-mm (for 64-bit words). Moving data off-chip takes energy, EP, of 2 pJ/bit (128 pJ/W) or more. Supplying the four operands (three input and one output) of the FMA operation from even 1 mm away takes 51.2 pJ—half as much energy as doing the operation itself. Sup- plying the data globally on-chip—say, over a distance of 20 mm—takes about 1 nJ, an order of magnitude more energy than doing the operation. Moving data off-chip is comparably expensive. Thus, to avoid having the vast majority of all energy be spent in moving data, it is imperative that data be kept local. Locality is inherently present in many algorithms, but the compu- tation must be properly ordered to express locality. For dense matrix computations, ordering is usually expressed by blocking the algorithm. For example, consider multiplying two 10,000 × 10,000 matrices. Using the straightforward algorithm, it requires performing 2 × 1012 arithmetic operations. If we perform the operations in a random order, there is little locality, and 4 × 1012 memory references will be required to compute the result, so both arithmetic operations and data access grow with the cube of the matrix dimension. Even with a natural implementation based on three nested loops, data accesses will grow with the cube of the matrix

OCR for page 105
113 THE END OF PROGRAMMING AS WE KNOW IT dimension, because one of the matrices will be accessed in an order that allows little reuse of data in the cache. However, if we decompose the problem into smaller matrix multiplication problems, we can capture locality, reusing each word fetched from memory many times. Suppose we have a memory capable of holding 256 kB (32 kW) 1 mm from our floating-point unit. The local memory is large enough to hold three 100 × 100 submatrices, one for each input operand and one for the partial result. We can perform a 100 × 100 matrix multiplication entirely out of the local memory, performing 2 × 106 operations with only 4 ×104 memory references—a ratio of 50 operations per reference. We can apply this blocking recursively. If there is aggregate on-chip memory of 32 MB (4 MW), we can hold three 1,000 × 1,000 submatrices at this level of the storage hierarchy. In a seminal paper by Hong and Kung, that idea was proved to be optimal for matrix multiplication in the sense that this kind of blocked algorithm moves the minimum amount of data pos- sible between processor and memory system.9 Other array computations, including convolutions and fast Fourier transformations, can be blocked in this manner—although with different computation-to-communication ratios—and there are theoretical results on the optimality of communi - cation for several linear algebra problems for both parallel and serial machines. The recursive nature of the blocked algorithm also led to the notion of “cache-oblivious” algorithms, in which the recursive subdivision pro - duces successively smaller subproblems that eventually fit into a cache or other fast memory layer.10 Whereas other blocked algorithms are imple- mented to match the size of a cache, the oblivious algorithms are opti - mized for locality without having specific constants, such as cache size, in their implementation. Locality optimizations for irregular codes, such as graph algorithms, can be much more difficult because the data structures are built with pointers or indexed structures that lead to random memory accesses. Even some of the graph algorithms have considerable locality that can be realized by partitioning the graph subgraphs that fit into a local memory and reorganizing the computations to operate on each subgraph with reuse before moving on to the next subgraph. There are many algorithms and software libraries for performing graph partition - 9 See Hong Jia-Wei and H.T. Kung, 1981, I/O complexity: The red-blue pebble game, Pro - ceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, Milwaukee, Wis., May 11-13, 1981, pp. 326-333 10 Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran, 1999, Cache-oblivious algorithms, Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, New York, N.Y., October 17-19, 1999, pp. 285-297.

OCR for page 105
114 THE FUTURE OF COMPUTING PERFORMANCE ing that minimize edge cuts for locality but with equal subgraph sizes for load-balancing.11 A key challenge in exploiting locality is developing abstractions for locality that allow a programmer to express the locality in a program independent of any particular target machine. One promising approach, used by the Sequoia programming system,12 is to present the programmer with an abstract memory hierarchy. The programmer views the machines as a tree of memories; the number of levels in the tree and the size of the memory at each level are unspecified. The programmer describes a decomposition method that subdivides the problem at one level into smaller problems at the next level and combines the partial solutions and a leaf method that solves the subproblem at the lowest level of the hierarchy. An autotuner then determines the number of times to apply the decomposition method and the appropriate data sizes at each level to map the program optimally onto a specific machine. The result is a programming approach that gives good locality with portability among diverse target machines. Software Abstractions and Hardware Mechanisms Needed Simplifying the task of parallel programming requires software abstractions that provide powerful mechanisms for synchronization, load balance, communication, and locality, as described above, while hiding the underlying details. Most current mechanisms for these operations are low-level and architecture-specific. The mechanisms must be carefully programmed to obtain good performance with a given parallel architec- ture, and the resulting programs are typically not performance-portable; that is, they do not exhibit better performance with a similar parallel architecture that has more processors. Successful software abstractions are needed to enable programmers to express the parallelism that is inherent in a program and the dependences between operations and to structure a program to enhance locality without being bogged down in low-level architectural details. Which abstractions make parallel programming con- venient and result in performance-portable programs is an open research question. Successful abstractions will probably involve global address spaces, accessible ways to describe or invoke parallel operations over 11 For one example of a graph-partitioning library, see George Karypis and Vipin Kumar, 1995, METIS: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Techni - cal report, Minneapolis, Minn.: University of Minnesota. 12 See Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan, 2006, Sequoia: Programming the memory hierarchy, Proceedings of the ACM/ IEEE Conference on Supercomputing, Tampa, Fla., November 11-17, 2006.

OCR for page 105
115 THE END OF PROGRAMMING AS WE KNOW IT collections of data, and constructs for atomic operations. Abstractions may also involve abstract machine models that capture resource costs and locality while hiding details of particular machines. Abstractions for parallelism are typically encapsulated in a programming system and execution model. At the same time, reasonable performance requires efficient underly- ing hardware mechanisms, particularly in cases that need fine-grained communication and synchronization. Some parallel machines require interactions between processors to occur by means of high-overhead message transfers or by passing data via shared memory locations. Such mechanisms are useful but can be cumbersome and restrict the granu- larity of parallelism that can be efficiently exploited. Resolving those details will require research, but successful mechanisms will enable low- overhead communication and synchronization and will facilitate migra - tion of data and operations to balance load. There are several emerging directions in hardware to support parallel computations. It is too early to know which hardware architecture or architectures will prove most suc - cessful, but several trends are evident: · Multiple processors sharing a memory. This direction was taken by chip multiprocessors and was the primary approach used by semiconductor companies once they could not continue to increase their single-processor products. · Multiple computers interconnected via a high-speed communica- tion network. When very large computation facilities are needed for research or business, it is impractical for all the processors to share a memory, and a high-speed interconnect is used to tie the hundreds or thousands of processors together in a single system. Data centers use this model. · A single processor containing multiple execution units. In this architecture, a single processor, or instruction stream, controls an array of similar execution units. This is sometimes termed single- instruction stream multiple-data (SIMD) architecture. · Array of specialized processors. This approach is effective for executing a specialized task, such as a graphic or video process - ing algorithm. Each individual processor and its interconnections can be tailored and simplied for the target application. · Field-programmable gate arrays (FPGAs) used in some parallel computing systems. FPGAs with execution units embedded in their fabric can yield high performance because they exploit local- ity and program their on-chip interconnects to match the data flow of the application.

OCR for page 105
121 THE END OF PROGRAMMING AS WE KNOW IT gramming. TBB is also relatively recent but has demonstrated support in both industry and academe. In the 1980s, when shared-memory hardware was especially popular, the functional language community introduced the notion of a “future”14 that wrapped around a function invocation and resulted in an implicit synchronization of thread completion; any attempt to access the return value of the function would wait until the thread had completed. The closely related idea of a “promise”15 is also wrapped around a function invocation but uses a special return type that must be explicitly unwrapped, making the wait for thread completion explicit. Two issues with dynamic thread creation are the overhead of thread creation and the policy for load-balancing threads among processors. A program written to create a thread everywhere that one could be used will typically overwhelm the scheduler. Several research efforts address such problems, including one extension of C called Cilk,16 which is now supported by the Cilk Arts company. Cilk uses the syntactic block struc- ture of the language to restrict thread creation and completion to simple nested patterns. It also uses a lazy thread creation model, which allows many threads to execute with low overhead as simple function calls if no processor is available to execute the thread. Instead, the runtime system on an idle processor steals work randomly from other processors. Allow- ing lazy-thread creation affects the semantics of the threading model; the PThread semantics require that each thread eventually execute even if there are enough other threads to keep the processors busy. In contrast, Cilk makes no such guarantee, so in a Cilk program, if one thread waits for a variable to be set by another thread, it may wait forever. Waiting for a variable to be set or a data structure to be updated without some explicit synchronization is generally considered dubious programming practice in parallel code although it is a popular technique for avoiding the overhead associated with system-provided synchronization primitives. In scientific computing, the most popular programming interface for shared-memory programming is OpenMP, a standard that emphasizes loop-level parallelism but also has support for more general task paral- lelism. OpenMP addresses the thread-overhead issue by dividing a set of iterations into groups so that each thread handles a set of iterations, and the programmer is able to control the load-balancing policy. It also gives more flexibility in restricting data that are private to a thread, as opposed to allowing them to be shared by all threads; and controlled forms of syn - 14 Robert Halstead, 1985, MULTILISP: A language for concurrent symbolic computation, ACM Transactions on Programming Languages and Systems 7(4): 501-538. 15 Barbara Liskov, 1998, Distributed programming in Argus, Communications of the ACM 31(3): 300-312. 16 For more on Cilk, see its project page at The Cilk Project, MIT website, at http:// supertech.csail.mit.edu/cilk/index.html.

OCR for page 105
122 THE FUTURE OF COMPUTING PERFORMANCE chronization and parallelism avoid some kinds of programming errors. OpenMP is sometimes used with message-passing to create a hybrid programming model: large-scale computing clusters in which individual nodes have hardware support for shared memory. The biggest drawbacks to thread programming are the potential for uncontrolled access to shared variables and the lack of locality control. Shared-variable access results in race conditions, in which two threads access a variable and at least one writes the variable; this results in indeter- minate behavior that depends on the access order and may vary from one run to the next. Those accesses make testing and debugging especially dif- ficult. Synchronization primitives—such as locks, which are used to avoid races—have their own forms of subtle errors: threads acquiring multiple locks can form deadlocks in which each of two threads are waiting for a lock held by the other thread. Some tools have been developed by the research community to detect those kinds of parallel-programming errors, but they have not reached the level of generality, accuracy, and speed that would encourage widespread deployment. Thread programming remains an error-prone process best handled by expert programmers, not the broader programming community of persons who have little formal training in programming, who would find it extremely challenging to create and maintain reliable code with these models. The broader com - munity fueled the growth in computing applications and the associated economic and social effects. The lack of locality support in threaded models limits the scalability of the underlying architecture and calls for some form of cache coherence, which traditionally has been a hardware challenge that grows exponen- tially harder as the number of processors grows.17 On the scale of chip multiprocessor systems available today, the problem is tractable, but even with on-chip data transfer rates, it is unclear how performance will be affected as core counts grow. Further complicating the programming problem for shared memory, many shared-memory machines with coher- ent caches use a relaxed consistency model; that is, some memory opera - tions performed by one thread may appear to be performed in a different order by another thread. There is some research on mapping OpenMP and Cilk to distributed-memory systems or building shared-memory sup- port with cache coherence in software, but the locality-aware model has often proved superior in performance even on systems with hardware- supported shared memory. Savvy thread programmers will use system mechanisms to control data layout and thread affinity to processors, but 17 More recent parallel languages, such as Chapel and X10, have explicitly included sup - port for locality.

OCR for page 105
123 THE END OF PROGRAMMING AS WE KNOW IT in the end, this model is best reserved for a relatively small set of compiler writers, runtime-system developers, and low-level library programmers. Message-Passing Interface The combination of scalability limits of shared-memory architectures and the cost and performance benefits of building parallel machines from commodity processors made distributed-memory multiprocessors a pop - ular architecture for high-end parallel computing. Those systems can vary from generic clusters built from personal computers and an Ether- net network to more specialized supercomputers with low-latency high- bandwidth networks that are more closely integrated into the processor nodes. As this architectural model become dominant in the early 1990s, several message-passing systems were developed by scientific program- ming communities, computer scientists, and industry. In the early 1990s, a group with representatives from those communities began a process to specify the message-passing interface (MPI). MPI has since emerged as the de facto standard programming model for high-performance com- puting and has nearly ubiquitous support among machines, including open-source implementations, such as MPICH and OpenMPI, that can be ported to new interconnection networks with modest effort.18 Although the standardization decisions in defining MPI were far from obvious, the relative ease with which application developers moved code from one of the previous models to MPI reflects the commonality of base concepts that already existed in the community. MPI has also proved to be a highly scal- able programming model and is used today in applications that regularly run on tens of thousands of processor cores, and sometimes over 100,000, in the largest supercomputers in the world. MPI is generally used to pro - gram a single cluster or supercomputer that resides at one site, but “grid- computing” variations of MPI and related libraries support programming among machines at multiple sites. It is also low-level programming and has analogues to the challenges presented by machine-language program- ming mentioned earlier. MPI has enabled tremendous scientific breakthroughs in nearly every scientific domain with some of the largest computations in climate change, chemistry, astronomy, and various aspects of defense. Computer simula - tions have demonstrated human effects on climate change and are critical 18 Formore on the MPI standard, see the final version of the draft released in May 1994, available online at http://www.mcs.anl.gov/Projects/mpi/standard.html. See also Open MPI: Open source high performance computing at http://www.open-mpi.org. and Peter Pacheco, 1997, Parallel Programming with MPI, fourth edition, San Francisco, Cal.: Morgan Kaufmann.

OCR for page 105
124 THE FUTURE OF COMPUTING PERFORMANCE for international environmental-policy negotiations, as recognized in the 2007 Nobel prize awarded to the Intergovernmental Panel on Climate Change. The climate-modeling problem is far from solved as researchers attempt to identify particular phenomena, such as the disappearance of polar ice caps; effects on the frequency or severity of hurricanes, floods, or droughts; and the effects of various mitigation proposals. The size and accuracy of the computations continue to push the limits of available computing systems, consuming tens of millions of processor-hours each year. The Community Climate Simulation Model and nearly all other major codes used for global-scale long-term climate simulations are writ - ten in MPI. A related problem that also relies heavily on MPI is weather- forecasting, which is more detailed but of shorter term and on smaller regional scales than climate models. MPI programs have been used to understand the origins of the universe on the basis of a study of cosmic microwave background (CMB) and to study quantum chromodynamics in particle physics—both uses are based on Nobel prize-wining work in physics. MPI is used in the design of new materials, in crash simulations in the automobile industry, in simulating earthquakes to improve build - ing designs and standards, and in fluid-dynamics simulations for improv- ing aircraft design, engine design, and biomedical simulations. There are limitations, however. MPI is designed for comparatively coarse-grained parallelism—among clusters of computers—not for par- allelism between cores on a chip. For example, most supercomputers installed in the last few years have dual-core or quad-core processing chips, and most applications use an MPI process per core. The model is shifting as application scientists strive to make more effective use of shared-memory chip multiprocessors by exploiting a mixture of MPI with OpenMP or PThreads. MPI-3 is a recent effort to implement MPI on mul- ticore processors although memory-per-core constraints are still a barrier to MPI-style weak scaling. Indeed, the motivation to mix programming models is often driven more by memory footprint concerns than by per- formance itself because the shared-memory model exploits fine-grained parallelism better in that it requires less memory per thread. Moreover, there is a broad class of applications of particular interest to the defense and intelligence communities for which MPI is not appropriate, owing to its mismatch with the computational patterns of particular problems.

OCR for page 105
125 THE END OF PROGRAMMING AS WE KNOW IT MapReduce: Exploiting Data Parallelism and Distributed Computation MapReduce is a data-processing infrastructure19 developed internally by Google and later popularized in the Hadoop open-source version.20 MapReduce is targeted at large-scale data-parallel workloads in which the input is represented as a set of key-value pairs and computation is expressed as a sequence of two user-provided functions: map and reduce. The Map function processes the input to create an intermediate key-value pair. Intermediate pairs with the same key are aggregated by the system and fed to the reduce function, which produces the final output. What makes MapReduce particularly compelling is that it frees the programmer from the need to worry about much of the complex- ity required to run large-scale computations. The programmer needs only to produce the body of the two MapReduce methods, and the sys- tem takes care of parallelization, data distribution, fault tolerance, and load-balancing. The class of problems that can be mapped to this relatively simple processing framework is surprisingly broad. MapReduce was conceived to simplify the data-processing involved in creating a Web index from a set of crawled documents, but developers have also used it for large- scale graph algorithms (for example, finding citation occurrences among scientific articles), computing query-term popularity (for example, Google Trends21), and creating language models for statistical machine transla- tion (for example, finding and keeping a count for every unique sequence of five or more terms in a large corpus), and other applications. Within Google itself, MapReduce’s popularity has exploded since its introduction. MapReduce has also been found to be useful for systems much smaller than the large distributed clusters for which it was designed. Ranger et al. examined the adequacy of the MapReduce framework for multicore and multiprocessor systems22 and found that it was equally compelling as a programming system for this class of machines. In a set of eight applica - tions both coded with a version of MapReduce (the Phoenix runtime) and hand-coded directly, the authors found that MapReduce-coded ver- 19 See Jeffrey Dean and Sanjay Ghemawat, 2008, MapReduce: Simplified data processing on large clusters, Communications of the ACM 51(1): 107-113, and Micheal Noth, 2006, Building a computing system for the world’s information, Invited talk, University of Iowa, IT Tech Forum Presentations, April 20, 2006. 20 See The Apache Hadoop project, available at http://hadoop.apache.org. 21 See Google trends, available at http://trends.google.com. 22 Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, 2007, Evaluating MapReduce for multi-core and multiprocessor systems, Proceedings of the IEEE 13th International Symposium on High-Performance Computer Architecture, Phoenix, Ariz., February 10-14, 2007.

OCR for page 105
126 THE FUTURE OF COMPUTING PERFORMANCE sions roughly matched the performance of the hand-coded versions for five. The remaining three applications did not fit well with MapReduce’s key-value data-stream model, and the hand-coded versions performed significantly better. Despite MapReduce’s success as a programming system beyond its initial application domain and machine-architecture context, it is far from a complete solution for extracting and managing parallelism in general. It remains limited to, for example, batch-processing systems and is therefore not suitable for many on-line serving systems. MapReduce also does not extract implicit parallelism from an otherwise sequential algorithm but instead facilitates the partitioning, distribution, and runtime management of an application that is already essentially data-parallel. Such systems as MapReduce and NVIDIA’s CUDA,23 however, point to a solution strategy for the general programming challenge of large-scale parallel systems. The solutions are not aimed at a single programming paradigm for all possible cases but are based on a small set of programming systems that can be specialized for particular types of applications. Distributed Computation—Harnessing the World’s Spare Computing Capacity The increasing quality of Internet connectivity around the globe has led researchers to contemplate the possibility of harnessing the unused computing capability of the world’s personal computers and servers to perform extremely compute-intensive parallel tasks. The most notable examples have come in the form of volunteer-based initiatives, such as SETI@home (http://setiathome.berkeley.edu/) and Folding@home (http://folding.stanford.edu/). The model consists of breaking a very large-scale computation into subtasks that can operate on relatively few input and result data, require substantial processing over that input set, and do not require much (or any) communication between subtasks other than passing of inputs and results. Those initiatives attract volunteers (individuals or organizations) that sympathize with their scientific goals to donate computing cycles on their equipment to take on and execute a number of subtasks and return the results to a coordinating server that coalesces and combines final results. The relatively few success cases of the model have relied not only on the friendly nature of the computation to be performed (vast data parallelism with very low communication requirements for each unit of 23 CUDA is a parallel computing approach aimed at taking advantages of NVIDIA graphi - cal processing units. For more, see CUDA zone, NVIDIA.com, available at http://www. nvidia.com/object/cuda_home.html.

OCR for page 105
127 THE END OF PROGRAMMING AS WE KNOW IT computing) but also on the trust of the volunteers that the code is safe to execute in their machines. For more widespread adoption, this particular programming model would require continuing improvements in secure execution technologies, incorporation of an economic model that provides users an incentive to donate their spare computing capacity, and improve- ments in Internet connectivity. To some extent, one can consider large illegitimately assembled networks of hijacked computers (or botnets) to be an exploitation of this computing model; this exemplifies the potential value of harnessing a large number of well-connected computing systems toward nobler aims. Summary Observations These success stories show that there are already a wide variety of computational models for parallel computation and that science and industry are successfully harnessing parallelism in some domains. The parallelism success stories bode well for the future if we can find ways to map more applications to the models or, for computations that do not map well to the models, if we can develop new models. PARALLEL-PROGRAMMING SYSTEMS AND THE PARALLEL SOFTWARE “STACK” The general problem of designing parallel algorithms and program- ming them to exploit parallelism is an extremely important, timely, and unsolved problem. The vast majority of software in use today is sequential. Although the previous section described examples of parallel approaches that work in particular domains, general solutions are still lacking. Many successful parallel approaches are tied to a specific type of parallel hardware (MPI and distributed clusters; storage-cluster architec - ture, which heavily influenced MapReduce; openGL and SIMD graphics processors; and so on). The looming crisis that is the subject of this report comes down to the question of how to continue to improve performance scalability as architectures continue to change and as more and more pro - cessors are added. There has been some limited success, but there is not yet an analogue of the sequential-programming models that have been so successful in software for decades. We know some things about what new parallel-programming approaches will need. A high-level performance-portable programming model is the only way to restart the virtuous cycle described in Chapter 2. The new model will need to be portable over successive generations of chips, multiple architectures, and different kinds of parallel hardware, and it will need to scale well. For all of those goals to be achieved, the

OCR for page 105
128 THE FUTURE OF COMPUTING PERFORMANCE entire software stack will need to be rethought, and architectural assump- tions will need to be included in the stack. Indeed, in the future, the term software stack will be a misnomer. A “software-hardware stack” will be the norm. The hardware, the programming model, and the applications will all need to change. A key part of modern programming systems is the software stack that executes the program on the hardware. The stack must also allow reasoning about the five main challenges to scalable and efficient perfor- mance: parallelism, communication, locality, synchronization, and load- balancing. The components of a modern software stack include · Libraries: Generic and domain-specific libraries provide appli- cation programmers with predefined software components that can be included in applications. Because library software may be reused in many applications, it is often highly optimized by hand or with automated tools. · Compiler: An ahead-of-time or a just-in-time compiler translates the program into assembly code and optimizes it for the underly- ing hardware. Just-in-time compilers combine profile data from the current execution with static program analysis to perform optimizations. · Runtime system or virtual machine: These systems manage fine- grain memory resources, application-thread creation and sched- uling, runtime profiling, and runtime compilation. · Operating system: The operating system manages processes and their resources, including coarse-grain memory management. · Hypervisors: Hypervisors abstract the hardware context to pro- vide performance portability for operating systems among hard- ware platforms. Because programming systems are mostly sequential, the software stack mostly optimizes and manages sequential programs. Optimizing and understanding the five challenges at all levels of the stack for parallel approaches will require substantial changes in these systems and their interfaces, and perhaps researchers should reconsider whether the overall structure of the stack is a good one for parallel systems. Researchers have made some progress in system support for pro-

OCR for page 105
129 THE END OF PROGRAMMING AS WE KNOW IT viding and supporting parallel-programming models.24 Over the years, researchers and industry have developed parallel-programming system tools, which include languages, compilers, runtime environments, librar- ies, components, and frameworks to assist programmers and software developers in managing parallelism. We list some examples below. · Runtime abstractions: multiprogramming and virtualization. The operating system can exploit chip multiprocessors at a coarse granularity immediately because operating systems can run mul - tiple user and kernel processes in parallel. Virtualization runs multiple operating systems in parallel. However, it remains chal- lenging to manage competition for shared resources, such as caches, when a particular application load varies dramatically. · Components: database transactions and Web applications. Data- base-transaction systems provide an extremely effective abstrac- tion in which programs use a sequential model, without the need to worry about synchronization and communication, and the database coordinates all the parallelism between user programs. Success in this regard emerged from over 20 years of research in parallelizing database systems. · Frameworks: three-tiered Web applications and MapReduce. Such frameworks as J2EE and Websphere make it easy to create large-scale parallel Web applications. For example, MapReduce (described above) simplifies the development of a large class of distributed applications that combine the results of the computa- tion of distributed nodes. Web applications follow the database- transaction model in which users write sequential tasks and the framework manages the parallelism. · Libraries: graphics libraries. Graphics libraries for DirectX 10 and OpenGl hide the details of parallelism in graphics hardware from the user. · Languages: Cuda Fortress, Cilk, x10, and Chapel. These languages seek to provide an array of high-level and low-level abstractions that help programmers to develop classes of efficient parallel software faster. What those tools suggest is that managing parallelism is another, more 24 One recent example was a parallel debugger, STAT, from the Lawrence Livermore Na - tional Laboratory, available at http://www.paradyn.org/STAT/STAT.html, presented at Supercomputing 2008. See Gregory L. Lee, Dong H. Ahn, Dorian C. Arnold, Bronis R. de Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, 2008, Lessons learned at 208K: Towards debugging millions of cores, available online at ftp://ftp.cs.wisc. edu/paradyn/papers/Lee08ScalingSTAT.pdf, last accessed on November 8, 2010.

OCR for page 105
130 THE FUTURE OF COMPUTING PERFORMANCE challenging facet of software engineering—it can be thought of as akin to a complex version of the problem of resource management. Parallel- program productivity can be improved if we can develop languages that provide useful software-engineering abstractions for parallelism, parallel components and libraries that programmers can reuse, and a software- hardware stack that can facilitate reasoning about all of them. MEETING THE CHALLENGES OF PARALLELISM The need for robust, general, and scalable parallel-software approaches presents challenges that affect the entire computing ecosystem. There are numerous possible paths toward a future that exploits abundant paral- lelism while managing locality. Parallel programming and parallel com- puters have been around since the 1960s, and much progress has been made. Much of the challenge of parallel programming deals with making parallel programs efficient and portable without requiring heroic efforts on the part of the programmer. No subfield or niche will be able to solve the problem of sustaining growth in computing performance on its own. The uncertainty about the best way forward is inhibiting investment. In other words, there is currently no parallel-programming approach that can help drive hardware development. Historically, a vendor might have taken on risk and invested heavily in developing an ecosystem, but given all the uncertainty, there is not enough of this investment, which entails risk as well as innovation. Research investment along multiple fronts, as described in this report, is essential.25 Software lasts a long time. The huge entrenched base of legacy soft- ware is part of the reason that people resist change and resist investment in new models, which may or may not take advantage of the capital investment represented by legacy software. Rewriting software is expen- sive. The economics of software results in pressure against any kind of innovative models and approaches. It also explains why the approaches we have seen have had relatively narrow applications or been incremen- tal. Industry, for example, has turned to chip multiprocessors (CMPs) that replicate existing cores a few times (many times in the future). Care is taken to maintain backward compatibility to bring forward the exist- ing multi-billion-dollar installed software base. With prospects dim for 25 A recent overview in Communications of the ACM articulates the view that develop- ing software for parallel cores needs to become as straightforward as writing software for traditional processors: Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick, 2009, A view of the parallel computing landscape, Communications of the ACM 52(10): 56-67, available online at http://cacm. acm.org/magazines/2009/10/42368-a-view-of-the-parallel-computing-landscape/fulltext.

OCR for page 105
131 THE END OF PROGRAMMING AS WE KNOW IT repeated doublings of single-core performance, CMPs inherit the mantle as the most obvious alternative, and industry is motivated to devote sub - stantial resources to moving compatible CMPs forward. The downside is that the core designs being replicated are optimized for serial code with support for dynamic parallelism discovery, such as speculation and out- of-order execution, which may waste area and energy for programs that are already parallel. At the same time, they may be missing some of the features needed for highly efficient parallel programming, such as light- weight synchronization, global communication, and locality control in software. A great deal of research remains to be done on on-chip network- ing, cache coherence, and distributed cache and memory management. One important role for academe is to explore CMP designs that are more aggressive than industry’s designs. Academics should project both hardware and software trends much further into the future to seek pos - sible inflection points even if they are not sure when or even whether tran- sitioning a technology from academe to industry will occur. Moreover, researchers have the opportunity to break the shackles of strict backward compatibility. Promising ideas should be nurtured to see whether they can either create enough benefit to be adopted without portability or to enable portability strategies to be developed later. There needs to be an intellectual ecosystem that enables ideas to be proposed, cross-fertilized, and refined and, ultimately, the best approaches to be adopted. Such an ecosystem requires sufficient resources to enable contributions from many competing and cooperating research teams. Meeting the challenges will involve essentially all aspects of comput- ing. Focusing on a single component—assuming a CMP architecture or a particular number of transistors, focusing on data parallelism or on het - erogeneity, and so on—will be insufficient to the task. Chapter 5 discusses recommendations for research aimed at meeting the challenges.