Most supercomputers are built from commodity processors that are designed for a broad market and are manufactured in large numbers. A small number of supercomputers use custom processors that are designed to achieve high performance in scientific computing and are manufactured in small numbers. Commodity processors, because they benefit from economies of scale and sophisticated engineering, provide the shortest time to solution (capability) and the highest sustained performance per unit cost (capacity) for a broad range of applications that have significant spatial and temporal locality and therefore take good advantage of the caches provided by commodity processors. A small set of important scientific applications, however, has almost no locality. They achieve shorter time to solution and better sustained performance per unit cost on a custom processor that provides higher effective local memory bandwidth on access patterns having no locality. For a larger set of applications with low locality, custom processors deliver better time to solution but at a higher cost per unit of sustained performance.
Commodity processors are often criticized because of their low efficiency (the fraction of peak performance they sustain). However, peak performance, and hence efficiency, is the wrong measure. The system metrics that matter are sustained performance (on applications of interest), time to solution, and cost.
The rate at which operands can be transferred to/from the processor is the primary performance bottleneck for many scientific computing codes.1,2 Custom processors differ primarily in the effective memory bandwidth that they provide on different types of access patterns. Whether a machine has a vector processor, a scalar processor, or a multithreaded processor is a secondary issue. The main issue is whether it has efficient support for irregular accesses (gather/scatter), high memory bandwidth, and the ability to hide memory latency so as to sustain this bandwidth. Vector processors, for example, typically have a short (if any) cache line and high memory bandwidth. The vectors themselves provide a latency hiding mechanism. Such features enable custom processors to more efficiently deliver the raw memory bandwidth provided by memory chips, which often dominate system cost. Hence, these processors can be more cost effective on applications that are limited by memory bandwidth.
Commodity processors are manufactured in high volume and hence benefit from economies of scale. The high volume also justifies sophisticated engineering—for example, the clock rate of the latest Intel Xeon processor is at least four times faster than the clock rate of the Cray X1. A commodity processor includes much of its memory system but little of its memory capacity on the processor chip, and this memory system is adapted for applications with high spatial and temporal locality. A typical commodity processor chip includes the level 1 and 2 caches on the chip and an external memory interface. This external interface limits sustained local memory bandwidth and requires local memory accesses to be performed in units of cache lines (typically 64 to 128 bytes in length3). Accessing