Warehouse-Scale Computing: The Machinery That Runs the Cloud
LUIZ ANDRÉ BARROSO
As high-bandwidth Internet connectivity becomes more ubiquitous, an increasing number of applications are being offered as Internet services that run on remote data-center facilities instead of on a user’s personal computer. The two classes of machines enabling this trend can be found on the very small and very large ends of the device spectrum. On the small end, mobile devices focus on user interaction and Internet connectivity, but with limited processing capabilities. On the large end, massive computing and storage systems (referred to here as ware-house-scale computers [WSCs]) implement many of today’s Internet (or Cloud) services, (Barroso and Hölzle, 2009).
Cost efficiency is critical for Internet services because only a small fraction of these services result directly in revenue; the rest comes mostly from online advertising. WSCs are particularly efficient for popular computing and data-intensive online services, such as Internet searches or language translations. Because a single search request may query the entire Web, including images, videos, news sources, maps, and product information, such services require a computing capacity well beyond the capabilities of a personal computing device. Thus, they are only economically feasible when amortized over a very large user population.
In this article I provide a brief description of the hardware and software in WSCs and highlight some of their key technical challenges.
WSC hardware consists of three primary subsystems: computing equipment per se, power-distribution systems, and cooling infrastructure. A brief description of each subsystem follows below.
WSCs are built of low-end or mid-range server-class computers connected in racks of 40 to 80 units by a first-level networking switch; each switch connects in turn to a cluster-level network fabric that ties together all of the racks. The clusters, which tend to be composed of several thousand servers, constitute the primary units of computing for Internet services. WSCs can be composed of one or many clusters. Storage is provided either as disk drives connected to each server or as dedicated file-serving appliances.1
The use of near PC-class components is a departure from the supercomputing model of the 1970s, which relied on extremely high-end technology, and reflects the cost sensitivity of the WSC application space. Lower-end server components that can leverage technology and development costs in high-volume consumer markets are therefore highly cost efficient.
Peak electricity needs of computing systems in a WSC can be more than 10 MW—roughly equivalent to the average power usage of 8,000 U.S. households. At those levels, WSC computing systems must tap into high-voltage, long-distance power lines (typically 10 to 20 kilovolts); the voltage level must then be converted down to 400 to 600 volts, the levels appropriate for distribution within the facility.
Before power is distributed to computing equipment, however, it is fed to an uninterruptible power supply (UPS) system that acts as an energy supply buffer against utility power failures. UPS systems are designed to support less than a minute of demand, since diesel generators can jump into action within 15 seconds of a utility outage.
Virtually all energy provided to computing equipment becomes heat that must be removed from the facility so the equipment can remain within its designed operating temperature range. This is accomplished by air conditioning units inside the building that supply cold air (18 to 22°C) to the machinery, coupled by a liquid coolant loop to a cooling plant situated outside the building. The cooling plant uses chiller or cooling towers to expel heat to the environment.
The capital costs of the three main subsystems of a WSC vary depending on the facility design. The cost of non-computing components is proportional to peak power delivery capacity, and cooling infrastructure is generally more expensive than the power-distribution subsystem. If high-end energy-efficient computing components are used, computing system costs tend to be dominant. If lower end, less energy-efficient computing components are used, cooling and power-distribution system costs usually predominate. Energy, therefore, affects WSC costs in two ways: (1) directly through the price of the amount of electricity consumed; and (2) indirectly through the cost of cooling and power plants.
Designing a WSC represents a formidable challenge. Some of the most difficult issues are deciding between scale-up (e.g., bigger servers) and scale-out (e.g., more servers) approaches and determining the right aggregate capacity and performance balance among the subsystems. For example, we may have too much CPU firepower and too little networking bandwidth.
These decisions are ultimately based on workload characteristics. For example, search workloads tend to compute heavily within server nodes and exchange comparatively little networking traffic. Video serving workloads do relatively little processing but are network intensive. An Internet services provider that offers both classes of workloads might have to design different WSCs for each class or find a common sweet spot that accommodates the needs of both. Common designs, when possible, are preferable, because they allow the provider to dynamically re-assign WSC resources to workloads as business priorities change, which tends to happen frequently in the still-young Internet services area.
Given the impact of energy on overall costs of WSCs, it is critical that we understand where energy is used. The data-center industry has developed a metric, called power usage effectiveness (PUE), that objectively characterizes the efficiency of non-computing elements in a facility. PUE is derived by measuring the total energy that enters a facility and dividing it by the amount consumed by the computing equipment. Typical data centers are rather inefficient, with PUEs hovering around 2 (one Watt used, one Watt wasted). State-of-the-art facilities have reported PUEs as low as 1.13 (Google, 2010); at such levels, the energy-efficiency focus shifts back to the computing equipment itself.
Mobile and embedded devices have been the main targets of low-power technology development for decades, and many of the energy-saving features that make their way to servers had their beginnings in those devices. However,
mobile systems have focused on techniques that save power when components are idle, a feature that is less useful for WSCs, which are rarely completely idle. Therefore, energy-efficient WSCs require energy proportionality, system behavior that yields energy-efficient operation for a range of activities (Barroso and Hölzle, 2007).
The software that runs on WSCs can be broadly divided into two layers: infrastructure software and workloads. Both are described below.
The software infrastructure in WSCs includes some basic components that enable their coordinated scheduling and use. For example, each Google WSC cluster has a management software stack that includes a scheduling master and a storage master, and corresponding slaves in each machine. The scheduling master takes submitted jobs and creates job-task instances in various machines. Enforcing resource allocations and performance isolation among tasks is accomplished by per-machine scheduling slaves in coordination with the underlying operating system (typically a Linux-based system). The role of storage servers is to export local disks to cluster-wide file-system users.
WSC workloads can include thousands of individual job tasks with diverse behavior and communication patterns, but they tend to fall into two broad categories: data processing and online services. Data processing workloads are large-batch computations necessary to analyze, reorganize, or convert data from one format to another. Examples of data-processing workloads might include stitching individual satellite images into seamless Google Earth tiles or building a Web index from a large collection of crawled documents. The structure of these workloads tends to be relatively uniform, and the keys for high performance are finding the right way to partition them among multiple tasks and then place those tasks closer to their corresponding data. Programming systems, such as MapReduce (Dean and Ghemawat, 2004), have simplified the building of complex data-processing workloads.
Web search is the best example of a demanding online-services workload. For these workloads, keeping users happy means providing very quick response times. In some cases, the system may have to process tens of terabytes of index data to respond to a single query. Thus, although high processing throughput is a requirement for both data-processing and online-services workloads, the latter have much stricter latency constraints per individual request. The main challenge
for online-services workloads it to provide predictable performance by thousands of cooperating nodes on sub-second timescales.
Similar to the hardware-design problem for WSCs, the complexity of software development for a WSC hardware platform can be an obstacle for both workload and infrastructure software developers. The complexity derives from a combination of scale and limits of electronic technology and physics. For example, a processor accessing its local memory can do so at rates of more than 10 gigabytes per second, but accessing memory attached to another processor in the facility may only be feasible at rates that are slower by orders of magnitude.
WSC software designers must also be able to cope with failures. Two server crashes per year may not sound particularly damaging, but if the system runs on 5,000 servers it will see approximately one failure every hour. Programming efficient WSC workloads requires handling complex performance trade-offs and creating reliable systems in the face of high failure rates.
The rapid increase in the popularity of Internet services as a model for provisioning computing and storage solutions has given rise to a new class of massive-scale computers outside of the traditional application domain of super-computing-class systems. Some of the world’s largest computing systems are the WSCs behind many of today’s Internet services. Building and programming this emerging class of machines are the subjects of some of the most compelling research being conducted today on computer systems.
Barroso, L.A., and U. Hölzle. 2007. The case for energy-proportional computing. IEEE Computer, December 2007.
Barroso, L.A., and U. Hölzle. 2009. The Datacenter as a Computer—An Introduction to the Design of Warehouse-Scale Machines. Synthesis Series on Computer Architecture. San Rafael, Calif.: Morgan & Claypool Publishers.
Dean, J., and S. Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, California, December, 2004.
Google. 2010. Google Data Center Efficiency Measurements. Available online at http://www.google.com/corporate/green/datacenters/measuring.html.