Glossary and Acronym List
Advanced Simulation and Computing program, the current name for the program formerly known as ASCI.
Accelerated Strategic Computing Initiative, which provides simulation and modeling capabilities and technologies as part of DOE/ NNSA Stockpile Stewardship Program.
The automatic creation of parallel code from sequential code by a compiler.
The amount of data that can be passed along a communications channel in a unit of time. Thus, memory bandwidth is the amount of data that can be passed between processor and memory in a unit of time and global communication bandwidth is the amount of data that can be passed between two nodes through the interconnect in a unit of time. Both can be a performance bottleneck. Bandwidth is often measured in megabytes (million bytes) per second (Mbyte/sec) or gigabytes (billion bytes) per second (Gbyte/sec) or in megawords (million words) per second (Mword/sec). Since a word consists (in this context) of 8 bytes, then 1 Gbyte/sec = 125 Mword/sec = 1,000 Mbyte/sec.
An experiment that enables the measurement of some meaningful property of a computer system; a program or a computational task or a set of such programs or tasks that is used to measure the performance of a computer.
Basic Linear Algebra Subprograms, a set of subprograms commonly used to solve dense linear algebra problems. Level 1 BLAS includes vector-vector operations, level 2 BLAS includes vector-matrix
operations, and level 3 BLAS includes matrix-matrix operations. BLAS subroutines are frequently optimized for each specific hardware platform.
Blue Gene/Light (IBM).
A small, fast storage area close to the central processing unit (CPU) of a computer that holds the most frequently used memory contents. Caches aim to provide the illusion of a memory as large as the main computer memory with fast performance. They succeed in doing so if memory accesses have good temporal locality and good spatial locality.
The unit of data that is moved between cache and memory. It typically consists of 64 or 128 consecutive bytes (8 or 16 consecutive double words).
cache memory system.
Modern computers typically have multiple levels of caches (named level 1, level 2, and so on) that are progressively larger and slower; together they comprise the cache memory system.
Computer-aided engineering. The construction and analysis of objects using virtual computer models. This may include activities of design, planning, construction, analysis, and production planning and preparation.
The use of the most powerful supercomputers to solve the largest and most demanding problems, in contrast to capacity computing. The main figure of merit in capability computing is time to solution. In capability computing, a system is often dedicated to running one problem.
The use of smaller and less expensive high-performance systems to run parallel problems with more modest computational requirements, in contrast to capability computing. The main figure of merit in capacity computing is the cost/performance ratio.
Community Climate System Model.
Control Data Corporation.
Time required for a signal to propagate through a circuit, measured in picoseconds per gate. It is a key aspect of processor performance.
The NSF Directorate for Computing and Information Science and Engineering. This directorate is responsible for NSF-funded supercomputing centers.
clock rate or clock speed.
The frequency of the clock that drives the operation of a CPU, measured in gigahertz (GHz). Clock rate and instructions per cycle (IPC) determine the rate at which a CPU executes instructions.
A group of computers connected by a high-speed network that work together as if they were one machine with multiple CPUs.
Complementary metal oxide semiconductor. CMOS is the semiconductor technology that is currently used for manufacturing processors and memories. While other technologies (silicon-germanium and gallium-arsenide) can support higher clock rates, their higher cost and lower integration levels have precluded their successful use in supercomputers.
A processor that is designed for a broad market and manufactured in large numbers, in contrast to a custom processor.
A supercomputer built from commodity parts.
The movement of data from one part of a system to another. Local communication is the movement of data between the processor and memory; global communication is the movement of data from one node to another.
composite theoretical performance.
CTP is a measure of the performance of a computer that is calculated using a formula that combines various system parameters. CTP is commonly measured in millions of theoretical operations per second (MTOPS). Systems with a CTP above a threshold (currently 190,000 MTOPS) are subject to stricter export controls. The threshold is periodically raised. While CTP is relatively easy to compute, it bears limited relationship to actual performance.
computational fluid dynamics (CFD).
The simulation of flows, such as the flow of air around a moving car or plane.
Originally used to denote a hardware and software infrastructure that enables applying the resources of many computers to a single problem. Now increasingly used to denote more broadly a hardware and software infrastructure that enables coordinated resource sharing within dynamic organizations consisting of individuals, institutions, and resources.
Parallelism that is achieved by the simultaneous execution of multiple threads.
The ratio between the cost of a system and the effective performance of the system. This ratio is sometimes estimated by the ratio between the purchase cost of a computer and the performance of the computer as measured by a benchmark. A more accurate but hard to estimate measure is the ratio between the total cost of ownership of a platform and the value contributed by the platform.
Central processing unit, the core unit of a computer that fetches instructions and data and executes the instructions. Often used as a synonym for processor.
The Computer Science and Telecommunications Board is part of the National Research Council.
A processor that is designed for a narrow set of computations and is manufactured in small numbers; in particular, a processor designed to achieve high-performance in scientific computing.
A supercomputer built with custom processors.
An infrastructure based on grids and on application-specific software, tools, and data repositories that support research in a particular discipline.
Defense Advanced Research Projects Agency, the central research and development organization of the Department of Defense (DoD).
Parallelism that is achieved by the application of the same operation to all the elements of a data aggregate, under the control of one instruction. Vector operations are the main example of data parallelism.
dense linear algebra.
Linear algebra computations (such as the solution of a linear system of equations) that involve dense matrices, where most entries are nonzero.
The process of replacing a continous system of differential equations by a finite discrete approximation that can be solved on a computer.
distributed memory parallel system.
A parallel system, such as a cluster, with hardware that does not support shared memory.
Department of Defense.
Department of Energy. DOE is a major funder and user of supercomputing, through the ASC program and the various science programs of the Office of Science.
Dynamic random access memory. The technology used in the main memory of a computer; DRAM is denser, consumes less power, and is cheaper but slower than SRAM. Two important performance measures are memory capacity, measured in megabytes or gigabytes, and memory access time, or memory latency. The memory access time depends on the memory access pattern; row access time (or row access latency) is the worst-case access time, for irregular accesses.
The rate at which a processor performs operations (for a particular computation), often measured in operations per second. Often used as a shorthand for effective floating-point performance. More generally, the rate at which a computer system computes solutions.
efficiency or processor efficiency.
The ratio between the effective performance of a processor and its peak performance.
Earth Simulator, a large custom supercomputer installed in Japan in early 2002 in support of earth sciences research. The ES topped the TOP500 list from its intallation to June 2004 and still provides significantly better performance than the largest U.S. supercomputers on many application.
Fast Fourier transform.
Fastest Fourier transform in the West.
Additions and multiplications involving floating-point numbers, i.e., numbers in scientific notation.
The rate at which a computer executes floating-point performance, measured in floating-point operations per second. In particular, peak floating-point performance and effective floating-point performance.
Floating point operations per second. Flops is used as a metric for a computer’s performance.
front-side bus (FSB).
The connection of a microprocessor to the memory subsystem.
1,000,000,000 cycles per second, often the unit used to measure computer clock rates.
A synonym for computational grid.
The activity of using a computational grid.
High End Computing Revitalization Task Force, a task force established in March 2003 to develop a roadmap for high-end computing (HEC) R&D and discuss issues related to federal procurement of HEC platforms.
A custom processor designed to provide significantly higher effective memory bandwidth than commodity processors normally provide.
high-end computing (HEC).
A synonym for HPC.
high-performance computing (HPC).
Computing on a high-performance machine. There is no strict definition of high-performance machines, and the threshold for high performance will change over time. Systems listed in the TOP500 or technical computing systems selling for more than $1 million are generally considered to be high-performance.
High Performance Computing and Communications Initiative, which was established in the early 1990s as an umbrella for federal agencies that support research in computing and communication, including HPC.
High Productivity Computing Systems, a DARPA program started in 2002 to support R&D on a new generation of HPC systems that reduce time to solution by addressing performance, programmability, portability, and robustness.
High-Performance Fortran, a language designed in the early 1990s as an extension of Fortran 90 to support data parallelism on distributed memory machines. The language was largely discarded in the United States but continue to be used in other countries and is used for some codes on the Earth Simulator.
A supercomputer built with commodity processors but with a custom interconnect and a custom interface to the interconnect.
International Data Corporation.
Formally, the Report on High Performance Computing for the National Security Community, a report requested by the House of Representatives from the Secretary of Defense and nominally submitted in July 2002. It describes an integrated high-end computing program.
The concurrent execution of multiple instructions in a processor.
instructions per cycle (IPC).
Average number of instructions executed per clock cycle in a processor. IPC depends on the processor design and on the code run. The product of IPC and clock speed yields the instruction execution rate of the processor.
interconnect or interconnection network.
The hardware (cables and switches) that connect the nodes of a parallel system and support the communication between nodes. Also known as a switch.
irregular memory access.
A pattern of access to memory where successively accessed words are not equally spaced.
Independent software vendor.
Los Alamos National Laboratory.
Linear Algebra PACKage, a package that has largely super-seded Linpack. The Linpack library makes heavy use of the BLAS subroutines.
A measure of delay. Memory latency is the time needed to access data in memory; global communication latency is the time needed to effect a communication between two nodes through the interconnect. Both can be a performance bottleneck.
Lawrence Berkeley National Laboratory.
A linear algebra software package; also a benchmark derived from it that consists of solving a dense system of linear equations. The Linpack benchmark has different versions, according to the size of the system solved. The TOP500 ranking uses a version where the chosen system size is large enough to get maximum performance.
A version of the UNIX operating system initially developed by Linus Torvalds and now widely used. The Linux code is freely available in open source.
Lawrence Livermore National Laboratory.
Faster increase in processor speed relative to memory access time. It is expected to hamper future improvements in processor performance.
A program that partitions a mesh into submeshes of roughly equal size, with few edges between submeshes. Such a program is needed to map a computation on a mesh to a parallel computer.
A method of communication between processes that involves one process sending data and the other process receiving the data, via explicit send and receive calls.
A processor on a single integrated circuit chip.
Millions of instructions per second. A measure of a processor’s speed.
Message Passing Interface, the current de facto standard library for message passing.
Millions of theoretical operations per second; the unit used to measure the composite theoretical performance (CTP) of high-performance systems.
Mean time to failure, the time from when a system or an application starts running until it is expected to fail.
A technique for the numerical solution of the linear systems that often arise from differential equations. It alternates the use of grids of various resolutions, achieving faster convergence than computations on fine grids and better accuracy than computations on coarse grids.
Numerical simulations that use multiple levels of discretization for a given domain, mixing coarser and finer discretizations; multigrid is an example of multilevel.
A simulation that combines various physical models. For example, a simulation of combustion that combines a fluid model with a model of chemical reactions.
A system comprising multiple processors. Each processor executes a separate thread. A single-chip multiprocessor is a system where multiple processors reside on one chip.
A processor that executes concurrently or simultaneously multiple threads, where the threads share computational resources (as distinct from a multiprocessor, where threads do not share computational resources). A multithreaded processor can uses its resources better that a multiprocessor: When a thread is idling, waiting for data to arrive from memory, another thread can execute and use the resources.
A form of parallelism where multiple threads run concurrently and communicate via shared memory.
NASA’s Advanced Supercomputing Division (previously known as the Numerical Aerospace Simulation systems division). The NAS benchmarks are a set of benchmarks that were developed by NAS to represent numerical aerospace simulation workloads.
National Aeronautics and Space Administration.
A structural analysis package developed in the mid-1960s at NASA and widely used by industry. It is now available both in open source and as a supported product.
National Center for Atmospheric Research.
National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, one of three extant NSF supercomputing centers.
National Energy Research Scientific Computation Center, a supercomputing center maintained by DOE at the Lawrence Berkeley National Laboratory to support basic scientific research.
An online repository of mathematical software maintained by the University of Tennesse at Knoxville and by the Oak Ridge National Laboratory.
National Institutes of Health, the focal point for federally funded health research.
Networking and Information Technology R&D, a federal program. The program includes (among other areas) the High End Computing Program Component Area. This involves multiple federal agencies (NSF, NIH, NASA, DARPA, DOE, the Agency for Healthcare Research and Quality (AHRQ), NSA, NIST, NOAA, EPA, the Office of the Director of Defense Research and Engineering (ODDR&E), and the Defense Information Systems Agency (DISA)). The National Coordination Office for Information Technology Research and Development (NCO/IT R&D) coordinates the programs of the multiple agencies involved in NITRD.
National Nuclear Security Administration, the organization within DOE that manages the Stockpile Stewardship Program that is responsible for manufacturing, maintaining, refurbishing, surveilling, and dismantling the nuclear weapons stockpile.
National Oceanic and Atmospheric Administration.
The building block in a parallel machine that usually consists of a processor or a multiprocessor, memory, an interface to the interconnect and, optionally, a local disk.
Goods that suppliers cannot prevent some people from using while allowing others to use them.
Goods that each consumer can enjoy without diminishing anyone else’s ability to enjoy them.
The National Research Council is the operating arm of the National Academies.
National Security Agency, America’s cryptologic organization. NSA is a major user of supercomputing.
National Science Foundation, an independent federal agency with responsiblity for scientific and engineering research. NSF funds research in computer science and engineering and supports three national supercomputing centers that serve the science community.
A computation chemistry package developed at the DOE Pacific Northwest National Laboratory (PNNL).
Ordinary differential equation.
Software that is available to users in source form and can be used and modified freely. Open source software is often created and maintained through the shared efforts of voluntary communities.
Partnership for Advanced Computational Infrastructure at NSF.
The ratio between the speedup achieved with p processors and the number of processors p. Parallel efficiency is an indication of scalability; it normally decreases as the number of processors increases, indicating a diminishing marginal return as more processors are applied to the solution of one problem.
The ratio between the time needed to solve a problem with one processor and the time needed to solve it with p processors, as a function of p. A larger parallel speedup indicates that parallelism is effective in reducing execution time.
parallel file system.
A file system designed to support efficiently a large number of simultaneous accesses to one file initiated by distinct processes.
The concurrent execution of operations to achieve higher performance.
Partial differential equation.
Highest performance achievable by a system. Often used as a shorthand for peak floating-point performance, the highest possible rate of floating-point operations that a computer system can sustain. Often estimated by considering the rate at which the arithmetic units of the processors can perform floating-point operations but ignoring other bottlenecks in the system. Thus, it is often the case that no program, and certainly no program of interest, can possibly achieve the peak performance of a system. Also known as never-to-exceed performance.
A package for the parallel solution of sparse linear algebra and PDE problems developed at DOE’s Argonne National Laboratory (ANL).
Processor in memory, a technique that combines DRAM and processor on the same chip to avoid the memory wall problem.
President’s Information Technology Advisory Committee, chartered by Congress in 1991 and 1998 as a federal advisory committee to provide the President with independent, expert advice on federal information technology R&D programs.
The moving of data from memory to cache in anticipation of future accesses by the processor to the data, so as to hide memory latency.
An executing program that runs in its own address space. A process may contain multiple threads.
An abstract conceptual view of the structure and operation of a computing system.
Goods that are nonrival and nonexcludable. Publicly available software that is not protected by a copyright or patent is an example of a public good.
A model of communication between processes that allow one process to read from (get) or write to (put) the memory of another process with no involvement of the other process.
Research and development.
A processor that operates only on scalar (i.e., single-word) operands; see vector processor.
A type of memory access where multiple words are loaded from distinct memory locations (gather) or stored at distinct locations (scatter). Vector processors typically support scatter/gather operations. Similarly, a global communication where data are received from multiple nodes (gather) or sent to multiple nodes (scatter).
Strategic Computing Initiative, a large program initiated by DARPA in the 1980s to foster computing technology in the United States.
shared memory multiprocessor (SMP).
A multiprocessor where hardware supports access by multiple processors to a shared memory. The shared memory may be physically distributed across processors.
A message passing library developed for the Cray T3E and now available on many systems that support put/get communication operations.
Scalable processor architecture.
sparse linear algebra.
Linear algebra computations (such as the solution of a linear system of equations) that involve sparse matrices, where many entries are zero. Sparse linear algebra codes use data structures that store only the nonzero matrix entries, thus saving stor-
age and computation time but resulting in irregular memory accesses and more complex logic.
The property that data stored near one another tend to be accessed closely in time. Good (high) spatial locality ensures that the use of multiple word cache lines is worthwhile, since when a word in a cache line is accessed there is a good chance that other words in the same line will be accessed soon after.
Set of benchmarks maintained by the Standard Performance Evaluation Corporation (SPEC); see <http://www.spec.org>. SPECfp is the floating-point component of the SPEC CPU benchmark that measures performance for compute-intensive applications (the other component is SPECint). The precise definition of the benchmark has evolved—the official name of the current version is SPEC CFP2000. The changes are small, however, and the mean flops rate achieved on the benchmarks is a good measure of processor performance evolution.
See parallel speedup.
Static random access memory. SRAM is faster but consumes more power and is less dense and more expensive than DRAM. SRAM is usually used for caches, while DRAM is used for the main memory of a computer.
Stockpile Stewardship Program.
A program established at DOE by the FY 1994 Defense Authorization Act to develop science-based tools and techniques for assessing the performance of nuclear weapon systems, predicting their safety and reliability, and certifying their functionality in the face of a halt in nuclear tests. The program includes computer simulation and modeling (ASC) as well as new experimental facilities.
Refers to those computing systems (hardware, systems software, and applications software) that provide close to the best currently achievable sustained performance on demanding computational problems.
Used to denote the various activities involved in the design, manufacture, or use of supercomputers.
Communication between threads with the effect of constraining the relative order that the threads execute code.
The property that data accessed recently in the past are likely to be accessed soon again. Good (high) temporal locality ensures that caches can effectively capture most memory accesses, since most accesses will be to data that were accessed recently in the past and that reside in the cache.
The basic unit of program execution.
time to solution.
Total time needed to solve a problem, including getting a new application up and running (the programming time), waiting for it to run (the execution time), and, finally, interpreting the results (the interpretation time). TOP500. A list, generated twice a year, of the sites operating the 500 most powerful computer systems in the world, as measured by the Linpack benchmark. While the list is often used for ranking supercomputers (including in this study), it is widely understood that the TOP500 ranking provides only a limited indication of the ability of supercomputers to solve real problems.
total cost of ownership.
The total cost of owning a computer, including the cost of the building hosting it, operation and maintenance costs, and so on. Total cost of ownership can be significantly higher than the purchase cost, and systems with a lower purchase cost can have higher total cost of ownership.
An operating system (OS) developed at Bell Laboratories in the late 1960s. UNIX is the most widely used OS on high-end computers. Different flavors of UNIX exist, some proprietary and some open source, such as Linux.
Universal Parallel C.
An operation that involves vector operands (consisting of multiple scalars), such as the addition of two vectors, or the loading from memory of a vector. Vector loads and stores can be used to hide memory latency.
A processor that supports vector operations.