Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 21
3
THE CURRENT STATE OF THE ART IN EVALUATION OF SUPERCOMPUTERS
Large-scale computations may be characterized as those that require a
major fraction of the resources of the most powerful computing systems
available at any particular time. Historically, large-scale
computations have occurred primarily in areas such as weather
forecasting, aeronautical design, orbital mechanics, nuclear energy
studies, and nuclear weapon design. Additional areas include automotive
design, seismic analysis, petroleum reservoir modeling, condensed matter
and statistical physics, nuclear and elementary particle physics, and
the modeling of large molecules. Many other applications will emerge in
the near future. In many of these areas, scientists and engineers
continually face problems whose computational requirements exceed the
capability of any available computing system. This is why organizations
such as the U.S. Department of Energy (DOE), the National Aeronautics
and Space Administration (NASA), and now the National Science Foundation
(NSF) have consistently procured and will continue to procure advanced
supercomputers as soon as they become available--to treat problems that
were previously intractable. These advances challenge existing methods
of performance measurement because we cannot precisely forecast the
nature of all the computational techniques that eventually will be used
in these new applications. Nevertheless, performance measurement of
existing systems and workloads provides valuable insight in improving
current applications and guiding future developments.
LES SONS FROM THE PAST
The events of the past ten years provide valuable lessons in evaluating
performance of large-scale computers. In the early to mid 1970s,
machines were introduced with architectures different enough from
earlier machines that measurement of their performance was not a simple
21
OCR for page 22
2 ·/
extension of previous techniques. Before that time, improvements in
computer performance came largely through advanced technology and modest
architecture improvements, like overlapped operations, which did not
impact seriously the algorithms used or the way codes were structured.
Noteworthy successes in performance evaluation of sequential instruction
machines were achieved and a substantial body of literature on the
subject was built.
In the mid to late 1970s the introduction of the ILLIAC IV, STAR-100,
Cray-1, and CYBER 2XX series brought the so-called single-instruction
multiple-data (SIMD) or "vector" architectures to scientific computing.
Access to the potentially high performance of these machines required
the codes to be structured to take advantage of machine architectures.
In some cases, codes not so structured actually ran more slowly than on
. . .
previous mac; cones.
While it was relatively easy to estimate or measure the maximum
number of floating point operations per second of which SIMD machines
were capable, performance of real jobs was much more complicated to
assess. Even though most codes were written in FORTRAN language at the
time, the ability of the compilers to optimize performance was
negligible in the early days. Codes that ran at high performance were
painstakingly modified by hand.
During the ensuing ten years, the science of vector computing has
come remarkably far. Compiler technology has developed to the point
where many constraints are automatically vectorized. Researchers have
discovered techniques and algorithms that are well suited to vector
machines, including many problems previously thought intractable with
these architectures. Today it is axiomatic that supercomputer system
performance is a function of problems, implementation, and algorithms.
The current trend toward performance enhancement by using multiple
processors (multiple vector processors for large applications) brings
back all the problems and uncertainties of the early days of vector
processing. As with the early days of vector architectures, limited
ability exists for the software to partition tasks automatically.
Currently, for systems with few (four to eight) processors, partitioning
can be done by hand for certain problems; but eventually software will
have to improve and there is reasonable expectation that it will. In
addition, whereas there is a limited range of variations in SIMD
architecture, such as vector length, bandwidth of the communication
channels to memory, and start-up time, there are myriad possible
multiple-instruction multiple-data (MIND) architectures. Examples of
variations are speed of interprocessor communication and shared versus
local memory.
OCR for page 23
23
The following conclusion captures-these lessons from the past:
Some of the methods, and corresponding metrics, used in the past are
not well suited for the performance evaluation of current and future
supercomputer systems.
PERFORMANCE METRICS
There are no universally accepted metrics of supercomputer performance.
The most widely used metric is millions of floating point operations per
second (MFLOPS), or megaflops, which is a partial measure of processor
performance. For examples, see Lubeck et al. (1985), Dongarra and Hinds
(1985), Los Alamos National Laboratory (in press), and Moore and Bucher
(1981~. Evaluations based on this metric require great restraint. A
cursory examination of the literature reveals that megaflops has been
reported to characterize performance with respect to the following
diverse features:
o Peak capability of the hardware
o Small computational loops
o Computational kernels extracted from large computations
o Entire programs
o Entire workloads.
These measurements can vary widely for the same supercomputer,
especially as a function of software and implementation (Worlton, 1984;
Dongarra, 19861. So it is always important to know the context in which
the measurements are made. Regrettably, such measurements are
occasionally lumped together and out of context, particularly in
advertising literature. The result is that megaflops, as currently
being used, is neither a reliable nor a precise measure of system
performance. In spite of these deficiencies, the trade press, having
few alternatives, tends to focus on peak hardware rates, so that the
problem perseveres.
Also, the floating point operation rate constitutes only a simplistic
metric because of certain technical shortcomings:
o It may not dominate performance for a particular application
o It can be hard to measure on real problems.
O It can be counted in different ways.
O It can be performed in either scalar or vector mode.
As to the last point, most state-of-the-art supercomputers offer scalar
and vector processing. The performance levels of these two modes, as
OCR for page 24
24
measured in megaflops, may differ by a factor of 5 to 100. Thus,
overall performance of an application will depend on the distribution of
work between these two modes. This distribution, in turn, depends on
the algorithms used, the details of their implementation, and the
specific problem being solved.
There is an acute need to develop and use more comprehensive
performance metrics than megaflops alone. Some progress has been made.
Hockney and Jesshope (1981) have introduced a two-parameter
characterization of vector processors. The two parameters are (1) peak,
or asymptotic, performance, commonly designated as TOO, and (2)
half-performance vector length, commonly designated as nay, which is
necessary to achieve half the asymptotic performance. This
characterization facilitates comparison of different vector
architectures as well as assessment of potential performance of a vector
processor (Buzbee, in press). A three-parameter model of multiple
vector processors is developed in Bucher and Simmons (1985~. More work
is needed on performance characterizations. In the meantime the
following conclusion can be drawn:
Some performance measures and metrics in current use for the
evaluation of supercomputers, such as rates of hardware operation and
simple kernel execution. are often misleading and misused.
TEST PROGRAMS FOR PERFORMANCE MEASUREMENT
The selection of representative test programs for measurement of the
performance range on a supercomputer system is a nontrivial task (Adams
et al., 1985~. One approach that is growing in acceptance is to define
a hierarchy of programs ranging from hardware demonstration programs to
full application packages. However, if these application packages are
written in FORTRAN language, a barrier to performance evaluation may
result because of the major hurdles that FORTRAN puts in the way of
algorithmic modification. There is a clear need to start building major
benchmark programs that can easily be transported, with appropriate
algorithmic changes, to new systems. If performance measurement is to
become a science, then the writing of benchmark programs should also be
rationalized and not be dependent on ad hoc procedures to overcome the
difficulty of FORTRAN for portability.
Hardware Demonstration Programs
Hardware demonstration programs test the basic speed of a machine on
simple computational or input-output (I/O) operations. The intent is to
OCR for page 25
25
have operations that can be measured with no loss of information, yet
are simple enough to be analyzed completely. Examples of such
operations can be found in Lubeck et al. (1985) and McMahon et al.
(1972~. From measurements of the basic machine performance, it is
possible to determine the costs of memory references, bank conflicts,
paging, disk accesses, filling and emptying pipelines, and hitting or
missing cache. From this class of programs, it is also possible to
determine the vector lengths at which half the peak performance is
attained, the vector lengths at which the breakeven point for
vector-scalar tradeoffs is reached, the asymptotic peak performance for
vector operations, and the amount of work that must be performed between
synchronization points in order to overcome the cost of communication in
a multiprocessor system.
However, the use of hardware demonstration programs for benchmarking
and performance measurement is useful only at a very low level--for
machine designers and for providing parameters for use in higher level
performance modeling. Results from this level of measurement may not be
well representative of supercomputer performance.
Program Kernels
Program kernels represent the computationally intense loops of actual
programs; performance on such loops therefore begins to reflect what can
be expected of a system when it executes a full workload. Examples
include the well-known Lawrence Livermore Loops, commonly used for
benchmarking (McMahon, in press), and NASA Ames Kernels (Bailey and
Barton, 198S).
In general, such program kernels measure central processing unit
(CPU) performance only, although some may be designed to measure
primarily I/O performance. Rarely are they designed to measure an
integrated system, and their lack of program complexity can create some
performance misconceptions. Therefore, relying too heavily on the
assumption that there is a high correlation of program kernels with the
total workload can pose problems. Finally, because kernels tend to be
relatively simple loops in structure, the performance that they
individually indicate can vary across the entire range of potential
system performance depending on whether the system compiler can
recognize vectorizability. If an overall figure of merit must be
derived from the sum of such kernels, it should be based on a weighted
harmonic mean rather than an arithmetic mean in order to estimate fairly
the effects~of the peaks and valleys of performance (Worlton, 1984~.
However, combining a set of measurements into a single number like this
should not be done without knowledge of the application environment.
OCR for page 26
26
Basic Routines
Basic routines can be defined to represent the main computational
elements used in a range of applications. Several examples are given in
Table 4-1 of Chapter 4. Another example is LINPACK (Dongarra, 19861.
As such, these routines should provide a reasonable indicator of
performance for applications in which they consume major portions of the
total requisite resources. The performance of a system on these basic
routines does not cover the interactions and communications required
between them, as well as the remainder of the full application program
that is outside their scope. If the time consumed by the remainder of
an application is not extremely small relative to the amount of time
used by the measured routines, then the value of this set of sample
programs declines as a measure of system performance.
Stripped-Down Versions of Major Application Packages
Generally, stripped-down versions of major application packages will
retain what we have labeled as basic routines, and they will permit
some, but not all, of the interactions required between basic routines.
They contain little of the pre- and post-processing required by full
application packages and may or may not include interactions of CPU and
I/O requirements. A well-known example is the SIMPLE program (Crowley
et al., 19781. Stripped-down programs are easier to manipulate and to
transport from one system to another than are full application packages,
and in cases where the full application packages are proprietary or
classified, they permit testing that would otherwise be impossible. One
of the problems that often occurs in this type of test routine is that
by creating a facsimile of a full program for test purposes, some of the
system requirements for memory capacity and I/O performance are
relaxed. There is thus a real danger, in constructing a stripped-down
version that will fit on a system that must be tested, of losing some of
the original structure. In particular, this can create havoc when
testing a parallel processing system. For example, if a program is
scaled to fit on a system with limited memory, its performance might be
totally different from the performance that would be expected if the
appropriate memory sizes were maintained. Whereas the version
constructed for a large memory might perform well when executed in
parallel on a system with sufficient memory, the version constructed for
a small memory may appear to spend more time communicating than it would
in computing on a balanced system with large memory and fast processors.
OCR for page 27
27
Full Application Packages
The class of test programs designated here as "full application
packages"' provides the most accurate picture of system performance for
actual problems, yet it too has drawbacks. Full application packages
are hard to transport from system to system because users have adapted
them to their own environments, tailoring algorithms to the subtle
points of the architecture. Even though the same adaptation will be
made in the new environment, it is harder to measure programming style
and level of effort invested in fitting a full application package to
given architecture than in doing so for the other types of
representative test programs. Measurements on this class of test
programs should be made, but are best approached only after the results
of measurements on the other test programs have been fully understood
and the computing environment has been carefully defined.
Developmental Programs
A final class, developmental programs, reflects the need to consider
future workload directions and the need to admit novel approaches to
algorithm and architecture design. As such, developmental programs
would be used to try out new algorithms, to devise approaches
appropriate for new architectures, and to experiment with them. The
idea is to be aware of what is just ahead in both algorithm and
architecture development, and to plan how to handle new capabilities in
software and hardware. In some cases that will mean using radical
algorithm approaches to test their utility on existing systems or on new
designs. In other cases it will mean staying abreast of new
architectures and attempting to design and characterize the programs
that will be suitable in this context. Results from developmental
programs may be hard to generalize, but should be used to indicate the
potential of new supercomputer systems for which new programs would be
created and new workloads developed.
ANALYTICAL APPROACH TO PERFORMANCE EVALUATION
The analytical approach to performance evaluation establishes procedures
and methods for evaluating performance with respect to some set of
applications or computational structures. For examples, see Baskett and
Keller (1977) and Bucher and Martin (1982~. Benchmarking is often
either followed or preceded by modeling to extend the results to other
configurations and job mixes. Simulated execution of benchmarks may be
used to evaluate systems that are not yet operational. For machines and
OCR for page 28
28
systems in normal business applications, analytical performance modeling
is a mature approach with a strong commercial base.
The literature on analytic performance modeling is extensive, with
groups like ACM Sigmetrics, the Computer Measurement Group, and the
International Federation of Information Processing Working Group 7.3 on
Computer System Modeling devoting annual conferences and publications to
the topic. A referreed journal, Performance Measurement, is published
quarterly; and the National Science Foundation has sponsored projects in
performance evaluation and monitoring for many years. Most of the work
is based on the application of queuing theory to computer systems. In
spite of the wealth of work, not much has been directly applied to
supercomputer systems, primarily because the tradeoffs in supercomputing
generally involve maximizing computational performance instead of trying
to optimize overall system performance.
SUMMARY
State-of-the-art practice in performance measurement generally involves
a hierarchy of test programs wherein each level of the hierarchy has its
unique strengths and weaknesses with respect to a specific set of
applications or computational structures. Historically, computability
has been related primarily to arithmetic complexity, and thus we have
naturally evolved toward heavy use of megaflops as a performance
metric. However, because of inconsistent practices and some technical
shortcomings, this metric should be used with restraint. More work is
needed to develop multidimensional characterizations such as those in
Hockney and Jesshope (1981~.
Perhaps more importantly, we are entering an era in which there will
be a diversity of supercomputer architectures, along with a growing set
of applications that span a wide spectrum of complexity. Our experience
with vector processors demonstrates that as architecture grows in
complexity, performance becomes increasingly dependent upon achieving a
good match between problems, implementation, and algorithms. This
dependence will become even more pronounced as we move into an era of
multiprocessor systems. Achieving good matches will hinge to some
extent on our capabilities in performance measurement. Thus it will
become a much more important aspect of our work, with a corresponding
requirement that we nurture advancement of performance measurement
technology.
The message of this chapter is crystallized in the following
recommendation:
Performance evaluation methods for supercomputers. beyond those
already available. should be developed and used: and these methods
should be based on sufficiently general and unifying concepts to
address the whole computer system. including multiple processors.
memory. input-output subsystems. and software.
OCR for page 29
29
REFERENCES
Adams, G. B., Brown, R. L. and Denning, P. J. 1985. On Evaluating
Parallel Computer Systems. RIACS TR 85.3. Moffett Field,
California: Research Institute for Advanced Computer Science.
September.
Bailey, D. H., and J. T. Barton. 1985. The NAS Kernel Benchmark
Program. NASA Technical Memorandum 86711. Moffett Field,
California: Ames Research Center, National Aeronautics and Space
Administration. August.
Baskett, F., and T. Keller. 1977. An evaluation of the Cray-1
computer. fin High Speed Computer and Algorithm Organization. D.
Kuck, D. Lawrie, and A. Sameh, eds. New York: Academic Press.
Bucher, I. Y., and J. L. Martin. 1982. Methodology for
Characterizing a Scientific Workload. Unclassified Release
LA-UR-82-1702. Los Alamos, New Mexico: Los Alamos National
Laboratory.
Bucher, I. Y., and M. L. Simmons. 1985. Performance Assessment
of Supercomputers. Computing Surveys. Preprint LA-UR-85-1505. Los
Alamos, New Mexico: Los Alamos National Laboratory.
Buzbee, B. L. 1986. A strategy for vectorization. Parallel Computing
3(July):187-192.
Crowley, W. P., C. P. Hendrickson, and T. E. Rudy. 1978. The SIMPLE
Code. Report UCID-17715. Livermore, California: Lawrence Livermore
National Laboratory. February.
Dongarra, J. J., and A. Hinds. 198S. Comparison of the Cray X-MP-4,
Fujitsu VP-200, and Hitachi S-810/20: An Argonne Perspective.
Technical Report ANL-85-19. Chicago: Argonne National Laboratory.
October.
Dongarra, J. J. 1986. Performance of Various Computers Using Standard
Linear Equations Software in a Fortran Environment. Technical
Memorandum 23. Chicago: Argonne National Laboratory. April 30.
Hockney, R. W., and C. R. Jesshope. 1981. Parallel Computer
Architecture, Programming and Algorithms. Philadelphia: Heydon.
Los Alamos National Laboratory. In press. Los Alamos National
Laboratory Computer Benchmarking 1986. LA-10151-MS. Los Alamos, New
Mexico: Los Alamo s National Laboratory. To be published December
1986.
OCR for page 30
30
Lubeck, O., J. Moore, and R. Mendez. 1985. A benchmark comparison of
Fujitsu VP-200, Hitachi S810/20, and Cray
18~12~:10-23. December.
three supercomputers:
X-MP/2. IEEE Computer
McMahon, F. H. In press. The Livermore Fortran Kernels: A CPU
Floating Point Performance Range Test. Livermore, California:
Lawrence Livermore National Laboratory.
McMahon, F. H., C. J. Sloan, and G. A. Long. 1972. STACKLIB:
Vector Function Library. Report UCID-30083. Livermore, California
Lawrence Livermore National Laboratory. November
.
Moore, J. W., and I. Y. Bucher. 1981. Comparative Performance
Evaluation of Two Supercomputers: CDC Cyber-205 and CRI Cray-l.
Unclassified Release LA-UR-81-1977. Los Alamos, New Mexico: Los
Alamos National Laboratory.
Worlton, J. 1984
Datamation, pp
.
Understanding supercomputer benchmarks
121-130. September 1.
Representative terms from entire chapter:
performance evaluation