Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 21
3 THE CURRENT STATE OF THE ART IN EVALUATION OF SUPERCOMPUTERS Large-scale computations may be characterized as those that require a major fraction of the resources of the most powerful computing systems available at any particular time. Historically, large-scale computations have occurred primarily in areas such as weather forecasting, aeronautical design, orbital mechanics, nuclear energy studies, and nuclear weapon design. Additional areas include automotive design, seismic analysis, petroleum reservoir modeling, condensed matter and statistical physics, nuclear and elementary particle physics, and the modeling of large molecules. Many other applications will emerge in the near future. In many of these areas, scientists and engineers continually face problems whose computational requirements exceed the capability of any available computing system. This is why organizations such as the U.S. Department of Energy (DOE), the National Aeronautics and Space Administration (NASA), and now the National Science Foundation (NSF) have consistently procured and will continue to procure advanced supercomputers as soon as they become available--to treat problems that were previously intractable. These advances challenge existing methods of performance measurement because we cannot precisely forecast the nature of all the computational techniques that eventually will be used in these new applications. Nevertheless, performance measurement of existing systems and workloads provides valuable insight in improving current applications and guiding future developments. LES SONS FROM THE PAST The events of the past ten years provide valuable lessons in evaluating performance of large-scale computers. In the early to mid 1970s, machines were introduced with architectures different enough from earlier machines that measurement of their performance was not a simple 21
OCR for page 22
2 ·/ extension of previous techniques. Before that time, improvements in computer performance came largely through advanced technology and modest architecture improvements, like overlapped operations, which did not impact seriously the algorithms used or the way codes were structured. Noteworthy successes in performance evaluation of sequential instruction machines were achieved and a substantial body of literature on the subject was built. In the mid to late 1970s the introduction of the ILLIAC IV, STAR-100, Cray-1, and CYBER 2XX series brought the so-called single-instruction multiple-data (SIMD) or "vector" architectures to scientific computing. Access to the potentially high performance of these machines required the codes to be structured to take advantage of machine architectures. In some cases, codes not so structured actually ran more slowly than on . . . previous mac; cones. While it was relatively easy to estimate or measure the maximum number of floating point operations per second of which SIMD machines were capable, performance of real jobs was much more complicated to assess. Even though most codes were written in FORTRAN language at the time, the ability of the compilers to optimize performance was negligible in the early days. Codes that ran at high performance were painstakingly modified by hand. During the ensuing ten years, the science of vector computing has come remarkably far. Compiler technology has developed to the point where many constraints are automatically vectorized. Researchers have discovered techniques and algorithms that are well suited to vector machines, including many problems previously thought intractable with these architectures. Today it is axiomatic that supercomputer system performance is a function of problems, implementation, and algorithms. The current trend toward performance enhancement by using multiple processors (multiple vector processors for large applications) brings back all the problems and uncertainties of the early days of vector processing. As with the early days of vector architectures, limited ability exists for the software to partition tasks automatically. Currently, for systems with few (four to eight) processors, partitioning can be done by hand for certain problems; but eventually software will have to improve and there is reasonable expectation that it will. In addition, whereas there is a limited range of variations in SIMD architecture, such as vector length, bandwidth of the communication channels to memory, and start-up time, there are myriad possible multiple-instruction multiple-data (MIND) architectures. Examples of variations are speed of interprocessor communication and shared versus local memory.
OCR for page 23
23 The following conclusion captures-these lessons from the past: Some of the methods, and corresponding metrics, used in the past are not well suited for the performance evaluation of current and future supercomputer systems. PERFORMANCE METRICS There are no universally accepted metrics of supercomputer performance. The most widely used metric is millions of floating point operations per second (MFLOPS), or megaflops, which is a partial measure of processor performance. For examples, see Lubeck et al. (1985), Dongarra and Hinds (1985), Los Alamos National Laboratory (in press), and Moore and Bucher (1981~. Evaluations based on this metric require great restraint. A cursory examination of the literature reveals that megaflops has been reported to characterize performance with respect to the following diverse features: o Peak capability of the hardware o Small computational loops o Computational kernels extracted from large computations o Entire programs o Entire workloads. These measurements can vary widely for the same supercomputer, especially as a function of software and implementation (Worlton, 1984; Dongarra, 19861. So it is always important to know the context in which the measurements are made. Regrettably, such measurements are occasionally lumped together and out of context, particularly in advertising literature. The result is that megaflops, as currently being used, is neither a reliable nor a precise measure of system performance. In spite of these deficiencies, the trade press, having few alternatives, tends to focus on peak hardware rates, so that the problem perseveres. Also, the floating point operation rate constitutes only a simplistic metric because of certain technical shortcomings: o It may not dominate performance for a particular application o It can be hard to measure on real problems. O It can be counted in different ways. O It can be performed in either scalar or vector mode. As to the last point, most state-of-the-art supercomputers offer scalar and vector processing. The performance levels of these two modes, as
OCR for page 24
24 measured in megaflops, may differ by a factor of 5 to 100. Thus, overall performance of an application will depend on the distribution of work between these two modes. This distribution, in turn, depends on the algorithms used, the details of their implementation, and the specific problem being solved. There is an acute need to develop and use more comprehensive performance metrics than megaflops alone. Some progress has been made. Hockney and Jesshope (1981) have introduced a two-parameter characterization of vector processors. The two parameters are (1) peak, or asymptotic, performance, commonly designated as TOO, and (2) half-performance vector length, commonly designated as nay, which is necessary to achieve half the asymptotic performance. This characterization facilitates comparison of different vector architectures as well as assessment of potential performance of a vector processor (Buzbee, in press). A three-parameter model of multiple vector processors is developed in Bucher and Simmons (1985~. More work is needed on performance characterizations. In the meantime the following conclusion can be drawn: Some performance measures and metrics in current use for the evaluation of supercomputers, such as rates of hardware operation and simple kernel execution. are often misleading and misused. TEST PROGRAMS FOR PERFORMANCE MEASUREMENT The selection of representative test programs for measurement of the performance range on a supercomputer system is a nontrivial task (Adams et al., 1985~. One approach that is growing in acceptance is to define a hierarchy of programs ranging from hardware demonstration programs to full application packages. However, if these application packages are written in FORTRAN language, a barrier to performance evaluation may result because of the major hurdles that FORTRAN puts in the way of algorithmic modification. There is a clear need to start building major benchmark programs that can easily be transported, with appropriate algorithmic changes, to new systems. If performance measurement is to become a science, then the writing of benchmark programs should also be rationalized and not be dependent on ad hoc procedures to overcome the difficulty of FORTRAN for portability. Hardware Demonstration Programs Hardware demonstration programs test the basic speed of a machine on simple computational or input-output (I/O) operations. The intent is to
OCR for page 25
25 have operations that can be measured with no loss of information, yet are simple enough to be analyzed completely. Examples of such operations can be found in Lubeck et al. (1985) and McMahon et al. (1972~. From measurements of the basic machine performance, it is possible to determine the costs of memory references, bank conflicts, paging, disk accesses, filling and emptying pipelines, and hitting or missing cache. From this class of programs, it is also possible to determine the vector lengths at which half the peak performance is attained, the vector lengths at which the breakeven point for vector-scalar tradeoffs is reached, the asymptotic peak performance for vector operations, and the amount of work that must be performed between synchronization points in order to overcome the cost of communication in a multiprocessor system. However, the use of hardware demonstration programs for benchmarking and performance measurement is useful only at a very low level--for machine designers and for providing parameters for use in higher level performance modeling. Results from this level of measurement may not be well representative of supercomputer performance. Program Kernels Program kernels represent the computationally intense loops of actual programs; performance on such loops therefore begins to reflect what can be expected of a system when it executes a full workload. Examples include the well-known Lawrence Livermore Loops, commonly used for benchmarking (McMahon, in press), and NASA Ames Kernels (Bailey and Barton, 198S). In general, such program kernels measure central processing unit (CPU) performance only, although some may be designed to measure primarily I/O performance. Rarely are they designed to measure an integrated system, and their lack of program complexity can create some performance misconceptions. Therefore, relying too heavily on the assumption that there is a high correlation of program kernels with the total workload can pose problems. Finally, because kernels tend to be relatively simple loops in structure, the performance that they individually indicate can vary across the entire range of potential system performance depending on whether the system compiler can recognize vectorizability. If an overall figure of merit must be derived from the sum of such kernels, it should be based on a weighted harmonic mean rather than an arithmetic mean in order to estimate fairly the effects~of the peaks and valleys of performance (Worlton, 1984~. However, combining a set of measurements into a single number like this should not be done without knowledge of the application environment.
OCR for page 26
26 Basic Routines Basic routines can be defined to represent the main computational elements used in a range of applications. Several examples are given in Table 4-1 of Chapter 4. Another example is LINPACK (Dongarra, 19861. As such, these routines should provide a reasonable indicator of performance for applications in which they consume major portions of the total requisite resources. The performance of a system on these basic routines does not cover the interactions and communications required between them, as well as the remainder of the full application program that is outside their scope. If the time consumed by the remainder of an application is not extremely small relative to the amount of time used by the measured routines, then the value of this set of sample programs declines as a measure of system performance. Stripped-Down Versions of Major Application Packages Generally, stripped-down versions of major application packages will retain what we have labeled as basic routines, and they will permit some, but not all, of the interactions required between basic routines. They contain little of the pre- and post-processing required by full application packages and may or may not include interactions of CPU and I/O requirements. A well-known example is the SIMPLE program (Crowley et al., 19781. Stripped-down programs are easier to manipulate and to transport from one system to another than are full application packages, and in cases where the full application packages are proprietary or classified, they permit testing that would otherwise be impossible. One of the problems that often occurs in this type of test routine is that by creating a facsimile of a full program for test purposes, some of the system requirements for memory capacity and I/O performance are relaxed. There is thus a real danger, in constructing a stripped-down version that will fit on a system that must be tested, of losing some of the original structure. In particular, this can create havoc when testing a parallel processing system. For example, if a program is scaled to fit on a system with limited memory, its performance might be totally different from the performance that would be expected if the appropriate memory sizes were maintained. Whereas the version constructed for a large memory might perform well when executed in parallel on a system with sufficient memory, the version constructed for a small memory may appear to spend more time communicating than it would in computing on a balanced system with large memory and fast processors.
OCR for page 27
27 Full Application Packages The class of test programs designated here as "full application packages"' provides the most accurate picture of system performance for actual problems, yet it too has drawbacks. Full application packages are hard to transport from system to system because users have adapted them to their own environments, tailoring algorithms to the subtle points of the architecture. Even though the same adaptation will be made in the new environment, it is harder to measure programming style and level of effort invested in fitting a full application package to given architecture than in doing so for the other types of representative test programs. Measurements on this class of test programs should be made, but are best approached only after the results of measurements on the other test programs have been fully understood and the computing environment has been carefully defined. Developmental Programs A final class, developmental programs, reflects the need to consider future workload directions and the need to admit novel approaches to algorithm and architecture design. As such, developmental programs would be used to try out new algorithms, to devise approaches appropriate for new architectures, and to experiment with them. The idea is to be aware of what is just ahead in both algorithm and architecture development, and to plan how to handle new capabilities in software and hardware. In some cases that will mean using radical algorithm approaches to test their utility on existing systems or on new designs. In other cases it will mean staying abreast of new architectures and attempting to design and characterize the programs that will be suitable in this context. Results from developmental programs may be hard to generalize, but should be used to indicate the potential of new supercomputer systems for which new programs would be created and new workloads developed. ANALYTICAL APPROACH TO PERFORMANCE EVALUATION The analytical approach to performance evaluation establishes procedures and methods for evaluating performance with respect to some set of applications or computational structures. For examples, see Baskett and Keller (1977) and Bucher and Martin (1982~. Benchmarking is often either followed or preceded by modeling to extend the results to other configurations and job mixes. Simulated execution of benchmarks may be used to evaluate systems that are not yet operational. For machines and
OCR for page 28
28 systems in normal business applications, analytical performance modeling is a mature approach with a strong commercial base. The literature on analytic performance modeling is extensive, with groups like ACM Sigmetrics, the Computer Measurement Group, and the International Federation of Information Processing Working Group 7.3 on Computer System Modeling devoting annual conferences and publications to the topic. A referreed journal, Performance Measurement, is published quarterly; and the National Science Foundation has sponsored projects in performance evaluation and monitoring for many years. Most of the work is based on the application of queuing theory to computer systems. In spite of the wealth of work, not much has been directly applied to supercomputer systems, primarily because the tradeoffs in supercomputing generally involve maximizing computational performance instead of trying to optimize overall system performance. SUMMARY State-of-the-art practice in performance measurement generally involves a hierarchy of test programs wherein each level of the hierarchy has its unique strengths and weaknesses with respect to a specific set of applications or computational structures. Historically, computability has been related primarily to arithmetic complexity, and thus we have naturally evolved toward heavy use of megaflops as a performance metric. However, because of inconsistent practices and some technical shortcomings, this metric should be used with restraint. More work is needed to develop multidimensional characterizations such as those in Hockney and Jesshope (1981~. Perhaps more importantly, we are entering an era in which there will be a diversity of supercomputer architectures, along with a growing set of applications that span a wide spectrum of complexity. Our experience with vector processors demonstrates that as architecture grows in complexity, performance becomes increasingly dependent upon achieving a good match between problems, implementation, and algorithms. This dependence will become even more pronounced as we move into an era of multiprocessor systems. Achieving good matches will hinge to some extent on our capabilities in performance measurement. Thus it will become a much more important aspect of our work, with a corresponding requirement that we nurture advancement of performance measurement technology. The message of this chapter is crystallized in the following recommendation: Performance evaluation methods for supercomputers. beyond those already available. should be developed and used: and these methods should be based on sufficiently general and unifying concepts to address the whole computer system. including multiple processors. memory. input-output subsystems. and software.
OCR for page 29
29 REFERENCES Adams, G. B., Brown, R. L. and Denning, P. J. 1985. On Evaluating Parallel Computer Systems. RIACS TR 85.3. Moffett Field, California: Research Institute for Advanced Computer Science. September. Bailey, D. H., and J. T. Barton. 1985. The NAS Kernel Benchmark Program. NASA Technical Memorandum 86711. Moffett Field, California: Ames Research Center, National Aeronautics and Space Administration. August. Baskett, F., and T. Keller. 1977. An evaluation of the Cray-1 computer. fin High Speed Computer and Algorithm Organization. D. Kuck, D. Lawrie, and A. Sameh, eds. New York: Academic Press. Bucher, I. Y., and J. L. Martin. 1982. Methodology for Characterizing a Scientific Workload. Unclassified Release LA-UR-82-1702. Los Alamos, New Mexico: Los Alamos National Laboratory. Bucher, I. Y., and M. L. Simmons. 1985. Performance Assessment of Supercomputers. Computing Surveys. Preprint LA-UR-85-1505. Los Alamos, New Mexico: Los Alamos National Laboratory. Buzbee, B. L. 1986. A strategy for vectorization. Parallel Computing 3(July):187-192. Crowley, W. P., C. P. Hendrickson, and T. E. Rudy. 1978. The SIMPLE Code. Report UCID-17715. Livermore, California: Lawrence Livermore National Laboratory. February. Dongarra, J. J., and A. Hinds. 198S. Comparison of the Cray X-MP-4, Fujitsu VP-200, and Hitachi S-810/20: An Argonne Perspective. Technical Report ANL-85-19. Chicago: Argonne National Laboratory. October. Dongarra, J. J. 1986. Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment. Technical Memorandum 23. Chicago: Argonne National Laboratory. April 30. Hockney, R. W., and C. R. Jesshope. 1981. Parallel Computer Architecture, Programming and Algorithms. Philadelphia: Heydon. Los Alamos National Laboratory. In press. Los Alamos National Laboratory Computer Benchmarking 1986. LA-10151-MS. Los Alamos, New Mexico: Los Alamo s National Laboratory. To be published December 1986.
OCR for page 30
30 Lubeck, O., J. Moore, and R. Mendez. 1985. A benchmark comparison of Fujitsu VP-200, Hitachi S810/20, and Cray 18~12~:10-23. December. three supercomputers: X-MP/2. IEEE Computer McMahon, F. H. In press. The Livermore Fortran Kernels: A CPU Floating Point Performance Range Test. Livermore, California: Lawrence Livermore National Laboratory. McMahon, F. H., C. J. Sloan, and G. A. Long. 1972. STACKLIB: Vector Function Library. Report UCID-30083. Livermore, California Lawrence Livermore National Laboratory. November . Moore, J. W., and I. Y. Bucher. 1981. Comparative Performance Evaluation of Two Supercomputers: CDC Cyber-205 and CRI Cray-l. Unclassified Release LA-UR-81-1977. Los Alamos, New Mexico: Los Alamos National Laboratory. Worlton, J. 1984 Datamation, pp . Understanding supercomputer benchmarks 121-130. September 1.
Representative terms from entire chapter: