Click for next page ( 32


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 31
4 IMPROVEMENTS ATTAINABLE IN PERFORMANCE EVALUATION I often say that when you can measure what you are speaking about and express it in numbers, you know something about it; but when you can not measure it, when you can not express it in numbers, your knowledge is of meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely advanced to the stage of science, whatever that matter may be. --Lord Kelvin Despite the significance of supercomputers to modern science, engineering, and industrial technology, the process of measuring, evaluating, and predicting supercomputer performance is imprecise at best. This chapter describes methods for characterizing applications and architectures and points toward an emerging and promising approach for accomplishing their pairing. It is hoped that this description of what might be attainable will encourage the development of supercomputer performance evaluation as a science. A central issue in understanding performance measurements on supercomputers involves the pairing of architectures and applications. Is there an inherent matching and, if so, what essential properties of the architectures and applications must be characterized for their contributions to overall system performance? In particular, what attributes enhance the suitability of particular applications for specific architectures? Clear answers to these questions are elusive because the terms ''applications" and "architectures" have not been defined precisely. Classifications and formal descriptions are needed to aid in understanding the interaction of hardware and software. That is, only after we have described an application and an architecture within.a well defined classification scheme, can we ask what it implies to say that application A has a performance P on machine M. Naturally, experimentation will be a crucial component of the design phase during 31

OCR for page 31
32 which these classifications are developed. However, until we are able to define a reference frame around which to develop models of performance, progress cannot be made toward a coherent theory. We may conduct experiments on many discrete applications and architectures without being able to unify and generalize any of the results. To advance supercomputer performance evaluation as a science, measurements must be made in the context of defined models of architectures and applications. Comparative measurements should be made on a constant set of applications and across a spectrum of systems. As pointed out in Chapter 2, the measurement process should cycle through the construction of abstract theoretical models of applications and architectures, the design of experiments to measure specific parameters relative to those models, the development of metrics and their use in a fully understood environment, and the consequent recreation of basic models. These observations led the committee to the following conclusion: No well developed scientific foundation exists for supercomputer performance evaluation. This is not to say that analytic modeling, mentioned Chapter 3, is not worthwhile or that it is not a well-developed science. It says that it is a science that has been developed to address different issues than the ones in which we are now interested and that, for the performance evaluation questions on which we have chosen to focus, the scientific basis is indeed not present. MODELS AND CLASSIFICATION SCHEMES Classifications and formal descriptions may help reduce the continuum of architectures and applications to a reasonable number of cases, and thereby allow their mapping onto manageable sets of information; this may make our matching endeavor more practicable. Descriptions of Architectures The taxonomy proposed by Flynn (1972), and introduced in Chapter 3, provides a useful starting point whereby computers can be classified as single-instruction single-data (SISD) ? single-instruction multiple-data

OCR for page 31
33 (SIMD), or multiple-instruction multiple-data (MIMD). Kuck (1978) has transformed this into a considerably more specific classification by distinguishing between multiple or single instruction streams (MIS or SIS) on the one hand, and multiple or single execution streams (MES or SES) on the other. The first distinction is made on the basis of the number of "programs" being executed at once, where a program is assumed to need a single instruction location register in some control unit for its execution. The second distinction is based on the ability of the control unit to sequence one or more operation types at once, where an operation type corresponds roughly to a single operation code. So far the correspondence between Flynn's and Kuck's schemes is obvious. Going further, Kuck's classification distinguishes between scalar and array (denoted by the addition of the letter A) operation types, where the latter refers to vector instructions of a vector machine like the CYBER 205 as well as to logically corresponding instructions for multiprocessors like the Burroughs BSP. Thus a CDC 6600 machine can be characterized as SIS-MES, a HEP as MIS-MES, a Gray 1 as SISA-MESA, and an IBM 3090 as SISA-SESA (if one is talking about one processor). While Kuck's introduction of the instruction and execution layers appears to be convenient and alleviates the ambiguities of Flynn's classification, we feel that we need more ingredients to characterize the performance of a machine and to map an application successfully onto it. The connection scheme between multiprocessors, the memory structure and hierarchy, and the manner in which data items are accessed and manipulated have become meaningful criteria in the description process. Of importance are not only the types of instruction and execution but also categories like the concurrency or complexity of instructions and the source and destination of their operands. Dongarra and Duff (1985) have proposed to describe computers via formalized templates. Other taxonomies have been attempted (for example, Wallich and Zorpette, 1986; Schwartz, 1983) and have highlighted different aspects of architectural characteristics, but each appears to have some weaknesses in addition to its strengths. Although it may not be possible to define a perfect taxonomy of computer architectures that would be free of ambiguity and allow the discrete placement of any system within its structure, much can be learned in the process of seeking definition. Furthermore, for the purposes of performance evaluation, good taxonomy of architectures should permit the development of performance methods and theories for classes of machines, rather than single entities. For these reasons, the continued development of taxonomies should be encouraged.

OCR for page 31
34 Descriptions of Applications Applications can be described at various levels, where the step from one level to the next can be thought of as being one of successive refinement from the general to the specific. The following levels are important ones in this hierarchy: 1. Identification of a scientific or technical problem 2. Choice of mathematical model 3. Choice of "discretization" method (changing from continuous to discrete vary ables and functions) 4. Ident' fication of "basic" building blocks 5. Choice of numerical methods 6. Arrangement and implementation of algorithms. The first two levels can be considered as identifying the application. Levels 3 and 4 characterize the "high-level implementation" for reaching the solution. Levels 5 and 6 characterize "intermediate-level" and "low-level" implementation respectively. It may be useful to note that, although the scientific or technical problem and to a certain extent the mathematical model is given, each of the subsequent steps has an influence on the suitability of the application for a specific architecture. Some of the steps may involve tradeoffs between mathematical properties, like local accuracy or rate of convergence, and more algorithmic aspects, like computational complexity. Having just said this, we must nevertheless admit that massively parallel architectures are having an impact on the development of mathematical models. Research in progress may eventually support arguments for considering only the scientific problem as fixed, and the mathematical model chosen for the solution as being architecture dependent. When an application is broken down into its main computational elements, it will quite frequently turn out that some of them are of a general purpose type; that is, they occur in many applications from different fields. Let these be called "basic routines," in the sense defined in Chapter 3. Sameh (1986) has presented a set of applications and a certain number of basic routines contained in them as shown in Table 4-1. Routines that require further partitioning are marked in the table by Gil. The basic routines 1 to 10 can be allocated to levels 3 to 5 of the hierarchy mentioned at the beginning of this section. As more applications are added and characterized, the grid should grow much more slowly in the horizontal direction than in the vertical. That is, we expect that a comparatively small set of basic routines, for which the question of an efficient implementation can be thoroughly studied, will

OCR for page 31
35 o X X X r_ U) so o A: V, A 5- U. . - o CO ED U) a) - o C) - so o - c) - X X X ~ X X X X X X X ~ X X X X Cal X X X X ~ ~ X X ~ ~ X X u) in 0 C) .,4 ~ C) . - u ~ u' a) .,4 ~ ~ ~ o o o . - U U. . - ~ ^ - o ~ ~ ~ ~ ~ ~ o o .,, . - ~ in ~ so cd e ~ = = ~ 0' U ~ ~ ~ U - ~ V U . - =~ - ~ U ~ V ~ ~ ~ U ~ ~ - 0 ~ ~ ~ U - o, ~ C~ V 0 U a~~ ~J u . - U) s~ a' U ~ a, 0 ~: ~q o s~ N ~ .,' . - U S~ a) ~ U a, hO - u c~ a) s~ .. o u, ~ cn ~ ~ O _' ~ ~ _ O _' cn (Q ~ . - O S~ ~ U u' _~ c~ S~ u O ~q ~Q ~ O S~ ~ ~ ~ U .,- ~ ~ 0 a u cn u ~ a U2 ~Q ~ EE ~ ~ ~ -3c 0 ~ ~ ~ 0 ~ ~ ~ ~ C) ,= ~ ~ ~ ~ U] _~ U] U r: ~ a) ~ ~ u, ~ :^ ~ ~ ~ ~ ~ ~ a) a; 0 U, ~ ~ ~ -= U ^m ~ ~ ~ s~ U) C~ ~ U 5- C~ ~ ~ ~ ~ ~ ~ ~ 0 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ U -! .- s~ ~ ~ ~ ~ ~ 5- U ~ ~ ~ :d ~ ~ -= Cd ~ a, ~ 0 0 u v 0 a) ~ ~ ~ .- U, . - ~ ~ V] s~ ~ ~ ~ u ~ ~ 0 ~ ~ a, V ~ ~ ~ U] ~ ~ ~ r: ~ U o ~ ~ u = o U] C~ z =: ~ U] ~ _ .~ ~ ~ c~ ~ ~ ~ ~ r~ oo ~ O ~1

OCR for page 31
36 be major constituents of a large spectrum of supercomputer applications. In addition to giving a functional decomposition of an application, it is useful to present some quantitative information on the computational complexity of its basic routines to understand their relative weights in the total computation. In many cases it will be possible to describe the dimensions of a partial problem in terms of the parameters of the total model. As an illustration one may think of a problem that is based on a ~ * M * N grid involving the solution of K * M tridiagonal linear systems of dimension N. In this way we arrive at a formal description of an application, similar in spirit with the one we proposed for the various architectures. Once such a profile is fully constructed, the focus in evaluating the performance of a particular supercomputer system for a given application involves measuring the performance of the basic routines on the computer system and combining the fractional contributions of their occurrence and their respective computational characteristics. Finally, because alternative approaches to applications may be radically different, new approaches should be included as subcategories of specific application areas as they are developed and studied. For example, computational fluid dynamics has an entry in the Table 4-1 as it stands, but it might have an additional entry that recognizes the characteristics of its solution by methods of cellular automata--a radical and debatable approach--as this technique matures. Example of an Application In this section we illustrate the concept of a hierarchical description of an application to indicate that the end user, the scientist who creates applications for supercomputer systems, already views problems in the way that we are advocating. An application refers to a particular implementation of a more general scientific problem. The original problem statement is independent of any notion of machine architecture. As an example, we use the code SNEX as described by Wienke (19851. The physical problem of interest is in the area of transport physics. The mathematical model chosen to solve this problem involves the solution of the transport equation. The author notes that a choice has been made regarding the solution of the transport equation. Whereas a typical method of solution involves finite difference or finite element techniques on fixed or iteratively defined meshes, this implementation assigns discrete values-to some of the independent variables by the discrete ordinates method and then solves the resulting system of uncoupled ordinary linear differential equations. The

OCR for page 31
37 numerical method chosen to effect the quadrature s is an adaptive Newton-Cotes 7-point routine. The SNEX code is the final high-level language encoding of the application for the choices that have been made. We note that, from the description of SNEX, the hierarchical model that we advocate is not foreign to the developers of large scientific and engineering applications. REQUIREMENTS FOR PRODUCING MEASUREMENTS Riganati and Schneck (1984) have observed that there is an absence of basic metrics for calibrating supercomputer performance in an absolute reference frame. As to the parameters noted in Chapter 3 for characterization of vector processors, Hackney (1985) has reported on the utility of m/2, the vector length necessary to achieve half the asymptotic performance; sl/2, the parallel task size necessary to achieve half the asymptotic performance; and TOO, the asymptotic performance in millions of floating point operations per second (MFLOPS). These measures permit the analysis of the overhead associated with the use of vector or parallel hardware features, but they fall short of characterizing the entire system. Additional parameters must be established, and corresponding metrics defined, to permit thorough measurement and comprehensive analysis of the individual components that combine to produce total system performance. High performance can be achieved only through a combination of hardware and software advances, and these advances must be based on a thorough understanding of the measured interaction of the application and the architecture on which it is executed. Hardware monitors, such as those currently available on the Cray X-MP, provide necessary input to the formulation of models of execution. Software monitors and simulators provide different information and, if designed in a general fashion, can be directed toward understanding performance across systems (Martin et al., 1982; Martin et al., 19839. Some of the questions that can be resolved by the use of monitoring tools include (but are not limited to) the following: o How does the instruction profile of an application change relative to the architecture on which it executes? In particular, how significant to the application are classes of instructions such as floating point operations, integer arithmetic, memory references, jumps, or address computations? o For vector architectures how much vectorization is being exploited, how much of the vector power can be used concurrently, what are the average vector lengths, and what are the significant strides?

OCR for page 31
o What is the memory referencing pattern? Is it regular? If not, how is it irregular? 0 For a parallel system, what degree of parallelism is being achieved, how much time is wasted when parallel tasks wait to coordinate, and how many memory references are local or global? How much communication is required per computation? O How fully is the architecture used? For example, are there instructions, registers, communication links, and other capabilities that are never being used by the compiled code (Summer Workshop on Parallel Algorithms and Architectures, 19861? These considerations support the following recommendation: Supercomputer systems should be provided with hardware and software for the collection of performance evaluation data. FIVE STAGES IN PERFORMANCE EVALUATION The following approach attempts to synthesize the considerations described above, outlining several stages in the process of performance evaluation. The goals are both to determine performance capabilities of existing systems and to predict performance of future systems on specified programs, representative of larger classes of applications. Determining the future system requirements for these and developmental applications can provide the means by which to influence designs of future generations of supercomputers. Stage 1 The first stage in the process is to determine the major supercomputer application areas and the predominant mathematical solution techniques. If radical techniques (for example, chaotic relaxation (Beaudet, 1978~) are under development, they should also be noted. Several reports present some of the current work (Kashiwagi, 1985; Hwang, 1985; Fernbach, 1984; Rodrigue et al., 1984), the report on the Japanese superspeed project being a notable contribution. Evolving areas, such as knowledge processing, should be included, although their characterization will be significantly more difficult. Stage 2 The second stage is to select a collection of representative programs covering the scientific disciplines and solution techniques for the

OCR for page 31
39 application areas. For this work, full applications are much more desirable than benchmark programs or simple kernels of problems. Simultaneously, a selection should be made regarding the target architectures for the study. Some effort should be expended at this stage to determine the appropriate representation of the application for the research. Ideally, a machine-independent representation would yield the greatest amount of information, yet this is an unlikely possibility since applications are generally written for a particular architecture. A philosophical choice must be made regarding how extensively to modify a program as it is analyzed on the defined target architectures. Modifications could be high-level language changes to enable execution, syntax changes to permit a compiler to recognize vector and parallel sections of code, algorithmic changes to enhance the suitability of a specific architecture, or total reconsideration of the underlying mathematical model. Clearly, each of these considerations has an associated cost. Stage 3 The third stage is to define the appropriate parameters of the applications and the architectures that will allow models to be developed. At a practical level, these will be specific architectural and computational models. At a conceptual level, they will provide a frame for abstractions and generalizations. Examples of possible sets of parameters follow. Architecture Definition, A1 Let the architecture definition, A1, be given by A1 = (N. CT, TOO, F. PP, M1, M2, M3, M4, M5 where o N = number of processors (connection scheme, bandwidth, approximate communication cost, synchronization overhead, sl/2) o CT = cycle time o r00 = peak rate (in MFLOPS) o F = number of functional units (per processor) o PP = number of pipes (per processor), cost of use, length o M1 = number of vector registers (per processor), length, m/2 o M2 = cache size, access time, cost for misses 0 M3 = main memory size, access time, contention, banks

OCR for page 31
40 o M4 ~ extended or virtual memory size, access time o MS ~ I/O behavior, bandwidth. Application Definition, A2 Let the application definition, A2, be given by where A2 = (CA, V, P. M, I/O), 0 CA = characteristics of the application (number of floating point operations or logical inferences, amount of memory traffic, total storage requirements, branch behavior) o ~ = degree of vectorization (natural and obtainable by a vectorizing compiler), average vector lengths, strides 0 P = degree and type of parallelism, granularity, balance 0 M = memory references, number relative to floating point operations, access patterns, likelihood of occurring in M1 to M4 as defined above o I/O = I/O requirements (if they exist beyond the capacity of M41. Stage 4 In the fourth stage, the metrics necessary to understand the performance of the models relative to the parameters of Stage 3 will be defined. Using these metrics, measurements, for both experimentation and evaluation, can be made in an environment controlled by the investigator. The environment must be defined in a consistent manner, with the realization that changing any of the components of the environment can have an impact on measured performance. Resulting from this stage are an environment vector EV and a performance vector PV. Environment Vector, EV Let the environment vector, EV, be given by where EV = (Al, C, OS, PE), o Al = architecture definition o C = compiler (automatic optimization, vectorization, parallelization) 0 0S = operating system (scaled toward throughput (multiple users) or turnaround (single user))

OCR for page 31
41 0 PE ~ peripheral equipment (for example, I/O system). Performance Vector, PV Let the performance vector, Pa, be given by PV = (A2, R), where o A2 = application definition o R = net processing rate of Al on A2 (in MFLOPS or millions of operations per second). Stage 5 Finally, the fifth stage will be to assess the relationship between the computational and architectural models. Is there an inherent pairing? What parameters and metrics are missing? An initial attempt at mapping applications to architectures will be made at this point, and with it an assessment of the performance of specific systems given a predefined applications set. The results will be in the form of a set of ordered pairs of performance information; specifically, each ordered pair will provide the net processing rate of a particular implementation of an application on an architecture, executed within a defined environment. There will not be a single performance value, but rather an array of values. By carefully weighting the elements of the array, classes of applications and architectures will be partitioned according to their mutual conformability. SYSTEM PERFORMANCE I S SUES In addition to the challenge of matching applications and architectures, we need technology for specifying and evaluating system performance of advanced architectures. We have noted that the greatest value of increased supercomputer capability lies in the potential to solve problems that are not now solvable, that is, to create new workloads. Thus, not only do we need advanced architectures, we also need productive environments for developing new applications. Designers have always attempted to balance supercomputer systems carefully; for example, high-performance processors were complemented by large memories and large bandwidth I/O. Future supercomputers should present a balanced hardware-software system that includes the following features:

OCR for page 31
42 o Efficient management of asynchronous tasks and efficient interprocessor communication o Large, efficient memories 0 High-speed channels and peripherals o High-performance secondary storage 0 High-speed networking technology, for example, protocols and interfaces 0 Very high-speed graphics o Program monitors (profilers) o Productive environment for software development and management o Hardware assists for debugging and performance measurement o Fault tolerance and graceful degradation of multiprocessor systems o Scalability to various sizes, so that expansion will permit upward compatibility of applications programs, and so that small versions may be built with identical software interfaces. System measurement and evaluation are extremely complex. Thus, they are active areas of basic research that merit encouragement. CONCLUSION The performance of an application on an architecture cannot be considered in isolation. The execution environment, including the compiler, operating system, and peripheral equipment, must be fully understood before measurements can be analyzed in context. Defining and quantifying these contributions and also accounting for the level of human effort required to use a system efficiently are critical to the scientific evaluation of supercomputer performance. We have presented a method for structuring experiments to measure performance within a controlled environment. By developing detailed descriptions of applications and architectures within this context, and using the information gathered at successive stages of the defined hierarchies, we hope to model the performance of an application on a supercomputer system. Information at the lowest levels will provide the processing potential of each possible computational mode (scalar, vector, parallel), and extensions of Amdahl's Law (Amdahl, 1967) will direct us toward expectations of total performance based on the relative weights assigned to the various processing modes for a given workload. * Amdahl's Law gives the ratio of the time to to do a given task at a given rate to the time to do do it if the task is divided into a fraction f, performed at the given rate, and a fraction 1 - f, performed at a rate r times as great. The relationship may be written symbolically as to/t = (f + (1 - f)/r)~ .

OCR for page 31
43 This theoretical prediction will then be tested against measurements of the system on the given workload. If the comparison of expected and achieved performance produces large discrepancies, then the models will be re-examined and modified to include information that was initially missing. Successive iterations of this process, as suggested by Figure 2-3, should enable the understanding of current system performance, and should permit reasonable predictions of future behavior of modified workloads on the current system, or of the current workload on a different system. Critical to the entire process is the recording of sufficient information concerning the environment of the experiments to ensure reproducibility. The goals of this plan are long term. Nevertheless, there is much to be learned incrementally. Requirements for fulfilling the goals, such as tools for making and analyzing measurements, can have an immediate impact on current performance evaluation work. Certainly, architecture designers, application developers, and system engineers will all profit from the establishment of a body of information, collected in a scientific manner, describing the analysis and measurement of a variety of applications on a spectrum of existing computer systems. This chapter has pointed to research directions that should extend the scope of supercomputer performance evaluation. The level of interest in the field may be inferred by the numerous references cited. Chapter 2 discusses the importance of the topic, while Chapter 5 cites various programs with interest in high-performance computing and associated evaluation of it. As the field advances, increased interaction among its participants will enhance the usual verification and adoption of useful results. The following conclusion embraces these points: Performance evaluation of supercomputers is an emerging area of significant interest and importance. There are numerous ongoing efforts by industrial. governmental and academic laboratories; but more effective mechanisms for teaching assessing and disseminating progress in this area are desirable. REFERENCE S Amdahl, G. 1967. The validity of the single processor approach to achieving large scale computing capabilities. American Federation of Information Processing Societies Conference Proceedings. Vol. 30. Beaudet, C. M. 1978. Asynchronous iterative methods for multi-processor. Journal of the Association of Computing Machinery. Vol. 25. April.

OCR for page 31
44 Dongarra, J. J., and I. S. Duff. 1985. Advanced Architecture Computers. Technical Memorandum 57. Chicago: Argonne National Laboratory. October 10. Fernbach, S. 1984. Applications of supercomputers in the USA--today and tomorrow. In Supercomputers: Design and Applications. Los Alamitos California: Computer Society, Institute of Electrical and Electronics Engineers. Flynn, M. J. 1972. IEEE Transactions on Computers C-21:948-960. Hackney, R. W. 1985. (TOO, m/2, sl/2) Measurements on the 2-CPU Cray X-MP. Parallel Computing 2:1-14. Hwang, K. 1985. Multiprocessor supercomputers for scientific/engineering applications. IEEE Computer 18~6~:57-83. Kashiwagi, H. 1985. The Japanese super-speed computer project. Future Generations Computer Systems 1:~3~153-160. Kuck, D. J. 1978. The Structure of Computers and Computations. Vol. 1, pp. 316-318. New York: John Wiley & Sons. Martin, J. L., I. Y. Bucher, and T. T. Warnock. 1982. Workload Characterization: Tools and Techniques. Technical Report LA-UR-82-3213. Los Alamos, New Mexico: Los Alamos National Laboratory. December. Martin, J. L., A. L. Dana, and T. T. Warnock. 1983. Tools for measuring software performance of vector architectures. In Proceedings. Symposium on Application and Assessment of Automated Tools for Software Development. November. Riganati, J. P., and P. B. Schneck. 1984. Supercomputing. IEEE Computer 17~10~:97-110. Rodrigue, G., et al. 1984. Large-scale scientific computation. In Supercomputers: Design and Applications. Los Alamitos, California: Computer Society, Institute of Electrical and Electronics Engineers. Sameh, A. 1986. Personal communication. Urbana, Illinois: University of Illinois. March.

OCR for page 31
45 Schwartz, J. 1983 on SS Designs. Institute, New York University. . A Taxonomic Table of Parallel Computers Based Technical Report UCN-69. New York: Courant Summer Workshop on Parallel Algorithms and Architectures. 1986. Report. UMIACS-TR-86-1, CS-TR-162S. College Park, Maryland: University of Maryland. February. No Wallich, P., and G. Zorpette, eds. 1986 Spectrum 23~1~:36-39. January. . Minis and mainframes . IEEE Wienke, B. R. 1985. SNEX: Semianalytic solution of the one-dimensional discrete ordinates transport equation with diamond differenced angular fluxes. Computer Physics Communications 38:397-402.

OCR for page 31