interprocessor faces to interior cells which is directly proportional to the granularity, and efficiency decays. Figure 11 displays parallel performance for the Euler equations, evaluated on an IBM SP2, a distributed memory machine, and confirms the good scalability of our algorithm. The effects of increasing the number of processors and increasing the size of the mesh are both apparent. Results of parallel RANS solvers under development in our laboratory for aeronautical applications confirm the theoretical efficiency increase that will be obtained when viscous fluxes are switched on.
The essential algorithm (convective and dissipative flux calculation, multigrid, viscous terms, etc.) is exactly the same as the one applied to the single block case. The only difference resides in the fact that an additional outer loop over all the blocks in the domain is added . The parallelization strategy, however, is quite different. Similarly to the single block code, the multiblock is parallelized using a domain decomposition model, a SPMD strategy, and the MPI Library for message passing. Since the sizes of the blocks can be quite small, sometimes further partitioning severely limits the number of multigrid levels that can be used in the flows. For this reason, it was decided to allocate complete blocks to each processor.
The underlying assumption is that there always will be more blocks than processors available. If this is the case, every processor in the domain would be responsible for the computations inside one or more blocks. In the case in which there are more processors than blocks available, the blocks can be adequately partitioned during a pre-processing step in order to at least have as many blocks as processors. This approach has the advantage that the number of multigrid levels that can be used in the parallel implementation of the code is always the same as in the serial version. Moreover, the number of processors in the calculation can now be any integer number, since no restrictions are imposed by the partitioning in all coordinate directions used by the single block program.
The only drawback of this approach is the loss of the exact load balancing that one has in the single block implementation. All blocks in the calculation can have different sizes, and consequently, it is very likely that different processors will be assigned a different total number of cells in the calculation. This, in turn, will imply that some of the processors will be waiting until the processor with the largest number of cells has completed its work and parallel performance will suffer. The approach that we have followed to solve the load balancing problem is to assign to each processor, in a pre-processing step, a certain number of blocks such that the total number of cells is as close as possible to the exact share for perfect load balancing.
One must note that load balancing based on the total number of cells in each processor is only an approximation to the optimal solution of the problem. Other variables such as the number of blocks, the size of each block, and the size of the buffers to be communicated play an important role in proper load balancing, and are the subject of current study. The implementation is fully scalable. Figure 12 shows an H-O grid around the Model 5415 hull divided into 20 blocks. Figure 13 shows speedups obtained on the 5415 for the zero Froude number condition. The shapes in the curves results from an interplay of forces. Increased cache hits push the curve to a superlinear (better than ideal) region. The wiggles are a result of deviations in the load balance from unity, or equal work (number cells) in each processor. Since the blocks are not all equal in size, the constraint that blocks are not shared among processors causes the taper as the number of processors approaches the number of blocks in the grid.
By utilizing a cell-center formulation suitable for parallel computing, flow solutions about complex geometries on the order of a half hour for a grid size up to one million mesh points have been achieved on 16 processors of an IBM SP2. Such efficiency makes our methodology suitable for routine calculations in the early stages of ship design. Also, an extension to the computation of unsteady flows has been made feasible by the speedup. Underwater control surfaces and transom sterns warrant the necessity of multiblock meshes. Preliminary testing of a multiblock version displays the scalability and efficiency of the method.
Our work has benefited greatly from the support of the Office of Naval Research through Grant N00014–93-I-0079, under the supervision of Dr. E.P.Rood. The selection, and implementation of the parallelization strategy presented here is the fruit of extensive collaborations with other students of the Princeton University CFD Laboratory. In particular we wish to acknowledge the contribution of Juan J.Alonso, and Andrey Belov.
 Toda, Y., Stern, F., and Longo, J., “Mean-Flow Measurements in the Boundary Layer and Wake and Wave Field of a Series 60 CB=0.6 Ship Model-Part 1: Froude Numbers 0.16 and 0.316,” Journal of Ship Research, v. 36, n. 4, pp. 360–377, 1992.
 Longo, J., Stern, F., and Toda, Y., “Mean-Flow Measurements in the Boundary Layer and Wake and Wave Field of a Series 60 CB=0.6 Ship Model-Part 2: Effects on Near-Field Wave Patterns and Comparisons with Inviscid Theory,” Journal of Ship Research, v. 37, n. 1, pp. 16–24, 1993.
 Hino, T., “Computation of Free Surface Flow Around an Advancing Ship by the Navier-Stokes Equations,” Proceedings,