uniform memory access (NUMA) inter-node communications. Although this hybrid programming model adds another layer of complexity, it offers a useful path to parallelism if it can be efficiently implemented. For example, rather than doing two-dimensional domain decomposition in longitude and latitude, with the hybrid model one might decompose and use MPI only in latitude while treating parallelism in longitude with threads spread across the processors on each node.
Important considerations from a software standpoint are
optimizing the placement of data with respect to the processor(s) that will use it most;
minimizing the number and maximizing the size of messages sent between nodes;
maximizing the number of operations performed on data that is in cache, while minimizing the amount of data required to be in cache for these operations to occur.
Normally, one expects the operating system or job scheduler to take care of “1” automatically. If data is not localized on the same node as the processor that will use it most often, performance will suffer and is likely to be quite variable from run to run. Item “2” requires careful planning of MPI calls. Item “3” requires the most code changes, such as subdividing the computational domain into blocks small enough so that all data for any single block will fit into cache. More radical steps involve converting from Fortran 90 to Fortran 77 in order to get explicit do-loops, re-ordering array and loop indices, in-lining subroutine calls, fusing loops, and other optimizations that would be left to the compilers if only they were capable.