Systematic errors, particularly in satellite data, create biases in the simplest statistical measures, be they spatial or temporal averages. In addition to the problem of limited sample size discussed above (see also Preisendorfer and Barnett, 1983), such gross statistics can obscure important characteristics of the differences such as geographical or temporal biases (see, e.g., Barnett and Jones, 1992). For the SST example above, such biases may arise from systematic errors in the algorithms applied to correct for atmospheric effects on satellite estimates of SST. As an example, volcanic aerosols injected into the atmosphere by the El Chichon volcano in 1982 contaminated infrared-based satellite estimates of SST within about 30° of the equator for a period of about 9 months. As another example, microwave-based satellite estimates of SST have been found to be biased upward in regions of high surface winds because of incomplete corrections for the effects of wind speed on ocean surface emissivity.
Evaluation of numerical model simulations, either through comparisons with observations or by comparisons with other model simulations, presents additional problems. Models produce a large number of output variables on a dense space-time grid. An ocean circulation model, for example, typically outputs current velocities, temperatures, and salinities at a number of different depths, as well as the sea surface elevation. It is not reasonable to expect present models to reproduce the details of the actual circulation, but one hopes that basic statistics such as the mean or variance of some characteristics of the actual circulation are well represented by the model. Assessing the strengths and weaknesses of a model is thus complicated by the large number of possible variables that can be considered. For example, present global ocean circulation models can reproduce the statistics of sea level variability with some accuracy but generally underestimate the surface eddy kinetic energy computed from surface velocities (e.g., see Morrow et al., 1992). A model that successfully represents the statistics of some geophysical quantity at one level may misrepresent the statistics of the same quantity at a different level. An even more stringent assessment of the performance of a model is how accurately it represents cross-covariances between different variables (which can be shown to be related to eddy fluxes of quantities such as heat, salt, or momentum). Some of these issues are discussed by Semtner and Chervin (1992) with regard to comparisons of numerical model output to satellite altimeter estimates of sea level variance and eddy kinetic energy. The overall goal of such comparisons is to guide further research in an effort to develop more accurate numerical models.
The types of questions that need to be addressed by techniques for comparing two different geophysical fields, whether they consist of observations or model simulations, are indicated by the following:
How, where, and when do the two independent estimates of a field differ?
Are the differences statistically significant? Addressing this question may lead to development of appropriate bootstrap techniques for estimating probability distributions.
What statistical comparisons are most appropriate for evaluating a model?