well as the mean level in the validation period, in which case CE will be close to zero. An ideal validation procedure would measure skill at different timescales, or in different frequency bands using wavelet or power spectrum calculations. Unfortunately, the paucity of validation data places severe limits on their sensitivity. For instance, a focus on variations of decadal or longer timescales with the 45 years of validation data used by Mann et al. (1998) would give statistics with just (2 × 45 ÷ 10) = 9 degrees of freedom, too few to adequately quantify skill. This discussion also motivates the choice of a validation period that exhibits the same kind of variability as the calibration period. Simply using the earliest part of the instrumental series may not be the best choice for validation.
Besides supplying an unbiased appraisal of the accuracy of the reconstruction, the validation period can also be used to adjust the uncertainty measures for the reconstruction. For example, the MSE calculated for the validation period provides a useful measure of the accuracy of the reconstruction; the square root of MSE can be used as an estimate of the reconstruction standard error. Reconstructions that have poor validation statistics (i.e., low CE) will have correspondingly wide uncertainty bounds, and so can be seen to be unreliable in an objective way. Moreover, a CE statistic close to zero or negative suggests that the reconstruction is no better than the mean, and so its skill for time averages shorter than the validation period will be low. Some recent results reported in Table 1S of Wahl and Ammann (in press) indicate that their reconstruction, which uses the same procedure and full set of proxies used by Mann et al. (1999), gives CE values ranging from 0.103 to –0.215, depending on how far back in time the reconstruction is carried. Although some debate has focused on when a validation statistic, such as CE or RE, is significant, a more meaningful approach may be to concentrate on the implied prediction intervals for a given reconstruction. Even a low CE value may still provide prediction intervals that are useful for drawing particular scientific conclusions.
The work of Bürger and Cubasch (2005) considers different variations on the reconstruction method to arrive at 64 different analyses. Although they do not report CE, examination of Figure 1 in their paper suggests that many of the variant reconstructions will have low CE and that selecting a reconstruction based on its CE value could be a useful way to winnow the choices for the reconstruction. Using CE to judge the merits of a reconstruction is known as cross-validation and is a common statistical technique for selecting among competing models and subsets of data. When the validation period is independent of the calibration period, cross-validation avoids many of the issues of overfitting if models were simply selected on the basis of RE.
The statistical framework based on regression provides a basis for attaching uncertainty estimates to the reconstructions. It should be emphasized, however, that this is only the statistical uncertainty and that other sources of error need to be addressed from a scientific perspective. These sources of error are specific to each proxy and are discussed in detail in Chapters 3–8 of this report. The quantification of statistical uncertainty depends on the stationarity and linearity assumptions cited above, the