The role of a validation period is to provide an independent assessment of the accuracy of the reconstruction method. As discussed above, it is possible to overfit the statistical model during the calibration period, which has the effect of underestimating the prediction error. Reserving a subset of the data for validation is a natural way to offset this problem. If the validation period is independent of the calibration period, any skill measures used to assess the quality of the reconstruction will not be biased by the potential overfit during the calibration period. An inherent difficulty in validating a climate reconstruction is that the validation period is limited to the historical instrumental record, so it is not possible to obtain a direct estimate of the reconstruction skill at earlier periods. Because of the autocorrelation in most geophysical time series, a validation period adjacent to the calibration period cannot be truly independent; if the autocorrelation is short term, the lack of independence does not seriously bias the validation results.
Some common measures used to assess the accuracy of statistical predictions are the mean squared error (MSE), reduction of error (RE), coefficient of efficiency (CE), and the squared correlation (r2). The mathematical definitions of these quantities are given in Box 9.1. MSE is a measure of how close a set of predictions are to the actual values and is widely used throughout the geosciences and statistics. It is usually normalized and presented in the form of either the RE statistic (Fritts 1976) or the CE statistic (Cook et al. 1994). The RE statistic compares the MSE of the reconstruction to the MSE of a reconstruction that is constant in time with a value equivalent to the sample mean for the calibration data. If the reconstruction has any predictive value, one would expect it to do better than just the sample average over the calibration period; that is, one would expect RE to be greater than zero.
The CE, on the other hand, compares the MSE to the performance of a reconstruction that is constant in time with a value equivalent to the sample mean for the validation data. This second constant reconstruction depends on the validation data, which are withheld from the calibration process, and therefore presents a more demanding comparison. In fact, CE will always be less than RE, and the difference increases as the difference between the sample means for the validation and the calibration periods increases.
If the calibration has any predictive value, one would expect it to do better than just the sample average over the validation period and, for this reason, CE is a particularly useful measure. The squared correlation statistic, denoted as r2, is usually adopted as a measure of association between two variables. Specifically, r2 measures the strength of a linear relationship between two variables when the linear fit is determined by regression. For example, the correlation between the variables in Figure 9-1 is 0.88, which means that the regression line explains 100 × 0.882 = 77.4 percent of the variability in the temperature values. However, r2 measures how well some linear function of the predictions matches the data, not how well the predictions themselves perform. The coefficients in that linear function cannot be calculated without knowing the values being predicted, so it is not in itself a useful indication of merit. A high CE