Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

6 Statistical Power and Interpretation of the Study Statistical power is discussed in section VIT of the HTDS Draft Final Report and in additional material given to the subcommittee: appendix H of the HTDS protocol (May 1993), section IT.G of the HTDS Pilot Study Final Report (Ianuary 1995), and a memorandum by Ken Kopecky (December 14, 1995~. This review focuses primarily on thyroid malignancies as an example, although most of or all the points can be made about the other disease end points of interest. FACTORS IN STATISTICAL POWER Because the study results were essentially negative (that is, a finding of no increase in thyroid disease among those with higher estimated levels of AT exposure compared with those with lower estimated levels), a critical issue is how to interpret the negative findings correctly. Part of the process of interpretation is to subject the study to a series of reality checks. Were the data of sufficiently good quality? Do the underlying patterns of exposure and disease agree with or counter the negative association? Are the confidence intervals wide enough for the results to be compatible with other studies that have found an association between AT exposure and disease? If, for example, this negative study did a very good job of estimating thyroid-disease rates but found that milk-drinkers have higher rates of disease than non-milk-drinkers and that those who lived directly downwind of the site have higher 99

100 Review of the HTDS Draft Final Report rates of disease than those in the more northern ("Iow-dose") counties, we might conclude that the pattern of thyroid morbidity fits the likely pattern of exposure and that the lack of a dose- response association could well be due to poor dose estimates. Similarly, a negative finding that is due to limitations in the assessment of thyroid-disease rates, because of either poor data collection or small numbers of subjects, does not support a conclusion that downwinders' disease patterns are unrelated to Hanford exposure patterns. The power of a study of this type to detect a hypothesized increase in disease prevalence per unit dose (an absolute risk of 2.5% per Gy was used in the power calculations for thyroid carcinoma) depends largely on · The size of the sample. studied. · The background prevalence of disease in the sample · The absence of biases that are caused by subject selection, reporting, or inaccurate detection of disease. · The dose distribution in the sample. . The adequacy of the dosimetry system in characterizing the dose of an individual. · The independence of disease between individuals in the study, conditional on dose (for example, if there is no geographic clustering of disease or systematic geographic difference in disease rates due to other unmown factors).

Statistical Power and Interpretation of the Study SAMPLE SIZE AND ASSUMED BACKGROUND PREVALENCE OF DISEASE 101 The HTDS was successful enough in obtaining subjects who were willing to participate in the study that it essentially met its goals, so sample size is not a concern if the statistical-power assumptions and methods were appropriate. For the second point listed above, it was assumed in section VTT of the HTDS Draft Final Report that the cumulative background prevalence of thyroid carcinoma either previously diagnosed or detected in screening would be 0.7°/O in females and 0.4°/O in males; these translate into 19 expected background cases (in the absence of a Hanford dose) in the cohort. Thus, if there is no increase in thyroid cancer due to the Hanford exposures, the 20 observed cases closely match the assumptions made in the statistical design. (For further discussion of the expected number of cases, see chapter 5 of the present report.) EFFECT OF DOSIMETRY ERROR ON STATISTICAL POWER Primary issues involved in determining whether the statistical power of the study was as expected are the dose distribution of the sample and the precision of the dosimetry system at the individual level. Distributional assumptions must be made in computing the power of the statistical tests or sample size for a specified power, and one can ask how robust the results were to these assumptions. The HEDR project acknowledged that parameter values used in its dose-reconstruction process were uncertain. Some of the uncertainties (those associated with release? dispersion, deposition, uptake modeling, and so on) were common to many individuals' dose estimates, whereas others (associated with food consumption, lifestyle, and so on) were individual · . specific. The HEDR project expresses parameter-value uncertainty with subjective probability distributions, which quantify the state of knowledge as judged by the HEDR analysts. The propagation of uncertainty through the HEDR models results

102 Review of the HTDS Draft Final Report in a subjective probability distribution for each individual's dose estimate. The resulting distributions are very complicated, so, rather than an analytic characterization, a random sample (the 100 alternative realizations of the HEDR dose) produced numerically served as an approximate quantitative expression of the combined influence of the parameter-value uncertainties on the estimation of the individuals' doses. The random sample was drawn so that the correlation among individuals' dose estimates due to common sources of uncertainty would be preserved. To summarize the dose distribution for each person, the median of the 100 dose realizations was calculated, and the median forms the single "dose estimate". The effect of dosimetry errors on the expected or achieved statistical power of the HTDS is not mentioned in the protocol section on statistical power. instead, in the power calculations, the distribution of the median dose estimate for each person is used as though it is equivalent to true dose. it would be valid to ignore the dosimetry errors in the calculation of the statistical power for detecting nonzero parameter values in a linear dose-response mode! if both the following criteria hold: · The average value of true dose for all subjects with the same estimated dose is equal to the estimated dose. . Dose errors are independent from subject to subject, or at least any correlation between subjects' true doses (given estimated doses) is due to additive rather than multiplicative components of error. Consider the second criterion. If all the doses were off by a constant unlmown additive amount, then only the intercept terms, not the slope coefficients in the linear models, would be

Statistical Power and Interpretation of the Study 103 affected by the correlated dose errors. However, for most shared sources of uncertainty, a multiplicative effect is likely. For example, if all the errors in the doses were due to uncertainties in the milk transfer coefficient appropriate for herds near Hanford, this would affect all doses multiplicatively; the estimated slope terms would retain the uncertainty, even in an infinitely large epidemiologic study. VIOLATION OF THE BERKSON ERROR ASSUMPTION If the first criterion is met, the statistical literature indicates, the estimation of linear dose-response models is essentially unaffected by independent dosimetry errors. In fact, an important measurement-error correction technique, known as "regression calibration", consists of the calculation of the average value of true dose, given estimated dose (see chapter 3 of Carroll and others, 1995), and the substitution of this average in the regression analysis. The first criterion, is sometimes called the Berkson mode! of measurement error. Berkson errors arise when dose estimates are given as the average value of possible doses of a category of subjects who individually have a range of possible doses. The aim of the designers of the dosimetry system (the HEDR project) was evidently to provide the same average dose for all members of a particular category defined by "input data" (such as specific geographic location on a particular day with particular meteorologic conditions and specific age). Because the input data do not by themselves define the dose precisely, a Monte CarIo procedure was used in which all possible factors affecting a given individual's dose were considered to be random and 100 possible doses were drawn. Use of the average of possible individual doses as the dose estimate ideally results in Berkson error. However, because uncertainties in multiplicative factors that affect all doses simultaneously are admitted by the HEDR project (source terms, milk transfer coefficients, on so on) even under ideal

104 Review of the HTDS Draft Final Report circumstances, violations of the second criterion are expected. These violations can be considerable. As described in section Lit of the Draft Final Report, the noncentrality parameter governing the power of the study is equal to NCP = I /2NB2cr2t ~ + . Pm(l Pm) Pf (] Pf where N= sample size (here 3,190), 1 B = assumed dose-response relationship (absolute risk, 2.5%/Gy for thyroid malignancies), and pm and pf = cumulative incidence of thyroid malignancies in mates and females, respectively, in the population as a whole (assumed to be 0.4°/0 and 0.7°/0). The variance term CJ2 has to do with the variance of the dose distribution. If closes were observed without error, this term would be equal to the variance of the dose estimates. When doses are observed with error (whether Berkson or from any other model) but errors are independent or dependent because of purely additive components (that is, satisfying the first criterion), CT2 iS replaced in equation ~ with the variance of the average of true dose given estimated dose: CT2 = Var (Avg (True dose~estimated Moseys. (This follows as a consequence of the "regression-calibration" approach to measurement-error correction.) By definition, if one accepts the argument that the HEDR system produces Berkson errors and that the correlations are small, the value of O2 to use in

Statistical Power and Interpretation of the Study 105 equation ~ is just the variance of the average dose for each individual. This is essentially what was done in the sample-size and power calculations for the HTDS, except that the median rather than the estimated mean dose was used. Dosimetry error always reduces study power, because it reduces the correlation between study outcomes and the exposure estimates, relative to the correlation that would be seen if true dose were known. However, in linear models of the probability of occurrence of disease and for dosimetry errors that satisfy the two conditions above, the slope parameter being estimated will remain statistically unbiased if the average dose estimates are used. Thus, the formula for the noncentrality parameter in equation ~ holds, except that the value of ~2 being used is now the variance of the average dose estimates, that is, Var(Avg(True dose~estimated dose)), which is always less than the vanance of the true exposures. DOSE ERROR DUE TO UNCERTAINTIES IN INPUT DATA One reason for doubt about the substitution of the variance of estimated doses for CT2 iS that the input data themselves are subject to obvious errors. The input data consist of such factors as location of residence (probably known fairly well) and milk- consumption habits in early childhood (undoubtedly known much less well; more will be stated about this below). If the fundamental input data are known with error, then in general Var (Avg (True dose~estimated dose)) will be overestimated by the dosimetry system. That occurs because the averaging process required to calculate Avg (True dose~input data) uses too few scenarios, and too little overlap is assumed between the scenarios that correspond to the distinct sets of reported input data. The mean estimated dose in the HTDS Draft Final Report is IS2 mGy with a variance equal to (227 mGy)2. The sensitivity of the power of the study to the assumption that dose errors are of the Berkson type is approached as follows.

106 Review of the HTDS Draft Final Report Suppose that the distribution of true dose is logno~mal and that instead of a Berkson-error mode! we assume a classical- error model on the log scale so that logfestimated dose) = log~true dose) + error. (2) in this model, it is the estimated doses that are randomly distributed, multiplicatively, around the true doses. This mode! has often been considered potentially appropriate if input data are known with errors that are independent from subject to subject. The relevant aspect of the mode! here is that the average of true dose, Avg (True dose~estimated dose), derived from it has smaller variability, from person to person, than does the estimated dose itself. (The large estimated doses are reduced, and the small estimated doses increased.) The reduction of variance (of the average of true compared with estimated dose) is governed by the relative sizes of the variances of the last two terms log~true dose) and error-in equation 2. Assume, for example, that errors in log (estimated dose) have mean zero and standard deviation equal to 0.30 and are independent between subjects. This roughly corresponds to measurement error with a coefficient of variation of 30°/0 (quite small compared with the variation seen in the 100 HEDR dose replications discussed in section VTT, figure VTIl.4~. in this case, it can be shown that if the estimated dose distribution has a mean of ~ 82 mGy and a variance of (227 mGy)2, the variance of Avg(True doselestimated dose) would be equal to (178 mGy)2. Substituting that for O2 in equation ~ reduces the power from 96% to about 85°/0. Assuming larger errors in equation 2 has correspondingly larger effects on the analysis. Reduction of the power to below 60% (generally regarded as a study of low power) would occur when the standard error in equation 2 equaled 0.48, because this will reduce the Var(True dose~estimated dose) to about (125 mGy)2. Note that 0.48 is still quite small compared with the overall variability seen in the 100 estimates of HEDR doses. A value of

Statistical Power and Interpretation of the Study 107 0.48 in equation 2 gives dose variations of about a factor of 2.8 compared with the even larger value of 4 seen in figure VTTT.4 in the HTDS Draft Final Report. To reiterate, the important thing about equation 2 is that the variance of the average true dose is smaller than the variance of the estimated doses. For example, if the CIDER program tended to overestimate the average true dose for all subjects by about 80°/O, the actual power of the test would again be 60% instead of the 96% claimed, because it would also correspond to cs2 = (126 mGy)2. We conclude that if a substantial, but not overwhelming Faction of the variability of the HEDR individual dose estimates actually is due to non-Berkson error, as in equation 2, or if there is a substantial additional component of error due to uncertainties in input data, the power of the study likely was reduced to below levels that would usually be considered acceptable. A worst-case scenario, in which all the error seen in figure vIrI.4 of the Draft Final Report is due to independent errors in equation 2, would produce very low power to detect a positive dose-response relationship. For a number of reasons, however, it is considered unlikely that such a worse case actually applies. Given that the dosimetry system is based on extensive Monte CarIo calculations over many scenarios for each individual's input data, it seems reasonable to believe that some errors in the dose estimates do correspond to Berkson error. Also, the point is made in the later parts of the HTDS results section (section VTTI) that a primary feature of the data is that two of the geostrata with the lowest estimated doses (Okanogan County and Ferry-Stevens Counties) actually had the highest rates of many of the thyroid diseases considered. Basic considerations of such factors as the prevailing wind directions would indicate that those counties should have had less ]3~l deposition then the other counties in the study. Unless such basic assumptions in the dosimetry system are incorrect, it is difficult to believe that this is an artifact of dosimetry error.

108 Review of the HTDS Draft Final Report EFFECT OF ERRORS IN ASSESSING CHILDHOOD MILK CONSUMPTION The effect of errors in assessment of childhood milk consumption on the power of the HTDS to detect dose-response relationships depends on the fraction of variation, M, in between- person thyroid dose that is due to between-person variability in milk consumption. if R is the coefficient of correlation between reported milk consumption and true consumption, the sample size needed to maintain the same power, relative to a study of size N with no errors in reported consumption, can be approximated to a first order as N/~1 - M + MR2) (see appendix D). if, for example, half the variation in thyroid dose is due to variation in milk consumption and the correlation between true and reported milk consumption is 0.3, the sample size needed is I.8N. Thus, about 80°/0 more subjects are needed, in this example, to make up for the poor correlation between reported and true milk consumption. The HTDS Draft Final Report does not discuss errors in the milk- consumption estimates relative to the calculation of the HEDR doses, so we assume that no allowance for such errors has been made. The HTDS Draft Final Report did investigate the effect of substituting defaults for estimated milk consumption (reference values, rather than each subject's individual estimate) in the HEDR model for thyroid dose estimation. The use of the reference values did not change the overall trends in the dose-response analyses, but more information about this analysis is needed. A comparison of the variance of the HEDR dose estimates using individual versus HEDR reference milk-consumption estimates would be helpful for two reasons. It would allow calculation of the statistical power to detect the hypothesized dose-response relationship in the analyses that used the reference diet an important point not explicitly discussed. And, by allowing the estimation of M, it would partly address the extent to which the HTDS might have been over optimistic about the value of the retrospective reports of diet. In

Statistical Power and Interpretation of the Study 109 particular, if the variance of the individual diet-based HEDR estimates is much larger than the variance of the reference diet- based estimates (that is, if M approaches 1), the power of the primary analysis (which used individual diet) could be very sensitive to low values of R for the correlation between estimated and true individual consumption of milk. (if M is close to ~ and R is 0.3, it would take a study size perhaps 10 times as large to obtain the power that knowing true consumption would yield.) But, if M is relatively small, then the power of the HTDS to fine! dose- response relationships is much less sensitive to assumptions about the accuracy of the retrospective surrogate reports of diet because other factors (such as location of residence) dominate the calculation of the estimated doses. Both power (of the reference diet analysis) and sensitivity (of the primary analysis) to error in the individual diet estimates should be discussed in future revisions of the Draft Final Report. CORRELATED DOSE ERRORS Correlations between individuals in dose errors also affect the power of the study to detect a dose-response relationship. For example, if the CIDER program tended to overestimate the average dose for all individuals by about 80%, the actual power of the study would be 60% instead of the 96% claimed, because it correspond to cT2 = (~26 mGy)2. But if doses were consistently underestimated, the study power would increase. In general, highly correlated multiplicative errors lead to wider confidence intervals (and hence reduced power) for estimated slope terms in the linear model, inasmuch as allowance for the common uncertainties in dose need to be accounted for. Because the Monte Cario procedures used by the HEDR project involved averaging over possible values of a number of parameters (source term, milk transfer coefficients, and so on) that are expected to affect doses multiplicatively, some analysis of the correlation between doses should have been performed as a part of the power calculations for the HTDS.

110 Review of the HTDS Draft Final Report EFFECT OF GEOGRAPHIC VARIATION IN DISEASE RATES ON STATISTICAL POWER A notable feature of the HTDS data is that there were indications of heterogeneity by geostratum for many of the thyroid-disease outcomes considered. Part of this heterogeneity was that the two low-dose geostrata often had higher rates of diseases than the other areas, whether or not dose was considered as a risk factor. That sort of heterogeneity of background rates of disease can have important effects on the power of studies of this type. Essentially, the issue is related to the last statistical "factor" noted earlier (see page 1001: whether, conditional on dose, disease outcome is independent from individual to individual. it is possible that important known or unknown covariates for thyroid-disease risk, not considered in the study, could lead to biases or loss of power if they tend to cluster by geostratum, distance from the Hanford facility, or otherwise in ways that affect estimated doses. To take an unlikely example, adult weight has been found in case-control studies (McTiernan and others, 1987; Preston-Martin and others, 1993) to play an important role in thyroid-cancer risk, with the heaviest subjects in one study (Goodman and others, 1992) having up to 5 times the risk as the lightest. if for any reason the average weight of subjects differed substantially by geostratum or distance from Hanford, this could make the estimation of a dose-response relationship quite difficult. Similarly, thyroid-cancer rates have been noted to differ by ethnicity, with an especially high rate among Filipino women in Hawaii (Kolonel, ~ 9851. The Hanford study population is ethnically homogeneous, but the evidence, for many of the thyroid diseases or abnormalities, that risk differs by geostratum raises the question of whether clustering of important unconsidered factors could have reduced the power of the study by violating the independence assumption.