Click for next page ( 67


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 66
8 Modeling of Sources of Variability and Biases As discussed in the previous chapter, a n,'=her of errors and biases can arise when estimating the distri- bution of nutrient intake in a population. The estimated prevalence is derived directly from the estimated distri- bution of nutrient intake, as described by the probability approach. Therefore, errors and biases in the estimation of intake distribution will carry over to the estimation of prevalence. Identifying the different sources of error will enable us to assess the impact of these errors on the estimate of prevalence. Some of the errors will be due to random sampling varia- tion. The magnitude of these errors can be determined directly from the data, and their impact on prevalence esti- mates can be determined with statistical theory. Other sources of bias cannot be determined directly. It is, how- ever, important to consider their impact on prevalence esti- mates. Once identified, indirect evidence or judgments can serve as a basis for estimating the magnitude of error, and sensitivity analysis can be used to determine how these errors may affect estimates of prevalence. The first step in the Nationwide Food Consumption Survey (NFCS) is to select the respondents. Information on the previous day is elicited by interview. For the day of the interview and the following day, foods are recorded by respondents at the time of consumption. Foods are then assigned to categories for coding; the coded foods are converted to nutrients by multiplying the amount of food eaten by the nutrient content per 100 g. The nutrient content information is obtained from reference data on food composition, which are maintained and updated periodically by the U.S. Department of Agriculture (USDA). 66

OCR for page 66
67 In the following sections, the sources of error and bias are broken down according to random sampling errors, errors in reporting food intake, and errors in the food composi- tion tables. VARIABILITY DUE TO SAMPLING OF RESPONDENTS In this discussion, errors in selecting respondents are presumed to arise randomly from an unclustered, unstrati- fied random probability sample of the population. In this case, the prevalence estimate, p, has a standard error of [p(1 - p)/Nl1/2, where N is the size of the sample. These are presented in Table 8-1 and compared to the increases in standard errors when random variability in food intakes and food nutrient compositions are also taken into account. The proportional increases are relatively small and would be even smaller if one were to use the standard errors of the NFCS, which are slightly higher due to clustering. RANDOM VARIABILITY IN FOODS CONSUMED When sampling a population at random' many people are sampled for several days, and the amount of food eaten is classified according to many different food items or cate- gories. The following system of notation will be used to examine the random variability: Different individuals will be denoted by the index i, different days by the index j, and different food items by the index k. For example, let Al k denote the amount of food eaten by the ith individ- uai on the jth day of the kth food item category. When a random components model is used to model the errors result- ing from a random sample, Aijk = Ok + Iik + Di k' where pk is the population mean amount of the ith food item eaten in one day and Iik is the difference between the average amount of the kth food item eaten by individ- ual i and the population mean. Iik can be thought of as a random variable with values varying across a population centered at zero, i.e., Iijk = 0 and variance is equal to ~ (Ik). The value ~ (Ik) is called the inter- individual variability for That food. The term Dijk refers to the difference in the amount of the kth food eaten on the jth day for the individual i and the average amount eaten by the individual i. The values Dijk are also considered to be random variables varying across days for the same individual with mean zero and variance o2(Dk). This in called the intraindividual variation for that food.

OCR for page 66
68 TABLE 8-1. The Mean and Standard Errors of Proportion with Inadequate Intake Resulting from Errors in Food Composition Tables for Different Nutrients for Males and Females. Contrasted to Estimates Obtained with the Delta Method when Table Errors Were Not Considered. Proportion with Inadequate Intake, % (mean + standard error) Error in Nutrient Estimate Male Female Protein Deltaa 1.2 + 0.19 7.2 + 0.52 FC ~ 1.2 + 0.32 7.2 + 1.1 Iron Delta 2.7 + 0.31 NAT FCT 3.1 + 2.2 NA Vitamin C Delta 43.0 ~ 1.16 55.4 + 0.99 FCT 43.0 + 1.55 55.4 + 1.13 Vitamin A Delta 59.4 + 1.27 59.8 + 1.03 FCT 59.5 + 2.18 59.7 + 1.89 Thiamin Delta 32.2 + 1.04 NA (mg/day) FCT 32.2 ~ 5.85 NA Thi~min Delta 2.6 + 0.36 NA (koal/day) FCT 3.4 + 2.49 NA Vitamin C Delta 0.25 + 0.06 1.13 + 0.16 (minimum FCT 0.29 + 0.17 1.18 + 0.47 requirements) Delta denotes error in estimates from variation in survey as obtained using the delta method (Bicker and Doksum, 1977) and includes sampling and reporting error only. FCT denotes error in estimate resulting from food composition tables, and includes sampling and reporting error. NA = Data were not available to the subcommittee.

OCR for page 66
\ 69 Not all errors and biases will arise from random sam- pling. Some derive from the methods used in determining the amount of nutrient that a person consumes each day. Errors in reporting of foods eaten also require attention. Previously, the amount of food consumed in one day was denoted by Aijk. The actual amount of food reported by the subject will be denoted by A*ijk = Aide + Rick + RIik + RRijk ~ RBk. RRijk denotes the random error within an Individual in reporting foods. The quantity is assumed to vary at random from day to day, and is centered at zero for each individual with variance equal to o2(RRk). Same people may consistently overreport or underreport certain foods. The average of this consistent over- or underreporting of the kth food type across a population will be denoted by RBk, and the ith individual's over- or underreporting will be denoted by RIik. The variable Raid varies across individuals in a population and is centered at zero with a variance of a (RIk). The value of RBk is assumed to be a constant. ~ _ Random error in food reporting enters into intraindi- vidual variation. Because the adjustment of the intake distribution described in Chapter 4 separates interindi- vidua1 variation from intraindividual variation, this type of intraindividual reporting error will have no effect on the estimation of prevalence. Consistent under- or overreporting of food intake will be part of the interindividual variation and will not be removed in the adjustment of intake data. Thus, it can affect the estimate of prevalence. The value o2(RI), if it exists, would contribute to the true interindividual variation and, hence, would artificially inflate the spread of the actual intake distribution. The standard deviation for the intake distribution, which should be oCI, will be estimated by [2(I) + o2(Rl)11/2. Unless o2(RI) is substantially large, this will have little effect on the prevalence estimates. For example, the coefficient of variation of the interindividual variation for many of the nutrients range from 30% to 50% (see Appendix A). If the over- and underreporting errors are symmetrical but on the order of 10% so that ~ (RI) has a CV of about 10%, then this would inflate the CV free 30% to 32% or from 50% to 51~. Similarly, if the reporting ~ .

OCR for page 66
70 errors are on the order of + 20%, then this would inflate the CV from 30% to 36% or from 50% to 54%. This is not true, however, if the over- and underre- porting is not symmetrical, that is, if there is an over- all systematic bias in reporting for the entire population or when the bias term RB is not equal to zero. This, how- ever, is not true of the bi as term RBk. Sensitivity analyses have shown that changes in the mean could have a substantial effect on the estimate of prevalence. Hence, systematic over- or underreporting of certain foods by a population must be taken very seriously. VARIABILITY IN FOOD COMPOSITION DATA - Using statistical notation' one can summarize the errors and biases that may occur in the compilation of food composition tables. When the amount of nutrient per 100 g of a food item is to be measured, the analysis is performed on a theoretically representative sample of the food. Although the food composition tables (USDA, 1976- 1984) give a single number representing the mean nutrient content per 100 g of the food item, the importance of the distribution of nutrients per food item must be recognized. To examine the impact of possible errors in these data, let us denote Fly as the true mean nutrient content for the distribution of the kth food item. Let FRijk denote the difference between the mean nutrient content Fk and the actual amount of nutrient in the kth food eaten by the ith individual on the jth day. The variable FRijk is assumed to be randomly distributed with a mean of zero and a variance of REFRY. The variance represents the true variability of nutrient content that is found within a population of certain types of food items. It is assumed that the foods eaten from day to day are random samples from this distribution. If people do not randomly select their foods from a group of specific food items but, rather, systematically and regularly select specific items (e.g., a certain brand of fortified cereal rather than samples of many kinds of cereals), then bias will be introduced. This bias will be denoted by the term FBik--the difference between the average amount of nutrient that the ith individual eats and the population mean Fk. It can be assumed that the

OCR for page 66
71 variable FBik varies from individual to individual and is centered at zero with a variance of o2FBk. Finally, Ck will be used to denote the difference between the true mean nutrient content Fk and the content as estimated from the food composition tables. The value Ck includes many components of error, such as laboratory error, sampling error, and biases, in relation to foods for laboratory analysis. Estimates of the nutrient content of foods are obtained by averaging the content of food samples. Ideally, the recorded nutrient intake equals the true amount of food eaten multiplied by the true nutrient con- tent of the foods and summed over all food items. Hence, the actual amount of nutrient intake for the ith indi- vidual on the jth day could be expressed as Nij = (kk + Iik + Dijk)(~k + F8ik + FRijk). The measured nutrient content is the amount of food reported, multiplied by the nutrient content of each food as given in the food tables and seemed over all food items. The following expression describes the measured nutrient content of the ith individual on the jth day: Nij* = k(~ + Ii~ + Dijk + RIik + RRijk)(Fk + Ck)- The difference between the true nutrient intake and the measured nutrient intake is: ij ij k(FBik + FRijk ~ Ck)(~k + Iik + D jR) - [Ck(RBk + ~Iik + RRijk) When there is no systematic bias in the reporting of foods, the following conditions apply: RBk and RIik are both equal to zero, and there is no systematic bias in the ways individuals select particular kinds of food, i.e., FBijk is equal to zero. Under these constraints, where the only errors are random, Nij = Nij + kFRijk(Wk + Iik + Dijk) + Ck(RRijk) + Dijk + LCk~k + kCkIik

OCR for page 66
This can be written as 72 Nij* = Ni; ~ X + Yi f Zij' where X - ~Ck~k, Yi = [CkIik' and Zij ~ Dijk ~ ECk( - jR ~ Dijk)- = ~Rijk(Pk + Iijk) We now turn to estimating the effects of these errors on prevalence estimates. EFFECT OF RANDOM STATISTICAL ERROR ON ESTIMATIoN OF PREVALENCE The amount of nutrient intake is estimated for each person on each day of the survey from information about food consumed obtained in the survey together with the food composition tables. As described in detail in Appendix C, two approaches for estimating prevalence have been suggested: the parametric approach, which assumes that the distribution of nutrient intake, or some transformation of the data, is normal, and the nonparametric approach, which does not make this assumption. The nonparametric approach would probably be the preferred method for estimating prevalence; however, the statistical methods used are much more difficult to model than those in the parametric approach. For this reason, the parametric approach is used in this chapter to generate an approximate measure of the degree of variability in the esti- mate of prevelance. Where the estimates of prevalence cal- culated in the two approaches differ, this should only be slight; however, in such a case the estimate obtained with the ~~onparametric approach is the one of choice. As indicated in Appendix C, prevalence estimates based on the parametric approach are derived from the population means of interindividual variation of nutrient intake, which are obtained from an analysis of variance (ANOvA) of the nutrient data. If the nutrient content recorded for each subject on each day of observation is exactly correct, then the only error in estimating prevalence would be statistical fluctuation resulting from random sampling. The magnitude of this statistical fluctuation will be measured by the standard error of the estimate. The formulas and theory necessary to find the standard error of the prevalence estimate are given in Appendix C.

OCR for page 66
73 The assumption was made that the distribution of actual intakes was log-normal. When a log-normal distribution is assumed, this method may not be appropriate and a larger class of transforma- tions should be considered (Box and Cox, 1964). However, the major purpose of this exercise is to get some sense of the degree of statistical variation in the estimation pro- cedure. For this purpose, the log-normal assumption will be adequate. To obtain 95% confidence intervals, the estimate 2 standard errors could be used. As was noted previously, the amount of nutrient recorded does not exactly reflect the amount of nutrient ingested. In fact, even when there is no systematic bias in the reporting or choices of foods eaten, Nij* = Nij + X + Yi + Zij' where Nij* is the amount of nutrient reported from the ith individual on day j and Nij is the actual amount of nutrient for the ith individual on day j. The component Zij is incorporated as part of the day- to-day variability and will be taken out by the analysis of variance. Hence, Zij will not have any effect on the estimate of standard error of proportion with inadequate intake. In the probability approach, an analysis of variance of the true Nij should be used to estimate the population mean of the nutrient and interindividual variation. These estimates are then used to estimate the proportion with inadequate intake. In actuality, however, the analysis of variance is made on the Nij*. Hence, the population mean that is being estimated is the true population mean EFk~k plus the value X, which is a realization of the error terms coming from the food tables. Also, the interindividual variation that is being estimated is equal to the true interindividual variation plus the variance of Yi. YE has a minimal effect on the estimate of interindi- vidual variation and almost no effect on the estimate of proportion with inadequate intake. Therefore, we shall only consider the effect of error term X on the estimate of proportion with inadequate intake.

OCR for page 66
74 the proportion of the population with inadequate intake, say P. is a function of the population mean and inter- individual variation of. That is, P = So). As mentioned previously, we are estimating P = S(p + Xo), where X - ICk~k can be thought of as a random variable with a mean of zero and a variance equal to ipiVarCk. To derive some sense of how much P* could be expected to vary from P. a sensitivity analysis was performed in the following manner. First, it was necessary to assign an approximate value for the variance, which will be denoted by o2 = kpiVarCk. (More will be said about this later.) Random values X1, X2, ..., X500 are generated from the distribution of X, assumed to be normally distributed with mean zero and variance of. Values of Pi* - S(W + Xi, o 2 ) were computed as were Sneer mean and standard deviation. The values of ~ and o] are estimates obtained from the original analysis of variance. Although the exercise will not produce precise estimates of the standard error resulting from food composition tables, it can be used to assess the impact of errors in food tables on the estimates of the prevalence of inadequate intake. To estimate o2, the standard error in the mean nutrient composition was obtained for a typical diet. The most recent set of reference tables on food composition that have been published by the USDA (1976-1984) provide some information about the number of samples analyzed and the standard error of the mean for same foods. Using method- ology similar to that described in Appendix E (using standard error instead of standard deviation), the sub- comu~ittee obtained a rough approximation of the standard error in the mean nutrient consumed in a sample diet as a result of random sampling of foods from the food composition table. In all cases, the estimation errors relating to the errors in food composition tables are larger than errors resulting from the survey data. The effect of the errors in the food table on estimates of prevalence cannot be diminished by larger surveys. improvement can be made only with more accurate food tables. IMPACT OF RANDOM UNDER- AND OVERREPORTING A nether of the dietary methodology studies reported in Chapter 6 suggest that there may be under- and over- reporting of intake. This is to be distinguished from

OCR for page 66
75 systematic misreporting by a population or population group (see Chapter 7). If the random element relates to individual reporting from day to day, the effect will be removed during the process of adjusting the distribution to remove the impact of day-to-day variation. people systematically underreport while others systemati- cally overreport, the between-individual variance will be incorporated in the estimate of the distribution of usual intake. This effect can be expected to have an impact on estimates of the prevalence of inadequate intake. The subcommittee used a series of simulations to examine the nature and magnitude of the impact. However, if some To provide same perspective on the potential magnitude of interindividual random under- and overreporting, Table 8-2 portrays, using simulation techniques, the effects that might be seen in population data if there is bias in reporting by an individual. A comparison of observed and reported intakes for single meals is discussed by Schnakenberg et al. (1981) In their data, there was an apparent overall bias toward underestimation. Of more importance for the present purpose, . TABLE 8-2. Magnitude of Expected Effect of Random Under- and Overreporting in Population Dataa Distribution of Deviations Between Coefficient Recorded and True Intake (% of Subjects of Variation Exhibiting Deviation) _ (% of Mean) 30% 25% 20% 15% 10% 5% 5 2.7 3.4 4.2 5.2 6.4 8.2 10 S.3 6.7 8.4 10.4 12.8 16.5 15 7.8 10.1 12.6 15.S 19.2 24.7 20 10.5 13.5 16.8 20.7 25.6 32.9 25 13.1 16.8 21.0 25.9 32.0 41.1 30 15.8 20.2 25.2 31.1 38.4 49.4 - "The body of the table displays the magnitude of the ~ , ,, deviation between recorded intake and a measure of true intake that would be expected in the proportion of sub- jects specified in the column. Deviations of the magn'- tude shown or greater would be expected. Assumes a Gaussian distribution for this simulation.

OCR for page 66
76 the standard deviation of mi sreporting expressed as a proportion of observed intake was 28% of the mean score (32~) for protein. This demonstrates the magnitude of under- and overreporting of single meals. Unless there is evidence that each person is consistent in his or her misreporting, the measured variance must include an unmeasured proportion of intraindividual variation. That is, the CV of 30% overestimates the error that would be expected in estimates of the distribution of usual intake (i.e., intake across multiple meals per day and many days). Many other studies cited in Chapter 7, in which one dietary methodology was compared with another in reporting under- and overestimation, fail to take into account the impact of differences in the number of days of observation. Thus, the literature does not provide a direct estimate of the magni- tude of under- and overreporting that might be expected in data sets adjusted to remove day-to-day variability of intake. The report of Schnakenberg et al. (1981) indicates that a realistic worst-case situation would be a CV of 20% for misreporting the estimate of usual intake distributions. The potential effects of random interindividual mis- reporting on estimates of the prevalence of inadequate intakes can be simulated as shown in Table 8-3. The adjusted distributions of usual intake for protein and for vitamin C (Appendix A) were further adj usted to incorporate or to remove the ef feet of a component of random variation. his was done by using ratios of stan- dard deviations of the derived distribution of usual intakes. The standard deviations adjusted to add or remove a variance component are used in the same manner as reported in Appendix A, except that the distributions were not normalized first. The approach preserves the skew of the distribution of usual intakes. The proba- bility approach was then used to derive estimates of the apparent prevalence of inadequate intakes. From these simulations it can be seen that random ~nterindividual under- and overreporting of intake can affect the prevalence estimate. Consideration of the underlying theory indicates that the magnitude of the effect will depend on the magnitude of the variance of this component in relation to the true interindividual variation as well as on the means of the intake and requirement distributions. In the case of protein, with the estimated variation of usual intakes (see Appendix

OCR for page 66
77 TABLE 8-3. Potential Impact of Random Interindividual Misreporting of Intake on the Estimate of Prevalence of Inadequate Intakes in Adult Males CV" of Impact of Addition Impact of Removing Random of Error Term (~)b Error Term (I) Error (I) Protein Vitamin C Protein Vitamin C 0 2.1 41.0 2.1 41.0 5 2.3 41.1 2.0 41.0 10 2.8 41.3 1.5 40.8 15 3.7 41.6 0.8 40.4 20 4.9 42.1 0.2 39.8 25 6.5 42.6 0 39.1 30 8.3 43.1 c 38. 0 aCV = coefficient of variation. bValues are the apparent prevalences of inadequate intake computed by the probability approach. Mean requirement of protein taken as 43 g/day and mean requirement of vitamin C taken as 46.2 mg/day. CV of requirement taken as 15% of the mean. CCannot be computed. In this case the error term would be equal to or greater than the estimated interindividual variation--an impossible situation. A) much smaller than that of vitamin C, the impact of adding or removing a component of variance would be much greater. The requirement and intake distributions for protein are also much more widely separated than in the case of vitamin C, which would accentuate the effect. Protein then can be used as a worst-case scenario. Table 8-3 also gives the potential effects of removing an independent variance component. In the NFCS data set, the potential impact of removing a component of variance is of particular interest. Removing the variance component

OCR for page 66
78 permits examination of any bias that might be present in the estimate of prevalence because of random under- and overreporting present in the original data set, whereas addition of variance would arise only if there was evidence of a negative correlation between intake and overestimation. In this case, the correlation itself would have to be con- sidered. The 20% estimate of the CV used here would be a generous estimate of the possible variation attributable to this source, and the analysis shown in Table 8-3 suggests that the impact, although real, would not be serious. The prevalence estimate for protein might fall from 2.1% to 0.2%, both of which would be considered very low preva- lences. This analysis has used a worst-case scenario with a high error term and a prevalence of inadequate intake that falls in the tail of the intake distribution. In contrast, the prevalence estimate for vitamin C might change from 41.0% to 39.8--an operationally undetectable change. Only if it could be argued that the random error greatly exceeds the real variation, after day-to-day vari- ability had been factored out, would the magnitude of the error be totally unacceptable for the purpose of survey data interpretation. Again the subcommittee emphasizes that this phenomenon is quite different from systematic under- or overreporting across individuals. That effect is discussed as bias in the estimate of intake earlier in this chapter. -