Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 66
8
Modeling of Sources of Variability
and Biases
As discussed in the previous chapter, a n,'=her of
errors and biases can arise when estimating the distri-
bution of nutrient intake in a population. The estimated
prevalence is derived directly from the estimated distri-
bution of nutrient intake, as described by the probability
approach. Therefore, errors and biases in the estimation
of intake distribution will carry over to the estimation
of prevalence. Identifying the different sources of error
will enable us to assess the impact of these errors on the
estimate of prevalence.
Some of the errors will be due to random sampling varia-
tion. The magnitude of these errors can be determined
directly from the data, and their impact on prevalence esti-
mates can be determined with statistical theory. Other
sources of bias cannot be determined directly. It is, how-
ever, important to consider their impact on prevalence esti-
mates. Once identified, indirect evidence or judgments can
serve as a basis for estimating the magnitude of error, and
sensitivity analysis can be used to determine how these
errors may affect estimates of prevalence.
The first step in the Nationwide Food Consumption Survey
(NFCS) is to select the respondents. Information on the
previous day is elicited by interview. For the day of the
interview and the following day, foods are recorded by
respondents at the time of consumption. Foods are then
assigned to categories for coding; the coded foods are
converted to nutrients by multiplying the amount of food
eaten by the nutrient content per 100 g. The nutrient
content information is obtained from reference data on food
composition, which are maintained and updated periodically
by the U.S. Department of Agriculture (USDA).
66
OCR for page 67
67
In the following sections, the sources of error and bias
are broken down according to random sampling errors, errors
in reporting food intake, and errors in the food composi-
tion tables.
VARIABILITY DUE TO SAMPLING OF RESPONDENTS
In this discussion, errors in selecting respondents are
presumed to arise randomly from an unclustered, unstrati-
fied random probability sample of the population. In this
case, the prevalence estimate, p, has a standard error of
[p(1 - p)/Nl1/2, where N is the size of the sample.
These are presented in Table 8-1 and compared to the
increases in standard errors when random variability in
food intakes and food nutrient compositions are also taken
into account. The proportional increases are relatively
small and would be even smaller if one were to use the
standard errors of the NFCS, which are slightly higher due
to clustering.
RANDOM VARIABILITY IN FOODS CONSUMED
When sampling a population at random' many people are
sampled for several days, and the amount of food eaten is
classified according to many different food items or cate-
gories. The following system of notation will be used to
examine the random variability: Different individuals will
be denoted by the index i, different days by the index j,
and different food items by the index k. For example, let
Al k denote the amount of food eaten by the ith individ-
uai on the jth day of the kth food item category. When a
random components model is used to model the errors result-
ing from a random sample, Aijk = Ok + Iik + Di k'
where pk is the population mean amount of the ith food
item eaten in one day and Iik is the difference between
the average amount of the kth food item eaten by individ-
ual i and the population mean. Iik can be thought of as
a random variable with values varying across a population
centered at zero, i.e., Iijk = 0 and variance is equal
to ~ (Ik). The value ~ (Ik) is called the inter-
individual variability for That food.
The term Dijk
refers to the difference in the amount of the kth food
eaten on the jth day for the individual i and the average
amount eaten by the individual i. The values Dijk are
also considered to be random variables varying across days
for the same individual with mean zero and variance
o2(Dk). This in called the intraindividual variation
for that food.
OCR for page 68
68
TABLE 8-1. The Mean and Standard Errors of Proportion with
Inadequate Intake Resulting from Errors in Food
Composition Tables for Different Nutrients for
Males and Females. Contrasted to Estimates
Obtained with the Delta Method when Table
Errors Were Not Considered.
Proportion with Inadequate Intake, %
(mean + standard error)
Error in
Nutrient Estimate Male Female
Protein Deltaa 1.2 + 0.19 7.2 + 0.52
FC ~ 1.2 + 0.32 7.2 + 1.1
Iron Delta 2.7 + 0.31 NAT
FCT 3.1 + 2.2 NA
Vitamin C Delta 43.0 ~ 1.16 55.4 + 0.99
FCT 43.0 + 1.55 55.4 + 1.13
Vitamin A Delta 59.4 + 1.27 59.8 + 1.03
FCT 59.5 + 2.18 59.7 + 1.89
Thiamin Delta 32.2 + 1.04 NA
(mg/day) FCT 32.2 ~ 5.85 NA
Thi~min Delta 2.6 + 0.36 NA
(koal/day) FCT 3.4 + 2.49 NA
Vitamin C Delta 0.25 + 0.06 1.13 + 0.16
(minimum FCT 0.29 + 0.17 1.18 + 0.47
requirements)
Delta denotes error in estimates from variation in
survey as obtained using the delta method (Bicker and
Doksum, 1977) and includes sampling and reporting error
only.
FCT denotes error in estimate resulting from food
composition tables, and includes sampling and reporting
error.
NA = Data were not available to the subcommittee.
OCR for page 69
\
69
Not all errors and biases will arise from random sam-
pling. Some derive from the methods used in determining
the amount of nutrient that a person consumes each day.
Errors in reporting of foods eaten also require attention.
Previously, the amount of food consumed in one day was
denoted by Aijk. The actual amount of food reported by
the subject will be denoted by A*ijk = Aide + Rick +
RIik + RRijk ~ RBk. RRijk denotes the random error
within an Individual in reporting foods. The quantity is
assumed to vary at random from day to day, and is centered
at zero for each individual with variance equal to
o2(RRk).
Same people may consistently overreport or underreport
certain foods. The average of this consistent over- or
underreporting of the kth food type across a population
will be denoted by RBk, and the ith individual's over-
or underreporting will be denoted by RIik. The variable
Raid varies across individuals in a population and is
centered at zero with a variance of a (RIk). The
value of RBk is assumed to be a constant.
~ _
Random error in food reporting enters into intraindi-
vidual variation. Because the adjustment of the intake
distribution described in Chapter 4 separates interindi-
vidua1 variation from intraindividual variation, this type
of intraindividual reporting error will have no effect on
the estimation of prevalence.
Consistent under- or overreporting of food intake will
be part of the interindividual variation and will not be
removed in the adjustment of intake data. Thus, it can
affect the estimate of prevalence. The value o2(RI), if
it exists, would contribute to the true interindividual
variation and, hence, would artificially inflate the
spread of the actual intake distribution. The standard
deviation for the intake distribution, which should be
oCI, will be estimated by [2(I) + o2(Rl)11/2.
Unless o2(RI) is substantially large, this will have
little effect on the prevalence estimates. For example,
the coefficient of variation of the interindividual
variation for many of the nutrients range from 30% to 50%
(see Appendix A). If the over- and underreporting errors
are symmetrical but on the order of 10% so that ~ (RI)
has a CV of about 10%, then this would inflate the CV free
30% to 32% or from 50% to 51~. Similarly, if the reporting
~ .
OCR for page 70
70
errors are on the order of + 20%, then this would inflate
the CV from 30% to 36% or from 50% to 54%.
This is not true, however, if the over- and underre-
porting is not symmetrical, that is, if there is an over-
all systematic bias in reporting for the entire population
or when the bias term RB is not equal to zero. This, how-
ever, is not true of the bi as term RBk. Sensitivity
analyses have shown that changes in the mean could have a
substantial effect on the estimate of prevalence. Hence,
systematic over- or underreporting of certain foods by a
population must be taken very seriously.
VARIABILITY IN FOOD COMPOSITION DATA
-
Using statistical notation' one can summarize the
errors and biases that may occur in the compilation of
food composition tables. When the amount of nutrient per
100 g of a food item is to be measured, the analysis is
performed on a theoretically representative sample of the
food. Although the food composition tables (USDA, 1976-
1984) give a single number representing the mean nutrient
content per 100 g of the food item, the importance of the
distribution of nutrients per food item must be recognized.
To examine the impact of possible errors in these data,
let us denote Fly as the true mean nutrient content for
the distribution of the kth food item. Let FRijk denote
the difference between the mean nutrient content Fk and
the actual amount of nutrient in the kth food eaten by the
ith individual on the jth day. The variable FRijk is
assumed to be randomly distributed with a mean of zero and
a variance of REFRY. The variance represents the true
variability of nutrient content that is found within a
population of certain types of food items. It is assumed
that the foods eaten from day to day are random samples
from this distribution.
If people do not randomly select their foods from a
group of specific food items but, rather, systematically
and regularly select specific items (e.g., a certain brand
of fortified cereal rather than samples of many kinds of
cereals), then bias will be introduced. This bias will be
denoted by the term FBik--the difference between the
average amount of nutrient that the ith individual eats
and the population mean Fk. It can be assumed that the
OCR for page 71
71
variable FBik varies from individual to individual and
is centered at zero with a variance of o2FBk.
Finally, Ck will be used to denote the difference
between the true mean nutrient content Fk and the content
as estimated from the food composition tables. The value
Ck includes many components of error, such as laboratory
error, sampling error, and biases, in relation to foods for
laboratory analysis. Estimates of the nutrient content of
foods are obtained by averaging the content of food samples.
Ideally, the recorded nutrient intake equals the true
amount of food eaten multiplied by the true nutrient con-
tent of the foods and summed over all food items. Hence,
the actual amount of nutrient intake for the ith indi-
vidual on the jth day could be expressed as
Nij = £(kk + Iik + Dijk)(~k + F8ik + FRijk).
The measured nutrient content is the amount of food
reported, multiplied by the nutrient content of each food
as given in the food tables and seemed over all food items.
The following expression describes the measured nutrient
content of the ith individual on the jth day:
Nij* = k(~ + Ii~ + Dijk + RIik + RRijk)(Fk + Ck)-
The difference between the true nutrient intake and the
measured nutrient intake is:
ij ij k(FBik + FRijk ~ Ck)(~k + Iik + D jR)
- [Ck(RBk + ~Iik + RRijk)
When there is no systematic bias in the reporting of
foods, the following conditions apply: RBk and RIik
are both equal to zero, and there is no systematic bias in
the ways individuals select particular kinds of food, i.e.,
FBijk is equal to zero. Under these constraints, where
the only errors are random,
Nij = Nij + kFRijk(Wk + Iik + Dijk) + £Ck(RRijk)
+ Dijk + LCk~k + kCkIik
OCR for page 72
This can be written as
72
Nij* = Ni; ~ X + Yi f Zij'
where X - ~Ck~k, Yi = [CkIik' and Zij
~ Dijk ~ ECk( - jR ~ Dijk)-
=
~Rijk(Pk + Iijk)
We now turn to estimating the effects of these errors on
prevalence estimates.
EFFECT OF RANDOM STATISTICAL ERROR ON ESTIMATIoN OF
PREVALENCE
The amount of nutrient intake is estimated for each
person on each day of the survey from information about
food consumed obtained in the survey together with the
food composition tables. As described in detail in
Appendix C, two approaches for estimating prevalence have
been suggested: the parametric approach, which assumes that
the distribution of nutrient intake, or some transformation
of the data, is normal, and the nonparametric approach,
which does not make this assumption.
The nonparametric approach would probably be the
preferred method for estimating prevalence; however, the
statistical methods used are much more difficult to model
than those in the parametric approach. For this reason, the
parametric approach is used in this chapter to generate an
approximate measure of the degree of variability in the esti-
mate of prevelance. Where the estimates of prevalence cal-
culated in the two approaches differ, this should only be
slight; however, in such a case the estimate obtained with
the ~~onparametric approach is the one of choice.
As indicated in Appendix C, prevalence estimates based on
the parametric approach are derived from the population
means of interindividual variation of nutrient intake, which
are obtained from an analysis of variance (ANOvA) of the
nutrient data. If the nutrient content recorded for each
subject on each day of observation is exactly correct, then
the only error in estimating prevalence would be statistical
fluctuation resulting from random sampling. The magnitude
of this statistical fluctuation will be measured by the
standard error of the estimate.
The formulas and theory necessary to find the standard
error of the prevalence estimate are given in Appendix C.
OCR for page 73
73
The assumption was made that the distribution of actual
intakes was log-normal.
When a log-normal distribution is assumed, this method
may not be appropriate and a larger class of transforma-
tions should be considered (Box and Cox, 1964). However,
the major purpose of this exercise is to get some sense of
the degree of statistical variation in the estimation pro-
cedure. For this purpose, the log-normal assumption will be
adequate. To obtain 95% confidence intervals, the estimate
2 standard errors could be used.
As was noted previously, the amount of nutrient recorded
does not exactly reflect the amount of nutrient ingested.
In fact, even when there is no systematic bias in the
reporting or choices of foods eaten,
Nij* = Nij + X + Yi + Zij'
where Nij* is the amount of nutrient reported from the
ith individual on day j and Nij is the actual amount of
nutrient for the ith individual on day j.
The component Zij is incorporated as part of the day-
to-day variability and will be taken out by the analysis of
variance. Hence, Zij will not have any effect on the
estimate of standard error of proportion with inadequate
intake.
In the probability approach, an analysis of variance of
the true Nij should be used to estimate the population
mean of the nutrient and interindividual variation. These
estimates are then used to estimate the proportion with
inadequate intake. In actuality, however, the analysis of
variance is made on the Nij*. Hence, the population mean
that is being estimated is the true population mean EFk~k
plus the value X, which is a realization of the error terms
coming from the food tables. Also, the interindividual
variation that is being estimated is equal to the true
interindividual variation plus the variance of Yi.
YE has a minimal effect on the estimate of interindi-
vidual variation and almost no effect on the estimate of
proportion with inadequate intake. Therefore, we shall only
consider the effect of error term X on the estimate of
proportion with inadequate intake.
OCR for page 74
74
the proportion of the population with inadequate intake,
say P. is a function of the population mean and inter-
individual variation of. That is, P = So). As
mentioned previously, we are estimating P = S(p + Xo§),
where X - ICk~k can be thought of as a random variable
with a mean of zero and a variance equal to ipiVarCk.
To derive some sense of how much P* could be expected to
vary from P. a sensitivity analysis was performed in the
following manner. First, it was necessary to assign an
approximate value for the variance, which will be denoted
by o2 = kpiVarCk. (More will be said about this later.)
Random values X1, X2, ..., X500 are generated from the
distribution of X, assumed to be normally distributed with
mean zero and variance of.
Values of Pi* - S(W + Xi, o 2 )
were computed as were Sneer mean and standard deviation.
The values of ~ and o] are estimates obtained from the
original analysis of variance.
Although the exercise will
not produce precise estimates of the standard error
resulting from food composition tables, it can be used to
assess the impact of errors in food tables on the estimates
of the prevalence of inadequate intake.
To estimate o2, the standard error in the mean
nutrient composition was obtained for a typical diet. The
most recent set of reference tables on food composition that
have been published by the USDA (1976-1984) provide some
information about the number of samples analyzed and the
standard error of the mean for same foods. Using method-
ology similar to that described in Appendix E (using
standard error instead of standard deviation), the sub-
comu~ittee obtained a rough approximation of the standard
error in the mean nutrient consumed in a sample diet as a
result of random sampling of foods from the food composition
table.
In all cases, the estimation errors relating to the errors
in food composition tables are larger than errors resulting
from the survey data. The effect of the errors in the food
table on estimates of prevalence cannot be diminished by
larger surveys. improvement can be made only with more
accurate food tables.
IMPACT OF RANDOM UNDER- AND OVERREPORTING
A nether of the dietary methodology studies reported in
Chapter 6 suggest that there may be under- and over-
reporting of intake. This is to be distinguished from
OCR for page 75
75
systematic misreporting by a population or population group
(see Chapter 7). If the random element relates to
individual reporting from day to day, the effect will be
removed during the process of adjusting the distribution to
remove the impact of day-to-day variation.
people systematically underreport while others systemati-
cally overreport, the between-individual variance will be
incorporated in the estimate of the distribution of usual
intake. This effect can be expected to have an impact on
estimates of the prevalence of inadequate intake. The
subcommittee used a series of simulations to examine the
nature and magnitude of the impact.
However, if some
To provide same perspective on the potential magnitude of
interindividual random under- and overreporting, Table 8-2
portrays, using simulation techniques, the effects that might
be seen in population data if there is bias in reporting by
an individual. A comparison of observed and reported intakes
for single meals is discussed by Schnakenberg et al. (1981)
In their data, there was an apparent overall bias toward
underestimation. Of more importance for the present purpose,
.
TABLE 8-2. Magnitude of Expected Effect of Random Under-
and Overreporting in Population Dataa
Distribution of Deviations Between
Coefficient Recorded and True Intake (% of Subjects
of Variation Exhibiting Deviation) _
(% of Mean) 30% 25% 20% 15% 10% 5%
5 2.7 3.4 4.2 5.2 6.4 8.2
10 S.3 6.7 8.4 10.4 12.8 16.5
15 7.8 10.1 12.6 15.S 19.2 24.7
20 10.5 13.5 16.8 20.7 25.6 32.9
25 13.1 16.8 21.0 25.9 32.0 41.1
30 15.8 20.2 25.2 31.1 38.4 49.4
-
"The body of the table displays the magnitude of the
~ , ,,
deviation between recorded intake and a measure of true
intake that would be expected in the proportion of sub-
jects specified in the column. Deviations of the magn'-
tude shown or greater would be expected. Assumes a
Gaussian distribution for this simulation.
OCR for page 76
76
the standard deviation of mi sreporting expressed as a
proportion of observed intake was 28% of the mean score
(32~) for protein. This demonstrates the magnitude of
under- and overreporting of single meals. Unless there is
evidence that each person is consistent in his or her
misreporting, the measured variance must include an
unmeasured proportion of intraindividual variation. That
is, the CV of 30% overestimates the error that would be
expected in estimates of the distribution of usual intake
(i.e., intake across multiple meals per day and many days).
Many other studies cited in Chapter 7, in which one dietary
methodology was compared with another in reporting under-
and overestimation, fail to take into account the impact of
differences in the number of days of observation. Thus, the
literature does not provide a direct estimate of the magni-
tude of under- and overreporting that might be expected in
data sets adjusted to remove day-to-day variability of
intake. The report of Schnakenberg et al. (1981) indicates
that a realistic worst-case situation would be a CV of 20%
for misreporting the estimate of usual intake distributions.
The potential effects of random interindividual mis-
reporting on estimates of the prevalence of inadequate
intakes can be simulated as shown in Table 8-3. The
adjusted distributions of usual intake for protein and
for vitamin C (Appendix A) were further adj usted to
incorporate or to remove the ef feet of a component of
random variation. his was done by using ratios of stan-
dard deviations of the derived distribution of usual
intakes. The standard deviations adjusted to add or
remove a variance component are used in the same manner
as reported in Appendix A, except that the distributions
were not normalized first. The approach preserves the
skew of the distribution of usual intakes. The proba-
bility approach was then used to derive estimates of the
apparent prevalence of inadequate intakes.
From these simulations it can be seen that random
~nterindividual under- and overreporting of intake can
affect the prevalence estimate. Consideration of the
underlying theory indicates that the magnitude of the
effect will depend on the magnitude of the variance of
this component in relation to the true interindividual
variation as well as on the means of the intake and
requirement distributions. In the case of protein, with
the estimated variation of usual intakes (see Appendix
OCR for page 77
77
TABLE 8-3. Potential Impact of Random Interindividual
Misreporting of Intake on the Estimate of
Prevalence of Inadequate Intakes in Adult
Males
CV" of Impact of Addition Impact of Removing
Random of Error Term (~)b Error Term (I)
Error (I) Protein Vitamin C Protein Vitamin C
0 2.1 41.0 2.1 41.0
5 2.3 41.1 2.0 41.0
10 2.8 41.3 1.5 40.8
15 3.7 41.6 0.8 40.4
20 4.9 42.1 0.2 39.8
25 6.5 42.6 0 39.1
30 8.3 43.1 c 38. 0
aCV = coefficient of variation.
bValues are the apparent prevalences of inadequate
intake computed by the probability approach. Mean
requirement of protein taken as 43 g/day and mean
requirement of vitamin C taken as 46.2 mg/day. CV of
requirement taken as 15% of the mean.
CCannot be computed. In this case the error term would
be equal to or greater than the estimated interindividual
variation--an impossible situation.
A) much smaller than that of vitamin C, the impact of
adding or removing a component of variance would be much
greater. The requirement and intake distributions for
protein are also much more widely separated than in the
case of vitamin C, which would accentuate the effect.
Protein then can be used as a worst-case scenario.
Table 8-3 also gives the potential effects of removing
an independent variance component. In the NFCS data set,
the potential impact of removing a component of variance is
of particular interest. Removing the variance component
OCR for page 78
78
permits examination of any bias that might be present in
the estimate of prevalence because of random under- and
overreporting present in the original data set, whereas
addition of variance would arise only if there was evidence
of a negative correlation between intake and overestimation.
In this case, the correlation itself would have to be con-
sidered.
The 20% estimate of the CV used here would be a generous
estimate of the possible variation attributable to this
source, and the analysis shown in Table 8-3 suggests that
the impact, although real, would not be serious. The
prevalence estimate for protein might fall from 2.1% to
0.2%, both of which would be considered very low preva-
lences. This analysis has used a worst-case scenario with
a high error term and a prevalence of inadequate intake
that falls in the tail of the intake distribution. In
contrast, the prevalence estimate for vitamin C might
change from 41.0% to 39.8--an operationally undetectable
change. Only if it could be argued that the random error
greatly exceeds the real variation, after day-to-day vari-
ability had been factored out, would the magnitude of the
error be totally unacceptable for the purpose of survey
data interpretation. Again the subcommittee emphasizes
that this phenomenon is quite different from systematic
under- or overreporting across individuals. That effect is
discussed as bias in the estimate of intake earlier in this
chapter.
-
Representative terms from entire chapter:
nutrient content