Design Issues in the Gulf War Veterans Health Study
Prepared for the Institute of Medicine Committee on Measuring the Health of Gulf War Veterans.
RAND, Santa Monica, California
Robert O. Valdez
School of Public Health, University of California at Los Angeles
The design for the Gulf War Veterans Health Study poses a variety of challenges. In order to study changes in the Gulf War (GW) veterans' health status over time, a panel design (also known as the prospective cohort design) is indicated. Since the study also aims to examine the levels of the GW veterans' health status at various time points, consideration needs to be given to the cross-sectional representativeness of the panel, thus a repeated panel design should be considered as a potential alternative to a permanent panel design. Due to the anticipated deterioration in the quality of the locating information on the GW veterans, recruiting the GW veterans will likely require a substantial effort to track and trace the sampled participants, making it unattractive to use designs such as a rotating panel that requires repeated recruitment of new panels. Given the closed nature of the GW veteran population (there are no new entries), it is important for the study to provide timely information of value to the GW veterans during their lifetime. Thus, the study should be designed with more frequent data collection in the early years when the information obtained has a longer "useful life." Based on the consideration of various trade-offs, three of the most promising designs are the permanent panel design, the repeated panel design, and a combination of the two. A promising design is to recruit an initial panel and follow the panel every 3 years for three waves. An assessment shall be made after the third wave to assess the quality of the panel, to determine whether to continue following the same panel, to switch to a new panel, or to take a combination of the two. The survey frequency might be reduced in the second decade and beyond.
The design of a study is usually determined by the research questions that need to be addressed, and the target population to be studied. The proposed Gulf War Veterans Health Study (GWVHS) aims to address the following research questions:
- The general population?
- Persons in the military at the time of the Gulf War but not deployed?
- Persons in the military at the time of the Gulf War who were deployed to nonconflict areas?
- Persons in the military deployed to other conflicts, such as Bosnia, Somalia, and so on?
How healthy are Gulf War veterans?
In what ways does the health of Gulf War veterans change over time?
Now and in the future, how does the health of Gulf War veterans compare with:
What individual and environmental characteristics are associated with observed differences in health between Gulf War veterans and comparison groups?
Since the study aims to address both the levels at specific time points (the first and third research questions) and changes over time (the second and fourth research questions) for Persian Gulf veterans' health status, we will consider study designs appropriate for both types of research questions. In particular, we will consider repeated cross-sectional surveys and various panel survey designs (also known as prospective cohort designs).
Unlike the general population (either civilian or military), Gulf War (GW) veterans are a closed population: there is no birth, enlistment, or migration into this population. The membership in this population was determined by the participation in the Persian Gulf War, that is, those who served in the Gulf War theater of operations between August 2, 1990, and June 13, 1991. The closed nature of the GW veteran population has important implications for the study design, such as on the merits of replenishing the panel.1 Further discussions are given in the section on temporal structure below.
Given the closed nature of the GW veteran population, it is important for the GWVHS to provide timely information for the GW veterans. This objective gives the GWVHS a stronger focus on public health rather than basic science research. The information obtained in GWVHS will be of value to the GW veterans only during their lifetime. Therefore timeliness of information should be
taken into consideration for the design of the GWVHS. In other words, the focus for GWVHS is more on public health (to serve the GW veterans) than basic science research (to obtain scientific knowledge applicable to future patients). This important feature indicates that the GWVHS should be designed with more frequent data collection in the early years when the information obtained has a longer "useful life" to the GW veterans, and less frequent data collection in later years when the information has a shorter "useful life." Further discussions are given in the section on Survey Frequency.
In order to strengthen our ability to understand the health problems of GW veterans and their trajectories, it is worth considering a case-control component in GWVHS. The comparison group would consist of patients with medically unexplained physical symptoms in the same geographical areas as the GW veterans in the GWVHS sample. Geographical matching has an important implication for the study design, namely, the extent to which the GWVHS sample should be clustered geographically. Further discussions are given in the section on Survey Modality and Geographical Clustering.
Another important unique feature about the GW veteran population is that our ability to locate individuals in this population is likely to deteriorate over time. While the Department of Defense (DoD) maintains locating information on record for all veterans, including GW veterans, the accuracy of this information is likely to decrease over time. As this report is being written, 8 years have elapsed since the Gulf War. We anticipate, therefore, that a substantial tracking and tracing effort is necessary to recruit a representative sample of GW veterans, and a substantial level of nonresponse will still occur despite this effort. This feature has important implications for the design of GWVHS, making it less appealing to recruit multiple cohorts into the study.
In population studies of Gulf War veterans conducted to date, response rates ranged from a low of 31% in the study conducted by Stretch et al. (1995) to 97% of those located in a survey of women who served in the U.S. Air Force during the Gulf War conducted by Pierce (1997). Further details are given in Chapter 5 of the report. A low response rate is a concern for the validity of the data, therefore it is important to engage in efforts to increase the response rate, using survey research tools such as tracking and tracing of participants, and incentives. Further discussions on nonresponse, tracking, and tracing can be found in the section "Nonresponse, Attrition, Tracking, and Tracing."
In order to help interpret the health status of GW veterans (especially the changes over time), several comparison groups (listed under the third research question above) will be included in GWVHS. Those comparison groups will be recruited and surveyed using the same design to be used for GW veterans to maximize the comparability. Since the membership in the GW veteran population versus the comparison groups is not randomized, the comparison will be vulnerable to potential selection bias problems common to all observational studies: GW veterans might be different from the comparison groups even in the absence of the Gulf War experience. In order to account for such differences, the
best possible effort needs to be made to collect data on potential confounding factors.
Based on the consideration of various trade-offs in the ability of various study designs to accomplish the objectives for the GWVHS, three of the most promising designs are the permanent panel design, the repeated panel design, and a combination of the two. A promising design is to recruit an initial panel and follow them every 3 years for three waves. An assessment shall be made to evaluate the quality of the panel at the end of the third wave, to determine whether to continue following the same panel, to switch to a new panel starting in year 10,2 or to take a combination between the two. The survey frequency might be reduced in the second decade and beyond. Note that the final design decision can be made after the first three waves have been fielded; thus it can be (and should be) based on the actual cost data obtained in the field.
In order to facilitate the recruitment of the second panel if warranted, it is worth considering recruiting a "reserve" sample along with the initial panel, giving them a brief enrollment interview to collect contact information, and maintaining contact with them over time through tracking. This provision will reduce the potential deterioration in our ability to locate and contact the GW veterans who were not sampled in the initial panel.
Temporal Structure for Survey Studies
Given the dual goals in GWVHS to study both levels and changes for GW veterans' health status, it is important to consider the structure of the survey study design both in terms of the units (individuals) to be surveyed and the time(s) those units are to be surveyed. We discuss in this section the candidate designs, the pros and cons for those designs, and specific considerations for the GWVHS.
Taxonomy of Survey Studies
The temporal structure for survey study design can be classified according to the following taxonomy,3 listed in order of increasing emphasis on temporal versus unit-specific data collection.
Single cross-sectional survey. A cross-sectional sample of observation units is identified at one time point and surveyed once. This design only allows for the estimation of level parameters, such as the prevalence of medically unexplained physical symptoms, at the specific time point, and usually does not allow for the assessment of prospective changes over time. The survey might include retrospective data items to inquire about past events; thus, it might provide some information on past changes, although the reliability of the retrospective data might be compromised. Additionally, the validity of retrospective data might be affected by mortality and other forms of exit from the target population. For example, in order for patients who experienced a life-threatening disease to be available to report on past changes, they must have survived the disease. Therefore the survival rate estimated from the retrospective data is likely to be very biased.
Repeated cross-sectional surveys without unit overlap. A cross-sectional sample of observation units is identified at each of several time points; each sample is surveyed once. The samples are drawn with no provisions for overlap; a few overlap cases might occur if the samples are drawn with replacement and the sampling rates are sufficiently high. This design allows for the estimation of level parameters at each of the selected time points, as well as the average of the level parameters over time. In addition, this design also allows for the estimation of net changes in the population parameters over time, such as the increase or reduction of the prevalence of medically unexplained physical symptoms over time. It usually does not allow for the assessment of individual gross changes, such as the persistence of these unexplained symptoms, whether the same patients are affected over time, and the amount of turnover. (Retrospective inquiries might provide some information, but might be of compromised quality.)
Repeated cross-sectional surveys with unit overlap. This design is similar to (2), with the exception that a portion of each subsequent sample is drawn from the previous sample. In other words, the membership in the previous sample is used as a stratifying variable in the subsequent sample, with the units in the previous sample being oversampled relative to the new units. This design has similar capabilities as (2); it also allows for the estimation of gross changes on the individual level, using the portion of the sample that overlaps with the previous sample.
Repeated panel surveys with temporal overlap. This design is similar to (2), with the exception that each sample is surveyed several times, usually at regular time intervals, thus each sample serves as a panel. The temporal spans for the panels overlap in time: a new panel is initiated before the previous panel has retired, thus several panels are usually active in the field at the same time. This design is essentially the same as the rotating panel survey design. (There are some fine distinctions between those two designs, but the similarity dominates their differences.) For each time point, the level parameters can be estimated using the panels active at the time, usually including a new panel (just initiated) and several ongoing panels. The quality of the ongoing panels for the level estimates might be compromised by panel artifacts such as attrition and
- panel conditioning, to be discussed later. (Similar compromise might also occur in  among the overlap portion of the subsequent samples.) However, the presence of multiple panels in different stages of progression might help mitigate some of those problems. On the other hand, this design usually provides more precise estimates on the changes (both net and gross) than the previous designs.
Repeated panel surveys without temporal overlap. This design is similar to (4), with the exception that the panels do not overlap in time: a new panel is initiated after the previous panel is retired. (Thus the temporal span for each panel is distinct and does not overlap with other panels.) Each panel allows the estimation of level parameters at the time of its initiation, as well as the estimation of those parameters at each follow-up. Similar to (4), this design might also be vulnerable to attrition and panel conditioning, and is likely more vulnerable than (4) at follow-up waves because there are no other panels active at the same time to help mitigate those problems.
Permanent panel survey. A single sample is drawn at one specific time point, then followed for the entire duration of the study. This design provides more information on changes, especially on long-term gross changes on the individual level. (The ability of  and  to provide direct information on long-term gross changes is limited by the duration of the individual panels. It might be possible to "splice" distinct panels to assess long-term gross changes. This usually requires strong assumptions such as the Markovian properties on the nature of the gross changes.) On the other hand, this design is more vulnerable than (4) and (5) to attrition and panel conditioning.
Time series study. A single unit is chosen and followed intensively for the entire duration of the study. This design will provide the most intensive information on gross changes, and practically no information on level parameters.
There are many possible variations on and combinations of the designs listed above. For example, a study focused on a specific disease or condition might follow all cases with the condition to assess their trajectories, and follow a subsample only for the noncases to assess the incidence rates.
Pros and Cons for Alternative Study Designs
Given the dual goals for GWVHS to obtain both level and change estimates, designs (1) and (7) should be ruled out. Among the remaining candidate designs (2)-(6), the sequence in which they appear in the above section on Taxonomy is ranked in increasing order of the emphasis on repeated measurements on the same individuals, that is, the overlap of the cohort over time.
Generally speaking, the more overlap there is across time in the units surveyed, the more information is available on estimating changes. This premise is self-evident for gross changes on the individual level: we observe gross changes
only among the units that overlap in tim.4 The permanent panel survey is therefore the preferred design for studies focused on gross changes.
For net changes, the overlap is usually viewed as a plus because the same individuals serve as their own controls over time. This is illustrated in the following simple model:
Yit = α + tβ + θi + εit, i = 1, . . ., n; t = 0,1,
where Y denotes the outcome of interest, α denotes the baseline population status, β denotes the net change of interest, θ denotes the time-invariant individual variation, and ε denotes the temporal random error. With less overlap, such as in repeated cross-sectional surveys, distinct individuals are observed at times 0 and 1. The net change is estimated using the difference between the two sample means:
where the subscripts 0 and 1 refer to the distinct cross-sectional samples. The uncertainty in the estimated net change includes both the sampling error in the cross-sectional samples, θ, and the temporal random error, ε. With more overlap, such as in panel surveys, the same individuals are observed at times 0 and 1, thus the sampling error θ is cancelled when we compare times 0 and 1, resulting in a more precise estimate for the net change.5
For estimating levels, the overlap is usually a disadvantage.6 More specifically, consider the estimation of the average level of Y across the population and also over time. This is usually estimated using the grand mean of all observed Yij's. With repeated cross-sectional surveys, the sampling error variance is reduced by a factor of 2n, because two distinct (and independent) samples are drawn at times 0 and 1. For panel survey, the sampling error variance is reduced by a factor of n, because the same sample is used at times 0 and 1. Therefore the
estimate based on the panel survey is less precise.7 This comparison is important, for example, for detecting persistent rare conditions: the chance for detecting such conditions is much higher with repeated cross-sectional surveys than panel surveys (because more individuals are surveyed).
A related advantage for the panel survey design is its ability to control for time-invariant unobserved confounding factors. As an illustration, consider the following extension of the earlier model:
Yit = α + tβ + xitγ + wiδ + θi + εit, i = 1,. . ., n; t = 0,1,
where x denotes the predictors of interest, and w denotes unobserved confounding factors (assumed to be time-invariant). If the confounding factors w were observed, it would be possible to control for them in cross-sectional data. With panel data, it is possible to control for unobserved time-invariant confounding factors by taking the difference across time, resulting in the following difference model:
Δi = Yi1 - Yi0 = β + (xi1 - xi0)γ + (εi1 - εi0), i = 1,. . ., n.
We then regress the change in Y (Δ) on the change in x. Assuming that the confounding factors w are time-invariant, they would be cancelled out in the difference model.
It should be noted, though, that the ability of the difference model to control for confounding factors and estimate the effects of interest depends critically on the temporal variation in the predictors of interest. If the predictors x do not vary over time, the difference model does not allow us to estimate their effects. Even if the predictors x do vary over time, the precision for the estimated effects might be poor if the temporal variation in x is small.
Another important advantage of the panel design is that it might help improve the quality of the recall by using the events observed in the earlier waves to bound the time frame of recall in future waves (see, e.g., Neter and Waksberg, 1964). However, since the anticipated between-wave lags are much longer than the recall period for GWVHS, this feature is unlikely to be applicable. Further discussions on bounding the time frame and related recall error issues can be found in the section, Measurement Error.
While the panel design has much merit, it also has some limitations. One important limitation is that the panel design can be especially vulnerable to nonresponse. Nonresponse is an important limitation to all survey studies under both cross-sectional and panel designs. Almost all survey studies fail to obtain complete data on some sampled subjects due to various reasons: some subjects cannot be located or reached, some are too sick to be interviewed, some refuse to be
interviewed. It is usually plausible that the nonrespondents are different from the respondents in terms of the attributes of interest. For example, Groves and Couper (1998) reported that households with many members and households with elderly persons or young children are easier to contact, urban households are more difficult to contact than rural households; once contacted, those in military services, racial and ethnic minorities, and households with young children or young adults are more likely to cooperate with the surveys. Given the potential for respondents to differ from nonrespondents, the analyses based on the respondents might provide biased estimates for the target population. The severity of the nonresponse bias is usually associated with the nonresponse rate (see, e.g., Kish, 1965, Section 13.4B). If the nonresponse rate is low, say, less than 10%, the nonresponse bias is likely to be small and negligible. If the nonresponse rate is high, say, more than 30%, there is a potential that the nonresponse bias might be serious, thus the conclusions based on the respondents might be flawed.
There are a number of statistical and econometric techniques that can be used to mitigate the impact of nonresponse, such as nonresponse weighting, (multiple) imputation, pattern mixture modeling, and selectivity modeling. The section Nonresponse, Attrition, Tracking, and Tracing provides further discussion.
While nonresponse is usually a limitation for both cross-sectional and panel designs, it is usually a bigger problem for panel designs because the nonresponse can accumulate over time. A panel study that is designed and implemented well usually holds the attrition over time to a very low level, such as 5 to 10 percent in each wave. Furthermore, some sampled subjects who did not respond to an early wave might be "resurrected" in a later wave. However, the nonresponse usually accumulates across waves, and reaches a substantial level after multiple waves. For example, the wave nonresponse accumulated to 19% at wave 7 in the 1987 Survey of Income and Program Participation (SIPP) panel (Jabine et. al., 1990), one of the exemplary panel studies. Therefore, the potential for nonresponse bias becomes more severe in later waves.8
A related limitation for the panel design is the omission of new members in the target population, due to birth, enrollment, immigration, and so on. While the panel was designed to be representative of the target population at the beginning of the study, the panel ages over time and does not represent the new members. Therefore the representativeness of the panel becomes compromised in subsequent waves, both because of the omission of new members, and because of the cumulative attrition discussed earlier. Some panel studies refresh the sample by adding a sample of new members who joined the target population since the original sample was drawn. This can be costly unless there is an easy way to identify the new members. However, since the GW veterans are a closed population, there are no new members entering the target population, thus the omission of new members is not an issue.
Another important limitation in the panel design is panel conditioning: the observed responses might be affected by the participation in the panel, thus compromising the validity of the data obtained in later waves. Further discussions on panel conditioning can be found in the section, Measurement Errors Specific to Panel Surveys.
Gulf War Veterans Health Study Design
Repeated panel surveys with temporal overlap (rotating panel surveys) is the commonly used compromise when both levels and changes are of interest. However, this design requires recruiting new cohorts from the target population regularly; thus, it might not be appropriate for the GWVHS. Given the anticipated deterioration of the quality of the DoD records data on the GW veterans, the recruitment cost is likely high for the GW veteran population, requiring substantial tracking and tracing efforts. In order to economize the design, it would be desirable to reduce the need to recruit new cohorts. Based on those considerations, either a permanent panel design or repeated panel surveys without temporal overlap would be the preferred choice for GWVHS, to avoid conducting costly recruitment on a regular basis.
The permanent panel design has the advantage that it allows the direct assessment of long-term gross changes on the same individuals—with the repeated panel surveys without temporal overlap, we need to "splice" the trajectory from different panels to assess changes across waves that fall under different panels. However, the permanent panel design will be more vulnerable to cumulative attrition and panel conditioning. Therefore, a promising design also worth considering for the GWVHS is repeated panel surveys without temporal overlap. The study should review the quality of the first panel after the third wave for the initial panel,9 to determine the extent to which the validity of the inference based on the panel is compromised by cumulative attrition and panel conditioning.10 If the quality of the panel is judged to be satisfactory, the study would continue following the same panel. If the validity is judged to be unsatisfactory, the study would switch to a new panel. If the validity is judged to be marginal, it is con
ceivable that a hybrid design analogous to a rotating panel design could be used, continuing to follow a random subsample of the initial panel, and drawing a new panel to make up for the discontinued portion of the initial panel.
In order to facilitate the recruitment of a second panel if it is warranted, it is worth considering that a "reserve" sample be recruited at the same time of the initial sample. The "reserve" sample will be enrolled into the study, and given a brief survey to collect the contact information.11 This sample will then be sent into "hibernation," and will be reactivated if a decision is reached later to recruit a second panel. While the "reserve" sample is in "hibernation," we will maintain tracking to make it feasible to reactivate this sample if needed.12 This provision will require a nontrivial amount of resources, but will guard against the risk that the contact information in the DoD records will deteriorate further during the tenure of the initial panel, making it impossible to recruit a second panel.
If the "reserve" sample is implemented successfully, the rotating panel design might be a viable option. We can activate a third of the "reserve" in waves 2, 3, and 4, respectively, and retire a corresponding portion of the original panel. The maintenance cost for the "reserve" sample will be lower under the rotating panel design, because the size of the "reserve" sample is reduced over time. On the other hand, the recruitment cost is still likely to be substantial even with the "reserve" sample, therefore it might be more economical to activate the "reserve" sample in one lump sum instead of in pieces.
There are important analytic trade-offs between those designs. With the repeated panel design without temporal overlap, the entire panel is available in the first three waves, allowing more precise estimates for changes (both net and gross). Furthermore, this design allows the option of switching to a permanent panel design (without activating the "reserve" sample), either in part or in full, if warranted. However, we cannot estimate gross changes between the two panels, say, between the third and fourth waves. On the other hand, the rotating panel design includes less overlap across the first three waves; thus, it provides less precision for estimating changes (especially gross changes) among those waves. However, it does allow for the estimation of gross changes in later waves, say, between the third and fourth waves. As discussed in the Introduction to this appendix, the public health focus of the GWVHS indicates that it is more important to assess changes across the first three waves (repeated panel surveys without temporal overlap is preferable for those objectives), than assessing changes that occur in later waves (rotating panel design is preferred for those objectives). See Survey Frequency for further discussion.
Another analytic advantage in the rotating panel design is that it allows us to assess the presence and severity of panel conditioning regularly because a new portion of the "reserve" sample is activated in each wave. However, given our anticipation that panel conditioning is unlikely to be a major problem for the GWVHS (see Survey Modality and Geographic Clustering), this advantage is less critical. In summary, the challenges in recruiting the GW veterans make the rotating panel design less appealing for the GWVHS than either the permanent panel design or the repeated panel design without temporal overlaps.
To summarize, three of the most promising designs for the GWVHS are the permanent panel design, the repeated panel design, and a combination of the two. A promising design is to recruit an initial panel, and follow them every 3 years for three waves. An assessment shall be made to evaluate the quality of the panel at the end of the third wave (to be fielded during the 9th year of the study), to determine whether to continue following the same panel, to switch to a new panel starting in year 10, or to take a combination between the two. It is worth considering recruiting a reserve sample to be activated as needed for the second panel. Note that the final design decision can be made after the first three waves have been fielded; thus, it can be (and should be) based on the actual cost data obtained in the field.
An important parameter in the design of panel studies is the survey frequency: how often will the participants be surveyed. We assume for now that the survey is conducted with equal spacing between waves—this is the design commonly employed in most panel studies. We also assume that the overall duration of the study has been determined in advance. The survey frequency is then determined by the number of waves to be conducted within the given duration.
For many panel studies, the survey frequency is determined by the reference periods for the key outcome measures (see the section, Measurement Error below), so as to obtain a contiguous stream of those outcome measures over time. Since the GWVHS follow-up waves will be much longer than the reference periods for the outcome measures, this aspect is not applicable.
For studies with a fixed budget, there is usually a trade-off between the survey frequency and the number of participants surveyed in each wave. With a higher survey frequency, the number of participants surveyed in each wave will be smaller.
For panel studies, a simplistic way to measure the total amount of information collected can be measured using the number of person-waves of surveys conducted. The cost per person-wave surveyed usually decreases with the survey frequency. More specifically, the marginal cost to conduct an additional wave of the follow-up survey for a participant already recruited into the study is usually lower than the cost to recruit an additional participant into the study for the first wave of data collection, because tracking and tracing costs for ongoing participants are
usually lower than the recruitment cost for new participants. Therefore, this simplistic measure (the number of person-waves surveyed) usually favors collecting many waves of follow-up on the same sample of participants.
The simplistic measure described above does not convey all information relevant to the design of a panel study. We must also consider the statistical information obtained in the study. Due to intraclass correlation at the individual level, the statistical information per person-wave of survey is likely to decrease with the survey frequency, thus counter-balancing the reduction in the cost per person-wave of survey. The intraindividual correlation can occur on the levels: a healthy individual will usually remain healthy over time. The correlation can also occur on the changes: the trajectory of an individual's health change during, say, the first 3 years might be predictive of his/her trajectory in the future.
As was noted in the Introduction, the repeated measurements on the same participants usually contribute less information towards the estimation of cumulative levels, such as the detection of rare conditions. For estimating changes, the panel design is usually preferable to the repeated cross-sectional design. However, the statistical information does not accumulate proportionately with the number of waves surveyed. Assuming a linear time trend, the largest contribution to the statistical information of the estimated rate of change comes from the first and last waves. The marginal contribution from each intermediate wave is smaller.
The appropriate survey frequency needs to balance between the cost per person-wave of survey and the statistical information per person-wave of survey. Overall and Doyle (1994) and Hedeker, Gibbons, and Waternaux (1999) provided the methodologies for power calculations for longitudinal studies under a variety of model assumptions. Those techniques can be used in conjunction with information about survey costs to compare alternative survey frequencies to determine the appropriate design that provides the optimal power and precision under the available budget.
In addition, one also needs to take into consideration attrition and measurement error issues discussed in the section Nonresponse, Attrition, Tracking, and Tracing and in Measurement Error, in order to arrive at the appropriate choice for the survey frequency. This is not an easy task. Presser (1989) observed that "There are difficult trade-offs here . . . a given decision might decrease sampling error but increase potential nonresponse bias, while at the same time reducing conditioning and increasing telescoping . . . the paucity of relevant studies means we are frequently operating in the dark."
A useful recommendation was given in Cantor (1989) to base decisions on survey frequency on "the amount of time it takes to expect meaningful change and/or occurrence in the variables that are of substantive interest. . . . For example, epidemiological studies typically require a number of years for a follow-up period in order to allow for physical and mental development." In other words, the survey frequency should allow enough time between waves for meaningful changes worth measuring to take place.
We have assumed up until now that the survey is conducted with equal spacing between waves—this is the design commonly employed in most panel studies. This might not be the most appropriate design for the GWVHS, for several reasons.
First, the usual practice of conducting longitudinal studies with equal spacing is not well-grounded in statistical theory. Maxwell (1998) observed that "equal spacing of observation periods is a typical mathematical assumption, in part because many longitudinal designs follow this practice. Nevertheless, if the straight-line grow model is in fact the correct model, greater power can be achieved by spacing all of the assessments farther from the middle of the overall assessment period and closer to the extremes (i.e., closer to the pretest and to the posttest). On balance, the equally spaced design often offers a better choice in practice. . . ."
The observation that nonequal spacing can improve power is important. In the extreme, if there is compelling evidence that the change in the outcomes follows the straight-line model (the rate of change is constant over time), the intermediate waves do not contribute much information, thus the study design can be improved by placing more observations near the beginning and the end of the study period. The optimality under the straight-line model needs to be balanced against the robustness for the design to perform well under deviations from the straight-line mode, thus some waves should be placed in the intermediate time range. However, as long as the straight-line model is a reasonable approximation to the actual pattern of the change over time, the balance between optimality and robustness is unlikely to justify the equally spaced design—the appropriate design should still place more emphasis near the beginning and the end of the study period.
In addition, the public health (vs. basic science research) focus of the GWVHS makes it important to take into consideration the timeliness of interim data. As discussed in Section 1, the information obtained in GWVHS will be of value to the GW veterans only during their lifetime. Therefore timeliness of information should be taken into consideration for the design of the GWVHS.
In basic science research, we usually assume an infinite time horizon. 13 The scientific knowledge acquired from the study is expected to serve a large number of future clients in the years to come. The study participants might receive suboptimal treatments in order to provide the scientific knowledge to serve future clients. For example, in placebo-controlled trials, the participants assigned to the control group receive the placebo instead of the experimental treatment.
Although the ethical consideration dictates that the trial be conducted only in the absence of conclusive evidence favoring the experimental condition over the control condition, it is conceivable that some partial evidence is available to justify the plausible benefits in the experimental condition. During the course of the trial, interim comparisons between the conditions are usually made to monitor the accumulation of evidence. The trial will usually be terminated when conclusive evidence is found favoring either the experimental or the control condition. However, it is conceivable that the interim data might indicate weak but not-yet-conclusive evidence; the trial usually continues, to allow further accumulation of evidence. In essence, the trial participants are making a short-term sacrifice to contribute to the benefits of future patients. The performance of the trial is usually judged based upon the value of the scientific knowledge acquired at the end of the study. Interim evidence is usually not used for the benefits of the trial participants except in extreme cases when the interim evidence is conclusive. Timeliness of information (the value of the interim data) is usually not taken into consideration for study designs because the dissemination phase subsequent to the trial has an infinite time horizon.
On the other hand, a public-health-oriented study like GWVHS aims to serve today's target population; thus, it does have an infinite time horizon.14
The following diagram illustrates the pattern of information accumulation in a basic science research study that attempts to estimate the rate of change over time (assumed to be constant) in the outcome variable, using five waves of data collection.
The horizontal axis in the diagram denotes time. The baseline survey is conducted at time zero, corresponding to the left edge of the diagram. The first follow-up is conducted at time one, corresponding to the first vertical line inside the diagram. The second follow-up is denoted by the next vertical line, and so on. As an illustration of the basic science research paradigm, we assume that the study period is relatively short compared to the dissemination period during which the information obtained in the study is utilized.
The vertical axis denotes the level of uncertainty about the rate of change. Prior to the first follow-up, the study provides no information about the rate of change. The first white bar between time zero and time one denotes the level of
uncertainty based on prior information. The first follow-up provides some new information about the rate of change, thus reducing the uncertainty somewhat. The second white bar between times one and two denotes the level of uncertainty remaining after the first follow-up. The portion of the second bar shaded in light gray denotes the information gain (the reduction in uncertainty) due to the first follow-up. Further reduction in uncertainty is accomplished with each follow-up, until the fourth and final follow-up. The white space subsequent to time four denotes the uncertainty remaining after the conclusion of the study. (The reduction in uncertainty is not linear in time; the diagram should be viewed as a conceptual demonstration.)
The gray areas in the diagram denote the reduction in uncertainty. The portion of the reduction due to the interim data is shaded in light gray. The portion of the reduction available after the conclusion of the study is shaded in dark gray. Since the study period is short compared to the dissemination period, the value of the interim data is low compared to the value of the final study results available after the conclusion of the study.
The next diagram illustrates a public-health-oriented study like the GWVHS. This study has the same five waves of data collection. However, the time horizon is much shorter than the previous study. The objective of this type of study is to serve the current population being studied, rather than to serve future populations. As can be seen from the diagram, the value of the interim data is high compared to the value of the final study results. Indeed, if the GWVHS follows the GW veterans until the end of their life cycles, the final study results will be of very little value to the GW veterans.
Under the paradigm discussed above, there is an important rationale to place more emphasis on data collection in the early years in the GWVHS: this information can be utilized to serve the GW veteran population over a longer time period that the data collected in later years. It appears reasonable that a higher survey frequency should be used in the early years in the GWVHS to maximize the overall utility of the information obtained over time, as illustrated in the diagram below. Note that we have varied both the spacing between waves, and the amount of uncertainty reduction corresponding to each wave: with shorter spacing between waves in the early years, the reduction in uncertainty due to each wave is likely to be smaller, because less change is expected to occur. The appropriate design needs to balance the smaller uncertainty reduction associated with shorter between-wave spacing in the early years against the longer utility of the information obtained.
Motivated by the design issues in the GWVHS, we have conducted some preliminary research that indicates the variable-frequency design can be beneficial, especially if the uncertainty is high prior to the study. Further research in this area would be worthwhile, not only to address the design issues in the GWVHS, but also to address similar public-health-oriented issues.
Survey Modality and Geographical Clustering
With the revolution in communication and information technologies in recent years, there are many modalities to be considered for conducting a survey study, including face-to-face interview, telephone interview, mail administration, and combinations of those modalities. Computer-assisted interview is commonly used in the face-to-face and telephone interviews. Audio-assisted interview is sometimes used in the face-to-face modality for sensitive topics, for the respondents to be interviewed privately without direct interface with the interviewer. Internet-based interview via e-mail or the World Wide Web is becoming a possibility. The choice of survey modality can affect both the response rate and the quality of the data (see, e.g., Groves, 1989, Chapter 11; McHorney et al., 1994; Weinberger et al., 1994, 1996; and Wu et al., 1997).
Face-to-face interview is usually considered to be the most reliable modality. This modality allows for the use of auxiliary material such as printed response categories and prompts that are difficult to use under the telephone modality. The face-to-face contact between the interviewer and the respondent also facilitates the retrieval of written or printed documents such as insurance benefit brochures. This modality also allows for direct examination of the respondents, such as the taking of physical measurements. There is some evidence that this modality might result in more socially desirable responses, such as underreporting of stigmatized behavior. Most importantly, this is usually the most costly modality, because it requires the physical co-location of the interviewer and the respondent at the same time.
Telephone interview is less costly than face-to-face interview. It is limited by the access to the respondent by telephone; thus, it is difficult to utilize for
respondents who do not have direct access to a telephone. Mail survey is usually the least costly, but usually results in a higher nonresponse rate than the other modalities.
An important design implication of the face-to-face modality is that it is usually economical to implement this modality using a geographically clustered sample, to reduce the travel cost for the interviewers. More specifically, the sample is usually drawn in multiple stages; a sample of geographical locales is drawn first, usually using the probability-proportional-to-size (PPS) design. A sample of individuals is then drawn from each sampled locale.
There are several other design features in the GWVHS that might also indicate the need for a geographically clustered sample. The case-control portion of the GWVHS will also require geographical clustering, to facilitate the identification of non-GW veteran cases with medically unexplained physical symptoms. In order to maximize the comparability between the GWVHS sample and the non-GW veterans sample with medically unexplained physical symptoms, it would be desirable to take a portion of the GWVHS sample from the same geographical locales where the non-GW veteran cases with medically unexplained physical symptoms are sampled. In addition, if physical examinations will be conducted on a subsample of GWVHS participants, it would be worth considering making the subsample geographically clustered, to facilitate the face-to-face contact for the physical examination. Finally, some of the GWVHS analyses might require the use of contextual data, such as the supply of health care in the geographical locale in which the respondent resides. While some contextual data can be obtained with little effort for all geographical locales, some detailed contextual data would require direct data collection from the agencies in the geographical locales being studied. Geographical clustering of the GWVHS sample will also reduce the cost of this type of data collection.
The clustering of the sample usually reduces the statistical information available in the sample, due to the intracluster correlation (ICC): the respondents from the same geographical locale are usually more similar to each other than the respondents from different locales. Due to the ICC, the statistical information obtained from each respondent usually exhibits diminishing return with the sample size in the same locale: the second respondent from a locale contributes less information than the first respondent, the third respondent contributes less than the second, and so on. In the extreme case of perfect ICC (all respondents from the same locale have identical attributes), the first respondent conveys all information available; subsequent respondents from the same locale contribute no additional information. Therefore the design of a geographically clustered sample needs to balance between the cost savings due to clustering and the diminishing return in the statistical information (see, e.g., Kish, 1965, Chapter 5).
It is challenging to maintain the geographically clustered design in a panel study: the level of clustering usually dissipates gradually over time, due to mi-
gration.15 Some participants will migrate out of the original locales they were recruited from. Some of the migrants might move to a new geographical locale that is part of the sampled locales, or near a sampled locale; thus, they can be followed easily using the face-to-face modality. Some of the migrants might move to a new geographical locale that is not near any sampled locales. It will be costly to follow those migrants using the face-to-face modality.
To the extent that face-to-face contact is important, either for the entire GWVHS sample or for a subsample, the dissipation of geographical clustering might make it prohibitively costly to follow the sample over time, thus making it necessary to draw a new panel.
Nonresponse, Attrition, Tracking, and Tracing
Nonresponse is a common challenge to the implementation of most survey studies, both for cross-sectional and longitudinal studies (see, e.g., Bailar, 1989; Brick and Kalton, 1996; Groves, 1989, 1998; Kalton, 1986; Kish, 1965, Chapter 13; Laird, 1988; Lepkowski, 1989; and Lessler and Kalsbeek, 1992). For longitudinal studies, nonresponse can occur both at the baseline and in subsequent follow-up surveys. We begin our discussions with baseline nonresponse, followed by attrition (follow-up nonresponse), and techniques commonly used to reduce nonresponse.
The GWVHS is likely to be especially vulnerable to baseline nonresponse. As this report is being written, 8 years have elapsed since the Gulf War. While DoD maintains the locating information on records for all veterans, including the GW veterans, the accuracy of this information is likely to deteriorate over time. We anticipate, therefore, that a substantial tracking and tracing effort will be necessary to recruit a representative sample of GW veterans, and a substantial level of baseline nonresponse will still occur despite this effort.
A substantial level of baseline nonresponse raises concerns about nonresponse bias. The nonrespondents are likely different from the respondents, thus the data obtained from the respondents might be biased and fail to represent the target population appropriately. To a limited extent, baseline nonresponse bias can be mitigated using nonresponse weighting and poststratification weighting, (multiple) imputation, pattern mixture modeling, and selectivity modeling (see, e.g., Brick and Kalton, 1996; Copas and Farewell, 1998; Heckman, 1979; Little, 1993, 1994; Little and Rubin, 1987; Rubin, 1987, 1996; and Schafer, 1997). Those techniques usually rely on strong assumptions that are difficult to verify empirically. Nonresponse weighting and imputation usually relies on the assumption that the nonrespondents are missing at random (MAR) after controlling for characteristics observed for both the respondents and nonrespondents. Pattern mixture models provide a way to assess the sensitivity of the analysis
results to informative missingness that violate the MAR assumption. Selectivity models usually rely on assumptions about the joint distributions about the nonresponse mechanism and the outcome data to assess the informative missingness. Those assumptions should be viewed as (at best) approximations to the reality. If the nonresponse rate is low, those techniques might help reduce the nonresponse bias. If the nonresponse rate is high, the potential for nonresponse bias is likely to remain even after applying those techniques.
As an example of those techniques, nonresponse weighting is based on comparisons between respondents and nonrespondents, using characteristics available for both groups, such as the administrative information available from DoD. A nonresponse weighting model16 is then used to estimate the response rate for each category of subjects; the reciprocal for the estimated response rate is used as the nonresponse weight. The subjects with low estimated response rates are given larger nonresponse weights; those with high estimated response rates are given smaller nonresponse weights. The nonresponse weights therefore adjust the distribution in the observed sample so as to compensate for the distortion resulting from the nonresponse. Similarly, poststratification weighting compares the respondents with the target population using characteristics available for both groups, to adjust for distortion that might have occurred in the observed sample.
Among respondents to the baseline survey, some might fail to respond to some or all of the subsequent follow-up surveys. Many panel surveys accomplish very high follow-up rates (95% or more) from wave to wave. However, nonresponse usually accumulates across waves, thus the overall response rate might be substantially lower after many waves. Attrition becomes a more severe problem.
Like baseline nonresponse, nonresponse during the follow-up waves is also likely to result in bias in the observed data. In particular, the cross-sectional representativeness of the cohort is likely to deteriorate over time, as the baseline nonresponse is compounded with attrition. This is one of the main reasons many panel studies use the repeated panel design to regain the cross-sectional representativeness. To a limited extent, wave nonresponse bias can also be mitigated using the techniques discussed earlier for baseline nonresponse (see, e.g., Copas and Farewell, 1998; Diggle, 1989; Diggle and Kenward, 1994; Diggle et al., 1994; Heckman and Robb, 1989; Kalton, 1986; Kyriazidou, 1997; Laird, 1988; and Lepkowski, 1989).
Many panel surveys restrict the follow-up to the respondents to the previous wave, thus the nonrespondents to the baseline survey are eliminated from all follow-ups, the nonrespondents to the first follow-up survey are eliminated from all
subsequent follow-ups, and so on. This practice is usually based on the expectation that the nonrespondents are unlikely to convert into respondents in future waves.17 While this expectation is not unreasonable, this practice usually results in a substantial accumulation of nonresponse across waves. In order to reduce the level of cumulative attrition, we believe it would be appropriate to make efforts to survey the nonrespondents to earlier waves (including the baseline nonrespondents), unless they have given explicit instructions not to be contacted again.
There are a number of procedures commonly used in panel studies to reduce nonresponse, namely, tracing and tracking (see, e.g., Burgess, 1989).18 Given the anticipated difficulties in recruiting the GW veterans, those procedures are important to help strengthen the quality of the GWVHS data.
Tracing is in essence looking for a missing person using available information. Prospective tracing is necessary at the baseline to locate the participants who cannot be located using the record data from DoD information. Retrospective tracing is necessary for participants who were surveyed in an earlier wave but could not be located at a subsequent follow-up wave.
A variety of public information sources are usually used for tracing, such as telephone directories, credit records, property records, court records, mortality records, (to identify deceased sampled subjects), and so on. It is important in the tracing procedure to verify the identity of the subjects located, to avoid false identification.
Customized tracing procedures can also be utilized, such as visiting the subject's prior residences, neighbors, and known and possible associates. Those procedures are labor intensive, and thus are likely to be too costly for the national scope of the GWVHS sample. It is conceivable that those procedures can be utilized for participants clustered in a limited number of geographical areas if a (partially) clustered design is used for the GWVHS.
While tracing is used to locating the missing subjects, tracking is used to maintain contact with subjects already located. During the baseline survey, the interviewer collects contact information (both primary and secondary) from the participants, to help locate the participants for subsequent follow-up surveys. The contact information is usually updated during each follow-up survey. For studies with infrequent survey frequencies, additional tracking procedures are usually deployed to maintain contact with the participants between waves. This includes sending postcards, birthday cards, and newsletters to the participants at regular intervals, requesting postal notification of change of address, and re-
questing the participants to submit change-of-address information to the study (an incentive is usually offered to encourage the participants to provide this information). In addition to updating the contact information, some of those procedures (e.g., birthday cards and newsletters) might also enhance the goodwill among the participants, so as to facilitate their cooperation at subsequent followup surveys.
Given the long lag between waves for GWVHS, additional procedures can be utilized to help maintain the contact information. One possibility is to conduct "light-duty" tracing procedures (such as retrieving easy-to-access public records) on the participants regularly. Burgess (1989) recommended that "If the intersurvey period is five years, . . . it may be more cost-effective to trace a person five times over five years than once after five years."
Another possible procedure is to make brief telephone contacts with participants between waves, to greet the participants (hopefully to enhance the goodwill), and to request updates on contact information. It is conceivable that more intensive tracing procedures can also be used between waves to maintain the contact information. Those procedures are more costly, therefore it might be appropriate to restrict them to participants anticipated to be more difficult to follow, such as those who were difficult to locate during an earlier wave.
Empirical data are almost always subject to measurement error. Survey data are no exception. Some types of measurement errors are general, and apply to both cross-sectional and panel surveys. Some types of measurement errors are specific to panel surveys. We discuss both types of measurement errors, and remedies that can help mitigate problems resulting from those measurement errors (see, e.g., Bailar, 1989; Groves, 1989; Groves and Couper, 1998; Kish, 1965, Chapter 13; and Lessler and Kalsbeek, 1992).
General Measurement Error Issues
Part of the measurement error can be attributable to the respondent. The respondent might intentionally provide an inaccurate response to a survey question. For example, the respondent might intentionally provide a socially desirable response, or refuse to report a stigmatized condition. The inaccurate response might also be given unintentionally, because the respondent does not have the necessary information, or does not want to make the effort to compile the necessary information.
Part of the measurement error can be attributable to the interviewer—this is applicable to face-to-face and telephone interviews delivered by an interviewer, it is not applicable to self-administered mail surveys. For example, the interviewer might not accurately follow the branching logic to deliver the appropriate
survey questions to the respondent; might not convey the survey question clearly to the respondent, might not guide and motivate the respondent to compile and process the information necessary to provide accurate responses, might record the respondent's responses erroneously, or might not be alert in identifying inconsistencies in the respondent's responses and request the respondent to confirm the responses. In the worst scenario, the data might be forged in part or in its entirety by the interviewer.
Part of the measurement error can be attributable to the survey instrument. For example, the branching logic in the survey instrument might be inappropriate, leading the respondent to miss applicable questions; the survey questions might not be organized in a user-friendly sequence to make it easy for the respondent to compile and process the information accurately; the wording of the survey question might not be cognitively clear, resulting in confusion and misinterpretation by the respondent; the response categories might not be defined clearly (mutually exclusive and exhaustive) for the respondent to classify his or her status according to the given categories.
Finally, part of the measurement error might be attributable to data processing subsequent to the interview, such as data entry errors, coding errors, secondary errors introduced in data editing, errors in matching records, and so on.
The nature of measurement error can usually be classified into systematic error and random error. The quality of survey responses is usually characterized using validity and reliability: validity measures the level of systematic error, reliability measures the level of random error. We use the term "accuracy" below to refer to the combination of validity and reliability.
The level of measurement error can be evaluated using a number of techniques, such as test-retest (to assess reliability), comparisons with alternative data sources such as records data (to assess validity), and so on.
Systematic error occurs when similar measurement error persists across multiple waves of surveys, and/or when similar measurement error occurs across respondents. For example, the respondents might overreport outpatient medical visits systematically. Systematic error usually leads to bias in estimated population parameters such as the prevalence for a disease condition or the average level of outpatient service use.
Random error usually varies over time across multiple waves of surveys, and/or varies across respondents. For estimating aggregate population parameters such as the disease prevalence or average service use, random error usually results in reduced power and precision, but does not result in bias. However, random error might result in the overestimation for individual level gross changes. There are many techniques and procedures that can be used to mitigate measurement error in the survey data. We describe several below.
Many of the general sources of measurement errors can be mitigated with computer-assisted survey techniques. For example, the computerized survey instrument usually incorporates built-in branching logic, thus avoiding interviewer mistakes in following the branching logic. It is of course still crucial that the branching logic be designed accurately and programmed accurately. Data
entry errors are essentially eliminated in computerized surveys, to the extent the interviewer records the respondent's responses accurately.
Thorough interviewer training and monitoring is essential to mitigate measurement error. In addition, the match between the interviewer and the respondent can help improve the rapport for the interview, such as the match in race and ethnicity or the use of HIV-positive interviewers in surveys of HIV-positive respondents.
In-depth cognitive testing of the survey instrument can be used to identify ambiguities in the wording of the survey questions and response categories. The results of the cognitive testing can be used to revise the instrument, improve the clarity, and reduce the measurement error. Similar laboratory-based testing of other design features about survey questions and instruments, such as the sequential order of survey questions in an instrument, can also help address potential measurement error issues.
An alternative to laboratory-based testing prior to the deployment of the survey is to include substudies in the survey study to assess important measurement error issues. As an example, the RAND Health Insurance Experiment (Newhouse et al., 1993) included a substudy on the frequency of health reports, randomizing the respondents to various levels of reporting frequency, to assess the potential that the health report might prompt the respondents to seek medical care.
Some sources of measurement error can be mitigated with the appropriate choice of survey modality. For example, audio-assisted interview can be incorporated into the face-to-face modality for sensitive and stigmatized topics to reduce the respondent's concern about providing socially undesirable responses to the interviewer. Sometime a randomized response design (Horvitz, 1967) is used, in which the respondent's response is randomized to help alleviate his concerns.
For survey questions that inquire about the respondent's past experience or future anticipation, the level of measurement error is determined by the reference period (see below). Therefore it is important to make an effort to choose an appropriate reference period to reduce the measurement error.
The nature of the measurement error for a specific attribute usually depends on the time frame for the trait being measured. Some traits are usually time-invariant, such as birth year, gender, race, and ethnicity. Time frame is usually irrelevant for those traits. Some traits vary over time, therefore the specific measure needs to take the time frame into consideration. Some measures are specific to the current status and thus can be viewed as snapshots, such as current health status (excellent, good, fair, poor), current marital status, and current employment status. Some measures inquire about events that occurred during the specified time interval (the reference period ), such as the number of outpatient medical visits during the last 6 weeks. Some measures inquire about the
accumulation or the central tendency of a time-variant trait over a reference period, such as the household income during the calendar year 1998 (accumulated over time), the number of cigarettes smoked each day (on the average) over the last 6 weeks, and the level of satisfaction with the primary care provider during the last 6 weeks (presumably the central tendency over this time interval). Most survey questions are retrospective and inquire about reference periods that occurred in the past. Some might be prospective and inquire about the respondent's anticipation for the future. The reference period might be a fixed time interval determined by the calendar (such as the calendar year 1998), a fixed time interval defined relative to the time of the interview (the last 6 weeks, the next 2 weeks), a time interval defined relative to an easily recognized milestone (since the last interview, since the most recent discharge from a hospital, until the anticipated surgery). Note that the duration for the milestone-based reference periods might vary from respondent to respondent; it might even be unknown (for prospective milestones). The analysis needs to take those variations into account—there is more chance for events to occur in a longer reference period. A special type of milestone-based reference period is the lifetime experience (since the respondent's birth), or lifetime anticipation (till the respondent's death).19
The nature of the reference period for a specific survey measure has important implications on the measurement error (see, e.g., Bailar, 1989; Neter and Waksberg, 1964). The respondent might misclassify the time for specific events relative to the reference period, resulting in telescoping (inclusion of events that occurred outside the reference period) and omission of events that occurred inside the reference period. Omission can also occur irrespective of the reference period: the reason for the omission might be the respondent's failure to recognize or report a specific event, rather than the respondent's failure to classify the time for the event accurately. "Fabrication" of nonexistent events can also occur irrespective of the reference period.
The accuracy of the survey response usually decreases with the length of the reference period: both telescoping and omission are more likely to occur when the respondent is required to recall or anticipate events distant in the past or the future. Exceptions to this general rule might occur if the longer reference period is easier for the respondent to recognize. For example, it might be easier for the respondent to report taxable income for the calendar year 1998 (the available information is likely organized by calendar year) rather than for the last 6 months (the respondent might have difficulties determining whether specific payments were received within or prior to the last 6 months). The presence of milestones might also help the respondent to respond accurately, even though the reference period might be longer than an alternative shorter fixed reference period.
The appropriate reference period to be used in a survey question depends on the trait being inquired. For major events such as hospitalization (to be more specific, discharge from a hospital), the accuracy of the respondent's recall usually remains high even for long time periods of 6 months or a year. For less "impressive" events such as outpatient visits, the accuracy might deteriorate substantially beyond a few weeks.
In addition to the accuracy of the survey measure, the choice of the appropriate reference period should also take into consideration the sampling variation associated with the time frame. In the absence of measurement error, the statistical information in the survey measure increases with the length of the reference period. In a sense, the effective sample size should be measured in terms of person-time. The longer reference period allows more events to be accumulated, thus provides more information. For example, it is conceivable that we can obtain nearly perfect reporting of hospital discharges during the last 7 days. However, very few respondents experienced hospital discharges during such a short reference period, therefore the precision of the estimated rate of hospitalization will be poor due to the high sampling error.
The rate of hospitalization needs to be defined relative to time, such as the number of hospital discharges per thousand person-years. A sample of a thousand individuals asked about a 7-day reference period only contributes about 20 person-years' worth of data. The same sample asked about a 12-month reference period will contribute a thousand person-years' worth of data. The latter design might be preferable even if the measurement error might be larger with the 12-month reference period. The ultimate choice of the time interval needs to be based on the trade-off between the reduction in the sampling error and the increase in measurement error due to the use of the longer time interval.
Most data items to be used in the GWVHS will likely be standard, with known properties in the quality of the recall measures. If some recall data elements are new, or if there are concerns about the recall properties among the GW veterans for some existing data elements, one might consider a substudy to use different recall periods for randomly partitioned subsamples, say, inquire about 3 months for a random subsample, and 6 months for others. The consistency between the two versions of the survey instrument can then be tested using the two subsamples.
Measurement Error Issues Specific to Panel Surveys
One of the advantages in using panel surveys is that the previous interview and/or events reported during the previous interview can be used as milestones to bound the reference period, to help improve the accuracy of the respondent's recall. This advantage is applicable when the reference period coincides with the lag between successive waves of surveys. However, this advantage is unlikely to be applicable to the GWVHS. The anticipated lag time between waves for GWVHS is much longer than the reference periods appropriate for most health
outcome measures, therefore it is unlikely that prior interviews can be used as milestones to bound the reference period for subsequent interviews.
A unique measurement error issue for panel surveys is panel conditioning: the observed responses might be affected by the participation in the panel, thus compromising the validity of the data obtained in later waves (see, e.g., Bailar, 1975, 1989; Cantor, 1989; Corder and Horvitz, 1989; Holt, 1989; Presser, 1989; Silberstein and Jacobs, 1989; and Waterton and Lievesley, 1989).
There are a number of possible interpretations for the panel conditioning. First, the participation in the panel might affect the respondents' actual behavior. For example, the survey might serve as a prompt for the participants to attend to their health care needs. Under this scenario, the survey responses in the subsequent waves might reflect the actual behavior and its consequences, but the behavior might not be representative of what would take place in the general population in the absence of the earlier survey. The potential for the participation effect is especially important if physical examination is conducted on a subsample of the GWVHS participants: the physical examination might reveal a health condition that requires medical care, thus having an impact on the health status for the participants in this subsample. The impact might be both short-term and long-term: the medical care received might affect the trajectory of the health status.
Second, the participants might learn from earlier waves that certain "trigger" items would lead to additional items; they might avoid the burden by responding to the "trigger" items negatively in future waves to avoid the additional items. Under this scenario, the survey responses in the subsequent waves would be biased towards underreporting of the "trigger" conditions.
Third, the participants might learn from earlier waves what is the information required for the survey; thus, they are more capable of compiling and processing the information required to provide accurate responses to the survey questions. Under this scenario, the panel conditioning will reduce the measurement error.
The presence of panel conditioning is easy to detect under the rotating panel design. For each wave of survey, we have respondents at various levels of "seniority" on the panel: some are new, some have had some experience on the panel in earlier waves, some have completed their tenure on the panel and are ready to retire from the panel. We can therefore compare the responses given by respondents at various levels of "seniority" on the panel to assess the presence of panel conditioning. (It is important, though, to control for attrition in those comparisons.) If panel conditioning is judged to be important, the rotating panel design should be considered to make it easy to address panel conditioning.
It is much more difficult to assess panel conditioning with either the permanent panel design or repeated panel surveys without temporal overlap. The comparison across waves cannot be used to assess panel conditioning because it is confounded with true changes over time. It is conceivable that some comparisons with records data can be made, maybe for a subsample, to assess the meas-
urement error due to panel conditioning. This will not address the impact of panel conditioning on actual behavior.
It is possible to assess panel conditioning by using a substudy that varies the follow-up frequency. For example, we can take a random subsample and interview them at a higher frequency, say, annually, to compare their responses to the responses in the rest of the sample. This might be too costly to be worthwhile. Of course we also obtain more data in the subsample; thus, we might be able to reduce the overall sample size.
The lag between waves is anticipated to be fairly long for the GWVHS. Therefore it is plausible that panel conditioning is unlikely to happen, except for the long-term impact of the physical examination. Therefore, we should focus the assessment of panel conditioning on the impact of the physical examination, and place a low priority on the other components of panel conditioning. More specifically, if we do not detect a long-term impact due to the physical examination, it would be reasonable to assume the absence of other components of panel conditioning. If we do detect a long-term impact due to the examination, we might need to consider either rotating the panel or switching to a new panel.
If physical examination is to be conducted in the GWVHS, it should be designed as a randomized substudy for GWVHS, with a random subsample assigned to receive the examination. It will then be easy to assess the long-term impact of the examination, by comparing the health status in the subsample versus the rest of the sample.
Armitage, P. 1960. Sequential Medical Trials. Springfield, Illinois: Thomas.
Bailar, B.A. 1989. Information Needs, Surveys, and Measurement Errors. In: Panel Surveys, (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 1–25.
Bailar, B.A. 1975. The Effects of Rotation Group Bias on Estimates from Panel Surveys. Journal of the American Statistical Association 70(349):23–30.
Brick, J.M., and Kalton, G. 1996. Handling Missing Data in Survey Research. Statistical Methods in Medical Research. 5:215–238.
Burgess, R.D. 1989. Major Issues and Implications of Treating Survey Respondents. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 52–75.
Cantor, D. 1989. Substantive Implications of Longitudinal Design Features: The National Crime Survey as a Case Study. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 25–51.
Coad, D.S., and Rosenberger, W.F. 1999. A Comparison of the Randomized Play-the-Winner Rule and the Triangular Test for Clinical Trials with Binary Responses. Statistics in Medicine 18:761–769.
Copas, A.J., and Farewell, V.T. 1998. Dealing with Non-Ignorable Non-Response by Using an "Enthusiasm-to-Respond" Variable. Journal of the Royal Statistical Society, Series A 161(3):385–396.
Corder, L.S., and Horvitz, D.G. 1989. Panel Effects in the National Medical Care Utilization and Expenditure Survey. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 304–319.
Day, N.E. 1969. Two-Stage Designs for Clinical Trials. Biometrics 25:111–118.
Diehr, P., Martin, D.C., Koepsell, T., Cheadlee, A., et al. 1995. Optimal Survey Design for Community Intervention Evaluations: Cohort or Cross-Sectional? Journal of Clinical Epidemiology 48(12):1461–1472.
Diggle, P.J. 1989. Testing for Random Dropouts in Repeated Measurement Data. Biometrics 45:1255–1258.
Diggle, P.J., and Kenward, M.G. 1994. Informative Drop-Out in Longitudinal Data Analysis. Applied Statistics 43(1):49–93.
Diggle, P.J., Liang, K-Y., and Zeger, S.L. 1994. Analysis of Longitudinal Data. Oxford: Clarendon Press.
Duncan, G.J., and Kalton, G. 1987. Issues of Design and Analysis of Surveys across Time. International Statistical Review 55(1):97–117.
Duncan, G.J., Juster, F.T., and Morgan, J.N. 1984. The Role of Panel Studies in a World of Scarce Research Resources. In: The Collection and Analysis of Economic and Behavior Data (Eds. S. Sudman and M.A. Spaeth). Champaign, Ill.: Bureau of Economic and Business Research & Survey Research Laboratory. Pp. 94–129.
Gail, M.H., Mark, S.D., Carroll, R.J., Green, S.B., and Pee, D. 1996. On Design Considerations and Randomization-Based Inference for Community Intervention Trials. Statistics in Medicine 15:1069–1092.
Groves, R.M. 1989. Survey Errors and Survey Costs. New York: John Wiley.
Groves, R.M., and Couper, M.P. 1998. Nonresponse in Household Interview Surveys. New York: John Wiley.
Heckman, J.J. 1979. Sample Selection Bias as a Specification Error. Econometrica 47:153–161.
Heckman, J.J., and Robb, R. 1989. The Value of Longitudinal Data for Solving the Problem of Selection Bias in Evaluating the Impact of Treatments on Outcomes. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 512–539.
Hedeker, D., Gibbons, R.D., and Waternaux, C. 1999. Sample Size Estimation for Longitudinal Designs with Attrition: Comparing Time-Related Contrasts Between Two Groups. Journal of Educational and Behavioral Statistics , in press.
Hirano, K., Imbens, G.W., Ridder, G., and Rubin, D.B. 1998. Combining Panel Data Sets with Attrition and Refreshment Samples, NBER Technical Working Paper No. 230. Pp. 1–37.
Holt, D. 1989. Panel Conditioning; Discussion. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 340–347.
Horvitz, D.G., Shah, B.V., and Simmons, W.R. 1967 The Unrelated Question Randomized Response Model. American Statistical Association: Proceedings of the Social Statistics Section, Pp. 67–72.
Hsiao, C. 1986. Analysis of Panel Data. New York: Cambridge University Press.
Hsiao, C. 1985. Benefits and Limitations of Panel Data. Economic Reviews 4(1):121–174.
Jabine, T.B., King, K.E., Petroni, R.J. 1990. Survey of Income and Program Participation Quality Profile. Bureau of the Census, U.S. Department of Commerce, Washington D.C.
Kalton, G. 1986. Handling Wave Nonresponse in Panel Surveys. Journal of Official Statistics 2(3):303–314.
Kalton, G., and Citro, C.F. 1993. Panel Surveys: Adding the Fourth Dimension. Survey Methodology 19(2):205–215.
Kish, L. 1965. Survey Sampling. New York: John Wiley.
Kyriazidou, E. 1997. Estimation of a Panel Data Sample Selection Model. Econometrica 65:1335–1364.
Lai, T.L., Levin, B., Robbins, H., and Siegmund, D. 1980. Sequential Medical Trials. Proceedings of the National Academy of Sciences USA 77(6):3135–3138.
Laird, N.M. 1988. Missing Data in Longitudinal Studies. Statistics in Medicine 7:305–315.
Lepkowski, J.M. 1989. Treatment of Wave Nonresponse in Panel Surveys. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 348–374.
Lessler, J.T., and Kalsbeek, W.D. 1992. Nonsampling Error in Surveys. New York: John Wiley.
Little, R.A. 1993. Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association 88:125–134.
Little, R.A. 1994. A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika 81(3):471–483.
Little, R.A., and Rubin, D.B. 1987. Statistical Analysis with Missing Data. New York: John Wiley.
Maxwell, S.E. 1998. Longitudinal Designs in Randomized Group Comparisons: When Will Intermediate Observations Increase Statistical Power? Psychological Methods 3(3):275–290.
McHorney, C.A., Kosinski, M., and Ware, J.E. 1994. Comparisons of the Costs and Quality of Norms for the SF-36 Health Survey Collected by Mail versus Telephone Interview: Results from a National Survey. Medical Care 32(6):351–367.
Neter, J., and Waksberg, J. 1964. A Study of Response Errors in Expenditure Data from Household Interviews. Journal of the American Statistical Association 59:18–55.
Newhouse, J.P., and the Insurance Experiment Group. 1993. Free for All? Lessons from the RAND Health Insurance Experiment. Cambridge Massachusetts: Harvard University Press.
Overall, J.E., and Doyle, S.R. 1994. Estimating Sample Sizes for Repeated Measurement Designs. Controlled Clinical Trials 15:100–123.
Pierce, P. 1997. Physical and Emotional Health of Gulf War Veteran Women. Aviation, Space, and Environmental Medicine, P. 68.
Presser, S. 1989. Collection and Design Issues: Discussion. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 75–79.
Robbins, H. 1974. A Sequential Test for Two Binomial Populations. Proceedings of the National Academy of Sciences, USA 71:4435–4436.
Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley.
Rubin, D.B. 1996. Multiple Imputation after 18+ Years. Journal of the American Statistical Association 91:473–489.
Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman and Hall.
Silberstein, A.R, and Jacobs, C.A. 1989. Symptoms of Repeated Interview Effects in the Consumer Expenditure Interview Survey. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 289–303.
Stretch, R.H., Bliese, P.D., Marlowe, D.H., Wright, K.M., Knudson, K.H., and Hoover, C.H. 1995. Physical Health Symptomatology of Gulf War-Era Service Personnel from the States of Pennsylvania and Hawaii. Military Medicine 160:131–136.
Waterton, J., and Lievesley, D. 1989. Evidence of Conditioning Effects in the British Social Attitudes Panel. In: Panel Surveys (Eds. D. Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh). New York: John Wiley. Pp. 319–339.
Wei, L.J., and Durham, S. 1978. The randomized play-the-winner rule in medical trials. Journal of the American Statistical Association 73:840–843.
Weinberger, M., Nagle, B., Hanlon, J.T., Samsa, G.P., et al. 1994 Assessing Health-Related Quality of Life in Elderly Outpatients: Telephone versus Face-to-Face Administration. Journal of the American Geriatrics Society 42:1295–1299.
Weinberger, M., Oddone, E.Z., Samsa, G.P., and Landsman, P.B. 1996. Are Health-Related Quality-of-Life Measures Affected by the Mode of Administration? Journal of Clinical Epidemiology 49(2):135–140.
Weinstein, M.C. 1974. Allocation of Subjects in Medical Experiments. New England Journal of Medicine 291:1278–1285.
Whitehead, J. 1997. The Design and Analysis of Sequential Clinical Trials, revised 2nd edition. New York: John Wiley.
Wu, A.W., Jacobson, D.L., Berzon, R.A., Revicki, D.A., et al. 1997. The Effect of Mode of Administration on Medical Outcomes Study Health Ratings and EuroQol Scores in AIDS. Quality of Life Research 6:3–10.
Zelen, M. 1969. Play the Winner Rule and the Controlled Clinical Trial. Journal of the American Statistical Association 64:131–146.