CONSIDERATIONS IN IDENTIFYING AND EVALUATING THE LITERATURE
This chapter reviews the approach that the committee took to identify and evaluate the health studies of Gulf War veterans. It discusses the major types of epidemiologic studies considered by the committee, the factor analysis used by many of the studies that were evaluated, and finally the committee’s inclusion and evaluation criteria.
The committee limited its review of the literature primarily to epidemiologic studies of Gulf War veterans to determine the prevalence of diseases and symptoms in that population. The studies typically examine veterans’ health outcomes in comparison with outcomes in their nondeployed counterparts. Because this report is a review of disease or symptom prevalence, no attempt is made to associate diseases or symptoms with specific biologic or chemical agents potentially encountered in the gulf. In a general way, however, the committee did examine studies that assessed exposures in veterans (Chapter 2) and the influence of exposure information on the interpretation of veterans’ health.
The committee members identified numerous cohort studies (Chapter 4) and case-control studies (Chapter 5) that they objectively reviewed without preconceived ideas about health outcomes. To assist them in their work, they developed criteria to determine which studies to include in their review.
TYPES OF EPIDEMIOLOGIC STUDIES
The committee focused on epidemiologic studies because epidemiology deals with the determinants, frequency, and distribution of disease in human populations. A focus on populations distinguishes epidemiology from medical disciplines that focus on the individual. Epidemiologic studies examine the relationship between exposures to agents of interest in a population and the development of health outcomes (in this review, deployment is the exposure). Such studies can be used to generate hypotheses for study or to test hypotheses posed by investigators. This section describes the major types of epidemiologic studies considered by the committee.
A cohort study is an epidemiologic study that follows a defined group, or cohort, over a period of time. It can test hypotheses about whether exposure to a specific agent is related to the development of disease and can examine multiple disease outcomes that might be associated
with exposure to a given agent (or, for example, to deployment). A cohort study starts with people who are free of a disease (or other outcome) and classifies them according to whether they have been exposed to the agent of interest. It compares health outcomes in people who have been exposed to the agent in question with those who have not.
Cohort studies can be prospective or retrospective. In a prospective cohort study, investigators select a group of subjects and determine who has been exposed and who has not been exposed to a given predictor (independent) variable. They then follow the cohort to determine the rate or risk of the disease (or other health outcome) in the exposed and comparison groups.
A retrospective (or historical) cohort study differs from a prospective study in temporal direction; investigators look back to classify past exposures in the cohort and then track the cohort forward to ascertain the rate of disease.
Cohort studies can be used to estimate a risk difference or a relative risk, two statistics that measure association between the exposure groups. The risk difference, or attributable risk, is the rate of disease in exposed persons minus the rate in unexposed persons, representing the number of extra cases of disease attributable to the exposure. The relative risk is determined by dividing the rate of those who develop the disease in the exposed group (for example, the deployed group) by the rate of those developing the disease in the nonexposed group (for example, the nondeployed group). A relative risk greater than 1 suggests an association between exposure and disease onset; the higher the relative risk, the stronger the association.
Cohort studies have several advantages and disadvantages. Generally, the advantages outweigh the disadvantages if the study is well designed and executed. The advantages of cohort studies include the following:
The investigator knows that the predictor variable preceded the outcome variable.
Exposure can be defined and classified at the beginning of the study and subjects can be selected based on exposure definition.
Information on potential confounding variables can be collected in a prospective cohort study so that they may be controlled in the analysis.
Rare or unique exposures (such as Gulf War exposures) can be studied, and the investigators can study multiple health outcomes.
Absolute rates or risk of disease incidence and prevalence can be calculated.1
Disadvantages of cohort studies include the following:
They are often expensive because of the long periods of followup (especially if the disease has a delayed onset, for example, cancer), attrition of study subjects, and delay in obtaining results.
They are inefficient for the study of rare diseases or diseases of long latency.
There is a possibility of the “healthy-worker effect”2 (Monson 1990), which might introduce bias and can diminish the true disease-exposure relationship.
In a case-control study, subjects (cases) are selected on the basis of having a disease; controls are selected on the basis of not having the disease. Cases and controls are asked about their exposures to specific agents. Cases and controls can be matched with regard to such characteristics as age, sex, and socioeconomic status to eliminate those characteristics as causes of observed differences, or those variables can be controlled in the analysis. The odds of exposure to the agent among the cases are then compared with the odds of exposure among controls. The comparison generates an odds ratio,3 which is a statistic that depicts the odds of having a disease among those exposed to the agent of concern relative to the odds of having the disease among an unexposed comparison group. An odds ratio greater than 1 indicates that there is a potential association between exposure to the agent and the disease; the greater the odds ratio, the greater the association.
Case-control studies are useful for testing hypotheses about the relationships between exposure to specific agents and disease. They are especially useful and efficient for studying the etiology of rare diseases. Case-control studies have the advantages of ease, speed, and relatively low cost. They are also valuable for their ability to probe multiple exposures or risk factors. However, case-control studies are vulnerable to several types of bias, such as recall bias, which can dilute or enhance associations between disease and exposure. Other problems include identifying representative groups of cases, choosing suitable controls, and collecting comparable information about exposures on both cases and controls. Those problems might lead to unidentified confounding variables that differentially influence the selection of cases or control subjects or the detection of exposure. For the reasons discussed above, case-control studies are often the first approach to testing a hypothesis, especially one related to a rare outcome.
A nested case-control study draws cases and controls from a previously defined cohort. Thus, it is said to be “nested” inside a cohort study. Baseline data are collected at the time that the cohort is identified and insures a more uniform set of data on cases and controls. Within the cohort, individuals identified with disease serve as cases and a sample of those who are disease-free serve as controls. Using baseline data, exposure in cases and controls is compared, as in a regular case-control study. Nested case-control studies are efficient in terms of time and cost in reconstructing exposure histories on cases and on only a sample of controls rather than the entire cohort. Additionally, because the cases and controls come from the same previously established cohort, concerns about unmeasured confounders and selection bias are decreased.
The main differentiating feature of a cross-sectional study is that exposure and disease information is collected at the same point (period) of time. The selection of people for the
study—unlike selection for cohort and case-control studies—is independent of both the exposure to the agent under study and disease characteristics. Cross-sectional studies seek to uncover potential associations between exposure to specific agents and development of disease. In a cross-sectional study, effect size is measured as relative risk, prevalence ratio, or prevalence odds ratio. It might compare disease or symptom rates between groups with and without exposure to the specific agent. Several health studies of Gulf War veterans are cross-sectional studies that compare a sample of veterans who were deployed to the Gulf War with a sample of veterans who served during the same period but were not deployed to the Gulf War.
Cross-sectional studies are easier and less expensive to perform than cohort studies and can identify the prevalence of diseases and exposures in a defined population. They are useful for generating hypotheses, but they are much less useful for determining cause-effect relationships, because disease and exposure data are collected simultaneously (Monson 1990). It might also be difficult to determine the temporal sequence of exposures and symptoms or disease.
Epidemiologic studies can establish statistical associations between exposure to specific agents or situations (for example, deployment to the Gulf War) and health effects, and associations are generally estimated by using relative risks or odds ratios. Epidemiologists seldom consider a single study sufficient to establish an association. It is desirable to replicate findings in other studies to draw conclusions about an association. Results of separate studies are sometimes conflicting. It is sometimes possible to attribute discordant study results to such characteristics as soundness of study design, quality of execution, and the influence of different forms of bias. Studies that result in a statistically precise measure of association suggest that the observed result is unlikely to have been due to chance. When the measure of association does not show a statistically precise effect, it is important to consider the size of the sample and whether the study had the power to detect an effect of a given size.
DEFINING A NEW SYNDROME
As the committee reviewed the literature on the health of Gulf War veterans, one fundamental question arose regarding whether the constellation of veterans’ unexplained symptoms constitutes a syndrome. If so, is the symptom constellation best studied and treated as a new syndrome or as a variant form of a known syndrome (IOM 2000)? Identification of a new set of unexplained symptoms in a group of patients does not automatically mean that a new syndrome has been found. Rather, it constitutes the beginning of a process to demonstrate that the patients are affected by a clinical entity that is distinct from other established clinical entities.
The process of defining a new syndrome usually begins with establishment of a case definition that lists criteria for distinguishing the potentially new patient population from patients with known clinical diagnoses. Developing the first case definition is a vital milestone intended to spur research and surveillance. More like a hypothesis than a conclusion, definition is an early step in the process; it is often revised as more evidence comes to light. Case definitions usually are a mix of clinical, demographic, epidemiologic, and laboratory criteria.
A case definition leads to the creation of a more homogeneous patient population, another step in the eventual establishment of a new syndrome. A potential disadvantage of any case definition—if it is inaccurate—is the mislabeling or misclassification of a condition, which can
thwart medical progress for years, if not decades (Aronowitz 2001). Classification of a new patient population also stimulates further understanding of prevalence, treatment, natural history, risk factors, and ultimately etiology and pathogenesis. As more knowledge unfolds about etiology and pathogenesis, an established syndrome can rise to the level of a disease. The renaming of a syndrome as a disease implies that the etiology or pathology has been identified.
STATISTICAL TECHNIQUES USED TO DEVELOP A CASE DEFINITION
Two statistical techniques have been used by investigators to identify symptom clusters that could potentially be used to develop case definitions suggestive of a new syndrome: factor analysis and cluster analysis. Many of the studies reviewed by the committee use those techniques. The aims of the techniques are different: factor analysis seeks to identify groups of individuals’ most prominent symptoms, whereas cluster analysis seeks to identify people who have similar symptoms. Stated in another way, factor analysis analyzes patterns of symptoms, and cluster analysis categorizes people on the basis of their symptoms.
Factor analysis has been used far more frequently than cluster analysis in the major cohort studies. It has been used to identify groups of symptoms that might potentially point to a new syndrome. However, factor analysis by itself cannot definitively identify a new syndrome. That requires more research about the putative syndrome’s clinical features, natural history, genetics, response to treatment, etiology, and pathogenesis (Robins and Guze 1970; Taub et al. 1995).
Factor analysis seeks to identify a small number of groups of highly related variables among a much larger number of measured variables. In the context of Gulf War research, the measured variables are the symptoms that veterans report in surveys. Factor analysis aggregates veterans’ symptoms into smaller groups to discern more fundamental, yet immeasurable, variables, which are referred to as factors. The idea is that a factor and the group of symptoms that it represents are somehow related pathophysiologically and that the symptoms within a factor are different symptomatic manifestations of the same underlying disease process. In a research context, the factors could be used, for instance, as clinical criteria for a new syndrome.
In the Gulf War literature, the key issue is whether the factors identified by factor analysis are exclusive to deployed veterans vs a comparison population, usually nondeployed veterans. Finding factors peculiar to deployed veterans would imply that they might have a new syndrome with specific symptoms that could indicate biologic plausibility or a common pathophysiology, presumably triggered by an exposure that occurred in the Gulf War Theater. The names given to the new factors—such as neurologic factor or cognitive factor—are merely descriptive labels. Their purpose is to convey what the investigators believe the nature of a new syndrome might be, depending on which symptoms group or “load” onto the new factor, and, in the absence of more research, to establish the putative syndrome’s features.
There are several key characteristics by which one can evaluate the methodology or findings of a particular factor analysis. Several are straightforward, such as sample size, sample population, type of symptom reporting (for example, interval, ordinal, or dichotomous) and what particular symptoms are subsumed under each factor. Several other characteristics, such as the method, rotation, factor-loading cutoff, number of factors isolated, and percentage of variance would require explanation.
Factor analysis examines how closely symptoms on a questionnaire are related in a study population. Studies typically elicit responses to long lists of potential symptoms (such as
numbness and tingling in the extremities) and how severe the symptoms are. Conventional factor analysis correlates either the presence or severity of a symptom with the presence or severity of all other symptoms. Factor analysis not only identifies symptoms that correlate with each other but also identifies them as relatively uncorrelated with other variables. For instance, in acute gastroenteritis, severe nausea might correlate with severe vomiting and severe diarrhea but not with cough. Classically, factor analysis examines the severity of symptoms rather than just their presence or absence, with severity scored on a continuous (for example, 100-point) scale. Many studies of Gulf War veterans, however, have used either dichotomous scales (“present” vs “absent”) or ordinal scales (“none”, “mild”, “moderate”, or “severe”) without using statistical techniques designed to deal with these types of variables. That can result in underestimation of the strength of a correlation (Muthen 1984).
In conducting a factor analysis, the first issue is which of the many symptoms on a questionnaire should be retained in the factor, that is, how strongly do symptoms need to be intercorrelated to be part of the factor? A commonly used method is “factor loading” or use of a “factor coefficient”. That method quantifies how well an individual symptom correlates or does not correlate with a potential factor. There are different statistical techniques (such as principal factor, principal component, iterated principal components, and maximal likelihood factor analyses) to determine how well symptoms load onto a particular factor.
Once a preliminary list of factors is determined, the next step is to decide their relative importance and how well they can explain the universe of symptoms collected from the study population. Several investigators have taken an additional step and have examined potential factors to determine how meaningful or plausible they are clinically or pathophysiologically and to discard the implausible ones.
The investigator examines two statistics to determine how well the proposed factors describe the array of symptoms in the study population. The first is the “factor load”, which refers to how strongly individual symptoms correlate with a factor. Factor loads are expressed on a scale of 0-1, and loads of greater than 0.4 are conventionally used as a cutoff point between factors that are strongly associated and factors that are not. The second statistic, the “eigenvalue”, is a measure of how well each factor can explain or fit the observed relationships among all the symptoms in the study population (technically, the proportion of variance in the study). Eigenvalues have numeric values. An eigenvalue of more than 1.0 is conventionally taken to mean that a factor should be retained. Taken together, those two statistics can be used in an iterative fashion to establish the best fit of the data. A related technique, “rotation”, allows for easier interpretation of factor loadings.
Once a relatively robust number of factors are identified, then how well the individual symptoms correlate with each other can be examined. Finally, the proportion of the variance that each factor explains is plotted by the individual factors arranged by decreasing eigenvalues. Investigators look for a sharp dropoff in the curve as an arbitrary cut point between important factors and minor or weak factors. The result is a small number of factors that can explain a large proportion of the variance observed in participants’ answers to the symptom questionnaires.
A somewhat related technique, cluster analysis, has also been used in several of the Gulf War studies (see Chapter 5) to determine how groups of patients with certain symptoms might relate to one another. A k-means cluster analysis partitions study subjects into k clusters of individuals based on their reported symptoms. Individual participants are then reassigned in an iterative fashion until a best fit is reached and the uniqueness of the clusters is not improved by further reassignments. For instance in a group of persons who were exposed to Staphylococcal
enterotoxin at a picnic, there would likely be two clusters—one cluster of persons with severe vomiting and nausea and one cluster with mild or even absent symptoms. While taking a different statistical approach to examining symptom complexes among individuals, studies using cluster analyses, like factor analyses, are ultimately dependent on representative samples and accuracy of self-reported data and can suffer from both selection (or participation) bias and recall bias to the extent that persons are more or less willing to participate and have greater or lesser recall of symptoms.
The committee’s evaluation included studies that would enable it to respond to its charge “to help inform the Department of Veterans Affairs of illnesses among Gulf War veterans that might not be fully appreciated”. The committee included studies that would answer the question, What does the literature tell us about the health status of Gulf War veterans? To that end, the committee searched the literature and included descriptive epidemiologic studies of health outcomes in military personnel that served in the Gulf War Theater. The studies were not restricted to US personnel. A study also needed to demonstrate rigorous methods (for example, was published in a peer-reviewed journal, included details of methods, had a control or reference group, had the statistical power to detect effects, and included reasonable adjustments for confounders), include information regarding a persistent health outcome, and have a medical evaluation, conducted by a health professional, and use laboratory testing as appropriate. Those types of studies constituted the committee’s primary literature. The committee did not evaluate studies of acute trauma, rehabilitation, or transient illness.
Studies reviewed by the committee that did not necessarily meet all the criteria of a primary study are considered secondary studies. Secondary studies are typically not as methodologically rigorous as primary studies and might present subclinical findings, that is, studies of altered functioning consistent with later development of a diagnosis but without clear predictive value.
Another step that the committee took in organizing its literature was to determine how all the studies were related to one another. Numerous Gulf War cohorts have been assembled, from several different countries; from those original cohorts many derivative studies have been conducted. The committee organized the literature into the major cohorts and derivative studies because they did not want to interpret the findings of the same cohorts as though they were results from unique groups (Chapter 4).
Finally, in assessing the descriptive studies, the committee was especially attentive to potential sources of bias, confounding, chance, and multiple comparisons, as discussed in the next section.
In addition to determining the primary and secondary literature that would be used to draw conclusions, the committee considered other characteristics of the studies. These characteristics had to do with the methods used by researchers in designing and conducting studies and include bias and chance.
Bias refers to systematic, or nonrandom, error. Bias causes an observed value to deviate from the true value, and can weaken an association or generate a spurious association. Because all studies are susceptible to bias, a goal of the research design is to minimize bias or to adjust the observed value of an association by correcting for bias. There are different types of bias, such as selection bias which occurs when systematic error in obtaining participants results in a potential distortion of the true association between exposure and outcome.
Information bias results from the manner in which data are collected and can result in measurement errors, imprecise measurement, and misdiagnosis. Those types of errors might be uniform in an entire study population or might affect some parts of the population more than others. Information bias might result from misclassification of study subjects with respect to the outcome variable or from misclassification of exposure. Other common sources of information bias are the inability of study subjects to recall the circumstances of their exposure accurately (recall bias) and the likelihood that one group more frequently reports what it remembers than another group (reporting bias). Information bias is especially harmful in interpreting study results when it affects one comparison group more than another.
Confounding occurs when a variable or characteristic otherwise known to be predictive of an outcome and associated with the exposure (and not on the causal pathway) can account for part or all of an apparent association. A confounding variable is an uncontrolled variable that influences the outcome of a study to an unknown extent, and makes precise evaluation of its effects impossible. Carefully applied statistical adjustments can often control for or reduce the influence of a confounder.
Chance is a type of error that can lead to an apparent association between an exposure to an agent and a health effect when no association is present. An apparent effect of an agent on a health outcome might be the result of random variation due to sampling in assembly of the study population rather than the result of exposure to the agent. Standard methods that use confidence intervals, for example, allow one to assess the role of chance variation due to sampling.
When an investigator initiates a large number of investigations simultaneously on the same dataset, multiplicity of comparisons poses a problem. When looking at so many different comparisons, the investigator is bound to find something of note by chance alone. For example, in many Gulf War veteran studies, the investigators are comparing multiple outcomes and multiple exposures. There are, however, ways to correct for multiple comparisons in studies. One way is to use a Bonferroni correction, a statistical adjustment for multiple comparisons. It effectively raises the standard of proof needed when an investigator looks at a wide array of hypotheses simultaneously.
Assignment of Causality
In addition to general considerations of research quality, the assessment of the studies reviewed raises a complex set of issues related to the assignment of causality (Pearl 2000). For purposes of this study, the committee scrutinized the degree to which studies were likely to provide strong causal evidence. To that end, the committee was guided by the Bradford Hill criteria (Hill 1965). In the spirit of those criteria (Phillips and Goodman 2004), the inferences expressed in this report are based on the totality of evidence reviewed and the committee’s collective judgment.
LIMITATIONS OF GULF WAR VETERAN STUDIES
The studies to date have provided valuable information regarding the health of Gulf War veterans; however, many of the studies have limitations that hinder accurate assessment of the veterans’ health status. Chapter 4 discusses the limitations. The issues under discussion include the possibility that study samples do not represent to the entire Gulf War population, low rates of participation in studies, self-reporting of symptoms and exposures, narrowness of studies in assessment of health status, insensitivity of instruments for detecting abnormalities in deployed veterans, and the use of period of investigation that is too brief to detect health outcomes that have long latency, such as cancer. In addition, many of the US studies are cross-sectional and this limits the opportunity to learn about symptom duration and chronicity, latency of onset, and prognosis. Finally, the problem of multiple comparisons that is common in many of the Gulf War studies results in confusion over whether the effect is real or occurring by chance. Those limitations make it difficult to interpret the results of the findings, particularly when several well-conducted studies produce inconsistent results.
The committee reviewed and evaluated studies from the scientific and medical literature that were identified with searches of bibliographic databases and other methods. The committee adopted a policy of using only peer-reviewed published literature as the basis of its conclusions. Publications that were not peer-reviewed had no evidentiary value for the committee, that is, they were not used as evidence for arriving at the committee’s conclusions about the prevalence of health effects. The process of peer review by fellow professionals promotes high standards of quality, although it does not guarantee the validity or generalizability of a study’s findings.
Committee members read each article critically. In some instances, nonpeer-reviewed publications provided background information for the committee and raised issues that required further research. The committee, however, did not collect original data, nor did it perform any secondary data analysis. In its evaluation of the peer-reviewed literature, the committee considered several important issues, including quality and relevance; error, bias, and confounding; and the diverse nature of the evidence and the research.
Aronowitz RA. 2001. When do symptoms become a disease? Annals of Internal Medicine 134(9 Pt 2):803-808.
Hill AB. 1965. The Environment and disease: Association or causation? Proceedings of the Royal Society of Medicine 58(10):295-300.
IOM (Institute of Medicine). 2000. Gulf War and Health, Volume 1. Depleted Uranium, Sarin, Pyridostigmine Bromide, Vaccines. Washington, DC: National Academy Press.
Monson R. 1990. Occupational Epidemiology. 2nd ed. Boca Raton, FL: CRC Press, Inc.
Muthen B. 1984. A general structural equation model with dichotomous, ordered, categorical, and continuous latent variable indicators. Psychometrika 49(1):115-132.
Pearl J. 2000. Causality: Models, Reasoning and Inference. Cambridge, UK: Cambridge University Press.
Phillips CV, Goodman KJ. 2004. The missed lessons of Sir Austin Bradford Hill. Epidemiologic Perspectives and Innovations 1(1):3.
Robins E, Guze SB. 1970. Establishment of diagnostic validity in psychiatric illness: Its application to schizophrenia. American Journal of Psychiatry 126(7):983-987.
Taub E, Cuevas JL, Cook EW 3rd, Crowell M, Whitehead WE. 1995. Irritable bowel syndrome defined by factor analysis. Gender and race comparisons. Digestive Diseases and Sciences 40(12):2647-2655.