3 Determining Treatment Effectiveness
Once the committee identified the health conditions upon which to focus (see Chapter 2), it had to determine how to evaluate the effectiveness of treatments for those conditions. There are a number of ways to show that a given treatment is effective in treating a disease or clinical condition. Studies of treatments typically start either with laboratory studies establishing a possible or plausible effect of a treatment or with uncontrolled clinical observations of that effect. Small pilot studies, larger controlled trials, and, finally, studies of efficacy in large clinical populations gradually build a case for the value of a given treatment. There is no point along this sequence when a treatment is unequivocally “proven” efficacious, since no single study is totally free of all methodological flaws and even a set of studies may be flawed and produce misleading conclusions. The strength of evidence for or against a given treatment can be graded, however, and there is a point at which the medical and scientific communities can reach consensus about the efficacy (or lack thereof) of a treatment (Guyatt et al. 2000). In this chapter, we will review these “rules of evidence” and indicate how they can be applied to treatments for Gulf War veterans' health problems.
STUDY DESIGNS AND STRENGTH OF INFERENCE
Questions of treatment effectiveness are fundamentally questions about cause-and-effect relationships. If an effective treatment is applied, some detectable improvement in a patient's condition should occur. If the treatment is not applied, no improvement occurs or the patient gets worse.
If the treatment is applied in higher doses or more frequently (at least up to a point), the improvement should be greater or occur sooner. Because there may be other causes of improvement beside the treatment in question, the improvement must be shown in multiple patients in multiple settings and in circumstances where as many other possible causal factors can be ruled out.
Assessments of the efficacy of specific treatments typically start with some evidence of biological plausibility. Basic laboratory studies or other kinds of knowledge that do not involve direct tests of a treatment in live human patients may suggest that a treatment or class of treatments should work. No matter how compelling the arguments for plausibility, though, plausibility per se is not evidence for treatment efficacy.
The randomized controlled trial (RCT) is the most reliable methodology for assessing the efficacy of treatments in medicine. In such a trial a defined group of study patients is assigned to either receive the treatment or not, or to receive different doses of the treatment, through a formal process of randomization. A coin flip is the simplest example of a random process. In a study with two “arms” (e.g., treatment or no treatment), each eligible patient would receive whatever a coin flip indicated—heads for treatment and tails for no treatment. In a large number of patients, any clinical or demographic factors such as age, height, weight, illness history, other illnesses, or any other unknown factor that might affect the results of the treatment would be equivalent in the two groups. These will all be eliminated, then, as plausible competing explanations for any observed difference in outcome between the two groups.
Randomized trials typically include other features that increase the strength of the conclusions about cause-and-effect relationships between the treatment and the outcome of interest. Some patients may be excluded from the study because they have conditions that make it impossible to evaluate outcomes or gather data (e.g., extremely elderly patients may be excluded from a study of a cancer treatment because too many of them would die of other conditions before the end of a five-year follow-up period). The study of the efficacy of a drug may include lab tests that measure the level of the drug in the bloodstream. This is done to ensure that the patients assigned to the treatment group actually received the drug while the patients randomized to “no treatment” did not take it on their own. A study may include near-term clinical measures of benefit (e.g., reduction in blood pressure or cholesterol level) as well as long-term objective measures of benefit (e.g., remissions of tumors, mortality) or long-term subjective measures of benefit (e.g., self-reported pain or functional status levels).
Even though an RCT provides strong evidence for or against the efficacy of a given treatment being tested, no one study is ever so perfect that the results cannot be challenged. A study may show an absence of an
effect of a drug, for example, because the dose chosen for study was too low. The patients in a particular study may be unique in some way that makes them not representative of all patients to whom the treatment might be given in the future. For example, if all study subjects are middle-age white men, it is not clear whether the treatment would work in the same way for older or younger Asian women. Results that appear significant in a near-term follow-up may change with longer follow-up (e.g., a treatment shrinks tumors dramatically for two months, but the cancers recur and the patients die after 18 months). The most powerful evidence of treatment efficacy comes from the cumulative, consistent results of several RCTs, preferably in different patient populations and in different settings, and with extensive follow-up periods.
Other kinds of studies (i.e., quasi-experimental designs) can provide evidence of treatment efficacy, too. In situations where it is technically or ethically impossible to run concurrent control groups, a series of “off/on” periods of treatment in a single group of patients can be studied. In these studies treatment is administered to a single group of patients and then taken away. Evidence of efficacy is provided if the benefit is consistently seen when treatment is given and the benefit disappears when treatment is not given. This is a specific example of a before-after study design without controls. A single round of off/on provides very weak evidence for effectiveness unless results are unusual and dramatic, because many other things occurring at the same time as the treatment may have caused the result. Being able to repeat the effect over and over again strengthens the argument for the treatment, rather than something else, being the cause of the effect.
One could also use a cohort study. In this study design a large number of patients who receive a treatment are followed over time to observe a possible benefit and are compared to those who did not receive the treatment. It may offer strong evidence of treatment effectiveness if the group studied is particularly large so that other possible causes of an effect may be evaluated through statistical analysis, or if the result is unusually strong and/or consistent in the large group. For example, one might observe a lower rate of heart attacks in a large group of men taking an aspirin every day over a 10-year period compared to similar men who did not take it. One might challenge the results, though, and ask whether the men taking aspirin became more health conscious in general and also lost weight, drank less, quit smoking, or did something else that was actually the cause of the reduced rate of heart attacks. Being able to go into the database and find the effect in a subset of men who did not lose weight or quit smoking or drink less would offer an answer to the challenge.
Another possible study design is a case-control study. In this study design, patients are assigned to study groups based on the results of treatment rather than the treatment itself. One might, for example, identify
some patients who survived one year after a heart attack and other patients who died within a year. If most or all of the patients who survived received a certain kind of treatment and few or none of those who died received the treatment, the treatment might have improved the odds of survival. In this kind of study and in the cohort study, there is no direct control over who receives what treatment, so there may be competing explanations for the effect. Perhaps only healthier, stronger patients received the treatment, with the possibility that the survival benefit was related to their health and strength rather than the treatment. Because this design leaves open many competing explanations, it is less powerful than an RCT for assessing treatment effectiveness and is usually followed by an RCT to confirm findings.
Still weaker evidence for treatment effectiveness comes from uncontrolled clinical observations or anecdotes. These types of studies may involve small numbers of patients, treatments that are not well defined or that vary from patient to patient, variable periods of follow-up, unstated or varying rules for selecting patients for study, or outcome measures of unknown validity (e.g., patient says he “feels better”). They may not be studies at all in the sense that there is any organized effort to answer a scientific question about a treatment. They may simply be the collected reports of the experiences of patients who share some basic characteristics of illness and treatment. Although these kinds of studies can appear in major professional journals because they do provide some evidence of potential treatment effectiveness, they are generally not considered to provide strong evidence because of the many possible competing explanations for the observed effect that they cannot rule out.
In summary, there are a number of study designs that can provide varying levels of evidence of treatment efficacy. They include (from strongest to weakest):
Multiple well-designed randomized controlled trials
Single well-designed RCT or multiple small RCTs
Cohort study, particularly one with multiple “on/off” features
Series of clinical observations or anecdotes
EFFICACY VERSUS EFFECTIVENESS
The discussion above has used the term “efficacy” to refer to the typical change in a patient's (or group of patients') health status brought about by a given treatment. There is an important distinction, however, between two similar-sounding terms:
Efficacy is the benefit produced by a given treatment in tightly
controlled, perhaps artificial, study conditions in which patients are carefully selected and may be more frequently observed, tested, and monitored than is typically the case in routine clinical practice.
Effectiveness is the benefit produced by the given treatment in dayto-day clinical practice, in unselected patient populations that do not receive extra tests, education, or visits because of participation in a study.
With this distinction in mind, most RCTs would be properly labeled efficacy studies rather than effectiveness studies. The results, then, reflect treatment efficacy rather than treatment effectiveness. Treatment effectiveness would typically be established after one or more RCTs, when the treatment came to be widely used in a variety of clinical settings in diverse patient populations.
It is possible to imagine an “effectiveness RCT” that would combine the design features of the RCT with a set of sampling and analytical features that would permit direct extrapolation of findings to routine clinical practice (Roper et al. 1988). An effectiveness RCT could be viewed as a hybrid that combines the real-world features of effectiveness studies with some of the study design features typically found in efficacy studies. In an effectiveness RCT, one would have relatively light (if any) patient exclusion criteria, so that the patients in the trial would be as similar as possible to those to whom the results would be generalized. The study would be run (to the extent possible) in a range of treatment settings rather than in a single academic medical center context. The treatment would be provided by the same kinds of providers (e.g., community physicians or nurses) who would provide the treatment in nonstudy settings. There would not be an elaborate data collection infrastructure (e.g., extra lab test or imaging studies) that would create a different “information environment” for treating clinicians and patients than the one that would be found in real-world treatment settings. Analysis would be done on an “intention to treat” basis. The study would have random assignment of patients to treatment arms and would have one or more control groups (e.g., placebo controls, waiting list controls, different dose or regimen controls, or other controls that would make sense for the question being asked).
If we adhere to this terminology, we will find that there is very little formal evidence of treatment effectiveness for most treatments for medical problems in Gulf War veterans because relatively few true effectiveness studies have been done on any medical condition.
In an influential 1988 article, Paul Ellwood coined the term “outcomes management” and challenged the medical community to formally
assess patient outcomes, not only in the context of specific research projects but also in the context of routine clinical care. His concept of outcome not only included the directly measurable outcomes of clinical trials like objective tumor shrinkage or mortality, but also a range of self-reported patient outcomes like functional health status, pain, ability to work, and overall quality of life. Part of his concept included the building of large data repositories in which clinicians would build a “collected clinical experience” of the baseline characteristics, treatments, and outcomes of thousands and thousands of patients. With this database available, he argued, future clinicians could enter the characteristics of a patient with a given problem and almost immediately obtain data on the relative effectiveness of several treatment choices in patients like the one in question.
There are a number of outcomes research projects that represent the closest approximation to a study of true treatment effectiveness (Magid et al. 2000). The National Registry of Myocardial Infarction is one example; the APACHE III and the Spine Surgery Consortium for Outcomes Research are others. The American Medical Group Association has been the sponsor of a large number of related outcomes projects that have been carried out in the past 10 years.
In the scheme of study designs described above, these studies may perhaps best be categorized as cohort studies, although the cohort is defined in terms of the clinical condition rather than by the receipt of a specific treatment. They do not typically have the tight control of patient eligibility criteria and the random assignment of treatments that are the hallmarks of the RCT. It is presumed that the effects of possible alternative causal factors can be identified and controlled through appropriate statistical techniques, and this is made possible by large sample sizes and the collection of a wide range of relevant variables.
The committee is not aware of any ongoing outcomes management or outcomes research projects involving either a specific cohort of Gulf War veterans or a specific clinical condition of particular concern to Gulf War veterans. There clearly is an opportunity, though, in both VA and DoD settings to organize such a project since some of the functions of identifying a cohort of patients and gathering baseline data have already been completed through their registry programs.
We have stated that the strongest form of evidence for treatment effectiveness is that of several well-designed RCTs whose results are consistent. What about the situation, though, in which study results are not fully consistent or where there are multiple studies of differing sizes and
degrees of design rigor? The technique of meta-analysis was developed in an attempt to fit this situation.
In a meta-analysis the results of multiple studies are combined to yield an overall cross-study estimate of treatment effectiveness (DerSimonian and Laird 1986). The key characteristics of a meta-analysis include:
explicit criteria for deciding which studies are relevant and are to be included;
explicit criteria for reviewing published or unpublished literature and choosing candidate studies for the analysis;
explicit criteria for grading studies according to rigor of design and execution and resulting strength of evidence;
explicit criteria for assigning “weights” to individual studies that reflect the strength of evidence in each one; and
a statistical method for aggregating the results of different studies that may have widely varying sample sizes, definitions of study end-points, follow-up intervals, and statistical tests of effect.
In principle, the results of a good meta-analysis should provide the closest possible approximation to “definitive evidence” of treatment effectiveness, since they are based on a formal, well-defined integration of the results of multiple studies conducted in different populations in different settings. Just as is the case with other study designs, however, meta-analysis does not always live up to its promise. It is necessary to carefuly examine the criteria and methods used in each meta-analysis to determine the quality of evidence provided by its conclusions.
LEVELS OF EVIDENCE FOR EFFICACY AND EFFECTIVENESS RESEARCH
A recent description of levels of evidence from the Evidence-Based Medicine Working Group (Guyatt et al. 2000) repeats the well-accepted hierarchy that places multiple well-designed randomized controlled trials at the top of the hierarchy and uncontrolled clinical observations at the bottom. The article discusses “N of 1 randomized trials,” indicating that for individual patients a carefully controlled trial of potentially effective treatments (with blinding, placebo controls, etc.) provides the most compelling evidence of effectiveness for that patient. It is not the case, however, that a number of “N of 1” studies added together would be comparable to an RCT.
A revision to the generally accepted hierarchy of evidence might be indicated, however, particularly when the question of treatment effectiveness (rather than efficacy) is being addressed. From the perspective of the
evaluation of treatment effectiveness, there are two general classes of studies, each with a balance of strengths and weaknesses:
Treatment efficacy studies, including prospective randomized trials, emphasize internal validity at the expense of external validity. That is, the sampling, data collection, and data analysis procedures are designed to support the strongest possible inferences about associations between independent and dependent variables (i.e., cause and effect) in a tightly controlled context. The best, strongest studies in tightly controlled situations may still lack generalizability, and therefore applicability, to routine medical practice.
Treatment effectiveness studies, including the largest and most comprehensive outcomes studies, emphasize external validity at the expense of internal validity. They may involve very large samples that are fully representative of the patients seen in routine clinical practice but may include confounding factors that weaken the inferences about cause-and-effect relationships.
In determining a policy to follow in developing guidelines about effective treatments for Gulf War illnesses, we suggest that both types of studies be considered in developing a hierarchy of evidence. We suggest that this be done by considering a “parallel” hierarchy with efficacy studies on one side and effectiveness studies on the other, as illustrated in Table 3-1.
SPECIFIC LEVELS OF EVIDENCE OR WEIGHTS
Table 3-1 is organized to suggest that types of studies at the same vertical position in the two columns should be seen as equally powerful for establishing treatment effectiveness. That is, for purposes of evaluating treatment effectiveness, the results of a single well-designed outcomes study should be considered to be as compelling as the results of a single well-controlled randomized trial. The former will have few concerns about the generalizability of its findings to real-world settings (external validity) but perhaps some serious concerns about internal validity; the latter will have the opposite pattern of strengths and weaknesses. If studies of both were available with similar results, the combined evidence would be quite powerful. Studies of the two types with conflicting findings would essentially cancel each other out and no conclusion could be drawn.
In terms of the levels of evidence such as those used by the Clinical Preventive Services Task Force (CPSTF), we suggest that both effectiveness and efficacy studies at or above the split into two columns be considered “Level I or Level II evidence” in the future. The set of cohort, observational, and case-control studies would be “Level III or IV evidence”
Emphasis on Efficacy
Emphasis on Effectiveness
Systematic Review (e.g., metaanalysis) of Several Well-Controlled Randomized Trials—consistent results
Systematic Review (e.g., metaanalysis) of Several Well-Designed Outcome Studies or “Effectiveness RCTs”— consistent results
Single, Well-Controlled Randomized Trial
Single, Well-Designed Outcomes Study or “Effectiveness RCT”
Consistent Findings from Multiple Cohort, Case-Control, or Observational Studies *
Single Cohort, Case-Control, or Observational Study
Uncontrolled Experiment, Unsystematic Observation, Expert Opinion, or Consensus Judgments
* It is not clear in many cases whether an observational or case-control study is an efficacy study or an effectiveness study. In principle, the label or definition depends on the extent to which the study sample and study procedures reflect the complexities and realities of daily clinical practice. For any one study, though, this may not be clear; however, when it is, more credence should be given to those truly reflecting effectiveness.
NOTE: N of 1 Effectiveness Studies: This study design offers the strongest possible evidence for the effectiveness of a treatment in a given patient but does not necessarily speak to the issue of effectiveness for patients in general (Sackett et al. 1991). The VA and DOD may use this evidence to support policies on payment for treatment or for deciding on promising treatments to be investigated with more definitive designs. Evidence from a single N of 1 study should not be used to establish general treatment guidelines.
depending on whether it was based on multiple studies versus a single study, and the set of expert opinion or uncontrolled clinical observations would be “Level V evidence.” For the efficacy studies, our use is fully consistent with the CPSTF's use of the terms. We do, however, add outcomes/effectiveness studies to those study designs capable of providing the highest levels of evidence when the major question is one of effectiveness rather than efficacy. The above hierarchy also implies that, when the focus of evaluation is on treatment effectiveness, and in the absence of RCTs specifically designed to assess effectiveness in real-world settings, evidence from well-designed outcomes studies may provide Level I or Level II evidence and serve as the basis for clinical policies and treatment guidelines. It also implies that Level I evidence of efficacy, even if based on a compelling set of RCTs, may not provide Level I evidence of treatment effectiveness.
Defining an Effect
Any study of efficacy or effectiveness must have one or more defined outcomes or endpoints to assess. These can include “ultimate” endpoints like mortality rates or length of survival or can include “intermediate” endpoints like reduction in blood pressure or shrinkage of tumor size. The endpoints may also be objective (i.e., can be measured reliably by an outside observer) or subjective (i.e., only observable and reportable by the patient or based on the judgment of a clinician).
In the population of Gulf War veterans, many of the treatment outcomes will be subjective. Since the most frequently reported symptoms include dimensions of pain, fatigue, difficulty concentrating, and difficulty performing normal work and social activities, the measures of effectiveness must presumably be in these domains. There are reliable, valid, and sensitive measures in virtually all of these domains. Not all of the measures are equally reliable, valid, or sensitive in all possible study populations, however, so it is necessary to evaluate whether the measure used in any one study is an appropriate measure of treatment effectiveness. Measures designed to detect differences in large samples of relatively healthy people, for example, may not be suitable for detecting differences in individual patients who are extremely ill.
For any specific measure chosen, there is an additional question of how an effect is defined, either for an individual patient or for a group of patients. Achieving complete remission or cure, for example, is different (and presumably harder) than achieving an objectively or subjectively detectable improvement. A treatment that has a “significant effect” in a large group of patients may actually have no effect or a negative effect on some individuals in the treatment group. A group that is better off on average may include individuals who are not better off or who are worse off. Depending on the design of the study and the amount of statistical analysis done, it may or may not be possible to identify those who will benefit from receiving the treatment in the future and those who will not.
Definingthe Study Population and Universe of Patients
Most treatment effectiveness studies are carried out on patients with well-defined diseases or other clinical conditions. There is typically a body of scientific knowledge of disease etiology, basic biological mechanisms, and the way in which treatment is expected to affect those mechanisms.
For Gulf War veterans, the situation is more complicated. Although several studies have identified specific symptoms or possibly clusters of symptoms that occur with greater frequency in Gulf War veterans, there
continues to be no single “Gulf War syndrome” or any clearly established etiology for the symptoms or symptom clusters. Veterans who are experiencing severe fatigue, for example, may or may not meet criteria for a diagnosis of chronic fatigue syndrome (CFS). Studies of CFS, then, may or may not be generalizable to the entire population of veterans experiencing fatigue even if they are deemed generalizable to veterans with CFS. Similarly, results of studies of fibromyalgia, depression, or migraine may or may not be applicable to Gulf War veterans who are experiencing symptoms of these conditions but do not have all the characteristics that would justify a formal diagnosis. One should not assume that Gulf War veterans with medically unexplained symptoms have one of the diseases of unknown etiology discussed here. Results of studies on conditions with unknown etiology may not generalize directly to Gulf War veterans whose similar symptoms may have a different etiology. However, given currently available diagnostic information and the lack of effectiveness studies conducted on Gulf War veterans, identification of effective treatments for such conditions as these may offer the best opportunity for alleviating the health problems of Gulf War veterans.
Another complication is the current absence of criteria for defining the presence, duration, progression, or severity of Gulf War veterans' health problems that match similar criteria for diagnosable illnesses. Clinical trials may be done on conditions that are “acute” or “uncomplicated,” or “recurrent” or “severe.” The patients recruited for such trials must meet certain explicit criteria for being in that category. Although the specific symptoms experienced by veterans may be categorized in these ways, and future prospective studies may be conducted on groups defined in this way, it will be challenging in the near term (and in the context of this report) to draw conclusions from existing published literature about the possible effectiveness of treatments in Gulf War veterans. Development of a standard language for describing Gulf War veterans' health problems (including severity and temporal characteristics) would facilitate the conduct of treatment effectiveness trials.
The preceding discussion and analysis describes approaches used for assessing treatment effectiveness. Based on this analysis, to implement well-designed and valuable treatment effectiveness studies, the committee recommends that the VA:
use a hierarchy of evidence structure that includes effectiveness studies as well as efficacy studies for any future treatment guidelines it develops for symptoms or illnesses of Gulf War veterans;
design future studies of treatment effectiveness that include outcomes research and effectiveness randomized clinical trials; and
develop a standard language for describing Gulf War veterans' symptoms, including their severity and temporal patterns, and that this standard language should be used in conducting treatment effectiveness studies and developing treatment guidelines.
Further, the committee recommends that those conducting ongoing cohort studies of veterans' health (e.g., the national VA study, the Iowa follow-up study on Gulf War veterans, and the Millennium Cohort Study being implemented by DoD) include collection of data on treatments and health-related quality of life.
The committee also recommends that current VA and DoD Gulf War registries be used as one way to identify patient samples and serve as a sampling frame for future treatment effectiveness studies.
(NOTE: The current information collected and analyzed in the Gulf War registries does not include basic data on treatment approaches, health status, or outcomes. Addition of such information to the registries would greatly increase their value as a basis for identifying patient samples for more in-depth study.)
Despite the fact that treatment effectiveness studies have not been conducted for the population of Gulf War veterans, there are approaches to treatment that have been shown to be beneficial across diagnostic categories. The following chapter explores a patient-centered approach to care that can be used with all patients, regardless of diagnosis, but which may prove especially beneficial to those who are experiencing symptoms but have no identifiable diagnosis.