KEY SPEAKER THEMES
• Heterogeneity in treatment effects is likely to be ubiquitous, and the failure to detect it represents a failure of science.
• Hidden heterogeneity often leads to misleading results from randomized controlled trials (RCTs), but current approaches to subgroup analysis of clinical trials are inadequate and can be misleading.
• Pooling of data from multiple RCTs can identify effect heterogeneity, but subgroups can still be too small to generate statistically significant differences.
• Well-planned observational studies conducted to answer specific questions with data from large databases can identify heterogeneity in treatment effects and enable individualized recommendations.
• Individual patient differences are of crucial importance, and the study of heterogeneity over broad subgroups may not be useful in comparative effectiveness research.
• Algorithmic predictions generated from large observational data sets and then validated in confirmatory studies may be a promising way to guide clinical decision making.
• Methods to generate individual-level predictions from large observational studies must deal with the causal inference problems of such data. Newer instrumental variable methods are being developed to address these issues.
Moderator Richard Platt, chair of ambulatory care and prevention and chair of population medicine at Harvard University, introduced this session by noting that its focus on treatment subgroups was a fitting end to a day of discussions that highlighted the importance of sorting out treatment effects on subgroups in observational studies and randomized controlled trials (RCTs). Toward that end, David M. Kent, director of the Clinical and Translational Science Program at the Tufts University Sackler School of Graduate Biomedical Sciences, presented an overview on the detection of heterogeneity in treatment effects; Mark A. Hlatky discussed a specific example in which heterogeneity in treatment effects was observed in RCTs and observational studies; and Anirban Basu, associate professor and director of the Program in Health Economics and Outcomes Methodology at the University of Washington, spoke about the use of instrumental variables to identify heterogeneity in treatment effects. Mary E. Charlson, chief of clinical epidemiology and evaluative sciences research at Weill Cornell Medical College, and Mark R. Cullen, professor of medicine at the Stanford School of Medicine, commented on the presentations before the floor was opened for discussion.
The first concept that David M. Kent discussed was the fallacy of division, an idea that says that it is dangerous to make inferences about individuals or subgroups from aggregate results. He acknowledged, though, that evidence-based medicine, by necessity, is based largely on the making of such inferences. “We take group data, we measure treatment effects, and then we make inferences to the individuals in that group,” he said. He explained that individual treatment effects are, generally, inherently unobservable because it is impossible to measure the outcome in an individual
patient simultaneously both on and off treatment. Instead, clinical trials study a group of patients that receive treatment and a matched group that does not, measure the outcomes in those two groups, and then determine the benefit by comparing the proportion of patients in each group with a particular outcome. The treatment effect summarizing the difference in outcomes between the groups is then described in a probabilistic or stochastic manner and applied back to individuals.
Heterogeneity in treatment effects, in Kent’s mind, is ubiquitous, and the failure to detect it represents a failure of science. Although he conceded that this is not a universally accepted idea, he said that it is consistent with the available evidence. Failure to detect heterogeneity even in the presence of marked differences in treatment effects across individuals can occur for myriad reasons. For example, observable covariates may be totally unrelated to the causal determinants of variations in treatment effects, or the causal mechanisms may be so complex that it is difficult to distinguish these from process statistical noise.
More potentially addressable problems include the limitations of the current analytical approach to subgroup analysis. For example, the statistical power to detect heterogeneity in the expected treatment effect is typically woefully inadequate when studies are powered to detect an overall treatment benefit. Analytic failure can also result from a limitation of conventional subgroup analysis, what Kent calls the “one-variable-at-a-time approach” to subgroup analysis. This type of subgroup analysis—for example, of males versus females or diabetics versus nondiabetics—is convenient but artificial because real patients simultaneously differ on multiple variables. As a result, the individuals in these one-variable-at-a-time subgroup analyses do not represent the full heterogeneity of patients that might be relevant for measurement of heterogeneous treatment effects that may be detected when combinations of variables are used (such as when multivariate risk models are employed).
Hidden heterogeneity often leads to misleading results from RCTs, said Kent in summary, but subgroup analysis of clinical trials is inadequate (because it is typically underpowered) and can also be misleading (because it is prone to spurious false-positive results). Despite voicing this negative outlook on treatment effect heterogeneity, Kent recommended three steps that could be taken to at least partially address some of these issues. First, investigators need to limit the number of hypothesis-testing subgroup analyses that they perform. One approach that may be used to do this is to explicitly specify which analyses are primary. Such analyses include those few analyses that are supported by strong prior evidence and that are clinically actionable. Any other subgroup analyses, said Kent, should be explicitly labeled as exploratory. Such analyses are meant not to inform clinical practice but to inform future research. The second step investigators
need to take is to increase the power of their studies, and third, they must replicate and confirm subgroup analyses when they do identify a meaningful heterogeneous treatment effect. The last two steps require reengineering of the clinical enterprise to enable much larger studies and might require a greater reliance on observational studies.
Each year, more than 1 million coronary revascularization procedures are performed worldwide, said Mark A. Hlatky, in explaining his interest in comparing two such procedures. Angioplasty, or percutaneous coronary intervention (PCI), is most often used for single-vessel disease, whereas the far more invasive coronary artery bypass grafting (CABG) is used for extensive triple-vessel disease. Either procedure is feasible for treating midseverity coronary artery disease, yet, despite the commonness of these procedures, their effects on mortality are uncertain. The two procedures have been compared in RCTs and observational studies, but these studies have focused on the general population and do not provide much guidance for specific patients. Hlatky noted that one of the early trials comparing the two procedures—the Bypass Angioplasty Revascularization Investigation trial in diabetics—did provide some evidence of a heterogeneous treatment effect. This was a controversial finding that was followed up in a number of studies, but all suffered from the concern that they were post hoc analyses that were, as he put it, “big fishing expeditions.”
If RCTs are the preferred approach to comparing PCI with CABG but the trials are large enough only to examine the main effect, one way to look for heterogeneity in treatment effects is to pool together several RCTs and test whether the answer lies in a larger sample size. Hlatky and his collaborators did just that, organizing a collaboration of 10 randomized trials of bypass surgery and angioplasty and convincing the investigators to share data, which he noted is more of a political than a technical challenge. In the end, the resulting dataset included almost 8,000 patients and 1,200 deaths, and the data did reveal some treatment effect heterogeneity, which Hlatky illustrated using a forest plot (see Figure 5-1).
He explained that in the youngest patients in the trials, PCI produced better outcomes, whereas bypass surgery produced better outcomes in the oldest patients. The data confirmed that diabetes was a strong modifier of risk and that age also modified comparative effectiveness. Additional subgroups showed evidence of heterogeneous treatment effects, but too few of the enrolled subjects fell into these subgroups to produce statistically significant differences. The latter finding is not surprising, he said, because the selection criteria for most RCTs limit generalization by excluding comorbidities. Another factor that worked against generalization was that
FIGURE 5-1 Outcomes in subgroups from 10 randomized clinical trials comparing coronary artery bypass grafting (CABG) and percutaneous coronary intervention (PCI).
SOURCE: Hlatky et al., 2009.
these trials were run at large medical centers with highly skilled staff, and the results may not represent those obtained by physicians in other clinical settings.
To address some of these limitations, Hlatky and his collaborators examined observational data to see if they could replicate and extend the findings from their pooled RCT study (Hlatky et al., 2013). They used the 20 percent Medicare sample from 1992 to 2008 to identify patients who were 66 years of age or older, which gave them at least 1 year to document comorbidities; received fee-for-service coverage, which yielded billing codes; and who underwent multivessel PCI or CABG. Hlatky and his colleagues used propensity score matching but also forced matches on the year in which patients received their treatment, whether they had diabetes, and their age within a year. The last step was taken because of suspected secular effects in the outcomes related to the year that the procedure was done. Each arm of the study had 105,000 patients. Treatment-covariate
interactions were prespecified to produce relative differences, or hazard ratios, and absolute differences in terms of 5-year survival and the number of years of life added. The goal was to produce information that was actionable for patients.
The main findings of this study were that those who underwent CABG had lower mortality overall and a higher 5-year survival rate. Diabetes, heart failure, peripheral artery disease, and tobacco use all produced a significant modification of the treatment effect; and treatment effectiveness varied substantially. In fact, the analysis predicted that 41 percent of the population would have better survival if they underwent angioplasty, even though the overall result showed that CABG was superior in terms of mortality. This analysis demonstrated that a substantial heterogeneity in treatment effect could affect people’s decision making, said Hlatky, and that seven patient variables could be used together to make individual predictions. He and his collaborators used these findings to create a coronary heart disease procedure calculator that could input a variety of patient characteristics—age, gender, tobacco use, prior hospitalization for a heart attack, diabetes, peripheral artery disease, and heart failure—and make an individual projection on which procedure would produce a higher 5-year survival, the range of the increased risk of mortality in similar patients, and the benefit in terms of longer life expectancy.
Hlatky closed his presentation by reflecting on the pooled RCT data and observational analyses he described, and stating that he was reassured by the fact that both found evidence of heterogeneity of treatment effectiveness. Finally, he called for the need to refocus efforts from focusing on overall effects to better detecting and understanding heterogeneity.
Anirban Basu noted that the focus of his work is on effect heterogeneity and not response heterogeneity, which he said is an important distinction to make. “We are really interested in how much the incremental outcome between two treatments varies across people,” he explained. As a recap of the day’s earlier presentation on instrumental variables, he reminded the workshop participants that instrumental variables are those that influence treatment choices but that are independent of factors that determine potential outcomes, that they are viewed as natural randomizers, and that they can be used to establish the causal effects of a treatment by accounting for both overt and hidden biases.
Before discussing his own work, he described one classic example of the use of instrumental variables. In that study, the investigators used Medicare data to examine the effect of invasive cardiac treatment on long-term
mortality rates (Stukel et al., 2007). The observed confounders were those typically found in Medicare data—age, sex, race, socioeconomic status, comorbidities, and inpatient treatment—and the unobserved confounder was the risk for the patient. Basu explained that the investigators were concerned that only low-risk patients were given invasive cardiac treatment, potentially leading to better outcomes from invasive treatments. With adjustment for differences in risk through the use of the propensity score or regression analysis, they found a huge positive effect of invasive cardiac treatment. However, when they repeated the analysis using the regional catheterization rate as the instrument variable, the effect was not as substantial when they adjusted for the selection of lower-risk patients for invasive cardiac treatment.
Building on that example, Basu discussed one approach to interpreting the results of an instrumental variable analysis when there is treatment effect heterogeneity. In the presence of heterogeneity in the treatment effect, there is no reason to believe that the causal effect of a treatment that comes out of an observational study should be the same as the causal effect of a treatment that comes out of an RCT, that the average treatment effect is a relevant metric for evaluation, or that the effect of the instrumental variable has a relevant interpretation. The first step in addressing these issues is to develop a choice model that starts with the assumption that the choice of treatment is based on an underlying latent index; the latent index is a function of observed confounders and instrumental variables, as well as a function of unobserved confounders and stochastic error. If the latent index is greater than zero, people choose to get treatment, and if it is less than zero, they do not choose to get treatment. This notion of an underlying latent index is pervasive today across choice models used in both statistics and econometrics, explained Basu. He also said that use of this type of model provides a good picture of who and who is not selecting treatment and how a treatment effect happens when instrumental variables exist.
By use of this simple model construct, instrumental variable methods estimate treatment effect by comparison of a group of people with some level of the instrumental variable with another group of people with a different level of the instrumental variable, with all observed characteristics kept constant. The difference in outcomes between those who choose treatment and those who do not, then yields an estimate of the treatment effect only for those groups of people whose treatment choice changes because of changes in the values of instrument. However, if the treatment effects are heterogeneous, then this treatment effect from the instrumental variable is conditional on the unobserved level of confounders, a fact that is sometimes called “essential heterogeneity” (Heckman and Vytlacil, 1999). A newer method, called the “local instrumental variable,” provides a way around essential heterogeneity. What the local instrumental vari-
able does, explained Basu, is to help identify the marginal treatment effects or the treatment effect for the person who is at the margin of choice defined by the choice model.
Once the marginal treatment effect is estimated, it is possible to determine a marginal treatment effect conditioned on various levels of observed and unobserved characteristics and to then aggregate across various populations to determine the average treatment effect and to extend that to a person-centered treatment effect. To illustrate one application of this approach, Basu discussed how he applied it using Medicaid data to compare the treatment effect of older, generic antipsychotic drugs and newer so-called atypical antipsychotics. The original Clinical Antipsychotic Trials of Intervention Effectiveness study funded by the National Institute of Mental Health found similar effectiveness between the two classes of drugs over an 18-month period (Lieberman et al., 2005). As a result, 40 percent of state-run Medicaid programs have instituted prior authorization restrictions on some atypical antipsychotic drugs, which may have played a role in the waning commitment by pharmaceutical companies to develop new drugs in the area of neuroscience.
Basu’s analysis of Medicaid data showed a tremendous treatment heterogeneity that could not be explained by one covariate. This analysis showed that when patients received optimal therapy, in comparison with the status quo, the average number of hospitalizations in the 12 months following the initiation of atypical antipsychotic therapy was predicted to be nearly 28 percent lower. He concluded his remarks by noting that differences between individual patients are of crucial importance and that the study of heterogeneity over broad subgroups may not be useful in comparative effectiveness research. He also noted that the use of algorithmic predictions generated from large sets of data from observational studies and then validated in confirmatory studies may be a promising way to guide clinical decision making.
Heterogeneity is something that the field needs to stop running from, said Mary E. Charlson in commenting on the two presentations. Investigators have become too focused on standardization and uniformity, but this focus “leads us away from really doing the investigation of treatment heterogeneity response that I think can inform the next series of questions about what is working for patients and what is not.” For example, she suggested that a closer examination of the baseline variables in the studies discussed by Hlatky might reveal that important aspects of the patients’ experiences during and after the CABG or PCI procedures could be driving overall differences in mortality. She indicated that her work has found that
patients who undergo CABG make important secondary lifestyle changes that likely have an impact on mortality, whereas the vast majority of PCI patients do not. She said that investigation of drivers of variability or heterogeneity in the treatment response helps expand the ability to help patients and improve the prognostic armamentarium.
Charlson called for the field to look at variables beyond those typically studied, such as depression and social isolation, and to expand the ability to collect data from patients directly through the use of modern technologies and techniques, such as crowd sourcing. She noted that patients are already sharing their experiences on social media sites and believed that collection of those stories, along with some quantitative data, could shape how the heterogeneity in the treatment response is studied. The capture of this kind of information is a major opportunity to learn more about actual patient experiences and outcomes and to customize how to better inform patients about which treatments are right for them.
Mark R. Cullen, whose background is in the study of the upstream causes of morbidity and mortality in the workplace, said that the workplace is an interesting and challenging environment in which to study heterogeneity in treatment effects for at least three reasons. First, all studies must be observational, because RCTs would be impractical and unethical in the workplace environment. Second, extraordinary and conspicuous selection pressures exist in the populations coming into and going out of the workplace. The quantification and management of these pressures are methodologically problematic, but such pressures are highly visible in ways that confounders at the bedside are not. Third, substantial heterogeneity exists in the way in which people respond to adverse physical elements in the workplace environment.
With that as background, Cullen said that it was surprising that the field has not developed a simplifying rule for model making that breaks down questions into those that involve a treatment choice at a single point in time. From that choice made at a single point in time, it would then be possible to develop a marginal structural model to deal with the changes that occur over the course of observation. He agreed with Miguel A. Hernán’s view, presented in the prior session, that observational research and RCTs have a great deal in common and that one of the major issues in working in the observational domain is where and how to use instrumental variables or other strategies to effectively randomize the environment in which randomization is not occurring.
Cullen noted that the field needs a new way to look at studies of large populations. “We have bought in to a very stochastic way of looking at questions of whether something has efficacy that is based on a certain set of assumptions,” he said. “We are averaging over all kinds of different effects and are imagining that those effects are in essence random.” In contrast, he
noted, “many of our colleagues not represented here do not take that view at all, but live in a very deterministic world in which they would imagine the only thing interesting about heterogeneity is how much we do or do not know about the underlying biology.” As an example, he cited the situation in the cancer world, in which breast cancer is no longer treated as one heterogeneous disease but as distinctly different diseases based on the underlying biology. This will play out in a way in which these biological discriminators are not likely to fall into the categories now used to explain heterogeneity, such as race, sex, and age. He thought that one area in which this type of heterogeneity could begin to be evaluated would be in individuals with unique, preexisting diseases.
He concluded his remarks by discussing how the combined use of a propensity score type of approach and an instrumental variable might be useful for examining heterogeneity. In a study of neonatal intensive care units at two hospitals, one of Cullen’s students used proximity to the unit as the instrumental variable and formed matched pairs that were identical on everything observable except for distance from the neonatal intensive care unit. From this subset of individuals, the student was able to define the enormous benefit for mortality for infants at one hospital over mortality for infants at the other. This student is now looking at subgroups to see if differences with sufficient power to draw conclusions about a particular subset of high-risk mothers or high-risk infants exist. The beauty of this approach, said Cullen, is how simple it is to do this kind of analysis.
Platt said he was impressed that all five speakers were to some degree sanguine about the prospects for doing subgroup analysis. A workshop participant then remarked that to him, the idea of creating risk models to create subgroups or models that identify those who may have different absolute reductions in risk from a treatment is a good one but that it is not the right approach for finding factors that could modify the treatment effect, particularly biological risk factors. Instead, the use of actual biological modifiers rather than risk models should be the more appropriate approach. He encouraged the field to think hard about how the biology works and create subgroups that are not based on risk models but that are based on models of how the biology might be modified by those factors.
Another participant said that he would like to see forest plots created according to absolute rather than relative risk reduction. He also commented that many physicians ignore the findings of RCTs because of the population that was tested. He cited two examples: gynecologists who say that the findings from the Women’s Health Initiative study on hormone replacement therapy do not apply to their patients who are younger than
age 60 years and orthopedic surgeons who still perform vertebroplasty, despite the data from a high-quality trial showing that this procedure does not benefit patients.
A participant commented that the talks so far failed to make a connection between observational studies and what they can bring to a learning health care system and suggested that the Institute of Medicine hold another meeting to deal explicitly with that connection. The same participant also wondered how to take the heterogeneity of the effectiveness of an intervention being measured and use that to study what are inherently complex, multiattribute decisions. He questioned, too, whether many examples of a treatment demonstrating efficacy in an RCT but not being effective when used correctly really exist.
The same participant wondered if data heterogeneity is being swamped by the heterogeneity of medical practice and patient preferences. Basu and Charlson thought that might be the case but that good opportunities to learn from those individual patient experiences exist and that this knowledge may inform a learning health care system. Charlson reiterated her earlier proposal that the field needs to develop methods to more systematically capture individual patient experiences beyond those recorded in electronic health records. Sheldon Greenfield agreed with this proposal because he believes that many of the variables related to heterogeneity that are now considered unobservable would be observable if both patients and physicians were queried more systematically. Hlatky agreed with these recommendations but cautioned that the solution does not always lie in more data; it lies in only more good-quality data.
Marc L. Berger asked the panelists if there were opportunities for examining heterogeneity in Phase II studies and generating hypotheses that could be examined in Phase III studies as a way of improving the productivity of the drug development pipeline. Basu replied that the only way to do that is to greatly expand the number of subgroups—and the budget—for Phase II trials. Charlson said that the field needs to look more carefully at adaptive clinical trial designs. Joe V. Selby noted that this would be a good topic for another workshop.
In response to a question about whether the subgroups in his studies were based at all on biology, Hlatky responded that they were not and that it would be interesting to look for biological correlates to the subgroup classifiers. Cullen pointed out that the subgroup classifiers that Hlatky and his colleagues identified are the ones that surgeons are most likely to use to make real therapeutic decisions when facing a patient. Hlatky added that in the case of cardiac surgery, certain psychosocial factors that are not biological play major roles in the outcome.
Steven N. Goodman remarked that he had not heard anyone address the subject of multiplicity, which he said will become important with the
advent of massive databases containing biological data from genomics, proteomics, and other -omics and data-mining tools that will generate what he said will effectively be an infinite number of subgroups. “I am definitely not a zealot about correcting for the multiplicity,” he said, “but it reflects in some ways indirectly our lack of understanding of biologic processes [in our] explanations. It is something we cannot ignore.”
Heckman, J. J., and E. J. Vytlacil. 1999. Local instrumental variables and latent variable models for identifying and bounding treatment effects. Proceedings of the National Academy of Sciences of the United States of America 96(8):4730–4734.
Hlatky, M. A., D. B. Boothroyd, D. M. Bravata, E. Boersma, J. Booth, M. M. Brooks, D. Carrié, T. C. Clayton, N. Danchin, M. Flather, C. W. Hamm, W. A. Hueb, J. Kähler, S. F. Kelsey, S. B. King, A. S. Kosinski, N. Lopes, K. M. McDonald, A. Rodriguez, P. Serruys, U. Sigwart, R. H. Stables, D. K. Owens, and S. J. Pocock. 2009. Coronary artery bypass surgery compared with percutaneous coronary interventions for multivessel disease: A collaborative analysis of individual patient data from ten randomised trials. Lancet 373(9670):1190–1197.
Hlatky, M. A., D. B. Boothroyd, L. Baker, D. S. Kazi, M. D. Solomon, T. I. Chang, D. Shilane, and A. S. Go. 2013. Comparative effectiveness of multivessel coronary bypass surgery and multivessel percutaneous coronary intervention: A cohort study. Annals of Internal Medicine 158(10):727–734.
Lieberman, J. A., T. S. Stroup, J. P. McEvoy, M. S. Swartz, R. A. Rosenheck, D. O. Perkins, R. S. E. Keefe, S. M. Davis, C. E. Davis, B. D. Lebowitz, J. Severe, and J. K. Hsiao for the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) Investigators. 2005. Effectiveness of Antipsychotic Drugs in Patients with Chronic Schizophrenia. New England Journal of Medicine 353:1209–1223.
Stukel, T. A., E. S. Fisher, D. E. Wennberg, D. A. Alter, D. J. Gottlieb, and M. J. Vermeulen. 2007. Analysis of observational studies in the presence of treatment selection bias: Effects of invasive cardiac management on AMI survival using propensity score and instrumental variable methods. Journal of the American Medical Association 297(3):278–285.