KEY SPEAKER THEMES
• Observational studies can be powerful tools for generalizing the results of randomized controlled trials (RCTs), but only if they are well designed to answer specific clinical questions.
• To maximize the ability of observational studies to generalize the results of RCTs, create a three-dimensional informatics infrastructure comprising electronic health records; high-quality, granular, and detailed registries; and patient-reported outcomes.
• Ask the right questions so that comparisons of RCTs and observational studies are done with the right analytical tools.
• The terms “effectiveness” and “efficacy” are too vague to define the questions of interest.
• For comparative effectiveness research, RCTs will need to be analyzed as though they were the results of observational studies.
• When a synthesis across different types of data is being conducted, it is important to examine all of the available evidence to ensure that any inferences are consistent, explainable, and applicable to the identified population of interest.
• Observational studies can be used to generalize from RCTs, but such a generalization should rely on an analytical framework that explicitly describes the parameters estimable from each type of data and the relationships among these parameters.
Randomized controlled trials (RCTs) use a rigorous experimental design to evaluate the average overall benefit and risk of a specific therapy when it is used by a narrowly defined, select group of patients treated under carefully controlled conditions. Because most RCT protocols limit enrollment eligibility to select groups of individuals, application of the findings from an RCT to broader populations can be problematic, particularly if the treatment effect differs for those who are not represented by the population chosen for an RCT.
Robert M. Califf, vice chancellor for clinical research at the Duke University Medical Center, started the session by reviewing the challenges that the medical community faces in translating the results of RCTs to broader populations. Miguel A. Hernán, professor of epidemiology at Harvard University, showed how the utility of both RCTs and observational studies can be increased when the right clinical questions are asked before the studies are designed, and Eloise E. Kaizar, associate professor in the Department of Statistics at The Ohio State University, considered whether it is possible to generalize the findings of studies designed to evaluate efficacy to inform effectiveness. William S. Weintraub, the John H. Ammon Chair of Cardiology at Christiana Care Health Services, and Constantine Frangakis, professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, commented on the three presentations and joined the panel for an open discussion with the workshop attendees.
To start the session, Robert M. Califf provided a framework for thinking about the issue of efficacy and effectiveness in a broader population. The first question that must be asked is, Does the therapy work at all, and can it be distinguished from placebo or the standard of care? The best way to answer this question is to conduct an RCT. Once efficacy is established, clinical studies are needed to determine how the treatment should be used,
the types of patients who should use it, and how long it should be used. The traditional approach to those issues involved subgroup analysis, an approach that Califf says is flawed because of the small amount of data for most subgroups in most clinical trials.
Califf commented on the way in which the standard of practice in acute cardiac care came to be in the United States, noting that all the foundational studies were conducted by entering patients into clinical trials as soon as they entered the hospital emergency room. This approach led to tens of thousands of people being randomized into clinical trials, and it generated findings that were clear-cut, making it easy to develop clinical practice guidelines.
What has become clear is that these guidelines are effective only when they are followed. To prove this fact, Califf discussed large national studies showing that for every 10 percent improvement in adherence to what worked in RCTs, mortality decreased by 11 percent. Similar results have been seen in heart failure patients, in which each 10 percent improvement in adherence to composite care recommended by clinical guidelines reduced mortality by 13 percent.
In contrast, erythropoietin prescribing guidelines for kidney dialysis patients were determined from dozens of what Califf called “shoddy” observational studies. These studies purported to show that high-dose erythropoietin benefited patients with chronic kidney disease who were on dialysis, and the medical community went along with those recommendations. When Califf and others conducted RCTs of erythropoietin, the results were distressing: patients on high-dose erythropoietin experienced higher mortality and worse clinical outcomes than patients receiving low doses of the drug. The medical community, said Califf, had been misled by “dozens of mutually reinforcing observational studies.”
In his opinion, the situation with erythropoietin is likely to become more common in everyday clinical practice unless the health care community takes a wholly systematic approach to evaluation of the results of different types of trials that are used to demonstrate efficacy and effectiveness. The challenge is to move from a situation in which validity and generalizability are low to one in which both are high.
He then discussed another example of a trial in which the average result was not broadly applicable. In this case, a worldwide trial compared the efficacy of the well-established anticlotting agent clopidogrel with that of a new drug, ticagrelor (Wallentin et al., 2009). The primary endpoint of this study was the time to cardiovascular death, myocardial infarction, or stroke. The results showed that ticagrelor lowered the cumulative incidence of major cardiac events and reduced total mortality, both in the worldwide population and in every predefined subgroup except one: patients in North America. A sophisticated statistical analysis based on observational analysis
suggested that the reason for this difference was that North Americans take more aspirin than everyone else.
In his closing comments, Califf said the answer to the problem of how to generalize from RCTs lies in creation of a fundamental informatics infrastructure that forms a three-dimensional matrix. Along one dimension are the data from every American’s electronic health record, whereas the second dimension consists of high-quality, granular, and detailed registries created by every relevant patient advocacy group and professional society. The third dimension is patient-reported outcomes recorded and reported by use of the ubiquitous cell phone.
Miguel A. Hernán began his presentation by stating that while observational studies address important challenges in comparative effectiveness research, observational studies do have some strengths. “They are faster, less expensive, [and] have fewer ethical problems, and the results may be more transportable to other populations,” said Hernán. Observational studies are more transportable, he explained, because the patients in such studies are more similar to real-life patients and they are often followed for longer periods of time. In addition, the treatments being compared are implemented under more realistic settings in an observational study. What would happen, Hernán asked, “if randomized trials and observational studies had the same patients, the same follow-up, and the same type of interventions? Would they be answering the same question?” He would argue that the answer is “no” because the two types of studies are not typically analyzed in the same way.
“We usually consider a randomized trial and an observational follow-up study as two different types of follow-up designs, but they may actually be very similar, except that the randomized trial treatment is randomized at baseline,” said Hernán. “We can think of randomized clinical trials as follow-up studies with baseline randomization.” If that is the case, he argued, it might be more useful to classify studies according to whether they had baseline randomization, without automatic assignment of greater validity to follow-up studies with baseline randomization. For example, in large simple trials and so-called pragmatic trials, the benefits of baseline randomization can be overshadowed by high rates of noncompliance and loss of patients to follow-up, and typically, data are not collected to adjust for these biases.
Regardless, the analysis of results from studies with these two types of designs differs, he explained. Most randomized trials use intent-to-treat analysis (ITT), whereas most observational studies use as-treated analysis, to enable adjustment of baseline confounding. Both study designs, however,
require adjustment for postbaseline confounding and for selection bias that may result from patients lost to follow-up after the baseline, neither of which is controlled by randomization.
The purpose of the ITT design is to identify the effect of being assigned to a treatment independently of what happens after the baseline. The goal is to compare those who are assigned to Treatment A and continue with follow-up until the end of the study with those who are assigned to Treatment B and continue with follow-up until the end of the study. The analysis may require adjustment for selection bias because of differential loss to follow-up, and that adjustment requires information on postbaseline confounders.
Hernán added that follow-up studies with baseline randomization may also want to examine the effect of the treatment as specified in the study protocol. In this case, the goal is to determine the effect of Treatment A over the entire course of the study, unless the patient experiences toxic effects, versus Treatment B over the entire course of the study, unless the patient experiences toxic effects. A follow-up study with baseline randomization can also aim to quantify the effect of some treatment or some form of that treatment and the effect of that treatment received in some other way that is not specified in the protocol. Again, what happens after the baseline is not controlled by randomization in either of these study designs and an adjustment for time-varying confounding and selection bias will need to be made. However, in most reported studies, said Hernán, investigators do not adjust for time-varying, postbaseline variables.
Turning to the subject of efficacy and effectiveness, Hernán said that these terms are probably useful in simple settings with short-term interventions, but they become ambiguous in complex settings with sustained interventions over long periods. Rather, he said, what would be more informative is an explicit definition of the interventions that define the causal effect of interest. He noted that an ITT effect does not necessarily measure effectiveness in the real world, whereas a per protocol effect may measure effectiveness but not, in general, efficacy. What would be useful, he said, would be to define the observational analogs of ITT and per protocol effects. Having such definitions would provide an idea of what questions can be generalized. A good start toward such definitions, he said, are the new user designs that aim to estimate ITT effects in observational studies.
Hernán briefly reviewed the methods that he and his colleagues used to reanalyze the data on the effects of hormone therapy on the risk of heart disease (Hernán et al., 2008). Most observational studies conducted in the 1980s and 1990s had found that women currently on hormone replacement therapy had a 30 percent lower risk of developing heart disease than did women who were not on hormone replacement therapy (Stampfer and Colditz, 1991). Then, the Women’s Health Initiative RCT found that women initiating hormone replacement therapy had a 20 percent increased
risk of developing heart disease compared with the risk for those who were given placebo (Manson et al., 2003). Hernán explained that these studies were asking two very different questions and that the results were not comparable. The question asked by the randomized trial was, what is the heart disease risk in women assigned to initiation of hormone therapy compared with women assigned to no initiation of hormone therapy? The observational study asked, what is the heart disease risk in women who are currently taking hormone therapy compared with women who are not? When he and his colleagues reanalyzed the observational data to estimate the analog of the ITT effect measured in the RCT, they found that risk was elevated in the first 2 years of therapy but that there was no elevated risk in women who were within 10 years of menopause.
Although this reanalysis did not produce exactly the same result, the discrepancy was far smaller because both analyses were asking the same ITT question, which compares groups based on their baseline randomization. However, one problem with the ITT study was that close to 40 percent of the women did not comply with the protocol. When these data were reanalyzed to identify the per protocol effect, which is based on whether participants fulfilled the protocol, the results from the RCT and observational studies were also similar.
Hernán then discussed an example of analysis of the findings of an RCT as if it was an observational study (Hernán et al., 2008). In this case, the data came from the Women’s Health Initiative RCT, and the reanalysis attempted to estimate a per protocol effect by using inverse probability weighting and by taking advantage of the data that were collected after randomization. When the data were analyzed in the manner described, the per protocol hazard ratio was 1.7 for breast cancer. whereas the intent-to-treat hazard ratio was 1.2. Hernán characterized the difference as large and significant for a woman who is considering whether to go on hormone replacement therapy. As Hernán put it, the relevant question for a woman planning to take therapy is, what is the effect of hormone replacement therapy on women who actually complied with the therapy and not what is the effect of hormone replacement therapy on women who enrolled in a clinical trial?
In conclusion, Hernán said that the question of interest must be stated clearly for both RCTs and observational studies as far as whether the goal is to get an ITT effect, a per protocol effect, or other effects. Once the question is in hand, then the analysis of both RCTs and observational studies should be the same, except for adjustment for baseline confounding. He also reiterated his earlier comment that the terms “effectiveness” and “efficacy” are too vague to define the questions of interest.
After agreeing with Hernán that efficacy and effectiveness are indeed vague concepts in terms of formulating research questions, Eloise E. Kaizar said that these terms are useful for classifying the types of effects that investigators should think about when designing their analyses. In the development setting, the term “efficacy” helps define the best population of individuals to be treated to design the most cost-efficient trial, to have the greatest chance of showing a positive treatment effect, if one exists, and to avoid recruiting people who might be harmed by the treatment. In the policy-setting arena, thinking about effectiveness brings a focus on the population or subpopulation that needs to be studied.
She then turned to the question at hand: is it possible to generalize information from studies designed to evaluate efficacy to inform effectiveness? She first defined two populations: the target population and the trial population. A target population is the population of all individuals for whom a treatment may be considered for its intended purpose, whereas a trial population is a theoretical population that consists of all individuals who would be eligible to enroll in an RCT. Although it is clear that it is straightforward to generalize about the trial population from the RCT analysis, the key is to define or model how the trial population relates to the target population.
These two populations can be considered in relation to each other in a number ways, she continued. In the best-case scenario, the trial population is a simple random sample from the target population. If this is the case, the distribution of baseline variables would be identical between the two populations and the distribution of outcome variables would be logically related. Comparison of the distribution of baseline variables in trial participants and a representative sample of the target popula-
tion from administrative or survey sources could identify discrepancies or evidence against the use of a simple random sample from the target population. For example, in a study of suicidality associated with antidepressant use, the RCTs suggested that taking antidepressants was associated with a risk of increased rates of suicidality among adolescents (Greenhouse et al., 2008). She and her colleagues thought it reasonable to assume that more subjects enrolled in the trials would be taking antidepressants than in the population at large, so they should observe a higher rate of suicidality in the trials than in the observational data. That was not the case, however, providing evidence that in this case they could not treat the trial population as a simple random sample from the target population.
The next and more complicated way to relate the two populations is to think of the trial population as a weighted sample of the target population. The idea is that the trial population has a representative sample but that the distribution of attributes is different than in the target population. Reweighting schemes include probability sampling methods from the survey sampling research community, such as poststratification or propensity-based standardization. However, patients are not recruited into trials as a weighted sample, given the number of inclusion and exclusion criteria imposed on the trial population, and studies have consistently shown, Kaizar noted, that it is usual for RCTs to include at most half of the target population as a result of eligibility criteria. What is not known, though, is whether these exclusions are important for generalization from the results of RCTs.
The bottom line, she said, is that once an analysis moves to a population outside of the trial population, it becomes an extrapolation, which then requires additional steps to determine if extrapolation is reasonable. For example, sensitivity analysis can estimate what the effect size needs to be in the excluded subpopulation for the inference from the trial population to change for the target population. If that effect size is large, then the extrapolation is likely to be reasonable. Another approach, which was discussed by earlier speakers, is to compare data from an RCT to data from parallel observational studies. One way to do such a comparison, said Kaizar, is to apply the exclusion criteria to data representative of the target population through the use of methods based on a cross-design synthesis framework, which combines results from studies with complementary designs (U.S. General Accounting Office, 1992). The idea here, she explained, is to stratify the target population by those represented in the trial and those not represented in the trial. This approach assumes that no residual confounding exists within subpopulations or that residual confounding is separate from the exclusion criteria. She recommended that this type of approach be incorporated into Phase IV explorations to promote a learning health care environment.
As an example, she briefly discussed an RCT of insulin pump use and the generalization of findings to the target population (Doyle et al., 2004). The RCT showed, as expected, that in the very controlled environment of the trial, insulin pump use was more effective than self-administration of insulin at improving metabolic control, as measured by blood hemoglobin A1c levels. However, the criteria for this trial excluded the very population that would most likely benefit from automated insulin administration: those patients who do not check their blood glucose levels regularly. In this case, the RCT likely underestimated the average effect size in the target population, which Kaizar noted was probably of interest to insurance companies that may be more willing to pay for these pumps.
In concluding her presentation, Kaizar said the point that she wanted to emphasize was that when one is synthesizing across types of data, it is important to examine all of the available evidence to ensure that any infer-
ences are consistent or explainable. She said, too, that the field needs to start thinking about how to work with multiple treatments. Stratification, she said in closing, “is very much an artifact of believing that the people who are in the trial represent everyone in that population who would be eligible. This is a bit naïve, in that patients also opt out of the trial for various reasons that we certainly cannot define a priori.” She noted that it is important to start thinking about how to capture that phenomenon, perhaps thinking about population membership as a more fuzzy or soft classification.
In his comments, William S. Weintraub said that he heard two important messages. The first, from Hernán, is to ask the right questions so that comparisons between RCTs and observational studies are done with the right analytical tools and are in fact comparing apples to apples. The second, from Kaizar, is that observational studies can be used to generalize from RCTs, but only if the confounders are the same, and if they are not, then an analytical framework must be used to account for any differences. As an example of the latter, he cited RCTs that showed that revascularization was beneficial for relatively young patients with acute coronary syndrome. The issue was whether this effect was generalizable to older populations. In this case, the confounders were not the same: elderly patients in the RCTs were less sick than those in the target population.
Weintraub also discussed an example in which the results from RCTs and observational studies did not match. In this case, a large observational study of hundreds of thousands of subjects undergoing revascularization showed a survival advantage for those who received drug-eluting stents compared with the survival for those who received bare metal stents, but no difference in repeat revascularization. In contrast, the RCTs showed the opposite: a benefit in terms of revascularization but not in terms of mortality. In this case, he said, size did not overcome what were in fact large and unaccounted for biases in the observational studies. As a result, the findings of the RCT were the more believable of the two types of studies.
To illustrate the power of use of a combination of study results to make more generalizable conclusions, he described a study funded by the U.S. Food and Drug Administration (FDA) to examine the comparative safety of different stents (Weintraub et al., 2012). The original RCTs were small and were designed to demonstrate efficacy, so Weintraub and his colleagues combined the data from the RCTs with data from a prospective observational study and a large patient registry. Using data from some 60,000 patients, they found that one of the devices had an odds ratio for vascular complications of 4, which was high enough to be believable and
exclude the possibility of treatment selection bias. “That device was off of the market within a couple of weeks,” said Weintraub. “To me, that was one of the triumphs of comparative effectiveness research.”
As a final comment, Weintraub noted that the use of patient registries is good for comparing quality of care and treatment adherence but that too often they are used to compare outcomes. Their use for comparison of outcomes presents a danger, however, because of the confounders resulting from the variations in the evaluation of thousands of health care providers. He made a plea to create registries that are coupled with electronic health records (EHRs) to address this issue.
Constantine Frangakis said that an important point that Hernán made was that investigators need to pay close attention to the possible differences that a randomized study and an observational study may have in their standard way of dealing with estimates, particularly given that most RCTs use an ITT protocol and most observational studies analyze data on an as-treated basis. Another potential issue, said Frangakis, is that in an RCT, it is often possible to ask participating physicians about deviations from the therapeutic protocol, something that cannot be done with observational studies. Therefore, the model being used in the two studies would be different in ways that are not identifiable. He added that post-treatment confounders make it difficult to use observational studies to generalize from RCTs. Regarding Kaizar’s talk, Frangakis said that he thought covariates, propensity scores, and extrapolation are useful and important analytical considerations.
Session moderator Harold C. Sox asked Kaizar if he understood her correctly that extraction of excluded patients from the target population and comparison of the remaining subjects with those in the RCT provide a more valid estimate of the treatment effect in the target population minus those who were excluded from participating in the RCT. She replied that his understanding was correct. She added that it is necessary to adjust for the population with data from the observational study to average the effects in the two subpopulations and argued that the result would be a reasonable effect size in the target population. She reiterated that such extrapolation always involves the making of assumptions that must be tested for validity.
Sox then asked the panel if it is now standard practice for randomized trials to have an observational cohort representing patients who were excluded. Weintraub answered that the answer was “no.” If anything, he indicated, because of financial constraints, this is being done less frequently now than it was 30 years ago. Robert M. Califf added that the only group that he knows of that does this routinely is the Society of Thoracic Sur-
geons, largely because it maintains a registry database of nearly everyone who undergoes a cardiothoracic procedure in North America. This registry enables investigators conducting RCTs to know who was randomized to the trial population and who was not. He also noted that the Health Systems Research Collaboratory of the National Institutes of Health is now conducting seven clinical trials using EHRs in a similar manner.
Califf commented that he believes that the field is in transition right now and that as a researcher who conducts clinical trials, he would be dubious about the result of any observational study with an odds ratio of less than 2 to 3. However, the situation will improve greatly when the field moves into the era in which everyone has an EHR. “In my view, we just got to live through this really agonizing period of time,” he said. He also added that he believes that large, expensive, data-intensive RCTs are going to be a thing of the past when this transition is complete.
Sean Hennessy, associate professor of epidemiology at the University of Pennsylvania, asked Hernán whether the instrumental variable approach would be the best for comparison of data from an RCT with as-treated observational data. For simple studies, Hernán said, instrumental variables work well, but they are not developed enough for use with more complex studies that have time-varying confounding and selection bias, for example.
Joe V. Selby asked Kaizar if it would be reasonable, when one is planning an observational study designed to extend findings from an RCT, to build into the cohort of the observational study the same group that was in the RCT. Kaizar agreed that that was a good idea, particularly for the first observational study of a particular problem, but that the study should also be designed to collect data from a wide range of patients.
Califf remarked that one problem that frustrates him is the lack of information on observational studies that have been conducted but that failed to produce the desired result and so were never published. He hopes that the Patient-Centered Outcomes Research Institute would enforce the same rules requiring publication of positive and negative results that are now in effect for the clinicaltrials.gov website. Hernán voiced strong support for this idea, adding that selective reporting is a major problem for the field.
Mitchell H. Gail raised the point that RCTs themselves can be an important source of information on generalizability. He said that if an RCT finds that treatment effects are homogeneous throughout the study population, generalizability should be more straightforward and reweighting becomes simpler. Kaizar agreed with that statement but also said that too many studies make that leap without much evidence. Califf added that most every RCT shows effect heterogeneity across subgroups but that issues arise when attempts are made to generalize to populations far different from the trial population. He reminded the workshop participants that the majority of RCTs select more homogeneous trial populations to increase
the odds that they will demonstrate a positive effect of the treatment, so the detection of subgroup heterogeneity within the trial population should not be surprising.
When asked by a workshop participant from a remote site about how to generalize treatment profiles in RCTs, Califf said that this is a big problem that the field needs to address. He cited as an example the Neonatal Intensive Care Unit Network Trial on the effects of oxygen saturation in neonates. The findings of this study were the opposite of those expected from observational studies, and a follow-up showed that mortality among neonates who were not enrolled in the study was higher even than that in the arm of the clinical trial with the worst mortality. One possibility is that whatever other treatments were being used outside of the RCT protocol were not only more variable but worse. Califf added that global trials on diet and medicinal herbs are extreme examples in which the variability in treatment profiles is so large as to be unmeasurable with current instruments.
Robert Temple noted that forest plots can provide a great deal of useful information about generalizability but that most reported studies on symptomatic treatments do not include them. He added that FDA is writing guidance that will encourage the use of forest plots and that will require demographic subset analysis.
Steven N. Goodman remarked that most of the work on heterogeneity and extrapolation that has been done has focused on the benefits of treatment but not the potential harms. Because most people want to know both the benefits and harms of any potential treatment that they might undergo, the issue of absolute versus relative risk becomes important. In that case, a therapy must surpass a higher standard, that is, whether it works well enough to overcome the potential harms and not just whether it works. In that regard, exclusion criteria in RCTs leave a large gap in the data because comorbidities are likely to play an important role in determining the risk of harm. Califf agreed with this point and noted that homogeneity often vanishes when one is looking at the benefit-to-risk ratio rather than just the benefits of treatment.
Doyle, E. A., S. A. Weinzimer, A. T. Steffen, J. A. H. Ahern, M. Vincent, and W. V. Tamborlane. 2004. A randomized, prospective trial comparing the efficacy of continuous subcutaneous insulin infusion with multiple daily injections using insulin glargine. Diabetes Care 27:1554–1558.
Greenhouse, J. B., E. E. Kaizar, K. Kelleher, H. Seltman, and W. Gardner. 2008. Generalizing from clinical trial data: A case study. The risk of suicidality among pediatric antidepressant users. Statistics in Medicine 27(11):1801–1813.
Hernán, M. A., A. Alonso, R. Logan, F. Grodstein, K. B. Michels, W. C. Willett, J. E. Manson, and J. M. Robins. 2008. Observational studies analyzed like randomized experiments: An application to postmenopausal hormone therapy and coronary heart disease. Epidemiology 19(6):766–779.
Manson, J. E., J. Hsia, K. C. Johnson, J. E. Rossouw, A. R. Assaf, N. L. Lasser, M. Trevisan, H. R. Black, S. R. Heckbert, R. Detrano, O. L. Strickland, N. D. Wong, J. R. Crouse, E. Stein, M. Cushman, and the Women’s Health Initiative Investigators. 2003. Estrogen plus progestin and the risk of coronary heart disease. New England Journal of Medicine 349(6):523–534.
Stampfer, M. J., and G. A. Colditz. 1991. Estrogen replacement therapy and coronary heart disease: A quantitative assessment of the epidemiologic evidence. Preventative Medicine 20(1):47–63.
U.S. General Accounting Office. 1992. Cross-Design Synthesis: A New Strategy for Medical Effectiveness Research . Report GAO/PEMD-92-18. Washington, DC: U.S. General Accounting Office. http://archive.gao.gov/d31t10/145906.pdf (accessed May 16, 2013).
Wallentin, L., R. C. Becker, A. Budaj, C. P. Cannon, H. Emanuelsson, C. Held, J. Horrow, S. Husted, S. James, H. Katus, K. W. Mahaffey, B. M. Scirica, A. Skene, P. G. Steg, R. F. Storey, and R. A. Harrington for the PLATO Investigators. 2009. Ticagrelor versus Clopidogrel in Patients with Acute Coronary Syndromes. New England Journal of Medicine 361:1045–1057.
Weintraub, W. S., M. V. Grau-Sepulveda, J. M. Weiss, S. M. O’Brien, E. D. Peterson, P. Kolm, Z. Zhang, L. W. Klein, R. E. Shaw, C. McKay, L. L. Ritzenhaler, J. J. Popma, J. C. Messenger, D. M. Shahian, F. L. Grover, J. E. Mayer, C. M. Shewan, K. N. Garratt, I. D. Moussa, G. D. Dangas, and F. H. Edwards. 2012. Comparative effectiveness of revascularization strategies. New England Journal of Medicine 366(16):1467–1476.