KEY SPEAKER THEMES
• Although issues concerning study design are important, the field of clinical research also needs to start considering methods with the goal of enabling physician inquiries.
• The Patient-Centered Outcomes Research Institute should promote the development of strategies to allow for approximate matching of individual patients with records of similar patients and then conduct tests to see what happens when physicians make treatment decisions using these strategies.
• Propensity score matching can be used to correct for the effects of bias of measured covariates in observational studies. However, this requires having measurements for all confounding variables, which is rarely the case.
• Implicit propensity score matching can be used to overcome limitations related to sparse information on confounders in databases of spontaneous reports of drug adverse events.
• The diagnostic gestalt is overrated because each physician has too many biases and too many variables to accurately compute therapeutic outcome probabilities.
• Data from electronic health records, although imperfect, yield reasonably accurate statistical prediction models that are often better than those based on the simple staging strategies currently used to predict risk.
The bottom line for medical decision making is to develop a course of therapy that will provide the greatest benefit to an individual patient with the lowest chance of harm. In this session, three speakers and two additional panelists discussed the challenges with the use of data from groups of individuals from randomized controlled trials (RCTs) and observational studies to guide treatment choices for specific individuals. Burton H. Singer, professor at the Emerging Pathogens Institute at the University of Florida, provided a short introduction to the subject. Nicholas Tatonetti, assistant professor of biomedical informatics at Colombia University, described the use of data-driven prediction models, and Michael W. Kattan, chair of the Quantitative Health Sciences Department at the Cleveland Clinic, described one method of predicting individual risk from treatment. Mitchell H. Gail and Peter Bach, an attending physician in the Department of Epidemiology and Biostatistics at the Memorial Sloan-Kettering Cancer Center, commented on the two presentations before an open discussion period.
Burton H. Singer asked the workshop to put aside for the moment thinking about study design, which was such an important focus of the previous day’s discussions, and instead consider the concept of inquiry. Consider a clinician talking to a patient, he said. In front of this physician is an elaborate patient record that includes biomarker information; notes about the patient’s experiences in care, such as the ones that Mary E. Charlson emphasized; a clinical history; and perhaps a history of the patient’s psychosocial well-being. Having read and absorbed all of that information, the clinician’s job at that moment is to determine the likely performance of a contemplated treatment regimen for the patient. Singer stressed the word “regimen,” because this treatment will not be an activity that takes places at just one point in time but is one that will occur over time. It is right then, at that moment, that the physician wants to conduct an inquiry, not a study.
What the physician wants at that moment is access to a large database that he or she can query to find patients who are approximate matches to the patient sitting across the desk. These approximate matches would not be matches just on particular variables but, rather, would be matches on an entire history as the unit of analysis, which would be characterized by a description of the patient’s condition at multiple time points. Next, the physician would query the database to identify the experiences that these approximate matches have had with the intended treatment regimen and compare them with the experiences that others have had. With that information in hand, the physician could then talk to his or her patient about the potential benefits and risks of the planned treatment in the context of what others with similar disease and personal characteristics have experienced. All of this, said Singer, presumes that the database is populated with high-quality data.
To reach this ideal state of medical practice, the Patient-Centered Outcomes Research Institute (PCORI) should promote an effort to develop approximate matching strategies and then conduct tests to see what happens when physicians make treatment decisions using these strategies. Singer noted that heterogeneity among both patients and physician practice will play a large role in this effort. For physicians, heterogeneity will show itself in terms of the types of treatments that they use and how they react to and use the answers to their queries: will they take the advice from the database query, ignore it, or modify it in some way? He acknowledged the substantial methodological challenge in accounting for physician-associated heterogeneity but suggested that these challenges can be tackled first by evaluation of individual disease processes.
He concluded his presentation with a quote from a 1977 paper published in Science by John Tukey, who wrote:
It is a difficult task to drive the nearly incompatible two-horse team: on the one hand, knowledge of a most carefully evaluated kind, where, in particular, questions of multiplicity are faced up to; and, on the other, informed professional opinion, where impressions gained from statistically inadequate numbers of cases often, and so far as we see, often should, control the treatment of individual patients. The same physician or surgeon must be concerned with both what is his knowledge and what is his informed professional opinion, often as part of treating a single patient. I wish I understood better how to help in this essentially ambivalent task.
Singer said that in his mind this statement is still relevant today and that part of PCORI’s mission should be to address the issues that the statement raises by thinking about clinical inquiry as a part of making individualized predictions.
Nicholas Tatonetti noted in his opening remarks that his presentation was not going to be directly applicable to the prediction of individualized responses but that the methodologies that he would be discussing represent the first step in that direction. With that as a caveat, he said that one of his interests is in drug safety, and he spoke about the balance between the health benefits and risks associated with the small-molecule drugs that he characterized as the cornerstone of modern medical practice. Discussing the alert issued by the U.S. Food and Drug Administration (FDA) in 2012 for the statin family of cholesterol-controlling drugs, Tatonetti said that although statins are one of the safest groups of drugs known, even drugs considered safe and effective can unexpectedly cause dangerous side effects. He also reminded the workshop of Vioxx and Avandia, two drugs that were approved and later pulled from the market when reports of severe side effects with widespread use began appearing.
In the aftermath of the Vioxx and Avandia incidents, there was a public call for the establishment of a public database to monitor drug safety, but FDA has been maintaining such a database for more than 30 years as part of the Adverse Events Reporting System. Today, this database comprises more than 3 million reports, an enormous observational database, but it is of limited utility because it is only sparsely populated with data on patient age, sex, weight, and country; the drugs that a patient was taking at the time of the adverse event; or the conditions for which a patient was receiving treatment. As a result of the sparse information, these reports are hard to interpret, Tatonetti said, and in fact, such spontaneous reporting systems in general are biased and introduce “synthetic associations” in terms of concomitant drug use and indication. As an example of the former, he said that a naïve analysis of the FDA Adverse Event Report System or most any clinical observational dataset would reveal a connection between aspirin use and heart attack. However, a deeper assessment of the data in the dataset would show an enormous signal for Vioxx and heart attacks that creates a false association with other drugs coprescribed with Vioxx. As an example of an indication effect, he said drugs given to diabetics are more likely to be associated with hyperglycemia in the adverse report dataset, which he termed a nonsensical result, given that the real association is with poor control of diabetes and not a particular diabetes drug.
Propensity score matching, said Tatonetti, is a technique used with observational studies that corrects for these very types of effects arising from the bias of measured covariates. Use of this technique requires identification of matched controls for the studied cases and modeling of the likelihood that the patient is selected for inclusion in the study on the basis of covariates, which in this case would be drug exposure. This method produces
effect estimates that are close to those seen in idealized RCTs, but its drawback is that it requires the availability of measurements for all confounding variables, which is rarely the case with spontaneously collected data.
To address this limitation, Tatonetti developed implicit propensity score matching (IPSM), an adapted form of propensity matching that assumes that combinations of drugs and indications describe the patient’s covariates. Having a list of the drugs that a patient is taking can provide much information about a patient, he explained. This method starts with reports in the database that list both coprescribed prescriptions and significantly correlated indications, and this group of reports serves as the source of the matched controls. As an example, he discussed the nonsensical association that he had mentioned earlier between diabetes drugs and hyperglycemia. In the adverse events database, 17.7 percent of reports of diabetes drugs list hyperglycemia as an adverse event. If the entire database served as the control, approximately 1.5 percent of the reports would list hyperglycemia as an adverse event, producing the apparent association. However, if the control cohort is restricted by use of the IPSM technique, the frequency at which hyperglycemia would be an expected adverse event reported in association with all other drugs that a patient is taking would be 17.6 percent, revealing the false association with diabetes drugs.
He described another example in which IPSM corrects for the association between arrhythmia and antiarrhythmic drugs, in which 10 of 13 drugs identified without correction by the use of IPSM were no longer associated with what is known as the “prorhythmic effect.” The three drugs whose proportional reporting ratio still exceeded the significance threshold after correction by the use of IPSM do, in fact, have prorhythmic effects that limit their use. IPSM, he added, can also correct for other biases, such as age or sex.
Another issue that this approach addresses is under- and nonreporting of adverse events. In this case, severe adverse events are identified by the presence of more minor—and more common—side effects. In essence, Tatonetti explained, this is much like the way in which a physician detects disease: the physician uses observable side effects to form a hypothesis about the underlying disease. In this case, the procedure first involves identification of common side effects that are harbingers for the underlying severe adverse event. These are then combined to form an effect profile for an adverse event.
As an example, he showed how IPSM was used with electronic health record (EHR) data to identify a drug-drug interaction between paroxetine and pravastatin, a combination that some 1 million patients take annually (Tatonetti et al., 2011). A previous observational study that used data from EHRs showed evidence of an interaction between these two drugs that resulted in increased blood glucose levels, but this study could have been
biased by confounders, such as the use of other combinations of drugs in the paroxetine class and other statins, the time of day that glucose readings were taken, and the concomitant use of other medications. IPSM analysis found that none of these were significant confounders. Skeptics remained, however, so he and his collaborators ran a small experiment in mice and found virtually the same result seen in humans: the combination of paroxetine and pravastatin, particularly in insulin-resistant mice, produces a significant increase in blood glucose levels.
Michael W. Kattan’s interest in predicting personal risk began when his physician told him he had Stage IV Hodgkin’s lymphoma. When he asked about his chances for surviving this disease, he was shown the typical prognostic plots, which were based solely on disease staging, and concluded that this particular counseling tool was not designed to predict his prognosis accurately. Kattan explained that when asked that question, physicians have two choices: either quote an overall average to all patients or make a prediction based on knowledge and experience. What the physician needs instead is some way of taking as many relevant pieces of information about a specific patient as possible and using that to inform a model to make a personalized prediction that informs therapeutic choices.
As an example of a simple model, he described a preoperative nomogram for prostate cancer that assigns point values for three characteristics—the prostate-specific antigen (PSA) level, the patient’s clinical state, and the biopsy Gleason grade—and then relates the point total to the probability that the patient will be free of prostate cancer after 5 years (Kattan et al., 1998). This model can be modified to include risk stratification depending on surgery or radiation therapy and therefore inform a patient’s decision on the basis of the risk of a 5-year recurrence with one type of therapy or the other. He showed a similar nomogram that was also better than simple staging at predicting the probability of 5-year survival for patients with gastric cancer. He concluded that on the basis of his experience, this type of continuous regression equation modeling produces at least a small amount of discrimination compared with that achieved with simple staging systems.
One problem that Kattan has noticed with these predictive tools is that surgeons are reluctant to use them. He recounted one study that he and his colleagues conducted in which they presented 10 case descriptions from real prostate cancer patients to 17 urologists. The urologists were provided with PSA, biopsy Gleason grades, clinical stage, patient age, systematic biopsy details, previous biopsy results, and PSA history. They were also provided with the preoperative nomogram and were asked to make their own predictions of the probability of 5-year progression-free cancer with or without
the use of the nomogram. The results showed that concordance with the patient’s actual 5-year survival was 67 percent with the nomogram, whereas it was 55 percent when the urologists made the prediction on the basis of experience (Ross et al., 2002). In a similar study in which a nomogram was used to predict the likelihood of additional nodal metastases in breast cancer patients with a positive sentinel node biopsy result, the nomogram was 72 percent accurate at making predictions, whereas clinicians were 54 percent accurate (Van Zee et al., 2003).
Based on these types of comparisons, Kattan concludes that the diagnostic gestalt is overrated because each physician has too many biases and the presence of too many variables makes it difficult to accurately compute the probabilities of various therapeutic outcomes. He indicated that he would like to see the field develop comparative effectiveness tables. He acknowledged that tailoring such tables to an individual is difficult, but he added, “I think efficiencies and better decision making would take place if we could get that type of information handed to us.” He showed an example of a risk calculator developed at the Cleveland Clinic that uses its EHR to fill in an individual’s information on age, gender, comorbid conditions, medications, blood pressure, lipid levels, smoking status, and other personal characteristics to produce a table that provides 6-year probabilities of mortality, stroke, coronary artery disease, liver injury, heart failure, renal insufficiency, and diabetic nephropathy for each of four classes of diabetes drugs (see Table 6-1).
After describing some of the programming that is needed to create these tools, he noted that the Cleveland Clinic has developed more than two
TABLE 6-1 Example Individualized Predictive Risks of Seven Outcomes Across Four Drugs Given to Diabetic Patients
|Coronary Artery Disease||0.024||0.005||0.028||0.033|
NOTE: BIG = biguanides; MEG = meglitinides; SFU = sulfonylurea; TZD = thiazolidinedione.
SOURCE: Reprinted with permission from Michael W. Kattan.
dozen of these risk calculators that input patient EHR data; the calculators are available to the public at http://rcalc.ccf.org. Also available is a tool for constructing new risk calculators that anyone is free to use with registration. In conclusion, he said that data from EHRs, although imperfect, yield reasonably accurate statistical prediction models, though many modeling options should be compared to determine their predictive accuracies. He added that the collection of follow-up data is the biggest challenge with verification of the accuracy of these models.
In his comments, Mitchell H. Gail noted that the concept of matching that Singer described had been widely used since the 1800s, until the RCT was introduced. He said that it has been argued that databases can address a wider range of questions than can RCTs but that is true only if the right things can be matched in a way that controls confounding by indication. As an example, he cited work done in the 1980s on thyroid cancer that showed that patients who received radiation therapy did far worse than those who had surgery, but the study could not conclude that radiation was the problem because the registry contained no information in indicating why the patient received radiation instead of surgery.
He then commended Tatonetti for his assessment of how observational data from large data sets can be used to identify adverse events. This approach to controlling for confounding by indication was interesting, but he took a wait-and-see attitude as to whether this method works, pending further studies to gain more experience with IPSM. Gail agreed with Kattan’s assessment about the need for highly discriminating and well-calibrated risk models, but he said that the approach that he took in his work avoided the question of confounding by indication.
Peter Bach agreed with Gail that the issue of confounding is important, as are concerns about the quality of the data in large data sets. He said that one of the challenges that the field faces today is the tension between the use of ever larger data sets and not being able to overcome some of the intrinsic, unmeasured differences between groups. In fact, he said that he worries that “the illusion of precise information could actually move us in the wrong direction.” As an example, he cited a seeming 10-fold difference in mortality that appeared in the diabetes table that Kattan presented. “It seems inconceivable to me that a single drug could drive that kind of mortality difference. But certainly, if I saw it as a patient and believed the numbers in front of me, it would certainly heavily influence my decision-making process,” he explained.
The issue of collecting follow-up data is an important problem, Bach continued. In the cancer world, for example, although the time and cause of
death are relatively easy to obtain, data on progression and time to progression are difficult to capture because patients move around and because they are actually heavily influenced by surveillance schedules and other aspects of treatment that are in themselves confounded. He then discussed work on lung cancer screening that showed the power of these predictive models to identify people at risk for interventions and counsel patients about the benefits and risks. He also remarked that the field needs to do more work on risk communication both to promote more research in the area and to help patients use available tools.
William S. Weintraub raised the issue of confounding by indication and remarked how none of the different risk models that could be developed using observational studies and RCTs fully address this issue. He asked members of the panel for their thoughts on two questions: How should the field go forward in developing good risk models? What are the best methods of assessing the quality of a given risk model? Responding to the first question, Kattan agreed that confounding by indication is a tough problem to solve. He said that RCTs could provide the best solutions but that in the cardiac field, surgeons and radiologists would never allow the kinds of trial designs that would answer questions regarding confounding by indication. Tatonetti agreed that RCTs are the preferred source of risk data but that they become infeasible when drug-drug interactions or comorbidity effects are being studied. “The number of patients you need is simply too large and the costs are too large, so these technologies need to be investigated,” said Tatonetti. “The problem is, we do not have a lot of validation that they produce reliable effect risk estimates.”
Horwitz asked the panel if the current risk models are providing data that may be misleading patients when they make decisions. Kattan replied that it was possible but added that the current models may still be providing information better than that which one would have if no models were available. Tatonetti said that his is a data-mining method and that it is not designed for developing a precise risk model in any one setting. “I would not be confident enough that I corrected for confounding so well that I would trust the risk estimates that come out of it,” said Tatonetti, adding that rather than trying to correct for confounding, he tries to corroborate the results with those from another dataset. “I think that is essential, especially when you are using these confounded observational data sets to continue to try and find a complementary dataset that has slightly different information and slightly different biases and you can start to build up a kind of corpus of evidence that suggests that maybe your hypothesis is true.”
Gail noted that “there are books written on the best way to formu-
late models and on the criteria for evaluating them, and I think very well established ways of checking how well calibrated a model is. So I think some of the technical aspects of modeling have received a lot of attention and are continuing to receive attention, but there are adequate methods.” He agreed with Bach that observational data can add to the estimation of risk in a way that is meaningful to the patient, particularly because the trial population in an RCT can be too small to provide reliable baseline estimates of absolute risk.
Michael Pencina, associate professor in the Department of Mathematics and Statistics at Boston University, remarked that he considers website prediction models such as the ones that Kattan described to be controversial. “We have a lot of them, but very few have been validated,” he said, noting that he has had discussions with the FDA regarding whether the agency should be monitoring this activity and to what extent. The big question, he said, is how to validate these models. “My answer is [that] it almost does not matter which metric we use to assess model performance as [much as] it does whether we understand what they tell us and what the standards are for interpretation,” said Pencina. In other words, he added, “What is good enough?” Kattan agreed with this critique of Web-based models for public use and said that he and his colleagues do not post a prediction tool until they are comfortable with the foundational procedures and the error measures. He thought that, ideally, all such tools should have a link to a publication that a physician can access before recommending the tool to a patient.
David M. Kent, commenting on Kattan’s use of risk models in concert with clinical trials, said that his research has been finding what he called a surprising degree of risk variation even in efficacy trials and that the typical patient typically has a much lower risk than the summary effect in the overall trial results. He also noted that although Kattan’s work showed that the gestalt of physicians often does not agree with the actual risk, prediction models often disagree with one another as well and produce different recommendations. In response, Bach said that it is vitally important to understand the user of these predictions and thought that this was an area ripe for study, in particular, how doctors comprehend risk prediction.
Sanford Schwartz, professor of internal medicine, health care management, economics, and medicine at the University of Pennsylvania, said that it is important to identify the clinical objective before a model is designed. “Is the objective to identify risk?” he asked. “Is it to inform the doctor and the patient about alternative trade-offs of prognosis or alternative tradeoffs of treatment?” It is more important, he said, to consider clinical utility than absolute accuracy, particularly in the context of advising patients in a learning health system and what PCORI is trying to accomplish in that context. In his mind, the risk and health care costs of a false-positive result
versus a false-negative result may be a greater for one application, but the opposite could be true for another application.
Providing feedback for PCORI, Schwartz said that observational studies are going to be critically important because RCTs can look at only a subset of outcomes and not the range of outcomes that are important to doctors and patients. The key, he said, is to fill in missing data, and he recommended that PCORI focus on “trying to generate registries or observational data sets where there is an emphasis on follow-up, on getting [data about] what happens to the person longitudinally.” He added that PCORI should also focus on developing ways for presenting information to doctors and patients and on understanding how that information will be interpreted by patients and physicians.
In response to a question from Horwitz about what needs to be done to provide predictions that reflect longitudinal changes in treatment, condition, or comorbidities, Gail said that RCTs or adaptive RCTs can be designed to address those issues in some cases, but doing so requires that the intervention and clinical question be carefully designed at the very outset of the project. He added that researchers are developing approaches to answering some of these longitudinal questions using observational data, “and to the extent that they do account for confounding by indication and for the longitudinal nature of confounding, they may be getting closer to giving good advice.” Singer agreed that adaptive trial designs are a good start toward addressing longitudinal questions. Bach added that from a methodological perspective, it is “exponentially more complicated to model changes over time,” referring not to the confounding issue but the mathematics.
A workshop participant from a remote site asked the panel to comment on whether it was possible to use observational studies to validate datamining results or to use data mining as a prestep for observational studies to allow sound hypotheses to be made. Tatonetti said that data mining does in fact generate hypotheses and that the work that he presented aims to generate the best hypotheses, given the bias in the data in data sets, that can then be validated through the use of data from observational studies. He added that he was not sure that data mining had yet reached the point where it generated hypotheses better than those of an expert biologist or clinician, but that was the goal of his work.
Mary E. Charlson commented that she thought that risk prediction would benefit if the field could agree on a common set of perhaps 20 items on socioeconomic status, location, mental status, and other characteristics that everyone would collect and report in a uniform manner. Both Bach and Kattan thought this to be a great idea, but both noted that physicians may balk if the list is longer than eight items, unless the data are collected within the context of an EHR. Sheldon Greenfield remarked that he is part
of a trial of elderly individuals that is trying to collect such data, and he wondered about the feasibility and cost of the use of these kinds of data to sort patients into risk groups. Bach responded that this was already being done in breast cancer prevention trials. “It is highly feasible, and in this case the effect on [the] power of selecting patients based on event probabilities is incredibly valuable,” said Bach.
Kattan, M. W., J. A. Eastham, A. M. Stapleton, T. M. Wheeler, and P. T. Scardino. 1998. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. Journal of the National Cancer Institute 90(10):766–771.
Ross, P. L., C. Gerigk, M. Gonen, O. Yossepowitch, I. Cagiannos, P. C. Sogani, P. T. Scardino, and M. W. Kattan. 2002. Comparisons of nomograms and urologists’ predictions in prostate cancer. Seminars in Urologic Oncology 20(2):82–88.
Tatonetti, N. P., J. C. Denny, S. N. Murphy, G. H. Fernald, G. Krishnan, V. Castro, P. Yue, P. S. Tsao, I. Kohane, D. M. Roden, and R. B. Altman. 2011. Detecting drug interactions from adverse-event reports: Interaction between paroxetine and pravastatin increases blood glucose levels. Clinical Pharmacology and Therapeutics 90(1):133–142.
Tukey, J. W. 1977. Some thoughts on clinical trials, especially problems of multiplicity. Science 198(4318):679–684.
Van Zee, K. J., D. M. Manasseh, J. L. Bevilacqua, S. K. Boolbol, J. V. Fey, L. K. Tan, P. I. Borgen, H. S. Cody III, and M. W. Kattan. 2003. A nomogram for predicting the likelihood of additional nodal metastases in breast cancer patients with a positive sentinel node biopsy. Annals of Surgical Oncology 10(10):1140–1151.