4

Tools for Generating and Synthesizing Evidence

Important Points Emphasized by Individual Speakers

  • Test development needs to be rigorous, using meaningful and well-designed studies, proper statistical analysis, independent external validation, and interdisciplinary expertise.
  • The identification and dissemination of best practices for the entire pathway of test development ensures that everyone understands the test development process.
  • Clinical utility needs to receive earlier and more intense focus, with more education about how to interpret the results.
  • Appropriate control groups are important to determine whether a biomarker distinguishes a group that benefits from a treatment.
  • Studies of clinical utility should be conducted in settings that are relevant to more real-world clinical decisions.
  • Focusing more on value than on cost-effectiveness in assessments of molecular diagnostics will enable analyses to be descriptive in addition to prescriptive and will allow consideration of the full context of care.
  • The collection of blood and tissue from every cancer patient, including those who die, could greatly advance research.


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 29
4 Tools for Generating and Synthesizing Evidence Important Points Emphasized by Individual Speakers • Test development needs to be rigorous, using meaningful and well-designed studies, proper statistical analysis, independent external validation, and interdisciplinary expertise. • The identification and dissemination of best practices for the entire pathway of test development ensures that everyone understands the test development process. • Clinical utility needs to receive earlier and more intense focus, with more education about how to interpret the results. • Appropriate control groups are important to determine whether a biomarker distinguishes a group that benefits from a treatment. • Studies of clinical utility should be conducted in settings that are relevant to more real-world clinical decisions. • Focusing more on value than on cost-effectiveness in assess- ments of molecular diagnostics will enable analyses to be descriptive in addition to prescriptive and will allow consider- ation of the full context of care. • The collection of blood and tissue from every cancer patient, including those who die, could greatly advance research. 29

OCR for page 29
30 GENOME-BASED DIAGNOSTICS Five speakers covered diverse aspects of the development of methodolo- gies and tools that, as the statement of task put it, are related to demon- strating the evidentiary requirements for clinical validity and clinical utility that meet the needs of all stakeholders. Discussions included guidelines for test development, the role of comparative-effectiveness research (CER) in demonstrating clinical utility, statistical techniques, cost-utility analyses, and innovation mechanisms in small companies. Common themes included the need for clearly defined standards for analyses, the importance of con- text in determining clinical utility, and the importance of access to well- documented biospecimens. Developing Omics Tests Debra Leonard, professor and vice chair in the Department of Pathol- ogy and Laboratory Medicine and director of the Clinical Laboratories at Weill Cornell Medical Center, summarized the findings of a recent Institute of Medicine (IOM) report titled Evolution of Translational Omics: Lessons Learned and the Path Forward (2012b). The report was written by an IOM committee in response to the development of gene-expression array tests at Duke University that were said to predict sensitivity to chemotherapeutic agents. Papers written about the tests suggested that they represented a major advance and would better direct cancer therapy. Clinical trials were initiated in 2007, with the tests being used to select which chemotherapeutic agent patients would receive. A paper by Baggerly and Coombes (2009), however, pointed to numer- ous errors and inconsistencies in the data and stated that the results could not be reproduced. Following a 2010 letter from more than 30 bioinforma- ticians and statisticians to the National Cancer Institute (NCI) urging the suspension of the clinical trials and an investigation of the test and compu- tational models by the NCI, the clinical trials were stopped. The NCI then asked the IOM to review the situation and provide guidance for the field. The IOM committee was charged with recommending an evaluation process to determine when omics-based tests are fit for use in a clinical trial. It also was asked to apply these criteria to omics-based tests used in the three cancer clinical trials conducted by the Duke investigators and to recommend ways to ensure adherence to the developed framework. A Recommended Framework An omics test is defined as being composed of or derived from multiple molecular measurements and interpreted by a fully specified computa- tional model to produce a clinically actionable result (IOM, 2012b). The test can assess genomics, transcriptomics, proteomics, epigenetics, and

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 31 so on. Characteristics of an omics test include the use of complex, high- dimensional datasets and interpretation by a computational model, with a high risk that the computational model will overfit the data. The term is not meant to apply to a single gene test or to noncomplex testing, though Leonard added that she believes the committee’s findings should apply to the development of any test. The committee developed a recommended framework for the evalu- ation of omics tests from discovery to clinical use (see Figure 4-1). The framework begins with a discovery phase in which a candidate test is developed on a training set of data. Then the computational model is fully defined and locked down. The testing method and computational model are subsequently confirmed on a separate set of specimens or a subset of samples from the discovery set that were not used for training. If the test is intended for clinical development and eventual use, the data, computer code, and metadata should be made available to the public. The candidate test should be defined precisely, including not only the molecular measure- ments and computational model but also the intended clinical use for the test. This is standard for FDA, Leonard explained, but academic investiga- tors tend not to think about how a test will be used in the clinic. After the discovery phase, the committee concluded, test validation should be done under approval from an institutional review board (Jacobs et al., 1990) and in consultation with the FDA. The clinical testing method should be defined along with the analytical validation or confirmation of the analytical performance characteristics of a test. This can be done in a looped process, Leonard said, in which the test is modified to achieve the desired analytical performance. The defined test method should then be used on a validation sample set, with the intended use, assay, compu- tational procedures, interpretation criteria, and target patient popula- tion all defined. The sample set can come from the discovery phase if the samples were from patients. If the discovery phase was done on cell lines or samples that were not like the patient samples, however, then validation needs to be done on a patient sample set. If the sample set is annotated with patient treatments and outcomes, it needs to be blinded to those doing the testing. Once the test is defined and locked down, validation cannot be done iteratively. Rather, if the test needs to be changed, it must then begin the validation phase anew. The committee recommended that the test be discussed with the FDA prior to the validation studies to learn what the FDA would want to see for approval of the test. The test development and validation should be performed in a Clinical Laboratory Improvement Amendments (CLIA)- accredited clinical laboratory if the test is intended to direct patient man- agement. The CLIA laboratory should design, optimize, validate, and

OCR for page 29
32 Discovery and Test Validation Stage Evaluation for Clinical Utility and Use Stage Discovery Phase Test Validation Phase Three Potential Pathways (IRB Approval and FDA Consultation) IRB Approval and Prospective/ Prospective Prospective Candidate Test Developed Consultation with the FDA Retrospective Clinical Trial; Clinical Trial; on Training Set, Followed B by Lock-Down of All Study with Test Does NOT Test Directs R Archived Direct Patient Patient Computational Procedures I Specimens Management Management Define Clinical/ G Clinical Analytical Biological Test Validation H IDE Needed? Confirmation of Candidate Validation T Omics-Based Test Using: Method Using No No Yes 1. An Independent Blinded Sample Set if Sample Set L FDA Approval/Clearance or LDT Process for Clinical Test Available (preferred); I OR N Additional High-Quality Evidence to Evaluate Clinical Utility of the Test 2. A Subset of the E Training Set NOT Defined, Validated, and Locked-Down Test Used During Training (Intended Use, Assay, Computational Practice Guidelines and Reimbursement (less preferred). Procedures, and Interpretation Criteria) Clinical Use FIGURE 4-1  The recommended framework for the evaluation of omics tests extends from discovery to clinical use. NOTE: FDA, U.S. Food and Drug Administration; IDE, investigational device exemption; IRB, institutional review board; LDT, laboratory-developed test. SOURCE: IOM, 2012b. Figure 4-1 Broadside

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 33 implement the test under the current clinical laboratory standards, and CLIA requirements should be met by each laboratory in which the test will be performed for the clinical trial. Pathways to Clinical Utility During the final stage—which is separated by a bright line from discov- ery and validation—the test is evaluated for clinical utility and use. Clinical utility is not assessed by the FDA or in the LDT process, so the committee recommended that the process of gathering evidence to support clinical use begin before the test is introduced into clinical practice. Three potential pathways are available for developing evidence of clinical utility, Leonard observed. Prospective or retrospective studies can be conducted using archived specimens from previously conducted clinical trials. Also, prospective clinical trials can be performed where either the test does or does not direct patient management. Whether test results direct patient management affects both the design of the prospective clinical trials and where the test is done. Regardless of the chosen method, the study or trial should receive approval from the institutional review board, and the FDA should be consulted. For investigators conducting a clinical trial in which the test will be used to manage patient care, the committee recom- mended that they communicate early with the FDA regarding the process and requirements of an investigational device exemption. Omics-based tests should not be changed during the clinical trial without a protocol amend- ment and discussion with the FDA. A substantive change to the test may require restarting the study, noted Leonard. If supportive evidence is generated using these pathways, FDA approval or clearance can be sought or the test can be further developed as an LDT, said Leonard. Evidence can continue to be generated during this period to facilitate coverage and reimbursement discussions with payers and adoption into clinical practice guidelines. Concluding Observations The test development pathway is segmented, and different parts of it are done by different groups, especially in the academic environment, Leonard noted after describing the committee’s recommendations. But oftentimes the groups do not fully comprehend the impact they have on each other. The IOM report, by describing the entire pathway of test development, defines best practices so that everyone can understand the interrelatedness of the test development process. Unfortunately, the report does not look in depth at the barriers to

OCR for page 29
34 GENOME-BASED DIAGNOSTICS the recommended pathway, Leonard noted. In addition to the cost of the clinical trial, test validation and development is expensive, and how this expense will be covered is not clear. Also, there is a lack of availability and access to annotated specimens and datasets. The National Institutes of Health (NIH) does not routinely fund the maintenance and biobanking of specimens along with associated clinical data. Finally, there is no process for establishing whether a test will be covered or the level of payment that will be received for a test. Leonard also suggested the idea of a clearinghouse that holds data gathered from various sources to determine whether a product has clini- cal utility. These data could be used both in guidelines development and in deciding whether to cover or not cover the clinical use of a molecular diagnostic. The Role of Comparative-Effectiveness Research One way to build the evidence base for decision making in cancer genomic medicine is through CER, said Andrew Freedman, chief of the Epidemiology and Genomics Research Program’s Clinical and Translational Epidemiology Branch at the NCI. CER is intended to create evidence for decision making by finding out “what works” in health care. According to the IOM (2009, p. 13), “CER is the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition or to improve the delivery of care. The purpose of CER is to assist consumers, clinicians, purchasers, and policymakers to make informed decisions that will improve health care at both the individual and population levels.” One problem with CER is that it typically focuses on average treat- ment effects. Some treatments may have a significant effect across a broad population but may in reality only benefit particular patients and not oth- ers. Alternatively, a treatment may not reach significance and be considered ineffective when looked at across a large population, but in fact work for a certain subpopulation. The failure to recognize the heterogeneity of treat- ment effects can undermine the interpretation of the clinical trial results and the generalizability of those findings to patient populations. In cancer genomics, the goal is to figure out, for patients with a similar diagnosis, what tumor markers or genomic markers predict who will respond to treat- ment, who will not respond to treatment, and who will have adverse effects. The methods used in CER are typically the same as in traditional genomic studies (see Table 4-1), but CER has a different orientation in that it encourages the stakeholders—including patients, clinicians, and payers— to prioritize research to help deliver the answers they need. The stakehold- ers identify the questions that would generate the needed evidence, and a

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 35 TABLE 4-1 Comparative-Effectiveness Research Versus Traditional Studies of Genomic Tests for Cancer Comparative-Effectiveness Feature of Research Research Traditional Studies Priority of study among Determined by multiple Opportunity as dictated by alternatives stakeholders, using criteria expert assessment of emerging such as disease burden or technology cost, lack of information, variability in care Study design Retrospective or Retrospective analysis of prospective analysis existing tumor specimens; occasional prospective analysis of observational data Comparisons Direct comparisons of new Direct comparisons of therapy with usual care competing therapies, often not considering usual care Topics Prevention, treatment, In most cases, prediction of monitoring, and other narrow effects such as serious broad topics drug interactions, response to treatment, tumor recurrence Perspectives Multiple, including Clinician and patient clinician, patient, purchaser, and policy maker Study populations and Representative of clinical Highly selected settings practice Data elements Patient characteristics, Patient characteristics, clinical quality of life, safety end points of treatment, resource use and costs, patients’ preferences Funding “Coverage with evidence Private investors, research development” programs, grants from federal sources public–private partnerships such as the National Institutes of Health SOURCE: Ramsey et al., 2011.

OCR for page 29
36 GENOME-BASED DIAGNOSTICS synthesis of the evidence then informs clinical practice. CER emphasizes new therapies in usual care rather than in highly selected clinical trials, because the studies need to be relevant to clinical practice and not look just at clinical end points. Important considerations include quality of life, resources used, costs, and patient preferences, among other factors. Types of Randomized Controlled Trials In a randomized controlled trial (RCT), patients are randomized into two groups on the basis of either the treatment or the genomic test or marker. Freedman focused particularly on several varieties of RCTs as sources of evidence for both traditional genomic studies and CER. He divided RCTs into explanatory RCTs,1 adaptive clinical trials, pragmatic clinical trials, and cluster randomized trials (Meyer, 2011). For example, the demonstration that HER2-positive patients benefited from treatment with Herceptin compared with others was the classic example in cancer genomics of an explanatory RCT (Smith, 2001). Adaptive clinical trials are “learn-as-you-go trials” where the bio- marker, the treatments, or both are changed as the trial progresses and more results become available, said Freedman. One or more decision points are built into the trial design for analysis of outcomes and associated patient or disease characteristics to identify subgroups that are responding favorably. Pragmatic clinical trials—also called practical clinical trials, effective- ness trials, or large simple trials—are designed to help decision makers choose between options for care in routine clinical practice. These trials include a broad range of health outcomes, including morbidity, quality of life, symptom severity, and costs and are similar in many ways to explana- tory trials. A purely pragmatic trial is not necessarily looking for regulatory approval of efficacy; rather, it is trying to figure out what works in the real world. It may not have an ideal experimental setting and is aimed more at normal practice. Generally, such trials have broader eligibility criteria that are not highly restrictive. An example is the ongoing 4,000-patient RxPONDER trial that is looking at the use of Oncotype DX to predict chemotherapeutic benefit for patients with estrogen receptor (ER)-positive/ HER2-negative breast cancer where the cancer has been detected in one to three lymph nodes (Wong et al., 2012). Patients are being randomized to receive either chemotherapy and endocrine therapy or just endocrine therapy in order to identify the best cutoff point for use of the Oncotype DX recurrence score. It has elements of not only a traditional explanatory RCT but also a pragmatic trial. It convened a stakeholder group to identify 1  Explanatory RCTs examine an intervention under ideal circumstances while pragmatic clinical trials assess interventions under real-world circumstances.

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 37 end points and the elements that would make the trial more relevant to cur- rent practice (Ramsey et al., 2013). The setting is reasonably representative of general clinical practice, as opposed to a more ideal clinical trial setting. A fairly simple intervention is being compared with usual care, and insur- ance firms are helping to pay for some of the tests. The outcomes are the cutoff point and disease-free survival, but quality of life, decision making, and cost-effectiveness are also being examined. Cluster randomized controlled trials are a type of pragmatic trial in which social units or clusters rather than individuals are randomly allocated to use of a treatment. For example, such a trial can be randomized on clinical practices, where some practices deliver the intervention and some do not. Relatively few of these have occurred in oncology, especially with biomarkers. Observational Trials Versus Randomized Controlled Trials Observational studies are a valuable and complementary approach to generating evidence, said Freedman. These studies are nonrandomized and include retrospective analysis of biospecimens from RCTs, retrospective and prospective cohort designs, studies based on registries, and studies with case-control designs. The strength of the evidence increases in the progression from observational studies to pragmatic RCTs to explanatory RCTs. External validity generally gets stronger in the opposite direction, however. Trade-offs, therefore, have to be made in the decisions of which type of study to use. Freedman proposed several sets of criteria that can be used to make these decisions. RCTs may be more suitable to determine comparative- effectiveness in genomic medicine in the following situations: • When decisions require the highest level of certainty. • When detecting small or modest differences in the results of treat- ment or testing. • When ensuring high levels of internal validity (by controlling for selection bias, patient compliance, and other confounding factors). • When accessible biospecimens are required for all participants. • When detailed information on outcomes is needed. • When genomic markers are incorporated in the design. • When examining complex testing of multitherapy treatments. Observational studies may be more suitable in the following situations: • When study populations are not represented in RCTs, as with patients distinguished by age, comorbidities, or medications.

OCR for page 29
38 GENOME-BASED DIAGNOSTICS • When larger studies and diverse populations are needed, especially when looking at rare outcomes or the analysis of subgroups. • When long-term follow-up is needed. • When an RCT is not ethical or feasible. • When testing or treatments are used off label. • When comparing outcomes from multiple treatment regimens. • When detecting larger differences in the results of treatment or testing. • When confirming results from RCTs. • When generating hypotheses to be tested in RCTs. • When study results need to be generalizable. • When study results are needed quickly. • When treatment adherence differs. An RCT is neither desirable nor feasible in every circumstance, said Freedman. High-quality observational study designs and evidence of under- lying biological mechanisms can contribute to the evidentiary framework. For example, large prospective cohort studies with very large effects and compelling data can make a convincing argument for clinical utility even if the evidence was only generated with observational studies. The major concerns about using observational studies to inform clinical utility are that they can be poorly designed, their findings may be difficult to replicate, reli- able outcome measures may be difficult to obtain, and they may be subject to bias and confounding through such factors as selection, response, adher- ence, attrition, or misclassification. These limitations have not been lost on epidemiologists, and several groups are now working on guidelines to ensure high-quality studies that can be replicated. With bias and confounding, for example, new techniques such as instrumental variables or propensity score matching can adjust for the nonrandomization of subjects studied. Also, sensitivity analysis can assess the variability in results. Considering the Context Health policy decisions have to take into consideration the clinical context, the type of genomic application, the quality and availability of evidence to assess a marker’s benefits and risks, and the risk to patients that a wrong decision could pose, Freedman said. Clinical utility is particularly difficult to determine because it encompasses the context in which the appli- cation is being used. Strong evidence is critical, but situations in which the benefits outweigh the risks often have to be weighed on a case-by-case basis. A comprehensive approach is needed to resolve questions about the

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 39 clinical utility of genomic applications, said Freedman. He offered the fol- lowing suggestions: • Future research should consider more outcomes measures and be conducted in settings that are relevant to more real-world clinical decisions. • A multitude of stakeholders having a role in evidence generation could lead to better studies. • New strategies involving transformation of the research infrastruc- ture to “learning systems” could allow continual additions to the knowledge base. • Any changes to the evidentiary framework should uphold rigorous best-research practice standards. • Collaborations among cancer centers are essential, particularly to investigate rare cancers. • Clear priorities for CER could ensure that limited resources are used to resolve the most compelling questions. • An evidentiary framework needs to articulate the minimal evidence necessary before clinical application is warranted. Designing Studies to Evaluate Biomarkers for Clinical Applications Lisa McShane, senior mathematical statistician in the Biometric Research Branch, Division of Cancer Treatment and Diagnosis, at the NCI, provided a statistician’s perspective on the issues associated with the development of genomic-based tests. There are “widespread problems in the literature,” she said, and everyone associated with the development of these tests needs to understand how to ensure the quality and interpret- ability of the studies evaluating them. Molecular diagnostics have a range of potential roles in medicine (see Figure 4-2). McShane focused on two of those roles: as a prognostic indi- cator and as a predictive indicator. Prognostic indicators are molecular signatures associated with clinical outcomes in the absence of therapy or with a standard therapy that all patients are likely to receive. In the latter case, an untreated population generally will not exist. Predictive indicators are molecular signatures associated with the ben- efit or lack of benefit—or potential harm—from a particular therapy rela- tive to other available therapies. In the simplest case, a patient group with a particular biomarker may benefit from a therapy while a group without that biomarker may not benefit. When is a prognostic test clinically useful? McShane asked. First, sta- tistically significant does not mean clinically useful. “The literature is abso-

OCR for page 29
40 GENOME-BASED DIAGNOSTICS • Confirmation Diagnosis • Staging • Subtyping Pre-Diagnosis Pre-Treatment Intra-Treatment Post-Treatment • Risk • Prognostic • Early • Early end point • Screening • Predictive response or • Recurrence or • Early detection futility progression • Toxicity monitoring monitoring FOCUS: Initial therapy selection FIGURE 4-2 Potential roles for molecular diagnostics in medicine extend from prediagnosis to posttreatment. SOURCE: Lisa McShane, workshop4-2 Figure presentation, May 24, 2012. lutely polluted with studies that show that a biomarker or a genomic test has some kind of prognostic information in it. But very, very few of those will ever make it into anything that is clinically useful,” she said. If a prog- nostic marker can split a patient population into two groups, one of which has an outcome so much better than the other that different treatment decisions are made, then that might reach the necessary level of utility. An example is the Oncotype DX test, which can identify patients who have such good outcomes after surgery that they do not need chemotherapy. A test may also split a patient population into two distinguishable groups that both have bad outcomes, however. Unless the test can guide treatment or monitoring decisions, that information may not be clinically useful. Basing clinical decisions on results from such a test may rest on very little evidence. Test developers also can confuse prognostic and predictive markers, said McShane. A marker may be prognostic of outcomes, but a new treat- ment may have the same effect in people with or without the marker. This dichotomy emphasizes the importance, said McShane, of appropriate con- trol groups to determine whether a marker is predictive and distinguishes a group that benefits from a treatment. To determine when a predictive test is clinically useful, a treatment- by-marker interaction can be assessed, but care needs to be taken in inter-

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 41 preting the results. One should not simply rely on statistically significant p-values, McShane said. For example, a new treatment may help only people with a particular biomarker, or it may help people regardless of biomarker status but to different extents. In the latter case, it might be beneficial to give everyone the new treatment, depending on such consider- ations as toxicity, cost, and patient preferences. Prospective Versus Retrospective Studies Prospective studies to establish the clinical utility of molecular tests are much easier to design than retrospective studies, McShane observed. They can use unbiased patient cohorts and adjust for standard variables, and several design options are available to answer specific questions (Freidlin et al., 2010; IOM, 2012b). The challenges with these studies tend to be their feasibility and cost. Retrospective studies can provide a high level of evidence if they are performed properly. For example, a prospective-retrospective study can produce high levels of evidence for the value of a test if specimens were collected carefully in a clinical trial, if a sufficient number of representa- tive specimens were collected, if the assay was analytically validated, if an analysis plan was prespecified, and if results were validated in one or more similar but separate studies (Simon et al., 2009). Unfortunately, McShane said, many retrospective studies are poorly conducted. There may not be sufficient specimens available from completed trials or there might not be any suitable trials to address the question of interest. The retrospective study design may be flawed. Patient characteris- tics may be heterogeneous, and treatments may be unknown, nonrandom- ized, and not standardized. Specimens may be poorly characterized and accompanied by data of uncertain quality. And the studies may be subject to misinterpretation or deficient reporting. As an example of the problems that can arise, McShane quoted from the conclusion of an American Society of Clinical Oncology update of the recommendations for the use of tumor markers in breast cancer (Harris et al., 2007, pp. 5287–5289): “The pri- mary literature is characterized by studies that included small patient num- bers, that are retrospective, and that commonly perform multiple analyses until one reveals a statistically significant result. Furthermore, many tumor marker studies fail to include descriptions of how patients were treated or analyses of the marker in different treatment subgroups.” A major problem is that, as McShane said, “if you torture the data long enough, they will confess to anything.” For example, multiple test- ing of many markers, many end points, many subgroups, and so on can produce false positives. Such testing is particularly problematic when there is no prespecified analysis plan and findings are selectively reported on the

OCR for page 29
42 GENOME-BASED DIAGNOSTICS basis of statistical significance. “People will do a zillion analyses, and what they will put into the paper are the ones that came up significant,” she said. Model overfitting is also a major problem, especially with high- dimensional marker data such as those generated by omics technologies. With overly complex models that have too many parameters or predictor variables, the model will describe random error or noise instead of the true underlying relationship. Similarly, if there are many more variables than independent subjects or if the data are sparse in a high-dimensional biomarker space, a model will generally have poor predictive performance on an independent dataset. Model validation is essential, said McShane, yet many researchers make errors when attempting validations. In particular, as discussed by Leonard, they often use the same dataset to validate the model that they used to train it, even though such resubstitution estimates of model performance are useless (Subramanian and Simon, 2010). There are ways to do internal validation of models using data from the same sample set, but they require careful planning (e.g., Molinaro et al., 2005). McShane also described several other common problems with evalu- ations of predictive tests. Randomized clinical trials adequately powered to detect treatment effects are often not sufficiently powered to establish predictive marker effects. For example, the nonsignificance of a treatment effect in a “marker negative” subgroup is often misinterpreted as no treat- ment effect, even though the test may not be adequately powered to exclude a treatment benefit. Also, sufficient information is sometimes not reported in studies to know whether an effect is meaningful. The p-value does not have much meaning without looking at an estimated effect size along with a measure of its uncertainty (e.g., a confidence interval). Improving Assessments McShane had several ideas about how to improve assessments of pre- dictive tests. One is to place earlier and more intense focus on clinical utility, with more education about the proper interpretation. The test-development process and study design need to be rigorous, using meaningful and well- designed studies, proper statistical analysis, independent external valida- tion, and interdisciplinary expertise. A biomarker study registry, as suggested by Andre et al. (2011), could aid in identifying relevant biomarker studies for overviews and meta- analyses. It also could make study protocols available, including prespec- ified analysis plans, and help reduce nonpublication bias and selective reporting. Several sets of guidelines exist that could encourage more com- plete and transparent reporting (Altman et al., 2008, 2012; McShane et al., 2005; Moore et al., 2011b).

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 43 The Worldwide Innovative Networking consortium in personal- ized cancer medicine has provided seed funding to establish such a bio- marker study registry. The aim is to create an entity similar to the website clinicaltrials.gov. Just as journals require that a clinical trial therapy proto- col be registered with clinicaltrials.gov from the start, that would happen with biomarker studies. This would provide a placeholder for studies that do not result in publication. It also would allow the distribution of prespeci- fied analysis plans, thus addressing not only nonpublication bias but also selective reporting in published papers. Finally, expanded access to useful specimens, including alternative sources of specimens because trial specimens are optimal but limited, would be especially useful. Specimens should be well annotated with clinicopatho- logic data, treatments, and clinical outcomes. Health maintenance organi- zations and other large health care entities could be important partners in such efforts because people move around, which makes it difficult to piece together a patient’s data for retrospective analyses. McShane cautioned that we do not know the optimal way to collect a specimen and preserve it for every technology that might appear in the next several decades. Also, she has been involved with large biobanks of specimens that do not get used for a variety of reasons. But representative specimen collections from populations for which the tests might be used, along with data on treatment, outcome, the handling of specimens, and so on, are an important place to start. Adequate funding for collection and storage of annotated specimens from clinical trials and other carefully fol- lowed cohorts would also be very helpful, she added. Assessing the value of Oncology-Based Molecular Diagnostics In theory, molecular diagnostics should save money or at least provide better care at lower cost. But some observers have called personalized medicine either a myth or hype, and others have declared it unaffordable, said Kathryn Phillips, professor of health economics and health services research at the University of California, San Francisco. Others have called more testing the medical equivalent of Moore’s law in computing, claim- ing that testing causes more visits to the doctor, which results in exponen- tially more visits to the doctor. “There’s a lot of debate regarding whether molecular diagnostics are really ever going to provide cost-effective care,” she concluded. Phillips conducted an informal study using data from the Tufts reg- istry of cost-utility analyses to learn what has already been done on the

OCR for page 29
44 GENOME-BASED DIAGNOSTICS economics of molecular diagnostics for cancer.2 More cost-utility analyses are available now than in the past, with about 14 percent focusing on cancer. The cost-effectiveness of diagnostics used to treat cancer is similar to that of other conditions. About half have a reasonable incremental cost- effectiveness ratio, but only about 10 percent of the analyses demonstrate that these interventions save money, Phillips said. Another 10 percent of the studies concluded that the interventions cost more and provide less health benefits than the standard of care, and these studies generally were done before the recent increase in high-cost diagnostics and cancer drugs. “In general, new health-care interventions do not save money,” said Phillips. “They provide better health [benefits] at a reasonable cost.” Furthermore, the cost-utility analyses of 64 molecular diagnostics for cancer were even less encouraging, with 20 percent costing more and resulting in less health benefits. Challenges to Cost-Effectiveness Analyses Phillips listed several well-known challenges to the use of cost- effectiveness analyses: • The lack of data on effectiveness and costs. • The need to consider the effect of the diagnostic on downstream decisions and outcomes. • No or limited use of cost-effectiveness analyses by stakeholders. She noted that diagnostics are complicated to analyze because of the evolving nature of the field, the complexity of the tests, the uncertainties surrounding them, and the nature of the disease. Also, because cancer can be an inherited condition, diagnostic tests may require consideration of family members. For example, a cost-effectiveness analysis of Lynch syn- drome screening found that the screening is only cost-effective if family members change their behavior (Ladabaum et al., 2011). “And that’s a big if,” said Phillips. “If you are just looking at what happens to the proband, then you shouldn’t be doing Lynch syndrome screening.” Whole-genome sequencing will be the next big dilemma, Phillips said. The complexity is far worse because of the huge amount of data available. Information may range from clinically actionable to not directly clinically actionable to unknown or no clinical significance, with various levels of risk and possible outcomes. 2  The registry is available at https://research.tufts-nemc.org/cear4/Default.aspx (accessed August 10, 2012).

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 45 A Focus on Value Phillips recommended that future analyses focus not only on cost- effectiveness but also on value. Cost-effectiveness analysis is “a hard sell,” she said. Methodological concerns are common, such as defining and mea- suring quality-adjusted life years. Furthermore, in the United States, there is a lack of support for explicit consideration of cost. Different frameworks for assessing value exist, said Phillips. Cost- effectiveness analysis can compare one alternative to another, but without the context. Alternatively, just the benefits and risks can be compared. Still, neither of these takes into account the larger impact of the technol- ogy on the health care system. For example, a technology may have a good cost-benefit ratio but have little impact because only a few individuals are being treated. A variety of methods can be used to capture the magnitude and scope of the technology being examined, said Phillips. Also, budget impact analysis can be used to assess whether a technology is affordable, for example, within a particular health plan. As an example of a process for assessing value, Phillips laid out poten- tial steps for conducting a multicriteria analysis: 1. Establish a decision context. What are the aims? Who are the deci- sion makers? 2. Identify options. 3. Identify objectives and criteria that reflect the value associated with the consequences of each option. 4. Describe the expected performance of each option against the cri- teria and score options. 5. Assign weights for each criterion to reflect relative importance. 6. Combine weights and scores for each of the options to derive over- all value. 7. Examine the results. 8. Conduct sensitivity analyses of the results. She noted that this type of framework could be used to make systematic decisions about new health care interventions in a way that still captures costs and benefits but does not solely focus on these variables. Phillips pointed out that cost-effectiveness analyses based on the ideal world may not adequately reflect actual implementation. Cost-effectiveness analyses are often normative—demonstrating what should be better—when they need to be descriptive, that is, taking into account the full context of care. For example, Elkin et al. (2011) demonstrated that few cost- effectiveness analyses of breast cancer diagnostics explicitly evaluated the relationships among the methods of targeting, the accuracy of the test, and

OCR for page 29
46 GENOME-BASED DIAGNOSTICS the outcomes of the intervention. The analyses tended to assume that the tests were perfect and did not consider the impact of test thresholds. As another example, Phillips (2008) found many real-world impacts on the cost-effectiveness of testing strategies. Data were missing on groups, espe- cially the uninsured, Medicaid recipients, and minorities. Test results could be inaccurate, and some of the treatment courses did not correspond with what would be recommended by the test results. In addition, the claims and medical records for testing did not match 25 percent of the time. Cost-effectiveness analyses are being applied to molecular diagnostics for cancer, Phillips concluded, but they raise methodological and political challenges. Focusing more on value than on cost-effectiveness will allow for headway to be made for molecular diagnostics but will also require changes in methods and public discourse. But shifting the focus to real-world analy- ses of value will allow decisions to be considered in the full context of care. Advancing the Utility of Oncology Diagnostics As Robert Bast did in the previous session, Noel Doheny, chief execu- tive officer of Epigenomics, used a specific disease to make several general points about the tools available to assess the clinical utility of molecular diagnostics in oncology. Colorectal cancer is the second largest cancer killer in the United States, causing 50,000 deaths and 140,000 new cases each year, or 15 to 20 deaths per 100,000 inhabitants (ACS, 2012). It is a disease of the developed world, with the highest prevalence in North America and Europe. Colorectal can- cer is curable, though, if it is detected early enough. The 5-year survival rate for diagnosed and treated stage I or II colorectal cancer is 90 percent (ACS, 2011). Colorectal cancer costs the U.S. health care system $17 billion per year—$7 billion in the initial year, $5 billion in the continuing-care years, and $5 billion in the last year of life (CDC, 2011). The key challenge to improving health and controlling costs, said Doheny, is changing current noncompliance with colonoscopy or stool-based screening. “If you [detect colorectal cancer] early, you never get to that last year where the costs go through the roof,” he said. In the United States today, 100 million people are eligible for colorectal screening. Half of them get a colonoscopy; another 12 percent or so are tested by fecal tests; and the rest do not undergo any form of screening. Major impediments to screening are unpleasantness associated with stool tests; time constraints, risks, and fears associated with the colonoscopy preparation and anesthesia; and unreimbursed costs. Noncompliance is also greater among people who lack health insurance, have no other source of health care, and have not visited a doctor within the preceding year. In

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 47 addition, patients are often lulled into a false sense of security after one colonoscopy and fail to undergo subsequent testing. Colonoscopy and fecal tests are not perfect. Colonoscopists differ in their ability to detect an adenoma, a polyp, or colorectal cancer in patients, and there are discrepancies in screening between the left and right side of the colon. Nevertheless, if everyone older than age 50 were screened regularly, as many as 60 percent of deaths from colorectal cancer could be prevented, suggested Doheny. A Blood Test for Colorectal Cancer The availability of a blood test could promote higher rates of screen- ing by providing the ability to evaluate patients who would not otherwise have been screened, said Doheny. In a survey of more than 1,300 adults, 75 percent said they were more likely to get screened more frequently if a blood test were available.3 Epigenomics has been developing a blood test for colorectal screening based on a circulating marker called methylated Septin9. The test, which was going through the FDA’s premarket approval process at the time of the workshop (Vogel et al., 2006), uses real-time polymerase chain reaction to detect free-circulating tumor DNA in blood. If one of three triplicate polymerase chain reaction tests is positive, the assay is considered to be positive, and patients are sent for a colonoscopy. The intended use of the test as filed in the premarket approval application is as a qualitative assay to aid in screening for patients with average risk. During the test’s technology development phase, Epigenomics con- ducted case-control studies using patients’ samples that were both positive and negative for colorectal cancer to optimize the research protocol and increase the sensitivity and specificity of the test. It then gauged the perfor- mance of the test in two prospective studies. In one of the trials, of about 8,000 people, about two-thirds had no evidence of disease, and the test found 51 cancers among this group. A “very positive dialogue” with the FDA subsequently changed the number of patients considered to have no evidence of disease, which “clarified what we needed to do in this trial,” said Doheny. At the time of the workshop, the company was getting ready to submit its clinical data later in 2012. Epigenomics has been developing the test as an LDT, but on an interim basis. It has given licenses for the test to two laboratories, licenses that 3  Results of the survey are available at http://www.prnewswire.com/news-releases/nearly- one-in-three-men-and-women-age-50-and-over-have-not-been-screened-for-colon-cancer-one- in-four-say-their-healthcare-provider-didnt-recommend-screening-124002559.html (accessed August 9, 2012).

OCR for page 29
48 GENOME-BASED DIAGNOSTICS will automatically end when the test receives regulatory approval and clearance. This procedure ensures that the laboratories switch to using the regulated product. About 1,000 tests per week are being done in the United States currently, which means that about five patients each week are being detected by a blood test as having colorectal cancer that would not have been detected if that test had not been done. “That, in our eyes, is a very positive position to be in,” said Doheny. Remaining Challenges Doheny discussed some of the issues that “keep him awake at night.” Imperfect standards are firmly entrenched, he said. Colonoscopies are not perfect, but they have become the gold standard. Also, incentives are mis- aligned for providers and payers, with current rewards disproportionately skewed toward chemotherapy. Innovation is being driven by small companies in a difficult capital environment. Large companies tend to buy de-risked assets, and many small companies cannot afford the up-front costs of diagnostics development. Said Doheny, “It’s very difficult to build an accurate, meaningful, and valid business model.” He also described several opportunities for improvement. Research partnerships are needed between pharmaceutical companies, diagnostic companies, and government. Although the incentives for these groups are different, they could be aligned if ways were found to combine complemen- tary strengths, Doheny said. Too few care approaches allow full cost clarity from first patient encounters to interventions and follow-up, Doheny observed. But several experiments, such as those being conducted by the Kaiser and Veterans Administration systems, could show a path forward. He also suggested that the payment process could be linked through a visible mechanism to the regulatory process. In Japan, for example, when a company receives clearance on a project, a level of payment is established for each time the test is performed, which is an effective “closed loop mechanism,” according to Doheny. Similarly, rewards could be differenti- ated on the basis of the regulatory credential of the offering. A registry of patients with apparent false positives is needed to demon- strate clinical utility, he said. Also, a “platinum” standard to compensate for colonoscopy variability should be identified. The collection of blood and tissue from patients with cancer, includ- ing those who die, could greatly advance research, Doheny said. He also recommended creating an accelerated review and publication format spe-

OCR for page 29
TOOLS FOR GENERATING AND SYNTHESIZING EVIDENCE 49 cifically for personalized medicine assays to overcome the extended and biased review cycles in traditional publications. Finally, real-world LDT performance should be linked to FDA filings, he said. Doheny concluded by saying that “[waiting for] perfection takes too long. Why don’t we just move ahead on some of these as appropriate?”

OCR for page 29