Click for next page ( 56


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 55
3 Assessment: Problems and Proposer1 Solutions The purpose of this chapter is to describe some of He problems one encounters in evaluating diagnostic technology and to propose an ap- proach that avoids many of them. Our underlying premise is Cat the public is it/-served by current approaches to assessment of diagnostic technology. Medical tests are more difficult to evaluate than medical treatments. A treatment is typically evaluated Pugh a clinical trial in which patients are randomly assigned to the treatment group or to a control group, which may receive a placebo or conventional therapy. The endpoint of Be trial may be physiologic (for example, blood pressure) or functional (such as the ability to waltz without developing chest pain), but most often the endpoint is the development of a disease (for example, acute myocardial infarction) or death In the best trials, therapy subsequent to randomiza- tion is controlled, and the only variable that differentiates the intervention and control groups is Be intervention. Under these circumstances, the investigators are often able to attribute differences in outcome to the intervention. By contrast, evaluation of a diagnostic test can occur at several levels, as outlined by Fineberg (Fineberg et al. 1977~. 'Parts of this chapter are adapted from a technical report (Sox 1987~. 55

OCR for page 55
56 MEASURES OF CLINICAL EFFICACY Technical Capability ASSESSMENT OF DIAGNOSTIC TECHNOLOGY This measure answers He question, "Does Be test do what the manu- facturer says it does?" For example, an MR] scanner meets this cnter~on if it produces a crisp image of Me brain, regardless of whether mat image fairy reflects the true state of the brain. The Food and Drug Admit stration currency requires this level of assessment for diagnostic tech- nologies before it win issue premarket~ng approval. Sensitivity and Specificity These two measures of test performance are the most widely used indi- cators of efficacy. They may help one to decide which of several diagnos- tic tests is superior, but the verdict is sometimes a split decision: one tech- nology has a lower false-negative rate while the other has a lower false- positive rate. Furthermore, these measures are not sufficient to indicate whether the test should be done. In many cases, the test is so inaccurate and the treatment so safe and effective that the patient should be Heated without testing in order to avoid the possibility of being misled by a false- negative result. Diagnostic Impact Do the test results alter the pattern of diagnostic testing? Does the test replace other tests, including some Hat are more hazardous or costly? This outcome is relatively easy to measure, and, because it occurs in the near teen, one can often attribute an effect on patterns of testing to a new technology. Non~nvasive methods of em aging internal organs have had a major impact on medical care because the information they provide has reduced the number of invasive diagnostic studies perfonned. CITE scan- ning of the head has reduced the number of craniotomies for head trauma (Ambrose et al. 1976~. This measure of efficacy, however, is not suffi- cient to answer the important question, "Should I do this test on this patient?" Resolution of diagnostic uncertainty is one measure of diagnostic impact. There is ample evidence that patients seek relief from uncertainty and that diagnostic tests play a role in satisfying them (Sox et al. 197S, Marton et al. 19803. The physician must use reassurance when it is

OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS 57 indicated, as when a test result reduces the probability of disease to me point where no furler intervention is needed. Reassurance following a negative result on a test that has a high false-negative rate may not be appropriate in some cases, particularly if the physician strongly suspected that disease was present before doing me test. Therapeutic Impact If a test alters the choice of the treatment for the patient, it meets this criterion for efficacy. The Reshoot mode} is built around the assumption that an effect on therapy is the sine qua non for doing a test. But, as indicated In Chapter 2, a test may alter therapy in one patient but not in another, depending on the pretest probability and the treatment Reshoot. Impact on Clinical Outcomes The ultimate measure of a test is its ability to alter the patient's outlook by leading to changes in management that reduce symptoms or prolong life. The determinants of long-te~m outcome are many. The accuracy, cost, and morbidity of a test may be much less important than when it is done in the natural history of the illness. The most important determinant of clinical outcome is therapy rather than diagnosis (Abrams and McNeil 19781. Improved imaging of metastases to the liver from a colon cancer does not improve the patient's outcome because there are no highly effective treatments for metastatic colon cancer. Imaging of metastases may, however, spare a patient from abdominal exploratory surgery Hat cannot alter the long-term prognosis. An improved short-tenn outcome does not necessarily imply an improved long-term outcome. This summary of the measures of clinical efficacy indicates the Futility of basing a decision about a technology on a single dimension. The way through this dilemma is to focus on the patient's needs. The right question about a technology is 'twin this maximize this patient's chances for the best achievable outcome?" In some cases, the answer to this question is the same for a large class of patients, and one can fonnulate a general recommendation. In others, the answer depends on the value Rat the individual patient places on the outcomes that the illness and its treatment may entail. In this case, a general recommendation may not be possible. In this chapter, we show how a technology assessment can provide the data that allow a physician to identify which management altemative win maximize the patient's chances for a good outcome.

OCR for page 55
58 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY THE SYSTEM OF TECHNOLOGY ASSESSMENT IN THE UNITED STATES The Pyramid The system of technology assessment in the United States is a pyramid with several layers. The broad base of the system consists of clinical studies in which physicians subject patients to a technology and observe Rem for its effects. These studies include a research question, which usually cans for the comparison of several technologies; they also include a rigorous study design and meticulous implementation of the study protocol. In studies of diagnostic technology, the index test, one or more competing tests, and a gold-standard test are performed in a series of patients. The published reports of these studies seldom put their findings into a form that helps the clinician to decide which patients win benefit most from the test or procedure. A second layer is made up of individuals who review the literature and try to distiB the evidence into recommendations that are true to the facts. These individuals are typically clinicians who have trading in the disci- plines of meta-analysis, clinical epidemiology, decision analysis, and cost-effectiveness analysis. This method has been used to identify out- moded, overused, and ineffective technologies. More recently, it has been used to make recommendations for using a diagnostic test or choos- ing among tests. On Be next layer are organizations Cat do technology assessment. They differ in their approach, but the starting point is frequently a techni- cal background paper written by an individual who reads the literature and proposes guidelines for using the technology. The conclusions of this paper are reviewed by others, and clinical policy recommendations are forged by some consensus process. The American College of Physicians' Clinical Efficacy Assessment Program (CEAP) is a prototype of this approach. Policymakers sit atop the pyramid and are the ultimate consumers of technology assessment. What they consume is the product of analysis and consensus, and it generally takes the form of recommendations about the usefulness of a technology. The individual physician, who bases deci- sions about using technology on published reports of assessments, is a policymaker. Other policymakers work for third-party payers, who exert control over medical practice by their coverage policy. This description of the technology assessment system in the United States shows that many individuals and organizations depend on good

OCR for page 55
ASSESSMENT: PROBl~MS AND SOLUTIONS 59 studies of technology. We use He term "primary technology assessment" to denote studies in which clinical data are obtained systemadcaBy on patients who have been subjected to a health ~ntervendon, such as a diagnostic test or treatment. In the next section, we discuss some of the methodological problems that are encounters in doing primary assess- ments of diagnostic tests. PROBLEMS WITH THE CURRENT SYSTEM Standards of evidence are incomplete. The standard of evidence for He efficacy of therapeutic technologies, such as surgical operations or drugs, has become He randomized clinical trial. This standard may be in- sufficient for clinical decis~onmaking. One drug is considered superior to another if there is a statistically significant difference in a measure of outcome, such as survival. Achieving this criterion does not mean that me drug should be used in all patients. This decision may depend on the characteristics of the individual patient, including the value he or she places on the benefits, adverse effects, and costs of the drug. One can use expected-value decisionmaking to identify the best alternative for an individual patient. According to the decision mode} described In Chapter 2, the usefulness of a test depends on the clinical circumstances. Among these is the pretest probability of disease. One of a pair of competing tests may be preferred in patients with a low pretest probability of disease, while the other test should be preferred in patients with a high pretest probability. In sum- ma~, we suggest that He efficacy of a test is context-dependent. Studies dFo not gather the data needed for decisions in individual patients. L`arge-scale, randomized trials sometimes lead to the conclusion that a given therapy is preferred only in a subgroup of patients. They do so by gathering the clinical data necessary to subclassify patients. Studies of diagnostic tests could be carried out in He same way, but Hey seldom are. For example, published studies of diagnostic tests infrequency report clinical prediction rules for estimating pretest probability. Studies of technology often apply only to a narrow spectrum of pa- tients. As discussed in Chapter 2, the patients who are enrobed in a study of a diagnostic test are often a small minority of those who actuaBy receive it (Philbr~ck et al. 1980~. Similarly, randomized clinical trials may exclude many patients, such as those with more than one disorder, in order to maximize the chances of obtaining an unequivocal answer. The results of these studies may not apply to many patients of concern to c. .lmclans.

OCR for page 55
60 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY Studies of diagnostic tests often do not compare a new test with an established test. Randomized clinical trials of treatments usually com- pare a new therapy to an established therapy or a placebo. Studies of diagnostic tests often do not compare one test with a competing test. When competing tests are compared, the design of the study usually precludes a complete answer to such questions as, "Should ~ do Test A but not Test B? Both Test A and Test B? Test A followed by Test B only if Test A is negative?" Souses are seldom timely. The earliest studies of a new technology tend to be misleadingly optimistic about its performance, often because the study populations are not clinically relevant (Ransohoff and Weinstein 1978~. Practice patterns are often established on the basis of early studies. Similarly, when hospital managers decide to invest in a new technology, they must often base Heir decision on early studies. Therefore, me quality of early studies must be improved. Technology is constantly changing. By the time a study is completed, the test or imaging device has changed, and no one believes that the results apply to Me new, improved technology. Technical changes may improve the image provided by a scanner, but they do not necessanly lead to a lower false-negative or false-posi~ve rate, nor do they guarantee improvement in clinical outcomes. Technology assessment should be done quickly. For example, a multi-~nstitutional study could take but a few months. Also, there should be a system for monitoring, and perhaps reevaluating, the technology as it mamres. The results of a study may apply to a narrow spectrum of the users of the technology. Published assessments of a diagnostic technology are usually done in academic medical centers. The use of the technology in such centers may differ gready from its use in a community hospital. The indications for using the test, the specimen of patients, the technique for using the equipment' and the skill of the clinician who interprets me results are but a few of the areas in which an academic medical center may differ from a community hospital. Two recent case reports illustrate some of the difficulties Cat are caused by inadequate primary technology assessment. - Case Report ]. Premature obsolescence: standard chest X-ray to- mography. Computed tomography (CT) was widely adopted before it had been compared with what was then He standard method for imaging the chest, standard X-ray tomography. Relatively few studies had compared the tests in the same patients. A review Cat compared their accuracy brought out some unexpected findings

OCR for page 55
ASSESSMENT: PROBl~MS AND SOLlJTIONS 61 (Inouye and Sox 1986). CI was superior to standard tomography for some indications. When 16 studies of chest tomography for mediast~nal metastases were reviewed, however, the frequency of false-negative results was lower for cr. but the frequency of false- positive results was lower for standard tomography. Furthennore, the differences in accuracy were too small to be Important for decisionmaking. By now, however, most radiologists consider stan- dard tomography to be obsolete in the study of most intrathoracic disorders. Comment: Large-scale, multi-institutional, prospective studies com- paIing CI and standard tomography should have been done very early In the history of the new technology. These might have shown that the two procedures were equivalent In most patients and might have defined patient subgroups in which one test was clearly supe- r~or. ated. Case Report 2. Premature adoption of a new technology: magnetic resonance imaging. Magnetic resonance imaging (MRI) is being adopted by hospitals throughout the United States and may eventu- ally replace computed tomography (car) in studies of the central nervous system (Steinberg et al. 1985~. MRI provides a remarkable definition of central nervous system structures. 'The images are striking in their detail, but those who purchase ~! scanners or use them should ask several pertinent questions: Does the improved image lead to lower false-negative rates without increasing false- positive rates? Does MRI lead to useful changes in diagnostic certainty, choice of therapy, or even clinical outcome? The answers to these questions were not available when many MRI scanners were purchased, because most early studies of MRI were relatively unsatisfactory (Kent et al. 198S, NTH Consensus Conference 19881. We now turn to a discussion of how diagnostic tests should be evalu- RANDOMIZED TRIALS OF DIAGNOSTIC TESTS A well-designed and well-executed randomized clinical trial is widely regarded as the most powerful method for comparing technologies. Sources of ambiguity in data interpretation are, in principle, removed by randomi- zation, because this process assures that all potentially influential vari-

OCR for page 55
62 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY ables, known and unknown, are distributed equitably among the study groups. Blinding of the investigator and the patient to the assigned intervention reduces bias in obtaining data from patients. A weB-con- ducted trial has internal safeguards to assure strict adherence to the study protocol. Limitations of Randomized Trials The cost may be high. Randomized trials can be very costly if stan- dardization of the intervention requires special care for patients. By focusing on effectiveness Measuring effects under usual patient care conditions) rather Han efficacy Measuring effects under ideal circum- stances), the costs of a randomized trial can be kept to a minimum. The study population may be too small. Many chronic diseases prog- ress slowly, and outcome events accumulate slowly unless the study population is vely large. Evaluating an intervention in subgroups of patients may require an unrealistically large number of patients. A large study population is also required if the intervention is expected to have a small effect. These problems can often be avoided by self-discipline when formulating Me study hypotheses. Sometimes the requirement for a large sample size is unavoidable, and many medical centers may be required to assemble a sufficient sample of patients. The technology may become obsolete-before the snaky is complete. Studies Mat continue for many years run the risk that the results win be irrelevant because of technological advances that have occurred during the years of the study. i The results may apply to a narrow spectrum of patients. Most random- zed trials exclude many patients. For example, only 12.7 percent of the patients in the Coronary Artery SurgeIy Study were randomized to re- ceive surgery or medical therapy (CASS Principal Investigators, 1983~. The remainder were not enrobed because Hey met one of many exclusion cr~tena. A study performed in a single institution may have a limited spectrum of study patients. Because of these problems, the results of a study may not necessarily apply to patients who are important both to clinicians and to policymakers. The exclusion of patients older than age 65 from the CoronaIy Artery Surgery Study is an example (CASS Princi- pa] Investigators 19831. Ideady, a randomized trial should include a wide spectrum of care facilities and should enroll patients who might be ex- cluded from other studies. The trial may not measure outcomes of clinical interest. By focusing on the principal clinical hypothesis, past randomized control trials have

OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS 63 often failed to study other measures of the effect of the intervention. Retum to work, psychological status, and social function aU measure the impact of successful treannent. Many observers feel that these "secon- dary endpoints" are as important as Me primary endpoint of the study, which is usually mortality from the disease. For example, cost-effective- ness is becoming a study endpoint in many teals. Trial Design A randomized trial of a diagnostic test is a powerful method for evaluating its effects on patient care. The test can be Compaq against another test or against no test. In general, a comparison with no test is not ethically sound. Most would agree that a patient could be randomized to no test only if me clinical history could be relied upon to be sure that the patient does not have Me disease in question. Under these circumstances, little can be learned about the effect of He test, other than its psychologi- caDy mediated effects (Sox et al. 1981~. In general, a randomized trial win compare two putatively similar diagnostic tests, such as MRI and CI. There have been few such studies, and this approach deserves greater use. One of Me advantages of a randomized trial is that the principal study endpoint is a clinical outcome (for example, length of hospitalizations, use of other tests, morbidity, or mortality). In contrast to studies that measure test accuracy, there is no need to perform a potentially dangerous gold-standard test on all patients. This advantage suggests two types of randomized trials of diagnostic tests. "Off the Gold Standard" If the goal of the study is to measure clinical outcomes rather than test accuracy, one can ethically enroll anyone who needs the index test. Study patients are randomly assigned to have the index test or the altemative test and are then monitored for the occurrence of the endpoint of Be study, which could be length of hospitalization, total cost of care, or functional status one month after enrolling. Being able to enroll all patients means that there is little problem with bias in selecting patients, and the findings will apply to primary care populations. A randomized study can show which diagnostic test is superior, and subgroup analysis can identify patients who benefit particularly from a given test. Nevertheless, this type of randomized trial cannot measure the true-positive and false-posi~ve rates of Be index test, because Be gold-standard test will be performed irregularly, or perhaps not at all. Therefore, this study does not provide all

OCR for page 55
64 ASSESSMENT OF DIAGNOSTIC ~CHNOLOGY the information that is required to interpret a test or to decide if it is necessary to perform the test. "On the God Standard" Studies that directly compare two tests are important. One way to do these studies is to perform both tests on each of a series of patients. This approach is costly, time-consuming, and potentially risky for the patient, and many patients may refuse to enroB. The alternative is to allocate patients at random to one of two putatively equivalent tests, perform the gold-standard test on aU patients, and then measure clinical outcomes. This approach allows one to compare accuracy and effect on shot-tern clinical outcomes, such as short-term morbidity and moronity, reduction in diagnostic uncertainty, and altered choice of therapy and other tech- nologies. The shortcoming of this approach is that many patients In the clin~caBy relevant population win not be enrobed because their physi- cians do not refer them for the gold-standard test. The result win be biased measures of test perfonnance and a relatively select study population, which compromises He generalizability of the outcome studies. A randomized trial that compares the effect of two diagnostic tests on clinical outcomes poses another potential problem. In a trial, the test is done on aU patients who are assigned to have it, rawer than on those selected because the test was indicated. If there is a narrow range of pretest probabilities for which a test is likely to be useful, few patients who are randomly assigned to get the test win benefit from it. As a result, the number of patients needed to detect a clinicaUy significant effect on outcomes may be very large, and there is a particularly high probability that a negative result will fad] to detect a clinically significant true differ- ence. The randomized trial of the effect of a test on clinical outcomes has been underutilized and deserves greater attention from investigators. Much of this attention should be directed at He potential problems of study design and interpretation. A PROPOSAL FOR MODEL-DRIVEN TECHNOLOGY ASSESSMENT Most studies of diagnostic tests have measured little more than the false-negative rate and false-positive rate of a given test. This section descnbes an approach that- we cad "model-dnven." In model-dr~ven technology assessment, the data to be obtained are specified by a method

OCR for page 55
ASSESSMENT: PROBLEMS AND SOLlJTIONS 65 for making decisions (Sox 1987, Phelps and Mushlin 1988~. We have used Me threshold model for test-ueannent selection to illustrate this discussion, but the particulars of the model are less important- Man Me pnnciple that is, that one should obtain the data that win enable the clinician to identify the decision altemative that win be most useful to the patient. A Technology Should Be Compared with a Competing Technology The decision to adopt a new technology often means abandoning an older technology. In evaluating a technology of any kind, one should ask in what ways it is better than another (its marginal effectiveness). Many studies of diagnostic technology have not been comparative. There have been very few randomized teals comparing Me effects of tests on out- comes. Too few studies have compared Me accuracy of two tests by doing both of them on a series of patients. The ideal study. The marginal effectiveness of a technology may be measured and its true value discovered only by comparison with another clinical method. A new technology may be compared with an old one, or two established technologies may be compared. There are two types of studies of diagnostic tests. Ideally, a diagnostic test win always be compared to some other memos for obtaining infor- mation, such as the patient's history and physical examination or another diagnostic test. First the effects of two or more tests on clinical outcomes can be compared. The marginal effect of a new test may be discerned by a randomized trial in which me effect of the test on patient care outcomes is measured directly, rather than inferred from probabilistic and decision- analytic models. The potential limitations of this approach are discussed in the preceding section. The performance characteristics of Be tests can be compa - . Compar- ing the frequency of false-negative and false-positive results in two or more tests provides the necessary data for a decision mode} that will help to indicate which test is preferred. Patients can be randomized to have one test or the other, or both tests can be done for each patient. Studies Should Be Planned! Before Enrolling the First Patient Most studies of diagnostic tests have retrospectively analyzed data that had been obtained for another purpose. Thus, they describe clinical experience rather than planned research. Typically, the index test has

OCR for page 55
66 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY been performed on many patients, but only a few have had the gold- standard test. The characteristics of the index test group are seldom compared with the character~shcs of Hose who also undergo We gold- standard test. Other clinical data have been obtained irregularly. Most of the defects of past studies are attributable to their retrospective character. The ideal stud. A study should be planned in advance to assure adher- ence to a uniform data collection protocol. Bias in selecting patients and interpreting data can be reduced by planning. ~ a multicenter study, all participants follow He same data collection protocol. All the Data that Are Neededfor Clinical Decision Making Should Be Collected Past studies have measured the accuracy of a test, but they have not collected all the data required to help physicians make decisions concem- ing individual patients. For instance, sequences of tests are not reported, although physicians must often choose between doing such a sequence or doing one test. Using B ayes ' theorem to interpret the second test in a sequence usually requires assuming that the performance of the second test is conditionally independent of the results of the first. In some studies, two tests have been performed on a series of patients, and the operating characteristics of each test have been reported; seldom reported, however, is the frequency of each combination of results (both positive, Test A positive and Test B negative, Test A negative and Test B positive, both negative) in diseased and nondiseased patients. Test results are not reported as continuous variables. The interpreta- tion of a test usually depends on the extent of the abnormality. Thus, an orange-size lung mass is more likely to be malignant than a pea-size mass. To make use of this infonnation in decisionmaking, He false-negative rate and false-positive rate for an orange-size mass should be reported separately from these rates for a pea-size mass. In most published studies, He results have been reported simply as "normal" or"abnonnal." As discussed in Chapter 2, the operating characteristics of a test really reflect the criterion for calling a particular result `'abnormal." Thus, optimal. decisionmaking requires reporting the true-positive rate and false-positive rate of each of a series of definitions of an abnormal result. The ideal study. The ideal study of technology is model-driven: the data to be obtained are those required by a mode] of the decisionmaking process. According to the principles of expected-value decisionmaking,

OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS 67 Me clinician must know He pretest probability of disease and must be able to calculate the patient's expected utility for each of the decision altema- tives: treat without testing, do Test A, do Test B. or do nothing. To provide the data needed for decision making, studies of diagnostic tests should: Develop clinical prediction rules for estimating pretest probability. Clinical prediction rules estimate the probability of disease from the history and physical examination and other data (see Chapter 2~. To develop a clinical prediction nlle, one must obtain a complete problem- related data set on each patient and do a gold-standard test to define his or her tnue state. These data are easily obtained at little additional cost in a prospective study to measure the false-negative rate and false-positive rate of a test. Measure thefalse-negat~ve rate andfalse-positive rate of sequences of tests. In a study that compares several diagnostic tests, each test should be performed on each patient in Be study, and the results of one test win be reported separately under each set of results for the other tests. For example, suppose that Test A and Test B are performed on all study patients. The false-negative rate and false-positive rate of Test B wild be reported in patients who had a positive result on Test A and in patients who had a negative result on Test A. Report the operating characteristics for several different results of the test. A study should enroll enough patients to report the accuracy of the test in subgroups of patients who show increasingly abnormal results. Results should be reported as receiver operating characteristic (ROC) curves. Tests can be compared by calculating Be area under their ROC cumes, although a more clinically useful comparison is the range of disease probability over which the test is preferred. Provide a decision mode! for identifying the preferred option. The clinician can use the~principles of expected-value decisionmaking to identify the decision that win maximize the patient's chances for a favor- able outcome. The pnnciples of expected-value decisionmaking, and of designing a decision model or decision tree, are described in Chapter 2. The decision tree requires one to estimate the probabilities at the chance nodes, which are usually obtained from published studies but could be obtained from analysis of insurance claims data (Barry et al. 1988~. The

OCR for page 55
68 ASSESSMENT OF DIAGNOSlIC TECHNOLOGY tree also requires a quantitative measure for each outcome. This measure could be life expectancy or a measure of patient preference, such as utility. Each study patient's utility for each outcome state Will be meas- ured using standard utility assessment techniques. Consider treatment threshold probabilities. For many problems, the Ashore mode} of decisionmaking can help the physician decide whether to treat, withhold treatment, or do a test or sequence of tests. He physician can use intuition to estimate an individual patient's treatment Reshoot or can use Be analytic methods that were described in Chapter 2. The treatment Reshoot win vary from patient to patient because of their different outcome preferences and their different clinical character~s- tics. The distnbution of treatment thresholds win provide an essential background for physicians as Hey estimate the individual patient's ~sh- old. If the range of treatment thresholds is relatively narrow, one can make general recommendations for using diagnostic tests. Bias in Patient Selection Should Be Avoicled In past studies of diagnostic tests, the study population has differed significantly from the patients who undergo the test in the usual course of medical care. This defect of past studies is the most important and the most difficult to solve. Chapter 2 contains a description of the selective forces that lead to a biased spectrum of study padents. This defect leads to test measurements that lack external validity and could senously mis- lead me clinician. The ideal Stun. All patients who receive an established test in custom- ary and usual practice should be included in the study population if possible. Exclusion and inclusion cnteria, if they are needed, should be stated in He study protocol. The most troublesome selective factor is "workup bias.', There are several ways to avoid this problem. The best way is to avoid using a gold-standard test that is unpleasant, costly, and risky. For example, in evaluating the accuracy of rectal ultrasound for evaluating prostate nodules, one can use needle biopsy of the prostate as the gold standard. This procedure can be performed so easily that there is no balmier to referring patients. Another way to avoid workup bias is to be sure Cat a positive index test is not used as a criterion for obtaining the gold-standard test. One way to assure compliance is to obtain the index test only In patients who have had the gold-standard test.

OCR for page 55
ASSESSMENT: PROBl~iMS AND SOLUTIONS 69 A third way to avoid workup bias is to use long-term foDow-up as an ultimate measure of whether or not the patient had We disease. Thus, aD patients who do not get an invasive gold-standard test for cancer would be evaluated periodically for the appearance of a cancer that was initially missed by the index test. Patients Should Be Observedfor Adverse Elects of the index Test Most studies of diagnostic tests have not included any clinical outcome measures other than diagnosis. ~ effects have seldom been assessed, other than to note direct complications (death and disability from the procedure itself). Other iU effect~such as psychological dependence on test results (Sox et al. 1978), expensive workup of false-positive resets, and mistakenly labelling the patient as diseased have seldom been in- vestigated. The ideal study. AU patients should be monitored to detect any delayed effects of the test. A prospective study can incorporate these important study endpoints at a small additions cost. A research assistant can perform clinical foBow-up of each patient by admin~stenng a ques- tionnaire and by reviewing the patient's medical record. Interpretation of Data Should Be Free of Bias The index test and the gold-standard test should be interpreted inde- pendently to avoid having the results of one influence the interpretation of the other. In some published reports, each test has been interpreted independently, but the protocol for interpreting the index test and the gold-standard test is not usually described. One way to avoid biased interpretation is to have standardized, written criteria for classifying test results. The ideal stud. The gold-standard test and each test being evaluated are interpreted independently, according to standardized cnteria. To achieve this goal win require the active cooperation of the cI~cians who perform and interpret the test. Interobserver Disagreement Should Be Measured Studies have often shown considerable disagreement among observers in labelling an image or tracing as abnormal (Koran 1975~. Very few studies of diagnostic tests have included measures of interobse~ver dis- agreement.

OCR for page 55
70 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY The ideal study. At least two people should examine images or tracings and categorize the result according to prospectively defined cntena. These test result categories could be Vomited to normal and abnormal or could include several degrees of abnormality. The level of agreement should be characterized quantitatively. There Should Be Enough Patients to Report the Results in Clinically Useful Subgroups of Patients Typical studies of diagnostic tests enroll fewer than 100 patients, far too few to evaluate the perfonnance of a test in clinically important subsets of patients. One large clinical study has shown that Me accuracy of a diagnostic test vanes among clinically defined patient subgroups (Weiner et al. 1979~. Patients who appear very sick often have extensive disease that a test can detect easily. Disease is often less extensive, and therefore less easily detected, In patients who do not appear in. Applying results obtained in very sick patients may lead to incorrect interpretation of test results in other patients. The ideal s. The study should enroll enough patients to measure test performance in subgroups of patients, and it should prospectively establish criteria for different categories of disease seventy. The operat- ing characteristic of the index test should be measured in these subgroups, as well as in the entire patient population. SUMMARY The chief importance of this chapter is that it sets out expectations for future studies of diagnostic tests. There are a few basic pnnciples. Do comparative studies: a test can be compared win a competing test, either by randomly allocating patients to one test or the other or by perfonning both tests on all patients. Do clinically relevant studies: the investigators should gather all the data that are required to implement a model for making clinical decisions. Avoid bias: the study population should be all Nose who get the index test in the course of usual care. REFERENCES Abrams, H.L., and McNeil, B.J. Medical implications of computed tomography ("CAT" scanning). New England Journal of Medicine 298:261, 31~318,1978.

OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS 71 Ambrose, J., Gooding, M.R., and Uttley, D. E.M.I. scan in the manage- ment of head injuries. Lancet 1:847-848, 1976. Barry, M.~., Mulley, A.G., Fowler, As., and Wennberg, I.W. Watchful waiting vs. immediate trar~sure~ resection for symptomatic prosta- tism. Journal of He American Medical Association 259:3010-3017, 1988. CASS Principal Investigators. Coronary Artery Surgery Study (CASS): A randomized trial of coronary artery bypass surgery: Survival data. Circulation 68:939-950, 1983. Fineberg, H.V., Bauman, R., and Sosman, M. Computerized coal tomography: Effect on diagnostic and therapeutic plans. Joumal of He American Medical Association 238:224-230, 1977. Haughton, V.M. MR imaging of He spine. Radiology 1 66:297-301, 1988. Inouye, S.K., and Sox, H.C. A compan son of computed tomography and standard tomography in neoplasms of He chest. Annals of Internal Medicine 105:906-924, 1986. Kent, D.L., and Larson, E.B. Magnetic resonance imaging of the brain and the spine. Annals of Intemal Medicine 108:402423, 1988. Koran, L.M. The reliability of clinical methods, data, and judgment. New England Journal of Medicine 293:642-646, 695-700, 1975. Marton, K.I., Sox, H.C., Wasson, I.H., and Duisenberg, C.E. The clinical value of the upper gastrointestinal series. Archives of Internal Medi- cine 140:191-195, 1980. Modic, M.T., Steinberg, P.M., Ross, I.S., Masaryk, TV., and Carter, I.R. Degenerative disk disease: Assessment of changes in vertebral body marrow with MRimaging. Radiology166(part D:193-199, 1988. NIH Consensus Conference. Magnetic resonance imaging. Journal of the American Medical Association 259:2132-2138, 1988. Phelps, C.E., and Mushily, A.I. Focusing medical technology assessment using medical decision theory. Medical Decision Making 8:279-289, 1988. Philbnck, I.T., Horwitz, Ret., and Feinstein, A.R. Methodolog~c prob- lems of exercise testing for coronary artery disease: Groups, analysis, and bias. American Joumal of Cardiology 46:807-812, 1980. Philbnck, I.T., Horwitz, R.I., Feinstein, A.R., et al. The limited spectrum of patients studied In exercise test research: Analyzing the tip of the iceberg. Journal of the American Medical Association 248:2467- 2470, 1982. Ransohoff, D.F., and Feinstein, A.R. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. New England Joumal of Medicine 299:926-930, 1978.

OCR for page 55
72 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY Sox, H.C. Centers for Excellence in Technology Assessment: A proposal for He national program for He study of heath care technology. In Roe, W., Anderson M., Gong, I., and Strauss, M., eds., A Forward Plan for Medicare Coverage and Technology Assessment. Washing- ton, D.C., Department of Heath and Human Services, 1987. Sox, H. C. , Margulies, I. , and Sox, C.H. Psychologically mediated effects of diagnostic tests. Annals of Intemal Medicine 95:680-685, 1981. Steinberg, E.P., Sisk, I.E., and Locke, K.E. X-ray Or and magnetic resonance imagers: Diffusion patterns and policy issues. New Eng- land Joumal of Medicine 313:859-864, 1985. Weiner, D.A., Ryan, T.~., McCabe, C.H., et al. Exercise stress testing: Correlations among history of angina, ST-segonent response, and prevalence of coronary-artery disease in He Coronary Artery Surgery Study (CASS). New England Journal of Medicine 302:230-235, 1979.