Read "Assessment of Diagnostic Technology in Health Care: Rationale, Methods, Problems, and Directions" at NAP.edu

« Previous: 3. Assessment: Problems and Proposed Solutions

Page 73 Cite

Suggested Citation:"4. Primary Assessment of Diagnostic Tests: Barriers to Implementation." Institute of Medicine. 1989. Assessment of Diagnostic Technology in Health Care: Rationale, Methods, Problems, and Directions. Washington, DC: The National Academies Press. doi: 10.17226/1432.

Page 74 Cite

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Unman Assessment of Diagnostic Tests: Barriers to Implementation In the two preceding chapters, we have presented a series of guidelines for conducting the ideal study of a diagnostic technology. The goal of this chapter is to examine the practical difficulties that are often encountered when the guidelines are applied to He design and execution of a typical protocol. We win also address briefly a number of methodological issues concerning the interpretation and reporting of Be study data. Five specific stages of a primary technology assessment win be addressed: planning and protocol development, recruitment, implementation, interpretation, aIld reporting. PLANNING AND PROTOCOL DEVELOPMENT The key to designing a useful technology assessment is that studies should be modeZ-driven; the data that are gathered should be specified by a model of decisionmaking. We pointed out in Chapter 3 that the best way to obtain the i-nfo~mation that will make the model usable (that is, the data needed by a physician to make a decision about the care of an individual patient) is to plan Be study before enrobing the first patient. What are the elements of Be planning and protocol development stage? What specific issues must be addressed and how can Hey be resolved? Weinstein (1985) has discussed many of these issues as they relate to Parts of this chapter are adapted from a paper published previously by one of Me authors (Abrams 1987~. 73

74 ASSESSMENT OF DIAGNOSTIC ~:CHNOLOGY planning a trial of cost-effectiveness of diagnostic technology, and we draw on his wow. First, the planners must clearly delineate the objectives of the study. A number of critical questions must be asked: · Which clinics condition should be investigated? · Which patient population should be included in the study? · Wid He endpoints be accuracy, outcome, or both? · What type of study design will be used? · WiB the assessment also be an economic evaluation? · WiB the study assess efficacy or electiveness? · What is the appropriate comparison technology? · How large a sample win be needed? · When should He study be conducted? · Is Were institutional support for He study? The answers to these questions win gready influence me design of He protocol and the nature of He data to be gathered. We win therefore consider each of them in more detail. Choosing a Clinical Condition A diagnostic imaging technique has numerous potential applications. For example, it was estimated in 1984 that MR! examinations might be used in up to 250 diagnosis-related groups (DRGs) (Steinberg and Cohen 1984, Weinstein 1985~. Defining the role of MR} for each of these categories would require many studies and a tremendous investment of time and resources. Recognizing that society may not be able to afford to assess every application of a diagnostic technology, we must establish priorities for technology assessment. ~ choosing the clinical problem to be evaluated in a diagnostic tech- nology assessment, policy-or~ented investigators would use cr~tena such as me frequency of a condition, He cost of the technology, and the potential impact of He study result on clinical practice. Other factors that might influence the choice include the potential effect of He test on patient management and outcome and deficiencies in existing diagnostic methods (Figure 4.1) (Guyatt and Drummond 1985). Planners may use policy considerations to select a study problem that will have a significant societal impact; but Hey must also ask if the study is feasible. The feasibility of the study depends on a number of variables, such as cost and the availability of a gold standard. (The costs of studies of

PRIMARY ASSESSMENT OF DIAGNOSTIC TESI); GOLD STANDARD AVAILABLE? \ FREQUENCY OF -CONDITION? 1, 1 COST OF TECHNOLOGY? 1/ CLINICAL PROBLEM 75 POTENTIAL TO AFFECT PATIENT MANAGEMENT? \ POTENTIAL TO AFFECT OLIN ICAL OUTCOME? EXISTING DIAGNOSTIC METHODS INADEQUATE? FIGURE 4.1 Factors that influence the choice of die clinical condition to be studied. diagnostic technology are discussed in detail in Chapter 5.) Open-ended questions and poorly defined goals may limit feasibility. Assessing the efficacy of Or or MR! of"the lived' ignores the sharp distinctions among biliary obstruction, mass lesions, and diffuse hepatocelBular disease. Each topic Squires separate consideration. One prospective study of CI, ultrasound (US), and scindgraphy focused on the tests' ability to detect metastatic liver disease from several types of primary carcinoma. No difference was absented in the diagnostic capabilities of these technolo- gies (Smith et al. 1982~. Nevertheless, the results of a more recent study, restncted to patients with carcinoma of the breast or colon, suggest that differences do exist in We diagnostic yield of the Wee modalities when pathologically distinct lesions are analyzed separately. These differences may have been obscured in the first study because the clinical problem was too broadly defined (Alderson et al. 1983~. One possible way to set priorities for technology assessment is to use decision-analytic techniques for detem~in~ng the value of perfect informa- tion. Suppose we are considenng an assessment of me accuracy of a new test for patients With condition X. Let us assume Hat We new test provides perfect information, thereby resolving all uncertainty about the true state of the patient, and that we can determine me value of the

76 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY information in doDars. According to the model, if we bind mat We cost of performing the test is greater than we would be willing to pay for perfect information, using the new test to diagnose patients win condition X would not be worthwhile (Phelps and Mushlin 1988~. The model uses a hypothetical test that is 100 percent accurate to maximize its potential value. If me information from an ideal test is not worm the test cost, we can expect, an over things being equal, that the information from a real, imperfect test would be worth even less. It follows that we would not want to expend resources to evaluate me test's performance In this clini- cal situation. This methodology provides a powerful tool for determining beforehand whether we should expend the resources necessary to evaluate a particular use of a technology. Patient Population The study population must be weD defined. When certain subsets of eligible patients are excluded because of other, coexisting disease, a physician may be unable to generalize the study result to me whole spectrum of patients encountered in clinical practice. (See Chapters 2 and 3 for a more thorough discussion of sources of bias in selecting patients and their negative impact on studies of diagnostic technology.) Choosing a specific clinical problem for a study of diagnostic technology defines We diagnostic category of patients who may participate in the study. Within this category, the population should include a representative spec- trum of patients. Inclusion and exclusion criteria are needed to define the boundaries of the study population. They must be explicit and they must be applied consistently. In the University Group Diabetes Study, these criteria were not applied uniformly, leading to the admission of a number of ineligible patients and We exclusion of some patients who were eligible. These errors compromised the generalizability of the study conclusion and wasted resources (Feinstein 1971~. Wide variance in test performance (that is, accuracy) within the study population may obscure differences in me performance of two tests. Investigators may need to specify and analyze the results of a test In subgroups of patients for whom they suspect the test will perform differ- endy. For example, me sensitivity and specificity of the exercise thallium treadmill, used to diagnose coronary artery disease, are different In groups Of patients segregated according to the severity of their chest pain (Weiner et al. 1979~. Although there may be no significant difference in test

PR MARY ASSESSMENT OF DIAGNOSTIC TESTS 77 performance when the population is considered as a wholes Mere may be differences when subgroups within the population are compared. Endpoints and Study Design The endpoint of a diagnostic test assessment win determine how the results win be used; it is, therefore, critical. Fineberg has proposed the foBow~ng hierarchy for the evaluation of diagnostic tests: technical capa- bility, diagnostic accuracy, therapeutic impact, and impact on patient outcome Iceberg et al. 1977~. Early reports of excellent technical capability are often the basis for me later studies of diagnostic accuracy and clinical value (impact on therapy and patient outcome). The critical question in the planrung and protocol development stage is: WiB the study attempt to measure diagnostic accuracy (~at is, sensitivity and specific- ity), the impact of the test on clinical outcome, or both? (Note that we define the outcome of a diagnostic test as any change in the pastiest process. It should not be considered synonymous with the terms morbid- ity and mortality.) Accuracy Studies of diagnostic accuracy use a "gold standard" to verify the presence or absence of disease. A potential difficulty in a study of accuracy occurs when there is no accepted "gold standard"; it may not be clear which of the available reference standards should be used (Schwartz 1986~. AD reference standards are imperfect. The coronary angiogram is used as Me gold standard in studies of diagnostic tests for coronary artery disease, such as the stress electrocardiogram. Yet, pathologic examina- tion of tissue from patients who have had an ang~ogram demonstrates Mat Me radiolog~c procedure underestimates the severity of disease (Abram s 19821. Physicians must interpret the results of studies of accuracy in this context. Perfect or not, in practice the appropriate gold standard win be the test or procedure mat physicians use to define the true state of patients with a particular disease. Outcome Because the purpose of diagnostic technology is to provide infonnation that will improve patient outcome, patient outcome is an important end- point in technology assessment. Making inferences from data on outcome

78 ASSESS OF DIAGNOSTIC TECHNOLOGY may be more difficult than interpreting data from studies of diagnostic accuracy. When long-term measures of outcome are used, the technology may be obsolete before the study is completed. Furthermore, long-tenn outcome may be an unrealistic criterion, "because the impact of diagnos- tic technologies generally is subordinated to that of other factors, such as the nature of the disease process itself, patient compliance, the efficacy of treatment, etc." (McNeil 1979, p. 37~. Improvement in long-tenn outcome may not be the most important effect of a test. If intetverung v enables act to obscure differences in the long-term effects of two technologies, perhaps me differences are not really important. Investigators must keep in mind that two patients win identical long-tem~ outcomes may have expenenced very different posttest processes. A variety of intermediate variables may be important indicators of the effects of a test. Furthermore, these variables may be more practical to evaluate than long-term effects. For example, a study could measure the ability of a diagnostic technology to obviate the need for further invasive diagnostic procedures. In patients with lung cancer, thoracotomy could be avoided if a test could accurately predict Me presence of mediast~nal metastases. The test will not improve the five-year survival of such patients, but, avoiding an unnecessary thoracotomy would be a major benefit (McNeil et al. 1978) and would therefore represent an improve- ment in the posttest process. Outcome studies must track intermediate outcomes and patients' attitudes toward those outcomes. Combined Souses: Accuracy and Outcome The alternative combinations of study design (randomized or nonran- domized) and endpoint (accuracy anther outcome) are depicted in Figure 4.2. The design of a technology assessment influences the feasibility of conducting each type of study. In a randomized design, each patient undergoes only one of the study tests; in a nonrandomized design, each patient would undergo all of the study tests, although randomization may be used to assign a patient to a particular sequence of tests. The advan- tages and disadvantages of a randomized design have already been dis- cussed in Chapter 3. The following example illustrates that a study design may not be compatible with the endpoints selected for evaluation. In an ideal study to compare the accuracy of two tests, each patient would have bow examinations. Guyatt and Drummond have suggested that investigators

PRIMARY ASSESSMENT OF DIAGNOSXJC TESTS RANDOMIZED EACH PATIENT UNDERGOES ONLY ONE OF Tide STUDY TESTS DESIGN NON RANDOM IZED EACH PATIENT UNDERGOES ALL OF Tide STUDY TESTS. PATIENTS MAY BE RANDOMIZED TO A SEaUENCE OF TESTS 79 ENDPolNT ACCURACY OUTCOME BOTH GIRD FOLLOW-UP GOL~STANDARD REC WIRED RK3UIRED EVALUATION AND ONLY ONLY Fry I OW-UP REQ'D GOU)STANDARD GOLD-STANDARD FOLLOW-UP EVALUATION AND EVALUATION REQUIRED ONLY FOLLOW-UP REaD REQUIRED USEFUL ONLY MEN (FOLLOW-UP PATIENT ONLY PATENT RANDOMIZED CARE WED ON TO ~ SEQUENCE) ONLY ONE CODY TEST RESULT) FIGURE 4~2 Alternative combinations of endpoint and study design. use this approach to assess both We accuracy and the impact on outcome of two relatively norunvasive imaging modalities, such as MR! and cr. in a single study. To compare He effects of the tests on outcome for the same patient in whom accuracy is determined, however, the result of one of two tests would have to be wit~eld from Be patient's physician (Guyatt and Dn~mmond 1985~. The design of this study poses ethical problems because patients will undergo a diagnostic examination that cannot affect their care (Weinstein 1985~. Patients and physicians alike may be reluctant to participate. In Chapter 3, we suggest that a randomized design may be preferred to a nonrandomized design for assessing outcome. Planners could also shift their focus from long-tenn to short-term outcomes. Short-Term Outcomes: ASynthet~c Approach The synthetic approach is a method for assessing short-term outcomes, such as me impact of a diagnostic test on the management of the patient. It involves obtaining detailed information from physicians about their pretest treatment strategies and comparing them to the posttest manage- ment of the patient (Guyatt et al. 1986~. In He example above, each physician would write down a plan for managing He patient before knowing the CI and MR! results. Using a randomized scheme, the result of one of the two tests would be given to each physician, who would then

80 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY formulate and record a treatment plan based on the test result. Next, the result of He other test would be revealed, and the patient's care would ultimately be based on aU available information. A test has had an impact if the physician's plans changed because of the test result. Economic Analysis ~ the current era of cost containment and remind resources, a consid- eration of cost will be an important stuffy endpoint. First, the investiga- tors planning a technology assessment must decide which type of analysis to use (for example, net resource costs, cost-effectiveness analysis, or cost-benefit analysis). Since cost-effectiveness analysis is comparative and does not require that health outcomes be valued in monetary teens, it is the type of analysis used most frequently. Second, investigators must choose an appropriate perspective for the analysis, because this will greatly influence which costs and effects are included. The societal perspective is the broadest, and it is adopted when the results of the cost- effec~veness analysis are needed to guide government decisions about how to allocate resources. Third, the investigators must recognize the degree to which additional time and personnel win be: needed when an economic evaluation is included in He study design. (For a complete discussion of these issues see Weinstein and Stason 1977; OTA 198Oa,b; orOTA1981.) Efficacy Versus Effectiveness The conditions of the study can imitate real life or they can be ideal- ized. The choice between efficacy, the performance of the test under ideal conditions, and effectiveness, its performance under ordinary conditions of clinical practice, win determine He type of question He study can answer. Consider a study designed to assess the diagnostic accuracy of barium enema (BE) in detecting colonic polyps (also see Figure 4.31. A study of e~ecizveness would enroll an patients who are referred for BE in clinical practice. Patients would be given the usual pretest instruc- tions and, although some would be less than optimally prepared, an would undergo the examination. This would be performed under usual condi- tions, by the individuals who normally perform itradiology staff or house staff. It would be interpreted by clinicians at varying levels of skis who would be provided with whatever clinical information is generally

PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 81 EFFICACY EFFECTIVENESS PATIENT POPULATION pBOCEDUR ES TESTING CON DITIONS TEST INTERPRETATION MORE HOMOGENEOUS; SCREENED FOR COEXISTING ILLNESSES/ COMPLIANCE STANDARDIZED IDEAL BLINDED TO CLINICAL DATA TYPE OF OBJECTIVE, ~HARD" EVENTS, OUTCOM E E.G.,DEATH DATA HETEROGENEOUS; INCLUDES ALL PTS ECHO USUALLY HAVE PROCEDURE MORE FLEXIBLE CONDITIONS OF EVERYDAY PRACTICE USING OTHER CLINICAL DATA MORE SUBJECTIVE, "SOFT. EVENTS, E.G., IMPROVED QUALITY OF LIFE [IGURE 43 Differing requirements: studies of efficacy vs. studies of effectiveness. available at the time. There would not be a protocol for subsequent patient care. A study of efficacy should assess the potential benefit of the technology when applied to a specific clinical problem in a defined population under ideal conditions. The protocol would be designed to maximize Me chance that the true accuracy of the test win be demonstrated by reducing sources of variability. Thus, a study of efficacy would: (~) enroll a-more select group of patients; (2) ensure that all patients were adequately and consis- tently prepared prior to the exam; (3) use only state-of-the-art equipment; (4) employ the most skilled clinicians to perform and interpret the test; and (5) make sure that interpreters were blinded to other clinical informa- tion. It would also standardize aftercare. The individuals who develop the protocol may disagree about which type of assessment is most appropriate, making the choice a difficult one. Feinstein (1983) has suggested a useful way to conceptualize the two approachesthe '~fastidious" and the "pragmatic" to design. The fastidious approach. Fastidious designers might include the bio-

82 ASSESS OF DIAGNOSTIC TECHNOLOGY statistician or the scientist who developed the technology. This group would argue that a study of efficacy is the only way to determine the "true" value of the technology. For example, an efficacy design will increase the chances of arriving at an unequivocal answer to the study question by standardizing procedures and removing many of the sources of variability that characterize coccal practice. If such a study concluded that a test was not efficacious, there would be no need to perfonn furler evaluations. The pragmatic approach. This approach would be adopted by the practicing clinician. The "clean" results of an efficacy assessment may have little value for the physician whose patients win receive their tests under "usual"' rather than "ideal" conditions. The pragmatist would argue Hat only studies of effectiveness, which attempt to mimic clinical reality, provide the infonnation physicians need to make decisions about individ- ual patients. Resolution of the conflict between the fastidious and pragmatic ap- proaches may involve combining features of both. In any case, the proto- co} as it is actually earned out may end up as a hybrid, because protocols that have been designed to assess efficacy will often encounter real-worId obstacles that make the ideal arrangement impossible. These problems will be covered in detail in the section of this chapter that considers implementation. Comparative Assessment In Chapter 3 we emphasized that technology assessments must be comparative if they are to provide useful data to the practicing physician. For example, the physician may need answers to either or both of the foBow~ng questions: (~) When used instead of existing tests, does Me new test have a greater impact on the outcome of me patient? (2) When used in combination with existing methods, does the new test add infonnation Tat will improve the outcome of the patient? These questions suggest two comparative designs (see Figure 4.41. In one design, the study would detect any positive impact when the new test is substituted for the old. Patients would be randomized to either We new technology or the existing technology. In the other design, the study would evaluate the impact of a technology when it is used as an addition. Patients would be randomized to undergo either the old test and the new test in sequence or the old test alone. Both designs could be used to compare the diagnostic accuracy of the tests (or combination of tests) and their impact on the outcome of disease. (A nonrandomized design, in

PRIMARY ASSESSMENT OF DIAGNOSTIC TE!= 83 Figure 4a DESIGN 1 The new technology as a substitute for the old technology TEST A VS. TEST B Figure 4b DESIGN 2 TEST A VS. TEST A AND TEST B The new technology as an addition to the old technology Figure 4c Over time the appropriate design may change: Initially Short term--physicians may prefer the combination of tests Long term--physicians may abandon the less effective test TEST A 1 TEST A AND TEST B TEST B FIGURE 4A Designs for comparative assessment. which ad patients undergo ad tests, would also be appropriate for a study of accuracy; refer to the section, `'Endpoints and Study Design," pp. 77- 80.) Which of these designs is more appropriate when comparing an exist- ing technology with a new one? Despite promising reports, physicians may be hesitant, in the short term, to change over completely to a new technology. An additive design would answer Weir questions about using the new test as part of a sequence. Nevertheless, one goal of technology assessment is to foster appropriate changes in practice habits and to discourage the use of additional tests whenever they will not have an impact. In the long teen, if we want to encourage physicians to abandon ineffective tests in favor of more effective ones, we win need to evaluate the technology's substitutive value as wed (Weinstein 1985~.

84 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY Another problem that must be addressed In the design stage is Me high level of accuracy of existing diagnostic methods. As diagnostic methods improve, the measurement of small differences in accuracy requires large sample populations. In a study of 279 patients for pancreatic disease, Or had a sensitivity of 0.87 and sonography had a sensitivity of 0.69 in detecting an abnormal pancreas (a difference of 0.~) (Bessel et al. 19821. To show that another technique (such as ~~) is superior to CI, a far larger sample would be sequin, because We maximal attainable differ- ence in sensitivity is 0.10 or less. Before embarking on a comparative study, investigators should ask whether such small differences in sensitivity are clinically significant. Here, it may sometimes be helpful to use Be Reshoot mode} described In Chapter 2. To accomplish ~is, we must choose a specific clinical prom lem and estimate the pretest probability of disease. We must also calcu- late a treatment threshold probability. This step is best accomplished by decision-analytic modeling. When the posttest probability of disease following a negative result from the old test (with sensitivity X) exceeds the Ashore, we treat the patient. Let us assume Cat the new test has a maximum sensitivity of X + 0.10. If the probability of disease after a negative result from the new test remains above the threshold, our treat- ment strategy win not change. Thus, the diffemnce in sensitivity has no practical implications for patient care (Sox 1986~. A full analysis of the impact of a test on decisionmaking requires knowledge of the distribution of pretest probabilities in the study popula- tion and knowledge of patient utilities. These additional data require- ments are substantial. Of course, even if the differences in accuracy are not worth assessing, we may want to evaluate over features of a new test, such as increased safety or decreased cost. Sample Size As part of the planning stage, investigators must calculate the sample size required to ensure adequate power for the study. A review of 71 "negative" clinical trials found that 50 of the teals had a greater than 10 percent chance of missing a true 50 percent therapeutic improvement because of small sample sizes (Freiman et al. 19781. A study should avoid two errors of inference. Type I error (a-type error) occurs if we reject the null hypothesis, Ho, when it is true. For example, we may conclude that there is a difference in the accuracy of Or and HI when no difference exists. The null hypothesis refers to the

PRIMARY ASSESSMEIYT OF DIAGNOSTIC TESTS 85 basic hypothesis being testedoften one of no difference between the two entities being compared.) Type · error (~-type error) occurs if we accept the nub hypothesis, Ho, when it is false. For example, we may conclude that there is no difference in the accuracy of CI and MR! when a difference exists. The power of a study is equal to ~ - p; it is He probability that the nun hypothesis (that is, ~I is no better than Or) win be rejected when the alternative hypothesis (MRI is better than Cry is true (Brown and Hol- lander 19771. Viewed from another perspective, as we increase the power of the study, we decrease ,B, the probability of missing a true difference between the effects of two tests. Increasing the sample size is the primary way to increase the power of me study without increasing a, the risk of concluding that a difference exists when it does not. Investigators must consider how large a population of patients they will need to screen to achieve the appropriate sample size. Many of the decisions dunng the planning stage of the trial win affect the size of these two populations. The sample size calculation depends on which outcome variables are selected and He magnitude of the difference in accuracy or outcome Hat we wish to detect. The factors that directly influence He size of the sample include heterogeneity of He study popula- tion and the degree of accuracy of existing diagnostic methods. Other factors influence the size of the screened population: the frequency of the condition under study, He breadth of the study focus, and the likelihood of patient withdrawal. If these factors are ignored, He result may be a gross underestimation of the number of patients needed and a high proba- bility of type II error. The problem of a large sample size can be overcome by using a multi-institutional cooperative design rather than attempting to conduct the study at a single center. We present a proposal for this Me of study and discuss the advantages and disadvantages of the multicenter design in Chapter 6. Timing the Assessment Selecting a specific clinical problem and determining the appropriate comparisons for a technology assessment are important and challenging aspects of the planning process. Perhaps the greatest challenge in this stage of a teal, however, is to determine when to conduct the study (Kent and Larson 1988~. For an assessment to have an impact on the use of a new technology, some would argue that the results must be available before clinicians have made subjective judgments about the value of the

86 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY technology and widespread diffusion has occurred (Fineberg and Hiatt 1979~. Studies must be initiated (and completed) as early in me "lifetime" of the technology as possible. Very early assessment may be difficult, if not undesirable, however, because of the inherently unstable nature of new technology (Alperovitch 1983~. The rapid pace of technological change affects diagnostic techniques as it does other types of technology. For example, a diagnostic method, especially one of He complexity of the MR scanner, is rarely introduced into practice in its most effective form. Rawer, He technology continues to develop and improvements are made based on information derived from its early use in practice. Changes may include new configurations of the hardware and improved techniques for using it. As physicians gain experience with the method, their interpretive skins increase (Sheedy et al. 1977~. Thus, a study conducted too early in He lifetime of a technol- ogy may fail to reflect its true potential. It may also be considered unethical to expose patients to an "unproven" technology, especially when insufficient time has elapsed to allow for the effects of the leaning curve. The decision about which "version" of the technology to assess is important. For example, "MR! is not a homogeneous diagnostic test, but offers a range of related, but different, diagnostic tests" (Weinstein 1985, p. 570~. As Weinstein points out, such technical flexibility creates some difficult decisions for tnal planners: should a study "freeze" the technol- ogy and specify standard hardware and techniques, or should a study systemadcaBy compare the efficacy of alternative hardware configura- tions? In the first case, Here is the risk that the chosen configuration win become obsolete while the study is still in progress. In the second case, by the time the study is complete, the diffusion of MRI might not be an issue anymore. Using technology assessment as a means of controlling the diffusion of new technologies may not be practical. Some have argued that studies should not be conducted until the technology has stabilized, at least win respect to certain clinical conditions. A randomized design would not be appropriate for this type of study, because, as Weinstein points out, stabilization often occurs just as physicians are beginning to consider it unethical to witllhold the technology from patients they believe will benefit from it (Weinstein 19851. A nonrandomized design could be used, however, since each padent would undergo the new test. It would then become possible to predict the range of patients in whom the mature

PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 87 technology would- be useful, and He results could be used to influence reimbursement decisions. Obtaining Institutional Support Another important step Cat must be taken fin the early stage of a trial is to enlist institutional support. In the National Cooperative Gallstone Study (NCGS), a randomized, controlled Dial of chenodio} for Be disso- lution of gallstones, two study centers had to be deleted because Hey were unable to meet the required rate of randomization. The source of the problem was a lack of institutional support. When the study coordinators reviewed He applications of institutions seeking to replace He deleted centers, Hey added an evaluation of He level of administrative and departmental support at each of He centers to the review process (Marks et al. 1984). The investigator in a study of diagnostic technology depends on the support of each department or group that will be involved in carrying out the study. The most useful approach is to enlist the direct participation of at least one interested, committed, and sophisticated member of the de- parOnent or group. The time to do this is shordy after the study is conceived, so that these individuals can participate in developing the research design. Summary: Planning and Protocol Development Planning and developing a protocol for a primary technology assess- ment is a time-consuming activity. Protocol development for the Pro- spective Investigation of Pulmonary Embolic Diagnosis (PlOPED) study of diagnostic tests for pulmonary embolism took 15 months (Vreim 1988~. Without inshtudonal support, the project cannot succeed. Investi- gators will be confronted with a multitude of decisions, and conflicting views on study design may require compromises between the more practi- cal approach (effectiveness) and He ideal approach (efficacy). Techno- logical change will strongly influence the liming of the study. In general, the most useful results win be achieved with a focused study in a patient population that is as representative of the clinically- relevant population as possible. Accuracy is an important endpoint, but the study should also evaluate intermediate outcomes that are important to patients. The study should also include a cost-effectiveness analysis. The

88 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY nature of the tests being compared influences the decision between a randomized and a nonrandomized design and He decision between as- sessing Be technology as an addition to existing mesons or as a subsh- tute for ~em. The choice between a randomized or nonrandomized design also depends on the whether the study evaluates accuracy, out- come, or both and whether Me technology is mature or just emerging. AU these decisions win influence the size of He sample and, therefore, He magnitude of He effort Hat will be needed to deceit patients for the study. RECRUITMENT The experience of many investigators conducting randomized con- troBed Dials (RCrs) of drugs or other therapeutic interventions has home out Muench's Third Law: ' The number of patients promised for a clinical trial must be divided by a factor of at least 10" (Prout 1979, p. 6951. For the study of a diagnostic technology, one might expect recruitment to be at least as great a problem as it has been for the trials of therapeutic regimens. There are three steps leading to patient enrollment. First, a patient must be referred for diagnostic evaluation by He technique under study. Next, eligibility must be established according to the criteria set forth in the protocol. Finally, He patient must give informed consent to participate. Referring Physicians The success of the first step, referral, depends on the cooperation of the individuals who usually refer patients for imaging studies: the internist, surgeon, pediatrician, or obstet~cian-gynecologist who requires either a solution to a diagnostic problem or He confirmation of a presumptive diagnosis. Unfortunately, these individuals cannot necessarily be counted on to provide the requisite number of cases (Croke 1979, Marks et al. 19841. In past therapeutic RCrs, there have been a variety of reasons for low rates of referral. Investigators in the Coronary Drug Project found Cat physicians did not appear to have problems with the concept of Be trial but seldom took the initiative to refer. When their patients were identified by records or self-referral, these same physicians were usually very sup- portive (Schoenberger 1979~. By contrast, in the National Surgical Adjuvant Project for Bowel and Breast Cancer trial to compare segmental mastectomy win total mastec-

PRIMARY ASSESSMENT OF DIAGNOSTIC TESTS 89 tomy, many physicians had ethical and other problems with the trial. A suIvey of surgeons participating in the tnal revealed me following con- cems: · There would be negative effects of an RCr on the doctor-patient relationship. · They were uncomfortable win admitting uncertainty about which treatment was best. · They felt conflict between the role of clinician cooing what is best for the individual patient) and the role of scientist (adhering to a protocol with randomization). · They were uncomfortable with the requirement of informed consent. A significant number of respondents described the process of obtaining informed consent as an "arduous task." In addition, a number of the surgeons already had strong convictions about which treatment was supe- rior (Taylor et al. 1984~. Through the few prospective studies of diagnostic technology that have been performed, the magnitude of the referral problem has become clear. Consider two examples: . .. . . · In a comparative study to assess the value of using MRI instead of CI to detect intracranial mass lesions, patients may be randomized to receive one of the two technologies. Because preliminary studies have suggested Cat MRI has a high level of diagnostic accuracy and the method is now available at many institutions, a physician may feel that it would be unethical to deny He patient an MRI scan · Similarly, in the ideal study designed to compare the accuracy of two diagnostic imaging methods in detecting metastatic disease to the liver, each patient would undergo both of the examinations. How do we convince the referring physician to allow the patient to undergo the second test when the physician is perfectly satisfied with the first exami- nation, which shows He presence or absence of metastases? Physicians may also be reluctant to refer patients for studies that involve the use of an invasive gold standard to verify the presence or absence of disease. They may feel that the study question lacks clinical relevance or that it will be irrelevant by He time the trial is completed. Fur~ennore, physicians may be hesitant to take on tile additional, often involved, paperwork associated win a trial. They may perceive the trial

go ASSESSME;NT OF DIAGNOSTIC TECHNOLOGY as an interference In the care of their patients or be concemed that patients referred to a large center for a trial win not return 0;erguson 1988~. Clearly, before the investigation begins, the refening physician must fee} comfortable with Me role of the scientific investigator and agree to par- ticipate in the study. Some physicians have already arrived at a subjective judgment of which of two technologies is better. These physicians may not refer patients to the study for fear that they might be randomized to the other technology. Referral bias can result in the exclusion of important subsets of patients, limiting We external validity of the study and providing unreliable estimates of sensitivity and specificity. A possible mechanism to avoid referral bias might be to assess each physician's preconceptions about the technologies under study though a pnv ate questionnaire. These responses could be used to classify patients into subgroups that could be examined later for particular trends. Patients and Informed Consent The public approves of research in principle~ut not necessarily in practice. When a sample of patients and the general public were surveyed concerning their attitudes about clinical trials, most respondents (71 per- cent) believed that patients should serge as research subjects and cited the potential benefit to others and the opportunity to increase scientific knowI- edge as major reasons. When they were queried indirectly about their own willingness to participate in a trial, however, "a more self-concerned, less altruistic standard seemed to prevail" (Cassile~ et al. 19821. The first obstacle to patient participation is randomization. Patients, like physicians, may be uncomfortable with the notion that Me memos used to diagnose their illness will be chosen randomly. They may not perceive me two anns of Me trial as comparable, and randomization does not alBow them to express their preferences (Angell 19841. Cassile~'s survey also found that many patients believe Cat "doctors know privately which one of the investigated treatments is best." If this holds true for diagnostic techniques, padents might prefer to have the physician's rec- ommendation rather than risk enrollment in a trial. Many may wish to have the "latest" or"state-of-the-art', procedure, even when its superiority Over existing methods has not been rigorously demonstrated. Patients may also be concerned that, as trial participants, Hey wild be treated by scientific investigators bound by a protocol rawer than by a

PRIMARY ASSESSMENT OF DIAGNOSlIC TESTS 91 clinician attentive primarily to their personal needs. A survey of part~ci- pants in He Aspirin Myocardial Infarction Study (AMIS) and the Beta- Blocker Heart Attack Trial (BHAT) cited better quality of care and me availability of second opinions as one of the benefits of participation (Mattson et al. 1985~. In He survey of attitudes of potential participants discussed above, however, many respondents felt that patients receiving physician-recommended treatment received better care than clinical trial participants (Cassile~ et al. 1982~. Thus, patients may be womed about He quality of care in a teal. ~ The requirement for infonned consent is another obstacle to enroll- ment. The content of informed consent raises issues Hat may discourage both referral and enrollment. Moreover, the process of infonned consent may itself be a bamer Ritz et al. 1983~. Patients may perceive the consent documents as "legalistic, undesirable intrusions into the physi- cian-patient relationship', (Cassileth et al. 1980~. They may prefer that the decision they make explicitly when Hey give consent be made for Rem implicitly by their physician. How do we ensure the patient's consent and cooperation win the study? The fun involvement of the personal physician is paramount. Without it, truly infonned consent is not possible, nor is a clear under- standing of the study's potential benefit. Even with the physician's involvement, a research assistant or the investigator must spend time win the patient to explain the protocol and its potential benefits and risks. In particular, they must discuss the randomization of test order, otherwise, the patient will ask why He tests are performed in an apparently strange sequence. Finally, patients may wish to avoid a number of features of a wet/-designed protocol, such as multiple examinations, additional tests, and reman clinic visits for follow-up (Mattson et al. 1985~. Patients may feel Hat the inconvenience, discomfort, and additional time involved in participating is too great. Summary: Recruitment A primary technology assessment is a form of human experimentation, with all its associated ethical challenges. Lack of interest, informed con- sent, randomization, and prior judgments of technological superiority may lead to low rates of referral or biased referral and patients who refuse to participate. Physicians may feel a conflict between their responsibility to their patients and their commitment to the teal. These problems can be

92 A5SE:SSM~T OF DIAGNO=IC TECHNOLOGY reduced if investigators take the time to explain He study and its goals to the referring physicians and to padents. The study should be planned win a large margin for error in forecasts of referral rates. IMPLEMENTATION Implementing a study protocol in the clinical setting presents a number of formidable obstacles. Some occur because Be study must be con- ducted within the context of a health care delivery system that is not specifically set up to accommodate the often artificial circumstances of an expenmental protocol. Odder problems arise because both patients and physicians may have negative feelings toward this type of research. With- out the cooperation of aR parties, even Me most weD-designed protocol is likely to fail. The following section wiD deal specifically with me logis- tics of technology assessment. Logistics of Randomization Whether the patient is to undergo only one of two examinations or is to have aU of several tests under study, some form of randomization is necessary to avoid bias. For example, it would be desirable to randomize the order of two tests In a study in which both tests are performed on each patient. Randomization employs a chance mecharusm to assign patients to an arm of the trialfor example, ultrasound or cr. The process of random assignment must be carefully specified in Me protocol; it may, for example, involve opening a sequentially numbered, sealed envelope with a code designating the diagnostic procedure to be perfonned. The process must be followed for each study padent. Some data must be obtained on aB patients who withdraw or cannot be randomized so that the population that did not participate can be adequately characterized. Patients can be randomized at the time of enrollment or at the time the examinations are scheduled. On a busy institution, scheduling a subset of patients according to requirements that differ from the norm may furler complicate an already complex task. The cooperation of the hospital staff is-needed. For example, in the radiology department of a large teaching hospital, 400 to 700 examinations- are performed each day. Someone in the depar ment must take the responsibility for identifying study patients- and reviewing all requests for particular examinations. The technology being studied is used in a host of different clinical conditions, and a busy receptionist may

PRIMARY ASSESSMENT OF D AGNOSTIC TESlS 93 simply schedule the examination- rather Wan take the time to indicate to the physician-investigator Rat this patient is one for whom me order of exams is to be determined by chance. Two prospective studies that compared multiple imaging techniques encountered problems win sched- uling that interfere with the ideal random arrangement of examinations and prevented aU of the studies from being performed in aB patients (McNeil et al. 19X1, Alderson et al. 1983~. Randomizing patients at the time of enrollment is preferable to doing so at the time of scheduling, because fewer people wiD have this key responsibility. Thus, the research- assistant (RA) who responds to refer- rals could see the patient wed before me scheduled tests, obtain consent, administer prerandomization questionnaires, and randomize the patient. The RA would then schedule the patient so that He order of the examina- tions is as specified by the protocol. Obstacles to Data Gathering Gathering data, the most important part of a technology assessment, occurs after the patient has been enrobed and randomized. The quality of a study is greatly influenced by He quality of the data collected. Yet, problems in this area frequently jeopardize the validity of the study Weinstein 1971~. This segment of the study can be divided into Tree phases: collecting "input" data, performing the study tests, and foBow-up studies. Each of the phases involves a number of individuals, including patients, refening physicians, study physicians, technicians, nurses and RAs. Similar factors affect the outcome of each phase: the level of interest of the study personnel, their perception of He relevance of the study, and their comprehension of the protocol requirements. The foDow- ing section win examine problems that may be encountered in the process of data gathering. Collecting Input Data '`Input" data are obtained before the patient undergoes the study testis). There are several reasons for gathering such data. First, when the study's aim is to define the marginal increment of information Hat a new technol- ogy adds in a particular clinical situation, input data may consist of the history, physical examination, and laboratory values that constituted the basis for requesting the examination. Second, background information on each patient is extremely important because the generalizability of a study

94 ASSESSAfENT OF DIAGNOSTIC TECHNOLOGY result depends on a fun characterization of the study population. Third, Me protocol should include the colBection of me clinical data needed to classify patients into subgroups in which test performance might be better or worse man in We total population. Fours, data may be needed to create clinical prediction rules for estimating the pretest probability of disease. Increasing me quantity of data collected, however, also increases me likelihood of decreasing its quality. Past Dials, whether of therapeutic or of diagnostic technology, have had two main problems with data collection: missing data and inaccurate data. It is important to collect ad pertinent and relevant data initially, because data specified but not collected during a prospective study are usually difficult to acquire retmspec~vely. The forms for recording the data may be a source of trouble if they require elaborate detail or are complicated to fin out. Inadequate data collection was a problem in the University Group Diabetes Program trial of the effects of oral hypoglyce- mics on the development of subsequent vascular complication of diabetes mellitus (Weinstein 19711. In the NCGS teal, laboratory reports, radiol- ogy fonns, and patient history forms were incomplete or incorrect as much as 50 percent of the time; correcting these deficiencies was ex- tremely difficult (Marks et al. 1984~. One consequence of missing data is to reduce the number of patients Mat can be analyzed, which jeopardizes Me statistical power of the study. The study population may be inade- quately characterized, which compromises the generalizability of the study conclusions. Often, studies must rely on busy physicians to provide key clinical data. If these individuals have been excluded from the planning and design stages of Me teal, they may fee] that Hey are performing additional chores in order to satisfy the curiosity and advance the interests of a third party outside the patient-physician relationship. They may also view me research as `'exploitative," and on this foundation resistance is built. The participation of individuals wig negative attitudes may be damaging to a study because they will not be concerned about the quality of the data (Hopwood et al. 19801. Before the start of the study, He investigator should meet with the individuals involved in data collection and monitoring: referring physi- cians, house staff, nurses, and RAs. The investigator should explain the goals and design of He study, indicate its value and He information required, and answer any questions. Once the physicians and others understand Hat aU pardes win gain by the acquisition of more accurate data, chances for effective cooperation are greatly enhanced.

PRIMARY ASSESSMENT OF DIAGNOSTIC TESTS 95 Perhaps the best way to avoid Be problem of incomplete or inaccurate data is to bypass physicians altogether by employing an RA to gather all input data. If physical examination data are needed, a nurse practitioner can fill the role of RA. Having one person serve as RA means greater standardization of data collection methods. Performing the S - y Tests Success in this phase of data gathering is influenced by many of the same factors Mat affect the quality of He input data Protocol compliance and exam quality are the two most important concems. Complying with the protocol includes not only following the specifications for perfonn- ance of the diagnostic procedures but also performing all procedures, including the gold standard, in every patient who is supposed to have them. A detailed, cookbook type of protocol does not assure compliance and may be an obstacle to collecting error-free data. The individuals perform- ing the tests may simply fail to read the complete protocol, may fail to understand the procedures, or may forget the protocol. To ensure that each patient enrolled in the study has a standardized examination, He protocol may specify that a test be perfonned in a manner that differs from the manner in which it is usually performed. But the individual performing the test may decide not to follow these instructions if the change in procedure requires a great deal of additional time or if the change is perceived as "bad medicine." These issues can best be illustrated with examples. · In a multicenter study comparing Or and radionuclide (RN) studies, the protocol carefully specified sodium pertechnetate for the RN studies. One institution, however, "used a mercury isotope and a type of imaging instrument unique to it and virtually unknown to odler nuclear radiolo- g~sts." Another institution participating in the same study obtained fewer than the specified number of images when performing tile Or scan (McNeil 1979, p. 34~. · A therapeutic trial used a test of visual acuity to assess outcome. Although the protocol specified a patient-to-chart distance of 20 feet, trial monitors found that the participating clinics used different distances (Fer- ris and EIderer 1979~. In these two studies, the failure to comply with the protocol could have

96 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY reduce Me number of cases Mat could be used in me final analysis, prolonging the recruitment effort and increasing the cost and time re- quired to complete Me study. Furthermore, combing data obtained win different procedures may compromise Me validity and generalizability of the study conclusions. Summarized below are some other problems encountered in a prospec- tive study evaluating Me diagnostic value of ventilation-perfusion scar~- ning in patients with suspected pulmonary embolism (HuB et al. 1985~: · The index test, ventilation scarming, could not be perfonned in 20 of the patients because of Me lack of availability of ~37Xe or for over "techni- cal reasons." · In 2 over patients, the results of the scan "were inadequate for interpretation. " · Of me potentially eligible patients, 51 were too ill to undergo Me gold standard, pulmonary angiography. · The gold-standard test was not perfonned In additional patients: 4 patients were allergic to contrast agents; 2 patients were pregnant; I! patients were too in; 9 patients refused permission; and 4 patients were excluded for over "technical" reasons. What are the consequences of these difficulties? Besides reducing the number of padents available for me final analysis, these problems can change me character of the study population. When patients are exhumed for iB-defined reasons or do not have the required foDow-up with the gold-standard test, the study population, and thus the patients to whom the study conclusions apply, become difficult to define. Follow-Up Studies An assessment designed to evaluate the impact of a diagnostic test on patient outcome win require clinical foHow-up. In addition, when studies of diagnostic accuracy employ a risky gold standard, patients with nega- tive index tests may not be referred for the gold-standard test, and clinical foBow-up may be used as a substitute. There are several ways to conduct follow-up studies. First, responsibility for collecting the data and filling out the forms can be placed with the referring physician or with physicians and staff at the study center. This approach is useful when physical examinations and testing are part of the follow-up plan. The method is cheap, but risky;

PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 97 physicians may fail to gather ah the data or may use nonstandard methods. Patients may move or may fail to keep follow-up appointments. Such patients are considered "lost to foBow-up" and present a challenge to the individuals who must analyze the data. Second, a research assistant can conduct a structured telephone inter- view with the patient in order to assess outcome. This method may be more convenient for the patient and may increase the chance of successful folBow-up on patients who have moved. It is not useful if tests or a physical examination are needed. Third, patients can fib out a foBow-up questionnaire and return it to the study center by mail. This approach is the least expensive, but compli- ance is likely to be poor and the cost of contacting noncompliers win be high. FoDow-up can be complicated by a number of factors, particularly if it requires observation or data collection over a period of years or requires the assessment of other than dichotomous variables. Some factors relate to patients. Patients may perceive foDow-up as a continued intrusion into their lives and simply refuse to cooperate. The patient may experience a change in health status that makes evaluation of outcome more difficult. In a randomized study, the patient may "cross over" and have a diagnostic evaluation for the same indication by the competing technology, making the assessment of the impact of the first test nearly impossible. Further- more, the patient is not always a reliable source of information. In one study, only 60 percent of patients with heart disease and 70 percent of patients with asthma reported these diagnoses when asked what condition they had ~udwid and Coletti 1971~. The environment in which foBow-up is conducted may also present a problem. The technology under study may change, or a newer technol- ogy may be developed so that the answer to the study question seems much less important. When interest wanes, foDow-up may be inadequate. The nature of the endpoint chosen for evaluation can also influence the success of foDow-up studies. A dichotomous variable such as life or dead is easy to assess. Obtaining and coding subjective infonnation about the impact of a test on the patient's functional status or quality of life requires more complex methods. Researchers have recognized the impor- tance of these endpoints and have developed the tools needed to conduct these types of follow-up studies. Some studies of diagnostic accuracy determine the patient's true state by using me gold-standard test in certain patients and clinical follow-up for those who do not undergo the gold-standard test. Follow-up is very

98 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY important in such studies. In McNeil's (1979) evaluation of the CT/RN study, she states that inadequate follow-up made it impossible to deter- mine whether some of the patients entered into the study did or did not have neurological disease. There must be a contingency plan for patients who do not comply with foBow-up, and the costs of foBow-up must be induded in the study budget. Summary: Implementation The obstacles encountered in this stage of a technology assessment may be We most difficult to resolve. Randomization, data collection, test performance, and follow-up are all subject to poor compliance and poor performance. To facilitate compliance, the requirements of Me protocol should be as explicit and as simple as possible, and they should be written out in detail. The study should be planned to minimize the number of patients who must be randomized at the time examinations are scheduled. The most important element, however, is the motivation of the patients, physicians, and over staff who carry out Me protocol. Those individuals involved in carrying out the protocol should receive training before the study begins, and there should be ongoing monitoring of study personnel (Cummings et al. 1988~. The best way to avoid implementation problems is expensive: hire a research assistant and assign as many data collection chores as possible to this person. TEST INTERPRETATION The choice between efficacy and effectiveness is important in design- ing the interpretation stage of an assessment. In a study of efficacy, test interpretation must be as accurate, consistent, and objective as possible. The ideal study would include multiple interpretations of both the index test and the gold standard for the purpose of determining interobserver variability. In a study of effectiveness, tests would be interpreted as they are in usual clinical practice. The procedure for interpretation would not necessanly be standardized. Accuracy Many factors affect the accuracy of data interpretation. Some, such as physician fatigue (frogmen et al. 1978), are difficult to control. Data from one early study indicated a substantial improvement in radiologists' use

PRIMARY ASSESSMENT OF DIAGNO=IC -Tin 99 of CI to detect pancreatic carcinoma after the first 1,000 body scans (Sheedy et al. 1977~. Improvements in physicians' skills with experience clearly demonstrates the importance of the reaming curve. Early esh- mates of the accuracy of a new test, when physicians' experience is Limited, may be a better reflection of their interpretative skins than the potential accuracy of the method. Consistency and Multiple-Test Interpretations Consistency is best guaranteed by having the same observer interpret an examinations for a particular technology and by using a standardized definition of an abnormal test result. IdeaBy, aU interpreters using the different methods should be at a similar level of experience. ROC analysis is appropriate for assessing tests with results expressed as continuous variables (Metz 1978~. By determining a series of true-positive/false- positive pairs in which different cistern separate the nonnal from the ab- normal, the ROC curve neutralizes observer biases associated with exces- sively conservative or liberal strategies (Hanley and McNeil 1982~. In a large-scale study, data interpretation might require a full-time commitment from specialists, such as radiologists. It may be difficult to find someone who win devote this amount of time to a study, and equally difficult to recruit the group of specialists who win be needed to reinter- pret at least a selected sample of the exams for the purpose of determining interobserver variability. The participation of these individuals should be solicited early, and Heir time should be a budgeted expense of the project. Objectivity How can objectivity of interpretation be obtained? There must be no cross-taDc between those who interpret different examinations on- the same patient. A physician interpreting the index test should be blinded to the result of the gold standard to avoid test-review bias; similarly, a physician interpreting the gold standard should be blinded to the result of me index test to avoid diagnosis-reYiew bias (see Chapter 3~. Both types of bias can lead to an overestimate of the true-positive and false-positive rates of the index test. Blinded interpretation of the index test and the gold-star~dard test is absolutely essential. Yet most reports of studies of diagnostic tests do not indicate that this precaution has been taken. In a study of efficacy, blinded interpretation is the most objective way to determine the accuracy of a test. It may not be ethically sound,

100 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY however, to make decisions about padent care based on a test Hat was interpreted without the benefit of ad relevant clinical data. In a study of effectiveness, the interpretation would depend on the combination of clinical information and that denved from Be specific imaging examina- tion. This memos, although less objective, is the one used in clinical practice. A study can be designed to accommodate interpretation under both "ideal" and "usual" conditions. There should be two separate interpreta- tions one (nonblinded) Interpretation used for patient care (and Pus for effectiveness) and the other (blinded) used for efficacy studies. In gen- eral, if we separate study interpretation from interpretation related to patient care, we can blind observers to aD other data more ethically. REPORTING The clinical utility of an otherwise well-executed diagnostic technol- ogy assessment depends on the success with which the results are commu- nicated to physicians who use the tests. In addition, meta-analysis, a form of secondary technology assessment that synthesizes recommendations from published reports, depends on thorough reporting of methods and results (PiUemer and Light 1980, Hunter 1982~. A number of authors have proposed standards for assessing and reporting randomized con- trolled trials; many of these standards can be applied to studies of diag- noetic technology. Two groups in particular (Mosteller et al. 1980, Chalm- ers et al. 1981) have descnbed 16 key features of a good report. I. a precise statement of the study question, including any prior hy- potheses regarding specific subgroups in whom the value of the tests might differ, 2. a complete description of the sway population, of inclusion and exclusion cr~tena (if used), and of patients who were rejected or who may have withdrawn fiom the study, so that clinicians can determine how their patients compare to the study population, with particular attention to clinical issues that define He spectrum of sevens of disease; 3. the dates of the enrollment period, to allow interpretation of the results in light of other developments that may have occurred during that time: (for example, technological advances); 4. a detailed description of the study protocol, including the mesons for performing tests (or appropriate references for the methodology) and the procedure for randomization (if applicable);

PRIMARY ASSESSMENT OF DIAGNOSTIC TESTS 101 5. a statement of the acceptable level of type I and type II errors, and the size of the sample required to detect the specified difference in study endpoint; · 6. presentation of the distribution of pretest vanables (for randomized studies). so that clinicians can check for biased assignment of patients to study groups; 7.. an indication of the level of compliance with the protocol, win a description of deviations and how they were handed; 8. specification of the reference standard used to define the tree state of the patient, taking care to show Mat there is no use of index test results (or clirucal data used for conical prediction rules) to define the diseased and nondiseased states; 9. the results of the index testis) and gold-standard test (in a 2-by-2 table, if applicable), with appropriate statistical analyses (for example, ROC for studies of test accuracy where results can be expressed as continuous vanables); 10. subgroup analysis: results of tests as ~ no. 9 in patient subgroups of interest; Il. the results of folBow-up (when patient outcome is an endpoint) with confidence limits, life-table analysis, or over statistical analyses as appropriate; 12. a description of the method for handling postintervention with- drawals and patients lost to follow-up; 13. a description of the method used to avoid test-referral bias; 14. a description of the method used to blind those who interpret the index and gold-standard tests; 15. the number of tests that were technically suboptimal or were considered uninterpretable; and 16. the source of funding for the study, to allow identification of pos- sible conflicts of interest. Two of these items deserve additional attention, because they can be sources of hidden bias in a study of diagnostic technology. Number ~ refers to the pitfall of "circular assessment," which must be avoided when choosing a reference standard. This occurs when the result of one of the index tests in a comparative study is used to define the true state of the patient. To obtain a valid measure of each test's performance, they must be assessed independently of one another, using a different method to verify the presence or absence of disease. Number 15 in the list above alludes to another potential source of bias:

102 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY reports of studies of diagnostic technology seldom include the number of test results that were considered urunterpretable or indeterminate. In one review of ten papers on CI, only five dealt explicitly with the number of unsatisfactory exams. Such information is essential, however, if efficacy is to be judged. For example, if a test detects renal lesions in 70 of 100 patients, misses them In 10, and results in technically suboptimal exami- nations in 20, the overall sensitivity is 70 over 100 (70 percent). Fre- quently, the 20 poor-quality exams are excluded, and the sensitivity repormd is 70 divided by 80 (88 percent) (Abram s 19811. Thus, if investigators fail to consider the impact of ignoring poor-quality exams, me true-positive and false-positive rates may be artificially inflated (Beg" et al. 19861. CONCLUSION In this chapter, we have examined the difficulties encountered in each stage of a primary technology assessment, from the planning and design process Trough the production of the final report. The solutions to some of these problems are relatively straightforward. For example, we have methods to avoid test-review and diagnosis-review bias. We also know that increasing the level of cooperation among participating individuals and institutions will go a long way to improving the outcome of a study. The solutions to other problems, such as when to conduct the assessment or which application to assess, are less obvious. In emphasizing some of the bamers to primary data collection, we have attempted to forestall such difficulties in future assessments. In posing a number of unanswered questions, we would hope to encourage He research necessary to resolve these problems, and thus enhance the value of diagnostic technology assessment. REFERENCES Abrams, H.L. Evaluating computed tomography. In Altennan, P.S., Gastel, B., and Eliastam, M., eds., Assessing Computed Tomogra- phy, pp. 1-17. National Center for Heal Care Technology Mono- graph Senes, Washington D.C., U.S. Deparunent of Health and Human Services, May 1981. Abram s, H.~. Garland lecture. Coronary Artenography: pathologic and prognostic implications. American Joumal of Roentgenology 139:1- 18, 1982.

PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 103 Abrams, Ho., and Hessel, S. Heady technology assessment: problems and challenges. American Joumal of Roentgenology 149:1127-1132, 1987. Aldemon, P.O., Adams, D.F., McNeil, By., et al. Computed tomography, ultrasound, and scintigraphy of He liver in patients with colon or breast carcinoma: A prospective comparison. Radiology 149:225- 230, 1983. Alperovitch, A. Controlled assessment of diagnostic techniques: Me~- odolog~cal problems. Effective Health Care 1:187-190, 1983. AngeH, M. Patients' preferences in randomized cImcal teals. New England Journal of Medicine 310:1385-1387, 1984. Begg, C.B., Greenes, R.A., and Iglewicz, B. The influence of uninterpre- tability on the assessment of diagnostic tests. Journal of Chronic Diseases 39:575-584, 1986. BrogUen, B.G., Delsey, C.A., and Moseley, R.D. Effect of fatigue and alcohol on observer perception. American loumal of Roentgenology 130:971-974, 1978. Brown, B.W., Ir., arid HoBander, M. Statistics: A Biomedical Introduc- tion. New York, John Wiley & Sons, 1977. Cassileth, B.R., Lusk, E.~., Miner, D.S., and Hurwitz, S. Altitudes toward clinical teals among patients and He public. Journal of He American Medical Association 248:968-970, 1982. Cassile~, B.R., Zupkis, R.V., Sutton-Smith, K., et al. Informed consent: Why are its goals imperfectly realized? New England Joumal of Medicine 302:896-900, 1980. Chalmers, T.C., Smith, H., Ir., Blackbum, B., et al. A method for assessing He quality of randomized control trial. Controlled Clinical Trials 2:3149, 19XI. Croke, G. Recruitment for the National Cooperative Gallstone Study. In Row, H.P., and Gordon, R.S., Jr., eds., Proceedings of me National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Therapeutics 25:691-694, 1979. Cummings, S.R., Hulley, S.B., and Siegel, D. Implementing the study: Pre-testing, quality control and protocol revisions. In Hulley, S.B., and Cummings, S.R., eds., Designing Clinical Research: An Epi- demiological Approach. Baltimore, Williams and Wilkins, 1988. Drummond, M. Guidelines for health technology assessment: Economic evaluation. In Feeny, D., Guyatt, G., and Tugwell, P., eds., Health Care Technology: Effectiveness, Efficacy and Public Policy. Mon- treal, The Institute for Research on Public Policy, 1986. Feinstein, A.R. An additional science for clinical medicine: II. The limitations of randomized trials. Annals of Internal Medicine 99:544- S50, 1983.

104 ASSESSAlENT OF DIAGNO=IC TECHNOLOGY Feinstein, A.R. Clinical biostatishcs-VIlI. An analytic appraisal of the University Group Diabetes Program (UGDP) study. Clinical Phar- macology and Therapeutics 12:167-191, 1971. Ferguson, I.H. Director, Office of Medical Applications Research. Per- sonal communication, 1988. - Fe~ns, F.L., and Ederer, F. External mon~coring in multiclin~c tr~als: Ap- plications from oph~almolog~c studies. In Ro~, H.P., and Gordon, R.S., Ir., eds., Proceedings of ache National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Th- erapeutics 25:72~723, 1979. Fineberg, H.V., Bauman, R., and Sosman, M. Computerized cranial tomography: Effect on diagnostic and therapeutic plans. Joumal of the American Medical Association 238:224-230, 1977. Fineberg, H.V., and Hiatt, H.H. Evaluation of medical practices: The case for technology assessment. New England Journal of Medicine 301:1086-1091, 1979. Freiman, J.A., Chalmers, T.C., Smith, H., Jr., and Kuebler, R.R. The importance of beta, the type lI error and sample size in the design and interpretation of the randomized control trial. New England Journal of Medicine 299:690-694, 1978. Guyatt, G., and Drummond, M. Guidelines for He clinical and economic assessment of health technologies: The case of magnetic resonance. International Journal of Technology Assessment in Health Care 1:551-566, 1985. Guyatt, G.H., Tugwell, P.X., Feeny, D.H., et al. The role of before-after studies of therapeutic impact in the evaluation of diagnostic technolo- gies. Journal of Chronic Diseases 39:295-304, 1986. Hanley, J.A., and McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29- 36, 1982. Hessel, S.J., Siegelman, S.S., McNeil, B.J., et al. A prospective evalu- ation of computed tomography and ultrasound of the pancreas. Radi- ology 143:129-133, 1982. Hopwood, M.D., Mabry, J.C., and Sibley, W.~. A first-order characteri- zation of clinical trials. Prepared for He National Institutes of Health by the Rand Corporation. R-2653-NTH; September 1980: 6 ~ -62. Hull, R.D., Hirsh, J., Carter, C.J., et al. Diagnostic value of ventilation- perfusion lung scanning in patients with suspected pulmonary embo- lism. Chest88:819-82S, 1985. Hunter, J.E. Meta-analysis: Cumulating Research Findings Across Stud- ies. Beverly Hills, California, Sage Publications, 1982. Kent, D.~., and Larson, E.B. Diagnostic technology assessment: Prob- lems and prospects. Annals of Intemal Medicine 108:759-761, 1988.

PRIMLY ASS~SME~ OF DIAGNOSTIC TOWS 105 Lidz, C.W., Meisel, A., Oste~weis, M., et al. Barriers to informed con- sent. Annals of Internal Medicine 99:539-543, 1983. ~ Ludwid, E.G., and Coletti, J.C. Some misuses of heals statistics. Journal of the American Medical Association 216:493499, 1971. Marks, ].W., Croke, G., Gochman, N., et al. Major issues in the organiza- tion and implementation of the National Cooperative Gallstone Study (NCGS). Controlled Clinical Trials 5:1-12, 1984. Mattson, M.E., Curb, I.D., McAr~e, R., et al. Participation in a clinical trial: The patients' point of view. Controlled Clinical Trials 6:156- 167, 1985. McNeil, B.J. Pitfalls in and requirements for evaluations of diagnostic technologies. In Wagner, J., ea., Proceedings of a Conference on Medical Technologies, DHEW Pub. No (PHS) 79-3254, pp. 33-39. Washington, D.C., U.S. Goverrunent Printing Office, 1979. McNeil, B.J., Sanders, R., Alderson, P.O., et al. A prospective study of computed tomography, ultrasound, and gallium imaging in patients with fever. Radiologyl39:647-653, 1981. McNeil, B.J., Weichselbaum, R., and Pauker, S.G. Fallacy of the five- year survival in lung cancer. New England Journal of Medicine 299:1397-1401, 1978. Metz, C.E. Basic principles of ROC analysis. Medicine 13:283-29S, 1978. Seminars in Nuclear Mosteller, F., Gilbert, J.P., and McPeek, B. Reporting standards and research strategies for controlled trials: Agenda for the editor. Con- trolled Clinical Trials 1:37-58, 1980. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Stock No. 051- 003-00765-7. Washington, D.C., U.S. Government Printing Office, 198Oa. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Background paper #1: Methodological issues and literature review. Washington, D.C., U.S. Government Printing Office, 1980b. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Background paper #2: Case studies of medical technologies. Case Study #2: The feasibility of economic evaluation of diagnostic procedures: The case of CI scanning. Washington, D.C., U.S. Government Printing Of- fice, 1981. Phelps, C.E., and Mushlin, A.~. Focusing technology assessment using medical) decision theory. Medical Decisionmaking 8:279-289, 1988. Pillemer, D.B., and Light, B.~. Synthesizing outcomes: How to use research evidence from many studies. Harvard Education Review 50:176-195, 1980.

106 ASSESSMENT OF DIAGNOSTIC TEClINOLOGY Prout, T.E. Other examples of recruitment problems and solutions. In Roth, H.P., and Gordon, R.S., Ir., eds., Proceedings of the National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Therapeutics 25:695-696, 1979. Schoenberger, I.A. Recruitment in the Coronary Snug Project and the Aspirin Myocardial Infarction Study. In Roth, H.P., and Gordon, R.S., Ir., eds., Proceedings of the National Conference on Clinical Trials Methodology, October 1977. CI=cal Pharmacology and Therapeutics 25:6X I-6X4, 1979. Schwartz, I.S. Evaluating diagnostic tests: What is done what needs to be done. Joumal of General Intemal Medicine 1:266-267, 1986. Sheedy, P.F., Stephens, D.H., Hatted, R.R, et al. Computed tomography in patients suspected of having carcinoma of the pancreas: Recent experience (abstract). Presented at the scientific assembly and annual meeting of the Radiological Society of North America, Chicago, Ill., November 1977. Smith, TV., Kemeny, M.M., Sugarbaker, P.H., et al. A prospective study of hepatic imaging in the detection of metastatic disease. Annals of Surgely 195:486~9l, 1982. Sox, H.C., Ir. Probability theory in the use of diagnostic tests: An introduction to critical study of the literature. Annals of Intemal Medicine {C4:60-66, 1986. Steinberg, E.P., and Cohen, A.B. Office of Technology Assessment, U.S. Congress. Nuclear Magnetic Resonance Imaging Technology: A Clinical, Industrial, and Policy Analysis. Technology case study 27. Washington, D.C., U.S. Government Printing Office, 1984. Taylor, K.M., Margolese, R.G., and Soskoline, C.L. Physicians' reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. New England Joumal of Medicine 3 ~ 0: ~ 363- 1367, 1984. Vreim, C. Project officer, Prospective Investigation of PulmonaIy Em- bolic Diagnosis project (PIOPED). Personal communication, l~988e Weiner, D.A., Ryan, T.~., McCabe, C.H., et al. Exercise stress testing: Correlation among history of angina, ST-segment response and preva- lence of coronary artery in the Coronary Artery Surgery Study (CASS). New England Journal of Medicine 310:230-235, 1979. Weinstein, M.C. Methodologic considerations in planning clinical trials of cost-effectiveness of magnetic resonance imaging twin a com- mentary on Guyatt and Dmmmond). Intemational Joumal of Tech- nology Assessment in Health Care ~ :567-581, 1985. Weinstein, M.C., and Stason, W.B. Foundations of cost-effec~veness analysis for health and medical practices. New England Joumal of Medicine 296:716-721, 1977.

Next: 5. Costs and Sources of Funding »

Assessment of Diagnostic Technology in Health Care: Rationale, Methods, Problems, and Directions (1989)

Chapter: 4. Primary Assessment of Diagnostic Tests: Barriers to Implementation

Welcome to OpenBook!

Get Email Updates