Click for next page ( 74


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 73
4 Unman Assessment of Diagnostic Tests: Barriers to Implementation In the two preceding chapters, we have presented a series of guidelines for conducting the ideal study of a diagnostic technology. The goal of this chapter is to examine the practical difficulties that are often encountered when the guidelines are applied to He design and execution of a typical protocol. We win also address briefly a number of methodological issues concerning the interpretation and reporting of Be study data. Five specific stages of a primary technology assessment win be addressed: planning and protocol development, recruitment, implementation, interpretation, aIld reporting. PLANNING AND PROTOCOL DEVELOPMENT The key to designing a useful technology assessment is that studies should be modeZ-driven; the data that are gathered should be specified by a model of decisionmaking. We pointed out in Chapter 3 that the best way to obtain the i-nfo~mation that will make the model usable (that is, the data needed by a physician to make a decision about the care of an individual patient) is to plan Be study before enrobing the first patient. What are the elements of Be planning and protocol development stage? What specific issues must be addressed and how can Hey be resolved? Weinstein (1985) has discussed many of these issues as they relate to Parts of this chapter are adapted from a paper published previously by one of Me authors (Abrams 1987~. 73

OCR for page 73
74 ASSESSMENT OF DIAGNOSTIC ~:CHNOLOGY planning a trial of cost-effectiveness of diagnostic technology, and we draw on his wow. First, the planners must clearly delineate the objectives of the study. A number of critical questions must be asked: Which clinics condition should be investigated? Which patient population should be included in the study? Wid He endpoints be accuracy, outcome, or both? What type of study design will be used? WiB the assessment also be an economic evaluation? WiB the study assess efficacy or electiveness? What is the appropriate comparison technology? How large a sample win be needed? When should He study be conducted? Is Were institutional support for He study? The answers to these questions win gready influence me design of He protocol and the nature of He data to be gathered. We win therefore consider each of them in more detail. Choosing a Clinical Condition A diagnostic imaging technique has numerous potential applications. For example, it was estimated in 1984 that MR! examinations might be used in up to 250 diagnosis-related groups (DRGs) (Steinberg and Cohen 1984, Weinstein 1985~. Defining the role of MR} for each of these categories would require many studies and a tremendous investment of time and resources. Recognizing that society may not be able to afford to assess every application of a diagnostic technology, we must establish priorities for technology assessment. ~ choosing the clinical problem to be evaluated in a diagnostic tech- nology assessment, policy-or~ented investigators would use cr~tena such as me frequency of a condition, He cost of the technology, and the potential impact of He study result on clinical practice. Other factors that might influence the choice include the potential effect of He test on patient management and outcome and deficiencies in existing diagnostic methods (Figure 4.1) (Guyatt and Drummond 1985). Planners may use policy considerations to select a study problem that will have a significant societal impact; but Hey must also ask if the study is feasible. The feasibility of the study depends on a number of variables, such as cost and the availability of a gold standard. (The costs of studies of

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TESI); GOLD STANDARD AVAILABLE? \ FREQUENCY OF -CONDITION? 1, 1 COST OF TECHNOLOGY? 1/ CLINICAL PROBLEM 75 POTENTIAL TO AFFECT PATIENT MANAGEMENT? \ POTENTIAL TO AFFECT OLIN ICAL OUTCOME? EXISTING DIAGNOSTIC METHODS INADEQUATE? FIGURE 4.1 Factors that influence the choice of die clinical condition to be studied. diagnostic technology are discussed in detail in Chapter 5.) Open-ended questions and poorly defined goals may limit feasibility. Assessing the efficacy of Or or MR! of"the lived' ignores the sharp distinctions among biliary obstruction, mass lesions, and diffuse hepatocelBular disease. Each topic Squires separate consideration. One prospective study of CI, ultrasound (US), and scindgraphy focused on the tests' ability to detect metastatic liver disease from several types of primary carcinoma. No difference was absented in the diagnostic capabilities of these technolo- gies (Smith et al. 1982~. Nevertheless, the results of a more recent study, restncted to patients with carcinoma of the breast or colon, suggest that differences do exist in We diagnostic yield of the Wee modalities when pathologically distinct lesions are analyzed separately. These differences may have been obscured in the first study because the clinical problem was too broadly defined (Alderson et al. 1983~. One possible way to set priorities for technology assessment is to use decision-analytic techniques for detem~in~ng the value of perfect informa- tion. Suppose we are considenng an assessment of me accuracy of a new test for patients With condition X. Let us assume Hat We new test provides perfect information, thereby resolving all uncertainty about the true state of the patient, and that we can determine me value of the

OCR for page 73
76 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY information in doDars. According to the model, if we bind mat We cost of performing the test is greater than we would be willing to pay for perfect information, using the new test to diagnose patients win condition X would not be worthwhile (Phelps and Mushlin 1988~. The model uses a hypothetical test that is 100 percent accurate to maximize its potential value. If me information from an ideal test is not worm the test cost, we can expect, an over things being equal, that the information from a real, imperfect test would be worth even less. It follows that we would not want to expend resources to evaluate me test's performance In this clini- cal situation. This methodology provides a powerful tool for determining beforehand whether we should expend the resources necessary to evaluate a particular use of a technology. Patient Population The study population must be weD defined. When certain subsets of eligible patients are excluded because of other, coexisting disease, a physician may be unable to generalize the study result to me whole spectrum of patients encountered in clinical practice. (See Chapters 2 and 3 for a more thorough discussion of sources of bias in selecting patients and their negative impact on studies of diagnostic technology.) Choosing a specific clinical problem for a study of diagnostic technology defines We diagnostic category of patients who may participate in the study. Within this category, the population should include a representative spec- trum of patients. Inclusion and exclusion criteria are needed to define the boundaries of the study population. They must be explicit and they must be applied consistently. In the University Group Diabetes Study, these criteria were not applied uniformly, leading to the admission of a number of ineligible patients and We exclusion of some patients who were eligible. These errors compromised the generalizability of the study conclusion and wasted resources (Feinstein 1971~. Wide variance in test performance (that is, accuracy) within the study population may obscure differences in me performance of two tests. Investigators may need to specify and analyze the results of a test In subgroups of patients for whom they suspect the test will perform differ- endy. For example, me sensitivity and specificity of the exercise thallium treadmill, used to diagnose coronary artery disease, are different In groups Of patients segregated according to the severity of their chest pain (Weiner et al. 1979~. Although there may be no significant difference in test

OCR for page 73
PR MARY ASSESSMENT OF DIAGNOSTIC TESTS 77 performance when the population is considered as a wholes Mere may be differences when subgroups within the population are compared. Endpoints and Study Design The endpoint of a diagnostic test assessment win determine how the results win be used; it is, therefore, critical. Fineberg has proposed the foBow~ng hierarchy for the evaluation of diagnostic tests: technical capa- bility, diagnostic accuracy, therapeutic impact, and impact on patient outcome Iceberg et al. 1977~. Early reports of excellent technical capability are often the basis for me later studies of diagnostic accuracy and clinical value (impact on therapy and patient outcome). The critical question in the planrung and protocol development stage is: WiB the study attempt to measure diagnostic accuracy (~at is, sensitivity and specific- ity), the impact of the test on clinical outcome, or both? (Note that we define the outcome of a diagnostic test as any change in the pastiest process. It should not be considered synonymous with the terms morbid- ity and mortality.) Accuracy Studies of diagnostic accuracy use a "gold standard" to verify the presence or absence of disease. A potential difficulty in a study of accuracy occurs when there is no accepted "gold standard"; it may not be clear which of the available reference standards should be used (Schwartz 1986~. AD reference standards are imperfect. The coronary angiogram is used as Me gold standard in studies of diagnostic tests for coronary artery disease, such as the stress electrocardiogram. Yet, pathologic examina- tion of tissue from patients who have had an ang~ogram demonstrates Mat Me radiolog~c procedure underestimates the severity of disease (Abram s 19821. Physicians must interpret the results of studies of accuracy in this context. Perfect or not, in practice the appropriate gold standard win be the test or procedure mat physicians use to define the true state of patients with a particular disease. Outcome Because the purpose of diagnostic technology is to provide infonnation that will improve patient outcome, patient outcome is an important end- point in technology assessment. Making inferences from data on outcome

OCR for page 73
78 ASSESS OF DIAGNOSTIC TECHNOLOGY may be more difficult than interpreting data from studies of diagnostic accuracy. When long-term measures of outcome are used, the technology may be obsolete before the study is completed. Furthermore, long-tenn outcome may be an unrealistic criterion, "because the impact of diagnos- tic technologies generally is subordinated to that of other factors, such as the nature of the disease process itself, patient compliance, the efficacy of treatment, etc." (McNeil 1979, p. 37~. Improvement in long-tenn outcome may not be the most important effect of a test. If intetverung v enables act to obscure differences in the long-term effects of two technologies, perhaps me differences are not really important. Investigators must keep in mind that two patients win identical long-tem~ outcomes may have expenenced very different posttest processes. A variety of intermediate variables may be important indicators of the effects of a test. Furthermore, these variables may be more practical to evaluate than long-term effects. For example, a study could measure the ability of a diagnostic technology to obviate the need for further invasive diagnostic procedures. In patients with lung cancer, thoracotomy could be avoided if a test could accurately predict Me presence of mediast~nal metastases. The test will not improve the five-year survival of such patients, but, avoiding an unnecessary thoracotomy would be a major benefit (McNeil et al. 1978) and would therefore represent an improve- ment in the posttest process. Outcome studies must track intermediate outcomes and patients' attitudes toward those outcomes. Combined Souses: Accuracy and Outcome The alternative combinations of study design (randomized or nonran- domized) and endpoint (accuracy anther outcome) are depicted in Figure 4.2. The design of a technology assessment influences the feasibility of conducting each type of study. In a randomized design, each patient undergoes only one of the study tests; in a nonrandomized design, each patient would undergo all of the study tests, although randomization may be used to assign a patient to a particular sequence of tests. The advan- tages and disadvantages of a randomized design have already been dis- cussed in Chapter 3. The following example illustrates that a study design may not be compatible with the endpoints selected for evaluation. In an ideal study to compare the accuracy of two tests, each patient would have bow examinations. Guyatt and Drummond have suggested that investigators

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSXJC TESTS RANDOMIZED EACH PATIENT UNDERGOES ONLY ONE OF Tide STUDY TESTS DESIGN NON RANDOM IZED EACH PATIENT UNDERGOES ALL OF Tide STUDY TESTS. PATIENTS MAY BE RANDOMIZED TO A SEaUENCE OF TESTS 79 ENDPolNT ACCURACY OUTCOME BOTH GIRD FOLLOW-UP GOL~STANDARD REC WIRED RK3UIRED EVALUATION AND ONLY ONLY Fry I OW-UP REQ'D GOU)STANDARD GOLD-STANDARD FOLLOW-UP EVALUATION AND EVALUATION REQUIRED ONLY FOLLOW-UP REaD REQUIRED USEFUL ONLY MEN (FOLLOW-UP PATIENT ONLY PATENT RANDOMIZED CARE WED ON TO ~ SEQUENCE) ONLY ONE CODY TEST RESULT) FIGURE 4~2 Alternative combinations of endpoint and study design. use this approach to assess both We accuracy and the impact on outcome of two relatively norunvasive imaging modalities, such as MR! and cr. in a single study. To compare He effects of the tests on outcome for the same patient in whom accuracy is determined, however, the result of one of two tests would have to be wit~eld from Be patient's physician (Guyatt and Dn~mmond 1985~. The design of this study poses ethical problems because patients will undergo a diagnostic examination that cannot affect their care (Weinstein 1985~. Patients and physicians alike may be reluctant to participate. In Chapter 3, we suggest that a randomized design may be preferred to a nonrandomized design for assessing outcome. Planners could also shift their focus from long-tenn to short-term outcomes. Short-Term Outcomes: ASynthet~c Approach The synthetic approach is a method for assessing short-term outcomes, such as me impact of a diagnostic test on the management of the patient. It involves obtaining detailed information from physicians about their pretest treatment strategies and comparing them to the posttest manage- ment of the patient (Guyatt et al. 1986~. In He example above, each physician would write down a plan for managing He patient before knowing the CI and MR! results. Using a randomized scheme, the result of one of the two tests would be given to each physician, who would then

OCR for page 73
80 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY formulate and record a treatment plan based on the test result. Next, the result of He other test would be revealed, and the patient's care would ultimately be based on aU available information. A test has had an impact if the physician's plans changed because of the test result. Economic Analysis ~ the current era of cost containment and remind resources, a consid- eration of cost will be an important stuffy endpoint. First, the investiga- tors planning a technology assessment must decide which type of analysis to use (for example, net resource costs, cost-effectiveness analysis, or cost-benefit analysis). Since cost-effectiveness analysis is comparative and does not require that health outcomes be valued in monetary teens, it is the type of analysis used most frequently. Second, investigators must choose an appropriate perspective for the analysis, because this will greatly influence which costs and effects are included. The societal perspective is the broadest, and it is adopted when the results of the cost- effec~veness analysis are needed to guide government decisions about how to allocate resources. Third, the investigators must recognize the degree to which additional time and personnel win be: needed when an economic evaluation is included in He study design. (For a complete discussion of these issues see Weinstein and Stason 1977; OTA 198Oa,b; orOTA1981.) Efficacy Versus Effectiveness The conditions of the study can imitate real life or they can be ideal- ized. The choice between efficacy, the performance of the test under ideal conditions, and effectiveness, its performance under ordinary conditions of clinical practice, win determine He type of question He study can answer. Consider a study designed to assess the diagnostic accuracy of barium enema (BE) in detecting colonic polyps (also see Figure 4.31. A study of e~ecizveness would enroll an patients who are referred for BE in clinical practice. Patients would be given the usual pretest instruc- tions and, although some would be less than optimally prepared, an would undergo the examination. This would be performed under usual condi- tions, by the individuals who normally perform itradiology staff or house staff. It would be interpreted by clinicians at varying levels of skis who would be provided with whatever clinical information is generally

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 81 EFFICACY EFFECTIVENESS PATIENT POPULATION pBOCEDUR ES TESTING CON DITIONS TEST INTERPRETATION MORE HOMOGENEOUS; SCREENED FOR COEXISTING ILLNESSES/ COMPLIANCE STANDARDIZED IDEAL BLINDED TO CLINICAL DATA TYPE OF OBJECTIVE, ~HARD" EVENTS, OUTCOM E E.G.,DEATH DATA HETEROGENEOUS; INCLUDES ALL PTS ECHO USUALLY HAVE PROCEDURE MORE FLEXIBLE CONDITIONS OF EVERYDAY PRACTICE USING OTHER CLINICAL DATA MORE SUBJECTIVE, "SOFT. EVENTS, E.G., IMPROVED QUALITY OF LIFE [IGURE 43 Differing requirements: studies of efficacy vs. studies of effectiveness. available at the time. There would not be a protocol for subsequent patient care. A study of efficacy should assess the potential benefit of the technology when applied to a specific clinical problem in a defined population under ideal conditions. The protocol would be designed to maximize Me chance that the true accuracy of the test win be demonstrated by reducing sources of variability. Thus, a study of efficacy would: (~) enroll a-more select group of patients; (2) ensure that all patients were adequately and consis- tently prepared prior to the exam; (3) use only state-of-the-art equipment; (4) employ the most skilled clinicians to perform and interpret the test; and (5) make sure that interpreters were blinded to other clinical informa- tion. It would also standardize aftercare. The individuals who develop the protocol may disagree about which type of assessment is most appropriate, making the choice a difficult one. Feinstein (1983) has suggested a useful way to conceptualize the two approachesthe '~fastidious" and the "pragmatic" to design. The fastidious approach. Fastidious designers might include the bio-

OCR for page 73
82 ASSESS OF DIAGNOSTIC TECHNOLOGY statistician or the scientist who developed the technology. This group would argue that a study of efficacy is the only way to determine the "true" value of the technology. For example, an efficacy design will increase the chances of arriving at an unequivocal answer to the study question by standardizing procedures and removing many of the sources of variability that characterize coccal practice. If such a study concluded that a test was not efficacious, there would be no need to perfonn furler evaluations. The pragmatic approach. This approach would be adopted by the practicing clinician. The "clean" results of an efficacy assessment may have little value for the physician whose patients win receive their tests under "usual"' rather than "ideal" conditions. The pragmatist would argue Hat only studies of effectiveness, which attempt to mimic clinical reality, provide the infonnation physicians need to make decisions about individ- ual patients. Resolution of the conflict between the fastidious and pragmatic ap- proaches may involve combining features of both. In any case, the proto- co} as it is actually earned out may end up as a hybrid, because protocols that have been designed to assess efficacy will often encounter real-worId obstacles that make the ideal arrangement impossible. These problems will be covered in detail in the section of this chapter that considers implementation. Comparative Assessment In Chapter 3 we emphasized that technology assessments must be comparative if they are to provide useful data to the practicing physician. For example, the physician may need answers to either or both of the foBow~ng questions: (~) When used instead of existing tests, does Me new test have a greater impact on the outcome of me patient? (2) When used in combination with existing methods, does the new test add infonnation Tat will improve the outcome of the patient? These questions suggest two comparative designs (see Figure 4.41. In one design, the study would detect any positive impact when the new test is substituted for the old. Patients would be randomized to either We new technology or the existing technology. In the other design, the study would evaluate the impact of a technology when it is used as an addition. Patients would be randomized to undergo either the old test and the new test in sequence or the old test alone. Both designs could be used to compare the diagnostic accuracy of the tests (or combination of tests) and their impact on the outcome of disease. (A nonrandomized design, in

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TE!= 83 Figure 4a DESIGN 1 The new technology as a substitute for the old technology TEST A VS. TEST B Figure 4b DESIGN 2 TEST A VS. TEST A AND TEST B The new technology as an addition to the old technology Figure 4c Over time the appropriate design may change: Initially Short term--physicians may prefer the combination of tests Long term--physicians may abandon the less effective test TEST A 1 TEST A AND TEST B TEST B FIGURE 4A Designs for comparative assessment. which ad patients undergo ad tests, would also be appropriate for a study of accuracy; refer to the section, `'Endpoints and Study Design," pp. 77- 80.) Which of these designs is more appropriate when comparing an exist- ing technology with a new one? Despite promising reports, physicians may be hesitant, in the short term, to change over completely to a new technology. An additive design would answer Weir questions about using the new test as part of a sequence. Nevertheless, one goal of technology assessment is to foster appropriate changes in practice habits and to discourage the use of additional tests whenever they will not have an impact. In the long teen, if we want to encourage physicians to abandon ineffective tests in favor of more effective ones, we win need to evaluate the technology's substitutive value as wed (Weinstein 1985~.

OCR for page 73
96 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY reduce Me number of cases Mat could be used in me final analysis, prolonging the recruitment effort and increasing the cost and time re- quired to complete Me study. Furthermore, combing data obtained win different procedures may compromise Me validity and generalizability of the study conclusions. Summarized below are some other problems encountered in a prospec- tive study evaluating Me diagnostic value of ventilation-perfusion scar~- ning in patients with suspected pulmonary embolism (HuB et al. 1985~: The index test, ventilation scarming, could not be perfonned in 20 of the patients because of Me lack of availability of ~37Xe or for over "techni- cal reasons." In 2 over patients, the results of the scan "were inadequate for interpretation. " Of me potentially eligible patients, 51 were too ill to undergo Me gold standard, pulmonary angiography. The gold-standard test was not perfonned In additional patients: 4 patients were allergic to contrast agents; 2 patients were pregnant; I! patients were too in; 9 patients refused permission; and 4 patients were excluded for over "technical" reasons. What are the consequences of these difficulties? Besides reducing the number of padents available for me final analysis, these problems can change me character of the study population. When patients are exhumed for iB-defined reasons or do not have the required foDow-up with the gold-standard test, the study population, and thus the patients to whom the study conclusions apply, become difficult to define. Follow-Up Studies An assessment designed to evaluate the impact of a diagnostic test on patient outcome win require clinical foHow-up. In addition, when studies of diagnostic accuracy employ a risky gold standard, patients with nega- tive index tests may not be referred for the gold-standard test, and clinical foBow-up may be used as a substitute. There are several ways to conduct follow-up studies. First, responsibility for collecting the data and filling out the forms can be placed with the referring physician or with physicians and staff at the study center. This approach is useful when physical examinations and testing are part of the follow-up plan. The method is cheap, but risky;

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 97 physicians may fail to gather ah the data or may use nonstandard methods. Patients may move or may fail to keep follow-up appointments. Such patients are considered "lost to foBow-up" and present a challenge to the individuals who must analyze the data. Second, a research assistant can conduct a structured telephone inter- view with the patient in order to assess outcome. This method may be more convenient for the patient and may increase the chance of successful folBow-up on patients who have moved. It is not useful if tests or a physical examination are needed. Third, patients can fib out a foBow-up questionnaire and return it to the study center by mail. This approach is the least expensive, but compli- ance is likely to be poor and the cost of contacting noncompliers win be high. FoDow-up can be complicated by a number of factors, particularly if it requires observation or data collection over a period of years or requires the assessment of other than dichotomous variables. Some factors relate to patients. Patients may perceive foDow-up as a continued intrusion into their lives and simply refuse to cooperate. The patient may experience a change in health status that makes evaluation of outcome more difficult. In a randomized study, the patient may "cross over" and have a diagnostic evaluation for the same indication by the competing technology, making the assessment of the impact of the first test nearly impossible. Further- more, the patient is not always a reliable source of information. In one study, only 60 percent of patients with heart disease and 70 percent of patients with asthma reported these diagnoses when asked what condition they had ~udwid and Coletti 1971~. The environment in which foBow-up is conducted may also present a problem. The technology under study may change, or a newer technol- ogy may be developed so that the answer to the study question seems much less important. When interest wanes, foDow-up may be inadequate. The nature of the endpoint chosen for evaluation can also influence the success of foDow-up studies. A dichotomous variable such as life or dead is easy to assess. Obtaining and coding subjective infonnation about the impact of a test on the patient's functional status or quality of life requires more complex methods. Researchers have recognized the impor- tance of these endpoints and have developed the tools needed to conduct these types of follow-up studies. Some studies of diagnostic accuracy determine the patient's true state by using me gold-standard test in certain patients and clinical follow-up for those who do not undergo the gold-standard test. Follow-up is very

OCR for page 73
98 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY important in such studies. In McNeil's (1979) evaluation of the CT/RN study, she states that inadequate follow-up made it impossible to deter- mine whether some of the patients entered into the study did or did not have neurological disease. There must be a contingency plan for patients who do not comply with foBow-up, and the costs of foBow-up must be induded in the study budget. Summary: Implementation The obstacles encountered in this stage of a technology assessment may be We most difficult to resolve. Randomization, data collection, test performance, and follow-up are all subject to poor compliance and poor performance. To facilitate compliance, the requirements of Me protocol should be as explicit and as simple as possible, and they should be written out in detail. The study should be planned to minimize the number of patients who must be randomized at the time examinations are scheduled. The most important element, however, is the motivation of the patients, physicians, and over staff who carry out Me protocol. Those individuals involved in carrying out the protocol should receive training before the study begins, and there should be ongoing monitoring of study personnel (Cummings et al. 1988~. The best way to avoid implementation problems is expensive: hire a research assistant and assign as many data collection chores as possible to this person. TEST INTERPRETATION The choice between efficacy and effectiveness is important in design- ing the interpretation stage of an assessment. In a study of efficacy, test interpretation must be as accurate, consistent, and objective as possible. The ideal study would include multiple interpretations of both the index test and the gold standard for the purpose of determining interobserver variability. In a study of effectiveness, tests would be interpreted as they are in usual clinical practice. The procedure for interpretation would not necessanly be standardized. Accuracy Many factors affect the accuracy of data interpretation. Some, such as physician fatigue (frogmen et al. 1978), are difficult to control. Data from one early study indicated a substantial improvement in radiologists' use

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNO=IC -Tin 99 of CI to detect pancreatic carcinoma after the first 1,000 body scans (Sheedy et al. 1977~. Improvements in physicians' skills with experience clearly demonstrates the importance of the reaming curve. Early esh- mates of the accuracy of a new test, when physicians' experience is Limited, may be a better reflection of their interpretative skins than the potential accuracy of the method. Consistency and Multiple-Test Interpretations Consistency is best guaranteed by having the same observer interpret an examinations for a particular technology and by using a standardized definition of an abnormal test result. IdeaBy, aU interpreters using the different methods should be at a similar level of experience. ROC analysis is appropriate for assessing tests with results expressed as continuous variables (Metz 1978~. By determining a series of true-positive/false- positive pairs in which different cistern separate the nonnal from the ab- normal, the ROC curve neutralizes observer biases associated with exces- sively conservative or liberal strategies (Hanley and McNeil 1982~. In a large-scale study, data interpretation might require a full-time commitment from specialists, such as radiologists. It may be difficult to find someone who win devote this amount of time to a study, and equally difficult to recruit the group of specialists who win be needed to reinter- pret at least a selected sample of the exams for the purpose of determining interobserver variability. The participation of these individuals should be solicited early, and Heir time should be a budgeted expense of the project. Objectivity How can objectivity of interpretation be obtained? There must be no cross-taDc between those who interpret different examinations on- the same patient. A physician interpreting the index test should be blinded to the result of the gold standard to avoid test-review bias; similarly, a physician interpreting the gold standard should be blinded to the result of me index test to avoid diagnosis-reYiew bias (see Chapter 3~. Both types of bias can lead to an overestimate of the true-positive and false-positive rates of the index test. Blinded interpretation of the index test and the gold-star~dard test is absolutely essential. Yet most reports of studies of diagnostic tests do not indicate that this precaution has been taken. In a study of efficacy, blinded interpretation is the most objective way to determine the accuracy of a test. It may not be ethically sound,

OCR for page 73
100 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY however, to make decisions about padent care based on a test Hat was interpreted without the benefit of ad relevant clinical data. In a study of effectiveness, the interpretation would depend on the combination of clinical information and that denved from Be specific imaging examina- tion. This memos, although less objective, is the one used in clinical practice. A study can be designed to accommodate interpretation under both "ideal" and "usual" conditions. There should be two separate interpreta- tions one (nonblinded) Interpretation used for patient care (and Pus for effectiveness) and the other (blinded) used for efficacy studies. In gen- eral, if we separate study interpretation from interpretation related to patient care, we can blind observers to aD other data more ethically. REPORTING The clinical utility of an otherwise well-executed diagnostic technol- ogy assessment depends on the success with which the results are commu- nicated to physicians who use the tests. In addition, meta-analysis, a form of secondary technology assessment that synthesizes recommendations from published reports, depends on thorough reporting of methods and results (PiUemer and Light 1980, Hunter 1982~. A number of authors have proposed standards for assessing and reporting randomized con- trolled trials; many of these standards can be applied to studies of diag- noetic technology. Two groups in particular (Mosteller et al. 1980, Chalm- ers et al. 1981) have descnbed 16 key features of a good report. I. a precise statement of the study question, including any prior hy- potheses regarding specific subgroups in whom the value of the tests might differ, 2. a complete description of the sway population, of inclusion and exclusion cr~tena (if used), and of patients who were rejected or who may have withdrawn fiom the study, so that clinicians can determine how their patients compare to the study population, with particular attention to clinical issues that define He spectrum of sevens of disease; 3. the dates of the enrollment period, to allow interpretation of the results in light of other developments that may have occurred during that time: (for example, technological advances); 4. a detailed description of the study protocol, including the mesons for performing tests (or appropriate references for the methodology) and the procedure for randomization (if applicable);

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TESTS 101 5. a statement of the acceptable level of type I and type II errors, and the size of the sample required to detect the specified difference in study endpoint; 6. presentation of the distribution of pretest vanables (for randomized studies). so that clinicians can check for biased assignment of patients to study groups; 7.. an indication of the level of compliance with the protocol, win a description of deviations and how they were handed; 8. specification of the reference standard used to define the tree state of the patient, taking care to show Mat there is no use of index test results (or clirucal data used for conical prediction rules) to define the diseased and nondiseased states; 9. the results of the index testis) and gold-standard test (in a 2-by-2 table, if applicable), with appropriate statistical analyses (for example, ROC for studies of test accuracy where results can be expressed as continuous vanables); 10. subgroup analysis: results of tests as ~ no. 9 in patient subgroups of interest; Il. the results of folBow-up (when patient outcome is an endpoint) with confidence limits, life-table analysis, or over statistical analyses as appropriate; 12. a description of the method for handling postintervention with- drawals and patients lost to follow-up; 13. a description of the method used to avoid test-referral bias; 14. a description of the method used to blind those who interpret the index and gold-standard tests; 15. the number of tests that were technically suboptimal or were considered uninterpretable; and 16. the source of funding for the study, to allow identification of pos- sible conflicts of interest. Two of these items deserve additional attention, because they can be sources of hidden bias in a study of diagnostic technology. Number ~ refers to the pitfall of "circular assessment," which must be avoided when choosing a reference standard. This occurs when the result of one of the index tests in a comparative study is used to define the true state of the patient. To obtain a valid measure of each test's performance, they must be assessed independently of one another, using a different method to verify the presence or absence of disease. Number 15 in the list above alludes to another potential source of bias:

OCR for page 73
102 ASSESSMENT OF DIAGNOSTIC TECHNOLOGY reports of studies of diagnostic technology seldom include the number of test results that were considered urunterpretable or indeterminate. In one review of ten papers on CI, only five dealt explicitly with the number of unsatisfactory exams. Such information is essential, however, if efficacy is to be judged. For example, if a test detects renal lesions in 70 of 100 patients, misses them In 10, and results in technically suboptimal exami- nations in 20, the overall sensitivity is 70 over 100 (70 percent). Fre- quently, the 20 poor-quality exams are excluded, and the sensitivity repormd is 70 divided by 80 (88 percent) (Abram s 19811. Thus, if investigators fail to consider the impact of ignoring poor-quality exams, me true-positive and false-positive rates may be artificially inflated (Beg" et al. 19861. CONCLUSION In this chapter, we have examined the difficulties encountered in each stage of a primary technology assessment, from the planning and design process Trough the production of the final report. The solutions to some of these problems are relatively straightforward. For example, we have methods to avoid test-review and diagnosis-review bias. We also know that increasing the level of cooperation among participating individuals and institutions will go a long way to improving the outcome of a study. The solutions to other problems, such as when to conduct the assessment or which application to assess, are less obvious. In emphasizing some of the bamers to primary data collection, we have attempted to forestall such difficulties in future assessments. In posing a number of unanswered questions, we would hope to encourage He research necessary to resolve these problems, and thus enhance the value of diagnostic technology assessment. REFERENCES Abrams, H.L. Evaluating computed tomography. In Altennan, P.S., Gastel, B., and Eliastam, M., eds., Assessing Computed Tomogra- phy, pp. 1-17. National Center for Heal Care Technology Mono- graph Senes, Washington D.C., U.S. Deparunent of Health and Human Services, May 1981. Abram s, H.~. Garland lecture. Coronary Artenography: pathologic and prognostic implications. American Joumal of Roentgenology 139:1- 18, 1982.

OCR for page 73
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS 103 Abrams, Ho., and Hessel, S. Heady technology assessment: problems and challenges. American Joumal of Roentgenology 149:1127-1132, 1987. Aldemon, P.O., Adams, D.F., McNeil, By., et al. Computed tomography, ultrasound, and scintigraphy of He liver in patients with colon or breast carcinoma: A prospective comparison. Radiology 149:225- 230, 1983. Alperovitch, A. Controlled assessment of diagnostic techniques: Me~- odolog~cal problems. Effective Health Care 1:187-190, 1983. AngeH, M. Patients' preferences in randomized cImcal teals. New England Journal of Medicine 310:1385-1387, 1984. Begg, C.B., Greenes, R.A., and Iglewicz, B. The influence of uninterpre- tability on the assessment of diagnostic tests. Journal of Chronic Diseases 39:575-584, 1986. BrogUen, B.G., Delsey, C.A., and Moseley, R.D. Effect of fatigue and alcohol on observer perception. American loumal of Roentgenology 130:971-974, 1978. Brown, B.W., Ir., arid HoBander, M. Statistics: A Biomedical Introduc- tion. New York, John Wiley & Sons, 1977. Cassileth, B.R., Lusk, E.~., Miner, D.S., and Hurwitz, S. Altitudes toward clinical teals among patients and He public. Journal of He American Medical Association 248:968-970, 1982. Cassile~, B.R., Zupkis, R.V., Sutton-Smith, K., et al. Informed consent: Why are its goals imperfectly realized? New England Joumal of Medicine 302:896-900, 1980. Chalmers, T.C., Smith, H., Ir., Blackbum, B., et al. A method for assessing He quality of randomized control trial. Controlled Clinical Trials 2:3149, 19XI. Croke, G. Recruitment for the National Cooperative Gallstone Study. In Row, H.P., and Gordon, R.S., Jr., eds., Proceedings of me National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Therapeutics 25:691-694, 1979. Cummings, S.R., Hulley, S.B., and Siegel, D. Implementing the study: Pre-testing, quality control and protocol revisions. In Hulley, S.B., and Cummings, S.R., eds., Designing Clinical Research: An Epi- demiological Approach. Baltimore, Williams and Wilkins, 1988. Drummond, M. Guidelines for health technology assessment: Economic evaluation. In Feeny, D., Guyatt, G., and Tugwell, P., eds., Health Care Technology: Effectiveness, Efficacy and Public Policy. Mon- treal, The Institute for Research on Public Policy, 1986. Feinstein, A.R. An additional science for clinical medicine: II. The limitations of randomized trials. Annals of Internal Medicine 99:544- S50, 1983.

OCR for page 73
104 ASSESSAlENT OF DIAGNO=IC TECHNOLOGY Feinstein, A.R. Clinical biostatishcs-VIlI. An analytic appraisal of the University Group Diabetes Program (UGDP) study. Clinical Phar- macology and Therapeutics 12:167-191, 1971. Ferguson, I.H. Director, Office of Medical Applications Research. Per- sonal communication, 1988. - Fe~ns, F.L., and Ederer, F. External mon~coring in multiclin~c tr~als: Ap- plications from oph~almolog~c studies. In Ro~, H.P., and Gordon, R.S., Ir., eds., Proceedings of ache National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Th- erapeutics 25:72~723, 1979. Fineberg, H.V., Bauman, R., and Sosman, M. Computerized cranial tomography: Effect on diagnostic and therapeutic plans. Joumal of the American Medical Association 238:224-230, 1977. Fineberg, H.V., and Hiatt, H.H. Evaluation of medical practices: The case for technology assessment. New England Journal of Medicine 301:1086-1091, 1979. Freiman, J.A., Chalmers, T.C., Smith, H., Jr., and Kuebler, R.R. The importance of beta, the type lI error and sample size in the design and interpretation of the randomized control trial. New England Journal of Medicine 299:690-694, 1978. Guyatt, G., and Drummond, M. Guidelines for He clinical and economic assessment of health technologies: The case of magnetic resonance. International Journal of Technology Assessment in Health Care 1:551-566, 1985. Guyatt, G.H., Tugwell, P.X., Feeny, D.H., et al. The role of before-after studies of therapeutic impact in the evaluation of diagnostic technolo- gies. Journal of Chronic Diseases 39:295-304, 1986. Hanley, J.A., and McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29- 36, 1982. Hessel, S.J., Siegelman, S.S., McNeil, B.J., et al. A prospective evalu- ation of computed tomography and ultrasound of the pancreas. Radi- ology 143:129-133, 1982. Hopwood, M.D., Mabry, J.C., and Sibley, W.~. A first-order characteri- zation of clinical trials. Prepared for He National Institutes of Health by the Rand Corporation. R-2653-NTH; September 1980: 6 ~ -62. Hull, R.D., Hirsh, J., Carter, C.J., et al. Diagnostic value of ventilation- perfusion lung scanning in patients with suspected pulmonary embo- lism. Chest88:819-82S, 1985. Hunter, J.E. Meta-analysis: Cumulating Research Findings Across Stud- ies. Beverly Hills, California, Sage Publications, 1982. Kent, D.~., and Larson, E.B. Diagnostic technology assessment: Prob- lems and prospects. Annals of Intemal Medicine 108:759-761, 1988.

OCR for page 73
PRIMLY ASS~SME~ OF DIAGNOSTIC TOWS 105 Lidz, C.W., Meisel, A., Oste~weis, M., et al. Barriers to informed con- sent. Annals of Internal Medicine 99:539-543, 1983. ~ Ludwid, E.G., and Coletti, J.C. Some misuses of heals statistics. Journal of the American Medical Association 216:493499, 1971. Marks, ].W., Croke, G., Gochman, N., et al. Major issues in the organiza- tion and implementation of the National Cooperative Gallstone Study (NCGS). Controlled Clinical Trials 5:1-12, 1984. Mattson, M.E., Curb, I.D., McAr~e, R., et al. Participation in a clinical trial: The patients' point of view. Controlled Clinical Trials 6:156- 167, 1985. McNeil, B.J. Pitfalls in and requirements for evaluations of diagnostic technologies. In Wagner, J., ea., Proceedings of a Conference on Medical Technologies, DHEW Pub. No (PHS) 79-3254, pp. 33-39. Washington, D.C., U.S. Goverrunent Printing Office, 1979. McNeil, B.J., Sanders, R., Alderson, P.O., et al. A prospective study of computed tomography, ultrasound, and gallium imaging in patients with fever. Radiologyl39:647-653, 1981. McNeil, B.J., Weichselbaum, R., and Pauker, S.G. Fallacy of the five- year survival in lung cancer. New England Journal of Medicine 299:1397-1401, 1978. Metz, C.E. Basic principles of ROC analysis. Medicine 13:283-29S, 1978. Seminars in Nuclear Mosteller, F., Gilbert, J.P., and McPeek, B. Reporting standards and research strategies for controlled trials: Agenda for the editor. Con- trolled Clinical Trials 1:37-58, 1980. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Stock No. 051- 003-00765-7. Washington, D.C., U.S. Government Printing Office, 198Oa. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Background paper #1: Methodological issues and literature review. Washington, D.C., U.S. Government Printing Office, 1980b. Office of Technology Assessment, U.S. Congress. The Implications of Cost-Effectiveness Analysis of Medical Technology. Background paper #2: Case studies of medical technologies. Case Study #2: The feasibility of economic evaluation of diagnostic procedures: The case of CI scanning. Washington, D.C., U.S. Government Printing Of- fice, 1981. Phelps, C.E., and Mushlin, A.~. Focusing technology assessment using medical) decision theory. Medical Decisionmaking 8:279-289, 1988. Pillemer, D.B., and Light, B.~. Synthesizing outcomes: How to use research evidence from many studies. Harvard Education Review 50:176-195, 1980.

OCR for page 73
106 ASSESSMENT OF DIAGNOSTIC TEClINOLOGY Prout, T.E. Other examples of recruitment problems and solutions. In Roth, H.P., and Gordon, R.S., Ir., eds., Proceedings of the National Conference on Clinical Trials Methodology, October 1977. Clinical Pharmacology and Therapeutics 25:695-696, 1979. Schoenberger, I.A. Recruitment in the Coronary Snug Project and the Aspirin Myocardial Infarction Study. In Roth, H.P., and Gordon, R.S., Ir., eds., Proceedings of the National Conference on Clinical Trials Methodology, October 1977. CI=cal Pharmacology and Therapeutics 25:6X I-6X4, 1979. Schwartz, I.S. Evaluating diagnostic tests: What is done what needs to be done. Joumal of General Intemal Medicine 1:266-267, 1986. Sheedy, P.F., Stephens, D.H., Hatted, R.R, et al. Computed tomography in patients suspected of having carcinoma of the pancreas: Recent experience (abstract). Presented at the scientific assembly and annual meeting of the Radiological Society of North America, Chicago, Ill., November 1977. Smith, TV., Kemeny, M.M., Sugarbaker, P.H., et al. A prospective study of hepatic imaging in the detection of metastatic disease. Annals of Surgely 195:486~9l, 1982. Sox, H.C., Ir. Probability theory in the use of diagnostic tests: An introduction to critical study of the literature. Annals of Intemal Medicine {C4:60-66, 1986. Steinberg, E.P., and Cohen, A.B. Office of Technology Assessment, U.S. Congress. Nuclear Magnetic Resonance Imaging Technology: A Clinical, Industrial, and Policy Analysis. Technology case study 27. Washington, D.C., U.S. Government Printing Office, 1984. Taylor, K.M., Margolese, R.G., and Soskoline, C.L. Physicians' reasons for not entering eligible patients in a randomized clinical trial of surgery for breast cancer. New England Joumal of Medicine 3 ~ 0: ~ 363- 1367, 1984. Vreim, C. Project officer, Prospective Investigation of PulmonaIy Em- bolic Diagnosis project (PIOPED). Personal communication, l~988e Weiner, D.A., Ryan, T.~., McCabe, C.H., et al. Exercise stress testing: Correlation among history of angina, ST-segment response and preva- lence of coronary artery in the Coronary Artery Surgery Study (CASS). New England Journal of Medicine 310:230-235, 1979. Weinstein, M.C. Methodologic considerations in planning clinical trials of cost-effectiveness of magnetic resonance imaging twin a com- mentary on Guyatt and Dmmmond). Intemational Joumal of Tech- nology Assessment in Health Care ~ :567-581, 1985. Weinstein, M.C., and Stason, W.B. Foundations of cost-effec~veness analysis for health and medical practices. New England Joumal of Medicine 296:716-721, 1977.