Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 55
3
Assessment:
Problems and
Proposer1 Solutions
The purpose of this chapter is to describe some of He problems one
encounters in evaluating diagnostic technology and to propose an ap-
proach that avoids many of them. Our underlying premise is Cat the
public is it/-served by current approaches to assessment of diagnostic
technology.
Medical tests are more difficult to evaluate than medical treatments. A
treatment is typically evaluated Pugh a clinical trial in which patients
are randomly assigned to the treatment group or to a control group, which
may receive a placebo or conventional therapy. The endpoint of Be trial
may be physiologic (for example, blood pressure) or functional (such as
the ability to waltz without developing chest pain), but most often the
endpoint is the development of a disease (for example, acute myocardial
infarction) or death In the best trials, therapy subsequent to randomiza-
tion is controlled, and the only variable that differentiates the intervention
and control groups is Be intervention. Under these circumstances, the
investigators are often able to attribute differences in outcome to the
intervention. By contrast, evaluation of a diagnostic test can occur at
several levels, as outlined by Fineberg (Fineberg et al. 1977~.
'Parts of this chapter are adapted from a technical report (Sox 1987~.
55
OCR for page 55
56
MEASURES OF CLINICAL EFFICACY
Technical Capability
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
This measure answers He question, "Does Be test do what the manu-
facturer says it does?" For example, an MR] scanner meets this cnter~on
if it produces a crisp image of Me brain, regardless of whether mat image
fairy reflects the true state of the brain. The Food and Drug Admit
stration currency requires this level of assessment for diagnostic tech-
nologies before it win issue premarket~ng approval.
Sensitivity and Specificity
These two measures of test performance are the most widely used indi-
cators of efficacy. They may help one to decide which of several diagnos-
tic tests is superior, but the verdict is sometimes a split decision: one tech-
nology has a lower false-negative rate while the other has a lower false-
positive rate. Furthermore, these measures are not sufficient to indicate
whether the test should be done. In many cases, the test is so inaccurate
and the treatment so safe and effective that the patient should be Heated
without testing in order to avoid the possibility of being misled by a false-
negative result.
Diagnostic Impact
Do the test results alter the pattern of diagnostic testing? Does the test
replace other tests, including some Hat are more hazardous or costly?
This outcome is relatively easy to measure, and, because it occurs in the
near teen, one can often attribute an effect on patterns of testing to a new
technology. Non~nvasive methods of em aging internal organs have had a
major impact on medical care because the information they provide has
reduced the number of invasive diagnostic studies perfonned. CITE scan-
ning of the head has reduced the number of craniotomies for head trauma
(Ambrose et al. 1976~. This measure of efficacy, however, is not suffi-
cient to answer the important question, "Should I do this test on this
patient?"
Resolution of diagnostic uncertainty is one measure of diagnostic
impact. There is ample evidence that patients seek relief from uncertainty
and that diagnostic tests play a role in satisfying them (Sox et al. 197S,
Marton et al. 19803. The physician must use reassurance when it is
OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS
57
indicated, as when a test result reduces the probability of disease to me
point where no furler intervention is needed. Reassurance following a
negative result on a test that has a high false-negative rate may not be
appropriate in some cases, particularly if the physician strongly suspected
that disease was present before doing me test.
Therapeutic Impact
If a test alters the choice of the treatment for the patient, it meets this
criterion for efficacy. The Reshoot mode} is built around the assumption
that an effect on therapy is the sine qua non for doing a test. But, as
indicated In Chapter 2, a test may alter therapy in one patient but not in
another, depending on the pretest probability and the treatment Reshoot.
Impact on Clinical Outcomes
The ultimate measure of a test is its ability to alter the patient's outlook
by leading to changes in management that reduce symptoms or prolong
life. The determinants of long-te~m outcome are many. The accuracy,
cost, and morbidity of a test may be much less important than when it is
done in the natural history of the illness. The most important determinant
of clinical outcome is therapy rather than diagnosis (Abrams and McNeil
19781. Improved imaging of metastases to the liver from a colon cancer
does not improve the patient's outcome because there are no highly
effective treatments for metastatic colon cancer. Imaging of metastases
may, however, spare a patient from abdominal exploratory surgery Hat
cannot alter the long-term prognosis. An improved short-tenn outcome
does not necessarily imply an improved long-term outcome.
This summary of the measures of clinical efficacy indicates the Futility
of basing a decision about a technology on a single dimension. The way
through this dilemma is to focus on the patient's needs. The right
question about a technology is 'twin this maximize this patient's chances
for the best achievable outcome?" In some cases, the answer to this
question is the same for a large class of patients, and one can fonnulate a
general recommendation. In others, the answer depends on the value Rat
the individual patient places on the outcomes that the illness and its
treatment may entail. In this case, a general recommendation may not be
possible. In this chapter, we show how a technology assessment can
provide the data that allow a physician to identify which management
altemative win maximize the patient's chances for a good outcome.
OCR for page 55
58
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
THE SYSTEM OF TECHNOLOGY ASSESSMENT IN THE
UNITED STATES
The Pyramid
The system of technology assessment in the United States is a pyramid
with several layers. The broad base of the system consists of clinical
studies in which physicians subject patients to a technology and observe
Rem for its effects. These studies include a research question, which
usually cans for the comparison of several technologies; they also include
a rigorous study design and meticulous implementation of the study
protocol. In studies of diagnostic technology, the index test, one or more
competing tests, and a gold-standard test are performed in a series of
patients. The published reports of these studies seldom put their findings
into a form that helps the clinician to decide which patients win benefit
most from the test or procedure.
A second layer is made up of individuals who review the literature and
try to distiB the evidence into recommendations that are true to the facts.
These individuals are typically clinicians who have trading in the disci-
plines of meta-analysis, clinical epidemiology, decision analysis, and
cost-effectiveness analysis. This method has been used to identify out-
moded, overused, and ineffective technologies. More recently, it has
been used to make recommendations for using a diagnostic test or choos-
ing among tests.
On Be next layer are organizations Cat do technology assessment.
They differ in their approach, but the starting point is frequently a techni-
cal background paper written by an individual who reads the literature and
proposes guidelines for using the technology. The conclusions of this
paper are reviewed by others, and clinical policy recommendations are
forged by some consensus process. The American College of Physicians'
Clinical Efficacy Assessment Program (CEAP) is a prototype of this
approach.
Policymakers sit atop the pyramid and are the ultimate consumers of
technology assessment. What they consume is the product of analysis and
consensus, and it generally takes the form of recommendations about the
usefulness of a technology. The individual physician, who bases deci-
sions about using technology on published reports of assessments, is a
policymaker. Other policymakers work for third-party payers, who exert
control over medical practice by their coverage policy.
This description of the technology assessment system in the United
States shows that many individuals and organizations depend on good
OCR for page 55
ASSESSMENT: PROBl~MS AND SOLUTIONS
59
studies of technology. We use He term "primary technology assessment"
to denote studies in which clinical data are obtained systemadcaBy on
patients who have been subjected to a health ~ntervendon, such as a
diagnostic test or treatment. In the next section, we discuss some of the
methodological problems that are encounters in doing primary assess-
ments of diagnostic tests.
PROBLEMS WITH THE CURRENT SYSTEM
Standards of evidence are incomplete. The standard of evidence for
He efficacy of therapeutic technologies, such as surgical operations or
drugs, has become He randomized clinical trial. This standard may be in-
sufficient for clinical decis~onmaking. One drug is considered superior to
another if there is a statistically significant difference in a measure of
outcome, such as survival. Achieving this criterion does not mean that me
drug should be used in all patients. This decision may depend on the
characteristics of the individual patient, including the value he or she
places on the benefits, adverse effects, and costs of the drug. One can use
expected-value decisionmaking to identify the best alternative for an
individual patient.
According to the decision mode} described In Chapter 2, the usefulness
of a test depends on the clinical circumstances. Among these is the pretest
probability of disease. One of a pair of competing tests may be preferred
in patients with a low pretest probability of disease, while the other test
should be preferred in patients with a high pretest probability. In sum-
ma~, we suggest that He efficacy of a test is context-dependent.
Studies dFo not gather the data needed for decisions in individual
patients. L`arge-scale, randomized trials sometimes lead to the conclusion
that a given therapy is preferred only in a subgroup of patients. They do
so by gathering the clinical data necessary to subclassify patients. Studies
of diagnostic tests could be carried out in He same way, but Hey seldom
are. For example, published studies of diagnostic tests infrequency report
clinical prediction rules for estimating pretest probability.
Studies of technology often apply only to a narrow spectrum of pa-
tients. As discussed in Chapter 2, the patients who are enrobed in a study
of a diagnostic test are often a small minority of those who actuaBy
receive it (Philbr~ck et al. 1980~. Similarly, randomized clinical trials
may exclude many patients, such as those with more than one disorder, in
order to maximize the chances of obtaining an unequivocal answer. The
results of these studies may not apply to many patients of concern to
· · —
c. .lmclans.
OCR for page 55
60
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
Studies of diagnostic tests often do not compare a new test with an
established test. Randomized clinical trials of treatments usually com-
pare a new therapy to an established therapy or a placebo. Studies of
diagnostic tests often do not compare one test with a competing test.
When competing tests are compared, the design of the study usually
precludes a complete answer to such questions as, "Should ~ do Test A but
not Test B? Both Test A and Test B? Test A followed by Test B only if
Test A is negative?"
Souses are seldom timely. The earliest studies of a new technology
tend to be misleadingly optimistic about its performance, often because
the study populations are not clinically relevant (Ransohoff and Weinstein
1978~. Practice patterns are often established on the basis of early studies.
Similarly, when hospital managers decide to invest in a new technology,
they must often base Heir decision on early studies. Therefore, me quality
of early studies must be improved.
Technology is constantly changing. By the time a study is completed,
the test or imaging device has changed, and no one believes that the
results apply to Me new, improved technology. Technical changes may
improve the image provided by a scanner, but they do not necessanly lead
to a lower false-negative or false-posi~ve rate, nor do they guarantee
improvement in clinical outcomes. Technology assessment should be
done quickly. For example, a multi-~nstitutional study could take but a
few months. Also, there should be a system for monitoring, and perhaps
reevaluating, the technology as it mamres.
The results of a study may apply to a narrow spectrum of the users of
the technology. Published assessments of a diagnostic technology are
usually done in academic medical centers. The use of the technology in
such centers may differ gready from its use in a community hospital. The
indications for using the test, the specimen of patients, the technique for
using the equipment' and the skill of the clinician who interprets me
results are but a few of the areas in which an academic medical center
may differ from a community hospital.
Two recent case reports illustrate some of the difficulties Cat are
caused by inadequate primary technology assessment.
- Case Report ]. Premature obsolescence: standard chest X-ray to-
mography. Computed tomography (CT) was widely adopted before
it had been compared with what was then He standard method for
imaging the chest, standard X-ray tomography. Relatively few
studies had compared the tests in the same patients. A review Cat
compared their accuracy brought out some unexpected findings
OCR for page 55
ASSESSMENT: PROBl~MS AND SOLlJTIONS
61
(Inouye and Sox 1986). CI was superior to standard tomography
for some indications. When 16 studies of chest tomography for
mediast~nal metastases were reviewed, however, the frequency of
false-negative results was lower for cr. but the frequency of false-
positive results was lower for standard tomography. Furthennore,
the differences in accuracy were too small to be Important for
decisionmaking. By now, however, most radiologists consider stan-
dard tomography to be obsolete in the study of most intrathoracic
disorders.
Comment: Large-scale, multi-institutional, prospective studies com-
paIing CI and standard tomography should have been done very
early In the history of the new technology. These might have shown
that the two procedures were equivalent In most patients and might
have defined patient subgroups in which one test was clearly supe-
r~or.
ated.
Case Report 2. Premature adoption of a new technology: magnetic
resonance imaging. Magnetic resonance imaging (MRI) is being
adopted by hospitals throughout the United States and may eventu-
ally replace computed tomography (car) in studies of the central
nervous system (Steinberg et al. 1985~. MRI provides a remarkable
definition of central nervous system structures. 'The images are
striking in their detail, but those who purchase ~! scanners or use
them should ask several pertinent questions: Does the improved
image lead to lower false-negative rates without increasing false-
positive rates? Does MRI lead to useful changes in diagnostic
certainty, choice of therapy, or even clinical outcome? The answers
to these questions were not available when many MRI scanners
were purchased, because most early studies of MRI were relatively
unsatisfactory (Kent et al. 198S, NTH Consensus Conference 19881.
We now turn to a discussion of how diagnostic tests should be evalu-
RANDOMIZED TRIALS OF DIAGNOSTIC TESTS
A well-designed and well-executed randomized clinical trial is widely
regarded as the most powerful method for comparing technologies. Sources
of ambiguity in data interpretation are, in principle, removed by randomi-
zation, because this process assures that all potentially influential vari-
OCR for page 55
62
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
ables, known and unknown, are distributed equitably among the study
groups. Blinding of the investigator and the patient to the assigned
intervention reduces bias in obtaining data from patients. A weB-con-
ducted trial has internal safeguards to assure strict adherence to the study
protocol.
Limitations of Randomized Trials
The cost may be high. Randomized trials can be very costly if stan-
dardization of the intervention requires special care for patients. By
focusing on effectiveness Measuring effects under usual patient care
conditions) rather Han efficacy Measuring effects under ideal circum-
stances), the costs of a randomized trial can be kept to a minimum.
The study population may be too small. Many chronic diseases prog-
ress slowly, and outcome events accumulate slowly unless the study
population is vely large. Evaluating an intervention in subgroups of
patients may require an unrealistically large number of patients. A large
study population is also required if the intervention is expected to have a
small effect. These problems can often be avoided by self-discipline
when formulating Me study hypotheses. Sometimes the requirement for a
large sample size is unavoidable, and many medical centers may be
required to assemble a sufficient sample of patients.
The technology may become obsolete-before the snaky is complete.
Studies Mat continue for many years run the risk that the results win be
irrelevant because of technological advances that have occurred during
the years of the study.
i
The results may apply to a narrow spectrum of patients. Most random-
zed trials exclude many patients. For example, only 12.7 percent of the
patients in the Coronary Artery SurgeIy Study were randomized to re-
ceive surgery or medical therapy (CASS Principal Investigators, 1983~.
The remainder were not enrobed because Hey met one of many exclusion
cr~tena. A study performed in a single institution may have a limited
spectrum of study patients. Because of these problems, the results of a
study may not necessarily apply to patients who are important both to
clinicians and to policymakers. The exclusion of patients older than age
65 from the CoronaIy Artery Surgery Study is an example (CASS Princi-
pa] Investigators 19831. Ideady, a randomized trial should include a wide
spectrum of care facilities and should enroll patients who might be ex-
cluded from other studies.
The trial may not measure outcomes of clinical interest. By focusing
on the principal clinical hypothesis, past randomized control trials have
OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS
63
often failed to study other measures of the effect of the intervention.
Retum to work, psychological status, and social function aU measure the
impact of successful treannent. Many observers feel that these "secon-
dary endpoints" are as important as Me primary endpoint of the study,
which is usually mortality from the disease. For example, cost-effective-
ness is becoming a study endpoint in many teals.
Trial Design
A randomized trial of a diagnostic test is a powerful method for
evaluating its effects on patient care. The test can be Compaq against
another test or against no test. In general, a comparison with no test is not
ethically sound. Most would agree that a patient could be randomized to
no test only if me clinical history could be relied upon to be sure that the
patient does not have Me disease in question. Under these circumstances,
little can be learned about the effect of He test, other than its psychologi-
caDy mediated effects (Sox et al. 1981~. In general, a randomized trial
win compare two putatively similar diagnostic tests, such as MRI and CI.
There have been few such studies, and this approach deserves greater use.
One of Me advantages of a randomized trial is that the principal study
endpoint is a clinical outcome (for example, length of hospitalizations,
use of other tests, morbidity, or mortality). In contrast to studies that
measure test accuracy, there is no need to perform a potentially dangerous
gold-standard test on all patients. This advantage suggests two types of
randomized trials of diagnostic tests.
"Off the Gold Standard"
If the goal of the study is to measure clinical outcomes rather than test
accuracy, one can ethically enroll anyone who needs the index test. Study
patients are randomly assigned to have the index test or the altemative test
and are then monitored for the occurrence of the endpoint of Be study,
which could be length of hospitalization, total cost of care, or functional
status one month after enrolling. Being able to enroll all patients means
that there is little problem with bias in selecting patients, and the findings
will apply to primary care populations. A randomized study can show
which diagnostic test is superior, and subgroup analysis can identify
patients who benefit particularly from a given test. Nevertheless, this type
of randomized trial cannot measure the true-positive and false-posi~ve
rates of Be index test, because Be gold-standard test will be performed
irregularly, or perhaps not at all. Therefore, this study does not provide all
OCR for page 55
64
ASSESSMENT OF DIAGNOSTIC ~CHNOLOGY
the information that is required to interpret a test or to decide if it is
necessary to perform the test.
"On the God Standard"
Studies that directly compare two tests are important. One way to do
these studies is to perform both tests on each of a series of patients. This
approach is costly, time-consuming, and potentially risky for the patient,
and many patients may refuse to enroB. The alternative is to allocate
patients at random to one of two putatively equivalent tests, perform the
gold-standard test on aU patients, and then measure clinical outcomes.
This approach allows one to compare accuracy and effect on shot-tern
clinical outcomes, such as short-term morbidity and moronity, reduction
in diagnostic uncertainty, and altered choice of therapy and other tech-
nologies. The shortcoming of this approach is that many patients In the
clin~caBy relevant population win not be enrobed because their physi-
cians do not refer them for the gold-standard test. The result win be biased
measures of test perfonnance and a relatively select study population,
which compromises He generalizability of the outcome studies.
A randomized trial that compares the effect of two diagnostic tests on
clinical outcomes poses another potential problem. In a trial, the test is
done on aU patients who are assigned to have it, rawer than on those
selected because the test was indicated. If there is a narrow range of
pretest probabilities for which a test is likely to be useful, few patients
who are randomly assigned to get the test win benefit from it. As a result,
the number of patients needed to detect a clinicaUy significant effect on
outcomes may be very large, and there is a particularly high probability
that a negative result will fad] to detect a clinically significant true differ-
ence.
The randomized trial of the effect of a test on clinical outcomes has
been underutilized and deserves greater attention from investigators. Much
of this attention should be directed at He potential problems of study
design and interpretation.
A PROPOSAL FOR MODEL-DRIVEN TECHNOLOGY
ASSESSMENT
Most studies of diagnostic tests have measured little more than the
false-negative rate and false-positive rate of a given test. This section
descnbes an approach that- we cad "model-dnven." In model-dr~ven
technology assessment, the data to be obtained are specified by a method
OCR for page 55
ASSESSMENT: PROBLEMS AND SOLlJTIONS
65
for making decisions (Sox 1987, Phelps and Mushlin 1988~. We have
used Me threshold model for test-ueannent selection to illustrate this
discussion, but the particulars of the model are less important- Man Me
pnnciple that is, that one should obtain the data that win enable the
clinician to identify the decision altemative that win be most useful to the
patient.
A Technology Should Be Compared with a Competing Technology
The decision to adopt a new technology often means abandoning an
older technology. In evaluating a technology of any kind, one should ask
in what ways it is better than another (its marginal effectiveness). Many
studies of diagnostic technology have not been comparative. There have
been very few randomized teals comparing Me effects of tests on out-
comes. Too few studies have compared Me accuracy of two tests by
doing both of them on a series of patients.
The ideal study. The marginal effectiveness of a technology may be
measured and its true value discovered only by comparison with another
clinical method. A new technology may be compared with an old one, or
two established technologies may be compared.
There are two types of studies of diagnostic tests. Ideally, a diagnostic
test win always be compared to some other memos for obtaining infor-
mation, such as the patient's history and physical examination or another
diagnostic test.
First the effects of two or more tests on clinical outcomes can be
compared. The marginal effect of a new test may be discerned by a
randomized trial in which me effect of the test on patient care outcomes is
measured directly, rather than inferred from probabilistic and decision-
analytic models. The potential limitations of this approach are discussed
in the preceding section.
The performance characteristics of Be tests can be compa - . Compar-
ing the frequency of false-negative and false-positive results in two or
more tests provides the necessary data for a decision mode} that will help
to indicate which test is preferred. Patients can be randomized to have
one test or the other, or both tests can be done for each patient.
Studies Should Be Planned! Before Enrolling the First Patient
Most studies of diagnostic tests have retrospectively analyzed data that
had been obtained for another purpose. Thus, they describe clinical
experience rather than planned research. Typically, the index test has
OCR for page 55
66
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
been performed on many patients, but only a few have had the gold-
standard test. The characteristics of the index test group are seldom
compared with the character~shcs of Hose who also undergo We gold-
standard test. Other clinical data have been obtained irregularly. Most of
the defects of past studies are attributable to their retrospective character.
The ideal stud. A study should be planned in advance to assure adher-
ence to a uniform data collection protocol. Bias in selecting patients and
interpreting data can be reduced by planning. ~ a multicenter study, all
participants follow He same data collection protocol.
All the Data that Are Neededfor Clinical Decision Making Should Be
Collected
Past studies have measured the accuracy of a test, but they have not
collected all the data required to help physicians make decisions concem-
ing individual patients. For instance, sequences of tests are not reported,
although physicians must often choose between doing such a sequence or
doing one test. Using B ayes ' theorem to interpret the second test in a
sequence usually requires assuming that the performance of the second
test is conditionally independent of the results of the first. In some
studies, two tests have been performed on a series of patients, and the
operating characteristics of each test have been reported; seldom reported,
however, is the frequency of each combination of results (both positive,
Test A positive and Test B negative, Test A negative and Test B positive,
both negative) in diseased and nondiseased patients.
Test results are not reported as continuous variables. The interpreta-
tion of a test usually depends on the extent of the abnormality. Thus, an
orange-size lung mass is more likely to be malignant than a pea-size mass.
To make use of this infonnation in decisionmaking, He false-negative
rate and false-positive rate for an orange-size mass should be reported
separately from these rates for a pea-size mass. In most published studies,
He results have been reported simply as "normal" or"abnonnal." As
discussed in Chapter 2, the operating characteristics of a test really reflect
the criterion for calling a particular result `'abnormal." Thus, optimal.
decisionmaking requires reporting the true-positive rate and false-positive
rate of each of a series of definitions of an abnormal result.
The ideal study. The ideal study of technology is model-driven: the
data to be obtained are those required by a mode] of the decisionmaking
process. According to the principles of expected-value decisionmaking,
OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS
67
Me clinician must know He pretest probability of disease and must be able
to calculate the patient's expected utility for each of the decision altema-
tives: treat without testing, do Test A, do Test B. or do nothing. To
provide the data needed for decision making, studies of diagnostic tests
should:
Develop clinical prediction rules for estimating pretest probability.
Clinical prediction rules estimate the probability of disease from the
history and physical examination and other data (see Chapter 2~. To
develop a clinical prediction nlle, one must obtain a complete problem-
related data set on each patient and do a gold-standard test to define his or
her tnue state. These data are easily obtained at little additional cost in a
prospective study to measure the false-negative rate and false-positive
rate of a test.
Measure thefalse-negat~ve rate andfalse-positive rate of sequences of
tests. In a study that compares several diagnostic tests, each test should be
performed on each patient in Be study, and the results of one test win be
reported separately under each set of results for the other tests. For
example, suppose that Test A and Test B are performed on all study
patients. The false-negative rate and false-positive rate of Test B wild be
reported in patients who had a positive result on Test A and in patients
who had a negative result on Test A.
Report the operating characteristics for several different results of the
test. A study should enroll enough patients to report the accuracy of the
test in subgroups of patients who show increasingly abnormal results.
Results should be reported as receiver operating characteristic (ROC)
curves. Tests can be compared by calculating Be area under their ROC
cumes, although a more clinically useful comparison is the range of
disease probability over which the test is preferred.
Provide a decision mode! for identifying the preferred option. The
clinician can use the~principles of expected-value decisionmaking to
identify the decision that win maximize the patient's chances for a favor-
able outcome. The pnnciples of expected-value decisionmaking, and of
designing a decision model or decision tree, are described in Chapter 2.
The decision tree requires one to estimate the probabilities at the chance
nodes, which are usually obtained from published studies but could be
obtained from analysis of insurance claims data (Barry et al. 1988~. The
OCR for page 55
68
ASSESSMENT OF DIAGNOSlIC TECHNOLOGY
tree also requires a quantitative measure for each outcome. This measure
could be life expectancy or a measure of patient preference, such as
utility. Each study patient's utility for each outcome state Will be meas-
ured using standard utility assessment techniques.
Consider treatment threshold probabilities. For many problems, the
Ashore mode} of decisionmaking can help the physician decide whether
to treat, withhold treatment, or do a test or sequence of tests. He
physician can use intuition to estimate an individual patient's treatment
Reshoot or can use Be analytic methods that were described in Chapter
2. The treatment Reshoot win vary from patient to patient because of
their different outcome preferences and their different clinical character~s-
tics. The distnbution of treatment thresholds win provide an essential
background for physicians as Hey estimate the individual patient's ~sh-
old. If the range of treatment thresholds is relatively narrow, one can
make general recommendations for using diagnostic tests.
Bias in Patient Selection Should Be Avoicled
In past studies of diagnostic tests, the study population has differed
significantly from the patients who undergo the test in the usual course of
medical care. This defect of past studies is the most important and the
most difficult to solve. Chapter 2 contains a description of the selective
forces that lead to a biased spectrum of study padents. This defect leads
to test measurements that lack external validity and could senously mis-
lead me clinician.
The ideal Stun. All patients who receive an established test in custom-
ary and usual practice should be included in the study population if
possible. Exclusion and inclusion cnteria, if they are needed, should be
stated in He study protocol. The most troublesome selective factor is
"workup bias.', There are several ways to avoid this problem.
The best way is to avoid using a gold-standard test that is unpleasant,
costly, and risky. For example, in evaluating the accuracy of rectal
ultrasound for evaluating prostate nodules, one can use needle biopsy of
the prostate as the gold standard. This procedure can be performed so
easily that there is no balmier to referring patients.
Another way to avoid workup bias is to be sure Cat a positive index
test is not used as a criterion for obtaining the gold-standard test. One
way to assure compliance is to obtain the index test only In patients who
have had the gold-standard test.
OCR for page 55
ASSESSMENT: PROBl~iMS AND SOLUTIONS
69
A third way to avoid workup bias is to use long-term foDow-up as an
ultimate measure of whether or not the patient had We disease. Thus, aD
patients who do not get an invasive gold-standard test for cancer would be
evaluated periodically for the appearance of a cancer that was initially
missed by the index test.
Patients Should Be Observedfor Adverse Elects of the index Test
Most studies of diagnostic tests have not included any clinical outcome
measures other than diagnosis. ~ effects have seldom been assessed,
other than to note direct complications (death and disability from the
procedure itself). Other iU effect~such as psychological dependence on
test results (Sox et al. 1978), expensive workup of false-positive resets,
and mistakenly labelling the patient as diseased have seldom been in-
vestigated.
The ideal study. AU patients should be monitored to detect any
delayed effects of the test. A prospective study can incorporate these
important study endpoints at a small additions cost. A research assistant
can perform clinical foBow-up of each patient by admin~stenng a ques-
tionnaire and by reviewing the patient's medical record.
Interpretation of Data Should Be Free of Bias
The index test and the gold-standard test should be interpreted inde-
pendently to avoid having the results of one influence the interpretation of
the other. In some published reports, each test has been interpreted
independently, but the protocol for interpreting the index test and the
gold-standard test is not usually described. One way to avoid biased
interpretation is to have standardized, written criteria for classifying test
results.
The ideal stud. The gold-standard test and each test being evaluated
are interpreted independently, according to standardized cnteria. To
achieve this goal win require the active cooperation of the cI~cians who
perform and interpret the test.
Interobserver Disagreement Should Be Measured
Studies have often shown considerable disagreement among observers
in labelling an image or tracing as abnormal (Koran 1975~. Very few
studies of diagnostic tests have included measures of interobse~ver dis-
agreement.
OCR for page 55
70
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
The ideal study. At least two people should examine images or
tracings and categorize the result according to prospectively defined
cntena. These test result categories could be Vomited to normal and
abnormal or could include several degrees of abnormality. The level of
agreement should be characterized quantitatively.
There Should Be Enough Patients to Report the Results in Clinically
Useful Subgroups of Patients
Typical studies of diagnostic tests enroll fewer than 100 patients, far
too few to evaluate the perfonnance of a test in clinically important
subsets of patients. One large clinical study has shown that Me accuracy
of a diagnostic test vanes among clinically defined patient subgroups
(Weiner et al. 1979~. Patients who appear very sick often have extensive
disease that a test can detect easily. Disease is often less extensive, and
therefore less easily detected, In patients who do not appear in. Applying
results obtained in very sick patients may lead to incorrect interpretation
of test results in other patients.
The ideal s—. The study should enroll enough patients to measure
test performance in subgroups of patients, and it should prospectively
establish criteria for different categories of disease seventy. The operat-
ing characteristic of the index test should be measured in these subgroups,
as well as in the entire patient population.
SUMMARY
The chief importance of this chapter is that it sets out expectations for
future studies of diagnostic tests. There are a few basic pnnciples. Do
comparative studies: a test can be compared win a competing test, either
by randomly allocating patients to one test or the other or by perfonning
both tests on all patients. Do clinically relevant studies: the investigators
should gather all the data that are required to implement a model for
making clinical decisions. Avoid bias: the study population should be all
Nose who get the index test in the course of usual care.
REFERENCES
Abrams, H.L., and McNeil, B.J. Medical implications of computed
tomography ("CAT" scanning). New England Journal of Medicine
298:261, 31~318,1978.
OCR for page 55
ASSESSMENT: PROBLEMS AND SOLUTIONS
71
Ambrose, J., Gooding, M.R., and Uttley, D. E.M.I. scan in the manage-
ment of head injuries. Lancet 1:847-848, 1976.
Barry, M.~., Mulley, A.G., Fowler, As., and Wennberg, I.W. Watchful
waiting vs. immediate trar~sure~ resection for symptomatic prosta-
tism. Journal of He American Medical Association 259:3010-3017,
1988.
CASS Principal Investigators. Coronary Artery Surgery Study (CASS):
A randomized trial of coronary artery bypass surgery: Survival data.
Circulation 68:939-950, 1983.
Fineberg, H.V., Bauman, R., and Sosman, M. Computerized coal
tomography: Effect on diagnostic and therapeutic plans. Joumal of
He American Medical Association 238:224-230, 1977.
Haughton, V.M. MR imaging of He spine. Radiology 1 66:297-301,
1988.
Inouye, S.K., and Sox, H.C. A compan son of computed tomography and
standard tomography in neoplasms of He chest. Annals of Internal
Medicine 105:906-924, 1986.
Kent, D.L., and Larson, E.B. Magnetic resonance imaging of the brain
and the spine. Annals of Intemal Medicine 108:402423, 1988.
Koran, L.M. The reliability of clinical methods, data, and judgment.
New England Journal of Medicine 293:642-646, 695-700, 1975.
Marton, K.I., Sox, H.C., Wasson, I.H., and Duisenberg, C.E. The clinical
value of the upper gastrointestinal series. Archives of Internal Medi-
cine 140:191-195, 1980.
Modic, M.T., Steinberg, P.M., Ross, I.S., Masaryk, TV., and Carter, I.R.
Degenerative disk disease: Assessment of changes in vertebral body
marrow with MRimaging. Radiology166(part D:193-199, 1988.
NIH Consensus Conference. Magnetic resonance imaging. Journal of the
American Medical Association 259:2132-2138, 1988.
Phelps, C.E., and Mushily, A.I. Focusing medical technology assessment
using medical decision theory. Medical Decision Making 8:279-289,
1988.
Philbnck, I.T., Horwitz, Ret., and Feinstein, A.R. Methodolog~c prob-
lems of exercise testing for coronary artery disease: Groups, analysis,
and bias. American Joumal of Cardiology 46:807-812, 1980.
Philbnck, I.T., Horwitz, R.I., Feinstein, A.R., et al. The limited spectrum
of patients studied In exercise test research: Analyzing the tip of the
iceberg. Journal of the American Medical Association 248:2467-
2470, 1982.
Ransohoff, D.F., and Feinstein, A.R. Problems of spectrum and bias in
evaluating the efficacy of diagnostic tests. New England Joumal of
Medicine 299:926-930, 1978.
OCR for page 55
72
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
Sox, H.C. Centers for Excellence in Technology Assessment: A proposal
for He national program for He study of heath care technology. In
Roe, W., Anderson M., Gong, I., and Strauss, M., eds., A Forward
Plan for Medicare Coverage and Technology Assessment. Washing-
ton, D.C., Department of Heath and Human Services, 1987.
Sox, H. C. , Margulies, I. , and Sox, C.H. Psychologically mediated effects
of diagnostic tests. Annals of Intemal Medicine 95:680-685, 1981.
Steinberg, E.P., Sisk, I.E., and Locke, K.E. X-ray Or and magnetic
resonance imagers: Diffusion patterns and policy issues. New Eng-
land Joumal of Medicine 313:859-864, 1985.
Weiner, D.A., Ryan, T.~., McCabe, C.H., et al. Exercise stress testing:
Correlations among history of angina, ST-segonent response, and
prevalence of coronary-artery disease in He Coronary Artery Surgery
Study (CASS). New England Journal of Medicine 302:230-235, 1979.