Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 73
4
Unman Assessment of
Diagnostic Tests:
Barriers to Implementation
In the two preceding chapters, we have presented a series of guidelines
for conducting the ideal study of a diagnostic technology. The goal of this
chapter is to examine the practical difficulties that are often encountered
when the guidelines are applied to He design and execution of a typical
protocol. We win also address briefly a number of methodological issues
concerning the interpretation and reporting of Be study data. Five specific
stages of a primary technology assessment win be addressed: planning
and protocol development, recruitment, implementation, interpretation,
aIld reporting.
PLANNING AND PROTOCOL DEVELOPMENT
The key to designing a useful technology assessment is that studies
should be modeZ-driven; the data that are gathered should be specified by
a model of decisionmaking. We pointed out in Chapter 3 that the best
way to obtain the i-nfo~mation that will make the model usable (that is, the
data needed by a physician to make a decision about the care of an
individual patient) is to plan Be study before enrobing the first patient.
What are the elements of Be planning and protocol development stage?
What specific issues must be addressed and how can Hey be resolved?
Weinstein (1985) has discussed many of these issues as they relate to
Parts of this chapter are adapted from a paper published previously by one of Me
authors (Abrams 1987~.
73
OCR for page 74
74
ASSESSMENT OF DIAGNOSTIC ~:CHNOLOGY
planning a trial of cost-effectiveness of diagnostic technology, and we
draw on his wow.
First, the planners must clearly delineate the objectives of the study. A
number of critical questions must be asked:
· Which clinics condition should be investigated?
· Which patient population should be included in the study?
· Wid He endpoints be accuracy, outcome, or both?
· What type of study design will be used?
· WiB the assessment also be an economic evaluation?
· WiB the study assess efficacy or electiveness?
· What is the appropriate comparison technology?
· How large a sample win be needed?
· When should He study be conducted?
· Is Were institutional support for He study?
The answers to these questions win gready influence me design of He
protocol and the nature of He data to be gathered. We win therefore
consider each of them in more detail.
Choosing a Clinical Condition
A diagnostic imaging technique has numerous potential applications.
For example, it was estimated in 1984 that MR! examinations might be
used in up to 250 diagnosis-related groups (DRGs) (Steinberg and Cohen
1984, Weinstein 1985~. Defining the role of MR} for each of these
categories would require many studies and a tremendous investment of
time and resources. Recognizing that society may not be able to afford to
assess every application of a diagnostic technology, we must establish
priorities for technology assessment.
~ choosing the clinical problem to be evaluated in a diagnostic tech-
nology assessment, policy-or~ented investigators would use cr~tena such
as me frequency of a condition, He cost of the technology, and the
potential impact of He study result on clinical practice. Other factors that
might influence the choice include the potential effect of He test on
patient management and outcome and deficiencies in existing diagnostic
methods (Figure 4.1) (Guyatt and Drummond 1985). Planners may use
policy considerations to select a study problem that will have a significant
societal impact; but Hey must also ask if the study is feasible.
The feasibility of the study depends on a number of variables, such as
cost and the availability of a gold standard. (The costs of studies of
OCR for page 75
PRIMARY ASSESSMENT OF DIAGNOSTIC TESI);
GOLD
STANDARD
AVAILABLE?
\
FREQUENCY
OF
-CONDITION?
1, 1
COST
OF
TECHNOLOGY?
1/
CLINICAL
PROBLEM
75
POTENTIAL
TO AFFECT
PATIENT
MANAGEMENT?
\
POTENTIAL
TO AFFECT
OLIN ICAL
OUTCOME?
EXISTING
DIAGNOSTIC
METHODS
INADEQUATE?
FIGURE 4.1 Factors that influence the choice of die clinical condition to be studied.
diagnostic technology are discussed in detail in Chapter 5.) Open-ended
questions and poorly defined goals may limit feasibility. Assessing the
efficacy of Or or MR! of"the lived' ignores the sharp distinctions among
biliary obstruction, mass lesions, and diffuse hepatocelBular disease. Each
topic Squires separate consideration. One prospective study of CI,
ultrasound (US), and scindgraphy focused on the tests' ability to detect
metastatic liver disease from several types of primary carcinoma. No
difference was absented in the diagnostic capabilities of these technolo-
gies (Smith et al. 1982~. Nevertheless, the results of a more recent study,
restncted to patients with carcinoma of the breast or colon, suggest that
differences do exist in We diagnostic yield of the Wee modalities when
pathologically distinct lesions are analyzed separately. These differences
may have been obscured in the first study because the clinical problem
was too broadly defined (Alderson et al. 1983~.
One possible way to set priorities for technology assessment is to use
decision-analytic techniques for detem~in~ng the value of perfect informa-
tion. Suppose we are considenng an assessment of me accuracy of a new
test for patients With condition X. Let us assume Hat We new test
provides perfect information, thereby resolving all uncertainty about the
true state of the patient, and that we can determine me value of the
OCR for page 76
76
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
information in doDars. According to the model, if we bind mat We cost of
performing the test is greater than we would be willing to pay for perfect
information, using the new test to diagnose patients win condition X
would not be worthwhile (Phelps and Mushlin 1988~. The model uses a
hypothetical test that is 100 percent accurate to maximize its potential
value. If me information from an ideal test is not worm the test cost, we
can expect, an over things being equal, that the information from a real,
imperfect test would be worth even less. It follows that we would not
want to expend resources to evaluate me test's performance In this clini-
cal situation. This methodology provides a powerful tool for determining
beforehand whether we should expend the resources necessary to evaluate
a particular use of a technology.
Patient Population
The study population must be weD defined. When certain subsets of
eligible patients are excluded because of other, coexisting disease, a
physician may be unable to generalize the study result to me whole
spectrum of patients encountered in clinical practice. (See Chapters 2 and
3 for a more thorough discussion of sources of bias in selecting patients
and their negative impact on studies of diagnostic technology.) Choosing
a specific clinical problem for a study of diagnostic technology defines
We diagnostic category of patients who may participate in the study.
Within this category, the population should include a representative spec-
trum of patients.
Inclusion and exclusion criteria are needed to define the boundaries of
the study population. They must be explicit and they must be applied
consistently. In the University Group Diabetes Study, these criteria were
not applied uniformly, leading to the admission of a number of ineligible
patients and We exclusion of some patients who were eligible. These
errors compromised the generalizability of the study conclusion and wasted
resources (Feinstein 1971~.
Wide variance in test performance (that is, accuracy) within the study
population may obscure differences in me performance of two tests.
Investigators may need to specify and analyze the results of a test In
subgroups of patients for whom they suspect the test will perform differ-
endy. For example, me sensitivity and specificity of the exercise thallium
treadmill, used to diagnose coronary artery disease, are different In groups
Of patients segregated according to the severity of their chest pain (Weiner
et al. 1979~. Although there may be no significant difference in test
OCR for page 77
PR MARY ASSESSMENT OF DIAGNOSTIC TESTS
77
performance when the population is considered as a wholes Mere may be
differences when subgroups within the population are compared.
Endpoints and Study Design
The endpoint of a diagnostic test assessment win determine how the
results win be used; it is, therefore, critical. Fineberg has proposed the
foBow~ng hierarchy for the evaluation of diagnostic tests: technical capa-
bility, diagnostic accuracy, therapeutic impact, and impact on patient
outcome Iceberg et al. 1977~. Early reports of excellent technical
capability are often the basis for me later studies of diagnostic accuracy
and clinical value (impact on therapy and patient outcome). The critical
question in the planrung and protocol development stage is: WiB the study
attempt to measure diagnostic accuracy (~at is, sensitivity and specific-
ity), the impact of the test on clinical outcome, or both? (Note that we
define the outcome of a diagnostic test as any change in the pastiest
process. It should not be considered synonymous with the terms morbid-
ity and mortality.)
Accuracy
Studies of diagnostic accuracy use a "gold standard" to verify the
presence or absence of disease. A potential difficulty in a study of
accuracy occurs when there is no accepted "gold standard"; it may not be
clear which of the available reference standards should be used (Schwartz
1986~. AD reference standards are imperfect. The coronary angiogram is
used as Me gold standard in studies of diagnostic tests for coronary artery
disease, such as the stress electrocardiogram. Yet, pathologic examina-
tion of tissue from patients who have had an ang~ogram demonstrates Mat
Me radiolog~c procedure underestimates the severity of disease (Abram s
19821. Physicians must interpret the results of studies of accuracy in this
context. Perfect or not, in practice the appropriate gold standard win be
the test or procedure mat physicians use to define the true state of patients
with a particular disease.
Outcome
Because the purpose of diagnostic technology is to provide infonnation
that will improve patient outcome, patient outcome is an important end-
point in technology assessment. Making inferences from data on outcome
OCR for page 78
78
ASSESS OF DIAGNOSTIC TECHNOLOGY
may be more difficult than interpreting data from studies of diagnostic
accuracy. When long-term measures of outcome are used, the technology
may be obsolete before the study is completed. Furthermore, long-tenn
outcome may be an unrealistic criterion, "because the impact of diagnos-
tic technologies generally is subordinated to that of other factors, such as
the nature of the disease process itself, patient compliance, the efficacy of
treatment, etc." (McNeil 1979, p. 37~.
Improvement in long-tenn outcome may not be the most important
effect of a test. If intetverung v enables act to obscure differences in the
long-term effects of two technologies, perhaps me differences are not
really important. Investigators must keep in mind that two patients win
identical long-tem~ outcomes may have expenenced very different posttest
processes.
A variety of intermediate variables may be important indicators of the
effects of a test. Furthermore, these variables may be more practical to
evaluate than long-term effects. For example, a study could measure the
ability of a diagnostic technology to obviate the need for further invasive
diagnostic procedures. In patients with lung cancer, thoracotomy could
be avoided if a test could accurately predict Me presence of mediast~nal
metastases. The test will not improve the five-year survival of such
patients, but, avoiding an unnecessary thoracotomy would be a major
benefit (McNeil et al. 1978) and would therefore represent an improve-
ment in the posttest process. Outcome studies must track intermediate
outcomes and patients' attitudes toward those outcomes.
Combined Souses: Accuracy and Outcome
The alternative combinations of study design (randomized or nonran-
domized) and endpoint (accuracy anther outcome) are depicted in Figure
4.2. The design of a technology assessment influences the feasibility of
conducting each type of study. In a randomized design, each patient
undergoes only one of the study tests; in a nonrandomized design, each
patient would undergo all of the study tests, although randomization may
be used to assign a patient to a particular sequence of tests. The advan-
tages and disadvantages of a randomized design have already been dis-
cussed in Chapter 3.
The following example illustrates that a study design may not be
compatible with the endpoints selected for evaluation. In an ideal study
to compare the accuracy of two tests, each patient would have bow
examinations. Guyatt and Drummond have suggested that investigators
OCR for page 79
PRIMARY ASSESSMENT OF DIAGNOSXJC TESTS
RANDOMIZED
EACH PATIENT UNDERGOES ONLY
ONE OF Tide STUDY TESTS
DESIGN
NON RANDOM IZED
EACH PATIENT UNDERGOES ALL
OF Tide STUDY TESTS.
PATIENTS MAY BE RANDOMIZED
TO A SEaUENCE OF TESTS
79
ENDPolNT
ACCURACY OUTCOME BOTH
GIRD FOLLOW-UP GOL~STANDARD
REC WIRED RK3UIRED EVALUATION AND
ONLY ONLY Fry I OW-UP REQ'D
GOU)STANDARD
GOLD-STANDARD FOLLOW-UP EVALUATION AND
EVALUATION REQUIRED ONLY FOLLOW-UP REaD
REQUIRED USEFUL ONLY MEN (FOLLOW-UP PATIENT
ONLY PATENT RANDOMIZED CARE WED ON
TO ~ SEQUENCE) ONLY ONE CODY
TEST RESULT)
FIGURE 4~2 Alternative combinations of endpoint and study design.
use this approach to assess both We accuracy and the impact on outcome
of two relatively norunvasive imaging modalities, such as MR! and cr. in
a single study. To compare He effects of the tests on outcome for the
same patient in whom accuracy is determined, however, the result of one
of two tests would have to be wit~eld from Be patient's physician
(Guyatt and Dn~mmond 1985~.
The design of this study poses ethical problems because patients will
undergo a diagnostic examination that cannot affect their care (Weinstein
1985~. Patients and physicians alike may be reluctant to participate. In
Chapter 3, we suggest that a randomized design may be preferred to a
nonrandomized design for assessing outcome. Planners could also shift
their focus from long-tenn to short-term outcomes.
Short-Term Outcomes: ASynthet~c Approach
The synthetic approach is a method for assessing short-term outcomes,
such as me impact of a diagnostic test on the management of the patient.
It involves obtaining detailed information from physicians about their
pretest treatment strategies and comparing them to the posttest manage-
ment of the patient (Guyatt et al. 1986~. In He example above, each
physician would write down a plan for managing He patient before
knowing the CI and MR! results. Using a randomized scheme, the result
of one of the two tests would be given to each physician, who would then
OCR for page 80
80
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
formulate and record a treatment plan based on the test result. Next, the
result of He other test would be revealed, and the patient's care would
ultimately be based on aU available information. A test has had an impact
if the physician's plans changed because of the test result.
Economic Analysis
~ the current era of cost containment and remind resources, a consid-
eration of cost will be an important stuffy endpoint. First, the investiga-
tors planning a technology assessment must decide which type of analysis
to use (for example, net resource costs, cost-effectiveness analysis, or
cost-benefit analysis). Since cost-effectiveness analysis is comparative
and does not require that health outcomes be valued in monetary teens, it
is the type of analysis used most frequently. Second, investigators must
choose an appropriate perspective for the analysis, because this will
greatly influence which costs and effects are included. The societal
perspective is the broadest, and it is adopted when the results of the cost-
effec~veness analysis are needed to guide government decisions about
how to allocate resources. Third, the investigators must recognize the
degree to which additional time and personnel win be: needed when an
economic evaluation is included in He study design. (For a complete
discussion of these issues see Weinstein and Stason 1977; OTA 198Oa,b;
orOTA1981.)
Efficacy Versus Effectiveness
The conditions of the study can imitate real life or they can be ideal-
ized. The choice between efficacy, the performance of the test under ideal
conditions, and effectiveness, its performance under ordinary conditions
of clinical practice, win determine He type of question He study can
answer. Consider a study designed to assess the diagnostic accuracy of
barium enema (BE) in detecting colonic polyps (also see Figure 4.31.
A study of e~ecizveness would enroll an patients who are referred for
BE in clinical practice. Patients would be given the usual pretest instruc-
tions and, although some would be less than optimally prepared, an would
undergo the examination. This would be performed under usual condi-
tions, by the individuals who normally perform it—radiology staff or
house staff. It would be interpreted by clinicians at varying levels of skis
who would be provided with whatever clinical information is generally
OCR for page 81
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS
81
EFFICACY EFFECTIVENESS
PATIENT
POPULATION
pBOCEDUR ES
TESTING
CON DITIONS
TEST
INTERPRETATION
MORE HOMOGENEOUS;
SCREENED FOR COEXISTING
ILLNESSES/ COMPLIANCE
STANDARDIZED
IDEAL
BLINDED TO CLINICAL DATA
TYPE OF OBJECTIVE, ~HARD" EVENTS,
OUTCOM E E.G.,DEATH
DATA
HETEROGENEOUS;
INCLUDES ALL PTS ECHO
USUALLY HAVE PROCEDURE
MORE FLEXIBLE
CONDITIONS OF EVERYDAY
PRACTICE
USING OTHER CLINICAL DATA
MORE SUBJECTIVE, "SOFT.
EVENTS, E.G., IMPROVED
QUALITY OF LIFE
[IGURE 43 Differing requirements: studies of efficacy vs. studies of effectiveness.
available at the time. There would not be a protocol for subsequent
patient care.
A study of efficacy should assess the potential benefit of the technology
when applied to a specific clinical problem in a defined population under
ideal conditions. The protocol would be designed to maximize Me chance
that the true accuracy of the test win be demonstrated by reducing sources
of variability. Thus, a study of efficacy would: (~) enroll a-more select
group of patients; (2) ensure that all patients were adequately and consis-
tently prepared prior to the exam; (3) use only state-of-the-art equipment;
(4) employ the most skilled clinicians to perform and interpret the test;
and (5) make sure that interpreters were blinded to other clinical informa-
tion. It would also standardize aftercare.
The individuals who develop the protocol may disagree about which
type of assessment is most appropriate, making the choice a difficult one.
Feinstein (1983) has suggested a useful way to conceptualize the two
approaches—the '~fastidious" and the "pragmatic" to design.
The fastidious approach. Fastidious designers might include the bio-
OCR for page 82
82
ASSESS OF DIAGNOSTIC TECHNOLOGY
statistician or the scientist who developed the technology. This group
would argue that a study of efficacy is the only way to determine the
"true" value of the technology. For example, an efficacy design will
increase the chances of arriving at an unequivocal answer to the study
question by standardizing procedures and removing many of the sources
of variability that characterize coccal practice. If such a study concluded
that a test was not efficacious, there would be no need to perfonn furler
evaluations.
The pragmatic approach. This approach would be adopted by the
practicing clinician. The "clean" results of an efficacy assessment may
have little value for the physician whose patients win receive their tests
under "usual"' rather than "ideal" conditions. The pragmatist would argue
Hat only studies of effectiveness, which attempt to mimic clinical reality,
provide the infonnation physicians need to make decisions about individ-
ual patients.
Resolution of the conflict between the fastidious and pragmatic ap-
proaches may involve combining features of both. In any case, the proto-
co} as it is actually earned out may end up as a hybrid, because protocols
that have been designed to assess efficacy will often encounter real-worId
obstacles that make the ideal arrangement impossible. These problems
will be covered in detail in the section of this chapter that considers
implementation.
Comparative Assessment
In Chapter 3 we emphasized that technology assessments must be
comparative if they are to provide useful data to the practicing physician.
For example, the physician may need answers to either or both of the
foBow~ng questions: (~) When used instead of existing tests, does Me new
test have a greater impact on the outcome of me patient? (2) When used
in combination with existing methods, does the new test add infonnation
Tat will improve the outcome of the patient?
These questions suggest two comparative designs (see Figure 4.41. In
one design, the study would detect any positive impact when the new test
is substituted for the old. Patients would be randomized to either We new
technology or the existing technology. In the other design, the study
would evaluate the impact of a technology when it is used as an addition.
Patients would be randomized to undergo either the old test and the new
test in sequence or the old test alone. Both designs could be used to
compare the diagnostic accuracy of the tests (or combination of tests) and
their impact on the outcome of disease. (A nonrandomized design, in
OCR for page 83
PRIMARY ASSESSMENT OF DIAGNOSTIC TE!=
83
Figure 4a
DESIGN 1
The new technology as
a substitute for the old
technology
TEST A VS. TEST B
Figure 4b
DESIGN 2 TEST A VS. TEST A AND TEST B
The new technology as
an addition to the old
technology
Figure 4c
Over time the appropriate
design may change:
Initially
Short term--physicians may prefer
the combination of tests
Long term--physicians may abandon
the less effective test
TEST A
1
TEST A AND TEST B
TEST B
FIGURE 4A Designs for comparative assessment.
which ad patients undergo ad tests, would also be appropriate for a study
of accuracy; refer to the section, `'Endpoints and Study Design," pp. 77-
80.)
Which of these designs is more appropriate when comparing an exist-
ing technology with a new one? Despite promising reports, physicians
may be hesitant, in the short term, to change over completely to a new
technology. An additive design would answer Weir questions about using
the new test as part of a sequence. Nevertheless, one goal of technology
assessment is to foster appropriate changes in practice habits and to
discourage the use of additional tests whenever they will not have an
impact. In the long teen, if we want to encourage physicians to abandon
ineffective tests in favor of more effective ones, we win need to evaluate
the technology's substitutive value as wed (Weinstein 1985~.
OCR for page 96
96
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
reduce Me number of cases Mat could be used in me final analysis,
prolonging the recruitment effort and increasing the cost and time re-
quired to complete Me study. Furthermore, combing data obtained win
different procedures may compromise Me validity and generalizability of
the study conclusions.
Summarized below are some other problems encountered in a prospec-
tive study evaluating Me diagnostic value of ventilation-perfusion scar~-
ning in patients with suspected pulmonary embolism (HuB et al. 1985~:
· The index test, ventilation scarming, could not be perfonned in 20 of
the patients because of Me lack of availability of ~37Xe or for over "techni-
cal reasons."
· In 2 over patients, the results of the scan "were inadequate for
interpretation. "
· Of me potentially eligible patients, 51 were too ill to undergo Me
gold standard, pulmonary angiography.
· The gold-standard test was not perfonned In additional patients: 4
patients were allergic to contrast agents; 2 patients were pregnant; I!
patients were too in; 9 patients refused permission; and 4 patients were
excluded for over "technical" reasons.
What are the consequences of these difficulties? Besides reducing the
number of padents available for me final analysis, these problems can
change me character of the study population. When patients are exhumed
for iB-defined reasons or do not have the required foDow-up with the
gold-standard test, the study population, and thus the patients to whom the
study conclusions apply, become difficult to define.
Follow-Up Studies
An assessment designed to evaluate the impact of a diagnostic test on
patient outcome win require clinical foHow-up. In addition, when studies
of diagnostic accuracy employ a risky gold standard, patients with nega-
tive index tests may not be referred for the gold-standard test, and clinical
foBow-up may be used as a substitute.
There are several ways to conduct follow-up studies.
First, responsibility for collecting the data and filling out the forms can
be placed with the referring physician or with physicians and staff at the
study center. This approach is useful when physical examinations and
testing are part of the follow-up plan. The method is cheap, but risky;
OCR for page 97
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS
97
physicians may fail to gather ah the data or may use nonstandard methods.
Patients may move or may fail to keep follow-up appointments. Such
patients are considered "lost to foBow-up" and present a challenge to the
individuals who must analyze the data.
Second, a research assistant can conduct a structured telephone inter-
view with the patient in order to assess outcome. This method may be
more convenient for the patient and may increase the chance of successful
folBow-up on patients who have moved. It is not useful if tests or a
physical examination are needed.
Third, patients can fib out a foBow-up questionnaire and return it to the
study center by mail. This approach is the least expensive, but compli-
ance is likely to be poor and the cost of contacting noncompliers win be
high.
FoDow-up can be complicated by a number of factors, particularly if it
requires observation or data collection over a period of years or requires
the assessment of other than dichotomous variables. Some factors relate
to patients. Patients may perceive foDow-up as a continued intrusion into
their lives and simply refuse to cooperate. The patient may experience a
change in health status that makes evaluation of outcome more difficult.
In a randomized study, the patient may "cross over" and have a diagnostic
evaluation for the same indication by the competing technology, making
the assessment of the impact of the first test nearly impossible. Further-
more, the patient is not always a reliable source of information. In one
study, only 60 percent of patients with heart disease and 70 percent of
patients with asthma reported these diagnoses when asked what condition
they had ~udwid and Coletti 1971~.
The environment in which foBow-up is conducted may also present a
problem. The technology under study may change, or a newer technol-
ogy may be developed so that the answer to the study question seems
much less important. When interest wanes, foDow-up may be inadequate.
The nature of the endpoint chosen for evaluation can also influence the
success of foDow-up studies. A dichotomous variable such as life or
dead is easy to assess. Obtaining and coding subjective infonnation about
the impact of a test on the patient's functional status or quality of life
requires more complex methods. Researchers have recognized the impor-
tance of these endpoints and have developed the tools needed to conduct
these types of follow-up studies.
Some studies of diagnostic accuracy determine the patient's true state
by using me gold-standard test in certain patients and clinical follow-up
for those who do not undergo the gold-standard test. Follow-up is very
OCR for page 98
98
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
important in such studies. In McNeil's (1979) evaluation of the CT/RN
study, she states that inadequate follow-up made it impossible to deter-
mine whether some of the patients entered into the study did or did not
have neurological disease. There must be a contingency plan for patients
who do not comply with foBow-up, and the costs of foBow-up must be
induded in the study budget.
Summary: Implementation
The obstacles encountered in this stage of a technology assessment
may be We most difficult to resolve. Randomization, data collection, test
performance, and follow-up are all subject to poor compliance and poor
performance. To facilitate compliance, the requirements of Me protocol
should be as explicit and as simple as possible, and they should be written
out in detail. The study should be planned to minimize the number of
patients who must be randomized at the time examinations are scheduled.
The most important element, however, is the motivation of the patients,
physicians, and over staff who carry out Me protocol. Those individuals
involved in carrying out the protocol should receive training before the
study begins, and there should be ongoing monitoring of study personnel
(Cummings et al. 1988~. The best way to avoid implementation problems
is expensive: hire a research assistant and assign as many data collection
chores as possible to this person.
TEST INTERPRETATION
The choice between efficacy and effectiveness is important in design-
ing the interpretation stage of an assessment. In a study of efficacy, test
interpretation must be as accurate, consistent, and objective as possible.
The ideal study would include multiple interpretations of both the index
test and the gold standard for the purpose of determining interobserver
variability. In a study of effectiveness, tests would be interpreted as they
are in usual clinical practice. The procedure for interpretation would not
necessanly be standardized.
Accuracy
Many factors affect the accuracy of data interpretation. Some, such as
physician fatigue (frogmen et al. 1978), are difficult to control. Data from
one early study indicated a substantial improvement in radiologists' use
OCR for page 99
PRIMARY ASSESSMENT OF DIAGNO=IC -Tin
99
of CI to detect pancreatic carcinoma after the first 1,000 body scans
(Sheedy et al. 1977~. Improvements in physicians' skills with experience
clearly demonstrates the importance of the reaming curve. Early esh-
mates of the accuracy of a new test, when physicians' experience is
Limited, may be a better reflection of their interpretative skins than the
potential accuracy of the method.
Consistency and Multiple-Test Interpretations
Consistency is best guaranteed by having the same observer interpret
an examinations for a particular technology and by using a standardized
definition of an abnormal test result. IdeaBy, aU interpreters using the
different methods should be at a similar level of experience. ROC analysis
is appropriate for assessing tests with results expressed as continuous
variables (Metz 1978~. By determining a series of true-positive/false-
positive pairs in which different cistern separate the nonnal from the ab-
normal, the ROC curve neutralizes observer biases associated with exces-
sively conservative or liberal strategies (Hanley and McNeil 1982~.
In a large-scale study, data interpretation might require a full-time
commitment from specialists, such as radiologists. It may be difficult to
find someone who win devote this amount of time to a study, and equally
difficult to recruit the group of specialists who win be needed to reinter-
pret at least a selected sample of the exams for the purpose of determining
interobserver variability. The participation of these individuals should be
solicited early, and Heir time should be a budgeted expense of the project.
Objectivity
How can objectivity of interpretation be obtained? There must be no
cross-taDc between those who interpret different examinations on- the
same patient. A physician interpreting the index test should be blinded to
the result of the gold standard to avoid test-review bias; similarly, a
physician interpreting the gold standard should be blinded to the result of
me index test to avoid diagnosis-reYiew bias (see Chapter 3~. Both types
of bias can lead to an overestimate of the true-positive and false-positive
rates of the index test. Blinded interpretation of the index test and the
gold-star~dard test is absolutely essential. Yet most reports of studies of
diagnostic tests do not indicate that this precaution has been taken.
In a study of efficacy, blinded interpretation is the most objective way
to determine the accuracy of a test. It may not be ethically sound,
OCR for page 100
100
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
however, to make decisions about padent care based on a test Hat was
interpreted without the benefit of ad relevant clinical data. In a study of
effectiveness, the interpretation would depend on the combination of
clinical information and that denved from Be specific imaging examina-
tion. This memos, although less objective, is the one used in clinical
practice.
A study can be designed to accommodate interpretation under both
"ideal" and "usual" conditions. There should be two separate interpreta-
tions one (nonblinded) Interpretation used for patient care (and Pus for
effectiveness) and the other (blinded) used for efficacy studies. In gen-
eral, if we separate study interpretation from interpretation related to
patient care, we can blind observers to aD other data more ethically.
REPORTING
The clinical utility of an otherwise well-executed diagnostic technol-
ogy assessment depends on the success with which the results are commu-
nicated to physicians who use the tests. In addition, meta-analysis, a form
of secondary technology assessment that synthesizes recommendations
from published reports, depends on thorough reporting of methods and
results (PiUemer and Light 1980, Hunter 1982~. A number of authors
have proposed standards for assessing and reporting randomized con-
trolled trials; many of these standards can be applied to studies of diag-
noetic technology. Two groups in particular (Mosteller et al. 1980, Chalm-
ers et al. 1981) have descnbed 16 key features of a good report.
I. a precise statement of the study question, including any prior hy-
potheses regarding specific subgroups in whom the value of the tests
might differ,
2. a complete description of the sway population, of inclusion and
exclusion cr~tena (if used), and of patients who were rejected or who may
have withdrawn fiom the study, so that clinicians can determine how their
patients compare to the study population, with particular attention to
clinical issues that define He spectrum of sevens of disease;
3. the dates of the enrollment period, to allow interpretation of the
results in light of other developments that may have occurred during that
time: (for example, technological advances);
4. a detailed description of the study protocol, including the mesons
for performing tests (or appropriate references for the methodology) and
the procedure for randomization (if applicable);
OCR for page 101
PRIMARY ASSESSMENT OF DIAGNOSTIC TESTS
101
5. a statement of the acceptable level of type I and type II errors, and
the size of the sample required to detect the specified difference in study
endpoint;
· 6. presentation of the distribution of pretest vanables (for randomized
studies). so that clinicians can check for biased assignment of patients to
study groups;
7.. an indication of the level of compliance with the protocol, win a
description of deviations and how they were handed;
8. specification of the reference standard used to define the tree state
of the patient, taking care to show Mat there is no use of index test results
(or clirucal data used for conical prediction rules) to define the diseased
and nondiseased states;
9. the results of the index testis) and gold-standard test (in a 2-by-2
table, if applicable), with appropriate statistical analyses (for example,
ROC for studies of test accuracy where results can be expressed as
continuous vanables);
10. subgroup analysis: results of tests as ~ no. 9 in patient subgroups
of interest;
Il. the results of folBow-up (when patient outcome is an endpoint)
with confidence limits, life-table analysis, or over statistical analyses as
appropriate;
12. a description of the method for handling postintervention with-
drawals and patients lost to follow-up;
13. a description of the method used to avoid test-referral bias;
14. a description of the method used to blind those who interpret the
index and gold-standard tests;
15. the number of tests that were technically suboptimal or were
considered uninterpretable; and
16. the source of funding for the study, to allow identification of pos-
sible conflicts of interest.
Two of these items deserve additional attention, because they can be
sources of hidden bias in a study of diagnostic technology. Number ~
refers to the pitfall of "circular assessment," which must be avoided when
choosing a reference standard. This occurs when the result of one of the
index tests in a comparative study is used to define the true state of the
patient. To obtain a valid measure of each test's performance, they must
be assessed independently of one another, using a different method to
verify the presence or absence of disease.
Number 15 in the list above alludes to another potential source of bias:
OCR for page 102
102
ASSESSMENT OF DIAGNOSTIC TECHNOLOGY
reports of studies of diagnostic technology seldom include the number of
test results that were considered urunterpretable or indeterminate. In one
review of ten papers on CI, only five dealt explicitly with the number of
unsatisfactory exams. Such information is essential, however, if efficacy
is to be judged. For example, if a test detects renal lesions in 70 of 100
patients, misses them In 10, and results in technically suboptimal exami-
nations in 20, the overall sensitivity is 70 over 100 (70 percent). Fre-
quently, the 20 poor-quality exams are excluded, and the sensitivity
repormd is 70 divided by 80 (88 percent) (Abram s 19811. Thus, if
investigators fail to consider the impact of ignoring poor-quality exams,
me true-positive and false-positive rates may be artificially inflated (Beg"
et al. 19861.
CONCLUSION
In this chapter, we have examined the difficulties encountered in each
stage of a primary technology assessment, from the planning and design
process Trough the production of the final report. The solutions to some
of these problems are relatively straightforward. For example, we have
methods to avoid test-review and diagnosis-review bias. We also know
that increasing the level of cooperation among participating individuals
and institutions will go a long way to improving the outcome of a study.
The solutions to other problems, such as when to conduct the assessment
or which application to assess, are less obvious. In emphasizing some of
the bamers to primary data collection, we have attempted to forestall such
difficulties in future assessments. In posing a number of unanswered
questions, we would hope to encourage He research necessary to resolve
these problems, and thus enhance the value of diagnostic technology
assessment.
REFERENCES
Abrams, H.L. Evaluating computed tomography. In Altennan, P.S.,
Gastel, B., and Eliastam, M., eds., Assessing Computed Tomogra-
phy, pp. 1-17. National Center for Heal Care Technology Mono-
graph Senes, Washington D.C., U.S. Deparunent of Health and Human
Services, May 1981.
Abram s, H.~. Garland lecture. Coronary Artenography: pathologic and
prognostic implications. American Joumal of Roentgenology 139:1-
18, 1982.
OCR for page 103
PRIMARY ASSESSMENT OF DIAGNOSTIC TESIS
103
Abrams, Ho., and Hessel, S. Heady technology assessment: problems
and challenges. American Joumal of Roentgenology 149:1127-1132,
1987.
Aldemon, P.O., Adams, D.F., McNeil, By., et al. Computed tomography,
ultrasound, and scintigraphy of He liver in patients with colon or
breast carcinoma: A prospective comparison. Radiology 149:225-
230, 1983.
Alperovitch, A. Controlled assessment of diagnostic techniques: Me~-
odolog~cal problems. Effective Health Care 1:187-190, 1983.
AngeH, M. Patients' preferences in randomized cImcal teals. New
England Journal of Medicine 310:1385-1387, 1984.
Begg, C.B., Greenes, R.A., and Iglewicz, B. The influence of uninterpre-
tability on the assessment of diagnostic tests. Journal of Chronic
Diseases 39:575-584, 1986.
BrogUen, B.G., Delsey, C.A., and Moseley, R.D. Effect of fatigue and
alcohol on observer perception. American loumal of Roentgenology
130:971-974, 1978.
Brown, B.W., Ir., arid HoBander, M. Statistics: A Biomedical Introduc-
tion. New York, John Wiley & Sons, 1977.
Cassileth, B.R., Lusk, E.~., Miner, D.S., and Hurwitz, S. Altitudes toward
clinical teals among patients and He public. Journal of He American
Medical Association 248:968-970, 1982.
Cassile~, B.R., Zupkis, R.V., Sutton-Smith, K., et al. Informed consent:
Why are its goals imperfectly realized? New England Joumal of
Medicine 302:896-900, 1980.
Chalmers, T.C., Smith, H., Ir., Blackbum, B., et al. A method for
assessing He quality of randomized control trial. Controlled Clinical
Trials 2:3149, 19XI.
Croke, G. Recruitment for the National Cooperative Gallstone Study. In
Row, H.P., and Gordon, R.S., Jr., eds., Proceedings of me National
Conference on Clinical Trials Methodology, October 1977. Clinical
Pharmacology and Therapeutics 25:691-694, 1979.
Cummings, S.R., Hulley, S.B., and Siegel, D. Implementing the study:
Pre-testing, quality control and protocol revisions. In Hulley, S.B.,
and Cummings, S.R., eds., Designing Clinical Research: An Epi-
demiological Approach. Baltimore, Williams and Wilkins, 1988.
Drummond, M. Guidelines for health technology assessment: Economic
evaluation. In Feeny, D., Guyatt, G., and Tugwell, P., eds., Health
Care Technology: Effectiveness, Efficacy and Public Policy. Mon-
treal, The Institute for Research on Public Policy, 1986.
Feinstein, A.R. An additional science for clinical medicine: II. The
limitations of randomized trials. Annals of Internal Medicine 99:544-
S50, 1983.
OCR for page 104
104
ASSESSAlENT OF DIAGNO=IC TECHNOLOGY
Feinstein, A.R. Clinical biostatishcs-VIlI. An analytic appraisal of the
University Group Diabetes Program (UGDP) study. Clinical Phar-
macology and Therapeutics 12:167-191, 1971.
Ferguson, I.H. Director, Office of Medical Applications Research. Per-
sonal communication, 1988.
-
Fe~ns, F.L., and Ederer, F. External mon~coring in multiclin~c tr~als: Ap-
plications from oph~almolog~c studies. In Ro~, H.P., and Gordon,
R.S., Ir., eds., Proceedings of ache National Conference on Clinical
Trials Methodology, October 1977. Clinical Pharmacology and Th-
erapeutics 25:72~723, 1979.
Fineberg, H.V., Bauman, R., and Sosman, M. Computerized cranial
tomography: Effect on diagnostic and therapeutic plans. Joumal of
the American Medical Association 238:224-230, 1977.
Fineberg, H.V., and Hiatt, H.H. Evaluation of medical practices: The
case for technology assessment. New England Journal of Medicine
301:1086-1091, 1979.
Freiman, J.A., Chalmers, T.C., Smith, H., Jr., and Kuebler, R.R. The
importance of beta, the type lI error and sample size in the design and
interpretation of the randomized control trial. New England Journal
of Medicine 299:690-694, 1978.
Guyatt, G., and Drummond, M. Guidelines for He clinical and economic
assessment of health technologies: The case of magnetic resonance.
International Journal of Technology Assessment in Health Care
1:551-566, 1985.
Guyatt, G.H., Tugwell, P.X., Feeny, D.H., et al. The role of before-after
studies of therapeutic impact in the evaluation of diagnostic technolo-
gies. Journal of Chronic Diseases 39:295-304, 1986.
Hanley, J.A., and McNeil, B.J. The meaning and use of the area under a
receiver operating characteristic (ROC) curve. Radiology 143:29-
36, 1982.
Hessel, S.J., Siegelman, S.S., McNeil, B.J., et al. A prospective evalu-
ation of computed tomography and ultrasound of the pancreas. Radi-
ology 143:129-133, 1982.
Hopwood, M.D., Mabry, J.C., and Sibley, W.~. A first-order characteri-
zation of clinical trials. Prepared for He National Institutes of Health
by the Rand Corporation. R-2653-NTH; September 1980: 6 ~ -62.
Hull, R.D., Hirsh, J., Carter, C.J., et al. Diagnostic value of ventilation-
perfusion lung scanning in patients with suspected pulmonary embo-
lism. Chest88:819-82S, 1985.
Hunter, J.E. Meta-analysis: Cumulating Research Findings Across Stud-
ies. Beverly Hills, California, Sage Publications, 1982.
Kent, D.~., and Larson, E.B. Diagnostic technology assessment: Prob-
lems and prospects. Annals of Intemal Medicine 108:759-761, 1988.
OCR for page 105
PRIMLY ASS~SME~ OF DIAGNOSTIC TOWS
105
Lidz, C.W., Meisel, A., Oste~weis, M., et al. Barriers to informed con-
sent. Annals of Internal Medicine 99:539-543, 1983. ~
Ludwid, E.G., and Coletti, J.C. Some misuses of heals statistics. Journal
of the American Medical Association 216:493499, 1971.
Marks, ].W., Croke, G., Gochman, N., et al. Major issues in the organiza-
tion and implementation of the National Cooperative Gallstone Study
(NCGS). Controlled Clinical Trials 5:1-12, 1984.
Mattson, M.E., Curb, I.D., McAr~e, R., et al. Participation in a clinical
trial: The patients' point of view. Controlled Clinical Trials 6:156-
167, 1985.
McNeil, B.J. Pitfalls in and requirements for evaluations of diagnostic
technologies. In Wagner, J., ea., Proceedings of a Conference on
Medical Technologies, DHEW Pub. No (PHS) 79-3254, pp. 33-39.
Washington, D.C., U.S. Goverrunent Printing Office, 1979.
McNeil, B.J., Sanders, R., Alderson, P.O., et al. A prospective study of
computed tomography, ultrasound, and gallium imaging in patients
with fever. Radiologyl39:647-653, 1981.
McNeil, B.J., Weichselbaum, R., and Pauker, S.G. Fallacy of the five-
year survival in lung cancer. New England Journal of Medicine
299:1397-1401, 1978.
Metz, C.E. Basic principles of ROC analysis.
Medicine 13:283-29S, 1978.
Seminars in Nuclear
Mosteller, F., Gilbert, J.P., and McPeek, B. Reporting standards and
research strategies for controlled trials: Agenda for the editor. Con-
trolled Clinical Trials 1:37-58, 1980.
Office of Technology Assessment, U.S. Congress. The Implications of
Cost-Effectiveness Analysis of Medical Technology. Stock No. 051-
003-00765-7. Washington, D.C., U.S. Government Printing Office,
198Oa.
Office of Technology Assessment, U.S. Congress. The Implications of
Cost-Effectiveness Analysis of Medical Technology. Background
paper #1: Methodological issues and literature review. Washington,
D.C., U.S. Government Printing Office, 1980b.
Office of Technology Assessment, U.S. Congress. The Implications of
Cost-Effectiveness Analysis of Medical Technology. Background
paper #2: Case studies of medical technologies. Case Study #2: The
feasibility of economic evaluation of diagnostic procedures: The case
of CI scanning. Washington, D.C., U.S. Government Printing Of-
fice, 1981.
Phelps, C.E., and Mushlin, A.~. Focusing technology assessment using
medical) decision theory. Medical Decisionmaking 8:279-289, 1988.
Pillemer, D.B., and Light, B.~. Synthesizing outcomes: How to use
research evidence from many studies. Harvard Education Review
50:176-195, 1980.
OCR for page 106
106
ASSESSMENT OF DIAGNOSTIC TEClINOLOGY
Prout, T.E. Other examples of recruitment problems and solutions. In
Roth, H.P., and Gordon, R.S., Ir., eds., Proceedings of the National
Conference on Clinical Trials Methodology, October 1977. Clinical
Pharmacology and Therapeutics 25:695-696, 1979.
Schoenberger, I.A. Recruitment in the Coronary Snug Project and the
Aspirin Myocardial Infarction Study. In Roth, H.P., and Gordon,
R.S., Ir., eds., Proceedings of the National Conference on Clinical
Trials Methodology, October 1977. CI=cal Pharmacology and
Therapeutics 25:6X I-6X4, 1979.
Schwartz, I.S. Evaluating diagnostic tests: What is done what needs to
be done. Joumal of General Intemal Medicine 1:266-267, 1986.
Sheedy, P.F., Stephens, D.H., Hatted, R.R, et al. Computed tomography
in patients suspected of having carcinoma of the pancreas: Recent
experience (abstract). Presented at the scientific assembly and annual
meeting of the Radiological Society of North America, Chicago, Ill.,
November 1977.
Smith, TV., Kemeny, M.M., Sugarbaker, P.H., et al. A prospective study
of hepatic imaging in the detection of metastatic disease. Annals of
Surgely 195:486~9l, 1982.
Sox, H.C., Ir. Probability theory in the use of diagnostic tests: An
introduction to critical study of the literature. Annals of Intemal
Medicine {C4:60-66, 1986.
Steinberg, E.P., and Cohen, A.B. Office of Technology Assessment, U.S.
Congress. Nuclear Magnetic Resonance Imaging Technology: A
Clinical, Industrial, and Policy Analysis. Technology case study 27.
Washington, D.C., U.S. Government Printing Office, 1984.
Taylor, K.M., Margolese, R.G., and Soskoline, C.L. Physicians' reasons
for not entering eligible patients in a randomized clinical trial of
surgery for breast cancer. New England Joumal of Medicine 3 ~ 0: ~ 363-
1367, 1984.
Vreim, C. Project officer, Prospective Investigation of PulmonaIy Em-
bolic Diagnosis project (PIOPED). Personal communication, l~988e
Weiner, D.A., Ryan, T.~., McCabe, C.H., et al. Exercise stress testing:
Correlation among history of angina, ST-segment response and preva-
lence of coronary artery in the Coronary Artery Surgery Study (CASS).
New England Journal of Medicine 310:230-235, 1979.
Weinstein, M.C. Methodologic considerations in planning clinical trials
of cost-effectiveness of magnetic resonance imaging twin a com-
mentary on Guyatt and Dmmmond). Intemational Joumal of Tech-
nology Assessment in Health Care ~ :567-581, 1985.
Weinstein, M.C., and Stason, W.B. Foundations of cost-effec~veness
analysis for health and medical practices. New England Joumal of
Medicine 296:716-721, 1977.
Representative terms from entire chapter:
primary assessment