LUCILA OHNO-MACHADO

*University of California, San Diego*

FREDERIC RESNIC

*Brigham and Women’s Hospital and Harvard Medical School*

*Cambridge, Massachusetts*

MICHAEL MATHENY

*Vanderbilt University*

*Nashville, Tennessee*

Predictive models to support diagnoses and prognoses are being developed in virtually every medical specialty. These models provide individualized estimates, such as a prognosis for a patient with cardiovascular disease, based on specific information about that individual (e.g., genotype, family history, past medical history, clinical findings). Statistical and machine-learning techniques applied to large clinical data sets are used to develop the models, which are used by both health care professionals and patients. However, verification (a critical step in the evaluation of a model) that the probabilities of estimated or predicted events truly reflect the underlying probability for a particular individual is often overlooked.

A simplistic type of calibration is calibration-in-the-large or bias. If the outcome is binary (e.g., “0” if a patient is not diseased and “1” if a patient is diseased), the bias corresponds to the average error for the estimates. For example, an estimate of 89 percent for a patient whose outcome is “1” contributes an individual error of 0.11. The average error for all patients is the measure of calibration-in-the-large. Calibration-in-the-large may be appropriate for considering a group of patients, but says little about how calibrated each estimate is. For example, the assignment of the prior probability of an event as the estimate or risk score for every patient, although it would result in a perfectly calibrated-in-the-large model,

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 91

Calibration in Computer Models for
Medical Diagnosis and Prognostication
lucila oHno-macHaDo
University of California, San Diego
frEDEric rEsnic
Brigham and Women’s Hospital and Harvard Medical School
Cambridge, Massachusetts
micHaEl maTHEny
Vanderbilt University
Nashville, Tennessee
Predictive models to support diagnoses and prognoses are being developed in
virtually every medical specialty. These models provide individualized estimates,
such as a prognosis for a patient with cardiovascular disease, based on specific
information about that individual (e.g., genotype, family history, past medical
history, clinical findings). Statistical and machine-learning techniques applied to
large clinical data sets are used to develop the models, which are used by both
health care professionals and patients. However, verification (a critical step in the
evaluation of a model) that the probabilities of estimated or predicted events truly
reflect the underlying probability for a particular individual is often overlooked.
ASSESSING CALIBRATION
A simplistic type of calibration is calibration-in-the-large or bias. If the out -
come is binary (e.g., “0” if a patient is not diseased and “1” if a patient is diseased),
the bias corresponds to the average error for the estimates. For example, an esti -
mate of 89 percent for a patient whose outcome is “1” contributes an individual
error of 0.11. The average error for all patients is the measure of calibration-in-
the-large. Calibration-in-the-large may be appropriate for considering a group of
patients, but says little about how calibrated each estimate is. For example, the
assignment of the prior probability of an event as the estimate or risk score for
every patient, although it would result in a perfectly calibrated-in-the-large model,
1

OCR for page 91

2 FRONTIERS OF ENGINEERING
would provide no individualized information. Hence it would be of limited practi-
cal utility for assessing predictive models.
A fundamental problem in evaluating the calibration of a model in a health
care setting is the lack of a gold standard against which individual risk estimates
can be compared. A gold standard would be based on a sufficient number of exact
replicas of the individual, accurately diagnosed or followed without censoring,
so that the proportion of observed events would be equal to the “true estimate”
for that individual.
Since every individual is unique, meaningful approximations of true prob -
ability are only possible for relatively large groups of similar individuals. How-
ever, the way the similarity of patient profiles is defined is a critical factor. Cur-
rently, calibration is measured by comparing health outcomes in sets of patients
with similar estimated risks. That is, given a predicted risk for an individual, a set
of neighboring individuals (in the sense of proximity in single dimension of the
estimates, or “output space”) is assembled and the bias for this set is assessed.
This measure of “calibration-in-the-small” is the same as the measure of “calibra -
tion-in-the-large,” except that it is applied to a smaller set based on individuals
who received similar estimates by a given model.
OUTPUT-SPACE SIMILARITY
One of the most widely used indices in assessing calibration of predictive
models was developed in the context of logistic regression by Lemeshow and
Hosmer (1982). The idea behind the test is simple: if cases are sorted according to
their estimated level of risk and the mean estimate for each decile of risk is very
close to the proportion of positive cases in the decile, then one cannot reject the
hypothesis that the model is correct (Hosmer and Hjort, 2002; Hosmer et al., 1991,
1997). The sum (i.e., the squared differences between the sum of estimates and
number of events in each decile divided by the sum of estimates in that decile) for
each outcome is reported to follow a χ2 distribution with 8 degrees of freedom. If
p < 0.05, we reject the hypothesis that the model fits the data. The H-L-C statistic
based on deciles of risk is defined as:
)
(π – O
2
1 10
C = ∑ ∑ Dl ,
Dl
π Dl
D = 0 l =1
where πDl and oDl are the sum of estimates in a decile and observed frequencies
in the same decile, for cells indexed by group (decile) l and outcome D. Hosmer
and Lemeshow showed via simulations that C is approximately distributed as χ2
with l-2 = 8 degrees of freedom when the fitted model is the correct one and the
estimated expected cell frequencies are sufficiently large.
Note that the H-L statistic is model-dependent because the statistic compares

OCR for page 91

CALIBRATION IN COMPUTER MODELS
the average estimate in each decile of estimated risk with the proportion of events
in that decile. To visualize the calibration of a predictive model, it is common to
plot the average estimate for groups representing either (1) percentiles of esti -
mated risk against the proportion of events in that group, as described above, or
(2) pre-defined ranges of the estimates. The latter is commonly used in clinical
predictive models.
INPUT-SPACE SIMILARITY
We described above how output-space similarity can be used to measure
calibration in a more refined way. However, output-space similarity is model-
dependent and difficult to understand. Similarity at the input-space is much
simpler (e.g., calculation of neighborhoods using features obtained directly from
data) and may be an equally legitimate way to assess calibration.
We describe a simulation in which we established in advance four tight clus-
ters of “patients” (100 in each cluster) according to two variables, x1 and x2. The
purposes of this simulation were to illustrate the H-L “goodness-of-fit” statistic
and to check whether differences in calibration can be determined using this sta -
tistic. Bi-normal distributions were generated with identical standard deviations
(0.1) and centered at (0,0), (0,1), (1,0), and (1,1) for clusters 1 to 4, respectively.
The binary outcome for each patient in a cluster was generated from a Bernoulli
distribution with probabilities 0.01, 0.4, 0.6, and 0.99 for clusters 1 to 4, respec -
tively. Figure 1 shows the spatial distribution of the clusters. For verification, the
four clusters were automatically re-discovered using the Expectation-Maximiza -
tion algorithm.
The resulting logistic regression model is highly significant. For comparison,
we built a neural network with hidden units so that it was capable of finding a non-
linear function relating the predictors and outcomes (Figure 1b). An ideal model
would assign the true underlying probability for each case (i.e., 0.01, 0.4, 0.6, and
0.99, depending on which cluster the case belonged to). A neural network with
enough parameters was able to get closer to that goal than a semi-linear model,
such as logistic regression.
Table 1 shows descriptive statistics for the estimates obtained by the two
types of models. The H-L-C statistic for the logistic-regression model was 6.43
(p = 0.59); hence we would not reject the hypothesis that the model is calibrated.
Although the neural network model had a less favorable H-L-C of 11.773 (p =
0.16), the overall errors were smaller.
In this example, the neural network provided better approximations of the true
underlying probability of the event in clusters 2 and 3, as can be seen in the ranges
of estimates in these clusters, as well as in their maximum residuals. However,
comparison of H-L-C and the calibration plot (Figure 2) do not indicate that a
neural network would be a better model in this case.

OCR for page 91

FRONTIERS OF ENGINEERING
4
1.4 “Axis” from
logistic
2 1.2
regression
1 Cluster 1, Y=0
Cluster 1, Y=1
0.8
Cluster 2, Y=0
0.6 Cluster 2, Y=1
x2
1 Cluster 3, Y=0
0.4
3 Cluster 3, Y=1
0.2
Cluster 4, Y=0
Cluster 4, Y=1
0
-0.5 0 0.5 1 1.5
-0.2
-0.4
x1
1.4
1.2
1 Cluster 1, Y=0
Cluster 1, Y=1
0.8
Cluster 2, Y=0
0.6 Cluster 2, Y=1
x2
Cluster 3, Y=0
0.4
Cluster 3, Y=1
0.2
Cluster 4, Y=0
Cluster 4, Y=1
0
-0.5 0 0.5 1 1.5
-0.2
-0.4
x1
FIGURE 1 Simulation with four predefined non-overlapping bi-normal clusters of individ-
uals with known underlying probability Ohno Fig 1 (0.01, 0.40, 0.60, and 0.99 for clusters
of an event
1 to 4, respectively). Top panel: Two-dimensional data are projected into one dimension by
the logistic-regression model. Dotted diagonal lines divide quartiles of risk. A patient from
cluster 2, indicated by the arrow, has an estimate closer to the average estimate for cluster
3 than to the average for cluster 2. Confidence in this estimate should be lower than for
a patient in the middle of one of the clusters. The input-space clusters, as opposed to the
quartiles of risk, can be explained because patients in cluster 1 have low x1 and low x2,
while patients in cluster 2 have low x1 and high x2, and so on. Bottom panel: Projection
of the points into an “axis” for neural network estimates. The neural network model comes
closer to the true probabilities for clusters 2 and 3 than the logistic-regression model.

OCR for page 91

TABLE 1 Descriptive statistics of logistic-regression (LR) and neural network (NN) estimates according to input-space
clusters. Note that NN estimates do not overlap (i.e., the minimum estimate for cluster 3 is greater than the maximum estimate
for cluster 2)
Proportion LR NN LR NN LR NN
Cluster of Events LR Mean NN Mean Std Dev Std Dev Minimum Minimum Maximum Maximum
1 0.01 0.0338 0.0219 0.0172 0.0069 0.0066 0.015 0.0949 0.059
2 0.42 0.4129 0.4819 0.1080 0.0207 0.1955 0.431 0.8013 0.584
3 0.64 0.6291 0.6852 0.1149 0.0146 0.2873 0.647 0.8507 0.732
4 0.98 0.9740 0.9908 0.0127 0.0011 0.9301 0.985 0.9954 0.992

OCR for page 91

FRONTIERS OF ENGINEERING
Calibration Plot
40
35
30
25
Expected
Neural Network
20
Logistic Regression
15
10
5
0
0 10 20 30 40
Observed
FIGURE 2 Calibration plot for logistic-regression and neural network models based on
deciles of risk. There is no apparent superiority of one model over the other.
Ohno Fig 2
IMPLICATIONS FOR MEDICAL DECISIONS
In clinical practice, incorrect estimates have significant implications. For
example, the widely used clinical practice guideline from the report by the Adult
Treatment Panel III (NCEP, 2002) uses cardiovascular risk estimates similar to
those available from online calculators to recommend particular treatment regi -
mens. For non-calibrated estimates, this may result in the inappropriate use of
medication to manage cholesterol levels.
Computer-based post-marketing tools for the surveillance of new medications
and medical devices use models that adjust risk for the population being treated
(Matheny et al., 2006). These models depend on the accuracy of the estimates to
trigger appropriate alerts for unsafe technologies and drugs. For non-calibrated
estimates, risk adjustments may result in a large number of false positives or of
false negatives, either of which would incur large costs to the health care system.
It is critical, therefore, to assess the calibration of estimates before using models
in clinical settings.
We and others have shown, in different domains, that the calibration of
medical diagnostic and prognostic models can vary significantly according to the

OCR for page 91

CALIBRATION IN COMPUTER MODELS
population to which they are applied (Hukkelhoven et al., 2006; Matheny et al.,
2005; Ohno-Machado et al., 2006), even though discrimination indices such as the
areas under the ROC curves may not vary. Although some efforts are being made
to recalibrate models for different populations and study reclassification rates,
Web-based calculators that estimate individualized risk do not yet take this issue
into account and may present incorrect estimates for particular individuals.
We have proposed methods for taking into account input-space clusters in
predictive models (Osl et al., 2008; Robles et al., 2008), but much remains to be
done to inform health care workers and the public about the potential shortcom -
ings of this aspect of personalized medicine. As new molecular-based biomarkers
for a variety of health conditions are developed and used in multidimensional
models to diagnose or prognosticate these conditions, it will become even more
important to develop accurate methods of assessing the quality of estimates
derived from predictive models.
ACkNOWLEDGMENTS
The authors acknowledge support from the National Library of Medicine,
NIH, FDA, and VA grants R01LM009520 (LO), R01 LM008142 (FR), HHSF
223200830058C (FR), VA HSR&D CDA2-2008-020 (MM). .
REFERENCES
Hosmer, D.W., and N.L. Hjort. 2002. Goodness-of-fit processes for logistic regression: simulation
results. Statistics in Medicine 21(18): 2723–2738.
Hosmer, D.W., T. Hosmer, S. Le Cessie, and S. Lemeshow. 1997. A comparison of goodness-of-fit
tests for the logistic regression model. Statistics in Medicine 16(9): 965–980.
Hosmer, D.W., S. Taber, and S. Lemeshow. 1991. The importance of assessing the fit of logistic regres-
sion models: a case study. American Journal of Public Health 81(12): 1630–1635.
Hukkelhoven, C.W., A.J. Rampen, A.I. Maas, E. Farace, J.D. Habbema, A. Marmarou, L.F. Marshall,
G.D. Murray, and E.W. Steverberg. 2006. Some prognostic models for traumatic brain injury
were not valid. Journal of Clinical Epidemiology 59(2): 132–143.
Lemeshow, S., and D.W. Hosmer Jr. 1982. A review of goodness of fit statistics for use in the develop -
ment of logistic regression models. American Journal of Epidemiology 115(1): 92–106.
Matheny, M.E., L. Ohno-Machado, and F.S. Resnic. 2005. Discrimination and calibration of mortality
risk prediction models in interventional cardiology. Journal of Biomedical Informatics 38(5):
367–375.
Matheny, M.E., L. Ohno-Machado, and F.S. Resnic. 2006. Monitoring device safety in interventional
cardiology. Journal of the American Medical Informatics Association 13(2): 180–187.
NCEP (National Cholesterol Education Program). 2002. Third report of the national cholesterol educa-
tion program expert panel on detection, evaluation, and treatment of high blood cholesterol in
adults (Adult Treatment Panel III) final report. Circulation 106(25): 3143–3421.
Ohno-Machado, L., F.S. Resnic, and M.E. Matheny. 2006. Prognosis in critical care. Annual Review
of Biomedical Engineering 8: 567–599.
Osl, M., L. Ohno-Machado, and S. Dreiseitl. 2008. Improving calibration of logistic regression models
by local estimates. AMIA Annuual Symposium Proceedings. 2008: pp. 535-539.

OCR for page 91

FRONTIERS OF ENGINEERING
Robles, V., C. Bielza, P. Larranaga, S. Gonzales, and L. Ohno-Machado. 2008. Optimizing logistic
regression coefficients for discrimination and calibration using estimation of distribution algo -
rithms. TOP: An Official Journal of the Spanish Society of Statistics and Operations Research
16(2): 345–366.