Below are the first 10 and last 10 pages of uncorrected machineread text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapterrepresentative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 291
7
Some Methodological Issues
in Making Predictions
John B. Copas and Roger Tarling
Methodological considerations are cen
tral to all quantitative or actuarial predic
tions, although each particular precliction
study invariably presents its own special
issues. At its most general level, a predic
tion study investigates the extent to
which criterion measures (the clependent
variables) can be preclicted by one or
more measures of other factors (the pre
dictor or independent variables).
It is outside the scope of this paper to
discuss all the important methodological
steps in the process: the selection and
measurement of appropriate information;
the choice of statistical method; the prac
tical application of a prediction instru
ment and its utility. Instead, we concen
trate on four aspects. First, we examine in
cletail the Burgess ant! Glueck point
scoring methods, which have been used
extensively in criminological prediction.
Second, we consider the important topic
of validating and calibrating the preclic
John B. Copas is professor of statistics, University
of Birmingham, England; Roger Tarling is deputy
head, Home Office Research and Planning Unit,
England.
29~
tion instrument. Third, we review the
various measures that have been pro
posed to assess an instrument's predictive
power. Fourth, we describe methods for
reusing samples to carry out a prospective
validation. At each stage we attempt to
synthesize some of the previous work in
the area and present the results of our
more recent statistical and methodologi
cal research.
POINTSCORING METHODS
A variety of statistical methods have
been used to construct prediction instru
ments. Chief among them are the Burgess
and Glueck pointscoring methods, mul
tiple regression, loglinear methods, and
logistic regression. In addition, various
clustering, classification, and segmenta
tion techniques have been used. (The
latter group of techniques are not dis
cussed here; see Fielding, 1979; TarTing
and Perry, 1985.~) For examples of the
iThe statistical methods listed above have severe
limitations for much criminal career research, espe
cially when the dependent variable is not binary
OCR for page 291
292
application of all these methods in crimi
nological research, see the studies in
cluded in Farrington and Tarling (1985~.
Invariably, these methods have been
used in studies in which the clependent
variable is binary (e.g., reconvictecT/not
reconvicted). Many criminologists have
found that simple pointscoring methods
are more efficient or robust than more
sophisticated methods and shrink less
when applied to a validation sample. This
seems especially so when the data con
tain measurement errors or "noise" (S. D.
Gottirecison and Gottfredson, 1985;
Wilbanks, 19851. This fincling, plus the
fact that pointscoring methods are simple
in conception and administratively easy
to use, has lee] to their being adopter! in
practice, particularly in studies of parole
and sentencing decision making (D. M.
GottEredson, Wilkins, and Hoffman, 1978;
Nuttall et al., 19771. However, some com
mentators have said that pointscoring
methods are intolerably crude, have no
statistical foundations, and clo not result
in any direct probabilistic interpretation.
In this section we explore pointscoring
methods to see if we can resolve some of and
these tensions and anomalies. In acicti
tion, we show how pointscoring meth
ods, reconceptualized in the way we rec
ommenc3, can be extendecI.
There are two basic pointscoring
methods, one ascribed to Burgess (1928)
ant] the other to Glueck anti Glueck
(19501. In the Burgess method each sub
Ject Is given a score of either O or 1 on
each of a number of predictors, depend
ing on whether the subject falls into a
category with a below or aboveaverage
success rate. The Glueck method is more
and the focus of interest is on the time interval to
some event, for example, the next offense. We
would suggest that alternative statistical methods,
stochastic pointprocess models, and failurerate re
gression models are more appropriate in these situ
ations and should receive more attention from crim
inologists.
CRIMINAL CAREERS AND CAREER CRIMINALS
sophisticates] in that, instead of contribut
ing a score of O or 1, each category of each
predictor is weighted according to the
percentage of subjects in that category
who are successes. The Glueck method
can be appliecl to polychotomous incle
pendent variables, but in practice it has
only been used for binary predictors. We
keep to this simpler situation in our clis
cussion.
Both the Burgess and the Glueck meth
offs have their parallels in stanciarcT statis
tical theory the "independence Bayes
memos." First, consider the Burgess
method.
Let xi be a series of binary predictive
factors, let q be the overall success rate,
en c! suppose that within the success (S)
and failure (F) groups separately, the fac
tors xi are statistically inclepenclent of
each other. Let
hi = P(Xi = AS), gi = P(Xi = OF).
Assume the xi's are coded such that hi >
gi Then, by Bayes theorem,
P(S~xi=l)=hiq/Pi
P(S~xi = 0) = (1  hi) q/~1  Pi)
where
Pi = P(Xi = 1) = hi q + gi(1  q).
By independence and Bayes theorem
again,
P(S~x) q HP(xi~S)
= .
P(F~x) 1q HP(xi~F)
and so log owls for S after observing x is
= log ~ _ + ~ i°g 11( F)
This can be written as
k + iWixi
where
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
0(1  gi)
(1  hi)gi
which is just the Tog odds ratio for the 2 x
2 table classifying xi = 1 or O against S and
F. Given indepenclence, these are there
fore the optimum weights. By the Ney
manPearson Lemma in statistical theory,
any other set of weights must be less
efficient (i.e., they do not use all the
information available in the xi's).
The Burgess method has Wi = 1, or,
since a scale factor in the score is irrele
vant, Wi = constant. Thus, the Burgess
method is only optimum if the cross
product ratio is the same for each factor
(i.e., each xi gives the same amount of
information about S or F).
The Glueck method is equivalent to
Wi = P(S~xi = 1)  P(S~Xi = 0),
which, from above, simplifies to
W.=q(lq)(higi)
Pi(1  Pi)
Again, a constant multiple is irrelevant, so
essentially
(hi gi)
Pi(1  Pi)
7: Tog ocicis ratio for xi.
However, if xi has only modest predictive
power, we can write
hi = gi + si
where si is small. We can then show that
hi(1 Hi) hi gi
'Vie gi(1  hi) Pi(1  Pi)
+ terms involving si2.
Hence the Glueck method is approxi
mately optimum if ei is small, that is, if
each individual xi contributes only a mod
est amount of information. In many prac
tical cases the score may involve a rela
tively large number of xi's, none of which
by itself is spectacular, but together they
where
293
may be useful. This, we suggest, accounts
for the apparent success of the Glueck
method.
As set out above, Burgess and Glueck
are not separate and distinct moclels but
are, in fact, simple Toglinear mo(lels in
which all the predictor variables are
treated as invepenclent, i.e., they are not
correlated. We would advocate the use of
the formal inclepenclence Bayes method
in preference to the more act hoc Burgess
and Glueck approaches because it has
several important advantages:
1. It is equally simple yet is based on a
coherent theory and is optimum within
the framework of that theory.
2. It provides a direct estimate P(S~x),
whereas the scoring methods of Burgess
and Glueck have to be separately cali
bratecI on the data, that is, the probability
of success given a certain score is esti
matecl by calculating the proportion of all
subjects with that score who succeeded.
3. Similarly, the value of the score is
seen to be a Tog odds ratio. Hence if the
score is s, the probability of success must
be of the form
eS
1 + es
There are two further advantages of the
Bayes method that make it extendable in
ways not possible for the Burgess and
Glueck methods. (Extensions of this kind
have been considered in the medical lit
erature under the name of"computer
aicled diagnosis moclels.") First, it can
more readily accommodate x:'s that are
not binary. The formula is then
log odds for S given x = log q
1  q
+ I, logy
gi(xi)
OCR for page 291
294
fi(Xi) =
P(xilS) and gi(xi)
= P(xilF).
Of course, all these probabilities are esti
mated from the data. Note that we need
the proportions ofthe various values of xi
within the F and S groups separately and
not the proportions of S and F within the
groups defined by various values of xi (a
crucial distinction). The above formula is
not necessarily linear in each xi (but there
is no reason to expect it to be). Thus we
avoid the need arbitrarily to dichotomize
each predictor variable, the full informa
tion in each value of xi being retainer] in
an optimum way. Of course, if the xi's are
divided into too many categories, each
tee, such as P(xi~S), is estimated less
, , ,
accurately, and so, if there are too many
categories (e.g., age measured in years), it
is better to treat xi as a continuous vari
able ancI use a regression technique.
Thus if some xi's are continuous, the term
fi(Xi)
log
gi(Xi)
can be estimated directly as a regression
on xi. Hence the method can accommo
ciate mixed data in which some xi's are
continuous, e.g., age, and some xi's are
binary, e.g., sex (c£ analysis of covariance
methods).
Second, the Bayes method can be gen
eralized to take account of particular cir
cumstances concerning the distribution
of the xi's. For example, if the xi's are not
independent but correlated to a roughly
equal extent (e.g., they are all positively
correlated), a mollification simply in
volves multiplying Wi by a constant, and
so the relative weights remain essentially
the same. Thus, if the Bayes formula is
recalibrated on the data (which allows an
appropriate linear transformation of the
score to be estimated), it works well even
when the xi's are moderately correlated
with each other. If the xi's are correlated,
but not all to the same degree, the so
called "Lancaster models" can be used,
CRIMINAL CAREERS AND CAREER CRIMINALS
which are based on a secondorder ap
proximation to the joint distribution of the
xi's. These models have been found use
fuT in medical diagnosis applications; see
review in Titterington et al. (1981).
Apart from the obvious simplicity, an
important advantage of all these methods
is the relative precision with which the
weights (or coefficients, if viewed as a
Toglinear moclel) are estimated. This is
because the assumption of inclependence
allows each weight to be estimated sepa
rately, and any sampling effects in the
intercorrelations of the x's have no effect.
If the sample size is relatively small, and
the correlations between the x's are, at
most, modest, pointscoring methods do
well. Larger correlations between the x's,
but with a similar sample size, can be
clealt with in an approximate way by one
of the mollifications mentioned above.
For somewhat larger sample sizes, how
ever (say several hundred), a prediction
equation should make proper allowance
for the clependence between the x's, and
a logistic motley or loglinear model (in
the usual sense for categorical data) is the
preferred alternative. In such models,
each weight or coefficient is, of course,
not just a function of the relevant xi but
(lepencls in a much more complicated
way on the joint (distribution of all the xi's.
The complexity of the model affects the
degree of shrinkage, which will be clis
cussed later in the paper. If our sugges
tions for correcting for shrinkage are
used, the increased shrinkage of these
complicated models should not present a
problem.
PREDICTIVE POWER, CALIBRATION,
AND SHRINKAGE OF PREDICTION
EQUATIONS
Much statistical work in criminology
has been concemecI with the construction
and use of prediction equations. For each
incliviclual, some response y (a binary
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
yesno variable, a time to arrest, et cetera)
is measured, alone with values of explan
r ~Id ~ ~and on the basis
Rev `~ri~hl~c ~ ~^
~ En, _, ,
ofthese x's a predicted value of y, say y, is
formulated. How good is y as a predictor
of y? Issues related to this general ques
tion are to be discussed in this section.
We are concerned here with the underly
ing methodology of the assessment of
prediction equations, rather than with de
tails of prediction equations in specific
applications.
There are two contrasting, and yet com
plementary, approaches to the discussion
ofthis question, corresponding roughly to
the two philosophies of statistical infer
ence and decision theory as understood
in the statistical literature. The inference
approach is taken up in the next section,
where we ask: Given that an individual is
described by x = A, x2, . . ., what infor
mation does that give us about y? A pre
diction equation, with value y, is seen as
an estimate of the expectation of u in
some sense. The properties and behavior
of a prediction instrument are studied in
terms of the accuracy of y over the totality
of all different values of y and x. We argue
that a particular advantage of the infer
ence approach is that a clear discussion of
shrinkage is possible. Our discussion
leads to a correction for shrinkage or to
"preshrunk" prediction equations as we
will call them.
The other approach is more pragmatic;
it views a prediction equation as a means
to an end, that of a decision instrument.
All the issues are illustrated by a binary
classification, conventionally labeled pos
itivenegative. Each individual falls into
one or other group (e.g., successfaiTure),
the decision as to which is the true group
being made on the basis of x. The discus
sion focuses entirely on the frequencies
of correct and incorrect decisions. A con
fusing array of measures of predictive
power has appeared in the criminological
literature (and in the parallel literature on
295
computeraided diagnosis in medicine).
We show that the more important of these
are in fact very closely related to each
other.
There is an obvious link between the
two approaches. If y is an observed re
sponse, a binary classification could be:
success if y 3 ki and failure if y < kit The
classification from the prediction equa
tion would by analogy be: success if y 3
k2 and failure if y < k2 (there is no reason
to insist that ki = k21. We would argue in
favor of formulating y to optimize such
properties as calibration and validation
(discussed in the next section) and then
choosing k2 to secure desirable aspects of
error rates and/or utility (discussed later).
It is worth noting, however, that pre
diction equations are sometimes useful as
a research too] in their own right, not just
as a means of implementing the positive
negative decision. For instance, to control
for differences between cases in a study,
the value of an appropriate prediction y
could sensibly be used either as a
covariate in statistical analysis using
covariance adjustments or as a criterion
for matching cases and controls in a
matchedpairs design. An example of the
former approach is in Bottoms and Mc
CTintock (1973:Chapter 11~.
Validation and Shrinkage
It is almost universal experience that,
when a prediction equation Is fitted to
data and then applied to some new cases
or a new cohort, the usefulness and accu
racy of the prediction are much more
disappointing than expected. The term
"shrinkage" has been used to describe
this deterioration in predictive power. Al
though the effect is real enough, and
noted in many studies, the term has never
been given a precise definition. Quite
independently of the experience of crim
inologists in using prediction equations,
there has been the remarkable develop
OCR for page 291
296
ment in the statistical literature of so
callec3 "shrinkage estimation," a tech
nique whereby a set of related parame
ters can be estimated more accurately (on
average) than by conventional tech
niques, such as least squares. The use of
the same term in these different contexts
has appeared at best coincidental and at
worst grotesquely misleading. However,
there are known to be close connections
between them, as cliscussec3 in Copas
(1983b). Using the theory clescribed in
that paper it is possible to (a) clarify the
manifestations of shrinkage, (b) highlight
the reasons for them, (c) derive altema
tive methods of fitting prediction equa
tions that will eliminate some of the ad
verse effects of shrinkage, and (I) enable
the extent of shrinkage in any given ap
plication to be estimated in advance from
the original ciata. These points are clis
cussed in this section, and a brief outline
of Copas's theory is illustrated by a crim
inological example.
In fitting a prediction equation to (lata,
we will have, as before, observations on
some response y (e.g., the number of
convictions in a Tongterm followup, or a
binary factor describing whether some
event, such as rearrest, has occurred) to
gether with information on a number of
predictive factors x (number of previous
convictions, age, et cetera). The aim is to
formulate a predictor y = fix) for some
function f [e.g., multiple regression, in
which case fix) = cz + I3'x]. The fit of the
equation relates to the proximity of y to
the actual observed values of y. Two as
pects of the prediction equation are dis
tinguished:
1. Calibration. Here we group cases
with the same or similar values of y and
ask whether the average of the associated
y's is equal to the predicted value y. The
greater the clifference, the worse the cal
ibration.
2. E,fficacy. Here we ask whether val
ues of y discriminate clearly between
CRIMINAL CAREERS AND CAREER CRIMINALS
cases with different x's. A simple measure
of this is the correlation between y and y.
(In the case of multiple regression this is
just the multiple correlation coefficient or
the coefficient of determination, R.) A
large R shows that y changes substantially
as x changes, while a small R means that
y is almost the same for all x (and so is
useless as a predictor).
The ideal predictor, never realized in
practice, is one in which y = y for all x,
which calibrates perfectly and has maxi
mal efficacy (R = 11. In practice, if the
model behind the prediction equation is
correct, when judged by values of y and y
in the data, y will calibrate well but have
R somewhat less than 1 (this is essentially
the GaussMarkov theorem of least
squares).
A second crucial distinction is between
retrospective fit and validation fit. Retro
spective fit concerns the comparison be
tween values of y and y in the data on
which the prediction equation is fitted.
Validation fit envisages the prediction
equation being applied to a new set of
cases or subjects and compares the actual
values of y in the new data with the
predictions fix), calculated using the orig
inal prediction equation f but using the
new values of x. The difference between
the sets of data is emphasized by the
terms "construction data" and "validation
data." Shrinkage implies that validation
fit is worse than retrospective fit. In prac
tice, the predictions y calibrate well in
the construction data but less well, and
sometimes very badly, in the validation
data. Efficacy is nearly always worse in
the validation data than in the construc
tion data. Copas's theory quantifies both
these aspects of the deterioration of fit.
There are (at least) three possible
causes of the deterioration in both these
aspects of fit: (a) a purely statistical effect
that is the inevitable result of unex
plained (random) variation in the data; (b)
changes in the population of x's from
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS 297
construction data to validation data (e.g.,
there might be some intermediate change
of policy or other intervention that alters
the range of subjects available for study);
and (c) the underlying associations be
tween y and x might change (e.g., a
change in some latent factor that is not
observed in x). Each of these causes of
shrinkage is discussecI below.
Shrinkage as a Statistical Effect
Cause (Al
Hence large values of y tend to be over
estimatecl and small values of y tend to be
unclerestimated. This is because
E(p'V,8) = ,l3'V,ll
+> ,S'V,B = E(,B'V/3),
where n is the sample size in the con
struction data and m is the number of
variables measured in x. By the same
reasoning, ,l3'V,B can be estimated by ,B'V,B
 mown, where a2 is the usual residual
mean square, and so K itself can be esti
matecl by
Cause (a) wflT be illustrated in the case A
of multiple regression, in which the sta EVE  man _ 1  1
tistical moclel is 9'V,B F
y = Ct + ~ X + E,
~ being the usual ranclom error. Without
Toss of generality, we can assume the x's
are stanciardized to have mean zero, so
that cr merely reflects the overall average
value of y. Suppose causes (b) and (c) do
not operate, so that we have a stable popu
lation of x's and constant true values of cat
and ,8 as we go from construction to valida
tion ciata. This, therefore, represents the
ideal situation as far as fitting and validating
a prediction equation is concerned.
If ~ and ,B are least squares estimates in
the construction data, the prediction
· .
equation Is
Y ~ + ~ X.
Suppose we test this out on a very large
validation sample, so that we compare
y = cat + ,B'x + ~ with c' + ,`3'x over a
population of new cases (y, x). To study
calibration, we calculate the average y
(i.e., cat + ,l3'x) over those cases x that relate
to a specific prediction y. This is clone by
fitting a linear regression of y on y, which
can be shown to have slope
K EVE
9'VP '
where V is the variancecovariance matrix
of the x's. The average of K, over statisti
cal errors in ,B, which is evaluated in
Copas (1983b), is always less than 1.
where F is the usual Fratio of multiple
regression. A more thorough analysis,
valid if m ~ 3, shows that the slightly
modified estimate
K = 1
m  2
mF
is unbiased in the sense that E(K) = E(K).
Thus K measures the (leterioration in cal
ibration; in a set of vaTiclation data, the
average value of y to be expected for a
given y is not y, as might be anticipated
from the construction data, but
y = y + K(y  Y),
where y is the overall observed average of
y. The smaller K is, the greater the distor
tion in calibration. Of course, this is itself
a prediction in the sense that y is caTcu
lated from the construction data and can
not be expected to be invariably correct
when appliecl to practical validation data.
However, on average, and to an approxi
mation examined in detail in Copas
(1983b),
E(y~y) = Y
for a typical validation case (y, x). Thus y
can be said to be preshrunk in the sense
that it is expected to calibrate well (show
no calibration shrinkage) on validation
clata. Of course y will not calibrate well
on the construction data (because it is y
OCR for page 291
298
that does), but, from a pragmatic point of
view, retrospective performance of a pre
cTictor is irrelevant.
The pedigree of y is confirmed] in
Copas (1983b), in that y corresponds ex
actly to a "shrinkage estimator" in the
sense of the term used in the statistical
literature. It is proved that, within the
assumptions outlined above, y is uni
formly better than y in the mean squared
error sense, Be.,
E(y _ y)2 < E`y _ y'2
over validation <3ata (y, x), provided m ~
3, where m is the number of x variables. If
m = 2, K = 1 and so preshrinkage has no
effect. If m = 1, the whole theory breaks
down, since the expectations of quantities
such as K cease to exist (the relevant
infinite integrals diverge). In fact, it is
shown that for m = 1 and m = 2 no
uniform improvement on least squares is
possible. The theory of preshrinking is
therefore useful only if there are three or
more predictive variables in x.
Tuming to efficacy, but still in the mul
tiple regression case, the deterioration in
correlation is inevitable and cannot be
removed by preshrinking. In fact
Corrky, y) = Corrty, y),
and so the discrimination afforded by y is
identical to that of y. The inevitable cle
cTine in correlation is simply due to the
fact that in the construction data y has
knowledge of the actual y's, whereas in
validation clata it cloes not. The above
theory is immediately extended to pre
dict the validation correlation of y and y
(or y): it is
(n 1jR2  m
R=
(nm  1)R
where R is the multiple correlation coef
ficient in the construction data. Always
we have R < R. For prediction, the retro
spective R is irrelevant; efficacy shouIct
CRIMINAL CAREERS AND CaREER CRIMINALS
be measured by R. which on average will
be (approximately) the correlation ob
tainecl if the predictor (y or y) were to be
validated.
A minor point to mention is that ~ can
be negative, in which case y inverts the
predictions macie by y. However, in the
worst case, in which x has no effect (,B =
O), E(F) > 1 and so E(~) > 0. Thus, if ~ is
negative, the correlations between y and
x are even worse than one wouIc3 expect
from pure random numbers, and it would
be apparent that any prediction equation
based on x is cloomed to failure. The same
comment applies to the circumstance that
R < 0.
The multiple regression model being
(liscussecl implicitly assumes that y is a
continuous variable. Models for discrete
and categorical data are mentioned else
where in this paper, including the impor
tant case of binary data. Suppose that y is
defined to be 1 if an event occurs (suc
cess) and O if it does not (failure), with the
predictive factors x as before. A multiple
regression of y on x can still be fitted, with
E(y) being interpreted as the probability
of success. All the above quantities in
shrinkage theory can be calculated in the
same way, although their mathematical
validity can only be taken as an approxi
mation (but often a reasonable one if n is
large and the correlations between y and
each x are not too close to 1~. The more
informative model is logistic regression,
for which
&+,~'
1 + e &+'
The overall significance of a fitted model
of this kind is measured by a value of x2
("deviance" in computer output from the
statistical package GLIM), and it can be
shown that in many practical cases x2 ~
mF, where F is the Fratio in an ordinary
multiple regression of y on x. Thus Zip
becomes
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
m  2
K = 1
x2
Calibration relates to the probability of
success rather than to the average value of
y. A binary predictor is well calibrated if,
over all cases in which fix) = p, say, the
proportion of successful cases is in fact p.
In a large validation sample, this propor
tion will be expected to be
e a+K~'x
+ e.ct+K~'x
for the same reasons as in the multiple re
gression case. Thus p is the preshrunk forts,
of the predictor, by analogy with y above.
This is illustrated in a particular appli
cation to the problem of predicting the
probability of absconding from open
borstals, taking into account known social
ant! criminological indicators (using ciata
kindly made available by the Prison De
par~nent's Young Offender Psychology
Unit, Home Office, England). Here y = 1
if the trainee absconded cluring sentence,
y = 0 otherwise, and m = 22 predictive
factors were studied. A logistic regression
on n = 500 cases gives x2 = 50.2 on 22
degrees of freedom, which is highly sig
nificant; ~ is 0.602. Calibration was exam
inec3 by using a nonparametric smoothing
methoc! to plot the actual proportion of
absconding cases, say p, against the pre
dictec3 proportions pE=f~x)~; the method
is from Copas (1983a). This is shown in
Figure 1, in which both axes are on Togis
tic scales. The calibration is satisfactory in
the construction clata, in that the plotter]
curve (labeled "construction clata") is tol
erably close to the diagonal line p = p. A
furler set of 1,500 cases was then used as
validation data and the plotting process
repeated. The shrinkage is very marked
(Figure 1~; the plotted curve is much
shallower than the diagonal (large p's are
overestimated by p, small p's underesti
matecI). The use of p insteac! of p is
299
equivalent to retaining the graph with p
as the horizontal coordinate, but replac
ing the diagonal line with a line of slope
K = 0.602, shown as the dashed line. The
reasonable fit of the validation curve to
the dashed line confirms that the vaTicia
tion calibration of p is satisfactory.
The ordinary multiple correlation coef
ficient between y and x for these data is
R  0.322, whereas the vaTiclation corre
lation discussed above is R = 0.194. The
substantial shrinkage has almost halved
the correlation, the efficacy of the predic
tor on validation being extremely modest.
This magnitude of the drop in correlation
is not at all unusual in practice (e.g.,
Simon, 19711.
The multiple and logistic regression
models discussed above are fixed models
in the sense that the variables in x are
fixed in advance. In practice, prediction
equations are often simplified by using
stepwise regression or some other proce
clure for subset selection; the variables in
x are then selected using the data, and
only those x's showing reasonably strong
correlation with y are retained. The usual
theory of least squares is, of course, com
pletely upset by such selection. A recent
discussion in the Journal of the Royal
Statistical Society (
300
be used. The formula for shrinkage of the
correlation coefficient is modified to
(n 1)R2m
R= R*
In 1  m)R
CRIMINAL CAREERS AND CAREER CRIMINALS
with K = 0.602 and much greater than
that implied by the value K = 0.931.
Shrinkage in the Light of Changes in
the Population Cause (b)
where R* is the multiple correlation be The theory expounded so far accommo
tween y en c] the selected x's, and as be dates cause (a) the purely statistical ef
fore, R is the corresponding correlation feet but assumes that there are no
for all the x's. changes in the distribution of x Ecf. (b)] or
the response function [cf. acid. Neither
assumption wit! be exactly true, although
each will often hold to a reasonable ap
proximation. In this section we discuss
the effect of changes in the population
(i.e., in the distribution of x) on the vaTi
clation performance of predictors. We
suppose that x has mean me in the con
struction sample and mean m2 in the
validation sample, with the variance
covariance matrices V, and V2 definer! in
an analogous way. We therefore wish to
Since many x's in the absconding stucly
appeared to be of Tow predictive value, a
subset of just four x's was chosen for the
logistic regression, with x2 = 29.0 on four
degrees of freedom. If selection is ig
nored, this would give K = 0.931 (indicat
ing very little shrinkage). For the full
logistic regression Zip = 0.602, as before.
The validation fit of the reducer! regres
sion is shown in Figure 2, which was
constructed in the same way as Figure 1.
As can be seen the shrinkage is consistent
o
1 _
Q
4 
._
o
J
IVY
2
 4
_
//
4  3
2 1 0 1 2
logit p
FIGURE 1 Shrinkage for absconding study (full regression). Source: Derived from
data provided by Prison Department's Young Offender Psychology Unit, Home
Office, England.
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
1 _
1
o
._
o

_~
3
i row ~ /
.~
,r`_ ~ 6~ '/
1
1 1 1 1
J
_4 _3  2
! O 1 2
logitp
FIGURE 2 Shrinkage for absconding study (stepwise regression). Source: Derived
from data provided by Prison Department's Young Offender Psychology Unit,
Home Office, England.
study the case in which me 7L m2 ancI/or
Vi ~ V2.
A number of approaches are discussed
in tum, corresponding to various ways in
which changes in distribution can occur,
and to different ways in which the per
formance of predictors can be assessed.
Some of these correspond to wellestab
lished results in the statistical literature,
others to work in progress.
Wishart Variation.
Perhaps the sim
plest case is to assume that the construc
tion and validation samples are both sam
pled randomly from the same underlying
population. The matrices Vat ant] V2 will
then be independent samples from the
same Wishart distribution inclexed by the
(unknown) true variancecovariance ma
trix. Similarly, m, and m2 will be in(le
pendent with identical multivariate nor
mal distributions. It can be shown that
the uniform improvement of the shrink
301
age predictor over least squares continues
to hold in this more general setting, i.e.
E(y _ y)2 < E`y _ y`~2
where the expectation is over (y, x) in the
validation sample, over the distribution of
regression parameters, as well as over
sampling variation in the m's and V's. The
only requirement is, as before, that m 3 3.
Again, the improvement holds over all
possible true regression parameters, no
matter what are the unclerlying parame
ters ofthe population. Thus differences in
samples caused by sampling variation
only do not affect the shrinkage argu
ments put forward in the last section.
Mathematical Conditions for Uniform
Improvement. The Wishart variation
case suggests that if me  m2 and Vat  V2
are small, shrinkage theory is unaffectecl.
To investigate what happens when these
differences are larger, define the matrix
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
context of a particular study, as will be
cliscussect in a later section of this paper.
Two particular applications lend them
seIves to the monitoring of screening.
First, a simulation study can be uncler
taken in which the prediction equation is
fitted to a random subset of the data, and
the remaining cases are screened in the
appropriate way to form the valiclation
sample. The random sampling of the con
struction data is repeated a large number
of times to obtain expected values of pre
diction mean squarest error to other mea
sures of predictive performance. The sec
onc3 method involves the bootstrap: both
construction and vaTiciation ciata are arti
ficially sampler! with replacement from
the complete set of available data. The
methoc] of screening under study is ap
pliec] to the validation cases before the
prediction equation is evaluated. Again,
some detailed results are given in Jones
and Copas (1985~; the general conclusion
is similar to that made earlier, namely,
that a moderate degree of screening cloes
not usually affect the advantages of the
shrinkage correction.
Shrinkage Correction Adapted for a
Change in Population. Comments so far
in this section have concerned robust
ness, i.e., the study of how the preshrunk
predictor y performs in the light of
changes in the distribution of x. If some
particular change in population is envis
agecI, can the shrinkage correction be de
signed to take account of it? A reworking
of the theory leacling to the correction K,
explained above, leacls to
* _ (`m2~2tr(V~~V2)
K_ , _
~ n/3'V ~
.
Note that K* = K if Vat = V2. The corre
sponding form ofthe preshrunk predictor is
y* = (~`x + ,S'(~m2  m~(1  K*) + K*y.
Unfortunately, the sampling theory of K*
303
and y* is very much more complicate<]
than that of K and y, and optimum mean
squared error properties have yet to be
proved. Presumably, if m2  me and V2 
Vat are both fairly small, the favorable
properties of y will continue to hold, but
the situation for large population changes
is less clear.
AnAdaptive Formulation of Shrinkage
Based on Crossvalidation. A very dif
ferent approach is reported in Copas
(1984~. Here none of the usual assump
tions of linear regression is made (e.g.,
constant variance of residuals), but
instead a shrinkage correction K** is esti
matec3 directly from the available con
struction data. Following the sample
reuse approach mentioned above, the
sampling distribution of the empirical
slope of y on y for randomly chosen sub
sets of the (lata is stucliecT mathematically,
and an asymptotic approximation to the
expected shrinkage is thereby obtained.
The form of this approximation is applied
to the whole set of ciata, given the
nonparametric shrinkage correction K**.
It is shown in Copas (1984) that, as ex
pected, K** is equal to K if ant! only if the
usual assumptions of the underlying
model hold. The correction K** is most
sensitive to heteroskedasticity of the re
si(luals; K** can shrink more or less than
K according to the particular observed
pattern of model resicluals. Case studies
carried out using this new approach sug
gest that only exceptionally will K** clif
fer markedly from it, and the validation
properties of the corresponding nonpara
metric shrinkage predictor will often be
rather similar to those of y.
Changes in the Regression
Relationshi~Cause (c)
It is obvious that if the relationship
between ~ and the x's changes clramati
ca~y Between construction and validation
OCR for page 291
304
data, the shrinkage will be equally dra
matic and nothing in the way of useful
prediction will be possible. Conversely,
minor changes in the coefficients cr and
,2, · · · Will result in only small changes
in predictive performance, and y can still
be regarded as an adequate approxima
tion. Little work has been done in study
ing the effects of changes of intermediate
size. As in the discussion of cause (b) in
the previous section, if something is
known in advance about the likely
changes, corresponding modifications to
the prediction equation can be made
(e.g., a 10 percent rise or fall in values of
y is anticipated). However, such circum
stances will occur rarely, if ever, and so
this remains an open research problem.
Some Concluding Remarks
We conclude this discussion of valida
tion and shrinkage with a few comments
that may help in formulating guidelines
on the choice of prediction equation in
any given application.
First, a simple method shrinks less than
a complex one. (This can be seen in the
above algebra by noting that the denom
inator of K exceeds the numerator by
ma2ln on averagethis quantity in
creases as m, the number of variables in
the equation, increases.) However, this is
not so when a preshrinking correction is
applied; provided the model ant] assump
tions hold true, a preshrunk predictor is
always approximately well calibrated.
Thus the argument that a simple model
(e.g., point scoring) is preferable to a more
complicated one (em., multiple regres
sion) because of shrinkage effects alone
cannot be sustained. Proper statistical
principles should be used in assessing
the fit between a given model and the
data; any shrinkage problems that arise
are allowed for by preshrinking rather
than by distorting the model being fitted.
Second, in selecting from among sev
eral x variables using a stepwise proce
dure, it is often supposed that a small
~At_, . ~
CRIMINAL CAREERS AND CAREER CRIMINALS
subset is better than a large one because
the smaller number of coefficients causes
less shrinkage. In general this argument
is false. As explained above, the empirical
selection effect itself leads to an increase
in shrinkage. Again, a larger subset, with
appropriate preshrinking correction, is
better than an artificially small set with its
own shrinkage correction. Usually, how
ever, very little is gained by the later
variables entering a stepwise regression
procedure and so on the grounds of sim
plicity, with little loss of efficacy, a sensi
ble subset (with preshrinking) will nearly
always be used in the final prediction
equation. For example, in the absconding
study mentioned above, there is little
basis for choosing on statistical grounds
between the fits with the total of 22 x's
and with a subset of just 4 x's (Figures 1
and 2~.
Third, caution is needed if a prediction
equation is to be applied outside the
range of the construction data. The new
theory of robustness to changes in the
distribution of the x's, outlined above,
suggests that modest changes can be tol
erated within the framework of the same
preshrinking method. However, if very
marked changes are anticipated, or if er
ratic changes in the model are likely to
occur, no prediction equation can be ex
pected to work well. These circum
stances are perhaps the only ones in
which oversimplified methods (e.g.,
Glueck) can be justified on the grounds of
robustness, but a clear formulation of
such properties would be difficult.
Fourth, a prediction equation is essen
tially a statement of conditional expecta
tion: if the x's are such and such, then the
expectation of y is estimated to be such
ant] such. In reality no particular model is
exactly correct, and so an argument that
one set of x's is "right" and another is
"wrong" has no logical basis. One can
imagine values of the response variable
(y) and the explanatory variables (x's) be
ing distributed jointly in some space
each subset of x's, and each particular
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
model, providing a separate form of con
ditional expectation of y. Choosing a pre
diction equation involves choosing which
conditional expectation is closest to the
actual values of y (has least conditional
variance), such a choice being made over
whatever set of candidates is available. It
may be that y is most closely correlated
with an x that cannot actually be used in
routine prediction, and so no subset con
taining such an x can be entertained.
Typically, the best subsets or models will
be ones that act as the best proxies to the
prohibited x. Such equations may do less
well than others involving the sensitive
variable, but they cannot be discredited
on statistical grounds alone.
Practical Utility
Predictive Power
Our starting point in this section is the
familiar "risk classification," which com
pares predicted and actual outcomes.
This approach to assessing the utility of
different prediction instruments is com
pletely different from (yet complemen
tary to) that discussed in the previous
section.
Risk classes can be defined as the range
ofthe predicted probability of some event
(e.g., k, = 0 < 0.1, k2 = 0.1 < 0.2, et
30S
cetera); as a score, such as the Salient
Factor Score calculated in parole predic
tion research (D.
~.. ~. ~
M. Gottfredson,
Wilkins, and iiournan, 19781; or by some
over classification, such as low, me
dium, and highrate offenders, as in
Greenwood's (1982) study of criminal ca
reers. The example adopted here to illus
~ate and develop the discussion of pre
dictive power is taken from Copas and
Whiteley (1976) as it was subsequently
used by Tarling (1982) to show the rela
tionship between various measures.
Copas and Whiteley's aim was to con
s~uct a prediction instrument to evaluate
He effects of therapeutic ~eatrnent at the
Henderson Hospital. The criterion of suc
cess was taken to be no furler admission
to a psychiatric hospital or no further
conviction for a criminal offense during
the 2 to 3 years following release. Table 1
sets out the results for their construction
and validation samples.
Several summary statistics have been
proposed to measure the predictive
power of this and similar risk cIassifica
tions, in particular mean cost rating
(MCR) (Duncan et al., 1953) and P(AW
the area under the receiver operating
characteristic curve in signal detection
theory (Fergusson, Fifield, and Slater,
19771. However, as the risk classification
in Table 1 can be regarded as an ordered
TABLE 1 Predicted Success and Observed Outcome, Construction and Validation
Samples
Risk Probability Construction Sample Validation Sample
Class of Success Success Failure Total Success Failure Total
(ki) (P) (si) fi) (ti) (ti) (fi) (ti)
k1 0 to .3 5 33 38 7 18 25
k2 .3 to .5 7 12 19 14 15 29
k3 .Sto .7 21 12 33 12 9 21
k4 .7tol.0 11 3 14 8 4 12
Total Ns = 44 Nf = 60 T= 104 Ns = 41 Nf = 46 T= 87
MCR = .57
P(A) = .78
Tc =  .55
By = .71
SOURCE: Copas and Whiteley (1976) data as used by Tarling (1982~.
MCR = .28
P(A) = .64
of =  .28
By = .40
OCR for page 291
306
contingency table, Kenclall's rank correla
tion coefficient tan, if (Kendall, 1970),
and Goodman and KruskaT's gamma,
(Goodman ant] Kruskal, 1963), can also be
used to measure the degree of associa
tion. There is as yet no consensus about
the measure to be acloptecI, but Tarling
(1982) has in fact shown that all four
measures are relatect because all are func
tions of the statistic S (where S = P  Q.
where P is the number of"concorclant
pairs" and Q is the number of"discorclant
. ,,\
pairs 9.
Expressing each as a function of S and
using the notation of Table 1, the four
measures can be defined as:
S
MCR =
NsNf
2NsNf
4S
~ =_
T2
S
~ =
, p+ Q.
Two advantages follow from knowing
that all four measures are a function of S.
First, by calculating S the calculation of
all four measures is greatly simplifiecl.
Second, as the distribution of S has Tong
been known, a test of the null hypothesis,
E(S) = 0, is a test that prediction is no
better than chance.
The measures tic and By have a further
advantage over MCR and P(A) in that the
variance of both can be estimated, thu
permitting tests of alternative hypotheses
and facilitating comparison of alternative
prediction instruments or their respective
power in the construction and validation
samples. For ~c, however, only an upper
bound to the variance is available, so only
a conservative test for the difference of
two observed values is possible. On the
CRIMINAL CAREERS AND CAREER CRIMINALS
other hanct, the exact value of the vari
ance of By is available (Goodman and
Kruskal, 1963), which permits a more
powerful test. For this reason TarTing
(1982) recommencled that By should gen
erally be preferred.
Prediction Errors
The four measures cliscussect above are
still only indicators of overall fit and just
give an indirect assessment of how a pre
diction instrument will perform in prac
tice. It is essential, therefore, to calculate
the number or proportion of correct and
incorrect predictions that would result
from the application of any rule.
Given the discussion of overfilling and
shrinkage in the previous section, esti
mates should be derived from a valida
tion sample. Before applying the Copas
and Whiteley instrument to identify
likely successes, a cutoff point must be
chosen. From the risk classification, as it
is presenter] above, there are three possi
ble cutoff points: all subjects with a pre
rlicte(1 probability of success of .7 or
above; all those with a predictecl proba
bility of.5 or above; and all those with a
preclicte(1 probability of .3 or above.
Figure 3 shows, for each cutoffpoint in
the validation sample, the following:
1. the number of true positives (TP),
that is, the number of subjects predicted
to succeed who dill in fact succeed;
2. the number of false positives (FP),
that is, the number of subjects predictecI
to succeed who in fact failed;
3. the number of false negatives (FN),
that is, the number of subjects preclictec!
to fait who in fact succeeded; and
4. the number of true negatives (TN),
that is, the number of subjects predictecl
to fait who slid in fact fail.
The two marginal distributions ofthese
tables are usually clefinecI as the base rate
and the selection ratio. The base rate (or
OCR for page 291
SOME METHODOLOGICAL ISSUES 11!: MAKING PREDICTIONS
A: Cutoff point .7 and above
Predicted Outcome
B: Cutoff point .5 and above
Predicted Outcome
C: Cutoff point.3 and above
Predicted Outcome
Actual Outcome
Success Fai I ure
Success
Failure
Success
Fai I ure
Success
Failure
FP
8 4
FN TN
33 42
Ns= 41 Nf= 46
Base rate= .471
Actual Outcome
Success Failure
TP FP
20 13
. . ~
FN TN
21 33
N=41 Nf=46
s
Base rate= .471
Actual Outcome
Success Fail ure
, ,
TP FP
34 28
FN TN
7 18
N = 41
s
Base rate= .471
Nf= 46
NP = 12
s
NPf= 75
NP = 33
NPf= 54
NP = 62
s
NPf= 25
FIGURE 3 Correct predictions and errors for each cutoff point.
307
Selection ratio= .133
Selection ratio=.379
Selection ratio= .713
OCR for page 291
308
the prevalence or the incidence) is the
proportion of the sample that actually
succeeded. It can be seen that this is the
same for all three cutoff points (i.e., 47.1
percent). The second marginal distribu
tion, the selection ratio, is the proportion
ofthe sample predicted to succeed. It can
be seen that the selected ratio changes
depending on the cutoff point: it is 13.8
percent when the cutoff point is set at .7
and above, 37.9 percent when the cutoff
point is set at .5 and above, and 71.3
percent when the cutoff point is set at .3
and above.
Defining the base rate anct the selec
tion ratio in terms of the four outcomes:
Base rate, BR =
Selection ratio, SR
where
TP + FN
T
and (1BR) =
TP + FP
and (1  SR) =
T = total sample
= TP +FP + FN +TN.
FP + TN
T
FN + TN
T
Considering the relationship between
the base rate and Me selection ratio re
veals several interesting properties. When
the selection ratio is larger than the base
rate, false positives exceed false negatives;
conversely, when the base rate is larger
than the selection ratio, false negatives ex
ceed false positives. When the base rate
equals the selection ratio, the number of
false positives ant! false negatives is the
same. Furthermore, when both the base
rate and the selection ratio equal .5, predic
tion becomes most accurate and results in
fewest total errors (FP + FN). However,
when the base rate (which is fixed) is not .5,
CRIMINAL CAREERS AND CAREER CRIMINALS
as is often the case in practice, total errors
are minimize<] when the selection ratio is
set to equal the base rate. These phenom
ena are revealed in Figure 3 and can be
used to guide the choice of the appropriate
cutoff point.
Dunn (1981) sets out the various mea
sures that can be clerived from the kind of
information presented in Figure 3, for
example, sensitivity and specificity, but
they are not discussed in any cletai! here.
Loeber and Dishion (1982, 1983) also
discuss the significance of the base rate
and the selection ratio. They point out
that the base rate anct the selection ratio
determine the maximum number of cor
rect predictions that could be achieved by
the prediction instrument but, further,
that a certain number of correct predic
tions could be expecter] by chance alone.
Loeber and Dishion therefore propose a
measure, relative improvement over
chance (RIOC), which attempts to assess
how an instrument performs relative to its
expected performance and its best possi
ble performance given the base rate and
the selection ratio.
They define RIOC as:
RIoC= AC RC
MC  RC
where AC = actual number of correct
predictions, RC = randomly expecter]
number of correct predictions, and MC =
maximum possible number of correct pre
dictions. In the notation of Figure 3 it can
be seen that
AC = TP + TN
RC (TP + FN)(TP + FP)
(FP + TN)(FN + TN)
T
MC = TN + TP + 2min(FN,FP).
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
Substituting for AC, RC, anct MC in the
above equation, RIOC reduces to:
IT'S
TP.TN  FP.FN
neon =
[TP + min(FN,FP)][TN + min(FN,FP)]
From the relationships presented earlier,
RIOC can also be expressed in terms of
the base rate ant! the selection ratio. Sub
stituting in the denominator, RIOC re
duces to:
TP.TN  FP.FN
T2[min(BR,SR)BR.SR]
A commonly used measure of associa
tion for 2 x 2 classifications such as Fig
ure 3 is ¢, which is the product moment
correlation coefficient for dichotomous
variables.
In the notation of Figure 3,
TP.TN  FP.FN
~ =
[(TP + FP)(TP + FN)(FP + TN)(FN + TN)]'t2
Expressing the denominator in terms of
BR and SR, ~ reduces to:
TP.TN  FP.FN
T2(BR.SR  BR.SH2  BR2.SR + BR2.SR2)"2
The relationship between RIOC and
OCR for page 291
310
say, rather than to minimize total errors,
we could have user! the approach out
lined there to guide our choice of cutoff
point. However, decision theory provides
a more direct framework for taking into
account the weights to be attached to
different types of outcome. Although the
decisiontheory approach has been
wiclely advocated in criminological appli
cations (e.g., Loeber and Dishion, 1983),
it has not been used to any great extent,
except by Blumstein, Farrington, and
Moitra (19851. While it is outside the
scope of this paper to discuss decision
theory in any detail, we would recom
mend that more attention be paid to it in
prediction research, especially when the
results are to be applied in practice.
SAMPLEREUSE METHODS
Previous sections of the paper have
stressed the distinction between retro
spective fit and prospective (or vaTida
tion) fit of a prediction instrument. A
simple way of carrying out a prospective
validation, and the one most commonly
used in criminology, is the splithalf
method, which divides the data into two
halves (at ranclom). The equation is fittest
to the first half (the construction sample)
and tested on the seconc] (the validation
sample). Although unbiased estimates of
shrinkage and error rates result from this
method, there are two obvious disac3van
CRIMINAL CAREERS AND CAREER CRIMINALS
The first, simple extension of the prin
ciple is crossvaTidation, of which the
splithalf method is merely a special case.
To construct and validate the prediction
instrument, the sample need not be split
in halfbut couIcl, instead, be split in many
different ways; for example, 80 percent of
the sample could be used for the con
struction sample ant! the remaining 20
percent could form the validation sample.
Moreover, any number of construction
and validation subsamples could be
drawn. The jackknife and the bootstrap
techniques are more formal c3~evelop
meets of this latter iclea. The jackknife
(see, for example, R. G. Miller, 1974), or
"holdoneout," proceeds as follows. Sup
pose the sample has 1~7 members; delete
one member and develop the prediction
instrument on the remaining N  1 and
use it to predict y for the missing mem
ber. The procedure is repeated N times, a
different member being omitted each
time. By this means a set of independent
values of y and y are obtained, and shrink
age and error rates can be calculated us
ing the methods presented earlier as if
these values related to a completely new
sample of N cases.2
The bootstrap technique (Efron, 1982)
proceeds slightly differently. If sampling
with replacement is permitted, a large
number of samples of size N can be
drawn, 2N as opposed to only N by the
jackknife procedure. The bootstrap repli
tages: (a) construction of the prediction cations can be used to assess the predic
instrument does not use all available in lion instrument. The method is illustrated
Connation, but only half the sample, and
(b) the comparability of the two sub
samples will always be open to doubt; for
example, there is a 1in20 chance that the
two subsamples will be significantly dif
ferent at the 5 percent level. Various tech
niques have been developed in the statis
tical literature to overcome these two
problems. The principle underlying them
is to generate many subsamples rather
than merely two.
by an example given in Efron and Gong
(1981, 1983) that is analogous to many
criminological prediction studies. Efron
and Gong were concerned to construct an
instrument to predict whether patients
2These ideas can be extended to other problems
relevant to the construction of prediction instru
ments; Mabbett, Stone, and Washbrook (1980), for
instance, consider the stepwise choice of variables
in forming a binary predictor.
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
suffering from acute hepatitis wouIc3 sur
vive or die. There were 155 patients in
the sample, 33 of whom diecI. There were
19 independent variables available for
analysis. A prediction instrument was de
veloped in the usual way. First only x
variables associated at the 5 percent level
were retained; this left 13 variables. Sec
ond, a kind of forward, stepwise, multi
pleIogisticregression program was used,
stopping when no additional variable
achieved the 5 percent significance level.
Four of the 13 variables were included in
the final prediction instrument. The cut
offpoint c was set at c = Tog 33/122. Full
information was available for 133 of the
original 155 patients. When the predic
tion instrument was applied to the 133
patients, 21 were misclassified, giving an
error rate of 21/133 = .158. The bootstrap
technique was then used to assess how
overoptimistic this error rate was or how
much it couIct be expected to shrink. Five
hundred] bootstrap samples were drawn
and the same procedure was used to con
struct a prediction instrument. On each
occasion the "overoptimism random vari
able," R', was calculated, which is merely
"the error rate for the bootstrap replica
tion minus .158." The 500 values of R'
were plotted and the mean of R' was
found to be .045, which suggests that the
expected overoptimism is about onethird
as large as the apparent error rate .158.
This gives the biascorrectec3 estimated
error rate .158 + .045 = .203. In addition,
the standard deviation of R' was .036.
Another advantage of the bootstrap tech
nique is illustrated by this example. At
each replication a check was made of the
variables included in the prediction in
strument and this revealed, for example,
that one variable was selected 37 percent
of the time, another 59 percent of the
time, and so on, giving an intuitive, if not
theoretically rigorous, indication of the
importance of the various predictor vari
ables.
317
Technical details of samplereuse
methods are given in Efron (1982), and
simplified descriptions appear in Dia
conis and Efron (1983) and Efron and
Gong (19831. Comparing and contrasting
the various methods, splithalf or cross
valiclation methods are the simplest to
perform but have certain limitations. The
advent of computer power and the in
creasing avaflabilit,v of appropriate aIgo
rithms make the jackknife and the boot
strap methods more attractive and
relatively easy to use. The jackknife ant]
the bootstrap are in fact theoretically
closely related: the jackknife is almost a
bootstrap itself The bootstrap is entirely
nonparametric and is, therefore, more
flexible. Efron (1982) suggests that the
jackknife performs less well than the
bootstrap in situations that he has inves
tigated but it requires less computa
tion. The close relation between sample
reuse methods an(1 Copas's theory of
shrinkage and vaTi(lation was cliscusse
earlier.
CONCLUSIONS
At the beginning of this paper we
showed how simple pointscoring meth
ods could be incorporated within the
framework of general linear models,
along with regression, logistic regression,
and Toglinear models. In adclition, we
noted that pointscoring methods, recon
ceptuaTized in the way we suggest, per
mit certain extensions that have been
found useful in medical (diagnosis.
It has Tong been recognized and empir
ically demonstrated that a prediction in
strument (leveloped on one sample will
perform less well when applied to a sub
sequent sample. The phenomenon of
shrinkage has recently been subjected to
rigorous theoretical investigation, which
we outlined. The findings stemming from
this work enable the researcher to uncler
stand an(1 anticipate the (degree of shrink
OCR for page 291
312
age that can be expected in any study and,
where necessary, to make any adjust
ments to (or preshrink) the prediction
equation.
To examine shrinkage in practice, re
searchers have tended to use splithalf
subsamples. We pointed out the range of
other and superior "samplereuse" meth
ods, including the jackknife and the boot
s~ap.
The usefulness of a prediction ins~u
ment can also be gauged by the number
of errors and correct decisions that result
from its application. We pointed out the
similarity between many of the indices
Mat have been proposed to assess We
utility of a risk classification. In addition,
we showed the importance of the base
rate and the selection ratio in determin
ing falsepositive and falsenegative er
rors and how the selection ratio can be set
to alter We balance between the two.
When predicting rare events it may be
the case that any prediction instrument
will not improve significantly over the
base rate. For example, a prediction in
s~ument developed to identify "danger
ous" offenders may result in more errors
than occur by merely classifying all of
fenders as not dangerous. This has led
some commentators to eschew attempts
to predict these kinds of events. An anal
ogous situation occurs in medical science,
where massscreening programs are
costly and may result in large falsepos
itive errors, causing considerable stress,
but where they are nevertheless consid
ered to be worthwhile to detect the small
number of true positives who actually
have the rare disease. Therefore, the
worth of any prediction instrument de
pends on the values to be attached to the
various outcomes emanating from its ap
plication, not simply on the total number
of errors that may accrue. Decision theory
provides a framework for making these
assessments and could be used more
widely in prediction in criminology.
CRIMINAL CAREERS AND CAREER CRIMINALS
REFERENCES
Blumstein, A., Farringon, D. P., and Moitra, S.
1985 Delinquent careers: innocents, desisters and
persisters. Pp. 187219 in M. Tonry and N.
Morris, eds., Crime and Justice. Vol. 6. Chi
cago, Ill.: University of Chicago Press.
Bottoms, A. E., and McClintock, F. H.
1973 Criminals Coming of Age. London, En
gland: Heinmann.
Brown, P. J., and Zidek, J. V.
1980 Adaptive multivariate ridge regression. An
nals of Statistics 8:6474.
Burgess, E. W.
1928 Factors determining success or failure on
parole. In A. A. Bruce, A. J. Harno, E. W.
Burgess, and J. Landesco, eds., The Work
ings of the IndeterminateSentence Law and
the Parole System in Illinois. Springfield,
Ill.: Illinois State Board of Parole.
Copas, J. B.
1983a Plotting p against x. Applied Statistics
32:2~31.
1983b Regression, prediction and shrinkage (with
discussion). Journal of the Royal Statistical
Society, Series B 45:311~54.
1984 Crossvalidation Shrinkage of Regression
Predictors. Research Report, Department of
Statistics. Birmingham, England: University
of Birmingham.
Copas, J. B., and Whiteley, J. S.
1916 Predicting success in the treatment of psy
chopaths. British Journal of Psychiatry
129:388~392.
Diaconis, P., and Efron, B.
1983 Computerintensive methods in statistics.
Scientific American 248~51:9~108.
Duncan, O. D., Ohlin, L. E., Reiss, A. J., and
Stanton, [I. R.
1953 Formal devices for making selection deci
sions. American Journal of Sociology
58:57~584.
Dunn, C. S.
1981 Prediction problems and decision logic in
longitudinal studies of delinquency. Crimi
nalJustice and Behavior 8:439~76.
Efron, B.
1982 The Jackknife, the Bootstrap and Other
Resampling Plans. Philadelphia, Pa.: Society
for Industrial and Applied Mathematics.
Efron, B., and Gong, G.
1981 Statistical Theory and the Computer. Un
published manuscript. Department of Statis
tics, Stanford University, Calif.
1983 A leisurely look at the bootstrap, the jack
knife, and crossvalidation. American Statis
tician 37(1~:36~8.
OCR for page 291
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
Farrington, D. P., and Tarling, R., eds.
1985 Prediction in Criminology. Albany, N.Y.:
SUNY Press.
Fergusson, D. M., Fifield, J. K., and Slater, S. W.
1977 Signal detectability theory and the evalua
tion of prediction tables.Journal of Research
in Crime and Delinquency 14:237246.
Fielding, A.
1979 Binary segmentation. In C. A. O'Muirchear
taigh and C. Payne, eds., Exploring Data
Structure. Vol. 1 of The Analysis of Survey
Data. London, England: John Wiley.
Glueck, S., and Glueck, E. T.
1950 Unraveling Juvenile Delinquency. Carn
bridge, Mass.: Harvard University Press.
Goodman, L. A., and Kruskal, W. H.
1963 Measures of association for cross classifica
tions III. Journal of the American Statistical
Association 58:310 364.
GottEredson, D. M., Wilkins, L. T., and Hoffman
P. B.
1978 Guidelines for Parole and Sentencing
Lexington, Mass.: Lexington Books.
GottEredson, S. D., and GottEredson, D. M.
1985 Screening for risk among parolees. Pp.5477
in D. P. Farrington and R. Twirling eds..
Prediction in
N.Y: SUNY Press.
A. .
Criminology. Albany,
Greenwood, If. W.
1982 Selective Incapacitation. Santa Monica,
Calif.: Rand Corporation.
Jones, M. C., and Copas, J. B.
1985 On the Robustness of Shrinkage Predictors
in Regression: Exemplifying and Using the
Theory. Research report. Department of Sta
tistics, University of Birrningharn, England.
In On the Robustness of Shrinkage Predictors
press in Regression: Some Theoretical Consider
ations. Journal of the Royal Statistical Soci
ety, Series B 48.
Kendall, M. G.
1970 Rank Correlation Methods. London: Griffin.
Loeber, R., and Dishion, T. J.
1982 Strategies for Identifying AtRisk Youths.
In
press
313
Unpublished report. Oregon Social Leaming
Center, Eugene.
1983 Early predictors of male delinquency: a re
view. Psychological Bulletin 94:6899.
Mabbett, A., Stone, M., and Washbrook, J.
1980 Crossvalidatory selection of binary variables
in differential diagnosis. Applied Statistics
29: 198204.
Miller, A. J.
1984 Selection of subsets of regression variables
(win discussion). Journal of the Royal Sta
tistical Society, Series A 147:389425.
Miller, R. G.
1974 The jackknife" a review. Biometrika 61(1):
115.
Nuttal, C. P., et al.
1977 Parole in England and Wales. Home Office
Research Study No. 38. London, England:
Her Majesty's Stationery Office.
Simon, F. H.
1971 Prediction Methods in Criminology. Home
Office Research Study No. 7. London, En
gland: Her Majesty's Stationery Office.
Tarling, R.
1982 Comparison of measures of predictive
power. Educational and Psychological Mea
surement 42:479487.
Tarling, R., and Perry, J. A.
1985 Statistical methods in criminological predic
tion. Pp. 21~231 in D. P. Farrington and R.
Tarling, eds., Prediction in Criminology. A1
bany, N.Y.: SUNY Press.
Titterington, D. M., Murray, G. D., Murray, L. S.,
Spiegelhalter, D. I., Skene, A. M., Habbema,
J. D. F., and Gelpke, G. I.
1981 Comparison of discrimination techniques
applied to a complex data set of head injured
patients. Journal of the Royal Statistical So
ciety, Series A 144: 145175.
Wilbanks, W. L.
1985 Predicting failures on parole. Pp. 7894 in
D. P. Farrington and R. Tarling, eds., Predic
tion in Criminology. Albany, N.Y.: SUNY
Press.