Read "Criminal Careers and "Career Criminals,": Volume II" at NAP.edu

« Previous: 6. Accuracy of Prediction Models

Page 291 Cite

Suggested Citation:"7. Some Methodological Issues in Making Predictions." National Research Council. 1986. Criminal Careers and "Career Criminals,": Volume II. Washington, DC: The National Academies Press. doi: 10.17226/928.

Page 292 Cite

Page 293 Cite

Page 294 Cite

Page 295 Cite

Page 296 Cite

Page 297 Cite

Page 298 Cite

Page 299 Cite

Page 300 Cite

Page 301 Cite

Page 302 Cite

Page 303 Cite

Page 304 Cite

Page 305 Cite

Page 306 Cite

Page 307 Cite

Page 308 Cite

Page 309 Cite

Page 310 Cite

Page 311 Cite

Page 312 Cite

Page 313 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

7 Some Methodological Issues in Making Predictions John B. Copas and Roger Tarling Methodological considerations are cen- tral to all quantitative or actuarial predic- tions, although each particular precliction study invariably presents its own special issues. At its most general level, a predic- tion study investigates the extent to which criterion measures (the clependent variables) can be preclicted by one or more measures of other factors (the pre- dictor or independent variables). It is outside the scope of this paper to discuss all the important methodological steps in the process: the selection and measurement of appropriate information; the choice of statistical method; the prac- tical application of a prediction instru- ment and its utility. Instead, we concen- trate on four aspects. First, we examine in cletail the Burgess ant! Glueck point- scoring methods, which have been used extensively in criminological prediction. Second, we consider the important topic of validating and calibrating the preclic John B. Copas is professor of statistics, University of Birmingham, England; Roger Tarling is deputy head, Home Office Research and Planning Unit, England. 29~ tion instrument. Third, we review the various measures that have been pro- posed to assess an instrument's predictive power. Fourth, we describe methods for reusing samples to carry out a prospective validation. At each stage we attempt to synthesize some of the previous work in the area and present the results of our more recent statistical and methodologi- cal research. POINT-SCORING METHODS A variety of statistical methods have been used to construct prediction instru- ments. Chief among them are the Burgess and Glueck point-scoring methods, mul- tiple regression, log-linear methods, and logistic regression. In addition, various clustering, classification, and segmenta- tion techniques have been used. (The latter group of techniques are not dis- cussed here; see Fielding, 1979; TarTing and Perry, 1985.~) For examples of the iThe statistical methods listed above have severe limitations for much criminal career research, espe- cially when the dependent variable is not binary

292 application of all these methods in crimi- nological research, see the studies in- cluded in Farrington and Tarling (1985~. Invariably, these methods have been used in studies in which the clependent variable is binary (e.g., reconvictecT/not reconvicted). Many criminologists have found that simple point-scoring methods are more efficient or robust than more sophisticated methods and shrink less when applied to a validation sample. This seems especially so when the data con- tain measurement errors or "noise" (S. D. Gottirecison and Gottfredson, 1985; Wilbanks, 19851. This fincling, plus the fact that point-scoring methods are simple in conception and administratively easy to use, has lee] to their being adopter! in practice, particularly in studies of parole and sentencing decision making (D. M. GottEredson, Wilkins, and Hoffman, 1978; Nuttall et al., 19771. However, some com- mentators have said that point-scoring methods are intolerably crude, have no statistical foundations, and clo not result in any direct probabilistic interpretation. In this section we explore point-scoring methods to see if we can resolve some of and these tensions and anomalies. In acicti- tion, we show how point-scoring meth- ods, reconceptualized in the way we rec- ommenc3, can be extendecI. There are two basic point-scoring methods, one ascribed to Burgess (1928) ant] the other to Glueck anti Glueck (19501. In the Burgess method each sub- Ject Is given a score of either O or 1 on each of a number of predictors, depend- ing on whether the subject falls into a category with a below- or above-average success rate. The Glueck method is more and the focus of interest is on the time interval to some event, for example, the next offense. We would suggest that alternative statistical methods, stochastic point-process models, and failure-rate re- gression models are more appropriate in these situ- ations and should receive more attention from crim- inologists. CRIMINAL CAREERS AND CAREER CRIMINALS sophisticates] in that, instead of contribut- ing a score of O or 1, each category of each predictor is weighted according to the percentage of subjects in that category who are successes. The Glueck method can be appliecl to polychotomous incle- pendent variables, but in practice it has only been used for binary predictors. We keep to this simpler situation in our clis- cussion. Both the Burgess and the Glueck meth- offs have their parallels in stanciarcT statis- tical theory the "independence Bayes memos." First, consider the Burgess method. Let xi be a series of binary predictive factors, let q be the overall success rate, en c! suppose that within the success (S) and failure (F) groups separately, the fac- tors xi are statistically inclepenclent of each other. Let hi = P(Xi = AS), gi = P(Xi = OF). Assume the xi's are coded such that hi > gi Then, by Bayes theorem, P(S~xi=l)=hiq/Pi P(S~xi = 0) = (1 - hi) q/~1 - Pi) where Pi = P(Xi = 1) = hi q + gi(1 - q). By independence and Bayes theorem again, P(S~x) q HP(xi~S) = . P(F~x) 1-q HP(xi~F) and so log owls for S after observing x is = log ~ _ + ~ i°g 11( F) This can be written as k + iWixi where

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS 0(1 - gi) (1 - hi)gi which is just the Tog odds ratio for the 2 x 2 table classifying xi = 1 or O against S and F. Given indepenclence, these are there- fore the optimum weights. By the Ney- man-Pearson Lemma in statistical theory, any other set of weights must be less efficient (i.e., they do not use all the information available in the xi's). The Burgess method has Wi = 1, or, since a scale factor in the score is irrele- vant, Wi = constant. Thus, the Burgess method is only optimum if the cross- product ratio is the same for each factor (i.e., each xi gives the same amount of information about S or F). The Glueck method is equivalent to Wi = P(S~xi = 1) - P(S~Xi = 0), which, from above, simplifies to W.=q(l-q)(hi-gi) Pi(1 - Pi) Again, a constant multiple is irrelevant, so essentially (hi- gi) Pi(1 - Pi) 7: Tog ocicis ratio for xi. However, if xi has only modest predictive power, we can write hi = gi + si where si is small. We can then show that hi(1- Hi) hi- gi 'Vie gi(1 - hi) Pi(1 - Pi) + terms involving si2. Hence the Glueck method is approxi- mately optimum if ei is small, that is, if each individual xi contributes only a mod- est amount of information. In many prac- tical cases the score may involve a rela- tively large number of xi's, none of which by itself is spectacular, but together they where 293 may be useful. This, we suggest, accounts for the apparent success of the Glueck method. As set out above, Burgess and Glueck are not separate and distinct moclels but are, in fact, simple Tog-linear mo(lels in which all the predictor variables are treated as invepenclent, i.e., they are not correlated. We would advocate the use of the formal inclepenclence Bayes method in preference to the more act hoc Burgess and Glueck approaches because it has several important advantages: 1. It is equally simple yet is based on a coherent theory and is optimum within the framework of that theory. 2. It provides a direct estimate P(S~x), whereas the scoring methods of Burgess and Glueck have to be separately cali- bratecI on the data, that is, the probability of success given a certain score is esti- matecl by calculating the proportion of all subjects with that score who succeeded. 3. Similarly, the value of the score is seen to be a Tog odds ratio. Hence if the score is s, the probability of success must be of the form eS 1 + es There are two further advantages of the Bayes method that make it extendable in ways not possible for the Burgess and Glueck methods. (Extensions of this kind have been considered in the medical lit- erature under the name of"computer- aicled diagnosis moclels.") First, it can more readily accommodate x:'s that are not binary. The formula is then log odds for S given x = log q 1 - q + I, logy gi(xi)

294 fi(Xi) = P(xilS) and gi(xi) = P(xilF). Of course, all these probabilities are esti- mated from the data. Note that we need the proportions ofthe various values of xi within the F and S groups separately and not the proportions of S and F within the groups defined by various values of xi (a crucial distinction). The above formula is not necessarily linear in each xi (but there is no reason to expect it to be). Thus we avoid the need arbitrarily to dichotomize each predictor variable, the full informa- tion in each value of xi being retainer] in an optimum way. Of course, if the xi's are divided into too many categories, each tee, such as P(xi~S), is estimated less , , , accurately, and so, if there are too many categories (e.g., age measured in years), it is better to treat xi as a continuous vari- able ancI use a regression technique. Thus if some xi's are continuous, the term fi(Xi) log gi(Xi) can be estimated directly as a regression on xi. Hence the method can accommo- ciate mixed data in which some xi's are continuous, e.g., age, and some xi's are binary, e.g., sex (c£ analysis of covariance methods). Second, the Bayes method can be gen- eralized to take account of particular cir- cumstances concerning the distribution of the xi's. For example, if the xi's are not independent but correlated to a roughly equal extent (e.g., they are all positively correlated), a mollification simply in- volves multiplying Wi by a constant, and so the relative weights remain essentially the same. Thus, if the Bayes formula is recalibrated on the data (which allows an appropriate linear transformation of the score to be estimated), it works well even when the xi's are moderately correlated with each other. If the xi's are correlated, but not all to the same degree, the so- called "Lancaster models" can be used, CRIMINAL CAREERS AND CAREER CRIMINALS which are based on a second-order ap- proximation to the joint distribution of the xi's. These models have been found use- fuT in medical diagnosis applications; see review in Titterington et al. (1981). Apart from the obvious simplicity, an important advantage of all these methods is the relative precision with which the weights (or coefficients, if viewed as a Tog-linear moclel) are estimated. This is because the assumption of inclependence allows each weight to be estimated sepa- rately, and any sampling effects in the intercorrelations of the x's have no effect. If the sample size is relatively small, and the correlations between the x's are, at most, modest, point-scoring methods do well. Larger correlations between the x's, but with a similar sample size, can be clealt with in an approximate way by one of the mollifications mentioned above. For somewhat larger sample sizes, how- ever (say several hundred), a prediction equation should make proper allowance for the clependence between the x's, and a logistic motley or log-linear model (in the usual sense for categorical data) is the preferred alternative. In such models, each weight or coefficient is, of course, not just a function of the relevant xi but (lepencls in a much more complicated way on the joint (distribution of all the xi's. The complexity of the model affects the degree of shrinkage, which will be clis- cussed later in the paper. If our sugges- tions for correcting for shrinkage are used, the increased shrinkage of these complicated models should not present a problem. PREDICTIVE POWER, CALIBRATION, AND SHRINKAGE OF PREDICTION EQUATIONS Much statistical work in criminology has been concemecI with the construction and use of prediction equations. For each incliviclual, some response y (a binary

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS yes-no variable, a time to arrest, et cetera) is measured, alone with values of explan r ~Id ~- ~and on the basis Rev `~ri~hl~c ~ ~^ ~ En, _, , ofthese x's a predicted value of y, say y, is formulated. How good is y as a predictor of y? Issues related to this general ques- tion are to be discussed in this section. We are concerned here with the underly- ing methodology of the assessment of prediction equations, rather than with de- tails of prediction equations in specific applications. There are two contrasting, and yet com- plementary, approaches to the discussion ofthis question, corresponding roughly to the two philosophies of statistical infer- ence and decision theory as understood in the statistical literature. The inference approach is taken up in the next section, where we ask: Given that an individual is described by x = A, x2, . . ., what infor- mation does that give us about y? A pre- diction equation, with value y, is seen as an estimate of the expectation of u in some sense. The properties and behavior of a prediction instrument are studied in terms of the accuracy of y over the totality of all different values of y and x. We argue that a particular advantage of the infer- ence approach is that a clear discussion of shrinkage is possible. Our discussion leads to a correction for shrinkage or to "preshrunk" prediction equations as we will call them. The other approach is more pragmatic; it views a prediction equation as a means to an end, that of a decision instrument. All the issues are illustrated by a binary classification, conventionally labeled pos- itive-negative. Each individual falls into one or other group (e.g., success-faiTure), the decision as to which is the true group being made on the basis of x. The discus- sion focuses entirely on the frequencies of correct and incorrect decisions. A con- fusing array of measures of predictive power has appeared in the criminological literature (and in the parallel literature on 295 computer-aided diagnosis in medicine). We show that the more important of these are in fact very closely related to each other. There is an obvious link between the two approaches. If y is an observed re- sponse, a binary classification could be: success if y 3 ki and failure if y < kit The classification from the prediction equa- tion would by analogy be: success if y 3 k2 and failure if y < k2 (there is no reason to insist that ki = k21. We would argue in favor of formulating y to optimize such properties as calibration and validation (discussed in the next section) and then choosing k2 to secure desirable aspects of error rates and/or utility (discussed later). It is worth noting, however, that pre- diction equations are sometimes useful as a research too] in their own right, not just as a means of implementing the positive- negative decision. For instance, to control for differences between cases in a study, the value of an appropriate prediction y could sensibly be used either as a covariate in statistical analysis using covariance adjustments or as a criterion for matching cases and controls in a matched-pairs design. An example of the former approach is in Bottoms and Mc- CTintock (1973:Chapter 11~. Validation and Shrinkage It is almost universal experience that, when a prediction equation Is fitted to data and then applied to some new cases or a new cohort, the usefulness and accu- racy of the prediction are much more disappointing than expected. The term "shrinkage" has been used to describe this deterioration in predictive power. Al- though the effect is real enough, and noted in many studies, the term has never been given a precise definition. Quite independently of the experience of crim- inologists in using prediction equations, there has been the remarkable develop

296 ment in the statistical literature of so- callec3 "shrinkage estimation," a tech- nique whereby a set of related parame- ters can be estimated more accurately (on average) than by conventional tech- niques, such as least squares. The use of the same term in these different contexts has appeared at best coincidental and at worst grotesquely misleading. However, there are known to be close connections between them, as cliscussec3 in Copas (1983b). Using the theory clescribed in that paper it is possible to (a) clarify the manifestations of shrinkage, (b) highlight the reasons for them, (c) derive altema- tive methods of fitting prediction equa- tions that will eliminate some of the ad- verse effects of shrinkage, and (I) enable the extent of shrinkage in any given ap- plication to be estimated in advance from the original ciata. These points are clis- cussed in this section, and a brief outline of Copas's theory is illustrated by a crim- inological example. In fitting a prediction equation to (lata, we will have, as before, observations on some response y (e.g., the number of convictions in a Tong-term follow-up, or a binary factor describing whether some event, such as rearrest, has occurred) to- gether with information on a number of predictive factors x (number of previous convictions, age, et cetera). The aim is to formulate a predictor y = fix) for some function f [e.g., multiple regression, in which case fix) = cz + I3'x]. The fit of the equation relates to the proximity of y to the actual observed values of y. Two as- pects of the prediction equation are dis- tinguished: 1. Calibration. Here we group cases with the same or similar values of y and ask whether the average of the associated y's is equal to the predicted value y. The greater the clifference, the worse the cal- ibration. 2. E,fficacy. Here we ask whether val- ues of y discriminate clearly between CRIMINAL CAREERS AND CAREER CRIMINALS cases with different x's. A simple measure of this is the correlation between y and y. (In the case of multiple regression this is just the multiple correlation coefficient or the coefficient of determination, R.) A large R shows that y changes substantially as x changes, while a small R means that y is almost the same for all x (and so is useless as a predictor). The ideal predictor, never realized in practice, is one in which y = y for all x, which calibrates perfectly and has maxi- mal efficacy (R = 11. In practice, if the model behind the prediction equation is correct, when judged by values of y and y in the data, y will calibrate well but have R somewhat less than 1 (this is essentially the Gauss-Markov theorem of least squares). A second crucial distinction is between retrospective fit and validation fit. Retro- spective fit concerns the comparison be- tween values of y and y in the data on which the prediction equation is fitted. Validation fit envisages the prediction equation being applied to a new set of cases or subjects and compares the actual values of y in the new data with the predictions fix), calculated using the orig- inal prediction equation f but using the new values of x. The difference between the sets of data is emphasized by the terms "construction data" and "validation data." Shrinkage implies that validation fit is worse than retrospective fit. In prac- tice, the predictions y calibrate well in the construction data but less well, and sometimes very badly, in the validation data. Efficacy is nearly always worse in the validation data than in the construc- tion data. Copas's theory quantifies both these aspects of the deterioration of fit. There are (at least) three possible causes of the deterioration in both these aspects of fit: (a) a purely statistical effect that is the inevitable result of unex- plained (random) variation in the data; (b) changes in the population of x's from

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS 297 construction data to validation data (e.g., there might be some intermediate change of policy or other intervention that alters the range of subjects available for study); and (c) the underlying associations be- tween y and x might change (e.g., a change in some latent factor that is not observed in x). Each of these causes of shrinkage is discussecI below. Shrinkage as a Statistical Effect Cause (Al Hence large values of y tend to be over estimatecl and small values of y tend to be unclerestimated. This is because E(p'V,8) = ,l3'V,ll +-> ,S'V,B = E(,B'V/3), where n is the sample size in the con struction data and m is the number of variables measured in x. By the same reasoning, ,l3'V,B can be estimated by ,B'V,B - mown, where a2 is the usual residual mean square, and so K itself can be esti matecl by Cause (a) wflT be illustrated in the case A of multiple regression, in which the sta- EVE - man _ 1 - 1 tistical moclel is 9'V,B F y = Ct + ~ X + E, ~ being the usual ranclom error. Without Toss of generality, we can assume the x's are stanciardized to have mean zero, so that cr merely reflects the overall average value of y. Suppose causes (b) and (c) do not operate, so that we have a stable popu- lation of x's and constant true values of cat and ,8 as we go from construction to valida- tion ciata. This, therefore, represents the ideal situation as far as fitting and validating a prediction equation is concerned. If ~ and ,B are least squares estimates in the construction data, the prediction · . equation Is Y ~ + ~ X. Suppose we test this out on a very large validation sample, so that we compare y = cat + ,B'x + ~ with c' + ,`3'x over a population of new cases (y, x). To study calibration, we calculate the average y (i.e., cat + ,l3'x) over those cases x that relate to a specific prediction y. This is clone by fitting a linear regression of y on y, which can be shown to have slope K- EVE 9'VP ' where V is the variance-covariance matrix of the x's. The average of K, over statisti- cal errors in ,B, which is evaluated in Copas (1983b), is always less than 1. where F is the usual F-ratio of multiple regression. A more thorough analysis, valid if m ~ 3, shows that the slightly modified estimate K = 1 m - 2 mF is unbiased in the sense that E(K) = E(K). Thus K measures the (leterioration in cal- ibration; in a set of vaTiclation data, the average value of y to be expected for a given y is not y, as might be anticipated from the construction data, but y = y + K(y - Y), where y is the overall observed average of y. The smaller K is, the greater the distor- tion in calibration. Of course, this is itself a prediction in the sense that y is caTcu- lated from the construction data and can- not be expected to be invariably correct when appliecl to practical validation data. However, on average, and to an approxi- mation examined in detail in Copas (1983b), E(y~y) = Y for a typical validation case (y, x). Thus y can be said to be preshrunk in the sense that it is expected to calibrate well (show no calibration shrinkage) on validation clata. Of course y will not calibrate well on the construction data (because it is y

298 that does), but, from a pragmatic point of view, retrospective performance of a pre- cTictor is irrelevant. The pedigree of y is confirmed] in Copas (1983b), in that y corresponds ex- actly to a "shrinkage estimator" in the sense of the term used in the statistical literature. It is proved that, within the assumptions outlined above, y is uni- formly better than y in the mean squared error sense, Be., E(y _ y)2 < E`y _ y'2 over validation <3ata (y, x), provided m ~ 3, where m is the number of x variables. If m = 2, K = 1 and so preshrinkage has no effect. If m = 1, the whole theory breaks down, since the expectations of quantities such as K cease to exist (the relevant infinite integrals diverge). In fact, it is shown that for m = 1 and m = 2 no uniform improvement on least squares is possible. The theory of preshrinking is therefore useful only if there are three or more predictive variables in x. Tuming to efficacy, but still in the mul- tiple regression case, the deterioration in correlation is inevitable and cannot be removed by preshrinking. In fact Corrky, y) = Corrty, y), and so the discrimination afforded by y is identical to that of y. The inevitable cle- cTine in correlation is simply due to the fact that in the construction data y has knowledge of the actual y's, whereas in validation clata it cloes not. The above theory is immediately extended to pre- dict the validation correlation of y and y (or y): it is (n- 1jR2 - m R= (n-m - 1)R where R is the multiple correlation coef- ficient in the construction data. Always we have R < R. For prediction, the retro- spective R is irrelevant; efficacy shouIct CRIMINAL CAREERS AND CaREER CRIMINALS be measured by R. which on average will be (approximately) the correlation ob- tainecl if the predictor (y or y) were to be validated. A minor point to mention is that ~ can be negative, in which case y inverts the predictions macie by y. However, in the worst case, in which x has no effect (,B = O), E(F) > 1 and so E(~) > 0. Thus, if ~ is negative, the correlations between y and x are even worse than one wouIc3 expect from pure random numbers, and it would be apparent that any prediction equation based on x is cloomed to failure. The same comment applies to the circumstance that R < 0. The multiple regression model being (liscussecl implicitly assumes that y is a continuous variable. Models for discrete and categorical data are mentioned else- where in this paper, including the impor- tant case of binary data. Suppose that y is defined to be 1 if an event occurs (suc- cess) and O if it does not (failure), with the predictive factors x as before. A multiple regression of y on x can still be fitted, with E(y) being interpreted as the probability of success. All the above quantities in shrinkage theory can be calculated in the same way, although their mathematical validity can only be taken as an approxi- mation (but often a reasonable one if n is large and the correlations between y and each x are not too close to 1~. The more informative model is logistic regression, for which &+,~' 1 + e &+' The overall significance of a fitted model of this kind is measured by a value of x2 ("deviance" in computer output from the statistical package GLIM), and it can be shown that in many practical cases x2 ~ mF, where F is the F-ratio in an ordinary multiple regression of y on x. Thus Zip becomes

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS m - 2 K = 1 x2 Calibration relates to the probability of success rather than to the average value of y. A binary predictor is well calibrated if, over all cases in which fix) = p, say, the proportion of successful cases is in fact p. In a large validation sample, this propor- tion will be expected to be e a+K~'x + e.ct+K~'x for the same reasons as in the multiple re- gression case. Thus p is the preshrunk forts, of the predictor, by analogy with y above. This is illustrated in a particular appli- cation to the problem of predicting the probability of absconding from open borstals, taking into account known social ant! criminological indicators (using ciata kindly made available by the Prison De- par~nent's Young Offender Psychology Unit, Home Office, England). Here y = 1 if the trainee absconded cluring sentence, y = 0 otherwise, and m = 22 predictive factors were studied. A logistic regression on n = 500 cases gives x2 = 50.2 on 22 degrees of freedom, which is highly sig- nificant; ~ is 0.602. Calibration was exam- inec3 by using a nonparametric smoothing methoc! to plot the actual proportion of absconding cases, say p, against the pre- dictec3 proportions pE=f~x)~; the method is from Copas (1983a). This is shown in Figure 1, in which both axes are on Togis- tic scales. The calibration is satisfactory in the construction clata, in that the plotter] curve (labeled "construction clata") is tol- erably close to the diagonal line p = p. A furler set of 1,500 cases was then used as validation data and the plotting process repeated. The shrinkage is very marked (Figure 1~; the plotted curve is much shallower than the diagonal (large p's are overestimated by p, small p's underesti- matecI). The use of p insteac! of p is 299 equivalent to retaining the graph with p as the horizontal coordinate, but replac- ing the diagonal line with a line of slope K = 0.602, shown as the dashed line. The reasonable fit of the validation curve to the dashed line confirms that the vaTicia- tion calibration of p is satisfactory. The ordinary multiple correlation coef- ficient between y and x for these data is R - 0.322, whereas the vaTiclation corre- lation discussed above is R = 0.194. The substantial shrinkage has almost halved the correlation, the efficacy of the predic- tor on validation being extremely modest. This magnitude of the drop in correlation is not at all unusual in practice (e.g., Simon, 19711. The multiple and logistic regression models discussed above are fixed models in the sense that the variables in x are fixed in advance. In practice, prediction equations are often simplified by using stepwise regression or some other proce- clure for subset selection; the variables in x are then selected using the data, and only those x's showing reasonably strong correlation with y are retained. The usual theory of least squares is, of course, com- pletely upset by such selection. A recent discussion in the Journal of the Royal Statistical Society (<A. I. Miller, 1984) has highlighter] the complex issues involvecI. Shrinkage theory has been extended to stepwise regression, but the details in Copas (1983b) will not be repeated here. The main result is that shrinkage for re- gression on a subset of x is greater, usu- ally much greater, than would be antici- pated if the subset were fixed in advance. Given certain assumptions, the value of K corresponding to validation calibration is the value as calculated from the full re- gression on all x's and not as calculated using the above formula based on the subset actually usecI. These assumptions are often reasonable in practice, and in cases of doubt a rather elaborate signifi- cance test proposed in Copas (1983b) can

300 be used. The formula for shrinkage of the correlation coefficient is modified to (n- 1)R2-m R= R* In- 1 - m)R CRIMINAL CAREERS AND CAREER CRIMINALS with K = 0.602 and much greater than that implied by the value K = 0.931. Shrinkage in the Light of Changes in the Population Cause (b) where R* is the multiple correlation be- The theory expounded so far accommo tween y en c] the selected x's, and as be- dates cause (a) the purely statistical ef fore, R is the corresponding correlation feet but assumes that there are no for all the x's. changes in the distribution of x Ecf. (b)] or the response function [cf. acid. Neither assumption wit! be exactly true, although each will often hold to a reasonable ap proximation. In this section we discuss the effect of changes in the population (i.e., in the distribution of x) on the vaTi clation performance of predictors. We suppose that x has mean me in the con struction sample and mean m2 in the validation sample, with the variance covariance matrices V, and V2 definer! in an analogous way. We therefore wish to Since many x's in the absconding stucly appeared to be of Tow predictive value, a subset of just four x's was chosen for the logistic regression, with x2 = 29.0 on four degrees of freedom. If selection is ig- nored, this would give K = 0.931 (indicat- ing very little shrinkage). For the full logistic regression Zip = 0.602, as before. The validation fit of the reducer! regres- sion is shown in Figure 2, which was constructed in the same way as Figure 1. As can be seen the shrinkage is consistent o -1 _ Q 4 - ._ o J IVY -2 - 4 _ // -4 - 3 -2 -1 0 1 2 logit p FIGURE 1 Shrinkage for absconding study (full regression). Source: Derived from data provided by Prison Department's Young Offender Psychology Unit, Home Office, England.

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS 1 _ 1 o ._ o - _~ -3 i row ~ / .~ ,r`_ ~ 6~ '/ 1 1 1 1 1 J _4 _3 - 2 -! O 1 2 logitp FIGURE 2 Shrinkage for absconding study (stepwise regression). Source: Derived from data provided by Prison Department's Young Offender Psychology Unit, Home Office, England. study the case in which me 7L m2 ancI/or Vi ~ V2. A number of approaches are discussed in tum, corresponding to various ways in which changes in distribution can occur, and to different ways in which the per- formance of predictors can be assessed. Some of these correspond to well-estab- lished results in the statistical literature, others to work in progress. Wishart Variation. Perhaps the sim- plest case is to assume that the construc- tion and validation samples are both sam- pled randomly from the same underlying population. The matrices Vat ant] V2 will then be independent samples from the same Wishart distribution inclexed by the (unknown) true variance-covariance ma- trix. Similarly, m, and m2 will be in(le- pendent with identical multivariate nor- mal distributions. It can be shown that the uniform improvement of the shrink 301 age predictor over least squares continues to hold in this more general setting, i.e. E(y _ y)2 < E`y _ y`~2 where the expectation is over (y, x) in the validation sample, over the distribution of regression parameters, as well as over sampling variation in the m's and V's. The only requirement is, as before, that m 3 3. Again, the improvement holds over all possible true regression parameters, no matter what are the unclerlying parame- ters ofthe population. Thus differences in samples caused by sampling variation only do not affect the shrinkage argu- ments put forward in the last section. Mathematical Conditions for Uniform Improvement. The Wishart variation case suggests that if me - m2 and Vat - V2 are small, shrinkage theory is unaffectecl. To investigate what happens when these differences are larger, define the matrix

302 W= Vl-l/2[V2 + (m2- ml)(m2 - ml)']V,-l72, and let Al ~ A2 ~ ... ~ Am be the m orderect eigenvalues of W. Then it is shown in Brown and Zidek (1980) that the prediction mean squarer! error of y is better than that of y for all possible regres- sion parameters if Al ' 2(m + 2) Thai Roughly speaking, the largest eigenvaTue of W should not exceed about twice the average of all the eigenvaTues. (If Vat = V2 and ma = m2, W is the identity matrix and so all the Ai's are unity). This puts an upper bouncI on the differences between construction and validation samples that can be allowed if y is to remain uniformly superior to y. Robustness Region for Superiority of y. When the matrix W leacis to failure of the above inequality, the question of whether y has a Tower prediction mean squared error than y clepends on the true (and unknown) vector of regression pa- rameters [3. Typically, when the inequal- ity fails, shrinkage will be better if the coefficients off are sufficiently small, but worse if the coefficients are large. Ex- tremely large regression coefficients do not usually characterize empirical rela- tionships in the social sciences, and so in practice differences between construc- tion and validation samples can often be considerably larger than impliecl by the Brown-Zidek inequality. Explicitly, it is possible to define a "robustness region," RR(m~,m2,V~,V2), such Mat the preshrunk predictor is superior to least squares if, and only if, ,8 ~ RR(m~, m2,V~,V2). Jones and Copas (in press) have formu- lated a precise specification of RR and, further, have developecl a significance test by which the hypothesis ,l3 ~ RR can CRIMINAL CAREERS AND CAREER CRIMINALS be assessed in the light of the estimated regression coefficient vector. Thus, when constructing a prediction equation, my and Vat are taken directly from the con- struction sample; the likely superiority of the shrinkage correction can then be checked using the robustness region test against a variety of changes in population that might be contemplated. The Effect of Screening Based on One or More Explanatory Variables. A com- mon way in which populations can change is represented by screening on one or more of the explanatory variables. Suppose, for example, that the values of xi in the construction sample are represen- tative of the unclerlying population of subjects, but that future use of the predic- tion equation will be restricted to sub- jects with xi > c, where c is some screen- in~ threshold. This may happen, for instance, it some intervention or change of policy occurs following a preliminary prediction study. Given ma and Via from the construction sample and the value of any screening threshold, it is possible to estimate m2 and V2 corresponding to the appropriately truncated distribution of the valiclation sample anal, hence, to esti- mate W. If the Brown-Zidek inequality fails, the robustness region test can be carried out for the observed vector of regression parameter estimates. By this procedure the value of the shrinkage cor- rection can be assessed. A number of case studies along these lines have been in- vestigate(1 in Jones and Copas (19851. In general, quite a heavy truncation can be tolerated while retaining the superiority of the shrinkage preclictor; at least one- half and as much as two-thirds of a popu- lation can be screened out in this way. Sample-Reuse Studies of Screening. Sample-reuse methods provide a rich source of techniques for studying the properties of a prediction equation in the

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS context of a particular study, as will be cliscussect in a later section of this paper. Two particular applications lend them- seIves to the monitoring of screening. First, a simulation study can be uncler- taken in which the prediction equation is fitted to a random subset of the data, and the remaining cases are screened in the appropriate way to form the valiclation sample. The random sampling of the con- struction data is repeated a large number of times to obtain expected values of pre- diction mean squarest error to other mea- sures of predictive performance. The sec- onc3 method involves the bootstrap: both construction and vaTiciation ciata are arti- ficially sampler! with replacement from the complete set of available data. The methoc] of screening under study is ap- pliec] to the validation cases before the prediction equation is evaluated. Again, some detailed results are given in Jones and Copas (1985~; the general conclusion is similar to that made earlier, namely, that a moderate degree of screening cloes not usually affect the advantages of the shrinkage correction. Shrinkage Correction Adapted for a Change in Population. Comments so far in this section have concerned robust- ness, i.e., the study of how the preshrunk predictor y performs in the light of changes in the distribution of x. If some particular change in population is envis- agecI, can the shrinkage correction be de- signed to take account of it? A reworking of the theory leacling to the correction K, explained above, leacls to * _ (`m-2~2tr(V~-~V2) K_ , _ ~ n/3'V ~ . Note that K* = K if Vat = V2. The corre- sponding form ofthe preshrunk predictor is y* = (~`x + ,S'(~m2 - m~(1 - K*) + K*y. Unfortunately, the sampling theory of K* 303 and y* is very much more complicate<] than that of K and y, and optimum mean squared error properties have yet to be proved. Presumably, if m2 - me and V2 - Vat are both fairly small, the favorable properties of y will continue to hold, but the situation for large population changes is less clear. AnAdaptive Formulation of Shrinkage Based on Cross-validation. A very dif- ferent approach is reported in Copas (1984~. Here none of the usual assump- tions of linear regression is made (e.g., constant variance of residuals), but instead a shrinkage correction K** is esti- matec3 directly from the available con- struction data. Following the sample- reuse approach mentioned above, the sampling distribution of the empirical slope of y on y for randomly chosen sub- sets of the (lata is stucliecT mathematically, and an asymptotic approximation to the expected shrinkage is thereby obtained. The form of this approximation is applied to the whole set of ciata, given the nonparametric shrinkage correction K**. It is shown in Copas (1984) that, as ex- pected, K** is equal to K if ant! only if the usual assumptions of the underlying model hold. The correction K** is most sensitive to heteroskedasticity of the re- si(luals; K** can shrink more or less than K according to the particular observed pattern of model resicluals. Case studies carried out using this new approach sug- gest that only exceptionally will K** clif- fer markedly from it, and the validation properties of the corresponding nonpara- metric shrinkage predictor will often be rather similar to those of y. Changes in the Regression Relationshi~Cause (c) It is obvious that if the relationship between ~ and the x's changes clramati- ca~y Between construction and validation

304 data, the shrinkage will be equally dra- matic and nothing in the way of useful prediction will be possible. Conversely, minor changes in the coefficients cr and ,2, · · · Will result in only small changes in predictive performance, and y can still be regarded as an adequate approxima- tion. Little work has been done in study- ing the effects of changes of intermediate size. As in the discussion of cause (b) in the previous section, if something is known in advance about the likely changes, corresponding modifications to the prediction equation can be made (e.g., a 10 percent rise or fall in values of y is anticipated). However, such circum- stances will occur rarely, if ever, and so this remains an open research problem. Some Concluding Remarks We conclude this discussion of valida- tion and shrinkage with a few comments that may help in formulating guidelines on the choice of prediction equation in any given application. First, a simple method shrinks less than a complex one. (This can be seen in the above algebra by noting that the denom- inator of K exceeds the numerator by ma2ln on average--this quantity in- creases as m, the number of variables in the equation, increases.) However, this is not so when a preshrinking correction is applied; provided the model ant] assump- tions hold true, a preshrunk predictor is always approximately well calibrated. Thus the argument that a simple model (e.g., point scoring) is preferable to a more complicated one (em., multiple regres- sion) because of shrinkage effects alone cannot be sustained. Proper statistical principles should be used in assessing the fit between a given model and the data; any shrinkage problems that arise are allowed for by preshrinking rather than by distorting the model being fitted. Second, in selecting from among sev- eral x variables using a stepwise proce- dure, it is often supposed that a small ~At_, . ~ CRIMINAL CAREERS AND CAREER CRIMINALS subset is better than a large one because the smaller number of coefficients causes less shrinkage. In general this argument is false. As explained above, the empirical selection effect itself leads to an increase in shrinkage. Again, a larger subset, with appropriate preshrinking correction, is better than an artificially small set with its own shrinkage correction. Usually, how- ever, very little is gained by the later variables entering a stepwise regression procedure and so on the grounds of sim- plicity, with little loss of efficacy, a sensi- ble subset (with preshrinking) will nearly always be used in the final prediction equation. For example, in the absconding study mentioned above, there is little basis for choosing on statistical grounds between the fits with the total of 22 x's and with a subset of just 4 x's (Figures 1 and 2~. Third, caution is needed if a prediction equation is to be applied outside the range of the construction data. The new theory of robustness to changes in the distribution of the x's, outlined above, suggests that modest changes can be tol- erated within the framework of the same preshrinking method. However, if very marked changes are anticipated, or if er- ratic changes in the model are likely to occur, no prediction equation can be ex- pected to work well. These circum- stances are perhaps the only ones in which oversimplified methods (e.g., Glueck) can be justified on the grounds of robustness, but a clear formulation of such properties would be difficult. Fourth, a prediction equation is essen- tially a statement of conditional expecta- tion: if the x's are such and such, then the expectation of y is estimated to be such ant] such. In reality no particular model is exactly correct, and so an argument that one set of x's is "right" and another is "wrong" has no logical basis. One can imagine values of the response variable (y) and the explanatory variables (x's) be- ing distributed jointly in some space each subset of x's, and each particular

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS model, providing a separate form of con- ditional expectation of y. Choosing a pre- diction equation involves choosing which conditional expectation is closest to the actual values of y (has least conditional variance), such a choice being made over whatever set of candidates is available. It may be that y is most closely correlated with an x that cannot actually be used in routine prediction, and so no subset con- taining such an x can be entertained. Typically, the best subsets or models will be ones that act as the best proxies to the prohibited x. Such equations may do less well than others involving the sensitive variable, but they cannot be discredited on statistical grounds alone. Practical Utility Predictive Power Our starting point in this section is the familiar "risk classification," which com- pares predicted and actual outcomes. This approach to assessing the utility of different prediction instruments is com- pletely different from (yet complemen- tary to) that discussed in the previous section. Risk classes can be defined as the range ofthe predicted probability of some event (e.g., k, = 0 < 0.1, k2 = 0.1 < 0.2, et 30S cetera); as a score, such as the Salient Factor Score calculated in parole predic- tion research (D. ~.. ~. ~ M. Gottfredson, Wilkins, and iiournan, 19781; or by some over classification, such as low-, me- dium-, and high-rate offenders, as in Greenwood's (1982) study of criminal ca- reers. The example adopted here to illus- ~ate and develop the discussion of pre- dictive power is taken from Copas and Whiteley (1976) as it was subsequently used by Tarling (1982) to show the rela- tionship between various measures. Copas and Whiteley's aim was to con- s~uct a prediction instrument to evaluate He effects of therapeutic ~eatrnent at the Henderson Hospital. The criterion of suc- cess was taken to be no furler admission to a psychiatric hospital or no further conviction for a criminal offense during the 2 to 3 years following release. Table 1 sets out the results for their construction and validation samples. Several summary statistics have been proposed to measure the predictive power of this and similar risk cIassifica- tions, in particular mean cost rating (MCR) (Duncan et al., 1953) and P(AW the area under the receiver operating characteristic curve in signal detection theory (Fergusson, Fifield, and Slater, 19771. However, as the risk classification in Table 1 can be regarded as an ordered TABLE 1 Predicted Success and Observed Outcome, Construction and Validation Samples Risk Probability Construction Sample Validation Sample Class of Success Success Failure Total Success Failure Total (ki) (P) (si) fi) (ti) (ti) (fi) (ti) k1 0 to .3 5 33 38 7 18 25 k2 .3 to .5 7 12 19 14 15 29 k3 .Sto .7 21 12 33 12 9 21 k4 .7tol.0 11 3 14 8 4 12 Total Ns = 44 Nf = 60 T= 104 Ns = 41 Nf = 46 T= 87 MCR = .57 P(A) = .78 Tc = - .55 By = -.71 SOURCE: Copas and Whiteley (1976) data as used by Tarling (1982~. MCR = .28 P(A) = .64 of = - .28 By = -.40

306 contingency table, Kenclall's rank correla- tion coefficient tan, if (Kendall, 1970), and Goodman and KruskaT's gamma, (Goodman ant] Kruskal, 1963), can also be used to measure the degree of associa- tion. There is as yet no consensus about the measure to be acloptecI, but Tarling (1982) has in fact shown that all four measures are relatect because all are func- tions of the statistic S (where S = P - Q. where P is the number of"concorclant pairs" and Q is the number of"discorclant . ,,\ pairs 9. Expressing each as a function of S and using the notation of Table 1, the four measures can be defined as: -S MCR = NsNf 2NsNf 4S ~ =_ T2 S ~ = , p+ Q. Two advantages follow from knowing that all four measures are a function of S. First, by calculating S the calculation of all four measures is greatly simplifiecl. Second, as the distribution of S has Tong been known, a test of the null hypothesis, E(S) = 0, is a test that prediction is no better than chance. The measures tic and By have a further advantage over MCR and P(A) in that the variance of both can be estimated, thu permitting tests of alternative hypotheses and facilitating comparison of alternative prediction instruments or their respective power in the construction and validation samples. For ~c, however, only an upper bound to the variance is available, so only a conservative test for the difference of two observed values is possible. On the CRIMINAL CAREERS AND CAREER CRIMINALS other hanct, the exact value of the vari- ance of By is available (Goodman and Kruskal, 1963), which permits a more powerful test. For this reason TarTing (1982) recommencled that By should gen- erally be preferred. Prediction Errors The four measures cliscussect above are still only indicators of overall fit and just give an indirect assessment of how a pre- diction instrument will perform in prac- tice. It is essential, therefore, to calculate the number or proportion of correct and incorrect predictions that would result from the application of any rule. Given the discussion of overfilling and shrinkage in the previous section, esti- mates should be derived from a valida- tion sample. Before applying the Copas and Whiteley instrument to identify likely successes, a cutoff point must be chosen. From the risk classification, as it is presenter] above, there are three possi- ble cutoff points: all subjects with a pre- rlicte(1 probability of success of .7 or above; all those with a predictecl proba- bility of.5 or above; and all those with a preclicte(1 probability of .3 or above. Figure 3 shows, for each cutoffpoint in the validation sample, the following: 1. the number of true positives (TP), that is, the number of subjects predicted to succeed who dill in fact succeed; 2. the number of false positives (FP), that is, the number of subjects predictecI to succeed who in fact failed; 3. the number of false negatives (FN), that is, the number of subjects preclictec! to fait who in fact succeeded; and 4. the number of true negatives (TN), that is, the number of subjects predictecl to fait who slid in fact fail. The two marginal distributions ofthese tables are usually clefinecI as the base rate and the selection ratio. The base rate (or

SOME METHODOLOGICAL ISSUES 11!: MAKING PREDICTIONS A: Cutoff point .7 and above Predicted Outcome B: Cutoff point .5 and above Predicted Outcome C: Cutoff point.3 and above Predicted Outcome Actual Outcome Success Fai I ure Success Failure Success Fai I ure Success Failure FP 8 4 FN TN 33 42 Ns= 41 Nf= 46 Base rate= .471 Actual Outcome Success Failure TP FP 20 13 . . ~ FN TN 21 33 N=41 Nf=46 s Base rate= .471 Actual Outcome Success Fail ure , , TP FP 34 28 FN TN 7 18 N = 41 s Base rate= .471 Nf= 46 NP = 12 s NPf= 75 NP = 33 NPf= 54 NP = 62 s NPf= 25 FIGURE 3 Correct predictions and errors for each cutoff point. 307 Selection ratio= .133 Selection ratio=.379 Selection ratio= .713

308 the prevalence or the incidence) is the proportion of the sample that actually succeeded. It can be seen that this is the same for all three cutoff points (i.e., 47.1 percent). The second marginal distribu- tion, the selection ratio, is the proportion ofthe sample predicted to succeed. It can be seen that the selected ratio changes depending on the cutoff point: it is 13.8 percent when the cutoff point is set at .7 and above, 37.9 percent when the cutoff point is set at .5 and above, and 71.3 percent when the cutoff point is set at .3 and above. Defining the base rate anct the selec- tion ratio in terms of the four outcomes: Base rate, BR = Selection ratio, SR where TP + FN T and (1-BR) = TP + FP and (1 - SR) = T = total sample = TP +FP + FN +TN. FP + TN T FN + TN T Considering the relationship between the base rate and Me selection ratio re- veals several interesting properties. When the selection ratio is larger than the base rate, false positives exceed false negatives; conversely, when the base rate is larger than the selection ratio, false negatives ex- ceed false positives. When the base rate equals the selection ratio, the number of false positives ant! false negatives is the same. Furthermore, when both the base rate and the selection ratio equal .5, predic- tion becomes most accurate and results in fewest total errors (FP + FN). However, when the base rate (which is fixed) is not .5, CRIMINAL CAREERS AND CAREER CRIMINALS as is often the case in practice, total errors are minimize<] when the selection ratio is set to equal the base rate. These phenom- ena are revealed in Figure 3 and can be used to guide the choice of the appropriate cutoff point. Dunn (1981) sets out the various mea- sures that can be clerived from the kind of information presented in Figure 3, for example, sensitivity and specificity, but they are not discussed in any cletai! here. Loeber and Dishion (1982, 1983) also discuss the significance of the base rate and the selection ratio. They point out that the base rate anct the selection ratio determine the maximum number of cor- rect predictions that could be achieved by the prediction instrument but, further, that a certain number of correct predic- tions could be expecter] by chance alone. Loeber and Dishion therefore propose a measure, relative improvement over chance (RIOC), which attempts to assess how an instrument performs relative to its expected performance and its best possi- ble performance given the base rate and the selection ratio. They define RIOC as: RIoC= AC RC MC - RC where AC = actual number of correct predictions, RC = randomly expecter] number of correct predictions, and MC = maximum possible number of correct pre- dictions. In the notation of Figure 3 it can be seen that AC = TP + TN RC (TP + FN)(TP + FP) (FP + TN)(FN + TN) T MC = TN + TP + 2min(FN,FP).

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS Substituting for AC, RC, anct MC in the above equation, RIOC reduces to: IT'S TP.TN - FP.FN neon = [TP + min(FN,FP)][TN + min(FN,FP)] From the relationships presented earlier, RIOC can also be expressed in terms of the base rate ant! the selection ratio. Sub- stituting in the denominator, RIOC re- duces to: TP.TN - FP.FN T2[min(BR,SR)-BR.SR] A commonly used measure of associa- tion for 2 x 2 classifications such as Fig- ure 3 is ¢, which is the product moment correlation coefficient for dichotomous variables. In the notation of Figure 3, TP.TN - FP.FN ~ = [(TP + FP)(TP + FN)(FP + TN)(FN + TN)]'t2 Expressing the denominator in terms of BR and SR, ~ reduces to: TP.TN - FP.FN T2(BR.SR - BR.SH2 - BR2.SR + BR2.SR2)"2 The relationship between RIOC and <h is therefore: (BR.SR - BR.SR2 - BR2.SR + BR2.SR2)~/2 RIOC= ~ min(BR,SR) - BR.SR However, if the base rate BR equals the selection ratio SR, an important result follows. Substituting BR for SR: RIOC = ~ { (BR - BR2) } i.e., RIOC = ¢. By using any ofthe above formulae, it can be calculated that for Figure 3A, RIOC = .3696 (or 37.0 per- cent) and ~ = .157; Figure 3B, RIOC = .2549 (or 25.5 percent) and ¢, = .211; Figure 3C, RIOC = .4056 (or 40.6 per- cent) and ~ = .243. The above set of results suggests that care should be exercised when using RIOC or ¢. The measure RIOC is less 309 (25.5 percent) for cutoff point .5 ant! above, as clepictecT in Figure 3B, than for the other two cutoff points: .7 and above, Figure 3A, and .3 and above, Figure 3C; ¢>, too, is lower than for Figure 3C, al- though it is greater than for Figure 3A. However, the total number of errors in Figure 3B is 34, 1 less than for Figure 3C and 3 less than for Figure 3A. It should also be emphasized perhaps that the measures cliscussecI are merely point estimates. Another study that found, for instance, 340 errors in a sample of 870 subjects would give the same total error rate, although it wouIcl be considered as more accurate since it is derivec! from a larger sample. This suggests the construc- tion of confidence intervals around these estimates to get a range of plausible val- ues. Invariably criminologists have not presented confidence intervals for their estimates although they are relatively straightforward to calculate. Tables exist for binomial conficlence intervals, but for large samples the normal approximation may be used. The standard deviation is given by: S.D.= - ~/2 where n is the numerator and N the cle- nominator of the rate. In this example the 95 percent confidence limits for the total error rate are .289 and .493. Before leaving this section there is just one final point that we wouIcl like to make. A criticism of the measures so far cliscussec3 is that they do not reflect the relative seriousness of the different types of outcome but assign equal value to true and false positives and true and false negatives. In practice, and dependent on the issues under consideration, it is usu- ally the case that the consequence of one type of outcome is more important than another. FIad our interest in the previous section been to minimize false positives,

310 say, rather than to minimize total errors, we could have user! the approach out- lined there to guide our choice of cutoff point. However, decision theory provides a more direct framework for taking into account the weights to be attached to different types of outcome. Although the decision-theory approach has been wiclely advocated in criminological appli- cations (e.g., Loeber and Dishion, 1983), it has not been used to any great extent, except by Blumstein, Farrington, and Moitra (19851. While it is outside the scope of this paper to discuss decision theory in any detail, we would recom- mend that more attention be paid to it in prediction research, especially when the results are to be applied in practice. SAMPLE-REUSE METHODS Previous sections of the paper have stressed the distinction between retro- spective fit and prospective (or vaTida- tion) fit of a prediction instrument. A simple way of carrying out a prospective validation, and the one most commonly used in criminology, is the split-half method, which divides the data into two halves (at ranclom). The equation is fittest to the first half (the construction sample) and tested on the seconc] (the validation sample). Although unbiased estimates of shrinkage and error rates result from this method, there are two obvious disac3van CRIMINAL CAREERS AND CAREER CRIMINALS The first, simple extension of the prin- ciple is cross-vaTidation, of which the split-half method is merely a special case. To construct and validate the prediction instrument, the sample need not be split in halfbut couIcl, instead, be split in many different ways; for example, 80 percent of the sample could be used for the con- struction sample ant! the remaining 20 percent could form the validation sample. Moreover, any number of construction and validation subsamples could be drawn. The jackknife and the bootstrap techniques are more formal c3~evelop- meets of this latter iclea. The jackknife (see, for example, R. G. Miller, 1974), or "hold-one-out," proceeds as follows. Sup- pose the sample has 1~7 members; delete one member and develop the prediction instrument on the remaining N - 1 and use it to predict y for the missing mem- ber. The procedure is repeated N times, a different member being omitted each time. By this means a set of independent values of y and y are obtained, and shrink- age and error rates can be calculated us- ing the methods presented earlier as if these values related to a completely new sample of N cases.2 The bootstrap technique (Efron, 1982) proceeds slightly differently. If sampling with replacement is permitted, a large number of samples of size N can be drawn, 2N as opposed to only N by the jackknife procedure. The bootstrap repli- tages: (a) construction of the prediction cations can be used to assess the predic- instrument does not use all available in- lion instrument. The method is illustrated Connation, but only half the sample, and (b) the comparability of the two sub- samples will always be open to doubt; for example, there is a 1-in-20 chance that the two subsamples will be significantly dif- ferent at the 5 percent level. Various tech- niques have been developed in the statis- tical literature to overcome these two problems. The principle underlying them is to generate many subsamples rather than merely two. by an example given in Efron and Gong (1981, 1983) that is analogous to many criminological prediction studies. Efron and Gong were concerned to construct an instrument to predict whether patients 2These ideas can be extended to other problems relevant to the construction of prediction instru- ments; Mabbett, Stone, and Washbrook (1980), for instance, consider the stepwise choice of variables in forming a binary predictor.

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS suffering from acute hepatitis wouIc3 sur- vive or die. There were 155 patients in the sample, 33 of whom diecI. There were 19 independent variables available for analysis. A prediction instrument was de- veloped in the usual way. First only x variables associated at the 5 percent level were retained; this left 13 variables. Sec- ond, a kind of forward, stepwise, multi- ple-Iogistic-regression program was used, stopping when no additional variable achieved the 5 percent significance level. Four of the 13 variables were included in the final prediction instrument. The cut- offpoint c was set at c = Tog 33/122. Full information was available for 133 of the original 155 patients. When the predic- tion instrument was applied to the 133 patients, 21 were misclassified, giving an error rate of 21/133 = .158. The bootstrap technique was then used to assess how overoptimistic this error rate was or how much it couIct be expected to shrink. Five hundred] bootstrap samples were drawn and the same procedure was used to con- struct a prediction instrument. On each occasion the "overoptimism random vari- able," R', was calculated, which is merely "the error rate for the bootstrap replica- tion minus .158." The 500 values of R' were plotted and the mean of R' was found to be .045, which suggests that the expected overoptimism is about one-third as large as the apparent error rate .158. This gives the bias-correctec3 estimated error rate .158 + .045 = .203. In addition, the standard deviation of R' was .036. Another advantage of the bootstrap tech- nique is illustrated by this example. At each replication a check was made of the variables included in the prediction in- strument and this revealed, for example, that one variable was selected 37 percent of the time, another 59 percent of the time, and so on, giving an intuitive, if not theoretically rigorous, indication of the importance of the various predictor vari- ables. 317 Technical details of sample-reuse methods are given in Efron (1982), and simplified descriptions appear in Dia- conis and Efron (1983) and Efron and Gong (19831. Comparing and contrasting the various methods, split-half or cross- valiclation methods are the simplest to perform but have certain limitations. The advent of computer power and the in- creasing avaflabilit,v of appropriate aIgo- rithms make the jackknife and the boot- strap methods more attractive and relatively easy to use. The jackknife ant] the bootstrap are in fact theoretically closely related: the jackknife is almost a bootstrap itself The bootstrap is entirely nonparametric and is, therefore, more flexible. Efron (1982) suggests that the jackknife performs less well than the bootstrap in situations that he has inves- tigated but it requires less computa- tion. The close relation between sample- reuse methods an(1 Copas's theory of shrinkage and vaTi(lation was cliscusse earlier. CONCLUSIONS At the beginning of this paper we showed how simple point-scoring meth- ods could be incorporated within the framework of general linear models, along with regression, logistic regression, and Tog-linear models. In adclition, we noted that point-scoring methods, recon- ceptuaTized in the way we suggest, per- mit certain extensions that have been found useful in medical (diagnosis. It has Tong been recognized and empir- ically demonstrated that a prediction in- strument (leveloped on one sample will perform less well when applied to a sub- sequent sample. The phenomenon of shrinkage has recently been subjected to rigorous theoretical investigation, which we outlined. The findings stemming from this work enable the researcher to uncler- stand an(1 anticipate the (degree of shrink

312 age that can be expected in any study and, where necessary, to make any adjust- ments to (or preshrink) the prediction equation. To examine shrinkage in practice, re- searchers have tended to use split-half subsamples. We pointed out the range of other and superior "sample-reuse" meth- ods, including the jackknife and the boot- s~ap. The usefulness of a prediction ins~u- ment can also be gauged by the number of errors and correct decisions that result from its application. We pointed out the similarity between many of the indices Mat have been proposed to assess We utility of a risk classification. In addition, we showed the importance of the base rate and the selection ratio in determin- ing false-positive and false-negative er- rors and how the selection ratio can be set to alter We balance between the two. When predicting rare events it may be the case that any prediction instrument will not improve significantly over the base rate. For example, a prediction in- s~ument developed to identify "danger- ous" offenders may result in more errors than occur by merely classifying all of- fenders as not dangerous. This has led some commentators to eschew attempts to predict these kinds of events. An anal- ogous situation occurs in medical science, where mass-screening programs are costly and may result in large false-pos- itive errors, causing considerable stress, but where they are nevertheless consid- ered to be worthwhile to detect the small number of true positives who actually have the rare disease. Therefore, the worth of any prediction instrument de- pends on the values to be attached to the various outcomes emanating from its ap- plication, not simply on the total number of errors that may accrue. Decision theory provides a framework for making these assessments and could be used more widely in prediction in criminology. CRIMINAL CAREERS AND CAREER CRIMINALS REFERENCES Blumstein, A., Farringon, D. P., and Moitra, S. 1985 Delinquent careers: innocents, desisters and persisters. Pp. 187-219 in M. Tonry and N. Morris, eds., Crime and Justice. Vol. 6. Chi- cago, Ill.: University of Chicago Press. Bottoms, A. E., and McClintock, F. H. 1973 Criminals Coming of Age. London, En- gland: Heinmann. Brown, P. J., and Zidek, J. V. 1980 Adaptive multivariate ridge regression. An- nals of Statistics 8:64-74. Burgess, E. W. 1928 Factors determining success or failure on parole. In A. A. Bruce, A. J. Harno, E. W. Burgess, and J. Landesco, eds., The Work- ings of the Indeterminate-Sentence Law and the Parole System in Illinois. Springfield, Ill.: Illinois State Board of Parole. Copas, J. B. 1983a Plotting p against x. Applied Statistics 32:2~31. 1983b Regression, prediction and shrinkage (with discussion). Journal of the Royal Statistical Society, Series B 45:311~54. 1984 Cross-validation Shrinkage of Regression Predictors. Research Report, Department of Statistics. Birmingham, England: University of Birmingham. Copas, J. B., and Whiteley, J. S. 1916 Predicting success in the treatment of psy- chopaths. British Journal of Psychiatry 129:388~392. Diaconis, P., and Efron, B. 1983 Computer-intensive methods in statistics. Scientific American 248~51:9~108. Duncan, O. D., Ohlin, L. E., Reiss, A. J., and Stanton, [I. R. 1953 Formal devices for making selection deci- sions. American Journal of Sociology 58:57~584. Dunn, C. S. 1981 Prediction problems and decision logic in longitudinal studies of delinquency. Crimi- nalJustice and Behavior 8:439~76. Efron, B. 1982 The Jackknife, the Bootstrap and Other Resampling Plans. Philadelphia, Pa.: Society for Industrial and Applied Mathematics. Efron, B., and Gong, G. 1981 Statistical Theory and the Computer. Un- published manuscript. Department of Statis- tics, Stanford University, Calif. 1983 A leisurely look at the bootstrap, the jack- knife, and cross-validation. American Statis- tician 37(1~:36~8.

SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS Farrington, D. P., and Tarling, R., eds. 1985 Prediction in Criminology. Albany, N.Y.: SUNY Press. Fergusson, D. M., Fifield, J. K., and Slater, S. W. 1977 Signal detectability theory and the evalua- tion of prediction tables.Journal of Research in Crime and Delinquency 14:237-246. Fielding, A. 1979 Binary segmentation. In C. A. O'Muirchear- taigh and C. Payne, eds., Exploring Data Structure. Vol. 1 of The Analysis of Survey Data. London, England: John Wiley. Glueck, S., and Glueck, E. T. 1950 Unraveling Juvenile Delinquency. Carn- bridge, Mass.: Harvard University Press. Goodman, L. A., and Kruskal, W. H. 1963 Measures of association for cross classifica- tions III. Journal of the American Statistical Association 58:310 364. GottEredson, D. M., Wilkins, L. T., and Hoffman P. B. 1978 Guidelines for Parole and Sentencing Lexington, Mass.: Lexington Books. GottEredson, S. D., and GottEredson, D. M. 1985 Screening for risk among parolees. Pp.54-77 in D. P. Farrington and R. Twirling eds.. Prediction in N.Y: SUNY Press. A. . Criminology. Albany, Greenwood, If. W. 1982 Selective Incapacitation. Santa Monica, Calif.: Rand Corporation. Jones, M. C., and Copas, J. B. 1985 On the Robustness of Shrinkage Predictors in Regression: Exemplifying and Using the Theory. Research report. Department of Sta- tistics, University of Birrningharn, England. In On the Robustness of Shrinkage Predictors press in Regression: Some Theoretical Consider- ations. Journal of the Royal Statistical Soci- ety, Series B 48. Kendall, M. G. 1970 Rank Correlation Methods. London: Griffin. Loeber, R., and Dishion, T. J. 1982 Strategies for Identifying At-Risk Youths. In press 313 Unpublished report. Oregon Social Leaming Center, Eugene. 1983 Early predictors of male delinquency: a re- view. Psychological Bulletin 94:68-99. Mabbett, A., Stone, M., and Washbrook, J. 1980 Cross-validatory selection of binary variables in differential diagnosis. Applied Statistics 29: 198-204. Miller, A. J. 1984 Selection of subsets of regression variables (win discussion). Journal of the Royal Sta- tistical Society, Series A 147:389-425. Miller, R. G. 1974 The jackknife" a review. Biometrika 61(1): 1-15. Nuttal, C. P., et al. 1977 Parole in England and Wales. Home Office Research Study No. 38. London, England: Her Majesty's Stationery Office. Simon, F. H. 1971 Prediction Methods in Criminology. Home Office Research Study No. 7. London, En- gland: Her Majesty's Stationery Office. Tarling, R. 1982 Comparison of measures of predictive power. Educational and Psychological Mea- surement 42:479-487. Tarling, R., and Perry, J. A. 1985 Statistical methods in criminological predic- tion. Pp. 21~231 in D. P. Farrington and R. Tarling, eds., Prediction in Criminology. A1- bany, N.Y.: SUNY Press. Titterington, D. M., Murray, G. D., Murray, L. S., Spiegelhalter, D. I., Skene, A. M., Habbema, J. D. F., and Gelpke, G. I. 1981 Comparison of discrimination techniques applied to a complex data set of head injured patients. Journal of the Royal Statistical So- ciety, Series A 144: 145-175. Wilbanks, W. L. 1985 Predicting failures on parole. Pp. 78-94 in D. P. Farrington and R. Tarling, eds., Predic- tion in Criminology. Albany, N.Y.: SUNY Press.

Next: 8. Purblind Justice: Normative Issues in the Use of Prediction in the Criminal Justice System »

Criminal Careers and "Career Criminals,": Volume II (1986)

Chapter: 7. Some Methodological Issues in Making Predictions

Welcome to OpenBook!

Get Email Updates