K
Statistical Analysis of Bullet Lead Data

By Karen Kafadar and Clifford Spiegelman

1. INTRODUCTION

The current procedure for assessing a “match” (analytically indistinguishable chemical compositions) between a crime-scene (CS) bullet and a potential suspect’s (PS) bullet starts with three pieces from each bullet or bullet fragment. Nominally each piece is measured in triplicate with inductively coupled plasma–optical emission spectrophotometry (ICP-OES) on seven elements: As, Sb, Sn, Cu, Bi, Ag, Cd, against three standards. Analyses in previous years measured three to six elements; in some cases, fewer than three pieces can be abstracted from a bullet or bullet fragment. Parts of the analysis below will consider fewer than seven elements, but we will always assume measurements on three pieces in triplicate even though occasionally very small bullet fragments may not have yielded three measurements. The three replicates on each piece are averaged, and then means, standard deviations (SDs), and ranges (minimum to maximum) for the three pieces and for each element are calculated for all CS and PS bullets. Throughout this appendix, the three averages (from the triplicate readings) on the three pieces are denoted the three “measurements” (even though occasionally very small bullet fragments may not have yielded three measurements).

Once the chemical analysis has been completed, a decision must be based on the measurements. Are the data consistent with the hypothesis that the mean chemical concentrations of the two bullets are the same or different? If the data suggest that the mean chemical concentrations are the same, the bullets or fragments are assessed as “analytically indistinguishable.” Intuitively, it makes sense that if the seven average concentrations (over the three measurements) of the CS bullet are “far” from those of the PS bullet, the data would be deemed more



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence K Statistical Analysis of Bullet Lead Data By Karen Kafadar and Clifford Spiegelman 1. INTRODUCTION The current procedure for assessing a “match” (analytically indistinguishable chemical compositions) between a crime-scene (CS) bullet and a potential suspect’s (PS) bullet starts with three pieces from each bullet or bullet fragment. Nominally each piece is measured in triplicate with inductively coupled plasma–optical emission spectrophotometry (ICP-OES) on seven elements: As, Sb, Sn, Cu, Bi, Ag, Cd, against three standards. Analyses in previous years measured three to six elements; in some cases, fewer than three pieces can be abstracted from a bullet or bullet fragment. Parts of the analysis below will consider fewer than seven elements, but we will always assume measurements on three pieces in triplicate even though occasionally very small bullet fragments may not have yielded three measurements. The three replicates on each piece are averaged, and then means, standard deviations (SDs), and ranges (minimum to maximum) for the three pieces and for each element are calculated for all CS and PS bullets. Throughout this appendix, the three averages (from the triplicate readings) on the three pieces are denoted the three “measurements” (even though occasionally very small bullet fragments may not have yielded three measurements). Once the chemical analysis has been completed, a decision must be based on the measurements. Are the data consistent with the hypothesis that the mean chemical concentrations of the two bullets are the same or different? If the data suggest that the mean chemical concentrations are the same, the bullets or fragments are assessed as “analytically indistinguishable.” Intuitively, it makes sense that if the seven average concentrations (over the three measurements) of the CS bullet are “far” from those of the PS bullet, the data would be deemed more

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence consistent with the hypothesis of “no match.” But if the seven averages are “close,” the data would be more consistent with the hypothesis that the two bullets “match.” The role of statistics is to determine how close, that is, to determine limits beyond which the bullets are deemed to have come from sources that have different mean concentrations and within which they are deemed to have come from sources that have the same mean concentrations. 1.1. Statistical Hypothesis Tests The classical approach to deciding between the two hypotheses was developed in the 1930s. The standard hypothesis-testing procedure consists of these steps: Set up the two hypotheses. The “assumed” state of affairs is generally the null hypothesis, for example, “drug is no better than placebo.” In the compositional analysis of bullet lead (CABL) context, the null hypothesis is “bullets do not match” or “mean concentrations of materials from which these two bullets were produced are not the same” (assume “not guilty”). The converse is called the alternative hypothesis, for example, “drug is effective” or in the CABL context, “bullets match” or “mean concentrations are the same.” Determine an acceptable level of risk posed by rejecting the null hypothesis when it is actually true. The level is set according to the circumstances. Conventional values in many fields are 0.05 and 0.01; that is, in one of 20 or in one of 100 cases when this test is conducted, the test will erroneously decide on the alternative hypothesis (“bullets match”) when the null hypothesis actually was correct (“bullets do not match”). The preset level is considered inviolate; a procedure will not be considered if its “risk” exceeds it. We consider below tests with desired risk levels of 0.30 to 0.0004. (The value of 0.0004 is equivalent to 1 in 2,500, thought by the FBI to be the current level.) Calculate a quantity based on the data (for example, involving the sample mean concentrations of the seven elements in the two bullets), known as a test statistic. The value of the test statistic will be used to test the null hypothesis versus the alternative hypothesis. The preset level of risk and the test statistic together define two regions, corresponding to the two hypotheses. If the test statistic falls in one region, the decision is to fail to reject the null hypothesis; if it falls in the other region (called the critical region), the decision is to reject the null hypothesis and conclude the alternative hypothesis. The critical region has the following property: Over the many times that this protocol is followed, the probability of falsely rejecting the null hypothesis does not exceed the preset level of risk. The recommended test procedure in Section 4

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence has a further property: if the alternative hypothesis holds, the procedure will have the greatest chance of correctly rejecting the null hypothesis. The FBI protocol worked in reverse. Three test procedures were proposed, described below as “2-SD overlap,” “range overlap,” and “chaining.” Thus, the first task of the authors was to calculate the level of risk that would result from the use of these three procedures. More precisely, we developed a simulation, guided by information about the bullet concentrations from various sources and from datasets that were published or provided to the committee (described in Section 3.2), to calculate the probability that the 2-SD-overlap and range-overlap procedures would claim a match between two bullets whose mean concentrations differed by a specified amount. The details of that simulation and the resulting calculations are described in Section 3.3 with a discussion of chaining. An alternative approach, based on the theory of equivalence t tests, is presented in Section 4. A level of risk is set for each equivalence t test to compare two bullets on each of the seven elemental concentrations; if the mean concentrations of all seven elements are sufficiently close, the overall false-positive probability (FPP) of a match between two bullets that actually differ is less than 0.0004 (one in 2,500). The method is described in detail so that the reader can apply it with another value of the FPP such as one in 500, or one in 10,000. A multivariate version of the seven separate tests (Hotelling’s T2) is also described. Details of the statistical theory are provided in the other appendixes. Appendix E contains basic principles of statistics; Appendix F provides a theoretical derivation that characterizes the FBI procedures and equivalence tests and some extra analyses not shown in this appendix; Appendix H describes the principal-component analysis for assessing the added contributions of each element for purposes of discrimination; and Appendix G provides further analyses conducted on the data sets. 1.2 Current Match Procedure The FBI presented three procedures for assessing a match between two bullets: “2-SD overlap.” Measurements of each element can be combined to form an interval with lower limit mean −2SD and upper limit mean+2SD. The means and SDs are based on the average of three measurements in each of the specimens. If the seven intervals for a given CS bullet overlap with all seven intervals for a given PS bullet, the CS and PS bullets are deemed a match. “Range overlap.” Intervals for each element are calculated as minimum to maximum from the three measurements in each of the specimens. If the seven intervals for a given CS bullet overlap with all seven intervals for a given PS bullet, the CS and PS bullets are deemed a match.

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence Chaining. As described in FBI Laboratory document Comparative Elemental Analysis of Firearms Projectile lead by ICP-OES (Ref. 1, pp. 10–11): a. CHARACTERIZATION OF THE CHEMICAL ELEMENT DISTRIBUTION IN THE KNOWN PROJECTILE LEAD POPULATION The mean element concentrations of the first and second specimens in the known material population are compared based upon twice the measurement uncertainties from their replicate analysis. If the uncertainties overlap in all elements, they are placed into a composition group; otherwise they are placed into separate groups. The next specimen is then compared to the first two specimens, and so on, in the same manner until all of the specimens in the known population are placed into compositional groups. Each specimen within a group is analytically indistinguishable for all significant elements measured from at least one other specimen in the group and is distinguishable in one or more elements from all the specimens in any other compositional group. (It should be noted that occasionally in groups containing more than two specimens, chaining occurs. That is, two specimens may be slightly separated from each other, but analytically indistinguishable from a third specimen, resulting in all three being included in the same compositional group.) b. COMPARISON OF UNKNOWN SPECIMEN COMPOSITION(S) WITH THE COMPOSITION(S) OF THE KNOWN POPULATION(S) The mean element concentrations of each individual questioned specimen are compared with the element concentration distribution of each known population composition group. The concentration distribution is based on the mean element concentrations and twice the standard deviation of the results for the known population composition group. If all mean element concentrations of a questioned specimen overlap within the element concentration distribution of one of the known material population groups, that questioned specimen is described as being “analytically indistinguishable” from that particular known group population. The SD of the “concentration distribution” is calculated as the SD of the averages (over three measurements for each bullet) from all bullets in the “known population composition group.” In Ref. 2, the authors (Peele et al. 1991) apply this “chaining algorithm” on intervals formed by the ranges (minimum and maximum of three measurements) rather than (mean ± 2SD) intervals. The “2-SD overlap” and “range-overlap” procedures are illustrated with data from an FBI-designed study of elemental concentrations of bullets from different boxes (Ref. 2). The three measurements in each of three pieces of each of seven elements (in units of parts per million, ppm) are shown in Table K.1 below for bullets F001 and F002 from one of the boxes of bullets provided by Federal Cartridge Company (described in more detail in Section 3.2). Each piece was mea-

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence TABLE K.1 Illustration of Calculations for 2-SD-Overlap and Range-Overlap Methods on Federal Bullets F001 and F002 (Concentrations in ppm)     Federal Bullet F001         icpSb icpCu icpAg icpBi icpAs icpSn a 29276 285 64 16 1415 1842 b 29506 275 74 16 1480 1838 c 29000 283 66 16 1404 1790 mean 29260.67 281.00 68.00 16 1433.00 1823.33 SD 253.35 5.29 5.29 0 41.07 28.94 Mean − 2SD 28753.97 270.42 57.42 16 1350.85 1765.46 Mean + 2SD 29767.36 291.58 78.58 16 1515.15 1881.21 minimum 29000 275 64 16 1404 1790 maximum 29506 285 74 16 1480 1842     Federal Bullet F002         icpSb icpCu icpAg icpBi icpAs icpSn a 28996 278 76 16 1473 1863 b 28833 279 67 16 1439 1797 c 28893 282 77 15 1451 1768 mean 28907.33 279.67 73.33 15.67 1454.33 1809.33 SD 82.44 2.08 5.51 0.58 17.24 48.69 mean − 2SD 28742.45 275.50 62.32 14.51 1419.84 1711.96 mean + 2SD 29072.21 283.83 84.35 16.82 1488.82 1906.71 minimum 28833 278 67 15 1439 1768 maximum 28996 282 77 16 1473 1863 sured three times against three different standards; only the average is provided, and in this report it is called the “measurement.” Table K.1 shows the three measurements, their means, their SDs (equal to the square root of the sum of the three squared deviations from the mean divided by 2), the “2-SD interval” (mean −2SD to mean + 2SD), and the “range interval” (minimum and maximum). For all seven elements, the 2-SD interval for Federal bullet 1 overlaps with the 2-SD interval for Federal bullet 2. Equivalently, the difference between the means is less than twice the sum of the two SDs. For example, the 2-SD interval for Cu in bullet 1 is (270.42, 291.58), and the interval for Cu in bullet 2 is (275.50, 283.83), which is completely within the Cu 2-SD interval for bullet 1. Equivalently, the difference between the means (281.00 and 279.67) is 1.33, less than 2(5.29 + 2.08) is 14.74. Thus, the 2-SD overlap procedure would conclude that the two bullets are analytically indistinguishable (Ref. 3) on all seven elements, so the bullets would be claimed to be analytically indis-

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence tinguishable. The range overlap procedure would find the two bullets analytically indistinguishable on all elements except Sb because for all other elements the range interval on each element for bullet 1 overlaps with the corresponding interval for bullet 2; for example, for Cu (275, 285) overlaps with (278, 282), but for Sb, the range interval (29,000, 29,506) just fails to overlap (28,833, 28,996) by only 4 ppm. Hence, by the range-overlap procedure, the bullets would be analytically distinguishable. 2. DESCRIPTION AND ANALYSIS OF DATASETS 2.1 Description of Data Sets This section describes three data sets made available to the authors in time for analysis. The analysis of these data sets resulted in the following observations: The uncertainty in measuring the seven elements is usually 2.0–5.0%. The distribution of the measurements is approximately lognormally distributed; that is, logarithms of measurements are approximately normally distributed. Because the uncertainty in the three measurements on a bullet is small (frequently less than 5%), the lognormal distribution with a small relative SD is similar to a normal distribution. For purposes of comparing the measurements on two bullets, the measurements need not be transformed with logarithms, but it is often more useful to do so. The distributions of the concentrations of a given element across many different bullets from various sources are lognormally distributed with much more variability than seen from within-bullet measurement error or within-lot error. For purposes of comparing average concentrations across many different bullets, the concentrations should be transformed with logarithms first, and then means and SDs can be calculated. The results can be reported on the original scale by taking the antilogarithms for example, exp(mean of logs). The errors in the measurements of the seven elements may not be uncorrelated. In particular, the errors in measuring Sb and Cu appear to be highly correlated (correlation approximately 0.7); the correlation between the errors in measuring Ag and Sb or between the errors in measuring Ag and Cu is approximately 0.3. Thus, if the 2-SD intervals for Sb on two bullets overlap, the 2-SD intervals for Cu may be more likely to overlap also. These observations will be described during the analysis part of this section. The three data sets that were studied by the authors are denoted here as “800-bullet data set,” “1,837-bullet data set,” and “Randich et al. data set.” 1. 800-bullet data set (Ref. 4): This data set contains triplicate measurements on 50 bullets in each of four boxes from each of four manufacturers—

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence CCI, Federal, Remington, and Winchester—measured as part of a careful study conducted by Peele et al. (1991). Measured elements in the bullet lead were Sb, Cu, and As, measured with neutron activation analysis (NAA), and Sb, Cu, Bi, and Ag (measured with ICP-OES). In the Federal bullet lead, As and Sn were measured with NAA and ICP-OES. This 800-bullet data set provided individual measurements on the three bullet lead samples which permitted calculation of means and SDs on the log scale and within-bullet correlations among six of the seven elements measured with ICP-OES (As, Sb, Sn, Bi, Cu, and Ag); see Section 3.2. 2. 1,837-bullet data set (Ref. 5): The bullets in this data set were extracted from a larger, historical file of 71,000+ bullets analyzed by the FBI Laboratory during the last 15 years. According to the notes that accompanied the data file, the bullets in it were selected to include one bullet (or sometimes more) that were determined to be distinct from the other bullets in the case; a few are research samples “not associated with any particular case,” and a few “were taken from the ammunition collection (again, not associated with a particular case).” The notes that accompanied this data set stated: To assure independence of samples, the number of samples in the full data set was reduced by removing multiple bullets from a given known source in each case. To do this, evidentiary submissions were considered one case at a time. For each case, one specimen from each combination of bullet caliber, style, and nominal alloy class was selected and that data was placed into the test sample set. In instances where two or more bullets in a case had the same nominal alloy class, one sample was randomly selected from those containing the maximum number of elements measured…. The test set in this study, therefore, should represent an unbiased sample in the sense that each known production source of lead is represented by only one randomly selected specimen. [Ref. 6] All bullets in this subset were measured three times (three fragments). Bullets from 1,005 cases between 1989 and 2002 are included; in 528 of these cases, only one bullet was selected. The numbers of cases for which different numbers of bullets were selected are given in Table K.2. The cases that had 11, 14, and 21 bullets were cases 834, 826, and 982, respectively. Due to the way in which these bullets were selected, they do not represent a random sample of bullets from any population—even the population of bullets analyzed by the laboratory. The selection probably produced a data set whose variability among bullets is higher than might be seen in the complete data set or in the population of all manufactured bullets. Only averages and SDs of the (unlogged) measurements are available, not the TABLE K.2 Number of Cases Having b Bullets in the 1,837-Bullet Data Set b = no. bullets 1 2 3 4 5 6 7 8 9 10 11 14 21 No. cases 578 238 93 48 24 10 7 1 1 2 1 1 1

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence three individual measurements themselves, so a precise estimate of the measurement uncertainty (relative SD within bullets) could not be calculated, as it could in the 800-bullet data set. (One of the aspects of the nonrandomness of this dataset is that it is impossible to determine whether the “selected” bullets tended to have larger or smaller relative SDs (RSDs) compared with the RSDs on all 71,000+ bullets.) Characteristics of this data set are given in Table K.3. Only Sb and Ag were measured in all 1,837 bullets in this data set; all but three of the 980 missing Cd values occurred within the first 1,030 bullets (before 1997). In only 854 of the 1,837 bullets were all seven elements measured; in 522 bullets, six elements were measured (in all but three of the 522 bullets, the missing element is Cd); in 372 bullets, only five elements are measured (in all but 10 bullets, the missing elements are Sn and Cd); in 86 bullets, only four elements are measured (in all but eight bullets, the missing elements are As, Sn, and Cd). The data on Cd are highly discrete: of the 572 nonzero measured averages (139, 96, 40, 48, 32, and 28) showed average Cd concentrations of only (10, 20, 30, 40, 50, and 60) ppm respectively (0.00001–0.00006). The remaining 189 nonzero Cd concentrations were spread out from 70 to 47,880 ppm (0.00007 to 0.04788). This data set provided some information on distributions of averages of the various elements and some correlations between the averages. Combining the 854 bullets in which all seven elements were measured with the 519 bullets in which all but Cd were measured yielded a subset of 1,373 bullets in which only 519 values of Cd needed to be imputed (estimated from the data). These 1,373 bullets then had measurements on all seven elements. The average Cd concentration in a bullet appeared to be uncorrelated with the average concentration of any other element, so the missing Cd concentration in 519 bullets was imputed by selecting at random one of the 854 Cd values measured in the 854 bullets in which all seven elements were measured. The 854- and 1,373-bullet subsets were used in some of the analyses below. 3. Randich et al. (2002) (Ref. 7): These data come from Table 1 of the article by Randich et al. (Ref. 7). Six elements (all but Cd) were measured in three pieces of wire from 28 lots of wire. The three pieces were selected from the beginning, middle, and end of the wire reel. The analysis of this data set confirms the homogeneity of the material in a lot within measurement error. TABLE K.3 Characteristics of 1,837-Bullet Data Set Element As Sb Sn Bi Cu Ag Cd No. bullets with no data 87 0 450 8 11 0 980 No. bullets with data 1,750 1,837 1,387 1,829 1,826 1,837 857 No. bullets with nonzero data 1,646 1,789 838 1,819 1,823 1,836 572 pooled RSD,% 2.26 2.20 2.89 0.66 1.48 0.58 1.39

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence 2.2 Lognormal Distributions The SDs of measurements made with ICP-OES tend to be proportional to their means; hence, one typically refers to relative standard deviation, usually expressed as 100% × (SD/mean). When the measurements are transformed first via logarithms, the SD of the log(measurements) is approximately, and conveniently, equal to the RSD on the original scale. That is, the SD on the log scale will be very close to the RSD on the original scale. The mathematical details of this result are given in Appendix E. A further benefit of the transformation is that the resulting transformed measurements have distributions that are much closer to the familiar normal (Gaussian) distribution—an assumption that underlies many classical statistical procedures. The 800-bullet data set allowed calculation of the RSD by calculating the ordinary SD on the logarithms of the measurements. The bullet means in the 1,837-bullet data set tend to be lognormally distributed, as shown by the histograms in Figures 3.1–3.4. The data on log(Sn) show two modes, and the data on Sb are split into Sb < 0.05 and Sb > 0.05. The histograms suggest that the concentrations of Sb and Sn in this data set consist of mixtures of lognormal distributions.) Carriquiry et al. (Ref. 8) also used lognormal distributions in analyzing the 800-bullet datas et. Calculating means and SDs on the log scale was not possible with the data in the 1,837-bullet data set, because only means and SDs of the three measurements are given. However, when the RSD is very small (say, less than 5%), the difference between the lognormal and normal distributions is very small. For about 80% of the bullets in the 1,837-bullet data set that was true for the three measurements of As, Sb, Bi, Cu, and Ag. 2.3 Within-Bullet Variances and Covariances 800-Bullet Data Set From the 800-bullet data set, which contains the three measurements in each bullet (not just the mean and SD), one can estimate the measurement SD in each set of three measurements. As mentioned above, when the RSD is small, the lognormally distributed measurement error will have a distribution that is close to normal. The within-bullet covariances shown below were calculated on the log-transformed measurements (results on the untransformed measurements were very similar). The 800-bullet data set (200 bullets from each of four manufacturers) permits estimates of the within-bullet variances and covariances as follows:

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence (1) where xijk denotes the logarithm of the ith measurement (i = 1, 2, 3; called “a, b, c” in the data file) of element j in bullet k, and is the mean of three log(measurements) of element j, bullet k. When l = j, the formula sjj reduces to a pooled within-bullet sample variance for the jth element; compare Equations E.2 and E.3 in Appendix E. Because sjj is based on within-bullet SDs from 200 bullets, the square root of sjj (called a pooled standard deviation) provides a more accurate and precise estimate of the measurement uncertainty than an SD based on only one bullet with three measurements (see Appendix F). The within-bullet TABLE K.4 Within-Bullet Covariances, times 105, by Manufacturer (800-Bullet Data Set) CCI   NAA-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA-As 118 10 6 4 17 ICP-Sb 10 48 33 34 36 ICP-Cu 6 33 46 31 36 ICP-Bi 4 34 31 193 29 ICP-Ag 17 36 36 29 54 Federal   NAA-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA-AS 34 8 6 15 7 ICP-Sb 8 37 25 18 39 ICP-Cu 6 25 40 14 42 ICP-Bi 15 18 14 90 44 ICP-Ag 7 39 42 44 681 Remington   NAA-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA- 345 −1 −3 13 3 ICP-Sb −1 32 21 16 18 ICP-Cu −3 21 35 15 12 ICP-Bi 13 16 15 169 18 ICP-Ag 3 18 12 18 49 Winchester   NAA-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA-As 555 5 7 −5 16 ICP-Sb 5 53 42 45 27 ICP-Cu 7 42 69 37 31 ICP-Bi −5 45 37 278 31 ICP-Ag 16 27 31 31 51

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence Average over manufacturers   Naa-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA-As 263 6 4 7 10 ICP-Sb 6 43 30 28 30 ICP-Cu 4 30 47 24 30 ICP-Bi 7 28 24 183 30 ICP-Ag 10 30 30 30 209 Average within-bullet correlation matrix   Naa-As ICP-Sb ICP-Cu ICP-Bi ICP-Ag NAA-As 1.00 0.05 0.04 0.03 0.04 ICP-Sb 0.05 1.00 0.67 0.32 0.31 ICP-Cu 0.04 0.67 1.00 0.26 0.30 ICP-Bi 0.03 0.32 0.26 1.00 0.16 ICP-Ag 0.04 0.31 0.30 0.16 1.00 covariance matrices were estimated separately for each manufacturer, on both the raw (untransformed) and log-transformed scales, for Sb, Cu, Bi, and Ag (measured with ICP-OES by all four manufacturers) and As (measured with NAA by all four manufacturers). Only the variances and covariances as calculated on the log scale are shown in Table K.4 because the square roots of the variances (diagonal terms) are estimates of the RSD. (These RSDs differ slightly from those cited in Table 2.2 in Chapter 2.) The within-bullet covariance matrices are pooled (averaged) across manufacturer, and the correlation matrix is derived in the usual way: correlation between elements i and j equals the covariance divided by the product of the SDs; that is, (The correlation matrix based on the untransformed data is very similar.) As and Sn were also measured with ICP-OES on only the Federal bullets, so the 6 × 6 within-bullet variances and covariances, and the within-bullet correlations among the six measurements, are given in Appendix F. The estimated correlation matrix indicates usually small correlations between the errors in measuring elements. Four notable exceptions are the correlation between the errors in measuring Sb and Cu, estimated as 0.67, and the correlations between the errors in measuring Ag and Sb, between Ag and Cu, and between Sb and Bi, all estimated as 0.30−0.32. Figure K.1 demonstrates that association with plots of the three Cu measurements versus the three Sb measurements centered at their mean values, so (0, 0) is roughly in the center of each plot for 20 randomly selected bullets from one of the four boxes from CCI (Ref. 2). In all 20 plots, the three points increase from left to right. A plot of three points does not show very much, but one would not expect to see all 20 plots showing consistent directions if there were no association in the measurement errors of Sb and Cu. In fact, for all four manufacturers,

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence elements. The allowance used in the 2-SD interval, 2(sxj + syj) calculated for each element, is too wide for three reasons: The measurement uncertainty in the difference between two sample means, each based on three observations, is The average value of even when the measurements are known to be normally distributed, is (0.8862σ + 0.8862σ) = 1.7724σ, or roughly 2.17 times as large. A sample SD based on only three observations has a rather high probability (0.21) of overestimating σ by 25%, whereas a pooled SD based on 50 bullets each measured three times (compare Equation 2 in Appendix E) has a very small probability (0.00028) of overestimating σ by 25%. (That is one of the reasons that the authors urge the FBI to use pooled SDs in its statistical testing procedures.) The 2 in 2(sxj + syj) is about 2–2.5 times too large, assuming that The measurement uncertainty σ is estimated by using a pooled SD. The procedure is designed to claim a match only if the true mean element concentrations differ by roughly the measurement uncertainty (δ ≈ σ ≈ 2–4%) or, at most, δ ≈ 1.5σ ≈ 3–6%. Measured differences in mean concentrations smaller than that amount would be considered analytically indistinguishable. Measured differences in mean concentrations larger than δ would be consistent with the hypothesis that the bullets came from different sources. For these three reasons, the 2-SD interval claims a “match” for bullets that lie within an interval that is, on the average, about 3.5σ (σ = measurement uncertainty), or about 7–17 percent. Hence, bullets whose mean concentrations differ by less than 3.5σ (about 7–17 percent) on all seven elements, have a high probability of being called “analytically indistinguishable.” The expected range of three normally distributed observations is 1.6926σ, so the range-overlap method tends to result in intervals that are on average, about half as wide as the intervals used in the 2-SD-overlap procedure. This fact explains the results showing that the range-overlap method had a lower rate of false matches than the 2-SD-overlap method. 4.2 Individual Equivalence t Tests An alternative approach is to set a per-element FPP of, say, 0.30 on any one element, so that the FPP on all seven elements is small, say, 0.305 = 0.00243, or 1 in 412, to 0.306 = 0.000729, or 1 in 1,372. This approach leads to an equivalence t test, which proceeds as follows: Estimate the measurement uncertainty in measuring each element using a pooled SD, that is, the root mean square of the sample SDs from 50 to 100

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence bullets, where the sample SD on each bullet is based on the logarithms of the three measurements of each bullet. (The sample SDs on bullets should be monitored with a process-monitoring chart, called an s-chart; see Ref. 12, pages 76–78.) Denote the pooled SD for element j as sj,pool. Calculate the mean of the logarithms of the three measurements of each bullet. Denote the sample means on element j (j = 1, 2, ..., 7) for the CS and PS bullets as and , respectively. Calculate the difference between the sample means on each element, −. If they differ by less than 0.63 times sj,pool (about two-thirds of the pooled standard deviation for that element), for all seven elements, then the bullets are deemed “analytically indistinguishable (match).” If the sample means differ by less than 1.07 times sj,pool (slightly more than one pooled standard deviation for that element), for all seven elements, then the bullets are deemed “analytically indistinguishable (weak match).” The limit 0.63 [or 1.07] allows for the fact that each sample mean concentration will vary slightly about its true mean (with measurement uncertainty and follows from the specification that (a) a false match on a single element has a probability of 0.30 and (b) a decision of “no match” suggests that the mean element concentrations are likely to differ by at least 1σ [or 1.5σ], the uncertainty of a single measurement. That is, assuming that the uncertainty measuring a single element is 2.5 percent and the true mean difference between two bullet concentrations on this element is at least 2.5 percent [3.8 percent], then, with a probability of 0.30, caused by the uncertainty in the measurement process and hence in the sample means and , the two sample means will, by chance, lie within 0.63sj,pool [or 1.07] of each other, and the bullets will be judged as analytically indistinguishable on this one element (even though the mean concentrations of this element differ by 2.5%). A match occurs only if the bullets are analytically indistinguishable on all seven elements. Obviously, these limits can be changed, simply by choosing a different value for the per element false match probability, and a different value of δ (here δ = 1 for a “match” and δ = 1.5 for a “weak match.”) If the measurement errors in all elements were independent, then this procedure could be expected to have an overall FPP of 0.307 = 0.00022, or about 1 in 4,572. The estimated correlation matrix in Section 3.3 suggests that the measurement errors are not all independent. A brief simulation comparing probabilities on 7 independent normal variates and 7 correlated normal variates (using the correlation matrix based on the Federal bullets given in Appendix F), indicated that the FPP is closer to 0.305.2 = 0.002, or about 1 in 500. To achieve the FBI’s stated FPP of 0.0004 (1 in 2,500), one could use a per-element error rate of 0.222 instead of 0.30, because 0.2225.2 = 0.0004. The limits for “match” and “weak match” would then change, from 0.636sj,pool and 1.07sj,pool to 0.47sj,pool (about one-half of sj,pool) and 0.88sj,pool, respectively. Table K.14 shows the calculations

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence involved for the equivalence t tests on Federal bullets F001 and F002, using the data in Section 3.1 (log concentrations). The calculations are based on the pooled standard deviations using 200 Federal bullets (400 degrees of freedom; see Appendix F). Not all of the relative mean differences on elements are less than 0.86 in magnitude, but they are all less than 1.05 in magnitude. Hence the bullets would be deemed “analytically indistinguishable (weak match).” The allowance 0.86sj,pool can be written as and the value 0.645 arises from a noncentral t distribution (see Appendix F), used in an equivalence t test (Ref. 13), assuming that n = 3, that at least 100 bullets are used in the estimate sj,pool (200 bullets, or 400 degrees of freedom), and that mean concentrations with δ = σ (that is, within the measurement uncertainty) are considered analytically indistinguishable. The constant changes to if one allows mean concentrations δ = 1.5σ to be considered “analytically indistinguishable.” Other values for the constant are given in Appendix F; they depend slightly on n (here, three measurements per sample mean), on the number of bullets used to estimate the pooled variance (here, assumed to be at least 100), and, most importantly, upon the per-element-FPP (here, 0.30) and on δ/σ (here, 1–1.5). The choice of δ ≈ σ used in the procedure is based on the observation that differences between mean concentrations among the seven elements (δj, j = 1,…,7) in three pairs of bullets in the 854-bullet subset of the 1,837-bullet data set (in which all seven elements were measured), which were assumed to be unrelated, can be as small as the measurement uncertainty (δj/σj ≤ 1 on all seven elements; compare Table K.8). Allowing matches between mean differences within 1.5, 2.0, or 3.0 times the measurement uncertainty increases the constant from 0.767 to 1.316, 1.925, or 3.147, respectively, and results in an increased allowance of the interval from 0.63sj,pool (“match”) to 1.07sj,pool (“weak match”), 1.57sj,pool, and 2.57sj,pool, respectively (resulting in progressively weaker matches). The FBI allowance of for the same per-element-FPP of 0.30 corresponds to δ/σ = 4.0. That is, concentrations within roughly 4.3 times the measurement uncertainty would yield an FPP of roughly 0.30 on each element. (Because the measurement uncertainty on all 7 elements is roughly 2–5%, this corresponds to claiming that bullets are analytically indistinguishable whenever the concentrations lie within 8–20% of each other.) Those wide intervals resulted in 693 false matches among all possible pairs of the 1,837 bullets in the 1,837-bullet data set or in 47 false matches among all possible pairs of the 854 bullets in which all seven elements were measured. In contrast, using the limit 1.07sj,pool resulted in zero matches among the 854 bullets. The use of equivalence t tests for comparing two bullets depends only on a model for measurement error (lognormal distribution, or, if σ/µ is small, normal

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence TABLE K.14 Equivalence t-Tests on Federal Bullets F001 and F002 log(concentration) on F001   ICP-Sb ICP-Cu ICP-Ag ICP-Bi ICP-As ICP-Sn a 10.28452 5.65249 4.15888 2.77259 7.25488 7.51861 b 10.29235 5.61677 4.30407 2.77259 7.29980 7.51643 c 10.27505 5.64545 4.18965 2.77259 7.24708 7.48997 mean 10.28397 5.63824 4.21753 2.77259 7.26725 7.50834 SD 0.00866 0.01892 0.07650 0.00000 0.02845 0.01594 log(concentration) on F002   ICP-Sb ICP-Cu ICP-Ag ICP-Bi ICP-As ICP-Sn a 10.27491 5.62762 4.33073 2.77259 7.29506 7.52994 b 10.26928 5.63121 4.20469 2.77259 7.27170 7.49387 c 10.27135 5.64191 4.34381 2.70805 7.28001 7.47760 mean 10.27185 5.63358 4.29308 2.75108 7.28226 7.50047 SD 0.00285 0.00743 0.07682 0.03726 0.01184 0.02679 sj,pool 0.0192 0.0200 0.0825 0.0300 0.0432 0.0326 RMD sj,pool 0.631 0.233 −0.916 0.717 −0.347 0.241 distribution), and that a “CIVL” has been defined to be as small a volume as is needed to ensure that the variability of the elemental concentrations within this volume is much smaller than the measurement uncertainty (i.e., within-lot variability is much smaller than σ). It does not depend on any assumptions about the distribution of elemental concentrations in the general population of bullets, for which we have no valid data sets that would allow statistical inference. Probabilities such as the FBI’s claim of “1 in 2,500” are inappropriate when based on a data set such as the 1,837-bullet data set; as noted in Section 3.2, it is not a random collection of bullets from the population of all bullets, or even from the complete 71,000+ bullet data set from which it was extracted. The use of either 0.63sj,pool or 1.07sj,pool (requiring and to be within 1.0 to 1.5 times the measurement uncertainty), might seem too demanding when only three pairs of bullets among 854 bullets (subset of the 1,837-bullet data set in which all seven elements were measured) showed differences of less than or equal to 1 SD on all seven elements (eight pairs of bullets had maximal RMDs of 1.5). However, as noted in the paragraph describing the data set, the 1,837 bullets were selected to be unrelated (Ref. 6), and hence do not represent, in any way, any sort of random sample from the population of bullets. We cannot say on the basis of this data set, how frequently two bullets manufactured from different sources may have concentrations within 1.0. We do know that such instances can occur. A carefully designed study representative of all bullets that might exist now or in the future may help to assess the distribution of differences

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence between mean concentrations of different bullets and may lead to a different choice of the constant, depending on the level of δ/σ that the procedure is designed to protect. Constants for other values of the per-element FPP (0.01, 0.05, 0.10, 0.20, 0.222 and 0.30) and δ (0.25, 0.50, 1.0, 1.5, 2.0, and 3.0), for n = 3 and n = 5, are given in Appendix F. See also Box K.1 4.3 Hotelling’s T2 A statistical test procedure that is designed for comparing two sets of 7 sample means simultaneously rather than 7 individual tests, one at a time, as in the previous section, uses the estimated covariance matrix for the measurement errors. The test statistic can be written where: n = number of measurements in each sample mean (here, n = 3). p = number of elements being measured (here, p = 7). s = vector of SDs in measuring the elements (length p). S−1 = inverse of the estimated matrix of variances and covariances among the measurement errors (seven rows and seven columns). R−1 = inverse of the estimated matrix of correlations among the measurement errors (seven rows and seven columns). v = number of degrees of freedom in estimating S, the matrix of variances and covariances (here, 2 times the number of bullets if three measurements are made of each bullet). Under the assumptions that the measurements are normally distributed (for example, if lognormal, then the logarithms of the measurements are normally distributed), the matrix of variances and covariances is estimated very well, using v degrees of freedom (for example, v = 200, if three measurements are made on each of 100 bullets and the variances and covariances within each set of three measurements are pooled across the 100 bullets), and the bullet means truly differ by δ/σ = 1 in each element, [v + 1 − p)/(pv)]T2 should not exceed a critical value determined by the noncentral F distribution with p and v degrees of freedom and noncentrality parameter given by n(δ/σ)R−1(δ/σ) = 3(δ/σ) times the sum of the elements in the inverse of the estimated correlation matrix (Ref. 16, pp. 541−542). When p = 7 and v = 400 degrees of freedom, and using the correlation matrix estimated from

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence BOX K.1 True Matches and Assessed Matches The recommended statistical test procedure for assessing a match will involve the calculation of the sample means from the measurements (transformed via logarithms) on the CS and PS bullets and a pooled standard deviation (as an estimate of the measurement uncertainty). If the sample means on all seven elements are “too close,” relative to the variability that is expected for a difference between two sample means, then a “match” is declared. “Too close” is determined by a constant that arises from either a non-central t distribution, if a t-test on each individual element is performed, or a non-central F distribution, if Hotelling’s T2 test is used, where the relative mean differences are combined and weighted in accordance with the correlation among the seven measurement errors. Two types of questions may be posed. The first type involves conditioning on the difference between the bullet means: Given that two bullets really did come from the same CIVL (compositionally indistinguishable volume of lead), what is the probability that the statistical test procedure correctly claims “match”? Similarly, given two bullets that are known to have come from different CIVLs, what is the probability that the test correctly claims “no match”? Stated formally, if δ represents the vector of true mean differences in the seven elemental concentrations, and if “P(A|B)” indicates the probability of A, given that B holds, then these first types of questions can be written: What are P(claim “match” | δ = 0) and P(claim “nonmatch” | δ = 0) (where these two expressions sum to 1 and the second expression is the false non-match probability), and what are P(claim “match” | δ > 0) and P(claim “nonmatch” | δ > 0) (again where these two expressions sum to 1, and the first expression is the false match probability )? In other words, one can ask about the performance of the test, given the true connection between the bullets. Using a combination of statistical theory and simulation, these probabilities can be estimated for the FBI’s current match procedures as well as for the alternative procedures recommended here. The second type of question that can be asked reverses terms and now involves conditioning on the assessment and asking about the state of the bullets. One of the two versions of this type of question is: Given that the statistical test indicates “match”, what is the probability that the two bullets came from the same CIVL? The answer to these questions depends on several factors. First, as indicated in Chapter 3, we cannot guarantee uniqueness in the mean concentrations of all seven elements simultaneously. Uniqueness seems plausible, given the characteristics of the manufacturing process and the possible changes in the industry over time (e.g., very slight increase in silver concentrations over time). But uniqueness cannot be assured. Therefore, at best, we can address only the following modified question: “If CABL analysis indicates “match,” what is the probability that these two bullets were manufactured from CIVL’s that have the same mean concentrations on all seven elements, compared with the probability that these two bullets were manufactured from CIVLs that differ in mean concentration on one or more of the seven elements?” Using the notation above, this probability can be written: P(δ = 0 | claim

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence “match”), which is 1 − P(δ > 0 | claim “match”). Similarly, one can ask about the P(δ = 0 | claim “nonmatch”), which is 1 − P(δ > 0 | claim “nonmatch”). By applying Bayes’ rule (Ref. 8), P(δ = 0 | claim “match”) = P(claim “match” | δ = 0)P(δ = 0) / P(claim “match” ) and P(δ > 0 | claim “match”) = P(claim “match” | δ > 0)P(δ > 0) / P(claim “match” The ratio between these two probabilities, i.e. P(δ = 0 | claim “match” )/ P(δ > 0) | claim “match”) is equal to: P(claim “match” | δ = 0)P(δ = 0) / P(claim “match” | δ > 0)P(δ > 0) (*) One might reflect, “Given that the CABL analysis indicates “match,” what is the probability that the bullets came from populations with the same mean concentrations, compared to the probability that the bullets came from different populations?” A large ratio might be strong evidence that the bullets came from CIVLs with the same mean concentrations. (In practice, one might allow a small δ0 so that “δ < δ0” is effectively a “match” and “δ > δ0” is effectively a “non-match”; the choice of δ0 will be discussed later, but for now we take δ0 = 0.) The above equation shows that this ratio is actually a product of two ratios, one P(claim “match” | δ = 0) / P(claim “match” \ (δ > 0), which can be estimated as indicated above through simulation, and where a larger ratio indicates a more sensitive test, and a second ratio P(δ = 0) |P(δ > 0) which depends on the values of the mean concentrations across the entire universe of CIVLs (past, present, and future). Section 3 below estimates probabilities of the form of the first ratio and shows that this ratio exceeds 1 for all tests, but especially so for the alternative procedures recommended here. However, the second ratio is unknown, and, in fact, depends on many factors: the consistency of elemental concentration within a CIVL (“within-CIVL homogeneity”); the number of bullets that can be manufactured from such a homogeneous CIVL; the number of CIVLs that are analytically indistinguishable from a given CIVL (in particular, the CIVL from which the CS bullet was manufactured); the number of CIVLs that are not analytically indistinguishable from a given CIVL. These factors will vary by type of bullet, by manufacturer, and perhaps by locale (i.e., more CIVLs are readily accessible to residents of a large metropolitan area than to those in a small urban town). This appendix analyzes data made available to the Committee in an attempt to estimate a frequency distribution for values of δ in the population, which is needed for the probabilities in the second ratio above. However, as will be seen, these data sets are biased, precluding unbiased inferences. In the end, one can conclude only that P(δ > 0 | claim “match”) > P(δ = 0), i.e., given the results of a test that suggests “match,” the probability that the two bullets came from the same CIVL is higher than this probability if the two bullets had not been measured at all. This, of course, is a weak statement. A stronger statement, namely, that the ratio

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence of the probabilities in (*) exceeds 1, is possible only through a carefully designed sampling scheme, from which estimates, and corresponding confidence intervals, for the probability in question (*), can be obtained. No such unbiased information is currently available. Consequently, the recommended alternative statistical procedures (Hotelling’s T2 test and successive individual Student’s t tests on the seven elements separately) consider only the measurable component of variability in the problem, namely, the measurement error, and not the other sources of variability (within-CIVL and between-CIVL variability), which would be needed to estimate this probability. We note as a further complication to the above that the linkage between a “match” between the CS and PS bullets and the inference that these two bullets came from the same CIVL depends on how a CIVL is defined. If a CS bullet is on the boundary of a CIVL, then the likelihood of a match to bullets outside a CIVL may be much higher than if a CS bullet is in the middle of a CIVL. the Federal data (which measured six of the seven elements with ICP-OES; see Appendix F) and assuming that the measurement error on Cd is 5% and is uncorrelated with the others, this test procedure claims analytically indistinguishable (match) only if T2 is less than 1.9 (δ/σ = 1 for each element) and claims analytically indistinguishable (weak match) only if T2 is less than 6.0 (δ/σ = 1.5 for each element), to ensure an overall FPP of no more than 0.0004 (1 in 2,500).1 (When applied to the log(concentrations) on Federal bullets F001 and F002 in Table K.14, the value of Hotelling’s T2 statistic, using only six elements, is 2.354, which is small enough to claim “analytically indistinguishable” when δ/σ = 1.0 and the overall FPP is 0.002, or 1 in 500.) The limit 1.9 depends on quite a large number of assumptions. It is indeed more sensitive if the correlation among the measurement errors is substantial (as it may be here for at least some pairs of elements) and if the differences in element concentrations tend to be spread out across all seven elements rather than concentrated in only one or two elements. However, the validity of Hotelling’s T2 test in the face of departures from those assumptions is not well understood. For example, the limit 1.9 was based on an estimated covariance matrix from one set of 200 bullets (Federal) from one study conducted in 1991, and the inferences from it may no longer apply to the current measurement procedure. Also, although Hotelling’s T2 test is more sensitive at detecting small differ- 1   For an overall FPP of 0.002 (1 in 500), the test would claim “match” or “weak match” if t2 does not exceed 1.9 or 8.1, respectively. For an overall FPP of 0.01 (1 in 100), the test would claim “match” or “weak match” if t2 does not exceed 4.5 or 11.5, respectively.

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence ences in concentrations in all elements, it is less sensitive than the individual t tests if the main cause of the difference between two bullets arises from only one fairly large difference in one element. (That can be seen from the fact that, if the measurement errors were independent, T2/p reduces to the average of the squared two-sample t statistics on the p = 7 separate elements, so one large difference is spread out across the seven dimensions, causing [v + 1 − 7)/v]T2/p to be small and thus to declare a match when the bullets differ quite significantly in one element.) Many more studies would be needed to assess the reliability of Hotelling’s T2 (for example, types of differences typically seen between bullet concentrations, precision of estimates of the variances and covariances between measurement errors, and departures from (log)normality). 4.4 Use of T Tests in Court One reason for the authors’ recommendation of seven individual equivalence t tests versus its multivariate analog based on Hotelling’s T2, is the familiarity of the form. Student’s t tests are in common use and familiar to many users of statistics; the only difference here is the multiplier (“0.63” for “match” or “1.07” for “weak match,” instead of “2.0” in a conventional t test, α = 0.05). The choice of FPP, and therefore the determination of δ, could appear arbitrary to a jury and could subject the examiner to a difficult cross examination. However, the choice of δ is in reality no more arbitrary than the choice of α in the conventional t test—the “convention” referred to in the name is in fact the choice α = 0.05, leading to a “2.0-sigma” confidence interval. The conventional t test has the serious disadvantage that it begins from the null hypothesis that the crime scene bullet and the suspect’s bullet match, that is, it starts from the assumption that the defendant is guilty (“bullet match”) and sets the probability of falsely assuming that the guilty person is innocent to be .05. This drawback could be overcome by computing the complement of the conventional t test Type II error rate (the rate at which the test fails to reject the null hypothesis when it is false, which in this case would be the false positive result) for a range of alternatives to the null hypothesis and expressing the results in a power curve in order to judge the power of the test. However, this is not as appealing from the statistician’s viewpoint as the equivalence t test. (It is important to note that the standard t test-based matching error rate will fluctuate by bullet manufacturer and bullet type. This is due to the fact that difference among CABLs are characteristic of manufacturer and bullet type.) Table K.15 presents a comparison of false positive and false negative rates using the FBI’s statistical methods, and using the equivalence and conventional t-tests. It is important to note that this appendix has considered tests of a “match” between a single CS bullet and a single PS bullet. If the CS bullet were com-

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence TABLE K.15 Simulated False-Positive and False-Negative Probabilities Obtained with Various Statistical Testing Procedures   Composition Identical δ = 0 Composition Not Identical δ = 1.5 CABL claims “match”   True Positive False Positive FBI-2SD 0.933 0.571 FBI-rg 0.507 0.050 Conv t 0.746 0.065 Equiv-t (1.3) 0.272 0.004 HotelT2 (6.0) 0.115 0.001 CABL claims “no match”   False Negative True Negative FBI-2SD 0.067 0.429 FBI-rg 0.493 0.948 Conv t 0.254 0.935 Equiv-t (1.3) 0.728 0.996 HotelT2 (6.0) 0.885 0.999 Note: Simulated false-positive and false-negative probabilities obtained with various statistical testing procedures. Simulation is based on 100,000 trials. In each trial, 3 measurements on seven elements were simulated from a normal distribution with mean vector µx, standard deviation vector σx, and within-measurement correlation matrix R, where µx is the vector of 7 mean concentrations from one of the bullets in the 854-bullet data set, σx is the vector of 7 standard deviations on this same bullet, and R is the within-measurement correlation matrix based on data from 200 Federal bullets (see Appendix F). Three further measurements on seven elements were simulated from a normal distribution with mean vector µy = µx + kσx, with the same standard deviation vector σx, and the same within-measurement correlation matrix R, where µy is the same vector of mean concentrations plus an offset equal to k times the measurement uncertainty in each element. The simulated probabilities of each test (FBI 2-SD overlap, FBI range overlap, conventional t, equivalence t) equal the proportions of the 100,000 trials in which the test claimed “match” or “no match” (i.e., the sample means on all 7 elements were within 0.63 of the pooled estimated of the measurement uncertainty in measuring that element). For the first column, the simulation was run with k = 0 (i.e., mean concentrations are the same); for the second column, the simulation was run with k = 1 (i.e., mean concentrations differ by 1.5 times the measurement uncertainty). With 100,000 trials, the uncertainties in these simulated probabilities (two standard errors) do not exceed 0.003. Note that σx is the measurement error, and we can consider this to be equal to where σl is the measurement uncertainty and σinh is uncerainty due to homogeneity. pared with, say, 5 PS bullets, all of which came from a CIVL whose mean concentrations differed by at least 1.5 times the measurement uncertainty (δ = 1.5σ), then, using Bonferroni’s inequality, the chance that the CS bullet would match at least one of the CS bullets could be as high as five times the nominal FPP (e.g., 0.01, or 1 in 100, if the “1 in 500” rate were chosen). Multiplying the current false positive rates for the FBI 2-SD-overlap and range-overlap procedures shown in Table K.15 by the number of bullets being tested results in a very

OCR for page 169
Forensic Analysis Weighing Bullet Lead Evidence high probability that at least one of the bullets will appear to “match,” simply by chance alone, even when the mean CIVL concentrations of the two bullets differ by 1.5 times the measurement uncertainty 3−7%). The small FPP for the equivalence t test results in a small probability that some CS bullet will match the PS bullet by chance alone, so long as the number of PS bullets is not very large. REFERENCES 1. Laboratory Chemistry Unit. Issue date: October 11, 2002. Unpublished (2002). 2. Peele, E. R.; Havekost, D. G.; Peters, C. A.; Riley, J. P.; Halberstam, R. C.; and Koons, R. D. USDOJ (ISBN 0-932115-12-8), 1991, 57. 3. Peters, C. A. Foren. Sci. Comm. 2002, 4(3). <http://www.fbi.gov/hq/lab/fsc/backissu/july2002/peters.htm> as of Aug. 8, 2003. 4. 800-bullet data set provided by FBI in email from Robert D. Koons to Jennifer J. Jackiw, February 24, 2003. 5. 1,837-bullet data set provided by the FBI. (CD) Received by committee May 12, 2003. 6. Koons, R. D. Personal communication to committee. (CD) Received by committee May 12, 2003. Description of 1,837-bullet data set. 7. Randich, E.; Duerfeldt, W.; McLendon, W.; and Tobin, W. Foren. Sci. Int. 2002,127, 174–191. 8. Carriquiry, A.; Daniels, M.; and Stern, H. “Statistical Treatment of Case Evidence: Analysis of Bullet Lead,” Unpublished report, Dept. of Statistics, Iowa State University, 2002. 9. Grant, D. M. Personal communication to committee. April 14, 2003. 10. Koons, R. D. Personal communication to committee via email to Jennifer J. Jackiw. March 3, 2003. 11. Koons, R. D. “Bullet Lead Elemental Composition Comparison: Analytical Technique and Statistics.” Presentation to committee. February 3, 2003. 12. Vardeman, S. B. and Jobe, J. M. Statistical Quality Assurance Methods for Engineers; Wiley: New York, NY 1999. 13. Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman and Hall: New York, NY 2003. 14. Owen, D.B. “Noncentral t distribution” in Encyclopedia of Statistical Sciences, Volume 6; Kotz, S.; Johnson, N. L.; and Read, C. B.; Eds.; Wiley: New York, NY 1985, pp 286–290. 15. Tiku, M. “Noncentral F distribution” in Encyclopedia of Statistical Sciences, Volume 6; Kotz, S.; Johnson, N. L.; and Read, C. B.; Eds.; Wiley: New York, NY 1985, pp 280–284. 16. Rao, C.R., Linear Statistical Inference and Its Applications; Wiley, New York, NY 1973.