Statistical Analysis of Bullet Lead Data

Assume that one has acquired samples from two bullets, one from a crime scene (the CS bullet) and one from a weapon found with a potential suspect (the PS bullet). The manufacture of bullets is, to some extent, heterogeneous by manufacturer, and by manufacturer’s production run within manufacturer. A CIVL, a “compositionally indistinguishable volume of lead”—which could be smaller than a production run (a “melt”)—is an aggregate of bullet lead that can be considered to be homogeneous. That is, a CIVL is the largest volume of lead produced in one production run at one time for which measurements of elemental composition are analytically indistinguishable (within measurement error). The chemical composition of bullets produced from different CIVLs from various manufacturers can vary much more than the composition of those produced by the same manufacturer from a single CIVL. (See Chapter 4 for details on the manufacturing process for bullets.) The fundamental issue addressed here is how to determine from the chemical compositions of the PS and the CS bullets one of the following: (1) that there is a non-match—that the compositions of the CS and PS bullets are so disparate that it is unlikely that they came from the same CIVL, (2) that there is a match—that the compositions of the CS and PS bullets are so alike that it is unlikely that they came from different CIVLs, and (possibly) (3) that the compositions of the two bullets are neither so clearly disparate as to assert that they came from different CIVLs, nor so clearly similar to assert that they came from the same CIVL. Statistical methods are needed in this context for two important purposes: (a) to find ways of making these assertions based on the evidence so that the error rates—either the chance of falsely asserting a match, or the chance of falsely asserting a non-match, are both ac-

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
3
Statistical Analysis of Bullet Lead Data
INTRODUCTION
Assume that one has acquired samples from two bullets, one from a crime scene (the CS bullet) and one from a weapon found with a potential suspect (the PS bullet). The manufacture of bullets is, to some extent, heterogeneous by manufacturer, and by manufacturer’s production run within manufacturer. A CIVL, a “compositionally indistinguishable volume of lead”—which could be smaller than a production run (a “melt”)—is an aggregate of bullet lead that can be considered to be homogeneous. That is, a CIVL is the largest volume of lead produced in one production run at one time for which measurements of elemental composition are analytically indistinguishable (within measurement error). The chemical composition of bullets produced from different CIVLs from various manufacturers can vary much more than the composition of those produced by the same manufacturer from a single CIVL. (See Chapter 4 for details on the manufacturing process for bullets.) The fundamental issue addressed here is how to determine from the chemical compositions of the PS and the CS bullets one of the following: (1) that there is a non-match—that the compositions of the CS and PS bullets are so disparate that it is unlikely that they came from the same CIVL, (2) that there is a match—that the compositions of the CS and PS bullets are so alike that it is unlikely that they came from different CIVLs, and (possibly) (3) that the compositions of the two bullets are neither so clearly disparate as to assert that they came from different CIVLs, nor so clearly similar to assert that they came from the same CIVL. Statistical methods are needed in this context for two important purposes: (a) to find ways of making these assertions based on the evidence so that the error rates—either the chance of falsely asserting a match, or the chance of falsely asserting a non-match, are both ac-

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
ceptably small, and (b) to estimate the size of these error rates for a given procedure, which need to be communicated along with the assertions of a match or a non-match so that the reliability of these assertions is understood.1 Our general approach is to outline some of the possibilities and recommend specific statistical approaches for assessing matches and non-matches, leaving to others the selection of one or more critical values to separate cases 1), 2), and perhaps 3) above.2
Given the data on any two bullets (e.g., CS and PS bullets), one crucial objective of compositional analysis of bullet lead (CABL) is to provide information that bears on the question: “What is the probability that these two bullets were manufactured from the same CIVL?” While one cannot answer this question directly, CABL analysis can provide relevant evidence, the strength of that evidence depending on several factors.
First, as indicated in this chapter, we cannot guarantee uniqueness in the mean concentrations of all seven elements simultaneously. However, there is certainly variability between CIVLs given the characteristics of the manufacturing process and possible changes in the industry over time (e.g., very slight increases in silver concentrations over time). Since uniqueness cannot be assured, at best, we can address only the following modified question:
“What is the probability that the CS and PS bullets would match given that they came from the same CIVL compared with the probability that they would match if they came from different CIVLs?”
The answer to this question depends on:
1. the number of bullets that can be manufactured from a CIVL,
2. the number of CIVLs that are analytically indistinguishable from a given CIVL (in particular, the CIVL from which the CS bullet was manufactured), and
3. the number of CIVLs that are not analytically indistinguishable from a given CIVL.
The answers to these three items will depend upon the type of bullet, the manufacturer, and perhaps the locale (i.e., more CIVLs may be more readily accessible to residents of a large metropolitan area than to those in a small urban town). A carefully designed sampling scheme may provide information from
1
This chapter is concerned with the problem of assessing the match status of two bullets. If, on the other hand, a single CS bullet were compared with K PS bullets, the usual issues involving multiple comparisons arise. A simple method for using the results provided here to assess false match and false non-match probabilities is through use of Bonferroni’s inequality. Using this method, if the PS bullets came from the same CIVL, an estimate of the probability that the CS bullet would match at least one of the PS bullets is bounded above by, but often very close to, K times the probability that the CS bullet would match a single PS bullet.
2
The purposive selection of disparate bullets by those engaged in crimes could reduce the value of this technology for forensic use.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
which estimates, and corresponding confidence intervals, for the probability in question can be obtained. No comprehensive information on this is currently available. Consequently, this chapter has given more attention to the only fully measurable component of variability in the problem, namely, the measurement error, and not to the other sources of variability (between-CIVL variability) which would be needed to estimate this probability.
Test statistics that measure the degree of closeness of the chemical compositions of two bullets are parameterized by critical values that define the specific ranges for the test statistics that determine which pairs of bullets are asserted to be matches and which are asserted to be non-matches. The error rates associated with false assertions of matches or non-matches are determined by these critical values. (These error rates we refer to here as the operating characteristics of a statistical test. The operating characteristics are often called the significance level or Type I error, and the power or Type II error.)
This chapter describes and critiques the statistical methods that the FBI currently uses, and proposes alternative methods that would be preferred for assessing the degree of consistency of two samples of bullet lead. In proposing improved methods, we will address the following issues:
General approaches to assessing the closeness of the measured chemical compositions of the PS and CS bullets,
Data sets that are currently available for understanding the characteristics of data on bullet lead composition,
Estimation of the standard deviation of measures of bullet lead composition, a crucial parameter in determining error rates, and
How to determine the false match and false non-match rates implied by different cut-off points (the critical values) for the statistical procedures advocated here to define ranges associated with matches, non-matches, and (possibly) an intermediate situation of no assertion of match status.
Before we address these four topics, we critique the procedures now used by the FBI. At the end, we will recommend statistical procedures for measuring the degree of consistency of two samples of bullet lead, leaving the critical values to be determined by those responsible for making the trade-offs involved.
FBI’s Statistical Procedures Currently in Use
The FBI currently uses the following three procedures to assert a “match,” that is, that a CS bullet and a PS bullet have compositions that are sufficiently similar3 for an FBI expert to assert that they were manufactured from CIVLs
3
The term “analytically indistinguishable chemical composition” is used to describe two bullets that have compositions that are considered to match.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
with the same chemical composition. First, the FBI collects three pieces from each bullet or bullet fragment (CS and PS), and nominally each piece is measured in triplicate. (These sample sizes are reduced when there is insufficient bullet lead to make three measurements on each of three samples.) Let us denote by CSki the kth measurement of the ith fragment of the crime scene bullet, and similarly for PSki. Of late, this measurement is done using inductively coupled plasma-optical emission spectrophotometry (ICP-OES) on seven elements that are known to differ among bullets from different manufacturers and between different CIVLs from the same manufacturer. The seven elements are arsenic (As), antimony (Sb), tin (Sn), copper (Cu), bismuth (Bi), silver (Ag), and cadmium (Cd).4
The three replicates on each piece are averaged, and means, standard deviations, and ranges (minimum to maximum) for each element in each of the three pieces are calculated for all CS and PS bullets.5 Specifically, the following are computed for each of the seven elements:
the average measurement for the ith piece from the CS bullet,
the overall average over the three pieces for the CS bullet,
the within-bullet standard deviation of the fragment means for the CS bullet—essentially the square root of the average squared difference between the average measurements for each of the three pieces and the overall average across pieces (the denominator uses 2 instead of 3 for a technical statistical reason),
the spread from highest to lowest of fragment means for the three pieces for the CS bullet.
The same statistics are computed for the PS bullet.
4
As explained below, analyses in previous years measured only three to six elements, and in some cases, fewer than three pieces can be abstracted from a bullet or bullet fragment. However, in general, the following analysis will assume measurements on three pieces in triplicate for seven elements.
5
Throughout this chapter, the triplicate measurements are ignored and the three averages are treated as the basic measurements. We have not found any analysis of the variability of measurements within a single sample; the FBI should conduct such an analysis as an estimate of pure measurement error, as distinct from variability within a single bullet. If the difference is trivial, use of the three fragments rather than the nine separate measurements is justified.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
The overall mean, avg(CS), is a measure of the concentration for a given element in a bullet. The overall mean could have differed: (1) had we used different fragments of the same bullet for measurement of the overall average, since even an individual bullet may not be completely homogeneous in its composition, and (2) because of the inherent variability of the measurement method. This variability in the overall mean can be estimated by the within-bullet standard deviation divided by √3 (since the mean is an average over 3 observations). Further, for normally distributed data, the variability in the overall mean can also be estimated by the range/3. Thus the standard deviation (divided by √3) and the range (divided by 3) can be used as approximate measures of the reliability of the sample mean concentration due to both of these sources of variation.
Since seven elements are used to measure the degree of similarity, there are seven different values of CSi and PSi, and hence seven summary statistics for each bullet. To denote this we sometimes use the notation CSi (As) to indicate the average for the ith bullet fragment for arsenic, for example, with similar notation for the other above statistics and the other elements.
Assessment of Match Status
As stated above, in a standard application the FBI would measure each of these seven elements three times in each of three samples from the CS bullet and again from the PS bullet. The FBI presented to the committee three statistical approaches to judge whether the concentrations of these seven elements in the two bullets are sufficiently close to assert that they match, or are sufficiently different to assert a non-match. The three statistical procedures are referred to as: (1) 2-SD overlap, (2) range overlap, and (3) chaining. The crucial issues that the panel examined for the three statistical procedures are their operating characteristics, i.e, how often bullets from the same CIVL are identified as not matching, and how often bullets from different CIVLs are identified as matching. We describe each of these procedures in turn. Later, the probability of falsely asserting a match or a non-match is examined directly for the first two procedures, and indirectly for the last.
2-SD Overlap First, consider one of the seven elements, say arsenic. If the absolute value of the difference between the average compositions of arsenic for the CS bullet and the PS bullet is less than twice the sum of the standard deviations for the CS and the PS bullets, that is if |avg(CS) − avg(PS)| < 2(sd(CS) + sd(PS)), then the bullets are judged as matching for arsenic. Mathematically, this is the same criterion as having the 95 percent6 confidence interval for the
6
The 95 percent confidence interval for the difference of the two means, which is a more relevant construct for assessing match status, would utilize the square root of the variance of this difference, which is the square root of the sum of the two individual variances divided by the sample size for each mean (here, 3), not the sum of the standard deviations.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
overall average arsenic concentration for the CS bullet overlap the corresponding 95 percent confidence interval for the PS bullet. This computation is repeated, in turn, for each of the seven elements. If the two bullets match using this criterion for all seven elements, the bullets are deemed a match; otherwise they are deemed a non-match.7
Range Overlap The procedure for range overlap is similar to that for the 2-standard deviation overlap, except that instead of determining whether 95 percent confidence intervals overlap, one determines whether the intervals defined by the minimum and maximum measurements overlap. Formally, the two bullets are considered as matching on, say, arsenic, if both max(CS1,CS2,CS3) > min(PS1,PS2,PS3), and min(CS1,CS2,CS3) < max(PS1,PS2,PS3). Again, if the two bullets match using this criterion for each of the seven elements, the bullets are deemed a match; otherwise they are deemed a non-match.
Chaining The description of chaining as presented in the FBI Laboratory document Comparative Elemental Analysis of Firearms Projectile Lead by ICP-OES, is included here as a footnote.8 There are several different interpretations of this language that would lead to different statistical methods. We provide a
7
The characterization of the 2-SD procedure here is equivalent to the standard description provided by the FBI. The equivalence can be seen as follows. Overlap is not occurring when either avg(CS) + 2sd(CS) < avg(PS) − 2sd(PS) or avg(PS) + 2sd(PS) < avg(CS) − 2sd(CS), which can be rewritten avg(PS) − avg(CS) > 2(sd(CS) + sd(PS)) or avg(CS) − avg(PS) > 2(sd(CS) + sd(PS)), which is equivalent to the single expression |avg(CS) − avg(PS)| > 2(sd(CS) + sd(PS)),
8
a. CHARACTERIZATION OF THE CHEMICAL ELEMENT DISTRIBUTION IN THE KNOWN PROJECTILE LEAD POPULATION The mean element concentrations of the first and second specimens in the known material population are compared based upon twice the measurement uncertainties from their replicate analysis. If the uncertainties overlap in all elements, they are placed into a composition group; otherwise they are placed into separate groups. The next specimen is then compared to the first two specimens, and so on, in the same manner until all of the specimens in the known population are placed into compositional groups. Each specimen within a group is analytically indistinguishable for all significant elements measured from at least one other specimen in the group and is distinguishable in one or more elements from all the specimens in any other compositional group. (It should be noted that occasionally in groups containing more than two specimens, chaining occurs. That is, two specimens may be slightly separated from each other, but analytically indistinguishable from a third specimen, resulting in all three being included in the same compositional group.)
b. COMPARISON OF UNKNOWN SPECIMEN COMPOSITION(S) WITH THE COMPOSITION(S) OF THE KNOWN POPULATION(S): The mean element concentrations of each individual questioned specimen are compared with the element concentration distribution of each known population composition group. The concentration distribution is based on the mean element concentrations and twice the standard deviation of the results for the known population composition group. If all mean element concentrations of a questioned specimen overlap within the element concentration distribution of one of the known material population groups, that questioned specimen is described as being “analytically indistinguishable” from that particular known group population.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
description here of a specific methodology that is consistent with the ambiguous FBI description. However, it is important that the FBI provide a rigorous definition of chaining so that it can be properly evaluated prior to use.
Chaining is defined for a situation in which one has a population of reference bullets. (Such a population should be collected through simple random sampling from the appropriate subpopulation of bullets relevant to a particular case, which to date has not been carried out, perhaps because an “appropriate” subpopulation would be very difficult to define, acquire, and test.) Chaining involves the formation of compositionally similar groups of bullets. This is done by first assuming that each bullet is distinct and forms its own initial “compositional group.” One of these bullets from the reference population is selected.9 This bullet is compared to each of the other bullets in the reference population to determine whether it is a match using the 2-SD overlap procedure.10, 11 When the bullet is determined to match another bullet, their compositional groups are collapsed into a single compositional group. This process is repeated for the entire reference set. The remaining bullets are similarly compared to each other. In this way, the compositional groups grow larger and the number of such groups decreases.
This process is repeated, matching all of the bullets and groups of bullets to the other bullets and groups of bullets, until the entire reference population of bullets has been partitioned into compositional groups (some of which might still include just one bullet). Presumably, the intent is to join bullets into groups that have been produced from similar manufacturing processes. When the process is concluded, every bullet in any given compositional group matches at least one other bullet in that group, and no two bullets from different groups match.
The process to this point involves only the reference set. Once the compositional groups have been formed, let us denote the chemical composition (for one of the seven elements of interest) from the kth bullet in a given compositional group as CG(k) k =1, ..., K. Then the compositional group average and the compositional group standard deviations12 are computed for this compositional group (assuming K members) as follows, for each element:
9
Assuming all bullets are ultimately compared to all other bullets, the order of selection of bullets is immaterial. Otherwise, the order can make a difference.
10
The range overlap procedure could also be used.
11
In the event that all three measurements for a bullet are identical, and hence the standard deviation is zero, the FBI specifies a minimum standard deviation and range for use in the computations.
12
Note that the standard deviation of a compositional group with one member cannot be defined.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Now, suppose that one has collected data for CS and PS bullets and one is interested in determining whether they match. If, for any compositional group, |avg(CS) − avg(CG)| ≤ 2sd(CG) for all seven elements, then the CS bullet is considered to be a match with that compositional group. (Note that the standard deviation of CS is not used.) If using the analogous computation, the PS bullet is also found to be a match with the same compositional group, then the CS and the PS bullets are considered to be a match.
This description leaves some details of implementation unclear. (Note that the 7-dimensional shapes of the compositional groups may have odd features; one could even be completely enclosed in another.) First, since sd(CG) is undefined for groups of size one, it is not clear how to test whether the CS of PS bullets matches a compositional group of one member. Second, it is not clear what happens if the CS or the PS bullet matches more than one compositional group. Third, it is not clear what happens when neither the CS nor the PS bullets match any compositional groups.
An important feature of chaining is that in forming the compositional groups with the reference population, if bullet A matches bullet B, and similarly if bullet B matches bullet C, bullet A may not match bullet C. (An example of the variety of bullets that can be matched is seen in Figure 3.1.) One could construct examples (which the panel has done using data provided by the FBI) in which large chains could be created and include bullets that have little compositionally in common with others in the same group. Further, a reference bullet with a large standard deviation across all seven chemical compositions has the potential of matching many other bullets. Having such a bullet in a compositional group could cause much of the non-transitivity13 just described.
Also, as more bullets are added to the reference set, any compositional groups that have been formed up to that point in the process may be merged if individual bullets in those compositional groups match. This merging may reduce the ability of the groups to separate new bullets into distinct groups. In an extreme case, one can imagine situations in which the whole reference set forms a single compositional group. The extent to which distinctly dissimilar bullets are assigned to the same compositional group in practice is not known, but clearly chaining can increase the rate of falsely asserting that two bullets match in comparison to the use of the 2-SD and range overlap procedures.
The predominant criticisms of all three of these procedures are that (1) the
13
Non-transitivity is where A matches B, and B matches C, but A does not match C.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
FIGURE 3.1 Illustration of chaining shows the 2-SD interval for bullet 1044 (selected at random) as first line in each set of elements, followed by the 2-SD interval for each of 41 bullets whose 2-SD intervals overlap with that of bullet 1044.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
error rates for false matching and false non-matching are not known, even if one were to assume that the measured concentrations are normally distributed, and (2) these procedures are less efficient, again assuming (log) normally distributed data, in using the bullet lead data to make inferences about matching, than competing procedures that will be proposed for use below.
Distance Functions
In trying to determine whether two bullets came from the same CIVL, one uses the “distance” between the measurements as the starting point. For a single element, the distance may be taken as the difference between the values obtained in the laboratory. Because that difference depends, at least in part, on the degree of natural variation in the measurements, it should be adjusted by expressing it in terms of a standard unit, the standard deviation of the measurement. The standard deviation is not known, but can be estimated from either the present data set or data collected in the past. The form of the distance function is then:
where s is the estimate of the standard deviation.
The situation is more complicated when there are measurements on two separate elements in the bullets, though the basic concept is the same. One needs the two-dimensional distance between the measurements and the natural variability of that distance, which depends on the standard deviations of measurements of the two elements, and also on the correlation between them. To illustrate in simple terms, if one is perfectly correlated (or perfectly negatively correlated) with the other, the second conveys no new information, and vice versa. If one measurement is independent of the other, distance measures can treat each distance separately. In intermediate cases, the analyst needs to understand how the correlation between measurements affects the assessment of distance. One possible distance function is the largest difference for either of the two elements. A second distance function is to add the differences across elements; this is equivalent to saying that the difference between two street addresses when the streets are on a grid is the sum of the north-south difference plus the east-west difference. A third is to take the distance “as the crow flies,” or as one might measure it in a straight line on a map. This last definition of distance is in accord with many of our uses and ideas about distance, but might not be appropriate for estimates of (say) the time needed to walk from one place to another along the sidewalks. Other distance functions could also be defined. Again, we only care about distance and not direction, and for mathematical convenience we often work with the square of the distance function.
The above extends to three dimensions: One needs an appropriate function of the standard deviations and correlations among the measurements, as well as a

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Technical details on the T2 test
For any number d of dimensions (including one, two, three, or seven)
where X is a vector of seven average measured concentrations on the CS bullet, Y is a vector of seven average measured concentrations on the PS bullet,’ denotes matrix transposition, n=number of measurements in each sample mean (here, n =3) and S−1 = inverse of the 7 by 7 matrix of estimated variances and covariances.
Under the assumptions that
the measurements are normally distributed (if lognormal, then the logarithms of the measurements are normally distributed),
the matrix of variances and covariances is well-estimated, using ν degrees of freedom (for example, ν =200, if three measurements are made on each of 100 bullets and the variances and covariances within each set of three measurements are pooled across the 100 bullets),
and the difference in the means of X and Y is δ =(δ1, …, δ7) and the standard deviation of X equals the standard deviation of Y equals (σ1, σ2 …, σ7)
then:[(ν − 6)/7ν]T2 should not exceed a critical value determined by the noncentral F distribution with p and ν degrees of freedom and noncentrality parameter, which is a function of δ, σ, and S−1.
When ν = 400 degrees of freedom,and using the correlation matrix estimated from the data from one of the manufacturers of bullet lead (which measured six of the seven elements with ICP-OES; see Appendix F), and assuming that the measurement uncertainty on Cd is 5 percent and is uncorrelated with the others, the choice of the following critical values will provide a procedure with a false match
specific way to define difference (e.g., if the measurements define two opposite corners of a box, one could use the largest single dimension of the box, the sum of the sides of the box, the distance in a straight line from one corner to the other, or some other function of the dimensions). Again, the distance is easier to use if it is squared.
These concepts extend directly to more than three measurements, though the physical realities are harder to picture. A specific, squared distance function, generally known as Hotelling’s T2, is generally preferred over other ways to define the difference between sets of measurements because it summarizes the information on all of the elements measured and provides a simple statistic that has small error under common conditions for assessing, in this application, whether the two bullets came from the same CIVL.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
large groups will tend to have a number of bullets that as pairs may have concentrations that are substantially different.
To see the effect of chaining, consider bullet 1,044, selected at random from the 1,837-bullet data set. The data for these bullets are given in the first two lines of Table 3.11.
Bullet 1,044 matched 12 other bullets; that is, the 2-SD interval overlapped on all elements with the 2-SD interval for 12 other bullets. In addition, each of the 12 other bullets in turn matched other bullets; in total, 42 unique bullets were identified. The variability in the averages and the standard deviations of the 42 bullets would call into question the reasonableness of placing them all in the same compositional group. The overall average and average standard deviation of the 42 average concentrations of the 42 “matching” bullets are given in the third and fourth lines of Table 3.11. In all cases, the average standard deviations are at least as large as, and usually 3–5 times larger than, the standard deviation of bullet 1,044, and larger standard deviations are associated with wider intervals and hence more false matches. Although this illustration does not present a comprehensive analysis of the false match probability for chaining, it demonstrates that this method of assessing matches could possibly create more false matches than either the 2-SD-overlap or the range-overlap procedures.
One of the questions presented to the committee (see Chapter 1) was, “Can known variations in compositions introduced in manufacturing processes be used to model specimen groupings and provide improved comparison criteria?” Bullets from the major manufacturers at a specific point in time might be able to be partitioned based on the elemental compositions of bullets produced. However, there are variations in the manufacturing process by hour and by day, there are a large number of smaller manufacturers, and there may be broader trends in composition over time. These three factors will erode the boundaries between these partitions. Given this and the reasons outlined above, chaining is unlikely to serve the desired purposes of identifying matching bullets with any degree of reliability. In part due to the many diverse methods that could be applied, the panel has not examined other algorithms for partitioning or clustering bullets to determine whether they might overcome the deficiencies of chaining. FBI support for such a study may provide useful information and a more appropriate partitioning algorithm that has a lower false match rate than chaining appears to have.
TABLE 3.11 Elemental Concentrations for Bullet 1,044
As
Sb
Sn
Bi
Cu
Ag
Cd
Average
0.0000
0.0000
0.0000
0.0121
0.00199
0.00207
0.00000
SD
0.0002
0.0002
0.0002
0.0002
0.00131
0.00003
0.00001
Avg of 42 Avgs
0.0004
0.0004
0.0005
0.0110
0.00215
0.00208
0.00001
SD of 42 Avgs
0.0006
0.0005
0.0009
0.0014
0.00411
0.00017
0.00001

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Alternative Testing Strategies
We have discussed the strategies used by the FBI to assess match status. An important issue is the substantial false match rate that occurs when using the 2-SD overlap procedure for bullets with elemental compositions that differ by amounts moderately larger than the within-bullet standard deviation. (This concern arises to a somewhat lesser degree for the range overlap procedure.) In addition, all three of the FBI’s procedures fail to represent current statistical practice, and as a result the data are not used as efficiently as they would be if the FBI were to adopt one of the alternative test strategies proposed for use here. A result of this inefficiency is either false match rates, false non-match rates, or both, that are larger than they could otherwise be.
This section describes alternative approaches to assessing the match status for two bullets, CS and PS, in a manner that makes effective and efficient use of the data collected, so that neither the false match nor the false non-match rates can be made smaller without an increase in the other, and so that estimates of these error rates can be calculated and the reliability of the assessment of match status can be determined.
The basic problem is to judge whether 21 numbers (each an average of three measurements), measuring seven elemental concentrations for each of three bullet fragments from the CS bullet, are either far enough away from the analogous 21 numbers from the PS bullet to be consistent with the hypothesis that the mean concentrations of the CIVLs from which the bullets came are different, or whether they are too close together, and hence more consistent with the hypothesis that the CIVLs means are the same. There are also other data available with information about the standard deviations and correlations of these measurements, and the use of this information is an important issue.
Let us consider one element to start. Again, we denote the three measurements on the CS and PS bullets CS1,CS2,CS3 and PS1,PS2,PS3, respectively. The basic question is whether three measurements of the concentrations of one of the seven elements from two bullets are sufficently different to be consistent with the following hypothesis, or are sufficiently close to be inconsistent with that hypothesis: that the mean values for the elemental concentrations for the bullets manufactured from the same CIVL with given elemental concentrations, of which the PS bullet is a member, are different from the mean values for the elemental concentrations for the bullets manufactured from a different CIVL of which the CS bullet is a member.
Assuming that the measurements of any one element come from a distribution that is well-behaved (in the sense that wildly discrepant observations are extremely unlikely to occur), and assuming that the standard deviation of the CS measurements is the same as the standard deviation of the PS measurements, the standard statistic used to measure the closeness of the six numbers for this single

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
element is the two sample t-test: When additional data on the within-bullet standard deviation is available, whose use we strongly recommend here, the denominator is replaced with a pooled estimate of the assumed common standard deviation sp, resulting in the t-statistic To use t, one sets a critical value tα so that when t is smaller than tα the averages are considered so close that the hypothesis of a “non-match” must be rejected, and the bullets are judged to match, and when t is larger than tα the averages are considered to be so far apart that the bullets are judged to not match.
Setting a critical value simultaneously determines a power function. For any given difference in the true mean concentrations for the CS and the PS bullets, δ, there is an associated probability of being judged a match and a probability of being judged a non-match. If δ equals 0, the probability of having t exceed the critical value tα is the probability of a false non-match. If δ is larger than 0, the probability of having t smaller than the critical value tα is the probability of a false match (as a function of δ).
As mentioned early in this chapter, one may also set two critical values to define three regions; match, no decision, and no match. Doing this may have important advantages in helping to achieve error rates that are more acceptable, at the expense of having situations for which no judgment on matching is made. When the assumptions given above obtain (assuming use of the logarithmic transformation), the two-sample t-test has several useful properties, given normal data, for study of a single element, and is clearly the procedure of choice. In practice we can check to see how close to normality we believe the bullet data or transformed bullet data are, and if they appear to be close to normality with no outliers we can have confidence that our procedure will behave reasonably.
The spirit of the 2-SD overlap procedure is similar to the two-sample t-test for one element, but results in an effectively much larger critical value than would ordinarily be used because the “SD” is the sum of two standard deviations (SD(CS) + SD(PS)), rather than which substantially overestimates the standard deviation of the difference between the two sample means. This reduces the false non-match rate when the bullets are identical, and simultaneously increases false match rates when they are different.
To apply the two-sample t-test, the only remaining questions are: (a) how to choose tα, and (b) how to estimate the common standard deviation of the measurement error. To estimate the common standard deviation using pooling, it would be necessary to carry out analysis of reference bullets to determine what factors were associated with heterogeneity in within-bullet standard deviations.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Having done that, all reference bullets that could be safely assumed to have equal within-bullet standard deviations could be pooled using the following formula:
where Ni is the number of replications for the ith bullet used in the computation (typically 3 here), and K is the total number of bullets used for pooling. When Ni is the same for all bullets, (in this application likely N1 = N2 = … = NK = 3, then sp is just the square root of the mean of the squared deviations.
Assuming that the measurements (after transforming using logarithms) are roughly normally distributed, tables exist that, given tα and δ, provide the false match and false non-match rates. (These are tables of the central and non-central t distribution.) Under the assumption of normality, the two-sample t-test has operating characteristics—the error rates corresponding to different values of δ—that are as small as possible. That is, given a specific δ, one cannot find a test statistic that has a simultaneously lower false match rate, given a specific δ, and lower false non-match rate.
The setting of tα, which determines both error rates, is not a matter to be decided here, since it is not a statistical question. One can make the argument that the false match rate should be controlled to a level at which society is comfortable. In that case, one would take a particular value of δ, the difference between the CS and PS bullet concentrations that one finds important to discriminate between, and determine tα to obtain that false match rate, at the same time accepting the associated false non-match rate. Appropriate values of δ to use will depend on the situation, the manufacturer, and the type of bullet. Having an acceptable false match rate for values where the within-bullet standard deviation becomes unlikely to be a reasonably full explanation for a difference in means would be very beneficial. However, in this case it would still be essential to compute and communicate the false non-match rate, since greatly reducing the false match rate by making tα extremely small may result in an undesirable trade-off of one error rate versus the other.21 Further, if one cannot make both error rates as small as would be acceptable, then there may be non-standard steps that can be taken to decrease both error rates, such as taking more readings per bullet or decreasing the measurement error in the laboratory analysis. (This assumes that the main part of within-bullet variability is due to measurement error and not due to within-bullet heterogeneity, which has yet to be confirmed.)
21
It is unlikely for there to be testimony in cases in which there is a non-match, since the evidence will not be included in the case. However, determining this error rate would nevertheless still be valuable to carry out.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Now we add the complication that seven elements are used to make the judgment of match status. The 2-SD overlap procedure uses a unanimous vote for matching based on the seven individual assessments by element of match or non-match status. A problem is that several of the differences for the seven elements may each be close to, but smaller than, the 2-SD overlap criterion, yet in some collective sense, the differences are too large to be compatible with a match. The 2-SD overlap procedure provides no opportunity to accumulate these differences to reach a conclusion of “no match.”
To address this, assume first that the within-bullet correlations between elemental concentrations are all equal to zero. In that case, the theoretically optimal procedure, assuming multivariate normality, is to add the squares of the separate t-statistics for the seven elements and to use the sum as the test statistic. The distribution of this test statistic is well-known, and false match rates and false non-match rates can be determined for a range of possible critical values and vectors of separation, δ. (There is a separation vector, since with seven elements, to determine a false match rate, one must specify the true distances between the means for the bullets for each of the seven elements.) Again, under the assumptions given, this procedure is theoretically optimal statistically in the sense that no test statistic can have a simultaneously lower false non-match rate and lower false match rate, given a specific separation vector.
However, as seen from the 800-bullet data set, it is apparently not the case that the within-bullet measurements of elemental composition are uncorrelated. If the standard deviations and correlations could be well estimated, the theoretically optimal procedure, assuming multivariate normality, to judge the closeness of the 21 numbers from the CS and the PS bullets would be to use Hotelling’s T2 statistic. However, there are three complications regarding the use of T2. First, the within-bullet correlations and standard deviations have not, to date, been estimated accurately. Second, the T2 statistic has best power against alternative hypotheses for which all of the mean elemental concentrations are different between the CS and the PS bullets. If this is not the case, T2 averages the impact of the differences that exist over seven anticipated differences, thus reducing their impact. Given situations where only three or four of the elements exhibit differences, T2 will have a relatively high false match error rate relative to procedures, like 2-SD overlap, that can key on one or two large differences. Third, T2 is somewhat sensitive to large deviations from normality, and the bullet lead data do seem to have frequent outlying observations, whether from heterogeneity within bullets or inadequately controlled measurement methods.
Even given these concerns, once the needed within-bullet correlations have been well-estimated, and the non-(log) normality has been addressed, the use of T2 should be preferred to the use of either the 2-SD overlap or the range overlap procedures. This is because T2 retains the theoretical optimality properties of the simpler tests described above. (It is the direct analogue of the two-sample t-test in more than one dimension.) One way to describe the theoretical optimality,

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
Theoretical Optimality of T2 Procedure
Hotelling’s T2 uses the observations to calculate the following statistic: T2 = n(X − Y)′S−1(X − Y), without which the n is known as the Mahalanobis distance. This statistic, whether it is used in a formal test or not, has a theoretical optimality property. The same distance between the center (mean) and contours appears in the mathematical formulation of the multivariate normal distribution (in the exponent). This statistic defines “contours” of equal probability around the center of the distribution, and the contours are at lower and lower levels of probability as the statistic increases. This means that, if the observations are multivariate normal, as seems to be approximately the case for the logged concentrations in bullet lead, the probability is most highly concentrated within such a contour. No other function of the data can have this property. The practical result is that the T2 statistic and the chosen value of define a region around the observed values of the differences between the PS and the CS bullets that is as small as possible, so that the probability of falsely declaring a match is also as small possible (given a fixed rate for the probability of false non-matches). This is a powerful argument in favor of using the T2 statistic.
given data that are multivariate normal, of T2 is that, for different critical values, say T2 defines a region of observed separation vectors that are the most probable if there were no difference between the means of the concentrations of the CS and the PS bullets.
The panel has identified an alternative to the use of the T2 test statistic that retains some of the benefits of being derived from the univariate t-test statistic, but also has the advantage of being able to reject a match based on one moderately substantial difference in one dimension, which is an advantage of the 2-SD overlap procedure. This approach, which we will denote the “successive t-test approach” test statistics, is as follows:
estimate the within-bullet standard deviations for each element using a pooled within-bullet standard deviation sp from a large number of bullets, as shown above.
calculate the difference between the means of the (log-transformed) measurements of the CS and the PS bullets,
If all the differences are less than kαsp for each of the seven elements for some constant kα, then the bullets are deemed a match, otherwise they are a nonmatch.
Unfortunately, the estimation of false match rates and false non-match rates for the successive t-test statistic is complicated by the lack of independence of within-bullet measurements between the different elements. The panel carried

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
out a number of simulations that support estimating the false match rate by raising the probability of a match for a single element to the fifth power (rather than the seventh power, which would be correct if the within-bullet measurements were independent). That is, raising the individual probabilities to the fifth power provides a reasonable approximation to the true error rates. This is somewhat ad hoc, but further analysis may show that for the modest within-bullet correlations in use, this is a reasonable approximation.22 In any event, for a specific separation vector, simulation studies can always be used to assess, to some degree of approximation, the false match and false non-match probabilities of this procedure. The advantage of the successive t-test statistics are that the approach has the ability to notice single large differences, but also retains the use of efficient measures of variability.
Similar to the above, choices of kα form a parametric family of test procedures, each of which trades off one of the two error rates against the other. The choice of kα is again a policy matter that we will not discuss except to stress that whatever the choice of kα is, if the FBI adopts this suggested procedure, both the false match and the false non-match probabilities must be estimated and communicated in conjunction with the use of this evidence in court.
In summary, the two alternatives to the FBI’s test statistics advocated by the panel are the T2 test statistic and the successive t-test statistics procedure. If the underlying data are approximately (log) normally distributed, and if pooled estimates, over an appropriate reference set of bullets, are available to estimate within-bullet standard deviations and within-bullet correlations, and finally, if all seven elements are relatively active in discriminating between the CS and the PS bullets, then T2 is an excellent statistic for assessing match status. The successive t-test statistics procedure is somewhat less dependent on normality and can be used in situations in which a relatively small number of elements are active. However, quick assessment of error rates involves an approximation. Given the different strengths of these two procedures, there are good reasons to report both results. In addition, the FBI should examine the 71,000, bullet data set for recent data to see whether all seven elements now in use are routinely active, or whether there may be advantages from reducing the elements considered. This would be an extension of the panel’s work described above on the 1,837-bullet data set.
In the meantime, both of the recommended approaches have advantages over the use of the current FBI procedures. They are both based on more efficient univariate statistical tests, and they both allow direct estimation (in one case, approximate estimation) of the false match and false non-match rates. One
22
The FBI should remain open to the possibility, if the within-bullet correlations are higher than current estimates, of dropping one of element pairs involved in very substantial correlations (over .9) to reduce the size of this problem, and to also consider the possibility of adding other elements if differences in those concentrations by manufacturer appear.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
procedure, successive t-test statistics, is better at identifying non-matching situations in which there are a few larger discrepancies for a subset of the seven elements, and the other, T2, is better at identifying non-matching situations in which there are modest differences for all seven elements. In addition, if T2 is to be used, given the small amount of data collected on the PS and the CS bullets, pooling across a reference data set of bullets to estimate the within-bullet standard deviations and correlations is vital to support this approach.23
If both of these procedures are adopted, the FBI must guard against the temptation to compute both statistics and report only the one showing the more favorable result.
We have stressed in several places that prior to use of these test procedures, the operating characteristics, i.e., the false match rate and false non-match rates, be calculated and communicated along with the results of the specific match. (Even though non-matches are unlikely to be presented as evidence in court, knowing the false non-match error rate protects against setting critical values that too strongly favor one error rate against the other.) A different false match rate is associated with each non-zero separation vector δ (in seven dimensions). It is difficult to prescribe a specific set of separation vectors to use for this communication purpose. However, as in the univariate case, having an acceptable false match rate for separation vectors where the within-bullet standard deviations become unlikely to be a reasonably full explanation for differences in means would be very beneficial. It would also be useful to include a separation vector that demonstrated the performance of the procedure when not all mean concentrations for elements differ.
In addition, for any procedure that the FBI adopts, a much more comprehensive study of the procedure’s false non-match and false match rates should be carried out than can be summarized in a small number of false match rates.
In discussing the calculation of false match rates, the panel is devoting its attention to cases that are at least somewhat unclear, since those are the cases for which the choice of procedure is most important. However, for a large majority of bullet pairs that are clearly dissimilar, there would be strong agreement between the procedures that the FBI is using today and the two procedures recommended here as preferred alternatives.
Finally, the 2-SD and range overlap procedures, the T2 test statistic, and to a lesser extent, the successive t-test statistics procedure, are all sensitive to the assumption of normality. By sensitive, we mean that the error rates computed under the assumption of (log) normality may be unrealistic if the assumption
23
There is a technical point here, that in using pooled standard deviations and correlations to form the estimated covariance matrix for use with the T2 test statistic, it is important to check that the resulting estimated covariance matrix is positive definite. This is unlikely to be a problem, in this application.

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
does not hold. Specifically, the presence of outlying values is likely to inflate the estimates of variability more than the differences in concentrations, so that more widely disparate bullet pairs will be found to match using these test statistics. (See Eaton and Efron, 1970; Holloway and Dunn, 1967; Chase and Bulgren, 1971; and Everit, 1979, for the non-robustness of T2.) The FBI could take two actions to address this sensitivity. First, if the non-normality is not a function of laboratory error or contamination or other sources that can be reduced over time, the FBI should use, in addition to the two procedures recommended here, a “robust” test procedure such as a permutation test, to see if there is agreement with the normal-theory based procedure. If there is agreement between the robust and non-robust procedures, one may safely report the results from the standard procedure. If, on the other hand, there is disagreement, the source of the disagreement would need to be investigated to see if outliers or other data problems were at fault. If the non-normality may be a function of human error, the data should be examined prior to use to identify any discrepant measurements so that they can be repeated in order to replace the outlying observation. Identifying outliers from a sample of size three is not easy, but over time, procedures (such as control charts) could be identified that would be effective at determining when additional measurements would be valuable to take.
RECOMMENDATIONS
The largest source of error in the use of CABL is the unknown variability within the population of bullets in the United States due to variations within and across manufacturing processes. (The manufacturing process and its effect on the interpretation of CABL evidence is discussed in detail in Chapter 4.) This variability is not sufficiently taken into account by the statistical methods currently in use in the analysis of CABL data. In addition, the FBI’s methods are not representative of current statistical practice. Several steps can be taken to remedy these problems. A key need is the identification of statistical tests that have acceptable levels of rates of false matches and false non-matches. The committee has proposed a variety of analyses to increase understanding of the variability in the composition of bullet lead, and how to make better use of statistical methods in analyzing this information.
The discussion above supports the following recommendations.
Recommendation: The committee recommends that the FBI estimate within-bullet standard deviations on separate elements and correlations for element pairs, when used for comparisons among bullets, through use of pooling over bullets that have been analyzed with the same ICP-OES measurement technique. The use of pooled within-bullet standard deviations and correlations is strongly preferable to the use of within-bullet standard deviations that are calculated from the two bullets being compared. Further, estimated standard deviations should be charted regularly to

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
ensure the stability of the measurement process; only standard deviations within control-chart limits are eligible for use in pooled estimates.
In choosing a statistical test to apply when determining a “match,” the goal was to choose a test that had good performance properties as measured by (1) its rate of false non-matches and (2) its rates of false matches, evaluated at a variety of separations between the concentrations of the CS and the PS bullets. The latter corresponds to the probability of providing false evidence of guilt, which our society views as important to keep extremely low.
Given arguments of statistical efficiency that translate into lower error rates, it is attractive to consider either the T2 test statistic, or the successive t-test statistics procedure, since they are more representative of current statistical practice. The application of both procedures is illustrated using some sample data in Appendix K.
Recommendation: The committee recommends that the FBI use either the T2 test statistic or the successive t-test statistics procedure in place of the 2-SD overlap, range overlap, and chaining procedures. The tests should use pooled standard deviations and correlations, which can be calculated from the relevant bullets that have been analyzed by the FBI Laboratory. Changes in the analytical method (protocol, instrumentation, and technique) will be reflected in the standard deviations and correlations, so it is important to monitor these statistics for trends and, if necessary, to recalculate the pooled statistics.
The committee recognizes that some work remains in order to provide additional rigor for the use of this testing methodology in criminal cases. Further exploration of the several issues raised in this chapter should be carried out. As part of this effort, it will be necessary to further mine the extant data resources on lead bullet composition to establish an empirical base for the methodology’s use. In addition, this analysis may discover deficiencies in the extant data resources, thereby identifying additional data collection that is needed.
Recommendation: To confirm the accuracy of the values used to assess the measurement uncertainty (within-bullet standard deviation) in each element, the committee recommends that a detailed statistical investigation using the FBI’s historical data set of over 71,000 bullets be conducted. To confirm the relative accuracy of the committee’s recommended approaches to those used by the FBI, the cases that match using the committee’s recommended approaches should be compared with those obtained with the FBI approaches, and causes of discrepancies between the two approaches—such as excessively wide intervals from larger-than-expected estimates of the standard deviation, data from specific time periods, or examiners—should be identified. As the FBI adds new bullet data to its

OCR for page 26

Forensic Analysis Weighing Bullet Lead Evidence
71,000+ data set, it should note matches for future review in the data set, and the statistical procedures used to assess match status.
No matter which statistical test is utilized by examiners, it is imperative that the same statistical protocol be applied in all investigations to provide a replicable procedure that can be evaluated.
Recommendation: The FBI’s statistical protocol should be properly documented and followed by all examiners in every case.
REFERENCES
Carriquiry, A.; Daniels, M.; and Stern, H. “Statistical Treatment of Case Evidence: Analysis of Bullet Lead,” Unpublished report, Dept. of Statistics, Iowa State University, 2002.
Chase, G.R., and Bulgren, W.G., “A Monte Carlo Investigation of the Robustness of T2,” Journal of the American Statistical Association, 1971, 66, pp 499–502.
Eaton, M.L. and Efron, B., “Hotelling’s T2 Test Under Symmetry Conditions,” Journal of the American Statistical Association, 1970, 65, pp. 702–711.
Everitt, B.S., “A Monte Carlo Investigation of the Robustness of Hotelling’s One- and Two-Sample T2 Tests,” Journal of the American Statistical Association, 1979, 74, pp 48–51.
Holloway, L.N. and Dunn, O.L., “The Robustness of Hotelling’s T2,” American Statistical Association Journal, 1967, pp 124–136.
Owen, D.B. “Noncentral t distribution” in Encyclopedia of Statistical Sciences, Volume 6; Kotz, S.; Johnson, N. L.; and Read, C.B.; Eds.; Wiley: New York, NY 1985, pp 286–290.
Peele, E. R.; Havekost, D. G.; Peters, C. A.; Riley, J. P.; Halberstam, R. C.; and Koons, R. D. USDOJ, (ISBN 0-932115-12-8), 1991, 57.
Peters, C. A. Comparative Elemental Analysis of Firearms Projectile Lead by ICP-OES, FBI Laboratory Chemistry Unit. Issue date: Oct. 11, 2002. Unpublished (2002).
Peters, C. A. Foren. Sci. Comm. 2002, 4(3). <http://www.fbi.gov/hq/lab/fsc/backissu/july2002/peters.htm> as of Aug. 8, 2003.
Randich, E.; Duerfeldt, W.; McLendon, W.; and Tobin, W. Foren. Sci. Int. 2002, 127, 174–191.
Rao, C.R., Linear Statistical Inference and Its Applications: Wiley, New York, NY 1973.
Tiku, M. “Noncentral F distribution” in Encyclopedia of Statistical Sciences, Volume 6; Kotz, S.; Johnson, N. L.; and Read, C.B.; Eds.; Wiley: New York, NY 1985, pp 280–284.
Vardeman, S. B. and Jobe, J. M. Statistical Quality Assurance Methods for Engineers, Wiley: New York, NY 1999.
Wellek, S. Testing Statistical Hypotheses of Equivalence; Chapman and Hall: New York, NY 2003.