E
Basic Principles of Statistics^{1}
All measurements are subject to error. Analytical chemical measurements often have the property that the error is proportional to the value. Denote the i^{th} measurement on bullet k as X_{ik} (we will consider only one element in this discussion and hence drop the subscript j utilized in Chapter 3). Let denote the mean of all measurements that could ever be taken on this bullet, and let denote the error associated with this measurement. A typical model for analytical measurement error might be
Likewise, for a given PS bullet measurement, Y_{ik}, with mean and error in measurement η_{ik},
Notice that if we take logarithms of each equation, these equations become additive rather than multiplicative in the error term:
Models with additive rather than multiplicative error are the basis for most statistical procedures. In addition, as discussed below, the logarithmic transformation yields more normally distributed data as well as transformed measure-
ments with constant variance. That is, an estimate of log(µ_{xk}) is the logarithm of the sample average of the three measurements on bullet k, and a plot of these log(averages) shows more normally distributed values than a plot of the averages alone. We denote the variances of and as and and the variances of the error terms and as and respectively. It is likely that the between-bullet variation is the same for the populations of both the CS and the PS bullets; therefore, since should be the same as we will denote the between-bullet variances as Similarly, if the measurements on both the CS and PS bullets were taken at the same time, their errors should also have the same variances; we will denote this within-bullet variance as or σ^{2} when we are concentrating on just the within-bullet (measurement) variability.
Thus, for three reasons—the nature of the error in chemical measurements, the approximate normality of the distributions, and the more constant variance (that is, the variance is not a function of the magnitude of the measurement itself)—logarithmic transformation of the measurements is advisable. In what follows, we will assume that x_{i} denotes the logarithm of the i^{th} measurement on a given CS bullet and one particular element, µ_{x} denotes the mean of these log(measurement) values, and ε_{i} denotes the error in this i^{th} measurement. Similarly, let y_{i} denote the logarithm of the i^{th} measurement on a given PS bullet and the same element, µ_{y} denote the mean of these log(measurement) values, and η_{i} denote the error in this i^{th} measurement.
NORMAL (GAUSSIAN) MODEL FOR MEASUREMENT ERROR
All measurements are subject to measurement error:
Ideally, ε_{i} and π_{i} are small, but in all instances they are unknown from measured replicate to replicate. If the measurement technique is unbiased, we expect the mean of the measurement errors to be zero. Let and denote the measurement errors’ variances. Because µ_{x} and µ_{y} are assumed to be constant, and hence have variance 0, and The distribution of measurement errors is often (not always) assumed to be normal (Gaussian). That assumption is often the basis of a convenient model for the measurements and implies that
(E.1)
if µ_{x} and σ_{x} are known (and likewise for y_{i}, using µ_{y} and σ_{y}). (The value 1.96 is often conveniently rounded to 2.) Moreover, will also be normally
distributed, also with mean µ_{x} but with a smaller variance, therefore
Referring to Part (b) of the Federal Bureau of Investigation (FBI) protocol for forming “compositional groups” (see Chapter 3), its calculation of the standard deviation of the group is actually a standard deviation of averages of three measurements, or an estimate of in our notation, not of σ_{x}. In practice, however, µ_{x} and σ_{x} are unknown, and interest centers not on an individual x_{i} but rather on µ_{x}, the mean of the distribution of the measured replicates. If we estimate µ_{x} and σ_{x} using and s_{x} from only three replicates as in the current FBI procedure but still assume that the measurement error is normally distributed, then a 95 percent confidence interval for the true{µ_{x}} can be derived from Equation E.1 by rearranging the inequalities using the correct multiplier, not from the Gaussian distribution (that is, not 1.96 in Equation E.1) but rather from Student’s t distribution, and the correct standard deviation instead of s_{x}:
Use of the multiplier 2 instead of 2.484 yields a confidence coefficient of 0.926, not 0.95.
CLASSICAL HYPOTHESIS-TESTING: TWO-SAMPLE t STATISTIC
The present situation involves the comparison between the sample means and from two bullets. Classical hypothesis-testing states the null and alternative hypotheses as (reversed from our situation), and states that the two samples of observations (here, x_{1}, x_{2}, x_{3} and y_{1}, y_{2}, y_{3}) are normally distributed as and Under those conditions, and s_{p} are highly efficient estimates of µ_{x}, µ_{y}, and σ, respectively, where s_{p} is a pooled estimate of the standard deviation that is based on both samples:
(E.2)
Evidence in favor of H_{1}:µ_{x} ≠ µ_{y} occurs when and are “far apart.” Formally, “far apart” is determined when the so-called two-sample t statistic (which, under H_{0}, has a central Student’s t distribution on n_{x} + n_{y} − 2 = 3 + 3 − 2 = 4 degrees of freedom) exceeds a critical point from this Student’s t_{4} distribution. To ensure a false null hypothesis rejection probability of no more than 100α% where α is the
probability of rejecting H_{0} when it is correct (that is, claiming “different” when the means are equal), we reject H_{0} in favor of H_{1} if
(E.3)
where t_{nx} _{+}_{ny} _{− 2,α/2} is the value beyond which only 100 · α/2% of the Student’s t distribution (on n_{x} + n_{y} − 2 degrees of freedom) lies.
When Equation E.3 reduces to:
(E.4)
This procedure for testing H_{0} versus H_{1} has the following property: among all possible tests of H_{0} whose false rejection probability does not exceed α, this two-sample Student’s t test has the maximum probability of rejecting H_{0} when H_{1} is true (that is, has the highest power to detect when µ_{x} and µ_{y} are unequal). If the two-sample t statistic is less than this critical value (2.776 for α = 0.05), the interpretation is that the data do not support the hypothesis of different means. A larger critical value would reject the null hypothesis (“same means”) less often.
The FBI protocol effectively uses s_{x} + s_{y} in the denominator instead of and uses a “critical value” of 2 instead of 2.776. Simulation suggests that the distribution of the ratio (s_{x} + s_{y})/s_{p} has a mean of 1.334 (10%, 25%, 75%, and 90% quantiles are 1.198, 1.288, 1.403, and 1.413, respectively). Substituting suggests that the approximate error in rejecting H_{0} when it is true for the FBI statistic, would also be 0.05 if it used a “critical point” of Replacing 1.334 with the quantiles 1.198, 1.288, 1.403, and 1.413 yields values of 1.892, 1.760, 1.616, and 1.604, respectively—all smaller than the FBI value of 2. The FBI value of 2 would correspond to an approximate error of 0.03. A larger critical value (smaller error) leads to fewer rejections of the null hypothesis, that is, more likely to claim “equality” and less likely to claim “different” when the means are the same.
If the null hypothesis is H_{0}:µ_{x} − µ_{y} = δ(δ ≠ 0), the two-sample t statistic in Equation E.4 has a noncentral t distribution with noncentrality parameter (δ/σ)(n_{x}n_{y})/(n_{x} + n_{y}), which reduces to (δ/σ)(n/2) when n_{x} = n_{y} = n. When the null hypothesis is the distribution of the pooled two-sided two-sample t statistic (Equation E.4) has a noncentral F distribution with 1 and n_{x} + n_{y − 2} = 2(n − 1) degrees of freedom and noncentrality parameter
The use of Student’s t statistic is valid (that is, the probability of falsely rejecting H_{0} when the means µ_{x} and µ_{y} are truly equal is α) only when the x’s and y’s are normally distributed. The appropriate critical value (here, 2.776 for α = 0.05 and δ = 0) is different if the distributions are not normal, or if σ_{x} ≠ σ_{y}, or if H_{0}: | µ_{x} − µ_{y} | ≥ δ ≠ 0, or if (s_{x} + s_{y})/2 is used instead of s_{p} (Equation E.2), as is used currently in the FBI’s statistical method. It also has the highest power (highest probability of claiming H_{1} when in fact µ_{x} ≠ µ_{y}, subject to the condition that the probability of erroneously rejecting H_{0} is no more than α.
The assumption “σ_{x} = σ_{y}” is probably reasonably valid if the measurement process is consistent from bullet sample to bullet sample: one would expect the error in measuring the concentration of a particular element for the crime scene (CS) bullet (σ_{x}) to be the same as that in measuring the concentration of the same element in the potential suspect (PS) bullet (σ_{y}). However, the normality assumption may be questionable here; as noted by (Ref. 1), average concentrations for different bullets tend to be lognormally distributed. That means that log(As average) is approximately normal as it is for all six other elements. When the measurement uncertainty is very small (say, σ_{x} < 0.2), the lognormal distribution differs little from the normal distribution (Ref. 2), so these assumptions will be reasonably well satisfied for precise measurement processes. Only a few of the standard deviations in the datasets were greater than 0.2 (see the section titled “Description of Data Sets” in Chapter 3).
The case of CABL differs from the classical situation primarily in the reversal of the null and alternative hypotheses of interest. That is, the null hypothesis here is H_{0}:µ_{x} ≠ µ_{y} vs H_{1}:µ_{x} = µ_{y}. We accommodate the difference by stating a specific relative difference between µ_{x} and µ_{y}, |µ_{x} − µ_{y}|, and rely on the noncentral F distribution as mentioned above.
EQUIVALENCE t TESTS^{2}
An equivalence t test is designed to handle our situation:
H_{0}: means are different.
H_{1}: means are similar.
Those hypotheses are quantified more precisely as
We must choose a value of δ that adequately reflects the condition that “two bullets came from the same compositionally indistinguishable volume of mate-
^{2} |
Note that the form of this test is referred to as successive t-test statistics in Chapter 3. In that description, the setting of error rates is not prescribed. |
rial (CIVL), subject to specification limits on the element given by the manufacturer.” For example, if the manufacturer claims that the Sb concentrations in a given lot of material are 5% ± 0.20%, a value of δ = 0.20 might be deemed reasonable. The test statistic is still the two-sample t as before, but now we reject H_{0} if and are too close. As before, we ensure that the false match probability cannot exceed a particular value by choosing a critical value so that the probability of falsely rejecting H_{0} (falsely claiming a “match”) is no greater than α (here, we will choose α = 1/2,500 = 0.0004 for example. The equivalence test has the property that, subject to false match probability ≤ α = 0.0004, the probability of correctly rejecting H_{0} (that is, claiming that two bullets match when the means of the batches from which the bullets came are less than δ), is maximized. The left panel of Figure E.1 shows a graph of the distribution of the difference under the null hypothesis that δ/σ = 0.25 (that is, either µ_{x} − µ_{y} = −0.25σ, or µ_{x} − µ_{y} = +0.25σ) and n = 100 fragment averages in each sample, subject to false match probability ≤ 0.05: the equivalence test in this case rejects H_{0} when The right panel of Figure E.1 shows the power of this test: when δ equals zero, the probability of correctly rejecting the null hypothesis (“means differ by more than 0.25”) is about 0.60, whereas the probability of rejecting the null hypothesis when δ = 0.25 is only 0.05 (as it should be, given the specifications of the test). Figure E.1 is based on the information given in Wellek (Ref. 3); similar figures apply for the case when α = 0.0004, n = 3 measurements in each sample, and δ/σ = 1 or 2.
DIGRESSION: LOGNORMAL DISTRIBUTIONS
This section explains two benefits of transforming measurements via logarithms for the statistical analysis.
The standard deviations of measurements made with inductively coupled plasma-optical emission spectroscopy are generally proportional to their means; hence, one typically refers to relative error, or coefficient of variation, sometimes expressed as a percentage, When the measurements are transformed first via logarithms, the standard deviation of the log(measurements) is approximately, and conveniently, equal to the coefficient of variation (COV), sometimes called relative error (RE), in the original scale. This can be seen easily through standard propagation-of-error formulas (Ref. 4, 5), which rely on a first-order Taylor series expansion for the transformation (here, the natural logarithm) about the mean in the original scale—
—because the variance of a constant (such as µ_{x}) is zero. Letting f(X) = log(X), and f′(µ_{x}) = 1/µ_{x}, it follows that
Moreover, the distribution of the logarithms for each element tends to be more normal than that of the raw data. Thus, to obtain more-normally distributed data and as a by-product a simple calculation of the COV, the data should first be transformed via logarithms. Approximate confidence intervals are calculated in the log scale and then can be transformed back to the original scale via the antilogarithm,
DIGRESSION: ESTIMATING σ^{2} WITH POOLED VARIANCES
The FBI protocol for statistical analysis estimates the variances of the triplicate measurements in each bullet with only three observations, which leads to highly variable estimates—a range of a factor of 10, 20, or even more Assuming that the measurement variation is the same for both the PS and CS bullets, the classical two-sample t statistic pools the variances into (Equation E.2), which has four degrees of freedom and is thus more stable than either individual s_{x} or s_{y} alone (each based on only two degrees of freedom). The pooled variance need not rely on only the six observations from the two samples if the within-replicate variance is the same for several bullets. Certainly, that condition is likely to hold if bullets are analyzed with a consistent measurement process. If three measurements are used to calculate each within-replicate standard deviation from each of, say, B bullets, a better, more stable estimate of σ^{2} is
Such an estimate of σ^{2} is now based on not just 2(2) = 4 degrees of freedom, but rather 2B degrees of freedom. A stable and repeatable measurement process offers many estimates of σ^{2} from many bullets analyzed by the laboratory over several years. The within-replicate variances may be used in the above equation. To verify the stability of the measurement process, standard deviations should be plotted in a control-chart format (s-chart) (Ref. 7) with limits that, if exceeded, indicate a change in precision. Standard deviations that fall within the limits should be pooled as in Equation E.3. Using pooled standard deviations guards against the possibility of claiming a match simply because the measurement variability on a particular day happened to be large by chance, creating wider intervals and hence greater chances of overlap.
To determine whether a given standard deviation, say, s_{g}, might be larger than the s_{p} determined from measurements on B previous bullets, one can com-
pare the ratio with an F distribution on 2 and 2B degrees of freedom. Assuming that the FBI has as many as 500 estimates, the 5% critical point from an F distribution on two and 1,000 degrees of freedom is 3.005. Thus, if a given standard deviation is times larger than the pooled standard deviation for that element, one should consider remeasuring that element, in that the precision may be larger than expected by chance alone (5% of the time).
REFERENCES
1. Carriquiry, A.; Daniels, M.; and Stern, H. “Statistical Treatment of Case Evidence: Analysis of Bullet Lead”, Unpublished report, 2002.
2. Antle, C.E. “Lognormal distribution” in Encyclopedia of Statistical Sciences, Vol 5, Kotz, S.; Johnson, N. L.; and Read, C. B., Eds.; Wiley: New York, NY, 1985, pp. 134–136.
3. Wellek, S. Testing Statistical Hypotheses of Equivalence Chapman and Hall: New York, NY 2003.
4. Ku, H.H. Notes on the use of propagation of error formulas, Journal of Research of the National Bureau of Standards-C. Engineering and Instrumentation, 70C(4), 263–273. Reprinted in Precision Measurement and Calibration: Selected NBS Papers on Statistical Concepts and Procedures, NBS Special Publication 300, Vol. 1, H.H. Ku, Ed., 1969, 331–341.
5. Cameron, J.E. “Error analysis” in Encyclopedia of Statistical Sciences, Vol 2, Kotz, S.; Johnson, N. L.; and Read, C. B., Eds., Wiley: New York, NY, 1982, pp. 545–541.
6. Mood, A.; Graybill, F.; and Boes, D. Introduction to the Theory of Statistics, Third Edition McGraw-Hill: New York, NY, 1974.
7. Vardeman, S. B. and Jobe, J. M. Statistical Quality Assurance Methods for Engineers, Wiley: New York, NY, 1999.