Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Appendix B Performance Metrics for ASPs and PVTs âFar better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.â â John W. Tukey (1962), âThe Future of Data Analysis,â Annals of Mathematical Statistics 33(1), p.1â67. (The citation appears on p.12.) When evaluating the performance of instruments to identify the system most well suited to a given task, one needs to consider the correct metric for making the comparison. In the case of systems such as the advanced spectroscopic portals (ASPs), conventional measures such as sensitivity and specificity provide useful information, but do not assess directly test performance in actual field operation. The metrics of interest concern the probabilities of making incorrect calls, i.e., the probability that the cargo actually contained dangerous material when the test system allowed it to pass (a false negative call), and the probability that the cargo actually contained benign material when the test system alarmed (a false positive call). This appendix describes the calculations leading to estimates of these probabilities, the uncertainties in these values, and how these estimated probabilities can be used to compare two systems under consideration. NOMENCLATURE Test system performance usually is characterized in terms of detection probabilities. The notation for these probabilities comes from the literature for comparing medical diagnostic tests, and we use the same notation here for radiation detection systems, so we begin with some terminology. In formal notation, the absolute probability of event A is written P{A}. The probability that event A happens given condition or event B is written as P{A|B}. The event after the vertical bar â|â is the event on which the probability is conditioned; i.e., the event that preexists. For the rest of this appendix, we define the following. A = cargo contains SNM B = Test system alarms Ac = The complement of A, cargo contains no SNM (benign) 47 Bc = The complement of B, test system does not alarm Sensitivity (S) = probability that the test system alarms, given that the underlying cargo truly contains special nuclear material (SNM). S = P{B|A} Specificity (T) = probability that the test system does not alarm, given that the underlying cargo truly contained benign material (non-SNM). T = P{Bc|Ac} 47 Some non-SNM radioactive material is not benign, but for simplicity, this appendix refers to non-SNM as benign material. 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 59 Prevalence (p) = probability that cargo contains SNM. p = P{A} Positive predictive value (PPV) = probability that the underlying cargo truly contains SNM, given that the test system alarms. PPV = P{A|B} Negative predictive value (NPV) = probability that the underlying cargo truly contains non-SNM, given that the test system did not alarm. NPV = P{Ac|Bc} WHAT WE WANT TO KNOW AND WHAT WE CAN KNOW Although we might want to know the sensitivity and specificity of the detection systems, because their definitions rely on true knowledge of the cargo contents, we can estimate a systemâs sensitivity and specificity only from a designed experiment. The experimenters insert into the cargo either SNM or benign material, and then run the cargo through the test systems; the proportion of SNM runs that properly set off the test system alarm is an estimate of the testâs sensitivity, and the proportion of benign runs that properly pass the test system is an estimate of the testâs specificity. Such design studies are artificial scenarios intended to represent a range of possible real-world events. In real life we do not know the cargo contents. We see only the result of the test system: either the test system alarmed, or it did not alarm, and the probability of getting an alarm given that SNM is present is not necessarily the same as the probability that SNM is present given that the system alarmed (P{B|A} â P{A|B}). Operationally, if the system alarms, SNM is suspected; if the system does not alarm, the cargo is allowed to pass. We are concerned especially with this question: Given that the test system did not alarm, what is the probability that the cargo contained SNM? That is, what risk do we take by allowing a âno-alarmâ container to pass? From the standpoint of practical operational effectiveness, this false negative call probability (FNCP = P{A|Bc} = 1-NPV) 48 has grave consequences. As shown below by Bayesâ Theorem, it is a function of sensitivity (S) and specificity (T), as well as of prevalence (p = P{A}), but a comparison between two test systems on the same scenario (i.e., the same threat) involves the same prevalence, so prevalence does not enter into the comparison of effectiveness for the two test systems. Accurate estimation of sensitivity and specificity is important, in that it allows us to compare accurately the performance of two test systems using the relevant, practically meaningful metric. As noted above, from designed studies we can estimate S and T, such as those conducted at the Nevada test site. We also can derive confidence limits on S and T from such designed experiments, and hence we can estimate (1âNPV) and associated confidence intervals. More importantly, we can compare the two systems via a ratio, say the FNCP ratio (1-NPV1)/(1-NPV2). A FNCP ratio whose lower confidence limit exceeds 1 indicates preference for test system 2, while a ratio whose upper confidence limit falls below 1 indicates preference for test system 1. Note that these ratios may differ for different scenarios; a table of these ratios may suggest strategies for associating the ratios with the threat levels presented by different scenarios. Notice also that the probability of making a false positive call (FPCP) is likewise of interest for purposes of evaluating costs and benefits: Too many false positive calls can also be costly by slowing down commerce, diverting CBP personnel from potential threats as they spend 48 Some analyses refer to âfalse discovery rateâ and âfalse non-discovery rate,â which are related to (1âPPV) and (1â NPV), respectively, but their definitions are slightly different (see Box B.1).
60 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Box B.1: A comment on notation We denoted by FNCP the probability of making a false positive call and by FPCP the probability of making a false positive call; i.e., FNCP = P{ true + | test calls âââ } FPCP = P{ true â | test calls â+â } . We related these probabilities to the following generic two-way table of test outcomes (notation from Benjamini and Hochberg, 1995, p.291, referred to as BH95, is in parentheses): Test calls Test calls Total Truth âPositiveâ âNegativeâ Tests N ï« ïº m ï m0 True POSITIVE N ï«ï« (V) N ï«ï (U) N ï ïº m0 True NEGATIVE N ïï« (S) N ïï (T) Total calls R mâR M We estimated the false negative call probability via the proportion of negative-call tests (m ï R) that were in fact positive (N+ï), or U/(m ïï R) in BH95 notation. Similarly, we estimated the false positive call probability via the proportion of positive-call tests (R) that were in fact negative (Nï+), or V/R in BH95 notation. BH95 address the situation known as âmultiple testing,â where one is conducting many hypothesis tests (e.g., hundreds or thousands of tests as occurs in gene expression experiments), and wants to control the frequency with which one declares as âsignificantâ (e.g., âpositiveâ) tests which in fact are negative. Hence Benjamini and Hochberg (1995) define the expected proportion of false positive calls, E(V/R), as the âfalse discovery rate,â or FDR. They provide a procedure based on the m p-values from the m tests so that one has assurance that, on average, the proportion of "declared significant" tests that in fact are not significant remains below a pre-set threshold (e.g., 0.05). If we estimate the FPCP as V/R, we can think of this estimated FPCP as an estimate of Benjamini and Hochbergâs FDR. In analogy with E(V/R)=FDR, some have termed E(U/(m ïï R)) the âfalse non-discovery rate.â Our situation differs from the multiple testing situation in two ways. First, our two-way table arises from a designed experiment where values of m0 and m are set by design. Second, our bigger concern lies not with false positive calls but rather with false negative calls; i.e., with the probability that a cargo declared âsafeâ (negative) actually is dangerous (true positive). The table suggests that we can estimate FNCP as U/( m ïï R)). Some authors have called the expected value of this ratio, E(U/(m ïï R)), the âfalse non-discovery rateâ (see Genovese and Wasserman 2004; Sarkar 2006). But with both FNCP and FPCP, one needs further information about the frequency of true âpositivesâ and true ânegativesâ (in the form of p = probability that cargo contains SNM or other threatening material) beyond the m tests given in the design. In fact, as further tests are conducted, better estimates of FNCP and FPCP can be obtained by incorporating better estimates of sensitivity and specificity, as well as p, into the formulas for FNCP and FPCP. For that reason, we have chosen to derive the relevant probabilities using Bayesâ formula, rather than using the terms âfalse discovery rateâ and âfalse non-discovery rate,â which often are estimated from only the table of outcomes from multiple tests. For further information, see the references listed at the end of Appendix B.
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 61 time investigating benign cargo, reducing confidence in the value of the system and increasing the likelihood that operators might not take results seriously. Two detection systems that have exactly the same probability of a false negative call for a given scenario, but substantially different values of the probability of making a false positive call, may indicate a preference for one system over the other. The probability of making a false positive call equals 1âPPV. We illustrate these calculations from hypothetical data below. Suppose we have 24 trucks, into 12 of which we place SNM and leave only benign material in the remaining 12 trucks. We run all 24 trucks through two test systems, and observe the following results: Test System 1 Test System 2 No Total No Total Alarm Alarm Alarm Runs Alarm Runs SNM in 10 2 12 11 1 12 cargo Non-SNM 4 8 12 2 10 12 in cargo 14 10 24 13 11 24 Sensitivity is the probability that the system alarmed, given the presence of SNM in the cargo: among the 12 trucks that contained SNM, 10 alarmed for test system 1 (estimated sensitivity S1 = 10/12) and 11 alarmed for test system 2 (estimated S 2 = 11/12). Similarly, we estimate specificity for the two test systems as 8/12 and 10/12, respectively (the fraction of âno alarmâ results out of the 12 non-SNM trucks). Because we specified the number of runs in each condition ( n1 =12 for SNM runs and n2 =12 for non-SNM runs), we can estimate the uncertainties in these probabilities using the conventional binomial distribution. In this case, the lower 95% confidence bounds determined from the binomial distribution based on n1 ï½ n2 ï½ 12 are: Test System 1 Test System 2 0.833 (10/12) 0.917 (11/12) Estimated Sensitivity (0.516, 0.979) (0.615, 0.998) 95% confidence interval 0.667 ( 8/12) 0.833 (10/12) Estimated Specificity (0.349, 0.901) (0.516, 0.979) 95% confidence interval (The wide intervals result from the small sample sizes.) More importantly, the negative predictive value (NPV, the probability that the truck truly did not contain SNM, given that the alarm did not sound) is 8/10 for test system 1 and 10/11 for test system 2, and hence we estimate the probability of making a false negative call for the two systems as ⢠proportion of cases where test system 1 did not alarm (10 cases) but actually contained SNM cargo (2 cases) = 2/10 = 0.20 ⢠proportion of cases where test system 2 did not alarm (11 cases) but actually contained SNM cargo (1 case) = 1/11 = 0.09
62 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Clearly, test system 1 appears to be less reliable than test system 2. The calculation of the lower bounds on these estimated probabilities is not as straightforward as using the binomial distribution, as was done for sensitivity and specificity, because the denominator (10 in the outcome of the performance tests of system 1 and 11 in the outcome of the performance tests on system 2) arose from the test results, not from the number of trials set by the study design. That is, the denominator â10â for test system 1 (and â11â for test system 2) is the sum of two numbers that might differ if the test were re-run. Confidence bounds can be obtained as a function of sensitivity (S) and specificity (T) (see Box B.2). Bayesâ rule (Navidi, 2006) states: P ï¬ A ï¼ B c ï¼ ï½ P ï» B c ï¼ Aï½ ï´ P ï» Aï½ ï¯ [( P ï» B c ï¼ Aï½ ï´ P ï» Aï½) ï« ( P ï¬ B c ï¼ Ac ï¼ ï´ P ï¬ Ac ï¼)] ï ï½ ï ï½ ï ï½ FNCP = ï® (1) ï¾ ï® ï¾ ï® ï¾ where P{A|Bc} = probability that event A occurs, given confirmation that event Bc has occurred (here, P{cargo contains SNM | test system does not alarm} = 1 â NPV) P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred (here, P{test system does not alarm | cargo contains SNM} = 1 â S) P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred (here, P{test system does not alarm | cargo contains no SNM} = T). Box B.2: Uncertainty in the ratio FNCP1/FNCP2 The uncertainty in the ratio FNCP1/FNCP2 â [(1-S1)/(1-S2)][T2/T1] can be approximated using propagation of error formulas. Let ratio = N/D denote a generic ratio (N = Numerator, D = Denominator). Var ( N ) Var ( D) SE (ratio) ï½ SE ( N / D) ï» ratio ï´ ï« N2 D2 When T and S have binomial distributions, Var(T1)=T1(1-T1)/n1, Var(S1)=S1(1-S1)/n1 and likewise for Var(T2) and Var(S2), where n1 [n2] is the number of trials on which S1 and T1 [S2 and T2] are estimated (in experimental runs at Nevada Test Site, n1ï»n2ï»12 or 24). Hence, the standard error (square root of the variance) of (1-S1)/T1 is approximately ï1 ï S1 ï ï´ 1 ï T1 S1 ï« n1 ï¨1 ï S1 ï© n1T1 T1 so the standard error of the ratio of false negative call probabilities (when p is tiny) is approximately ï¦ FNCP1 ï¶ ï¦ FNCP ï¶ Var ï¨FNCP ï© Var ï¨FNCP2 ï© ï§ FNCP ï· ï» ï§ FNCP ï· SEï§ ï« 1 1 . ï· ï·ï§ FNCP 2 FNCP22 2ï¸ 2ï¸ ï¨ ï¨ 1 So, T2 ï¨1 ï S1 ï© ï©ï¦ S1 ï¹ ï©ï¦ S ï¹ 1 ï T2 ï¶ 1 ï T1 ï¶ SE ï¨FNCP1 / FCNP2 ï© ï» ï· n2 ïº . ï· n1 ïº ï« ïªï§ 2 ï« ïªï§ ï« T2 ï· T1 ï· ï§ ï§ T1 ï¨1 ï S 2 ï© ï«ï¨ 1 ï S1 ï» ï«ï¨ 1 ï S 2 ï¸ ï¸ ï»
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 63 The sensitivity (S = P{B|A}) and specificity (T = P{Bc|Ac}) can be estimated from the experimental test runs and, factoring the prevalence, p, we can calculate FNCP: ï¨1 ï S ï© p 1 FNCP = P{A|Bc} = ï½ , (2) ïï¨1 ï S ï© p ï« T ï¨1 ï p ï©ï 1 ï« y where y = [T/(1ïS)]ï´[(1ïp)/p]. Systems with lower values of P{A|Bc}, i.e., with higher values of y, are preferred. Denoting by S1 , T1 , S 2 , T2 the sensitivities and specificities of systems 1 and 2, respectively, system 1 is preferred over system 2 if FNCP1 < FNCP2 ; i.e., if y1 > y2 i.e., if ï¦ T1 ï¶ï¦ 1 ï p ï¶ ï¦ T2 ï¶ï¦ 1 ï p ï¶ ï§ 1 ï S ï·ï§ p ï· ï¾ ï§ 1 ï S ï·ï§ p ï· ï§ ï·ï§ ï·ï§ ï·ï§ ï· 1 ï¸ï¨ 2 ï¸ï¨ ï¸ï¨ ï¸ ï¨ which is the same as either T1 T ï¾2 (3) 1 ï S1 1 ï S 2 or T1 1 ï S1 ï¾ . (4) T2 1 ï S 2 That is, a comparison of FNCP for test system 1 (FNCP1) with that for test system 2 (FNCP2) reduces to a comparison of [(specificity)/(1-sensitivity)] for the two systems. We can estimate uncertainties on our estimates of sensitivity and specificity (based on the binomial distribution; see above discussion). Hence, we can approximate the uncertainty in [(1 â S)/(T)], and ultimately the uncertainty in the ratio of false negative call probabilities (see Box B.2) â which does not involve assumptions on p (likelihood of the threat). Notice that test system 1 is always preferred if T1 ⥠T2 and S1 ⥠S2, because T1 ⥠T2 implies that the left-hand side of Equation (4) exceeds or equals 1, and S1 ⥠S2 implies that the right-hand side of Equation (4) is less than or equal to 1; hence Equation (4) is satisfied. (If T1 = T2 and S1 = S2, then the test systems are equivalent, in terms of sensitivity, specificity, and false negative call probability, so either can be selected.) In real situations, however, one test system may have a higher test sensitivity but a lower specificity. For example, if T1 = 0.70 and T2 = 0.80 (test system 2 is more likely to remain silent on truly benign cargo than test system 1), but S1 = 0.950 and S2 = 0.930 (test system 1 is slightly more likely to alarm if the cargo truly contains SNM), then (4) says that
64 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT test system 1 is preferred, because T1/T2 = 0.875 and (1-S1)/(1-S2) = 0.05/0.07 = 0.714. The FNCP for the two systems are 1 1 ï½ FNCP1 ï½ ï© ï¦ T1 ï¶ï¦ 1 ï p ï¶ï¹ ï© 14.00 ï ï¨1 ï p ï© ï¹ ï·ïº ïª1 ï« ïª1 ï« ï§ ï·ï§ ïº ï·ï§ ï· ï§ ï« ï¨ 1 ï S1 ï¸ï¨ p ï¸ï» ï« p ï» 1 1 ï½ FNCP2 ï½ ï© 11.43 ï ï¨1 ï p ï© ï¹ ï© ï¦ T2 ï¶ï¦ 1 ï p ï¶ï¹ ïª1 ï« ïª1 ï« ï§ ï·ï§ ï·ï§ p ï·ïº ïº ï· ï§ ï« ï¨1 ï S2 p ï« ï» ï¸ï¨ ï¸ï» so clearly FNCP1<FNCP2. Calculations for this example (S1 = 0.95, S2 = 0.93, T1 = 0.70, T2 = 0.80), for different threat levels p, are: ⢠p = 0.10: FNCP1 = 0.007874 and FNCP2 = 0.009629 (ratio = 0.81777); ⢠p = 0.05: FNCP1 = 0.003745 and FNCP2 = 0.004584 (ratio = 0.81701); ⢠p = 0.01: FNCP1 = 0.000721 and FNCP2 = 0.000883 (ratio = 0.81646); p = 0.001: FNCP1 = 0.7150´10-4 and FNCP2 = 0.8758´10-4 (ratio = 0.81634); ⢠p = 0.0001: FNCP1 = 0.7142´10-5 and FNCP2 = 0.8751´10-5 (ratio = 0.81633). ⢠The prevalence p has little effect on the ratio of FNCPs, but its effect on the absolute rate (magnitude) of the FNCP is noticeable. Regardless of its value, however, the probability of a FNC will be very small whenever the probability of a threat is small (e.g., less than 0.1). When the differences in sensitivities are much higher, the FNCPs also are quite different. Consider the case when S1 = 0.90, S2 = 0.30, T1 = 0.70, T2 = 0.90, for the same threat levels: ⢠p = 0.10: FNCP1 = 0.015625 and FNCP2 = 0.079545 (ratio = 0.19643); ⢠p = 0.05: FNCP1 = 0.007463 and FNCP2 = 0.038326 (ratio = 0.18977); ⢠p = 0.01: FNCP1 = 0.001441 and FNCP2 = 0.007795 (ratio = 0.18485); ⢠p = 0.001: FNCP1 = 0.000143 and FNCP2 = 0.000780 (ratio = 0.18379); p = 0.0001: FNCP1 = 0.1429´10-4 and FNCP2 = 0.7778´10-4 (ratio = 0.18369). ⢠Here, even with a higher specificity, the increase in sensitivity from 0.3 (test 2) to 0.9 (test 1) results in a five-fold decrease in the FNCP. With either test, the FNCP is small, even when the threat level is 0.01 (1 in 100 trucks carry threatening cargo). Calculations for the probability of a false positive call (FPCP, 1-PPV) are similar. Again from Bayesâ Theorem: P ï¬ A ï¼ B c ï¼ ï½ P ï» B c ï¼ Aï½ ï´ P ï» Aï½ ï¯ [( P ï» B c ï¼ Aï½ ï´ P ï» Aï½) ï« ( P ï¬ B c ï¼ Ac ï¼ ï´ P ï¬ Ac ï¼)] (5) ï ï½ ï ï½ ï ï½ ï® ï¾ ï® ï¾ ï® ï¾ where
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 65 Ac = complement of A = event that cargo does not contain SNM Bc = complement of B = event that test system does not alarm P{Ac|B} = probability that event Ac occurs even though B occurred(here, P{cargo contains no SNM | test system alarms} = 1 â PPV) P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred (here, P{test system does not alarm | cargo contains no SNM} = T). P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred (here, P{test system does not alarm | cargo contains SNM} = 1 â S) FPCP =(1-T)(1-p)/[(1-T)(1-p)+Sp] = 1/(1+z) where z = (Sp)/[(1-T)(1-p)]. So test system 1 would be preferred, in these terms, over system 2, if ï¦ S1 ï¶ï¦ p ï¶ ï¦ S 2 ï¶ï¦ p ï¶ ï§ 1 ï T ï·ï§ 1 ï p ï· ï¾ ï§ 1 ï T ï·ï§ 1 ï p ï· ï§ ï·ï§ ï·ï§ ï· ï·ï§ 2 ï¸ï¨ 1 ï¸ï¨ ï¸ ï¸ï¨ ï¨ i.e., if ï¦ 1 ï T1 ï¶ ï¦ 1 ï T2 ï¶ ï§ S ï· ï¼ ï§ S ï·. ï§ ï· ï·ï§ ï¨ 1ï¸ï¨ 2ï¸ To calculate the magnitude of FPCP (not just the ratio of the probabilities for the two systems), consider that p is likely small and that S1 (or S2) may not be orders of magnitude larger than (1-T1) (or (1-T2). In this case, the â1 +â in the denominator does matter for the absolute magnitude of this FPCP. For the example above, where S1 = 0.95, S2 = 0.93, T1 = 0.70, T2 = 0.80, the corresponding FPCP for p=0.10, p= 0.05, p = 0.01, p = 0.001, p = 0.0001 are: p = 0.10: FPCP = 1/[1 + 0.31579(1/9)] = 0.96610, FPCP2 = 0.97666 (ratio = ⢠1 0.9892) p = 0.05: FPCP ï½ 0.98365 , FPCP2 ï½ 0.98881 (ratio = 0.99478) ⢠1 p = 0.01: FPCP ï½ 0.99682 , FPCP2 ï½ 0.99783 (ratio = 0.99899) ⢠1 p = 0.001: FPCP ï½ 0.99968 , FPCP2 ï½ 0.99978 (ratio = 0.99990) ⢠1 p = 0.0001: FPCP ï½ 0.99997 , FPCP2 ï½ 0.99998 (ratio = 0.99999). ⢠1 For these examples, the chance of having to re-inspect every sounded alarm, only to find benign material, is virtually identical in both systems (and very close to 1 for both). The same is true when S1 ï½ 0.90 , S 2 ï½ 0.30 , T1 ï½ 0.60 , T2 ï½ 0.80 : p = 0.01: FPCP ï½ 0.95294 , FPCP2 ï½ 0.93103 (ratio = 1.02353) ⢠1 p = 0.05: FPCP ï½ 0.97714 , FPCP2 ï½ 0.96610 (ratio = 1.01143) ⢠1 p = 0.01: FPCP ï½ 0.99553 , FPCP2 ï½ 0.99331 (ratio = 1.00223) ⢠1 p = 0.001: FPCP ï½ 0.99956 , FPCP2 ï½ 0.99933 (ratio = 1.00022) ⢠1 p = 0.0001: FPCP ï½ 0.99996 , FPCP2 ï½ 0.99993 (ratio = 1.00002). ⢠1
66 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT The DNDO criteria for âsignificant improvement in operational effectivenessâ involve comparisons of sensitivity and specificity. As noted above, a test system that has higher sensitivity and higher specificity will have a lower false negative rate. But the above calculations also demonstrate that ânearly equalâ sensitivities and specificities result in nearly equivalent systems, and hence offer rather limited benefit for the cost. For completeness, we re-write the DNDO criteria for âsignificant improvement in operational testingâ (see Box 2, pp 40â41) using the S, T notation (for sensitivity and specificity). Let S A1) ï¨SNM , noNORM ï© denote the sensitivity of the ASP system in primary (1) ( screening when the cargo truly contains SNM and no NORM; i.e., S A1) ï¨SNM , noNORM ï© = ( P{ASP alarms | cargo contains SNM, no NORM}. Likewise, let S P1) ï¨SNM , noNORM ï© denote ( the sensitivity of the current (PVT+RIID) system in primary (1) screening when the cargo truly contains SNM and no NORM; i.e., S P1) ï¨SNM , noNORM ï© = P{PVT alarms in primary screening | ( cargo contains SNM, no NORM} Using T to denote specificity, let TP( 2) ï¨SNM , noNORM ï© = P{PVT/RIID does not alarm in secondary screening | cargo contains no SNM, but possibly NORM} (specificity). (1) (1) Denote by SA and SP the sensitivities of ASP and PVT+RIID combination, respectively, (1) (1) in primary screening, and TA and TP the specificities of ASP and PVT+RIID, respectively; superscript (2) indicates secondary screening. DNDO has specified its criteria for âoperational effectivenessâ as follows (see Sidebar 3.1 on page 29): 1. S A1) ï¨SNM , noNORM ï© â¥ S P1) ï¨SNM , noNORM ï© ( ( 2. S A1) ï¨SNM ï« NORM ï© â¥ S P1) ï¨SNM ï« NORM ï© (different version of criterion 1 ( ( above) 3. TA1) ( MI ï Iso) ï³ TP(1) ( MI ï Iso) (where âMI-Isoâ indicates âlicensable medical or ( industrial isotopes). 4. 1 ï TA1) ( NORM ) ï£ 0.20[1 ï TP(1) ( NORM )] ( ï T A1) ( NORM ) ï³ 0.8 ï« 0.2(TP(1) ( NORM )) . ( 5. 1 ï S A2 ) ( SNM ) ï£ 0.5(1 ï S P2 ) ( SNM )) ï S A2 ) ( SNM ) ï³ 0.5(1 ï« S P2) ( SNM )) . ( ( ( ( 6. Time in secondary for ASP ⤠time in secondary for RIID (no connection to sensitivity/specificity). Since criterion 4 is more stringent than criterion 3 and criterion 5 is more stringent than criterion 1, we concentrate on values of sensitivity and specificity that satisfy criteria 4 and 5. When these two conditions are satisfied (i.e., TA ⥠0.8 + 0.2TP and SA ⥠0.5 + 0.5SP), the ratio of false negative call probabilities (A to B) can be as small as 1:900 â almost 1000 times smaller. For such improvements, the ratio of both the sensitivities and the specificities must be on the order of 0.99/0.10 or 0.95/0.10; in such cases, the false negative call probabilities are on the order of (10-8 to 10-5). Tables of values of the probabilities of both false negative calls and false positive calls were calculated when T A , S A , TP , and S P were set equal to 0.1, 0.2, ..., 0.8, 0.9, 0.95, 0.99; of the 114 = 14,641 combinations, only 858 satisfied criteria 4 and 5. These 858 combinations were set along with 5 different values of p = 0.01 (cargo is present in 1 of 100 trucks), 0.001, 0.0001, 0.00001, 0.000001 (1 in 1,000,000 trucks). A plot of the smaller false
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 67 negative call probability (denoted FNCP2 in the figure) versus the larger one (denoted FNCP1) is shown in Figure B.1. (the red dashed line corresponds to the line where the two false negative call probabilies are equal). The upper left corner shows the cases where the FNCPs are most different ( 0.00112 ï¼ FNCP1 / FNCP2 ï¼ 0.00311 ), which occurred in 26 of the 858 cases (26 5 points are shown, corresponding to 5 values of p). More frequently, the ratio is less dramatic ( 0.00317 ï¼ FNCP / FNCP2 ï¼ 0.03161 for 257 of 858 cases; 0.0316 ï¼ FNCP1 / FNCP2 ï¼ 0.3162 1 for 535 of the 858 cases; 0.3165 ï¼ FNCP / FNCP2 ï¼ 0.4819 for 40 of the 858 cases). In each 1 case, the absolute magnitudes of the false negative call probabilities are quite small, and the ratios of the false positive call probabilities are almost 1. Figure B.1: Plot of FNCP2 versus FNCP1 for cases satisfying the criteria TA ï³ 0.8 ï« 0.2TP and S A ï³ 0.5 ï« 0.5S P , for different levels of p (1 x 10-2, 1 x 10-3, 1x 10-4, 1 x 10-5, 1x 10-6). The red dashed line corresponds to FNCP ï½ FNCP2 . The results are stratified by magnitude of the ratio 1 FNCP1 / FNCP2 (specifically, rounded values of the common logarithm of the ratio: â3, â2, â1, 0, respectively, for the four plots). APPENDIX B REFERENCES Benjamini, Y. and Y. Hochberg. 1995. Controlling the false discover rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B 57: 289â300.
68 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Genovese, C.R. and L. Wasserman. 2004a. Controlling the false discovery rate: Understanding and extending the Benjamini-Hochberg Method, http://www.stat.cmu.edu/genovese/talks/pitt-11-01.pdf. Genovese, C.R. and L. Wasserman. 2004. A stochastic process approach to false discovery control, Annals of Statistics 2004: 32(3), 1035â1061. Ku, H.H. 1962. Notes on the Use of Propagation of Errors Formulae. Journal of Research of the National Bureau of Standards 70C(4), p.269. Navidi, W. 2006. Statistics for Engineers and Scientists, McGraw-Hill. Pawitan, Y., S. Michels, S. Koscielny, A. Gusnanto, and A. Ploner. 2005. False discovery rate, sensitivity, and sample size in microarray studies, Bioinformatics 21(13), 3017â3024. Sarkar,S.K. 2006. False discovery and false non-discovery rates in single-step multiple testing procedures, Annals of Statistics 34(1), 394â415. Vardeman, S.B. 1994. Statistics for Engineering Problem Solving, PWS Publishing, Boston, Massachusetts: p.257.