Click for next page ( 59


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 58
Appendix B Performance Metrics for ASPs and PVTs “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” – John W. Tukey (1962), “The Future of Data Analysis,” Annals of Mathematical Statistics 33(1), p.1–67. (The citation appears on p.12.) When evaluating the performance of instruments to identify the system most well suited to a given task, one needs to consider the correct metric for making the comparison. In the case of systems such as the advanced spectroscopic portals (ASPs), conventional measures such as sensitivity and specificity provide useful information, but do not assess directly test performance in actual field operation. The metrics of interest concern the probabilities of making incorrect calls, i.e., the probability that the cargo actually contained dangerous material when the test system allowed it to pass (a false negative call), and the probability that the cargo actually contained benign material when the test system alarmed (a false positive call). This appendix describes the calculations leading to estimates of these probabilities, the uncertainties in these values, and how these estimated probabilities can be used to compare two systems under consideration. NOMENCLATURE Test system performance usually is characterized in terms of detection probabilities. The notation for these probabilities comes from the literature for comparing medical diagnostic tests, and we use the same notation here for radiation detection systems, so we begin with some terminology. In formal notation, the absolute probability of event A is written P{A}. The probability that event A happens given condition or event B is written as P{A|B}. The event after the vertical bar “|” is the event on which the probability is conditioned; i.e., the event that preexists. For the rest of this appendix, we define the following. A = cargo contains SNM B = Test system alarms Ac = The complement of A, cargo contains no SNM (benign) 47 Bc = The complement of B, test system does not alarm Sensitivity (S) = probability that the test system alarms, given that the underlying cargo truly contains special nuclear material (SNM). S = P{B|A} Specificity (T) = probability that the test system does not alarm, given that the underlying cargo truly contained benign material (non-SNM). T = P{Bc|Ac} 47 Some non-SNM radioactive material is not benign, but for simplicity, this appendix refers to non-SNM as benign material. 58

OCR for page 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 59 Prevalence (p) = probability that cargo contains SNM. p = P{A} Positive predictive value (PPV) = probability that the underlying cargo truly contains SNM, given that the test system alarms. PPV = P{A|B} Negative predictive value (NPV) = probability that the underlying cargo truly contains non-SNM, given that the test system did not alarm. NPV = P{Ac|Bc} WHAT WE WANT TO KNOW AND WHAT WE CAN KNOW Although we might want to know the sensitivity and specificity of the detection systems, because their definitions rely on true knowledge of the cargo contents, we can estimate a system’s sensitivity and specificity only from a designed experiment. The experimenters insert into the cargo either SNM or benign material, and then run the cargo through the test systems; the proportion of SNM runs that properly set off the test system alarm is an estimate of the test’s sensitivity, and the proportion of benign runs that properly pass the test system is an estimate of the test’s specificity. Such design studies are artificial scenarios intended to represent a range of possible real-world events. In real life we do not know the cargo contents. We see only the result of the test system: either the test system alarmed, or it did not alarm, and the probability of getting an alarm given that SNM is present is not necessarily the same as the probability that SNM is present given that the system alarmed (P{B|A} ≠ P{A|B}). Operationally, if the system alarms, SNM is suspected; if the system does not alarm, the cargo is allowed to pass. We are concerned especially with this question: Given that the test system did not alarm, what is the probability that the cargo contained SNM? That is, what risk do we take by allowing a “no-alarm” container to pass? From the standpoint of practical operational effectiveness, this false negative call probability (FNCP = P{A|Bc} = 1-NPV) 48 has grave consequences. As shown below by Bayes’ Theorem, it is a function of sensitivity (S) and specificity (T), as well as of prevalence (p = P{A}), but a comparison between two test systems on the same scenario (i.e., the same threat) involves the same prevalence, so prevalence does not enter into the comparison of effectiveness for the two test systems. Accurate estimation of sensitivity and specificity is important, in that it allows us to compare accurately the performance of two test systems using the relevant, practically meaningful metric. As noted above, from designed studies we can estimate S and T, such as those conducted at the Nevada test site. We also can derive confidence limits on S and T from such designed experiments, and hence we can estimate (1–NPV) and associated confidence intervals. More importantly, we can compare the two systems via a ratio, say the FNCP ratio (1-NPV1)/(1-NPV2). A FNCP ratio whose lower confidence limit exceeds 1 indicates preference for test system 2, while a ratio whose upper confidence limit falls below 1 indicates preference for test system 1. Note that these ratios may differ for different scenarios; a table of these ratios may suggest strategies for associating the ratios with the threat levels presented by different scenarios. Notice also that the probability of making a false positive call (FPCP) is likewise of interest for purposes of evaluating costs and benefits: Too many false positive calls can also be costly by slowing down commerce, diverting CBP personnel from potential threats as they spend 48 Some analyses refer to “false discovery rate” and “false non-discovery rate,” which are related to (1–PPV) and (1– NPV), respectively, but their definitions are slightly different (see Box B.1).

OCR for page 58
60 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Box B.1: A comment on notation We denoted by FNCP the probability of making a false positive call and by FPCP the probability of making a false positive call; i.e., FNCP = P{ true + | test calls “–” } FPCP = P{ true – | test calls “+” } . We related these probabilities to the following generic two-way table of test outcomes (notation from Benjamini and Hochberg, 1995, p.291, referred to as BH95, is in parentheses): Test calls Test calls Total Truth “Positive” “Negative” Tests N   m  m0 True POSITIVE N  (V) N  (U) N   m0 True NEGATIVE N  (S) N  (T) Total calls R m–R M We estimated the false negative call probability via the proportion of negative-call tests (m  R) that were in fact positive (N+), or U/(m R) in BH95 notation. Similarly, we estimated the false positive call probability via the proportion of positive-call tests (R) that were in fact negative (N+), or V/R in BH95 notation. BH95 address the situation known as “multiple testing,” where one is conducting many hypothesis tests (e.g., hundreds or thousands of tests as occurs in gene expression experiments), and wants to control the frequency with which one declares as “significant” (e.g., “positive”) tests which in fact are negative. Hence Benjamini and Hochberg (1995) define the expected proportion of false positive calls, E(V/R), as the “false discovery rate,” or FDR. They provide a procedure based on the m p-values from the m tests so that one has assurance that, on average, the proportion of "declared significant" tests that in fact are not significant remains below a pre-set threshold (e.g., 0.05). If we estimate the FPCP as V/R, we can think of this estimated FPCP as an estimate of Benjamini and Hochberg’s FDR. In analogy with E(V/R)=FDR, some have termed E(U/(m R)) the “false non-discovery rate.” Our situation differs from the multiple testing situation in two ways. First, our two-way table arises from a designed experiment where values of m0 and m are set by design. Second, our bigger concern lies not with false positive calls but rather with false negative calls; i.e., with the probability that a cargo declared “safe” (negative) actually is dangerous (true positive). The table suggests that we can estimate FNCP as U/( m R)). Some authors have called the expected value of this ratio, E(U/(m R)), the “false non-discovery rate” (see Genovese and Wasserman 2004; Sarkar 2006). But with both FNCP and FPCP, one needs further information about the frequency of true “positives” and true “negatives” (in the form of p = probability that cargo contains SNM or other threatening material) beyond the m tests given in the design. In fact, as further tests are conducted, better estimates of FNCP and FPCP can be obtained by incorporating better estimates of sensitivity and specificity, as well as p, into the formulas for FNCP and FPCP. For that reason, we have chosen to derive the relevant probabilities using Bayes’ formula, rather than using the terms “false discovery rate” and “false non-discovery rate,” which often are estimated from only the table of outcomes from multiple tests. For further information, see the references listed at the end of Appendix B.

OCR for page 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 61 time investigating benign cargo, reducing confidence in the value of the system and increasing the likelihood that operators might not take results seriously. Two detection systems that have exactly the same probability of a false negative call for a given scenario, but substantially different values of the probability of making a false positive call, may indicate a preference for one system over the other. The probability of making a false positive call equals 1–PPV. We illustrate these calculations from hypothetical data below. Suppose we have 24 trucks, into 12 of which we place SNM and leave only benign material in the remaining 12 trucks. We run all 24 trucks through two test systems, and observe the following results: Test System 1 Test System 2 No Total No Total Alarm Alarm Alarm Runs Alarm Runs SNM in 10 2 12 11 1 12 cargo Non-SNM 4 8 12 2 10 12 in cargo 14 10 24 13 11 24 Sensitivity is the probability that the system alarmed, given the presence of SNM in the cargo: among the 12 trucks that contained SNM, 10 alarmed for test system 1 (estimated sensitivity S1 = 10/12) and 11 alarmed for test system 2 (estimated S 2 = 11/12). Similarly, we estimate specificity for the two test systems as 8/12 and 10/12, respectively (the fraction of “no alarm” results out of the 12 non-SNM trucks). Because we specified the number of runs in each condition ( n1 =12 for SNM runs and n2 =12 for non-SNM runs), we can estimate the uncertainties in these probabilities using the conventional binomial distribution. In this case, the lower 95% confidence bounds determined from the binomial distribution based on n1  n2  12 are: Test System 1 Test System 2 0.833 (10/12) 0.917 (11/12) Estimated Sensitivity (0.516, 0.979) (0.615, 0.998) 95% confidence interval 0.667 ( 8/12) 0.833 (10/12) Estimated Specificity (0.349, 0.901) (0.516, 0.979) 95% confidence interval (The wide intervals result from the small sample sizes.) More importantly, the negative predictive value (NPV, the probability that the truck truly did not contain SNM, given that the alarm did not sound) is 8/10 for test system 1 and 10/11 for test system 2, and hence we estimate the probability of making a false negative call for the two systems as • proportion of cases where test system 1 did not alarm (10 cases) but actually contained SNM cargo (2 cases) = 2/10 = 0.20 • proportion of cases where test system 2 did not alarm (11 cases) but actually contained SNM cargo (1 case) = 1/11 = 0.09

OCR for page 58
62 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Clearly, test system 1 appears to be less reliable than test system 2. The calculation of the lower bounds on these estimated probabilities is not as straightforward as using the binomial distribution, as was done for sensitivity and specificity, because the denominator (10 in the outcome of the performance tests of system 1 and 11 in the outcome of the performance tests on system 2) arose from the test results, not from the number of trials set by the study design. That is, the denominator “10” for test system 1 (and “11” for test system 2) is the sum of two numbers that might differ if the test were re-run. Confidence bounds can be obtained as a function of sensitivity (S) and specificity (T) (see Box B.2). Bayes’ rule (Navidi, 2006) states: P  A  B c   P  B c  A  P  A  [( P  B c  A  P  A)  ( P  B c  Ac   P  Ac )]       FNCP =  (1)      where P{A|Bc} = probability that event A occurs, given confirmation that event Bc has occurred (here, P{cargo contains SNM | test system does not alarm} = 1 – NPV) P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred (here, P{test system does not alarm | cargo contains SNM} = 1 – S) P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred (here, P{test system does not alarm | cargo contains no SNM} = T). Box B.2: Uncertainty in the ratio FNCP1/FNCP2 The uncertainty in the ratio FNCP1/FNCP2 ≈ [(1-S1)/(1-S2)][T2/T1] can be approximated using propagation of error formulas. Let ratio = N/D denote a generic ratio (N = Numerator, D = Denominator). Var ( N ) Var ( D) SE (ratio)  SE ( N / D)  ratio   N2 D2 When T and S have binomial distributions, Var(T1)=T1(1-T1)/n1, Var(S1)=S1(1-S1)/n1 and likewise for Var(T2) and Var(S2), where n1 [n2] is the number of trials on which S1 and T1 [S2 and T2] are estimated (in experimental runs at Nevada Test Site, n1n212 or 24). Hence, the standard error (square root of the variance) of (1-S1)/T1 is approximately 1  S1   1  T1 S1  n1 1  S1  n1T1 T1 so the standard error of the ratio of false negative call probabilities (when p is tiny) is approximately  FNCP1   FNCP  Var FNCP  Var FNCP2   FNCP    FNCP  SE  1 1 .   FNCP 2 FNCP22 2 2   1 So, T2 1  S1   S1   S  1  T2  1  T1  SE FNCP1 / FCNP2    n2  .  n1    2    T2  T1    T1 1  S 2   1  S1   1  S 2   

OCR for page 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 63 The sensitivity (S = P{B|A}) and specificity (T = P{Bc|Ac}) can be estimated from the experimental test runs and, factoring the prevalence, p, we can calculate FNCP: 1  S  p 1 FNCP = P{A|Bc} =  , (2) 1  S  p  T 1  p  1  y where y = [T/(1S)][(1p)/p]. Systems with lower values of P{A|Bc}, i.e., with higher values of y, are preferred. Denoting by S1 , T1 , S 2 , T2 the sensitivities and specificities of systems 1 and 2, respectively, system 1 is preferred over system 2 if FNCP1 < FNCP2 ; i.e., if y1 > y2 i.e., if  T1  1  p   T2  1  p   1  S  p    1  S  p       1  2     which is the same as either T1 T 2 (3) 1  S1 1  S 2 or T1 1  S1  . (4) T2 1  S 2 That is, a comparison of FNCP for test system 1 (FNCP1) with that for test system 2 (FNCP2) reduces to a comparison of [(specificity)/(1-sensitivity)] for the two systems. We can estimate uncertainties on our estimates of sensitivity and specificity (based on the binomial distribution; see above discussion). Hence, we can approximate the uncertainty in [(1 – S)/(T)], and ultimately the uncertainty in the ratio of false negative call probabilities (see Box B.2) — which does not involve assumptions on p (likelihood of the threat). Notice that test system 1 is always preferred if T1 ≥ T2 and S1 ≥ S2, because T1 ≥ T2 implies that the left-hand side of Equation (4) exceeds or equals 1, and S1 ≥ S2 implies that the right-hand side of Equation (4) is less than or equal to 1; hence Equation (4) is satisfied. (If T1 = T2 and S1 = S2, then the test systems are equivalent, in terms of sensitivity, specificity, and false negative call probability, so either can be selected.) In real situations, however, one test system may have a higher test sensitivity but a lower specificity. For example, if T1 = 0.70 and T2 = 0.80 (test system 2 is more likely to remain silent on truly benign cargo than test system 1), but S1 = 0.950 and S2 = 0.930 (test system 1 is slightly more likely to alarm if the cargo truly contains SNM), then (4) says that

OCR for page 58
64 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT test system 1 is preferred, because T1/T2 = 0.875 and (1-S1)/(1-S2) = 0.05/0.07 = 0.714. The FNCP for the two systems are 1 1  FNCP1    T1  1  p   14.00  1  p    1  1          1  S1  p   p  1 1  FNCP2   11.43  1  p     T2  1  p  1  1     p      1  S2 p     so clearly FNCP1 OCR for page 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 65 Ac = complement of A = event that cargo does not contain SNM Bc = complement of B = event that test system does not alarm P{Ac|B} = probability that event Ac occurs even though B occurred(here, P{cargo contains no SNM | test system alarms} = 1 – PPV) P{Bc|Ac} = probability that event B occurs, given confirmation that event Ac has occurred (here, P{test system does not alarm | cargo contains no SNM} = T). P{Bc|A} = probability that event Bc occurs, given confirmation that event A has occurred (here, P{test system does not alarm | cargo contains SNM} = 1 – S) FPCP =(1-T)(1-p)/[(1-T)(1-p)+Sp] = 1/(1+z) where z = (Sp)/[(1-T)(1-p)]. So test system 1 would be preferred, in these terms, over system 2, if  S1  p   S 2  p   1  T  1  p    1  T  1  p       2  1     i.e., if  1  T1   1  T2   S    S .     1 2 To calculate the magnitude of FPCP (not just the ratio of the probabilities for the two systems), consider that p is likely small and that S1 (or S2) may not be orders of magnitude larger than (1-T1) (or (1-T2). In this case, the “1 +” in the denominator does matter for the absolute magnitude of this FPCP. For the example above, where S1 = 0.95, S2 = 0.93, T1 = 0.70, T2 = 0.80, the corresponding FPCP for p=0.10, p= 0.05, p = 0.01, p = 0.001, p = 0.0001 are: p = 0.10: FPCP = 1/[1 + 0.31579(1/9)] = 0.96610, FPCP2 = 0.97666 (ratio = • 1 0.9892) p = 0.05: FPCP  0.98365 , FPCP2  0.98881 (ratio = 0.99478) • 1 p = 0.01: FPCP  0.99682 , FPCP2  0.99783 (ratio = 0.99899) • 1 p = 0.001: FPCP  0.99968 , FPCP2  0.99978 (ratio = 0.99990) • 1 p = 0.0001: FPCP  0.99997 , FPCP2  0.99998 (ratio = 0.99999). • 1 For these examples, the chance of having to re-inspect every sounded alarm, only to find benign material, is virtually identical in both systems (and very close to 1 for both). The same is true when S1  0.90 , S 2  0.30 , T1  0.60 , T2  0.80 : p = 0.01: FPCP  0.95294 , FPCP2  0.93103 (ratio = 1.02353) • 1 p = 0.05: FPCP  0.97714 , FPCP2  0.96610 (ratio = 1.01143) • 1 p = 0.01: FPCP  0.99553 , FPCP2  0.99331 (ratio = 1.00223) • 1 p = 0.001: FPCP  0.99956 , FPCP2  0.99933 (ratio = 1.00022) • 1 p = 0.0001: FPCP  0.99996 , FPCP2  0.99993 (ratio = 1.00002). • 1

OCR for page 58
66 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT The DNDO criteria for “significant improvement in operational effectiveness” involve comparisons of sensitivity and specificity. As noted above, a test system that has higher sensitivity and higher specificity will have a lower false negative rate. But the above calculations also demonstrate that “nearly equal” sensitivities and specificities result in nearly equivalent systems, and hence offer rather limited benefit for the cost. For completeness, we re-write the DNDO criteria for “significant improvement in operational testing” (see Box 2, pp 40–41) using the S, T notation (for sensitivity and specificity). Let S A1) SNM , noNORM  denote the sensitivity of the ASP system in primary (1) ( screening when the cargo truly contains SNM and no NORM; i.e., S A1) SNM , noNORM  = ( P{ASP alarms | cargo contains SNM, no NORM}. Likewise, let S P1) SNM , noNORM  denote ( the sensitivity of the current (PVT+RIID) system in primary (1) screening when the cargo truly contains SNM and no NORM; i.e., S P1) SNM , noNORM  = P{PVT alarms in primary screening | ( cargo contains SNM, no NORM} Using T to denote specificity, let TP( 2) SNM , noNORM  = P{PVT/RIID does not alarm in secondary screening | cargo contains no SNM, but possibly NORM} (specificity). (1) (1) Denote by SA and SP the sensitivities of ASP and PVT+RIID combination, respectively, (1) (1) in primary screening, and TA and TP the specificities of ASP and PVT+RIID, respectively; superscript (2) indicates secondary screening. DNDO has specified its criteria for “operational effectiveness” as follows (see Sidebar 3.1 on page 29): 1. S A1) SNM , noNORM  ≥ S P1) SNM , noNORM  ( ( 2. S A1) SNM  NORM  ≥ S P1) SNM  NORM  (different version of criterion 1 ( ( above) 3. TA1) ( MI  Iso)  TP(1) ( MI  Iso) (where “MI-Iso” indicates “licensable medical or ( industrial isotopes). 4. 1  TA1) ( NORM )  0.20[1  TP(1) ( NORM )] (  T A1) ( NORM )  0.8  0.2(TP(1) ( NORM )) . ( 5. 1  S A2 ) ( SNM )  0.5(1  S P2 ) ( SNM ))  S A2 ) ( SNM )  0.5(1  S P2) ( SNM )) . ( ( ( ( 6. Time in secondary for ASP ≤ time in secondary for RIID (no connection to sensitivity/specificity). Since criterion 4 is more stringent than criterion 3 and criterion 5 is more stringent than criterion 1, we concentrate on values of sensitivity and specificity that satisfy criteria 4 and 5. When these two conditions are satisfied (i.e., TA ≥ 0.8 + 0.2TP and SA ≥ 0.5 + 0.5SP), the ratio of false negative call probabilities (A to B) can be as small as 1:900 – almost 1000 times smaller. For such improvements, the ratio of both the sensitivities and the specificities must be on the order of 0.99/0.10 or 0.95/0.10; in such cases, the false negative call probabilities are on the order of (10-8 to 10-5). Tables of values of the probabilities of both false negative calls and false positive calls were calculated when T A , S A , TP , and S P were set equal to 0.1, 0.2, ..., 0.8, 0.9, 0.95, 0.99; of the 114 = 14,641 combinations, only 858 satisfied criteria 4 and 5. These 858 combinations were set along with 5 different values of p = 0.01 (cargo is present in 1 of 100 trucks), 0.001, 0.0001, 0.00001, 0.000001 (1 in 1,000,000 trucks). A plot of the smaller false

OCR for page 58
APPENDIX B: PERFORMANCE METRICS FOR ASPs AND PVTs 67 negative call probability (denoted FNCP2 in the figure) versus the larger one (denoted FNCP1) is shown in Figure B.1. (the red dashed line corresponds to the line where the two false negative call probabilies are equal). The upper left corner shows the cases where the FNCPs are most different ( 0.00112  FNCP1 / FNCP2  0.00311 ), which occurred in 26 of the 858 cases (26 5 points are shown, corresponding to 5 values of p). More frequently, the ratio is less dramatic ( 0.00317  FNCP / FNCP2  0.03161 for 257 of 858 cases; 0.0316  FNCP1 / FNCP2  0.3162 1 for 535 of the 858 cases; 0.3165  FNCP / FNCP2  0.4819 for 40 of the 858 cases). In each 1 case, the absolute magnitudes of the false negative call probabilities are quite small, and the ratios of the false positive call probabilities are almost 1. Figure B.1: Plot of FNCP2 versus FNCP1 for cases satisfying the criteria TA  0.8  0.2TP and S A  0.5  0.5S P , for different levels of p (1 x 10-2, 1 x 10-3, 1x 10-4, 1 x 10-5, 1x 10-6). The red dashed line corresponds to FNCP  FNCP2 . The results are stratified by magnitude of the ratio 1 FNCP1 / FNCP2 (specifically, rounded values of the common logarithm of the ratio: –3, –2, –1, 0, respectively, for the four plots). APPENDIX B REFERENCES Benjamini, Y. and Y. Hochberg. 1995. Controlling the false discover rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B 57: 289–300.

OCR for page 58
68 EVALUATING TESTING, COSTS, & BENEFITS OF ASPs: INTERIM REPORT Genovese, C.R. and L. Wasserman. 2004a. Controlling the false discovery rate: Understanding and extending the Benjamini-Hochberg Method, http://www.stat.cmu.edu/genovese/talks/pitt-11-01.pdf. Genovese, C.R. and L. Wasserman. 2004. A stochastic process approach to false discovery control, Annals of Statistics 2004: 32(3), 1035–1061. Ku, H.H. 1962. Notes on the Use of Propagation of Errors Formulae. Journal of Research of the National Bureau of Standards 70C(4), p.269. Navidi, W. 2006. Statistics for Engineers and Scientists, McGraw-Hill. Pawitan, Y., S. Michels, S. Koscielny, A. Gusnanto, and A. Ploner. 2005. False discovery rate, sensitivity, and sample size in microarray studies, Bioinformatics 21(13), 3017–3024. Sarkar,S.K. 2006. False discovery and false non-discovery rates in single-step multiple testing procedures, Annals of Statistics 34(1), 394–415. Vardeman, S.B. 1994. Statistics for Engineering Problem Solving, PWS Publishing, Boston, Massachusetts: p.257.