the most liberal. Other possible thresholds would also have yielded substantial differences among interpreters in false-positive rates.
It is possible to produce an empirical ROC curve on the basis of the performance of a diagnostic test in a field or laboratory setting. This can be accomplished in a few different ways. An efficient way is for the diagnostician to set several thresholds at once, in effect to use several categories of response, say, five or six categories ranging from “very definitely a signal” to “very definitely only noise.” Points on the ROC curve are then calculated successively from each category boundary: first, considering only the top category positive and the rest negative; then considering the top two categories positive, and so on. This rating procedure can be expanded to have the diagnostician give probabilities from 0 to 1 (to two decimal places) that a signal is present. The 100 categories implied may then be used as is or condensed in analysis to perhaps 10, which would give nine ROC points to be fitted into a curve (the first point is always [0.0, 0.0], the point at which all tests are considered negative; the final point is always [1.0, 1.0], the point where all tests are considered positive). An example of this rating procedure is the use of three categories, corresponding to yes/no/inconclusive decisions in many polygraph diagnostic systems. Treating this three-alternative scoring system as a rating procedure gives a two-point ROC curve.10 Because of the way polygraph data are most commonly reported, our analyses in Chapter 5 draw heavily on two-point ROC curves obtained when “no-opinion” or “inconclusive” judgments are reported.
Treating no-opinion or inconclusive judgments as an intermediate category and estimating two ROC points handles neatly a problem that is not dealt with when percent correct is used to estimate accuracy. In that case, reported performance depends on how often given examiners use the inconclusive category, especially if examiners treat the “inconclusive” records, which are the ones they find most difficult to score, as if the subject had not been tested. Examiners vary considerably in how frequently their records are scored inconclusive. For example, nine datasets reported in four screening studies completed between 1989 and 1997 at the U.S. Department of Defense Polygraph Institute showed rates of no-opinion judgments ranging from 0 to 50 percent (materials presented to the committee, March 2001). By using the inconclusive category liberally and excluding inconclusive tests, an examiner can appear very accurate