oped. Based on a set of data with predictor variables (features in the polygraph test) of known deceptive and nondeceptive subjects, one attempts to find a function of the predictor variables with high values for deceptive and low values for nondeceptive subjects. The conversion of continuous polygraph readings into a set of numeric predictor variables requires many steps and detailed decisions, which we outline below. In particular, we discuss aspects of choosing a small number of these predictors that together do the best job of predicting deception, and we consider the dangers of attempting to use too many variables when the test data set is relatively small.
We examined the two scoring systems with sufficient documentation to allow evaluation. The CPS system has been designed with the goal of automating what careful human scorers currently do and has focused from the outset on a relatively small set of data features; PolyScore has been developed from a much larger set of features, and it is more difficult to evaluate because details of development are lacking. Updates to these systems exist, but their details are proprietary and were not shared with us. The description here focuses on the PolyScore and CPS scoring algorithms since no information is publicly available on statistical methods utilized by these more recently developed algorithms, although the penultimate section includes a summary of the performance of five algorithms, based on Dollins, Kraphol, and Dutton (2000).2
Since the 1970s, papers in the polygraph literature have proffered evidence claiming to show that automated classification algorithms could accomplish the objective of minimizing both false positive and false negative error rates. Our own analyses based on a set of several hundred actual polygraphs from criminal cases provided by the U.S. Department of Defense Polygraph Institute (DoDPI), suggest that it is easy to develop algorithms that appear to achieve perfect separation of deceptive and nondeceptive individuals by using a large number of features or classifying variables selected by discriminant analysis, logistic regression, or a more complex data-mining technique. Statisticians have long recognized that such a process often leads to “overfitting” of the data, however, and to classifiers whose performance deteriorates badly under proper cross-validation assessment (see Hastie, Tibshirani, and Friedman  for a general discussion of feature selection). Such overestimation still occurs whenever the same data are used both for fitting and for estimating accuracy even when the appropriate set of features is predetermined (see Copas and Corbett, 2002). Thus, on a new set of data, these complex algorithms often perform less effectively than alternatives based on a small set of simple features.
In a recent comparison, various computer scoring systems performed similarly and with only modest accuracy on a common data set used for