these variations do not affect the relationship between polygraph measurement and deception and empirically testing this hypothesis, or by modeling heterogeneity across the studies as a random effect around some central measure of the relationship’s strength, perhaps also correcting estimates of the observed variability in effect sizes for sampling error, which is likely to be a serious concern in a research literature where small samples are the norm. However, the literature contains too few studies of adequate quality to allow meaningful statistical analysis of such hypotheses or models. Without such analysis, it is not clear that there is any scientific or statistical basis for aggregating the studies into a single population estimate. Were such an estimate obtained, it would be far from clear what combination of population and polygraph test conditions it would represent.
Our main substantive concern is with the relevance of the available literature to our task of reviewing the scientific evidence on polygraph testing with particular attention to national security screening applications. There is only a single study that provides directly relevant data addressing the performance of the polygraph in this context (Brownlie et al., 1998), and because it uses global impressionistic scoring of the polygraph tests, its data do not meet our basic criteria for inclusion in the quantitative analysis. The great majority of the studies address the accuracy of specific-issue polygraph tests for revealing deception about specific criminal acts, real or simulated. Even in the few studies that simulate security screening polygraph examinations, the stakes are low for both the examiners and the examinees, the base rate for deception is quite high (that is, the examiners know that there is a high probability that the examinee is lying), and there is little or no ambiguity about ground truth (both examiners and examinees know what the specific target transgression is, and both are quite clear about the definitions of lying and truthfulness). Given the dubious relevance to security screening of even the closest analog studies, as well as the heterogeneity of the literature, we do not believe there is anything to be gained by using precise distributional models to summarize their findings.
Rather than developing and testing meta-analytic models, we have taken the simpler and less potentially misleading approach of presenting descriptive summaries and graphs. The studies vary greatly in quality and include several with extreme outcomes due to sampling variability, bias, or non-generalizable features of the study design. Thus, we do not give much weight to the studies with outcomes at the extremes of the group, and summarize the sample of studies with values of the accuracy index (A) that are most representative of the distribution of study outcomes—the median and the interquartile range. As Chapter 5 and Appendix I show, such a tabulation reveals sufficiently for our purposes the