subgroup. Our approach is empirical: we compare different subpopulations and also, to mimic a worst case scenario, perform sample calculations deliberately using an inappropriate database.
A simple random sample of a given size from a population is one chosen so that each possible sample has an equal chance of being selected. Ideally, the reference data set from which genotype frequencies are calculated would be a simple random sample or a stratified or otherwise scientifically structured random sample from the relevant population. Several conditions make the actual situation less than ideal. One is a lack of agreement as to what the relevant population is (should it be the whole population or only young males? should it be local or national?) and the consequent need to consider several possibilities. A second is that we are forced to rely on convenience samples, chosen not at random but because of availability or cost. It is difficult, expensive, and impractical to arrange a statistically valid random-sampling scheme. The saving point is that the features in which we are interested are believed theoretically and found empirically to be essentially uncorrelated with the means by which samples are chosen. Comparison of estimated profile frequencies from different data sets shows relative insensitivity to the source of the data, as we document later in the chapter. Furthermore, the VNTRs and STRs used in forensic analysis are usually not associated with any known function and therefore should not be correlated with occupation or behavior. So those convenience samples are effectively random.
The convenience samples from which the databases are derived come from various sources. Some data come from blood banks. Some come from genetic-counseling and disease-screening centers. Others come from mothers and putative fathers in paternity tests. The data summarized in FBI (1993b), which we have used in previous chapters and will again in this chapter, are from a variety of sources around the world, from blood banks, paternity-testing centers, molecular-biology and human-genetics laboratories, hospitals and clinics, law-enforcement officers, and criminal records.
As mentioned previously, most markers used for DNA analysis, VNTRs and STRs in particular, are from regions of DNA that have no known function. They are not related in any obvious way to gene-determined traits2, and there is no reason to suspect that persons who contribute to blood banks or who have been
2 Some loci used in PCR-based typing are associated with genes. It is important to determine if a particular forensic allele is associated with a disease state and hence subject to selection. A forensic marker might happen to be closely linked to an important gene, such as one causing some observable trait, and could conceivably be in strong linkage disequilibrium. As the number of mapped genes increases, this will become increasingly common. But for that to affect the reliability of a database, the trait would have to appear disproportionately in the populations that contribute to the database.