We genotyped 100 individuals with ancestry from Puerto Rico, the Dominican Republic, Ecuador, and Colombia on Illumina 610K arrays. We extracted 400 European, 365 African American, and 112 Mexican samples from the GlaxoSmithKline POPRES project, which is a resource of nearly 6,000 control individuals from North America, Europe, and Asia genotyped on the Affymetrix GeneChip 500K Array Set (Nelson et al., 2008). We randomly sampled 15 individuals from each European country where possible, or the maximum number of individuals available otherwise, to select the POPRES European individuals to be included in our study. Further description of sampling locations, genotyping, and data quality control are available elsewhere (Nelson et al., 2008). We include 165 and 167 individuals from the HapMap project from the CEU and YRI populations, thinned to the same SNP set (Frazer et al., 2007). We also include all European, Native American, and African individuals from the HGDP genotyped on Illumina 610K arrays (Jakobsson et al., 2008). Finally, we include all Native American populations from the Mao et al. (2007) study genotyped on Affymetrix 500K arrays. For each dataset, we used annotation information to determine the strand on which the data were given and to map all Affymetrix and Illumina marker ids to corresponding dbSNP reference ids [rsids]. SNPs without valid rsids were excluded from analysis. Each dataset was then converted to the forward strand to facilitate merging of the data. Data from the various platforms were merged using the PLINK toolset, version 1.06 (Purcell et al., 2007). Likewise, nonmissing genotype calls that showed disagreement between datasets were omitted. Demographic data for all individuals included in this study are available on GenBank. All samples were approved by institutional review board protocols from their respective studies.
The HapMap II release 23, HGDP, Mao et al., and POPRES samples were genotyped and called according to their respective quality control procedures (Frazer et al., 2007; Mao et al., 2007; Jakobsson et al., 2008; Nelson et al., 2008). Our final merged dataset contains 73,901 SNPs with genotype missingness of <0.1 and <0.05 individual missingness across 5,104 individuals.