The third session of the workshop covered some lesser-known approaches from the survey statistical literature on how to design and collect samples of rare or small populations: both probability sampling and nonprobability sampling methods. As covered in the session, while innovative and useful, probability methods may not work for really rare or sensitive populations. For these situations, nonprobability sampling approaches are being used.
The session, moderated by steering committee member Graham Kalton (Westat), began with a presentation by Marc Elliott (RAND Corporation) on probability sampling methods for small populations. The second and third presentations, by Sunghee Lee (University of Michigan) and Patrick Sullivan (Emory University), illustrated current methods used for nonprobability sampling. Invited discussant Krista Gile (University of Massachusetts Amherst) summarized the strengths and weakness of the approaches presented.
PROBABILITY SAMPLING METHODS FOR SMALL POPULATIONS
Marc Elliott provided background on probability sampling and shared examples well suited for probability sampling of rare populations: screening, disproportionate stratification at the level of the cluster and the person, and network sampling.
The purpose of probability sampling, Elliott said, is to make valid and reliable inferences about characteristics of a large population of interest from a smaller sample. Typically, information about many variables is col-
lected from the sample in a single survey effort. Parameters that may be of interest for estimation include proportions (prevalence), the mean (the population average), and the relationship between two variables as measured by a regression coefficient. A researcher calculates statistics based on the sample as estimates for that feature in the larger population. In the traditional probability sample way of thinking, sampling provides the link between a small sample and the larger population. It supports mathematical statements about the population from the sample.
The first distinction between types of samples is between probability samples and all other types. In a probability sample everyone in the target population has a non-zero probability of being selected. These probabilities are either known or can be estimated. They do not have to be equal; in fact, the techniques of greatest interest for rare populations will involve sampling different groups at different rates.
There are a variety of ways to implement probability sampling, but the key advantage is that it provides a basis for formal statistical inference about the population from the sample. With rare and hard-to-reach populations, however, sometimes there is not an approach to getting an adequate probability sample or it is not cost-effective. In those situations, researchers may need to turn to nonprobability samples (discussed in the subsequent presentations in this session).
Elliott summarized different kinds of probability samples. In the most basic, a simple random sample, everybody has equal probability of being selected. Stratified sampling assigns people to groups and samples groups at different rates. Multiple-stage samples might first sample geographic areas, then sample practices within those areas, then sample people within those practices. Examples of probability sampling for rare populations include screening, oversampling clusters and households, and network sampling.
Elliott said often there is not a compact list (or frame) of the rare population of interest. In that case, the only option is a large-scale general population survey with some form of screening to try to identify the rare population of interest. Typically, the sample is very large to identify enough of the target population. To make this approach economically feasible, the initial contact uses a very inexpensive technique, perhaps a mail survey or adding a few questions to a telephone survey. In the first stage of this multiple-stage approach, the questions are high sensitivity but they can be low specificity, and they have to be quick. The idea is to design questions to which the population of interest is most likely to say “yes,” even though others may also say “yes.” The follow-up survey
is administered to those who said “yes” to the screener. A longer survey provides more specific detail with high specificity.
Elliott described a project that used screening to estimate the national prevalence of a rare urinary tract disorder, interstitial cystitis/painful bladder syndrome. The first stage involved adding a single question about bladder symptoms to a large national telephone survey. Of the people who screened positive, a telephone survey involved administering a longer questionnaire. The people who screened positively in the second stage were asked to send in a urine sample, which allowed for detection of the bladder condition and estimation of national prevalence.
Oversampling at the Cluster and Household Levels
Oversampling at the cluster level is another approach for sampling rare populations. The clusters might be hospitals with a high incidence of a condition or geographic areas with a high concentration of a target population. A conscious decision to sample those clusters that are likely to have a greater concentration of the target population with a higher probability of selection must be made. Estimates are created by weighting observations according to their probability of selection to counteract bias. The benefit is a larger sample and greater precision for the group of interest.
An example of oversampling at the household level involves a mental health study of Cambodians who had been in Cambodia at the time of the Pol Pot regime and immigrated to the United States before 1993.1 Researchers focused on Long Beach, California, which has the largest and highest density Cambodian refugee population in the United States. The challenge was even in that community, only about 12 percent of households included a member of the target population.
A local community expert said he could quickly and with high accuracy determine whether a household was likely to contain Cambodians using visual clues such as footwear outside the door, Buddhist altars, and traditional Southeast Asian plants. He walked the entire area and classified every household as likely or unlikely to have a Cambodian resident. The process resulted in a list of households that were likely to have Cambodian residents. While incomplete, the list was used to identify households likely to include Cambodians from a complete list of households. The approach is called an incomplete or overlapping frame approach.
Every household designated as likely to be Cambodian was sampled, and one in four households designated as unlikely were sampled. The
1 Marshall, G.N., Schell, T.L., Elliott, M.N., and Chun, C.A. (2005). Mental health of Cambodian refugees two decades after resettlement in the United States. Journal of the American Medical Association, 294(5):571-579.
“unlikely” households were important to include in the sample to make sure all Cambodians had a chance of inclusion in the sample, even if not on the “likely” list. Among the sample of likely households, 58 percent contained members of the target population. Among the sample of unlikely households, 2 percent contained members. This approach resulted in greater eligible yield per sampled unit and lower cost than relying on a sample from the general population.
Another example involved using overlapping nongeographic frames to conduct a survey of Chinese Americans (about 1% of U.S. residents). Conducting a national-level survey of the general population with a standard screening approach to identify Chinese Americans would result in a very low yield, resulting in a very expensive survey. However, there are commercial lists of people highly likely to be Chinese, often generated by running surnames and addresses through telephone directories. Directly sampling from these lists would not result in a valid probability sample because the lists are incomplete. However, a parallel weighted study could be run where some cases are included through a traditional screening-based approach and some cases included from the list. Using this approach requires knowing whether a person contacted for screening is also on the list to estimate probabilities of selection.
Elliott cautioned that these techniques are not miraculous, and the incomplete list needs to cover a sizable fraction (at least one-third) of the population of interest. Because of the way the weighting works, the sample needs to contain about as many cases from the expensive technique as from the inexpensive technique. This technique results in some reduction in cost when it is important to do probability sampling.
The inspiration for network sampling is that frequently the expensive part of the process is contacting a household. With network sampling, each householder names all the people in a defined group about which they are very well informed. For example, householders could be asked about their adult siblings. Depending on the strategy, they could either be asked to provide survey information or contact information about the siblings. As a result, one contact provides information about multiple households.
The method requires work to compute the probabilities of selection by estimating the number of ways a person could have been reached. Additionally, related individuals may be more like each other than the general public. However, under the right circumstances, there are gains in precision from the probability sample at reduced cost.
TWO APPLICATIONS OF RESPONDENT-DRIVEN SAMPLING
Sunghee Lee provided two examples of respondent-driven sampling: one focusing on injection drug users in Michigan and the other on foreign-born Korean Americans. She noted that frequently there is no clear or practical way to use probability sampling approaches for rare, hard-to-reach groups for two main reasons. First, the costs for screening rare groups using a general population survey are likely to be very high. Second, the respondents might not want to self-identify to an interviewer that they use injection drugs or engage in other illegal or stigmatized behavior.
Lee noted that respondent-driven sampling was originally proposed by Douglas Heckathorn.2 It has frequently been applied in public health research to gather data on rare populations. Respondent-driven sampling uses an initial sample of the target population called seeds to start the process in the first wave. The seeds are asked to recruit a small number of individuals to participate in the study from their own existing social networks. They are given incentives and coupons as part of the recruitment process. The individuals recruited by the seeds constitute wave 2. They, too, are given incentives and coupons to use in recruiting wave 3. The recruitment chain process continues in waves until a pre-specified stopping point.
Network sampling, as described by Elliott, and respondent-driven sampling both use social networks, but they are different. In network sampling, the networks are specified very clearly by the researchers. Network-sample studies often rely on specific individuals, such as siblings. The researcher selects and controls the sampling procedures. In contrast, in respondent-driven sampling, other than for the seeds, the researchers do not control who comes to the study.
Another important aspect of respondent-driven sampling is chain-length. All of the participants coming out of the first seed, or seed number 1, are considered one recruitment chain. If there are a total of W waves, seed number 1 may have as many as S recruitment chains. In this way it looks similar to cluster sampling but not quite. The characteristics of the seed and the characteristics of the participants in wave 2 might look similar because people who know each other may be more alike. The characteristics of the seed and participants in wave 3 may be less alike. This is the property of memorylessness. It means that further down the chain, the less likely it is that the characteristics of the seed would be similar to those in that wave.
Respondent-driven sampling relies on many assumptions. In Lee’s
2 Heckathorn, D.D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems, 44(2):174-199; and Heckathorn, D.D., Semaan, S., Broadhead, R.S., and Hughes, J.J. (2002). Extensions of respondent-driven sampling: A new approach to the study of injection drug users aged 18-25. AIDS and Behavior, 6(1):55-67.
assessment, the most problematic assumption is that recruitment is done at random within each individual’s network. If this assumption holds, the recruitment chains are Markov chains, become memoryless, and reach equilibrium. Under these assumptions, unbiased estimators can be obtained after equilibrium using weights equal to the number of nodes for a subject’s recruiter.
Injection Drug Users in Michigan
In her first application of respondent-driven sampling, Lee discussed Positive Attitude Toward Health (PATH), sponsored by the National Science Foundation. PATH targeted injection drug users in southeast Michigan. The study protocol followed very closely the existing Centers for Disease Control and Prevention’s (CDC’s) National HIV Behavioral Surveillance System (NHBS) for its injection drug user component.
There were three data collection sites: urban Detroit, suburban Macomb, and mostly rural St. Clair. Researchers posted flyers in community centers in the three locations, and individuals were invited to call a telephone number. The process started with a telephone screener. Potential seeds were asked about their eligibility. Another round of screening at one of the study sites verified that they were injection drug users by checking physical injection marks. After the in-person screen, they completed a survey and were given up to three coupons to recruit other injection users. The survey was in the field for 6 months.
In Detroit, they started with 22 seeds and had a final sample size of 285. In St. Clair, they started with 14 seeds and had a final sample size of 106. In Macomb, however, they started with 10 seeds and had a sample size of 19 after 3 months; thus, the process ended to devote resources to the other sites. Lee noted that CDC’s NHBS had interviewed the injection drug user community in Detroit three times. As a result, this community was familiar with the recruitment process. In St. Clair and Macomb, the communities were completely new to the study process and were hesitant to collaborate with Lee’s team.
Lee noted that participants were very different in Detroit and a combination of St. Clair and Macomb. They differed in age, race, education level, employment level, and homeless experiences. Respondents also reported very different substance use. In Detroit, heroin was the drug of choice; in St. Clair and Macomb, other types of drugs were used in addition to heroin.
Health and Life Study of Koreans
In the second example of respondent-driven sampling, the Health and Life Study of Koreans, the target population was foreign-born Korean
American adults in Los Angeles County and in Michigan. Korean Americans make up about 0.6 percent of the population, so foreign-born Koreans are even rarer.
Unlike injection drug users, Korean Americans are rare, but not a highly stigmatized group. Frequently when immigrants come to the United States, they develop ethnic enclaves. These social networks are quite important to them. As a result, Lee and her colleagues thought that respondent-driven sampling might work for studying Korean Americans.
This survey was conducted on the web. Participants could visit the website3 to learn about the study and agree to participate in the survey. Each potential participant was provided with a unique number used to monitor participation. Incentives were provided using bank checks as a way to make sure no one took the survey more than once.
The target sample size was 800. As of January 2018, the study (ongoing at the time of the workshop) had about 600 completes. What is unique is that some benchmarks about foreign-born Koreans are available from the American Community Survey (ACS), so sample estimates from respondent-driven sampling can be compared to ACS estimates.
The study started with formative research, an important part of respondent-driven sampling. Because the target population is likely to be unfamiliar to researchers, formative research is a way for researchers to understand the community and how to approach it. Lee and her colleagues conducted three rounds of focus groups, with just over 30 participants in total, with two groups in Korean and one in English. The discussion focused on the purpose of the study, respondent-driven sampling, and the use of coupons. Participants discussed these issues and provided input on the incentive levels to use in different components of the study.
From these discussions, Lee said, it became clear that researchers had to be very clear in describing the study purpose. Participants wanted to see “University of Michigan” in the survey link, so the website URL mattered a lot. They understood the concept of respondent-driven sampling, and from the focus groups it seemed each participant recruiting two other people was realistic. Focus group participants said that there should be no incentive for recruiting, contrary to guidance in the respondent-driven sampling literature. Researchers decided to use two coupons that expired within 2 weeks. Incentives included a $20 coupon for completing the main survey and a $5 coupon for taking the follow-up survey.
Lee said that they started with 12 seeds in Los Angeles in June 2016. Seeds were recruited through referral and selected to have balance on gender, age, and dominant language. Researchers conducted in-person meetings with each seed, describing the importance of the project and inviting them
3 See http://sites.lsa.umich.edu/korean-healthlife-study/ [March 2018].
to take the survey. After 2 weeks there were few new participants. As a result, researchers began offering recruitment incentives of $5 per recruit and increased the number of seeds. Lee reported that as of January 2018, they had 306 completed interviews from 110 seeds in Los Angeles. In Michigan, they had 250 completed interviews from 85 seeds. Lee observed when seeds do not recruit other participants, the chain stops at the seed. In situations like this, it means some chains are very short. In this situation, the memoryless assumption is unlikely to hold.
Lee provided comparisons between estimates computed five different ways from this study and estimates from the ACS. The first estimator was the unweighted mean. The second two estimates, called RDS-I, and RDS-II, are well-known estimators from the respondent-driven sampling literature. The fourth estimate was the unweighted estimate post-stratified by known population totals for age, gender, and education. The final estimate combined RDS II with post-stratification.
She observed that the unweighted estimates were quite far from the benchmark ACS. The sample recruited for the web survey was younger, more highly educated, and more likely to have limited English proficiency than the ACS benchmark. More surprising to researchers was that the web survey respondents had more issues with activities of daily living than did the ACS benchmark. She also reported that it was clear that using RDS-I and RDS-II did not necessarily result in improved estimates. The post-stratified RDS-II estimator was improved for age, gender, and education (the post-stratifying variables), but for the remaining variables some were improved and some not.
In summary, Lee said that the main conclusion is that noncooperation—not recruiting other people—is a problem for generating long chains, which brings the memorylessness property into question. It requires improvisation to make respondent-driven sampling work. In addition, sample size (hence, chain length) is a random variable in respondent-driven sampling. As a result, inference is quite limited. The benefit of using respondent-driven sampling is to recruit people who are typically hard to recruit. However, noncooperation must be addressed to meet theoretical assumptions of respondent-driven sampling, and this has yet to be addressed in the literature.
VENUE-BASED AND ONLINE SAMPLING
Patrick Sullivan quoted from an issue of the Lancet (2012)4 about the difficulty in addressing HIV in much of the world where men who have sex with men (MSM) are in danger if their sex lives are exposed. He observed
4 The Lancet Series on HIV in Men Who Have Sex with Men. July 20, 2012. See https://www.thelancet.com/series/hiv-in-men-who-have-sex-with-men [May 2018].
that every specific population comes with layers of complication from being a small population. This speaks to the evolution of trying to sample MSM for health issues.
In the United States, gay and bisexual men make up about 2 percent of the population, but they accounted for over two-thirds of all new HIV diagnoses in 2015; black and Hispanic/Latino men accounted for two-thirds of those cases, significantly higher than their representation in the population. Young men of color are a critical part of the expanding HIV epidemic.
Sullivan said researchers may want to reach MSM for HIV prevention research, but it is also a population with significant health disparities with respect to mental health, cancer, substance use, and smoking. He described three ways to reach these men: (1) venue-based sampling, (2) online sampling, and (3) virtual venues like sex-seeking apps. Until about 20 years ago, same-sex behavior was criminalized, and bars and sex venues were the places to find gay men who often were not open in other locations. Sampling in these venues created a bias toward men who might have a higher level of sexual activity. With decreased stigma and improvements in human rights and laws, gay men are more integrated into U.S. society than in the past, but are also harder to sample because for younger generations being a gay man is often not their primary identity.
Sullivan’s sample studies have three goals: (1) recruit enough men to study; (2) ensure the sample includes younger men and men of color, the most critical subgroups; and (3) apply methods replicable across time to support the evaluation of time trends.
Sullivan described a study5 concerning the implementation of venue-based time-space sampling in the CDC’s NHBS, a large survey of 10,000 individuals conducted in 20 cities every third year. The process first involves formative work to enumerate venues. Venues are places where during a 4-hour period, at least 8 members of the target population are available. In a recent study in Atlanta, where they were recruiting MSM, they identified 183 such venues using a threshold of 30 men per time period. These enumerations need to be validated by stopping suspected members of the target population, asking about demographic characteristics, age, and other information.
This universe of venues is a first-stage sampling frame. Each venue is asked about time periods when it would be likely to find the threshold of
5 MacKellar, D.A., Gallagher, K.M., Finlayson, T., Sanchez, T., Lansky, A., and Sullivan, P. (2007). Surveillance of HIV risk and prevention behaviors of men who have sex with men: A national application of venue-based, time-space sampling. Public Health Reports, 122:39-47.
the target population. Each venue is assigned certain time periods when the threshold would be met. The list of venue/time periods becomes a sampling frame. A sampling calendar is developed with venues as primary sampling units and time periods as secondary. Within a sampled venue, the flow of men across a specific point is observed and every nth man is approached. Known as systematic flow-based sampling, the result is a sample that combines cluster sampling with flow-based sampling at the final stage.
This process has been used every 3 years since 2005, with a cycle completing in 2017. These data demonstrate6 that the prevalence of HIV in these samples has increased from 2009 to 2014 in young black and Hispanic MSM. It decreased in young white MSM. Although the venues change from cycle to cycle and the venue sampling frame is refreshed for each cycle in each city, the methods have been consistent. Sullivan noted in the surveillance reports, CDC does not do weighting or adjustment; they minimize biases by having a consistent process but report essentially raw numbers. In academic publications, they account for clustering by venue and then adjust for differences in other demographic characteristics over time.
Sullivan described work to evaluate the validity of the sampling frame of venues using a geospatial sex-seeking app. Delaney7 used it to prepare maps of Atlanta showing the density of black and white MSM listed on the app. At each selected point on a grid, Delaney determined the radius of a circle, centered on himself, that would contain 50 sex-seeking men. These circles were used to estimate the relative density of MSM, separately for black and white men. This analysis identified a major area of high activity that was missing venues on the sampling frame, the Atlanta University Center, the site of four historically black colleges and universities.
Online Sampling and Social Media
Sullivan started his discussion of online sampling approaches by illustrating an evaluation of bias.8 He and his colleagues did a study almost
6 Wejnert, C., Hess, K.L., Rose, C.E., Balaji, A., Smith, J.C., and Paz-Bailey, G. (2015). Age-specific race and ethnicity disparities in HIV infection and awareness among men who have sex with men: 20 U.S. cities 2008-2014. The Journal of Infectious Diseases, 212(11). doi: 10.1093/infdis/jiv500.
7 Delaney, K.P., Kramer, M.R., Waller, L.A., Flanders, W.D., and Sullivan, P.S. (2014). Using a geolocation social networking application to calculate the population density of sex-seeking gay men for research and prevention services. Journal of Medical Internet Research, 16(11):e249.
8 Sullivan, P.S., Khosropour, C.M., Luisi, N., Amsden, M., Coggia, T., Wingood, G.M., and DiClemente, R.J. (2011). Bias in online recruitment and retention of racial and ethnic minority men who have sex with men. Journal of Medical Internet Research, 13(2):e38. doi: 10.2196/jmir.1797.
10 years ago using the social networking site MySpace. The motivation was a review of data from online studies that found black and Hispanic MSM were underrepresented in studies relative to their prevalence in the population of New York City. One of their hypotheses was that visual features of an ad on the site might influence the probability that someone responds to it. They developed two ads with white models, two with black models, and two with Hispanic models to test the hypothesis that the ads could help target under-recruited groups by matching ads with demographic groups. They also evaluated incomplete surveys, demonstrating that completion rates are highest for white men and lowest for black men, resulting in potential bias. Sullivan noted further work is needed to find incentives, including nonmonetary incentives, to increase participation rates and reduce bias.
The American Men’s Internet Survey (AMIS) is hosted by Emory University and has just finished its 5th year of data collection with about 10,000 respondents per year. AMIS recruits from four different online approaches: (1) general social networking (such as Facebook or Twitter), (2) general gay interest (such as politics, advocacy, or style), (3) gay social networking, and (4) sex-seeking apps.9 Recruiting is unincentivized, so the only costs are for ads and staff. In the past few years, they have included in the sample some men who took the survey in a previous year and expressed interest in participating again. The question is open whether mixing or changing recruitment sources might change bias over time and impact time trend analysis.
According to Sullivan, recruiting approach matters. For example, MSM recruited through a sex-seeking app were more likely to have had an HIV test, to have had an STI test, to be living with HIV, and to report condom-less anal intercourse, and were somewhat more likely to use marijuana. As a result, in preparing survey results, AMIS uses standardization to a general population to look at time trends and compare estimates over time.
Sullivan noted that even though there are questions about the largely uncontrolled methodology associated with online sampling, the cost of recruiting 10,000 men through this method is 1/100 of the cost of doing a 10,000-person survey using venue-time-space sampling. Comparisons between estimates made with different approaches help to understand where they are the same and where different. He said online sampling can be viewed as a complement to other methods, and at a high level it seems to give trends consistent with other sampling methods.
9 Zlotorzynska, M., Sullivan, P., and Sanchez, T. (2017). The annual American Men’s Internet Survey of behaviors of men who have sex with men in the United States: 2015 key indicators report. JMIR Public Health and Surveillance, 3(1):e13.
Sullivan provided one more comparison.10 This study was conducted in Atlanta and used a combination of venue-time-space sampling and online sampling through Facebook. They followed a cohort of HIV-negative MSM, half black, half white, for 2 years. One question they wanted to address was how men recruited through Facebook differed from those recruited through venue-based sampling in terms of HIV prevalence, STI prevalence, retention in the study, and risk behaviors. The results indicate that for most of the outcomes, the men recruited through Facebook had the same outcomes as those recruited through venue-based sampling. Overall, he said, the two methods are complementary and, in this case, Facebook and venues were two different access points to largely similar populations.
Sullivan stressed MSM are the major risk group in the U.S. HIV epidemic and also have other health disparities. Venue-based sampling has the advantage of being a systematic approach. Expanding the types of venues beyond bars would eliminate some of the biases of just going to sex partner–meeting venues. Online sampling can also be used to access MSM. However, black and Hispanic men are under-recruited, and have greater retention loss than white men. These biases need to be addressed.
Krista Gile began by summarizing the sampling approaches presented. She reminded the audience that the goal of identifying and finding members of small populations is to make statements about the whole population. To do this, researchers need to think about statistical issues, including (1) the size of the target population, (2) the population proportions of characteristics of interest, and (3) the associations between variables or multivariate results. Many of these methods do not address quantifying uncertainty well. If uncertainty can be quantified, confidence intervals can be prepared and hypotheses tested. Quantifying uncertainty is important to consider when considering which methods to use.
Overview and Comparison of the Methods
Gile compared the sampling frames of the four methods discussed (probability sampling, respondent-driven sampling, venue-based sampling, and online sampling). A key point for probability sampling is to start with a sampling frame, such as a list of people. All probability samples start
10 Hernandez-Romieu, A.C., Sullivan, P.S., Sanchez, T.H., Kelley, C.F., Peterson, J.L., Del Rio, C., Salazar, L.F., Frew, P.M., and Rosenberg, E.S. (2014). The comparability of men who have sex with men recruited from venue-time-sampling and Facebook: A cohort study. Journal of Medical Internet Research, 3(3):e37. doi: 10.2196/resprot.3342.
from a sampling frame, and researchers think about the characteristics of inclusion and exclusion of the target population.
As discussed by Lee, Gile said, respondent-driven sampling starts by selecting seeds, and the seeds lead to web-like network samples. The ultimate sample depends on who the seeds are, how they were selected (often by a convenience mechanism), and their network within the target population. Who ends up in the sample depends on where the process started. She expressed hope that improvements in implementation and inference and the property of memorylessness will ultimately overcome this negative aspect of respondent-driven sampling,
In venue-based sampling, the population is divided into those who might be found at venues of interest and those who are not. There might be individuals who show up in multiple venues or in no venues at all. The basic sampling unit is a venue-time unit. In these settings, it is important to think about who is excluded and who may be overrepresented.
Gile noted online sampling, through different websites or ads, might reach different parts of the target population. The ad might be displayed to individuals in different ways. The question remains who sees that ad and, critically, who is going to click on it.
Gile compared the four methods on different points: elements of formative research and rapport, setting up the sampling frame and what is known about sampling rates and decisions about participation, methods for statistical inference (point estimates, confidential intervals), dependence between sampled individuals, and populations not suitable for each of the methods.
Gile referred to Sullivan’s summary of the formative research needed for venue-based sampling and Elliott’s presentation on effective probability sample for a very small population. In these situations, the literature is quite extensive. For respondent-driven sampling, formative research is also important to learn about the target population, help to select diverse seeds, and get buy-in from the community. In the end, the researcher wants a small number of seeds, and from there hopefully the study will spread. Formative research is also needed in online sampling.
Everyone who answers survey questions, particularly within a sensitive group, is giving researchers time and information about her or his truth. The more trust, the better the quality of information provided and the more likely researchers are to get the answers they want. Gile noted that as a statistician, she does not usually think about these issues, but would rather see data from a place where people are thinking about how to authentically connect with the target population, which questions are relevant and will be answered well, and who is engaged in participating in the survey. These factors influence the quality and completeness of the data.
In particular, she said, respondent-driven and venue-based sampling require large amounts of trust. Researchers need to find seeds for respondent--
driven sampling. For venue-based sampling, formative research is needed to find venue-times and develop relationships. All sampling methods are helped by knowing and having close connections with the target population.
Gile commented that a survey can observe only the people within the sampling frame. A key question is who is in and who is out. In a probability sample, the sampling frame hopefully covers everyone, although coverage needs to be assessed. In respondent-driven sampling, it is assumed that people are connected by a network, and that their self-reported number of ties reflects their rates of inclusion in the study. In venue-based sampling, the assumption is that people from the target population frequent the places sampled. In online sampling, people must visit the particular websites.
If the differential sampling rates of people in the target population are known, estimates can be adjusted for the fact that different people within the frame are more or less likely to be sampled. If those rates are not known, creative work is needed. It is important to think about what is driving differential sampling rates.
In a probability sample, the design controls the different rates of inclusion for the different people in the sampling frame. This allows for more straightforward and clean-cut inference, which is why probability sampling is the gold standard for survey research. In respondent-driven sampling, there is an expectation that the sampling process depends on a person’s connectivity in the network. In venue-based sampling, sampling probabilities might also depend on the extent of venue use. In online sampling, the website of interest and clicking on an ad are the two distinct features. In venue-based and online sampling, there is discussion but no consensus on how to determine sampling weights or members of the population.
In many probability samples with in-person interviewers, a potential respondent is approached by the interviewer. That person may refuse to participate, but the interviewer has the proximity for helping him or her to decide whether to be in the study. Similarly, in venue-based sampling, the potential respondent is approached by an interviewer. With respondent-driven sampling, coupons are passed out by other participants in the study. Researchers do not know how many people were approached and declined to participate. Similarly, with online sampling, decisions happen in privacy, and the researcher has no idea what goes into that process.
The methods for statistical inference rely on sampling probabilities. The probability sample, when it can be done, enables many powerful things with statistics. Respondent-driven sampling has many methods for inference, but requires many assumptions. With venue-based sampling, the sample is drawn at the first level on venue times. However, inferences are desired for the population of people. Gile observed that it is unclear the sample of venue-times is extrapolated to inference for a population of people. The probability of being sampled may depend on people’s usage of
the venues. With online data, perhaps post-stratification would be possible, but post-stratification requires valid reference data.
Low dependence between sampled individuals means that when a new individual is sampled, she or he will provide a great deal of additional information. With high dependence between sampled individuals, the information may be similar to the information already collected. With venue-based sampling, there might be similarities among the people who frequent that venue during that time. Similarly, in respondent-driven sampling, the people recruited may be similar. As a result, each additional person surveyed is going to provide less additional information than a process that involves independent samples.
If there is no suitable sampling frame, probability sampling cannot be used. Gile noted Elliott provided clever examples of how to do probability samples, but if a frame cannot be defined, it is not possible. Respondent-driven sampling needs people who are well connected by a network. Venue-based sampling needs people who congregate in a physical place. Online sampling needs a population with online activity who are likely to click on an ad.
Advantages and Weaknesses
Gile said probability sampling allows for straightforward and valid inference in a wide variety of settings. If probability sampling is feasible, it is preferred for this reason. Respondent-driven sampling is good at reaching unknown parts of the population and allows for approximately valid inference. Studies have shown that some respondent-driven sampling can get to people who might not have been reached by other methods. Venue-based sampling presents a valid sampling frame based on times and locations and avoids many biases that might occur in more subjective sampling methods. Finally, online sampling offers great ease of implementation and tremendous cost-benefit over the other approaches.
Gile posed a few questions for discussion. How can sampling weights for venue-based samples be estimated? How can the missing in online surveys be monitored? How can multiple methods for surveying a population be used? She noted the importance of protecting the confidentiality of respondents’ identification. Finally, she said, in terms of a data analysis or a data mining challenge, a company or political campaign can be “greedy” in how it draws inference and makes decisions. However, researchers and public health officials do not just want to find another person; they need to be fair and careful to not disadvantage certain populations. This may mean that some of the methods used and developed seem less powerful on the surface than some of those used by other actors, but it is important to be responsible to constituents.
Gordon Willis (NCI) asked Gile about the choice of sample design for small population studies based on study objectives, and in particular whether the study involves estimation of population frequencies as opposed to identifying associations in the data. For example, if he wanted to know the unemployment rate in a particular population, he would lean toward a population-based probability sample approach. On the other hand, if he was assessing a stop-smoking intervention known to be effective in one population, it might be preferable to use a more limited and less expensive nonprobability approach to assess whether the intervention is effective in another small population.
Gile said the method used should be dependent on what one wants to learn. Power analysis and sample size calculations are intended to be used for this. Calculations can determine whether a larger or smaller sample can be used. If only a rough answer is needed, a more basic approach may be fine. If a very precise answer is needed, researchers need to do something more precise.
Elliott added that another way to reframe the question would be to ask whether a regression coefficient is less biased than a population prevalence estimate when using a nonprobability sample. He said studies that have addressed this question suggest there is probably less bias on average in estimating the effectiveness of an intervention using a regression relationship. He cautioned that using a nonprobability sample to test effectiveness still has a chance of bias.
Sullivan pointed out that for some populations there are no sampling frames, so probability sampling is not possible. Respondent-driven sampling arose because of hidden populations. In some situations, nonprobability methods may be the only choice. He also observed that in some situations it is more important to monitor changes over time in health behaviors. In these situations, it may be better to compromise some of the accuracy of point estimates to make sure methods are sufficiently replicable to identify changes over time.
Graham Kalton pointed out two large projects with issues of generalizability: the UK Biobank cohort study that has enrolled around 500,000 people aged 40 to 69 in selected areas in the United Kingdom with a very low response rate, and the planned All of Us study in the United States that will enroll about 1 million volunteers. In his view, the argument that such studies can be used for measuring associations needs to be treated with due caution and evaluated.
Richard Moser (NCI) asked Lee about the respondent-driven sampling assumption that recruitment is random within each individual’s network. He questioned whether a seed or recruiter would use a random method. Lee
responded the assumption could easily be violated in reality. However, it was an assumption that respondent-driven sampling relied on when it first started. She acknowledged it is tricky to check these assumptions because researchers frequently do not know much about the target population. Gile agreed in many cases the assumption about recruitment being random is violated in respondent-driven sampling. One of her former students is working on an estimator that adjusts for differential recruitment effectiveness.
Robert Croyle (NCI) asked about a hypothetical grant process where an application proposes something that is really a census. For example, the proposal might be a study of the 125 Korean American breast cancer cases in the state of Iowa. The researcher proposes to use the population registry, include every case of breast cancer among Korean Americans, and invite all such individuals to participate. Then, the researcher proposes to make post-recruitment adjustments to account for differences (in age, wealth, or something else) between those who agreed to be in the study and those who refused. He asked about other things these researchers should be concerned with from a statistical perspective.
Sullivan responded that “if you have a census, then you have a census.” However, there are guidelines for the evaluation of registries or surveillance systems. One of the questions he would ask is whether there has been a check on the coverage of the system. Have there been any validations done to know what proportion of the actual cases 125 is? If the registry includes less than 80 or 90 percent of the cases, then most statistical methods and the assumptions that would be used to make inference are not applicable, and one is left with describing an incomplete surveillance system. He noted the NHBS is a surveillance system that tries to collect a census and does quality checks to make sure it covers 85 percent. As a result it typically just reports the numbers without adjustment.
Elliott provided another point of view, pointing to different schools of thought on how to analyze what might be called a census. In particular, one of the areas where people have different thoughts is whether the sample should be characterized as having uncertainty. One school of thought would say no; it is a census and one is trying just to describe those people. He went on to argue, however, that is not typically the question. Sometimes researchers might be implicitly asking about the likely situation in the future if things were unchanged. These situations carry some uncertainty.
Kalton said the underlying question relates to the population of statistical inference. If the population of inference is just 125 people that is one matter. However, he said, researchers generally want to make inferences about a larger population. This situation is actually a nonprobability sample of this larger population. The researcher can sometimes make weighting adjustments to bring the sample in line with known characteristics of the larger population. He said the evidence is that such weighting adjustments
do not work very well, but it might be better to do them than not to do them.
Croyle noted that the classic context where this comes up is American Indian and Alaskan Native studies. The study is about a small population. Reviewers may not want to see a study on a small number of people. Then it becomes a cultural scientific issue because of beliefs about generalizability. Tribes do not necessarily believe that study results on another tribe apply to them. The question about replication comes back to the funders. Clearly, identifying the desired inference is the key issue. The other part of the challenge with small population studies is that the denominator is smaller than people are comfortable with, especially since scientists usually have as many variables as there are people in the sample.
Scarlett Gomez (University of California, San Francisco) said she has experience with some of these methods as part of a case control study of breast cancer in Asian Americans. She said she thinks case control studies have fallen out of favor because the conventional thinking is that it is impossible to recruit representative controls. Her study started out with cases from a cancer registry so they knew what the population-based sampling frame should be. They sampled based on Asian ethnicity, age, and socioeconomic status. They used several different recruitment methods, including probability sampling from a mailing list, online recruitment, social media, and respondent-driven sampling. They compared the composition of their sample to the ACS and found good agreement. They monitored the people they were recruiting to make sure their characteristics matched the cases from the cancer registry. The different sampling approaches helped them find people with different characteristics and to fine tune recruitment of the control cases. She wondered what the panel thought about the approach of using multiple sampling methods to recruit a representative control group for a case control study.
Gile responded that as a mechanism to match characteristics of two groups, she viewed this example as a targeted convenience sample of people who match the cases. She noted Gomez described a challenging problem, and, as a convenience mechanism, it seems like a reasonable approach.
Sullivan said as long as the unconditional regression that includes the factors that they tried to set equal are part of the analysis, then it is a good approach. The question of how many features to match on is different. He said studies in the published literature have looked at case control studies and reported that the trend in epidemiology has been toward less matching.
James Allen (University of Minnesota) said the issue of why a population becomes difficult to reach often comes out of history, stigmatized behavior, or severe consequences related to the behavior. The question of trust implicit in different sampling methods led him to think that from a statistical assumptions perspective, one method may be preferred, but in
terms of the quality of the data, a different method may provide enhanced access with some difficult-to-reach populations. Perhaps this is a tradeoff to consider, he suggested.
Related to trust, Sullivan observed that Lee reported that in her qualitative work, the respondents wanted to see the University of Michigan connection in the ads. He and his colleagues do qualitative work before each study and usually find that universities are seen as trusted research partners. Participants consistently say they want to see a university mentioned, and they prefer to see a local university. In terms of reporting same-sex behavior or attraction, there may be underreporting even in a trusted environment. In some large studies, there are men who deny having anal intercourse even though they have rectal STIs. Biological markers increase the sensitivity of the data. It is not really about ascribing misclassification, but rather increasing the correctness of the data to increase sensitivity.
Sullivan noted other gradients across methods might be trust or the ability to reach marginalized people who might participate anonymously online, but not go to a venue, communicate with friends about it, or want their names on a list. Despite progress in the United States, there are places where receiving an HIV test kit in the mail may cause social harm.
Kalton commented that Gile had called probability sampling the gold standard. He posited that all of the methods have serious problems. Probability sampling faces increasing nonresponse issues. This brings into question the valid inferences possible with complete data. Probability sampling also faces problems in measuring membership in sensitive populations. The other methods have equally serious difficulties. The message, he said, is to be careful about how to interpret data.
This page intentionally left blank.