Experimental Methods for Assessing Discrimination
As we discussed in Chapter 5,at the core of assessing discrimination is a causal inference problem. When racial disparities in life outcomes occur, explicit or subtle prejudice leading to discriminatory behavior and processes is a possible cause, so that the outcomes could represent, at least in part, the effect of discrimination. Accurately determining what constitutes the effect of discrimination, personal choice, and other related and unrelated factors requires the ability to draw clear causal inferences. In this chapter, we review two experimental approaches that have been used by researchers to reach causal conclusions about racial discrimination: laboratory experiments and field experiments (particularly audit studies).
To permit valid causal inferences about racial discrimination, the design of an experiment and the analytic method used in conjunction with that design must address several issues. First, there are frequently intervening or confounding variables that are not of direct interest but that may affect the outcome. The effects of these variables must be accounted for in the study design and analysis. In controlled laboratory experiments, the investigator manipulates a variable of interest, randomly assigns participants to different conditions of the variable or treatments, and measures their responses to the manipulation while attempting to control for other
relevant conditions or attributes. As described in the previous chapter, randomization greatly increases the likelihood of being able to infer that an observed difference between the treatment and control groups is causal. Observing a difference in outcome between the groups of participants can be the basis for a causal inference. In controlled field experiments, researchers analyze the results of a deliberately manipulated factor of interest, such as the race of an interviewer. They attempt to control carefully for any intervening or confounding variables. Random assignment of treatments to participants is frequently used to reduce any doubts about lingering effects of unobserved variables, provided, of course, that one can actually apply the randomization to the variable of interest.
In addition to the problem of credibly designing an experiment that supports a causal inference, a common weakness of experiments is a lack of external validity. That is, the results of the experiment may not generalize to individuals other than those enrolled in the experiment, or to different areas or populations with different economic or sociological environments, or to attributes that differ from those tested in the experiment.
Despite these problems, the strengths of experiments for answering some types of questions are undeniable. Even if their results may not be completely generalizable and even if they do not always capture all the relevant aspects of the issue of interest, experiments provide more credible evidence than other methods for measuring the effects of an attribute (e.g., race) in one location and on one population.
Using Experiments to Measure Racial Discrimination
Use of an experimental design to measure racial discrimination raises important questions because race cannot be directly manipulated or assigned randomly to participants. Researchers who use randomized controlled experiments to measure discrimination, therefore, can manipulate race by either varying the “apparent” race of a target person as the experimental treatment or can manipulate “apparent” discrimination by randomly assigning study participants to being treated with different degrees of discrimination.
In the first case, the experimenter varies the treatment, namely, the apparent race, by such means as by providing race-related cues on job applications (e.g., name or school attended) or by showing photographs to participants in which the only differences are skin color and facial features. The experimenter then measures whether participants respond differently under one race treatment compared with another (e.g., evaluating black versus white job applicants or associating positive or negative attributes with photographs of blacks versus whites). In such a study, the experimenter elicits responses from the participants to determine the effect of
apparent race on their behavior (e.g., whether the participants engage in discriminatory behavior toward black and not white applicants). That is, they measure the behavior of potential discriminators toward targets of different races. If successful, then finding a difference in behavior would indicate an effect of race.
In the second instance, experimenters randomly assign participants to be treated differently, that is, either with or without discrimination. This type of experiment attempts to measure the response to discrimination rather than directly measure the expression of discrimination—that is, it measures the behavior of potential targets of discrimination. Because race cannot be experimentally manipulated, an explicit specification of the behavioral process is needed that allows the translation of results from such experiments into causal statements about the actual discrimination mechanism measured in the experiment (i.e., the extent to which the experimenter can manipulate some other factor related to race, such as perception). To our knowledge, no one has attempted to carry out such formal reverse reasoning, and we believe that doing so is especially crucial when arguing for the external validity of experimental results.
One of the few examples of attempts to perform similar inferential reversals is the special case of understanding odds ratios (and adjusted odds ratios) in the context of comparing retrospective and prospective studies on categorical variables. In retrospective studies, the data are collected only after the treatment has taken place, whereas in prospective studies the data are collected on possible covariates before treatment and on outcomes after the treatment. If one has both categorical explanatory and categorical response variables, one can estimate their relationship in the prospective study based on a retrospective sample. If the logistic causal model is correct, the inference about the key causal coefficient from the retrospective study is the same as if one had done a prospective sampling on the explanatory variable.1 Those results, however, do not generalize to relationships among continuous variables.
Laboratory experiments, like all experiments, include the standard features of (1) an independent variable that researchers can manipulate (i.e., assign conditions or treatments to participants); (2) random assignment to
treatment conditions; and (3) control over extraneous variables that otherwise might be confounded with the independent variable of interest, potentially undermining the interpretation of causality. Laboratory experiments occur in a controlled setting, chosen for its ability to minimize confounding variables and other extraneous stimuli.
Laboratory experiments on discrimination would ideally measure reactions to the exact same person while manipulating only that person’s race. As noted above, while strictly speaking one cannot manipulate the actual race of a single person, experimenters do typically either manipulate the apparent race of a target person or randomly assign subjects or study participants to the experimental condition while attempting to hold constant all other attributes of possible relevance. One common method of varying race is for experimenters to train several experimental confederates—both black and white—to interact with study participants according to a prepared script, to dress in comparable style, and to represent comparable levels of baseline physical attractiveness (see, e.g., Cook and Pelfrey, 1985; Dovidio et al., 2002; Henderson-King and Nisbett, 1996; Stephan and Stephan, 1989). Another common method of varying race involves preparing written materials and either incidentally indicating race or attaching a photograph of a black or white person to the materials (e.g., Linville and Jones, 1980).
Effects of race occur in concert with other situational or personal factors, called moderator variables, that may increase or decrease the effect of race on the participants’ responses. In addition to manipulating a person’s apparent race, for example, investigators may manipulate the person’s apparent success or failure, cooperation or competition, helpfulness, friendliness, dialect, or credentials (see, e.g., Cook and Pelfrey, 1985; Dovidio et al., 2002; Henderson-King and Nisbett, 1996; Linville and Jones, 1980; Stephan and Stephan, 1989). Even more often, experimenters will manipulate features of the situation expected to moderate levels of bias toward black and white targets; examples involve anonymity, potential retaliation, norms, motivation, time pressure, and distraction (Crosby et al., 1980). Finally, the study participants frequently are black and white college students (e.g., Crosby et al., 1980; Correll et al., 2002; Judd et al., 1995).
Strengths of Laboratory Experiments
Laboratory experiments, if well designed and executed, can have high levels of internal validity for causal inference—that is, they are designed to measure exactly what causes what. The direction of causality follows from the manipulation of randomly assigned independent variables that control for two kinds of unwanted, extraneous effects: systematic (confounding) variables and random (noise) variables.
Laboratory experiments are the method of choice for isolating a single variable of interest, particularly when fine-tuned manipulation of precisely defined independent variables is required. Laboratory studies also allow precise measurement of dependent variables (such as response time or inches in seating distance). The laboratory setting gives experimenters a great degree of control over the attention of participants, potentially allowing them to maximize the impact of the manipulation in an otherwise bland environment.
Because of these fine-grained methods, laboratory experiments on discrimination are well suited to examining psychological processes. Both face-to-face interactions and processes in which single individuals react to racial stimuli are readily studied in such experiments. The most sophisticated experiments show not only the effect of some variable (e.g., expectancies) on an outcome variable (e.g., discriminatory behavior) but also the mechanism or process that mediates the effect (e.g., biased interpretations, nonverbal hostility, stereotypic associations). That is, when an experiment manipulates the apparent race of two otherwise equivalent job candidates or interaction partners (as in the interracial interaction studies described later in this chapter; see Dovidio et al., 2002; Word et al., 1974), the experiment ideally should also measure some of the proposed explanatory psychological mechanisms (such as emotional prejudices and cognitive stereotypes, either implicit or explicit), as well as the predicted discrimination (either implicit behaviors, such as nonverbal reactions, or more explicit behaviors, such as verbal reactions).
A hallmark of the better laboratory experiments is that they not only test useful theories but also show how important, compelling phenomena (e.g., the automaticity of discrimination) can and do occur. Laboratory studies often show that very small, subtle alterations in a situation can have substantial effects on important outcome variables.
Measuring Racial Discrimination
Experimenters measure varying degrees of discrimination. Laboratory measures of discrimination begin with verbal hostility (e.g., in studies of interracial aggression), which can constitute discrimination, when, for example, negative personal comments result in a hostile work environment (see Chapter 3). At the next level are disparaging written ratings of an individual member of a particular group (Talaska et al., 2003). If unjustified, such negative evaluations can constitute discrimination in a school or workplace.
At the subtle behavioral level, laboratory studies measure nonverbal indicators of hostility, such as seating distance or tone of voice (Crosby et al., 1980). Related nonverbal measures include coding of overt facial ex-
pressions, as well as measurement of minute nonvisible movements in the facial muscles that constitute the precursors of a frown. Experimenters study these nonverbal behaviors because they, too, could result in a hostile environment.
Moving up a level, laboratory measures of discriminatory avoidance include participants’ choice of whether to associate or work with a member of a racial outgroup, volunteer to help an organization, or provide direct aid to an outgroup member who requests it (Talaska et al., 2003). In a laboratory setting, segregation can be measured by how people constitute small groups or choose leaders in organizational teams (Levine and Moreland, 1998; Pfeffer, 1998). Finally, aggression against outgroups can be measured in laboratory settings by competitive games or teacher–learner scenarios in which one person is allowed to punish another—an outgroup member—with low levels of shock, blasts of noise, or other aversive experiences (Crosby et al., 1980; Talaska et al., 2003).
A review of laboratory studies as of the early 1980s (Crosby et al., 1980) summarized the findings as follows. Experiments on unobtrusive forms of bias and prejudice showed that white bias was more prevalent than indicated by surveys. Experiments on helping, aggressive, and nonverbal behaviors indicated that (1) whites tended to help whites more often than they helped blacks, especially when they did not have to face the person in need of help directly; (2) under sanctioned conditions (e.g., in competitive games or administration of punishment), whites acted aggressively against blacks more than against whites but only when the consequences to the aggressor were low (under conditions of no retaliation, no censure, and anonymity); and (3) white nonverbal behavior displayed a discrepancy between verbal nondiscrimination and nonverbal hostility or discomfort, betrayed in tone of voice, seating distance, and the like. This review sparked the realization, discussed in earlier chapters, that modern forms of discrimination can be subtle, covert, and possibly unconscious, representing a new challenge to careful measurement, both inside and outside the laboratory (survey measures for these forms of discrimination are discussed in Chapter 8).
Since the 1980s, laboratory experiments on discrimination have concentrated more on measuring subtle forms of bias and less on examining overt behaviors, such as helping others. This shift occurred precisely because of the discrepancy between some people’s overtly egalitarian responses on surveys and their discriminatory responses when they think no one is looking, or at least when they have a nonprejudiced excuse for their discriminatory behavior. In Boxes 6-1 through 6-3, we describe three of the
In a pair of experiments, Word and colleagues (1974) elicited subtle nonverbal discriminatory behaviors from white interviewers against black job applicants and then demonstrated that such behaviors used against white applicants elicited behaviors stereotypically associated with blacks. The researchers first asked white college students to interview black and white high school applicants for a team that would plan a marketing campaign. Interviewers expected to see several applicants; the first applicant always was white, followed by black and white applicants in a randomly counterbalanced order. Unbeknownst to the interviewer participants, the applicants were confederates of the experimenters, trained to respond in a standard way. Debriefing indicated that study participants were unaware of its purposes or that the alleged applicants were confederates (probably aided by the sequence including two white applicants and one black applicant). Extensive debriefing indicated no suspicion about the confederates.
The interviewers’ nonverbal behavior indicated less immediacy (i.e., greater discomfort and less warmth) toward black than white applicants on a number of measures scored by judges behind one-way mirrors: greater physical seating distance, shorter interviews, and more speech errors. Although judges were not blind to the race of the confederates and therefore may have been influenced in their coding of the white interviewers’ behavior, three points suggest that the researchers were able to
best examples of controlled laboratory experiments on discrimination, ranging from simpler classic to more recent sophisticated studies. In a classic example, Word et al. (1974) created working definitions of race and discrimination to investigate subtle yet potentially powerful effects of stereotypical expectations hypothesized to result in discrimination (see Box 6-1). Another famous experiment showed that researchers can study social perception processes hypothesized to underlie discrimination, in which people see what they want to see by interpreting ambiguous evidence to fit their stereotypical biases (Darley and Gross, 1983; see Box 6-2). And in a final experiment, Dovidio et al. (2002) showed that implicit forms of prejudice tended to lead to implicit but potentially important forms of discrimination, whereas explicit forms of prejudice tended to lead to explicit forms of discrimination (see Box 6-3).
obtain fairly unbiased coding: (1) the coding consisted largely of physical measurement (e.g., seating distance, number of minutes) and counting (e.g., number of speech errors); (2) judges were unaware of the study’s hypotheses; and (3) replications of the coded behaviors produced the expected results in a second study.
In the second experiment, white interviewers were confederates trained to behave nonverbally in either a more or less immediate way toward naive white applicants; that is, they were trained to treat some of the white applicants as the black applicants in the previous study had been treated. White applicants treated as if they were black reciprocated with greater seating distance and more speech errors. They perceived the interviewer to be less friendly and less adequate. They also performed worse in the interview; were judged less adequate for the job; and appeared less calm, composed, and relaxed.
The overall point of this pair of experiments—the basic methods of which have since been replicated repeatedly—is that researchers can investigate how simulating discrimination against whites can bring about the very behaviors that are stereotypically associated with blacks (or another disadvantaged racial group), and they can measure the hypothesized mechanisms involved—that is, subtle nonverbal cues unlikely to be analyzed consciously by either perceiver or target. Moreover, researchers can mimic the employment interview context to examine the potentially large effects that nonverbal forms of discrimination are hypothesized to have on people’s ability to obtain a job.
Other provocative recent experiments have shown that actual discriminatory behavior can follow from subliminal exposure to racial and other demographic stimuli (Bargh et al., 1996). This work has revealed that exposure to concepts and stereotypes at speeds too fast for conscious recognition primes relevant behavior, even though participants cannot remember or report having seen the priming stimuli. For example, researchers randomly assigned participants to see, at subliminal speeds, words related to rudeness or neutral topics and showed that those participants exposed to rude words responded more rudely to an experimenter. In a parallel experiment, subliminal exposure to photographs of unfamiliar black male faces, as compared with white ones, was followed by more rude, hostile behavior when the white experimenter subsequently made an annoying request. Similar results have been demonstrated for exposure to phenomena related to being
In a laboratory experiment conducted by Darley and Gross (1983), participants viewed a child depicted in a 6-minute videotape as coming from either a high or low socioeconomic background, based on the setting in which she was shown playing. When asked to rate her academic performance, they acknowledged not having enough information and rated her ability at grade level. Other participants saw the initial 6-minute videotape depicting socioeconomic status but also saw an additional 12-minute videotape that depicted the child taking an oral test on which her performance was mixed. Participants shown the second video after the first no longer demurred regarding the child’s academic performance. Instead, they rated her performance as well below grade level if they had viewed the 6-minute video depicting low socioeconomic status and at grade level if they had seen the video depicting high socioeconomic status. Control participants shown the test video alone, and not also the socioeconomic status video, rated the child’s performance at about grade level.
Thus, the researchers were able to show how people perceived the academic performance tape (itself quite neutral) through the lens of their expectations, convincing themselves that they had evidence on which to base their biased judgments. Although the child on the tape was white, the applicability of this sort of socioeconomic status-based stereotype to racially tainted judgments of academic performance appears clear, and manipulating such variables sheds light on hypothesized processes of discrimination. Methods for assessing this kind of perceptual confirmation process have been replicated repeatedly. For example, in a study conducted by Sagar and Schofield (1980), black and white sixth-grade boys viewed depictions of various ambiguously aggressive behaviors by black and white actors. Participants read identical verbal descriptions of four ambiguously aggressive incidents common in middle schools: bumping in the hallway, requesting another student’s food, poking in the classroom, and using another’s pencil without permission. The race of actors and targets was not specified verbally, but each incident was accompanied by one of four drawings of the event, identical except for the depicted race of the actor and target. Participants saw each incident only once and each in just one of the four possible combinations of actor race and target race. Participants rated how mean, threatening, friendly, and playful each incident was. Researchers were able to show that all the participants, regardless of race, rated the behaviors as more mean and threatening when a black child enacted them than when a white child did.
elderly, which resulted in participants walking more slowly to the elevator after the experiment. The point is that researchers can manipulate racial cues without participants’ conscious awareness and measure subtle forms of behavior that, if occurring selectively toward members of one racial group or another, could constitute a hostile environment form of discrimination. Other more direct forms of discrimination are also possible to measure in such experiments, such as making negative comments in a job interview.
These examples illustrate the range of aspects of racial discrimination that can be examined in laboratory settings. Such experiments can manipulate racial and moderator variables; test various hypothesized mechanisms of discrimination, such as attitudes; and assess various hypothesized manifestations of discrimination, including verbal, nonverbal, and affiliative responses. They can also simulate pieces of real-world situations of interest, such as job applications and others. Most of the phenomena studied in experiments on race discrimination have been replicated in studies of gender discrimination and sometimes age, disability, class, or other ingroup–outgroup variations. Research indicates that gender, race, and age are the most salient, immediately encoded social categories (Fiske, 1998).
Limitations of Laboratory Experiments
Laboratory experiments usually are limited in time and measurement, so they generally do not aim to answer questions about behavior over long periods of time or behavior related to entire batteries of measures. The purpose of a laboratory experiment may include one or more of the following: (1) to demonstrate that an effect indeed can occur, at least under some conditions, with some people, for some period of time; (2) to create a simulation or microcosm that includes the most important factors; (3) to create a realistic psychological situation that is intrinsically compelling; or (4) to test a theory that has obvious larger importance.
Laboratory experiments are also at risk for various biases related to the settings in which they occur. For example, they may be set up in such a narrow, constraining way that the participants have no choice but to respond as the experimenters expect (Orne, 1962). Crafting more subtle manipulations and providing true choice in response options can sometimes be used to limit the potential biases in such cases. In addition, the experimenter may inadvertently bias presentation of the manipulations and measures, so that participants are equally inadvertently induced to confirm the hypotheses (Rosenthal, 1976). This problem can often be addressed using double-blind methods, in which experimenters as well as participants are not aware of the treatment assigned to them. Participants may also worry
Laboratory experiments can create working definitions of manipulated race, randomly assign participants to interact with black or white confederates, and measure a variety of proposed psychological mechanisms (implicit and explicit attitudes) to determine their effect on various types of discriminatory behavior. For example, Dovidio et al. (2002) conducted a multiphase experiment on how whites’ explicit and implicit racial attitudes predict bias and perceptions of bias in interracial interactions. At the beginning of the term, white college students completed a 20-item standardized measure of prejudice, the Attitudes Toward Blacks Scale. Later in the semester, 40 students (15 male and 25 female) participated in what they believed to be two separate studies. In the first, a decision task required participants to respond as quickly as possible—after the letter P or H was displayed on a computer screen—as to whether a given word displayed for each trial could ever describe a person or a house. Unbeknownst to them, on critical trials versus practice trials the letter P was preceded by a standardized schematic sketch of a black or white man or woman, presented at subliminal speeds (0.250 seconds). This level of presentation has been shown repeatedly to prime relevant associations in memory and, in particular, stereotypes. As in countless other studies (e.g., see Fazio and Olson, 2003), the findings in this study revealed subtle forms of stereotypic association when people responded more quickly to negative words (“bad,” “cruel,” “untrustworthy”) preceded by a black face and to positive words (“good,” “kind,” “trustworthy”) preceded by a white face, and more slowly to the converse combinations. As is typical with this method, no participant reported being aware of the subliminal faces. Such studies show how researchers can measure automatic and unconscious racial bias, regardless of expressed levels of prejudice (Devine, 1989). At this point, then, the experimenters had access to two kinds of attitudes—the explicit ones expressed on the questionnaire and the implicit ones suggested by the participants’ speed of stereotypic associations. These are the psychological causes of different kinds of discrimination hypothesized in the next step.
In what participants assumed to be a separate study focused on acquaintance processes, the participants met separately with two inter-
action partners—one white and one black—for a 3-minute conversation about dating in the current era. Five white and four black student confederates, trained to behave comparably to each other, played the role of interaction partners. All were unaware of the study’s hypotheses and the participants’ levels of implicit and explicit prejudice. After each interaction, both the participant and the confederate (in separate rooms) completed scales assessing their own and each other’s perceived friendliness (pleasant and not cold, unfriendly, unlikable, or cruel). Two coders used the same scales to rate, separately, participants’ verbal and nonverbal behavior, respectively, from audiotapes and from videotapes on which only the participant was visible. Two more coders rated participants’ overall friendliness from audio and video information combined. Analyses compared the differences in the participants’ responses to the white and black confederates as rated by the participants themselves, the confederates, and the observers.
Two patterns of response emerged: one an explicit and overt sequence of processes, and the other an implicit and subtle sequence of processes. The explicit sequence involved overt measures of verbal behavior. White participants’ scores on the attitudes questionnaire and their self-reported friendliness (both measures of explicit, overt prejudice) correlated with each other; that is, whites’ self-reported attitudes predicted bias in verbal friendliness toward black relative to white confederates. These measures also correlated with verbal friendliness as rated by observers from audiotapes (a measure of explicit, overt discriminatory behavior).
In contrast, the implicit sequence of processes was indicated by responses to subliminal primes (an implicit, subtle measure of prejudice), which correlated significantly with a series of implicit, subtle forms of discriminatory behavior: nonverbal behavior rated by observers from silent videotapes, confederate perceptions of participants’ friendliness, and overall friendliness rated by other observers, which also correlated significantly with each other. In other words, whites’ implicit attitudes predicted their bias and others’ perceptions of bias in nonverbal friendliness. None of the explicit and implicit measures correlated significantly with each other, indicating that the implicit and explicit sequences are independent. Each sequence is important: Effect sizes were moderate to large by social science standards.
about whether their behavior is socially acceptable (Marlow and Crowne, 1961) and fail to react spontaneously. Nonreactive, unobtrusive, disguised measurement can avert this problem. It is worth noting that not all of these issues are unique to the laboratory. Many of the potential biases and artifacts of laboratory experiments also occur at least as often in other kinds of experiments (e.g., field experiments, which we turn to next), as well as with nonexperimental methods (natural experiments and observational studies, such as surveys).
Translating Experimental Effects
Laboratory experiments are useful for measuring psychological mechanisms that lead to discriminatory behavior (e.g., implicit or explicit stereotypes), but they do not describe the frequency of occurrence of such behavior in the world. They cannot, by their nature, say how often or how much a particular phenomenon occurs, such as what proportion of a racial disparity is a function of discriminatory behavior. Thus, they can be legitimately criticized on the grounds of low external validity—that is, limited generalizability to other samples, other settings, and other measures. Laboratory experimenters can sometimes make a plausible case for generalizability by varying plausible factors that might limit the applicability of the experiment. For example, if there are theoretically or practically compelling reasons for suspecting that an effect is limited to college sophomores, one might also replicate the study with business executives on campus for a seminar or retirees passing through for an Airstream conference. But laboratory experiments rarely randomly sample participants from the population of interest. Thus by themselves they cannot address external validity, and it is an empirical question whether or how well their findings translate into discrimination occurring in the larger population. In well-designed and well-executed experiments, the effects of confounding variables are randomized, allowing researchers to dismiss competing explanations as unlikely, but they are not entirely eliminated. For this reason, replication is important. In the study of discrimination, there are many laboratory experiment results that do not generalize in field settings. Findings either may diminish or not hold up over time. However, many other effects tested both in the laboratory and in the field have been consistent, some showing even stronger effects in the field (Brewer and Brown, 1998; Crosby et al., 1980; Johnson and Stafford, 1998).
Field experiments have many of the standard features commonly found in laboratory experiments. The term field experiment refers to any fully randomized research design in which people or other observational units found in a natural setting are assigned to treatment and control conditions. The typical field experiment uses a two-group, post-test-only control group design (Campbell and Stanley, 1963). In such a design, people are randomly assigned to treatment and control groups. An experimental manipulation is administered to the treatment group, and an outcome measure is obtained for both treatment and control groups. Because of random assignment, differences between the two groups provide some evidence of an effect of the manipulation. However, because no preexperiment measure for the outcome is obtained (which is an option in laboratory experiments), one cannot be altogether sure whether the groups are similar prior to the experiment. Nonetheless, randomization protects against this problem because it ensures that, on average, the two groups are similar except for the treatment.
Field experiments are attractive and often persuasive because, when done well, they can eliminate many of the obstacles to valid statistical inference. They can measure the impact of differential treatment more cleanly than nonexperimental approaches, yet they have the advantage of occurring in a realistic setting and hence are more directly generalizable than laboratory experiments. Furthermore, for measuring discrimination, they appear to reflect the broader public vision of what discrimination means—the treatment of two (nearly) identical people differently.
The social scientific knowledge necessary to design effective field experiments is stronger in some areas than in others. For example, our knowledge of the mechanisms and incentives underlying real estate markets is arguably more advanced than our knowledge of the incentives underlying labor markets (Yinger, 1995). Hence, our ability to use field experiments is correspondingly stronger for measuring behavior in housing markets than in other areas. We therefore focus our discussion below on a common methodology—audit or paired testing—used particularly to assess discrimination in housing markets as well as in other areas. With the exception of a study we describe later (in Box 6-5), we do not review other types of field experiments in the domain of racial discrimination.
Audit or paired-testing methodology is commonly used to measure the level or frequency of discrimination in particular markets, usually in the labor market or in housing (Ross, 2002; for a summary of paired-testing studies in the labor and housing markets, see Bendick et al., 1994; Fix et al., 1993; Neumark, 1996; Riach and Rich, 2002). Auditors or testers are randomly assigned to pairs (one of each race) and matched on equivalent characteristics (e.g., socioeconomic status), credentials (e.g., education), tastes, and market needs. Members of each pair are typically trained to act in a similar fashion and are equipped with identical supporting documents. To avoid research subjects becoming suspicious when they confront duplicate sets of supporting documents, researchers sometimes vary the documents while keeping them similar enough that the two testers have equivalent levels of support.
As part of the study, testers are sent sequentially to a series of relevant locations to obtain goods or services or to apply for employment, housing, or college admission (Dion, 2001; Esmail and Everington, 1993; Fix et al., 1993; National Research Council, 1989; Schuman et al., 1983; Turner et al., 1991a, 1991b; Yinger, 1995). The order of arrival at the location is randomly assigned. For example, in a study of hiring, testers have identical résumés and apply for jobs, whereas in a study of rental housing, they have identical rental histories and apply for housing. Once the study has been completed, researchers use the differences in treatment experienced by the testers as an estimate of discrimination.
To the extent that testers are matched on a relevant set of nonracial characteristics, systematic differences by the race of the testers can be used to measure discrimination on the basis of race. Propensity score matching is sometimes used when there are too many relevant characteristics on which to match on every one. In propensity score matching, an index of similarity is created by fitting a logistic regression with the outcome variable being race and the explanatory variables being the relevant characteristics on which one wishes to match. Subjects of one race are then paired or matched with subjects of the other race having similar fitted logit values—the pro-
pensity score index (see Rosenbaum, 2002, and the references therein for a more complete description).
Paired-testing studies use an experimental design in natural settings to obtain information on apparently real outcomes and to assess the occurrence and prevalence of discrimination. An advantage to using paired tests is that individuals are matched on observed characteristics relevant to a particular market. Effective matching decreases the likelihood that differences are due to chance rather than discrimination because many factors are controlled for.
Paired testing is used in audit studies, such as the U.S. Department of Housing and Urban Development’s (HUD’s) national study of housing discrimination, to estimate overall levels of discrimination against racial and ethnic minorities. Audit studies can be highly effective enforcement tools for assessing treatment or detecting unfavorable treatment of members of disadvantaged groups (see Ross and Yinger, 2002).4 Studies in the housing market (e.g., Wienk et al., 1979; Yinger, 1995) and in the labor market (e.g., Bendick et al., 1994; Cross et al., 1990; Neumark, 1996; Turner et al., 1991b) using the paired-testing methodology provide evidence of discrimination against racial minorities (see National Research Council, 2002b; Ross and Yinger, 2002). In the case of housing, these studies might involve selecting a random sample of newspaper advertisements and then investigating the behavior of real estate agencies associated with these advertisements (Ross and Yinger, 2002). Employment audits are similarly based on a random sample of advertised jobs. While providing the generality valued by researchers, these studies also make it possible to observe the behavior of individual agencies or firms. This approach has been applied to other areas as well (see the examples in the next section).
Much of the use of audit or paired-testing methodology to study discrimination flows primarily from federal investigations concerning housing discrimination. National results of the 2000 Housing Discrimination Study (2000 HDS), conducted by the Urban Institute for HUD, show that housing discrimination persists, although its incidence has declined since 1989 for African Americans and Hispanics. Non-Hispanic whites are consistently favored over African Americans and Hispanics in metropolitan rental and
sales markets (Turner et al., 2002b); similarly, Asians and Pacific Islanders in metropolitan areas nationwide (particularly homebuyers) face significant levels of discrimination (Turner et al., 2003; see Box 6-4 for a brief history of housing audits). In another example, Yinger (1986) studied the Boston housing rental and sales markets in 1981. In the rental market, whites discussed 17 percent more units with a rental agent and were invited to inspect 57 percent more units than blacks. In the sales market, whites discussed 35 percent more houses and were invited to inspect 34 percent more houses; moreover, the difference in treatment was larger for low-income families and families with children. Yinger also found substantial variation in treat-
Perhaps the most common method of assessing discrimination in housing is the fair housing audit. This approach, also referred to as paired testing in an enforcement context, is used in fair housing enforcement by private fair housing groups, public fair housing agencies, and the U.S. Department of Justice (Yinger, 1995). HUD has conducted several times what is by far the largest field experiment using matched-pair methodology—the Housing Discrimination Study (HDS). Results of the most recent 2000 HDS (released in November 2001) show that housing discrimination has declined since 1989 for African Americans and Hispanics, but it nonetheless persists: Non-Hispanic whites are consistently favored over African Americans and Hispanics in metropolitan rental and sales markets (Turner et al., 2002b). Similarly, Asians and Pacific Islanders in metropolitan areas nationwide (particularly homebuyers) face significant levels of discrimination (Turner et al., 2003; also, see National Research Council, 2002b, for a review of the 2000 HDS design).
Housing audits conducted after the passage of the Fair Housing Act (Title VIII of the Civil Rights Act of 1968) have been used to address discrimination and ensure equal opportunity in housing. The first audits were carried out by local fair housing organizations, often for purposes of enforcement but also to gather information. Results of the earliest audits were impaired by small sample sizes, nonrandom assignment methods, and failure to use standardized instruments and procedures. However, practices and methods gradually improved, and the cumulative body of work consistently showed that African Americans continued to suffer from various forms of housing discrimination despite the legal prohibition of such discrimination (see Galster, 1990a, 1990b, for reviews of local studies).
ment across neighborhoods. Taken together, these results document significant discrimination in the housing market.
As reported by Ross and Yinger (2002) and by Riach and Rich (2002), although the typical audit study concerns housing (e.g., Donnerstein et al., 1975; Schafer, 1979; Wienk et al., 1979; Yinger 1986), researchers have used variants of the design described above to examine discrimination in other areas. Areas studied include the labor market (Turner et al., 1991b), entry-level hiring (Cross et al., 1990), automobile purchases (e.g., Ayres and Siegelman, 1995), helping behaviors (Benson et al., 1976), small favors (Gaertner and Bickman, 1971), being reported for shoplifting (Dertke et al.,
The first attempt to measure housing discrimination nationally was carried out by HUD in the HDS of 1977. This study covered 40 metropolitan areas chosen to represent areas with central cities that were at least 11 percent black. The study confirmed the results of earlier local housing audits and demonstrated that discrimination was not confined to a few isolated cases (Wienk et al., 1979).
The 1977 HDS was replicated in 1988. Twenty audit sites were randomly selected from metropolitan areas having central-city populations exceeding 100,000 and that were more than 12 percent black. Real estate ads in major metropolitan newspapers were randomly sampled, and realtors were approached by auditors who inquired about the availability of the advertised unit and other units that might be on the market. The study covered both housing rentals and sales, and the auditors were assigned incomes and family characteristics appropriate to the housing unit advertised (Turner et al., 1991a).
The resulting data offered little evidence that discrimination against blacks had declined since the 1977 assessment (Yinger, 1993). The incidence of discriminatory treatment (defined as the percentage of encounters in which discrimination occurred) was over 50 percent in both the rental and the sales markets. The severity of the discrimination was also very high (severity being the number of units made available to whites but not blacks). Across indicators (e.g., number of advertised units shown, number of other units mentioned or shown, and location of units shown), between 60 and 90 percent of the housing units made available to whites were not brought to the attention of blacks. Over the course of the 1990s, various researchers carried out housing audits in different metropolitan areas using various methods (Galster, 1998; Massey and Lundy, 2001; Ondrich et al., 2000).
1974), obtaining a taxicab (Ridley et al., 1989), preapplication behavior by lenders (Smith and Delair, 1999; Turner et al., 2002a), and home insurance (Squires and Velez, 1988; Wissoker et al., 1997).
In an example involving automobile purchases, Ayres and Siegelman (1995) sent 38 testers (19 pairs) to 153 randomly selected Chicago-area new-car dealers to bargain over nine car models. Testers bargained for the same model (a model of their mutual choice) at the same location within a few days of each other. In contrast with the common paired-testing design, pair membership was not limited to a single pair; instead, testers were assigned to multiple pairs. Also, testers did not know that the study was intended to investigate discrimination or that another tester would be sent to the same dealership. Testers were randomly allocated to dealerships, and the order of their visits was also randomly assigned. The testers were trained to follow a bargaining script in which they informed the dealer early on that they would not need financing. They followed two different bargaining strategies: one that depended on the behavior of the seller and another that was independent of seller behavior.
Ayres and Siegelman found that initial offers to white males were approximately $1,000 over dealer cost, whereas initial offers to black males were approximately $1,935 over dealer cost. White and black females received initial offers that were $1,110 and $1,320 above dealer cost, respectively. Final offers were lower, as expected, but the gaps remained largely unchanged. Compared with white males, black males were asked to pay $1,100 more to purchase a car, black females were asked to pay $410 more, and white females were asked to pay $92 more. These examples of evidence gleaned on market discrimination show the value of paired-testing methods for studying discrimination.
In Box 6-5, we provide an example of a field experiment on job hiring (Bertrand and Mullainathan, 2002) that emulates some of the best features of laboratory and audit studies. This study uses a large sample and avoids many of the problems of audit studies (e.g., auditor heterogeneity) by randomly assigning race to different résumés. It is a particularly good example of the possibilities of field study methodology to investigate racial discrimination.
Limitations of Audit Studies
Ross and Yinger (2002) discuss two main issues raised by researchers concerning the use of paired-testing methodology. They are (1) the accuracy of audit evidence and (2) its validity, particularly with respect to the target population. It is also worth noting that such studies typically require extensive effort to prepare and implement. They can be very expensive.
The Accuracy Issue
Many claim that the designs of audit studies are not true between-subjects experiments because research subjects (e.g., employer or housing agent) are not assigned to treatment or control groups but are exposed to both treatment and control (see Chapter 7 for a discussion of issues in repeated-measures designs). Also, although the order of exposure for each subject is randomized so that it should balance out, the time lapse between exposures makes it possible for the difference to be unrelated to the concept of focus (i.e., discrimination). In the time between two visits to an establishment, for example, someone else other than a tester may take the job or apartment of interest.
In the housing market, newspaper advertisements are used as a sampling frame (National Research Council, 2002b), but they may not accurately represent the sample of houses that are available or affordable to members of disadvantaged racial groups. Newspaper advertisements can be limiting because the sampling frame is restricted to members of disadvantaged racial groups who respond to typical advertisements and are qualified for the advertised housing unit or job. This limited sample may lead to a very specific interpretation of discrimination. For example, members of the sample may not be aware of alternative search strategies or know of other available housing units or jobs of interest. The practical difficulties associated with any sampling frame other than newspaper advertisements (and the associated steps of training auditors and assigning characteristics to them) are difficult to overcome.
The Validity Issue
Inferential target: estimating an effect of discrimination. Researchers have also debated the validity of audit studies (see the discussion in Ross and Yinger, 2002). Heckman and colleagues criticize the calculation of measures of discrimination (Heckman, 1998; Heckman and Siegelman, 1993). They argue that an estimate of discrimination at a randomly selected firm (or in an advertisement) does not measure the impact of discrimination in a market. Rather, discrimination should be measured by looking at (1) the average difference in the treatment of disadvantaged racial groups and whites or (2) the actual experience of the average member of a disadvantaged racial group, as opposed to examining the average experience of members of disadvantaged racial groups in a random sample of firms (i.e., the focus should be on the average across the population of applicants rather than the population of firms). Both of these proposed approaches to measuring discrimination are valid, but each has limitations.
Researchers typically determine the incidence of discrimination by mea-
Bertrand and Mullainathan (2002) conducted a large-scale field experiment on job hiring by sending résumés in response to over 1,300 help-wanted advertisements in Boston and Chicago newspapers (submitting four résumés per ad). In all they submitted 4,890 résumés. For each city, the authors took résumés of actual job seekers, made them anonymous, and divided them into two pools based on job qualifications—high and low. Two résumés from each pool were assigned to each advertisement, and race was randomly assigned within each pair. Thus, they randomly assigned white-sounding names (e.g., Allison and Brad) to two of the résumés and black-sounding names (e.g., Ebony and Darnell) to the remaining two résumés. This crucial randomization step breaks the tie between the résumé characteristics and race. Addresses were also randomized across résumés so that the ties between race and neighborhood characteristics and résumé attributes and neighborhood characteristics were also broken. Thus for each ad the researchers were able to observe differential callbacks by race both within and between the high- and low-qualified résumé pools.
Using callback rate as the outcome of interest, the authors found that on average, applicants with white-sounding names received 50 percent more callbacks than applicants with black-sounding names. Specifically, the researchers found a 12 percent callback rate for interviews for “white” applicants compared with a 7 percent callback rate for interviews for “black” applicants. They also found that higher-quality résumés yielded significant returns for white applicants (14 percent callback rate for white applicant/high-quality résumés versus 10 percent for white applicant/low-quality résumés) but not for black applicants (7.7 percent callback rate for black applicant/high-quality résumés versus 7.0 percent for black applicant/low-quality résumés). The authors concluded that for blacks having more productive skills may not necessarily reduce discrimination.
By randomizing the assignment of race, the authors made it possible to directly estimate the usual missing counterfactual—whether a callback would have been received if the résumé had belonged to an applicant likely to be perceived as being of the other race. Two résumés were selected from each pool (high- and low-qualified) because the same résumé could not be sent in response to a single advertisement with different names and addresses attached but otherwise identical content. Because race was randomized within each quality pair, any difference by race in the résumé quality (within a quality pool) for a particular advertisement could be expected to average out over a large number of advertisements. Thus the outcomes of the two résumés within a quality level could be compared, and the average of these comparisons could provide an estimate of the effect of race on callbacks within each quality level, which
would also provide an estimate of the effect of any interaction between race and qualifications.
More formally, with the analysis done at the résumé level, the causal effect of interest is as follows, where CB stands for callback, W for white, and B for black:
Because race was randomized within quality levels, which were assigned to particular advertisements within particular cities, this causal difference by race can be estimated within each of those categories by calculating
In addition, estimates for subpopulations within a quality level or city or type of advertisement can be estimated by summing just over those subpopulations.
These observations about the design and estimand of interest, along with the assumption of unit treatment additivity for city, advertisement, and a quality-by-race interaction effect, suggest the following model:
where f(.) is a function that produces a probability of callback. The outcome is measured with errors εijkl that are correlated within an advertisement, as they would be for observations within a cluster in a sample. Alternatively, the advertisements themselves can be included in the model, which makes the error terms independent. This model would take the form
where the extra subscript on the Ad variable acknowledges the fact that advertisements are nested within a city.
This design has several advantages over audit studies. One advantage is the ability to use a large number of résumés, as opposed to a smaller number of auditors, and thus the ability to send those résumés out to a large number of employers. The most significant advantage of this design is the ability of the researchers to randomize race, or a proxy for race, instead of trying to match actual people on as many characteristics as possible. The significant constraint this strategy imposes is that the outcome measured—receiving a callback from an employer—is from the early stages of the job search, as is necessary when the only contact is a résumé.
One concern regarding this study is that there may be real or perceived characteristics, such as class, that are associated with distinctively African American or distinctively white names that differ from the real or perceived characteristics of these groups more generally. The authors checked whether differences in mothers’ educational status by particular distinctive names correlated with differences in callback rates for particular names and found no significant correlation. However, this check does not address the present concern; rather, it suggests that the researchers have the data to determine whether the educational status of mothers who give their children distinctively African American names differs from that of both African American and white mothers who do not give such names. The authors also report having conducted a survey in Chicago in which respondents were given a name and asked to assess features of the person. This was done to check that respondents identified the correct race with the racially distinctive name, but also could have been used to check whether there are perceptions of other characteristics that vary within race based on how racially distinct a name is.
suring (1) the proportion of cases in which a white tester reports more favorable treatment than a nonwhite tester reports (gross adverse treatment) or (2) the difference between the proportion of cases in which a white tester reports favorable treatment and the proportion of cases in which a nonwhite tester reports favorable treatment (net adverse treatment) (for further discussion of these measures, see Fix et al., 1993; Heckman and Siegelman, 1993; Ondrich et al., 2000; Ross, 2002). Because statistical measures are “model-based” aggregates, net measures correctly measure the parameters in those models conditional on important stratifying variables. The gross measure may provide useful supplemental information to the net measure if the balancing disparities are large.
Ross and Yinger (2002) note that it would be valuable to know the true experiences of members of disadvantaged racial groups on average, but such information could not reveal the extent to which these individuals change their behavior to avoid experiencing discrimination. As a result, discrimination encountered by averaging over members of a disadvantaged racial group is not a complete measure of the impact of racial discrimination (Holzer and Ludwig, 2003). It is valuable to determine how much discrimination exists before such behavioral responses take place—which is the amount estimated using paired testing—and whether discrimination arises under certain circumstances.
The key observation of Murphy (2002) relates to the inferential target: Are we interested in estimating an overall or a market-level discrimination effect? Several distinct effects might be estimated, and they need to be distinguished because the estimates that result will not necessarily be identical. What is the appropriate population of real estate agents or ads from which to sample? Do we want to use only those agents that minorities actually visit? If past discrimination affects choice of agent, this population may vary from the population of agents selling houses that members of a nonwhite population could reasonably afford. Thus, the estimated effect of discrimination will be different under these alternative sampling strategies. Would it make sense to sample from agents or ads that could not reasonably be expected to be appropriate for most members of the nonwhite population? Murphy recommends ascertaining “discrimination in situations in which Blacks are qualified buyers” (2002:72).
Auditor heterogeneity. Heckman and colleagues (Heckman, 1998; Heckman and Siegelman, 1993) also argue that average differences in treatment by race may be driven by differences in the unobserved characteristics of testers (i.e., auditor heterogeneity) rather than by discrimination.5 Such characteristics (e.g., accent, height, body language, or physical attractiveness) of one or the other member of the pair may have a significant impact on interpersonal interactions and judgments and thus lead to invalid results (Smith, 2002). The role of these characteristics cannot be eliminated because of the paucity of observations of the research subjects. Ross (2002) addresses the problem by suggesting that, instead of trying to match testers exactly (which is virtually impossible), one can train testers to ensure that their true characteristics, as opposed to their assigned characteristics, have little influence on their behavior during the test.
Murphy (2002) addresses most of the issues raised by Heckman (1998) and discussed above. She lays out a framework showing that “as long as audit pairs are matched on all qualifications that vary in distribution by race, audit results averaged over realtors, circumstances of the visits, and auditors can be viewed as an unbiased estimate of overall-level discrimination” (Murphy, 2002:69). Murphy formally delineates the circumstances under which an estimate of discrimination will be erroneous if the researcher fails to account for individual auditor characteristics that do not vary in distribution by race and therefore were not used in the matching process.
The problem is the effect of the heterogeneity among applicants and agents. The strategy of matching on all characteristics that vary in distribution by race—including observed, unobserved, and unobservable character-
istics—substitutes for randomization. The problem, of course, is that we do not know whether we have in fact matched on all characteristics that vary by race. If all unmatched characteristics have the same distribution across racial groups, and if the auditors were selected to be representative of the distribution of these characteristics, we will have managed to balance the covariates across racial groups and can estimate an unbiased effect of race. But as Heckman and others note, there are a variety of reasons to believe that this goal of matching is elusive.
Heckman and Seigelman (1993) make the point that the problem of auditor heterogeneity poses a challenge particularly for employment audits, as well as for studies of wage discrimination, because the determinants of productivity within a firm are not well understood and are difficult to measure. Ross and Yinger (2002:45) note: “Heckman and Siegelman argue that matching may ultimately exacerbate the biases caused by unobserved auditor characteristics because those characteristics are the only ones on which [testers] differ; however, the direction and magnitude of this type of bias [are] not known.” Heckman and his colleague further argue that the factors that employers use to differentiate applicants are not well known; thus, equating testers on those factors can be difficult, if not impossible. This lack of knowledge may make experimental designs particularly problematic for labor market behaviors. However, it does not affect designs in areas with a well-known or identifiable set of legitimate cues to which establishments or authorities may respond (e.g., the rental market).
There are several other problems associated with paired testing. First, paired testing cannot be used to measure discrimination at points beyond the entry level of the housing or labor market. Examples are job assignments, promotions, discharges, or terms of housing agreements and loans. Second, the assignments and training provided to testers may not correspond to qualifications and behaviors of members of racially disadvantaged groups during actual transactions. Third, actual home or job seekers do not randomly assign themselves to housing agents or employers but select them for various reasons. Finally, different employees in the same establishment may behave differently. If a rental office has more than one agent who shows apartments, different experiences of the members of the pair may be traceable to differences in the behavior of the agent with whom they dealt.
Addressing the Limitations of Audit Studies
Ross and Yinger (2002) offer several options for addressing the limitations of audit studies. Three of the approaches they identify to address the problem of accuracy are (1) broaden the sampling frame to encompass methods other than newspaper advertisements (e.g., searching neighborhoods for rental or help-wanted signs); (2) examine whether the characteristics of
the specific goods or services involved (e.g., housing unit) instead of the characteristics of the testers affect the probability of discrimination (Yinger, 1995); and (3) use actual characteristics—as opposed to assigned characteristics—of testers and determine whether controlling for these characteristics influences estimates of discrimination.
To address validity concerns, Ross and Yinger (2002) suggest a strategy of sending multiple pairs to each establishment, which would allow researchers to obtain the data needed to reduce the effects of the idiosyncratic characteristics of single pairs of testers. Testers could then be debriefed after each experience to determine the agent with whom they had dealt. Doing so would not remove the potential effect of different agents on the results obtained, but it would allow researchers to assess that effect. Use of additional pairs of testers would also address issues regarding the calculation of outcome measures. Using multiple pairs might help in distinguishing systematic from random behaviors of an establishment and should, at the very least, tighten the bounds one might calculate on the basis of different mathematical formulas. Of course, care would need to be taken to avoid sending so many pairs of confederates that the research would become obvious.
Another approach to addressing the limitations of omitted variables is to collect extensive information on the actual characteristics of testers, as opposed to assigning their characteristics, and to determine whether controlling for these characteristics influences estimates of discrimination. HUD’s national audit study of housing discrimination, conducted in 2000, explicitly collected information on many actual characteristics of testers, such as their income (as opposed to the income assigned to them for the study), their education, and their experience in conducting tests.6
SUMMARY AND RECOMMENDATIONS
True experiments involve manipulation of the variable hypothesized to be causal, random assignment of participants to the experimental condition, and control of confounding variables. Experimental methods potentially provide the best solution to addressing causal inference (e.g., assigning disparate racial outcomes to discrimination per se) because well-designed and well-executed experiments have high levels of internal validity. In the language of contemporary statistics, experiments come closest to addressing the counterfactual question of how a person would have been treated but for his or her race, although they do not do so in a form that is easily translatable into direct measurement of the discriminatory effect.
Results based on analyses of this information are available at http://www.huduser.org/publications/hsgfin/phase1.html [accessed August 19, 2003].
The experimental method faces challenges when applied to race, which cannot be randomly assigned to an actual person. Experimental researchers frequently manipulate racial cues (e.g., racial designations or photographs on a résumé) or train black and white confederates to respond in standard ways. In both approaches, an attempt is made to manipulate apparent race, while holding all other variables constant, and to elicit a response from the participants. Although the experimental method has uncovered many subtle yet powerful psychological mechanisms, a laboratory experiment does not address the generalizability or external validity of its effects. Therefore, it is unable to estimate what proportion of observed disparities is actually a function of discrimination.
Over the past two decades, laboratory experiments have focused more on measuring subtle forms of bias and nonverbal forms of discriminatory behavior and less on examining overt behaviors, such as assisting others. If laboratory studies were to be more focused on real-world-type behaviors, they could help analysts who use statistical models for developing causal inferences from observational data (see Chapter 7). Thus, the results of real-world-oriented laboratory studies could provide more fully fleshed-out theories of discriminatory mechanisms to guide the modeling work. In turn, real-world studies based on laboratory-developed theories could be usefully conducted to try to replicate, and thereby validate, laboratory results.
Because laboratory experiments have limited external validity, researchers turn to field experiments, which emphasize real-world generalizability but inevitably sacrifice some methodological precision. Field audit studies randomly assign experimental and control treatments (e.g., black and white apartment hunters) to units (e.g., a rental agency) and measure outcomes (e.g., number of apartments shown). Aggregated over many encounters and units of analysis, audit studies come closer than laboratory experiments to assessing levels of discrimination in a particular market. Both the accuracy and the validity of audit studies on discrimination have been questioned, however. Advocates of paired-testing and survey experiments have responded that all these limitations can be remedied.
Although generally limited to particular aspects of housing and labor markets (e.g., showing of apartments or houses and callbacks to job applicants), audit studies to measure racial discrimination in housing and employment have demonstrated useful results. It is likely that audit studies of racial discrimination in other domains (e.g., schooling and health care) could produce useful results as well, even though their use will undoubtedly present methodological challenges specific to each domain.
Recommendation 6.1. To enhance the contribution of laboratory experiments to measuring racial discrimination, public and private funding agencies and researchers should give priority to the following:
Laboratory experiments that examine not only racially discriminatory attitudes but also discriminatory behavior. The results of such experiments could provide the theoretical basis for more accurate and complete statistical models of racial discrimination fit to observational data.
Studies designed to test whether the results of laboratory experiments can be replicated in real-word settings with real-world data. Such studies can help establish the general applicability of laboratory findings.
Recommendation 6.2. Nationwide field audit studies of racially based housing discrimination, such as those implemented by the U.S. Department of Housing and Urban Development in 1977, 1989, and 2000, provide valuable data and should be continued.
Recommendation 6.3. Because properly designed and executed field audit studies can provide an important and useful means of measuring discrimination in various domains, public and private funding agencies should explore appropriately designed experiments for this purpose.