Paired Testing and the 2000 Housing Discrimination Survey
Stephen L. Ross
This paper was prepared for a National Research Council workshop on the use of paired testing to study racial and ethnic discrimination in housing markets. A primary motivation for the conduct of this workshop was to examine methodological issues surrounding the use of newspaper advertisements for initiating tests. This methodology was used in the 1989 Housing Discrimination Study (HDS) and is being used in Phase I of the 2000 HDS. The approach involves a two-stage sampling of newspaper advertisements from medium-sized and large U.S. metropolitan areas with substantial minority populations. In the first stage, metropolitan areas are selected as test sites, and tests are conducted within a site on the basis of a sampling of advertisements from the major metropolitan newspaper.
This paper is organized into two major sections. The first introduces the concept of paired testing and reviews the major issues surrounding its use. The second provides a brief summary of the design of Phase I of the 2000 HDS, including a more detailed discussion of the advertisement-based sampling approach and potential alternatives.
Stephen L. Ross is an associate professor of economics in the Department of Economics, University of Connecticut.
PAIRED TESTING METHODOLOGY
The basic logic behind a paired test for discrimination is fairly straightforward. Two testers, one white and one minority, are matched on characteristics that are relevant to the market transaction being considered. Each tester is then sent to inquire about a market transaction under fairly controlled and highly similar circumstances. For example, in the case of rental housing, the two testers would be similar in age and physical appearance, assigned the same income and family status, and sent to inquire about the same rental unit and/or to the same rental agency using a common protocol. The result of each tester's inquiry and the treatment experienced are reported and documented in isolation from the other tester. The two testers' experiences are combined and compared at a later date by an independent third party.
Any differences between the paired testers' experiences is considered evidence of adverse or differential treatment. Paired testing is designed to measure the level or frequency of adverse treatment discrimination in a given market, where adverse treatment discrimination is defined as instances in which the treatment of an individual is adversely affected by his or her race, ethnicity, or other legally protected characteristic. Paired testing measures the level or frequency observed based on a specific protocol for sampling the market. Therefore, the testing cannot measure the actual impact of discrimination on individuals in the marketplace. For example, if real estate agents steer minority home buyers away from discriminatory lenders, a paired test of the mortgage market will not capture the mitigating effect of this behavior.
In addition, paired testing will not uncover the existence of adverse impact discrimination in a given market. Adverse impact discrimination is defined as follows. A firm or a set of firms in a market engages in many economic transactions, and for each transaction there is a relevant population of reasonable candidates. Adverse impact discrimination occurs when the policy of one or a number of firms places the minority group within the relevant population at a disadvantage relative to the majority even when the policy is applied uniformly, and this policy cannot be justified by business necessity. Naturally, this type of discrimination cannot be detected by testing because the policy is applied uniformly, and systematic racial differences in treatment may not exist.
Paired Testing Versus Analysis of Market Outcomes
As mentioned earlier, the key difference between findings based on testing data and those based on analysis of market outcomes is that testing isolates the incidence or level of discrimination observed when pairs of testers are assigned to enter a market following exogenous sampling and testing protocols. This structure raises issues concerning the relevance of the observed patterns of adverse treatment. The sampling and testing protocols may not yield a sample of market entries that is representative of the types of experiences typically observed in the marketplace. For example, in the 1989 HDS, a sample of units advertised in major metropolitan areas may not have been representative of the available housing stock. Likewise, the testing protocol, which required testers to walk into a real estate agency and refer to an advertisement they had found in the newspaper, may not resemble the approach followed by most consumers when entering the housing market. Second, testers are sampled in a nonrandom manner based on a hiring process, which may lead to systematic differences between the population of white and minority testers. Finally, results based on testing data ignore the mitigating influence of minority attempts to avoid discrimination or mitigate the impact of experienced discriminatory behavior.
While these concerns are important when interpreting the results of a testing study, the design features that lead to these concerns are also important positive attributes of testing as a research tool. Studies of market outcomes often face considerable design challenges because unobserved individual characteristics may influence key determinants of treatment, such as income, education, and work history, and also influence treatment directly (endogeneity bias), and these unobservables may influence individuals' choices concerning whether and how to enter a specific market (selection bias). For example, Ondrich et al. (2001) find that the initial request of a potential home buyer has a large influence on the treatment experienced, but such a request is typically unobserved in market data. Many of the observable determinants of treatment are assigned and therefore uncorrelated with tester unobservables. In addition, the protocols eliminate any possibility of selection bias by exogenously sampling from a population and by establishing a testing protocol that is followed carefully by both testers.
Of course, actual characteristics of testers, such as education or work experience, may influence their behavior during a test and as a result affect their treatment. If so, these characteristics may bias the results of a testing
study because of across-race differences in these characteristics or the nonrandom assignment of testers to particular tests. Naturally, the goal of the testing protocols and tester training is to minimize the variation in behavior across testers, which should in turn limit the influence of actual characteristics on testers' behavior and therefore on observed treatment. A well-designed paired-testing study may in fact dramatically limit the potential for omitted-variable bias by insulating observed outcomes from individual characteristics that are often difficult to observe or record and potentially correlated with race within the population. Heckman and Siegelman (1993) and Ondrich et al. (2000, 2001) test whether testers are heterogeneous over attributes that influence treatment in employment and housing tests, respectively. The evidence for employment tests is mixed, and the evidence for housing tests does not support the conclusion that testers are heterogeneous in a way that influences treatment.
Moreover, the interpretation of observed racial differences is much more straightforward with testing data than with market data. First, tests for discrimination based on market data completely incorporate the effects of any compensating behavior by the individuals being discriminated against even if such behavior imposes additional costs on the minority group. For example, in mortgage markets, a home buyer may avoid potential discrimination in underwriting by seeking out a higher-cost lender with lower standards. Alternatively, a home buyer may obtain a mortgage from a second lender after being discriminated against, but only after losing his or her first-choice home.
Second, observed racial differences in testing data represent adverse treatment against minorities. On the other hand, analyses of market data often combine the outcomes of individuals who engaged in economic transactions with different firms. Even in a model that controls for all relevant individual characteristics, observed racial differences may arise because on average, minorities engage in economic transactions with firms that have different policies, standards, or prices from those of firms that are typically engaged by whites. If these behavioral differences between firms are not justified by business necessity, the observed racial differences would be described as adverse impact discrimination. However, the behavioral differences may arise because the firms operate in different market segments and therefore represent legitimate business practices, in which case the observed racial differences in the market should not be classified as discrimination. Market analyses often cannot distinguish among these three explanations for racial differences in outcomes (see Ross and Yinger, 1999).
The paired structure of the tests also provides two significant advantages. First, the comparison is based on observationally equivalent individuals being treated differently by the same firm or individual, and the results of such comparisons carry considerable narrative power in both legal and policy arenas. Second, the structure of a paired test results in substantial statistical power for detecting discrimination. Specifically, the likelihood of similar treatment of two testers is very high because they have the same relevant characteristics and have been sent into very similar circumstances. The high probability of similar treatment decreases the likelihood that differences in treatment arise by chance and increases the ability to statistically isolate systematic adverse treatment of a given group.
Measuring Adverse Treatment
The results of a test are typically described using two measures of adverse treatment—gross and net. Gross adverse treatment is the portion or fraction of tests in which the white tester received more favorable treatment than the minority tester based on the reports of the two testers and a predetermined criterion for favorable treatment. Net adverse treatment is the fraction of tests in which whites were favored minus the fraction of tests in which minorities were favored. If the treatment can be described by a binary variable in which favorable treatment for one tester is recorded as a one and unfavorable treatment as a zero, the white tester is favored over the minority tester when the former records a one and the latter a zero. If the treatment is described by an ordinal or continuous variable, the white tester is favored if he or she records a higher value than the minority tester. For continuous variables, a threshold will usually be established, and the testers are assumed to have experienced equal treatment if the difference in white and minority treatment does not exceed the threshold.
Both the gross and net measures of adverse treatment may provide misleading estimates of the actual extent of discrimination even within the sampling frame being examined by the set of tests. The gross measure is likely to include differences in treatment that arise simply because the testers' visits differed in some unobserved way, and it may therefore overstate discrimination. The net measure is intended to correct for this problem by subtracting instances in which the white tester experiences adverse treatment relative to the minority tester. The net measure is constructed under the assumption that adverse treatment against the white tester occurs only because the testers' visits differed, and so adverse treatment against the
white tester provides an accurate measure of the number of instances of minority adverse treatment that arose because the testers' visits differed. In some cases, however, adverse treatment of the white tester may have been based on the tester's race. For example, in a housing test, the white tester may not be shown a unit in a minority neighborhood because he or she is white. In this case, the net measure will understate discrimination because the frequency of white adverse treatment overstates the frequency of minority adverse treatment that arose from differences between the two testers' visits. For alternative discussions of net and gross adverse treatment, see Fix et al. (1993) and Heckman and Siegelman (1993).
This problem may be avoided by the use of a three-person test, often called a “sandwich test.” In a sandwich test, two white and one minority tester are matched, assigned similar characteristics, and sent into the same market conditions. In this test, the potential exists for two individuals of the same race to receive differential treatment. These differences in treatment cannot be caused by race and must have arisen because of differences between the visits. Therefore, these differences can be used to construct a net measure that measures discrimination more accurately. Specifically, the frequency of adverse treatment of one white tester relative to the other, which can arise only because of differences in the two testers' visits, is subtracted from the frequency of adverse treatment of the minority tester relative to a white tester.
Alternatively, additional information concerning each test might be used to uncover the extent of discrimination experienced in a sample of tests. For example, in the housing market, information may be available concerning whether the white and minority testers saw the same agent during their visits or whether the advertised unit was in a neighborhood with a large percentage of minority residents. If the gross measure declines dramatically for the subsample in which the testers saw the same agent, that measure must seriously overstate discrimination. Alternatively, if the vast majority of white-favored tests occur when the advertised units are located in neighborhoods with large minority populations, the net measure must understate discrimination.
Ondrich et al. (2000) use this information and the structure provided by a parametric model to estimate upper and lower bounds on housing discrimination using the 1989 HDS. The frequency measures of adverse treatment discussed above can be thought of as simple nonparametric estimates of the probability of adverse treatment. The same probabilities can be predicted using the estimates from a parametric model of a test. A
paired test can be modeled as two separate decisions by an economic agent, where the unobservables associated with those two decisions share a common component because of the paired nature of the test. One possible specification is a bivariate probit in which each equation models the treatment of one tester, and there is a correlation between the treatments received by the pair. Unobservable differences between the two testers' visits are likely to decrease the correlation between the treatments and increase the predicted probability of adverse treatment of the minority tester relative to the white tester—the gross measure. Ondrich et al. (2000) control for differences between the visits by increasing the correlation between the equations to eliminate differences between the visits and revise the gross measure downward.
In the abstract, the strategy of sending a pair of testers to attempt the same market transaction following a common protocol appears simple and fairly straightforward. However, many market transactions are quite complex, involving substantially more interactions than simply a negotiation of prices and quantities, and only limited information concerning the nature and form of these transactions may be available. A testing effort will be successful only if the design sends testers into the market in a systematic and realistic manner.
The first design step for a testing effort is to define a point of entry into the market. This point of entry becomes the basis for sampling the market. A test must be initiated by random or stratified sampling from a well-defined population. For example, in the case of a rental housing market, tests might be initiated on the basis of sampling the population of available rental units or the population of agents who represent rental properties. However, there is no reliable source for the population of available units or even the population of agents for rental properties. Even if the population of agents could be observed for a specific metropolitan area, it is unlikely that any information would be available on the volume of business handled by individual agents. An alternative approach used in the 1989 HDS was to sample from the population of housing advertisements appearing in the major metropolitan newspaper, which was easily observable and provided a reasonable mechanism for entering the market.
Once a test has been initiated, the testers must approach the economic agent who has been sampled or who represents the property, job, or good
that has been sampled. A tester's approach should be consistent with both the sampling frame discussed above and approaches commonly witnessed by the economic agent being tested. In the case of the 1989 HDS, testers walked into a real estate agency and inquired about a unit in an advertisement that had been selected randomly from the newspaper. This behavior would be expected by real estate agents since advertisements are typically used to attract customers, and this protocol also explicitly tied the treatment experienced to the unit that had been sampled. In some markets, however, a realistic point of entry is more difficult to implement. For example, independent mortgage brokers would be a very difficult group to test because many mortgage brokers obtain the majority of their business through referrals from builders or real estate agents. These brokers would notice if they and their competitors simultaneously experienced a substantial increase in direct contacts either by phone or by walk-in.
1989 Housing Discrimination Study
The 1989 HDS was a major national study of discrimination against African Americans and Hispanics in both the rental and sales housing markets. The study sampled newspaper advertisements in 25 metropolitan areas to produce national estimates of housing discrimination. For each advertisement sampled, a pair of testers who were matched by age and gender were assigned an appropriate income level for the sampled housing. The testers were then sent to the advertising agency to inquire about the advertised unit and request to see it and any other similar available housing.
The 1989 HDS was designed to measure the national incidence of discrimination arising during visits by qualified home seekers to a sample of units advertised for sale or rent in major metropolitan area newspapers across the United States. The sample of advertised units was drawn in two stages. First, a sample of metropolitan areas was drawn from major U.S. metropolitan areas with a central city population of 100,000 or more and a substantial proportion of African Americans and/or Hispanics based on the 1980 census (12 percent African American and/or 7 percent Hispanic). Additional tests were conducted in five of these sites to support more in-depth analysis. These sites were chosen with certainty based on their substantial minority population to increase the statistical precision of the national estimates. Each
selected area became an African American-white and/or a Hispanic-white site for the 1989 HDS. Within each site, weekly samples of advertisements were drawn randomly from the Sunday newspaper.
A system of weights was generated to represent the inverse of the probability of selection for any given advertisement and to adjust for oversampling and nonresponse. These weights represented the joint probability of site selection and advertisement selection within a site, controlling for advertisement volume from week to week, saturation of the housing market within any week, and attrition within the sample of advertisements. Weighted racial differences in treatment provide an estimate of average adverse treatment in a national sample of advertisements.
The study provided estimates of adverse treatment for a variety of measures covering housing availability, sales effort, terms and conditions (rental only), and financing assistance (sales only). For the treatment variable “Was the advertised unit available to the tester?” the gross incidence of adverse treatment was 17.2, 15.5, 11.1, and 9.5 percentage points for African American-white rental, Hispanic-white rental, African American-white sales, and Hispanic-white sales tests, respectively. The corresponding net incidence of adverse treatment for these samples was 5.5, 8.4, 5.5, and 4.2 percentage points. (See Yinger, 1995, for an in-depth look at the results of the 1989 HDS.) The study also examined geographic differences in treatment for the five in-depth sites and provided estimates of racial steering by neighborhood racial composition, per capita income, and median house value (see Turner and Mikelsons, 1991, for these results).
The first systematic application of paired testing to hiring, conducted in 1989, focused on discrimination against Hispanic men applying for entry-level jobs in Chicago and San Diego. In each of these sites, approximately 150 paired tests were conducted, based on random samples of job openings advertised in the major metropolitan newspapers. A similar study of hiring discrimination against African American men was conducted a year later in Chicago and Washington, D.C. Again, about 200 paired tests were conducted in each metro area, based on random samples of advertised job openings. Both studies found that white applicants were able to advance further in the hiring process than their minority counterparts in a statistically significant share of cases. Specifically, in the Hispanic-white tests in which both testers were able to submit an application, whites re
ceived an interview and Hispanics did not 22 percent of the time, while in the African American-white tests, only whites received an interview 9 percent of the time. These numbers are based on the gross measure of adverse treatment. Net adverse treatment was 14 and 6 percent for Hispanic-white and African American-white tests, respectively. In addition, whites were significantly more likely to receive encouragement in the hiring process (Kenney and Wissoker, 1994).
A 1998 pilot study used paired testing to assess the extent and forms of possible discrimination in the home insurance market. Testers in three metropolitan areas posed as buyers of closely matched homes located in minority and white neighborhoods. They called insurance agents on the telephone to seek insurance quotes. The homes, neighborhoods, and insurance seekers were matched on a wide range of characteristics so that the primary difference within a paired test was whether the home was located in a minority or white neighborhood. Results indicated that buyers in white neighborhoods were no more likely than those in minority neighborhoods to receive quotes, but they were slightly more likely to be offered some desirable types of coverage (in one site) and to receive higher levels of service than minorities (in another site). In Phoenix, substantially higher premiums were quoted for homes in Hispanic neighborhoods, but because the white and Hispanic neighborhoods were in different insurance rating territories, the study could not determine definitively whether the difference in premiums might have been due to legitimate differences in rates of risk and loss (Wissoker et al., 1998).
The 1999 Homeownership Testing Project is a pilot study of discrimination in the pre-application phase of the mortgage market. This testing effort includes tests for African Americans and Hispanics in two major metropolitan areas. In each area, a stratified sample of lenders was selected by loan volume based on Home Mortgage Disclosure Act data. The testers were assigned income, assets, and debts sufficient to qualify to purchase a home priced at the median sales price in the area. The assignment was structured so that the qualifying price was constrained by the down payment, and income and debts were assigned so that the mortgage would conform to standard secondary market guidelines. The testers were also provided with an A– credit history profile. The results of this study are not yet available.
In 1999, the Urban Institute analyzed enforcement tests that had been conducted by the National Fair Housing Alliance (NFHA) in five sites. In two of the sites, statistically significant differences were found between the
treatment of white and African American testers. White applicants received a quote, defined as information about a loan product with an estimate of monthly mortgage payments and closing costs; African American applicants did not receive a quote in 16 percent of the tests in Chicago and 25 percent of the tests in Atlanta. The net measures of adverse treatment in Chicago and Atlanta were 13 and 25 percent, respectively. It should be noted that the lender sample for the NFHA tests was not random; rather, lenders were chosen using indicators based on the Home Mortgage Disclosure Act data (Smith and Delair, 1999).
2000 HOUSING DISCRIMINATION STUDY: PHASE I
Basic Structure of Study
Phase I of the 2000 HDS is designed to study discrimination in both rental and sales housing markets against African Americans, Hispanics, Asian Americans, and Native Americans. The study will provide estimates of the national incidence and severity of discrimination against African Americans and Hispanics in medium-sized and large metropolitan area housing markets. The study will also provide less precise metropolitan-level estimates of discrimination for all African American and Hispanic sites, as well as metropolitan-level estimates for the pilot Asian American and Native American sites. In the Asian American pilot study, separate estimates will be developed on the basis of different major ethnic sub-groups to assess the importance of ethnicity in the treatment of Asian Americans. Finally, given the concentration of the Native American population in small metropolitan and rural areas, the study will include pilot testing for Native Americans in two small metropolitan areas and the surrounding hinterland.
The 2000 HDS follows the basic methodology of the 1989 HDS. The point of entry to the market is an advertisement in a major metropolitan newspaper. The study is based on a sampling of advertisements in the relevant major metropolitan newspapers, followed by a test in which the testers approach the relevant agent or agency and identify their interest in the advertised unit and similar units. The tests are paired in the sense that two individuals, one white and one minority, pose as otherwise identical home seekers. Observed racial differences in treatment between racial groups are interpreted as the adverse treatment expected to be experienced
by a qualified minority member inquiring about a randomly chosen housing unit advertised in the newspaper.
The use of a sample of newspaper advertisements offers several advantages. First, the classified advertisements provide a clearly defined list of housing units that are currently on the market and for which information is available to individuals in search of housing. Newspaper advertisements provide a credible starting point for each test. This common starting point increases the match between the two testers' visits relative to simply approaching a real estate agency and therefore increases the statistical power available from a given-sized sample of tests. Finally, the advertisement sampling approach matches the sampling methodology of the 1989 HDS, increasing comparability between the two studies. The weaknesses of the advertisement sampling frame are discussed later in this section.
The national samples of African American-white and Hispanic-white tests are two-stage samples. First, a sample of sites (16 African American-white and 10 Hispanic-white) is selected from the population of medium-sized to large metropolitan areas with substantial populations of the minority group being tested. A site is included in the sample if the central city population exceeds 100,000 and the percentage of the minority group in the site exceeds that in the U.S. population overall. Probabilities of selection from the population of sites are based on the metropolitan area population. Then, advertisements are drawn weekly from the major metropolitan newspaper in each site. The samples of Asian American and Native American tests are single-stage samples drawn weekly from the major newspapers of individual metropolitan areas (two Asian American sites with three ethnic groups and one Native American site). In all sites, sufficient tests are being conducted to provide metropolitan-level estimates of adverse treatment (72 tests per tenure).
The sampling of advertisements is a centralized process conducted at the Urban Institute in Washington, D.C. The real estate sections of the Sunday newspapers for all sites are shipped to the Urban Institute every Sunday. A site must be sampled within a couple of hours of receipt so the sample can be relayed back to the local fair housing group for testing in a timely fashion. For each site, the order of the advertisement sample is randomized, and the advertisements are forwarded to the local group one at a time (see the next subsection for a more detailed discussion).
One of two sampling methods is used to select advertisements for rental and sales tests—systematic sampling or grid sampling. Systematic sampling involves the “numbering” of advertisements in a newspaper and the subsequent selection of a systematic sample using an interval designed to yield the target number of selections. Systematic sampling is employed when the number of advertisements is relatively small (say, less than 1,000) and confined to a specific format in the classified section. All rental advertisement selections are made using this method. Grid sampling is essentially an area sampling technique whereby a randomly assigned sampling grid is overlaid on the newspaper to reveal the areas (rectangles) that represent the sample. (Application of one grid is tantamount to a 1 in 24 sampling fraction.) Each advertisement is defined by a single point on the newspaper using an objective rule (i.e., the upper corner of the first letter of the first word in the line of descriptive text). Accordingly, all advertisements have the same chance of selection regardless of their size. Grid sampling is used for very large newspaper classified sections that include one or more supplements and can contain up to 3000 advertisements.
Regardless of the selection method, once an advertisement has been selected, it is reviewed to determine eligibility. To be eligible, a housing unit must be within the metropolitan area boundaries, and must be a rental property in a complex represented by an agent or a single-family home or condominium for sale. For example, the rental tests exclude shared rentals, seasonal rentals, and properties rented by owners, while sales tests exclude seasonal or temporary housing, income-generating properties, and properties for sale by the owner. Finally, the advertisement itself may not clearly identify whether a housing unit is eligible so that the eligibility criteria are applied by the local testing agency on the basis of information gathered on site. The sampling team at the Urban Institute draws substantially more advertisements than the number of tests planned in case some are determined to be ineligible by local testing agencies.
At the analysis stage, sample weights will be developed for each ethnic group at both the metropolitan and national levels for the African American-white and Hispanic-white tests. The national sampling weights will be the product of the site selection probability and the probability of selection of the advertisement. This weight will be adjusted for nonresponse to form a national analytic weight for use in national analyses (trends since 1989, as well as year 2000 estimates).
Separate metropolitan analytic weights are being developed for each site. These will be used in creating metropolitan report cards (i.e., develop
ing metro-specific estimates). The metropolitan analytic weight is the product of a sampling weight and a nonresponse adjustment. The sampling weight reflects the probability of selection of the advertisement and incorporates selection within the classified section as well as across weeks. In addition, the sampling weight controls for market saturation within a week if it occurs. In other words, in some small markets or during a week when many advertisements are ineligible, the entire pool of advertisements sent to the local office at a site may be used. Finally, the weights will be adjusted for nonresponse.
To generate confidence intervals, statistical analysis will be conducted for the gross measures and hypothesis tests for the net measures using the sample weights. The standard errors of estimates will be adjusted to account for the complex sampling design; see Kish (1965) and Wolter (1985). Given the small number of tests available in any given test site, statistical analysis will also be conducted for the metropolitan report cards using exact permutation tests (see Agresti, 1990, for a general discussion and Heckman and Siegelman, 1993, for the use of these tests in a testing context).
A test begins with the selection of an eligible advertisement at the Urban Institute and the submission of a test authorization form to the local test coordinator specifying the type of test to be conducted, the order in which the testers should contact the housing provider, and whether a narrative (a quality control measure) must be completed for this test. Selection proceeds in order of the randomized list of advertisements An advance call by a nonminority individual to obtain information concerning availability (rental tests only), price, size, and location is conducted for all rental tests and for sales tests if this information is not available in the advertisement. Tester income and financial characteristics (sales tests only) are assigned to match the price of the housing unit. Occupations and employers are assigned consistent with these characteristics, but specific occupations (e.g., law enforcement) and regional employers are excluded based on the belief that these occupations or employers might receive some special treatment. Marital status and family structure are assigned on the basis of the size of the unit and the desire to obtain a fairly equal distribution of family types.
The local agency assigns the selected advertisement to one minority and one white tester as soon as two testers of the same gender and compa-
rable ages are available. The testers each call to set up an appointment and visit in alternating order. These calls should be 1 to 6 hours apart for rental testing and 24 to 48 hours apart for sales testing. The actual tester visits should also be 1 to 6 hours apart for rental testing and 24 to 96 hours apart for sales testing. Rental testers make one visit to a rental housing site to inquire about the availability of the advertised and similar units. A similar protocol is followed by sales testers, except that the tester is available for a follow-up visit to see additional units, and provision has been made to record follow-up phone calls by the real estate agent. Testers are required to take notes during their visit and to document its results on standardized forms within 1 hour of completing the visit. The local test coordinator debriefs all testers, and also collects and reviews all test file materials. Test narratives are required on a small number of randomly chosen tests to provide information for a quality control review of test files. Testers are not informed that a narrative is required prior to performing the test.
Limitations of and Alternatives to Random Sampling of Advertisements
While the use of a sample of advertisements offers many advantages, there are a number of disadvantages associated with this sampling strategy. First, the units advertised in the newspaper may not accurately represent the population of available housing units. Units may be advertised because they are especially attractive or in desirable neighborhoods and will attract clients to the agency. Alternatively, some units may not be advertised to more closely control the population of home seekers who have access to a unit or the neighborhood in which it is located. Moreover, in the case of sales tests, most home buyers do not learn from the newspaper about the home they actually purchase. Finally, the importance of newspapers in marketing housing may be declining in significance over time as the Internet is increasingly used to market a wide variety of products.
Within Phase I of HDS 2000, a two-pronged strategy is being used to examine the limitations of the newspaper sampling frame, to be carried out in a small number of pilot sites. First, newspapers list housing advertisements by community and sometimes by smaller geographic regions for large central cities. The distribution of advertisements by community will be examined and compared with estimates of the distribution of rental and owner-occupied housing across communities in each of the pilot sites. This comparison will make it possible to identify communities in which hous
ing units are underrepresented in newspaper advertisements and to draw additional samples of advertisements from these communities.
Second, after the completion of a test, the actual address of the advertised unit is available. The Urban Institute will perform a geographic analysis of these addresses in an attempt to identify regions of the metropolitan area that do not appear in the sample of advertisements, and will acquire socioeconomic characteristics for regions and/or neighborhoods from private vendors, such as Claritas. Once these regions have been identified, six to ten neighborhoods (three to five per tenure) will be selected, and a variety of local or neighborhood-level sources will be used to identify housing units for testing. These tests will be used for comparison with traditional advertisement-based tests, but cannot be combined with the sample of the latter because of the nonrandom nature of the selection process.
There are many other types of marketing that might be considered for Phases II and III of HDS 2000. First, the population of advertisements might be expanded to include Internet and other easily observable metropolitan-wide sources of advertisements. This expansion would likely increase the base of marketed units covered without sacrificing comparability to Phase I because the sample would still be drawn from a metropolitan-wide sample of advertisements. The addition of local sources of advertisements (below the metropolitan level) as discussed above would expand the base further, but at the expense of comparability. Other, more extreme modifications to the protocols might involve a sampling of agencies and agents rather than advertisements. As discussed earlier, it can be quite difficult to compile a complete, nonduplicative list of rental or sales real estate agents, and nearly impossible to obtain any measure of volume for these agents. Finally, attempts might be made to sample available units. One possibility is the random sampling of streets and the second-stage selection of street-level advertisements from the selected streets. This approach might provide a fairly representative sample of units for sales tests (with the exception of condominiums), but is unlikely to provide a representative sample for rental tests since the use of street-level advertisements for rental properties is far less uniform.
Imperfect Pairs and Differences Across Visits
As discussed earlier, the paired-testing approach is unlikely to yield a perfect match within a test. First, the testers approach the selected agent at
different times, and as a result the circumstances and treatment they encounter may differ. The possibility of such differences implies that the frequency of adverse treatment of minority testers, the gross measure, may capture differences that do not represent discrimination. In addition, testers are paired only on gender and age, and therefore may differ on many characteristics that might influence behavior during a test. This second problem may exacerbate the error in gross adverse treatment as a measure of discrimination while creating the potential for more severe biases in the analysis. Specifically, the populations of white and minority testers may differ systematically on characteristics that influence treatment. If so, the net and gross measures capture a combination of discrimination and the effect of racial differences in unobserved tester characteristics.
The 2000 HDS is attempting to address these issues. To the author's knowledge, Phase I of this study is the first paired-testing effort that records actual tester characteristics and makes those characteristics available for analysis. The characteristics collected include employment status and history, education level, individual and household income, household structure, and experience as a home seeker. Earlier research by Heckman and Siegelman (1993) and Ondrich et al. (2000, 2001) found only limited evidence that tester characteristics affect treatment. The data analyzed in these studies, however, contain no information about testers beyond an identification number, and these analyses were based on examining the experiences of pairs of testers who conducted multiple tests together. In HDS 2000, the analysis will exploit the information on actual tester characteristics, as well as test characteristics such as the attributes of the advertised unit and observed circumstances during a tester's visit, to determine whether these factors influence treatment and whether such influences affect observed net and gross adverse treatment.
Finally, Phase II of the 2000 HDS will include three-person or triplet tests to examine the influence of random differences between visits and testers on observed adverse treatment. These tests will take two forms: minority-white-white and white-minority-minority. The form will be randomized over tests. This approach will minimize noise by limiting the time between same-race visits while also ensuring that the first two visits of each triplet will yield a standard paired test.
1990 Categorical Data Analysis . New York : John Wiley and Sons .
1993 An overview of auditing for discrimination. In Clear and Convincing Evidence: Testing for Discrimination in America , M. Fix and R. Struyk, eds. Washington, D.C. : Urban Institute Press ., , and
1993 The Urban Institute studies: Their methods and findings. In Clear and Convincing Evidence: Testing for Discrimination in America , M. Fix and R. Struyk, eds. Washington, D.C. : Urban Institute Press ., and
1994 An analysis of the correlates of discrimination facing young Hispanic job-seekers. American Economic Review 84(June): 674-683 ., and
1965 Survey Sampling . New York : John Wiley and Sons .
2001 Now You See It, Now You Don't: Why Some Homes Are Hidden From Black Buyers. Unpublished manuscript., , and
2000 How common is housing discrimination? Improving on traditional measures. Journal of Urban Economics ( 47): 470-500 ., , and
1999 Does discrimination exist? The Boston Fed study and its critics. In Mortgage Lending Discrimination: A Review of Existing Evidence , M. Turner and F. Skidmore, eds. Urban Institute Monograph Series on Race and Discrimination. Washington, DC : Urban Institute Press ., and
1999 New evidence from lender testing: Discrimination at the pre-application stage. In Mortgage Lending Discrimination: A Review of Existing Evidence , M. Turner and F. Skidmore, eds. Urban Institute Monograph Series on Race andDiscrimination. Washington, DC : Urban Institute Press ., and
1991 Patterns of racial steering in four metropolitan areas. Journal of Housing Economics ( 2): 199-234 ., and
1998 Testing for Discrimination in Home Insurance . Washington, D.C. : Urban Institute Press ., , and
1985 Introduction to Variance Estimation . New York : Springer-Verlag, Inc .
1995 Closed Doors, Opportunities Lost: The Continuing Costs of Housing Discrimination . New York : Russell Sage Foundation .