One session of the workshop was devoted to the topic of advantages and disadvantages of standardizing social science indicators. Disability indices, high school completion rates, and the construction of race and ethnicity categories were the primary examples discussed. A consistent theme was the paramount need for theory as well as a public policy purpose for motivating standardization of measurement for a particular construct.
THE STANDARDIZATION OF INDICATORS USED IN POLICY
Geoff Mulgan (The Young Foundation) described his specific perspective on the use of standardized indicators in policy making and decision making. He has experience working for several political leaders committed to using such indicators, including former prime minister Tony Blair, Australian prime minister Kevin Rudd, and the prime minister of Greece, George Papandreou, who is currently addressing a set of issues around harmonization and standardization related to national debt. He noted that, in the United Kingdom, the National Institute for Clinical Excellence is an independent, formal government body set up to determine the cost-effectiveness of different health treatments, from pharmaceuticals to smoking cessation. He is currently attempting to encourage governments to develop similar types of institutes in other fields, such as education and criminal justice in other countries.
In addressing the political context around standardization, Mulgan stated that governments in the 17th and 18th centuries tended to standardize and measure for central control (for example, tax collection), but that
approach has evolved into one of viewing standards as a tool for account-ability and democracy. In addition, he observed, some of the long-run trends involve measuring, not things, but rather less tangible concepts and intangibles, as well as moving from single measures to indices and from activities to outputs and outcomes. Also, there is a movement from objective facts to subjective measures of experience, for example, fear of crime as well as crime volume, patient satisfaction, and other relational measures of trust and feedback as well as classic health outcomes. He observed that a broader shift to complement output and outcome measures with relationship measures is moving quickly around the world, although with less speed in the United States. In addition, he told the audience, measurement has moved from being primarily an issue for policy makers and the state to becoming a source enabling the public and media to assess the progress made by government. The latter includes measuring performance at the local level, with indicators set at the level of very small neighborhoods as well as the town or city.
Mulgan asserted that these new uses of indicators regarding place raise two major issues related to experiential relational data and the balancing of present performance and future prospects. Specifically, what is the appropriate benchmark? And how can these measures of assessment of current performance be combined with some dynamic indicators to determine the future success of that area, for example, in terms of individual and business resilience?
Weaknesses and Risks in Standardization
Mulgan listed several classic weaknesses inherent in more widespread use of metrics in policy:
Excess simplicity: There is a risk of using excessively simple responses to complex problems, such as unemployment rates, that can distort reality or encourage excessive focus (e.g., targeting measures of household burglary may divert resources from other equally important crimes). In contrast, discussions under way in the United Kingdom on reducing cancer mortality focus on increasing the quality of clinical services, as well as addressing the environment, stress, and a host of other presumably causal factors.
Distortions to behavior: There are many ways in which bureaucracies and professions respond to standardized targets, particularly when monetary or other incentives are involved. Examples include suppressing performance for fear that improvements will be used as baselines for impossible targets or bringing in extra resources dur-
ing periods of intense scrutiny. In general, measures that are more about outcomes than outputs are less vulnerable to distortion.
Diminishing utility: For example, as soon as any measure of money supply becomes an official policy target for a government, it immediately becomes less useful because of market behavior anticipating movements in the indicator. Another example is the use of standardized tests and international benchmarking. Not only have these been powerful tools to drive up standards in mathematics and science literacy, but they also have diverted attention away from equally important but less measurable aspects of learning, such as noncognitive skills, social skills, resilience, motivation, and other key predictors of lifetime earnings, social mobility, and life success.
Obsolescence: Some standardized measures reflect society or the economy at a particular point and become less useful over time. The utility for policy makers of evolving indicators may outweigh the utility of consistency.
Limited relevance: While standard measurements may reflect the views of officials and professionals, they may be very different from those used by the public. For example, quality in health care services may be measured by official statistics in terms of waiting times or mortality, but the public may describe such factors as service style as most important.
Categories of Standardized Measurement: Underlying Causes and Relationship
Mulgan commented that, in most areas of public policy, there is little agreement about the fundamentals of causation and theory. Grade retention in school in the United States, for example, can be explained by economists as an issue of economic incentives of the labor market. Sociologists will insist that peer pressure is a key factor. Educators will claim that performance at age 11 affects a student at age 14, and psychologists may focus on personality structure. Consequently, he said, policy makers may not agree on which causal mode is correct, and there is no single approach to resolving disagreements.
In addition, he continued, there are also fields in which new indicators are needed, for example, the use of the Internet for public services. Related to this topic, Mulgan reported on a review that he recently conducted on the state of knowledge about behavior change and its relevance to health policy. He found an uneven evidence base on the efficacy of either financial incentives or “nudge-type” methods of environmental shaping of behavior.
Disaggregation and Aggregation
Mulgan acknowledged the difficulty in using any kind of aggregate indicators or aggregate population measures; at the same time, the key to measuring behavior change in any field rests in large part on knowing how to disaggregate or segment the population. For example, a practitioner may consider interventions to reduce recidivism among prisoners or to reduce obesity by assuming that particular interventions will be highly effective for perhaps 10 or 20 percent of the population, if selection of participants is made by cognitive style, culture, etc. However, the intervention will probably be ineffective if an entire population group is selected without segmentation. At the same time, the segmentation tools used in health services, which are based on commercial marketing, are unproven and often dismissed, he observed. According to Mulgan, there is a greater need for targeting and segmentation, yet national statistical officers, academies of science, and other similar organizations seem to want to discourage development of robust segmentation tools.
Measurements of Well-Being and Psychological Need
Mulgan identified as a major research concern in the United Kingdom the failure of many of the current measures of poverty to capture actual need. He explained that the earlier focus on material needs (e.g., money, housing, nutrition) do not cover such factors as psychological well-being, the strength of social relationships, and the like. More specifically, a person who is isolated yet reasonably materially well off may be more in need than a person who is materially poor but has very strong family support. He reported that the Young Foundation has been investigating, both through statistical analyses and case studies, ways to understand the dynamics of need in a contemporary society, giving equal weight to material, psychological, and psychosocial measures.
While psychological measures are not as well-developed as material ones, Mulgan noted, these are needed to measure well-being, life satisfaction, and other factors, such as social connectedness. He emphasized the strong impact of cultural norms in terms of how people present their levels of well-being.
Valuing Social Impact
According to Mulgan, it is important to measure social value by creating standardized metrics or tools to compare investments in programs. While the question of measuring social value has been alive in the world of policy since before the mid-1960s, he noted that there have been several
waves of effort to define usable indicators. However, none has succeeded in defining anything remotely as widely accepted as GDP.1 Mulgan offered several reasons why these methods have not been used to guide decision making, from the very nature of social science, which involves many variables, to the difficulties of allocating value, to issues with competing values. For example, economic analysis of the social benefits of not sending someone to prison conflicts with the public’s view that punishment has intrinsic virtue. In this case, a conceptual clash cannot be resolved by analysis. Mulgan also believes that time horizons, used in standard commercial discount rates, are often very inappropriate for valuing social and environmental goods.
Mulgan reported that the Young Foundation has been commissioned by the British Health Service to develop a set of tools for measuring social value and the value of health service innovations, as part of a broader effort to try to guide public services to think about the long-term productivity of specific interventions. This method attempts to gather together in a reasonably consistent framework, not a single metric, but elements that are incommensurable. The process involves a consistent way of weighting everything from quality-adjusted life years and patient satisfaction, to the cost-effectiveness of different treatments, to the benefits for other public-sector bodies, like municipalities, as well as the assessment of practical implementation tools. Standardization tools are needed to compare investments in different types of activity, he observed. They are also critical to apply in the United States and the United Kingdom, where, in the next four or five years, the dominant public policy issue will be related to dramatic cuts in public spending—up to 10 or 20 percent in the United Kingdom, he said. This type of priority is forcing more attention to productivity in public services and in the private sector. He reiterated that the use of cost-based measures in GDP for public services is “completely ridiculous” and actually discredits the GDP measures themselves as well as the public-sector measures.
Mulgan summarized his major points as follows:
There are definite benefits to standardization of some metrics applied to public policy today.
In the context of democratic politics, there is a drive to humanize data to make measures better fit human experience, including
addressing issues like relationships. The latter may not be very important to policy makers, scientists, or academics, but in fact is becoming very significant in the day-to-day practice of public services.
Indicators are essentially feedback systems to guide decision making in public policy, but there is a risk to linking indicators too closely to policy decisions. Social science needs consistent and comparable time-series data, whereas the needs of government are more variable.
A judgment about indicators needs to address both their construction and their use. Are they used to constrain fluid actions and decision making by governments and to assist competitive actions? Are they assisting effective judgment on conditions of considerable uncertainty and fuzzy data?
Both data and the institutions to use them are needed. Having authoritative public bodies make judgments using standardized metrics in transparent ways is as important as having the metrics themselves, and just as important as the recognition that all of these have, in Mulgan’s words, “limited half lives.”
Mulgan ended by sharing his belief that even the best indicator will be useful for a time but will then need to replaced and updated, because that is simply the nature of social knowledge.
In his presentation, Robert Pollak raised concerns about the premature application of standards and the notion that standardization will make for successful science, rather than the idea that successful science generates standardization. He gave a number of examples to illustrate his point. Family structure, in this example marital status, provides an excellent opportunity to explore the use of standardization, he said, posing several questions. Does marital status mean that one is legally married? Are cohabitants included? Are couples who are legally married but not living together included? Does the definition of marital status used as an independent variable affect the outcome when researchers try to predict educational outcomes, for example, whether a child will finish high school?
Taking this examination of factors and standardization further, Pollak raised the question of what it means to complete high school. In other words, should people with general educational development (or GED) credentials be treated as high school graduates? Citing work by James Heckman on labor market effects demonstrating clearly that GED is not equivalent to high school graduation, Pollak concluded that the question
of what is being examined may determine what measurement standards are used.
Turning to disability measures, Pollak considered as a real barrier to progress the lack of consistency in the concept and definition of disability and in the analysis of trends in prevalence. Some clarification on what is meant by the term “disability” is needed, he said, since different definitions suggest different kinds of solutions and indicate different targets for interventions and actions. Pollak further stated that it is unclear in this case what standardization will achieve.
He observed that public perceptions would certainly be affected by the standardization of disability and that policy may even be affected. However, questions remain: What should be the basis for standardization? What should be the underlying assumptions? Should the definition of disability be more or less inclusive? In Pollak’s view, whether standardized measurement would lead to better policy can be discussed only in terms of a particular standardization of measurement and a particular view of what constitutes better policy.
In economics, theory has implications for measurement, and economists regard measurement without theory with skepticism (Koopmans, 1947). Pollak used the example of the consumer price index (CPI) and the cost of living index to examine the use of standardized measurements for disability. The crucial aspect of having a theory is that it provided a way of dealing with a lot of hard problems that arose in constructing the CPI. The underlying theory provides a framework to refer to when questions arise that challenge the components of the index. Pollak posed the question: What counts as an argument if there is no theory to appeal to? Without a theory, he asserted, anything is equally as good as a treatment of a difficult problem. He further stated that another main advantage of theory is that it depoliticizes some of the serious choices that do have impacts on the behavior of the index.
Turning to the issue of how disability is perceived, Pollak divided the literature into three sections: (1) disability among children, (2) disability among working-age adults, and (3) disability among the elderly. Using the example of activities of daily living (ADLs), such as transferring, dressing, bathing, toileting, eating, and walking across a room, Pollak proposed that if individuals or their proxies were asked which activities pose difficulties, an index could be derived by adding up the number of positive responses. However, the questions of how the items in this index were chosen and how their weight was determined are significant. For example, does the standard list of ADLs give too little weight to cognitive impairment relative to mobility impairment? How is it determined if a new item needs to be added to the ADL list and, if it is, what weight is it assigned? Pollak argued that
there is no possible response to this kind of question without an underlying model and theory.
Continuing in the context of disability, Pollak delineated three possible models or theoretical constructs. One model, which he attributed to Dennis Fryback, is an appeal to utility that attempts to identify what people actually value. Another method entails using the theory of disability to predict the probability of nursing home entry within the next year, on which an index of disability could be based. His final example was an index that predicted the medical costs associated with an individual over his or her lifetime. All of these approaches are different and imply different weights, items, and methods of calculating disability. Pollak reiterated that, without an accepted theoretical framework, there is no touchstone for resolving any of the practical problems that arise in index construction.
Pollak emphasized that his focus is on nontrivial standardization, for which the measurement choices are really about choosing what is important, commenting that this is essentially a scientific question. With nontrivial standardization, the choices between measurement protocols convey different information. In his view, measurement without theory often means measurement using implicit theory. Implicit theory is better when made explicit, so it can be openly debated. He ended by saying that science is better done in the open.
Nancy Cartwright expanded on Pollak’s presentation by delineating three separate avenues by which theory contributes to measurement. The first she described as “coming up with the representation” using “heavy theory,” with a lot of assumptions in the theory that are very well worked out, along with a measure that gives an upper and a lower bound in constructing the particular index. A second way is using theory, or at least empirical regularities that connect the intended quantity with the actual procedures employed in carrying out measurement. This is done to ensure that those procedures are measuring the intended concept, especially when a quantity is measured indirectly via the components of an index, and even more especially when the components of this index are aggregated into a single number. The third way is distinguishing among different concepts going under the same name across a variety of theories. Proper precise scientific definition and explicit procedures are required when the emphasis is on making predictions about future behavior or forecasting the effect of policies. Different studies serving different purposes prescribe different definitions and procedures, yet they often use the same word. It is important to keep clear which of these more exact concepts is causally connected with the outcomes of interest.
Mulgan proposed that, in addition to a different theoretical foundation of a construct, it is important to consider the existence of different philosophical lenses. For example, he delineated three perspectives related
to disability: a disability rights’ perspective, a public value view, and a fiscal or bureaucratic viewpoint.
Robert Willis concurred with Pollak that theory is in some sense a stabilizing influence on the nature of the measure. However, he thought the issue of invariance, an old philosophical and scientific issue, has unclear implications in an economic and social context.
Robert Hauser contested the analysis provided by Pollak and Mulgan that assumes the necessity of choosing a criterion with respect to an array of measures like ADLs. He referred to the Multiple Indicator Multiple Indicator Cause (MIMIC) model presented by George Bohrnstedt, observing that if the data are benign, a criterion may not have to be chosen.
Bohrnstedt pointed out that such outcomes as nursing home, medical, and home care costs have been the focus of the discussion. There are also costs associated with disability status with respect to income or reduced income, and these factors may all comport with the same metric, which helps in weighting on the indicator side what is making a difference.
Pollak reiterated that although there are different ways to build a framework for constructing an index, the main point is to choose one and to factor in the possibility of biases.
HIGH SCHOOL COMPLETION RATES
In his presentation, John Robert Warren (University of Minnesota) considered indicators related to the measurement of high school completion. Because of the important reasons for completing high school (economic, social, political, personal, and academic), he argued that it is imperative to develop accurate and meaningful measures of the rate at which people complete or drop out of high school. While many people assume that it should be easy to quantify high school dropout or completion rates, Warren described the confusion associated with the actual estimates. Not only are there data discrepancies between surveys, but there are also inconsistencies between the data on high school completion and the data on dropouts. He outlined three reasons why the widely used measures of high school completion and dropouts differ so much from one another: (1) different objectives and purposes, (2) technical differences in measures, and (3) differences in the accuracy of the data.
In Warren’s view, the biggest step that could be taken toward clarifying understanding of high school dropout or completion rates in the United States is to be consistently clear and forthcoming about why they are measured in the first place. An important reason why estimates for dropout and
completion rates differ so much from one another is that they differ with respect to what they are trying to accomplish.
Economists or business leaders may be interested in characterizing the level of human capital in a population or in a region. For this purpose, the timing of high school completion (how long ago or at what age people completed high school) is not important. According to Warren, dropout status or completion rates computed from cross-sectional sample surveys are best suited to describing levels of human capital in a population. Because the goal is to describe the share of all individuals who have obtained a credential, it is important to use data that include people who may have gotten those credentials from any number of places: public schools, private schools, GED programs, community colleges, adult education programs, prisons, or the Internet. Administrative data alone are not sufficient for measuring the percentage of people in the population who fall into a particular status group.
Education policy makers may instead focus on quantifying school performance in evaluating schools (within a school district or against national standards) with respect to their “holding power.” How well do schools move young people from the first day of high school through to successful high school completion?
Both the timing of high school completion and the manner in which students complete high school are necessary factors to consider. Schools may be deemed successful at moving young people through to completion of high school only if they grant regular high school diplomas within four years.
Researchers may be more interested in characterizing students’ experiences in navigating through educational institutions, or in predicting the likelihood of dropping out, or in modeling the consequences of dropping out. These measures are designed to describe characteristics of students or groups of students rather than a school’s attributes.
Technical Differences in Measures
Another reason that high school dropout and completion rates differ involves technical differences in how they are constructed. This is true even when comparing measures that are intended for the same purpose. All high school completion and dropout rates are based on a ratio with a numerator and a denominator: the numerator is the number of high school completers or dropouts, and the denominator is the number of people at risk of completing or dropping out. But even when measuring the same concept, there are frequently differences with respect to who has been counted as a completer or a dropout in the numerator and who is at risk of being in one of those statuses in the denominator.
While it is easier to quantify success or failure in the numerator, Warren identified a number of scenarios that complicate measuring the denominator. For example, how should the denominator of a measure account for migration into or out of a particular geographic area? How should students who are expelled or otherwise pushed out of high school be counted in the denominator? When students transfer from one school to another, should they be counted in the first school’s denominator, the second school’s denominator, neither, or both?
In his overview of status rates, Warren explained that the fraction of the population that falls into a population subcategory is measured at a given point in time. For the purpose of describing amounts of human capital in a population or a geographic area, he addressed how status completion or dropout rates are imperfect. For example, in the previous presentation, Pollak described a method to treat all high school credentials as essentially equivalent; however, this is not necessarily the best approach, because economists have long questioned the relative labor market value of GEDs, and little is known about alternative credentials.
To measure a school’s holding power, dropout and completion rates need to directly and accurately reflect a specific location. This involves the use of cohort rates, which measure the fraction of individuals who transition into a particular status among those who share a common status at the outset. Cohort rates are based on longitudinal administrative data that school districts and states keep about students. School districts are increasingly using longitudinal tracking systems to follow students over time; however, there are still problems with the way states and districts define numerators and denominators in order to lower their dropout rates. Warren argued that the most effective data would represent each graduating or incoming student cohort and be made available annually.
He discussed how few trend analyses have been completed, because measures change over time and cross-state or cross-district comparisons have been difficult to carry out. In this regard, the movement toward using standards that were initially proposed in 2008 by the National Governors Association and the U.S. Department of Education is a step forward. These standards include restricting the numerator to regular diploma recipients who obtain diplomas within four years and the denominator to people who are at risk of getting those diplomas and appropriately accounting for things like migration. If states consistently implement the standards laid out by the Department of Education, eventually cohort rates can be compared over time and across states.
Until consistently defined cohort rates that are comparable over time and space become regular practices, Warren observed, it is best to use aggregate cohort rates based on Common Core Data or similar data for research purposes. It is also important to account for the weaknesses and limitations of these sorts of measures and acknowledge the bias in research results. Individual-level data based on longitudinal sample surveys, like the National Education Longitudinal Study or the various longitudinal surveys administered by the National Center for Education Statistics, are best suited for describing students’ progress through the secondary school system. However, these types of surveys are limiting because they are very expensive, are not conducted regularly, and suffer from problems of coverage bias and sample attrition.
Accuracy of Data
The third reason Warren outlined for the differences in high school dropout and completion rates has to do with the accuracy of the underlying data used to construct them. Even when the measures are intended to quantify the same thing and even when they agree on the technical definition of the numerator and the denominator, the estimates often differ. Another weakness with status completion and dropout rates has to do with the validity and reliability of respondents’ self-reports of whether and how they completed high school.
A COMMON METRIC FOR RACE AND ETHNICITY?
In his presentation, Matthew Snipp (Stanford University) referred to race and ethnicity as a set of universal characteristics that exist over time and space. He observed that the human species relies heavily on the ability to visualize and identify difference, and some people have argued that the ability to make distinctions on the basis of race may have even been a selective advantage. More specifically, identifying people who look the same in terms of physical appearance, stature, diet, etc., may be a way to recognize those who are less likely to cause harm (or vice versa).
Snipp noted that the color coding of race, however, is something that is even more recent, beginning with the emergence of biology and the racial sciences in the late 18th and early 19th centuries. The rise of the racial sciences in the 19th century, principally ethnology and eugenics, focused heavily on the physiognomy of race. In the late 19th and early 20th centuries, people began contesting the research and thinking on race, especially the concepts of physiognomy and the notion of inherent racial hierarchies. In the mid-20th century, attention began to shift from trying to define race to categorizing types of race. Today, administrative definitions are probably
most familiar, because they are based on some sort of administrative or political agenda.
Constructions of Race in America
Snipp explained that people construct race socially by taking behavioral and physical characteristics associated with human difference and agglomerating them into a set of traits that are called race or racial distinctions. Three entities are important in terms of determining what race is in America: legal definitions, the Census Bureau, and the Office of Management and Budget (OMB).
With regard to legal definitions, “white” is a default category conventionally understood to have some sort of European continental origin. Snipp explained that African Americans traditionally have been identified by the rule of hypodescent, the “one-drop rule,” which has been reinforced by Supreme Court and federal court rulings. He noted that, in contrast, the rule of hyperdescent has been applied to American Indians, which requires minimum ancestry that very clearly restricts the magnitude of federal obligations. Each of the 562 tribes has its own criteria for determining who is an American Indian. While there is no history of either hypodescent or hyperdescent for Asians, Snipp mentioned that there is a history of restrictions regarding immigration and citizenship that was built into the 1882 Chinese Exclusion Act. Lately, discussion among the Latino community has centered around whether “brown” is a separate race, whether Latinos are a separate race, or whether the idea of Hispanic white makes sense for those who are of mixed indigenous and European origin, for example, many Mexicans.
Snipp observed that ever since the first census, conducted in 1790, questions about race have been asked. By the 1970s, there was an enormous amount of legislation, programs, and operations that required data about race. To facilitate comparison of race data, OMB-issued Directive No. 15 identifies the categories that federal government agencies should use for statistical collection and reporting.2 The directive also notes that these classifications should not be interpreted as being scientific or anthropological in nature. All agencies, grantees, and contractors (with the exception of small businesses) were required to use this set of categories. The American people became used to seeing these categories and thus thinking about them in terms of race and ethnicity. The categories filtered into the social sciences and were reflected in textbooks about race and ethnicity. Snipp said that these categories became the foundation for basically everything that is known about race in this country.
In his view, the 1990 census was a turning point in racial measurement for a variety of reasons. It had a long list of categories that included legacy races, like white, black, and American Indian. Other categories, which listed nationalities, were followed by an instruction to circle one of them, causing many protests by such groups as Arabs, Taiwanese, and Native Hawaiians, who could not self-identify with the groups shown. Native Hawaiians, in particular, objected because they did not want to be included as Asians and Other Pacific Islanders. In addition, interracial family organizations protested against privileging one race over another in identifying children of biracial families. Others were exercised to learn that the Census Bureau editing procedure allocated individuals to one race category (mostly white) even if they had reported multiple categories.
In 1994 the National Research Council held a conference and published Spotlight on Heterogeneity: The Federal Standards for Racial and Ethnic Classification (National Research Council, 1996). OMB hearings were held around the country, and an interagency working group was formed. The Census Bureau conducted a number of tests in anticipation of revising the racial classifications. In October 1997, OMB released a revision of Directive No. 15 with two major changes: (1) a separate category for Native Hawaiians and (2) the option to report more than one race. The implementation of this new standard was slated to occur no later than January 1, 2003.
The Spotlight report developed eight principles for creating a racial classification, although very few of them have been honored. The most obvious shortcoming relates to the dictum that “the number of categories be of manageable size.” Allowing multiple responses and using the five basic race categories yields 20 unique race categories; overlaying these categories with Hispanicity creates 40 unique categories. The 2000 census used 13 categories, resulting in 63 unique combinations, or 126 with the addition of Hispanic/non-Hispanic. Few would argue that these distinct categories constitute a manageable number. The fact that the Census Bureau has rarely published data for all 126 combinations is evidence that this system is unworkable to produce specifications for congressional redistricting, civil rights, or voting rights enforcement, for example. Other problems have resulted from these race categories:
The inability of federal agencies to agree on which categories or subsets of categories to use for decision making.
The need for OMB to produce a memorandum outlining a subset of categories that should receive special attention for civil rights enforcement (it resorted to the doctrine of hypodescent).
Lack of compliance with Directive 15 by states, local governments, and other entities, thus hindering the exchange of statistical reports
among agencies and causing obstacles regarding implementing categories for different uses, including education.
The five race categories of the original and revised versions of Directive No. 15 have been found not meaningful to a sizable number of Hispanics.
In 2007, the U.S. Department of Education issued new guidance and a simplified set of categories in which all persons identified as Hispanic, regardless of their race, are counted simply as “Hispanic.” The five original single race categories of the revised version of Directive No. 15 are retained and persons reporting two or more races are categorized as “two or more races.” One of the flaws of this system is that about 14 percent of the total American Indian population claim Hispanic origin, but this is not reported; Cubans and Puerto Ricans of African descent are also not identified. This system, in fact, undermines the comparability of data with data from agencies adhering to the 1997 standard, such as the Census Bureau.
Measuring Race: Outstanding Considerations
While the validity and reliability of data for race and ethnicity receive relatively little attention in the literature, Snipp observed that questions about this topic are becoming increasingly inescapable. Current thinking regarding reliability, for example, demonstrates that racial data are more fluid and dynamic than believed in the past. In addition, instability in the reporting of race, once viewed as a result of random fluctuations arising from poorly created instruments, can be systematically modeled and therefore merits further inquiry as an object of social scientific research.
In terms of validity, Snipp underscored two considerations: (1) some concordance of understanding about the meaning of race must exist between the researcher and the research subject and (2) there is no ability to determine entitlement to a particular heritage. Other challenges facing researchers include
Ensuring content validity, including determining whether the race-specific categories under consideration are the correct ones and whether there is sufficient sample size to yield reliable estimates for smaller populations.
Whether complex content entailed by the idea of race is comprehensively measured by one or more items on a survey questionnaire or interview schedule.
The ability of respondents, particularly those of mixed racial heritage, to ignore instructions and choose to identify with a race that best reflects their own understanding of “race.”
Perceptions of others versus perceptions of self that are influenced by one’s cognitive organization of racial identification.
The use of indicia (characteristics used by an observer) versus criteria (formally established conditions) in determining group membership.
The ability to trace the continental origins of human DNA and to connect this information with other genetic traits yields a tempting schematic for measuring race in a way that can be standardized, measured objectively, and is invariant with respect to evolving attitudes and shifting public opinion. However, Snipp warned, genotypes do not necessarily correspond to phenotypes; phenotypic traits observable in the everyday lived experience of race may or may not correspond to the continental origins measured by genetic testing. Consequently, one may wonder about the connection between heritage and the observed human differences associated with race. In addition, although genes may have a great deal to say about the great migrations of human beings, they have little bearing on the everyday lived social experience surrounding racial differences. He commented that, although assays of genetic ancestry may be a convenient way to standardize race as a feature of biology, they are unlikely to prove a productive strategy for the social sciences attempting to capture and understand human action based on perceived and self-understood differences.
Snipp ended by noting that it would be ideal to have a tool for social science research that could capture the dynamic and reflexive nature of race and ethnicity, an instrument that would yield a standard unit of measure across time and space. However, he cautioned, there are few clues on how to devise such as instrument. He considered it more important to recognize that a useful measure for scientific inquiry depends on a clearly articulated definition or understanding of the concept under study—something currently lacking in the social sciences for the concept of race.
Kenneth Prewitt (Columbia University) commented first on Pollak’s presentation, pointing out that he made a powerful and useful statement that, without theory, any indicator is weak to the point of being useless in policy making. This effect is clearly demonstrated by Pollak’s juxtaposition of the CPI and the disability index. The CPI has a strong theoretical foundation, whereas the disability index does not and consequently its use inserts ambiguities into the policy process. Pollak argued that the dropout rate clearly demonstrates the straightforward nature of the relationship between theory and indices. Theory must be anchored in the policy process for the data and measures to have significant use. The primacy of purpose—for
example, determining human capital labor skills in a given area for plant location or tracking students in school systems—drives how the denominator and numerator are conceptualized and the choice of methodology.
Turning next to Warren’s presentation, Prewitt reinforced the point that the policy objective also drives the use of the data set. He suggested that, in terms of developing common metrics, more conversation is needed about the differences between administrative data and survey data. Survey data have the characteristic of being variable rich and case poor due to cost restrictions. Administrative data have the opposite characteristics: they are case rich and variable poor. Administrative data are not organized to give regression analyses about individual-level behavior. By examining the information systems of different national governments, the differences between administrative data and survey data become more apparent. In Europe the ratio is 85:15 administrative/survey data. In the United States, the ratio is roughly 80:20 survey/administrative data. If the indicators used are based in theory, then the theory itself has to connect to a public policy purpose that is primarily fixed by the administrative agency collecting the data. The control of the data is in fact with the administrative agency that collects it.
As an aside, Prewitt remarked that digital data will have a significant impact on the development of standardized measurements. The cost of the census in the United States is unsustainable, and this will result in a shift from its current reliance on survey data to increased use of administrative records and perhaps eventually on digital data. A digital footprint leaves enormous amounts of data and raises questions about what are proprietary data.
Prewitt then commented on Mulgan’s presentation describing the evolution of the measurement system, based on the constant interaction between the quality of the science and the ways in which the data are used. He said that Mulgan tracked effectively the movement from easily measured items to more abstract concepts that include subjective well-being, social resilience, or social capital. This progression is reflected in policy discussions about the use of data and the role of the scientific community in influencing policy makers. It is important, he continued, to control measurement across the boundaries of a threshold, for example, spending more attention and money on those “above the threshold” to obtain more funding. Prewitt acknowledged that social scientists need to live with certain distortions, but at the same time, he noted, the scientific community has to build in as many protections as possible so that the system cannot be gamed, as well as to maintain transparency.
Prewitt emphasized one of Mulgan’s key points about the direction of social science—the need to incorporate the constituencies affected into measurement, for example, in the creation of a new disability index. Prewitt lauded the Oregon benchmark program identified in Mulgan’s presentation,
which created the indices against which progress and the capability of its own government are measured. Prewitt also pointed to the global project Measuring the Progress of Societies, which is hosted by the Organisation for Economic Co-operation and Development (OECD) and run in collaboration with other international and regional partners. It is illustrative of the recognized importance of significant economic, social, and environmental indicators beyond GDP, such as measures of subjective well-being, organizational capacities, and innovation, to assess societal progress, he said. He noted that in the OECD conversation about progress, there is always a footnote that participating countries ought to define measures in their own way, thus undercutting the OECD’s drive for standardization.
Turning to Snipp’s presentation, Prewitt observed that the U.S. standardization of races into five categories in 1977 reflected patterns that trace to 220 years earlier, an indication of what he termed “bureaucratic inertia.” He commented that the race classification system in the United States has attached itself successively to different policy regimes, from those that supported the Three-Fifths Rule (which drove American history for the first 60 years), to immigration restrictions, to affirmative action. Even in the 2010 census, he observed, the race classification is still based on historical patterns of discrimination.
Prewitt indicated that there is little theoretical basis for the race classification system in use today. He stated that it is impossible to standardize the race measure, especially cross-culturally. He noted that all the presentations in this session made the same major point about the need for theory and the need for public purpose. The latter, including the relevant measures, must be embedded in a conversation with the population, not just among statisticians.
Regarding the genomic revolution and its impact on classifications, Prewitt commented that genomic projects conducted around the world are being forced into the coding schemes of the United States—specifically, the OMB classifications. He expressed concern about current directions and “rebiologizing race.” Prewitt saw the challenge as going beyond scientific standardization to focus on how such a system would be used, as well as its political and policy implications.
Barbara Schneider (Michigan State University) agreed with Prewitt about the lack of adequate research about administrative data. As more longitudinal data are being collected, she asked how these new data will be integrated into measures that have been based primarily on surveys. Prewitt commented that the big issue regarding administrative data is the potential ability to cross data sets from education with those on health and social services.
Jack Triplett commented on a point raised by both Prewitt and Snipp about controlling the denominator because it makes the data difficult for many purposes if ratios are based on different classification systems. As an example, Snipp pointed out that the U.S. Department of Education is not using the same categories as the Census Bureau, so the denominator comes from a different set of categories than the numerator.
Harris Cooper praised the quality of the papers presented, which he felt were especially valuable in relation to one another. Based on his understanding of the day’s presentations, he did not consider it a problem that common social and health metrics and indices are not possible. It is not that they are impossible, responded Pollak, but rather it depends on the definition. For example, if the marriage category is defined only as being legally married and living together, then that definition can be used in any data set as an independent and dependent variable. He contended that it is better to have the raw data in order to see what independent variables are correlated with a given definition. While it is possible to define some notion and insist that it is used by everybody, this approach may not be advisable, he continued. Hauser said that aggregation, rather than data collection or measurement, is the key issue; the American Community Survey asks for national origin, and it is a completely open-ended question.
Prewitt and Snipp both expressed concern about the use of genetic markers in conjunction with racial and environmental characteristics, thinking that some lines of research should be avoided. Pollak raised a different topic concerning the benefits and limitations of self-reported race on the decennial census. On one hand, he said, it raises an interesting behavioral theory of what people report, but on the other it is also a topic for people interested in discrimination. He emphasized that there are different purposes in a social science context, and it is important to keep them in mind when considering various research questions. For this reason he is less concerned than Prewitt and Snipp about incorporating genomic issues related to medicine.
Taking issue with Prewitt’s preference for administrative data that comes with associated costs, Grusky was interested in Prewitt’s reaction to the view that they can have some leverage, since the data are intended for research purposes. Grusky continued by raising a point regarding Pollak’s main concern that, in the absence of theory, standardized measurements would be vulnerable to political manipulation. He suggested that there may be other ways to protect against manipulation aside from theory, since the goal is to have consensus, which can be secured in other ways. He offered the examples of unemployment and official poverty measures as ones that are not defined by theory but are prevalent in usage. Setting the question of
theory aside, Pollak considered the prevailing unemployment and poverty measures as part of the status quo, which is different from consensus but may indeed be the consensus.
Karen Jones (Customs and Border Protection) raised the question of how best to combine good program design with common metrics. She attended a briefing by the U.S. Government Accountability Office that addressed practical issues on conducting pure empirical research and how to mitigate its limitations by using the correct statistics to evaluate the data gathered. However, she said, there was very little emphasis on common metrics to evaluate training programs in one field, such as law enforcement. In her field, if something works in a given situation, it is often used in other situations as long as it meets the minimum criteria for good program evaluation design. She questioned how people like her can influence organizations, like OMB, that continually request adverse impact studies for training based on arbitrary racial categories.
Referring to the Health and Retirement Study, Willis returned to the issue of administrative data in connection with surveys. First, an obvious advantage is a more robust data set resulting from linking representative survey data with administrative data. Second, this pairing creates an issue regarding what agency is willing or unwilling to link the data. An agency that has no policy or policy research aspect will be less inclined to interact productively with social scientists. Willis argued for a two-way flow of information, noting that federally funded Research Data Centers that allow researchers access to restricted data have benefited from the exchange between Census Bureau personnel and academics. Prewitt said that ideally interaction between the producers of administrative data and social scientists would develop in such a way as to yield high-quality data, as well as better program administration from the resulting data.
Pollak had stated that one should think of a measure in terms of how the measure works in predicting a certain outcome. Triplett expressed concern about the concept of centering measurement in the political process. While measurement needs to be of value for analysis, in political and other contexts, the potential for political or other gaming poses a serious problem for statistical agencies. The unemployment rate serves as an interesting example; in the 1970s, it was extremely controversial. The issue was settled not by theory but in part by the work of Julius Shiskin, who launched several different versions of the unemployment rate (called U-1 through U-7) and showed that they all moved together over the business cycle.
The CPI also generated political debate during Triplett’s tenure at the Bureau of Labor Statistics, and after. Many of the debates about changing the CPI focused on technical issues and how to apply the theory underlying the index. Ultimately, this specific debate did not call into question the integrity of the statistical agency. However, Triplett recalled the creation
of unemployment rates for states and smaller areas, for which no reliable sample existed. He expressed skepticism about statistical programs that are generated from a political process.
The question of how to best use data collected from or generated by transactions conducted over the Internet was raised by Christine Bachrach. Are there research programs in place to evaluate the data, their use, and their cost-effectiveness? What will be the implications of these data on standardization?
Mulgan reported a dramatic change in the use of administrative data in the United States, the United Kingdom, and Australia. These governments have made commitments to make raw data available to the public as a default. This potentially transforms the relationship between administrative and survey data. For example, the Australian government runs competitions to see who can get the most cross-correlations, which would yield more case-rich data.
Mulgan cited other examples of the co-evolution of policy and science. One was the initiative in the United Kingdom to maintain a time-series database of health education and other records for children mainly at risk of poverty and social exclusion. The impetus for the initiative came from the academic community in an effort to learn more about the life course, protective factors, and risk factors, among others. The program is likely to be terminated for political reasons and concerns about human rights and privacy. Another example is the history of the unemployment rate in the United Kingdom, which has undergone a range of treatments, from political manipulation to a return to a theoretical measure of surplus labor supply. Returning to the discussion of race, Mulgan gave the example of the large Pakistani and Bangladeshi community in the United Kingdom that is calling for identity through faith, not race. This has created a challenge for the state as it tries to identify this community through a set of regressive, semibiological racial terms.
Prewitt proceeded to discuss the political implications of classification categories on surveys like the census. He used the example of how multiple races have been categorized in the decennial census. In 2000, when people were allowed to choose more than one race category, the category of “other” was not removed from the form (which had been on prior census forms to allow respondents to indicate if they were of two or more races). Even though “other” did not serve any theoretical purpose after the mark-one-or-more option was introduced in 2000, it remained on the form. Nearly half of the Hispanic population, mostly Mexican and Central Americans, used the “other” category to identify their race. After the 2000 census, the Census Bureau decided that the term was not a good measure and wanted to remove it from the form; however, a member of the House Committee on Appropriations included in the budget the provision that the
Census Bureau shall always include the word “other” if it asks any race questions.
Prewitt believes that the government must have a proper reason for asking questions of its population. Consequently, he saw the need for a connection between some kind of policy issue or possibilities and the concepts that the government is trying to measure. He further observed that the science of social measurement in the United States is most protected in statistical agencies. He argued that they care more than program agencies about data quality, continuity across time, standardization, and privacy and confidentiality. He then addressed the issues surrounding the ownership and management of digital data. While some Research Data Centers have already started thinking about the relationship between administrative and survey data, they have not yet addressed digital data. Prewitt raised concerns about the quality control of digital data being used by the U.S. Department of Homeland Security, since without public access there is no way to know how it is being maintained. He asserted that discussion is still needed about how to make sure society’s information system is going to be housed in a place that is concerned with quality protection.
In the future, the way administrative records and surveys are linked will become increasingly important. Snipp cautioned that the scientific community will face a number of ethical issues, such as confidentiality and privacy concerns with respect to transactional data, survey data, and its linkages to administrative data. He mentioned that Stanford University, like a number of other institutions, has created a secure data center, but this kind of precaution is not being undertaken in the scientific community at large.