Measurement in the Social Sciences
In his overview, George Bohrnstedt (American Institutes for Research) provided a short history and review of measurement in the social sciences. He began by introducing measurement in the physical sciences and then discussed measurement approaches in the social sciences, touching in particular on seminal developments that have facilitated or impeded progress. He also introduced the topic of index construction, observing that indicators often turn out to be determinants of the construct rather than just reflecting it.
MEASUREMENT STANDARDIZATION IN THE PHYSICAL SCIENCES
Bohrnstedt made three observations about measurement standardization in the physical sciences:
Measures are social constructs, and the process of gaining standardization around measures is very much a social process involving social actors and negotiations, like any science or any political process.
Standardization is impelled along when there are strong commercial, political, or scientific forces at work.
Science has a strong, central role to play in the development of standards. An example of the adoption of standards as a social process can be seen in the way political and commercial interests worked against adoption in the United States of the metric system,
despite the involvement of scientists from many countries to lend scientific stature to the use of this measurement system.
Turning to physical measurements more generally, Bohrnstedt described them as characterized by standards that are based on strong theory and experimentation. In the physical sciences, theory is often viewed as a necessary precursor for measurement. With strong theory, measurements can often be used to confirm, reject, or refine hypotheses. In social science disciplines, the lack of strong theories is often reflected in the lack of well-accepted common metrics.
MEASUREMENT STANDARDIZATION IN THE SOCIAL SCIENCES
According to Bohrnstedt, there are some clear, tangible measures in the social sciences—such as birth, age, marital status, number of children—but the picture becomes murkier when one considers such concepts as attitudes, values, and beliefs at the individual or organizational level, or such concepts as school climate and organizational learning, or societal-level concepts, such as anomie and social disorganization. In the social sciences, it is often unclear whether the problem is the theory, the measures, or both. Bohrnstedt observed that researchers have not yet discovered how to define the kind of fundamental quantities in the social sciences that exist in the physical sciences. Social science concepts are large in number, fuzzy, and do not bear a simple relationship to one another, as is more frequently the case in the physical sciences. As a result, strong axiomatic theories against which to evaluate and inform measures are lacking. He cautioned, however, that it is not clear that social scientists would develop better measures if in fact strong theories existed.
Bohrnstedt traced the history of social science measurement, beginning with Pierre Guillaume Frédéric Le Play (1806-1882), who is credited with establishing what has become the modern-day social survey. He followed with mention of Guttman scales, popular in the 1950s and 1960s, which order both items and persons on a scale and are an important precursor to item response theory (IRT) scaling, developed in the early 1960s primarily to measure latent ability and achievement; application of psychophysical work on sensation and perception to attitude and value measurement using the method of paired comparisons; the scaling of attitude items, which led to development of the comparative law of judgment; the measurement of intelligence and the earliest factor analyses; the use of linear composites in the social sciences; and one-parameter Rasch models and subsequent two- and three-parameter models. There is increasing interest in IRT applications for the measurement of social and psychological latent concepts. One example is the measurement of health-related quality of life using the
Patient-Reported Outcomes Measurement Information System (PROMIS) at the National Institutes of Health (NIH).
Bohrnstedt ended with a set of ideas for constructing good measures in which the items reflect constructs:
Define the concept as carefully as possible, specifying the domain of meaning.
Use factor analysis to explore the dimensionality of the concept.
After determining dimensionality, do a confirmatory factor analysis to verify.
Estimate the internal consistency reliability of the measures constructed on the basis of the analysis.
Fit the items for each dimension to a Rasch model.
If the items will not fit a one-parameter or Rasch model, then fit them to a two-parameter model.
Ensure that parameter estimates are invariant for various subpopulations.
Develop new items to bolster sparse areas on the latent dimensions.
With respect to index construction, Bohrnstedt observed that in sociology, economics, and policy research, in some cases the assumption is that the indicators define the construct rather than the other way around. This is sometimes called a “formative” as opposed to a “reflective” model of index construction. Examples include an index of socioeconomic status, consisting of education, income, and occupation, and the consumer price index, which is based on a market basket of goods and services. The construct is in fact determined by or defined by the indicators that go into it. Typically, the indicators are simply unit-weighted, but in some cases they are weighted on the basis of theory, differential utilities, or other preferences (e.g., relative importance based on a community survey). One can estimate the weights of the indicators if there are multiple indicators and multiple causes (the MIMIC model).
COMPARABLE METRICS: SOME EXAMPLES
Robert Hauser (Division of Behavioral and Social Sciences and Education, National Research Council, Washington, DC, and Vilas Research Professor, Emeritus, University of Wisconsin, Madison) reflected on the tradeoff inherent in standardization. In the social, behavioral, and economic sciences, standardization of measures can help the accumulation of evidence because it permits valid comparisons across time, place, or units of observations (e.g., persons, families, settings, localities, organizations). Standardization also can create common understandings, when measure-
ment intersects with policy. At the same time, however, standardization can entail the loss of information, and too much standardization may make extensive evidence uninformative and misleading. A delicate balance must be negotiated, he said, between standardization of measurement and validity of social scientific constructs. This can be complicated, because measurement can overlap with representation (who or what is being measured), analysis (how data will be described and used), theory, and policy.
Hauser then illustrated his point with a number of public metrics, in declining order of success, based on his judgment of the validity and usage of the measures:
The unemployment rate is a social scientific invention based on a detailed behavioral report of job searching during a reference week by members of the labor force. It is defective in the sense that the officially unemployed do not include “discouraged workers,” persons who have given up on their search for employment, or the underemployed. This defect is exacerbated when unemployment is high, as the measure underestimates the extent of economic distress.
The official poverty line is a more recent scientific invention frequently used in policy applications despite major weaknesses that greatly limited its validity and usefulness from the outset. It is an absolute standard in real dollars, updated only to reflect changes in the consumer price index. Because of this and the fact that living standards and the share of food in family budgets have changed, the standard has become increasingly obsolete. In Hauser’s estimation, the official poverty line has been overused in thousands of research papers and books, and perceptions about poverty and the poor would differ if a standard measure of greater validity were widely accepted.
Academic achievement levels offer a more recent example of a nominally social scientific, standardized measure that has become visible and influential in public discourse and policy. Although drawn on questionable and subjective methods, academic achievement levels have nevertheless become ubiquitous in reports on diverse subjects at state and national levels. Public and political demands for understandable metrics of academic accountability have trumped their negative evaluations, he said. In this case, Hauser pointed out, the creation of a supposedly scientific set of standards led to their reification in law, to the creation of competing standards, and to comparisons of populations in differing but nominally identical metrics.
The 1992 National Adult Literacy Study reported five levels of literacy, based on four cutoff points set at equal intervals, without specific descriptors, that presumably indicate discrete breaks in competence. From this score distribution, it is not possible to determine the number of people who are considered illiterate in the United States. The National Center for Education Statistics, when it was about to undertake the successor National Assessment of Adult Literacy (NAAL) in 2003, asked the National Research Council (NRC) to recommend standards for adult literacy that could be used in the NAAL and applied retroactively to the National Adult Literacy Study in order to compare literacy levels across the decade among all adults and specific population groups. The NRC report Measuring Literacy: Performance Levels for Adults (National Research Council, 2005) developed five categories with explicit descriptions corresponding roughly to readiness for successive levels of formal education. The NRC report concludes from experimental work that the whole enterprise of line drawing is on very shaky ground.
The Voluntary National Tests were a 1997 proposal of the Clinton administration for tests of reading at grade 4 and mathematics at grade 8 that became a dramatic and failed effort to create a common metric for the assessment of academic achievement and changes in it. The proposal was to give the same assessment to all students nationwide, and individual reports would be shared with students, parents, teachers, and school administrators. Advocates believed that this diagnostic information would increase motivation to improve academic achievement. Hauser said that the project ultimately died due to strong opposition from Republicans who believed it would destroy the traditional prerogatives of local school systems and from minority groups afraid it would stigmatize them. He mentioned two proposals by Congress for NRC studies to address measurement issues in ways that would permit this project to go forward without giving everyone the same test. The first one, to equate the scales of existing tests, was considered not feasible. The second proposal, to insert modest numbers of existing items from national assessments into existing tests on state assessments, also was rejected because of substantial differences in context or administration between the state and national testing programs. Hauser was struck by the fact that Congress directly addressed technical issues of comparability in measurement, at least attempting to establish national comparability in the measurement of individual academic performance in its proposals to the NRC.
Accumulating Evidence, Comparing Effects
According to Hauser, social scientific examples of standardization range from qualitative classifications, like race/ethnicity and social class; to numerical scales describing psychological traits, social standing, or economic amounts; to normalized measures of the fit of statistical models and the effects of variables in such models. He discussed social class, occupational prestige, and occupational socioeconomic status as examples involving the normalization of metrics.
Social class is a core concept of sociology. It is ubiquitous, yet there is endless disagreement about how to measure it. In recent sociological research, there have been three main contenders on how to measure social class: a neo-Marxist classification developed by Erik Wright (1993),1 a neo-Weberian classification developed by Robert Erikson and John Goldthorpe (1992),2 and variants of the Edwards scale, a socioeconomic classification of occupations by the U.S. Census Bureau that was developed in the 1930s. The Edwards scale captures a central hierarchical dimension of the occupational structure, but major classification changes in the Census Bureau’s occupational system and the federal system more generally have made it difficult to maintain in any comparable form.3 This system has a stronger empirical than theoretical grounding. The Wright and Erikson-Goldthorpe class schemes have a strong basis in sociological theory, but each also has notable empirical weaknesses.
All three classification schemes exemplify the strengths and weaknesses of common metrics. On the positive side, the schemes have been used extensively in cumulative and comparative research, as well as for social reporting. However, each of the three schemes competes with the other two, thus reducing the set of comparable studies and observations.
In seeing how well the three schemes compare, Miech and Hauser (2001) looked at health outcomes in relation to all three of these measures
in the Wisconsin Longitudinal Study. They found that if used in occupational classification to explain health differentials, the Edwards scale was really the best choice, yet a simple classification of educational attainment actually dominated any of the occupational components.
Hauser closed this discussion by raising the broader problem with the use of any of the standard measures of “social class”: the belief that these, or closely related measures of social standing, taken alone, fully represent the social and economic standing of a person, household, or family. In his view, this simplistic view fails to recognize the complexity of contemporary systems of social stratification, in which inequalities are created and maintained in a substantially but by no means highly correlated mix of psychological, educational, occupational, and economic dimensions. He stated that this, more than the details of class measurement, is the greatest disadvantage of standardization in the measurement of social class.
Occupational prestige, based on lay or expert reports of the “general social standing” of occupations, was found in the mid-1950s to correlate highly across national populations, later across time, and between blacks and whites. Research by Donald Treiman (1976) produced the Standard International Occupational Prestige Scale. Hauser surmised that this scale did not take hold in part because sociologists around the world were more interested in the peculiarities of social mobility in their own nations and less concerned about comparability, as well as the fact that empirical research showed that prestige was not the main dimension of occupational persistence.
Studies of occupational prestige in the United States beginning as early as 1947 covered only modest numbers of occupational titles. In the absence of a complete set of prestige scores, Duncan created a proxy measure, the Socioeconomic Index for All Occupations (SEI),4 which has been widely used in U.S. studies of occupational mobility, including intergenerational mobility. Hauser emphasized that the SEI represents occupational standing alone, not individual or family socioeconomic status. This measure and its competitors (e.g., the Hollingshead Index of Social Position, the Nam-Powers Index) all have limitations.5 For example, all of these indexes are based on male workers alone, so they are not valid in today’s market, in
which women are a very important component of the labor force. In addition, it turns out that education alone generates a better scale than composite indexes, such as those that include both income and education. The story of the Duncan SEI is a case history of the rise and fall of a standard sociological measure that became obsolete over time. There is now an international socioeconomic index developed by Treiman and colleagues that is well suited for comparative work.6
Normalization of Metrics
Multiplicative scales and log transformations are analytic schemes for normalizing metrics to achieve comparability in levels or effects. Hauser discussed how such transformations can range from truly useful to utterly misleading.
One of the simplest and most powerful transformations, under appropriate circumstances, is the log transformation. Because log transformations reduce positive skew and increase negative skew, it is often desirable to add a constant (start value) before transforming the original variable.
Both location and metric affect comparisons. Hauser pointed out that interaction effects may be an artifact of differences in location on the same scale (when effects are not linear). As an example, he pointed to comparisons of returns to education among blacks and whites in the United States (Hauser et al., 2000). Vignette measurement circumvents some of these problems by trying to calibrate individual scales, rather than trying to assume that there is a common scale for everyone in ordering objects (see King et al., 2004).
Hauser turned next to meta-analysis, which typically involves statistical analyses of the combined results from different analytic studies. In his view, meta-analysis is vastly inferior to pooled analyses of primary data. In particular, the dominant use of “effect size” in standard deviation units does not create common understanding, since these units are not necessarily in the same metric and are not real units. As data sharing increases and as people’s capabilities to use multiple sources of data increase, his hope is that meta-analysis will become less important.
Hauser’s selective review of past efforts provides a cautionary account of the prospects for useful and valid common metrics in the social sciences. He ended his presentation with seven lessons for the creation of sound, standard, and comparable social, economic, and behavioral measures:
Repeated use gives meaning to a metric; overuse may reify it.
Meet a real scientific and/or policy need. If no one else will use a measure, it is not worth the effort. Widespread use is rewarding. A check of citation indexes attests to the fact that the biggest citation counts go to people who develop useful measures, not those who analyze data.
Seek simplicity in content and construction. To the extent that an indicator is hard to ascertain, is complicated to construct, and admits multiple interpretations, it will be less useful.
Avoid relative measurement: above all, avoid percentile ranks, standard deviations, and shares of variance.
Avoid descriptive terms for arbitrarily or subjectively determined ranges of a quantitative indicator. Such terms invite misinterpretation.
Study the operational and analytical behavior of a measure to assess its validity, not merely the details of its construction.
Weigh the balance between internal and external validity. Information loss may vary positively with comparability, and sometimes loss is gain.
His closing remark was that nothing is more important and scientifically rewarding than the development of standard metrics that are useful in theory and in practice.
In her discussant remarks, Christine Bachrach (Duke University and University of Maryland) posed three broad questions to further extend the range of issues based on her reading of the workshop papers and presentations.
First, how healthy is measurement science in the social sciences? Understanding common metrics to advance social science theory as the focus of the workshop, Bachrach probed whether theory is actually advancing metrics, common or not, in an adequate fashion in the social sciences. It is important to carefully define the constructs one wants to measure, she cautioned.
In addition, Bachrach noted that the seriousness with which measurement is approached and the degree to which it is grounded in scientific principles and scientific methods actually vary tremendously across the behavioral and social sciences. She observed that there are structural factors that contribute to placing measurement on the sidelines, chief among them the balkanization of disciplines, with some placing greater emphasis on measurement issues.
In the field of demography, the use of common measures is fairly well accepted in the design and development of omnibus surveys. Although this has many positive benefits, Bachrach noted that it also leads to the development of “habitual measurement practices,” that is, relying on the same measures regardless of whether they truly represent the theoretical constructs of interest. For example, years of schooling are measured quite similarly across the social sciences, although the measure is used to operationalize very different theoretical constructs ranging from opportunity cost to human capital to social class. She echoed a point made by Hauser about users reading into measures what they want. Thus, common measures alone are insufficient if there is a lack of common understanding as to what those measures represent. She identified the structure of peer review as yet another set of factors that influences the health of measurement science in the social sciences. NIH has recently shifted its review criteria to try to nudge reviewers away from a very detailed focus on the technical approach used in grant applications to a focus on impact, significance, and innovation. There always has been tremendous variation across different review groups as to how much attention is given to the quality of measurement and the approach taken to measurement; she supposed that this new change may further dampen attention to measurement. Bachrach saw similar variations in the peer review of journal articles in terms of the importance accorded to measurement issues.
Second, what is meant by common metrics? Bachrach encountered multiple meanings in her reading of the workshop papers. The workshop planners describe common metrics in terms of researchers who are pursuing a line of inquiry that relies on common measures for the variables under study. Some people mean the development of standard measures that are driven by policy needs and institutional requirements (e.g., poverty, race, high school completion). Hauser referred to these as public metrics, but said that through their use in policy they may take on a life of their own. Another meaning suggests the development of methods for aligning different measures with each other, as illustrated by international benchmarking of educational measures and approaches to normalizing and transforming metrics to achieve better comparability. Yet another meaning that is less explicit is associated with the idea that investigators situate their measures with respect to others in play.
Although the focus of the workshop is on social science theory, Bachrach observed, the papers are more concerned with the needs of policy. She cautioned that how one goes about developing common metrics for advancing policy may differ from the approach recommended for advancing theory. Even the definitions captured in the workshop description cover a very broad set of scenarios depending on how a line of investigation is interpreted. In her view, perhaps the best contribution that this workshop
could make would be to map out the very different forms that pursuing common metrics can take, depending on the state of the science and the goals in play. She also said that it would be worthwhile for the workshop to address how the different forms fit together and whether there are cases in which insufficient attention to the value of common metrics is holding back science.
Third, how does the social science community move from the successes of the past to tackling new opportunities and challenges? She noted two examples of metrics that have stood the test of time through very careful, thoughtful revision. One is the definition of the meter, which was adopted in 1791 and grounded in the physical sciences. The measure was revised at least four times, and these revisions were driven by changes in the science used to translate the definition of a meter into an actual metric. Another example is the Duncan socioeconomic index, a measure that has been extremely successful in advancing research on social mobility. It, too, has required adaptation because of changes in the occupational structure itself and because of changes in the labor force. Bachrach suggested that there is the opportunity for developing flexible common measurement strategies that can better keep up with the diversity of experience over time and accommodate the diversity of experience that exists at any one point in time. She asked whether there might be a way to tap into new technologies, new scientific advances, to develop adaptive models of measurement that can be widely used.
At NIH, Bachrach saw many instances of disciplinary divides obstructing the flow of knowledge about constructs and appropriate measurement between the health sciences and the social sciences. She considered the balkanization of disciplines as weakening links between science and measurement because the development of measures used in one discipline may benefit from science in another discipline. Thus, the movement toward interdisciplinary research promises greater commonality of measurement. She believes there has been progress in bridging these divides.
Robert Pollak (Washington University, St. Louis, Missouri) picked up on a different sort of disciplinary divide by distinguishing between measurement reports for their own sake and measurement for use in analysis. In the latter case, he said, people ought to think about what the independent variables and the dependent variable are. For example, with respect to outcomes for children, one might be thinking about health or education outcomes (e.g., highest grade completed, test scores), labor market outcomes, or crime. He also cautioned that seemingly simple variables (such as marital status) actually can be very complex. It has become conventional practice to combine those who are cohabiting with those who are married, for example. But Pollak raised additional questions, such as how one should think about married couples who are not living together or who commute.
He recognized that these are empirical questions and not ones that can be settled easily.
Nancy Cartwright (University of California, San Diego, and London School of Economics and Political Science) emphasized the need to consider sociology and politics not only outside the academic community but also within it. She observed that there can be pressure in the academic community to use the measures of one’s supervisor or to pursue the kind of results that are likely to bring professional rewards.
Harris Cooper (Duke University) turned the discussion to meta-analyses, contending that a more modern view of meta-analyses sees them as not providing definitive answers but perhaps setting the stage for where one should look to define the next experiment or investigation. He acknowledged that meta-analyses can only be as good as the studies that are included in them. He sees his colleagues in medicine as leading the way with regard to use of what they refer to as individual patient data meta-analyses. Hauser responded that he sees the challenge as going from effect sizes in different studies to metrics that have more meaning. He is convinced of the need for overlapping metrics in different studies in order to get to a real metric in the course of analysis.
WHAT CAN BE LEARNED FROM THE ECONOMIC SCIENCES?
Robert Willis (University of Michigan) provided some history on standardization, touching on the politics associated with standardizing measures in economics before turning attention to the U.S. national accounts. National accounts represent a standardization of method and approach that has been quite powerful yet incomplete in a fundamental way. There are ways to make them more complete by essentially using extensions of standard methodology, such as gathering better data and developing better theory. Willis discussed another approach, which is to complement so-called objective measures with more subjective ones. He also argued that established statistical agencies have had to apply economic theory in order to produce economic data that are useful and credible for science and for policy.
Historical and Political Considerations
Willis began by observing that because economics is so directly relevant to policy and politics in a democratic society, the development of standardized economic data has gone hand in hand with the development of the idea of data in public service. He recounted the history of the formation of the National Bureau of Economic Research (NBER) to illustrate the tradition in the field of connecting facts (data) and policy. Founded in 1920 as a
private institution, the NBER charter incorporates appreciation for the explicit connection between facts and policy, emphasis on scientific principles and impartiality, and the expectation that the bureau should abstain from making recommendations on policy (Fabricant, 1984).
As recounted by Willis, the first NBER project can be considered a case study of professionalization in the production of standard measures. National income measurement is based on a close connection between economic theory and the definition of the measurement tasks. In the 1930s, the project moved to the newly formed Bureau of Economic Analysis (BEA) in the U.S. Department of Commerce, national income accounts became part of the official statistics of the United States, and the methodology was adopted by other countries around the world. Willis noted the explicit attempt, first in the founding of the NBER itself and later in the incorporation of this work into the government, to make the production of the data as resistant as possible to political and other pressures.
In addition to BEA, Willis counted the Census Bureau, the Bureau of Labor Statistics, and others as federal statistical agencies committed to the collection of objective data free of partisanship and advocacy. He recalled various crises in which professionals in statistical agencies have stood their ground, refusing to manipulate a measure, such as the unemployment rate, for political advantage. A case in point can be seen in the advice given by Francis Walker—the superintendent of the 1870 census, the founding commissioner of the Bureau of Labor Statistics, the inaugural president of the American Economic Association, and a vice president of the National Academy of Sciences—to the first commissioner of the Massachusetts Bureau of Labor Statistics (Walker, 1877: vii-viii as cited in Prewitt, 1987):
Your office has only to prove itself superior to partisan dictation and to the seductions of theory, in order to command the cordial support of the press and the body of citizens…. I have strong hopes that you will distinctively and decisively disconnect [the bureau] from politics.
Measurement in Economic Life
In elaborating on the connection between theory and policy, Willis turned next to measurement in economic life. People enter exchanges only if they believe they are getting more than they give. Just as standardized measurement of physical quantities and monetary values have ancient origins (see Bohrnstedt, 2010), so do the actions of private actors and sovereigns to subvert the standards, or capitalize on asymmetric information, for their own advantage.
Willis pointed to measurement of gross domestic product (GDP) as the canonical example of standardized measurement in economics. GDP is
reported by BEA in the United States and similar agencies throughout the world. The basis for these aggregate measures lies in micro-level surveys of households, firms, and units of government, as well as administrative records. The measure is intended to allow comparisons of real income levels in a given country across time and across countries at a given time. To make the comparisons, adjustments must be made for differences in the purchasing power of a monetary unit using price indices.7
Another of the NBER’s projects concerns business cycles, work that is empirical and atheoretical, motivated by the idea that one needs to gather an abundance of facts to understand business cycles. In 1947, Tjalling Koopman made a very strong argument that measurement should be guided by theory, and economists by and large have abided by this ever since, with a standard set of beliefs in common practice.
Willis outlined a number of assumptions that have been very important in the history of economic thought, all of which are quite innocuous on their own. These assumptions include utility-maximizing consumers and profit-maximizing firms in a perfectly competitive market economy, with all quantities and prices being observable. He noted the scientific contribution of measures of price, quantity, and income as follows:
Income and related variables are cardinal measures that can be added, subtracted, multiplied, divided, logged, and exponentiated.
At the micro level, these variables are the outcomes and determinants of the behavior of individuals and firms that economic science seeks to explain.
At the macro level, short-run macroeconomics and long-run studies of economic growth depend on consistent measurement of aggregate quantities over time.
Real income and related measures provide meaningful, interpersonally and intertemporally comparable measures of welfare that can be compared across subgroups.
Willis elaborated on the idea that real income, which is income adjusted for inflation, can be used for economic welfare analyses that are relevant to policy often without knowing very much about individual characteristics or preferences. In discussing data demands for welfare analyses, Willis explained that the method of revealed preferences requires knowledge of the full choice set. An individual’s choice set is determined by his or her income derived from the ownership of resources and the market price of the goods and services available. He noted that data on goods and services consumed
and market prices omit much of what goes into people’s preferences. For example, public goods provided by the government enter into the national accounts at cost, since there is no way of valuing them. The environment is a nonmarketed shared resource, nothing of which appears in the national accounts. Individual consumption in families and households, as well as future (or lifetime) consumption based on a set of expectations under states of uncertainty, also are not directly measured. When markets are absent, there is little alternative but to try direct measurement of “output.”
Willis related a frustrating tale told by Angus Deaton at Princeton University about a largely failed attempt to determine trends in the number or proportion of people living on less than $1 per day. He used the purchasing power parity (PPP) approach to develop a measure of the amount of local currency needed to buy $1 worth of goods in countries around the world. Extreme poverty measured in this way is how many people live on $1 a day or less. Deaton could not get sensible results until he incorporated measures of self-rated well-being from a Gallup poll. He traced the problem with PPP to a failure to have data on prices on comparable items, such as the quality of shirts in Kenya, New York, and London. Willis interpreted Deaton’s experience to reflect not so much the inadequacy of mainstream theory as the difficulty of measuring the variables demanded by the theory.
He turned next to recent development of measurements that fall outside the conventional accounting framework used in economics. He observed that economists are increasingly willing to consider supplementing their market-based measures with subjective ones. Health is a good example for which objective measures are hard to come by, and it is not clear if self-reported measures of health are valid and interpersonally comparable. Anchoring vignettes can be a way to try to disentangle the rating scale from “true” value (see Hauser, 2010).
Willis ended with his belief that economics has developed a powerful method for using market data prices and quantities to create standardized measures of income and related variables that can be compared across people, countries, and time. They can be aggregated and disaggregated by the economic framework, but the framework fails to account well for goods and services that are produced and consumed outside markets. One approach to deal with this is to develop new measures of choice sets and behavior in implicit markets. Another is to relax the economist’s preference for objective data and revealed preference in favor of subjective measures. He sees very few measures based on implicit markets or subjective measurement ready for standardization in the sense of official statistics. There is great value, he said, in having comparable measures available for research that will allow improvement of new measures. Meanwhile, one needs to recognize the dangers of using imperfect or incomplete standardized measures as guides to policy.
MEASURING HEALTH-RELATED QUALITY OF LIFE
Dennis Fryback (University of Wisconsin, Madison) introduced a typology for health measures and then focused on the need for standardized “health-related quality of life” (HRQoL) indexes.
In his basic typology of health measures, Fryback distinguished between mortality-based and morbidity-based measures. Mortality-based measures are among the easiest to ascertain—life expectancies, whether someone is alive or dead. Morbidity measures and nonfatal outcomes are more difficult to track. The health field tends to rely on morbidity indicators that are usually countable (e.g., tuberculosis rate, Caesarian section rate, percentage of the population that exercises). He briefly reviewed examples of morbidity-based indicators, including Healthy People 2010, the Core Health Indicators of the World Health Organization (WHO), America’s Health Rankings, and the Wisconsin County Health Rankings. Many of these measures either contain too many indicators for a useful overall assessment of progress (e.g., Healthy People 2010, WHO) or arbitrarily sum several indicators to get rankings of states to stimulate policy. Fryback argued for less arbitrary ways of summarization.
One level up are summary health status measures that proxy point-in-time summaries of a person’s health, but with respect to a particular disease or organ. They are sensitive to changes in symptoms or functional impairment due to a particular disease process. Examples include the Arthritis Impact Measurement System (AIMS), the Vision Function Questionnaire, the McGill Pain Questionnaire, and the New York Heart Association Classification.
There are also generic health status measures that aim to obtain a full-spectrum profile of an individual’s health. These use a relatively brief questionnaire that touches on all of the major domains of health (or at least the relatively agreed-on ones) and is not tied to just one disease or organ system. These are useful particularly in measuring the health of people who have multiple disease conditions. The ubiquitous measure throughout the world is perhaps the SF-36 health profile, Fryback said. Its 36 questions cover 8 domains of health,8 with separate scores generated for each of two subscales: the physical component and the mental component.
Of all the generic health status indexes, Fryback favors the HRQoL to represent the overall health of the individual. The scale or score is neither a simple, psychometric sum of items nor a sum of responses to items on a questionnaire. Instead, it reflects preferences for different aspects of health, with 1 = perfect health and 0 = dead. Econometric methods are used to
elicit utility weights (preferences) for health states, with average preference weights from a community sample of people. He acknowledged that defining perfect health can be a problem.
Fryback returned to the two areas of concern as health outcomes—morbidity and mortality. Morbidity is how people feel, how health problems affect them, abilities, disabilities, functional capacity, independence, and other aspects of health and well-being. Mortality is how long people live. Health care and health interventions affect both of these aspects of health.
According to Fryback, one summary measure, HRQoL, combines all the aspects of morbidity. A second summary measure, quality-adjusted life expectancy (QALE), combines HRQoL and mortality into a single number. QALE would be the expected number of quality-adjusted life years (QALYs) experienced by a cohort of the same starting age and quality of life. It is perhaps the best estimate of future health-adjusted life years for a random member of that cohort.
Fryback shared other uses of QALYs. Canada follows HRQoL over time with a large longitudinal panel data as well as with successive cross-sectional population surveys. The U.S. Panel on Cost-Effectiveness in Health and Medicine tried to standardize cost-effectiveness analyses (CEA), calling for something like QALYs as the generic outcome measure for meaningful analysis. Fryback considered CEA to be more prominent in the United Kingdom and Great Britain, where the National Institute for Clinical Excellence uses QALYs as a basis for policy on what gets into the National Health Service, particularly for drug therapies.
Fryback described how cross-sectional samples of individuals’ HRQoL at a point in time can be used for meaningful population health measures. Community averages of HRQoL summarize health at a point in time. Cross-sectional HRQoL data can be combined with mortality data, and life table techniques can be used to weight life expectancy computations (Molla et al., 2001). To illustrate this, he presented data on women in the United States from the 2000 census and the National Health Interview Survey (NHIS). The life expectancy for women ages 55 to 59 at that time was 27.1 years, but the QALE was 20.5 years, about a 25 percent difference. For women 10 years older, ages 65 to 69 at that time, the QALE was 13.8 years, which means that for the cohort between ages 55 and 65, the expected QALY at that time was about 6.7 years (or 20.5 less 13.8 years). It would have been 10 years had the quality of life not degraded during this period.
According to Fryback, the key to making meaningful comparisons over time and across populations is the systematic collection of standardized measures with sufficient sample sizes. To date in the United States, only a few data sets have suitable measures, and only one has committed to longitudinal data collection. He argued that the population data system should
facilitate computing QALE over time. This would allow population tracking and measuring improvement in both survival and HRQoL over time. He noted that there are several potential HRQoL indexes available today that have been developed over the past 40 years,9 each with an associated questionnaire varying from 5 to nearly 60 questions, with varying times to completion from 2 to 15 minutes on average. All of these indexes conceive of HRQoL as multidimensional, generally capturing physical, mental, and social functions, as well as experience and feelings vis-à-vis some important symptoms (e.g., pain, anxiety, depression). They all attempt to locate the individual in a multidimensional health space; that multidimensional health state is then scored by some sort of preference-based weighting function based on population data.
The HRQoL indexes all differ. They use different dimensions, or they conceptualize dimensions differently. They rely mostly on Guttman scales or Likert scales to describe dimensions, but they use different categories, different levels, and different numbers of categories. Their scoring functions are based on utility assessments made by people sampled from the populations, but different populations and different econometric methods to elicit these preferences are used. As a result, the indexes are related but different, and each has flaws (e.g., differential coverage and differential sensitivity among health domains, ceiling and floor effects), which may explain why the United States has not adopted a standard HRQoL measure. Perhaps the most contentious issue among the different indexes is where they place the dead. Three of the scales have health states worse than dead.
In an effort to assess the different indexes and how they relate to a common underlying latent scale of health, Fryback et al. (2010) used item response theory in a novel way to put six of them on a common scale and compare them. Two appeared linearly related, but the others showed ceiling effects and therefore were not linearly related. The authors concluded that these indexes are clearly not identical and are imprecisely correlated.
Fryback identified a number of other barriers to adopting a standard HRQoL index for U.S. surveys:
Competing developers and proprietary interests, which discourage U.S. agencies from endorsing a measure that would create a financial winner and losers.
The perceived large incremental response burden to add an entire HRQoL questionnaire onto a national survey, when it can be challenging to add even one or two questions.
U.S. aversion to using weights from other countries.
Lack of interest from NIH institutes that are generally disease- or organ-focused and seek measures sensitive to their issues, with the National Institute on Aging being the exception.
Patient-Reported Outcomes Measurement Information System
PROMIS is part of the NIH Roadmap using IRT to scale the different domains of health. Each dimension has its own item bank and scale, which can be improved over time. Fryback described the conceptual framework for health in PROMIS as similar to HRQoL, but the measurement basis is very different. In PROMIS, IRT is used to create a separate psychometric scale for each dimension; there is no combining of scales into a single summary. He reported that PROMIS has developed an Internet-based interface using computer-adaptive testing to minimize response burden. Item banks are now available for only a few of the health dimensions; once there are item banks for all of the dimensions, PROMIS can use psychometric techniques, including IRT, to scale health items, and it can be improved over time as questions are added and improved and not necessarily affixed to one questionnaire. The final step, said Fryback, is to implement a scoring function to complete the HRQoL index. He believes there are many reasons to implement standard measures of HRQoL and that PROMIS offers a path for the future.
In his remarks, Jack Triplett (Brookings Institution) discussed the role of theory in economic measurement, but balanced it with a discussion of the limitations of economic theory as a guide to economic measurement. On the measurement of medical care in economics, he tied Fryback’s presentation on medical outcomes measures to problems in the economic measurement of medical care prices and output to show how some measurement problems in economics require information from outside economics.
To illustrate the usefulness of economic theory, he offered three examples of cases in which construction of economic data is guided by economic theory:
Gross domestic product: The basic structure of GDP is given in the equation Y = C + I + G, where Y is income generated, C is consumption expenditures, I is investment expenditures, and G is government expenditure. This equation is a fundamental analytical equation that comes right out of macroeconomic theory developed by Maynard Keynes more than 70 years ago. The basic structure of
macroeconomics is still built around this equation. In the national accounts literature, this equation is usually called an accounting identity, but it is properly understood from macroeconomic theory as an equilibrium condition. There is therefore a linkage between the basic macroeconomic theory, the macro structure of the accounts, and macroeconomic analysis, which is based on the theory.
Consumer price index: Triplett noted that the Bureau of Labor Statistics (BLS) considers the CPI to be an approximation to a cost of living index, which is an established concept in economic theory. He added that the BLS regards the producer price index as an approximation to a different economic concept, which is based on the theory of the output price index. The BLS is therefore an example of a statistical agency producing economic series that explicitly correspond to economic theory.
Economic classifications: Triplett recalled that since 1997 the United States (indeed, all of North America) has produced industry classifications that were guided by the economic theory of aggregation. An industry is an aggregation of producing units.
Triplett also offered examples of economic statistics for which no theory seems to apply. For example, he knew of no economic theory that guides the unemployment rate. Economists use it as a measure of excess supply, but there is no tight linkage between the unemployment rate and the concept of excess supply. He remarked that the questionnaire used as the basis for estimating the unemployment rate is motivated by search theory, not labor supply (that is, it asks if the respondent has looked for work, not the number of hours the respondent wants to work at existing wage rates).
As an additional limit to the application of economic theory to economic measurement, he noted that sometimes economists disagree on the interpretation of theory. In other cases, some economists may deny that a particular theory applies to an economic measurement—this has happened in some discussions of the CPI in recent years.
Triplett turned next to quality differences between goods and services that can undermine cross-country comparisons and inter-temporal comparisons. Constructing any price index or output measure must take into account gradations of quality. In medical care, quality change arises with changes in treatment. The basic unit of measurement for medical care output is a treatment for a disease. However, Triplett pointed out, the treatment can change over time. If treatments are improving, simply counting identical treatments will underestimate the growth in medical output. What is needed to adjust medical care output measures (and medical care price measures) for improved treatments is a medical outcome measure, of the type discussed by Dennis Fryback. The importance of probable mismeasure-
ment of medical care output is highlighted by the fact that current measurement approaches result in reported negative productivity growth in the U.S. medical care industry. This is an area in which improved measurement does not depend on economic theory. What are needed are measures of medical outcomes, like those Fryback discussed. Triplett added that there are many cases in economics in which improvement of an economic measurement depends on getting information from other social and natural sciences.
In her discussion, Kathleen Cagney (University of Chicago) distilled some of the main points from Fryback’s presentation and focused on challenges and opportunities related to the measurement of HRQoL. Turning attention to the three classes of HRQoL measures—generic health indices and profiles, disease-specific measures, and preference-based measures—and their interplay, Cagney considered how generic and disease-specific measures focus on the presence, absence, severity, frequency, or duration of symptoms and how these are drawn from psychometric theory, whereas the preference-based measures relevant for assessing preferences of individuals for alternative health states or outcomes are drawn from economic theory and ideas of comprehensiveness and comparability.
Cagney referred to the seminal work of Patrick and Erickson (1993), which defines HRQoL as the value assigned to duration of life as modified by impairments, functional states, perceptions, and social opportunities that are influenced by disease, injury, treatment, or policy. In contrast, the definition offered by the Centers for Disease Control and Prevention assumes that HRQoL is synonymous with health status but also encompasses reactions to coping with life circumstances.
Cagney referred also to the objectives of health status assessment as outlined by Patrick and Erickson (1993): to discriminate among persons at a single point in time, to predict some future outcome or results of a more intrusive or costly criterion measure, and to measure change over time (e.g., cohort study). Consistent with the tenor of Fryback’s presentation, Cagney shared Colleen McHorney’s (1999) observation that the “field of health status assessment is regarded more for how it quantifies and validates health status indicators than for how and why it conceptualizes health.” Cagney considered the SF-36 a standard in health status assessment. It is responsive to 44 disease conditions, and it has been translated into more than 50 languages. However, as McHorney has pointed out, there are 8,360 different ways to score 50 on the SF-36 physical functioning scale, which is only half of the SF-36 measure. What is important in Cagney’s view is to consider the progression of disease over the life course and how one shifts from the initial position of health decline to a later state of physical frailty.
Cagney summarized a number of challenges associated with the HRQoL measure. She highlighted Fryback’s sense that HRQoL scores describe but do not actually value health, a goal that may be informed by the work of
PROMIS. Other challenges include difficulty agreeing on a common set of metrics and the need to create or demonstrate valid and reliable measures across population groups. There is also the challenge of assessing change that involves not only aging but also the perception of the change with age, taking into account individual abilities to adapt. Another challenge is the tension between the needs of large-scale survey enterprises and clinical settings. Measures that have emerged in a clinical setting have a different set of goals (to augment clinical decision making) than those in population surveys (to more broadly inform social science and policy), and, in that sense, they may not be robust in a larger social survey setting. Fryback also had observed that rankings seem to mobilize the American psyche. Cagney remarked on the importance of thinking about policy-related goals when using HRQoL measures.
Cagney closed with a summary of opportunities. She saw HRQoL measures as potentially providing insight into geographic variation. These measures also instill a greater appreciation for the role of subjective assessments, as Willis noted in his presentation. She endorsed the idea of potentially triangulating survey data resources with clinical assessments that come from the hospital or from a physician’s office. There is also opportunity to focus on a framework for the study of cultural comparisons, to consider the larger social context and the bridging of mental and physical components, and to operationalize the social component for inclusion in social surveys. Cagney cautioned that even a very simple notion of walking across a small room, which is used as a robust indicator of disability status, is not necessarily translatable. Another opportunity is to think about HRQoL measures in concert with biomarkers. She saw an opportunity for the social sciences to improve the understanding of health, pointing again to the potential of PROMIS and other data sources to augment this understanding.
David Grusky (Stanford University) picked up on the comment that one of the major obstacles to adoption or standardization of HRQoL measures is that there is debate about whether or not to allow respondents to score some states worse than dead. According to Fryback, some measures do not allow for states worse than dead, despite the fact that there is always a small segment of the population that identifies certain conditions (such as chronic unremitting pain, inability to do self-care activities, dependence on others for toileting, dressing, etc.) as worse than dead. Pollak raised two other points that support the reality of states worse than dead: (1) in estate planning, the use of living wills and advanced directives reflect an
expression of preference for death and (2) suicide suggests that there are states worse than dead.
The challenge is how to incorporate these opinions into a data set and model them mathematically in a nonarbitrary way. The analytic methods are not yet developed, and not all index developers agree that states worse than dead should be allowed. In fact, there are some who refuse to believe that such states exist. According to Fryback, the only tool available right now is to average all the different points of view. In comparing the different indexes, Fryback speculated that it is more a matter of preference in scaling rather than a substantive issue. He wondered if it would make sense to try to reach different preference subgroups with different scales.
Hauser is not convinced that it is necessary to obtain different evaluations for different population subgroups. It struck him that a good quality of life metric would need to demonstrate some invariant properties for the ratings across different populations and different segments of the same geographically defined population. Otherwise, it would be difficult to make sense of those as utilities in an aggregate analysis. Fryback pointed out that there is nothing in the theory to suggest that everyone has the same underlying set of preferences that would lead to such invariance. As there is no way to assign people to one preference or another, he saw no way around needing to ask respondents their preferences. However, he pointed out that many different HRQoL systems order the states and scale them in approximately the same way.
Paul Courtney (National Cancer Institute) expressed concern about the trade-off between an overly reductionist approach and fidelity of measurement. Fryback agreed that the tension between essentially descriptive detail and the ability to summarize aggregated higher levels with standard measures is very real for health measures. His interest has mostly been in the measures that aggregate rather than disaggregate for deep understanding of pathways to outcomes. But he pointed to the WHO aggregate measure, which includes an extensive list of environmental factors (e.g., curb cuts) that can greatly affect the quality of life for someone with restricted mobility. Fryback further believes that the social environment must be included more than it has been. This would include consideration of whether a person can interact with friends, perform a job role, engage in outside social activities, have intimacy. Fryback saw the potential for PROMIS in reinforcing the idea of standardized patient-reported outcome measures across the NIH and across the broad front of medicine.
Robert Michael (University of Chicago) directed attention to the distinction between standardization and harmonization. Harmonization has more to do with coordination, and it is often encouraged as a way to facilitate joint analyses and thus preferred to rote standardization. Willis described how harmonization has been a major issue for the Health and
Retirement Study (HRS), as it has generated comparable studies throughout Europe and Asia. He elaborated that in comparisons of indicators from more than one country, it is important that observed differences be attributable to actual differences in behavior of the people in those countries and not to differences in measurement. Hauser added occupation-based measures of social class as a positive example of harmonization, for which it is relatively easy to obtain all the information needed to produce several different measures in a single survey operation. Willis pointed to the cross-fertilization of ideas across surveys as key to driving innovation in the HRS. He favored keeping studies like it as live scientific enterprises, drawing mutual inspiration from other studies.
Bohrnstedt agreed that harmonization is one way to think about common metrics, but he did not want to neglect the fact that more effort should go into improving measurement, that is, trying to understand some latent construct and how it should best be represented.