3

Methodological Issues in the Measurement of Work Disability

Nancy A. Mathiowetz

Joint Program in Survey Methodology

University of Maryland, College Park

The collection of information about persons with disabilities presents a particularly complex measurement issue because of the variety of conceptual paradigms that exist, the complexity of the various paradigms, and the numerous means by which alternative paradigms have been operationalized in different survey instruments (see Chapter 2 by Jette and Badley for a review). For example, disability is often defined in terms of environmental accommodation of an impairment; hence, two individuals with the same impairment may not be similarly disabled or share the same perception of their impairment. For an individual with mobility limitations who lives in an assisted-living environment that accommodates the impairment, the environmental adaptations may result in little or no disability. The same individual living on the second floor of an apartment building with no elevator may have a very different perception of the impairment and may see him-or herself as disabled because of the environmental barriers that exist within his or her immediate environment.

The Social Security Administration (SSA) is currently reengineering its disability claims process for providing benefits to blind and disabled persons under the Social Security Disability Insurance (SSDI) and Supplemental Security Income (SSI) programs. As part of the effort to redesign the claims process, SSA has initiated a research effort designed to address the growth in disability programs, including the design and conduct of the Disability Evaluation Study (DES). The DES will provide SSA with comprehensive information concerning the number and characteristics of persons with impairments severe enough to meet SSA's statutory definition of disability, as well as the number and characteristics of people who are not currently eligible but who could be eligible as a



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop 3 Methodological Issues in the Measurement of Work Disability Nancy A. Mathiowetz Joint Program in Survey Methodology University of Maryland, College Park The collection of information about persons with disabilities presents a particularly complex measurement issue because of the variety of conceptual paradigms that exist, the complexity of the various paradigms, and the numerous means by which alternative paradigms have been operationalized in different survey instruments (see Chapter 2 by Jette and Badley for a review). For example, disability is often defined in terms of environmental accommodation of an impairment; hence, two individuals with the same impairment may not be similarly disabled or share the same perception of their impairment. For an individual with mobility limitations who lives in an assisted-living environment that accommodates the impairment, the environmental adaptations may result in little or no disability. The same individual living on the second floor of an apartment building with no elevator may have a very different perception of the impairment and may see him-or herself as disabled because of the environmental barriers that exist within his or her immediate environment. The Social Security Administration (SSA) is currently reengineering its disability claims process for providing benefits to blind and disabled persons under the Social Security Disability Insurance (SSDI) and Supplemental Security Income (SSI) programs. As part of the effort to redesign the claims process, SSA has initiated a research effort designed to address the growth in disability programs, including the design and conduct of the Disability Evaluation Study (DES). The DES will provide SSA with comprehensive information concerning the number and characteristics of persons with impairments severe enough to meet SSA's statutory definition of disability, as well as the number and characteristics of people who are not currently eligible but who could be eligible as a

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop result of changes in the disability decision process. For those years in which the DES is not conducted, SSA will need to monitor the potential pool of applicants. One means by which SSA can monitor the size and characteristics of potential beneficiaries is through other ongoing federal data collection efforts. For both the conduct of the DES and monitoring of the pool of potential beneficiaries through the use of various data collection efforts, it is critical to understand the measurement error properties associated with the identification of persons with disabilities as a function of the essential survey conditions under which the data have been and will be collected. The extent to which alternative instruments designed to measure persons with disabilities map to various eligibility criteria under consideration by SSA is also important. BACKGROUND The collection of disability data is an evolving field. Although a large and growing number of scales attempt to measure functional status and work disability, little is known about the measurement error properties of various questions and composite scales. The empirical literature provides clear evidence of variation in the estimates of the number of persons with disabilities in the United States, depending upon the conceptual paradigm of interest, the analytic objectives of the particular measurement process, and the essential survey conditions under which the information is collected (e.g., Haber, 1990; McNeil, 1993; Sampson, 1997). This literature suggests that estimates of the disabled population not only are related to the conceptual framework underlying the measurement construct but are also a function of the essential survey conditions under which the measurement occurred, including the specific questions used to measure disability, the context of the questions, the source of the information (self-versus proxy response), variations in the mode and method of data collection, and the sponsor of the data collection effort. Furthermore, terms such as impairment, disability, functional limitation, and participation are often inconsistently used, resulting in different and conflicting estimates of prevalence. Attempts to measure not only the prevalence but also the severity of an impairment or disability further complicate the measurement process. Recent shifts in the conceptual paradigm of disability, in which disability is viewed as a dynamic process rather than a static measure and as an interaction between an individual with an impairment and the environment rather than as a characteristic only of the individual, imply that those responsible for the development of disability measures must separate the measurement of the impact of environmental factors in the enablement-disablement process from the measurement of ability. Viewing disability as a dynamic state resulting from an interaction between a person's impairment and a particular environmental context further complicates the assessment of the quality of various survey measures of disability, specifically, the reliability of a measure. As a dynamic characteristic,

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop one would anticipate changes in the reports of disability as a function of changes in the individual as well as changes in the social and environmental contexts. The challenge for the measurement process is to disentangle true change from unreliability. This workshop comes at a time when the federal government is undertaking several initiatives with respect to the measurement of disability in federal data collection efforts. The Americans with Disability Act of 1990 (ADA) defines disability as (1) a physical or mental impairment that substantially limits one or more of the major life activities of the individual, (2) a record of a substantially limiting impairment, or (3) being regarded as having a substantially limiting impairment. Although the measurement of disability within household surveys is not bound by the ADA definition, the passage of the ADA provides a socio-environmental framework for how society comprehends and uses terms such as disability and impairment (e.g., the popular press and court rulings on ADA-related litigation). These definitions will evolve as a function of litigation related to ADA legislation and presentation of that litigation in the press. Hence, society is entering a period in which potential dynamic shifts in the comprehension and interpretation of the language associated with the measurement of persons with disabilities can be anticipated. The paper presented in this chapter is intended to serve as a means of facilitating discussion among individuals from diverse theoretical and empirical disciplines concerning the methodological issues related to the measurement of persons with disabilities. As a first step to achieving this goal, a common language and framework needs to be established for the enumeration and assessment of the various sources of error that affect the survey measurement process. The chapter draws from several empirical investigations to provide evidence as to the extent of knowledge concerning the error properties associated with various approaches to the measurement of functional limitations and work disability. SOURCES OF ERROR IN THE SURVEY PROCESS: THE SURVEY RESEARCH PERSPECTIVE For the purpose of defining a framework that can be used to examine error associated with the measurement of persons with disabilities, I draw upon the conceptual structure and language used by Groves (1989), based on earlier work of Kish (1965) and used by Andersen et al. (1979). Suchman and Jordan (1990) have described errors in surveys as the discrepancy between the concept of interest to the researcher and the quantity actually measured in the survey. Bias, according to Kish (1965, p. 509), refers to systematic errors in a statistic that affect any sample taken under a specified survey design with the same constant error or, as stated by Groves (1989), is the type of error that affects the statistic in all implementations of a survey. Variable errors are those errors that are specific to a particular implementation of a design, that is, specific to the particular trial.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop The concept of variable error requires the possibility of repeating the survey, with changes in the units of replication, that is, the particular set of respondents, interviewers, supervisors, coding, editing, and data entry staff. Errors of Nonobservation Within the framework of survey methodology, both variable error and bias are further characterized in terms of errors of nonobservation and errors of observation. As one would expect from the term, errors of nonobservation reflect failure to obtain observations for some segment of the population or for all elements to be measured. Errors of nonobservation are most often classified as arising from three sources: sampling, coverage, and nonresponse. Sampling Error Sampling error represents one type of nonobservation variable error; it arises from the fact that measurements (observations) are taken for only a subset of the population. Sampling variance refers to changes in the value of some statistic over possible replications of a survey in which the sample design is fixed but different individuals are selected for the sample. Estimates based on a particular sample will not be identical to estimates based on a different subset of the population (selected in the same manner) or to estimates based on the full population. Coverage Error Coverage error defines the failure to include all eligible population members on the list or frame used to identify the population of interest. Those members not identified on the frame have a zero probability of selection and are never measured. For example, in the United States, approximately 5 percent of the population live in households without telephone service; any survey that is conducted by telephone and that attempts to describe the entire household-based population of the United States therefore suffers from coverage error. To the extent that those without telephones differ from those with telephones for the construct of interest, the resulting estimates will be biased. Nonresponse Error Nonresponse error can arise from failure to obtain any information from the persons selected to be measured (unit nonresponse) or from failure to obtain complete information from all respondents to a particular question (item nonresponse). The extent to which nonresponse affects survey statistics is a function of both the rate of nonresponse and the difference between respondents and nonrespondents, as illustrated in the following formula:

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop Knowing the response rate is not sufficient to determine the level of nonresponse bias; studies with both high and low rates of nonresponse can suffer from nonresponse bias. As noted by Groves and Couper (1998), it is useful to further distinguish among the types of unit nonresponse, each of which may be related to the failure to measure different types of persons. For most household data collection efforts involving interviewers, the final outcome of an interview attempt is often classified into one of the following four categories: completed or partial interview, refusal, noncontact, and other noninterview.1 Survey design features can affect the distribution of cases across the various categories. Noncontact rates are affected by the length of the field period (in which short field periods result in higher noncontact rates than longer field periods). Surveys that place greater demands on the respondent may suffer from higher refusal rates than less burdensome instruments. The choice of respondent rule affects the rate of nonresponse; designs that permit any knowledgeable adult within the household to serve as the respondent provide an interviewer with some flexibility, should one adult within the household refuse or be unable to participate. Field efforts that fail to accommodate non-English-speaking respondents or that focus their attention on frail subpopulations tend to experience higher rates of other noninterviews. Errors of Observation Observational errors can arise from any of the elements directly engaged in the measurement process, including the questionnaire, the respondent, and the interviewer, as well as the characteristics that define the measurement process (e.g., the mode and method of data collection). This section briefly reviews the theoretical framework and empirical findings related to the various sources of measurement error in surveys. 1   Other noninterview is used to classify cases in which contact was made with the members of the household in which the sample person resides, but for reasons such as physical or mental health, language difficulties, or other reasons not related to reluctance to participate, the interviewer was unable to conduct the interview.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop Questionnaire as Source of Measurement Error Tourangeau (1984) and others (see Sudman et al. [1996] for a review) have categorized the survey question-and-answer process as a four-step process involving comprehension of the question, retrieval of information from memory, assessment of the correspondence between the retrieved information and the requested information, and communication of the response. In addition, the encoding of information, a process outside the control of the survey interview, determines a priori whether the information of interest is available for the respondent to retrieve. Comprehension of the question involves the assignment of meaning to the question by the respondent. Ideally, the question will convey the meaning of interest to the researcher. However, several linguistic, structural, and environmental factors affect the interpretation of the question by the respondent. These factors include the specific wording of the question, the structure of the question, the order in which the questions are presented, the overall topic of the questionnaire, whether the question is read by the respondent (self-administration) or is presented to the respondent by an interviewer, and the mode of communication used by the interviewer (that is, telephone versus face-to-face presentation). The wording of a question is often seen as one of the major problems in survey research: although one can standardize the language read by the respondent or the interviewer, standardization of the language does not imply standardization of the meaning. For example, “Do you own a car?” appears to be a simple question from the perspective of semantics and structure. However, several of the words in the question are subject to variation in interpretation, including “you” (just the respondent or the respondent and his or her family), “own” (completely paid for, purchased as opposed to rented), and even the word “car” (does this include vans and trucks?). The goal for the questionnaire designer is to develop questions that exhaust the range of possible interpretations, making sure that the particular concept of interest is the concept that the respondent has in mind when responding to the item. One source of variation in a respondent's comprehension of survey questions is due to differences in the perceived intent or meaning of the question. Perceived intent can be shaped by the sponsorship of the survey, the overall topic of the questionnaire, or the environment more immediate to the question of interest, such as the context of the previous question or set of questions or the specific response options associated with the question. Respondent as Source of Measurement Error Once the respondent comprehends the question, he or she must retrieve the relevant information from memory, make a judgment as to whether the retrieved information matches the requested information, and communicate a response.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop Much of the measurement error literature has focused on the retrieval stage of the question-answering process, classifying the lack of reporting of an event as retrieval failure on the part of the respondent and comparing the characteristics of events that are reported with those that are not reported. Several factors have been found to be related to the quality of reporting, including the length of the reference period of interest and the salience of the information. For example, the literature suggests that the greater the length of the recall period, the greater the expected bias in the reporting of episodic information (e.g., Cannell et al., 1965; Sudman and Bradburn, 1973). Salience is hypothesized to affect the strength of the memory trace and, subsequently, the effort involved in retrieving the information from long-term memory. The weaker the trace, the greater the effort needed to locate and retrieve the information. As part of the communication of the response, the respondent must determine whether he or she wishes to reveal the information as part of the survey process. Survey instruments often ask questions about socially and personally sensitive topics. It is widely believed and well documented that such questions elicit patterns of underreporting (for socially undesirable behavior and attitudes), as well as overreporting (for socially desirable behaviors and attitudes). The determination of social desirability is a dynamic process and is a function of the topic of the question, the immediate social context, and the broader social environment at the time the question is asked. Even if the respondent is able to retrieve accurate information, he or she may choose to edit this information at the response formation stage as a means of reducing the costs associated with revealing the information. The use of proxy reporters, that is, asking individuals within sampled households to provide information about other members of the household, is a design decision that is often framed as a trade-off among costs, sampling errors, and nonsampling errors. The use of proxy informants to collect information about all members of a household can increase the sample size (and hence reduce the sampling error) at a lower marginal data collection cost than increasing the number of households. The use of proxy respondents also facilitates the provision of information for those who would otherwise be lost to nonresponse because of an unwillingness or inability to participate in the survey interview. However, the cost associated with the use of proxy reporting may be an increase in the rate of errors of observation associated with poorer-quality reporting for others compared with the quality that would have been obtained under a rule of all self-response. Most of the evaluations of the quality of proxy responses compared with the quality of self reports have focused on the reporting of autobiographical information (e.g., Mathiowetz and Groves, 1985; Moore, 1988) with some recent investigations examining the convergence of self and proxy reports of attitudes (Schwarz and Wellens, 1997). The literature is, however, for the most part silent with respect to the quality of proxy reports for personal characteristics, the excep-

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop tion being a small body of literature that addresses self-reporting versus proxy reporting effects in the reporting of race/ethnicity (Hahn et al., 1996) and the reporting of activities of daily living (e.g., Mathiowetz and Lair, 1994; Rodgers and Miller, 1997). The findings suggest that proxy reports of functional limitations tend to be higher than self-reports; the research is inconclusive as to whether the discrepancy is a function of overreporting on the part of proxy informants, underreporting on the part of self-respondents, or both. Interviewers as Sources of Measurement Error For interviewer-administered questionnaires, interviewers may affect the measurement processes in one of several ways, including: failure to read the question as written; variation in interviewer's ability to perform the other tasks associated with interviewing, for example, probing insufficient responses, selecting appropriate respondents, and recording the information provided by the respondent; and demographic and socioeconomic characteristics as well as voice characteristics that influence the behavior of the respondent and the responses provided by the respondent. The first two factors contribute to measurement error from a cognitive or psycholinguistic perspective in that different respondents are exposed to different stimuli; thus, variation in responses is, in part, a function of the variation in stimuli. All three factors suggest that the interviewer effect contributes to an increase in variable error across interviewers. If all interviewers erred in the same direction (or their characteristics resulted in errors of the same direction and magnitude), interviewer bias would result. For the most part, the literature indicates that among well-trained interview staff, interviewer error contributes to the overall variance of estimates as opposed to resulting in biased estimates (Lyberg and Kasprzyk, 1991). Other Essential Survey Conditions as Sources of Measurement Error Any data collection effort involves decisions concerning the features that define the overall design of the survey, referred to here as the “essential survey conditions.” In addition to the sample design and the wording of individual questions and response options, these decisions include the following: whether to use interviewers or to collect information via some form of self-administered questionnaire; the means for selecting and training interviewers (if applicable);

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop the mode of data collection for interviewer administration (telephone versus face to face); the method of data collection (paper and pencil, computer assisted); whether to contact respondents for a single interview (cross-sectional design) or follow respondents over time (longitudinal or panel design); for longitudinal designs, the frequency and periodicity of measurement; the identification of the organization for whom the data are collected; and the identification of the data collection organization. No single design feature is clearly superior with respect to overall data quality. For example, as noted above, interviewer variance is one source of variability that can be eliminated through the use of a self-administered questionnaire. However, the use of an interviewer may aid in the measurement process by providing the respondent with clarifying information or by probing insufficient responses. The use of a panel survey design, with repeated measurements with the same individuals, facilitates more efficient estimation of change over time (compared with the use of multiple cross-sectional samples); however, panel designs may be subject to higher rates of nonresponse (as a result of nonresponse at every round of data collection) or panel conditioning bias, an effect in which respondents alter their reporting behavior as a result of exposure to a set of questions during an earlier interview. The following scenario is an illustration of statistical measures of error used by survey methodologists. Assume that the measure of interest is personal earnings among all adults in the United States. A “true value” exists if the construct of interest is carefully defined. The data will be collected as part of a household-based health survey being conducted by telephone. The decision to use the telephone for data collection implies that approximately 5 percent of the adults will not be eligible for selection. To the extent that the personal earnings of adults without telephones differ significantly from those with telephones, population-based estimates for the entire adult population will suffer from coverage bias. Similarly, not all eligible sample persons will participate in the interview because of refusal to cooperate, an inability on the part of the survey organization to contact the respondent, or other reasons, such as language barriers or poor health that limits participation. Once again, to the extent that the earnings of those who participate differ significantly from those who do not participate, population-based estimates of earnings will suffer from nonresponse bias. If all respondents misreport their earnings, underreporting their earnings by 10 percent, and they consistently do so in response to repeated measures, the measure will be reliable but not valid and population estimates based on the question (e.g., population means) would be biased. However, multivariate model-based estimates that examine the relationship between earnings and human capital investment would not be biased, since all respondents erred in the same

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop direction and relative magnitude. Differential response error, for example, the overreporting of earnings by low-income individuals and the underreporting of earnings by high-income individuals, may produce unbiased population estimates (e.g., mean earnings per person) but biased model-based estimates related to individual behavior. MEASUREMENT ERROR: THE PSYCHOMETRIC PERSPECTIVE The language and concepts of measurement error in psychometrics are different from the language and concepts used within the fields of survey methodology and statistics. The focus for psychometrics is on variable errors; from the perspective of classical true score theory, all questions produce unbiased estimates, but not necessarily valid estimates, of the construct of interest. The confusion arises in that both statistics and psychometrics use the terms validity and reliability to sometimes refer to very similar concepts and to sometimes refer to concepts that are quite different. Within psychometrics, the terms validity and reliability are used to describe two types of variable error. Validity refers to “the correlation between the true score and the respondent's answer over trials” (Groves, 1991, p. 8). The validity of a measure can be assessed only for the population, whereas the validity of both population estimates and individuals' responses presented in the survey methodological literature can be assessed. Reliability refers to the ratio of the true score variance to the observed variance, where variance refers to variability over persons in the population and over trials within a person (Bohrnstedt, 1983). Once again, the measurement of reliability from this perspective does not facilitate measurement for a person but produces a measure of reliability specific to the particular set of individuals for whom the measurement was taken. The psychometric literature identifies several means by which validity can be assessed; the choice of measures is, in part, a function of the purpose of the measurement. These measures of validity include content, construct, concurrent, predictive, and criterion. If one considers that the questions included in a particular instrument represent a sampling of all questions that could have been included to measure the construct of interest, content validity refers to the comprehensiveness as well as the relevance of those questions. Content validity refers to the extent to which the question or questions reflect the domain or domains reflected in the conceptual definition. Face validity refers to the extent to which each item appears to measure that which it purports to measure. Cognitive interviewing techniques that focus on the comprehension of items by respondents is, to some extent, a test of face validity. Criterion-related validity evaluates the extent to which the measure of interest correlates highly with a “gold standard.” The gold standard could consist of a different self-reported measure, a behavioral measure, or an observation or evaluation outside the measurement process (e.g., clinical evaluation). Criterion-re-

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop lated validity is further categorized as concurrent validity or predictive validity. Concurrent validity refers to the correlation between the item of interest and some other item, event, or behavior measured at the same point in time, whereas predictive validity refers to the correlation between an indicator measured at time t and some other measure, event, or behavior measured at time t + 1. When no gold standard exists, validity is evaluated in terms of the correlation between the measure of interest and other measures, according to theorybased hypotheses. As noted by McDowell and Newall (1996), “construct validation begins with a conceptual definition of the topic or construct to be measured, indicating the internal structure of its components and the theoretical relationship of scale scores to external criteria” (p. 33). Measures of reliability include internal consistency (often referred to as coefficient Alpha or Cronbach's Alpha), test-retest, and interrater reliability. Internal consistency measures the extent to which all items in a scale measure the same underlying concept; it is only applicable for multi-item Likert scales. The reliability coefficient is a function of both the extent to which the items are homogeneous and the number of items in the scale; the coefficient increases with an increase in either the homogeneity of the items or an increase in the number of items. Test-retest reliability involves the measurement of the same person under the same measurement conditions at two points in time and can be used for single-item measures, as well as multi-item scales.2 Interrater reliability refers to the consistency with which different raters or observers rating the same person agree with one another. Returning to the example of the measurement of earnings to illustrate the measurement error properties of the construct in terms of psychometrics, assume that the question or questions designed to measure earnings are both comprehensive and relevant. Therefore, the questions would be assessed as having content validity (face validity). If, as noted above, all respondents underreported their earnings by 10 percent, the construct would have a lower score with respect to criterion validity, but since all respondents erred in the same direction and the same magnitude, the indicator would have construct validity. If repeated measurement resulted in consistent reports by all respondents, test-retest measures would indicate a high degree of reliability, not dissimilar to the conclusion drawn by statisticians. POTENTIAL SOURCES OF MEASUREMENT ERROR SPECIFIC TO PERSONS WITH DISABILITIES Similar to any other measurement of persons via the survey process, the identification of persons with disabilities is subject to the various sources of error 2   Within survey research, the conduct of a reinterview under the same essential survey conditions as the original interview is an example of a test-retest assessment of reliability.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop disability. The second concept of interest involves the notion of social identity and the groups, statuses, and social categories to which the members of society are recognized as belonging. If the social identity category is ambiguous, the self-concept related to the social identity will also be ambiguous. As noted by Jette and Badley in Chapter 2, the measurement of disability is often presented in surveys as an “all or nothing phenomenon.” This approach to the measurement of disability assumes that (1) the respondent recognizes and identifies with the socially defined label and (2) is willing to reveal membership in the group. If disability were an “all-or-nothing” phenomenon, identification with the classification would be less ambiguous; however, as already noted, the enablement-disablement process is a dynamic one, subject to variation as a function of both self and society. To the extent that identification or affiliation with group membership carries with it any type of social stigma, willingness to reveal membership in the group also carries with it a social cost, not unlike other phenomena subject to social desirability bias. Ambiguous social classification categories are also more likely to be subject to context effects; respondents use the specific wording of questions, the immediately prior questions, or the overall focus of the question as a means for interpreting questions on disability. From a theoretical perspective, it is not surprising to find that estimates of the number of persons with disabilities vary as a function of differences in the specific wording of the question, the number of questions used to determine the prevalence and severity of impairments and disabilities, the context of the questions immediately proximate to the question of interest, and the overall focus of the questionnaire (health versus employment versus program participation). EMPIRICAL EVIDENCE CONCERNING MEASUREMENT OF DISABILITY ERROR To date, most investigations with respect to the error properties associated with the measurement of persons with disabilities or the measurement of persons with work disabilities have focused on errors of observation, ignoring differences in estimates due to coverage error and nonresponse error. This review of the empirical literature is therefore focused on errors of observation. As an illustration of the type of empirical investigations concerning error in the measurement of disability, this section begins by examining the work that has been done to date with respect to measures of activities of daily living (ADL). The intent is to provide an illustration of the type of work that has been done (and not done) with respect to a frequently used measure of functional limitation. The focus is then turned to the measurement of persons with work disabilities.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop Measurement of ADLs, Functional Limitations, and Sensory Impairments Although there are several different measurement methods for the assessment of physical disability, one of the most often used (within the context of survey measurement) is the Index of Activities of Daily Living, often referred to as the Index of ADL (Katz et al., 1963). The index was originally developed to measure the physical functioning of elderly and chronically ill patients, but several national surveys of the general population administer the index to adults of all ages. The index assesses independence in six activities: bathing, dressing, toileting, transferring from a bed or chair, continence, and feeding. Despite its wide acceptance and use, the psychometric properties of the index have not been well documented. Brorsson and Asberg (1984) reported reliability scores of 0.74 to 0.88 (based on 100 patients). Katz et al. (1970) applied the Index of ADLs as well as other indexes to a sample of patients discharged from hospitals for the chronically ill and reported correlations between the index and a mobility scale and between the index and a confinement measure of 0.50 and 0.39, respectively. Most assessments of the Index of ADLs have examined the predictive validity of the index with respect to independent living (e.g., Katz and Akpom, 1976) or the length of hospitalization and discharge to home or death (e.g., Ashberg, 1987). These studies indicate relatively high levels of predictive validity. Despite the psychometric findings, a growing body of survey literature suggests that the measurement of functional limitations via the use of ADL scales is subject to substantial amounts of measurement error and that measurement error is a significant factor in the apparent improvement or decline in functional health observed in longitudinal data. Jette (1994) found that minor changes in the wording of the questions resulted in significant differences in the percentage of the population identified as being limited. Rodgers and Miller (1997) directly compared responses by the same respondents (or more specifically, for the same target individuals) by using different sets of ADL items and across different modes.5 They conclude that the measurements of functional limitations with respect to counts of ADLs, indications of the use of assistive devices or personal help, and indications of any difficulty are all subject to large amounts of measurement error, of which a substantial portion is random error. Similar to other empirical work (e.g., Mathiowetz and Lair, 1994), their findings indicate that the use of proxy respondents results in higher levels of reporting, of which only 25 to 33 percent can be explained by demographic characteristics and health variables 5   Note, however, that the allocation across modes was not experimentally varied but rather was an artifact in the design in which older respondents (80 years and older) were assigned to the face-to-face mode of data collection and those less than 80 years of age were assigned to the telephone mode of data collection. However, a substantial number of respondents were interviewed in the mode other than that to which they were originally assigned; the crossover permits determination of both main and interaction effects related to the mode of data collection.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop TABLE 3-1. Mobility Limitations: Distributions to Census Question 19a and Content Reinterview Survey Question 34a, Persons 16 to 64 Years of Age, United States, 1990   Content Reinterview Survey: Difficulty Going Outside Census Long Form: Difficulty Going Outside Yes No Total Yes 146 152 298 No 155 14,194 14,346 Total 301 14,346 14,647 NOTE: The prevalence rate based on census: 2.03 percent, of which 49.0 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 2.05 percent, of which 48.5 percent were consistent responses. SOURCE: McNeil, 1993. of the target individual. The finding suggests that higher levels of functional limitations reported by proxy respondents are not simply a result of selection bias, in which those with the most severe limitations are reported by proxy.6 Their analyses also suggest that there was no clear effect of mode of data collection on estimates of functional limitations. As illustrative of the variability and lack of reliability that is evident in survey estimates of functional limitations, Table 3-1 and Table 3-2 present findings from the 1990 decennial census and the Content Reinterview Survey (CRS) (U.S. Bureau of the Census, 1993; McNeil, 1993). The CRS was conducted approximately 5 to 9 months following the 1990 decennial census, with a sample of 15,000 housing units selected from among those housing units assigned to complete the long form of the census. With respect to mobility limitations, estimates from the two surveys appear to be similar (e.g., 2.03 versus 2.05 percent), but examination of the responses for individuals indicates a low rate of consistent responses (less than 50 percent) among those who reply affirmatively for either survey. With respect to personal care limitations, once again, a high rate of inconsistency in the responses is seen among individuals who respond affirmatively to the question in either survey. For example, among those 16 to 64 years of age, almost all (83.4 percent) of those who report a self-care limitation at the time of the census fail to report a self-care limitation in the CRS. Comparison of the percentage of persons with mobility and self-care limitations from the two surveys is confounded by differences in the essential survey conditions under which the data were collected and that most likely contribute to the discrepancies evident in the data. These differences include: 6   In comparisons of self-reports and proxy reports with clinical evaluations, Rubenstein et al. (1984) found self response to be more “optimistic ” and responses obtained by proxy report to be more pessimistic, findings which suggest that both self and proxy responses are subject to measurement error, albeit in different directions.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop TABLE 3-2. Self-Care Limitations: Distributions to Census Question 19b and Content Reinterview Survey Question 34b, Persons 16–64 Years of Age, United States, 1990   Content Reinterview Survey: Difficulty Taking Care of Personal Needs Census Long Form: Difficulty Taking Care of Personal Needs Yes No Total Yes 69 346 415 No 120 13,856 13,976 Total 189 14,202 14,391 NOTE: The prevalence rate based on census: 2.9 percent, of which 16.6 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 1.3 percent, of which 36.5 percent were consistent responses. SOURCE: McNeil, 1993. Differences in the mode of data collection. The decennial census is, for the most part, a self-administered questionnaire, whereas the CRS is interviewer administered and is conducted either by telephone (84 percent) or as a face-to-face interview (16 percent). McHorney et al. (1994) report that telephone administration of the SF-36 led to lower levels of reporting of chronic conditions and self-reports of poor health compared with a self-administered version of the SF-36. Differences in the context in which the questions were asked. Although the wording of the specific items is almost the same with respect to mobility limitations or self-care limitations, as can be seen from a comparison of the two questionnaires, the context in which the questions are asked differs in the two instruments. Several additional questions concerning sensory impairments, the use of assistive devices for mobility, mobility limitations related to walking a quarter mile or up a flight of steps, and the ability to lift and carry objects weighing up to 10 pounds precede the items of interest in the CRS. There is a large body of literature that documents the existence of context effects in attitude measurement (e.g., Schuman and Presser, 1981). The asking of additional questions could prime the respondent to think about impairments that he or she did not consider while answering the census questions, thereby resulting in an increase in the reporting of limitations. Alternatively, having just answered questions about a number of sensory impairments and limitations, respondents, when answering the more general questions, assume that the general question is intended to capture information not already reported; in this case one would expect the CRS estimates to be lower than those based on the census form. (See Sudman et al. [1996] for a review of the theoretical underpinning related to context effects and a thorough discussion of addition and subtraction effects.)

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop Self-reporting versus proxy reporting. There is little information as to who provided information on either the census form or the CRS. Although the CRS attempts to obtain self-reports from each adult household member, information for approximately 25 percent of the persons was reported by proxy. As noted earlier, proxy respondents tend to report more activity limitations and more severe limitations than self-respondents. Finally, the possibility that the lack of reliability is indicative of the occurrence of real change between the time of the census and the time of the CRS must also be considered. Although one can enumerate possible sources that explain the low rate of consistency between the two surveys, the lack of experimental design does not permit the identification of the relative contributions of the various design features to the overall lack of stability of these estimates. Empirical evidence shows that even when questions are administered under the same essential survey conditions, responses are subject to a high rate of inconsistency. This evidence comes from the administration of the same topical module on functional limitations and disability to respondents in the 1992–1993 panel of the Survey of Income and Program Participation. The module was administered between October 1993 and January 1994 (Time 1) and then again between October 1994 and January 1995 (Time 2). The context of the questionnaire is the same in both waves; the topical module is preceded by the core interview, which focuses on earnings, transfer income, program participation, and other forms of income. Information is collected for all members of the household, usually by having one person report for himself or herself and all other family members. In addition, information as to who served as the respondent is recorded; thus one can examine consistency in the reporting of information across time among all self-responses. Table 3-3 presents selected comparisons of functional limitations and sensory impairments reported at Time 1 with those reported at Time 2. The comparisons clearly reveal high levels of theoretical inconsistency, even among self-respondents. For example, among those who report an inability to walk at Time 1, only 70.3 percent report the same status at Time 2. Limiting the comparison to self-reports only does not greatly improve the consistency. Among self-reporters, 76.7 percent of those reporting inability to walk at Time 1 report the same status in the subsequent interview. These empirical findings illustrate some of the error properties associated with the measurement of functional limitations and sensory impairments. The research indicates that despite psychometric measures that indicate a relatively high degree of reliability, survey applications offer several examples of low levels of reliability, even under conditions in which the essential survey conditions are held constant. Subtle changes in the wording of questions, the order of questions, or the immediate prior context offer further illustration of the lack of robustness of these items. Although one can enumerate all of the factors that may contribute to this volatility, the relative contributions of the various factors have not been experimentally determined.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop TABLE 3-3. Selected Panel Survey of Income and Program Participation Data: Time 1 (October 1993–January 1994) and Time 2 (October 1994–January 1995) Comparisons, United States   All Cases Self-Respondents both Times Status at Time 1 No. of Persons Percentage at Time 2 with Disability No. of Persons Percentage at Time 2 with Disability Uses cane, crutches, walker 508 45.5 286 50.0 Uses a wheelchair 175 61.7 83 68.7 Unable to see 159 49.1 87 49.4 Unable to hear 121 50.4 41 48.8 Unable to speak 47 68.1 5 80.0 Unable to walk 1,045 70.3 587 76.7 Unable to lift/carry 975 61.2 566 65.6 Unable to climb stairs 1,132 68.3 658 72.3 Needs help outside 699 53.5 302 57.3 Needs help bathing 271 52.0 114 54.4 Needs help dressing 237 49.8 80 55.0 SOURCE: McNeil, 1998. Empirical Evidence Concerning Error in the Measurement of Work Disability The assessment of work disability in federal surveys has focused on variants of a limited number of questions, most of which concern whether the individual is limited in the kind or amount of work he or she is able to do or is unable to work at all because of a physical, mental, or emotional problem. Not dissimilar to the assessment of functional limitations, work disability is measured in data collection efforts that vary with respect to the essential survey conditions, the specific wording of questions, the number of questions asked, and the determination of severity, duration, and the use of assistive devices or environmental barriers. As McNeil (1993) points out, one of the problems with the current set of indicators designed to measure work disability is that many fail to acknowledge the role of environmental barriers and accommodations. He states: Questions can be raised about the validity of data on persons who are “limited in kind or amount of work they can do” or are “prevented from working.” The work disability questions make no mention of environmental factors, even though it is obvious that a person's ability to work cannot be meaningfully

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop separated from his or her environment. Work may be difficult or impossible under one set of environmental factors but productive and rewarding under another. It would certainly be logical for a respondent to answer “no” to the question, “Do you have a condition that prevents you from working?” if the real reason he or she is not working is the inaccessibility of the transportation system or the lack of accommodations at the workplace. (pp. 3–4) As noted in Chapter 2, the “fundamental conceptual issue of concern is that health-related restriction in work participation may not be solely or even primarily related to the health condition. . . .” One of the challenges facing questionnaire designers is the development of questions that match the conceptual framework of interest with respect to work disability, specifically, whether the focus is on the health condition that limits the individual 's ability to perform specific tasks related to a specific job, the external factors related to the performance of work, other factors that affect participation in the work environment (e.g., transportation), or all three sets of factors. Although McNeil (1993) raises questions concerning the validity of the work disability measures currently in use, several empirical investigations raise questions about the reliability of these measures, not unlike the findings with respect to the measurement of functional limitations and sensory impairments. Once again, it can be seen that differences in the wording of the questions, the context in which they are asked, the nature of the respondent, and other essential survey conditions, including the data collection organization and the sponsorship of the survey, may contribute to differences in estimates of the working-age disabled population. Haber (1990, as revised from Haber and McNeil [1983]) examined work disability from selected surveys between 1966 and 1988. He notes that “despite a high degree of consistency in the social and economic composition of the disabled population over a variety of studies, the overall level of disability prevalence has varied considerably ” (p. 43). Haber's findings are reproduced in Table 3-4. The estimates from the various surveys represent differences in the year of administration, the wording of the questions, the overall content of the survey, the mode of administration, the organization collecting the information, and the organization sponsoring the study. Although the wording of the questions is quite similar across the various surveys, there are some minor differences in specific wording (e.g., differences with respect to the emphasis on a health condition) and the order of the questions (e.g., whether the questions begin, as in the NHIS, by asking about whether a health condition keeps the person from working or begin, as in the SSA surveys, by asking whether the person's health limits the kind or amount of work that the person can do). As is evident from Table 3-4, the survey's content appears to be related to the overall estimate; the lowest rates of work disability prevalence come from the Census and the March Supplement to the Current Population Survey (8.5 to 9.4 percent), and the highest rates come from the surveys sponsored by SSA (14.3 to 17.2 percent).

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop TABLE 3-4. Prevalence of Work Disability Across Various Surveys, United States, 1966–1988   Percentage Classified with a Work Disability Data Source (age range [years] for estimate) Total Males Females 1966 SSA (18–64) 17.2 17.2 17.2 1967 SEO (17–64) 14.0 14.0 14.0 1969 NHIS (17–64) 11.9 13.1 10.9 1970 Census (16–64) 9.4 10.2 8.6 1972 SSA (20–64) 14.3 13.6 15.0 1976 SIE (18–64) 13.3 13.3 13.3 1978 SSA (18–64) 17.2 16.1 18.4 1980 Census (16–64) 8.5 9.0 8.0 1980 NHIS (17–64) 13.5 14.3 12.8 March, 1981 CPS (16–64) 9.0 9.5 8.5 March, 1982 CPS (16–64) 8.9 9.3 8.5 March, 1983 CPS (16–64) 8.7 9.0 8.3 March, 1984 CPS (16–64) 8.6 9.2 8.1 1984 SIPP (16–64) 12.1 11.7 12.4 March, 1985 CPS (16–64) 8.8 9.2 8.4 March, 1986 CPS (16–64) 8.8 9.4 8.2 1986 NHIS (18–64) 13.5 14.3 12.8 NOTE: SSA = Social Security Administration Disability Survey; SEO = Survey of Economic Opportunity; NHIS = National Health Interview Survey; SIE = Survey of Income and Education; March CPS = Annual March Supplement (Income Supplement) to the Current Population Survey; SIPP = Survey of Income and Program Participation. SOURCE: Haber, 1990. The lack of stability that was evident for estimates of mobility and self-care limitations between the 1990 census and the CRS is also evident for estimates of work disability. Table 3-5 presents the comparison of responses between the 1990 census and the CRS with respect to whether the person is limited in the kind of work, or the amount of work, or is prevented from working at a job because of physical, mental, or other health conditions. Once again, it can be seen that between one-third and almost one-half of the respondents are inconsistent in their responses. More recent investigations have used the extensive data from NHIS-D to investigate alternative estimates of the population with work disabilities. The data also provide an opportunity to examine inconsistencies in the reporting of work disability and receipt of SSI or SSDI benefits. For example, LaPlante (1999) found that, based on the data from the NHIS-D, 9.5 million adults 18 to 64 years of age report being unable to work because of a health problem. Among these 9.5 million adults, 5.3 million (or 56 percent) do not report receipt of SSI or SSDI

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop TABLE 3-5. Work Disability: Distributions to Census Questions 18a and 18b and Content Reinterview Survey Questions 33a and 33b for Persons 16– 64 years of age, United States, 1990   Content Reinterview Survey: Limited in Kind or Amount of Work or Prevented from Working Census Long Form: Limited in Kind or Amount of Work or Prevented from Working Yes No Total Yes 778 366 1,144 No 650 12,988 13,638 Total 1,428 13,354 14,782 NOTE: The prevalence rate based on census: 7.7 percent, of which 68 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 9.7 percent, of which 54.5 percent were consistent responses. SOURCE: McNeil, 1993. benefits. If one looks at those who report receiving SSI or SSDI benefits, 75 percent report that they are unable to work and 13 percent report that they are limited in the kind or amount of work that they can perform, but 12.3 percent who report receipt of benefits do not report any limitation with respect to work. Although these variations in estimates derived from different surveys suggest instability in the estimates of the proportion of persons with work disabilities as a function of the wording of the question, the nature of the respondent, and the essential survey conditions under which the measurement was taken, they provide little information about measurement error within the framework of either survey statistics or psychometrics. Little is known about the validity of these items or the reliability of these items, whether one views validity from the perspective of survey statistics as deviations from the true value or from the perspective of psychometrics as criterion-related or construct validity. The relative contributions of various sources of error are, for the most part, unknown; it is only known that various combinations of design features produce different estimates. None of the studies address errors of nonobservation. QUESTION WORDING ISSUES RELATED TO SELECTED MEASURES OF WORK DISABILITY Jette and Badley point out in Chapter 2 the conceptual problems inherent in many questions designed to measure persons with work disabilities, including the failure of most questions to enumerate the separate elements related to the role of work. That failure is evident in most work disability screening questions designed to be administered to the general adult population. The gap between the conceptual framework and the questions used to screen for work disability, is illustrated by using questions from several federal data collection efforts.

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop The long form of the decennial census for the year 2000 includes the following questions: Because of a physical, mental, or emotional condition lasting 6 months or more, does this person have any difficulty in doing any of the following activities: . . . (Answer if this person is 16 years old or over.) Working at a job or business? The respondent is to check a box corresponding to “Yes” or “No.” The question is complex for several reasons: The respondent must consider multiple dimensions of health (physical, mental, and emotional) and attribute difficulty working at a job or business to one or more of these health problems. The explicit enumeration of physical, mental, or emotional conditions serves as a means of clarifying for the respondent the fact that the question is intended to cover all three dimensions of health, but at the cost of additional cognitive processing by the respondent. The respondent must also assess the duration of the condition and determine the degree to which the 6 months is intended to convey 6 months specifically or a more general concept of a “long-term” condition. The term “difficulty” is subject to interpretation. Cognitive evaluation of the term “difficulty” suggests that for some respondents the term implies capacity or ability to perform the activity but does not infer actual participation in the activity. What is or is not included in the concept of working is further subject to interpretation by the respondent (e.g., inclusion or exclusion of sheltered workshops). As with many single screening items, the question fails to address accommodations that facilitate participation or barriers that prohibit participation. For example, if an individual is currently employed in an environment that accommodates a health condition, the respondent must determine whether the person should be considered as having difficulty working, even though the present employment situation presents no difficulty to the person. The NHIS asks two questions concerning work limitations: Does any impairment or health problem NOW keep _____ from working at a job or business? Is _____ limited in the kind OR amount of work _____ can do because of any impairment or health problem? In contrast to the questions in the census long form, the NHIS questions do not enumerate the various areas of health for consideration, nor does either question include a qualifying statement with respect to duration. The two questions are more specific in addressing the impact on working; compared with the term “difficulty” used in the census questionnaire, the NHIS probes whether a condi-

OCR for page 28
SURVEY MEASUREMENT OF WORK DISABILITY: Summary of a Workshop tion prevents the person from working or limits the kind or amount of work. Once again, note the lack of distinction between the ability to perform the activities associated with the actual performance of the job and those activities related to the role of work. For those who retire early because of a health condition or impairment, would the respondent consider that health problem as keeping the person from working? IMPLICATIONS FOR METHODOLOGICAL RESEARCH The point of the examples presented above is not to criticize the questionnaires in which they appear but rather to illustrate the problem of attempting to measure a complex, multidimensional, dynamic construct with a single question or a set of two questions. No one or even two questions can possibly tap into the various components of work disabilities. Clearly the first step toward a robust set of screening items is the acceptance of a shared conceptual framework and understanding of the dimensions of the construct of interest. That framework must consider the social environment in which the measurement of interest will be taken, understanding that the comprehension of the question is shaped not only by the specific words used in the question and the context of the question, but by the perceived intent of the question. The use of cognitive laboratory techniques can aid in the identification of problems of comprehension due to the use of inherently vague terms and differential perceptions of the intent of the question. Such techniques will aid in the understanding of the validity of the questions and, through the refinement of the wording of questions, hopefully improve the reliability of the items. Simply documenting that variation in the essential survey conditions of the measurement process contributes to different estimates of persons with work disabilities is not sufficient; the marginal effects of various factors need to be measured and the impact needs to be reduced through the use of alternative design features. Both of these can be accomplished only through a program of experimentation. Similarly, the psychometric properties of these measures need to be assessed. Without undertaking a thorough program of development and evaluation, the discrepant estimates evident in the empirical literature will persist.