Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 Improving the Validity of Cross-Population Comparisons T his chapter discusses methods for improving the validity of cross- population comparisons, within and across countries, for measures of disability obtained in population surveys. The presentations cov- ered three issues: 1. Developing additional measures of limitations in cognitive func- tioning and disability that could be used in population surveys 2. Using vignettes for validating judgmental reports in population surveys 3. Approaches to cognitive and field testing of disability measures for cross-cultural and cross-national comparability ADDITIONAL MEASURES OF LIMITATIONS IN COGNITIVE FUNCTIONING AND DISABILITY Craig Velozoâs (University of Florida and the Veterans Affairs Medical Center in Gainesville) presentation addressed the relationship of limita- tions in cognitive functioning and disability, additional measures of cogni- tion, and item response theory (IRT) and computer-adaptive testing (CAT). Velozo explained that the issue of cognition is very relevant to disability among the elderly population. A quick review of the literature shows a positive relationship between cognitive function and ADL and IADL status (Barberger-Gateau et al., 1999; Steen et al., 2001); decreases in ADL and IADL performance associated with cognitive decline and mild cognitive impairment and disability (Di Carlo et al., 2000; Kumamoto et al., 2000; 51
52 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY Purser et al., 2005; Raji et al., 2005; Ishizaki et al., 2006) and cognitive decline and ADL limitations associated with increased mortality (Wu et al., 2004; Schupf et al., 2005). Current Measurement Velozo pointed out that the following cognitive instruments are typi- cally used in national population surveys: â¢ The Mini Mental State Exam (MMSE): 11 questions covering 5 areasâ(1) orientation, (2) registration, (3) attention and calcula- tion, (4) recall, and (5) language â¢ The Medical Expenditures Panel Survey instrument: questions ad- dressing memory loss, confusion, problems making decisions, and supervision for safety Cognitive instruments generally used in rehabilitation include â¢ the Functional Independence Measure (FIM), used for inpatient rehabilitation, has five questions that address memory, comprehen- sion, expression, social interaction, and problem solving; â¢ the Minimum Data Set, used in skilled nursing facilities, has approx- imately 11 questions that address long-term memory, short-term memory, daily cognition, awareness, and speech and understanding; and â¢ the Outcome Assessment and Information Set, used in home health, contains a subset of questions that are somewhat cognitive and somewhat leaning toward function, such as managing oral medica- tions, using the telephone, cognitive function, and speech clarity. Velozo said that these instruments have some limitations, both in con- tent and in measurement. Relative to content limitation, MMSE does not address the effects of cognition in a personâs daily life. MMSE also does not generate separate cognitive domain measures that are more typical in the neuropsychological literature, such as attention, memory, and execu- tive function. Relative to measurement limitations of cognitive assessments, although FIM is widely used and has a relatively extensive literature on its psychometrics, these psychometric studies focus on the âmotoricâ or ADL component of FIM, not the cognitive component. Recent developments in the area of âappliedâ or âfunctionalâ cogni- tion offer one of the potential solutions for content limitations. Coster and colleagues (2004) have defined applied or functional cognition as discrete functional activities whose performance depends most critically on the application of cognitive skills with limited movement requirements: for
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 53 example, daily activities that require cognition, such as finding keys; con- versing with more than one person; and resolving a simple problem, such as scheduling a doctorâs appointment. They developed a measure of applied cognition that includes 59 items, which are based on the International Classification of Functioning, Disability and Health (ICF). These investi- gators tested the items on 477 patients who were receiving rehabilitation services. They applied Rasch measurement (an IRT methodology) and used principal components analysis (PCA) to investigate the unidimensionality of this set of items. Of the 59 items, 46 fit the Rasch model; 25 percent of the sample was at the ceiling; and the PCA suggested that the instrument was unidimensional. In contrast to the available traditional cognitive measures, Coster and colleagues (2004) used an IRT approach that involves the use of relatively large item banks to measure individuals. Associated with the IRT is the calibration of items according to their âdifficultiesâ (see Chapter 3). Velozo stressed that the calibration is an important aspect of the IRT approach that offers some benefits in terms of understanding the measures. (As discussed in Chapter 3, IRT is the statistical foundation for CAT, which is a method to administer subsets of items that are individualized for the respondent.) A New Applied Measure Velozo described his work with colleagues in which they used IRT and CAT in the development of an applied measure of cognition for stroke patients. The purpose of the study is to develop a measure of cognition that reflects the impact of cognitive challenges in everyday life; to design measures for separate domains of cognition (e.g., attention, memory, execu- tive function); and to maximize measurement efficiency and precision using IRT approaches and CAT. This work involved two studies: (1) developing a Computer Adaptive Measure of Functional Cognition (CAMFC) for Trau- matic Brain Injury and (2) developing a similar measure for stroke. Velozo gave an overview of the stroke study. Although it did not include typical aging individuals, within the stroke population there are individuals who have no or fairly mild cognitive problems and so may be reflective of what might be seen with an aging population. The four steps in developing a measure of functional cognition for stroke patients were as follows 1. Develop domains of functional cognition (Donovan et al., 2008), with input from an advisory panel on initially proposed domains. 2. Develop an item pool of cognitive items, using focus groups that included health care professionals, patients, and caregivers for the initially proposed sets of items.
54 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY 3. Field test the item bank, using confirmatory factor analysis (CFA), Rasch psychometrics, and correlations with neuropsychological and functional assessments. 4. Develop a CAT version of the measure. The 10 final domains of functional cognition included language, read- ing and writing, numeric calculation, limb praxis (which is very specific to the area of stroke), social use of language, visuospatial functioning, emo- tional function, attention, executive function, and memory. Operational definitions were developed for each of these domains. Each domain had subsets of items; the number of items per domain ranged from 9 to 41. An item pool was developed for each domain, which resulted in 244 functional cognitive items across the 10 domains. A total of 128 individuals were tested: 49 were acute stroke patients and 79 were chronic stroke patients. Psychometric analysis was performed on 252 ratings: 128 were self-ratings and 124 were proxy ratings from caregivers. Concurrent validity (CAMFC-Stroke domains against neuro- psychological-functional test) was investigated on a random selection of 63 participants. CFA supported treating the 10 domains as a âsingle measure.â Except for the limb praxis domain, the majority of correlations across do- mains were in the moderate range. CFA of each domain provided mixed re- sults in supporting the unidimensionality and hypothesized multiple-factor structure of the domains: 5 of the 10 domains showed support for both a unidimensional factor structure and a multiple factor structure (based on neuropsychological subdomains); 1 of 10 domains showed support for only a hypothesized multiple-factor structure; and 4 of 10 domains failed to support either a hypothesized undimensional or a multidimensional structure. A single measure across all domains (as rated by patients) and the do- main measures (with the exception of limb praxis) showed a high percent- age of items fitting the Rasch measurement model. Both the single measure and the domain measures (except limb praxis) showed good internal con- sistency and construct validity. With regard to measurement sensitivity, the single measure showed excellent sensitivity in separating the sample into different âabilityâ levels. Except for limb praxis and numeric calculations (patient-reported), domain measures showed good sensitivity in differentiating the sample. The single measure showed no floor or ceiling effects. The domain measures showed no floor effects, but 5 of the 10 domains showed ceiling effects. Results of rater comparisons showed that patient self-reports correlated with the caregiver proxy reports in the fair to moderate range for all do- mains except limb praxis. Patients and caregivers rated items in a similar way, as indicated by low levels of differential item functioning (DIF).
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 55 With regard to concurrent validity, domain measures showed fair to moderate correlations with analogous neuropsychological-functional tests. Caregiver proxy reports had a tendency to show stronger correla- tions with analogous neuropsychological-functional measures than patient self-reports. In conclusion, Velozo said that CAMFC-Stroke can exist as a single measure or as a battery of nine domain measures (excluding limb praxis). The advantage of the single measure is excellent measurement sensitivity, and the advantage of the domain measures is the ability to monitor domain- specific outcomes. In their study, both patients and caregivers provided acceptable CAMFC-Stroke measures. The item difficulty hierarchy within each domain of the CAMFC-Stroke offers considerable information. It provides support for the hypothesized domain development structure and provides a basis for interpreting the measures that are generated. Unique to this kind of measure is the capabil- ity of interpreting the generated measures in terms of what the patient can and cannot do within the content of the measure. For example, for a person who receives a measure of 0 logit, items such as âcopies information cor- rectlyâ and âpays attention to an hour-long TV programâ should match his or her ability level; items at â0.75 logit (such as âcorrectly answers yes/no questionsâ and âgreets someone who enters the roomâ) should be easy for the individual; and items at 0.75 logit (such as âhas a conversation in a noisy environmentâ and âreads 30 minutes without taking a breakâ) should be difficult for the individual. In summary, newly developed measures such as the CAMFC-Stroke extend the capability of measuring cognition on several fronts. First, IRT approaches maximize precision by generating measures from groups of items (i.e., item banks). Second, in combination with CAT approaches, IRT-generated measures reduce respondent burden. Finally, since IRT ap- proaches provide item-difficulty calibrations, measures generated with these instruments can be interpreted in terms of what individuals can and cannot do. While still in their infancy, IRT-CAT approaches to measuring cogni- tion and its impact on everyday life show promise for population-based measurement. USING VIGNETTES TO IMPROVE CROSS-POPULATION COMPARABILITY OF SELF-RATED DISABILITY MEASURES Arthur van Soest (Tilburg University and RAND) began his presenta- tion by noting that work-limiting disability is a major problem in many developed countries. It reduces participation and national productivity and increases the social welfare burden. Individuals with work disabilities lose
56 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY income, and their quality of life is lowered. The problem will increase as the population ages and people retire later in life. A simple work disability self-assessmentâsuch as âDo you have an impairment or health problem that limits the amount or type of work you can do?ââis often used to measure work disability and compare work disability rates across countries or socioeconomic groups. Large and sig- nificant differences between countries that seem to be at similar levels of development are found in self-reported work disability rates. Van Soest reported on a study comparing workers in the United States and the Netherlands (Kapteyn et al., 2007). In the study sample, 4.9 percent of the people in the United States reported being on disability rolls, com- pared to 10.7 percent in the Netherlands. There may be several explana- tions for this difference. First, programs providing disability benefits in the two countries differ in terms of financial incentives, access criteria, and ap- plication procedures. Second, people in the United States may be healthier than those in the Netherlands. Third, American employers may accommo- date workers with a health problem better than Dutch employers. In this study, the researchers focused on the second and third explana- tions. Are Americans really healthier than Dutch workers or are employers in the United States better able to accommodate workers with a handicap than in the Netherlands? The questionââDo you have an impairment or health problem that limits the amount or type of work you can do?ââis a very general measure of work-related disability, and typically is the only question, or some rephrasing of it is used, in general socioeconomic sur- veys where there is little room for elaborating on each specific topic. The responses to this question show that the prevalence of work-related health problems according to self-reports is much higher in the Netherlands than it is in the United States for all age groups. Are these ârealâ differences or differences in âreporting styleâ? If these are real differences, then one would expect to observe simi- lar differences in the prevalence of chronic conditions that may lead to work-related health problems. Some examples are diabetes, arthritis, hy- pertension, heart problems, stroke, and emotional problems. However, a comparison for the age group 55 to 64 shows that people in the Nether- lands actually suffer less from chronic health conditions than people in the United States. This finding suggests that the two countries may differ less in measured work disability than is reported by individuals. Van Soest then reported on research using anchoring vignettes in his and his colleaguesâ survey in the Netherlands and the United States. The methodology for this work was based on the earlier work by a group at Harvard in cooperation with the World Health Organization (WHO) (King et al., 2004). They found that reporting differences explain more than half of the observed differences in self-reported work disability, leaving less than
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 57 half as differences in underlying real health or employer accommodation of workers with a handicap. So looking only at self-reports one will draw mis- leading conclusions. Correcting for differences in the responses is essential in order to compare the actual distributions of health in the two countries. What is needed is to correct for the fact that people in different cultures have different response scales, different norms to say whether they have work-related health problems or not, or different norms for the severity of their work-related health problem. Self-reports provide a good alternative to the difficult (or impossible) and expensive task of creating a complete and comprehensive objective measure of work disability, but they have the drawback of possible dif- ferences in reporting styles. Technically, such differences are called DIF. Vignettes can be used to analyze these differences in response scales. Vi- gnettes are a new experimental tool that can correct self-reports and make them comparable across countries or socioeconomic groups; they work particularly well across countries because those are the comparisons for which differences in reporting styles are largest. A vignette describes the health of a hypothetical person and then asks the respondent to evaluate that personâs health on the same scale used for the self-report on health. Since the vignette description is the same in the two countries, the actual health of the person described in the vignette is the same. Therefore, any difference in reported country evaluations must be due to DIF. Van Soest and his colleagues applied the vignette approach to work- limiting disability to obtain not only international comparisons that are corrected for DIF, but also comparisons of different groups within a given country, such as systematic testing of hypotheses of differences by sex, age, or socioeconomic status. Vignettes were developed in three domains of dis- ability: (1) back pain, (2) mental problems, and (3) cardiovascular disease. Respondents in both countries were presented with these vignettes Âinvolving several questions, on a two-point scale and a five-point scale, and asked questions very similar to those asked about themselves. They were asked to evaluate the hypothetical persons presented for each of the vignettes. The response scales were the same as the response scale for the self-reports. The responses were used to estimate several versions of an econometric model generalizing the model introduced in the work of King and colleagues (2004). Van Soest and colleagues found that U.S. respondents were âharderâ on the vignette persons than the Dutch respondents: many more U.S. re- spondents than Dutch respondents said that the vignette person had no or only mild problems with working, whereas many more Dutch respondents thought the person had more serious problems in terms of working. Using simulations on the basis of the estimates, they found that accord- ing to a model not using vignettes, the percentages of people aged 51â64 with a work disability (on a yes/no scale) was 36 percent in the Netherlands
58 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY and 23 percent in the United States. Correcting for the response-scale differ- ences using vignettes and the benchmark model, the difference was reduced substantially. In the simulation, every respondent was given the U.S. scales. Nothing changed for the U.S. respondents, but the Dutch respondents ap- peared to have much less work disability when using the U.S. scales than when using their own scales. Accordingly, the model with vignettes and accounting for response-scale differences gave a much smaller difference in work disability between the two countries than a standard model that assumes everyone uses the same scales. Within the basic structure, van Soest and his colleagues used several models to test the sensitivity of the main result to different model assump- tions. Basically, they were all technical changes to the model, and not many changes in the results were observed. They consistently found that vignettes on work-limiting disabilities do help to correct for cross-country differences in scales used in self-reports. Corrections using vignettes reduced the esti- mated difference in work-limiting disability between the United States and the Netherlands by more than half. This result was robust to specification choices as long as the vignettes on all three domains (pain, cardiovascular disease, and mental health problems) were used. What explains the remaining difference? That is something the re- searchers still do not know. Can the differences be explained as employer accommodation? It is possible that employers in the United States are more used to having employees with a disability than employers in the Nether- lands, where it is traditional for people with disabilities not to work? The researchers were unable to study the distinction between health and em- ployersâ accommodations to it. Similar studies have been conducted in a number of European countries. In the Survey of Health, Aging, and Retirement in Europe (SHARE), similar questions were asked in eight countries in 2004. The COMPARE sub- sample of the SHARE 2006â2007 with the same work disability vignettes found that, if everyone uses the same response scale, on the five-point scale the percentages responding that a person has no problem or just a mild work-related health problem are not too different between countries. Finally, vignettes as a methodological tool can be applied in many other domains. In earlier work, they have been used by the Harvard Group and WHO in the fields of health and health care quality and political efficacy. â COMPARE is part of the family of research projects linked to SHARE. Data collection is parallel to the SHARE data collection in waves 2004 and 2006â2007 and follows the same procedures.
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 59 In addition, the American Life Panel, the Dutch CentERpanel, and the COMPARE samples on the age 50 and older populations in 10 European countries have vignettes on satisfaction with income, work, or daily activi- ties, and general well-being. NEW APPROACHES TO COGNITIVE AND FIELD TESTING OF DISABILITY MEASURES Julie D. Weeks (National Center for Health Statistics, NCHS) began her presentation by emphasizing that it is nearly impossible to discuss the recent advances by NCHS in question testing and evaluation methods without first considering two international question development projects, because that work has informed and transformed the testing work. There are two characteristics of the question development projects that have significantly influenced the way in which the testing and evaluation methods have devel- oped: the specific desire to not rely blindly on existing questions, which may erroneously be considered âgold standards,â and the fact that the questions are intended for use in trend analysis and cross-cultural comparative work. Weeks stated that she would describe the question development initiatives and then turn to the impact that these initiatives have had on the way cognitive question testing is conducted now, both at NCHS and at partner sites around the world. Question Development At the international level, there is largely an absence of comparable measures that can be used to paint a broad statistical picture of population health and disability. This is not to say that comparisons are not madeâ there certainly is information on births and deaths and life expectancy used to make general statements about the health of a populationâbut consis- tently measured, specific, standardized measures of health and disability status do not exist. Furthermore, standards with regard to the conceptu- alization, definition, and collection of those measures and the conduct of analyses typically are also lacking. Under the auspices of the United Nations, national statistical offices, and the Conference of European Statisticians, two groups were formed and charged with developing such measures that would provide basic in- â The American Life Panel is the U.S. analogue of the CentERpanelâit is representative of the U.S. population â The CentERpanel is an Internet survey based on a random sample of the Dutch people aged 25 and older. It is administered by CentERdata, a research institute affiliated with Tilburg University.
60 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY formation on population health and disability, for both within-country and interÂnational comparisons. Those two groups are the Washington Group on D Â isability Statistics and the Budapest Initiative. The Washington Group on Disability Statistics operates under the aegis of the U.N. Statistical Com- mission. Its main purpose is the promotion and coordination of interna- tional cooperation in the area of health statistics by focusing on disability measures suitable for censuses and national surveys. The Budapest Initia- tive, which is formally the Joint United Nations Economic Commission for Europe/WHO/Eurostat Task Force on the Measurement of Health Status, was organized under the Work Programme of the Conference of European Statisticians. Its main purpose is the development of an internationally ac- cepted standard set of questions for assessing general health state in the context of population surveys. An important objective of both efforts is to maximize the cross-Âindividual and cross-population comparability of survey questions and resulting data. Participants in the Washington Group include representatives from over 60 countries, national statistical offices, international organizations, and nongovernmental organizations, as well as some disabled personsâ organizations. Using ICF as a framework, the groupâs interest is in the measurement of basic actions at the level of the whole person. Disability is defined as the intersection of basic actions and the environment and affects participation in society. Disability is treated as a demographic variable, comparing populations and subgroups by disability status. The Washington Group first developed a short set of six disability ques- tions that have been tested, adopted, and now are being included in plans for censuses around the world. The group is now engaged in the develop- ment of longer sets of questions that include increasingly complex activities and more domains of health. The major focus of the Budapest Initiative is on the development of measures suitable for population surveys that capture health status or âhealth state.â In this context, health state reflects oneâs functional ability (âwithin the skinâ as opposed to with the use of aids or other assistance); that is, capacity, rather than performance, in a reasonable environment. Like the Washington Group, one of the objectives of the Budapest Ini- tiative is to develop a question set that describes individualsâ overall health state, by examining functioning in basic levels of activity across a number of health domains. A second objective is to describe trends in health over time within a country, across subgroups of a population, and across coun- tries. In this way, something meaningful about a populationâs health can be said when examining differences between countries and assessing trends over time. Weeks noted that the objectives of the two groups are very similar. When their respective work groups mapped out the possible health domains and basic activities that could be measured, the list included six categories:
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 61 1. Mobilityâwalking, climbing stairs, bending, reaching or lifting, using hands 2. Sensoryâseeing and hearing 3. Communicatingâunderstanding and speaking 4. Cognitive functionsâlearning, remembering, making decisions, and concentrating 5. Emotional functioningâinterpersonal interaction and psychologi- cal well-being 6. Otherâaffect, pain, fatigue, and self-care The challenge that each group faced was demonstrating in some easily digestible way exactly where the questions being developed were located in what seemed like an ever-increasing map of health and disability. Both groups are measuring functional ability at the level of basic actions, but across multiple health domains. How does one clearly show what is being considered and what remains to be developed? Moreover, the Washington Group is defining disability as the intersection of the person and the envi- ronment, so one has to know something about the environment. Finally, it became increasingly apparent that in a survey setting, in which there is room for a relatively larger number of questions, the question of what ad- ditional aspects should be measured within any health domain adds to the complexity. After many months and iterations, a small group of members developed âthe matrix.â At the simplest level, this matrix outlines in what areas work has occurred and in what areas it needs to continue to develop a full spectrum of questions on health and disability. The goal is to populate each cell with questions that have been subject to rigorous testing so that countries can use comparable measures and can choose those measures that fit with their survey and budgetary agendas. Particularly noteworthy about this matrix is that it succinctly conveys the use of explicit definitions of disability (or health) and is, in essence, a road- map for future survey and question development work. However, finding questions that have been cognitively evaluated is nearly impossible. Further- more, in earlier efforts, it quickly became clear that one simply cannot rely on data collected in separate studies, nor can the findings be compared. Question Testing and Analysis The question testing and evaluation phase of work required nearly as much development as did the questions. Ultimately, the goal of both the Washington Group and the Budapest Initiative is to develop internationally comparable data that are suitable for censuses and surveys and that capture most disabled people (or the broad spectrum of health states) in a consistent fashion. The goal then, for the cognitive test, is to ensure that the questions meet those goals, without relying on a gold standard. Unfortunately, many
62 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY of the traditional aspects of cognitive testing do not produce consistency or standardization. Furthermore, it is often impossible to assess if differences are âreal.â Some of the aspects of traditional cognitive testing methods that hinder comparative analysis include small, nonrepresentative samples; nonstandard interviewing protocols, outputs, and reports; underdeveloped literature and practice regarding the rigor of analysis; and lack of standard- ized criteria for what constitutes a cognitive interview finding. Also, in the area of disability measurement (as in so many other disci- plines), existing questions are often used as gold standards, and new ques- tions are evaluated by examining the relationship between the two. This is a problem for several reasons. First, often the gold standard has not been rigorously tested, so it is not clear what is being compared and which mea- sure might, in fact, be superior. Moreover, the purpose of the gold standard questions and new questions may differ, even slightly, so that making such a comparison may not be entirely appropriate. Finally, the strategy does not address cross-cultural comparability, unless it was addressed in the development of the original question considered the gold standard, which one would never know because questions currently in use rarely come with evidence of such study. The cross-cultural nature of the Washington Group and Budapest Ini- tiative projects underscores the need to clearly demonstrate that a question works and what is being measured. It is no longer sufficient to know just that the âquestions workedâ; one needs the question wording, interpreta- tion, and outputs, as well as how respondents interact with and answer the question. This need required a huge paradigm shift in the cognitive testing lab and ultimately changed the way the testing and the evaluations are be- ing conducted. In effect, the qualitative process is subject to far more of the scientific principles associated with quantitative analysis: a structured cognitive interview, data quality, data analysis (multiple levels of analysis, including an examination of patterns of respondent interpretation and calculation), and transparency and replicability in all processes, but, most importantly, in the transformation of qualitative data into quantitative results. In addition, these methods have to be implemented in a consistent and standardized fashion across all of the participating countries. Weeks next described the cognitive testing used by the Washington Group and Budapest Initiative. One of the most important steps during the initial phase is stating as specifically as possible the research questions to be tested. Moreover, one does not simply want to know that the âquestions worked,â but rather: â¢ How do specific respondents move through the cognitive processes (comprehension, retrieval, judgment, response)?
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 63 â¢ How much error is there (false negatives and false positives)? â¢ Why are there differences and how does one account for the way respondents with different socioeconomic conditions, cultures, and languages interpret, consider, and respond to survey questions? The challenge of answering these three questions is heightened even further when attempting to design an internationally comparable measure for a concept as complex and dynamic as disability. Next in the process is putting together a testing work group that meets on a regular, frequent basis. There is an initial, mandatory training meeting for all participating countries (or sites). A purposeful sampling procedure is designed; it is not a convenience sample. Translation is a major activity re- quiring a great deal of time and care. As much time is spent in translation as is spent in nearly all of the rest of the planning phase of testing. Ultimately, if one is going to administer questions about how âsad, blue, or depressedâ a respondent is in one site (the United States, for example) one has to know that these terms mean exactly the same thing in the other test sites. In Italy, for example, there is no concept of the term âblue.â Even if one can find the word, does it mean the same thing? Is it going to result in comparable data? Finally, time-intensive work must be done to take notes and to translate those notes into some kind of quantitative format. However, the data that are generated are rich and very quantitatively informative. When this work is completed, one has not only the typical qualitative notes taken during a cognitive interview, but also a narrative that follows a semi-scripted format, with as much detail as possible. In turn, those nar- ratives are entered into QNotes (software designed at NCHS) and form the basis of the data from the cognitive testing that are analyzed in a very quantitative fashion. The results of this process are â¢ validity tied to rich detail, â¢ findings that are grounded, â¢ insight into question interpretation, â¢ insight into patterns of calculation, and â¢ knowledge of question performance. Weeks next described what is different about the analysis stage. In quantitative terms, she and other NCHS staff liken within-interview analy- sis to frequencies, across-interview analysis to conducting crosstabs, and across-subgroup analysis to controlling for specific variables. The point is that each type of analysis offers some type of understanding. The goal is to perform the type of analysis that answers the question of most interest, but typically all three of them. The analysis itself should be conceptualized in three distinct layers. The
64 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY first and simplest level of analysis occurs within the interview, specifically, as the interviewer attempts to understand how one respondent has come to understand, process, and then answer a survey question. The interviewer must act as analyst during the interview, evaluating the information that the respondent describes and following up with additional questions if there are gaps, incongruencies, or disjunctures in the explanation. From this vantage point (i.e., within a single cognitive interview), basic response errors, such as recall trouble or misinterpretation, can be identifiedâerrors that can be linked to question design problems. The second layer of analysis occurs through a systematic examination of all interviews together. Specifically, interviews are examined to identify patterns in the way respondents interpret and process the question. By making comparisons across all of the interviews, patterns can be identified and then examined for consistency and degree of variation among respon- dents. Inconsistencies in the way respondents interpret questions may not necessarily mean misinterpretation, but they can illustrate even the subtle interpretation differences that respondents use as they consider the ques- tion in relation to their own life circumstances. From this vantage point, it is possible to identify the phenomena that are captured by the particular survey question, illustrating the substantive meaning behind the statistic. Additionally, from this layer of analysis, it is possible to identify patterns of calculation across respondents. This is particularly useful in understand- ing how qualifying clauses, such as âin the past 2 weeksâ or âon average,â affect the way respondents form their answer and whether respondents consistently use the clauses in their calculation. The last level, the heart of the cross-cultural analysis, occurs through an examination of the patterns across subgroups, identifying whether par- ticular groups of respondents interpret or process a question differently. This level of analysis is particularly important because it is the level where potential for bias would occur. Thus far, this testing protocol has been used by both the Washington Group and the Budapest Initiative in approximately 30 countries. In a sub- set of these countries, staff has also combined the cognitive testing with field testing. Preparations are now being made for a combined cognitive and field testing effort in the U.N. Economic and Social Commission for Asia and the Pacific region, which will include Cambodia, Fiji, Maldives, Philippines, Sri Lanka, and Vietnam. It has been a remarkable endeavor, one that has produced exciting results in both the development of questions and testing for cross-cultural purposes. In summary, the Washington Group and Budapest Initiative question development work is located at the most basic levels of activity and partici- pation in core health domains. The goal is to measure ability (or inability) to carry out basic activities and to treat disability as a demographic vari-
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 65 able, comparing participation by disability status. In this context, health state reflects oneâs functional ability âwithin the skinâ (without the use of aids or other assistance), rather than performance, in a reasonable environ- ment. A new methodology for integrating cognitive testing concepts into a standardized, quantitative testing procedure was developed in the process in order to meet the specific need of testing measures of disability and health state suitable for international comparisons. Information was collected on the question response process, patterns of interpretation, and evaluation of decision-making patterns. This informa- tion was then used to help identify potential response error and to test the suitability of the questions for the purpose for which they were designed to generate a meaningful, internationally comparable general prevalence mea- sure for disability. The pattern analysis was particularly advantageous. Ex- amining consistencies and inconsistencies across various questions allowed for an evaluation of the Washington Group questions without establishing existing questions as a gold standard. This pattern analysis allowed for old and new questions to be compared while maintaining a neutral or agnostic view of the other questions. The results indicate the usefulness of this approach for testing the de- sign of cross-national indicators, as well as lending support to the reliability of the particular measures developed (and now adopted) by the Washington Group on Disability Statistics. DISCUSSION The discussion focused on three issues: vignettes, item pools and CAT, and the Washington Group. Vignettes Several participants asked questions: Are some of the differences in reporting on work disability in the Netherlands and the United States explained by differences in the interview context and the organizations conducting the surveys? Can some of the differences be explained through the framing effect, that is, the location of the questions in the questionnaire? Can some of the results showing that African Americans have disability profiles at ages 10 years younger than whites be explained using vignette methodology? Has the vignette methodology been used to look at racial or ethnic differences within the United States? Arthur van Soest responded that the surveys were Internet surveys in both countries. However, he noted that they were not exactly the same complete survey, so there is possibly a framing effect. One of the interpreta- tions of what the authors found is that the countryâs, and its institutionsâ,
66 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY specific context does make a difference. The researchers also experimented with telephone interviews, but reading all those hypothetical stories did not work as well. One kind of experiment could be to ask somebody in the Netherlands about a hypothetical person in the United States and vice versa. However, they thought that would be too confusing and so did not use that approach. He noted that in the Netherlands and U.S. study, race was not included, mainly because in the Netherlands there is not enough variation in the data; items such as education, gender, and age were included. In response to the question about whether the vignette methodology has been used to look at racial or ethnic differences in the United States, van Soest replied that it could be used in such surveys as the Health and Retirement Study, but he does not know if it has been used. The focus for using vignettes to date has been almost exclusively on cross-national comparisons. Item Pools and Computer-Adaptive Testing Noting that the vignette scheme is particularly suited to identifying and making adjustments for DIF, Robert Hauser wondered whether an item pool for CAT would be valid if it had DIF in the items. To what extent have item pools for CAT been tested for DIF? Whether an item bank has DIF is based on where one is looking for it. There are many different levels at which one can look. The small number of items that show differential functioning can be removed from an item bank, especially if the bank has hundreds of items. Also, one can calibrate a group-specific item differential so that, for example, it would have a dif- ferent item differential for Dutch and U.S. populations. It would be very similar to the vignette scheme. It was noted that application of these methods across different popu- lations groups would be important. Gender is an example. The standard general health questions do not show any gender differences, but adjust- ments using these methods might show that there is a gender difference and that gender accounts for the difference between self-reported symptoms, for which there are always gender differences in the general health. The Washington Group A question was asked if there are some domains that the group simply cannot get to work across the multiple sites being evaluated. Julie Weeks responded that, anticipating that some domains would be harder than others to work with across the multiple sites being evaluated, the group started with some of the easier domains. In the Budapest Initia- tive, they are encountering difficulty with two domainsâhow to ask about
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS 67 pain and fatigue cross-culturally. How people interpret the concepts of pain and fatigue and how much they are willing to admit to having them are very different. Those two domains are in the third round of testing, without success as yet. Obviously, this work needs to continue. Connie Citro (Committee on National Statistics, DBASSE) commended the Washington Group initiative, which is clearly going back to basics as the way of making a start at getting some very carefully tested questions that will provide basic monitoring information across a whole range of countries. She said that is very important work.