This chapter discusses methods for improving the validity of cross-population comparisons, within and across countries, for measures of disability obtained in population surveys. The presentations covered three issues:
Developing additional measures of limitations in cognitive functioning and disability that could be used in population surveys
Using vignettes for validating judgmental reports in population surveys
Approaches to cognitive and field testing of disability measures for cross-cultural and cross-national comparability
Craig Velozo’s (University of Florida and the Veterans Affairs Medical Center in Gainesville) presentation addressed the relationship of limitations in cognitive functioning and disability, additional measures of cognition, and item response theory (IRT) and computer-adaptive testing (CAT). Velozo explained that the issue of cognition is very relevant to disability among the elderly population. A quick review of the literature shows a positive relationship between cognitive function and ADL and IADL status (Barberger-Gateau et al., 1999; Steen et al., 2001); decreases in ADL and IADL performance associated with cognitive decline and mild cognitive impairment and disability (Di Carlo et al., 2000; Kumamoto et al., 2000;
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 51
4
Improving the Validity of
Cross-Population Comparisons
T
his chapter discusses methods for improving the validity of cross-
population comparisons, within and across countries, for measures
of disability obtained in population surveys. The presentations cov-
ered three issues:
1. Developing additional measures of limitations in cognitive func-
tioning and disability that could be used in population surveys
2. Using vignettes for validating judgmental reports in population
surveys
3. Approaches to cognitive and field testing of disability measures for
cross-cultural and cross-national comparability
ADDITIONAL MEASURES OF LIMITATIONS IN
COGNITIVE FUNCTIONING AND DISABILITY
Craig Velozo’s (University of Florida and the Veterans Affairs Medical
Center in Gainesville) presentation addressed the relationship of limita-
tions in cognitive functioning and disability, additional measures of cogni-
tion, and item response theory (IRT) and computer-adaptive testing (CAT).
Velozo explained that the issue of cognition is very relevant to disability
among the elderly population. A quick review of the literature shows a
positive relationship between cognitive function and ADL and IADL status
(Barberger-Gateau et al., 1999; Steen et al., 2001); decreases in ADL and
IADL performance associated with cognitive decline and mild cognitive
impairment and disability (Di Carlo et al., 2000; Kumamoto et al., 2000;
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
Purser et al., 2005; Raji et al., 2005; Ishizaki et al., 2006) and cognitive
decline and ADL limitations associated with increased mortality (Wu et al.,
2004; Schupf et al., 2005).
Current Measurement
Velozo pointed out that the following cognitive instruments are typi-
cally used in national population surveys:
• The Mini Mental State Exam (MMSE): 11 questions covering 5
areas—(1) orientation, (2) registration, (3) attention and calcula-
tion, (4) recall, and (5) language
• The Medical Expenditures Panel Survey instrument: questions ad-
dressing memory loss, confusion, problems making decisions, and
supervision for safety
Cognitive instruments generally used in rehabilitation include
• the Functional Independence Measure (FIM), used for inpatient
rehabilitation, has five questions that address memory, comprehen-
sion, expression, social interaction, and problem solving;
• the Minimum Data Set, used in skilled nursing facilities, has approx-
imately 11 questions that address long-term memory, short-term
memory, daily cognition, awareness, and speech and understanding;
and
• the Outcome Assessment and Information Set, used in home health,
contains a subset of questions that are somewhat cognitive and
somewhat leaning toward function, such as managing oral medica-
tions, using the telephone, cognitive function, and speech clarity.
Velozo said that these instruments have some limitations, both in con-
tent and in measurement. Relative to content limitation, MMSE does not
address the effects of cognition in a person’s daily life. MMSE also does
not generate separate cognitive domain measures that are more typical in
the neuropsychological literature, such as attention, memory, and execu-
tive function. Relative to measurement limitations of cognitive assessments,
although FIM is widely used and has a relatively extensive literature on its
psychometrics, these psychometric studies focus on the “motoric” or ADL
component of FIM, not the cognitive component.
Recent developments in the area of “applied” or “functional” cogni-
tion offer one of the potential solutions for content limitations. Coster and
colleagues (2004) have defined applied or functional cognition as discrete
functional activities whose performance depends most critically on the
application of cognitive skills with limited movement requirements: for
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
example, daily activities that require cognition, such as finding keys; con-
versing with more than one person; and resolving a simple problem, such
as scheduling a doctor’s appointment. They developed a measure of applied
cognition that includes 59 items, which are based on the International
Classification of Functioning, Disability and Health (ICF). These investi-
gators tested the items on 477 patients who were receiving rehabilitation
services. They applied Rasch measurement (an IRT methodology) and used
principal components analysis (PCA) to investigate the unidimensionality
of this set of items. Of the 59 items, 46 fit the Rasch model; 25 percent of
the sample was at the ceiling; and the PCA suggested that the instrument
was unidimensional.
In contrast to the available traditional cognitive measures, Coster and
colleagues (2004) used an IRT approach that involves the use of relatively
large item banks to measure individuals. Associated with the IRT is the
calibration of items according to their “difficulties” (see Chapter 3). Velozo
stressed that the calibration is an important aspect of the IRT approach
that offers some benefits in terms of understanding the measures. (As
discussed in Chapter 3, IRT is the statistical foundation for CAT, which
is a method to administer subsets of items that are individualized for the
respondent.)
A New Applied Measure
Velozo described his work with colleagues in which they used IRT and
CAT in the development of an applied measure of cognition for stroke
patients. The purpose of the study is to develop a measure of cognition
that reflects the impact of cognitive challenges in everyday life; to design
measures for separate domains of cognition (e.g., attention, memory, execu-
tive function); and to maximize measurement efficiency and precision using
IRT approaches and CAT. This work involved two studies: (1) developing a
Computer Adaptive Measure of Functional Cognition (CAMFC) for Trau-
matic Brain Injury and (2) developing a similar measure for stroke.
Velozo gave an overview of the stroke study. Although it did not
include typical aging individuals, within the stroke population there are
individuals who have no or fairly mild cognitive problems and so may be
reflective of what might be seen with an aging population.
The four steps in developing a measure of functional cognition for
stroke patients were as follows
1. Develop domains of functional cognition (Donovan et al., 2008),
with input from an advisory panel on initially proposed domains.
2. Develop an item pool of cognitive items, using focus groups that
included health care professionals, patients, and caregivers for the
initially proposed sets of items.
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
3. Field test the item bank, using confirmatory factor analysis (CFA),
Rasch psychometrics, and correlations with neuropsychological
and functional assessments.
4. Develop a CAT version of the measure.
The 10 final domains of functional cognition included language, read-
ing and writing, numeric calculation, limb praxis (which is very specific to
the area of stroke), social use of language, visuospatial functioning, emo-
tional function, attention, executive function, and memory. Operational
definitions were developed for each of these domains. Each domain had
subsets of items; the number of items per domain ranged from 9 to 41. An
item pool was developed for each domain, which resulted in 244 functional
cognitive items across the 10 domains.
A total of 128 individuals were tested: 49 were acute stroke patients
and 79 were chronic stroke patients. Psychometric analysis was performed
on 252 ratings: 128 were self-ratings and 124 were proxy ratings from
caregivers. Concurrent validity (CAMFC-Stroke domains against neuro-
psychological-functional test) was investigated on a random selection of 63
participants. CFA supported treating the 10 domains as a “single measure.”
Except for the limb praxis domain, the majority of correlations across do-
mains were in the moderate range. CFA of each domain provided mixed re-
sults in supporting the unidimensionality and hypothesized multiple-factor
structure of the domains: 5 of the 10 domains showed support for both
a unidimensional factor structure and a multiple factor structure (based
on neuropsychological subdomains); 1 of 10 domains showed support for
only a hypothesized multiple-factor structure; and 4 of 10 domains failed
to support either a hypothesized undimensional or a multidimensional
structure.
A single measure across all domains (as rated by patients) and the do-
main measures (with the exception of limb praxis) showed a high percent-
age of items fitting the Rasch measurement model. Both the single measure
and the domain measures (except limb praxis) showed good internal con-
sistency and construct validity.
With regard to measurement sensitivity, the single measure showed
excellent sensitivity in separating the sample into different “ability” levels.
Except for limb praxis and numeric calculations (patient-reported), domain
measures showed good sensitivity in differentiating the sample. The single
measure showed no floor or ceiling effects. The domain measures showed
no floor effects, but 5 of the 10 domains showed ceiling effects.
Results of rater comparisons showed that patient self-reports correlated
with the caregiver proxy reports in the fair to moderate range for all do-
mains except limb praxis. Patients and caregivers rated items in a similar
way, as indicated by low levels of differential item functioning (DIF).
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
With regard to concurrent validity, domain measures showed fair
to moderate correlations with analogous neuropsychological-functional
tests. Caregiver proxy reports had a tendency to show stronger correla-
tions with analogous neuropsychological-functional measures than patient
self-reports.
In conclusion, Velozo said that CAMFC-Stroke can exist as a single
measure or as a battery of nine domain measures (excluding limb praxis).
The advantage of the single measure is excellent measurement sensitivity,
and the advantage of the domain measures is the ability to monitor domain-
specific outcomes. In their study, both patients and caregivers provided
acceptable CAMFC-Stroke measures.
The item difficulty hierarchy within each domain of the CAMFC-Stroke
offers considerable information. It provides support for the hypothesized
domain development structure and provides a basis for interpreting the
measures that are generated. Unique to this kind of measure is the capabil-
ity of interpreting the generated measures in terms of what the patient can
and cannot do within the content of the measure. For example, for a person
who receives a measure of 0 logit, items such as “copies information cor-
rectly” and “pays attention to an hour-long TV program” should match his
or her ability level; items at –0.75 logit (such as “correctly answers yes/no
questions” and “greets someone who enters the room”) should be easy for
the individual; and items at 0.75 logit (such as “has a conversation in a
noisy environment” and “reads 30 minutes without taking a break”) should
be difficult for the individual.
In summary, newly developed measures such as the CAMFC-Stroke
extend the capability of measuring cognition on several fronts. First, IRT
approaches maximize precision by generating measures from groups of
items (i.e., item banks). Second, in combination with CAT approaches,
IRT-generated measures reduce respondent burden. Finally, since IRT ap-
proaches provide item-difficulty calibrations, measures generated with these
instruments can be interpreted in terms of what individuals can and cannot
do. While still in their infancy, IRT-CAT approaches to measuring cogni-
tion and its impact on everyday life show promise for population-based
measurement.
USING VIGNETTES TO IMPROVE CROSS-POPULATION
COMPARABILITY OF SELF-RATED DISABILITY MEASURES
Arthur van Soest (Tilburg University and RAND) began his presenta-
tion by noting that work-limiting disability is a major problem in many
developed countries. It reduces participation and national productivity and
increases the social welfare burden. Individuals with work disabilities lose
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
income, and their quality of life is lowered. The problem will increase as
the population ages and people retire later in life.
A simple work disability self-assessment—such as “Do you have an
impairment or health problem that limits the amount or type of work you
can do?”—is often used to measure work disability and compare work
disability rates across countries or socioeconomic groups. Large and sig-
nificant differences between countries that seem to be at similar levels of
development are found in self-reported work disability rates.
Van Soest reported on a study comparing workers in the United States
and the Netherlands (Kapteyn et al., 2007). In the study sample, 4.9 percent
of the people in the United States reported being on disability rolls, com-
pared to 10.7 percent in the Netherlands. There may be several explana-
tions for this difference. First, programs providing disability benefits in the
two countries differ in terms of financial incentives, access criteria, and ap-
plication procedures. Second, people in the United States may be healthier
than those in the Netherlands. Third, American employers may accommo-
date workers with a health problem better than Dutch employers.
In this study, the researchers focused on the second and third explana-
tions. Are Americans really healthier than Dutch workers or are employers
in the United States better able to accommodate workers with a handicap
than in the Netherlands? The question—“Do you have an impairment or
health problem that limits the amount or type of work you can do?”—is
a very general measure of work-related disability, and typically is the only
question, or some rephrasing of it is used, in general socioeconomic sur-
veys where there is little room for elaborating on each specific topic. The
responses to this question show that the prevalence of work-related health
problems according to self-reports is much higher in the Netherlands than
it is in the United States for all age groups. Are these “real” differences or
differences in “reporting style”?
If these are real differences, then one would expect to observe simi-
lar differences in the prevalence of chronic conditions that may lead to
work-related health problems. Some examples are diabetes, arthritis, hy-
pertension, heart problems, stroke, and emotional problems. However, a
comparison for the age group 55 to 64 shows that people in the Nether-
lands actually suffer less from chronic health conditions than people in the
United States. This finding suggests that the two countries may differ less
in measured work disability than is reported by individuals.
Van Soest then reported on research using anchoring vignettes in his
and his colleagues’ survey in the Netherlands and the United States. The
methodology for this work was based on the earlier work by a group at
Harvard in cooperation with the World Health Organization (WHO) (King
et al., 2004). They found that reporting differences explain more than half
of the observed differences in self-reported work disability, leaving less than
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
half as differences in underlying real health or employer accommodation of
workers with a handicap. So looking only at self-reports one will draw mis-
leading conclusions. Correcting for differences in the responses is essential
in order to compare the actual distributions of health in the two countries.
What is needed is to correct for the fact that people in different cultures
have different response scales, different norms to say whether they have
work-related health problems or not, or different norms for the severity of
their work-related health problem.
Self-reports provide a good alternative to the difficult (or impossible)
and expensive task of creating a complete and comprehensive objective
measure of work disability, but they have the drawback of possible dif-
ferences in reporting styles. Technically, such differences are called DIF.
Vignettes can be used to analyze these differences in response scales. Vi-
gnettes are a new experimental tool that can correct self-reports and make
them comparable across countries or socioeconomic groups; they work
particularly well across countries because those are the comparisons for
which differences in reporting styles are largest. A vignette describes the
health of a hypothetical person and then asks the respondent to evaluate
that person’s health on the same scale used for the self-report on health.
Since the vignette description is the same in the two countries, the actual
health of the person described in the vignette is the same. Therefore, any
difference in reported country evaluations must be due to DIF.
Van Soest and his colleagues applied the vignette approach to work-
limiting disability to obtain not only international comparisons that are
corrected for DIF, but also comparisons of different groups within a given
country, such as systematic testing of hypotheses of differences by sex, age,
or socioeconomic status. Vignettes were developed in three domains of dis-
ability: (1) back pain, (2) mental problems, and (3) cardiovascular disease.
Respondents in both countries were presented with these vignettes involving
several questions, on a two-point scale and a five-point scale, and asked
questions very similar to those asked about themselves. They were asked to
evaluate the hypothetical persons presented for each of the vignettes. The
response scales were the same as the response scale for the self-reports.
The responses were used to estimate several versions of an econometric
model generalizing the model introduced in the work of King and colleagues
(2004). Van Soest and colleagues found that U.S. respondents were “harder”
on the vignette persons than the Dutch respondents: many more U.S. re-
spondents than Dutch respondents said that the vignette person had no or
only mild problems with working, whereas many more Dutch respondents
thought the person had more serious problems in terms of working.
Using simulations on the basis of the estimates, they found that accord-
ing to a model not using vignettes, the percentages of people aged 51–64
with a work disability (on a yes/no scale) was 36 percent in the Netherlands
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
and 23 percent in the United States. Correcting for the response-scale differ-
ences using vignettes and the benchmark model, the difference was reduced
substantially. In the simulation, every respondent was given the U.S. scales.
Nothing changed for the U.S. respondents, but the Dutch respondents ap-
peared to have much less work disability when using the U.S. scales than
when using their own scales. Accordingly, the model with vignettes and
accounting for response-scale differences gave a much smaller difference
in work disability between the two countries than a standard model that
assumes everyone uses the same scales.
Within the basic structure, van Soest and his colleagues used several
models to test the sensitivity of the main result to different model assump-
tions. Basically, they were all technical changes to the model, and not many
changes in the results were observed. They consistently found that vignettes
on work-limiting disabilities do help to correct for cross-country differences
in scales used in self-reports. Corrections using vignettes reduced the esti-
mated difference in work-limiting disability between the United States and
the Netherlands by more than half. This result was robust to specification
choices as long as the vignettes on all three domains (pain, cardiovascular
disease, and mental health problems) were used.
What explains the remaining difference? That is something the re-
searchers still do not know. Can the differences be explained as employer
accommodation? It is possible that employers in the United States are more
used to having employees with a disability than employers in the Nether-
lands, where it is traditional for people with disabilities not to work? The
researchers were unable to study the distinction between health and em-
ployers’ accommodations to it.
Similar studies have been conducted in a number of European countries.
In the Survey of Health, Aging, and Retirement in Europe (SHARE), similar
questions were asked in eight countries in 2004. The COMPARE1 sub-
sample of the SHARE 2006–2007 with the same work disability vignettes
found that, if everyone uses the same response scale, on the five-point scale
the percentages responding that a person has no problem or just a mild
work-related health problem are not too different between countries.
Finally, vignettes as a methodological tool can be applied in many other
domains. In earlier work, they have been used by the Harvard Group and
WHO in the fields of health and health care quality and political efficacy.
1 COMPARE is part of the family of research projects linked to SHARE. Data collection is
parallel to the SHARE data collection in waves 2004 and 2006–2007 and follows the same
procedures.
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
In addition, the American Life Panel,2 the Dutch CentERpanel,3 and the
COMPARE samples on the age 50 and older populations in 10 European
countries have vignettes on satisfaction with income, work, or daily activi-
ties, and general well-being.
NEW APPROACHES TO COGNITIVE AND FIELD
TESTING OF DISABILITY MEASURES
Julie D. Weeks (National Center for Health Statistics, NCHS) began her
presentation by emphasizing that it is nearly impossible to discuss the recent
advances by NCHS in question testing and evaluation methods without
first considering two international question development projects, because
that work has informed and transformed the testing work. There are two
characteristics of the question development projects that have significantly
influenced the way in which the testing and evaluation methods have devel-
oped: the specific desire to not rely blindly on existing questions, which may
erroneously be considered “gold standards,” and the fact that the questions
are intended for use in trend analysis and cross-cultural comparative work.
Weeks stated that she would describe the question development initiatives
and then turn to the impact that these initiatives have had on the way
cognitive question testing is conducted now, both at NCHS and at partner
sites around the world.
Question Development
At the international level, there is largely an absence of comparable
measures that can be used to paint a broad statistical picture of population
health and disability. This is not to say that comparisons are not made—
there certainly is information on births and deaths and life expectancy used
to make general statements about the health of a population—but consis-
tently measured, specific, standardized measures of health and disability
status do not exist. Furthermore, standards with regard to the conceptu-
alization, definition, and collection of those measures and the conduct of
analyses typically are also lacking.
Under the auspices of the United Nations, national statistical offices,
and the Conference of European Statisticians, two groups were formed
and charged with developing such measures that would provide basic in-
2 The American Life Panel is the U.S. analogue of the CentERpanel—it is representative of
the U.S. population
3 The CentERpanel is an Internet survey based on a random sample of the Dutch people aged
25 and older. It is administered by CentERdata, a research institute affiliated with Tilburg
University.
OCR for page 51
0 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
formation on population health and disability, for both within-country and
international comparisons. Those two groups are the Washington Group on
Disability Statistics and the Budapest Initiative. The Washington Group
on Disability Statistics operates under the aegis of the U.N. Statistical Com-
mission. Its main purpose is the promotion and coordination of interna-
tional cooperation in the area of health statistics by focusing on disability
measures suitable for censuses and national surveys. The Budapest Initia-
tive, which is formally the Joint United Nations Economic Commission for
Europe/WHO/Eurostat Task Force on the Measurement of Health Status,
was organized under the Work Programme of the Conference of European
Statisticians. Its main purpose is the development of an internationally ac-
cepted standard set of questions for assessing general health state in the
context of population surveys. An important objective of both efforts is to
maximize the cross-individual and cross-population comparability of survey
questions and resulting data.
Participants in the Washington Group include representatives from
over 60 countries, national statistical offices, international organizations,
and nongovernmental organizations, as well as some disabled persons’
organizations. Using ICF as a framework, the group’s interest is in the
measurement of basic actions at the level of the whole person. Disability is
defined as the intersection of basic actions and the environment and affects
participation in society. Disability is treated as a demographic variable,
comparing populations and subgroups by disability status.
The Washington Group first developed a short set of six disability ques-
tions that have been tested, adopted, and now are being included in plans
for censuses around the world. The group is now engaged in the develop-
ment of longer sets of questions that include increasingly complex activities
and more domains of health.
The major focus of the Budapest Initiative is on the development of
measures suitable for population surveys that capture health status or
“health state.” In this context, health state reflects one’s functional ability
(“within the skin” as opposed to with the use of aids or other assistance);
that is, capacity, rather than performance, in a reasonable environment.
Like the Washington Group, one of the objectives of the Budapest Ini-
tiative is to develop a question set that describes individuals’ overall health
state, by examining functioning in basic levels of activity across a number
of health domains. A second objective is to describe trends in health over
time within a country, across subgroups of a population, and across coun-
tries. In this way, something meaningful about a population’s health can
be said when examining differences between countries and assessing trends
over time.
Weeks noted that the objectives of the two groups are very similar.
When their respective work groups mapped out the possible health domains
and basic activities that could be measured, the list included six categories:
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
1. Mobility—walking, climbing stairs, bending, reaching or lifting,
using hands
2. Sensory—seeing and hearing
3. Communicating—understanding and speaking
4. Cognitive functions—learning, remembering, making decisions,
and concentrating
5. Emotional functioning—interpersonal interaction and psychologi-
cal well-being
6. Other—affect, pain, fatigue, and self-care
The challenge that each group faced was demonstrating in some easily
digestible way exactly where the questions being developed were located
in what seemed like an ever-increasing map of health and disability. Both
groups are measuring functional ability at the level of basic actions, but
across multiple health domains. How does one clearly show what is being
considered and what remains to be developed? Moreover, the Washington
Group is defining disability as the intersection of the person and the envi-
ronment, so one has to know something about the environment. Finally,
it became increasingly apparent that in a survey setting, in which there is
room for a relatively larger number of questions, the question of what ad-
ditional aspects should be measured within any health domain adds to the
complexity. After many months and iterations, a small group of members
developed “the matrix.” At the simplest level, this matrix outlines in what
areas work has occurred and in what areas it needs to continue to develop
a full spectrum of questions on health and disability.
The goal is to populate each cell with questions that have been subject
to rigorous testing so that countries can use comparable measures and can
choose those measures that fit with their survey and budgetary agendas.
Particularly noteworthy about this matrix is that it succinctly conveys the
use of explicit definitions of disability (or health) and is, in essence, a road-
map for future survey and question development work. However, finding
questions that have been cognitively evaluated is nearly impossible. Further-
more, in earlier efforts, it quickly became clear that one simply cannot rely
on data collected in separate studies, nor can the findings be compared.
Question Testing and Analysis
The question testing and evaluation phase of work required nearly as
much development as did the questions. Ultimately, the goal of both the
Washington Group and the Budapest Initiative is to develop internationally
comparable data that are suitable for censuses and surveys and that capture
most disabled people (or the broad spectrum of health states) in a consistent
fashion. The goal then, for the cognitive test, is to ensure that the questions
meet those goals, without relying on a gold standard. Unfortunately, many
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
of the traditional aspects of cognitive testing do not produce consistency or
standardization. Furthermore, it is often impossible to assess if differences
are “real.” Some of the aspects of traditional cognitive testing methods
that hinder comparative analysis include small, nonrepresentative samples;
nonstandard interviewing protocols, outputs, and reports; underdeveloped
literature and practice regarding the rigor of analysis; and lack of standard-
ized criteria for what constitutes a cognitive interview finding.
Also, in the area of disability measurement (as in so many other disci-
plines), existing questions are often used as gold standards, and new ques-
tions are evaluated by examining the relationship between the two. This is
a problem for several reasons. First, often the gold standard has not been
rigorously tested, so it is not clear what is being compared and which mea-
sure might, in fact, be superior. Moreover, the purpose of the gold standard
questions and new questions may differ, even slightly, so that making such
a comparison may not be entirely appropriate. Finally, the strategy does
not address cross-cultural comparability, unless it was addressed in the
development of the original question considered the gold standard, which
one would never know because questions currently in use rarely come with
evidence of such study.
The cross-cultural nature of the Washington Group and Budapest Ini-
tiative projects underscores the need to clearly demonstrate that a question
works and what is being measured. It is no longer sufficient to know just
that the “questions worked”; one needs the question wording, interpreta-
tion, and outputs, as well as how respondents interact with and answer the
question.
This need required a huge paradigm shift in the cognitive testing lab
and ultimately changed the way the testing and the evaluations are be-
ing conducted. In effect, the qualitative process is subject to far more of
the scientific principles associated with quantitative analysis: a structured
cognitive interview, data quality, data analysis (multiple levels of analysis,
including an examination of patterns of respondent interpretation and
calculation), and transparency and replicability in all processes, but, most
importantly, in the transformation of qualitative data into quantitative
results. In addition, these methods have to be implemented in a consistent
and standardized fashion across all of the participating countries.
Weeks next described the cognitive testing used by the Washington
Group and Budapest Initiative. One of the most important steps during the
initial phase is stating as specifically as possible the research questions to
be tested. Moreover, one does not simply want to know that the “questions
worked,” but rather:
• How do specific respondents move through the cognitive processes
(comprehension, retrieval, judgment, response)?
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
• How much error is there (false negatives and false positives)?
• Why are there differences and how does one account for the way
respondents with different socioeconomic conditions, cultures, and
languages interpret, consider, and respond to survey questions?
The challenge of answering these three questions is heightened even
further when attempting to design an internationally comparable measure
for a concept as complex and dynamic as disability.
Next in the process is putting together a testing work group that meets
on a regular, frequent basis. There is an initial, mandatory training meeting
for all participating countries (or sites). A purposeful sampling procedure is
designed; it is not a convenience sample. Translation is a major activity re-
quiring a great deal of time and care. As much time is spent in translation as
is spent in nearly all of the rest of the planning phase of testing. Ultimately,
if one is going to administer questions about how “sad, blue, or depressed”
a respondent is in one site (the United States, for example) one has to know
that these terms mean exactly the same thing in the other test sites. In Italy,
for example, there is no concept of the term “blue.” Even if one can find the
word, does it mean the same thing? Is it going to result in comparable data?
Finally, time-intensive work must be done to take notes and to translate
those notes into some kind of quantitative format. However, the data that
are generated are rich and very quantitatively informative.
When this work is completed, one has not only the typical qualitative
notes taken during a cognitive interview, but also a narrative that follows
a semi-scripted format, with as much detail as possible. In turn, those nar-
ratives are entered into QNotes (software designed at NCHS) and form
the basis of the data from the cognitive testing that are analyzed in a very
quantitative fashion. The results of this process are
• validity tied to rich detail,
• findings that are grounded,
• insight into question interpretation,
• insight into patterns of calculation, and
• knowledge of question performance.
Weeks next described what is different about the analysis stage. In
quantitative terms, she and other NCHS staff liken within-interview analy-
sis to frequencies, across-interview analysis to conducting crosstabs, and
across-subgroup analysis to controlling for specific variables. The point is
that each type of analysis offers some type of understanding. The goal is to
perform the type of analysis that answers the question of most interest, but
typically all three of them.
The analysis itself should be conceptualized in three distinct layers. The
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
first and simplest level of analysis occurs within the interview, specifically,
as the interviewer attempts to understand how one respondent has come to
understand, process, and then answer a survey question. The interviewer
must act as analyst during the interview, evaluating the information that the
respondent describes and following up with additional questions if there are
gaps, incongruencies, or disjunctures in the explanation. From this vantage
point (i.e., within a single cognitive interview), basic response errors, such
as recall trouble or misinterpretation, can be identified—errors that can be
linked to question design problems.
The second layer of analysis occurs through a systematic examination
of all interviews together. Specifically, interviews are examined to identify
patterns in the way respondents interpret and process the question. By
making comparisons across all of the interviews, patterns can be identified
and then examined for consistency and degree of variation among respon-
dents. Inconsistencies in the way respondents interpret questions may not
necessarily mean misinterpretation, but they can illustrate even the subtle
interpretation differences that respondents use as they consider the ques-
tion in relation to their own life circumstances. From this vantage point,
it is possible to identify the phenomena that are captured by the particular
survey question, illustrating the substantive meaning behind the statistic.
Additionally, from this layer of analysis, it is possible to identify patterns
of calculation across respondents. This is particularly useful in understand-
ing how qualifying clauses, such as “in the past 2 weeks” or “on average,”
affect the way respondents form their answer and whether respondents
consistently use the clauses in their calculation.
The last level, the heart of the cross-cultural analysis, occurs through
an examination of the patterns across subgroups, identifying whether par-
ticular groups of respondents interpret or process a question differently.
This level of analysis is particularly important because it is the level where
potential for bias would occur.
Thus far, this testing protocol has been used by both the Washington
Group and the Budapest Initiative in approximately 30 countries. In a sub-
set of these countries, staff has also combined the cognitive testing with field
testing. Preparations are now being made for a combined cognitive and field
testing effort in the U.N. Economic and Social Commission for Asia and
the Pacific region, which will include Cambodia, Fiji, Maldives, Philippines,
Sri Lanka, and Vietnam. It has been a remarkable endeavor, one that has
produced exciting results in both the development of questions and testing
for cross-cultural purposes.
In summary, the Washington Group and Budapest Initiative question
development work is located at the most basic levels of activity and partici-
pation in core health domains. The goal is to measure ability (or inability)
to carry out basic activities and to treat disability as a demographic vari-
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
able, comparing participation by disability status. In this context, health
state reflects one’s functional ability “within the skin” (without the use of
aids or other assistance), rather than performance, in a reasonable environ-
ment. A new methodology for integrating cognitive testing concepts into a
standardized, quantitative testing procedure was developed in the process in
order to meet the specific need of testing measures of disability and health
state suitable for international comparisons.
Information was collected on the question response process, patterns of
interpretation, and evaluation of decision-making patterns. This informa-
tion was then used to help identify potential response error and to test the
suitability of the questions for the purpose for which they were designed to
generate a meaningful, internationally comparable general prevalence mea-
sure for disability. The pattern analysis was particularly advantageous. Ex-
amining consistencies and inconsistencies across various questions allowed
for an evaluation of the Washington Group questions without establishing
existing questions as a gold standard. This pattern analysis allowed for old
and new questions to be compared while maintaining a neutral or agnostic
view of the other questions.
The results indicate the usefulness of this approach for testing the de-
sign of cross-national indicators, as well as lending support to the reliability
of the particular measures developed (and now adopted) by the Washington
Group on Disability Statistics.
DISCUSSION
The discussion focused on three issues: vignettes, item pools and CAT,
and the Washington Group.
Vignettes
Several participants asked questions: Are some of the differences in
reporting on work disability in the Netherlands and the United States
explained by differences in the interview context and the organizations
conducting the surveys? Can some of the differences be explained through
the framing effect, that is, the location of the questions in the questionnaire?
Can some of the results showing that African Americans have disability
profiles at ages 10 years younger than whites be explained using vignette
methodology? Has the vignette methodology been used to look at racial or
ethnic differences within the United States?
Arthur van Soest responded that the surveys were Internet surveys in
both countries. However, he noted that they were not exactly the same
complete survey, so there is possibly a framing effect. One of the interpreta-
tions of what the authors found is that the country’s, and its institutions’,
OCR for page 51
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
specific context does make a difference. The researchers also experimented
with telephone interviews, but reading all those hypothetical stories did
not work as well. One kind of experiment could be to ask somebody in
the Netherlands about a hypothetical person in the United States and vice
versa. However, they thought that would be too confusing and so did not
use that approach.
He noted that in the Netherlands and U.S. study, race was not included,
mainly because in the Netherlands there is not enough variation in the data;
items such as education, gender, and age were included. In response to the
question about whether the vignette methodology has been used to look
at racial or ethnic differences in the United States, van Soest replied that it
could be used in such surveys as the Health and Retirement Study, but he
does not know if it has been used. The focus for using vignettes to date has
been almost exclusively on cross-national comparisons.
Item Pools and Computer-Adaptive Testing
Noting that the vignette scheme is particularly suited to identifying and
making adjustments for DIF, Robert Hauser wondered whether an item
pool for CAT would be valid if it had DIF in the items. To what extent have
item pools for CAT been tested for DIF?
Whether an item bank has DIF is based on where one is looking for it.
There are many different levels at which one can look. The small number
of items that show differential functioning can be removed from an item
bank, especially if the bank has hundreds of items. Also, one can calibrate
a group-specific item differential so that, for example, it would have a dif-
ferent item differential for Dutch and U.S. populations. It would be very
similar to the vignette scheme.
It was noted that application of these methods across different popu-
lations groups would be important. Gender is an example. The standard
general health questions do not show any gender differences, but adjust-
ments using these methods might show that there is a gender difference and
that gender accounts for the difference between self-reported symptoms, for
which there are always gender differences in the general health.
The Washington Group
A question was asked if there are some domains that the group simply
cannot get to work across the multiple sites being evaluated.
Julie Weeks responded that, anticipating that some domains would be
harder than others to work with across the multiple sites being evaluated,
the group started with some of the easier domains. In the Budapest Initia-
tive, they are encountering difficulty with two domains—how to ask about
OCR for page 51
IMPROVING THE VALIDITY OF CROSS-POPULATION COMPARISONS
pain and fatigue cross-culturally. How people interpret the concepts of pain
and fatigue and how much they are willing to admit to having them are
very different. Those two domains are in the third round of testing, without
success as yet. Obviously, this work needs to continue.
Connie Citro (Committee on National Statistics, DBASSE) commended
the Washington Group initiative, which is clearly going back to basics as
the way of making a start at getting some very carefully tested questions
that will provide basic monitoring information across a whole range of
countries. She said that is very important work.
OCR for page 51