This chapter focuses on potential methods for refining or augmenting current measures of late-life disability used in population surveys to foster comparability across key subgroups. The presentations covered four topics:
Performance measures in population surveys
Improving patient-reported measurement of disability using item response theory (IRT) and computer-adaptive testing (CAT)
The possible use of easily collected biomarkers of chronic diseases to supplement ADLs and IADLs, which may be able to track decline in functionality over the life course and capture change in functionality across thresholds
The potential for using time-use data to augment existing measures of ADLs and IADLs
Jack Guralnik (National Institute on Aging, National Institutes of Health [NIH]) focused on estimating functional status in surveys using performance measures and on identifying points across the spectrum of performance that are associated with self-reported disability in different population groups. He briefly described several studies he has undertaken with colleagues with some new comparisons both across countries and among U.S. surveys.
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 31
3
Potential Methods for Revising
Measures to Foster Comparability
Across Subgroups
T
his chapter focuses on potential methods for refining or augmenting
current measures of late-life disability used in population surveys
to foster comparability across key subgroups. The presentations
covered four topics:
1. Performance measures in population surveys
2. Improving patient-reported measurement of disability using item
response theory (IRT) and computer-adaptive testing (CAT)
3. The possible use of easily collected biomarkers of chronic diseases
to supplement ADLs and IADLs, which may be able to track de-
cline in functionality over the life course and capture change in
functionality across thresholds
4. The potential for using time-use data to augment existing measures
of ADLs and IADLs
PERFORMANCE MEASURES IN SURVEYS
Jack Guralnik (National Institute on Aging, National Institutes of
Health [NIH]) focused on estimating functional status in surveys using
performance measures and on identifying points across the spectrum of
performance that are associated with self-reported disability in different
population groups. He briefly described several studies he has undertaken
with colleagues with some new comparisons both across countries and
among U.S. surveys.
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
Guralnik observed that the Nagi theoretical model of the pathway from
disease to disability has been very helpful in terms of operationalizing the
assessment of steps along the pathway and particularly useful in thinking
about where performance measures fit. Certainly, objective measures of
performance can be done at several of the steps in the model—impair-
ment, functional limitation, and disability. Impairments objectively measure
physiologic functioning. At the final step, disability, one may be observing
people in standardized home-type environments. However, performance
measures, such as gait speed, chair rises, and pegboard tests, have been used
mostly in the domain of functional limitations.
Guralnik offered three performance assessments to illustrate the points
made in his presentation: gait speed, the index of mobility-related physical
limitations (MOBLI) developed by Lan and Melzer (Lan et al., 2002), and
the Short Physical Performance Battery (SPPB).
MOBLI was developed using data from the National Health and Nutri-
tion Examination Survey III (NHANES III), empirically looking at measures
that were related to mobility, which include gait speed, chair rises, and a
pulmonary function test. The index was then validated in other studies.
The components of SPPB include timed standing balance, a timed 4-meter
walk, and timed multiple chair rises. This battery was first developed in the
Established Populations for Epidemiologic Studies of the Elderly (EPESE)
in 1988. SPPB has very good psychometric properties: It predicts mortal-
ity, nursing home admission, new disability, and health care expenditures,
among other things, and it has good reproducibility. It is sensitive to clini-
cally important change.
One of the issues related to performance testing in general is that the
scoring of some tests does not have a way of dealing with people who are
unable to perform the task. So it is difficult to know how to handle people
who are unable to perform the test. For example, if gait speed is used, what
do you do if someone just cannot walk at all? People have approached this
issue in different ways, but it is a limitation of performance measures that
researchers rarely address. Even in determining why a test was not done,
people often fuss with the data—trying to understand if the data are miss-
ing because the person really was not able to do the test and so should be
scored as a 0 or given a poor score or whether the person simply refused.
Sometimes even refusals can be vague. People refuse because they are
afraid to do the test because they know that they are going to be unable
to do it. Sometimes the responsibility for a refusal is placed on the exam-
iner, which is a bit unfair, but it is sometimes hard to sort out when the
researcher does not know what the data on the performance test mean. One
solution to this problem, used in the SPPB, is to create categorical scores
that cover the range of functioning and give a 0 score to those unable to
do the test.
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
Gait speed is an important performance measure: It is a simple test,
but it is highly predictive, and recently there has been increasing interest
generated in this very simple test, with many longitudinal studies showing
a clear stepwise gradient of greater risk for mortality with decreasing gait
speed. Analysis of data from the InCHIANTI Study, a population-based
study in the Chianti region of Italy (Alessandro Ble and Luigi Ferrucci,
Longitudinal Studies Section, Clinical Research Branch, National Institutes
of Health, Baltimore, MD, unpublished data), shows a graded response for
mortality according to quintiles of preferred walking speed. In this analysis,
it was demonstrated that the survival curve for persons with cancer actually
showed better survival than the curve for persons in the lowest quintile of
gate speed at baseline.
In recent work done in the Whitehall Study of British civil servants
(Brunner et al., 2009), it was found that gait speed rose steadily across
the six employment grades that were used to classify participants, none of
whom was poor and all of whom were full-time employees. It was impres-
sive just how sensitive the gait speed was to employment grades, which
range from the highest (administrative level) to the lowest (clerical) level.
Gait speed is picking up something about the health disparities across this
gradient of socioeconomic status in a very impressive way.
The question often asked is whether performance tests can replace
self-reports, whether both should be done, or which one should be used in
what situations. Most people who work in the field have generally agreed
that self-reports and performance tests are really complementary. They are
measuring different concepts, different aspects of functioning; there is a fair
amount of evidence to support this view. One example is the work in which
Guralnik collaborated with David Reuben (Reuben et al., 1990) in which
the study population was stratified in two ways—according to self-reports
of being independent in mobility and ADLs and according to categories
of SPPB—and mortality was studied as an outcome. In the group report-
ing no disability, there was a clear grading of mortality risk across SPPB
scores. The same was true with the group that was dependent in mobility
but independent in ADLs. Complementary information is being picked up,
and the performance batteries are showing something that is not available
from self-reports.
Finally, among the most severely disabled subset of this cohort, those
who were dependent in mobility and with one or more ADL disabilities,
there were high rates of mortality. Few people in this subset have high SPPB
scores, but even across the remainder of the SPPB spectrum, there was not
much of a gradient for mortality risk. Therefore, at the very disabled end
of the spectrum, performance measures may not be adding much to the es-
timation of prognosis, but for those with little or no disability, performance
measures make a valuable contribution characterizing prognosis.
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
Performance Measures in Large Surveys
Banks and colleagues (2006) used both gait speed and the SPPB from
the English Longitudinal Study of Aging (ELSA) in its 2004 wave. They
used cut points for poor functioning of less than or equal to 8 on the SPPB
or gait speed of less than or equal to 0.5 meter per second. The cut points,
previously shown to be related to high risk of future adverse events, show
a clear age effect, with much higher proportions of people in the 80 years
and older group being in the high-risk group according to the SPPB and
gait speed, as well as a difference between men and women, with women
having poorer functioning.
The presentation by Guralnik compared data for SPPB scores of less
than or equal to 8 in EPESE, ELSA, and the InCHIANTI study, with sub-
stantially poorer performance seen in EPESE. However, because the EPESE
was conducted 10 to 15 years earlier than the other studies, the trends in
observed performance may mirror the trends toward less self-reported dis-
ability. Similar effects were observed for men as well as for women.
Gait speed showed somewhat different results. For gait speed, Guralnik
included NHANES III data from 1988 to 1994. The InCHIANTI popula-
tion showed a substantially smaller proportion of individuals with slow
gait speed, for both men and women. Some of this difference may be real;
some of it may be that the test was done in a slightly different way. In the
InCHIANTI study, the researchers used automatic timers and participants
took a step before the timers were tripped, whereas in the other studies,
stopwatches were used and the time was measured from a standing start.
It is likely, however, that the Italians, who tend to walk much more than
Americans, do have less mobility limitation.
In the NHANES III data from 1988 to 1994, an 8-foot walk was mea-
sured. The 2001/2002 NHANES did the 20-foot walk but also timed the
first 8 feet to make comparisons with the 1988 data. The results showed
a large reduction from one time to the next in the proportion of people
who have very slow gait speed. However, some of the difference may be
explained by the fact that these tests were done somewhat differently, and
in a 20-foot walk people may see a longer walk ahead of them and may go
faster for the first 8 feet.
Using Performance Measures to Calibrate Self-Reports
Guralnik stated that he found two examples that represent a way of
using performance measures of functioning to calibrate responses to self-
report items in questionnaires. In the first example, from the World Health
Organization (WHO), Iburg and colleagues (2001) used a modeling tech-
nique called Hierarchical Ordered Probit Modeling. They used performance
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
tests from the NHANES III and created a vector of performance that they
considered similar to a latent variable representing the true underlying level
of performance. Then they used the model to look at how different sub-
groups reported disability at different levels of this background latent vari-
able. This analysis was done both for physician reports and self-reports.
Guralnik and Melzer (Melzer et al., 2004) did similar kinds of analyses
using MOBLI, derived from the NHANES, and observed similar results.
People who were 60–69 years old did not report disability until they had a
poorer level of performance than people who were older. Also, large differ-
ences were observed between men and women, with men not reporting dis-
ability until reaching lower levels of performance than women. There were
also differences in disability cut points by race, with blacks and Hispanics
not reporting disability until their background level of functioning was at
a poorer level than that of whites with the same level of functioning. For
income, people with the highest income did not report disability until their
performance was at a poorer level than that of people with lower income
people who reported disability. They may be denying their disability, or
they are able to compensate successfully for a lower level of functioning.
This kind of approach can be very useful. Comparison of U.S. data with
those from the Longitudinal Aging Study Amsterdam (Melzer et al., 2004)
showed that people in the Netherlands did not report their disabilities
until they had more severe levels of background dysfunction. Therefore,
the lower levels of self-reported disability in the Netherlands could be ex-
plained, at least in part, by this differential reporting as it relates to level
of background performance.
In conclusion, Guralnik observed that there are potential applications
of performance measures in improving population surveys of disability,
particularly in making comparisons across subgroups of a population and
for cross-national and cross-cultural comparisons. Trends over time can
be directly observed with performance testing, but this will require strict
standardization of test administration and quality control procedures to en-
sure that the tests are administered precisely the same way in every survey.
Performance tests can be used to identify high levels of functioning, which
cannot be done well with self-reported disability. They can be used to iden-
tify nondisabled persons at increased risk of disability, sometimes referred
to as preclinical disability. The concept of calibrating self-reports by using
a background measure of performance could be quite valuable. It may be
that even something as simple as gait speed could be used for this kind of
calibration and could be valuable for cross-national studies.
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
IMPROVING PATIENT-REPORTED MEASURES USING ITEM
RESPONSE THEORY AND COMPUTER-ADAPTIVE TESTING1
Karon Cook’s (University of Washington) presentation covered four
topics:
1. A brief introduction to IRT and CAT
2. Description of Patient-Reported Outcomes Measurement Informa-
tion System (PROMIS), which applies these methodologies
3. Opportunities and barriers to using modern psychometric methods
in population surveys and monitoring population trends
4. Envisioning the future and how modern measurement methods
might be helpful in advancing disability research
Item Response Theory
IRT models are probability-based models in which both the levels of
the trait being measured (e.g., physical function) and the difficulty of the
item are located on a common underlying continuum, or “ruler.” The prob-
abilities of answering in particular ways to items that ask about the trait
being measured are modeled as functions of how much of the trait a person
has relative to the difficulty of the items.
Classical test theory and IRT are different in several ways. Commonly
used reliability and validity estimates are based on classical test theory, in
which scores on measures are usually obtained by manipulating the item
scores (e.g., summing to get a total score). In IRT, scores on measures are
obtained on the basis of probability functions, not by averaging or totaling
item scores.
Another important difference is that, in classical test theory, unlike
IRT, variations in difficulty or intensity of items are not accounted for. For
example, a shoulder function scale score on an item that asks about throw-
ing a softball overhand 20 yards is weighted the same as an item that asks
about using the affected arm to flip a light switch. In IRT, differences in
difficulty or intensity are accounted for. In some IRT models, item discrimi-
nation is also included.
Yet another difference is that, in classical test theory, the scores are
ordinal-level indicators of individual differences. With IRT, especially with
the Rasch model, scores at least approximate interval-level measurement.
There is a great deal of debate about how well the scores approximate
1 Fora general review of IRT, CAT, and PROMIS, see De Ayala (1993); Ware et al. (2000);
Cook et al. (2005); Cella et al. (2007).
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
equal-interval measurement, but they come closer than classical test theory
scores.
Thus, with classical test theory, the focus is on total score or average
score or something along those lines; in IRT the focus is on the item re-
sponse. The IRT approach gives a great deal more flexibility when develop-
ing measures because one gets more information about specific items and
how they function.
A very important difference is in the area of reliability and precision.
Researchers typically say that a measure has a reliability of, for example,
0.89. Intuitively, everyone knows that it is very likely that a measure will
measure at different levels of a trait with different levels of precision, but
with classical test theory, all one gets is an average. With IRT, one gets an
estimate of the precision of a measure for every level of the trait that is be-
ing measured, which gives a great deal more information.
Cook explained that IRT is a mathematical model—a probability
model. IRT models estimate how likely persons are to respond in particular
ways to a particular item, depending on how much people have of the trait
being measured, and what the characteristics of the item are (e.g., how
difficult). What is not in the model is the total score. With IRT, different
people can answer different items yet their scores are estimated on the same
mathematical metric.
Information function in IRT is analogous to reliability in classical test
theory. The information function has an inverse relationship to the standard
error of measurement; that is, when precision is high, standard errors are
low, and when precision is low, standard errors are high. Thus, one can
identify what ranges of the trait level are measured with more precision and
what areas are measured with less precision. IRT information functions can
be estimated at both the scale and the item level.
Computer-Adaptive Testing
IRT is the math behind a very important application—CAT. CAT is a
process of measuring in which not all available items are administered to
any one respondent. Instead, the items chosen for a particular person are
based on that person’s responses to the previously administered items.
CAT begins with what is called a large “item pool.” Then items of the
pool are calibrated in advance on the basis of the known characteristics of
the items, their difficulty or intensity, or, in some cases, their discrimination
parameters. Once the item pool is calibrated, it is called an “item bank.”
Cook then described how CAT works. An initial item is presented to
a person and that person responds. Then a gross estimate is made of that
respondent’s level of the trait being measured. On the basis of that trait-
level estimate, the next item chosen from the item bank is the one that
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
gives maximum additional information. Each item has its own information
function, and the computer algorithm identifies the item that will give the
greatest additional precision for the person’s particular trait level. The test
administration stops when a predefined “stopping rule” is reached, such as
stopping when a person gets to a certain level of precision or stopping after
asking a specific number of items.
CAT is not the only application of IRT. It is important to know that
with a calibrated bank of items, static instruments can be developed, that
is, instruments in which everyone answers the same items. Also, an excit-
ing application is the ability to construct different short forms that target
specific clinical populations or specific measurement contexts. For example,
one might want a very short measure if respondent burden is an issue. One
might want a longer one if precision is of more interest. One might want
to target lower levels of the outcome if that is the population of interest, or
one might choose items from the bank that seem particularly relevant for
a given clinical population.
The important thing to know about item banks is that whether one
administers the items with CAT, a long, static instrument, or one of several
short forms, the scores are reported on the same mathematical metric. They
are not based on a total score but rather on a probability function that
takes into account a person’s level of trait and responses to items and the
characteristics of the items.
Patient-Reported Outcomes Measurement Information System
Patient-Reported Outcomes Mesurement Information System (PROMIS)
is an example of an application that uses IRT and CAT; it is funded by
NIH. The goal was to develop item banks that measure patient-reported
outcomes (PROs) across many different chronic conditions. The focus was
on PROs, such as pain, fatigue, physical function, social function, depres-
sion, and sleep. Part of the mandate also was to create a computer-adaptive
system for administering PRO-based tests to measure such outcomes. The
item banks that were developed have the flexibility to create multiple short
forms to measure the same traits on the same metric.
Cook explained that one of the things that is most helpful about
PROMIS is that scores on all measures have been calibrated to the general
U.S. population. For example, for fatigue scores, the mean for the U.S. pop-
ulation is based on a weighted sample that is based on the U.S. census: The
mean is 50 and the standardization is 10. Suppose one gives the PROMIS
fatigue measure to a particular sample and the average score is 60. This
score has inherent meaning: In comparison with the general population, the
study population is one standard deviation above the mean. This is much
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
better than using traditional measures, in which scores usually have mean-
ing only to people who have long experience using them.
Opportunities and Barriers
Cook asked: Are these methods—IRT and CAT—appropriate for sur-
veying populations and for monitoring trends? She said yes; in fact, these
particular methods offer some distinct advantages, such as the item-level
approach, that are very helpful in developing better measures. With trait-
specific standard errors, one knows how well one is measuring portions of
the population. CAT offers measurement efficiency, and it can be adminis-
tered in a lot of different ways—by Internet, by telephone, or in person.
IRT also allows linking new instruments to legacy instruments through
concordance tables. If two measures are measuring the same trait, it is pos-
sible to link them and do a crosswalk between them so that the results of
two studies can be compared, even if they used different measures for the
same trait.
Cook noted, however, that there are downsides to using IRT and
CAT. One is that calibration to an IRT model requires specialized and not
particularly user-friendly software and specialized expertise. Also, to use
CAT, for example, the respondent or an interviewer has to interface with a
computer. If it is an interviewer, then mode effects are introduced that might
be problematic. Also, unique qualities of IRT-based measurement require
meeting assumptions of the model, and these are not always easy to meet.
Challenges in disability measurement with these particular models are
substantial, and so are the advantages. Some of the disadvantages are
not limited to the newer methods, however. Because IRT and classical
test theory assume unidimensionality, both are probably better suited to
measurement of functional limitations than of disability. Disability often
gets defined as a multidimensional, interactional, and social construction.
Defined as such, it does not lend itself to either IRT or classical test theory
methods. Functional limitations typically are defined in much narrower
terms and are better suited to measurement models.
Envisioning the Future
In summing up, Cook noted that the psychometric methods that have
been developed in the past few years have improved exponentially and
have increased researchers’ ability to develop good measures in terms of
psychometric properties. However, the ability to assign any kind of meaning
to those scores is lagging behind. That is an area in need of some efforts.
Norm referencing is one possibility, but a great deal needs to be done in
addition. Levels and changes in levels of outcomes associated with mainte-
OCR for page 31
0 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
nance of capacity and onset of disability need to be identified. The newer
psychometric methods she has been discussing have some unique properties
that make them suitable for this endeavor, but they will require longitudinal
studies and close monitoring of functions, outcomes, and identification of
measurable “marker” clinical events that are associated with changes in
PROs.
USE OF EASILY COLLECTED BIOMARKERS
OF CHRONIC DISEASES
David Weir’s (University of Michigan) presentation focused on the pos-
sible use of easily collected biomarkers of chronic diseases, to supplement
ADL and IADL measures, which may track decline in functionality over the
life course and capture changes in functionality across thresholds. He based
his remarks on the results from the 2006 major redesign of the Health and
Retirement Study (HRS).
HRS is a longitudinal survey of 22,000 Americans over age 50,
who are interviewed every 2 years. Prior to the 2006 redesign, HRS was
primarily a telephone survey conducted every 2 years, which included
fairly useful self-reports on, among other items, functional limitations,
chronic conditions, and care received. Beginning in 2006, the sample was
randomly split into two halves: in one, participants are interviewed in
person every 4 years beginning in 2006; in the other, they are interviewed
in person every 4 years beginning in 2008. The in-person interviews in-
clude anthropometric measures, performance measures, dried blood spots,
DNA samples, and some other measures not included in the telephone
interviews.
Weir explained that the scientific focus of HRS is on two main areas.
One area is biomarkers in the narrow sense of biological samples, which
are focused essentially on measures of cardiovascular risk. Those biomark-
ers are relatively easy and straightforward to measure and are of great
importance and high prevalence in the population, such as blood pressure,
cholesterol, hemoglobin A1c measure of blood glucose, C-reactive protein,
waist size, height, and weight. They are also closely related to obesity and
metabolic syndrome, which are looming public health concerns. The second
area of focus is on physical performance measures, which are targeted more
at the older population as measures of frailty.
These two areas can be brought together by taking into account the
relationship between chronic disease and disability. Most disability is a
product of chronic disease. The chronic diseases that directly produce dis-
ability, such as stroke, heart disease, and cognitive impairment, are them-
selves often produced by antecedent other conditions (e.g., hypertension,
diabetes) that often have few symptoms. There is a need to model these
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
processes, as suggested in a previous presentation, beginning long before
people have difficulties with ADLs or IADLs. To understand the total pro-
cess by which people become disabled, it is necessary to look at the whole
evolution of chronic disease.
Disability starts long before a person experiences limitation in ADLs.
Some measures that are sensitive at those earlier stages are needed. In
HRS, there are 12 items, which include such Nagi items as walking several
blocks, climbing stairs, and pushing a heavy object. These items are quite
useful at documenting the earlier stages of disability. The HRS data indicate
that ADL and IADL limitations really only begin around age 75. There
are some people with limitations at earlier ages, but these cases mostly do
not reflect changes by age. Rather, ADL and IADL limitations are really a
feature of the very old. The percentage of people who receive more than 1
hour of care per day also is very low prior to about age 75; after age 75,
that percentage increases very rapidly. However, when people who report
having no ADL or IADL limitations and therefore are not reporting any
hours of care are asked how many of the Nagi limitation items they have
any difficulty with, the percentage also rises with age in a very linear way.
If people who reported no ADL or IADL difficulties in 2004 are arrayed by
the number of Nagi limitation items they had in 2004, and then are arrayed
by having an ADL or IADL difficulty by 2006, a very graded relationship
is seen. Just counting the number of these difficulties provides some insight
into the people who are at risk for developing further disability.
Chronic Disease and Disability
Weir used a combined measure that is a sum of the Nagi items plus ADL
limitations plus IADL limitations discussing chronic disease and disability.
As stated above, chronic disease underlies most disability, even at
younger ages. People under 62 years of age with disability that prevents
them from working are eligible for Social Security Disability Insurance
(SSDI). The distribution of SSDI recipients by the cause of disability on
which the disability award is based shows that injuries are less than 5
percent and infectious disease about 2 percent. The percentage of recipi-
ents reporting diabetes is about the same as injuries overall (4.6 percent).
Cardiovascular disease is twice that big (10.4 percent), and arthritis is 2.5
times higher (23.6 percent). All psychiatric conditions, of which depression
is the largest, are the single largest cause of reported disability for SSDI
respondents (27.9 percent). Even at younger ages, at which most people
might think of the disabled as being physically injured, most of it results in
some way from chronic conditions. Psychiatric conditions are actually quite
important even at younger ages.
In the HRS population, the number of physical limitations rises linearly
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
with the number of chronic disease diagnoses regardless of age, although
at 75 years and older the number of limitations reported increases slightly,
even at the same number of chronic conditions.
Obesity and Disability
Weir stated that the relationship between obesity and disability is com-
plex. One needs to consider multiple measurement perspectives on obesity.
There may be direct effects on mobility: For example, it is harder to move
around a lot of weight than a little weight. A somewhat less direct effect
is cumulative stress on joints from managing the excess weight. Even less
direct effects are through the risk of cardiovascular disease, which may
take a long time to manifest itself. There are also inverse effects, such as
sarcopenia (the age-related loss of muscle mass, strength, and function)
and weight loss.
In HRS, a number of indicators were developed to measure quintiles,
with 5 percentile cuts through the variable and then analysis of the mean
number of limitations at that level of variablity. Disability and weight as
measured by body mass index (BMI) are nonlinear. The lowest 5 percentile
is very disabled; this is the frail group. There is relatively little variation for
most of the U.S. population, indicating that a range of moderate obesity has
relatively low correlation with disability. The number of limitations starts
to increase at about the top 25 percent of BMI and especially at the very
top—BMIs of 35 and higher.
With regard to the measurement of height and weight, Weir said they
added little to the information from self-reports. In fact, they have almost
no value analytically. However, it is important to have those data, as they
add to age and gender as a predictor of disability.
Waist circumference has a more monotonic relationship with disabil-
ity. For waist sizes larger than about 40–41 inches, there is a substantial
increase in disability. This measure adds considerably to BMI alone. Even
with both in the model, it adds significantly: That is, waist size is inde-
pendently a highly significant predictor of disability. What is it measuring
that BMI is not? Weir suggested that it may be central adiposity, which is
an independent risk factor for cardiovascular disease. Also, BMI does not
distinguish lean body mass from fat. Lean body mass is almost certainly
protective in many ways, particularly against mobility difficulties.
Physical performance measures are highly correlated with self-reported
limitations. Grip strength, expiratory volume, and timed walk are inde-
pendently associated with limitations. HRS for a long time has measured
cognition, and it is independently predictive of disability, particularly if
IADLs are included. A word recall measure has been included in HRS since
1993 and is useful. However, the strongest predictor among the cognitive
measures is the eight-item count of depressive symptoms. Depression and
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
other psychiatric symptoms are a major cause of disability. However, they
are correlated negatively in self-reports. Separating what may be some kind
of affect in a person’s reporting style from what is the real effect of depres-
sion on disabilities is difficult.
Biological Biomarkers
HRS had high levels of cooperation for collecting biological samples:
80 percent of respondents agreed to do them. The distributions showed
a good match to the distributions from the NHANES, except for two
measures. One was total cholesterol, which is a difficult assay to do in dry
blood spots, and the other was diastolic blood pressure, which is almost
certainly due to the fact that machines and humans find that point differ-
ently. The biomarkers have good internal validity; prospective validity is to
be determined over time.
Disability is only slightly related to current levels of blood pressure, and
only at the high end. In contrast, Weir said that quite a strong relationship
exists between disability and hemoglobin A1c measures of blood glucose.
Disability is correlated with obesity. It has some independent value even
after the waist circumference and other obesity measures are taken into
consideration.
“Good cholesterol” (high density lipoprotein, HDL) is associated with
lower disability. However, disability also is negatively correlated with higher
values of total cholesterol, which is quite puzzling. Consequently, a com-
mon measure of risk, the ratio of total cholesterol to HDL cholesterol, is
not related to disability, at least on its own.
The true value of these blood assays is yet to be determined, in part
because a few more assays are still to be done from 2006 data. One is
C-reactive protein, which is a marker of inflammation and which may
be related to disability through both arthritis and cardiovascular disease.
Another is cystatin C, a measure of kidney function. And because blood
assays are predictors of cardiovascular disease progression, they are ex-
pected to predict future cardiovascular events, which then are precipitators
of disability.
DEVELOPING MEASURES OF TIME USE TO STUDY DISABILITY
Vicki Freedman (University of Medicine and Dentistry of New Jersey)
described time-use measures and how they may be used to study disability.
She also shared some of the lessons from the development phase of a time-
use pilot study that she and her colleagues at the University of Michigan are
developing for the Panel Study of Income Dynamics (PSID) with funding
from the National Institute on Aging.
Her presentation included an overview of three issues:
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
1. How do time-use data fit in with existing measures of disability
(ADLs and IADLs) and some of the conceptual frameworks dis-
cussed in this workshop?
2. What are the various approaches for measuring time use to study
disability in population-based surveys?
3. What lessons have been learned from the development phase of the
PSID’s pilot project, Disability and Use of Time (DUST)?
It is not immediately evident how time use fits in with the existing
measures of disability. In the Institute of Medicine’s (1991) model of the
disablement process, conditions and impairments may or may not lead to
functional limitations, which in turn, depending on the environment, may
or may not lead to disability. It also is not clear how time use fits in with
the parallel language offered by the more recent International Classification
of Functioning, Disability and Health (ICF) model from the WHO (World
Health Organization, 2001). In the ICF model, health conditions may or
may not lead to impairments in body functions and structures that may in
turn lead to activity limitations and participation restrictions. However,
unlike the IOM model, ICF also offers a set of positive analogs for describ-
ing functioning. In positive language, the ICF links body functions and
structures to activities and participation in daily life. Time-use measures
convey the latter concepts: what people do (activities) and the extent to
which they engage in social, productive, and other aspects of daily life
(participation).
Domains of Time Use
There is no consensus in the literature about how best to classify ac-
tivities, but if one looks across literatures related to aging, time use, and
participation, several key “domains” emerge:
• Basic self-care activities (includes ADLs and other activities that
people do to care for themselves, such as management of chronic
conditions)
• Household maintenance activities (includes IADLs and other
household-related activities that are essential for daily life)
• Regenerative activities (includes hobbies, arts, music, gardening,
puzzles, taking classes, etc.)
• Physical activities (includes exercise, walking for pleasure, partici-
pating in team sports, etc.)
• Social participation (includes socializing with friends and family,
attending group functions)
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
• Productive participation (includes work, volunteering, providing
child and adult care, etc.)
• Political or civic participation (includes involvement in home as-
sociations or board meetings, political participation involving col-
lective decision making, etc.)
Not all activities fall uniquely into one of these categories or into just
these categories, but these are some of the most common domains of time
use for older adults (see Waidmann and Freedman, 2007, for frequency of
participation in these types of activities).
Approaches to Measuring Time Use
Freedman explained that there are three main approaches for measur-
ing time use to study disability in population-based surveys. The first is a
-hour diary. In such an approach, people are asked a series of questions
about everything they did yesterday. The American Time-Use Study, con-
ducted by the Bureau of Labor Statistics, for example, asks respondents
what they were doing starting at 4:00 a.m. the previous day, for how long
they did it, where they were, and who else was present. The respondents
are then asked what they did next, and so on, until a 24-hour diary is
completed.
A second approach asks questions about how much time was spent on
various types of activities over a longer period of time. These questions are
referred to as stylized time-use questions. For example, a stylized question
might ask: “During the past week how much time did you spend _____?”
The reference periods are typically a week or a month or sometimes longer
if the activity is rare. This approach can capture activities that are not done
frequently.
A third approach, experiential sampling, involves contacting study
participants at random times of day (with either phones, beepers, or per-
sonal digital assistants [PDAs]). The participant is then asked questions
about what she or he has been doing in a brief window (e.g., 15 minutes)
just before the contact. Depending on the technology, the respondent either
answers the question by phone or perhaps types the answers into a PDA.
The participants may be asked not only what they have been doing, but
also who they were with, where they were, and how they felt.
These three different approaches to collecting time-use data have dif-
ferent strengths and weaknesses. The relative cognitive demands on the
respondents vary with each approach. Questions about activities obtained
through experiential sampling methods, for example, likely impose the
least demand on cognitive skills because of the focus on an immediate
time frame. At the other extreme, stylized questions often impose relatively
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
greater cognitive demands, because respondents need to review a longer
time period and may need to add or multiply or come up with averages
to obtain the number of hours in a week, month, or year for an activity.
Somewhere in between is the approach of a 24-hour diary. Another key
feature that varies across these approaches is the ability to add descriptors
about each activity, such as who the respondent was with, where he or she
was, and how the person felt. These questions can be added easily to both
the 24-hour diary and the experiential sampling methods; they cannot be
easily incorporated into the stylized question approach.
Reliability and validity issues also differ somewhat for each of these
approaches. For the 24-hour diaries, for example, weekday and weekend
patterns of time use differ. There is also considerable within-person varia-
tion across weekdays. Consequently, unless multiple diaries are collected
for every person, the 24-hour diary approach is better suited for analyzing
population patterns and trends than for analyzing within-person trajecto-
ries as people age.
With stylized questions, there are tradeoffs between reference period
and accuracy, with the length of the window inversely related to measure-
ment error. One option to minimize measurement error is to use as recent a
reference period as possible (e.g., a week) and to focus only on commonly
occurring activities.
The experiential sampling approach presents a potentially interesting
analytic issue for studying the implications of functioning for time use.
Contacting respondents at a specific time and asking them what they are
doing yields oversampling of longer lasting activities. If type and length of
the activity vary by a respondent’s level of functioning, it is possible that
bias can be introduced into comparisons of activity duration by functional
status. Freedman noted that there are well-established techniques for ana-
lyzing length-biased samples, but it is not clear that they have been applied
to study disability and time use.
Disability and Use of Time Development Phase
The purpose of the DUST project2 is twofold: to study the relation-
ship among functioning, time use, and well-being among older couples
and to lay the groundwork for potentially collecting time diaries with all
adults in the PSID. Approximately 1,600 time diaries will be collected by
telephone from 400 married couples aged 50 and older in 2009. Spouses
will be interviewed about the same days. Couples will be interviewed about
2 The DUST project is being led by Vicki Freedman, Frank Stafford, Norbert Schwarz, and
Fred Conrad with funding from the National Institute on Aging (P01-AG029409 to Robert
Schoeni).
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
both a randomly selected weekday and a weekend day so that four diaries
per couple will be completed in all. For each spouse, the first interview
will also include supplemental questions to assess stylized time-use ques-
tions, detailed measures of functioning, and global and detailed measures
of well-being.
The DUST team has spent almost 2 years developing the instrument.
The development phase included a series of focus groups, cognitive testing
of the instrument, an assessment of the reliability of diary pre-codes,3 and
a pretest with 27 couples. In terms of questionnaire design, the team began
with the American Time-Use Study questions, which ask respondents what
they were doing for how long, who was in the room with them, and where
they were. DUST investigated several expansions, which included
• The distinction among who actively participated in the activity with
the respondent, who was there but not actively participating, and
for whom the activity was carried out. This involved the testing
of nine pre-codes that route respondents to different (“tailored”)
follow-up questions depending on the type of activity reported.
• Introduction of a tailored follow-up to determine whether the re-
spondent received help with each reported activity or did it on his
or her own.
• The addition of a single-affect measure for each activity in the
diary that correlates well with established measures of well-being
from diaries such as the Day Reconstruction Method (Kahneman
et al., 2004) and the Princeton Affect and Time Study (Krueger and
Stone, 2008).
Freedman reported that several useful lessons about time-use diary
measurement have emerged from the development phase of DUST.
The first lesson is that activity descriptors about help may not be con-
sistently interpreted. Focus group activities suggested that adding a follow-
up question about receipt of help with each activity might yield inconsistent
responses. Interpretation of what constitutes “help” varied and was related
to the couple’s division of labor and the spouse’s ability to carry out such
activities. This lesson was learned early in the development phase, and
therefore this line of questioning was dropped prior to cognitive testing.
The second lesson is that pre-codes to tailor descriptors can be reliably
incorporated into the time diary. To tailor descriptors to different types of
activities reported in the 24-hour diary, the team piloted nine pre-codes to
3 The term “pre-code” is used to distinguish from the type of coding that more typically oc-
curs after the diaries are collected (i.e., post-processing or post-coding). Both types of coding
will be done in this project.
OCR for page 31
IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
be coded during the interview. For example, for household chores, helping,
and care-related activities, a follow-up question, “Who did you do that
for?” was asked, along with questions about who did that with you, and
who else was there with you. The interrater reliability of selecting one of
nine pre-codes in two rounds of testing with four interviewers was very high
(kappa > 0.9). Furthermore, in pretest interviews, which yielded over 1,500
activities, interviewer pre-codes agreed with the coding of the principal
investigators more than 90 percent of the time.
The third lesson is that a succinct measure of well-being can be incor-
porated into the diary as a valid activity descriptor. The team developed and
tested an activity descriptor to tap affect for all activities reported during
the previous day. In focus groups, respondents were asked an open-ended
question about how they felt for each activity reported during the previous
morning. Participants were then asked to classify these emotions as mostly
unpleasant, mostly pleasant, or neither. Participants were able to classify
their emotions in ways that made sense, but they needed direction in cases
in which they experienced both positive and negative emotions. From this
experience, the team developed the following question: “How did you feel
while you were ___? If you had more than one feeling, please tell me about
the strongest one. Would you say mostly unpleasant, mostly pleasant, or
neither?” Based on the pretest data, the correlation between responses to
this item and to more detailed questions about activities that occurred dur-
ing three randomly selected times of day, which were modeled after the Day
Reconstruction Method, was relatively strong (> .7; N = 155).
The fourth lesson is that less cognitively demanding stylized questions
can be successfully administered to couples. Rather than asking how much
time respondents spent in the last week or month doing specific kinds of
activities, DUST included in its cognitive testing and pretest questions of
the form: “On how many of the last 7 days did you ____?” Respondents
were provided with the following categorical answers to choose from:
none, 1–2, 3–4, 5 or more. Every one of the participants in the cognitive
testing was able to answer these questions. When asked how they arrived
at their answer, some participants reported knowing their schedules and
others reported reviewing and counting each day in the previous week that
they performed the activity. No problems were identified with these items
in subsequent pretesting.
The DUST team anticipates making the data available for public use by
the end of 2010 on the PSID website. The pilot will offer not only a larger
sample size, but also multiple days for each person and same-day diaries
for couples so that investigators can explore a number of crucial questions
related to older couples’ functioning, time use, and well-being.
OCR for page 31
POTENTIAL METHODS FOR REVISING MEASURES
DISCUSSION
Participants asked several questions for clarification or elaboration,
mostly focused on four topics: PROMIS, CAT, time-use measures to study
disability, and analysis of late-life disability.
Patient-Reported Outcomes Measurement Information System
Several questions were asked about PROMIS. The ability of PROMIS
measures to distinguish people who score very low on a trait, such as physi-
cal function, was discussed. Cook explained that PROMIS item banks are
developed so that there are items that target both high and low levels of
the trait. Large item banks are potentially better at discriminating among
frail elders, for example, because there are many items that match their
levels of function. Although PROMIS scores are normed (i.e., the average
scores in the general population are known), PROMIS measures still do a
good job of measuring people with extreme levels of a trait (e.g., very low
physical function).
Norms for different age groups and different clinical and social groups
can also be calculated. For example, PROMIS has calculated the average
scores for persons in different age categories, for gender, for different clini-
cal conditions, and for persons with none, one, two, or more chronic or
disabling conditions.
Computer-Adaptive Testing
A clarification was made about the difference between CAT and screen-
ing questions. Researchers studying trends in disability worry that a screen-
ing question might prevent getting information about the prevalence of
something asked about in a follow-up question. Fortunately, the items that
are presented with a CAT are not screened; they do not keep someone from
being asked about some other condition; they only help decide which ques-
tions will give the most information about someone’s level of, for example,
mobility.
A potential problem with CAT was mentioned—the whole idea of
“framing”—the phenomenon in which one question on a measure might
cause a person to think about the rest of the questions in a particular way
(framing). Researchers realize that the question that is asked beforehand
impacts how one answers a particular question. This is a serious issue to
consider with CAT. Short forms do not have this issue as much, or rather,
they have this issue, but it is the same for everyone taking the test.
OCR for page 31
0 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY
Time-Use Measures to Study Disability
Participants discussed the point made in the session that a more dis-
abled person might take longer to do something than a less disabled person.
In experiential sampling, that person is more likely to be picked up doing
the specific activity than others. Is that a length-biased sample of activi-
ties or a measure? The issue is what is being measured—the proportion of
people doing activities and the length of time on activities are two different
questions.
Another use of the time-use method is to track changes in patterns of
activity or participation over time, which may reflect changes in health as
well as disability status. But at a point in time, how does one determine
what is normative and what may be reflective of poor health or disability?
Still another issue raised concerns how to interpret the responses if one asks
people on how many of the past 7 days they have done some activity that
is considered elective. They may choose to do it or not do it. How does one
know whether the decision to do it or not to do it is related to their func-
tioning or that they are just not interested in the activity, such as socializing
or going to meetings? There are a couple of ways to answer that question:
asking people whether they do an activity as much as they like to, as well
as linking it to health-related reasons; or asking people what it is that they
value and then tracking their participation in those activities. Questions can
be individualized to what people say is important to them.
Analysis of Late-Life Disabilities
A participant commented that late-life disabilities are a manifestation
of the life-long accumulation of activities. The data now available in the
United States do not allow a life-course study of how early exposures to
negative factors in personal traits, and also the environment, result in any
late-life disabilities. The earliest data available on late life are maybe from
HRS.
Guralnik observed that in contrast to the situation in the United States,
the British have birth cohorts, the oldest of which is now over 60 years old.
They also have cohorts that started a little bit later in life that are now aged.
The evidence is clear that early life factors play a very large role in mid-life
and late-life functioning. Participants agreed that such cohorts are invalu-
able for studying the life-long development of disabilities. In some cases,
existing cohorts could be used if there are mid-life or earlier data and one
can recontact people when they are older. In that vein, it was noted that
one of the rationales for adding the study of disability to the PSID is that
the panel study is over 40 years old now and does have some predictive
measures, mostly economic ones, of distress in early life.