Read "Improving the Measurement of Late-Life Disability in Population Surveys: Beyond ADLs and IADLs: Summary of a Workshop" at NAP.edu

« Previous: 2 Challenges to Improving Measurement of Late-Life Functioning and Disability

Page 31 Cite

Suggested Citation:"3 Potential Methods for Revising Measures to Foster Comparability Across Subgroups." National Research Council. 2009. Improving the Measurement of Late-Life Disability in Population Surveys: Beyond ADLs and IADLs: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12740.

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

3 Potential Methods for Revising Measures to Foster Comparability Across Subgroups T his chapter focuses on potential methods for refining or augmenting current measures of late-life disability used in population surveys to foster comparability across key subgroups. The presentations covered four topics: 1. Performance measures in population surveys 2. Improving patient-reported measurement of disability using item response theory (IRT) and computer-adaptive testing (CAT) 3. The possible use of easily collected biomarkers of chronic diseases to supplement ADLs and IADLs, which may be able to track de- cline in functionality over the life course and capture change in functionality across thresholds 4. The potential for using time-use data to augment existing measures of ADLs and IADLs PERFORMANCE MEASURES IN SURVEYS Jack Guralnik (National Institute on Aging, National Institutes of Health [NIH]) focused on estimating functional status in surveys using performance measures and on identifying points across the spectrum of performance that are associated with self-reported disability in different population groups. He briefly described several studies he has undertaken with colleagues with some new comparisons both across countries and among U.S. surveys. 31

32 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY Guralnik observed that the Nagi theoretical model of the pathway from disease to disability has been very helpful in terms of operationalizing the assessment of steps along the pathway and particularly useful in thinking about where performance measures fit. Certainly, objective measures of performance can be done at several of the steps in the modelâimpair- ment, functional limitation, and disability. Impairments objectively measure physiologic functioning. At the final step, disability, one may be observing people in standardized home-type environments. However, performance measures, such as gait speed, chair rises, and pegboard tests, have been used mostly in the domain of functional limitations. Guralnik offered three performance assessments to illustrate the points made in his presentation: gait speed, the index of mobility-related physical limitations (MOBLI) developed by Lan and Melzer (Lan et al., 2002), and the Short Physical Performance Battery (SPPB). MOBLI was developed using data from the National Health and Nutri- tion Examination Survey III (NHANES III), empirically looking at measures that were related to mobility, which include gait speed, chair rises, and a pulmonary function test. The index was then validated in other studies. The components of SPPB include timed standing balance, a timed 4-meter walk, and timed multiple chair rises. This battery was first developed in the Established Populations for Epidemiologic Studies of the Elderly (EPESE) in 1988. SPPB has very good psychometric properties: It predicts mortal- ity, nursing home admission, new disability, and health care expenditures, among other things, and it has good reproducibility. It is sensitive to clini- cally important change. One of the issues related to performance testing in general is that the scoring of some tests does not have a way of dealing with people who are unable to perform the task. So it is difficult to know how to handle people who are unable to perform the test. For example, if gait speed is used, what do you do if someone just cannot walk at all? People have approached this issue in different ways, but it is a limitation of performance measures that researchers rarely address. Even in determining why a test was not done, people often fuss with the dataâtrying to understand if the data are miss- ing because the person really was not able to do the test and so should be scored as a 0 or given a poor score or whether the person simply refused. Sometimes even refusals can be vague. People refuse because they are afraid to do the test because they know that they are going to be unable to do it. Sometimes the responsibility for a refusal is placed on the exam- iner, which is a bit unfair, but it is sometimes hard to sort out when the researcher does not know what the data on the performance test mean. One solution to this problem, used in the SPPB, is to create categorical scores that cover the range of functioning and give a 0 score to those unable to do the test.

POTENTIAL METHODS FOR REVISING MEASURES 33 Gait speed is an important performance measure: It is a simple test, but it is highly predictive, and recently there has been increasing interest generated in this very simple test, with many longitudinal studies showing a clear stepwise gradient of greater risk for mortality with decreasing gait speed. Analysis of data from the InCHIANTI Study, a population-based study in the Chianti region of Italy (Alessandro Ble and Luigi Ferrucci, Longitudinal Studies Section, Clinical Research Branch, National Institutes of Health, Baltimore, MD, unpublished data), shows a graded response for mortality according to quintiles of preferred walking speed. In this analysis, it was demonstrated that the survival curve for persons with cancer actually showed better survival than the curve for persons in the lowest quintile of gate speed at baseline. In recent work done in the Whitehall Study of British civil servants (Brunner et al., 2009), it was found that gait speed rose steadily across the six employment grades that were used to classify participants, none of whom was poor and all of whom were full-time employees. It was impres- sive just how sensitive the gait speed was to employment grades, which range from the highest (administrative level) to the lowest (clerical) level. Gait speed is picking up something about the health disparities across this gradient of socioeconomic status in a very impressive way. The question often asked is whether performance tests can replace self-reports, whether both should be done, or which one should be used in what situations. Most people who work in the field have generally agreed that self-reports and performance tests are really complementary. They are measuring different concepts, different aspects of functioning; there is a fair amount of evidence to support this view. One example is the work in which Guralnik collaborated with David Reuben (Reuben et al., 1990) in which the study population was stratified in two waysâaccording to self-reports of being independent in mobility and ADLs and according to categories of SPPBâand mortality was studied as an outcome. In the group report- ing no disability, there was a clear grading of mortality risk across SPPB scores. The same was true with the group that was dependent in mobility but independent in ADLs. Complementary information is being picked up, and the performance batteries are showing something that is not available from self-reports. Finally, among the most severely disabled subset of this cohort, those who were dependent in mobility and with one or more ADL disabilities, there were high rates of mortality. Few people in this subset have high SPPB scores, but even across the remainder of the SPPB spectrum, there was not much of a gradient for mortality risk. Therefore, at the very disabled end of the spectrum, performance measures may not be adding much to the es- timation of prognosis, but for those with little or no disability, performance measures make a valuable contribution characterizing prognosis.

34 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY Performance Measures in Large Surveys Banks and colleagues (2006) used both gait speed and the SPPB from the English Longitudinal Study of Aging (ELSA) in its 2004 wave. They used cut points for poor functioning of less than or equal to 8 on the SPPB or gait speed of less than or equal to 0.5 meter per second. The cut points, previously shown to be related to high risk of future adverse events, show a clear age effect, with much higher proportions of people in the 80 years and older group being in the high-risk group according to the SPPB and gait speed, as well as a difference between men and women, with women having poorer functioning. The presentation by Guralnik compared data for SPPB scores of less than or equal to 8 in EPESE, ELSA, and the InCHIANTI study, with sub- stantially poorer performance seen in EPESE. However, because the EPESE was conducted 10 to 15 years earlier than the other studies, the trends in observed performance may mirror the trends toward less self-reported dis- ability. Similar effects were observed for men as well as for women. Gait speed showed somewhat different results. For gait speed, Guralnik included NHANES III data from 1988 to 1994. The InCHIANTI popula- tion showed a substantially smaller proportion of individuals with slow gait speed, for both men and women. Some of this difference may be real; some of it may be that the test was done in a slightly different way. In the InCHIANTI study, the researchers used automatic timers and participants took a step before the timers were tripped, whereas in the other studies, stopwatches were used and the time was measured from a standing start. It is likely, however, that the Italians, who tend to walk much more than Americans, do have less mobility limitation. In the NHANES III data from 1988 to 1994, an 8-foot walk was mea- sured. The 2001/2002 NHANES did the 20-foot walk but also timed the first 8 feet to make comparisons with the 1988 data. The results showed a large reduction from one time to the next in the proportion of people who have very slow gait speed. However, some of the difference may be explained by the fact that these tests were done somewhat differently, and in a 20-foot walk people may see a longer walk ahead of them and may go faster for the first 8 feet. Using Performance Measures to Calibrate Self-Reports Guralnik stated that he found two examples that represent a way of using performance measures of functioning to calibrate responses to self- report items in questionnaires. In the first example, from the World Health Organization (WHO), Iburg and colleagues (2001) used a modeling tech- nique called Hierarchical Ordered Probit Modeling. They used performance

POTENTIAL METHODS FOR REVISING MEASURES 35 tests from the NHANES III and created a vector of performance that they considered similar to a latent variable representing the true underlying level of performance. Then they used the model to look at how different sub- groups reported disability at different levels of this background latent vari- able. This analysis was done both for physician reports and self-reports. Guralnik and Melzer (Melzer et al., 2004) did similar kinds of analyses using MOBLI, derived from the NHANES, and observed similar results. People who were 60â69 years old did not report disability until they had a poorer level of performance than people who were older. Also, large differ- ences were observed between men and women, with men not reporting dis- ability until reaching lower levels of performance than women. There were also differences in disability cut points by race, with blacks and Hispanics not reporting disability until their background level of functioning was at a poorer level than that of whites with the same level of functioning. For income, people with the highest income did not report disability until their performance was at a poorer level than that of people with lower income people who reported disability. They may be denying their disability, or they are able to compensate successfully for a lower level of functioning. This kind of approach can be very useful. Comparison of U.S. data with those from the Longitudinal Aging Study Amsterdam (Melzer et al., 2004) showed that people in the Netherlands did not report their disabilities until they had more severe levels of background dysfunction. Therefore, the lower levels of self-reported disability in the Netherlands could be ex- plained, at least in part, by this differential reporting as it relates to level of background performance. In conclusion, Guralnik observed that there are potential applications of performance measures in improving population surveys of disability, particularly in making comparisons across subgroups of a population and for cross-national and cross-cultural comparisons. Trends over time can be directly observed with performance testing, but this will require strict standardization of test administration and quality control procedures to en- sure that the tests are administered precisely the same way in every survey. Performance tests can be used to identify high levels of functioning, which cannot be done well with self-reported disability. They can be used to iden- tify nondisabled persons at increased risk of disability, sometimes referred to as preclinical disability. The concept of calibrating self-reports by using a background measure of performance could be quite valuable. It may be that even something as simple as gait speed could be used for this kind of calibration and could be valuable for cross-national studies.

36 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY IMPROVING PATIENT-REPORTED MEASURES USING ITEM RESPONSE THEORY AND COMPUTER-ADAPTIVE TESTING Karon Cookâs (University of Washington) presentation covered four topics: 1. A brief introduction to IRT and CAT 2. Description of Patient-Reported Outcomes Measurement Informa- tion System (PROMIS), which applies these methodologies 3. Opportunities and barriers to using modern psychometric methods in population surveys and monitoring population trends 4. Envisioning the future and how modern measurement methods might be helpful in advancing disability research Item Response Theory IRT models are probability-based models in which both the levels of the trait being measured (e.g., physical function) and the difficulty of the item are located on a common underlying continuum, or âruler.â The prob- abilities of answering in particular ways to items that ask about the trait being measured are modeled as functions of how much of the trait a person has relative to the difficulty of the items. Classical test theory and IRT are different in several ways. Commonly used reliability and validity estimates are based on classical test theory, in which scores on measures are usually obtained by manipulating the item scores (e.g., summing to get a total score). In IRT, scores on measures are obtained on the basis of probability functions, not by averaging or totaling item scores. Another important difference is that, in classical test theory, unlike IRT, variations in difficulty or intensity of items are not accounted for. For example, a shoulder function scale score on an item that asks about throw- ing a softball overhand 20 yards is weighted the same as an item that asks about using the affected arm to flip a light switch. In IRT, differences in difficulty or intensity are accounted for. In some IRT models, item discrimi- nation is also included. Yet another difference is that, in classical test theory, the scores are ordinal-level indicators of individual differences. With IRT, especially with the Rasch model, scores at least approximate interval-level measurement. There is a great deal of debate about how well the scores approximate â For a general review of IRT, CAT, and PROMIS, see De Ayala (1993); Ware et al. (2000); Cook et al. (2005); Cella et al. (2007).

POTENTIAL METHODS FOR REVISING MEASURES 37 equal-interval measurement, but they come closer than classical test theory scores. Thus, with classical test theory, the focus is on total score or average score or something along those lines; in IRT the focus is on the item re- sponse. The IRT approach gives a great deal more flexibility when develop- ing measures because one gets more information about specific items and how they function. A very important difference is in the area of reliability and precision. Researchers typically say that a measure has a reliability of, for example, 0.89. Intuitively, everyone knows that it is very likely that a measure will measure at different levels of a trait with different levels of precision, but with classical test theory, all one gets is an average. With IRT, one gets an estimate of the precision of a measure for every level of the trait that is be- ing measured, which gives a great deal more information. Cook explained that IRT is a mathematical modelâa probability model. IRT models estimate how likely persons are to respond in particular ways to a particular item, depending on how much people have of the trait being measured, and what the characteristics of the item are (e.g., how difficult). What is not in the model is the total score. With IRT, different people can answer different items yet their scores are estimated on the same mathematical metric. Information function in IRT is analogous to reliability in classical test theory. The information function has an inverse relationship to the standard error of measurement; that is, when precision is high, standard errors are low, and when precision is low, standard errors are high. Thus, one can identify what ranges of the trait level are measured with more precision and what areas are measured with less precision. IRT information functions can be estimated at both the scale and the item level. Computer-Adaptive Testing IRT is the math behind a very important applicationâCAT. CAT is a process of measuring in which not all available items are administered to any one respondent. Instead, the items chosen for a particular person are based on that personâs responses to the previously administered items. CAT begins with what is called a large âitem pool.â Then items of the pool are calibrated in advance on the basis of the known characteristics of the items, their difficulty or intensity, or, in some cases, their discrimination parameters. Once the item pool is calibrated, it is called an âitem bank.â Cook then described how CAT works. An initial item is presented to a person and that person responds. Then a gross estimate is made of that respondentâs level of the trait being measured. On the basis of that trait- level estimate, the next item chosen from the item bank is the one that

38 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY gives maximum additional information. Each item has its own information function, and the computer algorithm identifies the item that will give the greatest additional precision for the personâs particular trait level. The test administration stops when a predefined âstopping ruleâ is reached, such as stopping when a person gets to a certain level of precision or stopping after asking a specific number of items. CAT is not the only application of IRT. It is important to know that with a calibrated bank of items, static instruments can be developed, that is, instruments in which everyone answers the same items. Also, an excit- ing application is the ability to construct different short forms that target specific clinical populations or specific measurement contexts. For example, one might want a very short measure if respondent burden is an issue. One might want a longer one if precision is of more interest. One might want to target lower levels of the outcome if that is the population of interest, or one might choose items from the bank that seem particularly relevant for a given clinical population. The important thing to know about item banks is that whether one administers the items with CAT, a long, static instrument, or one of several short forms, the scores are reported on the same mathematical metric. They are not based on a total score but rather on a probability function that takes into account a personâs level of trait and responses to items and the characteristics of the items. Patient-Reported Outcomes Measurement Information System Patient-Reported Outcomes Mesurement Information System (ÂPROMIS) is an example of an application that uses IRT and CAT; it is funded by NIH. The goal was to develop item banks that measure patient-reported outcomes (PROs) across many different chronic conditions. The focus was on PROs, such as pain, fatigue, physical function, social function, depres- sion, and sleep. Part of the mandate also was to create a computer-adaptive system for administering PRO-based tests to measure such outcomes. The item banks that were developed have the flexibility to create multiple short forms to measure the same traits on the same metric. Cook explained that one of the things that is most helpful about P Â ROMIS is that scores on all measures have been calibrated to the general U.S. population. For example, for fatigue scores, the mean for the U.S. pop- ulation is based on a weighted sample that is based on the U.S. census: The mean is 50 and the standardization is 10. Suppose one gives the ÂPROMIS fatigue measure to a particular sample and the average score is 60. This score has inherent meaning: In comparison with the general population, the study population is one standard deviation above the mean. This is much

POTENTIAL METHODS FOR REVISING MEASURES 39 better than using traditional measures, in which scores usually have mean- ing only to people who have long experience using them. Opportunities and Barriers Cook asked: Are these methodsâIRT and CATâappropriate for sur- veying populations and for monitoring trends? She said yes; in fact, these particular methods offer some distinct advantages, such as the item-level approach, that are very helpful in developing better measures. With trait- specific standard errors, one knows how well one is measuring portions of the population. CAT offers measurement efficiency, and it can be adminis- tered in a lot of different waysâby Internet, by telephone, or in person. IRT also allows linking new instruments to legacy instruments through concordance tables. If two measures are measuring the same trait, it is pos- sible to link them and do a crosswalk between them so that the results of two studies can be compared, even if they used different measures for the same trait. Cook noted, however, that there are downsides to using IRT and CAT. One is that calibration to an IRT model requires specialized and not particularly user-friendly software and specialized expertise. Also, to use CAT, for example, the respondent or an interviewer has to interface with a computer. If it is an interviewer, then mode effects are introduced that might be problematic. Also, unique qualities of IRT-based measurement require meeting assumptions of the model, and these are not always easy to meet. Challenges in disability measurement with these particular models are substantial, and so are the advantages. Some of the disadvantages are not limited to the newer methods, however. Because IRT and classical test theory assume unidimensionality, both are probably better suited to measurement of functional limitations than of disability. Disability often gets defined as a multidimensional, interactional, and social construction. Defined as such, it does not lend itself to either IRT or classical test theory methods. Functional limitations typically are defined in much narrower terms and are better suited to measurement models. Envisioning the Future In summing up, Cook noted that the psychometric methods that have been developed in the past few years have improved exponentially and have increased researchersâ ability to develop good measures in terms of psychometric properties. However, the ability to assign any kind of meaning to those scores is lagging behind. That is an area in need of some efforts. Norm referencing is one possibility, but a great deal needs to be done in addition. Levels and changes in levels of outcomes associated with mainte-

40 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY nance of capacity and onset of disability need to be identified. The newer psychometric methods she has been discussing have some unique properties that make them suitable for this endeavor, but they will require longitudinal studies and close monitoring of functions, outcomes, and identification of measurable âmarkerâ clinical events that are associated with changes in PROs. USE OF EASILY COLLECTED BIOMARKERS OF CHRONIC DISEASES David Weirâs (University of Michigan) presentation focused on the pos- sible use of easily collected biomarkers of chronic diseases, to supplement ADL and IADL measures, which may track decline in functionality over the life course and capture changes in functionality across thresholds. He based his remarks on the results from the 2006 major redesign of the Health and Retirement Study (HRS). HRS is a longitudinal survey of 22,000 Americans over age 50, who are interviewed every 2 years. Prior to the 2006 redesign, HRS was primarily a telephone survey conducted every 2 years, which included fairly useful self-reports on, among other items, functional limitations, chronic conditions, and care received. Beginning in 2006, the sample was randomly split into two halves: in one, participants are interviewed in person every 4 years beginning in 2006; in the other, they are interviewed in person every 4 years beginning in 2008. The in-person interviews in- clude anthropometric measures, performance measures, dried blood spots, DNA samples, and some other measures not included in the telephone interviews. Weir explained that the scientific focus of HRS is on two main areas. One area is biomarkers in the narrow sense of biological samples, which are focused essentially on measures of cardiovascular risk. Those biomark- ers are relatively easy and straightforward to measure and are of great importance and high prevalence in the population, such as blood pressure, cholesterol, hemoglobin A1c measure of blood glucose, C-reactive protein, waist size, height, and weight. They are also closely related to obesity and metabolic syndrome, which are looming public health concerns. The second area of focus is on physical performance measures, which are targeted more at the older population as measures of frailty. These two areas can be brought together by taking into account the relationship between chronic disease and disability. Most disability is a product of chronic disease. The chronic diseases that directly produce dis- ability, such as stroke, heart disease, and cognitive impairment, are them- selves often produced by antecedent other conditions (e.g., hypertension, diabetes) that often have few symptoms. There is a need to model these

POTENTIAL METHODS FOR REVISING MEASURES 41 processes, as suggested in a previous presentation, beginning long before people have difficulties with ADLs or IADLs. To understand the total pro- cess by which people become disabled, it is necessary to look at the whole evolution of chronic disease. Disability starts long before a person experiences limitation in ADLs. Some measures that are sensitive at those earlier stages are needed. In HRS, there are 12 items, which include such Nagi items as walking several blocks, climbing stairs, and pushing a heavy object. These items are quite useful at documenting the earlier stages of disability. The HRS data indicate that ADL and IADL limitations really only begin around age 75. There are some people with limitations at earlier ages, but these cases mostly do not reflect changes by age. Rather, ADL and IADL limitations are really a feature of the very old. The percentage of people who receive more than 1 hour of care per day also is very low prior to about age 75; after age 75, that percentage increases very rapidly. However, when people who report having no ADL or IADL limitations and therefore are not reporting any hours of care are asked how many of the Nagi limitation items they have any difficulty with, the percentage also rises with age in a very linear way. If people who reported no ADL or IADL difficulties in 2004 are arrayed by the number of Nagi limitation items they had in 2004, and then are arrayed by having an ADL or IADL difficulty by 2006, a very graded relationship is seen. Just counting the number of these difficulties provides some insight into the people who are at risk for developing further disability. Chronic Disease and Disability Weir used a combined measure that is a sum of the Nagi items plus ADL limitations plus IADL limitations discussing chronic disease and disability. As stated above, chronic disease underlies most disability, even at younger ages. People under 62 years of age with disability that prevents them from working are eligible for Social Security Disability Insurance (SSDI). The distribution of SSDI recipients by the cause of disability on which the disability award is based shows that injuries are less than 5 percent and infectious disease about 2 percent. The percentage of recipi- ents reporting diabetes is about the same as injuries overall (4.6 percent). Cardiovascular disease is twice that big (10.4 percent), and arthritis is 2.5 times higher (23.6 percent). All psychiatric conditions, of which depression is the largest, are the single largest cause of reported disability for SSDI respondents (27.9 percent). Even at younger ages, at which most people might think of the disabled as being physically injured, most of it results in some way from chronic conditions. Psychiatric conditions are actually quite important even at younger ages. In the HRS population, the number of physical limitations rises linearly

42 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY with the number of chronic disease diagnoses regardless of age, although at 75 years and older the number of limitations reported increases slightly, even at the same number of chronic conditions. Obesity and Disability Weir stated that the relationship between obesity and disability is com- plex. One needs to consider multiple measurement perspectives on obesity. There may be direct effects on mobility: For example, it is harder to move around a lot of weight than a little weight. A somewhat less direct effect is cumulative stress on joints from managing the excess weight. Even less direct effects are through the risk of cardiovascular disease, which may take a long time to manifest itself. There are also inverse effects, such as sarcopenia (the age-related loss of muscle mass, strength, and function) and weight loss. In HRS, a number of indicators were developed to measure quintiles, with 5 percentile cuts through the variable and then analysis of the mean number of limitations at that level of variablity. Disability and weight as measured by body mass index (BMI) are nonlinear. The lowest 5 percentile is very disabled; this is the frail group. There is relatively little variation for most of the U.S. population, indicating that a range of moderate obesity has relatively low correlation with disability. The number of limitations starts to increase at about the top 25 percent of BMI and especially at the very topâBMIs of 35 and higher. With regard to the measurement of height and weight, Weir said they added little to the information from self-reports. In fact, they have almost no value analytically. However, it is important to have those data, as they add to age and gender as a predictor of disability. Waist circumference has a more monotonic relationship with disabil- ity. For waist sizes larger than about 40â41 inches, there is a substantial increase in disability. This measure adds considerably to BMI alone. Even with both in the model, it adds significantly: That is, waist size is inde- pendently a highly significant predictor of disability. What is it measuring that BMI is not? Weir suggested that it may be central adiposity, which is an independent risk factor for cardiovascular disease. Also, BMI does not distinguish lean body mass from fat. Lean body mass is almost certainly protective in many ways, particularly against mobility difficulties. Physical performance measures are highly correlated with self-reported limitations. Grip strength, expiratory volume, and timed walk are inde- pendently associated with limitations. HRS for a long time has measured cognition, and it is independently predictive of disability, particularly if IADLs are included. A word recall measure has been included in HRS since 1993 and is useful. However, the strongest predictor among the cognitive measures is the eight-item count of depressive symptoms. Depression and

POTENTIAL METHODS FOR REVISING MEASURES 43 other psychiatric symptoms are a major cause of disability. However, they are correlated negatively in self-reports. Separating what may be some kind of affect in a personâs reporting style from what is the real effect of depres- sion on disabilities is difficult. Biological Biomarkers HRS had high levels of cooperation for collecting biological samples: 80 percent of respondents agreed to do them. The distributions showed a good match to the distributions from the NHANES, except for two measures. One was total cholesterol, which is a difficult assay to do in dry blood spots, and the other was diastolic blood pressure, which is almost certainly due to the fact that machines and humans find that point differ- ently. The biomarkers have good internal validity; prospective validity is to be determined over time. Disability is only slightly related to current levels of blood pressure, and only at the high end. In contrast, Weir said that quite a strong relationship exists between disability and hemoglobin A1c measures of blood glucose. Disability is correlated with obesity. It has some independent value even after the waist circumference and other obesity measures are taken into consideration. âGood cholesterolâ (high density lipoprotein, HDL) is associated with lower disability. However, disability also is negatively correlated with higher values of total cholesterol, which is quite puzzling. Consequently, a com- mon measure of risk, the ratio of total cholesterol to HDL cholesterol, is not related to disability, at least on its own. The true value of these blood assays is yet to be determined, in part because a few more assays are still to be done from 2006 data. One is C-reactive protein, which is a marker of inflammation and which may be related to disability through both arthritis and cardiovascular disease. Another is cystatin C, a measure of kidney function. And because blood assays are predictors of cardiovascular disease progression, they are ex- pected to predict future cardiovascular events, which then are precipitators of disability. DEVELOPING MEASURES OF TIME USE TO STUDY DISABILITY Vicki Freedman (University of Medicine and Dentistry of New Jersey) described time-use measures and how they may be used to study disability. She also shared some of the lessons from the development phase of a time- use pilot study that she and her colleagues at the University of Michigan are developing for the Panel Study of Income Dynamics (PSID) with funding from the National Institute on Aging. Her presentation included an overview of three issues:

44 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY 1. How do time-use data fit in with existing measures of disability (ADLs and IADLs) and some of the conceptual frameworks dis- cussed in this workshop? 2. What are the various approaches for measuring time use to study disability in population-based surveys? 3. What lessons have been learned from the development phase of the PSIDâs pilot project, Disability and Use of Time (DUST)? It is not immediately evident how time use fits in with the existing measures of disability. In the Institute of Medicineâs (1991) model of the disablement process, conditions and impairments may or may not lead to functional limitations, which in turn, depending on the environment, may or may not lead to disability. It also is not clear how time use fits in with the parallel language offered by the more recent International Classification of Functioning, Disability and Health (ICF) model from the WHO (World Health Organization, 2001). In the ICF model, health conditions may or may not lead to impairments in body functions and structures that may in turn lead to activity limitations and participation restrictions. However, unlike the IOM model, ICF also offers a set of positive analogs for describ- ing functioning. In positive language, the ICF links body functions and structures to activities and participation in daily life. Time-use measures convey the latter concepts: what people do (activities) and the extent to which they engage in social, productive, and other aspects of daily life (participation). Domains of Time Use There is no consensus in the literature about how best to classify ac- tivities, but if one looks across literatures related to aging, time use, and participation, several key âdomainsâ emerge: â¢ B sic self-care activities (includes ADLs and other activities that a people do to care for themselves, such as management of chronic conditions) â¢ Household maintenance activities (includes IADLs and other h Â ousehold-related activities that are essential for daily life) â¢ Regenerative activities (includes hobbies, arts, music, gardening, puzzles, taking classes, etc.) â¢ Physical activities (includes exercise, walking for pleasure, partici- pating in team sports, etc.) â¢ Social participation (includes socializing with friends and family, attending group functions)

POTENTIAL METHODS FOR REVISING MEASURES 45 â¢ Productive participation (includes work, volunteering, providing child and adult care, etc.) â¢ Political or civic participation (includes involvement in home as- sociations or board meetings, political participation involving col- lective decision making, etc.) Not all activities fall uniquely into one of these categories or into just these categories, but these are some of the most common domains of time use for older adults (see Waidmann and Freedman, 2007, for frequency of participation in these types of activities). Approaches to Measuring Time Use Freedman explained that there are three main approaches for measur- ing time use to study disability in population-based surveys. The first is a 24-hour diary. In such an approach, people are asked a series of questions about everything they did yesterday. The American Time-Use Study, con- ducted by the Bureau of Labor Statistics, for example, asks respondents what they were doing starting at 4:00 a.m. the previous day, for how long they did it, where they were, and who else was present. The respondents are then asked what they did next, and so on, until a 24-hour diary is completed. A second approach asks questions about how much time was spent on various types of activities over a longer period of time. These questions are referred to as stylized time-use questions. For example, a stylized question might ask: âDuring the past week how much time did you spend _____?â The reference periods are typically a week or a month or sometimes longer if the activity is rare. This approach can capture activities that are not done frequently. A third approach, experiential sampling, involves contacting study participants at random times of day (with either phones, beepers, or per- sonal digital assistants [PDAs]). The participant is then asked questions about what she or he has been doing in a brief window (e.g., 15 minutes) just before the contact. Depending on the technology, the respondent either answers the question by phone or perhaps types the answers into a PDA. The participants may be asked not only what they have been doing, but also who they were with, where they were, and how they felt. These three different approaches to collecting time-use data have dif- ferent strengths and weaknesses. The relative cognitive demands on the respondents vary with each approach. Questions about activities obtained through experiential sampling methods, for example, likely impose the least demand on cognitive skills because of the focus on an immediate time frame. At the other extreme, stylized questions often impose relatively

46 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY greater cognitive demands, because respondents need to review a longer time period and may need to add or multiply or come up with averages to obtain the number of hours in a week, month, or year for an activity. Somewhere in between is the approach of a 24-hour diary. Another key feature that varies across these approaches is the ability to add descriptors about each activity, such as who the respondent was with, where he or she was, and how the person felt. These questions can be added easily to both the 24-hour diary and the experiential sampling methods; they cannot be easily incorporated into the stylized question approach. Reliability and validity issues also differ somewhat for each of these approaches. For the 24-hour diaries, for example, weekday and weekend patterns of time use differ. There is also considerable within-person varia- tion across weekdays. Consequently, unless multiple diaries are collected for every person, the 24-hour diary approach is better suited for analyzing population patterns and trends than for analyzing within-person trajecto- ries as people age. With stylized questions, there are tradeoffs between reference period and accuracy, with the length of the window inversely related to measure- ment error. One option to minimize measurement error is to use as recent a reference period as possible (e.g., a week) and to focus only on commonly occurring activities. The experiential sampling approach presents a potentially interesting analytic issue for studying the implications of functioning for time use. Contacting respondents at a specific time and asking them what they are doing yields oversampling of longer lasting activities. If type and length of the activity vary by a respondentâs level of functioning, it is possible that bias can be introduced into comparisons of activity duration by functional status. Freedman noted that there are well-established techniques for ana- lyzing length-biased samples, but it is not clear that they have been applied to study disability and time use. Disability and Use of Time Development Phase The purpose of the DUST project is twofold: to study the relation- ship among functioning, time use, and well-being among older couples and to lay the groundwork for potentially collecting time diaries with all adults in the PSID. Approximately 1,600 time diaries will be collected by telephone from 400 married couples aged 50 and older in 2009. Spouses will be interviewed about the same days. Couples will be interviewed about â The DUST project is being led by Vicki Freedman, Frank Stafford, Norbert Schwarz, and Fred Conrad with funding from the National Institute on Aging (P01-AG029409 to Robert Schoeni).

POTENTIAL METHODS FOR REVISING MEASURES 47 both a randomly selected weekday and a weekend day so that four diaries per couple will be completed in all. For each spouse, the first interview will also include supplemental questions to assess stylized time-use ques- tions, detailed measures of functioning, and global and detailed measures of well-being. The DUST team has spent almost 2 years developing the instrument. The development phase included a series of focus groups, cognitive testing of the instrument, an assessment of the reliability of diary pre-codes, and a pretest with 27 couples. In terms of questionnaire design, the team began with the American Time-Use Study questions, which ask respondents what they were doing for how long, who was in the room with them, and where they were. DUST investigated several expansions, which included â¢ The distinction among who actively participated in the activity with the respondent, who was there but not actively participating, and for whom the activity was carried out. This involved the testing of nine pre-codes that route respondents to different (âtailoredâ) follow-up questions depending on the type of activity reported. â¢ Introduction of a tailored follow-up to determine whether the re- spondent received help with each reported activity or did it on his or her own. â¢ The addition of a single-affect measure for each activity in the diary that correlates well with established measures of well-being from diaries such as the Day Reconstruction Method (Kahneman et al., 2004) and the Princeton Affect and Time Study (Krueger and Stone, 2008). Freedman reported that several useful lessons about time-use diary measurement have emerged from the development phase of DUST. The first lesson is that activity descriptors about help may not be con- sistently interpreted. Focus group activities suggested that adding a follow- up question about receipt of help with each activity might yield inconsistent responses. Interpretation of what constitutes âhelpâ varied and was related to the coupleâs division of labor and the spouseâs ability to carry out such activities. This lesson was learned early in the development phase, and therefore this line of questioning was dropped prior to cognitive testing. The second lesson is that pre-codes to tailor descriptors can be reliably incorporated into the time diary. To tailor descriptors to different types of activities reported in the 24-hour diary, the team piloted nine pre-codes to â The term âpre-codeâ is used to distinguish from the type of coding that more typically oc- curs after the diaries are collected (i.e., post-processing or post-coding). Both types of coding will be done in this project.

48 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY be coded during the interview. For example, for household chores, helping, and care-related activities, a follow-up question, âWho did you do that for?â was asked, along with questions about who did that with you, and who else was there with you. The interrater reliability of selecting one of nine pre-codes in two rounds of testing with four interviewers was very high (kappa > 0.9). Furthermore, in pretest interviews, which yielded over 1,500 activities, interviewer pre-codes agreed with the coding of the principal investigators more than 90 percent of the time. The third lesson is that a succinct measure of well-being can be incor- porated into the diary as a valid activity descriptor. The team developed and tested an activity descriptor to tap affect for all activities reported during the previous day. In focus groups, respondents were asked an open-ended question about how they felt for each activity reported during the previous morning. Participants were then asked to classify these emotions as mostly unpleasant, mostly pleasant, or neither. Participants were able to classify their emotions in ways that made sense, but they needed direction in cases in which they experienced both positive and negative emotions. From this experience, the team developed the following question: âHow did you feel while you were ___? If you had more than one feeling, please tell me about the strongest one. Would you say mostly unpleasant, mostly pleasant, or neither?â Based on the pretest data, the correlation between responses to this item and to more detailed questions about activities that occurred dur- ing three randomly selected times of day, which were modeled after the Day Reconstruction Method, was relatively strong (> .7; N = 155). The fourth lesson is that less cognitively demanding stylized questions can be successfully administered to couples. Rather than asking how much time respondents spent in the last week or month doing specific kinds of activities, DUST included in its cognitive testing and pretest questions of the form: âOn how many of the last 7 days did you ____?â Respondents were provided with the following categorical answers to choose from: none, 1â2, 3â4, 5 or more. Every one of the participants in the cognitive testing was able to answer these questions. When asked how they arrived at their answer, some participants reported knowing their schedules and others reported reviewing and counting each day in the previous week that they performed the activity. No problems were identified with these items in subsequent pretesting. The DUST team anticipates making the data available for public use by the end of 2010 on the PSID website. The pilot will offer not only a larger sample size, but also multiple days for each person and same-day diaries for couples so that investigators can explore a number of crucial questions related to older couplesâ functioning, time use, and well-being.

POTENTIAL METHODS FOR REVISING MEASURES 49 DISCUSSION Participants asked several questions for clarification or elaboration, mostly focused on four topics: PROMIS, CAT, time-use measures to study disability, and analysis of late-life disability. Patient-Reported Outcomes Measurement Information System Several questions were asked about PROMIS. The ability of PROMIS measures to distinguish people who score very low on a trait, such as physi- cal function, was discussed. Cook explained that PROMIS item banks are developed so that there are items that target both high and low levels of the trait. Large item banks are potentially better at discriminating among frail elders, for example, because there are many items that match their levels of function. Although PROMIS scores are normed (i.e., the average scores in the general population are known), PROMIS measures still do a good job of measuring people with extreme levels of a trait (e.g., very low physical function). Norms for different age groups and different clinical and social groups can also be calculated. For example, PROMIS has calculated the average scores for persons in different age categories, for gender, for different clini- cal conditions, and for persons with none, one, two, or more chronic or disabling conditions. Computer-Adaptive Testing A clarification was made about the difference between CAT and screen- ing questions. Researchers studying trends in disability worry that a screen- ing question might prevent getting information about the prevalence of something asked about in a follow-up question. Fortunately, the items that are presented with a CAT are not screened; they do not keep someone from being asked about some other condition; they only help decide which ques- tions will give the most information about someoneâs level of, for example, mobility. A potential problem with CAT was mentionedâthe whole idea of âframingââthe phenomenon in which one question on a measure might cause a person to think about the rest of the questions in a particular way (framing). Researchers realize that the question that is asked beforehand impacts how one answers a particular question. This is a serious issue to consider with CAT. Short forms do not have this issue as much, or rather, they have this issue, but it is the same for everyone taking the test.

50 IMPROVING THE MEASUREMENT OF LATE-LIFE DISABILITY Time-Use Measures to Study Disability Participants discussed the point made in the session that a more dis- abled person might take longer to do something than a less disabled person. In experiential sampling, that person is more likely to be picked up doing the specific activity than others. Is that a length-biased sample of activi- ties or a measure? The issue is what is being measuredâthe proportion of people doing activities and the length of time on activities are two different questions. Another use of the time-use method is to track changes in patterns of activity or participation over time, which may reflect changes in health as well as disability status. But at a point in time, how does one determine what is normative and what may be reflective of poor health or disability? Still another issue raised concerns how to interpret the responses if one asks people on how many of the past 7 days they have done some activity that is considered elective. They may choose to do it or not do it. How does one know whether the decision to do it or not to do it is related to their func- tioning or that they are just not interested in the activity, such as socializing or going to meetings? There are a couple of ways to answer that question: asking people whether they do an activity as much as they like to, as well as linking it to health-related reasons; or asking people what it is that they value and then tracking their participation in those activities. Questions can be individualized to what people say is important to them. Analysis of Late-Life Disabilities A participant commented that late-life disabilities are a manifestation of the life-long accumulation of activities. The data now available in the United States do not allow a life-course study of how early exposures to negative factors in personal traits, and also the environment, result in any late-life disabilities. The earliest data available on late life are maybe from HRS. Guralnik observed that in contrast to the situation in the United States, the British have birth cohorts, the oldest of which is now over 60 years old. They also have cohorts that started a little bit later in life that are now aged. The evidence is clear that early life factors play a very large role in mid-life and late-life functioning. Participants agreed that such cohorts are invalu- able for studying the life-long development of disabilities. In some cases, existing cohorts could be used if there are mid-life or earlier data and one can recontact people when they are older. In that vein, it was noted that one of the rationales for adding the study of disability to the PSID is that the panel study is over 40 years old now and does have some predictive measures, mostly economic ones, of distress in early life.

Next: 4 Improving the Validity of Cross-Population Comparisons »

Improving the Measurement of Late-Life Disability in Population Surveys: Beyond ADLs and IADLs: Summary of a Workshop (2009)

Chapter: 3 Potential Methods for Revising Measures to Foster Comparability Across Subgroups

Welcome to OpenBook!

Get Email Updates