The U.S. Social Security Administration (SSA) makes a determination about whether an individual is disabled after considering all of the relevant evidence in the applicant’s case record.1 SSA currently defines evidence as any information related to an individual’s application that is submitted by the applicant or anyone else for consideration by SSA, as well as information the agency obtains while developing the claim (SSA, 2018a). The categories of evidence are objective medical evidence (signs, laboratory findings, or both), medical opinion, other medical evidence from medical sources, evidence from nonmedical sources, and prior administrative medical findings of state or federal Disability Determination Services medical and psychological consultants (SSA, 2018b). The applicant and individuals connected to him or her bear the burden of proof of a medically determinable impairment and associated limitations in activities. When no relevant records are available from a treating medical source, SSA has an obligation to obtain one or more consultative examinations. The extent and types of medical evidence in an applicant’s file likely will be affected by the availability and cost of specific tests.
This chapter begins with an overview of the types, sources, and quality of information about individuals’ functioning that SSA may encounter in its consideration of evidence in an applicant’s case record. Next is a discussion of psychometric and other properties of measurement instruments, which is followed by sections on professionals with the training and expertise to
1 20 CFR 404.1520; 20 CFR 416.920.
assess function and potential threats to the validity of functional assessment. The final section presents findings and conclusions.
Various sources of information and specific types of tools to assess function can be mapped to the committee’s conceptual framework presented in Chapter 2. For example, medical records typically contain information about the individual’s specific health issue and its manifestation in body function and structure. The results of X-rays and various types of scans, for example, provide information on body structure, and performance-based tests and instruments typically provide information on body function and ability to perform physical and mental activities or tasks. Self-report instruments can provide information on symptoms (e.g., pain, fatigue, anxiety); performance of physical and mental activities and tasks; and factors, including “interrupters,” that interfere with an individual’s ability to perform work on a regular and continuing basis.
Collecting information about individuals that is useful for evaluating function and work disability is a complex process requiring the careful application of rigorous, modern techniques and selection of the best information collection methods, measurement instruments, and performance measures. This section provides an overview of these tools and related considerations, including the strengths and weaknesses of different information collection methods; the pitfalls of conducting clinical interviews; the use of proxy respondents; the exploitation of clinical records, including electronic health records (EHRs); the conduct of clinical and functional observations; and the direct measurement of function. The committee notes that the collection and handling of clinical information has important ethical dimensions (e.g., informed consent for data release in accordance with the provisions of the Health Insurance Portability and Accountability Act) (NRC, 2007), but it is assumed here that SSA has substantial experience with these issues.
The Building Blocks of Health Measurement
The practice of clinical medicine and the clinical records that ensue employ a vocabulary (nomenclature) used to describe patient complaints and health reports and express the diagnostic conclusions of health professionals, as reported in the clinical records. This vocabulary includes standard terms used for the building blocks of health measurement, such as symptoms (abnormal bodily perceptions), signs (visible abnormalities), abnormal bodily movements or more complex behaviors (observed by patients,
clinicians, and others), abnormalities in function (disabilities reported by patients or other observers), and physiological measures usually made by professionals in the health care system (e.g., biomarkers, imaging procedures, and other physiological measures). The findings in these areas are then gathered, as actual or provisional diagnoses, into known or predicted clusters (i.e., syndromes) or formally designated conditions. While the focus of the following discussion is on the assessment of function and disability, it is important to note that essentially all of the aforementioned building blocks of health measurement are part of standard nomenclature systems, such as the International Classification of Diseases or the many resources of the U.S. National Library of Medicine, that can be important in evaluating medical records with digital tools employing machine-reading techniques (e.g., text mining, natural language processing,2 data mining), as discussed later in the chapter. It is also important to note the need to focus on the coexistence and interaction of multiple disabling conditions as part of the disability evaluation process (Oni et al., 2014; Ubalde-Lopez et al., 2016).
General Considerations in Information Collection
Although the accuracy of the information collected for disability evaluation depends primarily on the reliability and validity of the collection method used, as described below, some general considerations apply to nearly all collection methods related to function. First, characteristics of the individual may affect the accuracy of the information collected. Differences in gender, race, ethnicity, and culture can affect individuals’ perceptions of illness and their reporting of relevant health information (Anglin et al., 2008; Carpenter-Song et al., 2010; Forestier et al., 2019; Fuentes and Aranda, 2018; Zdunek et al., 2015). People also have different levels of general and health literacy, educational attainment and language experience, and cognitive or speech impairments (e.g., dyslexia), for example, which may affect their ability to convey information about their condition. Some difficulties in individuals’ reporting of information may also be related to the underlying condition being evaluated, whether physical or psychiatric, while others may be related to the adverse effects of various medical treatments, particularly medications with psychotropic properties. In addition, individuals may be subject to social or other stressors, whether related to work or not, that can impact the quality of the information they provide. In general, even when validated information collection methods
2 In simple terms, natural language processing is “a branch of artificial intelligence that helps computers understand, interpret and manipulate human language” (SAS Institute, 2019a), such as the free text in clinical health records.
are employed, it is often difficult to determine the causes of errors in applying so-called standard methods.
Second, the setting in which information is collected is important. Standardized laboratory settings with published protocols are usually necessary for performance testing, to ensure the reproducibility and accurate interpretation of the test results. In many ways, the same is true for the collection of information through interviews, and particularly for psychological testing. Test settings require good lighting, absence of noise, and freedom from important distractions, as well as special accommodations for persons with disabilities that can impede testing procedures. Characteristics of respondents and data acquisition methods that may alter the findings of information collection are called “response variables” in survey research.
Types of Functional Assessment Measures and Sources of Information
Sources of information for assessment of function include (1) clinical records, (2) performance-based measures, and (3) self- or proxy-reported measures. Each of these three sources has strengths and weaknesses, and the results of one are often used to validate those of another (Oude Voshaar et al., 2019; Schalet et al., 2015). The use of all three sources entails certain assumptions, such as the stability of the environment used for diagnostic testing and conditions in that environment that are conducive to the testing process, and the use of a measurement instrument with sufficiently strong and established psychometric properties to meet conventional standards. In addition, for performance-based and self- or proxy-report measures, it is assumed that the respondent has sufficient cognitive and reading skills to understand and follow directions and that the language and references in the performance-based and self- or proxy-report measures are appropriately compatible with the respondent.
Among the information found in an individual’s health record are the results of diagnostic testing. In addition to the results of physiological measures, such as blood pressure measurements, blood tests and imaging procedures, and any functional tests performed, clinical records provide information about diagnoses and treatments, including prescribed medications, appliances, and devices. Professionally witnessed physical or behavioral events or characteristics of patients also may be documented (e.g., seizures or spasticity), although occasional episodic events may not occur during a clinical examination. Clinical records, including results of directed laboratory testing, may include information on biomarkers that are physiologically or biochemically relevant to an individual’s health and functional
status and disability-related medical condition. Clinical records also may provide information over extended periods of time, yielding insight into clinical and functional trajectories of the person’s condition.
Despite their importance as an information source, clinical records have many actual or potential limitations that need to be recognized. The records may, for example, be derived from many institutional sources and have varying formats, content, and authors. At times, clinical findings in the records may be unclear or contradictory, a problem exacerbated by differences in clinical observers and the ways in which assessments are performed and documented. And many elements of a clinical record, particularly the medical history, are by definition based on patient self-report, so they may be subjective and unverified. Other potential problems with clinical records are worth noting. Diagnoses may be incomplete or not sufficiently accurate. Some records contain only tentative diagnoses, pending further evaluation. Some types of health considerations are known to be frequently missing, such as safety events that occur in hospitals and adverse drug reactions. A common problem is that records of consultations that occur outside of the parent institution, such as for rehabilitation or psychiatric treatment, may not be present.
Also, the availability of specific tests (e.g., certain cardiovascular tests and psychological batteries) that are valid and potentially useful to disability evaluations may be limited by costs and individuals’ access to specialized health care. Relevant health care data may not be easily available because an individual may lack insurance coverage or be underinsured, or the means of obtaining the information needed may be denied by insurance as not medically necessary. Health and care disparities can have a significant impact on the collection of health information available to inform disability determinations. In the United States, lower socioeconomic status is associated with less access to high-quality care and health care professionals (AHRQ, 2018; IOM, 2001, 2003), including those with expertise in providing information on functional status relevant to disability determinations. Thus, disparities in access to care and health outcomes can affect not only the quantity of assessments conducted in the context of disability determinations but also the quality of the assessments that are conducted and the resulting information (IOM, 2015; NASEM, 2016). Disability applicants who are uninsured or underinsured are less likely to have a well-developed body of health data, including the results of expensive, specialized tests, to demonstrate evidence of disability.
In addition, the acquisition of clinical records may be difficult for several reasons: providers’ fear of sharing confidential information, the limited capacity of a provider’s organization to gather and transmit records, and high administrative costs for record transfer. If clinical records are in electronic format, they can be shared and transmitted more easily, and digital
applications can enhance the types of information available, as described later. However, most EHRs are supported by commercial vendors with differing digital systems, and a lack of interoperability may make reading and manipulating such records complex and costly. Still another difficulty is that information and representations concerning function and disability are often not standardized, although some systems, such as the publicly available Patient-Reported Outcomes Measurement Information System (PROMIS), are progressing toward standard taxonomy-driven published measures (HealthMeasures, 2018). In addition, EHRs typically contain a substantial amount of free text, which may impede easy analysis and summarization, although natural language processing software is available for this purpose. Finally, use of EHRs makes it easy for providers to “cut and paste” previous entries, potentially affecting accuracy.
Performance-based measures require that the individual being assessed perform a set of functional tasks so that his or her ability to execute them can be ascertained. Examples of such measures include assessments of gait, balance, and lifting in the physical realm and cognition in the mental realm. In addition, a number of instruments are available for integrated assessment of physical and mental function, often entailing evaluation of specific activities of daily living (ADLs) and instrumental activities of daily living (IADLs). ADL and IADL measures can be applied reasonably well but have the problem of not being directly applicable to specific workplace settings and activities. Potential weaknesses of integrated and impairment-specific functional assessments also include social interaction tasks required in certain workplaces and the speed and endurance of various work tasks. In general, performance-based measurement requires formal and substantial control of assessment conditions and the information as it is collected. When the intent is to assess function based on maximal effort, observer-reported (as opposed to self-reported) function may itself affect the information collected.
There are two general approaches to performance-based functional assessment: direct measurement and in situ observation of task/activity performance.
Direct Direct measurement involves testing of relevant, perhaps general or stereotyped tasks in a clinical “laboratory” setting, often in rehabilitation or occupational medicine facilities. Direct performance testing for physical and neurocognitive functional abilities is well developed for various common illnesses and conditions and defined injury-related impairments. Such testing typically is used to assess common disease-specific deficits and
to monitor functional increments or decrements over time. Such measures may be useful for tracking the progress of those diseases, but they are not necessarily generalizable to other potentially disabling conditions.
Work-related tasks are tested using best approximations of the work settings being simulated. Physiological measurements are often performed, and the data collected may require some translation by the observer as to whether the client is likely to be capable of carrying out the work tasks under real-life conditions. These testing procedures can be valuable but may require substantial resources, including client transport to and from the testing facility.
In situ observation of task/activity A second approach is to test performance abilities by observing the client in a work setting that is identical, or nearly so, to that in which the tasks/activities in question would be carried out. In situ observation has some advantages over direct measurement. Assuming that the client being evaluated can be taught the procedural requisites and safety procedures related to performing in that particular workplace, this approach has the distinct advantage of representing both many of the challenges of the tasks/activities and the availability of potential adaptations to accommodate the client’s condition. Nevertheless, there are disadvantages to in situ observation as well. It is more logistically complex, time and resource intensive, and costly than direct measurement. Additional challenges include simulating social and cognitive tasks; assessing the client’s endurance in performing these tasks/activities in the particular work setting over a 40-hour workweek; and anticipating variation in the tasks/activities as they evolve over time, as well as variations and fluctuations in the individual’s symptoms over time. It should be noted that job coaches in the context of supported employment can provide useful information about their clients’ performance in the specific job setting, which takes account of some these challenges.
Self- or Proxy-Reported Measures
Self- or proxy-reported measures are those that require an individual being assessed or a third party to complete a questionnaire asking about the overall ability of the individual to perform a specific set of functional tasks. Patient-reported outcome measures, ADL questionnaires, and some types of psychological tests are examples of such measures. Note that self-report is not the same as self-administration, and different modes of administration (e.g., mail, computerized surveys, “pencil and paper,” automated audio telephone surveys) may alter the nature of self-reported findings. An advantage of self-administered instruments is that they can be completed over longer periods of time relative to the other two methods, providing a time
perspective on function and potentially improving accuracy by allowing for more response verification.
Self- and proxy-reported instruments may be based on either classical test theory methods or item response theory (IRT) methods.
Instruments based on classical test theory methods Instruments based on classical test theory are designed to be completed in toto via either self-report, proxy (surrogate) report, or interview. Self-report is feasible when an instrument’s reading level accommodates respondents with low literacy and limited concentration. Proxy respondents are helpful when instruments focus on observable behaviors. Personal interview administration is helpful when respondents need assistance staying on task or would benefit from the rapport that can be established during an interview. If the primary respondent is cognitively impaired, for example, assistance may improve the accuracy of the information obtained.
Health, functional, and disability assessments may be enhanced by acquiring information from third-party respondents. Relevant third-party respondents may include, for example, friends and family members, health care and social professionals, and workplace colleagues and employers. Such individuals can be particularly helpful for providing ancillary information on health and behavioral matters, physical and mental functioning, and workplace performance, sometimes supported by written documents. Applicants who are injured or ill can benefit in particular from appropriate third-party observers. Yet, while such reports can provide valuable information, it is important to understand the nature of respondents’ relationship to the individual being evaluated. They may not be skilled in the types of observations needed, may not be suitably familiar with the person, or may have biases or interests of their own that could affect the accuracy of information they provide (Gill et al., 2002; Lum et al., 2005). Collection of information about the length and nature of a third-party respondent’s relationship with the individual being assessed can help in interpreting the information gathered. It is important to note that tests assessing beliefs, attitudes, moods, and other internal states are not suitable for proxy respondents (Dorman et al., 1997; Duncan et al., 2002; Mathias et al., 1997; Oczkowski and O’Donnell, 2010; Pickard and Knight, 2005; Poulin and Desrosiers, 2008).
In general, personally administered interviews will provide greater accuracy and potentially obtain more complex responses relative to self- and proxy reports obtained by other means. Reading a survey to respondents with language or literacy problems will likely improve the quality of the information provided. Interviewers can also determine the most acceptable pace of data collection, ensure completion of all desired items, sometimes elicit more sensitive information, and explain items otherwise not fully
understood. At the same time, however, administered interviews consume more resources, and administration of a survey of self-reported function by an interviewer may actually reduce the accuracy of the information collected. Also of note, both self- and interviewer-administered survey instruments (as well as computer-administered instruments, discussed below) may allow for “adaptive interviews,” where, for example, certain items may be omitted or added if they are redundant or require respondent-specific information.
Whether surveys are self- or interviewer administered, the accuracy of survey information is based mainly on respondent characteristics, including abilities, knowledge, motivations, and competing burdens. The accuracy of self-reported information can be affected, intentionally or unintentionally, by the respondent. For example, some individuals who want their condition or the magnitude of their perceived distress to be taken seriously may overestimate their difficulty in performing various tasks. Conversely, other individuals may overestimate their abilities out of a desire to please the interviewer or to maintain independence or not appear weak. In addition, certain individuals, for example, some with traumatic brain injury or stroke, may have poor self-awareness or an inability to assess their limitations accurately because of a neurological deficit (e.g., anosognosia). The use of instruments or test batteries that include validity measures3 can help testers determine the validity of the results obtained (IOM, 2015). Another consideration is gender, racial, ethnic, and cultural variation in individuals’ perception of illness and symptoms and whether relevant self-report measures have been assessed for equivalency of scores in different populations.
Instruments based on item response theory methods Measurement tools may be built on IRT and computer adaptive testing (CAT). IRT is used to establish an individual’s position on the continuum of a trait of interest by asking him or her a series of questions. In the case of functional assessment, for example, the questions would be calibrated to a scale covering the range of function in one dimension, such as mobility. CAT instruments can be used to administer a selected sample of questions from an IRT-calibrated “item bank,” choosing questions based on how the respondent answered the previous questions (Chan, 2018). CAT instruments are highly efficient, typically involving shorter administration times and requiring respondents to answer fewer questions than would be required by a conventional test (Cheville et al., 2012; Fliege et al., 2009; Ware et al., 2003). Such instruments also may include embedded validity measures. Under optimal
3 Validity measures are used to provide information about an individual’s effort on tests of maximal performance, such as cognitive tests, or information about the consistency and accuracy of an individual’s self-report of symptoms he or she is experiencing (IOM, 2015).
circumstances, computer-based surveys have several additional advantages: they can provide helpful prompts if a respondent indicates a question is unclear or provides a response that is “out of range”; questions can be revisited on request; and real-time response editing can indicate that a particular response cannot be accurate and needs to be reconsidered.
CAT has generally shown its enhanced value in a number of applications relevant to disability situations. For example, CAT has increased the precision of discriminating persons who fall more frequently relative to conventional balance measures (Pardasaney et al., 2014), and similarly discriminating those with disabling back pain compared with standard testing (Choi, 2015). Use of a CAT approach to evaluating PROMIS-based upper-extremity function showed better psychometric properties compared with a non-CAT comparator instrument (Tyser et al., 2014), as well as improved sensitivity to functional change after hip and knee prosthetic procedures (McDonough et al., 2016). The use of measures developed using IRT that can be administered using CAT can also decrease respondent burden by reducing survey length and administration time while minimizing measurement error.
In addition, computerized surveys can provide immediate document structuring and formatting of the interview, supplying text versions of responses for immediate examination by professionals. A further potential advantage of electronically administered interviews is that the perceived privacy may allow the acquisition of more sensitive information than might otherwise be obtained. It should also be noted that certain types of cognitive and psychological testing protocols can be administered online.
Regardless of the sources and types of information, convergence of the information is important in weighing the validity of the evidence presented in a claim of disability. Divergent evidence erodes confidence, whereas convergence adds confidence that the reported information is accurate.
Potential Digital Applications of Clinical Records and Related Information
One important consequence of the use of digital, computerized clinical records and other digitized information is the ability to take advantage of techniques increasingly being applied in clinical settings. Many of these techniques could be considered decision-support tools, and while each technique that exploits big data (i.e., extremely large datasets) and predictive analytics4 (Shah et al., 2018) may have weaknesses or suffer from incom-
plete development, these techniques have the potential to become useful aids in the collection of information to inform disability determinations. For example, techniques such as text and data mining, machine learning, and natural language processing—some of which fall under the heading of clinical information extraction applications (Wang et al., 2018)—can help in summarizing the content, quality, and completeness of collected information. The process of reading, organizing, and interpreting clinical information has improved with the use of machine reading (artificial intelligence) programs that have emerged in the past several years.
The modeling of predictive analytics can offer reasonably accurate estimates of clinical outcomes for various conditions over the ensuing months and years. As of 2012, for example, there were about 800 predictive risk models for outcomes of cardiovascular conditions (Wessler et al., 2015). In another important application, the U.S. Veterans Health Administration has developed a detailed risk score (the Care Assessment of Need Score), based on medical records from hospitalized patients, that predicts 1-year mortality and the risk of rehospitalization (Ruiz et al., 2018). With appropriate available information, it may be useful to evaluate these information technology techniques and others as aids in predicting future health and functional trajectories and outcomes of individuals applying for disability benefits. Predictive modeling also has progressed with respect to the functional outcomes of mental conditions (Koutsouleris et al., 2018). An example of an emerging digital technology in this realm is the use of linguistic analysis (of recorded natural speech) to help assess neurological and psychiatric conditions (deBoer et al., 2018).
When evaluating the utility of a functional assessment instrument for SSA’s adjudication process, it is important to collect and consider the available evidence on that instrument’s design and performance. To guide potential users of the PROMIS and National Institutes of Health (NIH) Toolbox instruments, NIH (2018) suggests evaluating the available evidence related to the eight key instrument properties described by the Scientific Advisory Committee of the Medical Outcomes Trust (2002). The present committee adapted these properties for use in selecting the functional assessment instruments considered in this report:
- conceptual model and measurement approach,
- sensitivity to change and responsiveness,
- interpretability of results (e.g., self-report and trained observer rating),
- administrative and respondent burden,
- alternative modes of administration, and
- cultural and language adaptations (e.g., translations).
Conceptual Model and Measurement Approach
The conceptual model on which an instrument is based provides “the rationale for and description of the concepts” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 198) the instrument is intended to measure and the “populations [it] is intended to assess” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 198). The underlying measurement approach operationalizes the conceptual model and involves a range of decisions made during the instrument’s design and testing, such as the use of a scale or set of scales, the corresponding measurement units, modes of collection, scoring procedures, and empirical strategies (e.g., principal components analysis, confirmatory factor analysis). Brandt and colleagues (2011), for example, describe (1) SSA’s five-step sequential review process in the context of contemporary conceptualizations of disability and (2) the potential of using an IRT-CAT assessment tool for measuring the multiple dimensions of disability—both of which would guide the development of the Work Disability-Functional Assessment Battery (Meterko et al., 2015, 2018). In the context of functional assessment, an underlying conceptual model is one piece of evidence supporting the overall credibility of the measurement results and subsequent decisions.
Reliability denotes “the degree to which an instrument is free from random error” at a point in time (Joint Commission on Accreditation of Healthcare Organizations, 2003, p. 29). Several types of reliability are typically investigated in developing an instrument: internal consistency, interrater reliability, and test-retest reliability.
Internal consistency reliability is “the precision of a scale, based on the homogeneity (intercorrelations) of the scale’s items” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 196). For example, the developers of the Kessler Psychological Distress Scale (K10)—a self-report questionnaire with 10 brief items used to identify psychological distress—established internal consistency reliability using the responses of individuals to whom the scale was administered to estimate Cronbach’s alpha, which indicates the correlation among items (Kessler et al., 2002).
Greater correlation suggests that the items are identifying similar phenomena, which may be consistent with the assumed conceptual model. In the context of functional assessment, internal consistency reliability strengthens the credibility of results by minimizing the potential for collecting contradictory data.
Interrater and test-retest reliability are, respectively, the degree to which different raters are consistent in their observations and scoring at one point in time and the degree to which the results of a test are consistent over time. Wittchen and colleagues (1991), for example, compared the diagnoses of the same sample of patients made by clinicians and nonclinicians using the Composite International Diagnostic Interview, which is designed to identify mental disorders, and estimated Cohen’s kappa, which indicates the extent of agreement between the two types of raters. In the context of functional assessment, it is critically important to establish interrater and test-retest reliability, since it is important to obtain the same score when an instrument is used by different clinicians or when the same claimant is retested, all else being held constant.
The validity of an instrument is “the degree to which the instrument measures what it purports to measure” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 200). Three forms of validity are typically investigated when an instrument is developed. The first, content validity, is the degree to which “the domain of an instrument [i.e., what it purports to measure] is appropriate relative to its intended use” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 200). Construct validity is the degree to which the “proposed interpretation of scores [is] based on theoretical implications associated with the constructs being measured” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 200). And criterion validity is the “extent to which scores of the instrument are related to a criterion measure” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 200). Criterion measures are “measures of the target construct that are widely accepted as scaled, valid measures of that construct” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 200). Kessler and colleagues (2002), for example, compared results from the K10 self-report scale with the results of clinical assessments, the latter of which are assumed to be closer to the true level of psychological distress. Establishing the criterion validity of functional assessments is critical to their use in disability claims adjudication and to the underlying accuracy and defensibility of subsequent claims decisions.
Sensitivity to Change and Responsiveness
Although conceptually similar and sometimes used interchangeably, sensitivity to change and responsiveness have been defined differently (Corzillius et al., 1999; Pardasaney et al., 2012). Sensitivity to change is defined as the ability of an instrument to detect change in a state over time reliably, regardless of whether the change is meaningful, whereas responsiveness is defined as the ability of an instrument to detect a meaningful or clinically relevant change over time that is reproducible against an alternative measure or criterion (Corzillius et al., 1999; Pardasaney et al., 2012). For example, an assessment instrument might detect a significant increase in an individual’s leg muscle strength (sensitivity), but the changes might not be detectable using an alternative measure or criterion, such as the ability to stand unassisted. In the context of functional assessment, the ability to detect clinically relevant changes over time is relevant to assessing functional decline and recovery.
Interpretability of Results
Interpretability refers to “the degree to which one can assign easily understood meaning to an instrument’s quantitative scores” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 202). An instrument’s interpretability “is facilitated by information that translates a quantitative score or change in scores to a qualitative category or other external measure that has a more familiar meaning” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 202). Different types of information contribute to the interpretation of scores, including their relationship to clinical conditions or significant life events (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 202). With regard to disability determination, interpreting scores in the context of job requirements is crucially important.
Administrative and Respondent Burden
The burden involved in using an instrument is considered from the perspective of the person administering the instrument (administrative burden) and the person responding to the instrument (respondent burden). Burden encompasses “time, effort, and other demands” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 202), including the “amount of training and level of education or professional expertise and experience needed” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 203) by the person administering the instrument. Respondents with cognitive disabilities may experience additional
burden resulting from the cognitive load required to complete an instrument. Respondents with fine motor control limitations may encounter difficulties with reporting formats that require fine motor control of fingers. Respondents who are distractible and have difficulty remaining on task may require reminders or redirection to attend to test materials.
Alternative Modes of Administration
The burden of an instrument is related to the available alternative modes of administration, which include “self-report, interviewer-administered, trained observer rating, computer-assisted interviewer-administered, [and] performance-based measures” (Scientific Advisory Committee of the Medical Outcomes Trust, 2002, p. 203). In the context of consultative examinations, understanding the administrative aspects of an assessment instrument is important to the cost and timeliness of an assessment.
Cultural and Language Adaptations
Lastly, an instrument may be adapted or translated for use with populations that differ culturally and linguistically from those for which the instrument was initially developed. The cultural and language adaptations or translations of an instrument involve two primary steps: (1) assessment of conceptual and linguistic equivalence, and (2) reevaluation of the seven priorities described above. Üstün and colleagues (2010), for example, compared the results of psychometric testing of the World Health Organization (WHO) Disability Assessment Schedule 2.0 across multiple countries. In the context of disability claims adjudication, it is important that any cultural and linguistic barriers experienced by the applicant be noted and that inferences from assessment results be appropriately qualified.
An important consideration when evaluating the validity, reliability, and usefulness of an instrument is the educational, professional experience, and training requirements for those who can administer the instrument and, by extension, for those who can interpret the instrument’s results. When an instrument is developed, validated, and/or normed on a population, such requirements typically are specified or at least recommended. The developers may also specify requirements for additional training before individuals are certified as qualified to administer the test and/or interpret the results.
The types of professionals who are qualified to administer instruments and/or interpret their results for purposes of functional assessment of
physical and mental abilities are as varied as the conditions they represent, the multiple disciplines of the medical and allied health workforce, and the current state of the scientific literature with respect to the assessment target (e.g., major depression, ischemic heart disease, fibromyalgia, traumatic brain injury). To illustrate, a trained laboratory technician is qualified to perform neuroimaging tests using sophisticated diagnostic equipment, but is not qualified to interpret, diagnose, and report the results of those tests. Conversely, in the case of real-world applications of screening and assessment in which the goal is to identify treatments, interventions, and practices, tools may have been adapted for administration and scoring by a more diverse workforce. An example is depression screening tools (e.g., Patient Health Questionaire-9), which are widely available and come with easy-to-follow instructions on how to administer and score them and interpret their results. Information on requirements for persons who are qualified to administer and/or interpret the results of selected instruments for assessing physical and mental abilities relevant to work is provided in Chapters 5 and 6 (and associated annex tables), respectively.
Of particular note is that in the last decade or so, community health workers have been assuming responsibility for administering and/or interpreting the results of instruments that previously were considered largely the purview of more highly trained or certified specialists or assessors. These community health workers fill an important gap, especially for underresourced communities and service programs experiencing workforce shortages, freeing up skilled health care providers to perform more complex health care tasks. They typically are recruited because of their unique knowledge of and ability to navigate patient, family, and community expectations and norms around health, functional well-being, and access to care (Crigler et al., 2011; Hartzler et al., 2018).
Another consideration pertaining to the identification of professionals with the training and expertise to perform functional assessments related to work requirements is the importance of the balance between the snapshot of an examinee’s performance provided by a particular instrument at a single point in time and the understanding gained from repeated assessments or observations over time by professionals who have frequent interactions with patients by nature of their role and responsibilities in a clinical or rehabilitative setting or other system of care. Licensed clinical social workers, occupational therapists, physical therapists, and other professionals may administer ongoing assessments in their respective roles on a multidisciplinary team. They may have responsibility for repeated assessments using standardized assessment tools and procedures, and thus may render more detailed and accurate evaluations of an individual’s physical and/or mental functioning over time than can be provided by medical specialists who have
less frequent interactions with the person and less time per encounter during the same observation period.
Also important is identifying which professionals may be best suited to evaluating an applicant for disability benefits. For example, physicians who are skilled, trained, and experienced in determining impairments, such as occupational medicine physicians and rehabilitation medicine physicians (physiatrists), may be best qualified to perform these evaluations by virtue of their training and expertise in understanding not only the physical aspects of impairment, but also the work environment and how the abilities of impaired individuals may match that environment. Accordingly, they may provide information that is most relevant and useful to the disability determination process. Other clinicians and health care providers, such as physician assistants and nurse practitioners, who have followed the person being assessed for an extended period may be best suited to providing information on the individual’s functional abilities over time, regardless of whether they have expertise in evaluating the person’s specific impairment. Other clinicians with experience in evaluating and treating impairments—such as occupational therapists; speech-language pathologists; and physical therapists—also may be well qualified to conduct these evaluations.
Beyond identifying those professionals with appropriate expertise to perform functional assessments, acquiring information helpful in determining individuals’ functional abilities relevant to work is facilitated by asking clear and specific questions that target the information of greatest use in making a disability determination. To this end, forms with such questions can be provided to relevant professionals. For example, asking one item per question imposes less cognitive load on the professional, while asking as many discrete questions as possible rather than open-ended questions yields specific responses that may be more useful to the adjudicator than the responses to open-ended questions, which may be more or less useful depending on how the professional responds.
One useful approach is to front-load the short-answer—specific and straightforward—questions, such as: How long have you known the individual? What is/are the diagnoses for which you are seeing the person? What are the individual’s main symptoms? and Approximately how long has the individual had these symptoms? An expert in questionnaire development and/or psychometrics may be helpful in formulating questions that will be least ambiguous for the responding professionals and yield responses that the adjuster will find most helpful. This may be important given that providers often have limited time to evaluate an individual, allowing for a more efficient process for both the provider and the adjudicator.
Also helpful would be providing the clinician with information that is as specific and detailed as possible about why the applicant is seeking disability benefits, as well as the type of information that would be useful to
the adjudicator for making the determination. Providing as much guidance as possible to the clinician would contribute to a more efficient evaluation. A model for consideration is the recent Accreditation Council for Graduate Medical Education requirement that residents have a formal process for transferring information used in the care of a patient, known as “hand-off,” because it has been shown that discontinuity creates opportunities for error and miscommunication (PSNet, 2018; Riebschleger and Philibert, 2011).
SSA is particularly interested in information that pertains to individuals’ capacity to sustain physical and mental work activities on an ongoing and independent basis in the face of functional limitations. However, information in applicants’ health records typically is gathered for other purposes (e.g., treatment, rehabilitation) and so does not speak unambiguously to an individual’s capacity to sustain work-related physical or mental activities for 8 hours per day, 5 days per week. The information obtained is often influenced by the purpose for which it was originally gathered, which makes it difficult to draw inferences from that information for a different purpose. Assessments of work disability often require an inferential leap because they are based in part on information gathered for other purposes. The committee identified six primary threats to the validity of assessments of functional abilities: (1) testing maximal versus typical performance, (2) assessment of episodic activity versus sustained task performance, (3) absence of standardized testing conditions, (4) mixed-motive incentives, (5) compromised test integrity in high-stakes testing, and (6) diversity in the test population.
Testing Maximal Versus Typical Performance
In some cases, functional assessments are performed under conditions that best resemble maximal rather than typical performance, which by definition implies continuous and independent performance. For instance, the controlled settings in which physical and cognitive activities often are assessed typically fail to replicate the actual conditions under which such activities are performed at work. Specifically, such variables as social pressure (e.g., irate customers), hostile environmental conditions (e.g., temperature, humidity, noise), and continuous repetition over extended time periods and in a variety of settings are not always well replicated in the settings in which assessments of functional capacity are conducted.
Presentations to the committee by several stakeholder representatives raise similar concerns (Ford et al., 2018; Liebkemann, 2016; Liebkemann and Lang, 2017). Specifically, some functional assessments conducted in
controlled conditions may overestimate an individual’s capacity to perform work activities independently on a sustained basis. Certain medical conditions (e.g., cardiac, mental health) may interact with the context in which these assessments are carried out, so that a person is shown to be capable of performing work activities in a single episode in the presumably less adverse conditions of a controlled setting but proves unable to perform those activities in an ongoing and independent manner as demanded in an actual work setting. Research on typical versus maximal performance suggests that the antecedents of maximal performance on certain tasks do not always coincide with those of typical performance on the same tasks (DuBois et al., 1993; Salgado et al., 2015). Therefore, the individual qualities that facilitate success on assessments conducted under controlled conditions may not always ensure success in sustained performance in contexts subject to constant change. In the physical realm, the difference between peak capacity and sustained performance can be quantified specifically for aerobic functional capacity by means of cardiopulmonary exercise testing, but similar assessments of function are currently not available for all organ systems.
Assessment of Episodic Activity Versus Sustained Task Performance
Clearly, sustained and independent performance of a job involves more than the ability to perform each of the work activities separately. In other words, a job is more than the sum of its tasks; rather, it entails a series of coordination and integration processes involving the frequency, sequence, and duration of those tasks. We define meta-task processes as those concerning the coordination, integration, and sequencing of tasks because such processes require a clear understanding of task interdependence and criticality or the consequences of an error in task performance.
Unfortunately, meta-task processes are not necessarily evident in assessments of a single work activity performed in a controlled setting. In such cases, it may be necessary to make a difficult inferential leap from evidence of the individual’s ability to perform some of the work activities involved in a job in isolation, and often in ecologically unrealistic contexts that do not accurately represent actual working conditions, to the individual’s ability to sustain the full range of job activities and meta-task processes demanded in those conditions. Therefore, assessments ideally would be representative and encompass both the full range of tasks involved in the job and the ability to meta-task as necessary to perform the job. Consider, for example, the case of someone who performs successfully in a variety of episodic assessments targeting several activities or even broader tasks of a job. This same individual might prove unable to engage in continued performance of those activities or tasks because of his or her inability to decide on their
sequencing as the result of a health condition (e.g., mental illness) or the inability to sustain required physical activities over the necessary period of time. In other words, an activity-by-activity assessment of capacity, even if it covers the entire scope of activities and tasks involved in a job, may not provide a valid prediction of the individual’s ability to sustain performance of the job over time in an independent manner.
Absence of Standardized Testing Conditions
Standardization is a basic precept of valid and reliable assessment. Assessments are sometimes administered under varied conditions, such as test administrators who go out of their way to encourage applicants or who allow examinees to rest between tasks. The results of such assessments would not be comparable to those performed in the absence of encouragement or recovery periods. Variations in testing conditions are likely to be most problematic when nonstandardized instruments or methods (e.g., clinical judgment based on a potentially unrepresentative sample of behavior) are employed.
In some cases, customized assessments focused on the performance of activities unique to a certain job type (e.g., dexterity for a wet-bench lab worker where pipettes are used) may be appropriate. Such customized assessments are less generalizable and not as rigorous as standardized ones, but they allow assessments with a higher degree of “fidelity” to the job and, therefore, possess better face validity (i.e., more likely to be perceived as relevant by applicants) (Salgado et al., 2015).
Understanding the conditions under which assessments are administered is therefore of critical importance in interpreting assessment results. For this reason, it is important to gather as much information as possible on the standardized conditions under which assessments were conducted, as well as a detailed description of any exceptions to or deviations from these conditions.
Also important is understanding the potential influence of test administrators and other third parties operating under mixed-motive incentives, in which those conducting assessments operate under conflicting external pressures that motivate them to both adjudicate and not adjudicate disability benefits. For instance, test administrators, whether consciously or not, may at times conduct assessments in a manner that provides the examinee with much greater encouragement to perform work activities than is typically encountered in an actual work setting, thereby rendering results unrepresentative of the actual conditions under which the work activities being assessed
need to be performed. Because of their background, for example, this may be the case for rehabilitation specialists and other professionals engaged in therapeutic interventions. By virtue of their training, these professionals are interested in helping their patients succeed, and so might display an unduly optimistic and cheerful demeanor during the assessment. Individuals being assessed by such professionals may be able to complete certain work activities in these unusually motivating circumstances, but be unable to sustain those same activities on an ongoing basis and in the independent manner demanded by the job. The literature suggests that low-ability individuals may be particularly susceptible to this type of motivational stimulation and benefit more from motivational interventions (e.g., goal setting) relative to high-ability individuals (Kanfer and Ackerman, 1989). Thus, it is not unreasonable to think that low-ability individuals assessed under unusually motivating circumstances (e.g., a cheerful therapist) would show better results than they would be capable of achieving in a less supportive or encouraging environment.
Whenever possible, then, it is advisable to declare and formally document the mixed motives of test administrators so that the extent to which they may have motivated the examinee can be assessed, regardless of whether they intended to alter the examination conditions. Potential ways to address this issue include enacting standardized protocols and interpretive guidelines that regulate or codify the conditions of assessments and the manner in which they should be conducted by third parties, as well as communicating the goals and purpose of the assessments to those charged with administering them.
It is also important to understand the purpose for which an assessment was conducted in interpreting its results. As previously mentioned, test results gathered for one purpose (e.g., therapeutic or rehabilitation), where the objective may be to determine the maximum extent to which an individual can perform an activity even if he or she cannot sustain that activity over time, may not provide accurate information for a different purpose (e.g., assessment of work disability), where the objective is to learn the maximum extent to which the person can sustain that activity safely on a regular and continuing basis. It is likely, for instance, that some of the physical functional assessments documented in an individual’s health records were conducted by occupational and physical therapists in the context of interventions aimed at helping or motivating the person to perform at maximum capacity, rather than determining his or her capacity to sustain work activities independently over time. For this reason, it is important to understand the purpose of the original assessment when using its results for a different purpose.
Compromised Test Integrity in High-Stakes Testing
A high-stakes test is one whose results constitute the basis of a major decision, typically one involving the individual taking the test. Given the significant consequences of the adjudication of disability benefits, assessments conducted in the context of disability determination are a case of high-stakes testing. In light of the personal and social significance of adjudication decisions, the motivation to skew an assessment in the desired direction is rather high. Therefore, the use of assessment instruments developed for relatively low-stakes applications, such as research and teaching purposes—often available to a wide range of professionals for many years—is problematic. In addition, these instruments may be administered in less than fully standardized conditions and through a variety of platforms that are not always secure. In some cases, use of these instruments in less than secure conditions and over extended periods of time may result in their becoming publicly available, which is likely to compromise their integrity (AERA et al., 2014). If the content of a test became widely available publicly (e.g., on the Internet), prospective examinees could potentially preview the test questions and prepare accordingly, undermining the validity of the results. Therefore, it is important to know the integrity of assessment instruments used to inform disability determinations to the extent possible.
Diversity in the Test Population
The literature has long highlighted the role of race, ethnicity, and culture in assessments of physical and mental abilities, and to this day is punctuated by extensive debates over the validity and cultural relevance of procedures, tests, and assessments across racial, ethnic, and cultural groups, as well as groups identified by age and gender (Alonso et al., 2013; Baird et al., 2007; CNPAAEMI, 2016; Wild et al., 2005; Wong et al., 2000).
Indeed, patient-reported symptomatology measures and clinician/observer-rendered assessments vary in the degree to which they have been tested or adapted across diverse racial, ethnic, and cultural populations. Cross-cultural adaptations and validations of assessments in different cultural contexts and languages are predicated on the notion that such efforts take into account distinct groups’ experiences and meanings of health, behaviors, illness, symptomatology, and disability and help-seeking behaviors (Forestier et al., 2019; Fuentes and Aranda, 2018; Odole et al., 2016; Tennant et al., 2004; Zdunek et al., 2015).
Moreover, assessment instruments developed in research and training settings may not account for cultural, linguistic, or literacy factors, such as limited English proficiency or low literacy, that may limit access to such assessments. As a result, few or no assessments may be available that can
capture valid and reliable administration and scoring information for these populations. In sum, when evaluating the utility of a functional assessment instrument for informing disability determinations, it is important to consider the instrument’s performance across multiple subgroups (e.g., age, gender, socioeconomic status, race, ethnicity, cultural group) as a principle of good practice (Wild et al., 2005).
3-1. A variety of methods can be used to collect functional information (e.g., diagnostic testing, performance-based measures, self- or proxy-report measures), each of which has strengths and weaknesses, and the results of one are often used to validate those of another. Each method can yield instruments with satisfactory psychometric properties that allow their implementation in disability decision making.
3-2. It is important to consider eight properties when evaluating the quality of functional assessment instruments:
- conceptual model and measurement approach,
- sensitivity to change and responsiveness,
- interpretability of results (e.g., self-report and trained observer rating),
- administrative and respondent burden,
- alternative modes of administration, and
- cultural and language adaptations (e.g., translations).
3-3. The validity of functional assessment tests is enhanced when the test users administer them for the purpose and in the context for which they were designed (e.g., target population).
3-4. Assessment instruments developed for use in research and training settings may not account for cultural, linguistic, or literacy factors, such as limited English proficiency or low literacy, that can limit access to such assessments.
3-5. Direct performance testing of physical and neurocognitive functional abilities is well developed and typically is used to assess common disease-specific deficits and monitor functional increments or decrements over time. Such testing may be useful for tracking the progress of those diseases, but they are not necessarily generalizable to other disabling conditions.
3-6. The accuracy of self-reported information can be affected, intentionally or unintentionally, by the respondent, who may either under- or overestimate his or her ability to perform different tasks.
3-7. The use of instruments or test batteries that include validity measures can help testers determine the validity of the results obtained.
3-8. Third-party sources (e.g., friends and family members, health care and social service professionals, workplace colleagues and employers) who are suitably familiar with the applicant’s activities, health, and functional status can be particularly helpful in providing ancillary information on health and behavioral matters, physical and mental functioning, and workplace performance, sometimes supported by written documents. Such reports are at times influenced by such factors as self-interest, mixed motives, or inaccurate observations. Tests assessing beliefs, attitudes, moods, and other internal states are not suitable for proxy respondents.
3-9. Threats to the validity of assessments of functional abilities include testing of maximal versus typical performance, assessment of episodic activity versus sustained task performance, absence of standardized testing conditions, mixed-motive incentives, compromised test integrity owing to prior use of the test in low-stakes testing applications, and diverse test populations in whom tests may not have been validated.
3-10. Information obtained as evidence is often influenced by the purpose for which it was originally gathered (e.g., treatment, rehabilitation), which makes it difficult to draw inferences from that information for a different purpose (e.g., determination of work disability).
3-11. A variety of professionals representing multiple disciplines of the medical and allied health workforce are qualified to administer and interpret results of assessments of physical and mental function and have the capacity and experience to provide valuable information regarding individuals’ functional abilities.
3-12. Community health workers have assumed responsibilities for administration and/or interpretation of instruments that previously were typically considered the purview of more highly trained or certified specialists or assessors. In so doing they have filled an important gap, especially for underresourced communities and service programs experiencing workforce shortages.
3-13. Health care data relevant to disability determinations, such as the results of specific, expensive tests (e.g., certain cardiovascular tests and psychological batteries) that are valid and potentially useful, may not be easily available because an individual may lack insurance coverage or be underinsured, or the means of administering the
tests may be denied by insurance because the tests are not considered medically necessary.
3-14. Lower socioeconomic status is associated with less access to high-quality care and health care professionals, including those with expertise in providing information relevant to disability determination.
3-15. Patient-reported symptom measures and clinician/observer-rendered assessments vary in the degree to which they have been tested or adapted across diverse racial, ethnic, and cultural populations.
3-1. The use of measures based on item response theory that can be administered using computer adaptive testing can decrease respondent burden by reducing survey length and administration time while minimizing measurement error.
3-2. Professionals with responsibility for repeated assessments using standardized assessment tools and procedures may render more detailed and accurate evaluations of an individual’s physical and/or mental functioning over time relative to medical specialists who have less frequent interactions with the person and less time per encounter during the same observation period.
3-3. It is important to understand the nature of a proxy informant’s relationship to the individual being assessed because a proxy may not always be suitably familiar with the person or may have biases or interests that may affect the accuracy of information provided. Collecting information about the length and nature of a proxy respondent’s relationship with the individual can help in interpreting the information gathered.
3-4. It is important to collect information about the nature and original purpose of an assessment instrument, as well as the conditions and context in which it was administered, to help in understanding the results with respect to potential limitations on their generalizability.
3-5. When evaluating the utility of a functional assessment instrument for informing disability determinations, it is important to consider the instrument’s performance across multiple subgroups (e.g., age, gender, socioeconomic status, race, ethnicity, cultural group) as a principle of good practice.
3-6. Disparities in access to care and health outcomes can affect not only the quantity of assessments conducted in the context of disability determinations but also the quality of the assessments that are conducted and the resulting information.
AERA, APA, and NCME (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education). 2014. Standards for educational and psychological testing. Washington, DC: AERA, APA, and NCME.
AHRQ (Agency for Healthcare Research and Quality). 2018. 2017 National healthcare quality and disparities report. Rockville, MD: AHRQ. http://www.ahrq.gov/research/findings/nhqrdr/nhqdr17/index.html (accessed April 3, 2019).
Alonso, J., S. J. Bartlett, M. Rose, N. K. Aaronson, J. E. Chaplin, F. Efficace, A. Leplège, A. Lu, D. S. Tulsky, H. Raat, U. Ravens-Sieberer, D. Revicki, C. B. Terwee, J. M. Valderas, D. Cella, C. B. Forrest, and the PROMIS International Group. 2013. The case for an international Patient-Reported Outcomes Measurement Information System (PROMIS®) initiative. Health and Quality of Life Outcomes 11:210.
Anglin, D. M., P. M. Alberti, B. G. Link, and J. C. Phelan. 2008. Racial differences in beliefs about the effectiveness and necessity of mental health treatment. American Journal of Community Psychology 42(1–2):17–24. doi: 10.1007/s10464-008-9189-5.
Baird, A. D., M. Ford, and K. Podell. 2007. Ethnic differences in functional and neuropsychological test performance in older adults. Archives of Clinical Neuropsychology 22(3):309–318.
Brandt, D. E., A. J. Houtenville, M. T. Huynh, L. Chan, and E. K. Rasch. 2011. Connecting contemporary paradigms to the Social Security Administration’s disability evaluation process. Journal of Disability Policy Studies 22(2):116–128.
Carpenter-Song, E., E. Chu, R. E. Drake, M. Ritsema, B. Smith, and H. Alverson. 2010. Ethno-cultural variations in the experience and meaning of mental illness and treatment: Implications for access and utilization. Transcultural Psychiatry 47(2):224–251.
Chan, L. 2018. The WD-FAB: Development and validation testing. Presentation to the European Union of Medicine in Assurance and Social Security, March 2. https://www.eumass.eu/wp-content/uploads/2018/03/Leighton-Porcino.pdf (accessed April 3, 2019).
Cheville, A. L., K. J. Yost, D. R. Larson, K. Dos Santos, M. M. O’Byrne, M. T. Chang, T. M. Therneau, F. E. Diehn, and P. Yang. 2012. Performance of an item response theory-based computer adaptive test in identifying functional decline. Archives of Physical Medicine and Rehabilitation 93(7):1153–1160.
Choi, B. 2015. Measurement precision for Oswestry Back Pain Disability Questionnaire versus a web-based computer adaptive testing for measuring back pain. Journal of Back and Musculoskeletal Rehabilitation 28(1):145–152.
CNPAAEMI (Council of National Psychological Associations for the Advancement of Ethnic Minority Interests). 2016. Testing and assessment with persons & communities of color. Washington, DC: American Psychological Association. https://www.apa.org/pi/oema/resources/testing-assessment-monograph.pdf (accessed April 3, 2019).
Corzillius, M., P. Fortin, and G. Stucki. 1999. Responsiveness and sensitivity to change of SLE disease activity measures. Lupus 8(8):655–659.
Crigler, L., K. Hill, R. Furth, and D. Bjerregaard. 2011. Community Health Worker Assessment and Improvement Matrix (CHW AIM): A toolkit for improving CHW programs and services. Bethesda, MD: U.S. Agency for International Development.
deBoer, J. N., A. E. Voppel, M. J. H. Begemann, H. G. Schnack, F. Wijnen, and I. E. C. Sommer. 2018. Clinical use of semantic space models in psychiatry and neurology: A systematic review and meta-analysis. Neuroscience and Biobehavioral Reviews 93:85–92. doi: 10.1016/j.neubiorev.2018.06.008.
Dorman, P. J., F. Waddell, J. Slattery, M. Dennis, and P. Sandercock. 1997. Are proxy assessments of health status after stroke with the EuroQol questionnaire feasible, accurate, and unbiased? Stroke 28(10):1883–1887.
DuBois, C. L., P. R Sackett, S. Zedeck, and L. Fogli. 1993. Further exploration of typical and maximum performance criteria: Definitional issues, prediction, and white-black differences. Journal of Applied Psychology 78(2):205–211. doi: 10.1037/0021-9010.78.2.205.
Duncan, P. W., S. M. Lai, D. Tyler, S. Perera, D. M. Reker, and S. Studenski. 2002. Evaluation of proxy responses to the Stroke Impact Scale. Stroke 33(11):2593–2599.
Fliege, H., J. Becker, O. B. Walter, M. Rose, J. B. Bjorner, and B. F. Klapp. 2009. Evaluation of a computer-adaptive test for the assessment of depression (D-CAT) in clinical application. International Journal of Methods in Psychiatric Research 18(1):23–36.
Ford, M., K. Lang, K. Liebkemann, and B. Silverstone. 2018. Discussion with stakeholder representatives. Presentation to the Committee on Functional Assessment for Adults with Disabilities, Washington, DC, February 26.
Forestier, B., E. Anthoine, Z. Reguiai, C. Fohrer, and M. Blanchin. 2019. A systematic review of dimensions evaluating patient experience in chronic illness. Health and Quality of Life Outcomes 17:19.
Fuentes, D., and M. P. Aranda. 2018. Disclosing psychiatric diagnosis to close others: A cultural framework based on older Latin@s participating in a depression trial in Los Angeles County. Aging & Mental Health 1–9. doi: 10.1080/13607863.2018.1506738.
Gill, T. M., S. E. Hardy, and C. S. Williams. 2002. Underestimation of disability in community-living older persons. Journal of the American Geriatrics Society 50(9):1492–1497.
Hartzler, A. L., L. Tuzzio, C. Hsu, and E. H. Wagner. 2018. Roles and functions of community health workers in primary care. Annals of Family Medicine 16(3):240–245.
HealthMeasures. 2018. PROMIS. http://www.healthmeasures.net/explore-measurementsystems/promis (accessed April 4, 2019).
IOM (Institute of Medicine). 2001. Crossing the quality chasm: A new health system for the 21st century. Washington, DC: National Academy Press.
IOM. 2003. Unequal treatment: Confronting racial and ethnic disparities in health care. Washington, DC: The National Academies Press.
IOM. 2015. Psychological testing in the service of disability determination. Washington, DC: The National Academies Press.
Joint Commission on Accreditation of Healthcare Organizations. 2003. Improving the quality of pain management through measurement and action. Oakbrook Terrace, IL: Joint Commission on Accreditation of Healthcare Organizations.
Kanfer, R., and P. L. Ackerman. 1989. Motivation and cognitive abilities: An integrative/aptitude-treatment interaction approach to skill acquisition. Journal of Applied Psychology 74(4):657–690.
Kessler, R. C., G. Andrews, L. J. Colpe, E. Hiripi, D. K. Mroczek, S. L. T. Normand, E. E. Walters, and A. M. Zaslavsky. 2002. Short screening scales to monitor population prevalences and trends in non-specific psychological distress. Psychological Medicine 32(6):959–976.
Koutsouleris, N., L. Kambeitz-Ilankovic, S. Ruhrmann, M. Rosen, A. Ruef, D. B. Dwyer, M. Paolini, K. Chisholm, J. Kambeitz, T. Haidl, A. Schmidt, J. Gillam, F. Schultze-Lutter, P. Falkai, M. Reiser, A. Riecher-Rössler, R. Upthegrove, J. Hietala, R. K. R. Salokangas, C. Pantelis, E. Meisenzahl, S. J. Wood, D. Beque, P. Brambilla, and S. Borgwardt. 2018. Prediction models of functional outcomes for individuals in the clinical high risk state for psychosis or with recent-onset depression: A multimodal, multisite machine learning analysis. JAMA Psychiatry 75(11):1156–1172. doi: 10.1001/jamapsychiatry.2018.2165.
Liebkemann, K. 2016. National Disability Forum: Developing and assessing medical evidence for extreme limitations in the ability to focus on tasks. https://www.ssa.gov/ndf/documents/Panelist%20-%20Kevin%20Liebkemann.pdf (accessed April 4, 2019).
Liebkemann, K., and K. Lang. 2017 (unpublished). Letter to the Committee on Functional Assessment for Adults with Disabilities on behalf of the National Coalition of Social Security and Supplemental Security Income Advocates, December 5.
Lum, T. Y., W. C. Lin, and R. L. Kane. 2005. Use of proxy respondents and accuracy of minimum data set assessments of activities of daily living. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences 60(5):654–659.
Mathias, S. D., M. M. Bates, D. J. Pasta, M. G. Cisternas, D. Feeny, and D. L. Patrick. 1997. Use of the Health Utilities Index with stroke patients and their caregivers. Stroke 28(10):1888–1894.
McDonough, C. M., E. Stoiber, I. M. Tomek, P. Ni, Y. J. Kim, F. Tian, and A. M. Jette. 2016. Sensitivity to change of a computer adaptive testing instrument for outcome measurement after hip and knee arthroplasty and periacetabular osteotomy. Journal of Orthopaedic & Sports Physical Therapy 46(9):756–767.
Meterko, M., E. E. Marfeo, C. M. McDonough, A. M. Jette, P. Ni, K. Bogusz, E. K. Rasch, D. E. Brandt, and L. Chan. 2015. The Work Disability Functional Assessment Battery (WD-FAB): Feasibility and psychometric properties. Archives of Physical Medicine and Rehabilitation 96(6):1028–1035. doi: 10.1016/j.apmr.2014.11.025.
Meterko, M., M. Marino, P. Ni, E. Marfeo, C. M. McDonough, A. Jette, K. Peterik, E. Rasch, D. E. Brandt, and L. Chan. 2018. Psychometric evaluation of the improved Work-Disability Functional Assessment Battery. Archives of Physical Medicine and Rehabilitation. Corrected proof available online December 19, 2018. doi: 10.1016/j.apmr.2018.09.125.
NASEM (National Academies of Sciences, Engineering, and Medicine). 2016. Informing Social Security’s process for financial capability determination. Washington, DC: The National Academies Press.
NIH (National Institutes of Health). 2018. Selecting a HealthMeasure. http://www.healthmeasures.net/applications-of-healthmeasures/guidance/selecting-a-healthmeasure (accessed April 4, 2019).
NRC (National Research Council). 2007. Engaging privacy and information technology in a digital age. Washington, DC: The National Academies Press.
Oczkowski, C., and M. O’Donnell. 2010. Reliability of proxy respondents for patients with stroke: A systematic review. Journal of Stroke and Cerebrovascular Diseases 19(5):410–416.
Odole, A. C., P. O. Ibikunle, and U. Useh. 2016. Culturally sensitive and environment-friendly outcome measures in knee and hip osteoarthritis. African Journal of Biomedical Research 19(2):71–77.
Oni, T., N. McGrath, R. BeLue, P. Roderick, S. Colagiuri, C. R. May, and N. S. Levitt. 2014. Chronic diseases and multimorbidity: A conceptual modification to the WHO ICCC model for countries in health transition. BMC Public Health 14:575. doi: 10.1186/1471-2458-14-575.
Oude Voshaar, M. A. H., H. E. Vonkeman, D. Courvoisier, A. Finckh, L. Gossec, Y. Y. Leung, K. Michaud, G. Pinheiro, E. Soriano, N. Wulfraat, A. Zink, and M. A. F. J. van de Laar. 2019. Towards standardized patient reported physical function outcome reporting: Linking ten commonly used questionnaires to a common metric. Quality of Life Research 28(1):187–197.
Pardasaney, P. K., N. K. Latham, A. M. Jette, R. C. Wagenaar, P. Ni, M. D. Slavin, and J. F. Bean. 2012. Sensitivity to change and responsiveness of four balance measures for community-dwelling older adults. Physical Therapy 92(3):388–397.
Pardasaney, P. K., P. Ni, M. D. Slavin, N. K. Latham, R. C. Wagenaar, J. Bean, and A. M. Jette. 2014. Computer-adaptive balance testing improves discrimination between community-dwelling elderly fallers and nonfallers. Archives of Physical Medicine and Rehabilitation 95(7):1320–1327.
Pickard, A. S., and S. J. Knight. 2005. Proxy evaluation of health-related quality of life: A conceptual framework for understanding multiple proxy perspectives. Medical Care 43(5):493–499.
Poulin, V., and J. Desrosiers. 2008. Participation after stroke: Comparing proxies’ and patients’ perceptions. Journal of Rehabilitation Medicine 40(1):28–35.
PSNet (Patient Safety Network). 2018. Handoffs and signouts. https://psnet.ahrq.gov/primers/primer/9/Handoffs-and-Signouts (accessed April 4, 2019).
Riebschleger, M., and I. Philibert. 2011. New standards for transitions of care: Discussion and justification. In The ACGME 2011 duty hour standard: Enhancing quality of care, supervision, and resident professional development, edited by I. Philibert and S. Amis, Jr. Chicago, IL: Accreditation Council for Graduate Medical Education. Pp. 57–59.
Ruiz, J. G., S. Priyadarshni, Z. Rahaman, K. Cabrera, S. Dang, W. M. Valencia, and M. J. Mintzer. 2018. Validation of an automatically generated screening score for frailty: The Care Assessment Need (CAN) score. BMC Geriatrics 18:106. doi: 10.1186/s12877-018-0802-7.
Salgado, J. F., S. Moscoso, J. I. Sanchez, P. Alonso, B. Choragwicka, and A. Berges. 2015. Validity of the five-factor model and their facets: The impact of performance measure and facet residualization on the bandwidth-fidelity dilemma. European Journal of Work and Organizational Psychology 24(3):325–349.
SAS Institute. 2019a. Natural language processing: What it is and why it matters. https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html (accessed April 4, 2019).
SAS Institute. 2019b. Predictive analytics: What it is and why it matters. https://www.sas.com/en_us/insights/analytics/predictive-analytics.html (accessed April 4, 2019).
Schalet, B. D., D. A. Revicki, K. F. Cook, E. Krishnan, J. F. Fries, and D. Cella. 2015. Establishing a common metric for physical function: Linking the HAQ-DI and SF-36 PF subscale to PROMIS® physical function. Journal of General Internal Medicine 30(10):1517–1523. doi: 10.1007/s11606-015-3360-0.
Scientific Advisory Committee of the Medical Outcomes Trust. 2002. Assessing health status and quality-of-life instruments: Attributes and review criteria. Quality of Life Research 11(3):193–205.
Shah, N. D., E. W. Steyerberg, and D. M. Kent. 2018. Big data and predictive analytics: Recalibrating expectations. JAMA 320(1):27–28. doi: 10.1001/jama.2018.5602.
SSA (U.S. Social Security Administration). 2018a. POMS DI 22505.001: Medical and nonmedical evidence. https://secure.ssa.gov/poms.nsf/lnx/0422505001#a (accessed April 4, 2019).
SSA. 2018b. POMS DI 24503.005: Categories of evidence. https://secure.ssa.gov/poms.nsf/lnx/0424503005 (accessed April 4, 2019).
Tennant, A., M. Penta, L. Tesio, G. Grimby, J. L. Thonnard, A. Slade, G. Lawton, A. Simone, J. Carter, A. Lundgren-Nilsson, M. Tripolski, H. Ring, F. Biering-Sørensen, C. Marincek, H. Burger, and S. Phillips. 2004. Assessing and adjusting for cross-cultural validity of impairment and activity limitation scales through differential item functioning within the framework of the Rasch model: The PRO-ESOR project. Medical Care 42(1 Suppl.):I37–I48.
Tyser, A. R., J. Beckmann, J. D. Franklin, C. Cheng, S. D. Hon, A. Wang, and M. Hung. 2014. Evaluation of the PROMIS physical function computer adaptive test in the upper extremity. The Journal of Hand Surgery 39(10):2047–2051.
Ubalde-Lopez, M., G. L. Delclos, F. G. Benavides, E. Calvo-Bonacho, and D. Gimeno. 2016. Measuring multimorbidity in a working population: The effect on incident sickness absence. International Archives of Occupational and Environmental Health 89(4):667–678. doi: 10.1007/s00420-015-1104-4.
Üstün, T. B., N. Kostanjsek, S. Chatterji, and J. Rehm. 2010. Measuring health and disability: Manual for the WHO disability assessment schedule. Geneva, Switzerland: WHO Press.
Wang, Y., L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, and H. Liu. 2018. Clinical information extraction applications: A literature review. Journal of Biomedical Informatics 77:34–49. doi: 10.1016/j. jbi.2017.11.011.
Ware, J. E., Jr., M. Kosinski, J. B. Bjorner, M. S. Bayliss, A. Batenhorst, C. G. Dahlöf, S. Tepper, and A. Dowson. 2003. Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research 12(8):935–952.
Wessler, B. S., Y. L. Lai, W. Kramer, M. Cangelosi, G. Raman, J. S. Lutz, and D. M. Kent. 2015. Clinical predictive models for cardiovascular disease: Tufts predictive analytics and comparative effectiveness clinical prediction model database. Circulation: Cardiovascular Quality and Outcomes 8(4):368–375. doi: 10.1161/CIRCOUTCOMES.115.001693.
Wild, D., A. Grove, M. Martin, S. Eremenco, S. McElroy, A. Verjee-Lorenz, and P. Erikson. 2005. Principles of good practice for the translation and cultural adaptation process for patient reported outcomes (PRO) measures: Report of the ISPOR Task Force for Translation and Cultural Adaptation. Value in Health 8(2):94–104.
Wittchen, H. U., L. N. Robins, L. Cottler, N. Sartorius, J. Burke, and D. Regier. 1991. Cross-cultural feasibility, reliability and sources of variance of the Composite International Diagnostic Interview (CIDI). Results of the multicenter WHO/ADAMHA Field Trials. British Journal of Psychiatry 159(5):645–653. doi: 10.1192/bjp.159.5.645.
Wong, T. M., T. L. Strickland, E. Fletcher-Janzen, A. Ardila, and C. R. Reynolds. 2000. Theoretical and practical issues in the neuropsychological assessment and treatment of culturally dissimilar patients. In Handbook of cross-cultural neuropsychology, edited by E. Fletcher-Janzen, T. L. Strickland, and C. R. Reynolds. Boston, MA: Springer. Pp. 3–18.
Zdunek, M., L. A. Jason, M. Evans, R. Jantke, and J. L. Newton. 2015. A cross cultural comparison of disability and symptomatology associated with CFS. International Journal of Psychology and Behavioral Sciences 5(2):98–107.