The previous chapter identifies eight competencies that the available research suggests are related to undergraduate persistence and success: (1) behaviors related to conscientiousness; (2) sense of belonging; (3) academic self-efficacy; (4) growth mindset; (5) utility goals and values; (6) intrinsic goals and interest; (7) prosocial goals and values; and (8) positive future self. This chapter turns to the measurement of students’ status relative to these competencies, a necessary precursor to research and practice that can enable both deeper understanding of the relationship between these competencies and student success and evidence-based decisions and programmatic actions that capitalize on these competencies to promote students’ persistence and success. But measurement is a complex science. Just because an assessment claims to or is intended to measure a given competency does not ensure that it does so fairly and with appropriate levels of precision. The quality of assessment matters.
This chapter focuses on the nature and quality of existing assessments of the identified competencies; summarizes well-established principles of assessment development and validation that higher education stakeholders should keep in mind as they develop, select, and/or use assessment to support student success; and considers available options and future directions for improving assessment practices. Importantly, the chapter draws on longstanding tenets that have been honed largely in the context of measuring cognitive competencies. Although the formats used in typical assessments of this study’s eight competencies (e.g., self-report surveys) may differ from the formats used to assess students’ cognitive skills, the underlying principles
of measurement remain the same. Further, although this chapter focuses on the measurement of individuals’ competency, the committee fully recognizes that the identified competencies influence and are influenced by the college environment and other cultural and contextual variables. The climate on a college campus, for example, can greatly affect a student’s sense of belonging, and it may be the college context that needs to be measured and improved rather than the student’s competency.
The chapter is divided into four major sections. The first reports on the nature of the methods currently being used to assess intra- and interpersonal competencies, focusing especially on methods used to assess the eight identified competencies. The second section presents key principles of assessment quality, introducing the concepts of validity, reliability, and fairness. The third section applies these principles to evaluate the quality of current assessments of the eight competencies. The fourth section lays out a pathway to better measurement through a professional approach to developing and validating assessments for serious use and through current and future innovations in measurement and analysis. The chapter ends with conclusions and recommendations.
The committee examined the status of current assessments based on its review of the literature on intra- and interpersonal competencies, focusing primarily on the eight identified competencies. In its examination, the committee drew predominantly on three sources of evidence: (1) analysis of the assessment instruments used in the intervention studies that the committee used to judge the strength of evidence supporting each competency, (2) general review of the results of a literature search on assessments of the identified competencies (see Appendix B), and (3) close analysis of a small sample of established assessment instruments targeting one or more of the eight competencies. To provide a sense of the current landscape of these assessments, the following sections describe the types of assessment formats that could potentially be used to measure intra- and interpersonal competencies and note the extent to which each format is used in existing assessment instruments. The final subsection provides a summary of the formats used in current assessments of the eight competencies.
Self-Ratings (Selected Response)
A typical self-rating presents a trait term, or a statement, and requires the respondent to indicate the extent to which he or she agrees with the statement on a Likert scale (e.g., strongly agree, agree, neutral, disagree, strongly disagree) or to indicate the frequency of engaging in the thought or
behavior described in the statement (e.g., never or rarely, sometimes, often, always, or almost always). For example, the Programme for International Student Assessment (PISA) 2000 measured “sense of belonging” using a four-point agreement scale for the statements, “School is a place where…I feel like an outsider (or left out of things),” “. . . I feel awkward or out of place,” “. . . I feel lonely,” “. . . I feel like I belong,” and “. . . other students seem to like me” (Willms, 2003, p. 64). Note that of these five statements, the latter two would be positively keyed and the others negatively keyed.
The most common approach for scoring a set of self-ratings is to average the scale values (e.g., 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree) across items. This approach is commonly used because (1) the average is interpretable on the same 1-5 scale, and (2) low scores are then the result of an actual low standing on the construct, not missing responses. However, appropriate interpretation of the results of this approach rests on three assumptions: no item effects (i.e., every item has the same mean response); no item x person effects (i.e., a person who responds more highly than others on one item will respond more highly than others on all other items); and an equal interval scale (i.e., the difference in the underlying trait is the same at every score interval). All of these assumptions can be tested, and violations can be addressed with more sophisticated scoring approaches (e.g., item response theory-based scoring methods; latent class analyses).
Self-ratings are by far the most common means used for measuring intra- and interpersonal competencies in general (Duckworth and Yeager, 2015; Robins et al., 2007) and the eight identified competencies in particular. Among the 87 assessment instruments used across the intervention studies discussed in Chapter 2, 74 (85%) used self-ratings with selected response formats (mostly Likert scales; see Appendix B). In addition, the available meta-analyses of the correlations between these competencies and college outcomes reviewed in Chapter 2 are based almost exclusively on data from assessments using self-rating formats (Noftle and Robins, 2007; Poropat, 2009; Richardson et al., 2012). Finally, the committee’s search for and review of measures of the eight competencies similarly revealed reliance on self-report scales. Among these, both ACT and the Educational Testing Service (ETS) have developed self-report Likert surveys of admitted students’ intra- and interpersonal competencies as an aid to identifying students’ likelihood of success, providing students with direct feedback and support services, and placing them in appropriate course levels (e.g., developmental versus college-level mathematics; see further discussion in Chapter 4). Both tests measure some of the eight competencies, along with others. For example, the ACT Engage College subscales of Academic Discipline, General Determination, Goal Striving, and Study Skills generally align with behaviors related to conscientiousness, while the subscales of Academic Self-Confidence and Social Connection relate to academic
self-efficacy and sense of belonging, respectively. Similarly, ETS’s SuccessNavigator Tools and Strategies subscale includes items on organization, a behavior related to conscientiousness, while the Self-Management subscale includes items used to measure academic self-efficacy.
Despite their prevalence in assessments of the eight competencies, self-rating scales have several well-known limitations, which are discussed further later in this chapter. The first is social desirability: respondents may distort their responses to avoid embarrassment and project a favorable image (Zerbe and Paulhus, 1987). In this regard, studies of biodata have suggested that asking respondents to provide a rationale for their ratings, even when such content is not analyzed, may result in more accurate ratings (Schmitt and Kunce, 2002; Schmitt et al., 2003).
A second issue is the interpretation of item rating scales. Individual respondents may vary on how they interpret or use the rating scale, and as a result, ratings will not have the same meaning across individuals. For example, response style refers to the systematic tendency to respond a particular way regardless of the construct being measured. Common response styles are acquiescence (or “yea saying,” the tendency to respond positively) and extreme response style (the tendency to choose the extremes of the response scale) (Stricker, 1963). These tendencies can distort relationships, and research continues to focus on determining whether corrections for them might improve the quality of resulting data (Falk and Cai, 2016; He and van de Vijver, 2015).
Biographical and Personal Essays and Statements
Personal statements providing biographical information (biodata) and admissions essays that commonly have been required in college admissions and scholarship award contexts for some time (Willingham and Breland, 1982; Willingham et al., 1987) provide information about intrapersonal and interpersonal competencies. A study conducted by ETS found that college administrators and faculty members reported using personal statements to draw inferences about students’ intra- and interpersonal competencies (Kyllonen, 2008; Walpole et al., 2002), including some of those identified in Chapter 2 (e.g., perseverance, a behavior related to conscientiousness). Such use of personal statements persists despite the shortage of evidence for their predictive validity, especially after controlling for grades and test scores (Murphy et al., 2009), the potential validity threat posed by the possibility that the essay was prepared by someone other than the applicant (Willingham and Breland, 1982), and the apparent lack of standard treatments for scoring or evaluating these statements.
utility value intervention required students enrolled in an introductory biology course to write short (one- to two-page) essays about the personal value of course material, such as how animal physiology concepts might inform their personal workout and exercise program. Three such writing assignments, staggered over the semester, were integrated into the course. Research assistants coded the essays on a 0-4 scale based on how specific and personal the utility value connection was to the individual, providing a measure of implementation fidelity and strength of utility value, with 0 indicating no utility and 4 indicating a strong connection to the individual, reflecting deep appreciation of the material. In another example, involving less extended responses, Walton and colleagues (2012) asked students participating in a sense of belonging intervention to generate two reasons for their success and/or failure in math. Two raters independently coded whether each reason indicated (1) social-relational factors, or sense of belonging with others in math; (2) nonrelational social factors (e.g., interest in math relative to other students); (3) academic self-efficacy in math; or (4) unspecified factors. The authors calculated a valence score for each category by subtracting the number of negative from the number of positive reasons generated. This valence score and scores from a separate self-report measure of the perceived warmth and fairness of the math department were standardized and averaged to form a composite measure of social connectedness to math.
Others’ Ratings, Including Letters of Recommendation
Ratings by others generally take the same form as self-ratings and personal statements, except they are provided by knowledgeable others. That is, others are asked to respond to similar selected-response surveys, often involving Likert scales, for purposes of reporting on an individual’s traits, behaviors, attitudes, skills, or dispositions, or to produce letters of recommendation that provide biodata and comment on the individual’s competencies or qualifications and personal history. Faculty members in fact report using information from letters of recommendation to draw inferences about a range of competencies of student applicants (Walpole et al., 2002). Similarly, although not apparent in the studies reviewed for this report, Likert-type scales are common in others’ ratings of individuals.
Current assessments of the eight competencies identified in Chapter 2 do not rely on others’ ratings, although ETS formerly offered the Personal Potential Index (PPI), which used this approach to measure several competencies, including planning and organization—behaviors related to conscientiousness. The PPI (Kyllonen, 2008) was a standardized letter of recommendation used for collecting information on graduate school ap-
plicants’ cognitive, intrapersonal, and interpersonal competencies as determined from recommenders’ ratings on 24 statements in six dimensions (knowledge and creativity, communication skills, teamwork, resilience, planning and organization, and ethics and integrity), along with an overall evaluation. Preliminary research showed that the instrument predicted graduate school cumulative grade point average (GPA), above and beyond that based on undergraduate GPA and Graduate Record Examination (GRE) scores (Klieger et al., 2015), and showed smaller subgroup differences than these other measures.
In general, ratings by others have been found to be substantially more predictive of academic achievement and job performance outcomes relative to self-ratings (Connelly and Ones, 2010), and others’ ratings add to the predictions derived from self-ratings (whereas the reverse is not true). A meta-analysis of studies of letters of recommendation (Kuncel et al., 2014), for example, found that they predicted various higher education outcomes, such as GPA and degree attainment, and for degree attainment, some of this prediction provided unique information beyond that derived from other quantitative aspects of students’ academic records. Letters typically are not standardized, however, and a key question is whether standardized letters might be fairer for applicants; be more consistent and less fatiguing for recommenders; and provide better prediction of academic outcomes, including intra- and interpersonal competencies.
Interviews are common in higher education admissions, particularly for graduate school, medical school, and other professional schools. They likely are often intended to determine some of the eight identified intrapersonal competencies, such as growth mindset, even if implicitly and imperfectly, and they suffer from the same limitations that characterize other self-report measures.
Interviews vary widely in their structure, content, interpretation or scoring, and use. They fall into three broad types: standardized, behavioral, and informal. Standardized interviews ask each respondent the same questions, such as “Tell me about why you want to pursue x” or “Tell me about your qualifications to pursue y.” Behavioral interviews ask respondents such questions as “Tell me about a time when you had to give up a planned event to meet a deadline.” And informal interviews give the interviewer free reign to pursue various questions, such as those conditional on prior responses. Interviews also vary as to whether they are administered face-to-face or online (Kell et al., in press). Although there has been little work on interviews in higher education, a meta-analysis of studies of employment interviews (McDaniel et al., 1994) found that standardized interviews
generally had higher correlations with employment outcomes relative to informal interviews.
Performance Assessments and Behavioral Measures
Performance assessments encompass a wide variety of methods, but all involve an individual’s creating or constructing a response as opposed to answering a multiple-choice question, responding to an interview question, or being rated by a peer or teacher. Students are asked to engage in a given task, and their behavioral response to the task is then tracked and/or evaluated.
Performance assessment has become common as a way to assess students’ deeper learning and ability to apply their knowledge to solve real-world problems—for example, devising solutions for a given social problem, creating a business plan for a new product, mounting an advertising campaign to convince the members of the public to change some aspect of their behavior, or engineering a new approach to developing solar cars. Team projects can be a context for assessing students’ interpersonal competencies (see Chapter 5), and students’ involvement in complex individual performance tasks can serve as context for assessing their intrapersonal competencies. In the K-12 arena, for example, it is becoming increasingly common to ask students to assess their effort and efficacy in completing their work.
Similarly, students’ responses to challenging problems provided indicators of behaviors related to conscientiousness in a number of the intervention studies discussed in Chapter 2. For example, Walton and colleagues (2015) monitored the time students spent on an insoluble math puzzle as a measure of motivation and self-regulation, while Yeager and colleagues (2014) measured time spent on completing math problems versus consuming online media to assess these same conscientiousness-related behaviors.
While performance assessment requires that students respond to a given task, behavioral measures similarly monitor student responses or behaviors but need not be tied to a specific project or task. Instead, the assessment monitors specific behavioral indicators in given contexts and/or over specific periods of time. For example, Vansteenkiste and colleagues (2004a) tracked the number of times students visited the library and/or the recycling center to learn more about recycling as a measure of their intrinsic interest in contributing to the community.
Technology-based assessments can serve as a means of easily and unobtrusively monitoring students’ behaviors, including self-regulation and other behaviors related to conscientiousness, as in the examples cited above. At the same time, however, performance assessment can be time-consuming and costly in terms of the time required for both task completion and scor-
ing, which often involves human scoring. Thus performance assessment currently can be difficult to scale up for larger studies. Further, measures often are based on responses to a single task, which raises questions about the generalizability of the scores, since research in K-12 settings has demonstrated substantial variation in performance across different tasks and topics. Students’ behaviors related to conscientiousness (e.g., self-regulation) when approaching different problem sets, for example, may well depend on the topics of the problems.
Situational Judgment Tests (SJTs)
SJTs differ from self- and others’ ratings in that they provide a hypothetical situation and ask the respondent to select the most appropriate response to that situation from a set of possibilities or to rate the appropriateness of each possibility. To date, assessment instruments designed to measure the eight competencies identified in Chapter 2 have not used this format. However, in a study conducted by Oswald and colleagues (2004), college students were administered an SJT containing hypothetical situations explicitly designed to measure 12 competencies, including perseverance, a behavior related to conscientiousness (the others were knowledge, learning, artistic, multicultural, leadership, interpersonal, citizenship, health, career, adaptability, and ethics). The following is an example used to measure the competency of perseverance:
You realize about the fourth week of the term that you have too much coursework and other activities to get them all done—at least within the amount of time you are currently doing homework. What would you do?
- Drop the nonacademic activities.
- Use your time management skills to figure out a new study plan, putting the most important coursework at the top of the list.
- Analyze how much time you are spending on the homework and consider getting help if the work seems to be taking longer than it should.
- Evaluate whether you can reduce the number of credits you are taking, or reduce the number of activities you are involved in.
- Put homework and schoolwork first—it’s why you’re at school.
SJTs are most commonly administered in paper-pencil format, with written situations and responses, but video-based versions also have been developed. Meta-analyses suggest that SJTs provide information not available from self-report personality measures or from cognitive ability tests (McDaniel et al., 2001).
Summary of Current Assessment Methods in Higher Education
The committee’s review of assessments used in higher education revealed evidence that institutions use a number of different types of assessment formats. Biographical and personal essays and statements; others’ ratings, including letters of recommendation; and interviews are common in college admissions. These assessment methods tend to address implicitly some of the eight identified competencies, including behaviors related to conscientiousness, intrinsic goals and interest, growth mindset, and academic self-efficacy. However, these methods are not evident in the research reviewed in Chapter 2. Rather, self-report surveys, particularly those using Likert-type scales, are ubiquitous in the correlational and intervention studies reviewed. A limited number of examples of performance and behavioral measures were also found in the research or evaluation studies.
The committee’s analysis of the assessment instruments used in the intervention studies reviewed (which is discussed further below) revealed these same patterns. The committee analyzed all assessments of the eight competencies found in the 61 intervention studies meeting its criteria for inclusion in the literature search conducted for this study (see Box 2-1 in Chapter 2). Overall, of 87 total instruments, 74 (85%) used self-report scales, and of these, most were Likert scales. Nevertheless, at least one assessment format other than self-report scales was used in the intervention studies that assessed five of the eight competencies (see Appendix B). These included six performance and behavioral measures related to conscientiousness, one behavioral measure of intrinsic goals and interest, and four performance and behavioral measures of sense of belonging. For example, as one measure of female engineering students’ sense of belonging in the field, Walton and colleagues (2015) used a behavioral measure, asking students to list the initials, gender, and major of their five closest friends. The authors used the number of male friends in engineering as one indicator of sense of belonging in the field.
Further, it is noteworthy that the great majority of assessments found in the intervention research were investigator-developed, although the development process often relied on previously published assessment instruments. For example, in a study of growth mindset, Brady and colleagues (2016) evaluated a values affirmation intervention among a sample of 183 Latino and white students. To assess the outcomes, the authors created a new assessment of adaptive adequacy by combining three preexisting self-report measures that loaded on a single factor (α = 0.86): (1) self-integrity, measured with seven items rated on a six-point Likert scale (adapted from Sherman et al., 2009; α = 0.87); (2) self-esteem, measured with the ten-item Rosenberg self-esteem scale and rated on a six-point Likert scale (Rosenberg, 1965; α = 0.93); and (3) hope, measured with an eight-item
adult hope scale (Snyder et al., 1991; α = 0.82). In another example, Hausmann and colleagues (2009) used a three-item sense of belonging subscale created earlier by Bollen and Hoyle (1990), with responses on a five-point Likert scale.
In a few cases, different investigators used all or parts of the same, previously published instrument to assess the same competency. Fitch and colleagues (2012) used five scales from the 44-item version of the Motivated Strategies for Learning Questionnaire (MSLQ; Pintrich et al., 1991)1 to measure behaviors related to conscientiousness, while Haynes and colleagues (2008) used 8 items from the same questionnaire also to measure behaviors related to conscientiousness. And to measure intrinsic goals and interest, Hamm and colleagues (2014) used the Manitoba Motivation and Academic Achievement database, which includes data for two decades of separate cohorts of introductory psychology students (1992-2012). Data for the 2001-2002 cohort include responses to the Intrinsic Motivation Scale (Hall et al., 2007), which in turn was adapted from the MSLQ (Pintrich et al., 1993). In another example, Walton and Cohen (2007, Study 2) used an investigator-developed, 17-item self-report assessment of sense of belonging, which they refer to as a “social fit scale.” In a related study of an intervention to increase sense of belonging, Walton and colleagues (2012) used these 17 items, along with others, in a daily online/email survey. These examples are exceptions, however. Overall, the assessments used to measure each of the eight identified competencies differed across different investigators.
As essential background for the evaluation of the quality of assessments currently being used, this section describes three foundational concepts that the measurement community uses to judge assessment quality.
Overview of Key Measures of Assessment Quality
The committee’s perspective on assessment quality is shaped largely by the Standards for Educational and Psychological Testing (Standards), sponsored jointly by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council of Measurement in Education (2014). The Standards define a trinity of principles—validity, reliability/precision, and fairness—as the foundation for sound measurement. Validity refers to the nature and weight of the evidence that
supports the intended interpretation and use of an assessment—that is, evidence demonstrating the extent to which the assessment actually measures what it is intended to measure, and does so in a manner that serves its intended purpose(s), such as making accurate construct-relevant distinctions among the individuals or groups being assessed. Reliability reflects the consistency, precision, and replicability of scores from a measure. Fairness in measurement actually is a validity issue: it refers to the validity of score inferences for all individuals and groups in the intended population for the test. A fair measure does not disadvantage some individuals because of characteristics that are unrelated to the construct being measured. Because reliability is a necessary but not sufficient precursor to both validity and fairness, the discussion below starts with it.
Reliability refers to the degree to which “test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and consistent for an individual test taker” (American Educational Research Association et al., 2014, p. 223). Reliability can be estimated statistically given the availability of appropriate data. Such estimates can take many forms but in general provide a basic level of confidence that test scores may offer useful results.
When one student obtains a higher assessment score than another on academic self-efficacy, for instance, does this really mean that he or she has higher levels of this competency? Students’ assessment scores reflect both their status on a given construct and error caused by transient errors of measurement and other conditions (deemed random noise or measurement error variance), which can cause students’ observed score or rank ordering on an assessment to differ from their true score or rank ordering on the underlying competency. When reliability is higher, certain types of random errors are reduced, and observed scores on the assessment become more aligned with students’ actual standing on the intended construct (assuming that the assessment measures the intended construct).
Reliability is therefore one of several necessary conditions for scores on assessments of any of the eight identified competencies to be appropriate reflections of the actual competencies, and it is a prerequisite for obtaining solid validity evidence for an assessment. High levels of estimated reliability help ensure that decisions based on assessment scores will lead to consistent, justifiable decisions in many contexts, such as making admissions decisions, awarding scholarships, adapting instruction to a student’s performance or ability level, conducting outcomes assessment, and evaluating proficiency with respect to standards.
Reliability is estimated in multiple ways, depending on the statistical framework used and on those factors on which the user seeks to generalize the assessment (e.g., over time, across different forms). Three major statistical frameworks are used for estimating the reliability of an assessment: classical test theory (Lord, 1955), generalizability theory (Cronbach et al., 1972), and item response theory (de Ayala, 2009). All three are based on the idea that an assessment score and the item responses contributing to it contain both variance related to the construct of interest and variance reflecting some form (or forms) of measurement error. Regardless of the statistical framework used for estimating the reliability of an assessment, however, it is generally the case, all other things being equal, that the larger the sample of behavior (across observations, items, or raters), the higher is the reliability of the measurement.
Although validity traditionally referred to the degree to which an assessment measures what it purports to measure (McDonald, 1999), more recent treatments of the concept also highlight the importance of the specific intended use(s) and interpretations(s) of test scores in evaluating validity. This shift is reflected in the Standards, which define validity as “the degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test” (American Educational Research Association et al., 2014, p. 225). Validity thus does not refer to a test, but to particular uses and interpretations of it. The Standards lay out five sources of evidence for evaluating the validity of an assessment for a particular purpose.
Evidence Based on Test Content
Evidence based on test content refers to the degree to which the content of the assessment aligns with the construct-relevant behaviors, attitudes, activities, etc. that the assessment is designed to measure (Achieve, 2010). Typically, such alignment studies involve expert panels conducting detailed reviews of test content, which provide evidence and judgments about how well the test represents its intended construct(s), including the adequacy of construct relevance (representation) or evidence of construct underrepresentation. The Standards point to construct underrepresentation—defined as “the extent to which a test fails to capture important aspects of the construct domain that the test is designed to measure, resulting in test scores that do not fully represent that construct” (American Educational Research Association et al., 2014, p. 217)—as a major concern in evaluating validity. If a construct is inadequately represented by an assess-
ment, any conclusions drawn from it are limited and may need additional verification. Expert content reviews also typically address the possibility of construct-irrelevant aspects of an assessment that may influence individual scores—for example, issues of construct-item characteristics such as items with unnecessarily complex language or reference to privileged experiences that could bias the assessment of a particular student’s standing on the construct (see also the discussion of fairness below).
Content-related evidence is stronger and easier to collect when there is clear definition of and agreement on the construct to be measured. However, in the domain of intrapersonal competencies, there often is a problem with overlapping definitions and fluctuating terminology. Consider that conscientiousness, grit, self-regulation, persistence, pluck, and stick-to-itiveness are closely related if not identical competencies, despite being named differently (Jackson and Roberts, 2015). In fact, assessment instruments used to measure all of these competencies might contain similar or identical item content. Credé and colleagues (2016) recently found that scores on assessments of grit were highly correlated with scores on assessments of perserverance, which in turn is a facet of the broader trait of conscientiousness. Conversely, assessments may use the same name when in fact measuring different competencies. For example, the concept of grit encompasses both perseverance of effort and consistency of interest over time (Duckworth et al., 2007), and different assessments of grit may measure different dimensions or both (e.g., Bowman et al., 2015; Duckworth and Quinn, 2009). Such errors have been termed “jingle-jangle fallacies” (Kelley, 1927; Pedhazur and Schmelkin, 1991; Roeser et al., 2006). That is, one cannot necessarily assume that two assessments with the same competency label actually measure the same construct (jingle fallacy) or that two assessments bearing different competency labels actually reflect different competencies (jangle fallacy).
Evidence Based on Response Processes
Evidence based on response processes ideally demonstrates that an individual’s response draws primarily on the specific competency being assessed, such that all other reasons for a response are essentially random and minimized. Interviewing respondents for what they are thinking while responding to a test item or survey question has become standard practice prior to pilot or field testing of an instrument. These interviews go by such names as cognitive interviews, cognitive labs (Ruiz-Primo, 2015), protocol analysis (Ericsson and Simon, 1993), talk-aloud protocols, and verbal reports (Leighton et al., 2009). Typically, cognitive interviews involve in-depth, semistructured interviews with a small number of respondents similar to those that will be targeted in the assessment. They are com-
monly conducted while an individual is responding to the assessment, or sometimes retrospectively, after the examinee has completed a potential assessment item.
Cognitive interviews also are used to establish that respondents understand the questions in the manner intended. Such interviews can be valuable because the respondent may have a different interpretation of a word or phrase in a question from what is intended by the survey developers or may interpret a response category differently from how others might interpret it. For example, the respondent may choose the “neither agree nor disagree” category because of a neutral stance or because he or she is unsure. Through cognitive interviews, interpretational differences can be determined in terms of either variation at the student level or thematic differences among relevant subgroups (e.g., gender, race/ethnicity, grade level).
Evidence Based on Internal Structure
Evidence based on internal structure concerns the relationships among items in an assessment. Item factor analyses, cluster analyses, and reliability analyses are commonly conducted to examine the internal structure of items on an assessment. If an assessment is designed to measure one particular construct, such as sense of belonging, items in the assessment should be positively correlated with each other. From a factor analysis perspective, items should empirically support unidimensionality, meaning that only one factor (in the present case, the competency) is responsible for the inter-item correlations. Furthermore, because items almost always differ in the strength of their wording, they tend to elicit different levels of response, so that the mean levels of a student’s response will differ across items measuring a competency, such as sense of belonging (e.g., “I strongly believe that I belong at my university” versus “I feel comfortable on campus”).
Evidence Based on Relations to Other Variables
Evidence based on relations to other variables involves the empirical association between scores on the measure of the intended competencies with scores on other, validated measures. There are two general sources of evidence in this category: convergent and discriminant evidence and test-criterion relationships.
Convergent and discriminant evidence examines the extent to which scores from an assessment of a given construct are more strongly related to that of another measure of the same or closely related construct (convergent) than to a measure of a different construct (divergent). For example, because sense of belonging and growth mindset are conceptually distinct constructs,
it would make sense that different assessments of sense of belonging should correlate more highly with one another than with assessments of growth mindset, and vice versa. Similarly, one can look for convergent and discriminant relationships with other variables in relation to how specific conditions are expected to relate to an individual’s standing with regard to a construct. For example, an experiment designed to increase one’s sense of belonging (see Chapter 2) can be evaluated for actually doing so, and it also can be evaluated for not showing similar increases in other constructs, such as achievement motivation or emotional stability. If a treatment increases levels of these rival constructs (perhaps even more than sense of belonging itself), the treatment may appropriately be understood as a broader one than initially envisioned with effects other than those intended.
Test-criterion relationships When people say they “validated a test,” they often mean to imply that they correlated the scores from the test with a criterion or predicted outcome of the intended construct (e.g., the score on the SAT with first-year GPA or graduation rates). A typical or more traditional validation study might involve first administering an intra- or interpersonal competency assessment (e.g., of sense of belonging or growth mindset), and then correlating scores from the assessment with such outcome measures as graduation, GPA, absenteeism, time to degree, or any number of academic outcomes valued by higher education leaders. Evidence for interpretation of test scores as indicating readiness for college success comes from such correlations between the test scores and such indicators of college success.
Evidence of Consequential Validity
Evidence of consequential validity involves the consequences that follow on the use of a test and decisions based on it (see Kane, 2006; Messick, 1995), intended and unintended. Given that tests often signal the skills that institutions or employers find important, changing what is tested is a strategy for encouraging curricular change (Fredericksen, 1984). Likewise, some colleges and universities already are signaling the importance of students’ intra- and interpersonal competencies by explicitly assessing and developing them. If assessment of these competencies in higher education were to grow, it could promote greater emphasis on teaching intra- and interpersonal competencies in high school.
In evaluating the use of an assessment, it also is important to consider whether it may have unintended negative consequences. In considering whether to augment its current admissions criteria with measures of intra- or interpersonal competencies, for example, an institution would want to evaluate whether such a change might have adverse consequences for some subgroups of students. Such evaluation would include not only the
aforementioned expert review of assessment items, but also consideration of broader influences on recruitment, admissions, student development, and student success.
When assessments produce results that have serious consequences for individuals, they often are known as high-stakes tests, defined in the Standards as “a test used to provide results that have important, direct consequences for individuals, programs, or institutions involved in the testing” (American Educational Research Association et al., 2014, p. 219). By contrast, low-stakes tests carry little or no consequences for those who are assessed. The stakes associated with an assessment can engender behavior that in turn affects validity.
Some Validity Threats: Faking, Cheating, and Motivation
Validity can be compromised much more quickly than it can be established. For assessment in general, security is a major issue, and this is particularly true for high-stakes tests.
An entire industry is devoted to security topics, as are annual conferences (e.g., the 2015 Conference on Test Security), and most of the major test publishers have offices or departments dedicated to the matter. At this point, because assessments are not being used in a high-stakes manner, issues of test security have not yet become prominent. Nevertheless, as results from such assessments begin being used in decision making, affecting individuals and institutions, security measures may soon be needed in securing items, in storing and handling test results, and in exploring the possibilities for security breaches, as is done with cognitive test materials. The Standards include extensive discussions and standards pertaining to these issues.
A more immediate validity issue concerns faking responses on intra- and intrapersonal competency assessments, particularly if they include Likert-scale response items, which is the case for 85 percent of the assessments used in the interventions reviewed. Respondents can be tempted to use the extremes of the scale—for example, to endorse all positive statements as “most like me” and all negative statements as “least like me.” Indeed, students and applicants are often motivated to appear diligent, enthusiastic, and appreciative to those for whom they are completing an assessment (e.g., faculty members, potential employers, even researchers). High-stakes settings, such as college admissions, also provide incentives for test takers to present themselves in the most desirable light. This is not necessarily a problem: self-presenting effectively may (1) imply higher intra- and intrapersonal competencies in the first place, (2) predict the outcomes that intra- and interpersonal competencies are supposed to predict, and (3) be related to the sort of self-presentation that is required to result in the outcomes of interest (e.g., landing a job by interviewing more successfully than others).
Several approaches can be used to reduce the effects of faking, cheating, or other phenomena that distort and threaten the validity of assessment scores. One approach is to provide pretest warnings against faking (Converse et al., 2008; Dwight and Donovan, 2003; McFarland, 2003); another is to provide such warnings during the assessment itself, tailoring them to students who appear to be responding in a way that implies faking or cheating (see Pace and Borman, 2006, who illustrate other types of warnings); and a third is to require respondents to elaborate and provide follow-up justification for their test responses (Schmitt and Kunce, 2002; Schmitt et al., 2003). Note, however, that an elaboration requirement could impose an unnecessary burden and cognitive load, which could be responsible for lowering test scores in addition to serving the intended purpose of keeping the respondent honest.
As a final note, although a large literature is focused on social desirability (i.e., the tendency to think of and present oneself in a favorable light), measures of social desirability typically have not been proven useful for adjusting scores on an assessment to control for this tendency statistically (Sackett, 2012). Situational judgment tests and performance tests, discussed earlier, and forced-choice formats, discussed below, are intended to reduce the possibility of faking on intra- and interpersonal competency assessments and thereby increase the validity of scores. Regardless of the method employed, however, the higher the stakes associated with an assessment, the greater is the need for strong evidence of validity, as it would be unfair to make important decisions about an individual based on an assessment score that lacked credibility.
For low-stakes tests, defined by the Standards as those tests yielding data that have relatively minor consequences for individuals, programs, or institutions involved in the testing, the primary threat to validity is lack of motivation. In cognitive tests, differences in motivation among examinees can lead to substantial differences in scores (Liu et al., 2012).
In summary, the stakes associated with assessment results influence the validity issues that require investigation. The committee agrees with the Standards (American Educational Research Association et al., 2014, p. 22), which specify that
the amount and character of evidence required to support a provisional judgment of validity often vary between areas and also within an area as research on a topic advances. For example, prevailing standards of evidence may vary with the stakes involved in the use or interpretation of test scores. Higher stakes may entail higher standards of evidence.
In the assessment context, fairness is a broad concept that generally refers to the degree to which a test measures the same construct and scores have the same meaning for different individuals, or more commonly, for different subgroups of the population for which the test is intended. In other words, fairness is a question of the validity of scores for all intended individuals and subgroups.
Typical subgroup characteristics are sex (male, female), race/ethnicity (e.g., white, Hispanic, black, American Indian), culture, first language (e.g., English versus another language), socioeconomic status, and immigrant status. One might also consider college-specific subgroup categories, such as first- versus continuing-generation students, domestic versus international students, or on-campus students versus commuters.
The Standards lay out an overarching standard for fairness:
All steps in the testing process, including test design, validation, development, administration and scoring procedures, should be designed in such a way as to minimize construct irrelevant variance and to promote valid score interpretations for the intended uses for all examinees in the intended population. (American Educational Research Association et al., 2014, p. 63)
This means that fairness needs to be designed in and investigated at multiple points of the test development, validation, and use cycle, using multiple methods. A first issue is being clear on for whom the test is intended and taking into consideration the diversity of the examinee pool in designing a test. For example, if some in the intended population are not fully English proficient or have disabilities, care must be taken to design an assessment that is accessible for those students—for example, using principles of universal design, avoiding unnecessarily complex language (if language is not related to the construct of interest), and developing accommodations for those who otherwise would be unfairly disadvantaged in demonstrating their competency (universal design and accommodations are discussed further below). During the test development process, test items and forms usually undergo fairness reviews by committees comprising members of the pertinent subgroups as one step in ensuring fairness.
In addition, fairness is evaluated statistically following pilot or field testing and/or based on ongoing large-scale operational testing. These analyses help determine whether there is empirical support that the construct is being measured in the same manner across subgroups at both the item and test levels (Dorans and Holland, 1992). Items and tests that appear to function differently across subgroups are said to exhibit differential item functioning and differential test functioning, respectively. When tests
function similarly across subgroups, they are said to show measurement equivalence or measurement invariance. Generally speaking, testing for measurement invariance entails administering successive tests to determine whether subgroups are equivalent in terms of (1) the number of factors underlying the assessment, then (2) the extent to which each item reflects the construct of interest, and then (3) whether the underlying subgroup means on the construct are unbiased and can therefore be used to make subgroup comparisons. The literature provides the technical details of these tests in both the item response theory and confirmatory factor analysis frameworks (Raju et al., 2002; Vandenberg and Lance, 2000).
The possibility of differential validity also is a major concern in evaluating fairness. Evidence of differential validity means that correlations between an assessment and an outcome may differ statistically between subgroups (e.g., males versus females, black versus white students), indicating as well that the meaning of the scores differs across groups (Young and Kobrin, 2001). A distinct but related concept is differential prediction (Cleary, 1966), where regression lines show statistically different slopes between subgroups when an outcome is being predicted, implying different predicted values for each subgroup. To address fairness, the measurement of intra- and interpersonal competencies needs to consider both differential validity and differential prediction when test-criterion relationships are being analyzed and explored.
Fairness in the assessment of intra- and interpersonal competencies also becomes especially important in multicultural contexts because of issues of the comparability of the constructs across cultures and language groups and thus the comparability of measurements of such constructs. Intra- and interpersonal competency assessments will raise unique challenges related to cultural judgments of fairness, given that different cultures may express and value these competencies differently (Sato et al., 2015). The appropriateness of construct definitions for different cultural groups and the comparability of the constructs and of the measures across cultural groups are central concerns in deriving valid score-based inferences from an assessment (Ercikan and Oliveri, 2016).
This section considers evidence on the reliability, validity, and fairness of existing assessments of the eight intrapersonal competencies identified in Chapter 2. The discussion is based on an examination of the assessments used in the intervention studies cited in Chapter 2 and on close analysis of two specific, established assessment instruments. One of the latter instruments was selected because it measures several of the eight competencies
and includes subscales that were used in several of the intervention studies reviewed; the other was selected because it measures sense of belonging, a competency for which the committee found promising evidence of a relationship to college success.
Assessment Quality in Intervention Studies
As noted earlier, the committee closely examined the intervention studies cited in Chapter 2 to identify what assessments were used in each study and what evidence of their reliability, validity, and fairness was provided in the study reports (see Appendix B). This review indicated that overall, the investigators paid little attention to the quality of the assessments used. As shown in Table 3-1, of the 46 studies that assess at least one of the eight competencies, fewer than half provide evidence of the reliability of the assessments used. Studies reporting on reliability almost uniformly report coefficient alpha, a measure of internal consistency. Internal consistency
TABLE 3-1 Evidence of Measurement Quality in Intervention Studies
|Competency||Number of Studies Assessing Competency||Studies with Reliability Evidence||Studies with Validity Evidence||Studies with Fairness Evidence|
|Behaviors Related to Conscientiousness||15||5 (range 0.67-0.98)||0||0|
|Academic Self-Efficacy||2||2 (range 0.76-0.95)||0||0|
|Growth Mindset||12||2 (range 0.63-0.88)||0||0|
|Intrinsic Goals and Interest||2||1 (0.72)||0||0|
|Positive Future Self||3||3 (range 0.60-0.92)||0||0|
|Prosocial Goals and Values||0||0||0||0|
|Sense of Belonging||8||6 (range 0.63-0.93)||1||0|
|Utility Goals and Values||4||3 (range 0.78-0.93)||0||0|
ranges from 0.60 to 0.93, indicating a modest to high level of reliability. Only one study includes any evidence of validity: Cohen and Garcia (2005) cite evidence of convergent validity in the strong correlation of their Racial Identification scale with the Race Centrality subscale of the Multidimensional Inventory of Black Identity (Sellers et al., 1997) (r(34) = 0.79, p <.01) (Cohen and Garcia, 2005, pp. 571-572). None of the 46 studies explicitly reports evidence of fairness.
Assessment Quality in Established Instruments
To date, relatively few assessment instruments measuring one or more of the eight competencies have undergone enough research, development, and testing to yield durable evidence of reliability, validity, and fairness. That said, there are a few notable exceptions. The focus here is on the MSLQ (Pintrich et al., 1993), which measures several of the eight competencies (along with a few other more cognitive competencies), along with the Sense of Belonging scale (Bollen and Hoyle, 1990). Brief mention also is made of the extensive validation research that has been conducted on ACT’s Engage and ETS’s SuccessNavigator, two assessments for admitted students that address a broad range of intra- and interpersonal competencies, including a few of those identified in Chapter 2.
Motivated Strategies for Learning Questionnaire
The MSLQ is a self-report survey used widely in higher education research. The assessment underwent a 10-year development, refinement, and validation process, with funding from the U.S. Department of Education’s research division, then known as the Office of Educational Research and Improvement (Pintrich et al., 1991). It was initially developed by Duncan, Pintrich, and McKeachie as a product of their theoretical model of college student motivation and self-regulated learning (Duncan and McKeachie, 2005).
The full 81-item instrument is composed of two sections, each focused on student responses to any particular classroom course. The first section, consisting of six related scales, focuses on motivation. Five of the six tap constructs related to the competencies identified in Chapter 2 (the sixth is test anxiety). These five scales reflect students’ course goals (including intrinsic goals and interest), perceived control of learning (i.e., growth mindset), value beliefs (i.e., utility value), and sense of self-efficacy for learning and performance (i.e., academic self-efficacy). The second section, which assesses learning strategies, contains nine scales addressing students’ cognitive and metacognitive strategies and their management of learning resources. Of these, five appear congruent with behaviors related to conscientious-
ness: organization, metacognitive self-regulation, time/study environmental management, and effort regulation. For each item, students are presented with a construct-related statement and are asked to rate themselves on a seven-point scale (from 1 = not true at all of me to 7 = very true of me). Coefficient alphas for the various scales relevant to this report range from 0.62 to 0.93, with the majority in the range of 0.68 to 0.80, thus revealing moderate to good reliability; the task value and self-efficacy scales show alphas at or above 0.90.
Validity for the MSLQ can be attributed in part to the content derived from its strong theoretical base, which is situated in a social-cognitive view of motivation and learning strategies. Complementing this, both confirmatory factor analysis and structural modeling in a large sample of college students (N = 380) provided empirical support for the constructs within the motivation and learning strategies sections (Pintrich et al., 1993). The statistical fit of these models to the data was reasonable according to a range of goodness-of-fit indices, including the chi-squared test, goodness-of-fit index (GFI), adjusted goodness-of-fit index (AGFI), and root mean square residual (RMR). The relationship between the 15 scales and final course grades showed modest evidence of predictive validity, with self-efficacy for learning and performance showing the highest validity (r = 0.41). Overall, given the patterns of convergent and discriminant validity found among the scales and in their correlations with final course grades, the authors propose that the MSLQ scales are “valid measures of the motivational and cognitive constructs” (p. 811). Validation studies on their own provide no evidence related to fairness in operational settings toward relevant subgroups, such as gender and race/ethnicity, with the latter warranting further exploration.
Sense of Belonging Scale
The three-item Sense of Belonging scale is based in Bollen and Hoyle’s (1990) work on measuring perceived cohesion and has been part of the Diverse Learning Environment Survey, funded by the Ford Foundation and conducted by the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Institute (HERI) at the University of California, Los Angeles. The authors based the scale on a theoretical definition of the construct: “Perceived cohesion encompasses an individual’s sense of belonging to a particular group and his or her feelings of morale associated with membership in the group” (p. 482). Three of the scale’s six items measure students’ sense of belonging to a community or institution, and the remaining three relate to their feelings of morale. The three sense of belonging items are (1) I feel a sense of belonging to____________________; (2) I feel that I am a member of the _________________ community; and (3) I see
myself as part of the ___________________ community. Students respond to these statements on a 10-point Likert scale (from 0 = strongly disagree to 10 = strongly agree).
Bollen and Hoyle (1990) tested the validity of their scale in a study involving two samples of respondents, the first a randomly selected group of 102 undergraduates from a small, northeastern college reputed to have strong school spirit, and the second a random sample of 110 residents of a midsized northeastern city. The study tested the hypothesis that the college students would show greater cohesion than the city residents. Analyses of both samples found that a two-factor model, reflecting, respectively, items on sense of belonging and feelings of morale, was a good fit to the data based on a number of model fit indices (i.e., chi square test, GFI, AGFI). However, the unrestricted model revealed that the two factors were empirically indistinguishable. Nonetheless, the authors argue that although the two may be empirically indistinct, they remain theoretically useful as an overall construct of cohesion. The analysis revealed the measures were reliable with a high degree of structural invariance across the two samples (high and equal factor loadings). Further, the latent group means for the college group on these two constructs underlying cohesion were both higher than those of the city resident group—as the authors hypothesize, supporting the validity of interpreting scores as students’ standing on the construct of group cohesion. The sole example of evidence related to fairness was an investigation of potential bias for individuals from the middle versus the working class. None was found.
As noted, the Sense of Belonging scale currently is part of HERI’s CIRP, which conducted additional expert and practitioner review and psychometric analysis of the scale as part of the pilot and field testing of the Diverse Learning Environment Survey (Hurtado and Guillermo-Wann, 2013). The three-item scale again showed high internal consistency reliability (a = .93); in such a short scale, this means the three items were highly correlated (people who responded at a high or low level on a given belongingness item generally did the same for the others).
Engage and SuccessNavigator
ACT and ETS, the major publishers of college admissions tests, have each developed assessments of college readiness—Engage and SuccessNavigator, respectively—that measure a few of the eight competencies identified by the committee, along with a range of other competencies. Both instruments are designed for use with students already admitted into college to identify proactively those who may require additional support and to assist colleges in identifying the developmental interventions that will increase the likelihood of students’ persistence and academic success. Both instruments have undergone extensive research and development (Le et al., 2005;
Markle et al., 2013a; Rikoon et al., 2014; Robbins et al., 2004, 2006; Wiley et al., 2010). Similar to the above examples, work included thorough grounding in available theory; pilot and field testing to establish reliability; first- and second-order factor analyses to provide empirical support for the instruments’ conceptual design; and analysis of the relationships between the assessment scores and various criteria, including course grades, retention, and graduation (e.g., Moore et al., 2015). Further supporting the validity of score use, studies have examined the relationship between the use of the scores for placement and subsequent student success with coursework (Rikoon et al., 2014). Fairness also has been explicitly examined. For example, SuccessNavigator was subjected to tests for both measurement (reliability) and structural (validity/prediction) invariance by gender and race/ethnicity (Markle et al., 2013a).
The committee’s analysis of the quality of existing assessments of the eight identified competencies indicates room for improvement. Such improvement starts, the committee believes, with a professional approach to assessment development, considers new measurement options that may mitigate existing shortcomings, and includes the use of multiple measures and multiple levels of analysis.
Validity and fairness are driving concerns throughout a rigorous test development and validation process. As the Standards note:
. . . all steps in the testing process, including design, validation, development, administration, and scoring procedures, should be designed in such a manner as to minimize construct-irrelevant variance and to promote valid score interpretations for the intended uses for all examinees in the intended population. (American Educational Research Association et al., 2014, p. 63)
Borrowing from the Standards, Downing and Haladyna’s (2006) 12-step framework, and Mislevy and colleagues’ (2003) evidence-centered design, the subsections below outline a systematic process of test specification, item development and review, administration, and validation.2 Although the
2 While it is beyond the scope of this chapter to provide a detailed discussion of the full range of issues involved in assessment development, the interested reader is referred to more extensive treatments elsewhere (Downing and Haladyna, 2006; Lane et al., 2015; Schmeiser and Welch, 2006).
Standards are a primary source for the discussion, there are several other significant, but perhaps more specialized, sets of assessment standards also worthy of consideration.3
Test Specifications: Overall Plan for Assessment Development
Because the quality of an assessment is evaluated largely by how well it measures intended constructs and serves the intended purposes, it is important to begin the assessment development process with a clear articulation of the constructs to be assessed and the purposes, or set of purposes, to be served. Test specifications provide an overall plan for assessment development that begins with a description of these purposes.
The specified content definition elaborates the meaning of the construct to be measured. Dweck (2006), for example, argues that growth mindset is the belief that intelligence is not fixed, but is malleable; it changes with learning and experience, and it can be taught (see Chapter 2). Thus a content definition of the growth mindset construct would include these elements, and a sampling plan would articulate how an assessment would be based on sampling those elements (what percentage of items would address one’s beliefs in a fixed versus malleable view of intelligence, one’s beliefs about the efficacy of changing intelligence, and so on).
The test purpose(s) and score interpretations also are specified in the assessment development process. Doing so is important because different purposes and score interpretations have implications for the assessment’s design and psychometric characteristics. For example, an assessment intended to determine and interpret a student’s level of growth mindset for the purpose of informing decisions about admission or placement will need to differentiate students at various specified levels in a psychometrically reliable manner. By contrast, an assessment that will be used to evaluate how well an intervention influences the development of the construct needs to measure growth mindset over time, perhaps at the group level.
In addition, test specifications lay out a detailed plan for who will be tested and how the test will work. The Standards (American Educational Research Association et al., 2014, p. 85) present a comprehensive and detailed plan for what should be included:
3 These include the Society for Industrial and Organizational Psychology’s (2003) Principles for the Validation and Use of Personnel Selection Procedures (4th ed.); the International Test Commission’s (2014) Guidelines on Quality Control in Scoring, Test Analysis, and Reporting of Test Scores; and the Educational Testing Service’s (2014) ETS Standards for Quality and Fairness.
- the purpose(s) of the test,
- the definition of the construct or domain measured,
- the intended examinee population,
- interpretations for intended uses,
- the content of the test,
- the proposed test length,
- the item formats,
- the desired psychometric properties of the test items and the test, and
- the intended sequencing of items and sections.
The Standards also suggest that the process of documenting the validity of the interpretation of test scores starts with the rationale for the test specifications. The specifications should be subject to external review of their quality by qualified and diverse experts (American Educational Research Association et al., 2014, p. 87, Standard 4.6) who can provide objective judgments. Particularly important is an evaluation of the content definition and the extent to which it represents the intended intra- or interpersonal construct(s). This definition not only will guide test development but also will serve as a critical touch point in evaluating existing assessments for use. Critically, the Standards point to construct underrepresentation as a major issue in assessment validity. If a construct is inadequately represented by an assessment, any conclusions drawn from the assessment results are limited and may need additional verification.
Close examination of test specifications when selecting assessment instruments also can help stakeholders in higher education guard against the “jingle-jangle” fallacies (Kelley, 1927; Pedhazur and Schmelkin, 1991; Roeser et al., 2006) mentioned earlier. Examining test specifications can help guard against such erroneous assumptions.
Item Development and Review
According to the Standards, assessment items are developed, based on the test specifications, by teams of individuals trained and qualified in the process, which typically would mean that they are familiar with the construct, with the item types used to measure the construct, and with how to interpret psychometric findings from a pilot or field test.
Universal design and accommodations in item development Attention to individuals with disabilities and those for whom English is not the first language, as well as issues surrounding accommodations for those individuals, is one way in which assessment development in the United States has changed most dramatically over the past 15 years, (Thurlow et al., 2006). These concerns have become particularly important since pas-
sage of the 2008 Americans with Disabilities Amendments Act.4 A central issue concerns test development procedures that avoid biased scores due to construct-irrelevant variance related to a student’s disability or English language proficiency, thereby supporting comparability of scores and comparability of inferences about the intended construct across all intended test takers.
Universal design—a concept that originated in architecture to make buildings accessible to all—has now become standard practice in assessment development (Johnstone et al., 2008) and increasingly is required by state and federal education agencies (Laitusis, 2007). In the present context, universal design means designing items that will be accessible to as wide a range of the intended examinee population as possible—for example, by eliminating unnecessarily complex language (when such language is construct-irrelevant) so as not to bias results for non-English-fluent students or those with reading disabilities. Accommodations for students with disabilities include such things as screen and text readers (software that converts text from a screen into speech and allows use of a keyboard rather than a mouse) and tactile graphics (figures raised from the text, often with captions in braille) (Educational Testing Service, 2010; Laitusis, 2007). Test administrators also may allow for alternative response modes (e.g., speaking, pointing), additional testing time, and unproctored testing (Beaty et al., 2011). Because intra- and interpersonal competencies often are assessed through written self-report instruments, and language ability is not the target of such measures, the use of simplified English expression to reduce the language demands of a test may be particularly important for ensuring accessibility of items for students who are not fully fluent in English.5
Item review Once assessment items and any accommodations have been developed, they are subject to content and fairness reviews, typically conducted by committees of substantive and psychometric experts, along with experts knowledgeable about the populations to be tested. These experts evaluate the extent to which the items match their construct targets, reflect the test specifications, and are accurate in content. Items also are reviewed for fairness to ensure that they do not contain construct-irrelevant characteristics that could impede some students’ ability to understand or respond appropriately, which would distort “the meaning of the scores and thereby
4 Public Law 110-325, http://www.access-board.gov/about/laws/ada-amendments.htm [July 2016].
5 The chapter in the Standards on fairness (Chapter III) provides additional detail on several key principles in the use of appropriate test accommodations and modifications for English language learners and students with disabilities and in the reporting of scores from such assessments.
decrease the validity of the proposed interpretation” (American Educational Research Association et al., 2014, p. 217).
Test Design, Assembly, and Field Testing
Test design and assembly refers to the process of compiling the collection of items and tasks that will form the actual test (also called a “test form”) so that they conform to the test specifications. Test assembly often involves selecting items to satisfy psychometric requirements, such as test form reliability, as well as content requirements. There are increasingly sophisticated approaches for conducting test design and assembly, including evidence-centered design (Mislevy and Riconscente, 2006), automatic item generation (Gierl and Haladyna, 2013; Irvine and Kyllonen, 2002), optimal test design (van der Linden, 2005), and assessment engineering (Luecht, 2013).
Alternative test forms are those that measure the same construct, are built to the same set of test specifications, and are to be administered under similar conditions. Alternative forms are useful because the security of item content can be compromised through item exposure, and alternative forms can help minimize exposure of the content over time for any particular item. Alternative forms may also be useful in the context of a program evaluation in which an assessment is conducted both prior to and following the treatment (e.g., pretest, posttest, delayed posttest). In other contexts, items may partially overlap across alternative forms, such as in adaptive testing, where items are tailored to each examinee’s responses, or when one seeks to measure a construct at the group level, but each student receives only a limited number of items.
Test administration encompasses, for proctoring, the qualifications of those involved in the test administration, security, timing, translations, and issues associated with accommodations for test takers with special needs. All of these issues are considered to ensure that an assessment measures the construct it is intended to measure and to minimize the effects of cheating, adverse testing conditions, and other factors that might otherwise induce construct-irrelevant bias or variance on test scores. Although these issues are routinely considered critical for cognitive assessments, they are just as important in assessing intra- and interpersonal competencies. Conditions under which assessments are administered, including timing and instructions, also should be standardized to ensure fairness and comparability of scores across sessions.
Validation of Score Inferences
Evidence supporting the validity of score interpretation and use is collected throughout the test development and administration process, including through special validation studies. Content and bias reviews of test specifications, items, and forms can provide content-related evidence about the extent to which an assessment measures the intended competency and the range and depth of its construct representation, and ensure that test items are free of extraneous attributes that otherwise could constrain some students’ ability to demonstrate their competence. Pilot testing of test items often includes think-aloud protocols that can elicit evidence both that students understand the assessment questions and/or expected responses and that response processes actually reflect the intended competency.
Field testing and/or operational test administration then follows, which typically generates evidence of reliability; internal structure-related validity evidence; and, assuming adequate subgroup representation, various differential item functioning (DIF) analyses related to fairness. Special studies then need to be conducted to collect additional validity evidence related to hypothesized relationships between test scores and other variables. These studies include convergent-divergent studies, analyses of test-criterion relationships and/or convergent-divergent evidence, and studies exploring the possibility of differential prediction and/or criterion relationships for diverse student groups (see prior sections of this chapter on validity, reliability, and fairness).
New Testing Techniques Supporting Improvement
Rigorous test development processes can help improve the quality of existing measures of the eight competencies, as can new approaches to ameliorating some of the shortcomings of self-report measures, the most common type of measures currently used to assess the eight identified competencies.
Anchoring vignettes are a technique developed to mitigate response style bias in Likert measures among individuals or groups, particularly cross-culturally (King et al., 2004). The technique requires respondents to rate one or more anchoring vignettes, which are brief descriptions of a hypothetical person or situation. Respondents then rate themselves or their own situation on the same rating scale they used to rate the anchoring vignettes. Comparison of the self-ratings with the anchored ratings is then used to create an adjusted score; both parametric and nonparametric
scoring methods are employed (King and Wand, 2007). Data from the Programme for International Student Assessment (PISA) 2012 indicate that this approach has improved the comparability of intra- and interpersonal competency scores for individuals and groups (Kyllonen and Bertling, 2013).
To illustrate the concept of anchoring vignettes, consider a vignette pertaining to the intrapersonal competency of sense of belonging. An example item reflecting sense of belonging can be found in the PISA 2012 survey (OECD, 2013b): “I feel like I belong at school.” Students report their response on the four-point Likert scale “strongly agree,” “agree,” “disagree,” and “strongly disagree.” A corresponding anchoring vignette for this item might be something like the following: “After a class lecture, Rodrigo will discuss the class with his peers comfortably and without a sense of competition. He also shows a sense of humor interacting with his peers on his intramural volleyball team. Indicate how much you agree that Hidalgo believes he belongs at school: ‘strongly agree,’ ‘agree,’ ‘disagree,’ or ‘strongly disagree.’” Based on the self-rating and the vignette rating, an anchoring vignette-adjusted score would then be computed. This score would be related to the difference between the two ratings, reflecting the degree to which respondents rated themselves higher or lower than they rated the hypothetical Rodrigo.
Anchoring vignettes such as this operate under two assumptions: the vignette equivalence assumption, that all respondents interpret the vignette in the same way (Bago d’Uva et al., 2011); and the response consistency assumption, that respondents use the same scale when evaluating themselves and the person in the vignette (Kapteyn et al., 2011).
The forced-choice, or ipsative, method requires respondents to select from among two or more alternatives, such as two statements. In a strict forced-choice format, a respondent might be asked to indicate which statement is “most like me” from the statements “I enjoy working with others” and “I set high personal standards.” Choosing one statement necessarily means not choosing the other. This can be difficult for the respondent, which is often the point—that a choice must be made despite the difficulty, and a better assessment should result. Another common approach is a more relaxed form of forced choice that entails providing four statements from which the respondent is asked to select the one “most like me,” as well as the one “least like me.” The four statements might include, for example, “I enjoy working with others,” “I set high personal standards,” “I manage to relax easily,” and “I am careful about detail” (Brown and Maydeu-Olivares, 2011).
The advantage of the forced-choice format is that respondents can be made to choose between pairs of statements that have equal levels of social desirability (i.e., equally desirable or equally undesirable), which can reduce bias due to social desirability. Although the ipsative method has been used for decades (Edwards, 1957), a traditional assumption was that the information gathered was relevant only for understanding within-person choices or relative preferences, not for understanding differences among people (Cattell, 1944). More recently, however, quasi-ipsative test designs and psychometric approaches to forced-choice measures have been developed. There now is evidence that scores on assessments using both approaches (quasi-ipsative and forced choice), as well as ranking approaches, provide similar evidence for a respondent’s trait standing. Results of an experimental laboratory study suggest that Likert scale and forced-choice methods provided similar information about respondents with respect to the traits measured, although assessment scores in both formats were affected by instructed faking conditions (Heggestad et al., 2006). A meta-analysis (Salgado and Táuriz, 2014), however, indicates that across studies, forced-choice methods tended to provide stronger predictions of educational outcomes (GPA) and workforce outcomes (performance, training) relative to rating-scale methods.
New Methods Enabled by Technology
Advances in technology will continue to push the options for assessment. For example, automatic scoring methods based on natural language processing (NLP) techniques, that are under development, may be able to alleviate the scoring burden of performance tasks, essays, and other constructed-response items and thus enable larger-scale implementation. Beigman Klebanov and colleagues (2016) demonstrated that NLP techniques could be used to assess the utility value students perceive in biology topics. Specifically, the authors used these techniques successfully to evaluate the degree to which essays were in compliance with the following instructions:
Write an essay addressing this question and discuss the relevance of the concept or issue to your own life. Be sure to include some concrete information that was covered in this unit, explaining why this specific information is relevant to your life or useful for you. Be sure to explain how the information applies to you personally and give examples. (Harackiewicz et al., 2015)
The authors showed that essay writing features such as the use of appropriate general and genre-topic vocabulary, affect and social vocabulary, and argumentative and narrative elements were useful in measuring the degree
to which student essay writers complied with instructions to reflect on and express the value of the biology topics in their personal and social lives.
In another example from the intervention literature, Yeager and colleagues (2014, Study 4) used a technology-assisted “diligence task” to measure academic self-regulation, a behavior related to conscientiousness. Designed to mirror students’ real-world behavior when trying to complete homework in the face of digital distractions, the task gave participants the choice of completing boring single-digit subtraction problems or consuming media (either brief, viral videos or playing the video game Tetris). The software unobtrusively tracked the number of math problems completed successfully to assess academic self-regulation.
To assess female engineering students’ sense of belonging in the field, Walton and colleagues (2015) used a modified version of the Implicit Attitude Test (Yoshida et al., 2012), a computerized assessment designed to address the challenge that individuals may provide a socially desirable response rather than a true response when asked to self-report on attitudes that may be viewed socially as biased or prejudiced. Specifically, the test measured the reaction time when students were asked to associate the concept “most undergraduates at your university like” (versus “most undergraduates at your university don’t like”) with the concept “female engineers.” Participants were asked to categorize a series of words and images as quickly and as accurately as possible using keys on the left and right sides of the keyboard to indicate the category to which each word or image belonged. Students holding negative associations with most people’s evaluation of female engineers were expected to find the task more difficult, and respond more slowly, when “most people like” and “female engineers” were presented together than when “most people don’t like” and “female engineers” were presented together. Higher scores represented more positive implicit attitudes toward female engineers, suggesting a greater sense of belonging.
A more novel, recent example of the promise of technology is Pentland’s (2008) work using unobtrusive badges that automatically track how frequently individuals speak to each other, turn toward each other, mirror each others’ gestures, and so on to assess authentic communication and teamwork. The author notes that humans express subtle interaction patterns that can be interpreted as honest signals in the form of timing, energy, and the variability of expressions. These honest signals provide clues regarding one’s degree of influence on other people, the degree to which one unconsciously mimics others, how one expresses interest and excitement in the form of activity and demonstrativeness, and consistency in speech and movement that convey focus or signal openness to influence from others. Likewise, building on the idea that certain cues (e.g., facial expressions, gestures, vocal prosodity) can provide information about a person’s
intra- and interpersonal competencies, recent work in computer science is concerned with automating the evaluation or scoring of interviews based on multimodal emotion detection methods (Chen et al., 2014). Applications of such technologies are beginning to emerge, and as the technologies become more available and less expensive, they are likely to influence assessment.
There are other rapidly developing areas of assessment based on advances in technology and the concomitant advances in psychometrics that support new assessment types. These include game-based assessments (Mislevy et al., 2014) and collaborative problem-solving tasks (von Davier et al., 2017).
Use of Multiple Measures
It is axiomatic that any single measure can provide only one perspective on any given construct. Consider the construct of conscientiousness, which can be measured in a variety of ways, for a variety of purposes. Conscientiousness might be measured with a self-report rating scale (e.g., “I continue working on tasks until they are finished. Select one: strongly disagree, disagree, agree, strongly agree.”) or a peer or teacher rating (e.g., “How often does X turn in assignments on time? Select one: rarely or never, seldom, sometimes, often, always, or almost always.”). Conscientiousness also might be measured by college administrative records (e.g., available records on class attendance); with a situational judgment test (e.g., “You have a test the next day and don’t feel fully prepared. You are very tired and you are not thinking clearly. What do you do?”); or in a behavioral interview (e.g., “Tell me about a time when you had to persist on a task despite many barriers in your way.”). Or it might be measured indirectly with a performance task, such as working on an impossible puzzle (Walton et al., 2012); choosing a difficult over an easy anagram task (Gerhards and Gravert, 2015) or a difficult over an easy addition task (Alan et al., 2015); or working quickly on a tedious task, such as a picture-number lookup task (Segal, 2008).
Because each of these methods can have its benefits and drawbacks, using more than one method can increase the precision of measurement and the strength of inferences that can be drawn. A few of the intervention studies in Chapter 2 used multiple assessment methods to gather additional information about the target competency (see Appendix B). For example, Walton and colleagues (2012) assessed motivation, a behavior related to conscientiousness, using both a self-report instrument and the amount of time spent on an insoluble math puzzle. In another example, Vansteenkiste and colleagues (2004a, Study 1) assessed intrinsic interest in the intervention topic (recycling) using both a self-report, selected-response instrument and a behavioral measure of visits to the library and/or the recycling center.
Because students’ library visits were automatically recorded with a card swipe, data on this measure of intrinsic interest were readily available.
Furthermore, even within a specific method, one often can use a wide range of equally legitimate content when developing an assessment. In addition, time and other practicalities always limit the number of items that can be used for an assessment, and in general, as noted earlier, the longer the assessment, the more reliable it is likely to be. By extension, it is a truism of assessment that longer tests are associated with (but do not guarantee) higher levels of validity, a point that the Standards note as particularly critical when high-stakes decisions depend on assessment results.
Ultimately, the usefulness of any given assessment depends on the fundamental characteristics of validity, reliability, and fairness discussed above. When put to their best use, assessments are designed to provide data with which to answer one or more specific questions (e.g., What level of conscientiousness does the student have? Is that level enough to help him/her persist through challenges at college? Does the new intervention being used to increase students’ conscientiousness actually work?). The use of multiple measures of the targeted competency is likely to yield more valid answers to such questions.
Recognizing the Multilevel Context
It is important to emphasize that, beyond the factors discussed thus far in this chapter, educators and researchers need to understand and be sensitive to the factors that define the context in which assessment occurs if they are seeking to measure, understand, and intervene on key intra- and interpersonal competencies that ultimately influence student success. This systems perspective is broader than the charge to this committee and the scope of this report, and practically speaking, any research study necessarily will focus on a small portion of such a system. Nonetheless, it must be recognized that the context of intra- and interpersonal competency assessment is wide-ranging and encompasses numerous individual, group, and institutional entities operating and interacting simultaneously (e.g., diverse students and peer groups, instructors with varying roles and experience, classrooms with the potential to create and facilitate opportunities to exhibit and develop intra- and interpersonal competencies, institutions and departments that help establish both mission and culture).
Today, available statistical methods and computational power can be used to analyze assessments and context at the same time. Without providing a comprehensive review of those methods, it is worth noting that multilevel models (Gelman and Hill, 2007) provide both a conceptual and statistical basis for determining whether levels in a hierarchy are related. Referring to Figure 3-1, suppose one wanted to know whether new disci-
plinary accreditation standards calling for assessment of particular intra- and interpersonal competencies influenced a community college president, who then implemented departmental policies that affected faculty members, who in turn interacted with students who were assessed on their intra- and interpersonal competencies before and after the policy was formally enacted. In addition to answering this sort of question, multilevel models can incorporate longitudinal data (e.g., whether a construct measured among instructors at time A affected another construct among students at time B); measurement error variance (e.g., modeling relationships while accounting for the fact that psychological measures are never perfectly reliable); and different error structures (e.g., cyclical trends experienced year to year in a department or institution, or autocorrelation between neighboring events in any span of time).
The overall point is that the choice of an assessment strategy needs to be sensitive to the assessment context across all levels, such as those just mentioned and depicted in Figure 3-1. In addition, this context will have a
significant influence on what range of best practices should be considered in developing, administering, scoring, and interpreting an assessment, regardless of whether it is to be used for high- or low-stakes purposes. Researchers and assessment experts in higher education are encouraged to incorporate data on context (e.g., culture, climate, department) into their analyses and interpretations of intra- and interpersonal competency assessments.
The committee reviewed the nature and quality of existing assessments of the eight competencies identified in Chapter 2, together with research and professional standards related to the overall process of developing; validating; implementing; and interpreting, evaluating, and using the results of assessments of intra- and interpersonal competencies. The test development practices used to create assessments of cognitive knowledge and skills that meet professional standards are equally applicable to intra- and interpersonal competency assessments.
The committee examined the assessments used in the intervention studies targeting the eight competencies identified above and commissioned a literature search on measurement of these competencies. Drawing on both sources, the committee also identified and closely analyzed a small sample of established assessment instruments targeting one or more of the eight competencies. Overall, the review revealed that self-report methods, with their known limitations, predominated in the assessments of the eight competencies. Analysis of the quality of the assessments used in the intervention studies revealed spotty attention to reliability and almost no reported evidence of validity or fairness. However, more evidence of assessment quality was found for some established assessment instruments used in higher education research, particularly those efforts that have received funding for assessment research and development. These instruments provide evidence on reliability and validity but lack evidence on fairness. Assessments developed by professional testing companies provide even more evidence of quality, including fairness data; however, these assessments target a wider range of competencies, only partially addressing some of the eight competencies.
Conclusion: Most current assessments of the eight identified competencies are uneven in quality, providing only limited evidence to date that they meet professional standards of reliability, validity, and fairness.
Assessments for High-Stakes Purposes
Developers of all types of assessments, whether they aim to measure cognitive, intrapersonal, or interpersonal competencies, must exercise particular care when an assessment will serve a high-stakes purposes. Assessments are considered high-stakes when their results carry serious consequences for individuals or institutions.
Conclusion: The development and validation of assessments of intra- and interpersonal competencies for high-stakes purposes is a rigorous, time-consuming, and expensive process that depends critically on expertise in assessment and psychometrics. Validity, reliability, and fairness are essential considerations in evaluating assessment quality.
RECOMMENDATION 5: When developing and validating intra- and interpersonal competency assessments to be used for high-stakes purposes, stakeholders in higher education (e.g., faculty, administrators, student services offices) should comply with professional standards, legal guidelines, and best practices to enable appropriate interpretations of the assessment results for particular uses.
RECOMMENDATION 6: Institutions of higher education should not make high-stakes decisions based solely on current assessments of the eight identified competencies, given the relatively limited research to date demonstrating their validity for predicting college success.
Assessments for Low-Stakes Purposes
Researchers and practitioners in higher education also use assessments for low-stakes purposes, such as to evaluate the quality of interventions, policies, and instructional practices or simply to monitor student change over time. When used for these low-stakes purposes, assessments need not meet the high evidentiary requirements of individual high-stakes student assessments, such as college admissions tests. Professional testing standards clearly state that the amount and type of evidence needed to support a test’s validity may vary depending on the use or interpretation of the test scores. At the same time, even when assessments are not used for high-stakes purposes, they need to be sensitive to the competencies they are intended to measure.
Conclusion: Even low-stakes uses of intra- and interpersonal competency assessments require attention to validity, reliability, and fair-
ness, although they need not meet the high evidentiary requirements of high-stakes assessments.
RECOMMENDATION 7: Those who develop, select, or use intra- and interpersonal competency assessments should pay heed to, and collect evidence of, validity, reliability, and fairness as appropriate for the intended high-stakes or low-stakes uses.
Definition of Constructs Being Assessed
After reviewing both general principles for assessment development and use and recent research on measurement of the eight identified competencies, the committee concluded that defining each competency clearly and comprehensively is a critical first step in developing high-quality assessments. Clear definitions are especially important in light of the wide variety of terms used for these competencies. For example, conscientiousness, grit, and persistence are closely related constructs, despite being named differently. In fact, the content of items used to assess all of these constructs may be quite similar. Conversely, assessments bearing the same name may in fact measure different constructs.
Conclusion: High-quality assessment begins with a clear definition of the competency to be measured, and identifies how the assessment will be used and what kinds of inferences it will support.
Construct definitions guide assessment development and selection by making it possible to evaluate how well the assessment represents the construct it is intended to measure, thereby supporting appropriate inferences about the construct for particular uses. High-quality assessments avoid construct underrepresentation, represent the breadth and depth of the construct, and minimize any distortions caused by construct-irrelevant influences.
Innovative Methods and Technologies for Assessment
Self-report measures, such as those frequently used to assess the eight identified competencies, have several limitations. First, individuals responding to both high- and low-stakes assessments may be motivated to present themselves in a favorable light. In addition, people often express themselves on a response scale in habitual or characteristic ways, such as tending to mark the extremes (e.g., “strongly agree” or “strongly disagree”) or to agree or respond positively regardless of the question. Respondents also tend to compare themselves with those around them, such as their close
peers. This tendency can compromise the use of the responses to measure growth or to compare groups of individuals because such comparisons depend on an absolute rather than a relative standard. Because self-report measures are widely used, these limitations affect a broad swath of current intra- and interpersonal competency assessments.
Recent research has identified various methods that can mitigate these limitations. Ratings by others avoid some of the problems of self-reports and have been found to yield more reliable and predictive data in many contexts. The use of forced-choice and ranking methods for collecting self-evaluations avoids response-style bias by circumventing traditional rating scales altogether. The use of anchoring vignettes also addresses response-style bias by having raters make use of detailed objective anchors, and may potentially deal with reference group effects as well. Other nontraditional measures include situational judgment tests as well as games or simulations, which avoid many of the documented limitations of self-ratings. Further research is needed to develop, extend, and refine these promising new approaches.
Conclusion: Most existing assessments of the eight identified competencies, as well as many existing assessments of other intra- and interpersonal competencies, use self-report measures, which have well-documented limitations. These limitations may constrain or preclude certain uses of the results. Innovative approaches for assessing intra- and interpersonal competencies can address these limitations.
RECOMMENDATION 8: Federal agencies and foundations should support additional research, development, and validation of new intra- and interpersonal competency assessments that address the shortcomings of existing measures.
Fairness in Assessment
The Standards make clear that fairness to all individuals for whom an assessment is intended should be a driving concern throughout assessment development, validation, and use. Assessment development should minimize construct-irrelevant characteristics that would interfere with the ability of some individuals or subgroups to show their standing on a competency or lead to individual or subgroup differences in the meaning of test scores. Whenever differences in subgroup scores are observed, follow-up research may be needed to examine the reasons, the potential sources of bias, and the comparability of score interpretations across individuals and subgroups in light of the intended uses of the assessment results. The com-
mittee applied these fairness principles in its review of current assessments of the eight identified competencies.
Conclusion: Despite the ever-increasing diversity of undergraduate student populations, attention to fairness for diverse populations is often inadequate in the development, validation, and use of current assessments of the eight identified competencies.
RECOMMENDATION 9: Researchers and practitioners in higher education should consider evidence on fairness during the development, selection, and validation of intra- and interpersonal competency assessments.
Consideration of Contextual Factors
Self-, peer, or instructor ratings of such an intrapersonal competency as conscientiousness or such an interpersonal competency as teamwork may vary depending on local norms (e.g., reference group effects). In addition, contextual variables may mediate or moderate the relationships between intra- and interpersonal competencies and educational outcomes. For example, of an intervention intended to develop sense of belonging may be effective only for disadvantaged student groups (see Chapter 2).
Conclusion: Appropriate interpretation of the results of intra- and interpersonal competency assessments requires consideration of contextual factors such as student background, college climate, and department or discipline.
RECOMMENDATION 10: Higher education researchers and assessment experts should incorporate data on context (e.g., culture, climate, discipline) into their analyses and interpretations of the results of intra- and interpersonal competency assessments.
Implementing this recommendation will require that higher education researchers use appropriate statistical analyses that incorporate data on context when examining assessment results. Such analyses include use of multilevel statistical models, measurement invariance analyses, application of differential item functioning, and mediator and moderator analyses. These analyses can enhance understanding of the complex interactions and processes entailed in students’ individual competencies and of features of higher education contexts that contribute to students’ persistence and success. Multiple measures also can be used to minimize the possibility that inferences about a student’s intra- and interpersonal competencies are due to a particular measurement approach.