Joan Herman, chair of the steering committee and discussant at the workshop, posed two questions: Should we assess 21st century skills? If so, do we know how to do it?
In response to the first question, she said her answer was a wholehearted “yes.” In her view, all of the workshop presentations demonstrated the importance of these skills. Beginning with Richard Murnane’s presentation that highlighted the critical relationships between these skills and labor market outcomes to presentations by Nathan Kuncel, Stephen Fiore, and Rick Hoyle, speakers emphasized the need for these skills to function well in today’s society. One after another, each presenter made a case for the need for students to be well rounded in their abilities to think critically; problem solve; interact effectively with others; and manage their own learning, emotions, and development. To Herman, it would be a disservice to students and society at large to focus schooling solely on narrow academic content while neglecting the broader aspects of development. 1 More important than simply assessing the skills, Herman noted, we should be integrating the assessment and teaching of 21st century skills with academic content. As she put it, “This should not be something added on to what teachers are already required to do, but should be part of their routine practice for building academic knowledge.”
But, do we know how to assess 21st century skills? Herman’s answer to this question was that it depends on the kinds of skills. With respect to
1For additional discussion about breadth of instruction, see Bok (2006) and Lewis (2006).
cognitive skills, Herman thinks we know how to assess problem solving embedded in content, as Kuncel was arguing for. She noted that we also know how to develop assessments that require students to apply their knowledge, to evaluate evidence, and to perform other critical thinking and analytical reasoning tasks. There appear to be rich learning models on which to base these assessments, she added, but evaluating higher-order thinking skills has not received the attention it might have over the past few years.
With respect to some of the interpersonal and intrapersonal skills discussed at the workshop, she was somewhat more hesitant, but she said her hesitancy was in relation to the purposes and uses for the assessments, not the relative importance of the skills. She noted these days the word “assessment” has come to mean only large-scale, summative, accountability assessment, and, in her judgment, many of the measures of interpersonal and intrapersonal skills are clearly not ready to be used for this purpose. As she put it, “The long research histories in each area give rise to any number of measures for assessing individual constructs, but measures that are suitable for summative accountability purposes are few and far between.” Assessments can serve many purposes, however. For teachers, she pointed out, assessments are most useful if they provide information that can be used for formative purposes, to help make instructional decisions on a day-to-day basis. Some of the measures of interpersonal and intrapersonal skills seem to be well suited for this purpose or for purposes that involve small-scale administration.
As part of this discussion session, presenters and audience members raised a number of issues with regard to strategies for assessing 21st century skills, particularly the skills classified as interpersonal and intrapersonal. This chapter provides a synthesis of some of the main points raised by steering committee members and workshop participants and closes with a discussion of the implications for policy and strategies for moving forward.
REFLECTIONS ON ASSESSMENT STRATEGIES
Naming the Skill, Defining the Constructs
One point that arose repeatedly over the course of the workshop was the issue of labeling and defining the skills—from the name given to 21st century skills in general to the specific definitions of the constructs. Together, the collection of 21st century skills are sometimes referred to as “noncognitive” skills, a term to which several participants objected because all of the skills require some sort of cognition. These skills are sometimes referred to as “soft skills,” a term that some participants dislike
because it seems to downplay their importance. Others quibbled with the term “21st century skills” because it implies the skills were not needed in the 20th century and appears not to recognize that more than a decade of the 21st century has already passed. Thus, there is an issue with terminology at the broadest level.
There were also concerns expressed about placing these skills into three clusters (cognitive, interpersonal, and intrapersonal), as the committee had done. Some workshop participants pointed out it is misleading to imply the clusters of skills are independent and mutually exclusive. For instance, all of the skills included within the interpersonal and intrapersonal skills require cognition. That is, it is impossible to perform skills such as collaboration, complex communication, or self-regulation without using cognition. Likewise, intrapersonal skills and interpersonal skills are interdependent. For instance, self-management skills certainly come into play when participating in a collaborative task. The committee’s classifications were useful for the purposes of structuring the workshop, but there are issues with implying that the clusters are discrete and unrelated.
At a finer level, there are also issues with defining the constructs subsumed under the three broad categories identified by the committee. Stephen Fiore addressed this in his remarks in relation to interpersonal skills, noting “there is a proliferation of concepts associated with interpersonal skills, and it is problematic because we have different labels that may be describing the same construct, and we have the same label that may be describing a different construct.” For example, with regard to interpersonal skills, terms like social competence, soft skills, social self-efficacy, and social intelligence may all be used to refer to the same skills, or they may each refer to a different set of capabilities. Likewise, in discussing intrapersonal skills, Rick Hoyle pointed out the lack of consensus in the field with regard to defining skills like self-regulation. There is little agreement among researchers, he said, and sometimes the same researcher defines it differently within a single paper.
Settling on terminology for this set of skills and definitions for the constructs needs to be done before assessments can be developed. As Hoyle described this need in relation to self-regulation, “the current state of the conceptualization of self-regulation is the primary obstacle to producing assessments of it.” Defining the skills in a clear and precise way is fundamental to development of assessment tasks and essential for ensuring that the resulting scores support the intended inferences.
Validity, Reliability, and Authenticity
Another issue highlighted by workshop participants was the extent to which assessments of these skills are trustworthy and have fidelity. This
concern is essentially about reliability and validity: that is, do the assessments provide accurate results that support the intended inferences? The discussion centered around a number of issues related to reliability and validity, such as if the assessments measure what they are intended to measure; how susceptible they are to faking; how well they capture the actual processes involved in demonstrating the skill; and how reliable they are. The summary below elaborates on these issues in relation to each cluster of skills.
With regard to skills in the cognitive cluster, such as critical thinking and problem solving, Kuncel pointed out, “We have a good understanding of these constructs when they are considered from a domain-specific perspective.” As he described, “we know what it means to think critically in certain contexts, such as when considering a physics problem or evaluating a study in cognitive psychology, and we have a good understanding of how to assess these skills from a domain-specific perspective.” The example assessments of cognitive skills presented at the workshop were all set within a context. For the PISA problem-solving test, each task specifies the context, which all come from situations encountered in daily life. The Multistate Bar exam poses critical thinking questions within the context of the situations lawyers encounter. Operation ARIES! focuses on evaluating scientific evidence, and Packet Tracer focuses on solving problems with computer networking.
According to Kuncel, the problems arise with domain-general conceptions of these skills. In his view, focusing on broad critical thinking skills, such as understanding the law of large numbers, and training students to apply these skills, is not a useful endeavor. In his work, he has found no evidence that learning these sorts of skills improves critical thinking in general or in ways that can be transferred from one domain to another. Further, he finds little evidence that a domain-general concept of critical thinking is distinct from general cognitive ability.
With regard to interpersonal skills, Fiore reminded the audience of the complexity of interpersonal interactions. Interpersonal skills involve a mix of attitudinal, behavioral, and cognitive factors, all of which are used to read the person in the context of the interaction and determine the most appropriate way to respond. Designing assessments to measure these processes is challenging. One issue that Fiore described is the fidelity of the assessment: that is, the extent to which the assessment involves observa-
tions of actual interactions and actual emotional responses to the interactions. He noted the scenario-based learning examples described by Louise Yarnall and the portfolio assessments described by Bob Lenz represent real-life interactions with authentic exchanges. With the scenario-based learning examples, the students are introduced to a problem through a real-life mechanism, such as an online letter from a manager. The students have to work in teams to address the situation and collaborate to figure out how to solve a complex work-related problem. These assessments integrate technical and social skills.
Fiore views the portfolio examples as somewhat less authentic. While the portfolios are structured collections of student work in which students have documented the application of knowledge in a particular classroom context, the evaluation of interpersonal skills is based on self-, peer, and teacher ratings. Although these ratings are drawn from actual situations the student was involved in, there is no control over the context or the nature of the interaction. For instance, the situations may or may not have involved conflict in the context of the collaborative projects. The type of communication on which the student is evaluated may differ from one student to the other. These variations interfere with both reliability and validity, Fiore commented, in that the sampling of behavior and performance included in the portfolio may not be consistent from year to year or even from student to student.
The other two examples—situational judgment tests and assessment center tasks—assess interpersonal skills in more contrived, controlled situations, Fiore said. The assessor sets up the situation to which the test taker is responding or in which the test taker is interacting. This guarantees that certain samplings of behavior are observed, but they are not as authentic as the other approaches. For instance, assessment centers obtain simulated examples of behavior; the observers see how job candidates perform in the situation simulated at the assessment center but not how the candidate performs when he or she actually encounters that situation in real life.
Fiore thinks situational judgment tests are even more removed from real-world situations in that the test taker simply chooses what he or she judges to be the best response. The candidate does not have to perform the skill or demonstrate the capability. Fiore characterizes these assessments as low in fidelity—low in enactive fidelity (the amount of true interaction that takes place) and low in affective fidelity (the extent to which the experience elicits an emotion response). He also highlighted certain problems that have arisen with these assessments. First, there is some complexity in understanding why a candidate may have responded incorrectly. To respond to the problem, the test taker has to choose the appropriate response to the situation, but he or she also has to interpret the situation.
When the test taker responds incorrectly, it is impossible to discern if he or she did not know the appropriate response or did not understand the situation. Fiore said situational judgment tests are also susceptible to faking in that test takers can make guesses about the most socially acceptable response. To address this concern, some assessments ask the test taker to choose the best and worst response, not just the best response. These issues present potential threats to the validity of the test results.
Assessment of intrapersonal skills is also challenging because of the complexity of the processes involved. Hoyle reminded audience members that intrapersonal skills involve planfulness, self-discipline, delay of gratification, dealing with distractions, and adjusting the course when things do not go as planned—all characteristics of self-regulation or, put another way, the management of goal pursuit. The examples presented involved assessments of integrity, conduct disorders/antisocial behavior, self-regulated learning, and emotional intelligence. While these are all skills involved with self-regulation, Hoyle said one of the first things to consider is whether these skills are separable from personality. For instance, with regard to integrity, is there a certain personality profile associated with people who are prone to engage in dishonest behavior, or conversely people who are likely to operate with integrity in the workplace? Similar issues were raised by Gerald Matthews with respect to the distinction between emotional intelligence and personality.
The examples included a variety of strategies for assessing these skills. For tests of integrity, the strategies include both direct measures, such as self-report in which the test taker clearly knows the purpose of the assessment, and indirect measures, where the purpose is masked from the test taker. With regard to self-report measures of integrity, Hoyle questioned their utility, asking “How useful is it to ask a person who is dishonest to tell you if they are dishonest?” Nevertheless, he pointed out, considerable evidence documents their reliability, validity, and usefulness in employee selection. It is important to remember, however, that these assessments are used to reduce the prevalence of counterproductive behaviors in the aggregate, and test takers never receive their scores or any feedback on their performance. This is an important distinction from the type of testing done in the K-12 setting, where the focus is on reporting and interpreting scores in order to improve performance.
For evaluating antisocial behaviors and conduct disorders, a single assessment strategy has been adopted by the field—the childhood behavior checklist (or Achenbach system). In this case, there is broad consensus in the field about the characteristics of the disorder, and the
construct is well defined. The checklist includes permutations that allow it to be administered and scored from the point of view of the child or adolescent, the parents, or the teachers, which permits multiple sources of information in making a diagnosis. Hoyle noted it has been shown to be both valid and reliable. He highlighted Odgers’ research documenting that early identification and intervention can vastly improve outcomes for people with these disorders. Several participants also called attention to the recently skyrocketing problems with bullying in schools and noted that early identification of conduct disorders may help reduce the incidence of this behavior.
The other two examples were of assessments still used for research purposes. Hoyle found the assessments of self-regulated learning that Tim Cleary is exploring to be both intriguing and promising. The assessment strategies allow the researchers to directly observe someone engaged in the activity of learning, and one of the alternatives that Cleary discussed is having children report online as they actually proceed through the learning process. Hoyle commented on the multitude of insights that can be obtained by having children report on what they are doing before they begin an activity, while they are engaged in the activity, and then reflecting on it afterward. Preliminary work suggests these measures are predictive of course grades. With regard to the assessments of emotional intelligence, Hoyle tended to agree with Matthews that the construct is not yet well defined, and questions remain about its distinction from personality. As Hoyle put it, the measures Matthews discussed tend to be highly correlated with personality to the extent that “one wonders if one really needs separate measures of emotional intelligence or if, in fact, one is able to capture that variability in standard personality measurement.”
Fairness and Accessibility
A third issue discussed throughout the workshop was fairness. As explained in Chapter 5, in a testing context, fairness means the assessment should be designed so that test takers can demonstrate their proficiency on the targeted skill without irrelevant factors interfering with their performance. Fairness is an essential component of validity. Some of the constructs discussed during the workshop raised considerable concern about fairness and possible sources of bias. One issue alluded to previously in this chapter is whether the assessments are measuring the skills they purport to measure or are actually measuring personality traits or intelligence. To what extent is a domain-general conception of critical thinking distinct from general cognitive ability (intelligence)? To what extent are emotional intelligence and integrity distinct from personality?
There is some research to help answer these questions, but it is important to be clear on what exactly is being assessed.
Related to this is the notion of trainability or malleability: that is, that proficiency on the particular skill can increase as a consequence of training and practice. To what extent can a person learn to have more integrity, to become more self-regulated, or to have better social skills? Some students may come to school better prepared to collaborate with others or to manage their own learning. This may occur as a result of family background characteristics, home environment, or other out-of-school experiences. To what extent would assessments be measuring skills that can be learned in school versus family background? There is some research on these issues as well, but as Greg Duncan, professor of education with the University of California, Irvine, noted, the findings are not definitive. Related to this issue is the notion of opportunity to learn. If these skills are indeed trainable, to what extent will all students have equal exposure to instruction in the skills? If students are expected to acquire these skills and teachers are held accountable for teaching them, instructional programs will be needed so that students have the opportunity to learn them. This issue has direct bearing on fairness and ultimately on the validity of assessments. Workshop participants noted that these issues will need to be investigated and understood before moving into wide use of assessments of these skills, particularly if the results are used to make important decisions about students.
There were also considerable concerns about the issue of construct irrelevant variance, particularly as it relates to English language learners. Patrick Kyllonen, director of the Center for Academic and Workplace Readiness and Success at the Educational Testing Service, cited statistics that in the state of California, 25 percent of all public school students are English language learners, with the numbers increasing rapidly in other states as well (e.g., see National Research Council, 2011). For an assessment like the situational judgment test that presents a verbally dense description of a situation, language skills are critically important. For students with weak English language skills, the assessments would be a reading test, not a measure of interpersonal skills.
IMPLICATIONS FOR POLICY
Herman posed two additional questions to the group during the discussion session. If 21st century skills were included in assessments, what would the assessment system look like? And how would we go about implementing such a system?
In responding to the first question, she returned to her point about the many types of assessments and the many ways of using the results. She
highlighted the fact that throughout the workshop, participants repeatedly raised questions about the purposes of the assessments and the levels at which they would be used. In her view, the full spectrum of assessment purposes should be explored in determining ways to incorporate these skills into K-12 schooling. She said she would advocate for a system that included a variety of formative components intended both to guide instructional decision making and to enable early identification of potential problems. These might be combined with assessments used for a variety of summative purposes, including accountability for schools, teachers, and students, under the goal of ensuring students receive the exposure and engagement they need to develop the skills that are critical for college and workforce readiness.
In addressing the second question, she called for work to identify the constructs on which to focus. Throughout the workshop, a variety of skills and constructs were discussed, but as Herman put it, “we cannot do everything at once.” The initial work would be to identify the most critical skills and predispositions for students to learn, set priorities on what is most important, and then develop strategies for teaching and assessing them.
She referred to the Race to the Top (RTTT) assessment consortia2 as one vehicle for moving this work forward. She said the changes enacted through the RTTT efforts provide a timely opportunity for bringing attention to new skills. The cognitive skills of critical thinking and problem solving, she noted, are already incorporated into the common core standards. The next step would be to make sure these skills are included in the curriculum and the assessments and then to encourage focus on some of the interpersonal and intrapersonal skills.
As part of this discussion, Patrick Kyllonen commented about the idea of “consequential validity” or the social/educational consequences of having the assessment in place and making use of the test results. There are many examples, he noted, of tests inserted into testing systems, not necessarily because they will improve psychometric properties, but because of the consequences they might bring about. An example would be the inclusion of writing assessments in many standardized assessments—such as the SAT, GRE, MCAT, and LSAT—despite the fact that they may not significantly improve the predictive validity of the assessment. In this case, the notion is that including an assessment of writing, and attaching stakes to it, should bring about an increased focus on developing writing skills, both by teachers in their instruction and by potential test takers as they prepare for the assessment. Currently, in K-12 education, Kyllonen continued, accountability systems revolve almost entirely
2See http://www2.ed.gov/programs/racetothetop-assessment/index.html [May 2011].
around the ability of students to take reading and math tests. Thus, one consequence of incorporating 21st century skills into the assessment or the accountability system would be to encourage teachers and students to spend more time on these skills. As characterized by one workshop participant, what is tested is taught, and what is not tested is not taught.
Herman also spoke about teacher and teaching capacity. She summarized comments from workshop participants who pointed out that the development of 21st century skills and their integration with academic content is not a regular feature of curriculum or instruction; in some school systems, there may be some focus on the cognitive skills, but this is certainly not the case for the interpersonal and intrapersonal skills. While some teachers may have experience with assigning grades for effort, attitude, and behavior, the interpersonal and intrapersonal skills discussed at the workshop go far beyond these measures. This means the teaching and assessment of 21st century skills will require changes in curriculum and teacher practices that will require a substantial amount of teacher development. As emphasis on these skills takes on new meaning, teachers would need a good deal of assistance both to understand the nature of these constructs and to learn how to develop them in their students so that all students have the opportunity to learn them. This has implications both for teacher preparation programs and for teacher inservice professional development.
Herman also called for transparency. She noted the changes required in curriculum, instruction, teacher training, and assessment can be made more smoothly by transparency. Being transparent will help teachers and students understand the skills that are being emphasized and will help the assessment developers better understand the skills that are to be measured.
Feasibility and Moving Forward
As one workshop participant pointed out, students in U.S. schools already spend considerable time taking tests. Many educators would not readily welcome the idea of adding more tests to the school day. However, this idea assumes the assessments would be something put upon students rather than an integrated part of the curriculum. The view of the assessments endorsed by Herman and other workshop participants was that the various constructs would be incorporated into the academic curriculum so that their teaching would be an integral part of the instructional program. For instance, it is not difficult to imagine incorporating a team project into the regular science, social studies, language arts, or mathematics program. Incorporating activities in which students must problem solve, think creatively, and communicate their work to others using multiple types of mediums seems natural in academic settings.
Adding ongoing formative assessments that help to guide instruction of these skills does not seem like a heavy burden to place on teachers and students. As John Behrens noted in describing the Packet Tracer, the system relies on “stealth assessments”; often students do not even realize they are being tested.
At the same time, other workshop participants stressed it is important not to lose sight of the need to ensure that students in the United States learn the basic academics. As Paul Sackett put it, “If we were at a different conference, we would be spending time lamenting the fact that students in the U.S. are not up to par on some fundamental academic skills.” Likewise, Deirdre Knapp noted all 21st century skills are not equal—some are clearly more important for students to learn than others, and we are further along in knowing how to assess some skills than others. Thus, it is critical to set priorities for where and how to spend the limited time, money, and resources.
Kyllonen also emphasized the importance of considering the cost tradeoffs. He noted the various examples of assessments included some “ingenious low-cost assessments and some dazzling high-cost assessments.” He encouraged work to study the differences in order to figure out where high-cost investment is cost-effective and where it might not make a difference. He and others pointed to examples other than those presented at the workshop that might be important resources and models. For instance, Herman mentioned the work that David Conley, with the Educational Policy Improvement Center (EPIC), has been doing to identify critical components of college and career readiness, as well as similar efforts by the National Assessment of Educational Progress (NAEP) to focus the 12th grade assessment on these skills. Kyllonen also spoke of the exams used to assess critical-thinking skills at the college level, such as the Collegiate Learning Assessment (CLA), the ACT CAP Test, and the ETS Proficiency Profile Test. They are all operational programs, he pointed out, that may serve as models. Knapp noted the work the military has been doing to evaluate temperament, persistence, and stamina. Others commented that while the Envision High School was featured at the workshop, a number of such high schools throughout the country are working to incorporate instruction and assessment of 21st century skills into the curriculum in innovative ways.
Defining the overall purpose of the assessments was an issue raised repeatedly in deciding on a path for moving forward. Sackett framed the issue as deciding between a focus on individual results or group-level results. He asked, “Do we want students to leave school with an individualized certificate that documents their level of competence in each skill? Or do we want to document how the nation is doing in aggregate?” He cautioned obtaining precise and reliable assessment at the individual level is difficult,
costly, and time consuming. On the other hand, Steve Wise questioned how best to address the different aspirations that students have. While there is currently a heavy emphasis on ensuring all students pursue higher education, in reality, that is not likely to occur. Students have different goals. Do we design a system that is a “one size fits all plan,” he asked, do we focus on minimal competency across the board, or do we design a system that attends to the specific needs of the individual?
Several workshop participants spoke of the types of research needed in order to move forward with assessments of these skills. Deirdre Knapp pointed out many assessments are “pushing the envelope” as far as psychometric capabilities. For example, how does one evaluate the reliability of assessments such as those used by Art Graesser’s Auto Tutor? Greg Duncan called for research in two areas. First, he noted, if we are to relate these skills to training in school, we need to know what it takes to change these skills. That is, how malleable are they and what is involved in improving them? Second, he called for more in-depth study of the predictive power of the various skills, noting that what is needed is not simply correlations among the variables but well-controlled analyses to demonstrate that improvement in these skills results in improvement in academic and labor market outcomes. Finally, Juan Sanchez, professor of management and international business at Florida International University, called for increased levels of cross-disciplinary efforts, stressing that successfully tackling these issues will require the collaboration of expertise from many disciplines including measurement, cognitive psychology, and information technology.