Stakes attached to large-scale assessment results heighten the need for the reliability and validity of scores, particularly in terms of being resistant to fakeability. Cost and feasibility also are dominant issues for large-scale assessments. Each of the instrument types has limitations relative to these criteria. Self-report, social network analysis, and situational judgment tests, which can provide relatively efficient, reliable, and cost-effective measures, are all subject to social desirability bias—the tendency to give socially desirable and socially rewarded rather than honest responses to assessment items or tasks. While careful design can help to minimize or correct for social desirability bias, if any of these three types of assessment instruments were used for high-stakes educational testing, social desirability bias would likely be heightened.
Behavioral ratings, in contrast, present challenges in assuring reliability and cost feasibility. For example, if students’ interpersonal skills are assessed based on self, peer, or teacher ratings of student presentations of portfolios of their past work (including work as part of a team), a number of factors may limit the reliability and validity of the scores. These include differences in the nature of the interactions reflected in the portfolios for different students or at different times; differences in raters’ application of the scoring rubric; and differences in the groups with whom individual students have interacted. This lack of uniformity in the sample of interpersonal skills included in the portfolio poses a threat to both validity and reliability (National Research Council, 2011a). Dealing with these threats to reliability takes additional time and money beyond that required for simply presenting and scoring student presentations.
Collaborative problem-solving tasks currently under development by PISA offer one of the few examples today of a direct, large-scale assessment targeting social and collaboration competencies; other prototypes are under development by the ATC21S project and by the military. The quality and practical feasibility of any of these measures are not yet fully documented. However, like many of the promising cognitive measures, these rely on the abilities of technology to engage students in interaction, to simulate others with whom students can interact, to track students’ ongoing responses, and to draw inferences from those responses.
In summary, there are a variety of constructs and definitions of cognitive, intrapersonal, and interpersonal competencies and a paucity of high-quality measures for assessing them. All of the examples discussed above are measures of maximum performance rather than of typical performance (see Cronbach, 1970). They measure what students can do rather than what they are likely to do in a given situation or class of situations. While