Building on the discussion in Chapter 4 of possible changes to the current FSI test and its scoring that might be motivated by a principled approach review of the test, this chapter considers the way that the
FSI test scores are interpreted. A key element of this consideration is the role played by the skill-level descriptions of the Interagency Language Roundtable (ILR) framework.
FSI and many other government language testing programs use the skill-level descriptions of the ILR framework to understand language proficiency across all levels for all languages. Because the descriptions are used in so many different ways as a foundation for government language testing programs, it can sometimes be difficult in the government context to see the distinction between different aspects of assessment programs as shown in Figure 1-1 (in Chapter 1).
In government testing programs, the ILR framework is used as a substitute for a detailed description of the target language use domain of interest for a specific test. However, as discussed in Chapter 2, a full understanding of the target language use domain for any specific government language use requires more domain-specific detail than is included in the ILR skill-level descriptions.
For example, Foreign Service officers need to use the target language to engage in social conversation and write informal email messages, in addition to understanding formal presentations, which are different language
uses from those that an analyst in one of the intelligence services might need. It is important for the FSI testing program to develop a detailed understanding of language use that is specific for Foreign Service officers. Since the ILR framework describes language proficiency broadly, its descriptions are not sufficiently detailed to design a test for a specific purpose and build a solid validity argument for that test (see Chapter 2).
The use of the ILR framework can also obscure the distinctions between the ILR level scores awarded on the FSI test, their interpretation, and their use to make decisions. Because the FSI test is scored in terms of the ILR skill-level descriptions, which have defined interpretations and known uses in making decisions for Foreign Service officers, it can appear that there is no distinction between the score awarded on the test and its interpretation and subsequent use. Yet scoring, interpretation, and use are distinct, as shown graphically in Figure 1-1 (in Chapter 1):
- The score on the test reflects an evaluation of a specific test-taker’s performance on specific test tasks based on a set of skill-level descriptions.
- The interpretation of the score involves a generalization from the language proficiency elicited in the test and evaluated through the descriptions to the test-taker’s proficiency in the real world.
- The uses that flow from score interpretation involve decisions that reflect the adequacy of the test-taker’s inferred language proficiency to function meaningfully and appropriately in the target language use domain.
Fundamentally, the ILR framework provides a way for many government testing programs to interpret a test score in terms of what the government considers general or functional language proficiency and to link that interpretation to a set of personnel decisions, with related consequences. The ILR framework makes it possible to discuss personnel policies related to assessment of employees’ language proficiency in common terms across government agencies. As described in Chapter 3, most language-designated positions for Foreign Service officers are specified as requiring certification at the ILR level 3 in both speaking and reading. That certification is a requirement for long-term retention in the Foreign Service and is linked to incentive pay. The corresponding personnel policies of other government agencies with assessment of employees’ language proficiency are similarly described with respect to the levels defined within the ILR framework.
However, as a widely used framework across the government, the ILR framework cannot fully specify the necessary assessment details that are specific to FSI’s context and purpose. The importance of these details is highlighted in Figure 1-1 by the ring related to understanding of sociocul-
tural and institutional contexts. For example, the ILR framework does not incorporate the details about professional-level Foreign Service vocabulary that are reflected in the assessment tasks and topics used in the FSI test and the underlying scoring process used to evaluate performances on those specific tasks. Similarly, although the ILR framework provides examples of different levels of language proficiency, it does not reflect critical language uses for which a test-taker’s language proficiency are being inferred. Finally, although the ILR framework is used in Foreign Service personnel policies that affect retention and pay decisions that can be compared across government agencies, it does not specify the kinds of mission-critical consequences that could occur with Foreign Service officers in the field who do not have adequate language proficiency for their positions (see Box 3-1 in Chapter 3).
As explained in Chapter 1, a principled approach to test development will rest on a detailed understanding of the target language use domain, tasks that elicit performances that reflect key aspects of the domain, clear rules for scoring those test performances, and interpretations of those scores that lead to inferences about language proficiency in the domain. In the current FSI testing program, each of these aspects is described in terms of the ILR framework—rather than the target language use domain for Foreign Service officers. As a result, the entire testing program is geared toward producing a result that can be compared with other government testing programs based on the ILR framework.
A shift in focus to the target language use domain has the potential to strengthen the FSI test. However, this shift would mean that many aspects of the assessment would rest on the target language use domain in the Foreign Service, which may not be specifically addressed in the ILR framework. With such an approach, the ILR framework could retain its essential role in helping coordinate personnel policies across government agencies that assess employees’ language proficiency, but FSI’s testing program would not necessarily be defined solely in terms of the ILR framework. The testing program could use a more detailed and specific understanding of the target language use domain in the Foreign Service as the basis for designing tasks, scoring test-taker performances on those tasks, and interpreting those performances with respect to the required language proficiency of Foreign Service officers.
In some ways, the FSI testing program already elaborates its understanding of the ILR framework to consider the target language use domain for the Foreign Service, especially with respect to the specific tasks in the speaking test, which is different from the more common oral proficiency
interview used in other government agencies. However, explicitly acknowledging that the understanding of the target language use domain is driving the test raises the possibility of providing numerical scores for the test that are not directly described in terms of the ILR skill-level descriptions. In this approach, it may be necessary to map the resulting test scores to the ILR framework to link to the common personnel policies across government agencies.
For example, suppose FSI decided to augment its current speaking test with a technology-based test using many short listening tasks (see Chapter 4) that are scored correct or incorrect. This new test might result in a score continuum of 0 to 60 points, bearing no relation to the levels of the ILR framework. The results of this new test would need to be combined in some way with the results of the current speaking test to produce an aggregate result. One way to do this might be to simply add the score from the new technology-based test with the 120-point “index scale” that is produced (though not specifically used or reported) during the scoring process of the current speaking test. Or the combination could reflect different weights or thresholds, depending on the meaning of the two different scales. For either approach, or any other, the resulting aggregate numerical score would still need to be mapped to the levels of the ILR framework.
There are well-developed procedures for carrying out such mappings and providing evidence in support of interpreting performances on an assessment in terms of an external set of proficiency levels (such as the ILR skill-level descriptions). One way of performing the mapping is by a “contrasting groups” (or “examinee-centered” or “empirical”) approach, in which test takers with known ILR level scores from the current test would be given the new test as well (see, e.g., Livingston and Zieky, 1982; see also, e.g., Cizek, 2012; Cizek and Bunch, 2007; Hambleton and Pitoniak, 2006). By having a set of test takers take both tests, it would be possible to understand the relationship between the numerical scores on the new test and the ILR level scores from the current test. This information could then be used to map the numerical scores on the new test to the ILR skill-level descriptions for policy purposes—such as personnel decisions—in a way that would produce similar numbers of examinees reaching those levels as the current test.
Another way of mapping from the scores of the new test to the ILR skill-level descriptions would be by using standard-setting (or “test-centered”) processes, which use groups of qualified panelists to go through a standardized procedure with a test’s tasks to define one or more cut scores between different levels of performance (see, e.g., Cizek, 2012; Cizek and Bunch, 2007; Hambleton and Pitoniak, 2006). The Council of Europe’s manual for
The widely used Test of English as a Foreign Language (TOEFL) provides an example of using a “test-centered” mapping method. TOEFL has its own score scale, which test users have long used to make decisions about test takers. A major use of TOEFL iBT is to determine English-language readiness for pursuing a course of study at an English-language university. Schools, for example, may require a TOEFL iBT score of 85, of the 120 total, for international students to meet a university admission criterion related to language proficiency, though each college or university can set its own cut score for admission. However, with the increasing use of the CEFR as a framework for describing language proficiency, a number of English-speaking universities outside North America wanted to define their language proficiency admission criteria in terms of the six CEFR levels. In response, the Educational Testing Service (ETS) conducted a study to map performances of the TOEFL iBT onto the defined proficiency levels described by the CEFR (Tannenbaum and Wylie, 2008). The study was done “[f]or test users and decision makers who wished to interpret TOEFL iBT test scores in terms of the CEFR levels in order to inform their decisions” (Papageorgiou et al., 2015, p. 2). Based on the study, ETS established boundary scores in terms of scale scores on the TOEFL iBT that could be interpreted as entry scores into CEFR proficiency levels.
Although “test-centered” standard-setting processes would provide evidence for the correspondence between the meaning of the scores on the new test and the ILR skill-level descriptions, the procedures will not ensure that the new test produces roughly similar numbers of examinees achieving those ILR levels. If it is important for FSI that a new test maintain roughly similar distributions of outcomes in terms of ILR level scores, then the mapping should be carried out using an “examinee-centered” approach.
Fundamentally, the ILR framework provides a way for multiple government testing programs to interpret a test score in terms of a common government understanding of language proficiency and to link that interpretation to a set of personnel decisions. As discussed above, however, although the ILR framework defines some of the context for the FSI testing program and the interpretation of the scores it produces, it cannot provide the full level of detail needed to design and validate a test for FSI. A prin-
1 CEFR, the Common European Framework of Reference for Languages: Learning, Teaching, Assessment, is a guideline used to describe achievements of learners of foreign languages, principally in Europe.
cipled approach to test development for the FSI testing program will rest on a detailed understanding of language and the target language use domain and the sociocultural and institutional contexts of the Foreign Service; assessment tasks that elicit performances that reflect key aspects of the target language use domain; scoring that fairly and reliably evaluates those test performances; and interpretations of those scores that lead to appropriate inferences about language proficiency in the target language use domain for the purpose of making decisions about Foreign Service officers.
As FSI uses principled approaches to understand its current test and consider possible changes to it, it may be worth considering approaches to scoring that are based on a scale score that are not so directly linked to the skill-level descriptions of the ILR framework. It would still be possible to maintain the framework’s role in coordinating language proficiency personnel policies across government agencies by mapping a new FSI test score to the skill-level descriptions. There are a variety of techniques to setting cut scores that can be used to perform such a mapping.