7
Balancing Evaluation and the Implementation of New Approaches
Throughout this report, the committee notes a number of specific options related to language assessment that FSI may want to explore, given the research literature on language assessment. The committee was specifically asked by FSI not to provide recommendations about FSI’s language assessment program since the agency is responsible for determining its own program. Thus, in this concluding chapter the committee addresses the basic choice about the balance between evaluation and the implementation of new approaches that are relevant to FSI in determining how to proceed. Decisions about how to set this balance will influence all aspects of the development of FSI’s assessment program.
BASIC CONSIDERATIONS
At the heart of the FSI’s choice about how to strengthen its testing program lies a decision about the balance between (1) conducting an evaluation to understand how the current program is working and could be changed in light of a principled approach to assessment, and (2) beginning the implementation of new approaches. Evaluation and implementation are both necessary: evaluation of the current program without implementation of new approaches to bring improvements will have no effect, and implementation of new approaches without a full evaluation of the current test could be very harmful to the current program. However, given limited time and resources, it is important to decide the relative attention to give to each.
Through this report, the committee addresses both evaluation, through
the presentation of a principled approach to assessment, and implementation, through the presentation of new approaches to assessment. For evaluation, the report stresses the importance of understanding the target language use domain as a foundation for both the design and the validation of a testing program, and it briefly describes the available techniques for developing that understanding (Chapter 3). Furthermore, the report addresses the role the understanding of language plays in undergirding a testing program and the importance of understanding its sociocultural and institutional contexts. The report also stresses the importance of evaluating the validity of the uses of the test in light of the target language use domain and key details related to the test (Chapter 6). For implementation, the report suggests possible changes to the current test that might reflect identified goals for strengthening the current test, on the basis of what FSI currently knows about the strengths and weaknesses of the test or information that would result from an evaluation of the test (Chapter 4). These arguments and discussions raise the essential question about whether to emphasize—at this time—further evaluation to better understand the test and how well it is working or initial steps toward implementing plausible changes.
For FSI’s decision about the relative attention to give to each, it will be important to consider how well FSI’s current language assessment practices address the language proficiency needed by Foreign Service officers. The committee’s limited understanding of the language proficiency needed by Foreign Service officers and the current language assessment suggests that there are certainly points of commonality. FSI’s assessment is clearly different from a language assessment that might be used in other settings—such as certifying the language abilities of medical professionals or admitting graduate students to a course of study—and the distinctive aspects of FSI’s assessment appear to reflect the language tasks of Foreign Service officers. However, the committee had insufficient evidence about the nature of the language proficiency needed and the alignment of FSI’s assessment to those abilities to draw any conclusions about how close the alignment is. The committee’s discussion of some possible changes to the current test highlights a number of ways that the coverage of the language proficiency of Foreign Service officers may be limited in capturing all important aspects of their language-related tasks, but the committee has no information about the relative importance of these omitted aspects of Foreign Service tasks.
One of the key issues for FSI to consider is whether it has sufficient information to draw firm conclusions about the degree of alignment between the aspects of language proficiency measured by the test and the aspects that affect the performance of key Foreign Service tasks. If the available information is not sufficient to draw firm conclusions, then obtaining better information about the alignment is particularly important. However, if there is already good information about the degree of alignment, then
that information can help guide the consideration of changes to the current language assessment program.
The alignment between the language proficiency demonstrated by the current test and the language proficiency needed by Foreign Service officers is only one example of a key piece of evidence needed by a language assessment program to consider possible changes; it is captured in the first example validity claim discussed in Chapter 6. The other example validity claims discussed in that chapter suggest other instances of the tradeoff between evaluation and the implementation of new approaches. What is already known about the scoring process, the interpretation of the scores, and the relative benefits of the use of the scores? In each case, there could be very limited information, suggesting the importance of evaluation to improve understanding, or there could already be sufficient information to suggest that the test should be strengthened in some particular way or that there are no clear weaknesses.
One way to find a good balance between an evaluation of the current test and beginning implementation of new approaches to assessment is to consider the examples of validity evidence discussed in Chapter 6 and the best practices for testing programs recommended by the professional standards. For example:
- Does the FSI testing program have evidence related to the four example comparisons?
- Does the program incorporate the best practices recommended by the professional standards?
If the answer to either of these questions is “no,” then it makes sense to place more weight on the evaluation side, that is to first gather evidence to better understand how the current program is working. If the answer to these questions is “yes,” then there is probably already sufficient information to suggest particular ways that the test could be strengthened.
SOME CONSIDERATIONS ON THE EVALUATION SIDE
To the extent that FSI chooses to emphasize the evaluation side of the evaluation-implementation tradeoff, there are a number of important considerations. The discussions in Chapters 3 and 6 point toward a number of concrete questions that FSI could usefully further investigate related to the target language use domain and the different validity claims related to the current test. In addition, the possible changes discussed in Chapter 4 could each become a topic of evaluation, as an early step toward implementation. The exploration of new testing approaches on an experimental basis allows a testing program to better understand the tradeoffs of a change
before any major decision to implement those approaches for an entire assessment program.
Beyond the specific evaluation questions themselves, there are questions about the institutional structure that supports evaluation research at FSI and provides an environment that fosters continuous improvement. Many assessment programs incorporate regular input from researchers into the operation of their program. This input can include two different elements. First, technical advisory groups are often used to provide an assessment program with regular opportunities for discussion of technical options with outside researchers who become familiar with the program’s context and constraints during their service as advisors. Second, assessment programs also sometimes provide opportunities for researchers to work in-house as visiting researchers or interns to conduct research related to the program, such as conducting validity studies. Both of these routes allow assessment programs to receive new ideas from experts who come to understand the testing program and can provide tailored, useful advice. It is likely that there are constraints related to privacy and international security issues that could limit sharing data and publishing research on FSI outcomes, but it is possible that these constraints can be addressed with techniques to anonymize and share limited data for research. There also are costs associated with these activities, but many ongoing testing programs decide that these costs are outweighed by the long-term benefits of receiving regular input from outside researchers.
SOME CONSIDERATIONS ON THE IMPLEMENTATION SIDE
There are two salient constraints in the FSI testing program that are likely to strongly influence the consideration of possible new approaches. In the context of FSI’s current testing program, these constraints appear to be fixed. However, it is worth considering the possibility that these constraints may be more flexible than currently presumed.
The first constraint relates to the policy that all languages should be assessed using the same approach. The fairness concerns that provide the foundation for this policy are understandable, but the comparability of results from the testing process is what actually matters for fairness, not an identical testing procedure. If the fairness issue can be addressed, it may be possible to consider using different testing approaches across languages.
It is worth noting that FSI’s current assessment program already involves some limited variation in assessment procedures. The most prominent of these variations is the possibility for a test taker to interact with the evaluators over the phone or with a video conference rather than in person. In addition, in cases where only one assessor is available in a particular language, the assessor used for the test can be the test-taker’s instructor. The
general finding in the literature is that both of these variations can have an effect on assessment outcomes. However, in the context of FSI’s assessment program, these variations are accepted as providing scores that are sufficiently comparable to those provided in a standard in-person assessment with an assessor who does not already know the test taker.
The reason to consider using different approaches to assess different languages is the practical implications of the number of test takers. Some assessment techniques—such as technology-mediated approaches—have relatively high development costs but relatively low administration costs per person. Thus, such techniques may be cost-effective only for relatively high-volume tests. In considering possible new language testing approaches, FSI needs to decide whether the practical limitations that might prevent the use of some approaches for the low-frequency languages should automatically disqualify their use for all languages.
The second constraint relates to the role of the Interagency Language Roundtable (ILR) framework. As FSI considers possible changes to its language assessments, it may want to consider options—such as the use of multiple measures—that may be awkward to score directly in terms of the ILR framework. However, the use of the ILR framework for coordination of personnel policies across government agencies does not need to be interpreted as a constraint requiring the use of ILR skill-level descriptions for all aspects of FSI scoring. As detailed in Chapter 5, whatever assessment approaches may be developed can always be mapped to the ILR framework for the purposes of final scoring and the determination of language proficiency.
The committee appreciates that FSI faces complicated choices about possible changes to its language proficiency testing, and the agency’s interest in exploring the many aspects of modern language testing is commendable. The committee hopes that this report’s discussion of the research in the field contributes to FSI’s forward-looking decision process.
This page intentionally left blank.