This chapter turns to possible changes to the design of FSI’s assessment of foreign language proficiency, including consideration of tasks that could be used in the assessment, the performances they can elicit, and the scores that result from the evaluation of those performances. It builds on the two previous chapters: the FSI testing context (Chapter 2) and the nature of language and language use (Chapter 3).
Scope of Committee’s Work
The chapter’s discussion of possible changes to the current FSI test is guided by the description of the considerations in language assessment development in Figure 1-1 (in Chapter 1). The presentation of possible changes is based on the assumption that an analysis of the test’s key validity claims has been carried out, guided by a principled approach. Such an analysis would look at the relationships among the components of the assessment, the way the resulting test scores are interpreted and used, the target language use domain, the sociocultural and institutional contexts of the test, and the current understanding of language proficiency (see discussion in Chapter 6).
The chapter does not present a comprehensive description of all possible methods for assessing language proficiency. For surveys of the literature, the reader can consult any number of authoritative textbooks and handbooks, including the following: for overall treatments of lan-
guage assessment, Bachman and Palmer (2010), Fulcher and Davidson (2012), Green (2014), and Kunnan (2018); for assessing speaking, Luoma (2004), Taylor (2011), and Taylor and Falvery (2007); for assessing writing, Cushing-Weigle (2004), Plakans (2014), Taylor and Falvery (2007), and Weigle (2002); for assessing listening, Buck (2001) and Ockey and Wagner (2018b); for assessing reading, Alderson (2000); for assessing grammar, Purpura (2004); for assessing vocabulary, Read (2000); for assessing integrated tasks, Cumming (2013), Gebril and Plakans (2014), and Knoch and Sitajalabhorn (2013); and for assessing language for professional purposes, Knoch and Macqueen (2020).
Rather than describing all possible approaches to assessment, the committee selected the changes to discuss on the basis of its knowledge of the research in language assessment and its understanding of FSI’s context, target language use domain, and current test. Each of these possible changes is keyed to possible goals for improvement that they might address, while also noting their implications for other important considerations. The committee does not suggest ways that the different possible changes might be integrated into the current test. Some of the changes are minor alterations to current testing procedures; others involve wholly new tasks or testing approaches that might complement or substitute for parts of the current test.
The discussion of each possible approach provides examples of testing programs that have implemented the approach and relevant research references, as they are available. However, it is important to note that the field of language testing often proceeds by developing approaches first and then carrying out research as experience accumulates related to the innovations. As a result, the different possible approaches discussed below reflect a range of available examples and supporting research.
The first possible change we discuss below is multiple measures, which can be understood as a meta-strategy to combine different tests with complementary strengths to produce a more complete and comprehensive assessment of an individual than is available from any single test. The possible use of multiple measures is the first change discussed because many of the other changes could be carried out alongside the tasks or approaches of the current test. Whether to think of the possible changes as complements or substitutes to the current test is one of the choices FSI would need to consider during a program of research and development.
All the other possible changes were chosen in response to two plausible goals for improvement that might emerge from an evaluation of the complete context of the FSI test: (1) broadening the construct of the test or the test’s coverage of that construct, and (2) increasing the reliability of the test scores and the fairness of their use. It is important to note that any changes to the test will have other effects beyond these two goals. In the discussion
of possible changes, the committee considers two particular ones: effects of the test change on instruction and practical effects related to the cost and feasibility of the testing program. In another context, these two effects could themselves be considered the primary goals for the improvement of a testing program; however, in the context of FSI’s request, the committee has taken potential instructional and practical effects as important additional considerations, not as the primary goals for improvement likely to emerge from an evaluation of the current test. Before considering the specific possible changes to the current test, we elaborate on these two goals and effects.
Goals and Effects of Possible Changes
As discussed in Chapter 3, the construct measured by a test and its alignment with the target language use domain are critical to the validity of the inferences about a test-taker’s language proficiency in that domain. A consideration of the range of different language constructs stemming from job and target language use analyses related to the specific language needs of Foreign Service officers could suggest aspects of language proficiency that are important in Foreign Service work but that are not reflected, or perhaps not sufficiently reflected, in the current test. Listening proficiency is an example of an aspect of language proficiency that is perhaps not sufficiently reflected on the current test, and writing proficiency is an example of an aspect of language proficiency that is not reflected at all. In addition to these clear examples, the committee’s consideration of the Foreign Service context and the FSI test indicated some other possible improvements related to the assessed construct, depending on FSI’s goals. These other possible improvements include assessing interactional competency in more natural social exchanges, assessing listening proficiency across a typical range of speech, and assessing the ability to use language in tasks that are similar to those required on the job. Such improvements might strengthen the assessment of aspects of the language proficiency construct that are already partly reflected on the current test.
With respect to the first goal of broadening the construct measured by the test or the test’s coverage of that construct, several possible changes are discussed below: scoring for listening comprehension on the speaking test, adding writing as a response mode for some reading or listening tasks, adding paired or group oral tests, including listening tasks with a range of language varieties and unscripted texts, incorporating language supports, adding a scenario-based assessment, and incorporating portfolios of work samples.
The second goal, increasing the reliability of the test scores and fairness of their interpretation and use, is partly reflected in FSI’s request to the com-
mittee and is a plausible goal that might emerge from an internal evaluation of the current testing program. Some considerations that might lead to such a goal are discussed in Chapter 6, relating to such factors as the criteria used in the scoring process and the consequences of the decisions based on the test. High levels of variability in the test scores could give rise to concerns among stakeholders about reliability, and differences across individuals that reflect factors other than language proficiency may suggest concerns about fairness and possible bias.1 General approaches for increasing fairness and reliability in a testing program involve standardizing aspects of the test and its administration, scoring, and use; being transparent about the test and its administration, scoring, and use; and using multiple testing tasks, administrators, formats, and occasions.
It is important to note that there can be some tension between these two goals to the extent that the richness of more natural language interactions can be difficult to standardize. Obviously, it is not productive to standardize a language proficiency test in the service of the second goal in ways that prevent the test from assessing the aspects of language proficiency that are important in the target language use domain.
With respect to the second goal of increasing the fairness and reliability of test scores, the discussion below covers the following possible changes: adding short assessment tasks administered by computer, using automated assessment of speaking, providing transparent scoring criteria, using additional scorers, and providing more detailed score reports.
The structure of a test is often a powerful signal to its users about the competencies that are valued, which can have important effects on instruction. Within the field of language studies, these effects are sometimes referred to as washback. The effects can have both positive and negative aspects: positive when the test signals appropriate goals for instruction and negative when the test encourages narrow instructional goals that fall short
1 One form of bias—“unconscious bias” or “implicit bias”—concerns evidence of unconscious preferences related to race, gender, and related characteristics (Greenwald et al., 1998). For example, participants might automatically associate men with science more than they do for women. One can easily imagine the possible problematic effects of such associations in a language testing context, such as unintended inferences on the part of interviewers, raters, test takers, or test designers. Despite concern over unconscious bias, reviews of hundreds of studies conducted over 20 years reveal two key findings: the correlation between implicit bias and discriminatory behavior appears weaker than previously thought, and there is little evidence that changes in implicit bias relate to actual changes in behavior (Forscher et al., 2016; Oswald et al., 2013). More research is needed to clarify these findings. Regardless of the degree to which unconscious or implicit bias affects real-world employment and testing conditions, the best practices in assuring test fairness, highlighted throughout this report, remain the best defense against such effects.
of the language proficiency needed to accomplish tasks in the real world. Although the committee did not consider possible changes specifically to provide positive effects on instruction, a number of the changes considered as possible ways to meet the primary goals would also likely have a positive effect on instruction, which are discussed below when this is the case.
Finally, any changes to the test will raise practical considerations, including effects related to cost and operational feasibility. As with instructional effects, the committee did not specifically consider possible changes to the test with a goal of decreasing its cost or maximizing the ease of its administration. However, the discussion below notes the potential practical implications for the changes suggested to meet the two primary goals.
Although cost is an obvious practical consideration, in the FSI context it is important to highlight a perhaps more fundamental consideration, which is the wide range in the number of test takers across languages: from over 1,000 annually for Spanish, to less than 10 annually for several dozen languages, such as Finnish, Mongolian, and Tagalog. For the many languages with few test takers, there are fewer speakers who can serve as testers and test content developers, fewer resources in the language to draw on, and fewer opportunities to try out any new testing techniques with test takers. For all these reasons—in addition to the more obvious one of the associated cost that would be involved—the hurdle for implementing possible changes to the test will be much higher for any possible change that involves developing new testing material for all of FSI’s tested languages. To the extent that it is necessary to keep the structure of the test identical across all tested languages, the practical feasibility of any possible changes for the languages with few test takers could be an important constraint. This issue as to whether test structure must be held constant across all languages is discussed further in Chapter 7; in the discussion below we simply note the possible changes that may raise particularly high practical difficulties for those languages with few test takers.
Table 4-1 summarizes the possible changes to the FSI test that are discussed in the rest of this chapter, with the potential goals the committee explored that might motivate consideration of these changes and their additional instructional or practical considerations. As discussed above, the potential goals for change would need to emerge from an overall review of the current test and its context, using a principled approach to analyze how the current test could be strengthened to support the key claims about its validity. The possible changes are listed in the table in the order they are discussed in the rest of the chapter.
TABLE 4-1 Possible Changes to the FSI Test to Meet Potential Goals
|Possible Change||Potential Test Construct, Reliability and Fairness Considerations||Potential Instructional and Practical Considerations|
|Using Multiple Measures||
|Scoring Listening on the Speaking Test||
|Adding Target-Language Writing as a Response Mode for Some Reading or Listening Tasks||
|Adding Paired or Group Oral Tests||
|Using Recorded Listening Tasks That Use a Range of Language Varieties and Unscripted Texts||
|Incorporating Language Supports (such as dictionary and translation apps)||
|Adding a Scenario-Based Assessment||
|Possible Change||Potential Test Construct, Reliability and Fairness Considerations||Potential Instructional and Practical Considerations|
|Incorporating Portfolios of Work Samples||
|Adding Computer-Administered Tests Using Short Tasks in Reading and Listening||
|Using Automated Assessment of Speaking||
|Providing Transparent Scoring Criteria||
|Using Additional Scorers||
|Providing More Detailed Score Reports||
The fundamental idea behind the use of multiple measures is to make decisions on the basis of results from several different assessments. By using multiple measures, a testing program can expand coverage of the construct by combining information from different sources that assess different aspects of it. Reliance on multiple sources, such as several assessments that use different response modes, can help ameliorate the effects of any particular source of error. This can help increase overall reliability, generalizability, and fairness.
The value of using multiple measures in an assessment is strongly supported in the research on testing (Chester, 2005; Kane, 2006; Koretz and Hamilton, 2006; Messick, 1989). Several of the professional testing standards explicitly call for the use of multiple measures in testing programs that are used to support high-stakes decisions (e.g., American Educational Research Association et al., 2014; National Council on Measurement in Education, 1995; also see Chapter 6). These standards are reflected in current K–12 educational policy legislation, such as the Every Student Succeeds Act. Although there are some important measurement issues that need to be addressed when combining test scores (Douglas and Mislevy, 2010; In’nami and Koizumi, 2011), there are examples of practical implementations of decision systems using multiple measures (e.g., Barnett et al., 2018).
For FSI, additional measures might be added to complement the current speaking and reading tests in response to goals that are identified from a review of the test and its use. A number of the possible changes discussed below provide examples of additional measures that could be added to the current test to produce an overall testing program using multiple measures.
The committee’s statement of task specifically asks about the possibility of explicit assessments of listening comprehension that could be part of the FSI assessment. In reviewing the options, the committee noted that FSI could augment the scoring for the speaking part of the test to make more use of the information related to listening that it already provides. Specifically, the three tasks in the speaking test could be explicitly scored in relation to listening proficiency, with reference to the Interagency Language Roundtable (ILR) skill-level descriptions (see Chapter 3) for listening and the development of a set of listening-related scoring criteria (Van Moere, 2013). This approach might add some additional complexity to the scor-
ing process, but it would not necessarily require much change in the tasks themselves.
The FSI speaking test is notable in using three different speaking tasks that each involve a variety of language-related skills. In this sense, the current test reflects the new research on integrated skills that recognizes that most language tasks in the real world require multiple skills (e.g., Cumming, 2013; Knoch and Sitajalabhorn, 2013; Plakans, 2009). According to this view, the best tasks to assess a target language use domain will be tasks that integrate multiple skills. For FSI, for example, a Foreign Service officer participating in a formal meeting might need to make an initial presentation and then respond to follow-up questions, a use of integrated skills echoed in the “work-related exchange” portion of the FSI speaking test.
One point emphasized by the integrated skills literature is the need to consider scoring all the skills that are of interest in the target language. Until recently, the trend had been to score only the spoken or written performance and not the receptive skills involved (listening or reading) (Plakans and Gebril, 2012). Without scoring all the language skills involved, reported scores on a task that appears to reflect an authentic integration of multiple skills may provide more limited information than it could. An oral interview integrates listening and speaking, with a test taker needing to comprehend the questions by using listening skills in order to answer them orally. Although interview tasks have been used to assess speaking for many decades, until recently the necessary listening skills have usually played only a small role in defining the construct and scoring the performances for these tasks.
One approach is to score the appropriateness of the response to the question, i.e., the degree that the test taker comprehends the question. This approach is used in the Center for Applied Linguistics’ BEST Plus oral language test, which scores the appropriateness of the response to each question.2 In other cases, an oral assessment may contain tasks that appear to focus more on listening than on speaking. For example, in the interview section of the FSI test, a test taker needs to understand substantial spoken language in order to then report it in English to the interviewer. In this case, perhaps a reported score could also include the skill of listening, based on the completeness or accuracy of reporting in English what was heard in the target language. An example of a rubric that captures this sort of content-responsible listening/speaking can be found in the iBT TOEFL integrated speaking task scale.3
Writing appears to be part of the target language use domain for Foreign Service officers, although it is not included in the current FSI test. As noted in the discussion of the FSI context (see Chapter 2), writing appears to have become increasingly important in recent years, with short interpersonal written exchanges in text messages, email, or social media replacing verbal exchanges that used to take place by telephone or during in-person meetings. Further analysis (using one of the methods discussed in Chapter 3) would be needed to understand how Foreign Service officers currently need to use writing. Such a review might suggest adding a writing component to better reflect job-related language competencies.
There are a variety of ways that writing could be included in the FSI test. One example might be to develop tasks that involve writing email messages in response to reading and listening texts that are other emails or voicemail messages in the target language. Such tasks could be modeled on typical email correspondence by Foreign Service officers, perhaps involving responses to email requests, emails to request information about some topic described in a reading or listening text, or emails to briefly describe some aspect of U.S. culture or current policy in response to a written inquiry. This extension of the reading or listening tasks, with writing in the target language, could be considered an integrated skill assessment. For such an addition, FSI would need to consider how performances would be appropriately scored, as noted above in relation to assessing listening with speaking.
In recent years, researchers have explored the use of paired and group oral tests as a complement to one-on-one interview-style speaking tests. Paired and group orals were created to capture test-takers’ interactional competence (Ockey and Wagner, 2018a; Roever and Kasper, 2018), allowing raters to judge whether test takers are able to comprehend the other speaker and respond appropriately, are aware of other speakers’ roles in a conversation, and are able to manage conversational turn-taking, repair conversational breakdowns, and co-construct topics.
Paired oral tests resemble naturalistic conversation, mirror pair work that is common in communicatively oriented and task-based language classrooms, and can help in the measurement of interactional competence (Ducasse and Brown, 2009). Group orals generally involve three to four candidates, with the goal of eliciting group interaction. Groups are normally given 8 to 10 minutes to discuss a given topic, task, situation, or
scenario, and thus group orals are more often used with test takers who are already conversant in the language (Winke, 2013).
These interactional skills are likely part of the Foreign Service target language use domain. However, there could be practical challenges to coordinating opportunities for paired or group oral tests for FSI, particularly for languages with few test takers, and there are potential fairness concerns raised by the variability of the pairings.
Paired and group orals provide challenges related to interlocutor variability that are not present in one-on-one interviews because peer testing partners are not working off scripts and may come to the test with different language proficiencies, as well as variations in personality, motivation, and background (Ockey, 2009). Research has found that individuals who are assertive, disagree, or have a self-centered approach to the speaking task can influence how other speakers are able to participate in the conversation (Lazaraton and Davis, 2008). Raters may then struggle to determine a score that is fair to a candidate who they perceived had been disadvantaged by a particular pairing (May, 2009). It is important to note, however, that pairings of candidates at different proficiency levels might not necessarily influence the resulting scores (Davis, 2009).
Paired and group oral tests have been used in a variety of high-stakes testing programs. The first high-stakes paired oral assessment was introduced in 1991 by the Cambridge English for Speakers of Other Languages Certificate of Advanced English test (Lazarton and Davis, 2008). Since then, the majority of Cambridge tests have had a paired format (Norton, 2005). The English placement test for nonnative speakers of English at Iowa State University also uses paired oral assessments as a complement to a one-on-one oral interview. The paired task involves listening to audio recordings of two speakers providing different positions on a topic, followed by an opportunity for the test takers to summarize, discuss, and defend one of the positions. The scoring criteria include specific attention to interactional competence, with consideration to connecting one’s own ideas to a partner’s ideas, expanding on a partner’s ideas, making relevant comments, taking turns appropriately, asking appropriate questions, disagreeing politely, and answering questions in an appropriate amount of time.
The Spoken English Test of the National College English Test (the “CET-SET”) in China includes a high-stakes, standardized, group-oral assessment (Zhang and Elder, 2009).4 In the test, three to four students perform individual warm-ups with the examiner and then present monologues to the group. The students have two group discussions—one on the
4 This test is an optional, additional component of the College English Test (CET) taken by a small number of the approximately 10 million examinees annually who have already passed the main CET (at levels 4 or 6) at universities throughout China.
presentations they gave and one on a new topic—with the test examiner then posing additional questions. The scoring criteria consider whether the candidates contribute actively to the group discussion.
A review of the current FSI test using a principled approach would consider the extent to which the results generalize from the test situation to real-world circumstances. The current speaking test includes listening in the target language that is spoken by the tester in a relatively structured exchange, but daily exchanges will likely include a much wider variety of types of speech. Two of the most salient differences in spoken language to consider are language varieties and the scriptedness of text.
With respect to varieties, language can vary due to many factors, such as geographical region, education level, and ethnicity. As noted in Chapter 3, recent research has heightened appreciation of the many varieties of language that are used in natural settings. This factor can be particularly important with respect to listening comprehension, since spoken language often reflects a set of differences, including accent, that are often not present in written language. In many contexts, the dominant or national language might be a second language for many residents of that country or region, and thus accented varieties of the target language will be part of the target language use domain for Foreign Service officers in such contexts.
Research on accents shows that multiple exposures to and familiarity with a particular accent generally leads to increased comprehension of that accent (Gass and Varonis, 1984). A proficient listener in a language should be able to comprehend multiple variants of the target language and accommodate or adapt to unfamiliar accents (Canagarajah, 2006). The research is clear that a speaker can be comprehensible even with a perceived accent (Isaacs, 2008).
With respect to scriptedness, a Foreign Service officer’s language use probably typically includes both scripted language, such as political speeches, and unscripted spoken language, such as informal conversations and interviews. Research on scriptedness shows that unscripted spoken texts differ from scripted spoken texts in a number of ways; listeners vary in their ability to comprehend unscripted spoken language, based in large part on their exposure to it (Wagner, 2018).
A review of the current test might identify the importance of assessing language proficiency with respect to a range of language varieties and with both scripted and unscripted varieties. By using recorded speech, the FSI test could include listening tasks with a typical set of varieties of the language Foreign Service officers may be exposed to and a mix of scripted and
unscripted texts. When selecting listening texts, it is important to include whatever language varieties are prevalent in the target language use domain for that particular Foreign Service setting. For example, the listening test of the TOEFL iBT includes U.S., British, New Zealand, and Australian varieties of English. Such an expanded range of listening tasks would add additional time and expense to the testing process. However, in addition to providing better coverage of the target language use domain in some contexts, it would also likely have beneficial effects on instruction to provide test takers with exposure to the relevant range of language varieties.
In many situations, real-world language proficiency often involves the use of language supports (Oh, 2019). Traditional language supports include translation dictionaries, spelling and grammar checks, online dictionaries, and translation and interpretation apps, such as Google Translate. It is likely that a full review of the target language use domain for Foreign Service officers will show a number of ways that they incorporate language supports. Some situations, such as composing an email, allow the use of a language support while performing the task itself; other situations, such as having a conversation or making a presentation, may allow the use of a language support only beforehand, while preparing for the task.
There is considerable research relating to the value of providing supports to test takers, including research on the use of computers for test administration (e.g., Schaeffer et al., 1998) and extensive research on providing accommodations to students with disabilities (e.g., Luke and Schwartz, 2007). A general finding in this research is that a supporting tool may reduce the demand for some particular knowledge or skill on the test (such as foreign language vocabulary), while also adding a demand for knowing when and how to use the tool. As a result, it is important that test takers be familiar with a particular support and how it can be used.
It would be possible to incorporate the use of language supports into the FSI test with small modifications to the current tasks. For example, in the work-related exchange task in the current speaking test, the test taker could be allowed to use a translation dictionary during the initial preparation time to look up key vocabulary, as one would likely do in the target language use domain. In the in-depth task on the reading test, the test taker could be allowed to use a translation dictionary or an app to look up vocabulary or phrases. In both of these examples, the test administration and scoring of the current test would be substantially unchanged, but the interpretation of the result would be subtly changed to include the ability to effectively use language supports. It would also be possible to develop new tasks that allow test takers to use translation apps to accomplish a
particular task in ways that resemble the way they are typically used in the real world.
As noted above, the FSI test already uses tasks that draw on several language skills and that resemble aspects of common tasks that many Foreign Service officers need to perform—for example, the need to build understandings through interaction in the target language and report those understandings in English. Assessments that use richer “scenarios” could further broaden the way that the assessment tasks reflect real-world competencies, such as writing informational reports. Scenario-based approaches to assessment are motivated by domain analyses, which show that real-life tasks are very different from those used in traditional language assessments (e.g., Griffin et al., 2012; National Research Council, 2001b; Organisation for Economic Co-operation and Development, 2003, 2016, 2018). Many language assessments give test takers a series of unrelated tasks, each focusing primarily on a single language skill, such as reading or listening, which do not reflect the complexity of language use in real-world goal-oriented activities. Another example is when Foreign Service officers need to work collaboratively in teams to gather information, discuss an issue, propose a consensus-based solution to a problem, and share their solution with other colleagues.
Such real-life scenarios can be used as the basis for richer assessment activities that reflect the language proficiency needed to collaboratively solve problems in another language. As described above (Chapter 3), the work-related task in the current FSI speaking test allows the test taker to spend 5 minutes to prepare an initial statement on a selected topic, which is then followed by an exchange with the tester. This task could be enriched as a scenario in numerous ways. For example, rather than having the test taker invent a hypothetical statement, the task could provide several short readings to use as the basis for the statement, requiring the test taker to build an understanding of the issue through these documents, and then present and discuss the issues in the target language. The task could be further extended by having the test taker write an email in the target language that summarizes the key points raised during the discussion. Depending on the specific scenarios that might be relevant to the Foreign Service context, the initial readings could be in the target language, in English, or a mix. Such an enriched task could provide a demonstration of language proficiency that relates more closely to the kind of tasks often carried out by Foreign Service officers. Adding a scenario-based assessment activity would require significant change to the test administration.
There are examples of the use of scenario-based assessment in both education and employment settings. On the education side, significant research has been carried out at the Educational Testing Service to examine new ways of improving assessments in K–12 mathematics, science, and English language arts in the United States.5 Perhaps the most well-developed, scenario-based assessments in education are the international PISA tests,6 which are designed to determine if 15-year-old students can apply understandings from school to real-life situations. PISA test takers are not asked to recall factual information, but to use what they have learned to interpret texts, explain phenomena, and solve problems using reasoning skills similar to those in real life. In some PISA tests the problem solving is collaborative.
An example of scenario-based assessment specifically related to language proficiency is the placement test in English as a second language that is being developed for the Community Language Program at Teachers College in New York City (Purpura, 2019; Purpura and Turner, 2018). The overarching goal of the intermediate module is for test takers to make an oral pitch for an educational trip abroad to a selection committee on behalf of a virtual team, which requires a coherent series of interrelated subtasks involving multiple language skills that need to be performed on the path toward scenario completion (“pitch the trip”).
On the employment side, scenario-based approaches to assessment are often used for tests related to hiring, placement, and promotion. These approaches range from situational judgment tests to more realistic work samples and exercises that place candidates in situations that reproduce key aspects of a work setting to gauge their competence related to interpersonal skills, communication skills, problem solving, or adaptability (Pulakos and Kantrowitz, 2016). Even testing formats that do not strive for such realism, such as structured interviews, can be designed to include the use of scenarios that ask candidates what they would do or did do in a particular situation (Levashina et al., 2014). One particularly relevant example of a scenario-based, high-stakes employment test is the State Department’s own Foreign Service Officer Test, which includes a situational judgment component.
5 See the CBAL® Initiative (Cognitively Based Assessments of, for, and as Learning) at https://www.ets.org/cbal. Also see the Reading for Understanding Initiative at https://www.ets.org/research/topics/reading_for_understanding/publications.
6 PISA, the Program for International Student Assessment is a worldwide study by the Organisation for Economic Co-operation and Development. It was first administered in 2000 and has been repeated every 3 years.
The committee’s statement of task (see Chapter 1) specifically asks about the possibility of using portfolios in FSI’s test. Portfolios are often discussed in the educational literature in the context of collections of student work that are assembled to provide evidence of competency to complement or replace a formal assessment.7 Portfolios are also sometimes discussed in the context of collections of employee work (Dorsey, 2005). For FSI, either of these uses might be considered, with evidence of language-related work assembled during language instruction or on the job.
Portfolios have the potential to provide information about a broader range of language performances than can be sampled during a short formal assessment. In the case of job-related work samples requiring use of a foreign language, such information would clearly relate to the target language use domain because the work samples would be drawn from the domain. For FSI, a portfolio could be used in addition to the current test, which would be an example of using multiple measures for decision making. Portfolios may help address concerns that some test takers may be able to pass the test but not actually be able to use the target language in their work, while others may be able to use the target language in their work but not pass the test.
The weaknesses of portfolios relate to the difficulty of interpreting the information they provide about what a test taker knows and can do: they can be hard to standardize and can be affected by factors that are difficult to control, such as the circumstances under which the included performances were obtained and the extent of assistance provided to the test taker (Brown and Hudson, 1998). The use of portfolios as the basis for making high-stakes decisions about test takers has raised questions about the legitimacy of a selected portfolio as an accurate reflection of a test-taker’s ability to respond independently, the reliability and generalizability of the scores, the comparability of portfolio tasks across administrations, and unintended effects on instruction when instructional activities are designated for inclusion in a portfolio (East, 2015, 2016; Herman et al., 1993; Koretz, 1998; National Research Council, 2008). Portfolios can also be time consuming to prepare and score.
Nonetheless, portfolios have been used in a variety of educational settings related to language instruction. An example is the Council of Europe’s European Language Portfolio, which was intended for individuals
7 In the context of education, student portfolios often also include other information in addition to work samples, such as students’ self-assessment, learning history, or learning goals (Cummins and Davesne, 2009). These are likely to be important for future educational decisions, but they are not considered here because they are not relevant to making high-stakes decisions about an individual’s level of competency.
to keep a record of their language learning achievements and experiences, both formal and informal (Little, 2011). In Canada, portfolios are used as the final test in the government’s language instruction courses for immigrants and refugees, which are a required step in the immigration process (Pettis, 2014). In New Zealand, portfolios can be used in place of an oral proficiency interview at the end of term for high school students in foreign language courses (East and Scott, 2011). One example of the use of portfolios in a work context is the National Board for Professional Teaching Standards in the United States, which uses structured portfolios related to a video of a lesson as part of the process for advanced certification for K–12 classroom teachers (National Research Council, 2008).
Computer-administered tests with large numbers of short assessment tasks are widely used (Luecht and Sireci, 2011). For example, a number of tests of English language for nonnative speakers assess reading and listening using multiple-choice comprehension items, such as the TOEFL iBT,8 the International English Language Testing System,9 the PTE Academic,10 and the WIDA ACCESS for K–12 students.11 The Defense Language Proficiency Tests12 use a similar approach to assess reading and listening in foreign languages based on the ILR framework. In addition to multiple-choice questions, some of these tests use other selected- or constructed-response formats, such as matching, diagram labeling, fill-in-the-blanks, sentence completion, short answer, highlighting the correct summary, selecting a missing word, and highlighting incorrect words.
Such a test might be considered in response to a goal of broadening the current test coverage of Foreign Service topics. For example, the current FSI test is intended to assess a test-taker’s ability to understand and use professional-level vocabulary, discourse, and concepts in relation to a range of political, economic, and social topics that are relevant to the Foreign Service; however, only two or three reading texts are used for in-depth reading. A test using short assessment tasks in reading or listening could sample from a greater range of discourse and topics than can the limited number of reading passages sampled in the current test. Expanding the
11 This test is most often used as a screening test to determine the language level of students entering a school system.
12 The Defense Language Proficiency Tests are foreign language tests produced by the Defense Language Institute–Foreign Language Center and used by the U.S. Department of Defense.
breadth of coverage also has the potential to yield more information about the extent to which a test taker can understand a wide breadth of professional vocabulary, discourse, and concepts in the FSI target language, thus improving the reliability and generalizability of scores. However, unlike the current reading test, a test using many short assessment tasks might provide linguistic breadth by sacrificing communicative depth that reflects real-life language use in the domain. A computer-administered test using selected response questions would limit the ability to probe the test-taker’s responses, in contrast to the current reading tests. In addition, initial development costs for computer-administered tests would need to be considered; high development costs could affect their practicality for low-frequency languages.
In recent years, there has been growing interest in computer-adaptive tests, which reduce the time spent on questions that are clearly too easy or too difficult for the test taker, focusing instead on questions that appear to be at the border of the test-taker’s ability (e.g., Luecht and Nungester, 1998; Van der Linden and Glas, 2000; Wainer et al., 2000). The FSI test already includes adaptation to the level of difficulty appropriate to the test taker. In the speaking test, this adaptation occurs as a result of training in the certification process, as the tester modulates the level of speech that is used. In the reading test, the adaptation occurs explicitly in the choice of longer and more linguistically complex reading passages at a particular level of difficulty after the test-taker’s performance on the shorter reading passages. A computer-adaptive test could potentially implement such adaptation in a more standardized way.
The responses on computer-adaptive tests are automatically scored in real time, which allows the scores on prior questions to guide the choice of the questions that follow from a pool of possible questions. Although reading and listening potentially lend themselves to computer-adaptive testing because of the frequent use of machine-scorable questions, the approach has not been widely embraced in language proficiency testing because of the cost involved in developing and calibrating the necessary pool of items. Because of this requirement, the approach is feasible only for large-scale tests. However, this extra expense can be limited by using a “multistage” adaptive approach in which short sets of questions are administered and scored in a series of stages. Performance on one set of questions is used to select the next to administer to a given examinee. Generally, this approach reduces the size of the item pool required (e.g., Leucht et al., 2006; Yamamoto et al., 2018; Yan et al., 2014; Zenisky et al., 2010). For FSI, the small numbers of test takers for many languages may still make the development of computer-adaptive approaches impractical and prohibitively expensive.
A simpler but conceptually related approach would make use of a “two-step” process. In this approach, a screener test would be used to estimate if test takers are likely at or above a threshold level of proficiency
that would enable them to achieve the desired proficiency rating of 3/3 (or higher) on the full FSI test. Test takers below this threshold would not go on to take the full test, and the score on the screener would be their official score. For expedience and cost-effectiveness, the screener test could be computer administered and consist of questions with machine-scorable response formats. Moreover, the screener could contain a machine-scorable listening component that may predict oral language performance (i.e., speaking) on the full test.
Recognizing the intense resources that are currently being devoted to developing artificial intelligence (AI) techniques related to language, the committee highlights a possible change that would be explicitly forward looking: the use of automated scoring for the assessment of speaking. Unlike the other changes discussed, the possibility of a change to (or adoption of some elements of) automated scoring depends on larger breakthroughs that are being pursued by computer science researchers and developers. The intent of including this possibility on the list is to highlight the potential value of new technologies that may become available in a decade or so in order to sensitize FSI to these future possibilities.
Technology-based speaking tests are currently used routinely in some large testing programs to elicit test-takers’ speech in response to recorded prompts. The test-taker’s recorded responses are typically rated later by two raters. The TOEFL iBT is an example. It includes four speaking prompts that are recorded and later scored by human raters. The computerized version of the ACTFL Oral Proficiency Interview (see Isbell and Winke, 2019) and the now decommissioned Computerized Oral Proficiency Test from the Center for Applied Linguistics (see Malabonga et al., 2005) are two other examples of language tests that have been used for a range of world languages that collect responses to recorded prompts that are later scored by human raters. Although such computer-based tests can often provide more standardized assessment tasks than face-to-face interviews, they may show more limited aspects of language than face-to-face interactions (Kenyon and Malabonga, 2001; Quaid, 2018). In addition, other features of oral communication are not addressed by computer-based test tasks, such as the construction of meaning, social sensitivity, the conveyance of empathy, and turn-taking. While face-to-face interviews and computer-mediated platforms might yield comparable scores statistically with respect to basic features of language use, it is likely that the different modes of testing are tapping different skills (Qian, 2009).
One goal of automated scoring is to use scoring engines for the recorded speech from technology-based assessments (Wang et al., 2018).
The automated score may take the place of one of two human raters, for example, reducing costs, and interrater reliability could be calculated as between the automated score and the human score, with a second human rater only needed when the two do not agree.
Some operational tests already use limited AI to produce automated scores and do not involve human raters. Pearson’s Versant (formerly owned by Ordinate and called the PhonePass Test) takes 15 minutes and is automatically scored. The automated scores are from elicited imitation, sentence repetition, and short spoken responses, which are speaking tasks scored through careful elicitation and do not involve authentic communication (Chun, 2008). Pearson’s PTE Academic is automatically scored as well: test takers read text aloud, repeat sentences, describe images, and provide brief responses to questions. We note, however, that these types of automatically scored tests have been criticized as inauthentic, underrepresenting the speaking construct, and not assessing real conversation (Chun, 2008).
Despite the limitations of technology-based speaking tests and automatically scored speaking tests, there is a growing body of research on human conversation with chatbots and virtual assistants that is helping to inform and advance a set of AI technologies related to conversation (Ciechanowski et al., 2019). Computer-human interaction is the ultimate goal of this branch of AI assessment of speaking. This goal will be achieved when the computerized AI voice can ask questions and guide the direction of the conversation based on test-takers’ responses in real time. For example, several testing companies are researching or using AI to rate computer-based tests’ recorded speech samples and limited conversations (Chen et al., 2018; Ramanarayanan et al., 2020). Such ratings could form one part of a technology-based conversational system, although AI techniques cannot yet reliably score important qualities of human interaction, such as pragmatic appropriateness, collegiality, and humor (Bernstein, 2013). Future AI breakthroughs could substantially improve the capabilities of such systems with the potential of making technology-based oral testing more interactive.
As these technologies continue to be developed, they offer the possibility of greater standardization and reduced cost in the administration and scoring of speaking, while preserving more of the elements of human conversation that are missing from current technology-based speaking tests. Thus, at some future time, such systems could be attractive for use in FSI’s test.
Because of the subjective nature of scoring extended responses, such as those elicited by the FSI test, it is important that scorers be well trained to apply the criteria laid out in the scoring rubric and that the criteria clearly
reflect the knowledge, skills, and abilities that are assessed. Rubrics make the scoring less subjective, helping scorers to reliably and fairly transform a test-taker’s performance into a score by using agreed-upon criteria. The body of research on developing effective scoring rubrics for writing and speaking is sizable (for an overview, see Van Moere, 2013).
In addition to developing scoring rubrics, testing programs need to provide scorers with extensive and ongoing training to use the rubrics consistently. Initial training related to the meaning of the different criteria included in the rubric is important, but scorers also need regular norming and recalibration to correct for drift and to ensure that their scores are consistent with those given by other scorers. There is a considerable amount of guidance for rater training procedures in language assessment (e.g., Van Moere, 2013; Weigle, 1998). Scoring rubrics and rater training procedures need to give particular attention to the scoring of different varieties of the language, which can be particularly challenging when test takers and raters may come from a range of language backgrounds.
Scoring rubrics are generally publicly available as part of the documentation provided by a high-stakes testing program (see discussion of professional testing standards in Chapter 6). To help ensure the reliability, fairness, and transparency of the scoring process used in the FSI test—as well as the perception of that reliability and fairness—FSI should consider providing more information to the test takers and users about its scoring rubrics and procedures, as well as its scorer training processes. Transparent scoring rubrics can also improve performance by better aligning teaching and learning with valued outcomes (Jonsson, 2014; Tillema et al., 2011). Providing more transparent scoring criteria could be part of an overall effort to develop a shared understanding about language assessment across all stakeholders in the State Department.
One source of variability in the FSI test relates to the tester and the examiner who administer the test. These two individuals serve both as interlocutors—to prompt and collect the language performance on the test—and as scorers of that language performance. Without adding any complexity to the administration of the test, FSI could use the video recording of the test for a separate scoring by a third independent scorer. Such a review by a third scorer is currently used by FSI whenever scores are challenged by a test taker. However, if there are concerns about the reliability or fairness of the current test procedure, a rating by a third scorer could be added as a regular feature of the FSI test. This addition would reduce the effects of any one scorer on the outcome of the test, and it would have the additional benefit of providing regular information about the consistency in the ratings
across scorers. The value of additional scorers, whether routinely or for systematic samples, can be examined quantitatively with a research study before an implementation decision is made.
Another version of this possible change could involve changes to the scoring procedure so that the FSI tester and examiner provide their scores independently. The current scoring procedure starts with the tester and examiner reaching consensus about an overall holistic score before separately developing index scores that reflect their independent evaluations of the five factors. This scoring procedure could be altered so that the tester and examiner provide scores separately before reaching consensus. An examination of the ratings awarded independently would provide information about the range of difference between the two scorers, which could be monitored to provide additional information about the reliability of the scoring process.
One aspect of a testing program that needs to be considered in evaluating its validity is the different ways the test results are interpreted (“meaning of scores”) and then used, and the resulting consequences of those uses on everyone involved in the program. Substantial recent research demonstrates the value of providing more meaningful score reports (e.g., Hambleton and Zenisky, 2013; Zapata-Rivera, 2018).
For FSI, if there is limited understanding on the part of test takers and score users about the criteria used for scoring test-takers’ performances, additional information could be provided. For example, providing more information than a single ILR level in the score report might be useful because it allows a more comprehensive understanding of what the scores mean. Additional information in the score report could help test takers understand the criteria that are being applied across all test takers during scoring. If the review of FSI’s testing program shows any potential concerns about its fairness, additional transparency about the reasons for individual scores can help address those concerns, as well as help identify aspects of the scoring process that may need to be improved. As with the comments above about transparent scoring, the provision of more detailed score reports could be part of an overall effort to develop a shared understanding about language assessment across all stakeholders across the State Department.