Committee Conclusion: The ability to use judgment to interpret, evaluate, and weigh alternate courses of action appropriately and effectively is relevant to a wide variety of situations within the military. Various streams of research, including new conceptual and measurement developments in assessing situational judgment, as well as evidence of consistent incremental validity of situational judgment measures over cognitive ability and personality measures for predicting performance in various work settings, lead the committee to conclude that measures of situational judgment merit inclusion in a program of basic research with the long-term goal of improving the Army’s enlisted accession system.
Situational judgment tests (SJTs) are psychological measures that present test takers with hypothetical situations that often reflect constructs that may be interpersonal (e.g., communication, teamwork), intrapersonal (e.g., emotional stability, adaptability), or intellectual (e.g., technical knowledge, continuous learning) in nature. A sample SJT question dealing with squad leadership follows (Hanson and Borman, 1995; for another military SJT measure, see Tucker et al., 2010):
You are a squad leader on a field exercise, and your squad is ready to bed down for the night. The tent has not been put up yet, and nobody in the squad wants to put up the tent. They all know that it would be the best place to sleep since it may rain, but they are tired and just want to go to bed. What should you do?
- Tell them that the first four men to volunteer to put up the tent will get light duty tomorrow.
- Make the squad sleep without tents.
- Tell them that they will all work together and put up the tent.
- Explain that you are sympathetic with their fatigue, but the tent must be put up before they bed down.
There are multiple ways to answer an SJT (e.g., pick the best/worst; respond to each option on a 1-5 scale of effectiveness), and furthermore, there is more than one way to score an SJT (two scoring options, among others, are agreement with subject matter expert responses and agreement with the consensus response). SJTs are historically and substantively related to tests of practical intelligence, tacit knowledge, and other tests that ask respondents about solving hypothetical problems one might face in the real world. In fact, the specific test items for measures of practical intelligence look a lot like SJT items (e.g., the Wagner and Sternberg, 1991, measure of the practical intelligence of managers).1 (For a useful overview of critical SJT characteristics, such as constructs assessed, situational content, scoring methods, instructions and testing medium, see Campion et al., 2014.)
Since its development for use in personnel selection decades ago (e.g., Motowidlo et al., 1990), testing situational judgment has remained a viable method for assessing psychological constructs due to its consistent, criterion-related validity across a variety of work settings. But an SJT is a method of measurement, not a construct (see Arthur and Villado, 2008), and in fact, a variety of constructs have been (and can be) measured with an SJT. Existing SJTs predict task performance and other outcomes where cognitive ability is required (McDaniel et al., 2001); they also have shown validity for organizational citizenship behavior and personality-relevant outcomes where cognitive ability is not a strong requirement (Christian et al., 2010).
SJT items have successfully measured the constructs of job knowledge, interpersonal and team skills, leadership, and personality; meta-analyses of these types of SJTs have found high validities for a variety of types of performance that are technical and interpersonal in nature. Building on early SJT research that focused on constructs such as social and practical intel-
1 Of note, practical intelligence measures tend to not be useful if they are redundant with cognitive measures (Rumsey and Arabian, 2014).
ligence (see Whetzel and McDaniel, 2009), more modern SJTs frequently target interpersonal skills and competencies that are presumably difficult to measure with traditional tests of personality or attitudes. Examples of interpersonal constructs measured by SJTs include organization, maturity, and respectfulness (Weekly and Jones, 1997), work commitment and work quality (Chan and Schmitt, 1997), conflict management and resolution (Chan and Schmitt, 1997; Olson-Buchanan et al., 1998; Richman-Hirsch et al., 2000), leadership (Bergman et al., 2006), communication skills (Lievens and Sackett, 2006), integrity (Becker, 2005), and team orientation (Mumford et al., 2008; Weekley and Jones, 1997).
Even though SJTs have a history of being developed around specific constructs intended to predict specific types of job performance, sometimes they predict unintended outcomes as well or better than the intended outcomes, as was found in a recent meta-analysis (Christian et al., 2010). This meta-analysis found an interpersonal skills SJT predicted the technical aspects of job performance just as well as the interpersonal (contextual) aspects of job performance (meta-analytic correlations of r = .25 and .21, respectively). This finding of equivalent levels of prediction for unintended outcomes might reflect a need to refine the outcome measures, as much or more than refining the SJTs themselves.
There is a potentially more promising approach to designing SJTs such that they predict intended outcomes more strongly than predicting unintended outcomes. This approach involves designing an SJT so that the possible responses to a given situation each reflect different constructs. Under this design all of the SJT responses, across all situations, can be analyzed using a multitrait-multimethod framework (where traits = SJT responses, and methods = SJT situations/stems in which the responses are nested). A recent SJT designed in this manner successfully partitioned the variance of SJT items between situations and three constructs that measured the tendencies to approach new goals, to avoid new goals, or to treat new goals as achievements that others will evaluate (Westring et al., 2009). This innovative format could help an SJT approach to measure other constructs better, both conceptually and psychometrically. It also improves understanding of the nature of the situation being tested by a particular SJT.
In a related innovative approach, existing Multidimensional Item Response Theory models can be applied to situational judgment measures that are designed in a similar manner. A concrete example, similar to the previous example but designed to test goal orientation, would be an SJT that asks examinees how they would respond to 12 performance situations, where each situation is followed by a set of four possible responses (items),
with each response in a set reflecting one of four different constructs: (a) work commitment, (b) work quality, (c) conflict management, and (d) empathy (Chan and Schmitt, 1997). Examinees would pick the best and worst response or rank the responses from most to least effective. This SJT format is highly compatible with Item Response Theory (IRT) methods for investigating patterns of correlations between the four constructs (e.g., Brown and Maydeu-Olivares, 2011, 2013; de la Torre et al., 2012).
Testing situational judgment has maintained its decades-long prominence in the research and practice of employment testing because, even though the test formats and the constructs they assess vary widely, SJTs have demonstrated persistent incremental validity above cognitive ability and personality measures when predicting job performance of either a technical or interpersonal nature. Thus, even though particular SJTs are known to be correlated with traditional measures of cognitive ability and personality, as previously discussed, SJTs assess compound knowledge, skills, and abilities that often do not fall clearly or cleanly in either category (e.g., time management skills, leadership behaviors). SJTs are shown to predict above and beyond measures of cognitive ability and personality when SJTs are administered in job applicant settings (Chan and Schmitt, 2002; Clevenger et al., 2001) and in academic settings relevant to college admissions (Oswald et al., 2004). These increments are often modest (Peterson et al., 1999), and they critically depend on the type of SJT, the nature of the criteria being predicted, and the other types of predictors being administered. Nonetheless, the findings of these three studies illustrate how carefully developed SJTs, when added to an existing test battery, can improve selection decisions in the aggregate, across large applicant pools and/or over time. These consistent findings in the literature suggest the potential for increased validity with additional investments in SJT research. Furthermore, the committee predicts that when SJTs and personality assessments “compete” for validity, the military or other organizations might decide in favor of administering SJTs over personality tests (due to, for example, their greater face validity).
“Would Do” versus “Should Do” Instructions
Meta-analytic research indicates that the type of instructions used by an SJT partially determines its relationships with measures of cognitive ability or personality (McDaniel et al., 2007). For instance, an SJT that requires respondents to indicate what one “should do” or to rank-order situational
responses by their effectiveness is more highly correlated with cognitive ability than with personality (r = .32 versus r = .10–.20; also see Lievens et al., 2009). Conversely, the responses to SJTs asking about the respondent’s behavioral tendencies or what one “would do” in a situation tend to be more correlated with the personality traits of agreeableness, emotional stability, and conscientiousness (also see the McDaniel and Nguyen, 2001, meta-analysis) and less correlated with cognitive ability (r = .30–.33 for the aforementioned constructs versus r = .17 for cognitive ability).
That said, there are at least two important qualifications to these general findings. First, whether one provides “would do” versus “should do” instructions for an SJT is somewhat governed by the constructs being measured (e.g., SJTs related to personality tend to ask about behavioral tendencies or what one “would do”), although research indicates that in those cases where the construct and item content were held constant across different instruction sets, validity patterns appeared to be similar (McDaniel et al., 2007). Second, an additional concern might be that in high-stakes operational settings, “would do” instruction sets might lead to inflated mean SJT scores compared with “should do” instruction sets; however, at least one large-sample study in a college admissions setting (Lievens et al., 2009) found no such mean differences for its SJT measuring interpersonal skills.
Video versus Written SJTs
Traditionally, SJTs have been administered in a written format, although there are some notable instances of using video and computerized formats of the SJT, in which the nontraditional format has served at least four important purposes. First, the reading level required for video SJTs is often lower, meaning that if verbal ability is irrelevant to the constructs of interest, then the video SJT can lead to more reliable measurement than its text-based counterpart. This appears to be true not only for samples of test takers that vary widely in verbal ability (Chan and Schmitt, 1997) but also in samples presumed to have higher levels of verbal ability (e.g., medical school applicants; Lievens and Sackett, 2006). Lower verbal-ability requirements also tend to mean lower potential for adverse minority impact (large subgroup differences).
Second, in addition to reducing the demands on verbal ability, video formats are more immersive and engaging testing experiences, hence the enthusiasm for multimedia testing since the ready availability of personal computers in the 1990s. Video SJTs have the potential to allow organizations to distinguish themselves from competitors, to send a signal about their innovation, to pique the interest of highly valued applicants, and to gather information about these applicants for making hiring decisions. Note
that video SJTs have equal potential for creating negative impressions about an organization if they are not carefully constructed and administered.
Third, by customizing video SJTs to particular types of work, organizations can provide standardized realistic job previews (Weekley and Jones, 1997) that allow applicants to draw conclusions about person-environment fit (e.g., Edwards and Cable, 2009), which has implications for greater job satisfaction, lower turnover, and longer-term commitment. To the extent these conclusions are accurate, this benefits both applicants and organizations alike.
Fourth, video SJTs offer hope for increasing job applicants’ perceptions of the fairness and validity of selection systems. Watching enactments of workplace situations, rather than reading about them, might increase examinee motivation and decrease general cognitive ability requirements, which in turn might serve to reduce the risk of adverse impact on minority candidates. Olson-Buchanan and colleagues (1998) developed video SJTs that used a branching algorithm to present different scenes depending on examinee responses. This interactivity reportedly increased realism because the scenes that ensued were logical consequences of the examinees’ choices. Although branching was not performed using an IRT adaptive testing algorithm, as was the case with the researchers’ verbal skills assessment, the logic was in fact quite similar to that of the IRT algorithm.
Regarding the nature of video SJTs—in general and when compared with their written counterparts—the body of evidence accumulated over the past decade is complex. Contrary to the findings of Chan and Schmitt (1997) and Richman-Hirsch and colleagues (2000), Lievens and Sackett (2006) found no statistically significant difference in face validity perceptions for video and written SJTs in their study with medical school applicants. Likewise, although Chan and Schmitt (1997) and Olson-Buchanan and colleagues (1998) found smaller ethnic-group differences with video SJTs than written SJTs, Weekley and Jones (1997) found that video SJTs still exhibited considerable subgroup mean differences (0.3 to 0.6 standard deviations) in two studies of hourly service workers.
Like the SJT itself, the video format is a vehicle for measuring a variety of psychological constructs, some of which might be more amenable to the format (e.g., teamwork) than others (e.g., computer programming knowledge). Coupled with potential benefits are potential challenges that come with large-scale administration of video SJTs; these latter challenges may well be mitigated by future testing and video technologies.
The single-item SJT is a novel and simpler SJT format that has recently emerged in the research literature (Krumm et al., 2014). It amounts
to asking test takers to rate for effectiveness a representative set of critical incidents derived from a job analysis. Below are examples of two critical incidents generated by human factors professionals (HFPs) that have been rated on a Likert scale for effectiveness (Motowidlo et al., 2013, p. 1,854):
Whenever an HFP would perform product testing with live participants, the HFP would invite the entire product management team to the lab to observe the testing. [This example is intended to represent an effective incident]
The team was discussing how to design an interface. With the exception of the HFP, the team was unanimous in their design. The HFP began raising his voice and telling the team about his educational credentials. After the decision was made to use the design the rest of the team developed, the HFP aggressively stormed out of the room. [This example is intended to represent an ineffective incident]
Single-item SJTs of this nature yield somewhat distinct factors for knowledge of effective versus ineffective situations, and both factors demonstrate validity for predicting performance-related outcomes across samples of job incumbents and undergraduates (Crook et al., 2011; Motowidlo et al., 2013). The validity might be surprising, given the seeming obviousness of the effectiveness or ineffectiveness of situations like the above examples. But validity might emerge because of this obviousness: when examinees cannot identify effective and ineffective critical incidents, this is predictive of important outcomes. In any case, all of these recent empirical findings indicate that there is potential for validity of single-item SJTs in job applicant samples.
In addition, it seems likely that a large number of single-item SJTs could be developed and refined in the same amount of time it typically takes to develop and refine a set of, say, 50 traditional SJT items. Having a large number of SJT items, in turn, increases the potential to generate different SJT forms with similar psychometric qualities. This benefit seems essential in large-scale applications where test forms need to be continuously refreshed to minimize blatant forms of cheating. Still, additional research on single-item SJTs is needed to assess whether validities can be preserved under conditions where job applicants are coached on the correct responses and/or have the motivation to fake.
A meta-analysis of subgroup differences by Whetzel and colleagues (2008) found that SJTs collectively exhibit smaller white-black effect size differences than traditional cognitive ability tests (standard deviation of
0.38 versus 1.0), but the magnitudes of the differences depend on cognitive load, which some research has linked to the instructions accompanying SJT scenarios. More specifically, “should do” or knowledge instructions, which ask respondents to choose the best/worst or most/least effective option(s) from a series of alternatives, tend to increase correlations with cognitive ability measures and thus increase subgroup differences. In contrast, “would do” or behavioral tendency instructions, which ask respondents to choose the most/least likely option(s) from a series of alternatives, have reduced subgroup differences relative to “should do” instructions.
However, there is a potential tradeoff with “would do” instructions because even if subgroup differences are reduced, “would do” instructions have also been found to be more susceptible to faking (e.g., McDaniel et al., 2007; Nguyen et al., 2005; Peeters and Lievens, 2005; Ployhart and Ehrhart, 2003). Generally speaking, mean differences between subgroups by race and gender on SJTs tend to be low, but nonetheless, the cited research literature suggests that the magnitude of these differences (and thus the contribution of an SJT to adverse impact in a selection battery) is influenced by both the constructs and the instructions associated with a given SJT. Future research might also investigate subgroup differences in the SJT response process. Because this process involves comprehension, retrieval, judgment, and response selection (Ployhart, 2006), it must be affected by cognitive influences (e.g., verbal complexity) and noncognitive effects (e.g., test-taking anxiety, test-taking motivation) on which subgroups are already known to differ.
Typically, psychological measures are developed to be unidimensional or “construct-pure,” meaning that the relevant content for all the items within a given scale should reflect a single construct. Alpha reliability coefficients also depend on this assumption to be informative (Cortina, 1993). Developing SJT measures is challenging from the perspectives of both test development and psychometrics because SJT items are a priori known to be complex and heterogeneous in terms of both the stem (situation) of each item and the items’ response options. The previous research notwithstanding, SJTs in practice are often quite heterogeneous and will often yield a single weak factor (e.g., McDaniel and Whetzel, 2005; Whetzel and McDaniel, 2009) that is sometimes tautologically labeled “situational judgment.” Directly as a function of finding a weak factor, coefficient alpha is notoriously low (Chan and Schmitt, 1997; Smiderle et al., 1994; Weekley and Jones, 1997). By contrast, test-retest reliabilities based on the same items control for item heterogeneity, practice effects notwithstanding. Test-
retest reliabilities are potentially higher, suggesting there is reliable variance in SJT items that is not part of an overall construct but instead tends to be unique to each item. For example, the 56 alpha coefficients located by Catano and colleagues (2012) yielded an average value of α = .46. In their two longitudinal SJT studies, similarly low alphas were obtained. However, their test-retest reliabilities were much higher (r = .82) in a student sample after a 2-week retest interval and in an HR sample (r = .66) after a 3-month retest interval (Catano et al., 2012).
Thus, alpha reliability coefficients are inappropriate indices of reliability for SJTs because the situational stems and item responses reflect complex situations and thus are not internally consistent. Test-retest reliability coefficients are more appropriate to show stability in situational judgment across items, but equally important, if not more so, is the value and necessity in tying the content of SJTs to psychological theory and to the information provided by a job analysis. This, together with psychometric evidence for test-retest reliability, provides the converging pieces of evidence that help establish the quality and validity of an SJT measure. In today’s age of Big Data analytics and flexible predictive models, a temptation to be avoided in SJT development would be to select items that simply predict based on statistical properties, without theoretical basis or concern for reliability or construct validity. McDonald (1999, p. 243) also refers to this approach disparagingly, noting that it simply “uses the relations of the unique parts of the items to the criterion to maximize predictive ability” so that predictors are chosen that correlate highly with the criterion but correlate near zero with each other. In other words, using validity as a driver for item selection might lead to high levels of prediction by design, but the unfortunate result might be not knowing what is being measured.
Not surprisingly, scoring methods have been shown to influence the validity of SJTs for predicting various workplace criteria (Bergman et al., 2006; Weekley and Jones, 1997). As with personality measures, rational scoring keys determine the “correct” answers by theory, and therefore rational keys are the most straightforward to develop. However, rational scoring may increase susceptibility to faking because the “correct” answers may be too obvious.
The alternative approach of empirical keying, where SJT item responses that correlate highest (in magnitude and sign) with a criterion are deemed to be the correct answer, poses different challenges. Depending on the criterion and the group that is used for development, one particular SJT can have many empirical keys, which may include different numbers of items and/or differentially weighted items. For example, empirical keys developed
using the consensus judgments of novices (e.g., new employees), experts (e.g., supervisors, managers), customers, or examinees may be substantially different (Bergman et al., 2006). Furthermore, items that differentiate well among novices may be nondiscriminating among experts, and items may be more or less discriminating within a group, depending on the attribute that is being keyed. Because SJT scenarios and response options are typically multidimensional, the actions or events can be evaluated along multiple dimensions. With video SJTs, the inextricable nonverbal cues may lead to differences from written tests that were designed to be the content-equivalent of the video version (see, for example, Weekly and Jones, 1997). Ultimately, the potentially large number of empirical keys, the variations in item properties across calibration samples, and the loss of information when items are discarded due to low discrimination have implications for reliability, validity, and the generalizability of findings.
There have been recent attempts to create SJT forms that are reasonably parallel in their overall content and psychometric properties. As a step toward parallel SJT forms construction, researchers have explored some intuitive approaches involving assignment of items to different SJT forms (Irvine and Kyllonen, 2002; Lievens and Sackett, 2007; Oswald et al., 2005; Whetzel and McDaniel, 2009). For example, under an incident isomorphism strategy, a large pool of critical incidents is generated; two items, for example, are written for each incident; and one item from each pair is assigned to each form. With random assignment strategy, a large pool of items is developed for each domain and the items are assigned randomly to different forms.
Extending the latter approach, Oswald and colleagues (2005) used stratified random assignment by assigning SJT items randomly within each of 12 dimensions within intellectual, interpersonal, and intrapersonal domains. This assignment method, in conjunction with traditional scale construction and evaluation practices, produced 144 forms of an SJT to predict different types of college student performance. More specifically, the SJT item means, standard deviations, item-total correlations, and item validities (correlations of item responses with first-year grade point average) were arrayed in a spreadsheet (items in rows, statistics in columns) and grouped by the 12 dimensions they represented. Next, a computer program generated 10,000 preliminary 36-item forms via stratified random sampling of 3 items from each of the 12 dimensions. Projected test form means, standard deviations, and criterion-related validities were computed using widely available formulas. Then, the number of forms was reduced by imposing statistical constraints: The individual test form means could
differ by no more than 0.05 standard deviations from the overall mean across all test forms; test form alpha reliabilities had to exceed .70; and test form criterion-related validities had to exceed .15. This process left 144 forms that exhibited only about 30 percent overlapping content. Note that sample sizes were large enough to suggest the item statistics were stable (i.e., N = 644 and 381 across two sets of items) and that the 144 forms selected were not unduly capitalizing on chance. But that said, future investigation of the stability of parallel form generation procedures for SJTs is generally recommended.
At the very least, the aforementioned effort illustrates several important points or principles to be examined further in future research: (1) SJTs can be built to exhibit adequate reliabilities and validities by paying attention to statistical indices of homogeneity and criterion validity during test assembly, even while sampling items from conceptually heterogeneous dimensions in a deliberate manner. (2) Alternative SJT forms can be produced in large numbers by using computationally intensive methods that are similar in spirit to adaptive testing and automated test assembly algorithms (e.g., van der Linden, 1998; van der Linden and Glas, 2010) that select items to satisfy multiple constraints such as those tied to content, information, and exposure constraints.
Additionally, new SJT formats might be more amenable to suitably fitting IRT models, such that parallel forms could be constructed by matching test response and information functions (Hulin et al., 1983), test forms could be equated using traditional linking methods (Kolen and Brennan, 2004), measurement invariance tests might be conducted across examinee subpopulations (Millsap and Yun-Tein, 2004), and adaptive testing could be used to improve measurement precision while holding SJT test length constant (Hulin et al., 1983; van der Linden and Glas, 2010).
Operational Use of Innovative SJTs
For testing of situational judgment to be useful operationally in personnel selection settings (usually the intended setting), there is a set of key desiderata that are typical for most selection tests (see, for example, the nine research recommendations of Whetzel and McDaniel, 2009): (a) appropriate scoring methods (e.g., empirical versus rational); (b) high reliability, validity, and incremental validity when considering supplementing or substituting measures in a selection battery; (c) low subgroup differences (and adverse impact) with respect to legally protected subgroups; (d) resistance to faking and coaching; (e) ability to create parallel forms; (f) consideration of the possibilities and implications of applicant retesting;
and (g) item and test security. To this end, a strong basic research agenda would be required to (1) examine the relationships of SJT scores with key individual-differences variables; (2) clarify the complex associations between SJT testing modalities, instruction sets, constructs, criterion-related validity, and subgroup differences; and (3) explore new technologies for capturing and scoring examinee responses.
Multimedia and SJTs
Regarding this latter point concerning technology and SJTs, the committee believes there is still a great deal to be learned about the benefits of video and, more generally, multimedia SJTs with respect to lower-fidelity, written alternatives. Technological advances will surely create new possibilities for SJT item and test development, test delivery, response capture, and scoring. For instance, technology has evolved to the extent that multimedia tests can now be administered on a variety of personal computing devices, including tablets and smartphones. This enables testing to take place in natural environments (e.g., at home and unproctored) as well as traditional ones (e.g., at a testing center and proctored). Furthermore, rather than limiting responses to simple mouse clicks and key strokes to answer Likert scales or check-boxes, technology can be employed to seek to collect and analyze open-ended responses, such as by using dictation software or a video camera, with verbal responses analyzed with natural language processing technology (e.g., Jurafsky and Martin, 2008; Kumar, 2011). Accordingly, as new opportunities for testing situational judgment develop, future research will need to examine the advantages and disadvantages of new methodologies from both examinee and institutional perspectives.
Advances in SJT Item Development and Scoring
Given current SJT construction and scoring practices and the emergent nature of IRT methodologies, the committee believes the military testing process would benefit from future research and thinking on how psychometric technology, which has been used to improve precision and efficiency in large-scale testing programs, might be adapted for testing situational judgment. Additionally, the committee encourages applied researchers to think creatively about SJT item writing and answer formats; if changes can be made to increase the suitability of prevailing psychometric models without destroying the realism of SJT items, then an array of new possibilities for test construction and evaluation can be expected to follow.
Organizational researchers are clearly concerned about understanding the nature of testing for situational judgment, both in terms of constructs measured and methods employed in various SJTs. A program of basic SJT research committed to integrating these concerns would likely increase the effectiveness of personnel selection and classification systems.
The U.S. Army Research Institute for the Behavioral and Social Sciences should support research to understand constructs and assessment methods specific to the domain of situational judgment, including but not limited to the following lines of inquiry:
- Develop situational judgment tests with items reflecting constructs that are otherwise difficult to assess using other tests, that are important, and that show promise for validity (e.g., prosocial knowledge, team effectiveness).
- Consider innovative formats for presenting situations (e.g., ranging from simple text-based scenarios to dynamic and immersive computer-generated graphics), capturing examinee responses (e.g., open-ended, voice, gestures, facial expressions, eye movements, reaction times), and evaluating examinee responses (e.g., advanced natural language processing, automated reasoning, machine learning).
- Develop and explore psychometric models and methods that can accommodate the rich array of data that innovative assessment methods for situational judgment may yield, facilitating the development of psychometrically and practically equivalent assessments, and improving reliability and testing efficiency.
Arthur, W., Jr., and A.J. Villado. (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93(2):435–442.
Becker, T.E. (2005). Development and validation of a situational judgment test of employee integrity. International Journal of Selection and Assessment, 13(3):225−232.
Bergman, M.E., F. Drasgow, M.A. Donovan, J.B. Henning, and S. Juraska. (2006). Scoring situational judgment tests: Once you get the data, your troubles begin. International Journal of Selection and Assessment, 14(3):223−235.
Brown, A., and A. Maydeu-Olivares. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3):460−502.
Brown, A., and A. Maydeu-Olivares. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1):36−52.
Campion, M.C., R.E. Ployhart, and W.I. MacKenzie. (2014). The state of research on situational judgment tests: A content analysis and directions for future research. Human Performance, 27:283–310.
Catano, V.M., A. Brochu, and C.D. Lamerson. (2012). Assessing the reliability of situational judgment tests used in high-stakes situations. International Journal of Selection and Assessment, 20(3):333−346.
Chan, D., and N. Schmitt. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82(1):143−159.
Chan, D., and N. Schmitt. (2002). Situational judgment and job performance. Human Performance, 15(3):233−254.
Christian, M.S., B.D. Edwards, and J.C. Bradley. (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63(1):83−117.
Clevenger, J., G.M. Pereira, D. Wiechmann, N. Schmitt, and V.S. Harvey. (2001). Incremental validity of situational judgment tests. Journal of Applied Psychology, 86:410−417.
Cortina, J.M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1):98−104.
Crook, A.E., M.E. Beier, C.B. Cox, H.J. Kell, A.R. Hanks, and S.J. Motowidlo. (2011). Measuring relationships between personality, knowledge, and performance using single-response situational judgment tests. International Journal of Selection and Assessment, 19(4):364−373.
de la Torre, J., V. Ponsoda, I. Leenen, and P. Hontangas. (2012). Examining the Viability of Recent Models for Forced-Choice Data. Presented at the Meeting of the American Educational Research Association. Vancouver, British Columbia, Canada, April. Abstract available: http://www.aera.net/Publications/OnlinePaperRepository/AERAOnlinePaperRepository/tabid/12720/Owner/289609/Default.aspx [February 2015].
Edwards, J.R., and D.M. Cable. (2009). The value of congruence. Journal of Applied Psychology, 94:654−677.
Hanson, M.A., and W.C. Borman. (1995). Development and Construct Validation of the Situational Judgment Test (ARI Research Note 95-34). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Hulin, C.L., F. Drasgow, and C.K. Parsons. (1983). Item Response Theory: Applications to Psychological Measurement. Homewood, IL: Dow Jones-Irwin.
Irvine, S.H., and P.C. Kyllonen, Eds. (2002). Item Generation and Test Development. Mahwah, NJ: Laurence Erlbaum Associates.
Jurafsky, D., and J.H. Martin. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Prentice Hall.
Kolen, M.J., and R.L. Brennan. (2004). Test Equating, Scaling, and Linking: Methods and Practices. New York: Springer.
Krumm, S., F. Lievens, J. Hüffmeier, A.A. Lipnevich, H. Bendels, and G. Hertel. (2014). How “situational” is judgment in situational judgment tests? Journal of Applied Psychology (Epub). Available: http://www.researchgate.net/publication/264676090_How_Situational_Is_Judgment_in_Situational_Judgment_Tests [February 2015].
Kumar, E. (2011). Natural Language Processing. New Delhi, India: I.K. International Publishing House.
Lievens, F., and P.R. Sackett. (2006). Video-based versus written situational judgment tests: A comparison in terms of predictive validity. Journal of Applied Psychology, 91(5): 1,181−1,188.
Lievens, F., and P.R. Sackett. (2007). Situational judgment tests in high-stakes settings: Issues and strategies with generating alternate forms. Journal of Applied Psychology, 92(4):1,043−1,055.
Lievens, F., P.R. Sackett, and T. Buyse. (2009). The effects of response instructions on situational judgment test performance and validity in a high-stakes context. Journal of Applied Psychology, 94:1,095–1,101.
McDaniel, M.A., and N.T. Nguyen. (2001). Situational judgment tests: A review of practice and constructs assessed. International Journal of Selection and Assessment, 9:103−113.
McDaniel, M.A., and D.L. Whetzel. (2005). Situational judgment test research: Informing the debate on practical intelligence theory. Intelligence, 33(5):515−525.
McDaniel, M.A., F.P. Morgeson, E.B. Finnegan, M.A. Campion, and E.P. Braverman. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86:730−740.
McDaniel, M.A., N.S. Hartman, D.L. Whetzel, and W.L. Grubb, III. (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60(1):63−91.
McDonald, R.P. (1999). Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates.
Millsap, R.E., and J. Yun-Tein. (2004). Assessing factorial invariance in ordered-categorical measures. Multivariate Behavioral Research, 39(3):479–515.
Motowidlo, S.J., M.D. Dunnette, and G.W. Carter. (1990). An alternative selection procedure: The low-fidelity simulation. Journal of Applied Psychology, 75:640−647.
Motowidlo, S.J., M.P. Martin, and A.E. Crook. (2013). Relations between personality, knowledge, and behavior in professional service encounters. Journal of Applied Social Psychology, 43(9):1,851−1,861.
Mumford, T.V., F.P. Morgeson, C.H. Van Iddekinge, and M.A. Campion. (2008). The team role test: Development and validation of a team role knowledge situational judgment test. Journal of Applied Psychology, 93(2):250−267.
Nguyen, N.T., M.D. Biderman, and M.A. McDaniel. (2005). Effects of response instructions on faking a situational judgment test. International Journal of Selection and Assessment, 13(4):250−260.
Olson-Buchanan, J.B., F. Drasgow, P.J. Moberg, A.D. Mead, P.A. Keenan, and M.A. Donovan. (1998). An interactive video assessment of conflict resolution skills. Personnel Psychology, 51(1):1–24.
Oswald, F.L., N. Schmitt, B.H. Kim, L.J. Ramsay, and M.A. Gillespie. (2004). Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology, 89(2):187−207.
Oswald, F.L., A.J. Friede, N. Schmitt, B.K. Kim, and L.J. Ramsay. (2005). Extending a practical method for developing alternate test forms using independent sets of items. Organizational Research Methods, 8(2):149−164.
Peeters, H., and F. Lievens. (2005). Situational judgment tests and their predictiveness of college students’ success: The influence of faking. Educational and Psychological Measurement, 65(1):70−89.
Peterson, N.G., L.E. Anderson, J.L. Crafts, D.A. Smith, S.J. Motowidlo, R.L. Rosse, G.W. Waugh, R. McCloy, D.H. Reynolds, and M.R. Dela Rosa. (1999). Expanding the Concept of Quality in Personnel: Final Report (ARI Research Note 99-31). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Ployhart, R.E. (2006). The predictor response process model. In J.A. Weekley and R.E. Ployhart, Eds., Situational Judgment Tests: Theory, Measurement, and Application (pp. 83–105). Mahwah, NJ: Lawrence Earlbaum Associates.
Ployhart, R.E., and M.G. Erhart. (2003). Be careful what you ask for: Effects of response instructions on the construct validity and reliability of situational judgment tests. International Journal of Selection and Assessment, 11(1):1−16.
Richman-Hirsch, W.L., J.B. Olson-Buchanan, and F. Drasgow. (2000). Examining the impact of administration medium on examinee perceptions and attitudes. Journal of Applied Psychology, 85:880−887.
Rumsey, M.G., and J.M. Arabian. (2014). Military enlistment selection and classification: Moving forward. Military Psychology, 26(3):221–251.
Smiderle, D., B.A. Perry, and S.F. Cronshaw. (1994). Evaluation of video-based assessment in transit operator selection. Journal of Business and Psychology, 9(1):3−22.
Tucker, J.S., A.N. Gesselman, and V. Johnson. (2010). Assessing Leader Cognitive Skills with Situational Judgment Tests: Construct Validity Results (Research Product 2010-04). Fort Benning, GA: U.S. Army Research Institute for the Behavioral and Social Sciences.
van der Linden, W.J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22(3):195−211.
van der Linden, W.J., and C.A.W. Glas., Eds. (2010). Computerized Adaptive Testing: Theory and Practice. New York: Kluwer Academic.
Wagner, R.K., and R.J. Sternberg. (1991). Tacit Knowledge Inventory for Managers: User Manual. San Antonio, TX: The Psychological Corporation.
Weekley, J.A., and C. Jones. (1997). Video-based situational testing. Personnel Psychology, 50(1):25−49.
Westring, A.J.F., F.L. Oswald, N. Schmitt, S. Drzakowski, A. Imus, B. Kim, and S. Shivpuri. (2009). Trait and situational variance in a situational judgment measure of goal orientation. Human Performance, 22(1):44−63.
Whetzel, D.L., and M.A. McDaniel. (2009). Situational judgment tests: An overview of current research. Human Resource Management Review, 19(3):188−202.
Whetzel, D.L., M.A. McDaniel, and N.T. Nguyen. (2008). Subgroup differences in situational judgment test performance: A meta-analysis. Human Performance, 21(3):291−309.