Committee Conclusion: The military has long been in the forefront of modernized operational adaptive testing. Recent research offers promise for improvements in measurement in a variety of areas, including the application and modeling of forced-choice measurement methods; development of serious gaming; and pursuing Multidimensional Item Response Theory (MIRT), Big Data analytics, and other modern statistical tools for estimating applicant standing on attributes of interest with greater efficiency. Efficiency is a key issue, as the wide range of substantive topics recommended for research in this report may result in proposed additions to the current battery of measures administered for accession purposes. The committee concludes that such advances in measurement and statistical models merit inclusion in a program of basic research with the long-term goal of improving the Army’s enlisted accession system.
The U.S. armed forces’ historical commitment to develop and improve recruitment, selection, and job classification processes is reflected in a century of research initiatives since World War I to advance psychological measurement methodology and data analytics. The committee encourages the U.S. Army Research Institute for the Behavioral and Social Sciences (ARI) to continue to support this long tradition of research that aims to increase the precision, validity, efficiency, and security of existing assessments; to develop and evaluate new methods for measuring human capabilities; and to explore methods for analyzing the potentially vast amounts of data that
these assessments may generate (e.g., machine learning and other analytic methods inspired by the Big Data movement). Although previous chapters of this report describe psychological constructs of interest for recruit selection and assignment, this chapter focuses on research questions related to psychological-assessment measurement methods, emerging assessment technologies, and statistical analysis approaches that are applicable to many types of data. The committee anticipates that the use of modern psychometric and statistical approaches in assessment will yield payoffs in terms of reduced testing time, increased test security, and improved selection and classification, all of which reduce costs and improve human capital in military and organizational settings.
Psychometric research conducted and supported by the U.S. armed forces has influenced measurement and selection practices and yielded a wealth of information about human capabilities. In the domain of cognitive abilities and vocational interests, for example, the Army General Classification Test (Harrell, 1992) predicted performance in military training and deployments during World War II (Flanagan, 1947). In the domain of personality traits, later research by Tupes and Christal (1961) and Digman (1990) included numerous empirical ratings of adjectives in the English lexicon to identify the Big Five taxonomy of personality traits (Goldberg, 1992). These large collective efforts to develop and validate assessments of individual differences sparked advances in the statistical analysis of measures using classical test theory, which in turn provided a foundation for IRT and other modern measurement methods that undergird tests used now for military personnel screening.
The most widely administered and well-known personnel screening test is the Armed Services Vocational Aptitude Battery (ASVAB; Maier, 1993). The original ASVAB was a battery of 10 paper-and-pencil tests of various cognitive abilities, skills, and knowledge that took approximately 4 hours to complete. These tests were constructed, scored, and equated using classical test theory methods, and the scores were combined using simple weighting or regression methods to create composites for selection and classification decisions. For an overview of studies of the factor structure of the ASVAB, see Box 8-1.
In 1992, building on nearly 30 years of basic psychometric researched funded by the Department of Defense and the Office of Naval Research, a new computerized adaptive test (CAT), IRT-based version of the ASVAB was launched, the CAT-ASVAB (see Sands et al., 1999, for an historical review). Although the CAT-ASVAB measured the same constructs, with reliabilities and validities similar to the paper-and-pencil ASVAB, adaptive
item selection reduced test length and seat time by nearly 50 percent, allowing for quicker processing of military applicants or, alternatively, the measurement of additional knowledge, skills, abilities, and other attributes (KSAOs). Moreover, the transition to CAT offered benefits in terms of test security and data screening. For example, reducing the number of applicants taking the paper-and-pencil test over time, coupled with the variability in content that adaptive item selection provided, reduced the risk of a sudden and serious test compromise, as did the implementation of methods for thwarting blatant cheating and other attempts to “game the system.” Collectively, these features made CAT-ASVAB one of the most psychometrically sound, sophisticated tests ever developed across either military or civilian settings, and it set a high standard for future high-volume personnel screening instruments in the domain of cognitive abilities and skills.
Concurrent with measurement research to prepare for the deployment and maintenance of CAT-ASVAB, ARI conducted a detailed review of military jobs and potential predictors of successful performance under what was called Project A (Campbell, 1990; Campbell and Knapp, 2001). Project A identified eight components of job performance, subsumed under three broad categories now referred to as task performance, citizenship performance, and counterproductive performance (Rotundo and Sackett, 2002). Project A, as well as other military studies (e.g., Motowidlo and Van Scotter, 1994), indicated that although cognitive ability tests such as the ASVAB are among the best predictors of task performance, they only weakly predict citizenship and counterproductive performance, whereas noncognitive variables—those variables that fall under the domains of personality, motivation, and attitudes—tend to exhibit the opposite pattern of predictive relationships.
Recognizing the potential complementary benefits of adding a noncognitive test to CAT-ASVAB for military screening, Army researchers developed and experimented with a personality questionnaire called the Assessment of Background and Life Experiences (ABLE; White et al., 2001), which measured six constructs using the traditional format of single statements, each asking for a response on a four-point Likert scale (Likert, 1932). ABLE scores predicted performance as expected in low-stakes (“research only”) settings, but in situations where examinees were motivated to fake “good,” substantial score increases and validity decreases were observed (Hough et al., 1990; White and Young, 1998; White et al., 2001). Consequently, researchers began exploring an alternative multidimensional forced-choice (MFC) format for administering items (whereby test takers choose between statements rather than rating a single statement on a scale), along with format and scoring methods that together might address the problem of faking in personality tests, which tends to inflate test scores and reduce validity.
The result was the Assessment of Individual Motivation (AIM) inven-
Factor Structure of the ASVAB
The current ASVAB is a test given to all recruits (Powers, 2013), and it measures nine constructs (or factors) fairly well: general science, arithmetic reasoning, word knowledge, paragraph comprehension, mathematics knowledge, electronics information, auto and shop information, mechanical comprehension, and assembling objects. These tests were designed to be essentially unidimensional (Stout, 1987, 1990), so that unidimensional IRT models could be applied for CAT-ASVAB development (see Drasgow and Parsons, 1983). As noted in Chapter 1, the current ASVAB has a number of good measurement characteristics, including the fact that each subscale measures its associated construct well.
The factor structure of the ASVAB was examined by Kass and colleagues (1983). This standardized battery, which tests multiple cognitive abilities, is the primary selection and classification instrument used by all the U.S. military services. The investigators compared their factor structure results with that found for previous ASVAB samples and for previous forms of the ASVAB. In particular, they examined whether the factor structure was similar for racial/ethnic and sex subgroups, to determine the extent of invariance of ASVAB factor structure across these groups. Using data from a sample of more than 98,000 male and female Army applicants, they conducted an exploratory factor analysis and found four factors that accounted for 93 percent of the total variance: verbal ability, speeded performance, quantitative ability, and technical knowledge. In general, the factor analyses for male, female, white, black, and Hispanic subgroups yielded similar
tory (White and Young, 1998), which measures six personality constructs using a MFC format that requires examinees to make “most like me” and “least like me” choices among similarly desirable personality statements that are presented in blocks of four (tetrads). AIM tetrads are scored by assigning 0 to 2 points for each statement; test scores are then based on summing the points across the relevant statements for each construct. According to White and Young (1998), AIM scale scores are only partially ipsative (Hicks, 1970), because the number of statements representing each construct varies, respondents are required to endorse only two of four statements in each tetrad, and the nonendorsed statements are assigned intermediate scores. These features introduce enough variation into the AIM total scores to permit normative decision making. Importantly and perhaps as a consequence, the AIM personality measure proved much more resistant to faking than the ABLE personality measure in field research. Unfortunately, the complexity of the format precluded any near-term transition to CAT because there were no psychometric models, such as IRT, that were directly applicable to MFC tetrad responses.
results. The findings provided evidence that the ASVAB’s constructs for cognitive abilities were reliable across diverse samples of candidates.
A subsequent reanalysis of the factor structure of the ASVAB compared it with similar aptitude tests (Wothke et al., 1991). In this factor analysis, 46 tests from the Kit of Factor Referenced Cognitive Tests (the Kit) and the 10 ASVAB subtests were administered to a sample of airmen. Because a total of 56 tests were investigated, every examinee did not receive every test. Instead, matrix sampling was used to pair tests. Matrix sampling requires special factor analytic methods (McArdle, 1994). After consideration of descriptive statistics and editing, the data were assembled into a correlation matrix for exploratory and confirmatory factor analysis, which indicated that three factors were required to explain the correlation structure among the ASVAB scores. These three factors were defined as school attainment (for the word knowledge, paragraph comprehension, general science, and mathematics knowledge constructs), speediness (for numerical operations and coding speed, and some of the arithmetic reasoning construct), and technical knowledge (for auto and shop information, mechanical comprehension, and electronics information). The Kit scores required six factors, and the factors used to explain the ASVAB correlations could largely be placed within the factor space of the Kit factors, indicating that the abilities measured by the ASVAB are a subset of the abilities measured by the Kit. These results suggest that future research to enhance selection and classification should focus on abilities not currently measured by the ASVAB, such as those described in Chapter 4 on Spatial Abilities.
NOTE: ASVAB = Armed Services Vocational Aptitude Battery, CAT-ASVAB = computerized adaptive test (IRT-based version of ASVAB), IRT = Item Response Theory.
For the next several years, ARI supported research to increase the validity of the AIM using methods that capitalize on patterns of relationships among item responses and test scores. For example, Drasgow and colleagues (2004) compared the efficacy of predicting attrition among non–high school diploma grad recruits (see Stark et al., 2011; White et al., 2004) based on (a) logistic regression with AIM scale scores, (b) classification-and-regression-tree methods (Breiman et al., 1997), and (c) logistic regression using IRT odds-based scores derived from separately fitting a graded response model (Samejima, 1969) to data for each AIM subscale. There were two noteworthy findings: (1) Computationally intensive classification methodologies could improve the prediction of attrition relative to regression using ordinary scale scores. (2) In accordance with research by Chernyshenko and colleagues (2001), the graded response model generally did not fit the AIM personality data as effectively as the two- and three-parameter logistic IRT models (Birnbaum, 1968) did for cognitive-ability item responses.
This latter finding provided further support for research suggesting that so-called ideal-point models, which were developed and used for attitude measurement (for example, Andrich, 1996; Coombs, 1964; Roberts et al., 1999; 2000), should be similarly considered for personality measurement (Chernyshenko et al., 2001; 2007; Drasgow et al., 2010; Stark et al., 2006). In simple terms, ideal-point models assume that if personality statements are too negative or too positive, then respondents will tend to disagree with them. Note that personality statements that are uniquely appropriate to ideal-point models remain a challenge to write and scale (Dalal et al., 2014; Huang and Mead, 2014; Oswald and Schell, 2010). This is partly because most modern test construction and item evaluation practices are imbued with assumptions from traditional IRT models appropriate to cognitive ability (e.g., dominance models, which assume that a person who tends to answer hard items correctly should also be able to answer most easy items correctly). Thus, personality measurement will continue to benefit from research that continuously improves ideal-point IRT models and other modern psychometric tools and measure-development approaches. Consistently superior criterion-related validity over their traditional counterparts is the end goal of such developments.
In 2005, ARI funded a proposal to develop a new personality assessment system that would integrate findings concerning ideal-point modeling, MFC testing, and CAT. The result was the Tailored Adaptive Personality Assessment System (TAPAS; Drasgow et al., 2012), which was developed to measure up to 21 narrow personality factors (dimensions) and military-specific constructs, using a CAT algorithm based on a “multi-unidimensional pairwise-preference” IRT model (Stark, 2002; Stark et al., 2005, 2012a). Respondents are presented with pairs of personality statements, which are similar in their levels of social desirability and extremity, but they are usually different in the constructs they measure; respondents are then asked to select the statement in each pair that is “more like you.”
Initial field research with a nonadaptive paper-and-pencil form of this test, known as TAPAS-95s, showed good validities for predicting citizenship and counterproductive performance outcomes with new soldiers (Knapp and Heffner, 2010). Subsequent simulation research investigating various multidimensional pairwise preference CAT designs (see Drasgow et al., 2012; Stark and Chernyshenko, 2007; Stark et al., 2012a, 2012b) affirmed previous findings with the three-parameter logistic model underlying the CAT-ASVAB that adaptive item selection could provide the same accuracy and precision as nonadaptive tests that were nearly twice as long (e.g., Sands et al., 1999). Starting in May 2009, a 13-dimension 108-item TAPAS CAT was administered to Army applicants in military entrance processing stations with a time limit of 30 minutes. This test was later replaced by various 15-dimension 120-item versions with enhanced capabilities to
detect rapid, patterned, and random responding in real time, to promote data integrity (Stark et al., 2012c).
Today the U.S. military tests hundreds of thousands of potential recruits annually in 65 military entrance processing stations around the country. The CAT-ASVAB is administered to approximately 400,000 of these applicants, and the scores on four of what are now nine cognitive subtests are used to determine eligibility for enlistment and various assignments. The TAPAS is administered to a subset of the CAT-ASVAB applicants, and scores on a subset of the 15 personality dimensions are used to compute composites of ability and personality based on knowledge (“can do” composite), attitudes (“will do” composite), and experience and willingness to change (adaptability composite; Stark et al., 2014), which are used to make selection decisions and for research on assignment to military occupational specialties (Drasgow et al., 2012; Nye et al., 2012).
The CAT-ASVAB and TAPAS assess ability and personality, respectively, as complementary KSAOs, but they have several psychometric features in common: (1) The IRT CAT algorithms assume that examinees are about “average” at the start of a test, and from then on, essentially tailor subsequent items to examinees’ estimated levels on a given ability or trait at a given point to improve measurement precision with fewer items than traditional nonadaptive tests. (2) Examinee trait scores are computed using Bayesian methods that augment the effectiveness of short tests whenever additional examinee data (informative priors) are available. (3) They incorporate technology that hinders and/or flags examinees who appear to be “gaming the test” by quitting early or responding in an inattentive manner.
The CAT-ASVAB and TAPAS also have two primary differences that highlight needs and opportunities for basic research: First, CAT-ASVAB comprises nine cognitive subtests that are individually administered and scored based on the aforementioned unidimensional dominance model, which is a standard model for cognitive ability tests. Correlations among the subtest scores are sizable, as they tend to be between cognitive ability subtests (e.g., r = .4 to .7). By contrast, the TAPAS measures 13 to 15 personality dimensions based on a multidimensional pairwise preference format, which is based on the ideal-point model discussed previously. Trait scores for TAPAS dimensions are estimated simultaneously using a multidimensional Bayes modal method, and the trait score intercorrelations are significantly lower, as they tend to be for personality (e.g., r = .10 to .45).
The second difference is that CAT-ASVAB’s subtests contain 11 or 16 items each and approximately 2.5 hours total is allowed for completion.1 TAPAS tests are also adaptive and typically involve 120 or fewer multidi-
1 See http://www.official-asvab.com/whattoexpect_app.htm [December 2014] for additional information on the test format.
mensional pairwise preference items with a 30-minute time limit (Nye et al., 2012).
The above discussion illustrates how advances in measurement technology have helped to increase the efficiency and precision of current assessments, which use structured multiple choice and forced-choice formats. Over the next 20 years, advances in computing capabilities will undoubtedly facilitate the development of more sophisticated psychometric models and better methods for combining data from structured assessments with auxiliary information gathered, for example, from personnel records, background questionnaires, social media, and even devices that can capture examinees’ physiological data during testing sessions (including, for example, the potential use of biomarkers as described in Chapter 10; biomarkers are discussed in more detail in Appendix C). The proliferation of mobile computing devices and Wi-Fi access will make it possible to test examinees in their natural environments but will also present new challenges for standardization—and thus challenges for the comparability of test scores used for personnel decisions. The emerging field of serious gaming offers potential for measuring examinee KSAOs with less-structured, highly engaging methods, which could yield vast amounts of streaming data that are best analyzed by methods currently used in physics or computer science, rather than methods used in psychology and education. The next sections of this chapter provide a snapshot of developments in psychometric modeling, gaming and simulation, and Big Data analytics, which the committee believes merit serious attention in the Army’s long-term research agenda.
IRT methods provide the mathematical foundation for many of today’s most sophisticated structured assessments. IRT models relate the properties of test items (e.g., difficulty/extremity) and examinee trait levels (e.g., KSAOs such as math, verbal, and spatial abilities [see Chapter 4], conscientiousness, emotional stability, and motivation) to the probability of correctly answering or endorsing items. For practical reasons, most large-scale tests have been constructed, scored, and/or evaluated using unidimensional IRT models, which assume that item responding is a function of just one ability or dimension. To obtain a profile of scores representing an examinee’s proficiency in several areas, a sequence of unidimensional tests is typically administered, with each being sufficiently long to achieve an acceptable level of reliability. The broader (more heterogeneous) the constructs measured, the more items that are needed.
Research has shown that violating statistical assumptions by applying unidimensional models to tests that unwittingly or intentionally measure weak to moderate secondary dimensions (e.g., measuring mathematical reasoning with word problems that require language proficiency) does not greatly diminish the accuracy of IRT trait scores (Drasgow and Parsons, 1983) or their correlations with outcome variables (e.g., Drasgow, 1982). However, doing so can contribute to biases (also called “differential item and test functioning”) that disadvantage subpopulations of examinees who are lower in proficiency on the unaccounted-for secondary dimensions (e.g., Camilli, 1992; Shealy and Stout, 1993). Moreover, when abilities are highly correlated, administering a sequence of unidimensional tests is inefficient; Multidimensional Item Response Theory (MIRT) methods for scoring responses and for selecting items in CATs can reduce the overall number of items administered and increase measurement precision.
MIRT models conceptualize item responding as a function of multiple correlated dimensions (see, for example, Ackerman, 1989; 1991; Reckase, 2009; Reckase and McKinley, 1991; Reckase et al., 1988). Some items are viewed as factorially complex (i.e., they measure more than one dimension), whereas other items are factorially pure (they measure just one dimension). The probability of correct or positive item responses is portrayed as a function of examinee proficiency along multiple dimensions, overall item difficulty/extremity associated with the item content, and the degree to which items are sensitive to variance in proficiency along the dimensions they assess (as indicated by item discrimination coefficients in IRT; e.g., how well they measure, discriminate, or “load on” the intended factors).
MIRT models have value from a purely diagnostic standpoint. They may reveal characteristics that are obscured by unidimensional models and aid item generation and test revision. They may help explain examinee performance (e.g., Cheng, 2009), which is particularly helpful when a test exhibits adverse impact or differential item or test functioning across demographic groups. Their clearest practical benefit for personnel selection, however, lies in potentially improving test efficiency. MIRT scoring methods allow item responses from one ability or trait to serve as auxiliary or collateral information that informs the responses for other correlated abilities or traits, and this serves to increase overall measurement efficiency and precision (e.g., de la Torre, 2008, 2009; de la Torre and Patz, 2005). MIRT methods therefore not only get more information out of nonadaptive tests, they also reduce the number of items needed in CAT applications. As shown in simulation studies, CATs based on MIRT methods can attain measurement precision goals with even fewer items than unidimensional CATs (Segall, 1996, 2001a; Yao, 2013). Moreover, collateral information provided by data collected before a testing session (e.g., from application
blanks or personnel records) can further increase efficiency by providing better starting values for adaptive item selection, and methods that use response times, as well as examinee answers, to improve scoring are emerging (e.g., Ranger and Kuhn, 2012; van der Linden, 2008; van der Linden et al., 2010).
One particular class of MIRT models that is growing in popularity due to increased interest in personality and other noncognitive testing is MFC models. One of the earliest was the multi-unidimensional pairwise preference IRT model (Stark, 2002; Stark et al., 2005, 2012b), which de la Torre and colleagues (2012) recently generalized as the PICK and RANK models for preferential choice and rank responses among blocks of statements (e.g., pairs, triplets, tetrads). Another example is the Thurstonian model by Brown and Maydeu-Olivares (2011, 2012, 2013), which can be expressed using an IRT or common factor model parameterization.
These models have been developed for constructing, calibrating, and scoring MFC measures that are intended to reduce response biases, such as socially desirable responding, that are especially prevalent in personnel selection and promotion environments (Hough et al., 1990; Stark et al., 2012a). In these models, examinees must choose or rank statements within each block, based on how well the statements describe, for example, their thoughts, feelings, or actions. However, statements within a block typically represent different constructs, and they are matched on perceived social desirability to make it more difficult for examinees to “fake good” (for example, sometimes examinees have to choose or rank a set of response options where no option is especially desirable). This could be particularly useful in assessments of constructs such as defensive reactivity and emotion regulation, as described in Chapter 6, Hot Cognition.
In contrast to classical test theory scoring methods that historically proved problematic, these model-based MFC methodologies have been shown to yield normative scores that are suitable for inter-individual as well as intra-individual comparisons. This research, however, is still in its early stages. Gaps remain in understanding the intricacies and implications of test construction practices; the capabilities of parameter estimation procedures with tests of different dimensionality, length, and sample size; how to efficiently calibrate item pools, select items, and control exposure of items with CAT; how to create parallel nonadaptive test forms; how to equate alternative test forms; how to test for measurement invariance; and how to judge the seriousness of any specifications or constraints in test construction that are being violated. In short, all of the questions that have been explored for decades with unidimensional IRT models need to be answered for MFC, and more generally, MIRT models. In addition, although the benefits of unidimensional dominance and ideal-point (Coombs, 1964) IRT models for noncognitive testing have been discussed in many papers over the past
two decades (e.g., Andrich, 1988, 1996; Drasgow et al., 2010; Roberts et al., 1999; Stark et al., 2006; Tay et al., 2011), there is still much to learn about their use as a basis for MFC applications.
In addition to improving measurement through better models for item responding, test delivery, and scoring, there is a rapidly growing need for methods that can detect aberrant responding, which includes faking or careless responding, and methods for detecting potential item and test compromises that stem from overuse and sudden, outright security breaches. In the 1980s, many heuristic methods for detecting aberrance and item compromise (for reviews, see Hulin et al., 1983; Karabatsos, 2003; Meade and Craig, 2012; Meijer and Sijtsma, 1995) fell into disuse due to the advent of more effective IRT-based methods, which not only flag suspect response patterns but in some cases provide powerful test statistics that can provide benchmarks for simpler methods under different testing conditions (Drasgow, 1982; Drasgow and Levine, 1986; Drasgow et al., 1987, 1991; Levine and Drasgow, 1988). Drasgow and colleagues’ (1985) standardized log likelihood statistic (lz) became one of the more popular early IRT indices because it was effective for detecting spuriously high- and low-ability scores on nonadaptive cognitive tests and because it could be used not only with dichotomous unidimensional IRT models but also with polytomous unidimensional models and multidimensional test batteries. Over time, researchers began exploring noncognitive applications with the goal of detecting faking, untraitedness or random responding, or unspecified person misfit (e.g., Ferrando and Chico, 2001; Reise, 1995; Reise and Flannery, 1996; Zickar and Drasgow, 1996). Researchers also began examining the efficacy of lz and newer aberrance detection methods with CAT (Egberink et al., 2010; Nering, 1997). By and large, these studies have shown that faking can be difficult to detect because response distortion that is consistent across items is confounded with trait scores. Similarly, because CAT algorithms typically match item extremity to a respondent’s trait level, there are too few opportunities to observe inconsistencies between observed and predicted responses to yield adequate power for aberrance detection (Lee et al., 2014). Consequently, as noncognitive tests and CAT applications become more common, new methods for detecting aberrant response patterns will be needed, as will research that examines the tradeoffs of incorporating items into CATs that may reduce test efficiency for the sake of improving detection. (For more information about faking, detection, and potential solutions, see Ziegler et al., 2012.) In the future, it may be possible that potential uses of neuroscience-based measures marking psychological states (as described in Chapter 10) could be one tool for such detection.
A closely related and perhaps even more important research area for the Army is the detection of aberrant responding and test compromise in connection with unproctored Internet testing (Bartram, 2008; International
Test Commission, 2006; Tippins et al., 2006). Although it is highly unlikely that proctored testing at military entrance processing stations and mobile enlistment testing sites will be entirely obviated in the foreseeable future, it may eventually prove advantageous to prescreen applicants or credential existing service members on personal computing devices, just as corporations are accepting unproctored Internet testing as a way of attracting and processing more applicants, credentialing boards are embracing online continuing education, and universities are expanding online course offerings even for advanced degree credits. With mobile computing device capabilities and sales so rapidly increasing, it may simply become a necessity, especially with an all-volunteer workforce, to make pre-enlistment testing as convenient as possible.
The implications are that, in the modern age of testing that includes CAT and unproctored Internet testing, it will be necessary to consider and conceivably adopt some or all of the following approaches:
- Vet individual scores using aberrance detection and verification testing approaches (Segall, 2001b; Tippins et al., 2006; Way, 1998).
- Protect test content by improving item selection and exposure control methods (Barrada et al., 2010; Hsu et al., 2013; Lee et al., 2008).
- Construct and replenish item pools quickly using automatic item generation methods (e.g., Gierl and Lai, 2012; Irvine and Kyllonen, 2002).
- Automatically assemble tests that meet detailed design specifications (e.g., van der Linden, 2005; van der Linden and Diao, 2011; Veldkamp and van der Linden, 2002).
- Monitor item and test properties to detect compromise (Cizek, 1999; McLeod et al., 2003; Segall, 2002; Yi et al., 2008).
- Actively scan Internet blogs, chat rooms, and websites, which provide coaching tips, answer strategies, and realistic or actual items that might point to individual or organized test compromise efforts (Bartram, 2009; Foster, 2009; Guo et al., 2009).
Web-based and mobile CATs—sometimes referred to as eCATs—have been developed to provide efficient, possibly on-demand, screening of examinees in their natural environments and in settings that may not be conducive to traditional forms of test administration (as would be an important consideration for assessments such as situational judgment tests described in the following chapter). Using a Wi-Fi enabled device, examinees can complete a CAT that runs on a remote server or using a mobile application that can be downloaded and run on a tablet computer or smartphone. Such
applications are growing rapidly in health care settings because they can be used to assess patients on a variety of physical and psychological well-being indicators just before consultations with health care practitioners, as well as to monitor symptoms and responses to treatments between office visits. (For a prominent example of web-based CAT in health care, readers may consult Cella and colleagues  or Riley and colleagues , who discuss the Patient Reported Outcome Management Information System [PROMIS] initiative funded by the National Institutes of Health.)
Web-based and mobile CATs are also becoming common in workplace contexts. Example uses include screening job seekers for minimal skills before inviting them for an interview, measuring job knowledge or the effects of training, and developing intelligent tutoring systems (Chernyshenko and Stark, in press). Although less common in the military, web-based and mobile CATs have been developed to measure job-related KSAOs among incumbents (e.g., the Computer Adaptive Screening Test, or CAST; see Horgen et al., 2013; Knapp and Pliske, 1986; McBride and Cooper, 1999) and to develop soldier-centered training systems involving, for example, mobile, virtual classrooms and collaborative-scenario training environments (e.g., TRAIN II; Murphy et al., 2013).
In addition to standardization and fairness issues surrounding mobile assessments, which will take many years to explore, an immediate and more obvious concern is the exposure of items that will be used for decision making. However, algorithms are available for CAT item selection that safeguard against high item overuse/overexposure ratios that lead to higher probability of item content breach.
Examining test overlap2 (Chang and Zhang, 2002; Way, 1998) provides a sense of item exposure. Uniform exposure of items is characteristic of minimal test overlap (Chen et al., 2003). Wang and colleagues (2014) expanded on test overlap, examining the utility of the standard deviation. The authors’ analyses conclude that although tests may have similar mean overlap, a smaller standard deviation indicates that the number of shared items between applicants is uniform and that the advantage of retaking the test at a later time is minimized. In addition to optimizing measurement precision, CAT item selection approaches are also considered in terms of the security of the item bank. Barrada and colleagues (2011) found that matching nonstratified item banks with criteria items that had been selected for minimum distance between the respondent’s trait level (Li and Schafer, 2005) and the items’ difficulty offered greater test security than did use of the Fisher information function (Lord, 1980).
Further investigation will be necessary to determine (a) how CAT item
2 Test overlap = mean of between-test overlap (proportion of items on one administration that appear on another) across all possible pairs of respondents.
selection approaches might enhance or compromise test security, (b) what methods can verify and ensure the identity of eCAT test takers as well as the security of the testing setting in advance of beginning an eCAT, (c) cybersecurity approaches that guard against hacking, and (d) how automatic item generation might enhance security by expanding item pools and continually replacing frequently administered items. A return-on-investment analysis (a comparison of the magnitude and timing of gains from investing in such testing with the magnitude and timing of investment costs) might be a good starting point for considering a web-based eCAT, followed by a more complex examination of eCAT security, building on customary approaches and new knowledge.
This section has provided a brief review of historical and recent developments in psychometrics that are directly applicable to tests involving structured item formats. The next section delves into technology developments that offer new opportunities for engaging examinees and perhaps reducing the response biases associated with self-report measures. However, the interactive, dynamic nature of these assessments presents challenges in addition to opportunities. To ensure comparability of scores, standardization will need to be addressed, however scores are computed. And Big Data methods will probably be needed to parse the gigabytes of data that each assessment will generate.
In terms of potential for use in assessment, in contrast with the deep and long-standing tradition of self-report measures and ratings from peers and supervisors, technology advances such as those enabling immersive and realistic simulations and serious gaming provide opportunities for examinees to demonstrate knowledge, emotions, and interactions through their behavior as it is expressed within rich and often realistic scenarios (National Research Council, 2011). This could be especially productive in assessing constructs, such as those described in Sections 2-4 of this report (Chapters 2–7), which may not be effectively or efficiently assessed through standard or even computer adaptive testing. As Landers (2013) described, simulations and serious gaming are related but have some important distinctions.
Simulations, which may involve physical or computer-based re-creation of real-life environments, involve constructed representations of situations in which a task must be reproduced (with potential utility in assessments such as those of spatial abilities, as described in Chapter 4). They typically involve freedom of choice as well as risk and reward. Simulations can involve systems created solely for the purpose of training, or they can use systems that replicate those used in actual practice. One well-known example of the latter is found in the flight simulators that replicate actual instrument
panels for various aircraft models. Simulations allow learners to apply their knowledge and to practice important job-related skills in conditions that involve lower risk and possibly lower cost than real-life situations. An important consideration in the design of simulations is the degree to which psychological and physical fidelity to real-life situations can be achieved.
Serious games are similar to simulations but often involve more narrative, and fidelity to real-life situations may be reduced in order to increase user engagement. The U.S. Army currently uses several simulations and serious games as recruitment and training tools (Landers, 2013), but long-term, research would be needed to determine whether gaming experience unduly influences scores and validities of the assessments for high-stakes uses. Through the use of technology in assessments, the collection of vast amounts of data about critical behaviors—those that predict organizational outcomes such as job performance that may in fact closely resemble the outcome (e.g., job performance) itself—will be possible. Furthermore, serious games can confront examinees with unexpected phenomena that require their adaptability (see Chapter 7 for a discussion of individual differences in adaptability) and use feedback in the interest of maintaining or optimizing performance.
Landers (2013) suggested that simulations and serious games offer potential for testing many skills that may be of particular interest to the military: leadership, decision-making, reasoning, spatial ability, persistence, creativity, and particular technical skills (many of which are discussed elsewhere in this report as recommended future research topics). Personality assessment may also be possible using serious games. Just as technology has opened the door to increasingly sophisticated item-administration and precise scoring algorithms (e.g., CAT applications), technology is also changing methods of assessment through powerful advances in simulation and serious gaming. Sydell and colleagues (2013) asked what might be learned from an examination of keystrokes, mouse clicks, repetition of strategies, and response to untimed tasks. As the authors suggested, some of this information may reflect novel predictors of employee outcomes such as performance, satisfaction, and turnover. However, it could introduce contamination associated with environmental influences or irrelevant personal attributes. An integration of simulation and serious gaming with modern psychometric algorithms, based on some combination of IRT and Big Data methods, could be considered as part of a learning analytics model that moves past a test of binary correct-versus-incorrect responses, capturing unique and rich sources of information relevant to performance and to the 21st century skills that elude conventional assessment (Bennett, 2010; Redecker and Johannessen, 2013; see also this report’s discussions of performance under stress [Chapter 6], adaptive behavior [Chapter7], and team work behavior [Chapter 5]).
Use of these technologies for assessment is relatively new, but simulation and serious gaming have a longer history as instructional and learning supports. Kevin Corti of PIXEL Learning has been quoted as saying that serious games “will not grow as an industry unless the learning experience is definable, quantifiable, and measurable. Assessment is the future of serious games” (Bente and Brewer, 2009, p. 327). The military is already engaging in such assessments as tests of specific job/task performance in conjunction with training programs, such as performance after military medical training and after flood or fire emergency training on naval ships (Iseli et al., 2010; Koenig et al., 2010; 2013). The reports on these tests provide a framework in terms of scoring systems, performance assessment, and the incorporation of learning from mistakes.
Sydell and colleagues (2013) outlined an assessment approach for simulation and serious gaming that includes identifying what is to be assessed at different levels of the simulated scenario; developing a broad developmental rubric of the measured domain(s), their components, and their relationships with one another (a theoretical model so to speak); application of a Bayesian network (Levy and Mislevy, 2004) that empirically models the probabilistic and dynamic association among measured variables; and developing, vetting, and using various types of score generation tools.
Mislevy advocated the development of simulation-based assessment that is rooted in solid design and psychometrics as opposed to using a data mining approach after the simulation is built (Mislevy, 2013; Mislevy et al., 2012). Mislevy’s approach is similar to that of Sydell and colleagues (2013), building a framework referred to as an Evidence-Centered Design, or ECD (Mislevy and Riconscente, 2006; Mislevy et al., 2003). The ECD identifies operational layers of the assessment process: domain analysis; domain modeling; specification of a conceptual assessment framework; assessment implementation; and, assessment delivery (see Mislevy, 2013, for a description of the ECD approach). IRT is emphasized as critical to the ECD assessment implementation layer, as are CAT applications in the assessment delivery layer. (For a description of Epistemic Network Analysis, a method for assessing user performance based on ECD, see Shaffer et al., 2009).
New and emerging advances in IRT and constrained optimization procedures will undoubtedly be helpful in developing assessment approaches for simulation and serious gaming. Furthermore, IRT could be used to optimize serious gaming by calibrating scenarios and tasks, using information functions to optimally choose activities for examinees to complete, and using the trait scores to route examinees through gaming levels ranging from novice to expert, much like traditional CAT applications (Batista et al., 2013).
Applications for simulation and serious gaming are growing. Surgical procedures are practiced in simulated venues, as are landing commercial
and military aircraft under challenging circumstances, managing business and economic scenarios under unanticipated conditions, and other learning contexts where practice builds skill. Integrated assessment within simulation and serious gaming is gaining traction. Simulation-based assessment focused on patient problems is currently a part of a computer-based medical licensing exam in the United States (Dillon and Clauser, 2009). The Army Research Laboratory uses simulation in its Generalized Intelligent Framework for Tutoring. Still in its infancy is investigating what can be learned about a simulation “player” not only from that player’s performance and success in simulation and gaming outcomes but also from the player’s keystrokes, mouse clicks, strategy selection (e.g., repetition versus innovation), and time-constrained versus untimed task behavior (see the discussion of performance under stress contained in Chapter 6).
In addition to examining behavioral performance representations, sensors can be used to assess physiological responses and biomarkers, such as galvanic skin response, facial electromyography, electroencephalography, and cardiac activity (Nacke, 2009). (Appendix D describes many of the potentially relevant neuroscience measurement technologies; Chapter 10 has further discussion of potential assessments of individual differences using these technologies.) Examining the use of dynamic Bayesian networks in the contexts of decision making, situational judgment, communication approach, and management of uncertainty has potential value to the Army in making decisions on selection and assignment. In short, although feasibility considerations for large-scale screening would need to be resolved, the committee believes dynamic interactive assessments such as simulation and serious gaming provide two productive investigative areas to better understand what potential recruits actually do in simulated realistic settings as opposed to what they self-report they would do on traditional assessment questionnaires.
Modern measurement methods come with the promise of increasing precision, validity, efficiency, and security of current, emerging, and future forms of assessment. The U.S. Army Research Institute for the Behavioral and Social Sciences should continue to support developments to advance psychometric methods and data analytics.
- Potential topics of research on Item Response Theory (IRT) include the use of multidimensional IRT models, the application of rank and preference methods, and the estimation of applicant standing on the attributes of interest with greater efficiency (e.g., via automatic item generation, automated test assembly, detect-
- ing item pool compromise, multidimensional test equating, using background information in trait estimation).
- Ecological momentary assessments (e.g., experience sampling) and dynamic interactive assessments (e.g., team interaction, gaming, and simulation) yield vast amounts of examinee data, and future research should explore the new challenges and opportunities for innovation in psychometric and Big Data analytics.
- Big Data analytics also may play an increasingly important role as candidate data from multiple diverse sources become increasingly available. Big Data methods designed to find structure in datasets with many more columns (variables) than rows (candidates) might help identify robust variables, important new constructs, interactions between constructs, and nonlinear relationships between those constructs and candidate outcomes.
Ackerman, T.A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13(2):113–127.
Ackerman, T.A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15(1):13–24.
Andrich, D. (1988). The application of an unfolding model of the PIRT type to the measurement of attitude. Applied Psychological Measurement, 12(1):33–51.
Andrich, D. (1996). A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49(2):347–365.
Barrada, J.R., J. Olea, V. Ponsoda, and J. Abad. (2010). A method for comparison of item selection rules in computerized adaptive testing. Applied Psychological Measurement, 34(6):438–452.
Barrada, J.R., J. Abad, and J. Olea. (2011). Varying the valuating function and the presentable bank in computerized adaptive testing. Spanish Journal of Psychology, 14(1):500–508.
Bartram, D. (2008). The advantages and disadvantages of on-line testing. In S. Cartwright and C.L. Cooper, Eds., The Oxford Handbook of Personnel Psychology (pp. 234–260). Oxford, UK: Oxford University Press.
Bartram, D. (2009). The International Test Commission guidelines on computer-based and internet-delivered testing. Industrial and Organizational Psychology, 2(1):11–13.
Batista, M.H.E., J.L.V. Barbosa, J.E. Tavares, and J.L. Hackenhaar. (2013). Using the item response theory (IRT) for educational evaluation through games. International Journal of Information and Communication Technology Education, 9(3):27–41.
Bennett, R.E. (2010). Technology for large-scale assessment. In P. Peterson, E. Baker, and B. McGaw, Eds., International Encyclopedia of Education (3rd ed., vol. 8, pp. 48–55). Oxford, UK: Elsevier.
Bente, G., and J. Breuer. (2009). Making the implicit explicit: Embedding measurement in serious games. In V. Ritterfeld, M. Cody, and P. Vorderer, Eds., Serious Games: Mechanisms and Effects (pp. 322–343). New York: Routledge.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord and M.R. Novick, Eds., Statistical Theories of Mental Test Scores (pp. 395–479). Reading, PA: Addison-Wesley.
Breiman, L., J. Friedman, R. Olshen, and C. Stone. (1997). CART (version 4.0) [Computer program and documentation]. San Diego, CA: Salford Systems.
Brown, A., and A. Maydeu-Olivares. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3):460–502.
Brown, A., and A. Maydeu-Olivares. (2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44(4):1,135–1,147.
Brown, A., and A. Maydeu-Olivares. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1):36–52.
Camilli, G. (1992). A conceptual analysis of differential item functioning in terms of a multidimensional item response model. Applied Psychological Measurement, 16(2):129–147.
Campbell, J.P. (1990). An overview of the Army selection and classification project (Project A). Personnel Psychology, 43(2):231–239.
Campbell, J.P., and D.J. Knapp, Eds. (2001). Exploring the Limits in Personnel Selection and Classification. Mahwah, NJ: Lawrence Erlbaum Associates.
Cella, D., N. Rothrock, S. Choi, J.S. Lai, S. Yount, and R. Gershon. (2010). PROMIS overview: Development of new tools for measuring health-related quality of life and related outcomes in patients with chronic diseases. Annals of Behavioral Medicine, 39(Annual Meeting Supplement 1):s47.
Chang, H.H., and J. Zhang. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67(3):387–398.
Chen, S.Y., R.D. Ankenmann, and J.A. Spray. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40(2):129–145.
Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74(4):619–632.
Chernyshenko, O.S., and S. Stark (in press). Mobile psychological assessment. In F. Drasgow, Ed., Technology and Testing: Improving Educational and Psychological Measurement, Vol. 2 (NCME Book Series). Hoboken, NJ: Wiley-Blackwell. Available: http://ncme.org/publications/ncme-book-series/ [December 2014].
Chernyshenko, O.S., S. Stark, K.Y. Chan, F. Drasgow, and B.A. Williams. (2001). Fitting Item Response Theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36(4):523–562.
Chernyshenko, O.S., S. Stark, F. Drasgow, and B.W. Roberts. (2007). Constructing personality scales under the assumptions of an ideal point response process: Toward increasing the flexibility of personality measures. Psychological Assessment, 19(1):88–106.
Cizek, G.J. (1999). Cheating on Tests: How to Do It, Detect It, and Prevent It. Mahwah, NJ: Lawrence Erlbaum Associates.
Coombs, C.H. (1964). A Theory of Data. New York: Wiley & Sons.
Dalal, D.K., N.T. Carter, and C.J. Lake. (2014). Middle response scale options are inappropriate for ideal point scales. Journal of Business and Psychology, 29(3):463–478.
de la Torre, J. (2008). Multidimensional scoring of abilities: The ordered polytomous response case. Applied Psychological Measurement, 32(5):355–370.
de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6):465–485.
de la Torre, J., and R.J. Patz. (2005). Making the most of what we have: A practical application of multidimensional Item Response Theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3):295–311.
de la Torre, J., V. Ponsoda, I. Leenen, and P. Hontangas. (2012, April). Examining the Viability of Recent Models for Forced-Choice Data. Presented at the Meeting of the American Educational Research Association, Vancouver, British Columbia, Canada. Available: http://www.aera.net/tabid/13128/Default.aspx [February 2015].
Digman, J.M. (1990). Personality structure: Emergence of the five-factor model. Annual Review of Psychology, 41:417–440.
Dillon, G.F., and B.E. Clauser. (2009). Computer-delivered patient simulations in the United States Medical Licensing Examination (USMLE). Simulation in Healthcare, 4(1):30–34.
Drasgow, F. (1982). Choice of test models for appropriateness measurement. Applied Psychological Measurement, 6(3):297–308.
Drasgow, F., and C.K. Parsons. (1983). Application of unidimensional item response theory models to multidimensional data. Applied Psychological Measurement, 7(2):189–199.
Drasgow, F., and M.V. Levine. (1986). Optimal detection of certain forms of inappropriate test scores. Applied Psychological Measurement, 10(1):59–67.
Drasgow, F., M.V. Levine, and E.A. Williams. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1):67–86.
Drasgow, F., M.V. Levine, and M.E. McLaughlin. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11(1):59–79.
Drasgow, F., M.V. Levine, and M.E. McLaughlin. (1991). Appropriateness measurement for multidimensional test batteries. Applied Psychological Measurement, 15(2):171–191.
Drasgow, F., W.C. Lee, S. Stark, and O.S. Chernyshenko. (2004). Alternative methodologies for predicting attrition in the Army: The new AIM scales. In D.J. Knapp, E.D. Heggestad, and M.C. Young, Eds., Understanding and Improving the Assessment of Individual Motivation (AIM) in the Army’s GED Plus Program (pp. 7–1 to 7–16). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Drasgow, F., O.S. Chernyshenko, and S. Stark. (2010). 75 years after Likert: Thurstone was right (focal article). Industrial and Organizational Psychology, 3(4):465–476.
Drasgow, F., S. Stark, O.S. Chernyshenko, C.D. Nye, C.L. Hulin, and L.A. White. (2012). Development of the Tailored Adaptive Personality Assessment System (TAPAS) to Support Army Selection and Classification Decisions (Technical Report 1311). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Egberink, J.L., R.R. Meijer, B.P. Veldkamp, L. Schakel, and N.G. Smid. (2010). Detection of aberrant item score patterns in computerized adaptive testing: An empirical example using the CUSUM. Personality and Individual Differences, 48(8):921–925.
Ferrando, P.J., and E. Chico. (2001). Detecting dissimulation in personality test scores: A comparison between person-fit indices and detection scales. Educational and Psychological Measurement, 61(6):997–1,012.
Flanagan, J. (1947). Scientific development of the use of human resources: Progress in the Army Air Forces. Science, 105(2,716):57–60.
Foster, D. (2009). Secure, online, high-stakes testing: Science fiction or business reality? Industrial and Organizational Psychology: Perspectives on Science and Practice, 2(1):31–34.
Gierl, M.J., and H. Lai. (2012). The role of item models in automatic item generation. International Journal of Testing, 12(3):273–298.
Goldberg, L.R. (1992). The development of markers of the Big Five factor structure. Psychological Assessment, 4(1):26–42.
Guo, J., L. Tay, and F. Drasgow. (2009). Conspiracies and test compromise: An evaluation of the resistance of test systems to small-scale cheating. International Journal of Testing, 9(4):283–309.
Harrell, T.W. (1992). Some history of the Army General Classification Test. Journal of Applied Psychology, 77(6):875–878.
Hicks, L.E. (1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74(3):167–184.
Horgen, K.E., C.D. Nye, L.A. White, K.A. LaPort, R.R. Hoffman, F. Drasgow, O.S. Chernyshenko, S. Stark, and J.S. Conway. (2013). Validation of the Noncommissioned Officer Special Assignment Battery (Technical Report 1328). Ft. Belvoir, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Hough, L.M., N.K. Eaton, M.D. Dunnette, J.D. Kamp, and R.A. McCloy. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Journal of Applied Psychology, 75(5):581–595.
Hsu, C.L., W.C. Wang, and S.Y. Chen. (2013). Variable length computerized adaptive testing based on cognitive diagnosis models. Applied Psychological Measurement, 37(7): 563–582.
Huang, J., and A.D. Mead. (2014). Effect of personality item writing on psychometric properties of ideal-point and Likert scales. Psychological Assessment. Epub July 7, available: http://www.ncbi.nlm.nih.gov/pubmed/24999752 [December 2014].
Hulin, C.L., F. Drasgow, and C.K. Parsons. (1983). Item Response Theory: Application to Psychological Testing. Homewood, IL: Dow Jones-Irwin.
International Test Commission. (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6(2):143–172.
Irvine, S.H., and P.C. Kyllonen, Eds. (2002). Item Generation for Test Development. Mahwah, NJ: Lawrence Erlbaum Associates.
Iseli, M.R., A.D. Koenig, J.J. Lee, and R. Wainess. (2010). Automatic Assessment of Complex Task Performance in Games and Simulations (CRESST Report 775). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4):277–298.
Kass, R.A., K.J. Mitchell, F.C. Grafton, and H. Wing. (1983). Factorial Validity of the Armed Services Vocational Aptitude Battery (ASVAB), Forms 8, 9 and 10: 1981 Army Applicant Sample. Educational and Psychological Measurement, 43(4):1,077–1,087.
Knapp, D.J., and R.M. Pliske. (1986). Preliminary Report on a National Cross-Validation of the Computerized Adaptive Screening Test (CAST) (Research Rep. No. 1430). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Knapp, D.J., and T.S. Heffner, Eds. (2010). Expanded Enlistment Eligibility Metrics (EEEM): Recommendations on a Non-Cognitive Screen for New Soldier Selection (Technical Report 1267). Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Koenig, A.D., J.J. Lee, M. Iseli, and R. Wainess. (2010). A Conceptual Framework for Assessing Performance in Games and Simulations (CRESST Report 771). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.
Koenig, A.D., M. Iseli, R. Wainess, and J.J. Lee. (2013). Assessment methodology for computer-based instructional simulations. Military Medicine, 178(10S):47–54.
Landers, R.N. (2013). Serious Games, Simulations, and Simulation Games: Potential for Use in Candidate Assessment. Presentation during a data gathering session of the Committee on Measuring Human Capabilities: Performance Potential of Individuals and Collectives, National Research Council. Washington, DC. September 6. Presentation available upon request from the project’s public access file.
Lee, Y.H., E.H. Ip, and C-D. Fuh. (2008). A strategy for controlling item exposure in multidimensional computerized adaptive testing. Educational and Psychological Measurement, 68(2):215–232.
Lee, P., S. Stark, and O.S. Chernyshenko. (2014). Detecting aberrant responding on unidimensional pairwise preference tests: An application of lz based on the Zinnes-Griggs ideal point IRT model. Applied Psychological Measurement, 38(5):391–403.
Levine, M.V., and F. Drasgow. (1988). Optimal appropriateness measurement. Psychometrika, 53(2):161–176.
Levy, R., and R.J. Mislevy. (2004). Specifying and refining a measurement model for a computer-based interactive assessment. International Journal of Testing, 4(4):333–369.
Li, Y.H., and W.D. Schafer. (2005). Increasing the homogeneity of CAT’s item-exposure rates by minimizing or maximizing varied target functions while assembling shadow tests. Journal of Educational Measurement, 42(3):245–269.
Likert, R. (1932). A technique for the measurement of attitudes. In R.S. Woodworth, Ed., Archives of Psychology (no. 140, pp. 5–55). New York: Columbia University.
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Mahwah, NJ: Lawrence Erlbaum Associates.
Maier, M. (1993). Military Aptitude Testing: The Past Fifty Years (DMDC No. 93-007). Monterey, CA: Defense Manpower Data Center.
McArdle, J.J. (1994). Structural factor analysis experiments with incomplete data. Multivariate Behavioral Research, 29(4):409–454.
McBride, J.R., and R.R. Cooper. (1999). Modification of the Computer Adaptive Screening Test (CAST) for Use by Recruiters in All Military Services (ARI Research Note 99-25). Alexandria, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
McLeod, L., C. Lewis, and D. Thissen. (2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27(2):121–137.
Meade, A.W., and S.B. Craig. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3):437–455.
Meijer, R.R., and K. Sijtsma. (1995). Detection of aberrant item score patterns: A review and new developments. Applied Measurement in Education, 8:261–272.
Mislevy, R.J. (2013). Evidence-centered design for simulation-based assessment. Military Medicine, 178:107–114.
Mislevy, R.J., and M.M. Riconscente. (2006). Evidence-Centered Assessment Design: Layers, Structures, and Terminology. Menlo Park, CA: SRI International.
Mislevy R.J., L.S. Steinberg, and R. Almond. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1):3–62.
Mislevy R.J., J.T. Behrens, K.E. Dicerbo, and R. Levy. (2012). Design and discovery in educational assessment: Evidence-centered design, psychometrics, and educational data mining. Journal of Educational Data Mining, 4(1):11–48.
Motowidlo, S.J., and J.R. Van Scotter. (1994). Evidence that task performance should be distinguished from contextual performance. Journal of Applied Psychology, 79(4):475–480.
Murphy, J., R. Mulvaney, S. Huang, and M.A. Lodato (2013). Developing Technology-Based Training and Assessment to Support Soldier-Centered Learning. Presentation at the 28th Annual Conference for the Society of Industrial and Organizational Psychology, Houston, TX.
Nacke, L.E. (2009). Affective Ludology: Scientific Measurement of User Experience in Interactive Entertainment (Doctoral dissertation). Blekinge Institute of Technology, Karlskrona, Sweden. Available: http://hci.usask.ca/publications/view.php?id=178 [December 2014].
National Research Council. (2011). Learning Science Through Computer Games and Simulations. Committee on Science Learning: Computer Games, Simulations, and Education, M.A. Honey and M.L. Hilton, Eds. Board on Science Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
Nering, M.L. (1997). The distribution of indexes of person fit within the computerized adaptive testing environment. Applied Psychological Measurement, 21(2):115–127.
Nye, C.D., F. Drasgow, O.S. Chernyshenko, S. Stark, U.C. Kubisiak, L.A. White, and I. Jose. (2012). Assessing the Tailored Adaptive Personality Assessment System (TAPAS) as an MOS Qualification Instrument (Technical Report 1312). Ft. Belvoir, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Oswald, F.L., and K.S. Schell. (2010). Developing and scaling personality measures: Thurstone was right—but so far, Likert was not wrong. Industrial and Organizational Psychology, 3(4):481–484.
Powers, R. (2013). ASVAB for Dummies: Premier PLUS. New York: Wiley & Sons.
Ranger, J., and J.T. Kuhn. (2012). Improving Item Response Theory model calibration by considering response times in psychological tests. Applied Psychological Measurement, 36(3):214–231.
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: University of Chicago Press.
Reckase, M.D. (2009). Multidimensional Item Response Theory. New York: Springer-Verlag.
Reckase, M.D., and R.L. McKinley. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15(4):361–373.
Reckase, M.D., T.A. Ackerman, and J.E. Carlson. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25(3):193–203.
Redecker, C., and Ø. Johannessen. (2013). Changing assessment—Towards a new assessment paradigm using ICT. European Journal of Education, 48(1):79–96.
Reise, S.P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19(3):213–229.
Reise, S.P., and P. Flannery. (1996). Assessing person-fit on measures of typical performance. Applied Measurement in Education, 9(1):9–26.
Riley, W.T., P. Pilkonis, and D. Cella. (2011). Application of the National Institutes of Health Patient-reported Outcome Measurement Information System (PROMIS) to mental health research. The Journal of Mental Health Policy and Economics, 14(4):201–208.
Roberts, J.S., J.E. Laughlin, and D.H. Wedell. (1999). Validity issues in the Likert and Thurstone approaches to attitude measurement. Educational and Psychological Measurement, 59(2):211–233.
Roberts, J.S., J.R. Donoghue, and J.E. Laughlin. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1):3–32.
Rotundo, M., and P.R. Sackett. (2002). The relative importance of task, citizenship, and counterproductive performance to global ratings of job performance: A policy-capturing approach. Journal of Applied Psychology, 87(1):66–80.
Samejima, F. (1969). Estimation of a latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Available: https://www.psychometric-society.org/sites/default/files/pdf/MN17.pdf [December 2014].
Sands, W.A., B.K. Waters, and J.R. McBride. (1999). CATBOOK Computerized Adaptive Testing: From Inquiry to Operation (No. HUMRRO-FR-EADD-96-26). Alexandria, VA: Human Resources Research Organization.
Segall, D.O. (1996). Multidimensional adaptive testing. Psychometrika, 61(2):331–354.
Segall, D.O. (2001a). General ability measurement: An application of multidimensional item response theory. Psychometrika, 66(1):79–97.
Segall, D.O. (2001b). Detecting Test Compromise in High-Stakes Computerized Adaptive Testing: A Verification Testing Approach. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA. Available: http://ncme.org/default/assets/File/pdf/programPDF/NCMEProgram2001.pdf [February 2015].
Segall, D.O. (2002). An item response model for characterizing test compromise. Journal of Educational and Behavioral Statistics, 27(2):163–179.
Shaffer, D.W., D. Hatfield, G.N. Svarovsky, P. Nash, A. Nulty, E. Bagley, K. Frank, A.A. Rupp, and R. Mislevy. (2009). Epistemic network analysis: A prototype for 21st century assessment of learning. International Journal of Learning and Media, 1(2):33–53.
Shealy, R., and W. Stout. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects bias/DTF as well as item bias/DIF. Psychometrika, 58(2):159–194.
Stark, S. (2002). A New IRT Approach to Test Construction and Scoring Designed to Reduce the Effects of Faking in Personality Assessment (Doctoral dissertation). University of Illinois at Urbana-Champaign. Available: http://psychology.usf.edu/faculty/sestark/ [December 2014].
Stark, S., and O.S. Chernyshenko. (2007, October). Adaptive Testing with the Multi-Unidimensional Pairwise Preference Model. Paper presented at the 49th Annual Conference of the International Military Testing Association. Gold Coast, Australia. Available: http://www.imta.info/PastConferences/Presentations.aspx?Show=2007 [February 2015].
Stark, S., O.S. Chernyshenko, and F. Drasgow. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The Multi-Unidimensional Pairwise-Preference Model. Applied Psychological Measurement, 29(3):184–203.
Stark, S., O.S. Chernyshenko, F. Drasgow, and B.A. Williams. (2006). Examining assumptions about item responding in personality assessment: Should ideal-point methods be considered for scale development and scoring? Journal of Applied Psychology, 91(1):25–39.
Stark, S., O.S. Chernyshenko, W.C. Lee, F. Drasgow, L.A. White, and M.C. Young. (2011). Optimizing prediction of attrition with the U.S. Army’s Assessment of Individual Motivation (AIM). Military Psychology, 23(2):180–201.
Stark, S., O.S. Chernyshenko, and F. Drasgow. (2012a). Constructing fake-resistant personality tests using item response theory: High stakes personality testing with multidimensional pairwise preferences. In M. Ziegler, C. MacCann, and R.D. Roberts, Eds., New Perspectives on Faking in Personality Assessments (pp. 214–239). New York: Oxford University Press.
Stark, S., O.S. Chernyshenko, F. Drasgow, and L.A. White. (2012b). Adaptive testing with multidimensional pairwise preference items: Improving the efficiency of personality and other noncognitive assessments. Organizational Research Methods, 15:463–487.
Stark, S., O.S. Chernyshenko, C.D. Nye, F. Drasgow, and L.A. White. (2012c). Moderators of the Tailored Adaptive Personality Assessment System (TAPAS) Validity. Ft. Belvoir, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Stark, S., O.S. Chernyshenko, F. Drasgow, L.A. White, T. Heffner, C.D. Nye, and W.L. Farmer. (2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26(3):153–164.
Stout, W.F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52:589–617.
Stout, W.F. (1990). A new Item Response Theory modelling approach and applications to unidimensionality assessment and ability estimation. Psychometrika, 55:293–325.
Sydell, E., J. Ferrell, J. Carpenter, C. Frost, and C.C. Brodbeck. (2013). Simulation scoring. In M. Fetzer and K. Tyzinski, Eds., Simulations for Personnel Selection (pp. 83–107). New York: Springer.
Tay, L., U.S. Ali, F. Drasgow, and B. Williams. (2011). Fitting IRT models to dichotomous and polytomous data: Assessing the relative model–data fit of ideal point and dominance models. Applied Psychological Measurement, 35(4):280–295.
Tippins, N.T., J. Beaty, F. Drasgow, W.M. Gibson, K. Pearlman, D.O. Segall, and W. Shepherd. (2006). Unproctored internet testing in employment settings. Personnel Psychology, 59(1):189–225.
Tupes, E.C., and R.E. Christal. (1961). Recurrent Personality Factors Based on Trait Ratings (Technical Report ASD-TR-61-97). Lackland Air Force Base, TX: Personnel Laboratory, Air Forces Systems Command.
van der Linden, W.J. (2005). Comparison of item-selection methods for adaptive tests with content constraints. Journal of Educational Measurement, 42(3):283–302.
van der Linden, W.J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1):5–20.
van der Linden, W.J., and Q. Diao. (2011). Automated test-form generation. Journal of Educational Measurement, 48(2):206–222.
van der Linden, W.J., R.H.K. Entink, and J.P. Fox. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34(5): 327–347.
Veldkamp, B.P., and W.J. van der Linden. (2002). Multidimensional adaptive testing with constraints on test content. Psychometrika, 67(4):575–588.
Wang, C., Y. Zheng, and H.H. Chang. (2014). Does standard deviation matter? Using “standard deviation” to quantify security of multistage testing. Psychometrika, 79(1):154–174.
Way, W.D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 17(4):17–27.
White, L.A., and M.C. Young. (1998, August). Development and Validation of the Assessment of Individual Motivation (AIM). Paper presented at the Annual Meeting of the American Psychological Association, San Francisco, CA. Available: http://www.siop.org/tip/backissues/TIPJuly98/burke.aspx [February 2015].
White, L.A., M.C. Young, and M.G. Rumsey. (2001). Assessment of Background and Life Experiences (ABLE) implementation issues and related research. In J.P. Campbell and D.J. Knapp, Eds., Exploring the Limits in Personnel Selection and Classification (pp. 526–528). Mahwah, NJ: Lawrence Erlbaum Associates.
White, L.A., M.C. Young, E.D. Heggestad, S. Stark, F Drasgow, and G. Piskator. (2004). Development of a Non–High School Diploma Graduate Pre-Enlistment Screening Model to Enhance the Future Force. Arlington, VA: U.S. Army Research Institute for the Behavioral and Social Sciences.
Wothke, W., L.T. Curran, J.W. Augustin, C. Guerrero Jr., R.D. Bock, B.A. Fairbank, and A.H. Gillett. (1991). Factor Analytic Examination of the Armed Services Vocational Aptitude Battery (ASVAB) and the Kit of Factor-Referenced Tests (AFHRL-TL-90-67). Brooks Air Force Base, TX: Air Force Human Resources Laboratory.
Yao, L. (2013). Comparing the performance of five multidimensional CAT selection procedures with different stopping rules. Applied Psychological Measurement, 37(1):3–23.
Yi, Q., J. Zhang, and H.H. Chang. (2008). Severity of organized item theft in computerized adaptive testing: A simulation study. Applied Psychological Measurement, 32(3):543–558.
Zickar, M.J., and F. Drasgow. (1996). Detecting faking on a personality instrument using appropriateness measurement. Applied Psychological Measurement, 20(1):71–87.
Ziegler, M., C. MacCann, and R.D. Roberts, Eds. (2012). New Perspectives on Faking in Personality Assessments. New York: Oxford University Press.
This page intentionally left blank.