4
Contributions of Measurement and Statistical Modeling to Assessment

Over the past century, scientists have sought to bring objectivity, rigor, consistency, and efficiency to the process of assessment by developing a range of formal theories, models, practices, and statistical methods for deriving and interpreting test data. Considerable progress has been made in the field of measurement, traditionally referred to as “psychometrics.” The measurement models in use today include some very sophisticated options, but they have had surprisingly little impact on the everyday practice of educational assessment. The problem lies not so much with the range of measurement models available, but with the outdated conceptions of learning and observation that underlie most widely used assessments. Further, existing models and methods may appear to be more rigid than they actually are because they have long been associated with certain familiar kinds of test formats and with conceptions of student learning that emphasize general proficiency or ranking.

Findings from cognitive research suggest that new kinds of inferences are needed about students and how they acquire knowledge and skills if assessments are to be used to track and guide student learning. Advances in technology offer ways to capture, store, and communicate the multitude of things one can observe students say, do, and make. At issue is how to harness the relevant information to serve as evidence for the new kinds of inferences that cognitive research suggests are important for informing and improving learning. An important emphasis of this chapter is that currently available measurement methods could yield richer inferences about student knowledge if they were linked with contemporary theories of cognition and learning.1

1  

This chapter draws, in part, on a paper commissioned by the committee and written by Brian Junker (1999) that describes some statistical models and computational methods that may



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment 4 Contributions of Measurement and Statistical Modeling to Assessment Over the past century, scientists have sought to bring objectivity, rigor, consistency, and efficiency to the process of assessment by developing a range of formal theories, models, practices, and statistical methods for deriving and interpreting test data. Considerable progress has been made in the field of measurement, traditionally referred to as “psychometrics.” The measurement models in use today include some very sophisticated options, but they have had surprisingly little impact on the everyday practice of educational assessment. The problem lies not so much with the range of measurement models available, but with the outdated conceptions of learning and observation that underlie most widely used assessments. Further, existing models and methods may appear to be more rigid than they actually are because they have long been associated with certain familiar kinds of test formats and with conceptions of student learning that emphasize general proficiency or ranking. Findings from cognitive research suggest that new kinds of inferences are needed about students and how they acquire knowledge and skills if assessments are to be used to track and guide student learning. Advances in technology offer ways to capture, store, and communicate the multitude of things one can observe students say, do, and make. At issue is how to harness the relevant information to serve as evidence for the new kinds of inferences that cognitive research suggests are important for informing and improving learning. An important emphasis of this chapter is that currently available measurement methods could yield richer inferences about student knowledge if they were linked with contemporary theories of cognition and learning.1 1   This chapter draws, in part, on a paper commissioned by the committee and written by Brian Junker (1999) that describes some statistical models and computational methods that may

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment FORMAL MEASUREMENT MODELS AS A FORM OF REASONING FROM EVIDENCE As discussed in Chapter 2, assessment is a process of drawing reasonable inferences about what students know on the basis of evidence derived from observations of what they say, do, or make in selected situations. To this end, the three elements of the assessment triangle—cognition, observation, and interpretation—must be well coordinated. In this chapter, the three elements are defined more specifically, using terminology from the field of measurement: the aspects of cognition and learning that are the targets for the assessment are referred to as the construct or construct variables, observation is referred to as the observation model, and interpretation is discussed in terms of formal statistical methods referred to as measurement models. The methods and practices of standard test theory constitute a special type of reasoning from evidence. The field of psychometrics has focused on how best to gather, synthesize, and communicate evidence of student understanding in an explicit and formal way. As explained below, psychometric models are based on a probabilistic approach to reasoning. From this perspective, a statistical model is developed to characterize the patterns believed most likely to emerge in the data for students at varying levels of competence. When there are large masses of evidence to be interpreted and/or when the interpretations are complex, the complexity of these models can increase accordingly. Humans have remarkable abilities to evaluate and summarize information, but remarkable limitations as well. Formal probability-based models for assessment were developed to overcome some of these limitations, especially for assessment purposes that (1) involve high stakes; (2) are not limited to a specific context, such as one classroom; or (3) do not require immediate information. Formal measurement models allow one to draw meaning from quantities of data far more vast than a person can grasp at once and to express the degree of uncertainty associated with one’s conclusions. In other words, a measurement model is a framework for communicating with others how the evidence in observations can be used to inform the inferences one wants to draw about learner characteristics that are embodied in the construct variables. Further, measurement models allow people to avoid reasoning errors that appear to be hard-wired into the human mind, such as biases associated with preconceptions or with the representativeness or recency of information (Kahneman, Slovic, and Tversky, 1982).     be useful for cognitively informed assessment. Junker’s paper reviews some of the measurement models in more technical detail than is provided in this chapter and can be found at <http://www.sat.cmu.edu/~brian/nrc/cfa/>. [March 2, 2001].

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment Reasoning Principles and Formal Measurement Models Those involved in educational and psychological measurement must deal with a number of issues that arise when one assumes a probabilistic relationship between the observations made of a learner and the learner’s underlying cognitive constructs. The essential idea is that statistical models can be developed to predict the probability that people will behave in certain ways in assessment situations, and that evidence derived from observing these behaviors can be used to draw inferences about students’ knowledge, skills, and strategies (which are not directly observable).2 In assessment, aspects of students’ knowledge, skills, and strategies that cannot be directly observed play the role of “that which is to be explained” —generally referred to as “cognition” in Chapter 2 and more specifically as the “construct” in this chapter. The constructs are called “latent” because they are not directly observable. The things students say and do constitute the evidence used in this explanation—the observation element of the assessment triangle. In broad terms, the construct is seen as “causing” the observations, although generally this causation is probabilistic in nature (that is, the constructs determine the probability of a certain response, not the response itself). More technically there are two elements of probability-based measurement models: (1) unobservable latent constructs and (2) observations or observable variables, which are, for instance, students’ scores on a test intended to measure the given construct. The nature of the construct variables depends partly on the structure and psychology of the subject domain and partly on the purpose of assessment. The nature of the observations is determined by the kinds of things students might say or do in various situations to provide evidence about their values with respect to the construct. Figure 4– 1 shows how the construct is related to the observations. (In the figure, the latent construct is denoted θ [theta] and the observables x.) Note that although the latent construct causes the observations, one needs to go the other way when one draws inferences—back from the observations to their antecedents. Other variables are also needed to specify the formal model of the observations; these are generally called item parameters. The central idea of probability models is that these unknown constructs and item parameters do not determine the specifics of what occurs, but they do determine the probability associated with various possible results. For example, a coin might be expected to land as heads and as tails an approximately equal number of 2   This idea dates back to Spearman’s (1904) early work and was extended by Wright’s (1934) path analyses, Lazarsfeld’s (1950) latent class models, item response theory (Lawley, 1943), and structural equations modeling with measurement error (e.g., Jöreskog and Sörbom, 1979).

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment FIGURE 4–1 The student construct and the observations. times. That is, the probability of heads is the same as the probability of tails. However, this does not mean that in ten actual coin tosses these exact probabilities will be observed. The notion of “telling stories that match up with what we see” corresponds to the technical concept of conditional independence in formal probability-based reasoning. Conditional independence means that any systematic relationships among multiple observations are due entirely to the unobservable construct variables they tap. This is a property of mathematical probability models, not necessarily of any particular situation in the real world. Assessors choose where, in the real world, they wish to focus their attention. This includes what situation they want to explore and what properties of that situation are most important to manipulate. They then decide how to build a model or “approximation” that connects the construct variables to the specific observations. The level of unobservable constructs corresponds to “the story” people tell, and it is ultimately expressed in terms of important patterns and principles of knowledge in the cognitive domain under investigation. The level of observations represents the specifics from which evidence is derived about the unobservable level. Informally, conditional independence expresses the decision about what aspects of the situation are built into one’s story and what is ignored. Psychometric models are particular instances of this kind of reasoning. The most familiar measurement models evolved to help in “constructing stories” that were useful in situations characterized by various psychological perspectives on learning, for particular educational purposes, with certain recurring forms of evidence. The following sections describe some of these

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment models, explaining how these stories have grown and adapted to handle the increasingly complex demands of assessment. Knowing the history of these adaptations may help in dealing with new demands from more complex models of learning and the types of stories we would now like to be able to tell in many educational contexts. The BEAR Assessment System An example of the relationships among the conception of learning, the observations, and the interpretation model is provided by the Berkeley Evaluation and Assessment Research (BEAR) Center (Wilson and Sloane, 2000). The BEAR assessment system was designed to correspond to a middle school science curriculum called Issues, Evidence and You (IEY) (Science Education for Public Understanding Program, 1995). We use this assessment as a running example to illustrate various points throughout this chapter. The conception of cognition and learning underlying IEY is not based on a specific theory from cognitive research; rather it is based on pedagogic content knowledge, that is, teachers’ knowledge of how students learn specific types of content. Nevertheless, the BEAR example illustrates many of the principles that the committee is setting forth, including the need to pay attention to all three vertices of the assessment triangle and how they fit together. The IEY curriculum developers have conceptualized the learner as progressing along five progress variables that organize what students are to learn into five topic areas and a progression of concepts and skills (see Box 4–1). The BEAR assessment system is based on the same set of progress variables. A progress variable focuses on progression or growth. Learning is conceptualized not simply as a matter of acquiring more knowledge and skills, but as progressing toward higher levels of competence as new knowledge is linked to existing knowledge, and deeper understandings are developed from and take the place of earlier understandings. The concepts of ordered levels of understanding and direction are fundamental: in any given area, it is assumed that learning can be described and mapped as progress in the direction of qualitatively richer knowledge, higher-order skills, and deeper understandings. Progress variables are derived in part from professional opinion about what constitutes higher and lower levels of performance or competence, but are also informed by empirical research on how students respond or perform in practice. They provide qualitatively interpreted frames of reference for particular areas of learning and permit students’ levels of achievement to be interpreted in terms of the kinds of knowledge, skills, and understandings typically associated with those levels. They also allow individual and group achievements to be interpreted with respect to the achievements of other learners. The order of the activities intended to take

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment BOX 4–1 Progress Variables from the Issues, Evidence and You (IEY) Curriculum Designing and Conducting Investigations—designing a scientific experiment, performing laboratory procedures to collect data, recording and organizing data, and analyzing and interpreting the results of an experiment. Evidence and Trade-offs—identifying objective scientific evidence, as well as evaluating the advantages and disadvantages of different possible solutions to a problem on the basis of the available evidence. Understanding Concepts—understanding scientific concepts (such as properties and interactions of materials, energy, or thresholds) in order to apply the relevant scientific concepts to the solution of problems. Communicating Scientific Information—effectively, and free of technical errors, organizing and presenting results of an experiment or explaining the process of gathering evidence and weighing trade-offs in selecting a solution to a problem. Group Interaction—developing skill in collaborating with teammates to complete a task (such as a laboratory experiment), sharing the work of the activity, and contributing ideas to generate solutions to a given problem. SOURCE: Roberts, Wilson, and Draney (1997, p. 8). Used with permission of the authors. students through the progress variables is specified in a blueprint—a table showing an overview of all course activities, indicating where assessment tasks are located and to which variables they relate. During IEY instruction, students carry out laboratory exercises and investigations in structured quadruples, work on projects in pairs, and then create reports and respond to assessment questions on their own. Observations of student performance consist of assessment tasks (which are embedded in the instructional program, and each of which has direct links to the progress variables) and link tests (which are composed of short-answer items also linked to the progress variables). Recording of teacher judgments about students’ work is aided by scoring guides—criteria unique to each progress

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment variable that are used for assessing levels of student performance and interpreting student work (an example is provided in Table 4–1 for the Evidence and Trade-offs variable). These are augmented with exemplars—samples of actual student work illustrating performance at each score level for all assessment tasks. The interpretation of these judgments is carried out using progress maps— graphic displays used to record the progress of each student on particular progress variables over the course of the year. The statistical underpinning for these maps is a multidimensional item response model (explained later); the learning underpinning is the set of progress variables. An example of a BEAR progress map is shown in Box 4–2. Teacher and student involvement in the assessment system is motivated and structured through assessment moderation—a process by which groups of teachers and students reach consensus on standards of student performance and discuss the implications of assessment results for subsequent learning and instruction (Roberts, Sloane, and Wilson, 1996). To summarize, the BEAR assessment system as applied in the IEY curriculum embodies the assessment triangle as follows. The conception of learning consists of the five progress variables mentioned above. Students are helped in improving along these variables by the IEY instructional materials, including the assessments. The observations are the scores teachers assign to student work on the embedded assessment tasks and the link tests. The interpretation model is formally a multidimensional item response model (discussed later in this chapter) that underlies the progress maps; however, its meaning is elaborated through the exemplars and through the teacher’s knowledge about the specific responses a student gave on various items. STANDARD PSYCHOMETRIC MODELS Currently, standard measurement models focus on a situation in which the observations are in the form of a number of items with discrete, ordered response categories (such as the categories from an IEY scoring guide illustrated in Table 4–1) and in which the construct is a single continuous variable (such as one of the IEY progress variables described in Box 4–1). For example, a standardized achievement test is typically composed of many (usually dichotomous3) items that are often all linked substantively in some way to a common construct variable, such as mathematics achievement. The construct is thought of as a continuous unobservable (latent) characteristic 3   That is, the items can be scored into just two categories, e.g., either right or wrong.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment TABLE 4–1 Sample Scoring Guide for the BEAR Assessment   Evidence and Trade-offs (ET) Variable   Using Evidence: Using Evidence to Make Trade-offs: Score Response uses objective reason(s) based on relevant evidence to support choice. Response recognizes multiple perspectives of issue and explains each perspective using objective reasons, supported by evidence, in order to make choice. 4 Response accomplishes Level 3 AND goes beyond in some significant way, such as questioning or justifying the source, validity, and/or quantity of evidence. Response accomplishes Level 3 AND goes beyond in some significant way, such as suggesting additional evidence beyond the activity that would further influence choices in specific ways, OR questioning the source, validity, and/or quantity of evidence and explaining how it influences choice. 3 Response provides major objective reasons AND supports each with relevant and accurate evidence. Response discusses at least two perspectives of issue AND provides objective reasons, supported by relevant and accurate evidence, for each perspective. 2 Response provides some objective reasons AND some supporting evidence, BUT at least one reason is missing and/or part of the evidence is incomplete. Response states at least one perspective of issue AND provides some objective reasons using some relevant evidence, BUT reasons are incomplete and/or part of the evidence is missing; OR only one complete and accurate perspective has been provided. 1 Response provides only subjective reasons (opinions) for choice and/or uses inaccurate or irrelevant evidence from the activity. Response states at least one perspective of issue BUT only provides subjective reasons and/or uses inaccurate or irrelevant evidence. 0 No response; illegible response; response offers no reasons AND no evidence to support choice made. No response; illegible response; response lacks reasons AND offers no evidence to support decision made. X Student had no opportunity to respond.   SOURCE: Roberts, Wilson, and Draney (1997, p. 9). Used with permission of the authors. of the learner, representing relatively more or less of the competency that is common to the set of items and their responses. This can be summarized graphically as in Figure 4–2, where the latent construct variable θ (represented inside an oval shape in the figure to denote that it is unobservable) is thought of as potentially varying continuously from minus infinity to plus

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment BOX 4–2 Example of a BEAR Progress Map Below is an example of one of the types of progress maps produced by the BEAR assessment program. This particular example is called a “conference map” and is created by the GradeMap software (Wilson, Draney and Kennedy, 1999). This map shows the “current estimate” of where a student is on four of the IEY progress variables (the variable Group Interaction is not yet calibrated). The estimate is expressed in terms of a series of levels that are identified as segments of the continua (e.g., “Incorrect,” “Advanced”) and are specified in greater detail in the scoring guide for each progress variable. Additional examples of BEAR maps are provided later in this chapter. SOURCE: Wilson, Draney, and Kennedy (2001). Used with permission of the authors. infinity. The assessment items are shown in boxes (to denote that they are observed variables), and the arrows show that the construct “causes” the observations. Although not shown in the figure, each observed response consists of a component that statisticians generally call “error.” Note that error in this context means something quite different from its usual educational sense—it means merely that the component is not modeled (i.e., not attributable to the construct θ). The representation in Figure 4–2 corresponds to a class of measurement models called item response models, which are discussed below. First, how-

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment FIGURE 4–2 Unidimensional-continuous constructs. Boxes indicate observable variables; oval indicates a latent variable. ever, some methods that emerged earlier in the evolution of measurement models are described. Classical Test Theory Early studies of student testing and retesting led to the conclusion that although no tests were perfectly consistent, some gave more consistent results than others. Classical test theory (CTT) was developed initially by Spearman (1904) as a way to explain certain of these variations in consistency (expressed most often in terms of the well-known reliability index). In CTT, the construct is represented as a single continuous variable, but certain simplifications were necessary to allow use of the statistical methods available at that time. The observation model is simplified to focus only on the sum of the responses with the individual item responses being omitted (see Figure 4–3). For example, if a CTT measurement model were used in the BEAR example, it would take the sum of the student scores on a set of assessment tasks as the observed score. The measurement model, sometimes referred to as a “true-score model,” simply expresses that the true score (θ) arises from an observed score (x) plus error (e). The reliability is then the ratio of the variance of the true score to the variance of the observed score. This type of model may be sufficient when one is interested only in a single aspect of student achievement (the total score) and when tests are considered only as a whole. Scores obtained using CTT modeling are usually translated into percentiles for norm-referenced interpretation and for comparison with other tests. The simple assumptions of CTT have been used to develop a very large superstructure of concepts and measurement tools, including reliability indices, standard error estimation formulae, and test equating practices used to link scores on one test with those on another. CTT modeling does not allow

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment FIGURE 4–3 Classical test theory model. the simultaneous assessment of multiple aspects of examinee competence and does not address problems that arise whenever separate parts of a test need to be studied or manipulated. Formally, CTT does not include components that allow interpretation of scores based on subsets of items in the test. Historically, CTT has been the principal tool of formal assessments, and in part because of its great simplicity, it has been applied to assessments of virtually every type. Because of serious practical limitations, however, other theories—such as generalizability theory, item response modeling, and factor analysis—were developed to enable study of aspects of items. Generalizability Theory The purpose of generalizability theory (often referred to as G-theory) is to make it possible to examine how different aspects of observations—such as using different raters, using different types of items, or testing on different occasions—can affect the dependability of scores (Brennan, 1983; Cronbach, Gleser, Nanda, and Rajaratnam, 1972). In G-theory, the construct is again characterized as a single continuous variable. However, the observation can include design choices, such as the number of types of tasks, the number of raters, and the uses of scores from different raters (see Figure 4–4). These are commonly called facets4 of measurement. Facets can be treated as fixed or random. When they are treated as random, the observed elements in the facet are considered to be a random sample from the universe of all possible elements in the facet. For instance, if the set of tasks included on a test were 4   The term “facets” used in this sense is not to be confused with the facets-based instruction and assessment program (Hunt and Minstrell, 1994; Minstrell, 2000) referred to in Chapters 3, 5, 6, and 7.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment using Method B; a student who gets most of the first kind wrong but most of the second kind right is probably using Method A. This example could be extended in many ways with regard to both the nature of the observations and the nature of the student model. With the present student model, one might explore additional sources of evidence about strategy use, such as monitoring response times, tracing solution steps, or simply asking the students to describe their solutions. Each such extension involves trade-offs in terms of cost and the value of the evidence, and each could be sensible in some applications but not others. An important extension of the student model would be to allow for strategy switching (Kyllonen, Lohman, and Snow, 1984). Although the students in Tatsuoka’s application were not yet operating at this level, adults often decide whether to use Method A or Method B for a given item only after gauging which strategy would be easier to apply. The variables in the more complex student model needed to account for this behavior would express the tendencies of a student to employ different strategies under different conditions. Students would then be mixed cases in and of themselves, with “always use Method A” and “always use Method B” as extremes. Situations involving such mixes pose notoriously difficult statistical problems, and carrying out inference in the context of this more ambitious student model would certainly require the richer information mentioned above. Some intelligent tutoring systems of the type described in Chapter 3 make use of Bayes nets, explicitly in the case of VanLehn’s OLEA tutor (Martin and VanLehn, 1993, 1995) and implicitly in the case of John Anderson’s LISP and algebra tutors (Corbett and Anderson, 1992). These applications highlight again the interplay among cognitive theory, statistical modeling, and assessment purpose. Another example of this type, the HYDRIVE intelligent tutoring system for aircraft hydraulics, is provided in Annex 4–1 at the end of this chapter. Potential Future Role of Bayes Nets in Assessment Two implications are clear from this brief overview of the use of Bayes nets in educational assessment. First, this approach provides a framework for tackling one of the most challenging issues now faced: how to reason about complex student competencies from complex data when the standard models from educational measurement are not sufficient. It does so in a way that incorporates the accumulated wisdom residing within existing models and practices while providing a principled basis for its extension. One can expect further developments in this area in the coming years as computational methods improve, examples on which to build accumulate, and efforts to apply different kinds of models to different kinds of assessments succeed and fail.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment Second, classroom teachers are not expected to build formal Bayes nets in their classrooms from scratch. This is so even though the intuitive, often subconscious, reasoning teachers carry out every day in their informal assessments and conversations with students share key principles with formal networks. Explicitly disentangling the complex evidentiary relationships that characterize the classroom simply is not necessary. Nevertheless, a greater understanding of how one would go about doing this should it be required would undoubtedly improve everyday reasoning about assessment by policy makers, the public at large, and teachers. One can predict with confidence that the most ambitious uses of Bayes nets in assessments would not require teachers to work with the nuts and bolts of statistical distributions, evidence models, and Lauritzen-Spiegelhalter updating. Aside from research uses, one way these technical elements come into play is by being built into instructional tools. The computer in a microwave oven is an analogy, and some existing intelligent tutoring systems are an example. Neither students learning to troubleshoot the F-15 hydraulics nor their trainers know or care that a Bayes net helps parse their actions and trigger suggestions (see the HYDRIVE example presented in Annex 4–1). The difficult work is embodied in the device. More open systems than these will allow teachers or instructional designers to build tasks around recurring relationships between students’ understandings and their problem solving in a domain, and to link these tasks to programs that handle the technical details of probability-based reasoning. The most important lesson learned thus far, however, is the need for coordination across specialties in the design of complex assessments. An assessment that simultaneously pushes the frontiers of psychology, technology, statistics, and a substantive domain cannot succeed unless all of these areas are incorporated into a coherent design from the outset. If one tries to develop an ambitious student model, create a complex simulation environment, and write challenging task scenarios—all before working through the relationships among the elements of the assessment triangle needed to make sense of the data—one will surely fail. The familiar practice of writing test items and handing them off to psychometricians to model the results cannot be sustained in complex assessments. MODELING OF STRATEGY CHANGES10 In the preceding account, measurement models were discussed in order of increasing complexity with regard to how aspects of learning are mod- 10   This section draws heavily on the commissioned paper by Brian Junker. For the paper, go to <http://www.stat.cmu.edu/~brian/nrc/cfa/>. [March 2, 2001].

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment eled. Alternatively, one could organize the discussion in accordance with specific ideas from cognitive psychology. For example, one highly salient concept for cognitive psychologists is strategy. This section examines how the models described above might be used to investigate this issue. It is not difficult to believe that different students bring different problem-solving strategies to an assessment setting; sufficiently different curricular backgrounds provide a prima facie argument that this must happen. Moreover, comparative studies of experts and novices (e.g., Chi, Glaser, and Farr, 1988) and theories of expertise (e.g., Glaser, 1991) suggest that the strategies one uses to solve problems change as one’s expertise grows. Kyllonen et al. (1984) show that strategies used by the same person also change from task to task, and evidence from research on intelligent tutoring systems suggests it is not unusual for students to change strategy within a task as well. Thus one can distinguish at least four cases for modeling of differential strategy use, listed in increasing order of difficulty for statistical modeling and analysis: Case 0—no modeling of strategies. Case 1—strategy changes from person to person. Case 2—strategy changes from task to task for individuals. Case 3—strategy changes within a task for individuals. Which case is selected for a given application depends, as with all assessment modeling decisions, on trade-offs between capturing what students are actually doing and serving the purpose of the assessment. Consider, for example, the science assessment study of Baxter, Elder, and Glaser (1996), which examined how middle school students’ attempts to learn what electrical components were inside “mystery boxes” revealed their understanding of electrical circuits. One might decide to conduct an assessment specifically to identify which attributes of high competence a particular student has, so that the missing attributes can be addressed without regard to what low-competence attributes the student possesses; this could be an instance of Case 0. On the other hand, if the goal were to identify the competency level of the student—low, medium, or high—and remediate accordingly, a more complete person-to-person model, as in Case 1, would be appropriate. In addition, if the difficulty of the task depended strongly on the strategy used, one might be forced to apply Case 1 or one of the other cases to obtain an assessment model that fit the data well, even if the only valuable target of inference were the high-competence state. Many models for Case 1 (that is, modeling strategy changes among students, but assuming that strategy is constant across assessment tasks) are variations on the latent class model of Figure 4–8. For example, the Haertel/

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment Wiley latent class model, which defines latent classes in terms of the sets of attributes class members possess, maps directly onto Figure 4–8. The Mislevy and Verhelst (1990) model used to account for strategy effects on item difficulty in IRM also models strategy use at the level of Case 1. This approach uses information about the difficulties of the tasks under different strategies to draw inferences about what strategy is being used. Wilson’s Saltus model (Wilson, 1989; Mislevy and Wilson, 1996) is quite similar, positing specific interactions on the θ scale between items of a certain type and developmental stages of examinees. The M2RCML model of Pirolli and Wilson (1998) allows not only for mixing of strategies that drive item difficulty (as in the Mislevy and Verhelst and the Saltus models), but also for mixing over combinations of student proficiency variables. All of these Case 1 approaches are likely to succeed if the theory for positing differences among task difficulties under different strategies produces some large differences in task difficulty across strategies. Case 2, in which individual students change strategy from task to task, is more difficult. One example of a model intended to accommodate this case is the unified model of DiBello and colleagues (DiBello et al, 1995; DiBello, et al., 1999). In fact, one can build a version of the Mislevy and Verhelst model that does much the same thing; one simply builds the latent class model within task instead of among tasks. It is not difficult to build the full model or to formulate estimating equations for it. However, it is very difficult to fit, because wrong/right or even polytomously scored responses do not contain much information about the choice of strategy. To make progress with Case 2, one must collect more data. Helpful additions include building response latency into computerized tests; requesting information about the performance of subtasks within a task (if informative about the strategy); asking students to answer strategy-related auxiliary questions, as did Baxter, Elder, and Glaser (1996); asking students to explain the reasoning behind their answers; or even asking them directly what strategy they are using. In the best case, gathering this kind of information reduces the assessment modeling problem to the case in which each student’s strategy is known with certainty. Case 3, in which the student changes strategy within task, cannot be modeled successfully without rich within-task data. Some intelligent tutoring systems try to do this under the rubric of “model tracing” or “plan recognition.” The tutors of Anderson and colleagues (e.g., Anderson, Corbett, Koedinger, and Pelletier, 1995) generally do so by keeping students close to a modal solution path, but they have also experimented with asking students directly what strategy they are pursuing in ambiguous cases. Others keep track of other environmental variables to help reduce ambiguity about the choice of strategy within performance of a particular task (e.g., Hill and Johnson, 1995). Bayesian networks are commonly used for this purpose.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment The Andes tutor of mechanics problems in physics (e.g., Gertner, Conati, and VanLehn, 1998) employs a Bayesian network to do model tracing. The student attributes are production rules, the observed responses are problem-solving actions, and strategy-use variables mediate the relationships between attributes and responses. Various approaches have been proposed for controlling the potentially very large number of states as the number of possible strategies grows. Charniak and Goldman (1993), for example, build a network sequentially, adding notes for new evidence with respect to plausible plans along the way. CONCLUSIONS Advances in methods of educational measurement include the development of formal measurement (psychometric) models, which represent a particular form of reasoning from evidence. These models provide explicit, formal rules for integrating the many pieces of information that may be relevant to specific inferences drawn from observation of assessment tasks. Certain kinds of assessment applications require the capabilities of formal statistical models for the interpretation element of the assessment triangle. These tend to be applications with one or more of the following features: high stakes, distant users (i.e., assessment interpreters without day-to-day interaction with the students), complex student models, and large volumes of data. Measurement models currently available can support many of the kinds of inferences that cognitive science suggests are important to pursue. In particular, it is now possible to characterize students in terms of multiple aspects of proficiency, rather than a single score; chart students’ progress over time, instead of simply measuring performance at a particular point in time; deal with multiple paths or alternative methods of valued performance; model, monitor, and improve judgments on the basis of informed evaluations; and model performance not only at the level of students, but also at the levels of groups, classes, schools, and states. Nonetheless, many of the newer models and methods are not widely used because they are not easily understood or packaged in accessible ways for those without a strong technical background. Technology offers the possibility of addressing this shortcoming. For instance, building statistical models into technology-based learning environments for use in their classrooms enables teachers to employ more complex tasks, capture and replay students’ performances, share exemplars of competent performance, and in the process gain critical information about student competence. Much hard work remains to focus psychometric model building on the critical features of models of cognition and learning and on observations that reveal meaningful cognitive processes in a particular domain. If anything, the task has become more difficult because an additional step is now

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment required—determining in tandem the inferences that must be drawn, the observations needed, the tasks that will provide them, and the statistical models that will express the necessary patterns most efficiently. Therefore, having a broad array of models available does not mean that the measurement model problem has been solved. The long-standing tradition of leaving scientists, educators, task designers, and psychometricians each to their own realms represents perhaps the most serious barrier to progress. ANNEX 4–1: AN APPLICATION OF BAYES NETS IN AN INTELLIGENT TUTORING SYSTEM As described in Chapter 3, intelligent tutoring systems depend on some form of student modeling to guide tutor behavior. Inferences about what a student does and does not know affect the presentation and pacing of problems, the quality of feedback and instruction, and the determination of when a student has achieved tutorial objectives. The following example involves the HYDRIVE intelligent tutoring system, which, in the course of implementing principles of cognitive diagnosis, adapts concepts and tools of test theory to implement principles of probability-based reasoning (Mislevy and Gitomer, 1996). HYDRIVE is an intelligent tutoring/assessment system designed to help trainees in aircraft mechanics develop troubleshooting skills for the F-15’s hydraulics systems. These systems are involved in the operation of the flight controls, landing gear, canopy, jet fuel starter, and aerial refueling. HYDRIVE simulates many of the important cognitive and contextual features of troubleshooting on the flightline. A problem begins with a video sequence in which a pilot who is about to take off or has just landed describes some aircraft malfunction to the hydraulics technician (for example, the rudders do not move during preflight checks). HYDRIVE’s interface allows the student to perform troubleshooting procedures by accessing video images of aircraft components and acting on those components; to review on-line technical support materials, including hierarchically organized schematic diagrams; and to make instructional selections at any time during troubleshooting, in addition to or in place of the instruction recommended by the system itself. HYDRIVE’s system model tracks the state of the aircraft system, including changes brought about by user actions. Annex Figure 4–1 is a simplified version of portions of the Bayes net that supports inference in HYDRIVE. Four groups of variables can be distinguished (the last three of which comprise the student model). First, the rightmost nodes are the “observable variables” —actually the results of rule-driven analyses of a student’s actions in a given situation. Second, their immediate parents are knowledge and strategy requirements for two proto-

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment ANNEX FIGURE 4–1 Simplified version of portions of the inference network through which the HYDRIVE student model is operationalized and updated. NOTE: Bars represent probabilities, summing to one for all the possible values of a variable. A shaded bar extending the full width of a node represents certainty, due to having observed the value of that variable; i.e., a student’s actual responses to tasks. SOURCE: Mislevy (1996, p. 407). Used with permission of the author.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment typical situations addressed in this simplified diagram; the potential values of these variables are combinations of system knowledge and troubleshooting strategies that are relevant in these situations. Third, the long column of variables in the middle concerns aspects of subsystem and strategic knowledge, corresponding to instructional options. And fourth, to their left are summary characterizations of more generally construed proficiencies. The structure of the network, the variables that capture the progression from novice to expert hydraulics troubleshooter, and the conditional probabilities implemented in the network are based on two primary sources of information: in-depth analyses of how experts and novices verbalize their problem-solving actions and observations of trainees actually working through the problems in the HYDRIVE context. Strictly speaking, the observation variables in the HYDRIVE Bayes net are not observable behaviors, but outcomes of analyses that characterize sequences of actions as “serial elimination,” “redundant action,” “irrelevant action,” “remove-and-replace,” or “space-splitting” —all interpreted in light of the current state of the system and results of the student’s previous actions. HYDRIVE employs a relatively small number of interpretation rules (about 25) to classify each troubleshooting action in these terms. The following is an example: IF active path which includes failure has not been created and the student creates an active path which does not include failure and edges removed from the active problem area are of one power class, THEN the student strategy is power path splitting. Potential observable variables cannot be predetermined and uniquely defined in the manner of usual assessment items since a student could follow countless paths through the problem. Rather than attempting to model all possible system states and specific possible actions within them, HYDRIVE posits equivalence classes of system-situation states, each of which could arise many times or not at all in a given student’s work. Members of these equivalence classes are treated as conditionally independent, given the status of the requisite skill and knowledge requirements. Two such classes are illustrated in Annex Figure 4–1: canopy situations, in which space-splitting11 11   Space-splitting refers to a situation in which there is a chain of components that must all work for an event to happen (e.g., a car to start when the ignition key is turned), and a fault has occurred somewhere along that chain. The solution space includes the possibility that any components could have failed. Space-splitting means checking a point somewhere along the chain to see if things are working up to that point. If so, one can strip away the early portion of the solution space because all the components up to that point have worked; if not, one can strip away the latter part of the solution space and focus on the components up to that point.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment is not possible, and landing gear situations, in which space-splitting is possible. Annex Figure 4–1 depicts how one changes belief after observing the following actions in three separate situations from the canopy/no-split class: one redundant and one irrelevant action (both ineffectual troubleshooting moves) and one remove-and-replace action (serviceable but inefficient). Serial elimination would have been the best strategy in such cases, and is most likely to be applied when the student has strong knowledge of this strategy and all relevant subsystems. Remove-and-replace is more likely when a student possesses some subsystem knowledge but lacks familiarity with serial elimination. Weak subsystem knowledge increases chances of irrelevant and redundant actions. It is possible to get any of these classes of actions from a trainee with any combination of values of student-model variables; sometimes students with good understanding carry out redundant tests, for example, and sometimes students who lack understanding unwittingly take the same action an expert would. These possibilities must be reflected in the conditional probabilities of actions, given the values of student-model variables. The grain size and the nature of a student model in an intelligent tutoring system should be compatible with the instructional options available (Kieras, 1988). The subsystem and strategy student-model variables in HYDRIVE summarize patterns in trouble shooting solutions at the level addressed by the intelligent tutoring system’s instruction. As a result of the three aforementioned inexpert canopy actions, Annex Figure 4–1 shows belief shifted toward lower values for serial elimination and for all subsystem variables directly involved in the situation—mechanical, hydraulic, and canopy knowledge. Any or all of these variables could be a problem, since all are required for a high likelihood of expert action. Values for subsystem variables not directly involved in the situation are also lower because, to varying degrees, students familiar with one subsystem tend to be familiar with others, and, to a lesser extent, students familiar with subsystems tend to be familiar with troubleshooting strategies. These relationships are expressed by means of the more generalized system and strategy knowledge variables at the left of the figure. These variables take advantage of the indirect information about aspects of knowledge that a given problem does not address directly, and they summarize more broadly construed aspects of proficiency that are useful in evaluation and problem selection.

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment Part III Assessment Design and Use: Principles, Practices, and Future Directions

OCR for page 111
Knowing What Students Know: The Science and Design of Eduacational Assessment This page in the original is blank.