Read "Improving Indicators of the Quality of Science and Mathematics Education in Grades K-12" at NAP.edu

« Previous: 3 What Are Indicators?

Page 40 Cite

Suggested Citation:"4 Indicators of Learning in Science and Mathematics." National Research Council. 1988. Improving Indicators of the Quality of Science and Mathematics Education in Grades K-12. Washington, DC: The National Academies Press. doi: 10.17226/988.

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Indicators of Learning in Science and Mathematics This chapter first appraises currently available multiple-choice tests of student achievement in order to judge their suitability as indicators of the quality of education in science and mathematics. This appraisal includes a discussion of various uses of these tests, a review of criticisms of current testing methods, and suggestions on some desirable features that should be retained. New methods of assessment are then described that would provide both quantita- tive and qualitative information about how students perform tasks requiring higher-order skills. The chapter continues with our recom- mendations regarding uses of current indicators of student learning and work needed to develop improved tests. Implications of these recommendations for state education agencies are presented, and we conclude with a discussion of possible approaches to assessing aspects of scientific literacy of the U.S. population. AN APPRAISAL OF CURRENT TESTS OF STUDENT ACHIEVEMENT The most direct indicators of the quality of science and mathe- matics education are the scores based on tests that measure what stu- dents have learned. Currently available indicators of student learning are typically obtained from standardized achievement tests made up of multiple-choice items. Before one accepts information based on 40

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 41 such tests, it is necessary to make an appraisal of their suitability as indicators of the quality of education. Purposes of Testing In practice, tests are used for a wide variety of purposes. Some involve the evaluation of individual students for grading, student counseling, placement, promotion, awards, scholarships, and so on- important for educational purposes but not always well suited to the development of indicators. The use of tests most closely identified with assessing the condition of science and mathematics education is for the evaluation of learning achieved by populations of students; a related purpose that is of interest to the committee is the use of tests in improving the quality of instruction. Evaluation of Student Learning Measures of the outcomes of education for students are critical indicators in any educational mon- itoring system. Hence, the testing purpose of primary concern to the committee is evaluation of student learning, particularly at national, state, and regional levels. Indicators of learning that are satisfactory for this purpose would also be useful to school districts or individual schools as a means of monitoring change in levels of accomplishment over time. A related use of tests is to provide criterion measures to vali- date less direct indicators of the quality of education, for example, teaching effectiveness or the quality of the curriculum. Tests are of- ten used for this purpose, but such use is appropriate only when the tests being employed assess important dimensions of student learning in a satisfactory manner. Improving Instruction One reason for monitoring the condition of mathematics and science education is to be able to improve instruc- tion. Several applications of tests can help do so: tests can contribute to raising the standards of schools as to the skills and competencies to be taught and acceptable levels of performance. They can provide diagnostic information that would enable teachers to understand the reasons for failures and provide appropriate remedial treatment. Di- agnostic information would be useful in school assessment at local, district, state, or even national levels as well: better understand- ing of why students develop erroneous problem-solving algorithms or

42 INDICATORS OF SCIENCE AND M`4TNEMA TICS EDUCATION fad! to modify childhood misconceptions of physical principles would make possible actions at higher administrative and supervisory levels aimed at improving instruction. Tests can also be used as dependent variables in experimental studies involving educational treatments or methodologies developed to improve instruction. Test questions also can be used for teaching as practice ex- ercises with feedback. Much practice is necessary to acquire the complex skills required for development of the automatic processing and pattern-perception skills that are essential for the performance of more advanced problem-solving tasks. Such exercises might also pro- vide, as a by-product, information that would be useful for large-scale assessment of student learning. Still another instructional applica- tion is to improve the articulation of instruction at various transition points, for example, between elementary, middle, and high school or between introductory and advanced college courses. Tests can de- termine whether students actually possess the basic knowledge and skills necessary for successfully dealing with the more advanced con- cepts and procedures taught at the next educational level. Although such instructional uses of tests are not directly related to their use as indicators, they are as important and provide equally valid reasons for developing better tests. Criticisms of Current Testing In the early years of this century, the assessment of student achievement was generally based on teachers' judgments, which were in turn based on teacher-made tests, homework, and impressions of classroom performance. But after the demonstration of the efficiency of objective tests by the use of the Army Alpha tests in World War I, a revolution in testing methods began. The invention of the multiple- choice test item and the development of fast and efficient test-scoring machines (Lin~quist, 1954) made possible the mass testing of stu- dents on a very large scale. Testing agencies and test publishers hastened to develop multiple-choice tests, teachers were trained to write multiple-choice items, and many colleges set up testing bureaus to assist the faculty in preparing and scoring multiple-choice exami- nations. Except for the teacher-made tests that many teachers still rely on for grading students, multiple-choice tests have driven out virtually all other types of examinations because of their objectivity, speed, and economy (N. Frederiksen, 1984a).

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 43 From the standpoint of assessing the quality of education in science and mathematics, it is important to know to what extent information based on tests in current use provides a sound basis for judgment. Standardized multiple-choice achievement tests have been widely criticized not only by educators but also by students, parents, and the media. Some of the criticisms most relevant to the development of indicators of science and mathematics education are discussed below. Multiple- Choice Tests Penalize Creative Thinking This is a well- taken criticism, since most multiple-choice items do not provide much opportunity to generate new ideas. Students responding to a typical multiple-choice item begin by reading the stem (the expository or question part of the item); then they read the first option and make a narrow directed search of their memory store to find a match to the option. If they find information that clearly matches the option, they may mark it and go to the next item. If not, they read the next option and again seek a match and mark it or consider the next option, and so on until they either choose and mark an option that matches information stored in memory or skip the item. Such a process would appear to require little creative thinking. Of course, some multiple- choice items require more complex processing of information, but a large majority of the items in a typical achievement test measure factual knowledge. In spite of the controversy, there has been little research on the mental processes involved in taking a multiple-choice test. Several in- vestigators, however, have compared multiple-choice tests with their free-response counterparts, which were constructed] by substituting an answer space for the multiple-choice options for each item (Ver- non, 1962; Traub and Fisher, 1977; Ward, 1982; Webb et al., 1986~. As judged by correlations and other statistical analyses, the format of the test was found to make little difference. With a few minor excep- tions, for tests that were originally constructed with multiple-choice questions, both formats appeared to measure the same ability. However, use of the multiple-choice format may tend to exclude the writing of items that require more complex thinking processes. If so, different results might be found if one began with free-response problems intended to elicit productive (rather than reproductive) thinking and converted them to the multiple-choice format. Such a comparison was carried out using a test that required students to formulate hypotheses that might account for the findings of an

44 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION experiment (N. Frederiksen and Ward, 1978~. Indeed, quite different results were obtained than in the conversion from multiple-choice to free-response formats. The correlations between the two formats were generally low, and the pattern of relationships to various cogni- tive abilities was different. The two formats were similar with regard to their relationships to verbal ability and reasoning, but only for the free-response version were there substantial relations to a fac- tor called ideational fluency, which represents the skills involved in making broad searches of the memory store in order to retrieve in- formation relevant to a situation (Ward et al., 1980~. In at least one instance, converting a test intended to measure productive think- ing to multiple-choice format eliminated the need to broadly search the memory store for ideas that might be relevant, evidence that the multiple-choice format is not conducive to measuring productive thinking. Mulliple-Choice Tests Are Not Representative of Real-L,ife Prob- [em Situations There are at least two aspects of representativeness. One has to do with the frequency with which real-life problems oc- cur in multiple-choice form. Occasionally people encounter problems with a lirn~ted number of clearly defined options, such as deciding whether to go left, right, or straight ahead at an intersection, or whether to take the morning or the afternoon flight to Miami. But more often there are many options, and one does not know what they are and must think of them for oneself. Multiple-choice options are almost universal in educational testing but rare in real life. The other aspect of representativeness has to do with the extent to which the problems posed by test items are similar to problems that occur in real life. Problems encountered in real life generally involve situations that are far more complex and richer in detail than are provided by the stem of a multiple-choice item. Furthermore, there seems to be a tendency for testers to use stereotyped sets of test problems in both science and mathematics, problems that, for example, involve weights on inclined planes, pulleys, boats going with or against the current, and the number of pencils Jane has. Generalization of learning would be facilitated by schoolroom expe- riences that resemble problems in the world outside the classroom with respect to the variety and complexity of problem situations and settings. Use of test problems that simulate such situations would encourage such instruction (Norman et al., 1985~.

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 45 Mulliple-Choice Tests Are Undesirably Coachable Any test is coachable in some sense and to some degree. Some kinds of coaching involve training that has nothing to do with the subject matter of the test, such as teaching students that the longest multiple-choice option is most likely to be correct and to avoid highly technical options; in such cases coaching may improve test scores somewhat without improving the ability presumably measured by the test. Another kind of coaching attempts to improve the ability measured by the test; a review of fractions and percentages, for example, might improve both test scores and the student's underlying competence in arithmetic. Test makers should attempt to construct tests that are coachable only in the sense that coaching and teaching are indistinguishable. Tests that are coachable in the undesirable sense not only result in wasted time; they also tend to falsify the data. It is difficult to estimate the size of gains that are attributable to coaching (Messick, 1980~. Most coaching is probably done by teach- ers in school settings and generally consists of attempts to teach the kinds of knowledge and skills that are measured by the tests. Coach- ing schools are more likely to attempt to teach test-taking skills, with less attention to the content of the test; fantastic gains have been claimed for such coaching (Owen, 1985), but without much evidence. The studies of coaching for the Scholastic Aptitude Test (SAT) and similar tests that were reviewed by Messick show modest gains on the average less than 10 points on the SAT-verbal and about 15 points on the SAT-mathematics test, on a scale of 200 to 800. The gains are difficult to interpret, however, because of vari- ations in methods of assigning students to the coached and control groups (often the coached students are volunteers), the methods, length, and content of coaching, and methods of analyzing the data. Thus, it is usually difficult to judge whether gains are attributable to (a) differences in ability or motivation, (b) the nature and length of the coaching, or (c) the methods and variables used in attempt- ing to control statistically for differences between the coached and control groups. Messick suggests that the smaller effects seem to be associated with short-term cramming and drill and the larger effects with longer-term programs involving skill development especially in mathematics, for which there is likely to have been greater vari- ability with regard to opportunities or motivation for students to learn. Such results suggest that coaching is not likely to produce major distortions in the distributions of scores obtained from current tests.

46 INDICATORS OF SCIENCE AND MATHI£MA TICS EDUCATION However, even small average gains could lead to mistaken conclusions when test scores are used to monitor change in student achievement. Mulliple-Choice Tests Exert Undesirable Inflfuence on the CUT_ riculum There are many reasons to believe that the nature of the tests used influences what teachers teach and what students learn. Students want to get respectable grades, or at least pass the course, and teachers believe that they may be evaluated on the basis of their students' test scores. Tests that fail to match the intended curriculum may therefore have undesirable effects on learning. Testing had relatively little impact on instruction in the 1950s and early 1960s, but the situation began to change in 1965 when the Elementary and Secondary Education Act (ESEA) was passed. The act required that certain teaching programs funded by ESEA be evaluated, and future funding of programs often depended on the outcomes of the evaluations (Popham, 1983~. Pressure to improve test performance increased during the 1970s, when test data showed that attainment of knowledge and skills was declining (Womer, 1981), and the National Assessment of Educational Progress (1982) reported decrements in performance. Still more pressure to "teach for the tests resulted from the decision of a federal judge in 1979 that Florida's use of a competency test to satisfy graduation requirements was unconstitutional unless preparation for the test was provided. Educators representing a majority of the school districts identi- fied by the National Science Teachers Association as exemplary in the teaching of Key science (Penick, 1983) have expressed concern at the mismatch between currently available standardized tests and their curricula. These districts are teaching inquiry-based, hands-on sci- ence, which both the scientific and educational communities strongly support, but the skills acquired by their students are not measured by the tests. At a conference on elementary science education held by the National Science Resources Center at the National Academy of Sciences in 1986, participants representing school districts with innovative programs expressed concern "that standardized achieve- ment tests do not do a good job of assessing what students learn in elementary school science. There is a need to develop improved tests and alternative evaluation techniques to assess student progress in science, with more emphasis on the development of process skills and attitudes" (National Science Resources Center, 1986:3~. As more

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 47 school districts are striving to introduce more effective science pro- grams in grades 1-6, the issue of correspondence between tests and curricular goals becomes particularly critical at this level. Bloom (1984) wrote that "teacher-made tests (and standardized tests) are largely tests of remembered information.... It is estimated that over 90 percent of test questions the U.S. public school students are now expected to answer deal with little more than information. Our instructional material, our classroom teaching methods, and our testing methods rarely rise above the lowest category of the [Bloom] taxonomy-knowledge" (p. 13~. Resnick and Resnick (1985:15), in commenting on state testing programs, stated It is appropriate . . . to think of minimum competency programs as an effort to educationally enfranchise the least able segment of the school ~ population.... However, by focusing only on minimal performance, the competency testing movement has severely limited its potential for upgrading education standards. Only recently have some states begun to include higher level skills in their competency testing programs. It would be difficult to stress too much the importance of this move beyond the minimum . . . for there is evidence that examinations focused solely on low level competencies restrict the range of what teachers attend to in instruction and thus lower the standard of education for all but the weakest students. An examination of the results of state testing programs in mathe- matics provides further documentation: children score well on items dealing with computation but less well on items dealing with con- cepts and problem solving, because the learning of these higher-order skills is not stressed in classroom instruction (Suydam, 1984~. The National Assessment of Educational Progress (NAEP) re- port (1982) previously referred to showed similar results. Perfor- mance by comparable populations of students on test items measur- ing basic skills did not decline compared with earlier assessments, but there was a decrease on items reflecting more complex cognitive skills. In mathematics, about 90 percent of the 17-year-olds could handle simple addition and subtraction, but performance levels on problems requiring understanding of mathematical principles dropped during the preceding decade from 62 to 58 percent. In science, performance declined for both kinds of items, the decrease being twice as large for items requiring more advanced skills. It seems a reasonable conjecture that the mandated use of minimum-competency tests and concurrent emphasis on basic skills was at least in part responsible for these declines. It is possible, however, to use the influence of tests on what is taught to improve

48 INDICATORS OF SCIENCE AND MATNEMA TICS EDUCATION learning by constructing tests that require the more advanced skills. Such tests would thus provide incentives for improving the quality of education in science and mathematics (N. Frederiksen, 1984a). In Chapter 7, the committee recommends that basic curriculum frameworks be developed for nationwide use, frameworks that repre- sent the best opinions of working scientists and mathematicians, as well as educators, as to what should be taught and tested a core of essential factual knowledge and the algorithmic and procedural skills and higher-order competencies for doing real science and mathemat- ics. Tests that match such frameworks would influence teaching and learning in desirable directions. . Multiple-Choice Tests Are Not Based on Theory This criticism is not one that is frequently voiced by critics, but it deserves mention. In one sense, multiple-choice testing is indeed based on a theory, namely, a very extensive theory of the mathematical and statistical considerations having to do with test reliability, validity, error of measurement, methods of item analysis, item parameters, equating of tests, latent trait models, and so on (e.g., Gulliksen, 1950; Rasch, 1960; Lord and Novick, 1968; Lord, 1980~. This test theory is largely based on the assumption that items are scored objectively as either right or wrong, and the test score is the number right. Item-response theory, a relatively new and very influential part of test theory, assumes a multiple-choice format by taking account of guessing. This body of work has been extremely useful and important in the development of assessment methods. But none of this test theory is concerned with the content of the test items. Another kind of theory, one that grows out of work in cognitive psychology and artificial intelligence, does provide a potentially use- fuT basis for the development of tests based both on content and the cognitive processes that are involved in doing science and mathemat- ics. Some of the implications of this work are described later in this chapter. Science Content in Multiple-Choice Achievement Tests is Ques- tionable In order to obtain information on the quality of the science content in currently used achievement tests, the committee asked 12 scientists and science teachers from several science fields to evaluate the items from 9 commonly used multiple-choice achievement tests. (Two individuals did not review the items but wrote general com

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 49 meets.) This attempt to evaluate tests is described in more detail in Appendix B. Since differences in average ratings between the tests were relatively small compared with the variability between the re- viewers, no quantitative conclusions concerning their relative merits could be justified from their evaluations. There was agreement, how- ever, that the tests were poor at probing higher-order skills and that they contained a significant (5 to 10 percent) number of flawed items. The remaining items were judged to be quite variable in their qual- ity, such that it was not obvious that a positive change in test score would in fact mirror improvement in the quality of student learning. The committee's experience with this experiment in assessing sci- ence tests reinforces concern about the quality of the subject-matter content of some of the tests in common use, even while it empha- sizes some of the difficulties in obtaining reliable evidence on this important question. Some Virtues of the Current Bestirs System Despite the criticisms that have been leveled by the committee and others at the current system of educational testing, it has a number of virtues that should be acknowledged. First, the multiple- choice format for testing makes possible the economical measurement of factual knowledge. This format allows the rapid and reliable scoring of tests at a relatively low cost. Therefore, it seems sensible to retain the conventional test format for doing what it does best- measuring factual knowledge and the ability to use the simpler kinds of procedural knowledge, such as the algorithms used in arithmetic computations (to the extent that they continue to be taught). Two other useful developments in current testing systems are matrix sampling and the application of statistical methods to make possible test comparisons over time. Neither of these is limited to tests in the multiple-choice format. The use of matrix sampling al- lows one to obtain information about large populations of students without concomitant increases in cost and testing burden. Matrix sampling is analogous to the methods used in public-opinion polling, in that it requires drawing random or representative samples of sum jects. But in addition to drawing random samples of subjects, matrix sampling also involves independently drawing random samples of test items (Wilks, 1962; Lord and Novick, 1968~; thus random subsamples of students are given different subsamples of items. An adaptation of

50 INDICATORS OF SCIENCE AND MAI'HElMA TICS EDUCATION the item-sampling procedure used by NAEP involves what is called a balanced incomplete block design (Messick et al., 1983~. This pro- cedure makes possible the calculation of close approximations to the means, standard deviations, intercorrelations of tests and test items, and so on, that would be obtained if the entire school population had been tested. This is an important feature of the methods currently employed by NAEP. When tests are created that are more costly to administer and score than conventional multiple-choice tests, the use of matrix sampling will be critical for keeping costs within bounds. Another virtue to note is that current testing methodology makes possible comparisons over time. The collection of data on learning in- dicators is of limited value unless the measurement can be repeated, since the purpose of school evaluation is to detect change-to see if student performance is improving. Given that test-score scales are arbitrary, measures taken on a single occasion may be of limited value. The only way in which such measures would be interpretable would be for the scores to have intrinsic meaning apart from com- parative interpretations. School evaluation is concerned not only with measuring change in the same individuals over a period of time but also with comparing the performance of successive groups of students at a particular stage of instruction, such as the end of the eighth grade. The latter kind of comparison is of particular interest at state and national levels. Unfortunately, it poses a difficult problem of interpretation because of possible changes in the composition of the groups that have nothing to do with instruction. And there are many other problems of interpretation due to the use of fallible instruments, the possibility (if not likelihood) that a given test does not measure the same abilities before and after a period of training, the lack of random assignment of students, the lack of equal units on a score scale, the unreliability of difference scores, and so on. (see Harris, 1963~. But statistical test theory has provided workable answers to many of these problems, for example, in the development of methods of equating scores on different versions of a test (Angoff, 1984~. The development of item-response theory (I,ord, 1980) provides workable solutions to other problems. The extensive test theory that has been developed should be retained, but it needs to be adapted as necessary for use with new testing procedures.

INDICATORS OF LEARNING IN SCIENCE AND MA17IEMA TICS 51 NEW METHODS OF ASSESSMENT The procedures suggested in this section, if properly developed, could provide remedies for the problems described in the use of multiple-choice tests e They could provide both quantitative and qualitative information descriptive of how students perform the most important higher-order science and mathematics tasks. The results could reflect such attributes of performance as speed of respond- ing, use of inference in problem solving, pattern-recognition skills, students' internal models of problems, and use of strategies-and heuristics in solving problems. Two major kinds of assessment procedures are considered. One consists of what might be called global measures, since the perfor- mance to be elicited will be evaluated as a whole. The other set of procedures yields processing measures, since they are descriptive of the information-processing components that influence the develop ment of conceptual knowledge and overt performance of the student. GIo bat Assessment A frequently used alternative to a multiple-choice test is an essay test in which the items elicit fairly long written responses. Such tests have the virtue that students not only must think of the ideas for themselves but also must organize them in an appropriate sequence and state them clearly. Essay tests have been justifiably criticized, however, on the basis of the subjectivity and unreliability of scoring. Reliability can be improved by pooling the grades of two or more readers; in the case of essays written to test English-language proficiency, a holistic method of grading is used in large-scale testing in which two or more judges are asked to read each essay quickly and rate it impressionistically, and the ratings are pooled. The result is that grades are more reliable, but no one knows precisely what they mean. Another approach that has been tried involves the use of tasks that impose more structure on the response than does the typical essay question, so that one can know more precisely what skill is being measured (N. Frederiksen and Ward, 1978; Ward et al., 1980~. In science, for example, the test problems might simulate tasks that are frequently encountered by scientists, such as formulating hy- potheses that might account for a set of research findings, making critical comments on a research proposal, or suggesting solutions to a methodological problem. For example, in one exercise students were

52 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION asked to write down the hypotheses that they thought should be considered in trying to account for the findings (shown in a graph or table) of an experiment or field study. Development of materials to aid in scoring this kind of test requires a protocol-analysis procedure that includes the following steps: (a) making a classification of the ideas written by a sample of students, (b) writing definitions of the categories, and (c) having experts make judgments about the quality (and other attributes) of each category, in light of the information that was available to the students. Coders are trained to match each of a student's responses to a category, and scores can be generated by a computer on the basis of quality and other values attached to the categories. Tests of this sort were found to be poorer than Graduate Record Examination (ORE) scores for predicting first-year grades in gradu- ate school, but they were better than the GRE for predicting such student accomplishments as doing original research, designing and building laboratory equipment, and being author or coauthor of a re- search report. Thus, there is at least correlational evidence that tests of the kind described above measure something related to productive thinking that is not measured by conventional tests. More sophisticated methods of analyzing free-response protocols are being developed, methods that do not require the imposition of such a high degree of structure. These methods are based on dis- course analysis (C. H. Frederiksen, 1975, 1985; van Dijk and Kintsch, 1984; C. H. Frederiksen et al., 1985~; they make it possible to in- vestigate understanding by analyzing free-response productions of students. Flexible computer environments are being developed that permit students to generate text based on their retrieval, generation, and manipulation of declarative knowledge in a knowledge-rich do- main. The use of syntactic and semantic parsers makes it possible to analyze a student's responses to a task and to make their grammati- cal structure explicit on the screen. Analysis of the structure is then possible in terms of the student's prior knowledge of the topic, the knowledge representations generated in performing the assigned task, and the operations performed in generating links to new information. One task, for example, required students to interpret the results of an experiment involving photosynthesis in terms of their knowI- edge of the chemistry of photosynthesis. Their task involved (a) comprehending the experiment, (b) retrieving relevant information from memory, and (c) generating appropriate links between (a) and

INDICATORS OF LEARNING IN SCIENCE AND M24THEMA TICS 53 (b). Protocols from different students demonstrate differences in am preaches to the problem, such as forward and backward reasoning processes. Another approach to assessing performance is to display a student's structure as an overlay on a structure that represents a consensus among experts as to what constitutes an "ideal" answer. Subjects at different grade levels or different levels of compe- tency have been shown by such methods to differ with regard to patterns of performance in comprehending texts of different kinds (C. H. Erederiksen, 1984), and qualitative differences between novice and expert physicians in case comprehension have been identified (Patel and Frederiksen, 1984~. Several states are experimenting with analogous methods for analyzing samples of student writing in state assessment programs, even without using computers to analyze in- dividual protocols. The procedures require human judgment and are not intrinsically dependent on the computer, but computerized assistance may make the method feasible for widespread use. There are many other possible formats, including not only tests that require written responses but also tasks requiring hands-on operation of laboratory equipment. For example, students can be given the necessary materials and equipment and asked to design and carry out models of scientific investigations that demonstrate understanding of such scientific concepts as density, conductivity, and capillarity. Such tests are already in use on a limited scale by NAEP (Blumberg et al., 1986; National Assessment of Educational Progress, 1987), lEA in the Second International Science Study (Jacobson, 1985), the British Assessment of Performance Unit Series (1983- 1985), and others (Hein, in press). The availability of microprocessor-based computers in the class- room is growing at such a rate that it is not unreasonable to assume that In the near future every classroom, from kindergarten upward, will have access to computers. (According to Becker t1986i, a na- tional survey conducted in 1985 found that between 1983 and 1985 the number of computers in use for school instruction quadrupled from 250,000 to over 1 million.) Furthermore, while costs are de- creasing, processing power, ability to produce graphics displays, and mass storage capabilities are at very high levels. The classroom computer can play a powerful role not only in evaluating learning but also in helping students learn science prm cesses and the higher-order thinking skills involved (Goldstein, 1980; Sleeman and Brown, 1982~. While the committee's chief interest is

54 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION in improving assessment of student learning, it is important to con- sider as well the improvement of learning, given that software can be developed to serve both purposes simultaneously. Improvement in learning is made possible because of the capability of the computer, with appropriate software, to simulate real-worId scientific inves- tigations (CIancey, 1979~. Ideally, such simulations should reflect hands-on science done inside or outside of the classroom. The com- puter can be used to provide simulated experiments that reinforce, review, and extend the hands-on studies. Simulations also make it possible to speed up or slow down the progress of time, enlarge or shrink distances, and modify or eliminate such factors as friction and gravitation. If such simulations are integrated into appropriate host software systems, they can be powerful tools for assessment. The host soft- ware could remember the performance of each individual student on a mass storage device, such as a floppy disk; could provide the cIass- room teacher with appropriate summary information on the class as a whole; and could provide the option to examine in as much de- tai! as desired the performance of individual students. A simulation might be structured with regard to levels of achievement and could grant scoring points for good performance, just as good game soft- ware does. In this way, the simulations could give students valuable feedback as they use them, as well as storing information for the use of teachers and for the assessment of schools or school districts. Thus, the same information can be used for instructional or student evaluation purposes by the teacher, for local monitoring purposes by the principal or school superintendent, and as part of a state or national data base on student learning. As possible instruments for national assessment, simulations would provide a solution to the problem of testing for real skills in doing science. They can be the kind of tests that should be taught to-which by their use will generate higher-quality science instruc- tion. It appears entirely practical to use simulations for classroom learning and to draw on a subset of the same group of simulations for local, state, and national assessment. from the standpoint of effi- cient use of financial and intellectual resources, this seems desirable. Since high-quaTity simulations are difficult and costly to create, it is important to maximize their use once they are in place. It is also more likely that better testing methods will be developed if at the same time they can be used to improve instruction.

INDICATORS OF LEARNING IN SCIENCE AND A~4THElMA TICS Assessment of Conceptual Knowledge and Processing Skills Cognitive scientists, including both psychologists and computer scientists working in the area of artificial intelligence, are developing models of intellectual functioning that have relevance for assessment (Bransford et al., 1986~. Cognitive scientists view students as infor- mation processors who possess a variety of capabilities that enable them to learn and function intelligently. These include the develop- ment of conceptual knowledge organizing information according to structures or frameworks appropriate to the subject matter so as to give it meaning or, in mathematics and science specifically, imposing meaning on formal symbols and rules (Resnick, 1987~. For example, in the sciences, the way and the extent to which scientific princi- ples are used to organize perception, problem solving, and reasoning distinguishes the novice from the expert. The development of conceptual knowledge is supported by spe- cific processing skills that assist in the absorption of information and its organization and use; they include processing speed, memory capacity, memory organization, factual knowledge, and procedural knowledge (KyIlonen, 1986~. Procedural knowledge includes not only knowledge of algorithms but also the ability to plan and use various heuristics and strategies. All these capacities function interactively in contributing to learning and intelligent behavior. An understand- ing of how they function should facilitate instruction (N. Frederiksen, 1984b), and an ability to assess these capabilities should be valuable not only to teachers and curriculum designers but also to educators at state and national levels. This information-processing conception of learning and intellec- tual performance is too complex to describe here. What follows are brief descriptions of a number of possible assessment procedures aimed at certain cognitive abilities, ordered roughly according to the complexity of the ability and the difficulties involved in assessing it. The procedures suggested are generally based on experimental methods that have been devised by cognitive scientists for research purposes. Few of the procedures have been used for assessment, and much work will be needed before they can be used systematically in assessing proficiency in science and mathematics. Speed of Processing Processing speed is typically measured in terms of response latencies (reaction time) in perforrn~ng acts that

56 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION are relevant to an area of expertise. For example, in learning to read, the beginner must learn how to translate letter combinations into speech sounds and to relate those sounds to words stored in memory. These may be difficult tasks for a young child, but for a skilled reader they are performed very quickly and without attention. It has been shown that differences in response latencies in word analysis, dis- course analysis (e.g., identifying the antecedent of a pronoun), and integrative processes (e.g., generating extrapolations from the text) distinguish the proficient reader from a less skilled reader Id. R. Fred- eriksen, 1982~. Speed is important as an indicator because it shows that a process can be carried out automatically, without attention, and therefore does not interfere with other more complex mental processes that are going on simultaneously (Schneider and Shiffrin, 1977; Shiffrin and Schneider, 1977~. In the case of reading, ". . . auto- maticity of word-analysis skills essentially frees processing resources for the purpose of discourse analysis" Id. R. Frederiksen, 1982:172) and ". . . these skills are poorly represented in conventional tests of reading comprehension" (p. 173~. The need for automatic processing in elementary arithmetic is well known to teachers (although probably not by that term), and they try to increase automaticity by such means as drill with flash cards. Use of a computer would facilitate such training and would also make it possible to measure response latencies and, thus, identify those instances of finger counting or some other "short-cut" method that actually increases response time. In algebra, automatic pro- cessing could be assessed by having the student carry out simple transformations of equations and measuring the response latencies. Moreover, patterns of latencies have been used to distinguish what kinds of procedures children use for addition and subtraction, for example, and how students and experts break algebraic equations into meaningful units. Thus, speed measures are useful not only for assessing automaticity but also for monitoring procedural skills. Pattern Recognition Pattern recognition is a skill related to speed of processing. With much practice one can learn to recognize very quickly a complex stimulus that may be embedded in a still more complex background. This phenomenon was first observed by deGroot (1965) in comparing chess grand masters with ordinary chess players. He found that grand masters were able to reproduce cor- rectly the positions on a board of 20 to 25 chess pieces in a midgame position after seeing them for a few seconds, while ordinary players

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 57 could reproduce correctly only a half-dozen pieces. Apparently grand masters had learned after years of staring at chess boards to quickly perceive and use patterns in processing data. Simon and Chase (1973) and Simon (1974) later timed the placement of the pieces and found that the intervals between placements were relatively short for the pieces in a cluster and that longer intervals defined the bound- aries between clusters. Sirn~lar pattern-recognition skills have been identified in recognizing functional elements (e.g., stages of amplifica- tion) in a schematic by electronics experts (Egan and Schwartz, 1979) and in identifying the important signs and symptoms of a disease by experienced physicians (Barrows et al., 1982~. Pattern recognition is important in many activities, and measures of this skill might be an indicator of proficiency because, like automaticity, such skill reduces the load on working memory and makes its resources available for other, more complex activities. Measuring latencies in responding to relevant tasks would be an appropriate method for assessing a pattern-recognition skill. Organization of Knowledge How knowledge is organized in long- term memory may be another useful indicator of an aspect of infor- mation processing. The elements in long-term memory are items of information and clusters of such items, which are interrelated in complex ways to form an extremely large system. The organization may involve temporal, spatial, hierarchical, causal, and other kinds of relationships. Presumably the organization depends on the num- ber and kinds of experiences one has had with the elements, and retrieval would depend on the strength of their interrelationships (Hayes-Roth, 1977; Gentner and Gentner, 1983~. Highly organized cognitive structures are formed as one acquires expertise in an area such as mechanics or forestry. Since accessibility of stored informa- tion depends on how it is organized, it would undoubtedly be useful to know how information is organized in the minds of students and how that organization changes with practice. One cannot hope to discover how all the information in memory is organized, but methods are available for assessing the structure of knowledge in particular domains. One method is to ask students to recall items of information and to time the responses a method analogous to that used to investigate the size and nature of clusters of chess pieces as perceived by grand masters. Sets of closely related items tend to occur with short latencies, while longer intervals tend

58 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION to mark the boundaries between sets. Another method is merely to have students sort the elements into clusters. A more sophisticated method makes use of judgments of similar- ity between pairs of words that represent the key concepts in a domain (e.g., in mechanics, such words as mass, force, velocity, acceleration, density, volume). A student's ratings of all the possible pairs is an- alyzed by multivariate scaring, which produces a multidimensional representation of a structure. This structure then can be compared with that obtained from the judgments of experts (Shavelson, 1972, 1974; Meyer and SchvanevelUt, 1976; Preece, 1976; Diekhoff, 1983; SchvanevelUt et al., 1985~. The structure based on the judgments of experts in physics was found to fit a structure based on physical the- ory, and student structures were found to improve with instruction in physics (Shavelson, 1985~. Thus, it seems feasible to develop for a variety of subject-matter areas assessment methods that provide some information about the organization of information in memory for individuals or for groups of students. Skill in Retrieving Information The accessibility of information stored in memory has for many years been assessed by means of apti- tude tests presumed to measure the fluency with which associations and ideas are generated. The ability is very general and is thought to be related to creativity. It is possible that analogous tests would be useful in certain specific domains of expertise to elicit responses related to particular topics in that domain. Students of botany, for example, might be asked such questions as "What might be the cause of the fruit dropping from an apple tree before the apples are ripe?", and the test might be scored in terms of number and quality of the ideas. Internal Representations of Problems How students conceive of a problem has much to do with their success in solving it. A given student's representation or mental mode! might take the form of a set of verbal propositions, a spatial arrangement of the problem elements, a picture, a chart or diagram, an equation, or an algorithm (see I,arkin, 1979; I,arkin et al., 1980~. If a crucial element is omitted or if the representation is inaccurate, solving the problem will be difficult or impossible. It would be useful to know what problem representations are used by students when they attempt to solve a certain type of problem.

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 59 The most commonly used method in studying problem-solving behavior is the "think aloud method of collecting protocols, in which students are instructed to report what they are thinking as they attempt to solve a problem (Newell and Simon, 1972; Ericsson and Simon, 1984~. Once a protocol is obtained, it may be inter- preted in terms of the cognitive processes that are involved. This type of analysis has been used with some success in mathematics; pairs of students have been videotaped as they discuss a problem on which they are working together (Schoenfeld, 1982~. Methods using protocol analysis would be useful in investigating how a problem is represented internally and how that representation changes with training and practice. Another method of studying problem representations involves asking experts and novices to sort a set of problems into categories. The results in physics, where the method has been applied, indicate that novices tend to sort the problems on the basis of superficial characteristics of the problems, such as the use of inclined planes or pulleys, while the experts categorized the problems in terms of the physical principles that were involved (Chi et al., 1981~. Asking students to sort problems is a possible way of discovering something important about the internal representations of problems that they use. Research on the misconceptions that many students have regard- ing physical phenomena shows the importance of discovering student conceptions of problems (Stevens et al., 1979; McDermott, 1984~. For example, it has been shown that some children believe that they are able to see an object because their vision goes from the eye to the object, rather than because light from the sun is reflected by the object to the eyes (Anderson and Smith, 1983; Anderson, 1985~. And it is reported that an appreciable number of students, even those who have had a course in physics, believe that when an object is released from the run of a spinning wheel it will follow a spiral trajectory in space. Such misconceptions have been shown to be so enduring that some students reinterpret statements of physical laws to make them consistent with the misconception. Misconceptions about physical phenomena often can be discovered by asking a student to draw or otherwise indicate what he or she thought was happening or would happen under certain conditions. Computers have been used to assess students' understanding of physical laws. One simulation depicts a Newtonian world without friction or gravitation in which objects obey the laws of motion.

60 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION When given the task of moving objects from place to place by ap- plying force, students are often surprised by the results, indicating inadequacies in their understanding of Newtonian physics (White, 1983~. Such a simulation could be used both for assessment and for instruction. Procedural Knowledge The term procedural knowledge includes not only knowledge of such routine procedures as the algorithms used in computation but also more complex skills. Complex skills may involve, for example, planning the steps to be taken in solving a problem and the use of strategies or such heuristics as means- end analysis, reformulating a problem, or thinking of analogies to a problem situation. Computer programs have been developed that make it possible to discover the erroneous algorithms ("bugs") that some students use in attempting to solve arithmetic problems (Brown and Burton, 1978; Brown and VanLehn, 1980~. One well-known bug, for example, involves subtracting the smaller number from the larger regardless of which one is on top. Many other bugs have been found to exist that are unknown to most teachers. New computer programs provide detailed information about the sequence of steps (the solution path) that was taken by a student, and, from that information, the strategic errors committed because of inadequate mathematical understanding may be inferred. Other programs are intended to discover and assess the depth of a student's understanding of an area of expertise. For example, computerized algebra tools now being developed permit students to see and manipulate the array of possible steps that they could take as they attempt to solve an algebra problem. Knowing the path students take through this "search tree" reveals much more about their skills in algebra than does the number of correct answers to the problems, including such metacognitive skills as choosing an appro- priate strategy, profiting from errors, and the ability to monitor one's own performance. Similar programs are now available in other areas of mathematics, including the Geometric Supposer (Schwartz and Yerushalmy, 1985) and the Semantic Calculator (Schwartz, 1983~. Computerized coaching systems are being developed that mon- itor a student's problem-solving performance. Based on diagnostic models that are integral parts of the system, computer programs can be designed that offer advice to the student and at the same time provide detailed assessments of his or her capabilities (e.g., Burton and Brown, 1979; Anderson et al., 1985~. Computerized

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 61 medical problem-solving programs have been developed that offer to the physician not only advice but also explanations or reasons for the advice (Reggia et al., 1985~. Such systems are now capable of assessing performance in very complex domains. Another feature of the computer is that it can keep track of the collection of strategies that a student tries in solving a problem and then generate a summary of what he or she has tried and has neglected to try. Thus, the computer opens up several new possibili- ties for assessment. The interactive nature of the student-computer relationship allows the student's capabilities to be progressively dis- closed; if the student is unable to deal with a problem, more infor- mation or hints can be given (Reiser et al., 1985~. In this manner, a single problem can be used for both assessment and instructional purposes. Not all the computerized assessment procedures described above can be administered with a microcomputer; some may require the use of a sophisticated work station. The costs of such work stations have been decreasing at a rapid pace and are likely to continue to do so. Within five years, such equipment will not be out of reach, at least for assessments on a four-year cycle. In the meantime, much can be done with small computers. As the cost of computers continues to decline, more assessments will become affordable. A note of caution is in order. Too much reliance on computerized testing and teaching may result in a tendency to substitute computer simulation for real-worId experience, or to tilt testing methodology toward those exercises that are most easily computerized. Users and creators must be alert to minimize such tendencies, and innovative assessment devices that do not require a computer should also be developed and made available. The Development arid Use of New Methods None of the assessment methods described in this section can compete with multiple-choice tests from the standpoint of economy and efficiency, although matrix sampling makes their use more fea- sible. However, investment in the development of the recommended new methods and the cost of using them is, in the committee's view, justifiable not only because these methods would provide informa- tion for a far more accurate and complete assessment of instruction and student learning, but also because they are likely to be useful in the instructional process itself (see, e.g., Linn, 1986~. Exercises

62 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION could be used not only for assessment but also for practice and to provide information for remediation, and assessments based on exercises designed to probe higher-order learning should raise educa- tional standards by providing models of performance to be emulated by both students and teachers. An organization is needed to encourage, conduct, and coordinate the development of the needed assessment materials. The develop ment of new assessment materials is costly, in both money and intel- lectual resources; needless duplication of effort must be avoided. This implies that the areas most in need of research and development of assessment techniques must be defined, newly developed instruments must be evaluated for their quality, and facilities for the distribution of materials to schools and teachers must be created. The problem of test validation is particularly important for any new generation of tests that may be developed to assess proficiency in science and mathematics. The approach that has typically been used for test validation finding a variable that may be thought of as a criterion and computing a correlation will probably not be feasible, since no reasonable criterion is likely to exist. Clearly, another method is needed. The most reasonable method for validating the kinds of tests that have been proposed is construct validation (Cronbach and MeehI, 19553. Messick (1975) defines construct validity as "the process of marshalling evidence in the form of theoretically relevant empirical relations to support the inference that [a test score] has a particular meaning" (p. 955~. The implication is that a theory about the nature of the performance in question is necessary, and validation of a test involves a scientific investigation to see if the procedures and cognitive processes displayecl in taking the test are consistent with the theory. A study of construct validity of free-response tests intended to measure skill in problem solving may be used as an illustration (N. Frederiksen, 1986~. One test consisted of diagnostic problems that simulated a meeting of a doctor and a new patient, and the other test involved nonmedical problems, such as why there are relatively fewer old people in Iron City than in the rest of the state. Both tests used a format that required examinees to go through several cycles of writing hypotheses, asking for information to test their hypotheses, and revising the list of hypotheses until they arrived at a solution. The subjects were fourth-year medical students. The theory about cognitive processes assumed that such verbal skills as

INDICATORS OF LEARNING IN SCIENCE AND ~TH~TIC5 63 reading, various reasoning abilities, science knowledge, and cognitive flexibility (ability to give up unpromising leads) would all be involved for both kinds of problems. In addition, skill in retrieving relevant information from long-term memory would be important. In the case of the medical problems, medical knowledge would of course also be necessary. The salient findings from a correlational analysis of the data showed that, as expected, medical knowledge was clearly the most important resource in solving the medical problems, and of course it was of little or no help in dealing with the nonmedical problems. For nonmedical problems, ideational fluency, or skill in retrieving relevant information from memory, was by far the best predictor of performance, but it was of little or no value in solving medical problems. Thus the information-processing theory had to be revised. Embretson (1983) reports a more elaborate study involving latent-trait modeling for the identification of the theoretical mecha- nisms that underlie performance on a task and exploring the network of relationships of test scores to relevant variables. Experimental methods for testing a theory about test performance are also feasible and probably are preferable to correlational methods. Summary Currently available multiple-choice tests are adequate primar- ily for assessing student learning of the declarative knowledge of a subject. They are not adequate for assessing conceptual knowledge, most process skills, and the higher-order thinking that scientists, mathematicians, and educators consider most important. Since cur- rent efforts to improve curricula are beginning to concentrate on these skills, new tests and other assessment devices are needed to serve as national indicators of student learning in mathematics and science. The tests should include exercises that employ free-response techniques not only pencil-and-paper problems but also hands-on science experiments and computer simulations. Tests for measur- ing the component skills involved in reasoning and problem solving should also be developed. The improvements in testing can be made feasible, despite higher costs, by the use of computer-based tech- niques, by matrix-sampling methods, and by the use in instruction of exercises developed for the tests. Currently the area of greatest curricular change is in elementary school, grades K-5. A number of school systems are attempting

64 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION to implement inquiry-based, hands-on instructional programs in sci- ence. These programs are considered exemplary by both scientists and science teachers, and they urgently need the support of as- sessment instruments that match the new emphasis on teaching for understanding and for more complex thinking skills. Prototypes of free-response techniques exist that could be adapted for use at the K-5 level in the near future. Recommer`dations Indicators of student learning at the national, state, and local levels should be based on scores derived from assessment methods that are consonant with a curriculum that includes all major cur- ricular objectives, including the learning of factual and conceptual knowledge, process skills, and higher-order thinking in specific con- tent areas. Such tests should exhibit a range of testing methodology, including use of free-response formats. Research and Development: To provide the requisite tests for use as indicators of student learning, the committee recommends that a greatly accelerated program of research and development be undertaken aimed at the construction of free-response materials and techniques that measure skills not measured with multiple-choice tests. The committee urges that the development of science tests at the K-5 level receive immediate attention. Techniques to be developed include problem-solving tasks, as exemplified by the College Board Advanced Placement Tests; pencil- and-paper tests of hypothesis formulation, experimental design, and other tasks requiring productive-thinking skills, as exemplified by questions in the British Assessment of Performance Unit Series; hands-on experimental exercises, as exemplified by some test materi- als administered by the National Assessment of Educational Progress (NAEP) and the International Association for the Evaluation of Edu- cational Achievement (IEA), and simulations of scientific phenomena with classroom microcomputers that give students opportunities for experimental manipulations and prediction of results. The creation of new science tests for grades K-5 should be done by teams that include personnel from the school districts that have

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 65 been developing hands-on curricula to ensure that the new tests match the objectives of this type of instruction. In addition to providing valid national indicators of learning in areas of great im- portance, such new assessment materials for science in grades K-5 will provide models of tests that state and local school officials may want to adopt and use. Key Indicator: The committee recommends that assess- ment of student learning using the best available tests and- testing methods continue to be pursued in order to provide periodic indicators of the quality of science and mathematics education. Tests should be given to students in upper-elementary, middle, and senior high school (for example, in grades 4, 8, and 12~. Because of the rapid changes taking place in science instruction in grades K-5, assessment at this level should be carried out every two years, using exercises developed according to the preceding recommendation. For higher levels, a four-year cycle is appropriate. The tests should be given to a national sample, using matrix-sampling techniques. Test scores should be available for each test item or exercise and should be reported over time and by student subgroups (e.g., gender, race, ethnicity, rural/inner city/suburban community). As in previous assessments, results should also be reported by geographic region; efforts now under way may make possible state-by-state comparisons in the future. Similar procedures are appropriate for indicators of state or district assessments of student learning. Research and Development: The committee recommends that a research and development center be established to provide for the efficient production, evaluation, and distribu- tion of assessment materials for use as indicators of student learning at district, state, and national levels and for use by teachers in instruction. The center should function as a centralized resource and clear- inghouse that would make it possible for school people to survey the available assessment materials and obtain those desired. It might be called the National Science and Mathematics Assessment Resource

66 INDICATORS OF SCONCE AND ~TH~TICS EDUCATION Center. It should be tied closely to efforts to improve the curriculum and be an active partner in the total system of educational reform. The committee suggests that as a beginning a group of experts be convened to prepare a plan for the creation of the proposed center, including its management and operation, and that the plan serve as the basis for the founding of the center by a suitable educational es- tablishment or a consortium of universities and educational research organizations. IMPLICATIONS FOR STATE EDUCATION AGENCIES The assessment of what students have learned and their ability to apply that knowledge is a major task of accountability for state ed- ucation agencies. Such assessments can function to assure the public and their elected representatives that both human and material re- sources are available and meet certain standards, that the resources are appropriately distributed to schools, and that the effects of all the human and monetary investments are reflected in student learning. Using that basic premise, the state has a vital stake in valid yet feasible ways to evaluate what students know about mathematics and science. The state's role of leadership in assessment is quite important, and the committee is concerned that the complexities of assessing student learning be clearly understood and then attacked. If the state language-arts assessment is merely a multiple-choice grammar test, a direct message (intended or not) is sent to every teacher that the writing process itself is not important. Similarly, in the committee's view, if a state or school science assessment consists solely of a multiple-choice test, then clearly the measurement is equally limited. Representatives of state and local systems told the committee that the recommended assessment resource center, if it were to be implemented, would fill a major gap for schools, states, and the As- sessment Center Project of the Council of Chief State School Officers (see Appendix D). The assessment approaches based on hands-on investigation and computer simulation that would evolve from the proposed resource center could serve two functions for states and local communities. On a sample basis, the results of assessments us- ing such new techniques would themselves be an important indicator at the state and national levels of student learning, and simultane- ously such an assessment approach would provide a mode] that the committee believes to be important. While states may be able to

INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 67 contribute to the assessment resource center, probably only a non- governmental institution could muster sufficient resources to develop and evaluate the new approaches, as well as to create imaginative ways to improve traditional multiple-choice testing of factual knowI- edge and simpler kinds of procedural knowledge. The curriculum frameworks discussed in Chapter 7 should guide the development at the proposed resource center of outcome measures, including mea- sures not only of factual and conceptual knowledge, but also of the information-processing skills that are necessary for acquiring profi- ciency in science and mathematics. ASSESSING ADULT SCIENTIFIC AND MATHEMATICAL LITERACY There are several reasons why assessment of student learning should be extended to assess trends in the science and mathematics literacy of the entire population. First, one of the reasons to care about the quality of mathematics and science instruction in school is that it will influence mathematics and science literacy throughout the population; trends in the mathematics and science literacy of adults will in time provide information about the long-term consequences of attempts to improve the science and mathematics education pro- vided in the nation's schools. The issue of adult literacy may raise important questions as to whether schools should emphasize imme- diate knowledge retention or learning that is likely to be retained in adulthood. Second, children's interest in mathematics and science is influenced by the extent to which the adults in their lives know about and show an interest in these subjects. Consequently, changes in the science and mathematics literacy of adults may foster changes in the skills and attitudes about science and mathematics that students bring to school. Third, and most important, the science and mathe- matics literacy of aclults is a major goal of science and mathematics education. Results of previous efforts to assess scientific literacy in the United States have not been reassuring. For example, Miller (1986) reports on surveys of U.S. adults conducted in 1979 and 1985 that included questions on the meaning of scientific study, cognitive sci- ence knowledge, and attitudes on organized science. On the basis of the survey responses, he classified 7 percent of the public to be sci- entifically literate ~ 1979 and 5 percent in 1985. Young adults (ages 17-34) did slightly better (11 and 7 percent, respectively); also, the

68 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION percentage increased with increasing education. However, within the population who were high school graduates but who had not gone on to college, only 2 percent in 1979 and 3 percent in 1985 were deemed to be scientifically literate. Such results increase the need for future study of the population's scientific literacy and the long-term effects of science education. Desired Attributes of Indicators Any plan to generate indicators of scientific and mathematical literacy should try to estimate the degree to which a population possesses the kind of knowledge and intellectual skills outlined in Chapter 2. Assessment plans should be based on the following con- siderations: ~ A single measure will not do, because science and math- ematics literacy involves multiple dimensions of a complex set of characteristics. The indicators to be used should be matched to the models of literacy discussed in Chapter 2. ~ The indicators should recognize that there is no single, abso- Jute level of literacy and that various levels of attainment in different components of a community or population group are likely. . Any measures used to generate indicators should be supple- mented by research to validate what is actually being measured. . Indicators may be expressed in terms of descriptive patterns of problem solving and other nonnumerical ways. At this stage, there is no particular reason to favor one method of data collection over another. Therefore, several techniques, such as conducting surveys (see, e.g., Miller, 1983), interviews, and case studies, should be considered in deciding what information to collect in order to develop indicators. As with students, traditional methods may work reasonably well to assess knowledge, but indicators should also probe the population's understanding of the nature of science and its role in society. It is particularly important and difficult to ob- tain reliable estimates of problem-solving skills. In the committee's view, their assessment must go beyond individual pencil-and-paper tests and should include observation and analysis of individual and group responses to carefully selected phenomena involving real oh jects and filmed sequences of events. In some sense, the need for assessment of adult literacy is not as urgent as the need for assessing students; after all, fewer policy decisions will or can be driven by such assessments. Therefore, the

INDICATORS OF LEARNING IN SCIENCE AND MATHI~MA TICS 69 next two or three years can be devoted to the interim development of pencil-and-paper tests and tests involving real objects. These wiB provide a measure of adult literacy that can be correlated both to existing tests of learning (say, of 17-year-olds) and to the assessment techniques that the committee has proposed for in-school learning. Since an important aspect of science and mathematics literacy is con- tinuing sel£education, some of the assessment techniques suggested in the preceding sections may also be appropriate for adult literacy. Target Populations for Assessment The committee considers education policy makers for elementary and secondary schools at state, local, and also national levels to be prime users of indicators of the quality of science and mathematics education. This has implications and raises interesting issues for the design of a set of indicators to assess the scientific and mathematical literacy of adults. One issue, for example, is how the out-of-school population should be stratified in various ways for assessment pur- poses. One way is to divide it into the following groups: . . Parents and guardians of children enrolled in elementary/sec- ondary schools, public or private; alternatively, those with school-age children. . Individuals who work in mathematics- or science-related fields or use mathematics or science in their work. Individuals, stratified perhaps by age groups related to other national surveys, such as the National Assessment of Education Progress and the longitudinal follow-up surveys of earlier high school classes sponsored by the Center for Education Statistics (National Center for Education Statistics, 1981, 1984~. Considering the first group, if an in-school science assessment includes a particular student, should the parents or guardians of that student be included in a science literacy assessment? If so, should the assessment include both parents, a randomly selected parent, the mother, the father, or some combination of these? Data Collection Strategies The following suggestions outline an initial program and illus- trate one way in which a measurement effort might begin. The agency assigned responsibility for the measurement of scientific and

70 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION mathematical literacy should be given responsibility for developing the details of methodology. An initial program would be devoted to: . providing benchmark data for the country as a whole, using largely available material, and ~ research to develop, validate, and field-test instruments to better measure people's understanding of the nature of science and to obtain reliable estimates of their problem-solving skills. The projected interviews and administration of exercises proba- bly would require personal visits to households by the interviewers, although some screening of households and some data collection might be done by telephone. The assessment might begin by provid- ing benchmark data for all adults by gender and by broad age group and for parents and guardians of school-age children. This data base would later be expanded to provide measures for subgroups of the population, for example, by educational attainment and by race and ethnicity. Although the program would be targeted to adults 17 years of age and older, it should be expanded to include children in elementary and secondary schools as in-schooT testing programs begin to in- clude the measurement of scientific and mathematical literacy. The objective would be to provide links between school and household measures, as well as to provide a household-based unit of analysis for adults and children. If the assessment is to serve as a reliable base for policy, it will need to be based on a probability sample for which estimates of sta- tistical reliability can be provided. The goal should be a high rate of cooperation in the survey by individuals selected in the sample. Completion rates of from 85 to 90 percent are a reasonable expec- tation. The sampling could be based on a multistage approach. At the first stage, a sample of perhaps 100 areas would be selected. These areas would be counties or school districts and, if spread pro- portionately across the country, would be distributed across about 40 states. The sample could, however, be designed to include all states. Within each area, a sample of no fewer than 50 adults would be drawn from randomly selected city blocks or corresponding small areas outside cities, with art least 5 households sampled per block and 1 adult interviewed per household. Households would be sam- pled for this purpose according to their number of adults in order to give each adult tested approximately the same weight. With an 80 percent cooperation rate, this plan would yield interviews/tests with

INDICATORS OF LEARNING IN SCIENCE AND MATNEMA TICS 71 no fewer than 4,000 adults. That number would provide an adequate data base for analysis. To monitor changes in the population, the basic survey should be repeated at four-year intervals. During intervening years, effort could be concentrated on developing and testing improved assessment methodology. Assessing Grasp of Grand Conceptual Schemes As with school students, it is important to find out to what extent the adult population is familiar with key scientific concepts and understands their applications. While such high-order knowI- edge may seem at first to resist assessment, it can be probed with the following kind of exchange, probably best administered in an · - Interview: . Listen to (or look at) this list of ideas: plate tectonics, evo- lution, gravitation, the periodic table. Is there one of them that you would be willing to talk about a bit more? Response (for example): plate tectonics. ~ I'd like you to take a few minutes to think about plate tec- tonics. Please think about these two questions and answer them in whichever order you prefer. How would you briefly describe what plate tectonics are to someone who didn't know about them? What examples can you give me of things or events that plate tectonics cause or are involved in? Would you like to talk about another of these ideas? Several aspects of this sample exchange are important. First, it is in the free-response format, which is needed to probe the active knowledge of the respondent and to permit flexibility in answering. Second, it evokes both a definition and specific applications of the selected ~grand conceptual scheme.n Since part of the power of these schemes is their ability to unify phenomena, being able to define the terms without appreciating any of the applications is to lose much of their force. Third, by including some example that virtually all adults have encountered, a minimum level of literacy can be assessed. Finally, because the questions are open-ended and recursive, they permit assessment of both breadth and depth. Although it may be difficult to do so, it would be important to establish to what extent people's responses are based on knowledge gained in school and to what extent they draw on knowledge gained fi:om subsequent reading, television programs, museum visits, and so on, even given

72 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION that school ought to teach one to continue to learn beyond one's formal education. Recommendation Key Indicator: The committee recommends that, starting in 1989, the scientific and mathematical literacy of a ran- dom sample of aclults (including 17-year-olds) be assessed. The assessment should tap the several dimensions of literacy discussed in Chapter 2 and should be carried out every four years. To make the desired types of assessment possible, effort should be devoted over the next two years to developing interim assessment tools that use some free-response and some problem-solving compo- nents; these assessment tools should be used until more innovative assessment techniques, described in this chapter, are available. The data collected should be aggregated to provide measures of depth and breadth of knowledge and understanding. They should also be aggregated by age, gender, race, ethnicity, socioeconomic status, and geographic region so as to establish to what extent there are sys- tematic inequities in the distribution of scientific and mathematical literacy.

Next: 5 Indicators of Student Behavior »

Improving Indicators of the Quality of Science and Mathematics Education in Grades K-12 (1988)

Chapter: 4 Indicators of Learning in Science and Mathematics

Welcome to OpenBook!

Get Email Updates