Click for next page ( 41


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 40
4 Indicators of Learning in Science and Mathematics This chapter first appraises currently available multiple-choice tests of student achievement in order to judge their suitability as indicators of the quality of education in science and mathematics. This appraisal includes a discussion of various uses of these tests, a review of criticisms of current testing methods, and suggestions on some desirable features that should be retained. New methods of assessment are then described that would provide both quantita- tive and qualitative information about how students perform tasks requiring higher-order skills. The chapter continues with our recom- mendations regarding uses of current indicators of student learning and work needed to develop improved tests. Implications of these recommendations for state education agencies are presented, and we conclude with a discussion of possible approaches to assessing aspects of scientific literacy of the U.S. population. AN APPRAISAL OF CURRENT TESTS OF STUDENT ACHIEVEMENT The most direct indicators of the quality of science and mathe- matics education are the scores based on tests that measure what stu- dents have learned. Currently available indicators of student learning are typically obtained from standardized achievement tests made up of multiple-choice items. Before one accepts information based on 40

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 41 such tests, it is necessary to make an appraisal of their suitability as indicators of the quality of education. Purposes of Testing In practice, tests are used for a wide variety of purposes. Some involve the evaluation of individual students for grading, student counseling, placement, promotion, awards, scholarships, and so on- important for educational purposes but not always well suited to the development of indicators. The use of tests most closely identified with assessing the condition of science and mathematics education is for the evaluation of learning achieved by populations of students; a related purpose that is of interest to the committee is the use of tests in improving the quality of instruction. Evaluation of Student Learning Measures of the outcomes of education for students are critical indicators in any educational mon- itoring system. Hence, the testing purpose of primary concern to the committee is evaluation of student learning, particularly at national, state, and regional levels. Indicators of learning that are satisfactory for this purpose would also be useful to school districts or individual schools as a means of monitoring change in levels of accomplishment over time. A related use of tests is to provide criterion measures to vali- date less direct indicators of the quality of education, for example, teaching effectiveness or the quality of the curriculum. Tests are of- ten used for this purpose, but such use is appropriate only when the tests being employed assess important dimensions of student learning in a satisfactory manner. Improving Instruction One reason for monitoring the condition of mathematics and science education is to be able to improve instruc- tion. Several applications of tests can help do so: tests can contribute to raising the standards of schools as to the skills and competencies to be taught and acceptable levels of performance. They can provide diagnostic information that would enable teachers to understand the reasons for failures and provide appropriate remedial treatment. Di- agnostic information would be useful in school assessment at local, district, state, or even national levels as well: better understand- ing of why students develop erroneous problem-solving algorithms or

OCR for page 40
42 INDICATORS OF SCIENCE AND M`4TNEMA TICS EDUCATION fad! to modify childhood misconceptions of physical principles would make possible actions at higher administrative and supervisory levels aimed at improving instruction. Tests can also be used as dependent variables in experimental studies involving educational treatments or methodologies developed to improve instruction. Test questions also can be used for teaching as practice ex- ercises with feedback. Much practice is necessary to acquire the complex skills required for development of the automatic processing and pattern-perception skills that are essential for the performance of more advanced problem-solving tasks. Such exercises might also pro- vide, as a by-product, information that would be useful for large-scale assessment of student learning. Still another instructional applica- tion is to improve the articulation of instruction at various transition points, for example, between elementary, middle, and high school or between introductory and advanced college courses. Tests can de- termine whether students actually possess the basic knowledge and skills necessary for successfully dealing with the more advanced con- cepts and procedures taught at the next educational level. Although such instructional uses of tests are not directly related to their use as indicators, they are as important and provide equally valid reasons for developing better tests. Criticisms of Current Testing In the early years of this century, the assessment of student achievement was generally based on teachers' judgments, which were in turn based on teacher-made tests, homework, and impressions of classroom performance. But after the demonstration of the efficiency of objective tests by the use of the Army Alpha tests in World War I, a revolution in testing methods began. The invention of the multiple- choice test item and the development of fast and efficient test-scoring machines (Lin~quist, 1954) made possible the mass testing of stu- dents on a very large scale. Testing agencies and test publishers hastened to develop multiple-choice tests, teachers were trained to write multiple-choice items, and many colleges set up testing bureaus to assist the faculty in preparing and scoring multiple-choice exami- nations. Except for the teacher-made tests that many teachers still rely on for grading students, multiple-choice tests have driven out virtually all other types of examinations because of their objectivity, speed, and economy (N. Frederiksen, 1984a).

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 43 From the standpoint of assessing the quality of education in science and mathematics, it is important to know to what extent information based on tests in current use provides a sound basis for judgment. Standardized multiple-choice achievement tests have been widely criticized not only by educators but also by students, parents, and the media. Some of the criticisms most relevant to the development of indicators of science and mathematics education are discussed below. Multiple- Choice Tests Penalize Creative Thinking This is a well- taken criticism, since most multiple-choice items do not provide much opportunity to generate new ideas. Students responding to a typical multiple-choice item begin by reading the stem (the expository or question part of the item); then they read the first option and make a narrow directed search of their memory store to find a match to the option. If they find information that clearly matches the option, they may mark it and go to the next item. If not, they read the next option and again seek a match and mark it or consider the next option, and so on until they either choose and mark an option that matches information stored in memory or skip the item. Such a process would appear to require little creative thinking. Of course, some multiple- choice items require more complex processing of information, but a large majority of the items in a typical achievement test measure factual knowledge. In spite of the controversy, there has been little research on the mental processes involved in taking a multiple-choice test. Several in- vestigators, however, have compared multiple-choice tests with their free-response counterparts, which were constructed] by substituting an answer space for the multiple-choice options for each item (Ver- non, 1962; Traub and Fisher, 1977; Ward, 1982; Webb et al., 1986~. As judged by correlations and other statistical analyses, the format of the test was found to make little difference. With a few minor excep- tions, for tests that were originally constructed with multiple-choice questions, both formats appeared to measure the same ability. However, use of the multiple-choice format may tend to exclude the writing of items that require more complex thinking processes. If so, different results might be found if one began with free-response problems intended to elicit productive (rather than reproductive) thinking and converted them to the multiple-choice format. Such a comparison was carried out using a test that required students to formulate hypotheses that might account for the findings of an

OCR for page 40
44 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION experiment (N. Frederiksen and Ward, 1978~. Indeed, quite different results were obtained than in the conversion from multiple-choice to free-response formats. The correlations between the two formats were generally low, and the pattern of relationships to various cogni- tive abilities was different. The two formats were similar with regard to their relationships to verbal ability and reasoning, but only for the free-response version were there substantial relations to a fac- tor called ideational fluency, which represents the skills involved in making broad searches of the memory store in order to retrieve in- formation relevant to a situation (Ward et al., 1980~. In at least one instance, converting a test intended to measure productive think- ing to multiple-choice format eliminated the need to broadly search the memory store for ideas that might be relevant, evidence that the multiple-choice format is not conducive to measuring productive thinking. Mulliple-Choice Tests Are Not Representative of Real-L,ife Prob- [em Situations There are at least two aspects of representativeness. One has to do with the frequency with which real-life problems oc- cur in multiple-choice form. Occasionally people encounter problems with a lirn~ted number of clearly defined options, such as deciding whether to go left, right, or straight ahead at an intersection, or whether to take the morning or the afternoon flight to Miami. But more often there are many options, and one does not know what they are and must think of them for oneself. Multiple-choice options are almost universal in educational testing but rare in real life. The other aspect of representativeness has to do with the extent to which the problems posed by test items are similar to problems that occur in real life. Problems encountered in real life generally involve situations that are far more complex and richer in detail than are provided by the stem of a multiple-choice item. Furthermore, there seems to be a tendency for testers to use stereotyped sets of test problems in both science and mathematics, problems that, for example, involve weights on inclined planes, pulleys, boats going with or against the current, and the number of pencils Jane has. Generalization of learning would be facilitated by schoolroom expe- riences that resemble problems in the world outside the classroom with respect to the variety and complexity of problem situations and settings. Use of test problems that simulate such situations would encourage such instruction (Norman et al., 1985~.

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 45 Mulliple-Choice Tests Are Undesirably Coachable Any test is coachable in some sense and to some degree. Some kinds of coaching involve training that has nothing to do with the subject matter of the test, such as teaching students that the longest multiple-choice option is most likely to be correct and to avoid highly technical options; in such cases coaching may improve test scores somewhat without improving the ability presumably measured by the test. Another kind of coaching attempts to improve the ability measured by the test; a review of fractions and percentages, for example, might improve both test scores and the student's underlying competence in arithmetic. Test makers should attempt to construct tests that are coachable only in the sense that coaching and teaching are indistinguishable. Tests that are coachable in the undesirable sense not only result in wasted time; they also tend to falsify the data. It is difficult to estimate the size of gains that are attributable to coaching (Messick, 1980~. Most coaching is probably done by teach- ers in school settings and generally consists of attempts to teach the kinds of knowledge and skills that are measured by the tests. Coach- ing schools are more likely to attempt to teach test-taking skills, with less attention to the content of the test; fantastic gains have been claimed for such coaching (Owen, 1985), but without much evidence. The studies of coaching for the Scholastic Aptitude Test (SAT) and similar tests that were reviewed by Messick show modest gains on the average less than 10 points on the SAT-verbal and about 15 points on the SAT-mathematics test, on a scale of 200 to 800. The gains are difficult to interpret, however, because of vari- ations in methods of assigning students to the coached and control groups (often the coached students are volunteers), the methods, length, and content of coaching, and methods of analyzing the data. Thus, it is usually difficult to judge whether gains are attributable to (a) differences in ability or motivation, (b) the nature and length of the coaching, or (c) the methods and variables used in attempt- ing to control statistically for differences between the coached and control groups. Messick suggests that the smaller effects seem to be associated with short-term cramming and drill and the larger effects with longer-term programs involving skill development especially in mathematics, for which there is likely to have been greater vari- ability with regard to opportunities or motivation for students to learn. Such results suggest that coaching is not likely to produce major distortions in the distributions of scores obtained from current tests.

OCR for page 40
46 INDICATORS OF SCIENCE AND MATHIMA TICS EDUCATION However, even small average gains could lead to mistaken conclusions when test scores are used to monitor change in student achievement. Mulliple-Choice Tests Exert Undesirable Inflfuence on the CUT_ riculum There are many reasons to believe that the nature of the tests used influences what teachers teach and what students learn. Students want to get respectable grades, or at least pass the course, and teachers believe that they may be evaluated on the basis of their students' test scores. Tests that fail to match the intended curriculum may therefore have undesirable effects on learning. Testing had relatively little impact on instruction in the 1950s and early 1960s, but the situation began to change in 1965 when the Elementary and Secondary Education Act (ESEA) was passed. The act required that certain teaching programs funded by ESEA be evaluated, and future funding of programs often depended on the outcomes of the evaluations (Popham, 1983~. Pressure to improve test performance increased during the 1970s, when test data showed that attainment of knowledge and skills was declining (Womer, 1981), and the National Assessment of Educational Progress (1982) reported decrements in performance. Still more pressure to "teach for the tests resulted from the decision of a federal judge in 1979 that Florida's use of a competency test to satisfy graduation requirements was unconstitutional unless preparation for the test was provided. Educators representing a majority of the school districts identi- fied by the National Science Teachers Association as exemplary in the teaching of Key science (Penick, 1983) have expressed concern at the mismatch between currently available standardized tests and their curricula. These districts are teaching inquiry-based, hands-on sci- ence, which both the scientific and educational communities strongly support, but the skills acquired by their students are not measured by the tests. At a conference on elementary science education held by the National Science Resources Center at the National Academy of Sciences in 1986, participants representing school districts with innovative programs expressed concern "that standardized achieve- ment tests do not do a good job of assessing what students learn in elementary school science. There is a need to develop improved tests and alternative evaluation techniques to assess student progress in science, with more emphasis on the development of process skills and attitudes" (National Science Resources Center, 1986:3~. As more

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 47 school districts are striving to introduce more effective science pro- grams in grades 1-6, the issue of correspondence between tests and curricular goals becomes particularly critical at this level. Bloom (1984) wrote that "teacher-made tests (and standardized tests) are largely tests of remembered information.... It is estimated that over 90 percent of test questions the U.S. public school students are now expected to answer deal with little more than information. Our instructional material, our classroom teaching methods, and our testing methods rarely rise above the lowest category of the [Bloom] taxonomy-knowledge" (p. 13~. Resnick and Resnick (1985:15), in commenting on state testing programs, stated It is appropriate . . . to think of minimum competency programs as an effort to educationally enfranchise the least able segment of the school ~ population.... However, by focusing only on minimal performance, the competency testing movement has severely limited its potential for upgrading education standards. Only recently have some states begun to include higher level skills in their competency testing programs. It would be difficult to stress too much the importance of this move beyond the minimum . . . for there is evidence that examinations focused solely on low level competencies restrict the range of what teachers attend to in instruction and thus lower the standard of education for all but the weakest students. An examination of the results of state testing programs in mathe- matics provides further documentation: children score well on items dealing with computation but less well on items dealing with con- cepts and problem solving, because the learning of these higher-order skills is not stressed in classroom instruction (Suydam, 1984~. The National Assessment of Educational Progress (NAEP) re- port (1982) previously referred to showed similar results. Perfor- mance by comparable populations of students on test items measur- ing basic skills did not decline compared with earlier assessments, but there was a decrease on items reflecting more complex cognitive skills. In mathematics, about 90 percent of the 17-year-olds could handle simple addition and subtraction, but performance levels on problems requiring understanding of mathematical principles dropped during the preceding decade from 62 to 58 percent. In science, performance declined for both kinds of items, the decrease being twice as large for items requiring more advanced skills. It seems a reasonable conjecture that the mandated use of minimum-competency tests and concurrent emphasis on basic skills was at least in part responsible for these declines. It is possible, however, to use the influence of tests on what is taught to improve

OCR for page 40
48 INDICATORS OF SCIENCE AND MATNEMA TICS EDUCATION learning by constructing tests that require the more advanced skills. Such tests would thus provide incentives for improving the quality of education in science and mathematics (N. Frederiksen, 1984a). In Chapter 7, the committee recommends that basic curriculum frameworks be developed for nationwide use, frameworks that repre- sent the best opinions of working scientists and mathematicians, as well as educators, as to what should be taught and tested a core of essential factual knowledge and the algorithmic and procedural skills and higher-order competencies for doing real science and mathemat- ics. Tests that match such frameworks would influence teaching and learning in desirable directions. . Multiple-Choice Tests Are Not Based on Theory This criticism is not one that is frequently voiced by critics, but it deserves mention. In one sense, multiple-choice testing is indeed based on a theory, namely, a very extensive theory of the mathematical and statistical considerations having to do with test reliability, validity, error of measurement, methods of item analysis, item parameters, equating of tests, latent trait models, and so on (e.g., Gulliksen, 1950; Rasch, 1960; Lord and Novick, 1968; Lord, 1980~. This test theory is largely based on the assumption that items are scored objectively as either right or wrong, and the test score is the number right. Item-response theory, a relatively new and very influential part of test theory, assumes a multiple-choice format by taking account of guessing. This body of work has been extremely useful and important in the development of assessment methods. But none of this test theory is concerned with the content of the test items. Another kind of theory, one that grows out of work in cognitive psychology and artificial intelligence, does provide a potentially use- fuT basis for the development of tests based both on content and the cognitive processes that are involved in doing science and mathemat- ics. Some of the implications of this work are described later in this chapter. Science Content in Multiple-Choice Achievement Tests is Ques- tionable In order to obtain information on the quality of the science content in currently used achievement tests, the committee asked 12 scientists and science teachers from several science fields to evaluate the items from 9 commonly used multiple-choice achievement tests. (Two individuals did not review the items but wrote general com

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 49 meets.) This attempt to evaluate tests is described in more detail in Appendix B. Since differences in average ratings between the tests were relatively small compared with the variability between the re- viewers, no quantitative conclusions concerning their relative merits could be justified from their evaluations. There was agreement, how- ever, that the tests were poor at probing higher-order skills and that they contained a significant (5 to 10 percent) number of flawed items. The remaining items were judged to be quite variable in their qual- ity, such that it was not obvious that a positive change in test score would in fact mirror improvement in the quality of student learning. The committee's experience with this experiment in assessing sci- ence tests reinforces concern about the quality of the subject-matter content of some of the tests in common use, even while it empha- sizes some of the difficulties in obtaining reliable evidence on this important question. Some Virtues of the Current Bestirs System Despite the criticisms that have been leveled by the committee and others at the current system of educational testing, it has a number of virtues that should be acknowledged. First, the multiple- choice format for testing makes possible the economical measurement of factual knowledge. This format allows the rapid and reliable scoring of tests at a relatively low cost. Therefore, it seems sensible to retain the conventional test format for doing what it does best- measuring factual knowledge and the ability to use the simpler kinds of procedural knowledge, such as the algorithms used in arithmetic computations (to the extent that they continue to be taught). Two other useful developments in current testing systems are matrix sampling and the application of statistical methods to make possible test comparisons over time. Neither of these is limited to tests in the multiple-choice format. The use of matrix sampling al- lows one to obtain information about large populations of students without concomitant increases in cost and testing burden. Matrix sampling is analogous to the methods used in public-opinion polling, in that it requires drawing random or representative samples of sum jects. But in addition to drawing random samples of subjects, matrix sampling also involves independently drawing random samples of test items (Wilks, 1962; Lord and Novick, 1968~; thus random subsamples of students are given different subsamples of items. An adaptation of

OCR for page 40
50 INDICATORS OF SCIENCE AND MAI'HElMA TICS EDUCATION the item-sampling procedure used by NAEP involves what is called a balanced incomplete block design (Messick et al., 1983~. This pro- cedure makes possible the calculation of close approximations to the means, standard deviations, intercorrelations of tests and test items, and so on, that would be obtained if the entire school population had been tested. This is an important feature of the methods currently employed by NAEP. When tests are created that are more costly to administer and score than conventional multiple-choice tests, the use of matrix sampling will be critical for keeping costs within bounds. Another virtue to note is that current testing methodology makes possible comparisons over time. The collection of data on learning in- dicators is of limited value unless the measurement can be repeated, since the purpose of school evaluation is to detect change-to see if student performance is improving. Given that test-score scales are arbitrary, measures taken on a single occasion may be of limited value. The only way in which such measures would be interpretable would be for the scores to have intrinsic meaning apart from com- parative interpretations. School evaluation is concerned not only with measuring change in the same individuals over a period of time but also with comparing the performance of successive groups of students at a particular stage of instruction, such as the end of the eighth grade. The latter kind of comparison is of particular interest at state and national levels. Unfortunately, it poses a difficult problem of interpretation because of possible changes in the composition of the groups that have nothing to do with instruction. And there are many other problems of interpretation due to the use of fallible instruments, the possibility (if not likelihood) that a given test does not measure the same abilities before and after a period of training, the lack of random assignment of students, the lack of equal units on a score scale, the unreliability of difference scores, and so on. (see Harris, 1963~. But statistical test theory has provided workable answers to many of these problems, for example, in the development of methods of equating scores on different versions of a test (Angoff, 1984~. The development of item-response theory (I,ord, 1980) provides workable solutions to other problems. The extensive test theory that has been developed should be retained, but it needs to be adapted as necessary for use with new testing procedures.

OCR for page 40
62 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION could be used not only for assessment but also for practice and to provide information for remediation, and assessments based on exercises designed to probe higher-order learning should raise educa- tional standards by providing models of performance to be emulated by both students and teachers. An organization is needed to encourage, conduct, and coordinate the development of the needed assessment materials. The develop ment of new assessment materials is costly, in both money and intel- lectual resources; needless duplication of effort must be avoided. This implies that the areas most in need of research and development of assessment techniques must be defined, newly developed instruments must be evaluated for their quality, and facilities for the distribution of materials to schools and teachers must be created. The problem of test validation is particularly important for any new generation of tests that may be developed to assess proficiency in science and mathematics. The approach that has typically been used for test validation finding a variable that may be thought of as a criterion and computing a correlation will probably not be feasible, since no reasonable criterion is likely to exist. Clearly, another method is needed. The most reasonable method for validating the kinds of tests that have been proposed is construct validation (Cronbach and MeehI, 19553. Messick (1975) defines construct validity as "the process of marshalling evidence in the form of theoretically relevant empirical relations to support the inference that [a test score] has a particular meaning" (p. 955~. The implication is that a theory about the nature of the performance in question is necessary, and validation of a test involves a scientific investigation to see if the procedures and cognitive processes displayecl in taking the test are consistent with the theory. A study of construct validity of free-response tests intended to measure skill in problem solving may be used as an illustration (N. Frederiksen, 1986~. One test consisted of diagnostic problems that simulated a meeting of a doctor and a new patient, and the other test involved nonmedical problems, such as why there are relatively fewer old people in Iron City than in the rest of the state. Both tests used a format that required examinees to go through several cycles of writing hypotheses, asking for information to test their hypotheses, and revising the list of hypotheses until they arrived at a solution. The subjects were fourth-year medical students. The theory about cognitive processes assumed that such verbal skills as

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND ~TH~TIC5 63 reading, various reasoning abilities, science knowledge, and cognitive flexibility (ability to give up unpromising leads) would all be involved for both kinds of problems. In addition, skill in retrieving relevant information from long-term memory would be important. In the case of the medical problems, medical knowledge would of course also be necessary. The salient findings from a correlational analysis of the data showed that, as expected, medical knowledge was clearly the most important resource in solving the medical problems, and of course it was of little or no help in dealing with the nonmedical problems. For nonmedical problems, ideational fluency, or skill in retrieving relevant information from memory, was by far the best predictor of performance, but it was of little or no value in solving medical problems. Thus the information-processing theory had to be revised. Embretson (1983) reports a more elaborate study involving latent-trait modeling for the identification of the theoretical mecha- nisms that underlie performance on a task and exploring the network of relationships of test scores to relevant variables. Experimental methods for testing a theory about test performance are also feasible and probably are preferable to correlational methods. Summary Currently available multiple-choice tests are adequate primar- ily for assessing student learning of the declarative knowledge of a subject. They are not adequate for assessing conceptual knowledge, most process skills, and the higher-order thinking that scientists, mathematicians, and educators consider most important. Since cur- rent efforts to improve curricula are beginning to concentrate on these skills, new tests and other assessment devices are needed to serve as national indicators of student learning in mathematics and science. The tests should include exercises that employ free-response techniques not only pencil-and-paper problems but also hands-on science experiments and computer simulations. Tests for measur- ing the component skills involved in reasoning and problem solving should also be developed. The improvements in testing can be made feasible, despite higher costs, by the use of computer-based tech- niques, by matrix-sampling methods, and by the use in instruction of exercises developed for the tests. Currently the area of greatest curricular change is in elementary school, grades K-5. A number of school systems are attempting

OCR for page 40
64 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION to implement inquiry-based, hands-on instructional programs in sci- ence. These programs are considered exemplary by both scientists and science teachers, and they urgently need the support of as- sessment instruments that match the new emphasis on teaching for understanding and for more complex thinking skills. Prototypes of free-response techniques exist that could be adapted for use at the K-5 level in the near future. Recommer`dations Indicators of student learning at the national, state, and local levels should be based on scores derived from assessment methods that are consonant with a curriculum that includes all major cur- ricular objectives, including the learning of factual and conceptual knowledge, process skills, and higher-order thinking in specific con- tent areas. Such tests should exhibit a range of testing methodology, including use of free-response formats. Research and Development: To provide the requisite tests for use as indicators of student learning, the committee recommends that a greatly accelerated program of research and development be undertaken aimed at the construction of free-response materials and techniques that measure skills not measured with multiple-choice tests. The committee urges that the development of science tests at the K-5 level receive immediate attention. Techniques to be developed include problem-solving tasks, as exemplified by the College Board Advanced Placement Tests; pencil- and-paper tests of hypothesis formulation, experimental design, and other tasks requiring productive-thinking skills, as exemplified by questions in the British Assessment of Performance Unit Series; hands-on experimental exercises, as exemplified by some test materi- als administered by the National Assessment of Educational Progress (NAEP) and the International Association for the Evaluation of Edu- cational Achievement (IEA), and simulations of scientific phenomena with classroom microcomputers that give students opportunities for experimental manipulations and prediction of results. The creation of new science tests for grades K-5 should be done by teams that include personnel from the school districts that have

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 65 been developing hands-on curricula to ensure that the new tests match the objectives of this type of instruction. In addition to providing valid national indicators of learning in areas of great im- portance, such new assessment materials for science in grades K-5 will provide models of tests that state and local school officials may want to adopt and use. Key Indicator: The committee recommends that assess- ment of student learning using the best available tests and- testing methods continue to be pursued in order to provide periodic indicators of the quality of science and mathematics education. Tests should be given to students in upper-elementary, middle, and senior high school (for example, in grades 4, 8, and 12~. Because of the rapid changes taking place in science instruction in grades K-5, assessment at this level should be carried out every two years, using exercises developed according to the preceding recommendation. For higher levels, a four-year cycle is appropriate. The tests should be given to a national sample, using matrix-sampling techniques. Test scores should be available for each test item or exercise and should be reported over time and by student subgroups (e.g., gender, race, ethnicity, rural/inner city/suburban community). As in previous assessments, results should also be reported by geographic region; efforts now under way may make possible state-by-state comparisons in the future. Similar procedures are appropriate for indicators of state or district assessments of student learning. Research and Development: The committee recommends that a research and development center be established to provide for the efficient production, evaluation, and distribu- tion of assessment materials for use as indicators of student learning at district, state, and national levels and for use by teachers in instruction. The center should function as a centralized resource and clear- inghouse that would make it possible for school people to survey the available assessment materials and obtain those desired. It might be called the National Science and Mathematics Assessment Resource

OCR for page 40
66 INDICATORS OF SCONCE AND ~TH~TICS EDUCATION Center. It should be tied closely to efforts to improve the curriculum and be an active partner in the total system of educational reform. The committee suggests that as a beginning a group of experts be convened to prepare a plan for the creation of the proposed center, including its management and operation, and that the plan serve as the basis for the founding of the center by a suitable educational es- tablishment or a consortium of universities and educational research organizations. IMPLICATIONS FOR STATE EDUCATION AGENCIES The assessment of what students have learned and their ability to apply that knowledge is a major task of accountability for state ed- ucation agencies. Such assessments can function to assure the public and their elected representatives that both human and material re- sources are available and meet certain standards, that the resources are appropriately distributed to schools, and that the effects of all the human and monetary investments are reflected in student learning. Using that basic premise, the state has a vital stake in valid yet feasible ways to evaluate what students know about mathematics and science. The state's role of leadership in assessment is quite important, and the committee is concerned that the complexities of assessing student learning be clearly understood and then attacked. If the state language-arts assessment is merely a multiple-choice grammar test, a direct message (intended or not) is sent to every teacher that the writing process itself is not important. Similarly, in the committee's view, if a state or school science assessment consists solely of a multiple-choice test, then clearly the measurement is equally limited. Representatives of state and local systems told the committee that the recommended assessment resource center, if it were to be implemented, would fill a major gap for schools, states, and the As- sessment Center Project of the Council of Chief State School Officers (see Appendix D). The assessment approaches based on hands-on investigation and computer simulation that would evolve from the proposed resource center could serve two functions for states and local communities. On a sample basis, the results of assessments us- ing such new techniques would themselves be an important indicator at the state and national levels of student learning, and simultane- ously such an assessment approach would provide a mode] that the committee believes to be important. While states may be able to

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 67 contribute to the assessment resource center, probably only a non- governmental institution could muster sufficient resources to develop and evaluate the new approaches, as well as to create imaginative ways to improve traditional multiple-choice testing of factual knowI- edge and simpler kinds of procedural knowledge. The curriculum frameworks discussed in Chapter 7 should guide the development at the proposed resource center of outcome measures, including mea- sures not only of factual and conceptual knowledge, but also of the information-processing skills that are necessary for acquiring profi- ciency in science and mathematics. ASSESSING ADULT SCIENTIFIC AND MATHEMATICAL LITERACY There are several reasons why assessment of student learning should be extended to assess trends in the science and mathematics literacy of the entire population. First, one of the reasons to care about the quality of mathematics and science instruction in school is that it will influence mathematics and science literacy throughout the population; trends in the mathematics and science literacy of adults will in time provide information about the long-term consequences of attempts to improve the science and mathematics education pro- vided in the nation's schools. The issue of adult literacy may raise important questions as to whether schools should emphasize imme- diate knowledge retention or learning that is likely to be retained in adulthood. Second, children's interest in mathematics and science is influenced by the extent to which the adults in their lives know about and show an interest in these subjects. Consequently, changes in the science and mathematics literacy of adults may foster changes in the skills and attitudes about science and mathematics that students bring to school. Third, and most important, the science and mathe- matics literacy of aclults is a major goal of science and mathematics education. Results of previous efforts to assess scientific literacy in the United States have not been reassuring. For example, Miller (1986) reports on surveys of U.S. adults conducted in 1979 and 1985 that included questions on the meaning of scientific study, cognitive sci- ence knowledge, and attitudes on organized science. On the basis of the survey responses, he classified 7 percent of the public to be sci- entifically literate ~ 1979 and 5 percent in 1985. Young adults (ages 17-34) did slightly better (11 and 7 percent, respectively); also, the

OCR for page 40
68 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION percentage increased with increasing education. However, within the population who were high school graduates but who had not gone on to college, only 2 percent in 1979 and 3 percent in 1985 were deemed to be scientifically literate. Such results increase the need for future study of the population's scientific literacy and the long-term effects of science education. Desired Attributes of Indicators Any plan to generate indicators of scientific and mathematical literacy should try to estimate the degree to which a population possesses the kind of knowledge and intellectual skills outlined in Chapter 2. Assessment plans should be based on the following con- siderations: ~ A single measure will not do, because science and math- ematics literacy involves multiple dimensions of a complex set of characteristics. The indicators to be used should be matched to the models of literacy discussed in Chapter 2. ~ The indicators should recognize that there is no single, abso- Jute level of literacy and that various levels of attainment in different components of a community or population group are likely. . Any measures used to generate indicators should be supple- mented by research to validate what is actually being measured. . Indicators may be expressed in terms of descriptive patterns of problem solving and other nonnumerical ways. At this stage, there is no particular reason to favor one method of data collection over another. Therefore, several techniques, such as conducting surveys (see, e.g., Miller, 1983), interviews, and case studies, should be considered in deciding what information to collect in order to develop indicators. As with students, traditional methods may work reasonably well to assess knowledge, but indicators should also probe the population's understanding of the nature of science and its role in society. It is particularly important and difficult to ob- tain reliable estimates of problem-solving skills. In the committee's view, their assessment must go beyond individual pencil-and-paper tests and should include observation and analysis of individual and group responses to carefully selected phenomena involving real oh jects and filmed sequences of events. In some sense, the need for assessment of adult literacy is not as urgent as the need for assessing students; after all, fewer policy decisions will or can be driven by such assessments. Therefore, the

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATHI~MA TICS 69 next two or three years can be devoted to the interim development of pencil-and-paper tests and tests involving real objects. These wiB provide a measure of adult literacy that can be correlated both to existing tests of learning (say, of 17-year-olds) and to the assessment techniques that the committee has proposed for in-school learning. Since an important aspect of science and mathematics literacy is con- tinuing seleducation, some of the assessment techniques suggested in the preceding sections may also be appropriate for adult literacy. Target Populations for Assessment The committee considers education policy makers for elementary and secondary schools at state, local, and also national levels to be prime users of indicators of the quality of science and mathematics education. This has implications and raises interesting issues for the design of a set of indicators to assess the scientific and mathematical literacy of adults. One issue, for example, is how the out-of-school population should be stratified in various ways for assessment pur- poses. One way is to divide it into the following groups: . . Parents and guardians of children enrolled in elementary/sec- ondary schools, public or private; alternatively, those with school-age children. . Individuals who work in mathematics- or science-related fields or use mathematics or science in their work. Individuals, stratified perhaps by age groups related to other national surveys, such as the National Assessment of Education Progress and the longitudinal follow-up surveys of earlier high school classes sponsored by the Center for Education Statistics (National Center for Education Statistics, 1981, 1984~. Considering the first group, if an in-school science assessment includes a particular student, should the parents or guardians of that student be included in a science literacy assessment? If so, should the assessment include both parents, a randomly selected parent, the mother, the father, or some combination of these? Data Collection Strategies The following suggestions outline an initial program and illus- trate one way in which a measurement effort might begin. The agency assigned responsibility for the measurement of scientific and

OCR for page 40
70 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION mathematical literacy should be given responsibility for developing the details of methodology. An initial program would be devoted to: . providing benchmark data for the country as a whole, using largely available material, and ~ research to develop, validate, and field-test instruments to better measure people's understanding of the nature of science and to obtain reliable estimates of their problem-solving skills. The projected interviews and administration of exercises proba- bly would require personal visits to households by the interviewers, although some screening of households and some data collection might be done by telephone. The assessment might begin by provid- ing benchmark data for all adults by gender and by broad age group and for parents and guardians of school-age children. This data base would later be expanded to provide measures for subgroups of the population, for example, by educational attainment and by race and ethnicity. Although the program would be targeted to adults 17 years of age and older, it should be expanded to include children in elementary and secondary schools as in-schooT testing programs begin to in- clude the measurement of scientific and mathematical literacy. The objective would be to provide links between school and household measures, as well as to provide a household-based unit of analysis for adults and children. If the assessment is to serve as a reliable base for policy, it will need to be based on a probability sample for which estimates of sta- tistical reliability can be provided. The goal should be a high rate of cooperation in the survey by individuals selected in the sample. Completion rates of from 85 to 90 percent are a reasonable expec- tation. The sampling could be based on a multistage approach. At the first stage, a sample of perhaps 100 areas would be selected. These areas would be counties or school districts and, if spread pro- portionately across the country, would be distributed across about 40 states. The sample could, however, be designed to include all states. Within each area, a sample of no fewer than 50 adults would be drawn from randomly selected city blocks or corresponding small areas outside cities, with art least 5 households sampled per block and 1 adult interviewed per household. Households would be sam- pled for this purpose according to their number of adults in order to give each adult tested approximately the same weight. With an 80 percent cooperation rate, this plan would yield interviews/tests with

OCR for page 40
INDICATORS OF LEARNING IN SCIENCE AND MATNEMA TICS 71 no fewer than 4,000 adults. That number would provide an adequate data base for analysis. To monitor changes in the population, the basic survey should be repeated at four-year intervals. During intervening years, effort could be concentrated on developing and testing improved assessment methodology. Assessing Grasp of Grand Conceptual Schemes As with school students, it is important to find out to what extent the adult population is familiar with key scientific concepts and understands their applications. While such high-order knowI- edge may seem at first to resist assessment, it can be probed with the following kind of exchange, probably best administered in an - Interview: . Listen to (or look at) this list of ideas: plate tectonics, evo- lution, gravitation, the periodic table. Is there one of them that you would be willing to talk about a bit more? Response (for example): plate tectonics. ~ I'd like you to take a few minutes to think about plate tec- tonics. Please think about these two questions and answer them in whichever order you prefer. How would you briefly describe what plate tectonics are to someone who didn't know about them? What examples can you give me of things or events that plate tectonics cause or are involved in? Would you like to talk about another of these ideas? Several aspects of this sample exchange are important. First, it is in the free-response format, which is needed to probe the active knowledge of the respondent and to permit flexibility in answering. Second, it evokes both a definition and specific applications of the selected ~grand conceptual scheme.n Since part of the power of these schemes is their ability to unify phenomena, being able to define the terms without appreciating any of the applications is to lose much of their force. Third, by including some example that virtually all adults have encountered, a minimum level of literacy can be assessed. Finally, because the questions are open-ended and recursive, they permit assessment of both breadth and depth. Although it may be difficult to do so, it would be important to establish to what extent people's responses are based on knowledge gained in school and to what extent they draw on knowledge gained fi:om subsequent reading, television programs, museum visits, and so on, even given

OCR for page 40
72 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION that school ought to teach one to continue to learn beyond one's formal education. Recommendation Key Indicator: The committee recommends that, starting in 1989, the scientific and mathematical literacy of a ran- dom sample of aclults (including 17-year-olds) be assessed. The assessment should tap the several dimensions of literacy discussed in Chapter 2 and should be carried out every four years. To make the desired types of assessment possible, effort should be devoted over the next two years to developing interim assessment tools that use some free-response and some problem-solving compo- nents; these assessment tools should be used until more innovative assessment techniques, described in this chapter, are available. The data collected should be aggregated to provide measures of depth and breadth of knowledge and understanding. They should also be aggregated by age, gender, race, ethnicity, socioeconomic status, and geographic region so as to establish to what extent there are sys- tematic inequities in the distribution of scientific and mathematical literacy.