Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 40
4
Indicators of Learning in
Science and Mathematics
This chapter first appraises currently available multiple-choice
tests of student achievement in order to judge their suitability as
indicators of the quality of education in science and mathematics.
This appraisal includes a discussion of various uses of these tests,
a review of criticisms of current testing methods, and suggestions
on some desirable features that should be retained. New methods
of assessment are then described that would provide both quantita-
tive and qualitative information about how students perform tasks
requiring higher-order skills. The chapter continues with our recom-
mendations regarding uses of current indicators of student learning
and work needed to develop improved tests. Implications of these
recommendations for state education agencies are presented, and we
conclude with a discussion of possible approaches to assessing aspects
of scientific literacy of the U.S. population.
AN APPRAISAL OF CURRENT TESTS OF
STUDENT ACHIEVEMENT
The most direct indicators of the quality of science and mathe-
matics education are the scores based on tests that measure what stu-
dents have learned. Currently available indicators of student learning
are typically obtained from standardized achievement tests made up
of multiple-choice items. Before one accepts information based on
40
OCR for page 41
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 41
such tests, it is necessary to make an appraisal of their suitability as
indicators of the quality of education.
Purposes of Testing
In practice, tests are used for a wide variety of purposes. Some
involve the evaluation of individual students for grading, student
counseling, placement, promotion, awards, scholarships, and so on-
important for educational purposes but not always well suited to the
development of indicators. The use of tests most closely identified
with assessing the condition of science and mathematics education is
for the evaluation of learning achieved by populations of students; a
related purpose that is of interest to the committee is the use of tests
in improving the quality of instruction.
Evaluation of Student Learning Measures of the outcomes of
education for students are critical indicators in any educational mon-
itoring system. Hence, the testing purpose of primary concern to the
committee is evaluation of student learning, particularly at national,
state, and regional levels. Indicators of learning that are satisfactory
for this purpose would also be useful to school districts or individual
schools as a means of monitoring change in levels of accomplishment
over time.
A related use of tests is to provide criterion measures to vali-
date less direct indicators of the quality of education, for example,
teaching effectiveness or the quality of the curriculum. Tests are of-
ten used for this purpose, but such use is appropriate only when the
tests being employed assess important dimensions of student learning
in a satisfactory manner.
Improving Instruction One reason for monitoring the condition
of mathematics and science education is to be able to improve instruc-
tion. Several applications of tests can help do so: tests can contribute
to raising the standards of schools as to the skills and competencies
to be taught and acceptable levels of performance. They can provide
diagnostic information that would enable teachers to understand the
reasons for failures and provide appropriate remedial treatment. Di-
agnostic information would be useful in school assessment at local,
district, state, or even national levels as well: better understand-
ing of why students develop erroneous problem-solving algorithms or
OCR for page 42
42 INDICATORS OF SCIENCE AND M`4TNEMA TICS EDUCATION
fad! to modify childhood misconceptions of physical principles would
make possible actions at higher administrative and supervisory levels
aimed at improving instruction. Tests can also be used as dependent
variables in experimental studies involving educational treatments or
methodologies developed to improve instruction.
Test questions also can be used for teaching as practice ex-
ercises with feedback. Much practice is necessary to acquire the
complex skills required for development of the automatic processing
and pattern-perception skills that are essential for the performance of
more advanced problem-solving tasks. Such exercises might also pro-
vide, as a by-product, information that would be useful for large-scale
assessment of student learning. Still another instructional applica-
tion is to improve the articulation of instruction at various transition
points, for example, between elementary, middle, and high school or
between introductory and advanced college courses. Tests can de-
termine whether students actually possess the basic knowledge and
skills necessary for successfully dealing with the more advanced con-
cepts and procedures taught at the next educational level. Although
such instructional uses of tests are not directly related to their use as
indicators, they are as important and provide equally valid reasons
for developing better tests.
Criticisms of Current Testing
In the early years of this century, the assessment of student
achievement was generally based on teachers' judgments, which were
in turn based on teacher-made tests, homework, and impressions of
classroom performance. But after the demonstration of the efficiency
of objective tests by the use of the Army Alpha tests in World War I,
a revolution in testing methods began. The invention of the multiple-
choice test item and the development of fast and efficient test-scoring
machines (Lin~quist, 1954) made possible the mass testing of stu-
dents on a very large scale. Testing agencies and test publishers
hastened to develop multiple-choice tests, teachers were trained to
write multiple-choice items, and many colleges set up testing bureaus
to assist the faculty in preparing and scoring multiple-choice exami-
nations. Except for the teacher-made tests that many teachers still
rely on for grading students, multiple-choice tests have driven out
virtually all other types of examinations because of their objectivity,
speed, and economy (N. Frederiksen, 1984a).
OCR for page 43
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 43
From the standpoint of assessing the quality of education in
science and mathematics, it is important to know to what extent
information based on tests in current use provides a sound basis
for judgment. Standardized multiple-choice achievement tests have
been widely criticized not only by educators but also by students,
parents, and the media. Some of the criticisms most relevant to the
development of indicators of science and mathematics education are
discussed below.
Multiple- Choice Tests Penalize Creative Thinking This is a well-
taken criticism, since most multiple-choice items do not provide much
opportunity to generate new ideas. Students responding to a typical
multiple-choice item begin by reading the stem (the expository or
question part of the item); then they read the first option and make
a narrow directed search of their memory store to find a match to the
option. If they find information that clearly matches the option, they
may mark it and go to the next item. If not, they read the next option
and again seek a match and mark it or consider the next option,
and so on until they either choose and mark an option that matches
information stored in memory or skip the item. Such a process would
appear to require little creative thinking. Of course, some multiple-
choice items require more complex processing of information, but a
large majority of the items in a typical achievement test measure
factual knowledge.
In spite of the controversy, there has been little research on the
mental processes involved in taking a multiple-choice test. Several in-
vestigators, however, have compared multiple-choice tests with their
free-response counterparts, which were constructed] by substituting
an answer space for the multiple-choice options for each item (Ver-
non, 1962; Traub and Fisher, 1977; Ward, 1982; Webb et al., 1986~.
As judged by correlations and other statistical analyses, the format of
the test was found to make little difference. With a few minor excep-
tions, for tests that were originally constructed with multiple-choice
questions, both formats appeared to measure the same ability.
However, use of the multiple-choice format may tend to exclude
the writing of items that require more complex thinking processes.
If so, different results might be found if one began with free-response
problems intended to elicit productive (rather than reproductive)
thinking and converted them to the multiple-choice format. Such
a comparison was carried out using a test that required students
to formulate hypotheses that might account for the findings of an
OCR for page 44
44 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION
experiment (N. Frederiksen and Ward, 1978~. Indeed, quite different
results were obtained than in the conversion from multiple-choice
to free-response formats. The correlations between the two formats
were generally low, and the pattern of relationships to various cogni-
tive abilities was different. The two formats were similar with regard
to their relationships to verbal ability and reasoning, but only for
the free-response version were there substantial relations to a fac-
tor called ideational fluency, which represents the skills involved in
making broad searches of the memory store in order to retrieve in-
formation relevant to a situation (Ward et al., 1980~. In at least one
instance, converting a test intended to measure productive think-
ing to multiple-choice format eliminated the need to broadly search
the memory store for ideas that might be relevant, evidence that
the multiple-choice format is not conducive to measuring productive
thinking.
Mulliple-Choice Tests Are Not Representative of Real-L,ife Prob-
[em Situations There are at least two aspects of representativeness.
One has to do with the frequency with which real-life problems oc-
cur in multiple-choice form. Occasionally people encounter problems
with a lirn~ted number of clearly defined options, such as deciding
whether to go left, right, or straight ahead at an intersection, or
whether to take the morning or the afternoon flight to Miami. But
more often there are many options, and one does not know what they
are and must think of them for oneself. Multiple-choice options are
almost universal in educational testing but rare in real life.
The other aspect of representativeness has to do with the extent
to which the problems posed by test items are similar to problems
that occur in real life. Problems encountered in real life generally
involve situations that are far more complex and richer in detail than
are provided by the stem of a multiple-choice item. Furthermore,
there seems to be a tendency for testers to use stereotyped sets of
test problems in both science and mathematics, problems that, for
example, involve weights on inclined planes, pulleys, boats going
with or against the current, and the number of pencils Jane has.
Generalization of learning would be facilitated by schoolroom expe-
riences that resemble problems in the world outside the classroom
with respect to the variety and complexity of problem situations and
settings. Use of test problems that simulate such situations would
encourage such instruction (Norman et al., 1985~.
OCR for page 45
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 45
Mulliple-Choice Tests Are Undesirably Coachable Any test is
coachable in some sense and to some degree. Some kinds of coaching
involve training that has nothing to do with the subject matter of the
test, such as teaching students that the longest multiple-choice option
is most likely to be correct and to avoid highly technical options;
in such cases coaching may improve test scores somewhat without
improving the ability presumably measured by the test. Another kind
of coaching attempts to improve the ability measured by the test; a
review of fractions and percentages, for example, might improve both
test scores and the student's underlying competence in arithmetic.
Test makers should attempt to construct tests that are coachable
only in the sense that coaching and teaching are indistinguishable.
Tests that are coachable in the undesirable sense not only result in
wasted time; they also tend to falsify the data.
It is difficult to estimate the size of gains that are attributable to
coaching (Messick, 1980~. Most coaching is probably done by teach-
ers in school settings and generally consists of attempts to teach the
kinds of knowledge and skills that are measured by the tests. Coach-
ing schools are more likely to attempt to teach test-taking skills,
with less attention to the content of the test; fantastic gains have
been claimed for such coaching (Owen, 1985), but without much
evidence. The studies of coaching for the Scholastic Aptitude Test
(SAT) and similar tests that were reviewed by Messick show modest
gains on the average less than 10 points on the SAT-verbal and
about 15 points on the SAT-mathematics test, on a scale of 200 to
800. The gains are difficult to interpret, however, because of vari-
ations in methods of assigning students to the coached and control
groups (often the coached students are volunteers), the methods,
length, and content of coaching, and methods of analyzing the data.
Thus, it is usually difficult to judge whether gains are attributable
to (a) differences in ability or motivation, (b) the nature and length
of the coaching, or (c) the methods and variables used in attempt-
ing to control statistically for differences between the coached and
control groups. Messick suggests that the smaller effects seem to be
associated with short-term cramming and drill and the larger effects
with longer-term programs involving skill development especially
in mathematics, for which there is likely to have been greater vari-
ability with regard to opportunities or motivation for students to
learn.
Such results suggest that coaching is not likely to produce major
distortions in the distributions of scores obtained from current tests.
OCR for page 46
46 INDICATORS OF SCIENCE AND MATHI£MA TICS EDUCATION
However, even small average gains could lead to mistaken conclusions
when test scores are used to monitor change in student achievement.
Mulliple-Choice Tests Exert Undesirable Inflfuence on the CUT_
riculum There are many reasons to believe that the nature of the
tests used influences what teachers teach and what students learn.
Students want to get respectable grades, or at least pass the course,
and teachers believe that they may be evaluated on the basis of
their students' test scores. Tests that fail to match the intended
curriculum may therefore have undesirable effects on learning.
Testing had relatively little impact on instruction in the 1950s
and early 1960s, but the situation began to change in 1965 when
the Elementary and Secondary Education Act (ESEA) was passed.
The act required that certain teaching programs funded by ESEA
be evaluated, and future funding of programs often depended on the
outcomes of the evaluations (Popham, 1983~. Pressure to improve
test performance increased during the 1970s, when test data showed
that attainment of knowledge and skills was declining (Womer, 1981),
and the National Assessment of Educational Progress (1982) reported
decrements in performance. Still more pressure to "teach for the
tests resulted from the decision of a federal judge in 1979 that
Florida's use of a competency test to satisfy graduation requirements
was unconstitutional unless preparation for the test was provided.
Educators representing a majority of the school districts identi-
fied by the National Science Teachers Association as exemplary in the
teaching of Key science (Penick, 1983) have expressed concern at the
mismatch between currently available standardized tests and their
curricula. These districts are teaching inquiry-based, hands-on sci-
ence, which both the scientific and educational communities strongly
support, but the skills acquired by their students are not measured
by the tests. At a conference on elementary science education held
by the National Science Resources Center at the National Academy
of Sciences in 1986, participants representing school districts with
innovative programs expressed concern "that standardized achieve-
ment tests do not do a good job of assessing what students learn in
elementary school science. There is a need to develop improved tests
and alternative evaluation techniques to assess student progress in
science, with more emphasis on the development of process skills and
attitudes" (National Science Resources Center, 1986:3~. As more
OCR for page 47
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 47
school districts are striving to introduce more effective science pro-
grams in grades 1-6, the issue of correspondence between tests and
curricular goals becomes particularly critical at this level.
Bloom (1984) wrote that "teacher-made tests (and standardized
tests) are largely tests of remembered information.... It is estimated
that over 90 percent of test questions the U.S. public school students
are now expected to answer deal with little more than information.
Our instructional material, our classroom teaching methods, and our
testing methods rarely rise above the lowest category of the [Bloom]
taxonomy-knowledge" (p. 13~. Resnick and Resnick (1985:15), in
commenting on state testing programs, stated
It is appropriate . . . to think of minimum competency programs as an
effort to educationally enfranchise the least able segment of the school ~
population.... However, by focusing only on minimal performance,
the competency testing movement has severely limited its potential for
upgrading education standards. Only recently have some states begun
to include higher level skills in their competency testing programs.
It would be difficult to stress too much the importance of this move
beyond the minimum . . . for there is evidence that examinations
focused solely on low level competencies restrict the range of what
teachers attend to in instruction and thus lower the standard of
education for all but the weakest students.
An examination of the results of state testing programs in mathe-
matics provides further documentation: children score well on items
dealing with computation but less well on items dealing with con-
cepts and problem solving, because the learning of these higher-order
skills is not stressed in classroom instruction (Suydam, 1984~.
The National Assessment of Educational Progress (NAEP) re-
port (1982) previously referred to showed similar results. Perfor-
mance by comparable populations of students on test items measur-
ing basic skills did not decline compared with earlier assessments, but
there was a decrease on items reflecting more complex cognitive skills.
In mathematics, about 90 percent of the 17-year-olds could handle
simple addition and subtraction, but performance levels on problems
requiring understanding of mathematical principles dropped during
the preceding decade from 62 to 58 percent. In science, performance
declined for both kinds of items, the decrease being twice as large for
items requiring more advanced skills.
It seems a reasonable conjecture that the mandated use of
minimum-competency tests and concurrent emphasis on basic skills
was at least in part responsible for these declines. It is possible,
however, to use the influence of tests on what is taught to improve
OCR for page 48
48 INDICATORS OF SCIENCE AND MATNEMA TICS EDUCATION
learning by constructing tests that require the more advanced skills.
Such tests would thus provide incentives for improving the quality of
education in science and mathematics (N. Frederiksen, 1984a).
In Chapter 7, the committee recommends that basic curriculum
frameworks be developed for nationwide use, frameworks that repre-
sent the best opinions of working scientists and mathematicians, as
well as educators, as to what should be taught and tested a core of
essential factual knowledge and the algorithmic and procedural skills
and higher-order competencies for doing real science and mathemat-
ics. Tests that match such frameworks would influence teaching and
learning in desirable directions.
.
Multiple-Choice Tests Are Not Based on Theory This criticism
is not one that is frequently voiced by critics, but it deserves mention.
In one sense, multiple-choice testing is indeed based on a theory,
namely, a very extensive theory of the mathematical and statistical
considerations having to do with test reliability, validity, error of
measurement, methods of item analysis, item parameters, equating
of tests, latent trait models, and so on (e.g., Gulliksen, 1950; Rasch,
1960; Lord and Novick, 1968; Lord, 1980~. This test theory is largely
based on the assumption that items are scored objectively as either
right or wrong, and the test score is the number right. Item-response
theory, a relatively new and very influential part of test theory,
assumes a multiple-choice format by taking account of guessing.
This body of work has been extremely useful and important in the
development of assessment methods. But none of this test theory is
concerned with the content of the test items.
Another kind of theory, one that grows out of work in cognitive
psychology and artificial intelligence, does provide a potentially use-
fuT basis for the development of tests based both on content and the
cognitive processes that are involved in doing science and mathemat-
ics. Some of the implications of this work are described later in this
chapter.
Science Content in Multiple-Choice Achievement Tests is Ques-
tionable In order to obtain information on the quality of the science
content in currently used achievement tests, the committee asked 12
scientists and science teachers from several science fields to evaluate
the items from 9 commonly used multiple-choice achievement tests.
(Two individuals did not review the items but wrote general com
OCR for page 49
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 49
meets.) This attempt to evaluate tests is described in more detail in
Appendix B. Since differences in average ratings between the tests
were relatively small compared with the variability between the re-
viewers, no quantitative conclusions concerning their relative merits
could be justified from their evaluations. There was agreement, how-
ever, that the tests were poor at probing higher-order skills and that
they contained a significant (5 to 10 percent) number of flawed items.
The remaining items were judged to be quite variable in their qual-
ity, such that it was not obvious that a positive change in test score
would in fact mirror improvement in the quality of student learning.
The committee's experience with this experiment in assessing sci-
ence tests reinforces concern about the quality of the subject-matter
content of some of the tests in common use, even while it empha-
sizes some of the difficulties in obtaining reliable evidence on this
important question.
Some Virtues of the Current Bestirs System
Despite the criticisms that have been leveled by the committee
and others at the current system of educational testing, it has a
number of virtues that should be acknowledged. First, the multiple-
choice format for testing makes possible the economical measurement
of factual knowledge. This format allows the rapid and reliable
scoring of tests at a relatively low cost. Therefore, it seems sensible
to retain the conventional test format for doing what it does best-
measuring factual knowledge and the ability to use the simpler kinds
of procedural knowledge, such as the algorithms used in arithmetic
computations (to the extent that they continue to be taught).
Two other useful developments in current testing systems are
matrix sampling and the application of statistical methods to make
possible test comparisons over time. Neither of these is limited to
tests in the multiple-choice format. The use of matrix sampling al-
lows one to obtain information about large populations of students
without concomitant increases in cost and testing burden. Matrix
sampling is analogous to the methods used in public-opinion polling,
in that it requires drawing random or representative samples of sum
jects. But in addition to drawing random samples of subjects, matrix
sampling also involves independently drawing random samples of test
items (Wilks, 1962; Lord and Novick, 1968~; thus random subsamples
of students are given different subsamples of items. An adaptation of
OCR for page 50
50 INDICATORS OF SCIENCE AND MAI'HElMA TICS EDUCATION
the item-sampling procedure used by NAEP involves what is called
a balanced incomplete block design (Messick et al., 1983~. This pro-
cedure makes possible the calculation of close approximations to the
means, standard deviations, intercorrelations of tests and test items,
and so on, that would be obtained if the entire school population had
been tested. This is an important feature of the methods currently
employed by NAEP. When tests are created that are more costly to
administer and score than conventional multiple-choice tests, the use
of matrix sampling will be critical for keeping costs within bounds.
Another virtue to note is that current testing methodology makes
possible comparisons over time. The collection of data on learning in-
dicators is of limited value unless the measurement can be repeated,
since the purpose of school evaluation is to detect change-to see if
student performance is improving. Given that test-score scales are
arbitrary, measures taken on a single occasion may be of limited
value. The only way in which such measures would be interpretable
would be for the scores to have intrinsic meaning apart from com-
parative interpretations.
School evaluation is concerned not only with measuring change in
the same individuals over a period of time but also with comparing
the performance of successive groups of students at a particular
stage of instruction, such as the end of the eighth grade. The latter
kind of comparison is of particular interest at state and national
levels. Unfortunately, it poses a difficult problem of interpretation
because of possible changes in the composition of the groups that
have nothing to do with instruction. And there are many other
problems of interpretation due to the use of fallible instruments,
the possibility (if not likelihood) that a given test does not measure
the same abilities before and after a period of training, the lack of
random assignment of students, the lack of equal units on a score
scale, the unreliability of difference scores, and so on. (see Harris,
1963~. But statistical test theory has provided workable answers to
many of these problems, for example, in the development of methods
of equating scores on different versions of a test (Angoff, 1984~. The
development of item-response theory (I,ord, 1980) provides workable
solutions to other problems. The extensive test theory that has been
developed should be retained, but it needs to be adapted as necessary
for use with new testing procedures.
OCR for page 62
62 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
could be used not only for assessment but also for practice and
to provide information for remediation, and assessments based on
exercises designed to probe higher-order learning should raise educa-
tional standards by providing models of performance to be emulated
by both students and teachers.
An organization is needed to encourage, conduct, and coordinate
the development of the needed assessment materials. The develop
ment of new assessment materials is costly, in both money and intel-
lectual resources; needless duplication of effort must be avoided. This
implies that the areas most in need of research and development of
assessment techniques must be defined, newly developed instruments
must be evaluated for their quality, and facilities for the distribution
of materials to schools and teachers must be created.
The problem of test validation is particularly important for any
new generation of tests that may be developed to assess proficiency
in science and mathematics. The approach that has typically been
used for test validation finding a variable that may be thought
of as a criterion and computing a correlation will probably not
be feasible, since no reasonable criterion is likely to exist. Clearly,
another method is needed.
The most reasonable method for validating the kinds of tests that
have been proposed is construct validation (Cronbach and MeehI,
19553. Messick (1975) defines construct validity as "the process of
marshalling evidence in the form of theoretically relevant empirical
relations to support the inference that [a test score] has a particular
meaning" (p. 955~. The implication is that a theory about the
nature of the performance in question is necessary, and validation of
a test involves a scientific investigation to see if the procedures and
cognitive processes displayecl in taking the test are consistent with
the theory.
A study of construct validity of free-response tests intended to
measure skill in problem solving may be used as an illustration (N.
Frederiksen, 1986~. One test consisted of diagnostic problems that
simulated a meeting of a doctor and a new patient, and the other
test involved nonmedical problems, such as why there are relatively
fewer old people in Iron City than in the rest of the state. Both
tests used a format that required examinees to go through several
cycles of writing hypotheses, asking for information to test their
hypotheses, and revising the list of hypotheses until they arrived at
a solution. The subjects were fourth-year medical students. The
theory about cognitive processes assumed that such verbal skills as
OCR for page 63
INDICATORS OF LEARNING IN SCIENCE AND ~TH~TIC5 63
reading, various reasoning abilities, science knowledge, and cognitive
flexibility (ability to give up unpromising leads) would all be involved
for both kinds of problems. In addition, skill in retrieving relevant
information from long-term memory would be important. In the
case of the medical problems, medical knowledge would of course
also be necessary. The salient findings from a correlational analysis
of the data showed that, as expected, medical knowledge was clearly
the most important resource in solving the medical problems, and
of course it was of little or no help in dealing with the nonmedical
problems. For nonmedical problems, ideational fluency, or skill in
retrieving relevant information from memory, was by far the best
predictor of performance, but it was of little or no value in solving
medical problems. Thus the information-processing theory had to be
revised.
Embretson (1983) reports a more elaborate study involving
latent-trait modeling for the identification of the theoretical mecha-
nisms that underlie performance on a task and exploring the network
of relationships of test scores to relevant variables. Experimental
methods for testing a theory about test performance are also feasible
and probably are preferable to correlational methods.
Summary
Currently available multiple-choice tests are adequate primar-
ily for assessing student learning of the declarative knowledge of a
subject. They are not adequate for assessing conceptual knowledge,
most process skills, and the higher-order thinking that scientists,
mathematicians, and educators consider most important. Since cur-
rent efforts to improve curricula are beginning to concentrate on
these skills, new tests and other assessment devices are needed to
serve as national indicators of student learning in mathematics and
science. The tests should include exercises that employ free-response
techniques not only pencil-and-paper problems but also hands-on
science experiments and computer simulations. Tests for measur-
ing the component skills involved in reasoning and problem solving
should also be developed. The improvements in testing can be made
feasible, despite higher costs, by the use of computer-based tech-
niques, by matrix-sampling methods, and by the use in instruction
of exercises developed for the tests.
Currently the area of greatest curricular change is in elementary
school, grades K-5. A number of school systems are attempting
OCR for page 64
64 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
to implement inquiry-based, hands-on instructional programs in sci-
ence. These programs are considered exemplary by both scientists
and science teachers, and they urgently need the support of as-
sessment instruments that match the new emphasis on teaching for
understanding and for more complex thinking skills. Prototypes of
free-response techniques exist that could be adapted for use at the
K-5 level in the near future.
Recommer`dations
Indicators of student learning at the national, state, and local
levels should be based on scores derived from assessment methods
that are consonant with a curriculum that includes all major cur-
ricular objectives, including the learning of factual and conceptual
knowledge, process skills, and higher-order thinking in specific con-
tent areas. Such tests should exhibit a range of testing methodology,
including use of free-response formats.
Research and Development: To provide the requisite
tests for use as indicators of student learning, the committee
recommends that a greatly accelerated program of research
and development be undertaken aimed at the construction
of free-response materials and techniques that measure skills
not measured with multiple-choice tests. The committee
urges that the development of science tests at the K-5 level
receive immediate attention.
Techniques to be developed include problem-solving tasks, as
exemplified by the College Board Advanced Placement Tests; pencil-
and-paper tests of hypothesis formulation, experimental design, and
other tasks requiring productive-thinking skills, as exemplified by
questions in the British Assessment of Performance Unit Series;
hands-on experimental exercises, as exemplified by some test materi-
als administered by the National Assessment of Educational Progress
(NAEP) and the International Association for the Evaluation of Edu-
cational Achievement (IEA), and simulations of scientific phenomena
with classroom microcomputers that give students opportunities for
experimental manipulations and prediction of results.
The creation of new science tests for grades K-5 should be done
by teams that include personnel from the school districts that have
OCR for page 65
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 65
been developing hands-on curricula to ensure that the new tests
match the objectives of this type of instruction. In addition to
providing valid national indicators of learning in areas of great im-
portance, such new assessment materials for science in grades K-5
will provide models of tests that state and local school officials may
want to adopt and use.
Key Indicator: The committee recommends that assess-
ment of student learning using the best available tests and-
testing methods continue to be pursued in order to provide
periodic indicators of the quality of science and mathematics
education.
Tests should be given to students in upper-elementary, middle,
and senior high school (for example, in grades 4, 8, and 12~. Because
of the rapid changes taking place in science instruction in grades K-5,
assessment at this level should be carried out every two years, using
exercises developed according to the preceding recommendation. For
higher levels, a four-year cycle is appropriate. The tests should be
given to a national sample, using matrix-sampling techniques. Test
scores should be available for each test item or exercise and should
be reported over time and by student subgroups (e.g., gender, race,
ethnicity, rural/inner city/suburban community). As in previous
assessments, results should also be reported by geographic region;
efforts now under way may make possible state-by-state comparisons
in the future. Similar procedures are appropriate for indicators of
state or district assessments of student learning.
Research and Development: The committee recommends
that a research and development center be established to
provide for the efficient production, evaluation, and distribu-
tion of assessment materials for use as indicators of student
learning at district, state, and national levels and for use by
teachers in instruction.
The center should function as a centralized resource and clear-
inghouse that would make it possible for school people to survey the
available assessment materials and obtain those desired. It might be
called the National Science and Mathematics Assessment Resource
OCR for page 66
66 INDICATORS OF SCONCE AND ~TH~TICS EDUCATION
Center. It should be tied closely to efforts to improve the curriculum
and be an active partner in the total system of educational reform.
The committee suggests that as a beginning a group of experts be
convened to prepare a plan for the creation of the proposed center,
including its management and operation, and that the plan serve as
the basis for the founding of the center by a suitable educational es-
tablishment or a consortium of universities and educational research
organizations.
IMPLICATIONS FOR STATE EDUCATION AGENCIES
The assessment of what students have learned and their ability
to apply that knowledge is a major task of accountability for state ed-
ucation agencies. Such assessments can function to assure the public
and their elected representatives that both human and material re-
sources are available and meet certain standards, that the resources
are appropriately distributed to schools, and that the effects of all the
human and monetary investments are reflected in student learning.
Using that basic premise, the state has a vital stake in valid yet
feasible ways to evaluate what students know about mathematics
and science. The state's role of leadership in assessment is quite
important, and the committee is concerned that the complexities of
assessing student learning be clearly understood and then attacked.
If the state language-arts assessment is merely a multiple-choice
grammar test, a direct message (intended or not) is sent to every
teacher that the writing process itself is not important. Similarly, in
the committee's view, if a state or school science assessment consists
solely of a multiple-choice test, then clearly the measurement is
equally limited.
Representatives of state and local systems told the committee
that the recommended assessment resource center, if it were to be
implemented, would fill a major gap for schools, states, and the As-
sessment Center Project of the Council of Chief State School Officers
(see Appendix D). The assessment approaches based on hands-on
investigation and computer simulation that would evolve from the
proposed resource center could serve two functions for states and
local communities. On a sample basis, the results of assessments us-
ing such new techniques would themselves be an important indicator
at the state and national levels of student learning, and simultane-
ously such an assessment approach would provide a mode] that the
committee believes to be important. While states may be able to
OCR for page 67
INDICATORS OF LEARNING IN SCIENCE AND MATHEMATICS 67
contribute to the assessment resource center, probably only a non-
governmental institution could muster sufficient resources to develop
and evaluate the new approaches, as well as to create imaginative
ways to improve traditional multiple-choice testing of factual knowI-
edge and simpler kinds of procedural knowledge. The curriculum
frameworks discussed in Chapter 7 should guide the development at
the proposed resource center of outcome measures, including mea-
sures not only of factual and conceptual knowledge, but also of the
information-processing skills that are necessary for acquiring profi-
ciency in science and mathematics.
ASSESSING ADULT SCIENTIFIC AND
MATHEMATICAL LITERACY
There are several reasons why assessment of student learning
should be extended to assess trends in the science and mathematics
literacy of the entire population. First, one of the reasons to care
about the quality of mathematics and science instruction in school is
that it will influence mathematics and science literacy throughout the
population; trends in the mathematics and science literacy of adults
will in time provide information about the long-term consequences
of attempts to improve the science and mathematics education pro-
vided in the nation's schools. The issue of adult literacy may raise
important questions as to whether schools should emphasize imme-
diate knowledge retention or learning that is likely to be retained in
adulthood. Second, children's interest in mathematics and science is
influenced by the extent to which the adults in their lives know about
and show an interest in these subjects. Consequently, changes in the
science and mathematics literacy of adults may foster changes in the
skills and attitudes about science and mathematics that students
bring to school. Third, and most important, the science and mathe-
matics literacy of aclults is a major goal of science and mathematics
education.
Results of previous efforts to assess scientific literacy in the
United States have not been reassuring. For example, Miller (1986)
reports on surveys of U.S. adults conducted in 1979 and 1985 that
included questions on the meaning of scientific study, cognitive sci-
ence knowledge, and attitudes on organized science. On the basis of
the survey responses, he classified 7 percent of the public to be sci-
entifically literate ~ 1979 and 5 percent in 1985. Young adults (ages
17-34) did slightly better (11 and 7 percent, respectively); also, the
OCR for page 68
68
INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
percentage increased with increasing education. However, within the
population who were high school graduates but who had not gone on
to college, only 2 percent in 1979 and 3 percent in 1985 were deemed
to be scientifically literate. Such results increase the need for future
study of the population's scientific literacy and the long-term effects
of science education.
Desired Attributes of Indicators
Any plan to generate indicators of scientific and mathematical
literacy should try to estimate the degree to which a population
possesses the kind of knowledge and intellectual skills outlined in
Chapter 2. Assessment plans should be based on the following con-
siderations:
~ A single measure will not do, because science and math-
ematics literacy involves multiple dimensions of a complex set of
characteristics. The indicators to be used should be matched to the
models of literacy discussed in Chapter 2.
~ The indicators should recognize that there is no single, abso-
Jute level of literacy and that various levels of attainment in different
components of a community or population group are likely.
. Any measures used to generate indicators should be supple-
mented by research to validate what is actually being measured.
. Indicators may be expressed in terms of descriptive patterns
of problem solving and other nonnumerical ways.
At this stage, there is no particular reason to favor one method
of data collection over another. Therefore, several techniques, such
as conducting surveys (see, e.g., Miller, 1983), interviews, and case
studies, should be considered in deciding what information to collect
in order to develop indicators. As with students, traditional methods
may work reasonably well to assess knowledge, but indicators should
also probe the population's understanding of the nature of science
and its role in society. It is particularly important and difficult to ob-
tain reliable estimates of problem-solving skills. In the committee's
view, their assessment must go beyond individual pencil-and-paper
tests and should include observation and analysis of individual and
group responses to carefully selected phenomena involving real oh
jects and filmed sequences of events.
In some sense, the need for assessment of adult literacy is not
as urgent as the need for assessing students; after all, fewer policy
decisions will or can be driven by such assessments. Therefore, the
OCR for page 69
INDICATORS OF LEARNING IN SCIENCE AND MATHI~MA TICS 69
next two or three years can be devoted to the interim development
of pencil-and-paper tests and tests involving real objects. These wiB
provide a measure of adult literacy that can be correlated both to
existing tests of learning (say, of 17-year-olds) and to the assessment
techniques that the committee has proposed for in-school learning.
Since an important aspect of science and mathematics literacy is con-
tinuing sel£education, some of the assessment techniques suggested
in the preceding sections may also be appropriate for adult literacy.
Target Populations for Assessment
The committee considers education policy makers for elementary
and secondary schools at state, local, and also national levels to be
prime users of indicators of the quality of science and mathematics
education. This has implications and raises interesting issues for the
design of a set of indicators to assess the scientific and mathematical
literacy of adults. One issue, for example, is how the out-of-school
population should be stratified in various ways for assessment pur-
poses. One way is to divide it into the following groups:
.
. Parents and guardians of children enrolled in elementary/sec-
ondary schools, public or private; alternatively, those with school-age
children.
. Individuals who work in mathematics- or science-related fields
or use mathematics or science in their work.
Individuals, stratified perhaps by age groups related to other
national surveys, such as the National Assessment of Education
Progress and the longitudinal follow-up surveys of earlier high school
classes sponsored by the Center for Education Statistics (National
Center for Education Statistics, 1981, 1984~.
Considering the first group, if an in-school science assessment
includes a particular student, should the parents or guardians of that
student be included in a science literacy assessment? If so, should
the assessment include both parents, a randomly selected parent, the
mother, the father, or some combination of these?
Data Collection Strategies
The following suggestions outline an initial program and illus-
trate one way in which a measurement effort might begin. The
agency assigned responsibility for the measurement of scientific and
OCR for page 70
70 INDICATORS OF SCIENCE AND AL4THEMA TICS EDUCATION
mathematical literacy should be given responsibility for developing
the details of methodology. An initial program would be devoted to:
.
providing benchmark data for the country as a whole, using
largely available material, and
~ research to develop, validate, and field-test instruments to
better measure people's understanding of the nature of science and
to obtain reliable estimates of their problem-solving skills.
The projected interviews and administration of exercises proba-
bly would require personal visits to households by the interviewers,
although some screening of households and some data collection
might be done by telephone. The assessment might begin by provid-
ing benchmark data for all adults by gender and by broad age group
and for parents and guardians of school-age children. This data base
would later be expanded to provide measures for subgroups of the
population, for example, by educational attainment and by race and
ethnicity.
Although the program would be targeted to adults 17 years of age
and older, it should be expanded to include children in elementary
and secondary schools as in-schooT testing programs begin to in-
clude the measurement of scientific and mathematical literacy. The
objective would be to provide links between school and household
measures, as well as to provide a household-based unit of analysis for
adults and children.
If the assessment is to serve as a reliable base for policy, it will
need to be based on a probability sample for which estimates of sta-
tistical reliability can be provided. The goal should be a high rate
of cooperation in the survey by individuals selected in the sample.
Completion rates of from 85 to 90 percent are a reasonable expec-
tation. The sampling could be based on a multistage approach. At
the first stage, a sample of perhaps 100 areas would be selected.
These areas would be counties or school districts and, if spread pro-
portionately across the country, would be distributed across about
40 states. The sample could, however, be designed to include all
states. Within each area, a sample of no fewer than 50 adults would
be drawn from randomly selected city blocks or corresponding small
areas outside cities, with art least 5 households sampled per block
and 1 adult interviewed per household. Households would be sam-
pled for this purpose according to their number of adults in order to
give each adult tested approximately the same weight. With an 80
percent cooperation rate, this plan would yield interviews/tests with
OCR for page 71
INDICATORS OF LEARNING IN SCIENCE AND MATNEMA TICS 71
no fewer than 4,000 adults. That number would provide an adequate
data base for analysis.
To monitor changes in the population, the basic survey should
be repeated at four-year intervals. During intervening years, effort
could be concentrated on developing and testing improved assessment
methodology.
Assessing Grasp of Grand Conceptual Schemes
As with school students, it is important to find out to what
extent the adult population is familiar with key scientific concepts
and understands their applications. While such high-order knowI-
edge may seem at first to resist assessment, it can be probed with
the following kind of exchange, probably best administered in an
· -
Interview:
.
Listen to (or look at) this list of ideas: plate tectonics, evo-
lution, gravitation, the periodic table. Is there one of them that you
would be willing to talk about a bit more?
Response (for example): plate tectonics.
~ I'd like you to take a few minutes to think about plate tec-
tonics. Please think about these two questions and answer them in
whichever order you prefer. How would you briefly describe what
plate tectonics are to someone who didn't know about them? What
examples can you give me of things or events that plate tectonics
cause or are involved in?
Would you like to talk about another of these ideas?
Several aspects of this sample exchange are important. First, it
is in the free-response format, which is needed to probe the active
knowledge of the respondent and to permit flexibility in answering.
Second, it evokes both a definition and specific applications of the
selected ~grand conceptual scheme.n Since part of the power of these
schemes is their ability to unify phenomena, being able to define the
terms without appreciating any of the applications is to lose much
of their force. Third, by including some example that virtually all
adults have encountered, a minimum level of literacy can be assessed.
Finally, because the questions are open-ended and recursive, they
permit assessment of both breadth and depth. Although it may
be difficult to do so, it would be important to establish to what
extent people's responses are based on knowledge gained in school
and to what extent they draw on knowledge gained fi:om subsequent
reading, television programs, museum visits, and so on, even given
OCR for page 72
72
INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
that school ought to teach one to continue to learn beyond one's
formal education.
Recommendation
Key Indicator: The committee recommends that, starting
in 1989, the scientific and mathematical literacy of a ran-
dom sample of aclults (including 17-year-olds) be assessed.
The assessment should tap the several dimensions of literacy
discussed in Chapter 2 and should be carried out every four
years.
To make the desired types of assessment possible, effort should
be devoted over the next two years to developing interim assessment
tools that use some free-response and some problem-solving compo-
nents; these assessment tools should be used until more innovative
assessment techniques, described in this chapter, are available. The
data collected should be aggregated to provide measures of depth
and breadth of knowledge and understanding. They should also be
aggregated by age, gender, race, ethnicity, socioeconomic status, and
geographic region so as to establish to what extent there are sys-
tematic inequities in the distribution of scientific and mathematical
literacy.
Representative terms from entire chapter:
mathematics education