Click for next page ( 28


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 27
3 What Are Indicators? DEFINING INDICATORS Identifying the domains that need to be monitored is the first step in developing indicators of the quality of science and mathe- matics education. The next step is to define what indicators are arid how they should be distinguished from such other data as simple descriptive statistics or various kinds of qualitative information. In its earlier report (Ralzen and Jones, 1985:27-28), the committee de- fined an indicator as "a measure that conveys a general impression of the state or nature of the structure or system being examined. While it is not necessarily a precise statement, it gives sufficient indication of a condition concerning the system of interest to be of use in formulating policy. For a statistic or measure to be used as an indicator, it must have a reference point so that a judgment can be maple whether the condition being described is getting better or worse (Oakes, 1986~. The notion of judgment has been integral to the development of social indicators, as reflected in an early report by the U.S. Department of Health, Education, and Welfare (1969:971~: [An indicator is a] statistic of direct normative interest which facilitates concise, comprehensive and balanced judgement about the condition of major aspects of society. It is, in all cases, a direct measure of welfare and is subject to the interpretation that if it changes in the frights direction, while other things remain equal, things have gotten better, or people are better oR. 27

OCR for page 27
28 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION The literature on indicators is huge (White, 1983), and so an exhaustive treatment here of distinctions between indicators and other types of information is impractical. But a recurring theme that runs through much of this literature is that indicators usually imply a causal theory or mode! of how some underlying process operates to generate a particular value of the indicator. This distinction is evident in the following definition (Cariey, 1981:67-683: Social indicators, virtually by definition, specify causal linkages or connections between observable aspects of social phenomena, which indicate, and other unobservable aspects or concepts, which are in- dicated. This can only be accomplished by postulating, implicitly or explicitly, some causal model or theory of social behavior which serves to relate formally the variables under consideration. All social indicator research represents, therefore, some social theory or model, however simplistic. Much research to date laying claim to the term "social indicators research consists either of descriptive social statis- tics, which some have argued are not social indicators at all, or of implicit postulations of causal linkages. To be sure, all indicators are in some sense statistics, although the reverse is not so clear. Figures on crime rates are obviously important social indicators, but are the "number of police officers per capita" social indicators as well? Yes and no. They may be indicators of the value a society places on security, they may indicate the presence of an oppressive regime, they may indicate the extent of patronage, and they may also indicate crime rates indirectly. The point is that the theory connecting "number of police officers" to some condition in society is considerably more tenuous and remote than "number of murders" or "number of property thefts" per capita. The same logic applies to changes in an indicator versus changes in a statistic. There is virtually universal agreement on the right direction of a change in crime rates, but the right direction of a change in number of police officers (or any other group for that matter) is open to debate. How should indicators be used in policy formulation? To an- swer this question requires knowledge about the goals of a society as well as a theory about the nexus of causal linkages and processes that combine to produce the indicator. An unfortunate limitation of all indicators is that, while they can inform about the state of their respective domains, they cannot tell how the observed changes have come about. They cannot tell what, precisely, to do about the situation. Once the choice has been made on what social condition to

OCR for page 27
WHAT ARE INDICATORS? 29 assess, indicators are neutral, summary snapshots of that condition. Their implications for policy and action derive not from some inher- ent property they possess, but rather from the theory that the policy maker has about the underlying processes. However, it is possible to increase the utility of indicators to policy makers by ensuring that, to the extent possible, they: . consist of reliable and valid information that is as closely related to an important aspect of the educational system as possible, have reasonably direct policy implications, be small in number, and . be easily understood by a broad audience. In consideration of these criteria, the comrn~ttee has grouped its recommendations on indicators into three categories: (1) key indicators that are or would be feasible given adequate investment in experimentation and development and that should be included in even the most parsimonious monitoring system, (2) supplementary indicators that are presently feasible or might be developed, and (3) research on hypothesized causal links among some important but poorly understood aspects of education in order to create and validate indicators related to these aspects. INTERPRETING INDICATORS Once a value has been established for a given indicator, there are essentially three possible interpretations, all of which involve comparisons of some sort. First, the value of the indicator might be compared with some absolute standard. For example, professional consensus might be used to establish a "minimum knowledge level" of a new K-5 teacher. An indicator of this could be scores on a pencil-and-paper test to measure the amount of knowledge attained by teachers. Interpretation would involve comparison of the teachers' scores with the absolute standard. (It should be noted in passing that absolute or ideal values for most indicators are difficult to establish.) A second interpretation involves comparison of a given indicator value with its value at some prior time. For example, the percentage of high school students who took a physics course in a given year might be compared with the percentage who took a physics course - in some prior year. Third, indicators can be presented as a basis for the compari- son of instructional programs, demographic groups, states, regions,

OCR for page 27
30 INDICATORS OF SCIENCE AND MA THEA TICS EDUCATION countries, and so on. The proper interpretation of such compar- isons is limited because of differences in social, political, economic, cultural, and other characteristics. Nevertheless, when data are dis- aggregated on any basis and presented side by side on a page, the temptation to make evaluative comparisons, whether warranted or not, is overwhelming and nearly universally succumbed to. Problems in interpreting educational indicators fall into three broad categories and are sufficiently pervasive to merit brief mention here, together with suggestions for avoiding or at least minimizing their adverse consequences. The problems are (1) choice of vari- ables, (2) levels of aggregation, and (3) scale. These problems of interpretation have to be faced before data collection can begin. Choice of Variables Even after the key domains to be monitored have been identi- fied for our purpose, student learning, general scientific and mathe- matical literacy, student behavior, teaching quality, curriculum qual- ity, and financial and leadership support the number of possible variables from which to choose in constructing indicators of science and mathematics education remains large; a partial list could well number over 100. According to the committee's formulation, vari- ous teacher and student behaviors and the incentives and constraints that influence them are presumed to be causally related; for example, the quality of the curriculum and the use of it made by the teacher affect student competence in science and mathematics and student attitudes. To what extent will the conclusions one draws from one combination of variables be similar to the conclusions one would have drawn had a different set of variables of the same underlying condition been used to construct the indicator? The answer to this hypothetical question depends critically on the quality of the sets of variables and the manner in which they were combined. One gets an entirely different picture of the educational health of the nation depending on whether one looks at high school dropout rates, results from the National Assessment of Educational Progress (NAEP), stu- dent career choices, or amount of homework assigned per pupil. Each variable or combination of variables highlights a different aspect of the complex construct "educational health." The accuracy and ap- propriateness of interpretations and policy decisions are limited by the quality of the indicators themselves and the manner in which they are combined.

OCR for page 27
WHAT ARE INDICATORS? 31 Problems of Aggregation When data are aggregated from one level (e.g., students) to an- other (e.g., classrooms, schools, or districts), numerous interpretive difficulties arise. Data should be collected and aggregated accord- ing to a clear conception of schooling and with a view of who wiB use the information and for what purpose. Data aggregated to lev- els that are inappropriate to relevant policy decisions may be quite misleading-for example, statewide averages on teacher salaries may not be useful information for a particular school district. In general, data at the level of the individual student are most useful to that student's teacher, cIassroom-leve! data are of most interest to prin- cipals, school-level data are most useful to superintendents, and so on. Aggregation Effects and the Ecological Fallacy I,evels of aggre- gation exert important effects on correlation coefficients. These ef- fects help to explain why the results of educational research vary so much from study to study. What, for example, is the correlation between socioeconomic background and achievement test scores? Is it .3? .6? .9? All three are possible. The correlation depends on the unit of analysis, the population sampled, and the way the two con- structs are measured. If one takes these three factors into account, the results are fairly consistent. Using national samples of high school students, family income correlates about .3 with achievement test results at the student level. Aggregating to the school level, the correlation is between .5 and .6 among school means nationally. If, however, one looks within large urban districts, the school-level relationship is between .8 and .9. The district-level relationship varies from state to state (.2 to .6), and at the state level the correlation between 1975 poverty rates and state achievement estimates is .63 (N = 50 states). Table 3-1 summarizes these results. Other differences are found when looking at different grade lev- els, or when indicators other than poverty are used to represent home background. For example, in Project TALENT, an indicator of socioeconomic environment based on home variables that were hypothesized to exert a more direct effect on achievement (mother's education, books in the home, child has own desk, etc.) correlated .5 at the student level for high school students (Flanagan and Cooley, 1966~.

OCR for page 27
32 INDICATORS OF SCIENCE AND MATH~TICS EDUCATION TABLE 3-1 Socioeconomic Background and Achievement Population Sampled SES Indicators Correlation Student National Income .2 to .4 Student National Home environment .5 School Large urban district Income .8 to .9 School National Income .5 to .6 District Within state Income .2 to .6 State National Income .6 Source: Cooley et al. ( 1981~ . What is the appropriate unit of analysis? It depends, of course, on the question being asked. A scatterplot depicting the modest relationship between socioeconomic status (SES) and achievement at the student level is typically an oval-shaped swarm of points with few outliers. Given this fact, inferring from the within-district school-level correlation of .9 that most low-achieving students come from poor homes is an excellent example of what sociologists call the ecological fallacy: the error of using relationships at one level, such as school, to describe relationships at a lower level, such as student (Robinson, 1950~. Correlations at one level of analysis differ from correlations at an- other because of the grouping eject. This occurs when membership in the group (e.g., class or school) is related to either one or both of the variables being correlated. For example, the socioeconomic homogeneity of neighborhoods produces a relationship between SES and school, and that relationship produces the larger correlation be- tween SES and achievement at the school level than at the student level. Many statisticians would argue that the proper procedure is not to use correlations at all when, as in the case illustrateci, regressions are appropriate (see, e.g., Cain and Watts, 1970~. However, the use of correlations is so universal in analyzing and reporting educational data that we consider it important to warn against misinterpre- tations. Our brief discussion of problems in "ecological inference" merely scratches the surface. A detailed and comprehensive (al- though not too technical) treatment is provided by Langbein and Litchtman (1978~.

OCR for page 27
WHAT ARE INDICATORSf 33 Inconsistent Aggregation and Self-Selection Every student of elementary statistics is warned early in instruction that teasing out causal relations among any set of variables can be a tricky and often misleading endeavor. It is surprising how often unwarranted causal conclusions are drawn from summary indicators, whether or not the persons involved have had training in data interpretation. The temptation, for example, to judge the quality of education in a state by the mean Scholastic Aptitude Test (SAT) scores of its graduates, despite cautionary statements issued by the College Board (e.g., Hanford, 1986), is a case in point. This annual practice illustrates in a nutshell most of the pitfalls considered in this section. Why are mean SAT scores inappropriate indicators of the com- parative quality of instruction in the various states? First, consider the problem of sample representativeness. How representative are students who take the SAT of the typical high school graduate? In general, college-bound seniors (the SAT population) are better pre- pared academically than their noncollege-bound counterparts. More- over, there is wide variation in the percentage of students by state who take the SAT. (Some state institutions of higher education re- quire SAT scores for admission; some do not; others require scores on tests administered by the American College Testing Program.) For example, in 1984, the percentage of high school seniors by state who took the SAT ranged from a low of 3 to over 65 percent. For various reasons, including self-selection, the smaller the percentage of students taking the SAT, the higher their mean SAT scores. Thus, inconsistent aggregation leads to false and misleading comparisons. Problems of Scale The first interpretive problem in this category involves the ac- tual scale itself. Should absolute values (number of science and mathematics teachers, number of students taking at least two years of mathematics, etc.) be used, or should various ratios (for exam- ple, science teachers per 100 or 1,000 pupils or the ratio of science and mathematics teachers to all teachers) be used? Often but not always ratios and proportions are more informative reporting units. A simple example illustrates why this is so. An absolute increase in the number of unemployed persons who are actively seeking employ- ment is generally agreed to be a move in the wrong direction. But such an increase, by itself, may be misreading. If the entire labor force has increased significantly, it is possible that an increase in

OCR for page 27
34 INDICATORS OF SCIENCE AND A`4THEMA TICS EDUCATION the absolute number of unemployed actually represents a decrease in the unemployment rate, that is, the percentage of the labor force that is unemployed. A counter example from education involves in- creasing the length of the elementary school day and introducing an additional subject, say, health and family education. Under these circumstances, the proportion of school time devoted to mathemat- ics might decrease, an apparent move in the wrong direction, but the number of minutes per day given to mathematics might actually increase. In many situations, it is wise to collect information on and report both types of figures. For example, it may be important to know both the absolute number of minutes per day a student spends doing mathematics as well as the percentage this figure represents of the student's total time spent on school work. Another scale issue that, surprisingly, often goes overlooked is the use of scale units that change over time or that have different meanings in different locations. The most commonly used units are those involving monetary values. Total school budgets, dollars spent on laboratory equipment, and teacher salaries are all examples of scale units that vary to the extent that the value of the dollar varies over time and over locations. Results not adjusted for this variation may seriously distort the picture. Thus, total school expenditure should be adjusted to total expenditure per pupil, with perhaps an additional adjustment for variations in the cost of living; teacher salary should similarly be adjusted to account for local cost of living, and so on. INDICATORS FOR WHOM? This chapter argues that an indicator is more than another layer on a mound of statistics; rather, it can be used in a systematic attempt to investigate the interaction among selected pieces of infor- mation. Federal, state, and local education bureaucracies are awash in numbers. The challenge taken up by the committee in this report is to go beyond an endless parade of statistical tables and focus on the key questions and subsequent indicators that will be credible to policy makers in state and local education agencies the major decision makers since education in the United States is overwhelm- ingly a state and local legal and fiscal responsibility. The challenge for state and local policy makers is to adopt and use the indicators that, when combined, best represent a snapshot of what exists today

OCR for page 27
WHAT ARE INDICATORS? 35 in mathematics and science education as well as point to promising policy initiatives. At all levels of the education system, there is recognition of the need for a reliable and valid evaluation of how well students know, understand, appreciate, and use information they have received in their K-12 mathematics and science experience. And, as with any evaluation, the initial temptation is to start to collect data before the key questions have been asked. Once the questions are specified, most of the data can probably be obtained without generating a new national information system that may fall under its own weight (see Appendix E). In this respect, a concern shared by the committee and state administrators alike is feasibility. By feasibility, we mean that collection, analysis, and reporting of valid data should be possible in a timely manner, given reasonable resources. The design decisions and availability of resources that affect the frequency of collecting data, as well as methodology, may well be driven by timetables that allow indicators to interact with and influence policy. COLLECTING INFORMATION Once decisions have been made on the type of indicator to be used (e.g., student test scores, teacher salaries, judgments of curricu- lar quality), there arises the question of how to collect the pertinent information. This report argues that a wide range of data-collection methods is necessary. Some of the recommended methods have been used extensively in the past, such as surveys; others are less widely used, such as time-use studies. The key challenge is to tailor the proposed data-collection methods to the type of information that is needed. Comparability Versus Depth of Info rmatior~ There is a difficult tension in the choice of data-collection meth- ods between collecting comparable data and being open to unex- pected responses. For example, closed-ended questionnaires produce standardized information comparable across space and time and are particularly suitable for collecting information on such matters as salaries and defined fringe benefits, for which comparability is criti- cal, and the nature of the desired information is relatively clear-cut. Closed-ended questionnaires are poorly suited, however, to the collection of information dealing with such topics as how teachers

OCR for page 27
36 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION and students spend their time outside school. The reason is that the range of possible responses is much broader than can be captured by a closed-ended questionnaire. Consequently, it is important to give up standardization in favor of capturing diversity. Thus, time-use studies are more appropriate for collecting this type of information. A related issue arises in attempts to improve achievement tests, questionnaires, and the like so that responses mirror more faithfully and in greater depth, say, what students have learned and are able to do. Two problems arise: first, to the extent that items, examples, and questions are improved to capture more and better information, comparability to earlier assessments is lost. Second, assessments are likely to become more costly, and sample sizes may have to be reduced. This may create Toss of generaTizability (as in studies using classroom observation), although matrix sampling and other techniques may partially overcome this problem. These problems are not cited to argue against improving assessment instruments and questionnaires we argue quite the contrary in the next chapter but only to sensitize those using indicators to some of the difficulties involved in designing the requisite collection of data and information. Timing How often should information be collected? There is tension between the expense of collecting information often and the value of up-to-date information that permits rapid discernment of changes in trends. The choice of how often to collect data for a particular indicator should depend on the importance of the indicator for in- form~ng policy and on how ranidlv chances are likely to Cur in the distribution of the behavior, incentive, or outcome reflected in the indicator. Consequently, we argue for the assessment of student learning at given grade levels every four years, except for science achievement in elementary school, for which the current improve- ment efforts warrant assessment every two years. No matter what the frequency, it is important that each wave of information be col- lected at the same time of the year so as to maintain consistency and provide comparable data. Design of Expert Panels for Assessment At various places in this report, the committee recommends the use of panels of experts as a method for assessing instructional

OCR for page 27
WHAT ARE INDICATORS? 37 materials and performance when no suitable outcome measure is yet available. Because the use of experts is an often-used mechanism, we discuss the problems inherent in its application in some detail. Based in part on our experience with difficulties encountered in the experiment on reviewing the science content of science achieve- ment tests (see Appendix B), we consider it important to make some general comments about the use of expert panels as an assessment method. First, there should be a clear understanding among the pane! members as to the intent and interpretation of the material to be judged or rated. Second, if the tests or other materials are to be used for various purposes, the pane! members should understand and the ratings should distinguish among these purposes. Third, there should be agreement as to the rating criteria. Panels can meet these three conditions by using rater "training" exercises or discussing their procedures before the actual work begins. Discussion of the ratings by panelists after they have completed their work may fur- ther help to clarify whether purposes of the materials and rating criteria were unambiguous. (However, it is not desirable that the pane! members change their ratings as a result of the post-rating discussion, at the risk of reducing independence of the panelists' rat- ings.) Such techniques help to improve the rating process and to reduce the variability between raters. Rater vaTia~iiity Variability between raters with regard to in- dividual items is one source of variability in pane] assessments. How- ever, the scores of an individual rater on different items tend to be correlated. This correlation is one quantification of frequently heard comments, such as that one rater tends to give high scores and an- other low scores. It is not generally recognized that, as a result, the impact of rater variability on the variability of average scores or per- centiles can be substantially greater than indicated by the variability between raters item-by-item, perhaps by an order of magnitude. In the experimental review of science achievement tests, this was true not only of types of reviewers (teachers, scientists) but also of re- viewers within type. It is not feasible to eliminate these sources of rater variability. Thus, panel studies should be designed to provide estimates of rater variability and correlated variability. Such infor- mation has the potential for improving the design of expert panels, for example, for deciding on the number of pane! members needed to yield acceptably reliable estimates of averages, percentiles, or other statistics of interest. With a positive correlation between the ratings

OCR for page 27
38 INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION of an individual reviewer by item, the use of a given number of re- viewers, each rating every item, will yield less reliable statistics than a larger number, each rating a randomly chosen subsample of the items. This may be potentially useful when there is a large number of items to be rated and the rating process is time-consuming. Ap- preciation of the sources of rater variability will also help ensure that standard errors of statistics derived from pane] ratings are properly computed. Validity and Reliability The design of an expert pane] should consider the problems of both accuracy (validity) and precision (reli- ability). The concept of accuracy implies that there is a "truer value to be estimated. The true value may have a theoretical definition or may be defined only operationally as that value resulting from a set of carefully specified empirical measurement steps. A pane! whose assessments differ systematically, in either a positive or negative di- rection, from the true values is "biased." In experiments such as the science test review, the standards against which raters assign their scores are critical since they affect the accuracy of the scores as mea- sures of the relative value of alternative tests. Depending on their biases, reviewers may give a poor test relatively high ratings and a good test relatively low ratings so that two tests that differ widely in their true value are judged on the basis of average ratings-to be equally effective. Similarly, ratings of teacher performance based on classroom observation are likely to be strongly affected by the per- sonal views of the observer regardless of the procedures established for the assessment. The steps outlined above will help to minimize biases due to misunderstandings on the part of pane! members. They will also unprove the interpretation of the ratings. It may be possible to design a questionnaire for potential pane! members that would help ensure ratings free of personal preference or provide a basis for eliminating the ratings of particular individuals. Coordir~atiorl of Strategies for Collecting Data In each of the chapters that follow, recommendations are made for data to be collected or observations to be carried out or both. Implementation of these recommendations will involve surveys and other data collection strategies that should be coordinated. It is not the committee's intention that whole, new data systems be set up to carry out its recommendations. Instead, several existing mechanisms

OCR for page 27
WHAT ARE INDICATORS? 39 currently undergoing review and reformulation should be used to implement the recommended data collections and analyses, including the redesigned elementary/secondary data collection of the Center for Education Statistics, the Assessment Center of the Council of Chief State School Officers, and the educational data improvement effort intended to lead to common data collection by the states. In Appendix E we discuss issues of coordination, pulling together recommendations from throughout the report that imply surveys, referring to ongoing efforts, and outlining suggestions for how de- sirable new survey efforts might be implemented. More intensive survey design planning including issues of sample size should be left to agencies national, state, or local that assume or are assigned responsibility for the indicators.