Read "Biomedical and Behavioral Research Scientists: Their Training and Supply: Volume III: Commissioned Papers" at NAP.edu

« Previous: 1. Evaluating the National Research Service Award Program: A Review and Recommendation for the Future

Page 41 Cite

Suggested Citation:"2. Productivity." Institute of Medicine. 1989. Biomedical and Behavioral Research Scientists: Their Training and Supply: Volume III: Commissioned Papers. Washington, DC: The National Academies Press. doi: 10.17226/9915.

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Page 74 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

PRODUCTIVITY Helen Hofer Gee* INTRODUCTION In 1986 NIH contracted with the Institute of Medicine to organize a conference on research training. A central, though not explicitly stated, purpose of the conference was to obtain guidance on how to continue to meet Congressionally mandated requirements for periodic reports on the role of and need for research training in the biomedical and behavioral sciences. Ostensibly, the conference was concerned with an examination of how successfully research training has been conducted, which program mechanisms produce the most suitable training, and what information is required to enable further assessments of national needs for researchers in the decade ahead. blister first among three categories of issues requiring attention was] . . . measuring the productivity of scientists in their research programs and as reflections of their training. The issue in productivity is how to improve the measurement of it; simply gauging productivity by the current popular methods is inadequate for the task at hand. (Institute of Medicine, 1986) Anyone who has ever been faced with the task of having to select among individuals--for employment, advancement, funding, awards--has dealt with the issue of assessing productivity and has, implicitly or explicitly, weighed available evidence of previous performance. The difficulty and complexity of these decisions may well underlie the malaise that is apparent in the committee report. The committee's more explicitly reported concerns with such measures as success in obtaining research grants, citation counts that ignore differences among and possibly within disciplines, and studies that fail to consider work environments suggest that the real problem lies not in the measures of productivity per se that have been used, but in how the measures have been used--that is, in the designs of assessments of training support programs. * The opinions expressed in this paper are the author's and do not necessarily reflect those of either the Committee on Biomedical and Behavioral Research Personnel or the National Research Council.

Unfortunately for those who seek quick solutions, concepts relevant to the measurement of productivity are inextricable from those concerning almost al, other domains within the social study of science. Dealing directly with the problems of productivity measurement therefore requires cognizance of the state of the entire science. Any study, for example, that ignores differences among or within disciplines ignores more than three decades of intensive study of the entire social structure of science, not just the study of "productivity" per se. In the critical, scholarly essay, Gilbert (1978) noted that there is a reciprocal relationship between the theoretical framework which the social scientist brings to his work and the indicators which he will find most appropriate for his research . . . the adequacy of an indicator can only be assessed through a detailed study of the context in which the phenomena to be measured are embedded, and of the validity of the measurement theory on which it relies . . . this requirement is equivalent to the demand that we understand the functioning of the scientific community at a micro- level. A sma11 community of scientists (static in size in the United States since about 1980, but rapidly increasing in Western and Eastern Europe and Japan since the late 1970s) has been making significant progress in the direction Gilbert suggests (see Appendix and References). The most recent burst of research activity relevant to the assessment of productivity began when Martin and Irvine (1983) assessed basic research activity and programs (radio astronomy and physical sciences). Their papers specified "partial indicators" of scientific progress and investigated the extent to which these indicators '"converge" to produce valid and reliable estimates of the productivity of designated groups of scientists. The work created a virtual storm of criticism, largely because it was so far-reaching (see Chubin, 1987~. The continuing discussion has instilled new vigor into the development of the field. The concept of multiple partial criteria was certainly not introduced by Irvine and Martin. Even Clark's study of the careers of psychologists (1957) incorporated the concept in a general sense. As noted by Jones et al. (1982), Weiss (1972) discussed them: At best each is a partial measure encompassing a fraction of the large concept . . . . Moreover, each measure [may contain] a load of irrelevant superfluities, "extra baggage" unrelated to the outcomes under study. By the use of a number of such measures, each contributing a different facet of information, we can limit the effect of irrelevancies 42

and develop a more rounded and truer picture of program outcomes. However, as Chubin (1987) concludes in his discussion of Irvine and Martin's work, it is . . . also politically astute, serving scholarly and policy communities. It explicitly anticipates criticism and sources of error, disarms skeptics, and gets an analytical foot in the right doors--those shielding the offices of policymakers who have come to rely on participant scientists and their own imprecise and self-serving devices for making decisions about who gets and who doesn't. Moravcsik (1986) hailed the extensive debate and critiques of the Martin and Irvine work as a welcome sign-- since it shows that the field has reached the state of maturity when its applications to concrete situations are sufficiently realistic to create a heated controversy, involving people from a variety of professional backgrounds. He commented further that neither critics nor Irvine and Martin, in their response to critics, offered any specific suggestions for improvement. Moravcsik then proposed that some suggestions can be made and conclusions drawn concerning the need for future activities by relating the debate to another effort in science assessment--namely, a project organized by the United Nations Center for Science and Technology for Development (UNCSTD) and centered on a paper that Moravcsik was commissioned to write. Moravcsik reported further that at a meeting held in Graz, Austria, in May 1984 the paper was discussed: The UNCSTD project did not result in a set recipe for assessing science and technology. On the contrary, the project concluded that there is no such universal recipe, and hence that the aim should be to devise a process which, in any particular case, yields a methodology for an assessment. SUGGESTED OUTLINE FOR PLANNING STUDIES OF PRODUCTIVITY OR QUALITY The proposed UNCSD process serves as a useful framework in which to present some thoughts about planning studies focusing on the assessment of productivity. The following list draws heavily and directly on Moravcsik's report: 43

1. Identifv the coals of science that _ __, _ ~ ~ _ are to be taken into account. Moravcsik noted, . . . Science and technology have many different goals, aim and justifications, and in any particular case it must be specified which one (or which ones) of these are taken into account, and with what weight. Studies of National Institutes of Health research training programs have ostensibly aimed at assessing a common goal of such programs--to wit, the production of trained scientists who will contribute to the advancement of the biomedical sciences. Prior to the mid 1970s, this was interpreted by some Institutes as including support for the clinical training of physicians In areas where the supply of expertise was felt to be inadequate. After it became clear that the majority of these individuals simply entered private practice, however, those programs were for the most part discontinued. Such discrepancies must, therefore, be given careful attention in planning studies of program outcomes. Since the mid '70s, NIH training programs in general have focused specifically and exclusively on research training. Assessment of the success of these research training programs have, however, interpreted the terms ~t contribute" and "advancement" quite narrowly. Teaching (either future researchers or practitioners), biomedical research administration, mentoring (i.e., guiding the graduate education of future researchers), and conducting research that does not seek external funding and research that cannot (because of the interests or concerns of the power structure within which it is conducted) be published have often been denied recognition as goal-relevant behavior. Consideration should be given to whether any of these professional activities should be explicitly recognized as contributing to the advancement of the biomedical sciences and, if so, studies should be designed to assess these kinds of productive endeavor. Recognize the multidimensionality of goals of potential pathways to them, and of methods of measuring outcomes: specify which dimensions and connections of the system are to be taken into account. Once goals have been specified (and it is recognized that achieving those goals can and is likely to be expressed in different ways), study designs must allow for the varieties of pathways and outcomes that may occur. Cole and Cole (1973) set the stage for this type of inquiry in their cross-sectiona~ analyses. The work of Long, McGinnis, and Allison (1979, 1981, 1982) examined many of the same 'connections" as the Coles but, by following a cohort longitudinally, revealed a different sequence of career development. The Long and McGinnis work has been particularly notable in its pursuit of the significance of context, the multidimensionality of career pathways, and the 44

changing significance of predictors in assessing productivity at stages in research career development. In another notable analysis of the NIH Research Career Development Program, Carter et al. (1987) examined both selection processes and outcomes using multivariate techniques to assess the significance of correlates and causal relations, as well as a sophisticated cohort selection procedure to control for disciplinary differences. 3. If. as is usually the case, it is not feasible to study all aspects of a system, specify which aspects are to be included and which will be omitted and indicate clearly the implications of these decisions for the assessment process. Moravscsik provided an apposite illustration of one perspective: If, of two cars, one has a higher top speed, and the other a lower gasoline consumption per mile, it is not possible to say which is the 'better' car without ascribing some value judgment to high speeds versus economy in the use of fuel. Two other examples come to mind: (1) if in planning a study of the effectiveness of a training program, it was decided that pursuit of a research career in the private sector was a favorable outcome but that assessing the performance of former trainees who followed that path was not feasible, they could be explicitly excluded from potential comparison cohorts; (2) if research administration is deemed a favorable outcome, those research administrators could be excluded from comparisons in which research publications were used as indicators and included where other measures of productivity, more suitable to their employment, were used. The guideline simply demands precise specification of the details of the design of an investigation. 4. Specify how the results of the assessment are to be used. A study intended to assist program managers in their decisionmaking will seldom have the same design requirements as a study intended to inform policy decisions. If policy decisionmakers are to be informed, for example, the delineation of possible alternative indicators of productivity may be critical, whereas meeting program management needs may require more intensive analysis of only those that are the most direct manifestations of program goals. The key is to consider carefully the kinds of decisions that the study is intended to influence. 45

Select a set of indicators that will satisfy the l requirements of each of the study design considerations. Recognize and specify the limitations of each of the indicators. To quote Moravscik, There are many types of indicators: input versus output; quantitative versus qualitative, indicators of activity, productivity, or progress; indicators of quality, importance or impact; there are functional and instrumental indicators; there are micro- and macro- indicators; there are "data-bases" and '"perceptual" indicators; and so on. Some indicators are already 'on the shelf' and can be taken from it and used in new situations. More likely, however, the most appropriate indicators for a new situation need to be improvised for that particular situation. . . . Be reconciled to the fact that in any case, you will end up with a set of indicator measurements which, in general, cannot be reduced to a one-dimensional measure and hence to an unambiguous ranking. It is apparent that the selection and/or development of indicators of productivity depend on the kinds of questions that are being asked and the perceived complexity of the system involved. An indicator that provides excellent explanatory data for one study may be useless in another context. Every measure, moreover, has limitations that may, under some conditions, obviate their utility and, in other circumstances, may be totally irrelevant. If a study plan is suitably mapped, it may not be feasible to use the same indicators of productivity for all individuals in a cohort. For example, if teaching undergraduate students is judged to be an acceptable outcome of research training, the productivity of an individual whose primary activity is teaching will not be appropriately assessed by counting that individual's production of research papers--but consideration might be given to using the production of review papers as one of several measures of performance in the educational domain. However, for some outcomes regarded as suitable expressions of the goals of an enterprise, no suitable approach to assessment "measurement" is available to evaluators. In such cases the individuals should be removed from comparison groups that are to be analyzed statistically rather than, as is often the case, counted a "failure" according to indicators that appropriately measure the productivity of other members of the group. MEASURES OF PRODUCTIVITY The above overview should make it clear that any discussion of specific measures of productivity is necessarily superficial, simplistic, and incomplete because outside the context of the 46

design of a specific study, there is not a great deal to be said about any particular measure. In addition, since productivity in one sense or another is the focus of most of the studies of the social science of science, a thorough literature review would require a few years of effort. Nonetheless, various measures that might be used in studies of productivity are discussed below. The discussion is intended to draw attention to complexities, issues, and problems in the use of these measures, knowledge of which might aid in carrying out the kind of careful approach to study design outlined earlier. Publication Counts While it is generally agreed that the principal, or most prevalent, immediate outcome of the active research investigator's efforts is the preparation of papers published in professional journals and by 1982 a nearly 2600-item bibliography listing publications analysis items was available (Hjerppe, 1982), counts of publications continue to be derogated. Because the analysis of publications plays a dominant role in social studies of science, a complex, highly sophisticated methodology has been developed. The intellectual leader of modern-day social studies of science was Derek J. deSolla Price (1961, 1963) e The early development of computer-based analytic methods, which have stimulated much of the sophisticated analysis characteristic of social analysis of science studies of the past two decades, resulted largely from the enterprise of two individuals: Eugene Garfield (1955) developed the Science Citation Index (SCI), on which most publications analysis work is dependent; Francis Narin and his associates at Computer Horizons took the lead in exploring and developing measures to maximize the utility of the wealth of information contained in the SCI. In 1969 (Narin, 1977) the area even acquired its own label--bibliometrics--to describe collectively quantitative, analytical studies of written · e communication. In simplest terms, publication counts are no longer acceptable as a measure of productivity unless at least the following potential sources of error or misinterpretation are controlled or accounted for: o differences among disciplines of cohort members, o differences among journals in terms of measured influence (see section on journals page 131), o o differences in "quality" or "impact" as measured by citations or peer assessment (or journal influence), professional age of cohort members, and 47

o social context of cohort members. Despite concerns about "loud noises from empty vessels," publication counts have been shown repeatedly to correlate positively with assessments of quality and to contribute useful independent variance to the assessment of productivity. Reported correlations between quantity and quality measures vary considerably among studies, between approximately r =.23 and r =.80; differences may relate to disciplines, characteristics of cohorts, or even to how quantity and quality of publications are measured. In a series of studies conducted in the late 1970s (see Narin, 1983), numbers of publications by faculty and staff in universities and hospitals were shown to be extremely highly correlated with NIH funding (r = .90 to 95~; and there were no economies or diseconomies of scale in the funding of research grants. Funding and publication relationships may appear to break down, however, when small aggregates of researchers or disciplines are assessed and especially when basic and clinical research publications are intermixed. Publication rates of basic scientists differ markedly from those of clinical scientists, who publish less frequently and whose research is usually very much more costly. then the funding and publication rates of small aggregates of subjects are investigated, the tendency is to ignore such disciplinary differences, thus ignoring an important moderating variable. With small aggregates other minor sources of error--such as idiosyncratic events that may affect the usual patterns of behavior of part of a group for a period of time-- may also obscure an underlying relationship. When large aggregates and adequate time spans are employed, such obfuscating sources of error will usually cancel each other out, permitting stable, underlying relations to be revealed. When a quick, inexpensive estimate of productivity is needed, large quantities of data are available, and the comparability of cohorts is established, a simple count of publications may well provide adequate information. Ordinarily, however, such a single measure is useful primarily as a means of setting the stage for a more comprehensive investigation of some aspect of science or scientific behavior. Weighted Counts: The use of weighted counts of papers permits obtaining a preliminary estimate of quality without waiting for citations to become available; it is also an inexpensive means of obtaining an estimate of quality for large numbers of papers. 48

Each paper is weighted by an influence weight assigned to the journal in which the paper appears.) Paper Counts in the "Best" Journals: Committees charged with evaluating group or individual scientific performance will sometimes request that publications be counted only in a selection of the ''best'' journals. Such a practice would be seriously inequitable, since scientists do not have equal access to journals. For example, those located in smaller institutions are more often published in less influential journals, as are younger, less well-established investigators; and regional differences abound in some disciplines. McAllister and Narin (1983) investigated these relationships in the publications of all U.S. medical schools, using average citation influence per paper measures: the average citation influence per paper increased with the total number of biomedical publications, even when institutional control (public and private), region, and areas of research emphasis were control led. The positive relation between number of papers and citation influence was shown to hold within disciplines (biochemistry and internal medicine were analyzed in detail) and within research "level" (i.e., along basic and clinical research dimensions). Data Bases: NIH-supported studies that have involved counts of published scientific papers have almost always depended on computerized data bases derived by CHI from Medline and the SCI. The source data bases require a great deal of preliminary massaging to consolidate information and correct inconsistencies; but once prepared, they make data available unobtrusively, make accessible several different quantitative measures of publication performance, avoid the increasingly restrictive problem of securing clearance from the Office of Management and Budget (involved, in studies of federal programs, in any attempt to go directly to the scientific community for information), and are more accurate than individual reports. An interesting departure from the use exclusively of the comprehensive data base was reported by V. L. Simeon et al. (1986), who had studied a large research institution in Yugoslavia. In their study several forms of publication and communication were employed in addition to SCI journals (e.g., papers in other scientific journals and congress proceedings, books and monographs, technical articles in 1The technique developed by Computer Horizons, Inc. (CHI), determines journal influence weights by the weighted number of citations each journal receives over a given period of time. See F . Narin , Evaluative Bib ~ iometrics: The Use of Pub ~ ication and Ci tation Ana ~ ysis in the Eva 7 nation of Scientific Activi ty (Report to the National Science Foundation), 1976; and F. Narin, G. Pinski, and H. H. Gee, "Structure of the Biomedical Literature," Journal of the American Society for Information Science 27:25-45, 1976. 49

encyclopedias and popularizations, and presentations at scientific meetings). ~ multivariate analysis revealed interesting patterns of change among the several variables over time. This rather preliminary study, which was focused on change in publication behavior following the introduction of minimal criteria for promotion warranted no conclusions; but it suggested to this writer the possibility that some measures of these types might be useful in considering criteria suitable for assessing the productivity of individuals whose careers, though academic, are not directly focused on the production of original research. Activity Indexes: In recent years the utility of a new approach to using publication counts, the "activity index," has been demonstrated, particularly in studies conducted by CHI for NIH e Activity indexes are ratios that make use of publication counts in a relational context, thus allowing comparisons to be made among groups while allowing each group to be described within its own context. 2 Describing NIH Institutes' relative investment in the support of research in different disciplines is a case in point (see Gee and Narin, 1986~. Journal papers are more readily and accurately assigned to disciplines than are dollars, and a ratio that describes an Institute's investment in a discipline relative to both the Institute's investment in all disciplines and the "size" of the discipline among all others in a data set provides a great deal of information for comparison among both disciplines and Institutes. Schubert and Braun (1986) suggest several additional types of indexes that might be useful for different purposes. CITATIONS Ever since Clark's study of psychologists (1957), citation counts have been a favored measure for the assessment of productivity. In most cases, citations ad one or in combination with publication counts are more closely correlated with subjective estimates of productivity than are any other measures. They are more universally applicable to the assessment of scientific research activity than are other measures because (1) publication is the most accessible means of expression available to all scientists, and (2) being published offers a broader audience to the scientist than any other medium. 2The percent of an organization's papers that are published in a given discipline is divided by the percent of all papers in the data set that represent that discipline. An index of "1.0" indicates that the level of publication activity of this group in this discipline is consonant with the level that discipline represents among all disciplines. 50

mechanism: Rather than referring to citations as measures of "quality, as was common in the 1970s, the current practice is to refer to them as measures of "impact" or "utilization" or "influence." The implication is that before citations can be referred to as measures of the "quality" of research, the issue should be investigated in the given context of definition. 11 From an entirely different perspective! Moravcsik (1986), Chubin (1987), Cronin (1984), Vinkler (1987), and others have discussed and/or analyzed the functions and meaning of citations in terms of author motivation. Vinkler, whose contribution is most recent, has provided a concise review of the literature concerning definitions, classification, and roles that citations play in the scientific literature, concluding (in concert with Cronin) that the information carrier role is the most important. Vinkler distinguishes between "professional" (work is based on the cited work or uses part of it) and 'connectional" (e.g., desire of an author to establish a connection with the cited author or work) reasons for citation. In Vinkler's study, a group of productive investigators rated each of the references they had listed in a selected recent paper, identifying which of eight professional and/or nine connectional reasons had motivated the decision to cite, and the strength of the motive. Most (81 percent) citations were made solely for professional reasons-- that is, in a literature review for "completeness" or because the current work was based at least in part on the cited work, the cited work confirms or supports the work in the citing paper, or the cited work is criticized or refuted (at one of three levels). Citations made partially for professional and partially for connectional reasons accounted for 17 percent; only two percent were made solely for connectional reasons. It was also found that two to three times as many papers are reviewed as are actually cited. Failure to cite was also investigated; the principal reason found was that a work was not considered important enough to the current effort to warrant citation. Second most important was the "obliteration" phenomenon--the origin so well known that citation was not needed. A citation threshold model has been developed, and data confirm that the threshold depends primarily on the professional relevance of the work potentially citable in a given paper. Narin (1976), considered citations as an assessment Citation counts may be used directly as a measure of the utilization or influence of a single publication or of all the publications of an individual, a grant, contract, department, university, funding agency or country. Citation counts may be used to link individuals, institutions, and programs, since they show how one publication relates to another. . . In addition to these evaluative uses, citations also have 51

important bibliometric uses, since the references from one paper to another define the structure of the scientific literature. Narin has presented both literature reviews and extensive analytical evidence that substantiate the utility of citations in the assessment of productivity and provide a most valuable guide to technical aspects of their use (e.g., with respect to the consideration of such issues as time, differences among fields and disciplines, and the use of indexes to make possible cohort comparisons). In 1982 Narin and McAllister prepared for NSF a complete set of counts of all U.S. papers listed in the SCI for the years 1973 through 1976 and distributions of the citation counts received in the first through the fourth years after publication (mean, median, and mode) by each of the 106 subfields listed in NSF's Science Indicators. The average (mean) number of citations received in the first four years by papers in biomedical subfields ranged from 1.8 for "Miscellaneous Clinical Medicine" papers to 15 for "Biochemistry and Molecular Biology' and "General Biomedical Research" (publications in journals such as Nature, Science, Federation Proceedings). The data illustrate dramatically the importance of taking disciplinary differences into account. As with publication counts, the day has long since passed that simple citation counts would be regarded as acceptable measures of performance. Even average numbers of citations per paper are useful measures only when all of the precautions cited for paper counts are observed--that is, controls are exercised for sources of difference, such as discipline and time (both publication date and citation count period), that are not part of the question addressed. A longitudinal study of NIH-supported research 1973-1980 (Gee and Narin, 1986) employed citation counts per paper that had been scaled in relation to time and transformed into a standardized score in relation to all papers in a given research level, field and subfield. In a more recent study, Gee (1988) employed counts of papers in which each paper was weighted by the "influence weight" of the journal in which it appeared. For early examples of carefully considered treatment of citation data, see Carter's 1974 and 1978 reports on the NIH Peer Review system and on comparison of program project with individual (RO-1) grants. Highly Cited Papers: The percentage of papers produced by a cohort whose members' papers are among the most highly cited 1, 5, or 10 percent of all papers in a discipline or specialty has particular appeal as an assessment of quality. With a computer the data are easily obtained, and the technique admits the inclusion of an outstanding paper that has been published in a less touted journal, is not directly influenced by author 52

institution, is free of the possible biases of peer ratings 7 and provides an appropriate means of comparing across many different group dimensions. It is disadvantageous only for the cohort so large that it virtually defines the distribution of citations received by a group of journals. (For example, papers supported by the National Cancer Institute and the citations that they receive virtually define the distribution of citations to "cancer" journals). Comparisons between citation averages and percent of papers among the most highly cited 10 percent offer added insight into the distribution of performance within individual cohorts. Although the overwhelming maj ority of studies that have compared subjective ratings with citation counts have yielded strong positive correlations, not all have. Where the correspondence is weak, the data often serve to reveal characteristics of the peer judgments rather than indicating deficiency in the citation evidence (see, for example, Anderson et al., 1978~. JOURNALS AND SUBFTELDS Some of the dimensions of difference among journals and subfields are described in Narin (1976~: using the 1973 SCI, tables were developed showing the distributions of differences in types of publications (articles, notes, letters, etc.), numbers of journals, publications, references, citations, and ratios of these counts. Dimensions of differences drawn from these tables include the following: o Discipline (or subfield differences in numbers of references, citations and publications: In the 1973 SCI (articles, notes and reviews), citations per paper ranged from 2.4 for operations research and management science to 36.2 for physiology. In general, there are distinctive differences among fields and subfields: for example, mathematical scientists use few references, receive small numbers of citations, and produce relatively few papers while many basic biomedical scientists publish frequently and receive large numbers of citations. O Variation in rates of growth and time distributions of citations: Rapidly growing subfields of science have higher fractions of references to recent papers than do siow-growing subfields. Rapidly growing outfields also tend to receive their modal numbers of citations earlier than do other subfields (e.g., in the second, instead of third, year after publication). 53

o Concentration and relative citedness: The concentration of publications and relative citedness varies both within and between fields of science. Some fields are characterized by relatively few, very large and influential journals while the literature of others is widely dispersed. This kind of information was not widely discussed in the open literature until Moed et al. (1985) at the University of Leiden published their investigations that demonstrated the serious impact, especially on analyses or assessments within universities, of neglecting these characteristic differences among disciplines. The Moed study also revealed that operating on incomplete bibliometric data can have a serious impact on the outcome of an assessment. The characteristic differences among fields, subfields, and specialties of science have led to the development of a growing number of indexes that are aimed at making comparisons between cohorts justifiable. The most widely used are impact factors, influence measures, and relative citation rate and publication impact: Impact Factors: The Institute for Scientific Information, publisher of the Science Citation Index, also publishes Journal Impact Factors, based on a 2-year accumulation of citations received divided by the number of papers published in the target year. These measures, while correcting for journal size, do not correct for characteristic differences in referencing and citation practice and, therefore, reflect different dimensions of citation behavior in different disciplines. Noma (1986) states: . . . there is no normalization for the different referencing characteristics of different segments of the literature: a citation received by a biochemistry journal, in a field noted for its large numbers of references and short citation times, may be quite different in value from a citation in astronomy, where the overall citation density is much lower and the citation time lag much longer. In addition, journals that publish longer papers, such as review journals, tend to have the highest impact factors. Schubert and GlanzeJ (1983) have developed a method of estimating the Reliability of mean citation rates per publication, computing a standard error based on the relative frequency of zero citations. They exemplify its use by computing a sample set of "corrected" ISI impact factors. They are, however, still left with the lack of comparability in other aspects of differences among disciplines. Influence Measures: Pinski (1975), Anderson et al. (1978), and Noma (1986) developed three citation-based journal influence measures that offer greater breadth and precision of measurement 54

than does the impact factor for capturing the information resident in the journals in which research is published. The three measures are: o Influence weight, which is a size-independent measure of the weighted number of citations a journal received from other journals, normalized by the referencing practices of the field 0 Influence per paper, which is essentially the weighted number of times an average paper in a journal is cited, where the weight is based on the influence of the citing journal; and o Total influence (of a journal), which is the influence per publication times the total number of papers published over a given period of time. Of these three measures, the influence per paper measure has proved most useful in studies of group publication performance. Total influence scores are, however, the scores most highly correlated with subjective judgmental ratings of university program quality (see Anderson et al., 1978~. The influence measures offer a clear advantage over impact factors: the measures are determined within each of the fields of science, thus correcting for differences in citation practices and providing comparability across fields of science. In addition, citations from prestigious journals are weighted more than those from peripheral journals, thus introducing a quality concept, and the three different measures provide information from three different perspectives. Despite their superiority, influence measures have not been widely adopted. The iterative matrix manipulations involved are costly and cumbersome; as a result, it is not economical to revise them frequently, and only one revision--using 1982 publications--has been made of the original measures based on 1973 publications (see Noma, 1986~. The number of journals in the 1982 set increased to over 3,000 from the 2,300 on which the 1973 measures were based. Changes in computational techniques were also made, so average measures are not directly comparable. Even so, Narin (1985) reports, that most correlations within fields ranged between r = .85 and r = .95; highly influential journals retained their high influence, and journals with relatively low influence ratings (e.g., r = .4' to .60), tended to drift within that range. Depending, of course, on the aims of a study, it would appear that in many cases comparability of measures among disciplines should take precedence over recency of citation count. Relative Citation Rate and Publication Impact: Schubert et al. (1983, 1986) and Vinkler (1986) have proposed two new "relative" citation indicators based on Garfield's SCI journal impact factors. Schubert's "relative citation rate" (RCR) compares 55

actual citation counts with an expected citation rate based on the impact factors of the journals in which a set of papers has appeared, thus eliminating characteristic differences in the publication and citation practices among disciplines. Schubert and Braun (1986) and Schubert, Glanzel, and Braun (1983) have used these indicators in developing relational charts on which the relative merit of nations' publications are compared. Vinkler developed a "relative publication impact" (P~) measure--which includes with a relative citedness indicator, numbers of publications and a "cooperation" measure based on coauthorships--to arrive at an index that includes both quantity and quality in its ratings (of departments, in this instance). Vinkler reports "good agreement" with subjective peer evaluation that are carried out regularly. None of these measures have yet been subject to critical review, nor have they been used outside the context in which they were developed. Further exploration does appear to be warranted. PEER ASSESSMENT s ~ Peer assessment, as the concept was developed by NIH to overcome the potential biases of individuals in decisionmaking situations, is now in the anomalous circumstance of being defended against intrusion by some biased individuals who oppose statistically based measurement techniques while at the same time being attacked by apparently politically motivated individuals who accuse peer assessment in the management of science of the very biases it was designed to overcome. Compounding the anomaly is the fact that the defenders fail or refuse to recognize that the quantitative analysis "intrusion" that they reject can and has provided, by far, the most convincing evidence available to prove the case for peer review. Longitudinal studies of NIH- supported publications (see Gee and Narin, 1986), for example, are unequivocal in their demonstration of the effectiveness of the NIH dual peer-review system of determining which grants shall be funded. It is also something of an anomaly that we treat peer assessment as one among several different types of criteria that might be used to assess productivity, when in fact, almost all likely criteria are, at bottom, different representations of peer judgment. For the most part the different measures represent collections of judgments that are separated in time, in focus, and in method of combination. When a peer group is assembled to assess the productivity of an individual, a group, or a program, the outcome is based on the combined perceptions of any assortment of behaviors or perspectives its individual members may implicitly or explicitly agree upon at the time the judgment is delivered. Other measures, such as those derived from counts of citations to an accumulation of publications, represent a long 56

series of separate peer and peer-group assessments ranging from acceptance to college and graduate schools, through awards of degrees, positions, funding, selection for publication and decisions to cite. The difference is that each of these assessments has been specifically focused on a related concept of merit, and the outcome statistics derive from the judgment of many more and more disparate peers. There is no question of which is the ''better" type of measure; they represent different perspectives--sometimes only slightly and sometimes widely different. The quality and utility of either depends on the care taken to secure accuracy and to eliminate inappropriate or unfair considerations from the outcome and on the context in which the results are used. The significant point is not the superiority of one or the other measure but, rather, the extent to which either or both illuminate the questions at hand. Those who are concerned with achieving a better understanding of how science functions seldom argue the point that peer review groups that can take into consideration not only an immediate product or situation, but also any extenuating circumstances that might alter the significance of any given piece of information are, as yet, better equipped to take into account all relevant factors in making a judgment about, for example, an individual grant application than any likely collection of statistics. (This is not to say that there do not exist experimental techniques that might well improve peer judgment procedures. It ignores also the issues of personal and group bias, which may seriously distort judgment). When very large groups of individuals or products--such as programs or large sets of publications or whole journals--are at issue, however, and combinations of statistical measures correspond imperfectly with peer judgments, it may now, after three decades of very active research, be prudent to conclude that here are two different kinds of evidence, each of which potentially offers useful and valuable information, and each of which should figure in the assessment process (assuming, of course, equally careful data gathering and handling). Citations and Peer Assessment: Most studies that have involved both peer judgment and bibliometrics have been aimed at validating the utility of the bibliometric measures. The wellspring Clark study (1957), however, which was conducted before the analysis of publications became an object of social scientific interest, simply noted that the multivariate combination of journal citations and offices held in a professional association together accounted for nearly 64 percent of the variance in numbers of votes received when active investigators were asked to identify significant contributors to the field. The single most important predictor was citations, which correlated r = .67 with the number of votes. 57

Carter's report (1974) was the first serious analytical investigation of the NIH research grant peer review process. Publication and citation data, dating from the late 1960s, were inadequate by present-day standards, but the kinds of considerations Carter brought to the problem still warrant the attention of anyone proposing to investigate these kinds of relations. Carter found a low correlation (r = .40) between initial and renewal priority scores, and that . . . at least for grant applications from most of the larger basic science and clinical departments of medical schools, the judgments of the peer review process are significantly related to an objective measure of research output derived from citations to articles describing the results of the grant. Two reports issued by the National Academy of Sciences in 1982 represent nearly opposite perspectives on the use of peer judgment and objective measures in assessment. One, in which opinions but no data or objective evidence of any kind were presented, concluded, essentially, that peer review was the only mechanism needed to assess quality or productivity in scientific research (COSEPUP, 1982~. The other, in sharply contrasting peer group performance, applied 16 "measures"--4 based on peer ratings and 12 on records of program composition, support, and faculty publication performance--applied to 32 disciplines in 200 doctoral degree-granting institutions .3 In the four biological science areas of primary concern to biomedical research, total journal influence ratings of faculty publications accounted for 50-70 percent of the variation in subjective judgmental ratings of faculty scholarly quality and 40-60 percent of program educational effectiveness. Notably, no attempt was made to combine the different types of information; rather, each of the items was reported for each institution. A recent analysis by Lawani and Bayer (1986) of relations between peer and bibliometric assessment of quality is of interest because it compared peer and bibliometric assessments of cancer research papers. Papers abstracted in the Yearbook of Cancer, a selection made by large numbers of "peers," were classified as of high quality and compared with (1) papers listed 3The Committee on an Assessment of Quality-Related Characteristics of Research-Doctorate Programs in the United States began its work in response to growing criticism within the academic and educational communities of existing subjective ratings of graduate programs in the United States. The reports of this study serve as models of planning and reporting in the development and application of program assessment methods. Evidence is presented at every stage of development of the study to support decisions and analytic methods. S8

but not abstracted and (2) a restricted random sample of papers listed in Biological Abstracts. While citation frequency increased significantly with peer rating, there were discrepancies in the distributions. Of the most highly cited loo papers, 14 were from the random set; also, 16.8 percent of the highly rated papers received 4 or less citations in the five years following publication (2.3 percent received none). Whether some of these will turn out to be "late bloomers" or Will represent poor or biased choices for inclusion in the yearbook abstracts is not known. Lawani also found that quantity and quality were highly correlated and that the larger the number of coauthors, the larger the proportion of papers included in the yearbook. Also, self-citations relative to total cites declined with increased quality but did not affect the level of agreement between peer assessment and citation count. Porter, Chubin and Xiao-Yin-Jin (1986) recently compared Sloan Fellows' "most cited" papers with their own selection of which of their papers were their "best." For 1974 fellows, 35 percent of papers they perceived as "best" were also most highly cited; this percentage rose to 42 for 1984 fellows. Eighty-nine percent of papers were coauthored, but Fellows were much more likely than citations were to select as best those on which they were first author. The letters that had recommended the Fellows for appointment, and the citations to their publications tended to emphasize methodological contributions; fellows themselves tended to identify their theoretical and empirical papers as their best. Research evidence suggests that, when possible, both peer and bibliometric data should be made available for consideration. If only one measure can be obtained, in any study involving large numbers of subjects and publications (e.g., >100 papers per subgroup), the investigator would be at least as well served with publication data as with peer judgments. GRANTS AND GRANT APPLICATIONS NIH and the Veterans' Administration are probably the only two federally supported agencies that maintain data bases suitable for the analyses of scientist's grant application and award behavior. It has been possible in recent years to obtain some information about individuals to whom the National Science Foundation and several private foundations have awarded funds, but no information about applications has been available. The availability of accurate and complete information about grant applications and awards makes possible the investigation of many management and policy issues. Clearly also, the receipt of an award is itself an indicator of achievement, especially in recent years when only small percentages of grants are actually awarded. 59

The availability of longitudinal data on those who have become involved with NIH programs is unique. As sources of information about NIH programs and policies, the grant information, in conjunction with the publication data available in NIH data bases, constitutes a treasury of resources for the social and cognitive study of science that is probably unmatched elsewhere in the world. It is therefore extremely unfortunate that little or no opportunity exists for exploiting these resources in the interest of further developing the theory and methodology needed to advance our understanding of how science functions. One of the least valuable ends to which grant information can be put is as a single measure of the effectiveness of training support programs. The positive information that is yielded is, of course, directly informative, but when applications and approvals are compared with awards (with no control for disciplines or for type or location of appointment), the results can be misleading. Failure to apply for grants has been interpreted as a negative outcome, whereas non-application may be totally irrelevant to the productivity of both individuals who are pursuing research and those who are performing other services to science that can be regarded as successful. Some creative new attempts may be in order to design studies that will permit alternate patterns of successful outcome for such individuals as those engaged in research that is otherwise supported, as well as those whose administrative or teaching responsibilities preclude their applying for grants. This is not to say that grant applications and awards are not an important source of information about, for example, the success of training programs. It only cautions against its use exclusively, and without consideration of such limitations as disciplinary differences and the availability of funding. ACADEMIC RANK, RATE OF ADVANCEMENT, SALARY These three measures are possible alternative measures of one aspect of advancement or productivity for individuals in different settings or with different types of appointments. Each must usually be qualified in terms of years since doctorate or since completion of training, and it should be possible to scale each of them so that comparability among individual members of a cohort would be achievable. Data about rank and rate of advancement are generally more readily available than salary information: private universities often refuse to release salary information, but will yield rank; industry, on the other hand, which often balks at releasing information about its investment in research and development, is usually less unwilling to report the salary of an individual employee. Problems with these measures are likely to be related to institutional size, policy, 60

and prestige and caution is obviously needed to assure comparability. Even when all of these are taken into account, it is possible that constraints on salaries and promotions in different institutions may be such that the measures would be of marginal value. HONORS AND AWARDS It is intuitively desirable to be able to give "credit'' for having won honors or awards. While information about them is generally available from the individuals who have received them, access to the individuals is rarely available in connection with federally supported studies because of the continuing effort of the Office of Management and Budget to restrict data gathering. As a result, obtaining the information is often a tedious exercise; and with the hundreds of different awards made among all of the sciences, the chances of missing some are not small. However, biographical sources carry this kind of information, and the awarding organizations are also receptive to inquiry. The principal problems with awards and honors are the inequalities of opportunity and of significance to which they are prone. Again discipline is important, for access to awards varies widely among them. Location of employment may also militate against the opportunity to gain recognition, and the related gains of prestige and cumulative advantage. Some rather ingenious efforts have been made to overcome some of the disadvantages of these measures: The Coles (1973) had 300 physicists rate the visibility and prestige of 98 honorific awards and used their ratings to weight the available information. Others have set a mark--that is, some number of awards or honors may be set as a level of "success." As with grants, receiving awards may be a suitable indicator of achievement, but awards are very rare, and not receiving them is not an equitable indicator of their absence. MENTORING Mentoring has usually been studied as a predictor of the subsequent success of students (see, for example, Long et al., 1979), but as a measure of productivity it has generally been rejected because popularity as a mentor has been associated with the concept of operating "diploma mills." If suitable means of assessing the performance of individuals whose principal activities do not include active research participation or who devote only a small proportion of time to research are to be found, mentoring should be reconsidered. The mentor whose students become outstanding achievers may or may not deserve credit for having made an important contribution to science. Only studies that are able to obtain accurate assessments of 61

ability, and to consider very long time spans are likely to be able to deal with such a measure (after completion of training and appointment to a university, it would probably require about 15 years' followup to obtain a minimum amount of useful information). An alternate measure to be considered could be assessment of a mentor's success in placing students on completion of the doctorate. If such a measure could be partialed out from its inevitable association with prestige of the mentor's institution, it should prove to be an appropriate measure. Placements could be "scored" by referring to the National Academy of Sciences Assessment of Research Doctorate Programs (Jones et al., 1982) for the appropriate disciplines. PATENTS While it can be assumed that any participant in the scientific community is willing, if not eager, to receive an award, the pursuit of patents is a specialized goal of a restricted subgroup. With the easing of federal restrictions on the ownership of patents, interest in obtaining them has undoubtedly increased in the population of scientists that was formerly restricted. To what extent the changing mores of society, which strongly encourage the deliberate pursuit of material reward, may also affect scientists' problem-selection behavior is not, to this writer's knowledge, known. But the current "state of the science" of many areas of research in the biomedical sciences, which has produced rapidly expanding opportunities for producing scientific advances that have significant commercial potential, is surely not without effect. Among scientists employed in the commercial sector, patents and salary are probably the two best potential measures of productivity available. But how patents should be assessed in the academic sector is not so clear, and whether academics will change their publication behavior when nearing - patentable advance, as many in the commercial sector have been forced to do, is also not clear. Patents have become an international focus of attention in assessing the productivity of nations. Narin's report to NSF (1988) of comparisons between U.S. and Japanese patent activity has drawn widespread journalistic as well as scientific interest. Together with several present and former staff members, Narin has established a new dimension in bibliometrics through studies of relations between publications and patents. CHI, of which Narin is president, has developed a "Patent Citation Indicators Database" and "Full Text Patent Citation Data." The first makes available all information on the first pages of close to a million U.S. patents issued since 1971; and the second provides extracts from the full texts of patents issued between 1975 and 1983, including citations to U.S. and foreign patents. These resources are as yet untapped for studies of NIH research 62

programs, although the potential for investigation of a wide variety of questions at the interface between science and technology is great (see, e.g., Noma (1986~. Keith Pavitt, a member of the prolific science studies group at the University of Sussex, has prepared an excellent review (1985~. He notes early work in the 1960s, increased interest reflected in NSF's Science Indicators, and lists sources of data in Western Europe and the United States. The state of the field is discussed under the following headings: o analytical approaches (Narin, various economists, and NSF's Science Indicators ~ ; o types of activity measured (invention as distinct from innovation, relation to R&D expenditures, relative superior performance of small firms, protection against imitation, skewed distribution of monetary value); o international comparisons (summaries of several empirical studies in such areas as relations between per capita capitol expenditures on R&D and patent activity, problems of data-gathering in foreign countries, national differences in propensity to patent); a o comparisons among industrial sectors (technology gap theory, difficulties of allocating patent classes uniquely to product based industry classes, relation of this problem to accuracy of estimates in different fields); o comparisons among technical fields (classification problems in attempting to relate patenting to rates of technical innovation, citation rates of "significant" patents, links between patents and scientific literature, technical profiles of industrial firms); comparisons among industrial firms (relations between R&D and patenting, skewed distribution of value of patents and propensity to patent, inverse relation between propensity to patent and size of R&D programs); and comparisons over time (increasing share of U.S. patents that are of foreign origin, possibility of increased concentration in relation to diminishing ratio of patents to R&D size). Pavitt concludes with a list of areas in which systematic inquiry is needed. He contends that the elimination of sector and firm-specific biases will require more comprehensive and accurate information about the nature and determinants of 63

patenting behavior within firms. Systematic sample survey data are required on the following subjects: o the sources of the innovative activities that lead to patenting in particular, the intersectoral variance in the relative importance of R&D, production engineering small firms, and other sources; o the time distribution-of patenting activities over the life cycle of an innovation (in particular, does patenting typically reach a maximum at the time of commercial launch?; o the propensity to patent the results of innovative activities: in particular, sector specific factors related to the effectiveness of patenting as a barrier to imitation, compared to alternatives; firm-specific factors related to perceptions of the costs and benefits of patenting; and country-specific factors redating to the costs and benefits of patenting; and o the judgment of technological peers on the innovative performance of specific firms and countries, and on the relative rate of technological advance in specific fields: in particular, the degree to which these judgments are consistent with the patterns shown by patent statistics. Finally, Pavitt calls for improved classification schemes, such that established patent classes can be matched more effectively, on the one hand to standard industrial and trade classifications and, on the other, to technically coherent fields of development. SUMMARY There are, simply, no easy, ready-made solutions to the problems of identifying measures that will be useful in the assessment of productivity. There is need for the development and application of creative approaches to improving the utility of the kinds of information that can be obtained. The development, for example, of indexes that may increase the equitability of some measures. And there is need as well, in many cases, for increased attention to detail in designing studies and analyzing data. 64

The two sources of information that have the broadest potential value in the assessment of academic scientific performance are peer assessment and the analysis of publications, though there are circumstances in which neither may be appropriate. (For analyses involving the commercial sector, patent analysis--when used as an extension of publication analysis--should probably be added.) From the perspective that they tend to be fairly highly correlated, each contributes somewhat to confidence in the other, and to the extent that they are not correlated the need for both kinds of information is greater in the given measurement situation. Because peer assessment is so extremely costly, time consuming, and difficult to employ equitably, it may be necessary or worthwhile, especially in large-scale studies, to investigate whether there are records available about--for example, program operation, faculty activity, support, student outcomes, and resources (in addition to publication data)--that might be able to account for a large proportion of the variation in peer judgments of, program quality. On the other hand, the use of publication and citation measures as the sole consideration in the assessment of the individual scientist's productivity can be rejected on a purely rational basis. As a means of confirming a positive subjective judgment of individual performance, there is no problem, but the opposite does not hold because there are myriad alternative explanations for low numbers of publications and for few or no citations. One of the more significant misjudgments that can result is the case in which few or no citations are received by highly significant papers that either are ahead of their time or are published in obscure journals. No imperfect too] that may be used to the disadvantage of the single individual (including peer judgment) can be justified. The caution warrants repeating (and appears fairly frequently in the bibliometric literature) that bibliometric measures are most appropriately employed in group comparisons in which aggregates of publications are Jarge--just how large depends on how closely comparison groups can be matched. Correspondingly, peer assessments are most appropriately employed when peers are equally informed about all of the assessment targets and when self-serving competitive interests are absent. Perhaps the single most important factor in planning investigations of productivity is the need to employ multiple measures and to apply them selectively to the appropriate targets. 65

REFERENCES Anderson, Richard C., Francis Narin, and Paul McAllister. 1978. Publication ratings versus peer ratings of universities. Journa ~ of the American Society for Information Science 29~2~:91-103. Carter, Grace M. 1974. Peer Review, Citations, and Biomedical Research Policy: NIH Grants to Medical School Faculty (Rand Report R-1583-HEW). Washington, D.C.: Rand Corporation. , Clara S. Lai, Carolyn L. Lee. 1978. A Comparison of Large Grants and Research Project Grants Awarded by the National Institutes of Health (Rand Report R-2228-1-NIH). Washington, D.C.: Rand Corporation. , John D. Winkler, and Andrea K. Biddle. 1987. An Evaluation of the NIH Research Career Development Award. Washington, D.C.: Rand Corporation. Chubin, Daryl E. 1987. Research evaluation and the generation of big science policy. Knowledge 9 (2) :254-277. , and Soumyo D. Moitra. 1975. Content analysis of references: Adjunct or alternative to citation counting? Social Studies of Science 5:423-441. Clark, Kenneth E. 1957. America's Psychologists: A Survey of a Growing Profession. Washington, D.C.: American Psychological Association, Washington, D.C. Cole, Jonathan R., and Stephen Cole. 1973. Social Stratification in Science. Chicago: University of Chicago Press. Committee on Science, Engineering, and Public Policy (COSEPUP) 1982. The Quality of Research in Science Methods for Postperformance Evaluation of Research in the National Science Foundation. Washington, D.C.: National Academy Press. Cronin, B. 1984. The Citation Process. London: Taylor Graham. Fox, Mary Frank. 1983. Scientists' publication productivity Social Studies of Science 13~2~:298-329. Garfield, 167. E. 1955. Citation indexes for science. Science . 1972. Citation analysis as a tool in journal evaluation. Science 178:471-479. 66

Gee, Helen Hofer. 1988. An Analysis of NIH Intramural Research Publications, 1973-1984 (Report to the Committee to Study Strategies to Strengthen the Scientific Excellence of the NIH Intramural Research Program). Academy Press. Washington, D.C.: National , and Frances Narin. 1986. An Analysis of Research Publications Supported by NIH 1973-76 ant] 1977-80 (Publication No. 86-2777), Washington, D.C.: NIH. Gilbert, G. Nigel. 1988. Measuring the growth of science. Scientometrics 1~1~:9-34. Hjerppe, R. 1982. Supplement to a "Bibliography of bibliometrics and citation indexing & analysis." Scientometrics 4 ~ 3 ~ : 2 4 1-2 7 3 . Institute for Scientific Information. 1963. Science Citation Index. Philadelphia, PA: ISI. Jones, Lyle V., Gardner Lin~zey, and Porter Coggeshall (eds.~. 1982. An Assessment of Research-Doctorate Programs in the United States. Washington, D.C.: National Academy Press. Lawani, Stephen M., and Alan E. Bayer. 1983. Validity of citation criteria for assessing the influence of scientific publications: New evidence with peer assessment. Journal of the American Society for Information Science 34 (1) :59-66. Leydesdorf, Loet. 1987. Various methods for the mapping of science. Scientometrics 11~5-6~:295-324. , and Peter van der Schaar. 1987. The use of scientometric methods for evaluating national research programs . Science and Techno ~ ogy StuciJes 5 ~ 1 ) : 2 2 -31 . Long, J. Scott, Paul D. Allison, and Robert McGinnis. 1979. Entrance into the academic career. American sOcio ~ ogica Review 44 (5~: 816-830. , and Robert McGinnis. 1981. Organizational context and scientific productivity. American Socio 7 ogica 7 Review 46:422-442. Martin, Ben R., and John Irvine. 1983. Assessing basic research. Research Policy 12:61-90. McAllister, Paul R., and Francis Narin. 1983. Characterization of the research papers of U.S. medical schools. Journal of the American Society for Information Science 34 (2) :123-131. 67

McGinnis, Robert, and J. Scott Long. 1982. Postdoctoral training in bioscience: allocation and outcomes. Social Forces 60 ~3 ) : 701-722. Moed, H. F., J. M. Burger, J. G. Frankfort, A. F. J. Van Raan. 1985. A comparative study of bibliometric past performance analysis and peer judgment. Scien tome trios 8 : 3-4 . Moravcsik, Michael J. 1986. Assessing the methodology for finding a methodology for assessment. Socia ~ Stud' es of Science 16:534-39. Narin, F. 1976. Evaluative Bib~iometrics: The Use of Publication and Citation Analysis in the Evaluation of Scientific Activity (Report to the National Science Foundation). (Now available only through the National Technical Information Service (NTIS no. PB 252339/AS). . 1983. Subjective vs. Bibliometric Assessment of Biomedical Research Publications (NIH Program Evaluation Report). (Unpublished report available from the NIH Office of Program Planning and Evaluation or from the author.) . 1985. Measuring the Research Productivity of Higher Education Institutions using Bib~iometric Techniques. Paper presented at a Workshop on Science and Technology Measures in the Higher Education Sector, OECD, Paris, France. . 1988. Indicators of Strength: Excellence and Linkage in Japanese Technology and Science. Paper presented at the National Science Foundation, June 21, 1988 (See also F. Narin and E. Noma, Is technology becoming science?, Scientometrics 7~3~:369-381, 1985.) , and J. K. Moll. 1977. Bibliometrics: Annual Review of Information Science and Technology 12:32-58. , G. Pinski, and H. H. Gee. 1976. Structure of the Biomedical Literature, Journa ~ of the American SocieLy for In f ormation Science 2 7: 2 5-4 5 . National Science Foundation. 19 74. Science Indicators. Washington, D.C.: U.S. Government Printing Office (this report is published annually). . 1982 . Studies of Scientific Discipl ines: An Annotated Bibliography. Washington, D.C.: U.S. Government Printing Office. Noma, Elliot. 1986. Subject Classification and Influence Weights for 3 000 Journa ~ s . Haddon Heights, NJ: Computer Horizons, Inc. 68

Pavitt, K., 1985. Patent statistics as indicators of innovative activities: Possibilities and problems. Scientometrics 7 (1- 2~:77-99. (Pavitt cites B. Basberg, Technological change in the Norwegian whaling industry: A case study of the use of patent statistics as a technology indicator, Research Po ~ i cy 11~3~:163-171, 1982.) Pinski, Gabriel. 1975. Subject Classification and Influence Weights for 2300 Journals (NSF Final Task Report). Haddon Heights, NJ: Computer Horizons, Inc. Porter, A. L., D. E. Chubin, and Xiao-Yin-Jin. 1986. Citations and Scientific Progress: Comparing Bibliometric Measures with Scientist Judgments. Scientometrics 13~3-4~:103-124. Price, Derek de Solla. 1961. Science Since Babylon. New Haven, Conn.: Yale University Press. . 1963. Little Science, Big Science. New York: Columbia University Press. Reskin, Barbara. 1979. Review of the Literature on the Relationship Between Age and Scientific Productivity. In Committee on Continuity in Academic Research Performance, Research Excellence Through the Year 2000: The Importance of Maintaining a Flow of New Faculty into Academic Research (Appendix C: 189-207~. Washington, D.C.: National Academy of Sciences. Schubert, A. 1985. Quantitative studies of science: A current bibliography. Scientometrics 9 ( 5 -6 ) 293-304. . 1986. Quantitative studies of science: A current bibliography. Scientometrics 8~1-2) :137-140. , and T. Braun. 1986. Relative indicators and relational charts for comparative assessment of publication output and citation impact. Scientometrics, 9 (5-6) 281-291. (See also T. Braun, W. Glanzel, A. Schubert, One more version of the facts and figures on publication output and relative citation impact of 107 Countries, 1978-1980, Scientometrics 11~1- 2):9-15, and (3-4):127-140.) Schubert, A., and W. G1anzel. 1983. Statistical reliability of comparisons based on the citation impact of scientific publications. Scientometri as 5~1~:59-74. , W. Glanzel, and T. Braun. 1983. Relative citation rate: A new indicator for measuring the impact of publications. In D. Tomov and L. Dimitrova teds.), Proceedings of the First National Conference with International Participation on 69

Scientometrics and Linguistics of Scientific Text, Varna, pp.80-81. Simeon, V. L., et al. 1986. Analysis of the bibliographic output from a research institution in relation to the measures of scientific policy. Scientometrics 9~5-6~:223- 230. Stowe, Robert C. 1986. Annotated Bibliography of Publications Dealing with Qualitative and Quantitative Indicators of the Quality of Science (A technical memorandum of the quality indicators project). Cambridge, MA: Harvard University. Van Heeringen, A., and P. A. Dijkwel. 1987a. Age, mobility and productivity: I. Scientometrics 11:267-280. . 1987b. Age, mobility and productivity: II. Scientometrics 11:281-293. Vinkler, P. 1986. Evaluation of some methods for the relative assessment of scientific publications. Scientometrics 10~3- 4~:157-177. . 1987. A quasi-quantitative citation model. Scientometrics 12~1-2~:47-72. (See also B. Cronin, The Citation Process, London: Taylor Graham, 1984; D. Chubin and S. D. Moitra, Content Analysis of References: Adjunct or alternative to citation counting? Socia ~ Studies of Science 5:423, 1975; and M. J. Moravscik and P. Murugesan, Some results on the function and quality of citation, Socia Studies of Science 5:86-92, 1975.) Weiss, C. H. 1972. Evaluation Research: Methods of Assessing Program Effectiveness. Englewood Cliffs, NJ: Prentice-Hall Inc. 70

APPENDIX: SCIENCE 8TUDIE8 =80URCES Nearly three-quarters of a century has passed since Cole and Eales in 1917 reported their international comparison of counts of books and papers in comparative anatomy published between 1543 and 1880 (Narin, 1977~. In 1926 LoLka demonstrated that the distributions of publications in a discipline (physics) is widely skewed and that most scientific papers are published by a small minority of scientists (Fox, 1983~. So began inquiries into the use of publications measures in the assessment of productivity and the closely related concept of eminence. Rapid advancement, however, became feasible only when computers became readily accessible and inexpensive in the 1950s. In a landmark empirical study conducted between 1954 and 1957, a committee of the American Psychological Association conducted an extensive inquiry into the correlates of productivity of all doctorates granted in the field of psychology between 1930 and 1944 (Clark, 1957~. The study was significant in employing publication and citation measures as correlates of peer assessments of productivity and in recognizing the importance of investigating differences among subdisciplines and of taking into account variations in background, social, and psychological characteristics as correlates and potential predictors of eventual professional accomplishment and status. The study was also noteworthy in its use of computer-implemented quantitative methods to describe and compare the most productive with other members of the profession. In this sense it marks the empirical beginning of what has become a worldwide effort on the part of both theoretical and empirical investigators to achieve a better understanding of how science and scientists function and thrive in the society of our time. Comprehensive theoretical and methodological as well as empirical studies of the sociology, psychology, and economics of science and scientists did not begin to appear in large numbers until the 1960s. Derek de Sofia Price (1963) is appropriately credited with sparking the present-day intellectual development of inquiry into the assessment of research quality and eminence. Since then studies have proliferated rapidly in depth, breadth, and complexity as well as in number. Hjerppe (1982) added 518 items to an over 2,000-item "Bibliography of Bibliometrics and Citation Indexing & Analysis" published in Sweden in 1980. More directly relevant to the present inquiry are bibliographies that are being developed to assist groups of interested and involved scientists in their attempts to keep up with research aimed at achieving better understanding of how science and scientists function. Although it is not feasible to attempt a comprehensive review of all bibliographies that might be helpful to those concerned with the analysis of productivity and its essential correlates, a brief description of some publications that cover a 71

great deal of the relevant research effort to about 1980 may be useful. *** Jonathan and Stephen Cole, Socia 7 Stratification in Science (1973~: The Coles conducted several different cross-sectional studies of academic physicists in their investigation of the social stratification system in science. The Coles staunchly defended the view that science functions as a meritocracy and concluded that physics is a universalistic and rational discipline in which quality of work (as measured by citations) is the chief determinant of ultimate status. (A recent personal communication indicates that J. Cole delivered a paper at American Sociological Association meetings that partially recants earlier views on universalism.) For more up-to-date, longitudinal analyses of scientists in biochemistry that result in a different conclusion, see Long et al. (1979), Long and McGinnis (1981), and McGinnis and Long (1982~. The Coles examined multivariate interrelationships among departmental rank, number and assessed prestige of honorific awards, membership status in professional societies, geographical location, number and "quality" (citation counts) of publications in exploring the development of professional visibility, and eminence. The book also contains a brief historical account of the development of research in the social science of science. *** Francis Narin, Eva 7 native Bib7iometrics ~1976) : Narin cited 140 papers in providing a brief historical account of the development of techniques of measuring publications and citations, in reviewing a number of empirical investigations of the validity of bibliometric analyses, and in presenting details of the characteristics of and differences among scientific fields and subdisciplines. (The Annual Review of Information Science and Technology published a bibliography entitled "Bibliometrics" by Narin and Moll (1977), which contains many, but not all of the same references that are in Evaluative Bibliometrics. ~ The book, prepared for the National Science Foundation, contains explicit details of how several indices of journal influence are calculated and how variations within a field of science differ from variations within a subdiscipline. Three different influence measures are provided for each of the 2,250 journals in the 1973 Science Citation Index. ~ (New influence indices have since been calculated for some 3 000 journals in the 1982 SCI (see Noma, 1986~.] Some two dozen studies are cited that deal with the correspondence between literature-based and other methods of assessing the quality of scientific output. *** NSF Division of Planning and Policy, Social Studies of Scientific Disciplines, ~1982) : This annotated bibliography "makes accessible to the managers and practitioners of science and engineering the findings from the social studies of science in a form that will be useful to them." The bibliography covers studies conducted up to the mid 1980s and reports on the work of

nearly 300 authors, most with multiple entries. Although only one subsection is entitled "Productivity," it is not an exaggeration to estimate that at least 90 percent of entries in the work deal with material relevant to the measurement of this concept. An approximately similar percentage describe investigations that employ publications measures in their investigations of 23 identifiable but related subjects as dealt with in studies of 13 disciplines. A tote] of 285 studies yield nearly 500 entries in the bibliography, many studies having dealt with multiple disciplines. Subject categories in the bibliography include: Attitudes and Values Career Patterns Competition Development of Disciplines Discipline Comparisons Discipline Organization Discovery Process Education, Grad. Educ. Funding of Research Information Exchange National Comparisons Paradigm Characteristics Performance of research Productivity Productivity - age Professional Associations Publication practices Recognition and reward Social stratification Structure of the literature Structure of literature-- Specialty groups Citation rates Journal influence University Ratings *** Mary Frank Fox, "Scientists' Publication Productivity," Social Studies of Science ~1983) : In this critical review, Fox discusses publication productivity in relation to psychological characteristics of individuals such as motivation, ego strength, cognitive style, personality and interests, and IQ, noting the restricted range of ability among scientists and the corresponding low correlation with measures of productivity as well as the fact that creativity does not exist in a vacuum. Citing Peiz and Andrews, she states, ''Rather, social factors so affect the translation of creative ability into innovative performance that measured creativity is virtually unrelated to either the innovativeness or the productiveness of scientists' output." The importance of environmental characteristics such as institutional prestige and organizational freedom are summarized, including the important findings of Long and McGinnis, whose longitudinal studies point to the stronger effect of location on productivity than of productivity on subsequent location as had been previously reported in studies using cross-sectional designs. An interesting discussion of the closely entwined concepts of cumulative advantage and reinforcement is also included in this review of approximately a hundred different studies. *** A. Schubert, "Quantitative Studies of Science: A Current Bibliography," Scientometrics (1985 and 1986~. Close to 100 papers are listed in each year, and the list does not include 73

those published in Scientometrics itself. The vast majority deal with empirical and methodological papers on bibliometric topics. While no country exceeds the United States in number of papers listed, the total number of foreign papers, not including Canada and the United Kingdom, was nearly twice the number of United States publications. *** Robert C. Stowe, An Annotated Bibliography of Publications Dea ~ ing with Qua ~ itative and Quantitative Indicators of the Qua 7 i ty of Science (Inc ~ uding a bib 7 iography on the access of women to participation in scientific research} (1986~. In addition to a list of core books, annotated entries are made under the following headings: I. Bibliometric indicators of the quality of scientific research -Citations -Critiques -Citation Context Analvs; s and publications as indicators of quality of citation analysis II. Qualitative approaches to and more general works on research evaluation III. Works dealing specifically with "science indicators" IV. Forecasting and research priorities V. Peer review VI. Quality and quantity in the history of science and philosophy VII. Education VIII. Issues involving quantity and quality in particular disciplines, including papers on social indicators IX. Sociology of science X. Methodological papers and bibliographies XI. Access of women to participation in scientific research 74

Next: 3. Health Services Research Personnel: Demand, Supply, and Adequacy of Training Programs »

Biomedical and Behavioral Research Scientists: Their Training and Supply: Volume III: Commissioned Papers (1989)

Chapter: 2. Productivity

Welcome to OpenBook!

Get Email Updates