The Stakes in Research Assessment
This chapter endeavors to make explicit what is at stake in the comparative assessment of research fields, retrospectively and prospectively. It takes as given that core decisions concerning the scope of any research agency’s mission, overall budget, and dominant mechanisms for selecting research proposals are largely determined at high political and administrative levels, such as those of the Office of the President, Congress, and politically appointed agency heads. Science managers in subagency divisions or programs such as the Behavioral and Social Research (BSR) Program at the National Institute on Aging (NIA) work to set priorities within these bounds by identifying, supporting, maintaining, and nurturing scientifically vital, mission-relevant areas of research, including selectively promoting areas of research in the manager’s mission area on the basis of judgments of their prospects for advancing that mission. Science managers may also recommend changes in larger organizational priorities and practices deemed necessary to permit specific programs to more effectively achieve their objectives in the context of the larger agency mission. Managers do not do these things on their own, however. Their actions are informed and influenced by outside constituencies, prominently including working scientists in the relevant fields and the potential beneficiaries of the science.
This chapter addresses two major, interrelated themes in science assessment: the actors (Who should be involved in assessing the science, and what should be the relative power and influence among these actors?) and the methods (How should assessment be done and decision making organized?). These questions and themes are being raised in the National Institutes of Health (NIH) and in many other federal government science agencies.
We begin by briefly describing the historical context of science priority setting in the U.S. federal government, which in recent years includes increasing pressure to move from traditional, peer review–based approaches for setting research priorities and assessing returns from research investments to approaches that rely more on quantifying returns from investment in science. We discusses the limitations both of traditional expert judgment and of quantitative approaches, recognizing the particular difficulties of comparing different kinds of fields and of assessing scientific progress in interdisciplinary or transdisciplinary fields. Finally, we discuss the ways in which debates about the best methods for priority setting in the context of a movement for government accountability raise deeper questions about the balance of influence and power among researchers, program managers, advisory councils, extramural scientists, and other interested parties.
These debates occur in the pursuit of two legitimate and important public policy goals: making public expenditures accountable to the taxpayers and ensuring rational priority setting among research expenditures, based on the best available information about the likely returns from future public investments. Accountable and rational priority setting implies a need for comparative assessment, especially when continuing and new claims on resources outpace the rate of increase in science budgets or when a perception emerges that some research fields are not advancing, despite continued support. Comparative assessment is inherently difficult, however, for several reasons. First, different fields of science may produce different kinds of benefits and may benefit different people. In a diverse society, it is unlikely that everyone will agree on the relative importance of different kinds of benefits and therefore on the overall benefit of any particular line of research (Bozeman and Sarewitz, 2005).
Second, some kinds of benefit are easier to see and measure than others. Bibliometric data, for example, make scientific publications and published citations to them highly visible, but they have serious limitations, as detailed in Chapter 5. These data primarily reflect communication patterns among scientists. They are not necessarily predictive of the absolute or relative value of scientific outputs as sources of information for or influence on policy makers.
Third, even if agreement can be reached on the relative benefits from different fields of science to date, the benefits of future investments would remain speculative and uncertain. The existence of compendia of erroneous predictions by experts about the progress of fields of science and technology is good reason to start from the premise that no science of predicting the future of science exists (Cerf and Navasky, 1984, Chapter 7; Thomas, 1999).
Fourth, different individuals have the expertise needed to evaluate re-
search in different fields. Finding all the needed expertise for comparing different fields in the same individuals is the exception rather than the rule.
Finally, there is the question of whose judgment should be final in resolving disagreements. This is a political question about relationships between and among the principals and the agents engaged in setting research priorities and performing the funded research. The primary principals are congressional and executive decision makers. Agency officials serve in a dual, intermediate role: they are the agents of the elected and appointed public officials, but to those who turn to them for funding, they are principals. The research performers, in the main, are the agents, but their participation in advisory councils and review committees also provides them with some of the influence and decision-making power more conventionally ascribed to principals. The very complexity of these arrangements complicates answers to questions about the distribution of power and influence among those who might decide and the distribution of benefits to those who might benefit. We return to this issue later in the chapter.
BRIEF HISTORY OF FEDERAL SCIENCE PRIORITY SETTING
Since the end of World War II, the salience of the issues of priority setting and retrospective assessment in U.S. science policy has waxed and waned. Priority setting received a surge of analytical and policy attention in the early 1960s, a period of steady increases of federal government support for research and development in the defense and nondefense sectors (Smith, 1990). The conventional benchmark of science assessment in this era is Alvin M. Weinberg’s two articles entitled “Criteria for Scientific Choice” (Weinberg, 1963, 1964) and the surrounding exegesis (Toulmin, 1964; Smith, 1982). Weinberg defined a short list of generic criteria (see Table 3-1). Throughout the 1960s and 1970s, periodic efforts were undertaken to apply these and related criteria to priority setting in specific scientific fields (e.g.,
TABLE 3-1 Generic Criteria
(How well is the science done?)
Is the field ready for exploitation?
Are the scientists in the field really competent?
SOURCE: Weinberg (1963).
National Academy of Sciences, 1965; National Research Council, 1972). Priority setting, framed in terms of the long-term prospects for science, also drew episodic attention from the U.S. Congress, the National Science Foundation (NSF), and the National Academies (e.g., Committee on Science and Technology, 1982; Irvine and Martin, 1984). The topic has continued to surface intermittently in the larger discourse on the federal science budget, such as in the report, Allocating Federal Funds for Science and Technology (National Research Council, 1995a) and in prescriptions by senior national science policy officials (e.g., Bromley, 2003). Priority setting is also implicit in the strategic planning undertakings of federal agencies, in which selected fields are chosen for emphasis, with explicit or implicit decisions made not to fund other areas or to alter relative distributions of support among areas.
Declining Attention to Criteria for Priority Setting
Although attention to priority setting has remained a staple component of U.S. government science policy, systematic attention to the development of criteria for scientific choice waned after the 1970s, at least in the United States. In part, this was due to the difficulties of moving from general agreement about broad priority-setting principles to agreement about the specific methods and measures to be used to operationalize these principles. In particular, somewhat in contrast to developments in European countries, where bibliometrics is often used in assessing and formulating science policy, there has been little consensus in the United States about the reliability or validity of techniques to assess the relative importance of different fields (Hicks et al., 2004). The decline in interest in applying systematic criteria for setting research priorities also partly reflected the unwillingness or inability of many scientific communities to agree about the priorities in their fields, as in the case of divisions in the high-energy physics community surrounding decisions to construct the superconducting supercollider. In addition, the few formal entities that existed to conduct prospective and retrospective analysis and to support research in concept development and tool building had short life spans.1
General acceptance developed about relationships between the federal government as sponsor of basic and applied scientific research and the performers of this research. Congress and the administration, primarily through the legislative and budgetary processes, would set overarching national priorities for science and technology (although increasingly earmarking areas of research and performers); responsibility for converting these priorities into specific program areas and proposals devolved to intra-agency procedures and the peer or merit review system (Guston and Keniston, 1994; Guston, 2000). There was also general acceptance of the desirability of maintaining balance among fields in support for science, especially between
the life sciences and the natural sciences and engineering. However, few efforts were made for deeper analysis that might provide justification for important policy choices, such the relative apportionment of funds among disciplines and fields.2
For many years, far greater attention was paid to overall levels of federal government support of science than to questions of allocations among fields, reflecting a general level of satisfaction that for all its possible inefficiencies in terms of the goal of maximizing rates of return to public (or program) investments, the system had led to the U.S.’s preeminent position in world science. As Bruce Smith noted in congressional testimony in 1982, commenting on the National Academy of Sciences (1965) Basic Research report, “A common theme sounded by most of the panelists … was that the system we have evolved to support science, whatever our understanding of its inner mechanism, has given the United States a pre-eminence in the scientific world. Drastic changes in the present system, therefore, should be viewed with suspicion. “The quest should be for marginal adjustments in present policies to assure a continued United States leadership in basic science” (Smith, 1982:194). Similar analysis and recommendations pervade recent benchmarking assessments of the U.S. position in selected fields of science. As observed by the National Academies panel convened to assess the U.S. position in immunology research (Committee on Science, Engineering, and Public Policy, 1999b:52), the United States is the world leader in most major subcategories of immunology research, with this position being attributed to a system “that is largely an investigator-initiated, peer-reviewed, and merit-based system of awarding grants.”
Thus, analytical and policy attention and research shifted in recent years to questions relating to the measurement of the social and private rates of return from research in general. This research has been conducted mainly within an economic paradigm that draws on earlier analyses pioneered by Nelson (1959) and Arrow (1962), which focused on whether competitive market dynamics could be projected to produce the socially optimal level of private-sector investment in research, especially basic research (e.g., Hall, 1996). The analysis derived from this framework, coupled with several major empirical studies on private and social rates of return to research in agriculture, health, and technological innovations, provided empirical support for the conclusion that governmental support of fundamental research yielded net social benefits. As a by-product, this research produced findings about benefit-cost ratios of investments in different lines of research, such as on different agricultural commodities (Evenson et al., 1979) and on diseases (e.g., Institute of Medicine, 1998; Gross et al., 1999; Murphy and Topel, 1999) that could have served as guides to future budget allocations among different directions of scientific research. Little evidence exists, however, that these findings in fact affected congressional budgetary allocations (Olsen
and Levy, 2004). Moreover, the findings tended to be directed at comparing target uses of research, not at comparative lines of research, the question posed by BSR.
Increasing Pressures for Systematic Priority Setting
The question of priorities among fields of science resurfaced in the 1990s and has gained increasing salience since then, as a result of several factors. First, the size and continued growth of federal research and development expenditures, together with increased competition for federal budget dollars, began to call forth new demands for accountability and demonstrated accomplishments. The enactment in 1993 of the Government Performance and Results Act (GPRA) focused those demands: it required that all federal agencies develop multiyear strategic plans and evaluate and report annually on their activities in relation to the objectives stated in these plans. For research agencies, GPRA created pressure to implement systematic methods and bureaucratic routines for assessing the value of research investments (National Research Council, 1999, 2001c). The act shifted attention from statements of an agency’s needs and opportunities toward outputs and outcomes. Inexorably coupled with demands for accountability were new demands coming from both the administration and Congress for “evidence” and documentation of performance and results. The demands are manifest in GPRA and in its implementation via the Performance Assessment Rating Tool (PART), created by the Office of Management and Budget (OMB). They continue to receive support from congressional and administration leaders.3
PART was introduced to cover selected federal agencies as part of the FY 2004 federal budget process and has since been applied to an increasing number of agencies and programs. As described on the OMB web site, “PART was developed to assess and improve program performance so that the Federal government can achieve better results. A PART review helps identify a program’s strengths and weaknesses to inform funding and management decisions aimed at making the program more effective. The PART therefore looks at all factors that affect and reflect program performance including program purpose and design; performance measurement, evaluations, and strategic planning; program management; and program results. Because the PART includes a consistent series of analytical questions, it allows programs to show improvements over time, and allows comparisons between similar programs” (available: http://www.whitehouse.gov/omb/part).
A second factor has been the significant advances that have been made in theoretical and especially empirical studies of scientific activities since the Weinberg articles of the 1960s. Significant advances have been made
in assembling and making more accessible quantitative data on several aspects of scientific activity that previously required labor-intensive effort and that were difficult to link together. For example, there has been continuing expansion and refinement of the NSF’s Science and Engineering Indicators biennial reports, leading to more readily accessible data on publications, patents, and patterns of collaboration among scientists.
Data on scientific activity, including numbers of publications by keyword, numbers of citations, and so forth, are now readily available online through such services as Thomson Scientific’s Web of Science® and Google’s Google Scholar. The National Bureau of Economic Research has compiled an extensive data set of U.S. patents that includes all citations to these patents and a broad match of these patents to financial data sets (Jaffe and Trajtenberg, 2002). At least two major handbooks have been published, distilling a much larger and diverse literature on quantitative methods in the use of publication and patents statistics in studies of science and technology systems (van Raan, 1988a; Moed et al., 2004). New methodological and empirical ferment is emerging in the use of network theory and career trajectories to explore patterns of collaboration, and thus leader-follower relationships and the diffusion of ideas and techniques, among scientists (Wagner and Leydesdorff, 2005). Advances in data mining and data visualization techniques facilitate processing large quantities of data, making it possible to identify or confirm relationships that may previously have been unrecognized (e.g., Boyack and Börner, 2003).
Advances in quantitative measurement and data analysis have also facilitated the development of new theories of the relationships among scientific activities and their effects in the larger society, and increasingly sophisticated theoretical and empirical models have been tested for examining these causal relationships. Thus, for example, researchers have examined linkages between bibliometric data (on scientific publications and citations) and patent data to advance conclusions about the productivity of federal government investments in some areas of basic research (Narin et al., 1997). Advances in data availability and analysis make the quantitative assessment of developments in a nation’s scientific enterprise increasingly feasible and attractive to public-sector officials. With all these advances, the prospect of using quantitative analysis systematically to channel public funds to their most productive scientific uses appears more attainable than before.
A third factor making priority setting more salient involves the dynamics of science itself, especially the widespread consensus that the greatest opportunities for advances in science now involve the crossing of traditional disciplinary boundaries and the creation of new fields. NIH’s Roadmap, for example, addresses what agency leaders describe as revolutionary and rapid changes in science and the need to overcome barriers created by the complexity of NIH as an institution comprised of many units; the compart-
mentalized structure of the NIH bureaucracy, with its division by organ, life stage, disease, and scientific discipline; and the rapid convergence of science. Related to this judgment are increasingly voiced concerns that the combination of existing organizational arrangements and procedures that science agencies, including NIH, use for setting priorities, selecting research proposals, and evaluating the outcomes of research, together with increased competition for funding, are leading to unduly conservative, risk-averse selection of research awards.4
Finally, the globalization of scientific activities, coupled with the widespread belief that scientific leadership is increasingly linked to international economic competitiveness, introduced another issue into science policy discussions. A report from the National Academies (National Research Council, 1995a), articulated the principle that a country’s position relative to its scientific competitors should be taken into account in deciding the distribution of resources among scientific fields: “The President and the Congress should ensure that the [federal science and technology] budget is sufficient to allow the United States to achieve preeminence in a select number of fields and to perform at a world-class level in the other major fields” (p. 14).5
All these forces have given impetus to recent efforts to formalize science policy decision-making processes. Almost reflexively, there have been increasing calls for quantification and for transforming the more extensive data on science and technology and improved techniques for the analysis of such data into science metrics (for inventories of widely used metrics, see Geisler, 2000; National Science Board, 2004).
DEBATE OVER PRIORITY SETTING AND ASSESSMENT MECHANISMS
As already noted, peer review has long been the dominant approach in federal research agencies for evaluating the past performance and future potential of research areas and for setting priorities. Peer review is essentially a clinical and deliberative process and one that relies heavily on the expertise of working researchers. It is used most commonly to evaluate proposals for research projects or programs coming from single investigators or research groups; less commonly, the same approach is used to advise research managers on broader matters, such as evaluating the past performance or future prospects of entire research programs or selecting priorities for the future development of these programs. Whatever the purpose, the general approach is similar. Research managers convene groups of experts in relevant research fields, typically constituted as peer review panels, visiting committees, or advisory boards, that deliberate on issues or choices presented by research managers. They are sometimes informed by
additional input, for example, reviews solicited from specialists in particular narrow areas being considered. In making their recommendations to higher level agency decision makers, research managers draw on the judgments of these deliberative groups and add in their own judgments to the extent their agency prescribes or allows. Devolution of decision-making authority, or in this case, recommendations, to peer review panels is the “special mechanism” by which the social contract for science “balances responsibilities between government and science” and thus fosters accountability (Guston and Keniston, 1994:8).6
Outside reviews of research agencies’ efforts to assess programs and set priorities have generally endorsed the clinical, deliberative methods of expert review as the best way to assess research fields. For example, a National Research Council study committee that reviewed agency experiences under GPRA concluded: “The most effective way to evaluate recent programs is by expert review. The most commonly used form of expert review of quality is peer review. This operates on the premise that the people best qualified to judge the quality of research are experts in the field of research. This premise prevails across the research spectrum, from basic research to applied research” (National Research Council, 1999:39).
Accountability Challenges to Peer Review
However well peer review as a method of research assessment may have served science agencies, the scientific community, and society in the past, this approach has recently come under challenge. A major challenge has come from the movement toward greater accountability and attention to performance management, as embodied in GPRA and PART. GPRA requires federal agency managers to establish strategic goals and to demonstrate that these goals have been met, on the basis of predefined measures of performance. Development of standardized outcome measures is also seen as a means by which science managers can compare the bang for the buck across different kinds of expenditures. PART, as already noted, requires an increasing proportion of federal programs to apply a consistent, evidence-based approach to performance measurement, extending from the specification of strategic goals (such as lives saved) to outcomes and outputs. Development of performance measures is encouraged because such measures are seen as the ultimate results for the public. But the PART guidelines also note that “the key to assessing program effectiveness is measuring the right things,” by which is meant, “measures that meaningfully reflect the mission of the program, not merely ones for which there are data” (p. 16).
Intended to be broadly applicable across all federal programs, the PART procedures also contain a specific set of criteria to assess the effectiveness of the federal investment in research and development. The three salient
criteria applicable to NIH-NIA programs are relevance, quality, and performance. For each criterion, the use or development of quantitative metrics is emphasized. Thus, in considering relevance, the PART document states that “OMB will work with some programs to identify quantitative metrics to estimate and compare potential benefits across programs with similar goals. Such comparisons may be within an agency or among agencies” (p. 56). GPRA and PART are similar in that both provide agencies with pressures or incentives to move toward more quantitative methods for setting priorities or assessing performance.7
Despite the skepticism that researchers have at times expressed about applying quantitative approaches to assessment of their work—for example, concerns about the spawning of “LPUs” (least publishable units)—there are valid reasons for trying to use them. An important one is that many research agencies’ expenditures are justified not only in terms of advances in pure knowledge but also in terms of their potential value to society, some of which are eminently quantifiable. Among these, depending on the agency, are improved health or longevity, education, environmental quality, and public safety and security. Already, there are efforts under way to develop measures related to the impacts of research on some societal goals, with current emphasis being on improving the reliability, timeliness, and administrative feasibility of the measures (U.S. Department of Energy, Office of Science, 2004).
The prominence of the “health and well-being of older Americans” among the strategic goals of NIA makes it tempting to quantify at least those outcomes and to seek evidence for causal links between research and those societal benefits. Quantitative measures of many of the benefits, both realized and projected, as well as benefit-cost ratios, already exist. For example, healthy 70-year-olds live longer and spend less on lifetime health care than their less healthy peers. In one study, individuals with no functional limitations had a life expectancy of 14.3 years and expected cumulative health care expenditures of $136,000 in 1998 dollars, while those with one functional limitation had a shorter life expectancy (11.6 years), but could expect to spend more on health care ($145,000) (Lubitz et al., 2003). If health promotion efforts (e.g., exercise, smoking cessation) can improve the functioning of older Americans, these benefits can be predicted to follow. Similarly, observed benefits of cognitive and affective phenomena for health (e.g., Rosenkranz et al., 2003; Levy, 2003) might also be quantified in economic and life expectancy terms. The estimate that a 1 percent permanent reduction in mortality would be worth about $500 billion (Murphy and Topel, 1999) makes possible benefit-cost analysis of investments in mortality reduction.8
Research agencies’ internal needs also create pressure for quantifying research progress. They need to compete successfully for research funds
with other uses of federal funds that are justified in accounting terms in the tightening discretionary budget for nondefense science and in the face of competition from nonresearch priorities. And they need to do so at a time when demand for research funds is increasing as a result of technological advances, methodological developments, and growing concern with complex systems and interdisciplinary problems that require more expensive capital-intensive and team-based research. Some new research areas require major investments in expensive technology.9 Public officials want a rational basis for making difficult choices, and quantitative measures are attractive because they can be defended as “objective.” Some research agencies have also identified internal reasons to seek quantitative measures of research performance. For example, such measures might be useful for justifying budgets against the claims of other units in the same government department and for resisting pressure to shift expenditures in ways that would benefit specific political constituencies at the expense of the scientific and practical benefits of research to the public (see, for example, National Research Council, 2005c).
Challenges to Quantification
Reliable and valid quantification of benefits from scientific research would obviously be desirable for assessing the value of past investments in research. Such output measures would provide research managers and higher level government officials with valuable yardsticks for evaluating past investments and a counterweight to inappropriate claims on research budgets from interested groups. However, developing reliable and valid performance measures that work across disparate fields has been very difficult. Questions continue to be raised by policy makers, research administrators, practicing scientists, and specialists in program evaluation about the reliability and validity of the basic data series; about errors in measurement; about the ability of actors in the scientific enterprise to manipulate or “game” several mainstream quantitative techniques, for example, by pooling citations; and about the applicability of techniques used to study the workings of the scientific enterprise to evaluation and priority setting (e.g., van Raan, 2005; Weingart, 2005; Monastersky, 2005).10 Particular quantitative methods have also been criticized. For example, benefit-cost analysis has frequently been criticized on the grounds that many of the costs and benefits, especially the latter, are not traded in markets and therefore require inferences and imputations before seemingly precise quantitative calculations of value can be made (e.g., Gramlich, 1981; Stiglitz, 1988). Similarly, knowledgeable observers have raised the concern that the use of performance scorecards as represented by PART-like mechanisms is becoming detrimental to the conduct of science (e.g., Perrin, 1998; Weingart, 2005).
Critics of the use of performance measurement in social and health policy further argue that these policy goals are not like market-oriented or industrial production, in which output measures, such as numbers of items produced or the value of sales, are appropriate and readily linked to production activities (see, e.g., Hatry, 1989). They believe it is much more difficult to measure how much schoolteaching contributes to student learning or how much scientific research programs contribute to new socially beneficial discoveries (see Cozzens, 1997).
Reservations have also been advanced regarding the construct validity of performance measures as applied to scientific endeavors. Many scientists and science managers generically reject the idea that scientific progress can be measured in terms of discrete, homogeneous outputs, analogous to number of miles of road paved or speed with which social security checks are processed. They contend that scientific advance is inherently an uncertain process that often takes or even requires an elongated, circuitous path. Important advances often appear unexpectedly and from unlikely sources; long time lags may occur between a scientific development and its application; findings are used in ways not conceived of either by researcher or sponsor; findings deemed interesting but not significant take on new import when combined with newer findings or applied to newly emerging situations.11
Critics of quantification also challenge the value of retrospective assessments for research priority setting. Although past performance is often seen as the best predictor of future performance for individual researchers, the recent performance of a research field may or may not be a good predictor of whether additional investment in that field is likely to lead to great advances or to less productive elaboration of past work. Of special concern is that the predictive value of the past for the present may well decline at scientific turning points, when discontinuous leaps or falls occur in the scientific richness of a new or established field. As one scholar noted, “Although every scientist is aware of impending revolutions, no clear universal sign tells even the most astute observer the area of science in which the next revolution will occur or what form it will take. The most brilliant scientists are not able to predict exactly the kind of revolution they themselves will be making” (Cohen, 1985:21).
Critics also argue that valid measures are hard to devise and defend for scientific research within the output, outcome, and impact frameworks of GPRA and PART because research in a single area may yield several kinds of outputs and because each research product may produce several different kinds of value (outcomes). Also, the impacts may occur so far into the future and may require so many complementary activities that extend beyond the influence of either the scientist or program manager that considerable patience and carefully crafted analysis are necessary to establish or refute
causal links from funding in specific fields to specific societal impacts (David, 1994; Radin, 2000).
Performance measurement applied at the agency, division, or program level is also criticized as setting up competition of the parts against the whole, with each budget unit seeking to claim credit for or internalize all the benefits associated with its activities, rather than participating in activities that may produce larger but more widely dispersed benefits. As phrased by David (1994:297-298), “we need not move in the direction of taking apart the very complicated system of science and technology research, which works in ways that not all of us fully understand, and making each of the bits of it compete with one another in the claims they make for the performance of the system as a whole. To point toward the larger outcome goals, which could be imputed at the systemic level, and to try to get people to lay claim to bits and pieces of that one, or their contribution to that other one, is probably the wrong direction in which to push the formation of science policy thinking.”
In addition to these points, many of the outcomes of research, such as satisfying curiosity about the universe or understanding human society, are intangible and hard to put on a scale that permits comparison with other kinds of returns to research or to nonresearch expenditures. To further complicate the measurement problem, the nature of the products and the kinds of value they may produce are often unknown in advance. For all the past and ongoing efforts invested in developing improved forecasting or foresight techniques, predictions about societal impacts of scientific and technological advance are viewed with good reason as highly speculative (for the emerging case of nanotechnology, see Roco and Bainbridge, 2003).
It is also argued that assessment criteria, even if valid for gross discriminations about the routine progress of science, are much less useful when applied to discrimination in the tails of a distribution. Scientific innovation is heavily concentrated in the far upper tail of accomplishment in science: thus, criteria that are effective in discriminating reasonably good from reasonably bad normal science are likely to be unpredictive or even counterproductive in predicting events, trends, or productivity in the upper tail of the relevant distribution, where breakthroughs occur.
Yet another reservation about current efforts to quantify the performance of research investments is that few agencies systematically treat the development of human capital as an output complementary to conventionally measured research outputs. In one view, however, “the most important contribution that is being made through basic research funding to national economic growth comes not through the transfer of research findings directly but through the transfer of knowledge and skills of trained personnel who move from the university laboratory into employment in the private sector” (David, 1994:297).
Quantitative assessment of science carries a substantial risk that researchers will behave like teachers “teaching to the test,” using an analogy to one response to quantification in educational measurement. For example, if progress is measured by number of publications per dollar, measurement creates incentives to produce the smallest publishable units and against work with long gestation periods leading to major breakthroughs (Butler, 2004).
Perhaps the most seriously questionable use of quantitative measures of research output is for making direct comparisons of research productivity among research fields. Such comparisons, however, are central to the questions posed by BSR. Different fields produce different kinds of output, even at the level of readily measured products, such as journal counts. Lags between submission of manuscripts and publication vary across both journals and disciplines. Conventions concerning references to existing literatures also vary across fields, affecting the frequency with which specific articles may be cited by relevant scientific communities.
The problem of accurately accounting for output, productivity, and impact is compounded when differences in publication outlets are considered. For example, researchers in some fields mainly publish in journals indexed in major databases, while researchers in other fields mainly publish books, monographs, or technical reports that are not so indexed. This is particularly true in the social sciences, which Hicks (2004) has described as having four literatures—international journals, books, national journals, and the nonscholarly press. To the extent that different disciplines in a program manager’s portfolio have different publication patterns among these four literatures, quantitative measures based on bibliographic, journal-centered databases may be biased indicators for comparing scientific output.
Identifying these concerns is not equivalent to ruling out the utility of quantitative methods, including bibliometric data ones. For several of the examples cited, systematic collection of data, say on average delays in the submission-acceptance-publication process, would provide a ready means of adjusting data to move toward “unbiased” estimators. The more pressing concerns in those examples relate to (1) the need to be aware of the limitations of raw or unadjusted measures; (2) the considerable technical sophistication at times required to adjust or refine the measures; (3) the added time and costs involved in making the necessary adjustments so that the data are available and comprehensible in time frames consistent with agency priority and budget setting processes; and (4) the likelihood that the proper interpretation even of carefully adjusted measures may remain a matter for legitimate dispute.
Beyond issues associated with aspects of performance measurement, comparison across fields of inquiry is complicated by the different kinds of questions or problems and the different patterns or levels of effort that are
required for analyses in different research areas. Some areas present well-defined, delimited problems that call for a concentrated effort to obtain a solution, and then they go away. An example is research on the impact of Medicaid spend-down rules on the asset management and housing decisions of aging families (e.g., Adams et al., 1992); the results of that research could be incorporated into policy, and there was no need for further research. Other areas present core questions that require continuing attention. These include the reliability of self-reported health status in surveys for predicting health needs and outcomes and the development of data infrastructures (e.g., the Health and Retirement Survey) on which scientific investigations can build. Finally, there are areas in which the solution of a well-defined problem leads to new questions or methods for research.
The criteria used to judge scientific fields must be sufficiently nuanced to recognize these different paths of progress. For well-defined problems, when has scientific analysis definitively succeeded or failed, justifying termination? When has the end of a successful research effort opened new opportunities that deserve increased support? For continuing research issues, which scientific approaches are fresh and promising, and which are stale?
The situation is the same for other quantitative indicators. Research in some fields leads to patentable inventions, while in others it may lead to improved practices or new policies.12 Research in some fields leads to new drugs or medical procedures, whereas research in other fields leads to less readily quantifiable medical benefits, such as improved diagnostic categories or ways of interpreting diagnostic tests. Behavioral science research may lead to valuable advice to individuals about ways to change their behavior. However, when no organization has a strong incentive to publicize this advice, behavioral change may not be a fair test of the value of the science.
The challenge of commensurability—“the comparison of different entities according to a common metric” (Espeland and Stevens, 1998:313)—of outputs and outcomes across fields would appear to be particularly formidable in areas of social and behavioral science because the outcomes resulting from such research typically take the form of new knowledge that might be applied in practices and policies, rather than tangible objects. Whether or not new knowledge is used, and how it is used, depend on a variety of factors in addition to knowledge production itself (e.g., Weiss, 1979; Landry et al., 2003). Knowledge may lead in directly observable and traceable ways to changes in practice or policy, but the effects are more often indirect. For instance, knowledge may expand or alter the set and value of options considered by decision makers—an “enlightenment value” of knowledge (Weiss, 1979) that is important even when the specific action taken is unchanged. Thus, there is no straightforward link from knowledge production to its application, an application or its absence cannot be attributed in any simple way to the actions of scientists or research manag-
ers. For many research allocation decisions, the question before program managers or advisory groups is not readily amenable to a cost-effectiveness framework that would allow for quantitative analysis, even with specified uncertainty bounds, of which line of inquiry holds the highest promise of yielding satisfactory answers. If different fields address different questions and produce answers that lead to different kinds of outputs, the first and continuing challenge is first to agree on a standard unit of measurement.
These methodological critiques of quantification of scientific results complement the concern among many scientists that quantification á la the PART procedures is a threat to the traditional primacy of expert peer review. This concern is rooted in the above methodological concerns, in the idea that the judgments that emerge from expert review panels provide a more thoughtful and nuanced assessment of scientific progress than can come from any available quantitative methods, and in a concern that quantification entails a shift of power and influence over priority setting from working scientists to government officials following bureaucratic procedures. This last concern, to adapt Oscar Wilde’s comment, is about the possible ascendancy of nonscientists who know the price of everything and the value of nothing.
Alternatives to Quantification
Because of the limitations of quantitative indicators of scientific progress, some agencies have sought to justify nonquantitative methods as responsive to accountability needs. For instance, the NSF has received permission from OMB to employ qualitative methods for assessing level of performance set against agency strategic goals (National Science Foundation Advisory Committee for GPRA Performance Assessment, 2004). The NSF performance reporting system employs expert judgment via panel and mail reviews to vet proposals, augmented by a system for periodic review of program-level activities by committees of visitors. The NSF approach to assessment has relied on internal documents, committee of visitors’ reports, and a database of accomplishments, among other sources of information.
Other research funding organizations, including NIA’s BSR Program, have also seen considerable merit in preparing similar collections of information as inputs to expert judgment. For example, BSR has prepared numerous narrative descriptions of “research highlights,” “science advances,” and “stories of discovery” to document the results from the research it funds (e.g., Behavioral and Social Research Program, 2004). These brief histories highlight the contributions of agency-supported research to improvements in health or well-being and demonstrate by example the value of the entire program. A historical approach also allows readers to appreciate the different kinds of value that result from different lines of research, even though
it does not attempt quantitative comparisons. For example, one of these narratives describes the role of program-funded research in showing that increased longevity during the 20th century has not resulted in longer periods of disability. Another shows how funded research has provided better data for understanding the relationships between socioeconomic status and health. Yet another demonstrates that job control at work is a major risk factor for cardiovascular disease in men. BSR combines these narratives with a variety of outcome indicators, such as lists of peer-reviewed publications, honors received by funded researchers, and recognition of funded research in the specialist and popular press, to help inform expert judgments by its advisory board.
Case histories have not always proved useful, however. Earlier studies, such as the 1968 Technology in Retrospective and Critical Events in Science (TRACES) study (Illinois Institute of Technology, 1968), sponsored by the NSF, and Project Hindsight, sponsored by the Department of Defense (Sherwin and Isenson, 1966), which sought to relate advances in fundamental science to important technological advances, proved to be expensive and to have limited persuasive impact. Nettlesome methodological disagreements arose about the validity of the findings (Kreilkamp, 1971), and few such large-scale endeavors have been undertaken in recent years.
Differences of opinion also exist among program managers regarding the external validity and political impact of historical accounts or case studies. They can be subject to selection bias, especially when selected by program managers who have reasons to show a program’s best face. Thus, some argue that without large-scale comparative studies, historical accounts lack persuasiveness about the actual contributions of a program and are unlikely to convince higher levels of management. To others, however, one compelling case can be akin to the picture that says a thousand words. In addition, the case history technique in general fails to adequately satisfy OMB expectations that agencies install data-based management systems to monitor performance.
The current situation is thus characterized by both methodological and policy turmoil and disagreement. Some agencies seek to develop, validate, and apply quantitative measures of research output and its value. Others see available quantitative metrics as hopelessly inadequate for their assessment purposes and believe that expert judgment is the only valid and appropriate way to evaluate the past performance or future potential of research (see National Research Council, 1999). Nevertheless, the trend toward increased quantification is clear. A combination of forces—OMB mandates, improved sophistication in quantitative methods, more critical examination of the limits to which specific methods can legitimately be pushed, more modest claims on the part of advocates and practitioners of quantification, and what may best be termed resignation to the use of quantitative
methods—has softened the edges of earlier either-or debates about the use of expert judgment/peer review procedures versus quantitative methods for priority setting and assessment.13 Consensus may be emerging about the need for informed expert opinion based on the “proper” use of quantitative methods by relevant experts.
Limits of Expert Judgment for Comparing Fields
When research managers must set priorities across research fields, many of the problems of comparison faced in quantitative assessment also apply to expert judgment. Panels of experts that may be able to give reliable advice in specific, long-established disciplines may have much greater difficulty advising across fields. In a well-defined scientific discipline or field, it is reasonable to presume that the appropriate standards of judgment are well understood by most of the experts who might serve on a peer review or advisory panel, even if the standards are not identical in all parts of the field. The members of such groups understand each other’s work, and it would be possible for other experts in the same field to assess a panel’s findings by applying the same standards. It is less safe to presume that such shared understanding and the attendant possibility for checking judgments exists when review panels are organized across more varied areas of expertise. The problem is likely to get worse, the greater the breadth of the set of programs or units that are being compared.
Suppose an agency empanels a broadly multidisciplinary expert group to evaluate aspects of a multidisciplinary program. Group members must either judge outside their expertise or rely on their colleagues’ judgments, in which case they may fail to understand the standards that their colleagues are applying. Not being familiar with the content of the work being proposed or its potential for opening up new lines of research, experts sometimes use methodological rigor as a default evaluation criterion. The result may be, as has been increasingly asserted with regard to both NIH and NSF, that review panels are inherently too conservative about supporting radically new or transformative research ideas over well-crafted mainstream but incremental science.14 Strong criticism by one or two review panel members, particularly on specialized matters in those members’ areas of expertise, may be enough to defeat an idea. Similarly, panels have been criticized as favoring science in established disciplines over interdisciplinary proposals (National Research Council, 2005b:Chapter 6). It has been claimed that experts on a panel defer to the judgment of each member in his or her own field, with the result that fields perpetuate themselves even when their potential for generating important advances is weak and when a broader analysis of a research agency’s portfolio would justify reallocation of funds elsewhere.
The systemic logic behind these behaviors, according to Brenneis
(1994:31), is that expert review panels employ a “fairness through apparent clarity” model of decision making. In this model, scholarly progress is seen as incremental, so that proposals are favored when they “are clearly linked to a sense of how ‘science works.’ Proposals that promise to break new conceptual ground or to challenge and refigure dominant paradigms are viewed not so much as ‘bad’ proposals but as difficult to evaluate and compare with other contenders” (see also Guetzkow et al., 2004).
Collective judgments by peer review panels may thus not have a clear meaning, especially when a panel is covering a multidisciplinary range of fields. The problems cannot be solved by combining judgments from different disciplinary groups because standards in different fields may not be comparable. Assume that there are established fields of science, characterized by the conventional attributes of disciplines; that is, journals, professional associations, academic standing as departments, and institutionalized legitimacy within a federal science agency, such as an established study panel or a directorate or division devoted to the support of each field.
Assume further that over time one field becomes insular; that is, it focuses on problems that engage specialists but do not look important from an outside perspective, either on scientific or practical grounds, and that prove unproductive in retrospect. The experts in such fields nevertheless continue to believe that they and their colleagues are engaged in exciting, productive, and societally relevant work. When asked to judge recent or proposed new work on such criteria as originality, they rate the studies as more original than they would appear to outsiders to the insular field. Researchers in such a field can point to a steady stream of output, say, in articles published in leading journals in the field and citations to these articles, albeit predominately by other researchers in the field. Without some external yardstick, it is not possible to know whether or not a particular field is such an insular and moribund field in which the collective judgment of its experts is untrustworthy. Research managers want to identify such fields sooner rather than later, but the judgment of experts from within the field may be misleading, and the judgments of multidisciplinary peer review groups may also fail to offer good guidance.
The problem of identifying fields that have passed their prime is at the core of the questions posed by NIA-BSR. Its concern is that it may be supporting some fields that are producing minor if technically well done advances in knowledge derived from long established but increasingly stale paradigms, while choosing not to fund other fields, theories, methods, and findings that promise (with uncertain likelihood) to yield significant new advances in fundamental knowledge that will illuminate not only the field from which they come, but also spill over to enrich other fields or even create new ones. Although peer review panels in many fields believe that there is much high-quality work in those fields, it is possible that the experts in
some fields have an inflated view of the value of research in those fields. It is out of such concerns that research managers in BSR seek for a trustworthy method or strategy of research assessment that would make it possible to evaluate the hypothesis that there are serious imbalances in the value of research across the fields funded by the program.
As noted throughout this report, these concerns have long antecedents. But despite various efforts at broad-scale retrospective and prospective judgments (e.g., Deutsch et al., 1971; Irvine and Martin, 1984; Inkeles, 1986; Abrams, 1991; Henkel, 1999), these questions are generally finessed. One reason is their extreme sensitivity in scientific circles, because any explicit effort to compare the value of research across fields entails making invidious distinctions, with possible damage done to the field(s) given lower evaluations in future funding cycles.15 In contemporary U.S. science policy discussions, tactful discourse centers on the concept of balance and may suggest that one or more areas of science are underfunded, but not that the questions being addressed or methods being employed by other fields are in any manner experiencing flagging vitality.
The current budgetary pressures and the additional impetus for accountability and quantitative measurement of research progress make it increasingly difficult to finesse the questions of comparative assessment. Despite the difficulties of comparing the research efforts of different fields on the same quantitative scale, demands for accountability create serious pressure to provide clearly stated rationales for recommendations and decisions about priorities among research fields.
RESEARCH ASSESSMENT AND THE ISSUE OF POWER
As already noted, a critically important but sometimes unacknowledged issue behind debates about how to assess science is that the choice of method may both reflect and affect who has power and influence. Particularly at issue here may be the relative power and influence among researchers, research managers, and nonscientists in government. Such issues have been present in science policy since the beginnings of sponsored research in the United States, when the sponsors were private foundations, such as the Carnegie and Rockefeller Foundation groups (see Box 3-1). The issue continues into the present. An increased emphasis on the use of quantitative indicators in science policy decisions, especially to the extent that indicators can be developed by technicians who are not researchers in the relevant fields, can easily weaken the influence of scientists vis-à-vis agency science managers, or of scientists in general vis-à-vis nonscientist decision makers in government.
Bibliometrics provides a good example of the issue of power, latent in many current discussions of the use of “objective” measures of research
Power Relations in Sponsored Research, 1900-1945
In the United States, the system of sponsored research grants evolved in private foundations, especially those of the Carnegie and Rockefeller groups. At first, the idea of programmatic, actively managed grants to individuals met resistance because it seemed to transgress entrenched individualistic values and the belief that scientific discoveries are acts of individual genius. In the 1920s, foundations evaded these problems by giving block grants to universities or research institutes with no strings attached: they were designed to develop entire communities, not advance particular individuals’ lines of work. But these programs proved unaffordable after 1929, and in the late 1930s, the Rockefeller Foundation pioneered a system of programmatic grants to scientists in a few fields that were deemed by program managers to be of strategic value.
A new social role of grants manager evolved. Foundation officers earmarked strategic fields for investment and, in the absence of a system of peer review, selected among applications for support. Grant managers became partners in science, in direct and unmediated relations with their grantees.
This active relationship profoundly upset existing power relations with foundation boards of trustees. Trustees, who were mostly practical men of business and who had previously made decisions on the basis of their ability to judge organizations, were now in a position of rubber-stamping decisions made on technical grounds by mid-level managers. They were effectively deskilled as experts in organizing productive labor. Senior scientists on boards took the same line, opposing “planning” in science, even though relations between grantees and program managers were remarkably untroubled. In practice, activist managers were helpful, not intrusive, for example, in foster
progress. Resistance to the use of bibliometric measures by academic researchers surfaced almost immediately upon their introduction. “The reaction was predictable,” according to Weingart (2005:118) “because first of all the very attempt to measure research performance by ‘outsiders,’ i.e., non-experts in the field under study, conflicted with the firmly established wisdom that only the experts themselves were in the position to judge the quality and relevance of research and the appropriate mechanism to achieve that, namely peer review, was functioning properly.” Adopting any method of assessing research potentially affects who has the ability to influence the setting of broad research priorities, the contours of specific programs with respect to subfields and methodologies, and decisions concerning which
ing communication among scientists in different disciplines. Grantees quickly realized that.
Trustees distrusted program managers because they could not see what would prevent them from abusing their new power to set priorities and decide on individual proposals. Contention between managers and foundation boards changed only when managers, such as Warren Weaver at Rockefeller, showed their ability to devise and manage programs of individual grants in selected fields like genetics or molecular biology, a field that Weaver helped to define.
The system that evolved by the late 1930s had trustees appointing program officers, who were empowered to select strategic areas for investment and to actively manage systems of individual grants. Scientists accepted the active participation of program managers in directing research along selected lines, but they retained complete control of how the actual work would be done. Once grants were made, program officers never interfered and declined invitations to advise on the particulars. This division of labor worked without advisory committees, peer review panels, or formal procedures of reporting and accountability.
Communication was the main reason the system worked. Program officers worked constantly to be well informed of trends in the fields they sponsored and in the activities and reputations of leading figures in the fields. They did this by identifying a few trusted individuals who they learned would provide objective and disinterested advice and by continual traveling and conversing informally with grantees and potential grantees, including younger scientists. Program managers could become effective partners in science because they understood the personalities and intellectual politics, as well as the science, of their areas of interest almost as well as the insiders did themselves.
SOURCE: Kohler (1987, 1991).
proposals to fund (and at what levels) and which to reject. To the extent that the critical information needed for making research portfolio decisions can be gained without reliance on the researchers themselves, the power to make those decisions can be shifted from researchers to research managers. In short, decisions about quantification of scientific progress have a power dimension, whether or not this is within the awareness of those involved, and a shift in power relations can have significant consequences for the directions of science.
Such a shift may be viewed as good for science—for example, if research managers have a better overview than scientists of opportunities across many fields, or a better appreciation of which research directions are most likely
to meet societal needs or agency priorities. It may also be viewed as bad for science (for example, because of the possibility of political or ideological interference with scientific research agendas, as in the controversy over the claim that management at the National Endowment for the Humanities has overturned peer review endorsement of proposals because they address issues considered sensitive by political appointees; see Jaschik, 2006).
The issue of power is implicit in the implementation of GPRA and PART. To the extent that these tools emphasize routinized measurement of easily quantified attributes of research, they shift power away from the judgments of the scientific community and toward others, such as those who devise the indicators and those who can find ways to game the assessment system. Efforts to gain approval for assessment mechanisms that rely more on the judgments of scientists, apart from claims that they provide better quality assessments, are in part efforts to prevent a loss of influence by scientists over science priority setting.
It is important to recognize in this context that research managers at NIH are scientists as well as managers. They are typically in the job classification of Health Scientist Administrator, which is taken to mean that their first calling is as a scientist and that the administrator role is secondary. NIH program managers typically hold advanced degrees in science, not management. Once in their positions, however, research managers are expected to act as stewards of their scientific fields in the context of the mission of NIH and their institute or center. The challenge in such a position in any federal science agency is to stay abreast of one’s scientific field: to know what are the emerging opportunities and challenges, who is doing outstanding work, who is coming up in the field, who is on the cutting edge, what the demands are in the field for technologies or models, and so on.
In a stylized manner, much of the research support provided by NIH, especially that occurring in the form of investigator-initiated (R01) projects, represents grassroots initiatives of independent individual researchers. In this model, the frontiers of science, both in terms of the questions (or puzzles) posed and the selection of projects to answer the questions, are determined mainly by the collective workings of the scientific community.
NIH program managers ideally function as part of this community, not only as scientists but also as advocates, stewards, and occasionally as entrepreneurs for research fields. A field may need a research tool that NIH can support and make available; better access to data; or a new way of organizing research. A prominent example from BSR is its support of the ongoing Health and Retirement Survey at the level of about $10 million per year. This survey provides data useful to a large number of individual research projects. As noted in Chapter 2, research managers in NIA/BSR have been proactive as entrepreneurs of research by using the management tools at their disposal to provide such collective goods for science. Initiatives
by research administrators can be instrumental in starting or accelerating the development of a field or line of research, and research managers may need to have an entrepreneurial spirit to go along with their understanding of the science and the needs of a field. They may be called on to advocate for specific projects, to urge the adoption of policies, or to work with colleagues inside (or outside) NIH to build support for programs, grow the funding, or help further the appreciation of the science. They need to know how their field relates to others so they can effectively and enthusiastically cooperate in new, high-priority interdisciplinary areas. They also need to be alert to the scientific human resource development issues critical to the programs they administer.
It is possible that a research administrator can see opportunities or challenges that are as yet not fully visible to most working scientists in a field, and thus come to a judgment about a field’s needs that differs from the consensus of those working in the field. Research managers’ judgments may differ from those of active researchers because of the greater value the former group places on the relevance of research to an agency’s mission. And administrators responsible for several fields often make judgments about priorities that differ from the consensus judgments in some of the fields. Differences in judgment may arise from differences between managers and working scientists in the weights assigned to different program objectives or because they use different methods of assessment.
Research managers acting as stewards and entrepreneurs may try to convince higher organizational levels, elected officials, and the scientific communities with whom they interact of the importance and relevance of ongoing or emerging fields of science. Research managers may be more or less entrepreneurial and more or less successful in this role, depending on personal disposition or agency structure, practice, and culture.
In the best case, working scientists and science managers can bring valuable and complementary perspectives to the task of assessing science. In designing methods for assessment and priority setting, then, it makes sense to avoid framing either-or choices between mechanical, quantitative, and bureaucratized decision making led by science managers and qualitatively informed, nuanced choices dominated by scientists. The proper questions to ask in guiding research assessment and priority setting do not concern whether to use quantitative measures, but what should be the appropriate roles of quantitative measures and of deliberative processes of peer review and how should the perspectives of scientists and science managers be combined to provide wise guidance for science policy decisions.
The above observations on the role of program managers are mediated by the formal structures and informal practices and cultures of federal science agencies. The autonomy of program managers to set priorities across fields or modify the decisions of external review groups can vary consider-
ably across agencies. The experiences of members of this committee when serving as members of advisory councils, advisory committees, and review panels are remarkably consistent, and also consistent with the views of program managers we have interviewed in this regard, although we have been unable to identify any systematic study on this topic.
In part, these apparent differences reflect the different histories and purposes of these agencies. DARPA, for example, was formed in the 1950s purposively as a small and flexible organization oriented toward revolutionary technology breakthroughs (Bonvillian and Sharp, 2001). Its use of advisory panels and review panels is flexible and ad hoc. By way of contrast, formalized peer review systems are core features of NSF and NIH, with a key distinction being that NSF program managers oversee both program development and the panel review process, whereas NIH separates responsibility for program development and operations from the review process. The peer review process at NIH operates primarily out of the Center for Scientific Review, which organizes review groups that often cut across programs and even institutes, and which generates ratings that are intended to evaluate proposals on a unitary scale that is the same across programs and institutes. This procedure makes it particularly difficult for a science manager at NIH to argue for overturning the results of the peer review process. As noted in Chapter 2, however, NIH science managers have greater discretion with funding instruments other than unsolicited proposals, which allow them to identify topics of interest and sometimes to set aside funds and create separate review processes for the solicited research.16
Proposals for quantifying of the benefits of research, as well as proposals for increased discretion for science managers, should be understood in the context of these conditions of influence and power. Quantification is sometimes presumed to reduce the influence of extramural scientists. If it has this effect, however, it does not necessarily increase the influence of the science managers who are closest to the scientific research programs. That effect will depend on how quantification is implemented and where in an agency the responsibility is placed for quantifying and for interpreting the results. We return to these issues in Chapters 5 and 6, where we discuss in more detail the use of quantitative analytic methods and deliberative processes for informing research assessment and priority setting.