3
The Stakes in Research Assessment

This chapter endeavors to make explicit what is at stake in the comparative assessment of research fields, retrospectively and prospectively. It takes as given that core decisions concerning the scope of any research agency’s mission, overall budget, and dominant mechanisms for selecting research proposals are largely determined at high political and administrative levels, such as those of the Office of the President, Congress, and politically appointed agency heads. Science managers in subagency divisions or programs such as the Behavioral and Social Research (BSR) Program at the National Institute on Aging (NIA) work to set priorities within these bounds by identifying, supporting, maintaining, and nurturing scientifically vital, mission-relevant areas of research, including selectively promoting areas of research in the manager’s mission area on the basis of judgments of their prospects for advancing that mission. Science managers may also recommend changes in larger organizational priorities and practices deemed necessary to permit specific programs to more effectively achieve their objectives in the context of the larger agency mission. Managers do not do these things on their own, however. Their actions are informed and influenced by outside constituencies, prominently including working scientists in the relevant fields and the potential beneficiaries of the science.

This chapter addresses two major, interrelated themes in science assessment: the actors (Who should be involved in assessing the science, and what should be the relative power and influence among these actors?) and the methods (How should assessment be done and decision making organized?). These questions and themes are being raised in the National Institutes of Health (NIH) and in many other federal government science agencies.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging 3 The Stakes in Research Assessment This chapter endeavors to make explicit what is at stake in the comparative assessment of research fields, retrospectively and prospectively. It takes as given that core decisions concerning the scope of any research agency’s mission, overall budget, and dominant mechanisms for selecting research proposals are largely determined at high political and administrative levels, such as those of the Office of the President, Congress, and politically appointed agency heads. Science managers in subagency divisions or programs such as the Behavioral and Social Research (BSR) Program at the National Institute on Aging (NIA) work to set priorities within these bounds by identifying, supporting, maintaining, and nurturing scientifically vital, mission-relevant areas of research, including selectively promoting areas of research in the manager’s mission area on the basis of judgments of their prospects for advancing that mission. Science managers may also recommend changes in larger organizational priorities and practices deemed necessary to permit specific programs to more effectively achieve their objectives in the context of the larger agency mission. Managers do not do these things on their own, however. Their actions are informed and influenced by outside constituencies, prominently including working scientists in the relevant fields and the potential beneficiaries of the science. This chapter addresses two major, interrelated themes in science assessment: the actors (Who should be involved in assessing the science, and what should be the relative power and influence among these actors?) and the methods (How should assessment be done and decision making organized?). These questions and themes are being raised in the National Institutes of Health (NIH) and in many other federal government science agencies.

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging We begin by briefly describing the historical context of science priority setting in the U.S. federal government, which in recent years includes increasing pressure to move from traditional, peer review–based approaches for setting research priorities and assessing returns from research investments to approaches that rely more on quantifying returns from investment in science. We discusses the limitations both of traditional expert judgment and of quantitative approaches, recognizing the particular difficulties of comparing different kinds of fields and of assessing scientific progress in interdisciplinary or transdisciplinary fields. Finally, we discuss the ways in which debates about the best methods for priority setting in the context of a movement for government accountability raise deeper questions about the balance of influence and power among researchers, program managers, advisory councils, extramural scientists, and other interested parties. These debates occur in the pursuit of two legitimate and important public policy goals: making public expenditures accountable to the taxpayers and ensuring rational priority setting among research expenditures, based on the best available information about the likely returns from future public investments. Accountable and rational priority setting implies a need for comparative assessment, especially when continuing and new claims on resources outpace the rate of increase in science budgets or when a perception emerges that some research fields are not advancing, despite continued support. Comparative assessment is inherently difficult, however, for several reasons. First, different fields of science may produce different kinds of benefits and may benefit different people. In a diverse society, it is unlikely that everyone will agree on the relative importance of different kinds of benefits and therefore on the overall benefit of any particular line of research (Bozeman and Sarewitz, 2005). Second, some kinds of benefit are easier to see and measure than others. Bibliometric data, for example, make scientific publications and published citations to them highly visible, but they have serious limitations, as detailed in Chapter 5. These data primarily reflect communication patterns among scientists. They are not necessarily predictive of the absolute or relative value of scientific outputs as sources of information for or influence on policy makers. Third, even if agreement can be reached on the relative benefits from different fields of science to date, the benefits of future investments would remain speculative and uncertain. The existence of compendia of erroneous predictions by experts about the progress of fields of science and technology is good reason to start from the premise that no science of predicting the future of science exists (Cerf and Navasky, 1984, Chapter 7; Thomas, 1999). Fourth, different individuals have the expertise needed to evaluate re-

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging search in different fields. Finding all the needed expertise for comparing different fields in the same individuals is the exception rather than the rule. Finally, there is the question of whose judgment should be final in resolving disagreements. This is a political question about relationships between and among the principals and the agents engaged in setting research priorities and performing the funded research. The primary principals are congressional and executive decision makers. Agency officials serve in a dual, intermediate role: they are the agents of the elected and appointed public officials, but to those who turn to them for funding, they are principals. The research performers, in the main, are the agents, but their participation in advisory councils and review committees also provides them with some of the influence and decision-making power more conventionally ascribed to principals. The very complexity of these arrangements complicates answers to questions about the distribution of power and influence among those who might decide and the distribution of benefits to those who might benefit. We return to this issue later in the chapter. BRIEF HISTORY OF FEDERAL SCIENCE PRIORITY SETTING Since the end of World War II, the salience of the issues of priority setting and retrospective assessment in U.S. science policy has waxed and waned. Priority setting received a surge of analytical and policy attention in the early 1960s, a period of steady increases of federal government support for research and development in the defense and nondefense sectors (Smith, 1990). The conventional benchmark of science assessment in this era is Alvin M. Weinberg’s two articles entitled “Criteria for Scientific Choice” (Weinberg, 1963, 1964) and the surrounding exegesis (Toulmin, 1964; Smith, 1982). Weinberg defined a short list of generic criteria (see Table 3-1). Throughout the 1960s and 1970s, periodic efforts were undertaken to apply these and related criteria to priority setting in specific scientific fields (e.g., TABLE 3-1 Generic Criteria External Criteria Internal Criteria (How well is the science done?) Is the field ready for exploitation? Are the scientists in the field really competent? Technological Merit     Scientific Merit     Social Merit     SOURCE: Weinberg (1963).

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging National Academy of Sciences, 1965; National Research Council, 1972). Priority setting, framed in terms of the long-term prospects for science, also drew episodic attention from the U.S. Congress, the National Science Foundation (NSF), and the National Academies (e.g., Committee on Science and Technology, 1982; Irvine and Martin, 1984). The topic has continued to surface intermittently in the larger discourse on the federal science budget, such as in the report, Allocating Federal Funds for Science and Technology (National Research Council, 1995a) and in prescriptions by senior national science policy officials (e.g., Bromley, 2003). Priority setting is also implicit in the strategic planning undertakings of federal agencies, in which selected fields are chosen for emphasis, with explicit or implicit decisions made not to fund other areas or to alter relative distributions of support among areas. Declining Attention to Criteria for Priority Setting Although attention to priority setting has remained a staple component of U.S. government science policy, systematic attention to the development of criteria for scientific choice waned after the 1970s, at least in the United States. In part, this was due to the difficulties of moving from general agreement about broad priority-setting principles to agreement about the specific methods and measures to be used to operationalize these principles. In particular, somewhat in contrast to developments in European countries, where bibliometrics is often used in assessing and formulating science policy, there has been little consensus in the United States about the reliability or validity of techniques to assess the relative importance of different fields (Hicks et al., 2004). The decline in interest in applying systematic criteria for setting research priorities also partly reflected the unwillingness or inability of many scientific communities to agree about the priorities in their fields, as in the case of divisions in the high-energy physics community surrounding decisions to construct the superconducting supercollider. In addition, the few formal entities that existed to conduct prospective and retrospective analysis and to support research in concept development and tool building had short life spans.1 General acceptance developed about relationships between the federal government as sponsor of basic and applied scientific research and the performers of this research. Congress and the administration, primarily through the legislative and budgetary processes, would set overarching national priorities for science and technology (although increasingly earmarking areas of research and performers); responsibility for converting these priorities into specific program areas and proposals devolved to intra-agency procedures and the peer or merit review system (Guston and Keniston, 1994; Guston, 2000). There was also general acceptance of the desirability of maintaining balance among fields in support for science, especially between

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging the life sciences and the natural sciences and engineering. However, few efforts were made for deeper analysis that might provide justification for important policy choices, such the relative apportionment of funds among disciplines and fields.2 For many years, far greater attention was paid to overall levels of federal government support of science than to questions of allocations among fields, reflecting a general level of satisfaction that for all its possible inefficiencies in terms of the goal of maximizing rates of return to public (or program) investments, the system had led to the U.S.’s preeminent position in world science. As Bruce Smith noted in congressional testimony in 1982, commenting on the National Academy of Sciences (1965) Basic Research report, “A common theme sounded by most of the panelists … was that the system we have evolved to support science, whatever our understanding of its inner mechanism, has given the United States a pre-eminence in the scientific world. Drastic changes in the present system, therefore, should be viewed with suspicion. “The quest should be for marginal adjustments in present policies to assure a continued United States leadership in basic science” (Smith, 1982:194). Similar analysis and recommendations pervade recent benchmarking assessments of the U.S. position in selected fields of science. As observed by the National Academies panel convened to assess the U.S. position in immunology research (Committee on Science, Engineering, and Public Policy, 1999b:52), the United States is the world leader in most major subcategories of immunology research, with this position being attributed to a system “that is largely an investigator-initiated, peer-reviewed, and merit-based system of awarding grants.” Thus, analytical and policy attention and research shifted in recent years to questions relating to the measurement of the social and private rates of return from research in general. This research has been conducted mainly within an economic paradigm that draws on earlier analyses pioneered by Nelson (1959) and Arrow (1962), which focused on whether competitive market dynamics could be projected to produce the socially optimal level of private-sector investment in research, especially basic research (e.g., Hall, 1996). The analysis derived from this framework, coupled with several major empirical studies on private and social rates of return to research in agriculture, health, and technological innovations, provided empirical support for the conclusion that governmental support of fundamental research yielded net social benefits. As a by-product, this research produced findings about benefit-cost ratios of investments in different lines of research, such as on different agricultural commodities (Evenson et al., 1979) and on diseases (e.g., Institute of Medicine, 1998; Gross et al., 1999; Murphy and Topel, 1999) that could have served as guides to future budget allocations among different directions of scientific research. Little evidence exists, however, that these findings in fact affected congressional budgetary allocations (Olsen

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging and Levy, 2004). Moreover, the findings tended to be directed at comparing target uses of research, not at comparative lines of research, the question posed by BSR. Increasing Pressures for Systematic Priority Setting The question of priorities among fields of science resurfaced in the 1990s and has gained increasing salience since then, as a result of several factors. First, the size and continued growth of federal research and development expenditures, together with increased competition for federal budget dollars, began to call forth new demands for accountability and demonstrated accomplishments. The enactment in 1993 of the Government Performance and Results Act (GPRA) focused those demands: it required that all federal agencies develop multiyear strategic plans and evaluate and report annually on their activities in relation to the objectives stated in these plans. For research agencies, GPRA created pressure to implement systematic methods and bureaucratic routines for assessing the value of research investments (National Research Council, 1999, 2001c). The act shifted attention from statements of an agency’s needs and opportunities toward outputs and outcomes. Inexorably coupled with demands for accountability were new demands coming from both the administration and Congress for “evidence” and documentation of performance and results. The demands are manifest in GPRA and in its implementation via the Performance Assessment Rating Tool (PART), created by the Office of Management and Budget (OMB). They continue to receive support from congressional and administration leaders.3 PART was introduced to cover selected federal agencies as part of the FY 2004 federal budget process and has since been applied to an increasing number of agencies and programs. As described on the OMB web site, “PART was developed to assess and improve program performance so that the Federal government can achieve better results. A PART review helps identify a program’s strengths and weaknesses to inform funding and management decisions aimed at making the program more effective. The PART therefore looks at all factors that affect and reflect program performance including program purpose and design; performance measurement, evaluations, and strategic planning; program management; and program results. Because the PART includes a consistent series of analytical questions, it allows programs to show improvements over time, and allows comparisons between similar programs” (available: http://www.whitehouse.gov/omb/part). A second factor has been the significant advances that have been made in theoretical and especially empirical studies of scientific activities since the Weinberg articles of the 1960s. Significant advances have been made

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging in assembling and making more accessible quantitative data on several aspects of scientific activity that previously required labor-intensive effort and that were difficult to link together. For example, there has been continuing expansion and refinement of the NSF’s Science and Engineering Indicators biennial reports, leading to more readily accessible data on publications, patents, and patterns of collaboration among scientists. Data on scientific activity, including numbers of publications by keyword, numbers of citations, and so forth, are now readily available online through such services as Thomson Scientific’s Web of Science® and Google’s Google Scholar. The National Bureau of Economic Research has compiled an extensive data set of U.S. patents that includes all citations to these patents and a broad match of these patents to financial data sets (Jaffe and Trajtenberg, 2002). At least two major handbooks have been published, distilling a much larger and diverse literature on quantitative methods in the use of publication and patents statistics in studies of science and technology systems (van Raan, 1988a; Moed et al., 2004). New methodological and empirical ferment is emerging in the use of network theory and career trajectories to explore patterns of collaboration, and thus leader-follower relationships and the diffusion of ideas and techniques, among scientists (Wagner and Leydesdorff, 2005). Advances in data mining and data visualization techniques facilitate processing large quantities of data, making it possible to identify or confirm relationships that may previously have been unrecognized (e.g., Boyack and Börner, 2003). Advances in quantitative measurement and data analysis have also facilitated the development of new theories of the relationships among scientific activities and their effects in the larger society, and increasingly sophisticated theoretical and empirical models have been tested for examining these causal relationships. Thus, for example, researchers have examined linkages between bibliometric data (on scientific publications and citations) and patent data to advance conclusions about the productivity of federal government investments in some areas of basic research (Narin et al., 1997). Advances in data availability and analysis make the quantitative assessment of developments in a nation’s scientific enterprise increasingly feasible and attractive to public-sector officials. With all these advances, the prospect of using quantitative analysis systematically to channel public funds to their most productive scientific uses appears more attainable than before. A third factor making priority setting more salient involves the dynamics of science itself, especially the widespread consensus that the greatest opportunities for advances in science now involve the crossing of traditional disciplinary boundaries and the creation of new fields. NIH’s Roadmap, for example, addresses what agency leaders describe as revolutionary and rapid changes in science and the need to overcome barriers created by the complexity of NIH as an institution comprised of many units; the compart-

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging mentalized structure of the NIH bureaucracy, with its division by organ, life stage, disease, and scientific discipline; and the rapid convergence of science. Related to this judgment are increasingly voiced concerns that the combination of existing organizational arrangements and procedures that science agencies, including NIH, use for setting priorities, selecting research proposals, and evaluating the outcomes of research, together with increased competition for funding, are leading to unduly conservative, risk-averse selection of research awards.4 Finally, the globalization of scientific activities, coupled with the widespread belief that scientific leadership is increasingly linked to international economic competitiveness, introduced another issue into science policy discussions. A report from the National Academies (National Research Council, 1995a), articulated the principle that a country’s position relative to its scientific competitors should be taken into account in deciding the distribution of resources among scientific fields: “The President and the Congress should ensure that the [federal science and technology] budget is sufficient to allow the United States to achieve preeminence in a select number of fields and to perform at a world-class level in the other major fields” (p. 14).5 All these forces have given impetus to recent efforts to formalize science policy decision-making processes. Almost reflexively, there have been increasing calls for quantification and for transforming the more extensive data on science and technology and improved techniques for the analysis of such data into science metrics (for inventories of widely used metrics, see Geisler, 2000; National Science Board, 2004). DEBATE OVER PRIORITY SETTING AND ASSESSMENT MECHANISMS As already noted, peer review has long been the dominant approach in federal research agencies for evaluating the past performance and future potential of research areas and for setting priorities. Peer review is essentially a clinical and deliberative process and one that relies heavily on the expertise of working researchers. It is used most commonly to evaluate proposals for research projects or programs coming from single investigators or research groups; less commonly, the same approach is used to advise research managers on broader matters, such as evaluating the past performance or future prospects of entire research programs or selecting priorities for the future development of these programs. Whatever the purpose, the general approach is similar. Research managers convene groups of experts in relevant research fields, typically constituted as peer review panels, visiting committees, or advisory boards, that deliberate on issues or choices presented by research managers. They are sometimes informed by

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging additional input, for example, reviews solicited from specialists in particular narrow areas being considered. In making their recommendations to higher level agency decision makers, research managers draw on the judgments of these deliberative groups and add in their own judgments to the extent their agency prescribes or allows. Devolution of decision-making authority, or in this case, recommendations, to peer review panels is the “special mechanism” by which the social contract for science “balances responsibilities between government and science” and thus fosters accountability (Guston and Keniston, 1994:8).6 Outside reviews of research agencies’ efforts to assess programs and set priorities have generally endorsed the clinical, deliberative methods of expert review as the best way to assess research fields. For example, a National Research Council study committee that reviewed agency experiences under GPRA concluded: “The most effective way to evaluate recent programs is by expert review. The most commonly used form of expert review of quality is peer review. This operates on the premise that the people best qualified to judge the quality of research are experts in the field of research. This premise prevails across the research spectrum, from basic research to applied research” (National Research Council, 1999:39). Accountability Challenges to Peer Review However well peer review as a method of research assessment may have served science agencies, the scientific community, and society in the past, this approach has recently come under challenge. A major challenge has come from the movement toward greater accountability and attention to performance management, as embodied in GPRA and PART. GPRA requires federal agency managers to establish strategic goals and to demonstrate that these goals have been met, on the basis of predefined measures of performance. Development of standardized outcome measures is also seen as a means by which science managers can compare the bang for the buck across different kinds of expenditures. PART, as already noted, requires an increasing proportion of federal programs to apply a consistent, evidence-based approach to performance measurement, extending from the specification of strategic goals (such as lives saved) to outcomes and outputs. Development of performance measures is encouraged because such measures are seen as the ultimate results for the public. But the PART guidelines also note that “the key to assessing program effectiveness is measuring the right things,” by which is meant, “measures that meaningfully reflect the mission of the program, not merely ones for which there are data” (p. 16). Intended to be broadly applicable across all federal programs, the PART procedures also contain a specific set of criteria to assess the effectiveness of the federal investment in research and development. The three salient

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging criteria applicable to NIH-NIA programs are relevance, quality, and performance. For each criterion, the use or development of quantitative metrics is emphasized. Thus, in considering relevance, the PART document states that “OMB will work with some programs to identify quantitative metrics to estimate and compare potential benefits across programs with similar goals. Such comparisons may be within an agency or among agencies” (p. 56). GPRA and PART are similar in that both provide agencies with pressures or incentives to move toward more quantitative methods for setting priorities or assessing performance.7 Despite the skepticism that researchers have at times expressed about applying quantitative approaches to assessment of their work—for example, concerns about the spawning of “LPUs” (least publishable units)—there are valid reasons for trying to use them. An important one is that many research agencies’ expenditures are justified not only in terms of advances in pure knowledge but also in terms of their potential value to society, some of which are eminently quantifiable. Among these, depending on the agency, are improved health or longevity, education, environmental quality, and public safety and security. Already, there are efforts under way to develop measures related to the impacts of research on some societal goals, with current emphasis being on improving the reliability, timeliness, and administrative feasibility of the measures (U.S. Department of Energy, Office of Science, 2004). The prominence of the “health and well-being of older Americans” among the strategic goals of NIA makes it tempting to quantify at least those outcomes and to seek evidence for causal links between research and those societal benefits. Quantitative measures of many of the benefits, both realized and projected, as well as benefit-cost ratios, already exist. For example, healthy 70-year-olds live longer and spend less on lifetime health care than their less healthy peers. In one study, individuals with no functional limitations had a life expectancy of 14.3 years and expected cumulative health care expenditures of $136,000 in 1998 dollars, while those with one functional limitation had a shorter life expectancy (11.6 years), but could expect to spend more on health care ($145,000) (Lubitz et al., 2003). If health promotion efforts (e.g., exercise, smoking cessation) can improve the functioning of older Americans, these benefits can be predicted to follow. Similarly, observed benefits of cognitive and affective phenomena for health (e.g., Rosenkranz et al., 2003; Levy, 2003) might also be quantified in economic and life expectancy terms. The estimate that a 1 percent permanent reduction in mortality would be worth about $500 billion (Murphy and Topel, 1999) makes possible benefit-cost analysis of investments in mortality reduction.8 Research agencies’ internal needs also create pressure for quantifying research progress. They need to compete successfully for research funds

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging with other uses of federal funds that are justified in accounting terms in the tightening discretionary budget for nondefense science and in the face of competition from nonresearch priorities. And they need to do so at a time when demand for research funds is increasing as a result of technological advances, methodological developments, and growing concern with complex systems and interdisciplinary problems that require more expensive capital-intensive and team-based research. Some new research areas require major investments in expensive technology.9 Public officials want a rational basis for making difficult choices, and quantitative measures are attractive because they can be defended as “objective.” Some research agencies have also identified internal reasons to seek quantitative measures of research performance. For example, such measures might be useful for justifying budgets against the claims of other units in the same government department and for resisting pressure to shift expenditures in ways that would benefit specific political constituencies at the expense of the scientific and practical benefits of research to the public (see, for example, National Research Council, 2005c). Challenges to Quantification Reliable and valid quantification of benefits from scientific research would obviously be desirable for assessing the value of past investments in research. Such output measures would provide research managers and higher level government officials with valuable yardsticks for evaluating past investments and a counterweight to inappropriate claims on research budgets from interested groups. However, developing reliable and valid performance measures that work across disparate fields has been very difficult. Questions continue to be raised by policy makers, research administrators, practicing scientists, and specialists in program evaluation about the reliability and validity of the basic data series; about errors in measurement; about the ability of actors in the scientific enterprise to manipulate or “game” several mainstream quantitative techniques, for example, by pooling citations; and about the applicability of techniques used to study the workings of the scientific enterprise to evaluation and priority setting (e.g., van Raan, 2005; Weingart, 2005; Monastersky, 2005).10 Particular quantitative methods have also been criticized. For example, benefit-cost analysis has frequently been criticized on the grounds that many of the costs and benefits, especially the latter, are not traded in markets and therefore require inferences and imputations before seemingly precise quantitative calculations of value can be made (e.g., Gramlich, 1981; Stiglitz, 1988). Similarly, knowledgeable observers have raised the concern that the use of performance scorecards as represented by PART-like mechanisms is becoming detrimental to the conduct of science (e.g., Perrin, 1998; Weingart, 2005).

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging it does not attempt quantitative comparisons. For example, one of these narratives describes the role of program-funded research in showing that increased longevity during the 20th century has not resulted in longer periods of disability. Another shows how funded research has provided better data for understanding the relationships between socioeconomic status and health. Yet another demonstrates that job control at work is a major risk factor for cardiovascular disease in men. BSR combines these narratives with a variety of outcome indicators, such as lists of peer-reviewed publications, honors received by funded researchers, and recognition of funded research in the specialist and popular press, to help inform expert judgments by its advisory board. Case histories have not always proved useful, however. Earlier studies, such as the 1968 Technology in Retrospective and Critical Events in Science (TRACES) study (Illinois Institute of Technology, 1968), sponsored by the NSF, and Project Hindsight, sponsored by the Department of Defense (Sherwin and Isenson, 1966), which sought to relate advances in fundamental science to important technological advances, proved to be expensive and to have limited persuasive impact. Nettlesome methodological disagreements arose about the validity of the findings (Kreilkamp, 1971), and few such large-scale endeavors have been undertaken in recent years. Differences of opinion also exist among program managers regarding the external validity and political impact of historical accounts or case studies. They can be subject to selection bias, especially when selected by program managers who have reasons to show a program’s best face. Thus, some argue that without large-scale comparative studies, historical accounts lack persuasiveness about the actual contributions of a program and are unlikely to convince higher levels of management. To others, however, one compelling case can be akin to the picture that says a thousand words. In addition, the case history technique in general fails to adequately satisfy OMB expectations that agencies install data-based management systems to monitor performance. The current situation is thus characterized by both methodological and policy turmoil and disagreement. Some agencies seek to develop, validate, and apply quantitative measures of research output and its value. Others see available quantitative metrics as hopelessly inadequate for their assessment purposes and believe that expert judgment is the only valid and appropriate way to evaluate the past performance or future potential of research (see National Research Council, 1999). Nevertheless, the trend toward increased quantification is clear. A combination of forces—OMB mandates, improved sophistication in quantitative methods, more critical examination of the limits to which specific methods can legitimately be pushed, more modest claims on the part of advocates and practitioners of quantification, and what may best be termed resignation to the use of quantitative

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging methods—has softened the edges of earlier either-or debates about the use of expert judgment/peer review procedures versus quantitative methods for priority setting and assessment.13 Consensus may be emerging about the need for informed expert opinion based on the “proper” use of quantitative methods by relevant experts. Limits of Expert Judgment for Comparing Fields When research managers must set priorities across research fields, many of the problems of comparison faced in quantitative assessment also apply to expert judgment. Panels of experts that may be able to give reliable advice in specific, long-established disciplines may have much greater difficulty advising across fields. In a well-defined scientific discipline or field, it is reasonable to presume that the appropriate standards of judgment are well understood by most of the experts who might serve on a peer review or advisory panel, even if the standards are not identical in all parts of the field. The members of such groups understand each other’s work, and it would be possible for other experts in the same field to assess a panel’s findings by applying the same standards. It is less safe to presume that such shared understanding and the attendant possibility for checking judgments exists when review panels are organized across more varied areas of expertise. The problem is likely to get worse, the greater the breadth of the set of programs or units that are being compared. Suppose an agency empanels a broadly multidisciplinary expert group to evaluate aspects of a multidisciplinary program. Group members must either judge outside their expertise or rely on their colleagues’ judgments, in which case they may fail to understand the standards that their colleagues are applying. Not being familiar with the content of the work being proposed or its potential for opening up new lines of research, experts sometimes use methodological rigor as a default evaluation criterion. The result may be, as has been increasingly asserted with regard to both NIH and NSF, that review panels are inherently too conservative about supporting radically new or transformative research ideas over well-crafted mainstream but incremental science.14 Strong criticism by one or two review panel members, particularly on specialized matters in those members’ areas of expertise, may be enough to defeat an idea. Similarly, panels have been criticized as favoring science in established disciplines over interdisciplinary proposals (National Research Council, 2005b:Chapter 6). It has been claimed that experts on a panel defer to the judgment of each member in his or her own field, with the result that fields perpetuate themselves even when their potential for generating important advances is weak and when a broader analysis of a research agency’s portfolio would justify reallocation of funds elsewhere. The systemic logic behind these behaviors, according to Brenneis

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging (1994:31), is that expert review panels employ a “fairness through apparent clarity” model of decision making. In this model, scholarly progress is seen as incremental, so that proposals are favored when they “are clearly linked to a sense of how ‘science works.’ Proposals that promise to break new conceptual ground or to challenge and refigure dominant paradigms are viewed not so much as ‘bad’ proposals but as difficult to evaluate and compare with other contenders” (see also Guetzkow et al., 2004). Collective judgments by peer review panels may thus not have a clear meaning, especially when a panel is covering a multidisciplinary range of fields. The problems cannot be solved by combining judgments from different disciplinary groups because standards in different fields may not be comparable. Assume that there are established fields of science, characterized by the conventional attributes of disciplines; that is, journals, professional associations, academic standing as departments, and institutionalized legitimacy within a federal science agency, such as an established study panel or a directorate or division devoted to the support of each field. Assume further that over time one field becomes insular; that is, it focuses on problems that engage specialists but do not look important from an outside perspective, either on scientific or practical grounds, and that prove unproductive in retrospect. The experts in such fields nevertheless continue to believe that they and their colleagues are engaged in exciting, productive, and societally relevant work. When asked to judge recent or proposed new work on such criteria as originality, they rate the studies as more original than they would appear to outsiders to the insular field. Researchers in such a field can point to a steady stream of output, say, in articles published in leading journals in the field and citations to these articles, albeit predominately by other researchers in the field. Without some external yardstick, it is not possible to know whether or not a particular field is such an insular and moribund field in which the collective judgment of its experts is untrustworthy. Research managers want to identify such fields sooner rather than later, but the judgment of experts from within the field may be misleading, and the judgments of multidisciplinary peer review groups may also fail to offer good guidance. The problem of identifying fields that have passed their prime is at the core of the questions posed by NIA-BSR. Its concern is that it may be supporting some fields that are producing minor if technically well done advances in knowledge derived from long established but increasingly stale paradigms, while choosing not to fund other fields, theories, methods, and findings that promise (with uncertain likelihood) to yield significant new advances in fundamental knowledge that will illuminate not only the field from which they come, but also spill over to enrich other fields or even create new ones. Although peer review panels in many fields believe that there is much high-quality work in those fields, it is possible that the experts in

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging some fields have an inflated view of the value of research in those fields. It is out of such concerns that research managers in BSR seek for a trustworthy method or strategy of research assessment that would make it possible to evaluate the hypothesis that there are serious imbalances in the value of research across the fields funded by the program. As noted throughout this report, these concerns have long antecedents. But despite various efforts at broad-scale retrospective and prospective judgments (e.g., Deutsch et al., 1971; Irvine and Martin, 1984; Inkeles, 1986; Abrams, 1991; Henkel, 1999), these questions are generally finessed. One reason is their extreme sensitivity in scientific circles, because any explicit effort to compare the value of research across fields entails making invidious distinctions, with possible damage done to the field(s) given lower evaluations in future funding cycles.15 In contemporary U.S. science policy discussions, tactful discourse centers on the concept of balance and may suggest that one or more areas of science are underfunded, but not that the questions being addressed or methods being employed by other fields are in any manner experiencing flagging vitality. The current budgetary pressures and the additional impetus for accountability and quantitative measurement of research progress make it increasingly difficult to finesse the questions of comparative assessment. Despite the difficulties of comparing the research efforts of different fields on the same quantitative scale, demands for accountability create serious pressure to provide clearly stated rationales for recommendations and decisions about priorities among research fields. RESEARCH ASSESSMENT AND THE ISSUE OF POWER As already noted, a critically important but sometimes unacknowledged issue behind debates about how to assess science is that the choice of method may both reflect and affect who has power and influence. Particularly at issue here may be the relative power and influence among researchers, research managers, and nonscientists in government. Such issues have been present in science policy since the beginnings of sponsored research in the United States, when the sponsors were private foundations, such as the Carnegie and Rockefeller Foundation groups (see Box 3-1). The issue continues into the present. An increased emphasis on the use of quantitative indicators in science policy decisions, especially to the extent that indicators can be developed by technicians who are not researchers in the relevant fields, can easily weaken the influence of scientists vis-à-vis agency science managers, or of scientists in general vis-à-vis nonscientist decision makers in government. Bibliometrics provides a good example of the issue of power, latent in many current discussions of the use of “objective” measures of research

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging BOX 3-1 Power Relations in Sponsored Research, 1900-1945 In the United States, the system of sponsored research grants evolved in private foundations, especially those of the Carnegie and Rockefeller groups. At first, the idea of programmatic, actively managed grants to individuals met resistance because it seemed to transgress entrenched individualistic values and the belief that scientific discoveries are acts of individual genius. In the 1920s, foundations evaded these problems by giving block grants to universities or research institutes with no strings attached: they were designed to develop entire communities, not advance particular individuals’ lines of work. But these programs proved unaffordable after 1929, and in the late 1930s, the Rockefeller Foundation pioneered a system of programmatic grants to scientists in a few fields that were deemed by program managers to be of strategic value. A new social role of grants manager evolved. Foundation officers earmarked strategic fields for investment and, in the absence of a system of peer review, selected among applications for support. Grant managers became partners in science, in direct and unmediated relations with their grantees. This active relationship profoundly upset existing power relations with foundation boards of trustees. Trustees, who were mostly practical men of business and who had previously made decisions on the basis of their ability to judge organizations, were now in a position of rubber-stamping decisions made on technical grounds by mid-level managers. They were effectively deskilled as experts in organizing productive labor. Senior scientists on boards took the same line, opposing “planning” in science, even though relations between grantees and program managers were remarkably untroubled. In practice, activist managers were helpful, not intrusive, for example, in foster progress. Resistance to the use of bibliometric measures by academic researchers surfaced almost immediately upon their introduction. “The reaction was predictable,” according to Weingart (2005:118) “because first of all the very attempt to measure research performance by ‘outsiders,’ i.e., non-experts in the field under study, conflicted with the firmly established wisdom that only the experts themselves were in the position to judge the quality and relevance of research and the appropriate mechanism to achieve that, namely peer review, was functioning properly.” Adopting any method of assessing research potentially affects who has the ability to influence the setting of broad research priorities, the contours of specific programs with respect to subfields and methodologies, and decisions concerning which

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging ing communication among scientists in different disciplines. Grantees quickly realized that. Trustees distrusted program managers because they could not see what would prevent them from abusing their new power to set priorities and decide on individual proposals. Contention between managers and foundation boards changed only when managers, such as Warren Weaver at Rockefeller, showed their ability to devise and manage programs of individual grants in selected fields like genetics or molecular biology, a field that Weaver helped to define. The system that evolved by the late 1930s had trustees appointing program officers, who were empowered to select strategic areas for investment and to actively manage systems of individual grants. Scientists accepted the active participation of program managers in directing research along selected lines, but they retained complete control of how the actual work would be done. Once grants were made, program officers never interfered and declined invitations to advise on the particulars. This division of labor worked without advisory committees, peer review panels, or formal procedures of reporting and accountability. Communication was the main reason the system worked. Program officers worked constantly to be well informed of trends in the fields they sponsored and in the activities and reputations of leading figures in the fields. They did this by identifying a few trusted individuals who they learned would provide objective and disinterested advice and by continual traveling and conversing informally with grantees and potential grantees, including younger scientists. Program managers could become effective partners in science because they understood the personalities and intellectual politics, as well as the science, of their areas of interest almost as well as the insiders did themselves. SOURCE: Kohler (1987, 1991). proposals to fund (and at what levels) and which to reject. To the extent that the critical information needed for making research portfolio decisions can be gained without reliance on the researchers themselves, the power to make those decisions can be shifted from researchers to research managers. In short, decisions about quantification of scientific progress have a power dimension, whether or not this is within the awareness of those involved, and a shift in power relations can have significant consequences for the directions of science. Such a shift may be viewed as good for science—for example, if research managers have a better overview than scientists of opportunities across many fields, or a better appreciation of which research directions are most likely

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging to meet societal needs or agency priorities. It may also be viewed as bad for science (for example, because of the possibility of political or ideological interference with scientific research agendas, as in the controversy over the claim that management at the National Endowment for the Humanities has overturned peer review endorsement of proposals because they address issues considered sensitive by political appointees; see Jaschik, 2006). The issue of power is implicit in the implementation of GPRA and PART. To the extent that these tools emphasize routinized measurement of easily quantified attributes of research, they shift power away from the judgments of the scientific community and toward others, such as those who devise the indicators and those who can find ways to game the assessment system. Efforts to gain approval for assessment mechanisms that rely more on the judgments of scientists, apart from claims that they provide better quality assessments, are in part efforts to prevent a loss of influence by scientists over science priority setting. It is important to recognize in this context that research managers at NIH are scientists as well as managers. They are typically in the job classification of Health Scientist Administrator, which is taken to mean that their first calling is as a scientist and that the administrator role is secondary. NIH program managers typically hold advanced degrees in science, not management. Once in their positions, however, research managers are expected to act as stewards of their scientific fields in the context of the mission of NIH and their institute or center. The challenge in such a position in any federal science agency is to stay abreast of one’s scientific field: to know what are the emerging opportunities and challenges, who is doing outstanding work, who is coming up in the field, who is on the cutting edge, what the demands are in the field for technologies or models, and so on. In a stylized manner, much of the research support provided by NIH, especially that occurring in the form of investigator-initiated (R01) projects, represents grassroots initiatives of independent individual researchers. In this model, the frontiers of science, both in terms of the questions (or puzzles) posed and the selection of projects to answer the questions, are determined mainly by the collective workings of the scientific community. NIH program managers ideally function as part of this community, not only as scientists but also as advocates, stewards, and occasionally as entrepreneurs for research fields. A field may need a research tool that NIH can support and make available; better access to data; or a new way of organizing research. A prominent example from BSR is its support of the ongoing Health and Retirement Survey at the level of about $10 million per year. This survey provides data useful to a large number of individual research projects. As noted in Chapter 2, research managers in NIA/BSR have been proactive as entrepreneurs of research by using the management tools at their disposal to provide such collective goods for science. Initiatives

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging by research administrators can be instrumental in starting or accelerating the development of a field or line of research, and research managers may need to have an entrepreneurial spirit to go along with their understanding of the science and the needs of a field. They may be called on to advocate for specific projects, to urge the adoption of policies, or to work with colleagues inside (or outside) NIH to build support for programs, grow the funding, or help further the appreciation of the science. They need to know how their field relates to others so they can effectively and enthusiastically cooperate in new, high-priority interdisciplinary areas. They also need to be alert to the scientific human resource development issues critical to the programs they administer. It is possible that a research administrator can see opportunities or challenges that are as yet not fully visible to most working scientists in a field, and thus come to a judgment about a field’s needs that differs from the consensus of those working in the field. Research managers’ judgments may differ from those of active researchers because of the greater value the former group places on the relevance of research to an agency’s mission. And administrators responsible for several fields often make judgments about priorities that differ from the consensus judgments in some of the fields. Differences in judgment may arise from differences between managers and working scientists in the weights assigned to different program objectives or because they use different methods of assessment. Research managers acting as stewards and entrepreneurs may try to convince higher organizational levels, elected officials, and the scientific communities with whom they interact of the importance and relevance of ongoing or emerging fields of science. Research managers may be more or less entrepreneurial and more or less successful in this role, depending on personal disposition or agency structure, practice, and culture. In the best case, working scientists and science managers can bring valuable and complementary perspectives to the task of assessing science. In designing methods for assessment and priority setting, then, it makes sense to avoid framing either-or choices between mechanical, quantitative, and bureaucratized decision making led by science managers and qualitatively informed, nuanced choices dominated by scientists. The proper questions to ask in guiding research assessment and priority setting do not concern whether to use quantitative measures, but what should be the appropriate roles of quantitative measures and of deliberative processes of peer review and how should the perspectives of scientists and science managers be combined to provide wise guidance for science policy decisions. The above observations on the role of program managers are mediated by the formal structures and informal practices and cultures of federal science agencies. The autonomy of program managers to set priorities across fields or modify the decisions of external review groups can vary consider-

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging ably across agencies. The experiences of members of this committee when serving as members of advisory councils, advisory committees, and review panels are remarkably consistent, and also consistent with the views of program managers we have interviewed in this regard, although we have been unable to identify any systematic study on this topic. In part, these apparent differences reflect the different histories and purposes of these agencies. DARPA, for example, was formed in the 1950s purposively as a small and flexible organization oriented toward revolutionary technology breakthroughs (Bonvillian and Sharp, 2001). Its use of advisory panels and review panels is flexible and ad hoc. By way of contrast, formalized peer review systems are core features of NSF and NIH, with a key distinction being that NSF program managers oversee both program development and the panel review process, whereas NIH separates responsibility for program development and operations from the review process. The peer review process at NIH operates primarily out of the Center for Scientific Review, which organizes review groups that often cut across programs and even institutes, and which generates ratings that are intended to evaluate proposals on a unitary scale that is the same across programs and institutes. This procedure makes it particularly difficult for a science manager at NIH to argue for overturning the results of the peer review process. As noted in Chapter 2, however, NIH science managers have greater discretion with funding instruments other than unsolicited proposals, which allow them to identify topics of interest and sometimes to set aside funds and create separate review processes for the solicited research.16 Proposals for quantifying of the benefits of research, as well as proposals for increased discretion for science managers, should be understood in the context of these conditions of influence and power. Quantification is sometimes presumed to reduce the influence of extramural scientists. If it has this effect, however, it does not necessarily increase the influence of the science managers who are closest to the scientific research programs. That effect will depend on how quantification is implemented and where in an agency the responsibility is placed for quantifying and for interpreting the results. We return to these issues in Chapters 5 and 6, where we discuss in more detail the use of quantitative analytic methods and deliberative processes for informing research assessment and priority setting. NOTES    1. NSF’s Division of Policy Research and Analysis became embroiled in a losing battle over the independence of its research findings, while ceasing to be a source of external research support (Greenberg, 2001); the congressional Office of Technology Assessment closed its doors in 1995.

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging    2. This situation may be changing, as evidenced by recent statements by the president’s science adviser (Marburger, 2005) and the NSF request to include a “new research effort to address policy-relevant science metrics” in its FY 2007 budget.    3. For example, Senator William Frist was quoted as saying in 2000, “[A]n improved process is needed for establishing goals and research priorities based on scientific data and health analysis, including moving beyond input measures and anecdotal evidence to develop new metrics to measure scientific advances and their causal relationship to improved health outcomes. Such measures will never be precise and should not be used as an absolute guide to determine where and how much to invest. Translating research along the continuum of basic, clinical, and applied research and, ultimately, to patient care almost always involves long periods; the linkages between these stages are seldom straightforward. Still, more comprehensive and transparent measurement tools would provide policy makers, the public, the scientific community, and patients with a more complete understanding of the role of government-sponsored research and help inform federal policy” (quoted in Journal of the American Medical Association, 2002). The comments of John Marburger, the science adviser to President George W. Bush, have already been noted.    4. According to one recent statement of this view from an NIH official: “Competitive pressures have pushed researchers to submit more conservative applications, and we must find ways to encourage greater risk-taking and innovation and to ensure that our study sections are more receptive to innovative applications” (Scarpa, 2006).    5. This report also proposes a mechanism to be used to assess the U.S. position. “Every five years, panels are convened to evaluate the fields in each major areas of science and technology (e.g., physics, biology, electrical engineering), their standing in the world, and the resources needed to reach and maintain world-class position. Evaluation focuses on outputs, such as important discoveries, and also on certain benchmarks of best practices, such as number of scientists and engineers and their training, or the current state of the laboratories and research facilities” (National Research Council, 1995a:15).    6. “This balance takes into consideration both the values of accountability associated with representative government and those of autonomy associated with an independent professional community. Not only does the government ‘invest’ in a public good …, but it delegates to other institutions the actual conduct of the research. It is thus the scientific community, as established in universities and other research institutions, that has responsibility for ‘producing’ research, discoveries, and new technologies” (Guston and Keniston, 1994:8).    7. Despite their seeming similarities, GPRA and PART reflect the different perspectives, budget priorities, and decision-making processes of the legislative and executive branches. OMB’s initiation of PART, for example, has at times been described by OMB officials as representing dissatisfaction with the actual impact of GPRA on agency budgets. For its part, OMB’s PART recommendations must still run the gauntlet of congressional appropriations committees. PART’s impact on budget allocations remains problematical to date (Olsen and Levy, 2004).    8. In benefit-cost terms, their estimates indicate that an increase of $100 billion for cancer research spent over 10 years would be “worthwhile if it had only a one-in-five chance of producing a 1 percent reduction in cancer mortality, and a four-in-five chance of producing nothing” (Murphy and Topel, 1999:3).    9. The rare isotope accelerator being proposed by some physicists for funding by the U.S. Department of Energy is estimated to cost $1 billion. The National Aeronautics and Space Administration’s construction of the James Webb Space Telescope has an estimated cost of $4.5 billion.

OCR for page 40
A Strategy for Assessing Science: Behavioral and Social Research on Aging    10. Weingart (2005:120) observed, in discussing the difficulties of developing “correct” sets of bibliometric data, that the methodological and operational origins of the data can be “concealed from the end user who is not able to reflect upon the theoretical assumptions implied in their construction…. The healthy skepticism of years ago, albeit often for the wrong reasons, appears to have given way to an uncritical embrace of bibliometric measures and to an irresponsible use.”    11. A well-known instance of serendipity was the discovery that Viagra, initially developed to treat cardiac problems, had unanticipated effects on sexual performance.    12. Even for assessing “inventions” rather than “science discoveries,” patents have important limitations as performance measures. As noted by Jaffe and Trajtenberg (2002:3-4), “There are of course, important limitations to the use of patent data, the most glaring being the fact that not all inventions are patented. First, not all inventions meet the patentability criteria set by the USPTO, the United States Patent and Trademark Office (the invention has to be novel and nontrivial, and has to have commercial application). Second, the inventor has to make a strategic decision to patent as opposed to rely on secrecy or other means of appropriability.”    13. Roessner (2000:125) sees a choice as being imposed: “Posed as a choice, the question of quantitative versus qualitative methods or measures is a false one, at least to the professional evaluator residing in lofty isolation from the messy real world…. Legislators and other authoritative oversight bodies are increasingly asking public agencies for quantitative measures of research performance, and in so doing can generate all kinds of mischief.”    14. Erich Jarvis of Duke University, the 2002 recipient of the NSF’s Waterman Award, was quoted in Science (Mervis, 2004b:220) as saying “You learn the hard way not to send high-risk proposals to NSF or NIH, because they will get dinged by reviewers. Instead, you’re encouraged to tone down your proposal and request money for something you’re certain to be able to do.”    15. Researchers often seem to follow the dictum, “Thou shall not speak ill of a fellow [insert field]!” Consider, for example, the diplomatic response of Janez Potocnik, the European Union’s new commissioner for science and research on his appointment to the post: “when asked if any particular area of science has caught his interest since taking on the research job [Potocnik said that] (I)n practically all the areas you touch, you see interesting things going on” (Vogel, 2004).    16. Review processes also vary with funding instruments at NIH. The Center for Scientific Review is responsible for about two-thirds of peer review at NIH. Institute-based review groups do the bulk of the rest, such as reviewing proposals submitted in response to requests for applications.