KEY POINTS IN THIS CHAPTER
- Currently available metrics for research inputs and outputs are of some use in measuring aspects of the American research enterprise, but are not sufficient to answer broad questions about the enterprise on a national level.
- The impacts of scientific research can best be determined not by applying traditional metrics such as counts of publications and patents, but by cultivating an understanding of the complex system that is the U.S. research enterprise to determine how all of its component parts interrelate, a theme that is explored in detail in Chapter 6.
- Ongoing data collection efforts, including Science and Technology for America’s Reinvestment: Measuring the Effect of Research on Innovation, Competitiveness and Science (STAR METRICS), could potentially be of great value if these datasets could be linked with other data sources and made more accessible to researchers.
Metrics often are used to assess quantitatively how well a project or a research-performing institution measures up. Is it performing as well as it should? Is it producing the expected results? Is it a worthwhile investment?
Of particular interest to Congress for this study was whether metrics could be used to quantify one particular aspect of the U.S. research enterprise: the transfer of scientific discoveries at research universities and government laboratories into commercial products and services for societal benefit (a process discussed at length in Chapter 3). The committee found, however, that technology transfer is only one small piece of the picture (see Chapter 3 and Appendix B on the relationship between U.S. universities and industrial innovation). In fact, the very term “technology transfer” connotes specific institutionalized mechanisms for the movement of technical knowledge, whereas in reality knowledge moves through numerous informal channels and institutional frameworks—perhaps most importantly through people—and moves in many directions between universities, industry, and other laboratories, or between basic and applied research. It is also important to note the subtle difference between “measuring inputs to and outputs from the research enterprise” and “evaluating the impacts of the research enterprise”: the former focuses on the measurement of external factors that modulate the process of research and on the measurement of intermediate research outputs, such as publications and patents; the latter focuses on how research ultimately affects society.
This chapter reviews in turn existing measures; the uses and limitations of a commonly used input indicator and a commonly used output indicator; the challenges of data collection to inform measurement tools, with a focus on the STAR METRICS Program; the limitations of existing metrics; and the need to move beyond current indicators. Chapter 5 explores how the impacts of the research enterprise have been evaluated by various groups, including universities, private industry, private nonprofits, government agencies, and other nations.
Although no single measure can provide an accurate representation of the full picture of the returns on research investments, some currently available tools—particularly the metrics and indicators of research inputs and outputs described below—can help answer specific questions about aspects of the overall picture. Both techniques and methods, as well as metrics that are used to measure a system’s performance quantitatively, and indicators, which reveal trends or facts about a system, can provide value.
A report by Guthrie and colleagues (2013) explores how various methods (e.g., data mining, visualization, site visits, economic analyses), metrics, and indicators can be used and describes key challenges related to each. The following are some examples:
- Bibliometrics are a quantitative measure of the quantity, dissemination, and content of research publications. They reveal the volume of outputs from the research system, and can shed light on pathways of knowledge transfer and the linkages among various scientific fields. However, the use of citations as a measure of quality or impact varies among fields and individual scientists, making this a difficult metric to apply across the research system.
- Case studies are useful in capturing the complex and varied inputs that influenced a particular output. This method is a valuable way to reveal the context of a discovery, but case studies often provide examples and generalizable information rather than definitively linking research to a particular output or outcome.
- Economic analysis can be used to understand the relationship between costs and benefits, compare possible outcomes among a range of alternative strategies, and reveal the cost-efficiency of an approach. This method is useful whenever it is possible and appropriate to assign a monetary value to both costs and benefits.
- Logic models provide a visual interpretation of the trajectory through which inputs contribute to a particular output, and can be useful for planning, monitoring, and evaluating research programs. The limitation of logic models is that a trajectory can change in unexpected ways. Moreover, these models tend to disregard the counterfactual, or the most likely scenario had the research program not existed.
- Peer review is a method based on the idea that experts in a field are best suited to determining the quality of work in that field. Some have criticized peer review for discouraging the funding of high-risk research or radically new research approaches. Moreover, as further discussed in Chapter 5, others have recently criticized how the National Institutes of Health (NIH) organizes, funds, and selects study sections, suggesting that it leads to the dilution of expertise in the review process. See, for example, Alberts et al. (2014).
- Statistical analysis is a valuable, albeit time-consuming way to identify patterns in existing datasets. This method depends greatly on access to and the quality of existing data.
The next two sections describe indicators of the broader systems of research and innovation from two different perspectives—research inputs and research outputs. These indicators are commonly used to assess the competitiveness of the American research enterprise.
INPUT INDICATOR: RESEARCH AND DEVELOPMENT (R&D) AS A PERCENTAGE OF GROSS DOMESTIC PRODUCT (GDP)
A frequently used indicator of a nation’s investment in science is the ratio of spending on R&D to GDP. It is a crude measure that allows for international comparisons of the levels of national investment in R&D, investments that are correlated with overall innovative performance. Nonetheless, like many widely used metrics for R&D investment, R&D/GDP ratios conceal a great deal of cross-national heterogeneity. In the United States and some other countries, for example, a large share of the national R&D investment is devoted to defense, and most of such spending is on the development and testing of military equipment.1 Differences in the proportions of defense and nondefense funding in the public R&D investments of each nation make comparisons of the United States’ R&D/GDP ratio with those of other nations potentially misleading. The mix of public and private R&D investment that is included in the numerator also varies considerably among nations. Moreover, of course, this ratio measures only inputs, and says nothing at all about the efficiency with which the investment in R&D is translated into basic knowledge and/or innovations.
Despite these limitations, the ratio and its numerator (i.e., the amount of R&D spending) are important indicators of where the United States may face future competition. As reported by the National Science Foundation (NSF) (2014a, Table 4-4), in 2011, the R&D share of GDP for the United States was 2.9 percent. For Japan, it was 3.4 percent, and for South Korea, it was 4.0 percent. For China, the share has increased consistently since the mid-1990s, reaching 1.8 percent in 2011. According to the most recent OECD data, the United States ranks 10th among nations on this indicator. Moreover, NSF notes (pp. 4-18):
Most of the growth over time in the U.S. R&D/GDP ratio can be attributed to increases in nonfederal R&D spending, financed primarily by business. This growth may also indicate an increasing eagerness by business to transform new knowledge into marketable goods. Nonfederally financed R&D increased from about 0.6 percent of GDP in 1953 to about 2.0 percent of GDP in 2011. This increase in the nonfederal R&D/ GDP ratio reflects the growing role of business R&D in the national R&D system and, more broadly, the growing prominence of R&D-derived products and services in the national and global economies.
1In 2010, 60 percent of U.S. federal R&D spending was for defense, but only a very small fraction of that spending (about 2 percent) was for basic research (National Science Foundation, 2012, Tables 4-28 and 4-29).
As this statement suggests, much of the growth in industry-funded R&D during the post-1953 period reflects increased industry spending on applied research, as well as development, rather than on basic research.
The ratio of R&D to GDP does not account for how effectively each nation manages its investment, nor does it capture cross-national differences in the mix of public and private funding within the numerator or in the division of labor among different institutional performers (government, universities, and industry). Moreover, combining data on national investments in research and in development does not allow for cross-national comparisons of research investments alone, presenting a significant barrier to examining the effects of federal research investments. Therefore, further data and analysis are necessary to understand the components of R&D spending and enable a better understanding of how the United States compares with other countries in this regard.
NSF calculates a similar ratio for individual industries and sectors. In this case, the ratio is R&D divided by net sales.2 This measure, called “R&D intensity,” shows another aspect of the results of federal funding for basic and proof-of-concept research. The opportunity for a business to be successful in conducting R&D is influenced by the availability of a science base and platform technologies from proof-of-concept research. The better government does in supplying these inputs, the higher is the return to industry on its prospective investments in R&D, and the more industry invests in innovation. In other words, higher R&D intensities should be expected for industries supported in this way.
From a metrics perspective, the output of government support for science and technology platforms is what economists call an “intermediate good,” in that industry builds upon these platforms and the general stock of scientific knowledge to create innovations. For the most part, there are no markets for these knowledge-based intermediate technology goods, which makes impact assessment difficult.3 Nonetheless, they are essential to the productivity of applied R&D.
OUTCOME INDICATOR: RETURN ON INVESTMENT
A popular indicator for assessing research outputs as a means of justifying further research investments is return on investment. This indicator has been used for medical research (Passell, 2000), manufacturing prac-
2A portion of an industry’s sales is its “value added.” The sum of the value added by all domestic industries is GDP.
3Other government policies, such as the R&D tax credit, apply to the entire R&D cycle. Further, mission agencies, notably the National Aeronautics and Space Administration (NASA), fund applied R&D as well. The discussion here applies to government investment in research in support of economic growth objectives.
tices, information technology (IT) (Dehning and Richardson, 2002), and other elements of the research and innovation systems.4,5 Using this measure correctly, however, presents methodological challenges in terms of both alternative conceptual models and the requirement for quality data.
Two main approaches have been described for measuring the private economic returns on R&D investment. The first relates current output (measured as sales or net revenue) to conventional inputs (labor, capital, purchased materials and services) and a measure of the stock of knowledge available to a firm. The second is more forward looking and incorporates future expectations, but relies on the efficiency of financial markets in evaluating the future prospects of a firm. It relates the stock market value of the financial claims on the firm’s assets to the underlying assets, again including a measure of the knowledge stock. This second approach is not suitable when the unit of observation is anything other than a publicly traded firm, so it is not useful for measuring the returns on federal R&D investments.
Constructing a measure of the knowledge stock available to a firm is challenging. The earliest work, by Griliches (1980), Mansfield (1965), and Terleckyj (1980), simply used research intensity (the R&D to sales ratio), relating it to the growth in output adjusted for input change (that is, total factor productivity, or TFP). This approach is appropriate when the depreciation rate for R&D is zero and the impact of R&D on output is immediate. Subsequent researchers, led by Griliches, have used a stock of R&D constructed by analogy to ordinary capital, with a depreciation rate arbitrarily chosen to be 15 percent. Work by Hall (2005) using the market value of private firms suggests that the appropriate private depreciation rate may be larger than 15 percent and will vary over time and sector, depending on competitive conditions. In a social sense, however, knowledge generated by private-firm R&D may depreciate more slowly than these rates suggest. The reason is because the technical knowledge base of an industry or entire economy expands over time, retaining and building on knowledge produced in earlier technology life cycles. That is, such knowledge may remain useful even if it is no longer possible for an individual firm to extract monetary value from it directly. Even with respect to new knowledge, firms benefit from R&D done by others because of its quasi-public good nature. Thus several researchers have included mea-
4See the survey in Hall (1996) on the private and social returns on R&D investments.
5A large body of literature from NIST’s Advanced Technology Program proposes and demonstrates multiple approaches to conducting studies that measure societal impact, including return on investment (http://www.atp.nist.gov/eao/eao_pubs.htm [August 2014]).
sures of the stock of potential spillover R&D in the production function to obtain measures of the social return on R&D investments.6
Hall and colleagues (2010) survey a large literature using the above production function approach to measure the returns on R&D investment. These authors also discuss in detail the many measurement issues that must be addressed when using this methodology. They then report on studies conducted at the firm, sector, and country levels, including those that incorporate measures of the spillover stock of R&D. They conclude that private rates of return generally have been positive and usually higher than those for ordinary capital. In addition, social rates of return often have been substantially greater than private rates, while returns on government R&D investments are lower, as one would expect for reasons given below.
Aside from the many measurement issues identified in the literature, interpreting results on rate of return in the R&D context poses a central problem: the outcome of individual R&D projects or indeed a collection of R&D projects is highly uncertain, and the projects’ revenue or sales success depends on a number of other factors that are difficult to control. Hence past results are not a certain guide to future success, although they may be informative. In other words, the “rate of return on R&D investment” is not a parameter or universal constant—it will vary over time, country, firm, or technology. One might naturally expect it to be positive at the firm level on average, since profit-maximizing firms choose to spend money on such investments, but there will be great variability. Indeed, at any given point in time, returns are so variable that one might not even expect the average returns across firms to be equal to the cost of R&D capital that the firms face. Economic theory says that in general, a firm will invest in R&D to the point where the expected returns equal its cost of capital, but there is no guarantee that this equality holds ex post.
It is tempting to try to transfer the methodology for computing private returns on R&D investment to the assessment of economic returns on federal research investments. But this is generally a mistake. Besides the
6R&D spillovers are defined as the knowledge acquired from R&D done by others (including governments) that is not paid for. Examples are increased understanding of scientific processes useful for one’s own product development that is obtained from reading scientific publications or attending scientific meetings. In addition to spillovers from public R&D, firms and others frequently benefit from observing the introduction of new products and processes by their competitors. Although an actual product may be protected by one or more patents, some of the knowledge that accompanies its development is inevitably diffused to the rest of the industry. A number of researchers have explored ways of constructing stocks of knowledge relevant to a particular firm or sector by using spatial or technological distance measures to weight the R&D conducted by others. See Hall et al. (2010) for further development of these ideas.
interpretive drawback mentioned above, the central problem is that the computation of rate of return is appropriate when the entity making the investment is reaping its rewards and when the goal of the research is to maximize economic returns. This is not the case for most federal research, for a number of reasons.7 In addition, unlike the situation for private firms, little if any of the output of the agencies responsible for much federal R&D spending (e.g., national security, public health, environmental quality) is priced in conventional markets, making the measurement of output, a cornerstone of the production function approach discussed earlier, infeasible (Griliches, 1979, 1994).
The relevant output for most federal research is not revenue but a variety of public goods, some but not all of which will be reflected in economy-wide productivity growth but will not be directly traceable to any particular R&D spending.8 On the applied research side, these outputs include improvements to agricultural productivity, aeronautics, and energy production and efficiency. Such output may enhance the productivity of private firms, but it will be difficult to capture these impacts given their diffuse nature. For basic research, the problem is even more difficult because of long and variable time lags between the research and its impacts (see Chapter 3) and the fact that one cannot predict the areas impacted by particular fields of research very well (e.g., the role of mathematics and basic computer science research in genetic research).
The diffuse nature of the output of federal research has led some researchers to attempt measurement at the aggregate level by relating aggregate total factor productivity, or TFP, to various types of R&D spending across countries (Guellec and van Pottelsberghe de la Potterie, 2001; Westmore, 2013). The results are fairly encouraging and can reveal something about which policies and institutions appear to work better than others. However, these studies are somewhat fragile because of the great differences across countries and the increasingly international nature of R&D spillovers (which implies that one country may free-ride to some extent on the R&D spending of others). These studies also provide no specific guidance on the allocation of government R&D across fields. In principle, given enough data, it might be possible to estimate average returns across countries in various fields during the past, but it remains true that
7A report of the National Academy of Sciences (1995) identifies about half of federal R&D spending as being devoted to nonresearch programs, including end development and testing of aircraft and weapons by the U.S. Department of Defense, nuclear weapons development by the U.S. Department of Energy, and mission operations and evaluation at NASA. Excluding this spending, the balance of federal spending is arguably for basic and applied research.
8For more information about the economics of scientific research, see President’s Council of Advisors on Science and Technology (2012).
past performance is not necessarily indicative of future results. In any case, for most countries, it is simply impossible to construct the full input-output matrix from research in different scientific fields to downstream industry use and the accompanying feedbacks from industry to science.
These issues have been discussed thoroughly in several previous National Research Council (NRC) and other publications listed in the annotated bibliography in Appendix C. Most authors have reached the conclusion that standard economic rate of return analysis is not suitable for evaluation of federal research investments, and that a variety of methods—such as bibliometrics (supplemented by peer review), international benchmarking, and expert review of applied research projects—are necessary for the evaluation of scientific output.9
In summary, the outputs of federal research are intermediate knowledge-based goods, which industry combines with its own investments to produce proprietary technologies (innovations). The productivity of federal research must therefore be measured in terms of its partial contribution to the eventual commercialization of proprietary technologies. Thus, whether the output of federal research investments is science, technology platforms, or infratechnologies, the nature of the impact is to leverage the productivity of industry-funded R&D.
CHALLENGES OF DATA COLLECTION TO INFORM MEASUREMENT TOOLS
Interest in measuring the impacts of government-funded research has increased around the world, and a number of data collection efforts to this end are under way in the United States and other nations. Table 4-1 describes major data programs of Australia, Canada, the United Kingdom, and the United States (Guthrie et al., 2013). A key challenge is to establish the most appropriate set of metrics for achieving the goal of the data collection effort. Programs currently under way include the Research Excellence Framework in the United Kingdom, which is intended to measure the performance of universities and determine funding allocation based on the wider nonacademic impacts of research; the Excellence in Research for Australia framework, which uses bibliometrics and other quantitative indicators to measure research performance for accountability and advocacy purposes, and potentially for allocation of funds; and the Canadian Academy of Health Science Payback Framework, which relies on several indicators of research impact and incorporates a logic model for health
9For an example of how bibliometric measures can be used to help evaluate research impacts, see Lichtenberg (2013).
TABLE 4-1 Research Impact Frameworks Used by the United States and Other Nations
|Framework||Origin and Rationale||Scope|
|Research Excellence Framework, UK||Evolved from its predecessor, the RAE, and the RQF. Intended to be low burden, but pressure from researchers led to changes. Includes wider societal impact.||Assessment at subject level on three elements: quality of research outputs, impact of research (not academic) and vitality of environment.|
|STAR METRICS, U.S.||Key aim to minimize burden on academics. Helps to meet U.S. federal accountability requirements.||Two levels: Level 1, number of jobs supported; Level 2, range of research funded researcher interactions and wider impacts.|
|Excellence in Research for Australia, Australia||Perceived need to include assessment of quality in block funding allocation (previously volume only). Advocacy purpose to demonstrate quality of Australian research.||Assesses quality, volume, application of research (impact), and measures of esteem for all Australian universities at disciplinary level.|
|Canadian Academy of Health Sciences Payback Framework, Canada||Draws on well-established ‘payback’ framework. Aims to improve comparability across a disparate health research system, Covers wide range of impacts.||Five categories; advancing knowledge; capacity building; informing policies and product development; health and health sector benefits; broader economic benefits.|
|Measurement||Application to Date||Analysis||Wider Applicability|
|Assessment by subject peer review panel of list of outputs, impact statement and case studies, and statement on research environment.||Piloted 2009. First round of assessment 2014; results will determine funding allocation.||Burden not reduced, but adds wider impact to evaluation. Originally metrics based, but this was dropped as too unpopular.||Suitable for similar cross institutional assessment of performance. High burden in institutions, arguably expensive. Best for significant funding allocation uses.|
|Data mining approach, automated. At present, only gathers jobs data. Methodologies for Level 2 still being developed.||Level 1 rolled out to 80 universities. Level 2 still under development. Voluntary participation so full coverage unlikely.||Feedback generally positive, but feasibility of Level 2 not proven.||Potentially very wide depending on success of Level 2. There has been international interest, e.g., from Japan, EC.|
|Indicator approach; uses those appropriate at disciplinary level. Dashboard provided for review by expert panel.||First round in 2010, broadly successful. Next round 2012, with minor changes. Intended for funding allocation, but not used for this as yet.||Broadly positive reception. Meets aims, and burden not too great. Limitation is the availability of appropriate indicators.||Should be widely applicable; criticism limited in Australian context. Implementation appears to have been fairly straightforward.|
|Specific indicators for each category. Logic model has four research ‘pillars’: biomedical, clinical; health services; social cultural, environmental and population health.||Used by public funders; predominantly CIHR (federal funder), but there has also been some uptake by regional organizations (e.g., Alberta Innovates).||Strengths: generalizable within health sector; can handle unexpected outcomes. But understanding needed at funder level—may limit uptake. Early stages hard to assess.||Breadth, depth, and flexibility mean framework should be widely applicable. However, it only provides a guide and needs significant work to tailor to specific circumstances.|
|Framework||Origin and Rationale||Scope|
|National Institute of Health Research Dashboard, UK||Aim is to develop a small but balanced set of indicators to support strategic decision making, with regular monitoring of performance.||Data collected quarterly at programme level on inputs, processes, outputs and outcomes for three elements: financial, internal process and user satisfaction.|
|Productive Interactions, Europe||Measures productive interactions, defined as interactions with stakeholders that lead to change. Eliminates time lag, easier to measure than impacts. Assessment against internal goals intended for learning.||Intended to work in a wide range of contexts, best applied at research group or department level where goals are consistent.|
NOTES: CIHR = Canadian Institutes of Health Research; EC = European Commission; NIHR = National Institute for Health Research; RAE = Research Assessment Exercise; RQF = Research Quality Framework.
SOURCE: Reprinted with permission from Guthrie et al. (2013, Appendix A, p. 37).
research translation in an effort to provide consistency and comparability among institutions in a research system with multiple regional funders.
One data collection program in the United States—STAR METRICS—is designed to collect a number of measures of the impacts of federally funded research.10 This program is a joint effort of multiple science agencies (NIH, NSF, the U.S. Department of Energy, the U.S. Environmental Protection Agency, and the White House Office of Science and Technology Policy) and research institutions. Its objective is to document the outcomes and public benefits of national investments in science and engineering research for employment, knowledge generation, and health.
|Measurement||Application to Date||Analysis||Wider Applicability|
|Programme specific data can be pooled to provide a system level dashboard; 15 indicators selected, matching core aims, collected quarterly.||Launched July 2011 NIHR-wide, with data to be provided by the four coordinating centres, analyzed and aggregated centrally.||Designed to fit strategic objectives, so in that sense likely to be effective. However, only just launched, so detailed analysis premature.||Should be applicable to other national health research funders. Performance indicators selected can be tailored to assessment needs.|
|Three types of interaction: direct personal contacts, (e.g., via a publication) and financial. Engages users; findings assessed against internal goals.||Piloted across diverse disciplines and contexts in four European countries and at EC level. No plans to roll out more widely at present.||Tailored, so should help improve performance. No comparative ranking. Requires significant work from participants to generate their own set of goals and indicators.||Indicators developed to meet goals, so widely applicable, but does not produce comparison between institutions, so not appropriate for allocation, and could be challenging to use for accountability.|
The data collection program, which began in 2010, was to proceed in two phases using readily available information. In Phase I, the program identified workers supported by scientific funding, drawing on the internal administrative records (e.g., awards, grants, human resources, finance systems) of researchers’ (mainly academic) institutions. Phase II is currently gathering information on scientific activities from individual researchers, commercial publication databases, administrative data, and other sources. The information gathered by STAR METRICS will allow various calculations, such as the total number of individuals supported by research funding, along with the number of positions supported outside universities through vendor and subcontractor funding. The STAR METRICS Program is intended to help federal policy makers, agency offi-
cials, and research institutions document the immediate economic effects of federal investments in scientific research.
While the program is relatively new, it takes two interesting steps: (1) automating and aggregating standardized reporting of grant payment information from university administrative records, and (2) creating a dataset that can plausibly reorient the analysis of federal R&D investments away from a focus on grants and toward a focus on investigators by assessing the impact of federal R&D spending on job creation. Programs akin to STAR METRICS are beginning to gain traction in Japan, Australia, and the European Union nations, offering the eventual possibility of international comparisons.
The committee evaluated the STAR METRICS Program (see Appendix A) in an effort to determine its potential utility for assessing the value of research in achieving national goals. Although STAR METRICS represents a valuable step toward developing detailed, broadly accessible, and nationally representative data that would allow systematic and scientific analysis of the organization, productivity, and at least some of the economic effects of federally funded research, it is currently deficient in a number of respects. To fulfill its considerable promise, the program requires several changes and expansions.
First, as of this writing, STAR METRICS data are largely inaccessible for research use; the data could be used in more informative ways if steps were taken to ensure broad and open access. Second, data collection could usefully be expanded to include more universities and other performers of federally funded research, such as national laboratories and teaching hospitals. This expansion would enable better coverage of both federal expenditures on basic and applied research and key aspects of the scientific workforce. Finally, STAR METRICS data would be more useful if steps were taken to ensure that the data can be flexibly linked to other relevant data sources, including but not limited to those maintained by the federal statistical and science agencies, as well as proprietary data sources such as the Institute for Scientific Information’s Science Citation Index, recognizing that data emanating from such databases have very different meanings from field to field. Creating a robust and linkable dataset may require the addition of individual and organizational identifiers to the current STAR METRICS data.
The ability to capture, store, and analyze massive amounts of data offers opportunities for the further development of indicators. A recent NRC report, Frontiers in Massive Data Analysis, outlines the challenges of using today’s massive data and suggests statistical approaches for addressing these challenges (National Research Council, 2013a, p. 70). Big data will require the use of advanced statistical methods and machine learning algorithms to optimize the data’s usefulness and understand the
challenges involved. Big data cannot answer all the salient questions with the push of a button; there is a need to vet the algorithms used to analyze these data. Human input will still be needed to decide what data to use, how they might be sampled, and how to integrate them with complementary existing data sources and models.
In addition to STAR METRICS, NIH is engaged in other efforts to collect data on its research: the Research Portfolio Online Reporting Tools, the Scientific Publication Information Retrieval and Evaluation System, and the Electronic Scientific Portfolio Assistant. These programs, however, collect data on shorter-term outputs, such as citations and patents, for management purposes, and were not established to track longer-term outcomes. The data collected through these efforts could potentially prove valuable in the design of a more powerful database.
The 2013 NIH report Working Group on Approaches to Assess the Value of Biomedical Research Supported by NIH (National Institutes of Health, 2013) describes how three of the agency’s new administrative data collection efforts—the Research Performance Progress Report (RPPR), the SciENcv database, and the World RePORT database—can improve the quality of data collected on NIH-funded research projects and investigators. Indeed, much of the data needed to develop metrics about the research enterprise is housed in various administrative datasets across the federal government. The RPPR will collect information about research performance in a uniform format across all federal agencies, allowing greater integration across agencies. Moreover, it will link to other data-tracking systems, including the SciENcv database, which tracks researcher profiles. Finally, the World RePORT database will plot information about NIH-funded projects onto a geographic map to facilitate greater coordination among public and private funders.
LIMITATIONS OF EXISTING METRICS
The committee’s review of many current metrics for research inputs and outputs revealed them to be lacking. In Chapter 5, we point to ways of improving existing metrics and making their use more effective, in particular by using them to identify where improvements are needed. Ultimately, however, metrics used to assess any one aspect of the system of research in isolation without a strong understanding of the larger picture may prove misleading.
Many currently available metrics are used in an attempt to reveal the value of research through the measurement of research outputs. They look at individual pieces of the big picture, for example, by counting patents and licenses and various other outputs. But a holistic understanding of the research system is needed if the goal is to increase the likelihood that
innovations will emerge. Existing metrics give some indication of how well the system is performing, but the ultimate impacts, the emergent phenomena that truly matter to society—such as an abundant supply of natural gas enabled by fracking technology, communications and commerce enabled by Google and the Internet, and medical advances enabled by genomics—depend on a number of critical components, and the relationships among them, in the complex systems of research and innovation. These components often are intangible, including opportunities and relationships that are not captured by most data collection programs and cannot be measured by any method available today. The challenge, which has yet to be met, is to capture and articulate how these intangible factors enable the success of the research enterprise. A report by the National Bureau of Economic Research (Corrado et al., 2006) suggests that the challenges of accounting for intangible factors lead to the exclusion of nearly $3 billion of business intangible capital stock and significantly modulate the patterns of U.S. economic growth.
Numerous approaches have been used to measure the impacts and quality of research programs. With a few notable exceptions, such as the cost-benefit studies conducted for the National Institute of Standards and Technology (NIST) (Polenske and Rockler, 2004), these approaches cannot depict the diffuse and interconnected pathways that lead from research to technologies and other innovations. We particularly agree with a common finding that metrics of research impacts must be viewed with considerable caution and that assessments, therefore, require both metrics and professional judgment. The Australian Group of Eight (Rymer, 2011) explores this issue in depth in the report Measuring the Impact of Research: The Context for Metric Development, issuing strong warnings about the limitations of metrics. The report describes how the impacts of scientific research can be grouped into eight broad categories: effective teaching; advances in knowledge; encouraging additional investment by other parties; financial returns; and economic, social, environmental, and intangible (e.g., national reputation) outcomes. The Group of Eight assessed impact in these categories using several measures, including bibliometrics, benchmarking, peer review, and surveys, to determine patents and spin-offs (Rymer, 2011, pp. 11-17). The authors emphasize that none of the current metrics can provide definitive results.
In addition, data on the outcomes of each of the many steps in a complex research project or technological innovation often are lacking, and the appropriate performance metrics may differ in different phases of a technology’s development. Appropriate metrics also may differ for each type of research and field of science.
Moreover, multiple areas of research often contribute to the development of a technology, so that different measures and data may be needed
to evaluate outcomes in such areas as productivity, output, or overall societal benefit. Such measurement and analysis require expertise that is not evident in many federal agencies.11
Federal agencies must guard against the temptation to try estimating outcomes, or impacts, directly from a specific federal research program. Each research program makes a direct contribution to an outcome that might be measured in terms of outputs, such as publications, patents, and trained scientists and engineers, for example. But in the vast majority of cases, these outputs are inputs into further development. It is virtually impossible to extrapolate the impact of a single research program forward through multiple levels of development and commercialization because of the resulting technology’s combination with other technologies to make an eventual impact on economic growth or some societal goal. But for research toward a broad technology goal, such as clean energy, assessment of the relative contributions of research projects or programs can be made at a level sufficient to track progress toward the broad objective.
Because multiple technologies often need to be investigated in the early phases of an R&D project, support for a diverse portfolio of research is beneficial. Indeed, multiple technologies may eventually be combined into a final technology system. To manage diverse research studies at an early stage of technological development, prospective analysis is essential. That is, strategic planning is as important as retrospective impact analyses. Strategic planning studies that examine the entire technology base in question can also identify gaps in the existing technology platforms and infratechnologies.
Another crucial issue with the use of metrics to assess research quality and impacts, one stressed throughout this report, is that knowledge from basic research often underpins applied research. In this way, the benefits of basic research—the discoveries, the infrastructure, the networks, and the scientific workforce—enable applied research, with multiple feedback loops. The value of applied research is in part the value of the underlying scientific knowledge from basic research. This point is illustrated by the decades of basic research leading up to the discovery of an algorithm for Google’s search technology (see Box 4-1), which followed the emergence of a series of university-developed and government-funded Web browsers such as Lycos and Netscape. Knowledge from basic research allows for the continuous evolution of science. Today’s research is performed in dramatically different ways than it was 10 years ago, thanks to enhanced
11One exception is NIST, which has conducted many prospective and retrospective economic studies, as well as studies of impact assessment methodology. More recently, the U.S. Department of Energy has invested in the development of an evaluation framework to guide future impact studies.
Case Study: Google’s Page-Ranking Algorithm
On the morning of January 10, 1997, there were no festivities to mark the world’s transition into the age of Google. Only a U.S. provisional patent application filed by Stanford University Ph.D. student Lawrence Page marked the occasion. The patent had a somewhat obscure title (Method for Node Ranking in a Linked Database) that blended with those of the other technical applications filed that day. But 16 years later, the page-ranking algorithm underlying Google’s search technology has transformed people’s daily lives.
It is clear from Page’s original patent application that he did not invent the algorithm overnight. The invention drew heavily on multiple discoveries spanning nearly 45 years of social and information sciences research—discoveries made possible by research and development (R&D) funding from four federal science agencies and protected by a handful of seemingly unrelated patents awarded to a university (Carnegie Mellon), corporations (AT&T, Libertech, Lucent, Matsushita), and industrial laboratories (AT&T Bell Laboratories).
Much of the supporting research depended on federal research funds. The original patent application acknowledges support from a National Science Foundation (NSF) grant to the Stanford Digital Libraries project. That acknowledgment was eventually expanded to include three earlier NSF grants that extend back to 1974 and span fields of science as seemingly abstruse as centrality measures, analyses of prominence in international article citation networks, and methods for crawling and cataloguing websites. Twenty research articles cited by Page, covering highly abstract topics such as hypertext link structures, information retrieval, databases, bibliometrics (citation analysis), and social networks, were supported by federal funds from NSF, the National Library of Medicine, the National Institutes of Health, and the National Aeronautics and Space Administration.
The citations in Page’s patent application illustrate the timeless nature of scientific research. The underlying logic of Google’s page-ranking algorithm, for example, is analogous to the 1953 idea that people’s social status increases when they are acknowledged by others who are themselves of high status. In 1965, a researcher examined connections among people to identify flows of social influence and then used those measures to identify social cliques. In 1986, a group expanded this work to differentiate between social statuses that are reflected back through a relationship and those that are derived from a relationship. Unbeknownst to these early scientists, their research would one day form the underpinnings of one of the most transformative innovations in recent history.
instrumentation, advances in high-throughput data, the evolution of business models, and the emergence of platforms such as the human genome database and open-access databases.
Maintaining the expertise of those who conduct world-class research also sustains the innovation system because technological problems frequently arise in the development of an innovation that must be solved through research. In this way, research and innovation are symbiotic, as
illustrated by the case study in Box 4-2. Similarly, many aspects of manufacturing contribute to and draw on research (Pisano and Shih, 2012).
Is it possible that scientists who laid the groundwork for Google or wireless communication or their peers, or any metrics available today for that matter, could have predicted the multimillion dollar value of their original work? Is it possible to predict which projects undertaken today will lead to unfathomable transformations in the lives of future generations? Will metrics help protect seemingly obscure projects that could one day hold the key to these transformations, or will they encourage their dismissal? These are the kinds of questions raised by the case studies in Boxes 4-1 and 4-2.
Bibliometrics, for example, would not have flagged the supporting citations in the patent application for Page’s Google search algorithm (see Box 4-1) as particularly high impact during the years surrounding the initial appearance of those publications. Page’s discovery of the algorithm itself was first reported in Computer Networks, an archival journal with a relatively low impact factor (a measure of the average number of citations of articles published in the journal) of 1.2, as determined by the Institute for Scientific Information.
What, then, about metrics for talent? Clearly, the importance of talent cannot be overstated. But it also cannot be fully captured by metrics available today, particularly by counts of academic degrees. Page, for
Radio Astronomy and Wireless Communication
This case study illustrates how an application of research—the processing of signals over telephone lines—led to basic research on more efficient processing methods, which led in turn to discoveries in radio astronomy and to many innovations, including wireless communication.
The fast Fourier transform, a statistical technique, was developed by James W. Cooley and John W. Tukey at Bell Laboratories for the efficient analysis of sound waves to improve the transmission of conversations over telephone lines. The technique enabled the solution of signal processing problems in real time, at the rate at which the signal was received.
The technique was later used by radio astronomers to discern signals from background noise. John O’Sullivan developed a key patent of the technique as the result of a failed experiment aimed at detecting exploding mini black holes. The technique later enabled wireless transmission, whose development had been impeded because of the interference of signals with their reverberations off of objects in their path.
example, is as talented as they come, but he never earned a Ph.D., leaving Stanford with a master’s degree before assuming the role of Google’s founding CEO.
The committee found that no high-quality metrics for measuring societal impact currently exist that are adequate for evaluating the impacts of federally funded research on a national scale. We reviewed many metrics designed to measure the societal impacts of research, including those proposed and used by other countries, and found them to be useful for certain purposes but of limited utility for drawing broad conclusions about the American research enterprise as a whole. Each metric describes but a part of the larger picture, and even collectively, they fail to reveal the larger picture. Moreover, few if any metrics can accurately measure important intangibles, such as the knowledge generated by research and research training.
Furthermore, previous studies have shown that innovation and failure go hand in hand—another key point emphasized throughout this report—and that metrics can limit the possibility of transformative innovation by fostering an avoidance of failure to make the metrics look good. A study by Azoulay and colleagues (2010) compared 73 Howard Hughes Medical Institute (HHMI) investigators and a matched control group of similarly accomplished NIH-funded researchers. The authors initiated the study to test the hypothesis that NIH-funded researchers were deterred from taking risks because of that institution’s rigid expectations of outcomes, its short review cycles of about 3 years, and grant renewal policies that discourage taking risks that could result in failure. By contrast, HHMI researchers have greater flexibility in their efforts; are encouraged to focus on long-term outcomes; and work in 5-year cycles, which are more tolerant of failure. The HHMI researchers also undergo a more engaging and informal first review after 5 years. Azoulay and colleagues standardized publication outputs from these two groups of researchers using statistical methods.12 They discovered that HHMI researchers produced 96 percent more high-impact papers and 35 percent more low-impact papers compared with NIH researchers. In addition, HHMI researchers were awarded six times as many grants and introduced more new keywords into their fields of science. These findings suggest that flexibility and stability in funding, along with a culture that tolerates failure, may inspire researchers to pursue riskier and more innovative research with a greater chance of failure but also a greater likelihood of transformative impact. More formal qualitative judgments about relative risks assumed by dif-
12Using a combination of propensity-score weighting and difference-in-differences estimation strategies.
ferent research programs or portfolios could potentially enable similar evaluations.
Finally, metrics are only as good as the questions to be answered. They are most effective when their definitions and specific uses have been spelled out clearly in advance.
THE NEED TO MOVE BEYOND CURRENT INDICATORS
There are countless indicators of research performance. As described in Chapter 5, the success of research universities, for example, can be measured by examining university enrollment, NRC research rankings, and graduation statistics. The extent to which scientific knowledge is exchanged can be assessed through bibliometric and social network analyses, which rely on journal publications and citations. In the presence of clear university-defined goals, measuring the patents, licenses, and other products of university technology transfer offices can help identify areas for improvement when these measurements are compared among multiple universities or followed over time.
The real challenge, however, lies in assessing the value of knowledge itself. And the ultimate value of knowledge is equivalent to the people using it and the ways in which it is being used. While scientific impacts must be measured according to the final products of knowledge generation—the commercialization of research discoveries, for example—we suggest that these impacts might be further enhanced by focusing greater attention on the means to these ends. Achieving this focus requires more than counting publications, patents, and other traditional measures of research productivity: it requires cultivating a better understanding of the complex system that is the U.S. research enterprise to determine how all of its component parts interrelate, a theme that is explored in detail in Chapter 6.
In the next chapter, we describe some efforts to evaluate the impacts of research and innovation. We also outline the studies that need to be carried out to improve the ability to assess research impacts.
This page intentionally left blank.