7
Conclusions and Next Steps
The committee was charged to identify quantitative performance measures and metrics to assess progress in three to five areas of climate change research. The committee began by selecting a representative set of Climate Change Science Program (CCSP) objectives and developing a long list of metrics for each. However, analysis of the measures specific to these objectives showed that a general set of metrics could be developed and used to assess the progress of any element of the CCSP (Chapter 6). This unexpected conclusion, combined with the principles (Chapter 3), led the committee to think about metrics not just as simple ways to gauge progress, but as a tool to guide strategic planning and to foster future progress. The committee believes that the general metrics have the potential to be far more useful to CCSP agencies than a few specific metrics in selected areas of climate change research. The answers to the charge given below are presented in the context of this major conclusion.
ANSWERS TO THE COMMITTEE CHARGE
1. Provide a general assessment of how well the CCSP objectives lend themselves to quantitative metrics.
Meaningful metrics can be developed for all of the CCSP objectives. Some of the metrics will be quantitative, especially those that measure inputs (e.g., amount of resources devoted to the program) and outputs (e.g.,
creation of new products or techniques). However, most will be qualitative, especially those that focus on the research and development process, the outcome of research, and its impact on society. These generally require peer review (e.g., to evaluate scientific quality) or stakeholder assessments.
CCSP objectives range from the general (e.g., overarching goals) to the specific (e.g., milestones, products, and payoffs). The more general the objective, the greater is the number of qualitative contributing factors and the less quantitative are the metrics. For example, improvements to databases of water cycle variables can generally be measured quantitatively, but the resulting improvement in drought prediction models and resource decisions that use those predictions will require increasingly subjective analysis and a greater emphasis on expert assessment.
2. Identify three to five areas of climate change and global change research that can and should be evaluated through quantitative performance measures.
Both the Office of Management and Budget (OMB) and the agencies participating in the CCSP are seeking a manageable number of quantitative performance measures to monitor the progress of the program. The metric cited most often is the reduction of uncertainty (Chapter 4). However, by itself, reduction of uncertainty is a poor metric because (1) uncertainty about future climate states may increase, decrease, or remain the same as more is understood about the governing elements, and (2) the data needed to calculate errors in the probability estimates are limited or nonexistent. The danger of using this metric is that increasing uncertainty might be interpreted as a failure of the program, when the reverse may well be true.
The committee agrees that a limited set of metrics should be chosen. It would be expensive to implement all possible measures, and the results may be difficult for individual agencies to use to manage their programs and demonstrate success to Congress, OMB, and the public. However, the CCSP strategic plan provides neither a sense of priorities nor a definition of success.1 Indeed, a National Research Council review of the CCSP strategic plan noted that “many of the objectives in the plan are too vaguely worded
to determine what will constitute success.”2 Such guidance is essential for narrowing down research areas for which metrics should be developed.
However, even if such guidance were available, focusing on metrics in a few areas of climate change and global change research might not be useful for managing the program and achieving successful outcomes. The key to promoting successful outcomes is to consider the program from end to end, starting with program processes and inputs and extending to outputs, outcomes, and long-term impacts. The general metrics developed by the committee (Box 6.1) provide a starting point for making this evaluation.
3. For these areas, recommend specific metrics for documenting progress, measuring future performance (such as skill scores, correspondence across models, correspondence with observations), and communicating levels of performance.
The list of general metrics can be used for any element of the CCSP. Quantitative measures in that list include the following:
-
A multiyear plan that includes goals, focused statement of task, implementation, discovery, applications, and integration.
-
Sufficient commitment of resources (i.e., people, infrastructure, financial) directed specifically to allow the planned program to be carried out.
-
Synthesis and assessment products are created that incorporate new developments.
Generic quantitative metrics developed elsewhere for research and development (Appendix C) are also applicable to CCSP research elements. However, although these measures are useful for management, few scientific programs would wish to be judged on such terms. A mixture of qualitative and quantitative metrics would better capture the scope of CCSP objectives. A similar conclusion has been reached about measuring progress in other science programs.3
2 |
National Research Council, 2004, Implementing Climate and Global Change Research: A Review of the Final U.S. Climate Change Science Program Strategic Plan, The National Academies Press, Washington, D.C., p. 26. |
3 |
National Research Council, 1996, World-Class Research and Development: Characteristics for an Army Research, Development, and Engineering Organization, National Academy Press, Washington, D.C., 72 pp.; National Science and Technology Council, 1996, Assessing Fundamental Science, <http://www.nsf.gov/sbe/srs/ostp/assess/start.htm>; Cozzens, S.E., 1997, The knowledge pool: Measurement challenges in evaluating fundamental research programs, Evaluation and Program Planning, 20, 77–89; National Research Council, 1999, Evaluating Federal Research Programs: Research and the Government Performance and Results Act, |
Although worded generically, the metrics listed in Table 6.1 can be rephrased to be specific to the program element being evaluated. This task is best carried out by agency managers because they have more complete knowledge of the program than any outside group could have. Moreover, the process of refining the metrics will be as valuable to the agencies as the measures themselves. A process for narrowing down and rephrasing the committee’s list of general metrics is described in the following section.
4. Discuss possible limitations of quantitative performance measures for other areas of climate change and global change research.
Quantitative metrics can be developed for any CCSP objective. However, because quantitative metrics primarily (and only partially) measure inputs and outputs, they tell only a fraction of the story. The outcomes and impacts, which are the program results most visible to the public and usable to decision makers, are much more likely to be qualitative.
It may take years or even decades to assess the impact of CCSP programs, even though many are scheduled to produce results within two to four years. Answers to impact metrics will reflect the maturity of the programs as well as the complexity of the problems being analyzed. Consequently, many impact metrics developed for the CCSP will serve as a reminder of program goals, rather than a litmus test of achievement. Importantly, the CCSP will, with time, yield many unanticipated benefits because it supports discovery and innovation. General metrics that support successful outcomes, scenario planning, and other strategic improvements are more likely to reveal these unanticipated benefits than tightly specified, short-term objectives. A variety of such successful outcomes and impacts have already emerged from climate change programs that operated under the U.S. Global Change Research Program (USGCRP).
IMPLEMENTATION
Lessons from industry, academia, and federal agencies suggest that metrics are best used to support actions that allow programs to evolve
|
National Academy Press, Washington, D.C., 80 pp.; Geisler, E., 2000, The Metrics of Science and Technology, Quorum Books, Westport, Conn., 380 pp.; National Research Council, 2001, Implementing the Government Performance and Results Act for Research: A Status Report, National Academy Press, Washington, D.C., 190 pp.; National Research Council, 2003, The Measure of STAR: Review of the U.S. Environmental Protection Agency’s Science to Achieve Results (STAR) Research Grants Program, The National Academies Press, Washington, D.C., 176 pp. |
toward successful outcomes. Implementation of metrics should therefore be strategic and evolutionary, rather than fixed and prescriptive.
This report provides a guide to best practices (principles in Chapter 3) and a list of general metrics (Box 6.1) that the CCSP can use to evolve toward successful outcomes. The principles define prerequisites for assessing and enabling this evolution (e.g., leadership), as well as characteristics of useful metrics. The general metrics provide a way to think strategically about the program. Together, the principles and general metrics provide a framework for considering specific implementation issues. These range from how to evaluate the program, to ensuring that the metrics are reliable and valid, to factoring in the cost of evaluating and adjusting the measures on a regular basis.
Using the General Metrics
The way in which the general metrics are used depends on both on the identify of the evaluators and the granularity of the program or program elements to be evaluated. For example, an agency manager might quickly provide rough answers or scores to all of the general metrics for his or her program, identifying leaders, availability of resources, opportunities for innovation, the functioning of the peer review process, evidence of new techniques, number of peer-reviewed publications, and so forth. This assessment would allow the manager to assess strengths and weaknesses of the program and then determine an appropriate course of action.
Expert panels might use the general metrics as a framework for deeper exploration of issues, such as the value of different measurement techniques, assessment of the capabilities of new models, or the results of process studies. For example, output metric 1 (the program produces peer-reviewed and broadly accessible results, such as new and applicable measurement techniques) might prompt a thorough analysis of the types of new measurement techniques that were developed, as well as their veracity, limitations, and acceptance as a tool by the broader community. Expert panels may view process and input metrics only in hindsight, for example, tracing gaps in knowledge to limitations in program planning. Finally, an evaluation by stakeholders would likely emphasize outcomes and impacts. Depending on the nature of the program, this evaluation could be highly technical (e.g., the performance of a new flood forecasting model) or highly subjective (e.g., whether global change research has had noticeable impact on international policies).
Individual programs could easily spawn a set of quite specific metrics. For example, if the program plan calls for a doubling of the density of ocean buoys in the tropical Pacific, a metric might be the fraction of the new system that has been deployed. This program-dependent measure falls
within the scope of input metric 4 (Box 6.1). Higher-level indicators (e.g., the degree to which the buoy array has improved seasonal to interannual forecasting or has outcomes that are recognized and utilized by different stakeholders) prompt equally specific output and outcome metrics and alternative modes of evaluation (e.g., expert review). The general metrics in Box 6.1 provide the categories to be evaluated, but they will have to be narrowed down and reworded in terms that are specific to the objective or program plan being evaluated (see examples in Table 6.1). In this manner, the general metrics serve as a template for evaluating programs and promoting progress.
It is important to note that the development of an optimum set of metrics is an iterative process. No one gets it right the first time. However, the process itself will yield valuable information about the program and how to continuously improve it.
Refining the Metrics
Once the key metrics have been identified, they must be refined to ensure that biases are recognized and minimized. An evaluation system must also be developed. An overview of these issues is given below.
Bias, Reliability, and Repeatability
Any measure that contains subjective factors or relies on judgments introduces estimation errors, biases, or inaccurate perceptions.4 Yet subjective judgments are essential to evaluate both the scientific program elements of the CCSP and the usefulness of the resulting knowledge to users. Peer review, normally used to evaluate science quality, is subject to bias (e.g., against those who challenge conventional wisdom) and may not yield the same results from year to year.5 User evaluations, often used to gauge the importance of knowledge that results from a program, are biased toward high satisfaction with free services. Having an appropriate mix of expertise on the evaluation team will minimize the chances of different groups obtaining different results and thereby increase the reliability and repeatability of the results.6 The reliability of subjective measures could also be increased
by aggregating multiple judgments or requiring different assessment teams to arrive at a consensus.7
Quantitative measures do not suffer from these limitations, although their objective nature can lend a false sense of credibility and validity.8 They also overlook time lags that might bias the measurements. For example, research and development departments that have the same profit-to-expenditure ratios might produce results on different time horizons.
Finally, the selection of metrics themselves introduces biases and may also influence behaviors.9 The values of the decision maker or evaluator are often mirrored in the selection and weighting of the measures. Also, once the metrics are known, they can bias behaviors to meet expectations built into the measures. The metric of number of papers published, for example, may lead scientists to publish a greater number of articles on the same research results. Being cognizant of these issues can sometimes minimize the influence of bias on metrics.
Aggregating Qualitative and Quantitative Measures
A suite of different kinds of metrics has been shown to be effective for science and technology programs. Such measures can be aggregated and compared using a variety of techniques. Formulas can be developed in which each class of measures is subjectively assigned a different weight. In the OMB Program Assessment and Rating Tool (PART) analysis, for example, the major classes of measures are weighted as follows:
-
program purpose and design: 20 percent
-
strategic planning: 10 percent
-
program management: 20 percent
-
program results and accountability: 50 percent (Appendix A, Box A.2)
Individual metrics can also be aggregated into more comprehensive measures that may include both quantitative and qualitative elements.10 The different measures are assigned a weight and the qualitative measures are converted to numbers (e.g., an ordinal score or 0-100 percent of the
reference value). Care must be taken, however, to aggregate only measures that are well correlated.11
Finally, it is important to remember that any number describing a research activity or application depicts it imperfectly. Thus, agencies should not rely exclusively on the score of a particular metric or suite of metrics. The context, definition of scores, and commentary are at least as important as the specific answer or score and should be included in the formal evaluation.
Cost of Evaluating Metrics
The cost of developing and evaluating metrics must be balanced against the needs and resources of the program. Time costs can be considerable to develop an effective combination of quantitative and qualitative measures and to adjust them as experience reveals which are the most useful.12 Professional training may even be required to develop qualitative measures that have validity and reliability. Collecting information to evaluate the metrics and normalizing and interpreting the data often take significant time, although time costs decline with subsequent evaluations. The highest time costs are for peer review evaluations.13 Rather than peer review for every component of the CCSP, such investments should be targeted to improve management and performance of key program elements. All of these costs must be factored into determinations of how often the program should be evaluated to capture its impact over time.
The committee believes that a system of metrics, developed through an iterative process and evaluated in consultation with stakeholders, could be a valuable tool for managing the CCSP and further increasing its usefulness to society. For these metrics to be of real value, they must be implemented in a constructive fashion, following the guiding principles outlined in this report. That will require a great deal of thought by individual CCSP agencies as well as by the CCSP as a whole. Then, it will take time to determine whether these metrics help create a stronger and more successful CCSP. Thus, this report should be viewed as the first step and not as an end.