Principles for Developing Metrics
Metrics are tools for supporting actions that allow programs to evolve toward successful outcomes, promote continuous improvement, and enable strategic decision making. Based on the lessons learned from industry, academia, and federal agencies discussed in the previous chapter, the committee offers a set of general principles to guide the development and use of metrics. Although targeted to the Climate Change Science Program (CCSP), many of these general principles have also been proposed elsewhere.1 The principles are divided into three categories: (1) prerequisites for using metrics to promote successful outcomes, (2) characteristics of useful metrics, and (3) challenges in the application of metrics.
For example, variations on principles 2, 3, 4, 6, 10, and 11 appear in National Science and Technology Council, 1996, Assessing Fundamental Science, <http://www.nsf.gov/sbe/srs/ostp/assess/start.htm>; principle 7 appears in Creech, B., 1994, The Five Pillars of TQM: How to Make Total Quality Management Work for You, Truman Talley Books, New York, 530 pp.; principles 4 and 9 are captured in Geisler, E., 1999, The metrics of technology evaluation: Where we stand and where we should go from here, Presentation at the 24th Annual Technology Transfer Society Meeting, July 15–17, 1999, <http://www.stuart.iit.edu/faculty/workingpapers/technology/>; and the importance of leadership (principle 1) appears in National Research Council, 1999, Evaluating Federal Research Programs: Research and the Government Performance and Results Act, National Academy Press, Washington, D.C., 80 pp.
PREREQUISITES FOR USING METRICS TO PROMOTE SUCCESSFUL OUTCOMES
1. Good leadership is required if programs are to evolve toward successful outcomes.
Good leaders have several characteristics. They are committed to progress and are capable of articulating a vision, entraining strong participants, promoting partnerships, recognizing and enabling progress, and creating institutional and programmatic flexibility. Good leaders facilitate and encourage the success of others. They are vested with authority by their peers and institutions, through title, an ability to control resources, or other recognized mechanisms. Without leadership, programmatic resources and research efforts cannot be directed and then redirected to take advantage of new scientific, technological, or political opportunities. Metrics, no matter how good, will have limited use if resources cannot be directed to promote the program vision and objectives established by the leader.
2. A good strategic plan must precede the development of metrics.
Metrics gauge progress toward achieving a stated goal. Therefore, they are meaningless outside the context of a plan of action. The strategic plan must include the intellectual framework of the program, clear and realizable goals, a sense of priorities, and coherent and practical steps for implementation. The best metrics are designed to assess whether the effort and resources match the plan, whether actions are directed toward accomplishing the objectives of the plan, and whether the focus of effort should be altered because of new discoveries or new information. Metrics, no matter how good, will have limited use if the strategic plan is weak.
CHARACTERISTICS OF USEFUL METRICS
3. Good metrics should promote strategic analysis. Demands for higher levels of accuracy and specificity, more frequent reporting, and larger numbers of measures than are needed to improve performance can result in diminishing returns and escalating costs.
Preliminary data or results are often good enough to make strategic decisions; additional effort to make them scientifically rigorous might be wasted. Larger numbers of metrics may also promote inefficiencies. For example, if a substantial amount of signed paperwork is required to demonstrate that the federal Paperwork Reduction Act is working then the metric clearly fails to meet its primary objectives.
The frequency of assessment should reflect the needs and goals of the program. Very infrequent assessments are not likely to be useful for managing programs, and overly frequent assessments have the potential to promote micromanagement or to become burdensome. For example, the Intergovernmental Panel on Climate Change (IPCC) assessments are nearly continuous and require an enormous, sustained effort by a large segment of the climate science community.2 For short-term programs, such as the Tropical Ocean Global Atmosphere (TOGA) experiment, frequent scientific assessments would have been nearly useless, because a decade was required to clearly demonstrate some of the most important scientific outcomes.3 On the other hand, process metrics for evaluating progress on the creation and operation of the program, would have had value on much shorter time scales.
4. Metrics should serve to advance scientific progress or inquiry, not the reverse.
A good metric will encourage actions that continuously improve the program, such as the introduction of new measurement techniques, cutting-edge research, or new applications or tools. On the other hand, a poor measure could encourage actions to achieve high scores (i.e., “teaching to the test”) and ultimately unbalance the research and development portfolio. The misapplication of metrics could lead to unintended consequences, as illustrated by the following examples:
The author citation index provides a measure of research productivity. If this metric were the only way to measure faculty performance, it could drive researchers to invest more in writing review articles that are cited frequently than in working on new discoveries.
The U.S. Global Change Research Program (USGCRP) has supported efforts to compare major climate models. Convergence of model results (e.g., similar temperature increases in response to a doubling of carbon dioxide) could be a measure of progress in climate modeling. The metric succeeds if it identifies differences in the way physical processes are incorporated in models, which then leads to research aimed at improving understanding of those processes and, eventually, to model improvements and the reduction of uncertainties in model predictions. The metric fails if it creates an unintended bias in researchers who adjust their models solely to bring them into better agreement with one another.
5. Metrics should be easily understood and broadly accepted by stakeholders. Acceptance is obtained more easily when metrics are derivable from existing sources or mechanisms for gathering information.
It is important to avoid creating requirements for measurements that are difficult to obtain or that will not be recognized as useful by stakeholders. The latter is especially difficult for innovative or multidisciplinary sciences that have yet to establish natural mechanisms of assessment. The following examples illustrate these points:
A metric for measuring change in forest cover is the fraction of land surface covered by forest canopy, which is detectable using remote sensing. An area is considered “forest” when 10 to 70 percent of the land surface is covered by canopy. However, the lower threshold would not be viewed as useful by stakeholders. A metric based on this threshold (essentially, forest or not forest) could mean that an area with dense canopy would be defined as forest, despite being severely logged and degraded.4 The metric becomes more useful when it is associated with information about land-cover types. For example, a 10 percent threshold might be appropriate for savannah areas, whereas higher thresholds would be required for ecosystems with more continuous canopy cover. More detailed measures of forest cover can also be developed, such as selective removal of specific tree types, changes in species composition, or changes in indices (e.g., seed production, primary productivity, leaf density). However, one can quickly reach a point at which the difficulty of measuring the quantities systematically becomes overwhelming, limiting their use as metrics.
The number of users is commonly cited as a metric of the usefulness of holdings in data centers.5 However, it is difficult to gather reliable information to support this metric. With the shift to on-line access, most users find and retrieve data via the Internet. Since the actual number of users is not known, data centers count “hits” on their web sites, which are likely to be several orders of magnitude greater than the actual number of users, or “distinct hosts,” which overcount users accessing the site from several different computers.
6. Promoting quality should be a key objective for any set of metrics. Quality is best assessed by independent, transparent peer review.
The success of the scientific enterprise and confidence in its results depend strongly on the quality of the research. Although peer review has well-known limitations (e.g., results depend on the identity of the reviewers, there is a tendency to view research results conservatively), it is the generally accepted mechanism to assess research quality. Review occurs throughout the scientific enterprise in the form of peer review of proposals submitted for funding, peer review of manuscripts submitted for publication in journals, and internal and peer review of programs and program outcomes (Boxes 2.1 and 2.2). Peer review also provides the best mechanism for judging when to change research directions and, thus, make programs more evolutionary and responsive to new ideas.
7. Metrics should assess process as well as progress.
The success of any program depends on many factors, including process (e.g., level of planning, type of leadership, availability of resources, accessibility of information) and progress (e.g., addition of new observations, scientific discovery and innovation, transition of research to practical applications, demonstration of societal benefit). The assessment of process as well as progress is important for every program, but its value is particularly high for large, complex programs.
The sheer diversity and complexity of programs such as the USGCRP and the CCSP defies the application of a few simple metrics. Even the assessment of progress depends on the nature and maturity of the effort. Enhancing an existing data set is different from developing a new way to
measure a specific variable. Process studies are different from model improvements. Mission-oriented science is different from discovery science. Metrics should reflect the diversity and complexity of the program and the level of maturity of the research. Comprehensive assessment of the program will include processes taken to achieve CCSP goals, as well as progress on all aspects of the research, from inputs to outputs, outcomes, and impacts.
8. A focus on a single measure of progress is often misguided.
The tendency to try to demonstrate progress with a single metric can create an overly simplistic and even erroneous sense of progress. Reliance on a single metric can also result in poor management decisions. These points are illustrated in the following examples:
The predicted increase in globally averaged temperature with a doubling of carbon dioxide has remained in the same range for more than 20 years (see Chapter 4). According to the metric of reducing uncertainty, climate models would seem to have advanced little over that period despite considerable investment of resources. In fact, however, the physics incorporated in climate models has changed dramatically. Incorporation of new processes, such as vegetation changes as a function of climate, is yielding previously unrecognized feedbacks that either amplify or dampen the response of the model to increased carbon dioxide. The result is often greater uncertainty in the range of predicted temperatures until the underlying processes are better understood. New discoveries can also indicate that certain elements of the weather and climate system are not as predictable as once thought. In such cases, significant scientific advance can result in an increase in uncertainty. Rather than relying solely on uncertainty reduction, it may be more appropriate to develop metrics for the three components of uncertainty: (1) success in identifying uncertainties, (2) success in understanding the nature of uncertainties, and (3) success in reducing uncertainties.
Change in biomass is commonly used as a metric to assess the health of marine fisheries. However, this metric fails to recognize the substitution of one species for another (an important indication of environmental change or degradation), interactions among species, and changes in other parts of the food web that result from fishing. Reliance on biomass alone could lead to the establishment of fishing targets that speed the decline of desirable fish stocks or adversely affect other desired species. For example, early management of Antarctic krill stocks strictly on a biomass basis did not account for two facts: (1) most harvesting was in regions that support feeding by large populations of krill-dependent predators such as penguins, whales, and seals, and (2) predator populations can be adversely affected by
krill fishing, especially during their breeding seasons.6 A more complex metric or set of metrics that incorporate species composition (multispecies management), information about dependent species (ecosystem-based management), and species distribution and environmental structure (area-based management) would reflect the state of knowledge and lead to better resource management decisions. Combining a biomass-based metric with information from quota-based or fishing-effort-based management practices would provide an approach for sustaining fishery stocks at levels that are both economically and environmentally desired.
CHALLENGES IN THE APPLICATION OF METRICS
9. Considerable challenge should be expected in providing useful a priori outcome or impact metrics for discovery science.
The assignment of outcome metrics implies that we can anticipate specific results. This works well at the level of mission-oriented tasks such as increasing the accuracy of a thermometer. However, much of discovery science involves the unexpected and the outcome is simply unknown. For example, the measurement of atmospheric carbon dioxide concentrations by C.D. Keeling eventually revealed both an annual cycle and a decadal trend in atmospheric composition, neither of which was the original goal of the observation program.7 This remarkable achievement could have been defeated by the strenuous application of outcome metrics aimed at determining whether a reliable “baseline” CO2 level in the atmosphere had been established.
It is difficult to conceive of metrics for serendipity, yet serendipity has resulted in numerous discoveries—from X-rays to Post-it adhesives. Great care must be taken to avoid applying measures that stifle discovery and innovation. The most suitable metrics may be related to process (e.g., the level of investment in discovery, the extent to which serendipity is encouraged, the extent to which curiosity-driven research is supported). The National Science Foundation is highly regarded for its ability to promote discovery science, and its research performance measures focus on processes for developing a scientifically capable work force and tools to enable discovery, learning, and innovation (Table 2.4).
10. Metrics must evolve to keep pace with scientific progress and program objectives.
The development of metrics is a learning process. No one gets it right the first time, but practice and adjustments based on previous trials will eventually yield useful measures and show what information must be collected to evaluate them. Metrics must also evolve to keep pace with changes in program goals and objectives. Scientific enterprises experience considerable evolution as they move through various phases of exploration and understanding. Metrics for newly created science programs, which focus on data collection, analysis, and model development to increase understanding, will tend to focus on process and inputs. As the science matures and the resulting knowledge is applied to serve society, metrics will focus more on outputs and, finally, on outcomes and impacts. As science transitions from the discovery phase to the operational or mission-oriented phase, the types of metrics should also be expected to evolve.
11. The development and application of meaningful metrics will require significant human, financial, and computational resources.
The development and application of metrics, especially those that focus on quality, is far from a bookkeeping exercise. Efforts to assess programmatic plans, scientific progress, and outcomes require substantial resources, including the use of experts to carry out the reviews. Funding to support the logistics of the reviews is also required. The CCSP strategic plan includes a substantial number of assessments and a growing emphasis on measurable outcomes. As these are implemented, the choice of meaningful measures of progress must be deliberate. If the IPCC process is a representative example, the growing emphasis on assessments has the potential to increasingly divert resources from research and discovery to assessment.