Proceedings of a Workshop
Methodologies for Evaluating and Grading Evidence
Considerations for Public Health Emergency Preparedness and Response
Proceedings of a Workshop—in Brief
Since 2001, there has been significant investment in research aimed at understanding the effectiveness of public health emergency preparedness and response (PHEPR) practices and how to improve them. However, to date there has been little effort to systematically compile, assess, and summarize existing PHEPR research and evaluations to enable the identification and implementation of evidence-based practices. Therefore, the Centers for Disease Control and Prevention (CDC) requested that the National Academies of Sciences, Engineering, and Medicine (the National Academies) convene an ad hoc committee to (1) develop the methodology for conducting a review and grading of the evidence base for PHEPR practices; (2) apply the evidence review methodology to a selection of PHEPR practices and make recommendations on practices communities and state, local, tribal, and territorial public health agencies should or should not adopt based on evidence of effectiveness; and (3) provide recommendations for future research needed to strengthen the evidence for specific PHEPR practices and to improve the overall quality of evidence in the field.
On July 26, 2018, the Committee on Evidence-Based Practices for Public Health Emergency Preparedness and Response held a 1-day workshop featuring two panel sessions to hear from experts in evidence synthesis and grading.1 Given the context-sensitive nature of PHEPR practices and their focus on systems and processes, the committee sought to explore evidence assessment methodologies from a variety of disciplines that face similar challenges in identifying evidence-based practices. The first panel of experts from the fields of public health, medicine, transportation, education, and labor was assembled to assist the committee in gathering information regarding what can be learned and potentially adapted from existing evidence-grading frameworks and criteria that can be used to assess the strength of evidence and inform practice recommendations. Following the first workshop session, the initial presenters were joined by a second panel with additional experts from the fields of international development, aerospace medicine, and aviation safety to help the committee further explore how evidence generated from sources other than traditional research studies, including mechanistic evidence, modeling, expert judgment, case studies, and after action reports (AARs),2 has been used to inform recommendations and decision making.
This Proceedings of a Workshop—in Brief highlights key points made by workshop participants during the presentations and discussions. It is not intended to provide a comprehensive summary of information shared during the workshop. The statements in this proceedings reflect the knowledge and opinions of individual workshop participants and have not been endorsed or verified by the National Academies or the committee; they should not be construed as reflecting the consensus of the workshop participants, the committee, or the National Academies. The committee’s consensus study report will be available in 2020.
1 The workshop agenda is available at http://nationalacademies.org/hmd/Activities/PublicHealth/PublicHealthPreparedness/2018-July-26.aspx (accessed June 3, 2019).
2 AARs are evaluations that public health agencies and other response partners generate following incidents and emergencies to document and disseminate past experiences and lessons learned in an effort to improve future performance.
EXPLORING FRAMEWORKS FOR EVALUATING THE CERTAINTY OF EVIDENCE
The workshop began with an exploration of existing frameworks used in different fields for synthesizing and assessing a body of evidence and developing recommendations. Holger Schünemann presented on Grading of Recommendations Assessment, Development and Evaluation (GRADE),3 which, he said, was developed initially to provide an explicit and transparent process for evidence grading and guideline development in medicine, but has since been adapted for use in other fields, including public health. GRADE has also evolved to accommodate qualitative evidence, as discussed by Jane Noyes,4 who presented on GRADE-CERQual (CERQual), a recently developed companion framework to GRADE. Randy Elder described the review methods of CDC’s Guide to Community Preventive Services (the Community Guide),5 which he said were specifically developed to address the diversity of interventions and audiences in public health.
Speakers from outside of the health field included Jeffrey Valentine,6 who presented on the U.S. Department of Education’s What Works Clearinghouse (WWC), which conducts evidence reviews on education-related policies, programs, and products, and Demetra Nightingale,7 who described the U.S. Department of Labor’s (DOL’s) Clearinghouse for Labor Evaluation and Research (CLEAR), a repository of evaluations of studies on labor practices. The panel was concluded by Kristie Johnson,8 who spoke with the committee about the National Highway Traffic Safety Administration’s (NHTSA’s) Countermeasures That Work guidance publication, which was designed to assist state highway officials with the selection of evidence-based countermeasures to reduce traffic safety problems. Presentations during the first panel session focused on describing each of the evaluation frameworks and the criteria (or domains) that are applied to rate the strength, or certainty, of evidence and to guide recommendations.9
Presenting on the GRADE framework, Schünemann began by describing the interrelated factors in GRADE that may reduce or strengthen the certainty of evidence—which, he said, can be understood as the confidence in the estimate of an effect or an association—and the judgment process by which a body of evidence is rated as having high, moderate, low, or very low certainty for a given outcome (Guyatt et al., 2011). In GRADE, the certainty of evidence begins as high but is downgraded based on risk of bias, inconsistency of the findings, indirectness of evidence to the research question, imprecision, and publication bias. Certainty may be upgraded for a large magnitude of effect, residual confounding that opposes the direction of effect, or a dose–response gradient. GRADE does not have an insufficient evidence category, he explained, because there is always some evidence that can inform an answer to the question, even if the certainty in that evidence is very low because of indirectness. It may be necessary to rely on evidence from different populations or settings and, citing past reviews using GRADE on interventions related to avian influenza and toxic chemical exposures (Schünemann et al., 2007), Schünemann suggested that indirectness will likely be a key feature in assessments of PHEPR evidence.
Risk of bias is increased in the absence of randomization, so nonrandomized studies (e.g., observational studies) are automatically downgraded to having a low certainty of evidence, explained Schünemann. When asked whether it is fair to downgrade the certainty of evidence when it may not be possible to conduct randomized controlled trials, as is likely to often be the case in PHEPR, he maintained that the evaluation of certainty in the evidence should not change because the question does not lend itself to types of studies that provide greater confidence. However, he asserted, recommendations should be based on more than just evidence of effectiveness.
He described the Evidence to Decision (EtD) framework, which defines additional criteria that should be considered when making a recommendation, including values, resources, cost-effectiveness, equity, acceptability, and feasibility (Alonso-Coello et al., 2016). Schünemann explained that the EtD framework is an important part of the GRADE approach, which considers the strength of recommendations independent of, though informed by, the certainty of the evidence. He acknowledged that, in his experience, most decisions are made based on low or very low certainty of the evidence, but the EtD framework
3 Holger Schünemann is GRADE Co-Chair; Co-Director, World Health Organization Collaborating Centre for Evidence-Informed Policy; Professor and Chair, Department of Clinical Epidemiology and Biostatistics; and Professor, Department of Medicine, McMaster University.
4 Jane Noyes is Professor in Health and Social Services (Research and Child Health), Bangor University.
5 Randy Elder is a health scientist and former Scientific Director for Systematic Reviews, Community Guide Branch, Centers for Disease Control and Prevention.
6 Jeffrey Valentine is Professor and Program Coordinator, Educational Psychology, Measurement and Evaluation, Department of Counseling and Human Development, College of Education and Human Development, University of Louisville.
7 Demetra Nightingale is an Institute Fellow, Urban Institute.
8 Kristie Johnson is Research Psychologist, Office of Behavioral Safety Research, National Highway Traffic Safety Administration.
9 The terms strength of evidence and certainty of evidence are often used interchangeably.
enables the reviewer to still make the case for a strong recommendation.10 Recommendations supported by evidence rated as low or very low certainty may be accompanied by a request for reevaluation as more evidence becomes available. The EtD criteria should themselves be informed by research evidence and, in some cases, separate systematic reviews, he added, noting that it may be important to use a framework like CERQual to consider the certainty of the evidence for some EtD criteria, such as values and preferences.
Focusing on qualitative evidence, Noyes provided an overview of the process by which findings from a body of qualitative studies are assessed using CERQual, yielding confidence ratings that, as with GRADE, range from high to very low (Lewin et al., 2015, 2018). Qualitative evidence synthesis, she said, can answer questions about intervention heterogeneity, acceptability, feasibility, reach, and implementation, and it may also inform reviews of intervention effectiveness by, for example, helping to identify important outcomes. Qualitative evidence synthesis is increasingly being used in evaluations of complex health interventions and to examine system adaptivity,11 both of which, she suggested, would be relevant to the PHEPR context. In her experience, however, decision makers often need to be educated as to the value of qualitative evidence so that it is not dismissed.
The CERQual criteria for evaluating qualitative evidence align closely with those used in GRADE and include methodological limitations, coherence, adequacy, and relevance. A fifth criterion, dissemination bias, is under development and would be analogous to the GRADE publication bias domain. Noyes spoke about the importance of diversity of evidence in a qualitative review because “there’s no single truth in qualitative research.” What’s important, she said, is to identify the different patterns in the results and understand why these differences are found. Capturing the needed diversity often necessitates the inclusion of studies with lower methodological rigor, she acknowledged, emphasizing that there are always trade-offs that have to be considered.
Continuing the theme of evaluating bodies of evidence and identifying relevant patterns, Elder described how the Community Guide uses a weight-of-evidence approach to provide evidence-based guidance on interventions to improve public health.12 In this approach, which he said is the cornerstone of the Community Guide methods, both the quality and the quantity of available evidence is considered (Briss et al., 2000). The Community Guide considers all study designs that are appropriate for effect estimation in its evidence reviews, because, as Elder explained, in public health it is often difficult to conduct randomized controlled experiments and consequently there are many observational studies. When extraneous factors cannot be controlled in the study design, reviewers must rely on triangulation and hypothesis testing to rule out alternative explanations for apparent effects. By comparing across independent data sets, he explained, triangulation enables the identification of patterns. Consistency across independent studies increases confidence in the intervention effect, as does consistency of results with an a priori expectation, he said, referencing the analytic frameworks that are developed in the Community Guide reviews, which show the hypothesized causal pathway between the intervention and the outcomes of interest.
The chain of evidence depicted in these analytic frameworks can be used to infer the intervention effect in the absence of direct evidence.13 However, Elder emphasized that there are limits to aggregating data. Pervasive threats to validity across the body of evidence result in downgrading the conclusion on strength of evidence (Briss et al., 2000). If there are only low-quality studies, the Community Guide will not consider the evidence to be strong no matter the number that are included in the body of evidence, Elder said. Moreover, unlike GRADE, the Community Guide does employ an insufficient evidence rating when the reviewers conclude the evidence base does not support a recommendation for or against an intervention. The Community Guide is similar to GRADE, however, in that it explicitly considers features other than strength of evidence when making a recommendation. Recommendations also take into account applicability, equity, benefits, harms, barriers, and evidence gaps. Evidence of an important harm, health inequity, or upgrading reasons like a large effect size may all be reasons to make strong conclusions and a recommendation despite limited evidence.
Shifting from health-related grading frameworks to other fields that use evidence evaluation methods, Valentine discussed the two major products produced by WWC14 —Intervention Reports and WWC Practice Guides for Educators™. Both are based on systematic reviews of education interventions, but only the WWC Practice Guides include recommendations for practitioners. For each of the recommendations, he explained, the supporting evidence is rated as strong, moderate, or minimal. The evidence rating is guided by a rubric that addresses many of the same factors in other evidence-grading frameworks, including internal validity, applicability, and the strength and consistency of the intervention effects on relevant outcomes (WWC, 2017a,b).
10 In a GRADE Working Group paper, five “paradigmatic cases” are identified that provide reasons why evidence with a low certainty of effect could still result in strong recommendations, such as, when an intervention has a low certainty of effect but not acting will result in a life-threatening consequence, or there is a low certainty of a very serious side effect of an intervention (Neumann et al., 2016).
11 System adaptivity, in this context, describes how a system changes in response to the implementation of an intervention.
13 For an example analytic framework, see the supporting materials available at https://www.thecommunityguide.org/findings/obesity-multicomponent-interventions-increase-availability-healthier-foods-and-beverages (accessed June 3, 2019).
WWC, like the Community Guide, sets quality thresholds to determine the primary studies that will be considered in the evidence synthesis. Valentine said that studies can be rated as meeting WWC standards without reservations, meeting WWC standards with reservations, or not meeting the standards. Studies that do not meet the WWC standards are not included in the evidence synthesis. Given the paucity of randomized controlled trials available when WWC was first created, the standards were designed to be methodologically inclusive, with different standards for particular study types (e.g., randomized and nonrandomized studies, regression discontinuity studies, and single case studies) but, he acknowledged, WWC does not evaluate evidence from study designs it considers noncredible in the education context, including simple pretest and posttest studies.
The WWC methods have been adapted for evidence synthesis and evaluation outside of the education field, as discussed by Nightingale, who described how they have been applied for assessment of labor-related interventions included in DOL’s evidence-based clearinghouse, CLEAR.15 She went on to further describe the broader federal evidence-based agenda and the proliferation of evidence-based clearinghouses,16 acknowledging the need for common evidence terminologies, standards, and guidelines across agencies. Speaking of DOL’s CLEAR, she emphasized it was important to have a framework that would work across the full spectrum of DOL focus areas, which range from job training to mine safety and health. She explained how, in addition to rating individual study quality, CLEAR uses effectiveness ratings for causal impact studies that indicate whether an intervention was favorable, unfavorable, or had mixed results (CLEAR, 2014, 2015). Rating these aspects separately, she said, helps to determine if a practice is promising when dealing with discordant results from different studies.
Evaluating diverse types of evidence for behavioral interventions was a focus of Johnson’s presentation, and she began by clarifying that Countermeasures That Work focuses on evidence-based countermeasures that seek to change behaviors associated with traffic safety problems (e.g., speeding, seatbelt use, impaired driving) (Richard et al., 2018).17 Thus, she emphasized, the guide covers a variety of different intervention types, including educational programs, enforcement mechanisms, and legislation. The publication uses a five-star rating system, which, she said, NHTSA has found to be a quick and easy method for conveying effectiveness information to state highway officials. She explained that randomized controlled trials, of which there may be few, are not necessary for a countermeasure to achieve a five-star effectiveness rating. To rate the effectiveness of countermeasures, the guidance team first considers results from meta-analyses and systematic reviews but also includes other kinds of studies and even program evaluations. Consistent results from good-quality nonrandomized studies may contribute to a five-star rating, said Johnson. She added that intermediate outcomes, such as changes in knowledge and behavior, may be used to supplement crash outcome data or to evaluate effectiveness in its absence.
CONSIDERING CONTEXT: HOW, WHERE, WHEN, AND WHY DOES IT WORK?
Recognizing that the effectiveness of interventions often depends on the conditions in which they are applied and the specifics of how they are implemented, panelists were asked to comment on the challenges associated with assessing external validity or generalizability—which describes the applicability of evidence to contexts other than the one in which it was generated. Elder noted that there are two levels at which questions about applicability arise. The first, he said, relates to the applicability of a given piece of evidence to a recommendation. The second, which he submitted may be a more important consideration, is the applicability of a recommended practice to a situation or context in which it may be implemented.
He went on to describe the formal process by which the Community Guide evaluates certainty in generalizability separately from certainty in evidence of effectiveness, both of which inform recommendations. First, subject-matter experts will make a priori judgments on generalizability with respect to a contextual factor under consideration. Then the available evidence is assessed for concordance with those expectations, and certainty is rated as a function of those two factors. Schünemann discussed the potential to use nonrandomized studies to examine potential subgroup effects. He added that the absence of direct evidence for the many different contexts in which a practice may be implemented will contribute to a low certainty of evidence, necessitating an acknowledgment of the uncertainty in which decisions will need to be made.
Noyes suggested that there are a number of tools and emerging research methods that may help reviewers better understand the effects of context on outcomes, including a technique called qualitative comparative analysis and the intervention complexity assessment tool for systematic reviews, which, she said, can help reviewers explore heterogeneity and inform subgroup analyses (Lewin et al., 2017). She emphasized that qualitative evidence, such as that generated from focus groups, interviews, and observations, is well suited to answering questions about for whom and in what context an intervention worked. Nightingale underscored the need for additional work in the field to develop methods that bring together internal and external validity.
16 Nightingale referenced the Foundations for Evidence-Based Policymaking Act, which was subsequently enacted in January 2019 (https://www.congress.gov/bill/115th-congress/house-bill/4174, accessed June 3, 2019).
The panel was asked to consider whether it is reasonable to assume that an intervention has some “true effect”18 or, if instead, it is possible that the effect is so inextricably linked to contextual factors—only some of which may be known—that the concept of true effect does not apply. In such cases, the intervention’s effect on the outcome of interest is specific to the context in which it was implemented and the evidence synthesis process would not focus on describing and judging certainty in the effect but rather determining whether the evidence supports or refutes a theory about how the intervention works. Elder and Schünemann acknowledged that their respective frameworks for grading evidence and developing evidence-based guidelines assume that there is a true effect. Elder conceded, however, that when there is substantial effect modification that is difficult to quantify, the measure of the true effect becomes nearly meaningless. An alternative approach, he offered, is to narrow the scope of the review on the assumption there is a true effect for a given context. Where there is at least sufficient evidence of effectiveness in a certain context, the Community Guide makes a split recommendation, which, he explained, means that the practice is recommended in one or more contexts but not others. As an example, Elder referred to a Community Guide review on school dismissals to reduce pandemic influenza,19 noting that the recommendation for use of this PHEPR practice was contingent on the pandemic’s severity.
Considering the impact of context on evidence of effect and decision making regarding the implementation of interventions was a focus of Michael Woolcock’s20 presentation during the second panel which addressed the challenges and approaches for assessing generalizability in evidence evaluations conducted in the international development field. Recognizing the significant progress made in the evaluation of internal validity, Woolcock asserted that very little is known about how to build evidence for external validity that can inform generalizability across time, settings, groups, and scale. If strong evidence indicated an intervention or a program was effective in one setting, assumptions are often made that it should be transferable to other contexts. But, he added, these assumptions have not always held up in development, where a number of factors, not the least of which is the differing implementation capability of systems, have contributed to disappointing results (Woolcock, 2013).
Woolcock explained that systematic review processes are built around the idea that we can discern what works by aggregating evidence on the basis of a hierarchy of research. When dealing with complex systems, however, he emphasized that they will inherently generate large amounts of variation. He suggested that such variation can be a source of learning if we can work out how best to engage with it. For example, it can help us understand the many factors that may shape system dynamics (i.e., what happens) and the implications for others who are considering implementing something similar in a different context. The goal, he said, is not to reduce the variation in observed effects across contexts, but to help those implementing a program or intervention achieve outcomes on the upper end of the distribution. Doing so, he underscored, will require in situ feedback loops based on continuous monitoring and evaluation processes that are built into the system.
INCORPORATING EVIDENCE FROM SOURCES OTHER THAN TRADITIONAL RESEARCH STUDIES INTO EVIDENCE REVIEW AND GRADING PROCESSES
A recurring discussion throughout the workshop focused on how the various frameworks and methods for evidence synthesis and recommendation development would handle some types of evidence that may be important sources of information in PHEPR but are less commonly used in evidence reviews, including mechanistic evidence (i.e., a logical or conceptual rationale for why or how an intervention works),21 modeling, expert opinion, case studies, and AARs.
Mechanistic evidence is considered in the effectiveness ratings for Countermeasures That Work when direct outcome data are missing, said Johnson, citing as an example driving while impaired courts.22 She stated that although there was no direct evidence of reduction in crash outcomes, effects on recidivism were observed, and it is expected that if drinking and driving is reduced, the number of crashes should correspondingly decrease. Elder similarly suggested that mechanistic reasoning may be important to support decision making in the face of uncertainty given a lack or paucity of empirical evidence. Although the Community Guide requires empirical evidence to generate evidence ratings and recommendations on public health interventions, he acknowledged that mechanistic theories are considered when assessing generalizability in a process that relies on the judgments of subject-matter experts. Elder added that expert opinion and models are similar in that they are not evidence, per
18 “True” effect refers to the change in the outcome of interest (encompassing both magnitude and direction of effect) associated with the implementation of the intervention in a given population. Although the true effect cannot be known, intervention studies seek to estimate the true effect by measuring the outcome in a population sample in the presence and absence of the intervention. Measured effect sizes can differ from the true effect because of bias.
20 Michael Woolcock is Lead Social Scientist, Development Research Group, World Bank.
21 Mechanistic evidence in this context includes mechanisms of action based on physical laws or knowledge of biological pathways but, for social and behavioral interventions, it may also include logical reasoning regarding the causal chain or hypothesized mechanisms that lead from interventions to outcomes; the latter is also called a “program theory.”
se, but represent a fallible distillation of knowledge. Models, he said, have been used in past Community Guide reviews, but specific methods for integrating information from models have not yet been developed.
Schünemann noted that a GRADE working group is in the process of developing a framework for applying GRADE principles to assess the certainty of evidence in modeling studies. In discussing mechanistic evidence, he indicated that reviewers might consider the similarity of mechanism of action when looking for indirect evidence to inform recommendations on an intervention. However, certainty in such evidence would be reduced as compared to direct evidence. He referenced work being done on incorporating mechanistic evidence into reviews related to toxicology and a National Academies workshop on this topic.23
Valentine noted that the WWC Practice Guides enable a review panel of nationally recognized experts to make recommendations based on its own expert judgment when there is only low-quality evidence or even in the absence of empirical evidence if, for example, there are theoretical considerations. However, he added, bounds are placed on such recommendations by requiring that they be labeled as supported by minimal evidence.
In discussing the evidentiary value of case studies and AARs, Schünemann acknowledged that it is possible to use such sources to make decisions based on what others have done in the past, but the certainty in the evidence would be low. He contended that case reports are a limited form of evidence, and it generally is not worthwhile to try to draw distinctions between the certainty of evidence from 10 case reports versus 30. However, GRADE has been applied to answer questions regarding rare diseases and in that context was used to evaluate hundreds of case reports (Pai et al., 2015). This allowed decisions to be made when little direct research-based evidence was available but, he emphasized, it is still important to recognize the uncertainty associated with such evidence.
Nightingale emphasized the need for standard procedures or guidelines that could make AARs more amenable to content and qualitative analysis. When AARs serve a dual purpose of performance management and learning, she added that separate metrics could be developed to meet both needs. Elder cautioned that the effect of politics and liability concerns on reporting makes AARs a challenging source of evidence for systematic reviews in terms of their reliability. Valentine concurred and reiterated that it is essential to pay attention to reporting bias and how that affects what information is shared, especially for a source like AARs. Noyes underscored the importance of considering how case studies and AARs could contribute to the understanding of the question at hand. She suggested that information from both of these evidence sources could be used to populate a model or could be mapped onto empirical findings from research, which may increase confidence in the evidence if there is concordance. This latter approach, she said, was used in a recent World Health Organization guideline for emergency risk communication.24 For that report, method-specific tools were used to assess the quality of case studies and AARs when they used a recognized methodology, and she referred to other tools that can be used to assess the credibility of the information.
Woolcock described an effort being undertaken by the World Bank called the Global Delivery Initiative, which is using data analytics to extract generalizable principles and emergent lessons from hundreds of development case studies.25 He underscored that this work was intended as a complement to, not a substitute for, more traditional evidence assessment processes. He concluded by emphasizing that a broader array of tools is needed for dealing with the kinds of problems that are not amenable to the processes used in conventional evidence synthesis frameworks.
Three panelists who work in fields characterized by low-frequency, high-impact events expanded on the discussion regarding the use of mechanistic evidence, modeling, expert judgment, and reviews similar to AARs in recommendation development and decision making. J.D. Polk26 articulated ways that these different types of evidence are integrated to inform decisions on countermeasures for health risks to astronauts. Jennifer Bishop27 and Jeffrey Marcus28 discussed how such evidence guides recommendations to improve aviation safety.
To emphasize the evidence challenges associated with small sample sizes when dealing with health issues during space exploration, Polk quipped, “One astronaut is a case study, two is a series, and three astronauts are a prospective randomized trial.” With a limited body of experience to draw on, along with physical space and resource limitations inherent in space travel, NASA needed to be able to forecast medical needs to mitigate health threats astronauts may face in the course of their missions, he said. He described how information is gathered from their own medical experiences (from postmission medical debriefs and longitudinal astronaut treatment data captured in a medical database) and that of other communities that operate in similarly constraining environments (e.g., submarines). Additional information is generated from virtual and live-action mission
23 See http://dels.nas.edu/Upcoming-Workshop/Strategies-Tools-Conducting-Systematic/AUTO-5-32-82-N?bname=best (accessed June 3, 2019).
26 J.D. Polk is Chief Health and Medical Officer for the National Aeronautics and Space Administration (NASA).
27 Jennifer Bishop is Chief, Writing and Editing Division, Office of Aviation Safety, National Transportation Safety Board.
28 Jeffrey Marcus is Aviation Safety Recommendation Specialist, Safety Recommendations Division, National Transportation Safety Board.
simulations—the latter in environments that are as similar as possible to space, such as Antarctica or the desert.
NASA also conducts some directed research and draws on the expert opinion of flight surgeons using a modified Delphi process. These different streams of information are pulled together using the Integrated Medical Model (Minard et al., 2011), which, he explained, uses Monte Carlo simulation to predict incidence rates for particular diseases or health conditions that might be encountered, based on the length of a mission and bounded using best- and worst-case scenarios. It also predicts the ability to treat the astronauts in situ. The Integrated Medical Model has been validated against observed cases and real International Space Shuttle missions and, Polk shared, NASA found that its predictive value improved with more inputs. The results of the simulations are used to formulate the medical kits and other materials needed for space missions so its results constitute a valuable body of evidence for space flight preparations.
Bishop and Marcus described the process by which evidence is gathered in the wake of a civil aviation accident and used to generate safety recommendations. Mechanistic evidence and logic have a strong role as investigators use a ruling-out and ruling-in process to come to a conclusion about the probable cause of the accident and, Marcus explained, develop recommendations needed to address it and other identified safety issues. Bishop added that the National Transportation Safety Board (NTSB) relies on expert opinion because there generally are not studies to explain why an accident occurred and how to prevent such events in the future. She indicated that a team is assembled at the start of an investigation and includes experts in a variety of areas, such as weather, operations, and human performance, and additional expert opinion is gathered from external stakeholders, including manufacturers and regulators.29 For some simple cases, she said, the knowledge and expertise of those subject-matter experts may be sufficient to inform recommendations, but for more complex investigations, they often rely on simulations and models, the results of which are accepted as definitive factual evidence.
Bishop acknowledged that NTSB investigation and recommendation process is essentially an after action reporting process in that it involves gathering information from those involved in the incident to try to understand what happened and the opportunities for improvement. It is a great learning method, she said, especially when there are no quantitative data on which to rely. However, Bishop emphasized, the independence of the NTSB is critical to its ability to make credible recommendations based solely on safety, without consideration of the financial implications for their implementation and free from the undue influence of stakeholders involved in the incident that may be reluctant to formally document the problems that occurred owing to potential ramifications.
GUIDING IMPLEMENTATION OF EVIDENCE-BASED PRACTICES
The panelists discussed how the body of evidence, in addition to informing recommendations, can also yield guidance on implementing interventions that may be useful to practitioners in the field. Nightingale emphasized that some practitioners are less interested in quality ratings than practice-based experience and knowing who else has implemented a given intervention. She explained that CLEAR considers implementation studies separately from causal impact studies, and although they do not yet have quality ratings for such studies, CLEAR provides guidelines for their evaluation (CLEAR, 2014).The WWC Practice Guides, Valentine noted, offer expert advice on implementation, identify potential obstacles, and suggest ways to overcome them. The implementation guidance is based on the expert panel’s knowledge and reading of the evidence, he said.
Elder emphasized that the Community Guide similarly offers implementation considerations. Johnson noted that use (i.e., the extent to which a countermeasure has been implemented by states and communities) and implementation time are given their own ratings included in the Countermeasures That Work report. These ratings, along with cost and other information relevant to implementation, can help state highway officials make decisions about countermeasures that may be suited to their context. Noyes mentioned a series of guides that were created for decision makers to help them assess the applicability of review findings and recommendations to their own context and to consider local evidence in decisions on implementation.30 NTSB goes a step further in that it actually works with the organization receiving the recommendation on implementation, said Marcus.
In closing the workshop, Ned Calonge, chair of the Committee on Evidence-Based Practices for Public Health Emergency Preparedness and Response, thanked all of the speakers, noting that the presentations and discussions on the different frameworks and approaches generated a great deal of thought and dialogue and will be invaluable as the committee works to develop its methodology going forward.♦♦♦
29 See https://www.ntsb.gov/investigations/process/Pages/default.aspx (accessed June 3, 2019).
30 See https://health-policy-systems.biomedcentral.com/articles/supplements/volume-7-supplement-1 (accessed June 3, 2019).
Alonso-Coello, P., H. J. Schünemann, J. Moberg, R. Brignardello-Petersen, E. A. Akl, M. Davoli, S. Treweek, R. A. Mustafa, G. Rada, S. Rosenbaum, A. Morelli, G. H. Guyatt, and A. D. Oxman. 2016. GRADE evidence to decision (EtD) frameworks: A systematic and transparent approach to making well informed healthcare choices. 1: Introduction. BMJ 353.
Briss, P. A., S. Zaza, M. Pappaioanou, J. Fielding, L. K. Wright-De Aguero, B. I. Truman, D. P. Hopkins, P. Dolan Mullen, R. S. Thompson, S. H. Woolf, V. G. Carande-Kulis, L. Andersin, A. R. Hinman, D. V. McQueen, S. M. Teutsch, J. R. Harris, and the Task Force on Community Preventive Services. 2000. Developing an evidence-based guide to community preventive services—methods. American Journal of Preventive Medicine 18(1S):35–43.
CLEAR (Clearinghouse for Labor Evaluation and Research). 2014. Operational guidelines for reviewing implementation studies. Washington, DC: U.S. Department of Labor.
CLEAR. 2015. CLEAR causal evidence guidelines, version 2.1. Washington, DC: U.S. Department of Labor.
Guyatt, G., A. D. Oxman, E. A. Akl, R. Kunz, G. Vist, J. Brozek, S. Norris, Y. Falck-Ytter, P. Glasziou, H. Debeer, R. Jaeschke, D. Rind, J. Meerpohl, P. Dahm, and H. J. Schünemann. 2011. GRADE guidelines: 1. Introduction - GRADE evidence profiles and summary of findings tables. Journal of Clinical Epidemiology 64(4):383–394.
Lewin, S., C. Glenton, H. Munthe-Kaas, B. Carlsen, C. J. Colvin, M. Gulmezoglu, J. Noyes, A. Booth, R. Garside, and A. Rashidian. 2015. Using qualitative evidence in decision making for health and social interventions: An approach to assess confidence in findings from qualitative evidence syntheses (GRADE-CERQual). PLoS Medicine 12(10):e1001895.
Lewin, S., M. Hendry, J. Chandler, A. D. Oxman, S. Michie, S. Shepperd, B. C. Reeves, P. Tugwell, K. Hannes, E. A. Rehfuess, V. Welch, J. E. McKenzie, B. Burford, J. Petkovic, L. M. Anderson, J. Harris, and J. Noyes. 2017. Assessing the complexity of interventions within systematic reviews: Development, content and use of a new tool (iCAT_SR). BMC Medical Research Methodology 17(1):76.
Lewin, S., A. Booth, C. Glenton, H. Munthe-Kaas, A. Rashidian, M. Wainwright, M. A. Bohren, Ö. Tunçalp, C. J. Colvin, R. Garside, B. Carlsen, E. V. Langlois, and J. Noyes. 2018. Applying GRADE-CERQual to qualitative evidence synthesis findings: Introduction to the series. Implementation Science 13(1):2.
Minard, C. G., M. F. de Carvalho, and M. S. Iyengar. 2011. Optimizing medical resources for spaceflight using the Integrated Medical Model. Aviation, Space, and Environmental Medicine 82(9):890–894.
Neumann, I., N. Santesso, E. A. Akl, D. M. Rind, P. O. Vandvik, P. Alonso-Coello, T. Agoritsas, R. A. Mustafa, P. E. Alexander, H. Schünemann, and G. H. Guyatt. 2016. A guide for health professionals to interpret and use recommendations in guidelines developed with the GRADE approach. Journal of Clinical Epidemiology 72:45–55.
Pai, M., A. Iorio, J. Meerpohl, D. Taruscio, P. Laricchiuta, P. Mincarone, C. Morciano, C. G. Leo, S. Sabina, E. Akl, S. Treweek, B. Djulbegovic, and H. Schünemann. 2015. Developing methodology for the creation of clinical practice guidelines for rare diseases: A report from RARE-Bestpractices. Rare Diseases 3(1).
Richard, C., K. Magee, P. Bacon-Abdelmoteleb, and J. Brown. 2018. Countermeasures That Work: A highway safety countermeasure guide for state highway safety offices. 9th ed. Washington, DC: National Highway Traffic Safety Administration.
Schünemann, H. J., S. R. Hill, M. Kakad, G. E. Vist, R. Bellamy, L. Stockman, T. F. Wisloff, C. Del Mar, F. Hayden, T. M. Uyeki, J. Farrar, Y. Yazdanpanah, H. Zucker, J. Beigel, T. Chotpitayasunondh, T. T. Hien, B. Ozbay, N. Sugaya, and A. D. Oxman. 2007. Transparent development of the WHO rapid advice guidelines. PLoS Medicine 4(5):e119.
Woolcock, M. 2013. Using case studies to explore the external validity of “complex” development interventions. Evaluation 19(3):229—248
WWC (What Works Clearinghouse). 2017a. Procedures handbook. 4.0 ed. Washington, DC: U.S. Department of Education.
WWC. 2017b. Standards handbook. 4.0 ed. Washington, DC: U.S. Department of Education.
DISCLAIMER: This Proceedings of a Workshop—in Brief was prepared by Autumn Downey, Leah Rand, and Lisa Brown as a factual summary of what occurred at the workshop. The statements made are those of the rapporteurs or individual workshop participants and do not necessarily represent the views of all workshop participants; the committee; or the National Academies of Sciences, Engineering, and Medicine.
REVIEWERS: To ensure that it meets institutional standards for quality and objectivity, this Proceedings of a Workshop—in Brief was reviewed by Jennifer Bishop, National Transportation Safety Board and Jeffrey Valentine, University of Louisville. Lauren Shern, National Academies of Sciences, Engineering, and Medicine, served as the review coordinator.
SPONSORS: This workshop was supported by the Centers for Disease Control and Prevention.
For additional information regarding the workshop, visit www.nationalacademies.org/hmd/Activities/PublicHealth/PublicHealthPreparedness/2018-July-26.aspx.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2019. Methodologies for evaluating and grading evidence: Considerations for public health emergency preparedness and response: Proceedings of a workshop—in brief. Washington, DC: The National Academies Press. https://doi.org/10.17226/25510.
Health and Medicine Division
Copyright 2019 by the National Academy of Sciences. All rights reserved.