To establish a foundation for the workshop discussions, Harvey Fineberg, president of the Gordon and Betty Moore Foundation and workshop chair, elaborated on the findings and recommendations of the recently published National Academies of Sciences, Engineering, and Medicine report Reproducibility and Replicability in Science (NASEM, 2019). Marcia McNutt, president of the National Academy of Sciences (NAS) and former editor-in-chief of Science, delivered a keynote address to participants on signaling “indicators of trust” to the scientific community and the public.
Harvey Fineberg, President, Gordon and Betty Moore Foundation
The National Academies consensus study report Reproducibility and Replicability in Science was sponsored by the National Science Foundation.1 The committee assessed research and data reproducibility issues with a focus on topics that cross disciplines. More specifically, the committee was charged with defining the terms “reproducibility” and “replicability” (see Box 2-1), examining the extent and impact of the lack of reproducibility
1 The full report and additional information are available at https://sites.nationalacademies.org/sites/reproducibility-in-science/index.htm (accessed November 20, 2019).
and replicability on the overall health of science and engineering, and reviewing current activities to improve reproducibility and replicability.
The main conclusion of the study, Fineberg summarized, is that there is no “crisis” with regard to replication and reproducibility of scientific findings, “but there is also no room for complacency.” “Reproducibility is critically important,” he continued, but is “not currently easy to attain.” The committee noted their concerns about the non-replicability of individual studies. Furthermore, the report states that neither reproducibility nor replicability alone can ensure the reliability of scientific knowledge.
A challenge for computational reproducibility across scientific disciplines is that many reports of studies do not include sufficient information to allow another researcher to reproduce the original computa-
tional results, Fineberg said. The committee found that less than half of the studies they reviewed provided the “full array of data, code, digital artifacts, and other elements required” to facilitate computational reproducibility.2 There is growing recognition of this problem, Fineberg acknowledged. He shared several examples of efforts to provide more complete data and information (e.g., providing Internet links to underlying datasets and code, marking articles with badges that indicate open data sharing).
The committee identified several obstacles to reproducibility, Fineberg reported, including inadequate recordkeeping, reporting that lacks critical elements, obsolete digital artifacts, errors in the attempts to reproduce the findings of others, and cultural barriers.3 The committee also noted that improving computational reproducibility is challenging because experiments are complex and involve multiple steps that must be systematically documented and reported. In some cases, full reproducibility is not possible as studies involve non-public or proprietary data and experimental components.
As defined by the committee, replicability is nuanced and complex, especially with regard to the implications of failure to replicate. “Replicability takes many forms,” Fineberg said, and some studies are inherently not replicable (e.g., studies of ephemeral phenomena such as an earthquake; long-term epidemiological studies). The committee also observed that many studies are replicated as part of the routine conduct of science and these replications are not reported because the intent is to reaffirm a previous finding in order to build on it.
The committee outlined several situations in which undertaking the replication of a particular study might be necessary or appropriate. Fineberg listed several examples: if the results of the original study are to be used in making decisions of consequence (e.g., policy, clinical, or investment decisions); if the original study produced controversial or unexpected results; if the original study is flawed (e.g., design, methods, analysis); or if “the costs of replication are offset by the potential benefits for science and society.” Some studies are more replicable than others, and the committee’s report discusses the contributing roles of the complexity of the system under study and the degree of experimental control that is
possible. The lower the complexity and the higher the controllability, the better the chance of replication, Fineberg explained.4
The committee concluded that “the occurrence of non-replicability is due to multiple sources, some of which impede and others of which promote progress in science” (NASEM, 2019, p. 85). The committee differentiated between sources of non-replicability that are “potentially helpful” and those that are “unhelpful” to the advancement of science. A helpful reason for non-replicability of a study, for example, could be the identification of a new natural source of variability or other new discovery. Unhelpful, and potentially avoidable, sources of non-replicability include simple mistakes, methodological errors, bias, and fraud. Another unhelpful source of non-replicability discussed by the committee was inappropriate statistical inference, and Fineberg noted the problem of “misunderstanding and misuse of the concepts of p-values and statistical significance.”5 Efforts are being made to address these unhelpful sources of non-replicability, he continued, such as the development of guidelines and checklists for researchers and other approaches to increase awareness and improve reporting and transparency.
Workshop Participant Comments on Replicability
Workshop participants shared several examples how failure to replicate led to new insights. Thomas Curran, executive director and chief scientific officer of Children’s Mercy, Kansas City, observed that errors in replicability can lead to new discoveries. He shared an example in which the majority of published research describing a new cancer drug target was wrong due to the use of an inappropriate model, but further research revealed the applicability of the target to rare cancers and led to new drugs that might not otherwise have been developed. Another participant shared that variability in the replicability of a study he had published was ultimately associated with the amount of time the samples were exposed to oxygen during handling. It took years to figure out that the sample was easily oxidized, and those who handled samples anaerobically could replicate the experiment while others could not.
Participants also commented on statistics in replicability. John Gardenier, a research ethicist, agreed with the concerns raised by the committee about statistical analysis and reliance on p-values. He noted that the use of p-values is considered “standard in science,” but that does not necessarily make it appropriate or ethical. Kay Lund, director of the Division
of Biomedical Research Workforce at the National Institutes of Health, expressed concern about underpowered animal studies that are carried out in response to peer review or to meet criteria for publication.6 She noted that there is generally no mention in the publication that these add-on studies are preliminary data.
Public Trust in Science
The consensus committee reviewed data on public trust in science, Fineberg continued, and found that, from 1978 to 2018, the level of public confidence in the scientific community has been consistently higher than public confidence in other institutions, including major companies, the press, and Congress. Only the military has garnered higher public trust in recent decades, he said.7
Recommendations from the Consensus Study Report Reproducibility and Replicability in Science
Following its assessment of reproducibility and replicability in science, the committee made numerous recommendations directed toward funders, policy makers, researchers, journal editors and publishers, conference organizers, educational institutions, professional societies, and journalists. Fineberg highlighted four of the committee’s recommendations as being particularly relevant for the workshop discussions (see Box 2-2). These address the need for researchers to report complete information, and the roles that various stakeholders play—including academic institutions, professional societies, researchers, funders, and journals—in increasing transparency in the reporting of science for the purpose of enhancing scientific reproducibility and replicability.
In closing, Fineberg summarized that data sharing and transparent reporting should be an expectation of the scientific community. Barriers to the persistent availability of the digital artifacts and reagents needed for reproducibility and replicability include costs, lack of infrastructure, the culture of science, and weak incentives, he said. “There is a great deal that has already been accomplished,” he said, and he encouraged workshop participants to consider how existing principles and practices can be endorsed, leveraged, or improved upon to further the progress toward transparency in science.
Marcia McNutt, President, National Academy of Sciences
McNutt opened her keynote address with a recent case example published in The Chronicle of Higher Education that illustrates the importance of research transparency.8 As summarized by McNutt, the case, from the field of criminology, involved an anonymous whistleblower who found statistical irregularities in five published papers and emailed his concerns to all of the co-authors of each of the papers. One co-author, a faculty member at Florida State University (FSU), was listed on all five papers and held all the data. All the other co-authors responded that they had not seen the data. A co-author who was an associate professor at a university in New York followed up on the whistleblower’s concerns because the FSU faculty member was his former mentor. The associate professor contacted his mentor to get the original data with the intention of helping to sort out any mistakes and protect his mentor’s reputation. However, McNutt said the mentor did not provide the complete data.
The data in question involved phone surveys to landline phone numbers. As the assistant professor started to investigate, he observed that the response rate reported in the paper was more than 60 percent, which is well out of line with current response rates for such surveys. In addition, McNutt summarized from the article that the entity that conducted the survey was not identified, no source of funding was noted, and missing survey values had been filled in with imputed values.
The associate professor then sent a letter to the journal Criminology, which had published the paper on which he was a co-author. He also posted the letter online, which led to coverage of the case by Retraction Watch. The lead author of that paper then became involved and sought to retract the paper or correct it. McNutt read quotes from The Chronicle Review article, which suggested that the editor-in-chief of Criminology, who happened to be a university colleague of the associate professor, was highly resistant to a retraction despite the concerns raised by two coauthors of the paper. The quotes suggested that the editor was primarily concerned about the potential legal ramifications, impact on the mentor’s reputation, and public relations that would result from a retraction, rather than the underlying issues with research quality. Furthermore, the comments by the editor indicated that he was aware that other papers published in his journal had been of questionable quality, but that he believed it was sufficient to simply alert the field, not retract publications.
8 See https://www.chronicle.com/interactives/20190924-Criminology (accessed November 20, 2019).
Signaling Indicators of Trust
The case described in The Chronicle Review illustrates some of the key elements of transparency that are the focus of this workshop, including data availability, transparency of methods and statistics, and disclosure of funding sources. Another important aspect highlighted by this case, McNutt said, is the need to signal “indicators of trust” to the scientific community and the public. She said she found the response, or lack thereof, by the editor-in-chief of the journal to be particularly concerning. Not retracting the paper seemed acceptable to the editor because those in the field would know the work was problematic. The editor did not seem to care that other readers of the paper, including researchers outside that field of study, policy makers, and the general public, would not know the paper been discredited.
McNutt and colleagues recently addressed the issue of indicators of trust in a paper titled “Signaling the Trustworthiness of Science” (Jamieson et al., 2019). Three qualities that foster trust in the scientific enterprise are “competence, integrity, and benevolence,” McNutt said. The norms of science promote these qualities, she continued, but scientists often do not “clearly signal” to others in the scientific community or to the public that these norms have been upheld, or when appropriate, that they have been violated.
A benefit of communicating the adherence to scientific norms is that it reinforces those norms in the community. McNutt mentioned the use of open science badges by journals as an example of signaling adherence to norms in published research. Badges, such as those developed by the Center for Open Science to indicate open data, open materials, and preregistration of research, are “self-reinforcing,” she said. In the future, badges might also be displayed by journals for studies that have been independently reproduced or replicated, that have had independent statistics review, or that have been screened for plagiarism or image manipulation, she suggested. She noted that digital publishing of journals facilitates the display of additional badges after publication as appropriate.
A survey referred to by McNutt and colleagues found that the public places value on indicators of the trustworthiness of science. For example, respondents said “they are more likely to trust a study if scientists make data and methods transparent, if they disclose who funded the study, and if it is published in a peer-reviewed journal,” she summarized (see Jamieson et al., 2019).
Improving Indicators of Trustworthiness
McNutt highlighted several areas in need of improvement that were discussed in Jamieson et al. (2019). One area is the need to improve the
quality and transparency of the peer-review process. “Standards of peer review vary greatly among journals,” she said, and “there are very few signals of quality of peer review.” There are also cases in which journals have claimed they conduct peer review but do not, and cases of reviewer fraud.
Another area for attention is the language used for signaling the removal of a paper from the literature. McNutt pointed out that “retraction” is used to describe all withdrawals, whether due to honest errors or to falsification and fabrication, casting a negative light on all authors. The paper discusses the need for more descriptive terminology that signals the reason for removal of the paper from the literature. McNutt added that less punitive language could encourage more authors to withdraw their papers when errors are discovered.
As indicated by the survey, the public values information about potential biases. In this regard, McNutt said that “full disclosure of funding sources, outside obligations, and competing interests” is essential. Although journals generally require disclosures, she said more clarity is needed about what must be disclosed and for what time period. For example, for how long are past relationships relevant, or when do impending future relationships need to be disclosed? In addition, she said it is challenging for reviewers and editors to verify the accuracy of author disclosures.
In closing, McNutt expressed dismay that “members of the public are misled by long discredited studies,” and she cited an infamous example of a discredited study on vaccines and autism that took a decade to retract after it was widely known in the field that the work was fraudulent. She emphasized that the scientific community must take action to send “consistent and meaningful signals of which studies are honoring the norms that sustain trust.”
Following the keynote presentation, participants raised several issues for discussion with McNutt.
Coordinating Across Sectors
A participant asked how more cross-sector coordination could be encouraged. McNutt suggested that one possibility could be to develop an ongoing National Academies forum to share best practices for integrity, trust, and transparency and coordinate action across stakeholder groups at the research enterprise level. She noted that different stakeholder groups have created their own entities, such as the Committee on
Publication Ethics (COPE) for publishers, but no function exists to help coordinate the activities across all phases of research “from funding to execution to publication.”
Dealing with Misconduct9
Frustration with the lack of consequences for misconduct was noted by a participant. McNutt said she was initially concerned that imposing sanctions on researchers could lead to overreaction and backfire—concerns may not be reported for fear of ruining someone’s career. Now, however, she believes there is a need for action by the appropriate bodies and added that “no one should be untouchable.” The appropriate body would be the researcher’s employer, she said, but could also be a funder (e.g., in cases of fraud or misappropriation of funds) or a journal (when an author’s actions violate journal rules).
Harold Sox, program director of peer review at the Patient-Centered Outcomes Research Institute (PCORI), raised concerns about “the practice of spin” (e.g., when authors play up a positive secondary outcome when the primary outcome is negative). He said this is “a milder form of scientific misconduct.” He suggested that journal editors need to better address this and that standards are needed to help reviewers recognize hype that is not supported by data. McNutt agreed and said that most journal editors are volunteers and, in general, training for reviewers and editors is limited and may not sufficiently prepare them to address issues such as spin. She mentioned a recent survey carried out by a journal, which indicated that stakeholders would like more training opportunities for reviewers. She suggested that reviewer training could be discussed at scientific society meetings, and that students, early career researchers, and senior investigators could all provide feedback on what would be most helpful for such reviewer training.
Communicating Corrections in the Literature
Shai Silberberg, director for research quality at the National Institute of Neurological Disorders and Stroke (NINDS), noted his concern that researchers remain reluctant to alert editors to honest errors in their publications because of the potential negativity. He suggested finding a way to give credit to those who do come forward, perhaps on a “Correction
9 A participant observed that issues of sexual harassment and bullying are now being discussed in the context of scientific misconduct. Although not germane to the discussion of reproducibility and transparent reporting, the need for codes of conduct that might address this type of misconduct was discussed briefly by participants.
Watch” blog akin to Retraction Watch. McNutt cautioned that some might game the system and publish with the intent of being the first to issue a correction. She added that retractions are relatively infrequent, making it difficult to pilot interventions and to understand what the unintended consequences are. Steven Goodman, professor of medicine and health research and policy and co-director of the Meta-Research Innovation Center (METRICS) at Stanford University, referred to a recent publication from METRICS that proposes new taxonomy for amendments to the published literature (Fanelli et al., 2018). Goodman and colleagues propose a list of terms that credit authors or journals as appropriate for making corrections. He added that comments on the proposal were gathered at a workshop attended by representatives from major journals, COPE, the National Library of Medicine, and others. Deborah Sweet, vice president of editorial at Cell Press, noted the difficulty in getting authors to retract papers. She mentioned that Cell has introduced an “Editorial Note” in which editors can discuss an investigation associated with a paper that was ultimately not retracted or corrected. This communicates the fact that an investigation was done, she said, even though it was decided that no action was needed.
This page intentionally left blank.