Skip to main content

Currently Skimming:

7 Meeting #6: Improving Reproducibility by Teaching Data Science as a Scientific Process
Pages 78-93

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 78...
... Welcoming roundtable participants, co-chair Eric Kolaczyk, Boston University, noted that although replicability is a fundamental aspect of the scientific process, many have suggested that a "crisis in reproducibility"1 currently exists. Recently published articles, such as "Why Most Published Research Findings Are False" (Ioannidis, 2005)
From page 79...
... Applying this expectation for practicing transparent science to the notion of teaching data science, Stodden commented that effective data science curricula would include training in computational methods and tools as well as in theory and computational techniques. She suggested thinking about both tool and curricula development in terms of the data life cycle (i.e., acquire, clean, use, reuse, publish, preserve, and destroy)
From page 80...
... Stodden replied that preregistration would not be needed if the right infrastructure for reproducibility were in place -- for example, allowing any statistical tests performed during an experiment to be tracked -- and she suggested the design of appropriate tools as an effective solution. Peter Norvig, Google, supported the notion of developing computing infrastructure to enable reproducible research and suggested disaggregating steps along the scientific life cycle.
From page 81...
... The course focuses on data access, computation, statistical analysis, and publication as a way to underscore that reproducibility is an essential tenet of modern computational research. The course introduces the social and scientific implications of a lack of reproducibility, and students learn that reproducibility is an everyday practice that requires the development of skills and habits.
From page 82...
... Perez noted that discussions are under way with the National Science Foundation's big data regional innovation hubs to address this issue. Stodden noticed that many of the tools Perez uses in his course come from outside the academic community and have been repurposed for scientific work.
From page 83...
... Implementing a standard process eliminates problems, motivates repetition, fosters communication, encourages collaboration, enhances security, and allows encapsulation of experiments. Woody described Microsoft's Team Data Science Process methodology that aims to improve team collaboration and learning: • During the first phase of this process, business understanding, the team defines objectives and identifies data sources.
From page 84...
... OPEN DISCUSSION Incentive and Reward Structures Nicholas Horton, Amherst College, wondered how incentive structures in academia could be modified to encourage faculty to teach data science courses and to develop data science tools. Gardner described the fundamental difference between incentive structures in academia (e.g., publishing results and earning grants)
From page 85...
... Duncan Temple Lang, University of California, Davis, noted that software development that allows experimentation and brings in new ideas deserves to be rewarded but that not all software development fits in this category. He advocated for educating faculty on different types of software and redefining incentive structures.
From page 86...
... To be successful, researchers would need to connect the theory of reproducibility with practical skills and application. In other words, reproducible research emerges from the combination of a motivated researcher and relevant training.
From page 87...
... PERSPECTIVES ON ENHANCING RIGOR AND REPRODUCIBILITY IN BIOMEDICAL RESEARCH THROUGH TRAINING Alison Gammie, National Institute of General Medical Sciences Gammie explained that because issues of scientific rigor and transparency (especially in the field of biomedical research) are being discussed 10 The website for the Carpentries is https://carpentries.org/, accessed February 13, 2020.
From page 88...
... NIH also offers a predoctoral training grant program13 to ensure that rigor and transparency are threaded throughout the graduate curriculum and reinforced in the laboratory. The principal investigator and program faculty on these grants are required to have a record of doing rigorous and transparent science and to submit a specific plan for how the instruction will enhance reproducibility.
From page 89...
... Gammie encouraged data scientists who can demonstrate a robust training program that meets the basic science mission of NIGMS to continue to apply for training grants, as many fundamental skills cross disciplines. Hero suggested that it would be useful if predoctoral data science training programs had funding for and openness toward application areas.
From page 90...
... His final example of how data quality assurance and control can drive process improvement featured a company that reduced the relative error of its assays sixfold, which allowed it to reproducibly identify and build upon small incremental improvements that were otherwise lost in the noise. This doubled the rate of strain improvement, and Gardner described this as a paradigm for reproducible science -- if each individual can make an incremental improvement, society can make scientific discoveries much faster.
From page 91...
... Teal commended Riffyn for its work to improve data quality and observed that its incentive structure helps achieve that goal. She described a specific challenge in the genomics arena: because the data users are not data producers, they cannot easily impact data quality.
From page 92...
... as a prerequisite to a data science course; • Eliminate introductory computer science courses and replace them with data literacy courses; and • Develop a course that enables data literacy at the level of dialogue as opposed to a course that attempts to teach mastery. The group also discussed the potential for institutions with large, established programs to provide packages to help institutions with limited staffing to implement such courses and make data science more widely available.
From page 93...
... In which ways can data science education be modified to make the most impact? She noted that her group chose to discuss this question from the perspective of the entire data life cycle because reproducibility is truly a life cycle problem.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.