tion to handle this responsibility. A complication, however, is intellectual property considerations. Much of the intellectual property is generated during the first step, and recognition of this must be conveyed to the parties who operate and maintain the software. One of the most common approaches is to make the software open source. The open sourcing of software is actually advantageous for industry because it can avoid investing in the development step. It is attractive for the bioinformatics industry to pick up these open-source tools and operationalize and validate them for users.

In microbial forensics analysis, absolute reproducibility may be an unachievable goal. So a challenge becomes, how reproducible should it be? In theory, computers should make repeatability and reproducibility very easy. Many scientists who practice data analysis, however, lack training in fundamental aspects of computing and software engineering that would enable them to undertake reproducible computing. A recent trend toward educating this community can be seen in the series of workshops called Software Carpentry.7 These workshops offer training in basic computing skills, such as version control, as well as literate programming, so scientists can generate workflows that are reproducible on different systems and engineered according to standards generally accepted in software engineering. In addition, there are a number of efforts to generate point-and-click visual systems that will enable users to generate reproducible workflows. These include the Galaxy, Taverna, Knime, and Kepler systems. Users can access a large number of small bioinformatics components, which they can connect and reconnect for arbitrarily different and unique workflows, and which they can then share with others. When users perform data analysis, the systems will track and record every step applied to data, and the users can share analysis metadata.

In terms of the operational aspects of repeatability and reproducibility, Darling believes that the ability to examine the software unit should be taken into account. With closed-source software, the best one can achieve is repeatability and copying the closed-source software to another computer; one cannot know how or why it works, nor can the computation being performed by it be independently reproduced. With open-source software, it is possible to examine the nuts and bolts of why it does what it does. Often when software is developed, and a manuscript is published about it, there are inconsistencies, and frequently large discrepancies, between what the software is actually doing and what is described in the manuscript. This can be attributed to the fact that when software is engineered and implemented, the developer must make approximations


7 More information is available at http://software-carpentry.org/; accessed November 24, 2013.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement