performed, and the results were published. This approach does not adapt well to streaming data sources and may not be viable going forward. Unfortunately, most of the existing algorithms were engineered with the assumption that all data would be available from the start. Bioinformaticians need to start developing algorithms that scale to arbitrarily large datasets.

Algorithms in bioinformatics already exist that fit those two criteria. They include multiple sequence alignment with profile-hidden Markov models, phylogenetic placement on reference trees, and bloom filters. The nature of the data has changed, and more methods like these must be developed. Once it is possible to analyze all the data, researchers will have some basis from which to make a decision about whether data are useful or can be deleted from the data that are archived for future analysis.

The computing infrastructure used to analyze data also has changed greatly. The approach is shifting from individuals doing analyses on PCs or within a computing cluster to people performing analyses within cloud systems. The Amazon Web Services Elastic Computing Cloud is the best-known of these systems, and there are others, such as OpenStack, a cloud infrastructure that virtually everyone can download and set up on his or her own computer hardware. In Australia, the government set up a national research cloud named NeCTAR. Those at Australian research institutions are issued free allocations for use; in instances when large computing resources and many computing hours are needed, users can make special requests.

Other paradigms for using cloud computing have emerged, including the HTCondor project, which essentially leverages idle computing time; it is a cycle-scavenging system. The Condor can launch virtual machine images on idle computers. If an institution has a large number of computers that sit idle after everyone goes home at 5:00 p.m., that overnight time becomes a large computing resource.

The cloud-computing model is attractive because it enables one to rent time on a large, commercial, professionally managed computing facility, and to pay only for what is used. It enables a researcher to quickly scale computing resources up or down. One does not have to worry about how to dispose of all of the outdated computers in 5 years because the cloud provider will manage that. A challenge, however, is that data motion and storage in the cloud can become very expensive and will increase the funding researchers need from their sponsors. Researchers who wish to perform large dataset analyses in clouds must factor in how to most efficiently move the data around and store it. One possible solution is to use third-party providers for data storage.

Hardware architectures have changed also, largely owing to the fact that central processing unit (CPU) clock speed reached its limits about

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement