National Academies Press: OpenBook

Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems (2002)

Chapter: 7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets

« Previous: 6 Gene Transfer as a Biomedical Tool
Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×

7

The Data Flood: Analysis of Massive and Complex Genomic Data Sets

One of the major themes brought out by the workshop was the interplay between theory and data, but the discussions in preceding chapters do not mention how much data must be dealt with. In fact, the data sets themselves are so massive that their analysis presents major challenges to statistical methodology.

As an example, Dan Roden, of Vanderbilt University, reported on research the original goal of which was to use genetics to predict individual responses to drugs. However, the research quickly evolved into the challenge of navigating through a massive data set. Pharmacologists are very interested in understanding why individuals have different responses to the same drugs, and how to predict those variations. The variability in drug response can correlate with a variety of factors, such as gender, age, disease type, concomitant drug therapies, and ethnicity.

Variability in drug response among different individuals may also be due to genetic factors. Each person has two strands of DNA in his or her genome, shown as two panels in Figure 7-1. At particular genome locations, the DNA sequences might differ between any two people. Such a difference, called a DNA polymorphism, might be associated with the occurrence of side effects in a given individual.

Mutation is one of the factors causing DNA polymorphisms, and which therefore contributes to disease onset. DNA polymorphisms may be due to the deletion, insertion, or substitution of a nucleotide, may occur at coding or noncoding regions of the DNA, and may or may not alter gene function. The occurrence of DNA polymorphism makes it possible to associate a person’s response to drugs with particular DNA regions, for example, by correlating the occurrence of the polymorphism with the response. This is the basis of current phamacogenetics, which is the study of the impact of individual genetic variants on drug response.

Roden’s research sought to evaluate the role of genetics in determining drug response in the case of a single nucleotide polymorphism (SNP) that is known to predispose individuals to drug-induced arrhythmias. He approached the problem with the following strategy:

  • Define the drug response (phenotype) of interest.

  • Test appropriate DNA samples, patients, or families.

  • Identify candidate genes that might explain significant response variations.

Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×

FIGURE 7-1 Types of DNA variants: mutation and polymorphisms. Figure courtesy of Dan Roden.

  • Identify polymorphisms in candidate genes.

  • Relate the identified polymorphism to the phenotype.

Such an analysis would produce a graph like that in Figure 7-2, where the χ2 statistic would be calculated at each SNP. However, such an analysis would be infeasible for both statistical and economic reasons, because of the flood of data. Suppose the research has considered 100,000 SNPs in 1,000 patients (500 affected, 500 not affected). The statistical problem is that the data will result in 100,000 χ2 statistics. With such a multiplicity of tests, there will be many false positives. How then does one set a sensible cutoff point for statistical significance?

Even if the statistical problem can be solved, basic economics makes this straightforward experiment infeasible because of the tremendous cost of recording 100,000 genotypes in each of a thousand people. (If the cost of determining a genotype were only 50 cents, the entire experiment would still cost $50 million.) Accordingly, there is a pressing need to solve the problem of handling the flood of bioinformatics data.

The data flood pointed out by Roden is only one example of the data handling challenges to be overcome. With the development of microarray experiments, the amount of data available today is enormous. At the April 2001 workshop, Terry Speed, of the University of California at Berkeley, gave an overview of microarray experiments, which provide a means of measuring expression levels of many genes in parallel.

In the so-called Stanford protocol, shown on the right side of Plate 3, genetic material from cells is apportioned into two samples, each of which is exposed to a different treatment. (One of the treatments might be the null treatment, in which case we are comparing a treated sample with a control.) The goal is to determine how the two samples differ in the way their genes are expressed—that is, how the genes cause proteins to be created in accordance with their embedded genetic information. One sample is labeled with a red dye and the other with a green dye. The two samples are distributed over a microarray slide (a “gene chip”), which typically has 5,000 to 6,000 different segments of complementary DNA (cDNA) arrayed on it. The two samples of red- and green-dye-tagged genetic material adhere to the

Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×

FIGURE 7-2 Data from a hypothetical pharmacogenomic experiment. Figure courtesy of Dan Roden.

slide in different patterns according to their chemical bonding to the cDNA. When the dyed genetic material is allowed to express proteins, the level of activity at each coordinate of the gene chip can be measured through the fluorescence of the dyes. From these measurements, one can develop an understanding of how the genetic material was affected by the treatment to which it was exposed. More complete background on this process, and a number of valuable other links, may be found at <http://www.stat.Berkeley.edu/users/terry/zarray/Html/index.html>.

Many statistical issues arise in the analysis of microarray data, including issues of experimental design, data preprocessing, and arriving at ultimate conclusions. For example, the typical range of expression (on a log2 scale) is about ±5, and the amount of background noise in the data could be substantial. Thus, at present, it is usually possible to identify (with some certainty) only those genes that express at a very high or very low level.

Although there are problems with expression levels, and also with bias, a plot of M versus A, where

M = log2 (red expression) - log2 (green expression)

A = log2 (red expression) + log2 (green expression),

can be extremely useful, as in the following experiment described by Speed, which identified genes with altered expression between two physiological zones (zone 1 and zone 4) of the olfactory epithelium in mice. Plate 4 shows the log ratios plotted against the average of the logs (which gives a measure of absolute expression). It illustrates the noise level in much of the data. It also shows that a number of genes have very high expression levels, and that these genes show differential expression.

Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×

Summarizing, Speed outlined some challenges to current research:

  • How to address the observed bias associated with whether a sample is treated with red or green dye (which suggests the need to run the complementary experiment of interchanging the red and green labels);

  • How to create better designs for microarray experiments, ones that go beyond merely comparing treatment with control;

  • How to carry out the experiments’ preprocessing so as to reduce the noise in the data; and

  • How to deal with the fact that, because a large number of genes are tested in microarray experiments, the large number of statistical tests carried out in parallel greatly increases the chance of finding a false positive. (One attempt to address this is exemplified in Tusher et al. (2001), which uses the false discovery rate method—an approach to the multiple comparisons problem that controls for the expected proportion of false positives rather than attempting to minimize the absolute chance of false positives— to set cutoff points for these errors.)

Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×
Page 26
Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×
Page 27
Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×
Page 28
Suggested Citation:"7 The Data Flood: Analysis of Massive and Complex Genomic Data Sets." National Research Council. 2002. Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems. Washington, DC: The National Academies Press. doi: 10.17226/10356.
×
Page 29
Next: 8 Summary »
Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems Get This Book
×
 Making Sense of Complexity: Summary of the Workshop on Dynamical Modeling of Complex Biomedical Systems
Buy Paperback | $21.00 Buy Ebook | $16.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

On April 26-28, 2001, the Board on Mathematical Sciences and Their Applications (BMSA) and the Board on Life Sciences of the National Research Council cosponsored a workshop on the dynamical modeling of complex biomedical systems. The workshop's goal was to identify some open research questions in the mathematical sciences whose solution would contribute to important unsolved problems in three general areas of the biomedical sciences: disease states, cellular processes, and neuroscience. The workshop drew a diverse group of over 80 researchers, who engaged in lively discussions.

To convey the workshop's excitement more broadly, and to help more mathematical scientists become familiar with these very fertile interface areas, the BMSA appointed one of its members, George Casella, of the University of Florida, as rapporteur. He developed this summary with the help of two colleagues from his university, Rongling Wu and Sam S. Wu, assisted by Scott Weidman, BMSA director.

This summary represents the viewpoint of its authors only and should not be taken as a consensus report of the BMSA or of the National Research Council.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!