8
Crosscutting Themes

The organization of this report around levels of biological organization reflects the committee’s view that the interplay between mathematics and biology during the 21st century will be driven by biological problems. Nonetheless, the committee also recognized that this view of the mathematics-biology interface risks the neglect of crosscutting themes—that is, mathematical ideas or areas of productive research activity that cut across levels of biological organization, emerging and re-emerging in diverse biological contexts. Accordingly, it concludes its report with a few examples of such themes, starting with some mathematical ideas that have assumed central importance at the interface between mathematics and biology.

THE “SMALL n, LARGE P” PROBLEM

Classical statistics largely arose in settings where typical problems involved estimating a small set of parameters (P) from large numbers of data points (n). Modern examples of such “small P, large n” problems include estimating the overall inflation rate—or, perhaps, a modest number of category-specific inflation rates—from longitudinal data on the prices of large numbers of specific items. Of course, similar problems often also arise in overtly biological contexts (e.g., analysis of life expectancies or the dose-response characteristics of pharmaceutical agents); however, these applications of classical statistics to living systems are typically far removed from considerations of underlying biological mechanism. In



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 127
8 Crosscutting Themes The organization of this report around levels of biological organiza- tion reflects the committee’s view that the interplay between mathematics and biology during the 21st century will be driven by biological prob- lems. Nonetheless, the committee also recognized that this view of the mathematics-biology interface risks the neglect of crosscutting themes— that is, mathematical ideas or areas of productive research activity that cut across levels of biological organization, emerging and re-emerging in diverse biological contexts. Accordingly, it concludes its report with a few examples of such themes, starting with some mathematical ideas that have assumed central importance at the interface between mathematics and biology. THE “SMALL n, LARGE P” PROBLEM Classical statistics largely arose in settings where typical problems involved estimating a small set of parameters (P) from large numbers of data points (n). Modern examples of such “small P, large n” problems include estimating the overall inflation rate—or, perhaps, a modest num- ber of category-specific inflation rates—from longitudinal data on the prices of large numbers of specific items. Of course, similar problems of- ten also arise in overtly biological contexts (e.g., analysis of life expectan- cies or the dose-response characteristics of pharmaceutical agents); how- ever, these applications of classical statistics to living systems are typically far removed from considerations of underlying biological mechanism. In 127

OCR for page 127
128 MATHEMATICS AND 21ST CENTURY BIOLOGY small P, large n contexts, estimates of the parameter set are expected to improve as the number of data points increases, and much of the machin- ery of classical statistics addresses trade-offs between the reliability of parameter estimates and the number of samples analyzed. In many biological research settings, the statistical challenge is quite different. Individual experiments—for example, a microarray-based mea- surement of the levels of thousands of messenger-RNA levels in a single RNA sample extracted from a particular tumor—are often information- rich. However, the number of independent measurements from which a biologist seeks to draw conclusions (e.g., the number of tumors analyzed) may be quite small. Similarly, geneticists are now contemplating mea- suring the genetic variants present at ~105 sites across the genomes of individual research subjects even though the number of individuals—for example, the number of cases and controls in a disease-susceptibility study—is under severe practical constraints. The challenge in these situ- ations is analogous to attempting to reach conclusions from a moderate number of photographs of, for example, profitable and unprofitable res- taurants. Although increasing the number of restaurants photographed would certainly improve the reliability of the study, presuming that the sampling strategy was well considered, success would depend even more heavily on strategies for representing and modeling the immense amount of information in each photograph. Interest among biologists in prob- lems with similar statistical properties has grown dramatically in recent years. The committee examines this phenomenon here as a prime ex- ample of a crosscutting mathematical theme on the interface between mathematics and contemporary biology. It illustrates both the progress that has been made during the past decade and the challenges that lie ahead. Finding Patterns in Gene-Expression Data Although the small n, large P problem is encountered in many bio- logical contexts, the challenges of interpreting gene-expression data pro- vide a prototypical example that is of substantial current interest. The de- velopment of microarray technology, which can yield the transcription profiles of >104 genes in a single experiment, has enabled global ap- proaches to understanding regulatory processes in normal or disease states. Substantial work has been done on the selection and analysis of differentially expressed genes for purposes ranging from the discovery of new gene functions and the classification of cell types to the prediction of clinically important biological phenotypes (Nature Genetics Supplement 21, 1999; Golub et al., 1999; Tamayo et al., 1999; Nature Genetics Supple-

OCR for page 127
129 CROSSCUTTING THEMES ment 32, 2002). The best of the markers that have emerged from this re- search have already shown promise for both diagnostic and prognostic clinical use. Applications to tumor classification have attracted particular interest since it has been estimated that over 40,000 cancer cases per year in the United States present major classification challenges for existing clinical and histological approaches. Gene-expression microarrays for the first time offer the possibility of basing diagnoses on the global-gene-expres- sion profile of the tumor cells. Moreover, the discovery of gene-expres- sion patterns that are significantly correlated with tumor phenotype can clarify molecular mechanisms of pathogenesis and potentially identify new strategies for treatment (Shipp et al., 2002). Similar opportunities ex- ist for many other poorly understood diseases. For example, the recent discovery that a set of genes in the oxidative-phosphorylation pathway is more highly expressed in the muscle biopsies of normal controls than in those of patients with Type 2 diabetes has opened new avenues in diabe- tes research (Mootha et al., 2003). Closer study of the most highly corre- lated genes in this set led to the hypothesis that PGC-1α might regulate this subset of genes, a result that was then confirmed by further labora- tory study. By this route, an aberration in PGC-1α expression has become a prime candidate for being a step in disease progression. The pattern-recognition techniques required for analyzing gene-ex- pression data and other large biological data sets are often called super- vised and unsupervised learning. Machine-learning tools based on these techniques, designed in collaborations between bioscientists and math- ematical scientists, have already come into widespread use. Pattern recognition via supervised and unsupervised learning is based on quan- titative, stochastic descriptions of the data, sometimes referred to as as- sociative models. These models typically incorporate few or no assump- tions about the mechanistic basis for the patterns that they seek to discover. In unsupervised learning techniques, the structure in a data set is elucidated without using any a priori labeling of the data. Unsupervised learning can be useful during exploratory analysis. Supervised tech- niques create models for classifying data by training on labeled members of the classes that are to be distinguished—for example, invasive and noninvasive tumors. Supervised techniques have an advantage over un- supervised techniques because they are less subject to structure that is not directly relevant to the distinction of interest, such as the laboratory in which the data were collected. Unfortunately, training sets are not available in many biological situations. General machine-learning algorithms that are potentially useful in this area of research stem from fields such as psychology and systematics. The

OCR for page 127
130 MATHEMATICS AND 21ST CENTURY BIOLOGY algorithms have then been refined in such fields as financial modeling or market analysis, where the number of independent instances— that is, data points such as days of observation or individual-customer transac- tions—is substantially larger than the number of variables measured (i.e., the dimension of the problem; see Hastie et al., 2001). A rule of thumb is that the number of data points should be at least as large as the square of the number of dimensions (Friedman, 1994). This goal is often out of reach in biological-pattern-recognition studies: Typically, the data sets available comprise a small number of samples per biological class (generally fewer than 100, often only 10 or 20); however, this small number of samples contrasts with the large number of features that characterize each sample—for example, expression levels on the order of 104 genes). In many situations, increasing the number of samples is simply not possible. Hence, biological applications of machine learning often involve instances of the small n, large P problem discussed above. The first generation of discovery and recognition tools suitable for the analysis of microarray data has been built, and these tools have estab- lished that expression data can be productively mined for purposes such as tumor classification (Golub et al., 1999; Bittner et al., 2000; Slonim et al., 2000; Tamayo et al., 1999; Perou et al., 1999; Perou et al., 2000), chemosen- sitivity of tumors (Staunton et al., 2001), and treatment outcome (Alizadeh et al., 2000). The classifiers used in these papers are still heavily employed today and are being refined to apply to cases where subtle signatures of phenotypes such as post-treatment outcome are the endpoints. Despite clear successes in applying machine learning to gene-expression data, most studies have oversimplified the problem by treating genes as inde- pendent variables. Even when coregulation is taken into account (Cho et al., 1998; Eisen et al., 1998), existing methods still fail to capture the com- plex patterns of interaction that characterize all biological regulatory pro- cesses. They also address inadequately the diversity of biological mecha- nisms that can lead to indistinguishable phenotypes. In the context of microarray data, although measurements may de- fine a very high-dimensional space of >10,000 genes, the expression levels of these genes are dependent variables. In typical cases, a smaller set of variables—on the order of a few hundred metagenes—adequately cap- tures the process being modeled. Thus, the problem can be approached by reducing the gene dimension in a principled way to reach the desired small P, moderate n level with which traditional statistical theory gener- ally deals. However, major challenges remain in learning how to use a reduction process that actually reflects the biological mechanisms that lead to the highly correlated expression of the members of particular sub- sets of genes. These features of the problem highlight the need to develop statistical frameworks that will accommodate both the presence of many

OCR for page 127
131 CROSSCUTTING THEMES irrelevant variables and the high interdependence of those variables that are relevant. Next, the committee presents a brief description of the exist- ing statistical framework. Supervised Learning In the supervised-learning setting, there exists a data set of samples that belong to two different phenotypes, Class A and Class B. The goal is to build a model that when presented with a new sample of unknown phenotype can identify its corresponding class with high accuracy. Math- ematically, the challenge is to infer a function F that assigns a phenotype label A or B to a point G = (g1, g2, . . . , gn) in a high-dimensional space—the expression levels of ~10,000 genes in each sample—from a small number of (G, F(G)) pairs. In general, such systems are highly underdetermined. Moreover, as discussed above, the variables are not independent: Genes are expressed in response to the activation of biochemical pathways re- sulting from multiple gene products. While this interdependence of the variables is the only reason the problem is tractable, it severely limits the utility of traditional statistical approaches for testing the significance of observations. Classifiers tend to overfit the data—that is, they have poor predictive power outside the training sample because of the small number of samples, the large number of features (high dimensionality), and noise in the data. To address this problem, the number of variables must be re- duced by selecting a subset of features that are most highly correlated with the phenotype distinction. From the perspective of machine learning and pattern recognition, the problem of optimal feature selection is intrac- table, and biologists must be content with empirical approximations that are tailored to the specific application (Duda et al., 2000). Traditional meth- ods for determining the statistical significance of the features—for ex- ample, the expression levels of particular genes—to be used as classifiers assume a known underlying distribution of values and independence of the features. However, a good parametric description for expression val- ues has yet to be determined and may not exist. Gene-gene interactions are fundamental to biological processes, and thus gene-expression data are inherently incompatible with independence assumptions. Some groups have used permutation-based methods to solve the gene-selection problem (Golub et al., 1999; Slonim et al., 2000; Tusher et al., 2001). In these methods, one compares the observed distribution of gene correlations with phenotype against a distribution obtained by ran- domly assigning class labels to samples. Permuting the class labels pre- serves the gene-gene dependencies within the data set. Other methods include using step-down adjusted-p values (Dudoit et al., 2002), general-

OCR for page 127
132 MATHEMATICS AND 21ST CENTURY BIOLOGY ized likelihood tests (Ideker et al., 2000), Bayes hierarchical models (New- ton et al., 2001; Baldi and Long, 2001), and combined data from replicates to estimate posterior probabilities (Lee et al., 2000). So far, no systematic comparisons of the error rates and statistical power of these different methods have been published. Clearly this is an area that needs more research and a strong formal framework. Even the question of how many samples might be needed to improve the accuracy of the original classifier or to provide a more rigorous statis- tical validation of the predictive power of classifiers is difficult. Tradi- tional power calculations (Adcock, 1997) do not address the situation posed by gene-expression data: They estimate the confidence of an em- pirical error estimate based on a given data set, not how the error rate might decrease given more data. Attempts have been made to answer the latter question using nonparametric methods and permutation testing (Cortes et al., 1993; Cortes et al., 1995; Mukherjee et al., 2003), but formal analysis of this problem remains an open challenge. One widely used approach to supervised learning involves the use of support vector machines (SVMs). SVMs are based on a variation of regu- larization techniques for regression (Vapnik, 1998; Evgeniou et al., 2000) and are related to a much older algorithm, the perceptron (Minsky and Papert, 1988; Rosenblatt, 1962). Perceptrons seek a hyperplane to separate positive and negative examples. The SVM seeks to further separate such examples. It is trained by solving a convex optimization problem, usually involving a large number of variables. The objective function involves a penalty, which has to be tuned to avoid overfitting. Performance of the SVM is reasonably measured by the proportion of misclassified cases in a new sample. Since such a sample is not usually available, methods in- volving splitting the original training sample—known as cross valida- tion—are used. Other promising methods such as “boosting” are being developed in the statistics and machine-learning communities (Hastie et al., 2001). A basic limitation on all such methods is that, even when they identify reproducibly observable clusters, they may not provide insight into the biological mechanisms that underlie the process or phenotype being studied. Unsupervised Learning In unsupervised learning, the data are not labeled. The goal is to de- termine the underlying structure of a data set and to uncover relevant patterns and possible subtypes that can then provide the starting point for additional biological characterization. Many types of clustering algo- rithms have been applied to expression data—for example, hierarchical clustering (Cho et al., 1998; Eisen et al., 1998), self-organizing maps (SOM)

OCR for page 127
133 CROSSCUTTING THEMES (Tamayo et al., 1999), and k-means. These methods focus on the dominant structure present in a data set while potentially missing more subtle pat- terns that might be of equal or greater biological interest. In contrast, there are a number of local, or bottom-up, unsupervised methods that seek to identify and analyze subpatterns in gene expression data: the SPLASH algorithm (Califano, 2000), conserved X motifs (Murali and Kasif, 2003), the PLAID algorithm (Lazzeroni and Owen, 2002), the association rules of Becquet et al. (2002), or the frequent itemsets and mod- ules of Tamayo et al. (2004) and Segal et al. (2004). Bottom-up approaches provide a comprehensive catalog of subpatterns and expose most or all of the potentially interesting structure. They tackle the small n, large P prob- lem by attempting to directly extract and isolate the relevant signals. The challenge is the difficulty of dealing with the potentially large number of patterns discovered by these methods, many of which are typically false positives. The small n, large P problem remains in trying to find appropri- ate filters to separate real patterns from noise and finding ways to as- semble the discovered patterns into a coherent representation of the data. Unfortunately, there is no theoretical foundation for evaluating the sig- nificance of extracted subpatterns purely on the basis of the data. Classical approaches to reduce the noise and dimensionality use glo- bal decompositions or projections of the data that preserve the dominant structure. Examples of these methods include principal-component analysis (PCA) (Bittner et al., 2000; Pomeroy et al., 2002), singular-value decomposition (SVD) (Alter et al., 2000; Kluger et al., 2003) and PLAID (Lazzeroni and Owen, 2002). Unsupervised global, or top-down, ap- proaches address the small n, large P problem by using appropriate pro- jections from gene space to find a set of molecular coordinates that cap- tures dominant signals. Once again, these methods often produce difficult-to-interpret, complex, or unwieldy representations of the data. Projection algorithms such as nonnegative matrix factorization (NMF) (Lee and Seung, 1999; Kim and Tidor, 2003; Brunet et al., 2004) represent a new generation of methods that attempt to project the data into the space of a small number of metagenes, which provide representations that aid in biological interpretation and have the potential to guide follow-up ex- periments in the laboratory. NMF is based on a decomposition-by-parts approach, which was introduced by Lee and Seung (1999) to identify char- acteristic features of faces and semantic features of text. Despite its useful- ness and practical success in clustering data, there are many open ques- tions concerning the algorithm, its convergence properties, and the properties of the projected representation. Recent research on supervised- learning problems focuses on low-dimensional, nonlinear-manifold rep- resentations (Roweis and Saul, 2000; Tenenbaum et al., 2000), or other

OCR for page 127
134 MATHEMATICS AND 21ST CENTURY BIOLOGY nonlinear sparse representations. This work is in its infancy and has not yet been systematically used for biological applications. All the supervised and unsupervised approaches the committee de- scribes here have associated questions that require further investigation. Some of these questions follow: Is it possible to develop a formal frame- work for evaluating the significance of features or subpatterns extracted in a small n, large P context? What is the best way to determine the correct number of clusters within a given data set? How does one validate clus- tering or decomposition results? How does one compare the correctness of two decompositions of a data set? None of these challenges is unique to biology, but biological applications bring them to the fore. ANALYSIS OF ORDERED SYSTEMS Systems or processes with strong spatial or temporal order are ubiq- uitous in biology. Examples involve the sequence of bases in the genome, the propagation of nerve impulses, and—at a higher level of biological organization—animal behavior. Mathematical techniques for analyzing ordered processes have been successfully imported into biology from other research areas. A particularly important example is the hidden Markov model (HMM). HMMs have been used in areas such as speech recognition since the 1970s. More recently, they have been applied with great success in many areas of biology. HMMs require more specific mod- eling of the structure within a data set than do the nonparametric meth- ods discussed in the preceding section. When suitable models exist, this requirement is a strength: Indeed, it is sometimes possible to make valid inferences from a single instance of a biological entity such as a gene— that is, to analyze a small n, large P problem when n = 1. This escape from the small n, large P problem is somewhat illusory since the HMM assump- tion enables us to use the large number of bases in the single gene to provide us with nearly identically distributed and independent proxy samples. Applications of Hidden Markov Models to the Analysis of DNA, RNA, and Protein Sequences An HMM describes a set of states connected by transitions between states. The transitions occur according to a Markov process. This means that the distribution of the mth state in the series, given the preceding m – 1 states, depends only on the (m – 1)st state. However, the states them- selves are not observed (they are hidden): They reveal themselves by emit- ting observable variables. In speech recognition, the observed variables might be phonemes. In DNA and protein applications, they would be the

OCR for page 127
135 CROSSCUTTING THEMES nucleotides or amino acids corresponding to specific sequences. All of the parameters of the HMMs governing the emissions of variables from spe- cific states and the transitions between states are probabilities. There are many well-established algorithms for addressing important questions that arise during the use of HMMs. For example, given an HMM and a se- quence, one can determine the probability that the sequence was gener- ated by the HMM. Calculation of these probabilities allows one to find within a set of candidate models the HMM that is most likely to have generated a particular sequence. One can also find the specific path of the sequence through the HMM. This capability allows one to parse the se- quence into the most likely arrangement of hidden states. Note, however, that these are still associative rather than mechanistic models and are usu- ally viewed simply as very crude approximations to reality. Two specific applications of HMMs to biological sequences, profile HMMs for protein families and HMMs for predicting gene structures in DNA, are discussed below. Profile HMMs Profiles for protein families were introduced by Gribskov et al. (1987) as a method for representing the variability in protein sequences of the same family. Given an alignment of the sequences, the profile provides a score for all possible amino acids that might occur at each position and also a score associated with deletions and insertions at different positions. Profile HMMs were introduced by Krogh et al. (1994) to put the concept of a profile in a fully probabilistic framework. The hidden states, which are the positions in the protein-family model, are hidden because any in- dividual sequence may have insertions and deletions relative to the model. Given a set of sequences known to be of the same protein family, expecta- tion maximization (EM) can be used to determine the parameters of the HMM for the family. Given a profile HMM for a specific family and a protein sequence, one can determine the best alignment of the sequence to the family and the probability that the protein sequence would be gener- ated by the HMM for the family. One can then classify proteins into dif- ferent families by comparing those probabilities. The emission probabilities at each position of the HMM can indicate important features of a protein family. For example, active-site residues in enzymes tend to be highly, if not completely, conserved among all mem- bers of a family. Positions that are all hydrophobic are likely to be in the interior of the protein or exposed to hydrophobic environments such as the interior of membranes. Given a set of HMMs for different protein fami- lies and at least one known structure for each family, HMM-based meth- ods provide an effective means for predicting the approximate structure

OCR for page 127
136 MATHEMATICS AND 21ST CENTURY BIOLOGY of a new protein from its sequence simply by determining the family to which the protein is most likely to belong. Of course, if the protein does not belong to any of the established families, this approach fails, and one must resort to ab initio methods. However, as increasing numbers of pro- tein structures are determined and it becomes increasingly clear that most proteins—or at least domains of proteins—fall into a limited set of struc- tural classes, HMM-based classification methods are providing more and more useful predictions of protein structure and function. Despite past success, there is ample room for improvement in the de- velopment and application of HMMs to protein families. Two important areas for improvement deal with nonindependence in the data. Usually it is assumed that the protein sequences from which a profile HMM is built are independent samples from the set of sequences in the family. In actu- ality, members of the sample set are related to each other by a phyloge- netic tree, and means of incorporating that information into Profile HMMs should improve their performance. The other nonindependence issue in- volves limitations on the structures of the HMMs themselves. Profile HMMs assume that the positions are independent of one another or, at most, that there is a low-order Markov dependence among nearby posi- tions. In reality, distant positions within the protein may be interacting with one another, and the amino acid frequencies at these interacting sites may be correlated. Such long-distance correlations occur frequently in RNA structures and are represented by higher-order models called sto- chastic context-free grammars. However, even stochastic context-free grammars are limited to correlated positions that are nested. This condi- tion does not hold for typical protein interactions; indeed, it does not even apply to all intramolecular interactions within RNA molecules. Finding efficient ways of taking such long-range interactions into account, while maintaining the advantages of probabilistic models, would provide an important improvement, especially for structure prediction. HMMs in Gene Finding Gary Churchill (1989) first applied HMMs to partition DNA sequences into domains with different characteristics. Early on, David Searls (1992) recognized the analogy between the parsing of sequences in linguistic analysis and the determination of functional domains in DNA sequences. By the early 1990s, David Haussler and colleagues had begun applying HMMs to the problem of identifying the protein-coding regions in ge- nomic DNA sequences (see Krogh et al., 1994, Stormo and Haussler, 1994; Kulp et al., 1996). By that time, large-scale DNA-sequencing projects had begun, and there were many DNA sequences in the databases with no known associated genes or functions. Predicting what proteins might be

OCR for page 127
137 CROSSCUTTING THEMES encoded in these newly discovered DNA sequences was an important problem. The basic structure of an HMM maps well to the gene-prediction prob- lem. The hidden states are the functional domains of the DNA sequence: For example, some regions of the DNA code for protein sequence, other regions code for untranslated portions of genes, while still others are intergenic. Each class of regions has some statistical features that help to distinguish it from the other classes. For example, protein-coding exons must have an open reading frame and often use codons in a biased man- ner, so the base-emission probabilities characterizing that state will be dif- ferent from those characterizing introns or other classes. There is also a clearly defined grammar for protein-coding regions: Introns must alter- nate with exons, and intergenic regions must surround these alternating exon-intron segments. On the other hand, some aspects of gene structure are not captured by simple HMM architectures. For example, when introns are removed, the two joined exons must remain in-frame, so the HMM has to maintain a memory of the reading frame from the previous exon as it passes over the intron. Furthermore, exons and introns have different length distributions; neither is simply geometric, as would be modeled by a simple HMM. Fi- nally, the boundaries between domains are often indicated by signals in the DNA sequence—that is, specific sequence motifs that are themselves modeled by the probability distributions of bases at different positions within the motifs. Gene-prediction accuracy can be improved by incorporating other evidence that is not derived from the DNA sequence alone—for example, similarities between the protein sequence inferred from the predicted gene structure and previously known protein sequences. To utilize all the different kinds of information that are useful for gene prediction and to capture the details of gene structures, HMMs have been extended to generalized HMMs (GHMMs) (Kulp et al., 1996; Burge and Karlin, 1997). These new models, which couple classical HMMs to machine-learning techniques, provide significantly better predictions than previous models. Recently, the methodology was extended to predict simulta- neously gene structure in two homologous sequences (Korf et al., 2001; Meyer and Durbin, 2002; Alexandersson et al., 2003). Since corresponding (orthologous) genes in closely related organisms are expected to have similar structures, adding the constraint that the predicted structure be compatible with both sequences can significantly improve accuracy. Despite these advances, there is still much room for improvement in gene prediction. Overall accuracy, even when using two species, is far from 100 percent. Increasingly, the failures of gene-prediction methods are due to the inherent biological complexity of the problem. Recent data

OCR for page 127
140 MATHEMATICS AND 21ST CENTURY BIOLOGY superexponentially large. The development of Monte Carlo schemes ca- pable of handling this computation would be of great value in computa- tional biology. Sampling Protein Conformations The protein-folding problem has been a grand challenge for computa- tional molecular bioscientists for more than 30 years, since Anfinsen dem- onstrated that the sequences for some proteins determine their folded con- formations (Sela et al., 1957). To formulate the computational problem, one sets up an energy function based on considerations of bonding geom- etry, as well as electrostatic and van der Waals forces. Possible conforma- tions of the protein (i.e., the relative spatial positions of all its heavy at- oms) can then be sampled either by integrating Newton’s second law (i.e., carrying out a molecular dynamic calculation) or by Monte Carlo sam- pling of the corresponding Boltzmann distribution (for a review of this, see Frenkel and Smit, 1996). This problem is attractive both because it is intrinsically important for understanding proteins and because computa- tional results can be compared with experimentally solved structures. Hence, unlike in many other areas of predictive modeling in biology, there are easily applied, objective criteria for comparing the relative accuracy of alternative models. At present, de novo computation of native protein structures is not feasible. Thus, the near-term focus of most research in this area is on gaining an improved understanding of the mechanism of protein folding (Hansmann et al., 1997; Hao and Scheraga, 1998). Monte Carlo methods are important in these investigations because they provide wider sampling of the conformation space than do conventional methods. The study of folding-energy landscapes is generally based on a simplified energy function—for example, effects of entropy in the solvent are incor- porated into artificial hydrophobic terms in the energy function—and a greatly simplified conformation space. Even with such simplifications, Monte Carlo methods are often the only way to sample this space. LESSONS FROM MATHEMATICAL THEMES OF CURRENT IMPORT This discussion of flourishing applications of machine learning, hid- den Markov models, and Monte Carlo sampling illustrates how particu- lar mathematical themes can gain prominence in response to trends in biological research. The advent of high-throughput DNA sequencing and gene-expression microarrays brought to the forefront of biological re- search large amounts of data and many classes of problems that de-

OCR for page 127
141 CROSSCUTTING THEMES manded the importation of broad, powerful mathematical formalisms. Continued reliance on ad hoc solutions to particular problems would have impeded the development of whole areas of biology. In the instances dis- cussed, the biological problems that needed solution were sufficiently analogous to problems previously encountered in other fields that rel- evant mathematical formalisms were available. As these formalisms came into widespread use in the biosciences, particular limitations, associated in many instances with the general characteristics of the biological prob- lems to which they were applied, became evident and stimulated new mathematical research on the methods themselves. The committee expects this dynamic to recur as mathematical biology matures. Indeed, the committee attached more importance to the process than to its particular manifestations in the 1990s and early 2000s. While the techniques described here have broad importance at the moment, the committee does not expect them to dominate the biosciences over the long term. Indeed, as it did in the Executive Summary and Chapter 1, “The Nature of the Field,” the committee once more cautions against drawing up a list of mathematical challenges that are not grounded in specific bio- logical problems. Both the biosciences and mathematics have strange ways of surprising us. Mathematics can be useful in ways that are not predict- able. For example, Art Winfree’s use of topology provided wonderful in- sights into the way many oscillatory biological processes work (Winfree, 1983). Similarly, De Witt Sumners’s use of topology to understand aspects of circular DNA (Sumners, 1995) and Gary Odell’s topological observa- tions about the gene network behind segment polarity were quite unex- pected (von Dassow et al., 2000). Yet, even though topological arguments have provided biologists with powerful insights, the committee did not conclude that topology should be prioritized for further development be- cause of its potential to contribute to biology. Instead, the committee ex- pects that biological problems will continue to drive the importation and evolution of applicable mathematics. Then, as general principles emerge, they will be codified at the appropriate level of generality. For machine learning, HMMs, and Monte Carlo sampling, this process is well under way. Indeed, these powerful methods are now well established in the toolkits of most computational biologists and are routinely taught in in- troductory graduate-level courses covering computational biology. Other methods will follow, just as others went before. The greatest enabler of this process will be research programs and collaborations that confront mathematical scientists with specific problems drawn from across the whole landscape of modern biology.

OCR for page 127
142 MATHEMATICS AND 21ST CENTURY BIOLOGY PROCESSING OF LOW-LEVEL DATA The purpose of the current chapter, “Crosscutting Themes,” is to call attention to issues that might have been neglected if the committee had relied entirely on levels of biological organization to structure this report. By discussing examples of mathematical themes that are important at many levels of biological organization, the committee accomplishes that purpose. Another quite different crosscutting theme is the importance of low-level data processing. Indeed, one could argue that the most indis- pensable applications of mathematics in biology have historically been in this area. Furthermore, the importance of low-level data processing in bi- ology appears likely to grow. Rapid advances in technologies such as op- tics, digital electronics, sensors, and small-scale fabrication ensure that biologists will have access to ever more powerful instruments. Nearly all the data that biologists obtain from these instruments has gone through extensive analog and digital transformations. Because these transformations improve signal-to-noise ratios, correlate signals with real- world landmarks, eliminate distortions, and otherwise add value to the physical output of the primary sensing devices, they are often the key to success during instrumentation development. The continued involvement of mathematicians, physicists, engineers, chemists, and bioscientists in instrumentation development has great potential to advance the biologi- cal sciences. Mathematical scientists are essential partners in these col- laborations. Indeed, many of the challenges that arise in low-level data processing can only be met by applying powerful, abstract formalisms that are unfamiliar to most bioscientists. A few examples, discussed be- low, illustrate current research in this area. In optical imaging, the development of two-photon (or, more gener- ally, multiphoton) fluorescence microscopy is already having a signifi- cant impact on biology (So et al., 2000). This technique, in which molecu- lar excitation takes place from the simultaneous absorption of two or more photons by a fluorophore, offers submicron resolution with relatively little damage to samples. The latter feature is of particular importance in biol- ogy since there is growing interest in observing living cells as they un- dergo complex developmental changes. The sensitivity of two-photon microscopy, in contrast to conventional fluorescence microscopy, is more dependent on peak illumination of the sample than on average illumina- tion; hence, pulsed-laser light sources can be used to provide high instan- taneous illumination while maintaining low average-power dissipation in the sample. Significant progress has been made in using two-photon methods to image cells, subcellular components, and macromolecules. Substantial improvements in sensitivity remain possible since in current instruments, only a small fraction of emitted photons reaches the detec-

OCR for page 127
143 CROSSCUTTING THEMES tor. This low sensitivity, among other problems, limits the time resolution of two-photon microscopy. Computation and simulation will play a key role in efforts to increase sensitivity by optimizing the light path and im- proving detectors. Discussing the potential of future improvements in the sensitivity of two-photon microscopy, Fraser (2003) observed that “with a combined improvement of only ten-fold, today’s impossible project can become tomorrow’s routine research project.” This rapid progression from the impossible to the routine is the story of much of modern experimental biology. An entirely different class of imaging techniques, broadly referred to as near-field microscopy, has also made great strides in recent years. Steadily improving fiber-optic light sources and detectors have been the critical enabling technologies. Optical resolutions of 20 to 50 nm are achievable with ideal samples, dramatically breaching the wavelength limit on the resolution of traditional light microscopes. Nonetheless, near- field microscopy is difficult to apply in biology because of the irregular nature of biological materials. Despite these difficulties, Doyle et al. (2001) succeeded in imaging actin filaments in glial cells, and it is reasonable to expect further progress, based in part on improved computational tech- niques for extracting the desired signal from the noise in near-field data. At still higher spatial resolution, many new techniques have been in- troduced for the structural analysis of biological macromolecules. Ex- amples include high-field NMR, cryo-electron microscopy (cryo-EM) (Henderson, 2004; Carragher et al., 2004), time-resolved structural analy- sis based on physical and chemical trapping (Hajdu et al., 2000), small- angle scattering (Svergun et al., 2002), and total-internal-reflection fluo- rescence microscopy (Mashanov et al., 2003). Cryo-EM has achieved 0.4-nm resolution for two-dimensional crystals and may soon achieve that capability for single particles. One problem with all imaging methods is the lack of rigorous validation methods for determining the reliability of determined structures. Henderson (2004) emphasized this point, stating that the lack of such methods is “probably the greatest challenge facing cryo-EM.” The mathematical sciences have a clear role to play in address- ing this challenge. Hyperspectral imaging is the final example here of promising tech- nologies that could be incorporated into many types of biological instru- mentation. This technology involves measuring the optical response of a sample over an entire frequency range rather than at one, or a few, se- lected frequencies. In hyperspectral detectors, each pixel contains a spec- trum with tens to thousands of measurements and allows for far more detailed characterization of a sample than could be obtained from data collected at a single frequency. Hyperspectral imaging is already being used for microscopy (Sinclair et al., 2004; Schultz et al., 2001), pathological

OCR for page 127
144 MATHEMATICS AND 21ST CENTURY BIOLOGY studies (Davis et al., 2003), and microarray analysis (Sinclair et al., 2004; Schultz et al., 2001). Sinclair et al. (2004) recently developed a scanner with high spatial resolution that records an emission spectrum for each pixel over the range 490-900 nm at 3-nm intervals. These investigators used multivariate curve-resolution algorithms to distinguish between the emission spectra of the components of multiple samples. Further math- ematical developments have the potential to enhance instrument design and performance for diverse applications. Similar comments apply to many aspects of imaging technology. Indeed, the committee believes that one of the important goals of the next decade in instrumentation should be to improve the quantitation achievable in all forms of biological imag- ing. Nearly all applications of the mathematical sciences to biology will be promoted by improved instrumentation that lowers the cost of acquiring reliable quantitative data and increases the collection rates. EPILOGUE This brief discussion of the role of the mathematical sciences in the development of instrumentation is a suitable note on which to conclude this report since it emphasizes the primacy of data in the interplay be- tween mathematics and biology. Mathematical scientists, and the funding agencies that support them, should be encouraged to take an interest in the full cycle of experimental design, data acquisition, data processing, and data interpretation through which bioscientists are expanding their understanding of the living world. Applications of the mathematical sci- ences to biology are not yet so specialized as to make this breadth of view impractical. An illustrative case is that of Phil Green, whose training be- fore an early-career switch to genetics was in pure mathematics. During the Human Genome Project, he made key contributions to problems at every level of genome analysis: the phred-software package transformed large-scale DNA sequencing by attaching statistically valid quality mea- sures to the raw base calls of automated sequencing instruments (Ewing and Green, 1998; Ewing et al., 1998); phrap, consed, and autofinish soft- ware sheperded these base calls all the way to finished-DNA sequence (Gordon et al., 1998; Gordon et al., 2001); then, in analyzing the sequence itself, Green contributed to problems as diverse as estimating the number of human genes (Ewing and Green, 2000), discovering the likely existence of a new DNA-repair process in germ cells (Green et al., 2003), and mod- eling sequence-context effects on mutation rates (Hwang and Green, 2004). As this and many other stories emphasize, applications of the math- ematical sciences to the biosciences span an immense conceptual range, even when one considers only one facet of the biological enterprise. No one scientist, mathematical or biological specialty, research program, or

OCR for page 127
145 CROSSCUTTING THEMES funding agency can span the entire range. Instead, the integration of di- verse skills and perspectives must be the overriding goal. In this report, the committee seeks to encourage such integration by putting forward a set of broad principles that it regards as essential to the health of one of the most exciting and promising interdisciplinary frontiers in 21st cen- tury science. REFERENCES Adcock, C.J. 1997. Sample size determination: A review. Statistician 46(2): 261-283. Alexandersson, M., S. Cawley, and L. Pachter. 2003. SLAM—Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13(3): 496-502. Alizadeh, A.A., M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson Jr., L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt. 2000. Distinct types of diffuse large B-cell lymphoma identi- fied by gene expression profiling. Nature 403(6769): 503-511. Alter, O., P.O. Brown, and D. Botstein. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. U.S.A. 97(18): 10101- 10106. Baldi, P., and A.D. Long. 2001. A Bayesian framework for the analysis of microarray expres- sion data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 17(6): 509-519. Becquet, C., S. Blachon, B. Jeudy, J.-F. Boulicaut, and O. Gandrillon. 2002. Strong-associa- tion-rule mining for large-scale gene-expression data analysis: A case study on human SAGE data. Genome Biol. 3(12): Research0067. Bittner, M., P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak. 2000. Molecular classification of cuta- neous malignant melanoma by gene expression profiling. Nature 406(6795): 536-540. Brunet, J.P., P. Tamayo, T.R. Golub, and J.P. Mesirov. 2004. Metagenes and molecular pat- tern discovery using matrix factorization. Proc. Natl. Acad. Sci. U.S.A. 101(12): 4164-4169. Burge, C., and S. Karlin. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268(1): 78-94. Califano, A. 2000. SPLASH: Structural pattern localization analysis by sequential histograms. Bioinformatics 16(4): 341-357. Carragher, B., D. Fellmann, F. Guerra, R.A. Milligan, F. Mouche, J. Pulokas, B. Sheehan, J. Quispe, C. Suloway, Y. Zhu, and C.S. Potter. 2004. Rapid, routine structure determina- tion of macromolecular assemblies using electron microscopy: Current progress and further challenges. J. Synchrotron Rad. 11: 83-85. Cho, R.J., M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis. 1998. A ge- nome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2: 65-73. Churchill, G.A. 1989. Stochastic models for heterogeneous DNA sequences. Bull. Math. Bio. 51(1): 79-94.

OCR for page 127
146 MATHEMATICS AND 21ST CENTURY BIOLOGY Cortes, C., L.D. Jackel, and W.-P. Chiang. 1995. Limits on learning machine accuracy im- posed by data quality. Pp. 57-62 in Proceedings of the First International Conference on Knowledge Discovery and Data Mining. U.M. Fayyad and R. Uthurusamy, eds. Montreal, Canada: AAAI Press. Cortes, C., L.D. Jackel, S.A. Solla, V. Vapnik, and J.S. Denker. 1993. Learning curves: Asymp- totic values and rate of convergence. Pp. 327-334 in Advances in Neural Information Pro- cessing Systems. NIPS’1993, Vol. 6. Denver, Colo.: Morgan Kauffman. Davis, G.L., M. Maggioni, R.R. Coifman, D.L. Rimm, and R.M. Levenson. 2003. Spectral/ spatial analysis of colon carcinoma. Modern Pathol. 16 (1): 320A-321A. Doyle, R.T., M.J. Szulzcewski, and P.G. Haydon. 2001. Extraction of near-field fluorescence from composite signals to provide high resolution images of glial cells. Biophys. J. 80: 2477-2482. Duda, R.O., P.E. Hart, and D.G. Stork. 2000. Pattern Classification. New York, N.Y.: John Wiley & Sons Ltd. Dudoit, S., Y.H. Yang, M.J. Callow, and T.P. Speed. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12(1): 111-139. Eisen, M.B., P.T. Spellman, P. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95(25): 14863-14868. Evgeniou, T., M. Pontil, and T. Poggio. 2000. Regularization networks and support vector machines. Adv. Comput. Math. 13: 1-50. Ewing, B., and Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8(3): 186-194. Ewing, B., and Green P. 2000. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25(2): 232-234. Ewing, B., L. Hillier, M.C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8(3): 175-185. Fraser, S.E. 2003. Crystal gazing in optical microscopy. Nat. Biotechnol. 21(11): 1272-1273. Frenkel, D., and B. Smit. 1996. Understanding Molecular Simulation: From Algorithms to Appli- cations. San Diego, Calif.: Academic Press. Friedman, J.H. 1994. An overview of computational learning and function approximation. Pp. 1-61 in From Statistics to Neural Networks. Theory and Pattern Recognition Applications. V. Cherkassky, J.H. Friedman, and H. Wechsler, eds. Berlin: Springer-Verlag. Friedman, N., M. Linial, I. Nachman, and D. Pe’er. 2000. Using Bayesian networks to ana- lyze expression data. J. Comput. Biol. 7: 601-620. Gelfand, A.E., and A.F.M. Smith. 1990. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85: 398-409. Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE T. Pattern Anal. 6: 721-741. Gilks, W.R., S. Richardson, and D.J. Spegelhalter. 1996. Markov Chain Monte Carlo in Practice. London, England: Chapman and Hall. Golub, T.R., D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression moni- toring. Science 286(5439): 531-537. Gordon, D., C. Abajian, and P. Green. 1998. Consed: A graphical tool for sequence finishing. Genome Res. 8(3): 195-202. Gordon, D., C. Desmarais, P. Green. 2001. Automated finishing with autofinish. Genome Res. 11(4): 614-625. Green, P., B. Ewing, W. Miller, P.J. Thomas, and E.D. Green. 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33(4): 514-517.

OCR for page 127
147 CROSSCUTTING THEMES Gribskov, M., A.D. McLachlan, and D. Eisenberg. 1987. Profile analysis: Detection of dis- tantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84(13): 4355-4358. Hajdu, J., R. Neutze, T. Sjögren, K. Edman, A. Szöke, R.C. Wilmouth, and C.M. Wilmot. 2000. Analyzing protein functions in four dimensions. Nat. Struct. Biol. 7(11): 1006-1012. Hansmann, U.H.E., M. Masuya, and Y. Okamoto. 1997. Characteristic temperatures of fold- ing of a small peptide. Proc. Natl. Acad. Sci. U.S.A. 94: 10652-10656. Hao, M.-H., and H.A. Scheraga. 1998. Molecular mechanisms of coperative folding of pro- teins. J. Mol. Biol. 277: 973-983. Hastie, T., R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning. New York, N.Y.: Springer. Henderson, R. 2004. Realizing the potential of electron cryo-microscopy. Q. Rev. Biophys. 37(1): 3-13. Hwang, D.G., and P. Green. 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. U.S.A. 101(39): 13994-14001. Ideker, T., V. Thorsson, A.F. Siegel, and L.E. Hood. 2000. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7(6): 805-817. Kim, P.M., and B. Tidor. 2003. Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 13(7): 1706-1718. Kluger, Y., R. Basri, J.T. Chang, and M. Gerstein. 2003. Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Res. 13(4): 703-716. Korf, I., P. Flicek, D. Duan, and M.R. Brent. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17(Suppl 1): S140-S148. Krogh, A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. 1994. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235(5): 1501-1531. Kulp, D., D. Haussler, M.G. Reese, and F.H. Eeckman. 1996. A generalized Hidden Markov Model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol. 4: 134-142. Lauritzen, S.L., and D.J. Speigelhalter. 1988. Local computations with probabilities on graphi- cal structures and their application to expert systems. J. Roy. Stat. Soc. B 50: 157-224. Lawrence, C.E., S.F. Altschul, M.S. Boguski, A.F. Neuwald, and J.C. Wooton. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262: 208-214. Lazzeroni, L., and A.B. Owen. 2002. Plaid models for gene expression data. Stat. Sinica 12(1): 61-86. Lee, D.D., and H.S. Seung. 1999. Learning the parts of objects by non-negative matrix factor- ization. Nature 401(6755): 788-791. Lee, M.L., F.C. Kuo, G.A. Whitmore, and J. Sklar. 2000. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl. Acad. Sci. U.S.A. 97(18): 9834-9839. Liu, J.S. 2001. Monte Carlo Strategies in Scientific Computing. New York, N.Y.: Springer-Verlag. Mashanov, G.I., D. Tacon, A.E. Knight, M. Peckham, and J.E. Molloy. 2003. Visualizing single molecules inside living cells using total internal reflection fluorescence microscopy. Methods 29: 142-152. Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. 1953. Equa- tions of state calculations by fast computing machines. J. Chem. Phys. 21: 1087-1091. Meyer, I.M., and R. Durbin. 2002. Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics (10): 1309-1318.

OCR for page 127
148 MATHEMATICS AND 21ST CENTURY BIOLOGY Minsky, M., and S. Papert. 1988. Perceptrons. An Introduction to Computational Geometry. Cam- bridge, Mass.: MIT Press. Mootha, V.K., C.M. Lindgren, K.F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstrale, E. Laurila, N. Houstis, M.J. Daly, N. Patterson, J.P. Mesirov, T.R. Golub, P. Tamayo, B. Spiegelman, E.S. Lander, J.N. Hirschhorn, D. Altshuler, and L.C. Groop. 2003. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 34(3): 267-273. Mukherjee, S., P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T.R. Golub, and J.P. Mesirov. 2003. Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol. 10(2): 119-142. Murali, T.M., and S. Kasif. 2003. Extracting conserved gene expression motifs from gene expression data. Pp. 77-88 in Pacific Symposium on Biocomputing 2003. Singapore: World Scientific. Nature Genetics Supplement 21. 1999. Nature Genetics Supplement 32. 2002. Newton, M.A., C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui. 2001. On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8(1): 37-52. Perou, C.M., S.S. Jeffrey, M. van de Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C. Lee, D. Lashkari, D. Shalon, P.O. Brown, and D. Botstein. 1999. Distinctive gene expression patterns in human mammary epithe- lial cells and breast cancers. Proc. Natl. Acad. Sci. U.S.A. 96(16): 9212-9217. Perou, C.M., T. Sorlie, M.B. Eisen, M. van de Rijn, S.S. Jeffrey, C.A. Rees, J.R. Pollack, D.T. Ross, H. Johnsen, L.A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S.X. Zhu, P.E. Lonning, A.L. Borresen-Dale, P.O. Brown, and D. Botstein. 2000. Molecular por- traits of human breast tumours. Nature 406(6797): 747-752. Pomeroy, S.L., P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub. 2002. Prediction of central ner- vous system embryonal tumour outcome based on gene expression. Nature 415(6870): 436-442. Rosenblatt, F. 1962. Principles of Neurodynamics. New York, N.Y.: Spartan Books. Roweis, S.T., and L.K. Saul. 2000. Nonlinear dimensionality reduction by locally linear em- bedding. Science 290(5500): 2323-2326. Schultz, R.A., T. Nielsen, J.R. Zavaleta, R. Ruch, R. Wyatt, and H.R.Garner. 2001. Hyperspectral imaging: A novel approach for microscopic analysis. Cytometry 43(4): 239-247. Searls, D.B. 1992. The linguistics of DNA. Am. Sci. 80: 579-591. Sela, M., F.H. White Jr., and C.B. Anfinsen. 1957. Reductive cleavage of disulfide bridges in ribonuclease. Science 125: 691-692. Shipp, M.A., K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub. 2002. Diffuse large B- cell lymphoma outcome prediction by gene-expression profiling and supervised ma- chine learning. Nat. Med. 8(1): 68-74. Sinclair, M.B., J.A. Timlin, D.M. Haaland, and M. Werner-Washburne. 2004. Design, con- struction, characterization, and application of a hyperspectral microarray scanner. Appl. Optics 43 (10): 2079-2088.

OCR for page 127
149 CROSSCUTTING THEMES Slonim, D., P. Tamayo, J.P. Mesirov, T.R. Golub, and E.S. Lander. 2000. Class prediction and discovery using gene expression data. Pp. 263-272 in Proceedings of Fourth Annual Inter- national Conference on Computational Molecular Biology. New York, N.Y.: ACM Press. So, P.T.C., C.Y. Dong, B.R. Masters, and K.M. Berland. 2000. Two-photon excitation fluores- cence microscopy. Ann. Rev. Biomed. Eng. 2: 399-429. Staunton, J.E., D.K. Slonim, H.A. Coller, P. Tamayo, M.J. Angelo, J. Park, U. Scherf, J.K. Lee, W.O. Reinhold, J.N. Weinstein, J.P. Mesirov, E.S. Lander, and T.R. Golub. 2001. Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 98(19): 10787-10792. Stormo, G.D., and D. Haussler. 1994. Optimally pairing a sequence into different classes based on multiple types of evidence. Pp. 369-375 in Proceedings of the Second Interna- tional Conference on Intelligent Systems for Molecular Biology. Vol. 2. R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, eds. Menlo Park, Calif.: AAAI Press. Sumners, D. 1995. Lifting the curtain: Using topology to probe the hidden action of en- zymes. Notices of the AMS 42: 528-537. Svergun, D.I., and M.H.J. Koch. 2002. Advances in structure analysis using small-angle scat- tering in solution. Curr. Opin. Struct. Biol. 12: 654-660. Tamayo, P., D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96(6): 2907-2912. Tanner, M.A. 1996. Tools for Statistical Inference: Methods for the Exploration of Posterior Distri- butions, 3rd ed. New York, N.Y.: Springer-Verlag. Tanner, M.A., and W.H. Wong. 1987. The calculation of posterior distributions by data aug- mentation (with discussion). J. Am. Stat. Assoc. 82: 528-550. Tenenbaum, J.B., V. de Silva, and J.C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319-2323. Tusher, V.G., R. Tibshirani, and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 98(9): 5116-5121. Vapnik, V. 1998. Statistical Learning Theory. New York, N.Y.: John Wiley & Sons Ltd. von Dassow, G., E. Meir, E.M. Munro, and G.M. Odell. 2000. The segment polarity network is a robust developmental module. Nature 406: 188-192. Winfree, A.T. 1983. Sudden cardiac death, a problem in topology. Sci. Am. 248: 114-161.

OCR for page 127