**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

## 8

Crosscutting Themes

The organization of this report around levels of biological organization reflects the committee’s view that the interplay between mathematics and biology during the 21st century will be driven by biological problems. Nonetheless, the committee also recognized that this view of the mathematics-biology interface risks the neglect of crosscutting themes—that is, mathematical ideas or areas of productive research activity that cut across levels of biological organization, emerging and re-emerging in diverse biological contexts. Accordingly, it concludes its report with a few examples of such themes, starting with some mathematical ideas that have assumed central importance at the interface between mathematics and biology.

**THE “SMALL ***n*, LARGE *P*” PROBLEM

*n*, LARGE

*P*” PROBLEM

Classical statistics largely arose in settings where typical problems involved estimating a small set of parameters (*P*) from large numbers of data points (*n*). Modern examples of such “small *P*, large *n*” problems include estimating the overall inflation rate—or, perhaps, a modest number of category-specific inflation rates—from longitudinal data on the prices of large numbers of specific items. Of course, similar problems often also arise in overtly biological contexts (e.g., analysis of life expectancies or the dose-response characteristics of pharmaceutical agents); however, these applications of classical statistics to living systems are typically far removed from considerations of underlying biological mechanism. In

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

small *P*, large *n* contexts, estimates of the parameter set are expected to improve as the number of data points increases, and much of the machinery of classical statistics addresses trade-offs between the reliability of parameter estimates and the number of samples analyzed.

In many biological research settings, the statistical challenge is quite different. Individual experiments—for example, a microarray-based measurement of the levels of thousands of messenger-RNA levels in a single RNA sample extracted from a particular tumor—are often information-rich. However, the number of independent measurements from which a biologist seeks to draw conclusions (e.g., the number of tumors analyzed) may be quite small. Similarly, geneticists are now contemplating measuring the genetic variants present at ~10^{5} sites across the genomes of individual research subjects even though the number of individuals—for example, the number of cases and controls in a disease-susceptibility study—is under severe practical constraints. The challenge in these situations is analogous to attempting to reach conclusions from a moderate number of photographs of, for example, profitable and unprofitable restaurants. Although increasing the number of restaurants photographed would certainly improve the reliability of the study, presuming that the sampling strategy was well considered, success would depend even more heavily on strategies for representing and modeling the immense amount of information in each photograph. Interest among biologists in problems with similar statistical properties has grown dramatically in recent years. The committee examines this phenomenon here as a prime example of a crosscutting mathematical theme on the interface between mathematics and contemporary biology. It illustrates both the progress that has been made during the past decade and the challenges that lie ahead.

**Finding Patterns in Gene-Expression Data**

Although the small *n*, large *P* problem is encountered in many biological contexts, the challenges of interpreting gene-expression data provide a prototypical example that is of substantial current interest. The development of microarray technology, which can yield the transcription profiles of >10^{4} genes in a single experiment, has enabled global approaches to understanding regulatory processes in normal or disease states. Substantial work has been done on the selection and analysis of differentially expressed genes for purposes ranging from the discovery of new gene functions and the classification of cell types to the prediction of clinically important biological phenotypes (*Nature Genetics* Supplement 21, 1999; Golub et al., 1999; Tamayo et al., 1999; *Nature Genetics* Supple-

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

ment 32, 2002). The best of the markers that have emerged from this research have already shown promise for both diagnostic and prognostic clinical use.

Applications to tumor classification have attracted particular interest since it has been estimated that over 40,000 cancer cases per year in the United States present major classification challenges for existing clinical and histological approaches. Gene-expression microarrays for the first time offer the possibility of basing diagnoses on the global-gene-expression profile of the tumor cells. Moreover, the discovery of gene-expression patterns that are significantly correlated with tumor phenotype can clarify molecular mechanisms of pathogenesis and potentially identify new strategies for treatment (Shipp et al., 2002). Similar opportunities exist for many other poorly understood diseases. For example, the recent discovery that a set of genes in the oxidative-phosphorylation pathway is more highly expressed in the muscle biopsies of normal controls than in those of patients with Type 2 diabetes has opened new avenues in diabetes research (Mootha et al., 2003). Closer study of the most highly correlated genes in this set led to the hypothesis that PGC-1α might regulate this subset of genes, a result that was then confirmed by further laboratory study. By this route, an aberration in PGC-1α expression has become a prime candidate for being a step in disease progression.

The pattern-recognition techniques required for analyzing gene-expression data and other large biological data sets are often called supervised and unsupervised learning. Machine-learning tools based on these techniques, designed in collaborations between bioscientists and mathematical scientists, have already come into widespread use. Pattern recognition via supervised and unsupervised learning is based on quantitative, stochastic descriptions of the data, sometimes referred to as associative models. These models typically incorporate few or no assumptions about the mechanistic basis for the patterns that they seek to discover.

In unsupervised learning techniques, the structure in a data set is elucidated without using any a priori labeling of the data. Unsupervised learning can be useful during exploratory analysis. Supervised techniques create models for classifying data by training on labeled members of the classes that are to be distinguished—for example, invasive and noninvasive tumors. Supervised techniques have an advantage over unsupervised techniques because they are less subject to structure that is not directly relevant to the distinction of interest, such as the laboratory in which the data were collected. Unfortunately, training sets are not available in many biological situations.

General machine-learning algorithms that are potentially useful in this area of research stem from fields such as psychology and systematics. The

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

algorithms have then been refined in such fields as financial modeling or market analysis, where the number of independent instances—that is, data points such as days of observation or individual-customer transactions—is substantially larger than the number of variables measured (i.e., the dimension of the problem; see Hastie et al., 2001). A rule of thumb is that the number of data points should be at least as large as the square of the number of dimensions (Friedman, 1994). This goal is often out of reach in biological-pattern-recognition studies: Typically, the data sets available comprise a small number of samples per biological class (generally fewer than 100, often only 10 or 20); however, this small number of samples contrasts with the large number of features that characterize each sample—for example, expression levels on the order of 10^{4} genes). In many situations, increasing the number of samples is simply not possible. Hence, biological applications of machine learning often involve instances of the small *n*, large *P* problem discussed above.

The first generation of discovery and recognition tools suitable for the analysis of microarray data has been built, and these tools have established that expression data can be productively mined for purposes such as tumor classification (Golub et al., 1999; Bittner et al., 2000; Slonim et al., 2000; Tamayo et al., 1999; Perou et al., 1999; Perou et al., 2000), chemosensitivity of tumors (Staunton et al., 2001), and treatment outcome (Alizadeh et al., 2000). The classifiers used in these papers are still heavily employed today and are being refined to apply to cases where subtle signatures of phenotypes such as post-treatment outcome are the endpoints. Despite clear successes in applying machine learning to gene-expression data, most studies have oversimplified the problem by treating genes as independent variables. Even when coregulation is taken into account (Cho et al., 1998; Eisen et al., 1998), existing methods still fail to capture the complex patterns of interaction that characterize all biological regulatory processes. They also address inadequately the diversity of biological mechanisms that can lead to indistinguishable phenotypes.

In the context of microarray data, although measurements may define a very high-dimensional space of >10,000 genes, the expression levels of these genes are dependent variables. In typical cases, a smaller set of variables—on the order of a few hundred metagenes—adequately captures the process being modeled. Thus, the problem can be approached by reducing the gene dimension in a principled way to reach the desired small *P*, moderate *n* level with which traditional statistical theory generally deals. However, major challenges remain in learning how to use a reduction process that actually reflects the biological mechanisms that lead to the highly correlated expression of the members of particular subsets of genes. These features of the problem highlight the need to develop statistical frameworks that will accommodate both the presence of many

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

irrelevant variables and the high interdependence of those variables that are relevant. Next, the committee presents a brief description of the existing statistical framework.

**Supervised Learning**

In the supervised-learning setting, there exists a data set of samples that belong to two different phenotypes, Class *A* and Class *B*. The goal is to build a model that when presented with a new sample of unknown phenotype can identify its corresponding class with high accuracy. Mathematically, the challenge is to infer a function *F* that assigns a phenotype label *A* or *B* to a point *G = (g _{1}, g_{2}, … , g_{n})* in a high-dimensional space—the expression levels of ~10,000 genes in each sample—from a small number of

*(G, F(G))*pairs. In general, such systems are highly underdetermined. Moreover, as discussed above, the variables are not independent: Genes are expressed in response to the activation of biochemical pathways resulting from multiple gene products. While this interdependence of the variables is the only reason the problem is tractable, it severely limits the utility of traditional statistical approaches for testing the significance of observations.

Classifiers tend to overfit the data—that is, they have poor predictive power outside the training sample because of the small number of samples, the large number of features (high dimensionality), and noise in the data. To address this problem, the number of variables must be reduced by selecting a subset of features that are most highly correlated with the phenotype distinction. From the perspective of machine learning and pattern recognition, the problem of optimal feature selection is intractable, and biologists must be content with empirical approximations that are tailored to the specific application (Duda et al., 2000). Traditional methods for determining the statistical significance of the features—for example, the expression levels of particular genes—to be used as classifiers assume a known underlying distribution of values and independence of the features. However, a good parametric description for expression values has yet to be determined and may not exist. Gene-gene interactions are fundamental to biological processes, and thus gene-expression data are inherently incompatible with independence assumptions.

Some groups have used permutation-based methods to solve the gene-selection problem (Golub et al., 1999; Slonim et al., 2000; Tusher et al., 2001). In these methods, one compares the observed distribution of gene correlations with phenotype against a distribution obtained by randomly assigning class labels to samples. Permuting the class labels preserves the gene-gene dependencies within the data set. Other methods include using step-down adjusted-*p* values (Dudoit et al., 2002), general-

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

ized likelihood tests (Ideker et al., 2000), Bayes hierarchical models (Newton et al., 2001; Baldi and Long, 2001), and combined data from replicates to estimate posterior probabilities (Lee et al., 2000). So far, no systematic comparisons of the error rates and statistical power of these different methods have been published. Clearly this is an area that needs more research and a strong formal framework.

Even the question of how many samples might be needed to improve the accuracy of the original classifier or to provide a more rigorous statistical validation of the predictive power of classifiers is difficult. Traditional power calculations (Adcock, 1997) do not address the situation posed by gene-expression data: They estimate the confidence of an empirical error estimate based on a given data set, not how the error rate might decrease given more data. Attempts have been made to answer the latter question using nonparametric methods and permutation testing (Cortes et al., 1993; Cortes et al., 1995; Mukherjee et al., 2003), but formal analysis of this problem remains an open challenge.

One widely used approach to supervised learning involves the use of support vector machines (SVMs). SVMs are based on a variation of regularization techniques for regression (Vapnik, 1998; Evgeniou et al., 2000) and are related to a much older algorithm, the perceptron (Minsky and Papert, 1988; Rosenblatt, 1962). Perceptrons seek a hyperplane to separate positive and negative examples. The SVM seeks to further separate such examples. It is trained by solving a convex optimization problem, usually involving a large number of variables. The objective function involves a penalty, which has to be tuned to avoid overfitting. Performance of the SVM is reasonably measured by the proportion of misclassified cases in a new sample. Since such a sample is not usually available, methods involving splitting the original training sample—known as cross validation—are used. Other promising methods such as “boosting” are being developed in the statistics and machine-learning communities (Hastie et al., 2001). A basic limitation on all such methods is that, even when they identify reproducibly observable clusters, they may not provide insight into the biological mechanisms that underlie the process or phenotype being studied.

**Unsupervised Learning**

In unsupervised learning, the data are not labeled. The goal is to determine the underlying structure of a data set and to uncover relevant patterns and possible subtypes that can then provide the starting point for additional biological characterization. Many types of clustering algorithms have been applied to expression data—for example, hierarchical clustering (Cho et al., 1998; Eisen et al., 1998), self-organizing maps (SOM)

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

(Tamayo et al., 1999), and *k*-means. These methods focus on the dominant structure present in a data set while potentially missing more subtle patterns that might be of equal or greater biological interest.

In contrast, there are a number of local, or bottom-up, unsupervised methods that seek to identify and analyze subpatterns in gene expression data: the SPLASH algorithm (Califano, 2000), conserved X motifs (Murali and Kasif, 2003), the PLAID algorithm (Lazzeroni and Owen, 2002), the association rules of Becquet et al. (2002), or the frequent itemsets and modules of Tamayo et al. (2004) and Segal et al. (2004). Bottom-up approaches provide a comprehensive catalog of subpatterns and expose most or all of the potentially interesting structure. They tackle the small *n*, large *P* problem by attempting to directly extract and isolate the relevant signals. The challenge is the difficulty of dealing with the potentially large number of patterns discovered by these methods, many of which are typically false positives. The small *n*, large *P* problem remains in trying to find appropriate filters to separate real patterns from noise and finding ways to assemble the discovered patterns into a coherent representation of the data. Unfortunately, there is no theoretical foundation for evaluating the significance of extracted subpatterns purely on the basis of the data.

Classical approaches to reduce the noise and dimensionality use global decompositions or projections of the data that preserve the dominant structure. Examples of these methods include principal-component analysis (PCA) (Bittner et al., 2000; Pomeroy et al., 2002), singular-value decomposition (SVD) (Alter et al., 2000; Kluger et al., 2003) and PLAID (Lazzeroni and Owen, 2002). Unsupervised global, or top-down, approaches address the small *n*, large *P* problem by using appropriate projections from gene space to find a set of molecular coordinates that captures dominant signals. Once again, these methods often produce difficult-to-interpret, complex, or unwieldy representations of the data.

Projection algorithms such as nonnegative matrix factorization (NMF) (Lee and Seung, 1999; Kim and Tidor, 2003; Brunet et al., 2004) represent a new generation of methods that attempt to project the data into the space of a small number of metagenes, which provide representations that aid in biological interpretation and have the potential to guide follow-up experiments in the laboratory. NMF is based on a decomposition-by-parts approach, which was introduced by Lee and Seung (1999) to identify characteristic features of faces and semantic features of text. Despite its usefulness and practical success in clustering data, there are many open questions concerning the algorithm, its convergence properties, and the properties of the projected representation. Recent research on supervised-learning problems focuses on low-dimensional, nonlinear-manifold representations (Roweis and Saul, 2000; Tenenbaum et al., 2000), or other

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

nonlinear sparse representations. This work is in its infancy and has not yet been systematically used for biological applications.

All the supervised and unsupervised approaches the committee describes here have associated questions that require further investigation. Some of these questions follow: Is it possible to develop a formal framework for evaluating the significance of features or subpatterns extracted in a small *n*, large *P* context? What is the best way to determine the correct number of clusters within a given data set? How does one validate clustering or decomposition results? How does one compare the correctness of two decompositions of a data set? None of these challenges is unique to biology, but biological applications bring them to the fore.

**ANALYSIS OF ORDERED SYSTEMS**

Systems or processes with strong spatial or temporal order are ubiquitous in biology. Examples involve the sequence of bases in the genome, the propagation of nerve impulses, and—at a higher level of biological organization—animal behavior. Mathematical techniques for analyzing ordered processes have been successfully imported into biology from other research areas. A particularly important example is the hidden Markov model (HMM). HMMs have been used in areas such as speech recognition since the 1970s. More recently, they have been applied with great success in many areas of biology. HMMs require more specific modeling of the structure within a data set than do the nonparametric methods discussed in the preceding section. When suitable models exist, this requirement is a strength: Indeed, it is sometimes possible to make valid inferences from a single instance of a biological entity such as a gene—that is, to analyze a small *n*, large *P* problem when *n* = 1. This escape from the small *n*, large *P* problem is somewhat illusory since the HMM assumption enables us to use the large number of bases in the single gene to provide us with nearly identically distributed and independent proxy samples.

**Applications of Hidden Markov Models to the Analysis of DNA, RNA, and Protein Sequences**

An HMM describes a set of states connected by transitions between states. The transitions occur according to a Markov process. This means that the distribution of the *m*th state in the series, given the preceding *m* – 1 states, depends only on the (*m* – 1)st state. However, the states themselves are not observed (they are hidden): They reveal themselves by emitting observable variables. In speech recognition, the observed variables might be phonemes. In DNA and protein applications, they would be the

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

nucleotides or amino acids corresponding to specific sequences. All of the parameters of the HMMs governing the emissions of variables from specific states and the transitions between states are probabilities. There are many well-established algorithms for addressing important questions that arise during the use of HMMs. For example, given an HMM and a sequence, one can determine the probability that the sequence was generated by the HMM. Calculation of these probabilities allows one to find within a set of candidate models the HMM that is most likely to have generated a particular sequence. One can also find the specific path of the sequence through the HMM. This capability allows one to parse the sequence into the most likely arrangement of hidden states. Note, however, that these are still associative rather than mechanistic models and are usually viewed simply as very crude approximations to reality. Two specific applications of HMMs to biological sequences, profile HMMs for protein families and HMMs for predicting gene structures in DNA, are discussed below.

**Profile HMMs**

Profiles for protein families were introduced by Gribskov et al. (1987) as a method for representing the variability in protein sequences of the same family. Given an alignment of the sequences, the profile provides a score for all possible amino acids that might occur at each position and also a score associated with deletions and insertions at different positions. Profile HMMs were introduced by Krogh et al. (1994) to put the concept of a profile in a fully probabilistic framework. The hidden states, which are the positions in the protein-family model, are hidden because any individual sequence may have insertions and deletions relative to the model. Given a set of sequences known to be of the same protein family, expectation maximization (EM) can be used to determine the parameters of the HMM for the family. Given a profile HMM for a specific family and a protein sequence, one can determine the best alignment of the sequence to the family and the probability that the protein sequence would be generated by the HMM for the family. One can then classify proteins into different families by comparing those probabilities.

The emission probabilities at each position of the HMM can indicate important features of a protein family. For example, active-site residues in enzymes tend to be highly, if not completely, conserved among all members of a family. Positions that are all hydrophobic are likely to be in the interior of the protein or exposed to hydrophobic environments such as the interior of membranes. Given a set of HMMs for different protein families and at least one known structure for each family, HMM-based methods provide an effective means for predicting the approximate structure

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

of a new protein from its sequence simply by determining the family to which the protein is most likely to belong. Of course, if the protein does not belong to any of the established families, this approach fails, and one must resort to ab initio methods. However, as increasing numbers of protein structures are determined and it becomes increasingly clear that most proteins—or at least domains of proteins—fall into a limited set of structural classes, HMM-based classification methods are providing more and more useful predictions of protein structure and function.

Despite past success, there is ample room for improvement in the development and application of HMMs to protein families. Two important areas for improvement deal with nonindependence in the data. Usually it is assumed that the protein sequences from which a profile HMM is built are independent samples from the set of sequences in the family. In actuality, members of the sample set are related to each other by a phylogenetic tree, and means of incorporating that information into Profile HMMs should improve their performance. The other nonindependence issue involves limitations on the structures of the HMMs themselves. Profile HMMs assume that the positions are independent of one another or, at most, that there is a low-order Markov dependence among nearby positions. In reality, distant positions within the protein may be interacting with one another, and the amino acid frequencies at these interacting sites may be correlated. Such long-distance correlations occur frequently in RNA structures and are represented by higher-order models called stochastic context-free grammars. However, even stochastic context-free grammars are limited to correlated positions that are nested. This condition does not hold for typical protein interactions; indeed, it does not even apply to all intramolecular interactions within RNA molecules. Finding efficient ways of taking such long-range interactions into account, while maintaining the advantages of probabilistic models, would provide an important improvement, especially for structure prediction.

**HMMs in Gene Finding**

Gary Churchill (1989) first applied HMMs to partition DNA sequences into domains with different characteristics. Early on, David Searls (1992) recognized the analogy between the parsing of sequences in linguistic analysis and the determination of functional domains in DNA sequences. By the early 1990s, David Haussler and colleagues had begun applying HMMs to the problem of identifying the protein-coding regions in genomic DNA sequences (see Krogh et al., 1994, Stormo and Haussler, 1994; Kulp et al., 1996). By that time, large-scale DNA-sequencing projects had begun, and there were many DNA sequences in the databases with no known associated genes or functions. Predicting what proteins might be

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

encoded in these newly discovered DNA sequences was an important problem.

The basic structure of an HMM maps well to the gene-prediction problem. The hidden states are the functional domains of the DNA sequence: For example, some regions of the DNA code for protein sequence, other regions code for untranslated portions of genes, while still others are intergenic. Each class of regions has some statistical features that help to distinguish it from the other classes. For example, protein-coding exons must have an open reading frame and often use codons in a biased manner, so the base-emission probabilities characterizing that state will be different from those characterizing introns or other classes. There is also a clearly defined grammar for protein-coding regions: Introns must alternate with exons, and intergenic regions must surround these alternating exon-intron segments.

On the other hand, some aspects of gene structure are not captured by simple HMM architectures. For example, when introns are removed, the two joined exons must remain in-frame, so the HMM has to maintain a memory of the reading frame from the previous exon as it passes over the intron. Furthermore, exons and introns have different length distributions; neither is simply geometric, as would be modeled by a simple HMM. Finally, the boundaries between domains are often indicated by signals in the DNA sequence—that is, specific sequence motifs that are themselves modeled by the probability distributions of bases at different positions within the motifs.

Gene-prediction accuracy can be improved by incorporating other evidence that is not derived from the DNA sequence alone—for example, similarities between the protein sequence inferred from the predicted gene structure and previously known protein sequences. To utilize all the different kinds of information that are useful for gene prediction and to capture the details of gene structures, HMMs have been extended to generalized HMMs (GHMMs) (Kulp et al., 1996; Burge and Karlin, 1997). These new models, which couple classical HMMs to machine-learning techniques, provide significantly better predictions than previous models. Recently, the methodology was extended to predict simultaneously gene structure in two homologous sequences (Korf et al., 2001; Meyer and Durbin, 2002; Alexandersson et al., 2003). Since corresponding (orthologous) genes in closely related organisms are expected to have similar structures, adding the constraint that the predicted structure be compatible with both sequences can significantly improve accuracy.

Despite these advances, there is still much room for improvement in gene prediction. Overall accuracy, even when using two species, is far from 100 percent. Increasingly, the failures of gene-prediction methods are due to the inherent biological complexity of the problem. Recent data

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

have emphasized that a region of DNA may code for multiple protein variants owing to alternative splicing. Indeed, it now appears that the majority of human genes are alternatively spliced to give two or more protein products. This biological reality means that the basic assumption of gene-prediction HMMs—that any particular base in the sequence derives from a unique hidden state rather than playing multiple functional roles—is incorrect. It may be possible to extend HMMs to deal with such situations by making explicit states that accommodate dual roles or by predicting alternative products from the optimal and suboptimal predictions of the HMMs.

Much remains to be learned about the various classes of DNA segments and the features that define them. In particular, regulatory regions pose major challenges. These regions are composed of sets of binding sites for regulatory proteins, organized into modules that control gene expression. More experimental information is needed to incorporate the properties of regulatory regions into gene-prediction models. However, eventually it may be possible not only to predict what proteins are encoded by a given DNA region but also to predict the conditions under which they are expressed.

**APPLICATIONS OF MONTE CARLO METHODS IN COMPUTATIONAL BIOLOGY**

The early development of dynamic Monte Carlo methods (Metropolis et al., 1953) was motivated by the study of liquids and other complex physical systems. Increasing computational power and theoretical advances subsequently expanded their application throughout many areas of science, technology, and statistics. The use of dynamic Monte Carlo methods in statistics began in the early 1980s, when Geman and Geman (1984) and others introduced them in the context of image analysis. It was quickly realized that these methods were also useful in more traditional applications of parametric statistical inference. Tanner and Wong (1987), as well as Gelfand and Smith (1990), pointed out that such standard statistical problems as latent-class models, hierarchical-linear models, and censored-data regression all have structures allowing the effective use of iterative sampling when estimating posterior and predictive distributions. Within the past decade, there has been an explosion of interest in the application of Monte Carlo methods to diverse statistical problems such as clustering, longitudinal studies, density estimation, model selection, and the analysis of graphical systems (for reviews, see Tanner (1996); Gilks et al. (1996); Liu (2001)). Concomitant with the spread of Monte Carlo methods in model-based analysis, there has been a general increase in reliance on computational inference in many areas of science and engineering.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

Computational inference based on Bayesian or likelihood models often leads to large-scale Monte Carlo sampling as a global optimization strategy. In summary, in areas extending far beyond biology, Monte Carlo sampling has become an important tool in scientific computation, particularly when computational inference is based on statistical models. The committee describes below some uses of Monte Carlo methods in computational biology and discusses the limitations on current methods and possible directions for future research.

**Gibbs Sampling in Motif Finding**

The identification of binding sites for transcription factors that regulate when and where a gene may be transcribed is a central problem in molecular biology. Beginning in the late 1980s, this problem was formulated as a statistical-inference problem by Gary Stormo, Charles Lawrence, and others. It was assumed that the upstream regions of a set of coregulated genes are enriched in binding sites that have nucleotide frequencies different from the background sequences. In general, neither the site-specific nucleotide frequencies (the motif model) nor the locations of the sites are known. Currently, the most successful algorithm for the simultaneous statistical inference of the motif model and the sites involves application of a version of the Monte Carlo algorithm called the Gibbs sampler (Lawrence et al., 1993). Computational biologists are presently working to extend this basic approach to incorporate cooperative interactions between bound transcription factors and to analyze sequences from multiple species that are evolutionarily related.

**Inference of Regulatory Networks**

Probabilistic networks were developed independently in statistics (Lauritzen and Spiegelhalter, 1988) and computer science (Pearl, 1988). Directed-graph versions of probabilistic networks, known as Bayesian networks, have played an important role in the formulation of expert systems. Recently, Bayesian networks also proved to be useful as models of biological regulatory networks (Friedman et al., 2000). In these networks, the genes and proteins in a regulatory network are modeled as nodes in a directed graph, in which the directed edges indicate potential causal interactions—for example, gene *A* activates gene *B*. Given the network structure—that is, the graph structure specifying the set of directed edges—there are efficient algorithms for inferring the remaining parameters of the network. If the network structure is unknown, inferring it involves sampling from its posterior distribution, given the data. This computation is challenging, since the space of all possible network structures is

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

superexponentially large. The development of Monte Carlo schemes capable of handling this computation would be of great value in computational biology.

**Sampling Protein Conformations**

The protein-folding problem has been a grand challenge for computational molecular bioscientists for more than 30 years, since Anfinsen demonstrated that the sequences for some proteins determine their folded conformations (Sela et al., 1957). To formulate the computational problem, one sets up an energy function based on considerations of bonding geometry, as well as electrostatic and van der Waals forces. Possible conformations of the protein (i.e., the relative spatial positions of all its heavy atoms) can then be sampled either by integrating Newton’s second law (i.e., carrying out a molecular dynamic calculation) or by Monte Carlo sampling of the corresponding Boltzmann distribution (for a review of this, see Frenkel and Smit, 1996). This problem is attractive both because it is intrinsically important for understanding proteins and because computational results can be compared with experimentally solved structures. Hence, unlike in many other areas of predictive modeling in biology, there are easily applied, objective criteria for comparing the relative accuracy of alternative models. At present, de novo computation of native protein structures is not feasible. Thus, the near-term focus of most research in this area is on gaining an improved understanding of the mechanism of protein folding (Hansmann et al., 1997; Hao and Scheraga, 1998). Monte Carlo methods are important in these investigations because they provide wider sampling of the conformation space than do conventional methods. The study of folding-energy landscapes is generally based on a simplified energy function—for example, effects of entropy in the solvent are incorporated into artificial hydrophobic terms in the energy function—and a greatly simplified conformation space. Even with such simplifications, Monte Carlo methods are often the only way to sample this space.

**LESSONS FROM MATHEMATICAL THEMES OF CURRENT IMPORT**

This discussion of flourishing applications of machine learning, hidden Markov models, and Monte Carlo sampling illustrates how particular mathematical themes can gain prominence in response to trends in biological research. The advent of high-throughput DNA sequencing and gene-expression microarrays brought to the forefront of biological research large amounts of data and many classes of problems that de-

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

manded the importation of broad, powerful mathematical formalisms. Continued reliance on ad hoc solutions to particular problems would have impeded the development of whole areas of biology. In the instances discussed, the biological problems that needed solution were sufficiently analogous to problems previously encountered in other fields that relevant mathematical formalisms were available. As these formalisms came into widespread use in the biosciences, particular limitations, associated in many instances with the general characteristics of the biological problems to which they were applied, became evident and stimulated new mathematical research on the methods themselves.

The committee expects this dynamic to recur as mathematical biology matures. Indeed, the committee attached more importance to the process than to its particular manifestations in the 1990s and early 2000s. While the techniques described here have broad importance at the moment, the committee does not expect them to dominate the biosciences over the long term. Indeed, as it did in the Executive Summary and Chapter 1, “The Nature of the Field,” the committee once more cautions against drawing up a list of mathematical challenges that are not grounded in specific biological problems. Both the biosciences and mathematics have strange ways of surprising us. Mathematics can be useful in ways that are not predictable. For example, Art Winfree’s use of topology provided wonderful insights into the way many oscillatory biological processes work (Winfree, 1983). Similarly, De Witt Sumners’s use of topology to understand aspects of circular DNA (Sumners, 1995) and Gary Odell’s topological observations about the gene network behind segment polarity were quite unexpected (von Dassow et al., 2000). Yet, even though topological arguments have provided biologists with powerful insights, the committee did not conclude that topology should be prioritized for further development because of its potential to contribute to biology. Instead, the committee expects that biological problems will continue to drive the importation and evolution of applicable mathematics. Then, as general principles emerge, they will be codified at the appropriate level of generality. For machine learning, HMMs, and Monte Carlo sampling, this process is well under way. Indeed, these powerful methods are now well established in the toolkits of most computational biologists and are routinely taught in introductory graduate-level courses covering computational biology. Other methods will follow, just as others went before. The greatest enabler of this process will be research programs and collaborations that confront mathematical scientists with specific problems drawn from across the whole landscape of modern biology.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

**PROCESSING OF LOW-LEVEL DATA**

The purpose of the current chapter, “Crosscutting Themes,” is to call attention to issues that might have been neglected if the committee had relied entirely on levels of biological organization to structure this report. By discussing examples of mathematical themes that are important at many levels of biological organization, the committee accomplishes that purpose. Another quite different crosscutting theme is the importance of low-level data processing. Indeed, one could argue that the most indispensable applications of mathematics in biology have historically been in this area. Furthermore, the importance of low-level data processing in biology appears likely to grow. Rapid advances in technologies such as optics, digital electronics, sensors, and small-scale fabrication ensure that biologists will have access to ever more powerful instruments.

Nearly all the data that biologists obtain from these instruments has gone through extensive analog and digital transformations. Because these transformations improve signal-to-noise ratios, correlate signals with real-world landmarks, eliminate distortions, and otherwise add value to the physical output of the primary sensing devices, they are often the key to success during instrumentation development. The continued involvement of mathematicians, physicists, engineers, chemists, and bioscientists in instrumentation development has great potential to advance the biological sciences. Mathematical scientists are essential partners in these collaborations. Indeed, many of the challenges that arise in low-level data processing can only be met by applying powerful, abstract formalisms that are unfamiliar to most bioscientists. A few examples, discussed below, illustrate current research in this area.

In optical imaging, the development of two-photon (or, more generally, multiphoton) fluorescence microscopy is already having a significant impact on biology (So et al., 2000). This technique, in which molecular excitation takes place from the simultaneous absorption of two or more photons by a fluorophore, offers submicron resolution with relatively little damage to samples. The latter feature is of particular importance in biology since there is growing interest in observing living cells as they undergo complex developmental changes. The sensitivity of two-photon microscopy, in contrast to conventional fluorescence microscopy, is more dependent on peak illumination of the sample than on average illumination; hence, pulsed-laser light sources can be used to provide high instantaneous illumination while maintaining low average-power dissipation in the sample. Significant progress has been made in using two-photon methods to image cells, subcellular components, and macromolecules. Substantial improvements in sensitivity remain possible since in current instruments, only a small fraction of emitted photons reaches the detec-

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

tor. This low sensitivity, among other problems, limits the time resolution of two-photon microscopy. Computation and simulation will play a key role in efforts to increase sensitivity by optimizing the light path and improving detectors. Discussing the potential of future improvements in the sensitivity of two-photon microscopy, Fraser (2003) observed that “with a combined improvement of only ten-fold, today’s impossible project can become tomorrow’s routine research project.” This rapid progression from the impossible to the routine is the story of much of modern experimental biology.

An entirely different class of imaging techniques, broadly referred to as near-field microscopy, has also made great strides in recent years. Steadily improving fiber-optic light sources and detectors have been the critical enabling technologies. Optical resolutions of 20 to 50 nm are achievable with ideal samples, dramatically breaching the wavelength limit on the resolution of traditional light microscopes. Nonetheless, near-field microscopy is difficult to apply in biology because of the irregular nature of biological materials. Despite these difficulties, Doyle et al. (2001) succeeded in imaging actin filaments in glial cells, and it is reasonable to expect further progress, based in part on improved computational techniques for extracting the desired signal from the noise in near-field data.

At still higher spatial resolution, many new techniques have been introduced for the structural analysis of biological macromolecules. Examples include high-field NMR, cryo-electron microscopy (cryo-EM) (Henderson, 2004; Carragher et al., 2004), time-resolved structural analysis based on physical and chemical trapping (Hajdu et al., 2000), smallangle scattering (Svergun et al., 2002), and total-internal-reflection fluorescence microscopy (Mashanov et al., 2003). Cryo-EM has achieved 0.4-nm resolution for two-dimensional crystals and may soon achieve that capability for single particles. One problem with all imaging methods is the lack of rigorous validation methods for determining the reliability of determined structures. Henderson (2004) emphasized this point, stating that the lack of such methods is “probably the greatest challenge facing cryo-EM.” The mathematical sciences have a clear role to play in addressing this challenge.

Hyperspectral imaging is the final example here of promising technologies that could be incorporated into many types of biological instrumentation. This technology involves measuring the optical response of a sample over an entire frequency range rather than at one, or a few, selected frequencies. In hyperspectral detectors, each pixel contains a spectrum with tens to thousands of measurements and allows for far more detailed characterization of a sample than could be obtained from data collected at a single frequency. Hyperspectral imaging is already being used for microscopy (Sinclair et al., 2004; Schultz et al., 2001), pathological

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

studies (Davis et al., 2003), and microarray analysis (Sinclair et al., 2004; Schultz et al., 2001). Sinclair et al. (2004) recently developed a scanner with high spatial resolution that records an emission spectrum for each pixel over the range 490-900 nm at 3-nm intervals. These investigators used multivariate curve-resolution algorithms to distinguish between the emission spectra of the components of multiple samples. Further mathematical developments have the potential to enhance instrument design and performance for diverse applications. Similar comments apply to many aspects of imaging technology. Indeed, the committee believes that one of the important goals of the next decade in instrumentation should be to improve the quantitation achievable in all forms of biological imaging. Nearly all applications of the mathematical sciences to biology will be promoted by improved instrumentation that lowers the cost of acquiring reliable quantitative data and increases the collection rates.

**EPILOGUE**

This brief discussion of the role of the mathematical sciences in the development of instrumentation is a suitable note on which to conclude this report since it emphasizes the primacy of data in the interplay between mathematics and biology. Mathematical scientists, and the funding agencies that support them, should be encouraged to take an interest in the full cycle of experimental design, data acquisition, data processing, and data interpretation through which bioscientists are expanding their understanding of the living world. Applications of the mathematical sciences to biology are not yet so specialized as to make this breadth of view impractical. An illustrative case is that of Phil Green, whose training before an early-career switch to genetics was in pure mathematics. During the Human Genome Project, he made key contributions to problems at every level of genome analysis: the phred-software package transformed large-scale DNA sequencing by attaching statistically valid quality measures to the raw base calls of automated sequencing instruments (Ewing and Green, 1998; Ewing et al., 1998); phrap, consed, and autofinish software sheperded these base calls all the way to finished-DNA sequence (Gordon et al., 1998; Gordon et al., 2001); then, in analyzing the sequence itself, Green contributed to problems as diverse as estimating the number of human genes (Ewing and Green, 2000), discovering the likely existence of a new DNA-repair process in germ cells (Green et al., 2003), and modeling sequence-context effects on mutation rates (Hwang and Green, 2004).

As this and many other stories emphasize, applications of the mathematical sciences to the biosciences span an immense conceptual range, even when one considers only one facet of the biological enterprise. No one scientist, mathematical or biological specialty, research program, or

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

funding agency can span the entire range. Instead, the integration of diverse skills and perspectives must be the overriding goal. In this report, the committee seeks to encourage such integration by putting forward a set of broad principles that it regards as essential to the health of one of the most exciting and promising interdisciplinary frontiers in 21st century science.

**REFERENCES**

Adcock, C.J. 1997. Sample size determination: A review. *Statistician* 46(2): 261-283.

Alexandersson, M., S. Cawley, and L. Pachter. 2003. SLAM—Cross-species gene finding and alignment with a generalized pair hidden Markov model. *Genome Res.* 13(3): 496-502.

Alizadeh, A.A., M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson Jr., L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J.O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P.O. Brown, and L.M. Staudt. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. *Nature* 403(6769): 503-511.

Alter, O., P.O. Brown, and D. Botstein. 2000. Singular value decomposition for genome-wide expression data processing and modeling. *Proc. Natl. Acad. Sci. U.S.A**.* 97(18): 10101-10106.

Baldi, P., and A.D. Long. 2001. A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. *Bioinformatics* 17(6): 509-519.

Becquet, C., S. Blachon, B. Jeudy, J.-F. Boulicaut, and O. Gandrillon. 2002. Strong-association-rule mining for large-scale gene-expression data analysis: A case study on human SAGE data. *Genome Biol.* 3(12): Research0067.

Bittner, M., P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. *Nature* 406(6795): 536-540.

Brunet, J.P., P. Tamayo, T.R. Golub, and J.P. Mesirov. 2004. Metagenes and molecular pattern discovery using matrix factorization. *Proc. Natl. Acad. Sci. U.S.A**.* 101(12): 4164-4169.

Burge, C., and S. Karlin. 1997. Prediction of complete gene structures in human genomic DNA. *J. Mol. Biol.* 268(1): 78-94.

Califano, A. 2000. SPLASH: Structural pattern localization analysis by sequential histograms. *Bioinformatics* 16(4): 341-357.

Carragher, B., D. Fellmann, F. Guerra, R.A. Milligan, F. Mouche, J. Pulokas, B. Sheehan, J. Quispe, C. Suloway, Y. Zhu, and C.S. Potter. 2004. Rapid, routine structure determination of macromolecular assemblies using electron microscopy: Current progress and further challenges. *J. Synchrotron Rad.* 11: 83-85.

Cho, R.J., M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J. Lockhart, and R.W. Davis. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. *Mol. Cell* 2: 65-73.

Churchill, G.A. 1989. Stochastic models for heterogeneous DNA sequences. *Bull. Math. Bio.* 51(1): 79-94.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

Cortes, C., L.D. Jackel, and W.-P. Chiang. 1995. Limits on learning machine accuracy imposed by data quality. Pp. 57-62 in *Proceedings of the First International Conference on Knowledge Discovery and Data Mining**.* U.M. Fayyad and R. Uthurusamy, eds. Montreal, Canada: AAAI Press.

Cortes, C., L.D. Jackel, S.A. Solla, V. Vapnik, and J.S. Denker. 1993. Learning curves: Asymptotic values and rate of convergence. Pp. 327-334 in *Advances in Neural Information Processing Systems*. NIPS’1993, Vol. 6. Denver, Colo.: Morgan Kauffman.

Davis, G.L., M. Maggioni, R.R. Coifman, D.L. Rimm, and R.M. Levenson. 2003. Spectral/ spatial analysis of colon carcinoma. *Modern Pathol.* 16 (1): 320A-321A.

Doyle, R.T., M.J. Szulzcewski, and P.G. Haydon. 2001. Extraction of near-field fluorescence from composite signals to provide high resolution images of glial cells. *Biophys. J.* 80: 2477-2482.

Duda, R.O., P.E. Hart, and D.G. Stork. 2000. *Pattern Classification**.* New York, N.Y.: John Wiley & Sons Ltd.

Dudoit, S., Y.H. Yang, M.J. Callow, and T.P. Speed. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. *Statistica Sinica* 12(1): 111-139.

Eisen, M.B., P.T. Spellman, P. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. *Proc. Natl. Acad. Sci. U.S.A**.* 95(25): 14863-14868.

Evgeniou, T., M. Pontil, and T. Poggio. 2000. Regularization networks and support vector machines. *Adv. Comput. Math.* 13: 1-50.

Ewing, B., and Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. *Genome Res.* 8(3): 186-194.

Ewing, B., and Green P. 2000. Analysis of expressed sequence tags indicates 35,000 human genes. *Nat. Genet.* 25(2): 232-234.

Ewing, B., L. Hillier, M.C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. *Genome Res.* 8(3): 175-185.

Fraser, S.E. 2003. Crystal gazing in optical microscopy. *Nat. Biotechnol.* 21(11): 1272-1273.

Frenkel, D., and B. Smit. 1996. *Understanding Molecular Simulation: From Algorithms to Applications*. San Diego, Calif.: Academic Press.

Friedman, J.H. 1994. An overview of computational learning and function approximation. Pp. 1-61 in *From Statistics to Neural Networks. Theory and Pattern Recognition Applications**.* V. Cherkassky, J.H. Friedman, and H. Wechsler, eds. Berlin: Springer-Verlag.

Friedman, N., M. Linial, I. Nachman, and D. Pe’er. 2000. Using Bayesian networks to analyze expression data. *J. Comput. Biol.* 7: 601-620.

Gelfand, A.E., and A.F.M. Smith. 1990. Sampling-based approaches to calculating marginal densities. *J. Am. Stat. Assoc.* 85: 398-409.

Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. *IEEE T. Pattern Anal.* 6: 721-741.

Gilks, W.R., S. Richardson, and D.J. Spegelhalter. 1996. *Markov Chain Monte Carlo in Practice*. London, England: Chapman and Hall.

Golub, T.R., D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. *Science* 286(5439): 531-537.

Gordon, D., C. Abajian, and P. Green. 1998. Consed: A graphical tool for sequence finishing. *Genome Res.* 8(3): 195-202.

Gordon, D., C. Desmarais, P. Green. 2001. Automated finishing with autofinish. *Genome Res.* 11(4): 614-625.

Green, P., B. Ewing, W. Miller, P.J. Thomas, and E.D. Green. 2003. Transcription-associated mutational asymmetry in mammalian evolution. *Nat. Genet.* 33(4): 514-517.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

Gribskov, M., A.D. McLachlan, and D. Eisenberg. 1987. Profile analysis: Detection of distantly related proteins. *Proc. Natl. Acad. Sci. U.S.A*. 84(13): 4355-4358.

Hajdu, J., R. Neutze, T. Sjögren, K. Edman, A. Szöke, R.C. Wilmouth, and C.M. Wilmot. 2000. Analyzing protein functions in four dimensions. *Nat. Struct. Biol.* 7(11): 1006-1012.

Hansmann, U.H.E., M. Masuya, and Y. Okamoto. 1997. Characteristic temperatures of folding of a small peptide. *Proc. Natl. Acad. Sci. U.S.A**.* 94: 10652-10656.

Hao, M.-H., and H.A. Scheraga. 1998. Molecular mechanisms of coperative folding of proteins. *J. Mol. Biol.* 277: 973-983.

Hastie, T., R. Tibshirani, and J. Friedman. 2001. *The Elements of Statistical Learning**.* New York, N.Y.: Springer.

Henderson, R. 2004. Realizing the potential of electron cryo-microscopy. *Q. Rev. Biophys.* 37(1): 3-13.

Hwang, D.G., and P. Green. 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. *Proc. Natl. Acad. Sci. U.S.A**.* 101(39): 13994-14001.

Ideker, T., V. Thorsson, A.F. Siegel, and L.E. Hood. 2000. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. *J. Comput. Biol.* 7(6): 805-817.

Kim, P.M., and B. Tidor. 2003. Subsystem identification through dimensionality reduction of large-scale gene expression data. *Genome Res.* 13(7): 1706-1718.

Kluger, Y., R. Basri, J.T. Chang, and M. Gerstein. 2003. Spectral biclustering of microarray data: Coclustering genes and conditions. *Genome Res.* 13(4): 703-716.

Korf, I., P. Flicek, D. Duan, and M.R. Brent. 2001. Integrating genomic homology into gene structure prediction. *Bioinformatics* 17(Suppl 1): S140-S148.

Krogh, A., M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. 1994. Hidden Markov models in computational biology: Applications to protein modeling. *J. Mol. Biol.* 235(5): 1501-1531.

Kulp, D., D. Haussler, M.G. Reese, and F.H. Eeckman. 1996. A generalized Hidden Markov Model for the recognition of human genes in DNA. *Proc. Int. Conf. Intell. Syst. Mol. Biol.* 4: 134-142.

Lauritzen, S.L., and D.J. Speigelhalter. 1988. Local computations with probabilities on graphical structures and their application to expert systems. *J. Roy. Stat. Soc. B* 50: 157-224.

Lawrence, C.E., S.F. Altschul, M.S. Boguski, A.F. Neuwald, and J.C. Wooton. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. *Science* 262: 208-214.

Lazzeroni, L., and A.B. Owen. 2002. Plaid models for gene expression data. *Stat. Sinica* 12(1): 61-86.

Lee, D.D., and H.S. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. *Nature* 401(6755): 788-791.

Lee, M.L., F.C. Kuo, G.A. Whitmore, and J. Sklar. 2000. Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations. *Proc. Natl. Acad. Sci. U.S.A**.* 97(18): 9834-9839.

Liu, J.S. 2001. *Monte Carlo Strategies in Scientific Computing*. New York, N.Y.: Springer-Verlag.

Mashanov, G.I., D. Tacon, A.E. Knight, M. Peckham, and J.E. Molloy. 2003. Visualizing single molecules inside living cells using total internal reflection fluorescence microscopy. *Methods* 29: 142-152.

Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. *J. Chem. Phys.* 21: 1087-1091.

Meyer, I.M., and R. Durbin. 2002. Comparative ab initio prediction of gene structures using pair HMMs. *Bioinformatics* (10): 1309-1318.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

Minsky, M., and S. Papert. 1988. *Perceptrons. An Introduction to Computational Geometry*. Cambridge, Mass.: MIT Press.

Mootha, V.K., C.M. Lindgren, K.F. Eriksson, A. Subramanian, S. Sihag, J. Lehar, P. Puigserver, E. Carlsson, M. Ridderstrale, E. Laurila, N. Houstis, M.J. Daly, N. Patterson, J.P. Mesirov, T.R. Golub, P. Tamayo, B. Spiegelman, E.S. Lander, J.N. Hirschhorn, D. Altshuler, and L.C. Groop. 2003. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. *Nat. Genet.* 34(3): 267-273.

Mukherjee, S., P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T.R. Golub, and J.P. Mesirov. 2003. Estimating dataset size requirements for classifying DNA microarray data. *J. Comput. Biol.* 10(2): 119-142.

Murali, T.M., and S. Kasif. 2003. Extracting conserved gene expression motifs from gene expression data. Pp. 77-88 in *Pacific Symposium on Biocomputing 2003**.* Singapore: World Scientific.

*Nature Genetics Supplement* 21. 1999.

*Nature Genetics Supplement* 32. 2002.

Newton, M.A., C.M. Kendziorski, C.S. Richmond, F.R. Blattner, and K.W. Tsui. 2001. On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. *J. Comput. Biol.* 8(1): 37-52.

Perou, C.M., S.S. Jeffrey, M. van de Rijn, C.A. Rees, M.B. Eisen, D.T. Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C. Lee, D. Lashkari, D. Shalon, P.O. Brown, and D. Botstein. 1999. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. *Proc. Natl. Acad. Sci. U.S.A**.* 96(16): 9212-9217.

Perou, C.M., T. Sorlie, M.B. Eisen, M. van de Rijn, S.S. Jeffrey, C.A. Rees, J.R. Pollack, D.T. Ross, H. Johnsen, L.A. Akslen, O. Fluge, A. Pergamenschikov, C. Williams, S.X. Zhu, P.E. Lonning, A.L. Borresen-Dale, P.O. Brown, and D. Botstein. 2000. Molecular portraits of human breast tumours. *Nature* 406(6797): 747-752.

Pomeroy, S.L., P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. *Nature* 415(6870): 436-442.

Rosenblatt, F. 1962. *Principles of Neurodynamics*. New York, N.Y.: Spartan Books.

Roweis, S.T., and L.K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. *Science* 290(5500): 2323-2326.

Schultz, R.A., T. Nielsen, J.R. Zavaleta, R. Ruch, R. Wyatt, and H.R.Garner. 2001. Hyperspectral imaging: A novel approach for microscopic analysis. *Cytometry* 43(4): 239-247.

Searls, D.B. 1992. The linguistics of DNA. *Am. Sci.* 80: 579-591.

Sela, M., F.H. White Jr., and C.B. Anfinsen. 1957. Reductive cleavage of disulfide bridges in ribonuclease. *Science* 125: 691-692.

Shipp, M.A., K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub. 2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. *Nat. Med.* 8(1): 68-74.

Sinclair, M.B., J.A. Timlin, D.M. Haaland, and M. Werner-Washburne. 2004. Design, construction, characterization, and application of a hyperspectral microarray scanner. *Appl. Optics* 43 (10): 2079-2088.

**Suggested Citation:**"8 Crosscutting Themes." National Research Council. 2005.

*Mathematics and 21st Century Biology*. Washington, DC: The National Academies Press. doi: 10.17226/11315.

Slonim, D., P. Tamayo, J.P. Mesirov, T.R. Golub, and E.S. Lander. 2000. Class prediction and discovery using gene expression data. Pp. 263-272 in *Proceedings of Fourth Annual International Conference on Computational Molecular Biology**.* New York, N.Y.: ACM Press.

So, P.T.C., C.Y. Dong, B.R. Masters, and K.M. Berland. 2000. Two-photon excitation fluorescence microscopy. *Ann. Rev. Biomed. Eng.* 2: 399-429.

Staunton, J.E., D.K. Slonim, H.A. Coller, P. Tamayo, M.J. Angelo, J. Park, U. Scherf, J.K. Lee, W.O. Reinhold, J.N. Weinstein, J.P. Mesirov, E.S. Lander, and T.R. Golub. 2001. Chemosensitivity prediction by transcriptional profiling. *Proc. Natl. Acad. Sci. U.S.A**.* 98(19): 10787-10792.

Stormo, G.D., and D. Haussler. 1994. Optimally pairing a sequence into different classes based on multiple types of evidence. Pp. 369-375 in *Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology*. Vol. 2. R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, eds. Menlo Park, Calif.: AAAI Press.

Sumners, D. 1995. Lifting the curtain: Using topology to probe the hidden action of enzymes. *Notices of the AMS* 42: 528-537.

Svergun, D.I., and M.H.J. Koch. 2002. Advances in structure analysis using small-angle scattering in solution. *Curr. Opin. Struct. Biol.* 12: 654-660.

Tamayo, P., D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. *Proc. Natl. Acad. Sci. U.S.A**.* 96(6): 2907-2912.

Tanner, M.A. 1996. *Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions**,* 3rd ed. New York, N.Y.: Springer-Verlag.

Tanner, M.A., and W.H. Wong. 1987. The calculation of posterior distributions by data aug-mentation (with discussion). *J. Am. Stat. Assoc.* 82: 528-550.

Tenenbaum, J.B., V. de Silva, and J.C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. *Science* 290(5500): 2319-2323.

Tusher, V.G., R. Tibshirani, and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. *Proc. Natl. Acad. Sci. U.S.A**.* 98(9): 5116-5121.

Vapnik, V. 1998. *Statistical Learning Theory*. New York, N.Y.: John Wiley & Sons Ltd.

von Dassow, G., E. Meir, E.M. Munro, and G.M. Odell. 2000. The segment polarity network is a robust developmental module. *Nature* 406: 188-192.

Winfree, A.T. 1983. Sudden cardiac death, a problem in topology. *Sci. Am.* 248: 114-161.