it is not necessary. ESTs have been useful already for human and mouse studies, as in the serum study mentioned above (Iyer et al. 1999). If expression of a particular sequence, known only by its EST, is found to change greatly in the test condition, it might then qualify as interesting enough to deserve full-length cloning, sequencing, and further analysis of function.
Vast amounts of data accumulate in such comparisons (e.g., when 6,200 yeast genes (the entire yeast genome) or 8,900 human sequences are expressed differently under several conditions at several time points). The multidimensional data sets have challenged applied mathematicians to find means to express them in ways useful to biologists (e.g., see the methods of two-dimensional clustering analysis in Eisen et al. 1998; Alon et al. 1999; Tamayo et al. 1999). Yet, larger data sets loom on the horizon (e.g., the expression of perhaps 100,000 mouse or human genes at all times and places in development, not to mention with different toxicant exposures). As described below, the demand is great for managers and analysts of these data sets.
Although the various microarray techniques promise to reveal exciting new information about where, when, and under what conditions the genes of the genome are transcribed, this approach will not provide information concerning the translation and post-translational modification of proteins encoded by these mRNAs—that is, information about when and where the proteins are present and active. Protein function is almost always the immediate cause of cell function. To provide such functional information is the goal of proteomics. Proteomics has been defined (Anderson and Anderson 1998) as “the use of quantitative protein-level measurements of gene expression to characterize biological processes (e.g., disease processes and drug effects) and decipher the mechanisms of gene expression control.” At the core of proteomics is the Human Protein Index—that is, the systematic identification of all human proteins (Anderson and Anderson 1982) using high-throughput, high-resolution, two-dimensional (2D)-gel electrophoresis to generate a gel with as many as 1,000 separate protein spots on it. A large amount of 2D-gel information is stored in the Proteomics database (see Appendix B for the Internet address). Plans have been made to identify every protein spot on a 2D gel, because the nanogram amount of protein in a spot is sufficient to determine a partial amino acid sequence by tandem mass spectrometry (Yates 1998). The partial sequence can then be looked up in the genome database and the protein identified. When proteins are modified by phosphorylation, acylation, glycosylation, farnesylation, limited proteolysis, or any of the other 30 or so covalent post-translational alterations, their migration on a 2D gel changes, thus allowing a correlation to be made with their activity or inactivity in the tissue. Such activity information cannot be gained from DNA microarray measurements of mRNA amounts.
Finally, many proteins are required to associate with other proteins in order to achieve activity, and there are various nondenaturing gel-electrophoresis methods to detect such associations. Current and future efforts can be expected to