2
Toxicogenomic Technologies

All the information needed for life is encoded in an organism’s genome. In humans, approximately 25,000 genes encode protein products, which carry out diverse biologic functions. The DNA segments that compose genes are transcribed to create messenger RNA (mRNA), each of which in turn is translated to proteins (Figure 2-1). Other DNA sequences encode short RNA molecules that regulate the expression of genes and the stability and processing of mRNA and proteins. The integration of genomics, transcriptomics, proteomics, and metabolomics provides unprecedented opportunities to understand biologic networks that control responses to environmental insults. Toxicogenomics uses these new technologies to analyze genes, genetic polymorphisms, mRNA transcripts, proteins, and metabolites. The foundation of this field is the rapid sequencing of the human genome and the genomes of dozens of other organisms, including animals used as models in toxicology studies. Whereas the human genome project was a multiyear effort of a consortium of laboratories, new rapid, high-density, and high-efficiency sequencing approaches allow a single laboratory to sequence genomes in days or weeks. Gene sequencing technologies have also enabled rapid analysis of sequence variation in individual genes, which underlies the diversity in responses to chemicals and other environmental factors.

Perhaps the most emblematic technology of the post-genomic era is the microarray, which makes it possible to simultaneously analyze many elements of a complex system. The integration of microarray methods with the polymerase chain reaction (PCR) has made the analysis of mRNAs the first and most technologically comprehensive of all the -omics technologies. Rapid advances in mass spectrometry (MS) and nuclear magnetic resonance (NMR) have driven the development of proteomics and metabolomics, which complement trans-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 22
2 Toxicogenomic Technologies All the information needed for life is encoded in an organism’s genome. In humans, approximately 25,000 genes encode protein products, which carry out diverse biologic functions. The DNA segments that compose genes are tran- scribed to create messenger RNA (mRNA), each of which in turn is translated to proteins (Figure 2-1). Other DNA sequences encode short RNA molecules that regulate the expression of genes and the stability and processing of mRNA and proteins. The integration of genomics, transcriptomics, proteomics, and me- tabolomics provides unprecedented opportunities to understand biologic net- works that control responses to environmental insults. Toxicogenomics uses these new technologies to analyze genes, genetic polymorphisms, mRNA tran- scripts, proteins, and metabolites. The foundation of this field is the rapid se- quencing of the human genome and the genomes of dozens of other organisms, including animals used as models in toxicology studies. Whereas the human genome project was a multiyear effort of a consortium of laboratories, new rapid, high-density, and high-efficiency sequencing approaches allow a single laboratory to sequence genomes in days or weeks. Gene sequencing technolo- gies have also enabled rapid analysis of sequence variation in individual genes, which underlies the diversity in responses to chemicals and other environmental factors. Perhaps the most emblematic technology of the post-genomic era is the microarray, which makes it possible to simultaneously analyze many elements of a complex system. The integration of microarray methods with the poly- merase chain reaction (PCR) has made the analysis of mRNAs the first and most technologically comprehensive of all the -omics technologies. Rapid advances in mass spectrometry (MS) and nuclear magnetic resonance (NMR) have driven the development of proteomics and metabolomics, which complement trans- 22

OCR for page 22
23 Toxicogenomic Technologies DNA Genome •genes •intergenic sequences RNA Transcriptome •mRNAs •regulatory RNAs Protein Proteome •proteins •modified proteins Metabolites Metabonome •endogenous compounds •exogenous compounds FIGURE 2-1 Hierarchical relationships of DNA, RNA, proteins, and metabolites. criptomics. In contrast to microarrays for transcriptomics, technologies for pro- teomics and metabolomics are limited by the huge range of analyte concentra- tions involved (at least six orders of magnitude), because all existing instrumen- tation favors the detection of more abundant species over those that are less abundant. This fundamental problem limits essentially all proteomic and me- tabolomic analyses to subsets of the complete collections of proteins and me- tabolites, respectively. Despite this limitation, proteomic and metabolomic ap- proaches have fundamentally advanced the understanding of the mechanisms of toxicity and adaptation to stress and injury. The core technologies for toxicogenomics are evolving rapidly, and this is making toxicogenomic approaches increasingly powerful and cost-effective. Several key aspects of technology development will drive toxicogenomics in the next decade: 1. New sequencing technologies offer the prospect of cost-effective indi- vidual whole-genome sequencing and comprehensive genotype analysis. 2. Array-based whole-genome scanning for variations in individual genes, known as single nucleotide polymorphisms (SNPs), will dramatically increase throughput for genotyping in population studies. 3. Advances in NMR and MS instrumentation will enable high-sensitivity analyses of complex collections of metabolites and proteins and quantitative metabolomics and proteomics. 4. New bioinformatic tools, database resources, and statistical methods will integrate data across technology platforms and link phenotypes and toxico- genomic data.

OCR for page 22
24 Applications of Toxicogenomic Technologies The interplay of technology development and application continues to drive the evolution of toxicogenomic technologies in parallel with advances in their applications to other fields of biology and medicine. This dynamic will extend the longstanding paradigm of toxicology as a means to understand both fundamental physiology and environmentally induced disease. The basic types of technologies relevant to this report are described in this chapter. However, it is not feasible for the report to describe comprehensively the various tools and developments in this rapidly evolving field. GENOMIC TECHNOLOGIES The dramatic evolution of rapid DNA sequencing technologies during the 1980s and 1990s culminated in sequences of the human genome and the ge- nomes of dozens of other organisms, including those used as animal models for chemical toxicity. Complete genome sequences provide catalogs of genes, their locations, and their chromosomal context. They serve as general reference maps but do not yet reflect individual variations, including SNPs and groups of SNP alleles (haplotypes), that account for the individual genetic variation underlying different responses to toxic chemical and environmental factors. This section describes technologies for sequencing, for analyzing genotype variation, and for analyzing epigenetic modifications. As technology continues to advance, so will the analyses of genome variations and their roles in humans’ different responses to chemicals and other environmental factors. Genome Sequencing Technologies High-throughput gene sequencing began when Sanger and colleagues in- troduced dideoxynucleotide sequencing in 1977 (Sanger et al. 1977). Technical innovations over the next 18 years led to the development of current generation automated instruments, which use fluorescent tagging and capillary electropho- resis to sequence up to 1.6 million base pairs (bp) per day (Chan 2005). These instruments have become the principal engines of genome sequencing projects. Despite the impressive advances in technology, the cost of genome sequencing remains a barrier to the application of whole-genome sequencing to study indi- vidual variation in toxic responses and susceptibility. Current costs for whole- genome sequencing in a modern genome sequencing center using Sanger se- quencing technology were estimated at approximately $0.004 per bp, which translates to approximately $12M for an entire human genome (Chan 2005). Fundamental technology limitations of automated Sanger sequencing have spawned the development of new technologies that promise significant cost and throughput improvements. A cycle extension approach using a highly parallel picoliter reactor system and a pyrosequencing protocol sequenced 25 million bp at 99% accuracy in a 4-h analysis (Margulies et al. 2005).

OCR for page 22
25 Toxicogenomic Technologies This represents a 100-fold increase in throughput over the best available Sanger sequencing instrumentation and is now marketed commercially (454 Life Sciences 2006). Church and colleagues (Shendure et al. 2005) recently reported the use of a microscopy-based sequencing approach that employs “off-the-shelf” reagents and equipment and is approximately one-ninth as costly as high-speed Sanger sequencing (Shendure et al. 2005). Nanotechnology approaches may yield even more efficient DNA sequencing instrumentation, but these ap- proaches have not yet been implemented for high-throughput sequencing in a research laboratory setting (Chan et al. 2004; Chan 2005). The dominant genome sequencing strategy to emerge from the human ge- nome project is the whole-genome “shotgun” sequencing approach (Venter et al. 1996), which entails breaking up chromosomal DNA into fragments of 500- 1,000 DNA bases, which then are subjected to automated, repetitive sequencing and subsequent data analyses to reassemble the fragment sequences into their original chromosomal context. With this technique, most DNA bases on a chro- mosome are sequenced three to seven times, resulting in cost- and time-efficient generation of high-fidelity sequence information. Genotype Analysis Analysis of genetic variation between individuals involves first the dis- covery of SNPs and then the analysis of these variations in populations. SNPs occur on approximately every 2 kilobases in the human genome and have been commonly detected by automated Sanger sequencing as discussed above. The Human DNA Polymorphism Discovery Program of the National Institute of Environmental Health Sciences Environmental Genome Project (EGP) is one example of the application of automated DNA sequencing technologies to iden- tify SNPs in human genes (Livingston et al. 2004). The EGP selected 293 “can- didate genes” for sequencing based on their known or anticipated roles in the metabolism of xenobiotics. The major limitation of this candidate gene approach is that it enables discovery of polymorphisms only in those genes targeted for analysis. This analysis at the population level requires high-throughput, cost- effective tools to analyze specific SNPs. The analysis generally involves PCR- based amplification of a target gene and allele-specific detection. The dominant approaches in current use are homogeneous systems using TaqMan (Livak 1999) and molecular beacons (Marras et al. 1999), which offer high throughput in a multiwell format. However, these systems are not multiplexed (multiple SNPs cannot be analyzed simultaneously in the same sample), and this ulti- mately limits throughput. Other approaches may provide the ability to analyze multiple SNPs simultaneously. For example, matrix-assisted laser desorption ionization MS (MALDI-MS) has been used to simultaneously analyze multiple products of primer-extension reactions (Gut 2004). A more highly multiplexed

OCR for page 22
26 Applications of Toxicogenomic Technologies approach is the combination of tag-based array technology with primer exten- sion for genotyping (Phillips et al. 2003; Zhang et al. 2004). The ultimate approach for SNP discovery is individual whole-genome se- quencing. Although not yet feasible, rapidly advancing technology makes this approach the logical objective for SNP discovery in the future. In the meantime, rapid advances in microarray-based SNP analysis technology have redefined the scope of SNP discovery, mapping, and genotyping. New microarray-based genotyping technology enables “whole-genome association” analyses of SNPs between individuals or between strains of laboratory animals (Syvanen 2005). In contrast to the candidate gene approach described above, whole-genome asso- ciation identifies hundreds to thousands of SNPs in multiple genes. Arrays used for these analyses can represent a million or more polymorphic sequences mapped across a genome (Gunderson et al. 2005; Hinds et al. 2005; Klein et al. 2005). This approach makes it possible to identify SNPs associated with disease and susceptibility to toxic insult. As more SNPs are identified and incorporated into microarrays, the approach samples a greater fraction of the genome and becomes increasingly powerful. The strength of this technology is that it puts a massive amount of easily measurable genetic variation in the hands of research- ers in a cost-effective manner ($500-$1,000 per chip). Criteria for including the selected SNPs on the arrays are a critical consideration, as these affect infer- ences that can be drawn from the use of these platforms. Epigenetics Analysis of genes and genomes is not strictly confined to sequences, as modifications of DNA also influence the expression of genes. Epigenetics refers to the study of reversible heritable changes in gene function that occur without a change in the sequence of nuclear DNA. These changes are major regulators of gene expression. Differences in gene expression due to epigenetic factors are increasingly recognized as an important basis for individual variation in suscep- tibility and disease (Scarano et al 2005), as discussed in Chapter 6. Epigenetic phenomena include DNA methylation, imprinting, and histone modifications. Of all the different types of epigenetic modifications, DNA me- thylation is the most easily measured and amenable to the efficient analysis characteristic of toxicogenomic technologies. DNA methylation refers to addi- tion of a methyl group to the 5-carbon of cytosine in an area of DNA where there are many cytosines and guanines (a CpG island) by the enzyme DNA me- thyltransferase. Many different methods for analyzing DNA methylation have been developed and fall into two main types—global and gene-specific. Global methylation analysis methods measure the overall level of methyl cytosines in the genome using chromatographic methods (Ramsahoye 2002) or methyl ac- cepting capacity assays. Gene-specific methylation analysis methods originally used methylation-sensitive restriction enzymes to digest DNA before it was ana- lyzed by Southern blot analysis or PCR amplification. Sites that were methy-

OCR for page 22
27 Toxicogenomic Technologies lated were identified by their resistance to the enzymes. Recently, methylation- sensitive primers or bisulfite conversion of unmethylated cytosine to other bases have been used in methods such as methylation specific PCR and bisulfite ge- nomic sequencing. These methods give a precise map of the pattern of DNA methylation in a particular genomic region or gene and are fast and cost effec- tive (e.g., Yang et al. 2006). To identify unknown methylation hot-spots within a larger genomic con- text, techniques such as Restriction Landmark Genomic Scanning for Methyla- tion (Ando and Hayashizaki 2006) and CpG island microarrays (e.g., Y. Wang et al. 2005a; Bibikova et al. 2006; Hayashi et al. 2007) have also been devel- oped. Restriction landmark genomic scanning (RLGS) is a method to detect large numbers of methylated cytosines sites in a single experiment using direct end-labeling of the genomic DNA digested with a restriction enzyme and sepa- rated by high-resolution two-dimensional electrophoresis. Several array-based methods have also been developed recently. Bibikova et al. (2006) developed a rapid method for analyzing the methylation status of hundreds of preselected genes simultaneously through an adaptation of a geno- typing assay. For example, the methylation state of 1536 specific CpG sites in 371 genes was measured in a single reaction by multiplexed genotyping of bisul- fite-treated genomic DNA. The efficient nature of this quantitative assay could be useful for DNA methylation studies in large epidemiologic samples. Hayashi et al. (2007) describe a method for analysis of DNA methylation using oligonu- cleotide microarrays that involves separating methylated DNA immunoprecipi- tated with anti-methylcytosine antibodies. Most of these gene-specific methods work consistently at various genomic locations (that is, they have minimal bias by genomic location) and will be use- ful for high-resolution analysis of the epigenetic modifications to the genome in toxicogenomic studies. TRANSCRIPTOMIC TECHNOLOGIES Transcriptomics describes the global measurement of mRNA transcripts in a biologic system. This collection of mRNA transcripts represents the transcrip- tion of all genes at a point in time. Technologies that allow the simultaneous analysis of thousands of transcripts have made it possible to analyze transcrip- tomes. Technologic Approaches Technologies for assaying gene, protein, and metabolic expression profiles are not new inventions. Measurements of gene expression have evolved from the single measures of steady-state mRNA using Northern blot analysis to the more global analysis of thousands of genes using DNA microarrays and serial analysis of gene expression (SAGE), the two dominant technologies (Figure 2-2). The

OCR for page 22
28 Applications of Toxicogenomic Technologies advantage of global approaches is the ability of a single investigation to query the behavior of hundreds, thousands, or tens of thousands of biologic molecules in a single assay. For example, in profiling gene expression, one might use tech- nologies such as Northern blot analysis to look at expression of a single gene. Quantitative real-time reverse transcriptase PCR (qRT-PCR), often used with subtractive cloning or differential display, can easily be used to study the ex- pression of 10 or more genes. Techniques such as SAGE allow the entire collection of transcripts to be catalogued without assumptions about what is actually expressed (unlike mi- croarrays, where one needs to select probes from a catalogue of genes). SAGE is a technology based on sequencing strings of short expressed sequence tags rep- resenting both the identity and the frequency of occurrence of specific sequences within the transcriptome. SAGE is costly and relatively low throughput, because each sample to be analyzed requires a SAGE Tag library to be constructed and sequenced. Massively parallel signature sequencing speeds up the SAGE proc- ess with a bead-based approach that simultaneously sequences multiple tags, but it is costly. DNA microarray technology can be used to generate large amounts of data at moderate cost but is limited to surveys of genes that are included in the mi- croarray. In this technology (see Box 2-1 and Figure 2-3), a solid matrix surface supports thousands of different, surface-bound DNAs, which are hybridized against a pool of RNA to measure gene expression. A systematic comparison indicates that gene expression measured by oligonucleotide microarrays corre- lates well with SAGE in transcriptional profiling, particularly for genes ex- pressed at high levels (Kim 2003). Gene Expression Analysis Methods Throughput Method Comments 1 gene •Northern blot •Standard procedure; low throughput •Substractive cloning •Not always comprehensive Increasing Data Density •Differential display •Follow up full-length cloning required; potential to identify rare mRNAs •EST/SAGE •“Expensive” and requires a dedicated sequencing facility •Gridded filters •Cannot multiplex probes derived from two different tissue samples 104 gene •Identification of differentially expressed •High density arrays genes dependent on arrayed elements FIGURE 2-2 Overview of commonly used methods and technologies for gene expres- sion analysis.

OCR for page 22
29 Toxicogenomic Technologies BOX 2-1 Experimental Details of Transcriptome Profiling with Microarrays TmRNA extracted from cell or tissue samples is prepared for microarray analysis by PCR-based amplification (Hardiman 2004). A fluorescent dye (or bio- tin for Affymetrix microarrays) is incorporated into the amplified RNA sequences. Two-color arrays involve fluorescently labeling paired samples (control versus experimental) with different dyes (see Figure 2-3). The amplified, labeled se- quences, termed “targets,” are then hybridized to the microarrays. After hybridization and washing, the arrays are imaged with a confocal laser scanner and the relative fluorescence intensity (or streptavidin-conjugated phyco- erythrin) for each gene-specific probe represents the expression level for that gene. The actual value reported depends on the microarray technology platform used and the experimental design. For Affymetrix GeneChips, in which each sample is hy- bridized to an individual array, expression for each gene is measured as an “aver- age difference” that represents an estimated expression level, less nonspecific background. For two-color arrays, assays typically compare paired samples and report expression as the logarithm of the ratio of the experimental sample to the control sample. Regardless of the approach or technology, the fundamental data used in all subsequent analyses are the expression measures for each gene in each experiment. These expression data are typically represented as an “expression matrix” in which each row represents a particular gene and each column represents a specific biologic sample (Figure 2-3). In this representation, each row is a “gene expression vector,” where the individual entries are its expression levels in the samples assayed and each column is a “sample expression vector” that records the expression of all genes in that sample. The data are normalized to compensate for differences in labeling, hybridi- zation, and detection efficiencies. Approaches to data normalization depend on the platform and the assumptions made about biases in the data (Brazma et al. 2001; Schadt et al. 2001; I.V. Yang et al. 2002; Y.H. Yang et al. 2002; Sidransky et al. 2003). Filtering transformations are often applied to the data by using statistical approaches that, for example, eliminate genes that have minimal variance across the collection of samples or those that fail to provide data in most of the experi- ments. These filtering transformations reduce dataset complexity by eliminating genes unlikely to contribute to the experimental goal. The choice of normalization and filtering transformations can have a profound effect on the results (Hoffmann et al. 2002). Normalization adjusts the fluorescence intensities on each array and therefore can change the relative difference observed among samples. Normaliza- tion is generally necessary to compensate for systematic errors introduced during measurement, but overnormalizing can distort the data. Similarly, different meth- ods of data filtering can produce very different results. All statistical tests that are applied rely on assumptions about the nature of the variance in the measurements. Different statistical tests applied to the very same dataset can often produce differ- ent (but generally overlapping) sets of significant genes. Dealing with these “high- dimensional” datasets in which there are often more measurements (genes) than samples is an area of active research and debate. (Continued on next page)

OCR for page 22
30 Applications of Toxicogenomic Technologies BOX 2-1 Continued Standardization of protocols for transcriptional profiling experiments has contributed to validation and verification strategies that ensure the quality of data. In large measure, progress was facilitated by creation of Minimum Information About a Microarray Experiment (MIAME) guidelines by the Microarray Gene Expression Data Society. MIAME was designed as a set of recommendations to address issues related to data quality and exchange (Brazma et al. 2001; Ball et al. 2002a,b, 2004a,b). The scientific community has endorsed the guidelines (MIAME 2005), and most scientific journals now require adherence to the MIAME recom- mendations for publishing toxicogenomic studies. MIAME guidelines encompass parameters such as degree of signal linearity, hybridization specificity, normaliza- tion strategy, and use of exogenous and internal controls. In principle, it should be possible to mine datasets generated by multiple laboratories with different microarray platforms. There is tremendous value in making gene expression datasets publicly available and being able to mine the datasets. Besides serving as a source of independent data that can be used as a means of validating results, larger and more diverse sample populations can pro- vide more robust datasets for “meta-analysis” designed to find patterns of gene expression that can be associated with specific biologic states and responses (Malek et al. 2002; Stuart et al. 2003). However, a number of published studies have failed to find concordance between microarray platforms designed to assay expression patterns, in part because of observed disparities between results ob- tained by different groups analyzing similar samples (calling the validity of mi- croarray assays into question) (Kuo et al. 2002; Maitra et al. 2003; Rogojina et al. 2003; Mah et al. 2004; Park et al. 2004; Shippy et al. 2004; Ulrich et al. 2004; Yauk et al. 2004). In many instances, it appears that this failure to find concor- dance is a failure not of the platform or the biologic system but of the metrics used to evaluate concordance. For example, other meta-analyses focused on lists of significant genes, neglecting the fact that in many instances these lists of genes are derived not only from different platforms but also from vastly different approaches to data analysis (Tan et al. 2003; Jarvinen et al. 2004; Mah et al. 2004). This effect can be seen even in looking at a single dataset generated on a single platform. When results from the same array platforms are compared, the results generally show good concordance among different laboratories (Kane et al. 2000; Hughes et al. 2001; Yuen et al. 2002; Barczak et al. 2003; Carter et al. 2003; H.T. Wang et al. 2003). The data analysis effect can be seen even in looking at a single dataset gen- erated on a single platform. When results from the same array platforms are com- pared, the results generally show good concordance among different laboratories (Kane et al. 2000; Hughes et al. 2001; Yuen et al. 2002; Barczak et al. 2003; Carter et al. 2003; H.T. Wang et al. 2003). A series of papers that appeared in the May 2005 issue of Nature Methods systematically dealt with the problem of platform and laboratory comparison (Bammler et al. 2005; Irizarry et al. 2005; Larkin et al. 2005). Larkin et al. (2005) analyzed gene expression in a mouse model of hypertension and compared results obtained using spotted cDNA arrays and Affymetrix GeneChips. For the genes that could be compared, 88% showed expression patterns that appeared to be driven by

OCR for page 22
31 Toxicogenomic Technologies the underlying biology rather than the platform, and these genes also correlated well with qRT-PCR. Surprisingly, the 12% of genes that showed platform-specific effects also correlated poorly with qRT-PCR. Comparison of these platform dis- crepant genes with the platform concordant genes showed that the discrepant genes were much more likely to map to poorly annotated regions of the genome and con- sequently were more likely to represent different forms of MRNA (different splice forms). Irizarry and colleagues (2005) compared gene expression using a pair of defined RNA samples and looked at a variety of platforms with data generated by a number of laboratories using a variety of microarray platforms. This study showed that one can estimate the “lab effect,” which encompasses differences in sites, platforms, and protocols and, in doing so, arrive at estimates of gene expression that can be compared among laboratories. Finally, the Toxicogenomics Research Consortium (Bammler et al. 2005) reported that standardization of laboratory and data analysis protocols resulted in a dramatic increase in concordance among the results different laboratories obtained. Independently, these three groups arrived at the general conclusion that, if experiments are done and analyzed carefully and systematically, the results are quite reproducible and provide insight into the un- derlying biology driving the systems being analyzed. Although toxicogenomic studies typically rely on technologies that gener- ate large amounts of data, results are often confirmed and replicated with lower throughput assays. For example, differential gene expression detected with more global approaches is often verified by qRT-PCR analysis. The utility of these lower throughput approaches goes beyond validation. A subset of genes ana- lyzed by qRT-PCR may exhibit sensitivity and specificity comparable to global transcriptomic analyses with microarrays. Relatively small sets of marker genes that represent more complex gene expression patterns may be of considerable value in toxicogenomics. DNA Microarray Technology Microarray technology (Figure 2-3) fundamentally advanced biology by enabling the simultaneous analysis of all transcripts in a system. This capability for simultaneous, global analysis is emblematic of the new biology in the ge- nomic era and has become the standard against which other global analysis technologies are judged. DNA microarrays contain collections of oligonucleo- tide sequences located in precise locations in a high-density format. Two com- plementary DNA (cDNA) microarray formats have come to dominate the field. Spotted microarrays are prepared from synthesized cDNAs or oligonucleotide probes that are printed on a treated glass slide surface in a high-density format. These spotted arrays were the first widely used DNA microarrays (Schena et al. 1995, 1996) and were originally printed in individual investigators’ laboratories

OCR for page 22
32 Applications of Toxicogenomic Technologies FIGURE 2-3 An overview of DNA microarray analysis. (A) In two-color analysis ap- proaches, RNA samples from patient and control samples are individually labeled with distinguishable fluorescent dyes and cohybridized to a single DNA microarray consisting of individual gene-specific probes. Relative gene expression levels in the two samples are estimated by measuring the fluorescence intensities for each arrayed probe; a sample expression vector summarizing the expression level of each gene in the patient sample (relative to the control) is reported. (B) Single-color analysis, such as that using the Af- fymetrix GeneChip, hybridize labeled RNA from each biologic sample to a single array

OCR for page 22
34 Applications of Toxicogenomic Technologies increases in sensitivity, but this approach is not favored because of problems associated with the use of radioactivity and efficiency of analysis. Affymetrix and other major commercial vendors (Agilent, GE Healthcare [formerly Amersham], and Applied Biosystems) currently offer several different microarrays corresponding to essentially all known genes and transcripts for human as well as similar microarray products for model organisms used in tox- icity studies. In addition, Affymetrix also offers whole-genome microarrays for application to SNP mapping and detection (see above). PROTEOMIC TECHNOLOGIES Proteomics is the study of proteomes, which are collections of proteins in living systems. Because proteins carry out most functions encoded by genes, analysis of the protein complement of the genome provides insights into biology that cannot be drawn from studies of genes and genomes. MS, gene and protein sequence databases, protein and peptide separation techniques, and novel bioin- formatics tools are integrated to provide the technology platform for proteomics (Yates 2000; Smith 2002; Aebersold and Mann 2003). In contrast to “the ge- nome,” there is no single, static proteome in any organism; instead, there are dynamic collections of proteins in different cells and tissues that display mo- ment-to-moment variations in response to diet, stress, disease processes, and chemical exposures. There is no technology analogous to PCR amplification of nucleic acids that can amplify proteins, so they must be analyzed at their native abundances, which span more than six orders of magnitude. Each protein may be present in multiple modified forms; indeed, variations in modification status may be more critical to function than absolute levels of the protein per se (Mann and Jensen 2003). A related problem is the formation of protein adducts by reac- tive chemical intermediates generated from toxic chemicals and endogenous oxidative stress (Liebler et al. 2003). Protein damage by reactive chemical in- termediates may also perturb endogenous regulatory protein modifications. All these characteristics add to the challenge of proteome analysis. In contrast to the microarray technologies applied to gene expression, most analytical proteomic methods represent elaborate serial analyses rather than truly parallel technolo- gies. Gel-Based Proteomics Two major approaches used are gel-based proteomics and “shotgun” pro- teomics (see Figure 2-4). In the gel-based approach, proteins are resolved by electrophoresis or another separation method and protein features of interest are selected for analysis. This approach is best represented by the use of two- dimensional sodium dodecylsulfate polyacrylamide gel electrophoresis (2D- SDS-PAGE) to separate protein mixtures, followed by selection of spots, and identification of the proteins by digestion to peptides, MS analysis, and database

OCR for page 22
35 Toxicogenomic Technologies searching. Gel-based analyses generate an observable “map” of the proteome analyzed, although liquid separations and software can be coupled to achieve analogous results. Reproducibility of 2D gel separations has dramatically im- proved with the introduction of commercially available immobilized pH gradi- ent strips and precast gel systems (Righetti and Bossi 1997). Comparative 2D- SDS-PAGE with differential fluorescent labeling (for example, differential gel electrophoresis, DIGE) offers powerful quantitative comparisons of proteomes (Tonge et al. 2001; Von Eggeling et al. 2001). Moreover, modified protein forms are often resolved from unmodified forms, which enable separate charac- terization and quantitative analysis of each. Although 2D gels have been applied most commonly to global analyses of complex proteomes, they have great po- tential for comparative analyses of smaller subproteomes (for example, multi- protein complexes). Problems with gel-based analyses stem from the poor sepa- ration characteristics of proteins with extreme physical characteristics, such as hydrophobic membrane proteins. A major problem is the limited dynamic range for protein detection by staining (200- to 500-fold), whereas protein abundances vary more than a million fold (Gygi et al. 2000). This means that abundant pro- teins tend to preclude the detection of less abundant proteins in complex mix- tures. This problem is not unique to gel-based approaches. identify components database search 2D gel -based MS -MS spectra analysis MALDI -TOF - excise bands TOF MS digest 2D-SDS -PAGE simple peptide mixture Protein mixture proteolysis complex peptide “shotgun” mixture analysis Multidimensional LC -MS -MS identify database search components; MS -MS spectra map modifications FIGURE 2-4 Schematic representation of 2D gel-based proteome analysis (upper) and shotgun proteome analysis (lower). LC, liquid chromatography; TOF, time of flight; MS- MS, tandem mass spectrometry.

OCR for page 22
36 Applications of Toxicogenomic Technologies Shotgun Proteomics Shotgun proteomic analysis is somewhat analogous to the genome se- quencing strategy of the same name. Shotgun analyses begin with direct diges- tion of protein mixtures to complex mixtures of peptides, which then are ana- lyzed by liquid-chromatography-coupled mass spectrometry (LC-MS) (Yates 1998). The resulting collection of peptide tandem mass spectrometry (MS-MS) spectra is searched against databases to identify corresponding peptide se- quences and then the collection of sequences is reassembled using computer software to provide an inventory of the proteins in the original sample mixture. A key advantage of shotgun proteomics is its unsurpassed performance in the analysis of complex peptide mixtures (Wolters et al. 2001; Washburn et al. 2002). Peptide MS-MS spectra are acquired by automated LC-MS-MS analyses in which ions corresponding to intact peptides are automatically selected for fragmentation to produce MS-MS spectra that encode peptide sequences (Stahl et al. 1996). This approach enables automated analyses of complex mixtures without user intervention. However, selection of peptide ions for MS-MS frag- mentation is based on the intensity of the peptide ion signals, which favors ac- quisition of MS-MS spectra from the most abundant peptides in a mixture. Thus, detection of low-abundance peptides in complex mixtures is somewhat random. Application of multidimensional chromatographic separations (for example, ion exchange and then reverse-phase high-performance liquid chromatography) “spreads out” the peptide mixture and greatly increases the number of peptides for which MS-MS spectra are acquired (Link et al. 1999; Washburn et al. 2001; Wolters et al. 2001). This increases the detection of less abundant proteins and modified protein forms (MacCoss et al. 2002) (see below). New hybrid linear ion trap-tandem MS instruments offer more rapid acquisition of MS-MS spectra and more accurate identification of peptide ion mass-to-charge ratio values, which provides more identifications with greater reliability. Nevertheless, a con- tinuing challenge of shotgun proteome analyses is the identification of less abundant proteins and modified protein forms. Quantitative Proteomics The application of quantitative analyses has become a critical element of proteome analyses. Quantitative methods have been developed for application to both gel-based and shotgun proteomic analyses. The most effective quantitative approach for gel-based analyses is DIGE (see above), which involves using amine- or thiol-reactive fluorescent dyes that tag protein samples with different fluorophores for analysis on the same gel. This approach eliminates gel-to-gel variations inherent in comparing spots from individual samples run on different gels. The use of a separate dye and mixed internal standards allows gel-to-gel comparisons of DIGE analyses for larger studies and enables reliable statistical comparisons (Alban et al. 2003). Quantitative shotgun proteome analyses have

OCR for page 22
37 Toxicogenomic Technologies been done with stable isotope tags, which are used to derivatize functional groups on proteins (Julka and Regnier 2004). Stable isotope tagging is usually used in paired experimental designs, in which the relative amounts of a protein or protein form are measured rather than the absolute amount in a sample. The first of these to be introduced were the thiol-reactive isotope-coded affinity tag reagents (Gygi et al. 1999), which have been further developed to incorporate solid-phase capture and labeling (Zhou et al. 2002). These reagents are available in “heavy” (for example, 2H- or 13C-labeled) and “light” (for example, 1H- or 12 C-labeled) forms. Analysis of paired samples labeled with the light and heavy tags allows relative quantitation by comparing the signals for the corresponding light and heavy ions. Other tag chemistries that target peptide N and C termini have been developed and have been widely applied (Julka and Regnier 2004). An alternative approach to tagging proteins and peptides is to incorporate stable isotope labels through metabolic labeling of proteins during synthesis in cell culture (Ong et al. 2002). Quantitative proteomic approaches are applicable not only to comparing amounts of proteins in samples but also to kinetic studies of protein modifications and abundance changes as well as to identification of pro- tein components of multiprotein complexes as a function of specific experimen- tal variables (Ranish et al. 2003). Major limitations of the isotope-tagging approaches described above in- clude the requirement for chemical induction of changes in the samples (deri- vatization) or metabolic incorporation of isotope label and the need to perform quantitative analyses by pairwise comparison. Recent work has demonstrated that quantitative data from LC-MS full-scan analyses of intact peptides (W. Wang et al. 2003) and data from MS-MS spectra acquired from peptides are proportional to the peptide concentration in mixtures (Gao et al. 2003; Liu et al. 2004). This suggests that survey-level quantitative comparisons between any samples analyzed under similar conditions may be possible. Finally, the use of stable isotope dilution LC-MS-MS analysis provides a method for absolute quantification of individual proteins in complex samples (Gerber et al. 2003). Use of stable-isotope-labeled standard peptides that uniquely correspond to proteins or protein forms of interest are spiked into pro- teolytic digests from complex mixtures, and the levels of the target protein are measured relative to the labeled standard. This approach holds great potential for targeted quantitative analysis of candidate biomarkers in biologic fluids (Ander- son et al. 2004). Bioinformatic Tools for Proteomics A hierarchy of proteomic data is rooted in MS and MS-MS spectra (Figure 2-5) and includes identification and quantitation of proteins and peptides and their modified forms, including comparisons across multiple experiments, analy- ses, and datasets. A key element of MS-based proteomic platforms is the identi- fication of peptide and protein sequences from MS data. This task is accom-

OCR for page 22
38 Applications of Toxicogenomic Technologies plished with a variety of algorithms and software (“bioinformatics” tools) that search protein and nucleotide sequence databases (Fenyo 2000; Sadygov et al. 2004; MacCoss 2005). Measured peptide masses from MALDI-MS spectra of tryptic peptide digests can be searched against databases to identify the corre- sponding proteins (Perkins et al. 1999). This peptide mass fingerprinting ap- proach works best with relatively pure protein samples. The most widely used and most effective approach is to search uninterpreted MS-MS spectra against database sequences with algorithms and software, such as Sequest, Mascot, and X!Tandem (Eng et al. 1994; Perkins et al. 1999; Craig and Beavis 2004). These algorithms match all spectra to some sequence and provide scores or probability assessments of the quality of the matches. Nevertheless, the balance between sensitivity and specificity in these analyses amounts to a trade-off between missed identifications (low sensitivity) and false-positive identifications (poor specificity) (Nesvizhskii and Aebersold 2004). A second tier of bioinformatic tools evaluates outputs from database search algorithms and estimates probabili- ties of correct protein identifications (Keller et al. 2002; Nesvizhskii et al. 2003). Other software applications have been developed to enable the identification of modified peptide forms from MS-MS data, even when the chemical nature and amino acid specificity of the modification cannot be predicted (Hansen et al. 2001, 2005; Liebler et al. 2002, 2003). A key issue in proteomics is the standardization of data analysis methods and data representation and reporting formats. A fundamental problem is the variety of MS instruments, data analysis algorithms, and software used in pro- teomics. These generate a variety of different data types that describe proteins Data Level Data Type Comparisons of proteins 4 identified in groups of experiments Summaries of identified/quantified proteins 3 from an individual sample/analysis Quantitation of Peptide identities Peptide modified 2 peptides and scores forms and scores Mass spectra 1 MS-MS spectra FIGURE 2-5 A hierarchy of proteomic data is rooted in MS and MS-MS spectra (level 1) and includes outputs of database search analyses and related data reduction (level 2), integrated information about single proteins (level 3), and information about groups of proteins or proteomes across multiple experiments (level 4).

OCR for page 22
39 Toxicogenomic Technologies and peptides and their modifications. Proposals for common representations of MS and proteomic data have been published recently (Taylor et al. 2003; Craig et al. 2004; Pedrioli et al. 2004). In addition, draft criteria for data reporting standards for publication in proteomic journals are under consideration (e.g., Bradshaw 2005a). Another useful development is the emerging collection of databases of matched peptide and protein sequences and corresponding spectral data that define them (Craig et al. 2004; Desiere et al. 2005, 2006). Another important, but unresolved, issue concerns the differences in pro- tein and peptide identifications attributable to the use of different database search algorithms. Because different database search software packages are sold with different MS instruments (for example, Sequest is licensed with Thermo ion trap MS instruments), differences in performance of the algorithms are diffi- cult to separate from differences in characteristics of the instruments. Another issue in comparing the performance of different database searching software is wide variation in identifications due to variation in criteria used to filter the search results (Peng et al. 2003; Elias et al. 2005). This situation will be im- proved somewhat by adopting standards for reporting false-positive identifica- tion rates (Bradshaw 2005b). Although the efforts described above represent useful steps in information sharing and management, the diversity of instrumentation, analytical ap- proaches, and available data analysis tools will make standardization of infor- matics an ongoing challenge. Proteome Profiling Another type of proteome analysis that has attracted widespread interest is proteome profiling, in which MALDI time-of-flight (MALDI-TOF) MS is used to acquire a spectral profile of a tissue or biofluid sample (for example, serum) (Chaurand et al. 1999; Petricoin et al. 2002a,b; Villanueva et al. 2004). The sig- nals in these spectra represent intact proteins or protein fragments and collec- tively reflect the biologic state of the system but, in profiling (compared with approaches described above), the overall pattern rather than identification of specific proteins or protein fragments is the focus. Analyses with high- performance MALDI-TOF instruments can generate spectral profiles containing hundreds to thousands of signals. These typically correspond to the most abun- dant, lower molecular weight (<25 kilodaltons) components of proteomes. Ma- chine learning approaches have been used to identify sets of spectral features that can classify samples based on spectral patterns (Baggerly et al. 2004, 2005; Conrads et al. 2004). This approach has attracted considerable interest as a po- tential means of biomarker discovery for early detection of diseases, particularly cancers, as well as drug toxicity. Despite intense interest, proteome profiling studies have created considerable controversy due to problems with lab-to-lab reproducibility of the marker sets identified, a lack of identification of the pro- teins corresponding to the marker signals, and artifacts in data generation and

OCR for page 22
40 Applications of Toxicogenomic Technologies analysis (Diamandis 2004; Baggerly et al. 2005). In addition, studies that have identified some of the marker species have shown that they typically are prote- olysis products of abundant blood proteins (Marshall et al. 2003), which raises questions about the biologic relationship of the markers to the disease processes under study. The general utility of biofluid proteome profiling for biomarker discovery remains an attractive, if unproven, approach. Nevertheless, methods of instrumental and data analysis are rapidly evolving in this field, and the appli- cability of this approach should be better substantiated within the next 2-3 years. New MS Instrumentation and Related Technology for Proteomics Despite impressive advances over the past 15 years, MS instrumentation for proteomics is limited in the numbers of peptides or proteins that can be iden- tified and in the quality of the data generated. New hybrid tandem MS instru- ments that couple rapid-scanning linear ion tray analyzers with Fourier trans- form ion cyclotron resonance (FTICR), high-resolution ion trap, and TOF mass analyzers offer both high mass accuracy measurements of peptide ions and rapid scanning acquisition of MS-MS spectra (Syka et al. 2004a; Hu et al. 2005). This improves the fidelity of identification and the mapping of modifications (Wu et al. 2005). New methods for generating peptide sequence data, such as electron transfer dissociation (Syka et al. 2004b), can improve the mapping of posttrans- lational modifications and chemical adducts. An important emerging application of FTICR instrumentation is the tandem MS analysis of intact proteins, which is referred to as “top-down” MS analysis (Ge et al. 2002; Kelleher 2004). This approach can generate near-comprehensive sequence analysis of individual pro- tein molecular forms, thus enabling sequence-specific annotation of individual modification variants (Pesavento et al. 2004; Coon et al. 2005). A limitation of the approach is the requirement for relatively purified proteins and larger amounts of samples than are used in shotgun analyses. However, rapid technol- ogy development will make top-down methods increasingly useful for targeted analyses of individual proteins and their modified forms. Non-MS-Based Proteomic Approaches Non-MS-based technologies have been applied to proteome analyses, but they have not proven to be as robust and versatile as MS-based methods. Mi- croarray technology approaches include antibody microarrays, in which immo- bilized antibodies recognize proteins in complex mixtures (de Wildt et al. 2000; Miller et al. 2003; Olle et al. 2005). Although straightforward in principle, this approach has not proven robust and reliable for several reasons. Monospecific antibodies with high affinity for their targets are difficult to obtain and they of- ten lose activity when immobilized. Because arrays must be probed under native conditions, antibodies may capture multiprotein complexes as well as individual proteins, which complicates interpretation. Short strands of chemically synthe-

OCR for page 22
41 Toxicogenomic Technologies sized nucleic acid (aptamers) have been studied as potential monospecific rec- ognition molecules for microarrays, and this technology may eventually over- come some of the problems with antibody arrays (Smith et al. 2003; Kirby et al. 2004). “Reversed-phase” microarrays, which consist of multiple samples of pro- tein mixtures (for example, tissues, cell lysates), are probed with individual anti- bodies (Paweletz et al. 2001; Janzi et al. 2005). This establishes the presence of the target protein in multiple samples rather than the presence of multiple pro- teins in any sample. As with antibody microarrays, the main limitations of this approach stem from the quality and availability of antibodies for the targets of interest. Microarrays of expressed proteins or protein domain substructures have been probed with tagged proteins or small molecules to identify protein binding partners or small molecule ligand sites or to conduct surveys of substrates for enzymes (for example, kinases) (Zhu et al. 2001; Ramachandran et al. 2004). This approach is directed to functional analysis of known proteins as opposed to identification and analysis of the components of complex mixtures. A related technique directed at the study of protein-protein interactions is surface plasmon resonance (SPR) (Liedberg et al. 1995; Homola 2003; Yuk and Ha 2005). This technology allows real-time measurements of protein binding affinities and in- teractions. In common usage, SPR is used to study single pairs of interacting species. However, recent adaptations of SPR allow direct analysis of protein- protein interactions in microarray format (Yuk et al. 2004). METABOLOMIC TECHNOLOGIES Metabolomics1 is the analysis of collections of small molecule intermedi- ates and products of diverse biologic processes. Metabolic intermediates reflect the actions of proteins in biochemical pathways and thus represent biologic states in a way analogous to proteomes. As with proteomes, metabolomes are dynamic and change in response to nutrition, stress, disease states, and even diurnal variations in metabolism. Unlike genomes, transcriptomes, and pro- teomes, metabolomes comprise a chemically diverse collection of compounds, which range from small peptide, lipid, and nucleic acid precursors and degrada- tion products to chemical intermediates in biosynthesis and catabolism as well as metabolites of exogenous compounds derived from the diet, environmental exposures, and therapeutic interventions. A consequence of the chemical diver- sity of metabolome components is the difficulty of comprehensive analysis with any single analytical technology. 1 Although some scientists attempt to distinctly define the terms metabolomics and metabonomics, the committee uses the term metabolomics throughout the report simply because it is used more frequently in the literature.

OCR for page 22
42 Applications of Toxicogenomic Technologies NMR-Based Metabolomics The principal technology platforms for metabolomics are NMR spectros- copy and gas chromatography MS (GC-MS) or LC-MS. NMR has been the dominant technology for metabolomic studies of biofluids ex vivo. High-field (600 mHz) 1H-NMR spectra of urine contain thousands of signals representing hundreds to thousands of metabolites (Nicholson et al. 2002). Hundreds of ana- lytes have been identified in such spectra and collectively represent a plurality of metabolic processes and networks from multiple organs and tissues. Although NMR has been most commonly applied to urine samples, similar analyses of intact solid tissues were accomplished with the use of magic angle spinning 1H- NMR (Waters et al. 2000; Nicholson et al. 2002; Y. Wang et al. 2003). Although it is possible to establish the identity of many, but not all, of the peaks in NMR spectra of urine and biofluids, the value of the data has been in the analyses of collections of spectral signals. These pattern recognition ap- proaches have been used to identify distinguishing characteristics of samples or sample sets. Unsupervised2 analyses of the data, such as principal components analysis (PCA), have proven useful for grouping samples based on sets of simi- lar features (Beckwith-Hall et al. 1998; Holmes et al. 1998). These similar fea- tures frequently reflect chemical similarity in metabolite composition and thus similar courses of response to toxicants. Supervised analyses allow the use of data from biochemically or toxicologically defined samples to establish models capable of classifying samples based on multiple features in the spectra (Stoyanova et al. 2004). NMR-based metabolomics of urine measure global metabolic changes that have occurred throughout an organism. However, metabolite profiles in urine can also indicate tissue-specific toxicities. PCA of urinary NMR data have shown that the development and resolution of chemically induced tissue injury can be followed by plotting trajectories of PCA-derived parameters (Azmi et al. 2002). Although the patterns themselves provide a basis for analyses, some spe- cific metabolites have also been identified (based on their resonances in the NMR spectra). Mapping these metabolites onto known metabolic pathways makes it possible to draw inferences about the biochemical and cellular conse- quences and mechanisms of injury (Griffin et al. 2004). An interesting and im- portant consequence was the identification of endogenous bacterial metabolites as key elements of diagnostic metabonomic profiles (Nicholls et al. 2003; Wil- son and Nicholson 2003; Robosky et al. 2005). Although the interplay of gut bacteria with drug and chemical metabolism had been known previously, recent NMR metabolomic studies indicate that interactions between host tissues and gut microbes have a much more pronounced effect on susceptibility to injury than had been appreciated previously (Nicholson et al. 2005). 2 Unsupervised analysis methods look for patterns in the data without using previous knowledge about the data; information about treatment or classification supervised meth- ods use this knowledge. See Chapter 3 for more details.

OCR for page 22
43 Toxicogenomic Technologies A critical issue in the application of metabolomics is the standardization of methods, data analysis, and reporting across laboratories. A recent cooperative study by the Consortium for Metabonomic Toxicology indicated that NMR- based technology is robust and reproducible in laboratories that follow similar analytical protocols (Lindon et al. 2003). Investigators in the field recently have agreed on consensus standards for analytical standardization and data represen- tation in metabonomic analyses (Lindon et al. 2005a). MS-Based Metabolomics MS-based analyses offer an important alternative approach to metabolom- ics. The greatest potential advantage of MS-based methods is sensitivity. MS analyses can detect molecules at levels up to 10,000-fold lower than does NMR (Brown et al. 2005; Wilson et al. 2005a). Both GC-MS and LC-MS approaches have been used, although limits of volatility of many metabolites reduce the range of compounds that can be analyzed successfully with GC-MS. LC-MS analyses are done with both positive and negative ion electrospray ionization and positive and negative chemical ionization. These four ionization methods provide complementary coverage of diverse small molecule chemistries. The principal mode of analysis is via “full scan” LC-MS, in which the mass range of the instrument is repeatedly scanned (Plumb et al. 2002; Wilson et al. 2005b). This analysis records the mass-to-charge ratios and retention times of metabo- lites. Because most small molecules produce singly charged ions, the analyses provide molecular weights of the metabolites. Analysis of standards in the same system and the use of MS-MS analysis can establish the identity of the compo- nents of interest. However, apparent molecular weight measurement alone is often insufficient to generate candidate metabolite identifications; frequently, hundreds or thousands of molecules are being analyzed. Nevertheless, accurate information about molecular weight, where possible, is of great value in identi- fication. For this reason LC-MS metabolomic analyses are most commonly done with higher mass accuracy MS instruments, such as LC TOF, quadruple TOF, and FTICR MS instruments (Wilson et al. 2005a,b). NMR- and MS-based ap- proaches provide complementary platforms for metabolomic studies and an in- tegration of these platforms will be needed to provide capabilities that are most comprehensive. Clearly, either platform can detect metabolite profile differences sufficient to distinguish different toxicities. What is not yet clear is the degree to which either approach can resolve subtly different phenotypes. TECHNOLOGY ADVANCEMENT AND ECONOMY OF SCALE A major determinant of success in genome sequencing projects was achieving economy of scale for genome sequencing technologies (see above). The successful implementation of large-scale toxicogenomic initiatives will re- quire advances in standardization and economy of scale. Most proteomic and

OCR for page 22
44 Applications of Toxicogenomic Technologies metabolomic analyses in academic and industry laboratories are done on a small scale to address specific research questions. The evolution of transcriptome pro- filing and proteomic and metabolomic technology platforms to increase stan- dardization and reduce costs will be essential to maximize their impact.