Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 22
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment 2 Toxicogenomic Technologies All the information needed for life is encoded in an organism’s genome. In humans, approximately 25,000 genes encode protein products, which carry out diverse biologic functions. The DNA segments that compose genes are transcribed to create messenger RNA (mRNA), each of which in turn is translated to proteins (Figure 2-1). Other DNA sequences encode short RNA molecules that regulate the expression of genes and the stability and processing of mRNA and proteins. The integration of genomics, transcriptomics, proteomics, and metabolomics provides unprecedented opportunities to understand biologic networks that control responses to environmental insults. Toxicogenomics uses these new technologies to analyze genes, genetic polymorphisms, mRNA transcripts, proteins, and metabolites. The foundation of this field is the rapid sequencing of the human genome and the genomes of dozens of other organisms, including animals used as models in toxicology studies. Whereas the human genome project was a multiyear effort of a consortium of laboratories, new rapid, high-density, and high-efficiency sequencing approaches allow a single laboratory to sequence genomes in days or weeks. Gene sequencing technologies have also enabled rapid analysis of sequence variation in individual genes, which underlies the diversity in responses to chemicals and other environmental factors. Perhaps the most emblematic technology of the post-genomic era is the microarray, which makes it possible to simultaneously analyze many elements of a complex system. The integration of microarray methods with the polymerase chain reaction (PCR) has made the analysis of mRNAs the first and most technologically comprehensive of all the -omics technologies. Rapid advances in mass spectrometry (MS) and nuclear magnetic resonance (NMR) have driven the development of proteomics and metabolomics, which complement trans-
OCR for page 23
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment FIGURE 2-1 Hierarchical relationships of DNA, RNA, proteins, and metabolites. criptomics. In contrast to microarrays for transcriptomics, technologies for proteomics and metabolomics are limited by the huge range of analyte concentrations involved (at least six orders of magnitude), because all existing instrumentation favors the detection of more abundant species over those that are less abundant. This fundamental problem limits essentially all proteomic and metabolomic analyses to subsets of the complete collections of proteins and metabolites, respectively. Despite this limitation, proteomic and metabolomic approaches have fundamentally advanced the understanding of the mechanisms of toxicity and adaptation to stress and injury. The core technologies for toxicogenomics are evolving rapidly, and this is making toxicogenomic approaches increasingly powerful and cost-effective. Several key aspects of technology development will drive toxicogenomics in the next decade: New sequencing technologies offer the prospect of cost-effective individual whole-genome sequencing and comprehensive genotype analysis. Array-based whole-genome scanning for variations in individual genes, known as single nucleotide polymorphisms (SNPs), will dramatically increase throughput for genotyping in population studies. Advances in NMR and MS instrumentation will enable high-sensitivity analyses of complex collections of metabolites and proteins and quantitative metabolomics and proteomics. New bioinformatic tools, database resources, and statistical methods will integrate data across technology platforms and link phenotypes and toxicogenomic data.
OCR for page 24
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment The interplay of technology development and application continues to drive the evolution of toxicogenomic technologies in parallel with advances in their applications to other fields of biology and medicine. This dynamic will extend the longstanding paradigm of toxicology as a means to understand both fundamental physiology and environmentally induced disease. The basic types of technologies relevant to this report are described in this chapter. However, it is not feasible for the report to describe comprehensively the various tools and developments in this rapidly evolving field. GENOMIC TECHNOLOGIES The dramatic evolution of rapid DNA sequencing technologies during the 1980s and 1990s culminated in sequences of the human genome and the genomes of dozens of other organisms, including those used as animal models for chemical toxicity. Complete genome sequences provide catalogs of genes, their locations, and their chromosomal context. They serve as general reference maps but do not yet reflect individual variations, including SNPs and groups of SNP alleles (haplotypes), that account for the individual genetic variation underlying different responses to toxic chemical and environmental factors. This section describes technologies for sequencing, for analyzing genotype variation, and for analyzing epigenetic modifications. As technology continues to advance, so will the analyses of genome variations and their roles in humans’ different responses to chemicals and other environmental factors. Genome Sequencing Technologies High-throughput gene sequencing began when Sanger and colleagues introduced dideoxynucleotide sequencing in 1977 (Sanger et al. 1977). Technical innovations over the next 18 years led to the development of current generation automated instruments, which use fluorescent tagging and capillary electrophoresis to sequence up to 1.6 million base pairs (bp) per day (Chan 2005). These instruments have become the principal engines of genome sequencing projects. Despite the impressive advances in technology, the cost of genome sequencing remains a barrier to the application of whole-genome sequencing to study individual variation in toxic responses and susceptibility. Current costs for wholegenome sequencing in a modern genome sequencing center using Sanger sequencing technology were estimated at approximately $0.004 per bp, which translates to approximately $12M for an entire human genome (Chan 2005). Fundamental technology limitations of automated Sanger sequencing have spawned the development of new technologies that promise significant cost and throughput improvements. A cycle extension approach using a highly parallel picoliter reactor system and a pyrosequencing protocol sequenced 25 million bp at 99% accuracy in a 4-h analysis (Margulies et al. 2005).
OCR for page 25
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment This represents a 100-fold increase in throughput over the best available Sanger sequencing instrumentation and is now marketed commercially (454 Life Sciences 2006). Church and colleagues (Shendure et al. 2005) recently reported the use of a microscopy-based sequencing approach that employs “off-the-shelf” reagents and equipment and is approximately one-ninth as costly as high-speed Sanger sequencing (Shendure et al. 2005). Nanotechnology approaches may yield even more efficient DNA sequencing instrumentation, but these approaches have not yet been implemented for high-throughput sequencing in a research laboratory setting (Chan et al. 2004; Chan 2005). The dominant genome sequencing strategy to emerge from the human genome project is the whole-genome “shotgun” sequencing approach (Venter et al. 1996), which entails breaking up chromosomal DNA into fragments of 500-1,000 DNA bases, which then are subjected to automated, repetitive sequencing and subsequent data analyses to reassemble the fragment sequences into their original chromosomal context. With this technique, most DNA bases on a chromosome are sequenced three to seven times, resulting in cost- and time-efficient generation of high-fidelity sequence information. Genotype Analysis Analysis of genetic variation between individuals involves first the discovery of SNPs and then the analysis of these variations in populations. SNPs occur on approximately every 2 kilobases in the human genome and have been commonly detected by automated Sanger sequencing as discussed above. The Human DNA Polymorphism Discovery Program of the National Institute of Environmental Health Sciences Environmental Genome Project (EGP) is one example of the application of automated DNA sequencing technologies to identify SNPs in human genes (Livingston et al. 2004). The EGP selected 293 “candidate genes” for sequencing based on their known or anticipated roles in the metabolism of xenobiotics. The major limitation of this candidate gene approach is that it enables discovery of polymorphisms only in those genes targeted for analysis. This analysis at the population level requires high-throughput, cost-effective tools to analyze specific SNPs. The analysis generally involves PCR-based amplification of a target gene and allele-specific detection. The dominant approaches in current use are homogeneous systems using TaqMan (Livak 1999) and molecular beacons (Marras et al. 1999), which offer high throughput in a multiwell format. However, these systems are not multiplexed (multiple SNPs cannot be analyzed simultaneously in the same sample), and this ultimately limits throughput. Other approaches may provide the ability to analyze multiple SNPs simultaneously. For example, matrix-assisted laser desorption ionization MS (MALDI-MS) has been used to simultaneously analyze multiple products of primer-extension reactions (Gut 2004). A more highly multiplexed
OCR for page 26
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment approach is the combination of tag-based array technology with primer extension for genotyping (Phillips et al. 2003; Zhang et al. 2004). The ultimate approach for SNP discovery is individual whole-genome sequencing. Although not yet feasible, rapidly advancing technology makes this approach the logical objective for SNP discovery in the future. In the meantime, rapid advances in microarray-based SNP analysis technology have redefined the scope of SNP discovery, mapping, and genotyping. New microarray-based genotyping technology enables “whole-genome association” analyses of SNPs between individuals or between strains of laboratory animals (Syvanen 2005). In contrast to the candidate gene approach described above, whole-genome association identifies hundreds to thousands of SNPs in multiple genes. Arrays used for these analyses can represent a million or more polymorphic sequences mapped across a genome (Gunderson et al. 2005; Hinds et al. 2005; Klein et al. 2005). This approach makes it possible to identify SNPs associated with disease and susceptibility to toxic insult. As more SNPs are identified and incorporated into microarrays, the approach samples a greater fraction of the genome and becomes increasingly powerful. The strength of this technology is that it puts a massive amount of easily measurable genetic variation in the hands of researchers in a cost-effective manner ($500-$1,000 per chip). Criteria for including the selected SNPs on the arrays are a critical consideration, as these affect inferences that can be drawn from the use of these platforms. Epigenetics Analysis of genes and genomes is not strictly confined to sequences, as modifications of DNA also influence the expression of genes. Epigenetics refers to the study of reversible heritable changes in gene function that occur without a change in the sequence of nuclear DNA. These changes are major regulators of gene expression. Differences in gene expression due to epigenetic factors are increasingly recognized as an important basis for individual variation in susceptibility and disease (Scarano et al 2005), as discussed in Chapter 6. Epigenetic phenomena include DNA methylation, imprinting, and histone modifications. Of all the different types of epigenetic modifications, DNA methylation is the most easily measured and amenable to the efficient analysis characteristic of toxicogenomic technologies. DNA methylation refers to addition of a methyl group to the 5-carbon of cytosine in an area of DNA where there are many cytosines and guanines (a CpG island) by the enzyme DNA methyltransferase. Many different methods for analyzing DNA methylation have been developed and fall into two main types—global and gene-specific. Global methylation analysis methods measure the overall level of methyl cytosines in the genome using chromatographic methods (Ramsahoye 2002) or methyl accepting capacity assays. Gene-specific methylation analysis methods originally used methylation-sensitive restriction enzymes to digest DNA before it was analyzed by Southern blot analysis or PCR amplification. Sites that were methy-
OCR for page 27
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment lated were identified by their resistance to the enzymes. Recently, methylationsensitive primers or bisulfite conversion of unmethylated cytosine to other bases have been used in methods such as methylation specific PCR and bisulfite genomic sequencing. These methods give a precise map of the pattern of DNA methylation in a particular genomic region or gene and are fast and cost effective (e.g., Yang et al. 2006). To identify unknown methylation hot-spots within a larger genomic context, techniques such as Restriction Landmark Genomic Scanning for Methylation (Ando and Hayashizaki 2006) and CpG island microarrays (e.g., Y. Wang et al. 2005a; Bibikova et al. 2006; Hayashi et al. 2007) have also been developed. Restriction landmark genomic scanning (RLGS) is a method to detect large numbers of methylated cytosines sites in a single experiment using direct end-labeling of the genomic DNA digested with a restriction enzyme and separated by high-resolution two-dimensional electrophoresis. Several array-based methods have also been developed recently. Bibikova et al. (2006) developed a rapid method for analyzing the methylation status of hundreds of preselected genes simultaneously through an adaptation of a genotyping assay. For example, the methylation state of 1536 specific CpG sites in 371 genes was measured in a single reaction by multiplexed genotyping of bisulfite-treated genomic DNA. The efficient nature of this quantitative assay could be useful for DNA methylation studies in large epidemiologic samples. Hayashi et al. (2007) describe a method for analysis of DNA methylation using oligonucleotide microarrays that involves separating methylated DNA immunoprecipitated with anti-methylcytosine antibodies. Most of these gene-specific methods work consistently at various genomic locations (that is, they have minimal bias by genomic location) and will be useful for high-resolution analysis of the epigenetic modifications to the genome in toxicogenomic studies. TRANSCRIPTOMIC TECHNOLOGIES Transcriptomics describes the global measurement of mRNA transcripts in a biologic system. This collection of mRNA transcripts represents the transcription of all genes at a point in time. Technologies that allow the simultaneous analysis of thousands of transcripts have made it possible to analyze transcriptomes. Technologic Approaches Technologies for assaying gene, protein, and metabolic expression profiles are not new inventions. Measurements of gene expression have evolved from the single measures of steady-state mRNA using Northern blot analysis to the more global analysis of thousands of genes using DNA microarrays and serial analysis of gene expression (SAGE), the two dominant technologies (Figure 2-2). The
OCR for page 28
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment advantage of global approaches is the ability of a single investigation to query the behavior of hundreds, thousands, or tens of thousands of biologic molecules in a single assay. For example, in profiling gene expression, one might use technologies such as Northern blot analysis to look at expression of a single gene. Quantitative real-time reverse transcriptase PCR (qRT-PCR), often used with subtractive cloning or differential display, can easily be used to study the expression of 10 or more genes. Techniques such as SAGE allow the entire collection of transcripts to be catalogued without assumptions about what is actually expressed (unlike microarrays, where one needs to select probes from a catalogue of genes). SAGE is a technology based on sequencing strings of short expressed sequence tags representing both the identity and the frequency of occurrence of specific sequences within the transcriptome. SAGE is costly and relatively low throughput, because each sample to be analyzed requires a SAGE Tag library to be constructed and sequenced. Massively parallel signature sequencing speeds up the SAGE process with a bead-based approach that simultaneously sequences multiple tags, but it is costly. DNA microarray technology can be used to generate large amounts of data at moderate cost but is limited to surveys of genes that are included in the microarray. In this technology (see Box 2-1 and Figure 2-3), a solid matrix surface supports thousands of different, surface-bound DNAs, which are hybridized against a pool of RNA to measure gene expression. A systematic comparison indicates that gene expression measured by oligonucleotide microarrays correlates well with SAGE in transcriptional profiling, particularly for genes expressed at high levels (Kim 2003). FIGURE 2-2 Overview of commonly used methods and technologies for gene expression analysis.
OCR for page 29
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment BOX 2-1 Experimental Details of Transcriptome Profiling with Microarrays TmRNA extracted from cell or tissue samples is prepared for microarray analysis by PCR-based amplification (Hardiman 2004). A fluorescent dye (or biotin for Affymetrix microarrays) is incorporated into the amplified RNA sequences. Two-color arrays involve fluorescently labeling paired samples (control versus experimental) with different dyes (see Figure 2-3). The amplified, labeled sequences, termed “targets,” are then hybridized to the microarrays. After hybridization and washing, the arrays are imaged with a confocal laser scanner and the relative fluorescence intensity (or streptavidin-conjugated phycoerythrin) for each gene-specific probe represents the expression level for that gene. The actual value reported depends on the microarray technology platform used and the experimental design. For Affymetrix GeneChips, in which each sample is hybridized to an individual array, expression for each gene is measured as an “average difference” that represents an estimated expression level, less nonspecific background. For two-color arrays, assays typically compare paired samples and report expression as the logarithm of the ratio of the experimental sample to the control sample. Regardless of the approach or technology, the fundamental data used in all subsequent analyses are the expression measures for each gene in each experiment. These expression data are typically represented as an “expression matrix” in which each row represents a particular gene and each column represents a specific biologic sample (Figure 2-3). In this representation, each row is a “gene expression vector,” where the individual entries are its expression levels in the samples assayed and each column is a “sample expression vector” that records the expression of all genes in that sample. The data are normalized to compensate for differences in labeling, hybridization, and detection efficiencies. Approaches to data normalization depend on the platform and the assumptions made about biases in the data (Brazma et al. 2001; Schadt et al. 2001; I.V. Yang et al. 2002; Y.H. Yang et al. 2002; Sidransky et al. 2003). Filtering transformations are often applied to the data by using statistical approaches that, for example, eliminate genes that have minimal variance across the collection of samples or those that fail to provide data in most of the experiments. These filtering transformations reduce dataset complexity by eliminating genes unlikely to contribute to the experimental goal. The choice of normalization and filtering transformations can have a profound effect on the results (Hoffmann et al. 2002). Normalization adjusts the fluorescence intensities on each array and therefore can change the relative difference observed among samples. Normalization is generally necessary to compensate for systematic errors introduced during measurement, but overnormalizing can distort the data. Similarly, different methods of data filtering can produce very different results. All statistical tests that are applied rely on assumptions about the nature of the variance in the measurements. Different statistical tests applied to the very same dataset can often produce different (but generally overlapping) sets of significant genes. Dealing with these “high-dimensional” datasets in which there are often more measurements (genes) than samples is an area of active research and debate.
OCR for page 30
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment Standardization of protocols for transcriptional profiling experiments has contributed to validation and verification strategies that ensure the quality of data. In large measure, progress was facilitated by creation of Minimum Information About a Microarray Experiment (MIAME) guidelines by the Microarray Gene Expression Data Society. MIAME was designed as a set of recommendations to address issues related to data quality and exchange (Brazma et al. 2001; Ball et al. 2002a,b, 2004a,b). The scientific community has endorsed the guidelines (MIAME 2005), and most scientific journals now require adherence to the MIAME recommendations for publishing toxicogenomic studies. MIAME guidelines encompass parameters such as degree of signal linearity, hybridization specificity, normalization strategy, and use of exogenous and internal controls. In principle, it should be possible to mine datasets generated by multiple laboratories with different microarray platforms. There is tremendous value in making gene expression datasets publicly available and being able to mine the datasets. Besides serving as a source of independent data that can be used as a means of validating results, larger and more diverse sample populations can provide more robust datasets for “meta-analysis” designed to find patterns of gene expression that can be associated with specific biologic states and responses (Malek et al. 2002; Stuart et al. 2003). However, a number of published studies have failed to find concordance between microarray platforms designed to assay expression patterns, in part because of observed disparities between results obtained by different groups analyzing similar samples (calling the validity of microarray assays into question) (Kuo et al. 2002; Maitra et al. 2003; Rogojina et al. 2003; Mah et al. 2004; Park et al. 2004; Shippy et al. 2004; Ulrich et al. 2004; Yauk et al. 2004). In many instances, it appears that this failure to find concordance is a failure not of the platform or the biologic system but of the metrics used to evaluate concordance. For example, other meta-analyses focused on lists of significant genes, neglecting the fact that in many instances these lists of genes are derived not only from different platforms but also from vastly different approaches to data analysis (Tan et al. 2003; Jarvinen et al. 2004; Mah et al. 2004). This effect can be seen even in looking at a single dataset generated on a single platform. When results from the same array platforms are compared, the results generally show good concordance among different laboratories (Kane et al. 2000; Hughes et al. 2001; Yuen et al. 2002; Barczak et al. 2003; Carter et al. 2003; H.T. Wang et al. 2003). The data analysis effect can be seen even in looking at a single dataset generated on a single platform. When results from the same array platforms are compared, the results generally show good concordance among different laboratories (Kane et al. 2000; Hughes et al. 2001; Yuen et al. 2002; Barczak et al. 2003; Carter et al. 2003; H.T. Wang et al. 2003). A series of papers that appeared in the May 2005 issue of Nature Methods systematically dealt with the problem of platform and laboratory comparison (Bammler et al. 2005; Irizarry et al. 2005; Larkin et al. 2005). Larkin et al. (2005) analyzed gene expression in a mouse model of hypertension and compared results obtained using spotted cDNA arrays and Affymetrix GeneChips. For the genes that could be compared, 88% showed expression patterns that appeared to be driven by
OCR for page 31
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment the underlying biology rather than the platform, and these genes also correlated well with qRT-PCR. Surprisingly, the 12% of genes that showed platform-specific effects also correlated poorly with qRT-PCR. Comparison of these platform discrepant genes with the platform concordant genes showed that the discrepant genes were much more likely to map to poorly annotated regions of the genome and consequently were more likely to represent different forms of MRNA (different splice forms). Irizarry and colleagues (2005) compared gene expression using a pair of defined RNA samples and looked at a variety of platforms with data generated by a number of laboratories using a variety of microarray platforms. This study showed that one can estimate the “lab effect,” which encompasses differences in sites, platforms, and protocols and, in doing so, arrive at estimates of gene expression that can be compared among laboratories. Finally, the Toxicogenomics Research Consortium (Bammler et al. 2005) reported that standardization of laboratory and data analysis protocols resulted in a dramatic increase in concordance among the results different laboratories obtained. Independently, these three groups arrived at the general conclusion that, if experiments are done and analyzed carefully and systematically, the results are quite reproducible and provide insight into the underlying biology driving the systems being analyzed. Although toxicogenomic studies typically rely on technologies that generate large amounts of data, results are often confirmed and replicated with lower throughput assays. For example, differential gene expression detected with more global approaches is often verified by qRT-PCR analysis. The utility of these lower throughput approaches goes beyond validation. A subset of genes analyzed by qRT-PCR may exhibit sensitivity and specificity comparable to global transcriptomic analyses with microarrays. Relatively small sets of marker genes that represent more complex gene expression patterns may be of considerable value in toxicogenomics. DNA Microarray Technology Microarray technology (Figure 2-3) fundamentally advanced biology by enabling the simultaneous analysis of all transcripts in a system. This capability for simultaneous, global analysis is emblematic of the new biology in the genomic era and has become the standard against which other global analysis technologies are judged. DNA microarrays contain collections of oligonucleotide sequences located in precise locations in a high-density format. Two complementary DNA (cDNA) microarray formats have come to dominate the field. Spotted microarrays are prepared from synthesized cDNAs or oligonucleotide probes that are printed on a treated glass slide surface in a high-density format. These spotted arrays were the first widely used DNA microarrays (Schena et al. 1995, 1996) and were originally printed in individual investigators’ laboratories
OCR for page 32
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment FIGURE 2-3 An overview of DNA microarray analysis. (A) In two-color analysis approaches, RNA samples from patient and control samples are individually labeled with distinguishable fluorescent dyes and cohybridized to a single DNA microarray consisting of individual gene-specific probes. Relative gene expression levels in the two samples are estimated by measuring the fluorescence intensities for each arrayed probe; a sample expression vector summarizing the expression level of each gene in the patient sample (relative to the control) is reported. (B) Single-color analysis, such as that using the Affymetrix GeneChip, hybridize labeled RNA from each biologic sample to a single array in which a series of perfectly matched (PM) gene-specific probes are arrayed. Gene expression levels are estimated by measuring hybridization intensities for each probe, and background is measured by using a corresponding set of mismatched (MM) probes. For each array technology, gene expression levels are reported for each sample as a “sample expression vector” summarizing the difference between signal and background for each gene. (C) The data from each gene in each sample are collected and these “sample expression vectors” are assembled into a single “expression matrix.” Each column in the expression matrix represents an individual sample and its measured expression levels for each gene (the sample expression vector); each row represents a gene and its expression levels across all samples (a “gene expression vector”). The expression matrix is often visualized by presenting a colored matrix (typically red/green, although other combinations such as blue/yellow are now common). (D) An unordered dataset, subjected to average linkage hierarchical clustering or k-mean clustering, reveals underlying patterns that can help identify classes in the dataset.
OCR for page 34
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment increases in sensitivity, but this approach is not favored because of problems associated with the use of radioactivity and efficiency of analysis. Affymetrix and other major commercial vendors (Agilent, GE Healthcare [formerly Amersham], and Applied Biosystems) currently offer several different microarrays corresponding to essentially all known genes and transcripts for human as well as similar microarray products for model organisms used in toxicity studies. In addition, Affymetrix also offers whole-genome microarrays for application to SNP mapping and detection (see above). PROTEOMIC TECHNOLOGIES Proteomics is the study of proteomes, which are collections of proteins in living systems. Because proteins carry out most functions encoded by genes, analysis of the protein complement of the genome provides insights into biology that cannot be drawn from studies of genes and genomes. MS, gene and protein sequence databases, protein and peptide separation techniques, and novel bioinformatics tools are integrated to provide the technology platform for proteomics (Yates 2000; Smith 2002; Aebersold and Mann 2003). In contrast to “the genome,” there is no single, static proteome in any organism; instead, there are dynamic collections of proteins in different cells and tissues that display moment-to-moment variations in response to diet, stress, disease processes, and chemical exposures. There is no technology analogous to PCR amplification of nucleic acids that can amplify proteins, so they must be analyzed at their native abundances, which span more than six orders of magnitude. Each protein may be present in multiple modified forms; indeed, variations in modification status may be more critical to function than absolute levels of the protein per se (Mann and Jensen 2003). A related problem is the formation of protein adducts by reactive chemical intermediates generated from toxic chemicals and endogenous oxidative stress (Liebler et al. 2003). Protein damage by reactive chemical intermediates may also perturb endogenous regulatory protein modifications. All these characteristics add to the challenge of proteome analysis. In contrast to the microarray technologies applied to gene expression, most analytical proteomic methods represent elaborate serial analyses rather than truly parallel technologies. Gel-Based Proteomics Two major approaches used are gel-based proteomics and “shotgun” proteomics (see Figure 2-4). In the gel-based approach, proteins are resolved by electrophoresis or another separation method and protein features of interest are selected for analysis. This approach is best represented by the use of two-dimensional sodium dodecylsulfate polyacrylamide gel electrophoresis (2D-SDS-PAGE) to separate protein mixtures, followed by selection of spots, and identification of the proteins by digestion to peptides, MS analysis, and database
OCR for page 35
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment searching. Gel-based analyses generate an observable “map” of the proteome analyzed, although liquid separations and software can be coupled to achieve analogous results. Reproducibility of 2D gel separations has dramatically improved with the introduction of commercially available immobilized pH gradient strips and precast gel systems (Righetti and Bossi 1997). Comparative 2D-SDS-PAGE with differential fluorescent labeling (for example, differential gel electrophoresis, DIGE) offers powerful quantitative comparisons of proteomes (Tonge et al. 2001; Von Eggeling et al. 2001). Moreover, modified protein forms are often resolved from unmodified forms, which enable separate characterization and quantitative analysis of each. Although 2D gels have been applied most commonly to global analyses of complex proteomes, they have great potential for comparative analyses of smaller subproteomes (for example, multiprotein complexes). Problems with gel-based analyses stem from the poor separation characteristics of proteins with extreme physical characteristics, such as hydrophobic membrane proteins. A major problem is the limited dynamic range for protein detection by staining (200- to 500-fold), whereas protein abundances vary more than a million fold (Gygi et al. 2000). This means that abundant proteins tend to preclude the detection of less abundant proteins in complex mixtures. This problem is not unique to gel-based approaches. FIGURE 2-4 Schematic representation of 2D gel-based proteome analysis and shotgun proteome analysis. LC, liquid chromatography; TOF, time of flight; MS-MS, tandem mass spectrometry.
OCR for page 36
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment Shotgun Proteomics Shotgun proteomic analysis is somewhat analogous to the genome sequencing strategy of the same name. Shotgun analyses begin with direct digestion of protein mixtures to complex mixtures of peptides, which then are analyzed by liquid-chromatography-coupled mass spectrometry (LC-MS) (Yates 1998). The resulting collection of peptide tandem mass spectrometry (MS-MS) spectra is searched against databases to identify corresponding peptide sequences and then the collection of sequences is reassembled using computer software to provide an inventory of the proteins in the original sample mixture. A key advantage of shotgun proteomics is its unsurpassed performance in the analysis of complex peptide mixtures (Wolters et al. 2001; Washburn et al. 2002). Peptide MS-MS spectra are acquired by automated LC-MS-MS analyses in which ions corresponding to intact peptides are automatically selected for fragmentation to produce MS-MS spectra that encode peptide sequences (Stahl et al. 1996). This approach enables automated analyses of complex mixtures without user intervention. However, selection of peptide ions for MS-MS fragmentation is based on the intensity of the peptide ion signals, which favors acquisition of MS-MS spectra from the most abundant peptides in a mixture. Thus, detection of low-abundance peptides in complex mixtures is somewhat random. Application of multidimensional chromatographic separations (for example, ion exchange and then reverse-phase high-performance liquid chromatography) “spreads out” the peptide mixture and greatly increases the number of peptides for which MS-MS spectra are acquired (Link et al. 1999; Washburn et al. 2001; Wolters et al. 2001). This increases the detection of less abundant proteins and modified protein forms (MacCoss et al. 2002) (see below). New hybrid linear ion trap-tandem MS instruments offer more rapid acquisition of MS-MS spectra and more accurate identification of peptide ion mass-to-charge ratio values, which provides more identifications with greater reliability. Nevertheless, a continuing challenge of shotgun proteome analyses is the identification of less abundant proteins and modified protein forms. Quantitative Proteomics The application of quantitative analyses has become a critical element of proteome analyses. Quantitative methods have been developed for application to both gel-based and shotgun proteomic analyses. The most effective quantitative approach for gel-based analyses is DIGE (see above), which involves using amine- or thiol-reactive fluorescent dyes that tag protein samples with different fluorophores for analysis on the same gel. This approach eliminates gel-to-gel variations inherent in comparing spots from individual samples run on different gels. The use of a separate dye and mixed internal standards allows gel-to-gel comparisons of DIGE analyses for larger studies and enables reliable statistical comparisons (Alban et al. 2003). Quantitative shotgun proteome analyses have
OCR for page 37
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment been done with stable isotope tags, which are used to derivatize functional groups on proteins (Julka and Regnier 2004). Stable isotope tagging is usually used in paired experimental designs, in which the relative amounts of a protein or protein form are measured rather than the absolute amount in a sample. The first of these to be introduced were the thiol-reactive isotope-coded affinity tag reagents (Gygi et al. 1999), which have been further developed to incorporate solid-phase capture and labeling (Zhou et al. 2002). These reagents are available in “heavy” (for example, 2H- or 13C-labeled) and “light” (for example, 1H- or 12C-labeled) forms. Analysis of paired samples labeled with the light and heavy tags allows relative quantitation by comparing the signals for the corresponding light and heavy ions. Other tag chemistries that target peptide N and C termini have been developed and have been widely applied (Julka and Regnier 2004). An alternative approach to tagging proteins and peptides is to incorporate stable isotope labels through metabolic labeling of proteins during synthesis in cell culture (Ong et al. 2002). Quantitative proteomic approaches are applicable not only to comparing amounts of proteins in samples but also to kinetic studies of protein modifications and abundance changes as well as to identification of protein components of multiprotein complexes as a function of specific experimental variables (Ranish et al. 2003). Major limitations of the isotope-tagging approaches described above include the requirement for chemical induction of changes in the samples (derivatization) or metabolic incorporation of isotope label and the need to perform quantitative analyses by pairwise comparison. Recent work has demonstrated that quantitative data from LC-MS full-scan analyses of intact peptides (W. Wang et al. 2003) and data from MS-MS spectra acquired from peptides are proportional to the peptide concentration in mixtures (Gao et al. 2003; Liu et al. 2004). This suggests that survey-level quantitative comparisons between any samples analyzed under similar conditions may be possible. Finally, the use of stable isotope dilution LC-MS-MS analysis provides a method for absolute quantification of individual proteins in complex samples (Gerber et al. 2003). Use of stable-isotope-labeled standard peptides that uniquely correspond to proteins or protein forms of interest are spiked into proteolytic digests from complex mixtures, and the levels of the target protein are measured relative to the labeled standard. This approach holds great potential for targeted quantitative analysis of candidate biomarkers in biologic fluids (Anderson et al. 2004). Bioinformatic Tools for Proteomics A hierarchy of proteomic data is rooted in MS and MS-MS spectra (Figure 2-5) and includes identification and quantitation of proteins and peptides and their modified forms, including comparisons across multiple experiments, analyses, and datasets. A key element of MS-based proteomic platforms is the identification of peptide and protein sequences from MS data. This task is accom-
OCR for page 38
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment plished with a variety of algorithms and software (“bioinformatics” tools) that search protein and nucleotide sequence databases (Fenyo 2000; Sadygov et al. 2004; MacCoss 2005). Measured peptide masses from MALDI-MS spectra of tryptic peptide digests can be searched against databases to identify the corresponding proteins (Perkins et al. 1999). This peptide mass fingerprinting approach works best with relatively pure protein samples. The most widely used and most effective approach is to search uninterpreted MS-MS spectra against database sequences with algorithms and software, such as Sequest, Mascot, and X!Tandem (Eng et al. 1994; Perkins et al. 1999; Craig and Beavis 2004). These algorithms match all spectra to some sequence and provide scores or probability assessments of the quality of the matches. Nevertheless, the balance between sensitivity and specificity in these analyses amounts to a trade-off between missed identifications (low sensitivity) and false-positive identifications (poor specificity) (Nesvizhskii and Aebersold 2004). A second tier of bioinformatic tools evaluates outputs from database search algorithms and estimates probabilities of correct protein identifications (Keller et al. 2002; Nesvizhskii et al. 2003). Other software applications have been developed to enable the identification of modified peptide forms from MS-MS data, even when the chemical nature and amino acid specificity of the modification cannot be predicted (Hansen et al. 2001, 2005; Liebler et al. 2002, 2003). A key issue in proteomics is the standardization of data analysis methods and data representation and reporting formats. A fundamental problem is the variety of MS instruments, data analysis algorithms, and software used in proteomics. These generate a variety of different data types that describe proteins FIGURE 2-5 A hierarchy of proteomic data is rooted in MS and MS-MS spectra (level 1) and includes outputs of database search analyses and related data reduction (level 2), integrated information about single proteins (level 3), and information about groups of proteins or proteomes across multiple experiments (level 4).
OCR for page 39
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment and peptides and their modifications. Proposals for common representations of MS and proteomic data have been published recently (Taylor et al. 2003; Craig et al. 2004; Pedrioli et al. 2004). In addition, draft criteria for data reporting standards for publication in proteomic journals are under consideration (e.g., Bradshaw 2005a). Another useful development is the emerging collection of databases of matched peptide and protein sequences and corresponding spectral data that define them (Craig et al. 2004; Desiere et al. 2005, 2006). Another important, but unresolved, issue concerns the differences in protein and peptide identifications attributable to the use of different database search algorithms. Because different database search software packages are sold with different MS instruments (for example, Sequest is licensed with Thermo ion trap MS instruments), differences in performance of the algorithms are difficult to separate from differences in characteristics of the instruments. Another issue in comparing the performance of different database searching software is wide variation in identifications due to variation in criteria used to filter the search results (Peng et al. 2003; Elias et al. 2005). This situation will be improved somewhat by adopting standards for reporting false-positive identification rates (Bradshaw 2005b). Although the efforts described above represent useful steps in information sharing and management, the diversity of instrumentation, analytical approaches, and available data analysis tools will make standardization of informatics an ongoing challenge. Proteome Profiling Another type of proteome analysis that has attracted widespread interest is proteome profiling, in which MALDI time-of-flight (MALDI-TOF) MS is used to acquire a spectral profile of a tissue or biofluid sample (for example, serum) (Chaurand et al. 1999; Petricoin et al. 2002a,b; Villanueva et al. 2004). The signals in these spectra represent intact proteins or protein fragments and collectively reflect the biologic state of the system but, in profiling (compared with approaches described above), the overall pattern rather than identification of specific proteins or protein fragments is the focus. Analyses with high-performance MALDI-TOF instruments can generate spectral profiles containing hundreds to thousands of signals. These typically correspond to the most abundant, lower molecular weight (<25 kilodaltons) components of proteomes. Machine learning approaches have been used to identify sets of spectral features that can classify samples based on spectral patterns (Baggerly et al. 2004, 2005; Conrads et al. 2004). This approach has attracted considerable interest as a potential means of biomarker discovery for early detection of diseases, particularly cancers, as well as drug toxicity. Despite intense interest, proteome profiling studies have created considerable controversy due to problems with lab-to-lab reproducibility of the marker sets identified, a lack of identification of the proteins corresponding to the marker signals, and artifacts in data generation and
OCR for page 40
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment analysis (Diamandis 2004; Baggerly et al. 2005). In addition, studies that have identified some of the marker species have shown that they typically are proteolysis products of abundant blood proteins (Marshall et al. 2003), which raises questions about the biologic relationship of the markers to the disease processes under study. The general utility of biofluid proteome profiling for biomarker discovery remains an attractive, if unproven, approach. Nevertheless, methods of instrumental and data analysis are rapidly evolving in this field, and the applicability of this approach should be better substantiated within the next 2-3 years. New MS Instrumentation and Related Technology for Proteomics Despite impressive advances over the past 15 years, MS instrumentation for proteomics is limited in the numbers of peptides or proteins that can be identified and in the quality of the data generated. New hybrid tandem MS instruments that couple rapid-scanning linear ion tray analyzers with Fourier transform ion cyclotron resonance (FTICR), high-resolution ion trap, and TOF mass analyzers offer both high mass accuracy measurements of peptide ions and rapid scanning acquisition of MS-MS spectra (Syka et al. 2004a; Hu et al. 2005). This improves the fidelity of identification and the mapping of modifications (Wu et al. 2005). New methods for generating peptide sequence data, such as electron transfer dissociation (Syka et al. 2004b), can improve the mapping of posttranslational modifications and chemical adducts. An important emerging application of FTICR instrumentation is the tandem MS analysis of intact proteins, which is referred to as “top-down” MS analysis (Ge et al. 2002; Kelleher 2004). This approach can generate near-comprehensive sequence analysis of individual protein molecular forms, thus enabling sequence-specific annotation of individual modification variants (Pesavento et al. 2004; Coon et al. 2005). A limitation of the approach is the requirement for relatively purified proteins and larger amounts of samples than are used in shotgun analyses. However, rapid technology development will make top-down methods increasingly useful for targeted analyses of individual proteins and their modified forms. Non-MS-Based Proteomic Approaches Non-MS-based technologies have been applied to proteome analyses, but they have not proven to be as robust and versatile as MS-based methods. Microarray technology approaches include antibody microarrays, in which immobilized antibodies recognize proteins in complex mixtures (de Wildt et al. 2000; Miller et al. 2003; Olle et al. 2005). Although straightforward in principle, this approach has not proven robust and reliable for several reasons. Monospecific antibodies with high affinity for their targets are difficult to obtain and they often lose activity when immobilized. Because arrays must be probed under native conditions, antibodies may capture multiprotein complexes as well as individual proteins, which complicates interpretation. Short strands of chemically synthe-
OCR for page 41
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment sized nucleic acid (aptamers) have been studied as potential monospecific recognition molecules for microarrays, and this technology may eventually overcome some of the problems with antibody arrays (Smith et al. 2003; Kirby et al. 2004). “Reversed-phase” microarrays, which consist of multiple samples of protein mixtures (for example, tissues, cell lysates), are probed with individual antibodies (Paweletz et al. 2001; Janzi et al. 2005). This establishes the presence of the target protein in multiple samples rather than the presence of multiple proteins in any sample. As with antibody microarrays, the main limitations of this approach stem from the quality and availability of antibodies for the targets of interest. Microarrays of expressed proteins or protein domain substructures have been probed with tagged proteins or small molecules to identify protein binding partners or small molecule ligand sites or to conduct surveys of substrates for enzymes (for example, kinases) (Zhu et al. 2001; Ramachandran et al. 2004). This approach is directed to functional analysis of known proteins as opposed to identification and analysis of the components of complex mixtures. A related technique directed at the study of protein-protein interactions is surface plasmon resonance (SPR) (Liedberg et al. 1995; Homola 2003; Yuk and Ha 2005). This technology allows real-time measurements of protein binding affinities and interactions. In common usage, SPR is used to study single pairs of interacting species. However, recent adaptations of SPR allow direct analysis of protein-protein interactions in microarray format (Yuk et al. 2004). METABOLOMIC TECHNOLOGIES Metabolomics1 is the analysis of collections of small molecule intermediates and products of diverse biologic processes. Metabolic intermediates reflect the actions of proteins in biochemical pathways and thus represent biologic states in a way analogous to proteomes. As with proteomes, metabolomes are dynamic and change in response to nutrition, stress, disease states, and even diurnal variations in metabolism. Unlike genomes, transcriptomes, and proteomes, metabolomes comprise a chemically diverse collection of compounds, which range from small peptide, lipid, and nucleic acid precursors and degradation products to chemical intermediates in biosynthesis and catabolism as well as metabolites of exogenous compounds derived from the diet, environmental exposures, and therapeutic interventions. A consequence of the chemical diversity of metabolome components is the difficulty of comprehensive analysis with any single analytical technology. 1 Although some scientists attempt to distinctly define the terms metabolomics and metabonomics, the committee uses the term metabolomics throughout the report simply because it is used more frequently in the literature.
OCR for page 42
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment NMR-Based Metabolomics The principal technology platforms for metabolomics are NMR spectroscopy and gas chromatography MS (GC-MS) or LC-MS. NMR has been the dominant technology for metabolomic studies of biofluids ex vivo. High-field (600 mHz) 1H-NMR spectra of urine contain thousands of signals representing hundreds to thousands of metabolites (Nicholson et al. 2002). Hundreds of analytes have been identified in such spectra and collectively represent a plurality of metabolic processes and networks from multiple organs and tissues. Although NMR has been most commonly applied to urine samples, similar analyses of intact solid tissues were accomplished with the use of magic angle spinning 1H-NMR (Waters et al. 2000; Nicholson et al. 2002; Y. Wang et al. 2003). Although it is possible to establish the identity of many, but not all, of the peaks in NMR spectra of urine and biofluids, the value of the data has been in the analyses of collections of spectral signals. These pattern recognition approaches have been used to identify distinguishing characteristics of samples or sample sets. Unsupervised2 analyses of the data, such as principal components analysis (PCA), have proven useful for grouping samples based on sets of similar features (Beckwith-Hall et al. 1998; Holmes et al. 1998). These similar features frequently reflect chemical similarity in metabolite composition and thus similar courses of response to toxicants. Supervised analyses allow the use of data from biochemically or toxicologically defined samples to establish models capable of classifying samples based on multiple features in the spectra (Stoyanova et al. 2004). NMR-based metabolomics of urine measure global metabolic changes that have occurred throughout an organism. However, metabolite profiles in urine can also indicate tissue-specific toxicities. PCA of urinary NMR data have shown that the development and resolution of chemically induced tissue injury can be followed by plotting trajectories of PCA-derived parameters (Azmi et al. 2002). Although the patterns themselves provide a basis for analyses, some specific metabolites have also been identified (based on their resonances in the NMR spectra). Mapping these metabolites onto known metabolic pathways makes it possible to draw inferences about the biochemical and cellular consequences and mechanisms of injury (Griffin et al. 2004). An interesting and important consequence was the identification of endogenous bacterial metabolites as key elements of diagnostic metabonomic profiles (Nicholls et al. 2003; Wilson and Nicholson 2003; Robosky et al. 2005). Although the interplay of gut bacteria with drug and chemical metabolism had been known previously, recent NMR metabolomic studies indicate that interactions between host tissues and gut microbes have a much more pronounced effect on susceptibility to injury than had been appreciated previously (Nicholson et al. 2005). 2 Unsupervised analysis methods look for patterns in the data without using previous knowledge about the data; information about treatment or classification supervised methods use this knowledge. See Chapter 3 for more details.
OCR for page 43
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment A critical issue in the application of metabolomics is the standardization of methods, data analysis, and reporting across laboratories. A recent cooperative study by the Consortium for Metabonomic Toxicology indicated that NMR-based technology is robust and reproducible in laboratories that follow similar analytical protocols (Lindon et al. 2003). Investigators in the field recently have agreed on consensus standards for analytical standardization and data representation in metabonomic analyses (Lindon et al. 2005a). MS-Based Metabolomics MS-based analyses offer an important alternative approach to metabolomics. The greatest potential advantage of MS-based methods is sensitivity. MS analyses can detect molecules at levels up to 10,000-fold lower than does NMR (Brown et al. 2005; Wilson et al. 2005a). Both GC-MS and LC-MS approaches have been used, although limits of volatility of many metabolites reduce the range of compounds that can be analyzed successfully with GC-MS. LC-MS analyses are done with both positive and negative ion electrospray ionization and positive and negative chemical ionization. These four ionization methods provide complementary coverage of diverse small molecule chemistries. The principal mode of analysis is via “full scan” LC-MS, in which the mass range of the instrument is repeatedly scanned (Plumb et al. 2002; Wilson et al. 2005b). This analysis records the mass-to-charge ratios and retention times of metabolites. Because most small molecules produce singly charged ions, the analyses provide molecular weights of the metabolites. Analysis of standards in the same system and the use of MS-MS analysis can establish the identity of the components of interest. However, apparent molecular weight measurement alone is often insufficient to generate candidate metabolite identifications; frequently, hundreds or thousands of molecules are being analyzed. Nevertheless, accurate information about molecular weight, where possible, is of great value in identification. For this reason LC-MS metabolomic analyses are most commonly done with higher mass accuracy MS instruments, such as LC TOF, quadruple TOF, and FTICR MS instruments (Wilson et al. 2005a,b). NMR- and MS-based approaches provide complementary platforms for metabolomic studies and an integration of these platforms will be needed to provide capabilities that are most comprehensive. Clearly, either platform can detect metabolite profile differences sufficient to distinguish different toxicities. What is not yet clear is the degree to which either approach can resolve subtly different phenotypes. TECHNOLOGY ADVANCEMENT AND ECONOMY OF SCALE A major determinant of success in genome sequencing projects was achieving economy of scale for genome sequencing technologies (see above). The successful implementation of large-scale toxicogenomic initiatives will require advances in standardization and economy of scale. Most proteomic and
OCR for page 44
Applications of Toxicogenomic Technologies to Predictive Toxicology and Risk Assessment metabolomic analyses in academic and industry laboratories are done on a small scale to address specific research questions. The evolution of transcriptome profiling and proteomic and metabolomic technology platforms to increase standardization and reduce costs will be essential to maximize their impact.