Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
10 Sample and Data Collection and Analysis As discussed in previous chapters, toxicogenomic studies face significant challenges in the areas of validation, data management, and data analysis. Fore- most among these challenges is the ability to compile high-quality data in a for- mat that can be freely accessed and reanalyzed. These challenges can best be addressed within the context of consortia that generate high-quality, standard- ized, and appropriately annotated compendia of data. This chapter discusses issues related to the role of private and public consortia in (1) sample acquisi- tion, annotation, and storage; (2) data generation, annotation, and storage; (3) repositories for data standardization and curation; (4) integration of data from different toxicogenomic technologies with clinical, genetic, and exposure data; and (5) issues associated with data transparency and sharing. Although this chapter touches on how these issues relate to risk assessment and the associated ethical, legal, and social implications, other chapters of this report discuss these aspects of toxicogenomics in greater detail. LEVERAGING SAMPLES AND INFORMATION IN POPULATION- BASED STUDIES AND CLINICAL TRIALS FOR TOXICOGENOMIC STUDIES Because the potential of toxicogenomic studies is to improve human risk assessment, it is imperative that susceptible human populations be studied with toxicogenomics. Given the hundreds of millions of dollars already spent on clinical trials, environmental cohort studies, and measurements of human expo- sures to environmental chemicals, it is imperative that toxicogenomic studies make the best possible use of the human population and clinical resources al- ready in hand. 152
Sample and Data Collection and Analysis 153 This prospect raises many questions. How does the research community ensure that appropriate cohorts and sample repositories are available once the technologies have matured to the point of routine application? What are the limi- tations of using existing cohorts for toxicogenomic studies? Given the limita- tions of existing cohorts with respect to the informed consent, sample collection, and data formats, how should future population-based studies be designed? What is the ideal structure of a consortium that would assemble such a cohort, collect the necessary samples and data, and maintain repositories? Consideration of these questions begins with an examination of the current state of affairs. Current Practices and Studies Central to all population-based studies are identifying suitable cohorts, re- cruiting participants, obtaining informed consent, and collecting the appropriate biologic samples and associated clinical and epidemiologic data. Collecting, curating, annotating, processing, and storing biologic samples (for example, blood, DNA, urine, buccal swabs, histologic sections) and epidemiologic data for large population-based studies is a labor-intensive, expensive undertaking that requires a significant investment in future research by the initial study in- vestigators. As a result, study investigators have reasonable concerns about mak- ing biologic samples available for general use by others in the scientific com- munity. With the notable exception of immortalized lymphocyte cell lines that provide an inexhaustible resource from the blood of study participants, biologic specimens represent a limited resource and researchers often jealously guard unused samples for future use once the âidealâ assays become available. It is not unusual for institutions with sizeable epidemiology programs to have hundreds of thousands of blood, serum, lymphocyte, and tumor samples in storage that are connected to clinical data as well as information on demographics, lifestyle, diet, behavior, and occupational and environmental exposures. In addition to cohorts and samples collected as part of investigator- initiated studies and programs, sample repositories have also been accrued by large consortia and cooperative groups sponsored by the public and private sec- tors. An example of publicly sponsored initiatives in the area of cancer research and national and international sample and data repositories is the National Can- cer Institute (NCI)-sponsored Southwest Oncology Group (SWOG), one of the largest adult cancer clinical trial organizations in the world. The SWOG mem- bership and network consists of almost 4,000 physicians at 283 institutions throughout the United States and Canada. Since its inception in 1956, SWOG has enrolled and collected samples from more than 150,000 patients in clinical trials. Examples in the area of cancer prevention are the NCI-sponsored interven- tion trials such as the CARET (beta-Carotene and Retinoic Acid) study for pre- vention of lung cancer, and the Shanghai Breast Self Exam study conducted in factory workers in China. CARET was initiated in 1983 to test the hypothesis
154 Applications of Toxicogenomic Technologies that antioxidants, by preventing DNA damage from free radicals present in to- bacco smoke, might reduce the risk of lung cancer. CARET involved more than 18,000 current and former male and female smokers, as well as males exposed to asbestos, assigned to placebo and treatment groups with a combination of beta-carotene and vitamin A. Serum samples were collected before and during the intervention, and tumor samples continue to be accrued. The randomized trial to assess the efficacy of breast self-exam in Shanghai accrued 267,040 cur- rent and retired female employees associated with 520 factories in the Shanghai Textile Industry Bureau. Women were randomly assigned on the basis of factory to a self-examination instruction group (133,375 women) or a control group that received instruction in preventing back injury (133,665 women) (Thomas et al. 1997). A large number of familial cancer registries have also been established for breast, ovarian, and colon cancer. There are comparable cooperative groups to study noncancer disease end points, such as heart disease, stroke, and depres- sion. Other examples include several large prospective, population-based stud- ies designed to assess multiple outcomes. Participants provide baseline samples and epidemiologic data and are then followed with periodic resampling and up- dating of information. The prototype for this type of study is the Nursesâ Health Study (NHS), established in 1976 with funding from the National Institutes of Health (NIH), with a second phase in 1989. The primary motivation for the NHS was to investigate the long-term consequences of the use of oral contraceptives. Nurses were selected because their education increased the accuracy of re- sponses to technically worded questionnaires and because they were likely to be motivated to participate in a long-term study. More than 120,000 registered nurses were followed prospectively. Every 2 to 4 years, cohort members receive a follow-up questionnaire requesting information about diseases and health- related topics including smoking, hormone use and menopausal status, diet and nutrition, and quality of life. Biologic samples collected included toenails, blood samples, and urine. Another example of an ongoing trial to assess multiple disease end points is the NIH-funded Womenâs Health Initiative. Established in 1991, this study was designed to address the most common causes of death, disability, and poor quality of life in postmenopausal women, including cardiovascular disease, can- cer, and osteoporosis. The study included a set of clinical trials and an observational study, which together involved 161,808 generally healthy post- menopausal women. The clinical trials were designed to test the effects of post- menopausal hormone therapy, diet modification, and calcium and vitamin D supplements as well as ovarian hormone therapies. The observational study had several goals, including estimating the extent to which known risk factors pre- dict heart disease, cancers, and fractures; identifying risk factors for these and other diseases; and creating a future resource to identify biologic indicators of disease, especially substances found in blood and urine. The observational study enlisted 93,676 postmenopausal women whose health was tracked over an aver-
Sample and Data Collection and Analysis 155 age of 8 years. Women who joined this study filled out periodic health forms and visited the research clinic 3 years after enrollment. Sample/data repositories are also being assembled in the private sector, where pharmaceutical companies are increasingly eager to form cooperative groups or collaborations with academia and health care providers that provide access to patient cohorts for clinical trials and other population-based studies. In some cases, companies have assembled their own cohorts with samples pur- chased directly from medical providers. Large prospective cohorts have also been assembled, sampled, and observed by departments of health in countries with socialized medicine (for example, Sweden, Finland). There are multiple examples of established cohort studies with samples that could be used in stud- ies that evaluate the impact of environmental exposures. For example, the Occu- pational and Environmental Epidemiology Branch of the NCI (NCI 2006a) con- ducts studies in the United States and abroad to identify and evaluate environmental and workplace exposures that may be associated with cancer risk. Another example is the Agricultural Health Study (AHS 2006), a prospective study of nearly 90,000 farmers and their spouses in Iowa and North Carolina, carried out in collaboration with the National Institute of Environmental Health Sciences and the Environmental Protection Agency (EPA). These cohort studies involve sophisticated exposure assessments and mechanistic evaluations and include intensive collaborations among epidemiologists, industrial hygienists, and molecular biologists. These studies often involve collecting biologic sam- ples to assess biologic effects from exposure to agricultural (for example, pesti- cides and herbicides), industrial, and occupational chemicals. Another valuable resource is the National Health and Nutrition Examina- tion Survey (NHANES) conducted by the Environmental Health Laboratory of the Centers for Disease Control and Prevention (CDC) at the National Center for Environmental Health. The goal of NHANES is to identify environmental chemicals in participants, quantify their levels, and determine how these amounts relate to health outcomes. The survey design calls for collecting blood, urine, and extensive epidemiologic data (demographic, occupational, lifestyle, dietary, and medical information) from people of all ages and from all areas of the country. As such, NHANES provides representative exposure data for the entire U.S. population. Rather than estimating exposures, NHANES measures the amounts of hundreds of chemicals or their metabolites in participantsâ blood and urine, using the most sensitive, state-of-the-art analytical techniques. The number of people sampled varies among compounds but is typically several hundred to thousands, which is sufficient for determining the range of exposures in the population; determining racial, age, and sex differences in exposure; de- tecting trends in exposures over time; and analyzing the efficacy of interventions designed to reduce exposures. In addition to the measurements of exposures, blood and urine are also used to measure biomarkers of nutrition and indicators of general health status. A report of the findings is published every 2 years. In summary, academia, numerous government agencies, health care pro- viders, and companies in the private sector have already invested tremendous
156 Applications of Toxicogenomic Technologies effort and resources into accruing cohorts for population-based studies to design better drugs to predict responses to drug therapy, to assess the efficacy of drugs, to improve disease classification, to find genetic markers of disease susceptibil- ity, to understand gene-environment interactions, and to assess the health effects of human exposures to environmental chemicals. These resources present an opportunity for developing partnerships to apply toxicogenomic technologies to ongoing and future collaborative projects. Limitations and Barriers The most expensive and arduous components of any population-based study are the collection of epidemiologic data and samples in a form that can be used to answer the largest number of questions in a meaningful way. Ideally, the same cohorts and samples should be available for multiple studies and reanaly- ses. However, for a wide variety of reasons, ranging from the types of samples collected to issues of informed consent, this is rarely possible given the structure of existing cohorts. Structure of Cohorts Given the cost and logistics of assembling, obtaining consent, and follow- ing large cohorts, many studies are designed to address only a few specific ques- tions. Typically, case-control or association studies are used to test a new hy- pothesis before testing is expanded and validated with robust and expensive population-based cross-sectional designs or prospective studies. As a result, many studies lack the statistical power to address additional hypotheses. For example, a case control study designed to investigate the contribution of single nucleotide polymorphisms (SNPs) in a gene encoding a specific xenobiotic me- tabolizing enzyme in cancer will typically be underpowered to investigate the interactions of these SNPs with SNPs in other genes. Using larger cohorts that allow for a broader application would clearly be advantageous to future toxico- genomic studies. Heterogeneity of Disease or Response Classification Another important aspect of any population-based study is accurate geno- typing and precise phenotyping of diseases and responses. Whether a study is looking at genetic linkage, disease association, susceptibility, or responsiveness to environmental agents, any phenotypic or genotypic misclassification will re- duce or even obscure the truth by weakening the association between the correct genotype and the phenotype. Some diseases, such as clinical depression, mi- graine headaches, and schizophrenia are inherently difficult to diagnose and accurately classify at the clinical level. Likewise, an increasing number of mo- lecular analyses of tumors indicate that significant heterogeneity exists even
Sample and Data Collection and Analysis 157 among tumors with a similar histopathology and that these molecular differences can influence the clinical course of the disease and the response to therapy or chemoprevention (Potter 2004). The resulting inaccuracies in genotypic and phenotypic stratification of disease in cases can limit the utility of cohorts and their associated samples. However, increased stratification of disease based on genotype and molecular phenotype can also have an adverse effect by reducing the statistical power of a cohort to detect disease association or linkage. As the capacity to define homogeneous genotypes and phenotypes increases, the size of cohorts will need to be increased well above those used in present studies. Sample Collection A major impediment to the use of existing cohorts is that many studies have not collected appropriate types of specimens, or the available specimens are in a format that is not amenable to toxicogenomic studies. For example, the NCI funded the Cancer Genetic Network to identify and collect data on families at elevated risk for different forms of cancer. No provisions were made for col- lecting biologic specimens within the funded studies, and the samples that are collected are often inappropriate for genomic analysis. Traditionally, cancer cohort studies have collected formalin-fixed, paraffin-embedded samples of tu- mors and adjacent normal tissue. Although DNA for SNP-based genotyping can be extracted from such fixed specimens, albeit with some difficulty (e.g., Schu- bert et al. 2002), these samples usually do not yield representative mRNA ap- propriate for gene expression profiling. Clinical trials and epidemiologic studies have not usually collected or ana- lyzed DNA samples (Forand et al. 2001). On the other hand, most molecular epidemiology studies diligently collected blood or serum samples for biomarker analyses, but multiple challenges remain. The way samples are collected, han- dled, shipped, and preserved varies greatly among studies. Many samples are flash frozen and hence are adequate for DNA extraction and genotyping, but such samples are usually limited in size and this may not allow for comprehen- sive genotyping. To deal with this limitation, some studies have developed transformed cell lines from human lymphocytes as a way to create inexhaustible supplies of DNA for genomic studies (e.g., Shukla and Dolan 2005). These cell lines could provide valuable experimental material to determine the impact of interindividual genetic variation on the response to compounds in in vitro stud- ies. However, whether the results from studies with cell lines can be translated to human health effects remains to be determined. In the case of the primary blood lymphocytes from study participants, unless steps were taken to preserve mRNA at the time of collection, the samples may have limited utility in gene expression profiling, although serum samples obviously could be used for proteomic and metabonomic studies. The NCI is currently funding a large initiative in the application of serum proteomics for early detection of disease (NCI 2006b). However, whether the methods of pres-
158 Applications of Toxicogenomic Technologies ervation currently used will allow for accurate analyses after many years of stor- age and to what extent different methods of sample preparation and storage af- fect the applicability of samples to proteomic and metabonomic analyses remain to be determined. In summary, there are numerous impediments to using exist- ing samples available through cohort studies. There are both a need and an op- portunity to standardize methodologies in ongoing studies and to design future studies to accommodate the requirements of new toxicogenomic platforms. Data Uniformity Another impediment to using existing cohorts for toxicogenomic applica- tions is the lack of uniformity in data collection standards among population- based studies. Although there will always be a need to collect data specific to a particular study or application, much can be done to standardize questionnaires that collect demographic, dietary, occupational, and lifestyle data common to all studies. Moreover, efforts should be launched to develop a standardized vocabu- lary for all data types, including clinical data, which can be recorded in a digi- tized electronic format that can be searched with text-mining algorithms. An example of such a standardized approach is the attempt to standardize the vo- cabulary for histopathology within the Mouse Models of Cancer Consortium (NCI 2006c). The NCI launched a related effort, the Cancer Bioinformatics Grid, to develop standardized tools, ontologies, and representations of data (NCI 2006d). Sharing and Distributing Data The biologic samples and data collected by clinical trials, epidemiologic studies, and human exposure studies funded by agencies such as NIH, the CDC, the EPA, and other agencies represent a significant public scientific resource. The full value of these studies for basic research and validation studies cannot be realized until appropriate data-sharing and data-distribution policies are for- mulated and supported by government, academia, and the private sector. Several NIH institutes such as the NCI and the National Heart, Lung, and Blood Institute (NHLBI) have drafted initial policies to address this need. These policies have several key features that promote optimal use of the data resources, while emphasizing the protection of data on human subjects. For example, under the NHLBI policy, study coordinators retain information such as identifiers linked to personal information (date of birth, study sites, physical exam dates). However âlimited access data,â in which personal data have been removed or transformed into more generic forms (for example, age, instead of date of birth), can be distributed to the broader scientific community. Impor- tantly, study participants who did not consent to their data being shared beyond the initial study investigators are not included in the dataset. In some cases, con- sent forms provide the option for participants to specify whether their data can
Sample and Data Collection and Analysis 159 be used for commercial purposes. In these cases, it is important to discriminate between commercial purpose datasets and non-commercial purpose datasets to protect the rights of the human subjects. Because it may be possible to combine limited access data with other pub- licly available data to identify particular study participants, those who want to obtain these datasets must agree in advance to adhere to established data- distribution agreements through the Institute, with the understanding that viola- tion of the agreement may lead to legal action by study participants, their fami- lies, or the U.S. government. These investigators must also have approval of the institutional review board before they distribute the data. Under these policies, it is the responsibility of the initial study investiga- tors to prepare the datasets in such a way as to enable new investigators who are not familiar with the data to fully use them. In addition, documentation of the study must include data collection forms, study procedures and protocols, data dictionaries (codebooks), descriptions of all variable recoding, and a list of ma- jor study publications. Currently, the format of the data is requested to be in Statistical Analysis Software (SAS) and a generation program code for installing a SAS file from the SAS export data file is requested for NHLBI datasets. The NCI cancer bioinformatics information grid (CaBIG) initiative is exploring simi- lar requirements. Timing the release of limited access data is another major issue. Currently, policies differ across institutes and across study types. For large epidemiologic studies that require years of data to accumulate before analysis can begin, a pe- riod of 2 to 3 years after the completion of the final patient visit is typically al- lowed before release of the data. The interval is intended to strike a balance be- tween the rights of the original study investigators and the wider scientific community. Commercialization of repository data is another issue that could impede or enhance future studies. In Iceland, DNA samples from the cohort composed of all residents who agree to participate are offered as a commercial product for use in genetic studies. Based in Reykjavik, deCODEâs product is genetic informa- tion linked, anonymously, to medical records for the countryâs 275,000 inhabi- tants. Icelandâs populations are geographically isolated and many share the same ancestors. In addition, family and medical records have been thoroughly re- corded since the inception of the National Health Service in 1915. Icelanders also provide a relatively simple genetic background to investigate the genetics of disease. The population resource has proved its worth in studies of conditions caused by single defective genes, such as rare hereditary conditions including forms of dwarfism, epilepsy, and eye disorders. deCODE has also initiated pro- jects in 25 common diseases including multiple sclerosis, psoriasis, pre- eclampsia, inflammatory bowel disease, aortic aneurism, alcoholism, and obe- sity. Clearly, future toxicogenomic studies will have to take into account possi- ble use of such valuable, albeit commercial, resources.
160 Applications of Toxicogenomic Technologies Informed Consent Perhaps the greatest barrier to using existing sample repositories and data collections for toxicogenomic studies is the issue of informed consent and its ethical, legal, and social implications (see Chapter 11). Government regulations increasingly preclude the use of patient data and samples in studies for which the patient did not specifically provide consent. Consequently, many potentially valuable resources cannot be used for applications in emerging fields such as toxicogenomics. Legislation to protect the public, such as the Health Insurance Portability and Accountability Act (HIPAA), which was enacted to protect pa- tients against possible denial of health insurance as a result of intentional or un- intentional distribution of patient health or health risk data, has unwittingly im- paired population-based studies (DHHS 2006). Researchers now face tightening restrictions on the use of any data that can be connected with patient identity. Despite the significant barriers imposed by patient confidentiality and issues of informed consent, mechanisms exist for assembling large cohorts and collecting samples that can be used for multiple studies and applications to benefit public health. For example, the NHANES program under the aegis of the CDC collects blood samples for measurement of chemicals, nutritional biomarkers, and geno- types from large numbers of individuals using a consent mechanism that ex- pressly authorizes multiple uses of biologic samples. However, because NHANES functions under the aegis of the CDC, it is exempt from many of the requirements of HIPAA. Nonetheless, solutions can be found that meet research needs while protecting the rights of individuals. An example of a successful ap- proach is the BioBank in the United Kingdom (UK Biobank Limited 2007), a long-term project aimed at building a comprehensive resource for medical re- searchers by gathering information on the health and lifestyle of 500,000 volun- teers between 40 and 69 years old. After providing consent, each participant donates a blood and urine sample, has standard clinical diagnostic measurements (such as blood pressure), and completes a confidential lifestyle questionnaire. Fully approved researchers can then use these resources to study the progression of illnesses such as cancer, heart disease, diabetes, and Alzheimerâs disease in these patients over the next 20 to 30 years to develop new and better ways to prevent, diagnose, and treat such problems. The BioBank ensures that data and samples are used for ethically and scientifically approved research. Issues such as informed consent, confidentiality, and security of the data are guided by an Ethics and Governance Framework overseen by an independent council. In summary, the current state of sample repositories and their associated data are less than ideal and there are numerous limitations and barriers to their immediate use in the emerging field of toxicogenomics. Future studies require that issues in the design of cohort studies, phenotype and genotype classifica- tion, standardization of sample and data collection, database structure and shar- ing, and informed patient consent be addressed and mitigated. However, even when all these issues have been addressed, toxicogenomics faces additional and
Sample and Data Collection and Analysis 161 significant challenges related to the complexity of data collected by the various toxicogenomic technologies. EXISTING TOXICOGENOMIC DATA REPOSITORIES: STANDARDIZATION, AVAILABILITY, AND TRANSPARENCY The following section provides an overview of the current standards for toxicogenomic data and a brief overview of existing databases and repositories. The remaining needs that must be met to move forward are described in the Conclusions section of this chapter. Standards for Toxicogenomic Data Although each individual study conducted using toxicogenomic ap- proaches may provide insight into particular responses produced by compounds, the real value of the data goes beyond the individual assays. The real value of these data will be realized only when patterns of gene, protein, or metabolite expression are assembled and linked to a variety of other data resources. Conse- quently, there is a need for well-designed and well-maintained repositories for toxicogenomic data that facilitate toxicogenomic data use and allow further in- dependent analysis. This is a daunting task. Genome sequencing projects have generated large quantities of data, but the ancillary information associated with those data are much less complex than the information necessary to interpret toxicogenomic data. Genome sequencing does not vary significantly with the tissue analyzed, the age of the organism, its diet, the time of day, hormonal status, or exposure to a particular agent. Analyses of expression patterns, how- ever, are dramatically affected by these and a host of other variables that are essential for proper interpretation of the data. In 1999, a group representing the major DNA sequence databases, large- scale practitioners of microarray analysis, and a number of companies develop- ing microarray hardware, reagents, and software tools began discussing these issues. These discussions resulted in creation of the Microarray Gene Expression Data Society (MGED). MGED took on the task of developing standards for de- scribing microarray experiments with one simple question in mind: What is the minimum information necessary for an independent scientist to perform an in- dependent analysis of the data? Based on feedback received through numerous discussions and a series of public meetings and workshops, a new set of guide- lines called the MIAME standard (Minimal Information About a Microarray Experiment) was developed (MIAME 2002). The publication of this new stan- dard was met with general enthusiasm from the scientific community and the standards have evolved with continued input from scientists actively conducting microarray studies. To facilitate usage of this new standard, brief guidelines and a âMIAME checklistâ were developed and provided to scientific journals for
162 Applications of Toxicogenomic Technologies their use when considering publication of manuscripts presenting microarray data (Brazma et al. 2001, MGED 2005). Whereas MIAME can be thought of as a set of guidelines for describing an experiment, it is clear that these guidelines must be translated into protocols enabling the electronic exchange of data in a standard format. To meet that chal- lenge, a collaborative effort by members of MGED and a group at Rosetta In- pharmatics led to the development of the microarray gene expression (MAGE) object model as well as its implementation as an XML-based extensible markup language (Spellman et al. 2002). The adoption of MAGE by the Object Man- agement Group has promoted MAGE to the status of an âofficialâ industry stan- dard. MAGE is now being built into a wide range of freely available microarray software, including BASE (Saal et al. 2002), BioConductor (Dudoit et al. 2003), and TM4 (Saeed et al. 2003). An increasing number of companies are also adopting MIAME and MAGE as essential components of their products. The guidelines in MIAME have continued to evolve, and both MIAME and MAGE are being prepared for a second release. Efforts are also under way to develop an extended version of the MIAME data standard that allows for integration of other data specific to toxicologic experiments (for example, dose, exposure time, species, and toxicity end points). Deemed MIAME/Tox, this standard would form the basis for entry of standard- ized data into toxicogenomic databases (MIAME 2003). Similar efforts have produced microarray standards for a variety of related applications, including a MIAME/Env for environmental exposures (Sansone et al. 2004) and MIAME/Nut for nutrigenomics (Garosi et al. 2005). Work is also ongoing to develop standards for proteomics (the Minimal Information About a Proteomics Experiment) (Orchard et al. 2004) and for metabolomic profiling (the Standard Metabolic Reporting Structure) (Lindon et al. 2005a). Public Data Repositories The ArrayExpress database of the European Bioinformatic Institute (EBI) (Brazma et al., 2003), as well as the National Center for Biotechnology Informa- tionâs Gene Expression Omnibus (Edgar et al. 2002) and the Center for Informa- tion Biology gene Expression database at the DNA Data Bank of Japan have adopted and supported the MIAME standard. Following the same model used for sequence data, data exchange protocols are being developed to link expres- sion data found in each of these three major repositories. Other large public da- tabases such as the Stanford Microarray Database (SMD 2006) and CaBIG (NCI 2006d) have been developed in accordance with the MIAME guidelines. How- ever, these are essentially passive databases that apply standards to data struc- ture but provide little or no curation for data quality, quality assurance, or anno- tation. Other small-scale, publicly available tools and resources that have been developed for sharing, storing, and mining toxicogenomic data include db Zach
Sample and Data Collection and Analysis 163 (MSU 2007) and EDGE2, the Environment, Drugs and Gene Expression data- base (UW-Madison 2006). Private and Public Toxicogenomic Consortia Databases developed within private or public consortia are often much more proactive in curating data on the basis of quality and annotation. The pharmaceutical industry, for example, has generated very large compendia of toxicogenomic data on both proprietary and public compounds. These data are high quality but in general are not accessible to the public. Datasets and data repositories developed for toxicogenomics within the public sector, or in part- nership with the private sector, are also actively curated and annotated but are made available for public access after publication of findings. Examples of such cooperative ventures are described below. The Environmental Genome Project The Environmental Genome Project (EGP) is an initiative sponsored by the National Institute of Environmental Health Sciences (NIEHS) that was launched in 1998. Inextricably linked to the Human Genome Project, the under- lying premise for the EGP was that a subset of genes exists that have a greater than average influence on human susceptibility to environmental agents. There- fore, it was reasoned that the identification of these environmentally responsive genes, and characterization of their SNPs, would lead to enhanced understanding of human susceptibility to diseases with an environmental etiology. The EGP provided funding for extramural research projects in multiple areas, including bioinformatics/statistics; DNA sequencing; functional analysis; population- based epidemiology; ethical, legal, and social issues; technology development; and mouse models of disease. A variety of mechanisms including centers such as the Comparative Mouse Genomics Centers Consortium, the Toxicogenomics Research Consortium, the SNPs Program, and the Chemical Effects in Biologi- cal Systems database, which are described below, provided research support. The NIEHS scientific community selected the list of environmentally re- sponsive genes to be analyzed, which currently numbers 554 genes, although this list is not inclusive and is subject to ongoing modification. Among the envi- ronmentally responsive genes are those involved in eight categories or onto- genies including cell cycle, DNA repair, cell division, cell signaling, cell struc- ture, gene expression, apoptosis, and metabolism. The goal of the EGP included resequencing these genes in 100 to 200 individuals representing the range of human diversity. Polymorphisms are then assessed for their impact on gene function. However, the small size of the program precluded disease-association studies. The goal of the program was to provide a database of SNPs that could then be used in larger epidemiologic studies.
164 Applications of Toxicogenomic Technologies The Pharmacogenetics Research Network The NIH-sponsored Pharmacogenetics Research Network is a nationwide collaborative research network that seeks to enhance understanding of how ge- netic variation among individuals contributes to differences in responses to drugs. Part of this network is the Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB), a publicly available Internet research tool that curates genetic, genomic, and cellular phenotypes as well as clinical information obtained from participants in pharmacogenomic studies. The database includes, but is not limited to, information obtained from clinical, pharmacokinetic, and pharmacogenomic research on drugs targeting the cardiovascular system, the pulmonary systems, and cancer as well as studies on biochemical pathways, metabolism, and transporter domains. PharmGKB encourages submission of primary data from any study on how genes and genetic variation affect drug responses and disease phenotypes. The NIEHS Toxicogenomics Research Consortium As a complement to its ongoing participation with the International Life Sciences Instituteâs Health and Environmental Sciences Institute (ILSI-HESI) Consortium, the NIEHS, in conjunction with the National Toxicogenomics Pro- gram (NTP) established the Toxicogenomics Research Consortium (TRC). The goals were to enhance research in the area of environmental stress responses using transcriptome profiling by (1) developing standards and âbest practicesâ by evaluating and defining sources of variation across labs and platforms, and (2) contributing to the development of a robust relational database that combines toxicologic end points with changes in the expression patterns of genes, pro- teins, and metabolites. The TRC included both consortium-wide collaborative studies and inde- pendent research projects at each consortium member site. On the whole, the TRC was successful and achieved many of its goals. The TRC published a landmark paper that was the first to not only define the various sources of error and variability in microarray experiments but also to quantify the relative con- tributions of each source (Bammler et al. 2005). The study confirmed the finding of the ILSI-HESI Consortium (see page 165), indicating that, despite the lack of concordance across platforms in the expression of individual genes, concordance was high when considering affected biochemical pathways. The TRC results were also concordant with work by other groups who arrived at similar conclu- sions (Bammler et al. 2005, Dobbin et al. 2005, Irizarry et al. 2005, Larkin et al. 2005). Although the amount of funding limited the scope of independent re- search projects funded within each center, this component was highly successful and has led to numerous outstanding publications in the field (e.g., TRC 2003).
Sample and Data Collection and Analysis 165 The Chemical Effects in Biologic Systems Knowledge Base The Chemical Effects in Biological Systems (CEBS) (Waters et al. 2003b) knowledge base developed at the National Center for Toxicogenomics at the NIEHS represents a first attempt to create a database integrating various aspects of toxicogenomics with classic information on toxicology. The goal was to cre- ate a database containing fully documented experimental protocols searchable by compound, structure, toxicity end point, pathology end point, gene, gene group, SNP, pathway, and network as a function of dose, time, and the pheno- type of the target tissue. These goals have not been realized. ILSI-HESI Consortium The ILSI-HESI Consortium on the Application of Genomics to Mecha- nism Based Risk Assessment is an example of a highly successful, international collaboration between the private and public sectors, involving 30 laboratories from industry, academia, and national laboratories. The ILSI-HESI Toxicoge- nomics Committee developed the MIAME implementation for the toxicoge- nomic community in collaboration with the European Bioinformatics Institute (a draft MIAME/Tox standard is included in Appendix B). In addition to helping develop data standards, the consortium was also the first large-scale experimen- tal program to offer practical insights into genomic data exchange issues. Using well-studied compounds with well-characterized organ-specific tox- icities (for example, acetaminophen), the consortium conducted more than 1,000 microarray hybridizations on Affymetrix and cDNA platforms to determine whether known mechanisms of toxicity can be associated with characteristic gene expression profiles. These studies validated the utility of gene expression profiling in toxicology, led to the adoption of data standards, evaluated method- ologies and standard operating procedures, performed cross-platform compari- sons, and investigated dose and temporal responses on a large scale. The studies also demonstrated that, despite differences in gene-specific data among plat- forms, expression profiling on any platform revealed common pathways affected by a given exposure. Multiple landmark papers published the research findings of the consortium (Amin et al. 2004; Baker et al. 2004; Chu et al. 2004; Kramer et al. 2004; Mattes 2004; Mattes et al. 2004; Pennie et al. 2004; Thompson et al. 2004; Ulrich et al. 2004; Waring et al. 2004). The Consortium for Metabonomic Toxicology Another example of a highly successful consortium, the Consortium for Metabonomic Toxicology (COMET) was a collaborative venture between Uni- versity College London, six major pharmaceutical companies, and a nuclear magnetic resonance (NMR) instrumentation company. The main objectives of the consortium were similar to those of the ILSI-HESI Consortium but with pro-
166 Applications of Toxicogenomic Technologies ton NMR-based metabonomic profiling (Lindon et al. 2003). The group devel- oped a database of 1H NMR spectra from rats and mice dosed with model toxins and an associated database of meta-data on the studies and samples. The group also developed powerful informatic tools to detect toxic effects based on NMR spectral profiles. Since its inception, COMET defined sources of variation in data, estab- lished baseline levels and ranges of normal variation, and completed studies on approximately 150 model toxins (Lindon et al. 2005b). In addition, the analyti- cal tools developed have been validated by demonstrating a high degree of accu- racy in predicting the toxicity of toxins that were not included in the training dataset. Commercial Databases: Iconix Pharmaceuticals and Gene Logic Databases designed for use in predictive toxicology have also been devel- oped in the private sector for commercial applications. An example is the DrugMatrix database developed at Iconix in 2001. Working with Incyte, MDS Pharma Services, and Amersham Biosciences, Iconix selected more than 1,500 compounds, all of which were used to generate in vitro expression profiles of primary rat hepatocytes and in vivo gene expression profiles of heart, kidney, and liver tissue from exposed rats. They then used these data to populate a data- base that included publicly available data on biologic pathways, toxicity associ- ated with each compound, and pathology and in vitro pharmacology data. The database was then overlaid with software tools for rapid classification of new compounds by comparisons with the expression patterns of compounds in the database. A similar database for predictive, mechanistic, and investigational toxicol- ogy, deemed Tox-Express, was developed at Gene Logic. These databases are large by current standards and hence these companies occupy a market niche in predictive toxicology. The information in these databases far exceeds that avail- able through publicly available sources. CONCLUSIONS Leveraging Existing Studies Given the potential of toxicogenomic studies to improve human risk as- sessment, it is imperative to conduct toxicogenomic studies of human popula- tions. Academic institutions, government agencies, health care providers, and the private sector have already invested tremendous effort and resources into accruing cohorts for population-based studies to predict responses to drug ther- apy, to improve disease classification, to find genetic markers of disease suscep- tibility, to understand gene-environment interactions, and to assess effects of human exposures to environmental chemicals. Given the hundreds of millions of
Sample and Data Collection and Analysis 167 dollars already spent on clinical trials, environmental cohort studies, and meas- urements of human exposures to environmental chemicals, it is imperative that toxicogenomic studies make the best possible use of the resources at hand. En- suring access to the samples and data produced by existing studies will require consortia and other cooperative interactions involving health care providers, national laboratories, academia, and the private sector. Sharing of data and sam- ples also raises important ethical, legal, and social issues, such as informed con- sent and confidentiality, that must be addressed. Even though existing cohorts and ongoing studies should be leveraged when possible, the current state of sample repositories and their associated data is less than ideal, and there are numerous limitations and barriers to their imme- diate use in toxicogenomics. For example, there is little uniformity in data col- lection standards among population-based studies, and these studies for the most part were not designed with genomic research in mind. This lack of uniformity and the fact that samples and data are seldom collected in a format that can be assessed by genomic technologies present a major impediment to using existing cohorts for toxicogenomic applications. Although there will always be a need to collect data specific to a particular study or application, much can be done to standardize collection of samples and of demographic, dietary, occupational, and lifestyle data common to all studies. Databases and Building a Toxicogenomic Database Toxicogenomic technologies generate enormous amounts of dataâon a scale even larger than sequencing efforts like the Human Genome Project. Cur- rent public databases are inadequate to manage the types or volumes of data expected to be generated by large-scale applications of toxicogenomic technolo- gies or to facilitate the mining and interpretation of the data that are just as im- portant as its storage. In addition, because the predictive power of databases increases iteratively with the addition of new compounds, significant benefit could be realized from incorporating greater volumes of data. A large, publicly accessible database of quality data would strengthen the utility of toxicogenomic technologies, enabling more accurate prediction of health risks associated with existing and newly developed compounds, providing context to toxicogenomic data generated by drug and chemical manufacturers, informing risk assessments, and generally improving the understanding of toxicologic effects. The type of database needed is similar in concept to the vision for the CEBS database (Waters et al. 2003b), which was envisioned to be more than a data repository and, more importantly, to provide tools for analyzing and using data. The vision also included elements of integration with other databases and multidimensionality in that the database would accommodate data from various toxicogenomic technologies beyond gene expression data. While CEBS (CEBS 2007) is not well populated with data, adding data will not solve its shortcom- ings; the original goals for the database were not implemented.
168 Applications of Toxicogenomic Technologies The lack of a sufficient database in the public sector represents a serious obstacle to progress for the academic research community. Development of the needed database could be approached by either critically analyzing and improv- ing the structure of CEBS or starting a new database. Creating an effective data- base will require the close collaboration of diverse toxicogenomic communities (for example, toxicologists and bioinformatic scientists) who can work together to create âuse casesâ that help specify how the data will be collected and used, which will dictate the structure of the database that will ultimately be created. These are the first and second of the three steps required to create a database. 1. Create the database, including not only the measurements that will be made in a particular toxicogenomic assay but also how to describe the treat- ments and other information that will assist in data interpretation. 2. Create software tools to facilitate data entry, retrieval and presentation, and analysis of relationships. 3. Develop a strategy to populate the database with data, including mini- mum quality standards for data. Where will the data for such a database come from? Ideally, this should be organized as an international project involving partnership among government, academia, and industrial organizations to generate the appropriate data; an ex- ample of such coordinated efforts is the SNP Consortium. However, developing such a database is important enough that it needs to be actively pursued and not delayed for an extensive time. One potential source of toxicogenomic data is repositories maintained by companies, both commercial toxicogenomic database companies and drug or chemical companies that have developed their own datasets. Although industry groups have been leaders in publishing âdemonstrationâ toxicogenomic studies, the data (e.g., Daston 2004; Moggs et al. 2004; Ulrich et al. 2004; Naciff et al. 2005a,b) published to date are thought to represent a small fraction of the data companies maintain in internal databases that often focused on proprietary com- pounds. It is unlikely that much of these data will be available in the future without appropriate incentives and resolution of complex issues involving the economic and legal ramifications of releasing safety data on compounds studied in the private sector. Furthermore, extensive data collections even more compre- hensive and systematically collected than these maintained by companies are necessary if the field is to advance. The NTP conducts analyses of exposure, particularly chronic exposures, that would be difficult to replicate elsewhere and could serve as a ready source of biologic material for analysis. One possibility is to build the database on the existing infrastructure of the NTP at the NIEHS; at the least, the NTP should play a role in populating any database. Although creating and populating a relevant database is a long-term pro- ject, work could begin immediately on constructing and populating an initial
Sample and Data Collection and Analysis 169 database to serve as the basis of future development. A first-generation dataset could be organized following the outline in Box 10-1. This dataset could be analyzed for various applications, including classify- ing new compounds, identifying new classes of compounds, and searching for mechanistic insight. The preliminary database would also drive second- generation experiments, including a further study in populations larger than those used in the initial studies to assess the generalizability of any results ob- tained and second-generation experiments examining different doses, different exposure durations, different species and strains, and relevant human exposure, if possible. A practical challenge of a toxicogenomic data project would be that much of the database development and data mining work are not hypothesis-driven science and generally are not compatible with promotion and career advance- ment for those in the academic sector. This is despite the fact that creating useful databases, software tools, and proven analysis approaches is essential to the suc- cess of any project. If the ability to analyze the data is limited to a small com- munity of computational biologists able to manipulate abstract data formats or run command line programs, then the overall impact of such a project will be minimized. Thus, ways must be found to stimulate the participation of the aca- demic sector in this important work. Finally, although this may seem like an ambitious project, it should be emphasized that it would only be a start. Moving toxicogenomics from the re- search laboratory to more widespread use requires establishing useful data and standards that can be used both for validating the approaches and as a reference for future studies. RECOMMENDATIONS Short Term 1. Develop a public database to collect and facilitate the analysis of data from different toxicogenomic technologies and associated toxicity data. 2. Identify existing cohorts and ongoing population-based studies that are already collecting samples and data that could be used for toxicogenomic studies (for example, the CDC NHANES effort), and work with the relevant organiza- tions to gain access to samples and data and address ethical and legal issues.1 3. Develop standard operating procedures and formats for collecting sam- ples and epidemiologic data that will be used for toxicogenomic studies. Spe- cifically, there should be a standard approach to collecting and storing demo- graphic, lifestyle, occupation, dietary, and clinical (including disease classi- fication) data that can be included and queried with a public database. 1 Ethical and legal issues regarding human subject research are described in Chapter 11.
170 Applications of Toxicogenomic Technologies BOX 10-1 Possible Steps for Developing an Initial Database 1. Select two or more classes of compounds targeting different organ sys- tems. These compounds should be well-characterized compounds with well- understood modes of action. 2. Define appropriate sample populations (for example, a particular mouse strain) in which the effects of the compounds will be measured. 3. Design experiments, including an appropriate range of doses, time course, and level of biologic replication, to ensure that useful data are captured. 4. Select technology (for example, gene expression microarray or pro- teomics) to be studied. Consider using more than one platform for a particular technology (for example, Affymetrix GeneChips, and Agilent Arrays) or more than one technology (for example, gene expression microarray, and proteomics). Using more than one technology would enable development of methods for com- paring and integrating different types of data (for example, gene expression changes and protein data). 5. Develop and implement standards for data collection and reporting, working with communities that specialize in standard development and using pri- vate sector, national laboratory, academic, and government agency participation. The raw data, resulting phenotypes, and ancillary information about samples and experimental design should be described with a precise controlled vocabulary. While descriptions of information should be along the lines of what has been outlined in MIAME microarray standards, compliance with MIAME mi- croarray standards is insufficient to describe toxicology experiments. Adequate description of ancillary information is essential but often over- looked, but it is critical because toxicogenomic technologies are sensitive to a wide range of perturbations, and analysis requires paying careful attention to the complete information about the samples and experiments. 6. Develop a database and its software systems that can capture the rele- vant information and make it freely available to the scientific community. This requires a significant effort because the quality and ease of any subsequent analy- sis will largely depend on the design of the database and the information it cap- tures. It is expected that new software tools will need to be developed, and they should be a freely available and open source to facilitate their distribution, use, and continued assessment and development. 7. Develop tools to analyze the results. Although there are many tools for data analysis, at present none fully serves the needs of the toxicogenomic commu- nity. Such development must be carried out as a close partnership between labora- tory and computational scientists to ensure that the tools developed are both useful and accessible. Intermediate 4. Develop approaches to populate a database with high-quality toxicoge- nomic data.
Sample and Data Collection and Analysis 171 a. Incorporate existing datasets from animal and human toxico- genomic studies, from both the private and public sectors, into the public data repository, provided that they meet the estab- lished criteria. b. Communicate with the private sector, government, and aca- demia to formulate mutually acceptable and beneficial ap- proaches and policies for appropriate data sharing and data distribution that encourage including data from the private and public sectors. 5. Develop additional analytical tools for integrating epidemiologic and clinical data with toxicogenomic data. 6. NIEHS, in conjunction with other institutions, should cooperate to cre- ate a centralized national biorepository for human clinical and epidemiologic samples, building on existing efforts. Long Term 7. Work with public and private agencies to develop large, well-designed epidemiologic toxicogenomic studies of potentially informative populationsâ studies that include collecting samples over time and carefully assessing envi- ronmental exposures.