Click for next page ( 2


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
SUMMARY Research in proteomics is the next logical step after genomics in understanding life processes at the molecular level. In the largest sense proteomics encompasses knowledge of the structure, function and expression of all proteins in the biochemical or biological contexts of all organisms. Since that is an impossible goal to achieve, at least in our lifetimes, it is appropriate to set more realistic, achievable goals for the field. Up to now, primarily for reasons of feasibility, scientists have tended to concentrate on accumulating information about the nature of proteins and their absolute and relative levels of expression in cells (the primary tools for this have been 2D gel electrophoresis and mass spectrometry). Although these data have been useful and will continue to be so, the information inherent in the broader definition of proteomics must also be obtained if the true promise of the growing field is to be realized. Acquiring this knowledge is the challenge for researchers in proteomics and the means to support these endeavors need to be provided. An attempt has been made to present the major issues confronting the field of proteomics and two clear messages come through in this report. The first is that the mandate of proteomics is and should be much broader than is frequently recognized. The second is that proteomics is much more complicated than sequencing genomes. This will require new technologies but it is highs likely that many of these will he develonerl ~ coking ~ ~ cat -of--- ------A -------- ~ - ---r~~ ~~~~ = . . ~ A . ~ ~ ~ .. . ~ ~ ~ . ~ . ~ . ~ ~ ~ back ~u to ;zu years trom now, the question Is: Will we have done the~ob wisely or wastefully! ]:ntroduction Due to the rising interest in proteomics research worldwide, a symposium entitled "Defining the Mandate of Proteomics in the Post-Genomics Era" was held at the National Academy of Sciences on February 25, 2002, in Washington, D.C. Most of the attendees were invited because of their strong interest in proteomics, proteins, or drug discovery. They came from industry, both large and small, academia, and government. Most were from the United States, but an effort was made to invite people from outside the United States. Four of the 10 speakers came from outside of the United States. Six young scientists from around the world received travel fellowships to attend the meeting. The attendees heard about recent advances in the field that will greatly accelerate the process of accumulating and interpreting much of this additional needed data and information. The planning committee selected speakers (see Box I) and designed the symposium in the hope that one of the outcomes of the meeting would be helping to set the field on as wise a path as possible for the future. After the presentations attendees were involved in individual breakout sessions on a variety of topics, including protein separation and identification protein structure and function metabolic pathways and post-translational modifications implementation: necessary policy and infrastructure conditions for collaboration platforms: emerging technologies computational methods and bioinfo~matics clinical proteomics

OCR for page 1
The thoughts and ideas of the speakers and those expressed in the breakout sessions were captured by recorders to assist in the preparation of this report. While other organizations and meetings have addressed many of the issues facing proteomics, we hope that participants and readers of this report will look back on this meeting as the field progresses and find that it was of some help in defining the current efforts and applications, as well as providing direction to the advancing state of the art. ............ .. ... ....... . ~ ........... .......... ....... I .... .. ........ I ~ ': 2""''" 2"~"'"'';;'C~ ' ' e~ller-Um ~ N Y ~---~- -::Y 3...C,'.~'s~.-.-~,.,..~.~ ... ................. .... ~n2cu~""'"''~s~-o-~--G ~ . ~ ~.,,~H,,, R~,,,,,,,S, it. d .,..~.... ... ~ ~ ~.~.~. ^~.,-.~- - ~ ---~-_~ ~ ~~? ~ ^~ ~~v -~v-- ~ Cel~-~es Colons- SacInlet ~ d- Proteomics Now that the DNA sequences of the human genome and genomes of dozens of other organisms are essentially known, the biomedical and biological communities are placing increased emphasis on proteomics, the study of the proteins that are the gene products. Proteomics, a word derived from "protein" and "genomics," needs further definition, as do proteomics initiatives, especially since many in the scientific community are asking for a human proteome project. Historically one can point back to meetings and articles over 20 years ago, when scientists began to think about mapping the entire set of human proteins (see, for example, B.F.C. CIark, "Towards a Total Human Protein Map". Indeed, Congress was considering a project called the "Human Protein Index" long before the Human Genome Project had been conceived. The Human Protein Index project was developed in the late 1970's by Norman G. Anderson and N. Leigh Anderson at the Depardnent of Energy's Argonne National Laboratory2. Its objective was to enumerate the human proteins (what would now be called the human proteome) by separation on 2D gels and thus define their genes from the protein end, the only approach possible in those days before large scale DNA sequencing was possible. But this effort was perhaps ahead of its ~ Clark, B.F.C (1981) Towards a Total Human Protein Map. Nature 292 (5823~: 491-492 2Anderson,N.G. and Anderson,N.L. (1979) BehringInst Mitt. 63: 169-210

OCR for page 1
3 time given the lack of suitable technologies and shifting political sands. Instead, the rise of genomics took center stage. An Australian postdoctoral student, Marc Wilkins, is often credited with coining the term "proteomics" in 1994 at a time when only one proteomics company existed (Large Scale Biology Corporation). Today many proteomics initiatives are underway in industry and otherwise, such as the Human Proteomics Initiative (HPl), an effort which began in 2000 by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. The goal of the HPT is to annotate each known protein, providing information that includes the description of protein function, domain structure, subcellular location, post-translational modifications, splice variants, and similarities to other mammalian proteins4. Another major proteomics effort is led by the Human Proteome Organization (HUPO), a group which has created a worldwide organization that engages in scientific and educational activities to encourage the spread of proteomics technologies and to disseminate knowledge pertaining to the human proteome and that of model organisms On which goals should these national and international efforts focus? Should they be limited to human proteomics or like the Human Genome Project, include key model organisms? Perhaps the proteomes of the human pathogens should be included as well (e.g., the malaria parasite and other infectious microorganisms), and if so, in what order of priority? Should development of more efficient instrumentation (e.g., mass spectrometers, X-ray diffractometers, nuclear magnetic resonance spectrometers) and improved computational methodologies (e.g., high-speed computers and software useful in bioinfonnatics) be emphasized? What should be the role of major federal funding agencies (e.g., the National Institutes of Health, the National Science Foundation, the U.S. Environmental Protection Agency, and the U.S. Department of Agriculture)? What should be the role of academic laboratories? Should projects be supported mostly by individual research grants or program project (group effort) grants? What should be the role of the private sector, particularly those companies large and small that have a major stake in exploiting the results of the venous genome projects and proteomics initiatives? How can all of these stakeholders cooperate most effectively while still maintaining proprietary information where appropriate? Should the overall goal be to understand the structure and function of all known proteins or should only those known to be involved in diseases be emphasized? After all, one must first understand fimction if one is to fully understand dysfimction. Is enough emphasis being given to the fimctional aspects of proteomics? Are studies on post-translational modifications of proteins and subsequent functional aspects included in "proteomics"? Hence the interest in organizing the one-day symposium reported herein. 3 4 5

OCR for page 1
4 Discussion of General Topics Covered at Symposium Beginning with a definition of the term "proteomics," Marvin Cassman, former director of the National Institute of General Medical Sciences, and now at University of California, San Francisco and the Institute for Quantitative Biomedical Research, was one of many speakers expressing an opinion on this subject and it was clear that proteomics means many (or at least different) things to different people. ~ ~ fi : ~ ~ ~ ~ :~_ ~` : ~ ~ : ~- I, proteomics is not merely protein chemistry. Symposium chair and Dean of the University of Michigan College of Pharmacy, George Kenyon, commented, "Proteomics is not just a mass spectrum of a spot on a gel." Perhaps the most useful definition of proteomics for our purposes is the broadest: Proteomics represents the effort to establish the identities, quantities, structures, and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time, or physiological state. Somewhat limited operational definitions of proteomics were offered by some of the speakers. For instance, "In one sense it makes no difference at all why should you call something proteomics or call it something else?" Dr. Cassman continued, "What we call things often conditions how we organize our thinking and our efforts." He explained that genome-driven target selection coupled to high-throughput technologies is what he believes structural genomics means. "It means you are using the genomes as the primary source for target selection." However, structural proteomics uses these features "plus the additive feature of full coverage of protein space, that is, completeness" stated Dr. Cassman. The goal of completeness does not intend to suggest, however, that any smaller scale experiments, even including high- throughput analysis of specific tissues or subsets of proteins, would not be considered to be part of proteomics. Of course there are many "-omics" along with proteomics including genomics, metabolomics, transcriptomics, interactomics and so on, which are collectively involved in the mandate of defining proteomics. However, we will restrain ourselves from commenting on other "-omics". Functional genomics and functional proteomics (which can encompass other 'omics' as mentioned) are closely juxtaposed on a continuum along the path of discovering the detailed secrets of life and life processes. The general topics covered at the symposium included: come detonations Include ..h~h-throu~h~ut and some do not. Ubv~ouslv Perspectives (including genomics perspective, relationship of proteome to genome) Source of proteins (including organism, sample storage) Protein separation (including purification if subcellular) Protein identification (largely mass spectometry) Protein function (including localization, protein:protein interactions, structure determination, structure-function, post-translational modifications) Applications (including drug discovery, diagnostics) Tnformatics (including homology modeling, databases, analysis software, standardization) Other topics (including international collaboration, ethical considerations, collaboratories6) 6 Collaboratories are distributed research centers in which scientists in two or more locations are able to work together with the assistance of various forms of communication and collaborative technologies.

OCR for page 1
Dr. Cassman defined proteomics as a set of related options: "the analysis of complete complements of proteins present in defined cell or tissue environments (i.e., context-dependent) and their variation in space and time" (with credit given to Stan Fields for his contributions to this definition). One example of a proteomic effort is the Protein Structure Initiative of the NIGMS, which has as a goal the generation of a complete complement of protein structures in nature through the combination of direct structure determination and homology modeling. Although it requires high-throughput technology and genomic data to use for target determination, the goal of "completeness" is what distinguishes the effort as proteomics, according to Dr. Cassman. The second part of his definition is exemplified by the use of microarrays to identify characteristic markers for cancer progression in specific tissue samples. These studies involve image and pattern recognition tools, which yield large-scale visualization of specific cell-dependent, context-dependent proteomic outputs. The third part of the definition involves examining proteomic outputs in time and space. This requires not only the application of bioinformatics tools but also computational biology, that is, the use of modeling and simulation. Complex systems analysis could be considered an important element in the larger picture of defining a proteome, and such analysis will require theoretical modeling of systems. Several examples of NIGMS initiatives that focus on mathematical modeling of complex biological systems were provided. While we may be far off in terms of defining a complete human proteome, approaching proteomics on an organelIar basis provides goals that are perhaps achievable in our lifetimes. Remember that the first DNA genomes sequenced were those of the bacteriophage, in the 1970s, followed in 1981 with the DNA sequencing of a human mitochondrial genome. Consider also that the mitochon~ion, which is estimated to be composed of about 2,000 proteins, presents a considerably more manageable problem and a microcosm of whole cell proteomics. With this in mind Nobel laureate Sir John Walker, head of the Dunn Medical Research Council Unit in Cambridge, UK, discussed his proteomic studies of mitochondria directed to resolving specific biological issues. Dr. Walker's work includes the definition of the protein complement assembled in the respiratory enzyme known as complex I, the identification of the biochemical functions of a family of transport proteins found only in mitochondria, and the discovery of phosphorylation-dephosphorylation pathways in mitochondria. These studies rely not only on mass spectrometric and bioinformatics tools but also on biochemistry and genetics. Such an integrated approach is proving to be quite rewarding in Dr. Walker's view, in terms of both understanding the biology of mitochondria and the technical development of new methods versus attempts to analyze the global complement of proteins in the organelle. It is also possible to focus on subcompartrnents of mitochondria, such as the inner mitochonc3rial membrane of so much interest to bioenergeticists. In this report we have tried to avoid being constrained by a narrow definition of proteomics (e.~.. merely cuantitatina protein levels) and have used the broad definition given earlier to allow a wide-ranging discussion of goals, techniques, opportunities, and challenges. ~ A, ~ ~ a, Lessons Learned from the Human Genome Project Francis Collins, director of the National Human Genome Research Institute, spoke about lessons learned from the Human Genome Project that might be applicable to the discussion of a

OCR for page 1
6 public large-scale proteomics initiative (see Box 2~. He began his presentation by taking issue with the term "post-genomics era." He queried whether this means that from the beginning of the universe until 2001 we were in the "pre-genome era," and then suddenly, "bang," we moved into the post-genome era (leading one to wonder what happened to the genome era). He suggested that it was presumptuous to say that the Human Genome Project is already behind us. He pointed out that proteomics is a subset of genomics, and genomics is more than sequencing genomes, which will be ongoing for decades to come. His comments are especially relevant given that the human genome was still only about 69 percent complete at the time of the meeting. ~ BOXY Lessons Learned ffom~the Hu~nan ~G~enome Project ~~ ~ - ~-~ ~~ ~~ ~-~C~omm~ents iYom~Francis~ ~CoBins ~ ~~ ~~ ~ ~ ~ ~~ ~ ~~ ~~ : ~~ ~~ ~ ~~ ~~ ~~ ~~ ~ ~ ~ ~ ~~.~ ~~ ~ ~ ~ ~ ~~ ~ ~~. ~~ ~~ ~ ~~ . ~ ~ ~ ~ ~~ ~~ ~~ ~~ ~ ~ - ~ ~~ ~~ ~~ ~ ~~ ~ ~~ ~ ~ :~ ~ ~~ ~ ~ ~ . ~~ ~ ~: ~~ ~ it. ~~ :~ ~ .~ ~~ ~~ ~ ~ ~ ~ ~~ ~~: ~~ ~~.~ ~~ ~ aft. ~ ~ ~~ ~~ ~~ ~ ~o~gam~sms~am ~~generatec -A recess genome, ~~ ~d nai; start until six years into Me project and was initiated first m~ pi] pecks. ~ o ~~ Pubic ~ava~labili~ of data and resources id absolutely critical if He ~bene:ffls scenic community am egging ~ be i. The ~~d release~of pre-public~on ;~da1:a~was a He success of He Human Genome Project. 0 ~ ~ Interdisciplinary research neekls to- be ~s~, including Ethel pa~cipatton of experts In ~om~t~;on, cherry - , and~b~om~s. Blot ~~In~:ernational participation and~coordination~-s~ar~ essential component to bung He beds m~e~problem~to~avoid~pli~cation,~and~r cost sharing ~~ ~~ ~ ~~ ~~ ~ ~~ ~ ~~ ::: : : :::: :: ::::: ::: :~: :::: : ::::::::::::: ::::: : :~ :::: :::::: ::: :~:: ::::,: ::~: : :::::: :: :::: :: :::: ::~ :::: : : :::::: :::~:::::::~::::::::. :: :::::::::: ::::~:::: :::.::::::: :::::::::::::: :: :::::::::::::::::: hi: i:: :~ To ~ ~ Centralized databases thy allow for ~~e~:~on~and visualizations of the ~ ares bark of those Ho want to +~ ^~,~ ~ ~ . ~ tu ~~ , . :: ::: : P~ ~ ~ . ~ . . , ~ ~ i, ~ ~ ~ ~~ . ~~ ~ ~ no ~ :: fin 1~_nr~v~t~: ~n~r~n~r.~nin~:~ ~~.~ Act ~ ~ ~ ~~ ~~ ~ ~~ ~ ~~ i: ~~ ~~ ~ ~~ :~ : ::: :: i: ::: : i::: : ::: : :: ::: : ::: :: __ ~ ~~ _ a_ _ ~~ _ ~ ~~ ~ ~ especially Or the nucleotide successful i. ~ ~.~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~~.~ ~~ ~ A ~~ ~ ~ ~ ~~ -A ~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~ ~ ~~ ~ ~ bll~c-p~ate ~partnershlp~s~:~mclude~:a~:compelling~s:clex]:tIfic~opport~lty3~pre-competl:tWe~da~: sets,:: If, - t ~ Dr. Collins concurred with other participants in delivenng the sobering message that a large- scale proteomics effort is orders of magnitude more complicated and difficult than the sequencing ofthe human genome. (As if 100 trillion cells making up an organism and billions of base pairs in genomes are not enough complexity already!) The concept of a complete dataset of

OCR for page 1
7 all human proteins is therefore very difficult to imagine. There are many challenges as stated below: Wide dynamic range of expression Protein modifications Physical handling of proteins is more difficult than working with nucleic acids Need for multiple technologies, many of which are not optimized or even invented Unlike DNA data, protein data are more analog than digital, making data integration and analysis very challenging Intellectual property rights and claims Dr. Collins said that the most important area for investment in proteomics right now is technology development so that we can move these methods in the direction of being able to tackle a mammalian proteome without facing enormous costs and problems with quality of the data. A number of resources for genomics research continue to be generated that may help inform a proteomics effort, including multiple coverage of certain genomes and more specifically: Multiple genomic sequences from mouse (6x coverage), rat (3x coverage), puffer fish, zebrafish, a sea squirt, and close relatives of C. elegans (IOx coverage) and D. melanogaster will be forthcoming. Comparative genomics will be helpful in understanding gene models and gene function. ~ - . ~ a. ~ ~ ~ . . ~~ . . tull-length human cUNA sequencing etiorts are ongoing in Germany and Japan. Full-length cDNAs for human and mouse are being generated through the National Institutes of Health (NIH) Mammalian Gene Collection . Multiple N]H institutes plan to support a central database of protein sequence and fimction through a new initiatives. Dr. Collins referred to one publication: "Global Analysis of Protein Activities Using Proteome Chips."9 He finished his presentation with a particular recommendation, not Tom a scientist but Tom a famous athlete (hockey star Wayne Gretzky). When asked how it occulted that he was so good at playing hockey, and why it was that he always seemed to score the key goals, Gretzky said, "It is very simple. You have got to skate where the puck is going to be." In the field of proteomics Dr. Collins said he was not sure where exactly the puck was going to be, but there were a lot of"Wayne Gretzky's" at the meeting, and Dr. Collins was glad to get a chance to listen to them. 7 8 9 Zhu, H., et al. 2001. Science 293 (5537~: 2101-2105

OCR for page 1
8 Sources of Proteins By definition any proteomics effort aims at 'completeness' of information. This part of the symposium addressed primarily the comprehensiveness or completeness of any assembled library of proteins and the quality of the materials. It was noted that protein expression in a given cell varies from none to abundant. Historically, for practical reasons, the abundant proteins have been investigated most extensively; however, some of the rarely expressed proteins and proteins that appear only in disease states may be among the more interesting. Joshua LaBaer, Harvard Medical School, noted that the function of all proteins can be studied regardless of in viva levels once a copy of the gene and adequate expression vectors are available. ideally it would be desirable to have an available repository or library containing one clone for every spliced variant in the proteome. The size of that library will not be known for some time, but an intermediate realizable objective would be a repository consisting of one clone for every gene. These clones should be "expression ready''; that is, they should contain only the cDNA from the initiation site to the stop codons. It seems likely that we should have "some idea of all the different cDNAs" in the genome in the near future, stated Dr. LaBaer. The expressed proteins could be studied functionally and often identified by mass spectrometry. In general it is fairly easy to produce large quantities of proteins in insect cells or bacteria, but in certain cases it may be necessary to express them in their native cells in order to address such problems as localization or post-translational modifications. Dr. LaBaer compared the complexities of studying mammalian systems with those in yeast. There are approximately 6,000 genes in yeast compared to a much larger number in humans. Moreover, the genome in yeast is relatively simple; for example, there are only about 220 intron-containing genes in yeast, whereas a much larger fraction of mammalian genes contain introns and alternative splicing substantially increases the number of expressed proteins. To this end Dr. LaBaer described the FLEX Gene repository, which is currently being assembled by a consortium of about 20 different public and private research laboratories. "FLEX" stands for Full Length Expression ready. This repository will enable scientists to move several genes simultaneously from the master vector to any expression vector, which will allow researchers to screen for function by high-throughput experimentation. It is the intention of this consortium to make this collection of all human genes broadly available without restrictions on their use. The four self-defined objectives of the consortium are (~) identification of the genes, (2) assembly of clones, (3) sequence validation, and (4) distribution to the scientific community. One example of the success of this effort resulted in the identification of two new genes that are likely involved in the migration of breast cancer cells through a membrane. The collaboration of public and private research groups raises certain legal issues, which include consideration of antitrust law. Recombination-based cloning was presented as a high-throughput technology to enable the ready transfer of cDNAs from the supplied vector to one's own preferred expression vector. Dr. LaBaer described a protein purification scheme that was developed by a graduate student in his laboratory, Pascal Braun. "In the case of human proteins," Dr. LaBaer explained, "where it is not easy to produce these proteins in human cells, The availability of large numbers of purified proteins] will require the use of heterologous "expression] systems such as bacteria." "To develop these methods," continued Dr. LaBaer, "Braun transferred a collection of 30 cancer genes into four different expression vectors, each one adding a different epitope tag. tBraun]

OCR for page 1
9 then developed a two-hour automated protocol for purifying 96 proteins in parallel [and] has now purified over 330 different proteins using this approach." Braun and Yanhui Hu of the lab created a database that correlates the success of purification with various features of the proteins such as pi, GO annotation, subcellular localization, and domain structure. Dr. LaBaer said they found that the presence of certain domains such as SH2 domains or SH3 domains can predict success in purification. Dr. LaBaer concluded with a description of a database derived from a computer program that searches the primary literature for abstracts that mention both a gene and a disease. The assumption is that a significant number of such occurrences may identify groups of genes associated with a given disease. This effort was presented as a task in progress, and interested scientists were invited to experiment with the database.~ Brian T. Chait Tom Rockefeller University described a proteomics approach to understanding cellular function. His group is interested in mechanisms by which materials enter and exit the nucleus, the isolation of multiprotein complexes and the determination of their cellular localization. The basic concept is to introduce a particular affinity tag to one of the proteins at its natural location in the chromosome, which is done by replacing the endogenous gene by a gene that will code for a protein with a tag on it or as he termed it, "a piece of molecular Velcro." So long as the multiprotein complex is stable, the tag allows isolation of the associated interacting proteins. An application to the nuclear pore complex, a group of proteins involved in nuclear trafficking, was described extensively. The complex as isolated has a molecular mass of 50 million daltons. Interestingly, in the initial purification experiments it contained about I80 interacting proteins, but upon further fractionation only around 50 were found to comprise the complex. The individual proteins are identified by mass spectrometry, which has the power to provide additional information about phosphorylation sites. Preliminary experiments describing the use of this approach to follow proteins at different points in the cell cycle and in the regulation of chromatin were mentioned briefly. The genomic tagging and mapping approach can be used to gain analogous information about a number of other systems. Most importantly this approach can show where the protein is localized within the cell, how much is present, when the protein is present and for how long, with what it is interacting, and even something about the topology of the protein complexes. Protein Separation After more than a decade of effort in gene sequencing, reliable estimates of the number of human genes is still a matter of disagreement, speculation, and debate. From the point of view of proteomics, just the detection or enumeration of the numbers of expressed proteins defies prediction based on our current understanding of human cell-type protein composition and its modulation by myriad undefined post-translational modifications. Their actual identification or annotation of function remains a challenge. This entire situation is not significantly better for ~

OCR for page 1
10 yeast. It is thus not surprising that a key problem in proteomics at a practical level is the simplification of protein mixtures to a state in which their characterization by physicochemical methods is experimentally tractable. There are no documented, reliable, or reproducible strategies for separation of classes of proteins or even individual proteins from very complex mixtures typically obtained in biological samples such as cell lysates. Clearly, not only does one wish to know which specific proteins are in a given sample but, ideally, one would wish to know whether specific proteins are part of a particular biologically significant compartment, complex, or subcomplex. Denis Hochstrasser from the University of Geneva, a founder of GeneProt Inc., GeneBio SA, the Swiss Institute of Bioinformatics, and one of the pioneers in the identification of proteins in 2D gels, took the lead in dealing with the topic of protein separation. He stated at the outset that he wanted to play the role of "devil's advocate": to describe some of the excitement in proteomics but also to describe some of the difficulties. He outlined the scale of potential proteins one can look for in the millimolar (10-3), micromolar (10-6), nanomolar (10-9), picomolar (10-~2), femtomolar (10-~s), attomolar (10-~g), zeptomolar (10-2~) and yoctomolar (10- 4) (which is less than one molecule per liter) ranges. When one considers human blood, for example, Hochstrasser noted, "typically you only see albumin, immunoglobulin, and transferrin," whereas cardiac markers such as troponin are present at nanomolar concentrations, and insulin-like growth factor or insulin are in the picomolar range. Parathvroid hormone is in the low nicomolar range and Tumor Necrosis Factor is found in the femtomolar range (see Figure 1). Concentration Albumin mM-3 ,uM-6 nM-9 pM-12 fM -15 _ *1012 I~ _ . yM-24 1 1 10 ~.~(ft.' ~ ~ _ Immunoglobulin Transferrin Bp3 Leptin \ Alkalin Phosphatase Troponin - aM-18 zM-21 100 1'000 Plasma proteins Parathormon Tnf \ - - - - \ 10'000 100'000 (~n~rot- Number of proteins FIGURE 1. Potential plasma proteins observable at various concentration ranges: millimolar (10~3), micromolar (Io-6), nanomolar (lO~9), picomolar (Io-~2), fewtomolar (lO-~s), attomolar (10~~), zeptomolar (10-2~) and yoctomolar (Io-24~. SOURCE: Courtesy of Denis Hochstrasser, GeneProt Inc.

OCR for page 1
11 Hochstrasser speculated that there is "a linear logarithmic relationship between the concentration in blood and the number of proteins." He suggested that if there are about 300,000 proteins in the human body or five to six times the number of genes "you probably could find any protein you have in the body, maybe one in the total blood volume, which would be just below Avogadro's number (1 protein/L of plasma), because we have 6 or 7 liters of blood which makes about 4 liters of plasma, and if you have one in 4 liters, it is about at the yoctomolar (10- 24M) level." For experimental studies the amount of starting material, such as blood, is considerable in order to have high enough levels of various protein material that can be detected by today's methods. Since a 2D gel has a dynamic range of only 104, Hochstrasser stated, "if anyone used tad 2D gel from crude plasma, you never go below the micromolar range." Hochstrasser noted, for example, that starting with 1 mL of sample leads to roughly a nanomolar limit of detection. He further explained that starting with a much larger volume (e.g., 5-10 liters of plasma) is necessary to achieve detectability in the lower picomolar range. Clearly, prefractionation of proteins, individually, or as a subgroup is essential to reach the dynamic range of detectability required for both cell and tissue lYsates. and plasma. . ~ , ~ In subsequent discussion it became clear that even the best large-format 2D gels are inadequate for studies of the global range of expression, perhaps still inadequate by a factor of 10; therefore at least a 10-fold fractionation prior to large-format 2D gel separation would be required. Unfortunately, many membrane proteins do not enter 2D gels effectively. This presents a formidable challenge for the field. In his presentation, Julio Cells from the Institute of Cancer Biology and the Danish Center for Human Genome Research in Aarhus, Denmark, also spoke about methods and challenges in the area of protein separation. He stated that "for the study of tissue biopsies the use of high- resolution 2D electrophoresis is the method of choice Nor separations] as non-ge} high- throughput technologies based on chromatography-mass spectrometer are not yet ready for the study of tissue samples." He stated that 2D gel technology in combination with mass spectrometer can be used to establish comprehensive databases of protein information that can be useful in the clinical setting. He also made the important point that data in a given cell type can be valuable to the study of other cell types since 80-90 percent of the proteins are believed to be shared by all cell troves. While many structural and metabolic gene products may be the same between all cells, as one reviewer pointed out, cell-specific proteins will be important for understanding function and disease. An afternoon breakout session, devoted to the topic of "protein separation and identification," was led by Julio Cells; Alain Van Dorsselear, Louis Pasteur University, CARS; and A. L. Burlingame of the University of California, San Francisco. Most of the 1 6 discussants were experts in mass spectrometr-Y. The discussants concluded that the issue of sample preparation and purification has been sadly neglected at most meetings dealing with proteomics. There was the impression among some of the discussants that protein biochemists were developing and using methods to purify proteins that were not being adequately defined compositionally by mass spectrometrists interested in proteins. They envisioned setting up "core centers of excellence" in proteomics where innovation, mobility of people and ideas, and training can all occur. These core centers might also lead to spin-offs for the development of new instrumentation. Resources required to support a broad proteomic effort could be in the form of sample collections, standardization of data across platforms, and ligands that allow assaying of individual proteins, to name just a few. These centers would complement the work of scientists

OCR for page 1
20 comparisons of datasets of profiles of protein expression, usually determined by mass spectrometry. Sequence comparison can be powerful especially if families of related sequences are identified. However, it is becoming apparent that not only can function diverge markedly when two sequences differ by 50 percent or more, in some instances sequences that are more than 90 percent identical code for proteins that operate on completely different substrates and have no cross-reactivity. Assignment of biochemical function from sequence data alone should always be regarded as tentative without confirmatory experimental evidence. Most functional annotation errors in genomics databases probably arise this way. Structural Proteomics Among the possible experimental ways of approaching the problem of function determination on a large scale, the one that has received the most emphasis thus far is the use of structural information. Predicated on the assumption that the three-dimensional structure of a protein will often provide information about its biochemical and cellular functions, the structural approach is being applied on a genome-wide scale in a number of independent initiatives. Although in many instances at least the chemical function of an enzyme can be guessed from its overall fold, even that deduction is often problematic, and assignment of higher levels of function is practically impossible without additional information. This problem is exacerbated when membrane-associated proteins are considered. Between 25-40 percent of the proteins in the cell are estimated to be membrane associated (depending on the organism). The database of membrane protein structures is very small and the methods for determining those structures are very difficult and uncertain. Cheryl Arrowsmith, a structural biologist from the Ontario Center for Structural Proteomics at the University of Toronto, discussed her group's research on structural proteomics. She emphasized the difference of structural proteomics from structural genomics because they work on proteins, not genes. The focus of her proteomics research is to use X-ray crystallography and NMR spectroscopy to determine the three-dimensional structures of proteins on a genome-wide scale. She is particularly interested in examining the extent to which protein structure can reveal protein function. The model system used is Methanobacterium thermoautotrophicum, whose sequence was completed at the time the project was initiated in 1998. Since that time, her laboratory has evaluated thousands of proteins by subcloninp into bacterial expression systems' ~ ~ T. ITS ~ ~ ~ ~ ~ ~ "^ , ~ ~ ~ ~ ~ , ~ ~ ~ ~ perishing eltner NMK stucles or ^-ray action on soluble and relatively clean purpled protein. They have also evaluated hundreds of proteins from a number of different bacterial, viral, and yeast genomes. However, the number of proteins that give structural samples was low. "There is a huge attrition rate in going from cloned genes to those that can be readily expressed in bacteria, are soluble in bactena, can be purified, give good crystals or promising NMR spectra, and these would be very good in terms of getting a structure." The attrition rate overall is about 85-95 percent of genes that are tried, in other words, approximately 5-15 percent of bacterial or archaebactenal genes can be processed straight through to three-dimensional structures using a single protocol (e.g., single expression conditions, single purification procedure), according to Dr. Arrowsmith.~3 The numbers are worse for eukaryotic systems. i3 Christendat, D., et al. (2000. Nat. Struct. Biol., 7~10~: 903-sos.

OCR for page 1
21 "Clearly one needs to try multiple procedures for protein expression, purification, and crystallization in order to improve the success rate for structures, " said Dr. Arrowsmith. She has confirmed these difficulties in a number of other species and systems, and she reported that many of the other National Institutes of Health centers participating in the project are seeing these sorts of statistics as well. Only in a few cases have they had the opportunity and actually gone on to do functional studies of these proteins. Even with proteins of known function, such as spermidine synthase, the determination of structure can be useful in proposing an atomic model and thus a better understanding of the mechanism of enzymatic function. Dr. Arrowsmith's group was among the first to solve the structure of this protein. There are thousands of clones and proteins that have been prepared in the Ontario Center for Structural Proteomics and in many of the other centers; and these clones are available for further functional analysis. "l think this is a huge resource that is being generated, and it should be exploited through projects that emphasize [biochemical] functional analysis of proteins," said Dr. Arrow smith . Cellular Function Protein location can be determined by such genome-wide techniques as green fluorescent protein (GFP) tagging, and protein:protein interactions can be determined by affinity chromatography, immunoprecipitation, and yeast two-hybrid experiments. Databases resulting Dom these methods are beginning to emerge, but they are of uncertain accuracy. Recent comparisons of independently obtained databases for yeast proteins suggest that location determination is fairly robust but protein:protein interactions are at best determined with less than 50 percent overall accuracy. Clearly more reliable methods are needed, and efforts to create protein chips for profiling of interactions with proteins and small molecules appear promising. One useful addition to the available arsenal of function-finding tools would be a database of three-dimensional motifs of biochemical function. Such a database would contain those structural elements that participate in ligand binding and catalysis for proteins of known function. This database could be searched in a manner similar to sequence database searches whenever a new protein structure is determined. Another useful tool would be, for each protein family, a database of mutations with functional characterization. Essentially this database would provide a link between a mutation at a particular site, a genetic lesion, a metabolic lesion and even a phenotype such as a disease. Once again it was stressed that proteomics should be considered as a much broader field than would be apparent Dom early efforts, which have focused on cataloging levels of protein expression. Ideally it should encompass efforts to obtain complete functional descriptions for the gene products in a cell or organism. Because of the complexity of functional description, clearly more than one technique is required and no one existing technique should be emphasized in preference to any others. This goal may be beyond the reach of existing technologies, even for small numbers of proteins, but it is the direction in which the field must go.

OCR for page 1
22 Applications The application of proteomic techmn]~ies to clinical research ~nr1 n,~hlin health in ~enern1 is an immediate goal of proteomics. , ~ ,, _~_ ____--- ~-~ r~~~-~ ~~~~ a, A distantly related goal is the eventual application of proteomlcs to environmental, agncultural, and veterinary research, research areas that are far less developed than clinical applications. Thus, essentially all the applications discussed in the formal lectures and breakout sessions centered on clinical applications. Clinical proteomics aims to discover proteins with medical relevance, said Alan Sachs, a director of R&D at Merck. Such discoveries can be defined broadly as those that identify a potential target for pharmaceutical development' a markerts) for disease diagnosis or staging, and risk assessment, both for medical and environmental studies. Alan Sachs and Denis Hochstrasser co-chaired the "Clinical Aspects" breakout session and covered a wide range of issues: consent, samples, platforms, phases of diagnostic development, data analysis, and definition. (Note that there is a difference between developing biological insight and identifying clinically important diagnostic and prognostic protein-based assays, as one reviewer of this report has suggested: "BY studying protein interactions. or splice forms. or abundance. one ~ _~ ~ ~ ~ . , .~. , , might be able to effectively distinguish between healthy and diseased tissues. One of the great promises of genomics, and one that has captured the imagination of the public, is the idea that we might move toward personalized medicine through broad genomic or proteomic surveys, what is often called 'pharmacogenomics')." Samples Julio Cells illustrated the potential of proteomics to the study of diseases during his talk of his research on bladder cancer. He stated, "one must take into consideration the set of samples you are going to use." "I ^~^ em, ^ l ~ ~ ~ 1 ~` me -if- ~ ~ ~ I Hal ~ ~ ~ _~1 . . DlUpb1~b, b~1Q =1. W~11b, =1U OU1~- LypUb O1 b=11~1~b U=1 U~ Stilly heterogeneous 1n terms of cell type, stage of pathology, etc., and this presents a challenge for proteomic analysis that must be faced." Experimental research also must consider the use of various types of cell lines, primary tissues, body fluids, and various animal models. Each of these may impose considerations on the types of techniques used for proteomic analysis. As became apparent from several discussion participants, it is currently quite difficult to identify the best procedures for obtaining and storing samples for proteomic analysis, because the techniques used to analyze the samples are constantly changing, making it difficult to arrive at a consensus protocol for sample preparation that would be best for a particular analysis method. Thus, Dr. Sachs and others agreed that various strategies for handling samples or standard operating procedures, and long-term storage will need to be co-developed along with evolving protein detection methods. Dr. Hochstrasser raised the point that "eve don't know how to store the samples if we don't know how we plan to use them later." This is important, especially considering that most proteins stored in the freezer at -20C are useless for specific types of clinical research after a few months, according to Hochstrasser. The question of storage remains a problem because the technology for measurement in the clinics has not evolved, said Hochstrasser, "yet we need to start worrying about sample storage now." Related to this is the nature of the samples. "Defining 'normal' is a major problem," stated Dr. Cells. As many researchers know, the pathology of samples can be open to interpretation,

OCR for page 1
23 and robust parameters must be delineated and adhered to when defining normal versus various stages of pathology. Consideration of the various proteomic methods under development suggests that the size of samples required will be dictated largely by the constantly changing technology. As with all research the nature of the study will dictate the size of the sample available. Dr. Celis noted, "tissue biopsies will impose the most severe restraints, both in terms of size as well as the available clinical data to support the experimental work." Tissue epidemiolog~cal studies may provide blood or some other easily obtainable tissue that is not the target tissue of interest, whereas cancer epidemiology likely will provide tumors of different grades of differentiation. Each of these types of studies imposes complexities and limitations on sample size, number, and method of analysis. The proteome itself has a large, dynamic range, depending on the cells being analyzed, and the location of cells within a tissue could influence its size and nature. Dr. Celis estimated that the dynamic range (i.e., the concentrations of proteins) spanned 12-13 orders of magnitude. Given the limits of sensitivity of detection and the availability of a suitable amount of starting material, Hochstrasser stated, "l strongly believe that a combination of bioinformatics (dry lab) and chemistry (wet lab) is crucial to finding new diagnostic markers and therapeutic agents." Several participants expressed their belief that no single technology would be sufficient for proteomic analysis and that multiple approaches will be required, at least in the near future. Ethical Considerations In addition to the issues surrounding samples being obtained and stored properly certain consent requirements and sample limitations permit clinical samples to be used only once after patient consent has been obtained. In this case consent means both a clear description to the patient regarding how the samples will be used and a disclosure of who will have access to the samples. "Some samples will be anonymous, others will be 'de-identified', and yet others will have restrictions placed on their use," noted Dr. Sachs. For example, samples may have a limitation placed on the type of disease studied or the facility or institution at which the analysis may be performed. Thus, it is important that sample-tracking procedures are in place to ensure that only samples with appropriate consent from subjects are distributed to a specific site for a particular type of investigation. Development of Diagnostics Participants of the "Clinical Aspects" breakout session on diagnostics discussed the fact that although the experimental platform used in clinical settings to detect protein markers will change rapidly in the coming years, the underlying principles regarding the stages of going from the discovery of protein markers to their use as diagnostic tools in a community setting will remain reasonably constant. Consequently the criteria used to judge the quality of a marker or markers as diagnostics in a clinical setting are different from those used to evaluate the quality of a marker in the basic science setting. Discussion centered on the fact that the basic researchers developing protein markers, as well as reviewers evaluating such work, must consider the technical aspects of the application and development of such markers so that statistically underpowered or misinterpreted studies using such markers are not initiated or reported. Another reviewer pointed out an important variable to consider in clinical applications, which is

OCR for page 1
24 the impact of population or sample variability due to the heterologous nature of individuals. This point corresponds again to the idea of pharmacogenomics or personalized medicine. Although data analysis (informatics) is addressed elsewhere in this report, several speakers noted that special consideration should be given to adequate data analysis when reporting something as significant as the association of protein markers with a disease. Participant Thea Kalebic from the National Cancer Institute (NCI) recommended publication criteria for reporting the use of marker and clinical samples. Criteria should be specified for the use and analysis of a particular method to avoid incorrect application of a technique or inadequate or wrong interpretation of the results, stated Dr. Sachs. Participant {zet Kapetanovic, NCT, further suggested that a paper be written for the lay audience to describe how algorithmic clustering methodologies are being used to do disease association studies. "I think a lot of physicians, as well as clinical researchers, are not bioinformatics or statistics people, and they would benefit from such a review," stated Dr. Sachs. Clinical researchers will also need to consider the types of proteins that might be most relevant, noted Dr. Cells. "Because every modification has a functional meaning, [one must also consider] a protein-protein or protein-macromolecule interaction [as well as] cellular distribution, movement, or migration," added Dr. Cells. Regarding techniques, Dr. Cells believes the only available technique that provides a global picture of the cell proteome is high- resolution 2D gel electrophoresis, despite its obvious limitations in terms of the numbers of proteins resolved and the sensitivity of detection. The non-ye! approaches based on chromatography and mass spectrometry allow for high-throughput, Dr. Celis noted, but he stated they are not yet ready for the study of complex tissue samples. Scott Patterson, vice president of proteomics at Celera prefers the high-throughput approaches to clinical applications of proteomics research. "In our search for markers of disease or drug efficacy, and targets for small molecules, therapeutic antibodies, and cellular immunotherapeutics (vaccines), we employ a broad-based discovery approach," stated Dr. Patterson. His team uses chromatography and mass spectrometry as the basic tools in searching for protein diagnostic markers and therapeutic targets in specific diseases. "Most of you will know Celera for sequencing genomes," commented Dr. Patterson. But as the company decided to embark upon drug discovery based upon its valuable genomics business, the first platform to be built was a proteomics platform. The proteomics component of that strategy is to discover diagnostic markers of disease and targets for therapeutic intervention, said Dr. Patterson. They are specifically focused on proteins that are differentially expressed in disease tissue compared with normal tissue. Contrary to Dr. Celis's approach of performing a high-resolution protein separation at the beginning of the analysis (as is the case for 2D gel electrophoresis), a very high-resolution peptide analysis is performed at the end of the process using chromatography and mass spectrometry. "In its simplest description," said Dr. Patterson, "protein-level analysis is accomplished through targeted capture of classes of proteins (or the depletion of abundant proteins) prior to proteolytic digestion, yielding peptides that are quantitated and identified by MS/MS using one of a variety of platforms fe.g., a MALDI-TOF- TOF-MS or the Voyager 4700 Proteomics Analyzer']." The MS/MS spectra are identified using search algorithms for spectrum-to-sequence matching (using characterized protein sequence databases or a translation of the Celera human genome sequence). Automated identification can be achieved through spectral matching or spectrum-to-spectrum matching. This overall approach of peptide-level analysis can be employed with isotope dilution strategies

OCR for page 1
25 (such as ICAT for quantitation of the relative abundance of peptides and proteins from pairs of samples) and without if the fractionation of a series of samples is sufficiently reproducible. Identification of early markers of disease is important for development of a reliable assay for tissue samples so as to help diagnose disease, provide insight into prognosis and identify risk for disease. These markers are especially important in identifying tumor stages such as with Dr. Celis's work. A therapeutic that derives from this information is also desirable. Proteomic data, in combination with microarray (gene expression) data, pathology, immunohistochemistry, etc., have the potential to identify novel markers for early detection, diagnostics, prognostics, and response to treatment, concluded Dr. Cells. Drug discovery and improvement in public health and environmental research will require a combination of all these and other technologies. Salvatore Sechi, National Institute of Diabetes, Digestive and Kidney Diseases, emphasized that although it is clear that the emphasis in the clinical community is on marker discovery, the technology needed for clinical assays and high-throughput proteomics has not evolved yet. It is important to recognize however, that developing clinically relevant diagnostic and prognostic tests is something separate from developing biologically relevant insight into disease. While these two goals are not mutually exclusive, they are not necessarily overlapping, notes one reviewer. It maY be a relatively simple matter to identifier patterns of gene expression that can be ~ ~ ~ ~ 1 ~ 1 1 ~ ~ ~1 1 ~ 1 ~ 1 ~ ~1 ~ 1 ~ ~ 1 ~ , 1 , , ,1 correlated With a clinical outcome but that does not provide an immediate Insight into the underlying mechanism of the associated disease state. This does not mean that such a prognostic protein expression fingerprint is not useful. Any too! that can help improve and influence treatment has a great potential to affect patients' lives. This, it seems, defines a mandate for proteomics in the twenty-first century. Computational Methods and Bioinformatics Computation has become an essential component of biological research. The great quantity and diversity of the data being generated by different technologies is daunting and impossible to organize or oversee without computational assistance. In functional genomics, a great deal of effort has been devoted to developing community-based standards for reporting gene expression data to allow others to replicate experiments. The same will need to be done for proteomics to validate across the different technologies. Perhaps never before has a bioinformatics problem of this magnitude been approached. No one person can integrate and organize all the relevant information for even a single protein being studied without access to computational tools. Sequence, structure, expression profiles, functional assays, protein-protein interaction from yeast two-hybrid experiments or protein chip experiments, and other data all provide information on different aspects of proteins whose functions and roles we are only beginning to understand. Without effective and integrated databases to store and retrieve these data, and advanced computational methods such as pattern recognition and other machine learning approaches to analyze and interpret them, the full implications of these data will not be realized. A few years ago the typical biologist may have had little reason to turn to a computer for insights or information. Today the story is very different. c' To paraphrase an old adage, "No protein is an island," and researchers who are unable (or unwilling) to use all available data do so at their own peril. Computation can provide powerful tools to enable the detection of subtle relationships between data and suggest hypotheses for

OCR for page 1
26 expenmental validation. In addition to the traditional hypothesis-driven research to which we are all accustomed, computational methods provide a new paradigm: 'computationally assisted hypothesis generation'. Far from supplanting the biologist's intuition, understanding, and experimentation, computation can provide an added dimension enabling additional insight and understanding. Our ability to take advantage of the technological advances in genomics and proteomics will hinge greatly on our ability to integrate computation into the research and discovery process. For instance, experiments performed on one protein often have relevance for other proteins, not only within the context of the organism from which that protein was denved (i.e., paralogues) but also within the context of other organisms containing orthologous proteins. Researchers investigating an individual protein, say, one involved in disease resistance in potatoes, would miss a wealth of information and experimental data if they were not aware of work being done on related plants, such as Arabidopsis, tomatoes, or rice. In fact, many disease- resistance proteins in plants have orthologues in insects and animals, and experiments on one will shed light on the function of another. Entire cathwaYs in one organism have analocues in ~ . ., _ ~ . . . . . ~ . - ~ . . . ~ others. Genetic experiments In one organism will have Implications tor related organisms. Three-dimensional structures solved for one protein can be used to predict the structure of other proteins whose sequences are similar. Residues shown to be catalytic in one protein are likely to play a similar role for related proteins. Taken alone, proteomics data being generated (microarray, protein chip, structural, yeast two-hybrid, mass spectrometr-Y) can provide important insights. Taken in concert and integrated, these data provide a context for understanding the complex interactions and roles of these biological molecules. To take full advantage of the information contained in these data, computational development of two basic types is necessary: (~) database infrastructure enabling efficient and biologically intuitive storage and retrieval, and interface design to enable different databases to communicate with each other, and to allow investigators with disparate backgrounds to access the information in these databases; and (2) intelligent systems, agents, and software tools to discern relationships between data, and to generate hypotheses that can be tested experimentally. Such bioinformatics development may also be used to help answer fundamental questions in biology, which have never been posed. In addition, training the next generation of scientists to make the kinds of contributions that will be critical to discovery in this new century of proteomics must not be ignored. We discuss each of these issues separately. The breakout group "Computational Methods and Bioinformatics," led by Kimmen Sjolander, University of California, Berkeley, and Dagmar Ringe, Brandeis University, discussed database issues. Database infrastructure and interface design For histoncal reasons most biological databases have been produced pnmarily by the biological community, while most computational tools have been produced by the mathematical and computational communities. This has resulted in databases that are often not easily amenable to automated data-mining methods, unintelligible to some computers, and computational tools that are often non-intuitive to biologists. Biological databases have inherent complications stemming from the nature of the information they contain and the dependence of computational methods on these data. Most biological data are not digital, making machine-readability of the data (for automated data-mining) impossible. In addition, the lack of standardized nomenclature

OCR for page 1
27 and ontology, the use of protein aliases (leading to ambiguity), the lack of interoperability across databases, and the presence of errors in database annotations have hindered and complicated the use of computational methods. While the computational biology community has begun dialogue in this area, a great deal of work needs to be done before access to information becomes routine and accessible to the computational non-expert. Development of new methods Computational methods are based on models, whether mathematical or biological. As biologists achieve new insights based on new data being generated by experiment existing models will be reevaluated and changed accordingly. New methods will need to be developed and existing methods refined. In all cases development of benchmark test sets is critical for the assessment of method accuracy and reliability. As more information becomes available in databases more robust tools and more intuitive methods for finding relationships within these data will be needed. Trairlirlg Biology is being changed by computation. And computation, in turn, is being changed by biology. Researchers working at the interface of computation and biology are increasingly in demand. New degree-granting programs and departments are springing up around the world to train the next generation of scientists in this interdisciplinary field. To be effective in this new age of computationally assisted biological discovery biologists must receive training in statistics, mathematics, and computation, and become expert in the use and interpretation of the results of computational methods. It was suggested that life scientists learn at least some simple scripting language such as PERL, and a database language such as Shy. For computer scientists working in this area, training in life sciences is necessary. Both groups must learn to speak a common language. The information explosion has presented an opportunity and a challenge to the biological and computational communities. The wealth of data being generated needs to be integrated in order to define a system from multiple viewpoints and to understand a system from different sets of empirical data. Such integration is possible only with computational tools that can find relationships within the data and use these relationships to create testable models. Such tools must also be user fiiendly. Proteomics: A Coordinated International Effort "It was a report by a National Academy of Sciences pane} tin 1988] chaired by Bruce Alberts President of The National Academies] that basically laid out the blueprint that became the Human Genome Project, and a wise blueprint, indeed," said Francis Collins. Dr. Collins stressed the success and importance of having a large international consortium of laboratories involved in the Human Genome Project. "Forming large teams and international teams twas critical] and this is the group that earned out the large-scale sequencing effort or at least the leaders of many of the labs that were involved in that six-country enterprise," he stated. Dr.

OCR for page 1
28 Collins hopes the same will be true for proteomics research. "l think it was very helpful that all of the groups had the capacity for large-scale effort, had an open door to come and join in and that this was an international enterprise, also something that ~ had hoped would happen for proteomics in the public sector because after all science is an international enterprise. That is one of the joys of the whole thing." Protein Structural Initiative ~ September 2000 the NIGMS initiated the support of seven centers to begin work on developing an approach to structural genomics in order to reap the benefits of the multiple genome projects being undertaken worldwide. Two more centers were subsequently added in September 2001, forming what is now known as the Protein Structural Initiative. The idea for the consortium resulted from a planning meeting in April 2000 jointly sponsored by the NIGMS and the WelIcome Trust, and there was wide representation from several countries. It was essentially a policy meeting to come up with ideas about how to consider structural genomics in a worldwide setting, said Marvin Cassman. He defined structural genomics as "the discovery, analysis, and dissemination of three-dimensional structures of proteins, RNA, and other biological macromolecules." However, the focus is primarily proteins. Currently the NIGMS is helping to fund nine pilot projects to determine the best strategies for a large-scale production process. Each project is required to include all the components for the effort based on genome-driven target selection. These components include protein production, crystallization, structure determination, theoretical analyses for homology modeling, target selection testing approaches for full coverage of protein families, development of high- throughput methods, and best management practices. The consortium consists of industrial and international collaborators with 66 investigators and 24 institutions, according to Dr. Cassman. They all involve the development of technology. Proteomics is generating novel requirements for scientific collaboration. Several Divers and barriers to collaboration were discussed by members of the breakout called "implementation: Necessary Policy and Infrastructure Conditions for Collaboration," led by James Myers, of the Pacific Northwest National Laboratory, and Richard Morns, National Institutes of Health, Division of Allergy, Immunology and Transplantation. "As has been the case with genomics research, the promise of medical benefits is a major driver of proteomics research," stated Dr. Myers. Participants of the breakout session discussed how collaboration may create needed economies of scale in research, for example, by making it possible for groups of scientists to strive for completeness in critical description and annotation of proteins, an effort that would surpass the capacity of any individual scientist. It was suggested that success in this field would be based on the extent to which it selected and strived to characterize specific organs, pathways, and systems with completeness. Research Collaboration Proteomics research also poses unique challenges to collaboration. Due to its close tie with application, the field imposes barriers in intellectual property and authorship and other issues of attribution. As with genomics, tensions between collaboration and competition (as well as between government and industry) are also heightened in proteomics research. A global distribution of resources, expertise, and potential targets are necessary for collaboration and

OCR for page 1
29 success in the field. In terms of international focus the same problems arise here that arose in genomics. The proteomics techniques and data collection and expertise occur in one locale, but in developing countries there are important diseases and other health problems that are affecting millions of lives and should be studied. However, one has to address some of the differences in policies such as informed consent between countries if one is actual going to ret some work to happen there. ~ , . , ~ O O Policy, organizational, and technology solutions were discussed as interdependent variables, and as essential fixture enablers of proteomics research. For example, the need to construct and use diverse data sets was examined. Proteomics poses unprecedented demands for integrating biological samples, electronic data, outputs from instrumentation, and expertise Mom multiple disciplines. Interactions across disciplinary boundaries that were crucial for the genome project may be even more so for proteomics due to the variety of expertise needed. The group examined how this might be coordinated and made accessible through shared user facilities. Such collaboration may also help overcome shortages of trained experts who can contribute to proteomics research through particular fields including mathematics, statistics, physics, chemistry, computer science, and engineering. The potential drawbacks of such shared facilities were identified, including the requirement to travel away from one's home institution and the overhead costs of dealing with multiple facilities. Such costs could be mitigated by the creation of virtual facilities providing aggregated capabilities over the Internet. Dr. Myers suggested that Internet-based collaboratories, or laboratories without walls, could permit researchers to easily share data, instruments, and expertise. In addition, the group discussed the need to provide intellectual credit to developers of shared and reference resources (e.g., samples, instruments, software). The recognition by (and pressure for) scientific publication in academia leaves little room for scientists to work on reference resources and database construction. Since the best makers of such tools and databases are those who actually use them, scientists should be supported not only for the development of the too} (which often is hard to get through conventional funding means), but also for applications of the too] toward a biological problem. A scientist should be able to get funding for making a too! that will help with an important problem; clever grant writing should reflect this contribution toward their own work and toward the benefit of the scientific community. This problem could also be solved by adjustments to institutional tenure policies, whereby credit is given to those who take time away from the bench to develop critical databases, websites, and software for general use by the scientific community. The participants recommended that tenure committees and granting agencies should be able to recognize those kinds of contributions. Information technology could also play a role by assisting managers to develop broad metrics of authorship or enabling them to track the pedigree or provenance of intellectual contributions at a finer grain than publication credit. There are very few people trained in the multiple areas necessary for proteomics research, from computation to experimentation across disciplines of biology and chemistry. The need to train new researchers and to encourage practicing researchers to broaden their expertise could be met through support for additional fellowships and sabbaticals, which could be made more effective and less disruptive through remote collaboration capabilities. Difficulties in sharing and comparing data from different techniques and disciplines could be reduced through the promotion of standardization efforts and open, extensible data format standards. Overall it was agreed that the pursuit of proteomic research needs to progress on an international scale with broad support from governments and industries alike. In turn, this

OCR for page 1
30 creates the need for international professional organizations, as well as software applications and standards to enable international collaboration. "There is the sense that just like the promise of the genomics revolution, there are many, many things that can be done that are of practical importance," stated Dr. Myers. "Collaboration to speed up that process is really a big driver here and driving it more than just the basic research kind of ideas." Conclusion The most useful definition of proteomics is likely to be the broadest: proteomics represents the effort to establish the identities, quantities, structures, and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time, and physiological state. Proteomics is thus a huge, long-term task, much more involved than sequencing the genome. At the time the Human Genome Project was begun the basic methodology for sequencing DNA, Sanger's dideoxy chain termination, had already been in place for five years and the task, while challenging, was essentially one of efficient scale-up. One of the main lessons from this symposium is that proteomics has not yet reached that stage. There is much work to be done in the technology sector. Perhaps the most important area for investment right now is in platform technology development for high-throughput systems. Other areas where emphasis might be placed in the short term include protein markers and clinical assays of disease, as well as the use of less complex model systems. Quality controls and annotations are needed at all levels. There are also several barriers that remain to translate proteomics results into clinical applications, but progress is being made as described in this report. There is room for both big and small science, stated George Kenyon. No one group, company, or government entity is going to solve these problems; there is a great need for interdisciplinary collaboration, locally, nationally, and globally.