Click for next page ( 24


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 23
~- Primary Structure of Proteins and Nucleic Acids PROTEIN SEQUENCES AND DATA BASES As of mid-1987, more than 5,000 protein amino acid sequences had been reported, most of which were inferred from the DNA se- quences that encode them. Although the collection is redundant (same protein from different species) and definitely biased (many human and few plant sequences, for example), several patterns nevertheless stand out. Foremost among these is that the number of different types of protein is finite. It is becoming increasingly clear that most of the proteins determined thus far belong to iden- tifiable families that are easily recognized by amino acid sequences alone. Indeed, the chances are now better than even that a newly determined amino acid sequence from a eukaryotic organism will be found to resemble a previously entered sequence. Some of these families were anticipated (Table ~1) on the basis of similarities in function and size. Thus, gIobins were all known to bind heme and to have very similar properties. We knew about large numbers of kineses, serine proteases, and thio! proteases, for example, and scores of protease inhibitors. It is not surprising, either, that many dehydrogenases and reductases have related sequences or that all the ATPases belong to a homologous set. 23

OCR for page 23
24 TABLE S-1. Some ~rell-established protein familiesa ,EnzYmes: Serine proteases Dehydrogenases Thiol proteases Reductases Acid proteases Kinases ATPases Non-EnzYmes: Carboxypeptidases Transcarbamoylases Phosphorylases Globins Collagens Immunoglobulins Cytochromes Keratine Polypeptide hormones Histones Crystallins Glycopeptide hormones Protease inhibitors Lipid-binding proteins Interferons Toxins Transfemns T cell receptors MHC antigene aIn each group some membere are known to have similar amino acid sequences. What is surprming is that the same kinds of protein structures are appearing in proteins in quite different settings. Polypeptide hormone precursors have been found that. ~r. r-~+aA +^ :_L.~ __ _ ~ . llllll~luor~, lor example, and structural oroteins of t.h~ l~n~ ~f t.h" ~ 1^ ~_.~ ~ _ _ ~_ ~ ~ ~ ~; 1lav': o~en Iouna t0 oe related to regulatory agents called "heat shock" proteins (Ignolia and Craig, 1982~. Sometimes the connections seem astonishing at first, but unon re~ection th`~v are very reasonable. ~ ~. Thus, the recently determined sequences V~ ~1~; L~-aurenerglC receptor, which binds adrenaline and its derivatives, were found to be similar to that of rhodopsin, the eye pigment protein that responds to stimulation by light. This was extraordinary (Dixon et al., 1986), but not inexplicable, since the activating signals adrenaline in the one case and light in the other both can provoke excitatory actions. Subsequently, Kubo et al. (1986) found that a third protein, the muscarinic acetylcholine receptor, also belongs to this family. Many proteins, then, came into being through a process of "duplicate and modify." Gene duplications (partial or complete, , ,,

OCR for page 23
25 as will be discussed further) lead to extra gene copies that suffer base substitutions in the usual way; the substitutions, in turn, lead to modified proteins. For most proteins, these replacements are established sufficiently slowly that, even after a billion years or more of evolution along diverging lines of descent, it is possible to recognize common origins. More to the point for this report, an abundance of data show that three-dimensional structures of proteins are better conserved during the course of evolution than are their amino acid sequences. In this regard, a detailed crystallographic study of a series of serine proteases led to the conclusion that recognizably similar three-dimensional structures may endure as much as 10 times longer than do distinguishably similar amino acid sequences, a natural consequence of the diverse ways in which amino acids may be arranged to yield equivalent structures (James et al., 1978~. As a result, it is reasonable to assume that similar amino acid sequences give rise to similar three-dimensional structures. This is obviously an important point because it is considerably easier to determine amino acid sequences (albeit using DNA sequencing) than to determine crystal structures. The Current Crystal Structure Census In the next decade, the sequences of 50,000 proteins are likely to be determined. In the majority of cases, it should be possible to assign them to existing families. The question then arises: For what fraction of those known families do we have crystal structures? The Brookhaven Protein Data Bank, which keeps data for all known protein structures, lists about 300 entries. Like the se- quence data banks, however, the Brookhaven collection is heavily redundant and biased. The entries include many variations of the same proteins crystallized in different settings (14 entries for egg- white lysozyme alone) and from different species. Actually, only about 100 different protein structures have been determined, and of these, many belong to the same families, as do the dehydro- genases or the serine proteases. Equally important, many known protein families the interferons, for example-have not yet had a single crystal structure determined. The situation is changing rapidly, however, and the prospects appear very good for determining a truly representative set of

OCR for page 23
26 crystal structures. Innovations in recombinant DNA technology now allow the production of proteins in quantities sufficient for crystallization; previously, many of these proteins were available only in trace amounts. Beyond that, new techniques for crystal- lizing membrane proteins have opened an entirely new dimension (Michel, 1982~. In addition, modern techniques for rapid data collection have revolutionized the entire field and hastened the process immensely (Xuong et al., 1978~. Finally, improved tech- niques for structure solution and refinement. hZ`VP =1Q~ 11 this process. ~W _ _4L~_ ~_~1~1 ~ ~ ~ Modeling on the Basis of Sequence Cone At present, it is not possible to generate, with any hope of accuracy, a three-dimensional structure of a protein using only the amino acid sequence. Opinions differ widely as to whether a general solution to the Infolding problem" ~ near and recent developments in the field are discussed elsewhere in this report. Certainly Cohen et al. (1986b) have shown that in special situa- tions, much can be predicted about a protein on the basis of its sequence, but the routine application of an alI-inclusive Droced'~r~ is not yet in sight. To -I ~ ~ .u ~ pore, or course, to make computer-assisted predic- tions about secondary structure (Chou and Fasman, 1974), and although these methods have a limited accuracy (Kabsch and Sander, 1983; Nishikawa, 1983), they can nevertheless provide useful information when applied judiciously. Predictions about protein structure can also be made with computer programs that assess hydropathy (Kyte and Doolittle, 1982~. These have been es- pecially successful in predicting the membran~~nz~nnin. .~= of membrane-associated proteins. Existing Data Bases ~ _, -At The major data bases for sequences are the Protein Iden- tification Resource (PIR) at the National Biomedical Research Foundation, Washington, D.C.; GenBank, operated by Bolt, Be- ranek, and Newman in Cambridge, Massachusetts; and EMBI' Data Bank in Heidelberg, Germany (Table ~2~. GenBank and EMBE store only DNA sequences, although recently, they have begun to make available derived amino acid sequences. GenBank

OCR for page 23
27 TABLE S-2. Some sequence data banks and searching facilities Protein Identification Resource Georgetown University Medical Center National Biomedical Research Foundation 3900 Resenroir Road, N.W. Washington, D.C. 20007 U.S.A. GenBank: Genetic Sequence Databasea Computer & Information Science Diner. BBN Laboratoreis, Inc. 10 Moulton Street Cambridge, MA 02238 U.S.A. a EMBL Data Library Graham Cameron, Data Library Manager Postfach 10 2209 Meyerhofatrasse 1 6900 Heidelberg, Germany University of Wisconsin Genetics Computer Group (UWGCP) John De~rereux University of Wisconsin Biotechnology Center 1710 University Avenue Madison, WI 53705 U.S.A. Unite d'Informatique Scientifique Jean-Michel Cla~rerie Institute Pasteur Paris, France Bionet: National Computer Resource for Molecular Biology Intelligenetics, Inc. 124 University Avenue Palo Alto, CA 94301 U.S.A. PRF Amino Acid Sequence Collection Yasuniko Seto Peptide Institute, Protein Res. Found. 476 Ina Minon, Osaka 562 Japan . aAlthough GenBank and the EMBL Data Library primarily store DNA sequences, translated versions of the data are available. and EMBE exchange data frequently as a way to enhance the data sets. Currently, these two data bases together contain more than 15 million bases of nucleic acid sequences. All of the three major banks currently have backlogs of data awaiting entry. In the case of the PIR, for example, most of the protein sequence data are still typed in from the published literature. GenBank and EMBE are trying to make arrangements

OCR for page 23
28 with some key journab (Nucleic Acid Research is one) that will allow data to be submitted to the data bases in various computer modes: diskettes, tapes, and direct transmission. The logistic and policy problems associated with such data bases are enormous and many committees and societies are trying to establish an acceptable policy that will speed things up. At the same time, all indications are that the generation of sequence data will increase exponentially in the next few years. Clearly, we need a new, permanent, centralized data repository. Ideally, this should be international; practically, it may be more readily attained at the national level. This should be an institution at least as large as the National Library of Medicine. It could be located anywhere, although certainly consideration should be given to Los Alamos, which has already been heavily involved in sequence banking with GenBank. Models for this center, which must be a constantly updated base and not merely a repository, include the National Bureau of Standards or the Coast and Geodetic Survey. In addition to these major sequence collections, some smaller enterprises are operating, both in the United States and elsewhere (Table 3-2~. However, all of these appear to rely heavily on the PIR-GenBank-EMBE collections as their data cores. Finally, some beginning efforts are underway to create a carbohydrate structure bank. PATTERN BASED COMPARISONS Methods for determining sequence alignments and homolm gies represent a powerful set of took for relating the structures and, potentially, the functions of two or more biopolymers. How- ever, these homologies alone do not take Fill Ala nt.=tr~ ^f +~ information content of primary structures ot Copolymers. For example, different nucleotide patterns may code for the same pro- tein sequence; different protein sequences may share a very similar function. Often the differences at the level of primary sequence are substantial, and similarity of function among sequences is lost when the sequences are compared by homology. Yet, if we can somehow relate the sequence of a new protein to that of another protein whose structure and function are known, we can begin to determine the structure of the new protein. How might we take advantage of a more general view of homology, and view sequences ~~-~ ~~ ~ ~'-~ a e,- ~^ u AA ~ ~. .

OCR for page 23
29 Higher Order Structure Sequence ~ 1 P7 PS P6 P2 1 1 ~1 p4 XXIXXXXXXXXIXXpXXXXXXjXXX~XXX~XXXXpX ~ FIGURE 3-1 Hierarchical relationship among patterns in biological se quences. as patterns in order to extract more structural information from them? If we consider the concept of a sequence as a string of charac- ters, we can view the concept of a pattern as a string (or collection) of partial sequences. This view of primary amino acid sequences derives from our belief in a hierarchy of structural descriptions of proteins (see Figure 2-1~. If the only structural information we have about a protein is its sequence or primary structure, we seek patterns in its se- quence that will guide us to a better understanding of the protein. Patterns can be derived by examining the secondary and tertiary structures of known proteins, and relating spatial information back to patterns in the primary structure. In essence, we attempt to map what is known about spatial configurations into the one- dimensional world of sequences. This mapping itself can be done hierarchically, closely related to the hierarchy shown above, as shown from the bottom up in Figure 3-1. The scheme symbolized in Figure ~1 represents the fact that a sequence can be analyzed to determine patterns of partial se- quences, illustrated by the boxed elements. The individual pat- terns P1 - P4, for example, secondary structural elements, may themselves be part of a larger patterns, such as P5, P6, which

OCR for page 23
30 may only be recognized after the abstraction to P1 - P4 is accom- plished. P5 and P6, for example, structural domains, may then be recognized as part of an even larger pattern P7 for ~YZ.mnl" the tertiary structure of a protein. The hierarchy of Figure ~1 also represents a computational mode} for deriving secondary and tertiary structural information from the patterns themselves. This computational mode] Is de- scribec} below. 7 - -A ~^ rare Methodology The concept of patterns is useful for expanding our view of the primary structure of proteins only if we find some method for labeling the individual amino acids or partial sequences with prop- erties other than identity. For example, we can choose labelings related to physical, chemical, or functional properties of the amino acid. If we choose properties related to secondary or higher order structure, we can encode higher order properties in the sequence itself. Labelings can be assigned to individual amino acid residues or to groups of residues based on calculations over a partial sequence. Typical properties included are: charges: plus, minus or neutral pK: acidic, basic, neutral; or specific value hydropathicity: a hydrophobicity or hydrophilicity value for one residue or calculated over a set of residues chemical similarity: several possible definitions tendency for replacement over evolutionary time secondary structure calculated over a group of residues (see below) Some labelings are valuable for examining sequence homolm gies. For example, one can examine simularities in the chemical properties of a sequence by performing the homology search among sequences labeled with characters or flags that represent amign- ment to various chemical classes. If the evaluation function for rating the degree of homology is adjusted appropriately based on the variety of the labels, or the alphabet, it IS possible to find meaningful homologies among sequences with substantially dif- ferent amino acid sequences. One example of the success of this approach is a method for finding protein sequence homologies based on the tendency for replacement over evolutionary time of

OCR for page 23
31 one amino acid by another (Dayhoff, 1978; Lipman and Pearson, 1985). Software and Hardware Considerations The computer software and hardware needed to carry out the pattern-matching operations on sequences contrast in interesting ways with those required for the intense numerical calculations described in other sections of this report. First, the analyses of patterns in sequences is a problem in symbolic, not numeric, com- putation. For efficient program development and application, lan- guages that support symbol manipulation and pattern-matching primitives are desirable; C and LISP are often the languages of choice. Second, the algorithms used are the result of years of development of dynamic programming techniques, recursion, and other methods that allow flexible pattern detection and match- ing. Mismatches, density matches, gaps, and other irregularities are all characteristics of ~fuzzy" patterns that must be accommo- dated. Third, many of the techniques used to infer higher order structural information in patterns, such as secondary structural analysis and the heuristic techniques described below, are highly empirical. These methods often have a theoretical foundation or justification, but the actual inferences may be based on statistics over data bases of known structures or on judgmental rules based on experience and analysis of patterns in known systems. The computer power needed for these symbolic computations is widely available on several general purpose machines. However, as is true for numerical computations, progress in symbol manip- ulation and pattern matching could be hampered by insufficient hardware. It is becoming more and more difficult to explore com- plex patterns in large data bases because of the computer time required. Thus, just as array processors and high performance graphic devices are available to support numerical calculations and present their results, special purpose pattern-matching hardware has been developed to emulate the necessary symbolic computa- tions. This hardware contains the algorithms required for symbol processing. It is often several orders of magnitude faster than the corresponding software.