Cover Image

PAPERBACK
$18.00



View/Hide Left Panel

FRONTIERS OF BIOINFORMATICS: UNSOLVED PROBLEMS AND CHALLENGES

October 15-17, 2004

POSTER ABSTRACTS



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges FRONTIERS OF BIOINFORMATICS: UNSOLVED PROBLEMS AND CHALLENGES October 15-17, 2004 POSTER ABSTRACTS

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges This page intentionally left blank.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Analyze HIV-1 Mutation Evolution as a Conditional Selection Pressure Network L. Chen, C.J. Lee Department of Chemistry & Biochemistry, University of California, Los Angeles Antiretroviral therapy of HIV-1 frequently results in the emergence of drug resistant variants from the viral quasispecies. The development and maintenance of drug resistance usually requires the accumulation of 2 or more mutations. Many drug resistant mutation patterns have been reported, but this information is limited and static. It is useful to obtain a global picture of all the ways the viral population could respond to the current drug treatment. To address this problem, we have developed a conditional selection pressure (Ka/Ks) approach that measures how mutation at one site alters the selection pressure at another site. The conditional Ka/Ks analysis shows that different evolutionary paths to the same final genotype can have very different rates, so some paths are preferred whereas the others are not favored. We have generated a directed diagram to represent the mutation network of HIV-1 protease based on the conditional Ka/Ks analysis. This diagram shows the speeds of all possible paths of evolution the viral population will follow under the pressure of current drug treatment. This evolution network also reveals kinetic traps, individual sites (or groups of sites) that accumulate mutations rapidly, but which lack fast paths to drug resistance mutations (i.e. the rate constants are much slower than from wildtype). We can combine the kinetic map with existing information about specific drug resistance relationships and this may reveal general strategies for slowing the evolution of drug resistance.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Molecular Dynameomics Ryan Day, David A. C. Beck, Stephen Edwards, Daigo Inoyama, R. Dustin Schaeffer, Robert E. Steward, George W. N. White, and Valerie Daggett Molecular dynamics simulations allow effective characterization of the dynamics of proteins in water. They have been useful both in characterizing the native state dynamics of proteins and the unfolding process. Simulations have also led to models of the partially unfolded disease states of certain proteins. Molecular dynamics simulations have generally only been considered in the context of the particular system under study, however, limiting their applicability to broader questions of protein folding and dynamics. We have begun an effort to simulate a large number of proteins with different topologies under native and denaturing conditions in order to address this shortcoming. These simulations will be analyzed in a broader context in order to determine general dynamic properties of amino acids and proteins in water. We are calling this effort molecular dynameomics. The thirty most populated folds in the PDB represent 50% of the structures deposited. We have begun our work with simulations of thirty target proteins from these thirty folds. Here we present preliminary results from this set.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Analysis of human alternative splices predicted from exon junction arrays Katherina Kechris, Jean Yee Hwa Yang, Ru-Fang Yeh University of California, San Francisco Following transcription, alternative splicing of exons in a pre-mRNA transcript can create multiple different protein products. This process is an important mechanism which contributes to the protein complexity found in humans. Sequence elements in exons and adjacent introns are critical for regulating alternative splicing. To discover these elements genome-wide, we use the experimental results from Rosetta’s exon-junction array (Johnson et al., 2003) to first identify alternatively and constitutively spliced exons. By applying a variation of their linear model to the data, we specify a score that measures the occurrence of alternative splicing in each gene. Based on a ranking using this score, we identify genes, with their corresponding exons, that are either alternatively or constitutively spliced. Then, by comparing the word counts between these two sets of exons, and their neighboring introns, we find motifs that are associated with alternative splicing and are potential regulators. In particular, we find that adjacent introns of alternatively spliced exons are A/T rich, while those from constitutively spliced exons are G/C rich. For alternatively spliced exons, the motifs discovered tend to be A/G rich and are similar to known exonic sequence enhancers that are naturally occurring or discovered by SELEX.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Integrating Sequence and Structure Data for Annotating Functional Sites on Protein Structures Mike Liang Stanford University Structural genomics initiatives are developing high-throughput methods for large-scale determination of all protein structures. The biological roles for many of these proteins are still unknown, and high-throughput computational methods for determining their function are necessary. Understanding the function of these proteins will have profound impact in drug development and protein engineering. Current methods for functional annotation of these protein structures are based on sequence only or structure only analysis. Although sequence based methods have been quite powerful, they often have limited use when sequence similarity is low. Structure base methods are less sensitive to sequence similarity, but many of the structure based methods require manual creation of models and are thus limited by the number of available functional models. Methods for function annotation at a structural-genomics scale will require both greater sensitivity than what sequence only methods provide and more functional models than what the structural models offer. To address the requirements for structural-genomics scale function annotation, we have developed a method, SeqFEATURE, for automatically constructing a three-dimensional (3D) model of the functional site by integrating sequence and structure data. The 3D models describe the physicochemical environment around sequence motifs and identify the significant properties that are statistically conserved or absent in the functional site. These 3D models have better sensitivity than one-dimensional (1D) sequence motifs in function annotation. By automatically creating these 3D models from sequence motifs, we have developed a method for building a library of models that can be used in context of a structural genomics pipeline for functional annotation of protein structures. SeqFeature is available on the web at http://feature.stanford.edu/webfeature. Biologists can rapidly annotate their structure with the currently available library of functional models.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Processes and Functions Potentially Regulated by Alternative Splicing Uncovered Through Study of Protein Domains Shuo Liu Stanford University Alternative splicing plays an important role in processes such as development, differentiation, and cancer. With the recent increase of the estimates of the number of human genes that undergoes alternative splicing from 5% to 74%, it becomes critical to develop a better understanding of its functional consequences and regulatory mechanisms. We conducted a large scale study of the distribution of protein domains in a curated data set of several thousand genes and identified protein domains disproportionately distributed among alternatively spliced genes. We also identified a number of protein domains that tend to be spliced-out. Both the proteins having the disproportionately distributed domains as well as those with spliced-out domains are predominantly involved in the processes of cell communication, signaling, development and apoptosis. These proteins function mostly as enzymes, signal transducers, and receptors.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Real Space Protein Model Completion: an Inverse Kinematics Approach Henry van den Bedem* Itay Lotan†, Jean-Claude Latombe† and Ashley Deacon* * Joint Center for Structural Genomics, Stanford Synchrotron Radiation Laboratory, SLAC, † Department of Computer Science, Stanford University Rapid protein structure determination relies greatly on software that can automatically build a protein model into an experimental electron density map. In favorable circumstances, various software systems are capable of building over 90% of the final model. However, completeness falls off rapidly with the resolution of the diffraction data. Manual completion of these partial models is usually feasible, put is time-consuming, and prone to subjective interpretation. Except for the N- and C-termini of the chain, the end points of each missing fragment are known from the initial model. Hence, fitting fragments reduces to an inverse kinematics problem. We have combined fast, inverse kinematics algorithms with a real space, torsion angle refinement procedure in a two stage approach to fit missing main-chain fragments into the electron density between two anchor points. The first stage samples a large number of closing conformations, guided by the electron density. These candidates are ranked according to density fit. In a subsequent refinement stage, optimization steps are projected onto a carefully chosen subspace of conformation space to preserve rigid geometry and closure. In a test set of 103 structurally diverse fragments within one protein, the algorithm closed gaps of 12 residues in length to within, on average, 0.52Å all-atom Root Mean Square Deviation (aaRMSD) from the final, refined structure at a resolution of 2.8Å. The algorithm has also been tested and used to aid protein model completion in areas of weak or ambiguous experimental electron density, where an initial model was built using ARP/wARP (Perrakis et al., 1999) or RESOLVE (Terwilliger, 2002). At a resolution of 2.4Å, it closed a 10-residue gap to within 0.43Å aaRMSD of the final, refined structure. In another case, a 14-residue gap in a 51%-complete model built at 2.6Å was closed to within 0.9Å aaRMSD. Our method was furthermore used to correctly identify and build multiple, alternative main-chain conformations at a resolution of 1.8Å

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Genomic Comparisons among γ-proteobacteria Jan Mrázek1, Alfred M. Spormann2 and Samuel Karlin1 1Department of Mathematics and 2Departments of Civil and Environmental Engineering, of Biological Sciences, and of Geological and Environmental Sciences, Stanford University Highly expressed genes in most unicellular and some multicellular organisms exhibit characteristic codon usage biases that distinguish them from the bulk of genes in a genome. We have developed a method to identify predicted highly expressed (PHX) genes in complete genomes. Predicted highly expressed (PHX) genes are compared for sixteen γ-proteobacteria and their similarities and differences are interpreted with respect to known or predicted physiological characteristics of the organisms. PHX genes often reflect the organism’s lifestyle, habitat, nutrition sources and metabolic propensities. This technique allows to predict predominant metabolic activities of the microorganisms operating in their natural habitats. Among the most striking findings is an unusually high number of PHX enzymes acting in cell wall biosynthesis, amino acid biosynthesis and replication in the ant endosymbiont Blochmannia floridanus. We ascribe the abundance of these PHX genes to specific aspects of the relationship between the bacterium and its host. Xanthomonas campestris is also unique with very high number of PHX genes acting in flagellum biosynthesis, which may play a special role during its pathogenicity. Shewanella oneidensis possesses three protein complexes which all can function as complex I in the respiratory chain but only the Na+-transporting NADH:ubiquinone oxidoreductase nqr-2 operon is PHX. The PHX genes of Vibrio parahaemolyticus are consistent with the microorganism’s adaptation for extremely fast growth rates. Comparative analysis of PHX genes from complex environmental genomic sequences as well as from uncultured pathogenic microbes can provide a novel, useful tool to predict global flux of matter and key intermediates, as well as specific targets of antimicrobial agents.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges MotifCut: Motif Finding and Spectral Graph Theory Brian Naughton Computationally identifying conserved motifs in DNA sequences is important for understanding gene regulation. For example, it is used to find conserved DNA motifs in the upstream sequences of co-expressed genes from microarray and chromatin immunoprecipitation (CHIP) experiments. We have approached the motif-finding problem in a novel way, taking inspiration from an image analysis algorithm used to identify the foreground of an image and separate it from the background. We build a graph where all of the words k bases long (k-mers) in the sequence are represented as nodes. K-mers that are similar to each other are then connected with an edge in the graph. Usually, motif finding algorithms search the DNA sequence for a set of k-mers that are as similar to each other as possible. More advanced algorithms may also ensure that the motif is different from a model of the background sequences. We identify groups of k-mers that are similar to each other (the “foreground”) but also as different as possible from all of the other k-mers in the graph (the “background”). This problem is computationally difficult (NP complete), so we use an eigendecomposition as an approximation to the exact solution. MotifCut shows improved results over other motif-finding methods under many conditions.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Whole Proteome Functional Annotation via Automated Detection of Ligand 3-D Binding Site Motifs: Application to ATP- and GTP-Binding Sites in Unannotated Proteins of Dictyostelium discoideum Vicente M. Reyes University of California, San Diego The first few years of the new millennium have seen the avalanche of genome sequence data, a trend that is predicted to continue for the next 10 to 15 years. Assignment of function (“annotation”) to these mostly novel sequences is fast becoming top priority, for obvious basic and medical reasons. However, due to the sheer number of sequences to be annotated, conventional functional assay techniques involving gene cloning, protein expression and purification, etc., become extremely impractical even if implemented in high-throughput fashion including robotics. The solution lies in tapping the ever-increasing power of computers and the versatility of “smart” databases to predict the function of novel sequences, especially proteins, on the basis of their primary sequences alone. The present study, a part of the grand-challenge initiative "The Encyclopedia of Life" at the San Diego Supercomputer Center, nicely illustrates whole proteome functional annotation using computational techniques. Unlike DNA, which generally function at the level of primary sequence, and RNA, which (generally) function at the level of secondary structure, proteins function at the level of tertiary structure. The first step in our study is therefore the prediction of the 3-D structure of all the ORFs in the genome of our test organism, Dictyostelium discoideum, via a combination of homology modeling and threading algorithms, and a final step of model completion using the program Modeller6v2. These predicted 3-D structures will then be screened in a later step for various ligand 3-D binding site motifs. The second step in our study is therefore the construction of the 3-D binding site motif of a given ligand using as "training set" several (typically 8-12, but see below) experimentally solved protein structures with the ligand of interest bound. As mentioned, the third and final step in our study is the screening of the predicted 3-D structures of the D. discoideum proteins (from step 1) for the 3-D binding site motif (from step 2) using a set of Fortran 77 and 90 programs we developed. The program set treats the motif as a tree structure with a root, nodes, branches and edges, and searches for such “tree” in the protein structures. Since the latter are predicted and as such carry significant degrees of uncertainty, we have tried to minimize the effects of such uncertainties by incorporating fuzzy logic into the screening process. This was done by using two types of reduced representation of the proteins in the test set, namely: (a.) representing the protein as a collection of the centroids of its constituent amino acids, and (b.) representing the protein as the aggregate of its backbone atoms and the centroids of the side chains of its constituent amino acids. Another quality assurance step is the use of the program FADE, which detects crevice and pocket residues in protein 3-D structures. In building the binding site motif, we make sure that at least one (preferably two or more) residues making it up is/are located in crevices/pockets in an effort to reduce false positives, the rationale being that binding site residues are almost always found in such locations. We chose ATP and GTP as the pilot ligands for this study. As the ATP-binding protein family is quite heterogeneous, we first narrowed down our study to (1) ser/thr protein kinases (PKs), (2) cAMP-dependent PKs, and (3) ABC transporters. The training set for (1) contained sufficient members, but those of (2) and (3) contained only one member each, as there was only one experimentally solved structure each of a cAMP-dependent PK and an ABC transporter, with bound ATP, on deposit in the PDB. Nevertheless, binding site motifs for all three families were successfully built.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Control screens were then performed. Positive controls composed of ser/thr PKs of available experimental 3-D structures, and negative controls composed of proteins known not to bind ATP, also with available experimental 3-D structures, were subjected to the screening procedure. Both yielded the expected results, albeit with a small proportion of false negatives, and an even smaller proportion of false positives. The 400 previously unknown and unannotated proteins of D. discoideum were then chosen as the pilot test set. Two binding site motifs were deduced from the ser/thr PK training set, one with the two ribose hydroxyls “bound” to a single protein residue centroid (type 1), and a second where the two hydroxyls are separately “bound” (type 2). Screening the test set for the ser/thr ATP binding site motif type 1, our program detected 32 putative ATP-binding proteins, which on closer human inspection, revealed that only 15 are true positives. Screening for type 2, the program picked up 25 putatives, which on closer inspection revealed that all but 1 may be false positives. It now remains to be seen whether the 16 true positives (15 type 1 and 1 type 2) picked up by our program above indeed bind ATP via actual laboratory experimental techniques. Interestingly, our program did not detect any putative cAMP-dependent PK nor ABC transporter from the test set. This may be due to the fact that the training sets for these two families each contained only one member (as those were the only ones available from the PDB), making the screen overly specific and therefore quite insensitive. Similar work on GTP is underway. We are also currently trying to incorporate all the different programs for all the different steps in the entire procedure into one main calling script in order to streamline the screening procedure and make it less dependent on human intervention, and therefore more amenable to complete automation, and, in turn, more suited to large-scale, whole-proteome screening.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges SampleScan: A Sampling Approach to Motif Discovery in Nucleotide Sequences Serge Saxonov, Serafim Batzoglou, Douglas L Brutlag Stanford University When looking for DNA motifs on a genomic scale one is faced with two main challenges. One is that most motifs are likely to be present in only a small number of sequences, making it harder for conventional motif finders to spot them. The second is that the sheer size of the data often precludes the application of many well-established algorithms. In this work we present a sampling-based method geared specifically toward discovery of motifs in large sequence sets. In an outline, the method works by constructing many small subsets of sequences by sampling from the whole set, followed by an application of a motif-finder to each of the subsets. The motivation behind the approach is that a motif that is present in too small a fraction of the sequences to be discovered by a conventional motif-finder, will be enriched in some of the subsets, allowing it to be discovered. Each of the candidate motifs is tested and refined by scanning the entire sequence set. To help with this approach we constructed a new motif finder that outperforms other programs when run on small data sets. As a test case we have applied the SampleScan approach to the set of yeast upstream regions. We have shown that the method can recover a substantial fraction of known yeast motifs. In addition, we have used comparative genomics information and location bias to validate the motifs.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Assignment of structural domains in proteins: why is it so difficult? Stella Veretnik1, Ilya N. Shindyalov1. Nickolai N. Alexandrov and Phillip E. Bourne, 1,2 1 San Diego Supercomputer Center, University of California, San Diego, 2Department of Pharmacology, University of California, San Diego Structural domains are often considered to be basic units of protein structure. Assignment of structural domains from atomic coordinates is crucial for understanding protein evolution and function. Currently there is no good agreement among different assignment methods for what constitute the basic structural unit, underscoring the complexity of structural domain assignment. This work discusses tendencies of individual methods and highlights the problematic areas in assignment of structural domains by experts as well as by fully automated methods. Domain assignments were analyzed for three automatic methods (DALI[1], DomainParser[2], PDP[3]) and three expert methods (AUTHORS[4], CATH[5], SCOP[6]), using a 467-chains dataset assigned by all 6 methods. The following features were investigated: agreement on the number of assigned domains, agreement on domain boundaries, distribution of domain sizes and tendency toward assignment of discontinuous domains. Consensuses among automatic, expert and all methods were defined and used during comparison to tease out the behaviors specific to individual assignment methods or groups of methods. We observe that unambiguous domain assignments (when all methods agree on domain assignment) are confined predominantly to one-domain chains. Agreements among all methods in multi-domain chains are infrequent; in all cases the domains are compact and clearly spatially separated. For the majority of multi-domain proteins, there is no agreement on domain assignment among all methods. From the consensus analysis we observe that the majority of the difficulties of fully automated methods stem from overwhelming reliance on the structural cues (compactness/contact density) during domain assignments and the lack of functional/evolutionary information. Thus the cases in which domains are positioned close together are difficult or impossible for automatic methods to resolve. On the other hand, the differences in expert methods arise from different philosophical approaches underlying the specific methods. Authors of the structures (AUTHORS method) tend to define domains based on functionality, which may produce small and structurally not clearly defined domains. The creators of SCOP, on the other hand, often look for the largest common structure (fold) as a domain, which often consists of several distinctive structural units. The CATH method appears to strike a balance between sometimes contradictory structural, functional and evolutionary information. The inconsistencies in expert assignments are well reflected in the propensities of different fully automated methods, as those are trained and validated using a specific expert method, thus reflecting its philosophical biases. Detailed analysis of structures which do not have consensus between the assignment methods regarding the number of assigned domains indicates the following problematic areas: (1) assignment of small domains, (2) discontinuous domains and unassigned regions in the structure, (3) splitting of the secondary structure elements between domains (if required), (4) convoluted domain interfaces and complicated architectures. Comprehensive domain re-definition, which takes into account the above issues is overdue and will be a great step toward improvement of domain definitions in multi-domain proteins, which represent (by an estimation [7]) 66-75% of the sequence database. Also, the intensive growth of 3D protein data demands fully automated approaches to be used to maintain currency and uniformity of domain information relative to the PDB.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges REFERENCES [1] Holm L., S. C. 1996 Mapping the protein universe. Science 273, 595-602. [2] Guo, J-T. Xu, D. Kim, D. Xu, Y. 2003 Improving the Performance of DomainParser for Structural Domain Partition Using Neural Network, Nucleic Acids Res. 31(3), 944-952. [3]. Alexandrov, N. & Shindyalov, I. 2003. PDP: protein domain parser. Bioinformatics 19, 429-430. [4] Islam, S. A., Luo, J. & Sternberg, M. J. 1995 Identification and analysis of domains in proteins. Protein Eng 8, 513-25. [5] Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. 1997 CATH—a hierarchic classification of protein domain structures. Structure 5, 1093-108. [6] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures . J Mol Biol 247, 536-40. [7] Chothia C, Gough J, Vogel C, Teichmann SA. 2003 Evolution of the protein reprertoire. Science 300, 1701-03.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Detecting Tissue-Specific Regulation of Alternative Splicing as a Qualitative Change in Microarray Data Qi Wang University of California, Los Angeles Alternative splicing has recently emerged as a major mechanism of regulation in the human genome, occurring in perhaps 40-60% of human genes. Thus microarray studies of functional regulation should in principle be extended to detect not only changes in the overall expression of a gene, but changes in its splicing pattern between different tissues. However, since changes in the total expression of a gene and changes in its alternative splicing can be mixed in complex ways among a set of samples, separating these effects can be difficult, and is essential for their accurate assessment. We present a simple and general approach for distinguishing changes in alternative splicing from changes inexpression, based on detecting systematic anti-correlation between two different samples’ log ratios versus a pool containing both samples. We have tested this analysis method on microarray data for five human tissues, generated using a standard microarray platform and experimental protocols previously shown to be sensitive to alternative splicing. Our automatic analysis was able to detect a wide variety of tissue-specific alternative splicing events such as exon skipping, mutually exclusive exons, alternative 3’ and alternative 5’splicing, alternative initiation and alternative termination, all of which were validated by independent reverse-transcriptase PCR experiments, with validation rates of 70 - 85%.Our analysis method also enables hierarchical clustering of genes and samples by the level of similarity of their alternative splicing patterns, revealing patterns of tissue-specific regulation that are distinct from those obtained by hierarchical clustering of gene expression from the same microarray data.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Alternative Splicing Opens Neutral Paths for Genome Evolution Yi Xing University of California, Los Angeles The role of alternative splicing in evolution has become a subject for many recent investigations. It has been proposed that alternative splicing can reduce negative selection pressure within selected portions of a gene, accelerating its rate of evolution. Here we discuss several lines of evidence that support this hypothesis. Alternative splicing is frequently associated with recent creation and loss of exons in mammals. Alternative splicing relieves negative selection pressure against premature protein truncations, to the extent similar to that produced by diplody. Alternatively spliced regions undergo relaxed purifying selection pressure compared to other portions of the gene. Our data suggest that alternative splicing is able to open neutral paths for evolution, a principle that can add significantly to current theory of molecular evolution.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Identifying functional importance of NCS conserved across multiple species Na Xu University of California, Berkeley One of the goals in computational biology is to identify the functional elements in genomes. Highly conserved non-coding sequences (NCS) across multiple species are good candidates for functional regulatory regions. We examined 2094 NCS conserved among human, mouse and Fugu, explored their features, and attempted to pick out true functional regions. We gathered Gene Ontology (GO) and gene expression data for the neighbor genes of the NCS. With high statistical significance, we found that NCS neighbor genes are over-represented in several GO categories: transcription regulator activity, development and binding. Further analysis using expression data showed that the NCS genes are more likely to be over-expressed in nerve and brain tissues. We related the two analyses and combined these with other factors such as NCS density. Simple significance tests and clustering methods have provided us with an initial filtering of NCS based on the functional annotation and experimental information. These results provide a promising first set of NCS examples for further exploration.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges A Cellular Architecture Ontology for Analyzing Protein-Protein Interactions Based on Subcellular Localizations Iwei Yeh and Russ B. Altman Stanford University We introduce a cellular architecture ontology that encodes knowledge about cellular components, including membranes, spaces, and membrane-bound compartments, and their spatial relationships to each other. This ontology facilitates computational reasoning based on protein localization on a large scale and in a systematic fashion. To demonstrate the usefulness of our ontology in automatic reasoning, we developed rules to define the accessibility of cellular components and to define the accessibility and location of proteins. Using these rules, we automatically evaluated the physical accessibility of Saccharomyces cerevisiae proteins based on localizations provided by Saccharomyces Genome Database (SGD) and the accuracy of protein-protein interactions from the Database of Interacting Proteins (DIP). We found areas of inconsistency between these data resources and proposed refinement of protein localizations based on localizations of interacting proteins. In some cases our ontology allowed us to propose novel localizations using simple logical rules. Our cellular architecture ontology contains links to the Gene Ontology (GO), but provides a much richer framework for supporting computational inference.

OCR for page 27
Frontiers of Bioinformatics: Unsolved Problems and Challenges Genome-wide association mapping of flowering time in model plant - Arabidopsis thaliana Keyan Zhao1, María-José Aranzana1, Sung Kim1, John Molitor2, Paul Marjoram2, Fengzhu Sun1, Magnus Nordborg1 1Molecular and Computational Biology Program; 2Department Of Preventive Medicine, University of Southern California, Los Angeles A genome-wide association mapping study was conducted to search for genes controlling flowering time in Arabidopsis thaliana. We have applied several algorithms based on haplotype sharing. Using 906 fragments of sequenced polymorphism data from 95 accessions, we found some strong peaks in genes known to control flowering time in A. thaliania. Increasing the sample size to 192 still gave strong signals in genes FRI and other genes. The clustering algorithms successfully detected 2 known functional alleles in FRI gene. We have also found some other interesting regions associated with flowering time by our algorithm, although they may need further validation through experimental crosses. Population structure remains a challenging issue producing spurious associations between genotype and phenotype. This study demonstrates the promise of using LD mapping to study the genetic basis of complex traits and potentially genome-wide disease association.