100% found this document useful (9 votes)

98 views36 pages

Complete Algorithms For Computational Biology First International Conference AlCoB 2014 Tarragona Spain July 1 3 2014 Proceedigns 1st Edition Adrian-Horia Dediu PDF For All Chapters

Conference

Uploaded by

teitzvaghn8s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (9 votes)

98 views36 pages

Complete Algorithms For Computational Biology First International Conference AlCoB 2014 Tarragona Spain July 1 3 2014 Proceedigns 1st Edition Adrian-Horia Dediu PDF For All Chapters

Conference

Uploaded by

teitzvaghn8s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Download the Full Version of textbook for Fast Typing at textbookfull.

com

Algorithms for Computational Biology First

International Conference AlCoB 2014 Tarragona
Spain July 1 3 2014 Proceedigns 1st Edition
Adrian-Horia Dediu
https://textbookfull.com/product/algorithms-for-
computational-biology-first-international-conference-
alcob-2014-tarragona-spain-july-1-3-2014-proceedigns-1st-
edition-adrian-horia-dediu/

OR CLICK BUTTON

DOWNLOAD NOW

Download More textbook Instantly Today - Get Yours Now at textbookfull.com

Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Theory and Practice of Natural Computing Third

International Conference TPNC 2014 Granada Spain December
9 11 2014 Proceedings 1st Edition Adrian-Horia Dediu
https://textbookfull.com/product/theory-and-practice-of-natural-
computing-third-international-conference-tpnc-2014-granada-spain-
december-9-11-2014-proceedings-1st-edition-adrian-horia-dediu/
textboxfull.com

Algorithms for Computational Biology 5th International

Conference AlCoB 2019 Hong Kong China June 2018
Proceedings Jesper Jansson
https://textbookfull.com/product/algorithms-for-computational-
biology-5th-international-conference-alcob-2019-hong-kong-china-
june-2018-proceedings-jesper-jansson/
textboxfull.com

Algorithms for Computational Biology 4th International

Conference AlCoB 2017 Aveiro Portugal June 5 6 2017
Proceedings 1st Edition Daniel Figueiredo
https://textbookfull.com/product/algorithms-for-computational-
biology-4th-international-conference-alcob-2017-aveiro-portugal-
june-5-6-2017-proceedings-1st-edition-daniel-figueiredo/
textboxfull.com

High Performance Computing for Computational Science

VECPAR 2014 11th International Conference Eugene OR USA
June 30 July 3 2014 Revised Selected Papers 1st Edition
Michel Daydé
https://textbookfull.com/product/high-performance-computing-for-
computational-science-vecpar-2014-11th-international-conference-
eugene-or-usa-june-30-july-3-2014-revised-selected-papers-1st-edition-
michel-dayde/
textboxfull.com
Experimental Algorithms 13th International Symposium SEA
2014 Copenhagen Denmark June 29 July 1 2014 Proceedings
1st Edition Joachim Gudmundsson
https://textbookfull.com/product/experimental-algorithms-13th-
international-symposium-sea-2014-copenhagen-denmark-
june-29-july-1-2014-proceedings-1st-edition-joachim-gudmundsson/
textboxfull.com

Applied Algorithms First International Conference ICAA

2014 Kolkata India January 13 15 2014 Proceedings 1st
Edition Bhargab B. Bhattacharya
https://textbookfull.com/product/applied-algorithms-first-
international-conference-icaa-2014-kolkata-india-
january-13-15-2014-proceedings-1st-edition-bhargab-b-bhattacharya/
textboxfull.com

Articulated Motion and Deformable Objects 8th

International Conference AMDO 2014 Palma de Mallorca Spain
July 16 18 2014 Proceedings 1st Edition Francisco José
Perales
https://textbookfull.com/product/articulated-motion-and-deformable-
objects-8th-international-conference-amdo-2014-palma-de-mallorca-
spain-july-16-18-2014-proceedings-1st-edition-francisco-jose-perales/
textboxfull.com

Language and Automata Theory and Applications 9th

International Conference LATA 2015 Nice France March 2 6
2015 Proceedings 1st Edition Adrian-Horia Dediu
https://textbookfull.com/product/language-and-automata-theory-and-
applications-9th-international-conference-lata-2015-nice-france-
march-2-6-2015-proceedings-1st-edition-adrian-horia-dediu/
textboxfull.com

Biomimetic and Biohybrid Systems Third International

Conference Living Machines 2014 Milan Italy July 30 August
1 2014 Proceedings 1st Edition Armin Duff
https://textbookfull.com/product/biomimetic-and-biohybrid-systems-
third-international-conference-living-machines-2014-milan-italy-
july-30-august-1-2014-proceedings-1st-edition-armin-duff/
textboxfull.com
Adrian-Horia Dediu
Carlos Martín-Vide
Bianca Truthe (Eds.)
LNBI 8542

Algorithms for
Computational Biology
First International Conference, AlCoB 2014
Tarragona, Spain, July 1–3, 2014
Proceedings

123
Lecture Notes in Bioinformatics 8542

Subseries of Lecture Notes in Computer Science

LNBI Series Editors

Sorin Istrail
Brown University, Providence, RI, USA
Pavel Pevzner
University of California, San Diego, CA, USA
Michael Waterman
University of Southern California, Los Angeles, CA, USA

LNBI Editorial Board

Alberto Apostolico
Georgia Institute of Technology, Atlanta, GA, USA
Søren Brunak
Technical University of Denmark Kongens Lyngby, Denmark
Mikhail S. Gelfand
IITP, Research and Training Center on Bioinformatics, Moscow, Russia
Thomas Lengauer
Max Planck Institute for Informatics, Saarbrücken, Germany
Satoru Miyano
University of Tokyo, Japan
Eugene Myers
Max Planck Institute of Molecular Cell Biology and Genetics
Dresden, Germany
Marie-France Sagot
Université Lyon 1, Villeurbanne, France
David Sankoff
University of Ottawa, Canada
Ron Shamir
Tel Aviv University, Ramat Aviv, Tel Aviv, Israel
Terry Speed
Walter and Eliza Hall Institute of Medical Research
Melbourne, VIC, Australia
Martin Vingron
Max Planck Institute for Molecular Genetics, Berlin, Germany
W. Eric Wong
University of Texas at Dallas, Richardson, TX, USA
Adrian-Horia Dediu Carlos Martín-Vide
Bianca Truthe (Eds.)

Algorithms for
Computational Biology
First International Conference, AlCoB 2014
Tarragona, Spain, July 1-3, 2014
Proceedings

13
Volume Editors
Adrian-Horia Dediu
Rovira i Virgili University, Research Group on Mathematical Linguistics
Avinguda Catalunya, 35, 43002 Tarragona, Spain
E-mail: [email protected]

Carlos Martín-Vide
Rovira i Virgili University, Research Group on Mathematical Linguistics
Avinguda Catalunya, 35, 43002 Tarragona, Spain
E-mail: [email protected]

Bianca Truthe
Justus-Liebig-Universität, Fachbereich 07, Institut für Informatik
Arndtstraße 2, 35392 Gießen, Germany
E-mail: [email protected]

ISSN 0302-9743 e-ISSN 1611-3349

ISBN 978-3-319-07952-3 e-ISBN 978-3-319-07953-0
DOI 10.1007/978-3-319-07953-0
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014940380

LNCS Sublibrary: SL 8 – Bioinformatics

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication
or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location,
in ist current version, and permission for use must always be obtained from Springer. Permissions for use
may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

These proceedings contain the papers that were presented at the First Interna-
tional Conference on Algorithms for Computational Biology (AlCoB 2014), held
in Tarragona, Spain, during July 1–3, 2014.
The scope of AlCoB includes topics of either theoretical or applied interest,
namely:
– Exact sequence analysis
– Approximate sequence analysis
– Pairwise sequence alignment
– Multiple sequence alignment
– Sequence assembly
– Genome rearrangement
– Regulatory motif ﬁnding
– Phylogeny reconstruction
– Phylogeny comparison
– Structure prediction
– Proteomics: molecular pathways, interaction networks, etc.
– Transcriptomics: splicing variants, isoform inference and quantiﬁcation, dif-
ferential analysis, etc.
– Next-generation sequencing: population genomics, metagenomics, metatran-
scriptomics, etc.
– Microbiome analysis
– Systems biology
AlCoB 2014 received 39 submissions. Most papers were reviewed by three and
some by two Program Committee members. There were also several external
referees consulted; we acknowledge all the reviewers in the next section. After
a thorough and vivid discussion phase, the committee decided to accept 20 pa-
pers (which represents an acceptance rate of 51.28%). The conference program
also included two invited talks and one invited tutorial. Part of the success in
the management of this number of submissions is due to the excellent facilities
provided by the EasyChair conference management system.
We would like to thank all invited speakers and authors for their contri-
butions, the Program Committee and the reviewers for their cooperation, and
Springer for its very professional publishing work.

April 2014 Adrian-Horia Dediu

Carlos Martı́n-Vide
Bianca Truthe
Organization

AlCoB 2014 was organized by the Research Group on Mathematical Linguistics –

GRLMC, from Rovira i Virgili University, Tarragona.

Program Committee
Tatsuya Akutsu Kyoto University, Japan
Amihood Amir Bar-Ilan University, Ramat-Gan, Israel
Alberto Apostolico Georgia Institute of Technology, Atlanta, USA
Joel Bader Johns Hopkins University, Baltimore, USA
Pierre Baldi University of California, Irvine, USA
Seraﬁm Batzoglou Stanford University, USA
Bonnie Berger Massachusetts Institute of Technology,
Cambridge, USA
Francis Y.L. Chin University of Hong Kong, Hong Kong
Benny Chor Tel Aviv University, Israel
Keith A. Crandall George Washington University,
Washington DC, USA
Bhaskar DasGupta University of Illinois, Chicago, USA
Joaquı́n Dopazo Prı́ncipe Felipe Research Center,
Valencia, Spain
Liliana Florea Johns Hopkins University, Baltimore, USA
Olivier Gascuel LIRMM-CNRS, Montpellier, France
David Gilbert Brunel University, Uxbridge, UK
Gaston H. Gonnet ETH Zürich, Switzerland
Roderic Guigó Center for Genomic Regulation, Barcelona,
Spain
Dan Gusﬁeld University of California, Davis, USA
Vasant Honavar Pennsylvania State University, University Park,
USA
Sorin Istrail Brown University, Providence, USA
Tao Jiang University of California, Riverside, USA
Inge Jonassen University of Bergen, Norway
Anders Krogh University of Copenhagen, Denmark
Giovanni Manzini University of Eastern Piedmont, Alessandria,
Italy
Carlos Martı́n-Vide (Chair) Rovira i Virgili University, Tarragona, Spain
Satoru Miyano University of Tokyo, Japan
Burkhard Morgenstern University of Göttingen, Germany
VIII Organization

Shinichi Morishita University of Tokyo, Japan

Cédric Notredame Center for Genomic Regulation, Barcelona,
Spain
Graziano Pesole National Research Council, Bari, Italy
Mark Ragan University of Queensland, Brisbane, Australia
Timothy Ravasi King Abdullah University of Science and
Technology, Thuwal, Saudi Arabia
Allen G. Rodrigo Duke University, Durham, USA
Steven Salzberg Johns Hopkins University, Baltimore, USA
David Sankoﬀ University of Ottawa, Canada
Thomas Schiex INRA Toulouse, France
João Carlos Setubal University of São Paulo, Brazil
Steven Skiena Stony Brook University, USA
Peter F. Stadler University of Leipzig, Germany
Wing-Kin Sung National University of Singapore, Singapore
Alfonso Valencia Spanish National Cancer Research Centre,
Madrid, Spain
Jacques van Helden University of Aix-Marseille, France
Arndt von Haeseler Center for Integrative Bioinformatics Vienna,
Austria
Lusheng Wang City University of Hong Kong, Hong Kong
Limsoon Wong National University of Singapore, Singapore
Xiaohui Xie University of California, Irvine, USA
Dong Xu University of Missouri, Columbia, USA
Zohar Yakhini Agilent Laboratories, Santa Clara, USA
Alex Zelikovsky Georgia State University, Atlanta, USA
Michael Q. Zhang University of Texas, Dallas, USA

External Reviewers

Artyomenko, Alexander Leibovich, Limor

Chateau, Annie Leung, Henry
De Givry, Simon Mandric, Igor
Doi, Koichiro Park, Hee-Won
Ehsani, Sepehr Puglisi, Simon J.
Gonnet, Pedro Sheridan, Paul
Hamelryck, Thomas Srihari, Sriganesh
Hermelin, Danny Swenson, Krister M.
Katsirelos, George Wang, Yi
Kifer, Ilona Wood, Derrick
Kim, Daehwan Zagrovic, Bojan
Kurowski, Krzysztof Zheng, Chunfang
Organization IX

Organizing Committee
Adrian-Horia Dediu, Tarragona
Carlos Martı́n-Vide, Tarragona (Chair)
Bianca Truthe, Gießen
Lilica Voicu, Tarragona
Table of Contents

Invited Talks
Comparative Genomics Approaches to Identifying Functionally Related
Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Michael Y. Galperin and Eugene V. Koonin

Regular Papers
A Greedy Algorithm for Hierarchical Complete Linkage Clustering . . . . . 25
Ernst Althaus, Andreas Hildebrandt, and
Anna Katharina Hildebrandt

Vester’s Sensitivity Model for Genetic Networks with Time-Discrete

Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Liana Amaya Moreno, Ozlem Defterli, Armin Fügenschuh, and
Gerhard-Wilhelm Weber

Complexity and Polynomial-Time Approximation Algorithms around

the Scaﬀolding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Annie Chateau and Rodolphe Giroudeau

Heuristics for the Sorting by Length-Weighted Inversions Problem on

Signed Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Thiago da Silva Arruda, Ulisses Dias, and Zanoni Dias

On Low Treewidth Graphs and Supertrees . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Alexander Grigoriev, Steven Kelk, and Nela Lekić

On Optimal Read Trimming in Next Generation Sequencing and Its

Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Ivo Hedtke, Ioana Lemnian, Matthias Müller-Hannemann, and
Ivo Grosse

On the Implementation of Quantitative Model Reﬁnement . . . . . . . . . . . . 95

Bogdan Iancu, Diana-Elena Gratie, Sepinoud Azimi, and Ion Petre

HapMonster: A Statistically Uniﬁed Approach for Variant Calling and

Haplotyping Based on Phase-Informative Reads . . . . . . . . . . . . . . . . . . . . . . 107
Kaname Kojima, Naoki Nariai, Takahiro Mimori,
Yumi Yamaguchi-Kabata, Yukuto Sato, Yosuke Kawai, and
Masao Nagasaki
XII Table of Contents

Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints

from Raw NGS Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Claire Lemaitre, Liviu Ciortuz, and Pierre Peterlongo

Modeling the Geometry of the Endoplasmic Reticulum Network . . . . . . . . 131

Laurent Lemarchand, Reinhardt Euler, Congping Lin, and
Imogen Sparkes

On Sorting of Signed Permutations by Preﬁx and Suﬃx Reversals and

Transpositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Carla Negri Lintzmayer and Zanoni Dias

On the Diameter of Rearrangement Problems . . . . . . . . . . . . . . . . . . . . . . . . 158

Carla Negri Lintzmayer and Zanoni Dias

Eﬃciently Enumerating All Connected Induced Subgraphs of a Large

Molecular Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Sean Maxwell, Mark R. Chance, and Mehmet Koyutürk

On Algorithmic Complexity of Biomolecular Sequence Assembly

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Giuseppe Narzisi, Bud Mishra, and Michael C. Schatz

A Closed-Form Solution for Transcription Factor Activity Estimation

Using Network Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Amina Noor, Aitzaz Ahmad, Bilal Wajid, Erchin Serpedin,
Mohamed Nounou, and Hazem Nounou

SVEM: A Structural Variant Estimation Method Using Multi-mapped

Reads on Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Tomohiko Ohtsuki, Naoki Nariai, Kaname Kojima,
Takahiro Mimori, Yukuto Sato, Yosuke Kawai,
Yumi Yamaguchi-Kabata, Testuo Shibuya, and
Masao Nagasaki

Analysis and Classiﬁcation of Constrained DNA Elements with N-gram

Graphs and Genomic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Dimitris Polychronopoulos, Anastasia Krithara,
Christoforos Nikolaou, Giorgos Paliouras, Yannis Almirantis, and
George Giannakopoulos

Inference of Boolean Networks from Gene Interaction Graphs Using a

SAT Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
David A. Rosenblueth, Stalin Muñoz, Miguel Carrillo, and
Eugenio Azpeitia
Table of Contents XIII

RRCA: Ultra-Fast Multiple In-species Genome Alignments . . . . . . . . . . . . 247

Sebastian Wandelt and Ulf Leser

Exact Protein Structure Classiﬁcation Using the Maximum Contact

Map Overlap Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
Inken Wohlers, Mathilde Le Boudic-Jamin, Hristo Djidjev,
Gunnar W. Klau, and Rumen Andonov

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Comparative Genomics Approaches to Identifying
Functionally Related Genes*

Michael Y. Galperin and Eugene V. Koonin

National Center for Biotechnology Information, National Library of Medicine

National Institutes of Health, Bethesda, Maryland, USA
{galperin,koonin}@ncbi.nlm.nih.gov

Abstract. The rapid progress in genome sequencing makes it possible to ad-

dress fundamental problems of biology and achieve critical insights into the
functioning of the live cells and entire organisms. However, the widening gap
between the rapidly accumulating sequence data and our ability to properly an-
notate these data constitutes a major problem that slows down the progress of
genome biology. This paper discusses the notion of “function” as it relates to
computational biology, lists the most common ways of assigning function to the
new genes, particularly those that specifically rely on comparative genome
analysis, and briefly reviews the drawbacks of the current algorithms for semi-
automated high-throughput functional annotation of genomes.

Keywords: genome annotation, genomic context, gene neighborhood, operon,

functional genomics, orthology databases.

1 Introduction

Next year will mark the 20th anniversary of the sequencing of the first complete ge-
nome of a cellular organism, the bacterium Haemophilus influenzae [1]. Many bac-
terial and eukaryotic genomes followed shortly after that, including the first human
genome in 2001 [2]. These events led to a revolution in the genome sequencing tech-
nologies, which sharply decreased the sequencing costs and dramatically changed the
way we do science. It is now often cheaper to isolate the DNA from some obscure
environmental sample and do the sequencing than to perform a standard biochemical
or biophysical experiment.
The rapid progress in technology has led to a largely unexpected conundrum where
the sequencing data are being accumulated at such a fast pace that the ability of the
biologists to perform any sensible data analysis inevitably falls behind. As a result,
most published research typically addresses only a relatively small number of specific
problems that prompted generation of the respective data set, and most sequence data
remain underutilized by the researchers. The growing schism between data generation

*
The article is a work of the United States Government; Title 17 U.S.C 105 provides that
copyright protection is not available for any work of the United States government in the
United States.

A.-H. Dediu, C. Martín-Vide, and B. Truthe (Eds.): AlCoB 2014, LNBI 8542, pp. 1–24, 2014.
© Springer International Publishing Switzerland 2014
2 M.Y. Galperin and E.V. Koonin

and the use of these data makes post-genomic sequence analysis a particularly prom-
ising avenue of research, offering computational biologists ample amounts of raw
sequence data that could be used to answer a variety of important questions. The onus
therefore shifts to the researcher’s ability to ask the right questions and to extract from
the databases the right data sets to answer these questions.
One of the most common stumbling blocks in converting the raw sequence data to
scientific - or biotechnological - findings is the insufficient level of understanding the
functions of numerous genes even in the best-studied genomes, such as the bacteria
Escherichia coli and Bacillus subtilis, or the yeast Saccharomyces cerevisiae. Even for
Escherichia coli K-12, the workhorse of molecular biology and arguably the best-
studied organism in the world, the EcoGene database1 shows that 1336 genes out of
the current list of 4141 still have the ‘y’ designation, indicating that their functions
remain uncharacterized [3]. Further, for products of many other genes, only a general
function (e.g., ‘cell division protein’, stress-induced protein’) is known at this time.
For less-studied organisms, the fraction of uncharacterized genes can be much higher,
with virtually all of their genes are being assigned their functions solely based on the
sequence similarity to the genes in other organisms. Thus, comparing different ge-
nomes and transferring functional annotation of genes (proteins) from better studied
organisms to their orthologs from lesser studied organisms has become the key
process in the efforts to provide functional annotation of newly sequenced genomes
and use this information to achieve a better understanding of the physiology of the
respective organisms.
The goal of this presentation is to a) define the notion of “biological function” as it
relates to computational biology, b) describe the most popular ways of assigning func-
tion to predicted genes (open reading frames), particularly those that specifically rely on
comparative genome analysis, and c) discuss the challenges and drawbacks of the cur-
rent algorithms for semi-automated high-throughput functional annotation of genomes.

2 What Is the Gene “Function”?

While it is only natural to think of the live cell as a perfectly designed system where
every part has its own well-defined role (the “function”), in reality, cell components
participate in a complex network of interactions and often have more than one role.
Most enzymes can work with a group of related substrates instead of a single one
(have group specificity) and catalyze various side reactions. The function of the gene
is typically defined as the role that its protein product plays in situ, i.e. the live cell.
As a result, a protein that hydrolyzes a natural substrate, e.g. a phosphorylated sugar
into sugar and phosphate moieties, will be usually called a phosphatase, even if this
protein is more active with a non-natural artificial substrate, such as a sugar phospho-
nate. Sometimes, however, the name is derived from an easily measurable side activi-
ty whereas the genuine native function might not even be known. Thus, the enzymes
that catalyzed reduction of certain dyes - and whose activity could be easily measured

1
http://www.ecogene.org/
Comparative Genomics Approaches to Identifying Functionally Related Genes 3

by changes in color - has been referred to as diaphorase for more 20 years before its
activity as NAD(P)H:acceptor oxidoreductase has been established and it became
clear that there exists a whole family of such enzymes.
In biology, gene (protein) function is usually defined historically, based on the first
description of the properties of the respective mutant or the biochemical activity of
the purified protein. For essential genes, where mutations are lethal or conditionally
lethal, the function can be defined as something that the gene product needs to do to
sustain the cell growth. Operationally, for lethal mutations, the cause of cell death is
assumed to be the “function” of the gene in question. For non-essential genes, muta-
tion phenotypes can be quite complicated and, accordingly, the descriptions of “func-
tion” may be quite long and fuzzy, and not necessarily physiologically relevant, i.e.
reflecting their core functions. For example, studies of the sporulation process in the
hay bacterium Bacillus subtilis, a popular model organism, have been used to define
functions for hundreds of genes. As a result, certain bacterial genes are being referred
to as “sporulation” genes, even though the respective organisms, e.g., cyanobacteria,
are unable to sporulate [4,5].
This problem becomes particularly severe for high-throughput enzyme assays,
which can be used to define general biochemical activities of the products of pre-
viously uncharacterized genes, but are often unable to identify the natural substrates
for the respective enzymes or the biochemical pathway involving these enzymes [6,7].
A proper definition of the protein function should probably combine characterization
of its biochemical activity, if any (i.e. the nature of the catalyzed reaction and the
range of utilized substrates and products) with the description of the biological
process (e.g. a metabolic or signaling pathway) that involves this protein. For poorly
studied organisms, such information is obviously unavailable and every overly specif-
ic assignment should be taken with a grain of salt. We have previously discussed cer-
tain functional assignments that, despite being supported by reasonably high similari-
ty scores, do not pass even the cursory “sanity check”. Examples include bacterial and
archaeal “head morphogenesis protein“, “mitochondrial benzodiazepine receptor”,
“centromere protein”, and many others [8,9].
In the course of evolution, homologous genes may adopt new functions, sometimes
quite distinct from their ‘original’ ones. There are several excellent databases that
collect such data. The FunShift database2 at the Stockholm University [10] documents
functional shifts between different subfamilies within a single protein domain family.
The PANTHER3 database at SRI International in Menlo Park, California, shows such
functional shifts on the phylogenetic trees [11], whereas the Structure-Function
Linkage Database4 at the University of California, San Francisco, analyzes structural
and functional details for functionally diverse enzymes that belong to the same
superfamilies [12].
A further complication is the phenomenon of so-called “moonlighting proteins”
that perform one function in one environment, such as cytoplasm, and an entirely

2
http://funshift.sbc.su.se/
3
http://www.pantherdb.org/
4
http://sfld.rbvi.ucsf.edu/
4 M.Y. Galperin and E.V. Koonin

different function in a different environment, such as, for example, when secreted
outside the cell [13]. Some of such cases are captured in MultitaskProtDB, a database
of multitasking proteins5 at the Universitat Autònoma de Barcelona in Barcelona,
Spain [14]. While the number of such moonlighting proteins appears to be relatively
small, that might be due to the fact that such cases are not easy to recognize.
To summarize, the biological notion of ‘function’ is rather fuzzy, which usually
leave sufficient wiggle room for functional annotations to be reasonably close to the
reality. However, finding proper balance between overly generic (non-specific) and
overly specific functional annotation is a complex task that does not have easy algo-
rithmic solutions. Simply copying the functional annotation of the closest homolog in
the database or the closest characterized homolog is hardly an appropriate solution, as
it leads to numerous problems, from propagation of errors to generation of annota-
tions that cannot pass the sanity check.

3 Homology-Based Functional Assignments

3.1 Annotation by Similarity

The simplest and the most straightforward way to assign function to a newly sequence
gene (protein) is to find a similar gene (protein) with an experimentally characterized
function. Every day, numerous researchers use the BLAST program on the NCBI web
site to perform sequence comparisons and use them to annotate new genes (proteins)
based on the functional information from previously characterized genes. There are
also other sequence comparisons algorithms; some of them will be mentioned below.
It is important to remember, however, BLAST and other sequence comparisons al-
gorithms measure the degree of sequences similarity, not functional similarity. In
other words, such algorithms evaluate the probability that the given sequences are
related solely by chance, i.e. the probability that the given sequences are evolutionari-
ly unrelated. When that value is sufficiently low, e.g. less than one per million, this
result can be interpreted as evidence of an evolutionary relationship of those se-
quences, i.e. their common descent from the same ancestral gene. However, at lower
similarity levels, i.e. higher E (expectation) values, the probability that the respective
proteins have the same function and, therefore, that transfer of functional information
from already known genes (proteins) to the new one is justified, becomes progressive-
ly lower. Furthermore, because of the intrinsic diversity of biological sequences, there
can be no a priori estimate as to which E-value still allows transfer of functional
information and which E-value does not.
A potential way out of this conundrum lies in the development of databases of or-
thologous proteins or, more precisely, orthologous groups of proteins [15]. In its orig-
inal implementation in the COG database, the algorithm for identification of orthologs
across diverse bacteria and archaea relied on the triangles of genome-specific bidirec-
tional best hits with no cut-off by E-value [15]. Subsequent algorithms preserved the
need for bidirectional best hits but included certain cut-offs to eliminate spurious hits.

5
http://wallace.uab.es/multitask/
Comparative Genomics Approaches to Identifying Functionally Related Genes 5

There is now a wide variety of ortholog databases that use various tool to infer orthol-
ogy and are geared towards various uses, including functional annotation of genomes
[15-23].

3.2 Family/Superfamily Annotation

Despite the best efforts on sequence analysis, a substantial fraction of proteins show
only a limited similarity to their experimentally characterized counterparts. In many
cases, the similarity is limited to the common sequence motifs and/or to the predicted
structural features. In such cases, direct transfer of functional information from is
hardly justified. Instead, a much more productive way would be replacing a specific -
and most likely inaccurate - annotation of the new protein with a family-based annota-
tion, stressing the general conserved features of the family members but avoiding
unnecessary specifics (or, rather, leaving them for the future). We have previously
discussed the inherent fuzziness of the functional annotation for the members of the
ATP-grasp, alkaline phosphatase, all-alpha NTP-PPase, and other superfamilies
[24-27], as well as for transcriptional regulators and membrane transporters [8].
Finally, there are numerous protein families whose functions remain totally
enigmatic. Such proteins have been referred to as “hypothetical”, “conserved
hypothetical”, “uncharacterized” or even “putative uncharacterized” [28]). Families of
such proteins include Domains of Unknown Function (DUFs) in Pfam, and
Uncharacterized Protein Families (UPFs) in UniProt [28,29]. These lists are quite
valuable for genome annotation, because clarification of the functions of any of their
members immediately allows functional assignments for all other members of that
family. From the computational standpoint, the software should allow sufficient
flexibility in protein names, so that an amended functional assignment could be
quickly propagated to the members of a given protein family without the need for any
major revamp of the system. In fact, the continuing process of biological research
means that changes in gene (protein) functional annotation are bound to be a constant
factor in genomic databases for the foreseeable future.

4 Using Genome Comparisons for Predicting Protein Functions

While sequence similarity searches remain by far the most popular tool for identifying
the functions of unknown proteins and RNA, in many cases such searches do not
yield satisfactory functional annotation, as no functional assignment can be made with
any degree of confidence. For such cases, there are several computational approaches
that go beyond sequence comparison. Instead, such methods rely on “genomic con-
text”, i.e. common properties that are shared by unrelated (non-homologous) proteins
that perform the same or related functions. Examples of such proteins include differ-
ent subunits of the same complex enzyme, components of the same signaling path-
way, alternative enzymes that catalyze the same biochemical reaction, and many
others. In order for such non-homologous but functionally related protein pairs to
work in concert, they need to be present in the same organism at the same time, they
might also physically interact. Accordingly, identification of functionally associated
6 M.Y. Galperin and E.V. Koonin

pairs of proteins relies on their joint presence and absence in a certain set of genomes
(phylogenetic co-occurrence) and their co-expression, as judged by the presence of
common regulatory sites, conservation of their location next to each other in multiple
genomes, and/or gene fusions [8,30,31].
These approaches have two important traits: they take advantage of the availability
of multiple complete genomes and they treat them as genomes rather than just sets of
individual genes. Accordingly, these approaches rely on the same basic premise - that
organization of the genetic information in each particular genome is meaningful, in
the sense that it reflects a long history of mutations, gene duplications, gene re-
arrangements, gene function divergence, gene acquisition and loss that has produced
organisms that are uniquely adapted to their environment and are capable of regulat-
ing their metabolism in accordance with the environmental conditions. Further, some
of these approaches, as the analysis of gene co-expression, gene neighborhoods and
protein domain fusions, do not require knowledge of complete genome sequences and
therefore can benefit from the enormous amount of sequence data available in the
unfinished genomes and metagenomes. This dramatically increases sensitivity and
robustness of these approaches, making them indispensable tools in the functional
analysis of uncharacterized genes.
The principles and methods of genome context-based functional annotation have
been described in detail in numerous publications [8,30-43]. Here we briefly describe
the general principles of these approaches and discuss their principal caveats. We also
discuss the limitations of applying these tools to infer sensible functional association.
It is important to note that all these approaches critically depend on the number of
available genome sequences and their diversity. Therefore, recent progress in genome
sequencing that leads to the constantly growing number of available genomes, even if
incomplete, gradually increases the specificity of all these methods, effectively im-
proving the signal-to-noise ratio. In addition, functional links can be deduced from the
results of several high-throughput experimental techniques, such as gene co-
expression obtained using microarrays or deep RNA sequencing and various protein-
protein interaction data. All this makes genomic context-based methods increasingly
powerful in providing valuable clues to inferring gene (protein) function.

4.1 Phylogenetic Profiling

General Approach. The number of genes that are encoded in all known genomes is
extremely small, less than a hundred, and functions of all of them are already known.
Most of these genes encode ribosomal proteins or subunits of several key enzymes of
DNA replication, tRNA aminoacylation, and central metabolism [44]. All other genes
are present in some genomes and absent in the others. When comparing the distribu-
tion of two genes across multiple genomes, one can come with the following general
patterns. First, the genes typically co-occur, i.e. certain genomes carry both these
genes while other genomes do not have either of them. In such cases, functional asso-
ciation of the two genes becomes very likely, which makes this method a potentially
powerful tool for inferring protein function [15,34,38,45]. However, as mentioned
above, this functional association is quite fuzzy in biological terms and may be used
Comparative Genomics Approaches to Identifying Functionally Related Genes 7

only for a very general functional annotation. In other cases, the genes are rarely
found together, most genome carry either one or the other, resulting in complementa-
ry phylogenetic patterns. Such cases may arise from a specific kind of functional as-
sociation, the one where the respective gene actually have the same (or closely
related) functions, such that the organism only needs either of them. Such cases, re-
ferred to as non-orthologous gene displacement [46], are not very common but, when
found, could be used for very specific functional annotation [31].

Algorithmic Aspects. The overall approach is quite straightforward: compile a matrix

of presence (1) or absence (0) of the given genes in as many genomes as possible and
calculate the numbers of (1,1), (1,0), (0,1) and (0,0) combinations. Then compare the
fraction of (1,1) cases [as well as the combined fraction of (1,1) and (0,0) cases] with
the fraction of other two and evaluate the probability that the difference, if any, arises
simply by chance. If that probability is sufficiently low, the pair can be marked as
likely to have a functional interaction. For non-orthologous gene displacement, vice
versa, the (0,1) and (1,0) cases should be far more common than (1,1) ones.
Unfortunately, this approach has several important caveats. First of all, it relies on
recognition of the “same gene” in many distinct genomes, i.e. runs into all the prob-
lems described above. Different genes evolve with different rates, and even function-
ally related genes may accumulate mutations, insertions and deletions at dramatically
different pace. As a result, two homologous genes in two different genomes might be
very similar (e.g. with E-value of 1x10-10), whereas their partners in the same ge-
nomes would show only borderline similarity (e.g. E-value of 1x10-3). Selecting an
overly strict cut-off for similarity scores would throw away distant homologs of the
given gene and might artificially inflate the fraction of (1,0) cases. On the other hand,
selecting an overly permissive cut-off would result in an inflated fraction of (1,1)
cases, which would decrease the specificity of the method, highlighting spurious gene
pairs as functionally related. To avoid this conundrum, one could specifically look for
pairs of orthologs in diverse genomes, which would alleviate most of the problems
arising from differences in evolutionary rates. However, this would mean either
adding an entirely new layer of computation or relying on the external sources of
orthology data, which might have their own problems. For example, some orthology
databases, like OMA browser6 emphasize one-to-one correspondence between ortho-
logous genes and are therefore might be sensitive to lineage-specific gene duplication
events [17]. We believe that by defining orthologous groups, as opposed to single
orthologs, the COG approach offers the best balance of specificity and sensitivity.
However, the COG database covers only 63 genomes and has not been updated since
2003.
Another potential problem of phylogenetic profiling is taxonomic depth. With
hundreds of Escherichia coli genomes already in the database, most E. coli gene pairs
are already found in hundreds of genomes and are missing in numerous other

6
http://omabrowser.org
8 M.Y. Galperin and E.V. Koonin

genomes. While this ensures predominance of (1,1)+(0,0) cases, that does not mean
that such genes necessarily interact. Thus gene pairs that are found in phylogenetical-
ly distant organisms (e.g. in members of different phyla) should score much higher
than those found only at very short phylogenetic distances. It also makes sense to
ignore closely related genomes, e.g. by collapsing at the level of genus or even a
family. On the other hand, horizontal gene transfer between organisms that inhabit the
same environment can result in groups of unrelated genes being co-transferred across
large phylogenetic distances, e.g. from hyperthermophilic bacteria to hyperthermo-
philic archaea or vice versa. As a result, assigning too much value to the rare sightings
of the same genes in phylogenetically distinct organisms might be dangerous and
counterproductive.
For the rare genes that are found in relatively few genomes, the above factors com-
bine making phylogenetic profiling particularly unreliable. Thus, when the number of
(1,1) is small and the number of (0,0) is large, there is a decent chance that the (1,1)
cases are not indicative of a functional relationship.
One more potential problem of phylogenetic profiling is the reliance of the method
on the correct identification of all the ORFs in the genome. In practice, automatically
annotated genomes often miss short ORFs, those with less than 70-80 codons, and
sometimes even longer ones [45]. In addition, ORFs with frameshifts typically get
omitted from the protein set, even when these frameshifts s result from sequencing
errors, so that the genome encodes a fully functional protein. In some cases, supposed
frameshifts create stop codons between separate protein domains and therefore do not
result in the loss of function but such proteins still get removed from the respective
proteomes. We have previously described how deviations from conserved phyloge-
netic patterns could be used for improving genome annotation [45], but that required
manual intervention. When used semi-automatically on a genome scale, phylogenetic
profiling, particularly for short ORFs could be very sensitive to the annotation errors.

Practical Aspects. At this time, there is no universally accepted way to score the
results of phylogenetic profiling. As a result, this approach is still widely used but
typically on an ad hoc basis: biologists typically use co-occurrence of certain genes as
additional evidence of their involvement in the same process or a pathway. There are
databases that could be used to extract phylogenetic profiles from the genome data,
the best and most widely used being the STRING database7, maintained by Peer Bork
and coworkers at the European Molecular biology Laboratory in Heidelberg, Germa-
ny [43]. STRING allows the user to select a gene from a variety of complete genomes
and search for genes with the same or similar phylogenetic profiles. This tool is very
useful for genome annotation, particularly if combined with other options offered
by the same database (see below). FunCoup database8 at the Stockholm University
specifically targets eukaryotic genes and, like STRING, presents various kinds of
functional coupling information, including phylogenetic profiles [47].

7
http://string.embl.de
8
http://funcoup.sbc.su.se/
Comparative Genomics Approaches to Identifying Functionally Related Genes 9

4.2 Genomic Neighborhood

General Approach. Co-expression of proteins belonging to the same metabolic or

signaling pathway is typically achieved thorough co-regulation of the transcription of
the respective genes by the same transcriptional regulators. This could be detected by
identifying common regulatory sites, although the specificity of such prediction is
typically limited and they need to be verified by direct experimentation. In bacteria,
co-expressed genes are often located next to each other, forming operons that are
transcribed as a single multigenic mRNA. On the other hand, due to the constant
events of gene translocation within the genome, as well as gene acquisition through
horizontal gene transfer and gene loss, the overall gene order is not conserved even
among relatively close relatives that belong to the same genus, and is typically wiped
out at the level of the bacterial family. Thus, conserved gene neighborhoods in phyloge-
netically distinct organisms are relatively rare [48] and analysis of gene may provide
important functional clues [36,37]. Therefore, bacterial genome analysis offers an easy
way of inferring functional connections by simply looking at the genes that are consis-
tently adjacent to the studied gene in multiple genomes. This approach could even be
used for analyzing eukaryotic genes through finding bacterial orthologs of the given
eukaryotic gene, followed by an analysis of their genome neighborhoods [49].

Algorithmic Aspects. The general approach to the identification of functionally

linked genes through the analysis of their genomic context includes the following
steps. First, for a given gene from the given organism, one needs to identify the ‘same
gene’ (or, more precisely, orthologs of this gene) in all available genomes and, at the
next step, define other genes that belong to the same operons and therefore are co-
expressed. However, genetic studies have revealed co-regulated divergent operons
(running in both directions from a common regulatory site), as well as convergent
ones. That is why, in practice, the direction of the genes is usually ignored and the
algorithm simply selects a certain number of their neighbors (just the nearest neigh-
bors or two, three, or more adjacent genes) on one or both sides in all these genomes.
These neighboring genes then need be classified into conserved groups of the same
function and ranked by the frequency of their occurrence in these neighborhoods. The
genes that show a statistically significant association with the orthologs of the given
gene may be expected to have a functional connection to this gene.
Obviously, this approach is subject to the same caveats as phylogenetic profiling,
and also additional ones. First, again, the definition of the ‘same gene’ in various
genomes has to rely on sequence comparisons and is subject to all the limitations
discussed above. The availability of predefined clusters of orthologs helps but, again,
means either extra computation or reliance on an external source of information that
the user cannot control. This method, however, requires identification of orthologs not
just for the initial query gene but also for the genes that abut its orthologs in all
studied genomes. This calls for a far more complex computation and/or far more
extensive use of orthology databases.
The other two problems of phylogenetic profiling, the taxonomic depth and the po-
tential effect of horizontal gene transfer, also apply to the analysis of the genomic
10 M.Y. Galperin and E.V. Koonin

neighborhood. The high incidence of the same genome neighborhood in numerous

closely related genomes is likely to make it difficult to find relatively rare cases where
the neighbors might be different. On the other hand, such rare associations could re-
flect cases of horizontal gene transfer and assigning too much weight to them might
be misleading.
One more problem complicating the analysis of the genomic neighborhoods is a
rapid increase in the amount of the necessary computation with the expansion of the
search field. The chance of finding non-trivial gene associations obviously increases
when one looks not just at the nearest neighbor(s) but, say, at three, four or five genes
on each side from the analyzed one. However, the need to keep track of the identified
neighbors and all their orthologs makes the task increasingly complex.

Practical Aspects. There are several different tools for analyzing conserved gene
neighborhoods. A popular tool included in the SEED database9, [50] tags the selected
gene and displays conserved genes found in the vicinity of its orthologs (‘pinned
CDSs’), scoring them by the E-value of the BLAST hit. The user is given the option
of choosing the size of the analyzed region (in kilobases), the number of genomes to
display, and E-values for selecting the genes to show and to color the same way. This
tool is most convenient for analyzing gene neighborhoods among closely related ge-
nomes; expanding it to the members of different phyla may be complicated. Another
tool is available in the KEGG database, part of the KEGG Orthology10 system [23].
Instead of BLAST E-values, as in SEED, this tool relies on the precomputed lists of
orthologs and displays the members of KEGG orthologous groups located in the ge-
nome in the vicinity of the given gene. The most popular tool for studying gene
neighborhoods is probably the one at the STRING11 database [43]. It also relies on
precomputed lists of orthologs and displays them over the entire phylogenetic tree.
Thus, each tool has its own advantages, and by combining two or more of them, it
becomes possible to analyze the gene neighborhoods in much detail and over large
phylogenetic distances. Future progress in developing such tools would require creat-
ing more comprehensive ortholog databases and improvement of the phylogenetic
profiling methods that would allow investigating genome neighborhoods in selected
parts of the tree of life.

4.3 Gene Coexpression

General Approach. Strictly speaking, gene colocalization does not always imply
coexpression. In fact, adjacent but divergently oriented genes could be part of an ‘ei-
ther one or another’ regulatory system. The availability of genome sequences gave
rise to genomic microarrays, which allowed simultaneous identification of all genes
that are coexpressed in response to a specific environmental signal or in such condi-
tions as nutritional or osmotic stress. Such data have been very useful for the specific

9
http://theseed.org/
10
http://www.kegg.jp/kegg/ko.html
11
http://string.embl.de
Comparative Genomics Approaches to Identifying Functionally Related Genes 11

conditions that they studied but microarray experiments were generally costly and
narrowly targeted. Obviously, it would be very attractive to deduce gene coexpression
straight from the DNA sequence, by identifying conserved transcriptional regulatory
sites in front of the genes that might not even be located in the same genome neigh-
borhood. There have been numerous attempts to predict transcription regulatory sites
ab initio on the genome scale. Unfortunately, this task is quite complex and the sig-
nal-to-noise ratio is usually pretty low. A much more successful approach has been
based on utilizing information about known - experimentally determined - transcrip-
tional regulatory sites and scanning the genomes for additional instances of the same
or similar sites. In the past, the sequences of regulatory sites had to be determined
experimentally by DNA fingerprinting. More recently, such information has started
pouring in from deep sequencing data. As a result, transcriptional profiling with prob-
abilistic models of the likely regulatory sites has become a very promising approach
to look for coexpressed genes.

Algorithmic Aspects. The typical approach includes the following steps: compiling a
list of known coexpressed genes, creating a multiple alignment of the upstream regu-
latory sites, converting this alignment into either a frequency profile or a hidden Mar-
kov model, and using this profile or HMM to look for (additional) highly-scoring
sites, preferably in the intergenic regions. In a large series of papers from Gelfand and
colleagues, this approach has been used in combination with the information derived
from protein sequences, such as the presence of orthologs in several different
genomes [51-57], see [58,59] for review.

Practical Aspects. At this time, there are several tools for gene coexpression profil-
ing, including Gibbs Motif Sampler [60,61] and RegPredict [62]. The first one, Gibbs
Motif Sampler, is being run at the servers at the Wadsworth Center in Albany, New
York12, and at Brown University in Providence, Rhode Island13 [63,64]. In addition,
several versions of this software are available for downloading14. RegPredict15 is a
web service of the Lawrence Berkeley National Laboratory in Berkeley, California. It
is closely associated with RegPrecise16 and RegTransBase17, two manually curated
databases of transcriptional regulation in prokaryotes [65,66].

4.4 Protein Domain Fusions

General Approach. In some cases, adjacent genes are not just coexpressed, they may
lose the stop codon that terminates the first polypeptide chain. Such cases (as well as
certain gene recombination events) lead to the formation of fused genes, where a sin-

12
http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=DNA
13
http://ccmbweb.ccv.brown.edu/gibbs/gibbs.html
14
http://mcmc-jags.sourceforge.net/
15
http://regpredict.lbl.gov/
16
http://regprecise.lbl.gov
17
http://regtransbase.lbl.gov
12 M.Y. Galperin and E.V. Koonin

gle protein consists of two or more different domains. While each domain has its own
function, the fusion would be viable - and maintained in the course of evolution - only
when its components are functionally linked, e.g. by participating in the same path-
way or a common regulatory mechanism. Therefore, identification of fused genes
offers a convenient way to deduce functional association, which is why it has been
referred to as the “Rosetta stone” approach [32,67]. Obviously, protein domain fu-
sions are only helpful when they combine a previously uncharacterized domain with a
domain of known function [68]. Fusions of already characterized domains are being
studied by numerous researchers for a variety of purposes but not for functional as-
signments, whereas fusions of uncharacterized domains are interesting but hardly ever
contribute to functional analysis.

Algorithmic Aspects. Detection of gene fusions is usually performed at the protein

level, through the analysis of multidomain proteins that combine on a single polypep-
tide chain two or protein domains that are usually found separately (widespread do-
main fusions, e.g. of pyrimidine biosynthesis enzymes in eukaryotes, are trivial and
rarely yield new insights). The search algorithm would largely depend on whether the
analyzed gene product contains an already known protein domain. If so, the analysis
could be performed using the established databases of protein domains, such as
Pfam18 at the Wellcome Trust Sanger Institute or InterPro19 at the European Bioin-
formatics Institute, both in Hinxon, UK, or the NCBI’s Conserved Domain Database20
databases [29,69,70]. Each of these databases allows listing all domain architectures
that involve the given domain.
If, however, the analyzed gene product does not contain any protein domains that
are listed in public domain databases, the only applicable way seems to be using
BLAST (or PSI-BLAST, or HMMer) to find all instances of the new domain, sort the
search output by length looking for the longest database hits, and then analyze those
hits one-by-one to see if they contain any - known or new - conserved domains.
Analysis of meaningful protein fusions is relatively robust and is subject to few ca-
veats. The most important of those is the existence of so-called “promiscuous” do-
mains that associate with a wide variety of distinct proteins and do not allow any
functional inferences. Another potential issue is limiting the depth of the similarity
search. Tell-tale fusions of the given protein are often found only after several itera-
tions of PSI-BLAST or JackHMMer, and the degree of sequence conservation might
be fairly low. Then there is no guarantee that such domains retain the same or even
marginally similar functions, particularly when fused to different partners. Thus,
finding protein fusions among distant homologs makes it difficult to draw any
unequivocal conclusions.

18
http://pfam.sanger.ac.uk
19
http://www.ebi.ac.uk/interpro/
20
http://www.ncbi.nlm.nih.gov/cdd
Comparative Genomics Approaches to Identifying Functionally Related Genes 13

Practical Aspects. The information on protein domain fusions is available in several

databases, including FusionDB21 at the Institut de Microbiologie de la Méditerranée in
Marseille, France [71]. Still, it appears that in bacteria, a significant fraction of fused
genes are fusions with the signal-transducing phosphoacceptor REC domain, DNA-
binding helix-turn-helix domain, and other promiscuous domains. While it is interest-
ing to see the variety of known protein domains that are fused with REC and therefore
fall under the control of the two-component signal transduction [72] or can be found
in transcriptional regulators (helix-turn-helix domain fusions), such cases do not ad-
vance the cause of functional annotation. Likewise, in eukaryotes, many domain fu-
sions involve SH2, SH3, and other regulatory domains [73], giving no clue as to what
specific activity is being regulated. On the other hand, domain fusion maps are al-
ready available for numerous domains of unknown function, DUFs in Pfam [29].
Thus, even a minor advance in understanding the function of a previously uncharacte-
rized domain - or, say, availability of its 3D structure - can be quickly propagated to
all proteins that contain this domain.

4.5 Protein-Protein Interactions

General Approach. Obviously, protein domain fusions capture only a relatively

small fraction of protein-protein interactions. Some additional information on such
interactions can be extracted from protein crystal structures that sometimes contain
distinct protein domains and show their mutual orientation and the mode(s) of domain
interactions. Such data are stored in a variety of public databases, including iPfam22,
3did23, DIMA24, DOMINE25 [74-77], and many others. However, most information on
protein-protein interactions comes from experimental data. These data are being col-
lected - and often ranked by reliability - in several aggregator databases, such as Bio-
GRID26, BindingMOAD27, DIP28, HitPredict29, IntAct30, MINT31 [78-84], and many
others. A selected list of such databases can be found in the Nucleic Acids Research
online Molecular Biology Database Collection web site32 [85]. Unfortunately, all
experimental methods for detecting protein-protein interactions are known to bring a
substantial number of false-positives. The situation has become so bad that there is
even a database of known non-interacting proteins, Negatome33 [86], designed to
21
http://igs-server.cnrs-mrs.fr/FusionDB/
22
http://ipfam.sanger.ac.uk/
23
http://3did.irbbarcelona.org
24
http://webclu.bio.wzw.tum.de/dima
25
http://domine.utdallas.edu/
26
http://www.thebiogrid.org/
27
http://www.BindingMOAD.org
28
http://dip.doe-mbi.ucla.edu/
29
http://hintdb.hgc.jp/htp/
30
http://www.ebi.ac.uk/intact/
31
http://mint.bio.uniroma2.it/mint/
32
http://www.oxfordjournals.org/nar/database/subcat/6/26
33
http://mips.helmholtz-muenchen.de/proj/ppi/negatome
14 M.Y. Galperin and E.V. Koonin

serve as a tool for estimating false-positive rates in protein-protein interactions expe-

riments and tools. Accordingly, scanning the available databases for the information
on protein-protein interactions is a good way to get potential clues on the function(s)
of the given protein but the reliability of such clues is typically pretty low.

Practical Aspects. It generally makes sense to query the available databases not just
for protein-protein interactions of the given protein but also its orthologs from other,
related genomes. Some protein-protein interactions databases rank the results by re-
liability; incorporating these scores is generally a good idea. However, it should be
noted that all those databases feed on a relatively limited number of original studies.
Therefore, merely finding certain interaction in several different databases should not
be used as evidence of a high-confidence interaction.

5 Combining Disparate Data into a Single Annotation

With the exception of a relatively small number of well-known and straightforward

cases, functional annotations of new genes (proteins) are inherently fuzzy. One of the
reasons for that is that these gene annotations are expected to be as specific and as
reliable as possible. These two demands are somewhat contradictory: a very general
but mostly useless annotation (e.g. a “metal-binding protein”) could be made with a
high degree of confidence, whereas a more specific - and more useful - annotation
might not be that well-grounded and totally reliable.
The International Nucleotide Sequence Database Collaboration34, which includes
NCBI’s GenBank35, the EBI’s European Nucleotide Archive36, and the DNA Data
Bank of Japan37, uses a simple schema with two evidence qualifiers, /experiment and
/inference38, which replaced the previously used qualifiers, ‘experimental’ and ‘non-
experimental’. These two qualifiers come with controlled vocabularies39 that specify,
respectively, experimental or non-experimental evidence that supports the feature
assignment38. These evidence codes are increasingly being used to justify functional
assignments of the open reading frames in the newly sequenced genomes. As a result,
it becomes much easier for the outside user to trace to the origin of the specific anno-
tation and decide whether it is trustworthy.
It is important to note, however, that while the INSDC guidelines require the anno-
tator to specify the evidence in the “/inference="similar to DNA sequence:
INSD:AY411252.1" format38, they impose no limits on the degree of similarity that is
acceptable in that annotation. As a result, certain technically acceptable annotations
may be based on extremely low similarity levels or even on previous annotations that
themselves were non-experimental and highly unreliable. There have been several

34
http://www.insdc.org/
35
http://www.ncbi.nlm.nih.gov/genbank/
36
http://www.ebi.ac.uk/ena
37
http://www.ddbj.nig.ac.jp/
38
http://www.ncbi.nlm.nih.gov/genbank/evidence
39
http://www.insdc.org/documents
Comparative Genomics Approaches to Identifying Functionally Related Genes 15

attempts to develop a common set of standard operating procedures for genome anno-
tation [87], one such list is available online40, although most links there are no longer
functional. The NCBI maintains its own Prokaryotic Genome Annotation Pipeline41
and Eukaryotic Genome Annotation Pipeline42 projects that include certain annotation
standards43,44 designed to improve the annotation quality.
Still, there is a clear need for new computationally sound pipelines that would
comb through all sorts of disparate clues discussed in the previous sections in order to
a) provide the best possible annotations and b) not just list the annotation sources but
also evaluate the reliability of these annotations.
For protein annotation, the UniProt web site45 contains a variety of useful docu-
ments, including a constantly updated list of protein naming guidelines46. The key
question is, of course, “Annotation propagation: when to cut, copy and paste?” as
formulated in [88]. Several years ago we have come up with an annotation schema
that included the following seven categories [89]:

1. Exact biochemical function, based on high similarity to experimentally characte-

rized closely related homolog
2. Well defined biochemical function, unknown specificity
3. General biochemical function, based on family/superfamily assignment and/or a
conserved sequence motif
4. General biological function derived from the domain organization, genome con-
text (e.g., operons), experimental (e.g., protein-protein interactions), and/or struc-
tural genomics data (e.g., similarities to proteins with known 3D structures)
5. Certain functional insights derived from the above data
6. Widely conserved protein, expressed under certain growth condition(s)
7. Organism- or genus-specific protein, expressed under certain growth condi-
tion(s).

For the first two of the above categories, the best guidance can be found on the
web site of the HAMAP project47, which includes a set of manually created annota-
tion rules48 that specify the proper annotations for specific family members [90]. For
the third, and particularly for the remaining categories, the decision should probably
be made by a human annotator. Therefore, it is extremely important to provide that
human annotator with the proper tools that simplify his/her work. In practical terms,
that would mean bringing together the results of all the analyses that have been dis-
cussed above and ranking the results by their relevance and predictive value. The
resulting report would probably be pretty long and confusing. As an example, the
40
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196215/table/T1/
41
http://www.ncbi.nlm.nih.gov/genome/annotation_prok
42
http://www.ncbi.nlm.nih.gov/books/NBK169439/
43
http://www.ncbi.nlm.nih.gov/genome/annotation_prok/standards
44
http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
45
http://www.uniprot.org/docs/
46
http://www.uniprot.org/docs/proknameprot
47
http://hamap.expasy.org/
48
http://hamap.expasy.org/rules.html
16 M.Y. Galperin and E.V. Koonin

report for the Vibrio cholerae protein VC2772 (RefSeq entry NP_232398, UniProt
accession number Q9KNG7) would probably look like the following:

1. TIGRFAM04285, Nucleoid occlusion protein. Query coverage: 199/293 aa; target

coverage 198/255 aa; bit score: 223.5; E-value: 1.3e-71. Family description:
Nucleoid occlusion protein, a close homolog to ParB chromosome partitioning
proteins including Spo0J in Bacillus subtilis. Confidence: High
2. SwissProt BLAST hit P26497|SP0J_BACSU, Stage 0 sporulation protein J;
Query coverage: 288/293 aa; target coverage 275/282 aa; identities: 106/292; po-
sitives: 168/292; gaps: 21/292; bit score: 171; E-value: 3e-55; Confidence: High
3. PDB BLAST hit 1VZ0, Chromosome Segregation Protein Spo0j From Thermus
Thermophilus. Query coverage: 231/293 aa; target coverage 211/230 aa; identi-
ties: 98/232; positives: 149/232; gaps: 22/232; bit score: 170; E-value: 1e-55;
Confidence: High
4. TIGRFAM00180, ParB/RepB/Spo0J family partition protein. Query coverage:
179/293 aa; target coverage 186/187 aa; bit score: 177.5; E-value: 1.2e-54; Family
description: Chromosomal and plasmid partition proteins related to ParB, includ-
ing Spo0J, RepB, and SopB. Confidence: High
5. COG1475, Spo0J. Query coverage: 230/293 aa; target coverage 229/240 aa; bit
score: 156.2; E-value: 1.1e-45; Family description: Stage 0 sporulation protein J
(antagonist of Soj) containing ParB-like nuclease domain. Confidence: High
6. SUPERFAMILY SSF109709, KorB DNA-binding domain-like. Query coverage:
109/293 aa; Region: 122-230; E-value: 1.3e-33. Confidence: High
7. Pfam PF02195, ParBc. Query coverage: 89/293 aa; target coverage 88/90 aa; bit
score: 109; E-value: 1.3e-29; Family description: ParB-like nuclease domain.
Confidence: High
8. SUPERFAMILY SSF110849, ParB/Sulfiredoxin. Query coverage: 92/293 aa;
Region: 41-132; E-value: 3.4e-28. Confidence: High
9. SwissProt BLAST hit P77174|YBDM_ECOLI, Uncharacterized protein YbdM.
Query coverage: 136/293 aa; target coverage 140/209 aa; bit score: 39.3; E-value:
2e-8; Confidence: Medium
10. SwissProt BLAST hit P76068|YNAK_ECOLI, Uncharacterized protein YnaK.
Query coverage: 63/293 aa; target coverage 69/87 aa; bit score: 30.4; E-value: 2e-
6; Confidence: Medium
11. PDB: 1VZ0, chromosome segregation protein Spo0J from Thermus thermophilus.
12. PubMed: 15228524, Leonard,T.A., Butler,P.J. and Lowe,J. Structural analysis of
the chromosome segregation protein Spo0J from Thermus thermophilus. Mol. Mi-
crobiol. 53 (2), 419-432 (2004)
13. STRING Genome neighbors: VC_2773, ParA family protein (257 aa), score:
0.995; VC_2061, ParA family protein (258 aa), score: 0.932; gidA, tRNA uridine
5-carboxymethylaminomethyl modification enzyme GidA (631 aa), score: 0.877;
gidB, 16S rRNA methyltransferase GidB; specifically methylates the N7 position
of guanosine (210 aa), score: 0.862; ftsK, putative cell division protein FtsK;
DNA motor protein (960 aa), score: 0.823; VC_A1115, ParA family protein
(407 aa), score: 0.764.
Comparative Genomics Approaches to Identifying Functionally Related Genes 17

14. STRING Domain fusions: None

15. STRING Coexpression data: atpB, F0F1 ATP synthase subunit A, key component
of the proton channel
16. Protein-protein interactions: ParA, a Walker-type ATPase with non-specific
DNA-binding activity.

Looking at all these data, the annotator would realize that VC2772 is a DNA-
binding protein that also interacts with ParA protein and participates in chromosome
partitioning during cell division. Based on that, the tentative annotation would proba-
bly be as follows: Chromosome segregation protein Spo0J, contains ParB-like nuc-
lease domain. Please note that automatic transfer of the annotation of the best data-
base hit, Stage 0 sporulation protein Spo0J, would be an unforgivable mistake be-
cause, unlike B. subtilis, Vibrio cholerae does not form spores. This example shows
some of the caveats in annotating new proteins, even those with reasonably well cha-
racterized homologs. However, there is always a hope that in the future it would be
possible to create a comprehensive set of rules (expanding those already available in
HAMAP48) that would allow a largely automated assignment of functions to a great
majority of proteins encoded in any bacterial or eukaryotic genome.

6 Conclusions

In conclusion, improved functional annotation is the only feasible way to extracting

information from genomic sequences and gaining a better understanding of the
processes in the live cell. For numerous uncultured organisms, as well as for metage-
nomes, computational analysis is the only way to go. In most part, improved func-
tional assignments would depend on the experimental characterization of the remain-
ing unknown genes. Several recent discoveries, including the CRISPR-Cas system
and the c-di-GMP, c-di-AMP-and c-di-GAMP-mediated cellular signaling in bacteria
and eukaryotes, show that there could still be major gaps in our understanding of the
key processes even in the relatively well-studied cells.
That said, improved algorithms for functional annotation would play a major role
in generating viable hypotheses and guiding the experimental research. For many
widespread uncharacterized proteins with sufficiently wide phylogenetic representa-
tion, simultaneous application of all the tools described above can be expected to
generate a number of leads that would either point out the likely function or at least
suggest specific experiments that would eventually allow doing so. That would indeed
be an invaluable contribution of comparative genomics to genome biology and biolo-
gy as a whole. Exactly this approach lies at the heart of the COMputational BRidge to
EXperiments (COMBREX49) project, which aims at obtaining the best possible com-
putational predictions and subjecting them to experimental verification [91,92]. This
and other similar projects have a bright future, as only through combined efforts of
computational, structural, and experimental biologists would it be possible to achieve
a better understanding of gene function on the genome scale.

49
http://combrex.bu.edu/
18 M.Y. Galperin and E.V. Koonin

Acknowledgements. This study was supported by the Intramural Research Program

of the National Library of Medicine at the U.S. National Institutes of Health.

References
1. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage,
A.R., Bult, C.J., Tomb, J.-F., Dougherty, B.A., Merrick, J.M., McKenney, K., Sutton,
G.G., FitzHugh, W., Fields, C., Gocayne, J.D., Scott, J., Shirley, R., Liu, L.-I., Glodek, A.,
Kelley, J.M., Weidman, J.F., Phillips, C.A., Spriggs, T., Hedblom, E., Cotton, M.D.,
Utterback, T.R., Hanna, M.C., Nguyen, D., Saudek, D.M., Brandon, R.C., Fine, L.D.,
Frichtman, J.L., Fuhrmann, J.L., Geoghagen, N.S.M., Gnehm, C.L., McDonald, L.A.,
Small, K.V., Fraser, C.M., Smith, H.O., Venter, J.C.: Whole-genome random sequencing
and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995)
2. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K.,
Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A.,
Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J.,
Mesirov, J.P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R.,
Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian, A.,
Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee,
C., Carter, N., Coulson, A., Deadman, R., Deloukas, P., Dunham, A., Dunham, I., Durbin,
R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M.,
Lloyd, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J.C., Mungall, A.,
Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R.H., Wilson, R.K., Hillier,
L.W., McPherson, J.D., Marra, M.A., Mardis, E.R., Fulton, L.A., Chinwalla, A.T.,
Pepin, K.H., Gish, W.R., Chissoe, S.L., Wendl, M.C., Delehaunty, K.D., Miner, T.L.,
Delehaunty, A., Kramer, J.B., Cook, L.L., Fulton, R.S., Johnson, D.L., Minx, P.J., Clifton,
S.W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T.,
Doggett, N., Cheng, J.F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M.,
Gibbs, R.A., Muzny, D.M., Scherer, S.E., Bouck, J.B., Sodergren, E.J., Worley, K.C.,
Rives, C.M., Gorrell, J.H., Metzker, M.L., Naylor, S.L., Kucherlapati, R.S., Nelson, D.L.,
Weinstock, G.M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A., Itoh, T.,
Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach, J., Heilig, R., Saurin,
W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier, E., Robert, C., Wincker, P., Smith,
D.R., Doucette-Stamm, L., Rubenfield, M., Weinstock, K., Lee, H.M., Dubois, J.,
Rosenthal, A., Platzer, M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang,
J., Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R.W., Federspiel,
N.A., Abola, A.P., Proctor, M.J., Myers, R.M., Schmutz, J., Dickson, M., Grimwood, J.,
Cox, D.R., Olson, M.V., Kaul, R., Shimizu, N., Kawasaki, K., Minoshima, S., Evans,
G.A., Athanasiou, M., Schultz, R., Roe, B.A., Chen, F., Pan, H., Ramser, J., Lehrach, H.,
Reinhardt, R., McCombie, W.R., de la Bastide, M., Dedhia, N., Blocker, H., Hornischer,
K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J.A., Bateman, A., Batzoglou, S., Bir-
ney, E., Bork, P., Brown, D.G., Burge, C.B., Cerutti, L., Chen, H.C., Church, D., Clamp,
M., Copley, R.R., Doerks, T., Eddy, S.R., Eichler, E.E., Furey, T.S., Galagan, J., Gilbert,
J.G., Harmon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang, W.,
Johnson, L.S., Jones, T.A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent, W.J., Kitts, P.,
Koonin, E.V., Korf, I., Kulp, D., Lancet, D., Lowe, T.M., McLysaght, A., Mikkelsen, T.,
Moran, J.V., Mulder, N., Pollara, V.J., Ponting, C.P., Schuler, G., Schultz, J., Slater, G.,
Smit, A.F., Stupka, E., Szustakowski, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L.,
Exploring the Variety of Random
Documents with Different Content
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make

any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project

Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed

editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

textbookfull.com

Medical Professionals' Guide
No ratings yet
Medical Professionals' Guide
159 pages
Low-Dose Oral Minoxidil For Alopecia: A Comprehensive Review
No ratings yet
Low-Dose Oral Minoxidil For Alopecia: A Comprehensive Review
15 pages
Paracetamol Pharmacokinetics
No ratings yet
Paracetamol Pharmacokinetics
15 pages
Nutrition Roles in Healthcare
No ratings yet
Nutrition Roles in Healthcare
53 pages
Recovery From Strenuous Exercise 1st Edition Steve Bedford Download
No ratings yet
Recovery From Strenuous Exercise 1st Edition Steve Bedford Download
76 pages
ABG Interpretation Quiz Scenarios
No ratings yet
ABG Interpretation Quiz Scenarios
12 pages
Ijmb 21 30
No ratings yet
Ijmb 21 30
4 pages
Chapter 2 - Carbohydrates
No ratings yet
Chapter 2 - Carbohydrates
4 pages
Histology of The Male Reproductive System
No ratings yet
Histology of The Male Reproductive System
58 pages
Vitamins (Top MCQS)
No ratings yet
Vitamins (Top MCQS)
27 pages
CASE STUDY 22 Pancreatic Function
No ratings yet
CASE STUDY 22 Pancreatic Function
3 pages
(1479683X - European Journal of Endocrinology) Inhibin B in Male Reproduction - Pathophysiology and Clinical Relevance
No ratings yet
(1479683X - European Journal of Endocrinology) Inhibin B in Male Reproduction - Pathophysiology and Clinical Relevance
11 pages
DPM and DTM Anatomy and Physiology TG
No ratings yet
DPM and DTM Anatomy and Physiology TG
5 pages
Feedback Mechanisms in Menstrual Cycle
No ratings yet
Feedback Mechanisms in Menstrual Cycle
8 pages
Bacterial Growth-1
No ratings yet
Bacterial Growth-1
27 pages
AIIMS Patna Pharmacist Exam Paper - Pharma Affinity
No ratings yet
AIIMS Patna Pharmacist Exam Paper - Pharma Affinity
22 pages
Quatrefolic Overview
No ratings yet
Quatrefolic Overview
19 pages
نموذج امتحان مزاولة مهنة الصيدلة16-7-2009
100% (2)
نموذج امتحان مزاولة مهنة الصيدلة16-7-2009
13 pages
Diabetes Management in Liver Disease
No ratings yet
Diabetes Management in Liver Disease
17 pages
Epidermal Permeability Barrier
No ratings yet
Epidermal Permeability Barrier
6 pages
Clopidogrel's Impact on Fluvastatin Levels
No ratings yet
Clopidogrel's Impact on Fluvastatin Levels
8 pages
Medical-Surgical Nursing Competency Exam
No ratings yet
Medical-Surgical Nursing Competency Exam
5 pages
SUPP. ENDOCRINE Questions
No ratings yet
SUPP. ENDOCRINE Questions
3 pages
M.SC Nutrition and Dietetics PDF
No ratings yet
M.SC Nutrition and Dietetics PDF
49 pages
2nd Exams PDF
No ratings yet
2nd Exams PDF
143 pages
Inderbir Singhs Textbook of Human Histology With Colour Atlas and Practical Guide 9nbsped 9389034973 9789389034974 Compress
No ratings yet
Inderbir Singhs Textbook of Human Histology With Colour Atlas and Practical Guide 9nbsped 9389034973 9789389034974 Compress
30 pages
DR Ray Peat - Menopause and Its Causes
No ratings yet
DR Ray Peat - Menopause and Its Causes
5 pages
Update in Managing-Pcos DR Rakhi Singh
No ratings yet
Update in Managing-Pcos DR Rakhi Singh
26 pages
MET2 Introduction 2021
No ratings yet
MET2 Introduction 2021
31 pages
FCPS Part 1 Study Guide
No ratings yet
FCPS Part 1 Study Guide
25 pages

Complete Algorithms For Computational Biology First International Conference AlCoB 2014 Tarragona Spain July 1 3 2014 Proceedigns 1st Edition Adrian-Horia Dediu PDF For All Chapters

Uploaded by

Complete Algorithms For Computational Biology First International Conference AlCoB 2014 Tarragona Spain July 1 3 2014 Proceedigns 1st Edition Adrian-Horia Dediu PDF For All Chapters

Uploaded by

Download the Full Version of textbook for Fast Typing at textbookfull.

Algorithms for Computational Biology First

Download More textbook Instantly Today - Get Yours Now at textbookfull.com

Theory and Practice of Natural Computing Third

Algorithms for Computational Biology 5th International

Algorithms for Computational Biology 4th International

High Performance Computing for Computational Science

Applied Algorithms First International Conference ICAA

Articulated Motion and Deformable Objects 8th

Language and Automata Theory and Applications 9th

Biomimetic and Biohybrid Systems Third International

Subseries of Lecture Notes in Computer Science

LNBI Series Editors

LNBI Editorial Board

ISSN 0302-9743 e-ISSN 1611-3349

Library of Congress Control Number: 2014940380

LNCS Sublibrary: SL 8 – Bioinformatics

April 2014 Adrian-Horia Dediu

AlCoB 2014 was organized by the Research Group on Mathematical Linguistics –

Shinichi Morishita University of Tokyo, Japan

Artyomenko, Alexander Leibovich, Limor

Vester’s Sensitivity Model for Genetic Networks with Time-Discrete

Complexity and Polynomial-Time Approximation Algorithms around

Heuristics for the Sorting by Length-Weighted Inversions Problem on

On Low Treewidth Graphs and Supertrees . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

On Optimal Read Trimming in Next Generation Sequencing and Its

On the Implementation of Quantitative Model Reﬁnement . . . . . . . . . . . . 95

HapMonster: A Statistically Uniﬁed Approach for Variant Calling and

Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints

Modeling the Geometry of the Endoplasmic Reticulum Network . . . . . . . . 131

On Sorting of Signed Permutations by Preﬁx and Suﬃx Reversals and

On the Diameter of Rearrangement Problems . . . . . . . . . . . . . . . . . . . . . . . . 158

Eﬃciently Enumerating All Connected Induced Subgraphs of a Large

On Algorithmic Complexity of Biomolecular Sequence Assembly

A Closed-Form Solution for Transcription Factor Activity Estimation

SVEM: A Structural Variant Estimation Method Using Multi-mapped

Analysis and Classiﬁcation of Constrained DNA Elements with N-gram

Inference of Boolean Networks from Gene Interaction Graphs Using a

RRCA: Ultra-Fast Multiple In-species Genome Alignments . . . . . . . . . . . . 247

Exact Protein Structure Classiﬁcation Using the Maximum Contact

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

Michael Y. Galperin and Eugene V. Koonin

National Center for Biotechnology Information, National Library of Medicine

Abstract. The rapid progress in genome sequencing makes it possible to ad-

Keywords: genome annotation, genomic context, gene neighborhood, operon,

2 What Is the Gene “Function”?

3 Homology-Based Functional Assignments

3.1 Annotation by Similarity

3.2 Family/Superfamily Annotation

4 Using Genome Comparisons for Predicting Protein Functions

4.1 Phylogenetic Profiling

Algorithmic Aspects. The overall approach is quite straightforward: compile a matrix

4.2 Genomic Neighborhood

General Approach. Co-expression of proteins belonging to the same metabolic or

Algorithmic Aspects. The general approach to the identification of functionally

neighborhood. The high incidence of the same genome neighborhood in numerous

4.3 Gene Coexpression

4.4 Protein Domain Fusions

Algorithmic Aspects. Detection of gene fusions is usually performed at the protein

Practical Aspects. The information on protein domain fusions is available in several

4.5 Protein-Protein Interactions

General Approach. Obviously, protein domain fusions capture only a relatively

serve as a tool for estimating false-positive rates in protein-protein interactions expe-

5 Combining Disparate Data into a Single Annotation

With the exception of a relatively small number of well-known and straightforward

1. Exact biochemical function, based on high similarity to experimentally characte-

1. TIGRFAM04285, Nucleoid occlusion protein. Query coverage: 199/293 aa; target

14. STRING Domain fusions: None

In conclusion, improved functional annotation is the only feasible way to extracting

Acknowledgements. This study was supported by the Intramural Research Program

International donations are gratefully accepted, but we cannot make

Section 5. General Information About Project

Project Gutenberg™ eBooks are often created from several printed

This website includes information about Project Gutenberg™,

Let us accompany you on the journey of exploring knowledge and

You might also like