0% found this document useful (0 votes)
10 views16 pages

Phylogenomics An Introduction Full Download

The document is an introduction to phylogenomics, detailing the evolution and methodologies used to reconstruct the tree of life through genetic data. It covers key topics such as sequencing techniques, data assembly, phylogenetic analyses, and sources of error in phylogenomic studies. The book is aimed at biology students and researchers, providing a concise overview of the field and its advancements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views16 pages

Phylogenomics An Introduction Full Download

The document is an introduction to phylogenomics, detailing the evolution and methodologies used to reconstruct the tree of life through genetic data. It covers key topics such as sequencing techniques, data assembly, phylogenetic analyses, and sources of error in phylogenomic studies. The book is aimed at biology students and researchers, providing a concise overview of the field and its advancements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Phylogenomics An Introduction

Visit the link below to download the full version of this book:

https://medidownload.com/product/phylogenomics-an-introduction/

Click Download Now


Christoph Bleidorn
Museo Nacional de Ciencias Naturales
Spanish National Research Council (CSIC)
Madrid
Spain

ISBN 978-3-319-54062-7    ISBN 978-3-319-54064-1 (eBook)


DOI 10.1007/978-3-319-54064-1

Library of Congress Control Number: 2017942964

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recita-
tion, broadcasting, reproduction on microfilms or in any other physical way, and transmission or infor-
mation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publica-
tion does not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein
or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
V

Preface

All life on earth shares a common ancestor, and the aim of phylogenetic systematics is
to reconstruct the tree or network of life. Shortly after the availability of the first pro-
tein sequences, molecular phylogenetic approaches were developed to understand the
evolutionary relationships between proteins (or genes). It became clear that gene trees
will also help to unravel the phylogeny of species. The introduction of Sanger sequenc-
ing and polymerase chain reaction (PCR) paved the way that genetic approaches
became available across the scientific community and contributed to the rise of molec-
ular phylogenetics. At the end of the 1990s, results from single-gene studies challenged
the century-old textbook view of evolutionary relationships of many groups (e.g. ani-
mals, plants). Fierce discussions regarding the validity of these results led to important
methodological advances, and, nowadays, molecular phylogenies are broadly accepted
to represent organismal relationships in textbooks. In the mid-2000s, the way of
sequencing has been revolutionized, leading to a huge drop in its costs, and unprece-
dented amounts of sequence data became affordable for every type of study and also
for non-model organisms. This development transformed the field of molecular phy-
logenetics to phylogenomics, where genome-scale data (genomes, transcriptomes) can
be exploited. The term phylogenomics was already coined in 1998 by Jonathan Eisen
(also known under his twitter handle @phylogenomics), who outlined the importance
of phylogenetic methods for the annotation of genes without relying on direct (time
consuming) functional studies. This underlines how deeply embedded phylogenetic
methods are in the field of genomics. The theoretical background for reconstructing
gene trees (functional annotations) and species trees (reconstruction of the tree of life)
is broadly overlapping. In this book I will introduce the major steps of phylogenomic
analyses in general. The first two chapters briefly introduce the field of genomics
(7 Chap. 1, «Genomes») and the evolution and peculiarities of organellar genomes
(7 Chap. 2, «Organellar Genomes and Endosymbionts»). In 7 Chap. 3 («Sequencing
Techniques»), I review the most widely used sequencing platforms, which is difficult
in a print format, as the field advances so fast that many numbers describing the output
of these machines might be already out of date when you read this chapter. 7 Chapter 4
(«Sequencing Strategies») gives an overview of different strategies to sequence com-
plete or partial genomes and transcriptomes. The outputs of every sequencing plat-
form are sequences which are considerably shorter than chromosomes and in the case
of short-read sequencing also shorter than most genes. In 7 Chap. 5 («Assembly and
Data Quality»), ways to puzzle these small pieces into more complete representations
of genomes and genes (called assembly) are introduced. Fundamental steps for every
phylogenomic study are alignments, read mapping and finding homologous genes,
which are explained in 7 Chaps. 6 («Alignment and Mapping») and 7 («Finding
Genes»). Based on a sequence alignment, it is possible to reconstruct phylogenetic
trees, and the methods are briefly reviewed in 7 Chap. 8 («Phylogenetic Analyses»). I
kept this chapter on purpose rather brief, as many excellent textbooks describing these
methods (and its underlying algorithms) in detail are available (see references in
7 Chap. 8). Moreover, the basic theory underlying these methods did not change much
in the last decade. Surprisingly, even with this vast amount of data, many phylogenetic
VI Preface

questions remain still difficult to resolve. Some problems of phylogenetic reconstruc-


tion get even amplified when using hundreds or thousands of genes due to the pres-
ence of systematic error. 7 Chapter 9 («Sources of Error and Incongruence in
Phylogenomic Analyses») gives an overview of possible sources of error, as well as
recommendations on how to deal with them. Moreover, the differences in analysing
gene trees and species trees and possible sources of incongruence between those are
outlined. Finally, in 7 Chap. 10 («Rare Genomic Changes»), I introduce further phylo-
genetic markers apart from plain sequence data (e.g. integrations of mobile elements,
gene order) and give an overview on how these rare genomic changes are utilized for
phylogenetic systematics.

During my time at German universities, I was heavily involved in teaching bachelor and
master level students. This included lectures, seminars and practical courses. While the
field of molecular phylogenetics changed while moving into the postgenomic era, so did
my courses. Besides the introduction of phylogenetic methods (e.g. maximum parsi-
mony, maximum likelihood), I realized that more and more background knowledge
became of major importance to carry out phylogenetic analyses. This includes knowl-
edge about genomics, sequencing techniques as well as bioinformatic approaches to
handle sequence data before the actual phylogenetic analysis starts. With this book I
want to give a concise overview of all major steps of a phylogenomic analyses, as well as
some insights into recent advantages in the field of genomics. This book is mainly
addressed to undergraduate and graduate biology students, but also postdocs newly
moving to the field of phylogenomics might use it as a first overview. The chapters are
written in a concise way and focus more on explaining the idea behind methods, instead
of deeply digging into the algorithmic or technical background. However, I tried always
to refer to the appropriate specific literature to get deeper insights into any method (or
study) of interest. Furthermore, I specified widely used and important software for every
step of the phylogenetic analysis. When possible, I mention several alternatives. The
name of software or scripts is always written in all caps, irrespective of the original way a
name is written. This book does not include instructions on how to use this software, as
in most cases detailed descriptions are available in the manual. As already noted, this
book is mainly addressed to biology students. Working in the field of phylogenomics
needs good to excellent (bio)informatic skills. Unfortunately, in the curriculum of many
bachelor and master programmes, bioinformatics are not taught. However, several inter-
national courses teaching programming skills for (evolutionary) biologists take place
regularly (e.g. Cold Spring Harbor Course «Programming for Biology»; Programming
for Evolutionary Biology in Leipzig), and many excellent online tutorials are available.
As such I can only strongly suggest to any student interested in this field to get used to
work with Linux/Unix command lines and to acquire at least basic knowledge into
(scripting) languages like Python, Perl or R.

I would like to thank several colleagues who commented on earlier versions of the here
published chapters. In alphabetical order, they are Maite Aguado, Marie-Theres
Gansauge, Michael Gerth, Iker Irisarri, Lars Podsiadlowski and Alexander Suh. I am
grateful that Eva Nowack provided a picture of the enigmatic Paulinella. Moreover, I
want to thank Lars Vogt, Christoph Held and Andreas Schmidt-Rhaesa for introducing
VII
Preface

me into the theoretical and practical world of molecular phylogenetics. The above-­
mentioned university courses, which helped me to develop the outline and content of
this book, were taught at the Free University of Berlin, University of Potsdam and
University of Leipzig (in collaboration with Matthias Meyer from the Max Planck
Institute for Evolutionary Anthropology). I would like to thank the department heads
Thomas Bartolomaeus, Ralph Tiedemann and Martin Schlegel who gave me complete
freedom in filling these courses with life.

Christoph Bleidorn
Madrid, Spain, January 2017
IX

Contents

1 Genomes................................................................................................................................................ 1
1.1 The Ring of Life...................................................................................................................................... 2
1.2 Genome Structure................................................................................................................................ 4
1.3 Genome Size...........................................................................................................................................   7
1.4 The Genomes of Modern and Archaic Humans....................................................................... 10
References................................................................................................................................................. 14

2 Organelle Genomes and Endosymbionts........................................................................ 21


2.1 Mitochondria.......................................................................................................................................... 22
2.1.1 Origin and Evolution of Mitochondria............................................................................................. 22
2.1.2 Animal Mitochondrial Genomes........................................................................................................ 25
2.1.3 Mitochondrial Genomes of Plants and Algae................................................................................ 26
2.1.4 Mitochondrial Genomes of «Other» Eukaryotes........................................................................... 28
2.2 Plastids...................................................................................................................................................... 29
2.2.1 Origin and Evolution of Plastids......................................................................................................... 29
2.2.2 Plastid Genomes..................................................................................................................................... 31
2.2.3 Plastids in the Amoeba Paulinella chromatophora....................................................................... 32
2.3 Heritable Bacterial Endosymbionts.............................................................................................. 33
2.3.1 Primary Endosymbionts....................................................................................................................... 33
2.3.2 Secondary Endosymbionts.................................................................................................................. 35
2.4 DNA Barcoding...................................................................................................................................... 35
References................................................................................................................................................. 37

3 Sequencing Techniques............................................................................................................... 43
3.1 Sanger Sequencing.............................................................................................................................. 44
3.2 454 Pyrosequencing........................................................................................................................... 45
3.3 Reversible Terminator Sequencing (Illumina).......................................................................... 47
3.4 Ion Semiconductor Sequencing (Ion Torrent).......................................................................... 49
3.5 Single-Molecule Real-Time (SMRT) Sequencing (PacBio)................................................... 51
3.6 Nanopore Sequencing....................................................................................................................... 53
3.7 Comparison of Sequencing Platforms........................................................................................ 55
References................................................................................................................................................. 57

4 Sequencing Strategies.................................................................................................................. 61
4.1 Shotgun Sequencing.......................................................................................................................... 62
4.2 RADseq...................................................................................................................................................... 67
4.3 Hybrid Enrichment............................................................................................................................... 70
4.4 Expressed Sequence Tags and RNA-Seq..................................................................................... 73
4.5 Single-Cell Genomics and Transcriptomics............................................................................... 75
References................................................................................................................................................. 75
X Contents

5 Assembly and Data Quality..................................................................................................... 81


5.1 Data Quality and Filtering.............................................................................................................. 82
5.2 Assembly Strategies.......................................................................................................................... 84
5.2.1 Greedy Assemblies............................................................................................................................... 87
5.2.2 Overlap-Layout-Consensus (OLC) Assemblies............................................................................ 88
5.2.3 K-mer Assemblies Using de Bruijn Graphs.................................................................................... 90
5.3 Comparing Assemblies.................................................................................................................... 94
5.4 De Novo Assembly of Genomes................................................................................................... 96
5.4.1 Scaffolding.............................................................................................................................................. 96
5.4.2 Hybrid Assemblies................................................................................................................................ 97
5.5 De Novo Assembly of Transcriptomes and Metagenomes............................................... 97
References............................................................................................................................................... 100

6 Alignment and Mapping........................................................................................................... 105


6.1 Pairwise Alignment........................................................................................................................... 106
6.2 Local Alignment and BLAST Searches....................................................................................... 111
6.3 Multiple Sequence Alignment...................................................................................................... 114
6.4 Alignment Masking........................................................................................................................... 115
6.5 Mapping Sequence Reads.............................................................................................................. 117
6.6 Whole-Genome Alignments.......................................................................................................... 121
References............................................................................................................................................... 122

7 Finding Genes................................................................................................................................... 127


7.1 What Is a Gene?................................................................................................................................... 128
7.2 Gene Gain and Loss........................................................................................................................... 128
7.3 Homology of Genes........................................................................................................................... 130
7.4 Inferring Orthology........................................................................................................................... 131
7.5 Hidden Markov Profiles................................................................................................................... 133
7.6 Gene Ontology and the Ortholog Conjecture....................................................................... 136
7.7 Whole-Genome Duplications........................................................................................................ 138
References............................................................................................................................................... 139

8 Phylogenetic Analyses....................................................................................................................... 143


8.1 Trees......................................................................................................................................................... 144
8.2 Models of Nucleotide Substitution............................................................................................. 147
8.3 Models of Amino Acid Substitutions......................................................................................... 152
8.4 Model Selection and Data Partitions......................................................................................... 155
8.4.1 Model Selection.................................................................................................................................... 155
8.4.2 Partition Finding................................................................................................................................... 157
8.5 Inferring Phylogenies....................................................................................................................... 158
8.5.1 Neighbour Joining................................................................................................................................ 158
8.5.2 Maximum Parsimony........................................................................................................................... 159
8.5.3 Maximum Likelihood........................................................................................................................... 160
8.5.4 Heuristic Methods and Genetic Algorithms................................................................................. 162
8.5.5 Bayesian Inference................................................................................................................................ 163
8.6 Support Measures.............................................................................................................................. 165
8.7 Molecular Clocks................................................................................................................................. 166
References............................................................................................................................................... 168
XI
Contents

9 Sources of Error and Incongruence in Phylogenomic Analyses...................... 173


9.1 Incongruence in Phylogenomic Analyses................................................................................ 174
9.2 Systematic Errors................................................................................................................................ 177
9.3 Missing Data, Phylogenetic Information Content and Taxon Sampling.................... 180
9.3.1 Missing Data........................................................................................................................................... 180
9.3.2 More Genes or More Taxa?................................................................................................................. 182
9.3.3 Taxon Sampling..................................................................................................................................... 182
9.3.4 Gene Sampling...................................................................................................................................... 183
9.4 Incongruence Between Gene Trees and Species Trees...................................................... 186
References............................................................................................................................................... 189

10 Rare Genomic Changes.............................................................................................................. 195


10.1 The Perfect Phylogenetic Marker................................................................................................ 196
10.2 Mobile Elements................................................................................................................................. 198
10.3 MicroRNAs............................................................................................................................................. 201
10.4 Introns..................................................................................................................................................... 202
10.5 Gene Order............................................................................................................................................ 203
10.6 Changes in the Genetic Code........................................................................................................ 206
References............................................................................................................................................... 207

Service Part
Glossary............................................................................................................................................... 214
Index..................................................................................................................................................... 219
XIII

Abbreviations

μm Micrometre Mb Mega base pairs


MCMC Markov chain Monte Carlo method
A Adenine
MCMCMC Metropolis-coupled Markov chain
AIC Akaike information criterion Monte Carlo method
ATP Adenosine triphosphate MITE Miniature inverted-repeat
transposable element
BAC Bacterial artificial chromosome
ML Maximum likelihood
BI Bayesian inference
MP Maximum parsimony
BIC Bayesian information criterion
mRNA Messenger RNA
BLAST Basic Local Alignment Search Tool
mya Million years ago
bp Base pairs
NCBI National Center for Biotechnology
C Cytosine Information
cDNA Complementary DNA NGS Next-generation sequencing
CI Cytoplasmatic incompatibility NIP Near intron pair
CMOS Complementary metal-oxide NJ Neighbour joining
semiconductor
NNI Nearest neighbour interchange
CNV Copy number variation
CRISPR Clustered regularly interspaced short OTU Operational taxonomic unit
palindromic repeat
PAM Point accepted mutations
ddNTP Dideoxynucleoside triphosphate PCR Polymerase chain reaction
DNA Deoxyribonucleic acid PE Paired-end sequencing
dNTP Deoxynucleoside triphosphate pH Power of hydrogen
DUI Doubly uniparental inheritance
QTL Quantitative trait loci
G Guanine
Gb Giga base pairs RNA Ribonucleic acid
GBS Genotyping by sequencing
SINE Short interspersed element
GTR General time-reversible model
SMRT Single-molecule real-time
GWAS Genome-wide association study
SNP Single nucleotide polymorphism
HGT Horizontal gene transfer SPR Subtree pruning and regrafting

ICE Integrative conjugative element T Thymine


ILS Incomplete lineage sorting Tb Tera base pairs
ISFET Ion-sensitive field-effective transistor TBR Tree bisection and reconnection
TE Transposable element
Kb Kilo base pairs TPRT Target-primed reverse transcription
tRNA Transfer RNA
LBA Long-branch attraction
LD Linkage disequilibrium UCE Ultraconserved element
LINE Long interspersed element
LRT Likelihood ratio test wgs Whole-genome shotgun
LTR Long terminal repeat
ZMW Zero-mode waveguide
1 1

Genomes
1.1 The Ring of Life – 2

1.2 Genome Structure – 4

1.3 Genome Size – 7

1.4 The Genomes of Modern and Archaic Humans – 10

References – 14

© Springer International Publishing AG 2017


C. Bleidorn, Phylogenomics, DOI 10.1007/978-3-319-54064-1_1
2 Chapter 1 · Genomes

1 55 Life on earth can be largely classified into Bacteria, Archaea and Eukaryota.
55 Eukaryotes likely arose by symbiogenic origin due to the fusion of an archaean with
a bacterium.
55 Bacteria and Archaea have compact genomes with uninterrupted genes, contained
by a single, circular DNA molecule, located in the nucleoid.
55 Eukaryote genomes are linearly organized into separate chromosomes, located
in the nucleus, and contain genes interrupted by introns.
55 Eukaryotes bear substantially larger genomes than archaeans and bacteria, but
within eukaryotes there is no correlation between complexity and genome size.
55 The human genome is around 3.3 Gb in size, but protein-coding genes and other
functional DNA only make up a small proportion (<10%), whereas transposable
elements are dominating (>44%).
55 High-throughput sequencing of ancient human DNA allowed the reconstruction
of archaic human genomes and led to the discovery of a hitherto unknown lin-
eage, called Denisovan.

1.1 The Ring of Life

Life on earth was for a long time classified into two major groups, prokaryotes and
eukaryotes (Stanier and van Niel 1962; Cavalier-Smith 2010). Prokaryotic cells are char-
acterized by the lack of a true nucleus, absence of cell organelles and the genome is (usu-
ally) organized as a circular DNA molecule. Prokaryotic cells are usually small (<10 μm)
and mostly unicellular, even though some photosynthetic bacteria form true multicellu-
lar chains (Flores and Herrero 2010). Besides the characterization due to all these
absences of features, only prokaryotes show a coupling of translation and transcription.
In this case, the translation of mRNA starts before transcription has been finished (Martin
and Koonin 2006). In contrast, eukaryotic cells have their DNA organized on chromo-
somes located in a membrane-bound nucleus. With the exception of a few secondary
losses, eukaryotes harbour (at least) mitochondria as cell organelles. Cell division is
achieved due to mitosis, and meiosis, the prerequisite for sexual reproduction, likely was
already present in the last common ancestor of eukaryotes (Ramesh et al. 2005).
Eukaryotic cells are usually considerably bigger (>10 μm) than prokaryotic ones, and
multicellularity evolved convergently in several major eukaryotic taxa. A strong increase
in the number of investigated organisms recovered many exceptions to the here-men-
tioned features, blurring a clear distinction of «prokaryote-like» and «eukaryote-like»
properties (Gregory and DeSalle 2005).
Distinguishing life into two major groups was challenged by a series of publications
from the group of the American evolutionary microbiologist Carl Woese. Investigating
ribosomal sequence data, they found profound distances between two prokaryote groups,
now usually referred to as Bacteria and Archaea (Woese and Fox 1977; Fox et al. 1977;
Balch et al. 1977). Being firstly predominantly discovered in extreme environments,
Archaea have been since then found in virtually all environments and seem to be domi-
nant in some forms of marine plankton. Moreover, they are the only organisms capable of
methanogenesis (Gribaldo and Brochier-Armanet 2006). Fundamental differences
between Bacteria and Archaea were confirmed in subsequent studies, leading to a new
1.1 · The Ring of Life
3 1
classification of life into three domains, where Eukaryota represent the third one (Woese
et al. 1990).
One of the defining features of eukaryotes is the possession of mitochondria. The pri-
mary function of these organelles is ATP synthesis through the oxidative electron trans-
port chain, but also other functions are described (e.g. intracellular signalling). Similarities
in the physiology and biochemistry of mitochondria with bacterial cells led to the endo-
symbiotic theory. According to this theory, mitochondria are of bacterial origin, an idea
that dates back to a proposal from Ivan E. Wallin (1927). This hypothesis was later strongly
advocated by Lynn Margulis (1970). Mitochondria still bear their own, circular genome,
but massive transfer of mitochondrial genes to the host genome led to a strong size reduc-
tion. Phylogenetic analyses of mitochondrial genes recovered a close relationship with
Alphaproteobacteria, thereby strongly supporting the endosymbiotic theory. The initial
role of mitochondria in a symbiosis with its host and its environmental circumstances
remains debated (Martin and Muller 1998; Wang and Wu 2014).
The three-domain hypothesis suggests the respective monophyly of Bacteria, Archaea
and Eukaryota. In this case, these groups should include all descendent lineages of a
common ancestor and only these. Phylogenomic analyses were used to investigate this
question, and analyses based on a small set of core genes, which are present in all three
groups and which are regarded as not been transferred horizontally between groups,
recovered the three-domain tree (Ciccarelli et al. 2006). However, eukaryotic genomes
contain genes with different origins (Williams et al. 2013). Analyses of gene families
group eukaryotic genes either with Cyanobacteria, Alphaproteobacteria or within
Archaea (Pisani et al. 2007). These results reflect the symbiotic origin of plastids from
Cyanobacteria and the origin of mitochondria from Alphaproteobacteria and further
suggest an origin of eukaryotes from an archaeal ancestor. A large-scale phylogenomic
analysis including a newly discovered taxon called Lokiarchaeota provides further strong
support for the hypothesis that the eukaryotic ancestor evolved from an archaeon (Spang
et al. 2015). A subsequent study discovered several so far undescribed archaeans (named
Asgard archaea), which group with eukaryotes (Zaremba-Niedzwiedzka et al. 2017).
Furthermore, these archaeans bear several proteins, which had been regarded as eukary-
ote-specific, suggesting that the archaeal host contained many key components impor-
tant for the control of eukaryotic cellular complexity. Considering emerging evidence
from molecular phylogenetics, physiology, cell biology and palaeontology, a symbiogenic
origin from the merger of an archaean and an alphaproteobacterium becomes obvious
(McInerney et al. 2014). Phylogenetic analyses of eukaryote gene families support the
symbiogenic origin of eukaryotes (Rochette et al. 2014). Lane and Martin (2012) sug-
gested that mitochondria are a prerequisite for the evolution of complexity as seen in
eukaryote cells. And finally, the fossil record suggests with 3.4 billion years (Wacey et al.
2011) a much older age for bacterial (or archaeal) lineages than for eukaryotes. The first
fossilized eukaryotic cell dates 1.7–1.8 billion years ago (Rasmussen et al. 2008), which
sets a possible time horizon for the merging event (McInerney et al. 2014). The symbio-
genic origin of eukaryotes renders two of the domains paraphyletic. Instead, of being
strictly bifurcating, the early tree of life seems to be better represented by a network or a
ring (. Fig. 1.1).
Sequencing of bacterial, archaeal and eukaryote genomes enabled the discovery of
many important insights into the evolution, ecology and physiology of these organisms
(Fraser et al. 2000; Galagan et al. 2005). However, there is a bias in available genome
sequences in these groups. Whereas many taxa including model organisms, pathogens or
4 Chapter 1 · Genomes

1 Eubacteria Eukaryota Archaebacteria

Chloroplast
origin

Eukaryogenesis

Time
..      Fig. 1.1 The ring of life hypothesis (Reprinted by permission from Macmillan Publishers Ltd: Nature
(McInerney et al. 2014), Copyright 2014)

organisms with economic importance are well investigated, other taxa are completely
neglected. Consequently, a phylogeny-driven approach to cover genome sequencing
across the whole tree of life has been proposed to fill these gaps (Wu et al. 2009; del Campo
et al. 2014). Currently, major initiatives organize collaborative efforts in taxon-specific
genome sequencing projects. Especially for animals, large-scale sequencing projects aim
to sequence hundreds to thousands of nematode, arthropod, invertebrate and vertebrate
genomes (Robinson et al. 2011; Genome 10K Community of Scientists 2009; Kumar et al.
2012; GIGA Community of Scientists 2014). Phylogenetic analyses of whole-genome or
transcriptome data greatly improved our understanding of bacterial, archaeal and eukary-
otic relationships. Backbone trees of bacterial and archaeal phylogenies are available and
have been used to study the influence of horizontal gene transfer on the evolution of these
groups (Nelson-Sathi et al. 2015; Lang et al. 2013; Wu et al. 2009; Groussin et al. 2016).
Phylogenomic analyses of eukaryotes recover five major clades comprising their vast
diversity (. Fig. 1.2): (I) Archaeplastida (plants and green algae, red algae, glaucophytes);
(II) the SAR clade representing stramenopiles, alveolates and Rhizaria; (III) Excavata;
(IV) Amoebozoa; and (V) Opisthokonta, which unites fungi, choanoflagellates and ani-
mals (Katz and Grant 2014).

1.2 Genome Structure

There are profound differences between prokaryotes and eukaryotes in the structure
and organization of their genomes, which in turn strongly influence the way to work
with them in phylogenomic studies. Generally, prokaryote genomes are smaller and
more compact than those of eukaryotes, clearly reducing the effort of sequencing and
assembling them. However, due to the endosymbiotic origin of eukaryotes, it is obvious
that a mosaic-­like distribution for many of the features discussed below is found. Most
1.2 · Genome Structure
5 1
..      Fig. 1.2 Phylogenetic rela-
tionships of eukaryotes based on Brown algae
the phylogenomic analyses of
Katz and Grant (2014) Diatoms
Foraminifera
Oomycetes
Apicomplexa
Dinoflagellata
Ciliates
Cercozoa SAR clade

Rhodophyta
Viridiplantae Archae-
plastida
Glaucocystophytes
Haptophytes
Cryptomonades
Heterolobosea
Euglenozoa
Jakobids
Fornicata Excavata
Amoebozoa
Choanoflagellata
Metazoa
Icthyosporea
Fungi Opisthokonta

genomes of bacteria and archaeans are contained by a single, circular DNA molecule,
located in the nucleoid. For packaging, the double-stranded DNA molecule is super-
coiled, which is facilitated by DNA-binding proteins. Whereas in bacteria the supercoil-
ing is achieved by proteins like DNA gyrase, DNA topoisomerase I and HU proteins,
archaeans have proteins for packaging that are similar to the histones of eukaryotes
(White and Bell 2002). Exceptions from these general patterns exist, and, e.g. some
members of the bacterial taxa spirochaetes and actinomycetes show linearly organized
genomes (Hinnebusch and Tilly 1993). Multipartite genomes are not unusual across
prokaryotes as well (Harrison et al. 2010). Eukaryote genomes are linearly organized
into separate chromosomes. Within chromosomes the DNA forms nucleosomes due to
association with histone proteins for packaging. Further on, chromosomes bear centro-
meres and telomeres. Centromeres are characterized by a special set of proteins which
form the attachment point for microtubules during cell division. Telomeres are the cap
of the chromosome ends and are characterized by the presence of repetitive DNA motifs
(Brown 2007).
Prokaryotes often have a high potential for horizontal gene transfer (HGT) by mobile
genetic elements. Movement of DNA can be facilitated by transformation, conjugation
or transduction. In the case of transformation, cellular DNA is taken up by the recipient
due to the presence of special proteins. Conjugation is gene transfer mediated by plasmids
6 Chapter 1 · Genomes

or so-called integrative conjugative elements (ICEs) via contact between donor recipi-
1 ent cells. Finally, transduction is gene transfer by bacteriophages (Frost et al. 2005). The
presence of extrachromosomal elements such as plasmids, which usually carry accessory
(but not essential) genes, and the frequent occurrence of HGT lead to the phenomenon
that within prokaryotic species often large differences in gene content are found. This
led to the formulation of the pan-genome concept. A pan-genome is composed of two
parts: a «core genome», containing the genes present in all strains of a prokaryotic spe-
cies, and the «dispensable genome» summarizing the genes which occur in a subset of
strains or only one (Medini et al. 2005). Most archaeal and many bacterial genomes bear
clustered regularly interspaced short palindromic repeats (CRISPRs). Together with asso-
ciated proteins (CAS) these repeats constitute an adaptive immune system that can target
invading bacteriophages or conjugative plasmids (Horvath and Barrangou 2010; Burstein
et al. 2016). Plasmids are also occurring in some eukaryotes, e.g. in yeast and other fungi
(Hausner 2003).
Prokaryotic genomes are usually compactly organized, with a small proportion of
non-coding intragenic DNA. Consequently, prokaryotic genomes are relatively small,
rarely exceeding sizes of 10 Mb. The smallest known genomes are reported for endosym-
biotic bacteria, with the betaproteobacterium Candidatus Tremblaya princeps as record
holder with its only 139 Kb genome. Bacteria with extremely reduced genomes are depen-
dent on genes from their host or from other co-occurring endosymbionts (Husnik et al.
2013; McCutcheon and Moran 2012). Genome sizes of eukaryotes are more variable and
can exceed several hundred Gb (see 1.3 for more details). Not only are the genomes of
prokaryotes smaller than those of eukaryotes but also their genes. The mean protein
length is 40–60% higher in eukaryotes than in prokaryotes, and this holds true across dif-
ferent functional classes of proteins (Zhang 2000; Brocchieri and Karlin 2005). Moreover,
prokaryote genes are not interrupted by spliceosomal introns, which are typical for
eukaryote nuclear genomes (Roy and Gilbert 2006). For example, human genes are inter-
rupted in average by nine introns, and intronic sequences make up a substantial amount
of the complete genome (Venter et al. 2001). Spliceosomal introns exhibit special sequence
motifs and are removed before transcription by the spliceosome, which is formed by five
small RNAs and over 200 proteins (Irimia and Roy 2014). However, other types of introns
can be found in prokaryotes. Group II introns are self-splicing introns that have been
reported in ~25% of all sequenced bacterial genomes, but always in low frequency.
Moreover, they are also found in eukaryote organelle genomes, but are only known from
few archaeal genomes, which likely originate from horizontal transfer from bacteria
(Lambowitz and Zimmerly 2011). Other types of introns are more rare and often restricted
to certain types of genes (e.g. tRNAs), but can also be found across all organisms (Irimia
and Roy 2014).
Eukaryote genomes often carry a huge proportion of interspersed elements and tan-
dem repeats. Both types are usually rare or completely absent in prokaryotic genomes.
Tandemly repeated DNA, which is sometimes called satellite DNA, can be found around
centromeres or randomly scattered across chromosomes. Tandem repeats with short repet-
itive motifs are known as mini- and microsatellites (Brown 2007). Interspersed elements
have the ability to integrate into new sites of the genome of their origin, often in a random
pattern, even though many transposons show the preference for a specific target site. These
transposable elements are historically classified according to their mode of transposition
into retrotransposons (class I) and DNA transposons (class II) (Finnegan 1989). Such ele-
ments altogether often contribute massively to the genome size of eukaryotes (Kazazian
1.3 · Genome Size
7 1
2004). DNA transposons are mobile elements transposed by a cut-and-­paste mechanism,
where they are excised from one genomic site and integrated into a new one. These ele-
ments usually encode a transposase and bear terminal inverted repeats. Ten different
superfamilies of eukaryotic cut-and-paste DNA transposons are currently distinguished,
which show an enormous variation in their distribution across taxa (Wicker et al. 2007).
Two further groups of DNA transposons (Helitrons, Mavericks) likely use copy-­and-­paste
mechanisms for their spread across genomes (Feschotte and Pritham 2007). In contrast to
DNA transposons, retrotransposons are transcribed into RNA and subsequently reverse
transcribed and copied into the genome (copy and paste), leading to a duplication of the
element. Some autonomous retrotransposons bear long terminal repeats (LTRs) at their
ends. These LTR retrotranposons encode for several specific genes including a reverse
transcriptase and integrase, and they are generally similar to retroviruses, with which they
share their replication mechanism (Kazazian 2004). It should be mentioned that there is
no real distinction between LTR retrotransposons and retroviruses, as exogenous retrovi-
ruses can easily become endogenous by losing their env gene, which produces the protein
on the surface of the viral particle that is responsible for cell entry (Magiorkinis et al.
2012). Other autonomous retrotransposons lack the LTRs and use a different copy-and-
paste mechanism than LTR retrotransposons, namely, target-primed reverse transcription
(TPRT) (Luan et al. 1993). Autonomous non-LTR retrotransposons, which are also called
LINEs (long interspersed elements), such as L1 elements, constitute a high proportion of
the human genome (see below). In contrast, nonautonomous non-­LTR retrotransposons
lack coding capacity for genes needed for their retrotransposition. These elements are
commonly referred to as SINEs (short interspersed elements) and mostly range in length
between 100 and 500 bp. SINEs are transcribed by RNA polymerase III, for which they
contain a promoter in their sequence. For reverse transcription, they have to be bound
by the reverse transcriptase of a LINE, and they are subsequently integrated into a new
genomic location via TPRT (Kramerov and Vassetzky 2011). SINEs classified as Alu ele-
ments show the highest copy number of all transposable elements in humans (Batzer and
Deininger 2002). DNA transposons are frequently found in both eukaryotes and prokary-
otes and are frequently transferred horizontally (Gilbert et al. 2010). Retrotransposons are
usually restricted to eukaryotes, and their horizontal transfer is less frequent, except for
the RTE superfamily of LINEs (Suh et al. 2016).

1.3 Genome Size

The genome size of an organism can be measured by the c-value, which describes the mass
of DNA content of a haploid cell in picogram (pg). A c-value of 1 pg equals ~978 Mb
(Dolezel et al. 2003). Bacterial and archaeal genomes are usually rather small, but within
eukaryotes genome size shows huge variations with differences that can exceed 10,000–
100,000 folds in pairwise comparisons (. Fig. 1.3). However, it seems that there is no rela-
tion between the complexity of an organism (e.g. defined by the number of different cell
types) and its genome size, a conundrum which is known as the «c-value paradox»
(Thomas 1971; Gregory 2001). For example, the canopy plant Paris japonica has a c-value
of ~133 pg, more than 35× bigger than that of humans (~3.5 pg) (Pellicer et al. 2010). As
it has been shown by genome sequencing projects, eukaryotic genomes often contain only
small amounts of coding or functional DNA, and the large genome size in eukaryotes is
usually due to huge amounts of mobile elements (Lynch 2007).

You might also like