0 ratings0% found this document useful (0 votes) 381 views68 pagesGhosh and Mallik
sequence analysis, gen prediction and phylogenetic analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
2
Gene Prediction: Principle,
and Challenge,
“The universe is full of magical things, patiently waiting for oy,
wits 10 grow sharper,
EDEN Piuttpors
INTRODUCTION
The vast amount of raw sequence data generated because of advancement jn
sequencing technology needs biological interpretation. This is known as annotation,
Biological tools are essential for annotation of these raw sequences obtained from
various sequencing projects, in particular, for finding genes and determining their
functions. These genes can be of different types, starting from protein-coding
genes to non-coding genes such as ribosomal RNA (tRNA), transfer RNA (tRNA),
microRNA (miRNA), and many more. This chapter focuses on the annotation of
protein-coding genes.
Although the Human Genome Project was completed in April 2003, the exact
number of genes encoded by the human genome is still unknown. Hence, genome
annotation is a necessity and a multi-step process in itself. The steps involved in
genome annotation can be grouped into three categories: nucleotide-level (gene
identification or prediction), protein-level (structure determination of proteins), and
process-level annotation (mechanism of biochemical reactions). Among these three
categories, nucleotide-level annotation is the most significant, as it primarily deals
with gene annotation, a fundamental step in molecular biology. In this respect, itis
essential to mention that the accuracy with which genes can be predicted is still far
from satisfactory. Although 80% of genes are accurately predicted at the nucleotide
level in human genome, only 45% are predicted at the exon level and only ~20% at
the whole-gene level. This is the reason that the estimates of the number of genes in the
human genome are still indefinite. At present, the annotation of most human genes is
based on cDNA sequence data. Systematic ‘full-length’ cDNA-sequencing programs,Gene Prediction: Principles and Challenges 499
such as the Mammalian Gene Collection in the US and at RIKEN (The Institute of
nysical and Chemical Research) in Japan are Benerating vitally important experi-
mental data towards defining complete gene sets for the human and nai enomes
Hence, it is clear that further improvements to gene prediction are considered
necessary. Even if all human genes are experimentally determined, it would still b
imperative 10 understand how the structures ot wena ed organized and defined, and
' ‘an be recognized. The abili ;
pow they cal ability to predict a gene
ea analatpracicll cates I gene structure is both an
.0616AL OVERVIEW
efore pro: sing further into intricate details of gene annotation, we must know what
this ‘gene’ actually means. Being so small, how docs it ereate weet e ee
h ; “ate such a huge impact in
biology? “me
Gene
Gone is defined as a segment of DNA that co:
produce a functional product, usually a protein. This is a long strand of DNA (RNA
in some viruses) that contains a promoter, which controls the activity of a gene, and a
coding sequence, which determines what the gene produces ,
While defining a gene or describing its structure, we need to be accustomed with a
few biological jargons such as promoter, CDS, and ORF.
ntains the necessary information to
Promoter
Promoter is the regulatory region of DNA located upstream (towards the 5’ region) of
a gene. This provides a control point for regulated gene transcription. It contains
specific DNA. sequences that are recognized by proteins known as transcription
factors. These factors bind to the promoter sequences, recruiting RNA polymerase,
the enzyme that synthesizes RNA from the coding region of the gene. The promoter
elements are of two types — core promoter and proximal promoter.
Core promoter: It is the minimal portion of the promoter required to initiate
transcription properly. It is approximately 34 nucleotides in length. Core promoter
serves as a binding site for RNA polymerase and general transcription factors.
Proximal promoter: The proximal sequence upstream of the gene that tends to contain
primary regulatory elements is known as proximal promoter. It is approximately 250
nucleotides in length and serves as a binding site for specific transcription factors.
n Reading Frame (ORF)
An ORF is a sequence of DNA that starts with a start codon ‘ATG’ (though not
always) and ends with any of the three termination codons (TAA, TAG, or TGA).
Depending on the starting point, there are six possible ways (three on the forward
Strand and three on complementary strand) of translating any nucleotide sequence
into amino acid sequence, according to the genetic code. These are called reading
frames. Gene finding in organisms, particularly prokaryotes, starts from searching Sor
an ORF,200 Bioinformatics: Principles and Applications
i uence .
cai ore is abbreviated as CDS, The CDS is the actual region of DNA thay is
translated to form proteins. While the ORF may contain introns as well, the :
refers to those nucleotides (concatenated exons) that can be divided into codons
which are actually translated into amino acids by the ribosomal translation
machinery. In prokaryotes, the ORF and the CDS are the same.
WHAT IS GENE PREDICTION?
The characterization of genomic features using computational and experiment)
methods is called gene prediction or annotation. Given an uncharacterized DN
Sequence, what are the things we would want to do with this sequence to know ine
functional aspects? The answers to the following queries can be determined by pont
finding/gene annotation methods:
+ Which region codes for a protein?
Which DNA strand is used to encode the gene?
Where does the gene start and end?
Where are the exon-intron boundaries in eukaryotes?
Where (optionally) are the regulatory sequences for that gene?
COMPUTATIONAL METHODS OF GENE PREDICTION
SO TATIONAL METHODS OF GENE PREDICTION
Computational gene prediction is relatively simple for the prokaryotes, where all the
genes are converted into the corresponding mRNA, and then into proteins, On the
other hand, eukaryotic gene finding is altogether a different task because they ate not
continuous and are interrupted by intervening non-coding sequences called ‘introns?
(Figure 7.1). Moreover, the organization of genetic information in eukaryotes and
prokaryotes is different, However, genome annotation and genome sequencing do not
have an equal pace. Experimental genome annotation is slow and time consuming,
TRANSCRIPTION | ONA
SSS mRNA
Transtarion — |
are[x,.x]-sto
protein sequence
Figure 7.1 The structure of eukaryotic geneGene Prediction: Principles and Challenges 20
fe is an increasing demand to develop computational tools for g
al tools for gene
nese nich will help in answering the following questions
ton, NDNA sequence, what part oft codes fora protein and what part of itis
; junk DN N, junk DNA as intron, untranslated regi
ces junk ON slated region, transposons, dead genes,
; pivide @ newly sequenced genome into the genes (coding) and the non-coding
region’
jn small prokaryotic genomes, gene finding is largely a matter of identifying long
fs However, even in this case, ambiguities arise if long ORFs overlap on the
snes strands a Gans reach must be sorted out. As genomes get
pager gone find ng becomes increas ingly complicated. The main issue is the signal-to-
prise ratios 1 | PPO aryotic genome, such as Haemophilus influenzae, 85% of its
Mb genome is iM coding regions. The corresponding number in yeast is not much
° ie at 70%. For these genomes, ‘calling genes’ is an exercise in running a
compsiter Prost’ n that carries out a six-frame translation and identifies all ORFs that
are longer than & chosen threshold, However, even in these small genomes, finding
a has not been entirely effortless, The number of predicted yeast genes, e-., took
several years 10 settle down, and there are still several short ORFs that have an
setrian talus a8 bona fide genes. The process of finding genesis further complicated
se presence of splicing and alternative splicing. In the human genome, atypical
nig 150bp and a typical intron is several kilobases, and there is no clear
delineation between the intergenic regions that separate adjacent genes and the intra-
renic regions that separate exons. Defining the precise start and stop position of a gene
and the splicing pattern of its exons among all the non-coding sequence is like finding
f very small and indistinct needle in a very large and distracting haystack.
Bioinformatics combines the expertise and knowledge of people from biological and
computation areas, and sets a common stage for people from these backgrounds to
work together to solve practical challenges like gene annotations.
In short, computational gene finding is a process of the following:
« Identifying common phenomena in known genes.
* Building a computational framework/model that can accurately describe the
jower: it
common phenomena.
«Using the model to scan an uncharacterized sequence to identify regions that
match the model, which become putative genes.
+ Test and validate the predictions.
Information within a Genomic Sequence
There are different types of functional sites in genomic DNA that researchers have
sought to recognize. These are splice sites, s art-and-stop codons, branch points,
Promoters and terminators of transcription, polyadenylation sites, ribosomal-binding
sites, topoisomerase-II_ binding sites, topoisomeras -I cleavage sites, and various
transcription factor-binding sites. This information needs to be harnessed for gene
identification.202 Bioinformatics: Principles and Applications
METHODS OF GENE PREDICTION
‘There are three approaches for computational gene annotation. as shown in Figure 7
Gene Prediction
methods
‘Comparative
method
Intrinsic / ab initio
x Homol
Extrinsic / Homology ae] se
‘method
Figure 7.2 Three computational methods of gene prediction
Extrinsic or Homology Method
This is a straightforward method for identifying protein-coding gene(s). It is based on
sequence similarity of query sequence with annotated genes present in databases,
Given a database of sequences of other organisms, we search for query sequence in
this database and identify database sequences (known genes) that resemble the query
sequence. If the identified sequences are genes, the query sequence is probably
(putatively) a gene. This approach is able to identify biologically relevant genes,
However, they could not identify genes that code for proteins not present in the
database. Basic local alignment search tool (BLAST) is a well-known search tool in
this category (see Table 7.1). It is known that only approximately half of the genes can
be found by homology to other known genes or proteins (although this percentage is
of course increasing as more genomes get sequenced). In order to determine the
remaining 50% of the genes, the only solution is to use other prediction methods such
as intrinsic method or a combination of both intrin: s well as extrinsic methods.
The following are the principles of the homology method:
1. Coding regions evolve slower than non-coding regions, i., local sequence
similarity can be used as a gene finder.
2. Homologous sequences reflect a common evolutionary origin and possibly a
common gene structure, ie., gene structure can be solved by homology
(mRNAs, ESTs, proteins, and protein domains).
3. Standard pair-wise comparison methods can be used (BLAST or Smith
Waterman).
4, Include ‘gene syntax’ information (start/stop codons, etc.).
5. Homology methods are also useful to confirm predictions inferred by other
methods.
Intrinsic or Ab Initio Method
This method attempts to predict genes based on the statistical properties of the given
DNA sequence. This method attempts to extract information regarding gene locations
using statistical patterns inside and outside of the gene regions well as typical
patterns at their boundaries. 4d initio gene identification method searches for certain
signals of protein-coding genes. This method is applicable for both prokaryotes andGene Prediction: Principles and Challenges 203
re based on extrinsic method for eukaryotic gene identification
rh soft rganist
4s uh ms
— i Oraanlams 1 aoe Wan inatase
ve BLASTX/BLASTN Primates, rodents - /]¥ww.genomeccs,
OA ANA : hup://www-penomeccs.
y i mtu.edu/aat html
PLASTN, EST clustering, Human
MASTS EST ee http://www.ares tre
. sah Ne mew.edu/EBEST/
spliced alignment similarity Arabidopsis, maize, oe
S lize tops . p://Www.maizegdb.org/
geo" ith ESTP generic plant geneseger.php
oaFe™ sites, identity
plastP, WAM for splice Human, mouse,
http://www 125.itba.mi
cnr.it/webgene/wwworf
gene2.htm
score on Drosophila, Aspergillus,
frequencies of dipeptides Arabidopsis,
Caenorhabditis
protein/protein alignments Vertebrates hit
‘www-hto.usc.edu/
0- ored with PAM 120
es res sco" nd with Py software/procrustes/,
¢ aa w é
: BLAST similarity with All eukaryotes Ree eee univ.
at ¢DNA/DNA imd.php
g / lyon! fr/sim4.php
& Two BLASTS: high Vertebrates, Drosophila, _hhttp://www.ncbi.nlm.nih.
sid stringency and low C. elegans, plant gov/IEB/Research/Ostell)
stringency Spidey index.html
o WU-BLASTN, SIM4 Human, mouse, http://www sapiens,
Drosophila wustl.edu/zkan/TAP/
eae DP Human http://www.sanger.ac.uk/
Software/Wise2/
genewiseform.shtml
yNcoD Silent /replacement ratio, Human, mouse, hitp://swww.125.itba.mi.
Monte Carlo simulations Drosophila, Arabidopsis, cnr.it)~webgene/
Aspergillus, ‘wwwsyncod.html
Caenorhabditis
eukaryotes. Gene prediction in prokaryotes is relatively easy since prokaryote protein-
coding genes have specific signals such as transcription factor-binding site and Pribnow
box that are easy to identify. The protein-coding sequence is a contiguous ORF,
starting with a start codon (ATG) and ending with a stop codon (TAG/TGA/TAA).
Moreover, prokaryotes are small genomes and show a high coding density
(>90%). There are no introns in prokaryotes. But ab initio gene prediction method
is more difficult in eukaryotes for the following reasons:
1. Protein-coding genes are separated by large intergenic regions.
2. The genes are not contiguous. These genes are divided into exons and introns by
the splicing mechanisms in cukaryotic cells. The split genes make it difficult to
define ORFs.
3. The signals (e.g., promoters) are more difficult to identify than that in
prokaryotes, since these signals are more complex and unspecified. Two such
signals are CpG islands and binding sites for a Poly-A tail.204 Bioinformatics: Principles and Applications
Features for Gene Prediction in Prokaryotes ; :
The features of gene structure in prokaryotes act as the key criteria to identify Protein,
coding genes in them, The gene structures of prokaryotes, which are seq 5
prediction, particularly in ab initio method, are discussed here (see Figure 7.3)"
: The sequence in DNA which defines the start of a gene is cg
Promoter elements: The sequence Hs a art ofa gene is Sle
that are important Slons
=—m-| I hat ane important in the defy
ieee ORF (Open Reading Frame) nition of a prokaryotic promote,
one | (the —35 region and the”
Server Start codon Stopcodge region, also known as the pig,
now box).
BcacartacncarracasnrTACACoH 35 Region: This sequen
i Ad
Le aa aan centred about 35 bp before gy
Frame 2 a_i tt the
Frome 3 aaa a a as STL (upstream) of a bacteray
Figure 7.3 Gene structure of a prokaryote gene (prokaryote). It functions in
the initial recognition of a gene hy
RNA polymerase. The consensuy
sequence is TTGACAT,
—10 Region (Pribnow box): This has the consensus sequence TATAAT ang is
centred about 10 bp before the start of a bacterial gene, ic. there are about 12 bp
between the —35 region and the Pribnow box. Since three hydrogen bonds hold a G-¢
base pair together while only two are present between A and T, the strands of an
AT-rich region are separated more easily than in a GC-rich region. It is thought that
the enzyme DNA-dependent RNA polymerase initially makes contact with the —35
region, and then moves along the DNA until it finds an AT-rich region (the Pribnow
box). At this point the enzyme separates the two DNA strands, and the RNA
polymerase initiates RNA synthesis approximately 7 bp further along the DNA
(downstream of the TATAAT site). It then transcribes the DNA using the template
strand to produce an RNA molecule, which is, by definition, the sense strand,
Transcription start site: Protein-coding genes generally start with codon ‘ATG’, which
is known as start codon, and it codes for methionine. In prokaryotes, in some cases,
start codon is TTG or GTG. So any series of non-stop codons can be translated
computationally into an amino acid sequence. Such a region is called ORF starting
with ATG, TTG, or GTG.
ORFs: Since stop codons are found in uninformative nucleotide sequences, approxi-
mately, once every 21 codons (3 out of 64), a run of 30 or more triplet codons that
does not include a stop codon in itself can be a probable gene. The presence of a set of
sequences around which ribosomes assemble (during translation) at the 5’ end of each
ORF serves as a hallmark for identification of protein-coding genes in prokaryotes.
This set of sequence is often found immediately downstream of transcriptional sites
and just upstream of the first start codon. This sequence patch is known as ribosome-
loading sites (called Shine-Delgarno sequence) that has a consensus sequence 5’-
AGGAGGU-¥.
Translation stop site: This includes TAA, TAG, and TGA.Gono Prodiction: Principles and Chaltonges 208
Termination sequences: vast Major
contain specific signals for the ne otty 8f prokaryotic protein-coding gene operons
© fermination of transcription called intrinsic termina-
tors. Intrinsic terminators have
* have Wo prominent structural features: (i) a sequence of
nucleotides that include an i
An inverted repea , “
immediately following the inverted repe a and (ii) a run of roughly six uracils
some of the gene prediction tools for prok:
Aryotes are listed in Table 7.2.
able 72 Uist of gene prediction tools for prokaryotes
prediction tools Web interfaces
~{autGene (annotation of microbial genes)
EasyGene
GeneMark.hmm-P
Mttp:/ www. genoscope.ens.fr/age/tools/amigene/
http://www ebs.dtu.dk/services/EasyGene
http://wwwexon.g
gmhmm2_prok.
imme hutp://www.ebch,umd.edu/softw
Gs Finder hutp:/ Awww.tubie.tju.cduten
MED-Sta cn/main/SheGroup/
REGANOR /www.cebitec.uni-bielefeld.de/groups/brf/software/
TICO (translation initiation site correction)
Zeurve
reganor/
hutp://swww.tico gobies.de/
hup://wwww.tubie.tju.edu.en/Zcurve_B/
Features for Gene Prediction in Eukaryotes
There is certain information present within the DNA sequence such as splice sites,
start and stop codons, branch points, promoters and terminators of transcription,
polyadenylation sites, ribosomal-binding sites, topoisomerase-II binding sites, topo-
isomerase-I cleavage sites, various transcription factor-binding sites, and so on, that
are called signals. Methods for detecting them are called signal sensors. Genomic
DNA signals can be contrasted with extended and variable length regions such
exons and introns, which are recognized by different methods that may be called
content sensors.
Content sensors: These classify a DNA region into different types, e.g., coding versus
non-coding. Similarity-based approaches are often called extrinsic, in opposition to
others that try to capture some of the intrinsic properties of the coding/non-coding
sequences (compositional bias, codon usage, etc.). .
Extrinsic content sensor: These sensors basically perform similarity searching between
a genomic sequence region and a protein or DNA sequence present in a database. Basic
tools needed for similarity searching between sequences are local aligament methods
like Smith-Waterman algorithm, fast heuristic approaches such as FASTA and
BLAST. Almost 50% of the genes can be identified by this procedure. Databases like
SwissProt or PIR serve as the source for the most widely used protein sequences for
such a purpose. Again, a good similarity score will not enable exact identification of the
gene structure, since homologous proteins do not share all their domains. Further,
UTR regions cannot be delimited by such a procedure. It is well assumed that coding
sequences are more conserved than non-coding ones. Under such an assumption,208 Bioinformatics: Principles and Applications
also serve as a valuable source of information oy
intron/exon location, There are two approaches for such a purpose: intra-genomie
comparisons provide useful information regarding multigenic families, representing
huge percentage of the existing genes. Inlergenomic or cross-species comparisons
identify orthologous genes, without « preliminary knowledge about them:
Advantages: Similarity-based
gical data, Hence, the predictions should
similarity with genomic DNA can
approaches depend on accumulated pre-existing biolg,
be biologically relevant
Disadvantages
» If the database is d
obtained
« Even if'a good similarity
always precise and are unable to identify the gene accu
«Small exons are also missed out in the proce:
Intrinsic content sensor: Originally, these were defined for prokaryotes. In prokay.
votes it ig still commion to locate genes by just looking for long ORFS. This is certainly
hot adequate for higher eukaryotes. To discriminate coding from non-coding regions
jn eukaryotes, content sensors use statistical models of the nucleotide frequencies ang
dependencies present in codon structure, The most commonly used statistical models
sre known as Markov models, popularized for gene finding. Neural networks are used
to combine several coding measures along with signal sensors for the flanking splice
sites, Other content sensors include sensors for CpG islands, which are regions that
often mark the beginning of genes where the frequency of the dinucleotide CG is not
fs low as it is in the rest of the genome, and sensors for repetitive DNA, such as
human ALU sequences.
levoid of sufficiently similar sequence, no result will he
is found, the limits of the regions of similarity are no,
el
Signal sensors: Signal sensors are measures that try to detect the presence of the
functional sites specific to a gene. The basic signal sensor is a simple consensus
sequence or an expression that describes a consensus sequence along with allowable
variations, More sensitive sensors can be designed using weight matrices in place of
the consensus, in which each position in the pattern allows a match to any residue
Different costs are associated with matching cach residue in each position. The score
returned by a weight matrix sensor for a candidate site is the sum of the costs of
the individual residue matches over that site. If this score exceeds a given threshold,
the candidate site is predicted to be a true site. Such sensors have a natural
probabilistic interpretation in which the score returned is a log likelihood ratio undera
simple statistical model in which each position in the site is characterized by an
independent and distinct distribution over possible residues. More sophisticated types
of signal sensors, such as neural networks, are extensively used.
Ab initio methods in eukaryotes use these two statistical properties of coding region
for identification of protein-coding genes. In this context, it is essential to have an idea
of eukaryotic promoters.
Eukaryotic promoter: Eukaryotic promoters are extremely diverse and are difficult to
characterize. They typically lie upstream of the gene and can have regulatory elements
several kilobases away from the transcriptional start site. In eukaryotes, the
transcriptional complex can cause the DNA to bend back on itself, which allowsfor the Placement of
nt of regulatory sequences fh
binds ay TATA-binding Protein that fi
transcriptional complex :
The TAT,
Start site (often within 50 bases) *
Some of the methods
Table 7.3,
box typic:
table 73 List of software based on ab initio method fore
"ukaryotes:
Algorithm Organisms
HMM Human, mouse,
Drosophila, tice
GeneID DP, MM Vertebrates, plants
GeneMark.hmm = GHMM Human, mouse.
Drosophila
GeneParser DP Vertebrates
GeneWise DP Human
GENIE GHMM Drosophita, human
entats| Grammar rule Vertebrates, Drosophila,
dicotyledonous plants
GENSCAN GHMM Vertebrates, Arabidopsis,
maize
GENVIEW2 DP Human, mouse, diptera
GLIMMERM DP, IMM Small eukaryotes,
Arabidopsis, rice
GRAIL DP, NN Human, mouse,
Arabidopsis, rice
HMMgene CHMM Vertebrates, C. elegans
MORGAN DP Vertebrates
MZEF Quadratic Human, mouse,
discriminant Arabidopsis, fission yeast
analysis
SLAM DP Human, mouse
TWINSCAN, GHMM Mouse, human
VEIL DP,HMM —_ Vertebrates
Xpound HMM Human
used for identification of genes in cuk
Geno Prediction: Principles and Challenges 207
Ar from the actual site of transcription.
‘A box (sequence TATAAA), which in turn
‘ssists in the for
mation of the RNA polymerase
ally lies very close to the transcriptional
aryotes are listed in
Web interfaces
hutp://www.softberry.com/
berry.phtmltopic=index&group=
Programs&subgroup=gfind
hUtp://www | imim.es/geneid.html
http://www.opal.-biology.gatech.edu/
GeneMark/eukhmm.cgi)
http://www. beagle.colorado.edu/
cesnyder/GeneParser.html
hutp://www.ebi.ac.uk/Wise2/index.
humt
hutp://www.fruitfly.org/seq_tools/
genie.html
hutp://www.cbil.upenn.edu/~ sdong/
genlang_home.html
hitp://www.genes.mit.edu/
GENSCAN. htm!
http://www. 125..tba.mi.enr.it/
webgene/wwwgene.html
hutp://www.tigr.org/tdb/glimmerm/
elmr_form.html
http://www.compbio.ornl.gov/
Grail-1.3
hutp://www.cbs.dtu.dk/services/
HMMgene
hutp://www.cs,jhu.edu/labs/compbio/
morgan.html
hup://www.argon.cshl.org/
genefinder/
hutp://www.baboon.math.berkeley.
edu/~syntenie/stam.him|
hutp://www.genes.cs.wustledu/
hutp://www.tigr.org/~salzberg)
veil.huml
http:/;www.bioweb.pasteur.fr/
seqanal/interfaces/xpound-
simple.htmt208
Bioinformatics: Principles and Applications
Comparative Method . .
The given DNA string is compared with a similar DNA string from a different SPecigs
at the appropriate evolutionary distance and genes are simultancously predicteg ;-
both sequences based on the assumption that exons will be well conserved, where,
introns will not. Examples of software adopting this comparative method ate
conserved exon method (CEM) and Twinscan.
COMBINATION OF TWO METHODS
Gene prediction methods based on extrinsic approach are able to detect only a limiteg
number of genes (low sensitivity), due to the lack of known MRNAS in the database,
whereas ab initio gene prediction tools rely on intrinsic gene measures, includ,
coding potentials and splice signals. Even this method is not as efficient as it soung,
because of some of the exceptional genes lacking intrinsic gene measures oF so-caljeg
sensors (signal or content). For example, some housekeeping genes as well as some
oncogenes and growth factor genes have no TATA box, for which it will be difficul
to identify genes by signal sensors by ab initio method. Therefore, different type,
of gene prediction based on different methods can be combined to devise a more
efficient gene prediction tool. Examples of software combining two or more different
types of gene prediction software are given in Table 7.4. Non-protein coding genes
(tRNA, rRNA, and microRNA) prediction tools are provided in Table 7.5.
Table 7.4 Example of two combined software for protein-coding gene prediction
Software/website Combined tools _ Websites Descriptio oa
FGENESH http://www-softberry, HMM based method for
com/berry.phtml?topic= human, mouse,
identification tools) index&group=programs Drosophila, rice
http://www.digit.gse. &subgroup=efind
riken.go.jp
GENSCAN http://www.genes.mit. __ GHMM based method for
edufGENSCAN.html vertebrates, Arabidopsis,
maize
HMMgene http://www.cbs.dtudk/ GHMM-based method for
services/HMMgene vertebrates, C. elegans
EuGene NetStart hutp://www.cbs.dtu.dk/ Predictions of translation
http://www inra.fr/mia/T/ services/NetStart/ start in vertebrate and
EuGene/ Arabidopsis thaliana
nucleotide sequences.
NetGene2 http://www.cbs.dtudk/ Predictions of splice sites
services/NetGene2/ in human, C. elegans, and
Arabidopsis thaliana
SplicePredictor —hitp:/|www.deepe2. A method to identify
Psi.iastateedu/cgi- potential splice sites in
bin/sp.cgi (plant) pre-mRNA by
sequence inspection using
Bayesian statistical modelsGene Prediction; Principles and Challenges 209
le 7.5 Non-protein-coding gene prediction tools
ee
1A
a iA-ScanSE
____Web Interfaces
hUp://www.lowelab,uese.edujtRNAscan-SE}
FASIRNA ftpforww, bioweb. pasteur.fr/seqanal/interfaces/fastrna.html
aragora hutps://www.pembioekol-bioinf2.mbiockoL.u.se/
ARAGORN I.1/HTML/aragorn!.2.html
spilts hutp://www.splits.iab, keio.ac.jp/
RNA
gsU rRNA !ttp://swww.soe.uese.edu research compbio/ssurrna. html
CARNAC (Computer Alignment of
RNA by Cofolding)
‘http://swww2.lifLfr/~perrique/rna/OLD_index.html
myRDB hups://www.rdp.cme.msu.edu/login/myrdp/overview.spr
RNAmmer hup://www.cbs.dtu.dk/services/ RNAmmer/
‘MicroRNA
SIRNA hup://www. bioweb.pasteur.fr/seqanal/interfaces/sirna.html
MicFinder hup://www. bioinformatics.org/mirfinder/
MiRscan hup://www.genes.mit.edu/mirscan/
proMiR I hup://www.cbit.snu.ac.kr/~ProMiR2/
RNAmicto hup://www. bioinf-uni-leipzig.de/ ~jana/sofiware/index.html
WHY IS GENE PREDICTION DIFFICULT?
Although we have practically all information about gene's structure (both prokaryotes
and eukaryotes), it is still difficult to predict genes accurately due to the following
reasons:
+ DNA sequence signals have low information content (degenerated and highly
unspecific).
«Difficult to discriminate real signals.
* Contain sequencing errors.
Apart from these, we can specifically list the reasons separately for two domains of
life, prokaryotes and eukaryotes.
More specifically, prokaryotes pose difficulties because they have high gene density
and a simple gene structure.
+ Short genes are found in prokaryotes that have little information.
+ The presence of overlapping genes in prokaryotes makes the detection of genes
difficult.
The following features of eukaryotes make its detection complicated:
+ Low gene density and complex gene structure are the main factors in gene
prediction in eukaryotes.
« The presence of alternative splicing mechanism in eukaryotic genes makes its
detection difficult,
«The presence of pseudo-genes.240 Bioinformatics: Principles and Applications
SUMMARY
“Gene prediction’ research was conceived as a
result of the need for complete automated gene
finding systems for long, non-annotated se~
quences now being produced at a very high
volume, and it is important to distinguish the
two different goals in such gene finding re~
search. The first goal is to provide computit-
tional methods to aid in the annotation of the
large volume of genomic data that is produced
by the genome-sequencing efforts. The second
goal is to provide a computational model to
help elucidate the mechanisms involved in
transcription, splicing, polyadenylation, and
other critical processes in the pathway from
genome to proteome. The other key issues that
will influence future research in both of the
aforementioned computational gene finding
paradigms are the issues of alternative splicing.
There are no currently available programs]
algorithms that can handle alternative splicing
in an efficient manner. Intimately linked with
this issue is that of gene regulation,
annotation is not complete until the abunga®
regulatory signals flanking genes or Appearin
in introns or exons are. properly identifier’
Further, the cellular conditions that give rise:
the differing expression levels for diffe
transcripts need to be worked out,
‘nally, as a word of caution, it shoulg
mentioned that the results produced by sy
methods are predictions, and should be ta
carefully. These are very useful for speeding yy
gene discovery and knowledge mining theregp
but biological expertise remains necessary jp
order to confirm the existence of a virtyay
protein and to find or prove its biologica,
function and its condition of expression in the
organism. All these facts imply that future gene
finding research will greatly depend on the
experimental data relating to the differentia,
expression, along with the other types of data
that we have already discussed,
Feng
REVIEW QUESTIONS
1, What is Human Genome Project? How is it significant to bioinformatics?
2. What is gene annotation? Explain the need for gene annotation.
3. What is Open Reading Frame?
4. Write the three methods of computational gene prediction.
5. Why ab initio gene prediction method is difficult to annotate protein coding genes in
eukaryotes?
6. Mention the differences between content sensor and signal sensor.
7. Mention few softwares used for annotating protein coding genes in prokaryotes and
eukaryotes.
8. How do prokaryotic promoters help in detection of CDS?
SUGGESTED READING
Allex C.F., Shavlik J.W., and Blattner F.R., 1999,
Neural network input representations that
prose accurate consensus sequences from
' DA fesement assemblies’, Bioinformatics, 119):
Besemer J., Lomsadze A., and Borodovsky M.,200l,
“GeneMarks; A self-training method for predie-
tion of gene starts in microbial genomes. Implica-
tions for finding sequence motifs in regulatory
regions’, Nucleic Acids Res, 29(12): 2607-2618.
secsimmey E Thompson J.D J, 1996,
pair-wise and search-wise: Comparison. of =;
Fotein profile to all three translation frames
Proaitancously”, Nuclere Acids Res, 24: 2730-2739,
Myovsky M., Melninch J.D., 1993, “Genmark;
lel gene recognition for both DNA
Comp Chien, 17: 123-13
purge C. Karlin S., 1997, ‘Prediction of complete
ene structures in human genomic DNA’, J Mol
‘iol, 268: 78-94
pecher A-L.. Harmon D., Kasif $., White O., and
Galebere S.L.. 1999, ‘Improved microbial gene
entifcation with GLIMMER’, Nucleic Acids
es, 27(23): 4636-4641
Gelfand M.S.. Mironov A.A., and Pevzner P.A.,
1996, “Gene recognition via spliced sequence
alignment’, Proc Natl Acad Sei USA, 93:
9061-9066.
Holmes I. and Durbin R., 1998, ‘Dynamic pro-
framing alignment accuracy’, J Comp Biol, 5
493-504.
Kozak M., 1986, ‘Selection of translational start sites,
ja eukaryotic MRNAS’, in M.B. Mathews (ed),
IRegulation of Gene Expression at the Translational
Lecel, Cold Spring Harbor Laboratory, pp. 35-41
Lukashin A.V. and Borodovsky M., 1998, *Gene-
‘nark.hmm: New solutions for gene finding’,
Nucleic Acids Res, 26 (4): 1107-1115.
Ma Q., Wang J.T.L.. 1999, “Biological data mining
using Bayesian neural network: A case study’, Jnt
J Arif Intell Tools, 8(4): 433-451.
Bor
Paral
Gene Prediction: Principles and Challenges 214
» Wang J.T.L., and Wu C.H., 2000, ‘Applica~
n Of neural networks to biological data mining:
2000: 23-30.
MeLauchtan J., Gaffney D., Whitton J.L. and
Clements J.B., 1985, “The consensus sequence
PTYY located downstream from the AA-
nal is required for efficient formation of
MRNA ¥ termini’, Nucleic Acids Res, 13: 1347
1468,
Mott R., 1997, ‘EST_GENOME: A program to
align spliced DNA sequences to unspliced geno-
mic DNA‘, Comp Appl Biosci, 13: 477-478.
Pertea M., Lin X., and Salzberg S.L., 2001,
“GeneSplicer: A new computational method for
splice site prediction’, Nucleic Acids Res, 295);
1185-1190.
Proudfoot N.J. and Brownlee G.G., 1976, ‘3° Non-
coding region sequences in cukaryotic messenger
RNA’, Nature, 263(5574): 211-214,
Salzberg S., Delcher A., Kasif S., and White O..
1998, “Microbial gene identification using inter-
polated Markov models’, Nucleic Acids Res, 26:
544-548.
Stormo G.D., Schneider T.D., and Gold L., 1986,
“Quantitative analysis of the relationship between
nucleotide sequence and functional activity’,
Nucleic Acids Res, 14: 6661-6679.
Wu CH, 1997, ‘Artificial Neural networks for
molecular sequence analysis, Comp Chem.
21(4): 237-256.8
Molecular Phylogeny
“Nothing in biology makes sense except in the light of evolution
TuEoDosts Donzitansxy
INTRODUCTION
‘The word ‘phylogeny’ has been derived from two Greek words, phylon and genesis,
Phylon means ‘stem’ and genesis means ‘origin’, In other words, phylogeny gives an
“dea about the evolution or origin of an organism (see Figure 8.1). Phylogeny is
jllustrated as a tree. For a long time after people began trying to classify organisms
in a systematic fashion, they used a variety of definitions of relationship. One
definition was — things that look alike are more closely related to each other
than things that look different. This is perhaps logical, but it is wrong, for the things
that resemble each other superficially. For example, African euphorbias and American
cacti resemble each other, but are not closely related at all. On the contrary, things
that appear to be quite different turn out to be related. Zimmermann in 1931, defined
relationship as the sharing of-a recent common ancestor. For example, apples are
more closely related to magnolias than they are to ginkgos because apples and
magnolias share a more recent
common ancestor with each
other than either does with
ginkgos.
Apart from understanding
the evolutionary relationships
among the different groups of
organisms, phylogeny also
helps to understand the follow:
ing things:
© Understand the evolu-
tionary history of organ-
isms
Figure 8.1 Phylogenetic tree of human beings and A. Map the pathogen strait
his ancestors diversity for vaccines
PhylogenyBebe diols 9) Hue gprencl wo! Moleaar Pyogeny 213
fh Dest i 4
Assist | hn: 21 Chucases
+ Assist in the epidemiology of infectious diseas and genetic defects
+ Aid in predicting the functions
Biodiversity studies
+ Understanding the microbial ecologies
of novel genes
pHENOTYPIC PHYLOGENY AND MOLECULAR PHYLOGENY
There are two types. of phylogeny methods
molecular phylogeny,
phylogeny
namely, phenotypic phylogeny, and
eae Be enoeomie Phylogeny is considered the traditional method of
AS IC Is based upon phenotypic observations from the group of organisms.
In due course of time, scientists found that in this method it was difficult to classify the
micro-organisms because the phenotypic resemblance/dissimilarity may be superficial.
All these paved the pathway for the arrival of the novel concept of molecular
phylogeny. Linus Pauling was the first to make the observation that genetic sequences
could be used for phylogeny, and this method is known as molecular phylogeny. In this
molecular phylogeny approach, the relationships among organisms or genes are studied
by comparing the homologues of DNA or protein sequences. Thus, molecular
phylogeny can be defined as the study of relationships among the organisms using
molecular markers such as DNA or protein sequences. Molecular phylogeny based on
the nucleotide or amino acid sequence comparison has become a widespread tool for
general taxonomy and evolutionary analysis. It seems that molecular phylogeny is the
only means to establish a natural classification of micro-organisms, since their
phenotypic traits are not always consistent with the genealogy or family pedigree.
These two methods of phylogeny are related, since the genome strongly contributes to
the phenotype of the organisms. In general, organisms with more similar genes are more
closely related. However, phenotype based phylogeny or morphological phylogeny has
many disadvantages over molecular phylogeny. A few of those are listed here:
_/ There may be similar phenotypes in distantly related organisms due to the
process called convergent evolution.
«Phenotypic features for many organisms, e.g., bacteria cannot be studied.
_# It is difficult to compare the phenotypic traits with distantly related organisms,
e.g., when comparing bacteria and mammals.
Molecular phylogeny methods are often free of such problems and make possible
the study of genes without a morphological expression. That is why we prefer this
method for classifying the organisms in any evolutionary situation.
Mechanism of Molecular Phylogeny
The primary mechanism of evolution at the molecular level is based on the nucleotide
substitution during the process of DNA replication. All other outward evidence of
evolution (the phenotype) is the result of the changes in the DNA sequences within an
organism. This mutation in the germ-line occurs through several inter-related
mechanisms such as base substitution and exon shuffling.
1. Base substitutions: The most common genetic change is simply a change in the
nucleotide present in the parental genome to a different one in the progenyTAA Roinkomates. Prnnpies and Appicatons
genome. This base substitution oscurs: through several mechanisms sug
Transposition, tnsertion, and Deletion These are defined as follows:
a Trsspauitie Transposition as the movement of an entire Bene fom on,
Kewation ox a DNA mokcule to another
b. feertiee Insertions are extra nucleotides not present in the parental template
DNA. Insertions range from a single base to bilo bases in length, For exampyy
invertion is what ecurs on the recenving end of transposition, .
we Deletims: A deletion is the loss of nucleotide(s) in the progeny DNAS © g the
Jocation from which a gene or gene fragment was removed 10 be inserted ing
Jimination from the genome,
another keaton, or its compkte ¢ ae 7
Instanoes of insertions and deletions can be casily obtained by using the Py
program of GCG (Genetics Computer Group) tocompare multiple related sequence,
Alithe gaps represent aither an insertion or a deletion Dotplot also provides simityy
infarmation for the purrs of sequences. Dotplot prow es the graphical picture of the
differences anst shows insertions, deletions, and repheations particularly well
>. Exon shufMing: Exon shuMing is the provess by which either exons are duphcateg
, in the same DNA molecule, or stroctural or functional domains are exchange
benween the genes encoding proteins 10 muluple exons.
hay
Phylogenatic Marker: Choosing the Right Molecule for the
Problem at Hand
Molecules are like radio isotopes as they change at dillerent_rates. The genomes of
RNA viruses such ay HIV change so quickly that very soon every infected periog
Qwould carry an idenntiahly different strain, Mitochondrial DNA, which 1s haploid,
has a pelatively fast substitution nite. It evolves rapidly cnough to be useful for the
comparisons of hineages that diverged recently, but it can also be_used_to establish
relationships among the groups that are several million. years.old.
To obtain good mokcular information on events that occurred in deep time, highly
conserved genes, genes that change very slowly are needed, such as the DNA that
cates for the small subunits of ribosomal RNA. Such genes contain usefol
information about the events that occurred about $00-1,500 million years ago.
Chromosomal DNA being the most sensitive to evolutionary change has beca
traditionally wed for amalysis of phylogenetic relationships among the prokaryotes
and archaca. In eukaryotes, chromosomal DNA is subjected to various gene repait
mechanisms and also crossing-over and recombination duning sexual reproducuon. ht
has Kooome apparent that in clearly different lineages of organisms, almost any
sulficiontly large fragment of the genome will provide the same phylogenetic tree a
any other. With more closely related organisms, this is not always the case. To sohe
thee conundrums, relationships among the mammalian species have also been studied
by the analysis of evtra-chromosomal DNA, such as mitochondrial DNA, which
not subject to same of these processes because it is maternally inherited.
\Arirt trom these, we need to know the sequences that are most widely used as phy!
genetic marker and they are listed here along with their respective features and wiilibes
+ DNA — Very sensitive, non-uniform mutation rates
. Useful for more remote homologiesMolecular Phylogeny 215
«Protein Sequences — User,
ent ee for most remote homologies, deep phylogenies,
a rales, more character sta
' ara lates
Dean Rent clic markers, 168 Ribosomal RNA sequences are widely
a SEN pee bl a eciuuise these sequences exist in all organisms, they
art ‘OSS Kingdoms and these are long sequences which are widely
sequenced, The sequencing of these sequences is easy. Hence, they are suitable for
broad and very deep phylogeny studies,
REPRESENTATION OF PHYLOGENY
Re
The most convenient way of visually presentin
a group of organisms is through illustrations called phylogenetic trees. A tree is a
mathematical structure which is used to model the actual evolutionary history of a
group of sequences or organisms. A phylogenetic tree is composed of nodes, each
representing a taxonomic unit (species, populations, individuals), and branches, which
define the relationship between the taxonomic units in terms of descent and ancestry.
Only one branch can conneet any two adjacent nodes, The branching pattern of the
tree is called the topology, and the branch length usually represents the number of
inges that have occurred in the branch, Other terminologies often used in
phylogenetic trees are the following:
ZZ Re I taxa,
A Operational Taxonomic Unit (OTU): OTU is any group of organisms, popula-
tions, or sequences considered to be sufficiently distinct from cach other and is
treated as a separate unit, This is also termed Terminal nodes (external and
internal) or leaves (sce Figure 8.2).
#f Distance scale: It is a scale that represents the number of differences between the
organisms or sequences. For example, 0.1 means 10% differences between two
sequences.
«Internal branch: It is located between two nodes.
«External braneh: It is located between a node and a leaf.
« Monophyletic: It refers to two or more DNA sequences that are derived from a
single common ancestral DNA se-
OT fuenee’
eo aa a 35 _* Clade: It is a group of monophyletic
DNA sequences that make up all the
sequences included in the analysis that
are descended from a particular com-
mon ancestral sequence.
+ Horizontal branch length: This is pro-
#2 the evolutionary relationships among
ot: The common ancestor of
Jnteral nodes portional to the evolutionary distances
ne between the sequences and their ances-
Int External _ + .
ottegal O—O Oe eenes tors (unit =substitution/site).
Root The different parts of a phylogenetic tree
Figure 82 Different parts of a phylogenetic tree are shown in Figure 8.2.216 Bioinformatics: Principles and Applications halo giant - Chaaenae
dasa 4 an
cuted He,
{| te Life branther ode |
s
Drees tees of trees can be used to depict the different aspects of evolution,
history. The most basic tree is the Cladogran sich ene sows the relat
ecentness_of_a_common ancestry, This is ranching diagram depictin
hierar Teo ‘or a taxa defined by the cladistic. methods, Phylogea®
(Additive trees) depict the amount of evolutionary change that has
the different branches. It can otherwise be explained as a phylogenetic tree re
indicates the relationships between the taxa and also conveys @ Sense OF time oy pat
of evolution. Dendograms (Ultrametric_trees) depict the times of divergence Th
are a branching d
gram in the form of a tree used to depict the degrees 2
relationship or resemblance (see Figure 8.3).
5 Cladogram Phytogram nea
iet za . 2
BY >
ieee oe ae nothing
(a) ) 2
Figure 8.3 Three different types of representation of phylogenetic tree: (a)
Cladagram; (b) Phylogram or Additive tree; (c) Dendogram or Ultrametric
tee
Cladograms and Phylograms can be either unrooted or rooted in molecular
systematics.
+ Most phylogenetic methods produce unrooted trees (Figure 8.4 (a)). This is
because they detect the differences between the sequences, but have no means to
orient the residue changes relatively with respect to time. An unrooted tree shone
only the evolutionary relationships between the organisms in the tree, and does
not actually infer the placement of a common ancestor in the structure or the
evolutionary path used to obtain the current relationships. The direction of the
evolutionary process is not given.
+ In rooted trees (Figure 8.4 (b)), there
is a particular node, called the root,
A representing a common ancestor,
7 from which a unique path leads to
8 any other node. A rooted tree infers
: the existence of an actual common
A nos 5 ancestor and defines the evolution:
ary paths leading to the development
c of each organism. It provides an
8 c Tata indication of the direction of the
evolutionary process, defining the
(a) Unrooted tree (b) Rooted tree
Figure 8.4 This figure depicts rooted and unrooted trees
>
ancestral and the derived characters
or species,gular CLOCKS Molecular Phylogeny 217
i
Wo. molecular clock
regular, more or les
sui 7
™MPtion that mutations occur at some
© This h
F a is hy °
macromolecule (a protein or DNAS Yothesis postulates that for any given
an_number of amino ac quence), the ra any
me © acids oF micleonae ce "He of evolution (measured as the
approximately constant over tine He sequence ch
Manges per site per year) is
certain numbey ineages. Thus, if a certain
a Imber of mutations can be expected to have
lated much inter
amount of time has passed
occurred. This hypothesis has
in the evolutionary studies for two rence “Stn the use of macromolecules
_; Sequences can be used as
molec
2: The degree of rate of ch eae a markers to date evolutionary events
insights on the mechanisms of molecu rrsgeaenees and lineages can provide
in the rate of evolution in a protein i ation: For example, a large increase
adaptive evolution, min a particular lineage may indicate an
Some of the phylogenetic tree .
Pair Group Method with Arithmetic hon 2) de fs UPGMA (Cnweighted
construction of the tree. Whether or not this ie wae dy an pthis_notion. in their
and itis probably not trac excepe hes 8 UE Gt only be answered indirectly,
hundreds of millions of years). The fossil record indents (oe periods of time
periods in the earths hisiory when oss Zee0rd ndiestes that there have been
S extinctions of 50% or more of known
organisns eee hhc erc>? cc ,rlrClL
Shr thn nen nga ki eo apne
ghe i a iately succeeding the mass extinctions. In
particular, extremely high rates of species diversification are known to have occurred
in the early Cambrian time Gust after $90 million years ago), in early Triassic Gust
after 248 million years ago), and in early Cenozoic time (just after 65 million years
ago). In addition to these issues, where DNA sequences readily mutate at a fair rate
without changing the resulting amino acid sequence, or are in non-coding regions,
multiple mutations at a single site may have occurred. Looking at the predecessor
sequence and the current sequence thus sometimes gives an inaccurate value to the
mutation rate — some of the prior mutations have become invisible due to subsequent
mutations at the same site. Problems notwithstanding, the measurement of cumulative
mutations (genetic distance) may provide an estimate of the time period required for
the genome evolution from one organism to the other, or the time period since the two
organisms had a common ancestor.
Volutionary
Basic Steps of Phylogenetic Tree Construction
a. Choice of Data: In phylogenies, different genes or combinations of genes or
DNA regions may be used to infer phylogenetic trees while addressing groups
of organisms. These are called phylogenetic markers. Depending on whether
the organisms are expected to be closely or distantly related, extremely variable
(e.g., ITS2 or introns) to conserved DNA regions (¢.g., ribosomal LSU rDNA,
protein coding genes) can be used. If the chosen DNA region proves to be too
conserved or too variable, switching to a different DNA region may increase
the resolution.formatics:
b.
Principles and Applications
Sources of Sequences: Sequencing of desired phylogenetic marker can be ¢, ong
Which will be the input for the various phylogenetic packages: Phylogengie
varkers can be obtained from two sources: by sequencing in the lab ayy
(ii) from the available,
Alignment: Since descen
the history of descent is recorded |
The molecular data on sequences in t
abases. i
reece inherit traits From their ancestors through gene,
vv in the changes within the DNA sequenca”
he genes are a simple form of characte
data: the characters * in the sequence, and the character states gr
the nucleotides at those positions. This sounds mple but assumes thatthe
positions compared are homologous. ‘This idea has been conceived and pu jg
multiple alignment concepts. :
“Alignments may be done automatically using Clustal X/W, but it the
eet highly conserved and/or contain insertions and deletions
te likely to occur. Aligned sequences offer a,
freading step by searching for deviating nucleotide
checking them with the assembled sequence,
sequences
alignment errors are q)
opportunity for a final proo
in the conserved regions and cross
(basic steps are given in Figure 8.5).
Choice of Evolutionary Model: Evolutionary models for DNA sequences
became quite elaborate during the years. They include several parameters
ach as base frequencies, substitution rate matrix, gamma distribution,
proportion of invariable sites, and covarion/covariotide evolution. In simple
evolutionary models, base frequencies are assumed to be equal, ie., the amount
of nucleotides is set to 25% for each type of nucleotide. But base frequencies
may also be estimated from the data set, which may differ among the data sets
and for each nucleotide. In simple evolutionary models, the substitution rate is
assumed to be equal for each type of point mutation and vice versa. Th
umber of substitution types in these models is set to one. In evolutionary
models with two substitution types, usually transitions and transversion as
assigned different substitution rates. The most complex. substitution Fe
matrices consist of six different substitution rates for each type of ott
choice of molecular marker(s) mutation (AGC, ASG, AST, Cas,
and
'
taxon sampling
CoT, Go; ie, the general time
reversible model, GTR). Substitution
rate matrices for non-reversible models
‘amplication/sequencing
consist of 12 different substitution rates,
alignment pet are not implemented in most stan-
jard molecular phylogen:
choice of evoluti i construct
ener most e. Choice of methods to construct phylo-
tree(s)
Figure 85 Basic steps in molecular phylogeny
ee genetic trees: Several methods for com
phylogenetic analyses S| i
and topology testing Structing phylogenetic trees are known.
They include Maximum Parsimony,
———— reake Maximum Likelihood, and Distance
Methods, and these will be elaborated
in this chapter,
SeMolecular Phylogeny 219
tion of the tained phylogenetic tree al
° hylogenetic 4 ie tain ast ste e
phylogenetic ti
methods mentioned earlier nea re nt: THE tee which is generated by the
eds to be
most well-known methods Gan to be
methods will also be dis
‘ested or evaluated statistically. The
are Bootstrap and Jacl
Sd in thi ha? A Hekknife methods. These
jon5 ot PHYLOGENY
qwo extensive groups of analysis exis
Cladistic and Phenetic methods,
1 10 exami
(© examine the phylogenetic relationships:
+ Cladistic method assumes that the member
evolutionary history and are more elosel nets of
than to any other organisms. The ated
synapomorphies.
+ Phenetic methods, or numerical
a_group share_a_ common
lated to members of the same group
red derived characteristics are called
smarty for the ranking of spt can we any mam ot a
characters, but the data has to be converted into a numerical value, ‘The
organisms are compared to each other for all the characters and then the
similarities are calculated. After this, the organisms are clustered based on
their similarities. These clusters are called phenograms. They do not necessaril
reflect the evolutionary relatedness, . *
There are three main classes of phylogenetic methods for constructing
phylogenies from the sequence data and they can be classified into these three
groups as follows:
ot Mas
num Parsimony (cladistic methods)
imum Likelihood (cladistic methods)
3. Distance Methods (UPGMA, NJ, Fitch and Margoliash Method, and
Minimum Evolution) (phenetic method)
The first two methods are directly based on the sequences and the third is indirectly
based on the sequences.
‘Another way of classifying the tree-building methods is by the way they are
constructed, Cluster methods follow a set of steps (an algorithm) and arrive at a tree.
‘Type of Data
Distances. Nucleotide sites
Maximum Parsimony
Maximum Likelihood
Fawe 86 Cluster method and Optimality criterion method of
Phylogeny
For example, if we have five sequences we
might start with three of them (remember
that there is only one possible unrooted
tree for the three sequences) and decide
where to place the fourth sequence. Given
the resulting tree for four sequences, we
then decide where to add the fifth and last
sequence to our tree.
‘The tree-building methods in the second
class use optimality criteria (see Figure
8,6) to make a selection from among the
set of all possible trees. This criterion is220 Bioinformatics: Principles and Applications
used to assign each tree a ‘score’ or rank which is @ function of the relay,
between the tree and data (examples include maximum parsimony ang, hip
likelihood). My
How to Choose a Phylogenetic 4,
[Chose bel of relied) After choosing an appro, od)
uonces (DNA/Prote Priate
sequences (ORATSET netic marker (of related loge.
1 multiple sequence alignments ap, ee),
‘Obiain multipio alignment tained and analysed. Based o°
T similarity among the group of ali the
ee 5 F in
sequences, one appropriate me
dates 2 song emule chosen for tree construction, a iB
ee,
steps are illustrated in Figure 8.7
How Many Trees/Topologies are Therap
The number of rooted trees (Nr) for
OTUs (Operational Taxonomie Uni
is given by: )
‘Strong similarity
imum parsimony
Check validity | ._[ Very weak simianty
Lotthe results] "| Maximum likelihood (Qn — 3)! +
Figure 8.7 Steps to choose an appropriate molecular phylogeny T= An — I
method for the desired group of sequences based upon
their sequence homology The number of unrooted trees (Ny)
for ‘n’ OTUs is given by:
(Qn = 5)!
Nu=—
2-\n — 3)!
Table 8.1 Possible number of rooted and
unrooted trees for different number of The possible number of rooted and unrooted trees
OTUs for different number of OTUs is given in Table 8.1
Rooted Unrooted a
oTus trees trees, Evolutionary Models
2 1 1 During an infinitesimal time, Ar, the same nucleotide
3 3 1 cannot undergo two substitutions. Hence, we can
4 15 3 estimate, P(x|y, Af), the probability of going from x to
5 105 15 y in time At for x,y¢{4,C,G,T}. The substitution
6 954 106 model is expressed as follows:
7 10395 954
8 135135 10395 PIA | A, An) sone PA| Ta)
9 2027025 135135
10 34459425 2027025 . . .
M1 > 654 x 10° > 34 x 10° Stay = 5 . .
1S >213 x10)? >7 «102 .
20 >8 x10! >2 «10% 7 .
50 >6 x 10"! >2x« 10% PT | An sone pr | TaMolecular Phylogeny 224
‘One can reasonably assume t
by a hat evolution is a sta
the matrix is multiplicative, The
nary Markov process. Hence,
Teafter, the substituti atrix a ul
tema ee re stitution Matrix at time 1+ £ rer yis
That implies i
P| Are ary =
y I=EP | LAnr(e | ary |
The simplest among the evolutionary. modele is the Jukes-Cantor model of '
evolution.
Jukes-Cantor one-parameter model
Here the basic assump! is that the_rate of
evolution is constant. The substitution Tate of a
nucleotide by a different nucleotide
substitution model is shown in Fj
Substitution probability of say ny
by G, C, or Tis 2. Since the total
1, substitution rate of A by A is
same as for other nucleotides.
iS. The
igure 8.8,
cleotide A
Probability is
1-32. It is the
Figure 8.8 Substitution model |
Hence, for a short time ¢,
1-308 ae we ce
eM sae ce) ce
Ste) =
oe oe 1-30 ae
oe oe ae 1-3ae
For a longer time ‘ this reduces to
rm sit) st) sity
\
where,
rs) 5 10) = VA(L 4 3e4er) ea
SO=1 iy sy ney st) s(t) = V4(1 ~ etary 2
st) st) st) Kt) aa.
on? NS
4
a 6
Kimura two-parameter model
Another cvolutionacy model is the Kimura model. Jukes: ‘Cantor model does not take
into account that the tansition_rates (between_purines) AG and (between
Pyrimidine) CT are different from the transversion rates (ASC, AST, CoG,
and GeT), This model-considers rate of transition to be a and rate of transversion to
be B. Her
°¢ fi. Here the substitution matrix is as follows:222 Bioinformatics: Principles and Applications
A Gi G T
api—2p p A
c 1-2p- Bp a
Subst. Prob. = G i 2 1-3p-a a
TL B a B 1-2p-«
Thus,
nyt) SO
a Hs)
SO= |} yy sO
wy uy
where
s(t) = W401 - eABty .
u(t) = V4 + ABLE e248) i
oo) 2s(f) - u(t)
Maximum Parsimony Method
Maximum parsimony (MP) is a simple but popular technique used in cladistics to infer
a phylogenetic tree for a set of taxa (commonly a set of species or reproductively.
isolated populations of a single species) on the basis of some observed data on the
similarities and differences among taxa.
In other words, the principle of MP searches for a tree that requires the smallest
ary changes to explain the differences observed among OTUs
Input Data: The input data used ina
maximum parsimony analysis is in the
form of ‘characters’ for a range of taxa.
number of evolution:
Alignment sites in parsimony
123456789 10 Sites A character is a partitioning of the taxa
uae ro et chtata into distinct character states with re-
for 3 TTCGATCGAG spect to some feature. A character could
AS Reka be binary, e.g., for the presence or
absence of a feature (e.g., tail), or it
could be a multi-state, e.g., the protein
or nucleic acid residue at a particular
site in the organism’s genome. Differ-
tT 7
Invariant sites are not used in parsimony (they yield
‘no information on character state changes)
Informative sites (at least two different kinds of residues —
Qu
‘each present atleast two times) are used by parsimony
because they discriminate between topologies — i.e, diferent topologies
require different numbers of changes between residues.
Singleton sites cannot be used to discriminate between topologies,
(they require one change for all topologies).
Figure 8.9 Alignment sites in parsimony
at
ences in the character state are ¢%
plained by the evolutionary changes.
Among the three sites, only the infor-
mative sites help to infer phylogeny (s¢
Figure 8.9).Parsimony methods operate by se Molecular Phylogeny 223
the number of evolutionary steps (tance Hees th
required to explain a given set of quan O™
In mathematical term: -
ion a Minimize the total tree length:
OF one character state to another)
from the
Set Of possible
Possible trees, find all trees t such that Li)
ny
la ;
om De Dvds, xe)
where Le #8 the length of the tre
characters, K” and K” are the two
ther element of the input data matrix OF optim
. nal character-state assig, a
internal nodes, and diff(v, 2) is a function specifying the cont ofa transfor aan a
st of a transformation from
siate to state = along any branch. The coetfieiers ve :
Note also that diff(r, =) needs not be equal te ain, ‘signs a weight to each character.
The trees used in_ maximum parsimony analysis ar
a £ analysis are, in
trees (there is no indication of time in the tree, only”
the taxa used in the analysis g
©. Bis the number of branches,
rates, Heident to each branch kev sade amber of
Xgy and x, represent
@ general way, unrooted
the relations between taxa), All
led tips or terminal taxa) in the
Internal nodes are inserted into
Each internal node has at least
racter states are associated with the
the tree to represent the inferred
three edges into it. The transitions between cha,
edges on the tree.
Given here is an example where it
particular tree topology for a specific site
ancestral species,
is shown how to calculate the length of a
r : + ie., the informative site of a sequence, using
weighing schemes ~»Fitch (a) and transversion parsimony (b)
The weighing schemes fitch (‘f) and transversion parsimony (‘tp’) mention the
cost for transitions from one base to the other. Transition to the same state hate
cost of zero, Whereas the same for GC, Ge+A and reverse costs one according to
scheme (a). The costs differ in scheme (b). It is four for puRine+=pYrimidine
ion and vice versa, and one for puRinecspuRine or pYrimidine-+pYrimidine
transition,
Example: From the input data shown in Figure 8.10 we can calculate the length
of a particular tree topology ((W, Y), (¥, Z)) for a specific site of a sequence, ie.,
marked above by an atrow. For this specific site the weight matrices are as shown in
Figure 8.11.
|
| [Seqwi AGA GaT
| SeqX: ACAC GCT
Seqy: GTAAGGT
Seqz: GCAC GAC
Chosen informative sto
Figure 8.10 A set of test sequences to choose informative sites needed for maximum parsimony
In Figure 8.12, we calculate the length (i.c., number of steps) of a particular tree
topology (IY, Y), (X. Z)) . oo
"The deeply marked branches indicate substitution of bases.224 Bioinformatics: Principles and Applications
Figuro 8.11
A c
fo140+0+1+1=3
Ip: 1+0+0+4+4=9
fotetetetst=s
tp: 4+40444+4=20,
‘ :
Kw c
fostersiet=4
p:0+t+t4446=10
pag’
i c
Htstetetetss
fh: 4+404+444=20
Goac GCAC
Gfomt g (oes
pe C/ONt we einod
AL fo c|4140
(a) ©)
Fitch and (b) Tansversion Parsimony methog
Tho wot main fo (0) Fh and) ans
(for the mentioned set of sequen
Unrooted troo ((W, Y), (X:2))
w Y
x Zz
Figure 6.12 (a) The generalized unrooted tree
by Fitch (and Transversion pa
Internal
nodes
£:14044+04
Ip: 1+0+4+0+0=
* °
rN c
fs t+14040+0=2
tp: 4+4+0+0+0=8
F114 14040=
tp: 44441404
(a)
Kx
F:140+14141=4
tp: 1#0+t#444=10
No-e(*
tS c
fo tetetstet=s:
tp: 444444444220
6 Cc
’ :
Fotsorts1=3
p:0+1+0+4+4=9
‘ef’
a c
fiteretette5
tp: 44404+4+4=20
c
c
{or the earlier mentioned example; (b)
Simony (tp)
7 c
i c
torts
ips trout
7 c
’ :
Cc
fitetetete
tp: deaetetetatt
‘ ‘
a c
fOetetsi+t=4
(p: 0+144+14127
7 °
i c
Fo tat40r1+t=4
tp: 4+4+0+1+1=10 (bd)
The possible trees generatedThe three boxed trees are the best
ing to "scheme where the minines
steps achieved by three ways, The
nodes being “A-C’,*C_C "Gor
According to tp" scheme (Figure § rr)
theminimum lengthis five stepsachieveatt
two reconstructions (internal nodes "Ace
and ‘G-C). Therefore, under uneeaer
costs, the character becomes informe
The use of unequal costs may mari
more information for phylogenetic recon
struction.
accord.
iM is two
internal
Exercise: The alternative trees ((w
X), (¥, 2) and (W, Z), (Y, X)) alge
have two steps (according to f°
scheme) and a minimum of eight steps
according to ‘tp’ scheme. We leave it
as an exercise to the readers to work
out the minimum length of the alter-
native trees in the same way,
Advantages
+ Based on a logically coherent and biologically pl:
Molecular Phylogeny 225
c
O—C}
a c
£:190+1+060=2
(9: 14084+040=5
s c
Best trees
y—> according
w c to
TP scheme
1: 44440+090=2
tp: 49440+080=8
“Sp?
nN c
F:0et+140+0=2
Ip: 0814440+0=5
Figure 8.13 The best three according to Trans-
version Parsimony method
lausible model of evolution.
+ Free from assumptions used in distance estimations.
+ Better than distance methods when the extent of sequence divergence is low
(10%), the rate of substitution is constant, and the number of residues is large.
+ Very useful for certain types of molecular data, e.g., insertions and deletions.
+ Provides several ways to evaluate the support for the topologies produced, e.g.,
measures of homoplasy (independent derivation of a character state in two
lineages).
+ Ranks the list of trees based on length.
Disadvantages
+ Gives incorrect topologies when the backward substitutions are present
(common with nucleotides) and when the number of sites is fairly small.
+ Gives incorrect topologies when the rate of substitution varies substantially
across the lineages.
+ Long branch attraction —s long branches (and short branches) tend to group
together on the reconstructed tree.
+ Difficult to treat the results in a statistical framework.
Maximum Likelihood Method
\ Maximum Likelihood (ML) methods create all the possible trees containing the set
of organisms considered, and then use the statistics to evaluate the most likely tree.
For a small number of organisms, this is possible. For a large number of organisms,
the task cannot be accomplished as the number of generated trees is very large.226 Bioinformatics: Principles and Applications
Therefore, heuristics (methods which produce an answer in 4 computable Leng, a
time, but for which the answer may not be optimal) are used {0 select a subse,
‘tides or amino acids) of ay) ie
ie
trees to create, In this method, the .
sequences at cach site J separately (as independent), and the is
likelihood of having these bases are computed for a given topology by Using
particular probability model. This log-likelihood is a ded for all the sites, ang h
ed to estimate the branch length of the ye
sum of the log-likelihood is maxi ;
This procedure is repeated for all the possible topologies, and the topology 4y,,
shows the highest likelihood is chosen as the final tree. Likelihood methods jor
phylogenies were first introduced by Edwards and Cavalli-Sforza (1964) for the peng
frequency data, Neyman (1971) applied likelihood to the molecular sequences ang
this work was extended by Kashyap and Subas (1974). Felsenstein (1973, 194)
aximum likelihood framework to a nucleotid )
brought the
inference. .
Few main features of ML methods are the following:
«Statistical (probabilistic) method for inferring the phylogenies:
1. Substitution model is chosen for the sequence data (alignment).
2. Likelihood of observing the sequence data given in the substitution mode] By
obtained for each topology evaluated (parameter fitting on branch lengths),
that gives the highest likelihood is chosen as the best tree
ic methods almost always have to be employe
3. Topology
«Extremely slow method so heu
to search for the best tree.
+ This method is very dependent on the model of substitution used.
This method estimates the branch lengths not topology, so it may give the wrong
topology.
The ML method of inference is available for both nucleic acid and protein data,
The following programs are freely available from the web:
« DNAML (only DNA data; in the PHYLIP package)
‘« FastDNAML (only DNA data; a faster algorithm applied to DNAML)
« ProtML (both DNA and protein data) 5
* Puzzle (both DNA and protein data). This program is much faster than
PROTML.
Example: We take up a probabilistic model for nucleotide substitution (e.g., Jukes-
Cantor). By this method we have to pick up that tree which has the highest probability
of generating observed data.
Let D represent data, M the Jukes-Cantor one-parameter model and T the tree.
Given D and M we have to find T such that Pr(D|7,M) is maximized.
From the model we get the values of p,(#), the probability of substitution of ith
nucleotide by jth one in time ¢. Here two assumptions are put up.
a. Different sites evolve independently.
b. Diverged sequences (or species) evolve independently after diverging.
Now if D; is the data for ith site
Pr(DIT.M) = T1,Pr(D|7,M)
Calculation of Pr(D|7,M): The tree is given here.Molecular Phylogeny 227
prvi) is the probability from x to y in time 1
PrliikyllTM) =E2,E.pr(x(pxl (ty + ty + ty).pxy(ty)
Pyk(ty +45)
“PY ty)-P2i(ts).pzj(t))
With the tree topology and br e i
aoe aan eae mien ‘one can efficiently calculate Pr(D{T,M)
Advantages
«Estimates the branch lengths of the final tree.
+ Methods are usually consistent.
« It is extended to allow differences between the rate of transition and
transversion.
« Evaluates different tree topologies.
« Uses all the sequence information.
Disadvantages
+ MLis very CPU intensive and thus extremely slow.
+ Needs long computation time to construct a tree.
The result depends on the model of evolution used,
istance Methods
Distance methods are a major family of phylogenetic methods trying to fit a tree toa
matrix of pair-wise distance. The distance method creates a matrix of all distances
(difference scores) between all the organisms for which the tree is to be constructed.
Having calculated the matrix, the pair of organisms which have the smallest distance
score are connected with a root in between them. The average of the distances from
each member of the pair to a third node is used for the next iteration of the distance
matrix. The process is repeated, until all the organisms have been placed in the tree.
This always results in a rooted tree. The assumption that all the organisms mutate at
the same rate is the basis of this method.
The strategies involved in distance methods are the following:
1. Calculates the total number of changes — scored according to type — between
every pair of sequences in the alignment.
Represents the minimum number ‘of changes required to convert one sequence
to another. .
3. Results written to distance matrix are used to generate trees in several possible
ways — branch lengths visually represent the amount of change. |
4, Removing the ambiguous sections ‘will influence the branch length estimates.228 Bioinformatics: Principles and Applications
Few Facts About Distance Methods ; 5
a. The best way to think about distance matrix methods is to consider the di
as estimates of the branch length separating that pair of species
b.Branch lengths are not simply a function of time; they reflect the ex
amounts of evolution in different branches of the tree.
c. Two branches may reflect the same elapsed time (sister taxa), but they ean pay,
different expected amounts of evolution. "e
4. The main distance-based (ree-building methods are cluster analysis, least squay,
nd minimum evolution. A cluster analysis method includes UPGMA o; Nj
methods. This phylogeny makes an estimation of the distance for each pair as,
sum of branch lengths in the path from one sequence to another through the tee
e. They rely on different assumptions, and their success or failure in retrieving the
correct phylogenetic tree depends on how well any particular data set meets suey
assumptions.
stances
Pecteg
Simplest Distance Measure ;
Consider every pair of sequences in the multiple alignments and count the number of
differences.
Degree of divergenc:
Hamming distance (D)
lignment length; =number of sites with differences
Example
AGGCTTTTCA
AGCCTTCTCA
D=2/10
Problem with distance measure:
As the distance between the two sequences increases, the probability that more than
one mutation has occurred at any one site increases.
time point
0 1 2
senol A —+ A —+ A
scenario? A —* A —* G
senaio3 A —» G —» C
scemaiod A —+ G —» A
Therefore, methods have been developed to compensate for this.
Corrected distances are calculated by the following methods (discussed earlier):
1. Jukes-Cantor
dap
m4)
2. Kimura two-parameter model
Here the rate of transitions is different from the rate of transversions (see Figure
8.14),