0% found this document useful (0 votes)
381 views68 pages

Ghosh and Mallik

sequence analysis, gen prediction and phylogenetic analysis

Uploaded by

Priyanshu Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
381 views68 pages

Ghosh and Mallik

sequence analysis, gen prediction and phylogenetic analysis

Uploaded by

Priyanshu Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 68
2 Gene Prediction: Principle, and Challenge, “The universe is full of magical things, patiently waiting for oy, wits 10 grow sharper, EDEN Piuttpors INTRODUCTION The vast amount of raw sequence data generated because of advancement jn sequencing technology needs biological interpretation. This is known as annotation, Biological tools are essential for annotation of these raw sequences obtained from various sequencing projects, in particular, for finding genes and determining their functions. These genes can be of different types, starting from protein-coding genes to non-coding genes such as ribosomal RNA (tRNA), transfer RNA (tRNA), microRNA (miRNA), and many more. This chapter focuses on the annotation of protein-coding genes. Although the Human Genome Project was completed in April 2003, the exact number of genes encoded by the human genome is still unknown. Hence, genome annotation is a necessity and a multi-step process in itself. The steps involved in genome annotation can be grouped into three categories: nucleotide-level (gene identification or prediction), protein-level (structure determination of proteins), and process-level annotation (mechanism of biochemical reactions). Among these three categories, nucleotide-level annotation is the most significant, as it primarily deals with gene annotation, a fundamental step in molecular biology. In this respect, itis essential to mention that the accuracy with which genes can be predicted is still far from satisfactory. Although 80% of genes are accurately predicted at the nucleotide level in human genome, only 45% are predicted at the exon level and only ~20% at the whole-gene level. This is the reason that the estimates of the number of genes in the human genome are still indefinite. At present, the annotation of most human genes is based on cDNA sequence data. Systematic ‘full-length’ cDNA-sequencing programs, Gene Prediction: Principles and Challenges 499 such as the Mammalian Gene Collection in the US and at RIKEN (The Institute of nysical and Chemical Research) in Japan are Benerating vitally important experi- mental data towards defining complete gene sets for the human and nai enomes Hence, it is clear that further improvements to gene prediction are considered necessary. Even if all human genes are experimentally determined, it would still b imperative 10 understand how the structures ot wena ed organized and defined, and ' ‘an be recognized. The abili ; pow they cal ability to predict a gene ea analatpracicll cates I gene structure is both an .0616AL OVERVIEW efore pro: sing further into intricate details of gene annotation, we must know what this ‘gene’ actually means. Being so small, how docs it ereate weet e ee h ; “ate such a huge impact in biology? “me Gene Gone is defined as a segment of DNA that co: produce a functional product, usually a protein. This is a long strand of DNA (RNA in some viruses) that contains a promoter, which controls the activity of a gene, and a coding sequence, which determines what the gene produces , While defining a gene or describing its structure, we need to be accustomed with a few biological jargons such as promoter, CDS, and ORF. ntains the necessary information to Promoter Promoter is the regulatory region of DNA located upstream (towards the 5’ region) of a gene. This provides a control point for regulated gene transcription. It contains specific DNA. sequences that are recognized by proteins known as transcription factors. These factors bind to the promoter sequences, recruiting RNA polymerase, the enzyme that synthesizes RNA from the coding region of the gene. The promoter elements are of two types — core promoter and proximal promoter. Core promoter: It is the minimal portion of the promoter required to initiate transcription properly. It is approximately 34 nucleotides in length. Core promoter serves as a binding site for RNA polymerase and general transcription factors. Proximal promoter: The proximal sequence upstream of the gene that tends to contain primary regulatory elements is known as proximal promoter. It is approximately 250 nucleotides in length and serves as a binding site for specific transcription factors. n Reading Frame (ORF) An ORF is a sequence of DNA that starts with a start codon ‘ATG’ (though not always) and ends with any of the three termination codons (TAA, TAG, or TGA). Depending on the starting point, there are six possible ways (three on the forward Strand and three on complementary strand) of translating any nucleotide sequence into amino acid sequence, according to the genetic code. These are called reading frames. Gene finding in organisms, particularly prokaryotes, starts from searching Sor an ORF, 200 Bioinformatics: Principles and Applications i uence . cai ore is abbreviated as CDS, The CDS is the actual region of DNA thay is translated to form proteins. While the ORF may contain introns as well, the : refers to those nucleotides (concatenated exons) that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery. In prokaryotes, the ORF and the CDS are the same. WHAT IS GENE PREDICTION? The characterization of genomic features using computational and experiment) methods is called gene prediction or annotation. Given an uncharacterized DN Sequence, what are the things we would want to do with this sequence to know ine functional aspects? The answers to the following queries can be determined by pont finding/gene annotation methods: + Which region codes for a protein? Which DNA strand is used to encode the gene? Where does the gene start and end? Where are the exon-intron boundaries in eukaryotes? Where (optionally) are the regulatory sequences for that gene? COMPUTATIONAL METHODS OF GENE PREDICTION SO TATIONAL METHODS OF GENE PREDICTION Computational gene prediction is relatively simple for the prokaryotes, where all the genes are converted into the corresponding mRNA, and then into proteins, On the other hand, eukaryotic gene finding is altogether a different task because they ate not continuous and are interrupted by intervening non-coding sequences called ‘introns? (Figure 7.1). Moreover, the organization of genetic information in eukaryotes and prokaryotes is different, However, genome annotation and genome sequencing do not have an equal pace. Experimental genome annotation is slow and time consuming, TRANSCRIPTION | ONA SSS mRNA Transtarion — | are[x,.x]-sto protein sequence Figure 7.1 The structure of eukaryotic gene Gene Prediction: Principles and Challenges 20 fe is an increasing demand to develop computational tools for g al tools for gene nese nich will help in answering the following questions ton, NDNA sequence, what part oft codes fora protein and what part of itis ; junk DN N, junk DNA as intron, untranslated regi ces junk ON slated region, transposons, dead genes, ; pivide @ newly sequenced genome into the genes (coding) and the non-coding region’ jn small prokaryotic genomes, gene finding is largely a matter of identifying long fs However, even in this case, ambiguities arise if long ORFs overlap on the snes strands a Gans reach must be sorted out. As genomes get pager gone find ng becomes increas ingly complicated. The main issue is the signal-to- prise ratios 1 | PPO aryotic genome, such as Haemophilus influenzae, 85% of its Mb genome is iM coding regions. The corresponding number in yeast is not much ° ie at 70%. For these genomes, ‘calling genes’ is an exercise in running a compsiter Prost’ n that carries out a six-frame translation and identifies all ORFs that are longer than & chosen threshold, However, even in these small genomes, finding a has not been entirely effortless, The number of predicted yeast genes, e-., took several years 10 settle down, and there are still several short ORFs that have an setrian talus a8 bona fide genes. The process of finding genesis further complicated se presence of splicing and alternative splicing. In the human genome, atypical nig 150bp and a typical intron is several kilobases, and there is no clear delineation between the intergenic regions that separate adjacent genes and the intra- renic regions that separate exons. Defining the precise start and stop position of a gene and the splicing pattern of its exons among all the non-coding sequence is like finding f very small and indistinct needle in a very large and distracting haystack. Bioinformatics combines the expertise and knowledge of people from biological and computation areas, and sets a common stage for people from these backgrounds to work together to solve practical challenges like gene annotations. In short, computational gene finding is a process of the following: « Identifying common phenomena in known genes. * Building a computational framework/model that can accurately describe the jower: it common phenomena. «Using the model to scan an uncharacterized sequence to identify regions that match the model, which become putative genes. + Test and validate the predictions. Information within a Genomic Sequence There are different types of functional sites in genomic DNA that researchers have sought to recognize. These are splice sites, s art-and-stop codons, branch points, Promoters and terminators of transcription, polyadenylation sites, ribosomal-binding sites, topoisomerase-II_ binding sites, topoisomeras -I cleavage sites, and various transcription factor-binding sites. This information needs to be harnessed for gene identification. 202 Bioinformatics: Principles and Applications METHODS OF GENE PREDICTION ‘There are three approaches for computational gene annotation. as shown in Figure 7 Gene Prediction methods ‘Comparative method Intrinsic / ab initio x Homol Extrinsic / Homology ae] se ‘method Figure 7.2 Three computational methods of gene prediction Extrinsic or Homology Method This is a straightforward method for identifying protein-coding gene(s). It is based on sequence similarity of query sequence with annotated genes present in databases, Given a database of sequences of other organisms, we search for query sequence in this database and identify database sequences (known genes) that resemble the query sequence. If the identified sequences are genes, the query sequence is probably (putatively) a gene. This approach is able to identify biologically relevant genes, However, they could not identify genes that code for proteins not present in the database. Basic local alignment search tool (BLAST) is a well-known search tool in this category (see Table 7.1). It is known that only approximately half of the genes can be found by homology to other known genes or proteins (although this percentage is of course increasing as more genomes get sequenced). In order to determine the remaining 50% of the genes, the only solution is to use other prediction methods such as intrinsic method or a combination of both intrin: s well as extrinsic methods. The following are the principles of the homology method: 1. Coding regions evolve slower than non-coding regions, i., local sequence similarity can be used as a gene finder. 2. Homologous sequences reflect a common evolutionary origin and possibly a common gene structure, ie., gene structure can be solved by homology (mRNAs, ESTs, proteins, and protein domains). 3. Standard pair-wise comparison methods can be used (BLAST or Smith Waterman). 4, Include ‘gene syntax’ information (start/stop codons, etc.). 5. Homology methods are also useful to confirm predictions inferred by other methods. Intrinsic or Ab Initio Method This method attempts to predict genes based on the statistical properties of the given DNA sequence. This method attempts to extract information regarding gene locations using statistical patterns inside and outside of the gene regions well as typical patterns at their boundaries. 4d initio gene identification method searches for certain signals of protein-coding genes. This method is applicable for both prokaryotes and Gene Prediction: Principles and Challenges 203 re based on extrinsic method for eukaryotic gene identification rh soft rganist 4s uh ms — i Oraanlams 1 aoe Wan inatase ve BLASTX/BLASTN Primates, rodents - /]¥ww.genomeccs, OA ANA : hup://www-penomeccs. y i mtu.edu/aat html PLASTN, EST clustering, Human MASTS EST ee http://www.ares tre . sah Ne mew.edu/EBEST/ spliced alignment similarity Arabidopsis, maize, oe S lize tops . p://Www.maizegdb.org/ geo" ith ESTP generic plant geneseger.php oaFe™ sites, identity plastP, WAM for splice Human, mouse, http://www 125.itba.mi cnr.it/webgene/wwworf gene2.htm score on Drosophila, Aspergillus, frequencies of dipeptides Arabidopsis, Caenorhabditis protein/protein alignments Vertebrates hit ‘www-hto.usc.edu/ 0- ored with PAM 120 es res sco" nd with Py software/procrustes/, ¢ aa w é : BLAST similarity with All eukaryotes Ree eee univ. at ¢DNA/DNA imd.php g / lyon! fr/sim4.php & Two BLASTS: high Vertebrates, Drosophila, _hhttp://www.ncbi.nlm.nih. sid stringency and low C. elegans, plant gov/IEB/Research/Ostell) stringency Spidey index.html o WU-BLASTN, SIM4 Human, mouse, http://www sapiens, Drosophila wustl.edu/zkan/TAP/ eae DP Human http://www.sanger.ac.uk/ Software/Wise2/ genewiseform.shtml yNcoD Silent /replacement ratio, Human, mouse, hitp://swww.125.itba.mi. Monte Carlo simulations Drosophila, Arabidopsis, cnr.it)~webgene/ Aspergillus, ‘wwwsyncod.html Caenorhabditis eukaryotes. Gene prediction in prokaryotes is relatively easy since prokaryote protein- coding genes have specific signals such as transcription factor-binding site and Pribnow box that are easy to identify. The protein-coding sequence is a contiguous ORF, starting with a start codon (ATG) and ending with a stop codon (TAG/TGA/TAA). Moreover, prokaryotes are small genomes and show a high coding density (>90%). There are no introns in prokaryotes. But ab initio gene prediction method is more difficult in eukaryotes for the following reasons: 1. Protein-coding genes are separated by large intergenic regions. 2. The genes are not contiguous. These genes are divided into exons and introns by the splicing mechanisms in cukaryotic cells. The split genes make it difficult to define ORFs. 3. The signals (e.g., promoters) are more difficult to identify than that in prokaryotes, since these signals are more complex and unspecified. Two such signals are CpG islands and binding sites for a Poly-A tail. 204 Bioinformatics: Principles and Applications Features for Gene Prediction in Prokaryotes ; : The features of gene structure in prokaryotes act as the key criteria to identify Protein, coding genes in them, The gene structures of prokaryotes, which are seq 5 prediction, particularly in ab initio method, are discussed here (see Figure 7.3)" : The sequence in DNA which defines the start of a gene is cg Promoter elements: The sequence Hs a art ofa gene is Sle that are important Slons =—m-| I hat ane important in the defy ieee ORF (Open Reading Frame) nition of a prokaryotic promote, one | (the —35 region and the” Server Start codon Stopcodge region, also known as the pig, now box). BcacartacncarracasnrTACACoH 35 Region: This sequen i Ad Le aa aan centred about 35 bp before gy Frame 2 a_i tt the Frome 3 aaa a a as STL (upstream) of a bacteray Figure 7.3 Gene structure of a prokaryote gene (prokaryote). It functions in the initial recognition of a gene hy RNA polymerase. The consensuy sequence is TTGACAT, —10 Region (Pribnow box): This has the consensus sequence TATAAT ang is centred about 10 bp before the start of a bacterial gene, ic. there are about 12 bp between the —35 region and the Pribnow box. Since three hydrogen bonds hold a G-¢ base pair together while only two are present between A and T, the strands of an AT-rich region are separated more easily than in a GC-rich region. It is thought that the enzyme DNA-dependent RNA polymerase initially makes contact with the —35 region, and then moves along the DNA until it finds an AT-rich region (the Pribnow box). At this point the enzyme separates the two DNA strands, and the RNA polymerase initiates RNA synthesis approximately 7 bp further along the DNA (downstream of the TATAAT site). It then transcribes the DNA using the template strand to produce an RNA molecule, which is, by definition, the sense strand, Transcription start site: Protein-coding genes generally start with codon ‘ATG’, which is known as start codon, and it codes for methionine. In prokaryotes, in some cases, start codon is TTG or GTG. So any series of non-stop codons can be translated computationally into an amino acid sequence. Such a region is called ORF starting with ATG, TTG, or GTG. ORFs: Since stop codons are found in uninformative nucleotide sequences, approxi- mately, once every 21 codons (3 out of 64), a run of 30 or more triplet codons that does not include a stop codon in itself can be a probable gene. The presence of a set of sequences around which ribosomes assemble (during translation) at the 5’ end of each ORF serves as a hallmark for identification of protein-coding genes in prokaryotes. This set of sequence is often found immediately downstream of transcriptional sites and just upstream of the first start codon. This sequence patch is known as ribosome- loading sites (called Shine-Delgarno sequence) that has a consensus sequence 5’- AGGAGGU-¥. Translation stop site: This includes TAA, TAG, and TGA. Gono Prodiction: Principles and Chaltonges 208 Termination sequences: vast Major contain specific signals for the ne otty 8f prokaryotic protein-coding gene operons © fermination of transcription called intrinsic termina- tors. Intrinsic terminators have * have Wo prominent structural features: (i) a sequence of nucleotides that include an i An inverted repea , “ immediately following the inverted repe a and (ii) a run of roughly six uracils some of the gene prediction tools for prok: Aryotes are listed in Table 7.2. able 72 Uist of gene prediction tools for prokaryotes prediction tools Web interfaces ~{autGene (annotation of microbial genes) EasyGene GeneMark.hmm-P Mttp:/ www. genoscope.ens.fr/age/tools/amigene/ http://www ebs.dtu.dk/services/EasyGene http://wwwexon.g gmhmm2_prok. imme hutp://www.ebch,umd.edu/softw Gs Finder hutp:/ Awww.tubie.tju.cduten MED-Sta cn/main/SheGroup/ REGANOR /www.cebitec.uni-bielefeld.de/groups/brf/software/ TICO (translation initiation site correction) Zeurve reganor/ hutp://swww.tico gobies.de/ hup://wwww.tubie.tju.edu.en/Zcurve_B/ Features for Gene Prediction in Eukaryotes There is certain information present within the DNA sequence such as splice sites, start and stop codons, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal-binding sites, topoisomerase-II binding sites, topo- isomerase-I cleavage sites, various transcription factor-binding sites, and so on, that are called signals. Methods for detecting them are called signal sensors. Genomic DNA signals can be contrasted with extended and variable length regions such exons and introns, which are recognized by different methods that may be called content sensors. Content sensors: These classify a DNA region into different types, e.g., coding versus non-coding. Similarity-based approaches are often called extrinsic, in opposition to others that try to capture some of the intrinsic properties of the coding/non-coding sequences (compositional bias, codon usage, etc.). . Extrinsic content sensor: These sensors basically perform similarity searching between a genomic sequence region and a protein or DNA sequence present in a database. Basic tools needed for similarity searching between sequences are local aligament methods like Smith-Waterman algorithm, fast heuristic approaches such as FASTA and BLAST. Almost 50% of the genes can be identified by this procedure. Databases like SwissProt or PIR serve as the source for the most widely used protein sequences for such a purpose. Again, a good similarity score will not enable exact identification of the gene structure, since homologous proteins do not share all their domains. Further, UTR regions cannot be delimited by such a procedure. It is well assumed that coding sequences are more conserved than non-coding ones. Under such an assumption, 208 Bioinformatics: Principles and Applications also serve as a valuable source of information oy intron/exon location, There are two approaches for such a purpose: intra-genomie comparisons provide useful information regarding multigenic families, representing huge percentage of the existing genes. Inlergenomic or cross-species comparisons identify orthologous genes, without « preliminary knowledge about them: Advantages: Similarity-based gical data, Hence, the predictions should similarity with genomic DNA can approaches depend on accumulated pre-existing biolg, be biologically relevant Disadvantages » If the database is d obtained « Even if'a good similarity always precise and are unable to identify the gene accu «Small exons are also missed out in the proce: Intrinsic content sensor: Originally, these were defined for prokaryotes. In prokay. votes it ig still commion to locate genes by just looking for long ORFS. This is certainly hot adequate for higher eukaryotes. To discriminate coding from non-coding regions jn eukaryotes, content sensors use statistical models of the nucleotide frequencies ang dependencies present in codon structure, The most commonly used statistical models sre known as Markov models, popularized for gene finding. Neural networks are used to combine several coding measures along with signal sensors for the flanking splice sites, Other content sensors include sensors for CpG islands, which are regions that often mark the beginning of genes where the frequency of the dinucleotide CG is not fs low as it is in the rest of the genome, and sensors for repetitive DNA, such as human ALU sequences. levoid of sufficiently similar sequence, no result will he is found, the limits of the regions of similarity are no, el Signal sensors: Signal sensors are measures that try to detect the presence of the functional sites specific to a gene. The basic signal sensor is a simple consensus sequence or an expression that describes a consensus sequence along with allowable variations, More sensitive sensors can be designed using weight matrices in place of the consensus, in which each position in the pattern allows a match to any residue Different costs are associated with matching cach residue in each position. The score returned by a weight matrix sensor for a candidate site is the sum of the costs of the individual residue matches over that site. If this score exceeds a given threshold, the candidate site is predicted to be a true site. Such sensors have a natural probabilistic interpretation in which the score returned is a log likelihood ratio undera simple statistical model in which each position in the site is characterized by an independent and distinct distribution over possible residues. More sophisticated types of signal sensors, such as neural networks, are extensively used. Ab initio methods in eukaryotes use these two statistical properties of coding region for identification of protein-coding genes. In this context, it is essential to have an idea of eukaryotic promoters. Eukaryotic promoter: Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically lie upstream of the gene and can have regulatory elements several kilobases away from the transcriptional start site. In eukaryotes, the transcriptional complex can cause the DNA to bend back on itself, which allows for the Placement of nt of regulatory sequences fh binds ay TATA-binding Protein that fi transcriptional complex : The TAT, Start site (often within 50 bases) * Some of the methods Table 7.3, box typic: table 73 List of software based on ab initio method fore "ukaryotes: Algorithm Organisms HMM Human, mouse, Drosophila, tice GeneID DP, MM Vertebrates, plants GeneMark.hmm = GHMM Human, mouse. Drosophila GeneParser DP Vertebrates GeneWise DP Human GENIE GHMM Drosophita, human entats| Grammar rule Vertebrates, Drosophila, dicotyledonous plants GENSCAN GHMM Vertebrates, Arabidopsis, maize GENVIEW2 DP Human, mouse, diptera GLIMMERM DP, IMM Small eukaryotes, Arabidopsis, rice GRAIL DP, NN Human, mouse, Arabidopsis, rice HMMgene CHMM Vertebrates, C. elegans MORGAN DP Vertebrates MZEF Quadratic Human, mouse, discriminant Arabidopsis, fission yeast analysis SLAM DP Human, mouse TWINSCAN, GHMM Mouse, human VEIL DP,HMM —_ Vertebrates Xpound HMM Human used for identification of genes in cuk Geno Prediction: Principles and Challenges 207 Ar from the actual site of transcription. ‘A box (sequence TATAAA), which in turn ‘ssists in the for mation of the RNA polymerase ally lies very close to the transcriptional aryotes are listed in Web interfaces hutp://www.softberry.com/ berry.phtmltopic=index&group= Programs&subgroup=gfind hUtp://www | imim.es/geneid.html http://www.opal.-biology.gatech.edu/ GeneMark/eukhmm.cgi) http://www. beagle.colorado.edu/ cesnyder/GeneParser.html hutp://www.ebi.ac.uk/Wise2/index. humt hutp://www.fruitfly.org/seq_tools/ genie.html hutp://www.cbil.upenn.edu/~ sdong/ genlang_home.html hitp://www.genes.mit.edu/ GENSCAN. htm! http://www. 125..tba.mi.enr.it/ webgene/wwwgene.html hutp://www.tigr.org/tdb/glimmerm/ elmr_form.html http://www.compbio.ornl.gov/ Grail-1.3 hutp://www.cbs.dtu.dk/services/ HMMgene hutp://www.cs,jhu.edu/labs/compbio/ morgan.html hup://www.argon.cshl.org/ genefinder/ hutp://www.baboon.math.berkeley. edu/~syntenie/stam.him| hutp://www.genes.cs.wustledu/ hutp://www.tigr.org/~salzberg) veil.huml http:/;www.bioweb.pasteur.fr/ seqanal/interfaces/xpound- simple.htmt 208 Bioinformatics: Principles and Applications Comparative Method . . The given DNA string is compared with a similar DNA string from a different SPecigs at the appropriate evolutionary distance and genes are simultancously predicteg ;- both sequences based on the assumption that exons will be well conserved, where, introns will not. Examples of software adopting this comparative method ate conserved exon method (CEM) and Twinscan. COMBINATION OF TWO METHODS Gene prediction methods based on extrinsic approach are able to detect only a limiteg number of genes (low sensitivity), due to the lack of known MRNAS in the database, whereas ab initio gene prediction tools rely on intrinsic gene measures, includ, coding potentials and splice signals. Even this method is not as efficient as it soung, because of some of the exceptional genes lacking intrinsic gene measures oF so-caljeg sensors (signal or content). For example, some housekeeping genes as well as some oncogenes and growth factor genes have no TATA box, for which it will be difficul to identify genes by signal sensors by ab initio method. Therefore, different type, of gene prediction based on different methods can be combined to devise a more efficient gene prediction tool. Examples of software combining two or more different types of gene prediction software are given in Table 7.4. Non-protein coding genes (tRNA, rRNA, and microRNA) prediction tools are provided in Table 7.5. Table 7.4 Example of two combined software for protein-coding gene prediction Software/website Combined tools _ Websites Descriptio oa FGENESH http://www-softberry, HMM based method for com/berry.phtml?topic= human, mouse, identification tools) index&group=programs Drosophila, rice http://www.digit.gse. &subgroup=efind riken.go.jp GENSCAN http://www.genes.mit. __ GHMM based method for edufGENSCAN.html vertebrates, Arabidopsis, maize HMMgene http://www.cbs.dtudk/ GHMM-based method for services/HMMgene vertebrates, C. elegans EuGene NetStart hutp://www.cbs.dtu.dk/ Predictions of translation http://www inra.fr/mia/T/ services/NetStart/ start in vertebrate and EuGene/ Arabidopsis thaliana nucleotide sequences. NetGene2 http://www.cbs.dtudk/ Predictions of splice sites services/NetGene2/ in human, C. elegans, and Arabidopsis thaliana SplicePredictor —hitp:/|www.deepe2. A method to identify Psi.iastateedu/cgi- potential splice sites in bin/sp.cgi (plant) pre-mRNA by sequence inspection using Bayesian statistical models Gene Prediction; Principles and Challenges 209 le 7.5 Non-protein-coding gene prediction tools ee 1A a iA-ScanSE ____Web Interfaces hUp://www.lowelab,uese.edujtRNAscan-SE} FASIRNA ftpforww, bioweb. pasteur.fr/seqanal/interfaces/fastrna.html aragora hutps://www.pembioekol-bioinf2.mbiockoL.u.se/ ARAGORN I.1/HTML/aragorn!.2.html spilts hutp://www.splits.iab, keio.ac.jp/ RNA gsU rRNA !ttp://swww.soe.uese.edu research compbio/ssurrna. html CARNAC (Computer Alignment of RNA by Cofolding) ‘http://swww2.lifLfr/~perrique/rna/OLD_index.html myRDB hups://www.rdp.cme.msu.edu/login/myrdp/overview.spr RNAmmer hup://www.cbs.dtu.dk/services/ RNAmmer/ ‘MicroRNA SIRNA hup://www. bioweb.pasteur.fr/seqanal/interfaces/sirna.html MicFinder hup://www. bioinformatics.org/mirfinder/ MiRscan hup://www.genes.mit.edu/mirscan/ proMiR I hup://www.cbit.snu.ac.kr/~ProMiR2/ RNAmicto hup://www. bioinf-uni-leipzig.de/ ~jana/sofiware/index.html WHY IS GENE PREDICTION DIFFICULT? Although we have practically all information about gene's structure (both prokaryotes and eukaryotes), it is still difficult to predict genes accurately due to the following reasons: + DNA sequence signals have low information content (degenerated and highly unspecific). «Difficult to discriminate real signals. * Contain sequencing errors. Apart from these, we can specifically list the reasons separately for two domains of life, prokaryotes and eukaryotes. More specifically, prokaryotes pose difficulties because they have high gene density and a simple gene structure. + Short genes are found in prokaryotes that have little information. + The presence of overlapping genes in prokaryotes makes the detection of genes difficult. The following features of eukaryotes make its detection complicated: + Low gene density and complex gene structure are the main factors in gene prediction in eukaryotes. « The presence of alternative splicing mechanism in eukaryotic genes makes its detection difficult, «The presence of pseudo-genes. 240 Bioinformatics: Principles and Applications SUMMARY “Gene prediction’ research was conceived as a result of the need for complete automated gene finding systems for long, non-annotated se~ quences now being produced at a very high volume, and it is important to distinguish the two different goals in such gene finding re~ search. The first goal is to provide computit- tional methods to aid in the annotation of the large volume of genomic data that is produced by the genome-sequencing efforts. The second goal is to provide a computational model to help elucidate the mechanisms involved in transcription, splicing, polyadenylation, and other critical processes in the pathway from genome to proteome. The other key issues that will influence future research in both of the aforementioned computational gene finding paradigms are the issues of alternative splicing. There are no currently available programs] algorithms that can handle alternative splicing in an efficient manner. Intimately linked with this issue is that of gene regulation, annotation is not complete until the abunga® regulatory signals flanking genes or Appearin in introns or exons are. properly identifier’ Further, the cellular conditions that give rise: the differing expression levels for diffe transcripts need to be worked out, ‘nally, as a word of caution, it shoulg mentioned that the results produced by sy methods are predictions, and should be ta carefully. These are very useful for speeding yy gene discovery and knowledge mining theregp but biological expertise remains necessary jp order to confirm the existence of a virtyay protein and to find or prove its biologica, function and its condition of expression in the organism. All these facts imply that future gene finding research will greatly depend on the experimental data relating to the differentia, expression, along with the other types of data that we have already discussed, Feng REVIEW QUESTIONS 1, What is Human Genome Project? How is it significant to bioinformatics? 2. What is gene annotation? Explain the need for gene annotation. 3. What is Open Reading Frame? 4. Write the three methods of computational gene prediction. 5. Why ab initio gene prediction method is difficult to annotate protein coding genes in eukaryotes? 6. Mention the differences between content sensor and signal sensor. 7. Mention few softwares used for annotating protein coding genes in prokaryotes and eukaryotes. 8. How do prokaryotic promoters help in detection of CDS? SUGGESTED READING Allex C.F., Shavlik J.W., and Blattner F.R., 1999, Neural network input representations that prose accurate consensus sequences from ' DA fesement assemblies’, Bioinformatics, 119): Besemer J., Lomsadze A., and Borodovsky M.,200l, “GeneMarks; A self-training method for predie- tion of gene starts in microbial genomes. Implica- tions for finding sequence motifs in regulatory regions’, Nucleic Acids Res, 29(12): 2607-2618. sec simmey E Thompson J.D J, 1996, pair-wise and search-wise: Comparison. of =; Fotein profile to all three translation frames Proaitancously”, Nuclere Acids Res, 24: 2730-2739, Myovsky M., Melninch J.D., 1993, “Genmark; lel gene recognition for both DNA Comp Chien, 17: 123-13 purge C. Karlin S., 1997, ‘Prediction of complete ene structures in human genomic DNA’, J Mol ‘iol, 268: 78-94 pecher A-L.. Harmon D., Kasif $., White O., and Galebere S.L.. 1999, ‘Improved microbial gene entifcation with GLIMMER’, Nucleic Acids es, 27(23): 4636-4641 Gelfand M.S.. Mironov A.A., and Pevzner P.A., 1996, “Gene recognition via spliced sequence alignment’, Proc Natl Acad Sei USA, 93: 9061-9066. Holmes I. and Durbin R., 1998, ‘Dynamic pro- framing alignment accuracy’, J Comp Biol, 5 493-504. Kozak M., 1986, ‘Selection of translational start sites, ja eukaryotic MRNAS’, in M.B. Mathews (ed), IRegulation of Gene Expression at the Translational Lecel, Cold Spring Harbor Laboratory, pp. 35-41 Lukashin A.V. and Borodovsky M., 1998, *Gene- ‘nark.hmm: New solutions for gene finding’, Nucleic Acids Res, 26 (4): 1107-1115. Ma Q., Wang J.T.L.. 1999, “Biological data mining using Bayesian neural network: A case study’, Jnt J Arif Intell Tools, 8(4): 433-451. Bor Paral Gene Prediction: Principles and Challenges 214 » Wang J.T.L., and Wu C.H., 2000, ‘Applica~ n Of neural networks to biological data mining: 2000: 23-30. MeLauchtan J., Gaffney D., Whitton J.L. and Clements J.B., 1985, “The consensus sequence PTYY located downstream from the AA- nal is required for efficient formation of MRNA ¥ termini’, Nucleic Acids Res, 13: 1347 1468, Mott R., 1997, ‘EST_GENOME: A program to align spliced DNA sequences to unspliced geno- mic DNA‘, Comp Appl Biosci, 13: 477-478. Pertea M., Lin X., and Salzberg S.L., 2001, “GeneSplicer: A new computational method for splice site prediction’, Nucleic Acids Res, 295); 1185-1190. Proudfoot N.J. and Brownlee G.G., 1976, ‘3° Non- coding region sequences in cukaryotic messenger RNA’, Nature, 263(5574): 211-214, Salzberg S., Delcher A., Kasif S., and White O.. 1998, “Microbial gene identification using inter- polated Markov models’, Nucleic Acids Res, 26: 544-548. Stormo G.D., Schneider T.D., and Gold L., 1986, “Quantitative analysis of the relationship between nucleotide sequence and functional activity’, Nucleic Acids Res, 14: 6661-6679. Wu CH, 1997, ‘Artificial Neural networks for molecular sequence analysis, Comp Chem. 21(4): 237-256. 8 Molecular Phylogeny “Nothing in biology makes sense except in the light of evolution TuEoDosts Donzitansxy INTRODUCTION ‘The word ‘phylogeny’ has been derived from two Greek words, phylon and genesis, Phylon means ‘stem’ and genesis means ‘origin’, In other words, phylogeny gives an “dea about the evolution or origin of an organism (see Figure 8.1). Phylogeny is jllustrated as a tree. For a long time after people began trying to classify organisms in a systematic fashion, they used a variety of definitions of relationship. One definition was — things that look alike are more closely related to each other than things that look different. This is perhaps logical, but it is wrong, for the things that resemble each other superficially. For example, African euphorbias and American cacti resemble each other, but are not closely related at all. On the contrary, things that appear to be quite different turn out to be related. Zimmermann in 1931, defined relationship as the sharing of-a recent common ancestor. For example, apples are more closely related to magnolias than they are to ginkgos because apples and magnolias share a more recent common ancestor with each other than either does with ginkgos. Apart from understanding the evolutionary relationships among the different groups of organisms, phylogeny also helps to understand the follow: ing things: © Understand the evolu- tionary history of organ- isms Figure 8.1 Phylogenetic tree of human beings and A. Map the pathogen strait his ancestors diversity for vaccines Phylogeny Bebe diols 9) Hue gprencl wo! Moleaar Pyogeny 213 fh Dest i 4 Assist | hn: 21 Chucases + Assist in the epidemiology of infectious diseas and genetic defects + Aid in predicting the functions Biodiversity studies + Understanding the microbial ecologies of novel genes pHENOTYPIC PHYLOGENY AND MOLECULAR PHYLOGENY There are two types. of phylogeny methods molecular phylogeny, phylogeny namely, phenotypic phylogeny, and eae Be enoeomie Phylogeny is considered the traditional method of AS IC Is based upon phenotypic observations from the group of organisms. In due course of time, scientists found that in this method it was difficult to classify the micro-organisms because the phenotypic resemblance/dissimilarity may be superficial. All these paved the pathway for the arrival of the novel concept of molecular phylogeny. Linus Pauling was the first to make the observation that genetic sequences could be used for phylogeny, and this method is known as molecular phylogeny. In this molecular phylogeny approach, the relationships among organisms or genes are studied by comparing the homologues of DNA or protein sequences. Thus, molecular phylogeny can be defined as the study of relationships among the organisms using molecular markers such as DNA or protein sequences. Molecular phylogeny based on the nucleotide or amino acid sequence comparison has become a widespread tool for general taxonomy and evolutionary analysis. It seems that molecular phylogeny is the only means to establish a natural classification of micro-organisms, since their phenotypic traits are not always consistent with the genealogy or family pedigree. These two methods of phylogeny are related, since the genome strongly contributes to the phenotype of the organisms. In general, organisms with more similar genes are more closely related. However, phenotype based phylogeny or morphological phylogeny has many disadvantages over molecular phylogeny. A few of those are listed here: _/ There may be similar phenotypes in distantly related organisms due to the process called convergent evolution. «Phenotypic features for many organisms, e.g., bacteria cannot be studied. _# It is difficult to compare the phenotypic traits with distantly related organisms, e.g., when comparing bacteria and mammals. Molecular phylogeny methods are often free of such problems and make possible the study of genes without a morphological expression. That is why we prefer this method for classifying the organisms in any evolutionary situation. Mechanism of Molecular Phylogeny The primary mechanism of evolution at the molecular level is based on the nucleotide substitution during the process of DNA replication. All other outward evidence of evolution (the phenotype) is the result of the changes in the DNA sequences within an organism. This mutation in the germ-line occurs through several inter-related mechanisms such as base substitution and exon shuffling. 1. Base substitutions: The most common genetic change is simply a change in the nucleotide present in the parental genome to a different one in the progeny TAA Roinkomates. Prnnpies and Appicatons genome. This base substitution oscurs: through several mechanisms sug Transposition, tnsertion, and Deletion These are defined as follows: a Trsspauitie Transposition as the movement of an entire Bene fom on, Kewation ox a DNA mokcule to another b. feertiee Insertions are extra nucleotides not present in the parental template DNA. Insertions range from a single base to bilo bases in length, For exampyy invertion is what ecurs on the recenving end of transposition, . we Deletims: A deletion is the loss of nucleotide(s) in the progeny DNAS © g the Jocation from which a gene or gene fragment was removed 10 be inserted ing Jimination from the genome, another keaton, or its compkte ¢ ae 7 Instanoes of insertions and deletions can be casily obtained by using the Py program of GCG (Genetics Computer Group) tocompare multiple related sequence, Alithe gaps represent aither an insertion or a deletion Dotplot also provides simityy infarmation for the purrs of sequences. Dotplot prow es the graphical picture of the differences anst shows insertions, deletions, and repheations particularly well >. Exon shufMing: Exon shuMing is the provess by which either exons are duphcateg , in the same DNA molecule, or stroctural or functional domains are exchange benween the genes encoding proteins 10 muluple exons. hay Phylogenatic Marker: Choosing the Right Molecule for the Problem at Hand Molecules are like radio isotopes as they change at dillerent_rates. The genomes of RNA viruses such ay HIV change so quickly that very soon every infected periog Qwould carry an idenntiahly different strain, Mitochondrial DNA, which 1s haploid, has a pelatively fast substitution nite. It evolves rapidly cnough to be useful for the comparisons of hineages that diverged recently, but it can also be_used_to establish relationships among the groups that are several million. years.old. To obtain good mokcular information on events that occurred in deep time, highly conserved genes, genes that change very slowly are needed, such as the DNA that cates for the small subunits of ribosomal RNA. Such genes contain usefol information about the events that occurred about $00-1,500 million years ago. Chromosomal DNA being the most sensitive to evolutionary change has beca traditionally wed for amalysis of phylogenetic relationships among the prokaryotes and archaca. In eukaryotes, chromosomal DNA is subjected to various gene repait mechanisms and also crossing-over and recombination duning sexual reproducuon. ht has Kooome apparent that in clearly different lineages of organisms, almost any sulficiontly large fragment of the genome will provide the same phylogenetic tree a any other. With more closely related organisms, this is not always the case. To sohe thee conundrums, relationships among the mammalian species have also been studied by the analysis of evtra-chromosomal DNA, such as mitochondrial DNA, which not subject to same of these processes because it is maternally inherited. \Arirt trom these, we need to know the sequences that are most widely used as phy! genetic marker and they are listed here along with their respective features and wiilibes + DNA — Very sensitive, non-uniform mutation rates . Useful for more remote homologies Molecular Phylogeny 215 «Protein Sequences — User, ent ee for most remote homologies, deep phylogenies, a rales, more character sta ' ara lates Dean Rent clic markers, 168 Ribosomal RNA sequences are widely a SEN pee bl a eciuuise these sequences exist in all organisms, they art ‘OSS Kingdoms and these are long sequences which are widely sequenced, The sequencing of these sequences is easy. Hence, they are suitable for broad and very deep phylogeny studies, REPRESENTATION OF PHYLOGENY Re The most convenient way of visually presentin a group of organisms is through illustrations called phylogenetic trees. A tree is a mathematical structure which is used to model the actual evolutionary history of a group of sequences or organisms. A phylogenetic tree is composed of nodes, each representing a taxonomic unit (species, populations, individuals), and branches, which define the relationship between the taxonomic units in terms of descent and ancestry. Only one branch can conneet any two adjacent nodes, The branching pattern of the tree is called the topology, and the branch length usually represents the number of inges that have occurred in the branch, Other terminologies often used in phylogenetic trees are the following: ZZ Re I taxa, A Operational Taxonomic Unit (OTU): OTU is any group of organisms, popula- tions, or sequences considered to be sufficiently distinct from cach other and is treated as a separate unit, This is also termed Terminal nodes (external and internal) or leaves (sce Figure 8.2). #f Distance scale: It is a scale that represents the number of differences between the organisms or sequences. For example, 0.1 means 10% differences between two sequences. «Internal branch: It is located between two nodes. «External braneh: It is located between a node and a leaf. « Monophyletic: It refers to two or more DNA sequences that are derived from a single common ancestral DNA se- OT fuenee’ eo aa a 35 _* Clade: It is a group of monophyletic DNA sequences that make up all the sequences included in the analysis that are descended from a particular com- mon ancestral sequence. + Horizontal branch length: This is pro- #2 the evolutionary relationships among ot: The common ancestor of Jnteral nodes portional to the evolutionary distances ne between the sequences and their ances- Int External _ + . ottegal O—O Oe eenes tors (unit =substitution/site). Root The different parts of a phylogenetic tree Figure 82 Different parts of a phylogenetic tree are shown in Figure 8.2. 216 Bioinformatics: Principles and Applications halo giant - Chaaenae dasa 4 an cuted He, {| te Life branther ode | s Drees tees of trees can be used to depict the different aspects of evolution, history. The most basic tree is the Cladogran sich ene sows the relat ecentness_of_a_common ancestry, This is ranching diagram depictin hierar Teo ‘or a taxa defined by the cladistic. methods, Phylogea® (Additive trees) depict the amount of evolutionary change that has the different branches. It can otherwise be explained as a phylogenetic tree re indicates the relationships between the taxa and also conveys @ Sense OF time oy pat of evolution. Dendograms (Ultrametric_trees) depict the times of divergence Th are a branching d gram in the form of a tree used to depict the degrees 2 relationship or resemblance (see Figure 8.3). 5 Cladogram Phytogram nea iet za . 2 BY > ieee oe ae nothing (a) ) 2 Figure 8.3 Three different types of representation of phylogenetic tree: (a) Cladagram; (b) Phylogram or Additive tree; (c) Dendogram or Ultrametric tee Cladograms and Phylograms can be either unrooted or rooted in molecular systematics. + Most phylogenetic methods produce unrooted trees (Figure 8.4 (a)). This is because they detect the differences between the sequences, but have no means to orient the residue changes relatively with respect to time. An unrooted tree shone only the evolutionary relationships between the organisms in the tree, and does not actually infer the placement of a common ancestor in the structure or the evolutionary path used to obtain the current relationships. The direction of the evolutionary process is not given. + In rooted trees (Figure 8.4 (b)), there is a particular node, called the root, A representing a common ancestor, 7 from which a unique path leads to 8 any other node. A rooted tree infers : the existence of an actual common A nos 5 ancestor and defines the evolution: ary paths leading to the development c of each organism. It provides an 8 c Tata indication of the direction of the evolutionary process, defining the (a) Unrooted tree (b) Rooted tree Figure 8.4 This figure depicts rooted and unrooted trees > ancestral and the derived characters or species, gular CLOCKS Molecular Phylogeny 217 i Wo. molecular clock regular, more or les sui 7 ™MPtion that mutations occur at some © This h F a is hy ° macromolecule (a protein or DNAS Yothesis postulates that for any given an_number of amino ac quence), the ra any me © acids oF micleonae ce "He of evolution (measured as the approximately constant over tine He sequence ch Manges per site per year) is certain numbey ineages. Thus, if a certain a Imber of mutations can be expected to have lated much inter amount of time has passed occurred. This hypothesis has in the evolutionary studies for two rence “Stn the use of macromolecules _; Sequences can be used as molec 2: The degree of rate of ch eae a markers to date evolutionary events insights on the mechanisms of molecu rrsgeaenees and lineages can provide in the rate of evolution in a protein i ation: For example, a large increase adaptive evolution, min a particular lineage may indicate an Some of the phylogenetic tree . Pair Group Method with Arithmetic hon 2) de fs UPGMA (Cnweighted construction of the tree. Whether or not this ie wae dy an pthis_notion. in their and itis probably not trac excepe hes 8 UE Gt only be answered indirectly, hundreds of millions of years). The fossil record indents (oe periods of time periods in the earths hisiory when oss Zee0rd ndiestes that there have been S extinctions of 50% or more of known organisns eee hhc erc>? cc ,rlrClL Shr thn nen nga ki eo apne ghe i a iately succeeding the mass extinctions. In particular, extremely high rates of species diversification are known to have occurred in the early Cambrian time Gust after $90 million years ago), in early Triassic Gust after 248 million years ago), and in early Cenozoic time (just after 65 million years ago). In addition to these issues, where DNA sequences readily mutate at a fair rate without changing the resulting amino acid sequence, or are in non-coding regions, multiple mutations at a single site may have occurred. Looking at the predecessor sequence and the current sequence thus sometimes gives an inaccurate value to the mutation rate — some of the prior mutations have become invisible due to subsequent mutations at the same site. Problems notwithstanding, the measurement of cumulative mutations (genetic distance) may provide an estimate of the time period required for the genome evolution from one organism to the other, or the time period since the two organisms had a common ancestor. Volutionary Basic Steps of Phylogenetic Tree Construction a. Choice of Data: In phylogenies, different genes or combinations of genes or DNA regions may be used to infer phylogenetic trees while addressing groups of organisms. These are called phylogenetic markers. Depending on whether the organisms are expected to be closely or distantly related, extremely variable (e.g., ITS2 or introns) to conserved DNA regions (¢.g., ribosomal LSU rDNA, protein coding genes) can be used. If the chosen DNA region proves to be too conserved or too variable, switching to a different DNA region may increase the resolution. formatics: b. Principles and Applications Sources of Sequences: Sequencing of desired phylogenetic marker can be ¢, ong Which will be the input for the various phylogenetic packages: Phylogengie varkers can be obtained from two sources: by sequencing in the lab ayy (ii) from the available, Alignment: Since descen the history of descent is recorded | The molecular data on sequences in t abases. i reece inherit traits From their ancestors through gene, vv in the changes within the DNA sequenca” he genes are a simple form of characte data: the characters * in the sequence, and the character states gr the nucleotides at those positions. This sounds mple but assumes thatthe positions compared are homologous. ‘This idea has been conceived and pu jg multiple alignment concepts. : “Alignments may be done automatically using Clustal X/W, but it the eet highly conserved and/or contain insertions and deletions te likely to occur. Aligned sequences offer a, freading step by searching for deviating nucleotide checking them with the assembled sequence, sequences alignment errors are q) opportunity for a final proo in the conserved regions and cross (basic steps are given in Figure 8.5). Choice of Evolutionary Model: Evolutionary models for DNA sequences became quite elaborate during the years. They include several parameters ach as base frequencies, substitution rate matrix, gamma distribution, proportion of invariable sites, and covarion/covariotide evolution. In simple evolutionary models, base frequencies are assumed to be equal, ie., the amount of nucleotides is set to 25% for each type of nucleotide. But base frequencies may also be estimated from the data set, which may differ among the data sets and for each nucleotide. In simple evolutionary models, the substitution rate is assumed to be equal for each type of point mutation and vice versa. Th umber of substitution types in these models is set to one. In evolutionary models with two substitution types, usually transitions and transversion as assigned different substitution rates. The most complex. substitution Fe matrices consist of six different substitution rates for each type of ott choice of molecular marker(s) mutation (AGC, ASG, AST, Cas, and ' taxon sampling CoT, Go; ie, the general time reversible model, GTR). Substitution rate matrices for non-reversible models ‘amplication/sequencing consist of 12 different substitution rates, alignment pet are not implemented in most stan- jard molecular phylogen: choice of evoluti i construct ener most e. Choice of methods to construct phylo- tree(s) Figure 85 Basic steps in molecular phylogeny ee genetic trees: Several methods for com phylogenetic analyses S| i and topology testing Structing phylogenetic trees are known. They include Maximum Parsimony, ———— reake Maximum Likelihood, and Distance Methods, and these will be elaborated in this chapter, Se Molecular Phylogeny 219 tion of the tained phylogenetic tree al ° hylogenetic 4 ie tain ast ste e phylogenetic ti methods mentioned earlier nea re nt: THE tee which is generated by the eds to be most well-known methods Gan to be methods will also be dis ‘ested or evaluated statistically. The are Bootstrap and Jacl Sd in thi ha? A Hekknife methods. These jon5 ot PHYLOGENY qwo extensive groups of analysis exis Cladistic and Phenetic methods, 1 10 exami (© examine the phylogenetic relationships: + Cladistic method assumes that the member evolutionary history and are more elosel nets of than to any other organisms. The ated synapomorphies. + Phenetic methods, or numerical a_group share_a_ common lated to members of the same group red derived characteristics are called smarty for the ranking of spt can we any mam ot a characters, but the data has to be converted into a numerical value, ‘The organisms are compared to each other for all the characters and then the similarities are calculated. After this, the organisms are clustered based on their similarities. These clusters are called phenograms. They do not necessaril reflect the evolutionary relatedness, . * There are three main classes of phylogenetic methods for constructing phylogenies from the sequence data and they can be classified into these three groups as follows: ot Mas num Parsimony (cladistic methods) imum Likelihood (cladistic methods) 3. Distance Methods (UPGMA, NJ, Fitch and Margoliash Method, and Minimum Evolution) (phenetic method) The first two methods are directly based on the sequences and the third is indirectly based on the sequences. ‘Another way of classifying the tree-building methods is by the way they are constructed, Cluster methods follow a set of steps (an algorithm) and arrive at a tree. ‘Type of Data Distances. Nucleotide sites Maximum Parsimony Maximum Likelihood Fawe 86 Cluster method and Optimality criterion method of Phylogeny For example, if we have five sequences we might start with three of them (remember that there is only one possible unrooted tree for the three sequences) and decide where to place the fourth sequence. Given the resulting tree for four sequences, we then decide where to add the fifth and last sequence to our tree. ‘The tree-building methods in the second class use optimality criteria (see Figure 8,6) to make a selection from among the set of all possible trees. This criterion is 220 Bioinformatics: Principles and Applications used to assign each tree a ‘score’ or rank which is @ function of the relay, between the tree and data (examples include maximum parsimony ang, hip likelihood). My How to Choose a Phylogenetic 4, [Chose bel of relied) After choosing an appro, od) uonces (DNA/Prote Priate sequences (ORATSET netic marker (of related loge. 1 multiple sequence alignments ap, ee), ‘Obiain multipio alignment tained and analysed. Based o° T similarity among the group of ali the ee 5 F in sequences, one appropriate me dates 2 song emule chosen for tree construction, a iB ee, steps are illustrated in Figure 8.7 How Many Trees/Topologies are Therap The number of rooted trees (Nr) for OTUs (Operational Taxonomie Uni is given by: ) ‘Strong similarity imum parsimony Check validity | ._[ Very weak simianty Lotthe results] "| Maximum likelihood (Qn — 3)! + Figure 8.7 Steps to choose an appropriate molecular phylogeny T= An — I method for the desired group of sequences based upon their sequence homology The number of unrooted trees (Ny) for ‘n’ OTUs is given by: (Qn = 5)! Nu=— 2-\n — 3)! Table 8.1 Possible number of rooted and unrooted trees for different number of The possible number of rooted and unrooted trees OTUs for different number of OTUs is given in Table 8.1 Rooted Unrooted a oTus trees trees, Evolutionary Models 2 1 1 During an infinitesimal time, Ar, the same nucleotide 3 3 1 cannot undergo two substitutions. Hence, we can 4 15 3 estimate, P(x|y, Af), the probability of going from x to 5 105 15 y in time At for x,y¢{4,C,G,T}. The substitution 6 954 106 model is expressed as follows: 7 10395 954 8 135135 10395 PIA | A, An) sone PA| Ta) 9 2027025 135135 10 34459425 2027025 . . . M1 > 654 x 10° > 34 x 10° Stay = 5 . . 1S >213 x10)? >7 «102 . 20 >8 x10! >2 «10% 7 . 50 >6 x 10"! >2x« 10% PT | An sone pr | Ta Molecular Phylogeny 224 ‘One can reasonably assume t by a hat evolution is a sta the matrix is multiplicative, The nary Markov process. Hence, Teafter, the substituti atrix a ul tema ee re stitution Matrix at time 1+ £ rer yis That implies i P| Are ary = y I=EP | LAnr(e | ary | The simplest among the evolutionary. modele is the Jukes-Cantor model of ' evolution. Jukes-Cantor one-parameter model Here the basic assump! is that the_rate of evolution is constant. The substitution Tate of a nucleotide by a different nucleotide substitution model is shown in Fj Substitution probability of say ny by G, C, or Tis 2. Since the total 1, substitution rate of A by A is same as for other nucleotides. iS. The igure 8.8, cleotide A Probability is 1-32. It is the Figure 8.8 Substitution model | Hence, for a short time ¢, 1-308 ae we ce eM sae ce) ce Ste) = oe oe 1-30 ae oe oe ae 1-3ae For a longer time ‘ this reduces to rm sit) st) sity \ where, rs) 5 10) = VA(L 4 3e4er) ea SO=1 iy sy ney st) s(t) = V4(1 ~ etary 2 st) st) st) Kt) aa. on? NS 4 a 6 Kimura two-parameter model Another cvolutionacy model is the Kimura model. Jukes: ‘Cantor model does not take into account that the tansition_rates (between_purines) AG and (between Pyrimidine) CT are different from the transversion rates (ASC, AST, CoG, and GeT), This model-considers rate of transition to be a and rate of transversion to be B. Her °¢ fi. Here the substitution matrix is as follows: 222 Bioinformatics: Principles and Applications A Gi G T api—2p p A c 1-2p- Bp a Subst. Prob. = G i 2 1-3p-a a TL B a B 1-2p-« Thus, nyt) SO a Hs) SO= |} yy sO wy uy where s(t) = W401 - eABty . u(t) = V4 + ABLE e248) i oo) 2s(f) - u(t) Maximum Parsimony Method Maximum parsimony (MP) is a simple but popular technique used in cladistics to infer a phylogenetic tree for a set of taxa (commonly a set of species or reproductively. isolated populations of a single species) on the basis of some observed data on the similarities and differences among taxa. In other words, the principle of MP searches for a tree that requires the smallest ary changes to explain the differences observed among OTUs Input Data: The input data used ina maximum parsimony analysis is in the form of ‘characters’ for a range of taxa. number of evolution: Alignment sites in parsimony 123456789 10 Sites A character is a partitioning of the taxa uae ro et chtata into distinct character states with re- for 3 TTCGATCGAG spect to some feature. A character could AS Reka be binary, e.g., for the presence or absence of a feature (e.g., tail), or it could be a multi-state, e.g., the protein or nucleic acid residue at a particular site in the organism’s genome. Differ- tT 7 Invariant sites are not used in parsimony (they yield ‘no information on character state changes) Informative sites (at least two different kinds of residues — Qu ‘each present atleast two times) are used by parsimony because they discriminate between topologies — i.e, diferent topologies require different numbers of changes between residues. Singleton sites cannot be used to discriminate between topologies, (they require one change for all topologies). Figure 8.9 Alignment sites in parsimony at ences in the character state are ¢% plained by the evolutionary changes. Among the three sites, only the infor- mative sites help to infer phylogeny (s¢ Figure 8.9). Parsimony methods operate by se Molecular Phylogeny 223 the number of evolutionary steps (tance Hees th required to explain a given set of quan O™ In mathematical term: - ion a Minimize the total tree length: OF one character state to another) from the Set Of possible Possible trees, find all trees t such that Li) ny la ; om De Dvds, xe) where Le #8 the length of the tre characters, K” and K” are the two ther element of the input data matrix OF optim . nal character-state assig, a internal nodes, and diff(v, 2) is a function specifying the cont ofa transfor aan a st of a transformation from siate to state = along any branch. The coetfieiers ve : Note also that diff(r, =) needs not be equal te ain, ‘signs a weight to each character. The trees used in_ maximum parsimony analysis ar a £ analysis are, in trees (there is no indication of time in the tree, only” the taxa used in the analysis g ©. Bis the number of branches, rates, Heident to each branch kev sade amber of Xgy and x, represent @ general way, unrooted the relations between taxa), All led tips or terminal taxa) in the Internal nodes are inserted into Each internal node has at least racter states are associated with the the tree to represent the inferred three edges into it. The transitions between cha, edges on the tree. Given here is an example where it particular tree topology for a specific site ancestral species, is shown how to calculate the length of a r : + ie., the informative site of a sequence, using weighing schemes ~»Fitch (a) and transversion parsimony (b) The weighing schemes fitch (‘f) and transversion parsimony (‘tp’) mention the cost for transitions from one base to the other. Transition to the same state hate cost of zero, Whereas the same for GC, Ge+A and reverse costs one according to scheme (a). The costs differ in scheme (b). It is four for puRine+=pYrimidine ion and vice versa, and one for puRinecspuRine or pYrimidine-+pYrimidine transition, Example: From the input data shown in Figure 8.10 we can calculate the length of a particular tree topology ((W, Y), (¥, Z)) for a specific site of a sequence, ie., marked above by an atrow. For this specific site the weight matrices are as shown in Figure 8.11. | | [Seqwi AGA GaT | SeqX: ACAC GCT Seqy: GTAAGGT Seqz: GCAC GAC Chosen informative sto Figure 8.10 A set of test sequences to choose informative sites needed for maximum parsimony In Figure 8.12, we calculate the length (i.c., number of steps) of a particular tree topology (IY, Y), (X. Z)) . oo "The deeply marked branches indicate substitution of bases. 224 Bioinformatics: Principles and Applications Figuro 8.11 A c fo140+0+1+1=3 Ip: 1+0+0+4+4=9 fotetetetst=s tp: 4+40444+4=20, ‘ : Kw c fostersiet=4 p:0+t+t4446=10 pag’ i c Htstetetetss fh: 4+404+444=20 Goac GCAC Gfomt g (oes pe C/ONt we einod AL fo c|4140 (a) ©) Fitch and (b) Tansversion Parsimony methog Tho wot main fo (0) Fh and) ans (for the mentioned set of sequen Unrooted troo ((W, Y), (X:2)) w Y x Zz Figure 6.12 (a) The generalized unrooted tree by Fitch (and Transversion pa Internal nodes £:14044+04 Ip: 1+0+4+0+0= * ° rN c fs t+14040+0=2 tp: 4+4+0+0+0=8 F114 14040= tp: 44441404 (a) Kx F:140+14141=4 tp: 1#0+t#444=10 No-e(* tS c fo tetetstet=s: tp: 444444444220 6 Cc ’ : Fotsorts1=3 p:0+1+0+4+4=9 ‘ef’ a c fiteretette5 tp: 44404+4+4=20 c c {or the earlier mentioned example; (b) Simony (tp) 7 c i c torts ips trout 7 c ’ : Cc fitetetete tp: deaetetetatt ‘ ‘ a c fOetetsi+t=4 (p: 0+144+14127 7 ° i c Fo tat40r1+t=4 tp: 4+4+0+1+1=10 (bd) The possible trees generated The three boxed trees are the best ing to "scheme where the minines steps achieved by three ways, The nodes being “A-C’,*C_C "Gor According to tp" scheme (Figure § rr) theminimum lengthis five stepsachieveatt two reconstructions (internal nodes "Ace and ‘G-C). Therefore, under uneeaer costs, the character becomes informe The use of unequal costs may mari more information for phylogenetic recon struction. accord. iM is two internal Exercise: The alternative trees ((w X), (¥, 2) and (W, Z), (Y, X)) alge have two steps (according to f° scheme) and a minimum of eight steps according to ‘tp’ scheme. We leave it as an exercise to the readers to work out the minimum length of the alter- native trees in the same way, Advantages + Based on a logically coherent and biologically pl: Molecular Phylogeny 225 c O—C} a c £:190+1+060=2 (9: 14084+040=5 s c Best trees y—> according w c to TP scheme 1: 44440+090=2 tp: 49440+080=8 “Sp? nN c F:0et+140+0=2 Ip: 0814440+0=5 Figure 8.13 The best three according to Trans- version Parsimony method lausible model of evolution. + Free from assumptions used in distance estimations. + Better than distance methods when the extent of sequence divergence is low (10%), the rate of substitution is constant, and the number of residues is large. + Very useful for certain types of molecular data, e.g., insertions and deletions. + Provides several ways to evaluate the support for the topologies produced, e.g., measures of homoplasy (independent derivation of a character state in two lineages). + Ranks the list of trees based on length. Disadvantages + Gives incorrect topologies when the backward substitutions are present (common with nucleotides) and when the number of sites is fairly small. + Gives incorrect topologies when the rate of substitution varies substantially across the lineages. + Long branch attraction —s long branches (and short branches) tend to group together on the reconstructed tree. + Difficult to treat the results in a statistical framework. Maximum Likelihood Method \ Maximum Likelihood (ML) methods create all the possible trees containing the set of organisms considered, and then use the statistics to evaluate the most likely tree. For a small number of organisms, this is possible. For a large number of organisms, the task cannot be accomplished as the number of generated trees is very large. 226 Bioinformatics: Principles and Applications Therefore, heuristics (methods which produce an answer in 4 computable Leng, a time, but for which the answer may not be optimal) are used {0 select a subse, ‘tides or amino acids) of ay) ie ie trees to create, In this method, the . sequences at cach site J separately (as independent), and the is likelihood of having these bases are computed for a given topology by Using particular probability model. This log-likelihood is a ded for all the sites, ang h ed to estimate the branch length of the ye sum of the log-likelihood is maxi ; This procedure is repeated for all the possible topologies, and the topology 4y,, shows the highest likelihood is chosen as the final tree. Likelihood methods jor phylogenies were first introduced by Edwards and Cavalli-Sforza (1964) for the peng frequency data, Neyman (1971) applied likelihood to the molecular sequences ang this work was extended by Kashyap and Subas (1974). Felsenstein (1973, 194) aximum likelihood framework to a nucleotid ) brought the inference. . Few main features of ML methods are the following: «Statistical (probabilistic) method for inferring the phylogenies: 1. Substitution model is chosen for the sequence data (alignment). 2. Likelihood of observing the sequence data given in the substitution mode] By obtained for each topology evaluated (parameter fitting on branch lengths), that gives the highest likelihood is chosen as the best tree ic methods almost always have to be employe 3. Topology «Extremely slow method so heu to search for the best tree. + This method is very dependent on the model of substitution used. This method estimates the branch lengths not topology, so it may give the wrong topology. The ML method of inference is available for both nucleic acid and protein data, The following programs are freely available from the web: « DNAML (only DNA data; in the PHYLIP package) ‘« FastDNAML (only DNA data; a faster algorithm applied to DNAML) « ProtML (both DNA and protein data) 5 * Puzzle (both DNA and protein data). This program is much faster than PROTML. Example: We take up a probabilistic model for nucleotide substitution (e.g., Jukes- Cantor). By this method we have to pick up that tree which has the highest probability of generating observed data. Let D represent data, M the Jukes-Cantor one-parameter model and T the tree. Given D and M we have to find T such that Pr(D|7,M) is maximized. From the model we get the values of p,(#), the probability of substitution of ith nucleotide by jth one in time ¢. Here two assumptions are put up. a. Different sites evolve independently. b. Diverged sequences (or species) evolve independently after diverging. Now if D; is the data for ith site Pr(DIT.M) = T1,Pr(D|7,M) Calculation of Pr(D|7,M): The tree is given here. Molecular Phylogeny 227 prvi) is the probability from x to y in time 1 PrliikyllTM) =E2,E.pr(x(pxl (ty + ty + ty).pxy(ty) Pyk(ty +45) “PY ty)-P2i(ts).pzj(t)) With the tree topology and br e i aoe aan eae mien ‘one can efficiently calculate Pr(D{T,M) Advantages «Estimates the branch lengths of the final tree. + Methods are usually consistent. « It is extended to allow differences between the rate of transition and transversion. « Evaluates different tree topologies. « Uses all the sequence information. Disadvantages + MLis very CPU intensive and thus extremely slow. + Needs long computation time to construct a tree. The result depends on the model of evolution used, istance Methods Distance methods are a major family of phylogenetic methods trying to fit a tree toa matrix of pair-wise distance. The distance method creates a matrix of all distances (difference scores) between all the organisms for which the tree is to be constructed. Having calculated the matrix, the pair of organisms which have the smallest distance score are connected with a root in between them. The average of the distances from each member of the pair to a third node is used for the next iteration of the distance matrix. The process is repeated, until all the organisms have been placed in the tree. This always results in a rooted tree. The assumption that all the organisms mutate at the same rate is the basis of this method. The strategies involved in distance methods are the following: 1. Calculates the total number of changes — scored according to type — between every pair of sequences in the alignment. Represents the minimum number ‘of changes required to convert one sequence to another. . 3. Results written to distance matrix are used to generate trees in several possible ways — branch lengths visually represent the amount of change. | 4, Removing the ambiguous sections ‘will influence the branch length estimates. 228 Bioinformatics: Principles and Applications Few Facts About Distance Methods ; 5 a. The best way to think about distance matrix methods is to consider the di as estimates of the branch length separating that pair of species b.Branch lengths are not simply a function of time; they reflect the ex amounts of evolution in different branches of the tree. c. Two branches may reflect the same elapsed time (sister taxa), but they ean pay, different expected amounts of evolution. "e 4. The main distance-based (ree-building methods are cluster analysis, least squay, nd minimum evolution. A cluster analysis method includes UPGMA o; Nj methods. This phylogeny makes an estimation of the distance for each pair as, sum of branch lengths in the path from one sequence to another through the tee e. They rely on different assumptions, and their success or failure in retrieving the correct phylogenetic tree depends on how well any particular data set meets suey assumptions. stances Pecteg Simplest Distance Measure ; Consider every pair of sequences in the multiple alignments and count the number of differences. Degree of divergenc: Hamming distance (D) lignment length; =number of sites with differences Example AGGCTTTTCA AGCCTTCTCA D=2/10 Problem with distance measure: As the distance between the two sequences increases, the probability that more than one mutation has occurred at any one site increases. time point 0 1 2 senol A —+ A —+ A scenario? A —* A —* G senaio3 A —» G —» C scemaiod A —+ G —» A Therefore, methods have been developed to compensate for this. Corrected distances are calculated by the following methods (discussed earlier): 1. Jukes-Cantor dap m4) 2. Kimura two-parameter model Here the rate of transitions is different from the rate of transversions (see Figure 8.14),

You might also like