0% found this document useful (0 votes)
62 views57 pages

Sequence Analysis - Alignment

Here are the BLAST results for the given sequences: KR093978 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093979 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093980 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093981 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093982

Uploaded by

filson.riyadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views57 pages

Sequence Analysis - Alignment

Here are the BLAST results for the given sequences: KR093978 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093979 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093980 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093981 - BLASTn search shows this sequence has 100% identity to Bacillus cereus strain BGSC 6E1 chromosome, complete genome. KR093982

Uploaded by

filson.riyadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Chapter - 3

Biological Sequence Analysis & Alignment


3.1. Sequence comparison, similarity alignment
3.2. Database Similarity Searching: Database Searching
Tools & Formats (BLAST, FASTA, etc)
3.3. Sequence alignment methods (local & global) and
algorithms
3.4. Sequence alignment techniques: Pair wise & Multiple
Sequence Alignment
3.5. Tools for Sequence alignment: ClustalW, T-Coffee, etc
3.6. Alignment Interpretation and Scoring methods
Biological Sequence Analysis
• Biological Sequence
 is a single, continuous molecule of nucleic acid or protein.
 the methodologies implemented under sequence analysis
include:
- sequence alignment (pairwise & multiple sequence
alignment),
- phylogenetic analysis,
- motif & domain search/prediction,
- identification of novel genes for the drug.
• Sequence alignment
 is a way of arranging the sequences of DNA/RNA/AA to
identify regions of similarity that may be a consequence
of functional, structural or evolutionary relationships.
 the goal of alignment is to find the conserved region (if
present) between two or more sequences;
 these conserved regions are supposed to be an important
& functional region (domain or motif) in the sequences.
The dilemma: DNA or protein?
Search by similarity

Using nucleotide seq. Using amino acid seq.


By translating into amino acid sequence, are we losing
information? Yes!The genetic code is degenerate (Two
or more codons can represent the same amino acid)
 Very different DNA sequences may code for similar
protein sequences →We certainly do not want to miss those
cases!
• Conclusion:
It is almost always better to compare coding sequences in
their amino acid form, especially if they are very divergent.
Very highly similar nucleotide sequences may give better
results.
Biological Sequence Similarity
• it tells us:
1. Homology genes
- are genes that derive from a common ancestor-gene
are called homologs
- is an evolutionary relationship that either exists or
does not.
- high similarity is evidence for homology. Similar
sequences may be orthologs or paralogs.
2. Orthologous genes
- are homologous genes in different organisms with
shared function.
3. Paralogous genes
- are homologous genes in one organism that derive from
gene duplication → often have divergent function.
Orthologs and paralogs are often viewed in a single tree
Homologous and Paralogous
Causes for sequence (dis)similarity
• Mutation: a nucleotide at a certain location is replaced by
another nucleotide (e.g.: ATA → AGA)
Transitions mutations: change from a purine to a purine or
a pyrimidine to a pyrimidine. E.g: A to G; G to A; C
to T; T to C
Transversions mutations: change from a purine to a
pyrimidine or vice versa.
 Synonymous & non-synonymous mutation
Insertion: at a certain location one new nucleotide is
inserted in between two existing nucleotides
(e.g.: AA → AGA)
Deletion: at a certain location one existing nucleotide
is deleted (e.g.: ACTG → AC-G)
 Indel: an insertion or a deletion
Classification of sequence alignment algorithms
 two main classes of sequence alignment methods:
- global alignments and local alignments.

 in contrast to local alignments where only portions of


sequences are aligned, the entire sequences are aligned
in global alignments.
 Global alignments are useful for aligning closely related
sequences whereas local alignments are more suitable
when comparing distantly related sequences
 Pairwise & multiple alignments are the basic tools to
compare sequences.
 An alignment is meant to say global alignment when
closely related sequences of the same length are aligned
together;
 the alignment of the sequence is carried out from the start
to end of the sequence while searching for best possible
alignment.
→ Needleman-Wunsch algorithm
 Local alignment is mainly used for those sequences which
differ in sequence length.
→ this method finds local matches within the sequence
stretch instead of looking at the entire sequence.
→ Smith-Waterman algorithm
→ BLAST (basic local alignment search tool) is the most
commonly used tool for sequence alignment & similarity
search.
 gaps are used to show that an AA or DNA is without a
match in the other sequence & the gaps represent
insertions or deletions in an evolutionary context.
 when alignment is constructed, the identity & similarity
can be quantified.
• the identity is the number of DNAs or AAs matching
among sequences compared at all positions.
• Similarity is a further comparison also considering
different types of DNAs or AAs as well as the gaps.
• Global alignment (top) includes matches ignored by local
alignment (bottom)

Global:
15% identity

Local:
30% identity
Sequence Similarity & Scoring Methods
1. Dot-Matrix Method
 is done by putting one sequence along the y-axis on left
side & another sequence on x-axis horizontally on top.
 this method generates a simple matrix of sequence, while
each item of the matrix is a measure of similarity of those
two residues on the horizontal & vertical sequence.
2. Dynamic Programming
 this method is used in computer science, mathematics,
management science, economics.
Multiple Sequence Alignment
• EBI ClustalW Server
Preparing Multiple Sequence
 “*” refers to the residues or nucleotides in that column are identical in all
sequences in the alignment.
 “:” indicates that conserved substitutions have been observed.
 “.” indicates that semi-conserved substitutions are observed.
Multiple Alignment using Fast Fourier
Transform

MUltiple Sequence Comparison by Log-


Expectation

(Tree-based Consistency Objective Function For alignment Evaluation)


Exercise- 1
1. Pair wise alignment – online + CLC genome
workbench
2. Multiple alignment – online + CLC genome
workbench
3. Local alignment – online
4. Global alignment – online
Database searching tips
 use latest database version.
 use BLAST (Basic Local Alignment Search Tool) first
 search both strands when using FASTA.
 translate sequences where relevant
 E<0.05 is statistically significant, usually biologically
interesting.
 if the query has repeated segments, delete them & repeat
search
 most used algorithm in bioinformatics - Verb: to blast
 BLAST allows rapid sequence comparison of a query
sequence against a database.
 The BLAST algorithm is fast, accurate, & accessible both
via the web & the command line.
 is popular - good balance of sensitivity & speed; reliable
& flexible
BLAST
 BLAST tool is fast & can be used in analysis of >1000s
of sequences & even for comparison of two genomes
 BLAST is freely available for everyone
 BLAST tool is straightforward to handle and produces
very informative data
 BLAST method is a word search heuristic method which
eliminates the irrelevant sequences & saves search time.
 BLAST has some subprograms:
 BLASTn - aligns nucleotide query sequence with
nucleotide database.
 BLASTp - aligns protein sequence with protein
database.
 BLASTx - used to align nucleotide sequence with
protein database by comparing six-frame conceptual
translation of nucleotide sequence.
 tBLASTx - aligns query nucleotide possible six-frame
converted sequence with converted nucleotide six-
frame sequences of the database.
 tBLASTn - aligns protein query sequence with
translated nucleotide database.
(blastn)

(blastp)
BLASTn
BLASTp
BLASTn: Search Set
BLASTn: Program Selection
BLASTn Result
BLASTn: Graphic Summary
BLASTn Description
BLASTn Alignment
BLASTn Tree View
BLASTp: Search Set
PDB BLASTp
BLASTp: Graphic Summary
PDB BLASTp Description
BLASTp Tree View
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov

BLAST results
0
E = 0.0 means
≤10-1000
 E value: is the expectation value or probability to find by
chance hits similar to your sequence. The lower the E, the
more significant the score.
Exercise - 2
• BLAST the following sequence or accession numbers:
KR093978
• KR093979
• KR093980
• KR093981
• KR093982

You might also like