0% found this document useful (0 votes)
54 views39 pages

Bioinformatics I

This document provides an overview of pairwise sequence alignment. It defines key terms like homology, orthologs, and paralogs. It also describes different types of alignment like global and local, and scoring matrices used like BLOSUM and PAM. Methods for assessing the statistical significance of alignments are discussed, such as relative entropy. Pairwise alignment is a fundamental bioinformatics tool for determining the relationship between sequences and hypothesizing homology. The biological significance of any alignment must still be assessed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views39 pages

Bioinformatics I

This document provides an overview of pairwise sequence alignment. It defines key terms like homology, orthologs, and paralogs. It also describes different types of alignment like global and local, and scoring matrices used like BLOSUM and PAM. Methods for assessing the statistical significance of alignments are discussed, such as relative entropy. Pairwise alignment is a fundamental bioinformatics tool for determining the relationship between sequences and hypothesizing homology. The biological significance of any alignment must still be assessed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Bioinformatics I

Marineil C. Gomez
School of Chemical, Biological and Materials Engineering and Sciences
Mapua University
Pairwise Sequence Alignment
Outline
⚫ Definition of Homology, orthologs and
paralogs
⚫ Global and Local Alignment
⚫ Scoring Matrices
⚫ Nucletide Models
⚫ Protein Models
Why do Alignment?
⚫ Is the gene/protein related to any other gene/protein?
⚫ Relatedness:
⚫ Sequence level = homologous
⚫ Common functions

⚫ DNA vs protein alignment


⚫ 4 vs 20
⚫ “wiggle” at the third position
⚫ Amino acids with similar properties
Definitions
⚫ Homology
− the state of having the
same or similar relation,
relative position, or
structure.
− Qualitative similarity of
structures
⚫ Similarity and Identity
− Quantitative description of
the relatedness of
sequences.
Definitions
⚫ Orthologs – same function, different sequence / low
sequence similarity
⚫ Myglobin and hemoglobin

⚫ Paralogs – same identity, distinct and different functions


⚫ Alpha 1 and alpha 2 globin

⚫ Similar – (amino acids) has the same biochemical


properties
⚫ Identical – same identity / aa residue /nucleotide
Homology in protein structure

These proteins are homologous


(descended from a common
ancestor) and share very similar
three-dimensional structures.
However, pairwise alignment of the
amino acid sequences of these
proteins reveals that the proteins
share very limited amino acid
identity.
Paralogous human globins
Relatedness of genes and proteins

Can be assessed by pairwise


alignment
Pairwise alignment of human beta globin (the “query”) and myoglobin (the
“subject”).
Pairwise Alignment
Can be assessed by pairwise alignment

• Query and subject : two sequences being


compared
• Identical Amino Acids: exactly the same aa or
nucleotide
• Similar Amino Acids: different identity but similar
biochemistry due to conservative substitutions
• Gaps: Insertion or deletion mutations
Pairwise alignment of human beta globin (the “query”) and myoglobin (the
“subject”).
Pairwise Alignment
The percent similarity of two protein sequences is the sum
of both identical and similar matches.

֎ The purpose of a pairwise alignment is to assess the


degree of similarity and the possibility of homology
between two molecules.

֎ It is never correct to say that two proteins share a


certain percent homology, because they are either
homologous or not.

֎ It is not appropriate to describe two sequences as


“highly homologous;” instead, it can be said that they
share a high degree of similiarity.
Dynamic Programming Approach to Pairwise Alignment
Pairwise Alignment with Dotplots

a graphical method for comparing two sequences.


One protein or nucleic acid sequence is placed
along the x axis and the other is placed along the y
axis.

Positions of identity are scored with a dot. A region


of identity between two sequences results in the
formation of a diagonal line.
Dotplot showing multiple repeats
Dotplot with Inversions
Polymorphisms and mutations in
dotplots
When two sequences are
aligned, what scores should
they be assigned?
Scoring Matrices (Proteins)
⚫ Dayhoff Model – Substitution Model
⚫ Margaret Dayhoff (1978)
− provides the basis of a quantitative scoring system for
pairwise alignments between any proteins, whether they are
closely or distantly related
⚫ BLOSUM
⚫ Steven Henikoff and Jorja G. Henikoff
⚫ JTT
Nucleotide Substitution Models
Illustration of multiple substitutions at the
same site or multiple hits. An ancestral
sequence diverged into two sequences
and has since accumulated nucleotide
substitutions independently along the
two lineages.
Only two differences are observed
between the two present-day
sequences, so that the proportion of
different sites is p̂ = 2/8 = 0.25, while in
fact as many as 10 substitutions (seven
on the left lineage and three on the right
lineage) occurred so that the true
distance is 10/8 = 1.25 substitutions per
site.
Markov
Models of
Substitution
Relative substitution rates between
nucleotides under three Markov-chain
models of nucleotide substitution: JC69
(Jukes and Cantor 1969), K80 (Kimura
1980), and HKY85 (Hasegawa et al.
1985).

The thickness of the lines represents the


substitution rates while the sizes of the
circles represent the steady-state
distribution.
Amino Acid Substitution Models
⚫ Empirical Models
⚫ Attempt to describe the relative rates of substitution between amino
acids without considering explicitly factors that influence the evolutionary
process. They are often constructed by analysing large quantities of
sequence data, as compiled from databases
⚫ PAM (Dayhoff), JTT
⚫ Mechanistic Models
⚫ Consider the biological process involved in amino acid substitution, such
as mutational biases in the DNA, translation of the codons into amino
acids, and acceptance or rejection of the resulting amino acid after
filtering by natural selection.
⚫ Mechanistic models have more interpretative power and are particularly
useful for studying the forces and mechanisms of gene sequence
evolution
PAM Substitution Matrix
⚫ The first empirical amino acid
substitution matrix constructed by
Dayhoff and colleagues (1978).
⚫ compiled and analysed protein
sequences available at the time,
using a parsimony argument to
reconstruct ancestral protein
sequences and tabulating amino
acid changes along branches on
the phylogeny.
⚫ To reduce the impact of multiple
hits, the authors used only similar
sequences that were different from
one another at < 15% of sites.
Inferred changes were merged
across all branches without regard
for their different lengths.
JTT Matrix
DAYHOFF matrix was
updated by Jones et al.
(1992), who analysed a
much larger collection of
protein sequences, using
the same approach as did
Dayhoff et al.
(1978)
BLOSUM (BLOcks SUbstitution Matrix)
⚫ introduced by Steven Henikoff and Jorja Henikoff
⚫ used to score alignments between evolutionarily divergent protein sequences.
⚫ based on local alignments
⚫ They scanned the BLOCKS database (>500 groups of local multiple
alignments of distantly related protein sequences) for very conserved regions
of protein families (that do not have gaps in the sequence alignment) and then
counted the relative frequencies of amino acids and their substitution
probabilities.
⚫ Then, they calculated a log-odds score for each of the 190 possible

substitution pairs of the 20 standard amino acids.


⚫ All BLOSUM matrices are based on observed alignments; they are not
extrapolated from comparisons of closely related proteins like the PAM
Matrices.
⚫ The default matrix used in BLAST
How can we decide
whether the alignment
of two sequences is
statistically significant?
For Global Alignments

Q: Does the score occur by chance?


Method 1: Compare with set of non-homologous sequences
Method 2: compare with randomly generated sequences
Method 3: randomly scramble on of the sequences being compared

Hypothesis testing with z-test


Cons: biological data does not follow Gaussian distribution

:
For Local Alignments
By Percent Identity:
Cons: where does the threshold lie?
26% vs 30%;
40% vs 60%
Differences in the length of proteins
20 aa vs 150 aa
For Local Alignments
By Relative Entropy:
• The relative entropy (H) of the target and background distributions
measures the information that is available per aligned amino acid
position that, on average, distinguishes a true alignment from a chance
alignment
• For each substitution matrix with its unique target frequencies qij and
background distributions pipj, it is possible to derive the relative entropy
H as follows:

• where H corresponds to the information content of the target and


background distributions associated with a particular scoring matrix
Relative Entropy and PAM distance
Perspectives and Pitfalls
The pairwise alignment of DNA or protein
sequences is one of the most fundamental
operations of bioinformatics.
Pairwise alignment allows the relationship
between any two sequences to be determined,
and the degree of relatedness that is observed
helps in the forming of a hypothesis about whether
they are homologous
Any two sequences can be aligned, even if they
are unrelated. It is always important to assess the
biological significance of a sequence alignment.
End of Lecture II

You might also like