0% found this document useful (0 votes)
17 views36 pages

BLAST and Sequence Alignment

ukzn rdna 202(Lectures 21-22 )
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

BLAST and Sequence Alignment

ukzn rdna 202(Lectures 21-22 )
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 21 & 22:

BLAST & Sequence


Alignment

RDNA202

Cassie
[email protected]
Biotechnology and Genomics

•Lecture 20: Introduction to Bioinformatics

•Lecture 21 & 22: BLAST and Sequence alignment


BLAST
• Basic Local Alignment Search Tool
• One of the most commonly used tools for:
• Comparing sequence information
• Retrieving sequences from databases
BLAST - Types
• Used to analyse:

• DNA via nucleotide comparisons

• Proteins via amino acid comparisons

• DNA and proteins via translation comparisons


Blastp

• P for protein

• Blast P – compares protein query against


protein sequence database

• So Protein - Protein
tBlastn
• t for translated

• n for nucleotide

• t Blast n – compares protein query against all


six reading frames from a translated
nucleotide sequence database

• So Protein compared with translated


nucleotide
Blastn

• n for nucleotide

• Blast n – compares nucleotide query


against nucleotide sequence database

• So nucleotide compared with nucleotide


Blastx

• Blast x – compares six frame conceptual


translation products of nucleotide query
sequence (both strands) against a protein
sequence database

• So translated nucleotide compared with


protein
tBlastx
• t for translated

• t Blast x – compares nucleotide query


sequence against a translated nucleotide
sequence database

• Translates both to 6‐frame amino acid


sequences and compares them at the amino
acid level
What
does it
look like?
BLAST
algorith
m
BLAST algorithm
• Blast is a heuristic program – meaning it relies on smart shortcuts to
perform search faster

• Different Parameters:
1. General Parameters:
• E-value
• Word size
2. Scoring Parameters
3. Filter and Masking
BLAST algorithm
1. General Parameters:
• E-value
• Gives indication of statistical significance of a given pairwise alignment
• The lower the E-value – the more significant the hit
• If a sequence alignment has an E-value of 0.05 – means that it has a
similarity of 5 in 100 (1 in 20)
• E-value greater than 1 – indicates that the alignment likely occurred by chance

• Word size
• The length of the seed that initiates an alignment
BLAST algorithm
2. Scoring Parameters:

• Reward and penalty for matching and mismatching bases

• Cost to create and extend a gap in an alignment


BLAST algorithm
3. Filter and Masking:

• Mask regions of low compositional complexity that may cause spurious or


misleading results

• Mask query while producing seeds used to scan database – but not for
extensions
BLAST Results
• The top most hit = the best match to the query sequence
BLAST Results
• The top most hit = the best match to the query sequence
Why is Blast popular?
1. The flexibility of the search algorithm
2. Reliable statistical reports
3. Continual software development
4. The speed attained by the heuristic search methods
Sequence Alignment
Alignment algorithms
• Sequence alignment – most essential step in comparing biological sequences
• Identifies regions of similarity between sequences
• Two commonly used sequence alignment algorithms:

1. Global alignment
• Compares two sequences – by aligning the entire length of the sequences
• Used when sequences are the same length

2. Local alignment
◦ Does not align the entire sequence lengths
◦ Aligns regions with the highest density of matches
◦ Useful in identifying short conserved regions in nucleotide or protein sequences
Sequence Alignment
• Process of comparing two (pairwise alignment) or more (multiple sequence alignment)
DNA or protein sequences

• Done by searching for a series of individual characters/residues or character patterns that


are the same order in a sequence
Alignment types
1. Pairwise alignment
• Is the most fundamental operation of bioinformatics:
• Involves aligning two sequences together
• Main goal – obtain highest possible score
• Indicates degree of similarity between two sequences
• Uses:
• In genome analysis
• To decide if two proteins/genes are structurally or functionally related
• To identify domains or motifs shared between proteins
• Is the basis of BLAST searching
Pairwise Sequence Alignment
of the best hit
Alignment types
2. Multiple Sequence Alignment
• Involves aligning multiple (3 or more) biological sequences to achieve optimal
sequence matching

• Used to:
• Identify conserved sequence regions
• Construct phylogenetic trees

• Helps us understand functional and evolutionary relationships between


different species or groups of organisms
Multiple Sequence Alignment
Why Compare Sequences?
• DNA and Proteins are products of evolution
• Molecular sequences undergo random changes over time
• Substitutions, insertions, deletions
• Some of these are selected for during evolution
Why Compare Sequences?
• Detection of similarities indicate homology
• Detection of similarities between sequences – by sequence alignment
• Allows us to infer roles and functions of newly isolated sequences
• Using well-known already characterised sequences
Homology
• Homology – term used when two sequences share a common ancestor that is recent
enough that it is still detectable in their sequence

• Simply – We must compare the same nucleotide sequence in all organisms in our
comparison
Orthologs
• Orthologs
• Genes related by vertical decent from a common ancestor

• Genes in different species that evolved from common ancestral gene


through speciation event

• Encode proteins with the same function in different species


Paralogs
• Paralogs

• Genes that have evolved within the same species by gene duplication
events

• When a gene is duplicated – the two copies can evolve independently –


leading to development of paralogs

• Code for proteins with similar – but not necessarily identical – functions
Orthologs vs Paralogs
Feature Orthologs Paralogs

Origin Result from speciation Result from gene


events duplication events

Species Found in different Found within the same


species species

Functionality Typically retain similar Functions may diverge


functions over time
Homology
• Homology
• When two sequences share a common ancestor
• Recent enough that it is still detectable in their sequence
• Must compare the same nucleotide sequence in all organisms in our comparison

• Similarity
• Any 2 sequences can be compared and similarity calculated (% nt or aa identity
BUT
• This is meaningless unless they are homologous
Alignments – Positional
Homology
AATGATCCGATT How do you compare
ATGATCCGATT these sequences?
AATGATTCTTCT Which are most
ATTGATTCGATTCTA similar?

Align them

• An alignment involves creating Positional Homology


• Nucleotides at equivalent positions are placed under each other
• This allows comparison and identification of mutations

A good alignment is essential for a good analysis


Alignments – Positional
Homology
AATGATCCGATT
• An algorithm is used to create an alignment – E.g.
Clustal W ATGATCCGAGT
AATGATTCGTCT
• Questions: ATTGATTCGAGTCTA
• Are these sequences an alignment?

• What is the best way to align them?


AATGATCCGATT
• Sometimes it is necessary to add gaps to the AATGATCCGAGT
sequence to get a better alignment AATGATTCAAGTCT
ATTGATTCGAGTCTA
Alignments – Positional
Homology
AATGATCCGATT
AATGATCCGAGT
• TRIM AATGATTCAAGTCT
• Ensure sequences are all the same length ATTGATTCGAGTCTA

• Analyse
• The quality of the analysis depends on the quality
of the alignment AATGATCCGATT
AATGATCCGAGT
AATGATTC - - GTCAT
ATTGATTCGAGTCTA
Importance of Sequence
Alignments
• BLAST finds matches

• CLUSTAL aligns matches

• It is easy to make comparisons when


sequences are aligned
• E.g. examine how gene sequence varies among
people with and without a disease
• E.g. Cystic fibrosis – Person affected by the
disease is missing a three-base DNA
sequence

You might also like