0% found this document useful (0 votes)
35 views45 pages

BIF401 Midterm Short Notes

Uploaded by

talhajatt127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views45 pages

BIF401 Midterm Short Notes

Uploaded by

talhajatt127
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

BIF401 Bioinformatics-1

Grand Quiz Preparation(Part 1)

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Chapter 1- Introduction to Bioinformatics

 BACKGROUND

1. Bioinformatics is an interdisciplinary science at the cross-roads of biology, mathematics,


computer science, chemistry and physics.
2. Human genome has over 25,000 genes
3. The genes may produce > 250,000 different proteins

 EXPERIMENTS IN BIOLOGY
These instruments include:
1. Next Generation Sequencers (NGS) for whole genome sequencing
2. High Resolution Mass Spectrometry for whole proteome profiling
3. Nuclear Magnetic Resonance Spectroscopy for structural studies

 DIGITALIZATION OF BIOLOGY
Produced data is stored on Computer disks. The data may include text, numbers, symbols or
images.

 SPEED OF DATA GROWTH


1. Data is being accumulated at exponentially increasing rates
2. Doubling every few years

MOTIVATION
It is an interdisciplinary field as it covers the information of biological digital information
including human, plants, animals and microorganisms.

It demands a very low cost infrastructure and hardly any lab equipment.

SCOPE OF BIOINFORMATICS
Bioinformatics primarily deals with digitalized biological information

ACTIVITIES IN BIOTECHNOLOGY
Developing algorithms, writing software, Statistical evaluation of data etc

NEED FOR BIOINFORMATICS

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


The need for bioinformatics is on a rapid rise as biological data is rapidly increasing and
becoming available online, free of any cost.

Informatics and Bio-science are the umbrella terms given to a set of allied disciplines which
make up the field.

Quote
Biology easily has 500 years of exciting problems to work on.
Donald Knuth, Professor, Stanford University

“tree of life”
which consist of:
➢ Bacteria

➢ Archaea
➢ Eucarya

 From gene sequences to protein sequences Bioinformatics is the way forward


 Protein structure, protein-protein interactions and systems biology are other research
areas in bioinformatics

APPLICATIONS OF BIOINFORMATICS

GENOMICS
Bioinformatics can help in assembling
 DNA sequencing data.
 Gene Finding
 Genome Assembly
 Variation in Genomes
 Transcription Data
 Databases
EVOLUTIONARY STUDIES

 Evolutionary relationships
 Evolutionary distances
 Phylogenetics
 Tree of life
PROTEOMICS

Bioinformatics can help us in decoding protein sequences.


• Systems Biology
• Personalized Medicine

APPLICATIONS OF BIOINFORMATICS

• Genomics
• Transcriptomics
• Proteomics
• Metabolomics
• Structural Proteomics
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
• Drug Design
• Systems Biology
• Personalized Medicine

FRONTIERS (boundaries) IN BIOINFORMATICS

Frontier in Bioinformatics includes


 Next generation genomics
 Transcriptomics
 Proteomics

FRONTIER IN GENOMICS

To sequence the whole genome with the bioinformatics tool of Next generation sequencing
(NGS)

• Next generation sequencing of whole genomes


• Massive amounts of data (Tera byte files)

FRONTIER IN TRANSCRIPTOMICS
 to identify those matters which are unknown yet or under discussion
 Identification of hitherto(until now) unknown types as well as roles of RNA

FRONTIER IN PROTEOMICS
• Identification of low abundance proteins in patient tissue samples
• Large scale protein identification
• Identification of known as well as novel post-translational modifications

FRONTIER IN PROTEIN STURUCTURE


To understand the layer folding of proteins that how they are processed,

FRONTIER IN SYSTEM BIOLOGY


To understand Cells function as a whole!

FRONTIER IN PERSONALIZED MEDICINE


 Personalize the medicine for exact cure of a disease.
 Some medicines have side effects on certain person

Chapter 2 - Sequence Analysis

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Gene, mRNA and Protein Sequences

Central Dogma
Transfer of this information from DNA to these molecules is termed as “Central Dogma”

 DNA codes for an RNA


 RNA in turn helps produce Proteins
 Proteins along with some other molecules form cells including their
organelles and membrane
DNA-> RNA ->Protein.

DNA sends the information to cell via mRNA and that sequence the amino acids according to
coded information and protein structure is formed and that protein forms a cell.

DNA
DNA molecule is double helix structure contains base pairs composed of nucleotides and these
nucleotides are composed of sugar phosphate group and are bind with each other with
hydrogen bonds.

• DNA is constituted by four bases


• These are Adenine (A), Cytosine (C), Thymine (T), and Guanine (G)
• DNA molecule may contain thousands of A, C, T and G.
• A pairs with T only and C pairs with G with the help of Hydrogen bonding

RNA
•RNA is constituted by four bases
• These are Adenine (A), Cytosine (C), Uracil (U), and Guanine (G)
• RNA molecule may contain thousands of A, C, U and G.
• A pairs with U only and C pairs with G

Normally all the nucleotides are same in both DNA and RNA except one position in RNA which
is U (Uracil) and in DNA it is T (Thiamin)

DNA molecule although is double stranded and RNA is single stranded but there is difference
in sugar composition.
RNA has Ribose sugar and DNA has de-oxyribose sugar:

Adenine and Guanine collectively called Purines while Cytosine, Uracil, and Thymine are called
as Pyrimidine.

TRANSCRIPTION
Transcription is the process by which the information in a strand of DNA is copied into a new
molecule of messenger RNA (mRNA).

NUCLEOTIDES
DNA and RNA molecule than these are composed of four other molecules which are named as
Nucleotides

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


TRANSLATION
Cells are built of proteins and carbohydrates and these proteins are made in results of
transformation of RNA molecule and this transformation is called as translation.
• RNAs code for proteins
• Codons of three nucleotides from the RNA molecules code one amino acid at a time
• This process takes place at the ribosomes and is called translation

AMINO ACIDS
In all, 20 different amino acids can be coded by such codons during translation. The amino acids
polymerize to form proteins
If we observe the structure of amino acid it contains nitrogen, hydrogen, oxygen and two
carbon atoms and a variable group R.
 When polymerizations takes place water is formed
 Polymerization of Amino Acids by Peptide Bond
Amino acids are joined with each other with peptide bonds and fold with each other in 3D
form they make protein structure.

STORAGE OF BIOLOGICAL SEQUENCE INFORMATION


• DNA sequences may contain hundreds of thousands of bases
• RNA sequences can be a little less than the DNA but they are in a much larger variety

SOLUTIONS DATABASES
 Data bases for DNA & RNA the public database is GenBank (by NIH).
 For proteins the public database is UniProt (by Uniprot Consortium)
 Both GenBank and UniProt are online database and the DNA, RNA and Protein
sequences are available here online for public and researchers.
USING GENBANK
GenBank can be searched by:
• Sequence
• ID
• Name
• Species etc.
Other information includes:
• Locus
• Accession number
• Authors
• Journal etc

USING UNIPROT
UniProt is public database which is being used to search the sequence of proteins

In home page there is a box named “Swiss Prot” which contains human curated protein
information

UniProt can be searched by:


• Amino Acid Sequence
• ID
• Name
• Species etc.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


COMPARING SEQUENCES
By comparing sequences of DNA, RNA and Proteins we can get
 Similarity
 Specific difference due to some disease or mutation
 Some evolutionary relationship.

SIMILARITIES & DIFFERENCES IN SEQUENCES


If some sequences are exactly similar to each other it means there is some;
 Regular expression in cell or system
 Specific presence of protein or gene
 Similar nucleotide
Exact Matching
• Exact Matches include complete matching in terms of residues and their same order
• Exact matching includes partial matching in terms of residues and order with room for
variations in both.

Pairwise Sequence Alignment


Alignment
The process of inexact matching while keeping in view the conserved residues is called
sequence “Alignment
Pairwise sequence alignment is therefore alignment of a pair of two sequences
Pairwise sequence alignment compares two amino acid or nucleotide sequences
Matches are colored and missing nucleotides are denoted by

There are two types of pair alignments.


1. Global
2. Local

 Global alignment - maximizes the number of matches between the query and source
sequences along the entire length of both the sequences.
 Local alignment - gives the highest scoring local match between both query and
sequences.
Optimal alignment - one that exhibits the most correspondences between the query and the
source sequences. It is the alignment with the highest score.

Pairwise Sequence Alignment is used to identify regions of similarity that may indicate
functional, structural and/or evolutionary relationships between two biological sequences
(protein or nucleic acid)

PAIR WISE SEQUENCE ALIGNMENT

Insertions or deletions ("indels") result in gaps in alignments


• Substitutions result in mismatches

DOT PLOTS
• Dot plots employ dot matrix representation for pairwise alignment and comparison
• Sequences are written on top & left side of a dot matrix grid
• Dot plots employ dot matrix with two sequences plotted on top and left of the
matrix

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


• Matches are represented by dots
• Dots on diagonals are connected and represent alignments
Dot plots provides us the Global similarity between the two sequences
Dot plot help us to find the threshold difference among two sequences

EXAMPLE OF DOT PLOTS


Dot plot of the human Cytochrome and Tuna Fish Cytochrome than the diagonal
alignment of sequence

IDENTY VS SIMILARITY
There are two concepts for sequence analysis
1. Identity
2. Similarity

Sequence Identity

Identity means the counting number of nucleotides or amino acids which exactly match when two
biological sequences are matched.
Formula for Identity:
Identity = No. of Matches / smaller length × 100

• Gaps are not counted


• Identity measurement is made on the shorter of the two sequences

Similarity means the comparison between two different sequences calculated by alignment
approach.
Sequences vary due to insertions, deletions and substitutions

APPROACHES TO ALIGN THE SEQUENCE

1. Global Alignment
In Global alignment we compare both sequence from end to end completely.
Global alignment creates an end-to-end matching between two sequences

2. Local Alignment
In Local alignment we compare one whole sequence with the one portion of other.
Local alignment focuses on the highly matching sub regions within the sequences

WHY LOCAL ALIGNMENT


• Local alignments have the power to detect small regions of high similarity between two sequences
• Such matches may be “domains” or “motifs” in case of proteins
• DOMAIN SHUFFLING
Aligned portions of sequence can be considered in varying orders and this process is called as domain
shuffling.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


ALIGNING, INSERTION & DELETION
Insertion means addition of amino acids in protein sequence and addition of nucleotides in
DNA sequences.
And deletion means removal of amino acids from protein sequence and removal of
nucleotides from DNA or RNA sequences.
Insertion of gap is called as –ve or plenty

Removal and addition of amino acids in proteins and nucleotides in DNA, RNA by using Gaps
named as Indels

ALIGNING MUTATION IN SEQUENCES


Mutation is totally different from Indels, because in Mutation we replace the amino acid with
other amino acids and replace the nucleotides with other and we don’t use Gaps

CONCLUSION
In identity alignment we use Gaps and in mutation we use substitution penalties and penalties
depend upon the substitution.

To find matching in nucleotides and amino acids of two sequences we use dot plot method.
But dot plot cannot capture the insertions, deletions and gaps in the sequences.
Modification in Dot Plot

We represent the matching nucleotides with +1 while gaps, substitutions, insertions and
mutations can be represented as -1 in dot plot

INTRODUCTION TO DYNAMIC PROGRAMMING


Dynamic programming is an algorithmic technique used commonly in sequence analysis.
Dynamic programming is used when recursion could be used but would be inefficient because it
would repeatedly solve the same sub problems.

Need of algorithm

Compression of two sequences one by one it need time and is computationally expensive
method. That’s why we need algorithm

If we compare two sequences of length “n” than it would be “n2 “, its order is O (n2)

DYNAMIC PROGRAMMING METHODOLOGY


Dynamic programming (DP) helps reduce the large computational cost associated with sequence
comparisons
DP uses a “scoring function” to deal with matches, mismatches and gaps

Conclusion
All the alignments are done in diagonal way in dot plot matrix. For total score we make calculations in
diagonal way and after calculation best one is selected

Needleman Wunsh Algorithm


We Start with a zero in the second row, second column.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Move through the cells row by row, calculating the score for each cell.
Left to right and top to bottom the best element (having high score) is selected.
The matrix is computed progressively until the bottom right element

BACKTRACKING ALIGNMENTS
To find an optimal alignment in Needleman Wunsch Algorithm we use traceback method

After completely matrix calculations we apply traceback to find the optimal alignment and
traceback starts from bottom right (maximum score) to top side

REVISITING LOCAL AND GLOBAL ALIGNMENTS


Traceback strategy allows us to differentiate between a local and a global alignment
“Traceback” is the technique by which we can check the sequences from any end of the
matrix box.

 Dot plot help us in finding matching residues of two sequences while Needleman wunsch helps
us to find the global alignments.
 If some sequences have different regions of nucleotides which does not match to any other for
that alignment we prefer Global alignment not local, but that does not penalize leading or
trailing end
MOVING FROM GLOBAL TO LOCAL ALIGNMENT
DNA has coding and noncoding regions.
Coding regions are called “EXON” expressed as protein and they remain more conserved due to their
role in making functional proteins.
And noncoding regions of DNA are called as “INTRONS” which are more likely involved in
mutations than coding ones.
It means high degree of alignment can be find among two exons.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Local Alignments can identify exons which are present in both sequences
Exons in DNA tend to be more conserved as compared with introns

SMITH WATERMAN ALGORITHM


In global alignment we compare the sequence from end to end but in local alignment we
compare the sequences in segments

The Smith Freshman algorithm is different from Needle man.


Top row and Colum are set to zero.
Alignment can end anywhere.
Traceback starts from highest score.
Local Alignments can identify coding portions in a DNA

EXAMPLE OF SMITH WATERMAN ALGORITHM


The only difference between Needleman and Smith Waterman is that zero “0” is placed in the
relationship.

• Traceback can start from any position in the scoring matrix


• Local alignments can be extracted by starting from a high score till reaching ‘0

REPEATED ALIGNMENTS
By making some change in strategy of traceback we can find the repeated sequences.
Two modifications
• Use a score threshold ‘T’ above which the matches will be considered.
• This will help avoid low scoring local alignments
• Traceback should find multiple aligned regions by multiple traceback steps

REPEATED ALIGNMENTS STEPS


• Continue to traceback until you reach the top row again
• Repeat the previous two steps
• Stop the trace back if you reach ‘0’ at (0,0)
Traceback Strategies:
We have seen how biological sequences can be searched and compared using various recurrence
relations and traceback strategies

INTRODUCTION TO SCORING ALIGNMENTS


There are two types of scoring alignments;
Optimal Alignments
Best Alignment

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


MEASURING ALIGNMENTS SCORES
The matrix has positive and negative scores both, matches and mismatches therefore are all
considered because it’s a diagonal pattern

SCORING MATRICES
How to build scoring matrices?
• We analyze the observed frequency with which each amino acid / nucleotide is
substituted by an other one in similar proteins/genes

Scores in the Scoring Matrices


• Scoring matrices may contain +ve and -ve values as well as 0.
• +ve values are for more frequent matches / mismatches
• -ve values are for unlikely one

There are two types of scoring matrices.


➢ PAM
➢ BLOSUM

 Protein sequences have a choice of PAM and BLOSUM matrices.


 Nucleotide sequences have choices for a pair of match/mismatch costs.

PAM MATRICES
PAM means “Point Accepted Mutation”
Point accepted mutations means the substitution of one amino acid in a sequence with another that
protein function remains conserved.

PAM UNIT
PAM unit is actually that time during which 1% amino acid undergo for acceptable mutation.
Important: If two sequences diverge by 100 PAM units, it does not imply that
they are totally different in each position

STEP TO COMPUTE PAM MATRICES

STEP TOCOMPUTE PAM MATRICES


1. Align the protein sequences which are 1-PAM Unit diverge.
2. Let Ai,i be the number of times Ai is substituted by Ai.
3. Compute the frequency fi of amino acid Ai.

Then, PAM1=pii=
PAM ‘n’= (PAM1)n

• PAM1, PAM2 …PAM250 can be computed


• PAM120 is considered optimal scoring matrix!

BLOSUM MATRICES
BLOSUM matrices can be used to align the protein sequences. BLOSUM matrices was first purposed in
1992 by Henikoff et al.
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
BLOSUM matrices is also called the Block substitution matrix without any gap although it has
mismatches in sequences.

There are three steps to compute the BLOUSM Matrices.


Step 1: Eliminate sequences that are identical in x% positions
Step 2: Compute observed frequency f i, j of aligned pair Ai to Aj. Hence, f i,j becomes the
probability of aligning Ai and Aj in the selected blocks.
Step 3: Compute fi which is the frequency of observing Ai in the entire block

Typically used matrices: BLOSUM62 or PAM120 in PAMx, larger x detects more divergent
sequences.
BLOSUM62 and PAM120 are considered optimal for use in protein sequence
alignment
For BLOSUMn, a lower n detects more divergent sequences

MULTIPLE SEQUENCE ALIGNMENT (MSA)


Multiple sequence alignment involves comparison of three of more sequence
In Multiple sequence Alignments we compare multiple number of protein and DNA sequences
to identify the matches and mismatches.

MSA vs Pairwise
For pairwise alignments, we used Dynamic Programming
For MSA, Dynamic Programming may get very expensive, so we use Progressive Alignment
• Pairwise alignment was either local or global
• Multiple alignments are mostly global

APPLICATION OF MSA
Predict secondary and tertiary structures of new protein sequences
Evaluate evolutionary order of species or “Phylogeny
METHODOLOGY
Pairwise alignment is the alignment of two sequences
MSA can be performed by repeated application of pairwise alignment
MSA can help align multiple sequences. Progressive alignment can help perform MSA. Need to
remove sequences with >80% similarity.
PROGRESSIVE ALIGNMENT FOR MSA
• Progressive alignments are used in aligning multiple sequences
• Iterative approaches can help refine results from progressive alignments
SHORTCOMING OF THIS APPROACH
➢ Dependence upon initial alignments
➢ If sequences are dissimilar, errors in alignment are propagated
➢ Solution: Begin by using an initial alignment, and refine it repeatedly

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


MSA-EXAMPLE

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


MSA can be better performed using clustering strategies followed by alignment of the
alignments later. CLUSTAL is a free online tool that does all of this for us!

CLUSTALW
Developed by European Molecular Biology Laboratory & European Bioinformatics Institute
Performs alignment in:
• slow/accurate
• fast/approximate
SCOPE
create multiple alignments,
optimize existing alignments,
profile analysis &
create phylogenetic trees

Using CLUSTALW
• CLUSTALW can use multiple file formats including: EMBL/SwissProt, Pearson (Fasta) etc

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


CLUSTAL Omega is now also available which includes several upgrades!

INTRODUCTION TO BLAST-I
National Center for the Biotechnology Information (NCBI) – USA
BLAST developed in 1990
“Basic Local Alignment Search Tool”
Searches databases for query protein and nucleotide sequences
Also searches translational products etc.
Online availability www.blast.ncbi.nlm.nih.gov/Blast.cgi

INTRODUCTION TO BLAST-II
Smith Waterman can align complete sequences. BLAST does it in an approximate way. Hence,
BLAST is faster BUT does not ensure optimal alignment.

Use Cases of BLAST


BLAST provides for approximate sequence matching. Input to BLAST is a FASTA formatted
sequence and a set of search parameters

OUTPUT OF BLAST
Results are shown in HTML, plain text, and XML formats

BLAST ALGORITHM
BLAST can search sequence databases and identify unknown sequences by comparing them to
the known sequences. This can help identify the parent organism, function and evolutionary
history. BLAST performs quick alignments on sequences.

There are two main types of BLAST.


Nucleotides
• Blastn: Compares a nucleotide query sequence against a nucleotide database.
Proteins
• Blastp: Compares an amino acid query sequence against a protein database.

There are also many other types of BLAST:


Blastx:

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Compares a nucleotide query sequence against a protein sequence database.
Helps find potential translation products of unknown nucleotide sequences
tblastn:
Compares a protein query sequence against a nucleotide sequence database
Nucleotide sequence dynamically translated into all reading frames
tblastx:
Compares the six-frame translated proteins of a nucleotide query sequence against the six
frame translated proteins of a nucleotide sequence database.

SUMMARY OF BLAST
• BLAST performs quick alignments on biological sequences
Step1: obtain a query of sequence
For known sequences: Use NCBI, UCSC etc..
For unknown sequences: Use NGS or Mass Spectrometry

INTRODUCTION TO FASTA
For comparing two sequences we use pair wise sequencing and for the comparison of many
sequences we use multiple sequence alignment. To handle the multiple alignments we
perform alignment through smith-waterman algorithm for local one. And for global alignment
we use Needleman-wunsch algorithm.
Both local and global alignments are the dynamic approaches.

Many of the sequences are compared, which takes time and we use BLAST which is an
approximate local alignment search tool. BLAST compares a large number of sequences,
quickly.

FASTA took a similar approach, Developed in 1988 it does Fast Alignment .Searches databases
for query protein and nucleotide sequences. Was later improved upon in BLAST.

Online: http://www.ebi.ac.uk/Tools/sss/fasta/

FASTA – Fast Alignment Algorithm

Classical global and local alignment algorithms are time consuming. FASTA achieves alignment
by using short lengths of exact matches.
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
USES OF FASTA
FASTA relies on aligning subsequences of absolute identity. Input to FASTA search can be in
FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProt formats

OUTPUT OF BLAST
Results are output in visual format along with functional prediction. Makes table lists the
sequence hits found along with scores.

FASTA ALGORITHM
STEP1: Local regions of identity are found

STEP2: Rescore the local regions using PAM or BLOSUM matrix

STEP3: Eliminate short diagonals below a cutoff score

STEP4: Create a gapped alignment in a narrow segment and then perform Smith Watermann
alignment

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


TYPES OF FASTA
There are six types of FASTS:
• fasts35
Compare unordered peptides to a protein sequence database
• fastm35
Compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database
• Fasta35
Scan a protein or DNA sequence library for similar sequences
• Fastx35
Compare a translated DNA sequence (6 ORFs) to a protein sequence database
• tfastx35
Compare a protein sequence to a DNA sequence database (6 ORFs)
• fasty35
Compare a DNA sequence (6ORFs) to a protein sequence

SUMMARY OF FASTA
FASTA can briskly perform sequence search databases if given a query sequence.
Multiple types of FASTA exist which assist in aligning DNA/RNA/Protein sequences

http://fasta.bioch.virginia.edu/fasta_docs/fasta35.shtml

BIOLOGICAL DATABASE AND ONLINE TOOLS


Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Sequences are obtained from genome sequencing and mass spectrometry
Structures are obtained from X-Ray Crystallography, Atomic Force Microscopy & Nuclear Magnetic
Resonance Spectroscopy

This sequence and structure data is very useful for researchers


Databases are formed to store and share this data to make biological data available to scientists in
computer-readable form.

EXPASY
Expasy provides access to a variety of online databases and tools
It is developed by Swiss Bioinformatics Institute (SIB). Website provides access to databases
and tools. Proteomics, genomics, phylogeny, systems biology, population genetics,
transcriptomics etc. can be searched.
http://www.expasy.org/

UNIPROT AND SWISSPROT


Both UniProt and SwissProt are the online database for proteins.

Swiss-Prot contains human curated protein information


➢ Accession number, unique identifier
➢ The sequence
➢ Molecular mass
➢ Observed and predicted modifications

Protein sequences from various species and organisms can be found in uniprot.
SwissProt is the manually annotated version of the UniProt Database.

PROTEIN DATA BANK


Protein Data Bank is the premier resource of protein structures. These structures have been
determined using experimental techniques.
It’s Open & Free Protein Data Bank provides Cartesian coordinates of each atom in the protein
structure. Over 50,000 protein structures are reported and present in this database.

REVIEW OF SEQUENCE ALIGNMENT

We use next generation sequencing and whole genome sequencing to obtain the genetic
information.
For protein sequencing we use Mass Spectrometry and Edman Degradation.
STORAGE:
 Sequence information is stored digitally
 Databases are designed to store sequence data
SHARING AND ACCESS:
 Sequence databases are shared via online websites
 Access to several such websites is free
USAGE OF DATA:
Sequence data can be used to obtain:
 Similarity of sequences

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


 Evolutionary History
 Predict the function of molecules

GENBANK
• All known nucleotide sequences can be obtained from GenBank
• Thousands of genomes are available online on GenBank, for free and are searchable
• GenBank: Public database of nucleotide sequences for over 200,000 organisms
• Growing exponentially
Data in GenBank
• Entries can be a contiguous stretch of DNA or RNA sequences and their annotations
• Entries are updated every two months
• Developed by Swiss Bioinformatics Institute (SIB)

ENSEMBLE
http://www.ncbi.nlm.nih.gov/genbank/

ESNEMBLE is genome search engine which is used to search the genome of every recorded
species.
http://asia.ensembl.org/index.html

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Chapter 3 - Molecular Evolution
Molecular Evolution & Phylogeny
Molecular evolution is the process of change in the sequence composition of cellular molecules
such as DNA, RNA, and proteins across generations. All molecules have an evolutionary history.
Phylogenetics
Phylogenetics is the study of extracting evolutionary relationships between species.
Phylogenetics has led to the creation of relationship trees between various species of Bacteria,
Archaea, and Eukaryota.

Types of Phylogenetic Trees


Scaled Trees
• Branch lengths are equal to the magnitude of change in the nodes
Unscaled Trees
• Only representing the relationship between sequences

Evolution of Sequences
DNA acts as cellular memory unit and protein are the translated product of DNA coded
information. And evaluation is very important to survive in different type of environments.

Method of Change
DNA gets modified by:
 Mutation & Substitution
 Insertion
 Deletion

The evolutionary events and their combination impart relationships between sequences. These
relationships are explored in Phylogenetics .Several algorithms exist for finding such relationships

Concepts and Terminologies - I


Phylogenetics involves processing sequence information from different species to find
evolutionary relationships. Output from such studies includes Phylogenetic Trees.

In above figure the point A stands for ancestor and with the passage of time the evolution occurred with
and the genome sequence of organisms changed.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


All trees have same meanings.

Root node is the ancestor of all other nodes. The direction of evolution is from ancestor to the
terminal nodes.
Conclusion
Phylogenetics specifies evolutionary relationship with the help of trees. Trees can be rooted or
unrooted. Rooted trees can show temporal evolutionary direction.

Concepts and Terminologies - II


Rooted and Unrooted trees can be used to show phylogenetic relationships between
sequences.

Rooted trees are computationally expensive.

Algorithms and Techniques


Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Several types of algorithms exist which are divided into two classes. There are many methods
for constructing evolutionary trees.

UPGMA
UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a simple agglomerative
(bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and
Michener.
In this method two sequences with the shortest evolutionary distance between them are
considered and these sequences will be the last to diverge, and represented by the most
recent internal node.

Least Squares Distance Method


Branch lengths, represent the “observed” distances between sequences (i & j).
Find X, Y and Z such that D (i, j) are conserved?

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Conclusion
Several methods exist for constructing phylogenetic trees.
Broadly, they belong to objective methods or clustering methods.
We will study UPGMA and Distance Methods.

Introduction to UPGMA
UPGMA: Unweighted Pair – Group Method using arithmetic Averages
Calculating distance between two clusters:

Cluster X + Cluster Y = Cluster Z


Calculate the distance of a cluster (e.g. W) to the new cluster Z

Nx is the number of sequences in cluster x

Calculating distance between two trees:


Assume we have N sequences
Cluster X has NX sequences, cluster Y has NY sequences
dXY : the evlotionary distance between X and Y

Methods for constructing trees

The distance matrix is obtained using pairwise sequence alignment.

A – D becomes a new cluster lets say V. We have to modify the distance matrix. What are the
distances between:

V and B (Calculate),


V and C,
V and E,
V and F.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


UPGMA is a clustering algorithm which can help us compute phylogenetic trees.
UPGMA has two components to it. These include distance calculations between two clusters
and between two trees.
UPGMA-I
UPGMA starts with creating clusters of sequences which are the closest. Next, distance is
computed between the new cluster and the remaining sequences. The process is repeated for
all sequences

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


UPGMA-II
Once a cluster is selected and its distance is computed with all other sequences, we update the
distance matrix. Next, we select the shortest distance from the new matrix and repeat the
process.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Chapter 4 - RNA Secondary Structure
Prediction

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


DNA TO RNA SEQUENCES
MOTIVATION
1. In earlier days, RNA was only considered as a passive intermediary between DNA and proteins
2. Now we know that multiple types of RNA exist e.g. mRNA, tRNA, miRNA and siRNA etc
RNA viruses
Many viruses assemble their genomes from RNAs. They are therefore called RNA viruses.
Examples include Human Immunodeficiency Virus and Hepatitis C Virus.
Because RNA has two (OH) groups that’s why it has short life spam because of both (OH) repulsion

TYPES OF RNA &THEIR FUNCTIONS


There are two categories of RNA:
Coding RNA
Non-Coding RNA
Coding RNA performs their coded function in protein synthesis. And Non-coding RNA helps in
translation process.

TYPES OF RNA
There are many types of RNA according to their funtions like:
Messenger RNA (mRNA)
Transfer RNA (tRNA)
Ribosomal RNA (rRNA)
Micro RNAs (miRNA)
Small Interfering RNA (siRNA)

MESSENGER RNA
1. Only 5-10% of this RNA type is present in cell.
2. Messenger RNA 5’ end is capped with (7-Methyl Guanosine Triphosphate) which helps the
Ribosomes to identify the mRNA.
3. And 3’ end of the mRNA is poly A tail (around 30-200 adenylate residues) which help shield
against 3’ exonucleases)

SIGNIFICANCES OF RNA STRUCTURE


RNA can form 3D structures, such structural properties helps the RNA molecule to perform different
functions.
As RNA is composed of sugars, phosphate and nucleotides and these nucleotides have ability to form
hydrogen bonds.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


 A’ can make hydrogen bonds with ‘U’
 ‘G’ makes hydrogen bonds with ‘C’
 ‘G’ can also make hydrogen bonds with ‘U’ (Wobble Pair)
FUNCTIONS
Due to this ability of bonding RNA forms many structures and due to variety of structures RNA
performs many functions in cell like:
 DNA information transfer
 Regulatory roles
 Catalytic roles
 Defense & immune response
 Structure-based special roles

RNA FOLDING AND ENERGY FOLDING


RNA molecules form many structures for stability and different functions.
A review of free energy
“Gibs Free Energy” (LANGRIDGE and KOLLMAN 1987) is the free energy available for RNA
molecule for reactions and RNA structure formation takes place at this lower energy.
Energy of RNA structures
o RNA structures have the lowest (or close) quantity of free energy
o In case if RNA has two structures we can select the one with lowest energy state.

We can calculate the overall energy of RNA structures by summing up energies given out
during the process of folding
Calculating energies of structures
• The stabilizing energy associated with stacking base pairs in a double-stranded region (-ve)
• The destabilizing influence of unpaired regions (+ve)

CALCULATING ENERGIES OF FOLDING-AN EXAMPLE


RNA is composed of four nucleotides (A, U, C and G) and these nucleotides are attached with
ribose sugar in backbone. And these nucleotides have hydrogen bonding between them. G
always bond with C and A always bonds with U through hydrogen bonding and energy is
released. That’s why RNA molecule becomes more stable.

5 nucleotides formed H-Bonds. This bond formation released energy (-12.0 kcal/mol)
RNA molecule took up a 2’ structure. Hence became more stable.

TYPES OF RNA SECONDARY STRUCTURES


All the complimentary bases of RNA combine together to form RNA secondary structures.
A simple nucleotide sequences of RNA is called as Primary structure and denoted by 1’ while
when these nucleotides fold together and form a complex structure that is called secondary
structure and denoted by 2’.

The preferred structure of RNA is 2’ which has many structural patterns like Helices, Loops,
Bulges and Junctions.

The first 2’ RNA structure is called helix. Unlike the DNA helix, the RNA helix is formed when the RNA
folds onto itself.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


The second 2’ structure is the hairpin loop.

The loop of the hairpin must at least four bases long to avoid steric hindrance with base-pairing
in the stem part of the structure.
Note: That hairpin reverses the chemical direction of the RNA molecule

TYPES OF RNA SECONDARY STRUCTURES-II


RNA 1’ structure fold the (5’-3’) ends and make RNA 2’ structure just like helix and hairpin
structure.
The third type of 2’ structure is bulge loop.

Bulges, are formed when a double-stranded region cannot form base pairs perfectly. Bulges can
be asymmetric with varying number of base pairs on one side of the loop. Bulge loops are
commonly found in helical segments of cellular RNAs and used to measure the helical twist of
RNA in solution. (Tang and Draper 1990)

The forth type of 2’ RNA structure is interior loop.

Interior loops are formed by an asymmetric number of unpaired bases on each side of the
loop.

RNA can be fold to form helices, bulge loops, and interior loops

The fifth type of 2’ structure is the Junction or Intersection

Junctions include two or more double-stranded regions converging to form a closed structure.
The unpaired bases appear as a bulge

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


RNA TERTIARY STRUCTURES
RNA tertiary structures are formed when RNA unpaired base bond in 2’ region bond.
2’ RNA structures is formed due to folding of nucleotide with in RNA molecule but after folding
some nucleotides remain open for interaction. And they form hydrogen bonds together.

These unpaired nucleotides of 2’ structure interact with other unpaired nucleotides and form a third
structure called tertiary 3’ structure. For example 4 nucleotides in hairpin loop structure does.

The above figure:


1. Indicate how these 2’ structures come together
2. Indicate the difference between internal loop and multi loop
3. Indicate the yet unpaired bases

The unpaired bases in 3’ structure remain paired by abnormal folding called (pseudoknots) but instead
of pairing they remain available or pairing.

• How can we detect pseudoknots in RNA structures?


CIRCULAR REPRESENTATION OF STRUCTURES
Tertiary or 3’ structure of RNA may form pseudoknots to detect the pseudoknots in RNA
structure we need “circular plot” which is a graphical approach.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


Intersecting arcs in circular plot are the pseudoknot.

EXPERIMENTAL METHODS TO DETERMINE RNA STRUCTURES


RNA has 1’, 2’ and 3’ structures. 1’ has simple nucleotide sequence and 2’ has nucleotides
folding and 3’ has knots. How can we obtain 2’?
Measuring RNA Structures:
 X-Ray Crystallography
Principle: Crystallized RNA diffracts X-rays which helps estimate atomic positions
• Requires RNA crystallization
 Atomic Force Microscopy
• Principle: A laser connected to an Si3N4 piezoelectric probe scans an RNA sample
• Works well in air or liquid environment
For measuring the RNA structure we use X-ray crystallography (Smyth and Martin 2000), which
works according to the principle of diffraction. Crystallized RNA diffracts X-rays which helps
estimate atomic positions
All isotopes that contain an odd number of protons and/or of neutrons (see Isotope) have an
intrinsic magnetic moment and angular momentum, in other words a nonzero spin, while all
nuclides with even numbers of both have a total spin of zero. The most commonly studied
nuclei are 1H and 13C, al

Another method to measure the RNA structure is called as Atomic Force microscopy in this
technique a laser connected to a Si3N4 piezoelectric probe scans an RNA sample. It works well in air and
liquid environment.

The third method for measuring the RNA structure is Nuclear Magnetic Resonance Imaging in this
method Hydrogen atoms in RNA resonate upon placement in a high magnetic field. It Works well
without crystallizing RNA
Measuring RNA Structures:
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
 Nuclear Magnetic Resonance Imaging
• Principle: Hydrogen atoms in RNA resonate upon placement in a high magnetic
field
• Works well without crystallizing RNA
STORAGE OF STRUCTURES
Reported structures are stored in online databases. Example includes RNA Bricks and RMDB
etc.
RNA Bricks is a database of RNA 3D structure motifs and their contacts, both with themselves
and with proteins Stanford University’s RNA Mapping Database is an archive that contains
results of diverse structural mapping experiments performed on ribonucleic acids.

Strategies for RNA Structure Prediction


RNA structure 2’ and 3’ can be measured experimentally, but RNA molecule readily degrade
due to their short shelf life.
How can we predict 2’ and 3’ structures:
Give 1’ RNA structure creates the 2’ structure because the simple nucleotides folds and form 2’
structure. And on the base of folding we can predict the stability of the RNA molecule.
Maximizing the number of nucleotides can increase the structure and we have to select the structure
according to the stability.
Food for thought:
• Does maximizing nucleotide pairs provides the optimal 2’ structure?
• If 2 or more optimal structures are possible, which one should we select?
• Can instability be factored in as well?

Dot Plots for RNA 2' Structure Prediction


Structure measurement through experiments is slow and costly and there is maximum
chances of more than one structure existence.
The dot plot method for RNA structure prediction is easy. Draw a square and partition by
drawing gridlines. Put RNA sequence on top and left sides of the square. Put a “DOT” on
complementary nucleotides For example:

In longest RNA nucleotides the gaps between complementary nucleotides becomes bulges and loops of
the structure.

1. ENERGY BASES METHODS


Experimental prediction of RNA structure is slow and costly that’s why a few 2’ RNA structures
are reported experimentally.
While prediction we get many possible 2’ structures of RNA and for optimal structure selection
we calculate their overall stability.
• STABILIZING ENERGY
Energy table helps us to find the optimal prediction of structure because energy is released when
complementary nucleotides make bonds.
• DESTABILIZING ENERGY
Remaining unpaired nucleotides destabilized the RNA structure in form of hairpin or bulge
structure. Example: Eye of the hairpin and bulge

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


• SUM OF ENERGIES
Sum of stabilizing and destabilizing energies can help determine the quality of a 2’ RNA
structure.
2’ structure with longest coupled sequences vs. one with lowest energy

a. Zuker’s Algorithm
• Compute the stabilizing energies (-ve values)
• Compute the destabilizing energies (+ve values)
• Compute sum of +ve and –ve energies
Energy based methods involve evaluating the free energy structures. To compute the RNA
sequence for 1’ or 2’ optimal structure prediction we use Zuker’s Algorithm.
It Compute energies of all possible 2’ structures. Generate combinations of all computed 2
structures. Select the one with lowest energy.

Zuker’s Algorithm EXAMPLE


Zuker’s Algorithm involves computing stabilizing and destabilizing energies of a 2’ structure. All
possible 2’ structures are generated. The best 2’ structure is selected!

Zuker’s Algorithm – A Flow Chart


Zuker’s Algorithm involves computing stabilizing and destabilizing energies of 2’ structure. And
it also computes the overall energy by summing up the positive and negative energies.

The two diagonals (‘D’) given above include:


1. A/U, C/G, G/C, U/G
2. G/U, U/G

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


The flow chart for energies
Martinez Algorithm
Important Points:
• Since each 2’ structure is weighed by its stability, the optimal structure is quickly
shortlisted
• Monte Carlo (MC) simulations do not provide a definitive solution!!
In Martinez algorithm all the 2’ structures are weighed by its stability and optimal one is sorted
out. Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational
algorithms that rely on repeated random sampling to obtain numerical results. They are often
used in physical and mathematical problems and are most useful when it is difficult or
impossible to use other mathematical methods.
And Monto Carlo method do not provide a definitive solution.
2. Dynamic Programming Approaches
In 2’ RNA structure there may be large number of nucleotide sequence with large number of
combinations hence it is hard to find the optimal one and for this prediction we use Dynamic
Programming (DP) which breaks the larger problems into smaller one.
PRINCIPLE OF DYNAMIC PROGRAMMING
For optimal structure combination selection we use the Dynamic Programming (DP) and we select the
sequence of RNA nucleotides and list all the possible complementary positions for nucleotides in the
given complete sequence.
Dynamic Programming then recombines such combinations in a process called “Traceback” to
ensure that the highest coupled 2’ structure is reported

a. Nussinov An Overview
Nussinov-Jacobson (NJ) Algorithm is a Dynamic Programming (DP) strategy to predict optimal
RNA 2’ structures, Proposed in 1980. Computes 2’ structures with most nucleotide coupling.
HOW IT WORKS
➢ Create a matrix with RNA sequences on top and right
➢ Set diagonal & lower tri-diagonal to zero
➢ Start filling each empty position in matrix by choosing the maximum of 4 scores

Nussinov EXAMPLE
The main points to be focused in N-J Algorithm are:
➢ Scoring Matrix
➢ Matrix Initialization

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


➢ Scoring method
➢ The 4 different positions to be considered for calculating matrix
The matrix is filled by four different positions. Left, Bottom, Diagonal, and Left/Bottom elements. In
this way all complementary nucleotides coupling is catered.

Score Calculations & Traceback


From four positions the score is calculated and from each position we calculate the score
contribution. And maximum score is sorted out.
• There can be multiple tracebacks
• Each traceback can be used to construct an RNA secondary structure
• Select the traceback with the highest number of coupled nucleotides
Comparison of Algorithms
RNA has three different structures 1’, 2’ and 3’. For these structures predictions there are many
algorithms.
But in all algorithm there are two main strategies:
1. Nucleotides stacking
2. Energy minimization
• NUCLEOTIDES STACKING ALGORITHM.
NJ’s Algorithm comes under this category. It involves the maximizing the nucleotides pairing. Traceback
helps to find best 2’ structure.
It predict the 75% accurate 2’ structure. Because there may be more than two equal scores as it is
calculated from four different positions.
• ENERGY BASED ALGORITHM.
Zuker’s Algorithm involves energy minimization. It is updated version and incorporate the
phylogenetic information. It is improved. Overcomes the pseudoknots assumes them and accommodate
them. And this algorithm helps to predict the structures of RNA based on nucleotides.

WEB RESOURCES-I
The mfold web server is one of the oldest web servers in computational molecular biology.
Mfold is upgraded version of Zuker’s algorithm.
MFOLD is computationally expensive and can give results for 1’ and 2’ structures that have
sequences less than 8000 nucleotides.

Conclusion:
• Given 1’ RNA structures, RNA 2’ structures can be predicted
• Several algorithms exist for predicting 2’ structures
• These include Zuker, Martinez and NJ etc.
Online tools

•MFOLD – Implements an upgraded version of Zuker’s Algorithm


• MFOLD outputs a set of possible structures within a given energy range
• It also indicates their reliability

Chapter 5 - Protein Sequences


FROM DNA/RNA SEQUENCES TO PROTEIN
A set of three nucleotides called codon, codes the information for specific amino acids in protein
synthesis.
Codons select the amino acids and ribosomes make the protein by polymerization process and these
nucleotides coil together to form 3D structure.

CODING OF AMINO ACIDS


Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Nucleotides (A, G, C, and T) make set of three called codons for amino acid selection in protein
synthesis. More than one codon can code for same amino acids as there are 20 amino acids
involved in protein synthesis.

OPEN READING FRAMES


Codons codes information for amino acid and there are three stop codons and one start codon.
For the valid open reading frame it must have longest sequence.

In molecular genetics, an open reading frame( ORF ) is the part of a reading frame that has the
potential to code for a protein or peptide. An ORF is a continuous stretch of codons that do not contain
a stop codon (usually UAA, UAG or UGA).

Six ORF exist in any DNA sequence and longest one is marked and first stop codon will mark the end
of the protein.

ORF Extraction – A Flowchart


Selection of ORF is based on its length if it the longest one from others than it would be suitable for
protein synthesis reaction.

Both reverse and forward RNA sequences are considered which may have many ORF and selection
is based upon longest protein sequences having.

SEQUENCING PROTEINS
Edman degradation, developed by Pehr Edman, is a method of sequencing amino acids in a
peptide.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


In this method, the amino-terminal residue is labeled and cleaved from the peptide without
disrupting the peptide bonds between other amino acid residues.it starting from the N-terminal and
removing one amino acid at a time.
Cyclic degradation of peptides by Phenyl-iso-thio-cyanate (PhNCS). PhNCS attaches to the free
amino group at N-terminal residue. 1 amino acid is removed as a PhNCS derivative.


DRAWBACKS
It is restricted to chain of 60 residues.
It is very time consuming process 40-50 amino acids per day.

Modern techniques for this is Tandem mass spectrometry.

Application of Mass Spectrometry in Protein


Sequencing
Protein can be charged with electrons or protons and if moving charges are placed in between the
magnetic field they get deflected. And their deflection is proportional to their momentum.

Where:
F is the force applied to the ion 
m is the mass of the particle
a is the acceleration 
Q is the electric charge,
E is the electric field
v × B is the cross product of the ion's velocity and the magnetic flux density.
Components of MS
Sample Injection
Ionization Source
Mass Analyzer
Ion Detector
Spectra search using computational tools

CONCLUSION
Charged proteins can be set into motion within a magnetic field. Their deflections accurately
correspond to their molecular mass. Deflections can be measured (hence protein’s mass)

Techniques for MS Proteomics


MS proteomics works on the principle of protein ionization which are placed in very high magnetic
field. Each protein deflect to its proportion which is equal to its molecular weight in this way
molecular mass is measured.
The protein mass of unknown protein is compared with the masses of proteins in database and
matching one is selected.
Example for protein sequences database is uniProt, swissprot etc.

Types of MS-based proteomics


Proteins can be sequenced by Edam’s degradation and Mass spectrometry. MS based proteomics
helps us to sequence the larger and bigger proteins more quickly.
Following steps are involved in MS:
Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Separation
Ionization
Mass analysis
Detection

Two methodologies are involved


1. Bottom up proteomics
2. Top down proteomics

Bottom up proteomics measures the peptide masses produced after protein enzymatic digestion.
Top down proteomics measures the intact proteins followed by peptides after fragmentation
1. BOTTOM UP PROTEOMICS
In this methodology the protein complex is treated with site specific enzymes which cleave (chop)
them into amino acid residue and resultant peptides are measured for their masses. One peptide is
selected at one time for processing and when all are processed than protein search engine is used
for matches.
2. TOP DOWN PROTEOMICS
In this methodology proteins are ionized and measured for their masses and one protein is mass
selected at a time for fragmentation. And resultant peptide fragments are measured for mass.
We can say that bottom up proteomics deals with peptides while top down proteomics can handle
the whole protein.
PROTOCOL
1. Sample containing the mixture of protein from cells and tissues is obtained.
2. Enzymes such as trypsin is use to cleave the proteins.
3. Enzyme cleaves the amino acids at specific sites of amino acid.
4. Several peptides are formed when protein is cleaved.
5. Number of peptide depends upon the number of sites where enzymes cleaved the protein. For
example trypsin cleaves the protein at lysine (k)
6. Mass of each peptide is measured. Each peptide is measured for its mass (MS1)
7. One peptide is selected at a time.
8. Different enzyme is use to cleave the protein at different site.
10. Each resulting peptide is measured for its mass (MS3)
11. Each peptide is searched in protein sequence databases for matches
9. This process keep going until the possible number of peptides are formed or searched.
10. Peptides are searched in data base and matched.
Conclusion
• Bottom up proteomics measures the mass of peptides
• Peptides result after enzymatic digestion of precursor proteins
• Peptides are searched in protein databases

Two Approaches for Bottom up Proteomics


(BUP)
There are two approaches for bottom up proteomics.
1. Peptide Mass Fingerprinting.
2. Shotgun Proteomics

Shotgun Proteomics digest the whole protein and mix first and compared with database.
Peptide mass finger printing involves in protein separation followed by single protein’s peptide
analysis.

PROTEIN SEQUENCE IDENTIFICATION


Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)
Mass spectrometry helps us to measure the molecular weight of proteins and peptides, but several
proteins can have same masses to identify them we follow the flow chart of following techniques.

PROTEIN IONIZATION TECHNIQUES


Protein ionization is used in Mass spectrometry based on proteomics protocols. Ionization involves
loading of proton in protein or removal of protein. Ionizations can increase or decrease the mass
of protein or peptide.

SALIENT IONIZATION
Is the technique which include Matrix Assisted Laser Desorption Ionization( MALDI) & Electro
Spray Ionization (ESI)

For example:

MALDI
In this technique one proton is added to protein or peptide and the molecular weight is increases
by one and Mass spectrometry reports the molecule at +1.
ESI

ESI adds many protons to protein or peptides and molecular weight is increased by the number of
protons added. But it is difficult in ESI to find the molecule with +1.

MS data from MALDI ionization is easier to handle as the product ions masses are mostly at
“1+mass”. ESI is difficult to use as it does not easily give away the +1 charged ion
MS1 & INTACT PROTEIN MASS
When we ionize the protein, it can be deflected by a magnetic field in proportion to its mass and
the mass of protein can be measured by spectrometry.
Mass/charge helps us to calculate the mass of protein, “Mass Select” can help to select specific MS1 for
further analysis.
MS1 results the intact masses of the peptides.
SCORING INTACT PROTEIN MASS

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


MS1 helps us to obtain the intact masses of precursor molecules which depend upon the
proteomics and protocol applied. Protein masses reported by MS1 are matched with protein
database, but before match the masses are converted into +1 of all molecule.
 SCORING
We can score each protein in the way that it get maximum score and low quality matches should
get low scores.
After filtering the multiple charges we get the only the peaks having charge 1. And after this filter
we compare it with protein data base.

Conclusions:
• Experimental mass reported from MS1 is matched with theoretical mass of proteins in the database
• Score is awarded on the basis of the closeness between experimental and theoretical mass
PROTEIN FRAGMENTATION TECHNIQUES
We compare the experimental mass with theoretical data base mass of protein and on base of
closeness we rank or score it.
If several proteins have same score than selection is done by using another technique protein
fragmentation. We fragment the protein or peptide and ionize it, it helps us to measure the
fragment masses as the same ways as their precursor.
There are different techniques for protein fragmentation.
Electron Capture Dissociation (ECD)
Electron Transfer Dissociation (ETD)
Collision Induced Dissociated (CID)

Each fragmentation technique gives result of specific type of fragments.


ECD gives out ‘C’ and ‘Z’ ions. CID gives out ‘B’ and ‘Y’ ions, etc.
If we can measure the mass of fragments using MS, Calculate the theoretical mass of the fragments.
Then, we can award score on the basis of the similarity of experimental and theoretical mass.
TANDEM MS
Tandem MS (or MS/MS, MSn) is a technique to break down selected ions (precursor ions) into
fragments (product ions).
Tandem MS can be extended to the fragments of the intact fragment. All you need is the MS
instrument capability to, (i) select fragment’s mass range. (ii) Fragment the precursor fragment.
Tandem MS helps us to measure masses of fragments. By this scoring and protein identifications so
easy.

MEASURING EXPERIMENTAL FRAGMENT’S MASS


In MS1, the molecular weight of intact sample molecule is measure and then intact molecule is
fragmented in two afterward, these two fragments are measured by MS or MS2

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)


FRAGMENTATION TECHNIQUES AND MOLECULAR WEIGHT
Fragmentation techniques include ECD, CID etc.
Intact molecule fragmentation splits the molecule into two parts.

FRAGMENT MASS
Mass of fragment is produced by MS2 deepening upon the technique because each techniques
splits the protein or peptide at different location.
Experimental mass reported from MS2 is matched with theoretical peptides of candidate proteins
(from DB). Score is awarded on the basis of the closeness between experimental and theoretical
masses.

Prepared by Waqas Ejaz ( Youtube Channel : VU SUPREME TALEEM)

You might also like