0% found this document useful (0 votes)
15 views22 pages

Retrieval of Data

Database searching involves aligning nucleotide or protein sequences with database sequences to identify similarities, with primary databases containing experimentally derived data. Methods like BLAST and FASTA are used for efficient searching, providing local alignments that are more informative than global alignments. Secondary databases offer additional insights into sequence relationships and motifs, enhancing the analysis of biological sequences.

Uploaded by

tassera9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

Retrieval of Data

Database searching involves aligning nucleotide or protein sequences with database sequences to identify similarities, with primary databases containing experimentally derived data. Methods like BLAST and FASTA are used for efficient searching, providing local alignments that are more informative than global alignments. Secondary databases offer additional insights into sequence relationships and motifs, enhancing the analysis of biological sequences.

Uploaded by

tassera9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Database

searching ?
• Database searching is the matching of query nucleotide or protein
sequences with database sequences. To do this , we align the query
sequence with database sequences to find similarity among them.
Database searching is the application of knowledge achieved from
previous biological experiments to the gene discovery problem.
• A DNA sequence comprises only 4 nucleotides, while a protein
sequence is made up of twenty amino acids. Hence it is easier to
search for similar patterns in proteins than in DNA. To perform a
database search , it is better to translate the DNA sequence for
encoding proteins into protein sequence for a more reliable result.
Type of database searching
Primary database Secondary database
searching searching

BLAS FAST MOTIF PATTER


T A SEARC N
H
Primary database
• A primarysearching
sequence is one that has been experimentally
determined, and a primary database is one that contain
experimentally derived data.
• Searching a database in an efficient manner is a matter of prime
importance.
• Methods that can run on small databases may not be effective
with larger databases in terms of time and space efficiency.
• Due to large amount of information in database it is difficult to
perform a database search using dynamic programming as it is a
computationally intensive programme and is therefore too slow.
• To save time and space heuristic methods (BLAST and FASTA)
are used for database search.
• These programs are local similarity search programs that
provide short matches.
• These short matches called local alignment (a segment of
sequence showing similarity) are more informative than global
alignment (complete sequence showing similarity).
• Local alignments can return highly conserved region of
the sequence even in two sequences that do not produce
any reasonable alignment when aligned globally.
• Dynamic algorithm programs can return local alignments and
are guaranteed to find the best alignment, but are very
insensitive due to their mathematical rigour.
• BLAST and FASTA are major primary database search method
and are very fast though less sensitive used to carry out
database searches.
FAST
 FASTA A
developedby Pearsonand Lipman in 1905, and was
so it was faster than other methods used for sequence
named because
alignment at that time.
 FASTA uses the Pearson and Lipman algorithm for similarity search
between a query sequence and a database sequence.
 Given a query sequence , FASTA searches for local alignment with the
sequences in the database.
 Originally, the FASTAP program was designed for protein sequence
similarity.
 It is a rapid alignment program for protein and DNA sequence
pairs.
 No individual residue search is performed, saving time.
 Input sequence must be in FASTA format for alignment.
Basic FASTA
programs
Program name Query Database Algorithm used
sequence sequence
Nucleotide NUCLEOTIDE NUCLEOTIDE DNA/RNA
BLAST FASTA,
FASTM,
FASTS
Protein BLAST PROTEIN PROTEIN FASTA,
SSEARCH,
FASTS, FASTF
FASTX/FASTY TRANSLATE PROTEIN
D
NUCLEOTID
E
TFASTX/TFASTY PROTEIN TRRANSLAT
ED DNA
TFASTs PEPTIDES TRANSLATE
D DNA
BLAST (Basic local alignment search
tool)
A local similarity search program, BLAST compares nucleotides or protein
sequences to sequence databases and calculate the statistical significance of the
matches. The functional and evolutionary relationship between sequences are
construed and members of gene families identified by the BLAST search program.
It is a simplification of the Smith-waterman algorithm and It is faster than
FASTA.
This is the algorithm that is most commonly used for database search and sequence
alignment. It looks for similar regions in two sequences without allowing a gap, though
now there is gapped BLAST (WU-BLAST).
It is more selective and less sensitive.
It does not allow gaps in the alignment.
FASTA is more sensitive than BLAST for nucleotide sequences
BLAST(word size 3) is more sensitive for protein
sequence as compared to FASTA(Word size 2)
Basic BLAST
programs
Program name Query Database Algorithm used
sequence sequence
Nucleotide NUCLEOTIDE NUCLEOTIDE BLASTN,
BLAST MEGABLAS
T
Protein BLAST PROTEIN PROTEIN BLASTP,
PHIBLAST,
PSI BLAST
BLASTX TRANSLATE PROTEIN BLASTP
D
NUCLEOTID
E
TBLASTN PROTEIN TRRANSLAT BLASTP
ED
NUCLEOTIDE
TBLASTX TRANSLATE TRANSLATE BLASTP
D D
NUCLEOTID NUCLEOTID
Secondary database
 Primary database searching does not always provide a satisfactory answer to
searching
the questions of sequence analysis.
 The presence of highly repetitive and low complexity sequences can result in
irrelevant matches and may even complete the interpretation.
 Secondary databases provides information about the relationship of a given
sequence with other sequences within multiple alignment and some more
information (family, domain and motif) as well, depending on the method
used.
 These databases contain the results of primary sequence analysis.
 Some important secondary database searches are motif or pattern search and
profile search.
 PROSITE is a database and a tool consisting of documentation entries
describing protein domain, families and functional sites as well as associated
pattern and profiles to identify them.
Motif
search
• Motifs are specific geometric arrangement of protein
secondary structure elements (alpha, beta and loops).
Some motif are associated with a particular function and
some are part of other structural and functional
arrangements. Simple motifs are combined to form
complex motifs. These are biologically conserved regions
from protein sequences.
Types of
motif
The Hairpin Beta- Helix loop
Greek
key beta motif alpha- helix
motif beta motif motif
Similarity and

Identity
These are terms that illustrate the relationship between two
proteins with one another.
• The residue position at which both sequences being compared
have the same type of residue is called identical residue.
• The residue positions at which both sequences being
compared have amino acids with similar properties are called
similarity residues.
• Similarity is the likeness (resemblance) between two
sequences in comparison while identity is the number of
characters that match exactly between two different
sequences.
For
example
A F NTT (Seq1)
: :
L N NTS (Seq2)

AL and TS are similar residues. Similar residues are


represented with a colon (:). The residues N and T are
identical residues in the given example which represented
by solid line ( ) .
Sequence
alignment
• In bioinformatics , a sequence alignment is a way of
arranging the sequences of DNA, RNA, or protein to
identify regions of similarity that may be a consequence
of functional, structural, or evolutionary relationships
between the sequences.
Pairwise sequence
alignment
• This alignment used to identify regions of similarity that
may indicate functional, structural or evolutionary
relationships between two biological sequences.
• EMBOSS, LAGAN, Bl2seq, Dotlet and Dotter are the
common tools for pairwise sequence alignment.
• It is of two types ; local alignment and global
alignment.
Local

alignment
If the two given sequences are not so similar and it is
difficult to align the two sequences across the full length,
then local alignment can be used to align the sequences.
• Local alignment provides information about conserved
regions or domains. From these conserved regions it is
possible to get an idea of the evolutionary history.
• Local alignment is more meaningful than global alignment as
it can achieve some alignment even with sequences that are
not so similar. It can also be used to align sequences of
unequal length or when only a conserved domain is found in
two sequences.
Global
alignment
• Global alignment is done across the entire length of the
sequence, including matches characters, gaps and
mismatches.
• Choosing different mismatch and gap penalties may
produce different alignments for the same sequences.
Multiple sequence
• alignment
For multiple sequence alignment more than two sequences are
required.
• A database search usually reveals many homologous sequences. The
residues of the homologous sequences are aligned together I column
for multiple sequence alignment.
• While aligning, wherever a sequence does not possess an amino acid
in a particular position, it is denoted by a dash.
• Highly identical sequences are used to give some meaningful results.
These multiple sequence alignment can be used to establish
phylogenetic relationship.
• ClustalW, T-Coffee, Multalin, DCA, HMMER, DIALIGN are tools
for Multiple sequence alignment.
Homologous
gene
• Homologous gene is a gene inherited in two species by a
common ancestor. While homologous gene can be
similar in sequence.
Orthologous and paralogous gene
sequences
 Both orthologs and paralogs are types of homologs.
Orthologs are homologous genes where a gene diverges
after a speciation event, but the gene and its main
function are conserved.
If a gene is duplicated in a species, the resulting
duplicated genes are paralogs of each other, even though
over time they might become different in sequence
composition and function.
Globin
gene Gene Duplication

Alpha chain gene Beta chain gene


Gene Speciation
Gene Speciat

Frog Mous Mous Fro


e e g
Ortholog Ortholog
s s
Paralogs

Homolog
Thank
you

You might also like