0% found this document useful (0 votes)
106 views45 pages

Understanding Protein Structure and Motifs

- Protein structure and function are closely related, with structure determining function. Determining protein structure experimentally is difficult, so computational methods aim to predict structure from sequence. - The document discusses predicting structure from sequence using motifs, domains, and tertiary structure modeling. It also covers predicting secondary structure, transmembrane regions, localization, and function. Computational genomics allows analyzing whole genomes.

Uploaded by

peace lover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views45 pages

Understanding Protein Structure and Motifs

- Protein structure and function are closely related, with structure determining function. Determining protein structure experimentally is difficult, so computational methods aim to predict structure from sequence. - The document discusses predicting structure from sequence using motifs, domains, and tertiary structure modeling. It also covers predicting secondary structure, transmembrane regions, localization, and function. Computational genomics allows analyzing whole genomes.

Uploaded by

peace lover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to protein structure

• Sequence structure function was the


“first dogma” of bioinformatics
• Want to avoid determining structure
– Expensive
– Difficult
– Sometimes impossible?
• Bioinfo dream: - Structure from sequence!
• Questions like ”How does the protein fold”? Be
answered directly from sequence data
• Folding from sequence has no more out of
reach since 2007
What to do in silico
• Compromise and use what you’ve got.
– ”Recycle” structures
• Find and understand protein building blocks:
– Motifs and domains.
• Identify certain protein types:
– Transmembrane proteins
Understanding motifs
• Motifs are short subsequences, DNA or AA, 5
– 20 positions long.
• Their foremost application is serving as
binding sites
• Motifs grouped in families.
• When combinations of motifs are present,
they are termed as fingerprints
Motif search and presentation
• Multialignment
• Pattern notation, eg: [LMV]-[RSTV]-W-[DSN]-...
• Profiles and PSSM, PWM
• Visualize with sequences logo

• Height indicate conservation


• Symbol height is
proportional to frequency
PSI- BLAST
• Position- Specific Iterative Basic Local
Alignment Search Tool
• One of the BLAST variants which exists for
motifs.
• Motifs are small, therefore easy to search with
and fast.
• Using a score statistic, PSSM, similar to BLAST
• PSSM=Position Specific Scoring Matrix
• Calculates the log-odds score for each amino
acid at a specific position.
• E-value theory is the same as that of BLAST
because of the log-odds score
Motif databases
• PROSITE: Important binding sites
”What motifs does my protein have?”
– Profiles
– Regular expressions
– Careful documentation
• BLOCKS: Origin to BLOSUM.
– Presents multialignments!
– Assembled by most conserved parts of domains.
• PRINTS: ”What motif combinations does my
protein have?”
Understanding Domains
Domain is defined according to different groups:
• SCOP: ”A structural unit”. Remove a domain
from a protein and it will fold to the same
substructure.
– A protein is the union of its domains.
• Pfam: ”Independent evolving unit”. They are
conserved subsequences, no reference to
structure.
– Domains form a protein subset.
SCOP classification of proteins
SCOP = Structural Classification Of Proteins
• Hierachical classification
1. Class: α, β, α / β, α + β
2. Fold: similar structure
3. Superfamily: homologous proteins
4. Family: “clear homology”
• Semi-manual curation
Pfam
• Not a hierarchical classification system
• Provides mathematical models of domains
– Hidden Markov Models (HMMs)
• Has two complementary parts:
• Pfam-A: Highly curated, well annotated. Based
on Pfam-B.
• Pfam-B: Fully automated. Based on ProDom.

Pfam-B = ProDom – Pfam-A


ProDom
• Fully automated
• Interesting procedure, based on PSI-Blast

InterPro
• Joint project
• Interface to Pfam, ProDom, SMART, and more
Hidden Markov Model (HMM)
• Checks each sequence for match against a domain
model
• Same scoring statistics as Blast.
• Also known as linear HMM
State types:
• Match: A typical residue for this domain position. (□)
• Delete: When the domain is ”lacking” a position. (○)
• Insert: When a domain has more residues than the
domain model. (◊)
HMM
A simple example

In the real case,


Secondary structure prediction
Definition of terms
• Alpha helix: The classic spiral
• Beta strand: strands form ”sheets”
• Turn, bend: ”Sudden change”
• Coil, loop: ”Everything else”
• Assigned by principles.
• Coded in DSSP as H, B,E, S,T and C, L
Predicting softwares
PHD – predictprotein webpage,
http://www.predictprotein.org/
PsiPred - http://bioinf.cs.ucl.ac.uk
Example – secondary structure
Secondary structure prediction
• Principle: Structure affect amino acids
distribution.
• Bad news: No good explicit model for
determining secondary structure.
• Good news: Artificial Neural Networks give
decent implicit model.
To determine secondary structure of residue i, look
at window around i.
Ri−7Ri−6 · · ·Ri−1RiRi+1 · · ·Ri+6Ri+7
Strategy for prediction
• Use homologs!
1. Collect very similar sequences
2. Build profile
3. Use a predictor for profiles
• Good effect in sec. str. prediction
• General trick for various predictions problems.
Transmembrane proteins (TM)
• 20-30% of proteins in any organism are TM.
• 70% of drug targets are TM proteins
(Pestourie et al, 2006)
• Bad news: Hard to determine structure for
TM-proteins.
– Less than 1% of PDB contains TM structures.
• Good news: Regular and clear structure,
perfect for HMMs!
Properties of TMP
• Transmembrane helices are hydrophobic
• TM regions are 15-30 aa
• Loops on cytoplasmic side are positive:
TopPred
• Identifies hydrophobic regions

• ”Good prediction quality”


• Generally correct when more than3 TM regions
Common problems:
• Lose a TM region
• Flip in-out topology
• Problem in discerning signal peptides
Signal peptides
• Short (15-30 aa) peptide ”addressing” protein to
organelles
• About 16% of human proteome have a SP
• Some SP cleaved from its host protein
• One hydrophobic TM-segment, 7-15 aa
• Special predictor for SP: SignalP
• Common problem for TM predictors

PHOBIUS can predict both TM and signalling


peptides
Tertiary structure
• AB initio folding makes it possible
• Method 1: - homology modelling
– Find homologs with known structure: serve as
”templates”
– Align
– Construct atomic model, using alignment as proxy
– Evaluate? If bad, try other template.
• Limiting factor: Structure library
• You need sensitive search methods
Tertiary structure
• Method 2: threading
– Fold models specify local and long-range
interactions
– Align sequence to models
– ”Best” alignment % will be the chosen fold.
– Note: Harder than regular sequence aligmnent
Threading
Function Prediction
Functions to predict
• Chemical reactions?
• Interactions?
• Pathway activity?
• Cell localization?
• Activity details?
Enzymes
• Classification present since 1961!
• Hierarchical classification of enzymes
• Specifies reactions
– Examples from Wikipedia:
• EC 3 enzymes are hydrolases
• EC 3.4 are hydrolases that act on peptide bonds
• EC 3.4.11 are those hydrolases that cleave off the amino-
terminal amino acid from a polypeptide
• EC 3.4.11.4 are those that cleave off the amino-terminal
end from a tripeptide
• Too limited information for Bioinformatics
Gene Ontology (GO)
• Controlled vocabulary for function annotation
• Non-hierarchical
• ”is a” and ”part of” relationships between
terms
GO
Prediction of function
• Given a gene/protein, can we predict a GO
term?
• Approach: Expert systems
– Collect homologs
– Domain and motif analysis
– Study other features
– Finally: Make an ”educated guess”
Example: ProtFun
(http://www.cbs.dtu.dk/services/ProtFun/)
Prediction of localization
• Modest goal! Is the target...
– mitochondria?
– peroxisome?
– endoplasmic reticulum?
– golgi?
– Study signal peptide
(http://www.cbs.dtu.dk/services/TargetP/)
Genomics
• Study of the whole genome – a top- down
approach
• “Look at the whole to understand the parts”

– Haemophilus influenzae, 1,8 Mb


– Saccharomyces cerevisiae, 12,5 Mb
– C. elegans, 100 Mb, & D. melanogaster,123 Mb
– Homo sapiens, 2,9 Gb
– Mouse, chimp, several fishes, cow, opossum, sea
squirt, frog, chicken, horse, elephant, hedgehog,
macaque, rabbit, armadillo, . . .
– 599 complete microbial genomes till 2007
Bentley & Parkhill
Whole genome sequencing
• Whole Genome Shotgun technique
– Read random parts of the genome, then
assemble.
– Advantage: Easy to automize, only one type of
data
– Disadvantage: Hard to assemble the pieces!
Whole genome sequencing
• Compartmental shotgun:
– Break the genome into chunks, put each chunk in
”BAC” (Bacterial Artificial Chromosome), then
WGS on BAC.
– Advantage: Easier to assemble. Manages
duplications better.
– Disadvantage: More steps, more data: need a
physical map.
Genome assembly
Applications of (Meta)genomics
• Gene finding/ prediction – ab initio, Expressed
sequence tags (EST) finding
• Gene structure - transcription start sites
• Gene regulation - promoter & enhancer sites,
regulatory elements…
• Synteny – comparative genomics
Comparative genomics reveals
elements of genome architecture

1.Large-scale gene order is often poorly conserved


beyond the operon scale, even between closely
related organisms (lack of synteny)
2.Genomes have large regions of genes that are
conserved between close relatives, punctuated by
hypervariable regions named “genomic islands”.
Figure 12.18

Chromosomal
islands

Plasmids

Transposon

Pathogenicity Integrated
island phage DNA
Pathogenicity islands in Escherichia coli
Brock Biology of Microorganisms Figure 12.17

E. coli strain Genome (bp)


K-12 4,639,221
536 4,938,875 Prophage
073 5,231,428
There is a huge amount of genotypic diversity in microbial
populations; manifests largely as variation in gene content
Computational tools for
metagenomics

You might also like