Lec (3) - Protein_databases
Lec (3) - Protein_databases
Databases
Protein Classification
Concepts
• Classification methods
group proteins based on:
-Sequence similarity
- Structural similarity
Proteins can be classified into
different groups based on:
The families to which they
belong
The domains they contain
The sequence features they
possess
Protein Classification
Subfamily
(small group of
closely
related proteins)
Family
( Group of evolutionarily related
proteins that share one or more
domains/repeats
Superfamily
( large group of distantly related
proteins )
Protein Domains
• Domain
- Discrete structural unit
that is assumed to fold
independently of the
rest of the protein and
to have its own function.
- Similar domains can be
found in proteins with
different functions
Protein Sequence Features
• Motifs
- Short conserved regions and
frequently are the most conserved
regions of a domain. Motifs are critical
for the domain to function – in
enzymes,
for example, they contain the active
sites
Protein Sequence Features
• Repeat
- Stretch of amino acid
sequence that gets
repeated a number of
times along the length of
the sequence. Many
domains are constituted
from repeats
- Repeats may contain
binding sites and
contribute to structural
properties of the protein
Protein Sequence Features
• Consensus site/post-translation modification
site (PTM)
- A conserved position(s) among homologous
sequences. Position can be theoretically
modified, for example, by phosphorylation or
glycosylation.
An asparagine followed by any amino acid
followed by serine or threonine, for example, is
a
consensus site for N-linked glycosylation
Protein Signatures
• Protein signature are
computational models used to
classify protein properties:
- Protein families
- Domains
-Conserved sites
- Protein sequence features
Protein Resources
• A variety of protein resources online
• Several websites/resources dedicated
to
providing a single interface to multiple
resources.
Protein Databases
• Sequence and information databases
NCBI Protein Database –
contains protein sequences from
GenBank, RefSeq , as well as
records from SwissProt, PIR, PRF, and
PDB
EBI - UniProtKB – the “Protein knowledgebase”,
a comprehensive set of protein sequences.
Functional information on proteins, with
accurate, consistent, and rich annotation, the
amino acid sequence, protein name or
description, taxonomic data and citation
information.
Divided into two parts: Swiss-Prot and TrEMBL
Protein Databases
Protein resources :
Pfam
• Collection of protein families and domains
• Represented by
- Multiple sequence alignments
- Hidden Markov Models (HMMs)
• Two components to Pfam:
– Pfam-A entries: High quality, manually curated
families
– Pfam-B entries: Automatically generated
SMART
• Simple Modular Architecture Research Tool
- Identification and annotation of protein
domains
- Analysis of protein domain architectures
- Manually curated models for the prediction of
protein domains
- http://smart.embl-heidelberg.de
ExPASY (https://www.expasy.org/)
• Expasy (Swiss Institute of Bioinformatics)
- UniProt, PROSITE, homology modelling,
docking,many other tools doing protein
sequences and identication, mass
spectrometry and 2-DE data, protein
characterisation and function families,
patterns and profiles, post-translational
modication, protein structure, protein-protein
interaction, similarity search/alignment, drug
design, molecular modelling
Protein Information Resource
• PIR
- Protein ontology
- ProClass: Reports for UniProtKB
- ProLink: Literature, Text Mining
-http://pir.georgetown.edu/
InterPro
• Designed to integrate signature
databases
- Protein families, domain and
functional sites
- http://www.ebi.ac.uk/interpro/
Uniprot – Example SGLT1 protein