0% found this document useful (0 votes)
122 views135 pages

Lec2 Databases

This document provides an overview of different types of biological sequence databases. It discusses 10 categories of databases: 1) nucleotide sequences, 2) protein sequences, 3) genomics, 4) mutation/polymorphism, 5) protein domain/family, 6) proteomics, 7) 3D structure, 8) metabolic, 9) bibliographic, and 10) other specialized databases. Within each category, it provides examples of major databases and brief descriptions of what type of data they contain. The document serves to introduce the wide variety of biological sequence databases and how they are organized to store and allow retrieval of different types of biological sequence data.

Uploaded by

Deepali Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views135 pages

Lec2 Databases

This document provides an overview of different types of biological sequence databases. It discusses 10 categories of databases: 1) nucleotide sequences, 2) protein sequences, 3) genomics, 4) mutation/polymorphism, 5) protein domain/family, 6) proteomics, 7) 3D structure, 8) metabolic, 9) bibliographic, and 10) other specialized databases. Within each category, it provides examples of major databases and brief descriptions of what type of data they contain. The document serves to introduce the wide variety of biological sequence databases and how they are organized to store and allow retrieval of different types of biological sequence data.

Uploaded by

Deepali Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 135

Introduction to Sequence

Databases

1. DNA & RNA


2. Proteins

1
What are Databases?
• A database is a structured collection of
information.
• A database consists of basic units called
records or entries.
• Each record consists of fields, which hold pre-
defined data related to the record.
• For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein,
length, amino-acid sequence, …)

2
What is a Database?
• A database can be defined as "a collection of
data arranged for ease and speed of search and
retrieval.“
• A DNA database contains individual records or
data entries of the DNA sequences as well as
information about the sequences.
• A DNA database often contains flat-files. These
are relatively simple database systems in which
each database is contained in a single table.
• In contrast, relational database systems can use
multiple tables to store information, and each
table can have a different record format.
GenBank as a Database
• GenBank is the National Institute of Health
(NIH) genetic sequence database, an
annotated collection of all publicly available
DNA sequences.
• It is maintained by the National Center for
Biotechnology Information (NCBI) within the
National Institute of Health (NIH).
History of Sequence Databases
• The first bioinformatics databases were constructed a
few years after the first protein sequences began to
become available.
• The first protein sequence reported was that of bovine
insulin in 1956, consisting of 51 residues.
• Nearly a decade later, the first nucleic acid sequence
was reported, that of yeast alanine tRNA with 77 bases.
• Just a year later, Dayhoff gathered all the available
sequence data to create the first bioinformatic database.
• The Protein DataBank followed in 1972 with a collection
of ten X-ray crystallographic protein structures, and the
SWISSPROT protein sequence database began in 1987.
Different classifications of
databases
• Type of data
– nucleotide sequences
– protein sequences
– proteins sequence patterns or motifs
– macromolecular 3D structure
– gene expression data
– metabolic pathways

02/22/24 07:46
• A database can be thought of as a large
table, where the rows represent records and
the columns represent fields.

Field Name Length Sequence Enzyme


Record
QA001 MTGA 243 MYQWI… yes
QA002 Ribosomal 267 MAAPV… no
protein L9
QA003 Flagellin 374 GSSIL… no
QA004 GDPMH 157 MFLRQ… yes

Accession Numbers: Unique identifiers of the


database records. 7
Ideal minimal content of an entry in a
sequence database
• Sequence
• Accession number (AC)
• Taxonomic data
• References
• Annotation/Curation Sources of data:
- research groups (direct
• Keywords submission)
- literature supplementary
• Cross-references information
- genome sequencing institutes
• Documentation - patents
Within a database, the format needs to be
kept consistent.
A SwissProt entry, in Fasta format:

>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human).


MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE
NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA
VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD
AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR
Why Databases?
• The purpose of databases is not merely to collect and
organize data, but to allow intelligent data retrieval.
• A query is a method to retrieve information from the
database.
• The organization of each record into predetermined
fields, allows us to use queries on fields.

10
Databases on the Internet
• Biological databases often have web
interfaces, which allow users to send queries to
the databases.
• Some databases can be accessed by different
web servers, each offering a different interface.

request query

web page result

User Web server Database server


11
Database download

• Nearly all biological databases are available for


download as simple text (flat) files.
• A local version of the database allows one
greater freedom in processing the data.
• Processing data in files requires some
computer-programming skills. PERL is an easy
programming language that can be used for
extraction and analysis of data from files.

12
Database 1: nucleotide
sequences
• The 3 main nucleic acid sequence databases are
EMBL (Europe)/GenBank (USA) /DDBJ (Japan)

• EMBL: since 1982

• Specialized databases for the different types of RNAs (i.e. tRNA,


rRNA, tm RNA, uRNA, etc…)

• 3D structure (DNA and RNA) - PDB

• Others: Aberrant splicing db; Eukaryotic promoter db (EPD); RNA


editing sites, Multimedia Telomere Resource ……
Database 2: protein sequences
• SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/
• TrEMBL: created in 1996; complement to SWISS-PROT; derived from
EMBL CDS translations (« proteomic » version of EMBL)

• PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

• Genpept: « proteomic » version of GenBank

• Many specialized protein databases for specific families or groups of


proteins.

– Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM


receptors), IMGT (immune system) YPD (Yeast) etc.
Databases 3: ‘genomics’
• Contain informations on gene chromosomal
location (mapping) and nomenclature, and
provide links to sequence databases; has usually
no sequence;
• Exist for most organisms important in life science
research; usually species specific.
• Examples: MIM, GDB (human), MGD (mouse),
FlyBase (Drosophila), SGD (yeast), MaizeDB
(maize), SubtiList (B.subtilis), etc.;
• Generally relational db (Oracle, SyBase or
AceDb).
Databases 4:
mutation/polymorphism
• Contain informations on sequence variations linked or not to genetic
diseases;
• Mainly human but: OMIA - Online Mendelian Inheritance in Animals
• General db:
– OMIM
– HMGD - Human Gene Mutation db
– SVD - Sequence variation db
– HGBASE - Human Genic Bi-Allelic Sequences db
– dbSNP - Human single nucleotide polymorphism (SNP) db
• Disease-specific db: most of these databases are either linked to a single
gene or to a single disease;
– p53 mutation db
– ADB - Albinism db (Mutations in human genes causing albinism)
– Asthma and Allergy gene db
– ….
Database 5: protein domain/family

• Contains biologically significant « pattern / profiles/


HMM » formulated in such a way that, with appropriate
computional tools, it can rapidly and reliably determine to
which known family of proteins (if any) a new sequence
belongs to

• Used as a tool to identify the function of uncharacterized


proteins translated from genomic or cDNA sequences
(« functional diagnostic »)

• Either manually curated (i.e. PROSITE, Pfam, etc.) or


automatically generated (i.e. ProDom, DOMO)
Databases 6: proteomics
• Contain informations obtained by 2D-PAGE: images of
master gels and description of identified proteins

• Examples: SWISS-2DPAGE, ECO2DBASE, Maize-


2DPAGE, Sub2D, Cyano2DBase, etc.

• Composed of image and text files

• There is currently no protein Mass Spectrometry (MS)


database (not for long…)
Databases 7: 3D structure
• Contain the spatial coordinates of macromolecules whose 3D
structure has been obtained by X-ray or NMR studies

• Proteins represent more than 90% of available structures (others are


DNA, RNA, sugars, viruses, protein/DNA complexes…)

• PDB (Protein Data Bank), SCOP (structural classification of proteins


(according to the secondary structures)), BMRB (BioMagResBank;
RMN results)

• DSSP: Database of Secondary Structure Assignments.


HSSP: Homology-derived secondary structure of proteins.
FSSP: Fold Classification based on Structure-Structure Assignments.

• Future: Homology-derived 3D structure db.


Databases 8: metabolic
• Contain informations that describe enzymes, biochemical
reactions and metabolic pathways;

• ENZYME and BRENDA: nomenclature databases that


store informations on enzyme names and reactions;

• Metabolic databases: EcoCyc (specialized on Escherichia


coli), KEGG, EMP/WIT;
Usually these databases are tightly coupled with query
software that allows the user to visualise reaction
schemes.
Databases 9: bibliographic

• Bibliographic reference databases contain


citations and abstract informations of
published life science articles;
• Example: Medline and Pubmed
• Other more specialized databases also
exist (example: Agricola).
Databases 10: others
• There are many databases that cannot be
classified in the categories listed previously;
• Examples: ReBase (restriction enzymes),
TRANSFAC (transcription factors), CarbBank,
GlycoSuiteDB (linked sugars), Protein-protein
interactions db (DIR, ProNet, Intact, BIND),
Protease db (MEROPS), biotechnology patents
db, etc.;
• As well as many other resources concerning
any and new aspects of macromolecules and
molecular biology (Ex: Microarrays).
Nucleotids and associated topics databases (AMOS’links)

EMBL - EMBL Nucleotide sequence db (EBI)


Genbank - GenBank Nucleotide Sequence db (NCBI)
DDBJ - DNA Data Bank of Japan
dbEST - dbEST (Expressed Sequence Tags) db (NCBI)
dbSTS - dbSTS (Sequence Tagged Sites) db (NCBI)

NDB - Nucleic Acid Databank (3D structures)


BNASDB - Nucleic acid structure db from University of Pune

AsDb - Aberrant Splicing db


ACUTS - Ancient conserved untranslated DNA sequences db
Codon Usage Db
EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db [Mirror at EBI]
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project
gRNAs db - Guide RNA db
PLACE - Plant cis-acting regulatory DNA elements db
PlantCARE - Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db
ssu rRNA - Small ribosomal subunit db
lsu rRNA - Large ribosomal subunit db
5S rRNA - 5S ribosomal RNA db
tmRNA Website
tmRDB - tmRNA dB
tRNA - tRNA compilation from the University of Bayreuth
uRNADB - uRNA db
RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis Tools
Subviral RNA db - Small circular RNAs db (viroid and viroid-like)

MPDB - Molecular probe db


OPD - Oligonucleotide probe db
VectorDB - Vector sequence db (seems dead!)
There are approximately
286,730,369,256
sequence records in the
traditional GenBank
divisions as of 2011.
(Benson et al. (2011) Nucleic Acids Res D32:7)
(Benson et al. (2011) Nucleic Acids Res D32:7)
The “perfect” database
1. Comprehensive, but easy to search.

2. Annotated, but not “too annotated”.

3. A simple, easy to understand structure.

4. Cross-referenced.

5. Minimum redundancy.

6. Easy retrieval of data. 29


Problems with General
Sequence Databases
• Databases that strive for encyclopedic
completeness are now so huge as to be close
to unmanageable.

1. Redundancy (nothing ever goes out).


2. Inadequate sequences.
– old sequences
– partially annotated sequences
– inconsistent & outdated annotations (submitter annotation)
– error sequences, low-quality sequences
– contaminations
– anonymous sequence
30
( )

Release 57.5 of 07-Jul-09


of UniProtKB/Swiss-Prot
contains 471,472 sequence
entries,comprising
167,326,533 amino acids
abstracted from 181,042
references.
The RefSeq Accession number format
and molecule types

Accession Molecule type


NC_xxxxxx Complete genomic molecule
NG_xxxxxx Genomic region
NM_xxxxxx mRNA
NP_xxxxxx Protein
NR_xxxxxx RNA
NT_xxxxxx computed Genomic contig
XM_xxxxxx computed mRNA
XP_xxxxxx computed Protein
Using Biological Databases

• What databases should I use?


• What kind of information I expect
to find in this database?
• Is the data in database of interest
to me?
• How reliable is it?
34
Practical Session: Outline
• Integrated systems: e.g., NCBI (Protein,
Nucleotide, Gene, OMIM, etc.)
• Protein Databases: e.g., ExPASy
(SwissProt + TrEMBL)
• Protein structures: e.g., PDB and PDBsum
• Pathway databases: e.g., KEGG (Kyoto
Encyclopedia of Genes and Genomes)

35
Tips for the Practical Session
• We will go over several databases in a very
short time. Don’t expect to remember all the
small details. They are not important. All
respectable databases have a “HELP”
component.
• Try to:
– Learn the common features of biological databases.
– Understand the main features of every database.
– Learn how to use the online HELP.
– Judge and compare databases.
36
EBI/NCBI/DDBJ
• These 3 databases contain mainly the same information
within 2-3 days (few differences in format and syntax)
• Serve as archives containing all sequences (single genes,
ESTs, complete genomes, etc.) derived from:
– Genome projects
– Sequencing centers
– Individual scientists
– Literature
– Patent offices
• Non-confidential data exchanged daily
• The database triples approximately every 12 months.
EBI/NCBI/DDBJ
• Heterogeneous: sequence length, genomes, variants,
fragments, …
• Minimum sequence size: 10 bp
• Archive: nothing goes out -> highly redundant!
• full of errors: in sequences, in annotations, in CDS
attribution….
• no consistency of annotations; most annotations are
done by the submitters; heterogeneity of the quality and
the completion and updating of the information
EBI/NCBI/DDBJ
• Unexpected information you can find:
• ACCESSION Z71230
FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban Cahibo cigar, gift from President Fidel
FT Castro”
• ACCESSION NC_001610
FT source 1..17084
FT /chromosome="complete mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
EMBL Nucleotide Sequence
Database
• An annotated collection of all publicly available
nucleotide and protein sequences

• Created in 1980 at the European Molecular


Biology Laboratory in Heidelberg.

• Maintained since 1994 by EBI- Cambridge.

• http://www.ebi.ac.uk/embl.html
02/22/24 07:46
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available
nucleotide and protein sequences

• Started, 1984 at the National Institute of


Genetics (NIG) in Mishima.

• Still maintained in this institute a team led by


Takashi Gojobori.

• http://www.ddbj.nig.ac.jp
02/22/24 07:46
• There are 126,551,501,141 bases in
135,440,924 sequence records in the traditional
GenBank divisions and 191,401,393,188 bases
in 62,715,288 sequence records in the WGS
division as of April 2011.

• Most biocomputing sites update their copy of


GenBank every day over the internet.

• Scientists access GenBank directly over the


Web.

42
Annotation
•These billions of Gs, As, Ts, and Cs would be
useless without the "annotation" in each sequence
record.

43
Sequences

44
1 The LOCUS field
consists of five
different
subfields:

1a Locus Name (HSHFE) - The locus name is a tag for grouping


similar sequences. The first two or three letters usually designate
the organism. In this case HS stands for Homo sapiens The last
several characters are associated with another group designation,
such as gene product. In this example, the last three digits
represent the gene symbol, HFE. Currently, the only requirement for
assigning a locus name to a record is that it is unique.

1b Sequence Length (12146 bp) - The total number of nucleotide


base pairs (or amino acid residues) in the sequence record.
45
2 DEFINITION - Brief description of the sequence. The description
may include source organism name, gene or protein name, or
designation as untranscribed or untranslated sequences (e.g., a
promoter region). For sequences containing a coding region (CDS),
the definition field may also contain a “completeness” qualifier such
as "complete CDS" or "exon 1."

46
3 ACCESSION (Z92910) - Unique identifier assigned to a complete
sequence record. This number never changes, even if the record is
modified. An accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five
digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).

47
4 VERSION (Z92910.1) - Identification number assigned to a single,
specific sequence in the database. This number is in the format
“accession.version.” If any changes are made to the sequence data,
the version part of the number will increase by one. For example
U12345.1 becomes U12345.2. A version number of Z92910.1 for this
HFE sequence indicates that the sequence data has not been
altered since its original submission.
48
5 GI (1890179) - Also a sequence identification number. Whenever a
sequence is changed, the version number is increased and a new GI
is assigned. If a nucleotide sequence record contains a protein
translation of the sequence, the translation will have its own GI
number

49
6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be
any word or phrase used to describe the sequence. Keywords are
not taken from a controlled vocabulary. Notice that in this record the
keyword, "haemochromatosis," employs British spelling, rather than
the American "hemochromatosis." Many records have no keywords.
A period is placed in this field for records without keywords.

50
7 SOURCE (human) - Usually contains an abbreviated or common
name of the source organism.

8 ORGANISM (Homo sapiens) - The scientific name (usually genus


and species) and phylogenetic lineage. See the NCBI Taxonomy
Homepage for more information about the classification scheme
used to construct taxonomic lineages.
51
9 REFERENCE - Citations of publications by sequence authors that
support information presented in the sequence record. Several
references may be included in one record. References are
automatically sorted from the oldest to the newest. Cited publications
are searchable by author, article or publication title, journal title, or
MEDLINE unique identifier (UID). The UID links the sequence record
to the MEDLINE record.
52
1c Molecule Type
(DNA) - Type of
molecule that was
sequenced. All
sequence data in an
entry must be of the
same type.

1d GenBank Division (PRI) - There are different GenBank divisions.


In this example, PRI stands for primate sequences. Some other
divisions include ROD (rodent sequences), MAM (other mammal
sequences), PLN (plant, fungal, and algal sequences), and BCT
(bacterial sequences).

1e Modification Date (23-July-1999) - Date of most recent


modification made to the record. The date of first public release is not
available in the sequence record. This information can be obtained
only by contacting NCBI at [email protected]. 53
9 REFERENCE - If the REFERENCE TITLE contains the words
"Direct Submission," contact information for the submitter(s) is
provided.

54
The FEATURES table

55
A feature is simply an annotation that describes a portion of
the sequence.

 Each feature includes a location (sequence location or


interval) and one or several qualifiers.

 Clicking on the feature name will open a record for the


sequence interval identified in the feature location.

A list of features can be found in


http://www.ncbi.nlm.nih.gov/collab/FT/

56
source - An obligatory feature. The source gives the length of
the entire sequence, the scientific name of the source
organism, and the Taxon ID number.

Other types of information that the submitter may include in


this field are chromosome number, map location, clone, and
strain identification.

57
gene - Sequence portion that delineates the beginning and
end of a gene.

58
exon - Sequence segment that contains an exon. Exons may
contain portions of 5' and 3’ UTRs (untranslated regions). The
name of the gene to which the exon belongs and exon number
are provided.

59
CDS - Sequence of nucleotides that code for amino acids of the
protein product (coding sequence).
The CDS begins with the first nucleotide of the start codon and
ends with the third nucleotide of the stop codon.
This feature includes the translation into amino acids and may
also contain gene name, gene product function, link to protein
sequence record, and cross-references to other database
entries.

60
intron - Transcribed but spliced-out parts. Intron number is
shown.

61
polyA_signal - Identifies the sequence portion required for
endonuclease cleavage of an mRNA transcript. Consensus
sequence for the polyA signal is AATAAA.

62
BASE COUNT & ORIGIN
BASE COUNT - Base Count gives the total number of adenine
(A), cytosine (C), guanine (G), and thymine (T) bases in the
sequence.

ORIGIN - Origin contains the sequence data, which begins on


the line immediately below the field title.

63
Molecule-specific and topic-specific databases
AsDb - Aberrant Splicing db
ACUTS - Ancient conserved untranslated DNA sequences db
Codon Usage Db
EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db [Mirror at EBI]
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project
gRNAs db - Guide RNA db
PLACE - Plant cis-acting regulatory DNA elements db
PlantCARE - Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db
ssu rRNA - Small ribosomal subunit db
lsu rRNA - Large ribosomal subunit db
5S rRNA - 5S ribosomal RNA db
tmRNA Website
tmRDB - tmRNA dB
tRNA - tRNA compilation from the University of Bayreuth
uRNADB - uRNA db
RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis Tools
Subviral RNA db - Small circular RNAs db (viroid and viroid-like)
MPDB - Molecular probe db
OPD - Oligonucleotide probe db
VectorDB - Vector sequence db (seems dead!)
Organism specific databases:

FlyBase (Drosophila)
SGD (yeast)
MaizeDB (maize)
SubtiList (B. subtilis).

65
The search and retrieval system
that integrates information from
the National Center for
Biotechnology (NCBI) databases.

These databases include


nucleotide sequences, protein
sequences, macromolecular
structures, whole genomes, and
MEDLINE, through PubMed. 68
Input your search keywords or the Boolean expression

69
Databases: protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/

• TrEMBL: created in 1996; complement to SWISS-PROT; derived from


EMBL CDS translations (« proteomic » version of EMBL)

• PIR-PSD: Protein Information Resources http://pir.georgetown.edu/

• Genpept: « proteomic » version of GenBank

• Many specialized protein databases for specific families or groups of


proteins.

– Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM


receptors), IMGT (immune system), YPD (Yeast), etc.
SWISS-PROT
• Collaboration between the SIB (CH) and
EMBL/EBI (UK)
• Manually annotated: non-redundant,
cross-referenced, fully documented.
• Weekly releases; available from about 50
servers across the world, the main source
being ExPASy in Geneva
SWISS-PROT - 07/28/09
• 495,880 sequences
• 174,780,353 amino acid residues
• 11,891 species
• 2,000 journals
• 276,903 authors
SWISS-PROT - 07/27/11
• 531,473 sequences
• 188,463,640 amino acid residues
• 12,564 species
• 2,154 journals
• 306,144 authors
TrEMBL (Translation of EMBL)
• It is impossible to cope with the quantity of newly
generated data AND to maintain the high quality of
SWISS-PROT -> TrEMBL, created in 1996.

• TrEMBL is automatically generated (from annotated EMBL


coding sequences (CDS)) and annotated using software
tools.

• Contains all that is not in SWISS-PROT.


SWISS-PROT + TrEMBL = all known protein sequences.
The simplified story of a SWISS-PROT entry

Some data are not submitted to the public databases !!

cDNAs, genomes, …
(delayed or cancelled…)

« Automated »
EMBLnew EMBL • Redundancy check (merge)
CDS • Family attribution (InterPro)
• Annotation (computer)

TrEMBLnew TrEMBL « Manual »


• Redundancy (merge, conflicts)
• Annotation (manual)
• SWISS-PROT tools (macros…)
• SWISS-PROT documentation
• Medline
SWISS-PROT • Databases (MIM, MGD….)
• Brain storming

Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally
proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor
uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.
SWISS-PROT and the cross-references (X-
ref)
• SWISS-PROT was the 1st database with X-ref.;

• Explicitly X-referenced to 36 databases;


X-ref to DNA (EMBL/GenBank/DDBJ), 3D-structure (PDB),
literature (Medline), genomic (MIM, MGD, FlyBase, SGD, SubtiList,
etc.), 2D-gel (SWISS-2DPAGE), specialized db (PROSITE,
TRANSFAC);

• Implicitly X-referenced to 17 additional db added by the ExPASy


servers on the WWW (i.e.: GeneCards, PRODOM, HUGE, etc.)

Gasteiger et al., Curr. Issues Mol. Biol. (2001), 3(3): 47-55


Domains, functional sites, Human diseases
protein families MIM
PROSITE
InterPro Protein-specific dbs
Pfam GCRDb
PRINTS MEROPS
SMART REBASE
Mendel-GFDb TRANSFAC

2D and 3D Structural dbs Organism-spec. dbs


HSSP DictyDb
PDB EcoGene
SWISS-PROT
FlyBase
PTM HIV
CarbBank MaizeDB
GlycoSuiteDB MGD
SGD
2D-gel protein databases StyGene
SWISS-2DPAGE SubtiList
ECO2DBASE TIGR
HSC-2DPAGE TubercuList
Aarhus and Ghent Nucleotide sequence db WormPep
MAIZE-2DPAGE EMBL, GeneBank, DDBJ Zebrafish
htttp://www.rcsb.org/pdb/
NCBI - RefSeq
• Main features of the RefSeq collection include:

1. Non-redundancy.

2. Explicitly linked nucleotide and protein sequences

3. Data validation and format consistency

4. Distinct accession series.

5. Ongoing curation by NCBI staff and collaborators, with


review status indicated on each record
79
Text based searching
• Terminology: query, hit, fields, logical/Boolean operator.
• General principles:
1. All main databases provide a convenient tool for text base
searching.

2. We can search for query words in specific fields.

3. We can search more than one database at a time.

4. We can Pose additional limits, such as modification date.

80
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/

Schizosaccharomyces pombe strain 972h- complete genome


Sequence formats: GenBank format
Sequence formats: FASTA format

>sequence name 
[sequence]… 
Protein structure database
http://genome.ucsc.edu/
Summary

• What is the best db for sequence analysis ?


• Which does contain the highest quality data ?
• Which is the more comprehensive ?
• Which is the more up-to-date ?
• Which is the less redundant ?
• Which is the more indexed (allows complex queries) ?
• Which Web server does respond most quickly ?
Presents new databases and updates of existing databases
Biological
Information
Cancer as an
example:

Genes:
Growth Genes
Tumor
suppressor genes

Proteins:
Growth Factors
Enzymes
Receptors

Pathways:
Cell death

Systems:
Immune system
Blood supply

Function:
Role of proteins
Molecular Essential Bioinformatics and 91
interactions Biocomputing (LSM2104), NUS
Biological Information
Nucleic acids:
• DNA sequence, genes, gene products (proteins), mutation,
gene coding, distribution patterns, motifs
• Genomics: genome, gene structure and expression, genetic
map, genetic disorder
• RNA sequence, secondary structure, 3D structure,
interactions

Proteins:
• Protein sequence, corresponding gene, secondary structure,
3D structure, function, motifs, homology, interactions
• Proteomics: expression profile, proteins in disease processes
etc.
• Ligands and drugs (inhibitors, activators, substrates,
metabolites)
92
Biological Information
Pathways:
• Molecular networks, biological chain events,
regulation, feedback, kinetic data

Function:
• Binding sites, interactions, molecular action
(binding, chemical reaction, etc.)
• Biological effect (signaling, transport, feedback,
regulation, modification, etc.)
• Functional relationship, protein families, motifs, and
homologs
Essential Bioinformatics and 93
Biocomputing (LSM2104), NUS
SWISS-PROT entry P00709

Essential Bioinformatics and 94


Biocomputing (LSM2104), NUS
Essential Bioinformatics and 95
Biocomputing (LSM2104), NUS
SWISS-PROT entry P00709

Essential Bioinformatics and 96


Biocomputing (LSM2104), NUS
SWISS-PROT entry P00709

Essential Bioinformatics and 97


Biocomputing (LSM2104), NUS
Biological databases:
Protein structure database: PDB (http://www.pdb.org)
1. More than 18,000 macromolecular structures on proteins, peptides,
viruses, protein/nucleic acids complexes, nucleic acids, and carbohydrates.

2. Among the oldest databases – the first structure was deposited in 1972.

3. New deposited structures has been steadily growing (3298 in 2001, and
1486 Jan 1-June 5, 2002).

4. Determined mainly by the X-ray diffraction and NMR.

5. It Contains tools for keyword search, comprehensive visualization, and


information extraction – such as sequence, geometry, and structural
neighbors details.
Biological databases: PDB web-page
http://www.rcsb.org/pdb/

Essential Bioinformatics and 99


Biocomputing (LSM2104), NUS
Biological databases: A PDB entry
http://www.rcsb.org/pdb/

Essential Bioinformatics and 100


Biocomputing (LSM2104), NUS
Different classifications of databases….

• Primary or derived databases


– Primary databases: experimental results
directly into database
– Secondary databases: results of analysis of
primary databases
– Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data
Different classifications of databases….

• Technical design
– Flat-files
– Relational database (SQL)
– Exchange/publication technologies (FTP,
HTML, CORBA, XML,...)

02/22/24 07:46
Different classifications of databases….

• Availability
– Publicly available, no restrictions
– Available, but with copyright
– Accessible, but not downloadable
– Academic, but not freely available
– Proprietary, commercial; possibly free for
academics
02/22/24 07:46
http://www3.ebi.ac.uk/Services/DBStats/

02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
Other NCBI nucleic acids DBs
• EST database: A collection of expressed sequence tags, or short, single-pass sequence
reads from mRNA (cDNA).
• GSS database: A database of genome survey sequences, or short, single-pass genomic
sequences.
• HomoloGene: A gene homology tool that compares nucleotide sequences between pairs
of organisms in order to identify putative orthologs.
• HTG database: A collection of high-throughput genome sequences from large-scale
genome sequencing centers, including unfinished and finished sequences.
• SNPs database: A central repository for both single-base nucleotide substitutions and
short deletion and insertion polymorphisms.
• RefSeq: A database of non-redundant reference sequences standards, including genomic
DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within
NCBI and with external groups, supports data-gathering efforts.
• STS database: A database of sequence tagged sites, or short sequences that are
operationally unique in the genome.
• UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
• UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters,
each representing a unique known or putative human gene annotated with mapping and
expression information and cross-references to other sources.
02/22/24 07:46
02/22/24 07:46
Sequence submission
• Data mainly direct submissions from the
authors.
• Submissions through the Internet:
– Web forms.
– Email.
• Sequences shared/exchanged between
the 3 centers on a daily basis:
– The sequence content of the banks is
identical.
02/22/24 07:46
Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
• TIGR Gene Indices Organism-specific databases of EST and gene
sequences http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and
C.briggsae http://www.cse.ucsc.edu/~kent/intronerator/
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures
http://ndbserver.rutgers.edu/

• NTDB Thermodynamic data for nucleic acids


http://ntdb.chem.cuhk.edu.hk/

• RNABase RNA-containing structures from PDB and NDB


http://www.rnabase.org/

• SCOR Structural classification of RNA: RNA motifs by


structure, function and tertiary interactions
• http://scor.lbl.gov/
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
Molecular Databases
• Primary Databases
– Original submissions by experimentalists
– Database staff organize but don’t add
additional information
• Example: GenBank,SNP, GEO
• Derivative Databases
– Human curated
• compilation and correction of data
• Example: SWISS-PROT, NCBI RefSeq
mRNA
– Computationally Derived
• Example: UniGene
– Combinations
What, the scientists submit their
own DNA sequences?
• Who checks for error?
• Who makes people actually send their data to
the database so all can share it?
• Learn from success, failure of GenBank/EMBL
extensive publicly shared bio-data
• Carrot/stick approach. Granting agencies and
journals began requiring scientists to publish
sequence data. Patented sequences must be
entered in the databases too.
• However, there is significant public databank
error due to data ownership by scientists; no
inducements to update or go back and correct
Primary vs. Derivative Databases
ACG
TGC
C TC A A Curators
ATCATCT
GAG
GAG
TA
TA
G
CC RefSeq
C
Sequencing CGTG G TATAGCCG
A
C
TA
Centers G AGCTCCGATA

T
C
GA

T
TGA
CCGATGACAA

G
T
AT
A
Labs
C
CA
TGC

CG
A

G A
TT TTGACA
ACG A

CG
C
Genome
CT
CGTGA

AG A
A

TA T
CG C
C

GC

A
TA TTG C
G
CTGA

CGGA Assembly
A
CA
TAT
GC TAA
TG

CT
T

TG
TA

C C C C
AT G

A G T A
A
G G TTATAGCCG
ATT TATAGCCGA TA AT TG
TG TATAGCCG
TATAGCCG
TA
A

A T
T
T
A
T
AT
C

GA GenBank
AT
UniGene

TACTTTCTT C TC
Algorithms
GAGA A A
T ATCATCT
GAGA GAG
GAG
A ATCA C 127
Why use Bioinformatics
Databases?
• Speed of information retrieval

• Increasing size of data sets

• Amount of information available

• Save time and money by simulating


experiments prior to actual experiment
(a.k.a. in silico)
How do you access Databases?

• Search engines
– Programs that allow you to search the database

•Links from other sites to the search engines

•Programs that directly link to the search


engines
What is GenBank?
NCBI’s Primary Sequence Database
• Nucleotide only sequence database
• Archival in nature
• GenBank Data
– Direct submissions individual records (BankIt,
Sequin)
– Batch submissions via email (EST, GSS, STS)
– ftp accounts sequencing centers
• Data shared three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ).
– European Molecular Biology Laboratory
Database (EMBL) at EBI.
The International Sequence
Database Collaboration
Entrez
NIH
NCBI
•Submissions GenBa
•Updates •Submissions
nk •Updates
EMBL
DDBJ EBI
CIB

NIG •Submissions
•Updates SRS
getentry EMBL
EMBL/GenBank/DDBJ
• These 3 db contain mainly the same informations
within 2-3 days (few differences in the format and
syntax)
• Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 %
• Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.) derived from:
– Genome projects (> 80 % of entries)
– Sequencing centers
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Non-confidential data are exchanged daily
• Currently: 18 x106 sequences, ~30 x109 bp;
• Sequences from > 50’000 different species;
What is an Accession Number?
• An accession number is label that used to
identify a sequence in the various databases. It is
a string of letters and/or numbers that
corresponds to a molecular sequence.
• Examples (all for retinol-binding protein, RBP4):
– X02775 GenBank genomic DNA sequence
– NT_030059 Genomic contig
– Rs7079946 dbSNP (single nucleotide polymorphism)
– N91759.1 An expressed sequence tag (1 of 170)
– NM_006744 RefSeq DNA sequence (from a
transcript)
– NP_007635 RefSeq protein
– AAC02945 GenBank protein
– Q28369 SwissProt protein
– 1KT7 Protein Data Bank structure record
GenBank Record:
Feature Table
FEATURES Location/Qualifiers
source 1..3808
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
CDS 258..3302
/note="N-terminal protein kinase domain; C-terminal myosin
/protein_id="AAC16332.2"
heavy chain head; substrate for PKA"
GenPept Protein IDS
/db_xref="GI:7144485"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE COUNT 1201 a 689 c 782 g 1136 t
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
 Some databases in the field of molecular biology…

AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!

You might also like