Lec2 Databases
Lec2 Databases
Databases
1
What are Databases?
• A database is a structured collection of
information.
• A database consists of basic units called
records or entries.
• Each record consists of fields, which hold pre-
defined data related to the record.
• For example, a protein database would have
protein sequences as records and protein
properties as fields (e.g., name of protein,
length, amino-acid sequence, …)
2
What is a Database?
• A database can be defined as "a collection of
data arranged for ease and speed of search and
retrieval.“
• A DNA database contains individual records or
data entries of the DNA sequences as well as
information about the sequences.
• A DNA database often contains flat-files. These
are relatively simple database systems in which
each database is contained in a single table.
• In contrast, relational database systems can use
multiple tables to store information, and each
table can have a different record format.
GenBank as a Database
• GenBank is the National Institute of Health
(NIH) genetic sequence database, an
annotated collection of all publicly available
DNA sequences.
• It is maintained by the National Center for
Biotechnology Information (NCBI) within the
National Institute of Health (NIH).
History of Sequence Databases
• The first bioinformatics databases were constructed a
few years after the first protein sequences began to
become available.
• The first protein sequence reported was that of bovine
insulin in 1956, consisting of 51 residues.
• Nearly a decade later, the first nucleic acid sequence
was reported, that of yeast alanine tRNA with 77 bases.
• Just a year later, Dayhoff gathered all the available
sequence data to create the first bioinformatic database.
• The Protein DataBank followed in 1972 with a collection
of ten X-ray crystallographic protein structures, and the
SWISSPROT protein sequence database began in 1987.
Different classifications of
databases
• Type of data
– nucleotide sequences
– protein sequences
– proteins sequence patterns or motifs
– macromolecular 3D structure
– gene expression data
– metabolic pathways
02/22/24 07:46
• A database can be thought of as a large
table, where the rows represent records and
the columns represent fields.
10
Databases on the Internet
• Biological databases often have web
interfaces, which allow users to send queries to
the databases.
• Some databases can be accessed by different
web servers, each offering a different interface.
request query
12
Database 1: nucleotide
sequences
• The 3 main nucleic acid sequence databases are
EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
4. Cross-referenced.
5. Minimum redundancy.
35
Tips for the Practical Session
• We will go over several databases in a very
short time. Don’t expect to remember all the
small details. They are not important. All
respectable databases have a “HELP”
component.
• Try to:
– Learn the common features of biological databases.
– Understand the main features of every database.
– Learn how to use the online HELP.
– Judge and compare databases.
36
EBI/NCBI/DDBJ
• These 3 databases contain mainly the same information
within 2-3 days (few differences in format and syntax)
• Serve as archives containing all sequences (single genes,
ESTs, complete genomes, etc.) derived from:
– Genome projects
– Sequencing centers
– Individual scientists
– Literature
– Patent offices
• Non-confidential data exchanged daily
• The database triples approximately every 12 months.
EBI/NCBI/DDBJ
• Heterogeneous: sequence length, genomes, variants,
fragments, …
• Minimum sequence size: 10 bp
• Archive: nothing goes out -> highly redundant!
• full of errors: in sequences, in annotations, in CDS
attribution….
• no consistency of annotations; most annotations are
done by the submitters; heterogeneity of the quality and
the completion and updating of the information
EBI/NCBI/DDBJ
• Unexpected information you can find:
• ACCESSION Z71230
FT source 1..124
FT /db_xref="taxon:4097"
FT /organelle="plastid:chloroplast"
FT /organism="Nicotiana tabacum"
FT /isolate="Cuban Cahibo cigar, gift from President Fidel
FT Castro”
• ACCESSION NC_001610
FT source 1..17084
FT /chromosome="complete mitochondrial genome"
FT /db_xref="taxon:9267"
FT /organelle="mitochondrion"
FT /organism="Didelphis virginiana"
FT /dev_stage="adult"
FT /isolate="fresh road killed individual"
FT /tissue_type="liver"
EMBL Nucleotide Sequence
Database
• An annotated collection of all publicly available
nucleotide and protein sequences
• http://www.ebi.ac.uk/embl.html
02/22/24 07:46
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available
nucleotide and protein sequences
• http://www.ddbj.nig.ac.jp
02/22/24 07:46
• There are 126,551,501,141 bases in
135,440,924 sequence records in the traditional
GenBank divisions and 191,401,393,188 bases
in 62,715,288 sequence records in the WGS
division as of April 2011.
42
Annotation
•These billions of Gs, As, Ts, and Cs would be
useless without the "annotation" in each sequence
record.
43
Sequences
44
1 The LOCUS field
consists of five
different
subfields:
46
3 ACCESSION (Z92910) - Unique identifier assigned to a complete
sequence record. This number never changes, even if the record is
modified. An accession number is a combination of letters and
numbers that are usually in the format of one letter followed by five
digits (e.g., M12345) or two letters followed by six digits (e.g.,
AC123456).
47
4 VERSION (Z92910.1) - Identification number assigned to a single,
specific sequence in the database. This number is in the format
“accession.version.” If any changes are made to the sequence data,
the version part of the number will increase by one. For example
U12345.1 becomes U12345.2. A version number of Z92910.1 for this
HFE sequence indicates that the sequence data has not been
altered since its original submission.
48
5 GI (1890179) - Also a sequence identification number. Whenever a
sequence is changed, the version number is increased and a new GI
is assigned. If a nucleotide sequence record contains a protein
translation of the sequence, the translation will have its own GI
number
49
6 KEYWORDS (haemochromatosis; HFE gene) - A keyword can be
any word or phrase used to describe the sequence. Keywords are
not taken from a controlled vocabulary. Notice that in this record the
keyword, "haemochromatosis," employs British spelling, rather than
the American "hemochromatosis." Many records have no keywords.
A period is placed in this field for records without keywords.
50
7 SOURCE (human) - Usually contains an abbreviated or common
name of the source organism.
54
The FEATURES table
55
A feature is simply an annotation that describes a portion of
the sequence.
56
source - An obligatory feature. The source gives the length of
the entire sequence, the scientific name of the source
organism, and the Taxon ID number.
57
gene - Sequence portion that delineates the beginning and
end of a gene.
58
exon - Sequence segment that contains an exon. Exons may
contain portions of 5' and 3’ UTRs (untranslated regions). The
name of the gene to which the exon belongs and exon number
are provided.
59
CDS - Sequence of nucleotides that code for amino acids of the
protein product (coding sequence).
The CDS begins with the first nucleotide of the start codon and
ends with the third nucleotide of the stop codon.
This feature includes the translation into amino acids and may
also contain gene name, gene product function, link to protein
sequence record, and cross-references to other database
entries.
60
intron - Transcribed but spliced-out parts. Intron number is
shown.
61
polyA_signal - Identifies the sequence portion required for
endonuclease cleavage of an mRNA transcript. Consensus
sequence for the polyA signal is AATAAA.
62
BASE COUNT & ORIGIN
BASE COUNT - Base Count gives the total number of adenine
(A), cytosine (C), guanine (G), and thymine (T) bases in the
sequence.
63
Molecule-specific and topic-specific databases
AsDb - Aberrant Splicing db
ACUTS - Ancient conserved untranslated DNA sequences db
Codon Usage Db
EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db [Mirror at EBI]
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project
gRNAs db - Guide RNA db
PLACE - Plant cis-acting regulatory DNA elements db
PlantCARE - Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db
ssu rRNA - Small ribosomal subunit db
lsu rRNA - Large ribosomal subunit db
5S rRNA - 5S ribosomal RNA db
tmRNA Website
tmRDB - tmRNA dB
tRNA - tRNA compilation from the University of Bayreuth
uRNADB - uRNA db
RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis Tools
Subviral RNA db - Small circular RNAs db (viroid and viroid-like)
MPDB - Molecular probe db
OPD - Oligonucleotide probe db
VectorDB - Vector sequence db (seems dead!)
Organism specific databases:
FlyBase (Drosophila)
SGD (yeast)
MaizeDB (maize)
SubtiList (B. subtilis).
65
The search and retrieval system
that integrates information from
the National Center for
Biotechnology (NCBI) databases.
69
Databases: protein sequences
• SWISS-PROT: created in 1986 (Amos Bairoch) http://www.expasy.org/sprot/
cDNAs, genomes, …
(delayed or cancelled…)
« Automated »
EMBLnew EMBL • Redundancy check (merge)
CDS • Family attribution (InterPro)
• Annotation (computer)
Once in SWISS-PROT, the entry is no more in TrEMBL, but still in EMBL (archive)
CDS: proposed and submitted at EMBL by authors or by genome projects (can be experimentally
proven or derived from gene prediction programs). TrEMBL neither translates DNA sequences, nor
uses gene prediction programs: only takes CDS proposed by the submitting authors in the EMBL entry.
SWISS-PROT and the cross-references (X-
ref)
• SWISS-PROT was the 1st database with X-ref.;
1. Non-redundancy.
80
EMBL: The Genome divisions
http://www.ebi.ac.uk/genomes/
>sequence name
[sequence]…
Protein structure database
http://genome.ucsc.edu/
Summary
Genes:
Growth Genes
Tumor
suppressor genes
Proteins:
Growth Factors
Enzymes
Receptors
Pathways:
Cell death
Systems:
Immune system
Blood supply
Function:
Role of proteins
Molecular Essential Bioinformatics and 91
interactions Biocomputing (LSM2104), NUS
Biological Information
Nucleic acids:
• DNA sequence, genes, gene products (proteins), mutation,
gene coding, distribution patterns, motifs
• Genomics: genome, gene structure and expression, genetic
map, genetic disorder
• RNA sequence, secondary structure, 3D structure,
interactions
Proteins:
• Protein sequence, corresponding gene, secondary structure,
3D structure, function, motifs, homology, interactions
• Proteomics: expression profile, proteins in disease processes
etc.
• Ligands and drugs (inhibitors, activators, substrates,
metabolites)
92
Biological Information
Pathways:
• Molecular networks, biological chain events,
regulation, feedback, kinetic data
Function:
• Binding sites, interactions, molecular action
(binding, chemical reaction, etc.)
• Biological effect (signaling, transport, feedback,
regulation, modification, etc.)
• Functional relationship, protein families, motifs, and
homologs
Essential Bioinformatics and 93
Biocomputing (LSM2104), NUS
SWISS-PROT entry P00709
2. Among the oldest databases – the first structure was deposited in 1972.
3. New deposited structures has been steadily growing (3298 in 2001, and
1486 Jan 1-June 5, 2002).
• Technical design
– Flat-files
– Relational database (SQL)
– Exchange/publication technologies (FTP,
HTML, CORBA, XML,...)
02/22/24 07:46
Different classifications of databases….
• Availability
– Publicly available, no restrictions
– Available, but with copyright
– Accessible, but not downloadable
– Academic, but not freely available
– Proprietary, commercial; possibly free for
academics
02/22/24 07:46
http://www3.ebi.ac.uk/Services/DBStats/
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
Other NCBI nucleic acids DBs
• EST database: A collection of expressed sequence tags, or short, single-pass sequence
reads from mRNA (cDNA).
• GSS database: A database of genome survey sequences, or short, single-pass genomic
sequences.
• HomoloGene: A gene homology tool that compares nucleotide sequences between pairs
of organisms in order to identify putative orthologs.
• HTG database: A collection of high-throughput genome sequences from large-scale
genome sequencing centers, including unfinished and finished sequences.
• SNPs database: A central repository for both single-base nucleotide substitutions and
short deletion and insertion polymorphisms.
• RefSeq: A database of non-redundant reference sequences standards, including genomic
DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within
NCBI and with external groups, supports data-gathering efforts.
• STS database: A database of sequence tagged sites, or short sequences that are
operationally unique in the genome.
• UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
• UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters,
each representing a unique known or putative human gene annotated with mapping and
expression information and cross-references to other sources.
02/22/24 07:46
02/22/24 07:46
Sequence submission
• Data mainly direct submissions from the
authors.
• Submissions through the Internet:
– Web forms.
– Email.
• Sequences shared/exchanged between
the 3 centers on a daily basis:
– The sequence content of the banks is
identical.
02/22/24 07:46
Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
• TIGR Gene Indices Organism-specific databases of EST and gene
sequences http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and
C.briggsae http://www.cse.ucsc.edu/~kent/intronerator/
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
02/22/24 07:46
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures
http://ndbserver.rutgers.edu/
T
C
GA
T
TGA
CCGATGACAA
G
T
AT
A
Labs
C
CA
TGC
CG
A
G A
TT TTGACA
ACG A
CG
C
Genome
CT
CGTGA
AG A
A
TA T
CG C
C
GC
A
TA TTG C
G
CTGA
CGGA Assembly
A
CA
TAT
GC TAA
TG
CT
T
TG
TA
C C C C
AT G
A G T A
A
G G TTATAGCCG
ATT TATAGCCGA TA AT TG
TG TATAGCCG
TATAGCCG
TA
A
A T
T
T
A
T
AT
C
GA GenBank
AT
UniGene
TACTTTCTT C TC
Algorithms
GAGA A A
T ATCATCT
GAGA GAG
GAG
A ATCA C 127
Why use Bioinformatics
Databases?
• Speed of information retrieval
• Search engines
– Programs that allow you to search the database
NIG •Submissions
•Updates SRS
getentry EMBL
EMBL/GenBank/DDBJ
• These 3 db contain mainly the same informations
within 2-3 days (few differences in the format and
syntax)
• Contribution: EMBL 10 %; GenBank 73 %; DDBJ 17 %
• Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.) derived from:
– Genome projects (> 80 % of entries)
– Sequencing centers
– Individual scientists ( 15 % of entries)
– Patent offices (i.e. European Patent Office, EPO)
• Non-confidential data are exchanged daily
• Currently: 18 x106 sequences, ~30 x109 bp;
• Sequences from > 50’000 different species;
What is an Accession Number?
• An accession number is label that used to
identify a sequence in the various databases. It is
a string of letters and/or numbers that
corresponds to a molecular sequence.
• Examples (all for retinol-binding protein, RBP4):
– X02775 GenBank genomic DNA sequence
– NT_030059 Genomic contig
– Rs7079946 dbSNP (single nucleotide polymorphism)
– N91759.1 An expressed sequence tag (1 of 170)
– NM_006744 RefSeq DNA sequence (from a
transcript)
– NP_007635 RefSeq protein
– AAC02945 GenBank protein
– Q28369 SwissProt protein
– 1KT7 Protein Data Bank structure record
GenBank Record:
Feature Table
FEATURES Location/Qualifiers
source 1..3808
/organism="Limulus polyphemus"
/db_xref="taxon:6850"
/tissue_type="lateral eye"
CDS 258..3302
/note="N-terminal protein kinase domain; C-terminal myosin
/protein_id="AAC16332.2"
heavy chain head; substrate for PKA"
GenPept Protein IDS
/db_xref="GI:7144485"
/codon_start=1
/product="myosin III"
/protein_id="AAC16332.2"
/db_xref="GI:7144485"
/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE COUNT 1201 a 689 c 782 g 1136 t
ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa
//
Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!