Databases 2025
Databases 2025
SRV 1
• Genome sequencing and many other large-scale research projects have generated an explosive
growth in biological data
• Biological database is a store of biological information
• It is a collection of structured, searchable and up-to-date data
• The data deposited in databases are assigned a unique identifying number for quotation in
publications
SRV 2
Classification
• Primary, secondary and composite databases
• Depending on source of information
• Sequence and structure databases
• Depending upon type of information
• Public and private databases
• Depending upon free or paid accessibility
SRV 3
• Primary database:
o Consists of data derived experimentally
o Grown tremendously over the years
o Contains information of the sequence or structure alone and associated annotation information
o Includes:
• Sequence databases
• Structure databases
o Examples,
• Nucleotide / DNA databases: EMBL, GenBank, DDBJ
• Protein databases: SwissProt, TrEMBL, PIR, PSD, PDB
SRV 4
• Secondary database:
o Secondary sequence database contains derived information from a primary database, like
information about conserved sequence and active site residues of the protein families arrived by
multiple sequence alignment of a set of related proteins
o Secondary structure database contains entries of the protein data bank (PDB) in an organized way
(e.g. by classification of all PDB entries according to structures like α-helix or β-sheets) and also
information on conserved secondary structure motifs of a particular protein
o Examples,
• Sequence related information: ProSite, Pfam, REBase, Enzyme Nomenclature Database
• Genome related information: Online Mendelian Inheritance in Man (OMIM), TransFac
• Structure related information: Databases of Secondary Structure Assignments (DSSA),
Homology-Derived Secondary Structures of Proteins (HSSP), Dali
• Pathway information: KEGG
SRV 5
• Composite database:
o Joins a variety of different primary database sources
o Obviates the need to search multiple resources
o Examples,
o NRDB – Non-Redundant Database [non-redundant composite of PDB sequences, SWISS-
PROT, SWISS-PROTupdate, PIR, GenPept and GenPeptupdate]
o OWL [a non-redundant composite of protein sequence databases, e.g., SWISS-PROT, PIR,
GenBank (translation), PSD and NRL-3D]
SRV 6
• Public database:
• Most of the databases are public
• These are freely accessible for everybody everywhere in the world
• Example, NCBI
• Private database:
• Private companies sequence genomes of commercially or scientifically interesting organisms
• Data is not available to the public free of charge
• Academics normally are not able to pay the money required for accessing these databases
• Mainly used by the pharmaceutical and biotech industries
• Examples, Genome databases of Saccharomyces cerevisiae and Caenorhabditis elegans
SRV 7
Nucleotide sequence databases
SRV 8
• The International Nucleotide Sequence Database (INSD) consists of the following databases:
– GenBank (National Centre for Biotechnology Information; NCBI; USA)
– EMBL (European Molecular Biology Laboratory; Europe)
– DDJB (DNA Databank of Japan; Japan)
SRV 9
• Genbank
• Developed and maintained by the National Center for Biotechnology Information (NCBI) at the
National Institutes of Health (NIH)
• Primary nucleic acid public database
• Contains all known nt and protein sequences with supporting bibliographic and biological
annotation
• EMBL
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and outstations in
Hinxton (the European Bioinformatics Institute; EBI; England), Grenoble (France), Hamburg
(Germany) and Monterotondo (near Rome)
• Primary nucleic acid public database
• DDBJ
• Located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan
• Primary nucleic acid public database
SRV 10
Protein Sequence Databases
SRV 11
• Protein sequence databanks collect additional information about proteins, like ligands, subunit
association, disulfide bridges, catalytic activity, family, etc.
• Most of the information is collected from literature
• These databases arise by translation of nucleic acid sequences
SRV 12
PIR International:
• It was the very first sequence database, setup at the National Biomedical Research Foundation
(Georgetown University, Washington DC, USA)
• In 1988 the PIR joined with two other groups: the Munich Information Center for Protein Sequences
(MIPS) in Germany and the Japan International Protein Information Database (Tsukuba)
• The PIR maintains several databases about proteins:
• PIR-PSD: about protein sequence
• iProClass: classification of protein according to structure and function
• ASDB: annotation and similarity database
• P/R-NREF: a database of sequence and annotations of proteins of known structure deposited in
the PDB
• RESID: a database of covalent structure modifications (e.g. S-S bridges)
SRV 13
SwissProt:
• It is an annotated protein sequence database established in 1986 and maintained collaboratively since
1987 by the Department of Medical Biochemistry of the University of Geneva and the EMBL Data
Library (now the EBI)
• It consists the description of: function of the protein, post-translational modifications, domains and
sites, secondary structure, quaternary structure, similarities to other proteins, diseases associated with
deficiencies in the protein, variants and many descriptions more
SRV 14
• PROSITE - Database of Protein Families and Domains
• Pfam - Protein families database of alignments and HMMs - (Sanger Institute)
• Database of Interacting Proteins - Univ. of California
• InterPro - Classifies proteins into families and predicts the presence of domains and sites
• UniProt Universal Protein Resource - EBI, Swiss Institute of Bioinformatics, PIR
• Swiss-Prot Protein Knowledgebase - Swiss Institute of Bioinformatics
SRV 15
Structure Databases
SRV 16
• Structure databases archive, annotate and distribute sets of atomic coordinates to visualize three
dimensional structures
• Contain specific information about stereochemical analysis, like bond lengths and angles, X-ray
crystal structures and NMR spectroscopic data
• The best established database for biological macromolecular structures is the Protein Data Bank
(PDB)
SRV 17
Protein Data Bank (PDB):
• Primary database
• It is an American database started in 1971 by the late Walter Hamilton at Brookhaven National
Laboratories at Long Island, New York
• It is now managed by the Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers
University
• It is based in the San Diego Supercomputer Center in New Jersey, California and at the National
Institute of Standards and Technology in Maryland
• It contains 3-D structures about proteins, nucleic acids and some carbohydrates
• Most of the data of the PDB is generated by X-ray crystallography and NMR
• Comprises of:
• Protein Databank in Europe (PDBe)
SRV 18
• Secondary databases:
• SCOP - Structural Classification of Proteins
• CATH - Protein Structure Classification downloaded from PDB
• PDBsum - A pictorial database that provides an at-a-glance overview of the contents of each 3D
structure deposited in the Protein Data Bank
SRV 19
Pathway Databases
SRV 20
• These are databases that describe biochemical pathways, reactions and enzymes
• For the modeling and simulation of a biopathway, suitable information selection from public
biopathway databases, such as KEGG and BioCyc, is useful
SRV 21
KEGG:
SRV 22
BioCyc:
SRV 24
BioCyc Database Collection - Includes EcoCyc and MetaCyc
Small Molecule Pathway Database (SMPDB)
KEGG PATHWAY Database - Univ. of Kyoto
MANET database - University of Illinois
Metabolights - Metabolomics experiments and derived information: metabolite structures, reference
spectra, biological roles, locations and concentrations. (European Bioinformatics Institute)
Reactome - Navigable map of human biological pathways, ranging from metabolic processes to
hormonal signalling. (Cold Spring Harbor Laboratory, European Bioinformatics Institute, Gene Ontology
Consortium)
SRV 25
SRV 26
• Microarrays allow snapshots to be made of expression levels (and hence abundance) for thousands
of genes in a single experiment
• The amount of information generated by a microarray-based experiment is sufficiently large that no
single study can be expected to mine each piece of scientific information
• The amount of finished microarray experiments grows rapidly, and because of that the massive
amounts of valuable functional genomics data are already generated
• The Microarray Informatics Team at the EBI was established in May 2000 to address this problem of
managing and analyzing this data
• They found that systems were needed for the management and storage of microarray data
SRV 27
• ArrayExpress - A public database for microarray based gene expression data; Setup by the EBI;
Exchanges information with the NCBI and DDBJ microarray database every week
• Gene Expression Omnibus - NCBI
• Stanford Microarray Database (SMD) - Stanford University
• Genevestigator - Expression Search Engine (Nebion AG)
• GPX - Scottish Centre for Genomic Technology and Informatics
SRV 28
SRV 29
• It contains all kind of information about all kinds of naturally and laboratory-made mutants
• Amongst plants, most of the mutant databases are from Arabidopsis, but databases of other crops are
already made
SRV 30
Arabidopsis thaliana Insertion Database (ATIDB):
• It is a collaboration between the American Cold Spring Harbor Laboratory and John Innes Center
from the UK
• The ATIDB was designed as a public tool for genome researchers and other biologists to find
breeding lines of Arabidopsis created with insertional mutagenesis and to facilitate the study of their
distribution on the World Wide Web
SRV 31
SRV 32
• These contain scientific articles or abstracts of them
• Searches usually give the author's name, title, publication and date (citation information)
• There are several high quality databases, but the most used is PubMed
SRV 33
PubMed:
SRV 34
SRV 35
• These databases of databases collect data from different sources and make them available in new
and more convenient form, or with an emphasis on a particular disease or organism
• Some examples,
1. ConsensusPathDB - A molecular functional interaction database, integrating information from 12
other databases
2. Entrez - National Center for Biotechnology Information
3. Enzyme Portal - Integrates enzyme information such as small-molecule chemistry, biochemical
pathways and drug compounds (European Bioinformatics Institute)
4. MetaBase (KOBIC) - A user contributed database of biological databases
5. mGen - Containing four of the world biggest databases GenBank, Refseq, EMBL and DDBJ - easy
and simple program friendly gene extraction
6. PathogenPortal - A repository linking to the Bioinformatics Resource Centers (BRCs) sponsored by
the National Institute of Allergy and Infectious Diseases (NIAID)
7. SOURCE - (Stanford University) encapsulates the genetics and molecular biology of genes from the
genomes of Homo sapiens, Mus musculus, and Rattus norvegicus into easy to navigate
GeneReports
SRV 36
SRV 37
• These databases collect organism genome sequences, annotate and analyze them and provide public
access
• Some add curation of experimental literature to improve computed annotations
• These databases may hold many species genomes, or a single model organism genome
• CAMERA - Resource for microbial genomics and metagenomics
• Corn - the Maize Genetics and Genomics Database
• EcoCyc - A database that describes the genome and the biochemical machinery of the model
organism E. coli K-12
• Ensembl Genomes - Provides genome-scale data for bacteria, protists, fungi, plants and
invertebrate metazoa, through a unified set of interactive and programmatic interfaces (using the
Ensembl software platform)
SRV 38
• Flybase - Genome of Drosophila melanogaster
• National Microbial Pathogen Data Resource - A manually curated database of annotated genome data
for the pathogens Campylobacter, Chlamydia, Chlamydophila, Haemophilus, Listeria, Mycoplasma,
Neisseria, Staphylococcus, Streptococcus, Treponema, Ureaplasma and Vibrio
• Saccharomyces Genome Database - Genome of yeast
• The SEED platform - For microbial genome analysis includes all complete microbial genomes, and
most partial genomes. The platform is used to annotate microbial genomes using subsystems
• Wormbase - Genome of Caenorhabditis elegans
• The Arabidopsis Information Resource (TAIR)
• Rat Genome Database (RGD) - Genomic and phenotype data for Rattus norvegicus
SRV 39
Proteomics databases
• Proteomics Identifications Database (PRIDE)
• A public repository for proteomics data, containing protein and peptide identifications and their
associated supporting evidence as well as details of post-translational modifications (EBI)
• MitoMiner
• A mitochondrial proteomics database integrating large-scale experimental datasets from mass
spectrometry and GFP studies for various species (Medical Research Council Mitochondrial
Biology Unit)
SRV 40
RNA databases
• Rfam – A database of RNA families
• miRBase - The microRNA database
• snoRNAdb - A database of snoRNAs
• lncRNAdb - A database of lncRNAs
• GtRNAdb - A database of genomic tRNAs
• SILVA - A database of rRNAs
• RDP - The Ribosomal Database Project
SRV 41
Carbohydrate structure databases
• EuroCarbDB - A repository for both carbohydrate sequences / structures and experimental data
SRV 42
Protein-protein interactions
• BIND - Biomolecular Interaction Network Database
• DIP - Database of Interacting Proteins
• STRING - A database of known and predicted protein-protein interactions (EMBL)
• MINT - Molecular INTeraction database
• The Cell Collective - A web-based platform that enables laboratory scientists from across the globe to
collaboratively build large-scale models of various biological processes, and simulate/analyze them in
real time
SRV 43
Signal transduction pathway databases
SRV 44
PCR and quantitative PCR primer databases
• PathoOligoDB - A free qPCR oligo database for pathogens
• RTPrimerDB - A public primers and probes database for real-time PCR reactions
SRV 45
Taxonomic databases
• Catalogue of Life source databases
• Encyclopedia of Life
• Integrated Taxonomic Information System
• EzTaxon-e - Database for the identification of prokaryotes based on 16S ribosomal RNA gene
sequences
SRV 46
SRV 47
SRV 48
• Antibody Central - Antibody information database and search resource
• Barcode of Life Data Systems - A database of DNA barcodes
• Connectivity map - Transcriptional expression data and correlation tools for drugs
• CTD - The Comparative Toxicogenomics Database describes chemical-gene-disease interactions
• Drug2Gene - Provides integrated information for identified and reported relations between
genes/proteins and drugs/compounds
• GreenPhylDB - A phylogenomic database for plant comparative genomics
• HGMD disease-causing mutations - HGMD Human Gene Mutation Database
• HvrBase++ - Human and primate mitochondrial DNA
SRV 49
• Oncogenomic databases - A compilation of databases that serve for cancer research
• OMIM Inherited Diseases - Online Mendelian Inheritance in Man
• p53 - The p53 knowledgebase
• PHI-base - Pathogen-host interaction database
• TRANSFAC - A database about eukaryotic transcription factors, their genomic binding sites and DNA-
binding profiles.
• TreeBASE - An open-access database of phylogenetic trees and the data behind them
• Treefam - TreeFam (Tree families database) is a database of phylogenetic trees of animal genes
SRV 50