0% found this document useful (0 votes)
12 views36 pages

Unit-2 (2)

Unit 2 discusses biological databases and data retrieval, focusing on their classification based on source and data nature, including primary and secondary databases, nucleotide, protein, and structural databases. It highlights the importance of these databases for storing and retrieving information related to biomolecules, as well as tools for data access and visualization. Additionally, the unit covers small molecular databases like PubChem and Drug Bank, emphasizing their significance in chemical and drug research.

Uploaded by

Priyanshi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views36 pages

Unit-2 (2)

Unit 2 discusses biological databases and data retrieval, focusing on their classification based on source and data nature, including primary and secondary databases, nucleotide, protein, and structural databases. It highlights the importance of these databases for storing and retrieving information related to biomolecules, as well as tools for data access and visualization. Additionally, the unit covers small molecular databases like PubChem and Drug Bank, emphasizing their significance in chemical and drug research.

Uploaded by

Priyanshi Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 2 Biological Databases and Data Retrieval

UNIT 2
BIOLOGICAL DATABASES AND
DATA RETRIEVAL

Structure
2.1 Introduction 2.4 Small Molecular Databases

Expected Learning PubChem


Outcomes
Drugbank
2.2 Classification of Biological
Databases ZINC Database

Classification Based on Cambridge Structure


Source Database (CSD)

Nucleotide Databases 2.5 Structure Viewing tools


and File Formats
Protein Database
2.6 Summary
Structural Databases
2.7 Terminal Questions
CATH
2.8 Answers
2.3 Classification of Biological
Databases Based on
Nature of Data

2.1 INTRODUCTION
In the previous unit, you have learned about the basics of computers and their

studying biological databases. In these biological databases information


related to DNA, RNA, Protein, and other biomolecules are stored in a
systematic way inside servers named Data servers. Scientists, academicians,
and researchers working across the globe can retrieve this data (Biological
Data) whenever they need it for the purpose of analysis.

In BBCCT-101 course molecules of life, you have studied various


biomolecules and their significance. You are also aware that these molecules
interact with each other through various metabolic reactions. 47
BBCS-185 Bioinformatics Skill Enhancement Course

Some important characteristic features of biological databases include:


1. They can store data in electronic form in various formats.
2. Each entry in the database would be assigned a unique number or ID. It
cannot be repeated in other terms non-redundancy.
3. Data sharing The data can be downloaded from various websites related
to biological databases or FTP (File transfer protocol).
4. They are well structured, searchable, and also information is updated
periodically as per publications/innovations in the scientific world.
5. The data also refers to unique IDs in research publications/books. This can
be called a cross-reference.

Expected Learning Outcomes


After studying this unit, you should be able to:

define biological databases;

classify of biological databases;

enlist application of biological databases in research and data retrieval


through web links;

describe chemical, biological and structural databases; and

explain the availability of online visualization tools and offline .

2.2 CLASSIFICATION OF BIOLOGICAL


DATABASES
In unit-11 of BBCCT-105 (Proteins) course you came across some basic
concepts of biological databases, hence you are advised to recall those
concepts before proceeding further in this unit. The classification of biological
databases is very simple and is based on the source and nature of data
collection.

2.2.1 Classification Based on Source


1. Primary databases: These databases are constructed based on data
collected from laboratory experiments. After experiments the data will be
validated and analyzed before uploading in the biological databases and
it is very crucial step in the data collection. They are classified based on
the type of biological molecules like nucleic acid databases (GenBank,
EMBL, DDBJ, NDB), protein databases (PIR, Swiss-Prot, TrEMBL,
PDB), metabolic pathway database (KEGG, EcoCyc, and MetaCyc) and
small molecule databases (PubChem, Drug Bank, ZINC, CSD).

2. Secondary Databases: These databases are constructed based on


primary biological databases with additional information.
Secondary databases comprise data derived from the results of
analyzing primary data available on the primary databases. They are
48 often referred to as curated databases, but this is a bit of a misnomer
Unit 2 Biological Databases and Data Retrieval

because primary databases are also curated to ensure that the data in
them is consistent and accurate.

Secondary databases often draw upon information from numerous


sources, including other databases (primary and secondary), controlled
vocabulary, and scientific literature. They are highly curated and often
use a complex combination of computational algorithms and manual
analysis and interpretation to derive new knowledge from published
data.

library over the past decade or so, providing a wealth of (often daunting)
information on just about any gene or gene product that has been
investigated by the research community. The potential for mining this
information to make new discoveries is vast.

Table 2.1: Differences between primary and secondary databases


(Source:https://www.ebi.ac.uk/)

Primary database Secondary Database

Curated database;
Synonyms Archival database
knowledgebase

Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases

Inter Pro (protein


families, motifs and
ENA, GenBank and DDBJ (nucleotide domains) UniProt
sequence) Array Knowledgebase (sequen
Express and GEO (functional ce and functional
Examples genomics data) Protein Data information on
Bank (PDB; coordinates of three- proteins) Ensembl (variat
dimensional macromolecular ion, function, regulation
structures) and more layered onto
whole genome
sequences)

Upto now you have studied about types of biological databases based on
sources, now let us know about nucleotide databases.

2.2.2 Nucleotide Databases


It is well known that DNA and RNA are major nucleic acids. You have studied
the structure of these nucleic acids in Unit 13&14 of BBCCT-101). Each
protein/enzyme coded by a specific gene/genes intern itself is a DNA
sequence. If all the gene coding sequences are stored in a database i.e, called
a nucleotide database. These databases are repositories of the store and
49
BBCS-185 Bioinformatics Skill Enhancement Course

retrieve data in terms of nucleotides of various genomes (set of chromosomes


of an organism).

Let us see some of the examples of these nucleotide databases:

Gen Bank It is an integral part of the main biological database, i.e., NCBI
(National Center for Biotechnology. It has a tool called Entrez, which helps to
retrieve data from Genbank.

EMBL European Molecular Biology Laboratory is available at European


Bioinformatics Institute (EBI). SRS (Sequence Retrieval System) is a tool for
retrieval of desired protein/DNA/Gene Sequences from this above database.

DDBJ DNA data Bank for Japan is another database.

All above three-nucleotide databases are interconnected with each other by


data sharing and allow access to the data through the web links and data
rvers (Fig. 2.1).

Swiss-PROT This
database is owned by
EMBL and
maintained by SIB

TrEMBL - It contains
maximum translated
sequences

Fig. 2.1: Graphical representation of Nucleotide databases.

2.2.3 Protein Database


PIR Protein Information Resource is located at NBRF (National Biomedical
Research Foundation). It consists of complete protein information like source
protein crystal structures available in protein Databank (with ID), etc., PIR
have been classified into four types.

PIR1- This database is fully classified and annotated

PIR2 - It is basic database with preliminary protein information

PIR3- This database has unverified entries

PIR4- Database with genetically engineered sequences. This helps to


understand the possibilities of engineering proteins for research activities.
50
Unit 2 Biological Databases and Data Retrieval

SAQ 1
Fill in the blanks:

i) _________________tool is used to retrieve data from GenBank

ii) DDBJ is a____________database.

iii) EMBL database is maintained by_______.

iv) The full form of NCBI is________________.

v) ____________________database is related to classification of protein


families.

2.2.4 Structural Databases


Protein Databank (PDB) comprises various databases

PDB is a part of the Worldwide Protein Data Bank which collects, organizes,
and disseminates data on biological macromolecular structures like proteins,
enzymes, and DNA/RNA.

PDBj (Protein Data Bank Japan) maintains a centralized PDB archive of


macromolecular structures and provides integrated tools, in collaboration with
the Research Collaboratory for Structural Bioinformatics (RCSB),
the Biological magnetic resonance Data Bank (BMRB) in the USA, and the
PDBe in the EU.

RCSB: Research Collaboratory for Structural Bioinformatics. (RCSB-PDB).

Secondary structural Databases

Let us know about secondary structural databases, they are like:

1. SCOP : (http://scop.mrc-lmb.cam.ac.uk/scop/, Fig. 2.2)

2. CATH

3. PDBSUM

1. SCOP (Structural Classification of Proteins) database started by Lab of


Molecular Biology, MRC, Cambridge, UK. The aim of this database is to
classify protein 3D structures based on hierarchical schemes. SCOP has
various classifications as species, proteins, families, super families,
folds, and classes. Each class was again classified into various
structural organizations. Folds are further classified into five classes (Fig.
2.3):

a. All alpha

b. All beta

c. Alpha or beta

d. Alpha and beta

e. Multi-domain folds
51
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 2.2: Screenshot showing SCOP home page.


Classification chart

Fig.2.3: SCOP hierarchy chart structured based and evolution based


classification.

2.2.5 CATH
The CATH database (http://www.cathdb.info/) is a free, publicly available
online resource that provides information on the evolutionary relationships of
protein domains. It was created in the mid-1990s by Professor Christine
Orengo and colleagues, and continues to be developed by the Orengo group
at University College London.

Protein structures in the databank were experimentally determined and split


into their consecutive polypeptide chains, where applicable. All protein
domains are identified within these chains using a mixture of automatic
methods and manual curation that are available to the scientific community.
The domains are then classified within the CATH structural hierarchy: Class as
(C) level, domains assigned according to their secondary structure content, i.e.
52
Unit 2 Biological Databases and Data Retrieval

all alpha, all beta, a mixture of alpha and beta, or little secondary structure; in
the Architecture (A) level, information of the secondary structure arrangement
in three-dimensional space is used at the Topology/fold (T) level, information
on how the secondary structure elements are connected and arranged is
used; assign segregation are made to the Homologous superfamily (H) level if
there sufficient evidence that the domains are related by evolution, i.e. they
are homologous. To know, and browse the classification hierarchy, visit CATH
hierarchy web page (Fig. 2.4).

Additional sequence data for domains with no experimentally determined


structures are provided by sister resources like Gene3D, which are used to
populate the homologous superfamilies. Protein sequences from
UniProtKB and Ensembl were scanned against CATH HMMs to predict
domain sequence boundaries and make homologous superfamily
assignments/groups.

Fig. 2.4: CATH home page.

Learners can explore more about the proteins/enzymes/receptors by using


various sub-search methods like 3D structure, protein evolution, protein
function, and conserved sites. You can also access updated information about
the development or updation of the database from Learn more tab of
the CATH web homepage. You may download the complete database from
the download link. This will help learners to understand protein classification
through the structural organization of proteins.

SAQ 2
Define the following terms:

i) PDB

ii) RCSBC

iii) SCOP

53
BBCS-185 Bioinformatics Skill Enhancement Course

2.3 CLASSIFICATION OF BIOLOGICAL


DATABASES BASED ON NATURE OF
DATA
Up to now, you have learned about biological databases and their
classification-based. Now you will learn how to classify biological databases
based on their data nature.There are currently five types of databases

1. Sequence databases: These databases consist of DNA, RNA, and protein


sequences. You can access the gene or protein sequence by searching by
providing the name or unique ID in the respective databases. For example
EMBL (European Molecular Biology Laboratory), NCBI (National Center for
Biotechnology Information).

2. Structural databases: These are specialized databases related to


protein/DNA structures derived from X-Ray or NMR experiments or theoretical
models. Some of the structural databases are related to crystal structures of
chemicals. For example Protein Databank (PDB), Cambridge Crystallographic
Data Center (CCDC).

3. Literature databases: These databases are very important for the


development and advancement of science and technology as well as other
disciplines. These databases help researchers, academicians, and scientists
to search for the information and are able to download data in electronic form
like HTML (HyperText Machine Language), PDF (Portable document Format),
JPEG (Joint Photographic Experts Group), and other formats. Examples:
Pubmed, Medline, National Digital Library.

4. Gene expression databases: It is a well-known database to understand


the gene functions like up-regulation or down-regulation of cellular activities.
The Gene Expression Database (GXD) is a community resource for gene
expression information obtained from the laboratory. At various GXD stores
and integrates different types of expression data and makes these data freely
available in formats appropriate for comprehensive analysis. This database
helps in the interconnection of the gene expression and control with
other genes through various systems biology software. Now-a-days
scientists are working on gene molecular networks.

5. Metabolic pathway databases: It is a curated database of experimentally


elucidated metabolic pathways from all domains of life and is well maintained
and updated regularly. As you know about metabolic pathways of
carbohydrate metabolism (BBCCT-109) have many enzymatic reactions to
achieve the final product. The main characteristics of metabolic database
pathways are as follows:

Online encyclopedia of metabolism

1) Predict metabolic pathways in sequenced genomes

2) Support metabolic engineering via enzyme database

3) Metabolite database aids metabolomics research

Examples; KEGG (Kyoto Encyclopedia of Genes and Genomes)


MetaCyc, BioCyc
54
Unit 2 Biological Databases and Data Retrieval

2.4. SMALL MOLECULAR DATABASES


You have studied the various databases related to biological databases like
NCBI, Genbank, Swiss-Prot, EMBL, EBI, DDBJ, CATH, and SCOP in the
previous sections. All the above databases related to proteins and nucleic
acids have molecules with more molecular weight. So the above molecular
databases are called macromolecular databases. Low molecular weight
molecules are stored and retrieved through databases known as small
molecular databases. Most of the molecules are organic molecules like drugs,
antibiotics, vaccines, peptides, elements, compounds, etc. Now, we will
discuss in detail small molecular databases in this section.

2.4.1 PubChem
The PubChem database is a primary source for various chemicals, drugs, and
derivatives. It is one of the freely accessible chemical information resource
databases as well as the largest in the world. We can search for various
chemicals by molecular formula, name, structure, and other identifiers.
Further, one can find chemical and physical properties, safety and toxicity
information, biological activities, literature citations, patents, and more. New
chemicals/substances will be added regularly as and when new information is
available from the literature or from experimental results. It is very crucial for
finding vendor-based chemicals or new chemicals. Most of the scientists are
screening molecules from the Pubchem database for various disease
treatments like
Atherosclerosis, Cardio-vascular diseases, etc. It is the database of chemicals

that anyone can submit scientific data to this database and become a provider.
This database is more important and useful for scientists, students, and the
general public. Each month database and programmatic services provide data
to several million users worldwide about compounds.
(https://pubchemdocs.ncbi.nlm.nih.gov/)

2.4.2 Drug Bank


The drug bank is a crucial database for all existing drugs that are approved by
Food and Drug Administration authority to treat various diseases. It is also one
of the largest drug banks in the world. DrugBank is a curated pharmaceutical
knowledge base, with products commercially available for precision medicine,
telehealth, and drug discovery. It also provides important drug information in a
structured, unified resource. DrugBank Online is a free-to-access website that
provides highly detailed information across multiple topics including
pharmacology, chemical structures, targets, metabolism, and toxicology. The
integrated data means you can search by text, gene sequence, chemical
structure, and more. Anyone can download a comprehensive dataset, free for
academic and non-commercial researches. (https://www.drugbank.com/visit
this weblink to know more bout Drug Bank )
55
BBCS-185 Bioinformatics Skill Enhancement Course

2.4.3 ZINC Database


It is a free database of commercially-available compounds for virtual
screening. ZINC contains over 230 million purchasable compounds in ready-
to-dock, 3D formats. ZINC also contains over 750 million purchasable
compounds anyone can search for analogs in a short span of time. ZINC is
maintained by the Irwin and Shoichet Laboratories in the Department of
Pharmaceutical Chemistry at the University of California, San Francisco
(UCSF). This database is used by various researchers, training people,
scientists, biotech companies, research organizations, and university scholars
for drug discovery. (https://zinc.docking.org/visit this weblink to more about the
ZINC database )

2.4.4 Cambridge Structure Database (CSD)


- organic
molecules and metal-organic crystal structures. Containing over one million
structures from x-ray and neutron diffraction analyses, this unique database of
accurate 3D structures has become an essential resource to scientists around
the world.

There will be automatic checking of newly added entries to the above


database later on the chemical information further verified by in-house
scientific editors/scientists before launching online to the public. Each
chemical structure is enhanced with good quality for visualization,
downloading, and understanding of physical properties. This new knowledge
has been applied across academia and industry in pursuit of new drugs, novel
materials and a greater understanding of chemical and crystallographic
phenomena. (https://www.ccdc.cam.ac.uk/solutions/csd-
core/components/csd/visit this weblink to more about CSD ) . You will learn
more about how to retrieve data from these data bases while performing
exercise number 3 of this course.

SAQ 3 Do as directed
i) Write a short on chemical databases used in drug design and drug
discovery.

ii) Define the term curated database? Enlist few chemical databases
developed using curation method.

2.5. STRUCTURE VIEWING TOOLS AND


FILE FORMATS
The molecular structures like organic, biological molecules like proteins, DNA,
RNA, lipids and carbohydrates can be visualized through specific software.
Most of the molecular visualization software is not only for visualization but
also for various modifications and calculations of bond lengths, angles,
energy, rotatable bonds, molecular weight, and other various parameters.
These parameters are very important as per molecular visualization. In
56
Unit 2 Biological Databases and Data Retrieval

addition, a few more advanced software help us to calculate binding energy


between drug Receptor and molecular stability at various solvents, pH,
temperatures and etc.,

Currently, we are going to discuss basic software used for the visualization of
biomolecules. There are various file formats available to view those molecules
in 3Dimenstional space. It means that, each atomic position should be defined
from its origin with respect to X,Y, and Z axis. The majorly used format to view
in basic software is the pdb format (Protein databank). It has a stranded format
as follows. While performing exercises number 7 and 8 you will learn more
about these tools and file formats.

Examples of PDB Format

Glucagon is a small protein of 29 amino acids in a single chain. The first


residue is the amino-terminal amino acid, histidine, which is followed by a
serine residue and then glutamine. The coordinate information (entry 1gcn)
starts with:

ATOM 1 N HIS A 1 49.668 24.248 10.436 1.00 25.00 N

ATOM 2 CA HIS A 1 50.197 25.578 10.784 1.00 16.00 C

ATOM 3 C HIS A 1 49.169 26.701 10.917 1.00 16.00 C

ATOM 4 O HIS A 1 48.241 26.524 11.749 1.00 16.00 O

ATOM 5 CB HIS A 1 51.312 26.048 9.843 1.00 16.00 C

ATOM 6 CG HIS A 1 50.958 26.068 8.340 1.00 16.00 C

ATOM 7 ND1 HIS A 1 49.636 26.144 7.860 1.00 16.00 N

ATOM 8 CD2 HIS A 1 51.797 26.043 7.286 1.00 16.00 C

ATOM 9 CE1 HIS A 1 49.691 26.152 6.454 1.00 17.00 C

ATOM 10 NE2 HIS A 1 51.046 26.090 6.098 1.00 17.00 N

ATOM 11 N SER A 2 49.788 27.850 10.784 1.00 16.00 N

ATOM 12 CA SER A 2 49.138 29.147 10.620 1.00 15.00 C

ATOM 13 C SER A 2 47.713 29.006 10.110 1.00 15.00 C

ATOM 14 O SER A 2 46.740 29.251 10.864 1.00 15.00 O

ATOM 15 CB SER A 2 49.875 29.930 9.569 1.00 16.00 C

ATOM 16 OG SER A 2 49.145 31.057 9.176 1.00 19.00 O

ATOM 17 N GLN A 3 47.620 28.367 8.973 1.00 15.00 N

ATOM 18 CA GLN A 3 46.287 28.193 8.308 1.00 14.00 C

ATOM 19 C GLN A 3 45.406 27.172 8.963 1.00 14.00 C

Notice that each line or record begins with the record type ATOM. The atom
serial number is the next item in each record.
(Source:
https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) 57
BBCS-185 Bioinformatics Skill Enhancement Course

There are plenty of to show small molecules to higher level


molecular structures in a space. Among them few are enlisted as follows

1. RasMol

Most of the protein structure databases tools available today are well-
equipped with graphical visualization tools. The commonly used tool for
academic and research purposes is RasMol software. This is a molecular
graphics program intended to visualize proteins, nucleic acids and small
molecules, available in a 3-D structures format. In order to display a molecule,
RasMol requires an atomic co-ordinate file that specifies the position of every
atom in the molecule through its 3-D Cartesian coordinates (Fig. 2.5). RasMol
accepts this coordinate file in a variety of formats, including the Protein Data
Bank (PDB) format. The visualization tool provides the user a choice of color
schemes and molecular representation (wireframe, cylinder (Dreiding) stick
bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular
ribbons (either smooth-shaded solid ribbons or parallel strands), hydrogen
bonding and dot surface. Additional features such as test labeling for selected
atoms, different color schemes for different parts of the molecule, zoom,
rotation, etc. have made this the most popular among all existing visualization
tools.

This standalone software can be downloaded from the RasMol website.

Website:http://www.openrasmol.org/

Fig. 2.5: RasMol software with Crystals of Crambin with PDB ID: 1CRN.

2. Chime

Chime and proteins explorer are derivatives of RasMol that allow visualization
of structures inside web browsers, while RasMol runs independently outside a
web browser. Hence, chime should be used only online, when connected to
the Internet. Another feature of Chime is that only certain molecules that are
allowed by the company can be seen, unlike RasMol where any protein
molecule with atomic coordinates can be seen.
58
Unit 2 Biological Databases and Data Retrieval

You can access chime at: www.umass.edu/microbio/chime

Now-
also used widely. This can be downloaded on personal computers to view
molecules like proteins, DNA and RNA.

3. MolMol

MolMol stands for Molecule analysis and Molecule display. This is also free
software with a lot of features that are not found in RasMol and Chime. MolMol
is a molecular graphics program for display, analysis and manipulation of
three-dimensional structures of biological macromolecules, with special
emphasis on nuclear magnetic resonance (NMR) solution structures of
proteins and nucleic acids. MolMol can be reached at:
www.mol.biol.ethz.ch/wuthrich/software/molmol

4. Pymol

PyMOL is a user-friendly and one of the popular molecular visualization tools


on an open-source foundation, maintained and distributed by Schrödinger. It is
widely used software in structural bioinformatics, biophysics, computer-aided
drug design, and other fields of biology. It is an advanced graphical user
interface in the field of molecular visualization. Used to see the protein binding
with ligand/drug in a 3D space. This software allows the viewer to label atoms,
bonds, distances, angles, residues, residues with numbers, chains, and types
of bond interactions (Fig. 2.6). It has many features like one can model
protein, DNA with secondary structural information. User can see the proteins
in different forms like balls-sticks, wires, molecular surfaces with atomic
energy distribution. This tool is freely available for students/academic
institutions with legal agreement or registration. The tutorial and software
download is also available at https://pymol.org/2/

Fig.2.6: Visualization of ligand with sticks within a cavity of protein by pymol


software.

5. SPDBV

Swiss-PdbViewer is an application that provides a user-friendly interface to


analyze several proteins at the same time. The proteins can be superimposed
in order to deduce structural alignments and compare their active sites or any
other relevant parts. Amino acid mutations, H-bonds, angles and distances
between atoms can be viewed easily. This tool functions on the intuitive
graphic and menu interface (Fig. 2.7).
59
BBCS-185 Bioinformatics Skill Enhancement Course

Swiss-Pdb viewer was developed in 1994 by Nicolas Guex. Swiss-PdbViewer


is closely SWISS-MODEL, an automated homology modeling server
developed within the Swiss Institute of Bioinformatics (SIB) at the Structural
Bioinformatics Group of associated with Biozentrum in Basel.

Working with SWISS-MODEL and SWISS-Pbd Viewer programs greatly


reduces the amount of time required to generate models. It is possible to
thread a protein primary sequence onto a 3D template and get immediate
feedback on how well the threaded protein will be accepted by the reference
structure before submitting a request to build missing loops and refine side-
chain packing.

Fig. 2.7: Protein structure visualization with Spdbv software.

2.6 SUMMARY
Biological databases used to store experimental data in various formats
that can be accessed through the internet.

In biological databases, to avoid the Non-Redundancy of the data, a


unique number or ID is assigned as primary key.

All biological databases are well structured so that data can be


retrieved across the globe with ease in a short span of time.

There are two types of Biological Databases: 1. Primary Databases


data collected from Laboratory- GenBank, EMBL,DDBJ, NDB, TrEMBL,
PIR, SwissProt and PDB. 2. Secondary databases- derived from the
results of analyzing primary data of primary, databases- InterPro(Protein
families motifs, and domains) UniProt, Ensembl, Brenda databases
Macro molecules like proteins, DNA and RNA come under separate
databases. Similarly small molecules hence their own databases.
Examples of small molecular databases are Pubchem, Drug Bank,ZINC
Databases,CSD (Cambridge Structure Database).
60
Unit 2 Biological Databases and Data Retrieval

SCOP, CATH and PDBSUM serve as Secondary structural databases.

SCOP has five sub-classes. 1. All alpha, 2. All Beta, 3. Alpha or Beta 4.
Alpha and Beta 5. Multi-domain fold.

Based on nature of data in biological databases, there are five types of


databases.

1. Sequence databases-EMBL, NCBI, GenBank

2. Structural Databases- RCSB-PDB and CCDC

3. Literature Databases- Pubmed, Medline, NDL

4. Gene expression Databases- GXD, Gene Expression Omnibus


(GEO) is a database repository of high throughput gene
expression data and hybridization arrays, chips, microarrays.

5. Metabolic pathway Databases-KEGG (Kyoto Encyclopedia of


Genes and Genomes) MetaCyc, BioCyc

To view any protein or any molecule in a virtual space, it requires 3D


space coordinates like X, Y, and Z along the axis with respect to the
origin.

The basic and free molecular visualization software is Rasmol.

Pymol is a well-developed Graphical user interface with good rendering


options along with well-advanced features.

2.7 TERMINAL QUESTIONS


1. What are the basic principles and characteristics of an ideal biological
Databases

2. Write a note on Primary databases and secondary databases with


special emphasis to biological databases.

3. Describe nucleotide databases.

4. Explain secondary Structural databases.

5. Write in detail about classification of biological databases based on


nature of data.

2.8 ANSWERS
Self Assessment Questions
1. i) Nucleotide Database at NCBI
ii) DNA
iii) European Bioinformatics institute
iv) National Centre for Biotechnology
v) PIR

61
BBCS-185 Bioinformatics Skill Enhancement Course
2. i) Protein Data Bank
ii) Research Collaboratory for Structural Bioinformatics
iii) Structural Classification of Proteins
3. i) Refer Section 2.4.1 to 2.4.3.
ii) Refer Metabolic Pathway Databases under section 2.3.

Terminal Questions
1. There are various principles for the biological databases or
characteristics.

i) Biological databases can be stored in an electronic form at various


formats.

ii) As it is a database, each entry in the database would be assigned


Non-
redundancy.

iii) Biological data sharing The data can be downloaded from


various websites related to biological databases or ftp (File transfer
protocol).

iv) Biological databases are well structured, searchable and also


updated information periodically as per publications/innovations in
the scientific world.

v) The data also referred with unique ID in research


publications/books. This can be called a cross reference.

2. Primary database Secondary Database

Curateddatabase;
Synonyms Archival database
knowledgebase

Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases

InterPro (protein
families, motifs and
domains) UniProt
ENA, GenBank and DDBJ (nucleotide Knowledgebase (seque
sequence) ArrayExpress and GEO (fu nce and functional
nctional genomics data) Protein Data information on
Examples proteins) Ensembl (varia
Bank (PDB; coordinates of three-
dimensional macromolecular tion, function, regulation
structures) and more layered onto
whole genome
sequences)

62
Unit 2 Biological Databases and Data Retrieval

3. Refer section (2.2.2)

i) Importance of nucleotide databases

ii) Enlist nucleotide databases and describe in detail.

4. Refer section (2.2.4)

i) Write about SCOP,CATH, PDBSUM in detail.

ii) Classification of SCOP

iii) Few points on CATH databases.

5. Refer section (2.3)

63
BBCS-185 Bioinformatics Skill Enhancement Course

Exercise 4
DATABASES NCBI, PDB, SCOP, :
PUBMED, GENE BANK,
UNIPROT

Structure
4.1 Introduction 4.2 Databases and Retrieval

Expected Learning 4.3 Summary


Outcomes
4.4 Lab Exercises

4.1 INTRODUCTION
In this exercise, you will learn about biological databases that are widely used
in the field of bioinformatics.
Databases are systematic collections of theoretically related data. Software
packages are used for defining and managing databases. In publicly
accessible databases, there is a lot of information available regarding
biomolecules due to exponential growth in biological data. Data is no longer
published in a conventional way but rather submitted directly to databases.
Generally, the biological database can be classified into sequence database,
structural database, genome database, proteome database, specialized
databases, etc.

Expected Learning Outcomes


After performing this exercise you shall be able to:

browse the required information from the databases;

explain the importance of databases;


64 differentiate different biological databases; and
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

enlist the applications of various databases for academic and scientific


research.

4.2 DATABASES AND RETRIEVAL


In this section, we are going to learn about various databases that you have
studied in unit-2 of this course. However, the main focus will be on how to
access the information available in these databases using existing online
tools. You are aware that there are different databases available for individual
biomolecules like proteins, nucleic acids, and small molecules. Let us explore
them one by one.

National Centre for Biotechnology Information (NCBI)

National Centre for Biotechnology Information (NCBI) is a source of public


biomedical database; for analysing molecular and
genomic data, and conducting research in computational biology. NCBI
maintains over 40 integrated databases for the medical and scientific sectors,
as well as the general public. The GenBank nucleotide database is maintained
by the NCBI, which is part of the National Institute of Health (NIH), a federal
agency of the US government, Access NCBI database by following weblink
(https://www.ncbi.nlm.nih.gov/) and use popular resources to retrieve
information and use different tools (Fig. 4.1).

Fig. 4.1: Screeshot showing NCBI database.

You can access the NCBI to know about different popular resources, further
you will be learning the usage of NCBI and Genbank to access nucleotide and
protein sequence in exercises 5 and 7.

PUBMED

Public Medical (PubMed) is a bibliographic database of


popular NCBI resources. PubMed is a free resource supporting the search and
retrieval of biomedical and life sciences literature with the objective of
educating health globally and personally. As of 2021, PubMed comprises more
65
BBCS-185 Bioinformatics Skill Enhancement Course

than 32 million citations (Abstract) for biomedical literature from MEDLINE, life
science journals, and online books. Citations do not include full text journal
articles but may include links to full-text content from PubMed Central (PMC)
and publisher web sites available from other sources. PubMed was developed
and maintained by the National Center for Biotechnology Information (NCBI),
at the U.S. National Library of Medicine (NLM), located at the National
Institutes of Health (NIH).

Procedure

Step 1: Open the PUBMED, browser using the following URL


http://www.ncbi.nlm.nih.gov/pubmed/ (Fig. 4.2).

Fig. 4.2: Screeshot showing PubMed home page.

Step 2: Type your text query in the search panel (for example corona virus

Step 3: Select the appropriate abstract from the PubMed summary web page
66 (Fig. 4.3).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.3: Screeshot showing PubMed search results.

Step 3: Copy and save the relevant bibliography search for further use.

GenBank

The GenBank nucleotide database is maintained by the NCBI. GenBank is


part of the International Nucleotide Sequence Database Collaboration, which
comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide
Archive (ENA), and GenBank at NCBI. These three organizations exchange
data on a daily basis. The GenBank database is intended to provide and
encourage access to the most up-to-date and complete DNA sequence
67
BBCS-185 Bioinformatics Skill Enhancement Course

information within the scientific community


(https://www.ncbi.nlm.nih.gov/genbank/) (Fig. 4.4).

Fig. 4.4: Screenshot showing GenBank.

So far, you have learned about databases NCBI, PubMed, GenBank and how
to access the citations/abstract from PubMed. To become more familiar with
the procedure, repeat the exercises with different keywords such as author
name, keywords like antioxidants, curcumin, cholesterol etc. and text
searches. In the next subsection you will learn about Protein Data Bank, which
is widely used for 3-D protein structure-related information. Further, you will be
learning the usage of GenBank to access nucleotide sequences in exercises 5
and 7.

PDB

The Protein Data Bank (PDB) is a repository for the 3-D structural data of
large biological molecules, such as proteins and nucleic acids. The data,
typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations,
Research Collaboratory for Structural Bioinformatics (RCSB). The PDB
database is intended to provide access to 3-D structural information. To
access the PDB database, follow the web link https://www.rcsb.org/ and
retrieve structural information from PDB (Fig. 4.5).

You can access the PDB to understand about structural database, further you
will be learning the usage of PDB to access and download 3-D structures of
protein and DNA in exercises 6.

68
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.5: Screenshot showing PDB homepage.

SCOP

Structural classification of proteins (SCOP). SCOP maintained at MRC


(medical research council) laboratory of Molecular biology and centre for
protein engineering. Describes structural and evolutionary relationships
between proteins. Classification in hierarchical fashion, like Family: clustered
to families with clear evolutionary relationships, Super Family: structural and
functional characteristic have common evolutionary origin, Fold: common fold
if they have same secondary structure.

Procedure:

Step 1: Open the SCOP from the following URL https://scop.mrc-


lmb.cam.ac.uk/ (Fig. 4.6).

Fig. 4.6: Screeshot showing SCOP homepage.

Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.7).
69
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 4.7: Screenshot showing Keyword in search engine of SCOP.

Step 3: On pressing search button the result page (summary) is displayed. To


know further about specific protein, click on it (Fig. 4.8).

Fig. 4.8: Screenshot showing SCOP search results.

Step 4: Choose the appropriate link to display the functional information (Fig.
70 4.9).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.9: Appropriate links showing family and super family have been encircled.

UNIPROT

The Universal Protein (UniProt) is a comprehensive resource for protein


sequence and annotation data. The UniProt databases are of the following
subtypes like UniProt Knowledgebase (UniProtKB), the UniProt Reference
Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium
and host institutions like the European Bioinformatics Institute (EBI), the Swiss
Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR)
are committed to the long-term preservation of the UniProt databases.

Procedure:

Step 1. Open the UniProt website from the following URL-


https://www.uniprot.org/ (Fig. 4.10).

Fig. 4.10: Screeshot showing Uniprot homepage.


71
BBCS-185 Bioinformatics Skill Enhancement Course

Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.11).

Fig. 4.11: Screenshot showing Uniprot search column.

Step 3: On pressing search button the result page(summary) is displayed (Fig.


4.12).

Fig. 4.12: Screenshot showing search results on Uniprot.

Step 4: Choose the first sequence by double clicking the accession number,
go to display button select FASTA format to retrieve sequence (Fig. 4.13).
72
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt

Fig. 4.13: Screeshot showing display options for searched protein.

Step 5. Copy and save the protein sequence for further analysis (Fig. 4.14).

Fig. 4.14: Screenshot showing IASTA sequence of the protein.

4.3 SUMMARY
Databases are systematic collections of theoretically related data.
Generally, the biological database can be classified into sequence
database, structural databases, genome database, proteome database,
and specialized databases etc.

To gain knowledge of different biological databases and their usage


appropriately in relevant ways, databases are created and maintained.

NCBI is a source of the public biomedical database, NCBI maintains


over 40 integrated databases for the medical and scientific sectors, as
well as the general public.

The GenBank nucleotide database is maintained by the NCBI and


maintains complete DNA sequence information.
73
BBCS-185 Bioinformatics Skill Enhancement Course

PubMed is a bibliography database of popular NCBI resources,


comprising more than 21 million abstracts for biomedical literature.

PDB is a repository for the 3-D structural data of large biological


molecules, such as proteins and nucleic acids.

SCOP provides structural and evolutionary relationships between


proteins and provides classification on hierarchical protein family and
fold information. UniProt is a comprehensive resource for protein
sequence and annotation data.

4.4 LAB EXERCISES


1. Retrieve abstract PMID: 32643536 from PUBMED bibliographic
database write title of abstract and authors name

2. Access nucleotide sequence FJ436056 from GENE BANK database in


Genbank format and FASTA format and write the title, molecule type,
how many base pairs

3. Download protein 3-D structure of PDBID-7JMO from PDB (RCSB)


database and view structure in viewing tool.

4. Open SCOP database and give any keyword or text search write the
functional aspect, name of the protein, family, class and domain

5. Access globulin protein sequence from UNIPROT database in FASTA


format and write the title, Uniprot KB id, organism name.

74
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI

Exercise 5
RETRIEVAL OF GENE
SEQUENCES FROM NCBI

Structure
5.1 Introduction 5.2 Procedure

Expected Learning 5.3 Summary


Outcomes
5.4 Lab Exercises

5.1 INTRODUCTION
In the previous exercise, you have learned about different databases. In this
exercise, you will be studying the retrieval of protein and gene sequences
from NCBI.
In this exercise, we will learn about protein and gene sequence retrieval
from NCBI database. We have studied theoretically NCBI database in Unit-2
and learned about different resources of NCBI such as GenBank and GenPept
in Exercise 4. In this section, we shall access sequences from GenBank and
GenPept of NCBI which will be used in various sequence analysis techniques.

Protein sequences are the fundamental determinants of biological structure


and function. The NCBI protein database is a collection of protein sequences
from different sources like GenPept, including translation from annotated
coding regions in GenBank, Reference sequences and Third Party Annotation
(TPA) as well as records from Swiss-Prot, Protein Information Resource (PIR),
Protein Research Foundation (PRF) and Protein Data Bank (PDB). The
nucleotide database is a collection of gene sequences from different sources,
which include GenBank, RefSeq, TPA, and PDB. Genome, gene, and
transcript sequence data provide the foundation for biomedical research and
discovery. The Gene database can be accessed by simply querying the word,
preferably the gene name, or the disease name in the query box, which will
display the list of genes associated with the search. Users can also search
records with Gene ID, which is a unique identifier issued by NCBI. 75
BBCS-185 Bioinformatics Skill Enhancement Course

Expected Learning Outcomes


After performing this exercise you shall be able to:

capable of using NCBI-GenPept database and retrieve protein


sequence;

explore and retrieve gene information from NCBI Gene database; and

explain the importance of NCBI in sequence retrieval.

5.2 PROCEDURE
Step 1: Access the home page of NCBI from the following web link
https://www.ncbi.nlm.nih.gov/ (Fig. 5.1).

Fig. 5.1: Screenshot showing NCBI home page.

2. Click on the scrolling button ene


5.2).

Fig. 5.2: Screenshot showing dropdown menu on NCBI.

3. Type the relevant text in the search box or enter keyword (Example-
76 Gene name, Species name etc) (Fig. 5.3).
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI

Fig. 5.3: Screenshot showing search box on NCBI.

4. On pressing search button the result page (summary page) is displayed


(Fig. 5.4).

Fig. 5.4: Screenshot showing summary page (results) on NCBI.

5. Choose the desired Gene sequence by double-clicking the name or ID


or check to mark the appropriate sequence; go to the display button,
select GenBank or Fasta format to retrieve the sequence (Fig. 5.5).

Scroll down and click on required file format (FASTA or GenBank format)

77
BBCS-185 Bioinformatics Skill Enhancement Course

Fig. 5.5: Screenshot showing how to obtain FASTA or GenBank


sequence.

6. Copy and save the required gene sequence for further analysis (Fig.
5.6).

Fig. 5.6: Screenshot showing FASTA sequence.

5.3 SUMMARY
NCBI is a systematic collection of theoretically related biological data
such as sequence databases, genome databases, and specialized
databases, etc.

NCBI is a source of public biomedical databases. The GenBank


nucleotide database is maintained by the NCBI that provides complete
DNA sequence information. The GenPept protein sequence database
maintained by NCBI.

Both gene and protein sequences can be retrieved from NCBI database
for further analysis.

5.4 LAB EXERCISES


1. Access Nucleo capsid phosphor protein Gene sequence from NCBI
database in Genbank format write the title, ID, Organism name

2. Access envelope protein sequence from NCBI database in Genbank


format write the title, ID, Organism name

3. Access Covid-19 protein sequence from NCBI database in FASTA


78 format and write the title, ID, organism name.
Exercise 6 Download Protein Structure from PDB

Exercise 6
ACCESSING PROTEIN
STRUCTURE FROM PDB

Structure
6.1 Introduction 6.3 Summary

Expected Learning Outcomes 6.4 Lab Exercises

6.2 Procedure

6.1 INTRODUCTION
In the previous exercise, you learned how to retrieve protein and gene
sequences from the NCBI database. Now, in this exercise, you shall be
exploring the steps involved in downloading protein structures from the PDB
database. 3-D structures of proteins from Protein Data Bank (PDB), are used
to understand structural information such as the binding site of a protein or
DNA, the active site of enzymes, DNA-Protein interactions, and Protein-
Protein interactions, and this has applications in drug design. You learned
about the PDB database in Unit-2 and accessed the PDB website in Exercise-
4.

Protein structure is useful to understand how the protein works, and that
information can be used to inhibit, regulate, or modify protein function, and
predict what molecules bind to that protein. Also, to understand various
biological interactions, assist drug discovery, or even design novel proteins
therapeutic as molecules. In order to understand the biological function of
DNA, we need to study its molecular structure. The PDB is a repository for the
3-D structural data of large biological molecules, such as proteins and nucleic
acids. You have learned about PDB in exercise-4, now in this exercise, we
shall learn about how to download protein structure from PDB.

Expected Learning Outcomes


After performing this exercise you shall be able to:

describe how to access structures of proteins from PDB;


79
BBCS-185 Bioinformatics Skill Enhancement Course

describe how to access structural data of a protein using PDB database;


explain the PDB file format; and

perform how to download and save 3-D structure of Protein in PDB


format.

6.2 PROCEDURE
Step 1:

1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1).

Fig. 6.1: Screenshot showing PDB home page.

2. Enter the query in the textbox provided by entering PDB ID, molecule
name or author name. Click on the search button (Fig. 6.2).

Fig. 6.2: Screenshot showing search column on PDB homepage.

80
Exercise 6 Download Protein Structure from PDB

3. From the summary page click on PDB ID 7LYJ and download the
macromolecular 3D structure in PDB format (Fig. 6.3 and 6.4).

Fig. 6.3: Screenshot showing target protein (7LYJ).

Dowanload PDB format.

Fig. 6.4: Screenshot showing how to download PDB format.

4. Using any one of the visualizing tools PyMoL or RasMol or Swiss-PDB


viewer open the structure file to visualize. You will learn about these
tools in exercise number 8 of this course.
81
BBCS-185 Bioinformatics Skill Enhancement Course

6.3 SUMMARY
PDB is the NCBI database from where we can access the protein 3-D
structures.

In this exercise you have exhibited the skills to download protein in


PDB format.

You have acquired the skills to access PDB pages and learned how to
search for the desired protein.

These PDB formats can be visualised using visualising tools.

6.4 LAB EXERCISES


1. Access 7LMF protein structure from PDB database and download in
PDB format and also save in PDB flat file(text) format comment few
points.

2. Download S. cerevisiae CMG-Pol epsilon-DNA in PDB format give it


PDB ID, source and comment few points.

3. Download crystal structure of yeast phenylalanine t-rnain PDB format


give it PDB ID, source and comment few points.

82

You might also like