Unit-2 (2)
Unit-2 (2)
UNIT 2
BIOLOGICAL DATABASES AND
DATA RETRIEVAL
Structure
2.1 Introduction 2.4 Small Molecular Databases
2.1 INTRODUCTION
In the previous unit, you have learned about the basics of computers and their
because primary databases are also curated to ensure that the data in
them is consistent and accurate.
library over the past decade or so, providing a wealth of (often daunting)
information on just about any gene or gene product that has been
investigated by the research community. The potential for mining this
information to make new discoveries is vast.
Curated database;
Synonyms Archival database
knowledgebase
Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases
Upto now you have studied about types of biological databases based on
sources, now let us know about nucleotide databases.
Gen Bank It is an integral part of the main biological database, i.e., NCBI
(National Center for Biotechnology. It has a tool called Entrez, which helps to
retrieve data from Genbank.
Swiss-PROT This
database is owned by
EMBL and
maintained by SIB
TrEMBL - It contains
maximum translated
sequences
SAQ 1
Fill in the blanks:
PDB is a part of the Worldwide Protein Data Bank which collects, organizes,
and disseminates data on biological macromolecular structures like proteins,
enzymes, and DNA/RNA.
2. CATH
3. PDBSUM
a. All alpha
b. All beta
c. Alpha or beta
e. Multi-domain folds
51
BBCS-185 Bioinformatics Skill Enhancement Course
2.2.5 CATH
The CATH database (http://www.cathdb.info/) is a free, publicly available
online resource that provides information on the evolutionary relationships of
protein domains. It was created in the mid-1990s by Professor Christine
Orengo and colleagues, and continues to be developed by the Orengo group
at University College London.
all alpha, all beta, a mixture of alpha and beta, or little secondary structure; in
the Architecture (A) level, information of the secondary structure arrangement
in three-dimensional space is used at the Topology/fold (T) level, information
on how the secondary structure elements are connected and arranged is
used; assign segregation are made to the Homologous superfamily (H) level if
there sufficient evidence that the domains are related by evolution, i.e. they
are homologous. To know, and browse the classification hierarchy, visit CATH
hierarchy web page (Fig. 2.4).
SAQ 2
Define the following terms:
i) PDB
ii) RCSBC
iii) SCOP
53
BBCS-185 Bioinformatics Skill Enhancement Course
2.4.1 PubChem
The PubChem database is a primary source for various chemicals, drugs, and
derivatives. It is one of the freely accessible chemical information resource
databases as well as the largest in the world. We can search for various
chemicals by molecular formula, name, structure, and other identifiers.
Further, one can find chemical and physical properties, safety and toxicity
information, biological activities, literature citations, patents, and more. New
chemicals/substances will be added regularly as and when new information is
available from the literature or from experimental results. It is very crucial for
finding vendor-based chemicals or new chemicals. Most of the scientists are
screening molecules from the Pubchem database for various disease
treatments like
Atherosclerosis, Cardio-vascular diseases, etc. It is the database of chemicals
that anyone can submit scientific data to this database and become a provider.
This database is more important and useful for scientists, students, and the
general public. Each month database and programmatic services provide data
to several million users worldwide about compounds.
(https://pubchemdocs.ncbi.nlm.nih.gov/)
SAQ 3 Do as directed
i) Write a short on chemical databases used in drug design and drug
discovery.
ii) Define the term curated database? Enlist few chemical databases
developed using curation method.
Currently, we are going to discuss basic software used for the visualization of
biomolecules. There are various file formats available to view those molecules
in 3Dimenstional space. It means that, each atomic position should be defined
from its origin with respect to X,Y, and Z axis. The majorly used format to view
in basic software is the pdb format (Protein databank). It has a stranded format
as follows. While performing exercises number 7 and 8 you will learn more
about these tools and file formats.
Notice that each line or record begins with the record type ATOM. The atom
serial number is the next item in each record.
(Source:
https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) 57
BBCS-185 Bioinformatics Skill Enhancement Course
1. RasMol
Most of the protein structure databases tools available today are well-
equipped with graphical visualization tools. The commonly used tool for
academic and research purposes is RasMol software. This is a molecular
graphics program intended to visualize proteins, nucleic acids and small
molecules, available in a 3-D structures format. In order to display a molecule,
RasMol requires an atomic co-ordinate file that specifies the position of every
atom in the molecule through its 3-D Cartesian coordinates (Fig. 2.5). RasMol
accepts this coordinate file in a variety of formats, including the Protein Data
Bank (PDB) format. The visualization tool provides the user a choice of color
schemes and molecular representation (wireframe, cylinder (Dreiding) stick
bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular
ribbons (either smooth-shaded solid ribbons or parallel strands), hydrogen
bonding and dot surface. Additional features such as test labeling for selected
atoms, different color schemes for different parts of the molecule, zoom,
rotation, etc. have made this the most popular among all existing visualization
tools.
Website:http://www.openrasmol.org/
Fig. 2.5: RasMol software with Crystals of Crambin with PDB ID: 1CRN.
2. Chime
Chime and proteins explorer are derivatives of RasMol that allow visualization
of structures inside web browsers, while RasMol runs independently outside a
web browser. Hence, chime should be used only online, when connected to
the Internet. Another feature of Chime is that only certain molecules that are
allowed by the company can be seen, unlike RasMol where any protein
molecule with atomic coordinates can be seen.
58
Unit 2 Biological Databases and Data Retrieval
Now-
also used widely. This can be downloaded on personal computers to view
molecules like proteins, DNA and RNA.
3. MolMol
MolMol stands for Molecule analysis and Molecule display. This is also free
software with a lot of features that are not found in RasMol and Chime. MolMol
is a molecular graphics program for display, analysis and manipulation of
three-dimensional structures of biological macromolecules, with special
emphasis on nuclear magnetic resonance (NMR) solution structures of
proteins and nucleic acids. MolMol can be reached at:
www.mol.biol.ethz.ch/wuthrich/software/molmol
4. Pymol
5. SPDBV
2.6 SUMMARY
Biological databases used to store experimental data in various formats
that can be accessed through the internet.
SCOP has five sub-classes. 1. All alpha, 2. All Beta, 3. Alpha or Beta 4.
Alpha and Beta 5. Multi-domain fold.
2.8 ANSWERS
Self Assessment Questions
1. i) Nucleotide Database at NCBI
ii) DNA
iii) European Bioinformatics institute
iv) National Centre for Biotechnology
v) PIR
61
BBCS-185 Bioinformatics Skill Enhancement Course
2. i) Protein Data Bank
ii) Research Collaboratory for Structural Bioinformatics
iii) Structural Classification of Proteins
3. i) Refer Section 2.4.1 to 2.4.3.
ii) Refer Metabolic Pathway Databases under section 2.3.
Terminal Questions
1. There are various principles for the biological databases or
characteristics.
Curateddatabase;
Synonyms Archival database
knowledgebase
Results of analysis,
literature research and
Source of Direct submission of experimentally-
interpretation, often of
data derived data from researchers
data in primary
databases
InterPro (protein
families, motifs and
domains) UniProt
ENA, GenBank and DDBJ (nucleotide Knowledgebase (seque
sequence) ArrayExpress and GEO (fu nce and functional
nctional genomics data) Protein Data information on
Examples proteins) Ensembl (varia
Bank (PDB; coordinates of three-
dimensional macromolecular tion, function, regulation
structures) and more layered onto
whole genome
sequences)
62
Unit 2 Biological Databases and Data Retrieval
63
BBCS-185 Bioinformatics Skill Enhancement Course
Exercise 4
DATABASES NCBI, PDB, SCOP, :
PUBMED, GENE BANK,
UNIPROT
Structure
4.1 Introduction 4.2 Databases and Retrieval
4.1 INTRODUCTION
In this exercise, you will learn about biological databases that are widely used
in the field of bioinformatics.
Databases are systematic collections of theoretically related data. Software
packages are used for defining and managing databases. In publicly
accessible databases, there is a lot of information available regarding
biomolecules due to exponential growth in biological data. Data is no longer
published in a conventional way but rather submitted directly to databases.
Generally, the biological database can be classified into sequence database,
structural database, genome database, proteome database, specialized
databases, etc.
You can access the NCBI to know about different popular resources, further
you will be learning the usage of NCBI and Genbank to access nucleotide and
protein sequence in exercises 5 and 7.
PUBMED
than 32 million citations (Abstract) for biomedical literature from MEDLINE, life
science journals, and online books. Citations do not include full text journal
articles but may include links to full-text content from PubMed Central (PMC)
and publisher web sites available from other sources. PubMed was developed
and maintained by the National Center for Biotechnology Information (NCBI),
at the U.S. National Library of Medicine (NLM), located at the National
Institutes of Health (NIH).
Procedure
Step 2: Type your text query in the search panel (for example corona virus
Step 3: Select the appropriate abstract from the PubMed summary web page
66 (Fig. 4.3).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Step 3: Copy and save the relevant bibliography search for further use.
GenBank
So far, you have learned about databases NCBI, PubMed, GenBank and how
to access the citations/abstract from PubMed. To become more familiar with
the procedure, repeat the exercises with different keywords such as author
name, keywords like antioxidants, curcumin, cholesterol etc. and text
searches. In the next subsection you will learn about Protein Data Bank, which
is widely used for 3-D protein structure-related information. Further, you will be
learning the usage of GenBank to access nucleotide sequences in exercises 5
and 7.
PDB
The Protein Data Bank (PDB) is a repository for the 3-D structural data of
large biological molecules, such as proteins and nucleic acids. The data,
typically obtained by X-ray crystallography or NMR spectroscopy and
submitted by biologists and biochemists from around the world, are freely
accessible on the Internet via the websites of its member organisations,
Research Collaboratory for Structural Bioinformatics (RCSB). The PDB
database is intended to provide access to 3-D structural information. To
access the PDB database, follow the web link https://www.rcsb.org/ and
retrieve structural information from PDB (Fig. 4.5).
You can access the PDB to understand about structural database, further you
will be learning the usage of PDB to access and download 3-D structures of
protein and DNA in exercises 6.
68
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
SCOP
Procedure:
Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.7).
69
BBCS-185 Bioinformatics Skill Enhancement Course
Step 4: Choose the appropriate link to display the functional information (Fig.
70 4.9).
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Fig. 4.9: Appropriate links showing family and super family have been encircled.
UNIPROT
Procedure:
Step 2: Type the protein name or relevant text in the text box titled or enter
keyword (Fig. 4.11).
Step 4: Choose the first sequence by double clicking the accession number,
go to display button select FASTA format to retrieve sequence (Fig. 4.13).
72
Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt
Step 5. Copy and save the protein sequence for further analysis (Fig. 4.14).
4.3 SUMMARY
Databases are systematic collections of theoretically related data.
Generally, the biological database can be classified into sequence
database, structural databases, genome database, proteome database,
and specialized databases etc.
4. Open SCOP database and give any keyword or text search write the
functional aspect, name of the protein, family, class and domain
74
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI
Exercise 5
RETRIEVAL OF GENE
SEQUENCES FROM NCBI
Structure
5.1 Introduction 5.2 Procedure
5.1 INTRODUCTION
In the previous exercise, you have learned about different databases. In this
exercise, you will be studying the retrieval of protein and gene sequences
from NCBI.
In this exercise, we will learn about protein and gene sequence retrieval
from NCBI database. We have studied theoretically NCBI database in Unit-2
and learned about different resources of NCBI such as GenBank and GenPept
in Exercise 4. In this section, we shall access sequences from GenBank and
GenPept of NCBI which will be used in various sequence analysis techniques.
explore and retrieve gene information from NCBI Gene database; and
5.2 PROCEDURE
Step 1: Access the home page of NCBI from the following web link
https://www.ncbi.nlm.nih.gov/ (Fig. 5.1).
3. Type the relevant text in the search box or enter keyword (Example-
76 Gene name, Species name etc) (Fig. 5.3).
Exercise 5 Retrieval of Protein and Gene Sequences from NCBI
Scroll down and click on required file format (FASTA or GenBank format)
77
BBCS-185 Bioinformatics Skill Enhancement Course
6. Copy and save the required gene sequence for further analysis (Fig.
5.6).
5.3 SUMMARY
NCBI is a systematic collection of theoretically related biological data
such as sequence databases, genome databases, and specialized
databases, etc.
Both gene and protein sequences can be retrieved from NCBI database
for further analysis.
Exercise 6
ACCESSING PROTEIN
STRUCTURE FROM PDB
Structure
6.1 Introduction 6.3 Summary
6.2 Procedure
6.1 INTRODUCTION
In the previous exercise, you learned how to retrieve protein and gene
sequences from the NCBI database. Now, in this exercise, you shall be
exploring the steps involved in downloading protein structures from the PDB
database. 3-D structures of proteins from Protein Data Bank (PDB), are used
to understand structural information such as the binding site of a protein or
DNA, the active site of enzymes, DNA-Protein interactions, and Protein-
Protein interactions, and this has applications in drug design. You learned
about the PDB database in Unit-2 and accessed the PDB website in Exercise-
4.
Protein structure is useful to understand how the protein works, and that
information can be used to inhibit, regulate, or modify protein function, and
predict what molecules bind to that protein. Also, to understand various
biological interactions, assist drug discovery, or even design novel proteins
therapeutic as molecules. In order to understand the biological function of
DNA, we need to study its molecular structure. The PDB is a repository for the
3-D structural data of large biological molecules, such as proteins and nucleic
acids. You have learned about PDB in exercise-4, now in this exercise, we
shall learn about how to download protein structure from PDB.
6.2 PROCEDURE
Step 1:
1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1).
2. Enter the query in the textbox provided by entering PDB ID, molecule
name or author name. Click on the search button (Fig. 6.2).
80
Exercise 6 Download Protein Structure from PDB
3. From the summary page click on PDB ID 7LYJ and download the
macromolecular 3D structure in PDB format (Fig. 6.3 and 6.4).
6.3 SUMMARY
PDB is the NCBI database from where we can access the protein 3-D
structures.
You have acquired the skills to access PDB pages and learned how to
search for the desired protein.
82