0% found this document useful (0 votes)
4 views52 pages

02-A-Introduction to Biological Databases

The document provides an overview of biological databases, discussing their types, design, and architecture, as well as the challenges of data storage and retrieval in genomic research. It categorizes databases into primary, secondary, and specialized types, highlighting examples such as GenBank and SWISS-PROT. Additionally, it addresses the interconnection between databases and common pitfalls, including redundancy and erroneous annotations.

Uploaded by

wasilicharles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views52 pages

02-A-Introduction to Biological Databases

The document provides an overview of biological databases, discussing their types, design, and architecture, as well as the challenges of data storage and retrieval in genomic research. It categorizes databases into primary, secondary, and specialized types, highlighting examples such as GenBank and SWISS-PROT. Additionally, it addresses the interconnection between databases and common pitfalls, including redundancy and erroneous annotations.

Uploaded by

wasilicharles
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to biological

databases
Wilson Nandolo
[email protected]
+265993375505
Overview
• Genomic research generates
enormous amounts of raw sequence
data.
• The primary challenge with these
data is storage and handling.
Overview
• We will discuss the following aspects of
biological databases
⚫ Type

⚫ Design

⚫ Architecture
Definition of a database
• A database is a computerized archive used
to store and organize data in such a way
that information can be retrieved easily via
a variety of search criteria.
• The chief objective of the development of
a database is to organize data in a set of
structured records to enable easy retrieval
of information.
Definition of a database
• Each record, also called an entry, should contain a
number of fields that hold the actual data items
• For example: fields for
⚫ Names
⚫ Phone numbers
⚫ Addresses
⚫ Dates
Definition of a database
• To retrieve a particular record from the database,
a user can specify a particular piece of
information, called value, to be found in a
particular field and expect the computer to
retrieve the whole data record.
• This process is called making a query.
Databases
• Although data retrieval is the main purpose of all
databases, biological databases often have a higher level
of requirement, known as knowledge discovery
• knowledge discovery refers to the identification of
connections between pieces of information that were
not known when the information was first entered.
• For example, databases containing raw sequence
information can perform extra computational tasks to
identify sequence homology or conserved motifs.
Databases
• These features facilitate the discovery of new biological
insights from raw data.
• Originally, databases all used a flat file format, which is a
long text file that contains many entries separated by a
delimiter, a special character such as a vertical bar (|).
• Within each entry may be many fields separated by tabs
or commas.
Types of databases
• Relational
• Object-oriented
Relational databases
• A set of tables is used to organize data originally in a flat
file
• The tables are related to each other via some unique
attribute
• To execute a query, the system selects linked data items
from different tables and combines the information into
one report.
• Commonly used language: structured query language
(SQL)
Relational databases
Relational databases
• For example, if one is
to ask the question,
which courses are
students from Texas
taking?
• The database will
first find the field for
“State” in Table A
and look up for
Texas.
• This returns students
1 and 5.
Relational databases
• The student numbers are co-listed in Table B, in which
students 1 and 5 correspond to Biol 689 and Math 172,
respectively.
• The course names listed by course numbers are found in
Table C.
Relational databases

• By going to Table C, exact course names corresponding to


the course numbers can be retrieved.
• A final report is then given showing that the Texans are
taking the courses Bioinformatics and Calculus.
Relational databases-advantage
• Executing the same query through the flat file
requires the computer to
⚫ read through the entire text file word by word

⚫ store the information in a temporary memory

space
⚫ mark up the data records containing the word

“Texas”.
• This can easily be done for a small database.
• To perform queries in a large database using flat
files obviously becomes computationally
intensive.
Relational databases-pitfalls
• Another problem with relational
databases is that the tables do not
describe complex hierarchical
relationships between data items
• To overcome this problem, object-oriented
databases have been developed.
Object-oriented databases
• An object can be considered as a unit that
combines data and mathematical routines
that act on the data.
• The database is structured such that the
objects are linked by a set of pointers
defining predetermined relationships
between the objects.
Object-oriented databases
• Searching the database involves navigating
through the objects with the aid of the
pointers linking different objects.
• Programming languages like C++ are used to
create object-oriented databases.
Object-oriented databases
• Using the previous example, three different
objects can be designed
⚫ student object

⚫ course object

⚫ state object.

• Their interrelations are indicated by lines


with arrows.
Object-oriented databases
Object-oriented databases
• To answer the same
question–which
courses are
students from Texas
taking–one simply
needs to start from
Texas in the state
object, which has
pointers that lead
to students 1 and 5
in the student
object.
Object-oriented databases
• Further pointers in
the student object
point to the course
each of the two
students is taking.
• Therefore, a simple
navigation through
the linked objects
provides a final
report.
Object-oriented databases
• The object-oriented database system is
more flexible; data can be structured based
on hierarchical relationships.
• By doing so, programming tasks can be
simplified for data that are known to have
complex relationships, such as multimedia
data.
Object-oriented databases
• However, this type of database system lacks the
rigorous mathematical foundation of the
relational databases.
• There is also a risk that some of the relationships
between objects may be misrepresented.
• Some current databases have therefore
incorporated features of both types of database
programming, creating the object–relational
database management system.
Biological databases
• Based on their contents, biological
databases can be roughly divided into three
categories
⚫ primary databases

⚫ secondary databases, and

⚫ specialized databases
Primary databases
• Contain original biological data.
• Archive raw sequence or structural data
submitted by the scientific community.
Primary databases
• The main primary databases
⚫ GenBank (within NCBI,

https://www.ncbi.nlm.nih.go
v/refseq/)
⚫ DNA Data Bank of Japan (DDBJ,

https://www.ddbj.nig.ac.jp/in
dex-e.html)
⚫ European Molecular Biology

Laboratory database (EMBL,


https://www.ebi.ac.uk/ena/b
rowser/home)
Primary databases
• The three constitute the International
Nucleotide Sequence Database Collaboration.
• Sequence submission to either GenBank,
EMBL, or DDBJ is a precondition for publication
in most scientific journals to ensure the
fundamental molecular data to be made freely
available.
Primary databases
• For the three-dimensional
structures of biological
macromolecules, there is
only one centralized
database, the PDB
(https://www.rcsb.org/).
• This database archives
atomic coordinates of
macromolecules (both
proteins and nucleic acids)
determined by x-ray
crystallography and NMR.
Primary databases
• It uses a flat file format to represent protein
name, authors, experimental details,
secondary structure, cofactors, and atomic
coordinates.
• The web interface of PDB also provides
viewing tools for simple image manipulation.
Secondary databases
• Contain computationally processed or
manually curated information, based on
original information from primary
databases.
• Translated protein sequence databases
⚫ SWISS-Prot
(https://www.expasy.org/resources/unip
rotkb-swiss-prot)
⚫ Protein Information Resources (PIR,
https://proteininformationresource.org/)
Secondary databases
•Some are simple archives of translated
sequence data from identified open
reading frames in DNA.
⚫ TrEMBL (UniProt)
Secondary databases-SWISS-PROT
• This provides detailed sequence annotation that includes
structure, function, and protein family assignment.
• The protein annotation includes
⚫ Function

⚫ domain structure

⚫ catalytic sites

⚫ cofactor binding

⚫ posttranslational modification

⚫ metabolic pathway information

⚫ disease association

⚫ similarity with other sequences.


Secondary databases-SWISS-PROT

• Much of this information is


obtained from scientific
literature and entered by
database curators.
• The data record also
provides cross-referencing
links to other online
resources of interest.
Secondary databases-SWISS-PROT

• Very low redundancy


and high level of
integration with other
primary and
secondary databases
make SWISS-PROT
very popular among
biologists. cs/swiss-
prot_guideline.html
Secondary databases
• A recent effort to combine SWISS-PROT,
TrEMBL, and PIR led to the creation of the
UniProt database, which has larger coverage
than any one of the three databases while at
the same time maintaining the original
SWISS-PROT feature of low redundancy,
cross-references, and a high quality of
annotation.ttps://www.uniprot.org/
• There are also secondary databases that
relate to protein family classification
according to functions or structures.
Specialized databases
• These cater to a particular research interest.
⚫ FlyBase (https://flybase.org/)

⚫ HIV sequence database

(https://www.hsls.pitt.edu/obrc/index.php
?page=URL1101837165)
Specialized databases
• Specialized databases normally serve a
specific research community or focus on
a particular organism.
• The content of these databases may be
sequences or other types of information.
Specialized databases
• The sequences in these databases may overlap with a
primary database but may also have new data
submitted directly by authors.
• Because they are often curated by experts in the
field, they may have unique organizations and
additional annotations associated with the
sequences.
• Many genome databases that are taxonomic specific
fall within this category.
Interconnection between
biological databases
• As mentioned, primary databases are central repositories
and distributors of raw sequence and structure
information.
• They support nearly all other types of biological databases
in a way akin to the MANA providing news feeds to local
news media, which then tailor the news to suit their own
particular needs.
• Therefore, in the biological community, there is a frequent
need for the secondary and specialized databases to
connect to the primary databases and to keep uploading
sequence information.
Interconnection between biological
databases
• In addition, a user often needs to get information from
both primary and secondary databases to complete a task
because the information in a single database is often
insufficient.
• Instead of letting users visiting multiple databases, it is
convenient for entries in a database to be cross-referenced
and linked to related entries in other databases that
contain additional information.
• All these create a demand for linking different databases.
Interconnection between
Biological databases
• The main barrier to linking different biological
databases is format incompatibility
⚫ flat files

⚫ relational

⚫ object oriented.

• The heterogeneous database structures limit


communication between databases.
Interconnection between Biological
databases
• One solution to networking the databases is to use a
specification language called Common Object Request
Broker Architecture (COBRA)
⚫ this allows database programs at different locations to

communicate in a network through an “interface


broker” without having to understand each other’s
database structure.
⚫ It works in a way similar to HyperText Markup Language

(HTML) for web pages, labeling database entries using a


set of common tags.
Interconnection between Biological
databases
• A similar protocol called eXtensible Markup Language
(XML) also helps in bridging databases.
⚫ In this format, each biological record is broken down

into small, basic components that are labeled with a


hierarchical nesting of tags.
⚫ This database structure significantly improves the

distribution and exchange of complex sequence


annotations between databases.
Interconnection between Biological
databases
• Recently, a specialized protocol for bioinformatics data
exchange has been developed.
• It is the distributed annotation system, which allows one
computer to contact multiple servers and retrieve
dispersed sequence annotation information related to a
particular sequence and integrate the results into a single
combined report.
• Example
⚫ The Database for Annotation, Visualization and Integrated

Discovery (DAVID)-https://david.ncifcrf.gov/home.jsp
Pitfalls of biological databases
• One of the problems associated with biological databases is
over-reliance on sequence information and related
annotations, without understanding the reliability of the
information.
• Redundancy is another major problem affecting primary
databases.
⚫ The National Center for Biotechnology Information (NCBI) has
now created a non redundant database, called RefSeq, in
which identical sequences from the same organism and
associated sequence fragments are merged into a single entry.
Pitfalls of biological databases
• The other common problem is erroneous annotations.
• Often, the same gene sequence is found under different names
resulting in multiple entries and confusion about the data.
• Or conversely, unrelated genes bearing the same name are
found in the databases.
• A prominent example of such systems is Gene Ontology.
• Functional annotation databases such as DAVID have the
capacity to deal with these problems.
Pitfalls of biological databases

• Some of the inconsistencies in annotation could be


caused by genuine disagreement between
researchers in the field
⚫ others may result from imprudent assignment of
protein functions by sequence submitters.
Major biological databases
available on the world-wide web
System Brief Summary of Content URL
AceDB Genome database for www.acedb.org
Caenorhabditis elegans
DDBJ Primary nucleotide sequence www.ddbj.nig.ac.jp
database in Japan
EMBL Primary nucleotide sequence www.ebi.ac.uk/embl/in
database in Europe dex.html

Entrez NCBI portal for a variety www.ncbi.nlm.nih.gov/


of biological databases gquery/gquery.fcgi
ExPASY Proteomics database http://us.expasy.org/
Major biological databases
available on the world-wide web
System Brief Summary of Content URL
FlyBase A database of the Drosophila http://flybase.bio.indiana.edu/
genome
FSSP Protein secondary structures www.bioinfo.biocenter.helsink
i.fi:8080/dali/index.html
GenBank Primary nucleotide sequence www.ncbi.nlm.nih.gov/Genba
database in NCBI nk

HIV databases HIV sequence data and related www.hiv.lanl.gov/content/inde


immunologic information x

Microarray gene DNA microarray data and www.ebi.ac.uk/microarray


expression database analysis tools
Major biological databases
available on the world-wide web
System Brief Summary of Content URL
OMIM Genetic information of human diseases www.ncbi.nlm.nih.gov/entrez/q
uery.fcgi?db=OMIM
PIR Annotated protein sequences http://www/pir.georgetown.edu/
pirpirhome3.shtml
PubMed Biomedical literature information www.ncbi.nlm.nih.gov/PubMed
Ribosomal Ribosomal RNA sequences and http://rdp.cme.msu.edu/html
database project phylogenetic trees derived from the
sequences
SRS General sequence retrieval system http://srs6.ebi.ac.uk
SWISS-Prot Curated protein sequence database www.ebi.ac.uk/swissprot/acces
s.html
TAIR Arabidopsis information database www.arabidopsis.org
End of presentation

You might also like