02-A-Introduction to Biological Databases
02-A-Introduction to Biological Databases
databases
Wilson Nandolo
[email protected]
+265993375505
Overview
• Genomic research generates
enormous amounts of raw sequence
data.
• The primary challenge with these
data is storage and handling.
Overview
• We will discuss the following aspects of
biological databases
⚫ Type
⚫ Design
⚫ Architecture
Definition of a database
• A database is a computerized archive used
to store and organize data in such a way
that information can be retrieved easily via
a variety of search criteria.
• The chief objective of the development of
a database is to organize data in a set of
structured records to enable easy retrieval
of information.
Definition of a database
• Each record, also called an entry, should contain a
number of fields that hold the actual data items
• For example: fields for
⚫ Names
⚫ Phone numbers
⚫ Addresses
⚫ Dates
Definition of a database
• To retrieve a particular record from the database,
a user can specify a particular piece of
information, called value, to be found in a
particular field and expect the computer to
retrieve the whole data record.
• This process is called making a query.
Databases
• Although data retrieval is the main purpose of all
databases, biological databases often have a higher level
of requirement, known as knowledge discovery
• knowledge discovery refers to the identification of
connections between pieces of information that were
not known when the information was first entered.
• For example, databases containing raw sequence
information can perform extra computational tasks to
identify sequence homology or conserved motifs.
Databases
• These features facilitate the discovery of new biological
insights from raw data.
• Originally, databases all used a flat file format, which is a
long text file that contains many entries separated by a
delimiter, a special character such as a vertical bar (|).
• Within each entry may be many fields separated by tabs
or commas.
Types of databases
• Relational
• Object-oriented
Relational databases
• A set of tables is used to organize data originally in a flat
file
• The tables are related to each other via some unique
attribute
• To execute a query, the system selects linked data items
from different tables and combines the information into
one report.
• Commonly used language: structured query language
(SQL)
Relational databases
Relational databases
• For example, if one is
to ask the question,
which courses are
students from Texas
taking?
• The database will
first find the field for
“State” in Table A
and look up for
Texas.
• This returns students
1 and 5.
Relational databases
• The student numbers are co-listed in Table B, in which
students 1 and 5 correspond to Biol 689 and Math 172,
respectively.
• The course names listed by course numbers are found in
Table C.
Relational databases
space
⚫ mark up the data records containing the word
“Texas”.
• This can easily be done for a small database.
• To perform queries in a large database using flat
files obviously becomes computationally
intensive.
Relational databases-pitfalls
• Another problem with relational
databases is that the tables do not
describe complex hierarchical
relationships between data items
• To overcome this problem, object-oriented
databases have been developed.
Object-oriented databases
• An object can be considered as a unit that
combines data and mathematical routines
that act on the data.
• The database is structured such that the
objects are linked by a set of pointers
defining predetermined relationships
between the objects.
Object-oriented databases
• Searching the database involves navigating
through the objects with the aid of the
pointers linking different objects.
• Programming languages like C++ are used to
create object-oriented databases.
Object-oriented databases
• Using the previous example, three different
objects can be designed
⚫ student object
⚫ course object
⚫ state object.
⚫ specialized databases
Primary databases
• Contain original biological data.
• Archive raw sequence or structural data
submitted by the scientific community.
Primary databases
• The main primary databases
⚫ GenBank (within NCBI,
https://www.ncbi.nlm.nih.go
v/refseq/)
⚫ DNA Data Bank of Japan (DDBJ,
https://www.ddbj.nig.ac.jp/in
dex-e.html)
⚫ European Molecular Biology
⚫ domain structure
⚫ catalytic sites
⚫ cofactor binding
⚫ posttranslational modification
⚫ disease association
(https://www.hsls.pitt.edu/obrc/index.php
?page=URL1101837165)
Specialized databases
• Specialized databases normally serve a
specific research community or focus on
a particular organism.
• The content of these databases may be
sequences or other types of information.
Specialized databases
• The sequences in these databases may overlap with a
primary database but may also have new data
submitted directly by authors.
• Because they are often curated by experts in the
field, they may have unique organizations and
additional annotations associated with the
sequences.
• Many genome databases that are taxonomic specific
fall within this category.
Interconnection between
biological databases
• As mentioned, primary databases are central repositories
and distributors of raw sequence and structure
information.
• They support nearly all other types of biological databases
in a way akin to the MANA providing news feeds to local
news media, which then tailor the news to suit their own
particular needs.
• Therefore, in the biological community, there is a frequent
need for the secondary and specialized databases to
connect to the primary databases and to keep uploading
sequence information.
Interconnection between biological
databases
• In addition, a user often needs to get information from
both primary and secondary databases to complete a task
because the information in a single database is often
insufficient.
• Instead of letting users visiting multiple databases, it is
convenient for entries in a database to be cross-referenced
and linked to related entries in other databases that
contain additional information.
• All these create a demand for linking different databases.
Interconnection between
Biological databases
• The main barrier to linking different biological
databases is format incompatibility
⚫ flat files
⚫ relational
⚫ object oriented.
Discovery (DAVID)-https://david.ncifcrf.gov/home.jsp
Pitfalls of biological databases
• One of the problems associated with biological databases is
over-reliance on sequence information and related
annotations, without understanding the reliability of the
information.
• Redundancy is another major problem affecting primary
databases.
⚫ The National Center for Biotechnology Information (NCBI) has
now created a non redundant database, called RefSeq, in
which identical sequences from the same organism and
associated sequence fragments are merged into a single entry.
Pitfalls of biological databases
• The other common problem is erroneous annotations.
• Often, the same gene sequence is found under different names
resulting in multiple entries and confusion about the data.
• Or conversely, unrelated genes bearing the same name are
found in the databases.
• A prominent example of such systems is Gene Ontology.
• Functional annotation databases such as DAVID have the
capacity to deal with these problems.
Pitfalls of biological databases