0% found this document useful (0 votes)
51 views5 pages

Introduction To Biological Databases

The document discusses different types of databases used in biological research. It describes flat file, relational, and object-oriented databases. It also discusses biological databases which can contain raw sequence data as well as computationally processed data designed for knowledge discovery.

Uploaded by

Nickson Onchoka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views5 pages

Introduction To Biological Databases

The document discusses different types of databases used in biological research. It describes flat file, relational, and object-oriented databases. It also discusses biological databases which can contain raw sequence data as well as computationally processed data designed for knowledge discovery.

Uploaded by

Nickson Onchoka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2/13/2012

INTRODUCTION
• One of the hallmarks of modern genomic research is the
generation of enormous amounts of raw sequence data.

• As the volume of genomic data grows, sophisticated


INTRODUCTION TO computational methodologies are required to manage the data
BIOLOGICAL DATABASES deluge.

• Thus, the very first challenge in the genomics era is to store and
handle the staggering volume of information through the
establishment and use of computer databases.

• The development of databases to handle the vast amount of


molecular biological data is thus a fundamental task of
2/13/2012 1 bioinformatics.
2/13/2012 2

WHAT IS A DATABASE?
• A database is a computerized archive used to store and organize • Although data retrieval is the main purpose of all databases,
data in such a way that information can be retrieved easily via a biological databases often have a higher level of requirement, known
variety of search criteria. as knowledge discovery, which refers to the identification of
connections between pieces of information that were not known
• Databases are composed of computer hardware and software for when the information was first entered.
data management. – For example, databases containing raw sequence information can perform
extra computational tasks to identify sequence homology or conserved motifs.
• The chief objective of the development of a database is to organize
data in a set of structured records to enable easy retrieval of • These features facilitate the discovery of new biological insights
information. from raw data.
– Each record, also called an entry, should contain a number of fields that hold
the actual data items, for example, fields for names, phone numbers,
addresses, dates.
– To retrieve a particular record from the database, a user can specify a
particular piece of information, called value, to be found in a particular field
and expect the computer to retrieve the whole data record. This process is
2/13/2012 3 2/13/2012 4
called making a query.

TYPES OF DATABASES DATABASE MANAGEMENT SYSTEMS


• Originally, databases all used a flat file format, which is a long
text file that contains many entries separated by a delimiter, a • Depending on the types of data structures, these database
special character such as a vertical bar (|). management systems can be classified into two types:
– Relational database management systems.
– Within each entry are a number of fields separated by tabs or commas.
– Object-oriented database management systems.

• To facilitate the access and retrieval of data, sophisticated


computer software programs for organizing, searching, and • Consequently, databases employing these management systems
accessing data have been developed. are known as relational databases or object-oriented
databases, respectively.

• These are called database management systems. These systems


contain not only raw data records but also operational instructions
to help identify hidden connections among data records

2/13/2012 5 2/13/2012 6

1
2/13/2012

RELATIONAL DATABASES
• Instead of using a single table as in a flat file database, relational
databases use a set of tables to organize data.
– Each table, also called a relation, is made up of columns and rows.
– Columns represent individual fields. Rows represent values in the fields of
records.
– The columns in a table are indexed according to a common feature called an
attribute, so they can be cross-referenced in other tables.
– To execute a query in a relational database, the system selects linked data
items from different tables and combines the information into one report.
• Therefore, specific information can be found more quickly from a relational database than
from a flat file database.

• Relational databases can be created using a special programming


language called structured query language (SQL). Example of constructing a relational database for five students’ course information
– The creation of this type of databases can take a great deal of planning originally expressed in a flat file.
during the design phase. Creating three different tables linked by common fields, data can be easily accessed
2/13/2012 7 and2/13/2012
reassembled 8

OBJECT-ORIENTED DATABASES
• In an object-oriented programming language, an object can be
considered as a unit that combines data and mathematical routines
that act on the data.
– The database is structured such that the objects are linked by a set of
pointers defining predetermined relationships between the objects.
– Searching the database involves navigating through the objects with the aid
of the pointers linking different objects.
– Programming languages like C++ are used to create object-oriented
databases.

• The object-oriented database system is more flexible;


– Data can be structured based on hierarchical relationships.
– By doing so, programming tasks can be simplified for data that are known Example of construction and query of an object-oriented database.
to have complex relationships, such as multimedia data. Three objects are constructed and are linked by pointers shown as arrows.
Finding specific information relies on navigating through the objects by way of
2/13/2012 9 pointers.
2/13/2012 10

BIOLOGICAL DATABASES
• Secondary databases contain computationally processed or
• Current biological databases use all three types of database manually curated information, based on original information
structures: • from primary databases.
– Flat files, relational, and object oriented. – Translated protein sequence databases containing functional annotation
belong to this category.
• Based on their contents, biological databases can be roughly – Examples are SWISS-Prot and Protein Information Resources (PIR)
(successor of Margaret Dayhoff’s Atlas of Protein Sequence and
divided into three categories: Structure).
– Primary databases,
– Secondary databases,
– Specialized databases. • Specialized databases are those that cater to a particular
research interest.
– For example, Flybase, HIV sequence database, and Ribosomal Database
• Primary databases contain original biological data.
Project are databases that specialize in a particular organism or a
– They are archives of raw sequence or structural data submitted by the particular type of data.
scientific community.
– GenBank and Protein Data Bank (PDB) are examples of primary databases.
2/13/2012 11 2/13/2012 12

2
2/13/2012

PRIMARY DATABASES SECONDARY DATABASES


• There are three major public sequence databases that store raw • To turn the raw sequence information into more sophisticated
nucleic acid sequence data produced and submitted by researchers biological knowledge, much post-processing of the sequence
worldwide: information is needed.
– GenBank,
– The European Molecular Biology Laboratory (EMBL) database , • The amount of computational processing work varies greatly among
– The DNA Data Bank of Japan (DDBJ), [all freely available on the Internet.] the secondary databases;
– Some are simple archives of translated sequence data from identified open reading frames in
DNA, whereas others provide additional annotation and information related to higher levels of
• Most of the data in the databases are contributed directly by authors information regarding structure and functions.

with a minimal level of annotation.


– These three public databases closely collaborate and exchange new data daily. • An example of secondary databases is SWISS-PROT, which
– They together constitute the International Nucleotide Sequence Database provides detailed sequence annotation that includes structure,
Collaboration. function, and protein family assignment.
• This means that by connecting to any one of the three databases, one should have access to – The sequence data are mainly derived from TrEMBL, a database of translated nucleic acid
the same nucleotide sequence data. sequences stored in the EMBL database.
– Although the three databases all contain the same sets of raw data, each of the – The annotation of each entry is carefully curated by human experts and thus is of good quality.
individual databases has a slightly different kind of format to represent the – The protein annotation includes function, domain structure, catalytic sites, cofactor binding,
posttranslational modification, metabolic pathway information, disease association, and similarity
data.
2/13/2012 13 2/13/2012
with other sequences. 14

SPECIALIZED DATABASES
• The data record also provides cross-referencing links to other • Specialized databases normally serve a specific research
online resources of interest. community or focus on a particular organism.
– The content of these databases may be sequences or other types of information.
– The sequences in these databases may overlap with a primary database, but may also have new
• Other features such as very low redundancy and high level of data submitted directly by authors.
integration with other primary and secondary databases make
SWISS-PROT very popular among biologists. • Because they are often curated by experts in the field, they may
have unique organizations and additional annotations associated
• A recent effort to combine SWISS-PROT, TrEMBL, and PIR led with the sequences.
– Many genome databases that are taxonomic specific fall within this category.
to the creation of the UniProt database, which has larger coverage – Examples include Flybase,WormBase, AceDB, and TAIR
than any one of the three databases while at the same time
maintaining the original SWISS-PROT feature of low redundancy, • In addition, there are also specialized databases that contain
cross-references, and a high quality of annotation. original data derived from functional analysis.
– For example, GenBank EST database and Microarray Gene Expression Database at the
European Bioinformatics Institute (EBI) are some of the gene expression databases available.

2/13/2012 15 2/13/2012 16

INTERCONNECTION BETWEEN • One solution to networking the databases is to use a specification


BIOLOGICAL DATABASES language called Common Object Request Broker Architecture
(COBRA).
– COBRA allows database programs at different locations to communicate in a network through
• Primary databases are central repositories and distributors of raw an “interface broker” without having to understand each other’s database structure.
sequence and structure information. – It works in a way similar to Hyper Text Markup Language (HTML) for web pages, labeling
– They support nearly all other types of biological databases needs. database entries using a set of common tags.
– Therefore, in the biological community, there is a frequent need for the secondary and
specialized databases to connect to the primary databases and to keep uploading sequence
information.
• A similar protocol called eXtensible Markup Language (XML)
– Instead of letting users visiting multiple databases, it is convenient for entries in a database to also helps in bridging databases.
be cross-referenced and linked to related entries in other databases that contain additional – In this format, each biological record is broken down into small, basic components that are
information. labeled with a hierarchical nesting of tags.
– All these create a demand for linking different databases. – This database structure significantly improves the distribution and exchange of complex
sequence annotations between databases.

• The main barrier to linking different biological databases is format


incompatibility. • Recently, a specialized protocol for bioinformatics data exchange
– Current biological databases utilize all three types of database structures:– flat files, relational, has been developed.
and object oriented. – It is the distributed annotation system, which allows one computer to contact multiple
– The heterogeneous database structures limit communication between databases servers and retrieve dispersed sequence annotation information related to a particular sequence
and integrate the results into a single combined report.
2/13/2012 17 2/13/2012 18

3
2/13/2012

PITFALLS OF BIOLOGICAL DATABASES STEPS TO REDUCE REDUNDANCY


• One of the problems associated with biological databases is • The National Center for Biotechnology Information (NCBI) has
overreliance on sequence information and related annotations, now created a non-redundant database, called RefSeq.
– Identical sequences from the same organism and associated sequence fragments are merged into
without understanding the reliability of the information. a single entry. Proteins sequences derived from the same DNA sequences are explicitly linked as
– There are also high levels of redundancy in the primary sequence databases. related entries.
– Annotations of genes can also occasionally be false or incomplete.
– All these types of errors can be passed on to other databases, causing propagation of errors.
• SWISS-PROT database also has minimal redundancy for protein
sequences compared to most other databases.
• Most errors in nucleotide sequences are caused by sequencing
errors.
– Some errors cause frame shifts that make whole gene identification difficult or protein
• Another way to address the redundancy problem is to create
translation impossible. sequence-cluster databases such as UniGene that coalesce EST
– Sometimes, gene sequences are contaminated with sequences from cloning vectors. sequences that are derived from the same gene.

• Redundancy is another major problem affecting primary databases. • Errors in annotation can be particularly damaging.
– The causes include repeated submission of identical or overlapping sequences by the same or – Large majority of new sequences are assigned functions based on similarity with sequences in
different authors, revision of annotations, dumping of expressed sequence tags (EST) data , and the databases that are already annotated..
poor database management that fails to detect the redundancy.
– Therefore, a wrong annotation can be easily transferred to all similar genes in the entire
2/13/2012 19 database.
2/13/2012 20

INTRODUCTION
• A major goal in developing databases is to provide efficient and
user-friendly access to the data stored.

INFORMATION RETRIEVAL • There are a number of retrieval systems for biological data. The
FROM BIOLOGICAL most popular retrieval systems for biological databases are Entrez
and Sequence Retrieval Systems (SRS) that provide access to
DATABASES multiple databases for retrieval of integrated search results.

• To perform complex queries in a database often requires the use of


Boolean operators.
– This is to join a series of keywords using logical terms such as AND, OR, and NOT to indicate
relationships between the keywords used in a search.
• AND means that the search result must contain both words; OR means to search for results containing either word or
both; NOT excludes results containing either one of the words.
• In addition, one can use parentheses ( ) to define a concept if multiple words and relationships are involved, so that the
computer knows which part of the search to execute first. Items contained within parentheses are executed first.
2/13/2012 21 2/13/2012 22

ENTREZ PUBMED
• The NCBI developed and maintains Entrez, a biological database
retrieval system. • One of the databases accessible from Entrez is a biomedical
• It is a gateway that allows text-based searches for a wide variety of literature database known as PubMed, which contains abstracts
data, including: and in some cases the full text articles from nearly 4,000 journals.
– Annotated genetic sequence information, structural information, as well as citations and abstracts,
full papers, and taxonomic data.
• An important feature of PubMed is the retrieval of information
based on medical subject headings (MeSH) terms.
• The key feature of Entrez is its ability to integrate information, – The MeSH system consists of a collection of more than 20,000 controlled and standardized
vocabulary terms used for indexing articles
which comes from cross-referencing between NCBI databases based
on preexisting and logical relationships between individual entries.
• Another way to broaden the retrieval is by using the “Related
– Users do not have to visit multiple databases located in disparate places.
– For example, in a nucleotide sequence page, one may find cross-referencing links to the Articles” option.
translated protein sequence, genome mapping data, or to the related PubMed literature – PubMed uses a word weight algorithm to identify related articles with similar words in the
information, and to protein structures if available. titles, abstracts, and MeSH.
– By using this feature, articles on the same topic that were missed in the original search can be
retrieved.
• Effective use of Entrez requires an understanding of the main
features of the search engine.
2/13/2012 23 2/13/2012 24

4
2/13/2012

OMIM GENBANK
• GenBank is the most complete collection of annotated nucleic acid
• Another unique database accessible from Entrez is Online sequence data for almost every organism.
Mendelian Inheritance inMan(OMIM)
– Is a non-sequence-based database of human disease genes and human genetic disorders. • The content includes genomic DNA, mRNA, cDNA, ESTs, high
– Each entry in OMIM contains summary information about a particular disease as well as genes
related to the disease.
throughput raw sequence data, and sequence polymorphisms.
– The text contains numerous hyperlinks to literature citations, primary sequence records, as
well as chromosome loci of the disease genes.
• There is also a GenPept database for protein sequences, the
majority of which are conceptual translations from DNA sequences,
• The database can serve as an excellent starting point to study although a small number of the amino acid sequences are derived
genes related to a disease. using peptide sequencing techniques.

• There are two ways to search for sequences in GenBank. One is


using text-based keywords similar to a PubMed search. The other is
using molecular sequences to search by sequence similarity using
2/13/2012 25 BLAST
2/13/2012 26

GENBANK SEQUENCE FORMAT


• GenBank is a relational database. However, the search output for
sequence files is produced as flat files for easy reading.
– The resulting flat files contain three sections – Header, Features, and Sequence entry

• The Header section describes the origin of the sequence,


identification of the organism, and unique identifiers associated with
the record.
– The top line of the Header section is the Locus, which contains a unique database identifier for a
sequence location in the database (not a chromosome locus). The identifier is followed by
sequence length and molecule type (e.g., DNA or RNA). This is followed by a three-letter code
for GenBank divisions.

• The “Features” section includes annotation information about the


gene and gene product, as well as regions of biological significance
reported in the sequence, with identifiers and qualifiers

• The third section of the flat file is the sequence itself starting with
the label “ORIGIN.”
2/13/2012 27 2/13/2012 28

You might also like