Introduction To Biological Databases
Introduction To Biological Databases
INTRODUCTION
• One of the hallmarks of modern genomic research is the
generation of enormous amounts of raw sequence data.
• Thus, the very first challenge in the genomics era is to store and
handle the staggering volume of information through the
establishment and use of computer databases.
WHAT IS A DATABASE?
• A database is a computerized archive used to store and organize • Although data retrieval is the main purpose of all databases,
data in such a way that information can be retrieved easily via a biological databases often have a higher level of requirement, known
variety of search criteria. as knowledge discovery, which refers to the identification of
connections between pieces of information that were not known
• Databases are composed of computer hardware and software for when the information was first entered.
data management. – For example, databases containing raw sequence information can perform
extra computational tasks to identify sequence homology or conserved motifs.
• The chief objective of the development of a database is to organize
data in a set of structured records to enable easy retrieval of • These features facilitate the discovery of new biological insights
information. from raw data.
– Each record, also called an entry, should contain a number of fields that hold
the actual data items, for example, fields for names, phone numbers,
addresses, dates.
– To retrieve a particular record from the database, a user can specify a
particular piece of information, called value, to be found in a particular field
and expect the computer to retrieve the whole data record. This process is
2/13/2012 3 2/13/2012 4
called making a query.
2/13/2012 5 2/13/2012 6
1
2/13/2012
RELATIONAL DATABASES
• Instead of using a single table as in a flat file database, relational
databases use a set of tables to organize data.
– Each table, also called a relation, is made up of columns and rows.
– Columns represent individual fields. Rows represent values in the fields of
records.
– The columns in a table are indexed according to a common feature called an
attribute, so they can be cross-referenced in other tables.
– To execute a query in a relational database, the system selects linked data
items from different tables and combines the information into one report.
• Therefore, specific information can be found more quickly from a relational database than
from a flat file database.
OBJECT-ORIENTED DATABASES
• In an object-oriented programming language, an object can be
considered as a unit that combines data and mathematical routines
that act on the data.
– The database is structured such that the objects are linked by a set of
pointers defining predetermined relationships between the objects.
– Searching the database involves navigating through the objects with the aid
of the pointers linking different objects.
– Programming languages like C++ are used to create object-oriented
databases.
BIOLOGICAL DATABASES
• Secondary databases contain computationally processed or
• Current biological databases use all three types of database manually curated information, based on original information
structures: • from primary databases.
– Flat files, relational, and object oriented. – Translated protein sequence databases containing functional annotation
belong to this category.
• Based on their contents, biological databases can be roughly – Examples are SWISS-Prot and Protein Information Resources (PIR)
(successor of Margaret Dayhoff’s Atlas of Protein Sequence and
divided into three categories: Structure).
– Primary databases,
– Secondary databases,
– Specialized databases. • Specialized databases are those that cater to a particular
research interest.
– For example, Flybase, HIV sequence database, and Ribosomal Database
• Primary databases contain original biological data.
Project are databases that specialize in a particular organism or a
– They are archives of raw sequence or structural data submitted by the particular type of data.
scientific community.
– GenBank and Protein Data Bank (PDB) are examples of primary databases.
2/13/2012 11 2/13/2012 12
2
2/13/2012
SPECIALIZED DATABASES
• The data record also provides cross-referencing links to other • Specialized databases normally serve a specific research
online resources of interest. community or focus on a particular organism.
– The content of these databases may be sequences or other types of information.
– The sequences in these databases may overlap with a primary database, but may also have new
• Other features such as very low redundancy and high level of data submitted directly by authors.
integration with other primary and secondary databases make
SWISS-PROT very popular among biologists. • Because they are often curated by experts in the field, they may
have unique organizations and additional annotations associated
• A recent effort to combine SWISS-PROT, TrEMBL, and PIR led with the sequences.
– Many genome databases that are taxonomic specific fall within this category.
to the creation of the UniProt database, which has larger coverage – Examples include Flybase,WormBase, AceDB, and TAIR
than any one of the three databases while at the same time
maintaining the original SWISS-PROT feature of low redundancy, • In addition, there are also specialized databases that contain
cross-references, and a high quality of annotation. original data derived from functional analysis.
– For example, GenBank EST database and Microarray Gene Expression Database at the
European Bioinformatics Institute (EBI) are some of the gene expression databases available.
2/13/2012 15 2/13/2012 16
3
2/13/2012
• Redundancy is another major problem affecting primary databases. • Errors in annotation can be particularly damaging.
– The causes include repeated submission of identical or overlapping sequences by the same or – Large majority of new sequences are assigned functions based on similarity with sequences in
different authors, revision of annotations, dumping of expressed sequence tags (EST) data , and the databases that are already annotated..
poor database management that fails to detect the redundancy.
– Therefore, a wrong annotation can be easily transferred to all similar genes in the entire
2/13/2012 19 database.
2/13/2012 20
INTRODUCTION
• A major goal in developing databases is to provide efficient and
user-friendly access to the data stored.
INFORMATION RETRIEVAL • There are a number of retrieval systems for biological data. The
FROM BIOLOGICAL most popular retrieval systems for biological databases are Entrez
and Sequence Retrieval Systems (SRS) that provide access to
DATABASES multiple databases for retrieval of integrated search results.
ENTREZ PUBMED
• The NCBI developed and maintains Entrez, a biological database
retrieval system. • One of the databases accessible from Entrez is a biomedical
• It is a gateway that allows text-based searches for a wide variety of literature database known as PubMed, which contains abstracts
data, including: and in some cases the full text articles from nearly 4,000 journals.
– Annotated genetic sequence information, structural information, as well as citations and abstracts,
full papers, and taxonomic data.
• An important feature of PubMed is the retrieval of information
based on medical subject headings (MeSH) terms.
• The key feature of Entrez is its ability to integrate information, – The MeSH system consists of a collection of more than 20,000 controlled and standardized
vocabulary terms used for indexing articles
which comes from cross-referencing between NCBI databases based
on preexisting and logical relationships between individual entries.
• Another way to broaden the retrieval is by using the “Related
– Users do not have to visit multiple databases located in disparate places.
– For example, in a nucleotide sequence page, one may find cross-referencing links to the Articles” option.
translated protein sequence, genome mapping data, or to the related PubMed literature – PubMed uses a word weight algorithm to identify related articles with similar words in the
information, and to protein structures if available. titles, abstracts, and MeSH.
– By using this feature, articles on the same topic that were missed in the original search can be
retrieved.
• Effective use of Entrez requires an understanding of the main
features of the search engine.
2/13/2012 23 2/13/2012 24
4
2/13/2012
OMIM GENBANK
• GenBank is the most complete collection of annotated nucleic acid
• Another unique database accessible from Entrez is Online sequence data for almost every organism.
Mendelian Inheritance inMan(OMIM)
– Is a non-sequence-based database of human disease genes and human genetic disorders. • The content includes genomic DNA, mRNA, cDNA, ESTs, high
– Each entry in OMIM contains summary information about a particular disease as well as genes
related to the disease.
throughput raw sequence data, and sequence polymorphisms.
– The text contains numerous hyperlinks to literature citations, primary sequence records, as
well as chromosome loci of the disease genes.
• There is also a GenPept database for protein sequences, the
majority of which are conceptual translations from DNA sequences,
• The database can serve as an excellent starting point to study although a small number of the amino acid sequences are derived
genes related to a disease. using peptide sequencing techniques.
• The third section of the flat file is the sequence itself starting with
the label “ORIGIN.”
2/13/2012 27 2/13/2012 28