0% found this document useful (0 votes)
22 views67 pages

Bio in For Matics

Bioinformatics is an interdisciplinary field that merges computer science and biological science, focusing on the analysis of biological macromolecules like DNA, RNA, and proteins. It has evolved significantly over the past few decades, driven by advancements in molecular biology and computer science, and aims to enhance our understanding of cellular functions through computational tools. The field encompasses various applications, including drug design, forensic analysis, and agricultural biotechnology, and relies on the development of computational tools for sequence, structural, and functional analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views67 pages

Bio in For Matics

Bioinformatics is an interdisciplinary field that merges computer science and biological science, focusing on the analysis of biological macromolecules like DNA, RNA, and proteins. It has evolved significantly over the past few decades, driven by advancements in molecular biology and computer science, and aims to enhance our understanding of cellular functions through computational tools. The field encompasses various applications, including drug design, forensic analysis, and agricultural biotechnology, and relies on the development of computational tools for sequence, structural, and functional analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Bioinformatics

What is Bioinformatics?
• Interdisciplinary research area
• Interface between computer science and biological science
• A variety of definitions:
• Luscombe et al
Bioinformatics is a union of biology and informatics
(Bioinformatics involves the technology that uses computers for storage,
retrieval, manipulation, and distribution of information related to
biological macromolecules such as DNA, RNA, and proteins)
• The emphasis here is on the use of computers because most of the
tasks in genomic data analysis are highly repetitive or mathematically
complex
• The use of computers is absolutely indispensable in mining genomes
for information gathering and knowledge building
Introduction to Bioinformatics
• Bioinformatics differs from a related field - computational biology
• Bioinformatics is limited to
o Sequence
o Structural
o Functional analysis - of genes and genomes and their
corresponding products (proteins)
• Often considered computational molecular biology
• Computational biology covers all biological areas that involve
computation
• For example
o Mathematical modeling of ecosystems
o Population dynamics
o Application of the game theory in behavioral studies
o Phylogenetic construction using fossil records
Introduction to Bioinformatics
• These all employ computational tools
• But do not necessarily involve biological macromolecules
• Beside this distinction, it is worth noting that there are other views
of how the 2 terms relate
• For example
o One version defines bioinformatics as the development and
application of computational tools in managing all kinds of
biological data
o Whereas computational biology is more confined to the
theoretical development of algorithms used for bioinformatics
• The confusion at present over definition may partly reflect the nature
of this vibrant and quickly evolving new field
History of Bioinformatics
• Bioinformatics - more clearly defined as
The discipline of quantitative analysis of information relating to
biological macromolecules with the aid of computers
• The development of bioinformatics as a field is the result of advances
in both molecular biology and computer science over the past 30–
40 years
• A succinct chronological summary of the landmark events that have
had major impacts on the development of bioinformatics is
presented here:
1. The earliest bioinformatics efforts can be traced back to the 1960s,
although the word bioinformatics did not exist then
• Probably, the 1st major bioinformatics project was undertaken by
Margaret Dayhoff in 1965, who developed a 1st protein sequence
database called Atlas of Protein Sequence and Structure
History of Bioinformatics
2. In the early 1970s, the Brookhaven National Laboratory established
the Protein Data Bank for archiving three-dimensional (3D) protein
structures
• At its onset, the database stored less than a dozen protein structures,
compared to more than 30,000 structures today
3. The 1st sequence alignment algorithm was developed by Needleman
and Wunsch in 1970
• This was a fundamental step in the development of the field of
bioinformatics, which paved the way for the routine sequence
comparisons and database searching practiced by modern biologists
4. The 1st protein structure prediction algorithm was developed by Chou
and Fasman in 1974
• Though it is rather simple by today’s standard, it pioneered a series of
developments in protein structure prediction
History of Bioinformatics
5. The 1980s saw the establishment of GenBank and the development
of fast database searching algorithms such as:
FASTA by William Pearson
BLAST by Stephen Altschul and coworkers
6. The start of the human genome project in the late 1980s provided a
major boost for the development of bioinformatics
7. The development and the increasingly widespread use of the Internet
in the 1990s made instant access to, and exchange and distribution
of biological data possible
• These are only the major milestones in the establishment of this
new field
• The fundamental reason that bioinformatics gained fame as a
discipline was the advancement of genome studies that produced
unprecedented amounts of biological data
History of Bioinformatics
• The explosion of genomic sequence information generated a
sudden demand for efficient computational tools to manage and
analyze the data
• The development of these computational tools depended on
knowledge generated from a wide range of disciplines including;
 Mathematics
 Statistics
 Computer science
 Information technology
 Molecular biology
• The merger of these disciplines created an information oriented
field in biology, which is now known as bioinformatics
Goals of Bioinformatics
• Ultimate goal - To better understand a living cell and how it
functions at the molecular level
• By analyzing raw molecular sequence and structural data,
bioinformatics research can generate new insights and provide a
“global” perspective of the cell
• The reason that the functions of a cell can be better understood by
analyzing sequence data is ultimately because of “central dogma”
• Cellular functions are mainly performed by proteins whose
capabilities are ultimately determined by their sequences
• Therefore, solving functional problems using sequence and
sometimes structural approaches has proved to be a fruitful effort
Scope of Bioinformatics
• Bioinformatics consists of two subfields:
1. Development of computational tools and databases
2. Application of these tools and databases in generating
biological knowledge to better understand living systems
• These two subfields are complementary to each other
• Development includes writing software for:
o Sequence analysis
o Structural analysis
o Functional analysis
as well as the construction and curating of biological databases
• Three areas of genomic and molecular biological research:
1. Molecular sequence analysis
2. Molecular structural analysis
3. Molecular functional analysis
Scope of Bioinformatics
• The analyses of biological data often generate new problems and
challenges
• That in turn result in the development of new and better
computational tools
• The areas of sequence analysis include;
 Sequence alignment
 Sequence database searching
 Motif and pattern discovery
 Gene and promoter finding
 Reconstruction of evolutionary relationships
 Genome assembly and comparison
• Structural analyses include
 Protein and nucleic acid structure analysis
 Comparison
 Classification
 Prediction
Scope of Bioinformatics
• The functional analyses include
 Gene expression profiling
 Protein–protein interaction prediction
 Protein subcellular localization prediction
 Metabolic pathway reconstruction
 Simulation
• 3 aspects of bioinformatics analysis are not isolated but often
interact to produce integrated results
• For example;
 Protein structure prediction depends on alignment
 Clustering of gene expression profiles requires the use of
phylogenetic tree construction methods derived in sequence analysis
 Sequence-based promoter prediction is related to functional analysis
of co-expressed genes
Scope of Bioinformatics
 Gene annotation involves a number of activities:
o Distinction between coding and noncoding sequences
o Identification of translated protein sequences
o Determination of the gene’s evolutionary relationship with
other known genes
 Prediction of its cellular functions employs tools from all three
groups of the analyses
Overview of various subfields of bioinformatics
Biocomputing tool development is at the foundation of all bioinformatics analysis
Applications of tools fall into three areas: sequence analysis, structure analysis, and function
analysis
There are intrinsic connections between different areas of analyses represented by bars
between boxes
Applications of Bioinformatics
• Bioinformatics is essential for basic genomic and molecular biology
research
• It is also having a major impact on many areas of biotechnology and
biomedical sciences
• It has applications, for example, in
o Knowledge-based drug design
o Forensic DNA analysis
o Agricultural biotechnology
• Computational studies of protein–ligand interactions provide a
rational basis for the rapid identification of novel leads for synthetic
drugs
• Knowledge of 3D structures of proteins allows molecules to be
designed that are capable of binding to the receptor site of a target
protein with great affinity and specificity
Applications of Bioinformatics
• This informatics-based approach significantly reduces the time and
cost necessary to develop drugs with higher potency, fewer side
effects, and less toxicity than using the traditional trial-and-error
approach
• In forensics, results from molecular phylogenetic analysis have been
accepted as evidence in criminal courts
• It is worth mentioning that genomics and bioinformatics are now
poised to revolutionize our healthcare system by developing
personalized and customized medicines
• High speed genomic sequencing coupled with sophisticated
informatics technology will allow a doctor in a clinic to quickly
sequence a patient’s genome
• That will enable to easily detect potential harmful mutations and to
engage in early diagnosis and effective treatment of diseases
Applications of Bioinformatics
• Bioinformatics tools are being used in agriculture as well
• Plant genome databases and gene expression profile analyses have
played an important role in the development of new crop varieties
• These new varieties have higher productivity and more resistance to
different diseases
Internet Basics
• Network of networks
• Composed of interconnected local and regional networks in over 100
countries
• Work on remote communications began in the early 1960s
• True origins of the Internet lie with a research project on networking
at the Advanced Research Projects Agency (ARPA) of the US
Department of Defense in 1969 named ARPANET
• Immediate goal of ARPANET - transmit information on defense-
related research between laboratories
• In 1981, BITNET was introduced - providing point-to-point
connections between universities to transfer electronic mails & files
• In 1982, ARPA introduced the Transmission Control Protocol (TCP)
and the Internet Protocol (IP)
• TCP/IP allowed different networks to be connected to and
communicate with one another, creating the system in place today
Internet Basics
• When machines on a network get connected to one another - a clear
way is needed to specify a single computer
• So that messages and files actually find their planned recipient
• For this, all machines directly connected to the Internet have an IP
number
• IP addresses are unique, identifying one and only one machine
• IP address is made up of four numbers separated by periods
• For example, the IP address for the main file server at NCBI at NIH is
[Link]
• The numbers themselves represent, from left to right:
130.14 - Domain (for NIH)
.25 - Subnet (for NLM at NIH)
.1 - Machine itself
Internet Basics
• IP addresses often have associated with them a fully qualified
domain name (FQDN)
• It is dynamically translated in the background by domain name
servers
• Going back to the NCBI example:
• Rather than use [Link] to access the NCBI computer, a user
could instead use [Link] and achieve the same result
• Reading from left to right
• The IP address goes from least to most specific
• Whereas the FQDN equivalent goes from most specific to least
• Top-Level Doman Names
.com Commercial site
.edu Educational site
.gov Government site
Internet Basics
.mil Military site
.net Gateway or network host
.org Private (usually not-for-profit) organizations
• Examples of Top-Level Domain Names Used Outside the United
States:
.ca Canadian site
.[Link] Academic site in the United Kingdom
.[Link] Commercial site in the United Kingdom
Connecting to Internet
• Traditionally, users attempting to connect to Internet away from
office had one and only one option—a modem
• Modem uses existing copper twisted-pair cables carrying telephone
signals to transmit data
• Data transfer rates using modems – 28.8 to 56 kilobits per second
(kbps)
• Problem with using conventional copper wire to transmit data lies
not in copper wire itself but in switches
• Switches are found along the way that route information to their
intended destinations
• These switches were designed for efficient and effective transfer of
voice data
• These were never intended to handle high-speed transmission of
data
Connecting to Internet
• First of these new solutions – Integrated services digital network or
ISDN
• Advent of ISDN was originally heralded as the way to bring Internet
into the home in a speed-efficient manner
• However, it required that special wiring be brought into the home
• It also required that users be within a fixed distance from a central
office (on the order of 20,000 feet or less)
• Cost of running this special, dedicated wiring, along with a per-
minute pricing structure, effectively placed ISDN out of reach for
most individuals
• Although ISDN is still available in many areas, this type of service is
quickly being replaced by more cost-effective alternatives
• In looking at alternatives that did not require new wiring, cable
television providers began to look at ways
Connecting to Internet
• The coaxial cable already running into a substantial number of
households could be used to also transmit data
• Cable companies are able to use bandwidth that is not being used to
transmit television signals (effectively, unused channels) to push data
into the home at very high speeds - up to 4 megabits per second
(Mbps)
• Actual computer is connected to this network through a cable
modem, which uses an Ethernet connection to the computer and a
coaxial cable to the wall
• Homes in a given area all share a single cable, in a wiring scheme
very similar to how individual computers are connected via Ethernet
in an office or laboratory setting
• Although this branching arrangement can serve to connect a large
number of locations, there is one major disadvantage;
Connecting to Internet
• As more and more homes connect through their cable modems,
service effectively slows down as more signals attempt to pass
through any given node
• The local telephone companies were primary ISDN providers
• They quickly turned their attention to ways that existing,
conventional copper wire already in the home could be used to
transmit data at high speed
• Solution here is the digital subscriber line or DSL
• By using new, dedicated switches that are designed for rapid data
transfer, DSL providers can avoid old voice switches that slowed
down transfer speeds
• Depending on user’s distance from central office and whether a
particular neighborhood has been wired for DSL service, speeds are
on the order of 0.8 to 7.1 Mbps
Connecting to Internet
• Data transfers do not interfere with voice signals, and users can use
telephone while connected to Internet
• Signals are “split” by a special modem that passes data signals to
computer and a microfilter that passes voice signals to handset
• There is a special type of DSL called asynchronous DSL or ADSL
• This is the variety of DSL service that is becoming more and more
dominant
• Most home users download much more information than they send
out
• Therefore, systems are engineered to provide super-fast transmission
in ‘‘in’’ direction, with transmissions in ‘‘out’’ direction being 5–10
times slower
• Using this approach maximizes the amount of bandwidth that can be
used without necessitating new wiring
Connecting to Internet
• One of the advantages of ADSL over cable is that ADSL subscribers
effectively have a direct line to central office
• Meaning that they do not have to compete with their neighbors for
bandwidth
• This, of course, comes at a price; at the time of this writing, ADSL
connectivity options were on the order of twice as expensive as cable
Internet, but this will vary from region to region
• Some of the newer technologies involve wireless connections to
Internet
• These include using one’s own cell phone or a special cell phone
service to upload and download information
• These cellular providers can provide speeds on the order of 28.8–128
kbps
World Wide Web (www)
• Development of a number of distributed document delivery systems
(DDDS)
• Interactive client-server applications that allowed information to be
viewed without having to perform a download
• 1st generation of DDDS development led to programs like Gopher
• It allowed plain text to be viewed directly through a client-server
application
• From this evolved the most widely known and widely used DDDS,
namely, World Wide Web
• Web is an outgrowth of research performed at European Nuclear
Research Council (CERN) in 1989
• That was aimed at sharing research data between several locations
• That work led to a medium through which text, images, sounds, and
videos could be delivered to users on demand, anywhere in the
world
World Wide Web (www)
• Navigation on Web does not require advance knowledge of location
of information being required
• Instead, users can navigate by clicking on specific text, buttons, or
pictures
• These clickable items are collectively known as hyperlinks
• Once one of these hyperlinks is clicked, user is taken to another Web
location, which could be at same site or halfway around the world
• Each document displayed on Web is called a Web page, and all of
related Web pages on a particular server are collectively called a Web
site
• Navigation strictly through the use of hyperlinks has been nicknamed
Web surfing
• This standard-form address is known as a uniform resource locator,
or URL
World Wide Web (www)
• It takes the general form protocol://[Link]
• Where protocol specifies the type of site and [Link]
specifies the location
• The http used for protocol in World Wide Web URLs stands for
hypertext transfer protocol
• http is the method used in transferring Web files from host computer
to client
• It is also possible to directly search the Web by using search engines
• A search engine is simply a specialized program that can perform
full-text or keyword searches on databases that catalog Web content
• The result of a search is a hyperlinked list of Web sites fitting search
criteria from which user can visit any or all of the found sites
• However, search engines use slightly different methods in compiling
their databases
World Wide Web (www)
• One variation is the attempt to capture most or all of the text of
every Web page that search engine is able to find and catalog - Web
crawling
• Another technique is to catalog only the title of each Web page
rather than its entire text
• A third is to consider words that must appear next to each other or
only relatively close to one another
• Also keep in mind that, depending on indexing scheme that search
engine is using, the found pages may actually no longer exist, leading
user to dreaded 404 Not Found error
• One way of finding interesting & relevant Web sites is to consult
virtual libraries (curated lists of Web resources arranged by subject)
• Virtual libraries of special interest to biologists include WWW Virtual
Library ([Link] maintained by Keith Robison at Harvard,
and EBI BioCatalog, based at European Bioinformatics Institute
Browsers
• Browsers – used to look at Web pages – client-server applications
• Connect to a remote site, download requested information at that
site, and display information on a user’s monitor, then disconnecting
from remote host
• Information retrieved from remote host is in a platform-independent
format named hypertext markup language (HTML)
• HTML code is strictly text-based, and any associated graphics or
sounds for that document exist as separate files in a common format
• For example, images may be stored and transferred in GIF format, a
proprietary format developed by CompuServe for quick and efficient
transfer of graphics
• Other formats, such as JPEG and BMP, may also be used
• Because of this, a browser can display any Web page on any type of
computer, whether it be a Macintosh, IBM compatible, or UNIX
machine
Browsers
• Text is usually displayed first, with remaining elements being placed
on page as they are downloaded
• With minor exception, a given Web page will look same when same
browser is used on any of above platforms
• Two major players in area of browser software are Netscape, with
their Communicator product, and Microsoft, with Internet Explorer
• As with many other areas where multiple software products are
available, choice between Netscape and Internet Explorer comes
down to one of personal preference
Important Glossary of Bioinformatics
Accession number
An identifier supplied by the curators of the major biological databases
upon submission of a novel entry that uniquely identifies that sequence
(or other) entry.
Algorithm
A series of steps defining a procedure or formula for solving a problem
that can be coded into a programming language and executed.
Bioinformatics algorithms typically are used to process, store, analyze,
visualize and make predictions from biological data.
Alignment
The result of a comparison of two or more gene or protein sequences in
order to determine their degree of base or amino acid similarity.
Sequence alignments are used to determine the similarity, homology,
function or other degree of relatedness between two or more genes or
gene products.
Important Glossary of Bioinformatics
Alignment score
It is calculated by totaling the scores for each matched pair of residues
at each position in the alignment, plus unmatched residues are given
the gap open penalty or the gap extension penalty, if appropriate in the
alignment.
Annotation
DNA annotation or genome annotation is the process of identifying the
locations of genes and all of the coding regions in a genome and
determining what those genes do.
Base pair
A pair of nitrogenous bases (a purine and a pyrimidine), held together
by hydrogen bonds, that form the core of DNA and RNA i.e the A:T, G:C
and A:U interactions.
Important Glossary of Bioinformatics
Binding site
A place on cellular DNA to which a protein (such as a transcription
factor) can bind. Typically, binding sites might be found in vicinity of
genes, and would be involved in activating transcription of that gene
(promoter elements), in enhancing transcription of that gene (enhancer
elements), or in reducing transcription of that gene (silencers).
Bioinformatics
The field of endeavor that relates to the collection, organization and
analysis of large amounts of biological data using networks of
computers and databases.
BLAST
BLAST stands for Basic Local Alignment Search Tool. The emphasis of
this tool is to find regions of sequence similarity, which will yield
functional and evolutionary clues about the structure and function of
your novel sequence.
Important Glossary of Bioinformatics
Blastn program
Blastn will search a DNA sequence against a DNA databank.
Blastp program
Blastp will compare a protein sequence against the protein database of
your choice.
Blastx program
Blastx will translate a nucleic acid sequence in all six reading frames and
compare all these against the protein database of your choice.
CDS
The coding sequence or the portion of a nucleotide sequence that
makes up the triplet codons that actually code for amino acids.
Clade
Group of taxa on a phylogenetic tree that are descended from a single
common ancestor.
Important Glossary of Bioinformatics
Coding sequence
The portion of a gene or an mRNA which actually codes for a protein. Introns
are not coding sequences; nor are the 5' or 3' untranslated regions (UTRs) (or
the flanking regions, for that matter - they are not even transcribed into
mRNA). The coding sequence in a cDNA or mature mRNA includes everything
from the AUG (or ATG) initiation codon through to the stop codon, inclusive.
Codon
In an mRNA, a codon is a sequence of three nucleotides which codes for the
incorporation of a specific amino acid into the growing protein.
Comparative genomics
Subarea of genomics that focuses on comparison of whole genomes from
different organisms.
Consensus sequence
It is the calculated order of most frequent residues, either nucleotide or amino
acid, found at each position in a sequence alignment.
Important Glossary of Bioinformatics
Conserved sequence
A base sequence in a DNA molecule (or an amino acid sequence in a protein)
that has remained essentially unchanged throughout evolution
Database
Computerized collection used for storage and organization of data in such a
way that information can be retrieved easily via a variety of search criteria
Domain
A region of special biological interest within a single protein sequence
FASTA sequence format
o This format contains a one line header followed by lines of sequence data
o Sequences in FASTA formatted files are led by a line starting with a" >"
symbol
o 1st word on this line is name of sequence
o Rest of the line is a description of sequence
o Remaining lines contain sequence itself
Important Glossary of Bioinformatics
Functional genomics
Study of gene functions at whole-genome level using high throughput
approaches
Gap
Gap is a space that is introduced in a sequence during multiple sequence
alignment to increase alignment score
Gap extension penalty
Gap extension penalty is added to standard gap open penalty for each base or
residue in gap
Gene mapping
Determination of relative positions of genes on a DNA molecule (chromosome
or plasmid) and of the distance, in linkage units or physical units, between
them
Genomics
Study of genomes characterized by simultaneous analysis of all genes in a
genome
Important Glossary of Bioinformatics
Genetic code
Mapping of all possible codons into 20 amino acids including start & stop codons
Genome
Total DNA contained in each cell of an organism
Global alignment
Sequence alignment strategy that matches up two or more sequences over their
entire lengths
Homology
Strict – two or more biological species, systems or molecules that share a
common evolutionary ancestor
General – Two or more gene or protein sequences that share a significant
degree of similarity
Intron
o Introns are portions of genomic DNA which are transcribed (and thus present
in primary transcript) but which are later spliced out
o They are not present in mature mRNA
Important Glossary of Bioinformatics
Ligand
Any small molecule that binds to a protein or receptor; the related partner of
many cellular proteins, enzymes, and receptors
Local alignment
o An alignment that searches for segments of two sequences that match well
o There is no attempt to force entire sequences into an alignment, just those
parts that appear to have good similarity
MEDLINE
MEDLINE is a bibliographic database covering fields of medicine, nursing,
dentistry, veterinary medicine, health care system, and pre-clinical sciences
Motif
A conserved element of a protein sequence alignment that usually correlates
with a particular function
Pairwise Alignment
o Aligns your query sequence and database matches in pairs
o Matches are connected with a "|" symbol
Important Glossary of Bioinformatics
o Mismatches are opposed with a space
o Gaps are introduced with a "–" symbol
PDB
Brookhaven Protein Sequence Database, single worldwide source for
processing and distribution of 3D biological macromolecular structure data
Phylogeny
Study of evolutionary relationships among organisms using tree like diagrams
Phylogram
Phylogram is a branching diagram (tree) assumed to be an estimate of a
phylogeny, branch lengths are proportional to amount of inferred evolutionary
change
Proteome
Complete set of proteins expressed in a cell
Important Glossary of Bioinformatics
Proteomics
o Study of proteome
o Typically, cataloging of all expressed proteins in a particular cell or tissue
type, obtained by identifying proteins from cell extracts using a combination
of 2D gel electrophoresis and mass spectrometry
Query
Specific value used to retrieve a particular record from a database
Secondary structure
Organization of peptide backbone of a protein that occurs as a result of
hydrogen bonds e.g. alpha helix, beta pleated sheet
Taxon
Each species or sequence represented at tip of each branch of a phylogenetic
tree
Transcriptome
Complete set of mRNA molecules produced by a cell under a given condition
What is a database?
• A database is a computerized library used to store and organize data in such
a way that information can be retrieved easily via a variety of search criteria
• Databases are composed of computer hardware and software for data
management
• Chief objective of development of a database is to organize data in a set of
structured records to enable easy retrieval of information
• Each record, also called an entry, should contain a number of fields that
hold actual data items, for example, fields for names, phone numbers,
addresses, dates
• To retrieve a particular record from database, a user can specify a particular
piece of information, called value, to be found in a particular field and
expect computer to retrieve whole data record
• This process is called making a query
• Although data retrieval is main purpose of all databases, biological
databases often have a higher level of requirement, known as knowledge
discovery
What is a database?
• Which refers to identification of connections between pieces of information
that were not known when information was first entered
• For example, databases containing raw sequence information can perform
extra computational tasks to identify sequence homology or conserved
motifs
• These features facilitate discovery of new biological insights from raw data
Biological databases
• Based on their contents, biological databases can be roughly divided into 3
categories:
1. Primary databases
2. Secondary databases
3. Specialized databases
• Primary databases contain original biological data
• They are archives (collection) of raw sequence or structural data submitted
by scientific community
• GenBank and Protein Data Bank (PDB) are examples of primary databases
What is a database?
• Secondary databases contain computationally processed or manually
curated information, based on original information from primary databases
• Translated protein sequence databases containing functional annotation
belong to this category
• Examples are SWISS-Prot and Protein Information Resources (PIR)
• Specialized databases are those that provide a particular research interest
• For example, Flybase, HIV sequence database, and Ribosomal Database
Project are databases that specialize in a particular organism or a particular
type of data
• A list of some frequently used databases is provided below:
Major Biological Databases Available Via the World Wide Web
Databases Brief Summary of Content
DDBJ Primary nucleotide sequence database in Japan
EMBL Primary nucleotide sequence database in Europe
Entrez NCBI portal for a variety of biological databases
ExPASY Proteomics database
What is a database?
FSSP Protein secondary structures
GenBank Primary nucleotide sequence database in NCBI
HIV databases HIV sequence data and related immunologic
information
OMIM Genetic information of human diseases
PIR Annotated protein sequences
PubMed Biomedical literature information
SWISS-Prot Curated protein sequence database
Primary Databases
• There are 3 major public sequence databases that store raw nucleic acid
sequence data produced and submitted by researchers worldwide:
1. GenBank
2. European Molecular Biology Laboratory (EMBL)
3. DNA Data Bank of Japan (DDBJ)
• These are all freely available on Internet
• Most of the data in databases are contributed directly by authors with a
minimal level of annotation
• A small number of sequences, especially those published in 1980s, were
entered manually from published literature by database management staff
• Presently, sequence submission to GenBank, EMBL or DDBJ is a
precondition for publication in most scientific journals to ensure
fundamental molecular data to be made freely available
• These 3 public databases closely collaborate and exchange new data daily
• They together constitute International Nucleotide Sequence Database
Collaboration (INSDC)
Primary Databases
• This means that by connecting to any one of three databases, one should
have access to same nucleotide sequence data
• Although 3 databases all contain same sets of raw data, each of individual
databases has a slightly different kind of format to represent data
• Fortunately, for 3D structures of biological macromolecules, there is only
one centralized database, PDB
• This database archives atomic coordinates of macromolecules (both
proteins and nucleic acids) determined by x-ray crystallography and NMR
• It uses a flat file format to represent protein name, authors, experimental
details, secondary structure, cofactors, and atomic coordinates
• Web interface of PDB also provides viewing tools for simple image
manipulation
Secondary Databases
• Sequence annotation information in primary database is often minimal
• To turn raw sequence information into more sophisticated biological
knowledge, much post processing of sequence information is needed
• This begs need for secondary databases, which contain computationally
processed sequence information derived from primary databases
• Amount of computational processing work varies greatly among secondary
databases
• Some are simple archives of translated sequence data from identified open
reading frames in DNA, whereas others provide additional annotation and
information related to higher levels of information regarding structure and
functions
• A prominent example of secondary databases is SWISS-PROT, which
provides detailed sequence annotation that includes structure, function,
and protein family assignment
• Sequence data are mainly derived from TrEMBL, a database of Translated
nucleic acid sequences stored in EMBL database
Secondary Databases
• Annotation of each entry is carefully curated by human experts and thus is
of good quality
• Protein annotation includes function, domain structure, catalytic sites,
cofactor binding, posttranslational modification, metabolic pathway
information, disease association, and similarity with other sequences
• Much of this information is obtained from scientific literature and entered
by database curators
• Annotation provides significant added value to each original sequence
record
• Data record also provides cross referencing links to other online resources
of interest
• A recent effort to combine SWISS-PROT, TrEMBL, and PIR (Annotated
protein sequences) led to creation of UniProt database, which has larger
coverage than any one of 3 databases
• While at same time maintaining original SWISS-PROT feature of low
redundancy, cross-references, and a high quality of annotation
Secondary Databases
• There are also secondary databases that relate to protein family
classification according to functions or structures
• Pfam and Blocks databases contain aligned protein sequence information as
well as derived motifs and patterns
• That can be used for classification of protein families and inference of
protein functions
• DALI database is a protein secondary structure database that is vital for
protein structure classification and threading analysis to identify distant
evolutionary relationships among proteins
Specialized Databases
• Specialized databases normally serve a specific research community or
focus on a particular organism
• Content of these databases may be sequences or other types of information
• Sequences in these databases may overlap with a primary database, but
may also have new data submitted directly by authors
• Because they are often curated by experts in field, they may have unique
organizations and additional annotations associated with sequences
• Many genome databases that are taxonomic specific fall within this
category
• Examples include HIV database, and TAIR
• In addition, there are also specialized databases that contain original data
derived from functional analysis
• For example, GenBank EST database and Microarray Gene Expression
Database at European Bioinformatics Institute (EBI) are some of gene
expression databases available
Pitfalls of Biological Databases
• One of the problems associated with biological databases is over-reliance
on sequence information and related annotations, without understanding
reliability of information
• What is often ignored is the fact that there are many errors in sequence
databases
• There are also high levels of redundancy in primary sequence databases
• Annotations of genes can also occasionally be false or incomplete
• All these types of errors can be passed on to other databases, causing
propagation of errors
• Most errors in nucleotide sequences are caused by sequencing errors
• Some of these errors cause frame shifts that make whole gene
identification difficult or protein translation impossible
• Sometimes, gene sequences are contaminated with sequences from cloning
vectors
• Generally speaking, errors are more common for sequences produced
before 1990s; sequence quality has been greatly improved since
Pitfalls of Biological Databases
• Therefore, exceptional care should be taken when dealing with more dated
sequences
• Redundancy is another major problem affecting primary databases
• There is tremendous duplication of information in databases, for various
reasons
• Causes of redundancy include:
i. Repeated submission of identical or overlapping sequences by same
or different authors
ii. Revision of annotations
iii. Removal of expressed sequence tags (EST) data
iv. Poor database management that fails to detect redundancy
• This makes some primary databases excessively large and heavy for
information retrieval
Pitfalls of Biological Databases
• Steps have been taken to reduce redundancy:
i. National Center for Biotechnology Information (NCBI) has now created
a non-redundant database, called RefSeq, in which identical
sequences from same organism and associated sequence fragments
are merged into a single entry
ii. Proteins sequences derived from same DNA sequences are clearly
linked as related entries
iii. Sequence variants from same organism with very minor differences,
which may well be caused by sequencing errors, are treated as
distinctly related entries
iv. This carefully curated database can be considered a secondary
database
• Other common problem is erroneous annotations
• Often, same gene sequence is found under different names resulting in
multiple entries and confusion about data
• Or conversely, unrelated genes bearing same name are found in databases
Pitfalls of Biological Databases
• To solve problem of naming genes, re-annotation of genes and proteins
using a set of common, controlled vocabulary to describe a gene or protein
is necessary
• Goal is to provide a consistent and clear naming system for all genes and
proteins
• A prominent example of such systems is Gene Ontology (GO)
Information retrieval from biological databases
• A major goal in developing databases is to provide efficient and user
friendly access to the data stored
• There are a number of retrieval systems for biological data
• The most popular retrieval system for biological databases is Entrez that
provides access to multiple databases for retrieval of integrated search
results
• To perform complex queries in a database often requires use of Boolean
operators
• This is to join a series of keywords using logical terms such as AND, OR, and
NOT to indicate relationships between keywords used in a search
 AND means that search result must contain both words
 OR means to search for results containing either word or both
 NOT excludes results containing either one of the words
Information retrieval from biological databases
• In addition, one can use parentheses ( ) to define a concept if multiple
words and relationships are involved
• So that the computer knows which part of the search to execute first
• Items contained within parentheses are executed first
• Quotes can be used to specify a phrase
• Most search engines of public biological databases use some form of this
Boolean logic
Entrez
• Entrez is NCBI’s primary text search and retrieval system that integrates
PubMed database of biomedical literature with 38 other literature and
molecular databases including:
DNA and protein sequence Structure
Gene Genome
Genetic variation Gene expression
• The NCBI developed and maintains Entrez, a biological database retrieval
system. It is a gateway that allows text-based searches for a wide variety of
data, including:
• Annotated genetic sequence information
Structural information Citations and abstracts
Full length papers Taxonomic data
• Key feature of Entrez is its ability to integrate information, which comes
from cross-referencing between NCBI databases based on preexisting and
logical relationships between individual entries
Entrez
• This is highly convenient: users do not have to visit multiple databases
located in disparate places
• For example, in a nucleotide sequence page, one may find cross-referencing
links to translated protein sequence, genome mapping data, or to related
PubMed literature information, and to protein structures if available
• Effective use of Entrez requires an understanding of main features of search
engine
• There are several options common to all NCBI databases that help to
narrow the search
• One option is “Limits,” which helps to restrict search to a subset of a
particular database
• It can also be set to restrict a search to a particular database (e.g., field for
author or publication date) or a particular type of data (e.g., chloroplast
DNA/RNA)
• Another option is “Preview/Index,” which connects different searches with
Boolean operators and uses a string of logically connected keywords to
perform a new search
Entrez
• Search can also be limited to a particular search field (e.g., gene name or
accession number)
• The “History” option provides a record of previous searches so that user
can review, revise, or combine results of earlier searches
• There is also a “Clipboard” that stores search results for later viewing for a
limited time
• To store information in Clipboard, “Send to Clipboard” function should be
used
• One of the databases accessible from Entrez is a biomedical literature
database known as PubMed, which contains abstracts and in some cases
full text articles from nearly 4,000 journals
• An important feature of PubMed is retrieval of information based on
medical subject headings (MeSH) terms
• MeSH system consists of a collection of more than 20,000 controlled and
standardized vocabulary terms used for indexing articles
• In other words, it is a thesaurus that helps convert search keywords into
standardized terms to describe a concept
Entrez
• By doing so, it allows “smart” searches in which a group of accepted
synonyms are employed so that user not only gets exact matches, but also
related matches on same topic that otherwise might have been missed
• Another way to broaden the retrieval is by using the “Related Articles”
option
• PubMed uses a word weight algorithm to identify related articles with
similar words in titles, abstracts, and MeSH
• By using this feature, articles on same topic that were missed in original
search can be retrieved
• Another unique database accessible from Entrez is Online Mendelian
Inheritance in Man (OMIM), which is a non-sequence-based database of
human disease genes and human genetic disorders
• Each entry in OMIM contains summary information about a particular
disease as well as genes related to the disease
• Text contains numerous hyperlinks to literature citations, primary sequence
records, as well as chromosome loci of disease genes
Entrez
• Database can serve as an excellent starting point to study genes related to a
disease
• NCBI also maintains a taxonomy database that contains names and
taxonomic positions of over 100,000 organisms with at least one nucleotide
or protein sequence represented in GenBank database
• Taxonomy database has a hierarchical classification scheme
• Root level is Archaea, Eubacteria, and Eukaryota
• Database allows taxonomic tree for a particular organism to be displayed
• Tree is based on molecular phylogenetic data, namely, small ribosomal RNA
data
The Entrez Databases
• Entrez system comprises 41 molecular and literature databases
• New databases are added as biomedical science advances and new kinds of
data become available
• Some examples of Entrez databases are given below:
Gene
• Gene is a searchable database of genes, focusing on genomes that have
been completely sequenced
• Information in Gene records includes nomenclature, chromosomal
localization, gene products, protein interactions, phenotypes and
interactions
Genome
• Genome database contains sequence and map data from whole genomes of
over 1000 species or strains
• Genomes represent both completely sequenced genomes and those with
sequencing in-progress
• All 3 main domains of life (bacteria, archaea, eukaryota) are represented
The Entrez Databases
Nucleotide
• Nucleotide database contains all sequence data from GenBank, EMBL, and
DDBJ, members of INSDC
• Nucleotide also includes NCBI-curated Reference Sequences (RefSeqs) and
nucleotide sequences extracted from structure records from PDB
OMIM
• OMIM (Online Mendelian Inheritance in Man) database allows searches of
OMIM articles about human genes, genetic disorders, and other inherited
traits
• OMIM articles provide links to associated literature references, sequence
records, maps, and related databases
• OMIM records are hosted and served by independent OMIM site
([Link])
• NCBI service provides searching capabilities
The Entrez Databases
Protein
• Protein database contains amino acid sequences created from translations
of coding regions provided on nucleotide records in GenBank, EMBL, and
DDBJ, members of INSDC
• Protein records are also imported from UniProtKB/Swiss-Prot, Protein
Research Foundation (PRF)
• Protein sequences are also extracted from structure records from PDB
PubMed
• PubMed is database of citations and abstracts for biomedical literature
from MEDLINE and additional life science journals
• Links are provided when full text versions of articles are available through
PubMed Central or other websites

You might also like