Prepared By :- Ankur Gajendra Meshram
Genomics And Proteomics
Course No. :- BT-3515
Credits :- 03 (3+0)
Course Title :- Genomics And Proteomics
Course Teacher :-
Student’s Name :-
College Name :- M. G. College of Agri. Biotechnology , Pokharni , Nanded
COURSE SYLLABUS OUTLINE
Lec. No. Topics Weightage (%)
Unit : 1
1. Introduction to Genome and Genomics, terminology involved and History 02
2. Central dogma : Structure of genomics i.e. Functional Genomics, 03
Structural genomics and Comparative Genomics
Techniques in genome analysis - DNA microarray, Nanopore technology,
3. High -throughput sequencing, Southern hybridization, Expressed 05
sequence tag, DNA sequencing
4. cDNA library construction and development of BAC and YAC libraries, 03
PCR Amplifications
5. Gene sequencing, Principles and types of sequencing (Next generation 04
sequencing in detail)
6. Genome mapping and different methods of mapping, Physical mapping 04
of genomes and different techniques involved in physical mapping
7. Gene sequence analysis and annotation by using annotation models 04
8. Case study : Genome Projects : E. coli 03
9. Case studies : Genome Projects : Arabidopsis, Bovine 03
10. Case study : Genome Projects : Human genome project 04
10. Brief about Comparative Genomics and techniques involved in it and 03
Synteny
11. Orthologous and Paralogous sequences 02
12. Gene Order, Phylogenetic footprinting. 02
Unit : 2
13. Introduction to Functional genomics 02
BT-3515 (Genomics And Proteomics) 1
14. Analogy for gene expression and involved techniques 03
15. Principles and procedure of ESTs, cDNA-AFLP 04
16. Principle and types of microarray and its application in functional 04
genomics
17. Functional analysis of genome by Differential display techniques 05
like SAGE, RNAseq, Real time PCR
18. Principal, procedure and applications of SAGE RNAseq, Real time PCR 03
19. Principal, procedure and applications of Real time PCR 02
Unit : 3
20. Introduction to proteome and proteomics terminology and history, 02
Protein synthesis i.e. translation
21. Protein isolation techniques i.e. Chromatographic techniques : HPLC, GC 03
22. Protein Purification techniques from crude extract 02
23. Protein separation by Native PAGE, SDS PAGE, 2D PAGE and its Principles 04
and procedure
24. Protein staining techniques i.e. Silver staining , Coomassie blue staining, 02
Sypro Ruby staining
25. Techniques of protein digestion i.e. Edmann Degradation and peptide 02
purification
26. Protein analysis by Mass Spectrometry : MALDI-TOF, LC-MS, 04
Electrospray ionization (ESI)
27. Principles and procedure of MALDI-TOF 03
28. Peptide fingerprint analysis
29. Mass Spectrometric Identification of Proteins - Mapping 03
30. Protein identification : Peptide mass fingerprint, Tandem Mass 04
Spectrometry (MS/MS)
31. Types Post Translational modifications : Methylation of cytidine residues 03
in the DNA, The modifications of the histones and of CpG methylation
32. Application of genomics and proteomics in crop development 04
Total : 100
BT-3515 (Genomics And Proteomics) 2
UNIT : 1
1.1 INTRODUCTION TO GENOMICS
From the Greek 'gen', meaning “become, create, creation, birth”, and subsequent variants : genealogy,
genesis, genetics, genic, genomere, genotype, genus etc. While the word genome (from the German
‘Genom’, attributed to Hans Winkler) was in use in English as early as 1926, the term genomics was
coined by Thomas Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, Maine), over beer at a
meeting held in Maryland on the mapping of the human genome in 1986.
In the fields of molecular biology and genetics, a genome is all genetic material of an organism. It
consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and
the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is
called genomics.
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping,
and editing of genomes. A genome is an organism’s complete set of DNA, including all of its genes. In
contrast to genetics, which refers to the study of individual genes and their roles in inheritance,
genomics aims at the collective characterization and quantification of all of an organism’s genes, their
interrelations and influence on the organism.
The field also includes studies of intragenomic (within the genome) phenomena such as epistasis (effect
of one gene on another), pleiotropy (one gene affecting more than one trait), heterosis (hybrid vigour),
and other interactions between loci and alleles within the genome.
Scope And Importance of Genomics :-
1) Target identification and validation : integrative analysis of molecular data at scale, coupling
genetic, epigenetic, gene expression profiling, proteomic, metabolomic, phenotypic trait
measurements to disease diagnosis and clinical outcome data to generate hypotheses on
molecular etiology of diseases in service of identification or validation of novel therapeutic
targets.
2) Biomarker discovery : utilizing genetic and genomic data derived from cell lines, animal models,
human disease tissues and PBMC to develop preclinical or clinical biomarkers for target
engagement, pharmacodynamics, drug response, prognosis, and patient stratification; applying
genomic profiling in clinical trials to identify early response markers to predict clinical end
points.
3) Pharmacogenomics : identify associations between germline SNPs, somatic mutations, gene
expression and other molecular alterations and drug responses.
4) Toxicogenomics : integrative analysis of genomic, histopathology, and clinical chemistry data to
develop predictive toxicology biomarkers in preclinical 4-day, 14-day and 30-day studies and
clinical studies.
5) Understanding drug mechanisms of action (MoA) : applying genomic profiling to de-convolute
targets and delineate MoA of non-selective drugs or drugs from phenotypic screening.
6) Characterization of mechanisms of acquired resistance : analysis of genetic and genomic data
derived from preclinical isogenic models or clinical patient samples to study the mechanisms of
acquired resistance.
BT-3515 (Genomics And Proteomics) 3
7) Selection of disease-relevant experimental models : comparative analysis of genetic and
genomic data to assess and select cell line and animal models in drug discovery that best
represent the disease indications.
8) Developing drug combination strategies : analysis of genetic and genomic data to identify
synthetic lethality genes as drug combination targets; computational analysis to understand
gene regulatory networks to develop combination strategies that target parallel pathways or
reverse drug resistance.
9) Drug repurposing : applying in silico approaches to identify new disease indications for existing
drugs.
APPLICATIONS OF GENOMICS IN CROP IMPROVEMENT :-
1) Molecular Breeding or Marker Assisted Selection (MAS) :
Crop improvement relies on the identification of desirable genes and superior genotypes possessing
such genes. Selection of such genes and genotypes is facilitated by MAS. MAS refers to the process of
indirect selection of desirable genes or traits through direct selection for morphological, biochemical or
DNA-based/molecular markers. Breeders scan new varieties to directly select for the presence of the
markers and indirectly select for the desired genes . In this way, MAS subverts the need for phenotypic
selection which can be difficult, time consuming , costly and influenced by the environment.
Table 1 : Traits for which MAS is being applied in different crops
BT-3515 (Genomics And Proteomics) 4
2) DNA / Molecular markers :
DNA markers can be defined as differences in nucleotide sequence of different individuals that can be
used to mark a genomic region, tag a gene or distinguish individuals from each other. Many different
classes of molecular markers are available today. These have been elaborately described in many good
reviews.
3) Mapping and tagging of genes :
A large number of monogenic and polygenic genomic loci for various traits have been identified in many
plants and are currently being exploited by breeders. Tagging of useful genes responsible for conferring
resistance to plant pathogens, synthesis of plant hormones, drought tolerance and a variety of other
important developmental pathway genes is a major target. Molecular markers have facilitated
construction of physical and genetic maps of genes/QTL on the genome. Table 1 gives a summary of the
traits for which MAS is being applied in different crops.
4) Identification/DNA Fingerprinting :
The ability to identify individual plants is at the core of many applications. The use of DNA markers in
cultivar identification is particularly important for the protection of proprietary germplasm. In India, the
National Research Centre on DNA Fingerprinting, rechristened as the Division of Genomic Resources,
National Bureau of Plant Genetic Resources , Indian Council of Agricultural Research, has been entrusted
with the responsibility of fingerprinting released varieties and important germplasm in crops of national
importance. A total of 2,146 accessions of different crops, have been fingerprinted at the centre using
different molecular marker systems. A wide array of markers have been used for fingerprinting in
different crops . The markers chosen for the purpose of fingerprinting depends on factors like their
availability, genomic coverage, cost effectiveness and reproducibility. DNA markers have also been used
to confirm purity in hybrid cultivars where the maintenance of high levels of genetic purity is essential.
Marker based identification has also been used to check adulteration of commercial medicinal plants.
Sex identification in some dioecious plants is also facilitated by the use of molecular markers.
5) Diversity analysis of germplasm and heterosis breeding :
Heterosis refers to the presence of superior phenotypes in hybrids relative to their inbred parents with
respect to traits such as growth rate, reproductive success and yield. The theory of quantitative genetics
postulates a positive correlation between parental genetic distance and degree of heterosis.
Conventionally, the selection of such parents was based on a combination of phenotypic assessments,
pedigree information and breeding records . Now, molecular markers are also used for this purpose. A
genome wide assessment of genetic diversity using molecular markers makes parental selection more
efficient. Efforts have been made to construct haplotype blocks on the basis of molecular markers which
have been successfully used to predict hybrid performance.
6) Introgression/Backcrossing and gene pyramiding :
Commercial elite cultivars can be improved for a desirable trait (like resistance to a specific disease),
which exists in distantly related (wild type) genotype but lacks in the commercially grown variety. This is
achieved by gene introgression which involves crossing the elite cultivar with the donor plant, followed
by repeated backcrossing of the progeny with the recipient line, while selecting simultaneously for the
desirable allele in each generation (foreground selection).This takes about 6 or more generations but
the use of DNA markers can effectively shorten this duration by reducing the number of backcrosses
required. MAS allows recovery of the maximum proportion of recurrent parent genomic regions at the
BT-3515 (Genomics And Proteomics) 5
non target loci (background selection) and thus help minimize linkage drag. This method, is termed
marker-assisted backcrossing (MABC). MAS has also been widely utilized for gene pyramiding.
Pyramiding is the accumulation of several desired alleles into a single line or cultivar (background). This
is often cited as one of the major applications of MAS, since gene pyramiding through conventional
plant breeding is difficult.
1.2 CENTRAL DOGMA
In molecular biology, central dogma illustrates the flow of genetic information from DNA to RNA to
protein. It is defined as a process in which the information in DNA is converted into a functional product.
It is suggested that the information present in a DNA is essential to make up all proteins and RNA acts as
a messenger that carries information through the ribosomes.
“Central dogma is the process in which the genetic information flows from DNA to RNA, to make a
functional product protein.“
Central Dogma Steps :-
The central dogma takes place in two different steps :
1) Transcription :
Transcription is the process by which the information is transferred from one strand of the DNA to RNA
by the enzyme RNA Polymerase. The DNA strand which undergoes this process consists of three parts
namely promoter, structural gene, and a terminator.
The DNA strand that synthesizes the RNA is called the template strand and the other strand is called the
coding strand. The DNA-dependent RNA polymerase binds to the promoter and catalyzes the
polymerization in the 3′ to 5′ direction.
As it approaches the terminator sequence, it terminates and releases the newly synthesized RNA strand.
The newly released RNA strand further undergoes post-transcriptional modifications.
2) Translation :
Translation is the process by which the RNA codes for specific proteins. It is an active process which
requires energy. This energy is provided by the charged tRNA molecules.
Ribosomes initiate the translation process. The ribosomes consist of a larger subunit and a smaller
subunit. The larger subunit, in turn, consists of two tRNA molecules placed close enough so that peptide
bond can be formed at the expense of enough energy.
The mRNA enters the smaller subunit which is then held by the tRNA molecules of the complementary
codon present in the larger subunit. Thus, two codons are held by two tRNA molecules placed close to
each other and a peptide bond is formed between them. As this process repeats, long polypeptide
chains of amino acids are synthesized.
1.2.1 STRUCTURAL GENOMICS :-
Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given
genome. This genome-based approach allows for a high-throughput method of structure determination
by a combination of experimental and modeling approaches. The principal difference between
BT-3515 (Genomics And Proteomics) 6
structural genomics and traditional structural prediction is that structural genomics attempts to
determine the structure of every protein encoded by the genome, rather than focusing on one
particular protein. With full-genome sequences available, structure prediction can be done more quickly
through a combination of experimental and modeling approaches, especially because the availability of
large number of sequenced genomes and previously solved protein structures allows scientists to model
protein structure on the structures of previously solved homologs.
Because protein structure is closely linked with protein function, the structural genomics has the
potential to inform knowledge of protein function. In addition to elucidating protein functions,
structural genomics can be used to identify novel protein folds and potential targets for drug discovery.
Structural genomics involves taking a large number of approaches to structure determination, including
experimental methods using genomic sequences or modeling-based approaches based on sequence or
structural homology to a protein of known structure or based on chemical and physical principles for a
protein with no homology to any known structure.
As opposed to traditional structural biology, the determination of a protein structure through a
structural genomics effort often (but not always) comes before anything is known regarding the protein
function. This raises new challenges in structural bioinformatics, i.e. determining protein function from
its 3D structure.
Structural genomics emphasizes high throughput determination of protein structures. This is performed
in dedicated centers of structural genomics.
While most structural biologists pursue structures of individual proteins or protein groups, specialists in
structural genomics pursue structures of proteins on a genome wide scale. This implies large-scale
cloning, expression and purification. One main advantage of this approach is economy of scale. On the
other hand, the scientific value of some resultant structures is at times questioned. A Science article
from January 2006 analyzes the structural genomics field.
Goals of Structural Genomics :-
One goal of structural genomics is to identify novel protein folds. Experimental methods of protein
structure determination require proteins that express and/or crystallize well, which may inherently bias
the kinds of proteins folds that this experimental data elucidate. A genomic, modeling-based approach
such as ab initio modeling may be better able to identify novel protein folds than the experimental
approaches because they are not limited by experimental constraints.
Protein function depends on 3-D structure and these 3-D structures are more highly conserved than
sequences. Thus, the high-throughput structure determination methods of structural genomics have the
potential to inform our understanding of protein functions. This also has potential implications for drug
discovery and protein engineering. Furthermore, every protein that is added to the structural database
increases the likelihood that the database will include homologous sequences of other unknown
proteins. The Protein Structure Initiative (PSI) is a multifaceted effort funded by the National Institutes
of Health with various academic and industrial partners that aims to increase knowledge of protein
structure using a structural genomics approach and to improve structure-determination methodology.
Methods of Structural Genomics :-
Structural genomics takes advantage of completed genome sequences in several ways in order to
determine protein structures. The gene sequence of the target protein can also be compared to a
known sequence and structural information can then be inferred from the known protein’s structure.
BT-3515 (Genomics And Proteomics) 7
Structural genomics can be used to predict novel protein folds based on other structural data. Structural
genomics can also take modeling-based approach that relies on homology between the unknown
protein and a solved protein structure.
1) De novo methods :
Completed genome sequences allow every open reading frame (ORF), the part of a gene that is likely to
contain the sequence for the messenger RNA and protein, to be cloned and expressed as protein. These
proteins are then purified and crystallized, and then subjected to one of two types of structure
determination : X-ray crystallography and nuclear magnetic resonance (NMR). The whole genome
sequence allows for the design of every primer required in order to amplify all of the ORFs, clone them
into bacteria, and then express them. By using a whole-genome approach to this traditional method of
protein structure determination, all of the proteins encoded by the genome can be expressed at once.
This approach allows for the structural determination of every protein that is encoded by the genome.
2) Modelling-based methods :
a) Ab initio modeling :
This approach uses protein sequence data and the chemical and physical interactions of the encoded
amino acids to predict the 3-D structures of proteins with no homology to solved protein structures. One
highly successful method for ab initio modeling is the Rosetta program, which divides the protein into
short segments and arranges short polypeptide chain into a low-energy local conformation. Rosetta is
available for commercial use and for non-commercial use through its public program, Robetta.
b) Sequence-based modeling :
This modeling technique compares the gene sequence of an unknown protein with sequences of
proteins with known structures. Depending on the degree of similarity between the sequences, the
structure of the known protein can be used as a model for solving the structure of the unknown protein.
Highly accurate modeling is considered to require at least 50% amino acid sequence identity between
the unknown protein and the solved structure. 30-50% sequence identity gives a model of intermediate-
accuracy, and sequence identity below 30% gives low-accuracy models. It has been predicted that at
least 16,000 protein structures will need to be determined in order for all structural motifs to be
represented at least once and thus allowing the structure of any unknown protein to be solved
accurately through modeling. One disadvantage of this method, however, is that structure is more
conserved than sequence and thus sequence-based modeling may not be the most accurate way to
predict protein structures.
c) Threading :
Threading bases structural modeling on fold similarities rather than sequence identity. This method may
help identify distantly related proteins and can be used to infer molecular functions.
1.2.2 FUNCTIONAL GENOMICS :-
Functional genomics is a field of molecular biology that attempts to describe gene (and protein)
functions and interactions. Functional genomics make use of the vast data generated by genomic and
transcriptomic projects (such as genome sequencing projects and RNA sequencing). Functional genomics
focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression
and protein–protein interactions, as opposed to the static aspects of the genomic information such as
BT-3515 (Genomics And Proteomics) 8
DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide
approach to these questions, generally involving high-throughput methods rather than a more
traditional “gene-by-gene” approach.
Goals of Functional Genomics :-
The goal of functional genomics is to understand the function of genes or proteins, eventually all
components of a genome. The term functional genomics is often used to refer to the many technical
approaches to study an organism’s genes and proteins, including the “biochemical, cellular, and/or
physiological properties of each and every gene product” while some authors include the study of
nongenic elements in their definition. Functional genomics may also include studies of natural genetic
variation over time (such as an organism’s development) or space (such as its body regions), as well as
functional disruptions such as mutations.
The promise of functional genomics is to generate and synthesize genomic and proteomic knowledge
into an understanding of the dynamic properties of an organism. This could potentially provide a more
complete picture of how the genome specifies function compared to studies of single genes. Integration
of functional genomics data is often a part of systems biology approaches.
Techniques in Functional Genomics :-
Functional genomics includes function-related aspects of the genome itself such as mutation and
polymorphism (such as single nucleotide polymorphism (SNP) analysis), as well as the measurement of
molecular activities. The latter comprise a number of “-omics” such as transcriptomics (gene
expression), proteomics (protein production), and metabolomics. Functional genomics uses mostly
multiplex techniques to measure the abundance of many or all gene products such as mRNAs or
proteins within a biological sample. A more focused functional genomics approach might test the
function of all variants of one gene and quantify the effects of mutants by using sequencing as a readout
of activity. Together these measurement modalities endeavor to quantitate the various biological
processes and improve our understanding of gene and protein functions and interactions.
A) At the DNA level :-
1) Genetic interaction mapping :
Systematic pairwise deletion of genes or inhibition of gene expression can be used to identify genes with
related function, even if they do not interact physically. Epistasis refers to the fact that effects for two
different gene knockouts may not be additive; that is, the phenotype that results when two genes are
inhibited may be different from the sum of the effects of single knockouts.
2) DNA/Protein interactions :
Proteins formed by the translation of the mRNA (messenger RNA, a coded information from DNA for
protein synthesis) play a major role in regulating gene expression. To understand how they regulate
gene expression it is necessary to identify DNA sequences that they interact with. Techniques have been
developed to identify sites of DNA-protein interactions. These include ChIP-sequencing, CUT & RUN
sequencing and Calling Cards.
3) DNA accessibility assays :
Assays have been developed to identify regions of the genome that are accessible. These regions of
open chromatin are candidate regulatory regions. These assays include ATAC-Seq, DNase-Seq and FAIRE-
Seq.
BT-3515 (Genomics And Proteomics) 9
B) At the RNA level :-
1) Microarrays :
Microarrays measure the amount of mRNA in a sample that corresponds to a given gene or probe DNA
sequence. Probe sequences are immobilized on a solid surface and allowed to hybridize with
fluorescently labeled “target” mRNA. The intensity of fluorescence of a spot is proportional to the
amount of target sequence that has hybridized to that spot, and therefore to the abundance of that
mRNA sequence in the sample. Microarrays allow for identification of candidate genes involved in a
given process based on variation between transcript levels for different conditions and shared
expression patterns with genes of known function.
2) SAGE :
Serial analysis of gene expression (SAGE) is an alternate method of analysis based on RNA sequencing
rather than hybridization. SAGE relies on the sequencing of 10–17 base pair tags which are unique to
each gene. These tags are produced from poly-A mRNA and ligated end-to-end before sequencing. SAGE
gives an unbiased measurement of the number of transcripts per cell, since it does not depend on prior
knowledge of what transcripts to study (as microarrays do).
3) RNA sequencing :
RNA sequencing has taken over microarray and SAGE technology in recent years, as noted in 2016, and
has become the most efficient way to study transcription and gene expression. This is typically done by
next-generation sequencing.
A subset of sequenced RNAs are small RNAs, a class of non-coding RNA molecules that are key
regulators of transcriptional and post-transcriptional gene silencing, or RNA silencing. Next generation
sequencing is the gold standard tool for non-coding RNA discovery, profiling and expression analysis.
4) Massively Parallel Reporter Assays (MPRAs) :
Massively parallel reporter assays is a technology to test the cis-regulatory activity of DNA sequences.
MPRAs use a plasmid with a synthetic cis-regulatory element upstream of a promoter driving a synthetic
gene such as Green Fluorescent Protein. A library of cis-regulatory elements is usually tested using
MPRAs, a library can contain from hundreds to thousands of cis-regulatory elements. The cis-regulatory
activity of the elements is assayed by using the downstream reporter activity. The activity of all the
library members is assayed in parallel using barcodes for each cis-regulatory element. One limitation of
MPRAs is that the activity is assayed on a plasmid and may not capture all aspects of gene regulation
observed in the genome.
5) STARR-Seq :
STARR-Seq is a technique similar to MPRAs to assay enhancer activity of randomly sheared genomic
fragments. In the original publication, randomly sheared fragments of the Drosophila genome were
placed downstream of a minimal promoter. Candidate enhancers amongst the randomly sheared
fragments will transcribe themselves using the minimal promoter. By using sequencing as a readout and
controlling for input amounts of each sequence the strength of putative enhancers are assayed by this
method.
6) Perturb-Seq : Perturb-Seq couples CRISPR mediated gene knockdowns with single-cell gene
expression. Linear models are used to calculate the effect of the knockdown of a single gene on the
expression of multiple genes.
BT-3515 (Genomics And Proteomics) 10
C) At the protein level :-
1) Yeast two-hybrid system :
A yeast two-hybrid screening (Y2H) tests a “bait” protein against many potential interacting proteins
(“prey”) to identify physical protein–protein interactions. This system is based on a transcription factor,
originally GAL4, whose separate DNA-binding and transcription activation domains are both required in
order for the protein to cause transcription of a reporter gene. In a Y2H screen, the “bait” protein is
fused to the binding domain of GAL4, and a library of potential “prey” (interacting) proteins is
recombinantly expressed in a vector with the activation domain. In vivo interaction of bait and prey
proteins in a yeast cell brings the activation and binding domains of GAL4 close enough together to
result in expression of a reporter gene. It is also possible to systematically test a library of bait proteins
against a library of prey proteins to identify all possible interactions in a cell.
2) AP/MS :
Affinity purification and mass spectrometry (AP/MS) is able to identify proteins that interact with one
another in complexes. Complexes of proteins are allowed to form around a particular “bait” protein. The
bait protein is identified using an antibody or a recombinant tag which allows it to be extracted along
with any proteins that have formed a complex with it. The proteins are then digested into short peptide
fragments and mass spectrometry is used to identify the proteins based on the mass-to-charge ratios of
those fragments.
3) Deep mutational scanning :
In deep mutational scanning every possible amino acid change in a given protein is first synthesized. The
activity of each of these protein variants is assayed in parallel using barcodes for each variant. By
comparing the activity to the wild-type protein, the effect of each mutation is identified. While it is
possible to assay every possible single amino-acid change due to combinatorics two or more concurrent
mutations are hard to test. Deep mutational scanning experiments have also been used to infer protein
structure and protein-protein interactions.
Difference Between Structural And Functional Genomics :-
Structural Genomics Functional Genomics
Structural Genomics is the study that attempts to Functional Genomics is the study that attempts
sequence the whole genome and mapping the to determine the function of all gene products
genome. encoded by the genome of an organism.
It concerns with sequencing and mapping of the It concerns with studying the expression and
genome. function of the genome.
It determines physical nature of the genome. It determines expression and function of all genes
and their functions.
The main goal of Structural Genomics is to The main goal of Functional Genomics is to
identify novel protein folds. understand the function of genes or proteins,
eventually all components of genome.
Steps of Structural Genomics :- Steps of Functional Genomics :-
1) Construction of high resolution genetic 1) Determination of when and where
and physical maps. particular genes are expressed.
2) Sequencing of the genome.
BT-3515 (Genomics And Proteomics) 11
3) Determination of the complete set of 2) Determination of the functions of specific
proteins in an organism , and genes by selectively mutating the desired
4) Determination of the three-dimensional genes.
structures of the proteins 3) Finding the interactions that take place
among proteins and between protein and
other molecules.
Techniques of Structural Genomics :- Techniques of Functional Genomics :-
1) De Novo Methods : 1) At the DNA Level :
a) X-Ray Crystallography a) Generic Interaction Mapping
b) Nuclear Magnetic Resonance (NMR) b) DNA / Protein Interactions
2) Modeling-Based Methods : c) DNA Accessibility Assays
a) Ab initio modeling 2) At the RNA Level :
b) Sequence-Based Modeling a) Microarray
c) Threading b) SAGE
c) RNA Sequencing
d) Massively Parallel Reporter Assays
e) STARR-Seq
f) Perturb-Seq
3) At Protein Level :
a) Yeast two-hybrid system
b) AP / MS
c) Deep Mutational Scanning
1.2.3 COMPARATIVE GENOMICS :-
Comparative genomics is a field of biological research in which the genomic features of different
organisms are compared. The genomic features may include the DNA sequence, genes, gene order,
regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or
large parts of genomes resulting from genome projects are compared to study basic biological
similarities and differences as well as evolutionary relationships between organisms. The major principle
of comparative genomics is that common features of two organisms will often be encoded within the
DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start
with making some form of alignment of genome sequences and looking for orthologous sequences
(sequences that share a common ancestry) in the aligned genomes and checking to what extent those
sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in
turn be put in the context of, for example, phenotypic evolution or population genetics.
Virtually started as soon as the whole genomes of two organisms became available (that is, the genomes
of the bacteria Haemophilus influenzae and Mycoplasma genitalium) in 1995, comparative genomics is
now a standard component of the analysis of every new genome sequence. With the explosion in the
number of genome projects due to the advancements in DNA sequencing technologies, particularly the
next-generation sequencing methods in late 2000s, this field has become more sophisticated, making it
possible to deal with many genomes in a single study. Comparative genomics has revealed high levels of
similarity between closely related organisms, such as humans and chimpanzees, and, more surprisingly,
similarity between seemingly distantly related organisms, such as humans and the yeast Saccharomyces
cerevisiae. It has also showed the extreme diversity of the gene composition in different evolutionary
lineages.
BT-3515 (Genomics And Proteomics) 12
Evolutionary Principles of Comparative Genomics :-
One character of biology is evolution, evolutionary theory is also the theoretical foundation of
comparative genomics, and at the same time the results of comparative genomics unprecedentedly
enriched and developed the theory of evolution. When two or more of the genome sequence are
compared, one can deduce the evolutionary relationships of the sequences in a phylogenetic tree. Based
on a variety of biological genome data and the study of vertical and horizontal evolution processes, one
can understand vital parts of the gene structure and its regulatory function.
Similarity of related genomes is the basis of comparative genomics. If two creatures have a recent
common ancestor, the differences between the two species genomes are evolved from the ancestors’
genome. The closer the relationship between two organisms, the higher the similarities between their
genomes. If there is close relationship between them, then their genome will display a linear behaviour
(synteny), namely some or all of the genetic sequences are conserved. Thus, the genome sequences can
be used to identify gene function, by analyzing their homology (sequence similarity) to genes of known
function.
Orthologous sequences are related sequences in different species: a gene exists in the original species,
the species divided into two species, so genes in new species are orthologous to the sequence in the
original species. Paralogous sequences are separated by gene cloning (gene duplication) : if a particular
gene in the genome is copied, then the copy of the two sequences is paralogous to the original gene. A
pair of orthologous sequences is called orthologous pairs (orthologs), a pair of paralogous sequence is
called collateral pairs (paralogs). Orthologous pairs usually have the same or similar function, which is
not necessarily the case for collateral pairs. In collateral pairs, the sequences tend to evolve into having
different functions.
Comparative genomics exploits both similarities and differences in the proteins, RNA, and regulatory
regions of different organisms to infer how selection has acted upon these elements. Those elements
that are responsible for similarities between different species should be conserved through time
(stabilizing selection), while those elements responsible for differences among species should be
divergent (positive selection). Finally, those elements that are unimportant to the evolutionary success
of the organism will be unconserved (selection is neutral).
One of the important goals of the field is the identification of the mechanisms of eukaryotic genome
evolution. It is however often complicated by the multiplicity of events that have taken place
throughout the history of individual lineages, leaving only distorted and superimposed traces in the
genome of each living organism. For this reason comparative genomics studies of small model organisms
(for example the model Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great
importance to advance our understanding of general mechanisms of evolution.
Tools used in Comparative Genomics :-
Computational tools for analyzing sequences and complete genomes are developing quickly due to the
availability of large amount of genomic data. At the same time, comparative analysis tools are
progressed and improved. In the challenges about these analyses, it is very important to visualize the
comparative results.
Visualization of sequence conservation is a tough task of comparative sequence analysis. As we know, it
is highly inefficient to examine the alignment of long genomic regions manually. Internet-based genome
browsers provide many useful tools for investigating genomic sequences due to integrating all
BT-3515 (Genomics And Proteomics) 13
sequence-based biological information on genomic regions. When we extract large amount of relevant
biological data, they can be very easy to use and less time-consuming.
1) UCSC Browser : This site contains the reference sequence and working draft assemblies for a
large collection of genomes.
2) Ensembl : The Ensembl project produces genome databases for vertebrates and other
eukaryotic species, and makes this information freely available online.
3) MapView : The Map Viewer provides a wide variety of genome mapping and sequencing data.
4) VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic
sequences. It was built to visualize the results of comparative analysis based on DNA alignments.
The presentation of comparative data generated by VISTA can easily suit both small and large
scale of data.
5) BlueJay Genome Browser : a stand-alone visualization tool for the multi-scale viewing of
annotated genomes and other genomic elements.
An advantage of using online tools is that these websites are being developed and updated constantly.
There are many new settings and content can be used online to improve efficiency.
Applications of Comparative Genomics :-
1) Agriculture :
Agriculture is a field that reaps the benefits of comparative genomics. Identifying the loci of
advantageous genes is a key step in breeding crops that are optimized for greater yield, cost-efficiency,
quality, and disease resistance. For example, one genome wide association study conducted on 517 rice
landraces revealed 80 loci associated with several categories of agronomic performance, such as grain
weight, amylose content, and drought tolerance. Many of the loci were previously uncharacterized. Not
only is this methodology powerful, it is also quick. Previous methods of identifying loci associated with
agronomic performance required several generations of carefully monitored breeding of parent strains,
a time consuming effort that is unnecessary for comparative genomic studies.
2) Medicine :
The medical field also benefits from the study of comparative genomics. Vaccinology in particular has
experienced useful advances in technology due to genomic approaches to problems. In an approach
known as reverse vaccinology, researchers can discover candidate antigens for vaccine development by
analyzing the genome of a pathogen or a family of pathogens. Applying a comparative genomics
approach by analyzing the genomes of several related pathogens can lead to the development of
vaccines that are multiprotective. A team of researchers employed such an approach to create a
universal vaccine for Group B Streptococcus, a group of bacteria responsible for severe neonatal
infection. Comparative genomics can also be used to generate specificity for vaccines against pathogens
that are closely related to commensal microorganisms. For example, researchers used comparative
genomic analysis of commensal and pathogenic strains of E. coli to identify pathogen specific genes as a
basis for finding antigens that result in immune response against pathogenic strains but not commensal
ones. In May of 2019, using the Global Genome Set, a team in the UK and Australia sequenced
thousands of globally-collected isolates of Group A Streptococcus, providing potential targets for
developing a vaccine against the pathogen, also known as S. pyogenes.
BT-3515 (Genomics And Proteomics) 14
3) Research :
Comparative genomics also opens up new avenues in other areas of research. As DNA sequencing
technology has become more accessible, the number of sequenced genomes has grown. With the
increasing reservoir of available genomic data, the potency of comparative genomic inference has
grown as well. A notable case of this increased potency is found in recent primate research.
Comparative genomic methods have allowed researchers to gather information about genetic variation,
differential gene expression, and evolutionary dynamics in primates that were indiscernible using
previous data and methods. The Great Ape Genome Project used comparative genomic methods to
investigate genetic variation with reference to the six great ape species, finding healthy levels of
variation in their gene pool despite shrinking population size. Another study showed that patterns of
DNA methylation, which are a known regulation mechanism for gene expression, differ in the prefrontal
cortex of humans versus chimps, and implicated this difference in the evolutionary divergence of the
two species.
1.3 TECHNIQUES IN GENOME ANALYSIS
Genome analysis entails the prediction of genes in uncharacterized genomic sequences. The 21st century
has seen the announcement of the draft version of the human genome sequence. Model organisms
have been sequenced in both the plant and animal kingdoms.
However, the pace of genome annotation is not matching the pace of genome sequencing. Experimental
genome annotation is slow and time consuming. The demand is to be able to develop computational
tools for gene prediction.
Computational Gene prediction is relatively simple for the prokaryotes where all the genes are
converted into the corresponding mRNA and then into proteins. The process is more complex for
eukaryotic cells where the coding DNA sequence is interrupted by random sequences called introns.
Some of the questions which biologists want to answer today are :
• Given a DNA sequence, what part of it codes for a protein and what part of it is junk DNA.
• Classify the junk DNA as intron, untranslated region, transposons, dead genes, regulatory
elements etc.
• Divide a newly sequenced genome into the genes (coding) and the non-coding regions.
The importance of genome analysis can be understood by comparing the human and chimpanzee
genomes. The chimp and human genomes vary by an average of just 2% i.e. just about 160 enzymes. A
complete genome analysis of the two genomes would give a strong insight into the various mechanisms
responsible for the differences.
BT-3515 (Genomics And Proteomics) 15
Given below is a table listing down the estimated sizes of certain genomes and the number of genes in
them.
DIFFERENT TECHNIQUES IN GENOME ANALYSIS :-
1) DNA Microarray :-
• DNA microarrays are solid supports, usually of glass or silicon, upon which DNA is attached in an
organized pre-determined grid fashion.
• Each spot of DNA, called a probe, represents a single gene.
• DNA microarrays can analyze the expression of tens of thousands of genes simultaneously.
• There are several synonyms of DNA microarrays such as DNA chips, gene chips, DNA arrays,
gene arrays, and biochips.
2) Nanopore Sequencing Technology :-
The Oxford Nanopore sequencing (ONT) was developed as a technique to determine the order of
nucleotides in a DNA sequence. In 2014, Oxford Nanopore Technologies released the MinION device
that promises to generate longer reads that will ensure a better resolution structural genomic variants
and repeat content. It’s a mobile single-molecule Nanopore sequencing measures four inches in length
and is connected by a USB 3.0 port of a laptop computer. This device has been released for testing by a
community of users as part of the MinION Access Program (MAP) to examine the performance of the
MinION sequencer.
In this sequencing technology, the first strand of a DNA molecule is linked by a hairpin to its
complementary strand. The DNA fragment is passed through a protein nanopore (a nanopore is a
BT-3515 (Genomics And Proteomics) 16
nanoscale hole made of proteins or synthetic materials). When the DNA fragment is translated through
the pore by the action of a motor protein attached to the pore, it generates a variation of an ionic
current caused by differences in the moving nucleotides occupying the pore (Figure 7A). This variation of
ionic current is recorded progressively on a graphic model and then interpreted to identify the sequence
(Figure 7B). The sequencing is made on the direct strand generating the “template read” and then the
hairpin structure is read followed by the inverse strand generating the “complement read”, these reads
is called "1D". If the “temple” and “complement” reads are combined, then we have a resulting
consensus sequence called “two direction read” or "2D".
3) Southern Hybridization :-
• The analytical technique that involves the transfer of a specific DNA, RNA or a protein separated
on gel to a carrier membrane, for their detection or identification is termed as blotting.
• The process of transfer of the denatured fragments out of the gel and onto a carrier membrane
makes it accessible for analysis using a probe or antibody.
• Depending upon the substance to be separated, blotting techniques may be – Southern blot,
Northern blot or Western blot which separates DNA, RNA and proteins respectively.
• Southern Blot is the analytical technique used in molecular biology, immunogenetics and other
molecular methods to detect or identify DNA of interest from a mixture of DNA sample or a
specific base sequence within a strand of DNA.
• The technique was developed by a molecular biologist E.M. Southern in 1975 for analysing the
related genes in a DNA restriction fragment and thus named as Southern blotting in his honour.
4) Expressed Sequence Tags :-
An expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to
identify gene transcripts, and are instrumental in gene discovery and in gene-sequence determination.
The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in
public databases (e.g. GenBank 1 January 2013, all species).
An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are
typically individual clones from a cDNA library. The resulting sequence is a relatively low-quality
fragment whose length is limited by current technology to approximately 500 to 800 nucleotides.
Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of
expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the
reverse complement of the mRNA, the template strand.
5) DNA Sequencing :-
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in
DNA. It includes any method or technology that is used to determine the order of the four bases :
adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly
accelerated biological and medical research and discovery.
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e.
clusters of genes or operons), full chromosomes, or entire genomes of any organism. DNA sequencing is
also the most efficient way to indirectly sequence RNA or proteins (via their open reading frames). In
fact, DNA sequencing has become a key technology in many areas of biology and other sciences such as
medicine, forensics, and anthropology.
BT-3515 (Genomics And Proteomics) 17
1.4 cDNA LIBRARY CONSTRUCTION
A cDNA library is defined as a collection of cDNA fragments, each of which has been cloned into a
separate vector molecule.
Principle of cDNA Library :-
In the case of cDNA libraries we produce DNA copies of the RNA sequences (usually the mRNA) of an
organism and clone them. It is called a cDNA library because all the DNA in this library is complementary
to mRNA and are produced by the reverse transcription of the latter.
Much of eukaryotic DNA consists of repetitive sequences that are not transcribed into mRNA and the
sequences are not represented in a cDNA library. It must be noted that prokaryotes and lower
eukaryotes do not contain introns, and preparation of cDNA is generally unnecessary for these
organisms. Hence, cDNA libraries are produced only from higher eukaryotes.
Vectors used in the Construction of cDNA Library :-
Procedure in the Construction of cDNA Library :-
The steps involved in the construction of a cDNA library are as follows :
1) Extraction of mRNA from the eukaryotic Cell :
Firstly, the mRNA is obtained and purified from the rest of the RNAs. Several methods exist for purifying
RNA such as trizol extraction and column purification. Column purification is done by using oligomeric dT
nucleotide coated resins where only the mRNA having the poly-A tail will bind.
BT-3515 (Genomics And Proteomics) 18
The rest of the RNAs are eluted out. The mRNA is eluted by using eluting buffer and some heat to
separate the mRNA strands from oligo-dT.
Fig 1. Extraction of mRNA from Eukaryotic Cells
2) Construction of cDNA from the Extracted mRNA :
There are different strategies for the construction of a cDNA. These are discussed as follows :
a) The RNase Method :
The principle of this method is that a complementary DNA strand is synthesized using reverse
transcriptase to make an RNA : DNA duplex. The RNA strand is then nicked and replaced by DNA. In this
method the first step is to anneal a chemically synthesized oligo-dT primer to the 3′ polyA-tail of the
RNA.
The primer is typically 10-15 residues long, and it primes (by providing a free 3′ end) the synthesis of the
first DNA strand in the presence of reverse transcriptase and deoxyribonucleotides. This leaves an RNA :
DNA duplex.
The next step is to replace the RNA strand with a DNA strand. This is done by using RNase H enzyme
which removes the RNA from RNA : DNA duplex. The DNA strand thus left behind is then considered as
the template and the second DNA strand is synthesized by the action of DNA polymerase II.
BT-3515 (Genomics And Proteomics) 19
b) The Self-Priming method :
This involved the use of an oligo-dT primer annealing at the polyadenylate tail of the mRNA to prime
first DNA strand synthesis against the mRNA. This cDNA thus formed has the tendency to transiently fold
back on itself, forming a hairpin loop. This results in the self-priming of the second strand.
After the synthesis of the second DNA strand, this loop must be cleaved with a single-strand-specific
nuclease, e.g., SI nuclease, to allow insertion into the cloning vector. This method has a serious
disadvantage. The cleavage with SI nuclease results in the loss of a certain amount of sequence at the 5′
end of the clone.
Fig 2. The RNase H Method of cDNA Synthesis
Fig 3. Self-priming method of cDNA synthesis
c) Land et al. Strategy :
After first-strand synthesis, which is primed with an oligo- dT primer as usual, the cDNA is tailed with a
string of cytidine residues using the enzyme terminal transferase. This artificial oligo-dC tail is then used
as an annealing site for a synthetic oligo-dG primer, allowing synthesis of the second strand.
BT-3515 (Genomics And Proteomics) 20
d) Homopolymer Tailing :
This approach uses the enzyme terminal transferase, which can polymerize nucleotides onto the 3′-
hydroxyl of both DNA and RNA molecules. We carry out the synthesis of the first DNA strand essentially
as before, to produce an RNA : DNA hybrid.
We then use terminal transferase and a single deoxyribonucleotide to add tails of that nucleotide to the
3′ ends of both RNA and DNA strands. The result of this is that the DNA strand now has a known
sequence at its 3′ end Typically, dCTP or dATP are used.
Fig 5. Homopolymer tailing
Fig 4. Land et. al. strategy
e) Rapid Amplification of cDNA Ends (RACE) :
It is sometimes the case that we wish to clone a particular cDNA for which we already have some
sequence data, but with particular emphasis on the integrity of the 5′ or 3′ ends. RACE techniques (Rapid
Amplification of cDNA Ends) are available for this. The RACE methods are divided into 3’RACE and
5’RACE, according to which end of the cDNA we are interested in.
BT-3515 (Genomics And Proteomics) 21
i) 3’RACE :
In this type of RACE, reverse transcriptase synthesis of a first DNA strand is carried out using a modified
oligo-dT primer. This primer comprises a stretch of unique adaptor sequence followed by an oligo-dT
stretch. The first strand synthesis is followed by a second strand synthesis using a primer internal to the
coding sequence of interest.
This is followed by PCR using
• The same internal primer and ‘
• The adaptor sequence (i.e., omitting the oligo-dT). Although in theory it should be possible to
use a simple oligo- dT primer throughout instead of the adaptor-oligo-dT and adaptor
combination, the low melting temperature for an oligo-dT primer may interfere with the
subsequent rounds of PCR.
ii) 5’RACE :
In this type of RACE first cDNA strand is synthesized with reverse transcriptase and a primer from within
the coding sequence. Unincorporated primer is removed and the cDNA strands are tailed with oligo-dA.
A second cDNA strand is then synthesized with an adaptor-oligo-dT primer.
The resulting double-stranded molecules are
then subject to PCR using
• A primer nested within the coding
region and
• The adaptor sequence. A nested
primer is used in the final PCR to
improve specificity. The adaptor
sequence is used in the PCR because
of the low melting temperature of a
simple oligo-dT primer, as in 3’RACE
above. A number of kits for RACE are
commercially available.
3. Cloning the c-DNA :
a) Linkers :
The RNaseH and homopolymer tailing
methods ultimately generate a collection of
double-stranded, blunt-ended cDNA
molecules. They must now be attached to the Fig 6. a) 3’RACE and b) 5’RACE
vector molecules. This could be done by blunt-
ended ligation, or by the addition of linkers,
digestion with the relevant enzyme and ligation into vector.
b) Incorporation of Restriction Sites : It is possible to adapt the homopolymer tailing method by using
primers that are modified to incorporate restriction. In the diagram shown next page, the oligo-dT
primer is modified to contain a restriction site (in the figure, a Sail site GTCGAC).
BT-3515 (Genomics And Proteomics) 22
The 3′ end of the newly synthesized first cDNA strand is tailed with C’s. An oligo-dG primer, again
preceded by a Sail site within a short double-stranded region of the oligonucleotide, is then used for
second-strand synthesis.
Note that this method requires the use of an oligonucleotide containing a double-stranded region. Such
oligonucleotides are made by synthesizing the two strands separately and then allowing them to anneal
to one another.
Fig 7. Modification of Homopolymer tailing , incorporating restriction sites
c) Homopolymer Tailing of cDNA :
Another option is to use terminal transferase again. Treatment of the blunt-ended double-stranded
cDNA with terminal transferase and dCTP leads to the polymerization of several C residues (typically 20
or so) to the 3′ hydroxyl at each end.
Treatment of the vector with terminal transferase and dGTP leads to the incorporation of several G
residues onto the ends of the vector. (Alternatively, dATP and dTTP can be used.) The vector and cDNA
can now anneal, and the base-paired region is often so extensive that treatment with DNA ligase is
unnecessary.
In fact, there may be gaps rather than nicks at the vector insert boundaries, but these are repaired by
physiological processes once the recombinant molecules have been introduced into a host.
Advantages of cDNA Library :-
A cDNA library has two additional advantages. First, it is enriched with fragments from actively
transcribed genes. Second, introns do not interrupt the cloned sequences; introns would pose a problem
BT-3515 (Genomics And Proteomics) 23
when the goal is to produce a eukaryotic protein in bacteria, because most bacteria have no means of
removing the introns.
Disadvantages of cDNA Library :-
The disadvantage of a cDNA library is that it contains only sequences that are present in mature mRNA.
Introns and any other sequences that are altered after transcription are not present; sequences, such as
promoters and enhancers, that are not transcribed into RNA also are not present in a cDNA library.
It is also important to note that the cDNA library represents only those gene sequences ex-pressed in
the tissue from which the RNA was isolated. Furthermore, the frequency of a particular DNA sequence
in a cDNA library depends on the abundance of the corresponding mRNA in the given tissue. In contrast,
almost all genes are present at the same frequency in a genomic DNA library.
Applications of cDNA Library :-
Following are the applications of cDNA libraries :
❖ Discovery of novel genes.
❖ Cloning of full-length cDNA molecules for in vitro study of gene function.
❖ Study of the repertoire of mRNAs expressed in different cells or tissues.
❖ Study of alternative splicing in different cells or tissues.
1.5 GENE SEQUENCING
DNA sequencing is the process of determining the sequence of nucleotides within a DNA molecule.
Every organism’s DNA consists of a unique sequence of nucleotides. Determining the sequence can help
scientists compare DNA between organisms, which can help show how the organisms are related.
Sanger and Maxam-Gilbert sequencing technologies were the most common sequencing technologies
used by biologists until the emergence of a new era of sequencing technologies opening new
perspectives for genomes exploration and analysis. These sequencing technologies were firstly appeared
by Roche’s 454 technology in 2005 and were commercialized as technologies capable of producing
sequences with very high throughput and at much lower cost than the first sequencing technologies.
These new sequencing technologies are generally known under the name of “Next Generation
Sequencing (NGS) Technologies” or “High Throughput Sequencing Technologies”.
NGS technologies produce a massively parallel analysis with a high-throughput from multiple samples at
much reduced cost. NGS technologies can be sequenced in parallel millions to billions of reads in a single
run and the time required to generate the GigaBase sized reads is only a few days or hours making it
best than the first generation sequencing such as Sanger sequencing. The human genome, for example,
consists of 3 billion bps and is made up of DNA macromolecules of lengths varying from 33 to ~247
million bps, distributed in the 23 chromosomes located in each human cell nucleus, the sequencing of
the human genome using the Sanger sequencing took almost 15 years, required the cooperation of
many laboratories around the world and coasted approximately 100 million US dollars, whereas the
sequencing by NGS sequencers using the 454 Genome Sequencer FLX took two months and for
approximately one hundredth of the cost. Unfortunately, NGS are incapable to read the complete DNA
sequence of the genome, they are limited to sequence small DNA fragments and generate millions of
reads. This limit remains a negative point especially for genome assembly projects because it requires
high computing resources.
BT-3515 (Genomics And Proteomics) 24
A) The First Generation of Sequencing :-
Sanger and Maxam-Gilbert sequencing technologies were classified as the First Generation Sequencing
Technology who initiated the field of DNA sequencing with their publication in 1977.
1) Sanger sequencing :
Sanger Sequencing is known as the chain termination method or the dideoxynucleotide method or the
sequencing by synthesis method. It consists in using one strand of the double stranded DNA as template
to be sequenced. This sequencing is made using chemically modified nucleotides called dideoxy-
nucleotides (dNTPs). These dNTPs are marked for each DNA bases by ddG, ddA, ddT, and ddC. The
dideoxy-nucleotides are used dNTPs are used for elongation of nucleotide, once incorporated into the
DNA strand they prevent the further elongation and the elongation is complete. Then, we obtain DNA
fragments ended by a dNTP with different sizes. The fragments are separated according to their size
using gel slab where the resultant bands corresponding to DNA fragments can be visualized by an
imaging system (X-ray or UV light). Figure 1 details the Sanger sequencing technology.
Fig 1. Sanger sequencing technology. (a) The sequencing reaction is performed by the presence of denatured
DNA template, radioactively labeled primer, DNA polymerase, and dNTPs. The DNA polymerase is used to
incorporate the dNTPs into the elongating DNA strand. Each of the four dNTPs is run in a separate reaction so
the polymerization can randomly terminate at each base position. The end result of each reaction is a
population of DNA fragments with different lengths, with the length of each fragment dependent on where the
dNTPs is incorporated. (b) Illustrates the separation of these DNA fragments in a denaturing gel by
electrophoresis. The radioactive labeling on the primer enables visualization of the fragments as bands on the
gel. The bands on the gel represent the respective fragments shown to the right. The complement of the
original template (read from bottom to top) is given on the left margin of the sequencing gel.
BT-3515 (Genomics And Proteomics) 25
The first genomes sequenced by the Sanger sequencing are phiX174 genome with size of 5374 bp and in
1980 the bacteriophage λ genome with length of 48501 bp. After years of improvement, Applied
Biosystems is the first company that has automated Sanger sequencing. Applied Biosystems has built in
1995 an automatic sequencing machine called ABI Prism 370 based on capillary electrophoresis allowing
fast an accurate sequencing. The Sanger sequencing was used in several sequencing projects of different
plant species such as Arabidopsis, rice and soybean and the most emblematic achievement of this
sequencing technology is the decoding of the first human genome.
The sanger sequencing was widely used for three decades and even today for single or low-throughput
DNA sequencing, however, it is difficult to further improve the speed of analysis that does not allow the
sequencing of complex genomes such as the plant species genomes and the sequencing was still
extremely expensive and time consuming.
2) Maxam-Gilbert sequencing :
Maxam-Gilbert is another sequencing belonging to the first generation of sequencing known as the
chemical degradation method. Relies on the cleaving of nucleotides by chemicals and is most effective
with small nucleotides polymers. Chemical treatment generates breaks at a small proportion of one or
two of the four nucleotide bases in each of the four reactions (C, T+C, G, A+G). This reaction leads to a
series of marked fragments that can be separated according to their size by electrophoresis.
The sequencing here is performed without DNA cloning. However, the development and improvement
of the Sanger sequencing method favored the latter to the Maxam-Gilbert sequencing method, and it is
also considered dangerous because it uses toxic and radioactive chemicals.
B) The Second Generation of Sequencing :-
The first generation of sequencing was dominant for three decades especially Sanger sequencing,
however, the cost and time was a major stumbling block. In 2005 and in subsequent years, have marked
the emergence of a new generation of sequencers to break the limitations of the first generation. The
basic characteristics of second generation sequencing technology are :
1) the generation of many millions of short reads in parallel,
2) the speed up of sequencing the process compared to the first generation,
3) the low cost of sequencing and
4) the sequencing output is directly detected without the need for electrophoresis.
Short read sequencing approaches divided under two wide approaches : sequencing by ligation (SBL)
and sequencing by synthesis (SBS), (more details for these sequencing categories are presented in) and
are mainly classified into three major sequencing platforms : Roche/454 launched in 2005,
Illumina/Solexa in 2006 and in 2007 the ABI/SOLiD. We will briefly describe these commonly utilized
sequencing platforms.
1) Roche/454 sequencing :
Roche/454 sequencing appeared on the market in 2005, using pyrosequencing technique which is based
on the detection of pyrophosphate released after each nucleotide incorporation in the new synthetic
DNA strand (http://www.454.com). The pyrosequencing technique is a sequencing-by-synthesis
approach.
BT-3515 (Genomics And Proteomics) 26
DNA samples are randomly fragmented and each fragment is attached to a bead whose surface carries
primers that have oligonucleotides complementary to the DNA fragments so each bead is associated
with a single fragment (Figure 2A). Then, each bead is isolated and amplified using PCR emulsion which
produces about one million copies of each DNA fragment on the surface of the bead (Figure 2B). The
beads are then transferred to a plate containing many wells called picotiter plate (PTP) and the
pyrosequencing technique is applied which consists in activating of a series of downstream reactions
producing light at each incorporation of nucleotide. By detecting the light emission after each
incorporation of nucleotide, the sequence of the DNA fragment is deduced (Figure 2C). The use of the
picotiter plate allows hundreds of thousands of reactions occur in parallel, considerably increasing
sequencing throughput. The latest instrument launched by Roche/454 called GS FLX+ that generates
reads with lengths of up to 1000 bp and can produce ~1Million reads per run (454.com GS FLX+Systems
http//454.com/products/gs-flx-system/index.asp). Other characteristics of Roche/454 instruments are
listed in.
Fig 2. Roche/454 sequencing technology
BT-3515 (Genomics And Proteomics) 27
The Roche/454 is able to generate relatively long reads which are easier to map to a reference genome.
The main errors detected of sequencing are insertions and deletions due to the presence of
homopolymer regions. Indeed, the identification of the size of homopolymers should be determined by
the intensity of the light emitted by pyrosequencing. Signals with too high or too low intensity lead to
under or overestimation of the number of nucleotides which causes errors of nucleotides identification.
2) Ion torrent sequencing :
Life Technologies commercialized the Ion Torrent semiconductor sequencing technology in 2010
(https://www.thermofisher.com/us/en/home/brands/ion-torrent.html). It is similar to 454
pyrosequencing technology but it does not use fluorescent labeled nucleotides like other second-
generation technologies. It is based on the detection of the hydrogen ion released during the
sequencing process.
Specifically, Ion Torrent uses a chip that contains a set of micro wells and each has a bead with several
identical fragments. The incorporation of each nucleotide with a fragment in the pearl, a hydrogen ion is
released which change the pH of the solution. This change is detected by a sensor attached to the
bottom of the micro well and converted into a voltage signal which is proportional to the number of
nucleotides incorporated (Figure 3).
The Ion Torrent sequencers are capable of producing reads lengths of 200 bp, 400 bp and 600 bp with
throughput that can reach 10 Gb for ion proton sequencer. The major advantages of this sequencing
technology are focused on read lengths which are longer to other SGS sequencers and fast sequencing
time between 2 and 8 hours. The major disadvantage is the difficulty of interpreting the homopolymer
sequences (more than 6 bp) which causes insertion and deletion (indel) error with a rate about ~1%
Fig 3. Ion torrent sequencing technology
3) Illumina/Solexa sequencing :
The Solexa company has developed a new method of sequencing. Illumina company
(http://www.illumina.com) purchased Solexa that started to commercialize the sequencer
BT-3515 (Genomics And Proteomics) 28
Illumine/Solexa Genome Analyzer (GA). Illumina technology is sequencing by synthesis approach and is
currently the most used technology in the NGS market.
The sequencing process is shown in Figure 4. During the first step, the DNA samples are randomly
fragmented into sequences and adapters are ligated to both ends of each sequence. Then, these
adapters are fixed themselves to the respective complementary adapters, the latter are hooked on a
slide with many variants of adapters (complementary) placed on a solid plate (Figure 4A). During the
second step, each attached sequence to the solid plate is amplified by “PCR bridge amplificationͧ that
creates several identical copies of each sequence; a set of sequences made from the same original
sequence is called a cluster. Each cluster contains approximately one million copies of the same original
sequence (Figure 4B). The last step is to determine each nucleotide in the sequences, Illumina uses the
sequencing by synthesis approach that employs reversible terminators in which the four modified
nucleotides, sequencing primers and DNA polymerases are added as a mix, and the primers are
hybridized to the sequences. Then, polymerases are used to extend the primers using the modified
nucleotides. Each type of nucleotide is labeled with a fluorescent specific in order for each type to be
unique. The nucleotides have an inactive 3’-hydroxyl group which ensures that only one nucleotide is
incorporated. Clusters are excited by laser for emitting a light signal specific to each nucleotide, which
will be detected by a coupled-charge device (CCD) camera and Computer programs will translate these
signals into a nucleotide sequence (Figure 4C). The process continues with the elimination of the
terminator with the fluorescent label and the starting of a new cycle with a new incorporation.
Fig 4. Illumina sequencing technology.
BT-3515 (Genomics And Proteomics) 29
The first sequencers Illumina/Solexa GA has been able to produce very short reads ~35 bp and they had
an advantage in that they could produce paired-end (PE) short reads, in which the sequence at both
ends of each DNA cluster is recorded. The output data of the last Illumina sequencers is currently higher
than 600 Gpb and lengths of short reads are about 125 bp. Details on Illumina sequencers.
One of the main drawbacks of the Illumina/Solexa platform is the high requirement for sample loading
control because overloading can result in overlapping clusters and poor sequencing quality. The overall
error rate of this sequencing technology is about 1%. Substitutions of nucleotides are the most common
type of errors in this technology, the main source of error is due to the bad identification of the
incorporated nucleotide.
4) ABI/SOLiD sequencing :
Supported Oligonucleotide Ligation and Detection (SOLiD) is a NGS sequencer Marketed by Life
Technologies (http://www.lifetechnologies.com). In 2007, Applied Biosystems (ABI) has acquired SOLiD
and developed ABI/SOLID sequencing technology that adopts by ligation (SBL) approach.
Fig 5. ABI/SOLID sequencing technology.
BT-3515 (Genomics And Proteomics) 30
The ABI/SOLiD process consists of multiple sequencing rounds. It starts by attaching adapters to the
DNA fragments, fixed on beads and cloned by PCR emulsion. These beads are then placed on a glass
slide and the 8-mer with a fluorescent label at the end are sequentially ligated to DNA fragments, and
the color emitted by the label is recorded (Figure 5A). Then, the output format is color space which is
the encoded form of the nucleotide where four fluorescent colors are used to represent 16 possible
combinations of two bases. The sequencer repeats this ligation cycle and each cycle the complementary
strand is removed and a new sequencing cycle starts at the position n-1 of the template. The cycle is
repeated until each base is sequenced twice (Figure 5B). The recovered data from the color space can be
translated to letters of DNA bases and the sequence of the DNA fragment can be deduced.
ABI/SOLiD launched the first sequencer that produce short reads with length 35 bp and output of 3
Gb/run and continued to improve their sequencing which increased the length of reads to 75 bp with an
output up to 30 Gb/run. The strength of ABI/SOLiD platform is high accuracy because each base is read
twice while the drawback is the relatively short reads and long run times. The errors of sequencing in
this technology is due to noise during the ligation cycle which causes error identification of bases. The
main type of error is substitution.
C) The Third Generation of Sequencing :-
The second-generation of sequencing technologies previously discussed have revolutionized the analysis
of DNA and have been the most widely used compared to the first generation of sequencing
technologies. However, the SGS technologies generally require PCR amplification step which is a long
procedure in execution time and expansive in sequencing price. Also, it became clear that the genomes
are very complex with many repetitive areas that SGS technologies are incapable to solve them and the
relatively short reads made genome assembly more difficult. To remedy the problems caused by SGS
technologies, scientists have developed a new generation of sequencing called “third generation
sequencing”. These third generations of sequencing have the ability to offer a low sequencing cost and
easy sample preparation without the need PCR amplification in an execution time significantly faster
than SGS technologies. In addition, TGS are able to produce long reads exceeding several kilobases for
the resolution of the assembly problem and repetitive regions of complex genomes.
There are two main approaches that characterize TGS : The single molecule real time sequencing
approach (SMRT) that was developed by Quake laboratory and the synthetic approach that rely on
existing short reads technologies used by Illumina (Moleculo) and 10xGenomics
(https://www.10xgenomics.com) to construct long reads. The most widely used TGS technology
approach is SMRT and the sequencers that have used this approach are Pacific Biosciences and Oxford
Nanopore sequencing (specifically the MinION sequencer).
In the following, we present the two most widely used sequencing platforms in TGS to know Pacific
Biosciences and the MinION sequencing from Oxford Nanopore technology.
1) Pacific biosciences SMRT sequencing :
Pacific Biosciences (http://www.pacificbiosciences.com/) developed the first genomic sequencer using
SMRT approach and it’s the most widely used third-generation sequencing technology.
Pacific Biosciences uses the same fluorescent labelling as the other technologies, but instead of
executing cycles of amplification nucleotide, it detects the signals in real time, as they are emitted when
the incorporations occur. It uses a structure composed of many SMRT cells, each cell contains
BT-3515 (Genomics And Proteomics) 31
microfabricated nanostructures called zero-mode waveguides (ZMWs) which are wells of tens of
nanometers in diameter microfabricated in a metal film which is in turn deposited onto a glass
substrate. These ZMWs exploit the properties of light passing through openings with a diameter less
than its wavelength, so light cannot be propagated. Due to their small diameter, the light intensity
decreases along the wells and the bottom of the wells illuminated (Figure 6A). Each ZMW contains a
DNA polymerase attached to their bottom and the target DNA fragment for sequencing. During the
sequencing reaction, the DNA fragment is incorporated by the DNA polymerase with fluorescent labeled
nucleotides (with different colors). Whenever a nucleotide is incorporated, it releases a luminous signal
that is recorded by sensors (Figure 6B). The detection of the labeled nucleotides makes it possible to
determine the DNA sequence.
Compared to SGS, Pacific Bioscience technology has several advantages. The preparation of the sample
is very fast, it takes 4 to 6 hours instead of days. In addition, the long-read lengths, currently averaging
~10 kbp but individual very long reads can be as long as 60 kbp, which is longer than that of any SGS
technology. Pacific Biosciences sequencing platforms have a high error rate of about 13% dominated by
insertions and deletions errors. These errors are randomly distributed along the long read.
Fig 6. Pacific biosciences sequencing technology.
BT-3515 (Genomics And Proteomics) 32
2) Oxford nanopore sequencing :
A Nanopore is just a small hole its internal diameter is 1 nanometer. many porous transmembrane
cellular proteins act as Nanopore, and they have also been made by etching a somewhat bigger hole in a
piece of silicon, and then gradually filling it in using ion-beam sculpting methods which results in a much
smaller diameter hole the Nanopore. Graphene is also being explored as a synthetic substrate for solid-
state Nanopore. DNA could be passed through the nanopore for various reasons. For example,
electrophoresis might attract the DNA towards the nanopore, and it might eventually pass through it.
Enzymes attached to the nanopore might guide DNA towards the nanopore. The scale of the nanopore
means that the DNA may be forced through the hole as a long string, one base at a time, rather like
thread through the eye of a needle. As it does so, each nucleotide on the DNA molecule may obstruct
the nanopore to a different, characteristic degree. The amount of current which can pass through the
nanopore at any given moment therefore varies depending on whether the nanopore is blocked by an A,
a C, a G or a T. The change in the current through the nanopore as the DNA molecule passes through the
nanopore represents a direct reading of the DNA sequence. Alternatively, a nanopore might be used to
identify individual DNA bases as they pass through the nanopore in the correct order this approach has
been shown by Oxford Nanopore Technologies.
Fig 2. Nanopore
Principle :-
When a Nanopore is immersed in a conducting fluid and a voltage is applied across it, an electric current
due to conduction of ions through the Nanopore can be observed. The amount of current is very
sensitive to the size and shape of the Nanopore. If single bases, strands of DNA or other molecules pass
through the Nanopore, this creates a characteristic change in the magnitude of the current through the
Nanopore.
BT-3515 (Genomics And Proteomics) 33
Fig 3. Illustrates what the results of nanopore sequencing look like
Types of Nanopores :-
Types of Nanopores are categorized as biologically and solid state nanopores.
A) Biological Nanopores :-
1) Alpha hemolysin :
Alpha hemolysin (αHL) is a Nanopore from bacteria that causes the lysis of red blood cells. It has been
studied for over 15 years. To this point, studies have shown that all four bases can be identified using
ionic current measured by the αHL pore. The structure of αHL is advantageous to identify specific bases
moving through the pore. The αHL pore is ~10 nm long, with two distinct 5 nm sections. The upper
section consists of a larger, vestibule-like structure and the lower section consists of three possible
recognition sites (R1, R2, R3), and is able to differentiate between each base.
Sequencing using αHL :
Sequencing using αHL has been developed through basic study and structural mutations, moving
towards the sequencing of very long readings. Protein mutation of αHL improved the detection abilities
of the pore. The next proposed step is to bind an exonuclease onto the αHL pore. The enzyme would
periodically cleave single bases, enabling the pore to identify successive bases. Coupling an exonuclease
to the biological pore would slow the translocation of the DNA through the pore, and increase the
accuracy of data acquisition.
A recent study has pointed to the ability of αHL to detect nucleotides at two separate sites in the lower
half of the pore.
The R1 and R2 sites enable each base to be monitored twice as it moves through the pore, creating 16
different measurable ionic current values instead of 4. This method improves upon the single read
through the nanopore by doubling the sites that the sequence is read per nanopore.
BT-3515 (Genomics And Proteomics) 34
Fig 4. Nanopore DNA Sequencer
2) Mycobacterium smegmatis porin A (MspA) :
Mycobacterium smegmatis porin A (MspA) is the second biological nanopore currently investigated for
DNA sequencing. The MspA pore has been identified as a potential improvement over αHL due to a
more favorable structure. The pore is described as a goblet with a thick rim and a diameter of 1.2 nm at
the bottom of the pore. A natural MspA, while favorable for DNA sequencing because of shape and
diameter, has a negative core that prohibited single stranded DNA (ssDNA) translocation. The natural
nanopore was modified to improve translocation by replacing three negatively charged aspartic acids
with neutral asparagines.
Fig 5. Mycobacterium smegmatis porin A (MspA)
BT-3515 (Genomics And Proteomics) 35
MspA is more suitable for sequencing than Alpha hemolysis :
The electric current detection of nucleotides across the membrane has been shown to be tenfold more
specific than αHL for identifying bases. Utilizing this improved specificity, a group at the University of
Washington has proposed using double stranded DNA (dsDNA) between each single stranded molecule
to hold the base in the reading section of the pore. The dsDNA would place the base in the correct
section of the pore and enable the identification of the nucleotide or bases. A recent grant has been
awarded to collaboration from UC Santa Cruz, the University of Washington, and Northeastern
University to improve the base recognition of MspA using phi29 polymerase in conjunction with the
pore.
B) Solid State Nanopores :-
Solid state nanopore sequencing approaches, unlike biological nanopore sequencing, do not incorporate
proteins into their systems. Instead, solid state nanopore technology uses various metal or metal alloy
substrates with nanometer sized pores that allow DNA or RNA to pass through. These substrates most
often serve integral roles in the sequence recognition of nucleic acids as they translocate through the
channels along the substrates.
Fig 6. Solid State Nanopore Sequencing
1) Electron tunneling :
Measurement of electron tunneling through bases as ssDNA translocates through the nanopore is an
improved solid state Nanopore sequencing method. Most research has focused on proving bases could
be determined using electron tunneling. These studies were conducted using a scanning probe
microscope as the sensing electrode, and have proved that bases can be identified by specific tunneling
currents. After the proof of principle research, a functional system must be created to couple the solid
state pore and sensing devices.
BT-3515 (Genomics And Proteomics) 36
Fig 7. Electron Tunneling
Researchers at the Harvard Nanopore group have engineered solid state pores with single walled carbon
nanotubes across the diameter of the pore. Arrays of pores are created and chemical vapor deposition is
used to create nanotubes that grow across the array. Once a nanotube has grown across a pore, the
diameter of the pore is adjusted to the desired size. Successful creation of a nanotubes coupled with a
pore is an important step towards identifying bases as the ssDNA translocates through the solid state
pore. Another method is the use of Nanoelectrodes on two sides of pore. The electrodes are specifically
created to enable a solid state nanopore's formation between the two electrodes. This technology could
be used not only to sense the bases but to help control the base translocation speed and orientation.
2) Fluorescence :
An effective technique to determine a DNA sequence has been developed using solid state nanopore
and fluorescence. This fluorescence sequencing method converts each base into a characteristic
representation of multiple nucleotides which bind to a fluorescent probe strand-forming dsDNA. With
the two color system proposed, each base is identified by separate fluorescence’s, and will therefore be
converted into two specific sequences. Probes consist of a fluorophore and quencher at the start and
end of each sequence. Each fluorophore will be extinguished by the quencher at the end of the
preceding sequence. When the dsDNA is translocating through a solid state nanopore, the probe strand
will be stripped off, and the upstream fluorophore will fluoresce.
Capacity : This sequencing method has a capacity of 50-250 bases per second per pore, and a four color
fluorophore system (each base could be converted to one sequence instead of two), it can sequence
over than 500 bases per second.
Advantages of fluorescence sequencing method :
Advantages of this method are based on the clear sequencing readouts using a camera instead of noisy
current methods (detectors). However, the method must require sample preparation to convert each
base into an expanded binary code before sequencing. Instead of one base being identified as it
translocates through the pore, ~12 bases are required to find the sequence of one base.
BT-3515 (Genomics And Proteomics) 37
Epigenetic DNA modifications and Nanopore DNA sequencing :-
Sequencing reveals genetic variations, which determines each person risk for many diseases, as well as
which drugs will work best for each individual. Cancer centers are already sequencing tumors in search
of variations that make some resistant to chemotherapy. And global sequencing studies seek to find the
genetic contributors to a variety of conditions from autism to diabetes. The nanopore technology also
can be used to identify subtle DNA modifications that happen over the lifetime of an individual. Such
modifications, referred to a epigenetic DNA modifications and may take place as chemical reactions on
the DNA within cells and tell the cells how to interpret their DNA. While essential for proper cellular
functioning, epigenetic modifications can also be the underlying causes of various undesired conditions.
Epigenetic modifications are important for things like cancer, and being able to provide DNA sequencing
that can directly identify epigenetic changes is one of the charms of the nanopore sequencing method.
Nanopore sequencing apparatus :-
The development of solid state nanopores and the studies of DNA translocation through these
nanopores suggest how a nanopore could be the core of an instrument capable of inexpensive de novo
sequencing. The investigation and development of basic science and technology required to build a
nanopore based instrument that should be able to sequence a mammalian genome for <$1,000 and that
meets the following requirements :
a) High-speed sequential identification of the DNA nucleotides directly on the basis of their distinct
physical or electrical properties.
b) Very long, indefinite length reads. Analysis and assembly is a bottle-neck in de novo sequencing
and limits re-sequencing when copy number polymorphism or variable indels are to be
identified in heterozygous genomes.
c) The requisite sequence coverage (7.7-fold coverage, 6.5-fold coverage in Q20 bases) using
genomic DNA from <106 cells with no amplification and minimal preparative steps. Otherwise,
amplification or other preparatory steps become limiting.
Propose to investigate and develop the science and technology required to build a nanopore based
instrument that meets the above requirements. Among the unique capabilities of this instrument, four
well demonstrated features as follows :
1) A nanoscale device that translocates polymer molecules in sequential monomer order through a
very small volume of space, a small pore in an electrically biased membrane.
2) A single molecule detector that is also a very high throughput device. A nanopore can probe
thousands of different molecules or thousands of identical molecules in a few minutes.
3) A detector that directly converts characteristic features of the translocating polymer into an
electrical signal. Transduction and recognition occur in real time, on a molecule-by-molecule
basis.
4) A device that can probe very long lengths of DNA.
While practical considerations may limit the length of DNA that can be analyzed as it translocates
through a nanopore and it is not aware of any theoretical limits.
Advantages of nanopore sequencing technology :- The potential is that a single molecule of DNA can be
sequenced directly using a nanopore, without the need for an intervening PCR amplification step or a
chemical labelling step or the need for optical instrumentation to identify the chemical label.
BT-3515 (Genomics And Proteomics) 38
Drawbacks of nanopore sequencing technique :-
As of July 2010, information available to the public indicates that nanopore sequencing is still in the
development stage, with some laboratory-based data to back up the different components of the
sequencing method. Despite these advancements, nanopore sequencing is not currently commercially
available and routine zed, it is not cost-effective enough to compete with next generation sequencing
methods. Nanopore-based DNA analysis techniques are being industrially developed by Oxford
Nanopore Technologies (developing direct exonuclease sequencing and strand sequencing using protein
nanopores, and solid-state sequencing through internal R&D and collaborations with academic
institutions), NabSys (using a library of DNA probes and using nanopores to detect where these probes
have hybridized to single stranded DNA) and NobleGen (using nanopores in combination with
fluorescent labels. IBM has noted research projects on computer simulations of translocation of a DNA
strand through a solid-state nanopore, but no projects on identifying the DNA bases on that strand.
D) Next Generation Sequencing :-
❖ Next Generation Sequencing (NGS) is a powerful platform that has enabled the sequencing of
thousands to millions of DNA molecules simultaneously.
❖ Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all
term used to describe a number of different modern sequencing technologies.
❖ The high demand for low-cost sequencing has driven the development of high-throughput
sequencing which produce thousands or millions of sequences at once.
❖ They are intended to lower the cost of DNA sequencing beyond what is possible with standard
dye-terminator methods.
❖ Thus, these recent technologies allow us to sequence DNA and RNA much more quickly and
cheaply than the previously used Sanger sequencing, and as such have revolutionized the study
of genomics and molecular biology.
1) Lynx therapeutics’ massively parallel signature sequencing (MPSS) :
• It is considered as the first of the “next-generation” sequencing technologies.
• MPSS was developed in the 1990s at Lynx Therapeutics, a company founded in 1992 by Sydney
Brenner and Sam Eletr.
• MPSS is an ultra high throughput sequencing technology.
• When applied to expression profile, it reveals almost every transcript in the sample and provide
its accurate expression level.
• MPSS was a bead-based method that used a complex approach of adapter ligation followed by
adapter decoding, reading the sequence in increments of four nucleotides; this method made it
susceptible to sequence-specific bias or loss of specific sequences.
• However, the essential properties of the MPSS output were typical of later “next-gen” data
types, including hundreds of thousands of short DNA sequences.
• In the case of MPSS, these were typically used for sequencing cDNA for measurements of gene
expression levels.
BT-3515 (Genomics And Proteomics) 39
Fig 8. Massively Parallel Signature Sequencing (MPSS)
2) Polony sequencing :
➢ It is an inexpensive but highly accurate multiplex sequencing technique that can be used to read
millions of immobilized DNA sequences in parallel.
➢ This technique was first developed by Dr. George Church in Harvard Medical college.
➢ It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and
ligation-based sequencing chemistry to sequence an E. coli genome at an accuracy of >
99.9999% and a cost approximately 1/10 that of Sanger sequencing.
Fig 9. Polony Sequencing
BT-3515 (Genomics And Proteomics) 40
APPLICATIONS OF DNA SEQUENCING :-
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e.
clusters of genes or operons), full chromosomes, or entire genomes of any organism. DNA sequencing is
also the most efficient way to indirectly sequence RNA or proteins (via their open reading frames). In
fact, DNA sequencing has become a key technology in many areas of biology and other sciences such as
medicine, forensics, and anthropology.
1) Molecular biology :
Sequencing is used in molecular biology to study genomes and the proteins they encode. Information
obtained using sequencing allows researchers to identify changes in genes, associations with diseases
and phenotypes, and identify potential drug targets.
2) Metagenomics :
The field of metagenomics involves identification of organisms present in a body of water, sewage, dirt,
debris filtered from the air, or swab samples from organisms. Knowing which organisms are present in a
particular environment is critical to research in ecology, epidemiology, microbiology, and other fields.
Sequencing enables researchers to determine which types of microbes may be present in a microbiome,
for example.
3) Virology :
As most viruses are too small to be seen by a light microscope, sequencing is one of the main tools in
virology to identify and study the virus. Viral genomes can be based in DNA or RNA. RNA viruses are
more time-sensitive for genome sequencing, as they degrade faster in clinical samples. Traditional
Sanger sequencing and next-generation sequencing are used to sequence viruses in basic and clinical
research, as well as for the diagnosis of emerging viral infections, molecular epidemiology of viral
pathogens, and drug-resistance testing. There are more than 2.3 million unique viral sequences in
GenBank. Recently, NGS has surpassed traditional Sanger as the most popular approach for generating
viral genomes.
4) Medicine :
Medical technicians may sequence genes (or, theoretically, full genomes) from patients to determine if
there is risk of genetic diseases. This is a form of genetic testing, though some genetic tests may not
involve DNA sequencing. Also, DNA sequencing may be useful for determining a specific bacteria, to
allow for more precise antibiotics treatments, hereby reducing the risk of creating antimicrobial
resistance in bacteria populations.
5) Forensics :
DNA sequencing may be used along with DNA profiling methods for forensic identification[18] and
paternity testing. DNA testing has evolved tremendously in the last few decades to ultimately link a DNA
print to what is under investigation. The DNA patterns in fingerprint, saliva, hair follicles, etc. uniquely
separate each living organism from another. Testing DNA is a technique which can detect specific
genomes in a DNA strand to produce a unique and individualized pattern.
BT-3515 (Genomics And Proteomics) 41
1.6 GENOME MAPPING
❖ The creation of a genetic map assigning DNA fragments to chromosomes, which provide the first
level of genome mapping.
❖ When a genome is first investigated, this map is nonexistent. The map improves with the
scientific progress and is perfect when the genomic DNA sequencing of the species has been
completed.
❖ During this process, and for the investigation of differences in strain, the fragments are
identified by small tags. These may be genetic markers or the unique sequence-dependent
pattern of DNA- cutting enzymes.
❖ The term mapping is used in two different context but both of these are inter-related.
❖ Genetic mapping uses classical genetic techniques (e.g. pedigree analysis or breeding
experiments) to determine sequence features within a genome.
❖ Using modern molecular biology techniques for the same purpose is usually referred to as
physical mapping.
There are two types of genome mapping,
1) Genetic Mapping :
• A genetic map is a representation of the distance between two DNA elements based upon the
frequency at which recombination occurs between the two.
• The first genetic map of the chromosome was constructed by Alfred Sturtevant, using data from
Drosophila mating crosses collected by Thomas Morgan.
2) Physical Mapping :
• The physical map of a genome is a map of genetic markers made by analyzing a genomic DNA
sequence directly.
• Along with genetic maps, physical maps for each chromosome within the genome can be
constructed.
• A variety of different techniques have been used to construct physical maps in the absence of
complete DNA sequence.
Techniques involved in creating Physical Maps :-
a) Restriction maps
b) Radiation hybrid maps
c) STS maps
1) Restriction Map :
• Restriction map is a map of known restriction sites within a sequence of DNA. Restriction
mapping requires the use of restriction enzymes.
• In molecular biology, restriction maps are used as a reference to engineer plasmids or other
relatively short pieces of DNA, and sometimes for longer genomic DNA.
• There are other ways of mapping features on DNA for longer length DNA molecules, such as
mapping by transduction.
BT-3515 (Genomics And Proteomics) 42
Method of making these restriction maps :
• The experimental procedure first requires an aliquot of purified plasmid DNA for each digest to
be run. Digestion is then performed with each enzyme chosen. The resulting samples are
subsequently run on an electrophoresis gel, typically on agarose gel.
• The first step following the completion of electrophoresis, is to add up the sizes of the fragments
in each lane. The sum of the individual fragments should equal the size of the original fragment,
and each digest’s fragments should also sum up to be the same size as each other.
• If fragment sizes do not properly add up, there are two likely problems.
a) In one case, some of the smaller fragments may have run off the end of the gel. This
frequently occurs if the gel is run too long.
b) In second case the possible source of error is that the gel was not dense enough and
therefore was unable to resolve fragments close in size.
• This leads to a lack of separation of fragments which were close in size. If all of the digests
produce fragments that add up one may infer the position of the REN (restriction endonuclease)
sites by placing them in spots on the original DNA fragment that would satisfy the fragment sizes
produced by all three digests.
➢ EcoRI
➢ HindIII
➢ EcoRI + HindIII
2) Radiation Hybrid Maps :
• Radiation hybrid mapping (also known as RH mapping) is a technique for mapping mammalian
chromosomes.
• It carries a small DNA fragment from the genome of other organism, e.g. human.
• Irradiating human cells with X-rays causes random breaks with in the DNA.
• The size of these fragments decreases as the dose of these rays increases.
• These radiation levels are sufficient to kill the human cells but hybridization of these cells with
hamster cells in vitro rescue these cells.
• Then these human cells are then further analyzed for the genetic markers it carries.
BT-3515 (Genomics And Proteomics) 43
Fig. Radiation Hybrid Construction
3) Sequence Tagged Sites (STS) :
• STS ( sequence tagged site) is a DNA fragment of 100- 200 bp in length generated by PCR using
primers based on already known sequence.
• The genomic sites are tagged by its ability to hybridize with that sequence.
• STS can be generated from previously cloned genes, or from other random non- gene
sequences.
• An STS map of the human genome has now been constructed using a series of radiation hybrids
respectively.
A) Low Resolution Physical Mapping :
Low-resolution physical mapping is typically capable of resolving DNA ranging from one base pair to
several mega bases. In this category, most mapping methods involve generating a somatic cell hybrid
panel, which is able to map any human DNA sequences, the gene of interest, to specific chromosomes of
animal cells, such as those of mice and hamsters. The hybrid cell panel is produced by collecting hybrid
cell lines containing human chromosomes, identified by polymerase chain reaction (PCR) screening with
primers specific to the human sequence of interest as the hybridization probe. The human chromosome
would be presented in all of the cell lines.
There are different approaches to producing low-resolution physical mapping, including chromosome-
mediated gene transfer and irradiation fusion gene transfer which generate the hybrid cell panel.
Chromosome-mediated gene transfer is a process that coprecipitates human chromosome fragments
with calcium phosphate onto the cell line, leading to a stable transformation of recipient chromosomes
retaining human chromosomes ranging in size from 1 to 50 mega base pairs. Irradiation fusion gene
transfer produces radiation hybrids which contain the human sequence of interest and a random set of
other human chromosome fragments. Markers from fragments of human chromosome in radiation
hybrids give cross-reactivity patterns, which are further analyzed to generate a radiation hybrid map by
ordering the markers and breakpoints. This provides evidence on whether the markers are located on
the same human chromosome fragment, and hence the order of gene sequence.
B) High Resolution Physical Mapping :
High-resolution physical mapping could resolve hundreds of kilobases to a single nucleotide of DNA. A
major technique to map such large DNA regions is high resolution FISH mapping, which could be
achieved by the hybridization of probes to extended interphase chromosomes or artificially extended
chromatin. Since their hierarchic structure is less condensed comparing to prometaphase and
metaphase chromosomes, the standard in situ hybridization target, a high resolution of physical
mapping could be produced.
FISH mapping using interphase chromosome is a conventional in situ method to map DNA sequences
from 50 to 500 kilobases, which are mainly syntenic DNA clones. However, naturally extended
chromosomes might be folded back and produces alternative physical map orders. As a result, statistical
analysis is necessary to generate the accurate map order of interphase chromosomes.
BT-3515 (Genomics And Proteomics) 44
If artificially stretched chromatin is used instead, mapping resolutions could be over 700 kilobases. In
order to produce extended chromosomes on a slide, direct visual hybridization (DIRVISH) is often carried
out, that cells are lysed by detergent to allow DNA released into the solution to flow to the other end of
the slide. An example of high resolution FISH mapping using stretched chromatin is extended chromatin
fiber (ECF) FISH. The method suggests the order of desired regions on the DNA sequence by analyzing
the partial overlaps and gaps between yeast artificial chromosomes (YACs). Eventually, the linear
sequence of the interested DNA regions could be determined. One more to note is that if metaphase
chromosome is used in FISH mapping, the resolution resulted will be very poor, which is to be classified
to low-resolution mapping rather than a high-resolution mapping.
Applications of Physical Mapping :-
Physical mapping is a technique to complete the sequencing of a genome. Ongoing projects that
determine DNA base pair sequences, namely the Human Genome Project, give knowledge on the order
of nucleotide and allow further investigation to answer genetic questions, particularly the association
between the target sequence and the development of traits. From the individual DNA sequence isolated
and mapped in physical mapping, it could provide information on the transcription and translation
process during development of organisms, hence identifying the specific function of the gene and
associated traits produced. As a result of understanding the expression and regulation of the genes,
potential new treatments can be developed to alter protein expression patterns in specific tissues.
Moreover, if the location and sequence of disease genes are identified, medical advice can be given to
potential patients who are the carrier the disease gene, with reference to the knowledge of the gene
function and products.
Advantages of these physical mapping techniques :-
➢ Provide approx. distance and order between DNA sequences.
➢ Yields a framework onto which sequence information can be applied.
➢ Restriction mapping provide highly reliable fragment ordering and distance estimation.
➢ RH mapping allows the relative likelihoods of alternative marker orders to be determined.
➢ The RH procedure was used to map 14 DNA probes from a region of human chromosome 21,
spanning 20 Mbp.
➢ STS technique has been used to order inserts from individual human chromosomes in a YAC
library.
➢ But it was later rejected as some YAC’s contained DNA from more than one human genome
sequence.
➢ The physical maps proved useful in producing ordered library clones.
BT-3515 (Genomics And Proteomics) 45
1.7 GENE SEQUENCE ANALYSIS AND ANNOTATION
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any
of a wide range of analytical methods to understand its features, function, structure, or evolution.
Methodologies used include sequence alignment, searches against biological databases, and others.
Since the development of methods of high-throughput production of gene and protein sequences, the
rate of addition of new sequences to the databases increased exponentially. Such a collection of
sequences does not, by itself, increase the scientist’s understanding of the biology of organisms.
However, comparing these new sequences to those with known functions is a key way of understanding
the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to
assign function to genes and proteins by the study of the similarities between the compared sequences.
Nowadays, there are many tools and techniques that provide the sequence comparisons (sequence
alignment) and analyze the alignment product to understand its biology.
Sequence analysis in molecular biology includes a very wide range of relevant topics :
1) The comparison of sequences in order to find similarity, often to infer if they are related
(homologous)
2) Identification of intrinsic features of the sequence such as active sites, post translational
modification sites, gene-structures, reading frames, distributions of introns and exons and
regulatory elements
3) Identification of sequence differences and variations such as point mutations and single
nucleotide polymorphism (SNP) in order to get the genetic marker.
4) Revealing the evolution and genetic diversity of sequences and organisms
5) Identification of molecular structure from sequence alone
DNA Annotation :-
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the
coding regions in a genome and determining what those genes do. Once a genome is sequenced, it
needs to be annotated to make sense of it.
For DNA annotation, a previously unknown sequence representation of genetic material is enriched with
information relating genomic position to intron-exon boundaries, regulatory sequences, repeats, gene
names and protein products. This annotation is stored in genomic databases such as Mouse Genome
Informatics, FlyBase, and WormBase. Educational materials on some aspects of biological annotation
from the 2006 Gene Ontology annotation camp and similar events are available at the Gene Ontology
website.
The National Center for Biomedical Ontology (www.bioontology.org) develops tools for automated
annotation of database records based on the textual descriptions of those records.
As a general method, dcGO has an automated procedure for statistically inferring associations between
ontology terms and protein domains or combinations of domains from the existing gene/protein-level
annotations.
Process of Genome Annotation :
Genome annotation consists of three main steps :
BT-3515 (Genomics And Proteomics) 46
1) Identifying portions of the genome that do not code for proteins
2) Identifying elements on the genome, a process called gene prediction
3) Attaching biological information to these elements
Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to
manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist
and complement each other in the same annotation pipeline.
A simple method of gene annotation relies on homology based search tools, like BLAST, to search for
homologous genes in specific databases, the resulting information is then used to annotate genes and
genomes. However, as information is added to the annotation platform, manual annotators become
capable of deconvoluting discrepancies between genes that are given the same annotation. Some
databases use genome context information, similarity scores, experimental data, and integrations of
other resources to provide genome annotations through their Subsystems approach. Other databases
(e.g. Ensembl) rely on curated data sources as well as a range of different software tools in their
automated genome annotation pipeline.
Structural annotation consists of the identification of genomic elements.
• ORFs and their localization
• Gene structure
• Coding regions
• Location of regulatory motifs
Functional annotation consists of attaching biological information to genomic elements.
• Biochemical function
• Biological function
• Involved regulation and interactions
• Expression
These steps may involve both biological experiments and in silico analysis. Proteogenomics based
approaches utilize information from expressed proteins, often derived from mass spectrometry, to
improve genomics annotations.
A variety of software tools have been developed to permit scientists to view and share genome
annotations; for example, MAKER.
Genome annotation remains a major challenge for scientists investigating the human genome, now that
the genome sequences of more than a thousand human individuals (The 100,000 Genomes Project, UK)
and several model organisms are largely complete. Identifying the locations of genes and other genetic
control elements is often described as defining the biological “parts list” for the assembly and normal
operation of an organism. Scientists are still at an early stage in the process of delineating this parts list
and in understanding how all the parts “fit together”.
Genome annotation is an active area of investigation and involves a number of different organizations in
the life science community which publish the results of their efforts in publicly available biological
databases accessible via the web and other electronic means. Here is an alphabetical listing of on-going
projects relevant to genome annotation :
➢ Encyclopedia of DNA elements (ENCODE)
➢ Entrez Gene
BT-3515 (Genomics And Proteomics) 47
➢ Ensembl
➢ GENCODE
➢ Gene Ontology Consortium
➢ GeneRIF
➢ RefSeq
➢ Uniprot
➢ Vertebrate and Genome Annotation Project (Vega)
1.8 GENOME PROJECTS
Genome projects are scientific endeavours that ultimately aim to determine the complete genome
sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a
virus) and to annotate protein-coding genes and other important genome-encoded features. The
genome sequence of an organism includes the collective DNA sequences of each chromosome in the
organism. For a bacterium containing a single chromosome, a genome project will aim to map the
sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes
and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome
sequences.
1.8.1 GENOME PROJECT : E. COLI :-
In September 1997, the complete genome sequence of Escherichia coli was published. E. coli bacteria
live in the lower intestinal tract of animals. It is one of the many bacteria that reside in our bodies,
normally causing no harm. Biochemists and geneticists had long used E. coli to study the basic chemical
reactions of life and to obtain some of the first clues about how gene action is regulated. The complete
sequence of the E. coli genome was expected to help scientists learn even more about a bacterium they
had studied for many years.
The strain of E. coli used for the sequencing project is not a pathogen (that is, it does not cause disease).
However, some strains of E. coli can cause illness, such as food poisoning. Comparing the normal strain
with pathogenic strains is expected to help suggest treatments for these illnesses and strategies to
prevent infection.
Escherichia coli K-12 Genome :-
Since the genome of Escherichia coli K-12 was initially annotated in 1997, additional functional
information based on biological characterization and functions of sequence-similar proteins has become
available. On the basis of this new information, an updated version of the annotated chromosome has
been generated.
Number of genes in the E. coli K-12 genome :
For the initial annotation of the E. coli K-12 genome, 4,404 genes were identified with Blattner numbers
(Bnums). Among the genes, 4,288 were believed to encode proteins and 116 to encode RNAs. Since then
six Bnums have been retired : bo322, bo395, bo663, bo667, bo669 and bo671. In addition, three new
genes have been identified and assigned to Bnums. These include the protein-coding b4406 (yaeP,
SWISS-PROT P52099) and b4407 (thiS, SWISS-PROT 032583) and the RNA encoding b4408. The current
number of E. coli genes is 4,401, with 4,285 encoding proteins and 116 encoding RNAs.
BT-3515 (Genomics And Proteomics) 48
Fig 1. E. coli Genome Map
Proteins as modular entities :
Some of the proteins encoded in the E. coli genome have arisen through fusion of two or more genes.
Examples of such gene fusions are the multifunctional enzymes Aas (2-acylglycerophospho-
ethanolamine acyl transferase and acyl-acyl carrier protein synthetase) and G1mU (N-acetyl
glucosamine-1-phosphate uridyltransferase and glucosamine-1-phosphate acetyl transferase). We have
chosen to deal with proteins as modular entities where a module is defined as a protein element that
has at least 100 amino-acid residues, carries a biological function and is presumed to have an
independent evolutionary history. Most modules in E. coli are individual proteins. They can, however,
also be part of a protein where multiple modules have been joined by gene fusion, as is the case for Aas
and G1mU. Other protein types in E. coli such as transporters and regulators also involve gene fusion
events. The current modular assignments are based on analysis of protein sequences within E. coli K-12
(P. Liang and M. Riley, unpublished data).
Functional annotation of E. coli K-12 gene products :-
The functional assignments of the E. coli gene products in the November 97 GenBank U00096 deposit
represented an accumulation of information retrieved from the literature (collected in the GenProtEC
and EcoCyc databases) as well as imputed functions based on similarity of a known protein to the
translated sequences. Since the deposit to GenBank was made, our database GenProtEC has continually
been updated with knowledge on E. coli gene products appearing in the literature. Information on
transcriptional regulators has been incorporated from the work of J. Collado-Vides, and transport
BT-3515 (Genomics And Proteomics) 49
protein information has been adapted from the work of M.H. Saier and I.T. Paulsen. GenProtEC also
contains imputed function assignments based on sequence similarity to orthologous or paralogous
proteins, on gene (operon) location and on phenotypes of mutants.
Gene products whose functions were known were not considered further for the functional update. The
remaining 2,294 CDSs whose gene products had a putative or unknown function assignment were
analyzed using BLAST and DARWIN. BLAST analyses were carried out for both the Bnum- and the
Magnum-derived protein sequences. The results for the Bnum-derived protein sequences and the
automatic functions predicted by MAGPIE or HERON (human-emulated reasoning for objective
notations) were manually evaluated and imputed functions were assigned. Although the manual
annotation step could not compete with the speed of the automatic annotation process of HERON, it
provided us with more useful function descriptions. A comparison of the manually assigned putative
functions with the HERON predicted functions showed that when leaving aside issues of specificity, a
nearly equivalent function was predicted in 46% of the cases, whereas in 52% of the cases less
information was obtained with HERON.
A) Automated annotation :-
1) MAGPIE ORF prediction :
A three-step approach to ORF prediction was taken to prepare the MAGPIE project for E. coli. GLIMMER
2.0 with a minimum ORF length of 80 nucleotides was initially used to create the base set of predictions.
Glimmer 2.0 was run with all default parameters, as recommended in the documentation and trained on
the annotated set of ORFs from the Blattner et al. release of 1997. Because GLIMMER selectively
identifies ORFs that match a statistical model of a gene for the organism, GLIMMER may miss genes that
were laterally transferred or acquired more recently from other genomes. We therefore chose to
combine the GLIMMER predictions with those of a syntactic tool encoded within MAGPIE. This tool
identifies stop codons and then ‘backtracks’ to the farthest upstream acceptable in-frame start codon
and defines this as the ORF. A non-redundant set of all GLIMMER ORFs plus syntactic ORFs between
GLIMMER ORFs was generated. Finally, ORFs annotated by Blattner et al. that were not present in the
non-redundant set were added to the MAGPIE project.
2) BLAST analysis :
The CDSs were compared to the NCBI nucleotide (nt) and non-redundant protein (nr) databases using
gapped BLAST. Protein-sequence motifs were identified by PROSITE. A search against the MAGPIE-
predicted proteins of over 40 completed genomes, including the previously annotated E. coli set, was
also performed.
Functional annotation :
Automated function annotation was provided using HERON. Description lines with low information
content (for example, descriptions containing words such as “hypothetical” or “putative”) were filtered
out. HERON then calculated word frequencies in the remaining descriptions, identified the top three
most common words, and selected the description of the highest-scoring sequence match (for
homology comparisons) with one or more high-frequency words. The selected description became the
automated annotation for the coding region.
BT-3515 (Genomics And Proteomics) 50
B) Manual annotation :-
1) BLAST analysis :
The protein sequences collected from GenBank Accession U00096 were compared to the nr database
using gapped BLAST.
2) DARWIN analysis :
DARWIN (version 2.0) was used to detect sequence-similar proteins within E. coli K-12 and in 20
additional microbial genomes (P. Liang and M. Riley, unpublished data). In addition to orthologous
matches, groups of paralogous proteins of E. coli K-12 were generated on the basis of the DARWIN
results. In our hands, DARWIN is particularly successful in identifying distant sequence similarities, a
consequence no doubt of the application of multiple substitution matrices optimized for the organism
and to each sequence pair.
Functional annotation :
Functions were assigned to gene products on the basis of a manual evaluation of the results from the
BLAST and DARWIN analyses. The automatic function prediction was also taken into account. In addition
to incorporating recent experimental information, a substantial amount of human judgment was
brought to bear.
1.8.2 GENOME PROJECT : ARABIDOPSIS :-
Thale cress, Arabidopsis thaliana, is a member of one of the largest families of flowering plants, the
Brassicaceae, to which mustards, radishes and cabbages also belong. A. thaliana is thought to have
originated in Central Asia and spread from there throughout Eurasia. During the last glaciation, A.
thaliana was confined to the southern limit of its range, and after the ice retreated, much of Europe was
recolonized by different populations, resulting in complex admixture patterns. Today, A. thaliana occurs
throughout the Northern Hemisphere, mostly in temperate regions, from the mountains of North Africa
to the Arctic Circle. Like many other European plants, it has also invaded North America, most probably
during historic times.
The ascendancy of A. thaliana to become one of the most popular species in basic plant research,
despite its lack of economic value, is due to the favorable genetics of this plant. It has a diploid genome
of only about 125 to 150 Mb distributed over five chromosomes, with fewer than 30,000 protein-coding
genes. The ease with which it can be stably transformed is unsurpassed by any other multicellular
organism. Moreover, as flowering plants only appeared about 100 million years ago, they are all
relatively closely related. Indeed, key aspects of plant physiology such as flowering are highly conserved
between economically important grasses such as rice and A. thaliana.
A. thaliana was the first plant species for which a genome sequence became available. This initial
sequence was from a single inbred strain (accession), and was of very high quality, with each
chromosome represented by merely two contigs, one for each arm. In addition to functional analyses,
the 120 Mb reference sequence of the Columbia (Col-0) accession proved to be a boon for evolutionary
and ecological genetics. A particular advantage in this respect is that the species is mostly self-fertilizing,
and most strains collected from the wild are homozygous throughout the genome. This distinguishes A.
thaliana from other model organisms such as the mouse or the fruit fly. In these systems, inbred strains
have been derived, but they do not represent any individuals actually found in nature.
BT-3515 (Genomics And Proteomics) 51
A first-generation haplotype map (HapMap) for A. thaliana :-
From this first set of 96 strains, 20 maximally diverse strains were chosen for much denser
polymorphism discovery using array-based resequencing. This led to the identification of about one
single nucleotide polymorphism (SNP) for every 200 bp of the genome, constituting one quarter or so of
all SNPs estimated to be present. In addition, regions that are missing or highly divergent in at least one
accession encompass about a quarter of the reference genome.
The progress made with genome-wide association (GWA) mapping in humans during the past three
years has been nothing but phenomenal, and bodes well for applying association mapping to A.
thaliana. As in humans, linkage disequilibrium (LD), which is the basis for GWA studies, decays over
about 10 kb, the equivalent of two average genes. That the average LD in Arabidopsis is not so different
from that in humans might seem surprising, given the selfing nature of A. thaliana, but it reflects the
fact that outcrossing is not that rare, and that this species apparently has a large effective population
size. A 250 k SNP chip (containing 250,000 probes), corresponding to approximately one SNP very 480
bp, has been produced, and should predict some 90% of all non-singleton SNPs. A collection of over
6,000 A. thaliana accessions, both from stock centers and recent collections has been assembled, and a
subset of 1,200 genetically diverse strains will be interrogated with the 250 k SNP chip, providing a
fantastic resource for GWA studies in this species.
Fig 2. Arabidopsis thaliana Genome Map
BT-3515 (Genomics And Proteomics) 52
The A. thaliana 1001 Genomes project :-
Together with partners from around the world, we have initiated a project with the goal of describing
the whole-genome sequence variation in 1,001 accessions of A. thaliana. The current technological
revolution in sequencing means that it is now feasible and inexpensive to sequence large numbers of
genomes. Indeed, a 1000 Genomes Project for humans was announced in January 2008, and the first
results of this initiative are very encouraging. It builds, in a manner similar to the A. thaliana project, on
previous HapMap information, but because of the greater complexity and repetitiveness of human
genomes, much of the initial effort for the human project will go towards comparing the feasibility of
different approaches. In contrast, even short reads of the A. thaliana sequence, such as those produced
by the first generation of Illumina’s Genome Analyzer instrument, have already been proved to support
not only the discovery of SNPs, but also of short to medium-size indels, including the detection of
sequences not present in the reference genome.
We are proposing a hierarchical strategy to sequence the species-wide genome of A. thaliana. The first
aspect of this approach is to make use of different technologies and different depths of sequencing
coverage. A small number of genome sequences that approach the quality of the original Col-0
reference will be generated by exploiting mostly technologies such as Roche’s 454 platform, which
generates longer reads, in combination with libraries of different insert sizes, allowing long-range
assembly. A much larger number of genomes will be sequenced with a less expensive technology such
as Illumina’s Genome Analyzer or Applied Biosystems’ SOLiD and with only a single type of clone library.
For this set of accessions, local haplotype similarity will be exploited in combination with information
from the reference genomes to deduce the complete sequence, using methods similar those employed
in inbred strains of mice. The power of this approach is in the large number of accessions that can be
sequenced. For example, even if a particular haplotype is only present at 1% frequency, and each of the
1,001 strains is only sequenced at 8× coverage, there would still be on average 80 reads for each site in
this haplotype.
The second aspect of the hierarchical approach will be the sampling of ten individuals from ten
populations each in ten geographic regions throughout Eurasia, plus at least one North African accession
(10 × 10 × 10 + 1). We expect individuals from the same region to show more extensive haplotype
sharing than is observed in worldwide samples, which will be advantageous for the imputation strategy
discussed above. An argument that might be raised against this approach is the strong population
structure it entails, but we note that it is probably impossible to sample accessions in a manner that
avoids population structure completely, and that our strategy will allow us to address questions of local
adaptation, which are of great interest to evolutionary scientists. The output of the 1001 Genomes
project will be a generalized genome sequence that encompasses every A. thaliana accession analysed
as a special case. It will comprise a mosaic of variable haplotypes such that every genome can be aligned
completely against it.
The main motivation for the 1001 Genomes project is, however, to enable GWA studies in this species.
The seeds from the 1,001 accessions will be freely available from the Arabidopsis stock centers, and
each accession can be grown and phenotyped by scientists from all over the world, in as many
environments as desired. Importantly, because an unlimited supply of genetically identical individuals
will be available for each accession, even subtle phenotypes and ones that are highly sensitive to the
microenvironment, which is often difficult to control, can be measured with high confidence. The
phenotypes will include morphological analyses, such as plant stature, growth and flowering;
investigations of plant content, such as metabolites and ions; responses to the abiotic environment,
BT-3515 (Genomics And Proteomics) 53
such as resistance to drought or salt stress; or resistance to disease caused by a host of prokaryotic and
eukaryotic pathogens, from microbes to insects and nematodes. In the last case, a particularly exciting
prospect is the ability to identify plant genes that mediate the effects of individual pathogen proteins,
which are normally delivered as a complex mix to the plant, as is being done in the Effectoromics
project, which has the aim of “understanding host plant susceptibility and resistance by indexing and
deploying obligate pathogen effectors”. The value of being able to correlate many different phenotypes,
including genome-wide phenotypes, has already been beautifully demonstrated for the Drosophila
Genetic Reference Panel, and we expect similar dividends for the A. thaliana project.
1.8.3 GENOME PROJECT : BOVINE :-
The genome of a female Hereford cow was published in 2009. It was sequenced by the Bovine Genome
Sequencing and Analysis Consortium, a team of researchers led by the National Institutes of Health and
the U.S. Department of Agriculture. It was part of an effort to improve livestock breeding and at the
time was one of the largest genomes ever sequenced.
The Bovine Genome Sequencing and Analysis Consortium worked to sequence the genome over a six
year period, and included 300 scientists across 25 countries led by the U.S. NIH and the U.S. DoA.
You Bovine Genome :-
The size of the bovine genome is 3 Gb (3 billion base pairs). It contains approximately 22,000 genes of
which 14,000 are common to all mammalian species. Bovines share 80 percent of their genes with
humans; cows are less similar to humans than rodents (humans and rodents belong to the clade of
Supraprimates). They also have about 1,000 genes shared with dogs and rodents but not identified in
humans.
The charting of key DNA differences, also known as haplotypes, between several varieties of cattle could
allow scientists to understand what is the role of some genes coding for products of economic value
(milk, meat, leather). It opens new perspectives for enhancing selective breeding and changing certain
cattle characteristics for the benefit of farmers.
Bovine Genome Map :-
A bovine BAC map was constructed with HindIII restriction digest fragments of 290,797 BAC clones from
animals of three different breeds. Comparative mapping of 422,522 BAC end sequences assisted with
BAC map ordering and assembly. Genotypes and pedigree from two genetic maps and marker scores
from three whole-genome RH panels were consolidated on a 17,254-marker composite map. Sequence
similarity allowed integrating the BAC and composite maps with the bovine draft assembly (Btau3.1),
establishing a comprehensive resource describing the bovine genome. Agreement between the marker
and BAC maps and the draft assembly is high, although discrepancies exist. The composite and BAC
maps are more similar than either is to the draft assembly.
The Bovine Genome Database :-
The Bovine Genome Database supports the efforts of bovine genomics researchers by providing data
mining, genome navigation and annotation tools for the bovine reference genome based on the
Hereford cow, L1 Dominette 01449.
BGD provides tools for data mining (BovineMine), sequence database searching (BLAST), genome
browsing (JBrowse) and annotation (Apollo). The bovine reference genome assembly has been revised
BT-3515 (Genomics And Proteomics) 54
several times. The newest release of BovineMine (BovineMine v1.6) includes both the ARS-UCD1.2 and
UMD3.1 genome assemblies, and the last release with only UMD3.1.1 (BovineMine v1.4) is still
available. JBrowse is also available for both ARS-UCD1.2 and UMD3.1.
BovineMine integrates the genome assemblies with a variety of data sources, including genes, proteins,
orthologs, pathways, gene ontology, gene expression, interactions, variants, QTL and publications. The
goal of BovineMine is to accelerate genomics analysis by enabling researchers without scripting skills to
create and export customized annotation datasets merged with their own research data for use in
downstream analyses. BovineMine allows researchers to leverage the curated gene pathways of model
organisms (e.g. human, mouse and rat) based on orthology, and is especially useful for GO and pathway
analyses in conjunction with GWAS and QTL studies. BovineMine also includes reference genomes of
sheep and goat so researchers can leverage information across ruminants.
1.8.4 HUMAN GENOME PROJECT :-
Human genome project (HGP) was an international scientific research project which got successfully
completed in the year 2003 by sequencing the entire human genome of 3.3 billion base pairs. The HGP
led to the growth of bioinformatics which is a vast field of research. The successful sequencing of the
human genome could solve the mystery of many disorders in humans and gave us a way to cope up with
them.
Goals of the human genome project :
❖ Goals of the human genome project include :
❖ Optimization of the data analysis.
❖ Sequencing the entire genome.
❖ Identification of the complete human genome.
❖ Creating genome sequence databases to store the data.
❖ Taking care of the legal, ethical and social issues that the project may pose.
Methods of the human genome project :
In this project, two different and significant methods are typically used.
1) Expressed sequence tags wherein the genes were differentiated into the ones forming a part of
the genome and the others which expressed RNAs.
2) Sequence Annotation wherein the entire genome was first sequenced and the functional tags
were assigned later.
The process of the human genome project :
• The complete gene set was isolated from a cell.
• It was then split into small fragments.
• This DNA structure was then amplified with the help of a vector which mostly was BAC (Bacterial
artificial chromosomes) and YAC (Yeast artificial chromosomes).
• The smaller fragments were then sequenced using DNA sequencers.
• On the basis of overlapping regions, the sequences were then arranged.
• All the information of this genome sequence was then stored in a computer-based program.
BT-3515 (Genomics And Proteomics) 55
• This way the entire genome was sequenced and stored as genome database in computers.
Genome mapping was the next goal which was achieved with the help of microsatellites
(repetitive DNA sequences).
Features :
Features of the Human genome project include :
➢ Our entire genome is made up of 3164.7 million base pairs.
➢ On average, a gene is made up of 3000 nucleotides.
➢ The function of more than 50 percent of the genes is yet to be discovered.
➢ Proteins are coded by less than 2 percent of the genome.
➢ Most of the genome is made up of repetitive sequences which have no coding purposes
specifically but such redundant codes can help us better understand of genetic development of
humanity through the ages.
Applications of Human Genome Project :
Scientists estimate that chromosomes in the human population differ at about 0.1%. Understanding
these differences could lead to discovery of heritable diseases, as well as diseases and other traits that
are common to man. Information gained from the HGP has already fueled many positive discoveries in
health care. Well-publicized successes include the cloning of genes responsible for Duchenne muscular
dystrophy, retinoblastoma, cystic fibrosis, and neurofibromatosis. Increasingly detailed genomic maps
have also aided researchers seeking genes associated with fragile X syndrome, types of inherited colon
cancer, Alzheimer’s disease, and familial breast cancer.
If other disease-related genes are isolated, scientists can begin to understand the structure and
pathology of other disorders such as heart disease, cancer, and diabetes. This knowledge would lead to
better medical management of these diseases and pharmaceutical discovery.
Current and potential applications of genome research will address national needs in molecular
medicine, waste control and environmental cleanup, biotechnology, energy sources, and risk
assessment.
1) Molecular Medicine :
Through genetic research, medicine will look more into the fundamental causes of diseases rather than
concentrating on treating symptoms. Genetic screening will enable rapid and specific diagnostic tests
making it possible to treat countless maladies. DNA-based tests clarify diagnosis quickly and enable
geneticists to detect carriers within families. Genomic information can indicate the future likelihood of
some diseases. As an example, if the gene responsible for Huntington’s disease is present, it may be
certain that symptoms will eventually occur, although predicting the exact time may not be possible.
Other diseases where susceptibility may be determined include heart disease, cancer, and diabetes.
Medical researchers will be able to create therapeutic products based on new classes of drugs,
immunotherapy techniques, and possible augmentation or replacement of defective genes through
gene therapy.
2) Waste Control and Environmental Cleanup :
In 1994, through advances gained by the HGP, the DOE formulated the Microbial Genome Initiative to
sequence the genomes of bacteria useful in the areas of energy production, environmental remediation,
toxic waste reduction, and industrial processing. Resulting from that project, six microbes that live under
BT-3515 (Genomics And Proteomics) 56
extreme temperature and pressure conditions have been sequenced. By learning the unique protein
structure of these microbes, researchers may be able to use the organisms and their enzymes for such
practical purposes as waste control and environmental cleanup.
3) Biotechnology :
The potential for commercial development presents U.S. industry with a wealth of opportunities. Sales
of biotechnology products are projected to exceed $20 billion by the year 2000. The HGP has stimulated
significant investment by large corporations and promoted the development of new biotechnology
companies hoping to capitalize on the implications of HGP research.
4) Energy Sources :
Biotechnology, strengthened by the HGP, will be important in improving the use of fossil-based
resources. Increased energy demands require strategies to circumvent the many problems with today’s
dominant energy technologies. Biotechnology will help address these needs by providing a cleaner
means for the bioconversion of raw materials to refined products. Additionally, there is the possibility of
developing entirely new biomass-based energy sources. Having the genomic sequence of the methane-
producing microorganism Methanococcus jannaschii, for example, will allow researchers to explore the
process of methanogenesis in more detail and could lead to cheaper production of fuel-grade methane.
Risk Assessment :
Understanding the human genome will have an enormous impact on the ability to assess risks posed to
individuals by environmental exposure to toxic agents. Scientists know that genetic differences cause
some people to be more susceptible than others to such agents. More work must be done to determine
the genetic basis of such variability, but this knowledge will directly address the DOE’s long-term mission
to understand the effects of low-level exposures to radiation and other energy-related agents, especially
in terms of cancer risk. Additional positive spin-offs from this research include a better understanding of
biology, increased taxonomic understanding, increased development of pest-resistant and productive
crops and livestock, and other commercially useful microorganisms.
1.9 SYNTENY : CONCEPT AND IT’S DETECTION
In classical genetics, synteny describes the physical co-localization of genetic loci on the same
chromosome within an individual or species. Today, however, biologists usually refer to synteny as the
conservation of blocks of order within two sets of chromosomes that are being compared with each
other. This concept can also be referred to as shared synteny.
The classical concept is related to genetic linkage : Linkage between two loci is established by the
observation of lower-than-expected recombination frequencies between them. In contrast, any loci on
the same chromosome are by definition syntenic, even if their recombination frequency cannot be
distinguished from unlinked loci by practical experiments. Thus, in theory, all linked loci are syntenic, but
not all syntenic loci are necessarily linked. Similarly, in genomics, the genetic loci on a chromosome are
syntenic regardless of whether this relationship can be established by experimental methods such as
DNA sequencing/assembly, genome walking, physical localization or hap-mapping.
Students of genetics employ the term synteny to describe the situation in which two genetic loci have
been assigned to the same chromosome but still may be separated by a large enough distance in map
units that genetic linkage has not been demonstrated.
BT-3515 (Genomics And Proteomics) 57
Shared Synteny :-
Shared synteny (also known as conserved synteny) describes preserved co-localization of genes on
chromosomes of different species. During evolution, rearrangements to the genome such as
chromosome translocations may separate two loci, resulting in the loss of synteny between them.
Conversely, translocations can also join two previously separate pieces of chromosomes together,
resulting in a gain of synteny between loci. Stronger-than-expected shared synteny can reflect selection
for functional relationships between syntenic genes, such as combinations of alleles that are
advantageous when inherited together, or shared regulatory mechanisms.
The term is sometimes also used to describe preservation of the precise order of genes on a
chromosome passed down from a common ancestor, although many geneticists reject this use of the
term.
The analysis of synteny in the gene order sense has several applications in genomics. Shared synteny is
one of the most reliable criteria for establishing the orthology of genomic regions in different species.
Additionally, exceptional conservation of synteny can reflect important functional relationships between
genes. For example, the order of genes in the “Hox cluster”, which are key determinants of the animal
body plan and which interact with each other in critical ways, is essentially preserved throughout the
animal kingdom.
Synteny is widely used in studying complex genomes, as comparative genomics allows the presence and
possibly function of genes in a simpler, model organism to infer those in a more complex one. For
example, wheat has a very large, complex genome which is difficult to study. In 1994 research from the
John Innes Centre in England and the National Institute of Agrobiological Research in Japan
demonstrated that the much smaller rice genome had a similar structure and gene order to that of
wheat. Further study found that many cereals are syntenic and thus plants such as rice or the grass
Brachypodium could be used as a model to find genes or genetic markers of interest which could be
used in wheat breeding and research. In this context, synteny was also essential in identifying a highly
important region in wheat, the Ph1 locus involved in genome stability and fertility, which was located
using information from syntenic regions in rice and Brachypodium.
Synteny is also widely used in microbial genomics. In Rhizobiales and Enterobacteriales, syntenic genes
encode a large number of essential cell functions and represent a high level of functional relationships.
Patterns of shared synteny or synteny breaks can also be used as characters to infer the phylogenetic
relationships among several species, and even to infer the genome organization of extinct ancestral
species. A qualitative distinction is sometimes drawn between macrosynteny, preservation of synteny in
large portions of a chromosome, and microsynteny, preservation of synteny for only a few genes at a
time.
Computational Detection :-
Shared synteny between different species can be inferred from their genomic sequences. This is
typically done using a version of the MCScan algorithm, which finds syntenic blocks between species by
comparing their homologous genes and looking for common patterns of collinearity on a chromosomal
or contig scale. Homologies are usually determined on the basis of high bit score BLAST hits that occur
between multiple genomes. From here, dynamic programming is used to select the best scoring path of
shared homologous genes between species, taking into account potential gene loss and gain which may
have occurred in the species’ evolutionary histories.
BT-3515 (Genomics And Proteomics) 58
SyntenyTracker :-
As an input, SyntenyTracker uses a tab-delimited file containing information that includes chromosome
assignment of orthologous markers in two genomes, position of each marker in the chromosomes of
both genomes, and marker identifiers. The markers in the input table are sorted on the basis of their
chromosome assignments and positions in one of the two genomes. This genome is termed as the
"reference genome." The second genome is called the "target genome”. Coordinates are provided in
base pairs or map units, thus making SyntenyTracker suitable for building HSBs from any comparative
map. Description of the SyntenyTracker algorithm is presented in Figure 1. For the pseudo code
implementation of the algorithm see additional file 1. SyntenyTracker provides output as two text files.
The first file contains the original input with HSB identifiers added to each line. The second file contains
information on the chromosome assignment, start and end chromosome coordinates and relative
orientation of each HSB in the genomes compared.
Fig 1. Schematic representation of the algorithm for identification of HSBs
with SyntenyTracker.
BT-3515 (Genomics And Proteomics) 59
The SyntenyTracker tool is freely available online http://www-
app.igb.uiuc.edu/labs/lewin/donthu/Synteny_assign/html/. The user can select from two modes that
include "Radiation Hybrid" mode and "Orthologous Gene" mode. The major difference between these
modes is that in the "Radiation Hybrid" mode definition of the HSB orientation takes into consideration
possible "flips" of adjacent markers on a comparative map.
Testing :-
We tested SyntenyTracker with several datasets, including a cattle-human RH comparative map
comprised of 3,204 markers, and a dataset containing 14,380 orthologous gene pairs with one-to-one
relationships between the human and mouse genomes (Ensembl release 42). For the cattle-human
comparative map, among the 196 HSBs defined by SyntenyTracker, 189 HSBs completely match HSBs
manually defined by Everts-van der Wind and coworkers using the same set of rules that we
implemented in SyntenyTracker (see additional file 2 for SyntenyTracker output compared to manually
defined HSBs). On BTA16, SyntenyTracker combined two HSBs defined by Everts-van der Wind and
coworkers. In this case, singleton markers interrupting the HSBs were ignored by SyntenyTracker
according to predefined settings. In two cases, on BTA25 and BTA26, blocks of "out-of-place" markers
defined as HSBs by Everts-van der Wind et al. were ignored by SyntenyTracker. In Everts-van der Wind
et al. two HSBs were defined in region 0–832 map units on BTA3 because of two closely linked "out-of-
place" markers that mapped to BTA16. SyntenyTracker combined these two HSBs ignoring the "out-of-
place" markers according to the rule that does not allow "out-of-place" markers to break other HSBs.
Similarly, two HSBs on BTA5 and another two on BTA15 were merged. On BTAX, SyntenyTracker
detected a missing inversion defined by three consecutive markers (CC553554, BZ931493, X03098) and
identified three HSBs whereas Everts-van der Wind and coworkers found one (Figure 2). Thus,
SyntenyTracker is useful for identifying errors made by manual assignment using predefined rules.
To verify the quality of HSB definition by SyntenyTracker we selected another tool that was designed to
work with radiation hybrid comparative maps for a detailed comparison. Among many synteny-defining
tools we found that only AutoGRAPH was designed to work with RH comparative maps. Other popular
tools, e.g. GRIMM-Synteny were made to work with sequenced genomes and were not suitable for the
comparison.
To define "conserved segments ordered" (CSO), the equivalent of HSBs AutoGRAPH first assigns a
numerical integer to the markers in both reference and tested genomes and calculates adjacency
penalties between consecutive markers on tested genomes. AutoGRAPH breaks a CSO if the adjacency
penalty exceeds the penalty chosen by the user. The main difference with SyntenyTracker definition of
HSBs is that SyntenyTracker checks the size of inversions and compares them to the threshold selected
by the user. The number of markers in an inversion cannot be less than 3. A change in the order of
markers caused by single marker is ignored by SyntenyTracker unlike Auto-GRAPH. In addition
SyntenyTracker checks if there are any markers in other reference chromosomes that could interrupt
the order of the markers in an HSB.
BT-3515 (Genomics And Proteomics) 60
Fig 2. An inversion of CC553554, BZ931493, and X03098 markers (3.2 human-
Mb) on BTAX was identified by SyntenyTracker but not by manual analysis.
Using the same comparative map dataset we performed comparison of HSB definitions made by
SyntenyTracker with analogous CSO defined by AutoGRAPH [6]. Using AutoGRAPH with default
parameters, we were not able to define HSBs for the whole-genome dataset due to limitations in the
web application. To run the comparison with AutoGRAPH we had to break our dataset into smaller
datasets, each corresponding to an individual reference genome chromosome. Therefore, to define
HSBs on the cattle-human RH comparative map, AutoGRAPH was run 30 times for each of the 30
reference chromosomes. This resulted in definition of 180 HSBs. In 10 cases, AutoGRAPH combined two
HSBs defined by SyntenyTracker because of an interrupting HSB that was located on another reference
chromosome. On BTA2 and BTAX, inversions defined by more than three consecutive markers were
unaccounted for by AutoGRAPH. In four cases AutoGRAPH did not account for markers mapped to the
same positions on the RH map, resulting in deletion of four HSBs. In one case (on BTA26) AutoGRAPH
defined two "out-of-place" markers as an HSB. In two additional cases HSBs were broken because of the
presence of singleton markers.
BT-3515 (Genomics And Proteomics) 61
We then compared AutoGRAPH and SyntenyTracker definitions of HSBs on the same set of orthologous
genes in the human and mouse genomes. The set of 14,380 orthologous markers was analyzed in one
run with SyntenyTracker and in 23 runs by AutoGRAPH. SyntenyTracker defined 313 HSBs, whereas
AutoGRAPH defined 358. The majority of discrepant cases can be grouped into 3 categories. The first
category includes cases when AutoGRAPH broke HSBs defined by SyntenyTracker because of a single
marker from other regions of the same or other orthologous chromosomes positioned within these
regions. The second category includes cases when SyntenyTracker ignores small inversions because of
default parameters that require 3 consecutive markers ≥300 Kb apart to define orientation of the block.
The first two categories of discrepancies in HSB definitions between SyntenyTracker and AutoGRAPH
can be explained by small differences in the rules used by these two tools and can be avoided by
adjusting HSB definition parameters in either of the tools. The last category includes cases when
AutoGRAPH joined HSBs defined by SyntenyTracker. For such cases an interrupting HSB was located on
another reference chromosome, and therefore genuine chromosomal rearrangements were likely
missed using AutoGRAPH. This can only be fixed by changing the algorithm of the tool to work with the
whole-genome set rather than with an individual chromosome.
We investigated all 15 cases of singleton gene markers that caused AutoGRAPH to break the HSBs. The
same markers were ignored by SyntenyTracker. For these 15 genes we examined consistency of
orthologous relationships in different builds of the human and mouse genomes. For 10 of 15 genes we
found inconsistency in the definition of human-mouse orthology pairs in different genome builds or
annotation sources, indicating a problem in defining 1 to 1 orthology between these human genes and
their mouse counterparts.
To verify that the SyntenyTracker algorithm is robust and that the results obtained from its use are not
affected by modifications to the input file that do not change the comparative map, we have done the
following tests :
a) the order of target genome chromosomes in the original human-mouse orthologous gene input
file was changed;
b) the order of markers within both reference and target chromosomes was inverted.
HSBs defined using such modified input files were compared to the HSBs defined using the original input
file. All HSBs from the modified input completely matched original HSBs.
To examine how SyntenyTracker accounts for the uncertainties in the order of markers on the RH
comparative map when several markers are mapped to exactly the same position on the RH map, we
changed the order of such markers on the human-cattle comparative map, defined HSBs, and compared
them to the HSBs defined from the original map. No differences in HSB definition were detected.
BT-3515 (Genomics And Proteomics) 62
1.10 SEQUENCE HOMOLOGY
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in
terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared
ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event
(paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).
Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid
sequence similarity. Significant similarity is strong evidence that two sequences are related by
evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to
indicate which regions of each sequence are homologous.
1) Orthologous Sequences :-
Homologous sequences are orthologous if they are inferred to be descended from the same ancestral
sequence separated by a speciation event: when a species diverges into two separate species, the
copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous
genes, are genes in different species that originated by vertical descent from a single gene of the last
common ancestor. The term “ortholog” was coined in 1970 by the molecular evolutionist Walter Fitch.
For instance, the plant Flu regulatory protein is present both in Arabidopsis (multicellular higher plant)
and Chlamydomonas (single cell green algae). The Chlamydomonas version is more complex: it crosses
the membrane twice rather than once, contains additional domains and undergoes alternative splicing.
However it can fully substitute the much simpler Arabidopsis protein, if transferred from algae to plant
genome by means of genetic engineering. Significant sequence similarity and shared functional domains
indicate that these two genes are orthologous genes, inherited from the shared ancestor.
Orthology is strictly defined in terms of ancestry. Given that the exact ancestry of genes in different
organisms is difficult to ascertain due to gene duplication and genome rearrangement events, the
strongest evidence that two similar genes are orthologous is usually found by carrying out phylogenetic
analysis of the gene lineage. Orthologs often, but not always, have the same function.
Fig 1. Top : An ancestral gene duplicates to produce two paralogs (Genes A and
B). A speciation event produces orthologs in the two daughter species. Bottom :
in a separate species, an unrelated gene has a similar function (Gene C) but has
a separate evolutionary origin and so is an analog.
BT-3515 (Genomics And Proteomics) 63
Orthologous sequences provide useful information in taxonomic classification and phylogenetic studies
of organisms. The pattern of genetic divergence can be used to trace the relatedness of organisms. Two
organisms that are very closely related are likely to display very similar DNA sequences between two
orthologs. Conversely, an organism that is further removed evolutionarily from another organism is
likely to display a greater divergence in the sequence of the orthologs being studied.
Databases of orthologous genes :
Given their tremendous importance for biology and bioinformatics, orthologous genes have been
organized in several specialized databases that provide tools to identify and analyze orthologous gene
sequences. These resources employ approaches that can be generally classified into those that use
heuristic analysis of all pairwise sequence comparisons, and those that use phylogenetic methods.
Sequence comparison methods were first pioneered in the COGs database in 1997. These methods have
been extended and automated in the following databases :
1) eggNOG
2) GreenPhylDB for plants
3) InParanoid focuses on pairwise ortholog relationships
4) OHNOLOGS is a repository of the genes retained from whole genome duplications in the
vertebrate genomes including human and mouse.
5) OMA
6) OrthoDB appreciates that the orthology concept is relative to different speciation points by
providing a hierarchy of orthologs along the species tree.
7) OrthoInspector is a repository of orthologous genes for 4753 organisms covering the three
domains of life
8) OrthologID
9) OrthoMaM for mammals
10) OrthoMCL
11) Roundup
Tree-based phylogenetic approaches aim to distinguish speciation from gene duplication events by
comparing gene trees with species trees, as implemented in databases and software tools such as :
1) LOFT
2) TreeFam
3) OrthoFinder
A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct
clusters and determine trees, for example :
1) EnsemblCompara GeneTrees
2) HomoloGene
3) Ortholuge
BT-3515 (Genomics And Proteomics) 64
2) Paralogous Sequences :-
Paralogous genes are genes that are related via duplication events in the last common ancestor (LCA) of
the species being compared. They result from the mutation of duplicated genes during separate
speciation events. When descendants from the LCA share mutated homologs of the original duplicated
genes then those genes are considered paralogs.
As an example, in the LCA, one gene (gene A) may get duplicated to make a separate similar gene (gene
B), those two genes will continue to get passed to subsequent generations. During speciation, one
environment will favor a mutation in gene A (gene A1), producing a new species with genes A1 and B.
Then in a separate speciation event, one environment will favor a mutation in gene B (gene B1) giving
rise to a new species with genes A and B1. The descendants’ genes A1 and B1 are paralogous to each
other because they are homologs that are related via a duplication event in the last common ancestor of
the two species.
Additional classifications of paralogs include alloparalogs (out-paralogs) and symparalogs (in-paralogs).
Alloparalogs are paralogs that evolved from gene duplications that preceded the given speciation event.
In other words, alloparalogs are paralogs that evolved from duplication events that happened in the LCA
of the organisms being compared. The example above is an example alloparalogy. Symparalogs are
paralogs that evolved from gene duplication of paralogous genes in subsequent speciation events. From
the example above, if the descendant with genes A1 and B underwent another speciation event where
gene A1 duplicated, the new species would have genes B, A1a, and A1b. In this example, genes A1a and
A1b are symparalogs.
Paralogous genes can shape the structure of whole genomes and thus explain genome evolution to a
large extent. Examples include the Homeobox (Hox) genes in animals. These genes not only underwent
gene duplications within chromosomes but also whole genome duplications. As a result, Hox genes in
most vertebrates are clustered across multiple chromosomes with the HoxA-D clusters being the best
studied.
Another example are the globin genes which encode myoglobin and hemoglobin and are considered to
be ancient paralogs. Similarly, the four known classes of hemoglobins (hemoglobin A, hemoglobin A2,
hemoglobin B, and hemoglobin F) are paralogs of each other. While each of these proteins serves the
same basic function of oxygen transport, they have already diverged slightly in function : fetal
hemoglobin (hemoglobin F) has a higher affinity for oxygen than adult hemoglobin. Function is not
always conserved, however. Human angiogenin diverged from ribonuclease, for example, and while the
two paralogs remain similar in tertiary structure, their functions within the cell are now quite different.
Regulation :-
Paralogs are often regulated differently, e.g. by having different tissue-specific expression patterns (see
Hox genes). However, they can also be regulated differently on the protein level. For instance, Bacillus
subtilis encodes two paralogues of glutamate dehydrogenase : GudB is constitutively transcribed
whereas RocG is tightly regulated. In their active, oligomeric states, both enzymes show similar
enzymatic rates. However, swaps of enzymes and promoters cause severe fitness losses, thus indicating
promoter–enzyme coevolution. Characterization of the proteins shows that, compared to RocG, GudB’s
enzymatic activity is highly dependent on glutamate and pH.
BT-3515 (Genomics And Proteomics) 65
Paralogous chromosomal regions :-
Sometimes, large regions of chromosomes share gene content similar to other chromosomal regions
within the same genome. They are well characterised in the human genome, where they have been
used as evidence to support the 2R hypothesis. Sets of duplicated, triplicated and quadruplicated genes,
with the related genes on different chromosomes, are deduced to be remnants from genome or
chromosomal duplications. A set of paralogy regions is together called a paralogon. Well-studied sets of
paralogy regions include regions of human chromosome 2, 7, 12 and 17 containing Hox gene clusters,
collagen genes, keratin genes and other duplicated genes, regions of human chromosomes 4, 5, 8 and
10 containing neuropeptide receptor genes, NK class homeobox genes and many more gene families,
and parts of human chromosomes 13, 4, 5 and X containing the ParaHox genes and their neighbors. The
Major histocompatibility complex (MHC) on human chromosome 6 has paralogy regions on
chromosomes 1, 9 and 19. Much of the human genome seems to be assignable to paralogy regions.
3) Ohnology :-
Ohnologous genes are paralogous genes that have originated by a process of 2R whole-genome
duplication. The name was first given in honour of Susumu Ohno by Ken Wolfe. Ohnologues are useful
for evolutionary analysis because all ohnologues in a genome have been diverging for the same length
of time (since their common origin in the whole genome duplication). Ohnologues are also known to
show greater association with cancers, dominant genetic disorders, and pathogenic copy number
variations.
4) Xenology :-
Homologs resulting from horizontal gene transfer between two organisms are termed xenologs.
Xenologs can have different functions if the new environment is vastly different for the horizontally
moving gene. In general, though, xenologs typically have similar function in both organisms. The term
was coined by Walter Fitch.
5) Homoeology :-
Homoeologous (also spelled homeologous) chromosomes or parts of chromosomes are those brought
together following inter-species hybridization and allopolyploidization to form a hybrid genome, and
whose relationship was completely homologous in an ancestral species. In allopolyploids, the
homologous chromosomes within each parental sub-genome should pair faithfully during meiosis,
leading to disomic inheritance; however in some allopolyploids, the homoeologous chromosomes of the
parental genomes may be nearly as similar to one another as the homologous chromosomes, leading to
tetrasomic inheritance (four chromosomes pairing at meiosis), intergenomic recombination, and
reduced fertility.
6) Gametology :-
Gametology denotes the relationship between homologous genes on non-recombining, opposite sex
chromosomes. The term was coined by García-Moreno and Mindell. 2000. Gametologs result from the
origination of genetic sex determination and barriers to recombination between sex chromosomes.
Examples of gametologs include CHDW and CHDZ in birds.
BT-3515 (Genomics And Proteomics) 66
1.11 GENE ORDER AND PHYLOGENERIC FOOTPRINTING
Gene orders are the permutation of genome arrangement. A fair amount of research has been done
trying to determine whether gene orders evolve according to a molecular clock (molecular clock
hypothesis) or in jumps (punctuated equilibrium).
Some research on gene orders in animals’ mitochondrial genomes reveal that the mutation rate of gene
orders is not a constant in some degrees.
Determination of gene orders :-
Random isolates from a 3-point cross. When markers have been scored in a cross where gene order is
not known, genotypes and numbers are conveniently listed using a 3-point data sheet such as that
shown in Table from How to use genetic methods for detecting linkage. An example is given here, using
numbers from an actual experiment.
Initially, the genes can be listed arbitrarily in any of the three possible orders a b c, b a c, or a c b.
Completion of the tally should reveal which two complementary classes are the most frequent and
which two are the least frequent. Two times out of three, the initial tabulation will show the genes in an
incorrect order, as shown on the left in the example. The data can then be retabulated showing genes in
the correct order and with progeny genotypes correctly identified as parentals, singles, or doubles, as in
the table on the right. Organizing the data in this way facilitates calculating crossover frequencies. (Both
BT-3515 (Genomics And Proteomics) 67
single and double crossovers must, of course, be used in deriving the value for each interval.) One
hundred progeny is usually a reasonable number to isolate initially. If an allele at one locus is lethal or
cannot be scored, the scorable member of each complementary class will provide the needed
information.
(What is ‘left’ and what is ‘right’ in each linkage group is based on convention.)
Gene order from duplication coverage. Enough progeny are needed to distinguish whether alleles at a
duplication-linked marker locus show a dominant : recessive ratio of 2:1 or 1:2.
PHYLOGENETIC FOOTPRINTING :-
Phylogenetic footprinting is a technique used to identify transcription factor binding sites (TFBS) within a
non-coding region of DNA of interest by comparing it to the orthologous sequence in different species.
When this technique is used with a large number of closely related species, this is called phylogenetic
shadowing.
Researchers have found that non-coding pieces of DNA contain binding sites for regulatory proteins that
govern the spatiotemporal expression of genes. These transcription factor binding sites (TFBS), or
regulatory motifs, have proven hard to identify, primarily because they are short in length, and can show
sequence variation. The importance of understanding transcriptional regulation to many fields of
biology has led researchers to develop strategies for predicting the presence of TFBS, many of which
have led to publicly available databases. One such technique is Phylogenetic Footprinting.
Phylogenetic footprinting relies upon two major concepts :
❖ The function and DNA binding preferences of transcription factors are well-conserved between
diverse species.
❖ Important non-coding DNA sequences that are essential for regulating gene expression will
show differential selective pressure. A slower rate of change occurs in TFBS than in other, less
critical, parts of the non-coding genome.
History :-
Phylogenetic footprinting was first used and published by Tagle et al. in 1988, which allowed
researchers to predict evolutionary conserved cis-regulatory elements responsible for embryonic ε and
γ globulin gene expression in primates.
Before phylogenetic footprinting, DNase footprinting was used, where protein would be bound to DNA
transcription factor binding sites (TFBS) protecting it from DNase digestion. One of the problems with
this technique was the amount of time and labor it would take. Unlike DNase footprinting, phylogenetic
footprinting relies on evolutionary constraints within the genome, with the “important” parts of the
sequence being conserved among the different species.
Protocol of Phylogenetic Footprinting :-
It is important when using this technique to decide which genome your sequence should be aligned to.
More divergent species will have less sequence similarity between orthologous genes. Therefore, the
key is to pick species that are related enough to detect homology, but divergent enough to maximize
non-alignment “noise”. Step wise approach to Phylogenetic footprinting consists of :
BT-3515 (Genomics And Proteomics) 68
1) One should decide on the gene of interest.
2) Carefully choose species with orthologous genes.
3) Decide on the length of the upstream or maybe downstream region to be looked at.
4) Align the sequences.
5) Look for conserved regions and analyse them.
Fig. Steps of Phylogenetic Footprinting
Accuracy of Phylogenetic Footprinting :-
It is important to keep in mind that not all conserved sequences are under selection pressure. To
eliminate false positives statistical analysis must be performed that will show that the motifs reported
have a mutation rate meaningfully less than that of the surrounding nonfunctional sequence.
Moreover, results could be more accurate if the prior knowledge about the sequence is considered. For
example, some regulatory elements are repeated 15 times in a promoter region (e.g., some
metallothionein promoters have up to 15 metal response elements (MREs)). Thus, to eliminate false
motifs with inconsistent order across species, the orientation and order of regulatory elements in a
promoter region should be the same in all species. This type of information could help us to identify
regulatory elements that are not adequately conserved but occur in several copies in the input
sequence.
BT-3515 (Genomics And Proteomics) 69
UNIT : 2
2.1 FUNCTIONAL GENOMICS
Large-scale genome projects produce a large amount of data that are used to build models to
understand how biological systems manage information. In contrast to the static representation of
structural genomics, functional genomics focuses on the dynamic aspects of genome expression.
Functional genomics emerged as a field of molecular biology that involves high-throughput methods in
genome-wide investigations. In a broad sense, functional genomics is dedicated to the understanding of
the relationship between the genome of an organism and its phenotype. The fundamental goal of
functional genomics is to understand how biological functions arise from the information encoded in a
genome.
Briefly, the biological information necessary to manage a biological system is stored in the genome and
transcribed into the transcriptome to be finally translated into the proteome. The proteome is the sum
of signaling and metabolic networks. The interactome represents the signaling networks because the
response of the cell to a specific situation is obtained by the activation of gene expression; gene
expression depends on the activation of a signaling pathway, and the signaling pathway is activated
through the interaction of a receptor with a specific signaling molecule. By contrast, the subsection of
the proteome that is dedicated to enzymatic activities describes metabolic pathways and interactions
with the metabolome. The metabolome encompasses not only molecules that are substrates and
products of enzymatic reactions but also cofactors and non-peptidic organic signaling molecules.
Ionome, the core of ions, gives information on the state of the system under specific environmental
conditions. The genome, transcriptome, proteome, and metabolome are considered the four pillars of
functional genomics (Salt et al. 2008). The dynamic response and the interaction of these four pillars
define how a living system operates, which is currently one of the greatest challenges in science.
Among numerous notable websites, the following websites are highlighted.
1) PlantGDB is a database of molecular sequence data for all plant species with significant
sequencing investments. The database organizes EST sequences into contigs that represent
tentative unique genes. Genome sequence fragments are assembled on the basis of their
similarity. Contigs are annotated and, whenever possible, linked to their respective genomic
DNA. The goal of the PlantGDB website is to establish the basis for identifying sets of genes
common to all plants or specific to particular species. To achieve this goal, PlantGDB integrates a
number of bioinformatic tools that facilitate gene predictions and cross comparisons in species
with large-scale genome sequencing investments.
2) Open Sputnik is a database for the comparison of plant genomes. This website uses sequence
resources to fill information gaps in non-sequenced plant genomes and provides a foundation
for in silicio comparative plant genomics (Rudd 2005).
3) Mercator is a web server for the genome scale functional annotation of plant sequence data
(Lohse et al. 2014).
4) PlantFDB 3.0 is a website for the functional and evolutionary study of plant transcription factors
(Jin et al. 2014). For additional database and bioinformatics tools, see Table 1.
Functional genomics data can be subdivided into sequence and experimental datasets. The sequences
are used to perform fundamental genetic analyses such as a homology search, nucleotide substitutions,
BT-3515 (Genomics And Proteomics) 70
indels search, SNP detection, nucleotide composition analyses, gene expression measure and gene
structure description. These processes are commonly performed genome-wide on sequence datasets
using automated algorithms running in silico. For example, the principle of conserved operons
(Overbeek et al. 1999) can be used to predict the function and functional interactions of unknown open
reading frames (ORFs).
Table 1. Databases for functional genomics of plant resources
BT-3515 (Genomics And Proteomics) 71
2.2 GENOMICS ANOLOGY MODELS FOR EDUCATORS (GAME)
In order for students, and particularly learners with special needs, to comprehend topics such as
genomics and genetics, the foundation must be laid for the understanding of rudimentary concepts in
molecular biology. The Genomics Analogy Model for Educators (GAME) approach is intended to enable
learning in the area of molecular biology by using everyday concepts and materials, such as a town, a
library, Lego® blocks, and factories to represent scientific terminology and relationships. The intent of
the GAME approach is to introduce the various concepts of genomics in simple analogies prior to
teaching students the technical terms associated with this area of biology. One of the modules in the
GAME approach is the Lego® Analogy Model (LAM), which uses common Lego® blocks to explain how
genes are sequenced (Kirkpatrick et al., 2002). Classroom testing of this approach has demonstrated
that this analogy model increases student understanding of sequencing (Rothhaar, Pittendrigh, & Orvis,
2006). This strategy relies on the colors of the Lego® blocks in order to explain sequencing, making it a
good approach for the fully sighted; however, this strategy would be inappropriate for completely blind
students. We have recently adapted the LAM for visually impaired students, by adding distinct textures
to each colored Lego® block so that the students can learn sequencing through both the feel and color
of the blocks (Butler, Bello, York, Orvis, & Pittendrigh, in press).
The next step in this GAME teaching process is to introduce the concepts behind DNA microarrays.
Briefly, DNA microarrays involve placing numerous genes from the tissue of a whole organism on a
“chip” (or array) and examining the resulting presence of cDNA to determine the expression levels of
many genes at once. Such an approach could be used to teach the concepts of microarrays to high
school level students. These genes might originate from organisms, individual cells, or cell cultures. The
concepts in microarrays can be easily adapted to the GAME approach, providing instructors with an
opportunity to use readily understandable concepts and inexpensive items that can be purchased at
most department stores.
Here we present the concept that VELCRO® can be used in the GAME approach for enabling the learning
of DNA microarrays for both fully sighted and potentially visually impaired students. There are several
aspects to the VELCRO® array model (VAM) that should make it useful across a variety of classroom
environments : (i) VECLRO® is inexpensive and is easily accessible in most department stores; (ii) the
“arrays” for the classroom are easy to put together; and (iii) it provides a hands-on teaching approach
that allows students to both look at and manipulate the arrays.
Briefly, the rough side of the VELCRO® is cut into different shapes and these are affixed onto a solid
surface. This constitutes the “VELCRO® chip,” which is analogous to a DNA chip. The fuzzy side VELCRO®
shapes represent the different genes for the downloadable document explaining the details of cDNA)
that will be tested for expression patterns. Different numbers of fuzzy VELCRO® shapes can be made to
represent the cDNA that will be hybridized to the “VELCRO® chips” in order to determine the expression
levels of the “genes” in question. To the authors’ knowledge, this represents the first publication on
enabling learning of “DNA arrays” for the visually impaired and blind. Of course, for visually enabled
learners, one can simply use a piece of paper with the shapes drawn on it for the array and cut-out
pieces of paper (in the respective shapes) can be used as the cDNA. A downloadable lesson plan for this
later approach is available at Purdue University (2008).
VELCRO® Analogy Model :-
Most of the cells in our body contain the same basic genetic materials (the genome) (notable exceptions
include red blood cells and gametes), but different cells express different sets of genes at different
BT-3515 (Genomics And Proteomics) 72
levels, depending on the cell’s purpose. Each cell is controlled (“instructed”) by a different combination
of genes to maintain and, in many cases, to replace itself. However, there are many different types of
cells in our bodies, from our hair to our fingernails, skin, and eyes. As a result, while each cell may have
the same genome, the genes it uses to become a skin cell are different than the genes it expresses to be
a hair cell. One way of examining the differences in the genes being expressed in different cells is to use
DNA microarrays. The microarray allows us to look at the genome of any cell and to see what genes the
cell is using and how often, and, in some cases, what genes are not being used. Not only could we learn
what genes are responsible for the difference in a skin cell and a hair cell, we could also learn more
about diseases such as cancer, ALS, lupus, and other auto-immune diseases. A cancerous cell(s), which is
functioning “incorrectly,” could be compared with non-cancerous cells in the body that are functioning
“properly.” Discovering which genes are involved in making the cells cancerous would allow researchers
or clinicians early and accurate diagnoses of cancer in patients and possibly provide target sites for the
development of compounds to control or cure the cancerous cells.
DNA chips are being used in a variety of scientific fields including molecular genetics, biochemistry,
agronomy, entomology, animal science, evolutionary biology, medicine, and a variety of other fields of
biology. These chips typically contain copies of many genes (or all the genes), typically from a single
species, placed on a glass or plastic slide. Some of the copies of these genes are built on the base
material (the chip) through a lithography approach and representative bases for each gene occur at a
specific spot. In other cases, substantial portions of each of the genes (reverse-transcribed RNA that is
turned into the more stable cDNA or copy DNA) are placed on the chip. These are called cDNA
microarrays and are often placed on glass slides.
The VELCRO® Analogy Model (VAM) is intended to introduce students to the basic concepts behind
cDNA microarrays that are fundamental to many of the scientific discoveries being made. This approach
can be used for both sighted and visually impaired students. In contrast to actually performing
oligoarray or cDNA microarray experiments, the VAM will not require expensive equipment or
chemicals. In addition, another innate problem with actually performing an array experiment is that it is
“visual” in nature and it would be virtually impossible to adapt this process to the needs of blind
students. Instead, the same principles can be taught with the VELCRO® Analogy Model, where each
VECLRO® shape represents a gene that has been placed on the “chip.”
The VAM is intended to clarify the process of how scientists determine the expression levels of mRNA in
an organism, tissue, or cell using “DNA arrays.” The VAM can be used to clarify several concepts involved
in DNA arrays. These include (i) reverse transcription of the mRNA to cDNA for the sample material that
is being tested, (ii) complementation between the DNA on the array and cDNA that is hybridized to the
array, and (iii) how the arrays are used to determine differential expression of genes between the
different treatments.
Reverse Transcription and Complementation :-
When messenger RNA (mRNA) is extracted from the cells of the organism that is the subject of an
experiment, copies of the mRNA must be made. While the number of mRNA present in a cell is used to
determine which genes are being used, mRNA is highly unstable, and it cannot be manipulated on a DNA
array (Boyer, 1999). Instead, a stable complimentary copy is made, which is called cDNA. Because the
copy, or cDNA, is complimentary, each base in the mRNA strand is copied into its matching base pair in a
cDNA strand. The bases C (cytosine) and G (guanine) pair together, as do the bases A (adenine) and T
(thymine), where T is used instead of U (uracil) in the cDNA strand. Consequently, if the RNA has a C
BT-3515 (Genomics And Proteomics) 73
base in the strand, the cDNA will have a G at that same position in the cDNA strand. If the RNA has a G
at the spot then the cDNA will have a C. If the mRNA has an A at a spot, then the cDNA will have a T. If
the mRNA has the RNA equivalent of T (called U), then the cDNA will have an A at the spot. One can
demonstrate why mRNA is unstable and must be copied into cDNA by using rough VELCO® that is not
attached to the “array” board. First, one can explain to the students that the mRNA is equivalent to the
rough side of the VELCRO® with a piece of felt attached to the back of it. This extra component added on
the back of the rough VELCRO® represents the one extra OH (oxygen and hydrogen) group found on the
backbone, or mRNA. This extra OH group is one of the reasons mRNA is much more unstable than DNA;
RNA can actually use this OH group to tear itself apart.
Thus, in order to measure the expression of a gene, RNA must be turned into a more stable material,
such as DNA. We call this cDNA, where the c stands for copy. In our analogy, the rough side of VELCRO®
can be cut out in a circle, star, square, or triangle (Figure 1, left side). Each different shape represents a
different “gene” (or more precisely, mRNA transcript from the gene). One can have the students
imagine that this first rough-sided VELCRO® represents mRNA and is inherently unstable. Consequently,
one must make stable, complimentary copies of the fuzzy-sided VELCO® pieces in the exact same shape
as the “mRNA rough” VELCRO®. This is the cDNA copy of the mRNA for a given gene. These cDNA fuzzy
sided VELCRO® pieces will be used to interact with the rough-sided DNA that will be on the “DNA
VELCRO® arrays.” For every mRNA of a given gene, a single copy of cDNA will be made (Figure 2).
Fig 1. Reverse transcription of mRNA to cDNA analogy. Hooked VELCRO®
with velvet on the back of the material represents mRNA. Each shape
represents mRNA coded for by four separate genes. A “copy” is made of
each mRNA (shape of VELCRO®) with the fuzzy side of VELCRO® and this
“stable copy is “cDNA.” Thus, a copy DNA or cDNA is made from the
mRNA using a reverse transcriptase enzyme.
BT-3515 (Genomics And Proteomics) 74
Fig 2. For every mRNA there is a cDNA made during reverse transcription. Thus, if
there were two “star gene” mRNAs, then there would be two star cDNAs reverse
transcribed. If there were four “square gene” mRNAs, then there would be four
square cDNAs reverse transcribed. If there was one “circle gene” mRNA, then there
would be one circle cDNA reverse transcribed. Also, if there were no “triangle gene”
mRNAs, then there would be no triangle cDNAs reverse transcribed.
The actual “VECLRO® arrays” can be created with two pieces of wood or plastic to explain the concept
behind DNA arrays. Both “VECLRO® arrays” will have identical copies of the rough side of the VELCRO®
shapes (Figure 3). For simplicity, both boards could have five copies of each of the four shapes, each set
of shapes located in each of the four corners of the “array.”
Fig 3. VELCRO® analogy model (VAM) for explaining the concept underlying cDNA
arrays (“chips”). Hook sided VELCRO® shapes are attached to a solid square base
(e.g., a wood board or plastic sheet). The solid square base represents the material
that the cDNAs are “printed on,” or bound to, in order to create the array. For each
shape, five copies of the given shape are placed on a respective corner of the
wooden board or plastic sheet. They are attached to the solid base such that their
hook side faces away from the base.
BT-3515 (Genomics And Proteomics) 75
One of the two arrays will be used to bind the cDNA from the “control” organism (Figure 4, left side).
The second array will be used to bind the cDNA from the “treatment” organism (Figure 4, right side). In
the example given in Figure 5, we see that in the control and treatment organism there were equal
numbers of stars. Thus, the genes showed no difference in expression (constitutively expressed). The
organisms in the treated group expressed fewer squares than in the control group, meaning the “square
genes” are under-transcribed. The organisms in the treated group expressed more triangles than in the
control group. Consequently, the “triangle genes” were over-transcribed. The organisms in the treated
group expressed more circles than the control group, which had no circles group. Thus, the “triangle
genes” were over-transcribed, where the circles were absent in the control group and present in the
treatment group. Over-transcribing a gene may allow an organism to do things that it normally cannot.
For example, some insects over-transcribe certain genes in order to become resistant to insecticides
(Pedra et al., 2004).
Fig 4. One can now compare the expression levels between a control group (untreated
organisms, left side) and the treated group (right side). The students can actually perform an
experiment to determine expression level differences between the control and treated
organisms, to determine the impact of the treatment on expression levels of the four
different “genes.” The fuzzy VELCRO® representing the cDNA from the control organism is
placed beside the array (left side). The fuzzy VELCRO® representing the cDNA from the
treated organism is placed beside the second array (right side).
A-T and G-C Content and its Influence on how Tightly Complimentary Strands Will Bind :-
In a separate lesson, VELCRO® can also be used to explain how the GC and AT content of the DNA will
influence how tightly the complimentary strands interact. The two bases G and C have three hydrogen
bond interactions, whereas A and T have only two hydrogen bonds. This means it takes more energy to
BT-3515 (Genomics And Proteomics) 76
pull apart G and C than it does to pull apart A and T, because more bonds must be broken in a G-C bond
than in an A-T bond. VECLRO® can be used to explain this concept in terms of the energy required to pull
apart A-T or G-C rich strands of DNA. VELCRO® comes in several forms; some forms have weak
interactions so that the two compliments are easy to pull apart (VELCRO® “Soft and Flexible”) and other
forms are very difficult to pull apart (VELCRO® “Industrial Strength”). The A-T rich DNA is analogous to
the Soft and Flexible VELCRO® and the G-C rich DNA is analogous to the Industrial Strength VELCRO®.
Thus, it takes more energy to pull apart strands of complimentary DNA that is G-C rich than strands that
are A-T rich. In this example, the students can be told that Soft and Flexible VELCRO® represents A-T
interactions, while they are pulling the VELCRO® strands apart. The same can also be done for Industrial
Strength VELCRO® in explaining that G-C interactions bind the complimentary strands together more
tightly.
Fig 5. The students attach (”hybridize”) the fuzzy VELCRO® “cDNA” to the respective arrays. They
can then compare the numbers of “cDNAs” that have been bound to each array. From this they
can compare expression level differences of “mRNA” between the control (left side) and treatment
(right side) groups.
2.3 EXPRESSION SEQUENCE TAGS (ESTs)
Functional genomic approaches may provide powerful tools for identifying expressed genes. The
discovery of novel genes and its possible utilization in modern plant breeding continue to engage the
attention of most plant biologists. ESTs are short DNA molecules (300 - 500 bp) reverse-transcribed from
a cellular mRNA population. They are generated by large-scale single-pass sequencing of randomly
picked cDNA clones and have proven to be efficient and rapid means to identify novel genes. ESTs thus
represent informative source of expressed genes and provide a sequence resource that can be exploited
for large-scale gene discovery.
By using comparative genomic approaches, the putative functions for some of these new cDNA clones
may be found and thereby constitute an important tool for a better understanding of plant genome
structure, gene expression and function. A large number of ESTs have been studied and generated from
BT-3515 (Genomics And Proteomics) 77
various plant species including both mosses and cycads model and crop plants like A. thaliana, rice,
wheat and maize as well as other species such as gymnosperms, produced ESTs clones assembled into
375 contigs and 696 clusters when Glycine soya was subjected to saline conditions with the objective of
mining salt tolerance genes. A number of ESTs have been generated and produced by studying genes
involved in stress adaptation in the mangrove plant Acanthus ebracteatus Vahl, studying the genome of
Panax ginseng C.A Meyer developed and characterised an EST database for quinoa (Chenopodium
quinoa Wild) and demonstrated the usefulness of EST libraries as a starting point for detecting DNA
sequence polymorphisms (SNPs). They compared cDNA sequences of quinoa with sequences in the TIGR
A. thaliana and GeneBank protein database. 67% of the quinoa proteins showed homology to
Arabidopsis proteins with putative function, 18% had no significant matches, 9% had significant
homology to Arabidopsis proteins with no known function and 6% sharing significant homology with
plant proteins from species other than Arabidopsis. According to the dbEST release (September, 2007),
there are currently over 46 million ESTs belonging to both plants and animals. Many of these dbESTs
have their websites where they can be assessed (Table 1). Although there is no real substitute for a
complete genome sequence, EST sequencing certainly avoids the biggest problems associated with
genome size and the accompanying retrotransposon repetitiveness.
Table 1. Some specific plant EST databases with their websites.
How are they made ?
Expressed Sequence Tags are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that
are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits
of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and
use these “tags” to fish a gene out of a portion of chromosomal DNA by matching base pairs. The
BT-3515 (Genomics And Proteomics) 78
challenge associated with identifying genes from genomic sequences varies among organisms and is
dependent upon genome size as well as the presence or absence of introns—the intervening DNA
sequences interrupting the protein coding sequence of a gene.
Separating the Wheat from the Chaff : Using mRNA to Generate cDNA :-
Gene identification is very difficult in humans, as most of our genome is comprised of introns
interspersed with a relative few DNA coding sequences, or genes. These genes are expressed as
proteins, a complex process comprised of two main two steps. First, each gene (DNA) must be
converted, or transcribed, into messenger RNA (mRNA)—RNA that serves as a template for protein
synthesis. The resulting mRNA then guides the synthesis of a protein through a process called
translation. Interestingly, mRNAs in a cell do not contain sequences from the regions between genes,
nor from the non-coding introns that are present within many genes. Therefore, isolating mRNA is key
to finding expressed genes in the vast expanse of the human genome.
Fig 1. An overview of the process of protein synthesis.
The problem however, is that mRNA is very unstable outside of a cell, so scientists use special enzymes
to convert it to cDNA, or complementary DNA. cDNA is a much more stable compound and, importantly,
because it was generated from a mRNA in which the introns had been removed, cDNA represents only
expressed DNA sequence.
BT-3515 (Genomics And Proteomics) 79
From cDNAs to ESTs :-
Once cDNA representing an expressed gene has been isolated, scientists can then sequence a few
hundred nucleotides from either end of the molecule to create two different kinds of ESTs. Sequencing
only the beginning portion of the cDNA produces what is called a 5’ EST. A 5’ EST, which is obtained from
the portion of a transcript that usually codes for a protein. These regions tend to be conserved across
species and do not change much within a gene family. Sequencing the ending portion of the cDNA
molecule produces what is called a 3’ EST. As these ESTs are generated from the 3’ end of a transcript,
they are likely to fall within non-coding, or untranslated regions (UTR), and therefore tend to exhibit
less cross-species conservation than do coding sequences.
Fig 2. An overview of how Expressed Sequence Tags are generated.
ESTs are generated by sequencing cDNA, which itself is synthesized from the mRNA molecules in a cell.
The mRNA in a cell are copies of the genes that are being expressed. mRNA does not contain sequences
from the regions between genes, nor from the noncoding introns that are present within many
interesting parts of the genome.
ESTs : Tools for Gene Mapping and Discovery :-
1) ESTs as Genome Landmarks :
Just as a person driving a car may need a map to find a destination, scientists searching for genes also
need genome maps to help them to navigate through the billions of nucleotides that make up the
BT-3515 (Genomics And Proteomics) 80
human genome. For a map to make navigational sense, it must include reliable landmarks or “markers”.
Currently, the most powerful mapping technique, and one that has been used to generate many
genome maps, relies on STS mapping. A Sequence Tagged Site (STS) is a short DNA sequence that is
easily recognizable and occurs only once in a genome (or chromosome). The 3’ ESTs serve as a common
source of STSs due to their likelihood of being unique to a particular species, and provide the additional
feature of pointing directly to an expressed gene.
2) ESTs as Gene Discovery Resources :
Because ESTs represent a copy of just the interesting part of a genome—that which is expressed, they
have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary
diseases. ESTs also have a number of practical advantages in that their sequences can be generated
rapidly and inexpensively; only one sequencing experiment is needed per each cDNA generated; and
they do not have to be checked for sequencing errors as mistakes do not prevent identification of the
gene from which the EST was derived.
To find a disease gene using this approach, scientists first use observable biological clues to identify ESTs
that may correspond to disease gene candidates. Scientists then examine the DNA of disease patients
for mutations in one or more of these candidate genes to confirm gene identity. Using this method,
scientists have already isolated genes involved in Alzheimer’s disease, colon cancer and many other
diseases. So it is easy to see why ESTs will pave the way to new horizons in genetic research.
ESTs and NCBI :-
Due to their utility, speed with which they may be generated, and the low cost associated with this
technology, many individual scientists as well as large genome sequencing centers have been generating
hundreds of thousands of ESTs for public use. Once an EST was generated, scientists were submitting
their tags to GenBank, the NIH sequence database operated by the NCBI. With the rapid submission of
so many ESTs, it became difficult to identify a sequence that had already been deposited in the
database. It was becoming increasingly apparent to NCBI investigators that if ESTs were to be easily
accessed and useful as gene discovery tools, they needed to be organized in a searchable database that
also provided access to other genome data. Therefore, in 1992, scientists at the NCBI developed a new
database designed to serve as a collection point for ESTs. Once an EST that was submitted to GenBank
had been screened and annotated, it was then deposited in this new database, called dbEST.
dbEST : a descriptive catalog of ESTs :-
Scientists at NCBI created dbEST to organize, store, and provide access to the great mass of public EST
data that has already accumulated, and that continues to grow daily. Using dbEST, a scientist can access
not only data on human ESTs, but information on ESTs from over 300 other organisms as well.
Whenever possible, NCBI scientists annotate the EST record with any known information. For example, if
an EST matches a DNA sequence that codes for a known gene with a known function, that gene’s name
and function is placed on the EST record. Annotating EST records allows public scientists to use dbEST as
an avenue for gene discovery. By employing a database search tool, such as NCBI’s BLAST, any
interested party can conduct sequence similarity searches against dbEST.
UniGene : a non-redundant set of gene-oriented clusters :-
Because a gene can be expressed as mRNA many, many times, ESTs ultimately derived from this mRNA
may be redundant. That is, there may be many identical, or similar, copies of the same EST. Such
redundancy and overlap means that when someone searches dbEST for a particular EST, they may
BT-3515 (Genomics And Proteomics) 81
retrieve a long list of tags, many of which may represent the same gene. Searching through all these
identical ESTs can be very time consuming. To resolve the redundancy and overlap problem, NCBI
investigators developed the UniGene database.
2.4 cDNA – AMPLIFIED FRAGMENT LENGTH POLYMORPHISM (cDNA-AFLP)
cDNA-AFLP is a gel-based transcript profiling method to generate quantitative gene expression level
data for any organism on a genome-wide scale. The method has found widespread use as one of the
most robust, sensitive and attractive technologies for gene discovery on the basis of fragment detection.
cDNA-AFLP has also been applied for temporal quantitative gene expression analysis and for generating
quantitative gene expression phenotypes for expression quantitative trait loci mapping particularly in
organisms that lack the gen(om)e sequences necessary for development of transcript profiling DNA
chips or microarrays.
The most advantageous feature of cDNA-AFLP is that no prior sequence information is required. This
feature characterizes the technology as an open system compared with closed expression systems
relying on prior availability of gene sequences, such as DNA chips. Although hybridization to DNA chips is
currently a very attractive method for high-throughput gene expression analysis, and the throughput of
data production that can be reached with this technology is difficult to match with any other currently
used transcript analysis method, their use is restricted to species for which the genomic sequence or
extensive expressed sequence tag (EST) libraries are available. Other significant advantages of cDNA-
AFLP over DNA chips and microarrays are the relatively low startup costs and its high specificity, which
enables expression profiling of highly homologous genes such as members of gene families. Cross-
hybridization of highly homologous genes may pose a problem using DNA chips, despite the use of
highly discriminative oligonucleotide probes.
Another category of transcript analysis technologies is represented by serial analysis of gene expression
technology and massive parallel signature sequencing technology. These technologies generate small
(10–30 bp) sequence tags for the majority of transcripts in a particular cell or tissue type. Comparing the
abundance of these tags among multiple samples provides relative expression level data digitally.
Although they do not suffer from cross-hybridization, one limitation of these methods is that the small
size of the sequence tags requires availability of EST or genome sequences for gene annotation and/or
for subsequent independent validation of expression level variation using sequence-based detection
methods such as quantitative real-time PCR.
cDNA-AFLP also has a major advantage over Differential Display, another gel-based method to analyze
mRNA populations, which is the systematic display of cDNA fragments, as each selective primer
combination (PC) displays a different subset of cDNAs. The selective PCR-amplification step of cDNA-
AFLP using PCs with variable numbers of selective nucleotides yields reproducible, sharp and discrete
banding patterns and offers the flexibility to perform transcript profiling at variable detection
sensitivities. These features of the method allow levels of rare transcripts to be measured with great
accuracy and enable the construction of comprehensive EST databases by sequencing transcript-derived
fragments (TDFs).
The cDNA-AFLP technique also has a number of limitations. Identification of interesting differentially
expressed genes requires purifying resulting TDFs from gels followed by amplification and subsequent
(cloning and) sequencing. This procedure is time consuming, labor intensive and not very amenable to
BT-3515 (Genomics And Proteomics) 82
automation. In addition, when the tags are of insufficient length to characterize the interesting
transcript functionally, identification of the corresponding full-length cDNAs might be required.
Fig 1. Outline of the ‘one-gene–one-tag’ complementary DNA-amplified fragment length
polymorphism (cDNA-AFLP) procedure using the BstYI/MseI restriction enzyme
combination (EC).
The ‘one-gene–one-tag’ cDNA-AFLP protocol described here involves the following steps, which are all
illustrated in Figure 1 (except the isolation of total RNA and post-detection purification of TDFs for
sequencing) :
1) synthesis of a library of double-stranded cDNA fragments from mRNA template, using a
biotinylated oligo-dT primer;
2) a first digestion of the cDNA fragments with the restriction enzyme BstYI;
3) selection of a single TDF per transcript by recovering the 3¢-terminus of each cDNA through
biotin binding to streptavidin beads;
4) a second digestion with MseI restriction enzyme;
5) ligation of adapters to the BstYI and MseI fragment ends to generate PCR templates;
6) a first reduction of the template mixture complexity by pre-amplification of specific subsets of
TDFs with either BstYI+T or a BstYI+C primer in combination with an MseI primer with no
selective nucleotides;
BT-3515 (Genomics And Proteomics) 83
7) selective amplification of all TDF fractions consecutively, to provide a genome-wide screen for
differentially expressed genes (as an example, BstYI+T and BstYI+C primers with one selective
nucleotide, in combination with MseI primers with two selective nucleotides, result in 2 × 4 × 42
= 128 TDF fractions to be amplified; usually, the BstYI primers are labeled to allow detection of
the resulting TDFs); and
8) electrophoretic analysis of the amplification products on standard denaturing polyacrylamide
gels.
The ‘one-gene–one-tag’ variant is identical to the original cDNA-AFLP procedure throughout the entire
protocol, except step (iii) (above) which is not undertaken in the original method. TDF detection is
described using either conventional gel electrophoresis, radiolabeled primers and autoradiography or
using LI-COR automated DNA sequencers and infrared dye (IRD) detection technology. Both procedures
require the same preparation of template fragments, except in the final amplification step, where IRD-
labeled primers can be substituted for radioactively labeled primers as appropriate.
Experimental design :-
1) Preparation and quality assessment of total RNA. Isolation of intact RNA as well as comparable
quantities of total RNA preparations between samples is essential for cDNA-AFLP analysis. A minimum of
2 mg of total RNA is recommended for template preparation. The most common method to assess the
integrity of total RNA is to run an aliquot of the RNA sample on a 1% denaturing agarose gel with
ethidium bromide (EtBr). Intact total RNA will have sharp, clear 28S and 18S rRNA bands (eukaryotic
samples). The 28S rRNA band should be approximately twice as intense as the 18S rRNA band. This 2:1
ratio is a good indication that the RNA is intact. Completely degraded RNA will appear as a very low
molecular weight smear. Inclusion of RNA size markers on the gel allows RNA bands or low molecular
weight smears to be sized and serves as a good control to ensure the gel was run properly. A drawback
of using agarose gels to assess the integrity of the RNA is the amount of RNA required for visualization.
Generally, at least 200 ng of RNA must be loaded on an agarose gel to be visualized with EtBr.
Alternative nucleic acid stains, such as SYBR Safe DNA stain, offer a significant increase in sensitivity
compared with the traditional EtBr stain in agarose gels.
The Agilent 2100 Bioanalyzer (Agilent Technologies) offers an alternative to traditional gel-based
analysis of RNA samples that integrates quantification and integrity and purity assessment in one quick
and simple assay. When used in combination with the RNA 6000 LabChip, as little as 1 µl of 10 ng µl–1
RNA template is required per analysis. The concentration of an RNA sample also can be checked by the
use of UV spectrophotometry. RNA absorbs UV light and has an absorption maximum at approximately
260 nm. Using a 1-cm light path, the extinction coefficient for nucleotides at this wavelength is 20.
Based on this extinction coefficient, the absorbance at 260 nm (A260) in a 1-cm quartz cuvette of 40 µg
ml–1 solution of single-stranded RNA is equal to 1. Hence, the concentration of RNA in the sample can be
calculated as follows : RNA concentration (µg ml–1) = A260 × dilution factor × 40 mg RNA per ml. In
contrast to nucleic acids, proteins have a UV absorption maximum of 280 nm. The absorbance of a RNA
sample at 280 nm (A280) gives an estimate of the protein contamination of the sample. Thus, the
A260/A280 ratio is a measure of the purity of a RNA sample; it should be between 1.85 and 2.00. The
absorbance of an RNA sample at 230 nm (A230) gives an estimate of the remnants of Tris, EDTA and other
buffer salts in the sample; the A260/A230 ratio should preferably be higher than 2. The Nanodrop ND-
1000 UV-Vis Spectrophotometer offers many benefits over other traditional spectrophotometers. It is
designed for small samples (1–2 ml), the need for dilutions is eliminated (up to 3,700 µg ml–1 without
dilution) and the measurement is made in less than 10 s.
BT-3515 (Genomics And Proteomics) 84
2) Choice of restriction enzymes. If no sequence data are available, it is advisable to perform cDNA-AFLP
analysis using either a common EC such as BstYI/MseI and Taq/MseI or an EC successfully used in cDNA-
AFLP analyses in related species. If genomic or cDNA sequence data are available, the selection of
appropriate restriction enzymes and their combinations for cDNA-AFLP analysis can be optimized by
performing in silico analysis so that as many transcripts as possible can be profiled with a given amount
of resources. The selection of the appropriate ECs is determined by the following considerations :
❖ The distribution of the lengths of the TDFs generated : TDFs should be of sufficient length to fit
within the length range detectable by gel electrophoresis (typically between 100 and 500 bp).
❖ The proportion of coding sequence tagged : TDFs should be derived, at least partially, from the
coding region, to facilitate the functional characterization of the transcripts once the TDFs are
purified and sequenced.
❖ The average redundancy : this defines the number of restriction fragments that fit within the
indicated length range and that are derived from the same transcript. An acceptable redundancy
enables on average two to three fragments per transcript. In case of the ‘one-gene–one-tag’
approach, the redundancy equals 1.
❖ The percentage of the transcriptome covered : this is the major determinant in the selection of
the most appropriate EC(s) for cDNA-AFLP analysis. Restriction enzymes that cut frequently in
the cDNA are recommended as these enzymes target a large subset of the mRNAs. Enzymes
with 4-base recognition sites often provide the highest cDNA coverage but produce relatively
short tags. In contrast, 5- and 6-base cutters generate more informative tags, but often less than
half of all cDNAs are covered.
The enzyme TaqI in combination with MseI or AseI is often found to be the most appropriate
combination for cDNA-AFLP analysis in yeast and plants. For plant species, however, other ECs than the
commonly used TaqI/MseI (such as NlaIII/Csp6I in A. thaliana; BstYI/MseI in tobacco, A. thaliana and
Brassica juncea; Sau3AI/NcoI in barley; and ApoI/MseI in tomato) are found to cover a substantial
amount of cDNAs and to produce informative tags. In cDNA-AFLP transcript profiling experiments in
animal species such as horse and dog, the restriction enzyme MseI was combined with EcoRI and BstYI,
respectively. In bacteria (Azospirillum brasilense, Cuprivavidus metallidurans and Xanthomonas
campestris) the EC of PstI/TaqI is often used.
3) Primer design and preparation. The primer design and preparation for cDNA-AFLP analysis is identical
to that for AFLP analysis. In brief, for selective PCR amplification of subsets of TDFs, primers are used
that correspond to the core and the enzyme-specific sequence of the adapter, and to the remnant
sequence of the restriction site (see Fig. 2). In addition, they have one or a number of additional bases at
the 3¢-end extending into the TDFs, called the selective nucleotides. AFLP primers are named ‘+0’ when
they have no selective bases (only the core and enzyme-specific sequence), ‘+1’ when they have a single
selective base, ‘+2’ when they have two selective bases, and so on. The optimal number of selective
nucleotides and, hence, the number of PCRs required to screen the majority of expressed genes is
determined by the cutter frequency of the restriction enzymes and the aim of the transcript profiling
experiment. Specifically, screening for scarcely expressed genes is facilitated by increasing the level of
fractionation by the use of AFLP amplification primers with more selective nucleotides. However, a
higher level of fractionation also increases the total number of PCRs to be carried out and, hence, the
workload to cover the transcriptome fully. In contrast, if the aim of the experiment is to cover as many
expressed transcripts as possible with a given amount of resources, the level of fractionation can be
lowered. Again, if no gene sequence is available, it is advisable to perform cDNA-AFLP analysis using PCs
BT-3515 (Genomics And Proteomics) 85
successfully used in cDNA-AFLP analyses in related species or to determine selective bases based on the
genomic GC content of the species. In contrast, if gene sequence information is available, selection of
appropriate selective PCs can be optimized by performing in silico simulations.
Fig 2. Schematic for adapter and primer design for the rare cutter,
BstYI, and the two frequent cutters MseI and TaqI.
4) Choice of radioactive versus fluorescent detection systems. Detection of TDFs using IRD or
fluorescent dye detection technology offers several advantages over conventional detection using
radiolabeled primers and autoradiography: the use of radioactivity is eliminated, the cost of dye-labeled
primers is less than the cost of corresponding amounts of radionucleotides for radiolabeling primers,
and images are obtained in several hours rather than 1–3 d. Alternatively, when no autoradiography or
fluorescent dye detection technology is available, silver staining of the DNA in the polyacrylamide gels
can be used to visualize the cDNA-AFLP amplification products.
5) Quantification of band intensities. Although differentially expressed genes can be readily recognized
by visual inspection of the band intensities, gel image analysis software such as AFLP QuantarPro
(Keygene products N.V.) can be used to quantify band intensities in different samples. Subsequently,
relative quantitative differences in gene expression can be calculated from the band intensities. These
data can be analyzed further in much the same manner as microarray data—for example, by analysis of
variance and/or clustering the transcripts on the basis of their expression profiles. In the case of silver
staining, differentially expressed genes are directly visualized, but the band resolution might be too low
to measure the relative quantitative differences accurately.
6) Gene identification. Finally, identification of interesting differentially expressed genes can be
accomplished by purifying resulting TDFs from gels followed by PCR amplification and subsequent
BT-3515 (Genomics And Proteomics) 86
(cloning and) sequencing. cDNA-AFLP analysis with LI-COR automated DNA sequencers and IRD
detection technology, however, uses a separate IR detection system such as the Odyssey InfraRed
Imaging System to scan the gel, which allows accurate band purification from the gels for subsequent
sequencing. If sequence tags are of insufficient length to characterize the transcript of interest
functionally, identification of the corresponding full-length cDNAs might also be required.
2.5 DNA MICROARRAY
A microarray is a multiplex lab-on-a-chip. It is a two-dimensional array on a solid substrate—usually a
glass slide or silicon thin-film cell—that assays (tests) large amounts of biological material using high-
throughput screening miniaturized, multiplexed and parallel processing and detection methods. The
concept and methodology of microarrays was first introduced and illustrated in antibody microarrays
(also referred to as antibody matrix) by Tse Wen Chang in 1983 in a scientific publication and a series of
patents. The “gene chip” industry started to grow significantly after the 1995 Science Magazine article
by the Ron Davis and Pat Brown labs at Stanford University. With the establishment of companies, such
as Affymetrix, Agilent, Applied Microarrays, Arrayjet, Illumina, and others, the technology of DNA
microarrays has become the most sophisticated and the most widely used, while the use of protein,
peptide and carbohydrate microarrays is expanding.
➢ Also termed as DNA chips, gene chips, DNA arrays, gene arrays and biochips.
➢ Biochips are latest generation of biosensors developed by use of DNA probes.
➢ DNA microarray is one of the molecular detection techniques which is a collection of
microscopic characteristics (commonly DNA) affixed to a solid surface.
➢ DNA microarrays are solid supports usually made up of glass or silicon upon which DNA is
attached in an organized pre-arranged grid design.
➢ Each spot of DNA, termed as probe, signifies a single gene.
➢ DNA microarrays can examine the expression of tens of thousands of genes concurrently.
➢ There are 2 types of DNA microarray i.e. cDNA based microarray and oligonucleotide based
microarray.
Principle of DNA microarray :-
• DNA microarray technology was originated from Southern blotting, in which fragmented DNA is
attached to a substrate and then probed with a known DNA sequence.
• DNA microarray is based on principle of hybridization between the nucleic acid strands.
• Complementary nucleic acid sequences have the characteristic to specifically pair to each other
by the formation of hydrogen bonds between complementary nucleotide base pairs.
• Unknown sample of DNA sequence is termed as sample or target and the known sequence of
DNA molecule is called as probe.
• Fluorescent dyes are used for labelling the samples and at least 2 samples are hybridized to the
chip.
• A large number of complementary base pairs in nucleotide sequence is suggestive of tighter
non-covalent bonding between the two strands.
• Following the washing off of non-specific bonding sequences, only strongly paired strands will
stay hybridized.
BT-3515 (Genomics And Proteomics) 87
• Thus, the fluorescent labeled target sequences that pairs to the probe releases a signal that
relies on the strength of the hybridization detected by the number of paired bases, hybridization
conditions, and washing after hybridization.
• DNA microarrays employs relative quantization in which the comparison of same character is
done under two different conditions and the identification of that character is known by its
position.
• After completion of the hybridization, the surface of chip can be examined both qualitatively
and quantitatively by use of autoradiography, laser scanning, fluorescence detection device,
enzyme detection system.
• The presence of one genomic or cDNA sequence in 1,00,000 or more can be screened in a single
hybridization by using DNA microarray.
Types of DNA Microarray :-
This technology works under the principle of binding (also called hybridization) strands of
complementary sequences of DNA (cDNA) with each other. Microarray has two broad classifications,
based on its mode of preparation and on the types of probes used.
A) Classification of microarray based on the mode of preparation :
Based on the mode of preparation of the array, microarrays are divided into three types :
1) The spotted array on glass : spotted arrays are arrays made on poly-lysine coated glass
microscope slides. This provides binding of high-density DNA by using slotted pins. It allows
fluorescent labeling of the sample.
2) Self-assembled arrays : these are fiber optic arrays made by the deposition of DNA synthesized
on small polystyrene beads. The beads are deposited on the etched ends of the array. Different
DNA can be synthesized on different beads and applying a mixture of beads to the fiber optic
cable will make a randomly assembled array.
3) In-situ synthesized arrays : these arrays are made by chemical synthesis on a solid substrate. In
the chemical synthesis, photolabile protecting groups are combined with photolithography to
perform the action. These arrays are used in expression analysis, genotyping, and sequencing.
B) Classification of microarray based on the types of probes used :
Based on the types of probes used, microarrays are of twelve different types :
1) DNA microarrays : DNA microarray is also known as gene chip, DNA chip, or biochip. It either
measures DNA or uses DNA as a part of its detection system. There are four different types of
DNA microarrays: cDNA microarrays, oligo DNA microarrays, BAC microarrays and SNP
microarrays.
2) MMChips : MMchip allows the integrative analysis of cross-platform and between-laboratory
data. It studies interactions between DNA and protein. ChIP-chip (Chromatin
immunoprecipitation (ChIP) followed by array hybridization) and ChIP-Seq (ChIP followed by
massively parallel sequencing) are the two techniques used.
3) Protein microarrays : it acts as a platform for characterization of hundreds of thousands of
proteins in a highly parallel way. Protein microarray is of three types, and these are analytical
protein microarrays, functional protein microarrays and reverse-phase protein microarrays.
4) Peptide microarrays : these types of arrays are used for the detailed analyses or optimization of
protein–protein interactions. It helps in antibody recognition by screening proteomes.
BT-3515 (Genomics And Proteomics) 88
5) Tissue microarrays : tissue microarray paraffin blocks that are formed by separating cylindrical
tissue cores from various donors and embedding it into a single microarray. This is mainly used
in pathology.
6) Cellular microarrays : they are also called transfection microarrays or living-cell-microarrays,
and are used for screening large-scale chemical and genomic libraries and systematically
investigating the local cellular microenvironment.
7) Chemical compound microarrays : this is used for drug screening and drug discovery. This
microarray has the capacity to identify and evaluate small molecules and so it is more useful
than the other technologies used in the pharmaceutical industry.
8) Antibody microarrays : they are also referred to as antibody array or antibody chip. These are
protein-specific microarrays that contain a collection of capture antibodies placed inside a
microscope slide. They are used for detecting antigens.
9) Carbohydrate arrays : they are also called glycoarrays. Carbohydrate arrays are used in
screening proteomes that are carbohydrate binding. They can also be utilized in calculating
protein binding affinities and automization of solid-support synthesis for glycans.
10) Phenotype microarrays : phenotype microarrays or PMs are mainly used in drug development.
They quantitatively measure thousands of cellular phenotypes all at once. It is also used in
functional genomics and toxicological testing.
11) Reverse phase protein microarrays : they are microarrays of lysates or serum. Mostly used in
clinical trials, especially in the field of cancer, they also have pharmaceutical uses. In some cases,
they can also be used in the study of biomarkers.
12) Interferometric reflectance imaging sensor or IRIS : IRIS is a biosensor that is used to analyze
protein–protein, protein–DNA, and DNA–DNA interactions. It does not make use of fluorescent
labels. It is made of Si/SiO2 substrates prepared by robotic spotting.
C) Mostly used DNA Microarray Techniques :-
1) cDNA based microarrays :
• cDNA is used for the preparation of chips.
• cDNAs are amplified by PCR.
• It is a high throughput technique.
• It is highly parallel RNA expression assay technique that allows quantitative analysis of RNAs
transcribed from both known and unknown genes.
2) Oligonucleotide based microarrays :
• In this type, the spotted probes contains of short, chemically synthesized sequences, 20-25
mers/gene.
• Shorter probe lengths allows less errors during probe synthesis and enables the interrogation of
small genomic regions, plus polymorphisms
• Despite being easier to produce than dsDNA probes, oligonucleotide probes need to be carefully
designed so that all probes acquire similar melting temperatures (within 50 c) and eliminate
palindromic sequences.
• The probe’s attachment to the glass slides takes place by the covalent linkage as electrostatic
immobilization and cross-linking can result in significant loss of probes during wash steps due to
their small size.
BT-3515 (Genomics And Proteomics) 89
• The coupling of probes to the microarray surface takes place via modified 5′ to 3′ ends on
coated slides that provide functional groups (epoxy or aldehyde).
Requirements of DNA microarray :-
❖ DNA chip
❖ Fluorescent dyes
❖ Fluorescent labelled target/sample
❖ Probes
❖ Scanner
Steps involved in cDNA based microarray :-
➢ Sample collection
➢ Isolation of mRNA
➢ Creation of labeled cDNA
➢ Hybridization
➢ Collection and analysis
Fig 1. Steps of cDNA Based Microarrays
BT-3515 (Genomics And Proteomics) 90
1) Sample collection :
• A sample can be any cell/tissue that we desire to conduct our study on.
• Generally, 2 types of samples are collected, i.e. healthy and infected cells, for comparing and
obtaining the results.
2) Isolation of mRNA :
• The extraction of RNA from a sample is performed by using a column or solvent like phenol-
chloroform.
• mRNA is isolated from the extracted RNA leaving behind rRNA and tRNA.
• As mRNA has a poly-A tail, column beads with poly-T tails are employed to bind mRNA.
• Following the extraction, buffer is used to rinse the column in order to isolate mRNA from the
beads.
3) Creation of labeled cDNA :
• Reverse transcription of mRNA yields cDNA.
• Both the samples are then integrated with different fluorescent dyes for the production of
fluorescent cDNA strands which allows to differentiate the sample category of the cDNAs.
4) Hybridization :
• The labeled cDNAs from both the samples are placed on the DNA microarray which permits the
hybridization of each cDNA to its complementary strand.
• Then they are thoroughly washed to remove unpaired sequences.
5) Collection and analysis :
• Microarray scanner is used to collect the data.
• The scanner contains a laser, a computer and a camera. The laser is responsible for exciting the
fluorescence of the cDNA, generating signals.
• The camera records the images produced at the time laser scans the array.
• Then computer stores the data and yields results instantly. The data are now analyzed.
• The distinct intensity of the colors for each spot determines the character of the gene in that
particular spot.
Applications of DNA microarray technique :-
1) Drug discovery
2) Study of functional genomics
3) DNA sequencing
4) Gene expression profiling
5) Study of proteomics
6) Diagnostics and genetic engineering
7) Toxicological researches
8) Pharmacogenomics and theranostics
BT-3515 (Genomics And Proteomics) 91
2.6 FUNCTIONAL ANALYSIS OF GENOME
Genomic analysis is the identification, measurement or comparison of genomic features such as DNA
sequence, structural variation, gene expression, or regulatory and functional element annotation at a
genomic scale.
The process of determining the function of a studied gene is only possible after obtaining the sequence
and identifying where it is expressed. It is performed by means of computer analyses or experimental
studies, associating a gene and related phenotype. Various strategies and research techniques, such as
homology-based prediction, increased expression or gene inactivation have been used for many years
now. However, more and more modern microarrays and DNA chips are used and they have become the
basic technique of analysing gene function.
Functional Analysis Program of Genomes :-
Completely sequencing an organism’s genome is just the beginning of our understanding of that
organism’s biology. All of the genes still need to be identified; the function of those genes’ expressed
products (functional RNAs and proteins) must be elucidated; and the non-coding regulatory sequences
need to be understood. The Functional Analysis of the Genome program manages and supports research
that will lead to improved techniques and strategies for efficient identification and functional analysis of
genes, coding regions and other functional elements of entire genomes on a high throughput basis.
The main emphasis of this program is technology development. These technologies must be efficient,
robust and have the potential to be applied in a large-scale yet cost-effective manner. The program also
supports the large-scale application of high-throughput and efficient technologies on a limited basis,
primarily in model organisms. (The application of these technologies to specific, highly focused biological
or medical problems is not supported under this program).
Functional Analysis Program Research Objectives :-
1) Identification and mapping of all functional elements (both coding and non-coding) in a genome.
2) Generation of high-quality, full-length and representative cDNA libraries.
3) Analysis of steady-state RNA and protein expression levels in a given cell type; of comparative
levels of gene products in different cell types; and of temporal or induced changes in RNA and
protein expression.
4) Analysis of naturally occurring or induced mutations that alter RNA and/or protein expression.
5) Analysis of cellular localization of proteins and of protein-protein, or protein-nucleic acid,
interactions.
6) Comparative analysis of protein sequences.
7) Analysis of genome organization and its effect on cellular functions.
DIFFERENTIAL DISPLAY TECHNIQUES :-
Differential display (also referred to as DDRT-PCR or DD-PCR) is a laboratory technique that allows a
researcher to compare and identify changes in gene expression at the mRNA level between two or more
eukaryotic cell samples. It was the most commonly used method to compare expression profiles of two
eukaryotic cell samples in the 1990s. By 2000, differential display was superseded by DNA microarray
approaches.
BT-3515 (Genomics And Proteomics) 92
In differential display, first all the RNA in each sample is reverse transcribed using a set of 3′ “anchored
primers” (having a short sequence of deoxy-thymidine nucleotides at the end) to create a cDNA library
for each sample, followed by PCR amplification using arbitrary 3′ primers for cDNA strand amplification
together with anchored 3′ primers for RNA strand amplification, identical to those used to create the
library; about forty arbitrary primers is the optimal number to transcribe almost all of the mRNA. The
resulting transcripts are then separated by electrophoresis and visualized, so that they can be
compared. The method was prone to error due to different mRNAs migrated into single bands,
differences in less abundant mRNAs getting drowned by more abundant mRNAs, sensitivity to small
changes in cell culture conditions, and a tendency to amplify 3′ fragments rather than full mRNAs, and
the necessity to use about 300 primers to catch all the mRNA. The method was first published in Science
in 1992.
The basis of DD is as follows :
a) simplified pools of cDNA fragments (called subsets) are produced from the RNA samples being
compared, their content being strictly defined by the intrinsic features of the protocol;
b) the analogous pools obtained from the compared RNA samples are resolved side by side on a
polyacrylamide gel (the produced band patterns are called fingerprints);
c) fragments (defined by length) that are present in only one sample (or much more abundant in
one than in the other) are excised from the gel and investigated (it is supposed that such
fragments originate from DETs);
d) by changing the parameters of the pool generation protocol, other pools are produced and
studied in a similar way to compare more mRNAs.
The advantage of such an approach compared with differential screening seems obvious : instead of
going through a separate hybridization round with each of the chosen clones, the abundance of several
dozen randomly picked cDNA fragments can be checked simultaneously. Another very important feature
of DD is the ability to compare more than two RNA samples at once. This feature is nearly impossible in
differential screening if subtractive hybridization is involved. In the following years, DD was cited
extensively in the literature and resulted in several hundred publications reporting its successful
application. However, the DD technique was found to contain many more stumbling blocks than could
be guessed from the original publications. A number of improvements were proposed to overcome
drawbacks inherent in the DD method. Additionally, several promising techniques for displaying cDNA
fragments were recently developed that are substantially different from the original DD method.
In conclusion, there is a wide choice of methods available to search for differentially expressed
transcripts (DETs). It seems that there is no technique which could be considered the best for any task.
Instead, the choice of method should depend upon particular features of the experiment and project
requirements.
General Features of Differential Display Approach :-
When choosing between subtraction-based differential screening and DD, the investigator should be
guided by two considerations in selecting the latter : first, the need to compare more than two samples
simultaneously; and second, if the differences in transcript concentrations between the samples (this
parameter will be called ‘distribution’ from now on) is unlikely to exceed 10–15-fold. The latter situation
usually arises when cell separation or microsurgical procedures are used to prepare the biological
material, which may lead to significant cross-contamination of the resulting samples. Existing techniques
of subtractive hybridization do not provide effective enrichment for such ‘weakly distributed’
BT-3515 (Genomics And Proteomics) 93
transcripts, especially when transcripts of more pronounced distribution (>100:1) are also present. In
the latter case the fraction of weakly distributed transcripts will be remarkably under-represented or
even totally lost as a result of subtractive hybridization. DD, in contrast, shows even the most subtle
differences, as little as several-fold. On the other hand, in DD it may be difficult to select a really highly
distributed sequence (like one required for use as a spatial molecular marker, with a distribution of
500:1 or higher), because differences in fragment abundance of 50:1 and 1000:1 are likely to have the
same appearance on a polyacrylamide gel.
Another attractive feature of the DD approach is the ability to provide a quick assessment of the DETs
fraction in a particular experiment. For some tasks (like comparing temporal stages of a process or
studying the early effects of various treatments on an object) one may wish to redesign the experiment
if there is too much difference (>5% of mRNAs), as well as when no difference is observed on the first
couple of gels. It is highly advisable, especially for experiments where little can be predicted on the basis
of theoretical considerations, to obtain a few DD patterns at first to evaluate the difference, even if
subtractive hybridization was planned as a basic search tool.
The possibility to perform an effective search starting with very small amounts of RNA (several
micrograms of total RNA for the whole analysis) is often cited as one of the advantages of DD over
subtractive hybridization. Meanwhile, reliable techniques for obtaining representative PCR-amplified
cDNA samples from as little as 50 ng of total RNA were recently developed. It was demonstrated that
such samples can be successfully used for subtracted library construction. In our opinion, this abrogates
DD’s advantage of using small amounts of RNA. Moreover, as will be discussed later, a decrease in the
amount of RNA in DD reactions is likely to be accompanied by an increase in ‘noise level’, i.e. the
fraction of false positives. However, the potential of using amplified cDNA in the search for DETs is not
yet fully realized, and we expect both DD and subtractive hybridization to benefit from this approach.
A similar situation exists as far as the sensitivity of subtraction and DD is concerned. It is usually
considered that a rare DET is more likely to be found by means of DD rather than subtractive
hybridization. However, some contemporary techniques of subtractive hybridization largely overcome
the problem of bias toward abundant mRNAs, while the representation of a rare DETs fraction during
DD remains uncertain.
2.6.1 SERIAL ANALYSIS OF GENE EXPRESSION (SAGE) :-
Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to
produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags
that correspond to fragments of those transcripts. Several variants have been developed since, most
notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have
improved the technique with the capture of longer tags, enabling more confident identification of a
source gene.
History of SAGE :-
In 1979 teams at Harvard and Caltech extended the basic idea of making DNA copies of mRNAs in vitro
to amplifying a library of such in bacterial plasmids. In 1982–1983, the idea of selecting random or semi-
random clones from such a cDNA library for sequencing was explored by Greg Sutcliffe and coworkers.
and Putney et al. who sequenced 178 clones from a rabbit muscle cDNA library. In 1991 Adams and co-
workers coined the term expressed sequence tag (EST) and initiated more systematic sequencing of
cDNAs as a project (starting with 600 brain cDNAs). The identification of ESTs proceeded rapidly, millions
of ESTs now available in public databases (e.g. GenBank).
BT-3515 (Genomics And Proteomics) 94
In 1995, the idea of reducing the tag length from 100 to 800 bp down to tag length of 10 to 22 bp helped
reduce the cost of mRNA surveys. In this year, the original SAGE protocol was published by Victor
Velculescu at the Oncology Center of Johns Hopkins University. Although SAGE was originally conceived
for use in cancer studies, it has been successfully used to describe the transcriptome of other diseases
and in a wide variety of organisms.
Steps involved in SAGE Experiment :-
Briefly, SAGE experiments proceed as follows :
1) The mRNA of an input sample (e.g. a tumour) is isolated and a reverse transcriptase and
biotinylated primers are used to synthesize cDNA from mRNA.
2) The cDNA is bound to Streptavidin beads via interaction with the biotin attached to the primers,
and is then cleaved using a restriction endonuclease called an anchoring enzyme (AE). The
location of the cleavage site and thus the length of the remaining cDNA bound to the bead will
vary for each individual cDNA (mRNA).
3) The cleaved cDNA downstream from the cleavage site is then discarded, and the remaining
immobile cDNA fragments upstream from cleavage sites are divided in half and exposed to one
of two adaptor oligonucleotides (A or B) containing several components in the following order
upstream from the attachment site :
➢ Sticky ends with the AE cut site to allow for attachment to cleaved cDNA;
➢ A recognition site for a restriction endonuclease known as the tagging enzyme (TE),
which cuts about 15 nucleotides downstream of its recognition site (within the original
cDNA/mRNA sequence);
➢ A short primer sequence unique to either adaptor A or B, which will later be used for
further amplification via PCR.
4) After adaptor ligation, cDNA are cleaved using TE to remove them from the beads, leaving only a
short “tag” of about 11 nucleotides of original cDNA (15 nucleotides minus the 4 corresponding
to the AE recognition site).
5) The cleaved cDNA tags are then repaired with DNA polymerase to produce blunt end cDNA
fragments.
6) These cDNA tag fragments (with adaptor primers and AE and TE recognition sites attached) are
ligated, sandwiching the two tag sequences together, and flanking adaptors A and B at either
end. These new constructs, called ditags, are then PCR amplified using anchor A and B specific
primers.
7) The ditags are then cleaved using the original AE, and allowed to link together with other ditags,
which will be ligated to create a cDNA concatemer with each ditag being separated by the AE
recognition site.
8) These concatemers are then transformed into bacteria for amplification through bacterial
replication.
9) The cDNA concatemers can then be isolated and sequenced using modern high-throughput DNA
sequencers, and these sequences can be analysed with computer programs which quantify the
recurrence of individual tags.
BT-3515 (Genomics And Proteomics) 95
Fig 1. Summary of Serial Analysis of Gene Expression (SAGE)
Analysis of SAGE :-
The output of SAGE is a list of short sequence tags and the number of times it is observed. Using
sequence databases a researcher can usually determine, with some confidence, from which original
mRNA (and therefore which gene) the tag was extracted.
Statistical methods can be applied to tag and count lists from different samples in order to determine
which genes are more highly expressed. For example, a normal tissue sample can be compared against a
corresponding tumor to determine which genes tend to be more (or less) active.
Applications of SAGE :-
❖ Analysis of yeast transcriptome.
❖ Gene Expression Profiles in Normal and Cancer Cell.
❖ Insights into p53-mediated apoptosis.
❖ Identification and classification of p53-regulated genes.
❖ Analysis of human transcriptomes.
❖ Serial microanalysis of renal transcriptomes.
BT-3515 (Genomics And Proteomics) 96
2.6.2 RNA SEQUENCING (RNAseq) :-
RNA-Seq (named as an abbreviation of “RNA sequencing”) is a sequencing technique which uses next-
generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a
given moment, analyzing the continuously changing cellular transcriptome.
Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-
transcriptional modifications, gene fusion, mutations/SNPs and changes in gene expression over time, or
differences in gene expression in different groups or treatments. In addition to mRNA transcripts, RNA-
Seq can look at different populations of RNA to include total RNA, small RNA, such as miRNA, tRNA, and
ribosomal profiling. RNA-Seq can also be used to determine exon/intron boundaries and verify or amend
previously annotated 5’ and 3’ gene boundaries. Recent advances in RNA-Seq include single cell
sequencing and in situ sequencing of fixed tissue.
Prior to RNA-Seq, gene expression studies were done with hybridization-based microarrays. Issues with
microarrays include cross-hybridization artifacts, poor quantification of lowly and highly expressed
genes, and needing to know the sequence a priori. Because of these technical issues, transcriptomics
transitioned to sequencing-based methods. These progressed from Sanger sequencing of Expressed
Sequence Tag libraries, to chemical tag-based methods (e.g., serial analysis of gene expression), and
finally to the current technology, next-gen sequencing of cDNA (notably RNA-Seq).
Table 1. Different types of RNA and it’s functions
BT-3515 (Genomics And Proteomics) 97
Principle of RNAseq :-
❖ RNA sequencing is a next-generation, high throughput RNA sequencing and quantification
method used for studying the transcriptomics and gene expression.
❖ A cDNA is constructed from total mRNA through the process of reverse transcription and
fragmented. Simultaneous, adaptor ligation and library preparation are practised before doing
sequencing.
❖ The sequencer reads and quantifies the cDNA complementary to the mRNA.
Steps in RNAseq :-
1) RNA isolation
2) cDNA synthesis
3) Adaptor ligation
4) Library preparation
5) DNA fragmentation
6) Sequencing
7) Downstream applications
cDNA Library Preparation for RNAseq :-
The general steps to prepare a complementary DNA (cDNA) library for sequencing are described below,
but often vary between platforms.
1) RNA Isolation : RNA is isolated from tissue and mixed with deoxyribonuclease (DNase). DNase
reduces the amount of genomic DNA. The amount of RNA degradation is checked with gel and capillary
electrophoresis and is used to assign an RNA integrity number to the sample. This RNA quality and the
total amount of starting RNA are taken into consideration during the subsequent library preparation,
sequencing, and analysis steps.
2) RNA selection/depletion : To analyze signals of interest, the isolated RNA can either be kept as is,
depleted of ribosomal RNA (rRNA), filtered for RNA with 3’ polyadenylated (poly(A)) tails to include only
mRNA, and/or filtered for RNA that binds specific sequences (RNA selection and depletion methods
table, below). In eukaryotes, the RNA with 3’ poly(A) tails are mature, processed, coding sequences.
Poly(A) selection is performed by mixing the eukaryotic RNA with poly(T) oligomers covalently attached
to a substrate, typically magnetic beads. Poly(A) selection ignores noncoding RNA and introduces 3’ bias,
which is avoided with the ribosomal depletion strategy. The rRNA is removed because it represents over
90% of the RNA in a cell, which if kept would drown out other data in the transcriptome.
BT-3515 (Genomics And Proteomics) 98
Table 2. RNA Selection and Depletion Methods
3) cDNA synthesis : RNA is reverse transcribed to cDNA because DNA is more stable and to allow for
amplification (which uses DNA polymerases) and leverage more mature DNA sequencing technology.
Amplification subsequent to reverse transcription results in loss of strandedness, which can be avoided
with chemical labeling or single molecule sequencing. Fragmentation and size selection are performed
to purify sequences that are the appropriate length for the sequencing machine. The RNA, cDNA, or
both are fragmented with enzymes, sonication, or nebulizers. Fragmentation of the RNA reduces 5’ bias
of randomly primed-reverse transcription and the influence of primer binding sites, with the downside
that the 5’ and 3’ ends are converted to DNA less efficiently. Fragmentation is followed by size selection,
where either small sequences are removed or a tight range of sequence lengths are selected. Because
small RNAs like miRNAs are lost, these are analyzed independently. The cDNA for each experiment can
be indexed with a hexamer or octamer barcode, so that these experiments can be pooled into a single
lane for multiplexed sequencing.
Fig 2. Summary of RNA Sequencing
Advantages of RNA Sequencing :-
➢ More accurate and sensitive gene expression study can be done.
➢ Broader dynamic range.
BT-3515 (Genomics And Proteomics) 99
➢ Capture both known as well as novel alterations of the transcript, even if no sequence
information is available.
➢ Even though no reference sequence information is available, RNA sequencing can be applied for
any species.
Applications to medicine :-
RNA-Seq has the potential to identify new disease biology, profile biomarkers for clinical indications,
infer druggable pathways, and make genetic diagnoses. These results could be further personalized for
subgroups or even individual patients, potentially highlighting more effective prevention, diagnostics,
and therapy. The feasibility of this approach is in part dictated by costs in money and time; a related
limitation is the required team of specialists (bioinformaticians, physicians/clinicians, basic researchers,
technicians) to fully interpret the huge amount of data generated by this analysis.
2.6.3 REAL TIME PCR :-
• It is a technique used to monitor the progress of a PCR reaction in real-time.
• At the same time, a relatively small amount of PCR product (DNA, cDNA or RNA) can be
quantified.
• It is based on the detection of the fluorescence produced by a reporter molecule which
increases, as the reaction proceeds.
• It is also known as a quantitative polymerase chain reaction (qPCR), which is a laboratory
technique of molecular biology based on the polymerase chain reaction (PCR).
• qPCR is a powerful technique that allows exponential amplification of DNA sequences.
• A PCR reaction needs a pair of primers that are complementary to the sequence of interest.
Primers are extended by the DNA polymerase.
• The copies produced after the extension, so-called amplicons, are re-amplified with the same
primers leading thus to exponential amplification of the DNA molecules.
• After amplification, however, gel electrophoresis is used to analyze the amplified PCR products
and this makes conventional PCR time consuming; since the reaction must finish before
proceeding with the post-PCR analysis. Real-Time PCR overcomes this problem.
• The term “real-time” denotes that it can monitor the progress of the amplification when the
process is going on in contrast to the conventional PCR method where analysis is possible only
after the process is completed.
Principle of Real Time PCR :-
This same principle of amplification of PCR is employed in real-time PCR. But instead of looking at bands
on a gel at the end of the reaction, the process is monitored in “real-time”. The reaction is placed into a
real-time PCR machine that watches the reaction occur with a camera or detector.
Although many different techniques are used to monitor the progress of a PCR reaction, all have one
thing in common. They all link the amplification of DNA to the generation of fluorescence which can
simply be detected with a camera during each PCR cycle. Hence, as the number of gene copies increases
during the reaction, so does the fluorescence, indicating the progress of the reaction.
Steps of Real Time PCR (Protocol) :-
The working procedure can be divided into two steps :
BT-3515 (Genomics And Proteomics) 100
A) Amplification :-
1) Denaturation : High temperature incubation is used to “melt” double- stranded DNA into single
strands and loosen secondary structure in single-stranded DNA. The highest temperature that
the DNA polymerase can withstand is typically used (usually 95°C). The denaturation time can be
increased if template GC content is high.
2) Annealing : During annealing, complementary sequences have an opportunity to hybridize, so
an appropriate temperature is used that is based on the calculated melting temperature ™ of
the primers(5°C below the Tm of the primer).
3) Extension : At 70-72°C, the activity of the DNA polymerase is optimal, and primer extension
occurs at rates of up to 100 bases per second. When an amplicon in real-time PCR is small, this
step is often combined with the annealing step using 60°C as the temperature.
B) Detection :-
➢ The detection is based on fluorescence technology.
➢ The specimen is first kept in proper well and subjected to thermal cycle like in the normal PCR.
➢ The machine, however, in the Real Time PCR is subjected to tungsten or halogen source that
lead to fluoresce the marker added to the sample and the signal is amplified with the
amplification of copy number of sample DNA.
➢ The emitted signal is detected by an detector and sent to computer after conversion into digital
signal that is displayed on screen.
➢ The signal can be detected when it comes up the threshold level (lowest detection level of the
detector).
Fluorescence Markers used in Real Time PCR :-
There are many different markers used in Real Time PCR but the most common of them include :
• Taqman probe.
• SYBR Green.
1) Taqman Probe :
• It is a hydrolysis probe which bear a reporter dye, often fluorescein (FAM) at its 5’ end and a
quencher tetramethylrhodamine (TAMRA), attached to the 3’ end of the oligonucleotide.
• Under normal conditions, the probe remain coiled on itself bringing the fluorescence dye near
the quencher, which inhibits or quenches of fluorescent signal of the dye.
• The oligonucleotide of the Taqpolymerase has a homologous region with the target gene and
thus when the target sequence is present in the mixture, it bind with the sample DNA.
• As the taqpolymerase start to synthesize new DNA strand in the extension stage, it causes
degradation of the probe by 5’ end nuclease activity and the fluorescein is separated from the
quencher as a result of which fluorescence signal is generated.
• As this procedure continues, in each cycle the number of signal molecule increases, causing the
increase in fluorescence which is positively related with the amplification of the target.
2) SYBR Green :
• This is a dye that emits prominent fluorescent signal when it binds at the minor groove of DNA,
nonspecifically.
BT-3515 (Genomics And Proteomics) 101
• Other fluorescent dyes like Ethidium Bromide or Acridine Orange can also be used but SYBR
Green is better used for its higher signal intensity.
• SYBR Green is more preferred than the Taqman Probe as it can provide information about each
cycle of amplification as well as about the melting temperature which is not obtained from the
Taqman probe.
• However, its disadvantage is the lack of specificity as compared to Taqman Probe.
Fig 3. Mostly used Fluorescence Markers in Real Rime PCR
Advantages :-
❖ It gives a look in to the reaction that is help to decide which reactions have worked well and
which have failed.
❖ The efficiency of the reaction can be precisely calculated.
❖ There is no need to run the PCR product out on a gel after the reaction as the melt curve
analysis serve the purpose.
❖ The real-time PCR data can be used to perform truly quantitative analysis of gene expression. In
comparison, old fashioned PCR was only ever semi-quantitative at best.
❖ Faster than normal PCR.
❖ Less complexity at the quantification of sample. Etc.
Applications :-
1) Gene expression analysis
➢ Cancer research
➢ Drug research
2) Disease diagnosis and management
➢ Viral quantification
3) Food testing
➢ GMO food
BT-3515 (Genomics And Proteomics) 102
4) Animal and plant breeding
➢ Gene copy number
Difference Between SAGE and Microarray :-
Difference Between RNAseq and Microarray :-
BT-3515 (Genomics And Proteomics) 103
UNIT : 3
3.1 INTRODUCTION TO PROTEOMICS
The word “proteome” represents the complete protein pool of an organism encoded by the genome. In
broader term, Proteomics, is defined as the total protein content of a cell or that of an organism.
Proteomics helps in understanding of alteration in protein expression during different stages of life cycle
or under stress condition. Likewise, Proteomics helps in understanding the structure and function of
different proteins as well as protein-protein interactions of an organism. A minor defect in either protein
structure, its function or alternation in expression pattern can be easily detected using proteomics
studies. This is important with regards to drug development and understanding various biological
processes, as proteins are the most favorable targets for various drugs. Proteomics on the whole can be
divided into three kinds as described below (Fig. 1) :
Fig 1. Types of Proteomics : Functional, structural and differential
proteomics.
History :-
The first studies of proteins that could be regarded as proteomics began in 1975, after the introduction
of the two-dimensional gel and mapping of the proteins from the bacterium Escherichia coli.
The word proteome is blend of the words “protein” and “genome”, and was coined by Marc Wilkins in
1994 while he was a Ph.D. student at Macquarie University. Macquarie University also founded the first
dedicated proteomics laboratory in 1995.
Techniques Involved in Proteomics Study :-
Some of the very basic analytical techniques are used as major proteomic tools for studying the
proteome of an organism. We shall study most of these techniques as we progress in the course. The
initial step in all proteomic studies is the separation of a mixture of proteins. This can be carried out
using Two Dimensional Gel Electrophoresis technique in which proteins are first of all separated based
BT-3515 (Genomics And Proteomics) 104
on their individual charges in 1D. The gel is then turned 90 degrees from its initial position to separate
proteins based on the difference in their size. This separation occurs in 2nd dimension hence the name
2D. The spots obtained in 2D electrophoresis are excised and further subjected to mass spectrometric
analysis of each protein present in the mixture.
Apart from charge and size of proteins there are number of other intrinsic properties of proteins that
can be employed for their separation and detection. One of these techniques is Field Flow Fractionation
(FFF) which separates proteins based on their mobility in presence of applied field. The difference in
mobility may be attributed to different size and mass of proteins. The applied field can be of many types
such as electrical, gravitational, centrifugal etc. This technique helps in determining different
components in a protein mixture, different conformations of protein, their interaction with other
proteins as well as some organic molecules such as drugs (Fig. 2).
Fig 2. Techniques involved in protein identification during proteomic analysis.
Steps in Proteomic Analysis :-
The following steps are involved in analysis of proteome of an organism as shown in Fig. 3 :
1) Purification of proteins : This step involves extraction of protein samples from whole cell, tissue
or sub cellular organelles followed by purification using density gradient centrifugation,
chromatographic techniques (exclusion, affinity etc.)
BT-3515 (Genomics And Proteomics) 105
2) Separation of proteins : 2D gel electrophoresis is applied for separation of proteins on the basis
of their isoelectric points in one dimension and molecular weight on the other. Spots are
detected using fluorescent dyes or radioactive probes.
3) Identification of proteins : The separated protein spots on gel are excised and digested in gel by
a protease (e.g. trypsin). The eluted peptides are identified using mass spectrometry.
Analysis of protein molecules is usually carried out by MALDI-TOF (Matrix Assisted Laser Desorption
Ionization-Time of Flight) based peptide mass fingerprinting.
Determined amino acid sequence is finally compared with available database to validate the proteins.
Several online tools are available for proteomic analysis such as Mascot, Aldente, Popitam, Quickmod,
Peptide cutter etc.
Fig 3. Overview of steps involved in proteomic analysis
Applications of Proteomics :-
1) Post-Translational Modifications :
Proteomics studies involve certain unique features as the ability to analyze post- translational
modifications of proteins. These modifications can be phosphorylation, glycosylation and sulphation as
well as some other modifications involved in the maintenance of the structure of a protein.
These modifications are very important for the activity, solubility and localization of proteins in the cell.
Determination of protein modification is much more difficult rather than the identification of proteins.
As for identification purpose, only few peptides are required for protease cleavages followed by
database alignment of a known sequence of a peptide. But for determination of modification in a
protein, much more material is needed as all the peptides do not have the expected molecular mass
need to be analyzed further.
BT-3515 (Genomics And Proteomics) 106
For example, during protein phosphorylation events, phosphopeptides are 80 Da heavier than their
unmodified counterparts. Therefore, it gives, rise to a specific fragment (PO3- mass 79) bind to metal
resins, get recognized by specific antibodies and later phosphate group can be removed by
phosphatases (Clauser et al. 1999; Colledge and Scott, 1999). So protein of interest (post-translationally
modified protein) can be detected by Western blotting with the help of antibodies or 32P-labelling that
recognize only the active state of molecules. Later, these spots can be identified by mass spectrometry.
2) Protein-Protein Interactions :
The major attribution of proteomics towards the development of protein interactions map of a cell is of
immense value to understand the biology of a cell. The knowledge about the time of expression of a
particular protein, its level of expression, and, finally, its interaction with another protein to form an
intermediate for the performance of a specific biological function is currently available.
These intermediates can be exploited for therapeutic purposes also. An attractive way to study the
protein-protein interactions is to purify the entire multi-protein complex by affinity based methods
using GST-fusion proteins, antibodies, peptides etc.
The yeast two-hybrid system has emerged as a powerful tool to study protein-protein interactions
(Haynes and Yates, 2000). According to Pandey and Mann (2000) it is a genetic method based on the
modular structure of transcription factors in the close proximity of DNA binding domain to the activation
domain induces increased transcription of a set of genes.
The yeast hybrid system uses ORFs fused to the DNA binding or activation domain of GAL4 such that
increased transcription of a reporter gene results when the proteins encoded by two ORFs interact in
the nucleus of the yeast cell. One of the main consequences of this is that once a positive interaction is
detected, simply sequencing the relevant clones identifies the ORF. For this reason it is a generic
method that is simple and amenable to high throughput screening of protein-protein interactions.
Phage display is a method where bacteriophage particles are made to express either a peptide or
protein of interest fused to a capsid or coat protein. It can be used to screen for peptide epitopes,
peptide ligands, enzyme substrate or single chain antibody fragments.
Another important method to detect protein-protein interactions involves the use of fluorescence
resonance energy transfer (FRET) between fluorescent tags on interacting proteins. FRET is a non-
radioactive process whereby energy from an excited donor fluorophore is transferred to an acceptor
fluorophore. After excitation of the first fluorophore, FRET is detected either by emission from the
second fluorophore using appropriate filters or by alteration of the fluorescence lifetime of the donor.
A proteomics strategy of increasing importance involves the localization of proteins in cells as a
necessary first step towards understanding protein function in complex cellular networks. The discovery
of GFP (green fluorescent protein) and the development of its spectral variants has opened the door to
analysis of proteins in living cells by use of the light microscope.
Large-scale approaches of localizing GFP-tagged proteins in cells have been performed in the genetically
amenable yeast S. pombe (Ding et al. 2000) and in Drosophila (Morin et al. 2001). To localize proteins in
mammalian cells, a strategy was developed that enables the systematic GFP tagging of ORFs from novel
full-length cDNAs that are identified in genome projects.
BT-3515 (Genomics And Proteomics) 107
3) Protein Expression Profiling :
The largest application of proteomics continues to be protein expression profiling. The expression levels
of a protein sample could be measured by 2-DE or other novel technique such as isotope coded affinity
tag (ICAT). Using these approaches the varying levels of expression of two different protein samples can
also be analyzed.
This application of proteomics would be helpful in identifying the signaling mechanisms as well as
disease specific proteins. With the help of 2-DE several proteins have been identified that are
responsible for heart diseases and cancer (Celis et al. 1999). Proteomics helps in identifying the cancer
cells from the non-cancerous cells due to the presence of differentially expressed proteins.
The technique of Isotope Coded Affinity Tag has developed new horizons in the field of proteomics. This
involves the labeling of two different proteins from two different sources with two chemically identical
reagents that differ in their masses due to isotope composition (Gygi et al. 1999). The biggest advantage
of this technique is the elimination of protein quantitation by 2-DE. Therefore, high amount of protein
sample can be used to enrich low abundance proteins.
Different methods have been used to probe genomic sets of proteins for biochemical activity. One
method is called a biochemical genomics approach, which uses parallel biochemical analysis of a
proteome comprised of pools of purified proteins in order to identify proteins and the corresponding
ORFs responsible for a biochemical activity.
The second approach for analyzing genomic sets of proteins is the use of functional protein microarrays,
in which individually purified proteins are separately spotted on a surface such as a glass slide and then
analyzed for activity. This approach has huge potential for rapid high-throughput analysis of proteomes
and other large collections of proteins, and promises to transform the field of biochemical analysis.
4) Molecular Medicine :
With the help of the information available through clinical proteomics, several drugs have been
designed. This aims to discover the proteins with medical relevance to identify a potential target for
pharmaceutical development, a marker(s) for disease diagnosis or staging, and risk assessment—both
for medical and environmental studies. Proteomic technologies will play an important role in drug
discovery, diagnostics and molecular medicine because of the link between genes, proteins and disease.
As researchers study defective proteins that cause particular diseases, their findings will help develop
new drugs that either alter the shape of a defective protein or mimic a missing one. Already, many of
the best-selling drugs today either act by targeting proteins or are proteins themselves. Advances in
proteomics may help scientists eventually create medications that are “personalized” for different
individuals to be more effective and have fewer side effects. Current research is looking at protein
families linked to disease including cancer, diabetes and heart disease.
TRANSLATION (PROTEIN SYNTHESIS) :-
The translation is a process of synthesizing proteins in a chain of amino acids known as polypeptides. It
is the second part of the central dogma in genetics.
➢ It takes place in the ribosomes found in the cytosol or those attached to the rough endoplasmic
reticulum.
BT-3515 (Genomics And Proteomics) 108
➢ The functions of the ribosome are to read the sequence of the codons in mRNA and the tRNA
molecules that transfer or transport or bring the amino acids to the ribosomes in the correct
sequence. However, other molecules are also involved in the process of translation such as
various enzymatic factors.
➢ The translation process involves reading the genetic code in mRNA to make proteins.
➢ The entire translation process can be summarized into three phases: Initiation, elongation, and
termination.
Translation (Protein Synthesis) machinery :-
The translation process is aided by two major factors : A translator – this is the molecule that conducts
the translation; substrate – this is where the mRNA is translated into a new protein (translator desk).
The translation process is guided by machinery composed of :
1) Ribosomes :
• Ribosomes are made of ribosomal RNA (rRNA) and proteins, and therefore they are also names
ribozymes because the rRNA has enzymatic activity. The rRNA has the peptidyl transferase
activity that bonds the amino acids.
• The ribosomes have two subunits of rRNA and proteins, a large subunit with three active sites
(E, P, A) which are critical for the catalytic activity of ribosomes.
2) Transfer RNA (tRNA) :
• Each tRNA has an anticodon for the amino acid codon it carries which are complementary to
each other. For example; Lysine is coded by AAG, and therefore the anticodon that will be
carried by tRNA will be UUC, therefore when the codon AAG appears, an anticodon UUC of tRNA
will bind to it temporarily.
• When tRNA is bound to mRNA, the tRNA then releases its amino acid. rRNA then helps to form
bonds between the amino acids as they are transported to the ribosomes one by one, thus
creating a polypeptide chain. The polypeptide chain keeps growing until it reaches a stop codon.
Translation (Protein Synthesis) enzymes and functions :-
• Peptidyl transferase is the main enzyme used in Translation. It is found in the ribosomes with an
enzymatic activity that catalyzes the formation of a covalent peptide bond between the
adjacent amino acids.
• The enzyme’s activity is to form peptide bonds between adjacent amino acids using tRNAs
during translation.
• The enzyme’s activity uses two substrates of which one has the growing peptide chain and the
other bears the amino acid that is added to the chain.
• It is located in the large subunit of the ribosomes and therefore, the primary function of peptidyl
transferase is to catalyze the addition of amino acid residues allowing the polypeptide chain to
grow.
• The peptidyl transferase enzyme is entirely made up of RNA and its mechanism is mediated by
ribosomal RNA (rRNA), which is a ribozyme, made up of ribonucleotides.
• In prokaryotes, the 23S subunit contains the peptidyl transferase between the A-site and the O-
site of tRNA while in eukaryotes, it is found in the 28S subunit.
BT-3515 (Genomics And Proteomics) 109
Translation (Protein Synthesis) Steps /Process in Detail :-
1) Initiation :
• Protein synthesis initiation is triggered by the presence of several initiation factors IF1, IF2, and
IF3, including mRNA, ribosomes, tRNA.
• The small subunit binds to the upstream on the 5′ end at the start of mRNA. The ribosome scans
the mRNA in the 5′ to 3′ direction until it encounters the start codon (AUG or GUG or UUG).
When either of these start codons is present, it is recognized by the initiator fMet-tRNA (N-
formylMet-tRNA). This initiator factor carries the methionine (Met) which binds to the P site on
the ribosome.
• This synthesizes the first amino acid polypeptide known as N-formylmethionine. The initiator
fMet-tRNA has a normal methionine anticodon therefore it inserts the N-formylmethionine. This
means that methionine is the first amino acid that is added and appears in the chain.
• Generally, there are three steps in the initiation process of translation;
a) Initiation of the binding of mRNA to the small ribosome subunit (the 30S), stimulating
the initiator factor IF3. this dissociates the ribosomal subunits into two.
b) The initiator factor IF2 then binds to the Guanine-triphosphate (GTP) and to the initiator
fMet-tRNA to the P-site of the ribosomes.
c) A ribosomal protein splits the GTP that is bound to IF2 thus helping in driving the
assembly of the two ribosomal subunits. The IF3 and IF2 are released.
Fig. Steps of Translation (Protein Synthesis)
BT-3515 (Genomics And Proteomics) 110
2) Elongation :
• The elongation of protein synthesis is aided by three protein factors i.e. EF-Tu, EF-Ts, and EF-G.
• The ribosomal function is known to shift one codon at a time, catalyzing the processes that take
place in its three sites.
• For every step, a charged tRNA enters the ribosomal complex and inserts the polypeptides that
become one amino acid longer, while an uncharged tRNA departs. In prokaryotes, an amino acid
is added at least every 0.05 seconds, which means that about 200 polypeptide amino acids are
translated in 10 seconds.
• The bond created between each amino acid is derived from the Guanosine Triphosphate (GTP),
which is similar to Adenosine Triphosphate (ATP).
• The three sites (A, P, E) all participate in the translation process, and the ribosome itself
interacts with all the RNA types involved in translation.
• Therefore, three distinct steps are involved in translation, and these are;
a) The mediation of elongation Factor-Tu (EF-Tu) in the entry of amino-acyl-tRNAs to the A
site. This entails the binding of EF-Tu to GTP, which activates the EF-Tu-GTP complex to
bind to tRNA. The GTP then hydrolyses to GDP releasing an energy-giving phosphate
molecule, thus driving the binding of aminoacyl-tRNA to the A site. At this point the EF-
Tu is released, leaving the tRNA in the A-site.
b) Elongation factor EF-Ts then mediates the releasing of EF-Tu-GDP complex from the
ribosomes and the formation of the EF-Tu-GTP.
c) During this translocation process, the polypeptide chain on the peptidyl-tRNA is
transferred to the aminoacyl-tRNA on the A-site during a reaction that is catalyzed by a
peptidyl transferase. The ribosomes then move one codon further along the mRNA in
the 5′ to 3′ direction mediated by the elongation factor EF-G. This step draws its energy
from the splitting of GTP to GDP. Uncharged tRNA is released from the P-site,
transferring newly formed peptidyl-tRNA from the A-site to the P-site.
3) Termination :
• Termination of the translation process is triggered by an encounter of any of the three stop
codons (UAA, UAG, UGA). These triplet stop codons, however, are not recognized by the tRNA
but by protein factors known as the release factors, (RF1 and RF2) found in the ribosomes.
• The RF1 recognizes the triplet UAA and UAG while RF2 recognizes UAA and UGA. A third factor
also assists in catalyzing the termination process and it’s known as Release factor 3 (RF3).
• When the peptidyl-tRNA from the elongation step arrives at the P site, the release factor of the
stop codon binds to the A site. These releases the polypeptide from the P site allowing the
ribosomes to dissociate into two subunits by the energy derived from GTP, leaving the mRNA.
• After many ribosomes have completed the translation process, the mRNA is degraded allowing
its nucleotides to be reused in other transcription reactions.
BT-3515 (Genomics And Proteomics) 111
3.2 PROTEIN ISOLATION & PURIFICATION
Protein purification is a series of processes intended to isolate one or a few proteins from a complex
mixture, usually cells, tissues or whole organisms. Protein purification is vital for the specification of the
function, structure and interactions of the protein of interest. The purification process may separate the
protein and non-protein parts of the mixture, and finally separate the desired protein from all other
proteins. Separation of one protein from all others is typically the most laborious aspect of protein
purification. Separation steps usually exploit differences in protein size, physico-chemical properties,
binding affinity and biological activity. The pure result may be termed protein isolate.
Techniques of Protein Isolation and Purification :-
The methods used in protein purification, can roughly be divided into analytical and preparative
methods.
The distinction is not exact, but the deciding factor is the amount of protein, that can practically be
purified with that method. Analytical methods aim to detect and identify a protein in a mixture, whereas
preparative methods aim to produce large quantities of the protein for other purposes, such as
structural biology or industrial use. In general, the preparative methods can be used in analytical
applications, but not the other way around.
1) Extraction :
Depending on the source, the protein has to be brought into solution by breaking the tissue or cells
containing it. There are several methods to achieve this; Repeated freezing and thawing, sonication,
homogenization by high pressure or permeabilization by organic solvents. The method of choice
depends on how fragile the protein is and how sturdy the cells are.
After this extraction process soluble protein will be in the solvent, and can be separated from cell
membranes, DNA, etc. by centrifugation. The extraction process also extracts proteases, which will start
digesting the proteins in the solution. If the protein is sensitive to proteolysis, it is usually desirable to
proceed quickly, and keep the extract cooled, to slow down proteolysis.
2) Precipitation and Differential Solubilisation :
In bulk protein purification, a common first step to isolate proteins is precipitation with ammonium
sulphate (NH4)2SO4. This is performed by adding increasing amounts of ammonium sulphate and
collecting the different fractions of precipitate protein. One advantage of this method is that it can be
performed inexpensively with very large volumes.
The first proteins to be purified are water-soluble proteins. Purification of integral membrane proteins
requires disruption of the cell membrane in order to isolate any one particular protein from others that
are in the same membrane compartment. Sometimes a particular membrane traction can be isolated
first, such as isolating mitochondria from cells before purifying a protein located in a mitochondrial
membrane.
A detergent such as sodium dodecyl sulphate (SDS) can be used to dissolve cell membranes and keep
membrane proteins in solution during purification; however, because SDS causes denaturation, milder
detergents such as Triton X-100 or CHAPS can be used to retain the protein’s native conformation during
purification.
BT-3515 (Genomics And Proteomics) 112
3) Ultracentrifugation :
Centrifugation is a process that uses centrifugal force to separate mixtures of particles of varying masses
or densities suspended in a liquid. When a vessel (typically a tube or bottle) containing a mixture of
proteins or other particulate matter, such as bacterial cells, is rotated at high speeds, the angular
momentum yields an outward force to each particle that is proportional to its mass.
The tendency of a given particle to move through the liquid because of this force is offset by the
resistance the liquid exerts on the particle. The net effect of “spinning” the sample in a centrifuge is that
massive, small, and dense particles move outward faster than less massive particles or particles with
more “drag” in the liquid.
When suspensions of particles are “spun” in a centrifuge, a “pellet” may form at the bottom of the
vessel that is enriched for the most massive particles with low drag in the liquid. The remaining, non-
compacted particles still remaining mostly in the liquid are called the “supernatant” and can be removed
from the vessel to separate the supernatant from the pellet.
The rate of centrifugation is specified by the angular acceleration applied to the sample, typically
measured in comparison to the g. If samples are centrifuged long enough, the particles in the vessel will
reach equilibrium wherein the particles accumulate specifically at a point in the vessel where their
buoyant density is balanced with centrifugal force. Such an “equilibrium” centrifugation can allow
extensive purification of a given particle.
Sucrose gradient centrifugation :
A linear concentration gradient of sugar (typically sucrose glycerol, or Percoll) is generated in a tube
such that the highest concentration is on the bottom and lowest on top. A protein sample is then
layered on top of the gradient and spun at high speeds in an ultracentrifuge. This causes heavy
macromolecules to migrate towards the bottom of the tube faster than lighter material.
During centrifugation in the absence of sucrose, as particles move farther and farther from the centre of
rotation, they experience more and more centrifugal force (the further they move, the faster they
move). The problem with this is that the useful separation range within the vessel is restricted to a small
observable window.
Spinning a sample twice as long does not mean the particle of interest will go twice as far; in fact, it will
go significantly farther. However when the proteins are moving through a sucrose gradient, they
encounter liquid of increasing density and viscosity.
A properly designed sucrose gradient will counteract the increasing centrifugal force, so the particles
move in close proportion to the time they have been in the centrifugal field. Samples separated by these
gradients are referred to as “rate zonal” centrifugations. After separating the protein/particles, the
gradient is then fractionated and collected.
4) Chromatographic Methods :
Usually a protein purification protocol contains one or more chromatographic steps. The basic
procedure in chromatography is to flow the solution containing the protein through a column packed
with various materials. Different proteins interact differently with the column material, and can thus be
separated by the time required to pass the column, or the conditions required to elute the protein from
the column. Usually proteins are detected as they are coming off the column by their absorbance at 280
nm.
BT-3515 (Genomics And Proteomics) 113
CHROMATOGRAPHY :-
• Chromatography is an important biophysical technique that enables the separation,
identification, and purification of the components of a mixture for qualitative and quantitative
analysis.
• The Russian botanist Mikhail Tswett coined the term chromatography in 1906.
• The first analytical use of chromatography was described by James and Martin in 1952, for the
use of gas chromatography for the analysis of fatty acid mixtures.
• A wide range of chromatographic procedures makes use of differences in size, binding affinities,
charge, and other properties to separate materials.
• It is a powerful separation tool that is used in all branches of science and is often the only means
of separating components from complex mixtures.
Principle of Chromatography (how does chromatography work) :-
Chromatography is based on the principle where
molecules in mixture applied onto the surface or into the
solid, and fluid stationary phase (stable phase) is
separating from each other while moving with the aid of a
mobile phase.
The factors effective on this separation process include
molecular characteristics related to adsorption (liquid-
solid), partition (liquid-solid), and affinity or differences
among their molecular weights.
Because of these differences, some components of the
mixture stay longer in the stationary phase, and they move
slowly in the chromatography system, while others pass
rapidly into the mobile phase, and leave the system faster.
Three components thus form the basis of the
chromatography technique. Fig 1. Principle of Chromatography
1) Stationary phase : This phase is always composed
of a “solid” phase or “a layer of a liquid adsorbed on the surface solid support”.
2) Mobile phase : This phase is always composed of “liquid” or a “gaseous component.”
3) Separated molecules
The type of interaction between the stationary phase, mobile phase, and substances contained in the
mixture is the basic component effective on the separation of molecules from each other.
Types of Chromatography :-
➢ Substances can be separated on the basis of a variety of methods and the presence of
characteristics such as size and shape, total charge, hydrophobic groups present on the surface,
and binding capacity with the stationary phase.
➢ This leads to different types of chromatography techniques, each with their own
instrumentation and working principle.
➢ For instance, four separation techniques based on molecular characteristics and interaction type
use mechanisms of ion exchange, surface adsorption, partition, and size exclusion.
BT-3515 (Genomics And Proteomics) 114
➢ Other chromatography techniques are based on the stationary bed, including column, thin layer,
and paper chromatography.
Commonly employed chromatography techniques include :
1) Column chromatography
2) Ion-exchange chromatography
3) Gel-permeation (molecular sieve) chromatography
4) Affinity chromatography
5) Paper chromatography
6) Thin-layer chromatography
7) Gas chromatography (GS)
8) Dye-ligand chromatography
9) Hydrophobic interaction chromatography
10) Pseudoaffinity chromatography
11) High-pressure liquid chromatography (HPLC)
Applications of Chromatography :-
1) Pharmaceutical sector :
• To identify and analyze samples for the presence of trace elements or chemicals.
• Separation of compounds based on their molecular weight and element composition.
• Detects the unknown compounds and purity of mixture.
• In drug development.
2) Chemical industry :
• In testing water samples and also checks air quality.
• HPLC and GC are very much used for detecting various contaminants such as polychlorinated
biphenyl (PCBs) in pesticides and oils.
• In various life sciences applications
3) Food Industry :
• In food spoilage and additive detection
• Determining the nutritional quality of food
4) Forensic Science :
• In forensic pathology and crime scene testing like analyzing blood and hair samples of crime
place.
5) Molecular Biology Studies :
• Various hyphenated techniques in chromatography such as EC-LC-MS are applied in the study of
metabolomics and proteomics along with nucleic acid research.
• HPLC is used in Protein Separation like Insulin Purification, Plasma Fractionation, and Enzyme
Purification and also in various departments like Fuel Industry, biotechnology, and biochemical
processes.
BT-3515 (Genomics And Proteomics) 115
DIFFERENT CHROMATOGRAPHIC TECHNIQUES :-
1) AFFINITY CHROMATOGRAPHY :-
Affinity chromatography is a separation technique where the components of a mixture are separated
based on their affinity towards the stationary phase of the system.
Principle of Affinity chromatography :
❖ This chromatography technique is based on the principle that components of a mixture are
separated when the element having an affinity towards the stationary phase binds to the
stationary phase. In contrast, other components are eluted with the mobile phase.
❖ The substrate/ ligand is bound to the stationary phase so that the reactive sites for the binding
of components are exposed.
❖ Now, the mixture is passed through the mobile phase where the components with binding sites
for the substrate bind to the substrate on the stationary phase while the rest of the components
are eluted out with the mobile phase.
❖ The components attached to the stationary phase are then eluted by changing the pH, ionic
strength, or other conditions.
Fig 2. Affinity Chromatography
Steps of Affinity chromatography :
➢ The column is prepared by loading it with solid support like agarose or cellulose, onto which the
substrate/ ligand with the spacer arm, is attached.
➢ The mobile phase containing the mixture is poured into the column at a constant rate.
➢ Once the process is complete, the ligand-molecule complex is eluted from the stationary phase
by changing the conditions that favor the separation of ligand and components of the mixture.
BT-3515 (Genomics And Proteomics) 116
Uses of Affinity chromatography :
• Affinity chromatography is used as a staple separation technique from enzymes and other
proteins.
• This principle is also applied in the in vitro antigen-antibody reactions.
• This technique is used for the separation of components as well as the removal of impurities
from a mixture.
• Affinity chromatography can be used in the detection of mutation and nucleotide
polymorphisms in nucleic acids.
Examples of Affinity chromatography :
1) The purification of coli β-galactosidase from a mixture of proteins using the p-aminophenyl-1-
thio-β-D-galactopyranosyl agarose as the affinity matrix.
2) The removal of excess albumin and α 2-macroglobulin from the serum albumin.
2) ANION EXCHANGE CHROMATOGRAPHY :-
Anion exchange chromatography is the separation technique for negatively charged molecules by their
interaction with the positively charged stationary phase in the form of ion-exchange resin.
Principle of Anion exchange chromatography :
❖ This technique is based on the principle of attraction of positively charged resin and the
negatively charged analyte. Here the exchange of positively charged ions takes place to remove
the negatively charged molecules.
❖ The stationary phase is first coated with positive charges where the components of the mixture
with negative charges will bind.
❖ An anion exchange resin with a higher affinity to the negatively charged components then binds
the components, displacing the positively charged resin.
❖ The anion exchange resin-component complex then is removed by using different buffers.
Fig 3. Anion Exchange Chromatography
BT-3515 (Genomics And Proteomics) 117
Steps of Anion exchange chromatography :
➢ A column packed with positively charged resin is taken as the stationary phase.
➢ The mixture with the charged particles is then passed down the column where the negatively
charged molecules bind to the positively charged resins.
➢ The anion exchange resin is then passed through the column where the negatively charged
molecules now bind to the anion exchange resin displacing the positively charged resin.
➢ Now an appropriate buffer is applied to the column to separate the complex of anion exchange
resins and the charged molecules.
Uses of Anion exchange chromatography :
• Anion exchange chromatography is used to separate proteins and amino acids from their
mixtures.
• Negatively charged nucleic acids can be separated, which helps in further analysis of the nucleic
acids.
• This method can also be used for water purification where the anions are exchanged for
hydroxyl ions.
• Anion exchange resins can be used for the separation of metals as they usually have negatively
charged complexes that are bound to the anion exchangers.
Examples of Anion exchange chromatography :
1) The separation of nucleic acids from a mixture obtained after cell destruction.
2) The separation of proteins from the crude mixture obtained from the blood serum.
3) CATION EXCHANGE CHROMATOGRAPHY :-
Anion exchange chromatography is the separation technique for positively charged molecules by their
interaction with negatively charged stationary phase in the form of ion-exchange resin.
Principle of Cation exchange chromatography :
❖ This technique is based on the principle of attraction of negatively charged resin and the
positively charged analyte. Here the exchange of negatively charged ions takes place to remove
the positively charged molecules.
❖ The stationary phase is first coated with negative charges where the components of the mixture
with positive charges will bind.
❖ A cation exchange resin with a higher affinity to the positively charged components then binds
the components, displacing the negatively charged resin.
❖ The cation exchange resin-component complex then is removed by using different buffers.
Steps of Cation exchange chromatography :
➢ A column packed with negatively charged resin is taken as the stationary phase.
➢ The mixture with the charged particles is then passed down the column where the positively
charged molecules bind to the negatively charged resins.
➢ The cation exchange resin is then passed through the column where the positively charged
molecules now bind to the cation exchange resin displacing the negatively charged resin.
BT-3515 (Genomics And Proteomics) 118
➢ Now an appropriate buffer is applied to the column to separate the complex of cation exchange
resins and the charged molecules.
Uses of Cation exchange chromatography :
• Cation exchange chromatography is used for the analysis of the products obtained after the
hydrolysis of nucleic acids.
• This can also be used for the separation of metals where the metal ions themselves bind to the
negatively charged resins to remove the negatively charged complexes.
• Cation exchange chromatography helps in purification of water by exchanging the positively
charged ion by the hydrogen ions.
• It is also used to analyze the rocks and other inorganic molecules.
Examples of Cation exchange chromatography :
1) The separation of positively charged lanthanoid ions obtained from the earth’s crust.
2) The determination of total dissolved salts in natural waters by analyzing the presence of calcium
ions.
4) COLUMN CHROMATOGRAPHY :-
Column chromatography is the separation technique where the components in a mixture are separated
on the basis of their differential adsorption with the stationary phase, resulting in them moving at
different speeds when passed through a column.
It is a solid-liquid chromatography technique in which the stationary phase is a solid & mobile phase is a
liquid or gas.
Principle of Column chromatography :
• This technique is based on the principle of differential adsorption where different molecules in a
mixture have different affinities with the absorbent present on the stationary phase.
• The molecules having higher affinity remain adsorbed for a longer time decreasing their speed
of movement through the column.
• However, the molecules with lower affinity move with a faster movement, thus allowing the
molecules to be separated in different fractions.
• Here, the stationary phase in the column chromatography also termed the absorbent, is a solid
(mostly silica) and the mobile phase is a liquid that allows the molecules to move through the
column smoothly.
Steps of Column chromatography :
➢ The column is prepared by taking a glass tube that is dried and coated with a thin, uniform layer
of stationary phase (cellulose, silica).
➢ Then the sample is prepared by adding the mixture to the mobile phase. The sample is
introduced into the column from the top and is allowed to pass the sample under the influence
of gravity.
➢ The molecules bound to the column are separated by elution technique where either solution of
the same polarity is used (isocratic technique), or different samples with different polarities are
used (gradient technique).
BT-3515 (Genomics And Proteomics) 119
➢ The separated molecules can further be analyzed for various purposes.
Fig 4. Column Chromatography
Uses of Column chromatography :
• Column chromatography is routinely used for the separation of impurities and purification of
various biological mixtures.
• This technique can also be used for the isolation of active molecules and metabolites from
various samples.
• Column chromatography is increasingly used for the detection of drugs in crude extracts.
Examples of Column chromatography :
1) Extraction of pesticides from solid food samples of animal origin containing lipids, waxes, and
pigments.
2) Synthesis of Pramlintide which is an analog of Amylin, a peptide hormone, for treating type 1
and type 2 Diabetics.
3) Purification of bioactive glycolipids, showing antiviral activity towards HSV-1 (Herpes Virus).
5) FLASH CHROMATOGRAPHY :-
Flash chromatography is a separation technique where smaller sizes of gel particles are used as
stationary phase, and pressurized gas is used to drive the solvent through the column.
BT-3515 (Genomics And Proteomics) 120
Principle of Flash chromatography :
❖ The principle of flash chromatography is similar to that of column chromatography, where the
components are separated on the basis of their differential adsorption to the stationary phase.
❖ The sample applied is passed by using a pressurized gas that makes the process faster and more
efficient.
❖ Molecules bind to the stationary phase on the basis of their affinity while the rest of the solvent
is eluted out by applying the pressured gas which quickens the process.
❖ Here, the stationary phase is solid, the mobile phase and the elution solution are liquid, and an
additional pressurized gas is used.
Fig 5. Flash Chromatography
Steps of Flash chromatography :
➢ The column is prepared by taking a glass tube that is dried and coated with a thin, uniform layer
of stationary phase (cellulose, silica). The bottom and top of the column are packed with cotton
wool to prevent the gel from escaping.
BT-3515 (Genomics And Proteomics) 121
➢ Then the sample is prepared by adding the mixture to the mobile phase. The sample is
introduced into the column from the top, and a pumped sample is used to pass the sample at a
constant rate.
➢ The molecules bound to the column are separated by elution solution where either solution of
the same polarity is used (isocratic technique), or different samples with different polarities are
used (gradient technique).
➢ The elution solvent is applied with a constant minimum pressure required to move the solute
down the column.
➢ The separated molecules can further be analyzed for various purposes.
Uses of Flash chromatography :
• Flash chromatography is used as a rapid and more efficient method of separation of
components of different mixtures.
• It is used for the removal of impurities from crude extracts of natural and synthetic mixtures.
6) GAS CHROMATOGRAPHY :-
Gas chromatography is a separation technique in which the molecules are separated on the basis of
their retention time depending on the affinity of the molecules to the stationary phase.
The sample is either liquid or gas that is vaporized in the injection point.
Principle of Gas chromatography :
❖ Gas chromatography is based on the principle that components having a higher affinity to the
stationary phase have a higher retention time as they take a longer time to come out of the
column.
❖ However, the components having a higher affinity to the stationary phase have less retention
time as they move along with the mobile phase.
❖ The mobile phase is a gas, mostly helium, that carries the sample through the column.
❖ The sample once injected in converted into the vapor stage is then passed through a detector to
determine the retention time.
❖ The components are collected separately as they come out of the stationary phase at different
times.
Steps of Gas chromatography :
➢ The sample is injected into the column where it is vaporized into a gaseous state. The
vapourised component than mixes with the mobile phase to be carried through the rest of the
column.
➢ The column is set with the stationary phase where the molecules are separated on the basis of
their affinity to the stationary phase.
➢ The components of the mixture reach the detector at different times due to differences in the
time they are retained in the column.
Uses of Gas chromatography :
• This technique is used to calculate the concentration of different chemicals in various samples.
• This is used in the analysis of air pollutants, oil spills, and other samples.
BT-3515 (Genomics And Proteomics) 122
• Gas chromatography can also be used in forensic science to identify and quantify various
biological samples found in the crime scene.
Examples of Gas chromatography
1) The identification of performance-inducing drug in the athlete’s urine.
2) The separation and quantification of a solid drug in soil and water samples.
Fig 6. Gas Chromatography
7) GEL FILTRATION CHROMATOGRAPHY/ GEL PERMEATION CHROMATOGRAPHY/ SIZE EXCLUSION
CHROMATOGRAPHY/ MOLECULAR SIEVE CHROMATOGRAPHY :-
Gel-filtration chromatography is a form of partition chromatography used to separate molecules of
different molecular sizes.
This technique has also frequently been referred to by various other names, including gel-permeation,
gel-exclusion, size- exclusion, and molecular- sieve chromatography.
Principle :
❖ Molecules are partitioned between a mobile phase and a stationary phase as a function of their
relative sizes.
❖ The stationary phase is a matrix of porous polymer which have pores of specific sizes.
❖ When the sample is injected with the mobile phase, the mobile phase occupies the pores of the
stationary phase.
❖ If the size of the molecules is appropriate enough to enter the pores, they remain in the pores
partly or wholly.
❖ However, molecules with a larger size are retained from entering the pores, causing them to be
moved with the mobile phase, out of the column.
❖ If the mobile phase used in an aqueous solution, the process is termed gel filtration
chromatography.
❖ If the mobile phase used is an organic solvent, it is termed as gel permeation chromatography.
BT-3515 (Genomics And Proteomics) 123
Fig 7. Gel Filtration Chromatography
Steps :
➢ The column is filled with semi-permeable, porous polymer gel beads with a well-defined range
of pore sizes.
➢ The sample, mixed with the mobile phase, is then injected into the column from the top of the
column.
➢ The molecules bound to the column are separated by elution solution where either solution of
the same polarity is used (isocratic technique), or different samples with different polarities are
used (gradient technique).
➢ Elution conditions (pH, essential ions, cofactors, protease inhibitors, etc.) can be selected, which
will complement the requirements of the molecule of interest.
Uses :
• One of the principal advantages of gel-filtration chromatography is that separation can be
performed under conditions specifically designed to maintain the stability and activity of the
molecule of interest without compromising resolution.
• The absence of a molecule-matrix binding step also prevents unnecessary damage to fragile
molecules, ensuring that gel-filtration separations generally give high recoveries of activity.
• Because of its unique mode of separation, gel-filtration chromatography has been used
successfully in the purification of proteins and peptides from various sources.
• Gel-filtration chromatography has been used to separate various nucleic acid species such as
DNA, RNA, and tRNA as well as their constituent bases, adenine, guanine, thymine, cytosine, and
uracil.
Examples :
1) The separation of recombinant human granulocyte colony-stimulating factor (rhG-CSF) from
inclusion bodies in high yield by urea-gradient size-exclusion chromatography.
2) The separation of hen egg lysozyme using both acrylamide- and dextran-based gel columns.
BT-3515 (Genomics And Proteomics) 124
8) HIGH-PERFORMANCE LIQUID CHROMATOGRAPHY (HPLC) :-
High-performance liquid chromatography is a modified form of column chromatography where the
components of a mixture are separated on the basis of their affinity with the stationary phase.
Principle of HPLC :
❖ This technique is based on the principle of differential adsorption where different molecules in a
mixture have a varying degree of interactions with the absorbent present on the stationary
phase.
❖ The molecules having higher affinity remain adsorbed for a longer time decreasing their speed
of movement through the column.
❖ However, the molecules with lower affinity move with a faster movement, thus allowing the
molecules to be separated in different fractions.
❖ This process is slightly different from the column chromatography as in this case; the solvent is
forced under high pressures of up to 400 atmospheres instead of allowing it to drip down under
gravity.
Fig 8. High-performance liquid chromatography
Steps of HPLC :
➢ The column is prepared by taking a glass tube that is dried and coated with a thin, uniform layer
of stationary phase (cellulose, silica).
➢ Then the sample is prepared by adding the mixture to the mobile phase. The sample is
introduced into the column from the top, and a high-pressure pump is used to pass the sample
at a constant rate.
➢ The mobile phase then moves down to a detector that detects molecules at a certain
absorbance wavelength.
➢ The separated molecules can further be analyzed for various purposes.
BT-3515 (Genomics And Proteomics) 125
Uses of HPLC :
• High-performance liquid chromatography is used in the analysis of pollutants present in
environmental samples.
• It is performed to maintain product purity and quality control of various industrial productions.
• This technique can also be used to separate different biological molecules like proteins and
nucleic acids.
• The increased speed of this technique makes the process faster and more effective.
Example of HPLC :
High-performance liquid chromatography has been performed to test the efficiency of different
antibodies against diseases like Ebola.
9) HYDROPHOBIC INTERACTION CHROMATOGRAPHY :-
Hydrophobic interaction chromatography is the separation technique that separates molecules on the
basis of their degree of hydrophobicity.
Principle of Hydrophobic interaction chromatography :
❖ The principle of hydrophobic interaction chromatography is based on the interaction between
two molecules with hydrophobic groups.
❖ Here, the stationary phase is solid support applied with both hydrophobic and hydrophilic
groups.
❖ The solvent molecules containing hydrophobic regions interact with the hydrophobic groups,
thus separating them from the molecules with hydrophilic groups.
❖ The interaction is then reversed by applying an elution solution with decreasing salt gradient,
which causes the molecules with hydrophobic groups to be separated from the stationary
phase.
Fig 9. Hydrophobic Interaction Chromatography
BT-3515 (Genomics And Proteomics) 126
Steps of Hydrophobic interaction chromatography :
➢ The column is prepared with a glass tube applied with solid support like silica gel, upon which
hydrophobic groups like phenyl, octyl butyl, are attached.
➢ The sample is prepared by adding the mixture to the mobile phase.
➢ The sample is then injected into the column from the top of the column.
➢ The molecules with hydrophobic groups form an interaction with the hydrophobic groups of the
stationary phase. In contrast, the molecules without such groups move out of the column with
the mobile phase.
➢ Then a particular elution solution with decreasing salt gradient is then passed into the column
that removes the bound molecules from the stationary phase.
Uses of Hydrophobic interaction chromatography :
• Hydrophobic interaction chromatography is extremely important for the separation of proteins
with hydrophobic groups.
• This technique is more appropriate than other methods, as this technique results in minimum
denaturation activities.
• Similarly, this method can also be applied to the separation of other organic compounds with
hydrophobic groups.
• This allows the separation of hydrophilic and hydrophobic biological molecules from each other.
Example of Hydrophobic interaction chromatography :
The separation of plant proteins from the crude extracts.
10) ION EXCHANGE CHROMATOGRAPHY :-
Ion exchange chromatography is the separation technique for charged molecules by their interaction
with the oppositely charged stationary phase in the form of ion-exchange resin.
Principle of Ion exchange chromatography :
❖ This technique is based on the principle of attraction of charged resin and the oppositely
charged analyte. Here the exchange of negatively/ positively charged ions takes place to remove
the charged molecules.
❖ The stationary phase is first coated with particular charges where the components of the
mixture with opposite charges will bind.
❖ A cation or anion exchange resin with a higher affinity to the charged components then binds
the components, displacing the oppositely charged resin.
❖ The cation or anion exchange resin-component complex then is removed by using different
buffers.
Steps of Ion exchange chromatography :
➢ A column packed with charged resin that can either be positively charged or negatively charged
is taken as the stationary phase.
➢ The mixture with the charged particles is then passed down the column where the charged
molecules bind to the oppositely charged resins.
BT-3515 (Genomics And Proteomics) 127
➢ If a cation exchange resin is used, the positively charged molecules now bind to the cation
exchange resin displacing the negatively charged resin.
➢ Similarly, if an anion exchange resin is used, the negatively charged molecules bind to the anion
exchange resin displacing the positively charged resin.
➢ Now an appropriate buffer is applied to the column to separate the complex of charged
exchange resins and the charged molecules.
Fig 10. Ion Exchange Chromatography
Uses of Ion exchange chromatography :
• Ion exchange chromatography is used in the purification of water where the positively charged
ions are replaced by hydrogen ions, and the negatively charged ions are replaced by hydroxyl
ions.
• This method also works as an effective method for the analysis of the products formed after
hydrolysis of nucleic acids.
• The separation of metals and other inorganic compounds is also facilitated by the ion-exchange
chromatography.
Examples of Ion exchange chromatography :
1) The separation of positively charged lanthanoid ions obtained from the earth’s crust.
2) The separation of proteins from the crude mixture obtained from the blood serum.
11) LIQUID CHROMATOGRAPHY :-
Liquid chromatography is a separation technique where the mobile phase used is liquid, and the
separation can take place either in a column or a plain surface.
BT-3515 (Genomics And Proteomics) 128
Principle of Liquid chromatography :
❖ The process of liquid chromatography is based on the principle for the affinity of the molecules
to the mobile phase.
❖ If the components to be separated have a higher affinity to the mobile phase, the molecules
move along with the mobile phase and come out of the column faster.
❖ However, if the components have a lower degree of interaction with the mobile phase, the
molecules move slowly and thus come out of the column later.
❖ Thus, if two molecules in a mixture have different polarities and the mobile phase is of a distinct
polarity, the two molecules will move at different speeds through the stationary phase.
Fig 11. Liquid Chromatography
Steps of Liquid chromatography :
➢ The column or paper is prepared where the stationary phase (cellulose or silica) is applied on
the solid support.
➢ The sample is added to the liquid mobile phase, which is then injected into the chromatographic
system.
➢ The mobile phase moves through the stationary phase before coming out of the column or the
edge of the paper.
➢ An elution solution is applied to the system to separate the molecules from the stationary
phase.
Uses of Liquid chromatography :
• Liquid chromatography is an effective method for the separation of a colored solution as they
form two separate bands after separation.
• This method can also be used over other techniques as it is quite simple and less expensive.
• It can be used for the separation of solid molecules that are insoluble in water.
Example of Liquid chromatography : High-performance liquid chromatography is a modified form of
liquid chromatography that is used in the research regarding biological molecules.
BT-3515 (Genomics And Proteomics) 129
12) PAPER CHROMATOGRAPHY :-
Paper chromatography is a separation technique where the separation is performed on a specialized
paper.
Principle of Paper chromatography :
❖ Paper chromatography is of two types based on two different principles.
❖ The first is the paper adsorption chromatography that is based on the varying degree of
interaction between the molecules and the stationary phase.
❖ The molecules having higher affinity remain adsorbed for a longer time decreasing their speed
of movement through the column.
❖ However, the molecules with lower affinity move with a faster movement, thus allowing the
molecules to be separated in different fractions.
❖ The second type of paper chromatography is the paper partition chromatography. It is based on
the principle that the moisture on the cellulose paper acts as a stationary phase for the
molecules moving with the mobile phase.
❖ The separation of the molecules is thus based on how strongly they adsorb onto the stationary
phase.
❖ An additional concept of ‘retention factor’ is applied during the separation of molecules in the
paper chromatography.
❖ The retention value for a molecule is determined as a ratio of distance traveled by the molecule
to the distance traveled by the mobile phase.
❖ The retention value of different molecules can be used to differentiate those molecules.
Fig 12. Paper Chromatography
Steps of Paper chromatography :
➢ The stationary phase is selected as a fine quality cellulosic paper.
➢ Different combinations of organic and inorganic solvents are taken as the mobile phase.
➢ About 2-200 µl of the sample solution is injected at the baseline of the paper, and it is allowed
to air dry.
BT-3515 (Genomics And Proteomics) 130
➢ The sample loaded paper is then carefully dipped into the mobile phase not more than the
height of 1 cm.
➢ After the mobile phase reaches near the edge of the paper, the paper is taken out.
➢ The retention factor is calculated, and the separated components are detected by different
techniques.
Uses of Paper chromatography :
• Paper chromatography is performed to detect the purity of various pharmaceutical products.
• It can also be employed to detect contamination in various samples, like food and beverages.
• This method can also be used for the separation of impurities from various industrial products.
• The analysis of the reaction mixtures in chemical labs is also conducted via paper
chromatography.
Examples of Paper chromatography :
Paper chromatography is used in the separation of mixtures of inks or other colored drinks.
13) REVERSE-PHASE CHROMATOGRAPHY :-
Reverse-phase chromatography is a liquid chromatography technique where the separation of
molecules is achieved through hydrophobic interaction between the liquid mobile phase and the
stationary phase.
Principle of Reverse-phase chromatography :
❖ The principle of reverse phase chromatography is based on the interaction between two
molecules with hydrophobic groups.
❖ Here, the stationary phase is solid support applied with both hydrophobic and hydrophilic
groups.
❖ The solvent molecules containing hydrophobic regions interact with the hydrophobic groups,
thus separating them from the molecules with hydrophilic groups.
❖ The interaction is then reversed by applying an elution solution with decreasing salt gradient,
which causes the molecules with hydrophobic groups to be separated from the stationary
phase.
Steps of Reverse-phase chromatography :
➢ The column is prepared with a glass tube applied with solid support like silica gel, upon which
hydrophobic groups like phenyl, octyl butyl, are attached.
➢ The sample is prepared by adding the mixture to the mobile phase of organic and inorganic
solvents.
➢ The sample is then injected into the column from the top of the column.
➢ The molecules with hydrophobic groups form an interaction with the hydrophobic groups of the
stationary phase. In contrast, the molecules without such groups move out of the column with
the mobile phase.
➢ Then a particular elution solution with decreasing salt gradient is then passed into the column
that removes the bound molecules from the stationary phase.
BT-3515 (Genomics And Proteomics) 131
Fig 13. Reverse-phase Chromatography
Uses of Reverse-phase chromatography :
• Reverse chromatography, in combination with high-performance liquid chromatography, is
increasingly used for the separation of biomolecules.
• This is also used in the study of the analysis of drugs, metabolites, and active molecules.
• It can also be used to remove impurities from various environmental samples.
Examples of Reverse-phase chromatography :
Hydrophobic interaction chromatography is an example of reverse phase chromatography where this
technique is used to separate proteins from their mixtures.
14) THIN-LAYER CHROMATOGRAPHY (TLC) :-
Thin-layer chromatography is a separation technique where the stationary phase is applied as a thin
layer on a solid support plate with a liquid mobile phase.
Principle of Thin-layer chromatography (TLC) :
❖ This chromatography technique is based on the principle that components of a mixture are
separated when the component having an affinity towards the stationary phase binds to the
stationary phase. In contrast, other components are eluted with the mobile phase.
❖ The substrate/ ligand is bound to the stationary phase so that the reactive sites for the binding
of components are exposed.
❖ Now, the mixture is passed through the mobile phase where the components with binding sites
for the substrate bind to the substrate on the stationary phase while the rest of the components
are eluted out with the mobile phase.
❖ After separation, the molecules are seen as spots at a different location throughout the
stationary phase.
❖ The detection of molecules is performed by various techniques.
BT-3515 (Genomics And Proteomics) 132
Fig 14. Thin-Layer Chromatography
Steps of Thin-layer chromatography (TLC) :
➢ The stationary phase is uniformly applied on the solid support (glass, thin plate or aluminum foil)
and dried.
➢ The sample is injected as spots on the stationary phase about 1 cm above the edge of the plate.
➢ The sample loaded plate is then carefully dipped into the mobile phase not more than the
height of 1 cm.
➢ After the mobile phase reaches near the edge of the plate, the plate is taken out.
➢ The retention factor is calculated as in paper chromatography, and the separated components
are detected by different techniques.
Uses of Thin-layer chromatography (TLC) :
• Thin-layer chromatography is routinely performed in laboratories to identify different
substances present in a mixture.
• This technique helps in the analysis of fibers in forensics.
• TLC also allows the assay of various pharmaceutical products.
• It aids in the identification of medicinal plants and their composition.
3.3 PROTEIN SEPERATION TECHNIQUES
Protein separation techniques have traditionally been used to isolate and to purify specific proteins in
order to facilitate studies of their enzymatic, physical, chemical and structural properties. These kinds of
studies are necessary in order to elucidate the biological role of individual proteins in the cell and to
understand the mechanism by which the activity of specific enzymes are controlled.
Because protein separation techniques are based on the chemical, physical and enzymatic properties of
proteins, the behavior of a specific protein during a separation protocol can reveal a great deal about
that protein. For example, ion exchange chromatography can give an indication of the relative net
charge on the protein at a given pH; gel permeation chromatography can be used to determine
molecular size (Stokes radius) : affinity chromatography can be used to analyze the interaction of a
protein with specific substrates, inhibitors, activators or antibodies; and hydrophobic interaction
chromatography can be used to examine the hydrophobicity of specific proteins.
BT-3515 (Genomics And Proteomics) 133
Protein purification techniques can also be used to monitor the interactions of specific purified proteins.
Studies of in vitro interactions between two or more proteins from the same tissue or cell often yield
information about how the proteins interact in vivo. In food systems, the interaction between specific
proteins can affect their functionality in the system. Consequently, it is often of great interest to
ascertain whether two purified proteins interact and how this interaction affects their properties.
Protein Fractionation :-
Protein fractionation generally refers to the process of isolating, identifying and characterizing various
proteins present in a sample. However, the analysis of proteomes is usually hindered by the vast
amounts of proteins, especially since the larger, more abundant proteins tend to inhibit the signal of
lower abundance proteins. Incidentally, the lower abundance proteins are usually the more interesting
proteins in the group.
Protein fractionation is routinely used in proteomic research to :
• Reduce the size of the protein pool to be analyzed
• Remove highly expressed proteins
• Bring low abundant proteins into dynamic range
• Simplify analysis and interpretation
Proteins differ in their molecular size and charge. They can be separated on the basis of their following
properties :
1) molecular size
2) solubility
3) electrical charge
4) adsorption properties
5) specific bioaffinity.
Many techniques for protein purification exist, but the emphasis here is on some of the most popular
procedures and the principles involved in their use. Protein fractionation is required to separate and
characterize a protein in detail.
Most Commonly Used Protein Separation Techniques :-
1) Differential Centrifugation :
A typical crude broken cell preparation contains disrupted cell membranes, cellular organelles and a
large number of soluble proteins all dispersed in an aqueous buffered solution. The membranes and the
organelles can be separated from the soluble proteins by differential centrifugation.
This type of centrifugation involves the use of different speeds and different durations. For eg. If the
protein of interest is in the mitochondrial fraction, the crude cell lysate is first centrifuged at 1000g for
removing nuclei, debris etc. The supernatant contains among other elements the mitochondria, which
can be pelleted at 3300g for 10 minutes.
2) Salt Fractionation :
Proteins show a variation in solubility that depends on the concentration of salts in the solution. This
method is frequently used to separate serum proteins into albumins and globulins. Albumin is soluble in
water whereas globulins are not. Globulins are soluble in weak salt solutions, going into solution at salt
concentrations of 0.1 mol/L. This phenomenon called “salting in”. This is thought to be due to
BT-3515 (Genomics And Proteomics) 134
electrostatic attraction between salt ions and the charged groups on the protein, which decreases the
intermolecular electrostatic attraction of proteins & increases the interactions of protein molecules with
water, a polar solvent, thus making them soluble. Salts with divalent ions are more effective than those
with monovalent ions.
As the salt concentration is increased, however, salt ions compete for the water molecules of hydration
of the hydrated groups of proteins, resulting in the decreased solubility and precipitation of protein out
of the solution. The larger proteins are usually precipitated first. This phenomenon is referred to as
“salting out” of protein. Ammonium sulphate is commonly used for salting out proteins. Globulins are
the proteins precipitated by half saturation of a solution with ammonium sulphate, while Albumin is
precipitated on fully saturating the solution.
3) Electrophoresis :
Electrophoresis is the movement of charged particles through an electrolyte in an electric field. The
positively charged particles move towards the cathode and the negative ions to the anode. The rate of
migration of particles of like charge will depend among other things on the number of charges it carries.
Different rates of migration separate a complex mixture such as plasma proteins into a number of
fractions according to mobility. Electrophoresis is not used to purify proteins because some alteration in
protein structure and ultimately function. This is used as a analytical method. It permits to estimate
number of proteins in a mixture. This is also useful to determine isoelectric point and approximate
molecular weight.
4) Chromatography :
This technique was originally used to separate chlorophyll from plant extracts on silica, hence the name
chromatography, which means separation of colored compounds. It is the name given to any technique
in which the members of a group of similar substances are separated by a continuous redistribution
between two phases. One is the stationary phase, which may be solid, liquid, gel or solid/liquid mixture
which is immobilised. The second mobile phase may be liquid or gaseous and flows over or through the
stationary phase. The choice of stationary or mobile phases is made so that the compounds to be
separated have different distribution coefficients.
3.3.1 NATIVE PAGE :-
Native gels, also known as non-denaturing gels, analyze proteins that are still in their folded state. Thus,
the electrophoretic mobility depends not only on the charge-to-mass ratio, but also on the physical
shape and size of the protein.
Blue native PAGE :
BN-PAGE is a native PAGE technique, where the Coomassie Brilliant Blue dye provides the necessary
charges to the protein complexes for the electrophoretic separation. The disadvantage of Coomassie is
that in binding to proteins it can act like a detergent causing complexes to dissociate. Another drawback
is the potential quenching of chemoluminescence (e.g. in subsequent western blot detection or activity
assays) or fluorescence of proteins with prosthetic groups (e.g. heme or chlorophyll) or labelled with
fluorescent dyes.
BT-3515 (Genomics And Proteomics) 135
Clear native PAGE :
CN-PAGE (commonly referred to as Native PAGE) separates acidic water-soluble and membrane proteins
in a polyacrylamide gradient gel. It uses no charged dye so the electrophoretic mobility of proteins in
CN-PAGE (in contrast to the charge shift technique BN-PAGE) is related to the intrinsic charge of the
proteins. The migration distance depends on the protein charge, its size and the pore size of the gel. In
many cases this method has lower resolution than BN-PAGE, but CN-PAGE offers advantages whenever
Coomassie dye would interfere with further analytical techniques, for example it has been described as
a very efficient microscale separation technique for FRET analyses. Also CN-PAGE is milder than BN-
PAGE so it can retain labile supramolecular assemblies of membrane protein complexes that are
dissociated under the conditions of BN-PAGE.
Quantitative native PAGE :
The folded protein complexes of interest separate cleanly and predictably due to the specific properties
of the polyacrylamide gel. The separated proteins are continuously eluted into a physiological eluent
and transported to a fraction collector. In four to five PAGE fractions each the metal cofactors can be
identified and absolutely quantified by high-resolution ICP-MS. The respective structures of the isolated
metalloproteins can be determined by solution NMR spectroscopy.
3.3.2 SDS-PAGE :-
SDS PAGE or Sodium Dodecyl Sulphate-Polyacrylamide Gel Electrophoresis is a technique used for the
separation of proteins based on their molecular weight. It is a technique widely used in forensics,
genetics, biotechnology and molecular biology to separate the protein molecules based on their
electrophoretic mobility.
Principle of SDS-PAGE :
The principle of SDS-PAGE states that a charged molecule migrates to the electrode with the opposite
sign when placed in an electric field. The separation of the charged molecules depends upon the relative
mobility of charged species.
The smaller molecules migrate faster due to less resistance during electrophoresis. The structure and
the charge of the proteins also influence the rate of migration. Sodium dodecyl sulphate and
polyacrylamide eliminate the influence of structure and charge of the proteins, and the proteins are
separated based on the length of the polypeptide chain.
Role of SDS in SDS-PAGE :
SDS is a detergent present in the SDS-PAGE sample buffer. SDS along with some reducing agents
function to break the disulphide bonds of proteins disrupting the tertiary structure of proteins.
Materials Required :
1) Power Supplies : It is used to convert the AC current to DC current.
2) Gels : These are either prepared in the laboratory or precast gels are purchased from the market
3) Electrophoresis Chambers : The chambers that can fit the SDS-PAGE gels should be used.
4) Protein Samples : The protein is diluted using SDS-PAGE sample buffer and boiled for 10
minutes. A reducing agent such as dithiothreitol or 2-mercaptoethanol is also added to reduce
the disulfide linkages to prevent any tertiary protein folding.
BT-3515 (Genomics And Proteomics) 136
5) Running Buffer : The protein samples loaded on the gel are run in SDS-PAGE running buffer.
6) Staining and Destaining Buffer : The gel is stained with Coomassie Stain Solution. The gel is then
destained with the destaining solution. Protein bands are then visible under naked eyes.
7) Protein Ladder : A reference protein ladder is used to determine the location of the protein of
interest, based on the molecular size.
Fig 1. Schematic diagram of SDS-PAGE
Protocol of SDS-PAGE :
1) Preparation of the Gel
• All the reagents are combined, except TEMED, for the preparation of gel.
• When the gel is ready to be poured, add TEMED.
• The separating gel is poured in the casting chamber.
• Add butanol before polymerization to remove the unwanted air bubbles present.
• The comb is inserted in the spaces between the glass plate.
• The polymerized gel is known as the “gel cassette”.
2) Sample Preparation
• Boil some water in a beaker.
• Add 2-mercaptoethanol to the sample buffer.
• Place the buffer solution in microcentrifuge tubes and add protein sample to it.
• Take MW markers in separate tubes.
• Boil the samples for less than 5 minutes to completely denature the proteins.
3) Electrophoresis
• The gel cassette is removed from the casting stand and placed in the electrode assembly.
• The electrode assembly is fixed in the clamp stand.
• 1X electrophoresis buffer is poured in the opening of the casting frame to fill the wells of the gel.
• Pipette 30ml of the denatured sample in the well.
• The tank is then covered with a lid and the unit is connected to a power supply.
BT-3515 (Genomics And Proteomics) 137
• The sample is allowed to run at 30mA for about 1 hour.
• The bands are then seen under UV light.
Applications of SDS-PAGE :
➢ The applications of SDS-PAGE are as follows :
➢ It is used to measure the molecular weight of the molecules.
➢ It is used to estimate the size of the protein.
➢ Used in peptide mapping
➢ It is used to compare the polypeptide composition of different structures.
➢ It is used to estimate the purity of the proteins.
➢ It is used in Western Blotting and protein ubiquitination.
➢ It is used in HIV test to separate the HIV proteins.
➢ Analyzing the size and number of polypeptide subunits.
➢ To analyze post-translational modifications.
3.3.3 2D PAGE :-
2D-PAGE is a form of gel electrophoresis in which separation and identification of proteins in a sample
are done by displacement in 2 dimensions oriented at right angles to one another(orthogonal). This
technique is also used to compare two or more samples to find differences in their protein expressions.
In this technique proteins are separated by two different physicochemical properties. In the first
dimension proteins or polypeptides are separated on the basis of their net charges by isoelectric
focusing and in the second dimension they are separated on the basis of their molecular masses by
electrophoresis. Because it is unlikely that two molecules will be similar in both properties, molecules
are more effectively separated in 2-D electrophoresis than in 1-D electrophoresis.
Isoelectric Focusing (IEF) :-
In IEF, proteins are separated by electrophoresis in a pH gradient based on their isoelectric point(pl). A
pH gradient is generated in the gel and an electric potential is applied across the gel. At all pHs other
than their isoelectric point, proteins will be charged. If they are positively charged, they will move
towards the more negative end of the gel and if they are negatively charged they will move towards the
more positive end of the gel. At its isoelectric point, since the protein molecule carry no net charge it
accumulates or focuses into a sharp band.
Immobilized pH Gradient (IPG) and IEF run :
Immobilized pH gradients are used for IEF because the fixed pH gradients remain stable over extended
run times at very high voltages. The pH gradients of IPGs are generated by means of buffering
compounds that are covalently bound into polyacrylamide gels. IPGs are cast strips with plastic backing
sheets and are commercially available in different pH ranges and lengths. They offer high resolution,
great reproducibility, and allow high protein loads. Isoelectric focusing is run in the same solutions that
are used to extract or solubilize the proteins. The IPG strips with the protein sample must be rehydrated
in the rehydration/sample buffer during with protein samples are loaded into the strips. Rehydration
can be active or passive. To load larger proteins active rehydration in small voltage is applied. After the
run in IEF cell, the proteins focus as bands on the strip according to their isoelectric points. The focused
strips can be frozen for storage.
BT-3515 (Genomics And Proteomics) 138
Sample Preparation :-
The goal of sample preparation is to solubilize maximum number of proteins and maintain their
solubility throughout the process. The materials for sample should be carefully collected, snap frozen
and ground under liquid nitrogen in the presence of protease inhibitors. After extracting proteins from
source material they are then solubilized and denatured by means of chaotropes, detergents, and
reducing agents. Hydrogen-bonds in the sample proteins are disrupted by chaotropes urea and thiourea.
Uncharged detergents are used to disrupt hydrophobic interactions. Detergents such as CHAPS, Triton X-
100, sulfobetaine SB3-10, and amidosulfobetaine are IEF-compatible additives. Disulfide bonds are
reduced to sulfhydryls by reducing agents dithiothrietol (DTT), dithioerythritol, and tributyl phosphine
(TBP).
Sequential extraction is done to categorize proteins based on their solubility. This is an example of pre-
fractionation to enrich low abundance proteins. Proteins are sequentially extracted into
chaotrope/detergent solutions of increasing solubilization power. First, proteins are treated with an
aqueous buffer, the insoluble proteins remaining from this extraction are treated with urea/CHAPS/TBP,
and the insoluble proteins remaining from this step are then treated with urea/thiourea/CHAPS/SB 3-
10/TBP. 2-D PAGE is then done to separate the proteins in each of the supernatants. The remaining
insoluble material from the final extraction can be taken up in SDS-PAGE sample solution and run in a
one-dimensional gel.
Fig 2. Schematic representation of 2D PAGE
BT-3515 (Genomics And Proteomics) 139
Principle of 2D PAGE :-
The principle applied was very simple : proteins were resolved on a gel using isoelectric focusing (IEF),
which separates proteins in the first dimension according to their isoelectric point, followed by
electrophoresis in a second dimension in the presence of sodium dodecyl sulfate (SDS), which separates
proteins according to their molecular mass.
Visualization :-
After electrophoresis the gel is stained to visualize the separated proteins. Commonly used stains are
Coomassie Brilliant Blue or SYPRO Ruby or silver stain. Different proteins will appear as distinct spot
within the gel. Coomassie Brilliant Blue or SYPRO Ruby are compatible with Mass Spectrometry.
Coomassie Brilliant Blue has detection limit about 10ng of proteins per spot and the gel images of spots
can be captured by scanning densitometer which operate in visible light. SYPRO Ruby can detect 1ng of
proteins per spot and since it is fluorescent , the sopts are visualized by a fluorescent imager. Silver stain
can detect spots containing proteins less than 1 ng and is the most sensitive non – radioactive protein
visualization method. Laser devices for image capturing are useful for fluorescently stained gels.
Analysis :-
The images can be further analyzed using image –analysis softwares. These softwares quantify proteins
spots, match images and compare corresponding spots intensites of related gels, prepare gel data
reports, remove background patterns, and integrate image information to databases. Alternately the
proteins separated can be obtained from the gel can be analyzed by MS for protein identification. 2D-
PAGE can be used to study differential protein expression by comparing images from 2D –PAGE gels
samples labeled with stable isotopes or fluorescent dyes.
Advantages of 2D PAGE :-
• Using 2D-PAGE,100 –1000’s of polypeptides can be analyzed in a single run.
• The proteins can be separated in pure form from the spots.
• The spots can be quantified and also analyzed by MS.
• In this method polypeptides can be probed with antibodies and also they can be tested for post-
translational modifications.
Disadvantages of 2D PAGE :-
• Large amount of sample handling.
• Less reproducibility.
• Difficulty to separate low abundance proteins.
• Acidic and basic proteins.
• Very large and very small proteins and hydrophobic proteins.
• Not automated for high throughput analysis.
• 2D-PAGE has limited dynamic range.
BT-3515 (Genomics And Proteomics) 140
3.4 PROTEIN STAINING TECHNIQUES
Protein separation and identification is critical for proteome analysis and requires high resolution and
powerful protein characterization after gel electrophoresis. To identify the proteins in the gel,
colorimetric and fluorescence staining techniques are used. The stains include anionic dyes (Coomassie
brilliant blue), metal cations (imidazole-zinc), silver stain, fluorescent dyes, and radioactive probes.
Specific staining methods could also be used to detect post-translational modifications such as
glycosylation or phosphorylation. The protein stain should exhibit high sensitivity (i.e., low detection
limit), quantitative accuracy, reproducibility, ease of use, and compatibility with downstream protein
analysis techniques like mass spectrometry (MS).
A) Post-electrophoretic protein stains :-
1) Coomassie brilliant blue stain (CBB) :-
Coomassie brilliant blue stain is a disulfonated triphenylmethane dye that is used to stain protein bands
bright blue. The stain binds with the protonated basic amino acids (lysine, arginine, and histidine) by
electrostatic interactions and with the aromatic residues by hydrophobic interactions. The CBB has a low
affinity for the polyacrylamide, but it does penetrate the gel matrix, necessitating a destaining step.
Coomassie stains are noncovalent, reversible, and do not interfere with downstream analysis techniques
such as mass spectrometry of excised protein bands.
Staining method :
1) Dissolve 0.025–0.10% (w/v dye) dye in an acidic alcohol formulation [30– 50% (v/v) methanol
(or, less frequently, ethanol)) with 7–10% (v/v) acetic acid solution.
2) Filer the solution with Whatman #1 paper for a stable formulation.
3) Perform the fixation and staining in 10 gel volumes of solution.
4) Destain with 7% acetic acid.
Strengths and limitations :
Coomassie brilliant blue stain is easily available and is the most commonly used protein stain. However,
staining and destaining require more time and reagents.
2) Silver staining :-
The silver stain is the alternative colorimetric stain for increased detection sensitivity as compared to
the Coomassie staining. It is generally considered as the standard for other “ultrasensitive” staining
methods. Silver staining protocols are categorized on the basis of the silvering agents and development
conditions. Silver diamine complex in an alkaline environment is used in alkaline methods; acidic silver
nitrate is used in acidic methods.
Staining method :
1) After separation, fix the gel in fixation solution (50% methanol, 12% acetic acid, and 0.05%
formalin) for 2 hours or overnight, followed by three washes with 35% ethanol for 20 minutes
each.
2) Sensitize gel for 2 minutes, followed by three washes in water for 5 minutes each.
3) Stain the gel in silver nitrate solution (0.2% silver nitrate and 0.076% formalin) for 20 minutes,
followed by two washes in water.
BT-3515 (Genomics And Proteomics) 141
4) Add the developer (6% sodium carbonate, 0.05% formalin, and 0.0004% sodium thiosulfate) and
stop staining by leaving the gel for 5 minutes in stop solution (50% methanol and 12% acetic
acid).
Strengths and limitations :
Silver staining is one of the most sensitive colorimetric methods used for the detection of total protein.
The formulations and protocols are optimized and standardized, helping to minimize the effects of
minor differences in day-to-day use.
3) Zinc stain :-
Zinc staining is a negative stain that stains the polyacrylamide gel except proteins enabling their
detection. Zinc ions coupled with imidazole precipitates in the gel matrix. Zinc ion staining utilizes the
ability of proteins and protein-SDS complexes to bind and sequester Zn2+ in the gel, causing the
precipitation with opaque background, contrasting with transparent protein-SDS-Zn2+ zones. The
technique is compatible with mass spectrometry or microsequencing methods for downstream protein
analysis and characterization.
Staining method :
1) After electrophoresis, incubate the gel in 0.2 M imidazole, 0.1% Sodium dodecyl sulfate SDS for
15 minutes.
2) Discard the imidazole solution and incubate the gel in 0.3 M zinc sulfate for 30-45 seconds.
3) Discard the developer and wash the gel for several times, with water, 1 minute per wash.
4) Store the gel in 0.5% (w/v) sodium carbonate.
Strengths and limitations :
The zinc stain is a sensitive dye for the detection of 0.25 ng of protein per band in a mini gel. The entire
method takes 15 minutes. Proteins could be easily recovered from the gel after staining.
Overdevelopment of the stain can be problematic but can be rectified by incubating in 100 mM glycine
to dissolve excess zinc imidazolate.
B) Fluorescent total protein stains :-
1) SYPRO ruby :-
SYPRO Ruby is a luminescent ruthenium metal chelate that interacts with the basic amino acids in
proteins. It provides with the sensitivity close to that of silver staining and the properties of classical
organic stains such as Coomassie blue. The staining does not lead to any irreversible modification of
amino acids, so satisfactory mass spectrometry compatibility is expected. It also allows a stable
sequence coverage with a capacity of spots identification.
Staining method :
1) Fix the gel in 50% (v/v) methanol, 10% (v/v) acetic acid, and leave it for 30 minutes or overnight.
2) Stain the gel with SYPRO ruby stain.
3) Briefly destain the gel with 10% (v/v) methanol and 7% (v/v) acetic acid.
BT-3515 (Genomics And Proteomics) 142
4) Visualize under UV light or with blue light sources; the protein bands appear orange-red to the
eye.
Strengths and limitations :
The stain could be profitably used for large scale proteomic analysis. The staining procedure is simple
and allows high-throughput and large-scale proteomic applications. However, the necessity of a
fluorescent scanner makes it expensive.
2) Nile red stain :-
Nile red is a phenoxazone dye that presents strong fluorescence upon transition from aqueous to
hydrophobic environments like SDS micelles or protein–SDS complexes. Nile red does not bind the SDS
monomers; this property makes it a rapid, non-fixative total protein staining method for SDS gels.
Staining method :
1) Dilute the Nile red from a stock solution (0.4 mg/mL in dimethyl sulfoxide DMSO) 200-fold into
the water to 2 mg/mL.
2) Add a tenfold volume excess (e.g., 50 mL staining solution for a 5 mL mini gel) to a gel and
thoroughly agitate.
3) Briefly wash the gel with water after staining.
4) The gels can be viewed under UV light or with green light sources; the protein bands appear
pale red.
Strengths and limitations :
Detection sensitivity of the stain is similar to that of Coomassie stains. Nile red-stained gels could be
subsequently electroblotted with an excellent transfer efficiency. High-fluorescent background and
photostability are the limitations of this staining technique.
3) Epicocconone Stain :-
The epicocconone stain is a fluorescent stain which was isolated from a fungus Epicoccum nigrum. It is
an azaphilone that reacts with primary amines and NH3 to produce red fluorescent compounds which
detect proteins in the gels.
Staining method :
1) Fix the SDS–PAGE in 7.5% (v/v) acetic acid for 60 minutes.
2) Wash the gel with water (2×30 min).
3) Stain the gel for 1 hour in an aqueous solution into which an aliquot of the fluorophore stock
solution is diluted.
4) Incubate the gel in 0.05% (v/v) ammonia (3×10 min).
5) Acquire the images with UV or blue or green visible light.
Strengths and limitations :
The stain is a rapid and sensitive fluorescent total protein stain suitable for both protein gels and blots.
It is highly compatible with mass spectrometry, Edman sequencing, and Ettan DIGE system. It is also
environmentally friendly and does not contain any heavy metals, allowing safe disposal after use.
BT-3515 (Genomics And Proteomics) 143
The staining method for protein identification in gel should be chosen based on its sensitivity and
downstream analysis techniques. While selecting the stain, the composition of the proteins of interest
(presence/ absence of single amino acids), sample availability, and post-staining application of the
gel/protein should be considered.
3.5 TECHNIQUES OF PROTEIN DIGESTION
Proteins found in nature vary in size from 5 kDa to greater than 400 kDa. Protein digestion presents the
process to cut proteins into shorter fragments, known as peptides. It allows for the identification and
characterization of proteins according to their properties. Protein digestion is a crucial step prior to mass
spectrometry (MS) analysis of peptides for successful protein identification and characterization,
biomarker discovery, and systems biology. Although it is possible for MS to study intact proteins, the
smaller peptides facilitate protein identification and improve the coverage of proteins that might be
reduced due to solubility and heterogeneity. Therefore, the most common proteomic approaches utilize
site-specific digestion to generate smaller fragments. Peptides are easier for separation and
characterization by high performance liquid chromatography (HPLC) and HLPC-coupled MS.
1) Edman Degradation :-
Edman degradation, developed by Pehr Edman, is a method of sequencing amino acids in a peptide. In
this method, the amino-terminal residue is labeled and cleaved from the peptide without disrupting the
peptide bonds between other amino acid residues.
Mechanism :
Phenyl isothiocyanate is reacted with an uncharged N-terminal amino group, under mildly alkaline
conditions, to form a cyclical phenylthiocarbamoyl derivative. Then, under acidic conditions, this
derivative of the terminal amino acid is cleaved as a thiazolinone derivative. The thiazolinone amino acid
is then selectively extracted into an organic solvent and treated with acid to form the more stable
phenylthiohydantoin (PTH)- amino acid derivative that can be identified by using chromatography or
electrophoresis. This procedure can then be repeated again to identify the next amino acid. A major
BT-3515 (Genomics And Proteomics) 144
drawback to this technique is that the peptides being sequenced in this manner cannot have more than
50 to 60 residues (and in practice, under 30). The peptide length is limited due to the cyclical
derivatization not always going to completion. The derivatization problem can be resolved by cleaving
large peptides into smaller peptides before proceeding with the reaction. It is able to accurately
sequence up to 30 amino acids with modern machines capable of over 99% efficiency per amino acid. An
advantage of the Edman degradation is that it only uses 10 – 100 pico-moles of peptide for the
sequencing process. The Edman degradation reaction was automated in 1967 by Edman and Beggs to
speed up the process and 100 automated devices were in use worldwide by 1973.
Coupled Analysis :
Following 2D SDS PAGE the proteins can be transferred to a polyvinylidene difluoride (PVDF) blotting
membrane for further analysis. Edman degradations can be performed directly from a PVDF membrane.
N-terminal residue sequencing resulting in five to ten amino acid may be sufficient to identify a Protein
of Interest (POI).
Advantages :
Edman degradation is very useful because it does not damage the protein. This allows sequencing of the
protein to be done in less time. Edman sequencing is done best if the composition of the amino acid is
known. To determine the composition of the amino acid, the peptide must be hydrolyzed. This can be
done by denaturing the protein and heating it and adding HCl for a long time. This causes the individual
amino acids to be separated, and they can be separated by ion exchange chromatography. They are
then dyed with ninhydrin and the amount of amino acid can be determined by the amount of optical
absorbance. This way, the composition but not the sequence can be determined
Limitations :
Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-
terminus has been chemically modified (e.g. by acetylation or formation of pyroglutamic acid).
Sequencing will stop if a non-α-amino acid is encountered (e.g. isoaspartic acid), since the favored five-
membered ring intermediate is unable to be formed. Edman degradation is generally not useful to
determine the positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above for
discernible results.
Sequencing Larger Proteins :
Larger proteins cannot be sequenced by the Edman sequencing because of the less than perfect
efficiency of the method. A strategy called divide and conquer successfully cleaves the larger protein
into smaller, practical amino acids. This is done by using a certain chemical or enzyme which can cleave
the protein at specific amino acid residues. The separated peptides can be isolated by chromatography.
Then they can be sequenced using the Edman method, because of their smaller size.
In order to put together all the sequences of the different peptides, a method of overlapping peptides is
used. The strategy of divide and conquer followed by Edman sequencing is used again a second time,
but using a different enzyme or chemical to cleave it into different residues. This allows two different
sets of amino acid sequences of the same protein, but at different points. By comparing these two
sequences and examining for any overlap between the two, the sequence can be known for the original
protein.
BT-3515 (Genomics And Proteomics) 145
For example, trypsin can be used on the initial peptide to cleave it at the carboxyl side of arginine and
lysine residues. Using trypsin to cleave the protein and sequencing them individually with Edman
degradation will yield many different individual results. Although the sequence of each individual
cleaved amino acid segment is known, the order is scrambled. Chymotrypsin, which cleaves on the
carboxyl side of aromatic and other bulky nonpolar residues, can be used. The sequence of these
segments overlap with those of the trypsin. They can be overlapped to find the original sequence of the
initial protein. However, this method is limited in analyzing larger sized proteins (more than 100 amino
acids) because of secondary hydrogen bond interference. Other weak intermolecular bonding such as
hydrophobic interactions cannot be properly predicted. Only the linear sequence of a protein can be
properly predicted assuming the sequence is small enough.
2) Peptide Purification :-
For purification of peptides, it is often difficult to use methods similar to those applied in the purification
of other organic compounds, mainly due to their complexity. Purification of organic molecules often
uses methods based on crystallization to isolate the desired molecule. As efficiency and high yields are
of vital importance for optimal economy of any industrial manufacturing process, methods other than
those based on crystallization have been explored for purification of peptides and peptide like
molecules. These methods usually utilize various principles of chromatography such as ion exchange
chromatography, gel permeation chromatography and medium- or high-pressure reversed phase
chromatography. The examples mentioned are the most commonly used in peptide purification today
but other methods have been used in the past, most noteworthy being counter current distribution and
partition chromatography.
Synthesis-related impurities :
The ultimate goal for any purification process is to obtain a preparation that meets the quality
requirements set for the compound to be purified. In the manufacture of active pharmaceutical
ingredients, it is recommended that no single unknown impurity is more than 0.1% (e.g. as determined
by high performance liquid chromatography, HPLC, and given as relative area percent) in the final
substance. To this end it is valuable if the nature of potential and actual impurities is known prior to the
design and development of the purification procedure.
In peptide synthesis, the chemistry is well known and many different side reactions have been reported
and described in the literature. Examples of impurities that may be generated are diastereomers,
hydrolysis products of labile amide bonds, deletion sequences formed predominantly in solid-phase
peptide synthesis and insertion peptides and by-products formed during removal of protection groups in
the final step of the synthesis. Polymeric forms of the desired peptide are also known. These are often
by-products associated with formation of cyclic peptides containing disulphide bonds.
Methods of Peptide Purification :
1) Reversed phase chromatography (RPC) : The most powerful method for peptide purification is
without doubt reversed phase chromatography utilizing hydrophobic interactions as the main
separation principle. It is characterized by the use of a stationary phase and an aqueous mobile phase
containing an organic solvent such as acetonitrile or an alcohol. Various chromatography media have
been used for large-scale purifications of peptides on reversed phase resins. Among the most popular
are those based on C-4, C-8 and C-18 alkyl chains attached to a silica surface. Phases based on synthetic
BT-3515 (Genomics And Proteomics) 146
polymers are also used and have recently received increased attention due to the chemical stability of
these materials.
For industrial scale purifications on reversed phase resin columns, important considerations are shape
and size of the particles of the stationary phase. Columns packed with spherical particles are preferred
to those packed with irregularly shaped particles; the latter is highly likely to result in clogging of frits.
Such a risk is imminent on extensive use of columns and when the column is operated under dynamic
axial compression. Particle size is another important characteristic of bonded silica phases that strongly
influences column efficiency.
For large-scale applications, a particle size of 10-16 microns normally yields satisfactory separations on
reversed phase columns.
2) Ion exchange chromatography (IEC) : In this technique separation is dependent on the ionic
interaction between the support surface and charged groups of the peptide. Both cation and anion
exchangers have been used with success for peptide purifications. For large-scale purifications, high
flow-rates and efficiency are desirable characteristics. Such high performance ion exchangers of
different chemical compositions are now commercially available. Examples of such materials are
primarily based on agarose. Synthetic polymers withstanding high concentrations of acid and base have
also been reported.
High mechanical strength is a desirable property of the ion exchange material as it allows large-scale
purifications in columns under dynamic axial compression. Mechanical strength is often associated with
a hydrophobic character of the stationary phase. This may lead to poor recoveries and may also have a
negative impact on the separation efficiency. However, a possibility to circumvent this problem is to use
an organic modifier in the mobile phase.
3) Gel permeation chromatography (GPC) : This method separates molecules primarily on the basis of
size exclusion. The technique is highly efficient for separation of polymeric forms of peptides and for
desalting of peptide solutions. Sephadex is a well-known example of a commercially available gel
permeation material and has been used successfully in purifications of various molecules. Disadvantages
with gel permeation chromatography are the low capacity and the relatively low flow-rates that can be
applied for optimal separation on such columns.
3.6 PROTEIN ANALYSIS BY MASS SPECTROMETRY
Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass
spectrometry is an important method for the accurate mass determination and characterization of
proteins, and a variety of methods and instrumentations have been developed for its many uses. Its
applications include the identification of proteins and their post-translational modifications, the
elucidation of protein complexes, their subunits and functional interactions, as well as the global
measurement of proteins in proteomics. It can also be used to localize proteins to the various
organelles, and determine the interactions between different proteins as well as with membrane lipids.
The two primary methods used for the ionization of protein in mass spectrometry are electrospray
ionization (ESI) and matrix-assisted laser desorption/ionization (MALDI). These ionization techniques
are used in conjunction with mass analyzers such as tandem mass spectrometry. In general, the protein
are analyzed either in a “top-down” approach in which proteins are analyzed intact, or a “bottom-up”
BT-3515 (Genomics And Proteomics) 147
approach in which protein are first digested into fragments. An intermediate “middle-down” approach
in which larger peptide fragments are analyzed may also sometimes be used.
History :-
The application of mass spectrometry to study proteins became popularized in the 1980s after the
development of MALDI and ESI. These ionization techniques have played a significant role in the
characterization of proteins. (MALDI) Matrix-assisted laser desorption ionization was coined in the late
80’s by Franz Hillenkamp and Michael Karas. Hillenkamp, Karas and their fellow researchers were able
to ionize the amino acid alanine by mixing it with the amino acid tryptophan and irradiated with a pulse
266 nm laser. Though important, the breakthrough did not come until 1987. In 1987, Koichi Tanaka used
the “ultra fine metal plus liquid matrix method” and ionized biomolecules the size of 34,472 Da protein
carboxypeptidase-A.
In 1968, Malcolm Dole reported the first use of electrospray ionization with mass spectrometry. Around
the same time MALDI became popularized, John Bennett Fenn was cited for the development of
electrospray ionization. Koichi Tanaka received the 2002 Nobel Prize in Chemistry alongside John Fenn,
and Kurt Wüthrich “for the development of methods for identification and structure analyses of
biological macromolecules.” These ionization methods have greatly facilitated the study of proteins by
mass spectrometry. Consequently, protein mass spectrometry now plays a leading role in protein
characterization.
3.6.1 MALDI-TOF MASS SPECTROMETRY :-
MALDI-TOF mass spectrometry is a versatile analytical technique to detect and characterize mixtures of
organic molecules. In Microbiology, it is being used as a rapid, accurate and cost-effective method for
the identification of microorganism (bacteria, fungi and viruses). A typical experiment consists of growth
of the organism (e.g. bacteria), colony selection and placement on a target, addition of matrix, and
analysis with MALDI-TOF MS.
MALDI stands for Matrix-Assisted Laser Desorption Ionization. In this ionization method samples are
fixed in a crystalline matrix and are bombarded by a laser. The sample molecules vaporize into the
vacuum while being ionized at the same time without fragmenting or decomposing.
TOF stands for Time of Flight, a mass spectrometry method that separates ions by their mass to charge
ratio and determines that mass to charge ratio by the time it takes for the ions to reach a detector.
This technology generates characteristic mass spectral fingerprints which is compared with large library
of mass spectra. As the spectral fingerprints are unique signatures for each microorganism accurate
microbial identification at the genus and species levels is done using bioinformatics pattern profiling.
MALDI : Principle and Methodology :-
1) The sample to be analyzed is mixed or coated with an organic compound that contains an absorbent
solution called matrix. MALDI matrix functions as an energy absorbent upon laser irradiation. Commonly
used matrixes are α-cyano-4-hydroxycinnamic acid, 2, 5-dihydroxybenzoic acid, 3, 5-dimethoxy-4-
hydroxycinnamic acid, and 2, 6-dihydroxyacetophenone.
2) The matrix compounds with the analyte are deposited on a target plate which is made up of a
conducting metal. Then with a laser beam such as ultraviolet (UV) and infrared (IR), the sample is
ionized on the spot where it becomes vibrationally excited.
BT-3515 (Genomics And Proteomics) 148
3) The matrix molecules energetically desorb and ionize in the presence of a laser beam and generate
the analyte molecules into the gas phase.
4) The protonated ion accelerates and gets separated on the basis of their mass and charge ratio. Small-
sized ion molecules accelerate faster than large-sized ions.
5) Then these protonated ions are detected and measured by time of flight (TOF) analyzer.
Fig 1. Principle of MALDI
Time of flight (TOF) analyzer : Principle and methodology :-
Fig 2. Principle and Working of TOF Analyzer
BT-3515 (Genomics And Proteomics) 149
1) The basic principle of the TOF analyzer is that it determines the time required by the ion of
different size and charge ratio that takes to travel the length of the flight tube.
2) The ion starts traveling at the same time where the small and lighter sized ions travel faster to
the detector than the larger and heavier one.
3) With the only use of liner TOF analyzer, all ions do not feel the pulse with equal intensity
therefore, the reflector is applied. The reflector contains a high voltage series of ring electrodes
and consists of a slightly displaced angle that helps the ion to pulse back in the flight tube.
4) The ions that travel faster will spend more time within the reflector where the detector receives
similar mass ions at the same time.
5) TOF analyzer is high in resolution than other analyzers and is applicable only for microbiological
purposes.
Working Principle of MALDI-TOF Mass Spectrometry :-
The MALDI TOF process is a two-phase procedure;
• Ionization Phase
• Time of Flight Phase
1) Ionization Phase :
Initially, the samples are fixed in a crystalline matrix in a target plate and are bombarded by a laser. The
sample molecules vaporize into the vacuum while being ionized at the same time. High voltage is then
applied to accelerate the charged particles.
2) The second step is the time-of-flight mass spectrometry phase.
a) In the linear mode, particles will impinge upon the linear detector within a few nanoseconds
after ionization. Higher mass molecules will arrive later than lighter ones. Flight time
measurement make it possible to determine molecule masses directly. Each peak in the
spectrum corresponds to the specific mass of the particle along the time axis, starting with the
ionization moment.
b) In the reflector mode, the particles are diverted so that they fly towards a second detector. In
addition to extending the flight distance, the reflector also focuses the masses. The combination
of these two effects makes for higher resolution than in the linear mode.
The net result is a generation of a mass spectrum which is compared with those of well-characterized
organisms available in the reference library database to identify the isolate.
Procedure of MALDI-TOF MS :-
1) Pick a bacterial colony and smear it onto a target plate.
2) Add 1-2 µl of a matrix consisting α-Cyano-4-hydroxycinnamic acid (CHCA) dissolved in
acetonitrile (50%) and 2.5% trifluoroacetic acid on to it and dry it on the target plate (at room
air).
3) Place the target plate into the plating chamber of the mass spectrometer, close it and perform
the analysis.
BT-3515 (Genomics And Proteomics) 150
Fig 3. Proteomic Fingerprints of Microorganisms
Advantages of MALDI-TOF Mass Spectrometry over Conventional Technology :-
• Significantly decreases the turnaround time. Processing time is similar to rapid biochemicals.
• The sample preparation is simple and the sample requirement is minimal. A single colony is
sufficient in order to generate spectra of sufficient quality
• Cost effective-low consumable costs
• Automated, robust, interlaboratory reproducibility
• Broad applicability (all types of bacteria including anaerobes, fungi)
• Adaptable-open system, expandable by user
Limitations :-
• Identification of new isolates is possible only if the spectral database contains peptide mass
fingerprints of the type strains of specific genera/species/subspecies/strains
• No susceptibility information is provided
• Not useful for direct testing of clinical specimens (except urine)
• Some organisms require repeat analysis and additional processing (extraction)
• The acceptable score cutoffs vary between studies and some closely related organisms are not
differentiated.
• Some organisms currently cannot be reliably identified by this method, such as Shigella spp and
Streptococcus pneumoniae.
BT-3515 (Genomics And Proteomics) 151
Applications of MALDI-TOF MS :-
1) Biochemistry :
MALDI – TOF MS is used in the field of biochemistry to identify the proteins and characterize it because
protein contains a molecular weight that has an intact structure that becomes fragmented when
ionized.
2) Peptide mass fingerprinting (PMF) :
Peptide mass fingerprinting is used in the identification of proteins from simple mixture where MALDI –
TOF plays an important role by its simple operation with high resolution and good mass accuracy.
Peptides are generated by enzyme-like trypsin and are analyzed by MALDI – TOF MS and the obtained
peptide masses are compared with the database.
3) Clinical and Environmental Bacteriology :
For a rapid, reliable, and cost-effective diagnosis of clinical and environmental samples, MALDI – TOF MS
is appropriate for the early identification of bacterial culture. MALDI – TOF can detect blood culture,
stool and urine samples, cerebrospinal fluids, and respiratory tract infections.
It can also identify food and waterborne pathogens and environmental samples.
4) Detection of viruses :
Besides immunological methods, researchers have proved using MALDI – TOF MS as more advanced and
effective in use for diagnosing infectious viruses like influenza viruses, herpes viruses, and hepatitis
viruses.
First, the viral genetic materials are amplified by PCR and later identified by MALDI – TOF MS which
gives rapid and reliable results in a short period.
5) Organic chemistry :
Some synthetic macromolecules with a high molecular weight such as catenanes, rotaxanes, and
hyperbranched polymers can be analyzed rapidly with effective results.
6) Medicine :
MALDI – TOF MS may even serve as an early detection technique for various types of cancer like it has
been used in identifying membrane protein associated with pancreatic cancer.
It can also be used in determining drug resistance bacteria which could help physicians to decide before
prescribing the medicine.
3.6.2 ELECTROSPRAY IONIZATION (ESI) :-
Electrospray ionization is a soft ionization technique that is typically used to determine the molecular
weights of proteins, peptides, and other biological macromolecules. Soft ionization is a useful technique
when considering biological molecules of large molecular mass, such as the aformetioned, because this
process does not fragment the macromolecules into smaller charged particles, rather it turns the
macromolecule being ionized into small droplets. These droplets will then be further desolvated into
even smaller droplets, which creates molecules with attached protons. These protonated and
BT-3515 (Genomics And Proteomics) 152
desolvated molecular ions will then be passed through the mass analyzer to the detector, and the mass
of the sample can be determined.
The electrospray ionization technique was first reported by Masamichi Yamashita and John Fenn in
1984. The development of electrospray ionization for the analysis of biological macromolecules was
rewarded with the attribution of the Nobel Prize in Chemistry to John Bennett Fenn in 2002. One of the
original instruments used by Dr. Fenn is on display at the Science History Institute in Philadelphia,
Pennsylvania.
ESI-MS : Introduction :-
Electrospray ionization mass spectrometry is a desorption ionization method. Desorption ionization
methods can be performed on solid or liquid samples, and allows for the sample to be nonvolatile or
thermally unstable. This means that ionization of samples such as proteins, peptides, olgiopeptides, and
some inorganic molecules can be performed. Electrospray ionization mass spectrometry requires that a
molecule be of a fairly large mass. The instrument has a small mass range that it is able to detect, so
therefore the mass of the unknown injected sample can easily be determined; as it must be in the range
of the instrument. This quantitative analysis is done by considering the mass to charge ratios of the
various peaks in the spectrum (Figure 1). The spectrum is shown with the mass-to-charge (m/z) ratio on
the x-axis, and the relative intensity (%) of each peak shown on the y-axis. Calculations to determine the
unknown mass, Mr, from the spectral data can then be performed using
Where p1 and p2 are adjacent peaks. Peak p1 comes before peak p2 in the spectrum, and has a lower m/z
value. The z1 value represents the charge of peak one. It should be noted that as the m/z value
increases, the number of protons attached to the molecular ion decreases.
BT-3515 (Genomics And Proteomics) 153
Fig 1. Mass Spectrum Example
Figure 1 below illustrates these concepts. Electrospray ionization mass spectrometry research was
pioneered by the analytical chemistry professor John Bennet Fenn, who shared the Nobel Prize in
Chemistry with Koichi Tanaka in 2002 for his work on the subject.
Sample Preparation :-
Samples for injection into the electrospray ionization mass spectrometer work the best if they are first
purified. The reason purity in a sample is important is because this technique does not work well when
mixtures are used as the analyte. For this reason a means of purification is often employed to inject a
homogeneous sample into the capillary needle. High performance liquid chromatography, Capillary
Electrophoresis, and Liquid-Solid Column Chromatography are methods of choice for this purpose. The
chosen purification method is then attached to the capillary needle, and the sample can be introduced
directly.
Apparatus :-
Fig 2. Apparatus of Electrospray Ionization Mass Spectrometry
1) Capillary Needle :
The capillary needle is the inlet into the apparatus for the liquid sample. Once in the capillary needle,
the liquid sample is nebulized and charged. There is a large amount of pressure being applied to the
capillary needle, which in effect nebulizes the liquid sample forming a mist. The stainless steel capillary
needle is also surrounded by an electrode that retains a steady voltage of around 4000 volts. This
BT-3515 (Genomics And Proteomics) 154
applied voltage will place a charge on the droplets. Therefore, the mist that is ejected from the needle
will be comprised of charged molecular ions.
2) Desolvating Capillary :
The molecular ions are oxidized upon entering the desolvating capillary, and a continual voltage is
applied to the gas chamber in which this capillary is located. Here the desolvation process begins,
through the use of a dry gas or heat, and the desolvation process continues through various pumping
stages as the molecular ion travels towards the mass analyzer. An example of a dry gas would be an N2
gas that has been dehydrated. The gas or heat then provides means of evaporation, or desolvation, for
the ionized droplets. As the droplets become smaller in size, their electric field densities become more
concentrated. The increase in electric field density causes the like charges to repel one another, which
induces an increase in surface tension. The point where the droplet can no longer support this increase
in surface tension is known as the Rayleigh limit. At this point, the droplet divides into smaller droplets
of either positive or negative charge. This process is referred to as either a coulombic explosion or the
ions are described as exiting the droplet through the “Taylor cone”. Once the molecular ions have
reached the entrance to the mass analyzer, they have been effectively reduced through protonation.
3) Mass Analyzer :
Mass Analyzers (Mass Spectrometry) are used to determine the mass-to-charge ratio (m/z), this ratio is
used to differentiate between molecular ions that were formed in the desolvating capillary. In order for
a mass-to-charge ratio to be determined, the mass analyzer must be able to separate even the smallest
masses. The ability of the analyzer to resolve the mass peaks can be defined with the following
equation;
This equation represents the mass of the first peak (m), divided by the difference between the
neighboring peaks Δm . The better the resolution, the more useful the data. The mass analyzer must
also be able to measure the ion currents produced by the multiply charged particles that are created in
this process.
Mass analyzers use electrostatic lenses to direct the beam of molecular ions to the analyzer. A vacuum
system is used to maintain a low pressure environment in order to prevent unwanted interactions
between the molecular ions and any components that may be present in the atmosphere. These
atmospheric components can effect the determined mass-to-charge ratio, so it is best to keep them to a
minimum. The mass-to-charge ratio is then used to determine quantitative and qualitative properties of
the liquid sample.
The mass analyzer used for electrospray ionization is a quadrupole mass spectrometer. A quadrupole
mass spectrometer uses four charged rods, two negatively charged and two positively charged, that
have alternating AC and DC currents. The rods are connected to both the positive terminal of the DC
voltage and the negative terminal. Each pair of rods contains a negatively charged rod and a positively
charged rod. The molecular ions are then sped through the chamber between these pairs of oppositely
charged rods making use of a potential difference to do so. To maintain charge, and ultimately be
readable by the detector, the molecular ions must travel through the quadrupole chamber without
BT-3515 (Genomics And Proteomics) 155
touching any of the four charged rods. If a molecular ion does run into one of the rods it will deem it
neutral and undetectable.
4) Detector :
The molecular ions pass through the mass analyzer to the detector. The detector most commonly used
in conjunction with the quadrupole mass analyzer is a high energy dynode (HED), which is a electron
multiplier with some slight variations. In an HED detector, the electrons are passed through the system
at a high voltage and the electrons are measured at the end of the funnel shaped apparatus; otherwise
known as the anode. A HED detector differs from the electron multiplier in that it operates at a much
higher sensitivity for samples with a large mass than does the electron multiplier detector. Once the
analog signal of the mass-to-charge ratio is recorded, it is then converted to a digital signal and a
spectrum representing the data run can be analyzed.
Advantages and Disadvantages :-
There are some clear advantages to using electrospray ionization mass spectrometry as an analytical
method. One advantage is its ability to handle samples that have large masses. Another advantage is
that this ionization method is one of the softest ionization methods available, therefore it has the ability
to analyze biological samples that are defined by non-covalent interactions. A quadrupole mass analyzer
can also be used for this method, which means that a sample’s structure can be determined fairly easily.
The m/z ratio range of the quadrupole instrument is fairly small, which means that the mass of the
sample can be determined to with a high amount of accuracy. Finally, the sensitivity for this instrument
is impressive and therefore can be useful in accurate quantitative and qualitative measurements.
Some disadvantages to electrospray ionization mass spectrometry are present as well. A major
disadvantage is that this technique cannot analyze mixtures very well, and when forced to do so, the
results are unreliable. The apparatus is also very difficult to clean and has a tendency to become overly
contaminated with residues from previous experiments. Finally, the multiple charges that are attached
to the molecular ions can make for confusing spectral data. This confusion is further fueled by use of a
mixed sample, which is yet another reason why mixtures should be avoided when using an electrospray
ionization mass spectrometer.
Applications of Electrospray Ionization Mass Spectrometry :-
1) Electrospray is used to study protein folding.
2) Liquid chromatography–mass spectrometry :
Electrospray ionization is the ion source of choice to couple liquid chromatography with mass
spectrometry (LC-MS). The analysis can be performed online, by feeding the liquid eluting from the LC
column directly to an electrospray, or offline, by collecting fractions to be later analyzed in a classical
nanoelectrospray-mass spectrometry setup. Among the numerous operating parameters in ESI-MS, for
proteins, the electrospray voltage has been identified as an important parameter to consider in ESI
LC/MS gradient elution. The effect of various solvent compositions (such as TFA or ammonium acetate,
or supercharging reagents, or derivitizing groups) or spraying conditions on electrospray-LCMS spectra
and/or nanoESI-MS spectra. have been studied.
3) Capillary electrophoresis-mass spectrometry (CE-MS) :
Capillary electrophoresis-mass spectrometry was enabled by an ESI interface that was developed and
patented by Richard D. Smith and coworkers at Pacific Northwest National Laboratory, and shown to
BT-3515 (Genomics And Proteomics) 156
have broad utility for the analysis of very small biological and chemical compound mixtures, and even
extending to a single biological cell.
4) Noncovalent gas phase interactions :
Electrospray ionization is also utilized in studying noncovalent gas phase interactions. The electrospray
process is thought to be capable of transferring liquid-phase noncovalent complexes into the gas phase
without disrupting the noncovalent interaction. Problems such as non specific interactions have been
identified when studying ligand substrate complexes by ESI-MS or nanoESI-MS. An interesting example
of this is studying the interactions between enzymes and drugs which are inhibitors of the enzyme.
Competition studies between STAT6 and inhibitors have used ESI as a way to screen for potential new
drug candidates.
3.6.3 LIQUID CHROMATOGRAPHY MASS SPECTROMETRY (LC-MS) :-
Liquid chromatography–mass spectrometry (LC–MS) is an analytical chemistry technique that combines
the physical separation capabilities of liquid chromatography (or HPLC) with the mass analysis
capabilities of mass spectrometry (MS). Coupled chromatography – MS systems are popular in chemical
analysis because the individual capabilities of each technique are enhanced synergistically. While liquid
chromatography separates mixtures with multiple components, mass spectrometry provides structural
identity of the individual components with high molecular specificity and detection sensitivity. This
tandem technique can be used to analyze biochemical, organic, and inorganic compounds commonly
found in complex samples of environmental and biological origin. Therefore, LC-MS may be applied in a
wide range of sectors including biotechnology, environment monitoring, food processing, and
pharmaceutical, agrochemical, and cosmetic industries.
Principle :-
• LC/MS is a technique that combines physical separation capabilities of liquid chromatography
with mass analysis capability of Mass spectrometry.
• It is a method that combines separation power of HPLC with detection power of Mass
spectrometry.
• In LC-MS we remove the detector from the column of LC and fit the column to interface of MS.
• In the most of the cases the interface used in LC-MS are ionization source.
Theory of LC/MS :-
• HPLC is a method for separating a complex mixture in to its components.
• High sensitivity of mass spectroscopy provides the information for identification of compounds
or structural elucidation of compounds.
• Combination of these two techniques is LC-MS.
• As the metabolites appear from the end of the column they enter the mass detector, where the
solvent is removed and the metabolites are ionized.
LC-MS System Components :-
Mass spectrometers work by ionizing molecules and then sorting and identifying the ions according to
their mass-to-charge (m/z) ratios.
BT-3515 (Genomics And Proteomics) 157
Fig 1. Components of LC-MS Technique
Problems in combining HPLC and MS :-
HPLC MS
• Liquid phase operation • Vacuum operation
• 25 – 500 C • 200 – 3000 C
• No mass range limitations • Up to 4000 Da for quadrupole MS
• Inorganic buffers • Requires volatile buffers
• 1 ml/min eluent flow is equivalent to 500 • Accepts 10 ml/min gas flow
ml/min of gas
Mobile Phase :-
The mobile phase is the solvent that moves the solute through out column.
General requirements :
1) Low cost, UV transparency, high purity.
2) Low viscosity, low toxicity, non flammability.
3) Non corrosive to LC system component.
Column :-
• The use of di-functional or tri-functional silanes to create bonded groups with two or three
attachment points leading to phases with higher stability in low or higher pH and lower bleed
for LCMS.
• Most widely used columns for LC-MS are :
a) Fast LC column. The use of short column. (15-50mm)
b) Micro LC column. The use of large column. (20-150mm)
Sample preparation :-
• Sample preparation generally consists of concentrating the analyte and removing compounds
that can cause background ion or suppress ionization.
BT-3515 (Genomics And Proteomics) 158
• Example of sample preparation include :
a) On Column concentration -to increase analyte concentration.
b) Desalting – to reduce the sodium and potassium adduct formation that commonly
occurs in electro spray.
c) Filtration- to separate a low molecular-weight drug from proteins in plasma, milk, or
tissue.
Interfaces :-
• LC-MS systems include a device for introducing samples (such as an HPLC )an interface for
connecting such device, an ion source that ionizes samples, an electrostatic lens that efficiently
introduces the generated ions, a mass analyzer unit that separates ions based on their mass-to-
charge (m/z) ratio, and a detector unit that detects the separated ions.
• In an LC-MS system, however, if the LC unit is simply connected directly to the MS unit, the
liquid mobile phase would vaporize, resulting in large amounts of gas being introduced into the
MS unit.
• This would decrease the vacuum level and prevent the target ions from reaching the detector.
So interfaces are to be used.
• It is difficult to interface a liquid chromatography to a mass spectrometer cause of the necessity
to remove the solvent.
• The commonly used interfaces are :
1) Electrospray ionization (ESI)
2) Thermospray ionization (TSI)
3) Atmospheric pressure chemical ionization (APCI)
4) Atmospheric pressure photoionization (APPI)
1) Electro Spray Ionization (ESI) :
• ESI draws sample solutions to the tip of a capillary tube, where it applies a high voltage of about
3 to 5 kV.
• A nebulizer gas flows from outside the capillary to spray the sample. This creates a fine mist of
charged droplets with the same polarity as the applied voltage.
• As these charged particles move, the solvent continues to evaporate, thereby increasing the
electric field on the droplet surface. When the mutual repulsive force of the charges exceeds the
liquid surface tension, then fission occurs.
• As this evaporation and fission cycle is repeated, the droplets eventually become small enough
that the sample ions are liberated into the gas phase.
• ESI provides the softest ionization method available, which means it can be used for highly
polar, least volatile, or thermally unstable compounds.
2) Atmospheric pressure chemical ionization (APCI) :
• APCI vaporizes solvent and sample molecules by spraying the sample solution into a heater
(heated to about 400 C) using a gas, such as N2.
• Solvent molecules are ionized by corona discharge to generate stable reaction ions.
3) Thermospray ionization (TSI) :
They are of 2 type :
BT-3515 (Genomics And Proteomics) 159
➢ Real-TSP ionization
➢ Discharge electrode for external ionization and repeller electrode
4) Atmospheric pressure photoionization (APPI) :
• The LC eluent is vaporized using a heater at atmospheric pressure. The resulting gas is made to
pass through a beam of photons generated by a discharge lamp (UV lamp) which ionizes the gas
molecules.
Mass Analyser :-
• They deflect ions down a curved tubes in a magnetic fields based on their kinetic energy
determined by the mass, charge and velocity.
• The magnetic field is scanned to measure different ions.
• Types of mass analyzer :
a) Quadrapole mass filter.
b) Time of flight
c) Ion trap
d) Fourier transform ion cyclotron resonance (FT-ICR)
1) Quadrupole Mass Analyzer :
• A Quadrupole mass filter consists of four parallel metal rods with different charges
• Two opposite rods have an applied + potential and the other two rods have a – potential
• The applied voltages affect the trajectory of ions traveling down the flight path
• For given DC and AC voltages, only ions of a certain mass-to-charge ratio pass through the
quadrupole filter and all other ions are thrown out of their original path
2) TOF (Time of Flight) Mass Analyzer :
• TOF Analyzers separate ions by time without the use of an electric or magnetic field.
• In a crude sense, TOF is similar to chromatography, except there is no stationary / mobile phase,
instead the separation is based on the kinetic energy and velocity of the ions.
3) Ion Trap Mass Analyzer :
• It uses an electric field for the separation of the ions by mass to charge ratios.
• The electric field in the cavity due to the electrodes causes the ions of certain m/z values to orbit
in the space.
4) Fourier transform ion cyclotron resonance (FT-ICR) :
• Uses a magnetic field in order to trap ions into an orbit inside of it.
• In this analyzer there is no separation that occurs rather all the ions of a particular range are
trapped inside, and an applied external electric field helps to generate a signal.
Applications of LC-MS :-
1) Pharmaceutical Applications :
• Rapid chromatography of benzodiazepines
• Identification of bile acid metabolite
2) Biochemical Applications :
BT-3515 (Genomics And Proteomics) 160
• Rapid protein identification using capillary LC/MS/MS and database searching.
3) Clinical Applications :
• High-sensitivity detection of trimipramine and thioridazine
4) Food Applications :
• Identification of aflatoxins in food
• Determination of vitamin D3 in poultry feed supplements
5) Environmental Applications :
• Detection of phenylurea herbicides
• Detection of low levels of carbaryl in food
6) Forensic Applications :
• illegal substances, toxic agents
• Explosives
• Drugs of abuse
3.6.4 PEPTIDE MASS FINGERPRINTING :-
Peptide mass fingerprinting (PMF) (also known as protein fingerprinting) is an analytical technique for
protein identification in which the unknown protein of interest is first cleaved into smaller peptides,
whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or
ESI-TOF. The method was developed in 1993 by several groups independently. The peptide masses are
compared to either a database containing known protein sequences or even the genome. This is
achieved by using computer programs that translate the known genome of the organism into proteins,
then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from
each protein. They then compare the masses of the peptides of the unknown protein to the theoretical
peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the
best match.
Fig 1. A typical workflow of a peptide mass
fingerprinting experiment.
BT-3515 (Genomics And Proteomics) 161
The advantage of this method is that only the masses of the peptides have to be known. Time-
consuming de novo peptide sequencing is then unnecessary. A disadvantage is that the protein
sequence has to be present in the database of interest. Additionally most PMF algorithms assume that
the peptides come from a single protein. The presence of a mixture can significantly complicate the
analysis and potentially compromise the results. Typical for the PMF based protein identification is the
requirement for an isolated protein. Mixtures exceeding a number of 2-3 proteins typically require the
additional use of MS/MS based protein identification to achieve sufficient specificity of identification.
Therefore, the typical PMF samples are isolated proteins from two-dimensional gel electrophoresis (2D
gels) or isolated SDS-PAGE bands. Additional analyses by MS/MS can either be direct, e.g., MALDI-
TOF/TOF analysis or downstream nanoLC-ESI-MS/MS analysis of gel spot eluates.
Sample Preparation :-
Protein samples can be derived from SDS-PAGE or reversed phase HPLC, and are then subject to some
chemical modifications. Disulfide bridges in proteins are reduced and cysteine amino acids are
carbamidomethylated chemically or acrylamidated during the gel electrophoresis.
Then the proteins are cut into several fragments using proteolytic enzymes such as trypsin,
chymotrypsin or Glu-C. A typical sample : protease ratio is 50:1. The proteolysis is typically carried out
overnight and the resulting peptides are extracted with acetonitrile and dried under vacuum. The
peptides are then dissolved in a small amount of distilled water or further concentrated and purified and
are ready for mass spectrometric analysis.
Mass Spectrometric Analysis :-
The digested protein can be analyzed with different types of mass spectrometers such as ESI-TOF or
MALDI-TOF. MALDI-TOF is often the preferred instrument because it allows a high sample throughput
and several proteins can be analyzed in a single experiment, if complemented by MS/MS analysis.
LC/ESI-MS and CE/ESI-MS are also great techniques for peptide mass fingerprinting.
A small fraction of the peptide (usually 1 microliter or less) is pipetted onto a MALDI target and a
chemical called a matrix is added to the peptide mix. Common matrices are Sinapinic acid, Alpha-Cyano-
4-hydroxycinnamic acid, and 2,3-Dihydroxybenzoic acid. The matrix molecules are required for the
desorption of the peptide molecules. Matrix and peptide molecules co-crystallize on the MALDI target
and are ready to be analyzed. There is one predominantly MALDI-MS sample preparation technique,
namely dried droplet technique. The target is inserted into the vacuum chamber of the mass
spectrometer and the desorption and ionization of the polypeptide fragments is initiated by a pulsed
laser beam which transfers high amounts of energy into the matrix molecules. The energy transfer is
sufficient to promote the ionization and transition of matrix molecules and peptides from the solid
phase into the gas phase. The ions are accelerated in the electric field of the mass spectrometer and fly
towards an ion detector where their arrival is detected as an electric signal. Their mass-to-charge ratio is
proportional to their time of flight (TOF) in the drift tube and can be calculated accordingly.
Coupling ESI with capillary LC can separate peptides from protein digests, while obtaining their
molecular masses at the same time. Capillary electrophoresis coupled with ESI-MS is another technique;
however, it works best when analyzing small amounts of proteins.
Computational Analysis :-
The mass spectrometric analysis produces a list of molecular weights of the fragments which is often
called a peak list. The peptide masses are compared to protein databases such as SwissProt, which
BT-3515 (Genomics And Proteomics) 162
contain protein sequence information. Software performs in silico digests on proteins in the database
with the same enzyme (e.g. trypsin) used in the chemical cleavage reaction. The mass of these peptide
fragments is then calculated and compared to the peak list of measured peptide masses. The results are
statistically analyzed and possible matches are returned in a results table.
3.6.5 TANDOM MASS SPECTROMETRY (MS/MS) :-
Tandem mass spectrometry, also known as MS/MS or MS2, is a technique in instrumental analysis where
two or more mass analyzers are coupled together using an additional reaction step to increase their
abilities to analyse chemical samples. A common use of tandem MS is the analysis of biomolecules, such
as proteins and peptides.
The molecules of a given sample are ionized and the first spectrometer (designated MS1) separates
these ions by their mass-to-charge ratio (often given as m/z or m/Q). Ions of a particular m/z-ratio
coming from MS1 are selected and then made to split into smaller fragment ions, e.g. by collision-
induced dissociation, ion-molecule reaction, or photodissociation. These fragments are then introduced
into the second mass spectrometer (MS2), which in turn separates the fragments by their m/z-ratio and
detects them. The fragmentation step makes it possible to identify and separate ions that have very
similar m/z-ratios in regular mass spectrometers.
Structures :-
Tandem mass spectrometry includes triple quadrupole mass spectrometer (QqQ), quad time of flight (Q-
TOF), and hybrid mass spectrometer
1) Triple quadrupole mass spectrometer :
Triple quadrupole mass spectrometers use the first and third quadrupoles as mass filters. When analytes
pass the second quadrupole, the fragmentation proceeds through collision with gas. Usually used for the
pharmaceutical industry.
2) Quadrupole time of flight (Q-TOF) :
Q-TOF mass spectrometer combines TOF and quadrupole instruments, which cause high mass accuracy
for product ions, accurate quantitation capability, and fragmentation experiment applicability. This is a
method of mass spectrometry that ion fragmentation (m/z) ratio determined through a time of flight
measurement.
BT-3515 (Genomics And Proteomics) 163
In the QTOF, precursor ions are selected in the Quadrupole and sent to the Collision Cell for
fragmentation. The generated product ions are detected by time-of-flight (TOF) mass spectrometry.
3) Hybrid mass spectrometer :
Hybrid mass spectrometer consists of more than two mass analyzers.
BT-3515 (Genomics And Proteomics) 164
Instrumentation of Tandem Mass Spectrometry :-
Multiple stages of mass analysis separation can be accomplished with individual mass spectrometer
elements separated in space or using a single mass spectrometer with the MS steps separated in time.
For tandem mass spectrometry in space, the different elements are often noted in a shorthand, giving
the type of mass selector used.
1) Tandem in space :
In tandem mass spectrometry in space, the separation elements are physically separated and distinct,
although there is a physical connection between the elements to maintain high vacuum. These elements
can be sectors, transmission quadrupole, or time-of-flight. When using multiple quadrupoles, they can
act as both mass analyzers and collision chambers.
Common notation for mass analyzers is Q – quadrupole mass analyzer; q – radio frequency collision
quadrupole; TOF – time-of-flight mass analyzer; B – magnetic sector, and E – electric sector. The
notation can be combined to indicate various hybrid instrument, for example QqQ’ – triple quadrupole
mass spectrometer; QTOF – quadrupole time-of-flight mass spectrometer (also QqTOF); and BEBE –
four-sector (reverse geometry) mass spectrometer.
2) Tandem in time :
By doing tandem mass spectrometry in time, the separation is accomplished with ions trapped in the
same place, with multiple separation steps taking place over time. A quadrupole ion trap or Fourier
transform ion cyclotron resonance (FTICR) instrument can be used for such an analysis. Trapping
instruments can perform multiple steps of analysis, which is sometimes referred to as MSn (MS to the
n). Often the number of steps, n, is not indicated, but occasionally the value is specified; for example
MS3 indicates three stages of separation. Tandem in time MS instruments do not use the modes
described next, but typically collect all of the information from a precursor ion scan and a parent ion
scan of the entire spectrum. Each instrumental configuration utilizes a unique mode of mass
identification.
3) Tandem in space MS/MS modes :
When tandem MS is performed with an in space design, the instrument must operate in one of a variety
of modes. There are a number of different tandem MS/MS experimental setups and each mode has its
own applications and provides different information. Tandem MS in space uses the coupling of two
instrument components which measure the same mass spectrum range but with a controlled
fractionation between them in space, while tandem MS in time involves the use of an ion trap.
There are four main scan experiments possible using MS/MS : precursor ion scan, product ion scan,
neutral loss scan, and selected reaction monitoring.
1) For a precursor ion scan, the product ion is selected in the second mass analyzer, and the
precursor masses are scanned in the first mass analyzer. Note that precursor ion is synonymous
with parent ion and product ion with daughter ion; however the use of these anthropomorphic
terms is discouraged.
2) In a product ion scan, a precursor ion is selected in the first stage, allowed to fragment and then
all resultant masses are scanned in the second mass analyzer and detected in the detector that
is positioned after the second mass analyzer. This experiment is commonly performed to
identify transitions used for quantification by tandem MS.
BT-3515 (Genomics And Proteomics) 165
3) In a neutral loss scan, the first mass analyzer scans all the masses. The second mass analyzer
also scans, but at a set offset from the first mass analyzer. This offset corresponds to a neutral
loss that is commonly observed for the class of compounds. In a constant-neutral-loss scan, all
precursors that undergo the loss of a specified common neutral are monitored. To obtain this
information, both mass analyzers are scanned simultaneously, but with a mass offset that
correlates with the mass of the specified neutral. Similar to the precursor-ion scan, this
technique is also useful in the selective identification of closely related class of compounds in a
mixture.
4) In selected reaction monitoring, both mass analyzers are set to a selected mass. This mode is
analogous to selected ion monitoring for MS experiments. A selective analysis mode, which can
increase sensitivity.
Fig. Instrumentation of MS/MS Technique
Fragmentation Techniques :-
Fragmentation of gas-phase ions is essential to tandem mass spectrometry and occurs between
different stages of mass analysis. There are many methods used to fragment the ions and these can
result in different types of fragmentation and thus different information about the structure and
composition of the molecule.
Precursor ions can be activated (with increased internal energy) in many different ways. Fragmentation
patterns depend on how energy is transferred to the precursor ion, the amount of energy transferred,
and how the transferred energy is internally distributed. Collision-induced dissociation and infrared
multiphoton dissociation are “slow-heating” techniques that increase the Boltzmann temperature of the
ion and thus preferentially cleave the weakest bonds to produce mainly b and y ions. These techniques
are quite efficient for peptides, lipids and other relatively small chemical compounds, but may also
remove protein post-translational modifications (e.g., phosphates and sugars). Electron capture
dissociation and electron transfer dissociation mainly produce c and z ions while preserving post-
BT-3515 (Genomics And Proteomics) 166
translational modifications (PTMs). Thus, ECD and ETD are widely applied to proteins and peptides with
labile PTMs. For oligosaccharides (including glycolipids), ECD/ETD can also generate cross-ring cleaved a
and z ions, which are crucial for localization of glycosidic bonds.
This technique can be used with the following instruments :
• 21 Tesla FT-ICR MS (Actively Shielded)
• 14.5 Tesla FT-ICR MS (Actively Shielded)
• 9.4 Tesla FT-ICR MS (Passively Shielded)
Fragment ion notation :-
Peptides and oligosaccharides (including glycolipids) follow different systems of nomenclature for their
fragment ions. Other classes of compounds, i.e. phospholipids, etc., do not yet have established
nomenclature systems.
1) Peptides :
Fragments containing the N-terminus are
labeled a, b, or c, depending on the site of the
cleavage, whereas fragments containing the C-
terminus are labeled x, y, or z. The numbers
indicate the number of amino acid residues in
the fragment ion.
2) Oligosaccharides (including glycolipids) :
For oligosaccharides, fragments containing the
reducing end (reducing end is on the right-hand side in the figure) are labeled x, y, or z, depending on
the site of the cleavage, whereas fragments containing the other end are labeled a, b, or c. The numbers
indicate the site of the sugar residue : y, z, b, and c ions are fragments due to glycosidic cleavages
(cutting glycosidic bonds holding two adjacent sugar residues), whereas a and x ions result from cross-
ring cleavage.
BT-3515 (Genomics And Proteomics) 167
Applications of Tandem Mass Spectrometry :-
1) Peptide :
Tandem mass spectrometry can be used for protein sequencing. When intact proteins are introduced to
a mass analyzer, this is called “top-down proteomics” and when proteins are digested into smaller
peptides and subsequently introduced into the mass spectrometer, this is called “bottom-up
proteomics”. Shotgun proteomics is a variant of bottom up proteomics in which proteins in a mixture
are digested prior to separation and tandem mass spectrometry.
Tandem mass spectrometry can produce a peptide sequence tag that can be used to identify a peptide
in a protein database. A notation has been developed for indicating peptide fragments that arise from a
tandem mass spectrum. Peptide fragment ions are indicated by a, b, or c if the charge is retained on the
N-terminus and by x, y or z if the charge is maintained on the C-terminus. The subscript indicates the
number of amino acid residues in the fragment. Superscripts are sometimes used to indicate neutral
losses in addition to the backbone fragmentation, for loss of ammonia and for loss of water. Although
peptide backbone cleavage is the most useful for sequencing and peptide identification other fragment
ions may be observed under high energy dissociation conditions. These include the side chain loss ions
d, v, w and ammonium ions and additional sequence-specific fragment ions associated with particular
amino acid residues.
2) Oligosaccharides :
Oligosaccharides may be sequenced using tandem mass spectrometry in a similar manner to peptide
sequencing. Fragmentation generally occurs on either side of the glycosidic bond (b, c, y and z ions) but
also under more energetic conditions through the sugar ring structure in a cross-ring cleavage (x ions).
Again trailing subscripts are used to indicate position of the cleavage along the chain. For cross ring
cleavage ions the nature of the cross ring cleavage is indicated by preceding superscripts.
3) Oligonucleotides :
Tandem mass spectrometry has been applied to DNA and RNA sequencing. A notation for gas-phase
fragmentation of oligonucleotide ions has been proposed.
4) Newborn screening :
Newborn screening is the process of testing newborn babies for treatable genetic, endocrinologic,
metabolic and hematologic diseases. The development of tandem mass spectrometry screening in the
early 1990s led to a large expansion of potentially detectable congenital metabolic diseases that affect
blood levels of organic acids.
BT-3515 (Genomics And Proteomics) 168
3.7 POST TRANSLATIONAL MODIFICATIONS
Post-translational modification (PTM) refers to the covalent and generally enzymatic modification of
proteins following protein biosynthesis. Proteins are synthesized by ribosomes translating mRNA into
polypeptide chains, which may then undergo PTM to form the mature protein product. PTMs are
important components in cell signaling, as for example when prohormones are converted to hormones.
There are many types of protein modification, which are mostly catalyzed by enzymes that recognize
specific target sequences in proteins. These modifications regulate protein folding by targeting specific
subcellular compartments, interacting with ligands or other proteins, or by bringing about a change in
their functional state including catalytic activity or signaling. The most common PTMs are :
1) Based on the addition of chemical groups :
• Phosphorylation
• Acetylation
• Hydroxylation
• Methylation
2) Based on the addition of complex groups :
• Glycosylation
• AMPylation
• Lipidation
3) Based on the addition of polypeptides :
• Ubiquitination
4) Based on the cleavage of proteins :
• Proteolysis
5) Based on the amino acid modification :
• Deamidation
A) Chemical groups :-
1) Phosphorylation : Reversible phosphorylation of proteins involves addition of a phosphate group on
serine, threonine, or tyrosine residues and is one of the important and extensively studied PTM in both
prokaryotes and eukaryotes.
Several enzymes or signaling proteins are switched ‘on’ or ‘off’ by phosphorylation or
dephosphorylation. Phosphorylation is performed by enzymes called ‘kinases’, while dephosphorylation
is performed by ‘phosphatases’.
Addition of a phosphate group can convert a previously uncharged pocket of protein into a negatively
charged and hydrophilic protein thereby inducing conformational changes in the protein.
Phosphorylation has implications in several cellular processes, including cell cycle, growth, apoptosis and
signal transduction pathways. One example is the activation of p53, a tumor suppressor protein. P53 is
used in cancer therapeutics and is activated by phosphorylation of its N-terminal by several kinases.
BT-3515 (Genomics And Proteomics) 169
2) Acetylation : Acetylation refers to addition of acetyl group in a protein. It is involved in several
biological functions, including protein stability, location, synthesis; apoptosis; cancer; DNA stability.
Acetylation and deacetylation of histone form a critical part of gene regulation.
Acetylation of histones reduces the positive charge on histone, reducing its interaction with the
negatively charged phosphate groups of DNA, making it less tightly wound to DNA and accessible to
gene transcription. Acetylation of p53, a tumor suppressor gene, is crucial for its growth suppressing
properties.
3) Hydroxylation : This process adds a hydroxyl group (-OH) to the proteins. It is catalyzed by enzymes
termed as ‘hydroxylases’ and aids in converting hydrophobic or lipophilic compounds into hydrophilic
compounds.
4) Methylation : Methylation refers to addition of a methyl group to lysine or arginine residue of a
protein. Arginine can be methylated once or twice, while lysine can be methylated once, twice, or thrice.
Methylation is achieved by enzymes called methyltransferases. Methylation has been widely studied in
histones wherein histone methylation can lead to gene activation or repression based on the residue
that is methylated.
B) Complex groups :-
1) Glycosylation : Glycosylation involves addition of an oligosaccharide termed ‘glycan’ to either a
nitrogen atom (N-linked glycosylation) or an oxygen atom (O-linked glycosylation). N-linked
glycosylation occurs in the amide nitrogen of asparagine, while the O-linked glycosylation occurs on the
oxygen atom of serine or threonine.
Carbohydrates present in the form of N-linked or O-linked oligosaccharides are present on the surface of
cells and secrete proteins. They have critical roles in protein sorting, immune recognition, receptor
binding, inflammation, and pathogenicity. For example, N-linked glycans on an immune cell can dictate
how it migrates to specific sites. Similarly, it can also determine how a cell recognizes ‘self’ and ‘non-
self’.
2) AMPylation : AMPylation refers to reversible addition of AMP to a protein. It involves formation of a
phosphodiester bond between the hydroxyl group of the protein and the phosphate group of AMP.
3) Lipidation : The covalent binding of a lipid group to a protein is called lipidation. Lipidation can be
further subdivided into prenylation, N-myristoylation, palmitoylation, and glycosylphosphatidylinositol
(GPI)-anchor addition.
Prenylation involves the addition of isoprenoid moiety to a cysteine residue of a substrate protein. It is
critical in controlling the localization and activity of several proteins that have crucial functions in
biological regulation.
Myristoylation involves the addition of myristoyl group to a glycine residue by an amide bond. It has
functions in membrane association and apoptosis. In palmitoylation, a palmitoyl group is added to a
cysteine residue of a protein.
In GPI-anchor addition, the carboxyl-terminal signal peptide of the protein is split and replaced by a GPI
anchor. Recent research in human genetics has revealed that GPI anchors are important for human
health. Any defects in the assembling, attachment or remodeling of GPI anchors lead to genetic diseases
known as inherited GPI deficiency.
BT-3515 (Genomics And Proteomics) 170
C) Polypeptides :-
1) Ubiquitination : Ubiquitination involves addition of a protein found ubiquitously, termed ‘ubiquitin’,
to the lysine residue of a substrate. Either a single ubiquitin molecule (monoubiquitination) or a chain of
several ubiquitin molecules may be attached (polyubiquitination).
Polyubiquitinated proteins are recognized by the 26S proteasome and are subsequently targeted for
proteolysis or degradation. Monoubiquitinated proteins may influence cell tracking and endocytosis.
D) Protein cleavage :-
1) Proteolysis : Proteolysis refers to breakdown of proteins into smaller polypeptides or amino acids. For
example, removal of N-terminal methionine, a signal peptide, after translation leads to conversion of an
inactive or non-functional protein to an active one.
E) Amino acid modification :-
1) Deamidation : Deamidation is the removal or conversion of asparagine or glutamine residue to
another functional group. Asparagine is converted to aspartic acid or isoaspartic acid, while glutamine is
converted to glutamic acid or pyroglutamic acid. This modification can change the protein structure,
stability, and function.
3.7.1 DNA METHYLATION :-
DNA methylation is a biochemical process that is important for normal development in higher
organisms. It involves the addition of through cell a methyl group purine ring to the 5 position of the
cytosine pyrimidine ring or the number 6 nitrogen of the adenine ring (cytosine and adenine are two of
the four bases of DNA). This modification can be inherited division.
DNA methylation is a crucial part of normal organismal development and cellular differentiation in
higher organisms. DNA methylation stably alters the gene expression pattern in cells such that cells can
"remember where they have been" or decrease gene expression; for example, cells programmed to be
pancreatic islets during embryonic development remain pancreatic islets throughout the life of the
organism without continuing signals telling them that they need to remain islets. DNA methylation is
typically removed during zygote formation and re-established through successive cell divisions during
development. However, the latest research shows that hydroxylation of methyl group occurs rather
than complete removal of methyl groups in zygote. Some methylation modifications that regulate gene
expression are inheritable and are referred to as epigenetic regulation.
In addition, DNA methylation suppresses the expression of viral genes and other deleterious elements
that have been incorporated into the genome of the host over time. DNA methylation also forms the
basis of chromatin structure, which enables cells to form the myriad characteristics necessary for multi
cellular life from a single immutable sequence of DNA. Methylation also plays a crucial role in the
development of nearly all types of cancer. DNA methylation at the 5 position of cytosine has the specific
effacing gene expression and has been found in every vertebrate examined.
Mechanism of DNA methylation :-
In mammalian DNA, 5-methylcytosine is found in approximately 4% of genomic DNA, primarily at
cytosine-guanosine dinucleotides (CpGs). Such CpG sites occur at lower than expected frequencies
throughout the human genome but are found more frequently at small stretches of DNA called CpG
BT-3515 (Genomics And Proteomics) 171
islands. These islands are typically found in or near promoter regions of genes, where transcription is
initiated. In contrast to the bulk of genomic DNA, in which most CpG sites are heavily methylated, CpG
islands in germ-line tissue and promoters of normal somatic cells remain un-methylated, allowing gene
expression to occur.
Fig 1. Reaction of DNA Methylation
DNA methylation helps to maintain transcriptional silence in non expressed or non coding regions of the
genome. For example, pericentromeric heterochromatin, which is condensed and transcriptionally
inactive, is heavily methylated. Hyper methylation thus ensures this DNA is late-replicating and
transcriptionally quiescent, and suppresses the expression of any potentially harmful viral sequences or
transposons that may have integrated into such sites containing highly repetitive sequences. By
contrast, these sites are generally un-methylated in promoter regions of euchromatin, regardless of the
transcriptional state of the gene. Exceptions to this rule, however, can be found in mammalian cells
where these regions are methylated to maintain transcriptional inactivation. Thus, CpG islands in
promoters of genes located on the inactivated X chromosome of females are methylated, as are certain
imprinted genes in which only the maternal or paternal allele is expressed.
Fig 2. Mechanism of DNA Methylation
BT-3515 (Genomics And Proteomics) 172
Epigenetic effects such as hypermethylation can also induce inevitable alterations in gene expression.
Methylation of the DNA repair genes MLH1 and MGMT can lead to their inactivation, methylation may
affect the transcription of genes in two ways. First, the methylation of DNA itself may physically impede
the binding of transcriptional proteins to the gene, and second, and likely more important, methylated
DNA may be bound by proteins known as methyl-CpG-binding domain proteins (MBDs). MBD proteins
then recruit additional proteins to the locus, such as histone deacetylases and other chromatin
remodeling proteins that can modify histones, thereby forming compact, inactive chromatin, termed
heterochromatin. This link between DNA methylation and chromatin structure is very important. In
particular, loss of methyl-CpG-binding protein 2 (MeCP2) has been implicated in Rett syndrome; and
medimethyl-CpG-binding domain protein 2 (MBD2) ates the transcriptional silencing of
hypermethylated genes in cancer. Research has suggested that long-term memory storage in humans
may be regulated by DNA methylation.
Regulation of DNA methylation :-
DNA methylation is controlled at several different levels in normal and tumor cells. The addition of
methyl groups is carried out by a family of enzymes, DNA methyltransferases (DNMTs). Chromatin
structure in the vicinity of gene promoters also affects DNA methylation and transcriptional activity.
These are in turn regulated by various factors such as nucleosome spacing and histone acetylases, which
affect access to transcriptional factors.
DNA methyltransferases (DNMT) :-
In mammalian cells, DNA methylation occurs mainly at the C5 position of CpG dinucleotides and is
carried out by of enzymatic activities–maintenance methylation and de novo methylation. Two general
classes.
Maintenance methylation activity is necessary to preserve DNA methylation after every cellular DNA
replication cycle. Without the DNA methyltransferase (DNMT), the replication machinery itself would
produce daughter strands that are un-methylated and, over time, would lead to passive demethylation.
DNMT1 is the proposed maintenance methyltransferase that is responsible for copying DNA methylation
patterns to the daughter strands during DNA replication. Mouse models with both copies of DNMT1
deleted are embryonic lethal at approximately day 9, due to the requirement of DNMT1 activity for
development in mammalian cells.
It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA
methylation patterns early in development. DNMT3L is a protein that is homologous to the other
DNMT3s but has no catalytic activity. Instead, DNMT3L assists the de novo methyltransferases by
increasing their ability to bind to DNA and stimulating their activity. Finally, DNMT2 (TRDMT1) has been
identified as a DNA methyl-transferase homolog, containing all 10 sequence motifs common to all DNA
methyltransferases; however, DNMT2 (TRDMT1) does not methylate DNA but instead methylates
cytosine-38 in the anticodon loop of aspartic acid transfer RNA.
BT-3515 (Genomics And Proteomics) 173
3.7.2 HISTONE MODIFICATIONS :-
Chromatin architecture, nucleosomal positioning, and ultimately access to DNA for gene transcription, is
largely controlled by histone proteins. Each nucleosome is made of two identical subunits, each of which
contains four histones : H2A, H2B, H3, and H4. Meanwhile, the H1 protein acts as the linker histone to
stabilize internucleosomal DNA and does not form part of the nucleosome itself.
Histone proteins undergo post-translational modification (PTM) in different ways, which impacts their
interactions with DNA. Some modifications disrupt histone-DNA interactions, causing nucleosomes to
unwind. In this open chromatin conformation, called euchromatin, DNA is accessible to binding of
transcriptional machinery and subsequent gene activation. In contrast, modifications that strengthen
histone-DNA interactions create a tightly packed chromatin structure called heterochromatin. In this
compact form, transcriptional machinery cannot access DNA, resulting in gene silencing. In this way,
modification of histones by chromatin remodeling complexes changes chromatin architecture and gene
activation.
At least nine different types of histone modifications have been discovered. Acetylation, methylation,
phosphorylation, and ubiquitylation are the most well-understood, while GlcNAcylation, citrullination,
krotonilation, and isomerization are more recent discoveries that have yet to be thoroughly
investigated. Each of these modifications are added or removed from histone amino acid residues by a
specific set of enzymes.
Fig 3. The most common histone modifications
BT-3515 (Genomics And Proteomics) 174
Together, these histone modifications make up what is known as the histone code, which dictates the
transcriptional state of the local genomic region. Examining histone modifications at a particular region,
or across the genome, can reveal gene activation states, locations of promoters, enhancers, and other
gene regulatory elements.
Types of Histone Modifications :-
1) Acetylation :-
Acetylation is one of the most widely studied histone modifications since it was one of the first
discovered to influence transcriptional regulation. Acetylation adds a negative charge to lysine residues
on the N-terminal histone tails that extend out from the nucleosome. These negative charges repel
negatively charged DNA, which results in a relaxed chromatin structure. The open chromatin
conformation allows transcription factor binding and significantly increases gene expression (Roth et al.,
2001)
Histone acetylation is involved in cell cycle regulation, cell proliferation, and apoptosis and may play a
vital role in regulating many other cellular processes, including cellular differentiation, DNA replication
and repair, nuclear import and neuronal repression. An imbalance in the equilibrium of histone
acetylation is associated with tumorigenesis and cancer progression.
Enzymatic regulation :
Acetyl groups are added to lysine residues of histones H3 and H4 by histone acetyltransferases (HAT)
and removed by deacetylases (HDAC). Histone acetylation is largely targeted to promoter regions,
known as promoter-localized acetylation. For example, acetylation of K9 and K27 on histone H3
(H3K9ac and H3K27ac) is usually associated with enhancers and promoters of active genes. Low levels of
global acetylation are also found throughout transcribed genes, whose function remains unclear.
2) Methylation :-
Methylation is added to the lysine or arginine residues of histones H3 and H4, with different impacts on
transcription. Arginine methylation promotes transcriptional activation (Greer et al., 2012) while lysine
methylation is implicated in both transcriptional activation and repression depending on the
methylation site. This flexibility may be explained by the fact that that methylation does not alter
histone charge or directly impact histone-DNA interactions, unlike acetylation.
Lysines can be mono-, di-, or tri-methylated, providing further functional diversity to each site of
methylation. For example, both mono- and tri-methylation on K4 of histone H3 (H3K4me1and
H3K4me3) are activation markers, but with unique nuances : H3K4me1 typically marks transcriptional
enhancers, while H3K4me3 marks gene promoters. Meanwhile, tri-methylation of K36 (H3K36me3) is an
activation marker associated with transcribed regions in gene bodies.
In contrast, tri-methylation on K9 and K27 of histone H3 (H3K9me3 and H3K27me3) are repressive
signals with unique functions : H3K27me3 is a temporary signal at promoter regions that controls
development regulators in embryonic stem cells, including Hox and Sox genes. Meanwhile, H3K9me3 is
a permanent signal for heterochromatin formation in gene-poor chromosomal regions with tandem
repeat structures, such as satellite repeats, telomeres, and pericentromeres. It also marks
retrotransposons and specific families of zinc finger genes (KRAB-ZFPs). Both marks are found on the
inactive chromosome X, with H3K27me3 at intergenic and silenced coding regions and H3K9me3
predominantly in coding regions of active genes.
BT-3515 (Genomics And Proteomics) 175
Enzymatic regulation :
Histone methylation is a stable mark propagated through multiple cell divisions, and for many years was
thought to be irreversible. However, it was recently discovered to be an actively regulated and
reversible process.
a) Methylation : histone methyltransferases (HMTs) :
❖ Lysine
➢ SET domain-containing (histone tails)
➢ Non-SET domain-containing (histone cores)
❖ Arginine
➢ PRMT (protein arginine methyltransferases) family
b) Demethylation : histone demethylases :
❖ Lysine
➢ KDM1/LSD1 (lysine-specific demethylase 1)
➢ JmjC (Jumonji domain-containing)
❖ Arginine
➢ PAD4/PADI4
3) Phosphorylation :-
Histone phosphorylation is a critical intermediate step in chromosome condensation during cell division,
transcriptional regulation, and DNA damage repair (Rossetto et al., 2012, Kschonsak et al., 2015). Unlike
acetylation and methylation, histone phosphorylation establishes interactions between other histone
modifications and serves as a platform for effector proteins, which leads to a downstream cascade of
events.
Phosphorylation occurs on all core histones, with differential effects on each. Phosphorylation of histone
H3 at serine 10 and 28, and histone H2A on T120, are involved in chromatin compaction and the
regulation of chromatin structure and function during mitosis. These are important markers of cell cycle
and cell growth that are conserved throughout eukaryotes. Phosphorylation of H2AX at S139 (resulting
in γH2AX) serves as a recruiting point for DNA damage repair proteins (Lowndes et al., 2005, Pinto et al.,
2010) and is one of the earliest events to occur after DNA double-strand breaks. H2B phosphorylation is
not as well studies but is found to facilitate apoptosis-related chromatin condensation, DNA
fragmentation, and cell death (Füllgrabe et al., 2010).
4) Ubiquitylation :-
All histone core proteins can be ubiquitylated, but H2A and H2B are most commonly and are two of the
most highly ubiquitylated proteins in the nucleus (Cao et al., 2012). Histone ubiquitylation plays a
central role in the DNA damage response.
Monoubiquitylation of histones H2A, H2B, and H2AX is found at sites of DNA double-strand breaks. The
most common forms are monoubiquitylated H2A on K119 and H2B on K123 (yeast)/K120 (vertebrates).
Monoubiquitylated H2A is also associated with gene silencing, whereas H2B is also associated with
transcription activation.
Poly-ubiquitylation is less common but is also important in DNA repair—polyubiquitylation of H2A and
H2AX on K63 provides a recognition site for DNA repair proteins, like RAP80.
BT-3515 (Genomics And Proteomics) 176
Enzymatic regulation :
Like other histone modifications, monoubiquitylation of H2A and H2B is reversible and is tightly
regulated by histone ubiquitin ligases and deubiquitylating enzymes.
a) Monoubiquitylation
➢ H2A : polycomb group proteins
➢ H2B : Bre1 (yeast) and its homologs RNF20/RNF40 (mammals)
b) Polyubiquitylation
➢ H2A/H2AX K63 : RNF8/RNF168
Comparison between Histone Modifications :-
3.7.3 EPIGENETIC CODE :-
• The epigenetic code is a defining code in every eukaryotic cell consisting of the specific
epigenetic modification in each cell.
• It consists of histone modifications defined by the histone code and additional epigenetic
modifications such as DNA methylation.
• The basis for the epigenetic code is a system above the genetic code of a single cell.
• While in one individual the genetic code in each cell is the same, the epigenetic code is tissue
and cell specific.
Histone Modifications + DNA Methylation = Epigenetic Code
BT-3515 (Genomics And Proteomics) 177