THE GENETIC CODE
The genetic code is a triplet
During translation, the sequence of an mRNA molecule is read from its 5’ end by ribosomes which then synthesize an
appropriate polypeptide. Both in prokaryotes and eukaryotes, the DNA sequence of a single gene is collinear with
the amino acid sequence of the polypeptide it encodes. In other words, the nucleotide sequence of the coding DNA
strand, 5 to 3, specifies in exactly the same order the amino acid sequence of the encoded polypeptide, N-terminal to C-
terminal. The relationship between the nucleotide sequence of the mRNA and the amino acid sequence of the
polypeptide is called the genetic code. The sequence of the mRNA is read in groups of three nucleotides called
codons, with each codon specifying a particular amino acid (Fig. 1). However, three codons, UAG, UGA and UAA, do not
encode an amino acid. Whenever one of these codons is encountered by a ribosome, it leads to termination of
protein synthesis. Therefore these three codons are called termination codons or stop codons. The codon AUG
codes for methionine. Although methionine is found at internal positions in polypeptide chains, all eukaryotic polypeptides
also start with methionine and all prokaryotic polypeptides start with a modified methionine (N-formyl methionine).
Therefore the first AUG codon that is read by the ribosome in an mRNA is called the initiation codon or start
codon.
The genetic code is degenerate
Since RNA is composed of four types of nucleotides, there are 4 3 = 64 possible codons, that is 64 possible triplets of
nucleotides with different sequences. However, only 20 amino acids are commonly found in proteins so that, in most
cases, a single amino acid is coded for by several different codons (see Fig. 1). The genetic code is therefore said to
be degenerate. In fact, only methionine and tryptophan are represented by a single codon. As a result of the genetic code’s
degeneracy, a mutation that changes only a single nucleotide in DNA (point mutation), and hence changes only a
single nucleotide in the corresponding mRNA, often has no effect on the amino acid sequence of the encoded
polypeptide.
Fig. 1. The genetic code.
Codons that specify the same amino acid are called synonyms. Most synonyms differ only in the third base of the
codon; for example GUU, GUC, GUA and GUG all code for valine. During protein synthesis, each codon is
recognized by a triplet of bases, called an anticodon, in a specific tRNA molecule. Each base in the codon base
pairs with its complementary base in the anticodon. However, the pairing of the third base of a codon is less stringent
than for the first two bases (i.e. there is some ‘wobble base-pairing’) so that in some cases a single tRNA may base-
pair with more than one codon. For example, phenylalanine tRNA, which has the anticodon GAA, recognizes both of
the codons UUU and UUC. The third position of the codon is therefore also called the wobble position.
Universality of the genetic code
For many years it was thought that the genetic code is ‘universal’, namely that all living organisms used the same code.
Now we know that the genetic code is almost the same in all organisms but there are a few differences.
Mitochondria contain DNA, as double-stranded DNA circles, and the mitochondrial genome codes for about 10–20 proteins.
Surprisingly, in mitochondrial mRNAs, some codons have different meanings from their counterparts in mRNA in the
cytosol. A few examples are given below (N denotes any of the four nucleotides A, G, C or U):
Mitochondria AUA = Met not Ile
mitochondria UGA = Trp not Stop
some animal mitochondria AGA and AGG = Stop not Arg
plant mitochondria CGG = Trp not Arg
yeast mitochondria CUN = Thr not Leu
Some unicellular organisms are also now known to use a variant genetic code.
For example:
some ciliated protozoa UAA and UAG = Glu not Stop.
Reading frames
Since the sequence of an mRNA molecule is read in groups of three nucleotides (codons) from the 5’ end, it can be
read in three possible reading frames, depending on which nucleotide is used as the first base of the first codon
(Fig. 2). Usually, only one reading frame (reading frame 3 in Fig. 2) will produce a functional protein since the other two
reading frames will include several termination (Stop) codons. The correct reading frame is set in vivo by recognition by
the ribosome of the initiation codon, AUG, at the start of the coding sequence. Usually one sequence of bases
encodes only a single protein. However, in some bacteriophage DNAs, several genes overlap, with each gene being in a
different reading frame. This organization of overlapping genes generally occurs when the genome size is smaller
than can accommodate the genes necessary for phage structure and assembly using only one reading frame.
Fig. 2. Three potential reading frames for any given mRNA sequence depending on which nucleotide is ‘read’ first.
Open Reading Frames
In many cases these days, the protein encoded by a particular gene is deduced by cloning and then sequencing the
corresponding DNA. The DNA sequence is then scanned using a computer program to identify runs of
codons that start with ATG and end with TGA, TAA or TAG. These runs of codons are called open reading
frames (ORFs) and identify potential coding regions. Because genes carry out important cellular functions, the
sequence of coding DNA (and of important regulatory sequences) is more strongly conserved in evolution that that
of noncoding DNA. In particular, mutations that lead to the creation of termination codons within the coding region,
and hence premature termination during translation, are selected against. This means that the coding regions of genes
often contain comparatively long ORFs whereas in noncoding DNA, triplets corresponding to termination codons are
not selected against and ORFs are comparatively short. Thus, when analyzing the ORFs displayed for a particular cloned
DNA, it is usually true that a long ORF is likely to be coding DNA whereas short ORFs may not be. Nevertheless, one must
be aware that some exons can be short and so some short ORFs may also be coding DNA. Computer analysis may be
able to detect these by screening for the conserved sequences at exon/intron boundaries and the splice branchpoint
sequence. Finally, by referring to the genetic code, computer analysis can predict the protein sequence encoded by each
ORF. This is the deduced protein sequence.