Huffman Coding
The idea behind Huffman coding is to find a way to compress
the storage of data using variable length codes. Our standard
model of storing data uses fixed length codes. For example,
each character in a text file is stored using 8 bits. There are
certain advantages to this system. When reading a file, we
know to ALWAYS read 8 bits at a time to read a single
character. But as you might imagine, this coding scheme is
inefficient. The reason for this is that some characters are
more frequently used than other characters. Let's say that the
character 'e' is used 10 times more frequently than the
character 'q'. It would then be advantageous for us to use a 7
bit code for e and a 9 bit code for q instead because that could
shorten our overall message length.
Huffman coding finds the optimal way to take advantage of
varying character frequencies in a particular file. On average,
using Huffman coding on standard files can shrink them
anywhere from 10% to 30% depending to the character
distribution. (The more skewed the distribution, the better
Huffman coding will do.)
The idea behind the coding is to give less frequent characters
and groups of characters longer codes. Also, the coding is
constructed in such a way that no two constructed codes are
prefixes of each other. This property about the code is crucial
with respect to easily deciphering the code.
Building a Huffman Tree
The easiest way to see how this algorithm works is to work
through an example. Let's assume that after scanning a file we
find the following character frequencies:
Character Frequency
'a' 12
'b' 2
'c' 7
'd' 13
'e' 14
'f' 85
Now, create a binary tree for each character that also stores
the frequency with which it occurs.
The algorithm is as follows: Find the two binary trees in the list
that store minimum frequencies at their nodes. Connect these
two nodes at a newly created common node that will store NO
character but will store the sum of the frequencies of all the
nodes connected below it. So our picture looks like follows:
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
Now, repeat this process until only one tree is left:
21
/ 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
48
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
133
/ 
48 85 'f'
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e'
/ 
2 'b' 7 'c'
Once the tree is built, each leaf node corresponds to a letter
with a code. To determine the code for a particular node, walk
a standard search path from the root to the leaf node in
question. For each step to the left, append a 0 to the code and
for each step right append a 1. Thus for the tree above we get
the following codes:
Letter Code
'a' 001
'b' 0000
'c' 0001
'd' 010
'e' 011
'f' 1
Why are we guaranteed that one code is NOT the prefix of
another?
Find a set of valid Huffman codes for a file with the given
character frequencies:
Character Frequency
'a' 15
'b' 7
'c' 5
'd' 23
'e' 17
'f' 19
Calculating Bits Saved
All we need to do for this calculation is figure out how many
bits are originally used to store the data and subtract from that
how many bits are used to store the data using the Huffman
code.
In the first example given, since we have six characters, let's
assume each is stored with a three bit code. Since there are 133
such characters, the total number of bits used is 3*133 = 399.
Now, using the Huffman coding frequencies we can calculate
the new total number of bits used:
Letter Code Frequency Total Bits
'a' 001 12 36
'b' 0000 2 8
'c' 0001 7 28
'd' 010 13 39
'e' 011 14 42
'f' 1 85 85
Total 238
Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage
space. Of course there is a small detail we haven't taken into
account here. What is that?
Huffman Coding is an Optimal Prefix Code
Of all prefix codes for a file, Huffman coding produces an
optimal one. In all of our examples from class on Monday, we
found that Huffman coding saved us a fair percentage of
storage space. But, we can show that no other prefix code can
do better than Huffman coding.
First, we will show the following:
Let x and y be two characters with the least frequencies in a
file. Then there exists an optimal prefix code for C in which the
codewords for x and y have the same length and differ only in
the last bit.
Here is how we will prove this:
Assume that a tree T stores an optimal prefix code. Let and
characters a and b be sibling nodes stored at the maximum
depth of the tree. We will show that we can create T' with x
and y as siblings at the lowest depth of the tree such that the
number of bits used for the coding with T' is the same as with
T. (Let f(a) denote the frequency of the character a. Without
loss of generality, assume f(x)  f(y) and f(a)  f(b). It also
follows that f(x)  f(a) and f(y)  f(b). Let h be the height of the
tree T. Let x have a depth of dx in T and y have a depth of dx in
T.)
Create T' as follows: swap the nodes storing a and x, and then
swap the nodes storing b and y. Now, we have that the depth of
x and y in T' is h, the depth of a is dx and the depth of b is dy in
T'.
Now, let's calculate the change in the number of bits used for
the coding with tree T' with the coding in tree T. (Note: Since
all other codes remain unchanged, we only need to analyze the
total number of bits it takes to code a, b, x and y.)
# bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y)
# bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) +
hf(y).
Difference =
hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) =
hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) =
h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) =
h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) =
(h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y))
Notice that all four of the terms above must be non-negative
since we know that h  dx, h  dy, f(a)  f(x), and f(b)  f(y).
Thus, it follows that this difference must be 0. Thus, the
number of bits to used in a code where x and y (the two
characters with lowest frequency) are siblings at maximum
depth of the coding tree is optimal.
In layman's terms, give me what you think is an optimal
coding tree, and I can create a new one from it with the two
nodes corresponding to low frequencies at the bottom of the
tree.
To complete the proof, you'll notice that by construction,
Huffman coding ALWAYS makes sure that the nodes with the
lowest frequencies are at the bottom of the coding tree, all the
way through the construction. (You can't find any pair of
nodes for which this isn't true.) Technically, to carry out the
proof, you'd use induction, but we'll skip that for now...

More Related Content

DOC
Huffman coding01
PPTX
Huffman coding || Huffman Tree
PPTX
Huffman codes
PPTX
Huffman coding || Huffman Tree
PPTX
Huffman Algorithm By Shuhin
PPT
Huffman Student
PPTX
Huffman Coding
PDF
Huffman Code Decoding
Huffman coding01
Huffman coding || Huffman Tree
Huffman codes
Huffman coding || Huffman Tree
Huffman Algorithm By Shuhin
Huffman Student
Huffman Coding
Huffman Code Decoding

What's hot (20)

PPTX
Huffman Algorithm and its Application by Ekansh Agarwal
PPT
Huffman Coding
PPTX
Huffman Coding
PPTX
Huffman coding
PDF
DOCX
Lecft3data
PPTX
Shannon Fano
PPTX
Text compression
PPT
Data Structure and Algorithms Huffman Coding Algorithm
PPTX
Multimedia lossless compression algorithms
PDF
Day 8b examples
PDF
Huffman and Arithmetic coding - Performance analysis
PPTX
Data Compression - Text Compression - Run Length Encoding
PPT
Huffman coding
PPTX
Huffman coding
PPT
PPTX
Oh, you're from Jersey? What exit?
PPTX
Cs1123 9 strings
PDF
Strings1
Huffman Algorithm and its Application by Ekansh Agarwal
Huffman Coding
Huffman Coding
Huffman coding
Lecft3data
Shannon Fano
Text compression
Data Structure and Algorithms Huffman Coding Algorithm
Multimedia lossless compression algorithms
Day 8b examples
Huffman and Arithmetic coding - Performance analysis
Data Compression - Text Compression - Run Length Encoding
Huffman coding
Huffman coding
Oh, you're from Jersey? What exit?
Cs1123 9 strings
Strings1

Similar to HuffmanCoding01.doc (20)

PPTX
Huffman Codes
PPT
Greedy Algorithms Huffman Coding.ppt
PPTX
Farhana shaikh webinar_huffman coding
DOCX
Huffman Coding is a technique of compressing data
PDF
Implementation of Lossless Compression Algorithms for Text Data
PDF
Huffman Encoding Pr
PPTX
Huffman analysis
PPTX
5c. huffman coding using greedy technique.pptx
PPT
Huffman code presentation and their operation
PPTX
Lecture-7-CS345A-2023 of Design and Analysis
PDF
Unit 2 Lecture notes on Huffman coding
PDF
Huffman Encoding Algorithm - Concepts and Example
PPTX
Data structures' project
PPT
Hufman coding basic
PDF
Data communication & computer networking: Huffman algorithm
PDF
PPTX
Huffman.pptx
PPT
Huffman Tree And Its Application
PDF
Huffman
PDF
Huffman
Huffman Codes
Greedy Algorithms Huffman Coding.ppt
Farhana shaikh webinar_huffman coding
Huffman Coding is a technique of compressing data
Implementation of Lossless Compression Algorithms for Text Data
Huffman Encoding Pr
Huffman analysis
5c. huffman coding using greedy technique.pptx
Huffman code presentation and their operation
Lecture-7-CS345A-2023 of Design and Analysis
Unit 2 Lecture notes on Huffman coding
Huffman Encoding Algorithm - Concepts and Example
Data structures' project
Hufman coding basic
Data communication & computer networking: Huffman algorithm
Huffman.pptx
Huffman Tree And Its Application
Huffman
Huffman

Recently uploaded (20)

PPTX
highway-150803160405-lva1-app6891 (1).pptx
PPTX
sub station Simple Design of Substation PPT.pptx
PDF
IAE-V2500 Engine Airbus Family A319/320
PPTX
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
PDF
Recent Trends in Network Security - 2025
PPTX
1. Effective HSEW Induction Training - EMCO 2024, O&M.pptx
PPTX
Software-Development-Life-Cycle-SDLC.pptx
PPTX
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
PPTX
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
PDF
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
PPT
Basics Of Pump types, Details, and working principles.
PPTX
quantum theory on the next future in.pptx
PPTX
Hardware, SLAM tracking,Privacy and AR Cloud Data.
PDF
LS-6-Digital-Literacy (1) K12 CURRICULUM .pdf
PDF
THE PEDAGOGICAL NEXUS IN TEACHING ELECTRICITY CONCEPTS IN THE GRADE 9 NATURAL...
PPTX
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
PPTX
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
PDF
ST MNCWANGO P2 WIL (MEPR302) FINAL REPORT.pdf
PDF
SURVEYING BRIDGING DBATU LONERE 2025 SYLLABUS
PPTX
Unit I - Mechatronics.pptx presentation
highway-150803160405-lva1-app6891 (1).pptx
sub station Simple Design of Substation PPT.pptx
IAE-V2500 Engine Airbus Family A319/320
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
Recent Trends in Network Security - 2025
1. Effective HSEW Induction Training - EMCO 2024, O&M.pptx
Software-Development-Life-Cycle-SDLC.pptx
ARCHITECTURE AND PROGRAMMING OF EMBEDDED SYSTEMS
Soft Skills Unit 2 Listening Speaking Reading Writing.pptx
Artificial Intelligence_ Basics .Artificial Intelligence_ Basics .
Basics Of Pump types, Details, and working principles.
quantum theory on the next future in.pptx
Hardware, SLAM tracking,Privacy and AR Cloud Data.
LS-6-Digital-Literacy (1) K12 CURRICULUM .pdf
THE PEDAGOGICAL NEXUS IN TEACHING ELECTRICITY CONCEPTS IN THE GRADE 9 NATURAL...
DATA STRCUTURE LABORATORY -BCSL305(PRG1)
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
ST MNCWANGO P2 WIL (MEPR302) FINAL REPORT.pdf
SURVEYING BRIDGING DBATU LONERE 2025 SYLLABUS
Unit I - Mechatronics.pptx presentation

HuffmanCoding01.doc

  • 1. Huffman Coding The idea behind Huffman coding is to find a way to compress the storage of data using variable length codes. Our standard model of storing data uses fixed length codes. For example, each character in a text file is stored using 8 bits. There are certain advantages to this system. When reading a file, we know to ALWAYS read 8 bits at a time to read a single character. But as you might imagine, this coding scheme is inefficient. The reason for this is that some characters are more frequently used than other characters. Let's say that the character 'e' is used 10 times more frequently than the character 'q'. It would then be advantageous for us to use a 7 bit code for e and a 9 bit code for q instead because that could shorten our overall message length. Huffman coding finds the optimal way to take advantage of varying character frequencies in a particular file. On average, using Huffman coding on standard files can shrink them anywhere from 10% to 30% depending to the character distribution. (The more skewed the distribution, the better Huffman coding will do.) The idea behind the coding is to give less frequent characters and groups of characters longer codes. Also, the coding is constructed in such a way that no two constructed codes are prefixes of each other. This property about the code is crucial with respect to easily deciphering the code.
  • 2. Building a Huffman Tree The easiest way to see how this algorithm works is to work through an example. Let's assume that after scanning a file we find the following character frequencies: Character Frequency 'a' 12 'b' 2 'c' 7 'd' 13 'e' 14 'f' 85 Now, create a binary tree for each character that also stores the frequency with which it occurs. The algorithm is as follows: Find the two binary trees in the list that store minimum frequencies at their nodes. Connect these two nodes at a newly created common node that will store NO character but will store the sum of the frequencies of all the nodes connected below it. So our picture looks like follows: 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c'
  • 3. Now, repeat this process until only one tree is left: 21 / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 48 / 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 133 / 48 85 'f' / 21 27 / / 9 12 'a' 13 'd' 14 'e' / 2 'b' 7 'c' Once the tree is built, each leaf node corresponds to a letter with a code. To determine the code for a particular node, walk
  • 4. a standard search path from the root to the leaf node in question. For each step to the left, append a 0 to the code and for each step right append a 1. Thus for the tree above we get the following codes: Letter Code 'a' 001 'b' 0000 'c' 0001 'd' 010 'e' 011 'f' 1 Why are we guaranteed that one code is NOT the prefix of another? Find a set of valid Huffman codes for a file with the given character frequencies: Character Frequency 'a' 15 'b' 7 'c' 5 'd' 23 'e' 17 'f' 19
  • 5. Calculating Bits Saved All we need to do for this calculation is figure out how many bits are originally used to store the data and subtract from that how many bits are used to store the data using the Huffman code. In the first example given, since we have six characters, let's assume each is stored with a three bit code. Since there are 133 such characters, the total number of bits used is 3*133 = 399. Now, using the Huffman coding frequencies we can calculate the new total number of bits used: Letter Code Frequency Total Bits 'a' 001 12 36 'b' 0000 2 8 'c' 0001 7 28 'd' 010 13 39 'e' 011 14 42 'f' 1 85 85 Total 238 Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage space. Of course there is a small detail we haven't taken into account here. What is that?
  • 6. Huffman Coding is an Optimal Prefix Code Of all prefix codes for a file, Huffman coding produces an optimal one. In all of our examples from class on Monday, we found that Huffman coding saved us a fair percentage of storage space. But, we can show that no other prefix code can do better than Huffman coding. First, we will show the following: Let x and y be two characters with the least frequencies in a file. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Here is how we will prove this: Assume that a tree T stores an optimal prefix code. Let and characters a and b be sibling nodes stored at the maximum depth of the tree. We will show that we can create T' with x and y as siblings at the lowest depth of the tree such that the number of bits used for the coding with T' is the same as with T. (Let f(a) denote the frequency of the character a. Without loss of generality, assume f(x)  f(y) and f(a)  f(b). It also follows that f(x)  f(a) and f(y)  f(b). Let h be the height of the tree T. Let x have a depth of dx in T and y have a depth of dx in T.) Create T' as follows: swap the nodes storing a and x, and then swap the nodes storing b and y. Now, we have that the depth of x and y in T' is h, the depth of a is dx and the depth of b is dy in T'.
  • 7. Now, let's calculate the change in the number of bits used for the coding with tree T' with the coding in tree T. (Note: Since all other codes remain unchanged, we only need to analyze the total number of bits it takes to code a, b, x and y.) # bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y) # bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) + hf(y). Difference = hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) = hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) = h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) = h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) = (h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y)) Notice that all four of the terms above must be non-negative since we know that h  dx, h  dy, f(a)  f(x), and f(b)  f(y). Thus, it follows that this difference must be 0. Thus, the number of bits to used in a code where x and y (the two characters with lowest frequency) are siblings at maximum depth of the coding tree is optimal. In layman's terms, give me what you think is an optimal coding tree, and I can create a new one from it with the two nodes corresponding to low frequencies at the bottom of the tree.
  • 8. To complete the proof, you'll notice that by construction, Huffman coding ALWAYS makes sure that the nodes with the lowest frequencies are at the bottom of the coding tree, all the way through the construction. (You can't find any pair of nodes for which this isn't true.) Technically, to carry out the proof, you'd use induction, but we'll skip that for now...