Huffman Coding
Huffman Coding
Huffman coding is a lossless data compression algorithm. The idea is to assign variable-length
codes to input characters, lengths of the assigned codes are based on the frequencies of
corresponding characters. The most frequent character gets the smallest code and the least
frequent character gets the largest code.
The variable-length codes assigned to input characters are Prefix Codes, means the codes (bit
sequences) are assigned in such a way that the code assigned to one character is not the prefix of
code assigned to any other character. This is how Huffman Coding makes sure that there is no
ambiguity when decoding the generated bitstream.
Let us understand prefix codes with a counter example. Let there be four characters a, b, c and d,
and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity
because code assigned to c is the prefix of codes assigned to a and b. If the compressed bit
stream is 0001, the de-compressed output may be "cccd" or "ccb" or "acd" or "ab".
1. Create a leaf node for each unique character and build a min heap of all leaf nodes (Min
Heap is used as a priority queue. The value of frequency field is used to compare two
nodes in min heap. Initially, the least frequent character is at root)
2. Extract two nodes with the minimum frequency from the min heap.
3. Create a new internal node with a frequency equal to the sum of the two nodes
frequencies. Make the first extracted node as its left child and the other extracted node as
its right child. Add this node to the min heap.
4. Repeat steps#2 and #3 until the heap contains only one node. The remaining node is the
root node and the tree is complete.
Let us understand the algorithm with an example:
character Frequency
a 5
b 9
c 12
d 13
e 16
f 45
Step 1. Build a min heap that contains 6 nodes where each node represents root of a tree with
single node.
Step 2 Extract two minimum frequency nodes from min heap. Add a new internal node with
frequency 5 + 9 = 14.
Now min heap contains 5 nodes where 4 nodes are roots of trees with single element each, and
one heap node is root of tree with 3 elements
character Frequency
c 12
d 13
Internal Node 14
e 16
f 45
Step 3: Extract two minimum frequency nodes from heap. Add a new internal node with
frequency 12 + 13 = 25
Now min heap contains 4 nodes where 2 nodes are roots of trees with single element each, and
two heap nodes are root of tree with more than one nodes
character Frequency
Internal Node 14
e 16
Internal Node 25
f 45
Step 4: Extract two minimum frequency nodes. Add a new internal node with frequency 14 + 16
= 30
character Frequency
Internal Node 25
Internal Node 30
f 45
Step 5: Extract two minimum frequency nodes. Add a new internal node with frequency 25 + 30
= 55
Now min heap contains 2 nodes.
character Frequency
f 45
Internal Node 55
Step 6: Extract two minimum frequency nodes. Add a new internal node with frequency 45 + 55
= 100
character Frequency
Internal Node 100
Since the heap contains only one node, the algorithm stops here.
character code-word
f 0
c 100
d 101
a 1100
b 1101
e 111
Example
Input: ch[] = { ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’ }, freq[] = { 5, 9, 12, 13, 16, 45 }
Output:
f0
c 100
d 101
a 1100
b 1101
e 111
Approach:
1. Push all the characters in ch[] mapped to corresponding
frequency freq[] in priority queue.
2. To create Huffman Tree, pop two nodes from priority queue.
3. Assign two popped node from priority queue as left and right child of new
node.
4. Push the new node formed in priority queue.
5. Repeat all above steps until size of priority queue becomes 1.
6. Traverse the Huffman Tree (whose root is the only node left in the priority
queue) to store the Huffman Code
7. Print all the stored Huffman Code for every character in ch[].
Below is the implementation of the above approach:
class HuffmanTreeNode {
public:
// Stores character
char data;
int freq;
HuffmanTreeNode* left;
HuffmanTreeNode* right;
// Node which has least frequency and Remove node from Priority Queue
// Node which has least frequency and Remove node from Priority Queue
// Driver Code
int main()
{
char data[] = { 'a', 'b', 'c', 'd', 'e', 'f' };
int freq[] = { 5, 9, 12, 13, 16, 45 };
int size = sizeof(data) / sizeof(data[0]);
Output:
f 0
c 100
d 101
a 1100
b 1101
e 111
Time Complexity: O(n*logn) where n is the number of unique characters
Auxiliary Space: O(n)