Text Compression
Text Compression
Text Compression
We will now look at techniques for text compression.
These are the techniques used by general purpose compressors such as zip,
gzip, bzip2, 7zip, etc..
56
Compression Models
A static model is a fixed model that is known by both the compressor and
the decompressor and does not depend on the data that is being
compressed. For example, the frequencies of symbols in English language
computed from a large corpus of English texts could be used as the model.
57
A semiadaptive or semistatic model is a fixed model that is constructed
from the data to be compressed. For example, the symbol frequencies
computed from the text to be compressed can be used as the model. The
model has to be included as a part of the compressed data.
58
An adaptive model changes during the compression. At a given point in
compression, the model is a function of the previously compressed part of
the data. Since that part of the data is available to the decompressor at the
corresponding point in decompression, there is no need to store the model.
For example, we could start compressing using a uniform distibution of
symbols but then adjust that distribution towards the symbol frequencies in
the already processed part of the text.
• Not having to store a model saves space, but this saving can be lost to
a poor compression rate in the beginning of the compression. There is
no clear advantage either way. As the compression progresses, the
adaptive model improves and approaches optimal. In this way, the
model automatically adapts to the size of the data.
• The data is not always uniform. An optimal model for one part may be
a poor model for another part. An adaptive model can adapt to such
local differences by forgetting the far past.
• The disadvantage of adaptive computation is the time needed to
maintain the model. Decompression, in particular, can be slow
compared with semiadaptive compression.
59
Zeroth order text compression
Let T = t0 t2 . . . tn−1 be a text of length n over an alphabet Σ of size σ. For
any symbol s ∈ Σ, let ns be the number of occurrencies of s in T . Let
fs = ns /n denote the frequency of s in T .
Definition 2.1: The zeroth order empirical entropy of the text T is
X
H0 (T ) = − fs log fs .
s∈Σ
The quantity nH0 (T ) represents a type of lower bound for the compressed
size of the text T .
• If we encode the text with arithmetic coding using some probability
distribution P on Σ, the length of the encoding is about
X X
− ns log P (s) = −n fs log P (s).
s∈Σ s∈Σ
60
A simple semiadaptive encoding of T consists of:
• the symbol counts ns encoded with γ-coding, for example,
• the text symbols encoded with Huffman or arithmetic coding using the
symbol frequencies fs as probabilities.
The size of the first part can be reduced slightly by encoding codeword
lengths (Huffman coding) or rounded frequencies (low precision arithmetic
coding) instead of the symbol counts.
The size of the second part is between nH0 (T ) and nH0 (T ) + n bits.
Note that nH0 (T ) bits is not a lower bound for the second part but for the
whole encoding. The size of the second part could be reduced as follows:
62
Example 2.2: Let T = badada and Σ = {a, b, c, d}. The frequencies assigned
to the symbols are:
b a d a d a
semi-adaptive 1/6 3/6 2/6 3/6 2/6 3/6
optimized semi-adaptive 1/6 3/5 2/4 2/3 1/2 1/1
adaptive (count+1) 1/4 1/5 1/6 2/7 2/8 3/9
adaptive (escape) 1/1 · 1/4 1/2 · 1/3 1/3 · 1/2 1/4 1/5 2/6
When the symbol counts grow large, they can be rescaled by dividing by
some constant. This has the effect of partially forgetting the past, since the
symbols encountered after the rescaling have a bigger effect on the counts
than those preceding the rescaling. This can make the model adapt better
to local differences.
63
Adaptive entropy coders
The constantly changing frequencies of adaptive models require some
modifications to the entropy coders. Assume thatPthe model maintains
symbol counts ns , s ∈ Σ as well as their sum n = s ns .
For arithmetic coding, the changing probabilities are not a problem as such,
but there are a couple of details to take care of:
All nodes are kept on a list in the order of their counts. Ties are broken so
that siblings (children of the same parent) are next to each other.
Whenever an increment causes a node u to be in a wrong place on this list,
it is swapped with another node v:
• v is chosen so that after swapping u and v on the list, the list is again
in the correct order.
• The subtrees rooted at u and v are swapped in the tree.
The increments start at a leaf and are propagated upwards only after a
possible swap.
65
Example 2.3: Let T = abb . . . and Σ = {a, b, c, d}.
Starting with all counts being one and a balanced
R 6
Huffman tree, processing the first two symbols ab
updates the counts but does not change the tree.
U 4 V 2
R U V a b c d
a 2 b 2 c 1 d 1 6 4 2 2 2 1 1
c 1 d 1
It can be shown that this algorithm keeps the tree a Huffman tree. The
swaps can be implemented in constant time. The total update time is
proportional to the codeword length.
66
Higher order models
The kth order context of a symbol ti in T is the string T [i − k..i − 1]. For
any string w ∈ Σ∗ , let nw the number of occurrence of w in T (including
those that cyclically overlap the boundary).
67
A simple semiadaptive kth order encoding of T consists of:
• the symbol counts nw for all w ∈ Σk+1 encoded with γ-coding, for
example,
• the first k symbols of T using some simple encoding
• the rest of text symbols encoded with Huffman or arithmetic coding
using as probabilities the symbol frequencies in their kth order context.
The size of the last part of the encoding is between nHK (T ) and nHk (T ) + n
bits.
An adaptive kth order model maintains σ k separate zeroth order models, one
for each w ∈ Σk . Each symbol ti is encoded using the model for the context
T [i − k..i − 1] and then the model is updated just as in zeroth order
compression.
68
Example 2.6: Let T = badadabada and Σ = {a, b, c, d}. The frequencies of
symbols in first order contexts are:
69
A problem with kth order models is the choice of the context length k. If k
is too small, the compression rate is not as good as it could be. If k is too
large, the overhead components start to dominate:
• the storage for the counts in semiadaptive compression
• the poor compression in the beginning for adaptive compression.
Furthermore, there may not exist a single right context length and a
different k should be used at different places.
One approach to address this problem is an adaptive method called
prediction by partial matching (PPM):
• Use multiple context lengths k ∈ {0, . . . , kmax }. To encode a symbol ti ,
find its longest context T [i − k..i − 1] that has occurred before.
• If ti has not occurred in that context, encode the escape symbol
similarly to what we saw with zeroth order compression. Then reduce
the context length by one, and try again. Keep reducing the context
length until ti can be encoded in that context. Each reduction encodes
a new escape symbol.
The most advanced technique is context mixing. The idea is to use multiple
models and combine their predictions. The variations are endless with
respect to what models are used and how the predictions are combined.
Context mixing compressors are at the top in many compression
benchmarks.
71