0% found this document useful (0 votes)
6 views47 pages

chap2

The document discusses mathematical preliminaries for lossless compression, covering topics such as information theory, data representation, and coding techniques. It explains how data can be compressed by removing redundancy and introduces concepts like entropy and Markov models to improve compression efficiency. The document also emphasizes the importance of unique decodability in coding schemes to ensure accurate data retrieval.

Uploaded by

shreya.dilip649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views47 pages

chap2

The document discusses mathematical preliminaries for lossless compression, covering topics such as information theory, data representation, and coding techniques. It explains how data can be compressed by removing redundancy and introduces concepts like entropy and Markov models to improve compression efficiency. The document also emphasizes the importance of unique decodability in coding schemes to ensure accurate data retrieval.

Uploaded by

shreya.dilip649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Mathematical Preliminaries for

Lossless Compression
C.M. Liu
Perceptual Lab, College of Computer Science
National Chiao-Tung University
http://www.csie.nctu.edu.tw/~cmliu/Courses/Compression/

Office: EC538
(03)5731877
[email protected]
Outlines
2

Introduction
Information Theory
Models
Coding
Achieving Data Compression
3

Most data has natural redundancy


† I.e.,‘straightforward’ encoding contains more data than
the actual information in the data
† E.g., audio sampling:
Achieving Data Compression (2)
4

Compression == ‘squeezing out’ the inefficiencies of


the information representation
† Note #1: in lossy compression we threw out less
important/imperceptible information
† Note #2: We must be able to reverse the process to make
the data usable again
Q1: What data can be compressed?
Q2: By how much?
Q3: How close are we to optimal compression?
Information Theory: a mathematical description of
information and its properties
Representing Data
5

Analog (continuous) data


† Represented by real numbers
† Note: cannot be represented by computers

Digital (discrete) data


† Given a finite set of symbols {a1, a2, …, an},
† All data represented as symbol sequences (or strings) in the
symbol set
† E.g.: {a,b,c,d,r} => abc, car, bar, abracadabra, …
† We use digital data to approximate analog data
Common Symbol Sets
6

Roman alphabet plus punctuation


ASCII - 256 symbols
Braille, Morse
Binary - {0,1}
† 0 and 1 are called bits
† All digital data can be represented efficiently in binary

† E.g.: {a, b, c, d} fixed length binary representation (2


bits/symbol):
Symbol a b c d
Binary 00 01 10 11
Information
7

First formally developed by Claude Shannon at Bell


Labs in the 1940s/50s
Explains limits on coding/communication using
probability theory
Self-information
† Given event A with probability P(A)

= − log b P( A)
1
i ( A) = log b
P( A)
Self-Information
8

Observations 6

† Low P(A) => high i(A) 5

† High P(A) => low i(A) 4

Rationale:
3

− log 2 x
2

† Low probability (surprise)


events carry more information; 1

think 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

† man bites dog vs.


dog bites man
1 1
Suppose A and B are i ( AB) = log b = log b =
P( AB) P( A) P( B)
independent then 1 1
= log b + log b = i ( A) + i ( B)
† i(AB) = i(A) + i(B) P( A) P( B)
Coin Flip Example
9

Fair coin
† Let H & T be the outcomes
† If P(H) = P(T) = 1/2, then

† i(H) = i(T) = -1/log2(1/2) = 1 bit

Unfair coin
† Let P(H) = 1/8, P(T) = 7/8
† i(H) = 3 bits

† i(T) = 0.193 bits

Note that P(H) + P(T) = 1


(First-Order) Entropy
10

Let
† A1,…,An be all the independent possible outcomes from an
experiment
† with probabilities P(A1), …,P(An)

H = ∑i =1 P( Ai )i ( Ai ) = − ∑i =1 P( Ai ) log b P( Ai )
n n

™ If the experiment generates symbols, then (for b=2) H is the


average number of binary symbols needed to code the
symbols.
™ Shannon: No lossless compression algorithm can do better.
™ Note: The general expression for H is more complex but
reduces to the above for iid sources
Entropy Example #1
11

Consider the sequence:


†1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
† Assume it correctly describes the probabilities
generated by the source; then
† P(1) = P(6) = P(7) = p(10) = 1/16

† P(2) = P(3) = P(4) = P(5) = P(8) = P(9) = 2/16

† Assuming the sequence is iid

1 ⎛1⎞ 2 ⎛2⎞
H = −∑i =1 P(i ) log 2 P(i ) = −4
10
log 2 ⎜ ⎟ − 6 log 2 ⎜ ⎟ = 3.25 bits
16 ⎝ 16 ⎠ 16 ⎝ 16 ⎠
Entropy Example #2
12

Assume sample-to-sample correlation


Instead of coding samples, code difference:
† 1 1 1 -1 1 1 1 -1 1 1 1 1 1 -1 1 1
† Now P(1) = 13/16, P(-1) = 3/16
† H = 0.70 bits (per symbol)
† Model also needs to be coded
Knowing something about the source can help us
‘reduce’ the entropy
† Note the we cannot actually reduce the entropy of the
source, as long as our coding is lossless
† Instead, we are reducing our estimate of the entropy
Entropy Example #3
13

Consider the sequence:


† 12123333123333123312
† P(1) = P(2) = 1/4, P(3) = 1/2, H = 1.5 bits/symbol
† Total bits: 20 x 1.5 = 30

Reconsider the sequence


† (1 2) (1 2) (3 3) (3 3) (1 2) (3 3) (3 3) (1 2) (3 3) (1 2)
† P(1 2) = 1/2, P(3 3) = 1/2
† H = 1 bit/symbol x 10 symbols = 10 bits
In theory, structure can eventually be extracted by taking
larger samples
In reality, we need an accurate model as it is often impractical
to observe a source for long
Models
14

Physical models
† Based on understanding of the process generating the
data
„ E.g., speech
†A good model leads to good compression
† Usually impractical

† Empirical data instead


„ Statistical methods can help take a proper sample
Probability Models
15

Ignorance model
1. Assume each letter is generated independently from the
rest
2. Assume all letters are generated with equal probability
† Examples?
„ ASCII, RGB, CDDA, …
Improvement—drop assumption 2:
† A = {a1, a2, …, an}, P = {P(a1}, P(a2), …, P(an)}
† Very efficient coding schemes exist already
Note
† If 1. does not hold, a better solution likely exists
Markov Models
16

Assume that each output symbol depends on


previous k ones. Formally:
† Let {xn} be a sequence of observations
† We call {xn} a kth-order discrete Markov chain (DMC) if

P (x n x n − 1 , K , x n − k ) = P (x n x n − 1 , K , x n − k , K )

¾ Usually, we use a first‐order DMC:


P (x n x n − 1 ) = P (x n x n − 1 , K , x n − k , K )
™ Linear dependency model
¾ xn = ρ xn-1 + εn
¾ εn => white noise
Non-linear Markov Models
17

Consider a BW image as a
string of black & white pixels P(w|w) P(b|w) P(b|b)

(e.g. row-by-row)
Sw Sb
† Define two states: Sb & Sw for P(w|b)
the current pixel
† Define probabilities:
„ P(Sb) = prob of being in Sb H (S w ) = − P(b / w) log(b / w) − P(w / w) log(w / w)
„ P(Sw) = prob of being in Sw H (Sb ) = − P(w / b ) log(b / w) − P(b / b ) log(b / b )
† Transition probabilities P(w / w) = 1 − P (b / w), P(b / b ) = 1 − P(w / b )
„ P(b|b), P(b|w)
„ P(w|b), P(w|w) H = P(Sb )H (Sb ) + P (S w )H (S w )
Markov Model (MM) Example
18

Assume
P(S w ) = 30 / 31 P(Sb ) = 1 / 31
P(w / w) = 0.99 P(b / w) = 0.01 P(b / b ) = 0.7 P(w / b ) = 0.3
™ For the iid model:
H iid = −0.8 log 0.8 − 0.2 log 0.2 = 0.206

™ For the Markov model:


H (Sb ) = −0.3 log 0.3 − 0.7 log 0.7 = 0.881
H (S w ) = −0.01 log 0.01 − 0.99 log 0.99 = 0.081
30 1
H Markov = 0.081 + 0.881 = 0.107
31 31
Markov Models in Text Compression
19

In written English, probability of next letter is heavily influenced


by previous ones
† E.g. u after q
Shannon’s work
† 2nd-order MM, 26 letters + space H = 3.1 bits/letter
† Word-based model H=2.4 bits/letter
† Human prediction based on 100 letters
„ 0.6 ≤ H ≤ 1.3 bits/letter

Longer context => better prediction


Practical concerns:
† Context model storage (e.g. 4th-order w/ 95 chars = 954 contexts)
† Zero frequency problem
Composite Source Model
20

Many sources cannot be adequately described by a single


model
† E.g.: an executable contains:
„ Code, resources (text, images, …)
Solution: composite model:

Coding
21

Alphabet
† Collection of symbols called letters
Code
† A set of binary sequences called codewords
Coding
† The process of mapping letters to codewords
† Fixed vs. variable-length coding
Example: letter ‘A’
† ASCII: 01000001
† Morse: •—
Code rate
† Average number of bits per symbol
Uniquely Decodable Codes
22

Example
† Alphabet = {a1, a2, a3, a4}
† P(a1) = 1/2, P(a2) = 1/4, P(a3) = P(a4) = 1/8
† H = 1.75 bits
† n(ai) = length (codeword(ai)), i=1..4
† Avg length l = Σi=1..4P(ai) n(ai)
Possible codes:
Probability Code 1 Code 2 Code 3 Code 4
a1 0.500 0 0 0 0
a2 0.250 0 1 10 01
a3 0.125 1 00 110 011
a4 0.125 10 11 111 0111
l 1.125 1.250 1.750 1.875
Uniquely Decodable Codes (2)
23

Probability Code 1 Code 2 Code 3 Code 4


a1 0.500 0 0 0 0
a2 0.250 0 1 10 01
a3 0.125 1 00 110 011
a4 0.125 10 11 111 0111
l 1.125 1.250 1.750 1.875
Code 1
† Identical codewords for a1 & a2==> decode(‘00’) = ???
Code 2
† Unique codes but ambiguous: decode( ‘00’/’11’) = ???
Code 3
† Uniquely decodable, instantaneous
Code 4
† Uniquely decodable, ‘near-instantaneous’
Uniquely Decodable Codes (3)
24

Unique decodability:
† Given any sequence of codewords, there is a unique
decoding of it.
Unique != instantaneous
† E.g.:
„ a1 Ù 0
„ a2 Ù 01
„ a3 Ù 11
† decode(0111111111) = a1a3… or a2a3… ?
„ don’t know until the end of the string
„ 0111111111 Æ 01111111a3 Æ 011111a3a3 Æ 0111a3a3a3 Æ
01a3a3a3a3 Æ a2a3a3a3a3a3
Unique Decodability Test
25

Prefix & dangling suffix


† Let a = a1…ak, b = b1…bn be binary codewords and k < n
† If a1…ak = b1…bk then a is a prefix of b and
† bk+1…bn is a dangling suffix: ds(a, b)
Algorithm
† Let C = {cn} be the set of all codeword
For all pairs (ci, cj) in C repeat
If ds(ci, cj) ∉ C // dangling suffix is not a codeword
CI = CI U ds(ci, cj)
Else // dangling suffix is a codeword
return NOT_UNIQUE
until no more unique pairs
return UNIQUE
Prefix Codes
26

Prefix code:
† No codeword is prefix of another.
† Prefix codes are also known as prefix-free codes, prefix condition codes, comma-
free codes[1] (although this is incorrect), and instantaneous codes.
Binary trees as prefix decoders:
repeat
symbol code 0 1 curr = root
a 00 repeat
c if get_bit(input) = 1
b 01 0 1
curr = curr.right
c 1 else
a b
curr = curr.left
until is_leaf(curr)
output curr.symbol
until eof(input)
Decoding Prefix Codes: Example
27

symbol code 0 1
a 0 a 0 1
b 10
b 0 1
c 110
c 0 1
d 1110
r 1111 d r

abracadabra = 010111101100111001011110
Decoding Example
28

0 1

a 0 1

b 0 1

c 0 1

d r

Input = 010111101100111001011110
Output = -----------
Decoding Example
29

0 1

a 0 1

b 0 1

c 0 1

d r

Input = 010111101100111001011110
Output = a----------
Decoding Example
30

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -10111101100111001011110
Output = a----------
Decoding Example
31

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -10111101100111001011110
Output = a----------
Decoding Example
32

0 1

a 0 1

b 0 1

c 0 1

d r

Input = --0111101100111001011110
Output = ab---------
Decoding Example
33

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ---111101100111001011110
Output = ab---------
Decoding Example
34

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ---111101100111001011110
Output = ab---------
Decoding Example
35

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ----11101100111001011110
Output = ab---------
Decoding Example
36

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -----1101100111001011110
Output = ab---------
Decoding Example
37

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ------101100111001011110
Output = abr--------
Decoding Example
38

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -------01100111001011110
Output = abr--------
Decoding Example
39

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -------01100111001011110
Output = abra-------
Decoding Example
40

0 1

a 0 1

b 0 1

c 0 1

d r

Input = --------1100111001011110
Output = abra-------
Decoding Example
41

0 1

a 0 1

b 0 1

c 0 1

d r

Input = --------1100111001011110
Output = abra-------
Decoding Example
42

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ---------100111001011110
Output = abra-------
Decoding Example
43

0 1

a 0 1

b 0 1

c 0 1

d r

Input = ----------00111001011110
Output = abrac------
Decoding Example
44

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -----------0111001011110
Output = abrac------
Decoding Example
45

0 1

a 0 1

b 0 1

c 0 1

d r

Input = -----------0111001011110
Output = abraca-----
Decoding Example
46

0 1

a 0 1
0 1 and so on …
b
c 0 1

d r

Input = ------------111001011110
Output = abraca-----
Summary
47

Basic definitions of Information Theory


† Information
† Entropy
† Models
† Codes
„ Unique decodability
„ Prefix codes
Homeworks (pp. 38-39)
† 3, 4, 7.
Program (pp. 39)
† 5

You might also like