Data Compression
CS 147
Minh Nguyen
Why Data Compression?
Make
optimal use of limited
storage space
Save
time and help to optimize
resources
If compression and decompression are done in I/O processor,
less time is required to move data to or from storage
subsystem, freeing I/O bus for other work
In sending data over communication line: less time to transmit
and less storage to host
Data CompressionEntropy
Entropy is the measure of information
content in a message.
Messages with higher entropy carry more information
than messages with lower entropy.
How
to determine the entropy
Find the probability p(x) of symbol x in the message
The entropy H(x) of the symbol x is:
H(x) = - p(x) log2p(x)
The
average entropy over the entire
message is the sum of the entropy of all
n symbols in the message
Data Compression
Methods
Data compression is about storing and
sending a smaller number of bits.
Therere two major categories for
methods to compress data: lossless
and lossy methods
Lossless Compression
Methods
In lossless methods, original data and the
data after compression and
decompression are exactly the same.
Redundant
data is removed in
compression and added during
decompression.
Lossless
methods are used when we
cant afford to lose any data: legal and
medical documents, computer programs.
Run-length encoding
Simplest method of compression.
How: replace consecutive repeating occurrences of a
symbol by 1 occurrence of the symbol itself, then
followed by the number of occurrences.
The method can be more efficient if the data uses
only 2 symbols (0s and 1s) in bit patterns and 1
symbol is more frequent than another.
Huffman Coding
Assign
fewer bits to symbols that occur more
frequently and more bits to symbols appear less
often.
Theres no unique Huffman code and every
Huffman code has the same average code length.
Algorithm:
Make a leaf node for each code symbol
Add the generation probability of each symbol to the leaf
node
Take the two leaf nodes with the smallest probability and
connect them into a new node
Add 1 or 0 to each of the two branches
The probability of the new node is the sum of the
probabilities of the two connecting nodes
If there is only one node left, the code construction is
completed. If not, go back to (2)
Huffman Coding
Example
Huffman Coding
Encoding
Decoding
Lempel Ziv Encoding
It
is dictionary-based encoding
Basic
idea:
Create a dictionary(a table) of strings
used during communication.
If both sender and receiver have a copy
of the dictionary, then previouslyencountered strings can be substituted
by their index in the dictionary.
Lempel Ziv Compression
Have
2 phases:
Building an indexed dictionary
Compressing a string of symbols
Algorithm:
Extract the smallest substring that cannot be
found in the remaining uncompressed string.
Store that substring in the dictionary as a
new entry and assign it an index value
Substring is replaced with the index found in
the dictionary
Insert the index and the last character of the
substring into the compressed string
Lempel Ziv Compression
Compression
example:
Audio Encoding
Predictive
encoding
Only the differences
Lempel Ziv
Decompression
Its just the inverse
of compression process
Lossy Compression
Methods
Used for compressing images and
video files (our eyes cannot
distinguish subtle changes, so lossy
data is acceptable).
These methods are cheaper, less time
and space.
Several
methods:
JPEG: compress pictures and graphics
MPEG: compress video
MP3: compress audio
JPEG Encoding
Used
to compress pictures and
graphics.
In JPEG, a grayscale picture is divided
into 8x8 pixel blocks to decrease the
number of calculations.
Basic idea:
Change the picture into a linear (vector) sets of
numbers that reveals the redundancies.
The redundancies is then removed by one of
lossless compression methods.
JPEG Encoding- DCT
DCT:
Discrete Concise Transform
DCT transforms the 64 values in 8x8 pixel
block in a way that the relative relationships
between pixels are kept but the
redundancies are revealed.
Example:
A gradient grayscale
Quantization &
Compression
Quantization:
After T table is created, the values are quantized
to reduce the number of bits needed for encoding.
Quantization divides the number of bits by a
constant, then drops the fraction. This is done to
optimize the number of bits and the number of 0s
for each particular application.
Compression:
Quantized values are read from the table and
redundant 0s are removed.
To cluster the 0s together, the table is read
diagonally in an zigzag fashion. The reason is if
the table doesnt have fine changes, the bottom
right corner of the table is all 0s.
JPEG usually uses lossless run-length encoding at
the compression phase.
JPEG Encoding
MPEG Encoding
Used
to compress video.
Basic idea:
Each video is a rapid sequence of a set of
frames. Each frame is a spatial
combination of pixels, or a picture.
Compressing video =
spatially compressing each frame
+
temporally compressing a set of
frames.
MPEG Encoding
Spatial
Compression
Each frame is spatially compressed by JPEG.
Temporal Compression
Redundant frames are removed.
For example, in a static scene in which someone
is talking, most frames are the same except for
the segment around the speakers lips, which
changes from one frame to the next.
Audio Compression
Used
for speech or music
Speech: compress a 64 kHz digitized
signal
Music: compress a 1.411 MHz signal
Two categories of techniques:
Predictive encoding
Perceptual encoding
Audio Encoding
Predictive
Encoding
Only the differences between samples are
encoded, not the whole sample values.
Several standards: GSM (13 kbps), G.729 (8
kbps), and G.723.3 (6.4 or 5.3 kbps)
Perceptual Encoding: MP3
CD-quality audio needs at least 1.411
Mbps and cannot be sent over the
Internet without compression.
MP3 (MPEG audio layer 3) uses perceptual
encoding technique to compress audio.
References
http://www.csie.kuas.edu.tw/course/cs/
english/ch-15.ppt
CS157B-Lecture
19 by Professor
Lee
http://cs.sjsu.edu/~lee/cs157b/cs157b
.html
The
essentials of computer
organization and architecture by
Linda Null and Julia Nobur.
Data Compression
QUESTION?