0% found this document useful (0 votes)
44 views11 pages

Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar

The document describes a cost optimal parallel algorithm for arithmetic coding that runs in O(log n) time using n/log n processors on an EREW PRAM model. This results in a total parallel cost of O(n). The algorithm improves upon prior work that was not cost optimal or focused on hardware implementations rather than theoretical efficiency.

Uploaded by

Rohit Soren
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views11 pages

Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar

The document describes a cost optimal parallel algorithm for arithmetic coding that runs in O(log n) time using n/log n processors on an EREW PRAM model. This results in a total parallel cost of O(n). The algorithm improves upon prior work that was not cost optimal or focused on hardware implementations rather than theoretical efficiency.

Uploaded by

Rohit Soren
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Arithmetic Coding in Parallel

Jan Šupol and Bořivoj Melichar


Department of Computer Science & Engineering
Faculty of Electrical Engineering
Czech Technical University
Karlovo nám. 13, 121 35 Prague 2

e-mail: {supolj,melichar}@fel.cvut.cz

Abstract. We present a cost optimal parallel algorithm for the computation


of arithmetic coding. We solve the problem in O(log n) time using n/log n
processors on EREW PRAM. This leads to O(n) total cost.

Keywords: arithmetic coding, NC algorithm, EREW PRAM, PPS, parallel


text compression.

1 Introduction
There is still a need for data coding. The growing demand for network communication
and for storage of data signals from space are not the only examples of coding needs.
Many algorithms have been developed for text compression.
One of these is arithmetic coding [Mo98, Wi87], which is more efficient than
the widely known Huffman algorithm [Hu52]. The latter rarely produces the best
variable-size code, the arithmetic coding overcomes this problem. Arithmetic coding
can be generated in O(n) time sequentially, and we present a well scalable NC parallel
algorithm that generates the code in O(log n) time on EREW PRAM with n/log n
processors. This leads to O(n) total cost and a cost optimal algorithm.
Despite the large number of papers on the parallel Huffman algorithm (the last
known [Lb99] is work optimal) there are only a few papers on parallel arithmetic
coding. Most of these are based on a quasi-arithmetic coding [Ho92]. We know only
two exceptions. The first [Yo98] is based on an N-processor hypercube and is not
cost optimal. The second [Ji94] is mainly focused on the hardware implementation.
Authors expected the processing speed of their tree-based parallel structure eight
times as high as the speed of a sequential coder. This is still O(n) parallel time.
This paper is organized as follows. Section 2 provides a description of the sequen-
tial arithmetic coding algorithm. Section 3 presents some basic definitions. Section 4
describes the parallel prefix computation needed by our algorithm. Section 5 presents
our parallel arithmetic coding algorithm. Section 6 describes the time complexity of
our algorithm. Section 7 contains our conclusion. Note that this paper does not
mention the decoding process.

168
Arithmetic Coding in Parallel

2 Sequential Arithmetic Coding


First we review the sequential algorithm. Let A = [a0 , a1 , . . . , am−1 ] be the source al-
phabet containing m symbols and an associated set of frequencies F = [f0 , f1 , . . . , fm−1 ]
shows the occurrences of each symbol. Next we compute the m−1array of probabili-
ties R = [r0 , r1 , . . . , rm−1 ] such that ri = fi /T where T = i=0 fi , the array of
i
high ranges H = [h0 , h1 , . . . , hm−1 ] such that hi = x=0 rx , the array of low ranges
L = [l0 , l1 , . . . , lm−1 ] such that l0 = 0 and li = hi−1 , i > 0. Table 1 shows an example.

A F R L H
S 5 5/10=0.5 0.5 1.0
W 1 1/10=0.1 0.4 0.5
I 2 2/10=0.2 0.2 0.4
M 1 1/10=0.1 0.1 0.2
 1 1/10=0.1 0.0 0.1

Table 1: Frequencies, probabilities and ranges of five symbols.

The string of symbols S = [s0 , s1 , . . . , sn−1 ] is encoded as follows. The first charac-
ter s0 can be encoded by a number within an interval [ly , hy ) associated to a character
y = s0 , y ∈ A. This notation [a, b) means the range of real numbers from a to b, not
including b. Let us define these two bounds as LowRange and HighRange.
As more symbols are input and processed, LowRange and HighRange are updated
according to

LowRangej = LowRangej−1 + (HighRangej−1 − LowRangej−1 ) × lx ,

HighRangej = LowRangej−1 + (HighRangej−1 − LowRangej−1 ) × hx ,


where hx and lx are low and high ranges of new character x ∈ A, LowRange−1 = 0,
HighRange−1 = 1. Table 2 indicates an example for the word “SWISS”.

A L&H The calculation of low and high ranges


S L 0.0 + (1.0 − 0.0)×0.5 = 0.5
H 0.0 + (1.0 − 0.0)×1.0 = 1.0
W L 0.5 + (1.0 − 0.5)×0.4 = 0.70
H 0.5 + (1.0 − 0.5)×0.5 = 0.75
I L 0.7 + (0.75 − 0.7)×0.2 = 0.71
H 0.7 + (0.75 − 0.7)×0.4 = 0.72
S L 0.71 + (0.72 − 0.71)×0.5 = 0.715
H 0.71 + (0.72 − 0.71)×1.0 = 0.720
S L 0.715 + (0.72 − 0.715)×0.5 = 0.7175
H 0.715 + (0.72 − 0.715)×1.0 = 0.7200

Table 2: The process of arithmetic encoding.

169
Proceedings of the Prague Stringology Conference ’04

3 Definitions
Our parallel algorithm is designed to run on the Parallel Random Access Machine
(PRAM), which is a very simple synchronous model of the SIMD computer([Le92,
Qu94, Tv94]). PRAM includes many submodels of parallel machines that differ from
each other by conditions of access to the shared memory. Our algorithm works on the
Exclusive Read Exclusive Write (EREW) PRAM model, which means that no two
processors can access the same cell of the shared memory.
We define sequential time SU(n) as the worst time of the best known sequential
algorithm where n is the size of the input data. Parallel time T (n, p) is the time
elapsed from the beginning of a p-processor parallel algorithm solving a problem
instance of size n until the last (slowest) processor finishes the execution.
Consider a synchronous p-processor algorithm A with τ = T (n, p) parallel steps.
Let pi be the number of processors active (working) at step i ∈ {1, 2, . . . , τ } of A.
Then the synchronous parallel work of A is

W (n, p) = T1 + T2 + · · · + Tτ .

Parallel cost (also called processor-time product) is defined as

C(n, p) = p × T (n, p).

It is obvious that
SU(n) ≤ W (n, p) ≤ C(n, p).
If SU(n) = W (n, p) then the algorithm is work optimal. If SU(n) = C(n, p) then
the algorithm is cost optimal.
The efficiency of the parallel algorithm is defined as

SU(n)
E(n, p) = .
C(n, p)

Let E0 be the constant such that 0 < E0 < 1. Then isoefficiency function ψ1 (p)
is the asymptotically minimum function such that

∀np = Ω(ψ1 (p)) : E(np , p) ≥ E0 .

Hence, ψ1 (p) gives asymptotically the lower bound on the instance size of a prob-
lem that can be solved by p processors with efficiency at least E0 .
Scalability is the ability to adapt itself to a changing number of processors or or
to changing size of the input data. Good scalability means that if we want to use
new processors we have to increase the size of our problem only a little. Fast growth
of function ψ1 provides poor scalability.
We say that class NC (Nick’s class) is a set of algorithms that can be computed
with at most polylogarithmic time and with at most a polynomial number of proces-
sors. These algorithms provide a high level of parallelization.

170
Arithmetic Coding in Parallel

4 Parallel Prefix Computation


As far as our parallel algorithm is based on the parallel prefix algorithm we show how
it works. The problem is defined as follows [La80]. Let S = [s0 , s1 , . . . , sn−1 ] be the
array of numbers. The prefix problem is to compute all the prefixes of the products
s0 ⊗ s1 ⊗ · · · ⊗ sn−1 ,
where ⊗ is an associative operation.
Fig. 1 shows the algorithm that assumes n processors p0 , p1 , . . . , pn−1 and array
M = [m0 , m1 , . . . , mn−1 ] of numbers stored in the shared memory. Every processor pi
also has a register y i . From now on we will use EREW PRAM with similar conditions.
for i := 0, 1, . . . , n − 1 do in parallel
y i := M[i];
for j := 0, 1, . . . , log n − 1 do sequentially
begin
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
y i := y i ⊗ M[i − 2j ];
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
M[i] = y i;
end

Figure 1: Parallel prefix algorithm.

Fig. 2 indicates a parallel prefix algorithm computing an array of 7 numbers with


the associative operation of sum. This is then called the parallel prefix sum.
Here we show the parallel time T (n, p) of the parallel prefix computation on EREW
PRAM. First we suppose that p < n. Each processor simulates n/p processors. This
sequentially sums n/p numbers. This takes at most 4n/p steps (read first number,
read second number, sum and write the result). After that the processors run the
parallel prefix algorithm in time O(log p). So the parallel time, cost, efficiency and
function ψ1 take
T (n, p) = O(n/p + log p),
C(n, p) = O(n + p log p),
n
E(n, p) = O( ),
n + p log p
ψ1 (p) = O(p log p).
We can say that the parallel prefix algorithm is a well scalable NC algorithm due to
the definitions in Section 3. If p = n then
T (n, n) = O(n/n + log n) = O(log n),
C(n, n) = O(n + n log n) = O(n log n).
However, when p = n/log n then
T (n, n/log n) = O(n log n/n + log n − log log n) = O(log n),
C(n, n/log n) = O(n + n/log n(log n − log log n)) = O(n).
Hence, we have obtained a parallel cost optimal algorithm.

171
Proceedings of the Prague Stringology Conference ’04

3 2 4 7 1 5 2

@ @ @ @ @ @
@ @ @ @ @ @
@ @ @ @ @ @
@R?
@ @R?
@ @R?
@ @R?
@ @R?
@ @R?
@

3 5 6 11 8 6 7
HH HH HH HH HH
HH HH HH HH HH
HH HH HH HH HH
HH HH HH HH HH
j?
H j?
H j?
H j?
H j?
H

3 5 9 16 14 17 15
XX X X
XXX XXXX XXXX
XX X X
XXX XXXX XXXX
XX X X
XXX XXXX XXXX
z? XX
XX z? XX
z?

3 5 12 16 17 22 24

Figure 2: Parallel prefix sum example.

5 Parallel Arithmetic Coding


Recall that we use the array A = [a0 , a1 , . . . , am−1 ] of the source alphabet containing
m symbols, the associated set of frequencies F = [f0 , f1 , . . . , fm−1 ],the associated set
m−1
of probabilities R = [r0 , r1 , . . . , rm−1 ] so that ri = fi /T where T = i=0 fi , the array
of low ranges L = [l0 , l1 , . . . , lm−1 ], the array of high ranges H = [h0 , h1 , . . . , hm−1 ] so
that l0 = 0, li = hi−1 , i > 0 and hi = li + ri .
Our idea of parallelism is that we have a string S = [s0 , s1 , . . . , sn−1 ] of n characters
to encode. Each processor pj is associated with a character sj and computes variables
LowRange and HighRange for that character.

5.1 Preliminaries
We suppose that we have an array Range = [range0 , range1 , . . . , rangen−1 ] for our
algorithm. Each rangej is initialized with probability ry such that ay = sj where j is
the index of the j-th character in the input string and sj ∈ A. We also suppose that
we have an array Low = [low0 , low1 , . . . , lown−1 ]. Each lowj is initialized with value ly
such that ay = sj . We need at least one variable high initialized with value hy such
that ay = sn−1 .

172
Arithmetic Coding in Parallel

5.2 Changes in Sequential Algorithm


Let us return to sequential arithmetic coding and try to change the algorithm a bit
so that it can be parallelized. Recall the bounds computation

LowRangej = LowRangej−1 + (HighRangej−1 − LowRangej−1 ) × lx ,

HighRangej = LowRangej−1 + (HighRangej−1 − LowRangej−1 ) × hx ,


where hx and lx are low and high ranges of new character x ∈ A, LowRange−1 = 0,
HighRange−1 = 1 and mark the cumulative lower and higher bounds

LRj = (HighRangej−1 − LowRangej−1 )×lx ,

HRj = (HighRangej−1 − LowRangej−1 )×hx .

So the values LowRange and HighRange are updated as

LowRangej = LowRangej−1 + LRj ,

HighRangej = LowRangej−1 + HRj


and we now focus only on the variables LR and HR now.

LRj = (HighRangej−1 − LowRangej−1 ) × lx =


= (LowRangej−2 + HRj−1 − LowRangej−2 − LRj−1) × lx =
= (HRj−1 − LRj−1 ) × lx ,

HRj = (HighRangej−1 − LowRangej−1 ) × hx =


= (LowRangej−2 + HRj−1 − LowRangej−2 − LRj−1) × hx =
= (HRj−1 − LRj−1 ) × hx .

Moreover, LowRangej can be computed as

LowRangej = LRj + LowRangej−1 = LRj + LRj−1 + LowRangej−2 =


= · · · = LRj + LRj−1 + · · · + LR0 + LowRange−1 =
 j
j
= LRx + LowRange−1 = LRx
x=0 x=0

because LowRange−1 = 0.
The change in our algorithm is that we first compute the cumulative lower and
higher bounds and next we simply compute the sum of these cumulative bounds so
that we obtain the final bounds LowRange and HighRange.
Let us see how the variables LR and HR can be computed for the word “SWISS”.
We declare that LR0 is the LR variable for the first character s0 = “S  and lx , hx ,
rx are lower range, higher range and probability of character x ∈ A. LR−1 and HR−1
are initial cumulative bounds for a number that represents the encoded text S. For
arithmetic coding this number is defined by default as an interval [0,1). That is why
LR−1 = LowRange−1 = 0 and HR−1 = HighRange−1 = 1.

173
Proceedings of the Prague Stringology Conference ’04

LR−1 = 0
HR−1 = 1
LR0 = (HR−1 − LR−1 ) × ls = 1.0 × 0.5 = 0.5
HR0 = (HR−1 − LR−1 ) × hs = 1.0 × 1.0 = 1.0
LR1 = (HR0 − LR0 )×lw = (hs − ls )×lw = rs ×lw = 0.5×0.4 = 0.2
HR1 = (HR0 − LR0 )×hw = (hs − ls )×hw = rs ×hw = 0.5×0.5 = 0.25
LR2 = (HR1 − LR1 )×li = (rs ×hw − rs ×lw )×li = rs ×rw ×li = 0.5×0.1×0.2 = 0.01
HR2 = (HR1 − LR1 )×hi = (rs ×hw − rs ×lw )×hi = rs ×rw ×hi = 0.5×0.1×0.4 = 0.02
LR3 = (HR2 − LR2 )×ls = (rs ×rw ×hi − rs ×rw ×li )×ls = rs ×rw ×ri ×ls = 0.005
HR3 = (HR2 − LR2 )×hs = (rs ×rw ×hi − rs ×rw ×li )×hs = rs ×rw ×ri ×hs = 0.01
...
So it is obvious that the lower bound of the j-th character LRj and the higher
bound of the j-th character HRj can be computed as


j−1
j
LR = ( r x ) × lj , j > 0,
x=0


j−1
j
HR = ( r x ) × hj , j > 0.
x=0

5.3 Parallel Prefix Production

for i := 0, 1, . . . , n − 1 do in parallel
y i := Range[i];
for j := 0, 1, . . . , log n − 1 do sequentially
begin
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
y i := y i × Range[i − 2j ];
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
Range[i] = y i;
end

Figure 3: Parallel prefix production algorithm.



These new LR and HR variables are exactly what we need, because j−1 rx
j x=0
can be computed in parallel as we immediately show. Computation of x=0 r x =
j x
x=0 range can be done by the parallel prefix production algorithm explained in
Section 4, as shown in Fig. 3. Table 3 indicates the parallel prefix algorithm in our
example for the word “SWISS”.

174
Arithmetic Coding in Parallel

S W I S S
0.5 0.1 0.2 0.5 0.5
0.5 0.05 0.02 0.1 0.25
0.5 0.05 0.01 0.005 0.005
0.5 0.05 0.01 0.005 0.0025

Table 3: Parallel prefix production example for the word “SWISS”.

5.4 Cumulative Bounds Computation



If we have computed j−1 r x we can obtain the variables LRj and HRj simply as the
j−1 x j x=0 j−1 x j
product of x=0 r ×l and x=0 r ×h . Parallel algorithm computing the variables
LR and the variable HRn−1 is shown in Fig. 4. The variables HR are not exactly
needed, except for the last one HRn−1 . If these variables are required, they can be
computed in a similar way. The value HRn−1 , which is the cumulative high range, is
computed after the parallel prefix production computation as


n−2
HRn−1 = ( r x ) × hn−1 .
x=0

Table 4 shows this computation in our example for the word “SWISS”. Note that
the results correspond to the cumulative bounds in our sequential example.

do sequentially
begin
y n−1 := High;
y n−1 := y n−1 × Range[n − 2];
High := y n−1 ;
y 0 := 1;
end
for i := 1, 2, . . . ,n − 1 do in parallel
y i := Range[i − 1];
for i := 0, 1, . . . ,n − 1 do in parallel
begin
y i := y i×Low[i];
Low[i] := y i;
end

Figure 4: Parallel computation of the variables LR and HRn−1 .

Now we have computed the cumulative high and low ranges. The array Low
contains the LR values and the field High contains the value HRn−1 . Next we have to
compute the sum of these cumulative ranges LR so that we shall obtain the required
bounds HighRange and LowRange for arithmetic compression of string S.

175
Proceedings of the Prague Stringology Conference ’04

L/H S W I S S
LR 0.5 0.2 0.01 0.005 0.0025
HR 1 0.25 0.02 0.01 0.005

Table 4: Low and high ranges.

5.5 Computation of Low and High Ranges


In Section 5.4 we computed the cumulative bounds LR and HR. Here we show how to
obtain the bounds earlier declared as LowRange and HighRange for the compressed
text. These values can be computed as shown in Section 5.2 as

j−1
j
LowRange = ( LRx ) + LRj ,
x=0


j−1
j
HighRange = ( LRx ) + HRj .
x=0

To compute the sum we can use the parallel prefix algorithm once more, exactly the
parallel prefix sum shown in the former text. Finally, after computing the sum, the
variable HighRangen−1 is obtained as

HighRangen−1 = LowRangen−2 + HRn−1 .

This algorithm is shown in Fig. 5. The array Low contains the values LowRange and
the field High contains the value HighRangen−1 . Our example for the word “SWISS”
is shown in Table 5.

for i := 0, 1, . . . , n − 1 do in parallel
y i := Low[i];
for j := 0, 1, . . . , log n − 1 do sequentially
begin
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
y i := y i + Low[i − 2j ];
for i := 2j , 2j + 1, . . . , n − 1 do in parallel
Low[i] = y i ;
end
do sequentially
begin
y n−1 := High;
y n−1 := y n−1 + Low[n − 2];
High := y n−1;
end

Figure 5: LowRange and HighRangen−1 computation algorithm.

176
Arithmetic Coding in Parallel

S W I S S
0.5 0.2 0.01 0.005 0.0025
0.5 0.7 0.21 0.015 0.0075
0.5 0.7 0.71 0.715 0.2175
0.5 0.7 0.71 0.715 0.7175

Table 5: Parallel prefix sum example.

6 Time and Cost Complexities


Our algorithm does not say how to set the arrays Range, Low and the variable High
in a preliminary phase. However, having set the arrays A, R, L and H, this can be
done in time O(1) on CREW PRAM with a good hash function that returns an index
in the array A of an input character from the input string S.
Our EREW PRAM algorithm consists of three phases. In the first phase, the
parallel prefix production is computed. As shown in Section 4, this can be done in
time O(n/p + log p) where p is the number of used processors and n is the size of
the input. In the second phase, shown in Fig. 4, we have computed the cumulative
bounds LR and HR in time O(n/p). The third phase, the parallel prefix sum shown
in Fig. 5, also takes O(n/p + log p) time. The computation of HighRangen−1 takes
only O(1) time in any phase. So the time and cost of our algorithm are

T (n, p) = O(n/p + log p),

C(n, p) = O(n + p log p).


If p = n/log n then the total time is O(log n) and the cost is O(n).
Because our algorithm consists mainly of parallel prefix computation, it inherits
its best properties. Our algorithm is therefore a well scalable NC algorithm, and it
can be implemented as the cost optimal algorithm.

7 Conclusions
We have presented a parallel NC algorithm for computation of arithmetic coding. We
have solved the problem in O(log n) time using n/log n processors on EREW PRAM.
Our algorithm leads to O(n) total cost and is cost optimal.
The preliminary phase is a weakness of our algorithm. However, if we were able
to construct a good adaptive parallel arithmetic coding based on our algorithm, it
could solve this problem.
Another question is how to make a good parallel arithmetic decoding algorithm.

References
[Ho92] Howard, Paul G., Jeffrey Scott Witter (1992): Parallel Losseless Image Com-
pression Using Huffman and Arithmetic Coding. Proceedings of the IEEE
Data Compression Conference, 299-308.

177
Proceedings of the Prague Stringology Conference ’04

[Hu52] Huffman, David (1952): A method for the construction of minimum-


redundancy codes. Proceedings of the Inst. Radio Engineers, 40: 1098-1101.

[Ji94] Jiang J., S. Jones (1994): Parallel design of arithmetic coding. IEE
Proceedings-Computers and Digital Techniques, 141(6):327-333, November.

[La80] Ladner, Richard and Michael J. Fisher (1980): Parallel Prefix Computation.
Journal of the ACM, 27(4):831-838, October.

[Lb99] Laber, Eduardo Sany, Ruy Luiz Milidi and Artur Alves Pessoa (1999): A
Work Efficient Parallel Algorithm for Constructing Huffman Codes. Pro-
ceedings of the IEEE Data Compression Conference DCC’99.

[Le92] Lewis, T.G. and H. El-Rewini (1992): Introduction to Parallel Computing.


Prentice Hall.

[Qu94] Quinn, M.J.(1994): Parallel Computing Theory and Practise. McGraw-Hill.

[Mo98] Moffat, Alistar, Redford Neal, and Ian H.Witten (1998): Arithmetic Coding
Revisited. ACM Transactions on Information Systems, 16(3):256-294, July.

[Tv94] Casavant, T.L., P. Tvrdı́k and F. Plášil, editors (1994): Parallel Computers:
Architectures, Languages, and Algorithms. IEEE CS Press.

[Wi87] Witten, Ian H., Redford Neal and John G. Cleary (1987): Arithmetic coding
for Data Compression. Communications of the ACM 30(6):520-540.

[Yo98] Youssef A. (1998): Parallel Algorithms for Entropy Coding Techniques. Pro-
ceedings of European Parallel and Distributed Systems. ACTA Press.

178

You might also like