0% found this document useful (0 votes)
34 views28 pages

Probabilistic Counting Algorithms For Database Applications - Flajolet

Probabilistic Counting Algorithms for Data Base Applications 1985

Uploaded by

Elizabeth Oxford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views28 pages

Probabilistic Counting Algorithms For Database Applications - Flajolet

Probabilistic Counting Algorithms for Data Base Applications 1985

Uploaded by

Elizabeth Oxford
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 28
Ape am to Cour Ser Se ol Ne 2 Ont as ‘NfRigh sere by Aaadeane Pre New Vor moe ‘een Belg Probabilistic Counting Algorithms for Data Base Applications Pruuipre FLasouer INRIA, Rocquencnart, 78153 Le Chesnay, Fance AND G. Nice. Marty 18M Development Laboratory, Hursley Park, Winchester, Hampshire SO2L2IN, United Kingdom Recsived June 13,1984; revised April 3, 1985, “This paper introduces 4 class of probabilistic counting Algorithms with which one can estimate the numberof distinct clements in a lege collection of data (typically large file Stored on disk) in a single pass using only a small addtional storage (typically less than Ihundred binary words) and nly afew operations per clement seanned. The algorithms are ‘esed on statistical observations made on bis of hashed values of records. They are by con- ‘Suction totaly insensitive tothe replicative structure of elements i the le; they can e used tn the context of distributed systems without any degradation of performances and prove ‘specially sellin the context of data bases query optimisation. 01885 Asem Pros as 1, Inrropucrion [As data base systems allow the user {0 specify more and more complex queries, the need arises for efficient processing methods. A complex query can however generally be evaluated in a number of different manners, and the overall perfor- mance of a data base system depends rather crucially on the selection of appropriate decomposition strategies in each particular case. Even a problem as trivial as computing the intersection of two collections of data A and B lends itself to a number of different treatments (see, 2, [7]}: 1. Sort 4, search each element of B in A and retain it if it appears in As 2. sort A, sort B, then perform a merge-like operation to determine the inter- section; 3, eliminate duplicates in 4 and/or B using hashing or bash filters, then per- form Algorithm | or 2 Each of these evaluation strategy will have a cost essentially determined by the number of records a, 6 in 4 and B, and the number of distinct elements a, 8 in A ‘and B, and for typical sorting methods, the costs are: 82 022-c0000%5 $5.00 Coma by Adee ra, a ‘agi eats om oad PROBABILISTIC COUNTING ALGORITHMS 183 for strategy 1: O(a log a+ blog a); for strategy 2: O(a log a+ blog b-+a+6), In a number of similar situations, it appears thus that, apart from the sizes of the files on which one operates (ie, the number of records), a major determinant of efficiency is the cardinalities of the underlying sets, iz., the munber of distinct elements they comprise. ‘The situation gets much more complex when operations like projections, selee- tions, multiple joins in combination with various boolean operations appear in queries. As an example, the relational system system R has a sophisticated query coptimiser. In order to perform its task, that programme keeps several statistics on relations of the data base. The most important ones are the sizes of relations as well as the number of different elements of some key fields [8]. This information is used to determine the selectivity of attributes at any given time in order to decide the choice of keys and the choive of the appropriate algorithms to be employed when computing relational operators, The choices are made in order to minimise a cer- tain cost function that depends on specific CPU and disk access costs as well as sizes and cardinalities of relations or fields. In system R, this information is, periodically recomputed and kept in catalogues that are companions to the data base records and indexes. In this paper, we propose efficient algorithms to estimate the cardinalities of mul- tisets of data as are commonly encountered in data base practice. A trivial method consists in determining card(¥) by building a list of all elements of Mf without replication; this method has the advantage of being exact but it has a cost in num- ber of disk accesses and auxiliary storage (at least O(a) or O(a log a) if sorting is used) that might be higher than the possible gains which one can obtain using that information ‘The method we propose here is probabilistic in nature since its result depends on the particular hashing function used and on the particular data on which it operates. It uses minimal extra storage in core and provides practically useful estimates on cardinalties of large collections of data, The accuraey is inversely related to the storage: using 64 binary words of typically 32 bits, we attain a typical accuracy of 10%; using 256 words, the accuracy improves to about 5%. The per- formances do not dezrade as files gct large: with 32 bit words, one can safely count cardinalities well over 100 million. The only assumption made is that records can be hashed in a suitably pseudo-uniform manner. This does not however appear to be & severe limitation since empirical studies on large industrial files [5] reveal that careful implementations of standard hashing techniques do achieve practically uniformity of hashed values. Furthermore, by design, our algorithms are totally insensitive to the replication structures of files: as opposed to sampling techniques,! "The simplest sampling algorithm is: take a sample of size Ny ofa file of sie N. Estimate the car- inaiy oof the sample using any direc algorithm and return v/a) a8 estimate of he cardinality of the whole He 184 PLAJOLET AND MARTIN the result will be the same whether elements appear a million times or just a few times. From a more theoretical standpoint, these techniques constitute yet another illustration of the gains that may be achieved in many situations through the use of probabilistic methods. We mention here Mortis’ approximate counting algorithm [6] which maintains approximate counters with an expected constant relative accuracy using only log; logs n+ O(1) bits in order to count up to n. Mortis’ algorithm (see [2] for a detailed analysis that has analogies to the present paper) may be used to reduce by a factor of 2 the memory size necessary (o store large statistics on a large number of events in com- puter systems. ‘The structure of the paper is as follows: in Section 2, we describe a basic counting procedure called COUNT that forms the basis of our algorithms. It may be worth noting that non-trivial analytic techniques enter the justification, and actually the design, of the algorithms; these techniques are also developed in Section 2. Section 3 presents the actual counting algorithms based on this COUNT procedure and on the probabilistic tools of Section 2, Finally, Section 4 concludes with several indications on contexts in which the methods may be used: most notably they can be employed on the fly as well as in the context of distributed processing with minimal exchanges of information between processors and without any degradation of performances. Preliminary results about this work have been reported in [3]. 2. A PROBABILISTIC COUNTING PROCEDURE AND IS ANALYSIS The Basie Counting Procedure We assume here that we have at our disposal a hashing function hash of the type: function hash(.x: records): scalar range [0 2" —1], that transforms records into integers sufficiently uniformly distributed over the scalar range or equivalently over the set of binary strings of length L. For y any non-negative integer, we define bit(», k) to be the kth bit in the binary repesen- lation of, so that © bith 4324 We also introduce the function p(y) that represents the position of the least significant 1-bit in the binary representation of », with a suitable convention for (0): PROBABILISTIC COUNTING ALGORITHMS 185 o(y)=min bitty, 40 if y>0 L ity (Thus ranks are numbered starting from zero.) We observe that if the values hash(x) are uniformly distributed, the pattern 0*1--appears with probability 2‘! The idea consists in recording observations ‘on the occurrence of such pattrns in a vector B/TMAPTO .. L 1]. IF M is the mul- tiset whose cardinality is sought, we perform the following operations for (:=0 to L—1 do BITMAPTI] :=0; for all x in M do begin index ‘= phash(x)) if BITMAP index] =0 then BITMAPT index] end: ‘Thus BITMAPLi] is equal to 1 iff after execution a pattern of the form 0't has appeared amongst hashed values of records in M. Notice that by construction, vee= tor BITMAP only depends on the set of hashed values and not on the particular frequency with which such values may repeat themselves. From the remarks concerning pattern probabilities, we should therefore expect, if nis the number of distinct elements in M that B/TMAP[O] is accessed approximately n/2 times, B/TMAPT1] approximately n/4 times ... Thus at the end of an execution, B/TMAPTA] will almost certainly be zero if (log, and one if ieélogs with a fringe of zeros and ones for i= log, n. As an example, we took as M the on-line documentation corresponding to Volume ! of the manual of the Unix system on one of our installations. 4M consists here of 26692 lines of which 16405 were distinct. Considering these Hines as records and hashing them through stan- dard multiplicative hashing over 24 bits (= 24), we found the following BITMAP vector: 1111111111001 100000000, ‘The leftmost value zero appears in position 12 and the’ rightmost value one in position 15 while 2" = 16384, We propose to use the position of the leftmost zero in BITMAP (ranks start at 0) as an indicator of logan. Let R be this quantity, we shall see that under the assumption that hashed values are uniformly distributed, the expected value of 2 is lose to: F(R)=log;0n, = 0.77351 a 0 that our intuition is justified. In fact the “correction factor” » plays quite an important role in the design of the final algorithms we propose here. We shall also 186 FLAIOLET AND MARTIN prove that under reasonable probabilistic assumptions, the standard deviation of R is close to o(R)=LI2 @) so that an estimate based on (1) will typically be one binary order of magnitude off the exact result, a fact that calls for more elaborate algorithms to be developed in Section 3, Probability Distributions We now proceed to justify rigorously the above claims (1) and (2) concerning the distribution of the value of parameter R in the basic counting procedure. Pronanmtisric Monet. We let B denote the set of infinite binary strings. The model assumes that bits of elements of B are uniformly and independently dis- tributed. Equivalently strings ean be considered as real numbers over the interval [0;11, and the model assumes that the numbers are uniformly distributed over the interval. Functions bit and p are extended (o B trivially. We denote by R, the ran- dom variable defined over B (assuming independence) that is, the analogue of the parameter R above: Rely. Xeon) = iff (i) for all O< 0, we define the following events (ie, subsets of BY: Ex= {xl pk} Ky= be | pO) ek}. PROBABILISTIC COUNTING ALGORITHMS 187 Thus, for each k, Eo, Ey Ex 1, Ky form a disjoint and complete set of events. When n elements are drawn from B, the formal polynomial: PY) = (Eg Ey o + Bg a+ Ka)” a) represents the set of all possible events in the following sense: if we expand (3) taken as a non-commutative polynomial in its indeterminates, interpreting the sums as (disjoint) unions of events and the products as successions of events (cach ‘monomial has degree n), we obtain a complete and disjoint representation of BY We are interested in obtaining from 2," an expression for the polynomial Qf" that represents in a similar fashion the succession of all events corresponding (0 R, 2k. Polynomial 0" is formed by a subset of the noncommutative monomials appearing in P. Let us start with a few examples. If k =0, we have: Py) =(Ko)" and Qf") = Pf I k=l, Pe (Eo+ Ki", Oy" Ey + Ky" Ky, since Q is obtained from P is this case by taking out from P the monomial Ky corresponding to the situation where all strings drawn have a p-value at least 1. For k=2 now, we have 10 (y+ Ey + Kay — (Ey + Ka) (Eg Kay + KS, since we have to take out from P the cases where either p-value 1 or p-value 0 does not appear but in so doing, we have eliminated the case where all p-values are at least 2 (ie, Ks) twice In general, for Pa polynomial in the indeterminates E,, Ey. the polynomial Q formed with monomials of degree at most | in cach of the indeterminates E, can be obtained from P by the inclusion-exelusion type formula: Q=P-Y PLE + 0)+E PLE. E> 0)- YL PLES By, Ey 0} (A) where the notation Px, y+0] means the replacement of x, » by 0 in P, Thus Qf" can in general be obtained by applying (4) to the expression of P" given by (3) To evaluate the probabilities q,,, all we have to do is to take the measures j. of the events described by polynomial Q using the rules: using additivity of measure 4 over disjoint sets of events as well as the relation H(A” B)= (A) (2B) since trials in B are assumed to be independent. On our pevious examples, we find in this way: Gag =15 ys = 1 ara ay, 188 FLAJOLET AND MARTIN and in general: Ing = WHE HEH Est (5) where interval [1+-K]. Notice that by changing the summation indexes to /,=k—i, &, can be rewritten as: _ teats ay y(t y x( 7 where now the are distinct integers over the interval [0---k— 1] In other words, We have shown that - d (4) 6 Using (6) inside (5) completes the proof of the theorem. We now turn to the derivation of asymptotic forms for these probabilities. Tueonem 2. The distribution of R, satisfies the following estimates: (If k<1og;n=2 10g: log n, then = O(ne na (ii) Pk }log nm, then ds= Th . wy 40(*) =D Ene mm} +0 (M82); ix vn (ii) fk > $logs m+, with 520, the tail of the distribution is exponential: Proof. The main device here consists in using repeatedly the exponential approximation: (aye PROBABILISTIC COUNTING ALGORITHRS 189 inside the terms that form the expression of dau? wan Sco( Winb=(1-) (i). The case when k log? n/n, and the above inequality becomes nx ne, as was to be established, (ii) The case when k<3 logs n. We set here s(n) =log? nin When j>elm)2', for k in the given range, 1(j,n,) is Ole"); since there are less than 2* such terms, and 2° = O(n), we get dea = YL (ANU b+ OCW? (8) sean We let gi, denote the sum that appears in (8), and we define similarly gee YD (enyen, For jk) <1 (1 = 1/24)", <1 -exp(—2 1/2"). «2 In the range of values of k considered, the last expression is O(/2"), which is itself of order O(2//n): thus the proof of part (i) Is now completed. "B For the sequel we introduce the real function: wo)= [Pde =F (AD Pexpl-a. (3) j30 So ‘Thus Theorem 2 expresses essentially the existence of a sort of limiting distribution for the probability distribution of R,, as m gets large: ast(S)s ous (S)-v (se) uy Table I describes the values of the probabilities compared to the approximation given by (14), It shows excellent agreement between the g,. and their approximations. It also reveals that the tail decreases sharply (actually a decrease faster than that of Theorem 2 may be established). Asymptotic Analysis From Theorem 2 follows that Levan 1. The expectation R, of R, sai zi[o(8)-ee)}eGh) «8 R, PROBABILISTIC COUNTING ALGORITHMS 191 TABLET ‘Values of Exact Probabilities (gn) and of the Approximations (9) (in Tales) kad kaS kG aT kB ka oMs2 02767 0.1088 0021200020, 10000019 0016 03978 2799 0.1037 09309 0.0020 ket ke kell k=R ko 1000 00008 0201 OIKP—Ot6H 03216 SKE OS OBKT 0.0004 0020038703167 9321901386 w038s — omneT k= ket 2 RAIS ket ken 10000 0001078302679 034690210 00659 OIOT ‘ono — 00076083 0287303489 02150 0.0659 010 Note. n= 10026, n= 10009, and x= 100002" Thus the problem of estimating R, asymptotically reduces to that of estimating the sum in (15), ie. the function rede | (3) Y (=»)} 16) for large x, ‘To that purpose we appeal to Mellin transform techniques whose introduction in the context of analysis of algorithms is due to De Bruijn (see [4, pp. 131 ef seq.]). The Mellin transform of a function f(x) defined for x>0, x teal, is by definition the complex function f*(s) given by P(9)=MLfshish= |" fla) 2° "de “un We suecinetly recall the salient properties of the Mellin transform, referring. the reader to [1] for precise statements, The Mellin transform of a function f is defined ina strip of the complex plane that is determined by the asymptotic behaviours of f at 0 and co. It satisfies the important functional property MLflaxiis]=a “fs. cas) Finally there is a complex inversion formula, no E free oy where ¢ is chosen in the strip where the integral in (17) is absolutely convergent 192 FLAJOLET AND MARTIN ‘The interest of the inversion formula is that, in many cases, it ean be evaluated by means of the residue theorem, each residue corresponding to a term in the asymptotic expansion of f Lema 2. The Mellin transform of Fx) is for ~1\ by N= ORG Proof. Let (x)= Woe) ~ L. The transform of yy, is for Re(s)> 1 ¥toy= L(y) N(s) Ts), (20) as follows from the basic functional property (18), and the fact that the transform of exp(—x) is the Gamma funetion 7s}. Similarly, for w(x) = Yx)— W(a/2) and Re(s)> 1, we get B69) = MLW) — Wx/2}: 5] = WHS) 2°). uy Since Y(x)—W(x/2) is exponentially small both at 0 and co, the transform #2 is actually analytic for all complex s; since: wH(s) Tietz Ns) (22) we find that N(s) is analytic for all s except possibly for the points »= 2ikn/log 2, where the denominator of (22) vanishes. However, direct calculations in Lemma 3 below show that N(s) is analytic for Re(s)> 1, so that N(s) is analytic everywhere, ‘Now, using again the basie functional property, Prey=wrey Ske To (3) where (23) is valid for Re(s) <0. Putting together (20), (21), (22), (23) establishes the claim of the lemma, PROBABILISTIC COUNTING ALGORITHMS 193 We now need to establish some more constructive properties of N(s) for Res) <0, establishing in passing the analytie continuation property of N(s) used in the proof of Lemma 2. Lima 3. The funetion N(s) satisfies N(O} > ~0.99, it satisfies <1. Furthermore, for s= 0+ it and Ne OU +9?) Proof. Terms in the definition of N(s) may be grouped 4 by 4; using the Property ¥(4)) = vi); A+ 1) = (Aj +2)= 1 + v1) VA) +3)=2 +9), we find NG)= 1-243 yy 1 1 (2a) + 1 - = ay Ty 7.2) (9g) Cray ( We observe thatthe general term in the above sum is O(/-"~4) as j gts large. This confirms that N(s) is defined and analytic when ¢ > —1. To obtain the bounds on N(s), we split the sum (24): the terms such that j<|s|* contribute at most (1 + {sf*) to the sum; and since [a(t uy (14 2a) 4 14 3a)? = Ofs/70") uniformly in s and v when w[s|? is ous? Ee OU\s!"), uniformly in s when @> —0.99, say. Finally substituting s=0 in (24) gives MO)= 1. 0 We can now come back to the asymptotic study of F(x) and henoe of R, using the inversion formula (19), Tutors 3.A. The average value of parameter R, satisfies B,=log,(on) + Pllogsn) +011), where constant @=0.T735K ... is given by amg? A [sgeneesa me pvp +3) 194 FLAJOLET AND MARTIN and Pu) is a periodic and continuous functions of u with period \ and amplitude bounded by 10-* Proof. By Lemma 1, the problem reduces to obtaining an asymptotic expansion of Fx) as x 00 up to o(1) terms. The principle consists in evaluating the complex integral of the form (19) by residues. From the inversion theorem for Mellin trans- forms, we have 1 potetee . Bal an, POE Po (25) We consider for k a positive integer the rectangle contour I, defined by its comer points (and traversed in that order) =1/2= 12k + 1) nflog 2; ~1/2 + (2k + 1) x/log 2; 12k + 1)nflog 2; 1+ 2k + 1) n/log 24 By Cauchy's residue theorem, we have sel Pianeta ah Y Ress). sins For fixed x, ask gets large, the integral along the _ segment [=1/2—i(2k-+1) nflog 2; = 1/2+:2K + 1) x/log 2] tends to F(x) by (25). From Lemma 3 and the exponential decrease of Fs) towards ico, the integrals along [-1/2+i2k+ 1) flog 2; 1-2k-+ In/log2] and [1+ i(2k-+1)x/log2; ~1/2—i(2k + 1)r/log 2] tend to zero exponentially fast (as functions of m). As to the integral along [1~i(2k + 1)x/log 2; 1 +i(2k-+ I)x/log 2], it stays bounded in absolute value by Ep rosin ack, for some constant &. (Again the exponential decrease of Fs) guarantees con- vergence of the above integral.) We have thus found that, by letting m—> oo: Fs -y Rec) +0(2). (26) (The sum of residues is also absolutely convergent because of the decrease of /(s) towards ico.) It only remains to evaluate the residues in (26). F¥(s) has a double pole at s=0 and simple poles at each z.~ 2ikx/log 2, with k an integer different from 0, and we find casily logs x+ Res(P*(s)x~ i PROBADILISTIC COUNTING ALGORITHMS 195 which we may rewrite as logs x, and —Res(F4(s) x Hu) = (flog 2) M6) Nita) 87%, which is of the form py e™**!® ‘Thus summing the residues, and using (26), we find the announced asymptotic form for F(x) (and hence R,), with P(u) given by Plu)= Yo pee, seZ0) ‘The gory details of the bound on the amplitude of P(x) are left for the Appen- dix We can evaluate the standard deviation of R,, in a similar fashion. Let S, be the second moment of R,: 5, = E(R3). As before, S, is approximated by the function Gin) where ave e[0(8)-e(ee)} whose transform is found to be for Re(s) <0 anja +2) Toa MO) No which now has a triple pole at s=0. Computing G() is done from G*(s) via the inversion theorem followed by residue calculations, and one finds: THEOREM 3.B. The standard deviation of R,, satisfies 67, + O(log, n) + o(t), where 0. = 112127... and Olu) isa periodic function with mean value 0 and period I. We can mention in passing for... the * 1 7 m 2 = Fatog ay 2 + Hoe 2= 129 (0)—12N"(0)] —2. Sal josed form” expression where the p, are the Fourier coefficients of P(u) defined above. 3. PROBABILISTIC COUNTING ALGORITHMS We have seen in the previous section that the result R of the COUNT procedure has an average close to log, gn, with a standard deviation close to 1.12. Actually the values of An)= (jen 196 FLAJOLET AND MARTIN are amazingly close to n as the following instances show: A(10)= 10.502; 4(100) = 100.4997; (1000) = 100,502, This observation justifies the hope of obtaining very good estimates on from the observation of parameter R, using the correction factor . However, the dispersion of results corresponds to a typical error of 1 binary order of magnitude which is certainly too high for many applications. ‘The simplest idea to remedy this situation consists in using a set H of m hashing functions, where m is a design parameter and computing m different BITMAP vec- tors. In this way, we obtain m estimates R°, R®,... R. One then considers the average ROE RO 4 ROO a7) ‘When m distinct elements are present in the file, the random variable 4 has expec- tation and standard deviation that satisfy Et logon o(d)~0.h/m Thus we may expect 2* to provide an estimate of n with « typical ereor (measured ty the standard deviation of the estimates) of relative value ~ &//m: Such un algorithm using dtet averaging has indeed provably good performances (with an expected elaive eror of about 10% i m= 64 but thas the disadvantage of raqiring the ealeulation ofa number of hashing functions, so thatthe CPU cost per clemeat seanned gets essentially multiplied bya factor of m. Terns out that an effet very similar (0 straight averaging may be achieved by a devioe that we call stochastic averaging The ea consists in using the hashing function in order to dsiibute cach record into” one of m lols, computing a= (x) mod m. We update only the corresponding BITMAP vector of addres 2 trith the “rest of the information contained in hx) namely A(x) div m= (xn ‘tthe end, we dlermine as before the Rs and compute thir average 4 by (27), Hoping forthe dstrbution of records into Jos to be even enough, we may thus txpet that about mm element fall into each lot so thit(1/p}2* should be a feasoosble approximation of n/m The corresponding. algorithm is called Probabilistic Counting with Stochastic Averaging. of PCSA for shor. tis described in Fig. We clit tha its cost pot Glement scanned is handly distinguishable from that ofthe COUNT procedure and its relative accuracy improves wth m roughly as 0.78-/n Inthe sequel, we shal call standard error the quotient ofthe standard deviation ofan extimate of» by the Wale ofm this quantity is thus precise indication of the expected relative accuracy Of an algorithm estimating m- Neglecting periodic uctuations of extremely smal amplitude (less than 10~*), we shall call the bias of an algorithm the ratio between the estimate of mand the exact values of» for large n. Standard error and bias of PROBABILISTIC COUNTING ALGORITHMS 197 const nmap = 64: luith nmap = 64 accuracy is typically 205} Inmup correeponds to variable m in (he analyse) = 077351 [te mopie constant), mastongin» 32 Feith mazlength “32 (==1), one ean count upto 108) [BIMAPS! erry (0 onap 10 asionghe I] of eteper: function getslement(varzrecerds) reads an element = of type records fom le fonction hash records) steer fhasher a recerdF into an integer over sala Famcton p(y steper inept: {returns the poston ofthe ret 1-bt ny ranks sorta OF eo aime) bean ile net eof) do exin _eteement(s:hashede=hash(): (S=hashede mod nmap, trdes~p(hashede di rep) i eituarsasedes|-0 then BITAAFanes}=1: ein 250, while (ITHAF=3) and (Remaslngth) do B=R3:5: ends ranefumap 92457 nmap) [Result ¥of the PESA programme that wsinotos Fic. 1. Probabilistic counting with stochastic averaging (PCSA}, ‘TADLE Bias and Standard Error of PCSA for Several Values of im, the Number of BIPALAP Vectors Used ™ Bis %Standard err 2 1162 Lo 4 ove 409 5 10386 282 6 Lot 96 2 1.05 Bs 6 0087 97 18 ‘0023 68 256 011 48 sn 1.0005 a4 vom 10003 24 198 FLAJOLET AND MARTIN algorithm PCSA for various values of the design parameter m are displayed in Table I In the remainder of this section, we are going to justify these claims rigorously and in particular show how the estimates of Table MI are deduced. We let denote the random variable computed by PCSA with mr BITMAPS and let Z, denote this random variable when 1 distinct elements are present in the file; wwe denote by ELE,] the average value of 3, and o(5,) the standard deviation of E,. We propose to establish: Tuworem 4. The estimare satisfies* eea=*[atgu(-2)r(-d)a-2 von [ross m+ tn of algorithm PCSA has average value that the second moment of &, satisfies S[meee(-2)e(-2)e-2 sey] +m ontogs mote In the above expressions P,, and Qn, represent periodic functions with period 1, mean value O and amplitude bounded by 10~*. ‘THEOREM 5. Using the notation u(n)~ v(a) £0 express the property Si n> ng |} —v(n)] < 10-* ‘one has the following characterisations of the bias and standard error of algorithm PSA: = (1 +e(m)) atm), where quantities em) and n(m) satisfy as m gets large: elim)~ if2m afny~ 2°), where mop — ro) 2. 2 The ertor terms in Theorem 4 and the in Theorem 5 are nor uniform in m PROBABILISTIC COUNTING ALGORITHNS 199 The Analysis of Algorithm PCSA We now proceed with the proof of Theorem 4. We start with an estimate of E[f*] for 1 k] where k = jlog, +4, with 5>0, When R,>k, positions (k~1) and (k—2) of BITMAP are set to 1, an event that has +(1-s¢-5et 2 ) HOEY 4 BH O10 e(d, + P,(logs n)) + o(n"), where ‘a quantity which is nats 002% _ or O(n/2*), which in the given range of values of k is O(n—?4~*). Thus Ep =0 (1 "), 28) 2B 2-0 and the same bound applies if 2 is replaced by fi in the above sum. We now consider the error that comes from the replacement of the p,. by their asymptotic equivalent for “small” k. From the bounds of Theorem 2, one finds Sato) ot8)- a quantity which is atogn is Ole "°*") for some constant h>0. Proof. Set p=/m, q=1—im; let N, be the number of elements that fall into cell 1. , obeys a binomial distribution s=(() ates 0 oo( 222% 0(2)) Sint °C6 1£5= Vlog, the probability (30) is exponentially small We conclude the proof by observing that the binomial distribution i unimodal and PrN, Pr(Ny = pnd) ~ 2 l>Vinoen Jitoen]. We can now conlie the proof ofthe ftp of Theorem 4 Lat dente the sum RO? + RO +--+ + RO. We have Dela on, Pat Pot Pr(S= rate BD) PROBABILISTIC COUNTING ALGORITHMS 201 Thus FQ" z a, non Jet) ety 2%"), (32) Call £ the quantity (32), and Ee the sum of the terms in (32) such that forall h1Sjsm ny < Siig From Lemmas 4,5, Ee is O(ne“¥**"), As to the central contribution Ee itis bounded by (E[2N oe sone) Be (EE IMMMons ney, so that finally B29] = (ELM + on). G3) ELA} "+ of). (G4) Equation (34) combined with Lemma 5 is sufficient to establish the estimates on 3, from Theorem 4, provided we check that the amplitudes of the periodic fluctuations do not grow with m, a fact that can be proved using the methods described in the Appendix. Estimates on the second moment of 5, are derived in exactly the same way through the equality RE PERO)" + 000) (5) 7 Dependence of Results on the Number of BITMAPs We finally conclude with an indication of the (easy) proof of Theorem 5. From Theorem 4, all we need is to determine the asymptotic behaviour of the quantities am eta(-De(-Sarmf eo wm=2[25x(-2)r(-2)e2f. an on) = (Blom) 22m)", (8) 202 FLAJOLET AND MARTIN ‘as m gets large since we neglect the effect of the small periodic fluctuations. This is achieved by performing standard (but tedious) asymptotic expansions of (36), (37), (38) for large m. (This task as been carried out with the help of the MACSYMA, system for symbolic computations.) We find that the bias and standard error are for all values of m closely approximated by the formulae bias: 1-+031/m (9) standardertor: 078) (40) 4, IMPLEMENTATION ISSUES ‘There are three factors to be taken into account when applying algorithm PCSA: (i) The choice of the hashing function. (ii) ‘The choice of the length of the BITMAP-vectors, L. (ii) The number, nmap, of BITMAP used (corresponding to quantity min our analyse). Also corrections of two types may be introduced: iv) Corrections to the systematic bias of Table Il. (¥) Corrections for initial nonlinearities of the algorithm. We briefly proceed to discuss these issues here. 1. Hashing functions, Simulations on textual files (see below) ranging in size from a few kilobytes to about 1 megabyte indicate that standard multiplicative hashing leads to performances that do not depart in any detectable way from those predicted by the uniform model of Sections 2, 3. There, a record x= (9. X)s%p) formed of ASCII characters is hashed into Many, ord(x) 128’) mod 24, with ord(x) denoting the (standard ASCII) rank of character x. This good agreement between theoretically predicted and practically observed performances is in accordance with empirical studies concerning standard hashing techniques and conducted on large industrial files by Lum et al. [3] 2. Length of the BITMAP vector. Since the probability ditribution of the R- parameter has a very steep distribution, it suffices to select J in such a way that L>logs(n/nmap) +4, ay Thus, as already pointed out, with nmap=64, taking L=16 makes it possible to safely count cardinalties of files up to n~10°, and L=24 can be used for car- PRORARILISTIC COUNTING ALGORITHMS, 203 dinalities well beyond 10°, The probabilities of obtaining underestimates because of such truncations (the probabilistic model assumes L to be infinite) can be com- puted from our previous results and when (41) is satisfied, the error introduced is below 5°10 3. Number of BITMAPS, The expected relative accuracy of the algorithm or standard error is by Theorems 4, 5 inversely proportional to ,/m, being closely approximated by ogi fm ‘Thus nmap = 64 leads to a standard error of about 10%, and with nmap = 256, this error decreases to about 5% (see Table Il). 4, Bias, ‘The bias of algorithm PCSA as presented in Table II is negligible com- pared to the standard error as soon as nmap exceeds 32. If smaller values of nmap are to be used, it can be corrected using the results of Theorems 4, 5. For a practical use of the algorithm, it suffices to use the estimates of Theorem 5, which one achieves by changing the last instruction of the programme to trunc(amap/(@*(1-+0.31/nmap))*2**(S/nmap)). In so doing, we obtain an algorithm which apart from the small periodic fluc~ twations of amplitude less than 10-* is an asymptotically unbiased estimator of car- dinalitiesn, 5. Initial non-linearities. The asymptotic estimates which form the basis of the algorithm are extremely close to the actual average values as soon as n/amap ‘exceeds 10-20, If very small cardinalities were to be estimated, then based on the waracterisation of probability distributions, corrections could be computed and introduced in the algorithm, (These corrections would be based on calculation of exact average values from our formulae instead of using the asymptotic estimates). Simulations We have conducted fairly extensive simulations of algorithm PCSA applied to textual data, The files called man,, man.,.., many correspond to chapters of the on- line documantation available on one of our systems, and the versions man, w, ‘man,w,.. correspond to the files obtained from the preceding ones by segmentation into 5 character blocks. Standard multiplicative hashing was used as described by Eq, (41). We counted in each case the number of different records and compared with corresponding values estimated by algorithm PCSA (here, a record is a line of text for man,,.. and a $ letter block for man, W,..). Some sample runs are reported in Table II], and they show good agreement between our estimates and a values, The files are mixtures of text in English, names of commands and typesetting commands. We have also taken these 16 files, and have subjected them to algorithm PCSA, varying the constants M and N in (41), This provides empirical values of the bias 204 FLAJOLED AND MART! TABLE IIL Sample Exceutions of Algorithm PCSA on 6 Fils with the Same Multiplicative Hashing Function File Cand 5 16 2 ot m8 286 ‘man | woos 17SIt 16322, 4OTT_ 15982 166907056 Joe 0999109? tas maniy 38845 401450865 gOS 4829041042502 0% 10) 1.8 man? 3149 pay SONOS 8M) 077 oat sos “098 man 2 10860500, 9719100, 9100100327 Joo 03t 88 0869S. man sors M482 Te 33602523087 3106 14 121M man wv 11384———*10S9——«10880——10363——«1070S——t08B9—_— 10476 00) 09} asta Ft Note, The igure displays the file name, the exact cardinality, the estimated earinality for nmap = 8, 16, 32, 64, 128, 256, and the ratio of estimated cardinals o exact cardinals (in tales) and standard error of PCSA (averaging over 10 simulations x 16 files) that azain appear to be in amazingly good agreement with the theoretical predictions. Such results are reported in Table IV and should be compared with Table Il, (The correction for small values of nmap described above has been inserted into the algorithm PCSA of Fig. 1.) Applications to Distributed Computing Assume a fine Fis partitioned into subfiles Fy, Fp... F,, where the F; and F) need not be disjoint. Such a situation occurs routinely in the context of distributed data bases. ‘TABLE Empirical Values of Bias and Standard Fxror Based on 160 Simulations ™ Biss % Standard error 8 10169 3192 16 10108 1963 2 978 1298 ‘ 9961 960 2 10035 668 256 1.0073 465 Nore. ‘Ten diferent hashing functions applied to the 16 files man, mangv PROBABILISTIC COUNTING ALGORITHMS 205 Then the global cardinality of file F may be determined as follows: Process separately each of the s subfiles by algorithm PCSA. This gives rise 10s BITMAP vectors, BITMAP... Each of the s processors sends its result 10 a central processor that computes the logical ot of the s BITMAPs. The resulting BITMAP vector is then used t0 construct the estimate of n It is rather remarkable that the accuracy of the estimate is, by construction, not affected at all by the way records are spread amongst subfiles. The number of ‘messages exchanged is small (being O(s)), and the algorithm results in 2 ner speed- up by a factor of s. Scrolling ‘The matrix of B/TMAP vectors has a rather specific form: it starts with rows of ones followed by a fringe of rows consisting of mixed zeras and ones and followed by rows all zeros, This suggests naturally a more compact encoding of the bitmap that may be quite useful for distributed applications since it then minimises the sizes of messages exchanged by processors. The idea is to indicate the left boundary of the fringe, followed by a standard encoding of the fringe itself, For instance if the BITMAP matrix is Trt if1o1o0o0jo000 ririjr100o0jo000 rrrijorod0rsfoooo ririfrtoroloooo then, one only needs to represent the leftmost boundary of the fringe here 4), and the binary words 10100, 11000, O1011, 11010. ‘This technique amounts to keeping only a small window of the BITMAP matrix ‘and scrolling it is necessary. For practical pruposes, a window of size 8 should su fice, so that the storage requirement of this version of PCSA becomes close to Hlogsn-+ nmap bytes Deletions If instead of keeping only bits to record the occurrences of patterns of the form FL, one also keeps the counts of such occurrences, one obtains an algorithm that can maintain running estimates of cardinalities of files subjected to arbitrary sequences of insertions and deletions. The price to be paid is however @ somewhat increased storage cost. 5. ConeLuston Probabilistic counting techniques presented here are particular algorithmic solutions to be problem of estimating the cardinality of a multiset. It is quite clear ay 206 PLAIOLET AND MARTIN that other observable regularities on hashed values of records could have been used, in conjunction with direct or stochastic averaging, We mention is passing: -the rank of the rightmost one in B/TMAP: this parameter has a flatter dis- tribution that results in an appreciably less accurate algorithm (in terms of stan- dard error); the binary logarithm of the minimal hashed value encountered (hashed values being considered are real [0;1] numbers) provides an approximation 10 log, 1/n, but the resulting algorithm appears to be slightly less accurate than PCSA. ‘The common feature of all such algorithms is to estimate the cardinality of a mul- tisct in real time, using auxiliary storage O(m log, n) with a relative accuracy of the form: It might be of interest to determine whether appreciably better storage/accuracy trade-offs can be achieved (or to prove that this is not possible from an infor- mation-theoretic standpoint), For practical purposes, algorithm PCSA is quite satisfactory. It consumes only a few operations per element scanned (may be 20 or 30 assembly language instruc- tions), has good accuracy described at length in the previous sections, and may be used (0 gather statistics on files on she fly (therefore eliminating the additonal cost of disk accesses). On a VAX 11/780 running Berkeley Unix, a non-optimised ver- sion in Pascal used for our tests is already typically twice as fast as the standard system sorting routine. A version of the algorithm has been implemented at IBM San Jose in the context of the System R® Project. APPENDIX: Tur AMPLtTubE oF Prtionre FLuctuaTi The purpose of this Appendix is to show how the Mluctuations, in the form of Fourier series, that appear in Theorems 3, 4, 5 can be precisely bounded. Notice that the problem reduces to showing that the Fourier coefficients have sufficiently small values Al these Fourier coefficients are values of functions of the form: Ths) Nis) ols), with co(s) a “well-behaved” function, taken at points 7, =o + 2ikn/log 2 and k is a non-zero integer. Quantity ¢ depends on the particular problem considered: = 0 in Theorem 3, ¢= —1/m in Theorems 4, 5. We shail only give a proof in the case of Theorem 3.A, the other proofs being entirely similar. We thus need to find bounds for the Fourier series ype Poul PROBABILISTIC COUNTING ALGORITHNS, 207 12) 5 (ike e=iaea! (ies) * (et) ‘The behaviour of the gamma function along the imaginary axis is known: Irn] = Valesink with so that it decreases very fast when going away from the real axis. For instance, one finds with 1. = 2ikn/log 2: [Ua] = 545249 10-7; 2) =2.52468- 10°". Thus all hat is required is efietive bounds on (M(H). These follows easly by refining the approach taken in the proof of Lemma 3, Define for x and 1 real, the function (see Eq. (24)): fon) = (ex) (Fe) EL BI Lena, For 21 and x<3/2t, one has: [fla D1 <16x22, Proof. The proof depends on the following easy observations: for v2 0: log( + y)< i) and for |ul <}: let Tul
    (6) we obtain [fly ns 208414 <1Oe?, ‘The above lemma can be used for two purposes: (I) bounding the values of [MG] for large F; (2) bounding the truncation errors when estimating N(t) from the sum of its first few terms. Conoutany. For all 1> 1, Nit) satisfies INGn|

    2 are much smaller and exponentially decreasing with the basis of the exponential equal to ¢~""*#?~=0,6584 10-* PRORARILISFIC COUNTING ALGORITHMS 209 ACKNOWLEDGMENTS “The frst author would like to express his gratitude to IBM France and the TBM San Jose Research Laboratory for an invited visit during which his work on the subject was done fora lage part Thanks are dus to M.Schkolnick, Kya Young Wang {who implemented the method), and R, Fagin for thir ‘support and many stimulating discussions Note added in proof. ‘The sequence (1) that occurred repeatedly here is the classical Morse ‘Thue sequence Using the Dirichlet generating function Ns), Allouche etl. (Automates fini et séris de Dirichlet, J. Inform. Math, Publ. Math. Université de Caen, 1985) have obtained several intresting properties of that sequence, including a proof of a curious identity of Shalit (compare with our ‘Theorem 34): A (apr 2Nap + [seers ” ReveRENcES 1. G, Dorrsc, “Handbuch der Laplace‘Transormation.” Birkhauser, Base, 1950, 2. P. Frau, Approximate counting: A detailed analysis, U7 25 (1985), 113-134 5. P. FLAlOLET axD N. Marrs, Probabilistic counting, n “Proc. 24th IEEE Sympos. Foundations of ‘Computer Science, Nov. 1983," pp. 76-82 4D. E. Kaumt “The Art of Computer Reading, Mass, 1973 5. V.¥, Low, P'S, T. Yurs, ax M. Dooo, Key to address transformations: A fundamental study based om lags existing formatted files, Comms, ACM 14 (1971), 228-238 6. R. Monus, Counting large numbers of evens in small registers, Conmm, ACM 21 (1978), 840-842. 7.1, Moso aso P. Srima Sorting and searching in multisets, STAM J Comput 8, No. (1976), 1-8 8. P. Grimms Sitar, M. M. AstRasies, D- D. CilM@rsuin, R.A. Lome, aNp T.G. Price “Access Path Selection in A Relational Database Management Stem.” Report RJ-2429, 1BM Sin Jose Res. Lab, Aug. 1979. rogramming: Sorting and Searching,” Addison-Wesley, Printed by the St. Catherine Press Lud, Tempe 41, Bruges, Begiom

You might also like