Performance Comparison of Extendible Hashing and Linear Hashing Techniques
Performance Comparison of Extendible Hashing and Linear Hashing Techniques
Permission to copy without fee all or part of this material is granted pro- The linear hashing scheme, referred to as LINHASH
vided that the copies are not made or distributed for direct commercial hereafter, is a directory-less scheme which allows a
advantage, the ACM copyright notice and the title of the publication and smooth growth of the hash table [Ram82]. The following
its date appear, and notice is given that copying is by permission of the example is due to Larson [Lar88]. Consider a hash table
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific permission.
0 1990 ACM 089791-347-7/90/0003/0178 $1.50 178
consisting of N buckets with addresses O,l,..,N-1. There are 2**d directory entries. Often more than one
LINHASH splits the buckets in predetermined order; i.e., directory entry points to the same bucket. Figure 2
the first bucket has address 0, then bucket 1, and so on, expands this discussion. Upon expansion of the table, the
up to and including bucket N-l. In figure l(a), the table local depth of the 2 buckets involved is increased by 1.
size N is 3 and the next bucket to be split is bucket 0. If d’ of any bucket is greater than d, then the directory
Pointer p always indicates the bucket to be split next. size is doubled (shown in figure 3), and the global depth
Figure l(b) shows the status after bucket 0 has been split. is increased by 1.
Notice that pointer p has moved to bucket 1. Next,
bucket 1 is split into bucket 1 and bucket 4. The current
expansion is considered complete when the last bucket of
the tabIe is split. After the split, pointer p is reset to
bucket 0. In our example, the expansion will be complete
when bucket 3 is split, as shown in figure l(d). cuxl
owl
The extendible hashing technique, referred to as 0010
EXHASH hereafter, was developed by Fagin et al. 0011
[Fag79]. This scheme uses the leading (or trailing) bits, 0100
denote by d, of the key to index into the directory. 0101
Global depth, d, and local depth, d’, imply the depth of
the directory and the depth of a bucket, respectively.
0110
1
0111 Bvcial
103
lCO1
(1) ,--$p-/n 1010
B&3
1011
(b) 1100
oboe
1101
(cl 1110
mobam 1111
W ,+--jnr;mn
Figure 3: HashTable DoubledAfter Splitting Bucket2 of Figure2(b)
Figure 1: ExpansionProcessin LinearHashing
179
III. SIMULATION PREPARATION Following factors have been considered to analyze the
relative performance of LINHASH and EXHASH:
The performance comparison factors for simulation
are based on the following assumptions. (1) Storage utilization : N /(B*b)
(1) The keys are distributed uniformly, so each key has (3) Average successful search cost: bs / s
equal access probability.
(2) Records are of fixed length. (4) Split cost (expansion cost): In LINHASH, a split
(3) The bucket capacity is fixed in terms of the number bucket is usually different from the bucket where insertion
of records that it can hold. took place. Hence additional accessesare needed to read
(4) Expansion takes place as soon as a bucket overflows. the split bucket chain.
(5) Enough main memory is available to handle the
expansion. LINHASH: 1 access to read the primary bucket
(6) EXHASH: (a) The most significant bits are extracted + k accessesto read k overflow buckets
from the key to find the directory entry. 04 -f’he + 1 access to write old bucket
overflow bucket is split at most once. In other words, a + extra accessesto write the overflow
second split is not attempted even though the first split buckets attached to old and new buckets
may fail to release the overflow bucket. (c) Main EXHASH: 1 access to write old bucket
memory can hold a maximum of 1024 directory entries. + 1 access to write new bucket
The rest of the directory must reside on the secondary + extra accesses to write the overflow
storage. buckets attached to old and new buckets
(7) LINHASH: A simple division method with modulo + accessesneeded to update the
arithmetic is used to find the relevent bucket. directory pointers if the directory
According to assumption (l), we use a random resides on the secondary storage
function that broadly satisfies the properties of a minimal
random function. Given a minimal random function “f(z) (5) Insertion cost: Unsuccessful search cost + Split cost
= az mod m”, the value of “a” should pass the three tests
as defined in [Par881 such that f(z) should (i) be a full (6) Number of overflow buckets
period generating function; (ii) be random for all the
sequences generated; and (iii) be implemented efficiently The above factors have been simulated with the
with 32-bit arithmetic. Further, the hash functions used in bucket sizes of 10, 20, and 50 for both EXHASH and
the simulation also satisfy the basic properties listed by LINHASH. In order to observe their average behavior,
Carter and Knuth in [Car791 [Knu73]. the simulation uses 50,000 keys which have been
generated randomly.
2. Comparison Factors
B : The number of buckets in the current hash table For all bucket sizes, EXHASH produces consistently
better storage utilization than LINHASH. LINHASH gives
b : Bucket capacity cyclic storage utilization since the buckets are split linearly
regardless of their load. In both EXHASH and
bs : The number of buckets accessed for successful LINHASH, as the bucket size rises, the storage utilization
search + 1 if the directory entry is not in becomes more fluctuating (see figures 4,5,6). EXHASH
main memory. (only in EXHASH). has an advantage of approximately 5% over LINHASH in
storage utilization. Such a performance is wholly
bu : Number of buckets accessedfor unsuccessful search attributable to the way the buckets are split under the two
+ 1 if the directory entry is not found in main schemes. The corollary is that LINHASH requires more
memory (only in EXHASH). buckets to hold the same number of records than
EXHASH does.
s : Number of successful searches
u : Number of unsuccessful searches
* . Arithmetic multiplication symbol
I : Arithmetic division symbol
180
,:
!
.’
I
* 1.0’
v
I .?’
E 1 .I :
1 I.9 ;
L” 1.4.
g 1.3.
i I.2
n 1.1
_-_m ------- m-o__
c ,., ii.._)r. -._.---- ___- -----.- .-_____ce ~
.._.__ . . . . . ~ . .._.. ~
0 1.54
, n.9 ,..‘...‘“,“““.“I.‘.“““I”..“..‘r
0 10&W Ewas 3ams 46868 M
s leeeD
-aREams
Msilno -- ---- LItEM
r 1.38 . . . . . . . . . , _ _ . . . -
1eQBB 28866 38888 4mee SQae
e 1W m 38888 48888 sseee
-oFREm
~ffI(EcoRDG
waw1m - EXTDO ---- LItEm
WlM -EmElm ---- LIKlR
RreERalEEcaaos
FIwE15.s?LlTcoslus.rv.nlERw56tasrs
182
.&\..
\
,-’ ‘.
I ‘,
#’ I
\
/ \
\
t4Xl&lSlZEtr.*
4.mj
J
fj 3.7%
c 330. ’
f o.es-
E
1 3.m.
n .
128 _
8 lww mew 3ww 4eem
mrz(FafcutDs
WIsItIKi - EXTEIO ---- LlKCiSl
4.w
fj 3.7s
Ir 3.m
aE 3.zs
I s.BB-
sE 2.15.
R
; 2.543’
: e.F.5
b..a...--._C------_ w__c_d------ ---A-- .___- e--
‘0 2.63. *~
: 1.73.
s _,-- .- -’
1.w
T.-..““‘,‘.“‘...TT.......,.........* s _, -- ‘-- *_--
e
I---......,.., .... v”+---..-..r
lewe PBBBB xae0 48888 wem e Ifas emw 38888 48888 wwa
-aacoaos -0FRXCSfS
ms41m -EnEm ---- LIwim WlsHltc - .DCTElQ ---- LJm3a
F:KFS 18. lltSEF!lI~ CUST US. NWER Q RECfSDS FIURf 21. MrlCU BuhdE75 U.S. )ueEn ff Gf0SD-S
183
LINHASH performs better for all the bucket sizes [Chu89] Chun, S.H., Hedrick, G.E., Lu, H., Fisher,
with an unsuccessful search becomes less costly as the D.D., “A Partitioning Method for Grid File Directories,”
bucket size rises. On an average, an unsuccessful search will appear in Proc. of the IEEE Computer Society’s 13th
cost stays close to 1 for all the bucket sizes in LINHASH Annual International Computer Software and
(see figures 7,8,9). Similar observations hold true for the Applications Conference, Sept. 18-22, 1989 Orlando
cost of a successful search (see figures 10,11,12). The applications Conference, Sept. 18-22, 1989 Orlando
successful search and the unsuccessful search are equally
costly in EXHASH. This is due to the fact that the [El1821 Ellis, C.S. “Extendible Hashing for Concurren
overflow buckets are almost non-existent in EXHASH. Operations and Distributed Data.” ACM SIGMOD, 1982
Overflow buckets are mandatory in LINHASH. In
EXHASH, the search cost can be kept to 1 regardless of Fag791 Fagin, R., Nievergelt, J., Pippenger, N., and
the bucket size when the entire directory can be kept in Strong, H.R. “Extendible Hashing - A Fast Access Method
the main memory. for Dynamic Files.” ACM Transactions on Database
Systems, 14, 3(Sept.1979), pp. 315-344.
The splitting of a bucket is costlier in LINHASH.
This is due to the fact that an extra read access is needed l&87] Freeston, M. “The BANG file.” Proc. of ACM
to read the bucket to be split (see figures 13,14,15). The SIGMOD Conf., 16, 3(Dec. 1987), 260-269.
insertion cost is slightly higher in EXHASH for the bucket
sizes 10 and 20. However, for the bucket size 50, this [Knu73] Knuth, D. The Art of Computer Programming,
cost is slightly less in EXHASH (see figures 16,17,18). vol III* Sorting and Searching. Reading, MA: Addison-
-* --
Wesley, 1973. -
As expected, LINHASH performed poorly with
respect to the number of overflow buckets. The number [Lar78] Larson, P. “Dynamic Hashing.” BIT, 18(1987),
of overflow buckets decreasesas the bucket size increases. pp. 184-201.
The simulation shows that a maximum of 10% of the total
space should be marked as an overflow area in lLar82] Larson, P. “Performance Analysis of Linear
LINHASH. Overflow buckets are almost non-existent in Hashing with Partial Expansions.” ACM Transactions on
EXHASH (see figures 19,20,21). Database Systems, 7, 4(1982), pp. 566-587.
[Bra861 Bradley, J. “Use of Mean Distance between [Nie84] Nievergelt, J., Hinterberger, H., & Sevcik, KC.,
Overflow Records of Compute Average Search Lengths The Grid File: An Adaptable, Symmetric Multikey File
in Hash Files with Open Addressing.” Computer i., 29, Structure”, ACM Transactions 0” Database Systems, Vol.
2(1986), pp. 167-170. 9, No. 1, 1984, pp 38-71.
184
Communications of the ACM, 31, 10 (October 1988), pp.
1192-1201.
185