0% found this document useful (0 votes)
121 views

Performance Comparison of Extendible Hashing and Linear Hashing Techniques

Uploaded by

thảo dương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views

Performance Comparison of Extendible Hashing and Linear Hashing Techniques

Uploaded by

thảo dương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

PERFORMANCE COMPARISON OF EXTENDIBLE

HASHING AND LINEAR HASHING TECHNIQUES

Ashok Rathi, Huizhu Lu, G.E. Hedrick

Department of Computing and Information Sciences


Oklahoma State University
Stillwater, Oklahoma 74078
Phone: (405) 744-5668

ABSTRACT allocated for the entire file. To overcome these


drawbacks, several dynamic hashing schemes were
Based on seven assumptions, the following developed in late seventies and early eighties.
comparison factors are used to compare the performance
of linear hashing with extendible hashing: 1. storage The dynamic hashing scheme [Lar78] and the
utilization; 2. average unsuccessful search cost; 3. average dynamic hashing scheme with deferred splitting [Sch81]
successful search cost; 4. split cost; 5. insertion cost; 6. both keep an index in main memory. In these schemes,
number of overflow buckets. The simulation is conducted the random access cost is high. A spiral storage scheme
with the bucket sizes of IO, 20, and 50 for both hashing [Mu1851 seeks to provide a uniform performance
techniques. In order to observe their average behavior, regardless of the file size. This scheme involves a very
the simulation uses 50,000 keys which have been complex address computation to determine the appropriate
generated randomly. buckets. Also, the expansion process is both slow and
complex.
According to our simulation results, extendible
hashing has an advantage of 5% over linear hashing in To overcome the shortcomings of the spiral storage
terms of storage utilization. Successful search, scheme, W. Litwin bit801 and Fagin et al. [Fag791
unsuccessful search, and insertions are less costly in linear presented hashing schemes called linear hashing and
hashing, However, linear hashing requires a large extendible hashing respectively. Later, Ellis applied
overflow space to handle the overflow records. concurrent operations to extendible hashing in a distributed
Simulation shows that approximately 10% of the sapce database environment lEIl821. The address computation
should be marked as overflow space in linear hashing. and expansion prcesses in both linear hashing and
extendible hashing is easy and efficient [Lar82] bar851
Directory size is a serious bottleneck in extendible IBra861.
hashing. Based on the simulation results, the authors
recommend linear hashing when main memory is at a Both Litwin [Lit801 and Fagin et al. Fag’W
premium. claimed their respective hashing techniques to be efficient.
However, no comparison results of the two techniques
I. INTRODUCTION were reported. Hence, the objective of this paper is to
compare both linear hashing and extendible hashing.
A number of file structures and access methods, e.g.
B+ tree [Knu73]. inverted file &nu73], heap war771, Section II of this paper briefly reviews linear hashing
grid file [N&34:] [Chu89], BANG file [Fre871 lLia891, and extendible hashing. Section III discusses the
AVL data structure with persistent technique [Ver87], and simulation setup for comparison and section IV presents
hashing are widely used in current database design. the simulation results and conclusions (Mathematical
Among those techniques, hashing is a well-known derivations have been shown regarding search costs,
technique for organizing direct access files. The method insertion cost etc. They are omitted here due to space
is simple: Retrieval, insertion, and deletion of records is limitations).
very fast. In traditional hashing, the size of the file must
be estimated in advance, and storage space must be
II. LINEAR HASHING AND EXTENDIBLE
HASHING

Permission to copy without fee all or part of this material is granted pro- The linear hashing scheme, referred to as LINHASH
vided that the copies are not made or distributed for direct commercial hereafter, is a directory-less scheme which allows a
advantage, the ACM copyright notice and the title of the publication and smooth growth of the hash table [Ram82]. The following
its date appear, and notice is given that copying is by permission of the example is due to Larson [Lar88]. Consider a hash table
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific permission.
0 1990 ACM 089791-347-7/90/0003/0178 $1.50 178
consisting of N buckets with addresses O,l,..,N-1. There are 2**d directory entries. Often more than one
LINHASH splits the buckets in predetermined order; i.e., directory entry points to the same bucket. Figure 2
the first bucket has address 0, then bucket 1, and so on, expands this discussion. Upon expansion of the table, the
up to and including bucket N-l. In figure l(a), the table local depth of the 2 buckets involved is increased by 1.
size N is 3 and the next bucket to be split is bucket 0. If d’ of any bucket is greater than d, then the directory
Pointer p always indicates the bucket to be split next. size is doubled (shown in figure 3), and the global depth
Figure l(b) shows the status after bucket 0 has been split. is increased by 1.
Notice that pointer p has moved to bucket 1. Next,
bucket 1 is split into bucket 1 and bucket 4. The current
expansion is considered complete when the last bucket of
the tabIe is split. After the split, pointer p is reset to
bucket 0. In our example, the expansion will be complete
when bucket 3 is split, as shown in figure l(d). cuxl

owl
The extendible hashing technique, referred to as 0010
EXHASH hereafter, was developed by Fagin et al. 0011
[Fag79]. This scheme uses the leading (or trailing) bits, 0100
denote by d, of the key to index into the directory. 0101
Global depth, d, and local depth, d’, imply the depth of
the directory and the depth of a bucket, respectively.
0110
1
0111 Bvcial
103

lCO1
(1) ,--$p-/n 1010
B&3

1011

(b) 1100
oboe
1101

(cl 1110
mobam 1111

W ,+--jnr;mn
Figure 3: HashTable DoubledAfter Splitting Bucket2 of Figure2(b)
Figure 1: ExpansionProcessin LinearHashing

(a) : ExrendibleHashTableBeforeSplit (b) : After Splitting Bucket 3 of Figure 2(a)

Figure 2: An Exampleof ExtendibleHashTable

179
III. SIMULATION PREPARATION Following factors have been considered to analyze the
relative performance of LINHASH and EXHASH:
The performance comparison factors for simulation
are based on the following assumptions. (1) Storage utilization : N /(B*b)

1. Assumptions (2) Average unsuccessful search cost : bu / u

(1) The keys are distributed uniformly, so each key has (3) Average successful search cost: bs / s
equal access probability.
(2) Records are of fixed length. (4) Split cost (expansion cost): In LINHASH, a split
(3) The bucket capacity is fixed in terms of the number bucket is usually different from the bucket where insertion
of records that it can hold. took place. Hence additional accessesare needed to read
(4) Expansion takes place as soon as a bucket overflows. the split bucket chain.
(5) Enough main memory is available to handle the
expansion. LINHASH: 1 access to read the primary bucket
(6) EXHASH: (a) The most significant bits are extracted + k accessesto read k overflow buckets
from the key to find the directory entry. 04 -f’he + 1 access to write old bucket
overflow bucket is split at most once. In other words, a + extra accessesto write the overflow
second split is not attempted even though the first split buckets attached to old and new buckets
may fail to release the overflow bucket. (c) Main EXHASH: 1 access to write old bucket
memory can hold a maximum of 1024 directory entries. + 1 access to write new bucket
The rest of the directory must reside on the secondary + extra accesses to write the overflow
storage. buckets attached to old and new buckets
(7) LINHASH: A simple division method with modulo + accessesneeded to update the
arithmetic is used to find the relevent bucket. directory pointers if the directory
According to assumption (l), we use a random resides on the secondary storage
function that broadly satisfies the properties of a minimal
random function. Given a minimal random function “f(z) (5) Insertion cost: Unsuccessful search cost + Split cost
= az mod m”, the value of “a” should pass the three tests
as defined in [Par881 such that f(z) should (i) be a full (6) Number of overflow buckets
period generating function; (ii) be random for all the
sequences generated; and (iii) be implemented efficiently The above factors have been simulated with the
with 32-bit arithmetic. Further, the hash functions used in bucket sizes of 10, 20, and 50 for both EXHASH and
the simulation also satisfy the basic properties listed by LINHASH. In order to observe their average behavior,
Carter and Knuth in [Car791 [Knu73]. the simulation uses 50,000 keys which have been
generated randomly.

2. Comparison Factors

Following notations are used to define the comparison


factors: IV. RESULTS & CONCLUSION

N : The number of records in the current hash table 1. Simulation Results

B : The number of buckets in the current hash table For all bucket sizes, EXHASH produces consistently
better storage utilization than LINHASH. LINHASH gives
b : Bucket capacity cyclic storage utilization since the buckets are split linearly
regardless of their load. In both EXHASH and
bs : The number of buckets accessed for successful LINHASH, as the bucket size rises, the storage utilization
search + 1 if the directory entry is not in becomes more fluctuating (see figures 4,5,6). EXHASH
main memory. (only in EXHASH). has an advantage of approximately 5% over LINHASH in
storage utilization. Such a performance is wholly
bu : Number of buckets accessedfor unsuccessful search attributable to the way the buckets are split under the two
+ 1 if the directory entry is not found in main schemes. The corollary is that LINHASH requires more
memory (only in EXHASH). buckets to hold the same number of records than
EXHASH does.
s : Number of successful searches
u : Number of unsuccessful searches
* . Arithmetic multiplication symbol
I : Arithmetic division symbol

180
,:
!
.’
I
* 1.0’
v
I .?’
E 1 .I :
1 I.9 ;
L” 1.4.
g 1.3.
i I.2
n 1.1
_-_m ------- m-o__
c ,., ii.._)r. -._.---- ___- -----.- .-_____ce ~
.._.__ . . . . . ~ . .._.. ~
0 1.54
, n.9 ,..‘...‘“,“““.“I.‘.“““I”..“..‘r
0 10&W Ewas 3ams 46868 M
s leeeD
-aREams
Msilno -- ---- LItEM

FI(PJIE 13. SPLIT COST US. HIBER OF RECCRDS

--___--*---.- - _______ em-------- .--_,-_,

r 1.38 . . . . . . . . . , _ _ . . . -
1eQBB 28866 38888 4mee SQae
e 1W m 38888 48888 sseee
-oFREm
~ffI(EcoRDG
waw1m - EXTDO ---- LItEm
WlM -EmElm ---- LIKlR

FIGWE 11. sJc(xssFu SEnRM CC61 Vs. IWeER OF RECZRDS


FICUSC:4.SPLlTQISTUS.fUEERff2ECURO9

l.za ,_,,,, _.,,,.,_. ,_,,.,_,. ,,_.,,.,, ., ..___. r

e 1WW iseae sees 4ee2a s42m

RreERalEEcaaos

WalKi -D(ToO ---- LIrEM

FIwE15.s?LlTcoslus.rv.nlERw56tasrs

182
.&\..
\
,-’ ‘.
I ‘,
#’ I
\
/ \
\

t4Xl&lSlZEtr.*
4.mj
J
fj 3.7%

c 330. ’
f o.es-
E

1 3.m.
n .

--- -______ -..-- _-- - ---__

128 _
8 lww mew 3ww 4eem
mrz(FafcutDs
WIsItIKi - EXTEIO ---- LlKCiSl

FICb!RE 17. ltiSERlI(x1 CDs1 VS. nrSra W FXfZCW

4.w

fj 3.7s

Ir 3.m
aE 3.zs

I s.BB-
sE 2.15.
R
; 2.543’

: e.F.5
b..a...--._C------_ w__c_d------ ---A-- .___- e--
‘0 2.63. *~

: 1.73.
s _,-- .- -’
1.w
T.-..““‘,‘.“‘...TT.......,.........* s _, -- ‘-- *_--
e
I---......,.., .... v”+---..-..r
lewe PBBBB xae0 48888 wem e Ifas emw 38888 48888 wwa
-aacoaos -0FRXCSfS
ms41m -EnEm ---- LIwim WlsHltc - .DCTElQ ---- LJm3a
F:KFS 18. lltSEF!lI~ CUST US. NWER Q RECfSDS FIURf 21. MrlCU BuhdE75 U.S. )ueEn ff Gf0SD-S

183
LINHASH performs better for all the bucket sizes [Chu89] Chun, S.H., Hedrick, G.E., Lu, H., Fisher,
with an unsuccessful search becomes less costly as the D.D., “A Partitioning Method for Grid File Directories,”
bucket size rises. On an average, an unsuccessful search will appear in Proc. of the IEEE Computer Society’s 13th
cost stays close to 1 for all the bucket sizes in LINHASH Annual International Computer Software and
(see figures 7,8,9). Similar observations hold true for the Applications Conference, Sept. 18-22, 1989 Orlando
cost of a successful search (see figures 10,11,12). The applications Conference, Sept. 18-22, 1989 Orlando
successful search and the unsuccessful search are equally
costly in EXHASH. This is due to the fact that the [El1821 Ellis, C.S. “Extendible Hashing for Concurren
overflow buckets are almost non-existent in EXHASH. Operations and Distributed Data.” ACM SIGMOD, 1982
Overflow buckets are mandatory in LINHASH. In
EXHASH, the search cost can be kept to 1 regardless of Fag791 Fagin, R., Nievergelt, J., Pippenger, N., and
the bucket size when the entire directory can be kept in Strong, H.R. “Extendible Hashing - A Fast Access Method
the main memory. for Dynamic Files.” ACM Transactions on Database
Systems, 14, 3(Sept.1979), pp. 315-344.
The splitting of a bucket is costlier in LINHASH.
This is due to the fact that an extra read access is needed l&87] Freeston, M. “The BANG file.” Proc. of ACM
to read the bucket to be split (see figures 13,14,15). The SIGMOD Conf., 16, 3(Dec. 1987), 260-269.
insertion cost is slightly higher in EXHASH for the bucket
sizes 10 and 20. However, for the bucket size 50, this [Knu73] Knuth, D. The Art of Computer Programming,
cost is slightly less in EXHASH (see figures 16,17,18). vol III* Sorting and Searching. Reading, MA: Addison-
-* --
Wesley, 1973. -
As expected, LINHASH performed poorly with
respect to the number of overflow buckets. The number [Lar78] Larson, P. “Dynamic Hashing.” BIT, 18(1987),
of overflow buckets decreasesas the bucket size increases. pp. 184-201.
The simulation shows that a maximum of 10% of the total
space should be marked as an overflow area in lLar82] Larson, P. “Performance Analysis of Linear
LINHASH. Overflow buckets are almost non-existent in Hashing with Partial Expansions.” ACM Transactions on
EXHASH (see figures 19,20,21). Database Systems, 7, 4(1982), pp. 566-587.

2. Conclusion &ar85] Larson, P. “Performance Analysis of a Single-


file Version of Linear Hashing.” Computer 1.,28, 3(1985),
Based on simulation results, the linear hashing pp. 319-326.
technique is recommended when main storage is at a
premium since it requires no directory. This scheme is II-=881 Larson, P. “Dynamic Hash Tables.”
particularly useful in a small computer environment. Cm 31, 4(April 1988),
However, this scheme is not devoid of its pitfalls. Since pp. 446-457.
there is no control over the length of an overflow chain,
the search cost may become high. However, the [Lia89] Lian, T., Fisher, D., Lu, H., “Implementation and
simulation has shown that the maximum search cost is 2 Evaluation of Grid and Bang (Balanced and Nested Grid)
for all the bucket sizes in linear hashing. Extendible File Structures,” mm-
IEEE Proc. of Workshop 0” Applied
hashing could be useful if sufficient main memory is Computing 1989, pp. 80-85.
available to hold the directory. Doubling and halving the
directory size is expensive. In both the schemes, the lLit80] Kitwin, W. “Linear Hashing: A New Tool for
bucket size does not affect the performance significantly. File and Table Addressing.” Proc. of the 6th Conference
However, a bucket size of 20 seems to be a good choice on Very Large Databases, 1980, pp. 212-223. -
since it gives fairly reasonable storage utilization and
search times. [Mar771 Martin, J. Computer Data-Base Organization
(2nd edition). Prentice-Hall, 1977:

[Mu1851 Mullin, J.R. “Spiral Storage: An Efficient


Dynamic Hashing with Constant Performance.” Computer
REFERENCES L, 28, 3(1985), pp. 330-334.

[Bra861 Bradley, J. “Use of Mean Distance between [Nie84] Nievergelt, J., Hinterberger, H., & Sevcik, KC.,
Overflow Records of Compute Average Search Lengths The Grid File: An Adaptable, Symmetric Multikey File
in Hash Files with Open Addressing.” Computer i., 29, Structure”, ACM Transactions 0” Database Systems, Vol.
2(1986), pp. 167-170. 9, No. 1, 1984, pp 38-71.

[Car791 Carter, J.L. and Wegman M. “Universal Class of


Hash Functions,” 1. of Camp. & SYS. Sci., 18, 1(1979), [Par881 Park, S.K. and Miller, K.W. “Random Number
pp. 143-154. Generators: Good Ones Are Hard to Find.”

184
Communications of the ACM, 31, 10 (October 1988), pp.
1192-1201.

[Ram821 Rammohanrao, K. and Lloyd, J.K. “Dynamic


Hashing Schemes.” Computer J., 25, 4(1982), pp. 478-
485.

[SchSl] Scholl, M. “New File Organizations Based on


Dynamic Hashing.” m Traosactions M. m
Systems, 6, l(Mar. 1981), pp. 194-211.

Wet871 Vet-ma. V.. & Lu. H.. “A New ADDroach to


Version Management for Databases,” Proc&ings of
National Computer Conference (AFIP Conference) vol. 56,
1987, pp. 645651.

185

You might also like