0% found this document useful (0 votes)
5 views

Hashing

Hashing is a technique that maps large data items to smaller tables using a hashing function, allowing for constant time data retrieval and updates. There are two types of hashing: static and dynamic, with various hash functions like division, mid-square, and digit folding used to compute hash values. Collision resolution techniques include separate chaining and open addressing, each with its advantages and disadvantages.

Uploaded by

tahensy0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Hashing

Hashing is a technique that maps large data items to smaller tables using a hashing function, allowing for constant time data retrieval and updates. There are two types of hashing: static and dynamic, with various hash functions like division, mid-square, and digit folding used to compute hash values. Collision resolution techniques include separate chaining and open addressing, each with its advantages and disadvantages.

Uploaded by

tahensy0909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Unit-III
HASHING

 Hashing is the process of mapping large amount of data item to smaller table with the help of
hashing function.
 Hashing is also known as Hashing Algorithm or Message Digest Function.
 It is a technique to convert a range of key values into a range of indexes of an array.
 It is used to facilitate the next level searching method when compared with the linear or binary
search.
 Hashing allows to update and retrieve any data entry in a constant time O(1).
 Constant time O(1) means the operation does not depend on the size of the data.
 Hashing is used with a database to enable items to be retrieved more quickly.
 It is used in the encryption and decryption of digital signatures.
 Hashing is a technique which uses less key comparisons and searches the element in O(n) time
in the worst case and in an average case it will be done in O(1) time.
 This method generally used the hash functions to map the keys into a table, which is called
a hash table.

Types of Hashing

o There are two types of hashing :

o Static hashing: In static hashing, the hash function maps search-key values to a fixed set
of locations.
o Dynamic hashing: In dynamic hashing a hash table can grow to handle more items. The
associated hash function must change as the table grows.

o The load factor of a hash table is the ratio of the number of keys in the table to the size of
the hash table.

o Load Factor = number of key values / number of file positions

o Note: The higher the load factor, the slower the retrieval.

1) Hash table
Hash table is a type of data structure which is used for storing and accessing data very quickly.
Insertion of data in a table is based on a key value. Hence every entry in the hash table is defined
with some key. By using this key, data can be searched in the hash table by few key
comparisons and then searching time is dependent upon the size of the hash table.
 Hash table or hash map is a data structure used to store key-value pairs.
 It is a collection of items stored to make it easy to find them later.
 It uses a hash function to compute an index into an array of buckets or slots from which the
desired value can be found.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

 It is an array of list where each list is known as bucket.


 It contains value based on the key.
 Hash table is used to implement the map interface and extends Dictionary class.
 Hash table is synchronized and contains only unique elements.

 The above figure shows the hash table with the size of n = 10. Each position of the hash table is
called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6,
7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.
 As we know the mapping between an item and the slot where item belongs in the hash table is
called the hash function. The hash function takes any item in the collection and returns an
integer in the range of slot names between 0 to n-1.
 Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a
hash key is the division method of hashing and the formula is :

Hash Key = Key Value % Number of Slots in the Table


 Division method or reminder method takes an item and divides it by the table size and returns
the remainder as its hash value.

Data Item Value % No. of Slots Hash Value

26 26 % 10 = 6 6

70 70 % 10 = 0 0

18 18 % 10 = 8 8

31 31 % 10 = 1 1

54 54 % 10 = 4 4

93 93 % 10 = 3 3

 After computing the hash values, we can insert each item into the hash table at the designated
position as shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

referred to as the load factor and denoted by, λ = No. of items / table size. For example , λ =
6/10.
 It is easy to search for an item using hash function where it computes the slot name for the item
and then checks the hash table to see if it is present.
 Constant amount of time O(1) is required to compute the hash value and index of the hash table
at that location.

2) Hash function
Hash function is a function which is applied on a key by which it produces an integer, which can
be used as an address of hash table. Hence one can use the same hash function for accessing the
data from the hash table. A function which employs some algorithm to computes the key K for
all the data elements in the set U, such that the key K which is of a fixed size. The same key K
can be used to map data to a hash table and all the operations like insertion, deletion and
searching should be possible. The values returned by a hash function are also referred to as
hash values, hash codes, hash sums, or hashes.

Types of hash function


There are various types of hash function which are used to place the data in a hash table,
1. Division method
The basic idea behind hashing is to take a field in a record, known as the key, and convert it
through some fixed process to a numeric value, known as the hash key, which represents the
position to either store or find an item in the table. The numeric value will be in the range of 0 to
n-1, where n is the maximum number of slots (or buckets) in the table.
The fixed process to convert a key to a hash key is known as a hash function. This function will
be used whenever access to the table is needed.
One common method of determining a hash key is the division method of hashing. The formula
that will be used is:
hash key = key % number of slots in the table

The division method is generally a reasonable strategy, unless the key happens to have some
undesirable properties. For example, if the table size is 10 and all of the keys end in zero. In this
case, the choice of hash function and table size needs to be carefully considered. The best table
sizes are prime numbers. One problem though is that keys are not always numeric. In fact, it's
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

common for them to be strings. One possible solution: add up the ASCII values of the characters
in the string to get a numeric value and then perform the division method.
In this the hash function is dependent upon the remainder of a division. For example:-if
the record 52,68,99,84 is to be placed in a hash table and let us take the table size is 10.
Then:
h(key)=record% table size.
2=52%10
8=68%10
9=99%10
4=84%10

Pros:
1. This method is quite good for any table size.
2. The division method is very fast since it requires only a single division operation.
Cons:
1. This method leads to poor performance since consecutive keys map to consecutive hash
values in the hash table.
2. Sometimes extra care should be taken to choose the value table size

2. Mid Square Method:


The mid-square method is a very good hashing method. It involves two steps to compute the
hash value-
1. Square the value of the key k i.e. k2
2. Extract the middle r digits as the hash value.
Formula:
h(K) = h(k x k)
Here,
k is the key value.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

The value of r can be decided based on the size of the table. If the size of the table is 100 , then
r can be 2 digits as the index values range from 0 to 99. If it is 1000, then r can be 3 digits as
the index ranges from 0 to 999.

Example:
Suppose the hash table has 100 memory locations. So r = 2 because two digits are required to
map the key to the memory location.
k = 60
k x k = 60 x 60
= 3600
h(60) = 60
The hash value obtained is 60

• The key k is squared. Then the hash function H is defined as


H(k) = l
• The l is obtained by deleting the digits from both ends of K2.
• The same position must be used for all the keys.
• Example:
k: 3205 7148 2345
2
k: 10272025 51093904 5499025
H(k): 72 93 99
4th and 5th digits have been selected. From the right side
In this method firstly key is squared and then mid part of the result is taken as the index. For
example: consider that if we want to place a record of 3101 and the size of table is 1000. So
3101*3101=9616201 i.e. h (3101) = 162 (middle 3 digit)

Pros:
1. The performance of this method is good as most or all digits of the key value contribute to
the result. This is because all digits in the key contribute to generating the middle digits of
the squared result.
2. The result is not dominated by the distribution of the top digit or bottom digit of the original
key value.
Cons:
1. The size of the key is one of the limitations of this method, as the key is of big size then its
square will double the number of digits.
2. Another disadvantage is that there will be collisions but we can try to reduce collisions.

3. Digit Folding Method:


This method involves two steps:
1. Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn, where each part has the
same number of digits except for the last part that can have lesser digits than the other parts.
2. Add the individual parts. The hash value is obtained by ignoring the last carry if any.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k

Example:
k = 12345
k1 = 12, k2 = 34, k3 = 5
s = k1 + k2 + k3
= 12 + 34 + 5
= 51
h(K) = 51

Note: The number of digits in each part varies depending upon the size of the hash table.
Suppose for example the size of the hash table is 100, then each part must have two digits
except for the last part which can have a lesser number of digits.

Collision Resolution Techniques:-


What is Collision? Since a hash function gets us a small number for a key which is a big
integer or string, there is a possibility that two keys result in the same value. The situation
where a newly inserted key maps to an already occupied slot in the hash table is called
collision and must be handled using some collision handling technique.
There are mainly two methods to handle collision:
 Open Hashing / Separate Chaining 
 Closed hashing / Open Addressing 

Separate Chaining:
 The idea behind separate chaining is to implement the array as a linked list called a chain.
Separate chaining is one of the most popular and commonly used techniques in order to
handle collisions.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

 The linked list data structure is used to implement this technique. When multiple e lements
are hashed into the same slot index, then these elements are inserted into a singly-linked
list which is known as a chain. 
 Here, all those elements that hash into the same slot index are inserted into a linked list.
Now, we can use a key K to search in the linked list by just linearly traversing. If the
intrinsic key for any entry is equal to K then it means that we have found our entry. If we
have reached the end of the linked list and yet we haven’t found our entry then it means
that the entry does not exist. Hence, the conclusion is that in separate chaining, if two
different elements have the same hash value then we store both the elements in the same
linked list one after the other. 
 Example: Let us consider a simple hash function as “key mod 7” and a sequence of keys as
50, 700, 76, 85, 92, 73, 101

Advantages:
 Simple to implement.
 Hash table never fills up, we can always add more elements to the chain. 
 Less sensitive to the hash function or load factors.
 It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted. 
Disadvantages:
 The cache performance of chaining is not good as keys are stored using a linked list. Open
addressing provides better cache performance as everything is stored in the same table. 
 Wastage of Space (Some Parts of the hash table are never used) 
 If the chain becomes long, then search time can become O(n) in the worst case
 Uses extra space for links 

Open Addressing: Like separate chaining, open addressing is a method for handling
collisions. In Open Addressing, all elements are stored in the hash table itself. So at any point,
the size of the table must be greater than or equal to the total number of keys (Note that we can
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

increase table size by copying old data if needed). This approach is also known as closed
hashing.
Different ways of Open Addressing:
1) Linear Probing
2) Quadratic Probing
3) Double Hashing
1. Linear Probing:
In linear probing, the hash table is searched sequentially that starts from the original location of
the hash. If in case the location that we get is already occupied, then we check for the next
location. The function used for rehashing is as follows: rehash(key) = (n+1)%table-size.
For example, The typical gap between two probes is 1 as seen in the example below:
Let hash(x) be the slot index computed using a hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
and so on. Let us consider a simple hash function as “key mod 7” and a sequence of keys as
50, 700, 76, 85, 92, 73, 101.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Advantages :-
1) No extra space is required

Disadvantages:-

1) Search time is O(n)


2) Deletion is difficult
3) Primary clustering is present where many consecutive elements form groups and it
starts taking time to find a free slot or to search for an element.
4) Secondary clustering is also present which is less severe, two records only have the
same collision chain (Probe Sequence) if their initial position is the same

Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that
are to be inserted are 50, 70, 76, 93.
 Step1: First draw the empty hash table which will have a possible range of hash values from
0 to 4 according to the hash function provided. 

Hash table

 Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will map
to slot number 0 because 50%5=0. So insert it into slot number 0.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Insert key 50 in the hash table


 Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is already at
slot number 0 so, search for the next empty slot and insert it.

Insert key 70 in the hash table

 Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is already at
slot number 1 so, search for the next empty slot and insert it.

Insert key 76 in the hash table


 Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it into
slot number 3.

Insert key 93 in the hash table


2. Quadratic Probing
The interval between probes will increase proportionally to the hash value. Quadratic probing
is a method with the help of which we can solve the problem of clustering. This method is also
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

known as the mid-square method. In this method, we look for the i2 th slot in the ith iteration.
We always start from the original hash location. If only the location is occupied then we check
the other slots. Let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S

Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision
resolution strategy to be f(i) = i2 . Insert = 22, 30, and 50.
 Step 1: Create a table of size 7.

 Step 2 – Insert 22 and 30


 Hash(22) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert 22
at slot 1. 
 Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert 30
at slot 2. 

Insert keys 22 and 30 in the hash table


 Step 3: Inserting 50
 Hash(50) = 50 % 7 = 1
 In our hash table slot 1 is already occupied. So, we will search for slot 1+12 , i.e. 1+1 =
2,
 Again slot 2 is found occupied, so we will search for cell 1+22 , i.e.1+4 = 5,
 Now, cell 5 is not occupied so we will place 50 in slot 5. 
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Insert key 50 in the hash table

Advantages of Quadratic probing:


1) No extra memory space is required
2) Primary clustering problem is resolved
Disadvantages:-
1) Search is in the order of O(n)
2) Secondary clustering is present. Two keys have the same probe sequence when they hash
to the same location.
3) No guarantee of finding the free slot

3. Double Hashing:-
It is a collision resolution technique in open addressing hash table that is used to avoid
collisions. A collision occurs when two keys are hashed to the same index in a hash table.
The reason for occurring collision is that every slot in hash table is supposed to store a single
element.
Generally, hashing technique consists of a hash function that takes a key and produces
hash table index for that key. The double hashing technique uses two hash functions so it is
called double hashing. The second hash function provides an offset value if the first hash
function produces a collision. In other words, we can say that when two different objects
have the same hash, is called collision.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Double Hash Function


The first hash function determines the initial location to located the key and the second hash
function is to determine the size of the jumps in the probe sequence. The following function is
an example of double hashing:
h(key, i) = (firstHashfunction(key) + i * secondHashFunction(key)) % tableSize
In the above function, the value of i will keep incrementing until an empty slot is found.
firstHashFunction(key) = key % tableSize
If the table size is prime, the double hashing works well.
secondHashFunction(key) = PRIME - (key % PRIME) where PRIME is a prime
smaller than tableSize.
If the above functions compute an offset value that is already occupied by other object, it means
there is a collision. A good hash function must have the following properties:
o Quick to evaluate.
o Differ from the original hash function.
o Never evaluates to 0.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

Advantages of Double Hashing


o The technique does not yield any clusters. Primary and secondary clustering problems are
resolved
o It is the best form of probing because it can find next free slot in hash table more quickly
than linear probing.
o It produces a uniform distribution of records throughout a hash table.
o No extra space is required
Disadvantages:-
1) Search is in the order of O(n)

Comparison between Linear, Quadratic and Double hashing:


 Linear probing has the best cache performance but suffers from clustering. One more
advantage of Linear probing is easy to compute. 
 Quadratic probing lies between the two in terms of cache performance and clustering. 
 Double hashing has poor cache performance but no clustering. Double hashing requires
more computation time as two hash functions need to be computed. 

S.No. Separate Chaining Open Addressing

Open Addressing requires more


1. Chaining is Simpler to implement. computation.

In chaining, Hash table never fills up, we can In open addressing, table may become
2. always add more elements to chain. full.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

S.No. Separate Chaining Open Addressing

Chaining is Less sensitive to the hash Open addressing requires extra care to
3. function or load factors. avoid clustering and load factor.

Chaining is mostly used when it is unknown Open addressing is used when the
how many and how frequently keys may be frequency and number of keys is
4. inserted or deleted. known.

Open addressing provides better cache


Cache performance of chaining is not good performance as everything is stored in
5. as keys are stored using linked list. the same table.

Wastage of Space (Some Parts of hash table In Open addressing, a slot can be used
6. in chaining are never used). even if an input doesn’t map to it.

7. Chaining uses extra space for links. No links in Open addressing

//Linear probing
#include<stdio.h>
#include<stdlib.h>
#define MAX 10

int create(int num)


{
int key;
key=num % MAX;
return key;
}

void hash(int a[MAX],int key,int num)


{
int flag,i,count=0;
void display(int a[]);
flag=0;
if(a[key]==-1)
a[key]=num;
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

else
{
i=0;
while(i<MAX)
{
if(a[i]!= -1)
count++;
i++;
}
if(count==MAX)
{
printf("\n hashtable is full");
display(a);
exit(1);
}
for(i=key+1;i<MAX;i++)
if(a[i]==-1)
{
a[i]=num;
flag=1;
break;
}
for(i=0;i<key && flag==0;i++)
if(a[i]==-1)
{
a[i]=num;
flag=1;
break;
}
}
}

void display(int a[MAX])


{
int i;
printf("\nThe hash table is ......... \n");
for(i=0;i<MAX;i++)
printf("\n%d %d",i,a[i]);
}
int main()
{
int a[MAX],num,key,i;
char ch;
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

printf("\nhash table\n");
for(i=0;i<MAX;i++)
a[i]=-1;
do
{
printf("\n enter the number:");
scanf("%d",&num);
key=create(num);
hash(a,key,num);
printf("\ndo u want to continue(y/n):");
scanf(" %c",&ch);
}while(ch=='y'|| ch=='Y');
display(a);
return 0;
}

//Separate Chaining
#include<stdio.h>
#include<stdlib.h>
#define size 7
struct node
{
int data;
struct node *next;
};

struct node *chain[size];


void init()
{
int i;
for(i = 0; i < size; i++)
chain[i] = NULL;
}

void insert(int value)


{
//create a newnode with value
struct node *newnode = malloc(sizeof(struct node));
newnode->data = value;
newnode->next = NULL;
//calculate hash key
int key = value % size;
//check if chain[key] is empty
if(chain[key] == NULL)
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS

chain[key] = newnode;
//collision
else
{
//add the node at the end of chain[key].
struct node *temp = chain[key];
while(temp->next)
{
temp = temp->next;
}
temp->next = newnode;
}
}

void print()
{
int i;
for(i = 0; i < size; i++)
{
struct node *temp = chain[i];
printf("chain[%d]-->",i);
while(temp)
{
printf("%d -->",temp->data);
temp = temp->next;
}
printf("null\n");
}
}
int main()
{
//init array of list to null
init();
insert(7);
insert(0);
insert(3);
insert(10);
insert(4);
insert(5);
print();
return 0;
}

You might also like