Hashing
Hashing
Unit-III
HASHING
Hashing is the process of mapping large amount of data item to smaller table with the help of
hashing function.
Hashing is also known as Hashing Algorithm or Message Digest Function.
It is a technique to convert a range of key values into a range of indexes of an array.
It is used to facilitate the next level searching method when compared with the linear or binary
search.
Hashing allows to update and retrieve any data entry in a constant time O(1).
Constant time O(1) means the operation does not depend on the size of the data.
Hashing is used with a database to enable items to be retrieved more quickly.
It is used in the encryption and decryption of digital signatures.
Hashing is a technique which uses less key comparisons and searches the element in O(n) time
in the worst case and in an average case it will be done in O(1) time.
This method generally used the hash functions to map the keys into a table, which is called
a hash table.
Types of Hashing
o Static hashing: In static hashing, the hash function maps search-key values to a fixed set
of locations.
o Dynamic hashing: In dynamic hashing a hash table can grow to handle more items. The
associated hash function must change as the table grows.
o The load factor of a hash table is the ratio of the number of keys in the table to the size of
the hash table.
o Note: The higher the load factor, the slower the retrieval.
1) Hash table
Hash table is a type of data structure which is used for storing and accessing data very quickly.
Insertion of data in a table is based on a key value. Hence every entry in the hash table is defined
with some key. By using this key, data can be searched in the hash table by few key
comparisons and then searching time is dependent upon the size of the hash table.
Hash table or hash map is a data structure used to store key-value pairs.
It is a collection of items stored to make it easy to find them later.
It uses a hash function to compute an index into an array of buckets or slots from which the
desired value can be found.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
The above figure shows the hash table with the size of n = 10. Each position of the hash table is
called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6,
7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.
As we know the mapping between an item and the slot where item belongs in the hash table is
called the hash function. The hash function takes any item in the collection and returns an
integer in the range of slot names between 0 to n-1.
Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a
hash key is the division method of hashing and the formula is :
26 26 % 10 = 6 6
70 70 % 10 = 0 0
18 18 % 10 = 8 8
31 31 % 10 = 1 1
54 54 % 10 = 4 4
93 93 % 10 = 3 3
After computing the hash values, we can insert each item into the hash table at the designated
position as shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
referred to as the load factor and denoted by, λ = No. of items / table size. For example , λ =
6/10.
It is easy to search for an item using hash function where it computes the slot name for the item
and then checks the hash table to see if it is present.
Constant amount of time O(1) is required to compute the hash value and index of the hash table
at that location.
2) Hash function
Hash function is a function which is applied on a key by which it produces an integer, which can
be used as an address of hash table. Hence one can use the same hash function for accessing the
data from the hash table. A function which employs some algorithm to computes the key K for
all the data elements in the set U, such that the key K which is of a fixed size. The same key K
can be used to map data to a hash table and all the operations like insertion, deletion and
searching should be possible. The values returned by a hash function are also referred to as
hash values, hash codes, hash sums, or hashes.
The division method is generally a reasonable strategy, unless the key happens to have some
undesirable properties. For example, if the table size is 10 and all of the keys end in zero. In this
case, the choice of hash function and table size needs to be carefully considered. The best table
sizes are prime numbers. One problem though is that keys are not always numeric. In fact, it's
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
common for them to be strings. One possible solution: add up the ASCII values of the characters
in the string to get a numeric value and then perform the division method.
In this the hash function is dependent upon the remainder of a division. For example:-if
the record 52,68,99,84 is to be placed in a hash table and let us take the table size is 10.
Then:
h(key)=record% table size.
2=52%10
8=68%10
9=99%10
4=84%10
Pros:
1. This method is quite good for any table size.
2. The division method is very fast since it requires only a single division operation.
Cons:
1. This method leads to poor performance since consecutive keys map to consecutive hash
values in the hash table.
2. Sometimes extra care should be taken to choose the value table size
The value of r can be decided based on the size of the table. If the size of the table is 100 , then
r can be 2 digits as the index values range from 0 to 99. If it is 1000, then r can be 3 digits as
the index ranges from 0 to 999.
Example:
Suppose the hash table has 100 memory locations. So r = 2 because two digits are required to
map the key to the memory location.
k = 60
k x k = 60 x 60
= 3600
h(60) = 60
The hash value obtained is 60
Pros:
1. The performance of this method is good as most or all digits of the key value contribute to
the result. This is because all digits in the key contribute to generating the middle digits of
the squared result.
2. The result is not dominated by the distribution of the top digit or bottom digit of the original
key value.
Cons:
1. The size of the key is one of the limitations of this method, as the key is of big size then its
square will double the number of digits.
2. Another disadvantage is that there will be collisions but we can try to reduce collisions.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k
Example:
k = 12345
k1 = 12, k2 = 34, k3 = 5
s = k1 + k2 + k3
= 12 + 34 + 5
= 51
h(K) = 51
Note: The number of digits in each part varies depending upon the size of the hash table.
Suppose for example the size of the hash table is 100, then each part must have two digits
except for the last part which can have a lesser number of digits.
Separate Chaining:
The idea behind separate chaining is to implement the array as a linked list called a chain.
Separate chaining is one of the most popular and commonly used techniques in order to
handle collisions.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
The linked list data structure is used to implement this technique. When multiple e lements
are hashed into the same slot index, then these elements are inserted into a singly-linked
list which is known as a chain.
Here, all those elements that hash into the same slot index are inserted into a linked list.
Now, we can use a key K to search in the linked list by just linearly traversing. If the
intrinsic key for any entry is equal to K then it means that we have found our entry. If we
have reached the end of the linked list and yet we haven’t found our entry then it means
that the entry does not exist. Hence, the conclusion is that in separate chaining, if two
different elements have the same hash value then we store both the elements in the same
linked list one after the other.
Example: Let us consider a simple hash function as “key mod 7” and a sequence of keys as
50, 700, 76, 85, 92, 73, 101
Advantages:
Simple to implement.
Hash table never fills up, we can always add more elements to the chain.
Less sensitive to the hash function or load factors.
It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
Disadvantages:
The cache performance of chaining is not good as keys are stored using a linked list. Open
addressing provides better cache performance as everything is stored in the same table.
Wastage of Space (Some Parts of the hash table are never used)
If the chain becomes long, then search time can become O(n) in the worst case
Uses extra space for links
Open Addressing: Like separate chaining, open addressing is a method for handling
collisions. In Open Addressing, all elements are stored in the hash table itself. So at any point,
the size of the table must be greater than or equal to the total number of keys (Note that we can
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
increase table size by copying old data if needed). This approach is also known as closed
hashing.
Different ways of Open Addressing:
1) Linear Probing
2) Quadratic Probing
3) Double Hashing
1. Linear Probing:
In linear probing, the hash table is searched sequentially that starts from the original location of
the hash. If in case the location that we get is already occupied, then we check for the next
location. The function used for rehashing is as follows: rehash(key) = (n+1)%table-size.
For example, The typical gap between two probes is 1 as seen in the example below:
Let hash(x) be the slot index computed using a hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
and so on. Let us consider a simple hash function as “key mod 7” and a sequence of keys as
50, 700, 76, 85, 92, 73, 101.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
Advantages :-
1) No extra space is required
Disadvantages:-
Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that
are to be inserted are 50, 70, 76, 93.
Step1: First draw the empty hash table which will have a possible range of hash values from
0 to 4 according to the hash function provided.
Hash table
Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will map
to slot number 0 because 50%5=0. So insert it into slot number 0.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is already at
slot number 1 so, search for the next empty slot and insert it.
known as the mid-square method. In this method, we look for the i2 th slot in the ith iteration.
We always start from the original hash location. If only the location is occupied then we check
the other slots. Let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision
resolution strategy to be f(i) = i2 . Insert = 22, 30, and 50.
Step 1: Create a table of size 7.
3. Double Hashing:-
It is a collision resolution technique in open addressing hash table that is used to avoid
collisions. A collision occurs when two keys are hashed to the same index in a hash table.
The reason for occurring collision is that every slot in hash table is supposed to store a single
element.
Generally, hashing technique consists of a hash function that takes a key and produces
hash table index for that key. The double hashing technique uses two hash functions so it is
called double hashing. The second hash function provides an offset value if the first hash
function produces a collision. In other words, we can say that when two different objects
have the same hash, is called collision.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
In chaining, Hash table never fills up, we can In open addressing, table may become
2. always add more elements to chain. full.
GLEC, Hyd PC301CS - DATA STRUCTURES AND ALGORITHMS
Chaining is Less sensitive to the hash Open addressing requires extra care to
3. function or load factors. avoid clustering and load factor.
Chaining is mostly used when it is unknown Open addressing is used when the
how many and how frequently keys may be frequency and number of keys is
4. inserted or deleted. known.
Wastage of Space (Some Parts of hash table In Open addressing, a slot can be used
6. in chaining are never used). even if an input doesn’t map to it.
//Linear probing
#include<stdio.h>
#include<stdlib.h>
#define MAX 10
else
{
i=0;
while(i<MAX)
{
if(a[i]!= -1)
count++;
i++;
}
if(count==MAX)
{
printf("\n hashtable is full");
display(a);
exit(1);
}
for(i=key+1;i<MAX;i++)
if(a[i]==-1)
{
a[i]=num;
flag=1;
break;
}
for(i=0;i<key && flag==0;i++)
if(a[i]==-1)
{
a[i]=num;
flag=1;
break;
}
}
}
printf("\nhash table\n");
for(i=0;i<MAX;i++)
a[i]=-1;
do
{
printf("\n enter the number:");
scanf("%d",&num);
key=create(num);
hash(a,key,num);
printf("\ndo u want to continue(y/n):");
scanf(" %c",&ch);
}while(ch=='y'|| ch=='Y');
display(a);
return 0;
}
//Separate Chaining
#include<stdio.h>
#include<stdlib.h>
#define size 7
struct node
{
int data;
struct node *next;
};
chain[key] = newnode;
//collision
else
{
//add the node at the end of chain[key].
struct node *temp = chain[key];
while(temp->next)
{
temp = temp->next;
}
temp->next = newnode;
}
}
void print()
{
int i;
for(i = 0; i < size; i++)
{
struct node *temp = chain[i];
printf("chain[%d]-->",i);
while(temp)
{
printf("%d -->",temp->data);
temp = temp->next;
}
printf("null\n");
}
}
int main()
{
//init array of list to null
init();
insert(7);
insert(0);
insert(3);
insert(10);
insert(4);
insert(5);
print();
return 0;
}