0% found this document useful (0 votes)
11 views

Unit 3

The document provides an overview of different types of physical storage media and their advantages. It then discusses various RAID levels like RAID 0, 1, 2, 3, 4 and 5, explaining their functionality, pros and cons through examples.

Uploaded by

Mani Kandan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 3

The document provides an overview of different types of physical storage media and their advantages. It then discusses various RAID levels like RAID 0, 1, 2, 3, 4 and 5, explaining their functionality, pros and cons through examples.

Uploaded by

Mani Kandan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

3.

Data Storage and Querying

Overview of Physical Storage Media


Physical Storage Media is used for storing (writing /recording/saving) and retrieving
(reading/opening) data. A storage medium (media is the plural) is the physical material on which
data is stored. A storage device is the computer hardware that records and retrieves data to and
from a storage medium.
Advantages of Physical Storage:

1.No dependence on internet connection: Physical storage devices do not require an internet
connection to store or access data.
2.This means that data can be accessed at any time, regardless of internet connectivity.
Cache:- The cache is the fastest and most costly form of storage. Cache memory is small; its use
is managed by the computer system hardware. We shall not be concerned about managing cache
storage in the database system.

Main memory:-The storage medium used for data that are available to be operated on is main
memory. The general-purpose machine instructions operate on main memory. Although main
memory may contain many megabytes of data, or even gigabytes of data in large server systems,
it is generally too small (or too expensive) for storing the entire database. The contents of main
memory are usually lost if a power failure or system crash occurs.

Flash memory:-Also known as electrically erasable programmable read-only memory


(EEPROM), flash memory differs from main memory in that data survive power failure.

Flash memory has found popularity as a replacement for magnetic disks for storing small
volumes of data (5 to 10 megabytes) in low-cost computer systems, such as computer systems
that are embedded in other devices, in hand-held computers, and in other digital electronic
devices such as digital cameras.

Magnetic-disk storage:-The primary medium for the long-term on-line storage of data is the
magnetic disk. Usually, the entire database is stored on magnetic disk. The system must move
the data from disk to main memory so that they can be accessed. After the system has performed
the designated operations, the data that have been modified must be written to disk.

Tape storage:- Tape storage is used primarily for backup and archival data. Although magnetic
tape is much cheaper than disks, access to data is much slower, because the tape must be
accessed sequentially from the beginning.

For this reason, tape storage is referred to as sequential-access storage. In contrast, disk storage is
referred to as direct-access storage because it is possible to read data from any location on disk.
Fig: Storage Hierarchy

RAID (Redundant Arrays of Independent Disks)


RAID is a technique that makes use of a combination of multiple disks instead of using a
single disk for increased performance, data redundancy, or both. The term was coined by
David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in
1987.
Why Data Redundancy?
Data redundancy, although taking up extra space, adds to disk reliability. This means, in case
of disk failure, if the same data is also backed up onto another disk, we can retrieve the data
and go on with the operation. On the other hand, if the data is spread across just multiple disks
without the RAID technique, the loss of a single disk can affect the entire data.
Key Evaluation Points for a RAID System
 Reliability: How many disk faults can the system tolerate?
 Availability: What fraction of the total session time is a system in uptime mode, i.e. how
available is the system for actual use?
 Performance: How good is the response time? How high is the throughput (rate of
processing work)? Note that performance contains a lot of parameters and not just the two.
 Capacity: Given a set of N disks each with B blocks, how much useful capacity is
available to the user?
RAID is very transparent to the underlying system. This means, to the host system, it appears
as a single big disk presenting itself as a linear array of blocks. This allows older technologies
to be replaced by RAID without making too many changes to the existing code.
Types of RAID
The various types of RAID are explained below −
RAID-0
RAID Level-0 is not redundant. Since no redundant information is stored, performance is very
good, but the failure of any disk in the array results in data loss. A single record is divided into
strips typically 512 bytes and is stored across all disks. The record can be accessed quickly by
reading all disks at the same time, called as striping.
Example:
Disk Disk
Disk 0 Disk 3 Disk 0 Disk 1 Disk 2 Disk 3
1 2

15 16 17 18 15 17 19 21

19 20 21 22 16 18 20 22

23 24 25 26 23 25 27 29

27 28 29 30 24 26 28 30

In the above table, block 0, 1, 2,3 are a stripe.


Now in this level, we place the stripped data as shown below
In the table above there is no mirroring of data hence, we lose the data in case of disk failure.
RAID 1
This level uses what is called ‘mirroring of data’. This means it uses its multiple disks to store
exactly the same content through duplication of data. You could also say it uses half of the
available space to store an exact copy of the data.
Example:
Disk 0 Disk 1 Disk 2 Disk 3

S S T T

U U V V

W W X X

Y Y Z Z

In the above table, there is duplication of data. Hence only half of the space is utilized to store
data.
Pros
 It provides 100% redundancy thus making the system fault tolerant.
 In case of a single drive failure, you have your data safely stored in the other drive.
 This also means the RAID drive group will function even in case of a single drive failure.
Cons
 Here the stored data will use double space, because of mirroring or duplication of data. So it
enables us to actually use only half of the storage capacity.
 As any data here requires twice the amount of storage so it turns out to be expensive.
RAID 2

RAID 2 is rarely used. This level stripes data at a bit level and each bit is stored in a separate
drive. It requires a disk separately for storing ECC code of data. The level uses the Hamming
code for error correction. You would agree that it is complex and expensive.

Pros
 It uses a selected drive for uniformity in storing data.
 It detects error through hamming code.
 It can be a good answer to data-security problems.
Cons
 It uses an extra drive for error detection.
 The need for hamming code makes it inconvenient for commercial use.

RAID 3
The RAID 3 level stripes data at the byte level. It requires a separate parity disk which stores the
parity information for each byte. When a disk fails, data can be recovered with the help of parity
bytes corresponding to them. Then the retrieved data can be stored in a new disk. It also has a
high read speed.
Example:
Disk Disk
Disk 0 Disk 3
1 2

M N O P (M, N, O)

R S T P (R, S, T)

U V W P (U, V, W)

X Y Z P (X, Y, Z)

Pros
 It enables high-speed transmission of data.
 In case of a disk failure, data can be reconstructed using the corresponding parity byte.
 Data can be used parallelly.
 It might be used where few users are referring to large files.
Cons
 It needs an extra file to store parity bytes.
 Its performance is slow in case of files of small size.
 It can be said that it is not a reliable or cheap solution to storage problems.
RAID 4
RAID 4 is a quite popular one. It is similar to RAID 1 and RAID 3 in a few ways. It goes for a
block level data stripping which is similar to RAID 0. Just like RAID 3, it uses parity disk to
store data. When you combine both these features together you will clearly understand what
RAID 4 does. It stripes data at the block level and stores its corresponding parity bytes in the
parity disk. In case of a single disk failure, data can be recovered from this parity disk.
Example:
Disk 0 Disk 1 Disk 2 Disk 3

M N O P0

R S T P1

U V W P2

X Y Z P3

Using the above table, parity can be calculated by XOR (function).


T1 T2 T3 T4 Parity

0 1 0 0 1

0 0 1 1 0

If T4 is lost then we can recover it from parity bit and other columns.
Pros
 In case of a single disk failure, the lost data is recovered from the parity disk.
 It can be useful for large files.
Cons
 It does not solve the problem of more than one disk failure.
 The level needs at least 3 disks as well as hardware backing for doing parity calculations.
 It might seem to be slow in case of small files.

RAID 5
RAID 5 is one of the most popular ones especially for a system with three or more drives. This
level has some similarity with RAID 4. This level too uses the parity but in a distributed way. It
stripes data across all drives in a somewhat rotating way. It stripes data at the byte level.
Example:
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4

10 11 12 13 P0

15 16 17 P1 14
20 21 P2 18 19

25 P3 22 23 24

P4 26 27 28 29

The table above shows the rotation of parity bits.


Pros
 This level is known for distributed parity among the various disks in the group.
 It shows good performance without being expensive.
 It uses only one-fourth of the storage capacity for parity and leaves three-fourths of the capacity
to be used for storing data.
Cons
 The recovery of data takes longer due to parity distributed among all disks.
 It is not able to help in the case where more than one disk fails.

RAID 6
RAID 6 is also known as double-parity RAID. It is an enhanced version of RAID 5. It stripes
data at block level and stores two corresponding parity block on all disks. First, it does stripping
of data and follows it up with mirroring of data. In other words, you could say first it works like
RAID 0 for stripping and then RAID 1 in mirroring.
Example:
Disk 1 Disk 2 Disk 3 Disk 4

J0 K0 Q0 P0

J1 Q1 P1 M1

Q2 P2 L2 M2

P3 K3 L3 Q3

Pros
 It can help you in case of 2 simultaneous disk failures.
 The number of drives for this level should be an even number with a minimum of 4 drives.
Cons
 It uses only half for storing data as the other half is used for mirroring.
 It needs two extra disks for parity.
 It needs to write in two parity blocks and hence is slower than RAID 5.
 It has inadequate adaptability.

Conclusion
There are a number of other RAID levels too. Some of them are RAID 10, RAID 5EE, RAID 50,
RAID 60. Different RAID level is a combination of different qualities. Each level can be judged
on the basis of redundancy, read performance, writing performance, minimum disks required and
usage of the disk drive. The best RAID level for you depends on your storage space as well as
performance and reliability you are looking for. Generally, a number of drives result in better
performance. Each RAID level has its own set of advantages and disadvantages. So you have to
decide what you are looking for is safety or speed or storage space.
Summary
Given below is the summary of all the types of RAID −
Levels Summary

RAID- It is the fastest and most efficient array type but offers no fault-tolerance.
0

RAID- It is the array of choice for a critical, fault tolerant environment.


1

RAID- It is used today because ECC is embedded in almost all modern disk drives.
2

RAID- It is used in single environments which access long sequential records to speed up
3 data transfer.

RAID- It offers no advantages over RAID-5 and does not support multiple simultaneous
4 write operations.

RAID- It is the best choice in a multi-user environment. However, at least three drives are
5 required for the RAID-5 array.

File Organization
 The File is a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization which
was used for a given set of records.
 File organization is a logical relationship among various records. This method defines
how file records are mapped onto disk blocks.
 File organization is used to describe the way in which the records are stored in terms of
blocks, and the blocks are placed on the storage medium.

Objective of file organization

 It contains an optimal selection of records, i.e., records can be selected as fast as possible.
 To perform insert, delete or update transaction on the records should be quick and easy.
 The duplicate records cannot be induced as a result of insert, update or delete.
 For the minimal cost of storage, records should be stored efficiently.
Types of file organization:

File organization contains various methods. These particular methods have pros and cons on the
basis of access or selection. In the file organization, the programmer decides the best-suited file
organization method according to his requirement.

Types of file organization are as follows:

Sequential File Organization


This method is the easiest method for file organization. In this method, files are stored
sequentially. This method can be implemented in two ways:

1. Pile File Method:

 It is a quite simple method. In this method, we store the record in a sequence, i.e., one after
another. Here, the record will be inserted in the order in which they are inserted into tables.
 In case of updating or deleting of any record, the record will be searched in the memory blocks.
When it is found, then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:


Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records
are nothing but a row in the table. Suppose we want to insert a new record R2 in the sequence,
then it will be placed at the end of the file. Here, records are nothing but a row in any table.

2. Sorted File Method:


 In this method, the new record is always inserted at the file's end, and then it will sort the
sequence in ascending or descending order. Sorting of records is based on any primary key or
any other key.
 In the case of modification of any record, it will update the record and then sort the file, and
lastly, the updated record is placed in the right place.

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7.
Suppose a new record R2 has to be inserted in the sequence, then it will be inserted at the end of
the file, and then it will sort the sequence.

Heap file organization


 It is the simplest and most basic type of organization. It works with data blocks. In heap file
organization, the records are inserted at the file's end. When the records are inserted, it doesn't
require the sorting and ordering of records.
 When the data block is full, the new record is stored in some other block. This new data block
need not to be the very next data block, but it can select any data block in the memory to store
new records. The heap file is also known as an unordered file.
 In the file, every record has a unique id, and every page in a file is of the same size. It is the
DBMS responsibility to store and manage the new records.

Insertion of a new record


Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a
new record R2 in a heap. If the data block 3 is full then it will be inserted in any of the database
selected by the DBMS, let's say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to traverse
the data from staring of the file till we get the requested record.

If the database is very large then searching, updating or deleting of record will be time-
consuming because there is no sorting or ordering of records. In the heap file organization, we
need to check all the data until we get the requested record.

Hash File Organization


Hash File Organization uses the computation of hash function on some fields of the records. The
hash function's output determines the location of disk block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is generated, and
the whole record is retrieved using that address. In the same way, when a new record has to be
inserted, then the address is generated using the hash key and record is directly inserted. The
same process is applied in the case of delete and update.

In this method, there is no effort for searching and sorting the entire file. In this method, each
record will be stored randomly in the memory.
B+ File Organization
 B+ tree file organization is the advanced method of an indexed sequential access method. It
uses a tree-like structure to store records in File.
 It uses the same concept of key-index where the primary key is used to sort the records. For
each primary key, the value of the index is generated and mapped with the record.
 The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In
this method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer
to the leaf nodes. They do not contain any records.

The above B+ tree shows that:

 There is one root node of the tree, i.e., 25.


 There is an intermediary layer with nodes. They do not store the actual record. They have only
pointers to the leaf node.
 The nodes to the left of the root node contain the prior value of the root and nodes to the right
contain next value of the root, i.e., 15 and 30 respectively.
 There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
 Searching for any record is easier as all the leaf nodes are balanced.
 In this method, searching any record can be traversed through the single path and accessed
easily.

Pros of B+ tree file organization

 In this method, searching becomes very easy as all the records are stored only in the leaf nodes
and sorted the sequential linked list.
 Traversing through the tree structure is easier and faster.
 The size of the B+ tree has no restrictions, so the number of records can increase or decrease
and the B+ tree structure can also grow or shrink.
 It is a balanced tree structure, and any insert/update/delete does not affect the performance of
tree.
Cons of B+ tree file organization
 This method is inefficient for the static method.

Indexed sequential access method (ISAM)


ISAM method is an advanced sequential file organization. In this method, records are stored in
the file using the primary key. An index value is generated for each primary key and mapped
with the record. This index contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.

Cluster file organization


 When the two or more records are stored in the same file, it is known as clusters. These files will
have two or more tables in the same data block, and key attributes which are used to map these
tables together are stored only once.
 This method reduces the cost of searching for various records in different files.
 The cluster file organization is used when there is a frequent need for joining the tables with the
same condition. These joins will give only a few records from both tables. In the given example,
we are retrieving the record for only particular departments. This method can't be used to
retrieve the record for the entire department.
In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table is
performed.

Types of Cluster file organization:

Cluster file organization is of two types:

1. Indexed Clusters:

In indexed cluster, records are grouped based on the cluster key and stored together. The above
EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all the
records are grouped based on the cluster key- DEP_ID and all the records are grouped.

2. Hash Clusters:

It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records with
the same hash key value.

Indexing in DBMS
 Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
 The index is a type of data structure. It is used to locate and access the data in a database table
quickly.

Index structure:

Indexes can be created using some database columns.

 The first column of the database is the search key that contains a copy of the primary key or
candidate key of the table. The values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
 The second column of the database is the data reference. It contains a set of pointers holding
the address of the disk block where the value of the particular key can be found.
Indexing Methods

Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.

 In the case of a database with no index, we have to search the disk block from starting till it
reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
 In the case of an index, we will search using indexes and the DBMS will read the record after
reading 542*2= 1084 bytes which are very less compared to the previous case.

Primary Index

 If the index is created on the basis of the primary key of the table, then it is known as primary
indexing. These primary keys are unique to each record and contain 1:1 relation between the
records.
 As primary keys are stored in sorted order, the performance of the searching operation is quite
efficient.
 The primary index can be classified into two types: Dense index and Sparse index.

Dense index

 The dense index contains an index record for every search key value in the data file. It makes
searching faster.
 In this, the number of records in the index table is same as the number of records in the main
table.
 It needs more space to store index record itself. The index records have the search key and a
pointer to the actual record on the disk.
Sparse index

 In the data file, index record appears only for a few items. Each item points to a block.
 In this, instead of pointing to each record in the main table, the index points to the records in
the main table in a gap.

Clustering Index

 A clustered index can be defined as an ordered data file. Sometimes the index is created on non-
primary key columns which may not be unique for each record.
 In this case, to identify the record faster, we will group two or more columns to get the unique
value and create index out of them. This method is called a clustering index.
 The records which have similar characteristics are grouped, and indexes are created for these
group.

Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is called
better technique.

Secondary Index

In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:

 If you want to find the record of roll 111 in the diagram, then it will search the highest entry
which is smaller than or equal to 111 in the first level index. It will get 100 at this level.
 Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the
address 110, it goes to the data block and starts searching each record till it gets 111.
 This is how a search is performed in this method. Inserting, updating or deleting is also done in
the same manner.

B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes
remain at the same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support
random access as well as sequential access.

Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of
the order n where n is fixed for every B+ tree.
o It contains an internal node and leaf node.

Internal node
o An internal node of the B+ tree can contain at least n/2 record pointers except the root
node.
o At most, an internal node of the tree contains n pointers.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o At most, a leaf node contains n record pointer and n key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.

Searching a record in B+ Tree


Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at
the end, we will be redirected to the third leaf node. Here DBMS will perform a
sequential search to find 55.

B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.

In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove
60 from the intermediate node as well as from the 4th leaf node too. If we remove it
from the intermediate node, then the tree will not satisfy the rule of the B+ tree. So we
need to modify it to have a balanced tree.

After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:
Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach
the desired data. Hashing technique is used to calculate the direct location of a data
record on the disk without using index structure.

In this technique, data is stored at the data blocks whose address is generated by using
the hashing function. The memory location where these records are stored is known as
data bucket or data blocks.

In this, a hash function can choose any of the column value to generate the address.
Most of the time, the hash function uses the primary key to generate the address of the
data block. A hash function is a simple mathematical function to any complex
mathematical function. We can even consider the primary key itself as the address of the
data block. That means each row whose address will be the same as a primary key
stored in the data block.
The above diagram shows data block addresses same as primary key value. This hash
function can also be a simple mathematical function like exponential, mod, cos, sin, etc.
Suppose we have mod (5) hash function to determine the address of the data block. In
this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1, 4
and 2 respectively, and records are stored in those data block addresses.
Types of Hashing:

Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means
if we generate an address for EMP_ID =103 using the hash function mod (5) then it will
always result in same bucket address 3. Here, there will be no change in the bucket
address.

Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store
the data.
Operations of Static Hashing
o Searching a record

When a record needs to be searched, then the same hash function retrieves the address
of the bucket where the data is stored.

o Insert a Record

When a new record is inserted into the table, then we will generate an address for a new record
based on the hash key and record is stored in that location.

o Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted. Then
we will delete the records for that address in memory.

o Update a Record

To update a record, we will first search it using a hash function, and then the data record
is updated.

If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address. This
situation in the static hashing is known as bucket overflow. This is a critical situation in
this method.

To overcome this situation, there are various methods. Some commonly used methods
are as follows:

1. Open Hashing
When a hash function generates an address at which data is already stored, then the
next bucket will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full. So
the system searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is known as Overflow chaining.

For example: Suppose R3 is a new address which needs to be inserted into the table,
the hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.

How to search a key


o First, calculate the hash address of the key.
o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record
might be.

How to insert a new record


o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their
hash address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and
6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into
bucket B2. The last two bits of 7 are 11, so it will go into B3.

Insert key 9 with hash address 10001 into


the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is full,
so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go into
bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.

Advantages of dynamic hashing


o In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks frequently.
Disadvantages of dynamic hashing
o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in
data, maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.

You might also like