UNIT-5
Storage System in DBMS
A database system provides an ultimate view of the stored data. However, data in the
form of bits, bytes get stored in different storage devices.
In this section, we will take an overview of various types of storage devices that are used
for accessing and storing data.
Types of Data Storage
For storing the data, there are different types of storage options available. These storage
types differ from one another as per the speed and accessibility. There are the following
types of storage devices used for storing the data:
o Primary Storage
o Secondary Storage
o Tertiary Storage
Primary Storage
It is the primary area that offers quick access to the stored data. We also know the
primary storage as volatile storage. It is because this type of memory does not
permanently store the data. As soon as the system leads to a power cut or a crash, the
data also get lost. Main memory and cache are the types of primary storage.
o Main Memory: It is the one that is responsible for operating the data that is
available by the storage medium. The main memory handles each instruction of a
computer machine. This type of memory can store gigabytes of data on a system
but is small enough to carry the entire database. At last, the main memory loses
the whole content if the system shuts down because of power failure or other
reasons.
1. Cache: It is one of the costly storage media. On the other hand, it is the fastest
one. A cache is a tiny storage media which is maintained by the computer
hardware usually. While designing the algorithms and query processors for the
data structures, the designers keep concern on the cache effects.
Secondary Storage
Secondary storage is also called as Online storage. It is the storage area that allows the
user to save and store data permanently. This type of memory does not lose the data
due to any power failure or system crash. That's why we also call it non-volatile storage.
There are some commonly described secondary storage media which are available in
almost every type of computer system:
o Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys
which are further plugged into the USB slots of a computer system. These USB
keys help transfer data to a computer system, but it varies in size limits. Unlike
the main memory, it is possible to get back the stored data which may be lost
due to a power cut or other reasons. This type of memory storage is most
commonly used in the server systems for caching the frequently used data. This
leads the systems towards high performance and is capable of storing large
amounts of databases than the main memory.
o Magnetic Disk Storage: This type of storage media is also known as online
storage media. A magnetic disk is used for storing the data for a long time. It is
capable of storing an entire database. It is the responsibility of the computer
system to make availability of the data from a disk to the main memory for
further accessing. Also, if the system performs any operation over the data, the
modified data should be written back to the disk. The tremendous capability of a
magnetic disk is that it does not affect the data due to a system crash or failure,
but a disk failure can easily ruin as well as destroy the stored data.
Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed.
But it is capable of storing a large amount of data. It is also known as Offline storage.
Tertiary storage is generally used for data backup. There are following tertiary storage
devices available:
o Optical Storage: An optical storage can store megabytes or gigabytes of data. A
Compact Disk (CD) can store 700 megabytes of data with a playtime of around 80
minutes. On the other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5
gigabytes of data on each side of the disk.
o Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are
used for archiving or backing up the data. It provides slow access to data as it
accesses data sequentially from the start. Thus, tape storage is also known as
sequential-access storage. Disk storage is known as direct-access storage as we
can directly access the data from any location on disk.
Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These
storage media are organized on the basis of data accessing speed, cost per unit of data
to buy the medium, and by medium's reliability. Thus, we can create a hierarchy of
storage media on the basis of its cost and speed.
Thus, on arranging the above-described storage media in a hierarchy according to its
speed and cost, we conclude the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit
is decreasing, and the access time is increasing. Also, the storage media from the main
memory to up represents the volatile nature, and below the main memory, all are non-
volatile devices.
SECONDARY STORAGE DEVICES:
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to
connect multiple secondary storage devices and use them as a
single storage media.
RAID consists of an array of disks in which multiple disks are
connected together to achieve different goals. RAID levels define
the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is
broken down into blocks and the blocks are distributed among
disks. Each disk receives a block of data to write/read in parallel. It
enhances the speed and performance of the storage device. There is
no parity and backup in Level 0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID
controller, it sends a copy of data to all the disks in the array. RAID
level 1 is also called mirroring and provides 100% redundancy in
case of a failure.
RAID 2
RAID 2 records Error Correction Code using Hamming distance for
its data, striped on different disks. Like level 0, each data bit in a
word is recorded on a separate disk and ECC codes of the data
words are stored on a different set disks. Due to its complex
structure and high cost, RAID 2 is not commercially available.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated
for data word is stored on a different disk. This technique makes it
to overcome single disk failures.
RAID 4
In this level, an entire block of data is written onto data disks and
then the parity is generated and stored on a different disk. Note
that level 3 uses byte-level striping, whereas level 4 uses block-
level striping. Both level 3 and level 4 require at least three disks to
implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity
bits generated for data block stripe are distributed among all the
data disks rather than storing them on a different dedicated disk.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent
parities are generated and stored in distributed fashion among
multiple disks. Two parities provide additional fault tolerance. This
level requires at least four disk drives to implement RAID.
Sorted file method
As the name suggests, the file in this method has to be kept in sorted order all the time. In this
method, the file is sorted after every delete,insert, and update operation on the basis of some
primary key or another reference. Insertion of the new record is done by adding the new record at
the end of the file, after which the file is sorted in ascending or descending order based on the
requirements giving the insertion operation a time complexity of O(nlogn) used for sorting the
file. Let's understand the insertion of a new record with the image below:
Advantages:
1. Sequential file organization in DBMS is the simplest of all file organization methods.
2. In the case of large volumes of data, this method is efficient in terms of speed as access to
the data is relatively fast in this method.
3. It is also cost-effective as it can be implemented using storage devices like magnetic
tapes, which are relatively cheap.
4. Since this method is fast in terms of accessing the records, it is used in cases where most
of the records of the file are accessed at the same time, for example, calculating
the attendance of the students, generating the payslips for the employees.
5. This technique is also used for the statistical computation process.
Disadvantages:
1. The disadvantage of this technique is the traversal cost which is linear and; hence to
access a particular record, we have to use linear traversal O(n) in the case of the pile file
method while in the sorted file method, although traversal cost is lesser O(logn) but to
maintain the file sorted it takes more time.
2. In the case of the sorted file method, it is costlier because it has to sort the file after
every delete, update, and insert operation.
3. The main requirement of this method is that the records must be of same size, which is
difficult to implement in most real-world database systems.
4. The data redundancy is high in the sequential file organization in DBMS.
Heap file
One of the simplest methods of file organization in DBMS.
Data block is chosen randomly, and it is not mandatory that the next data block must be
chosen for mapping the record.
Inefficient in terms of searching,deleting, and updating.
This method is also one of the simplest methods of file organization in DBMS. In this method,
records are inserted in a sequential manner, but unlike the sequential file method, the data blocks
are not allocated sequentially DBMS can choose any data block for the record to be inserted.
There is no ordering of records in heap file organization once the data block is full, the next
record is stored in the new data block, which might not be the next data block, as shown in the
image below:
Let's understand the insertion , deletion, and updation operations in the Heap file organization
method: To insert a new record, it is simply added at the end of the file, and any data block can
be allocated in the memory by the DBMS to this new record as shown below:
To delete any record in this type of file, one has to traverse the entire file to reach the desired
record because there is no order in the file, and hence updation, and deletion is costly in this
method. Let's understand the advantages and disadvantages of heap file organization.
Advantages:
1. It is one of the simplest file organization methods in DBMS in terms of its data
structure and operations like insertion, deletion, and updation.
2. In the case of small databases, this method is used over the sequential file method
because accessing the records is relatively faster in this method.
3. Since it is faster, in case of a large amount of data being transferred at a single time, then
this method is best suited.
Disadvantages:
1. Since the method takes linear traversal for accessing the records, hence it is not best
suited for large databases, it is mainly used for small databases.
2. Since records map to the random blocks of memory, unlike sequential file organization
in DBMS they are not allocated in a sequence; therefore, there is a problem of memory
block wastage which is the main disadvantage of this method because after one part or
bucket address of a particular block is mapped with some record, it is not mandatory for
the DBMS to allocate the next bucket address of the previous block, but it can choose any
new random block for mapping the record which leads to memory wastage.
Indexing in DBMS
o Indexing is used to optimize the performance of a database by minimizing the
number of disk accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a
database table quickly.
Index structure:
Indexes can be created using some database columns.
o The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the particular
key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are
known as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of
which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search
student with ID-543.
o In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record and
contain 1:1 relation between the records.
o As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data
file. It makes searching faster.
o In this, the number of records in the index table is same as the number of records
in the main table.
o It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a
block.
o In this, instead of pointing to each record in the main table, the index points to
the records in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to
get the unique value and create index out of them. This method is called a
clustering index.
o The records which have similar characteristics are grouped, and indexes are
created for these group.
Example: suppose a company contains several employees in each department. Suppose
we use a clustering index, where all employees which belong to the same Dept_ID are
considered within a single cluster, and index pointers point to the cluster as a whole.
Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records
which belong to the different cluster. If we use separate disk block for separate clusters,
then it is called better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows.
These mappings are usually kept in the primary memory so that address fetch should be
faster. Then the secondary memory searches the actual data based on the address got
from mapping. If the mapping size grows then fetching the address itself becomes
slower. In this case, the sparse index will not be efficient. To overcome this problem,
secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced. In this method, the huge range for the columns is selected initially so that
the mapping size of the first level becomes small. Then each range is further divided
into smaller ranges. The mapping of the first level is stored in the primary memory, so
that address fetch is faster. The mapping of the second level and actual data are stored
in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the
highest entry which is smaller than or equal to 111 in the first level index. It will
get 100 at this level.
o Then in the second index level, again it does max (111) <= 111 and gets 110.
Now using the address 110, it goes to the data block and starts searching each
record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting
is also done in the same manner.
Bitmap Indexing
It is a special type of indexing built on a single key. But, it is designed to fire queries on
multiple keys quickly. We need to arrange the records in sequential order before
applying bitmap indexing on it. It makes it simple to fetch a particular record from the
block. Also, it becomes easy to allocate them in the block of a file.
Bitmap Index Structure
The word 'bitmap' comprises of 'bit' and 'map'. A bit is the smallest unit of data in a
computer system. A map means organizing things. Thus, a bitmap is simply mapping of
bits in the form of an array. In a relation, each attribute carries one bitmap for its value.
A bitmap has sufficient bits for numbering each record in the block.
For example, consider a relation Student_record where we wish to find out the female
and male students whose score in English is greater than 40. The bitmaps for gender are
given in the below image.
If the value of a record iis set to Mr, it means the ithbit of the bitmap will be set to 1.
Remaining all other bits for Mr in the bitmap will be set to 0. Similarly, the same step
proceeds in the case of female students. If for a particular j record, the value is set to Ms,
it means the jth bit in the bitmap will be set to 1. All other bits for Ms will be set to 0.
Now, if a user wishes to retrieve either a single or all records for female students or male
students, i.e., value as Mr or Ms, only need to read all the records of the relation. After
reading, select the required records either for Mr or Ms.
However, the bitmap indexing does not allow to select the records quickly. But, it
enables the users to read and choose only the required records. As seen in the above
example that the user only selected the required records either for female students or
for male students.
Uses of Bitmap Indexing
Including one from the above example, there are following uses of Bitmap indexing:
o It enables the user to read and select only the required records or data from a
relation.
o It is useful for making selections on multiple keys.
o Bitmap indexing helps in counting the number of records falling under the
selection requirement. It means it makes it easy to count those tuples which are
under the selection criteria of the user.
o Bitmap indexing is useful as well as necessary in performing queries for data
analysis.
o The size of a single bitmap is smaller than 1 percent. Thus, its size is smaller than
a relation. For example, a record in a relation is of 100 Bytes, and the relation
occupies 1% of memory space. Consequently, a single bitmap occupies 1/8 th of
the space occupied by the relation.
Deletion and Insertion of Records
o Deleting a record from the relation disturbs the sequence of records. The record
which gets deleted creates space or gaps in between other records.
o Filling such gaps is possible by shifting other records, but it is an expensive task.
o Existence bitmap is a solution to this problem. By storing an existence bitmap, we
can recognize the deleted records.
o In existence bitmap, if a record does not exist, its bit value will be 0-otherwise 1.
o Insertion of records is cost-effective. It is because the insertion of a record does
not affect the sequence of other records.
o Either by merely replacing the deleted records or by appending records to the
EOF (end of file), a user can proceed to insert the records.
Implementation of Bitmap Operations
It is better to implement bitmap operations efficiently.
o Intersection Operation
The intersection of two bitmaps is possible by using for loop. The ithiteration of the loop
performs the AND operation of the ith bits of both bitmaps. For speeding up the
intersection operation, use bitwise AND instructions. It is because a word consists of
either 32 or 64-bits, depending on the computer architecture. So, it is easy for a single
bitwise AND to perform the intersection of 32 or 64-bits at once.
o Union Operation
Bitmap union is used for calculating the OR of two bitmaps. For performing union
operation, we use bitwise OR instructions on 32 or 64-bits at once.
o Negation Operation
The complement operation is used for performing the negation operation. It is done by
complementing each bit of the bitmap. However, if some records have been deleted,
then the complement of the bitmap is insufficient. It happens because, in the original
bitmap, bits corresponding to such non-existing records will be 0. But would become 1
in the complement. For this, perform the intersection of the complement bitmap with
the existence bitmap to ensure that the bits corresponding to the deleted records must
turn off to 0. The same problem occurs with the null values. So, perform the intersection
operation of the complement bitmap with the complement of the bitmap for the null
value.