UNIT- IV: STORAGE AND FILE ORGANIZATION
Databases are stored in file formats, which contains records. At physical level, actual data is stored in
electromagnetic format on some device capable of storing it for a longer amount of time. These storage
devices can be broadly categorized in three types:
Primary Storage − This category contains the memory storage that is directly
available to the CPU. The internal memory (registers), fast memory (cache), and
main memory ( RAM) of the CPU are directly accessible to the CPU since they are
all placed on the chipset of the motherboard or CPU. Usually, this storage is very
small, ultra-fast, and volatile. In order to maintain its condition, primary storage
requires a continuous power supply. All the data is lost in the event of a power
failure.
Secondary Storage-Secondary storage devices are used for potential use or as
backup data storage. k Secondary storage covers memory devices, such as
magnetic disks, optical disk (DVDs, CDs, etc.), disk drives, flash drives, and
magnetic tapes that are not part of the CPU chipset or motherboard.
Tertiary Storage- - Tertiary storage is used to store immense data volumes. They
are the slowest in speed because such storage devices are external to the
computer system. For the most part, these storage devices are used to back up
an entire system. Tertiary storage is commonly used for optical disks and
magnetic tapes.
Redundant Array of Independent Disks: In Redundant Array of Independent
Disks technology, two or more secondary storage devices are connected so
that the devices operate as one storage medium. A RAID array consists of
several disks linked together for a variety of purposes. Disk arrays are
categorized by their RAID levels.
RAID 0: At this level, disks are organized in a striped array. Blocks of
data are divided into disks and distributed over disks. Parallel writing and
reading of data occurs on each disk. This improves performance and
speed. Level 0 does not support parity and backup.
RAID 1: Mirroring is used in RAID 1. A RAID controller copies data
across all disks in an array when data is sent to it. In case of failure, RAID
level 1 provides 100% redundancy.
RAID 2: The data in RAID 2 is striped on different disks, and Error
Correction Code is recorded using Hamming distance. Similarly to level 0,
each bit within a word is stored on a separate disk, and ECC codes for
the data words are saved on a separate set of disks. As a result of its
complex structure and high cost, RAID 2 cannot be commercially
deployed.
RAID 3: Data is striped across multiple disks in RAID 3. Data words are
parsed to generate a parity bit. It is stored on a different disk. Thus, single
disk failures can be avoided.
RAID 4: This level involves writing an entire block of data onto data disks,
and then generating the parity and storing it somewhere else. At level 3,
bytes are striped, while at level 4, blocks are striped. Both levels 3 and 4
require a minimum of three disks.
RAID 5: The data blocks in RAID 5 are written to different disks, but the
parity bits are spread out across all the data disks rather than being
stored on a separate disk.
RAID 6: The RAID 6 level extends the level 5 concept. A pair of
independent parities are generated and stored on multiple disks at this
level. A pair of independent parities are generated and stored on multiple
disks at this level. Ideally, you need four disk drives for this level.
Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest
speed. But it is capable of storing a large amount of data. It is also known as Offline
storage. Tertiary storage is generally used for data backup. There are following tertiary
storage devices available:
Optical Storage: An optical storage can store megabytes or gigabytes of data. A
Compact Disk (CD) can store 700 megabytes of data with a playtime of around 80
minutes. On the other hand, a Digital Video Disk or a DVD can store 4.7 or 8.5 gigabytes
of data on each side of the disk.
Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used
for archiving or backing up the data. It provides slow access to data as it accesses data
sequentially from the start. Thus, tape storage is also known as sequential-access
storage. Disk storage is known as direct-access storage as we can directly access the data
from any location on disk.
Access methods in DBMS
The main goal of DBMS is to return data which is requested by the user. In
RDBMS it may be a record or set of records. In an object- oriented database it
may be object or set of objects.
Indexes – Access Method
Index is the small table having two columns. The first columns consist of
primary key of the table and second column consists of a set of pointers
holding the address of the disk, where the particular key (value) can
found.
The indexes are very useful to improve the search operation in the DBMS
system.
Type of indexes are discussed as follows:
1. Function-based indexes
A function-based index computes the values of expression which are
present in one or more column and stored in the table.
The expression can be an arithmetic expression or SQL function.
A function-based index can not contain null value.
Example:
CREATE INDEX Sample _idx on table_(a+c*(b+d));
2. Bitmap indexes
Bitmap index is used to work with well for low-cardinal (refers to
columns few unique values) columns in tables.
For example: boolean data which has only two values true or false.
Bitmap indexes are very useful in data ware house applications for joining
the large fact tables.
3. Domain indexes
Domain index is used to create index type schema object and an application
specific index. It is used for indexing data in application specific domain.
4. Clusters
Clustering in DBMS is design for high availability of data. Clustering is
applied on tables which are repeatedly used by the user.
For example: When there are many employees in the department, we can
create index of non-unique key, such a that Dept-id. With this, all employees
belonging to the same department are consider to be within a same cluster.
Clustering can improve the performance of the system.
5. Indexed sequential access method (ISAM)
ISAM was developed by IBM for mainframe computers but the term is
used in several concepts.
In DBMS, ISAM is used to access data in sequentially (sequence in which
data is entered) or randomly (with an index).
File Organization
The File is a collection of records. Using the primary key, we can
access the records. The type and frequency of access can be
determined by the type of file organization which was used for a given
set of records.
File organization is a logical relationship among various records. This
method defines how file records are mapped onto disk blocks.
File organization is used to describe the way in which the records are
stored in terms of blocks, and the blocks are placed on the storage
medium.
The first approach to map the database to the file is to use the several
files and store only one fixed length record in any given file. An
alternative approach is to structure our files so that we can contain
multiple lengths for records.
Files of fixed length records are easier to implement than the files of
variable length records.
Objective of file organization
It contains an optimal selection of records, i.e., records can be
selected as fast as possible.
To perform insert, delete or update transaction on the records should
be quick and easy.
The duplicate records cannot be induced as a result of insert, update
or delete.
For the minimal cost of storage, records should be stored efficiently.
Types of file organization:
File organization contains various methods. These particular methods have
pros and cons on the basis of access or selection. In the file organization, the
programmer decides the best-suited file organization method according to his
requirement.
Types of file organization are as follows:
Sequential File Organization
This method is the easiest method for file organization. In this method, files
are stored sequentially. This method can be implemented in two ways:
1. Pile File Method:
It is a quite simple method. In this method, we store the record in a
sequence, i.e., one after another. Here, the record will be inserted in
the order in which they are inserted into tables.
In case of updating or deleting of any record, the record will be
searched in the memory blocks. When it is found, then it will be
marked for deleting, and the new record is inserted.
Insertion of the new record:
Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence.
Hence, records are nothing but a row in the table. Suppose we want to insert a
new record R2 in the sequence, then it will be placed at the end of the file.
Here, records are nothing but a row in any table.
2. Sorted File Method:
In this method, the new record is always inserted at the file's end,
and then it will sort the sequence in ascending or descending order.
Sorting of records is based on any primary key or any other key.
In the case of modification of any record, it will update the record and
then sort the file, and lastly, the updated record is placed in the right
place.
Insertion of the new record:
Suppose there is a preexisting sorted sequence of four records R1, R3 and so
on upto R6 and R7. Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file, and then it will sort the
sequence.
Pros of sequential file organization
It contains a fast and efficient method for the huge amount of data.
In this method, files can be easily stored in cheaper storage
mechanism like magnetic tapes.
It is simple in design. It requires no much effort to store the data.
This method is used when most of the records have to be accessed
like grade calculation of a student, generating the salary slip, etc.
This method is used for report generation or statistical calculations.
Cons of sequential file organization
It will waste time as we cannot jump on a particular record that is
required but we have to move sequentially which takes our time.
Sorted file method takes more time and space for sorting the records.
Heap file organization
It is the simplest and most basic type of organization. It works with
data blocks. In heap file organization, the records are inserted at the
file's end. When the records are inserted, it doesn't require the sorting
and ordering of records.
When the data block is full, the new record is stored in some other
block. This new data block need not to be the very next data block,
but it can select any data block in the memory to store new records.
The heap file is also known as an unordered file.
In the file, every record has a unique id, and every page in a file is of
the same size. It is the DBMS responsibility to store and manage the
new records.
Insertion of a new record
Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose
we want to insert a new record R2 in a heap. If the data block 3 is full then it
will be inserted in any of the database selected by the DBMS, let's say data
block 1.
If we want to search, update or delete the data in heap file organization, then
we need to traverse the data from staring of the file till we get the requested
record.
If the database is very large then searching, updating or deleting of record will
be time-consuming because there is no sorting or ordering of records. In the
heap file organization, we need to check all the data until we get the requested
record.
Pros of Heap file organization
It is a very good method of file organization for bulk insertion. If there
is a large number of data which needs to load into the database at a
time, then this method is best suited.
In case of a small database, fetching and retrieving of records is
faster than the sequential record.
Cons of Heap file organization
This method is inefficient for the large database because it takes time to
search or modify the record.
This method is inefficient for large databases.
Hash File Organization
Hash File Organization uses the computation of hash function on some fields of
the records. The hash function's output determines the location of disk block
where the records are to be placed.
When a record has to be received using the hash key columns, then the
address is generated, and the whole record is retrieved using that address. In
the same way, when a new record has to be inserted, then the address is
generated using the hash key and record is directly inserted. The same process
is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In
this method, each record will be stored randomly in the memory.
B+ File Organization
B+ tree file organization is the advanced method of an indexed
sequential access method. It uses a tree-like structure to store
records in File.
It uses the same concept of key-index where the primary key is used
to sort the records. For each primary key, the value of the index is
generated and mapped with the record.
The B+ tree is similar to a binary search tree (BST), but it can have
more than two children. In this method, all the records are stored
only at the leaf node. Intermediate nodes act as a pointer to the leaf
nodes. They do not contain any records.
The above B+ tree shows that:
There is one root node of the tree, i.e., 25.
There is an intermediary layer with nodes. They do not store the
actual record. They have only pointers to the leaf node.
The nodes to the left of the root node contain the prior value of the
root and nodes to the right contain next value of the root, i.e., 15 and
30 respectively.
There is only one leaf node which has only values, i.e., 10, 12, 17, 20,
24, 27 and 29.
Searching for any record is easier as all the leaf nodes are balanced.
In this method, searching any record can be traversed through the
single path and accessed easily.
Pros of B+ tree file organization
In this method, searching becomes very easy as all the records are
stored only in the leaf nodes and sorted the sequential linked list.
Traversing through the tree structure is easier and faster.
The size of the B+ tree has no restrictions, so the number of records
can increase or decrease and the B+ tree structure can also grow or
shrink.
It is a balanced tree structure, and any insert/update/delete does not
affect the performance of tree.
Cons of B+ tree file organization
This method is inefficient for the static method.
Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method,
records are stored in the file using the primary key. An index value is
generated for each primary key and mapped with the record. This index
contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of
the data block is fetched and the record is retrieved from the memory.
Pros of ISAM:
In this method, each record has the address of its data block,
searching a record in a huge database is quick and easy.
This method supports range retrieval and partial retrieval of records.
Since the index is based on the primary key values, we can retrieve
the data for the given range of value. In the same way, the partial
value can also be easily searched, i.e., the student name starting with
'JA' can be easily searched.
Cons of ISAM
This method requires extra space in the disk to store the index value.
When the new records are inserted, then these files have to be
reconstructed to maintain the sequence.
When the record is deleted, then the space used by it needs to be
released. Otherwise, the performance of the database will slow down.
Cluster file organization
When the two or more records are stored in the same file, it is known
as clusters. These files will have two or more tables in the same data
block, and key attributes which are used to map these tables together
are stored only once.
This method reduces the cost of searching for various records in
different files.
The cluster file organization is used when there is a frequent need for
joining the tables with the same condition. These joins will give only a
few records from both tables. In the given example, we are retrieving
the record for only particular departments. This method can't be used
to retrieve the record for the entire department.
In this method, we can directly insert, update or delete any record. Data is
sorted based on the key with which searching is done. Cluster key is a type of
key with which joining of the table is performed.
Types of Cluster file organization:
Cluster file organization is of two types:
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored
together. The above EMPLOYEE and DEPARTMENT relationship is an example of
an indexed cluster. Here, all the records are grouped based on the cluster key-
DEP_ID and all the records are grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster, instead of storing the
records based on the cluster key, we generate the value of the hash key for
the cluster key and store the records with the same hash key value.
Pros of Cluster file organization
The cluster file organization is used when there is a frequent request
for joining the tables with same joining condition.
It provides the efficient result when there is a 1:M mapping between
the tables.
Cons of Cluster file organization
This method has the low performance for the very large database.
If there is any change in joining condition, then this method cannot
use. If we change the condition of joining then traversing the file
takes a lot of time.
This method is not suitable for a table with a 1:1 condition.
Data Dictionary Storage
Till now, we learned and understood about relations and its representation. In
the relational database system, it maintains all information of a relation or
table, from its schema to the applied constraints. All the metadata is stored. In
general, metadata refers to the data about data. So, storing the relational
schemas and other metadata about the relations in a structure is known
as Data Dictionary or System Catalog.
A data dictionary is like the A-Z dictionary of the relational database system
holding all information of each relation in the database.
The types of information a system must store are:
Name of the relations
Name of the attributes of each relation
Lengths and domains of attributes
Name and definitions of the views defined on the database
Various integrity constraints
With this, the system also keeps the following data based on users of
the system:
Name of authorized users
Accounting and authorization information about users.
The authentication information for users, such as passwords or other
related information.
In addition to this, the system may also store some statistical and
descriptive data about the relations, such as:
Number of tuples in each relation
Method of storage for each relation, such as clustered or non-
clustered.
A system may also store the storage organization, whether sequential,
hash, or heap. It also notes the location where each relation is stored:
If relations are stored in the files of the operating system, the data
dictionary note, and stores the names of the file.
If the database stores all the relations in a single file, the data
dictionary notes and store the blocks containing records of each
relation in a data structure similar to a linked list.
At last, it also stores the information regarding each index of all the
relations:
Name of the index.
Name of the relation being indexed.
Attributes on which the index is defined.
The type of index formed.
All the above information or metadata is stored in a data dictionary. The data
dictionary also maintains updated information whenever they occur in the
relations. Such metadata constitutes a miniature database. Some systems
store the metadata in the form of a relation in the database itself. The system
designers design the way of representation of the data dictionary. Also, a data
dictionary stores the data in a non-formalized manner. It does not use any
normal form so as to fastly access the data stored in the dictionary.
For example, in the data dictionary, it uses underline below the value to
represent that the following field contains a primary key.
So, whenever the database system requires fetching records from a relation, it
firstly finds in the relation of data dictionary about the location and storage
organization of the relation. After confirming the details, it finally retrieves the
required record from the database.