0% found this document useful (0 votes)
2 views52 pages

UNIT 4 Updated - 121124

Unit IV covers various aspects of Database Management Systems, focusing on data storage, indexing methods, and file organization. Key topics include the use of external storage, the role of buffer and disk space managers, and different indexing techniques such as primary, secondary, and hash-based indexing. Additionally, it discusses tree-based indexing structures like B-Trees and their efficiency in data retrieval.

Uploaded by

k.nikhil1701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views52 pages

UNIT 4 Updated - 121124

Unit IV covers various aspects of Database Management Systems, focusing on data storage, indexing methods, and file organization. Key topics include the use of external storage, the role of buffer and disk space managers, and different indexing techniques such as primary, secondary, and hash-based indexing. Additionally, it discusses tree-based indexing structures like B-Trees and their efficiency in data retrieval.

Uploaded by

k.nikhil1701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT IV

Database Management System


UNIT 4

 Indexes on Sequential Files


 Secondary Indexes

 B-Trees

 Hash Tables
DATA ON EXTERNAL STORAGE
 A DBMS stores vast quantity of data and the data must
persist across program execution
 Therefore, data is stored on external storage devices such as
disks and tapes, and fetched into main memory as needed
for processing.
 The unit of information read from or written to disk is a page
 The size of a page is a DBMS parameter and typical values
are 4KB and 8KB.
 The cost of page I/O (input from disk to main memory and
output from memory to disk) dominates the cost of typical
database operations, and database systems are carefully
optimized to minimize this cost.
 Disks are the most important external storage
devices. They allow us to retrieve any page at a (more
or less) fixed cost per page. However, if we read
several pages in the order that they are stored
physically, the cost can be much less than the cost if
reading the same pages in a random order.
 Tapes are sequential access devices and forces us to
read one page after the other. They are mostly used
to archive data that is not needed on regular basis.
 Each record in a file has a unique identifier called a
record id, (or rid). An rid has the property that we can
identify the disk address of the page containing the
record by using the rid.
 Data is read into memory for processing, and written to
disk for persistent storage, by a layer of software called
the buffer manager.
 When the files and access methods layer needs to
process a page, it asks the buffer manager to fetch the
page, specifying the page’s rid.
 The buffer manager fetches the page from disk if it is not
already in memory.
 Space on disk is managed by the disk space manager,
according to the DBMS software architecture.
 When the files and access methods layer needs
additional space to hold new records in a file, it asks
page the disk space manager to allocate an additional
disk page for the file, it also informs the disk space
manager when it no longer need one of its disk pages.
 The disk space manager keeps track of the pages in use by
the file layer, if a page is freed by the file layer, the space
manager tracks this and reuses the space if the file layer
requests a new page later on.
 Data on external storage is follows:
File and Access method

Buffer Manager

Disk space Manager

DISK
FILE ORGANIZATION
DB Database
collection

FILES

Sequence

RECORDS

Sequence

FIELDS

In Database the files are stored in sequential blocks in contiguous allocation


THE PROCESS OF GETTING THE RECORD INTO THE CPU FOR PROCESSING.(BUFFER MANAGER)

Secondary Memory

 RAM DISK

CPU

Harddisk, Transfers a complete block to the MM


 File organization is,
 how data is organized in Hard Disk
 how we are searching record in HD
 how it deletes and insert record in HD
in a efficient manner
WHAT IS AN INDEXING AND WHY INDEXING IS
USED
THE PROCESS OF GETTING THE RECORD INTO THE CPU FOR PROCESSING.
(BUFFER MANAGER)
Student details ie Student
Query: Select * from Student where RNO = 501 record is stored
permanently in Hard disk
 Query should be RAM /MM HD
processed by CPU Slow

MIPSCPU B0 B1 B2

B3 B4 B5

B6 Bn

HD is divided into logical


Blocks / Pages Blocks / Pages
I/O COST IS REDUCED BY INDEXING Example Book indexing
INDEXING HD
Query: Select * from Student where RNO = 501
If size of the block is 100 and there are 10000 records in student db, then the no. of blocks required is
10000/100=100
Data will be stored in two ways in HD, Sorted (ordered) or
Block 0
Unsorted(Unordered). Let us take that our db is
 in unorderered way in HD. RAM
stored
Our aim is reducing I/O cost
example ie getting less no. Block 1
blocks to MM
CPU
Block 2

Block 3
If CPU want to search for RNo. 501 from HD, each
block will be brought to MM, if not found block will be
sent back and will bring the next block for search.
Block n
Number of blocks we are calling to MM is called I/O cost
IMPLEMENTATION OF INDEXING
If our db is stored in ordered then, number of
entries in IF is number of blocks in HD or number
of entries will be equal to number of records in 501
HD. B0
502

Index File 503


B1
504
SK BP
505

B2
506

SK-Search Key may be R.No


BP-Block Point
Ordered file Primary Index Clustered index

Unordered file Secondary Index Secondary Index

Key Non key

Main types of Indexing:


1. Primary Indexing
2. Clustered Indexing
3. Secondary Indexing
PRIMARY INDEX

 If the index is created on the basis of the primary key of the


table, then it is known as primary indexing. These primary keys
are unique to each record and contain 1:1 relation between the
records.

 As primary keys are stored in sorted order, the performance of


the searching operation is quite efficient.

 The primary index can be classified into two types: Dense index
and Sparse index.
Dense index
 The dense index contains an index record for every search
key value in the data file. It makes searching faster.
 In this, the number of records in the index table is same as
the number of records in the main table.
 It needs more space to store index record itself. The index
records have the search key and a pointer to the actual
record on the disk.
Sparse index
 In the data file, index record appears only for a few items.
Each item points to a block.
 In this, instead of pointing to each record in the main table,
the index points to the records in the main table in a gap.
SECONDARY INDEX
In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that
address fetch should be faster. Then the secondary memory searches the
actual data based on the address got from mapping. If the mapping size
grows then fetching the address itself becomes slower. In this case, the
sparse index will not be efficient. To overcome this problem, secondary
indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of


indexing is introduced. In this method, the huge range for the columns is
selected initially so that the mapping size of the first level becomes small.
Then each range is further divided into smaller ranges. The mapping of the
first level is stored in the primary memory, so that address fetch is faster. The
mapping of the second level and actual data are stored in the secondary
memory (hard disk).
A secondary index is usually dense (i.e., it has
an entry for every record) because it must point
directly to each record that contains a specific
value in the indexed column. Each entry in the
secondary index contains:

 The value of the indexed attribute (e.g., department name, city).


 A pointer to the location (or address) of the record in the sequential file.
If we want to find the record of roll 211 in the diagram, then it will search the
highest entry which is smaller than or equal to 211 in the first level index. It will
get 200 at this level.
Then in the second index level, again it does max (211) <= 211 and gets 210.
Now using the address 210, it goes to the data block and starts searching each
record till it gets 211.
 Dense Indexing- In a dense index, there is an index
record for every search key value in the database. This
makes searching faster but requires more space to store
index records itself. Index records contain search key
value and a pointer to the actual record on the disk.

 Sparse Indexing- In a sparse index, index records are not


created for every search key. An index record here
contains a search key and an actual pointer to the data
on the disk.
SOME OF THE SEARCHING TECHNIQUES

 Linear Search O(n)


9 6 7 10 3 12 5 8 15 1

0 1 2 3 4 5 6 7 8 9

 Binary Search O(log n)


1 3 5 6 7 8 9 10 12 15

0 1 2 3 4 5 6 7 8 9

Order of n we have moved to order of log n but actually we expect


order as just 1
Idea behind creating hash indexing

 Keys: 9,6,7,12,15,22…
6 7 9 12

0 1 2 3 4 5 6 7 8 9 10 11 12

If need to search for the key element 6, then directly go to index 6


HASH BASED INDEXING
 We can organize records using a technique called hashing to
quickly find records that have a given search key value.
 In this technique the records in a file are grouped in buckets
 A bucket consists of a primary page and possibly additional
pages linked in a chain.
 The bucket to which a record belongs can be determined by
applying a special function called a hash function m to the
search key.
 Given a bucket number, a hash based index structure allows
us to retrieve the primary key page for the bucket in one or
two disk I/Os
HASH BASED INDEXING
 Index entries partitioned into buckets in accordance
with a hash function, h(v), where v ranges are a search
key values.
 Each bucket is identified by an address ‘a’
 Bucket at address ‘a’ contains all index entries with search
key ‘v’ such that h(v)=a

 Each bucket is stored in a page(with possible overflow chain)

 If index entries contain rows, set of buckets forms an


integrated storage structure, else set of buckets forms an
(unclustered) secondary index.
HASH FUNCTIONS
1 DIVISION
2 MID SQUARE
3 DIGIT FOLDER
4 MULTIPLICATE (H(K)= FLR(T.S*(K*A))/SIZE
A=0.6180
K=23
TS=10
23=F(10*
WHAT IS COLLISION?
Hash collision is a state when the resultant hashes from two
or more data in the data set, wrongly map the same place in
the hash table.
How to deal with Hashing Collision?
There are two technique which you can use to avoid a hash
collision:
Rehashing: This method, invokes a secondary hash function,
which is applied continuously until an empty slot is found,
where a record should be placed.
Chaining: Chaining method builds a Linked list of items
whose key hashes to the same value. This method requires
an extra link field to each table position.
Given V
1) Evaluate h(v)
2) Fetch bucket at h(v)
3) Search bucket

Cost is number of pages in bucket (cheaper than B+ tree if no overflow


chains
INDEXING IS OF TWO TYPES –
1. SINGLE LEVEL INDEXING
2. MULTILEVEL INDEXING

 1. Single level indexing 501

B0
502

Index File 503


B1
504
SK BP
505

B2
506

SK-Search Key may be R.No


BP-Block Point
2. MULTILEVEL INDEXING

Data value pointer

501

B0
502

Index File
503
B1
504
SK BP
505

B2
506

Root Internal Leaf


node node node
SK-Search Key may be R.No
BP-Block Point
TREE-BASED INDEXING
 1. An alternative to hash-based indexing is to organize records using a tree like
structure

 2. The data entries are arranged in sorted order by search key value, and a
hierarchical search data structure is maintained that directs searches to the
correct page of data entries.

 3. The lowest level of the tree called leaf node, contains the data entries.

 4. This structure allows us to efficiently locate all the data entries with search
key values in a desired range.

 5. All searches begins at the topmost node, called the root, and the contents of
pages in non-leaf levels direct searches to the correct leaf page.

 6. Non-leaf pages contain node pointers separated by search key values.


 7. The node pointer to the left of a key value k points to a
subtree that contains only data entries less than k.
 8. The node pointer to the right of a key value k points to a
subtree that contains only data entries greater than or equal to k.
 9. The number of I/Os incurred during a search is equal to the
length of a path from the root to a leaf, plus the number of leaf
pages with qualifying data entries.
 10. The height of a balanced tree is the length of a path from
root to leaf.
 11. The average number of children for a non-leaf node is
called the fan-out of the tree.
 12. If every non-leaf node has ‘n’ children, a tree of height ‘h’
has nh leaf pages,
EXAMPLE FOR B-TREE

 Create B-tree with order 4 (Degree)


 Max no. of children = 4
 Max no. of keys = m-1 = 4-1 = 3

 Min no. of children = m/2 = 2

 Min no. of keys = m/2 – 1 = 1

 Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15

 S1 : 10 20 40

 S2: Insert 50 S3
40 S4

10 20 50 60 70

 S5: Insert 80

40 70

10 20 33 50 60 80

S6
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15

S6: Insert 33 40 70

10 20 33 50 60 80

S7: Insert 35
33 40 70

5 10 20 35 50 60 80

S8: Insert 5
15 33 40 70
S9: Insert 15

50 60 80
5 10 20 35
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15

15 33 40 70

5 10 20 35 50 60 80

Key value
Block pointer
40
Record pointer

15 33 70

5 10 20 35 50 60 80
CONSTRUCT B-TREE OF ORDER 4
 Keys = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

You might also like