UNIT 4 Updated - 121124
UNIT 4 Updated - 121124
B-Trees
Hash Tables
DATA ON EXTERNAL STORAGE
A DBMS stores vast quantity of data and the data must
persist across program execution
Therefore, data is stored on external storage devices such as
disks and tapes, and fetched into main memory as needed
for processing.
The unit of information read from or written to disk is a page
The size of a page is a DBMS parameter and typical values
are 4KB and 8KB.
The cost of page I/O (input from disk to main memory and
output from memory to disk) dominates the cost of typical
database operations, and database systems are carefully
optimized to minimize this cost.
Disks are the most important external storage
devices. They allow us to retrieve any page at a (more
or less) fixed cost per page. However, if we read
several pages in the order that they are stored
physically, the cost can be much less than the cost if
reading the same pages in a random order.
Tapes are sequential access devices and forces us to
read one page after the other. They are mostly used
to archive data that is not needed on regular basis.
Each record in a file has a unique identifier called a
record id, (or rid). An rid has the property that we can
identify the disk address of the page containing the
record by using the rid.
Data is read into memory for processing, and written to
disk for persistent storage, by a layer of software called
the buffer manager.
When the files and access methods layer needs to
process a page, it asks the buffer manager to fetch the
page, specifying the page’s rid.
The buffer manager fetches the page from disk if it is not
already in memory.
Space on disk is managed by the disk space manager,
according to the DBMS software architecture.
When the files and access methods layer needs
additional space to hold new records in a file, it asks
page the disk space manager to allocate an additional
disk page for the file, it also informs the disk space
manager when it no longer need one of its disk pages.
The disk space manager keeps track of the pages in use by
the file layer, if a page is freed by the file layer, the space
manager tracks this and reuses the space if the file layer
requests a new page later on.
Data on external storage is follows:
File and Access method
Buffer Manager
DISK
FILE ORGANIZATION
DB Database
collection
FILES
Sequence
RECORDS
Sequence
FIELDS
Secondary Memory
RAM DISK
CPU
MIPSCPU B0 B1 B2
B3 B4 B5
B6 Bn
Block 3
If CPU want to search for RNo. 501 from HD, each
block will be brought to MM, if not found block will be
sent back and will bring the next block for search.
Block n
Number of blocks we are calling to MM is called I/O cost
IMPLEMENTATION OF INDEXING
If our db is stored in ordered then, number of
entries in IF is number of blocks in HD or number
of entries will be equal to number of records in 501
HD. B0
502
B2
506
The primary index can be classified into two types: Dense index
and Sparse index.
Dense index
The dense index contains an index record for every search
key value in the data file. It makes searching faster.
In this, the number of records in the index table is same as
the number of records in the main table.
It needs more space to store index record itself. The index
records have the search key and a pointer to the actual
record on the disk.
Sparse index
In the data file, index record appears only for a few items.
Each item points to a block.
In this, instead of pointing to each record in the main table,
the index points to the records in the main table in a gap.
SECONDARY INDEX
In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that
address fetch should be faster. Then the secondary memory searches the
actual data based on the address got from mapping. If the mapping size
grows then fetching the address itself becomes slower. In this case, the
sparse index will not be efficient. To overcome this problem, secondary
indexing is introduced.
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Keys: 9,6,7,12,15,22…
6 7 9 12
0 1 2 3 4 5 6 7 8 9 10 11 12
B0
502
B2
506
501
B0
502
Index File
503
B1
504
SK BP
505
B2
506
2. The data entries are arranged in sorted order by search key value, and a
hierarchical search data structure is maintained that directs searches to the
correct page of data entries.
3. The lowest level of the tree called leaf node, contains the data entries.
4. This structure allows us to efficiently locate all the data entries with search
key values in a desired range.
5. All searches begins at the topmost node, called the root, and the contents of
pages in non-leaf levels direct searches to the correct leaf page.
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15
S1 : 10 20 40
S2: Insert 50 S3
40 S4
10 20 50 60 70
S5: Insert 80
40 70
10 20 33 50 60 80
S6
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15
S6: Insert 33 40 70
10 20 33 50 60 80
S7: Insert 35
33 40 70
5 10 20 35 50 60 80
S8: Insert 5
15 33 40 70
S9: Insert 15
50 60 80
5 10 20 35
Keys = 10, 20, 40, 50, 60, 70, 80, 33, 35, 5, 15
15 33 40 70
5 10 20 35 50 60 80
Key value
Block pointer
40
Record pointer
15 33 70
5 10 20 35 50 60 80
CONSTRUCT B-TREE OF ORDER 4
Keys = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10