Batch processing of k sequentially stored records
read the transaction file:
lra = k where k = number of transaction records
sba = ceil(k/tfbf) where tfbf is the transaction file blocking factor
read the master file:
lra = n
sba = ceil(n/bf) where bf is the master file blocking factor
write a new master file:
lra = n + adds - deletes
sba = ceil((n+adds-deletes)/bf)
where adds is the number of records added or inserted,
and deletes is the number of records deleted.
1
Random Access Methods
Hashing
Basic mechanism – transformation of a primary key directly to a physical address,
called a bucket (or indirectly via a logical address)
Collisions – handled by variations of chained overflow techniques
random access to a hashed file
lra = 1 + overflow(avg)
rba = 1 + overflow(avg)
insertion into a hashed file
lra = 1 + overflow(avg) + rewrite
rba = 1 + overflow(avg)
rba=1 for the rewrite
2
Extendible Hashing
* number of buckets grow or contracts
* bucket splits when it becomes full
* collisions are resolved immediately, no long overflow chains
* primary key transformed to an entry in the Bucket Address Table
(BAT), typically in RAM
* BAT has pointers to disk buckets that hold the actual data
* Retrieve a single record = 1 rba (access the bucket in one step)
* Cost (service time) of I/O for updates, inserts, and deletes is the same as for B+-trees
3
B-trees and B+-trees
B-tree index basic characteristics
* each node contains p pointers and p-1 records
* each pointer at level i is for a data and pointer block at level i+1
* i=1 denotes the root level (single node or block)
* can be inefficient for searching because of the overhead in each search level
4
B+-tree index basic characteristics
* eliminates data pointers from all nodes except the leaf nodes
* each non-leaf index node has p pointers and p-1 key values
* each pointer at level i is for an index block (of key/pointer pairs) at level i+1
* each leaf index has a key value/pointer pair to point to the actual data
block (and record) containing that primary key value
* leaf index nodes can be logically connected via pointers for ordered sequence search
* hybrid method for efficient random access and sequential search
Example: B + -tree
To determine the order of a B+-tree, let us assume that the database has 500,000
records of 200 bytes each, the search key is 15 bytes, the tree and data pointers are
5 bytes, and the index node (and data block size) is 1024 bytes. For this
configuration we have non-leaf index node size = 1024 bytes = p*5 + (p-1)*15
bytes
p = floor((1024+15)/20) = floor(51.95) = 51
number of search key values in the leaf nodes = floor ((1024-5)/(15+5))=50
h = height of the B+-tree (number of index levels, including the leaf index nodes
n = number of records in the database (or file); all must be pointed at from the next to last level, h-
1
ph-1(p-1) > n
(h-1)log p + log(p-1) > log n
(h-1)log p > log n-log(p-1)
h > 1 + (log n-log(p-1)) / log p
h > 1 + (log 500,000-log 50)/log 51 = 3.34, h=4 (nearest higher integer)
A good approximation can be made by assuming that the leaf index nodes are
implemented with p pointers and p key values:
ph > n
h log p > log n
h > log n/log p
In this case, the result above becomes h > 3.35 or h = 4.
5
B+-tree performance
read a single record (B+-tree) = h+1 rba
update a single record (B+-tree) = search cost + rewrite data block
= (h+1) rba + 1 rba
general update cost for insertion (B+-tree)
=search cost (i.e., h+1 reads)
+simple rewrite of data block and leaf index node pointing to the
data block (i.e., 2 rewrites)
+nos*(write of new split index node
+ rewrite of the index node pointer to the new index node)
+ nosb*(write of new split data block)
= (h+1) rba + 2 rba + nos*(2 rba) + nosb*(1 rba)
where nos is the number of index split node operations required and nosb is the
number of data split block operations required
general update cost for deletion (B+-tree)
= search cost (i.e., h+1 reads)
+ simple rewrite of data block and leaf index node pointing to the
data block (i.e., 2 rewrites)
+ noc*(rewrite of the node pointer to the remaining node)
= (h+1) rba + 2 rba + noc*(1 rba)
where noc is the number of consolidations of index nodes required.
As an example, consider the insertion of a node (with key value 77) to the B+-
tree shown in Fig. 6.6. This insertion requires a search (query) phase and an
insertion phase with one split node. The total insertion cost for height 3 is
insertion cost = (3 + 1) rba search cost + (2 rba) rewrite cost
+ 1 split *(2 rba rewrite cost)
= 8 rba
6
7
Secondary Indexes
Basic characteristics of secondary indexes
* based on Boolean search criteria (AND, OR, NOT) of attributes that are
not the primary key
* attribute type index is level 1 (usually in RAM)
* attribute value index is level 2 (usually in RAM)
* accession list is level 3 (ordered list of pointers to blocks containing
records with the given attribute value)
* one accession list per attribute value; pointers have block address and
record offset typically
* accession lists can be merged to satisfy the intersection (AND) of
records that satisfy more than one condition
Boolean query cost (secondary index)
= search attribute type index + search attribute value index
+ search and merge m accession lists + access t target records
= (0 + 0 + sum of m accession list accesses) rba + t rba
= (sum of m accession list cost) rba + t rba
where m is the number of accession lists to be merged and t is the number
of target records to be accessed after the merge operation.
accession list cost (for accession list j) = ceil(pj/bfac) rba
where pj is the number of pointer entries in the jth accession list and bfac is
the blocking factor for all accession lists
bfac = block_size/pointer_size
* assume all accesses to the accession list are random due to dynamic re-allocation
of disk blocks
use the 1% rule
(any variable affecting the result by less than 1% is ignored)
8
9