0% found this document useful (0 votes)
9 views18 pages

Unit-4

Unit IV discusses various implementation techniques in database management systems (DBMS), focusing on RAID for improved performance and fault tolerance, file organization methods, and data dictionary storage. It covers indexing and hashing techniques, including B+ trees and their advantages for efficient data retrieval. Additionally, the document explains the structure and operations of B-trees and B+ trees, highlighting their role in ordered indexing and query processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views18 pages

Unit-4

Unit IV discusses various implementation techniques in database management systems (DBMS), focusing on RAID for improved performance and fault tolerance, file organization methods, and data dictionary storage. It covers indexing and hashing techniques, including B+ trees and their advantages for efficient data retrieval. Additionally, the document explains the structure and operations of B-trees and B+ trees, highlighting their role in ordered indexing and query processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit IV IMPLEMENTATION TECHNIQUES

RAID – File Organization – Organization of Records in Files – Data dictionary Storage –


Column Oriented Storage– Indexing and Hashing –Ordered Indices – B+ tree Index Files
– B tree Index Files – Static Hashing – Dynamic Hashing – Query Processing Overview –
Algorithms for Selection, Sorting and join operations – Query optimization using
Heuristics – Cost Estimation.

4.1.RAID (Redundant Array of Independent Disks)


Definition: RAID is a data storage virtualization technology that combines multiple
physical disk drives into one logical unit to improve performance, increase storage
capacity, and provide fault tolerance.
RAID is especially useful in DBMS because it enhances the reliability and speed of data
storage and retrieval—crucial for maintaining large databases.
RAID Architecture
Core Components of RAID Architecture:
1. RAID Controller:
o Manages the disk drives and presents them as a single logical unit.
o Can be hardware-based (dedicated RAID controller) or software-based
(managed by the OS or DBMS).
2. Disk Array:
o Multiple physical hard drives or SSDs connected together.
o These can be organized into different RAID levels depending on the
performance and redundancy required.
3. Cache Memory (optional but common in hardware RAID):
o Temporary memory that holds data during read/write operations for faster
access.

4.11RAID Levels (Types)

4. Each RAID level offers a different balance of performance, redundancy, and


storage efficiency.
RAID
Description Pros Cons
Level
Striping (data split across No redundancy (if one disk
RAID 0 High speed
disks) fails, all data is lost)
Mirroring (exact copies
RAID 1 High reliability Storage efficiency is 50%
on two disks)
Block-level striping with Good balance of speed
RAID 5 Slow write performance
distributed parity and fault tolerance
Like RAID 5 but with Can survive 2 disk
RAID 6 Higher overhead
two parity blocks failures
RAID 10 Mirrored pairs of striped High performance and Expensive (needs at least 4
(1+0) disks reliability dis

How RAID Supports DBMS


 Improved Read/Write Speed: Through striping and parallel access.
 Fault Tolerance: Ensures DBMS uptime even when a disk fails.
 Data Redundancy: Keeps a backup for recovery without losing data.
 Load Balancing: Distributes I/O load, enhancing multi-user performance.

4.2. File Organization in DBMS

Definition:

 File organization refers to the method of arranging data records in a file on storage
(like a hard disk), so they can be efficiently stored, retrieved, updated, and
managed by a Database Management System (DBMS).
Types of File Organization
There are several common file organization methods, each suitable for different
access and update needs:

1. Heap (Unordered) File Organization


 Records are stored in no specific order.
 New records are inserted at the end of the file.
 Searches are slow, but insertion is fast.
Use Case: Good for small files or when frequent inserts are expected.

2. Sequential (Ordered) File Organization


 Records are stored in sorted order (usually based on a key field).
 Allows efficient binary search.
 Insertion and deletion are expensive (requires maintaining order).
Use Case: Ideal for applications needing sorted data (like reports or batch
processing).

3. Hashing File Organization


 A hash function determines where a record is stored based on a key (e.g., Employee
ID).
 Fast for exact-match queries.
 Not suitable for range queries.
Use Case: Useful when frequent access to individual records by a key is required.

4. Clustered File Organization


 Physically groups related records from different tables (e.g., orders and customers)
that are often accessed together.
 Improves performance for joins and related queries.
Use Case: Used in advanced systems to reduce I/O for related data access.

4.3 Organization of Records in Files


Within a file, records can be organized in the following ways:
1. Fixed-Length Records
 Each record has the same size and structure.
 Simple to manage and fast to access.
 Wastes space if some fields are not fully used.
2. Variable-Length Records
 Records have different lengths, often due to fields like names or descriptions.
 Requires record delimiters or length indicators.
 More efficient in space usage, but harder to manage.

Record Placement Strategies:


 Spanned Records: A record can span across multiple blocks.
 Unspanned Records: Each record must be stored fully within one block.

4.4. Data Dictionary Storage in DBMS


Definition:
A Data Dictionary is a centralized repository in a Database Management System
(DBMS) that stores metadata — that is, data about the data.
It contains detailed information about:
 Tables
 Columns
 Data types
 Indexes
 Constraints
 Users and permissions
 Relationships among database objects
Purpose of Data Dictionary:
 Ensures consistency and integrity of the database schema.
 Helps DBMS manage, optimize, and secure database operations.
 Provides information to developers and DBAs (Database Administrators) about
structure and organization.

4.4.2 Data Dictionary Storage Components


 A data dictionary is stored as a set of system tables within the database itself. These
are typically hidden from regular users but can be queried by administrators.
 Key Contents Stored in the Data Dictionary:

Metadata Type Description


Table Definitions Table names, column names, data types, default values
Column Constraints Primary keys, foreign keys, unique constraints, not null
Index Info Index names, associated tables/columns
User Accounts Usernames, roles, privileges
Views & Triggers Definitions and dependencies

4.5. Column-Oriented Storage in DBMS


Definition:
Column-oriented storage (or columnar storage) is a method of storing data in a database
table column-by-column instead of row-by-row.
In contrast to row-oriented storage (used by traditional relational databases), where all
columns of a single row are stored together, column-oriented storage stores each column’s
values together.
How It Works:
Consider this table:

ID Name Age

1 John 25

2 Alice 30

3 Bob 28
In Row-Oriented Storage:
 Data is stored like:
[1, John, 25], [2, Alice, 30], [3, Bob, 28]
In Column-Oriented Storage:
 Data is stored like:
ID: [1, 2, 3]
Name: [John, Alice, Bob]
Age: [25, 30, 28]

4.5.1Benefits of Column-Oriented Storage

Benefit Explanation
Ideal for read-heavy operations like aggregates, filtering, and reporting.
Faster analytics
Only relevant columns are read.
Better Column data often has similar types or repeated values, allowing more
compression efficient compression (e.g., RLE, dictionary encoding).
Efficient use of
Reduces I/O by reading only required columns, not full rows.
memory
Optimized for Great for Online Analytical Processing (data warehousing, BI) where
OLAP queries often target few columns across many rows.

Drawbacks
Drawback Explanation
Inserts and updates are more complex since data
Slower writes and updates
is split across columns.
Not ideal for transactional Column stores are inefficient for operations that
(OLTP) workloads need full rows (e.g., frequent inserts or updates).
May need special optimization for joins due to
Join performance
column disaggregation.

4.6. Indexing and Hashing in DBMS


Both indexing and hashing are techniques used in Database Management Systems
(DBMS) to speed up data retrieval. They are especially important for handling large
databases efficiently.

Indexing is a data structure technique used to quickly locate and access data in a
database without scanning the entire table.
An index is similar to the index of a book—it lets you jump directly to the needed
information.
Types of Indexes:
Type Description

Built on the primary key of a table. Usually unique and


Primary Index automatically created.

Secondary Index Built on non-primary key attributes. Can have duplicates.

Alters the physical order of data to match the index. Only


Clustered Index one per table.

Non-clustered Stores a pointer to the actual data. Multiple can exist per
Index table.

Indexing Data Structures:


1. B+ Tree Indexes (most common in DBMS):
o Balanced tree structure.
o Allows fast range queries and sorted traversal.
o Used in MySQL, Oracle, PostgreSQL.
2. Bitmap Indexes:
o Uses bitmaps for each distinct value.
o Efficient for columns with low cardinality (e.g., gender: M/F).

B. Hashing in DBMS
Hashing is a technique where a hash function is used to map search keys to specific storage
locations (called buckets), enabling direct access to the data.
How Hashing Works:
 A hash function (e.g., h(key) = key mod n) converts a search key into a bucket
number.
 Data is stored in buckets based on this hash value.
 Great for exact match lookups, but not suitable for range queries.

Types of Hashing:

Type Description
Static Hashing Number of buckets is fixed. Simple but inflexible.
Dynamic Buckets grow or shrink as needed. Examples: Extendible Hashing,
Hashing Linear Hashing. More scalable.

Advantages of Hashing:
 Fast access for exact-match queries.
 Ideal for key-based lookups.
Disadvantages:
 Not suitable for range searches or ordering.
 Potential for collisions, requiring handling methods like chaining or open addressing.

4.7. Ordered Indices in DBMS


Definition:
An ordered index (also called a sequential index or sorted index) is a type of index
where the index entries are stored in sorted order based on the key field of the
records.
This allows for:
 Efficient searching
 Faster range queries
 Sequential data access
Ordered indices are typically implemented using B+ Trees in modern databases.
Types of Ordered Indices
Type Description

Primary Ordered Built on the primary key of a table where the data is
Index stored in the same sorted order.

Secondary Built on non-primary attributes, with index entries sorted


Ordered Index by that field. Data may not be physically sorted.

Example Table:
Let’s say we have a table of students:
Student_ID Name Age

102 Alice 22

104 John 23

106 Bob 21

108 Emma 22

110 Zara 20

[104]-------------------
/ \
[102] [106, 108, 110]
/ \ / | | \
Alice John Bob Emma Zara (null)

 Keys in the internal nodes direct the search.


 Leaf nodes contain the actual index entries and point to the records.
This structure supports binary search, fast range queries (e.g., find students with
IDs between 104 and 110), and sequential access.

Advantages of Ordered Indices:


 Great for range-based queries.
 Easy to traverse records in sorted order.
 Efficient search, insert, and delete operations when using tree structures like B+
Trees.
Limitations:
 Requires maintenance overhead when inserting or deleting records (to keep order).
 Needs extra storage for the index.

4.8. B-Tree Index File in DBMS


Definition:
A B-Tree (Balanced Tree) is a self-balancing tree data structure that maintains sorted
data and allows searches, insertions, deletions, and sequential access in logarithmic time.
In DBMS, B-Trees are used to implement ordered indexing, especially for primary and
secondary indices.
Properties of B-Trees:
1. Balanced: All leaf nodes are at the same depth.
2. Sorted: Keys in each node are stored in increasing order.
3. Efficient: Minimizes disk I/O by keeping nodes with many keys (high branching
factor).
4. M-Way Tree: A B-tree of order m can have at most m-1 keys and m children.
Structure of a B-Tree Node:
A B-tree node contains:
 A list of keys (used to guide the search)
 A list of pointers to child nodes (or data records)
 Internal nodes act like decision points
 Leaf nodes contain actual data pointers
A. Insertion in B-Tree
Steps:
1. Start from the root and find the correct leaf.
2. Insert the key in sorted order.
3. If the node is full, split it:
o Middle key moves up to the parent.
o Two new child nodes are created.
Algorithm
Insertion Algorithm
1: procedure B-Tree-Insert (Node x, Key k)
2: find i such that x:keys[i] > k or i >=numkeys(x)
3: if x is a leaf then
4: Insert k into x.keys at i
5: else
6: if x:child[i] is full then
7: Split x:child[i]
8: if k > x:key[i] then
9: i = i + 1
10: end if
11: end if
12: B-Tree-Insert(x:child[i]; k)
13: end if
14: end procedure
Example: Insert 10, 20, 5, 6, 12 into B-tree of order 3.
[10]
/ \
[5, 6] [12, 20]

B. Deletion in B-Tree
Steps:

1. Find and delete the key from the leaf or internal node.
2. If underflow (node has fewer keys than allowed), fix it by:
o Redistribution (borrow from sibling)
o Merge with sibling
3. If the root has no keys and only one child, make child the new root.
Example: Delete 6 from the B-Tree above.
[10]
/ \
[5] [12, 20]

B Tree Deletion Example


4.9. B+ Tree Index in DBMS
Definition:
A B+ Tree is an enhanced form of a B-Tree used in database indexing. It is a balanced
tree structure where:
 All records (data pointers) are stored only at the leaf nodes.
 Internal nodes only store keys, not actual data.
 Leaf nodes are linked together for efficient range queries.
This structure makes B+ Trees ideal for database indexing, especially for range queries
and ordered data retrieval.
Structure of a B+ Tree
 Internal nodes: Contain keys to guide the search; no actual data.
 Leaf nodes: Contain all the actual data (or pointers to records).
 Linked leaf nodes: Allow fast sequential access.
Example: Simple B+ Tree of Order 3

 Max 2 keys per node, 3 child pointers (order = 3)


 Leaf nodes have keys and pointers to data records
 Internal nodes guide search with only keys
A. Insertion in B+ Tree
Steps:
1. Find the correct leaf node for the key.
2. Insert the key in sorted order.
3. If the leaf overflows, split it:
o Push the middle key up to the parent.
o Split the leaf node into two.
4. If the parent overflows, recursively split up.
Example: Insert 5, 10, 15, 20, 25 into B+ Tree (Order 3)
1. Insert 5, 10 → [5, 10] in one leaf node
2. Insert 15 → [5, 10, 15] → Overflows → Split into:
o [5] and [10, 15]
o Promote 10 to parent
[10]
/ \
[5] [10, 15]
Insert 20
[10, 15]
/ | \
[5] [10] [15, 20]
 All data is in leaf nodes.
 Leaf nodes are linked for sequential access.
Example 2: Insert B+ tree of order 5
B. Deletion in B+ Tree
Steps:
1. Delete the key from the leaf node.
2. If the node underflows (fewer keys than minimum):
o Borrow from a sibling (left or right), or
o Merge with sibling and adjust parent.
3. Propagate changes upward if needed.
Example: Delete 20 from the tree:
[15]
/ \
[10] [20]
/ \ / \
[5] [10] [15] [20, 25]

After deletion,
[15]
/ \
[10] [25]
/ \ / \
[5] [10] [15] [25]
4.10 Static Hashing in DBMS
Definition: Static Hashing is a technique used in database systems for fast data retrieval
where the number of primary buckets is fixed when the file is created and does not
change as new data is inserted.
 A hash function h(k) is used to compute the address (or bucket number) of the disk
block where the record with key k will be stored.
 For example, if there are 10 buckets, the hash function might be:
h(k)=kmod 10h(k) = k \mod 10h(k)=kmod10
 Each bucket typically corresponds to a disk block.
Features:
 Fixed number of buckets: Once defined, the number of buckets remains constant.
 Simple and fast: Efficient for retrieval when data size is known and stable.
 Overflow handling: When a bucket is full, overflow occurs. Overflow buckets are
used to handle collisions (usually via chaining or open addressing).
Disadvantages:
 Poor scalability: Not suitable for dynamic datasets where the number of records
increases or decreases frequently.
 Clustering and overflow: As more records are added, more collisions and overflows
may occur, degrading performance.

4.11 Dynamic Hashing in DBMS

 Definition: Dynamic Hashing is a flexible hashing technique used to overcome the


limitations of static hashing by dynamically adjusting the number of buckets as the
data grows or shrinks.
 nstead of a fixed number of buckets, dynamic hashing uses a directory that grows
and shrinks as needed.
 A hash function h(k) produces a large number of bits, and only the first d bits (where
d is the global depth) are used to index into the directory.
 Each directory entry points to a bucket, which stores records.
 If a bucket overflows, it is split, and the directory may double in size (increasing d)
to accommodate the new buckets.

Advantages:
 Scalable and efficient for large or growing datasets.
 Minimizes overflow, as the system adapts dynamically.
 Efficient space utilization, compared to static hashing.

Disadvantages:
 More complex than static hashing.
 Directory maintenance may cause overhead, especially when doubling the size.
4.12 Query Processing - Overview
Query Processing refers to the series of steps a DBMS takes to interpret and execute a
SQL query efficiently and correctly. The goal is to retrieve the requested data in the least
amount of time and using the least resources (like CPU, memory, and disk I/O).

Main Phases of Query Processing:

1. Parsing and Translation


o Input: SQL query from the user.
o Action:
 The query is checked for syntax and semantic errors.
 It is then translated into an internal form like a parse tree or logical
query plan (relational algebra).
o Example: SELECT * FROM students WHERE age > 20 is translated into a
tree with SELECT as the root and age > 20 as a filter.
2. Query Optimization

 Action:
o Multiple equivalent query plans are generated (different ways to execute the
same query).
o The optimizer selects the most efficient plan based on cost estimates (e.g.,
how many rows, I/O, CPU).
 Types:
o Heuristic optimization: Uses rules of thumb (e.g., push selections down).
o Cost-based optimization: Uses statistics and estimates the cost of each plan.

3. Query Execution Plan Generation

 Action:
o The selected plan is converted into a physical query plan, specifying actual
algorithms and access paths (e.g., table scan, index scan, join type).

4. Query Execution

 Action:
o The DBMS engine executes the physical plan using available resources.
o It fetches data from storage, processes it, and returns the result to the user.

Key Components Involved:

 Parser: Checks query syntax and builds parse tree.


 Optimizer: Chooses the most efficient execution strategy.
 Executor: Runs the physical query plan.
 Catalog Manager: Provides metadata (like indexes, schema info).
 Buffer Manager: Handles in-memory data caching.
 Disk Manager: Reads/writes data to/from disk.

4.13 Algorithms for Selection, Sorting and join operations


Selection Algorithms (σ)

Selection retrieves rows from a table that satisfy a given condition.

Common Selection Algorithms:

a) Linear Search (Brute Force)

 How it works: Scans every record and checks if it satisfies the condition.
 Use Case: No index on the selection field.
 Cost: O(n)O(n)O(n) I/Os (where n = number of records)

b) Binary Search

 How it works: Used if the relation is sorted on the selection field.


 Use Case: Selection condition on a sorted attribute.
 Cost: O(log⁡2n)O(\log_2 n)O(log2n) I/Os to find the start, then scan for matches.

c) Index-Based Search

 How it works: Uses a B+ tree or hash index to quickly locate matching records.
 Use Case: Index exists on the selection attribute.
 Cost: Low (depends on index depth)

2. Sorting Algorithms

Sorting arranges the tuples in a specified order (e.g., by a column).

Common Sorting Algorithms:

a) External Merge Sort

 Why: In databases, relations are usually too large to fit in memory.


 Steps:
1. Pass 0: Read chunks (runs) that fit in memory, sort them using quicksort or
heapsort, write them back.
2. Merge Passes: Merge sorted runs until only one run remains.
 Cost: 2N⋅log⁡MN2N \cdot \log_M N2N⋅logMN I/Os (N = number of pages, M =
number of buffer pages)

3. Join Algorithms

Joins combine tuples from two relations based on a common attribute.

Common Join Algorithms:

a) Nested Loop Join

 How it works: For each tuple in R, scan the entire S for matching tuples.
 Variants:
o Naive Nested Loop
o Block Nested Loop (more efficient)
 Cost: M+(M⋅N)M + (M \cdot N)M+(M⋅N), where M and N are the sizes of R and S

b) Sort-Merge Join

 How it works:
1. Sort both R and S on the join attribute.
2. Merge them like in merge sort.
 Cost: Cost of sorting + merging.

c) Hash Join

 How it works:
1. Partition both R and S using a hash function.
2. Join matching partitions (in memory if possible).
 Use Case: Efficient for equality joins.
 Cost: O(M+N)

4.14 Query optimization using Heuristics

Heuristic optimization is a rule-based approach to query optimization in which a set of


predefined transformation rules (heuristics) are applied to a query’s logical
representation (usually relational algebra) to improve efficiency.

 It does not calculate costs like cost-based optimization.


 Instead, it relies on proven strategies that generally result in better
performance.

Perform Selection Operations Early (Push Selections Down)

 Apply selection as close to the base relation as possible.


 Why: Reduces the number of tuples that need to be processed in later
operations.

SELECT name FROM Students WHERE age > 20;

Apply σ(age > 20) before projecting name.

Perform Projections Early

 Eliminate unnecessary attributes as early as possible.


 Why: Reduces the size of intermediate relations.

Combine Cascading Selections

 Combine multiple selection conditions into one.


 Why: Reduces the number of operations.
Example:
σage>20(σdept=′CS′(Students) ⇒σage>20∧dept=′CS′(Students)

Combine Cascading Projections

 Combine multiple projections into one.

Example:

πname(πname,age(Students))⇒πname(Students)

Replace Cartesian Product + Selection with a Join

 Convert cross product followed by selection into a join.

Example:

σA.id=B.id(A×B)⇒A⋈B

Use Associative and Commutative Properties of Joins

 Rearrange joins to minimize the size of intermediate results.


 Why: Smaller intermediate results = faster performance.

Example:

(A⋈B)⋈C⇒A⋈(B⋈C)

4.15 Cost Estimation

Cost Estimation is the process of predicting the resources (like I/O, CPU time, memory)
needed to execute different query execution plans. It’s used in cost-based query optimization
to choose the most efficient plan for a given SQL query.

What Costs Are Estimated?

1. Disk I/O Cost


o Reading/writing pages from/to disk
o Usually the dominant cost in large databases
2. CPU Cost
o Time to process records (comparisons, computations, hashing)
3. Memory Usage
o Buffer space needed to hold intermediate results
4. Communication Cost
o In distributed DBMS: data transfer between nodes

What Does the Optimizer Use to Estimate Costs?

1. Statistics from the Database Catalog:


o Number of tuples in a relation (cardinality)
o Number of pages (blocks)
o Number of distinct values in a column
o Index presence and type
2. Selectivity Estimation:
o Selectivity: Fraction of rows that satisfy a condition.
o Used to estimate the number of output tuples.

Estimated Output=Input Tuples×Selectivity

Access Paths:

 Full table scan vs. index scan vs. hash lookup.

How Cost Estimation Works in Optimization:

Example: Join between two tables R and S

The optimizer may consider:

 Nested loop join: Cost = size(R) × size(S)


 Sort-merge join: Cost = sort(R) + sort(S) + merge(R, S)
 Hash join: Cost = partition + matching

It estimates the cost of each and chooses the lowest.

You might also like