Unit-4
Unit-4
Definition:
File organization refers to the method of arranging data records in a file on storage
(like a hard disk), so they can be efficiently stored, retrieved, updated, and
managed by a Database Management System (DBMS).
Types of File Organization
There are several common file organization methods, each suitable for different
access and update needs:
ID Name Age
1 John 25
2 Alice 30
3 Bob 28
In Row-Oriented Storage:
Data is stored like:
[1, John, 25], [2, Alice, 30], [3, Bob, 28]
In Column-Oriented Storage:
Data is stored like:
ID: [1, 2, 3]
Name: [John, Alice, Bob]
Age: [25, 30, 28]
Benefit Explanation
Ideal for read-heavy operations like aggregates, filtering, and reporting.
Faster analytics
Only relevant columns are read.
Better Column data often has similar types or repeated values, allowing more
compression efficient compression (e.g., RLE, dictionary encoding).
Efficient use of
Reduces I/O by reading only required columns, not full rows.
memory
Optimized for Great for Online Analytical Processing (data warehousing, BI) where
OLAP queries often target few columns across many rows.
Drawbacks
Drawback Explanation
Inserts and updates are more complex since data
Slower writes and updates
is split across columns.
Not ideal for transactional Column stores are inefficient for operations that
(OLTP) workloads need full rows (e.g., frequent inserts or updates).
May need special optimization for joins due to
Join performance
column disaggregation.
Indexing is a data structure technique used to quickly locate and access data in a
database without scanning the entire table.
An index is similar to the index of a book—it lets you jump directly to the needed
information.
Types of Indexes:
Type Description
Non-clustered Stores a pointer to the actual data. Multiple can exist per
Index table.
B. Hashing in DBMS
Hashing is a technique where a hash function is used to map search keys to specific storage
locations (called buckets), enabling direct access to the data.
How Hashing Works:
A hash function (e.g., h(key) = key mod n) converts a search key into a bucket
number.
Data is stored in buckets based on this hash value.
Great for exact match lookups, but not suitable for range queries.
Types of Hashing:
Type Description
Static Hashing Number of buckets is fixed. Simple but inflexible.
Dynamic Buckets grow or shrink as needed. Examples: Extendible Hashing,
Hashing Linear Hashing. More scalable.
Advantages of Hashing:
Fast access for exact-match queries.
Ideal for key-based lookups.
Disadvantages:
Not suitable for range searches or ordering.
Potential for collisions, requiring handling methods like chaining or open addressing.
Primary Ordered Built on the primary key of a table where the data is
Index stored in the same sorted order.
Example Table:
Let’s say we have a table of students:
Student_ID Name Age
102 Alice 22
104 John 23
106 Bob 21
108 Emma 22
110 Zara 20
[104]-------------------
/ \
[102] [106, 108, 110]
/ \ / | | \
Alice John Bob Emma Zara (null)
B. Deletion in B-Tree
Steps:
1. Find and delete the key from the leaf or internal node.
2. If underflow (node has fewer keys than allowed), fix it by:
o Redistribution (borrow from sibling)
o Merge with sibling
3. If the root has no keys and only one child, make child the new root.
Example: Delete 6 from the B-Tree above.
[10]
/ \
[5] [12, 20]
After deletion,
[15]
/ \
[10] [25]
/ \ / \
[5] [10] [15] [25]
4.10 Static Hashing in DBMS
Definition: Static Hashing is a technique used in database systems for fast data retrieval
where the number of primary buckets is fixed when the file is created and does not
change as new data is inserted.
A hash function h(k) is used to compute the address (or bucket number) of the disk
block where the record with key k will be stored.
For example, if there are 10 buckets, the hash function might be:
h(k)=kmod 10h(k) = k \mod 10h(k)=kmod10
Each bucket typically corresponds to a disk block.
Features:
Fixed number of buckets: Once defined, the number of buckets remains constant.
Simple and fast: Efficient for retrieval when data size is known and stable.
Overflow handling: When a bucket is full, overflow occurs. Overflow buckets are
used to handle collisions (usually via chaining or open addressing).
Disadvantages:
Poor scalability: Not suitable for dynamic datasets where the number of records
increases or decreases frequently.
Clustering and overflow: As more records are added, more collisions and overflows
may occur, degrading performance.
Advantages:
Scalable and efficient for large or growing datasets.
Minimizes overflow, as the system adapts dynamically.
Efficient space utilization, compared to static hashing.
Disadvantages:
More complex than static hashing.
Directory maintenance may cause overhead, especially when doubling the size.
4.12 Query Processing - Overview
Query Processing refers to the series of steps a DBMS takes to interpret and execute a
SQL query efficiently and correctly. The goal is to retrieve the requested data in the least
amount of time and using the least resources (like CPU, memory, and disk I/O).
Action:
o Multiple equivalent query plans are generated (different ways to execute the
same query).
o The optimizer selects the most efficient plan based on cost estimates (e.g.,
how many rows, I/O, CPU).
Types:
o Heuristic optimization: Uses rules of thumb (e.g., push selections down).
o Cost-based optimization: Uses statistics and estimates the cost of each plan.
Action:
o The selected plan is converted into a physical query plan, specifying actual
algorithms and access paths (e.g., table scan, index scan, join type).
4. Query Execution
Action:
o The DBMS engine executes the physical plan using available resources.
o It fetches data from storage, processes it, and returns the result to the user.
How it works: Scans every record and checks if it satisfies the condition.
Use Case: No index on the selection field.
Cost: O(n)O(n)O(n) I/Os (where n = number of records)
b) Binary Search
c) Index-Based Search
How it works: Uses a B+ tree or hash index to quickly locate matching records.
Use Case: Index exists on the selection attribute.
Cost: Low (depends on index depth)
2. Sorting Algorithms
3. Join Algorithms
How it works: For each tuple in R, scan the entire S for matching tuples.
Variants:
o Naive Nested Loop
o Block Nested Loop (more efficient)
Cost: M+(M⋅N)M + (M \cdot N)M+(M⋅N), where M and N are the sizes of R and S
b) Sort-Merge Join
How it works:
1. Sort both R and S on the join attribute.
2. Merge them like in merge sort.
Cost: Cost of sorting + merging.
c) Hash Join
How it works:
1. Partition both R and S using a hash function.
2. Join matching partitions (in memory if possible).
Use Case: Efficient for equality joins.
Cost: O(M+N)
Example:
πname(πname,age(Students))⇒πname(Students)
Example:
σA.id=B.id(A×B)⇒A⋈B
Example:
(A⋈B)⋈C⇒A⋈(B⋈C)
Cost Estimation is the process of predicting the resources (like I/O, CPU time, memory)
needed to execute different query execution plans. It’s used in cost-based query optimization
to choose the most efficient plan for a given SQL query.
Access Paths: