Dbms Unit 3 New
Dbms Unit 3 New
Introduction To Normalization:
Normalization is the process of organizing data in a database. This includes creating tables and
establishing relationships between those tables according to rules designed both to protect the data
and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more
than one place must be changed, the data must be changed in exactly the same way in all locations.
A customer address change is much easier to implement if that data is stored only in the Customers
table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table
for the address of a particular customer, it may not make sense to look there for the salary of the
employee who calls on that customer. The employee's salary is related to, or dependent on, the
employee and thus should be moved to the Employees table. Inconsistent dependencies can make
data difficult to access because the path to find the data may be missing or broken.
There are a few rules for database normalization. Each rule is called a "normal form." If the first
rule is observed, the database is said to be in "first normal form." If the first three rules are
observed, the database is considered to be in "third normal form."
the characteristic of data while each row in a table represents a set of related data, and every row in the
table has the same structure. The row is sometimes referred to as a tuple in DBMS.
Have a look at the Employee table below. It contains attributes as column values, namely
1. Employee_Id
2. Employee_Name
3. Employee_Department
4. Salary
Employee Table
dependency is.
A relation consisting of functional dependencies always follows a set of rules called RAT rules.
It helps in maintaining the quality of data in the database, and the core concepts behind database
represented by A → B.
Consider a relation with four attributes A, B, C and D,
R (ABCD)
1. A → BCD
2. B → CD
For the first functional dependency A → BCD, attributes B, C and D are functionally dependent
on attribute A.
attribute B.
Sometimes everything on the left side of functional dependency is also referred to as determinant set,
Pointing arrows determines the depending attribute and the origin of the arrow determines the
determinant set.
Types of Functional Dependencies in DBMS
words, a functional dependency is called trivial if the attributes on the right side are the subset of
Trivial.
dependency X → Y where X is a set of attributes and Y is also a set of the attribute but not a
In Multivalued functional dependency, attributes in the dependent set are not dependent on each
other.
dependent attributes Name, Age are not functionally dependent(i.e. Name → Age or Age →
dependency.
For example, consider the Employee table below.
dependency.
William Armstrong in 1974 suggested a few rules related to functional dependency. They are
B holds true.
2. Augmentation: If a functional dependency A → B holds true, then appending any number of the
attribute to both sides of dependency doesn't affect the dependency. It remains true.
5. Functional Dependency removes data redundancy where the same values should not be repeated
6. The process of Normalization starts with identifying the candidate keys in the relation. Without
functional dependency, it's impossible to find candidate keys and normalize the database.
Conclusion
Functional dependency defines how the attributes of a relation are related to each other. It helps
Armstrong in 1974 suggested a few axioms or rules related to functional dependency. They are
Rule of Reflexivity
Rule of Augmentation
Rule of Transitivity
Functional dependencies have many advantages, keeping the database design clean, defining
the meaning and constraints of the databases, and removing data redundancy are a few of them.
redundancy and more concerning issues or rather anomalies w.r.t Insertion, Deletion, and Updating
data.
1. Insertion Anomaly
Caused by updating the same set of repeated information again and again. This becomes a problem as the
entries for a table increases with time.
Example: for the table in Img1, if a new employee must be added to the table, then the corresponding
information of the manager and manager’s information must be repeated leading to the insertion
anomaly which will increase with the increase in the entries to the Employee table.
2. Deletion Anomaly
It causes loss of data within the database due to its removal in some other related data set.
Example: for the table in Img1, if the information of Manager, Mr.X is deleted, then this leads to the
deletion of the information corresponding to the employees associated with Mr.X leading to loss of
employee information for the deleted employee details.
3. Updating Anomaly
In case of an update, it’s very crucial to make sure that the given update happens for all the rows
associated with the change. Even if a single row gets missed out it will lead to inconsistency of data.
Example: for the table in Img1, if the manager Mr.X’s name has to be updated, the update operation
must be applied to all the rows that Mr.X is associated with. Missing out even a single row causes
inconsistency of data within the database
The above-mentioned anomalies occur because inadvertently we are storing two or more pieces of
information in every row of a table. To avoid this, Data Normalization comes to the rescue. Data
Normalization ensures data dependency makes sense.
For the normalization process to happen it is important to make sure that the data type of each data
throughout an attribute is the same and there is no mix up within the data types. For example, an attribute
‘Date-of-Birth’ must contain data only with ‘date’ data type. Let’s dive into the most trivial
types of Normal Forms.
1.First Normal Form (1NF): This is the most basic level of normalization. In 1NF, each
table cell should contain only a single value, and each column should have a unique
name. The first normal form helps to eliminate duplicate data and simplify queries.
2.Second Normal Form (2NF): 2NF eliminates redundant data by requiring that each
non-key attribute be dependent on the primary key. This means that each column should
be directly related to the primary key, and not to other columns.
3.Third Normal Form (3NF): 3NF builds on 2NF by requiring that all non-key attributes
are independent of each other. This means that each column should be directly related to
the primary key, and not to any other columns in the same table.
4.Boyce-Codd Normal Form (BCNF): BCNF is a stricter form of 3NF that ensures that
each determinant in a table is a candidate key. In other words, BCNF ensures that each
non-key attribute is dependent only on the candidate key.
Normal forms help to reduce data redundancy, increase data consistency, and improve
database performance. However, higher levels of normalization can lead to more
complex database designs and queries. It is important to strike a balance between
normalization and practicality when designing a database.
ID Name Courses
------------------
1 A c1, c2
2 E c3
3 M C2, c3
In the above table Course is a multi-valued attribute so it is not in 1NF. Below Table is
in 1NF as there is no multi-valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c2
3 M c3
DBMS.
dependencies hold:
X -> Y
Y -> Z
For a relational table to be in third normal form, it must satisfy the following rules:
3.For each functional dependency X -> Z at least one of the following conditions hold:
If a transitive dependency exists, we can divide the table to remove the transitively dependent
attributes and place them to a new table along with a copy of the determinant.
Let us take an example of the following <EmployeeDetail> table to understand what is transitive
dependency and how to normalize the table to the third normal form:
<EmployeeDetail>
dependency because:
Also, Employee Zipcode is not a super key and Employee City is not a prime attribute.
To remove transitive dependency from this table and normalize it into the third normal form, we can
<EmployeeDetail>
<EmployeeDetail> and <EmployeeLocation> tables as they are in 2NF and they don’t have any
transitive dependency.
The 2NF and 3NF impose some extra conditions on dependencies on candidate keys and remove
redundancy caused by that. However, there may still exist some dependencies that cause
redundancy in the database. These redundancies are removed by a more strict normal form known
as BCNF.
compared to 3NF.
For a relational table to be in Boyce-Codd normal form, it must satisfy the following rules:
2. For every non-trivial functional dependency X -> Y, X is the superkey of the table. That means X
A superkey is a set of one or more attributes that can uniquely identify a row in a database table.
Let us take an example of the following <EmployeeProjectLead> table to understand how to normalize
<EmployeeProjectLead>
candidate key of the above table is {Employee Code, Project ID}. For the non-trivial functional
dependency, Project Leader -> Project ID, Project ID is a prime attribute but Project Leader is a non-
To convert the given table into BCNF, we decompose it into three tables:
<EmployeeProject>
Employee Code Project ID
101 P03
101 P01
102 P04
103 P02
<ProjectLead>
Conclusion
Normal forms are a mechanism to remove redundancy and optimize database storage.
Problem:
Let a relation R (A, B, C, D ) and functional dependency {AB –> C, C –> D, D –> A}.
Relation R is decomposed into R1( A, B, C) and R2(C, D). Check whether decomposition is
dependency preserving or not.
R1(A, B, C) and R2(C, D)
closure(AB) = {A, B, C, D}
= {A, B, C}
AB --> C // Removing AB from right side as these are trivial attributes
closure(BC) = {B, C, D, A}
= {A, B, C}
BC --> A // Removing BC from right side as these are trivial attributes
closure(AC) = {A, C, D}
AC --> D // Removing AC from right side as these are trivial attributes
The original relation and relation reconstructed from joining decomposed relations must
contain same number of tuples if number is increased or decreased then it is Losssy Join
decomposition.
Lossless join decomposition ensures that never get the situation where spurious tuple are
generated in relation, for every value on the join attributes there will be a unique tuple in
one of the relations.
Lossless join decomposition is a decomposition of a relation R into relations R1, R2 such
that if we perform a natural join of relation R1 and R2, it will return the original relation R.
This is effective in removing redundancy from databases while preserving the original data.
In other words by lossless decomposition, it becomes feasible to reconstruct the relation R
from decomposed tables R1 and R2 by using Joins.
Only 1NF,2NF,3NF and BCNF are valid for lossless join decomposition.
In Lossless Decomposition, we select the common attribute and the criteria for selecting a
common attribute is that the common attribute must be a candidate key or super key in
either relation R1, R2, or both.
Decomposition of a relation R into R1 and R2 is a lossless-join decomposition if at least one
of the following functional dependencies are in F+ (Closure of functional dependencies)
Example:
— Employee (Employee_Id, Ename, Salary, Department_Id, Dname) –
— Can be decomposed using lossless decomposition as,
— Employee_desc (Employee_Id, Ename, Salary, Department_Id)
— Department_desc (Department_Id, Dname) . :
– Alternatively the lossy decomposition would be as joining these tables is not possible so
not possible to get back original data.
– Employee_desc (Employee_Id, Ename, Salary)
– Department_desc (Department_Id, Dname)
R1 ∩ R2 → R1
OR
R1 ∩ R2 → R2
NOTE – revise all the problems regarding this topic which we solved in the lecture.
We must carefully consider the problems associated with NULLs when designing a
relational database schema. There is no fully satisfactory relational design theory as
yet that includes NULL values. One problem occurs when some tuples
have NULL values for attributes that will be used to join individual relations in the
decomposition. To illustrate this, consider the database shown in Figure 16.2(a),
where two relations EMPLOYEE and DEPARTMENT are shown. The last two
employee tuples— ‘Berger’ and ‘Benitez’—represent newly hired employees who
have not yet been assigned to a department (assume that this does not violate any
integrity constraints). Now suppose that we want to retrieve a list of (Ename, Dname)
values for all the employees. If we apply the NATURAL JOIN operation
on EMPLOYEE and DEPARTMENT (Figure 16.2(b)), the two aforementioned tuples
will not appear in the result. The OUTER JOIN operation, discussed in Chapter 6, can
deal with this problem. Recall that if we take the LEFT OUTER
JOIN of EMPLOYEE with DEPARTMENT, tuples in EMPLOYEE that
have NULL for the join attribute will still appear in the result, joined with
an imaginary tuple in DEPARTMENT that has NULLs for all its attribute values.
Figure 16.2(c) shows the result.
EMPLOYEE_3, the tuples for Berger and Benitez will not appear in the result;
these are called dangling tuples in EMPLOYEE_1 because they are represented in
only one of the two relations that represent employees, and hence are lost if we apply
an (INNER) JOIN operation.
Multivalued Dependency
Multivalued dependency occurs when two attributes in a table are independent of each other
but, both depend on a third attribute.
A multivalued dependency consists of at least two attributes that are dependent on a third
attribute that's why it always requires at least three attributes.
Example: Suppose there is a bike manufacturer company which produces two colors(white and
black) of each model every year.
Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of
each other.
In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The
representation of these dependencies is shown below:
1. BIKE_MODEL → → MANUF_YEAR
2. BIKE_MODEL → → COLOR
This can be read as "BIKE_MODEL multidetermined MANUF_YEAR" and "BIKE_MODEL
multidetermined COLOR".
Overview
The query optimizer (also known as the optimizer) is database software that identifies the most efficient
The process of selecting an efficient execution plan for processing a query is known as query
optimization.
Following query parsing which is a process by which this decision making is done that for a given query,
calculating how many different ways there are in which the query can run, then the parsed query is
delivered to the query optimizer, which generates various execution plans to analyze the parsed query
and select the plan with the lowest estimated cost. The catalog manager assists the optimizer in selecting
the optimum plan to perform the query by generating the cost of each plan.
Query optimization is used to access and modify the database in the most efficient way possible. It is the
art of obtaining necessary information in a predictable, reliable, and timely manner. Query
optimization is formally described as the process of transforming a query into an equivalent form that
may be evaluated more efficiently. The goal of query optimization is to find an execution plan that
reduces the time required to process a query. We must complete two major tasks to attain this
optimization target.
The first is to determine the optimal plan to access the database, and the second is to reduce the time
The optimizer tries to come up with the best execution plan possible for a SQL statement.
Among all the candidate plans reviewed, the optimizer chooses the plan with the lowest cost. The
optimizer computes costs based on available facts. The cost computation takes into account query
execution factors such as I/O, CPU, and communication for a certain query in a given context.
as being a class representative. If the optimizer statistics show that 50% of students are in positions of
leadership, the optimizer may decide that a full table search is the most efficient. However, if data show
that just a small number of students are in positions of leadership, reading an index followed by table
Because the database has so many internal statistics and tools at its disposal, the optimizer is frequently
in a better position than the user to decide the best way to execute a statement. As a result, the optimizer
Optimizer Components
The optimizer is made up of three parts: the transformer, the estimator, and the plan generator. The figure
Query
Transformer The query transformer determines whether it is advantageous to rewrite the original SQL
statement into a semantically equivalent SQL statement at a lower cost for some statements.
When a plausible alternative exists, the database compares the costs of each alternative and chooses the
one with the lowest cost. The query transformer shown in the query below can be taken as an example of
how query optimization is done by transforming an OR-based input query into a UNION ALL-based
output query.
SELECT *
FROM sales
WHERE promo_id=12
OR prod_id=125;
SELECT *
FROM sales
WHERE promo_id=124
UNION ALL
SELECT *
FROM sales
WHERE promo_id=12
Estimator
The estimator is the optimizer component that calculates the total cost of a given execution plan.
Selectivity: The query picks a percentage of the rows in the row set, with 0 indicating no
rows and 1 indicating all rows. Selectivity is determined by a query predicate, such
as WHERE the last name LIKE X%, or by a mix of predicates. As the selectivity value
approaches zero, a predicate gets more selective, and as the value nears one, it becomes less
For example, The row set can be a base table, a view, or the result of a join. The selectivity is tied to
Cardinality: The cardinality of an execution plan is the number of rows returned by each
action. This input is shared by all cost functions and is essential for determining the best
strategy. Cardinality in DBMS can be calculated using DBMS STATS table statistics or after
taking into account the impact of predicates (filter, join, and so on), DISTINCT or GROUP
BY operations, and so on. In an execution plan, the Rows column displays the estimated
cardinality.
For example, if the optimizer estimates that a full table scan will yield 100 rows, then the
cardinality estimate for this operation is 100. The cardinality estimate appears in the execution
Cost: This metric represents the number of units of labor or resources used. The query
optimizer uses disc I/O, CPU utilization, and memory usage as units of effort. For example,
if the plan for query A has a lower cost than the plan for query B, then the following
outcomes are possible: A executes faster than B, A executes slower than B or A executes in
The plan generator investigates multiple plans for a query block by experimenting with various
Because of the different combinations that the database can utilize to generate the same
outcome, many plans are available. The plan with the lowest cost is chosen by the
optimizer.
Methods of Query Optimization in DBMS
There are two methods of query optimization. They are as follows.
Query optimization is the process of selecting the most efficient way to execute a SQL statement.
Because SQL is a nonprocedural language, the optimizer can merge, restructure, and process data in
any sequence.
The Optimizer allocates a cost in numerical form for each step of a feasible plan for a given query
and environment, and then discovers these values together to get a cost estimate for the plan or
possible strategy. The Optimizer aims to find the plan with the lowest cost estimate after evaluating
the costs of all feasible plans. As a result, the Optimizer is sometimes known as the Cost-Based
Optimizer.
Execution Plans:
The plan describes the steps taken by Oracle Database to execute a SQL statement. Each step
physically retrieves or prepares rows of data from the database for the statement's user.
An execution plan shows the total cost of the plan, which is stated on line 0, as well as the cost of
each individual operation. A cost is an internal unit that appears solely in the execution plan to allow
for plan comparisons. As a result, the cost value cannot be fine-tuned or adjusted.
Query Blocks The optimizer receives a parsed representation of a SQL statement as input.
Each SELECT block in the original SQL statement is internally represented by a query
block. A query block can be a statement at the top level, a subquery, or an unmerged view.
Let’s take an example where the SQL statement that follows is made up of two query
sections. The inner query block is the subquery in parentheses. The remainder of the outer
query block of the SQL statement obtains the names of employees in the departments whose
IDs were supplied by the subquery. The query form specifies how query blocks are
connected.
FROM hr.employees
WHERE department_id
IN (SELECT department_id
FROM hr.departments
From the bottom up, the database optimizes query blocks separately. As a result, the database
optimizes the innermost query block first, generating a sub-plan for it, before generating the outer
number of objects rises, this number climbs exponentially. The possibilities for a join of five tables,
for example, are far higher than those for a connection of two tables.
A biker wishes to find the most efficient bicycle path from point A to point B. A query is analogous
to the phrase "I need the quickest route from point A to point B" or "I need the quickest route from
point A to point B via point C". To choose the most efficient route, the trip advisor employs an
internal algorithm that takes into account factors such as speed and difficulty. The biker can sway
the trip advisor's judgment by saying things like "I want to arrive as quickly as possible" or "I want
In this example, an execution plan is a possible path generated by the travel advisor. Internally, the
advisor may divide the overall route into multiple subroutes (sub plans) and compute the efficiency
of each subroute separately. For example, the trip advisor may estimate one subroute to
take 15 minutes and be of medium difficulty, another subroute to take 22 minutes and be of low
Based on the user-specified goals and accessible facts about roads and traffic conditions, the advisor
selects the most efficient (lowest cost) overall route. The better the guidance, the more accurate the
statistics. For example, if the advisor is not kept up to date on traffic delays, road closures, and poor
road conditions, the proposed route may prove inefficient (high cost).
heuristic optimization
Cost-based optimization is expensive. Heuristics are used to reduce the number of choices that must
be made in a cost-based approach.
Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve the
performance. These rules are as follows −
Perform the SELECTION process foremost in the query. This should be the first action for
any SQL table. By doing so, we can decrease the number of records required in the query,
rather than using all the tables during the query.
Perform all the projection as soon as achievable in the query. Somewhat like a selection but
this method helps in decreasing the number of columns in the query.
Perform the most restrictive joins and selection operations. What this means is that select
only those sets of tables and/or views which will result in a relatively lesser number of records
and are extremely necessary in the query. Obviously any query will execute better when tables
with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Note the two cross product operations. These require lots of space and time (nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6 projects * 3
departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and project
operations as possible before doing any joins.
There are a number of transformation rules that can be used to transform a query:
1.Cascading selections. A list of conjunctive conditions can be broken up into separate individual
conditions.
2.Commutativity of the selection operation.
3.Cascading projections. All but the last projection can be ignored.
4.Commuting selection and projection. If a selection condition only involves attributes contained in
a projection clause, the two can be commuted.
5.Commutativity of Join and Cross Product.
6.Commuting selection with Join.
7.Commuting projection with Join.
8.Commutativity of set operations. Union and Intersection are commutative.
9.Associativity of Union, Intersection, Join and Cross Product.
10.Commuting selection with set operations.
11.Commuting projection with set operations.
12.Logical transformation of selection conditions. For example, using DeMorgan’s law, etc.
13.Combine Selection and Cartesian product to form Joins.
These transformations can be used in various combinations to optimize queries. Some general
steps follow:
1.Using rule 1, break up conjunctive selection conditions and chain them together.
2.Using the commutativity rules, move the selection operations as far down the tree as possible.
3.Using the associativity rules, rearrange the leaf nodes so that the most restrictive selection
conditions are executed first. For example, an equality condition is likely more restrictive than an
inequality condition (range query).
4.Combine cartesian product operations with associated selection conditions to form a single Join
operation.
5.Using the commutativity of Projection rules, move the projection operations down the tree to
reduce the sizes of intermediate result sets.
6.Finally, identify subtrees that can be executed using a single efficient access method.