0% found this document useful (0 votes)
16 views

Dbms Unit 3 New

Uploaded by

ashish gawande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Dbms Unit 3 New

Uploaded by

ashish gawande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT-3

Introduction To Normalization:

Normalization is the process of organizing data in a database. This includes creating tables and
establishing relationships between those tables according to rules designed both to protect the data
and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more
than one place must be changed, the data must be changed in exactly the same way in all locations.
A customer address change is much easier to implement if that data is stored only in the Customers
table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table
for the address of a particular customer, it may not make sense to look there for the salary of the
employee who calls on that customer. The employee's salary is related to, or dependent on, the
employee and thus should be moved to the Employees table. Inconsistent dependencies can make
data difficult to access because the path to find the data may be missing or broken.
There are a few rules for database normalization. Each rule is called a "normal form." If the first
rule is observed, the database is said to be in "first normal form." If the first three rules are
observed, the database is considered to be in "third normal form."

What is Functional Dependency in DBMS?


Relational database is a collection of data stored in rows and columns. Columns represent

the characteristic of data while each row in a table represents a set of related data, and every row in the

table has the same structure. The row is sometimes referred to as a tuple in DBMS.

Have a look at the Employee table below. It contains attributes as column values, namely

1. Employee_Id

2. Employee_Name

3. Employee_Department
4. Salary

Employee Table

Employee_Id Employee_Name Employee_Department Salary


1 Ryan Mechanical $5000
2 Justin Biotechnology $5000
3 Andrew Computer Science $8000
4 Felix Human Resource $10000
Now that we are clear with the jargon related to functional dependency, let's discuss what functional

dependency is.

 Functional Dependency in DBMS, as the name suggests it is the relationship between

attributes(characteristics) of a table related to each other.

 A relation consisting of functional dependencies always follows a set of rules called RAT rules.

They are proposed by William Armstrong in 1974.

 It helps in maintaining the quality of data in the database, and the core concepts behind database

normalization are based on functional dependencies.

How to Denote a Functional Dependency in DBMS?

A functional dependency is denoted by an arrow “→”. The functional dependency of A on B is

represented by A → B.
Consider a relation with four attributes A, B, C and D,

R (ABCD)

1. A → BCD

2. B → CD

 For the first functional dependency A → BCD, attributes B, C and D are functionally dependent

on attribute A.

 Function dependency B → CD has two attributes C and D functionally depending upon

attribute B.

Sometimes everything on the left side of functional dependency is also referred to as determinant set,

while everything on the right side is referred to as depending attributes.

Functional dependency can also be represented diagrammatically like this,

Pointing arrows determines the depending attribute and the origin of the arrow determines the

determinant set.
Types of Functional Dependencies in DBMS

1. Trivial functional dependency

2. Non-Trivial functional dependency

3. Multivalued functional dependency

4. Transitive functional dependency

Trivial Functional Dependency in DBMS

 In Trivial functional dependency, a dependent is always a subset of the determinant. In other

words, a functional dependency is called trivial if the attributes on the right side are the subset of

the attributes on the left side of the functional dependency.

 X → Y is called a trivial functional dependency if Y is the subset of X.

 For example, consider the Employee table below.

Employee_Id Name Age


1 Zayn 24
2 Phobe 34
3 Hikki 26
4 David 29
 Here, { Employee_Id, Name } → { Name } is a Trivial functional dependency, since the

dependent Name is the subset of determinant { Employee_Id, Name }.


 { Employee_Id } → { Employee_Id }, { Name } → { Name } and { Age } → { Age } are also

Trivial.

Non-Trivial Functional Dependency in DBMS

 It is the opposite of Trivial functional dependency. Formally speaking, in Non-Trivial functional

dependency, dependent if not a subset of the determinant.

 X → Y is called a Non-trivial functional dependency if Y is not a subset of X. So, a functional

dependency X → Y where X is a set of attributes and Y is also a set of the attribute but not a

subset of X, then it is called Non-trivial functional dependency.

 For example, consider the Employee table below.

Employee_Id Name Age


1 Zayn 24
2 Phobe 34
3 Hikki 26
4 David 29
 Here, { Employee_Id } → { Name } is a non-trivial functional dependency

because Name(dependent) is not a subset of Employee_Id(determinant).

 Similarly, { Employee_Id, Name } → { Age } is also a non-trivial functional dependency.


Multivalued Functional Dependency in DBMS

 In Multivalued functional dependency, attributes in the dependent set are not dependent on each

other.

 For example, X → { Y, Z }, if there exists is no functional dependency between Y and Z, then it

is called as Multivalued functional dependency.

 For example, consider the Employee table below.

Employee_Id Name Age


1 Zayn 24
2 Phobe 34
3 Hikki 26
4 David 29
4 Phobe 24
 Here, { Employee_Id } → { Name, Age } is a Multivalued functional dependency, since the

dependent attributes Name, Age are not functionally dependent(i.e. Name → Age or Age →

Name doesn’t exist !).

Transitive Functional Dependency in DBMS

 Consider two functional dependencies A → B and B → C then according to the transitivity

axiom A → C must also exist. This is called a transitive functional dependency.

 In other words, dependent is indirectly dependent on determinant in Transitive functional

dependency.
 For example, consider the Employee table below.

Employee_Id Name Department Street Number


1 Zayn CD 11
2 Phobe AB 24
3 Hikki CD 11
4 David PQ 71
5 Phobe LM 21
 Here, { Employee_Id → Department } and { Department → Street Number } holds true. Hence,

according to the axiom of transitivity, { Employee_Id → Street Number } is a valid functional

dependency.

Armstrong’s Axioms/Properties of Functional Dependency in DBMS

William Armstrong in 1974 suggested a few rules related to functional dependency. They are

called RAT rules.

1. Reflexivity: If A is a set of attributes and B is a subset of A, then the functional dependency A →

B holds true.

 For example, { Employee_Id, Name } → Name is valid.

2. Augmentation: If a functional dependency A → B holds true, then appending any number of the

attribute to both sides of dependency doesn't affect the dependency. It remains true.

 For example, X → Y holds true then, ZX → ZY also holds true.


 For example, if { Employee_Id, Name } → { Name } holds true then, { Employee_Id,

Name, Age } → { Name, Age }

3. Transitivity: If two functional dependencies X → Y and Y → Z hold true, then X → Z also

holds true by the rule of Transitivity.

 For example, if { Employee_Id } → { Name } holds true and { Name } →

{ Department } holds true, then { Employee_Id } → { Department } also holds true.

Advantages of Functional Dependency in DBMS

Let's discuss some of the advantages of Functional dependency,

1. It is used to maintain the quality of data in the database.

2. It expresses the facts about the database design.

3. It helps in clearly defining the meanings and constraints of databases.

4. It helps to identify bad designs.

5. Functional Dependency removes data redundancy where the same values should not be repeated

at multiple locations in the same database table.

6. The process of Normalization starts with identifying the candidate keys in the relation. Without

functional dependency, it's impossible to find candidate keys and normalize the database.
Conclusion

 Functional dependency defines how the attributes of a relation are related to each other. It helps

in maintaining the quality of data in the database. It is denoted by an arrow “→”.

 The functional dependency of A on B is represented by A → B. William

Armstrong in 1974 suggested a few axioms or rules related to functional dependency. They are

 Rule of Reflexivity

 Rule of Augmentation

 Rule of Transitivity

 There are four types of functional dependency in DBMS - Trivial, Non-

Trivial, Multivalued and Transitive functional dependency.

 Functional dependencies have many advantages, keeping the database design clean, defining

the meaning and constraints of the databases, and removing data redundancy are a few of them.

Insertion, Deletion And Update Anomalies


Now that the definition of Functional Dependency is covered, Let’s look into the draw backs of data

redundancy and more concerning issues or rather anomalies w.r.t Insertion, Deletion, and Updating

data.

1. Insertion Anomaly
Caused by updating the same set of repeated information again and again. This becomes a problem as the
entries for a table increases with time.

Example: for the table in Img1, if a new employee must be added to the table, then the corresponding
information of the manager and manager’s information must be repeated leading to the insertion
anomaly which will increase with the increase in the entries to the Employee table.

2. Deletion Anomaly
It causes loss of data within the database due to its removal in some other related data set.

Example: for the table in Img1, if the information of Manager, Mr.X is deleted, then this leads to the
deletion of the information corresponding to the employees associated with Mr.X leading to loss of
employee information for the deleted employee details.

3. Updating Anomaly
In case of an update, it’s very crucial to make sure that the given update happens for all the rows
associated with the change. Even if a single row gets missed out it will lead to inconsistency of data.

Example: for the table in Img1, if the manager Mr.X’s name has to be updated, the update operation
must be applied to all the rows that Mr.X is associated with. Missing out even a single row causes
inconsistency of data within the database
The above-mentioned anomalies occur because inadvertently we are storing two or more pieces of
information in every row of a table. To avoid this, Data Normalization comes to the rescue. Data
Normalization ensures data dependency makes sense.

For the normalization process to happen it is important to make sure that the data type of each data
throughout an attribute is the same and there is no mix up within the data types. For example, an attribute
‘Date-of-Birth’ must contain data only with ‘date’ data type. Let’s dive into the most trivial
types of Normal Forms.

Introduction To Normal Forms:In database management systems (DBMS),


normal forms are a series of guidelines that help to ensure that the design of a database is
efficient, organized, and free from data anomalies. There are several levels of normalization,
each with its own set of guidelines, known as normal forms.

Here are the important points regarding normal forms in DBMS:

1.First Normal Form (1NF): This is the most basic level of normalization. In 1NF, each
table cell should contain only a single value, and each column should have a unique
name. The first normal form helps to eliminate duplicate data and simplify queries.
2.Second Normal Form (2NF): 2NF eliminates redundant data by requiring that each
non-key attribute be dependent on the primary key. This means that each column should
be directly related to the primary key, and not to other columns.
3.Third Normal Form (3NF): 3NF builds on 2NF by requiring that all non-key attributes
are independent of each other. This means that each column should be directly related to
the primary key, and not to any other columns in the same table.
4.Boyce-Codd Normal Form (BCNF): BCNF is a stricter form of 3NF that ensures that
each determinant in a table is a candidate key. In other words, BCNF ensures that each
non-key attribute is dependent only on the candidate key.
Normal forms help to reduce data redundancy, increase data consistency, and improve
database performance. However, higher levels of normalization can lead to more
complex database designs and queries. It is important to strike a balance between
normalization and practicality when designing a database.

The advantages of using normal forms in DBMS include:


Reduced data redundancy: Normalization helps to eliminate duplicate data in tables,
reducing the amount of storage space needed and improving database efficiency.
Improved data consistency: Normalization ensures that data is stored in a consistent and
organized manner, reducing the risk of data inconsistencies and errors.
Simplified database design: Normalization provides guidelines for organizing tables
and data relationships, making it easier to design and maintain a database.
Improved query performance: Normalized tables are typically easier to search and
retrieve data from, resulting in faster query performance.
Easier database maintenance: Normalization reduces the complexity of a database by
breaking it down into smaller, more manageable tables, making it easier to add, modify,
and delete data.
Overall, using normal forms in DBMS helps to improve data quality, increase database
efficiency, and simplify database design and maintenance.

1. First Normal Form –


If a relation contain composite or multi-valued attribute, it violates first normal form or a
relation is in first normal form if it does not contain any composite or multi-valued attribute.
A relation is in first normal form if every attribute in that relation is singled valued attribute.
Example 1 – Relation STUDENT in table 1 is not in 1NF because of multi-valued
attribute STUD_PHONE. Its decomposition into 1NF has been shown in table 2.
Example 2 –

ID Name Courses
------------------
1 A c1, c2
2 E c3
3 M C2, c3
In the above table Course is a multi-valued attribute so it is not in 1NF. Below Table is
in 1NF as there is no multi-valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c2
3 M c3

2. Second Normal Form –


To be in second normal form, a relation must be in first normal form and relation must not
contain any partial dependency. A relation is in 2NF if it has No Partial Dependency, i.e., no
non-prime attribute (attributes which are not part of any candidate key) is dependent on any
proper subset of any candidate key of the table.
Partial Dependency – If the proper subset of candidate key determines non-prime attribute,
it is called partial dependency.
OR
When prime attributes -> non prime attribute
table will not be in 2nd normal form.
OR
when a part of candidate key determines any non-prime attribute--- it is called partial
dependency.
And when there is partial dependency, the table/relation is not in 2nd normal form.

NOTE: - subset of {A,B} ===== {NULL, A, B, AB}


proper subset of {A,B} ===={Null, A, B}
nd
In 2 normal form, we take proper subset of candidate key.

Example: In a relation R(ABCDE), If AB is a candidate key of any relation then,


prime attributes – {A,B}
non-prime attributes – {C,D,E}

Example 1 – Consider table-3 as following below.


STUD_NO COURSE_NO COURSE_FEE
1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000
{Note that, there are many courses having the same course fee. } Here, COURSE_FEE
cannot alone decide the value of COURSE_NO or STUD_NO; COURSE_FEE together
with STUD_NO cannot decide the value of COURSE_NO; COURSE_FEE together
with COURSE_NO cannot decide the value of STUD_NO; Hence, COURSE_FEE
would be a non-prime attribute, as it does not belong to the one only candidate key
{STUD_NO, COURSE_NO} ; But, COURSE_NO -> COURSE_FEE, i.e.,
COURSE_FEE is dependent on COURSE_NO, which is a proper subset of the
candidate key. Non-prime attribute COURSE_FEE is dependent on a proper subset of
the candidate key, which is a partial dependency and so this relation is not in 2NF. To
convert the above relation to 2NF, we need to split the table into two tables such as :
Table 1: STUD_NO, COURSE_NO Table 2: COURSE_NO, COURSE_FEE
Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
2 C5 NOTE: 2NF tries to reduce the redundant data getting stored in memory. For
instance, if there are 100 students taking C1 course, we don’t need to store its Fee as
1000 for all the 100 records, instead, once we can store it in the second table as the
course fee for C1 is 1000.
Example 2 – Consider following functional dependencies in relation R (A, B , C, D )
AB -> C [A and B together determine C]
BC -> D [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial dependency,
i.e., any proper subset of AB doesn’t determine any non-prime attribute.

Third Normal Form (3NF)


The normalization of 2NF relations to 3NF involves the elimination of transitive dependencies in

DBMS.

A functional dependency X -> Z is said to be transitive if the following three functional

dependencies hold:

X -> Y

Y does not -> X

Y -> Z
For a relational table to be in third normal form, it must satisfy the following rules:

1.The table must be in the second normal form.

2.No non-prime attribute is transitively dependent on the primary key.

3.For each functional dependency X -> Z at least one of the following conditions hold:

X is a super key of the table.

Z is a prime attribute of the table.

If a transitive dependency exists, we can divide the table to remove the transitively dependent

attributes and place them to a new table along with a copy of the determinant.

Let us take an example of the following <EmployeeDetail> table to understand what is transitive

dependency and how to normalize the table to the third normal form:

<EmployeeDetail>

Employee Code Employee Name Employee Zipcode Employee City


101 John 110033 Model Town
101 John 110044 Badarpur
102 Ryan 110028 Naraina
103 Stephanie 110064 Hari Nagar
The above table is not in 3NF because it has Employee Code -> Employee City transitive

dependency because:

Employee Code -> Employee Zipcode

Employee Zipcode -> Employee City

Also, Employee Zipcode is not a super key and Employee City is not a prime attribute.
To remove transitive dependency from this table and normalize it into the third normal form, we can

decompose the <EmployeeDetail> table into the following two tables:

<EmployeeDetail>

Employee Code Employee Name Employee Zipcode


101 John 110033
101 John 110044
102 Ryan 110028
103 Stephanie 110064
<EmployeeLocation>

Employee Zipcode Employee City


110033 Model Town
110044 Badarpur
110028 Naraina
110064 Hari Nagar
Thus, we’ve converted the <EmployeeDetail> table into 3NF by decomposing it into

<EmployeeDetail> and <EmployeeLocation> tables as they are in 2NF and they don’t have any

transitive dependency.

The 2NF and 3NF impose some extra conditions on dependencies on candidate keys and remove

redundancy caused by that. However, there may still exist some dependencies that cause

redundancy in the database. These redundancies are removed by a more strict normal form known

as BCNF.

Boyce-Codd Normal Form (BCNF)


Boyce-Codd Normal Form(BCNF) is an advanced version of 3NF as it contains additional constraints

compared to 3NF.
For a relational table to be in Boyce-Codd normal form, it must satisfy the following rules:

1. The table must be in the third normal form.

2. For every non-trivial functional dependency X -> Y, X is the superkey of the table. That means X

cannot be a non-prime attribute if Y is a prime attribute.

A superkey is a set of one or more attributes that can uniquely identify a row in a database table.

Let us take an example of the following <EmployeeProjectLead> table to understand how to normalize

the table to the BCNF:

<EmployeeProjectLead>

Employee Code Project ID Project Leader


101 P03 Grey
101 P01 Christian
102 P04 Hudson
103 P02 Petro
The above table satisfies all the normal forms till 3NF, but it violates the rules of BCNF because the

candidate key of the above table is {Employee Code, Project ID}. For the non-trivial functional

dependency, Project Leader -> Project ID, Project ID is a prime attribute but Project Leader is a non-

prime attribute. This is not allowed in BCNF.

To convert the given table into BCNF, we decompose it into three tables:

<EmployeeProject>
Employee Code Project ID
101 P03
101 P01
102 P04
103 P02
<ProjectLead>

Project Leader Project ID


Grey P03
Christian P01
Hudson P04
Petro P02
Thus, we’ve converted the <EmployeeProjectLead> table into BCNF by decomposing it into

<EmployeeProject> and <ProjectLead> tables.

Conclusion

 Normal forms are a mechanism to remove redundancy and optimize database storage.

 In 1NF, we check for atomicity of the attributes of a relation.

 In 2NF, we check for partial dependencies in a relation.

 In 3NF, we check for transitive dependencies in a relation.

 In BCNF, we check for the superkeys in LHS of all functional dependencies.


Dependency Preservation In DBMS

Dependency Preservation: A Decomposition D = { R1, R2, R3…Rn } of R is dependency

preserving wrt a set F of Functional dependency if

(F1 ∪ F2 ∪ … ∪ Fm)+ = F+.


Consider a relation R
R ---> F{...with some functional dependency(FD)....}

R is decomposed or divided into R1 with FD { f1 } and R2 with { f2 }, then


there can be three cases:

f1 U f2 = F -----> Decomposition is dependency preserving.


f1 U f2 is a subset of F -----> Not Dependency preserving.
f1 U f2 is a super set of F -----> This case is not possible.

Problem:
Let a relation R (A, B, C, D ) and functional dependency {AB –> C, C –> D, D –> A}.
Relation R is decomposed into R1( A, B, C) and R2(C, D). Check whether decomposition is
dependency preserving or not.
R1(A, B, C) and R2(C, D)

Let us find closure of F1 and F2


(F1 – functional dependencies belongs to relation R1,
F2 - functional dependencies belongs to rlation R2)
To find closure of F1, consider all combination of
ABC. i.e., find closure of A, B, C, AB, BC and AC
Note ABC is not considered as it is always ABC
closure(A) = { A } // Trivial
closure(B) = { B } // Trivial
closure(C) = {C, A, D} but D can't be in closure as D is not present R1.
= {C, A}
C--> A // Removing C from right side as it is trivial attribute

closure(AB) = {A, B, C, D}
= {A, B, C}
AB --> C // Removing AB from right side as these are trivial attributes

closure(BC) = {B, C, D, A}
= {A, B, C}
BC --> A // Removing BC from right side as these are trivial attributes

closure(AC) = {A, C, D}
AC --> D // Removing AC from right side as these are trivial attributes

F1 {C--> A, AB --> C, BC --> A}.


Similarly F2 { C--> D }

In the original Relation Dependency { AB --> C , C --> D , D --> A}.


AB --> C is present in F1.
C --> D is present in F2.
D --> A is not preserved.

F1 U F2 is a subset of F. So given decomposition is not dependency preserving.


Lossless Decomposition in DBMS

The original relation and relation reconstructed from joining decomposed relations must

contain same number of tuples if number is increased or decreased then it is Losssy Join

decomposition.

Lossless join decomposition ensures that never get the situation where spurious tuple are
generated in relation, for every value on the join attributes there will be a unique tuple in
one of the relations.
Lossless join decomposition is a decomposition of a relation R into relations R1, R2 such
that if we perform a natural join of relation R1 and R2, it will return the original relation R.
This is effective in removing redundancy from databases while preserving the original data.
In other words by lossless decomposition, it becomes feasible to reconstruct the relation R
from decomposed tables R1 and R2 by using Joins.

Only 1NF,2NF,3NF and BCNF are valid for lossless join decomposition.

In Lossless Decomposition, we select the common attribute and the criteria for selecting a
common attribute is that the common attribute must be a candidate key or super key in
either relation R1, R2, or both.
Decomposition of a relation R into R1 and R2 is a lossless-join decomposition if at least one
of the following functional dependencies are in F+ (Closure of functional dependencies)
Example:
— Employee (Employee_Id, Ename, Salary, Department_Id, Dname) –
— Can be decomposed using lossless decomposition as,
— Employee_desc (Employee_Id, Ename, Salary, Department_Id)
— Department_desc (Department_Id, Dname) . :
– Alternatively the lossy decomposition would be as joining these tables is not possible so
not possible to get back original data.
– Employee_desc (Employee_Id, Ename, Salary)
– Department_desc (Department_Id, Dname)
R1 ∩ R2 → R1
OR
R1 ∩ R2 → R2
NOTE – revise all the problems regarding this topic which we solved in the lecture.

Problems with NULL Values and Dangling Tuples

We must carefully consider the problems associated with NULLs when designing a
relational database schema. There is no fully satisfactory relational design theory as
yet that includes NULL values. One problem occurs when some tuples
have NULL values for attributes that will be used to join individual relations in the
decomposition. To illustrate this, consider the database shown in Figure 16.2(a),
where two relations EMPLOYEE and DEPARTMENT are shown. The last two
employee tuples— ‘Berger’ and ‘Benitez’—represent newly hired employees who
have not yet been assigned to a department (assume that this does not violate any
integrity constraints). Now suppose that we want to retrieve a list of (Ename, Dname)
values for all the employees. If we apply the NATURAL JOIN operation
on EMPLOYEE and DEPARTMENT (Figure 16.2(b)), the two aforementioned tuples
will not appear in the result. The OUTER JOIN operation, discussed in Chapter 6, can
deal with this problem. Recall that if we take the LEFT OUTER
JOIN of EMPLOYEE with DEPARTMENT, tuples in EMPLOYEE that
have NULL for the join attribute will still appear in the result, joined with
an imaginary tuple in DEPARTMENT that has NULLs for all its attribute values.
Figure 16.2(c) shows the result.

In general, whenever a relational database schema is designed in which two or more


relations are interrelated via foreign keys, particular care must be devoted to watch-
ing for potential NULL values in foreign keys. This can cause unexpected loss of
information in queries that involve joins on that foreign key. Moreover, if NULLs
occur in other attributes, such as Salary, their effect on built-in functions such
as SUM and AVERAGE must be carefully evaluated.

A related problem is that of dangling tuples, which may occur if we carry a


decomposition too far. Suppose that we decompose the EMPLOYEE relation in
Figure 16.2(a) further into EMPLOYEE_1 and EMPLOYEE_2, shown in Figure
16.3(a) and 16.3(b). If we apply the NATURAL JOIN operation
to EMPLOYEE_1 and EMPLOYEE_2, we get the original EMPLOYEE relation.
However, we may use the alternative representation, shown in Figure 16.3(c), where
we do not include a tuple
in EMPLOYEE_3 if the employee has not been assigned a department (instead of
including a tuple with NULL for Dnum as in EMPLOYEE_2). If we
use EMPLOYEE_3 instead of EMPLOYEE_2 and apply a NATURAL
JOIN on EMPLOYEE_1 and

EMPLOYEE_3, the tuples for Berger and Benitez will not appear in the result;
these are called dangling tuples in EMPLOYEE_1 because they are represented in
only one of the two relations that represent employees, and hence are lost if we apply
an (INNER) JOIN operation.

Multivalued Dependency

Multivalued dependency occurs when two attributes in a table are independent of each other
but, both depend on a third attribute.

A multivalued dependency consists of at least two attributes that are dependent on a third
attribute that's why it always requires at least three attributes.
Example: Suppose there is a bike manufacturer company which produces two colors(white and
black) of each model every year.

BIKE_MODEL MANUF_YEAR COLOR


M2001 2008 White

M2001 2008 Black

M3001 2013 White

M3001 2013 Black

M4006 2017 White

M4006 2017 Black

Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of
each other.
In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The
representation of these dependencies is shown below:
1. BIKE_MODEL → → MANUF_YEAR
2. BIKE_MODEL → → COLOR
This can be read as "BIKE_MODEL multidetermined MANUF_YEAR" and "BIKE_MODEL
multidetermined COLOR".

Query Optimization in DBMS

Overview

The query optimizer (also known as the optimizer) is database software that identifies the most efficient

way (like by reducing time) for a SQL statement to access data

The process of selecting an efficient execution plan for processing a query is known as query

optimization.

Following query parsing which is a process by which this decision making is done that for a given query,

calculating how many different ways there are in which the query can run, then the parsed query is

delivered to the query optimizer, which generates various execution plans to analyze the parsed query

and select the plan with the lowest estimated cost. The catalog manager assists the optimizer in selecting

the optimum plan to perform the query by generating the cost of each plan.

Query optimization is used to access and modify the database in the most efficient way possible. It is the

art of obtaining necessary information in a predictable, reliable, and timely manner. Query
optimization is formally described as the process of transforming a query into an equivalent form that

may be evaluated more efficiently. The goal of query optimization is to find an execution plan that

reduces the time required to process a query. We must complete two major tasks to attain this

optimization target.

The first is to determine the optimal plan to access the database, and the second is to reduce the time

required to execute the query plan.

Purpose of the Query Optimizer in DBMS

The optimizer tries to come up with the best execution plan possible for a SQL statement.
Among all the candidate plans reviewed, the optimizer chooses the plan with the lowest cost. The

optimizer computes costs based on available facts. The cost computation takes into account query

execution factors such as I/O, CPU, and communication for a certain query in a given context.

Sr. No Class Name Role


01 10 Shreya CR
02 10 Ritik
For example, there is a query that requests information about students who are in leadership roles, such

as being a class representative. If the optimizer statistics show that 50% of students are in positions of

leadership, the optimizer may decide that a full table search is the most efficient. However, if data show

that just a small number of students are in positions of leadership, reading an index followed by table

access by row id may be more efficient than a full table scan.

Because the database has so many internal statistics and tools at its disposal, the optimizer is frequently

in a better position than the user to decide the best way to execute a statement. As a result, the optimizer

is used by all SQL statements.

Optimizer Components

The optimizer is made up of three parts: the transformer, the estimator, and the plan generator. The figure

below depicts those components.


Query

Transformer The query transformer determines whether it is advantageous to rewrite the original SQL

statement into a semantically equivalent SQL statement at a lower cost for some statements.

When a plausible alternative exists, the database compares the costs of each alternative and chooses the

one with the lowest cost. The query transformer shown in the query below can be taken as an example of

how query optimization is done by transforming an OR-based input query into a UNION ALL-based

output query.

SELECT *

FROM sales
WHERE promo_id=12

OR prod_id=125;

The given query transformed using query transformer.

SELECT *

FROM sales

WHERE promo_id=124

UNION ALL

SELECT *

FROM sales

WHERE promo_id=12

AND LNNVL(prod_id=125);/*LNNVL provides a concise way to evaluate a condition when one or

both operands of the condition may be null. */

Estimator

The estimator is the optimizer component that calculates the total cost of a given execution plan.

To determine the cost, the estimator employs three different methods:

Selectivity: The query picks a percentage of the rows in the row set, with 0 indicating no

rows and 1 indicating all rows. Selectivity is determined by a query predicate, such

as WHERE the last name LIKE X%, or by a mix of predicates. As the selectivity value
approaches zero, a predicate gets more selective, and as the value nears one, it becomes less

selective (or more unselective).

For example, The row set can be a base table, a view, or the result of a join. The selectivity is tied to

a query predicate, such as last_name = 'Prakash', or a combination of predicates, such as last_name

= 'Prakash' AND job_id = 'SDE'.

Cardinality: The cardinality of an execution plan is the number of rows returned by each

action. This input is shared by all cost functions and is essential for determining the best

strategy. Cardinality in DBMS can be calculated using DBMS STATS table statistics or after

taking into account the impact of predicates (filter, join, and so on), DISTINCT or GROUP

BY operations, and so on. In an execution plan, the Rows column displays the estimated

cardinality.

For example, if the optimizer estimates that a full table scan will yield 100 rows, then the

cardinality estimate for this operation is 100. The cardinality estimate appears in the execution

plan's Rows column.

Cost: This metric represents the number of units of labor or resources used. The query

optimizer uses disc I/O, CPU utilization, and memory usage as units of effort. For example,

if the plan for query A has a lower cost than the plan for query B, then the following

outcomes are possible: A executes faster than B, A executes slower than B or A executes in

the same amount of time as B.


Plan Generator

The plan generator investigates multiple plans for a query block by experimenting with various

access paths, join methods, and join orders.

Because of the different combinations that the database can utilize to generate the same
outcome, many plans are available. The plan with the lowest cost is chosen by the
optimizer.
Methods of Query Optimization in DBMS
There are two methods of query optimization. They are as follows.

Cost-Based Query Optimization in DBMS

Query optimization is the process of selecting the most efficient way to execute a SQL statement.

Because SQL is a nonprocedural language, the optimizer can merge, restructure, and process data in

any sequence.

The Optimizer allocates a cost in numerical form for each step of a feasible plan for a given query

and environment, and then discovers these values together to get a cost estimate for the plan or

possible strategy. The Optimizer aims to find the plan with the lowest cost estimate after evaluating

the costs of all feasible plans. As a result, the Optimizer is sometimes known as the Cost-Based

Optimizer.

Execution Plans:

An execution plan specifies the best way to execute a SQL statement.

The plan describes the steps taken by Oracle Database to execute a SQL statement. Each step

physically retrieves or prepares rows of data from the database for the statement's user.

An execution plan shows the total cost of the plan, which is stated on line 0, as well as the cost of

each individual operation. A cost is an internal unit that appears solely in the execution plan to allow

for plan comparisons. As a result, the cost value cannot be fine-tuned or adjusted.
Query Blocks The optimizer receives a parsed representation of a SQL statement as input.

Each SELECT block in the original SQL statement is internally represented by a query

block. A query block can be a statement at the top level, a subquery, or an unmerged view.

Let’s take an example where the SQL statement that follows is made up of two query

sections. The inner query block is the subquery in parentheses. The remainder of the outer

query block of the SQL statement obtains the names of employees in the departments whose

IDs were supplied by the subquery. The query form specifies how query blocks are

connected.

SELECT first_name, last_name

FROM hr.employees

WHERE department_id

IN (SELECT department_id

FROM hr.departments

WHERE location_id = 1800);

Query Sub Plans

The optimizer creates a query sub-plan for each query block.

From the bottom up, the database optimizes query blocks separately. As a result, the database

optimizes the innermost query block first, generating a sub-plan for it, before generating the outer

query block, which represents the full query.


The number of query block plans is proportional to the number of items in the FROM clause. As the

number of objects rises, this number climbs exponentially. The possibilities for a join of five tables,

for example, are far higher than those for a connection of two tables.

Analogy for the Optimizer

An online trip counselor is one analogy for the optimizer.

A biker wishes to find the most efficient bicycle path from point A to point B. A query is analogous

to the phrase "I need the quickest route from point A to point B" or "I need the quickest route from

point A to point B via point C". To choose the most efficient route, the trip advisor employs an

internal algorithm that takes into account factors such as speed and difficulty. The biker can sway

the trip advisor's judgment by saying things like "I want to arrive as quickly as possible" or "I want

the simplest route possible.”

In this example, an execution plan is a possible path generated by the travel advisor. Internally, the

advisor may divide the overall route into multiple subroutes (sub plans) and compute the efficiency

of each subroute separately. For example, the trip advisor may estimate one subroute to

take 15 minutes and be of medium difficulty, another subroute to take 22 minutes and be of low

difficulty, and so on.

Based on the user-specified goals and accessible facts about roads and traffic conditions, the advisor

selects the most efficient (lowest cost) overall route. The better the guidance, the more accurate the
statistics. For example, if the advisor is not kept up to date on traffic delays, road closures, and poor

road conditions, the proposed route may prove inefficient (high cost).

heuristic optimization

Cost-based optimization is expensive. Heuristics are used to reduce the number of choices that must
be made in a cost-based approach.

Rules
Heuristic optimization transforms the expression-tree by using a set of rules which improve the
performance. These rules are as follows −

Perform the SELECTION process foremost in the query. This should be the first action for
any SQL table. By doing so, we can decrease the number of records required in the query,
rather than using all the tables during the query.
Perform all the projection as soon as achievable in the query. Somewhat like a selection but
this method helps in decreasing the number of columns in the query.
Perform the most restrictive joins and selection operations. What this means is that select
only those sets of tables and/or views which will result in a relatively lesser number of records
and are extremely necessary in the query. Obviously any query will execute better when tables
with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.

Steps in heuristic optimization


Let’s see the steps involve in heuristic optimization, which are explained below −

Deconstruct the conjunctive selections into a sequence of single selection operations.


Move the selection operations down the query tree for the earliest possible execution.
First execute those selections and join operations which will produce smallest relations.
Replace the cartesian product operation followed by selection operation with join operation.
Deconstructive and move the tree down as far as possible.
Identify those subtrees whose operations are pipelined.
Which of the following query trees is more efficient ?
1.

This tree is evaluated in steps as follows:


2.
This tree is evaluated in steps as follows:

Note the two cross product operations. These require lots of space and time (nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6 projects * 3
departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and project
operations as possible before doing any joins.
There are a number of transformation rules that can be used to transform a query:
1.Cascading selections. A list of conjunctive conditions can be broken up into separate individual
conditions.
2.Commutativity of the selection operation.
3.Cascading projections. All but the last projection can be ignored.
4.Commuting selection and projection. If a selection condition only involves attributes contained in
a projection clause, the two can be commuted.
5.Commutativity of Join and Cross Product.
6.Commuting selection with Join.
7.Commuting projection with Join.
8.Commutativity of set operations. Union and Intersection are commutative.
9.Associativity of Union, Intersection, Join and Cross Product.
10.Commuting selection with set operations.
11.Commuting projection with set operations.
12.Logical transformation of selection conditions. For example, using DeMorgan’s law, etc.
13.Combine Selection and Cartesian product to form Joins.
These transformations can be used in various combinations to optimize queries. Some general
steps follow:
1.Using rule 1, break up conjunctive selection conditions and chain them together.
2.Using the commutativity rules, move the selection operations as far down the tree as possible.
3.Using the associativity rules, rearrange the leaf nodes so that the most restrictive selection
conditions are executed first. For example, an equality condition is likely more restrictive than an
inequality condition (range query).
4.Combine cartesian product operations with associated selection conditions to form a single Join
operation.
5.Using the commutativity of Projection rules, move the projection operations down the tree to
reduce the sizes of intermediate result sets.
6.Finally, identify subtrees that can be executed using a single efficient access method.

Example of Heuristic Query Optimization


1. Original Query Tree

2. Use Rule 1 to Break up Cascading Selections


3. Commute Selection with Cross Product
4. Combine Cross Product and Selection to form Joins

You might also like