Normalization
Definition
Database normalization is the process of organizing the
fields and tables of a relational database to minimize
redundancy and dependency.
Normalization usually involves dividing large tables into
smaller (and less redundant) tables and defining relationships
between them.
The objective is to isolate data so that additions, deletions,
and modifications of a field can be made in just one table and
then propagated through the rest of the database via the
defined relationships.
Levels of redundancy
File level Handled by OS
Tuple level Handled by Primary key/candidate key
Column level Handled by normalization
How can you define Normalization???
It is used to eliminate redundancy from the database tables.
Normalization is not meant for:
Eliminating File Level redundancy.
F1.txt F1.txt
SID SNAME CID CNAME
Eliminating Tuple level redundancy. S1 A C1 C++
S1 A C1 C++
Primary Key will take
Normalization is meant for: care of this thing.
Removing Attribute level redundancy.
Attribute Level Redundancy……
SID SNAME CID CNAME FNAM FID
E
Suppose Faculty teaching
S1 A C1 C++ X F1
JAVA is changed, then…….....
S1 A C2 Java Y F2
S2 B C2 Java Y F2
S3 C C1 C++ X F1
S3 C C2 Java Y F2
S4 D C2 Java Y F2
S5 E C2 Java Y F2
S6 F C1 C++ X F1
Solution………….
Decompose the table into more than one tables.
SID SNAME CID
S1 A C1 CID CNAME FNAME FID
S1 A C2 C1 C++ X F1
S2 B C2 C2 Java Y F2
S3 C C1
S3 C C2
S4 D C2
S5 E C2
S6 F C1
What is anomaly ?
Database anomalies are the problems in relations that
occur due to redundancy in the relations.
These anomalies affect the process of inserting,
deleting and modifying data in the relations. Some
important data may be lost if a Relation is updated
that contains database anomalies.
It is important to remove these anomalies in order to
perform different processing on the relations without
any problem.
Types of Anomalies.
Redundancy
Repeat info unnecessarily in several tuples/columns
Update anomalies:
Change info in one tuple but not in another.
Deletion anomalies:
Delete some values & lose other values too.
Insert anomalies:
Inserting row means having to insert other, separate info.
Data Redundancy
Data redundancies is nothing, it is duplicity of data that
means, the same data is stored in different location in a data
base, or if you getting some problem to extract the data from
the file due to duplicacy of data then this cause is called as
redundancies.
It increases storage and decrease performance.
It becomes more difficult to maintain data changes.
Update Anomalies
An update anomaly occurs when we have a lot of
redundancy in our data. Due to redundancy, data
updating becomes cumbersome.
If we have to update one attribute value, which is
occurring a number of times, we have to search for
every occurrence of that value and then change it.
Stu_No Stu_Name Address Course_ID
Course_ID Course_Name Instructor
1001 Amit Jalandhar Cap302
Cap301 Data_base Mr.Sartaj Singh
1002 Vikash Chandighar Cap301
Cap302 Operating_Syst Mrs. Jasleen
1003 Sumit ludhiana Cap303
em
1004 Rahul Jammu Cap301
Cap303 Financial_Mana Mrs. Manpreet
1005 Vijay Chandighar Cap303 gement Kaur
Stu_No Stu_Name Address Course_ID Course_Name Instructor
1001 Amit Jalandhar Cap302 Operating_System Mrs. Jasleen
1002 Vikash Chandigarh Cap301 Data_base Mr.Sartaj Singh
1003 Sumit ludhiana Cap303 Financial_Management Mrs. Manpreet Kaur
1004 Rahul Jammu Cap301 Data_base Mr.Sartaj Singh
1005 Vijay Chandigarh Cap303 Financial_Management Mrs. Manpreet Kaur
Stu_No Stu_Name Address Course_ID Course_Name Instructor
1001 Amit Jalandhar Cap302 Operating_System Mrs. Jasleen
1002 Vikash Chandighar Cap301 Data_base Mr.Sartaj Singh
1003 Sumit ludhiana Cap303 Financial_Management Mrs. Manpreet Kaur
1004 Rahul Jammu Cap301 Data_base Mr. Navdeep Kumar
1005 Vijay Chandighar Cap303 Financial_Management Mrs. Manpreet Kaur
• An Update Anomaly exists when one or more instances of duplicated
data is updated, but not all i.e Change info in one tuple but not in
another.
• In STU_DETAIL, if we want to change the name of Instructor of
Cource_ID cap301 then it will update all the tuples in the table, but
some reason all the tuples are not updated, we might have a database
that gives two names of instructor for subject cap301.
Insert Anomalies
An insertion anomaly occurs when we are unable to insert a
tuple into a table. Such a situation can arise when the value of
primary key is not known.
As per the entity integrity rule, the primary key cannot have null
value. Therefore, the value/s corresponding to primary key
attribute/s of the tuple must be assigned before inserting the
tuple.
If these values are unknown, the tuple cannot be inserted into
the table.
Delete Anomalies
In case of a deletion anomaly, the deletion of a tuple
causes problems in the database.
This can happen when we delete a tuple, which contains
an important piece of information, and the tuple being the
last one in the table containing the information.
With the deletion of the tuple the important piece of
information also gets removed from the database.
Which anomaly is this?
Which anomaly is this?
Dependency
A dependency refers to relationship amongst attributes.
These attributes may belong to the same relation or different
relations. Dependencies can be of various types viz.,
functional dependencies, transitive dependencies,
multivalued dependencies, join dependencies, etc. We shall
briefly examine some of these dependencies.
Types of Dependency
Functional Dependency
Fully Functional Dependency
Partial Functional Dependency
Transitive Dependency
Multivalued Dependency
Join Dependency
Functional Dependency
Functional Dependency (F.D) – Functional
dependency represents semantic association
between attributes. If a value of an attribute A
determines the value of another attribute B, we say
B is functionally dependent on A. This is denoted by
A => B and read as “A determines B” and A is
called the determinant
Course_ID Course_Name Instructor
Cap301 Data_base Mr.Sartaj Singh
Cap302 Operating_System Mrs. Jasleen
Cap303 Financial_Management Mrs. Manpreet Kaur
Course_Name and Instructors are
dependent on Course_ID
Fully Functional Dependency
Full Functional dependency Indicates that if A and B are
attributes(columns)of a table, B is fully functionally dependent on
A, if B is functionally dependent on A ,but not on any proper
subset of A.
E.g. Staff_ID----> Domain
Examples: For example, “{SCN, age} -> name” is a functional
dependency, but it is not a full functional dependency because you
can remove age from the left side of the statement without
impacting the dependency relationship
Partial Functional Dependency
Partial Functional Dependency Indicates that if A and B are
attributes of a table , B is partially dependent on A if there is some
attribute that can be removed from A and yet the dependency still
holds.
Say for Ex, consider the following functional dependency that
exists in the STUDENT table:
Reg_no, Name -------> Section_No
Section_No is functionally dependent on a subset of A
(Reg_no, Name ), namely Reg_no.
Transitive Dependency
Transitive Dependency –
Transitive dependency is a form of intermediate
dependency. For example, if we have attributes or
groups of attributes A, B and C such that A determines B
and B determines C i.e.
A => B
B => C
Then we say a transitive dependency represented by A
=> B => C
C is transitively dependent on A through B.
Multi-valued Dependency
Multi-valued Dependency refers to m:n (many-to-many)
relationships.
We say multi-valued dependency exists between two data
items when one value of the first data item gives a collection
of values of the second data item i.e., it multi-determines the
second data items.
For example, imagine a car company that manufactures
many models of car, but always makes both red and blue
colors of each model. If you have a table that contains the
model name, color and year of each car the company
manufactures, there is a multivalued dependency in that
table. If there is a row for a certain model name and year in
blue, there must also be a similar row corresponding to the
red version of that same car.
Join Dependency
JD * (R1, R2, R3, ..., Rm) holds in R iff R = join (R1, R2, R3, ..., Rm ), Ri - a projection of R
Decomposition???
Splitting the relation into two or more sub-relations.
Decomposition can be of two types:
Loss Less Join Decomposition.
Dependency preserve Decomposition.
A B
1 1
2 1
A B C 3 2
1 1 2
B C
2 1 1
1 2
3 2 2
1 1
2 2
A B
1 1
2 1 A B C
A B C 3 2 1 1 2
1 1 2 1 1 1
B C
2 1 1 2 1 2
1 2
3 2 2 2 1 1
1 1
3 2 2
2 2
R1 JOIN R2 Ↄ R
Lossy JOIN Decomposition
A B
1 1
2 1
A B C
A B C 3 2
1 1 2
1 1 2
A C 2 1 1
2 1 1
1 2 3 2 2
3 2 2
2 1
3 2
R1 JOIN R2 = R
Loss-Less JOIN
Decomposition
Normal Forms
The normal forms (abbrev. NF) of relational database
theory provide criteria for determining a table's degree of
vulnerability to logical inconsistencies and anomalies.
The higher the normal form applicable to a table, the less
vulnerable it is to inconsistencies and anomalies.
Each table has a "highest normal form" (HNF): by
definition, a table always meets the requirements of its HNF
and of all normal forms lower than its HNF.
1NF
A relation is said to be in First Normal Form (1NF) if and only if
every entry of the relation (the intersection of a tuple and a
column) has at most a single value.
In other words “a relation is in First Normal Form if and only if
all underlying domains contain atomic values or single value only
Example
Suppose a designer wishes to record the names and telephone numbers of customers. He defines a
customer table which looks like this
Customer
Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633
The designer then becomes aware of a requirement to record multiple telephone numbers for
some customers. He reasons that the simplest way of doing this is to allow the "Telephone
Number" field in any given record to contain more than one value
Customer ID First Name CustomerSurname Telephone Number
123 Robert Ingram 555-861-2025
555-403-1659
456 Jane Wright
555-776-4100
789 Maria Fernandez 555-808-9633
Another example
Dependencies in given table
First Approach: Flattening the table
The first approach known as “flattening the table” removes
repeating groups by filling in the “missing” entries of each
“incomplete row” of the table with copies of their
corresponding non-repeating attributes.
Second Approach: Decomposition of
the table
The second approach for normalizing a table requires
that the table be decomposed into two new tables that
will replace the original table.
However, before decomposing the original table it is
necessary to identify an attribute or a set of its attributes
that can be used as table identifiers
Rules for decomposition
One of the two tables contains the table identifier
of the original table and all the non-repeating
attributes.
The other table contains a copy of the table
identifier and all the repeating attributes.
Anomalies in 1NF
Insert anomaly:
Cannot insert the information about a student until he/she joins
any course
Cannot insert the information about course until there is a
student who enroll in that course
Delete anomaly
Whenever we delete the last tuple of a particular student.
Other information may also be deleted
Update anomaly
Forgets to make changes in all tables.
2NF
Second normal form (2NF) a relation in first normal form in
which every non key attribute is fully functionally dependent
on the primary key.
A Table is said to be in 2NF if
it is in 1NF and
there are no partial dependencies i.e. every non
primary key attribute of the Table is fully functionally
dependent on the primary key.
Rules for converting 1 NF into 2NF
Non-key attributes Name, System_Used and
Hourly_Rate are not fully dependent on the primary key
(Course_Code, Rollno)
Because Name, System_Used and Hourly_Rate are
functional dependent on Rollno and Rollno is a subset of
the primary key so it does not hold the law of fully
functional dependence.
Data Anomalies in 2NF Relations
Insert anomaly :
We want to set in advance the rate of a system. We cannot insert
it until there is a student assigne to that system.
But its obvious thing that the rate that is charged from student
for a particular system is independent of whether or not any
student uses that system.
Delete anomaly
Update anomaly:
If several students are working on same type of system and we
want to change the hourly rate, then we have to make changes
in all locations.
Third NF (3NF)
A relation R is in Third Normal Form (3NF) if and only if the
following conditions are satisfied simultaneously:
(1) R is already in 2NF
(2) No nonprime attribute is transitively dependent on the key.
Another way of expressing the conditions for Third Normal
Form is as follows:
(1) R is already in 2NF
(2) No nonprime attribute functionally determines any other
nonprime attribute.
These two sets of conditions are equivalent.
Rule for converting a relation in 3NF
2NF
BCNF (Boyce Codd NF)
BCNF is a stronger form than 3NF.
BCNF makes no explicit reference to first, second normal
forms (nor the concept of functional and transitive
dependencies).
BCNF states that
A relation R is in BCNF if and only if every determinant is a
candidate key.
We will consider three cases :
Case 1: Table is not in 3NF , also not in BCNF
Case 2: Table is in 3NF as well as in BCNF
Case 3: Table is in 3NF, but not in BCNF.
After this , we will discuss how we can convert a table from
3NF to BCNF
Similarities between 3NF and BCNF
Relation (student_system_charge) in 2NF , not in 3NF, also not in BCNF
In student_system_charge
Rollno name, system_used,hourly_rate
System_used Hourly_rate
In this relation, there is transitive dependency.
So it is not in 3NF.
System_used is a determinant but is not a key.
So relation is not in BCNF.
Consider 3NF
Here in each relation, every determinant is unique(i.e. key)
in its corresponding relation.
So this relation is also in BCNF.
Only single candidate key BCNF as well as 3NF
4NF
Fourth Normal Form (4NF) 4NF is a stronger normal
form than BCNF as it prevents Tables from containing
Multi-Valued Dependencies (MVDs) and hence data
redundancy.
The Normalization of BCNF Tables to 4NF involves the
removal of MVDs from the Table by placing the attribute(s)
in a new Table along with the copy of the determinant(s).
4th Normal Form (4NF)
Stronger than 3NF and BCNF
A relation R is in Fourth Normal Form (4NF) if and only
if the following conditions are satisfied simultaneously:
R is already in 3NF or BCNF.
If it contains no multi-valued dependencies.
MVDs (Multi valued dependency)
MVD is the dependency where one attribute value is
potentially a 'multi-valued fact' about another.
Consider the table
Raj New Delhi, Amritsar
Amritsar Raj, suneet
Customer_name : Address
1 : N
Address : Customer_name
1 : N
Means N : N relationship
Fourth Normal Form (4NF) - MVD
Dependency between attributes (for example, A, B,
and C) in a relation, such that for each value of A
there is a set of values for B and a set of values for C
MVD between attributes A, B, and C in a relation
using the following notation:
A B
A C
Deepak Gour, Faculty – DBMS, School of Engineering,
SPSU
This table will be in the fourth normal form if B
and C dependent on each other.
However if B and C are independent of each
other then R is not in fourth normal form as
MVD exists.
In order for a table to contain MVD, it must have
three or more attributes.
Rule to convert 3NF or BCNF into 4NF
A relation R having A, B, and C, as attributes can be non loss-
decomposed into two projections R1(A,B) and R2(A,C) if and
only if the MVD A--> --> B|C hold in R.
Looking again at COURSE_STUDENT_BOOK table, it contains a
multi-valued dependency as shown below:
Course ---> --> Student_name
Course ---> --> Text_book
To put it into 4NF, two separate tables are formed as shown
below:
COURSE_STUDENT (Course, Student_name)
COURSE_BOOK (Course, text_book)
5th Normal Form
Fifth Normal Form(5NF) 5NF is also called Project-
Join Normal Form(PJNF) and specifies that a 5NF Table
has no Join dependency.
Formally, A table is said to be in 5NF if and only if
Table is already in 4NF
It cannot be further non-loss decomposed.
Lossless and lossy decomposition
A decomposition {R1,R2,….Rn} of a relation R is called as
a Lossless if the natural join of R1,R2,….Rn produces
exactly the same relation R
A decomposition {R1,R2,….Rn} of a relation R is called as
a Lossless if the natural join of R1,R2,….Rn does not
produce exactly the same relation R
Consider the table
AGENT_COMPANY_PRODUCT (Agent, Company,
Product _Name)
Consider another decomposition
Company Product_name Agent Company Product_name
ABC Nut Suneet ABC Nut
ABC Screw Suneet ABC Screw
ABC Bolt Suneet ABC Bolt*
CDE Bolt Suneet CDE Bolt
Raj ABC Bolt
P3 Join of P1, P2 and P3
Consider another table
Join of P1 and P2
Join of P1, P2 and P3
Pitfalls in Relational Database Design
Relational database design is prone to many possible errors.
Creating an effective design for a relational database is a key element in
building a reliable system. There is no one "correct" relational database
design for any particular project, and developers must make choices to
create a design that will work efficiently. There are a few common design
pitfalls that can harm a database system. Watching out for these errors at
the design stage can help to avoid problems later on.
1. Careless Naming Practices
Choosing names is an aspect of database design that is often neglected but
can have a considerable impact on usability and future development. To
avoid this, both table and column names should be chosen to be meaningful
and to conform to the established conventions, ensuring that consistency is
maintained throughout a system. A number of conventions can be used in
relational database names, including the following two examples for a
record storing a client name: "client_name" and "clientName."
2. Lack of Documentation
Creating documentation for a relational database can be a vital step in
safeguarding future development. There are different levels of documentation
that can be created for databases, and some database management systems are
able to generate the documentation automatically. For projects where formal
documentation is not considered necessary, simply including comments
within the SQL code can be helpful.
3. Failure to Normalize
Normalization is a technique for analyzing, and improving on, an initial
database design. A variety of techniques are involved, including identifying
features of a database design that may compromise data integrity, for example
items of data that are stored in more than one place. Normalization identifies
anomalies in a database design, and can preempt design features that will
cause problems when data is queried, inserted or updated.
4. Lack of Testing
Failure to test a database design with a sample of real, or realistic,
data can cause serious problems in a database system. Generally,
relational database design is started from an abstract level, using
modeling techniques to arrive at a design. The drawback to this
process is that the design sometimes will not relate accurately to the
actual data, which is why testing is so important.
5. Failure to Exploit SQL Facilities
SQL has many capabilities that can improve the usability and success
of a database system. Facilities such as stored procedures and integrity
checks are often not used in cases where they could greatly enhance
the stability of a system. Developers often choose not to carry out
these processes during the design stages of a project as they are not a
necessity, but they can help to avoid problems at a later stage.