100% found this document useful (1 vote)

144 views

Unit-3 Data Preprocessing

The document discusses the process of data preprocessing, which involves transforming raw data into a clean and consistent format suitable for analysis. It describes several types of dirty or imperfect data like incomplete, noisy, and inconsistent data. The key steps of data preprocessing are data cleaning, which handles issues like missing values and outliers, and data integration, which combines data from multiple sources. The goals of preprocessing are to make data more accurate, consistent, and complete for downstream analytics tasks.

Uploaded by

Khal Drago

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

144 views

Unit-3 Data Preprocessing

Uploaded by

Khal Drago

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1 Data Warehousing and Data Mining Reference Note

Unit-3
Data Preprocessing

Introduction
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format.
Raw data (real-world data) is often incomplete, inconsistent, and/or noisy, due to which there
are some increased chances of error and misinterpretation
 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data. E.g., occupation = “ ”
 Noisy: containing errors or outliers. E.g. Salary = “-10”
 Inconsistent: containing discrepancies in codes or names. E.g. Age=“42”
Birthday=“03/07/1997”
Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.

Why is data dirty?

 Incomplete data may come from
- “Not applicable” data value when collected
- Different considerations between the time when the data was collected and when it
is analyzed.
- Human/hardware/software problems
 Noisy data (incorrect values) may come from
- Faulty data collection instruments
- Human or computer error at data entry
- Errors in data transmission
 Inconsistent data may come from
- Different data sources
- Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Why do we need to preprocess data?

By preprocessing data, we:
 Make our database more accurate: We eliminate the incorrect or missing values that are
there as a result of human factor or bugs.
 Boost Consistency: when there are inconsistencies in data or duplicates, it affects the
accuracy of the results.
 Make the database more complete: We can fill the attributes that are missing if needed.
 Smooth the data: This way we make it easier to use and interpret.

Collegenote Prepared By: Jayanta Poudel

2 Data Warehousing and Data Mining Reference Note

Steps involved into data pre-processing:

Fig: Data Preprocessing Steps

Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
a) Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways. Some of them are:
 Ignore the tuples: This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
 Fill in the missing value manually.
 Use a global constant to fill in the missing value. E.g. “unknown”, a new class.
 Use the attribute mean to fill in the missing value.
 Use the attribute mean for all samples belonging to the same class as the given tuple.
 Use the most probable value to fill in the missing value.
b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. E.g. Salary = “-10”.
It can be handled in following ways:
 Binning method: This method is to smooth or handle noisy data. First, the data is
sorted then and then the sorted values are separated into segments of equal size and
stored in the form of bins. There are three methods for smoothing data in the bin.
- Smoothing by bin mean method: In this method, the values in the bin are replaced
by the mean value of the bin;
- Smoothing by bin median: In this method, the values in the bin are replaced by
the median value;
- Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken and the values are replaced by the closest
boundary value.

Collegenote Prepared By: Jayanta Poudel

3 Data Warehousing and Data Mining Reference Note

Example:
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins:
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means:
For Bin 1: For Bin 2: For Bin 3:
(8+ 9 + 15 +16 / 4) = 12 (21 + 21 + 24 + 26 / 4) = 23 (27 + 30 + 30 + 34 / 4) = 30
Bin 1 = 12, 12, 12, 12 Bin 2 = 23, 23, 23, 23 Bin 3 = 30, 30, 30, 30

 Regression: Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
 Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

Data Integration
Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified view of
the data. These sources may include multiple data cubes, databases, or flat files.

Fig: Data Integration

There are mainly 2 major approaches for data integration:
 Tight Coupling: In tight coupling data is combined from different sources into a single
physical location through the process of ETL - Extraction, Transformation and Loading.
 Loose Coupling: In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and transforms it in a
way the source database can understand and then sends the query directly to the source
databases to obtain the result.

Collegenote Prepared By: Jayanta Poudel

4 Data Warehousing and Data Mining Reference Note

Issues in Data Integration

 Entity Identification Problem: As we know the data is unified from the heterogeneous
sources then how can we ‘match the real-world entities from the data’? For example, we
have customer data from two different data source. An entity from one data source has
customer_id and the entity from the other data source has customer_number. Now how
does the data analyst or the system would understand that these two entities refer to the
same attribute?
 Redundancy: An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes. Inconsistencies in attributes can also cause redundancies in
the resulting data set. Some redundancies can be detected by correlation analysis.
 Data Conflict Detection and Resolution: Data conflict means the data merged from the
different sources do not match. Like the attribute values may differ in different data sets.
The difference maybe because they are represented differently in the different data sets.
For suppose the price of a hotel room may be represented in different currencies in different
cities. This kind of issues is detected and resolved during data integration.

Data Transformation

Data transformation is the process of transforming data into the form that is appropriate for
mining.
Some Data Transformation Strategies:
 Smoothing: It is used to remove the noise from data. Such techniques include binning,
clustering, and regression.
 Aggregation: Here summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
 Generalization: Here low level data are replaced by higher level concepts through the use
of concept hierarchies. For example, categorical attributes, like street, can be generalized
to higher level concepts, like city or country.
 Attribute construction: Here new attributes are constructed and added from the given set
of attributes to help the mining process.
 Normalization: Here the attribute data are scaled so as to fall within a small specified range,
such as -1 to +1, or 0 to 1. Techniques that are used for normalization are:
- Min-Max Normalization: It performs a linear transformation on the original data.
Suppose that min and max are the minimum and maximum values of an attribute, 𝐴.
Min-max normalization maps a value, , of 𝐴 to 𝑛𝑣 in the range [new_min, new_max]
using following formula.

Collegenote Prepared By: Jayanta Poudel

5 Data Warehousing and Data Mining Reference Note

- Z-score Normalization: In z-score normalization (or zero-mean normalization), the

values for an attribute, 𝐴, are normalized based on the mean and standard deviation
of 𝐴. The value, 𝑣, of 𝐴 is normalized to 𝑛𝑣 as below. It is also called standard
normalization.

Where,
where, 𝜇 is mean and 𝑛 is number of data points.

Data Reduction
A database or date warehouse may store terabytes of data. So it may take very long to perform
data analysis and mining on such huge amounts of data. Data Reduction is obtaining a reduced
representation of the data set that is much smaller in volume but yet produces the same (or
almost the same) analytical results.
Data Reduction Techniques:
 Dimensionality Reduction: Dimensionality reduction is the process of reducing the
number of random variables or attributes under consideration. Dimensionality reduction
methods include wavelet transforms and principal components analysis, which transform
or project the original data onto a smaller space. Attribute subset selection is a method of
dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed. For example,
Name Mobile No. Mobile Network
Jayanta 9843xxxxxx NTC
Kushal 9801xxxxxx NCELL
Fig: Before Dimension Reduction
If we know Mobile Number, then we can know the Mobile Network. So we need to
reduce the one dimension
Name Mobile No.
Jayanta 9843xxxxxx
Kushal 9801xxxxxx
Fig: After Dimension Reduction
 Numerosity Reduction: Numerosity reduction techniques replace the original data volume
by alternative, smaller forms of data representation. These techniques may be parametric
or nonparametric. For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual data (Outliers
may also be stored.) Regression and log-linear models are examples. Nonparametric
methods for storing reduced representations of the data include histograms, clustering,
sampling, and data cube aggregation.
 Data Compression: In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any information loss, the data reduction is
called lossless. If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.

Collegenote Prepared By: Jayanta Poudel

6 Data Warehousing and Data Mining Reference Note

Data Discretization and Concept Hierarchy Generation

Discretization reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to replace actual data
values.
Discretization can be categorized into following two types:
 Top-down discretization: If we first consider one or a couple of points (so-called
breakpoints or split points) to divide the whole set of attributes and repeat of this method
up to the end, then the process is known as top-down discretization also known as
splitting.
 Bottom-up discretization: If we first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval, that
process is called bottom-up discretization.
Concept Hierarchies reduce the data by collecting and replacing low level concepts (such as
city) by higher level concepts (such as province or country).

Fig: Concept Hierarchy

Data Discretization and Concept Hierarchy Generation can be performed using binning,
histogram analysis or decision tree induction approaches.
 Discretization and Concept Hierarchy Generation by Binning: Binning is a top-down
splitting technique based on a specified number of bins. Binning methods for data
smoothing can also be used as discretization methods for data reduction and concept
hierarchy generation. For example, attribute values can be discretized by applying binning,
and then replacing each bin value by the bin mean or median. These techniques can be
applied recursively to the resulting partitions to generate concept hierarchies.
 Discretization and Concept Hierarchy Generation by Histogram Analysis: Histograms
use binning to approximate data distributions and are a popular form of data reduction. A
histogram for an attribute, 𝐴, partitions the data distribution of 𝐴 into disjoint subsets,
referred to as buckets or bins. The histogram analysis algorithm can be applied recursively
to each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a pre-specified number of concept levels has been reached.

Collegenote Prepared By: Jayanta Poudel

7 Data Warehousing and Data Mining Reference Note

 Discretization and Concept Hierarchy Generation by Clustering: A clustering algorithm

can be applied to discretize a numeric attribute, 𝐴, by partitioning the values of 𝐴 into
clusters or groups. Clustering takes the distribution of 𝐴 into consideration, as well as the
closeness of data points, and therefore is able to produce high quality discretization results.
Clustering can be used to generate a concept hierarchy for 𝐴 by following either a top-down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy.

Data Mining Task Primitives

We can specify a data mining task in the form of a data mining query. This query is input to
the system.
A data mining query is defined in terms of data mining task primitives. These primitives allow
us to communicate in an interactive manner with the data mining system.
The data mining task primitives are:
1. Task-relevant data: This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data warehouse dimensions
of interest (referred to as the relevant attributes or dimensions).
2. The Kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
3. Background Knowledge: Background knowledge is information about the domain to be
mined that can be useful in the discovery process and for evaluating the patterns found.
Concept hierarchies are a popular form of background knowledge, which allow data to be
mined at multiple levels of abstraction.
4. Interestingness measures: These functions are used to separate uninteresting patterns
from knowledge. They may be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of knowledge may have different
interestingness measures.
5. Presentation and visualization of discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts,
graphs, decision trees, and cubes.

Collegenote Prepared By: Jayanta Poudel

Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
Lecture Notes 1.3 & 1.4
No ratings yet
Lecture Notes 1.3 & 1.4
2 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
CH 3
No ratings yet
CH 3
68 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
PPT 2
No ratings yet
PPT 2
51 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Unit-2 new
No ratings yet
Unit-2 new
61 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Lecture123
No ratings yet
Lecture123
20 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Correlation
No ratings yet
Correlation
14 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
Unit-2
No ratings yet
Unit-2
144 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Chapter 2 Introduction Data Mining
No ratings yet
Chapter 2 Introduction Data Mining
2 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Data Mining Mid Syllabus
No ratings yet
Data Mining Mid Syllabus
162 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Chapter3
No ratings yet
Chapter3
50 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Module 2_DM_AI
No ratings yet
Module 2_DM_AI
61 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Digitisation Big Data and The Transformation of Accounting Information
100% (1)
Digitisation Big Data and The Transformation of Accounting Information
23 pages
Module 2 Data Analytics and Its Type
No ratings yet
Module 2 Data Analytics and Its Type
9 pages
Understanding The Importance and Impact of Technology in An Accou
No ratings yet
Understanding The Importance and Impact of Technology in An Accou
38 pages
ROLE OF THE BROADCAST MEDIA IN PROMOTING FREE AND FAIR ELECTION IN NASARAWA LOCAL GOVERNMENT AREA DURING THE 2019 GENERAL ELECTION
No ratings yet
ROLE OF THE BROADCAST MEDIA IN PROMOTING FREE AND FAIR ELECTION IN NASARAWA LOCAL GOVERNMENT AREA DURING THE 2019 GENERAL ELECTION
50 pages
Conducting Research: Social and Behavioral Science Methods 2nd Edition, (Ebook PDF) pdf download
100% (37)
Conducting Research: Social and Behavioral Science Methods 2nd Edition, (Ebook PDF) pdf download
58 pages
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
No ratings yet
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
17 pages
Jmbe.2024.0028 on the Job+Training+Experiences Fuentes
No ratings yet
Jmbe.2024.0028 on the Job+Training+Experiences Fuentes
34 pages
17.17 LMS Docs Flow Down Matrix
No ratings yet
17.17 LMS Docs Flow Down Matrix
1 page
3is Quiz1 Kian
No ratings yet
3is Quiz1 Kian
1 page
07au Midterm
No ratings yet
07au Midterm
17 pages
Ppt Analisa Produktifitas Dengan Metode Omax
No ratings yet
Ppt Analisa Produktifitas Dengan Metode Omax
8 pages
Additional Mathematics Project Work 2019
No ratings yet
Additional Mathematics Project Work 2019
8 pages
Curriculum Vitae - Eka Devi Wulandari (2019) PDF
No ratings yet
Curriculum Vitae - Eka Devi Wulandari (2019) PDF
3 pages
Correlation: Prepared By: Prof. Shuchi Mathur
No ratings yet
Correlation: Prepared By: Prof. Shuchi Mathur
14 pages
SRM Notes
No ratings yet
SRM Notes
38 pages
Anova, Ancova, Manova, & Mancova
No ratings yet
Anova, Ancova, Manova, & Mancova
11 pages
Case Study - 5Q Group 2
100% (3)
Case Study - 5Q Group 2
5 pages
Institute Information Technology&Management of Itm Universe, Gwalior (M.P.)
70% (10)
Institute Information Technology&Management of Itm Universe, Gwalior (M.P.)
57 pages
FINAL WORK Sesi 1-3
No ratings yet
FINAL WORK Sesi 1-3
127 pages
Final Individual Assessment
No ratings yet
Final Individual Assessment
25 pages
Warren Practical Research Final
No ratings yet
Warren Practical Research Final
13 pages
Vardaan Sharma MSC Project Management 10232721 Dissertation
No ratings yet
Vardaan Sharma MSC Project Management 10232721 Dissertation
93 pages
Dương Đào Khải An - ENENIU21078 - 04 - Final Assignment
No ratings yet
Dương Đào Khải An - ENENIU21078 - 04 - Final Assignment
28 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
PPT-Hackathon Tiny Coders (1) (1)
No ratings yet
PPT-Hackathon Tiny Coders (1) (1)
21 pages
Data Analytics FULL Course for Begi
No ratings yet
Data Analytics FULL Course for Begi
2 pages
Ch3 Simple Linear Regression PDF
No ratings yet
Ch3 Simple Linear Regression PDF
24 pages
Tugas Munaa Masyu Abbas
No ratings yet
Tugas Munaa Masyu Abbas
17 pages
MSC Data Analytics and Information Systems Management
No ratings yet
MSC Data Analytics and Information Systems Management
15 pages
9682Instant ebooks textbook Statistics With R Solving Problems Using Real World Data 1st Edition Jenine K Harris download all chapters
100% (3)
9682Instant ebooks textbook Statistics With R Solving Problems Using Real World Data 1st Edition Jenine K Harris download all chapters
55 pages

Unit-3 Data Preprocessing

Uploaded by

Unit-3 Data Preprocessing

Uploaded by

1 Data Warehousing and Data Mining Reference Note

Why is data dirty?

Why do we need to preprocess data?

Collegenote Prepared By: Jayanta Poudel

Steps involved into data pre-processing:

Fig: Data Preprocessing Steps

Collegenote Prepared By: Jayanta Poudel

Fig: Data Integration

Collegenote Prepared By: Jayanta Poudel

Issues in Data Integration

Collegenote Prepared By: Jayanta Poudel

- Z-score Normalization: In z-score normalization (or zero-mean normalization), the

Collegenote Prepared By: Jayanta Poudel

Data Discretization and Concept Hierarchy Generation

Fig: Concept Hierarchy

Collegenote Prepared By: Jayanta Poudel

 Discretization and Concept Hierarchy Generation by Clustering: A clustering algorithm

Data Mining Task Primitives

Collegenote Prepared By: Jayanta Poudel

You might also like