100% found this document useful (1 vote)
144 views

Unit-3 Data Preprocessing

The document discusses the process of data preprocessing, which involves transforming raw data into a clean and consistent format suitable for analysis. It describes several types of dirty or imperfect data like incomplete, noisy, and inconsistent data. The key steps of data preprocessing are data cleaning, which handles issues like missing values and outliers, and data integration, which combines data from multiple sources. The goals of preprocessing are to make data more accurate, consistent, and complete for downstream analytics tasks.

Uploaded by

Khal Drago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
144 views

Unit-3 Data Preprocessing

The document discusses the process of data preprocessing, which involves transforming raw data into a clean and consistent format suitable for analysis. It describes several types of dirty or imperfect data like incomplete, noisy, and inconsistent data. The key steps of data preprocessing are data cleaning, which handles issues like missing values and outliers, and data integration, which combines data from multiple sources. The goals of preprocessing are to make data more accurate, consistent, and complete for downstream analytics tasks.

Uploaded by

Khal Drago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1 Data Warehousing and Data Mining Reference Note

Unit-3
Data Preprocessing

Introduction
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format.
Raw data (real-world data) is often incomplete, inconsistent, and/or noisy, due to which there
are some increased chances of error and misinterpretation
 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data. E.g., occupation = “ ”
 Noisy: containing errors or outliers. E.g. Salary = “-10”
 Inconsistent: containing discrepancies in codes or names. E.g. Age=“42”
Birthday=“03/07/1997”
Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.

Why is data dirty?


 Incomplete data may come from
- “Not applicable” data value when collected
- Different considerations between the time when the data was collected and when it
is analyzed.
- Human/hardware/software problems
 Noisy data (incorrect values) may come from
- Faulty data collection instruments
- Human or computer error at data entry
- Errors in data transmission
 Inconsistent data may come from
- Different data sources
- Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning

Why do we need to preprocess data?


By preprocessing data, we:
 Make our database more accurate: We eliminate the incorrect or missing values that are
there as a result of human factor or bugs.
 Boost Consistency: when there are inconsistencies in data or duplicates, it affects the
accuracy of the results.
 Make the database more complete: We can fill the attributes that are missing if needed.
 Smooth the data: This way we make it easier to use and interpret.

Collegenote Prepared By: Jayanta Poudel


2 Data Warehousing and Data Mining Reference Note

Steps involved into data pre-processing:

Fig: Data Preprocessing Steps

Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
a) Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways. Some of them are:
 Ignore the tuples: This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
 Fill in the missing value manually.
 Use a global constant to fill in the missing value. E.g. “unknown”, a new class.
 Use the attribute mean to fill in the missing value.
 Use the attribute mean for all samples belonging to the same class as the given tuple.
 Use the most probable value to fill in the missing value.
b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. E.g. Salary = “-10”.
It can be handled in following ways:
 Binning method: This method is to smooth or handle noisy data. First, the data is
sorted then and then the sorted values are separated into segments of equal size and
stored in the form of bins. There are three methods for smoothing data in the bin.
- Smoothing by bin mean method: In this method, the values in the bin are replaced
by the mean value of the bin;
- Smoothing by bin median: In this method, the values in the bin are replaced by
the median value;
- Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken and the values are replaced by the closest
boundary value.

Collegenote Prepared By: Jayanta Poudel


3 Data Warehousing and Data Mining Reference Note

Example:
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins:
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means:
For Bin 1: For Bin 2: For Bin 3:
(8+ 9 + 15 +16 / 4) = 12 (21 + 21 + 24 + 26 / 4) = 23 (27 + 30 + 30 + 34 / 4) = 30
Bin 1 = 12, 12, 12, 12 Bin 2 = 23, 23, 23, 23 Bin 3 = 30, 30, 30, 30

 Regression: Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
 Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

Data Integration
Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified view of
the data. These sources may include multiple data cubes, databases, or flat files.

Fig: Data Integration


There are mainly 2 major approaches for data integration:
 Tight Coupling: In tight coupling data is combined from different sources into a single
physical location through the process of ETL - Extraction, Transformation and Loading.
 Loose Coupling: In loose coupling data only remains in the actual source databases. In
this approach, an interface is provided that takes query from user and transforms it in a
way the source database can understand and then sends the query directly to the source
databases to obtain the result.

Collegenote Prepared By: Jayanta Poudel


4 Data Warehousing and Data Mining Reference Note

Issues in Data Integration


 Entity Identification Problem: As we know the data is unified from the heterogeneous
sources then how can we ‘match the real-world entities from the data’? For example, we
have customer data from two different data source. An entity from one data source has
customer_id and the entity from the other data source has customer_number. Now how
does the data analyst or the system would understand that these two entities refer to the
same attribute?
 Redundancy: An attribute may be redundant if it can be derived or obtaining from another
attribute or set of attributes. Inconsistencies in attributes can also cause redundancies in
the resulting data set. Some redundancies can be detected by correlation analysis.
 Data Conflict Detection and Resolution: Data conflict means the data merged from the
different sources do not match. Like the attribute values may differ in different data sets.
The difference maybe because they are represented differently in the different data sets.
For suppose the price of a hotel room may be represented in different currencies in different
cities. This kind of issues is detected and resolved during data integration.

Data Transformation

Data transformation is the process of transforming data into the form that is appropriate for
mining.
Some Data Transformation Strategies:
 Smoothing: It is used to remove the noise from data. Such techniques include binning,
clustering, and regression.
 Aggregation: Here summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
 Generalization: Here low level data are replaced by higher level concepts through the use
of concept hierarchies. For example, categorical attributes, like street, can be generalized
to higher level concepts, like city or country.
 Attribute construction: Here new attributes are constructed and added from the given set
of attributes to help the mining process.
 Normalization: Here the attribute data are scaled so as to fall within a small specified range,
such as -1 to +1, or 0 to 1. Techniques that are used for normalization are:
- Min-Max Normalization: It performs a linear transformation on the original data.
Suppose that min and max are the minimum and maximum values of an attribute, 𝐴.
Min-max normalization maps a value, , of 𝐴 to 𝑛𝑣 in the range [new_min, new_max]
using following formula.

Collegenote Prepared By: Jayanta Poudel


5 Data Warehousing and Data Mining Reference Note

- Z-score Normalization: In z-score normalization (or zero-mean normalization), the


values for an attribute, 𝐴, are normalized based on the mean and standard deviation
of 𝐴. The value, 𝑣, of 𝐴 is normalized to 𝑛𝑣 as below. It is also called standard
normalization.

Where,
where, 𝜇 is mean and 𝑛 is number of data points.

Data Reduction
A database or date warehouse may store terabytes of data. So it may take very long to perform
data analysis and mining on such huge amounts of data. Data Reduction is obtaining a reduced
representation of the data set that is much smaller in volume but yet produces the same (or
almost the same) analytical results.
Data Reduction Techniques:
 Dimensionality Reduction: Dimensionality reduction is the process of reducing the
number of random variables or attributes under consideration. Dimensionality reduction
methods include wavelet transforms and principal components analysis, which transform
or project the original data onto a smaller space. Attribute subset selection is a method of
dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed. For example,
Name Mobile No. Mobile Network
Jayanta 9843xxxxxx NTC
Kushal 9801xxxxxx NCELL
Fig: Before Dimension Reduction
If we know Mobile Number, then we can know the Mobile Network. So we need to
reduce the one dimension
Name Mobile No.
Jayanta 9843xxxxxx
Kushal 9801xxxxxx
Fig: After Dimension Reduction
 Numerosity Reduction: Numerosity reduction techniques replace the original data volume
by alternative, smaller forms of data representation. These techniques may be parametric
or nonparametric. For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual data (Outliers
may also be stored.) Regression and log-linear models are examples. Nonparametric
methods for storing reduced representations of the data include histograms, clustering,
sampling, and data cube aggregation.
 Data Compression: In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any information loss, the data reduction is
called lossless. If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.

Collegenote Prepared By: Jayanta Poudel


6 Data Warehousing and Data Mining Reference Note

Data Discretization and Concept Hierarchy Generation


Discretization reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to replace actual data
values.
Discretization can be categorized into following two types:
 Top-down discretization: If we first consider one or a couple of points (so-called
breakpoints or split points) to divide the whole set of attributes and repeat of this method
up to the end, then the process is known as top-down discretization also known as
splitting.
 Bottom-up discretization: If we first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval, that
process is called bottom-up discretization.
Concept Hierarchies reduce the data by collecting and replacing low level concepts (such as
city) by higher level concepts (such as province or country).

Fig: Concept Hierarchy


Data Discretization and Concept Hierarchy Generation can be performed using binning,
histogram analysis or decision tree induction approaches.
 Discretization and Concept Hierarchy Generation by Binning: Binning is a top-down
splitting technique based on a specified number of bins. Binning methods for data
smoothing can also be used as discretization methods for data reduction and concept
hierarchy generation. For example, attribute values can be discretized by applying binning,
and then replacing each bin value by the bin mean or median. These techniques can be
applied recursively to the resulting partitions to generate concept hierarchies.
 Discretization and Concept Hierarchy Generation by Histogram Analysis: Histograms
use binning to approximate data distributions and are a popular form of data reduction. A
histogram for an attribute, 𝐴, partitions the data distribution of 𝐴 into disjoint subsets,
referred to as buckets or bins. The histogram analysis algorithm can be applied recursively
to each partition in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a pre-specified number of concept levels has been reached.

Collegenote Prepared By: Jayanta Poudel


7 Data Warehousing and Data Mining Reference Note

 Discretization and Concept Hierarchy Generation by Clustering: A clustering algorithm


can be applied to discretize a numeric attribute, 𝐴, by partitioning the values of 𝐴 into
clusters or groups. Clustering takes the distribution of 𝐴 into consideration, as well as the
closeness of data points, and therefore is able to produce high quality discretization results.
Clustering can be used to generate a concept hierarchy for 𝐴 by following either a top-down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy.

Data Mining Task Primitives


We can specify a data mining task in the form of a data mining query. This query is input to
the system.
A data mining query is defined in terms of data mining task primitives. These primitives allow
us to communicate in an interactive manner with the data mining system.
The data mining task primitives are:
1. Task-relevant data: This specifies the portions of the database or the set of data in which
the user is interested. This includes the database attributes or data warehouse dimensions
of interest (referred to as the relevant attributes or dimensions).
2. The Kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
3. Background Knowledge: Background knowledge is information about the domain to be
mined that can be useful in the discovery process and for evaluating the patterns found.
Concept hierarchies are a popular form of background knowledge, which allow data to be
mined at multiple levels of abstraction.
4. Interestingness measures: These functions are used to separate uninteresting patterns
from knowledge. They may be used to guide the mining process, or after discovery, to
evaluate the discovered patterns. Different kinds of knowledge may have different
interestingness measures.
5. Presentation and visualization of discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts,
graphs, decision trees, and cubes.

Collegenote Prepared By: Jayanta Poudel

You might also like