Unit-3 Data Preprocessing
Unit-3 Data Preprocessing
Unit-3
Data Preprocessing
Introduction
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format.
Raw data (real-world data) is often incomplete, inconsistent, and/or noisy, due to which there
are some increased chances of error and misinterpretation
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data. E.g., occupation = “ ”
Noisy: containing errors or outliers. E.g. Salary = “-10”
Inconsistent: containing discrepancies in codes or names. E.g. Age=“42”
Birthday=“03/07/1997”
Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing.
Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
a) Missing Data: This situation arises when some data is missing in the data. It can be
handled in various ways. Some of them are:
Ignore the tuples: This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
Fill in the missing value manually.
Use a global constant to fill in the missing value. E.g. “unknown”, a new class.
Use the attribute mean to fill in the missing value.
Use the attribute mean for all samples belonging to the same class as the given tuple.
Use the most probable value to fill in the missing value.
b) Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. E.g. Salary = “-10”.
It can be handled in following ways:
Binning method: This method is to smooth or handle noisy data. First, the data is
sorted then and then the sorted values are separated into segments of equal size and
stored in the form of bins. There are three methods for smoothing data in the bin.
- Smoothing by bin mean method: In this method, the values in the bin are replaced
by the mean value of the bin;
- Smoothing by bin median: In this method, the values in the bin are replaced by
the median value;
- Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken and the values are replaced by the closest
boundary value.
Example:
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins:
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means:
For Bin 1: For Bin 2: For Bin 3:
(8+ 9 + 15 +16 / 4) = 12 (21 + 21 + 24 + 26 / 4) = 23 (27 + 30 + 30 + 34 / 4) = 30
Bin 1 = 12, 12, 12, 12 Bin 2 = 23, 23, 23, 23 Bin 3 = 30, 30, 30, 30
Regression: Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).
Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
Data Integration
Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and provide a unified view of
the data. These sources may include multiple data cubes, databases, or flat files.
Data Transformation
Data transformation is the process of transforming data into the form that is appropriate for
mining.
Some Data Transformation Strategies:
Smoothing: It is used to remove the noise from data. Such techniques include binning,
clustering, and regression.
Aggregation: Here summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
Generalization: Here low level data are replaced by higher level concepts through the use
of concept hierarchies. For example, categorical attributes, like street, can be generalized
to higher level concepts, like city or country.
Attribute construction: Here new attributes are constructed and added from the given set
of attributes to help the mining process.
Normalization: Here the attribute data are scaled so as to fall within a small specified range,
such as -1 to +1, or 0 to 1. Techniques that are used for normalization are:
- Min-Max Normalization: It performs a linear transformation on the original data.
Suppose that min and max are the minimum and maximum values of an attribute, 𝐴.
Min-max normalization maps a value, , of 𝐴 to 𝑛𝑣 in the range [new_min, new_max]
using following formula.
Where,
where, 𝜇 is mean and 𝑛 is number of data points.
Data Reduction
A database or date warehouse may store terabytes of data. So it may take very long to perform
data analysis and mining on such huge amounts of data. Data Reduction is obtaining a reduced
representation of the data set that is much smaller in volume but yet produces the same (or
almost the same) analytical results.
Data Reduction Techniques:
Dimensionality Reduction: Dimensionality reduction is the process of reducing the
number of random variables or attributes under consideration. Dimensionality reduction
methods include wavelet transforms and principal components analysis, which transform
or project the original data onto a smaller space. Attribute subset selection is a method of
dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed. For example,
Name Mobile No. Mobile Network
Jayanta 9843xxxxxx NTC
Kushal 9801xxxxxx NCELL
Fig: Before Dimension Reduction
If we know Mobile Number, then we can know the Mobile Network. So we need to
reduce the one dimension
Name Mobile No.
Jayanta 9843xxxxxx
Kushal 9801xxxxxx
Fig: After Dimension Reduction
Numerosity Reduction: Numerosity reduction techniques replace the original data volume
by alternative, smaller forms of data representation. These techniques may be parametric
or nonparametric. For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual data (Outliers
may also be stored.) Regression and log-linear models are examples. Nonparametric
methods for storing reduced representations of the data include histograms, clustering,
sampling, and data cube aggregation.
Data Compression: In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data. If the original data can be
reconstructed from the compressed data without any information loss, the data reduction is
called lossless. If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.