0% found this document useful (0 votes)
30 views26 pages

Data Preprocessing Steps 2

Data preprocessing is essential for transforming raw data into a usable format, ensuring its accuracy, completeness, consistency, timeliness, believability, and interpretability. Key tasks include data cleaning, integration, reduction, and transformation, each with specific methods to handle issues like missing values, noise, and dimensionality. Effective data preprocessing enhances the quality of data analysis and decision-making in data mining.

Uploaded by

rupali.purkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views26 pages

Data Preprocessing Steps 2

Data preprocessing is essential for transforming raw data into a usable format, ensuring its accuracy, completeness, consistency, timeliness, believability, and interpretability. Key tasks include data cleaning, integration, reduction, and transformation, each with specific methods to handle issues like missing values, noise, and dimensionality. Effective data preprocessing enhances the quality of data analysis and decision-making in data mining.

Uploaded by

rupali.purkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Preprocessing

By Shital R. Bedse
Data Preprocessing
● Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we
cannot work with raw data.
● The quality of the data should be checked before applying machine
learning or data mining algorithms.
Why is Data preprocessing important?
● Accuracy: To check whether the data entered is
correct or not.
● Completeness: To check whether the data is available
or not recorded.
● Consistency: To check whether the same data is kept
in all the places that do or do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
3
Major Tasks in Data Preprocessing
1.1 Data cleaning
○ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
1.2 Data integration
○ Integration of multiple databases, data cubes, or files
1.3 Data reduction
○ Dimensionality reduction
○ Numerosity reduction
1.4 Data transformation and data discretization
○ Normalization
○ Concept hierarchy generation
1.1 Data Cleaning
● Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument
faulty, human or computer error, transmission error
○ incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
○ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
○ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
○ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?
Data Issues
● Lack of Validation-email ids,phone no
● Data from different sources-
● Personal Names-
● Loactions- US/U.S.A/United State of America/United
States,Mumbai/Bombay/New Mumbai
● Dates-1/4/2020 ,1 April 2020,1st April 2020
● Numbers–2034.32 German 2.034,56 American 2 034.56
● Currencies-$,Rs,USD,1 Thai Baht(1THB)=2.54 Indian rs
● Languages-English,Urdu,Marathi,Hindi
● Other Issues-Spelling Mistakes-Uppercase, Lowercase

6
How to Handle Noisy Data?
● Binning
○ first sort data and partition into (equal-frequency) bins
○ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
● Regression
○ smooth by fitting the data into regression functions
● Clustering
○ detect and remove outliers
● Combined computer and human inspection
○ detect suspicious values and check by human (e.g., deal
with possible outliers)
7
Data Cleaning Methods
● Remove duplicates: Find and remove identical observations across all variables.
Duplicate data can skew analysis results.
● Handle missing data: Use imputation or data removal to replace missing data.
Imputation is best when there's a small percentage of missing data.
● Manual data cleaning: Identify and correct issues like errors, inconsistencies, and
duplicates.
● Use data cleansing tools: Tools like OpenRefine can help clean, transform, and extend
data.
● Convert data types: Change the type of data in a column.
● Remove unnecessary values: Remove values that aren't needed for analysis.
● Use a clear format: Make sure the data is organized in a clear and consistent way.
● Translate language: Translate the data into a different language if needed.
● Remove unwanted outliers: Remove data points that are significantly different from the
rest of the data
8
1.2 Data Integration
● Data integration in data mining refers to the
process of combining data from multiple
sources into a single, unified view. This can
involve cleaning and transforming the data, as
well as resolving any inconsistencies or
conflicts that may exist between the different
sources.
● The goal of data integration is to make the
data more useful and meaningful for the
purposes of analysis and decision making.
Techniques used in data integration include
data warehousing, ETL (extract, transform,
load) processes, and data federation. 9
Tight Coupling
● In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading
● Data is integrated in a tightly coupled manner, meaning that the data is
integrated at a high level, such as at the level of the entire dataset or schema.
This approach is also known as data warehousing, and it enables data
consistency and integrity, but it can be inflexible and difficult to change or update.
● Here, a data warehouse is treated as an information retrieval component.
Loosely Coupling
● This approach involves integrating data at the lowest level, such as at the level
of individual data elements or records.
● Here, an interface is provided that takes the query from the user, transforms it
in a way the source database can understand, and then sends the query
directly to the source databases to obtain the result.
● And the data only remains in the actual source databases.
1.3 Data Reduction
● Data reduction is a technique used in data mining to reduce the
size of a dataset while still preserving the most important
information.
● This can be beneficial in situations where the dataset is too large
to be processed efficiently, or where the dataset contains a large
amount of irrelevant or redundant information.
Data Reduction Methods
● Dimensionality Reduction
○ Step-wise forward selection
○ Step-wise backward selection
○ Combining of forwarding and Backward Selection
● Numerosity Reduction
● Data Cube Aggregation
● Data Compression
○ Looseless Compression
○ Loosy Compression
● Discretization
1.3.1 Data Cube Aggregation
This technique is used to aggregate data in a simpler form.

For example, imagine the information you gathered for your analysis for the years
2018 to 2020, that data includes the revenue of your company every three months.
They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data.
1.3.2 Dimension Reduction
This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining
multiple features into a single feature.
Step-Wise Forward Selection
● The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in
statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Step-Wise Backward Selection
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the
worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


Combination of forwarding and Backward Selection
It allows us to remove the worst and select the best attributes, saving time and
making the process faster.
1.3.4 Data Compression
● Data compression employs
modification, encoding, or
converting the structure of data in a
way that consumes less space.
● Data compression involves building
a compact representation of
information by removing
redundancy and representing data
in binary form. Data that can be
restored successfully from its
compressed form is called Lossless
compression.
1.3.5 Data Compression
This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on their
compression techniques.
1. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.
2. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them. For example, the
JPEG image format is a lossy compression, but we can find the meaning equivalent to the
original image. Methods such as the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.
1.3.5 Data Discretization
The data discretization technique is used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes with labels of
small intervals. This means that mining results are shown in a concise and easily
understandable way.

1. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this
method up to the end, then the process is known as top-down discretization, also
known as splitting.
2. Bottom-up discretization: If you first consider all the constant values as split-points,
some are discarded through a combination of the neighborhood values in the
interval. That process is called bottom-up discretization.
Benefits of Data Reduction
The main benefit of data reduction is simple: the more data you can fit into a terabyte
of disk space, the less capacity you will need to purchase. Here are some benefits of
data reduction, such as:

○ Data reduction can save energy.


○ Data reduction can reduce your physical storage costs.
○ And data reduction can decrease your data center track.

Data reduction greatly increases the efficiency of a storage system and


directly impacts your total spending on capacity.
1.4 Data Transformation
Data transformation is the process of converting data from one format or structure into
another. It is a key component of data management and is used to prepare data for
analysis, reporting, and storage.

Data transformation in data mining refers to the process of converting raw data into a
format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and
knowledge.
Data Transformation Techniques
4. Data Aggregation->
1. Data Smoothing -> Combining data at different
Binning, regression and levels by summing or
Clustering averaging, to create new
features or attributes

5. Data Discretization->
2. Attribute eg.interval labels (0 to 20,21
Construction -> New to 40 etc.),conceptual
attribute from the labels(eg. Youth,adult,senior)
given set of attributes

6. Concept Hierarchy
3. Normalization-> generation for
attribute ranges nominal data->
-1.0 to 1.0 or 0.0 to 1.0 Street-city/country
Normalization
● Min-max normalization: to [new_minA, new_maxA]

○ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].


Then $73,000 is mapped to
● Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then



● Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1
25

You might also like