0% found this document useful (0 votes)

10 views

Preprocessing in Data Mining: Edgar Acu Na

it is about preprocessing data.

Uploaded by

babaeisaba50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Preprocessing in Data Mining: Edgar Acu Na

it is about preprocessing data.

Uploaded by

babaeisaba50

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

PREPROCESSING IN DATA MINING

Edgar Acuña 1
Professor, Department of Mathematical Sciences
University of Puerto Rico at Mayaguez, Puerto Rico

1. INTRODUCTION

Data mining is the process of extracting hidden patterns in a large dataset.

Azzopardi (2002) breaks the data mining process into five stages:

I) Selecting the domain – Data mining should be assessed to determine whether

there is a viable solution to the problem at hand and a set of objectives should
be defined to characterize these problems.
II) Selecting the target data – This entails the selection of data that is to be used
in the specified domain; for example, selection of subsets of features or data
samples from larger databases.
III) Preprocessing the data – This phase is primarily aimed at preparing the data
in a suitable and useable format, so that a knowledge extraction process can
be applied.
IV) Extracting the knowledge/information – During this stage the types of data
mining operations (association rules, regression, supervised classification, clus-
tering, etc.), the data mining techniques, and data mining algorithms are
chosen and the data is then mined.
V) Interpretation and evaluation – This stage of the data mining process is the
interpretation and evaluation of the discoveries made. It includes filtering
information that is to be presented, visualizing graphically or locating the
useful patterns and translating the patterns discovered into an understandable
form.

In the process of data mining many patterns are found in the data. Patterns that
are interesting for the miner are those that are easily understood, valid, potentially
useful, and novel. These patterns should validate the hypothesis that the user seeks
to confirm. The quality of patterns obtained depends on the quality of the analyzed
1 Dr. Edgar Acuña, is a Professor, Department of Mathematical Sciences, University of Puerto

Rico at Mayaguez. He is also the leader of the group in Computational and statistical Learning
from databases at the University of Puerto Rico. He has authored and co-authored more than 20
papers mainly on data preprocessing. He is the author of book (in Spanish) Analisis Estadistico
De Datos usando Minitab (John Wiley & Sons, 2002). In 2003, he was was honored with the
Power Hitter Award in Business and Technology. In 2008 , he was selected as a Fullbright visiting
Scholar. Currently, he is an Associate editor for the Revista Colombiana de Estadistica.

1
data. It is a common practice to prepare data before applying traditional data
mining techniques such as: regression, association rules, clustering, and supervised
classification. Section two of this article provides a more precise justification for the
use of data preprocessing techniques. This is followed by a description in section
three of some of the data preprocessing techniques currently in use.

2. REASONS FOR APPLYING DATA PREPROCESSING

Pyle (1999) suggests that about 60% of the total time required to complete a
data mining project should be spent on data preparation since it is one of the most
important contributors to the success of the project. Transforming the data at hand
into a format appropriate for knowledge extraction has a significant influence on
the final models generated, as well as on the amount and quality of the knowledge
discovered during the process. At the same time, the effect caused by changes
made to a dataset during data preprocessing can either facilitate or complicate
even further the knowledge discovery process, thus changes made must be selected
with care.
Today’s real-world datasets are highly susceptible to noise, missing and in-
consistent data due to human errors, mechanical failures and to their typically large
size. Data affected in this manner is known as ”dirty”. During the past decades,
a number of techniques have been developed to preprocess data gathered from real
world applications before the data is further processed for other purposes.
Cases where data mining techniques are applied directly to raw data without
any kind of data preprocessing are still frequent; yet, data preprocessing has been
recommended as an obligatory step. Data preprocessing techniques should never
be applied blindly to a dataset, however. Prior to any data preprocessing effort,
the dataset should be explored and characterized. Two methods for exploring the
data prior to preprocessing are data characterization and data visualization.
2.1 Data Characterization
Data characterization describes data in ways that are useful to the miner and
begins the process of understanding what is in the data. Engels (1998) describes
the following characteristics as standard for a given dataset: the number of classes,
the number of observations, the number of attributes, the number of features with
numeric data type and the number of features with symbolic data type. These
characteristics can provide a first indication of the complexity of the problem being
studied.
In addition to the above mentioned characteristics, parameters of location
and dispersion can be calculated as single dimensional measurements that describe
the dataset. Location parameters are measurements such as minimum, maximum,
arithmetic mean, median, and empirical quartiles. On the other hand, dispersion
parameters such as range, standard deviation, and quartile deviation, provide mea-
surements that indicate the dispersion of values of the feature.
Location and dispersion parameters can be divided in two classes: those that
can deal with extreme values and those that are sensitive to them. A parameter

2
that can deal well with extreme values is called robust. Some statistical software
packages provide the computation of robust parameters in addition to the tradi-
tional non-robust parameters. Comparing robust and non-robust parameter values
can provide insight to the existence of outliers during the data characterization
phase.
2.2 Data Visualization
Visualization techniques can also be of assistance during this exploration and
characterization phase. Visualizing the data before preprocessing it can improve
the understanding of the data, thereby, increasing the likelihood that new and
useful information will be gained from the data. Visualization techniques can be
used to identify the existence of missing values, and outliers, as well as to identify
relationships among attributes. These techniques can, in effect, assist in ranking
the ”impurity” of the data and in selecting the most appropriate data preprocessing
technique to apply.

3. TECHNIQUES FOR DATA PREPROCESSING

Applying the correct data preprocessing techniques can improve the quality of
the data, thereby helping to improve the accuracy and efficiency of the subsequent
mining process. Lu, et al. (1996), Pyle (1999) and Azzopardi (2002) present
descriptions of common techniques for preparing data for analysis. The techniques
described by both authors can be summarized as follows:

I) Data cleaning – filling in missing values, smoothing noisy data, removing

outliers and resolving inconsistencies.
II) Data reduction – reducing the volume of data (but preserving the patterns)
by removing repeated observations and applying instance selection as well as
feature selection techniques. Discretization of continuous attributes is also a
way of data reduction.
III) Data transformation – converting text and graphical data to a format which
can be processed, normalizing or scaling the data, aggregation, generalization.
IV) Data integration – correcting differences in coding schemes due to the com-
bining of several sources of data.

3.1 Data cleaning

Data cleaning provides methods to deal with dirty data. Since dirty datasets
can cause problems for data exploration and analysis, data cleaning techniques
have been developed to clean data by filling in missing values (value imputation),
smoothing noisy data, identifying and/or removing outliers, and resolving incon-
sistencies. Noise is a random error or variability in a measured feature, and several
methods can be applied to remove it. Data can also be smoothed by using regression
to find a mathematical equation to fit the data. Smoothing methods that involve

3
discretization are also methods of data reduction since they reduce the number of
distinct values per attribute. Clustering methods can also be used to remove noise
by detecting outliers.
3.2 Data Integration
Some studies require the integration of multiple databases, or files. This
process is known as data integration. Since attributes representing a given con-
cept may have different names in different databases, care must be taken to avoid
causing inconsistencies and redundancies in the data. Inconsistencies are observa-
tions that have the same values for each of the attributes but that are assigned to
different classes. Redundant observations are observations that contain the same
information.
Attributes that have been derived or inferred from others may create redun-
dancy problems. Again, having a large amount of redundant and inconsistent data
may slow down the knowledge discovery process for a given dataset.
3.3 Data Transformation
Many data mining algorithms provide better results if the data has been nor-
malized or scaled to a specific range before these algorithms are applied. The use
of normalization techniques is crucial when distance-based algorithms are applied,
because the distance measurements taken on by attributes that assume many val-
ues will generally outweigh distance measurements taken by attributes that assume
fewer values. Other methods of data transformation include data aggregation and
generalization techniques. These methods create new attributes from existing in-
formation by applying summary operations to data or by replacing raw data by
higher level concepts. For example, monthly sales data may be aggregated to com-
pute annual sales.
3.4 Data Reduction
The increased size of current real-world datasets has led to the development
of techniques that can reduce the size of the dataset without jeopardizing the data
mining results. The process known as data reduction obtains a reduced represen-
tation of the dataset that is much smaller in volume, yet maintains the integrity
of the original data. This means that data mining on the reduced dataset should
be more efficient yet produce similar analytical results. Han and Kamber (2006)
mention the following strategies for data reduction:
I) Dimension reduction, where algorithms are applied to remove irrelevant, weakly
relevant or redundant attributes.
II) Data compression, where encoding mechanisms are used to obtain a reduced
or compressed representation of the original data. Two common types of data
compression are wavelet transforms and principal component analysis.
III) Numerosity reduction, where the data are replaced or estimated by alterna-
tive, smaller data representations such as parametric models (which store only
the model parameters instead of the actual data) or nonparametric methods
such as clustering, and the use of histograms.

4
IV) Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. For example,
concept hierarchies can be used to replace a low level concept such as age,
with a higher level concept such as young, middle-aged or senior. Some detail
may be lost by such data generalizations.

V) Instance Selection, where a subset of best instances of the whole dataset is

selected. Some of the instances are more relevant than others to perform a
data mining, and working only with an optimal subset of instances it will be
more cost and time efficient. Variants of the classical sampling techniques
can be used.

4. FINAL REMARKS

Acuna (2009) has developed Drep, an R package for data preprocessing and
visualization. Drep performs most of the data preprocessing techniques mentioned
in this article. Currently, research is being done in order to apply preprocessing
methods to data streams, see Aggarwal (2007) for more details.

References

[ 1 ] Acuña, E. (2009). Dprep: Data Preprocessing and Visualization Functions for Classifi-
cation. (http://cran.r-project.org/package=dprep). R package version 2.1.
[ 2 ] Aggarwal, C. C. (2007). DATA STREAMS: Models and Algorithms. Edited by C. C.
Aggarwal. Springer.
[ 3 ] Azzopardi, L. (2002). ”Am I Right?” asked the Classifier: Preprocessing Data in the
Classification Process. Computing and Information Systems, 9, 37–44.
[ 4 ] Engels, R., C. Theusinger (1998). Using a Data Metric for Preprocessing Advice for
Data Mining Applications. Proceedings of 13th European Conference on Artificial In-
telligence, 430–434.
[ 5 ] Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth (1996). From data mining to knowledge
discovery: An overview. In: Advances in Knowledge Discovery and Data Mining, AAAI
Press and the MIT Press, Chapter 1, 1–34.
[ 6 ] Han, J., M. Kamber (2006). Data Mining: Concepts and Techniques. 2nd edition.
Morgan Kaufman Publishers.
[ 7 ] Lu, H., S. Sun, Y. Lu (1996). On Preprocessing Data for Effective Classification. ACM
SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery,
Montreal, Canada.
[ 8 ] Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann, San Francisco.

LatentView Corporate Deck
100% (1)
LatentView Corporate Deck
10 pages
Technology Management - Lecture Notes
96% (26)
Technology Management - Lecture Notes
36 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Data Mining Notes
100% (1)
Data Mining Notes
75 pages
Upper Control Card Repair Manual
No ratings yet
Upper Control Card Repair Manual
17 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Course Manual on Data Mining_CSC 425_015446
No ratings yet
Course Manual on Data Mining_CSC 425_015446
44 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Mining Knowledge of Business Analyst
No ratings yet
Mining Knowledge of Business Analyst
14 pages
Data Mining _ Preprocessing
No ratings yet
Data Mining _ Preprocessing
77 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Sathyapriya Thesis NEW
No ratings yet
Sathyapriya Thesis NEW
47 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
data mining
No ratings yet
data mining
44 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
Chapter 3
No ratings yet
Chapter 3
9 pages
Down 2
No ratings yet
Down 2
61 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
preprocessing(review)
No ratings yet
preprocessing(review)
11 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
Unit 3
No ratings yet
Unit 3
34 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Journal Data Preprocessing 1906.08510
No ratings yet
Journal Data Preprocessing 1906.08510
7 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Unit-2
No ratings yet
Unit-2
144 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
unit2
No ratings yet
unit2
20 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
SF Suite Picklists Guardrail Impl
No ratings yet
SF Suite Picklists Guardrail Impl
40 pages
CHAPTER 5 (05:02) : Flexible Pavement Design
No ratings yet
CHAPTER 5 (05:02) : Flexible Pavement Design
43 pages
Lesson 3 - Inverse Functions
No ratings yet
Lesson 3 - Inverse Functions
31 pages
Adc Interfacing
No ratings yet
Adc Interfacing
17 pages
D00178871_05_EN
No ratings yet
D00178871_05_EN
6 pages
Cloud+ Mod03 - FINAL
No ratings yet
Cloud+ Mod03 - FINAL
58 pages
Forouzan MCQ in Switching
No ratings yet
Forouzan MCQ in Switching
12 pages
sOFTWARE REUSE
No ratings yet
sOFTWARE REUSE
48 pages
Salt Lab Guide
No ratings yet
Salt Lab Guide
5 pages
Common Computer Problems and Solutions
No ratings yet
Common Computer Problems and Solutions
21 pages
2024_[EE303A] HW4
No ratings yet
2024_[EE303A] HW4
4 pages
Shanto-Mariam University of Creative Technology: Application Form For Certificate
No ratings yet
Shanto-Mariam University of Creative Technology: Application Form For Certificate
1 page
Service Plan Details
No ratings yet
Service Plan Details
3 pages
For Printing A Computer Based School Information Management System at Al Andalus International School
No ratings yet
For Printing A Computer Based School Information Management System at Al Andalus International School
50 pages
The-Impact-Of-Social-Media-On-Humanity Worksheet Pre
100% (1)
The-Impact-Of-Social-Media-On-Humanity Worksheet Pre
10 pages
Gas Law Homework
No ratings yet
Gas Law Homework
6 pages
O&M Template
No ratings yet
O&M Template
16 pages
Circular - Student Bootcamp for PM Shri Kendriya Vidyalaya Schools
No ratings yet
Circular - Student Bootcamp for PM Shri Kendriya Vidyalaya Schools
4 pages
ARAG Digiwolf Fill Meter Instructions
No ratings yet
ARAG Digiwolf Fill Meter Instructions
24 pages
Kasi Sugam Darshan Ticket
No ratings yet
Kasi Sugam Darshan Ticket
1 page
Lm4780tabd/nopb Lm4780ta/nopb Lm4780ta/nopb
No ratings yet
Lm4780tabd/nopb Lm4780ta/nopb Lm4780ta/nopb
33 pages
Instruction Manual: Videonics
No ratings yet
Instruction Manual: Videonics
93 pages
Kushal Yuva Program Center Registration Process Flow-2017 Final
No ratings yet
Kushal Yuva Program Center Registration Process Flow-2017 Final
4 pages
Beginning C: From Beginner to Pro, 7th Edition German Gonzalez-Morris pdf download
100% (4)
Beginning C: From Beginner to Pro, 7th Edition German Gonzalez-Morris pdf download
68 pages
Difference Between General and Technical Communication
No ratings yet
Difference Between General and Technical Communication
7 pages
Megapack Ingenieria y Arquitectura
No ratings yet
Megapack Ingenieria y Arquitectura
10 pages
4541 - Programmable Timer
No ratings yet
4541 - Programmable Timer
11 pages

Preprocessing in Data Mining: Edgar Acu Na

Uploaded by

Preprocessing in Data Mining: Edgar Acu Na

Uploaded by

PREPROCESSING IN DATA MINING

Data mining is the process of extracting hidden patterns in a large dataset.

I) Selecting the domain – Data mining should be assessed to determine whether

2. REASONS FOR APPLYING DATA PREPROCESSING

3. TECHNIQUES FOR DATA PREPROCESSING

I) Data cleaning – filling in missing values, smoothing noisy data, removing

3.1 Data cleaning

V) Instance Selection, where a subset of best instances of the whole dataset is

You might also like