Preprocessing in Data Mining: Edgar Acu Na
Preprocessing in Data Mining: Edgar Acu Na
Edgar Acuña 1
Professor, Department of Mathematical Sciences
University of Puerto Rico at Mayaguez, Puerto Rico
1. INTRODUCTION
In the process of data mining many patterns are found in the data. Patterns that
are interesting for the miner are those that are easily understood, valid, potentially
useful, and novel. These patterns should validate the hypothesis that the user seeks
to confirm. The quality of patterns obtained depends on the quality of the analyzed
1 Dr. Edgar Acuña, is a Professor, Department of Mathematical Sciences, University of Puerto
Rico at Mayaguez. He is also the leader of the group in Computational and statistical Learning
from databases at the University of Puerto Rico. He has authored and co-authored more than 20
papers mainly on data preprocessing. He is the author of book (in Spanish) Analisis Estadistico
De Datos usando Minitab (John Wiley & Sons, 2002). In 2003, he was was honored with the
Power Hitter Award in Business and Technology. In 2008 , he was selected as a Fullbright visiting
Scholar. Currently, he is an Associate editor for the Revista Colombiana de Estadistica.
1
data. It is a common practice to prepare data before applying traditional data
mining techniques such as: regression, association rules, clustering, and supervised
classification. Section two of this article provides a more precise justification for the
use of data preprocessing techniques. This is followed by a description in section
three of some of the data preprocessing techniques currently in use.
Pyle (1999) suggests that about 60% of the total time required to complete a
data mining project should be spent on data preparation since it is one of the most
important contributors to the success of the project. Transforming the data at hand
into a format appropriate for knowledge extraction has a significant influence on
the final models generated, as well as on the amount and quality of the knowledge
discovered during the process. At the same time, the effect caused by changes
made to a dataset during data preprocessing can either facilitate or complicate
even further the knowledge discovery process, thus changes made must be selected
with care.
Today’s real-world datasets are highly susceptible to noise, missing and in-
consistent data due to human errors, mechanical failures and to their typically large
size. Data affected in this manner is known as ”dirty”. During the past decades,
a number of techniques have been developed to preprocess data gathered from real
world applications before the data is further processed for other purposes.
Cases where data mining techniques are applied directly to raw data without
any kind of data preprocessing are still frequent; yet, data preprocessing has been
recommended as an obligatory step. Data preprocessing techniques should never
be applied blindly to a dataset, however. Prior to any data preprocessing effort,
the dataset should be explored and characterized. Two methods for exploring the
data prior to preprocessing are data characterization and data visualization.
2.1 Data Characterization
Data characterization describes data in ways that are useful to the miner and
begins the process of understanding what is in the data. Engels (1998) describes
the following characteristics as standard for a given dataset: the number of classes,
the number of observations, the number of attributes, the number of features with
numeric data type and the number of features with symbolic data type. These
characteristics can provide a first indication of the complexity of the problem being
studied.
In addition to the above mentioned characteristics, parameters of location
and dispersion can be calculated as single dimensional measurements that describe
the dataset. Location parameters are measurements such as minimum, maximum,
arithmetic mean, median, and empirical quartiles. On the other hand, dispersion
parameters such as range, standard deviation, and quartile deviation, provide mea-
surements that indicate the dispersion of values of the feature.
Location and dispersion parameters can be divided in two classes: those that
can deal with extreme values and those that are sensitive to them. A parameter
2
that can deal well with extreme values is called robust. Some statistical software
packages provide the computation of robust parameters in addition to the tradi-
tional non-robust parameters. Comparing robust and non-robust parameter values
can provide insight to the existence of outliers during the data characterization
phase.
2.2 Data Visualization
Visualization techniques can also be of assistance during this exploration and
characterization phase. Visualizing the data before preprocessing it can improve
the understanding of the data, thereby, increasing the likelihood that new and
useful information will be gained from the data. Visualization techniques can be
used to identify the existence of missing values, and outliers, as well as to identify
relationships among attributes. These techniques can, in effect, assist in ranking
the ”impurity” of the data and in selecting the most appropriate data preprocessing
technique to apply.
Applying the correct data preprocessing techniques can improve the quality of
the data, thereby helping to improve the accuracy and efficiency of the subsequent
mining process. Lu, et al. (1996), Pyle (1999) and Azzopardi (2002) present
descriptions of common techniques for preparing data for analysis. The techniques
described by both authors can be summarized as follows:
3
discretization are also methods of data reduction since they reduce the number of
distinct values per attribute. Clustering methods can also be used to remove noise
by detecting outliers.
3.2 Data Integration
Some studies require the integration of multiple databases, or files. This
process is known as data integration. Since attributes representing a given con-
cept may have different names in different databases, care must be taken to avoid
causing inconsistencies and redundancies in the data. Inconsistencies are observa-
tions that have the same values for each of the attributes but that are assigned to
different classes. Redundant observations are observations that contain the same
information.
Attributes that have been derived or inferred from others may create redun-
dancy problems. Again, having a large amount of redundant and inconsistent data
may slow down the knowledge discovery process for a given dataset.
3.3 Data Transformation
Many data mining algorithms provide better results if the data has been nor-
malized or scaled to a specific range before these algorithms are applied. The use
of normalization techniques is crucial when distance-based algorithms are applied,
because the distance measurements taken on by attributes that assume many val-
ues will generally outweigh distance measurements taken by attributes that assume
fewer values. Other methods of data transformation include data aggregation and
generalization techniques. These methods create new attributes from existing in-
formation by applying summary operations to data or by replacing raw data by
higher level concepts. For example, monthly sales data may be aggregated to com-
pute annual sales.
3.4 Data Reduction
The increased size of current real-world datasets has led to the development
of techniques that can reduce the size of the dataset without jeopardizing the data
mining results. The process known as data reduction obtains a reduced represen-
tation of the dataset that is much smaller in volume, yet maintains the integrity
of the original data. This means that data mining on the reduced dataset should
be more efficient yet produce similar analytical results. Han and Kamber (2006)
mention the following strategies for data reduction:
I) Dimension reduction, where algorithms are applied to remove irrelevant, weakly
relevant or redundant attributes.
II) Data compression, where encoding mechanisms are used to obtain a reduced
or compressed representation of the original data. Two common types of data
compression are wavelet transforms and principal component analysis.
III) Numerosity reduction, where the data are replaced or estimated by alterna-
tive, smaller data representations such as parametric models (which store only
the model parameters instead of the actual data) or nonparametric methods
such as clustering, and the use of histograms.
4
IV) Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. For example,
concept hierarchies can be used to replace a low level concept such as age,
with a higher level concept such as young, middle-aged or senior. Some detail
may be lost by such data generalizations.
4. FINAL REMARKS
Acuna (2009) has developed Drep, an R package for data preprocessing and
visualization. Drep performs most of the data preprocessing techniques mentioned
in this article. Currently, research is being done in order to apply preprocessing
methods to data streams, see Aggarwal (2007) for more details.
References
[ 1 ] Acuña, E. (2009). Dprep: Data Preprocessing and Visualization Functions for Classifi-
cation. (http://cran.r-project.org/package=dprep). R package version 2.1.
[ 2 ] Aggarwal, C. C. (2007). DATA STREAMS: Models and Algorithms. Edited by C. C.
Aggarwal. Springer.
[ 3 ] Azzopardi, L. (2002). ”Am I Right?” asked the Classifier: Preprocessing Data in the
Classification Process. Computing and Information Systems, 9, 37–44.
[ 4 ] Engels, R., C. Theusinger (1998). Using a Data Metric for Preprocessing Advice for
Data Mining Applications. Proceedings of 13th European Conference on Artificial In-
telligence, 430–434.
[ 5 ] Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth (1996). From data mining to knowledge
discovery: An overview. In: Advances in Knowledge Discovery and Data Mining, AAAI
Press and the MIT Press, Chapter 1, 1–34.
[ 6 ] Han, J., M. Kamber (2006). Data Mining: Concepts and Techniques. 2nd edition.
Morgan Kaufman Publishers.
[ 7 ] Lu, H., S. Sun, Y. Lu (1996). On Preprocessing Data for Effective Classification. ACM
SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery,
Montreal, Canada.
[ 8 ] Pyle, D. (1999). Data Preparation for Data Mining. Morgan Kaufmann, San Francisco.