Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency
Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency
net/publication/317692315
Clustering Categorical Data Using the K-Means Algorithm and the Attribute's
Relative Frequency
CITATIONS READS
5 1,008
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Semeh Ben Salem on 20 June 2017.
1
description for each obtained cluster to extract the
Abstract—Clustering is a well known data mining technique corresponding proprieties and knowledge.
used in pattern recognition and information retrieval. The initial k-means is a well known clustering algorithm proposed for
dataset to be clustered can either contain categorical or numeric data. numeric datasets (containing numeric values) which makes it
Each type of data has its own specific clustering algorithm. In this
context, two algorithms are proposed: the k-means for clustering
not adapted for clustering categorical datasets. This fact is a
numeric datasets and the k-modes for categorical datasets. The main great restriction and limited the performance of this algorithm
encountered problem in data mining applications is clustering since, in many data mining applications, most considered
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221
categorical dataset so relevant in the datasets. One main issue to datasets may contain categorical values. To deal with
achieve the clustering process on categorical values is to transform categorical datasets, the k-means was extended to obtain the k-
the categorical attributes into numeric measures and directly apply modes algorithm that will be detailed in the next section.
the k-means algorithm instead the k-modes. In this paper, it is
proposed to experiment an approach based on the previous issue by
However, one other interesting issue is to convert the
transforming the categorical values into numeric ones using the categorical data into numeric values and directly apply the k-
relative frequency of each modality in the attributes. The proposed means algorithm which is also interesting to discover.
approach is compared with a previously method based on This paper is organized as follows: in the second section,
transforming the categorical datasets into binary values. The we present previous approaches towards clustering categorical
scalability and accuracy of the two methods are experimented. The data with their limits and provides a detailed description of the
obtained results show that our proposed method outperforms the
binary method in all cases.
k-means that will be adopted in this study. In the third section,
our proposed approach is detailed. Experimental results and
Keywords—Clustering, k-means, categorical datasets, pattern discussion are provided in the fourth section, and the last
recognition, unsupervised learning, knowledge discovery. section is devoted to the conclusion and perspectives.
International Scholarly and Scientific Research & Innovation 11(6) 2017 657 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017
patterns using the concept of links, i.e. the similarity between The k-means clustering algorithm is well known for its
any two categorical patterns depends on the number of their efficiency in clustering large datasets, and few previous
common neighbors. Thus, the aim of this algorithm is to proposals aimed to use its original version in clustering
merge the patterns into a group that have relatively large categorical data. On the other hand, some attempts were
number of links. proposed to cluster categorical datasets using hierarchical
The notion of relative frequency was used in [13] to define algorithms but would not present interesting issues due to their
a new dissimilarity coefficient for the k-modes algorithm in quadratic time complexity that hindered their usage.
which the frequency of the categorical values in current cluster The main motivation behind this approach is to take the
has been considered to calculate the dissimilarity between a benefits from the k-means algorithm in terms of complexity: it
data point and a cluster mode since the simple matching is well known for its low computational cost O(KNTd) that is
distance metric is not a good measure as it results in poor linear to the number of clusters K, the number of observations
intra-cluster similarity. N, the number of iterations T and the number of attributes d. In
Although the k-modes based algorithms have shown their [15], the author proposed an approach to using the k-means
efficiency in clustering large categorical datasets, like the k- algorithm to cluster categorical data. The approach is based on
means types algorithms, they still have two major limitations: converting multiple category attributes into binary values
(1) impossibility to cover the global information effectively, using either 1 or 0 to represent if the category is absent or
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221
i.e. the provided solutions are only local optimal and a global present and to consider the binary attributes as numeric data.
solution is not easy to find [14], (2) the accuracy of the However, once used in data mining applications, this approach
obtained results is sensitive to the number and shape of the needs to handle a huge number of parameters and an
initial centroids. Besides, the modes are more difficult to move increasing number of attributes corresponding to huge number
in iterative optimization processes because the attribute values of modalities. This fact will increase both the computational
of categorical data are not continuous. The mode represents and space costs of the k-means algorithm. Besides, according
the most frequent element in the considered modality which to the algorithm’s process, the cluster means computed
means that if two modalities have close frequencies, only one representing the centroids will be contained into 0 and 1,
will be retained and the other one will be dismissed, which which does not indicate the real characteristics of the clusters.
results in information loss.
In this paper, two approaches are discussed and III. PROPOSED APPROACH FOR CLUSTERING CATEGORICAL
experimented for clustering categorical datasets using k-means DATASETS
algorithm: in the first method, we use a binary data In this paper, it is proposed to experiment a method to
representation to convert the initial categorical dataset into cluster qualitative data using the original version of the k-
numeric values. In the second method, the relative frequency means algorithm. The considered dataset is assumed to be
of the modalities in the attributes is used to execute the stored in a table, where each row (tuple) represents the
transformation. observations described by the attributes arranged in columns.
B. The k-Means Algorithm Encountered objects in many Data Mining applications are
many times described by categorical information systems.
The k-means algorithm is a widely used clustering
Definition 1. Formally, a categorical information system is
technique where an initial training set SN={x(i), i=1,..,N}
represented by the quadruple CIS=(U,A,V,f), where:
composed of N elements x(i)∈ described by d attributes is
U is a non empty set of objects (universe).
divided into K clusters , , … , . The clustering process is
Aa non empty set of attributes.
based on measuring the distance between the initially
V is a finite unordered set representing the union of all the
randomly selected centroids z(i) and the observations. The
attributes domain.
algorithm is described as follows:
f is a mapping information function.
1. Initialize centroids ,…, ; Although the initial version of the k-means is not adapted
2. Repeat until there is no further changes in cost function for categorical data which represents its main limitation, in
a. ∀j=1,…,K: ; is closest to this paper we propose a new efficient approach to cluster
b. ∀j=1,…,K: ∑∈ (cluster mean) categorical datasets based on k-means. To make it possible,
our proposed solution consists of transforming the initial
The k-means aims to minimize a criterion known
dataset into numeric values by considering the relative
as with in cluster sum of squares. This function is defined as
frequency of the modalities in each attribute.
follows:
Definition 2. The relative frequency is the number of
occurrences of the kth category Ck,j in attribute divided by
∑ ∑ ∈
the number of observations N in the dataset and is defined
The resulting clusters are described by the mean of the as follows:
samples in the cluster called “centroids” which may not be
,
points from the dataset, although they live in the same space. , /
International Scholarly and Scientific Research & Innovation 11(6) 2017 658 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017
,
; / approach will be as follows in Table II.C.
STEP 2: Randomly select K initial centroids (objects) from for
the clusters; IV. EXPERIMENTAL ANALYSIS RESULTS
, ,…, In this section, the experimental environment and the initial
WHILE ( ,…, ,…, ) DO dataset are described. The efficiency is evaluated using the
FOR each cluster ∈ DO accuracy. Besides, the contribution on scalability is also tested
FOR each individual , ∈ DO considering different values of the number of clusters K in a
Compute , first step and then with 50 runs of four different values of K
Assign each to the nearest with different initial centroids.
: , The complexity of the k-means algorithm depends on the
Re-compute new cluster centroid using the means; number of iterations T, attributes (dimensions) d, observations
∑ ∈ N and clusters K. In the experiments, N and K are equal for the
two methods, however, the experimental results show that T
for the binary method is higher than for the frequency based
The following example, gives an idea on how to implement method. Besides, the resulting datasets to be experimented
the two methods. have different number of dimensions: for the binary
TABLE I
transformation, this parameter is higher than for the frequency
EXAMPLE OF THE INITIAL CONSIDERED CATEGORICAL DATASET based method. These facts show that our new proposed
Obsi Sex (M/F) Work Criminal Records technique permits reducing the complexity of the k-means
Obs1 M S Y once executed. Some proposals were made to reduce the
Obs2 M E Y dimensionality [16], [17] and can be considered if it is
Obs3 F J N proposed to experiment the issue of reducing the dimensions
Obs4 M E Y of the resulting binary transformation.
*S: Student, E: Employee, J: Jobless, **Y: Yes, N: No
A. Experimental Environment and Evaluation Criterion
Table I provides an example of a categorical dataset The algorithm was coded with JAVA language and
containing four observations described by three categorical experimented on an Intel Core i3-2.1 GHz machine with a 4
attributes. The first attribute (Sex) has two modalities GB RAM running on Windows 7 operating system. To
(Male/Female), the second attribute (Work) has three evaluate the efficiency of the k-means in clustering categorical
modalities (Student, Employee, Jobless), and the third attribute datasets using the relative frequency of attributes
(Criminal Records) has two modalities (Yes/No). When transformation, the accuracy is considered as an evaluation
considering the first transformation method to obtain binary criterion and as this metric increase, better clusters are
dataset, the obtained result will be as follows. obtained. The accuracy is defined as follows:
Definition 3. The accuracy AC of a clustering is an external
TABLE II.A evaluation criterion that permits comparing the effectiveness
BINARY TRANSFORMATION OF THE INITIAL DATASET
of two clusterings as follows:
Sex Work Criminal Record
Obsi
Male Female Student Employee Jobless Yes No
∑
Obs1 1 0 1 0 0 1 0
| |
Obs2 1 0 0 1 0 1 0
Obs3 0 1 0 0 1 0 1
K is the number of predefined classes, is the number of
Obs4 1 0 0 1 0 1 0
International Scholarly and Scientific Research & Innovation 11(6) 2017 659 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017
Fig. 1 Execution Time comparison using the two methods for the considered datasets over the number of clusters
Fig. 2 NB of iterations comparison using the two methods for the considered datasets over the number of clusters
The two previous figures represent the scalability of the k- dataset into numeric values were experimented according to
means clustering algorithm when considering two different different values of K. To better experiment it, it is proposed to
initial datasets obtained according to our approach. The test the scalability for four values of K with 50 runs and
scalability is defined by two parameters: the run time and the compute the average, the minimum and maximum values of
number of iterations required by the algorithm to converge. the run time and number of iterations. The obtained results are
According to these two factors, the relative frequency summarized in Tables III.A-C.
transformation is lower than the run time required by the
binary method. This fact highlights the convenience of the TABLE III.A
AVERAGE OF THE RUN TIME AND NUMBER OF ITERATIONS REQUIRED BY THE
proposed approach and the value of our contribution. The TWO APPROACHES FOR VARIOUS VALUES OF K
difference in the execution time is very significant and makes Binary dataset Dataset with relative frequency
K
our proposed approach more adapted for data mining Run time iteration Run time iteration
applications when dealing with huge datasets. 3 860.6 16.6 369.13 7.16
4 1228.12 25.4 433.28 10.22
In the previous experiments, the scalability of the algorithm 5 1792 28.68 511.08 11.64
and its performance in clustering a transformed categorical 6 2022.98 37.78 551.9 13.66
International Scholarly and Scientific Research & Innovation 11(6) 2017 660 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017
5 2886 54 1108 30
computed for different values of the number of clusters K
6 4664 79 1170 29
(2→20). The obtained results are shown in the following
figure.
According to the previous results, it is obvious that the The provided results confirm again that the proposed
accuracy computed using the proposed approach is superior in approach is more effective in clustering categorical data if we
major cases to the accuracy computed when considering the consider the relative frequency of the modalities in the
Ralambondrainy’s approach [15]. The proposed approach is attribute in transforming the categorical data into numeric
not only effective in terms of the run time and the number of values. The obtained accuracies are higher for the proposed
iteration but also the efficiency is enhanced with the new approach than the results provided for the Ralambondrainy’s
proposal. The evaluation on clustering efficiency can be technique.
considered as more important than the scalability since it
permits characterizing and identifying more imminent profiles, V. CONCLUSION
which is the aim scope of the clustering process. Clustering categorical data is a heavy and complex task, and
As executed with the scalability experiments, it is proposed specific clustering algorithm should be designed. In this paper,
to consider accuracy computation over 50 runs of the the relative frequency of each modality in their attributes is
algorithm for the two datasets. The same previous values of used to transform the categorical measures into numeric
K(3,4,5,6) are considered. In Table IV, the average of the values. The k-means algorithm is the applied to the resulting
values of the accuracy computed in each case is provided. dataset. Experimental results conducted show that our
proposed approach permitted enhancing three parameters: (i)
TABLE IV
ACCURACY COMPUTED FOR 50 RUNS OF THE K-MEANS FOR THE TWO the scalability: the run time and number of iterations, (ii) the
METHODS efficiency experimented using the accuracy and (iii) the
3 4 5 6 complexity due to the reduction of the number of iterations
Ralambondrainy 0.5 0.57 0.51 0.65 and dimensions of the original dataset. These findings show
Proposed approach 0.675 0.748 0.686 0.712 the considerable contribution resulting from to the use of the
relative frequency. This criterion is considered as the most
appropriate statistical parameter to convert categorical into
International Scholarly and Scientific Research & Innovation 11(6) 2017 661 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017
REFERENCES
[1] Jiawei Han, Jian Pei, Micheline Kamber, “Data Mining: Concepts and
Techniques”, Elsevier, 3rd edition, 2011, 744 p.
[2] Charu C. Aggarwal, “Data Mining: the textbook”, Springer 2015, 734
pages.
[3] GuojunGan, Chaoqun Ma, Jianhong Wu, “Data Clustering: Theory,
Algorithms, and Applications”, ASA-SIAM Series on Statistics and
Applied Probability, 2007.
[4] Zhexue Huang, “Extension to the k-means algorithm for clustering large
data sets with categorical values.” Data Mining and Knowledge
Discovery 2, 283-304 (1998).
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221
[5] Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang, “A
dissimilarity measure for the k-modes clustering algorithm”, Knowledge
Based Systems 26 (2012), Elsevier, pp 120-127.
[6] Z. He, X. Xu, S. Deng, ”Squeezer: an efficient algorithm for clustering
categorical data” Journal of Computational Science and Technology 17
(5) (2002) 611-624.
[7] Z. He, X. Xu, S. Deng, “Scalable algorithms for clustering large datasets
with mixed type attributes”, International Journal of Intelligent Systems
20 (10) (2005) 1077-1089.
[8] Z. X, Huang, M. K Ng, “A fuzzy k-modes algorithm for clustering
categorical data”, IEEE transactions on Fuzzy systems 7(4) (1999) 446-
452.
[9] D. W Kim, K. H Lee, D. Lee, “Fuzzy clustering of categorical data using
fuzzy centroids”, Pattern recognition letters 25 (2004) 1263-1271.
[10] M. K Ng, M. J Li, Z. X Huang, Z. Y He “On the impact of dissimilarity
measure in k-modes clustering algorithm.” IEEE transactions on Pattern
Analysis and Machine Intelligence 29 (3) (2007) 503-507.
[11] D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: an
approach based on dynamical systems”, Proceedings of the 24th VLDB
Conference, New York, 1998, pp 311-322.
[12] S. Guha, R. Rastogi, K. Shim, “ROCK: a robust clustering algorithm for
categorical attributes”Proceedings of the IEEEInternationalConference
on Data Engineering, Sydney, Australia 1999, pp 512-521.
[13] Ng M. K., Li M. J, Huang J. H, He Z, “On the impact of dissimilarity
measure in k-modes clustering algorithm.” IEEE transactions on Pattern
Analysis and Machine Intelligence 29 (3); 503-507, 2007.
[14] A. Chaturvedi, Paul E. Green and J.D Caroll, “K-modes clustering.”,
Journal of classification, Vol.18, No 1, pp 35-55, 2001.
[15] Ralambondrainy, H, “A conceptual version of the k-means algorithm.”
Pattern recognition Letters 16, 1147-1157, 1995.
[16] Semeh Ben Salem, Sami Naouali, “Reducing the multidimensionality of
OLAP cubes with Genetic Algorithms and Multiple Correspondence
Analysis”, international conference on Advanced Wireless, Information,
and Communication Technologies (AWICT 2015), Tunisia.
[17] Semeh Ben Salem, Sami Naouali, “Towards Reducing the
multidimensionality of OLAP cubes using the Evolutionary Algorithms
and Factor Analysis Method”, International Journal of Data Mining and
Knowledge Management Process (IJDKM 2016).
[18] Semeh Ben Salem and Sami Naouali, “Pattern Recognition Approach in
Multidimensional Databases: Application to the Global Terrorism
Database” International Journal of Advanced Computer Science and
Applications (IJACSA), 7(8), 2016.
International Scholarly and Scientific Research & Innovation 11(6) 2017 662 scholar.waset.org/1999.4/10007221