0% found this document useful (0 votes)
76 views7 pages

Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency

This document discusses clustering categorical data using the k-means algorithm. Specifically, it proposes transforming categorical values into numeric ones using the relative frequency of each category in the attributes. This approach is compared to transforming categorical data into binary values. The paper aims to experiment the scalability and accuracy of the two methods for clustering categorical data, with the goal of outperforming the binary transformation approach. It provides background on clustering categorical data and the k-means algorithm, details the proposed relative frequency transformation approach, and discusses experimental results.

Uploaded by

Kiran Keeru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views7 pages

Clustering Categorical Data Using The K Means Algorithm and The Attributes Relative Frequency

This document discusses clustering categorical data using the k-means algorithm. Specifically, it proposes transforming categorical values into numeric ones using the relative frequency of each category in the attributes. This approach is compared to transforming categorical data into binary values. The paper aims to experiment the scalability and accuracy of the two methods for clustering categorical data, with the goal of outperforming the binary transformation approach. It provides background on clustering categorical data and the k-means algorithm, details the proposed relative frequency transformation approach, and discusses experimental results.

Uploaded by

Kiran Keeru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/317692315

Clustering Categorical Data Using the K-Means Algorithm and the Attribute's
Relative Frequency

Conference Paper · June 2017

CITATIONS READS

5 1,008

1 author:

Semeh Ben Salem


Military Research Center MoD Tunisia
20 PUBLICATIONS   72 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

clustering categorical data View project

IBM WATSON services for Artificial Intelligence View project

All content following this page was uploaded by Semeh Ben Salem on 20 June 2017.

The user has requested enhancement of the downloaded file.


World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

Clustering Categorical Data Using the K-Means


Algorithm and the Attribute’s Relative Frequency
Semeh Ben Salem, Sami Naouali, Moetez Sallami

1
description for each obtained cluster to extract the
Abstract—Clustering is a well known data mining technique corresponding proprieties and knowledge.
used in pattern recognition and information retrieval. The initial k-means is a well known clustering algorithm proposed for
dataset to be clustered can either contain categorical or numeric data. numeric datasets (containing numeric values) which makes it
Each type of data has its own specific clustering algorithm. In this
context, two algorithms are proposed: the k-means for clustering
not adapted for clustering categorical datasets. This fact is a
numeric datasets and the k-modes for categorical datasets. The main great restriction and limited the performance of this algorithm
encountered problem in data mining applications is clustering since, in many data mining applications, most considered
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

categorical dataset so relevant in the datasets. One main issue to datasets may contain categorical values. To deal with
achieve the clustering process on categorical values is to transform categorical datasets, the k-means was extended to obtain the k-
the categorical attributes into numeric measures and directly apply modes algorithm that will be detailed in the next section.
the k-means algorithm instead the k-modes. In this paper, it is
proposed to experiment an approach based on the previous issue by
However, one other interesting issue is to convert the
transforming the categorical values into numeric ones using the categorical data into numeric values and directly apply the k-
relative frequency of each modality in the attributes. The proposed means algorithm which is also interesting to discover.
approach is compared with a previously method based on This paper is organized as follows: in the second section,
transforming the categorical datasets into binary values. The we present previous approaches towards clustering categorical
scalability and accuracy of the two methods are experimented. The data with their limits and provides a detailed description of the
obtained results show that our proposed method outperforms the
binary method in all cases.
k-means that will be adopted in this study. In the third section,
our proposed approach is detailed. Experimental results and
Keywords—Clustering, k-means, categorical datasets, pattern discussion are provided in the fourth section, and the last
recognition, unsupervised learning, knowledge discovery. section is devoted to the conclusion and perspectives.

I. INTRODUCTION II. LITERATURE REVIEW IN CLUSTERING CATEGORICAL

T HE considerable increase of information technology DATASETS


devices manufacturing and the advances in scientific data A. Categorical Clustering Algorithms
collection methods lead to the creation of growing data Although several proposals were made in the context of
repositories. Besides, traditional exploratory methods have clustering categorical datasets, the most popular developed
shown their inefficiency in dealing with such data quantities to algorithm is the k-mode [4] and its variants [5]-[7]. It is an
discover new findings. Thus, recent developed knowledge- extension of the k-means algorithm where the Euclidean
discovery systems should implement an innovative and distance is replaced by the simple matching dissimilarity
appropriate machine learning algorithms to explore these huge function, more suitable for categorical values, and the means
structures and to identify initially hidden patterns [1], [2]. by the modes, to identify the most representative element in a
In data mining, clustering [3] is the most commonly cluster (centroid). Besides, the modes are based on a
encountered knowledge-discovery technique applied in frequency based method used in each iteration to update the
information retrieval and pattern recognition. It refers to centroids. The k-prototype algorithm [4] permits clustering
unsupervised learning aiming to partition a dataset composed mixed datasets with categorical and numeric values.
of N individuals embedded in and-dimensional space into K Numerous variants were also proposed: the fuzzy k-modes
distinct clusters without any prior knowledge about the algorithm [8] and the fuzzy k-modes algorithm with fuzzy
distribution of the resulting clusters. The resulting data points centroids [9]. However, the main limitation when using the
in the same cluster are more similar to each other than to data simple dissimilarity matching distance is that it does not
points in other clusters. Three sub-problems are addressed by provide efficient results since the simple matching often
this process: (i) the similarity measure (distance) used to results in clusters with weak intra-similarity [10].
compare the data points, (ii) the iterative process of the In [11], the authors showed that the similarity between two
designed algorithm to discover the clusters in an unsupervised categorical values can also be referred as their co-occurrence
way to guarantee the efficiency and (iii) derive a significant according to a common value or a set of values which
represents the second techniques to clustering categorical data
Semeh Ben Salem, Sami Naouali and Moetez Sallami are with the Virtual considering the co-occurrence of the attributes. The most
Reality and Information Technology (VRIT), Military Academy of Fandouk
Jedid, Tunisia (e-mail: [email protected], [email protected],
popular algorithm that falls into this category is the ROCK
[email protected]). [12]. It measures the similarity between the categorical

International Scholarly and Scientific Research & Innovation 11(6) 2017 657 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

patterns using the concept of links, i.e. the similarity between The k-means clustering algorithm is well known for its
any two categorical patterns depends on the number of their efficiency in clustering large datasets, and few previous
common neighbors. Thus, the aim of this algorithm is to proposals aimed to use its original version in clustering
merge the patterns into a group that have relatively large categorical data. On the other hand, some attempts were
number of links. proposed to cluster categorical datasets using hierarchical
The notion of relative frequency was used in [13] to define algorithms but would not present interesting issues due to their
a new dissimilarity coefficient for the k-modes algorithm in quadratic time complexity that hindered their usage.
which the frequency of the categorical values in current cluster The main motivation behind this approach is to take the
has been considered to calculate the dissimilarity between a benefits from the k-means algorithm in terms of complexity: it
data point and a cluster mode since the simple matching is well known for its low computational cost O(KNTd) that is
distance metric is not a good measure as it results in poor linear to the number of clusters K, the number of observations
intra-cluster similarity. N, the number of iterations T and the number of attributes d. In
Although the k-modes based algorithms have shown their [15], the author proposed an approach to using the k-means
efficiency in clustering large categorical datasets, like the k- algorithm to cluster categorical data. The approach is based on
means types algorithms, they still have two major limitations: converting multiple category attributes into binary values
(1) impossibility to cover the global information effectively, using either 1 or 0 to represent if the category is absent or
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

i.e. the provided solutions are only local optimal and a global present and to consider the binary attributes as numeric data.
solution is not easy to find [14], (2) the accuracy of the However, once used in data mining applications, this approach
obtained results is sensitive to the number and shape of the needs to handle a huge number of parameters and an
initial centroids. Besides, the modes are more difficult to move increasing number of attributes corresponding to huge number
in iterative optimization processes because the attribute values of modalities. This fact will increase both the computational
of categorical data are not continuous. The mode represents and space costs of the k-means algorithm. Besides, according
the most frequent element in the considered modality which to the algorithm’s process, the cluster means computed
means that if two modalities have close frequencies, only one representing the centroids will be contained into 0 and 1,
will be retained and the other one will be dismissed, which which does not indicate the real characteristics of the clusters.
results in information loss.
In this paper, two approaches are discussed and III. PROPOSED APPROACH FOR CLUSTERING CATEGORICAL
experimented for clustering categorical datasets using k-means DATASETS
algorithm: in the first method, we use a binary data In this paper, it is proposed to experiment a method to
representation to convert the initial categorical dataset into cluster qualitative data using the original version of the k-
numeric values. In the second method, the relative frequency means algorithm. The considered dataset is assumed to be
of the modalities in the attributes is used to execute the stored in a table, where each row (tuple) represents the
transformation. observations described by the attributes arranged in columns.
B. The k-Means Algorithm Encountered objects in many Data Mining applications are
many times described by categorical information systems.
The k-means algorithm is a widely used clustering
Definition 1. Formally, a categorical information system is
technique where an initial training set SN={x(i), i=1,..,N}
represented by the quadruple CIS=(U,A,V,f), where:
composed of N elements x(i)∈ described by d attributes is
 U is a non empty set of objects (universe).
divided into K clusters , , … , . The clustering process is
 Aa non empty set of attributes.
based on measuring the distance between the initially
 V is a finite unordered set representing the union of all the
randomly selected centroids z(i) and the observations. The
attributes domain.
algorithm is described as follows:
 f is a mapping information function.
1. Initialize centroids ,…, ; Although the initial version of the k-means is not adapted
2. Repeat until there is no further changes in cost function for categorical data which represents its main limitation, in
a. ∀j=1,…,K: ; is closest to this paper we propose a new efficient approach to cluster
b. ∀j=1,…,K: ∑∈ (cluster mean) categorical datasets based on k-means. To make it possible,
our proposed solution consists of transforming the initial
The k-means aims to minimize a criterion known
dataset into numeric values by considering the relative
as with in cluster sum of squares. This function is defined as
frequency of the modalities in each attribute.
follows:
Definition 2. The relative frequency is the number of
occurrences of the kth category Ck,j in attribute divided by
∑ ∑ ∈
the number of observations N in the dataset and is defined
The resulting clusters are described by the mean of the as follows:
samples in the cluster called “centroids” which may not be
,
points from the dataset, although they live in the same space. , /

International Scholarly and Scientific Research & Innovation 11(6) 2017 658 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

where , is the number of occurrences of the category Ck,j. TABLE II.B


RELATIVE FREQUENCY OF THE MODALITY IN EACH ATTRIBUTE
This proposed approach will be compared to Modality Male Female Student Employee Jobless Yes No
Ralambondrainy’s method [15]. The corresponding clustering 0.75 0.25 0.25 0.5 0.25 0.75 0.25
;
algorithm proposed in this context is described as follows:
TABLE II.C
Inputs: NUMERIC TRANSFORMATION OF THE INITIAL DATASET CONSIDERING THE
={ , ,…, }⊆ a set of N individuals; RELATIVE FREQUENCY
K (≪ ∈ desired clusters; Obsi Sex (M/F) Work Criminal Records
: x → the Euclidean distance; Obs1 0.75 0.25 0.75
Outputs: Obs2 0.75 0.5 0.75
a set of K clusters , { , ,…, } Obs3 0.25 0.25 0.25
Data Transformation (qualitative → quantitative) Obs4 0.75 0.5 0.75
STEP1: FOR each observation obsi from DO
FOR each attribute DO The relative frequency of each modality in the example is
Compute of the kth category Ck,j in
provided in Table II.B.
The obtained result when considering the proposed
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

,
; / approach will be as follows in Table II.C.
STEP 2: Randomly select K initial centroids (objects) from for
the clusters; IV. EXPERIMENTAL ANALYSIS RESULTS
, ,…, In this section, the experimental environment and the initial
WHILE ( ,…, ,…, ) DO dataset are described. The efficiency is evaluated using the
FOR each cluster ∈ DO accuracy. Besides, the contribution on scalability is also tested
FOR each individual , ∈ DO considering different values of the number of clusters K in a
Compute , first step and then with 50 runs of four different values of K
Assign each to the nearest with different initial centroids.
: , The complexity of the k-means algorithm depends on the
Re-compute new cluster centroid using the means; number of iterations T, attributes (dimensions) d, observations
∑ ∈ N and clusters K. In the experiments, N and K are equal for the
two methods, however, the experimental results show that T
for the binary method is higher than for the frequency based
The following example, gives an idea on how to implement method. Besides, the resulting datasets to be experimented
the two methods. have different number of dimensions: for the binary
TABLE I
transformation, this parameter is higher than for the frequency
EXAMPLE OF THE INITIAL CONSIDERED CATEGORICAL DATASET based method. These facts show that our new proposed
Obsi Sex (M/F) Work Criminal Records technique permits reducing the complexity of the k-means
Obs1 M S Y once executed. Some proposals were made to reduce the
Obs2 M E Y dimensionality [16], [17] and can be considered if it is
Obs3 F J N proposed to experiment the issue of reducing the dimensions
Obs4 M E Y of the resulting binary transformation.
*S: Student, E: Employee, J: Jobless, **Y: Yes, N: No
A. Experimental Environment and Evaluation Criterion
Table I provides an example of a categorical dataset The algorithm was coded with JAVA language and
containing four observations described by three categorical experimented on an Intel Core i3-2.1 GHz machine with a 4
attributes. The first attribute (Sex) has two modalities GB RAM running on Windows 7 operating system. To
(Male/Female), the second attribute (Work) has three evaluate the efficiency of the k-means in clustering categorical
modalities (Student, Employee, Jobless), and the third attribute datasets using the relative frequency of attributes
(Criminal Records) has two modalities (Yes/No). When transformation, the accuracy is considered as an evaluation
considering the first transformation method to obtain binary criterion and as this metric increase, better clusters are
dataset, the obtained result will be as follows. obtained. The accuracy is defined as follows:
Definition 3. The accuracy AC of a clustering is an external
TABLE II.A evaluation criterion that permits comparing the effectiveness
BINARY TRANSFORMATION OF THE INITIAL DATASET
of two clusterings as follows:
Sex Work Criminal Record
Obsi
Male Female Student Employee Jobless Yes No

Obs1 1 0 1 0 0 1 0
| |
Obs2 1 0 0 1 0 1 0
Obs3 0 1 0 0 1 0 1
K is the number of predefined classes, is the number of
Obs4 1 0 0 1 0 1 0

International Scholarly and Scientific Research & Innovation 11(6) 2017 659 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

correctly assigned objects. B. Evaluation on Scalability


In the experiments, the dataset contains a list of 50 terrorist In this subsection, the scalability of the k-means applied for
attacks that occurred in several countries worldwide extracted the two datasets is evaluated. This process is based on
from the Global Terrorism Database (GTD) [18]. Each dataset estimating two factors: the required execution time (run time)
is described by eleven qualitative attributes: year, month, day, and the number of iterations necessary for the convergence.
country, region, city, type, target, target nationality, group, Besides, since the final results of the clustering depend on the
mean. Although the first three attributes are numeric, they initial centroids and to avoid the influence of their casual
were considered as qualitative measures since they have fixed selection, we performed additional experiments to better
ranges and would provide more significance. Two datasets are experiment our proposed approach when fixing the number of
then generated: the first one contains binary values (58 Ko) clusters: for each experiment, the number of clusters K is fixed
and the other contains numeric values computed using the (K=3,4,5,6) and the initial centroids are modified to run the
relative frequency of the modalities (26 Ko). algorithm 50 times. Therefore, the average of 50 times runs is
also provided to better illustrate the contribution on improving
the scalability and effectiveness of our proposed technique.
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

Fig. 1 Execution Time comparison using the two methods for the considered datasets over the number of clusters

Fig. 2 NB of iterations comparison using the two methods for the considered datasets over the number of clusters

The two previous figures represent the scalability of the k- dataset into numeric values were experimented according to
means clustering algorithm when considering two different different values of K. To better experiment it, it is proposed to
initial datasets obtained according to our approach. The test the scalability for four values of K with 50 runs and
scalability is defined by two parameters: the run time and the compute the average, the minimum and maximum values of
number of iterations required by the algorithm to converge. the run time and number of iterations. The obtained results are
According to these two factors, the relative frequency summarized in Tables III.A-C.
transformation is lower than the run time required by the
binary method. This fact highlights the convenience of the TABLE III.A
AVERAGE OF THE RUN TIME AND NUMBER OF ITERATIONS REQUIRED BY THE
proposed approach and the value of our contribution. The TWO APPROACHES FOR VARIOUS VALUES OF K
difference in the execution time is very significant and makes Binary dataset Dataset with relative frequency
K
our proposed approach more adapted for data mining Run time iteration Run time iteration
applications when dealing with huge datasets. 3 860.6 16.6 369.13 7.16
4 1228.12 25.4 433.28 10.22
In the previous experiments, the scalability of the algorithm 5 1792 28.68 511.08 11.64
and its performance in clustering a transformed categorical 6 2022.98 37.78 551.9 13.66

International Scholarly and Scientific Research & Innovation 11(6) 2017 660 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

TABLE III.B According to the previous results, it is obvious that the


MINIMUM VALUES OF THE RUN TIME AND NUMBER OF ITERATIONS
COMPUTED FOR THE TWO APPROACHES
proposed approach, consisting on transforming the categorical
Binary dataset Dataset with relative frequency
data into numeric values using the relative frequency of each
K modality in the attributes, is more scalable than the
Run time iteration Run time iteration
3 608 10 312 6 Ralambondrainy’s technique: the average execution time and
4 780 8 312 6 number of iterations calculated for 50 runs of the algorithm on
5 811 11 331 7 the relative frequency dataset is lower when compared with
6 952 14 390 8 the binary dataset. This fact highlights the importance of our
proposed approach and its adaptability for huge datasets.
TABLE III.C
MINIMUM VALUES OF THE RUN TIME AND NUMBER OF ITERATIONS C. Evaluation on Clustering Efficiency
COMPUTED FOR THE TWO APPROACHES In this subsection, the clustering efficiency is experimented
Binary dataset Dataset with relative frequency using the accuracy presented in section IV. Good clustering
K
Run time iteration Run time iteration corresponds to higher values of the accuracy that represents
3 1545 29 468 9 the average of well clustered elements in their corresponding
4 3261 77 1435 41
classes. In the first step of the experiments, the accuracy is
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

5 2886 54 1108 30
computed for different values of the number of clusters K
6 4664 79 1170 29
(2→20). The obtained results are shown in the following
figure.

Fig. 3 Accuracy computed for various number of K

According to the previous results, it is obvious that the The provided results confirm again that the proposed
accuracy computed using the proposed approach is superior in approach is more effective in clustering categorical data if we
major cases to the accuracy computed when considering the consider the relative frequency of the modalities in the
Ralambondrainy’s approach [15]. The proposed approach is attribute in transforming the categorical data into numeric
not only effective in terms of the run time and the number of values. The obtained accuracies are higher for the proposed
iteration but also the efficiency is enhanced with the new approach than the results provided for the Ralambondrainy’s
proposal. The evaluation on clustering efficiency can be technique.
considered as more important than the scalability since it
permits characterizing and identifying more imminent profiles, V. CONCLUSION
which is the aim scope of the clustering process. Clustering categorical data is a heavy and complex task, and
As executed with the scalability experiments, it is proposed specific clustering algorithm should be designed. In this paper,
to consider accuracy computation over 50 runs of the the relative frequency of each modality in their attributes is
algorithm for the two datasets. The same previous values of used to transform the categorical measures into numeric
K(3,4,5,6) are considered. In Table IV, the average of the values. The k-means algorithm is the applied to the resulting
values of the accuracy computed in each case is provided. dataset. Experimental results conducted show that our
proposed approach permitted enhancing three parameters: (i)
TABLE IV
ACCURACY COMPUTED FOR 50 RUNS OF THE K-MEANS FOR THE TWO the scalability: the run time and number of iterations, (ii) the
METHODS efficiency experimented using the accuracy and (iii) the
3 4 5 6 complexity due to the reduction of the number of iterations
Ralambondrainy 0.5 0.57 0.51 0.65 and dimensions of the original dataset. These findings show
Proposed approach 0.675 0.748 0.686 0.712 the considerable contribution resulting from to the use of the
relative frequency. This criterion is considered as the most
appropriate statistical parameter to convert categorical into

International Scholarly and Scientific Research & Innovation 11(6) 2017 661 scholar.waset.org/1999.4/10007221
World Academy of Science, Engineering and Technology
International Journal of Computer, Electrical, Automation, Control and Information Engineering Vol:11, No:6, 2017

numeric measures. However, more additional experiments


should be conducted to evaluate the effectiveness of our
proposed approach: in our future work, we propose to
compare the experimented approach in this paper with other
more advanced techniques proposed for clustering categorical
datasets.

REFERENCES
[1] Jiawei Han, Jian Pei, Micheline Kamber, “Data Mining: Concepts and
Techniques”, Elsevier, 3rd edition, 2011, 744 p.
[2] Charu C. Aggarwal, “Data Mining: the textbook”, Springer 2015, 734
pages.
[3] GuojunGan, Chaoqun Ma, Jianhong Wu, “Data Clustering: Theory,
Algorithms, and Applications”, ASA-SIAM Series on Statistics and
Applied Probability, 2007.
[4] Zhexue Huang, “Extension to the k-means algorithm for clustering large
data sets with categorical values.” Data Mining and Knowledge
Discovery 2, 283-304 (1998).
International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221

[5] Fuyuan Cao, Jiye Liang, Deyu Li, Liang Bai, Chuangyin Dang, “A
dissimilarity measure for the k-modes clustering algorithm”, Knowledge
Based Systems 26 (2012), Elsevier, pp 120-127.
[6] Z. He, X. Xu, S. Deng, ”Squeezer: an efficient algorithm for clustering
categorical data” Journal of Computational Science and Technology 17
(5) (2002) 611-624.
[7] Z. He, X. Xu, S. Deng, “Scalable algorithms for clustering large datasets
with mixed type attributes”, International Journal of Intelligent Systems
20 (10) (2005) 1077-1089.
[8] Z. X, Huang, M. K Ng, “A fuzzy k-modes algorithm for clustering
categorical data”, IEEE transactions on Fuzzy systems 7(4) (1999) 446-
452.
[9] D. W Kim, K. H Lee, D. Lee, “Fuzzy clustering of categorical data using
fuzzy centroids”, Pattern recognition letters 25 (2004) 1263-1271.
[10] M. K Ng, M. J Li, Z. X Huang, Z. Y He “On the impact of dissimilarity
measure in k-modes clustering algorithm.” IEEE transactions on Pattern
Analysis and Machine Intelligence 29 (3) (2007) 503-507.
[11] D. Gibson, J. Kleinberg, P. Raghavan, “Clustering categorical data: an
approach based on dynamical systems”, Proceedings of the 24th VLDB
Conference, New York, 1998, pp 311-322.
[12] S. Guha, R. Rastogi, K. Shim, “ROCK: a robust clustering algorithm for
categorical attributes”Proceedings of the IEEEInternationalConference
on Data Engineering, Sydney, Australia 1999, pp 512-521.
[13] Ng M. K., Li M. J, Huang J. H, He Z, “On the impact of dissimilarity
measure in k-modes clustering algorithm.” IEEE transactions on Pattern
Analysis and Machine Intelligence 29 (3); 503-507, 2007.
[14] A. Chaturvedi, Paul E. Green and J.D Caroll, “K-modes clustering.”,
Journal of classification, Vol.18, No 1, pp 35-55, 2001.
[15] Ralambondrainy, H, “A conceptual version of the k-means algorithm.”
Pattern recognition Letters 16, 1147-1157, 1995.
[16] Semeh Ben Salem, Sami Naouali, “Reducing the multidimensionality of
OLAP cubes with Genetic Algorithms and Multiple Correspondence
Analysis”, international conference on Advanced Wireless, Information,
and Communication Technologies (AWICT 2015), Tunisia.
[17] Semeh Ben Salem, Sami Naouali, “Towards Reducing the
multidimensionality of OLAP cubes using the Evolutionary Algorithms
and Factor Analysis Method”, International Journal of Data Mining and
Knowledge Management Process (IJDKM 2016).
[18] Semeh Ben Salem and Sami Naouali, “Pattern Recognition Approach in
Multidimensional Databases: Application to the Global Terrorism
Database” International Journal of Advanced Computer Science and
Applications (IJACSA), 7(8), 2016.

International Scholarly and Scientific Research & Innovation 11(6) 2017 662 scholar.waset.org/1999.4/10007221

You might also like