A New Feature Subset Selection Using Bottom-Up Clu
A New Feature Subset Selection Using Bottom-Up Clu
https://doi.org/10.1007/s10044-016-0565-8
THEORETICAL ADVANCES
Received: 6 December 2015 / Accepted: 10 June 2016 / Published online: 18 June 2016
Ó Springer-Verlag London 2016
123
method tries to divide data into clusters according to data objective function does not change. The k-means algorithm
similarity such that the between-clusters similarity is iteratively bisects the feature space and tries to form
minimized and the within-cluster similarity is maximized clusters which maximize the information gain.
[10]. K-means [11] and hierarchical algorithms [12] are In this paper, an extension of agglomerative hierarchical
popular in clustering methods. Hierarchical algorithm is clustering in feature space is presented. In this clustering-
done in top-down (divisive clustering) and bottom-up based feature subset selection (CFSS) method, the dis-
(agglomerative clustering) manner [12]. In hierarchical similarity measure of two clusters is the distance between
clustering, a tree structure (called dendrogram) is con- two representative members of those clusters instead of
structed [13]. The root of tree is a cluster including all data, their nearest members (e.g., single linkage) or farthest
and the leaves of tree are clusters of single data. By cutting members (e.g., complete linkage) [21]. Representative
the tree in each level, the clusters are obtained [14]. Many feature in each cluster is the feature with the maximum
papers tried to solve feature selection problems by clus- mutual information against other features in that cluster.
tering method [15–18]. In these methods, similar features CFSS works as a filter method but among similar features
are grouped into clusters according to their similarity or in a cluster. That is, the selected feature in each cluster is a
distance and then, representative features are selected from representative of that cluster in order to measure its dis-
each cluster. tance against other clusters. As an advantage of hierar-
In [16], an unsupervised feature selection based on chical clustering, it does not need to determine the number
clustering and k-nearest neighbor (kNN) [19] algorithms of clusters. In CFSS, the clustering process is repeated until
was proposed. This method can reach to features subspace all features are distributed in some clusters with at least two
by selecting features with minimum distance and then features. However, to diffuse the features in a reasonable
removing k neighbors until all features are selected or are number of clusters, we have used the method of GACH (a
removed. It seems that removing neighbors are not rea- grid-based algorithm for hierarchical clustering of high-
sonable and gathering neighbors in a group and then ana- dimensional data) [22] to obtain a suitable level for cutting
lyzing reaction between neighbors, it can achieve better the clustering tree.
result. As a supervised method, a feature selection based on The rest of this paper is organized as follows. In Sect. 2,
feature clustering was proposed in [17]. This algorithm our proposed CFSS method is explained. In Sect. 3, the
uses an agglomerative clustering approach based on experimental results are presented. Section 4 concludes the
Ward’s linkage method [20] and a new combination of paper.
conditional mutual information for distance measure
between features. In this method, the agglomerative clus-
tering algorithm is run until k ? 1 clusters are obtained. 2 Proposed method
The cluster with the lowest mutual information is elimi-
nated to remove the irrelevant features. Representative In feature selection, finding the nature of features and
feature in each cluster is the feature which has maximum selecting a few features among similar ones is an important
mutual information with the target. Though this algorithm issue. In this regard, clustering methods can solve this
tries to reduce the redundancy, it does not determine the problem by more computations than filter approaches. The
number of required clusters. Instead, it repeats the process reason of more simplicity and few complexities in filter
with different values of k and uses the average accuracy to methods is individual feature evaluation and not achieving
find the optimal number of clusters. features structure [23]. By incorporating clustering and
In [18], a feature selection algorithm was presented that filter methods, we can gain their advantages while dis-
works in two steps. In the first step, irrelevant and redun- carding their deficiencies. Our proposed CFSS uses a
dant features are eliminated and the remaining features are clustering method to discover the features’ structure and
divided into some clusters. For this purpose, graph-theo- applies a filter method to rank the features in each cluster.
retic clustering method is used to create the minimum By this manner, individual features evaluation of filter
spanning tree (MST) from a weighted complete graph and method is bypassed in clustering phase and only is needed
then partitioning the MST into a forest whose each tree in representative features’ selection. This selection of the
represents a cluster [18]. In the second step, the represen- best feature among similar ones in a cluster will conduce
tative feature of each cluster is chosen as the feature having into redundancy reduction. To attain this intention, an
maximum mutual information with the target. As a semi- effective approach for selecting the representative feature
supervised approach, a feature selection based on divisive in each cluster is also proposed.
clustering and k-means algorithms was also proposed [15]. An evaluation criterion for feature similarity is mutual
This algorithm initializes clusters by likelihood estimation information (MI). This is an efficient strategy to measure
and then iteratively finds the optimal clusters until their relevancy between two features or dependency between a
123
feature and its target. The MI between features fi and fj is where the difference between two data instances is com-
defined, in terms of entropy, as [24]: puted along their features.
XX pðfi ; fj Þ Given a dataset with M data instances fx1 ; . . .; xM g and
I fi ; fj ¼ pðfi ; fj Þ log ð1Þ N features ff1 ; . . .; fN g. These data have labels ft1 ; . . .; tM g
pðfi Þpðfj Þ
from T classes. In order to find the distance between
where pðf Þ is the probability distribution of feature f and instances xk and xl , it is required to compute
pðfi ; fj Þ is the joint probability distribution of features fi and xk xl ¼ ðxk1 xl1 ; . . .; xkN xlN Þ. For example, if fi is
fj . According to (1), the higher value of MI between two weight and fj is height, the difference xki xli is mean-
features fi and fj means that the similarity (correlation) ingful since both xki and xli represent weight. Similarly,
between them is more (i.e., distance between them is less) xkj xlj has meaning as both xkj and xlj are height. How-
as shown in Fig. 1 where Hðf Þ means the entropy of feature ever, in feature clustering, the distance term does not seem
f. reasonable as in measuring the distance between features fi
CFSS tries to discover the structure of data by hierar- and fj , we have to compute fi fj ¼ ðx1i x1j ; . . .;
chically collecting similar features in the same groups. At xMi xMj Þ, while the difference between height and weight
first, each feature is assumed as an individual cluster. These is meaningless. Thus in feature clustering, similarity term
clusters (features) are placed in leaves (lowest level) of the is more suitable than distance term and so using similarity-
hierarchical clustering tree and are called: L ¼ based criteria instead of single linkage and complete
ðC1 ; . . .; CN Þ as shown in Fig. 2a. linkage is more reasonable.
Using (1), two nearest (with the highest MI) clusters Ci The similarity method, presented here, offers similarity
and Cj in level 1 (indeed, features fi and fj ) are selected and between representative features in clusters. The represen-
grouped as a new cluster Cl in level 2 (see Fig. 2b). Sim- tative feature in each cluster is the feature that has maxi-
ilarly, cluster Ck is constructed. In the representative mum dependency with the target (class label). Since this
introducing of these clusters, only one of their features feature has the maximum information, it can be the best
would be selected. This is because using highly similar candidate for representative of the similar features in a
features is not reasonable and might increase the cluster. Using this criterion, the similarity between repre-
redundancy. sentative feature in cluster Cl and other features (see
In the next step of algorithm, two clusters being merged Fig. 2b) is measured and the maximum similarity is used to
might have one/two features. In this case, the similarity select the next feature for merging. In new level 3 (in
between two clusters or between one cluster and one fea- Fig. 2c), cluster Cl is merged by another feature and new
ture should be measured. There are some methods for cluster Cp is created. In continuing this process, all features
measuring the distance between clusters such as single are merged into clusters until not any feature remains in a
linkage [25] and complete linkage [26]. These methods are leaf (indeed, all clusters in level 1 are merged). In this step,
suitable for measuring the distance between data clusters CFSS terminates and the constructed clusters with their
members are reported.
One of the advantages of hierarchical clustering is that it
does not need to know the number of clusters. Similarly in
( ) ( ) this work, the number of clusters is not set initially.
Instead, the process terminates when all features are dis-
( ; ) tributed in clusters with at least two features. After that, the
representative features in these clusters should be found.
The number of selected features from each cluster is
different in various methods. In some methods, for redun-
Fig. 1 Mutual information between two features fi and fj dancy reduction, only one feature is selected from each
123
cluster. Because of high similarity between features in a In addition to the Euclidian distance, the distance pro-
cluster, this single feature might be sufficiently good repre- posed in [27] is extended here (called DistFR) in order to
sentative. On the other hand, some authors believe that measure the distance between two features. In [27], each
selecting just top-ranked features is not enough since a com- feature is evaluated individually using this measure:
bination of good and bad features might be more appropriate.
ðl l0 Þ2
In other methods, feature is not selected from some clusters at Dð f Þ ¼ ð3Þ
r2 þ r02
all. For example, in [17] a cluster with the lowest mutual
information is eliminated to remove the irrelevant features. where l and r are the mean and standard deviation of
0
For selecting the features in CFSS, clusters are ranked feature f in the target class (l0 and r belong to non-target
according to their importance (i.e., ranking order). In this class). This measure prefers feature fi to feature fj if
regard, possibly more than one feature is selected from some Dðfi Þ [ D fj . To measure the distance of two features fi
clusters and fewer or even no feature from others. In order to and fj , the criterion in (3) is rewritten as:
determine how many features are sufficient for representing
a given dataset, the approach proposed in GACH [22] is used ðl l0it Þ2 ðljt l0jt Þ2
d fi ; fj ¼ max it2 max ð4Þ
here. In GACH as a hierarchical clustering method, the t¼1...T rit þ r02
it t¼1...T r2jt þ r02
jt
clusters are merged in a bottom-up manner until a termi-
nating situation is occurred. This condition would help where lit and rit are the mean and standard deviation of
GACH to determine the optimal number of clusters. Though feature fi in tth class (l0it and r0it belong to all except
in GACH, the data instances (not the features) are clustered, t classes). Criterion (4) is used for similarity comparing in
its stopping scheme is customized here to end up the merging (2) between two features.
process of the features’ clusters. Accordingly, all features are Using (2) for merging up the clusters of features and
considered as individual clusters at first. Then, in a merging employing the GACH approach for ending up the merging
loop, two nearest clusters Ci and Cj are selected and are process, number of clusters is predicted. The value of P
merged into cluster Cl according to this criterion: would be in (0, 1) and grows in ascending and regular rises.
P nP P o But in some stages of merging, Pð:; :Þ has some jumps
f 2Cl dðf ; c l Þ f 2Ci dðf ; c i Þ þ f 2Cj dðf ; c j Þ which can be good candidates for stopping the merging
ðCi ; Cj Þ ¼ PN PJ P process and so determining the optimal number of features
k¼1 d ð fk ; c Þ k¼1 f 2Ck dðf ; ck Þ
[22]. Figure 3 reveals these stopping positions in Pð:; :Þ
ð2Þ values, depicted for Sensor dataset (with 24 features), when
where c is the representative (centroid) of all N features,ci Euclidian and DistFR distances are used.
and cj are the representatives of features’ clusters Ci and According to Fig. 3, the Pð:; :Þ values increases in steady
Cj , respectively. Also, cl is the centroid of new cluster Cl , manner, but in 21st step, it has a sudden jump. This point is
and J is the number of features’ clusters. The function a good position to stop the merging process. Consequently,
dð:; :Þ is any dissimilarity (distance) metric between each 3 (=24–21) features are optimum for Sensor dataset.
pair of features.
Fig. 3 Pð:; :Þ values in features’ clusters merging of Sensor dataset. a Euclidian distance, b DistFR distance
123
On the other hand, the features of Sensor dataset are Table 1 Datasets used in the experiments
grouped into 7 clusters, using GACH approach. This means Dataset No. of No. of No. of
that we need only 3 out of 7 clusters in order to select one features instances classes
feature of each one as representative. For this purpose, the
Wine 13 178 3
clusters should be ranked in some manner. As a heuristic
criterion, the crowd of features in each cluster is considered Vehicle 18 946 4
here. Since some best features from each cluster should be Sensor readings 24 5456 4
selected as its representative, another feature ranking cri- WDBC 30 569 2
terion is required. In this regard, the mutual information is Prognostic D cancer 30 569 2
used. Prognostic B cancer 33 198 2
According to these explanations, the algorithm of CFSS Ionosphere 34 351 2
is given here. Spect heart 44 267 2
Lung cancer 56 32 3
Spam base 57 4601 2
Algorithm: CFSS (clustering-based feature subset selection) Sonar 60 208 2
Inputs: M data instances fx1 ; . . .; xM g with features ff1 ; . . .; fN g
and labels ft1 ; . . .; tM g
0
Output: the best features in F
1. Compute the similarity between each pair of features fi and fj
using (1): sij ¼ Iðfi ; fj Þ which forms S ¼ ½sij NN (4, 5, 8, 13) (2, 7, 9, 11)
2. Compute the similarity (relevancy) of each feature fi to class
labels using (1): ri ¼ Iðfi ; tÞ which forms R ¼ ½ri N1
3. Consider each feature fi as an individual cluster Ci , that is
Ci ¼ ffi g (4, 5, 8) (1, 6, 12) (2, 9, 11)
4. Gather all clusters Ci ‘s in L which gives L ¼ fC1 ; . . .; CN g
5. Let each feature fi be representative feature of cluster Ci
(4, 8) (1, 6) (3, 10) (2, 9)
6. Gather all representative features fi ‘s in F which gives
F ¼ ff1 ; . . .; fN g
7. Repeat 4 8 5 13 1 6 12 3 10 2 9 11 7
7.1. Find two most similar clusters Ci and Cj in L (according
to S via their representative features fi and fj in F) Fig. 4 Hierarchical clustering of Wine’s features by CFSS
7.2. Construct new cluster Cl by merging Ci and Cj , that is
Cl ¼ Ci [ Cj
3 Experimental results
7.3. Remove clusters Ci and Cj from L, that is
L ¼ L fCi ; Cj g
7.4. Include cluster Cl in L, that is L ¼ L [fCl g In this work, eleven various datasets from UCI ML
7.5. Find fl as the most relevant feature in cluster Cl repository [28] are used for comparing CFSS against some
(according to R) common and recent methods. Table 1 summarizes their
7.6. Introduce fl as the representative feature of cluster Cl statistics.
7.7. Remove features fi and fj from F, that is F ¼ F ffi ; fj g Firstly, the capability of CFSS in clustering of features
7.8. Include feature fl in F, that is F ¼ F [ffl g is compared versus GACH, as the most similar method. For
8. Until there is no single-feature cluster, that is this purpose, the hierarchical view of clusters, obtained by
9= Ci 2 L : jCi j ¼ 1 CFSS for Wine dataset, is depicted in Fig. 4. As shown, the
9. Determine the optimum number of features, n, using GACH clustering phase of CFSS terminates when four clusters
approach with 4, 3, 2 and 4 features are established.
10. Rank, in descending order of their cardinalities, the clusters in Figure 5 shows the dendrogram of features obtained by
L
GACH. In this case, three clusters with 5, 5 and 3 features
11. If n\jLj
0
are obtained. Comparing the two big clusters of CFSS and
F = {representative features of n top-ranked clusters in L }
GACH, it justifies that about 34 of features are in common.
Else
0
Since the dendrogram of features for higher-dimensional
F = {representative features of all clusters in L } [ { data is very huge, the clusters of three datasets (Vehicle,
n jLj features from ranked clusters in L }
Prognostic B cancer and Ionosphere) are gathered in
End if
0
Table 2. In this table, the clusters of features, obtained by
12. Return F as the set of best features
CFSS and GACH, are included where the similarity of
123
(6, 7, 9, 11, 12) (2, 3, 4, 5, 8) Table 3 Effect of mutual information versus Euclidean distance on
CFSS
Dataset Classification accuracy
(6, 7, 9, 12) (2, 3, 4, 5)
Mutual information Euclidian distance
Vehicle CFSS (1, 3, 4, 7, 8, 9, 11, 12, 16), (2, 10, 13, Table 4 Effect of our heuristic criterion versus CC1 on CFSS
15), (5, 6, 14, 17, 18)
Dataset No. of features Classification accuracy
GACH (1, 3, 4, 7, 9, 11, 12, 14), (2, 10, 13, 16), (by GACH)
(5, 6, 15, 17, 18) Heuristic CC1
Prognostic CFSS (1, 2, 4, 5), (3, 13, 23), (6, 11, 26), (7, 8,
B cancer 9, 29), (10, 30, 32, 33), (12, 14, 15), Wine 4 95.35 95.35
(16, 17, 20, 21), (18, 19), (22, 24, 25), Vehicle 6 68.49 66.15
(27, 28, 31) Sensor readings 3 93.67 85.42
GACH (1, 2, 4, 5, 22, 24, 25), (3, 13, 23), (6, 11, WDBC 5 93.93 95.36
26), (7, 8, 9, 29), (10, 30, 32, 33), (12,
Prognostic D cancer 6 93.86 95.98
14, 15), (16, 17, 18, 19, 20, 21), (27, 28,
31) Prognostic B cancer 4 71.63 74.47
Ionosphere CFSS (1, 2, 3, 5, 7, 9), (4, 6, 8, 10, 12, 14, 16), Ionosphere 5 89.23 87.43
(11, 15, 17), (13, 25, 27, 29), (18, 20), Spect heart 3 75.27 78.92
(19, 21, 23), (22, 24), (26, 28, 30), (31, Lung cancer 5 62.33 50.33
33), (32, 34)
Spam base 9 87.77 81.00
GACH (1, 2, 3, 5, 7), (4, 6, 16), (8, 10, 12, 14,
34), (9, 11, 13, 27), (15, 21, 17, 19, 23), Sonar 5 83.35 76.15
(18, 24, 25, 26, 30, 32), (20, 22, 28, 29), Average 83.17 80.6
(31, 33)
123
Fig. 6 Distribution of Ionosphere dataset from aspect of two best features. a CFSS uses heuristic criterion, b CFSS uses CC1 criterion
select those instances in a cluster which can decrease the To assess this visually, Fig. 6 compares the distribution
distances in that cluster. It uses CC1 criterion which is of Ionosphere dataset from aspect of two best features
customized here to being used by CFSS: when CC1 and our criteria are used. Obviously, the
P instances of data are less correlated when they are seen
f 2Cj d f ; cj
CC1 ¼ ð5Þ from viewpoint of two best features returned by CFSS
jCj j using heuristic method.
where cj is the centroid of features’ cluster Cj , as before. In this part, the positive cooperation of GACH method,
According to (5), the cluster with the least CC1 is the best. in estimating the number of features, and CFSS, in
Table 4 shows the results of our heuristic versus CC1 selecting good features, is approved via experiments. In
criterion where the higher accuracy for each dataset is in this regard, the number of features, estimated by GACH, is
boldface. In this table, the optimum number of clusters, used by CFSS for feature selection. Also, the optimal
reported by GACH, is also included. These results say that number of features is used as well (to find the optimum,
choosing the final features heuristically from appropriate CFSS is run several times with different number of features
clusters is more efficient than CC1 criterion. and the best one is obtained by trial and error). The
123
performance of these features for each dataset is shown in robust to outliers and noises as the classification accuracies
Table 5. Obviously, the cooperation performance of confirm.
GACH and CFSS is reasonable as the classification accu- In this part, our CFSS is compared against two com-
racies are near optimal. This is achieved by using only one- mon methods, mRMR [8] and ReliefF [31] and a new
third of features, in average. method, L1-LSMI [32]. The mRMR is a method based on
In order to examine the robustness of CFSS to outliers, information theoretic which tries to select features by
some remote instances of each dataset are pretended as high dependency with class labels and less relevancy to
outliers and are removed temporarily. For this purpose, the other features. ReliefF is a filter method which selects
centroid of instances in each class is computed and some data instances at random and then changes the weight of
percent of the farthest instances to centroid are set aside. relevant (nearest) features. L1-LSMI is a least squares
Table 6 reports the performance of features, selected by feature selection by L1-penalized squared-loss mutual
CFSS, when outlier instances are filtered. In addition to information. The comparison results are given in Table 7
outliers, some random noisy data, with uniform distribu- where the performance of CFSS is the best in some
tion, are also added to each dataset and then their best datasets and also in average (as shown in bold).
features are extracted. The results of CFSS on these noisy CFSS is assessed in terms of time complexity as well.
datasets are also included in Table 6. For each dataset, the For this purpose, the CPU time needed to run CFSS,
best performance is highlighted in bold. Clearly, CFSS is ReliefF, mRMR and L1-LSMI for feature selection is
123
Table 8 CPU Time (in second) of CFSS against three other methods References
Dataset ReliefF mRMR L1-LSMI CFSS
1. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by
Wine 0.53 1.04 19.14 0.73 locally linear embedding. Science 290(5500):2323–2326
2. Kohavi R, John GH (1997) Wrapper for feature subset selection.
SpectF heart 0.59 0.89 5.57 1.75
Artif Intell 97(1–2):273–324
Vehicle 1.00 2.82 23.00 0.80 3. Pudil P, Novovicova J, Kittler J (1994) Floating search methods
Sensor readings 11.11 11.25 50 3.96 in feature selection. Pattern Recognit Lett 15:1119–1125
Prognostic D cancer 0.85 13.7 53.36 1.44 4. Reunanen J (2003) Overfitting in making comparisons between
variable selection methods. J Mach Learn Res 3:1371–1382
Prognostic B cancer 0.46 5.03 23.68 1.25 5. Goldberg D (1989) Genetic algorithms in search, optimization
Ionosphere 0.59 14.37 34.18 1.42 and machine learning. Addison Wesley, Reading
Spam base 18.09 20.00 53.00 11.48 6. Kennedy J, Eberhart RC (1995) Particle swarm optimization.
IEEE Int Conf Neural Netw 4:942–1948
Sonar 0.52 12.80 11.99 3.72
7. Chandrashekar G, Sahin F (2014) A survey on feature selection
WDBC 0.84 58.62 29.67 1.46 methods. Comput Electr Eng 40(1):16–28
Lung cancer 0.30 0.70 5.44 0.90 8. Peng H, Long F, Ding C (2005) Feature selection based on
Average 3.17 12.84 28.09 2.63 mutual information: criteria of max-dependency, max-relevance,
and min-redundance. IEEE Trans Pattern Anal Mach Intell
27(8):1226–1238
9. Dubes R, Jain AK (1980) Clustering methodologies in explora-
tory data analysis. In: Yovits MC (ed) Advances in computers.
Academic Press Inc., New York, pp 113–125
computed and reported in Table 8 where the best method is 10. Kasim S, Deris S, Othman RM (2013) Multi-stage filtering for
bolded. Based on the average, our CFSS is fastest though in improving confidence level and determining dominant clusters in
most of datasets, ReliefF has the least computational clustering algorithms of gene expression data. Comput Biol Med
43:1120–1133
complexity. This might be because of filter nature of
11. MacQueen JB (1967) Some methods for classification and anal-
ReliefF. ysis of multivariate observations. In: Proceedings of 5th Berkeley
symposium on mathematical statistics and probability, vol 1.
University of California Press, pp 281–297
12. Rokach L, Maimon O (2005) Clustering methods. In: Data
4 Conclusion
mining and knowledge discovery handbook. Springer, New York,
pp 321–352
In this work, we presented a new feature subset selection 13. Manning CD, Schütze H (1999) Foundations of statistical natural
method based on hierarchical clustering. In each level of language processing. MIT Press, Cambridge
14. Rafsanjani MK, Varzaneh ZA, Chukanlo NE (2012) A survey of
agglomeration, it uses similarity measure among features,
hierarchical clustering algorithms. J Math Comput Sci
instead of their distance, to merge two most similar fea- 5(3):229–240
tures’ clusters. Gathering similar features into clusters and 15. Yu-chieh WU (2014) A top-down information theoretic word
then using filter method among similar features leads to clustering algorithm for phrase recognition. Inf Sci 275:213–225
16. Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection
redundancy reduction. Our method does not need to
using feature similarity. IEEE Trans Pattern Anal Mach Intell
determine the number of clusters in advance. Instead of 24(3):301–312
choosing features from all clusters, only more important 17. Sotoca JM, Pla F (2010) Supervised feature selection by clus-
clusters are used. To estimate an appropriate number of tering using conditional mutual information based distances.
Pattern Recogn 43(6):325–343
features for each dataset, the method of GACH is used.
18. Song Q, Ni J, Wang G (2013) A fast clustering-based feature
By applying CFSS algorithm to extract the best features subset selection algorithm for high-dimensional data. IEEE Trans
of some UCI datasets and then using them by a kNN Knowl Data Eng 25(1):1–14
classifier, we assessed our proposed method in comparison 19. Altman NS (1992) An introduction to kernel and nearest neighbor
nonparametric regression. Am Stat 46(3):175–185
with some feature selection methods. Via experimental
20. Ward JH (1963) Hierarchical grouping to optimize an objective
results, we showed that CFSS is reasonably efficient since function. J Am Stat Assoc 58(301):236–244
it tries to merge the similar clusters and then selects some 21. Song Y, Jin S, Shen J (2011) A unique property of single-link
good representatives from each cluster of features. More- distance and its application in data clustering. Data Knowl Eng
70:984–1003
over, it is noticeably fast since it works in a filter manner to
22. Mansoori EG (2014) GACH: a grid-based algorithm for hierar-
choose the representative features. chical clustering of high-dimensional data. Soft Comput
The stopping condition in merging clusters of features in 18(5):905–922
addition to the appropriate number of representative fea- 23. Khedkar SA, Bainwad AM, Chitnis PO (2014) A survey on
clustered feature selection algorithms for high dimensional data.
tures of each cluster is still two important and open issues
Int J Comput Sci Inf Technol (IJCSIT) 5(3):3274–3280
in our method. The future extension of CFSS should con- 24. Cover TM, Thomas JA (1991) Elements of information theory.
centrate on these two vital drawbacks. Wiley, New York
123
25. Sibson R (1973) SLINK: an optimally efficient algorithm for the 30. Raskutti B, Leckie C (1999) An evaluation of criteria for mea-
single-link cluster method. Comput J (Br Comput Soc) suring the quality of clusters. In: Proceedings of the international
16(1):30–34 joint conference of artificial intelligence, pp 905–910
26. Defays D (1977) An efficient algorithm for a complete link 31. Robnik-Sikonja M, Kononenko I (1997) An adaptation of relief
method. Comput J (Br Comput Soc) 20(4):364–366 for attribute estimation in regression. In: Machine learning pro-
27. Mansoori EG (2013) Using statistical measures for feature ceedings of the fourteenth international conference (ICML),
ranking. Int J Pattern Recognit Artif Intell 27(1):1–14 pp 296–304
28. Asuncion A, Newman DJ (2007) UCI machine learning reposi- 32. Jitkrittum W, Hachiya H, Sugiyama M (2013) Feature selection
tory. Department of Information and Computer science, Univer- via L1-penalized squared loss mutual information. IEICE Trans
sity of California, Irvine, CA, online available: http://www.ics. Inf Syst 96(7):1513–1524
uci.edu/mlearn/MLRepository.html
29. McLachlan GJ, Do KA, Ambroise C (2004) Analyzing
microarray gene expression data. Wiley, New York
123
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at