Improvised Method of FAST Clustering Based Feature Selection Technique Algorithm For High Dimensional Data
Improvised Method of FAST Clustering Based Feature Selection Technique Algorithm For High Dimensional Data
ABSTRACT
A high dimensional data is data consisting thousands of attributes or features. Nowadays for scientific and research
applications high dimensional data is used.But.as there are thousands of features present in the data, We need to select the
features those are non-redundant and most relevant in order to reduce the dimensionality and runtime, and also improve
accuracy of the results. In this paper we have proposed FAST algorithm of feature subset selection and improved method of
FAST algorithm. The efficiency and accuracy of the results is evaluated by empirical study. In this paper, we have presented a
novel clustering-based feature subset selection algorithm for high dimensional data. The algorithm involves (i) removing
irrelevant features, (ii) constructing a minimum spanning tree from relative ones, and (iii) partitioning the MST and selecting
representative features. In the proposed algorithm, a cluster consists of features. Each cluster is treated as a single feature and
thus dimensionality is highly reduced. The Proposed System will be Implementation of FAST algorithm Using Dice Coefficient
Measure to remove irrelevant and redundant features.
Keywords: FAST, Feature Subset Selection, Graph Based Clustering, Minimum Spanning Tree.
1. INTRODUCTION
With the goal of choosing a subset of good features with respect to the target classes, feature subset selection is an
proper way for reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result
comprehensibility[01][02]. Basically there are four methods for features selection i. e Wrapper , Filter , Embedded and
Hybrid With respect to the filter feature selection methods, the application of cluster analysis has been demonstrated to
be more effective than traditional feature selection algorithms[04]. Filter approach uses intrinsic properties of data for
feature selection. This is the unsupervised feature selection approach. This approach performs the feature selection
without using induction algorithms which is display in the figure. This method is used for the transformation of
variable space. This transformation of variable space is required for the collation and computation of all the features
before dimension reduction can be achieved [05][06]. In particular, we accept the minimum spanning tree based
clustering algorithms, for the reason that they do not imagine that data points are clustered around centers or separated
by means of a normal geometric curve and have been extensively used in tradition [02]. In cluster analysis, graphtheoretic methods have been well studied and used in many applications. Select the features from the generated cluster
Which remove the redundant and irrelevant attributes [03][13]. This method is use for selecting the interesting features
from the clusters. Clustering is a semi-supervised learning problem, which tries to group a set of points into clusters
such that points in the same cluster are more similar to each other than points in different clusters, under a particular
similarity matrix[11][12]. Feature subset selection can be viewed as the process of identifying and removing as many
irrelevant and redundant features as possible. This is because 1) irrelevant features do not contribute to the predictive
accuracy, and 2) redundant features do not redound to getting a better predictor for that they provide mostly
information which is already present in other feature(s)[07][09][10].
Using this method we can get the quality of feature attributes In this paper we have focused on using best similarity
measure to calculate relevance between the features. In this paper we have compared results with few traditional feature
selection algorithms like CFS,FAST..The goal of this paper includes focusing on using best algorithm i.e improved
FAST for feature subset selection so that we will get effective accuracy[08].
2. LITERATURE SURVEY
In [02], Qinbao Song et al, proposed a new FAST algorithm that gain more accuracy and reduce time complexity than
traditional feature selection algorithm like, FCBF, Relief, CFS, FOCUS-SF, Consist and also compare the classification
accuracy with prominent classifiers. Graph theoretic clustering and MST based approach is used for ensure the
efficiency of feature selection process. Classifiers plays vital roles in feature selection operation since accuracy of
Page 135
selected features are measured using the progression of classifiers[06]. The following classifiers are utilized to classify
the data sets [2], [3], Nave Bayes: it works under Bayes theory and is based on probabilistic approach and yet then
offers first-rate classification output. C4.5 is the successor of ID3 [4] support of decision tree induction method. Gain
ratio, gini index information gain are the measures used for the process of attribute selection. Simplest algorithm is IB1
(instance based) [5]. Based on the distance vectors, it performs the classification process. RIPPER [6] is the rule based
technique, it make a set of rules for the purpose of classify the data sets. Classifier is one of the evaluation parameter for
measuring the accuracy of the process.
Author
Description
Kononenko
Yu L. and
Liu H
Fleuret F
Krier C.,
Francois D
Qinbao
Relief is ineffective at removing redundant features as two predictive but highly correlated features
are likely both to be highly weighted. Relief, enabling this method to work with noisy and
incomplete data sets and to deal with multi-class[2]
FCBF is a fast filter method which can identify relevant features as well as redundancy among
relevant features without pair wise correlation analysis[01].
CMIM iteratively picks features which maximize their mutual information with the class to
predict, conditionally to the response of any feature already picked[14].
In this paper presented a methodology combining hierarchical constrained clustering of spectral
variables and selection of clusters by mutual information[15].
The FAST algorithm works in two steps. In the first step, features are divided into clusters by
using graph-theoretic clustering methods. In the second step, the most representative feature that
is strongly related to target classes is selected from each cluster to form the final subset of features.
Features in different clusters are relatively independent; the clustering-based strategy of FAST has
a high probability of producing a subset of useful and independent feature[02].
3.PROPOSED SYSTEM
In proposed sytem i.e improved FAST ,we use relevance between features and the relevance between feature and the
target concept.We have used dice coefficient for calculating the relevance between the features.We extract the best
representative features from cluster using relevance between the features and the relevance between feature and target
class. There exist many feature selection techniques which are aimed at reducing unnecessary features to reduce
dataset. But some of them failed at removing redundant features after removing irrelevant features. Proposed system
focuses on removing both irrelevant and redundant features. The features are first divided into clusters and features
from each clusters are selected which are more representative to target class. System provides MST (Minimum
Spanning Tree) method, using which we propose a Fast clustering based feature Selection algorithm (FAST).. Proposed
System will be Implementation of FAST algorithm Using Dice Coefficient Measure to remove irrelevant and redundant
features.
Feature selection is also useful as part of the data analysis process, as it shows which features are important for
prediction, and how these features are related. Clustering is mainly used in grouping the datasets which are similar to
the users search. The datasets which are irrelevant can be easily eliminated and redundant features inside the datasets
are also removed. The clustering finally produces the selected datasets. The clustering uses MST for selecting the
related datasets and finally the relevant datasets.
A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or
equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has
a minimum spanning forest, which is a union of minimum spanning trees for its connected components. It is the cluster
analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces
are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of
measurements at once, and the clustering of text documents, where, if a word frequency vector is used, the number of
dimensions equals the size of the dictionary.
Page 136
Page 137
CFS
50
40
28
41
FAST
29
30
08
11
Improved FAST
19
25
06
10
Page 138
CFS
17382ms
13236ms
40486ms
44486ms
FAST
4416ms
3456ms
2418ms
2918ms
Improved FAST
3775ms
2800ms
1846ms
2246ms
CFS
99.14
92.34
94.71
98.71
FAST
99.75
94.45
96.65
99.65
Improved FAST
99.81
99.10
98.69
99.69
Page 139
5.CONCLUSION
An improved clustering based feature subset selection algorithm for high dimensional data. The algorithm involves (i)
deleting irrelevant features, (ii) devloping a minimum spanning tree from relative ones, and (iii) dividing the MST and
selecting representative features. In the proposed algorithm, a cluster consists of features. Each cluster is treated as a
single feature and thus dimensionality is highly reduced. The performance of the proposed algorithm with feature
selection algorithms like CFS, FAST on the different datasets is analysed. Proposed algorithm obtained the best
proportion of selected features, the best runtime, and the best classification accuracy for Naive Bayes, C4.5, and
RIPPER, and the second best classification accuracy for IB1FAST is the best algorithm amongst available algorithm for
all kind of datasets Its efficiency can be increased by using different similarity measures like dice coefficient.
REFERENCES
[1] Mr.Avinash Godase, Mrs. Poonam Gupta, A survey on Clustering Based Feature Selection Technique Algorithm
For High Dimensional Data,International journal of emerging trends & technology in computer science,Volume
4, Issue 1, January-February 2015, ISSN 2278-6856.
[2] QinBao , A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data -in "ieee
transactions on knowledge and data engineering vol:25 no:1 year 2013.
[3] Kira K. and Rendell L.A., The feature selection problem: Traditional methods and a new algorithm, In
Proceedings of Nineth National Conference on Artificial Intelligence, pp 129-134, 1992.
[4] Yu L. and Liu H., Feature selection for high-dimensional data: a fast correlation-based filter) solution, in
Proceedings of 20th International Conference on Machine Leaning, 20(2), pp 856-863, 2003.
[5] Butterworth R., Piatetsky-Shapiro G. and Simovici D.A., On Feature Selection through Clustering, In
Proceedings of the Fifth IEEE international Conference on Data Mining, pp 581-584, 2005.
[6] Yu L. and Liu H.,Efficient feature selection via analysis of relevance and redundancy,journal of machine learning
research,10(5),pp 1205-1224,2004
[7] Van Dijk G. and Van Hulle M.M., Speeding Up the Wrapper Feature Subset Selection in Regression by Mutual
Information Relevance and Redundancy Analysis, International Conference on Artificial Neural Networks, 2006
[8] Krier C., Francois D., Rossi F. and Verleysen M., Feature clustering and mutual information for the selection of
variables in spectral data, In Proc European Symposium on Artificial Neural Networks Advances in Computational
Intelligence and Learning, pp 157-162, 2007.
[9] Zheng Zhao and Huan Liu in Searching for Interacting Features, ijcai07
[10] P. Soucy, G.W. Mineau, A simple feature selection method for text classification, in: Proceedings of IJCAI-01,
Seattle, WA, 2001, pp. 897903
[11] Kohavi R. and John G.H., Wrappers for feature subset selection, Artif.Intell., 97(1-2), pp 273-324, 1997.
[12] Chanda P., Cho Y., Zhang A. and Ramanathan M., Mining of Attribute Interactions Using Information Theoretic
Metrics, In Proceedings of IEEE international Conference on Data Mining Workshops, pp 350-355, 2009.
[13] Forman G., An extensive empirical study of feature selection metrics for text classification, Journal of Machine
Learning Research, 3, pp 1289-1305,2003.
[14] Fleuret F., Fast binary feature selection with conditional mutual Information, Journal of Machine Learning
Research, 5, pp 1531- 1555, 2004.
[15] C.Krier, D.Francois, F. Rossi, and M. Verleysen, Feature Clustering and Mutual Information for the Selection of
Variables in Spectral Data, Proc. European Symp. Artificial Neural Networks Advances in Computational
Intelligence and Learning, pp. 157-162, 2007.
Page 140