Neighbor Consistency in Clustering
Neighbor Consistency in Clustering
CLUSTERING
Contents:
Introduction
Abstract
Existing System
Proposed System
Modules
Modules description
Designs
Results(Screen shots)
Conclusion
Future Scope
Software & Hardware Requirements
References
Introduction
• The continuous expansion of data availability in many areas of engineering and
science, identifying patterns from vast amounts of data and identifying members of a
predefined class, which is called classification, have become critical tasks. Therefore,
classification is a fundamental problem, especially in pattern recognition and data
mining.
• k-Nearest Neighbor (KNN) classifiers are basic classifiers that are used to classify a
query object into the category as its nearest example.
• The efficiency of NN classification heavily depends on the type of distance measure,
especially in a large-scale and high-dimensional database. In some applications, the
data structure is so complicated that the corresponding distance measure is
computationally expensive.
• Traditional KNN adopts a fixed k for all query samples regardless of their geometric
location and related specialties.
• To overcome the above limitations of classification algorithms, we propose a new
method called the Natural Neighborhood-Based Classification Algorithm (NNBCA).
With the help of our previously proposed Natural Neighbor (NaN) method
Abstract
• K-Nearest Neighbors algorithm can be used for classification. An object is classified by a
majority vote of its neighbors, with the object being assigned to the class most common
among its k nearest neighbors. It can also be used for regression. This value is the average
of the values of its k nearest neighbors.
• Various kinds of K-Nearest Neighbor algorithm based classification methods are the bases
of many well established and high performance pattern recognition techniques. However,
such methods are vulnerable to parameter choice K. Essentially, the challenge is to detect
the neighborhood of various data sets, while utterly ignorant of the data characteristic.
• In this we introduce a new supervised classification method: the natural neighbor
algorithm method, and shows that it provides a better classification result without
choosing the neighborhood parameter artificially. Unlike the original K-Nearest Neighbors
algorithm based method which needs a prior k, the natural neighbor algorithm method
predicts different k in different stages. Therefore, the natural neighbor algorithm method is
able to learn more from flexible neighbor information both in training stage and testing
stage, and provide a better classification result.
Existing System
K-Nearest-Neighbors (KNN)
The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more
complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines
(SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is used in a
variety of applications such as economic forecasting, data compression and genetics. For
example, KNN was leveraged in a 2006 study of functional genomics for the assignment of genes
based on their expression profiles.
KNN falls in the supervised learning family of algorithms. Informally, this means that we are
given a labeled dataset consisting of training observations (x,y) (x,y) and would like to capture
the relationship between xx and yy. More formally, our goal is to learn a function h:X→Yh:X→Y
so that given an unseen observation xx, h(x) h(x) can confidently predict the corresponding output
y. The KNN classifier is also a non parametric and instance-based learning algorithm.
Existing System Contd;
How does KNN work?
In the classification setting, the K-nearest neighbor algorithm essentially boils down to
forming a majority vote between the K most similar instances to a given “unseen”
observation. Similarity is defined according to a distance metric between two data points.
A popular choice is the Euclidean distance given by
d(x,x′)=sqrt((x1−x′1)2+(x2−x′2)2+…+(xn−x′n)2) .
Various kinds of k-Nearest Neighbor (KNN) based classification methods are the bases of
many well established and high-performance pattern recognition techniques. However,
such methods are vulnerable to parameter choice. Essentially, the challenge is to detect the
neighborhood of various datasets while ignoring the data characteristics.
This introduces a new supervised classification algorithm, Natural Neighborhood Based
Classification Algorithm (NNBCA).
Findings indicate that this new algorithm provides a good classification result without
artificially selecting the neighborhood parameter. Unlike the original KNN-based method ,
which needs a prior k, NNBCA predicts different k for different samples.
Therefore, NNBCA is able to learn more from flexible neighbor information both in the
training and testing stages. Thus, NNBCA provides a better classification result than other
methods.
Proposed System Contd;
Advantages of Proposed System
NaN method can create an applicable neighborhood graph based on the local
characteristics of various data sets. This neighborhood graph can identify the
basic clusters in the data set, especially manifold clusters and noises.
This method can provide a numeric result named NaN Eigenvalue (NaNE) to
replace the parameter k in the traditional KNN method, and the number of NaNE
is dynamically chosen for different data sets.
The NaN number of each point is flexible , and this value is a dynamic number
ranging from 0 to NaNE. The center point of the cluster has more neighbors, and
the neighbor number of noise is equal to 0.
Modules :-
Libraries
• Tkinter
• Matplotlib
• Statistics
Reading the training data
Reading the testing data
Finding k
Testing new data
Modules description:-
Tkinter library
Tkinter is the standard GUI library for Python. Python when combined
with Tkinter provides a fast and easy way to create GUI applications. Tkinter provides a
powerful object-oriented interface to the Tk GUI toolkit.
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of
arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays.
Statistics
The mode() is used to locate the central tendency of numeric or nominal data.
Modules description:-
Reading the training data
It reads the already available given data sets
Finding the K
It will predict the k value using the training data and testing data
NaN method is a new concept of the nearest neighbor method, and we have used
this method in clustering, outlier detection, classification. The improved algorithms
obtained excellent results by using the toolkits of the NaN method. In the proposed
algorithm, we use the toolkit of NaNG to achieve improved classification accuracy, and
further work should address the reduction of the complexity of the proposed algorithm
and NaN-based incorporation of better application areas.
Future Scope:
The outcomes in this research are based on results that involve only sample datasets. It is
necessary that additional datasets should be considered for the evaluation of different
classification problems as the information growth in the recent technology is extending to
heights beyond assumptions .Recent field of technology is growing and data are by
nature dynamic.
Hence , further classification of the entire system needs to be implemented right
from the scratch since the results from the old process have become obsolete. The scope
of future work can deal with Incremental learning, which stores the existing model and
processes the new incoming data more efficiently. More specifically, the models with
incremental learning can be used in categorization process .
Software & hardware requirement :-
Software :-
• Python 3.7
• Spyder IDE(Any python IDE)
Hardware :-
• Processor ( i3 or above)
• Ram ( 4 gb or above)
• Storage (250 gb or above)
References
1. Big data mining and analytics issn 2096-0654 01/06 pp257–265 Volume 1, Number 4,
December 2018 DOI: 10.26599/BDMA.2018.9020017
2. Z.H. Zhou , N.V. Chawla , Y.C.Jin , and G.J.Williams , Big data opportunities and
challenges: Discussions from data analytics perspectives, IEEE Comput. Intell. Mag.,
vol. 9, no. 4, pp. 62–74, 2014.
3. Y. T. Zhai, Y. S. Ong, and I. W. Tsang, The emerging “big dimensionality”, IEEE
Comput. Intell. Mag., vol. 9, no. 3, pp. 14–26, 2014.
4. T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory,
vol. 13, no. 1, pp. 21–27, 1967.
5. H. Zhang, A. C. Berg, M. Maire, and J. Malik, SVMKNN: Discriminative nearest
neighbor classification for visual category recognition, in Proc. 2006 IEEE Computer
Society Conf. Computer Vision and Pattern Recognition, New York, NY, USA, 2006, pp.
2126–2136.
6. D. Lunga and O. Ersoy, Spherical nearest neighbor classification: Application to
hyperspectral data, in Machine Learning and Data Mining in Pattern Recognition, P.
Perner, ed. Springer, 2011.
Thank you