Research Paper
Research Paper
Abstract—K-nearest neighbor (KNN) algorithm is a non- Section IV, we discuss requirements of preprocessing, Section
parametric classification method often used in pattern V, we present literature survey of several variants of KNN. In
classification for handling binary and multiclass problems Section VI, we show the comparative analysis of relatedwork
in different domains. Although KNN algorithm is a simple of research papers. Then, Finally, we conclude this
data mining approach but it may give inaccurate results paper in Section VII.
when training datasets are imbalanced and incomplete.
Different variants of KNN algorithm are developed to
handle the discrepancy and inaccuracy in the results. To
further improve the efficiency and accuracy, preprocessing
can be done to handle imbalanced and incomplete datasets.
This paper presents concept of KNN algorithm and brief
analysis of variants of KNN algorithm based on their
preprocessing techniques.
The KNN algorithm is one of the simplest approach There are many distance metric functions used in KNN and
among all machine learning algorithms. other machine learning algorithms. We haveconsideredonly
The KNN algorithm is highly adaptive to local frequently used distance metrics such as Euclidean.
information. Let x = {x1, x2, …., xn} be an object of the training set, y be an
The KNN algorithm uses the nearest objects for unknown object to be classified, n be the number of feature
estimation, so it provides advantages of local information attributes, K be the number of nearest neighbors and D is a
and form highly nonlinear decision boundaries for each distance metric.
object. 1. The Euclidean distance metric is expressed as
The KNN algorithm can be used to classify objects of any 1/ 2
n 2
type(for example datasets may have numerical, categorial D ( x, y ) xi y i
or mix types of data). i 1
The KNN algorithm can be easily implemented in parallel
[13]. 2. The Minkowskidistance metric is expressed as
1/ p
n p
B. Basic KNNAlgorithm
D ( x, y ) xi y i
Let X = {x1,x2,…., xn} be a dataset of n objects where xi’s are
i 1
objects, y be an unknown object to be classified, K be the
number of nearest neighbors and D is a distance metric. Basic 3. The Chebyshev distance metric is expressed as
algorithm is D( x, y) max i xi yi
1: Calculate the distance metric between y and every xi in X. 4. The Manhattan distance metric is expressed as
2: Select the K objects from dataset X which are closest to test n
object y. D( x, y) xi yi
3: Assign membership of the class to the object y, which have i 1
the majority votes. 5. The Hamming distance metric is expressed as
n
The basic steps of KNN algorithm (inspired from [4])can be
described as D( x, y) 1xi yi
i 1
187
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
1. The choice of the value of K: refers to selection of value KNN algorithm has a learning process to train the local-
for K, which is very important and can be done by optima-free distance feature by providing a solution to a
inspecting the data. If value of Kis too small, then the convex optimization problem. This approach is useful in
result may be sensitive to noise points. If value of K is too selection of relevant features and also helps to remove
large, then the neighborhood may have many objects from irrelevant features by using L1 regularization. The proposed
other classes. If a value of K is large, it may reduce the approach can be extended to multiclass classification
overall noise. Cross validation is a good approach to problems. Authors have shown the theoretical relation of the
determine value of K. proposed method with classical KNN algorithm, SVM
algorithm and LDA algorithm. Experimental results show that
2. The choice of Distance metric: refers to selection of provides accurate results as compared to other classification
distance metric based on dataset. Euclidean distance is algorithm.
common metric used for KNN algorithm. It gives equal
importance to all the attributes which may not be good in Jian Hou et al. [2] proposeda new weighted average
case of high dimensional datasets. Results may not be combination approach using the KNN algorithm. The existing
effective and accurate. For categorial dataset hamming approach of feature combination with average combination
distance metric can be used. and weighted average combination is investigated in this
paper. Authors proposed that using a sample of the more
3. Incomplete dataset: refers to the case where some powerful features in average combination gives better result.
attribute values are missing in the dataset. Objects holding Few conclusions made in the paper are,First, the KNN
incomplete data may be discarded in the classification but algorithm along with weighted average combination gives a
results may be inaccurate and biased. basis for the sparsity assumption normally used in multiple
kernel learning (SVM with multiple kernel is used in the
4. Imbalanced datasets: refers to the datasets where a class paper). Second, the proposed approach shows that in weighted
may have very small number of objects whereas another average combination, even if few kernel weights may set close
class may have major objects. Based on the number of to zero, the best solution is not sparse. Third, the proposed
objects a class can be termed as minority class or majority approach provides a concept towards new research directions
class. The appropriate pre-processing of the minority class in feature combination, non-uniform weighting and negative
data may improve the accuracy of the algorithm. weighting.
5. Missing values: refers to non-availability of certain Li‑ Yu Hu et al. [3] investigated the performance of KNN
attribute values in a data set. For example,map services algorithm based on distance metrics. Experiments are
may have missing map information. performed on different medical datasets consisting of three
types of data, categorial, numerical and mixed types.
6. Uncertain values: refers to the data having values derived Experiment results are based on four different types of
or modelled by some probability distribution functions distance metrics including Euclidean, Cosine, Chi square, and
such as gaussian distribution. For example, a missing Minkowsky. Performance of Chi square distance metric is best
value of an attribute can be replaced by mean or median as compared to three other distance metrics. Whereas,
of the attribute values of the other objects of the dataset. Euclidean distance metric performs well over datasets
In this case, value is not accurate. containing numerical and categorial data.
7. Attribute reduction: refers to reduction of irrelevant Keller et al. [4] developed a fuzzy version of KNN algorithm.
attributes from the high dimensional dataset. The concept of fuzzy sets is applied into KNN algorithm to
produce better experimental results in this paper. Problem
8. Sparse dataset: dataset refers to the high dimensional associated with classical KNN algorithm is, every labeled
dataset which have sparse data. Attribute reduction is very pattern has equal importance in the class memberships
important in this case. decision irrespective of their desirable qualities. To solve
above problem, three methods of assigning fuzzy
To resolve these issues, several approaches are given in memberships to the labeled samples are proposed, with two
research papers [1-18]. We have done analysis of the papers new approaches fuzzy K-nearest neighbor algorithm with
based on pre-processing methods and compared their work fuzzy nearest prototype algorithm and experimental results are
based on problem domain and proposed techniques. compared with crisp KNN algorithm. Proposed algorithm
performs better than existing one.
188
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
paper. To handle the imbalance problem, the proposed increased based on the local features of minority class samples
HBKNN algorithm used the fuzzy relative transform method. distribution. The proposed algorithm is compared with the
In this approach local and global information is used. To existing weighted distance KNN algorithm(WDKNN).
handle dataset having noisy attributes random space ensemble Experimental results show that proposed algorithm performs
framework(RS-HBKNN) algorithm is introduced. better than WDKNN in case of imbalanced datasets.
Experimental results show proposed algorithms performs
better than classical KNN algorithm. Wang Xueliet al. [11] proposed a modified version of KNN
algorithm, stepwise KNN algorithm (SWKNN) with feature
Shichao Zhang et al. [6] Developed a new version of KNN reduction and kernel method which can handle redundant
algorithm, the k Tree algorithm to calculate optimal K value attributes and uneven distribution of data. The SWKNN
for every test object by introducing a training stage in the reduces irrelevant attributes stepwise with the concept of
classical KNN method. Experimental results show proposed kernel based KNN algorithm. The experiment results show
algorithm has higher accuracy than classical KNN although that the proposed algorithm performs better than the classical
both have similar running cost. Further, an improved k Tree KNN algorithm and effectively reduce high dimensional
algorithm, the k* Tree algorithm is proposed to speed up features.
testing phase by storing extra information of testing objects in
the leaf node of the k Tree. AimanMoldagulovaet al. [12] developed a machine learning
system is in R which can classify textual documents using
Alberto Palacios Pawlovskyet al. [7]proposed an approach to KNN algorithm. Selection of proper value of K which
choose a good setting of parameters in the existing KNN represents the number of neighbors, is challenging issue.To
algorithm. The propose method applied the KNN algorithm to find the proper value of K, the experiment is performed on
prognosis breast cancer dataset. Few setting like number of textual documents. The experimental results show the best
neighbors(K), amount of data used, method of preprocessing result if the value of k ranges from 1 to 50. If value of k is
is changed when using KNN algorithm. Data used can be larger than 50, the accuracy of result reduced.
standardized or can be used without any normalization or
standardization. Using this approach with the Wisconsin’s M NirmalaDeviet al.[13]developed an approach named
breast cancer prognosis data, the KNN classification algorithm Amalgam KNN model for classification of PIMA diabetic
provides better accuracy according to results given in this dataset. The k-means algorithm is combined with classical
paper. KNN algorithm in the proposed method. It involves multi step
pre-processing. The missing values can be replaced by means
Zhang Li et al. [8] developed an approach known as and medians, a good K can be chosen by tenfold cross
Weighted KNN algorithm used for reduction of the irrelevant validation(CV) method, bigger values of K can be selected to
attributes from a dataset. In this approach weight is associated reduce the effect of noise on the classification, grouping can
with every attribute which reduce influence of irrelevant be performed by K-means clustering and classification can be
attributes as each attribute in a dataset does not have equal one by KNN algorithm to improve accuracy and efficiency of
importance in the classification. The weight of every attribute the Amalgam method. Experimental results show the proposed
is calculated by the method of sensitivity. The experimental amalgam KNN performs better than classical KNN.
results show that Weighted KNN algorithm reduces impact of
irrelevant attributes and improves performances of datasets Nicolás García-Pedrajaset al. [14]proposed an approach for
used in the paper. settinga local value of Kfor the KNN algorithm which
performs better than classical KNN algorithm. This approach
Xindong Wu et al. [9]described top ten approaches which are associates a local value of Kto every prototype and get the best
C4.5 algorithm, k-means algorithm, Support vector machine value of K by setting certain conditions having the local and
algorithm, Apriori algorithm, Expectation-Maximization(EM) global impacts of the different Kvalues in the neighborhood of
algorithm, PageRank algoriithm, AdaBoost algorithm, KNN the prototype. The experimental results show that proposed
algorithm, Naive Bayes algorithm, and CART algorithm used approach performs better than classical KNN algorithm for
in data mining. These algorithms are the most popular data standard and imbalanced datasets.
mining approaches used in the different research areas. With
every approach, authors have explained the algorithm, HuahuaXieet al. [15] proposed a new improved KNN
described the influence of the algorithm, and presented current algorithm with a better performance than classical KNN
and future research scope of the algorithm. algorithm to reduce its time complexity. In this approach, a
pre-classification is conducted before application of proposed
Bo Zang et al. [10] proposed an improved version of existing algorithm. Then the training set is partitioned into many parts
KNN algorithm for imbalance datasets which focused on the with a threshold. The experimental results demonstrated
minority class sample than the majority class sample. In thatproposed KNN algorithm with pre-classification performs
general, equal importance is given to all the samples whereas, better than classical KNN and has reduced time complexity.
in this approach the weight of minority class sample is
189
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
was98.85%, highest accuracy for SVM algorithm was
90%,and highest accuracy for Random Forests algorithm was
98.85%.
Cao Xiao et al.[1] Decomposed KNN Ionosphere, Sonar, Bupa, Optimization Proposed KNN, SVM, Better result than
Wine etc metric LDA baseline methods
Jian Hou et al. [2] KNN-based weighted Event-8, Scene-15, Optimization Proposed MKL Better results than
average combination Flower-17, Caltech-101 metric MKL
method
Li Yu Hu et al.[3] KNN using different Breast cancer, Ecoli, Analysis of Euclidean, KNN Chi square
distance functions Pima, Blood distance Cosine, performs better
functions Minkowsky, than other distance
Chi Square metrics
J M Keller et Fuzzy KNN IRIS, IRIS23, Classification Proposed KNN Better result than
al.[4] TWOCLASS metric traditional KNN
Z Yu et al. [5] Hybrid KNN Chess, Optdigits, Imbalanced Proposed KNN HBKNN handles
Texture dataset, Sparse, metric problem domain
Noise problem
Shichao Zhang et Efficient KNN with Abalone, Balance, Optimization Proposed KNN Reduced running
al. [6] different number of k Blood, car, etc metric cost
Alberto Palacios Euclidean distance for Breast cancer Normalization Euclidean KNN Better result than
Pawlovskyet normalized data traditional KNN
al.[7]
Zhang Li et al.[8] Weighted KNN Wine Optimization Proposed KNN Better result than
traditional KNN
Bo Zang et Weighted Distance KNN Yeast3, Ecoli3, Vowel0, Imbalanced Euclidean KNN Better result than
al.[10] etc dataset traditional KNN
Wang Xueliet al. Step wise KNN algorithm Glass, Wine, diabetes, Attribute Proposed KNN Better result than
[11] Based on Kernel Methods Wine quality reduction traditional KNN
(SWKNN)
AimanMoldagulo K-Nearest Neighbors Governmental news Document Euclidean KNN Best accuracy
vaet al. [12] (KNN) method for the stream, E-Government classification when the value of k
classification of textual news stream ranges from 1 to 50
documents
M Nirmala Devi K mean with KNN Pima Missing values Euclidean KNN Better result for
et al. [13] large values of k
Pedrajaset al.[14] Local k value for KNN Medical datasets Optimization Independent KNN Better result than
from metric traditional KNN
HuahuaXieet Pre-Classification Based Customer complaint, Preprocessing Euclidean KNN Better result than
al.[15] KNN Algorithm terminal marketing traditional KNN
RatnaAstutiNugra KNN, SVM, and Random Extended Cohn Kanade, Classification Not specified KNN, SVM, For large dataset,
haeniet al.[16] Forests Algorithm (CK+) database RANDOM better result for
FOREST KNN
Zhou et al.[17] Clustering-based KNN Arts, History Uneven Proposed KNN Better result than
algorithm distribution of metric traditional KNN
training samples
UshaRaniet Method for fixing missing Medical datasets Missing values, Proposed KNN Better result than
al.[18] values Dimensionality metric traditional KNN
reduction
190
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
UshaRaniet al. [18] developed an approach for handling [4] Keller, James M., Michael R. Gray, and James A. Givens. "A fuzzy k-
nearest neighbor algorithm." IEEE transactions on systems, man, and
missing attribute values for both categorical and numerical cybernetics 4 (1985): 580-585.
attribute in medical datasets. This is done by extracting hidden [5] Yu, Zhiwen, Hantao Chen, Jiming Liu, Jane You, Hareton Leung, and
knowledge from medical datasets. In this paper, authors Guoqiang Han. "Hybrid k -Nearest Neighbor Classifier." IEEE
proposed a novel imputation approach for fixing missing transactions on cybernetics 46, no. 6 (2016): 1263-1275.
values by clustering medical records which are free from [6] Zhang, Shichao, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili
missing values. The proposed method for handling missing Wang. "Efficient knn classification with different numbers of nearest
neighbors." IEEE transactions on neural networks and learning
values also achieved dimensionality reduction of the records. systems (2017).
This approach may be extended to any other research area. [7] Pawlovsky, Alberto Palacios, and Mai Nagahashi. "A method to select a
good setting for the kNN algorithm when using it for breast cancer
prognosis." In Biomedical and Health Informatics (BHI), 2014 IEEE-
VI. ANALYSIS OF WORK OF RELATED PAPERS EMBS International Conference on, pp. 189-192. IEEE, 2014.
[8] Li, Zhang, Zhang Chengjin, Xu Qingyang, and Liu Chunfa. "Weigted-
As we can see from the table 1, in the research papers, KNN and its application on UCI." In Information and Automation, 2015
improvements are done in the existing KNN algorithm in IEEE International Conference on, pp. 1748-1750. IEEE, 2015.
order to handle the issues we had discussed in section IV. [9] Wu, Xindong, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang
Datasets used in the papers are mostly from UCI repository Yang, Hiroshi Motoda, Geoffrey J. McLachlan et al. "Top 10 algorithms
and few are from other sources. Hybrid approaches are in data mining." Knowledge and information systems 14, no. 1 (2008):
1-37.
proposed to increase efficiency of KNN[1],[2],[8], [10].The
[10] Zang, Bo, Ruochen Huang, Lei Wang, Jianxin Chen, Feng Tian, and Xin
methods for selection of the parameter K are discussed in the Wei. "An Improved KNN Algorithm Based on Minority Class
papers [6], [14]. Methods to handle incomplete datasets are Distribution for Imbalanced Dataset." In Computer Symposium (ICS),
given in the papers [18]. Imbalanced datasets handling 2016 International, pp. 696-700. IEEE, 2016.
methods are discussed in the papers [5], [10].Sparse datasets is [11] Xueli, Wang, Jiang Zhiyong, and Yu Dahai. "An Improved KNN
handled in the paper [5]. Methods to handle uncertain values is Algorithm Based on Kernel Methods and Attribute Reduction."
In Instrumentation and Measurement, Computer, Communication and
discussed in the paper [18]. Methods to handle Missing values Control (IMCCC), 2015 Fifth International Conference on, pp. 567-570.
are discussed in the papers [13],[18]. Methods to handle noisy IEEE, 2015.
datasets are given in the papers [5], [18]. Approaches to [12] Moldagulova, Aiman, and Rosnafisah Bte Sulaiman. "Using KNN
reduce irrelevant attributes are described in the papers algorithm for classification of textual documents." InInformation
Technology (ICIT), 2017 8th International Conference on, pp. 665-671.
[11],[18]. In few papers, comparison of KNN and other IEEE, 2017.
classification techniques are done [1], [16]. [13] NirmalaDevi, M., Subramanian Appavu, and U. V. Swathi. "An
amalgam KNN to predict diabetes mellitus." In Emerging Trends in
Computing, Communication and Nanotechnology (ICE-CCN), 2013
VII. CONCLUSION International Conference on, pp. 691-695. IEEE, 2013.
In this paper, we discussed about KNN algorithm and [14] García-Pedrajas, Nicolás, Juan A. Romero del Castillo, and Gonzalo
Cerruela-García. "A Proposal for Local k Values for k-Nearest
presented a brief review on variants of classical KNN Neighbor Rule." IEEE transactions on neural networks and learning
algorithm proposed in different research papers. Also, we have systems 28, no. 2 (2017): 470-475.
compared them according to their techniques, datasets, [15] Xie, Huahua, Dong Liang, Zhaojing Zhang, Hao Jin, Chen Lu, and Yi
problem domains and performances. In future, we will Lin. "A Novel Pre-Classification Based kNN Algorithm." In Data
Mining Workshops (ICDMW), 2016 IEEE 16th International Conference
implementKNN algorithm based on different combinations of on, pp. 1269-1275. IEEE, 2016.
pre-processing methods on datasets and will compare the [16] Nugrahaeni, Ratna Astuti, and Kusprasapta Mutijarsa. "Comparative
results with other baseline methods. analysis of machine learning KNN, SVM, and random forests algorithm
for facial expression classification." In Technology of Information and
Communication (ISemantic), International Seminar on Application for,
REFERENCES pp. 163-168. IEEE, 2016.
[17] Zhou, Lijuan, Linshuang Wang, Xuebin Ge, and Qian Shi. "A
[1] Xiao, Cao, and Wanpracha Art Chaovalitwongse. "Optimization models
clustering-Based KNN improved algorithm CLKNN for text
for feature selection of decomposed nearest neighbor."IEEE
classification." In Informatics in Control, Automation and Robotics
Transactions on Systems, Man, and Cybernetics: Systems 46, no. 2
(CAR), 2010 2nd International Asia Conference on, vol. 3, pp. 212-215.
(2016): 177-184.
IEEE, 2010.
[2] Hou, Jian, Huijun Gao, Qi Xia, and Naiming Qi. "Feature combination
[18] UshaRani, Yelipe, and P. Sammulal. "A novel approach for imputation
and the kNN framework in object classification."IEEE transactions on
of missing values for mining medical datasets." InComputational
neural networks and learning systems27, no. 6 (2016): 1368-1378.
Intelligence and Computing Research (ICCIC), 2015 IEEE International
[3] Hu, Li-Yu, Min-Wei Huang, Shih-Wen Ke, and Chih-Fong Tsai. "The Conference on, pp. 1-8. IEEE, 2015.
distance function effect on k-nearest neighbor classification for medical
datasets." SpringerPlus 5, no. 1 (2016): 1304.
191