Earlier Prediction of Heart Disease Using Locality Sensitive Hashing
Earlier Prediction of Heart Disease Using Locality Sensitive Hashing
Introduction
In day to day life many factors that affect a human heart. Many problems are occurring
at a rapid pace and new heart diseases are rapidly being identified. In today’s world of stress
Heart, being an essential organ in a human body which pumps blood through the body for the
blood circulation is essential and its health is to be conserved for a healthy living. Heart failure is
also an outcome of heart disease, and breathlessness can occur when the heart becomes too weak
to circulate blood. Some heart conditions occur with no symptoms at all, especially in older
adults and individuals with diabetes. The term 'congenital heart disease' covers a range of
conditions, but the general symptoms include sweating, high levels of fatigue, fast heartbeat and
breathing, breathlessness, chest pain. However, these symptoms might not develop until a person
is older than 13 years. In these type of cases, the diagnosis becomes an intricate task requiring
great experience and high skill.
Data Mining is a task of extracting the vital decision making information from a
collective of past records for future analysis or prediction. The medical data mining made a
possible solution to integrate the classification techniques and provide computerised training on
the dataset that further leads to exploring the hidden patterns in the medical data sets which is
used for the prediction of the patient’s future state.
In this research work, the supervised machine learning concept is utilized for making the
predictions. A comparative analysis of the three data mining classification algorithms namely k-
NN, SVM and LSH are used to make predictions. The analysis is done at several levels of Model
building time, correctly classified instances, incorrectly classified instance and accuracy %. The
StatLog dataset from UCI machine learning repository is utilized for making heart disease
predictions in this research work. The predictions are made using the classification model,
Locality Sensitive Hashing that is built from the classification algorithms when the heart disease
dataset is used for training. This final model can be used for prediction of any types of heart
diseases
Literature Survey
Data mining has been played an important role in the intelligent medical systems. Data
mining plays a vital role for the healthcare industry that helps health systems to effectively use
data and analytics to recognize inefficiencies and best ways that reduce costs and improve care.
The main disadvantage of implementing data mining techniques and analysis strategies
effectively is the adoption of technology and the complexity of healthcare. Also, the data
generated by the healthcare activities are more complex and huge, it is impractical for un-
automated analysis. For identifying the disease followed by effective treatment, data mining
techniques are more important for the entire patient and the stake holders.
The applications of machine learning techniques were examined by various researchers
previously. However, most of the studies are focusing on specific impact of those machines
learning techniques rather than optimizing these techniques. Hybrid methods were also proposed
to enhance the optimization.
Dangare et al. used feature selection for [1] predicting heart disease using Neural
network. Jabbar et al. used feature selection methods such as symmetrical uncertainty,
information gain and genetic algorithm . His proposed a method uses feature subset selection
and associative classification for risk score of disease [2]. Krishnaiah V(3) and Kumar (4)
proposed a method that uses fuzzy logic along with KNN for diagnosing of heart disease. The
performance of their algorithms was improved by discretization and filtering techniques. Syed et
al predicted the heart disease by using genetic neural networks (5). In [6] authors proposed
prediction of heart disease using genetic neural networks. Experiments were done on American
heart association data set. Their approach recorded an accuracy of 96.2%. Masethe et al. [7]
proposed a model using decision tree for heart disease prediction. Following are some of the data
mining techniques and its drawbacks while implementing in earlier prediction of diseases.
Palaniappan, et al. [8] have carried out a research work and have built a model known as
Intelligent Heart Disease Prediction System (IHDPS) by using several data mining techniques
such as Decision Trees, Naïve Bayes and Neural Network.
Shantakumar, et al. [9] have done a research work in which the intelligent and effective
heart attack prediction system is developed using Multi-Layer Perceptron with Back-
Propagation. Accordingly, the frequency patterns of the heart disease are mined with the MAFIA
algorithm based on the data extracted.
Yanwei, et.al [10] have built a classification method based on the origin of multi
parametric features by assessing HRV (Heart Rate Variability) from ECG and the data is pre-
processed and heart disease prediction model is built that classifies the heart disease of a patient.
Decision Tree
Some decision trees can only deal with binary – valued target classes. Others are able to
assign records to an arbitrary no. of classes, but are error-prone when the no. of training
examples per class gets small. This can happen rather quickly in a tree with many levels
and/or many branches per node.
The process of growing a decision tree is computationally expensive. At each node, each
candidate splitting field is examined before its best split can be found.
Association Rule
This algorithm is for discovering frequent sets are not directly suitable, when the
underlying database is incremented intermittently.
Discovery of poorly understandable rules
Naïve Bayes
The main disadvantage is that it can’t learn interactions between features.
In classification task we need a big data set in order to make reliable estimations of the
probability of each class.
We can use Naïve Bayes classification algorithm with a small data set but precision and
recall will keep very low
Support Vector Machines
Problem need to be formulated as 2-class classification
Difficult to understand the learned function (weights).
Learning takes long time (QP Optimization).
Neural Network
Neural Networks cannot be retrained. If you add data later, this is almost impossible to
add to an existing network.
Handling of time series data in neural networks is a very complicated topic
PROPOSED SYSTEM
One of the major drawbacks of these works is that the main focus has been on the
application of classification techniques for heart disease prediction, rather than studying various
data cleaning and pruning techniques that prepare and make a dataset suitable for mining. It has
been observed that a properly cleaned and pruned dataset provides much better accuracy than an
unclean one with missing values. Selection of suitable techniques for data cleaning along with
proper classification algorithms will lead to the development of prediction systems that give
enhanced accuracy.
So in our proposed work, we plan to implement Locality Sensitive Hashing technique for
better classification. The problem LSH solves is that finding nearest neighbors is a very
expensive, both in time and space when operating in large feature spaces. It hashes input vectors
(e.g. bag-of-word vectors) in a way such that similar vectors are likely to have the same hashes.
Because of this property, lookup of near neighbors becomes a very efficient operation. The most
important applications for LSH is usually in high-dimensional spaces. The other applications of
the proposed method is
Input : Dataset
Preprocessing
Splitting
Output: Classification of data set into patients with heart disease and normal
Step 1: Input HD
Step 4: Hash all n points from the data set S into each of the L hash tables.
Step 6: Based on the query q, the algorithm iterates over the L hash function g.
Step 7: Finally classify the queried data into normal and abnormal.
Algorithm takes the heart disease dataset and classify whether a person is having heart in
normal condition or in abnormal condition. The proposed algorithm works in two ways, first ,
preprocessing is done for filling the missing values followed by feature reduction. Then the
dataset is given as input to the proposed method. IN the second phase, LSH will find the similar
results based on the given query.
Experimental Results
In this paper, we used Locality Sensitive Hashing classifier for predicting Heart disease.
The main goal of this paper is to compare our results with different classification model. For that,
we have compared our result with k-Nearest Neighbor and Support Vector Machine. To
implement our proposed algorithm, Heart Disease dataset is taken from UCI repository dataset. It
consists of 270 instances and 14 features. This is shown in Table 1.
Table 2 shows the experimental result. Experiments are carried out to evaluate the usefulness and
the performance of different classification algorithm for predicting heart disease.
Evaluation Criteria Classifiers
K-NN SVM LSH
Model Building Time(in 0.25 0.9 0.6
sec)
Correctly classified 243 247 262
instances
Incorrectly Classified 27 23 8
Instances
Accuracy % 90% 91.48% 97.03%
The following chart reveals the performance analysis of the LSH compared with the K-
NN and SVM models in terms of Model building time, Correctly classified instances, Incorrectly
classified instances and Accuracy %.
Figure 2 . a) Comparison chart of Model building Time b) a) Comparison chart of Correctly
classified instances c) Incorrectly Classified Instances d) Accuracy%
From the above figure, it is clear that that Model building time is very less ie., 0.6 sec in the case
of LSH compared to 0.25 of k-NN and 0.9 of SVM. Similarly, out of 270 instances , our
proposed method exactly classifies 262 instances with the accuracy of 97.03% compared with
90% and 91.48% of k-NN and SVM.
Conclusion
In this paper, different classifiers are studied and the experiments are conducted to find
the best classifier for predicting the patient of heart disease. We proposed an approach to predict
the heart diseases using machine learning techniques. Three techniques, k-NN , SVM and LSH
are compared. The results show that the proposed method LSH outperforms compared to other
two classifiers. Unlike conventional computer hashes that are designed to return exact matches in
O(1) time, an LSH algorithm uses dot products with random vectors to quickly find nearest
neighbors. LSH provides a probabilistic guarantee that it will return the correct answer. In
systems that have other sources of error (perhaps due to mislabeled data) one can reduce the LSH
error below the error due to other sources, while significantly improving the computational
performance. This makes LSH in particular and randomized algorithms in general, important in
today’s world of Internet-sized databases. This study can be improvised by improving in terms of
feature reduction and using optimization techniques.
References
8. Sellappan Palaniappan and Rafiah Awang, “Intelligent Heart Disease Prediction System
using Data Mining Techniques”, International Journal of Computer Science and Network
Security, Vol. 8, No. 8, pp. 1-6, 2008.
9. Shantakumar B. Patil and Y.S. Kumaraswamy, “Intelligent and Effective Heart Attack
Prediction System using Data Mining and Artificial Neural Network”, European Journal
of Scientific Research, Vol. 31, No. 4, pp. 642-656, 2009.
10. X. Yanwei et al., “Combination Data Mining Models with New Medical Data to Predict
Outcome of Coronary Heart Disease”, Proceedings of International Conference on
Convergence Information Technology, pp. 868-872, 2007.