0% found this document useful (0 votes)
35 views13 pages

Ensemble of Heterogeneous Classifiers For Diagnosis and Prediction of Coronary Artery Disease With Red

This study presents a novel heterogeneous ensemble method for diagnosing coronary artery disease (CAD) using machine learning algorithms, specifically combining K-Nearest Neighbour, Random Forest, and Support Vector Machine. The proposed method employs feature selection techniques and achieves high classification accuracy, with the weighted-average voting algorithm (WAVEn) reaching 100% accuracy on a balanced dataset. The results indicate that this approach can significantly aid in the early diagnosis of CAD, potentially reducing mortality rates associated with the disease.

Uploaded by

Mymunnisa Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views13 pages

Ensemble of Heterogeneous Classifiers For Diagnosis and Prediction of Coronary Artery Disease With Red

This study presents a novel heterogeneous ensemble method for diagnosing coronary artery disease (CAD) using machine learning algorithms, specifically combining K-Nearest Neighbour, Random Forest, and Support Vector Machine. The proposed method employs feature selection techniques and achieves high classification accuracy, with the weighted-average voting algorithm (WAVEn) reaching 100% accuracy on a balanced dataset. The results indicate that this approach can significantly aid in the early diagnosis of CAD, potentially reducing mortality rates associated with the disease.

Uploaded by

Mymunnisa Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Computer Methods and Programs in Biomedicine 198 (2021) 105770

Contents lists available at ScienceDirect

Computer Methods and Programs in Biomedicine


journal homepage: www.elsevier.com/locate/cmpb

Ensemble of heterogeneous classifiers for diagnosis and prediction of


coronary artery disease with reduced feature subset
Durgadevi Velusamy a,∗, Karthikeyan Ramasamy b
a
Department of Computer Science and Engineering, M.Kumarasamy College of Engineering, Karur, Tamilnadu, 639 113, India
b
Department of Electrical and Electronics Engineering, M.Kumarasamy College of Engineering, Karur, Tamilnadu, 639 113, India

a r t i c l e i n f o a b s t r a c t

Article history: Background and Objective: Coronary artery disease (CAD) is considered one of the most prominent health
Received 9 June 2020 issues causing high mortality in the world population. Hence, earlier diagnosis and prediction of CAD
Accepted 19 September 2020
is essential for the proper medication of patients. The objective of this study is to develop a machine
learning algorithm that will help in accurate diagnosis of CAD.
Keywords:
Methods: In this paper, we have proposed a novel heterogeneous ensemble method combining three base
Cardiovascular disease
classifiers viz., K-Nearest Neighbour, Random Forest, and Support Vector Machine for effective diagnosis
Coronary artery disease
Machine learning algorithms of CAD. The results of base classifiers are combined using ensemble voting technique based on average-
Ensemble methods voting (AVEn), majority-voting (MVEn), and weighted-average voting (WAVEn) for prediction of CAD. The
Feature selection random forest-based Boruta wrapper feature selection algorithm and feature importance of SVM are used
Classification for relevant feature selection based on attribute importance and rank.
Results: The proposed ensemble algorithm is developed using 5 features selected based on the feature
importance and the performance of the algorithm is evaluated using the Z-Alizadeh Sani dataset. Fur-
ther, the dataset is balanced using Synthetic Minority Over-sampling Technique and its performance is
evaluated. The result analysis shows that the WAVEn algorithm achieves better classification accuracy,
sensitivity, specificity and precision of 98.97%, 100%, 96.3% and 98.3% respectively for the original dataset.
The WAVEn algorithm applied on the balanced dataset achieves 100% accuracy, sensitivity, specificity and
precision in diagnosing CAD. To the best of author’s knowledge, the accuracy achieved by WAVEn is the
highest accuracy when compared with the state-of-the-art algorithms in the literature for both original
and balanced dataset.
Conclusions: The statistical results prove the robustness of the WAVEn algorithm in reliably discriminating
the CAD patients from healthy ones with high precision, and therefore it can be used for developing a
decision support system for diagnosing CAD at an early stage.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction teries. Atherosclerosis is a condition where plaque builds up nar-


rowing the artery lumen that limits the flow of oxygenated blood
Cardiovascular disease (CVD), also known as Heart disease is to the heart. This decreased oxygenated blood is inadequate for the
one of the clinical disorders that occurs due to the abnormal func- heart muscles that cause pain in the chest, pain in the neck, arms,
tioning of the heart [23]. Heart disease can be commonly classi- and shoulder called Angina. The complete blockage of oxygenated
fied as coronary artery disease (CAD) and heart failure (HF). Ac- blood leads to a heart attack.
cording to the statistics of world health organization (WHO) [63], The various risk factors identified as the cause for CAD are hy-
it has been estimated that 26 million of world adult population pertension, stress, diabetes, smoking, unhealthy food intake, physi-
are suffering from heart diseases. Coronary Artery Disease (CAD) cal inactivity, high cholesterol, the genetic history of a person, and
occurs due to the accumulation of plaques inside the coronary ar- so on [45]. The clinical diagnosis of CAD is regarded as a great
challenge especially in hugely populated countries like India as it
requires lots of medical experts. The death rate due to heart attack

Corresponding author. is high due to a lack of health awareness and knowledge among
E-mail addresses: [email protected] (D. Velusamy), the patients, insufficient diagnostic devices, and medical experts
[email protected] (K. Ramasamy).

https://doi.org/10.1016/j.cmpb.2020.105770
0169-2607/© 2020 Elsevier B.V. All rights reserved.
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

[46]. In such a case, an earlier diagnosis of CAD along with proper (2) An attribute selection measure is used for selecting relevant
medication will drastically decrease the overall deaths in the coun- features based on their relative importance to minimize clas-
try. sification time and improve accuracy in CAD prediction.
Coronary angiography, a gold-standard is used for the diagno- (3) To improve the disease diagnosis accuracy and reduce the false
sis of CAD [8]. However, this approach is expensive and time- predictions, the weights in the weighted-average voting tech-
consuming. Therefore, a suitable health-care application for au- nique is assigned based on the predictive performance of the
tomatic CAD diagnosis using intelligent machine learning tech- base classifier.
nique will assist a cardiologist in an earlier diagnosis of the dis- (4) The performance of the classifier models are evaluated using
ease. The performance of any diagnostic system greatly depends the original and SMOTE balanced datasets to analyze the pre-
on the algorithm and the number of feature variables used for dis- dictive accuracy of each model in identifying the CAD patients
ease diagnosis and prediction. The feature selection algorithms can as a patients and healthy persons as healthy. The model per-
be broadly classified into three categories namely filter methods, formance is validated using metrics like accuracy, sensitivity,
wrapper methods, and embedded methods. Various attribute se- specificity, precision, F measure, MCC, kappa and Area under the
lection algorithms like weights by Support Vector Machine(SVM) Curve.
[20], correlation [30], Gini index [31], information gain [31], prin-
cipal component analysis [30], Boruta Wrapper Feature Selection The rest of the paper is organized as follows. Section 2 de-
[35], recursive feature elimination [40], and so on are used for scribes the related works in the literature, especially for detec-
identifying important features of CAD. In general, a relevant fea- tion of CAD. Section 3 provides details about feature selection al-
ture subset selection has the potential to substantially improve the gorithms and the proposed heterogeneous ensemble classifiers for
testing performance of any machine learning algorithm on unseen prediction of CAD. The experimental results of the classification al-
data samples in terms of accuracy and learning ability. gorithms for the original and balanced dataset are presented in
In this study, we have employed Random Forest based Boruta Section 4. The discussion about the performance of the proposed
wrapper feature selection algorithm and SVM variable importance algorithm in classification of CAD is given in Section 5. Finally,
measure to select the significant attributes associated with the Section 6 summaries the conclusion with future works.
CAD dataset. To the best of our knowledge, no prior works in liter-
ature have applied the Boruta-based attribute selection algorithm 2. Background
on the Z-Alizadeh Sani dataset [55] for the prediction of CAD. En-
semble classification has been proved to be an effective way of 2.1. A review of previous research works
improving disease diagnosis accuracy over single classifier models.
The main aim of this work is to develop an ensemble of hetero- Data mining and machine learning algorithms have gained great
geneous classifiers that will help to detect and predict CAD at an attention from many researchers in several domains like communi-
earlier stage. An ensemble of heterogenous classifiers combining cation security [59], optimization [60], predictive analytics [22,37],
the K-Nearest Neighbour [31], Random Forest Ensemble [21], and smart grid [61], automatic disease diagnosis and detection and so
Support Vector Machine as the base classifiers are used for de- on. Numerous studies exploiting machine learning algorithms have
veloping an ensemble model. Further, the class probability of the been reported in the literature for automatic disease diagnosis and
three base classifiers are combined to develop a voting-ensemble classification namely coronary artery disease [1,5–7,19,62], obstruc-
based on the average of the posterior probability (AVEn), a major- tive sleep apnea [57], cancer diagnosis [16,27], Alzheimer diseases
ity vote of the class probability (MVEn) and a weighted average of [33], chronic kidney disease [40] and so on. In this section, we re-
the posterior probability (WAVEn) to obtain the final classification view some of the significant works reported in the literature for
accuracy of the ensemble method in predicting CAD. CAD diagnosis using computational intelligence algorithms.
In the proposed work, we have evaluated the performance of There are several studies reported in the literature for the di-
the base classifiers and ensemble methods with five significant agnosis and classification of CAD. Generally, most of the studies
features on the Z-Alizadeh Sani dataset separated into training reported in [29,34,43,48–52,56,58] have either used Electrocardio-
and testing datasets. The ten-fold cross-validation is applied to gram (ECG) signals or analyzed the heart rate signals for diagno-
the training set and the testing set is used to test the perfor- sis of CAD. Nonetheless, data mining and machine learning algo-
mance of the classifier on the unseen data. Besides, we have bal- rithms are widely used for prediction and classification of CAD. A
anced the dataset using Synthetic Minority Over-sampling Tech- fuzzy expert system is developed in [42] for CAD prediction us-
nique (SMOTE) and applied the same procedure as that of the ing the clinical parameter and the algorithm is able to achieve a
original dataset to evaluate model performance on the balanced sensitivity of 95.85% and specificity of 83.33% with 84.20% classifi-
dataset. Experiments are conducted to evaluate the performance of cation accuracy. The following are some of the literary works that
these algorithms with reduced feature subset in terms of accuracy, have applied prediction techniques particularly on the Z-Alizadeh
sensitivity, specificity, Area under the Curve (AUC), and Matthew’s Sani CAD dataset [55] published in 2013. The authors in [5] ap-
Correlation Coefficient (MCC) for binary classification. The experi- plied algorithms like Artificial Neural Network (ANN), Naive Bayes
mental result shows that the proposed WAVEn classification tech- (NB), Bagging Algorithm, Sequential Minimal Optimization(SMO)
nique has achieved the highest average accuracy of 98.97% with with feature selection and creation algorithms for classification of
the original and accuracy of 100% for the balanced dataset respec- CAD. The result shows that the SMO achieved a classification ac-
tively in the detection of CAD. To the best of our knowledge, these curacy of 94.08% in the prediction of CAD. The disease diagnostic
are the highest classification accuracy achieved so far when com- system developed in [6] utilized information gain and Support Vec-
pared with other state-of-the-art algorithms in the literature. tor Machine (SVM) for feature selection and diagnosis of CAD. The
The significant contributions of this paper are as follows: algorithm achieved an accuracy of 86.14% for stenosis diagnosis of
left anterior descending (LAD), 83.17% for left circumflex (LCX), and
83.50% for right coronary artery (RCA) respectively.
(1) We have proposed a heterogeneous ensemble method by com- The non-invasive detection of CAD in high-risk patients is pro-
bining Random Forest, K-NN and SVM and the result of base posed [7] for CAD prediction using SVM based on the stenosis
classifiers is combined using a voting technique for earlier and of three coronary arteries, namely LCX, LAD, and RCA, respec-
effective diagnosis of CAD. tively. This machine learning algorithm has achieved an accuracy of

2
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

96.40%, sensitivity of 100% and specificity of 88.1% in the detection an accuracy of 93.7% in prediction of CAD. Although several algo-
of CAD. The authors in [9] proposed a machine learning algorithm rithms have been developed for diagnosis and prediction of CAD,
based on SVM with different kernels for diagnosing the stenosis only a few works have used random forest ensemble for CAD diag-
of each individual artery. The PCA and Gini index algorithms are nosis even though it performed well [1]. Moreover, there are sev-
used for feature selection. The proposed algorithm achieved a clas- eral studies in the literature that have applied ensemble methods
sification accuracy of 86.43%, 83.67%, and 82.67%, in diagnosing for prediction task in various fields like agriculture [24], cancer dis-
LCA, LCX and RCA stenosis respectively. A hybrid PSO algorithm ease diagnosis [16,27,38], attack detection [41], and so on. This mo-
is proposed in [64] for rule discovery and diagnosing CAD. This tivated us to develop an ensemble classifier for identification and
study achieved an accuracy of 84.25% using 13 feature attributes prediction of CAD with a reduced feature set for earlier diagnosis,
in prediction of CAD. An SVM algorithm is presented in [10] for prediction, and treatment of CAD.
automatic diagnosis of stenosis of LAD, LCX and RCA arteries. The Ensemble techniques can be categorized into two types, namely
hyper-parameters of SVM kernels are tuned using the genetic algo- homogeneous and heterogeneous [27]. When an ensemble method
rithm. The SVM with RBF kernel function is utilized for diagnosing combines a base method with two or more configuration or vari-
individual arteries stenosis and achieved a classification accuracy ants, then it is homogeneous, whereas a heterogeneous ensem-
of 86.64%, 83.47% and 82.85% in diagnosing stenosis of LAD, LCX ble method combines one or more base methods with a meta-
and RCA arteries respectively. The study in [11] has utilized C4.5, ensemble method like Bagging, Boosting or random subspace.
Naive Bayes, and K-NN for diagnosing the stenosis of individual Meanwhile, heterogeneous can have a combination of two differ-
coronary artery. The C4.5 algorithm achieved an accuracy of 74.20% ent base classifiers. An ensemble technique having the capabil-
for LAD, 63.76% for LCX and 68.33% for RCA coronary arteries re- ity of accomplishing a favorable trade-off between heterogeneity
spectively. among base classifiers and diversity in the same training dataset
The authors in [12,13] employed a cost-sensitive algorithms can achieve better accuracy and ability to handle different train-
along with the base classifiers like KNN, SMO, C4.5, SVM and Naive ing errors of base classifiers. In general, an ensemble learning tech-
Bayes for classification of CAD with high sensitivity. The SMO al- nique enhances the performance and improves the overall accuracy
gorithm achieved greater sensitivity of 97.22% and 92.09% accu- of prediction [24,27,41].
racy, respectively. Utilizing the Bagging and C4.5 classification al- This paper investigates the performance of an ensemble of het-
gorithms, the authors in [14] diagnosed the stenosis of coronary erogeneous classifiers with reduced feature subset built with three
arteries with a classification accuracy of 79.54%, 61.46% and 68.96% base classifiers, namely Random Forest ensemble, KNN, and SVM
in diagnosing stenosis of LAD, LCX and RCA arteries respectively. for earlier diagnosis and prediction of CAD. Initially, Random For-
The authors in [15] utilized the SMO, Naive Bayes and Ensem- est (RF) ensemble classifier along with two single classifiers viz.,
ble algorithm with 16 features for diagnosing CAD. The ensemble KNN and SVM are trained as base classifier models to predict the
method achieved an accuracy of 88.25% in predicting CAD. A het- probability of a particular class and perform binary classification.
erogeneous hybrid feature selection algorithm is employed [39] for Then, the posterior probability of the individual base classifier al-
classification of CAD. The authors employed Decision tree, Ran- gorithms are combined to compute the classification result of the
dom Forest, Gaussian Naive Bayes, and XGBoost classifiers on the voting ensemble based on average-voting (AVEn), majority-voting
dataset balanced using SMOTE and Adaptive synthetic (ADASYN) (MVEn) and weighted-average voting (WAVEn) techniques.
sampling techniques. The XGBoost classifier with SMOTE technique
has achieved a classification accuracy of 92.58% in diagnosing CAD.
A machine learning algorithm for diagnosing CAD presented in 2.2. Medical dataset used
[2] employs genetic algorithm and particle swarm optimization for
feature selection and SVM for identification of CAD. It achieves an The Z-Alizadeh Sani medical dataset contains information about
accuracy of 93.08% and F1 score of 91.51% in predicting CAD. A hy- 303 patients as clinical records with 56 feature attributes freely
brid method combining Multi-layer Perceptron (MLP) Neural Net- available in the University of California, Irvine machine learning
work and Genetic Algorithm (GA) is proposed in [19] for the clas- repository [55]. The dataset consists of 216 records of CAD patients
sification of CAD that used 90% of data for training and 10% for and 87 records of healthy persons with 55 independent feature
testing. The authors have used weights by SVM feature selection attributes or predictors and one output or response variable. The
method and achieved a classification accuracy of 93.85%, the sen- medical dataset is the study based on angiography procedure con-
sitivity of 97%, and specificity of 92% in predicting CAD. A decision ducted by Alizadeh Sani to measure the stenosis of each artery. The
tree-based CAD diagnosis using classification and regression tree response variable has two values based on an angiographic disease
(CART) developed in [28] has utilized feature importance measure status, namely (i) value 0 for the absence of CAD and (ii) value 1
for relevant feature selection. The algorithm has achieved an accu- for the presence of CAD. The target class is set to value 0: When
racy of 100% for 18 and 10 features and an accuracy of 92.41% for the narrowing diameter of an artery is less than 50% then the per-
five feature using the CART model. son is a non-CAD patient (a person is healthy or normal) and oth-
Most of the research works reported in the literature have erwise set to value 1: when the arteries have ≥ 50% diameter nar-
utilized a single classifier model for CAD diagnosis like Decision rowing, the particular patient record is categorized as CAD patient
Tree (CART) [28], K-Nearest Neighbour (KNN) [5], Artificial Neu- (a person having CAD).
ral Networks (ANN) [5,62], Naive Bayes [5], Support Vector Ma- The dataset is highly imbalanced with 71.29% patient records
chine (SVM) [7] and so on. However, a single classifier model can- contributing to CAD patients, and the remaining 28.71% records
not assure a high level of accuracy in all contexts, as each clas- are normal or healthy persons. The dataset features can be catego-
sifier model has its own merits and demerits. To the best of the rized into four groups based on the demography, Echocardiography,
author’s knowledge, no prior works except the studies presented symptoms and examination, and laboratory and echocardiography
in [1,15,47] have used ensemble classification for CAD diagnosis features. The demography consists of 16 parameters, 7 features re-
using the Z-Alizadeh Sani dataset [8]. A nested ensemble algo- lated to Electrocardiography, symptoms and physical examination
rithm (NE-Nu-SVC) with feature selection and multi-step balanc- results consists of 14 clinical parameters and 17 feature attributes
ing [1] has achieved a classification accuracy of 94.66% on the bal- extracted from laboratory tests, and echocardiography results. The
anced dataset. An ensemble algorithm for integrating multiple cri- details about the features of the Z-Alizadeh Sani dataset along with
teria feature selection algorithm is presented in [47] that achieved the range or type of value is given in Table 1.

3
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Table 1
Range value of various features of Z-Alizadeh Sani dataset.

Category Feature name Range or type

DemographicAge(yrs) 30 - 86
Features Sex Male, Female
Weight(kg) 48 - 120
Body Mass Index (BMI, Kg/m2 ) 18 - 41
Diabetes Mellitus (DM) Yes, No
Hyper Tension (HTN) Yes, No
Current Smoker Yes, No
Ex-smoker Yes, No
Family History(FH) Yes, No
Obesity Yes (BMI > 25),else No
Chronic Renal Failure (CRF) Yes, No
Cerebrovascular Accident (CVA) Yes, No
Thyroid disease Yes, No
Airway disease Yes, No
Congestive Heart Failure (CHF) Yes, No
Dyslipidemia (DLP) Yes, No
Electrocardiography
Rhythm Sin,AF
ST elevation Yes, No
ST depression Yes, No
Q-wave Yes, No
T inversion Yes, No
Left Ventricular Hypertrophy (LVH) Yes, No
Poor R-wave progression Yes, No
Symptoms Blood Pressure (BP, mmHg) 90 - 190
and Pulse Rate(PR) 50 - 110
Phys- Edema Yes, No
i- Weak peripheral pulse Yes, No
cal Lungs rales Yes, No
examinationSystolic murmur Yes, No
Diastolic murmur (DM) Yes, No
Typical Chest Pain Yes, No
Dyspnea Yes, No
Function Class 1,2,3,4
Atypical Yes, No
Nonanginal Chest Pain Yes, No
Exertional Chest Pain Yes, No
Low Threshold Angina (LowTH Ang) Yes, No
Laboratory Fasting Blood Sugar (FBS, mg/dL) 62 - 400
Tests Creatine (Cr mg/dL) 0.5 - 2.2
and Triglyceride (TG mg/dL) 37 - 1050
Echocardiography
Low Density Lipoprotein (LDL, mg/dL) 18 - 232
High Density Lipoprotein (HDL, mg/dL) 15 - 111
Blood Urea Nitrogen (BUN, mg/dL) 6 - 52
Erythrocyte Sedimentation Rate (ESR, mm/h) 1 - 90
Hemoglobin (HB, g/dL) 8.9 - 17.6
Potassium (K, mEq/lit) 3.0 - 6.6
Sodium (Na, mEq/lit) 128 - 156
White Blood Cell (WBC, cells/mL) 3700 - 18000
Lymphocyte (Lymph %) 7 - 60
Neutrophil (Neut %) 32 - 89
Platelet (PLT, 1000/mL) 25 - 742
Ejection Fraction (EF-TTE %) 15 - 60
Region with Regional wall motion 0,1,2,3,4
abnormality (Region.RWMA)
Valvular Heart Disease (VHD) Normal, Mild, Moderate, Severe

3. Methods 3.1. Feature selection algorithms

In this section, we discuss about the Boruta wrapper feature se- In this study, we have used a wrapper-based Boruta feature se-
lection and embedded feature selection algorithm based on SVM lection algorithm, and embedded algorithm based on SVM feature
used for selecting relevant feature attributes. After attribute selec- importance for selecting relevant features from the dataset. We
tion, the Random Forest ensemble, K-Nearest Neighbour and Sup- have applied a 10-fold cross-validation technique on the training
port Vector Machine classifiers are trained as base classifier mod- data for selecting the significant features of importance. The fea-
els using the selected features. The proposed ensemble-method tures with higher weights are ranked to find and select a good
is a heterogeneous ensemble voting technique that combines the combination of features. The attribute selection measure helps to
posterior probability of base classifiers. The result of the base identify the subset of attributes that are most relevant, significant,
classifiers are combined with average voting, majority voting and and less redundant. We have selected only those features that have
weighted-average voting ensemble techniques to find the final been selected by more than one algorithm based on the feature
classification result for diagnosing CAD. significance and relevance measure to predict CAD.

4
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Fig. 1. Feature selection using Boruta wrapper feature selection algorithm.

3.1.1. Boruta feature selection algorithm


Boruta is a random-forest based feature selection algorithm that
follows all relevant feature selection method [35]. It works by gen-
erating randomness in the dataset by making copies of all features
in a shuffled manner (shadow features). Then, the algorithm trains
a random forest classifier to evaluate the feature importance based
on Mean Decrease of Accuracy or Mean Decrease of Impurity mea-
sure as shown in Fig. 1.
During each iteration, the algorithm tries to find significant fea-
tures by comparing its Z-score with that of the randomly shuf-
fled copies of features (shadow features). An attribute becomes im-
portant when higher its Z-score than the maximum Z-score of its
shadow features computed using a binomial distribution. Finally,
all the attributes are categorized either as confirmed or tentative
or rejected based on its importance score. The box plot represen-
tation shown in Fig. 1 illustrates that the algorithm has selected
15 features as confirmed (shown in green color), 2 features as ten-
tative (shown in yellow color), and the remaining features are re-
jected (shown in red color) based on their score.
Fig. 2. Feature selection using variable importance measure of SVM algorithm.
3.1.2. Feature Selection based on SVM
The feature selection based on variable importance function
of SVM shown in Fig. 2. It is based on the model information using the in-built method for model evaluation is that the per-
where the model is trained to incorporate the relation between formance of the model is closely related to it. Fig. 2 shows the
the predictors for computing the variable importance [20]. In vari- top 20 feature variable selected based on its importance using in-
able importance computation, the feature attributes selected by built variable importance measure of SVM. However, we have se-
SVM have the same importance for both the classes and the im- lected only top 12 features viz., Typical.Chest.Pain, Atypical, Age,
portance measure ranges from 0 to 100. The main advantage of Region.RWMA, EF.TTE, Nonanginal, HyperTension, FBS, Tinversion,

5
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

BP, Diabetes, and TG having importance measure greater than 20 classifiers are tuned to improve the predictive accuracy by reduc-
for our evaluation. ing training errors. Then, the trained model is tested using the un-
seen testing samples and a probability score for each class is com-
3.1.3. Features selected for model evaluation puted to find the class outcome of sample X. The class with the
This section discusses about the number of features selected for highest probability score is the final winning class chosen by each
model evaluation for predicting CAD using the base classifier and classifier. The probability predicted by each classifier is called the
ensemble models. In this study, we have employed a trial and er- posterior probability that is used by the ensemble voting method
ror procedure to find the number of significant features that can for prediction of CAD. In final ensemble-based classification, the
predict CAD more accurately than previous methods reported in testing data are classified using the average voting, majority voting
the literature. From Figs. 1 and 2, it can be observed that the sig- and weighted-average voting ensemble techniques.
nificant features selected by both the algorithms are the same with Algorithm 1 gives a detailed procedure about the flow of the
some changes in the ranking order. Hence, we initially started with
the top seven features that are commonly selected by the two Algorithm 1 Proposed ensemble voting algorithm for CAD predic-
feature selection algorithms namely Typical Chest Pain, Atypical, tion.
Age, Region.RWMA, Nonanginal, EF.TTE and Hyper Tension(HTN) 1: procedure Heterogeneous Ensemble Classi f ication
for model evaluation. Then, we selected the top five features for Input: D = f1 , f2 , . . . , fn Training dataset consists of n
model evaluation and diagnosis of CAD. In the present study, the feature variables, m classes, N number of patient
proposed model is developed and evaluated using the five features records in the dataset, M Base classifier algorithms
namely Typical Chest Pain, Atypical, Age, Region.RWMA, and EF.TTE C A1 , C A2 , . . . , C AM .
for diagnosis of CAD. Output: Prediction results of Ensemble Classifiers.
2: Select k most relevant features from D based on
feature significance rank and importance measure R,
3.2. Data partitioning R( f1 ) > R f2 ) >, . . . , > R( fk )
3: Partition the dataset into two sets: Training sub-
The Z-Alizadeh Sani dataset is randomly divided into a training set (DT Rk ) with P samples and Testing subset (DT Ek )
subset and testing subset with the selected features for the devel- Q samples.
opment of the three base classifiers. The dataset is split into two 4: for i = 1 to M do
parts with 90% of data for training and 10% for testing respectively. 5: Apply 10-fold cross validation on DT Rk using
In the training process, 10-fold cross-validation with 3 repetitions CAi with the k selected features.
is utilized with a suitable measure to tune each base classifier ac- 6: Predict the result of CAi for all samples in
cording to the training errors and average of results are computed. DT Rk
The data is more efficiently used by taking k − 1 subsets i.e., 9 sub- 7: Tune the hyper-parameter of CAi to improve
sets for training and one for testing to determine the parameters prediction accuracy and goto step 5.
of the model by repeating the iteration for k times. The param- 8: end for
eter tuning is done by exploiting the exhaustive grid search algo-
9: for i = 1 to M do
rithm for predicting CAD using the training subset. Thus, the cross-
10: Evaluate the performance of CAi on DT Ek with
the k selected features.
validation procedure makes available different subsets of samples
11: Compute the class probability and predict the
for training the model to create the model more robust against
result of CAi for all samples in DT Ek .
errors. Once the model is trained and built using ten-fold cross- end for
1 
12:
validation, the testing subset is used to assess the predictive abil- 13: Apply average-voting 
y= M (Pr (CA1 (X )) +
ity of the base classifiers on the unseen test samples during the
P r (CA2 (X )) + . . . + P r (CAM (X )))
prediction (testing) process.
14: Apply majority-voting y = mode(CA1 (X ) +

CA2 (X ) + . . . + CAM (X ))
3.3. Proposed heterogeneous ensemble classification for CAD 15: Apply weighted-average votingy = W1 ×
prediction P r (CA1 (X )) + . . . + WM × P r (CAM (X ))
16: Compute the Prediction results of Ensemble Vot-
Let us consider for example the dataset D consists of each pa- ing classifiers.
tient record with n feature variables, m classes and having N num- 17: end procedure

ber of patient records in the dataset. A classification problem in


which sample X is associated with one of the possible classes out proposed heterogeneous ensemble classification algorithm for the
of m classes (C1 , C2 , C3 , . . . , Cm ) in the dataset. In our study, the prediction of CAD.
CAD dataset has two classes namely normal or healthy person and
abnormal or CAD patient. Let us call it as class C1 for normal, rep- 3.3.1. Average voting ensemble (AVEn)
resented as value 0 and class C2 for CAD, represented as value 1 in The average voting ensemble is simply summing up the poste-
the dataset. Select k most relevant features from D based on fea- rior probabilities obtain for each sample from the M classification
ture significance rank and importance measure R, R( f 1 ) > R f2 ) > algorithms (C A1 , C A2 , . . . , C AM ) and then assigns the sample with a
, . . . , > R( fk ). The schematic diagram of the proposed heteroge- particular class based on the average value of the posterior proba-
neous ensemble classification algorithm for diagnosis and predic- bilities obtained from the base classifiers is computed using Eq. (1).
tion of CAD is shown in Fig. 3.
Split the dataset D into two subsets DTR and DTE with k se- 1 

y= P r (CA1 (X )) + P r (CA2 (X )) + . . . + P r (CAM (X )) (1)
lected features having P training samples and Q testing samples M

such that D = DT R DT E. The set DTR is used as a training set and if 
y > 0.5 then sample X ∈ CAD, otherwise Normal.
DTE is used as a testing set. The dataset is divided at a ratio of 90%
training and 10% testing respectively. Initially, the base classifiers 3.3.2. Majority voting ensemble (MVEn)
RF, K-NN, SVM are trained on the training samples using the 10- The majority voting or hard voting classifies a sample based
fold cross-validation technique. The hyper-parameters of the base on the votes received for a particular class from the M different

6
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Fig. 3. A typical framework for prediction of CAD.

base classifiers (C A1 , C A2 , . . . , C AM ). The class that gets major votes of the classifier model on the unseen test data and the weight as-
based on the posterior probabilities is chosen as the final class of signed indicates the percentage of trust on a particular classifier
the sample is calculated using Eq. (2). based on its performance and the final classification result is com-
puted using Eq. (3).
y = mode(CA1 (X ) + CA2 (X ) + . . . + CAM (X ))
 (2)

y = W1 × P r (CA1 (X )) + W2 × P r (CA2 (X )) + . . .

3.3.3. Weighted average voting ensemble (WAVEn)
The weighted average voting ensemble assigns weight to a base +WM × P r (CAM (X )) (3)
classifier based on its performance. The weights are small positive

value between 0 and 1 and sum of all weights assigned is equal to such that W1 + W2 + . . . + WM = 1 if 
y > 0.5, then sample X
1. The weights are assigned based on the predictive performance ∈ CAD, otherwise Normal.

7
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

4. Experimental results Table 2


Average testing results of 10 runs obtained by the base classifiers (RF, K-NN, and
SVM) and ensemble classifiers (AVEn, MVEn, and WAVEn) using the selected five
In this section, we discuss the experiments carried out to eval- feature performed on the original Z-Alizadeh Sani dataset.
uate the performance of the proposed heterogeneous ensemble
Algorithm ACC% Kappa Sen Spe Prec F1 MCC
technique. The experiments are conducted on a 2.5–4.0 GHz Intel
dual-core i7 processors with 16 GB RAM running on Mac-10.13.2 RF 98.3 0.96 1.00 0.94 0.98 0.989 0.96
operating system to evaluate the performance of the algorithms. K-NN 86.21 0.655 0.905 0.75 0.905 0.905 0.655
SVM 95.17 0.87 1.00 0.825 0.94 0.969 0.88
The R-3.6.1 version and RStudio1.2.1335 are used to develop these
AVEn 92.76 0.8 1.00 0.74 0.91 0.953 0.82
algorithms for performing classification of CAD. An extensive set MVEn 96.9 0.92 1.00 0.89 0.96 0.98 0.922
of simulation using R to evaluate the performance of base classi- WAVEn 98.97 0.973 1.00 0.963 0.987 0.994 0.974
fiers namely Random Forest Ensemble classifier, K-Nearest Neigh- ACC - Accuracy, Sen - Sensitivity, Spe - Specificity, Prec - Precision, MCC -
bour and Support Vector Machine Radial Basis Kernel are inves- Matthew’s Correlation Coefficient
tigated on the original data set. The Ensemble classification ap-
proach based on a voting method using average voting, majority
voting and weighted-average voting techniques are implemented Precision and F measures are computed to measure the algo-
by combining the result of the base classifiers. rithm performance using Eqs. (7) and (8) respectively.
Meanwhile, the classes in the dataset are balanced using Syn-
TP
thetic Minority Over-sampling Technique (SMOTE) and the perfor- P recision = (7)
TP + FP
mance of the base classifier algorithms and Ensemble classifiers
are evaluated on the balanced data set. From the experimental re- precision × sensit ivit y
F1 = 2 × (8)
sults, it is observed that the balanced data improved the overall precision + sensit ivit y
performance of the algorithm with a considerable amount of im-
provement in the classification accuracy, sensitivity, specificity and Matthew’s Correlation coefficient (MCC) is a measure used to
F Measure of the base classifiers as well as for the ensemble clas- evaluate the quality of a binary classifier for diagnosing CAD in a
sifiers specifically due to the increase in the number of training patient is computed using Eq. (9)
samples in the balanced data set. TP × TN + FP × FN
In this study, the experiments are executed for 10 independent MCC =  (9)
trials with the same set of parameters to avoid random errors. The
(T P + F P )(T P + F N )(T N + F P )(T N + F N )
result of all performance evaluation measures is the average value The kappa coefficient measure takes into account the random
taken from 10 independent simulations with different random seed classifier accuracy to evaluate the achieved classification accuracy
values. These results are reported for the base classifiers and en- as given in Eq. (10).
semble methods.
acc − rand
κ= (10)
1 − rand
4.1. Performance evaluation measures
where acc - is the accuracy achieved by an algorithm and rand - is
the accuracy achieved by the algorithm with random outputs.
This section presents the performance evaluation of the base
classifiers and the ensemble of heterogeneous classifiers in ef-
fectively diagnosing and predicting CAD. The performance of any 4.2. Classification results
classification algorithm is measured in terms of accuracy. How-
ever, relying only on classification accuracy, especially for an im- In this section, the classification performance of the proposed
balanced medical dataset could be misleading sometimes. Hence, heterogeneous ensemble classifier models and the three base clas-
in addition to the accuracy, model evaluation metrics like sensitiv- sifiers using five features namely Typical Chest Pain, Age, Atypi-
ity, specificity, precision, F measure, Kappa, MCC and area under cal, Region.RWMA and EF-TTE to predict CAD is reported for the
the curve (AUC) are computed to assess the performance of the original Z-Alizadeh Sani dataset and the dataset balanced using the
classifier models. The effectiveness of the classifier is experimen- SMOTE sampling techniques.
tally evaluated using the parameters obtained from the confusion
matrix namely True Positive (TP) i.e., CAD predicted as CAD, True 4.2.1. Original dataset
Negative (TN) i.e., Normal predicted as Normal, False Positive (FP) This section describes about the performance results of the en-
i.e.,Normal predicted as CAD and False Negative (FN) i.e., CAD pre- semble models and three base classifiers for prediction of CAD in
dicted as Normal. terms of classification accuracy, Kappa, sensitivity, specificity, pre-
Accuracy is defined as the metrics that determine the number cision, F1 measure and MCC is reported for original dataset. The
of correctly classified class out of the total samples in the test- developed model applies 10-fold cross-validation on the training
ing database. The accuracy is mathematically computed using the samples and the testing data do not affect model training. The test
Eq. (4). samples are independently tested to evaluate the performance of
TP + TN the models on the unseen data.
Accuracy = (4) The original dataset consists of 303 samples out of that 216
TP + TN + FP + FN
samples are diseased and 87 samples are healthy. The dataset con-
Sensitivity or True Positive Rate (TPR) and Specificity or True sists of 274 training samples and 29 test samples. The average
Negative Rate (TNR) of a classifier model is computed using the testing result of 10 runs in terms of accuracy, kappa, sensitivity,
Eqs. (5) and (6) respectively. specificity, Precision, F measure and MCC obtained by the base
classifiers namely Random Forest (RF), K-Nearest Neighbour (K-
TP NN), Support Vector Machine (SVM) and ensemble classifiers based
Sensit ivit y = (5)
TP + FN on average voting (AVEn), Majority Voting (MVEn) and Weighted-
TN Average Voting (WAVEn) for the original Z-Alizadeh Sani dataset is
Speci f icity = (6) reported in Table 2.
TN + FP
8
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Table 3
Average testing results of 10 runs obtained by the base classifiers (RF, K-NN, and
SVM) and ensemble classifiers (AVEn, MVEn, and WAVEn) using the selected five
feature performed on the SMOTE balanced Z-Alizadeh Sani dataset.

Algorithm ACC% Kappa Sen Spe Prec F1 MCC

RF 100 1.00 1.00 1.00 1.00 1.00 1.00


K-NN 88.34 0.767 0.905 0.862 0.868 0.886 0.767
SVM 97.62 0.952 1.00 0.952 0.955 0.977 0.953
AVEn 97.62 0.952 1.00 0.952 0.955 0.977 0.953
MVEn 97.86 0.957 1.00 0.957 0.959 0.979 0.958
WAVEn 100 1.00 1.00 1.00 1.00 1.00 1.00

ACC - Accuracy, Sen - Sensitivity, Spe - Specificity, Prec - Precision, MCC -


Matthew’s Correlation Coefficient

From the experimental results, it can be seen that the WAVEn


model shows better performance in terms of accuracy, kappa, sen-
sitivity, specificity, precision, F measure and MCC. From the three
ensemble models developed for prediction, the Weighted-Average
Voting Ensemble (WAVEn) technique is able to achieve better per-
formance with a classification accuracy of 98.97% having a sensi-
tivity value of 100% and specificity value of 96.3% on the origi-
Fig. 4. ROC curve of the three base classifiers and ensemble models for the SMOTE
nal Z-Alizadeh Sani dataset with 303 patient records. This shows balanced Z-Alizadeh Sani dataset on 90% training and 10% testing ratio.
the efficiency of the model in predicting all patients with CAD as
positive cases which is required for any medical diagnostic system.
However, the result shows that there are some false predictions Boruta and SVM feature importance algorithms that significantly
say some of the healthy persons are classified as CAD patients. improved the performance of the proposed algorithm. Moreover,
From the result, it can be inferred that WAVEn algorithm sig- our research is different from the previous works reported in the
nificantly identifies all CAD patients as patients and sometimes literature as we have studied the performance of our algorithms
it has little degradation in its discriminative power of identifying on both the original unbalanced and SMOTE balanced datasets. It
all healthy persons as healthy. It should be noted that the previ- is important to mention that in this research optimization algo-
ously developed models for diagnosing CAD using Z-Alizadeh Sani rithms such as evolutionary or nature inspired techniques are not
dataset have not achieved this accuracy level. utilized for tuning the hyper-parameters of the base classifiers.

4.2.2. Balanced dataset 4.3. Receiver operating characteristics (ROC) analysis


In this section, we discuss about the performance results of the
ensemble models and three base classifiers for prediction of CAD in The Receiver Operating Characteristics (ROC) curve was plotted
terms of classification accuracy, Kappa, sensitivity, specificity, pre- against the true-positive rate against the false-positive rate. The
cision, F1 measure and MCC for the dataset balanced using Syn- relationship between the True positive rate (Sensitivity) and False
thetic Minority Over-sampling Technique (SMOTE). We have evalu- positive rate (1-Specificity) helps to visualize the strength of the
ated the performance of the classifier models and reported the re- classification performance of a classifier model with a Receiver op-
sults obtained by the base classifiers and ensemble classifier mod- erating characteristics (ROC) curve. When the ROC curve is closer
els for the testing samples. to the ideal coordinate then the classifier is more accurate in pre-
We have balanced the dataset to improve the robustness of dicting the CAD. The ROC graph is plotted against the true positive
the algorithm and achieve better performance by correctly classi- rate (TPR) in y-axis against false positive rate (FPR) in x-axis for
fying CAD patients as patients and healthy persons as healthy. The different cut-points starting from the coordinate (0,0) to (1,1).
balanced dataset consists of 432 samples with 216 diseased sam- The Z-Alizadeh Sani dataset is balanced using SMOTE to analyze
ples and 216 healthy samples. In developing our classifier mod- the performance of the classifiers. The ROC curve for the three base
els, we have used 390 samples for training and 42 samples for classifiers and ensemble models is given in Fig. 4 for the SMOTE
testing. The average testing result of 10 runs in terms of accu- balanced Z-Alizadeh Sani dataset. From the figure, it is observed
racy, kappa, sensitivity, specificity, Precision, F measure and MCC that the ROC curve raises up to (0,1) and then horizontally reaches
obtained by the base classifiers namely RF, K-NN, SVM and en- (1,1) indicates Weighted-Average Voting Ensemble (WAVEn) identi-
semble classifiers based on Average Voting (AVEn), Majority Voting fies both positive instances and negative instances more effectively
(MVEn) and Weighted-Average Voting (WAVEn) for the balanced Z- than other classifiers. The RF and WAVEn Model achieves higher
Alizadeh Sani dataset is reported in Table 3. predictive accuracy in identifying CAD effectively when compare
From the experimental results, it can be observed that RF and with other classifier models. The ROC curve shows that RF and
WAVEn show superior performance for all performance evaluation WAVEn approaches achieve a 100% classification accuracy in de-
measure with an average accuracy of 100% and zero false positive tecting CAD. The value of the Area under the Curve (AUC) for each
and false negative rate. This shows the efficiency and robustness of classifier is given in red color nearer to each classifier model in the
the developed model in predicting CAD with less time and cost. figure.
In addition, the performance of the proposed heterogeneous
ensemble classifier model was compared with the results of the 4.4. Experimental results of proposed method on the Cleveland
previous studies reported in the literature for the Z-Alizadeh Sani dataset
dataset. From the Table 4, it can be observed that the proposed
methodology has obtained better performance when compared In this section, we discuss about the experimental results of
with all the existing works. It should be noted that we have used our proposed method on other datasets. To analyze the perfor-
only five features selected based on the importance score using mance of our proposed method on other dataset, we have selected

9
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Table 4
Comparison of the proposed method with other methods in literature on the Z-Alizadeh Sani
dataset.

Method #Features Accuracy(%) Sensitivity Specificity

Information Gain + SMO [5] 16 94.08 0.963 0.885


Information Gain + SVM [6] 24 86.14 (LAD) NR NR
83.17 (LCX)
83.50 (RCA)
SVM+Feature Engineering 32 96.4 1.00 0.88
on Extended Dataset∗ [7]
Gini Index+SVM [9] 26 86.43 (LAD) NR NR
83.67 (LCX)
82.67 (RCA)
Hybrid PSO [64] 13 84.25 NR NR
SVM+GA [10] 40 86.64 (LAD) NR NR
83.47 (LCX)
82.85 (RCA)
Information Gain+C4.5 [11] 37 74.2 (LAD) NR NR
63.76 (LCX)
69.33 (RCA)
SMO [12] 34 92.09 0.97 NR
SMO [13] 34 92.09 0.97 NR
Bagging-C4.5 [14] 20 79.54 (LAD) NR NR
61.46 (LCX)
68.96 (RCA)
Ensemble::Naive Bayes+SMO [15] 16 88.52 0.91 0.82
Hybrid FSA+FA+ETCA 27 92.58 0.93 NR
+XGBoost+SMOTE[39]
N2GC-nuSVM + balancing [2] 29 93.08 NR NR
NN-GA [19] 22 93.85 0.97 0.92
CART [28] 5 92.41 0.986 0.77
Proposed WAVEn Ensemble+Ori 5 98.97 1.00 0.937
Proposed WAVEn Ensemble+Bal 5 100 1.00 1.00

#Features-Number of Features selected, NR- Not Reported, Ori-Original Dataset, Bal- SMOTE Bal-
anced Dataset, Extended Dataset∗ has 500 patient records

the Cleveland dataset for experimental evaluation as it is widely 0.933 with a kappa value of 0.931. In addition, the 100% sensitivity
used by the researchers for heart failure identification and pre- shows the efficiency of the WAVEn model in predicting all posi-
diction. The Cleveland dataset is a famous heart failure dataset tive cases which is required for any medical diagnostic system that
taken from University of California, Irvine machine learning repos- make it practically applicable. From the results, it can be inferred
itory with 303 records [25]. This dataset consists of 14 features that WAVEn algorithm significantly identifies all CAD patients as
with 13 independent feature attributes namely Sex, Age, resting patients and sometimes it has little degradation in its discrimina-
Blood Pressure (RBP), serum cholesterol (CHO), Fasting Blood Sugar tive power of identifying all healthy persons as healthy.
(FBS), Rest Electrocardiogram (RECG), Maximum achieved Heart Similarly, the experimental results of the SMOTE balanced
rate (THA), Angina induced by exercise (EXA), old peak (OP), peak dataset shows that the proposed WAVEn algorithm achieved 100%
exercise slope (SLO), major blood vessels coloured by Fluoroscopy for all metrics. This shows the effectiveness of the proposed model
(CA), Thallium Scan (THAL), Chest Pain Type (CPT) and one target in significantly identifying all heart failure persons as CAD patients
class (NUM). The target class has two values: value 0 represents and normal persons as healthy. Nonetheless, the previously devel-
the absence of disease and value 1 represents the presence of dis- oped models for identification and prediction of heart failure dis-
ease. ease using Cleveland dataset have not achieved this accuracy level.
In the dataset, we have selected only 297 records from 303 Table 5 shows the performance result of various algorithms and
records and removed 6 records having missing values. Finally, the number of feature selected for predicting CAD using the Cleve-
dataset consists of 160 normal or healthy records and 137 CAD or land dataset. The proposed WAVEn algorithm improved the pre-
unhealthy records with a class distribution of 53.87% normal and dictive accuracy with only 7 features and achieved a better classi-
46.13% abnormal or CAD patients. The dataset is divided into train- fication accuracy when compared with all the existing techniques
ing and testing subsets with 90% for training and remaining 10% [3,4,17,18,26,32,36,44,53,62], except one that has slightly better per-
of the data is used for testing. A 10-fold cross-validation technique formance than ours only in terms of accuracy and specificity for
is utilized for training the model using seven significant features unbalanced dataset [54]. However, it should be noted that high
namely Sex, THA, EXA, OP, CA, THAL, and CPT selected using the sensitivity is required for any medical diagnosis system to accu-
Boruta feature selection algorithm. rately identify the diseased data from healthy data. Hence, our pro-
Experiments are conducted to evaluate the performance of the posed algorithm is effective than [54] as it has the ability in dis-
proposed algorithm on both original and SMOTE balanced dataset. criminating the diseased data more accurately.
From the experimental results, it can be observed that the WAVEn
model shows better performance in terms of accuracy, kappa, sen- 5. Discussion
sitivity, specificity, precision, F measure and MCC. The WAVEn
technique achieves a classification accuracy of 96.55% with a sen- In the proposed work, we achieved the maximum achievable
sitivity value of 100%, specificity value of 93.75%, 92.86% precision accuracy, sensitivity, specificity, and precision in diagnosis and pre-
and F1 score of 96.3% on the original Cleveland dataset. The quality diction of CAD. The main objective during the diagnosis of CAD
of binary classification measured using MCC score has a value of is that a CAD patients should be reported as having CAD and a

10
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

Table 5
Comparison of the proposed method with other methods in literature on the Cleveland dataset.

Method #Features Accuracy(%) Sensitivity Specificity

CFS + PSO + MLP [62] 7 90.28 NR NR


ANN + F-AHP [53] 13 91.10 1.00 0.84
Logistic Regression[26] 13 85 0.89 0.81
Adaptive FDSS [44] 8 92.13 0.92 0.92
HRFLM [36] 13 88.4 0.93 0.826
RSA-RF [32] 7 93.33 0.95 0.898
Voting [17] 9 87.41 NR NR
Stacked SVM [4] 9 92.22 0.829 1.00
DNN [3] 11 93.33 0.85 1.00
HNCL+AMLN [54] 13 97.80 0.955 1.00
NE-nu-SVC+GA+balancing [18] 7 98.60 0.986 NR
Proposed WAVEn Ensemble +Ori 7 96.55 1.00 0.937
Proposed WAVEn Ensemble +Bal 7 100 1.00 1.00

#Features-Number of Features selected, NR- Not Reported, Ori-Original Dataset, Bal- SMOTE
Balanced Dataset

healthy person should not be reported as a patient. Also, it is es- method is able to discriminate the diseased data from healthy data
sential to identify a person having CAD correctly as a patient is on the balanced dataset to achieve the highest precision of 100%
more important than identifying a healthy person. In this aspect, and no prior works have achieved this reliability with only 5 fea-
the WAVEn algorithm always correctly identifies a person having tures. Hence, the proposed model can be used effectively to pre-
CAD as a patients with some false predictions say some of the dict CAD in a patient without using angiography procedure that
healthy persons are classified as CAD patients in original dataset. will greatly reduce the side-effects of angiography, saves time and
However, the WAVEn and RF algorithms have the ability to pre- cost for the patients.
dict patients having CAD as well as always identifies and reports In this study, the highest classification accuracy of WAVEn with
a healthy person as healthy without any error for the balanced zero false negatives strongly emphasize on the practical implemen-
dataset. With compliance to these experimental results, the devel- tation of a diagnostic system in hospitals that can be helpful for a
oped model can be used to test a patient initially to find whether cardiologist to diagnose CAD at the earlier stage. Finally, the algo-
he/she is having CAD or not. When the test results are negative rithm can be applied on the extended version of Z-Alizadeh Sani
no further diagnostic measures are required for the patient, other- dataset with 59 features and its performance in predicting stenosis
wise, recommend angiography procedure for identification of the using LAD, LCX and RCA arteries can be taken as future research
stenosis. work.
Experimental results clearly shows the robustness of our algo-
rithm in diagnosis and prediction of CAD. To the best of author’s Declaration of Competing Interest
knowledge, no prior works in literature have obtained these high-
est result using only five features in CAD diagnosis with an average The authors declare no conflict of interest and there has been
accuracy of 98.97%, 100% sensitivity and 96.3% specificity for the no financial support for this work.
original dataset and similarly 100% average accuracy, 100% sensi-
tivity and 100% specificity rate for a balanced dataset. Comparing Acknowledgments
both original and balanced dataset performance, it is found that
the proposed WAVEn ensemble model shows better performance This research work does not receive any support or grant from
with zero false-negatives ensures that the algorithm identifies the public, private or non-commercial funding agencies.
CAD patients more accurately than identifying a healthy person.
References
Furthermore, the WAVEn model has zero false predictions for a
balanced dataset. Obviously, the performance result shows the re- [1] M. Abdar, U. R. Acharya, N. Sarrafzadegan, V. Makarenkov, NE-nu-SVC: a new
liability of the WAVEn model in distinguishing CAD patients from nested ensemble clinical decision support system for effective diagnosis of
healthy persons that makes it more suitable for practical imple- coronary artery disease, IEEE Access. 10.1109/ACCESS.2019.2953920
[2] M. Abdar, W. Ksiazek, U.R. Acharya, R. S. Tan, V. Makarenkov, P. Plawiak, A
mentation. new machine learning technique for an accurate diagnosis of coronary artery
disease, Comput. Methods Programs Biomed. 179 (2019), doi:10.1016/j.cmpb.
2019.104992.
6. Conclusion [3] L. Ali, A. Rahman, A. Khan, M. Zhou, A. Javeed, J.A. Khan, An automated diag-
nostic system for heart disease prediction based on χ 2 statistical model and
In this study, a heterogeneous ensemble method is proposed to optimally configured deep neural network, IEEE Access 7 (2019) 34938–34945.
[4] L. Ali, A. Niamat, J.A. Khan, N.A. Golilarz, X. Xingzhong, A. Noor, R. Nour,
facilitate effective diagnosis and prediction of CAD disease in a pa- S.A.C. Bukhari, An optimized stacked support vector machines based expert
tient. We have evaluated the performance of the proposed ensem- system for the effective prediction of heart failure, IEEE Access 7 (2019)
ble technique and base classifiers to analyze the predictive or di- 54007–54014.
[5] R. Alizadehsani, J. Habibia, M.J. Hosseini, H. Mashayekhi, R. Boghrati, A. Ghan-
agnostics performance of the model to correctly classify CAD data
deharioun, B. Bahadorian, Z. Alizadeh, Sani, a data mining approach for diagno-
using original and balanced Z-Alizadeh Sani dataset. We have em- sis of coronary artery disease, Comput. Methods Programs Biomed. 111 (2013)
ployed feature selection algorithm and selected five features based 52–61.
[6] R. Alizadehsani, M. H. Zangooei, M. J. Hosseini, J. Habibi, A. Khos-
on feature importance and rank. Then, 10-fold cross-validation is
ravi, M. Roshanzamir, F. Khozeimeh, N. Sarrafzadegan, S. Nahavandi, Coro-
applied on the training subset and the performance of the algo- nary artery disease detection using computational intelligence methods,
rithm is independently assessed on the unseen test data. Com- Knowl.-Based Syst. 109 (2016) 187–197.
pared with all classifier models, the WAVEn algorithm shows better [7] R. Alizadehsani, M. J. Hosseini, A. Khosravi, F. Khozeimeh, M. Roshanzamir,
N. Sarrafzadegan, S. Nahavandi, Non-invasive detection of coronary artery dis-
performance and able to identify CAD with 98.97% accuracy with ease in high-risk patients based on the stenosis prediction of separate coro-
a precision of 98.3% for the original dataset. Moreover, the WAVEn nary arteries, Comput. Methods Programs Biomed. 162 (2018) 119–127.

11
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

[8] R. Alizadehsani, M. Abdar, M. Roshanzamir, A. Khosravi, P.M. Kebria, [35] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, J. Stat.
F. Khozeimeh, S. Nahavandi, N. Sarrafzadegan, U. Rajendra, Acharya, machine Softw. 36 (11) (2010) 1–13.
learning-based coronary artery disease diagnosis: a comprehensive review, [36] S. Mohan, C. Thirumalai, G. Srivastava, Effective heart disease prediction using
Computers in Biology and Medicine 111 (2019). Art.No:103346 hybrid machine learning techniques, IEEE Access 7 (2019) 81542–81554.
[9] R. Alizadehsani, M. Roshanzamir, M. Abdar, A. Beykikhoshk, M.H. Zangooei, [37] C. M. K. Peryiasamy, R. Balasubramanain, D. Velusamy, Predictive analysis of
A. Khosravi, S. Nahavandi, R. S. Tan, U.R. Acharya, Model uncertainty quantifi- heat transfer characteristics of nanofluids in helically coiled tube heat ex-
cation for diagnosis of each main coronary artery stenosis, Soft Comput. 24 changer using regression approach, Therm. Sci. 24 (1) (2020) 505–513.
(2020) 10149–10160, doi:10.10 07/s0 050 0- 019- 04531- 0. [38] R. Nagarajan, M. Upreti, An ensemble predictive modeling framework for
[10] R. Alizadehsani, M. Roshanzamir, M. Abdar, A. Beykikhoshk, A. Khosravi, S. Na- breast cancer classification, Methods 131 (2017) 128–134.
havandi, P. Plawiak, R. S. Tan, U.R. Acharya, Hybrid genetic-discretized algo- [39] E. Nasarian, M. Abdar, M. A. Fahami, R. Alizadehsani, S. Hussain, M. E. Basiri,
rithm to handle data uncertainty in diagnosing stenosis of coronary arteries, M. Zomorodi-Moghadam, X. Zhou, P. Plawiak, U.R. Acharya, R. S. Tan, N. Sar-
Expert Syst.. 10.1111/exsy.12573. rafzadegan, Association between work-related features and coronary artery
[11] R. Alizadehsani, J. Habibi, B. Bahadorian, H. Mashayekhi, A. Ghandeharioun, disease: a heterogeneous hybrid feature selection integrated with balancing
R. Boghrati, Z. A. Sani, Diagnosis of coronary arteries stenosis using data min- approach, Pattern Recognit. Lett. 133 (2020) 33–40.
ing, J. Med. Signals Sens. 2 (3) (2012) 153–159. [40] A. Ogunleye, Q.G. Wang, XGBoost model for chronic kidney disease diagnosis,
[12] R. Alizadehsani, M. J. Hosseini, Z. A. Sani, A. Ghandeharioun, R. Boghrati, Diag- IEEE/ACM Trans. Comp. Biol. Bioinf. (2019), doi:10.1109/TCBB.2019.2911071.
nosis of coronary artery disease using cost-sensitive algorithms, in: IEEE 12th [41] O. Osanaiye, H. Cai, K.K.R. Choo, A. Dehghantanha, Z. Xu, M. Dlodlo, Ensemble-
International Conference on Data Mining Workshops, Brussels, Belgium, 2012, based multi-filter feature selection method for DDos detection in cloud
pp. 9–16, doi:10.1109/ICDMW.2012.29. computing, EURASIP J. Wirel. Commun.Netw. 130 (2016) 1–10, doi:10.1186/
[13] R. Alizadehsani, M. J. Hosseini, R. Boghrati, A. Ghandeharioun, F. Khozeimeh, s13638- 016- 0623- 3.
Z. A. Sani, Exerting cost-sensitive and feature creation algorithms for coronary [42] D. Pal, K. M. Mandana, S. Pal, D. Sarkar, C. Chakraborty, Fuzzy expert sys-
artery disease diagnosis, Int. J. Knowl. Discov.Bioinf. 3 (1) (2012) 59–79. tem approach for coronary artery disease screening using clinical parameters,
[14] R. Alizadehsani, J. Habibi, Z. A. Sani, H. Mashayekhi, R. Boghrati, A. Ghandehar- Knowl.-Based Syst. 36 (2012) 162–174.
ioun, F. Khozeimeh, F. Alizadeh-Sani, Diagnosing coronary artery disease via [43] S. Patidar, R. B. Pachori, U. Rajendra, Acharya, automated diagnosis of coronary
data mining algorithms by considering laboratory and echocardiography fea- artery disease using tunable-q wavelet transform applied on heart rate signals,
tures, Res. Cardiovasc. Med. 2 (2013) 133–139. Knowl.-Based Syst. 25 (2015) 1–10.
[15] R. Alizadehsani, J. Habibi, M. J. Hosseini, R. Boghrati, A. Ghandeharioun, B. Ba- [44] A.K. Paul, P.C. Shill, M.R.I. Rabin, K. Murase, Adaptive weighted fuzzy
hadorian, Z. A. Sani, Diagnosis of coronary artery disease using data mining rule-based system for the risk level assessment of heart disease, Appl. Intell.
techniques based on symptoms and ECG features, Eur. J. Sci. Res. 82 (4) (2012) 48 (7) (2018) 1739–1756.
542–553. [45] D. Prabhakaran, P. Jeemon, A. Roy, Cardiovascular diseases in india current epi-
[16] J. A. ALzubi, B. Bharathikannan, S. Tanwar, R. Manikandan, A. Khanna, demiology and future directions, Am. Heart Assoc. Inc. 133 (2016) 1605–1620.
C. Thaventhiran, Boosted neural network ensemble classification for lung can- [46] D. Prabhakaran, K. Singh, G.A. Roth, A. Banerjee, N.J. Pagidipati, M.D. Huffman,
cer, Appl. Soft Comput. J. 80 (2019) 579–591. Cardiovascular diseases in india compared with the United States, J. Am. Coll.
[17] M.S. Amin, Y.K. Chiam, K.D. Varathan, Identification of significant features and Cardiol. 72 (1) (2018) 79–95.
data mining techniques in predicting heart disease, Telemat. Inf. 36 (2019) [47] C. J. Qin, Q. Guan, X. P. Wang, Application of ensemble algorithm integrating
82–93. multiple criteria feature selection in coronary heart disease detection, Biomed.
[18] Z. Aouabed, M. Abdar, N. Tahiri, J.C. Gareau, V. Makarenkov, A novel effective Eng. Appl. BasisCommun. 29 (6) (2017) 1–11.
ensemble model for early detection of coronary artery disease, in: Innov. in [48] U.R. Acharya, V. K. Sudarshan, J.E.W. Koh, R. J. Martis, J. H. Tan, S.L. Oh,
Info. Sys. and Tech. to Support Learning Research Proc. of EMENA-ISTL, 2019, A. Muhammad, Y. Hagiwara, M. R. K. Mookiah, K. P. Chua, C. K. Chua, R. S.
pp. 480–489. Tan, Application of higher-order spectra for the characterization of coronary
[19] Z. Arabasadi, R. Alizadehsani, M. Roshanzamir, H. Moosaei, A. A. Yarifard, Com- artery disease using electrocardiogram signals, Biomed. Signal Process. Control
puter aided decision making for heart disease detection using hybrid neu- 31 (2017) 31–43.
ral network-genetic algorithm, Comput. Methods Programs Biomed. 141 (2017) [49] U.R. Acharya, H. Fujita, A. Muhammad, S. L. Oh, V.K. Sudarshan, J. H. Tan, J. E.
19–26. W. Koh, Y. Hagiwara, C. K. Chua, K. P. Chua, R. S. Tan, Automated characteriza-
[20] A. Ben-Hur, J. Weston, A user’s guide to support vector machines, Data Min. tion and classification of coronary artery disease and myocardial infarction by
Tech. Life Sci. 609 (2010) 223–239. decomposition of ECG signals: a comparative study, Inf. Sci. 377 (2017) 17–29.
[21] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. [50] U.R. Acharya, H. Fujita, S. L. Oh, A. Muhammad, J. H. Tan, C. K. Chua, Automated
[22] L.M. Candanedo, V. Feldheim, D. Deramaix, Data driven prediction models of detection of coronary artery disease using different durations of ECG segments
energy use of appliances in a low-energy house, Energy Build. 140 (2017) with convolutional neural network, Knowl.-Based Syst. 132 (2017) 62–71.
81–97. [51] U.R. Acharya, H. Fujita, V.K. Sudarshan, S. L. Oh, A. Muhammad, J. H. Tan, K. J.
[23] Cardiovascular diseases, (CVDs)- Key Facts, World Health Organization, https: Hui, A. Jain, L. C. Min, C. K. Chua, Automated characterization of coronary
//www.who.int/health- topics/cardiovascular- diseases. artery disease, myocardial infarction, and congestive heart failure using con-
[24] A. Chaudhary, S. Kolhe, R. Kamal, A hybrid ensemble for classification in mul- tourlet and shearlet transforms of electrocardiogram signal, Knowl.-Based Syst.
ticlass datasets: an application to oilseed disease dataset, Comput. Electron. 132 (2017) 156–166.
Agric. 124 (2016) 65–72. [52] U.R. Acharya, O. Faust, V. Sree, G. Swapna, R. J. Martis, N. A. Kadri, J.S. Suri,
[25] R. Detrano, V.A.M. Center, Long beach and cleveland clinic foundation and Linear and nonlinear analysis of normal and CAD-affected heart rate signals,
university of california, irvine machine learning repository, 2019, Cleve- Comput. Methods Programs Biomed. 113 (2014) 55–68.
land Heart Disease DatasetAvailable at http://archive.ics.uci.edu/ml/datasets/ [53] O.W. Samuel, G.M. Asogbon, A.K. Sangaiah, P. Fang, G. Li, An integrated decision
Heart+Disease Date last accessed: 20-12. support system based on ANN and fuzzy_AHP for heart failure risk prediction,
[26] A.K. Dwivedi, Performance evaluation of different machine learning techniques Expert Syst. Appl. 68 (2017) 163–172.
for prediction of heart disease, Neural Comput. Appl. 29 (10) (2018) 685–693. [54] O. W. Samuel, B. Yang, Y. Geng, et al., A new technique for the prediction of
[27] M. Hosni, I. Abnane, A. Idri, J.M.C. de Gea, J. L. F. Alemán, Reviewing ensemble heart failure risk driven by hierarchical neighborhood component-based learn-
classification methods in breast cancer, Comput. Methods Programs Biomed. ing and adaptive multi-layer networks, Future Gener. Comput. Syst. 110 (2020)
177 (2019) 89–112. 781–794, doi:10.1016/j.future.2019.10.034.
[28] M.M. Ghiasi, S. Zendehboudi, A. A. Mohsenipour, Decision tree-based diagnosis [55] Z. A. Sani, R. Alizadehsani, M. Roshanzamir, Z-Alizadeh Sani data set, 2020,
of coronary artery disease: CART model, Computer Methods and Programs in [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Z-Alizadeh+Sani, Ac-
Biomedicine 192 (2020). Article No: 105400 cessed January.
[29] D. Giri, U.R. Acharya, R. J. Martis, S.V. Sree, T. C. Lim, V.I. T. Ahamed, [56] S. Sood, M. Kumar, R. B. Pachori, U.R. Acharya, Application of empirical
J.S. Suri, Automated diagnosis of coronary artery disease affected patients us- mode decomposition based features for analysis of normal and CAD heart
ing LDA, PCA, ICA and discrete wavelet transform, Knowl.-Based Syst. 37 (2013) rate signals, Journal of Mechanics in Medicine and Biology 16 (1) (2016).
274–282. Art.No.1640 0 02
[30] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. [57] G. Surrel, A. Aminifar, F. Rincon, S. Murali, D. Atienza, Online obstructive sleep
Learn. Res. 3 (2003) 1157–1182. apnea detection on medical wearable sensors, IEEE Trans. Biomed. Circuits
[31] J. Han, M. Kamber, Data Mining: Concepts and Techniques Morgan Kaufmann, Syst. 12 (2018) 762–773.
second ed., 2006. [58] J. H. Tan, Y. Hagiwara, W. Pang, I. Lim, S. L. Oh, A. Muhammad, R. S. Tan,
[32] A. Javeed, S. Zhou, L. Yongjian, I. Qasim, A. Noor, R. Nour, S. Waliand, A. Basit, M. Chen, U.R. Acharya, Application of stacked convolutional and long short-
An intelligent learning system based on random search algorithm and opti- -term memory network for accurate identification of CAD ECG signals, Com-
mized random forest model for improved heart disease detection, IEEE Access put. Biol. Med. 94 (2018) 19–26.
7 (2019) 180235–180243. [59] D. Velusamy, G. Pugalendhi, Fuzzy integrated Bayesian Dempster Shafer the-
[33] R. Ju, C. Hu, P. Zhou, Q. Li, Early diagnosis of Alzheimer’s disease based on ory to defend cross-layer heterogeneity attacks in communication network of
resting-state brain networks and deep learning, IEEE/ACM Trans. Comp. Biol. smart grid, Inf. Sci. 479 (2019) 542–566.
Bioinf. 16 (2019) 244–257. [60] D. Velusamy, G. Pugalendhi, Water cycle algorithm tuned fuzzy expert system
[34] M. Kumar, R.B. Pachori, U.R. Acharya, Characterization of coronary artery dis- for trusted routing in smart grid communication network, IEEE Trans. Fuzzy
ease using flexible analytic wavelet transform applied on ECG signals, Biomed. Syst. (2020), doi:10.1109/TFUZZ.2020.2968833.
Signal Process. Control 31 (2017) 301–308. [61] D. Velusamy, G. Pugalendhi, K. Ramasamy, A cross-layer trust evaluation pro-

12
D. Velusamy and K. Ramasamy Computer Methods and Programs in Biomedicine 198 (2021) 105770

tocol for secured routing in communication network of smart grid, IEEE J. Sel. [63] http://www.who.int/news-room/fact-sheets/detail/the- top- 10- causes- of- death.
Areas Commun. 38 (1) (2020) 193–204. Accessed (January)2020.
[62] L. Verma, S. Srivastava, P. Negi, A hybrid data mining model to predict coronary [64] M. Zomorodi-moghadam, M. Abdar, Z. Davarzani, X. Zhou, P. Plawiak,
artery disease cases using non-invasive clinical data, Journal of Medical Sys. 40 U.R. Acharya, Hybrid particle swarm optimization for rule discovery in the di-
(178) (2016) 1–7. (7) agnosis of coronary artery disease, Expert Syst. (2019), doi:10.1111/exsy.12485.

13

You might also like