SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3875
Performance Evaluation of Various Classification Algorithms
Shafali Deora
Amritsar College of Engineering & Technology, Punjab Technical University
-----------------------------------------------------------***----------------------------------------------------------
Abstract - Classification is a technique in which the data is categorized into 2 or more classes. It can be performed on both
structured/ linear as well as unstructured/ non-linear data. The main goal of the classification problem is to identify the
category of the test data. The 'Heart Disease' dataset has been chosen for study purpose in order to infer and understand the
results from different classification models. An effort has been put forward through this paper to study and evaluate the
prediction abilities of different classification models on the dataset using scikit-learn library. Naive Bayes, Logistic Regression,
Support Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbor are some of the classification algorithms
chosen for evaluation on the dataset. After analysis of the results on the considered dataset, the evaluation parameters like
confusion matrix, precision, recall, f1-score and accuracy have been calculated and compared for each of the models.
Key Words: Pandas, confusion matrix, precision, logistic regression, decision tree, random forest, Naïve Bayes, SVM,
sklearn.
1. INTRODUCTION
Machine learning became famous in 1990s as the intersection of computer science and statistics originated the
probabilistic approaches in AI. Having large-scale data available, scientists started building intelligent systems capable of
analysing and learning from large amounts of data.
Machine learning is a type of artificial intelligence which provides computer with the ability to learn without being
explicitly programmed. Machine learning basically focuses on the development of the computer programs that can change
accordingly when exposed to new data depending upon the past scenarios. It is closely related to the field of computational
statistics as well as mathematical optimization which focuses on making predictions using computers. Machine learning
uses supervised learning in a variety of practical scenarios. The algorithm builds a mathematical model from a set of data
that contains both the inputs and the desired outputs. The model is first trained by feeding the actual datasets to help it
build a mapping between the dependent and independent variables in order to predict the accurate output.
Classification belongs to the category of supervised learning where the computer program learns from the data input given
to it and then uses this learning to classify new observations. This approach is used when there are fixed number of
outputs. This technique categorizes the data into a given number of classes and the goal is to identify the category or class
to which a new data will fall.
Below are the classification algorithms used for evaluation:
1.1 Logistic Regression
Logistic Regression is a classification algorithm which produces result in the binary format. This technique is used to
predict the outcome of a dependent variable or the target wherein the algorithm uses a linear equation with independent
variable called predictors to predict a value which can be anywhere between negative infinity to positive infinity. However,
the output of the algorithm is required to be class variable, i.e. 0 or 1. Hence, the output of the linear equation is squashed
into a range of [0, 1]. The sigmoid function is used to map the prediction to probabilities.
Mathematical representation of sigmoid function:
Where
F(z): output between 0 and 1
z: input to the function
F( z)=
1
1+ e− z
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3876
Graphical representation of sigmoid function:
1.2 K- Nearest Neighbor:
K nearest neighbour is a supervised learning technique that stores all the available cases and classifies the new data based
on a similarity measure. It uses the least distance measure in order to find its nearest neighbours where it looks at the ‘K’
nearest training data points for each test data point and takes the most frequently occurring class and assign that class to
the test data.
K= number of nearest neighbours
Let’s say for K=6, the algorithm would look out for 6 nearest neighbours to the test data. As class A forms the majority over
class B, hence class A would be assigned to the test data in the below example.
The distance between two points can be calculated using either Euclidean distance which is the least distance between two
points or Manhattan distance which is the distance between the points measured along the axis at right angle.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3877
Euclidean Distance = = 5
Manhattan Distance = |5 – 1| + |4 – 1| = 7
1.3 Decision Tree
Decision Tree is the tree representation of all the possible solutions to a decision based on certain conditions. The tree can
be explained by two entities- decision nodes and leaves where leaf nodes are the decisions or the final outcomes and the
decision nodes are where the data is split. There is a concept of pruning in which the unwanted nodes are removed from
the tree.
The CART (Classification and Regression Tree) algorithm is used to design the tree which helps choosing the best attribute
and deciding on where to split the tree.
This splitting is decided on the basis of these factors:
Gini Index: Measure of impurity used to build decision tree in CART.
Information Gain: Decrease in the entropy after a dataset is split on basis of an attribute is information gain and the
purpose is to find attribute that returns the highest information gain.
Reduction in Variance: Algorithm used for continuous target variables and split with lower variance is selected as the
criteria for splitting.
Chi Square: Algorithm to find the statistical significance between sub nodes and parent nodes.
√(5−1)2
+ (4−1)2
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3878
1.4 Random Forest
The primary weakness of Decision Tree is that it doesn't tend to have the best predictive accuracy, partially because of high
variance, i.e. different splits in the training data can lead to different results. Random forests uses a technique called
bagging in order to reduce the high variance.
Here we create an ensemble of decision trees using bootstrapped samples of the training set which is basically sampling of
training set with replacement. These decision trees are then merged together to get a more accurate and stable prediction.
Here a random subset of features is chosen by the algorithm for splitting the node. The larger number of trees will result in
better accuracy.
1.5 Support Vector Machine (SVM):
This is a type of technique that is used for both classification as well as regression problems however, we will discuss its
classification aspect. In order to classify data, it makes use of hyperplanes which act like decision boundaries between the
various classes and make segments in such a way that each segment contains only one type of data. SVM can also classify
non-linear data using the SVM kernel function which is basically transforming the data into another dimension in order to
select the best decision boundary. The best or optimal line that can separate the classes is the line that has the largest
distance between the closest data points and itself. This distance is calculated as the perpendicular distance from the line
to the closest points.
1.6 Naïve Bayes:
Bayes theorem tells us the probability of an event, given prior knowledge of related events that occurred earlier. It is a
statistical classifier. It assumes that the effect of a feature on a given class is independent of other features and due to this
assumption is known as class conditional independence. Below is the equation:
P (A|B) = (P (B|A) ⋅ P (A))/P (B)
P (A|B): the probability of event A when event B has already happened (Posterior)
P (B|A): the probability of event B when event A has already happened (Likelihood)
P (A): probability of event A (Prior)
P (B): probability of event B (Marginal)
Proof of Bayes Theorem:
P (A|B) = P (A intersection B)/P (B)
P (B|A) = P (B intersection A)/P (A)
Since P (A intersection B) = P (B intersection A)
P (A intersection B) = P (A|B) * P (B) = P (B|A) * P (A)
P (A|B) = (P (B|A) ⋅ P (A))/P (B)
2. METHODOLOGY:
2.1 Dataset:
The data has one csv file from which the training and testing set will be formed. The training sample will be used to fit the
machine learning models and their performance will be evaluated on the testing sample. The main objective for the models
will be to predict whether a patient will have a heart disease or not.
Below is the head of the dataset:
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3879
The target variable is in binary form which is the dependent variable that needs to be predicted by the models.
Data Dictionary:
 Age
 Sex
 Chest pain type (4 values)
 Resting blood pressure
 Serum cholesterol in mg/dl
 Fasting blood sugar > 120 mg/dl
 Resting electrocardiographic results (values 0,1,2)
 Maximum heart rate achieved
 Exercise induced angina
 Old peak = ST depression induced by exercise relative to rest
 Slope of the peak exercise ST segment
 Number of major vessels (0-3) coloured by fluoroscopy
 Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
2.2 Evaluation Parameters:
2.2.1 Confusion Matrix:
In order to evaluate how an algorithm performed on the testing data, a confusion matrix is used. The rows of this matrix
correspond to what machine learning algorithm predicted and the columns correspond to the known truth.
2.2.2 Precision:
It is calculated by dividing the true positives by total sum of the true positives and the false positives.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3880
2.2.3 Recall:
Recall is also known as sensitivity. It is the total number of true positives upon total sum of the true positives and the false
negatives.
2.2.4 Accuracy:
It is the total sum of the true positives and the true negatives divided by overall number of cases.
2.3 Experimental Results:
Classifier Class Classification Report Confusion
Matrix
Precision Recall f1-score Accuracy
Logistic
Regression
0 0.91 0.65 0.76 78.02% [[32 17]
[ 3 39]]1 0.7 0.93 0.8
KNN 0 0.76 0.51 0.61 64.83% [[25 24]
[ 8 34]]1 0.59 0.81 0.68
Decision Tree 0 0.67 0.63 0.65 63.73% [[31 18]
[15 27]]1 0.6 0.64 0.62
Random Forest 0 0.89 0.67 0.77 78.02% [[33 16]
[ 4 38]]1 0.7 0.9 0.79
SVM 0 0 0 0 46.15% [[ 0 49]
[ 0 42]]1 0.46 1 0.63
Naïve Bayes 0 0.9 0.71 0.8 80.21% [[35 14]
[ 4 38]]1 0.73 0.9 0.81
3. CONCLUSION:
In this effort we tested 6 different types of supervised learning classification models on the Heart Disease dataset. For the
train-test split, the test size of .3 was chosen. The above results show that the Naïve Bayes Classifier got the most correct
predictions and classifying only 18 samples wrong, probably because the dataset favours the conditionally independent
behaviour of the attributes assumed by this classifier whereas the SVM performed the worst with just over 46%.
4. REFERENCES:
[1]. Suzumura, S., Ogawa, K., Sugiyama, M., Karasuyama, M, & Takeuchi, I., Homotopy continuation approaches for robust SV
classification and regression, 2017.
[2]. Sugiyama, M., Hachiya, H., Yamada, M., Simm, J., & Nam, H., Least-squares probabilistic classifier: A computationally
efficient alternative to kernel logistic regression, 2012.
[3]. Kanu Patel, Jay Vala, Jaymit Pandya, Comparison of various classification algorithms on iris datasets using WEKA, 2014.
[4]. Y.S. Kim, “Comparision of the decision tree, artificial neural network, and linear regression methods based on the
number and types of independent variables and sample size”, 2008.

More Related Content

PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
Dimensionality Reduction
mrizwan969
 
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PDF
Research scholars evaluation based on guides view using id3
eSAT Journals
 
PDF
Pca analysis
kunasujitha
 
PDF
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
ijaia
 
PDF
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
IJECEIAES
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
Dimensionality Reduction
mrizwan969
 
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Venkata Karthik Gullapalli
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Research scholars evaluation based on guides view using id3
eSAT Journals
 
Pca analysis
kunasujitha
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
ijaia
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
IJECEIAES
 

What's hot (14)

PPTX
Recommendation system
Ding Li
 
PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
PDF
Fuzzy logic applications for data acquisition systems of practical measurement
IJECEIAES
 
PDF
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET Journal
 
PDF
Research scholars evaluation based on guides view
eSAT Publishing House
 
PPTX
WEKA: Algorithms The Basic Methods
DataminingTools Inc
 
PDF
Recognition of Handwritten Mathematical Equations
IRJET Journal
 
PPTX
Random forest
Musa Hawamdah
 
PPT
CART Classification and Regression Trees Experienced User Guide
Salford Systems
 
PPT
Supervised and unsupervised learning
AmAn Singh
 
PDF
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
PDF
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
ijcseit
 
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
Recommendation system
Ding Li
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Fuzzy logic applications for data acquisition systems of practical measurement
IJECEIAES
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET Journal
 
Research scholars evaluation based on guides view
eSAT Publishing House
 
WEKA: Algorithms The Basic Methods
DataminingTools Inc
 
Recognition of Handwritten Mathematical Equations
IRJET Journal
 
Random forest
Musa Hawamdah
 
CART Classification and Regression Trees Experienced User Guide
Salford Systems
 
Supervised and unsupervised learning
AmAn Singh
 
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
ijcseit
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
Ad

Similar to IRJET- Performance Evaluation of Various Classification Algorithms (20)

PDF
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
PDF
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET Journal
 
PDF
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
PDF
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
IRJET Journal
 
PDF
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET Journal
 
PDF
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
IRJET Journal
 
PDF
Real Estate Investment Advising Using Machine Learning
IRJET Journal
 
PDF
AMAZON STOCK PRICE PREDICTION BY USING SMLT
IRJET Journal
 
PDF
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET Journal
 
PDF
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET Journal
 
PDF
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
PDF
Working with the data for Machine Learning
Mehwish690898
 
PDF
Student Performance Predictor
IRJET Journal
 
PPT
final report (ppt)
butest
 
PDF
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
PDF
Machine Learning.pdf
BeyaNasr1
 
PDF
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Editor IJCATR
 
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
PDF
A02610104
theijes
 
PPTX
Big Data Analytics.pptx
Kaviya452563
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET Journal
 
IRJET - Rainfall Forecasting using Weka Data Mining Tool
IRJET Journal
 
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
IRJET Journal
 
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
IRJET Journal
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
IRJET Journal
 
Real Estate Investment Advising Using Machine Learning
IRJET Journal
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
IRJET Journal
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET Journal
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET Journal
 
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Working with the data for Machine Learning
Mehwish690898
 
Student Performance Predictor
IRJET Journal
 
final report (ppt)
butest
 
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Machine Learning.pdf
BeyaNasr1
 
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
Editor IJCATR
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET Journal
 
A02610104
theijes
 
Big Data Analytics.pptx
Kaviya452563
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 

Recently uploaded (20)

PPTX
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
PDF
Zero Carbon Building Performance standard
BassemOsman1
 
PPTX
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PDF
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
PDF
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
PDF
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
PPTX
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PDF
Introduction to Data Science: data science process
ShivarkarSandip
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PPTX
Introduction of deep learning in cse.pptx
fizarcse
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PPTX
easa module 3 funtamental electronics.pptx
tryanothert7
 
MSME 4.0 Template idea hackathon pdf to understand
alaudeenaarish
 
Zero Carbon Building Performance standard
BassemOsman1
 
22PCOAM21 Session 1 Data Management.pptx
Guru Nanak Technical Institutions
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
Top 10 read articles In Managing Information Technology.pdf
IJMIT JOURNAL
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Victory Precisions_Supplier Profile.pptx
victoryprecisions199
 
FLEX-LNG-Company-Presentation-Nov-2017.pdf
jbloggzs
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
EVS+PRESENTATIONS EVS+PRESENTATIONS like
saiyedaqib429
 
Packaging Tips for Stainless Steel Tubes and Pipes
heavymetalsandtubes
 
AgentX UiPath Community Webinar series - Delhi
RohitRadhakrishnan8
 
Information Retrieval and Extraction - Module 7
premSankar19
 
Introduction to Data Science: data science process
ShivarkarSandip
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
Introduction of deep learning in cse.pptx
fizarcse
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
easa module 3 funtamental electronics.pptx
tryanothert7
 

IRJET- Performance Evaluation of Various Classification Algorithms

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3875 Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***---------------------------------------------------------- Abstract - Classification is a technique in which the data is categorized into 2 or more classes. It can be performed on both structured/ linear as well as unstructured/ non-linear data. The main goal of the classification problem is to identify the category of the test data. The 'Heart Disease' dataset has been chosen for study purpose in order to infer and understand the results from different classification models. An effort has been put forward through this paper to study and evaluate the prediction abilities of different classification models on the dataset using scikit-learn library. Naive Bayes, Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbor are some of the classification algorithms chosen for evaluation on the dataset. After analysis of the results on the considered dataset, the evaluation parameters like confusion matrix, precision, recall, f1-score and accuracy have been calculated and compared for each of the models. Key Words: Pandas, confusion matrix, precision, logistic regression, decision tree, random forest, Naïve Bayes, SVM, sklearn. 1. INTRODUCTION Machine learning became famous in 1990s as the intersection of computer science and statistics originated the probabilistic approaches in AI. Having large-scale data available, scientists started building intelligent systems capable of analysing and learning from large amounts of data. Machine learning is a type of artificial intelligence which provides computer with the ability to learn without being explicitly programmed. Machine learning basically focuses on the development of the computer programs that can change accordingly when exposed to new data depending upon the past scenarios. It is closely related to the field of computational statistics as well as mathematical optimization which focuses on making predictions using computers. Machine learning uses supervised learning in a variety of practical scenarios. The algorithm builds a mathematical model from a set of data that contains both the inputs and the desired outputs. The model is first trained by feeding the actual datasets to help it build a mapping between the dependent and independent variables in order to predict the accurate output. Classification belongs to the category of supervised learning where the computer program learns from the data input given to it and then uses this learning to classify new observations. This approach is used when there are fixed number of outputs. This technique categorizes the data into a given number of classes and the goal is to identify the category or class to which a new data will fall. Below are the classification algorithms used for evaluation: 1.1 Logistic Regression Logistic Regression is a classification algorithm which produces result in the binary format. This technique is used to predict the outcome of a dependent variable or the target wherein the algorithm uses a linear equation with independent variable called predictors to predict a value which can be anywhere between negative infinity to positive infinity. However, the output of the algorithm is required to be class variable, i.e. 0 or 1. Hence, the output of the linear equation is squashed into a range of [0, 1]. The sigmoid function is used to map the prediction to probabilities. Mathematical representation of sigmoid function: Where F(z): output between 0 and 1 z: input to the function F( z)= 1 1+ e− z
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3876 Graphical representation of sigmoid function: 1.2 K- Nearest Neighbor: K nearest neighbour is a supervised learning technique that stores all the available cases and classifies the new data based on a similarity measure. It uses the least distance measure in order to find its nearest neighbours where it looks at the ‘K’ nearest training data points for each test data point and takes the most frequently occurring class and assign that class to the test data. K= number of nearest neighbours Let’s say for K=6, the algorithm would look out for 6 nearest neighbours to the test data. As class A forms the majority over class B, hence class A would be assigned to the test data in the below example. The distance between two points can be calculated using either Euclidean distance which is the least distance between two points or Manhattan distance which is the distance between the points measured along the axis at right angle.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3877 Euclidean Distance = = 5 Manhattan Distance = |5 – 1| + |4 – 1| = 7 1.3 Decision Tree Decision Tree is the tree representation of all the possible solutions to a decision based on certain conditions. The tree can be explained by two entities- decision nodes and leaves where leaf nodes are the decisions or the final outcomes and the decision nodes are where the data is split. There is a concept of pruning in which the unwanted nodes are removed from the tree. The CART (Classification and Regression Tree) algorithm is used to design the tree which helps choosing the best attribute and deciding on where to split the tree. This splitting is decided on the basis of these factors: Gini Index: Measure of impurity used to build decision tree in CART. Information Gain: Decrease in the entropy after a dataset is split on basis of an attribute is information gain and the purpose is to find attribute that returns the highest information gain. Reduction in Variance: Algorithm used for continuous target variables and split with lower variance is selected as the criteria for splitting. Chi Square: Algorithm to find the statistical significance between sub nodes and parent nodes. √(5−1)2 + (4−1)2
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3878 1.4 Random Forest The primary weakness of Decision Tree is that it doesn't tend to have the best predictive accuracy, partially because of high variance, i.e. different splits in the training data can lead to different results. Random forests uses a technique called bagging in order to reduce the high variance. Here we create an ensemble of decision trees using bootstrapped samples of the training set which is basically sampling of training set with replacement. These decision trees are then merged together to get a more accurate and stable prediction. Here a random subset of features is chosen by the algorithm for splitting the node. The larger number of trees will result in better accuracy. 1.5 Support Vector Machine (SVM): This is a type of technique that is used for both classification as well as regression problems however, we will discuss its classification aspect. In order to classify data, it makes use of hyperplanes which act like decision boundaries between the various classes and make segments in such a way that each segment contains only one type of data. SVM can also classify non-linear data using the SVM kernel function which is basically transforming the data into another dimension in order to select the best decision boundary. The best or optimal line that can separate the classes is the line that has the largest distance between the closest data points and itself. This distance is calculated as the perpendicular distance from the line to the closest points. 1.6 Naïve Bayes: Bayes theorem tells us the probability of an event, given prior knowledge of related events that occurred earlier. It is a statistical classifier. It assumes that the effect of a feature on a given class is independent of other features and due to this assumption is known as class conditional independence. Below is the equation: P (A|B) = (P (B|A) ⋅ P (A))/P (B) P (A|B): the probability of event A when event B has already happened (Posterior) P (B|A): the probability of event B when event A has already happened (Likelihood) P (A): probability of event A (Prior) P (B): probability of event B (Marginal) Proof of Bayes Theorem: P (A|B) = P (A intersection B)/P (B) P (B|A) = P (B intersection A)/P (A) Since P (A intersection B) = P (B intersection A) P (A intersection B) = P (A|B) * P (B) = P (B|A) * P (A) P (A|B) = (P (B|A) ⋅ P (A))/P (B) 2. METHODOLOGY: 2.1 Dataset: The data has one csv file from which the training and testing set will be formed. The training sample will be used to fit the machine learning models and their performance will be evaluated on the testing sample. The main objective for the models will be to predict whether a patient will have a heart disease or not. Below is the head of the dataset:
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3879 The target variable is in binary form which is the dependent variable that needs to be predicted by the models. Data Dictionary:  Age  Sex  Chest pain type (4 values)  Resting blood pressure  Serum cholesterol in mg/dl  Fasting blood sugar > 120 mg/dl  Resting electrocardiographic results (values 0,1,2)  Maximum heart rate achieved  Exercise induced angina  Old peak = ST depression induced by exercise relative to rest  Slope of the peak exercise ST segment  Number of major vessels (0-3) coloured by fluoroscopy  Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect 2.2 Evaluation Parameters: 2.2.1 Confusion Matrix: In order to evaluate how an algorithm performed on the testing data, a confusion matrix is used. The rows of this matrix correspond to what machine learning algorithm predicted and the columns correspond to the known truth. 2.2.2 Precision: It is calculated by dividing the true positives by total sum of the true positives and the false positives.
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 03 | Mar 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3880 2.2.3 Recall: Recall is also known as sensitivity. It is the total number of true positives upon total sum of the true positives and the false negatives. 2.2.4 Accuracy: It is the total sum of the true positives and the true negatives divided by overall number of cases. 2.3 Experimental Results: Classifier Class Classification Report Confusion Matrix Precision Recall f1-score Accuracy Logistic Regression 0 0.91 0.65 0.76 78.02% [[32 17] [ 3 39]]1 0.7 0.93 0.8 KNN 0 0.76 0.51 0.61 64.83% [[25 24] [ 8 34]]1 0.59 0.81 0.68 Decision Tree 0 0.67 0.63 0.65 63.73% [[31 18] [15 27]]1 0.6 0.64 0.62 Random Forest 0 0.89 0.67 0.77 78.02% [[33 16] [ 4 38]]1 0.7 0.9 0.79 SVM 0 0 0 0 46.15% [[ 0 49] [ 0 42]]1 0.46 1 0.63 Naïve Bayes 0 0.9 0.71 0.8 80.21% [[35 14] [ 4 38]]1 0.73 0.9 0.81 3. CONCLUSION: In this effort we tested 6 different types of supervised learning classification models on the Heart Disease dataset. For the train-test split, the test size of .3 was chosen. The above results show that the Naïve Bayes Classifier got the most correct predictions and classifying only 18 samples wrong, probably because the dataset favours the conditionally independent behaviour of the attributes assumed by this classifier whereas the SVM performed the worst with just over 46%. 4. REFERENCES: [1]. Suzumura, S., Ogawa, K., Sugiyama, M., Karasuyama, M, & Takeuchi, I., Homotopy continuation approaches for robust SV classification and regression, 2017. [2]. Sugiyama, M., Hachiya, H., Yamada, M., Simm, J., & Nam, H., Least-squares probabilistic classifier: A computationally efficient alternative to kernel logistic regression, 2012. [3]. Kanu Patel, Jay Vala, Jaymit Pandya, Comparison of various classification algorithms on iris datasets using WEKA, 2014. [4]. Y.S. Kim, “Comparision of the decision tree, artificial neural network, and linear regression methods based on the number and types of independent variables and sample size”, 2008.