Employee Churn Prediction Using Logistic Regression
Employee Churn Prediction Using Logistic Regression
SJSU ScholarWorks
Fall 2021
Recommended Citation
Maharjan, Rajendra, "Employee Churn Prediction using Logistic Regression and Support Vector Machine"
(2021). Master's Projects. 1043.
DOI: https://doi.org/10.31979/etd.3t5h-excq
https://scholarworks.sjsu.edu/etd_projects/1043
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at
SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU
ScholarWorks. For more information, please contact [email protected].
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE
Employee Churn Prediction using Logistic Regression and Support Vector Machine
A Project
Presented to
In Partial Fulfillment
Master of Science
by
Rajendra Maharjan
December 2021
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE
© 2021
Rajendra Maharjan
Employee Churn Prediction using Logistic Regression and Support Vector Machine
by
Rajendra Maharjan
DECEMBER 2021
ABSTRACT
It is a challenge for Human Resource (HR) team to retain their existing employees than to
hire a new one. For any company, losing their valuable employees is a loss in terms of time, money,
productivity, and trust, etc. This loss could be possibly minimized if HR could beforehand find out
their potential employees who are planning to quit their job hence, we investigated solving the
employee churn problem through the machine learning perspective. We have designed machine
learning models using supervised and classification-based algorithms like Logistic Regression and
Support Vector Machine (SVM). The models are trained with the IBM HR employee dataset
retrieved from https://kaggle.com and later fine-tuned to boost the performance of the models.
Metrics such as precision, recall, confusion matrix, AUC, ROC curve were used to compare the
performance of the models. The Logistic Regression model recorded an accuracy of 0.67,
Sensitivity of 0.65, Specificity of 0.70, Type I Error of 0.30, Type II Error of 0.35, and AUC score
of 0.73 where as SVM achieved an accuracy of 0.93 with Sensitivity of 0.98, Specificity of 0.88,
Type I Error of 0.12, Type II Error of 0.01 and AUC score of 0.96.
ACKNOWLEDGEMENTS
First of all, I am grateful to my advisor Dr. Robert Chun. His continuous support and
guidance have made this project better. I am thankful to Dr. Chun for giving me the freedom to
choose the project topic of my interest. I would like to thank my committee members Prof.
Christopher Pollett and Ms. Stuti Patel for their valuable feedbacks and suggestions. A heartful
during this project. Last but not least, I am always thankful to my friends and family members who
TABLE OF CONTENTS
I. Introduction................................................................................................................ 1
a. Logistic Regression........................................................................................ 16
V. Implementation ....................................................................................................... 23
REFERENCES............................................................................................................. 59
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE
LIST OF FIGURES
Figure 12. List of columns with its data types and non-null counts ............................. 24
Figure 13. Count of rows, columns, and data types in the dataset ............................... 24
Figure 44. Distribution of attrition and non-attrition employee in the train and
test Dataset ................................................................................................... 41
Figure 45. Confusion matrix and Classification report of Logistic Regression
model ........................................................................................................... 42
Figure 46. Confusion matrix and Classification report of SVM .................................. 42
Figure 48. Total records and class label counts of before and after Under Sampling .. 44
Figure 50. Confusion matrix and Classification report of SVM using Under
Sampling ...................................................................................................... 44
Figure 51. Total records and counts of the class label before and after Over
Sampling ...................................................................................................... 45
Figure 52. Confusion matrix and Classification report of Logistic Regression with
Over Sampling ............................................................................................. 46
Figure 53. Confusion matrix and Classification report of SVM with Over
Sampling ...................................................................................................... 46
Figure 54. Class label counts of before and after SOMTE process .............................. 47
Figure 55. Confusion matrix and Classification report of Logistic Regression with
SMOTETOMEK .......................................................................................... 48
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE
Figure 58. Confusion matrix and Classification report of Logistic Regression with
hyper parameters ......................................................................................... 51
Figure 59. Confusion matrix and Classification report of SVM with hyperparameter
...................................................................................................................... 51
I. INTRODUCTION
retire from their job requiring the position to be filled up by another candidate. Feeling lack of
coaching and feedback, lack of growth, commute time, unsatisfied pay scale, feeling devalued,
work stress, work-life imbalance, and lack of trust from the supervisor, etc. are some of the
common reasons why people quit their jobs [1]. In 2018 a report was published by LinkedIn Talent
Solutions that showed the technology industry is on the top with the highest employee turnover of
13.2% [2]. To solve the employee churn problem, we designed two supervised models one using
the Logistic Regression and the other using SVM models to predict whether an employee will
churn or not. An exploratory data analysis was performed to explore and clean the IBM HR dataset
that we retrieved from www.kaagle.com. Techniques such as heatmap, χ2, Extra Tree classifier,
and Information Gain were implemented as a part of the feature selection process. Stratified K-
Fold cross-validation, under sampling, over sampling, and SMOTETOMEK methods were
implemented to balance the IBM HR dataset, and metrics such as Confusion matrix, Classification
report, Sensitivity, Specificity, Type I Error, Type II Error, ROC, and AUC score were used to
monitor the performance of the models. To optimize the performance of the model, the
GridSearchCV method was used for the Logistic Regression, and parameters such as C, kernel,
Employees are always considered valuable assets of the company. Especially those
employees who have worked for a longer period gained years of experience and the top performers
are considered special employees. Companies have to bear more loss when special employees
resignation can have a psychological effect on the team, lowering the morale of the team members
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 2
ultimately leading to the loss of productivity. Having an experienced employee resigning from the
job can bring insecurity to the HR and Management team thinking that their former employee may
apply the acquired intellectual knowledge to their rival companies making the competition fiercer.
Companies having a larger number of employee turnover can potentially damage their brand value
and also start losing the trust of their valuable clients. Filling up the vacant positions sometimes
needs to coordinate with the third-party staffing agency which could be extra expenses to the
company. A newly recruited staff needs to be sent for various training so that he/she can perform
his/her duties in a decent way and sending the employees to the training cost additional money to
the company. The newly recruited employees require some time to adjust to the new teammates
and workplace before they can start performing. The HR team has to go through all those steps
every time there is a search for a replacement resource. Hence, to minimize that situation the HR
Various strategies have been approached by the managers and HR team to retain their top
talents. Compensation and Benefits packages are one of those strategies where the HR team work
on setting up a proper compensation structure across organizational levels, assuring fair pay and
equity, offering better benefits packages such as vesting, stock options, and cash bonuses, etc. [3].
Company perks such as extended PTO and flex time have been an effective way of retaining the
employees [4]. Establishing a staff/employee recognition program can make employees feel their
work is being valued and hence that may encourage them to stay with the company for a longer
period [5]. Conducting surveys is another good approach to collect employees’ feedback which
helps to get an insight into what the employees feel towards their employer. Some HR conduct
exit interviews to know the reason why an employee decided to quit the job. Exit interviews are
in-person interviews so there can be a situation where an employee may not feel comfortable
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 3
sharing the actual reason for leaving. Once an employee makes a final decision of leaving the
company and sends a resignation letter then it is almost impossible for HR to convince the
employee to change their decision. However, the case is different if HR discovers beforehand that
their staff is on a plan of resigning and they could take all preliminary countermeasures which
could potentially help to change the employees' minds of resigning. Multiple efforts have been
made in the past to solve employee or customer churn problems using various machine learning
algorithms such as Naïve Bayes, KNN, Decision Trees, Reinforcement learning and Neural
Networks, etc. Past research work has recorded the highest accuracy of 0.92, Recall of 0.75, F1-
There are two main objectives to conduct this research work. The first objective is to
perform a deep analysis of the employee dataset, mine some patterns or useful information, and
visualize that information so that HR can get some insights. The second objective is to build two
predictive models using the Logistic regression and SVM that predict whether an employee will
churn or not. The goal is to obtain a robust and accurate model through fine-tuning so that the HR
has high confidence in using our designed models. We are hopeful that the employee churn
prediction made by our designed models will help HR in some way or the other in taking necessary
The rest of the paper is organized as follows: Section II, explains the prior works that have
been conducted related to customer or employee churn problems, Section III discusses machine
learning topics and Section IV explains algorithm selection. Section V and Section VI covers the
There has been a lot of study and research work conducted by various groups and
individuals on the topic related to churn, attrition, and turnover. Zhang et al. proposed a solution
predicting the customer churn of a leading mobile telecommunication service provider [6]. In the
paper, the authors tried to prove that the network attributes or interpersonal influence have a better
contribution to improving the prediction accuracy compared to that of traditional attributes. The
authors approached the classification and the propagation models implementing decision trees,
logistic regression, and neural network machine learning algorithms. The results showed that the
In [7] Yadav and his team have applied data mining techniques in conjunction with
machine learning algorithms to predict the early attrition of an employee. The brute-force approach
has been implemented to convert a long list of categorical data into two numerical values 0 and 1.
In addition to that, the one-hot encoding technique has also been implemented to convert
categorical variables into numerical values. Each feature available in the dataset has a different
level of contribution to the performance of the model. So, to collect only the meaningful features,
the authors implemented a feature selection method called Recursive Feature Elimination with
Cross-Validation (RFECV). This method recursively selects an optimal feature from smaller sets
of features.
Another interesting paper was conducted by Raman and Bhattacharya [8] where they
approached a solution predicting whether a teaching faculty member will leave the business school
or not. Using the archival email dataset, the authors tried to study the email patterns, extracted the
sentiments expressed on the emails, and discovered the features that have a significant contribution
to the model’s prediction. Using the R program, correlation analysis was conducted to discover
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 5
the rich set of features. Similarly, word count analysis and sentiment analysis were also performed
using data logging tools. An email pattern was discovered where the faculty members leaving the
business school had more external email communication compared to internal. It was also
discovered that the negative sentiments were less for the faculty members who left the school
compared to those who stayed with the school. Researching the root cause of expressed sentiments
and the decision of leaving the school was beyond the scope of the study.
A case study prepared by Saradhi and Palshikar gives two distinct models, the predictive
model and the value model [9]. The predictive model is designed to predict whether a customer or
an employee will churn or not, whereas the value model is used to identify how many of the
churned employees or customers were valuable. Under the predictive model, the authors
performed a comparison of various machine learning algorithms. Two of the value models, the
customer lifetime value model, and the employee value model were briefly covered in the paper.
To further elaborate the employee model, the authors proposed a simple employee value model
that identified whether a churn employee was a valuable employee or not. The paper also defined
the differences between an employee churn vs customer churn and the valuable customer vs
valuable employee.
A deep learning model using a feed-forward neural network has been proposed to predict
employee attrition by Dutta and Bandyopadhyay [10]. The model is trained on 1470 sample
rating, distance from home, monthly income, stock market option, and work-life balance, etc. were
taken into consideration to make the prediction. The proposed model comprised three layers of
neural network with 32, 16, and 1 number of nodes respectively. To perform diverse computation,
relu and sigmoid activation functions were applied to each of the three layers. To evaluate the
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 6
designed predictive model, the authors implemented a 10-fold cross-validation method where the
entire dataset got split into 10 groups and on each iteration, there would be 1 test data and the
remaining 9 groups as training data. The proposed model reported an accuracy of 87.01% and
0.1299 of Mean Square Error (MESE) which outperformed the other 6 classifiers SVM, Naïve
company [11]. The paper talked about how a negative correlation learning method can be used to
resolve the issue of target class/label imbalance. The authors claimed that the trained ensemble
multilayer perceptron (MLP) using the negative correlation learning method has better customer
churn prediction accuracy compared to that without NCL and other data mining techniques. The
optimal model parameters such as hidden layers of N = 10, learning rate = 0.3, use of Sigmoid
logistic activation function, 5-fold cross-validation, and penalty factor in range [0,1] were
Gao and the team implemented a weighted quadratic random forest (WQRF) algorithm to
build a model which predicted employee turnover [12]. The model is an improved version of the
random forest algorithm which uses the F-measure of each decision tree and the weighted voting
mechanism to solve the data imbalance issue. The dataset was extracted from a Chinese
communication company which consists of 2000 employee records and 32 features where only
13.5% of the total population were churned. As a part of feature selection, the attributes were
ordered in descending order based on their importance score and only the top 15 attributes were
selected. A benchmark comparison of the WQRF algorithm with C4.5, logistic regression, bp, and
the random forest was conducted where the proposed algorithm outperformed the rest with the
reported accuracy of 92.80 %, 0.653 of Recall, 0.711 of F-measure, and 0.881 of ROC area.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 7
Models based on Random Forest and Naïve Bayes classifiers were proposed by Valle and
Ruz to predict the turnover of the sales agent in a call center [13]. As a part of the experiment, a
sample of 3543 sales activity records from 2407 sales agents was taken into consideration. The
proposed models were trained with only 6 attributes such as logged hours, talked hours, effective
contacts, number of approved sales, number of finished records, and approved production. The
addition to that, the mean standard deviation of performance measure metrics such as accuracy,
precision, recall, and area under the curve (AUC) was computed and compared for the Random
Andrea and Tronscoso [14]. The textual comments used by the customers to interact with the
service provider were considered as the main dataset. 23,195 interactions from 14,531 customers
of Chilean bank was considered to build the model. Text mining techniques such as part of speech
tagging and linguistic pattern analysis procedures were adapted to extract the churn determinant.
Models like logistic regression, decision tree, multi-layer perceptron neural network, support
vector machine, random forest, adboost, and kNN were designed. Out of all, SVM was reported
as the best performing model with an overall accuracy of 58.3%, precision = 58.5%, and recall =
58.4%.
Panjasuchat and Limpiyakorn [15]. The original telecommunication dataset extracted from kaggle
contained 100,000 sample records and 99 attributes. From the original dataset, 3 versions of the
dataset were prepared, dataset1 which is the preprocessed version, dataset2 is the shuffled version
and dataset3 is the combination of dataset1 and dataset2. Applying reinforcement learning and
using the Deep Q Network (DQN) algorithm, the model used the concept of reward with a positive
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 8
integer value for correct prediction and a negative integer as a penalty for an incorrect prediction.
The proposed neural network consists of 4 connected layers and 2 hidden layers where each layer
contains 256 neurons. Each layer of the neural network used ReLU as the activation function,
0.001 as a parameter for the learning rate, and Adam was the optimizer algorithm. The proposed
Deep Q Network (DQN) algorithm outperformed other algorithms XGBoost, random forest, and
kNN with the highest accuracy of 65.26%, Precision = 63.71%, Recall = 65.81%, and F1 score =
64.74%.
Madushanka and et al. presented a new concept of cognitive learning technique to predict
customer churn behavior [16]. The model is designed into two phases, dataset clustering phase,
and model training phase. During the clustering phase, Kernelized Growing Self Organizing Map
(KGSOM) and Growing Self Organizing Map (GSOM) techniques were implemented to cluster
the data evenly. Later the pruning technique is applied to reduce the number of nodes and obtain
faster and better clusters. Models such as SVM, random forests, and logistic regression were
trained with KGSOM preprocessed data. It was observed that KGSOM has a good contribution to
Multiple supervised classification algorithms like support vector machine, random forest,
Naïve Bayes, kNN, decision tree, and logistic regression are implemented to help the HR team
with employee turnover prediction [17], [18], [19]. Various performance measure metrics like
accuracy, confusion matrix, precision, recall, specificity are used to compare the performance of
each model.
Shang introduced a new idea of predicting employee turnover using survival analysis [20].
Survival analysis is a time-dependent and events occurrence statistical technique that computes
the probability of occurrence of an event at a given point of time. The author implemented CoxRF
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 9
(CoX proportional hazards model with Random Forest) algorithm in conjunction with survival
analysis to build the predictive model. The probability of occurrence of an event is computed which
is considered as a feature to train the model on predicting employee turnover. Using the Kaplan-
Meier method it was discovered that factors such as gender (female employees), external
environmental factors (GDP growth), and industry type (IT) have a great influence on employee
turnover. It was proved that the CoxRF method outperformed other algorithms like SVM, naïve
Bayes, logistic regression, decision tree, XGBoost, and random forest with a reported accuracy of
All the related works that were discussed earlier focused on the performance metrics of
their models and had no specific information on whether the used dataset was balanced or
imbalanced. To cover this research gap, we decided to use an imbalanced IBM HR dataset and
experimented with various balancing techniques such as Stratified K-Fold cross-validation, Over
and Under Sampling, and SMOTETOMEK. We also compared the performance of the Logistic
Regression and SVM model before and after the balancing techniques were implemented. In our
project, we experimented with some of the new feature selection techniques such as the χ2 test,
ExtraTree classifier, and Information Gain which were not discussed in any of the earlier studies.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 10
Machine learning is a statistical learning framework that falls under the branch of artificial
intelligence. It has the self-learning ability to detect hidden patterns from the data and can make
predictions and decisions based on its learning. Fig 1. shows the hierarchical relationship of
machine learning with other studies of computer science. Machine learning is composed of three-
step processes Data, Model, and Action [21]. Data is split into two parts training and testing data.
Training data is used during the learning phase where the model tries to learn the patterns and gain
empirical information and knowledge. After the training phase, the testing data is used to check
how well the trained model performed. Finally, the designed model will decide to solve the
problem. In our project, we have chosen a data split ratio of 70:30 where 70% of the entire dataset
is considered as training data and the rest 20% as testing data. Various performance metrics like
accuracy, precision, recall, sensitivity, confusion matrix, ROC (Receiver Operating Characteristic)
, and AUC (Area under the curve, etc. are used to evaluate the performance of the model. With an
iterative training process and hyperparameter tuning, the model gets better than its earlier version.
One of the salient features of machine learning technology is that it does not require any direct
A very commonly raised question is when do we need machine learning? In today’s world,
machine learning is used almost everywhere. The tasks that humans are doing every day such as
driving, speech recognition, etc. can be implemented via machine learning. Other complex tasks
that are beyond human capabilities such as weather prediction, analysis of astronomical data, etc.
are made possible by machine learning techniques. Since machine learning requires no input
commands, tasks or problems which require adaptivity on the input data can also be solved through
machine learning [22]. Machine learning is broadly divided into 3 categories: supervised learning,
unsupervised learning, and reinforcement learning. Each of the learning methodologies has its kind
of problems solving capabilities and application scope. Fig 2. Shows the different types of machine
learning algorithms.
a. Supervised learning
In supervised learning, we have a dataset with input variables (X) as well as labeled output
variables (Y). XT= (X1, X2,……….., Xp) represents the vector of input features or the independent
variable and Y represents the response or a dependent variable. Using the past input and output
pairs, a supervised learning algorithm discovers a true function or a rule that gives the best
prediction of Y on the given values of X. The derived rule or function will map X->Y.
Classification and regression are the two main problems that are solved by supervised learning.
A classification problem is solved by the model predicting some qualitative, discrete, or categorical
values such as predicting a male or female, detecting cancer or not, and whether a patient survives
or not, etc. [23]. Classification problems can be further divided into binary and multi-class
classification is a classification that contains binary or only two class labels. Churn or not and buy
or not etc. are some examples of binary classification whereas, in multi-class classification, it
contains more than two class labels. Human face classification and DNA classification are some
examples of multi-class classification [24]. The employee churn problem that we are trying to
solve in our project is considered as a binary classification problem where class label 0 represents
churn and 1 as non-churn. Two supervised machine learning algorithms, Logistic Regression, and
SVM are used to solve our classification problem. On the other hand, the regression problem is
solved by predicting continuous or quantitative variables such as predicting an age, salary or price,
etc. Naïve Bayes, Decision Tree, Random Forest, SVM, Linear Regression, Logistic Regression,
and k-nearest neighbor, etc. are some of the most commonly used supervised learning algorithms.
b. Unsupervised learning
In unsupervised learning, we have the input variables X but do not have any output
variables or labels Y. In other words, the model is trained only with the input data such that it tries
to detect hidden patterns or define a rule out of them. Generally, the descriptive models are built
using an unsupervised learning algorithm where the model learns through clustering by forming a
cluster of data points with similar characteristics or through the associate rule which discovers the
rule that describes the data [25]. K-means clustering, DBSCAN, and association rule are some of
anomaly detection such as detection of fraud transactions, malware, fake customer reviews, etc.
c. Reinforcement learning
Unlike supervised algorithms, reinforcement learning does not rely on labeled data; rather,
it uses a unique concept of reward and penalty to train the model. A reward is granted when the
model performs the task correctly and a penalty is issued for any mistakes performed by the model.
A positive score value is represented as a reward and a penalty is denoted by a negative score. The
positive and negative score is used as feedback to iteratively improve the performance of the
model; i.e., the model is in a continuous phase of learning, leveraging the feedback received from
the previous iteration. Reinforcement learning can be related to a real-life example of a kid trying
to learn to play a computer game [21]. Since the game is completely new to the kid, so, in the
beginning, the kid makes multiple mistakes, but as he moves on, he learns from his previous
mistakes and using his learning he keeps on improving his gaming skills in the subsequent play.
Finally, the kid will complete the game. Q-Learning, Temporal Difference (TD), and Deep
Adversarial Networks are some of the common reinforcement learning algorithms. Fig 4. Shows
V. ALGORITHM SELECTION
In machine learning, there is a famous theorem called “No Free Lunch”. According to that
theorem, no such single model or algorithm exists that can perfectly solve every kind of problem
[22]. Algorithm selection is one of the preliminary and the most important decision that we need
to make before we start jumping on solving the problem. Failure on choosing an appropriate
algorithm can result in the poor performance of the model thus receiving an unsatisfactory result.
In our project, factors such as problem type, dataset size, computation time, resources, feature
correlation, target class types (qualitative or quantitative) were considered while selecting the
In this paper, we are trying to solve an employee churn classification problem with a
predictive model through a supervised learning algorithm, so the pros and cons of various
supervised learning algorithms are summarized as follows [26], [27]. Naïve Bayes is a simple and
faster-to-implement algorithm that supports a larger dataset. The algorithm is less susceptible to
irrelevant features and can be used to predict multi-label classification problems. As a downside,
the algorithm makes a Bayesian assumption that the variables of the input data are independent,
which may not be true all the time. The algorithm also fails to perform well if the dataset has an
unequal distribution of labeled classes. Since Logistic Regression is a simple and effective
algorithm that suits best to solve the binary classification problem, therefore we chose it as one of
our algorithms in our project. Maximum likelihood estimates and stochastic gradient descent
methods can be well implemented to best fit the model. Since the algorithm is a linear model, it
may not perform well with the non-linear data. The performance of the model may not be good
when trained with irrelevant or highly correlated data. In the case of a decision tree classifier, the
algorithm has a minimal impact on missing values or irrelevant features present in the dataset.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 16
Normalization or scaling of data is not required with the decision tree algorithm. However, the
algorithm is very prone to a common overfitting issue and it takes a longer training time compared
to the rest of the other classifiers. kNN or K nearest neighbor makes it simple and easy to
implement an algorithm. It makes no prior assumption about the data but the algorithm is sensitive
to the outliers and it slows down while processing the larger dataset. In order words, the algorithm
performs poorly with higher dimension attributes. Random Forest is an ensemble of decision trees
that performs well even with the data having an unequal distribution of labeled classes. The
algorithm is not being impacted by the outliers and is free from overfitting issues but the model
relies heavily on the selected features hence, we need to be cautious while selecting the features.
Support vector machine is another popular classifier that suits best for solving a binary
classification problem. The algorithm has the capability of working with linear as well as nonlinear
data. One of the big advantages of the kernel SVM algorithm is that it makes use of various kernel
methods which can transform lower dimension data to a higher dimension. During the
hyperparameter turning of the SVM model, we used Radial Basis Function (rbf) as one of the
parameters for the kernel method. SVM is a memory-intensive algorithm and it performs slow
with a larger dataset having overlapped classes. In addition to that, choosing appropriate
hyperparameters is a bit challenging with the SVM algorithm. After comparing the pros and cons
a. Logistic Regression
solve binary classification problems. Instead of making the direct prediction, the model outputs
the probability of a point that falls on one of the two sides of the plane. Let X be the predictor
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 17
variable and Y be its response variable. Then the relationship between X and Y can be
Mathematically, β0 and β1 are considered as the y-intercept and slope of a line; whereas, in
Specifically, β0 is known as the bias of the model. Considers as a binary classification of the
problem, there are two class labels 0 and 1 (default class). According to posterior probabilities, the
probability of having 1 as the class label for a given X is provided by the following equation [23]
…………. equation 2
…………. equation 3
Approaching with linear regression, the probabilities obtained p(X) may not fall within the range
of [0,1] so, to overcome this issue we introduce a special function called logistic or sigmoid
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 18
function. The sigmoid function produces an S-shaped curve line as shown in fig. 5. Fig. 6 shows
the diagram of a logistic regression model in which the sigmoid function takes the sum of the
predicted values and its coefficients, maps the predicted values to corresponding probabilities. The
output of the sigmoid function will always remain in the boundary range of [0,1]. Mathematically
it can be represented by 0 ≤ σ(x) ≤ 1. The sigmoid function can be represented with the following
equation
σ(x) = 1 / ( 1 + e -x )
= ex / ( 1 + e x ) …………. equation 4
The ratio of p(X) / 1-p(X) is called the odds and can take the value from 0 to ∞. Taking the
logarithm base 10 on both sides of the equation, we get the following equation:
…………. equation 5
The ratio of log ( p(X) / 1-p(X) ) is called log-odds or logit. To fit the logistic regression model,
we need to best estimate the coefficient values (β0 and β1) such that, using those coefficient values,
the model p(X) yields a number that is as close to 1 for a default class and as close to 0 for the
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 19
non-default class. It is an iterative process of estimating the coefficient values until and unless the
accuracy of the model reaches an acceptable point. Methods such as least squares, maximum
likelihood, and stochastic gradient descent are available to estimate the coefficient values but since
maximum likelihood is a general method that can fit both the linear and nonlinear model so we
decided to go with the maximum likelihood method to fit the logistic regression model.
Mathematically, the maximum likelihood function can be represented by the equation provided
below.
The overall goal is to best estimate the values of β0, β1which can maximize the likelihood function.
Support vector machine is a supervised learning algorithm that can be used to solve
both classifications as well as regression problems. Particularly it serves best for solving a binary
classification problem which is one of the reasons we selected SVM as our algorithm to predict
employee churn. The goal of the SVM algorithm is to find the optimal hyperplane that can classify
the data points accurately. A hyperplane is a flat plane of p-1 dimension in a p dimensional space.
E.g. in a two-dimensional space, a hyperplane would be one dimension which means a simple line.
In our project, the hyperplane acts as a separating plane that will separate the churn employees
In a two-dimensional space, the hyperplane will divide the space into two halves and it will act as
a separating plane for classifying the data points. Let X = (X1, X2, ……., Xp) be a vector of
indicates that X lies on the one side of the plane representing the class label with positive value
plane with a negative value -1. So, the model will classify the observations into one of the two
class labels {-1, +1}. As shown in Fig. 7 [23], there could exist multiple hyperplanes that still
classifies the data points correctly but among all those hyperplanes, the optimal hyperplane
contains two parallel margins that passes through one or more of the nearest data points of both
the classes or support vectors while keeping the maximum marginal distance. So, to obtain that
optimal hyperplane, a classifier called maximal margin classifier can be used. Fig. 8 [23] shows
various components of a maximal margin classifier. The maximal margin classifier is based on the
maximum margin distance rule. While following the rule, the optimal hyperplane may
continuously shift its position just to accommodate one or two newly added outliers. So, to avoid
that situation a classifier called soft margin or support vector classifier can be implemented. For
the soft margin classifier, it is completely acceptable to have a few observations be misclassified.
In other words, it is ok to classify some observations on the incorrect side of the margin or even
Fig. 9 [23] shows a few observations that are classified on the wrong side of the margin and the
hyperplane. The observations that lie on the margins or the wrong side of the margin and
hyperplane are known as the support vectors and they do affect the performance of the classifier.
In the case of the support vector classifier, it is up to the user to decide the number of
misclassifications that the model can accept and this is done through configuring a tuning
parameter C. In the real world, there can be a dataset with an overlapping class or a group of
nonlinear observations where even the maximal margin classifier fails to come up with a separating
hyperplane. In other words, a linear decision boundary will not work classifying the observation
correctly so in that situation, a method called a support vector machine can help mitigate the issue.
Fig. 10 a [23] shows a group of non-linearly separable observations and fig. 10 b [23] shows the
linear boundary created by the support vector classifier that has poorly classified the observations.
Support vector machine is an extended version of support vector classifier which uses the concept
of kernel methods. Using the kernel tricks, the algorithm can transform a lower dimension data
into a higher dimension so that a decision boundary can be created to classify the non-separable
observations easily. Polynomial, RBF, and Sigmoid kernels are some of the most common kernel
methods that are used in the support vector machine algorithm to classify linear and nonlinear
overlapping observations. In our project, we used rbf as a kernel method parameter which helped
to classify our churn and non-churn employee with higher accuracy. Fig. 11 [21] shows how a
two-dimension data space is transformed into three-dimensions and how a hyperplane is used to
VI. IMPLEMENTATION
As per the implementation of this project, we chose Python as our primary programming
language and Visual Studio Code as the integrated development environment (IDE). All the
Python scripts are coded on Python Version 3.9. numpy, pandas, array, collections, seaborn,
matplotlib, sklearn, imblearn, textwrap, time and random, etc. are a few of the libraries that are
being referenced in the project. The implementation of the entire project is divided into 5 different
b. Feature Selection
c. Model Building
d. Hyperparameter Tuning
e. Performance Evaluation
Exploratory Data Analysis (EDA) started with the data collection process. For this
project, we extracted an IBM HR analytics dataset of 4.3 MB (approx.) that was made publicly
available in https://kaggle.com. The dataset contained the records of 23,436 employees who had
either left the company or were still being employed. There were 37 different attributes to hold the
various information of the employees. Fig. 12 shows the list of all the attributes present in the
dataset and Fig. 13 shows the count of all the records and the various data types. Performing a
quick data quality check, we identified that the dataset contained some dirty data such as data
mismatch and null values. Fig. 14 and Fig. 15 show the count of dirty values and null values in the
dataset. Since data is the key element in this project and knowing that the dataset is not 100% clean
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 24
Fig. 12 List of columns with its data types and non-null counts
so, we performed data cleansing to obtain valid, consistent, and accurate data. Based on the type
of data that the attributes were holding off; various data cleansing methods were implemented. For
the categorical data, we applied a random sampling method were for all the columns with dirty
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 25
data, a randomly picked item from the clean subset were used to update the corresponding dirty
data. Whereas for the continuous data, a median value was computed for a column and then the
computed value was used to update the corresponding dirty data. Unlike the mean, the median is
robust and less impacted by the outliers; hence, we decided to use the median value. For the non-
relevant attributes such as Employee Number, Application ID, and Employee Count, if they hold
any dirty data, we just deleted those corresponding records from the dataset. Once the data
cleansing task was completed, we validated the data by checking the counts of Null values and the
total records in the cleaned dataset. Fig 16. shows that there are 0 Null values and Fig 17. shows
the total number of rows and columns in the cleaned dataset. Exploring the cleaned dataset, we
observed 3 different data types as Nominal, Ordinal and Numerical data. Attributes such as
Employee Source, Education Field and Department, etc. have no inherited order, so these
attributes are categorized as nominal data; Whereas, Percent Salary Hike, Job Level, and Job
Satisfaction, etc. have some inherent order and they are considered as ordinal data. Performing
statistical analysis on nominal and ordinal data is not meaningful and not possible; hence, we did
not carry out any statistical analysis for those attributes. In contrast, attributes such Age, Monthly
Income and Distance from Home, etc. are the numerical data so we performed some simple
statistical analysis such as min, max, median, range, and standard deviation. Fig. 18 shows the
Fig. 16 Validating count of null values Fig. 17 Count of records after data cleansing
To understand the data in-depth and get more insights, we used matplotlib and seaborn
libraries from Python to visualize the data. The dataset contained a column named Attrition which
stored binary values 0 and 1. 0 represent the employees who had left the company and 1 indicated
the current employees. To add clarity, we did a label mapping where 0 was mapped to Attrition
and 1 to Non-Attrition respectively. Fig. 19 shows the percentage distribution of the employees
The dataset did contain some attributes such as Work-life balance, Job level, and Stock
option level, etc. which were categorical but still held numerical values. So, to add better clarity,
we mapped those numerical values with a meaningful label. Fig. 20-28 shows the cascaded bar
To graphically represent the continuous attributes, we applied binning. Based on the distribution
of the data and their corresponding ranges, the continuous data variables were mapped into bins of
appropriate intervals. Fig. 29-34 shows the cascaded bar chart of various continuous attributes with
respect to Attrition. Furthermore, to observe the variations in the distribution of continuous data
variables, we plotted a distribution plot. In a distribution plot, the data points were represented by
a combination of a histogram and a line. Fig. 35 shows the histogram distribution plot for the
In addition, we also plotted a bar plot for continuous attributes to check if there exist any
outliers in our dataset. Fig. 36 shows bar plots for some sample continuous attributes. In some box
plots, it is observed that some data points are plotted beyond the maximum whisker boundary and
they are considered as an outlier. However, after logical analysis, we considered those observations
as valid data points and did not filter them out as an outlier.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 34
In the dataset, except the Attrition column, all the other columns are considered as the
features or independent variables. Since the Attrition column is dependent upon the independent
variables, it is called a dependent variable. Since the Attrition column contained binary values 0
and 1and these values did not provide any specific meanings, we used a label encoder method in
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 35
Python to map those values to some meaningful labels. Hence,0 was mapped to attrition and 1 to
non-attrition labels respectively. The independent variables or features are used by the machine
learning models to make the predictions of the dependent variable. Since the machine learning
models understand numbers better than words and may not perform well with the categorical
attributes, we used a dummy variable from pandas to map those categorical data into numerical.
With a dummy variable technique, a categorical column with a k number of distinct values will
result in a k-1 number of additional columns. Since this technique increases the number of
columns, the column that has a large number of distinct values can result in very high dimensions
b. Feature Selection
The dataset contains 1 dependent variable and 36 independent variables or features and
not all the features are equally important and relevant. Training the machine learning models
without filtering the irrelevant features costs an overhead of computational resources, time and
also results in the poor performance of the models. The higher the number of features or
dimensions, the more resources are required for the computation, and the longer the models will
take time for training and making predictions. This condition is well known as the curse of
dimensionality. Feature selection is a technique that helps to pick only the important and relevant
features which heavily contribute to determining or predicting the dependent variables. In this
project, we have implemented a series of procedures to pick the most important and relevant
We started the feature selection process by plotting two different heatmaps. One for the
continuous and the other for the categorical variables. Each cell of the heatmap indicates the
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 36
correlation values between the attributes listed on the x and y-axis. A correlation threshold value
of 0.05 was set. In other words, any independent variable that holds the correlation values of
greater than or equal to 0.05 with respect to the dependent variable is considered as highly
correlated, and that particular independent variable is chosen for further validation. At the same
time, if the correlation values between two independent variables are greater than or equal to 0.05,
then in that case, only one independent variable is chosen for the next step. Fig. 37 and 38 show
the heatmap of the Attrition attribute with categorical and continuous variables.
The χ2 test was the next task that was performed as a part of the feature selection process.
This task helped to determine the correction between two features, and based on the level of
correlation, a score was generated. The higher the score value, the more the features are correlated.
In conjunction with χ2, we implemented the SelectKBest algorithm to select the top K best features
based on the score obtained from the χ2 test. Among all the features, Monthly Income recorded the
highest score of 471777 approx. and was marked as the most correlated attribute with respect to
Attrition. Environment Satisfaction, Daily Rate, Monthly Rate, and Age were the other features
that followed the list subsequently. Fig. 39 shows the list of features and their corresponding
feature scores. We also used an ExtraTree Classifier from the sklearn package to
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 38
calculate the importance score of all the features. Higher the score value, the more relevant the
features are. According to the ExtraTreeClassifier, Age recorded the highest feature importance
score of 0.058531 standing as the most relevant feature. Along with Age, Daily Rate, Distance
From Home, and Hourly Rate followed the list subsequently. Fig. 40 shows the list of all the
The last technique used in the feature selection was Information Gain. In this technique,
we used a mutual info classifier to compute the correlation values between the features and the
dependent variable. High correlation values indicated that the features are more relevant with the
dependent variable and have a better contribution to the model predicting the target class label.
The mutual info classifier recorded Daily Rate as the most correlated feature with the highest score
of 0.276273. Fig. 41 shows the list of features with their corresponding correlation values.
We experimented with heat map, χ2 Test, ExtraTree Classifier, and Information Gain to
select the most relevant features for our models. Since each method has its own mechanism to
determine the relevant features, we decided to analyze and compare the experimental results of all
the techniques and prepare a final list of features. The selected features shown in Fig. 42 are simply
listed without any rank or order because we are aware that in the training phase, the model uses its
own internal mechanism to rank those features and will not rely on the user computed rank.
c. Model Building
After our study and research, we decided to use the Logistic Regression and SVM model to
predict employee churn. We started building the model by following 3 steps: training the model,
predicting the outcome using the trained model, and finally evaluating the performance of the
model. Before training the model, we plotted a scatter plot as shown in Fig. 43 to check the
proportion of the class label. Observing the scatter plot, we analyzed that there existed a 1:5 uneven
data distribution between attrition and non-attrition employees. In order words, our dataset is
imbalanced where the population of attrition employees is significantly less compared to non-
attrition employees. Even though we were aware of the imbalance dataset problem, still for our
learning purposes, we decided to build the model using the imbalanced dataset. To avoid the
overfitting situation, the entire dataset was split into train and test set with the split ratio of
70:30.70% of the entire data was considered as a training dataset and the remaining 30% as a
testing dataset. Fig 44 shows the bar chart of attrition and non-attrition employee counts for the
Fig. 44 Distribution of attrition and non-attrition employee in the train and test dataset
Using the training and testing dataset, Logistic Regression and SVM models were trained
and tested respectively. To evaluate the performance of the models, we plotted a confusion matrix
and generated classification reports. Fig. 45-46 shows the confusion matrix and classification
report of Logistic Regression and SVM model respectively. In the confusion matrix, class 0 is
considered as positive and 1 as negative. Since the proportion of 1 is significantly higher than 0 in
the dataset, 1 is considered a majority and 0 as a minority class. Both the models recorded
significantly low true positive compared to the true negative. With this observation, we understood
that both the models seem biased towards the majority class and did not perform well on predicting
the minority class even though both the models recorded an accuracy of 84%. However, other
metrics such as precision, recall, and f1-score for the minority class were recorded as being
significantly low compared to that of the majority class. Since the imbalanced dataset directly
impacted the performance of both the Logistic Regression and SVM model, we decided to work
To fix the imbalanced dataset issue, we experimented with 4 different techniques. The first
technique is known as Stratified K-fold cross-validation. In this technique, first, we defined the
value of K, and based on that value the entire dataset will get split into that number of folds. The
algorithm gets executed for K iterations and on each iteration, one-fold will be picked as a testing
dataset and the remaining K-1 folds will be used as the training dataset. The models then get trained
and tested with the dataset. At the end of all the iterations, an average accuracy, precision, recall,
and f1-score are computed and considered as the final performance metrics of the model. The
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 43
the beauty of Stratified K-fold cross-validation is that it tries to make sure that there is a uniform
distribution of the majority and minority classes across training and testing datasets, and also
Regression and SVM algorithm but did not observe any improvements in the performance of the
models. Fig. 47 shows the performance metrics of both Logistic Regression and SVM using
The second technique that we implemented is called Under Sampling. This technique uses
the NearMiss method which accepts an argument called sampling strategy. Based on the sampling
strategy value, the size of the majority class will be scaled down to match approximately the size
of the minority class. Scaling down the dataset results in loss of data which is one of the major
drawbacks of this technique. Fig. 48 shows the count of the majority and minority classes before
and after Under Sampling. With Under Sampling, the total number of records was reduced from
23419 to 9007. The Logistic Regression and SVM recorded a count of 530 and 277 as True Positive
which were higher but the counts of True Negative were recorded of 1354 and 1449 which were
lower in comparison to the results of the unbalanced dataset. In addition, the overall accuracy
recorded for the Logistic Regression was 0.7 and that of SVM was 0.64 which was still not
considered acceptable. However, there was a significant improvement in the precision, recall, and
F1-score for the minority class which shows that both the classes were fairly treated by the models.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 44
Fig. 49 and Fig. 50 shows the confusion matrix and classification report for Logistic Regression
Fig. 48 Total records and class label counts of before and after Under Sampling
Fig. 49 Confusion matrix and Classification report of Logistic Regression using Under Sampling
Fig. 50 Confusion matrix and Classification report of SVM using Under Sampling
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 45
The third technique that we implemented is called Over Sampling. This technique uses the
RandomOverSampler method which accepts an argument value for sampling strategy. Based on
the argument value, the RandomOverSampler method will upscale the size of the minority class
by adding more replicas of its own class, thus approximately matching with the size of the majority
class. Since the method adds the exact replicas of the same class, this can lead to the situation of
overfitting which is the major drawback of this technique. Fig. 51 shows the count of the majority
and minority classes before and after oversampling. With the Over Sampling technique, the total
number of records increased from 23419 to 31536. The Logistic Regression model recorded 1386
as True Positive,5173 as True Negative, 740 as False Positive, and 2162 as False Negative which
were higher compared to the results of Under Sampling. However, the SVM model performed
very poorly with oversampled data recording 0 as True and False Positive, indicating that the
model did not perform well on predicting with the minority class. Fig. 52-53 shows the confusion
matrix and classification report for Logistic Regression and SVM with an oversampled dataset.
Fig. 51 Total records and counts of the class label before and after Over Sampling
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 46
Fig. 52 Confusion matrix and Classification report of Logistic Regression with Over Sampling
Fig. 53 Confusion matrix and Classification report of SVM with Over Sampling
combination of the Under and Over Sampling technique and overcomes the issue of data loss and
overfitting. SMOTETOMEK method accepts an argument value for sampling strategy, and based
on that argument value, the method will either upscale the minority class or downscale the majority
class. This technique randomly picks a data point, and using the K-Nearest Neighbor algorithm, it
will add more data points around its proximity. As shown in Fig. 54, the SMOTETOMEK method
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 47
increased the total number of attrition employee records from 3709 to 19707 and non-attrition
records from 19710 to 19707 respectively, thus making the number of attrition and non-attrition
records equal. With SMOTETOMEK, True Positive counts were recorded as 3707 for the Logistic
Regression and 3780 for the SVM model whereas, in the case of True Negative, the Logistic
Regression recoded 4127 which was almost double compared to 2892 of SVM. For the False
Positive, the Logistic Regression recorded 1786 which was almost half compared to 3021 of SVM.
A False Negative count of 2205 was recorded for the Logistic Regression model and 2132 for the
SVM respectively. The Logistic Regression model recorded an accuracy of 0.66 and 0.56 by SVM.
From all the above observations, we made a statement that with the SMOTETOMEK technique,
the Logistic Regression model performed better compared to the SVM model. Fig 55-56 shows
the confusion matrix and classification report of the Logistic Regression and SVM model. On
Comparing all the data balancing techniques, SMOTE TOMEK turns out to be the most effective
technique and hence we decided to use this method to balance our dataset.
Fig. 55 Confusion matrix and Classification report of Logistic Regression with SMOTETOMEK
d. Hyperparameter Tuning
So far, the Logistic Regression and SVM models were trained with the default parameters
and we analyzed the results obtained from both models. However, in machine learning, most of
the algorithms provide the flexibility of tuning certain parameters called hyperparameters tuning,
thus optimizing the performance and making the models robust and accurate. Each model has its
own set of hyperparameters to be configured and tuning these parameters can significantly change
the behavior of the algorithms so, care must be exercised when adjusting the hyperparameters. In
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 49
the Logistic Regression model, there are 3 standard parameters C, penalty, and solver that are
available for hyperparameter tuning. C parameter is a floating number that controls the penalty
strength. The lower the value of c, the lower is the penalty for misclassification and vice versa.
The next parameter is a penalty, and also called regularization, and it controls the model’s
overfitting condition by reducing the variance. The penalty parameter accepts 4 different values
none, l1, l2, and elasticnet. The last parameter is the solver which specifies the algorithm that will
be used to optimize the performance of the model and it accepts 5 different values lbfgs, loglinear,
newton-cg, sag, and saga. After referencing the sklearn documentation for the Logistic Regression
model, we understood the compatibility between the penalty and solver. Meaning, not all the
values of the penalty parameter are compatible with the solver parameters. So, after reviewing the
compatibility chart and researching the standard set of parameters that are commonly being used,
we decided to use l1, l2 as the penalty parameters and lbfgs, loglinear, and newton-cg as the solver
parameters.
The SVM model has 3 standard parameters C, kernel, and gamma available for the
hyperparameter tuning. A kernel is a function that is very helpful for creating the decision surface
when the data points are not linearly separable. Using the kernel trick, the kernel function computes
the decision boundaries in terms of similarity measures in a high dimensional space without
actually transforming the data points into higher dimensions. Kernel parameters accept linear,
Radial Basis Function (rbf), sigmoid, and poly. Experimenting with poly and sigmoid parameters,
we experienced our system getting extremely slow and unresponsive for hours, so we decided to
choose rbf as our kernel parameter. Gamma is the next parameter that controls the distance of
influence of a single training point A lower value of gamma increases the similarity radius,
meaning data points that are far apart from each other are also considered similar and get classified
as the same class. In contrast, a higher value of gamma reduces the similarity radius, so the data
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 50
points need to be close enough to each other to get classified under the same class. Higher values
of gamma can cause the overfitting situation; whereas the lower values create a generalized
decision boundary, thus causing high misclassifications. Maintaining the balance between the
lower and higher gamma values and reviewing the commonly used gamma values, we decided to
go with the following gamma values 0.0001, 0.0001, 0.001, 0.01,0.1, and 1.
During hyperparameter tuning, trying out every possible set of parameters and checking its
results manually is an iterative and inefficient process. So, to avoid this, we implemented a
GridSearchCV method which works on every combination and permutation of the parameters and
provides the best set of parameters for that model. Fig. 57 shows the results of GridSearchCV for
the Logistic Regression. The best parameters suggested by GridSearchCV for the Logistic
Regression was C= 0.01, penalty = l2 and solver = liblinear and for SVM it was C=100, kernel
=rbf and gamma =0.00001. Both the models were again trained and tested with their corresponding
hyperparameters and the results are shown in Fig 58-59. For the Logistic Regression model, the
counts of False Positive got reduced to 2093, but besides that, no other significant improvements
were recorded. However, for SVM, the hyperparameters drastically changed the overall
performance of the model. SVM model recorded the highest counts of 5818 as True Positive and
5197 as True Negative and also the lowest counts of 95 as False Positive and 726 as False Negative
respectively. In addition, the precision, recall, and f1-score for both the majority and the minority
classes were also recorded high and the accuracy of the model was boosted to 0.93. The rbf
function that was chosen as kernel parameter created the most appropriate decision boundary and
surface to separate the two classes distinctly; whereas the Logistic Regression model created a
generic decision boundary resulting in higher misclassification errors compared to the SVM.
Along with that, the suggested optimum value of gamma = 0.00001 contributed on grouping the
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 51
two classes correctly. We believed that these could be the potential reasons for the SVM model to
perform better than the Logistic Regression with the hyperparameter tuning.
Fig. 58 Confusion matrix and Classification report of Logistic Regression with hyperparameters
e. Performance Evaluation
We evaluated the performance of the Logistic Regression and SVM model based on
the following metrics
1. True Positive Rate (Recall/ Sensitivity)
5. Accuracy
True Positive Rate specifies the proportion of actual positive classes that got correctly
classified by the model. True Positive Rate is also known as recall or sensitivity and is calculated
by the formula
The logistic Regression model recorded the True Positive Rate of 0.65 and 0.93 for SVM
respectively. In other words, SVM classified the positive class correctly with an accuracy of 93%
True Negative Rate is defined as the proportion of actual negative classes that got
correctly classified by the model. It is also known as Specificity and is calculated by the formula
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 53
True Negative Rate of 0.65 was observed for the Logistic Regression model and 0.98 for SVM
respectively. In other words, SVM classified the negative class correctly with an accuracy of 98%
False Positive Rate is defined as the proportion of actual negative class entries that were
incorrectly classified as being in the positive class. It is also known as Type I error and can be
The False Positive Rate of 0.12 was recorded for SVM and 0.30 for the Logistic Regression
model respectively. In other words, for the negative class, the SVM model has a 12%
False Negative Rate is defined as the proportion of actual positive class entries that were
incorrectly classified as being in the negative class. It is also known as Type II error and can be
A False Negative Rate of 0.01 was recorded for the SVM model and 0.35 for the Logistic
Regression respectively. In other words, SVM correctly classified the positive class with an
accuracy of 99%, and only 1% were misclassified as the negative class. Whereas for the Logistic
Regression model, only 65% of the positive class were classified correctly, and the rest 35% of the
5. Accuracy:
Accuracy of the model is defined as the proportion of total actual class predictions to the
Logistic Regression recorded an accuracy of 0.67 whereas SVM recorded 0.93. It shows that SVM
classified both the positive and negative classes correctly with an accuracy of 93% compared to
Based on the prediction probabilities, the True Positive Rate and False Positive Rate for
the different threshold values were calculated. Using those calculated values, a probability curve
is plotted with True Positive Rate on the y-axis and False Positive Rate on the x-axis, and the curve
is known as the ROC curve. A ROC curve is a tool that helps to measure how well the model
performed on classifying the observations. Fig. 60 shows the ROC curve for the Logistic
Regression and SVM model. A blue dotted diagonal line seen in fig 60 is considered as the
reference line and any curve that falls above it is considered a good model and any curve below is
a bad model. The closer and higher the curve towards the y-axis of the plot, the more accurate the
model. The curve for the SVM model climbs up steeply to the top left of the plot whereas the curve
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 55
for the Logistic Regression is above the reference line but still below the SVM curve. Therefore,
from this observation, we can say that the SVM model performed better in predicting the classes
Area Under Curve is the summary of the ROC curve. It is a score that ranges between the
values 0 and 1. A model with an AUC score of 0 is considered a bad model and classified randomly
whereas an AUC score of 1 is considered as an ideal classifier that classifies all the observations
perfectly with no misclassification error. In general, the model within the range of 0.5 and 1 is
acceptable. In our experiment, SVM recorded a high AUC score of 0.96 compared to 0.73 of
Logistic Regression. Fig. 61-62 shows the bar chart with all the performance metrics for the
Logistic Regression and SVM model. As shown in Fig. 63, we plotted a bar chart comparing the
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 56
training time for each of the models and observed that SVM took 413 seconds (approx.) for training
whereas the Logistic Regression model got quickly trained in 0.28 seconds. We always demand a
model with higher accuracy, True Positive Rate, and True Negative Rate and lower False Negative
Rate and False Positive Rate, so observing and comparing all the performance metrics for both the
models on the given dataset, we concluded that SVM performed much better on predicting an
Using the balanced dataset obtained from SMOTETOMEK, the Logistic Regression model
achieved an accuracy of 0.66 with 3707 as True Positive, 4127 as True Negative, 1786 as False
Positive, and 2205 as False Negative whereas the SVM model recorded an accuracy of 0.56 with
3780 as True Positive, 2892 as True Negative, 3021 as False Positive and 2132 as False Negative
respectively. With the hyperparameter tuning, the Logistic Regression model achieved an accuracy
of 0.67 with counts of 3820 True Positive, 4118 True Negative, 1795 False Positive, and 2093
False Negative whereas the SVM model recorded an accuracy of 0.93 with counts of 5818 True
Positive, 5197 True Negative, 716 False Positive and 95 False Negative respectively. The Logistic
Regression model also recorded Sensitivity of 0.65, Specificity of 0.70, Type I Error of 0.30, Type
II Error of 0.35, and AUC score of 0.73 whereas the SVM model recorded Sensitivity of 0.98,
Specificity of 0.88, Type I Error of 0.12, Type II Error of 0.01 and AUC score of 0.96 respectively.
From all this observation, we concluded that with the model’s default configurations the Logistic
Regression model predicted employee churn better than the SVM whereas with hyperparameter
turning, the SVM performed better than the Logistic Regression model.
In the future, we would like to extend this project by hosting our trained SVM model on
the cloud platform and allowing the end-user to use our model via a web interface. We would also
like to explore Neural networks to train our dataset and observe its predictions. Last but not least,
we would like to explore another HR dataset that has the records of employees who had resigned
before and during COVID-19, and using that dataset, discover the correlation and patterns of
employee churn.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 59
REFERENCES
[1] L. Branham, The 7 hidden reasons employees leave: How to recognize the subtle signs
and act before it’s too late, 2005. [Online]. Available:
https://ebookcentral.proquest.com/lib/sjsu/reader.action?docID=242984.
[2] https://business.linkedin.com/talent-solutions/blog/trends-and-research/2018/the-3-
Industries- with-the-highest-turnover-rates.
[5] Abrams and M. N,” Employee retention strategies: lessons from the best,” Healthcare
executive, vol. 19, issue 4, pp. 18-22, 2004.
[6] X. Zhang, J. Zhu, S. Xu, and Y. Wan, “Predicting customer churn through interpersonal
influence,” Knowledge-based Systems, vol. 28, pp. 97-104, April. 2012.
[7] S. Yadav, A. Jain, and D. Singh, “Early prediction of Employee Attrition using Data Mining
Techniques,” in Proc. 2018 IEEE 8th International Advance Computing Conference
(IACC), Dec. 2018, pp.349-354.
[9] V. V. Saradhi and G. K. Palshikhar, “Employee churn prediction,” Expert systems with
applications, vol. 38, issue 3, pp. 1999-2006, 2011.
EMPLOYEE CHURN PREDICTION USING LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINE 60
[10] S. Dutta and S. K. Bandyopadhyay, “Employee attrition prediction using neural network
cross validation method,” International J. of Commerce and Management Research, vol.
6, issue 3, pp. 80-85, 2020.
[12] X. Gao, J. Wen, and C. Zhang, “An Improved Random Forest Algorithm for Predicting
Employee Turnover,” Mathematical Problems in Engineering, vol. 2019, pp. 1-12, 2019.
[Online]. Available: https://doi.org/10.1155/2019/4140707
[13] M. A. Valle and G. A. Ruz, “Turnover Prediction in a Call Center: Behavioral Evidence of
Loss Aversion Using Random Forest and Naïve Bayes Algorithm,” Applied Artificial
Intelligence, vol. 29, issue 9, pp. 923-942, 2015. doi: 10.1080/08839514.2015.1082282.
[14] C. A. M. Troncoso, “Predicting Customer Churn using Voice of the Customer. A Text
Mining Approach,” ph.D. thesis, Business, and Management, University of Manchester,
Manchester, The United Kingdom, 2018.
[17] S. N. Khera and Divya, “Predictive Modelling of Employee Turnover in Indian IT Industry
Using Machine Learning Techniques,” Vision, vol. 23, issue 1, pp. 12-21, 2018. doi:
10.1177/0972262918821221.
[20] Q. Zhu, J. Shang, and X. Cai, “CoxRF: Employee Turnover Prediction based on Survival
Analysis,” 2019 IEEE Smart World, Ubiquitous Intelligence, and Computing, Advanced
and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data
Computing, Internet of People and Smart City Innovation, 2019. doi:
10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00212.
[21] O. Thebald, Machine Learning for Absolute Beginners, 2nd edition, Independently published
, 2018.
[23] G. James, D. Witten, and T. Hastie, An Introduction to Statistical Learning with Application
in R, NY, USA: Springer Science+Business Media, 2017.
[25] G. Edwards, “Machine Learning: An Introduction,” Medium, Towards Data Science, Jan.
2020. [Online]. Available:
https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0.
[26] S. Gupta, “Pros and Cons of Various Classification ML Algorithms,” Medium, Towards
Data Science, June 2020. [Online]. Available:
https://towardsdatascience.com/pros-and-cons-of-various-classification-ml-algorithms-
3b5bfb3c87d6.