Introduction to R
History of R
Logistic regression is a type of regression analysis used for predicting the outcome of a Categorical dependent
variable (a dependent variable that can take on a limited number of categories) based on one or more predictor
variables (Continuous, Ordinal or Categorical).
Binary Logistic Regression has Outcome or Dependent Variable: Binary or Dichotomous i.e. 0 or 1
Example:
a customer will churn (1) or not (0)
a customer will respond to a campaign (1) or not (0)
should we grant a loan to a particular person (1) or not (0)
Logistic regression measures the relationship between a categorical dependent variable and independent variable (or
several), by converting the dependent variable to probability scores (P)
Probability score signifies the probability of event happening, for example probability of a customer to churn or
respond to a campaign
How is Logistic Regression different from Linear Regression
In Linear regression, the outcome variable is continuous and the predictor variables can be a mix of numeric and
categorical. But often there are situations where we wish to evaluate the effects of multiple explanatory variables on
a binary outcome variable
For example, the effects of a number of factors on the development or otherwise of a disease. A patient may be
cured or not; a prospect may respond or not, should we grant a loan to particular person or not, etc.
When the outcome or dependent variable is binary, and we wish to measure the effects of several independent
variables on it, we uses Logistic Regression
Probability of each observation will not be linearly distributed but more like sigmoid function i.e. values would be
closer to 0 and 1.
The binary outcome variable can be coded as 0 or 1.
The logistic curve is shown in the figure below:
Sigmoid Function
Concept of Sigmoid Function in Logistic Regression
The sigmoid function is a bounded function.
If
If b is –ve, thens shape will get reversed
If we use linear regression, the predicted
value can become greater than one and less
than zero.
Basically Y is a random variable having 0 or 1
outcome, which is a Bernoulli random
variable.
log of odds:
ln( p / 1 p) a bx
This is also called as a logit function
The estimation of parameters is done using Maximum Likelihood Estimate(for Non Linear
distribution) unlike Linear regression where method of Ordinary Least square is used.
Odds Ratio
Odds is calculated as P(Y = 1)/P(Y = 0)
Odds > 1 if Y = 1 is more likely
Odds < 1 if Y = 0 is more likely
This is called logit and it looks like “linear regression equation”
The bigger the logit is, bigger is P(Y = 1)
Quick Question 1
Suppose the coefficients of a logistic regression model with two
independent variables are as follows:
β0 = -1.5, β1 = 3, β2 = -0.5
And we have an observation with the following values of independent
variables:
x1 = 1, x2 = 5
What is the value of the Logit for this observation? Recall that the Logit is
log(Odds)
What is the value of the Odds for this observation? Note that you can
compute e^x, for some number x, in your R console by typing exp(x). The
function exp() computes the exponential of its argument
What is the value of P(y = 1) for this observation?
Applications of logistic regression in business
Response to an Subscriber
Churning up of
E-mail conversion after
subscribers
Campaign a Campaign
Response Conversion Attrition
Model Model Model
Cross Sell
Application Behavioral
Up Sell
Risk Model Risk Model
Model
Finding Credit Finding
Finding
card defaulters parameters that
probability of
by Demography boost Cross sell
loan defaults
and Behavior & Up sell
Just a few of them
Logistic Process
THE FRAMINGHAM HEART STUDY
Evaluating Risk Factors to Save Lives
Misconceptions in the first half of 20th Century about blood pressure
High blood pressure, dubbed hypertension, was considered important to
force blood through arteries and it was considered harmful to lower
blood pressure
In late 1940s, the US government set out to better understand
cardiovascular disease
The plan was to track a large cohort of initially healthy patients over
their lifetimes
A city was chosen, the city of Framingham, Massachusetts, to be the site
for the study
Appropriate Size
Stable population
5209 patients aged 30 – 59 enrolled
Patients were given questionnaire and exam every 2 years:
Physical characteristics
Behavioral characteristics
THE FRAMINGHAM HEART STUDY Contd..
Use the anonymized version of the original data that was collected
Includes several demographic risk factors:
the sex of the patient, male or female;
the age of the patient in years;
the education level coded as either 1 for some high school, 2 for a
high school diploma or GED, 3 for some college or vocational
school, and 4 for a college degree.
Includes behavioral risk factors:
Does the patient smoke (yes/no)
Medical history - blood pressure medication, previously had a
stroke, hypertensive or not, diabetic or not
Includes risk factors from physical examination:
Cholesterol level
Systolic/diastolic blood pressure
Body mass index
Heart rate
Blood glucose level
THE FRAMINGHAM HEART STUDY Contd..
Building the model
Split our data randomly into training and testing set
Logistic regression to predict whether or not a patient experienced
Coronary Heart Disease within 10 years of first examination
After building the model, we will evaluate the predictive power of the
model on the test set
Threshold Value
The outcome of a logistic regression model is a probability
Often, we want to make a binary prediction – whether this person will
suffer from CHD or not
We can do this using a threshold value t
If P(CHD = 1) ≥ t, predict CHD
If P(PoorCare = 1) < t, predict healthy
What value should we pick?
Often selected based on which errors are better
Confusion Matrix
Compare actual outcomes to predicted outcomes using a confusion
matrix (classification matrix)
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
Receiver Operator Characteristic
Receiver Operating Characteristic (ROC) curve is a graph between True Positive Rate (Sensitivity)
and False Positive Rate (1-Specificity)
Accuracy is measured by the area under the ROC curve. The greater the area under curve better is
the model. An area of 1 represents a perfect test.
Each point on the ROC curve represents a cutoff probability. These cutoff point represent the
tradeoff between sensitivity and specificity probabilities
Ideally the goal should be to have high probabilities for both Sensitivity and Specificity
Selecting a Threshold using ROC
Captures all thresholds simultaneously
High threshold means High specificity and Low sensitivity
Low Threshold means Low specificity and High sensitivity
Choose best threshold for best trade off:
cost of failing to detect positives
costs of raising false alarms
Compute Outcome Measures
Overall accuracy = (TN + TP)/N
Overall error rate = (FP + FN)/N
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
False negative error rate = FN/(TP + FN)
False positive error rate = FP/(TN + FP)
Quick Question 2
Using the below confusion matrix, answer the following questions.
FALSE TRUE
0 1069 6
1 187 11
What is the sensitivity of our logistic regression model on the test set,
using a threshold of 0.5?
What is the specificity of our logistic regression model on the test set,
using a threshold of 0.5?
Thank You