0% found this document useful (0 votes)
13 views72 pages

Module 2

Module 2 of ML TECHMAX

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views72 pages

Module 2

Module 2 of ML TECHMAX

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

CSC 701 Introduction to

Machine Learning
Learning with Regression and
Trees
Course objectives and outcomes
Objectives:
1. To introduce the basic concepts and techniques of Machine Learning. 2
2. To acquire in depth understanding of various supervised and unsupervised algorithms
3. To be able to apply various ensemble techniques for combining ML models.
4. To demonstrate dimensionality reduction techniques.
Outcomes:
1. To acquire fundamental knowledge of developing machine learning models.
2. To select, apply and evaluate an appropriate machine learning model for the given
3. To demonstrate ensemble techniques to combine predictions from different models.
4. To demonstrate the dimensionality reduction techniques.
Syllabus
● Learning with Regression: Linear Regression,
Multivariate Linear Regression, Logistic Regression.
● Performance Measures : Model evaluation and
selection, Training, Testing and Validation Tests,
Confusion Matrix & Basic Evaluation Metrics,
Precision-recall.
TEXT BOOKS
Measuring error

Error or “residual”
Observation

Prediction

0
0 2
0
Multivariate Linear Regression
Logistic Regression
Logistic Regression
•It’s a classification algorithm, that is used where the response variable is
categorical. The idea of Logistic Regression is to find a relationship between
features and probability of particular outcome.
E.g. When we have to predict if a student passes or fails in an exam when the
number of hours spent studying is given as a feature, the response variable has
two values, pass and fail. This type of a problem is referred to as Binomial Logistic
Regression, where the response variable has two values 0 and 1 or pass and fail or
true and false.
Multinomial Logistic Regression deals with situations where the response variable
can have three or more possible values. Do body weight, calorie intake, fat intake,
and age have an influence on the probability of having a heart attack (yes vs. no)?
See derivations in notes
Performance measures
What is a confusion matrix?
It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on another.
EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset consists of 100 people.

● false alarm, bad

● the worst
Accuracy
It’s the ratio of the correctly labeled subjects to the whole pool of subjects. Accuracy is the most intuitive
one.
Accuracy answers the following question: How many persons did we correctly label out
of all the persons?

Accuracy = (TP+TN)/(TP+FP+FN+TN)

numerator: all correctly labeled subject (All trues)


denominator: all subjects
Precision
Both precision and recall are crucial for information retrieval,
where positive class mattered the most as compared to negative.

Precision
Out of all the positive predicted,
what percentage is truly positive.

The precision value lies between 0 and 1.

Precision is the ratio of the correctly +ve labeled by our program to all +ve labeled.

EXAMPLE 1: Precision answers the following: How many of those who we


labeled as diabetic are actually diabetic?
numerator: +ve labeled diabetic people.
denominator: all +ve labeled by our program (whether they’re diabetic or not in
reality).
Recall (aka Sensitivity)

It is the same as TPR (true positive rate).

EXAMPLE 2- Credit card fraud detection


We do not want to miss any fraud transactions. Therefore, we want False-Negative to be as
low as possible. In these situations, we can compromise with the low precision, but
recall should be high. Similarly, in the medical application, we don’t want to miss any patient.
Therefore we focus on having a high recall.

Recall answers the following question: Of all


the people who are diabetic, how many of those
we correctly predict?

numerator: +ve labeled diabetic people.


denominator: all people who are diabetic (whether
detected by our program or not)
EXAMPLE 3 — Spam detection

In the detection of spam mail, it is okay if any spam mail remains undetected (false negative), but
what if we miss any critical mail because it is classified as spam (false positive). In this situation,
False Positive should be as low as possible. Here, precision is more vital as compared to
recall.
F1 Score
When comparing different models, it will be difficult to decide which is better (high precision
and low recall or vice-versa). Therefore, there should be a metric that combines both of
these. One such metric is the F1 score.
It is the harmonic mean of precision and recall. It takes both false positive and false negatives
into account. Therefore, it performs well on an imbalanced dataset. F1 Score is best if
there is some sort of balance between precision (p) & recall (r) in the system. Oppositely F1
Score isn’t so high if one measure is improved at the expense of the other.
For example, if P is 1 & R is 0, F1 score is 0.
Specificity

Specificity is the correctly -ve labeled by the program to all who are healthy in reality.

Specificity answers the following question: Of all the people who are healthy, how
many of those did we correctly predict?

Specificity = TN/(TN+FP)

numerator: -ve labeled healthy people.


denominator: all people who are healthy in reality (whether +ve or -ve labeled)
Attributes, training variables
Numeric class, target variable, response

Class
0
1
Yes
Country, Canada:
attribute value pair yes
Not buy
Will buy
Which performance measure?
Accuracy is a great measure but only when you have symmetric datasets (false negatives &
false positives counts are close), also, false negatives & false positives have similar costs.
If the cost of false positives and false negatives are different then F1 is your savior. F1 is best if you
have an uneven class distribution.
Precision is how sure you are of your true positives whilst recall is how sure you are that you are not
missing any positives.
Choose Recall if the idea of false positives is far better than false negatives, in other words, if the
occurrence of false negatives is unaccepted/intolerable, that you’d rather get some extra false
positives(false alarms) over saving some false negatives, like in our diabetes example.
You’d rather get some healthy people labeled diabetic over leaving a diabetic person labeled healthy.

Choose Precision if you want to be more confident of your true positives. for example, Spam
emails. You’d rather have some spam emails in your inbox rather than some regular emails in your
spam box. So, the email company wants to be extra sure that email Y is spam before they put it in the
spam box and you never get to see it.
Choose Specificity if you want to cover all true negatives, meaning you don’t want any false
alarms, you don’t want any false positives. for example, you’re running a drug test in which all people
who test positive will immediately go to jail, you don’t want anyone drug-free going to jail. False
positives here are intolerable.
● Accuracy value of 90% means that 1 of every 10 labels is incorrect, and 9 is correct.
● Precision value of 80% means that on average, 2 of every 10 diabetic labeled person by our program is
healthy, and 8 is diabetic.
● Recall value is 70% means that 3 of every 10 diabetic people in reality are missed by our program and 7
labeled as diabetic.
● Specificity value is 60% means that 4 of every 10 healthy people in reality are miss-labeled as diabetic
and 6 are correctly labeled as healthy.
Often, we choose Model Accuracy to evaluate the model. It’s a
popular choice because it is very easy to understand and explain.
Accuracy coincides well with the general aim of building a
classification model, i.e. to predict the class of new
observations accurately.
Accuracy might not be the best model evaluation metric every time.
It can convey the health of a model well only when all the classes
have similar prevalence in the data.
Say we were predicting if an asteroid will hit the earth?
If our model says NO every time, it will be highly accurate but it
would not be of much value to us. The number of asteroids that will
hit the earth is very low but missing even one of them might prove
very costly. When the classes’ distribution is imbalanced, accuracy is
not a good model evaluation metric.
Optimal Probability Threshold — ROC Curve

EXAMPLE 4: Say we were building an email classification model to detect


suspicious communication between terrorists over email.
In that case a terrorist email is Class YES and a non-terrorist email is Class NO.
We choose Sensitivity as a metric to improve this model.

Why?
Because it is absolutely necessary for the model to identify the terrorists’ emails
correctly. For the model to be considered useful, its true positive rate should be
high. In this pursuit, we might end up having a few false positives/false alarms
but that might be a compromise that we’ll have to make.

Let us say we end up with 2 really good models which have the same
Sensitivity score. Does that mean the two models have equal
predictive power? NO.
Most classification algorithms predict the probability that an observation belongs to class YES.
We need to decide a threshold for these probabilities, to classify the observations into one of
the two classes. The observation having probability higher than the threshold are classified as
class YES.
Say, we get the probability of an email being a terrorist email as 0.75. If we have set the
threshold of our system as 0.8, then we will classify this email as non-terrorist email. If we
have set the threshold as 0.7, we will classify the email as a terrorist email. The performance
of our system would vary as we change this threshold.
This threshold can be adjusted to tune the behavior of the model for a specific problem. An
example would be to reduce more of one or another type of error (FP/FN).
Two models with the same sensitivity (TPR) are not equivalent. Among these two, the model
with a lower FPR is obviously a better, more reliable model. We do not want to waste any
of our investigative resources on non-terrorist emails that were misclassified as
terrorist emails.
The threshold that we set, can help us increase or decrease the TPR. If we choose a low
threshold, more emails will be classified as terrorist emails, we will be able to catch more true
positives but then even the false positive rate would increase. The choice of threshold entails a
trade-off between false positives and false negatives.
An ROC curve is a useful resource in this regard.
The Receiver Characteristic Operator ROC Curve is a plot of the True Positive Rate/Sensitivity
(y-axis) versus the False Positive Rate/1-Specificity (x-axis) for candidate threshold values between
0.0 and 1.0.
The points (grey) on the orange curve are the
corresponding thresholds.
ROC curve is plot on all possible thresholds.
1. In the above curve if you wanted a model with a very
low false positive rate, you might pick 0.8 as your
threshold of choice. If you favour a low FPR, but you
don’t want an a bad TPR, you might go for 0.5, the
point where the curve starts turning hard to the right.
If you prefer a low false negative rate/high Sensitivity
(because you don’t want to miss potential terrorists,
for example), then you might decide that somewhere
between 0.2 and 0.1 is the region where you start
getting severely diminishing returns for improving the
Sensitivity any further.
2. Notice the graph at threshold 0.5 and 0.4. The
Sensitivity at both the thresholds is ~0.6, but FPR is
higher at threshold 0.4. It’s clear that if we are happy
with Sensitivity = 0.6 we should choose threshold =
0.5.
The ROC curve is great for choosing a threshold. Its shape contains a lot of information:
a) Smaller values on the x-axis of the plot indicate lower false positives and higher true
negatives.
b) Larger values on the y-axis of the plot indicate higher true positives and lower false
negatives.
c) A model that has high y values at low x values is a good model.

The ROC curve is a very useful tool for a few additional reasons:
a) The curves of different models can be compared directly in general or for different
thresholds.
b) The area under the curve (AUC) can be used as a summary of the model skill.

You might also like