0% found this document useful (0 votes)

13 views72 pages

Module 2

Module 2 of ML TECHMAX

Uploaded by

neha1831sewani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views72 pages

Module 2

Module 2 of ML TECHMAX

Uploaded by

neha1831sewani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

CSC 701 Introduction to

Machine Learning
Learning with Regression and
Trees
Course objectives and outcomes
Objectives:
1. To introduce the basic concepts and techniques of Machine Learning. 2
2. To acquire in depth understanding of various supervised and unsupervised algorithms
3. To be able to apply various ensemble techniques for combining ML models.
4. To demonstrate dimensionality reduction techniques.
Outcomes:
1. To acquire fundamental knowledge of developing machine learning models.
2. To select, apply and evaluate an appropriate machine learning model for the given
3. To demonstrate ensemble techniques to combine predictions from different models.
4. To demonstrate the dimensionality reduction techniques.
Syllabus
● Learning with Regression: Linear Regression,
Multivariate Linear Regression, Logistic Regression.
● Performance Measures : Model evaluation and
selection, Training, Testing and Validation Tests,
Confusion Matrix & Basic Evaluation Metrics,
Precision-recall.
TEXT BOOKS
Measuring error

Error or “residual”
Observation

Prediction

0
0 2
0
Multivariate Linear Regression
Logistic Regression
Logistic Regression
•It’s a classification algorithm, that is used where the response variable is
categorical. The idea of Logistic Regression is to find a relationship between
features and probability of particular outcome.
E.g. When we have to predict if a student passes or fails in an exam when the
number of hours spent studying is given as a feature, the response variable has
two values, pass and fail. This type of a problem is referred to as Binomial Logistic
Regression, where the response variable has two values 0 and 1 or pass and fail or
true and false.
Multinomial Logistic Regression deals with situations where the response variable
can have three or more possible values. Do body weight, calorie intake, fat intake,
and age have an influence on the probability of having a heart attack (yes vs. no)?
See derivations in notes
Performance measures
What is a confusion matrix?
It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on another.
EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset consists of 100 people.

● false alarm, bad

● the worst
Accuracy
It’s the ratio of the correctly labeled subjects to the whole pool of subjects. Accuracy is the most intuitive
one.
Accuracy answers the following question: How many persons did we correctly label out
of all the persons?

Accuracy = (TP+TN)/(TP+FP+FN+TN)

numerator: all correctly labeled subject (All trues)

denominator: all subjects
Precision
Both precision and recall are crucial for information retrieval,
where positive class mattered the most as compared to negative.

Precision
Out of all the positive predicted,
what percentage is truly positive.

The precision value lies between 0 and 1.

Precision is the ratio of the correctly +ve labeled by our program to all +ve labeled.

EXAMPLE 1: Precision answers the following: How many of those who we

labeled as diabetic are actually diabetic?
numerator: +ve labeled diabetic people.
denominator: all +ve labeled by our program (whether they’re diabetic or not in
reality).
Recall (aka Sensitivity)

It is the same as TPR (true positive rate).

EXAMPLE 2- Credit card fraud detection

We do not want to miss any fraud transactions. Therefore, we want False-Negative to be as
low as possible. In these situations, we can compromise with the low precision, but
recall should be high. Similarly, in the medical application, we don’t want to miss any patient.
Therefore we focus on having a high recall.

Recall answers the following question: Of all

the people who are diabetic, how many of those
we correctly predict?

numerator: +ve labeled diabetic people.

denominator: all people who are diabetic (whether
detected by our program or not)
EXAMPLE 3 — Spam detection

In the detection of spam mail, it is okay if any spam mail remains undetected (false negative), but
what if we miss any critical mail because it is classified as spam (false positive). In this situation,
False Positive should be as low as possible. Here, precision is more vital as compared to
recall.
F1 Score
When comparing different models, it will be difficult to decide which is better (high precision
and low recall or vice-versa). Therefore, there should be a metric that combines both of
these. One such metric is the F1 score.
It is the harmonic mean of precision and recall. It takes both false positive and false negatives
into account. Therefore, it performs well on an imbalanced dataset. F1 Score is best if
there is some sort of balance between precision (p) & recall (r) in the system. Oppositely F1
Score isn’t so high if one measure is improved at the expense of the other.
For example, if P is 1 & R is 0, F1 score is 0.
Specificity

Specificity is the correctly -ve labeled by the program to all who are healthy in reality.

Specificity answers the following question: Of all the people who are healthy, how
many of those did we correctly predict?

Specificity = TN/(TN+FP)

numerator: -ve labeled healthy people.

denominator: all people who are healthy in reality (whether +ve or -ve labeled)
Attributes, training variables
Numeric class, target variable, response

Class
0
1
Yes
Country, Canada:
attribute value pair yes
Not buy
Will buy
Which performance measure?
Accuracy is a great measure but only when you have symmetric datasets (false negatives &
false positives counts are close), also, false negatives & false positives have similar costs.
If the cost of false positives and false negatives are different then F1 is your savior. F1 is best if you
have an uneven class distribution.
Precision is how sure you are of your true positives whilst recall is how sure you are that you are not
missing any positives.
Choose Recall if the idea of false positives is far better than false negatives, in other words, if the
occurrence of false negatives is unaccepted/intolerable, that you’d rather get some extra false
positives(false alarms) over saving some false negatives, like in our diabetes example.
You’d rather get some healthy people labeled diabetic over leaving a diabetic person labeled healthy.

Choose Precision if you want to be more confident of your true positives. for example, Spam
emails. You’d rather have some spam emails in your inbox rather than some regular emails in your
spam box. So, the email company wants to be extra sure that email Y is spam before they put it in the
spam box and you never get to see it.
Choose Specificity if you want to cover all true negatives, meaning you don’t want any false
alarms, you don’t want any false positives. for example, you’re running a drug test in which all people
who test positive will immediately go to jail, you don’t want anyone drug-free going to jail. False
positives here are intolerable.
● Accuracy value of 90% means that 1 of every 10 labels is incorrect, and 9 is correct.
● Precision value of 80% means that on average, 2 of every 10 diabetic labeled person by our program is
healthy, and 8 is diabetic.
● Recall value is 70% means that 3 of every 10 diabetic people in reality are missed by our program and 7
labeled as diabetic.
● Specificity value is 60% means that 4 of every 10 healthy people in reality are miss-labeled as diabetic
and 6 are correctly labeled as healthy.
Often, we choose Model Accuracy to evaluate the model. It’s a
popular choice because it is very easy to understand and explain.
Accuracy coincides well with the general aim of building a
classification model, i.e. to predict the class of new
observations accurately.
Accuracy might not be the best model evaluation metric every time.
It can convey the health of a model well only when all the classes
have similar prevalence in the data.
Say we were predicting if an asteroid will hit the earth?
If our model says NO every time, it will be highly accurate but it
would not be of much value to us. The number of asteroids that will
hit the earth is very low but missing even one of them might prove
very costly. When the classes’ distribution is imbalanced, accuracy is
not a good model evaluation metric.
Optimal Probability Threshold — ROC Curve

EXAMPLE 4: Say we were building an email classification model to detect

suspicious communication between terrorists over email.
In that case a terrorist email is Class YES and a non-terrorist email is Class NO.
We choose Sensitivity as a metric to improve this model.

Why?
Because it is absolutely necessary for the model to identify the terrorists’ emails
correctly. For the model to be considered useful, its true positive rate should be
high. In this pursuit, we might end up having a few false positives/false alarms
but that might be a compromise that we’ll have to make.

Let us say we end up with 2 really good models which have the same
Sensitivity score. Does that mean the two models have equal
predictive power? NO.
Most classification algorithms predict the probability that an observation belongs to class YES.
We need to decide a threshold for these probabilities, to classify the observations into one of
the two classes. The observation having probability higher than the threshold are classified as
class YES.
Say, we get the probability of an email being a terrorist email as 0.75. If we have set the
threshold of our system as 0.8, then we will classify this email as non-terrorist email. If we
have set the threshold as 0.7, we will classify the email as a terrorist email. The performance
of our system would vary as we change this threshold.
This threshold can be adjusted to tune the behavior of the model for a specific problem. An
example would be to reduce more of one or another type of error (FP/FN).
Two models with the same sensitivity (TPR) are not equivalent. Among these two, the model
with a lower FPR is obviously a better, more reliable model. We do not want to waste any
of our investigative resources on non-terrorist emails that were misclassified as
terrorist emails.
The threshold that we set, can help us increase or decrease the TPR. If we choose a low
threshold, more emails will be classified as terrorist emails, we will be able to catch more true
positives but then even the false positive rate would increase. The choice of threshold entails a
trade-off between false positives and false negatives.
An ROC curve is a useful resource in this regard.
The Receiver Characteristic Operator ROC Curve is a plot of the True Positive Rate/Sensitivity
(y-axis) versus the False Positive Rate/1-Specificity (x-axis) for candidate threshold values between
0.0 and 1.0.
The points (grey) on the orange curve are the
corresponding thresholds.
ROC curve is plot on all possible thresholds.
1. In the above curve if you wanted a model with a very
low false positive rate, you might pick 0.8 as your
threshold of choice. If you favour a low FPR, but you
don’t want an a bad TPR, you might go for 0.5, the
point where the curve starts turning hard to the right.
If you prefer a low false negative rate/high Sensitivity
(because you don’t want to miss potential terrorists,
for example), then you might decide that somewhere
between 0.2 and 0.1 is the region where you start
getting severely diminishing returns for improving the
Sensitivity any further.
2. Notice the graph at threshold 0.5 and 0.4. The
Sensitivity at both the thresholds is ~0.6, but FPR is
higher at threshold 0.4. It’s clear that if we are happy
with Sensitivity = 0.6 we should choose threshold =
0.5.
The ROC curve is great for choosing a threshold. Its shape contains a lot of information:
a) Smaller values on the x-axis of the plot indicate lower false positives and higher true
negatives.
b) Larger values on the y-axis of the plot indicate higher true positives and lower false
negatives.
c) A model that has high y values at low x values is a good model.

The ROC curve is a very useful tool for a few additional reasons:
a) The curves of different models can be compared directly in general or for different
thresholds.
b) The area under the curve (AUC) can be used as a summary of the model skill.

Lean Six Sigma Green Belt 490 Examination Questions
75% (4)
Lean Six Sigma Green Belt 490 Examination Questions
115 pages
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
20 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Ch01_ICS422_03
No ratings yet
Ch01_ICS422_03
46 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
Accuracy, Recall, Precision, F-Score & Specificity, Which To Optimize On
No ratings yet
Accuracy, Recall, Precision, F-Score & Specificity, Which To Optimize On
10 pages
21-General approach to classification, classification by decision tree induction-17-02-2025
No ratings yet
21-General approach to classification, classification by decision tree induction-17-02-2025
15 pages
Lect_02_Evaluation_Part_1
No ratings yet
Lect_02_Evaluation_Part_1
33 pages
9__ROC__AUC
No ratings yet
9__ROC__AUC
27 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
Session-11 Machine Learning
No ratings yet
Session-11 Machine Learning
27 pages
Session-11 Machine Learning - Jupyter Notebook
No ratings yet
Session-11 Machine Learning - Jupyter Notebook
11 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
Performance Parameters
No ratings yet
Performance Parameters
14 pages
Confusion Matrix
No ratings yet
Confusion Matrix
7 pages
UNIT-3
No ratings yet
UNIT-3
13 pages
DS_UNIT_4
No ratings yet
DS_UNIT_4
13 pages
performance evaluation
No ratings yet
performance evaluation
24 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Confusion Matrix
No ratings yet
Confusion Matrix
43 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
CE880_Lecture6_slides
No ratings yet
CE880_Lecture6_slides
25 pages
L22 KNN+Metrics
No ratings yet
L22 KNN+Metrics
18 pages
08 Classifier Evaluation
No ratings yet
08 Classifier Evaluation
39 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Metric
No ratings yet
Metric
6 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Evaluation Metrics:: Confusion Matrix
No ratings yet
Evaluation Metrics:: Confusion Matrix
7 pages
ML CLASS 5 Logistic Regression Algorithm
No ratings yet
ML CLASS 5 Logistic Regression Algorithm
16 pages
Lecture-(3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture-(3-4) Evaluation Metrices Classification and Regression
28 pages
AD3501-DL-UNIT 4 NOTES
No ratings yet
AD3501-DL-UNIT 4 NOTES
16 pages
Confusion Matrix
No ratings yet
Confusion Matrix
14 pages
Evaluation Metrics-ML
No ratings yet
Evaluation Metrics-ML
16 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
ML Mod 2
No ratings yet
ML Mod 2
13 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
ML Interview Questions placements
No ratings yet
ML Interview Questions placements
99 pages
Lecture04
No ratings yet
Lecture04
33 pages
Intermediate Analytics-Regression-Week 3-1
No ratings yet
Intermediate Analytics-Regression-Week 3-1
44 pages
Model Validation and Perf Metrics - v2 - Noman - 08 - 06 - 24
No ratings yet
Model Validation and Perf Metrics - v2 - Noman - 08 - 06 - 24
25 pages
Performance Measures
No ratings yet
Performance Measures
25 pages
Accuracy and error measures
No ratings yet
Accuracy and error measures
14 pages
3 - Model Evaluation & Validation
No ratings yet
3 - Model Evaluation & Validation
47 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Performance Metrics
No ratings yet
Performance Metrics
12 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Intel Assignment ----
No ratings yet
Intel Assignment ----
13 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
ADS 5
No ratings yet
ADS 5
5 pages
Unit 2 Chap 4
No ratings yet
Unit 2 Chap 4
14 pages
MLS+2+-+Classification
No ratings yet
MLS+2+-+Classification
13 pages
2-Training and Testing Models, Evaluation Metrics-01-07-2023
No ratings yet
2-Training and Testing Models, Evaluation Metrics-01-07-2023
23 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
ML Unit 2
No ratings yet
ML Unit 2
5 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
MSA 3rd Ed With ANOVA 14 Parts
No ratings yet
MSA 3rd Ed With ANOVA 14 Parts
14 pages
The Influence of Influencers TikTok on Online Shop
No ratings yet
The Influence of Influencers TikTok on Online Shop
9 pages
Where can buy Product Analytics: Applied Data Science Techniques for Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues ebook with cheap price
100% (1)
Where can buy Product Analytics: Applied Data Science Techniques for Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues ebook with cheap price
55 pages
Stats Chapter 6 Lesson 3
No ratings yet
Stats Chapter 6 Lesson 3
30 pages
Succession Management
No ratings yet
Succession Management
12 pages
Statistics For Engineers (MAT2001) - Syllabus
No ratings yet
Statistics For Engineers (MAT2001) - Syllabus
3 pages
Laré Et Al. - EDS - 2024
No ratings yet
Laré Et Al. - EDS - 2024
24 pages
Neonatal Organ Weights
No ratings yet
Neonatal Organ Weights
12 pages
Sample Question For Business Statistics
100% (1)
Sample Question For Business Statistics
12 pages
BRM L4 Hypothesis Maths Questions
No ratings yet
BRM L4 Hypothesis Maths Questions
3 pages
CAP Handbook
100% (1)
CAP Handbook
40 pages
3 2LeastSquaresRegression
No ratings yet
3 2LeastSquaresRegression
29 pages
Winter Report
No ratings yet
Winter Report
82 pages
MIT6 0002F16 ProblemSet5
No ratings yet
MIT6 0002F16 ProblemSet5
13 pages
HW2 Solution
No ratings yet
HW2 Solution
7 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
QTDM Unit-2 Correlation & Regression Analysis
No ratings yet
QTDM Unit-2 Correlation & Regression Analysis
12 pages
ARDL in R
No ratings yet
ARDL in R
23 pages
Share Market
No ratings yet
Share Market
40 pages
Econometrics Chapter-3-Multiple Regression-14-08-2023
No ratings yet
Econometrics Chapter-3-Multiple Regression-14-08-2023
33 pages
Audit Etika Auditor
No ratings yet
Audit Etika Auditor
6 pages
Int J Consumer Studies - 2021 - Gilal - Consumer e Waste Disposal Behaviour A Systematic Review and Research Agenda
No ratings yet
Int J Consumer Studies - 2021 - Gilal - Consumer e Waste Disposal Behaviour A Systematic Review and Research Agenda
19 pages
Multiple Logistic Regression (SPSS) 2021
No ratings yet
Multiple Logistic Regression (SPSS) 2021
79 pages
Food Chemistry: M.J. Lerma-García, G. Ramis-Ramos, J.M. Herrero-Martínez, E.F. Simó-Alfonso
No ratings yet
Food Chemistry: M.J. Lerma-García, G. Ramis-Ramos, J.M. Herrero-Martínez, E.F. Simó-Alfonso
6 pages
Data Extraction From Electric Vehicles Through OBD and Application of Carbon Footprint Evaluation
No ratings yet
Data Extraction From Electric Vehicles Through OBD and Application of Carbon Footprint Evaluation
6 pages
Earnings Management and Investment Efficiency
No ratings yet
Earnings Management and Investment Efficiency
16 pages
PVT Correlations McCain - Valko
No ratings yet
PVT Correlations McCain - Valko
17 pages
Case Study 07
No ratings yet
Case Study 07
6 pages
(eBook PDF) Workshop Statistics: Discovery with Data 4th Editioninstant download
100% (4)
(eBook PDF) Workshop Statistics: Discovery with Data 4th Editioninstant download
51 pages

Module 2

Uploaded by

Module 2

Uploaded by

CSC 701 Introduction to

● false alarm, bad

numerator: all correctly labeled subject (All trues)

The precision value lies between 0 and 1.

EXAMPLE 1: Precision answers the following: How many of those who we

It is the same as TPR (true positive rate).

EXAMPLE 2- Credit card fraud detection

Recall answers the following question: Of all

numerator: +ve labeled diabetic people.

numerator: -ve labeled healthy people.

EXAMPLE 4: Say we were building an email classification model to detect

You might also like