Lecture 18 - 2024
Lecture 18 - 2024
• P(c|x) = posterior probability of class (c, target) given the predictor variable x
• P(x|c) = the likelihood which is the probability of predictor x given class c,
• P(c) = the prior probability of the class,
• P(x) = the prior probability of the predictor.
For example, a fruit may be considered an apple if it is red, round, and about
3 inches in diameter.
Even if these features depend on each other or upon the existence of the
other features, all of these properties independently contribute to the
probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods
• It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
• When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
• It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).
• If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction. This is often known as “Zero Frequency”.
• On the other side naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
• Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are
completely independent.
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c) 𝑃𝑃 Yes 𝒙𝒙 = 010) =?
0 1 0 ?? 𝑃𝑃 No 𝒙𝒙 = 010) =?
For simplicity, label this input as x = 010
𝑛𝑛 𝑛𝑛
3 1 2 2 2 1 1
𝑃𝑃 Yes � 𝑃𝑃 𝑥𝑥𝑖𝑖 Yes = = 0.089 𝑃𝑃 No � 𝑃𝑃 𝑥𝑥𝑖𝑖 No = 0 =0
5 3 3 3 5 2 2
𝑖𝑖=1 𝑖𝑖=1
𝑛𝑛
0.089
Can convert the result into a 𝑃𝑃 Yes|𝒙𝒙 = 010 =
0.089 + 0
=1
Results will always
probability as well 𝑃𝑃 No|𝒙𝒙 = 010 =
0
=0 sum up to 1
0.089 + 0
The distance between the hyperplane and the closest class point is called
the margin.
The optimal hyperplane has the largest margin that classifies points to maximize
the distance between the closest data point and both classes.
Gamma - The gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’, and high values meaning ‘close’.
In other words, with low gamma, points far away from plausible separation line
are considered in calculation for the separation line. Where as high gamma
means the points close to plausible line are considered in calculation..
It is completely separable
• No points violate the hyperplane
• Leads to the concept of a “soft margin”
It is linearly separable
• Can be separated using a hyperplane
• Leads to the trick of using kernels
Minimizes a loss that has a “hinge loss” term where only vectors (𝜉𝜉) that
are on the wrong side of the hyperplane or within the margin contribute
to the loss 2 2
L tries to find a balance
𝒘𝒘
𝐿𝐿 = 𝒘𝒘 + 𝐶𝐶 � 𝜉𝜉𝑖𝑖 between maximizing
𝑖𝑖 the margin and avoiding
misclassifications
𝜉𝜉3 Inversely proportional to margin width
Penalty for misclassifications
𝜉𝜉1
𝜉𝜉2
Leads to a regularization parameter, 𝐶𝐶 > 0, that
𝜉𝜉0 controls how much the model is penalized for
misclassifications
• 𝐶𝐶 → 0: low penalty for misclassifications, wider margin
𝑧𝑧 = 𝑏𝑏 + 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛 • 𝐶𝐶 → ∞: high penalty for misclassifications, → hard margin
Predicted
Positive 43 2
• γ = 0.033
Negative 0 40
Majority vote
to get class
Works well in cases where # of input features > # of training data points