0% found this document useful (0 votes)
13 views34 pages

Lecture 18 - 2024

Uploaded by

Gursimar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views34 pages

Lecture 18 - 2024

Uploaded by

Gursimar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Data Analysis

ECE 710 Lecture 18 E. Lou 1


10 Most Common Machine Learning Algorithms
for Data Analysis
1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forest Trees
5. Navie Bayes
6. Support Vector Machine
7. K-Nearest Neighbors (KNN)
8. K-Means
9. Dimensionality Reduction
10. Artificial Neural Network

ECE 710 Lecture 18 E. Lou 2


Navies Bayes

Naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on


applying Bayes' theorem with strong (naïve) independence assumptions between
the features.

• P(c|x) = posterior probability of class (c, target) given the predictor variable x
• P(x|c) = the likelihood which is the probability of predictor x given class c,
• P(c) = the prior probability of the class,
• P(x) = the prior probability of the predictor.

ECE 710 Lecture 18 E. Lou 3


Navies Bayes

Involves calculating probabilities from the training data


set to use in the final classification formula:
𝑛𝑛

𝑦𝑦� = argmax 𝑃𝑃 𝑐𝑐𝑘𝑘 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘


𝑘𝑘∈ 1,…,𝐾𝐾
𝑖𝑖=1
c = class variable
k = class index
x = input variable
i = variable index

ECE 710 Lecture 18 E. Lou 4


Navies Bayes

For example, a fruit may be considered an apple if it is red, round, and about
3 inches in diameter.
Even if these features depend on each other or upon the existence of the
other features, all of these properties independently contribute to the
probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods

ECE 710 Lecture 18 E. Lou 5


Navies Bayes - Example
Beside is a training data set of weather and corresponding target
variable ‘Play’ (suggesting possibilities of playing).
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like
Overcast probability = 0.29 and probability of playing is 0.64.

Step 3: Use Naive Bayesian equation to calculate the posterior


probability for each class.
ECE 710 Lecture 18 E. Lou 6
Navies Bayes - Example

Question: Players will play if weather is sunny. Is this statement correct?


Predictor = Yes
Class = Sunny
Bayesian equation :
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
P (Sunny | Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

ECE 710 Lecture 18 E. Lou 7


Navies Bayes - Pro

• It is easy and fast to predict class of test data set. It also perform well in multi
class prediction
• When assumption of independence holds, a Naive Bayes classifier performs
better compare to other models like logistic regression and you need less
training data.
• It perform well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve,
which is a strong assumption).

ECE 710 Lecture 18 E. Lou 8


Navies Bayes - Con

• If categorical variable has a category (in test data set), which was not observed
in training data set, then model will assign a 0 (zero) probability and will be
unable to make a prediction. This is often known as “Zero Frequency”.
• On the other side naive Bayes is also known as a bad estimator, so the
probability outputs from predict_proba are not to be taken too seriously.
• Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are
completely independent.

ECE 710 Lecture 18 E. Lou 9


Navies Bayes – Multiple Predictors

ECE 710 Lecture 18 E. Lou 10


Naïve Bayes – Example
• Predict whether an e-mail is ‘spam’ or ‘not spam’
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c)
proportional to 𝑃𝑃 𝑐𝑐𝑘𝑘 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 )
1 1 0 Yes 𝑛𝑛
0 0 1 No
𝑦𝑦� = argmax 𝑃𝑃 𝑐𝑐𝑘𝑘 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘
1 0 0 Yes 𝑘𝑘∈ 1,…,𝐾𝐾
𝑖𝑖=1
0 1 1 Yes
1 0 1 No

‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c) 𝑃𝑃 Yes 𝒙𝒙 = 010) =?
0 1 0 ?? 𝑃𝑃 No 𝒙𝒙 = 010) =?
For simplicity, label this input as x = 010

ECE 710 Lecture 18 E. Lou 11


Naïve Bayes – Example
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c) 𝑁𝑁Yes 3 3
1 1 0 Yes 𝑃𝑃 Yes = = =
𝑁𝑁Yes + 𝑁𝑁No 3 + 2 5
0 0 1 No
1 0 0 Yes
0 1 1 Yes 𝑛𝑛
1 0 1 No 𝑦𝑦� = argmax 𝑃𝑃 𝑐𝑐𝑘𝑘 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘
𝑘𝑘∈ 1,…,𝐾𝐾
𝑖𝑖=1
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c)
0 1 0 ??

𝑁𝑁𝑥𝑥1 =0,𝑐𝑐=Yes 1 𝑁𝑁𝑥𝑥2=1,𝑐𝑐=Yes 2


𝑃𝑃 𝑥𝑥1 = 0|Yes = = 𝑃𝑃 𝑥𝑥2 = 1|Yes = = 𝑃𝑃 𝑥𝑥3 = 0|Yes =?
𝑁𝑁Yes 3 𝑁𝑁Yes 3

ECE 710 Lecture 18 E. Lou 12


Naïve Bayes – Example
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c) 𝑛𝑛
1 1 0 Yes
𝑦𝑦� = argmax 𝑃𝑃 𝑐𝑐𝑘𝑘 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘
0 0 1 No 𝑘𝑘∈ 1,…,𝐾𝐾
𝑖𝑖=1
1 0 0 Yes
0 1 1 Yes
1 0 1 No

‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c)


0 1 0 ??

𝑛𝑛 𝑛𝑛
3 1 2 2 2 1 1
𝑃𝑃 Yes � 𝑃𝑃 𝑥𝑥𝑖𝑖 Yes = = 0.089 𝑃𝑃 No � 𝑃𝑃 𝑥𝑥𝑖𝑖 No = 0 =0
5 3 3 3 5 2 2
𝑖𝑖=1 𝑖𝑖=1

ECE 710 Lecture 18 E. Lou 13


Naïve Bayes – Example
𝑛𝑛
‘secret’ (x1) ‘prince’ (x2) ‘password’ (x3) Spam (c)
𝑦𝑦� = argmax 𝑃𝑃 𝑐𝑐𝑘𝑘 � 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘
𝑘𝑘∈ 1,…,𝐾𝐾 0 1 0 ??
𝑖𝑖=1

𝑛𝑛

𝑃𝑃 Yes � 𝑃𝑃 𝑥𝑥𝑖𝑖 Yes = 0.089


𝑖𝑖=1
𝑛𝑛 � = 𝐘𝐘𝐘𝐘𝐘𝐘 because 0.089 > 0
𝒚𝒚
𝑃𝑃 No � 𝑃𝑃 𝑥𝑥𝑖𝑖 No = 0
𝑖𝑖=1

0.089
Can convert the result into a 𝑃𝑃 Yes|𝒙𝒙 = 010 =
0.089 + 0
=1
Results will always
probability as well 𝑃𝑃 No|𝒙𝒙 = 010 =
0
=0 sum up to 1
0.089 + 0

ECE 710 Lecture 18 E. Lou 14


Naïve Bayes – Limitations
• Assumes that all features (x1, x2,…, xn) are independent from each other
• Violated by many data sets (e.g. BMI, blood pressure, insulin are not independent)
• Famously used in spam detection and document classification tasks

If an input variable is continuous, then it must follow a normal


distribution 𝑃𝑃 BMI = 31.1 𝑐𝑐 = 0 = ?? 𝑥𝑥𝑖𝑖 −𝜇𝜇𝑘𝑘
2

Calculate mean (μ) and standard 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑐𝑐𝑘𝑘 =
1
𝑒𝑒 2𝜎𝜎2
𝑦𝑦
deviation (σ) of xi to get probability 𝜎𝜎𝑘𝑘 2𝜋𝜋

Age Blood pressure BMI Insulin Diabetes outcome


21 66 28.1 94 0
33 40 43.1 168 1

ECE 710 Lecture 18 E. Lou 15


Support Vector Machine

Support-vector machines (SVMs, also support-vector networks[) are supervised


learning models with associated learning algorithms that analyze data used
for classification and regression analysis.
SVM tries to draw lines between the data points with the largest margin between
them.
To do this, we plot data items as points in n-dimensional space, where n is the
number of input features.
Based on this, SVM finds an optimal boundary, called a hyperplane, which best
separates the possible outputs by their class label.
The objective of the support vector machine algorithm is to find a hyperplane in
an N-dimensional space(N — the number of features) that distinctly classifies the
data points.
ECE 710 Lecture 18 E. Lou 16
Support Vector Machine

The distance between the hyperplane and the closest class point is called
the margin.

The optimal hyperplane has the largest margin that classifies points to maximize
the distance between the closest data point and both classes.

ECE 710 Lecture 18 E. Lou 17


Support Vector Machine
Let’s imagine we have two tags: red and blue, A support vector machine takes these data points
and our data has two features: x and y. We want and outputs the hyperplane that best separates the
a classifier that, given a pair of (x,y) coordinates, tags. This line is the decision boundary: anything
outputs if it’s either red or blue. We plot our that falls to one side of it we will classify as blue, and
already labeled training data on a plane: anything that falls to the other as red.

ECE 710 Lecture 18 E. Lou 18


Support Vector Machine

Example where H1 does not separate the


two classes.
H2 does, but only with a small margin.
H3 separates them with the maximal
margin.
There are many possible hyperplanes that
could be chosen. Our objective is to find a
plane that has the maximum margin, i.e the
maximum distance between data points of
both classes.

ECE 710 Lecture 18 E. Lou 19


Support Vector Machine
It’s pretty clear that there’s not a linear
decision boundary (a single straight line
that separates both tags).
However, the vectors are very clearly
segregated and it looks as though it should
be easy to separate them.
So here’s what we’ll do: we will add a third
dimension. Up until now we had two
dimensions: x and y.
We create a new z dimension, and we rule
that it be calculated a certain way that is
convenient for us: z = x² + y² (you’ll notice
that’s the equation for a circle).
ECE 710 Lecture 18 E. Lou 20
Support Vector Machine

•Imagine the new space we want:


• z = x² + y²
•Figure out what the dot product in that space looks like:
• a · b = xa · xb + ya · yb + za · zb
• a · b = xa · xb + ya · yb + (xa² + ya²) · (xb² + yb²)
•Tell SVM to do its thing, but using the new dot product — we call this a kernel

By using a nonlinear kernel (like above) we can get a nonlinear classifier


without transforming the data at all

ECE 710 Lecture 18 E. Lou 21


Support Vector Machine - Kernels

Transformations applied to the inputs, so the problem becomes more


linearly separable

Apply kernel function Find Project back into


𝑥𝑥3 = 𝑥𝑥12 + 𝑥𝑥22 hyperplane original space

ECE 710 Lecture 18 E. Lou 22


Support Vector Machine - Example

Both the answers are


correct.
The first one tolerates
some outlier points.
The second one is trying to
achieve 0 tolerance with
perfect partition.

ECE 710 Lecture 18 E. Lou 23


Support Vector Machine

Regularization - The Regularization parameter (often termed as C parameter in


python’s sklearn library) tells the SVM optimization how much you want to avoid
misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if
that hyperplane does a better job of getting all the training points classified
correctly.
Conversely, a very small value of C will cause the optimizer to look for a larger-
margin separating hyperplane, even if that hyperplane misclassifies more points.

ECE 710 Lecture 18 E. Lou 24


Support Vector Machine

Gamma - The gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’, and high values meaning ‘close’.
In other words, with low gamma, points far away from plausible separation line
are considered in calculation for the separation line. Where as high gamma
means the points close to plausible line are considered in calculation..

ECE 710 Lecture 18 E. Lou 25


Support Vector Machine

Margin - A margin is a separation of line to the closest class points.


A good margin is one where this separation is larger for both the classes.
Images below gives to visual example of good and bad margin. A good margin
allows the points to be in their respective classes without crossing to other class.

ECE 710 Lecture 18 E. Lou 26


Support Vector Machine

If the equation of the line is

ax + by ≥ c all the black dot


ax + by < c all the hollow dot

Need to find the a, b, c values

ECE 710 Lecture 18 E. Lou 27


Support Vector Machine

A model primarily used for classification that finds the hyperplane(s)


that separate(s) the different classes
Determines the hyperplane that maximizes
the margin (separation) between the
hyperplane and the support vectors of
different classes

𝑐𝑐1 , 𝑧𝑧(𝒙𝒙) > 0


𝑐𝑐(𝒙𝒙)
̂ =�
𝑐𝑐2 , 𝑧𝑧(𝒙𝒙) < 0
𝑧𝑧 = 𝑏𝑏 + 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛

ECE 710 Lecture 18 E. Lou 28


Support Vector Machine

Note that this is an illustration of a trivial


problem – problems like this are not realistic

It is completely separable
• No points violate the hyperplane
• Leads to the concept of a “soft margin”

It is linearly separable
• Can be separated using a hyperplane
• Leads to the trick of using kernels

𝑧𝑧 = 𝑏𝑏 + 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛

ECE 710 Lecture 18 E. Lou 29


Support Vector Machine – Soft Margin

Minimizes a loss that has a “hinge loss” term where only vectors (𝜉𝜉) that
are on the wrong side of the hyperplane or within the margin contribute
to the loss 2 2
L tries to find a balance
𝒘𝒘
𝐿𝐿 = 𝒘𝒘 + 𝐶𝐶 � 𝜉𝜉𝑖𝑖 between maximizing
𝑖𝑖 the margin and avoiding
misclassifications
𝜉𝜉3 Inversely proportional to margin width
Penalty for misclassifications

𝜉𝜉1
𝜉𝜉2
Leads to a regularization parameter, 𝐶𝐶 > 0, that
𝜉𝜉0 controls how much the model is penalized for
misclassifications
• 𝐶𝐶 → 0: low penalty for misclassifications, wider margin
𝑧𝑧 = 𝑏𝑏 + 𝑤𝑤1 𝑥𝑥1 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛 • 𝐶𝐶 → ∞: high penalty for misclassifications, → hard margin

ECE 710 Lecture 18 E. Lou 30


Support Vector Machine - Example

Predict whether a breast cancer tumor is malignant (positive) or benign


(negative) given info about the tumor

Parameters: Confusion Actual


• C=4 matrix Positive Negative
• RBF kernel
- (Radial basis function kernel)

Predicted
Positive 43 2
• γ = 0.033
Negative 0 40

No probabilities – just classes are output

ECE 710 Lecture 18 E. Lou 31


Support Vector Machine – Multi-class

One-vs-one One-vs-all or One-vs-rest


Fit a binary SVM for each pair of K classes Fit a binary SVM for each class
• Total # of SVMs:
𝐾𝐾(𝐾𝐾−1) • Total # of SVMs: 𝐾𝐾
2

Majority vote
to get class

ECE 710 Lecture 18 E. Lou 32


Support Vector Machine – Multi-class

One-vs-all can suffer from class imbalance issues


• Each SVM aims to predict one class against the rest

One-vs-one requires more computation time


𝐾𝐾(𝐾𝐾−1)
• Fits instead of 𝐾𝐾 SVMs
2

If computation time is not an issue, then choose the one-vs-one method

ECE 710 Lecture 18 E. Lou 33


Support Vector Machine – Advantages & Disadvantages

Lack of interpretability in the results (direct probabilities are not output)

Longer training time, especially with multi-class problems

Highly capable model that can model very complex relationships


• This comes with a higher chance of overfitting

Works well in cases where # of input features > # of training data points

ECE 710 Lecture 18 E. Lou 34

You might also like