Notes 2
Notes 2
Department of CSE
Unit-II (KCS-055)
Unit II
Syllabus :
BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve Bayes
classifier, Bayesian belief networks, EM algorithm.
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear kernel,
polynomial kernel, and Gaussian kernel), Hyperplane – (Decision surface), Properties of SVM, and Issues
in SVM.
Regression :
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc. We can understand the concept of regression analysis using
the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Advertisement Sales
Rs. 3400 Rs. 38000
Rs. 900 Rs. 8078
Rs.50000 Rs. 300071
Rs. 1076 Rs. 9802
Rs. 23564 Rs. 34671
Rs. 1873 Rs.20121
Rs. 8000 Rs. 82005
Rs. 2000 ??
Now, the company wants to do the advertisement of Rs.200o in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine learning,
we need regression analysis. Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous output variable based on the one or
more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables. In Regression, we plot a graph between the
variables which best fits the given datapoints, using this plot, the machine learning model can make
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
predictions about the data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and
the regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.
Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
Multi collinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical method
and used in machine learning and data science. Below are some other reasons for using Regression
analysis:
Regression estimates the relationship between the target and the independent variable. o It is used to find
the trends in data.
It helps to predict real/continuous values By performing the regression, we can confidently determine the
most important factor, the least important factor, and how each factor is affecting the other factors.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Types of Regression :
There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of
regression which are given below:
a) Linear Regression
b) Logistic Regression
c) Polynomial Regression
d) Support Vector Regression
e) Decision Tree Regression
f) Random Forest Regression
g) Ridge Regression
h) Lasso Regression
Linear Regression: Linear regression is a statistical regression method which is used for predictive
analysis. It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables. It is used for solving the regression problem in machine
learning. Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear regression. And if
there is more than one input variable, then such linear regression is called multiple linear regression. The
relationship between variables in the linear regression model can be explained using the below image.
Here we are predicting the salary of an employee on the basis of the year of experience.
The values for x and y variables are training datasets for Linear Regression model representation. Types
of Linear Regression Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression: If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression: If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression. Linear Regression Line: A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can show two types of relationship: o
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Linear Regression Line: A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of relationship:
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a negative linear relationship.
Finding the best fit line: When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be minimized. The best fit line
will have the least error. The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.
Cost function:The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit line. Cost
function optimizes the regression coefficients or weights. It measures how a linear regression model is
performing. We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis function. For
Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:
MLR equation: In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same
is applied for the multiple linear regression equation, the equation becomes:
Y =b0 +b1 x 1+ b2 x 2+b 3 x 3+ ¿…..+b n x n
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent/feature variable
Assumptions for Multiple Linear Regression:
A linear relationship should exist between the Target and predictor variables.
The regression residuals must be normally distributed.
MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
Logistic Regression:
Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such as 0
or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of
probability. Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
1
F ( x )=
1+e− x
F ( x ) = Output between the 0 and 1 value.
x = input to the function
e = base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
There are three types of logistic regression:
Binary(0/1, pass/fail) .
Multi(cats, dogs, lions).
Ordinal(low, medium, high).
Bayesian Learning
Some Terms to Understand Before delving into Bayesian learning:
Random variable (Stochastic variable) — In statistics, the random variable is a variable whose possible
values are a result of a random event. Therefore, each possible value of a random variable has some
probability attached to it to represent the likelihood of those values.
Probability distribution — The function that defines the probability of different outcomes/values of a
random variable. The continuous probability distributions are described using probability density
functions whereas discrete probability distributions can be represented using probability mass functions.
Conditional probability — This is a measure of probability P(A|B) of an event A given that another
event B has occurred.
Joint probability distribution -
Imagine a situation where your friend gives you a new coin and asks you the fairness of the coin (or the
probability of observing heads) without even flipping the coin once. In fact, you are also aware that your
friend has not made the coin biased. In general, you have seen that coins are fair, thus you expect the
probability of observing heads is 0.5. In the absence of any such observations, you assert the fairness of
the coin only using your past experiences or observations with coins.
Suppose that you are allowed to flip the coin 10 times in order to determine the fairness of the coin. Your
observations from the experiment will fall under one of the following cases:
Case 1: observing 5 heads and 5 tails.
Case 2: observing h heads and 10-h tails, where h ≠ 10−h.
If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the
probability of observing heads is 0.5 with more confidence.
If case 2 is observed, you can either:
1. Neglect your prior beliefs since now you have new data and decide the probability of observing heads
is h/10 by solely depending on recent observations.
2. Adjust your belief accordingly to the value of h that you have just observed, and decide the probability
of observing heads using your recent observations.
The first method suggests that we use the frequent method, where we omit our beliefs when making
decisions. However, the second method seems to be more convenient because 10 coins are insufficient to
determine the fairness of a coin.
Therefore, we can make better decisions by combining our recent observations and beliefs that we have
gained through our past experiences. It is this thinking model that uses our most recent observations
together with our beliefs or inclination for critical thinking that is known as Bayesian thinking. Moreover,
assume that your friend allows you to conduct another 10 coin flips. Then, we can use these new
observations to further update our beliefs. As we gain more data, we can incrementally update our beliefs
increasing the certainty of our conclusions. This is known as incremental learning, where you update your
knowledge incrementally with new evidence.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Bayesian learning comes into play on such occasions, where we are unable to use frequenters statistics
due to the drawbacks that we have discussed above. We can use Bayesian learning to address all these
drawbacks and even with additional capabilities (such as incremental updates of the posterior) when
testing a hypothesis to estimate unknown parameters of a machine learning models.
Bayesian learning uses Bayes' theorem to determine the conditional probability of a hypotheses given
some evidence or observations.
Bayes Theorem : Bayes' theorem describes how the conditional probability of an event or a hypothesis
can be computed using evidence and prior knowledge. It is similar to concluding that our code has no
bugs given the evidence that it has passed all the test cases, including our prior belief that we have rarely
observed any bugs in our code. However, this intuition goes beyond that simple hypothesis test where
there are multiple events or hypotheses involved (let us not worry about this for the moment). The Bayes'
theorem is given by:
P ( X|ϴ )∗P (ϴ)
P ( ϴ| X )=
P (X)
Where (ϴ │ X )=Posterio Probability ; P(𝛳) = Prior Probability; P(X)= Likelihood of event and
P ( D|ϴ )=Probability of occurence of event D givenh ypot h eisϴ .
Consider the hypothesis that there are no bugs in our code. 𝛳 and X denote that our code is bug-free and
passes all the test cases respectively.
P(θ) — Prior Probability is the probability of the hypothesis θ being true before applying the Bayes'
theorem. Prior represents the beliefs that we have gained through past experience, which refers to either
common sense or an outcome of Bayes' theorem for some past observations.
For the example given, prior probability denotes the probability of observing no bugs in our code.
However, since this is the first time we are applying Bayes' theorem, we have to decide the priors using
other means (otherwise we could use the previous posterior as the new prior). Let us assume that it is very
unlikely to find bugs in our code because rarely have we observed bugs in our code in the past. With our
past experience of observing fewer bugs in our code, we can assign our prior P(θ) with a higher
probability. However, for now, let us assume that P(θ) = p.
P(X|θ) — Likelihood is the conditional probability of the evidence given a hypothesis. The likelihood is
mainly related to our observations or the data we have. If it is given that our code is bug-free, then the
probability of our code passing all test cases is given by the likelihood. Assuming we have implemented
these test cases correctly, if no bug is presented in our code, then it should pass all the test cases.
Therefore, the likelihood P(X|θ) = 1.
P(X) — Evidence term denotes the probability of evidence or data. This can be expressed as a
summation (or integral) of the probabilities of all possible hypotheses weighted by the likelihood of the
same. Therefore, we can write P(X) as:
The argmaxθ operator estimates the event or hypothesis θi that maximizes the posterior probability
P ¿).
Naive Bayes :
It is a simple technique for constructing classifiers models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. There is
not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the
value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A
naive Bayes classifier considers each of these features to contribute independently to the probability that
this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter
features. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting. In many practical applications, parameter estimation for naive Bayes models
uses the method of maximum likelihood; in other words, one can work with the naive Bayes model
without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and
apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex
real-world situations. In 2004, an analysis of the Bayesian classification problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that
Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An
advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification .
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem
provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation
below:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Problem:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny).
Naive Bayes uses a similar method to predict the probability of different class based on various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost
impossible that we get a set of predictors which are completely independent.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success rate
as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail)
and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.
As you would understand from the formula, to be able to calculate the joint distribution we need to have
conditional probabilities indicated by the network. But further that if we have the joint distribution, then
we can start to ask interesting questions.
For example, in the first example, we ask for the probability of “RAIN” if “SEASON” is “WINTER” and
“DOG BARK” is “TRUE”.
The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset to
estimate the missing data and then using that data to update the values of the parameters. Let us
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
understand the EM algorithm in detail. Initially, a set of initial values of the parameters are considered. A
set of incomplete observed data is given to the system with the assumption that the observed data comes
from a specific model. The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or incomplete data. It is basically
used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis. Now, in the fourth step, it is checked whether the values are
converging or not,if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart of EM Algorithm is shown below:
Usage of EM algorithm – It can be used to fill the missing data in a sample. It can be used as the basis
of unsupervised learning of clusters. It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM). It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability).
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Introduction
The objective of the support vector machine algorithm is to find a hyper plane in an N-dimensional space
(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyper planes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of
both classes. Maximizing the margin distance provides some reinforcement so that future data points can
be classified with more confidence.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Hyperplanes and Support Vectors
Hyper planes are decision boundaries that help classify the data points. Data points falling on either side of
the hyper plane can be attributed to different classes. Also, the dimension of the hyper plane depends upon
the number of features. If the number of input features is 2, then the hyper plane is just a line. If the
number of input features is 3, then the hyper plane becomes a two-dimensional plane. It becomes difficult
to imagine when the number of features exceeds 3.
Support Vectors: Support vectors are data points that are closer to the hyper plane and influence the
position and orientation of the hyper plane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyper plane. These are the points
that help us build our SVM.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Large Margin Intuition:
In logistic regression, we take the output of the linear function and squash the value within the range of
[0,1] using the sigmoid function. If the squashed value is greater than a threshold value (0.5) we assign it a
label 1, else we assign it a label 0. In SVM, we take the output of the linear function and if that output is
greater than 1, we identify it with one class and if the output is -1, we identify is with another class. Since
the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values ([-1,1])
which acts as margin.
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyper
plane. The loss function that helps maximize the margin is hinge loss.
The basics of Support Vector Machines and how it works are best understood with a simple example.
Let’s imagine we have two tags: red and blue, and our data has two features: x and y. We want a classifier
that, given a pair of (x,y) coordinates, outputs if it’s either red or blue. We plot our already labeled
training data on a plane:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
A support vector machine takes these data points and outputs the hyper plane (which in two dimensions
it’s simply a line) that best separates the tags. This line is the decision boundary: anything that falls to
one side of it we will classify as blue, and anything that falls to the other as red.
For SVM, it’s the one that maximizes the margins from both tags. In other words: the hyper plane
(remember it’s a line in this case) whose distance to the nearest element of each tag is the largest.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Nonlinear data
Now this example was easy, since clearly the data was linearly separable — we could draw a straight line
to separate triangle and circle. Sadly, usually things aren’t that simple. Take a look at this case:
So here’s what we’ll do: we will add a third dimension. Up until now we had two dimensions: x and y.
We create a new z dimension, and we rule that it be calculated a certain way that is convenient for us:
This will give us a three-dimensional space. Taking a slice of that space, it looks like this:
From a different perspective, the data is now in two linearly separated groups
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
What can SVM do with this? Let’s see:
Note that since we are in three dimensions now, the hyper plane is a plane parallel to the x axis at a
certain z (let’s say z = 1).
And there we go! Our decision boundary is a circumference of radius 1, which separates both tags using
SVM. Check out this 3d visualization to see another example of the same effect:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
The kernel trick
In our example we found a way to classify nonlinear data by cleverly mapping our space to a higher
dimension. However, it turns out that calculating this transformation can get pretty computationally
expensive: there can be a lot of new dimensions, each one of them possibly involving a complicated
calculation. Doing this for every vector in the dataset can be a lot of work, so it’d be great if we could
find a cheaper solution.
Here’s a trick: SVM doesn’t need the actual vectors to work its magic, it actually can get by only with
the dot products between them. This means that we can sidestep the expensive calculations of the new
dimensions! This is what we do instead:
z = x² + y²
Figure out what the dot product in that space looks like:
Tell SVM to do its thing, but using the new dot product — we call this a kernel function.
The kernel trick, allows us to sidestep a lot of expensive calculations. Normally, the kernel is linear, and
we get a linear classifier. However, by using a nonlinear kernel (like above) we can get a nonlinear
classifier without transforming the data at all: we only change the dot product to that of the space that we
want and SVM will work along.
Note that the kernel trick isn’t actually part of SVM. It can be used with other linear classifiers such as
logistic regression also. A support vector machine only takes care of finding the decision boundary.
********************************************