0% found this document useful (0 votes)
287 views22 pages

Notes 2

The document discusses various machine learning algorithms including linear regression, logistic regression, Bayesian learning, support vector machines, and their applications. Linear regression is used to model the relationship between a target variable and one or more predictor variables. It can be used for prediction problems like sales forecasting based on advertising expenditure. Logistic regression is used for classification with binary dependent variables. Bayesian learning uses Bayes' theorem to calculate conditional probabilities. Support vector machines find a hyperplane that distinctly classifies data points. These algorithms are widely applied in domains such as weather forecasting, market trend analysis, and accident prediction.

Uploaded by

Abhi Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
287 views22 pages

Notes 2

The document discusses various machine learning algorithms including linear regression, logistic regression, Bayesian learning, support vector machines, and their applications. Linear regression is used to model the relationship between a target variable and one or more predictor variables. It can be used for prediction problems like sales forecasting based on advertising expenditure. Logistic regression is used for classification with binary dependent variables. Bayesian learning uses Bayes' theorem to calculate conditional probabilities. Support vector machines find a hyperplane that distinctly classifies data points. These algorithms are widely applied in domains such as weather forecasting, market trend analysis, and accident prediction.

Uploaded by

Abhi Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Greater Noida Institute of Technology

Department of CSE
Unit-II (KCS-055)

Unit II
Syllabus :

REGRESSION: Linear Regression and Logistic Regression

BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve Bayes
classifier, Bayesian belief networks, EM algorithm.

SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear kernel,
polynomial kernel, and Gaussian kernel), Hyperplane – (Decision surface), Properties of SVM, and Issues
in SVM.

Regression :

Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values
such as temperature, age, salary, price, etc. We can understand the concept of regression analysis using
the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:

Advertisement Sales
Rs. 3400 Rs. 38000
Rs. 900 Rs. 8078
Rs.50000 Rs. 300071
Rs. 1076 Rs. 9802
Rs. 23564 Rs. 34671
Rs. 1873 Rs.20121
Rs. 8000 Rs. 82005
Rs. 2000 ??

Now, the company wants to do the advertisement of Rs.200o in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine learning,
we need regression analysis. Regression is a supervised learning technique which helps in finding the
correlation between variables and enables us to predict the continuous output variable based on the one or
more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables. In Regression, we plot a graph between the
variables which best fits the given datapoints, using this plot, the machine learning model can make
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
predictions about the data. In simple words, "Regression shows a line or curve that passes through all the
datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and
the regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.

Some examples of regression can be as:

1) Prediction of rain using temperature and other factors.


2) Determining Market trends .
3) Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.

Independent Variable: The factors which affect the dependent variables or which are used to predict the
values of the dependent variables are called independent variable, also called as a predictor.

Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.

Multi collinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.

Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such as weather condition,
sales prediction, marketing trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis which is a statistical method
and used in machine learning and data science. Below are some other reasons for using Regression
analysis:

Regression estimates the relationship between the target and the independent variable. o It is used to find
the trends in data.

It helps to predict real/continuous values By performing the regression, we can confidently determine the
most important factor, the least important factor, and how each factor is affecting the other factors.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Types of Regression :

There are various types of regressions which are used in data science and machine learning. Each type has
its own importance on different scenarios, but at the core, all the regression methods analyze the effect of
the independent variable on dependent variables. Here we are discussing some important types of
regression which are given below:

a) Linear Regression
b) Logistic Regression
c) Polynomial Regression
d) Support Vector Regression
e) Decision Tree Regression
f) Random Forest Regression
g) Ridge Regression
h) Lasso Regression
Linear Regression: Linear regression is a statistical regression method which is used for predictive
analysis. It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables. It is used for solving the regression problem in machine
learning. Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
If there is only one input variable (x), then such linear regression is called simple linear regression. And if
there is more than one input variable, then such linear regression is called multiple linear regression. The
relationship between variables in the linear regression model can be explained using the below image.
Here we are predicting the salary of an employee on the basis of the year of experience.

Below is the mathematical equation for Linear regression:


Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients.
Some popular applications of linear regression are: i. Analyzing trends and sales estimates. ii. Salary
forecasting. iii. Real estate prediction. iv. Arriving at Expected Time Ariivals in traffic.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)

Mathematically, we can represent a linear regression as:


y=a 0+a 1 x+ ε
Here, y = Dependent Variable (Target Variable) x = Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation. Types
of Linear Regression Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression: If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression: If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression. Linear Regression Line: A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can show two types of relationship: o
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Linear Regression Line: A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of relationship:
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line: When working with linear regression, our main goal is to find the best fit line
that means the error between predicted values and actual values should be minimized. The best fit line
will have the least error. The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line,
so to calculate this we use cost function.
Cost function:The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best fit line. Cost
function optimizes the regression coefficients or weights. It measures how a linear regression model is
performing. We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis function. For
Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared
error occurred between the predicted values and actual values. It can be written as:

For the above linear equation, MSE can be calculated as:


N
1
MSE=1. ∑ ¿ ¿)
N i=1
Where,
N=Tota l number of observation .
Yi= Actual value .
(a 1 xi+ a 0)=Predicted value .
Residuals: The distance between the actual value and predicted values is called residual. If the observed
points are far from the regression line, then the residual will be high, and so cost function will high. If the
scatter points are close to the regression line, then the residual will be small and hence the cost function.
Gradient Descent: Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function. A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function. It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function. Model Performance: The Goodness of fit determines how the
line of regression fits the set of observations. The process of finding the best model out of various models
is called optimization. It can be achieved by below method:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
R-squared method: R-squared is a statistical method that determines the goodness of fit. It
measures the strength of the relationship between the dependent and independent variables on a
scale of 0-100%. The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model. It is also called a coefficient of
determination, or coefficient of multiple determination for multiple regression. It can be
calculated from the below formula:
Explained variation
RSquarred =
Total variation
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given dataset.
1.Linear relationship between the features and target: Linear regression assumes the linear
relationship between the dependent and independent variables.
2. Small or no multicollinearity between the features: Multicollinearity means high-correlation between
the independent variables. Due to multicollinearity, it may difficult to find the true relationship between
the predictors and target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no multicollinearity
between the features or independent variables.
3.Homoscedasticity Assumption: Homoscedasticity is a situation when the error term is the same for all
the values of independent variables. With homoscedasticity, there should be no clear pattern distribution
of data in the scatter plot.
4. Normal distribution of error terms: Linear regression assumes that the error term should follow the
normal distribution pattern. If error terms are not normally distributed, then confidence intervals will
become either too wide or too narrow, which may cause difficulties in finding coefficients. It can be
checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error
is normally distributed.
5. No autocorrelations: The linear regression model assumes no autocorrelation in error terms. If there
will be any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
Simple Linear regression algorithm has mainly two objectives:
1. Model the relationship between the two variables: Such as the relationship between Income and
expenditure, experience and Salary, etc.
2. Forecasting new observations: Such as Weather forecasting according to temperature, Revenue of a
company according to the investments in a year, etc.
Multiple Linear Regressions:
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be various
cases in which the response variable is affected by more than one predictor variable; for such cases, the
Multiple Linear Regression algorithm is used. Moreover, Multiple Linear Regression is an extension of
Simple Linear regression as it takes more than one predictor variable to predict the response variable. We
can define it as:
“Multiple Linear Regression is one of the important regression algorithms which models the linear
relationship between a single dependent continuous variable and more than one independent variable.”
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Example: Prediction of CO2 emission based on engine size and number of cylinders in a car.
Some key points about Multiple Linear Regression (MLR):
a) For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.
b) Each feature variable must model the linear relationship with the dependent variable.
c) MLR tries to fit a regression line through a multidimensional space of data-points.

MLR equation: In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple
predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple Linear Regression, so the same
is applied for the multiple linear regression equation, the equation becomes:
Y =b0 +b1 x 1+ b2 x 2+b 3 x 3+ ¿…..+b n x n
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent/feature variable
Assumptions for Multiple Linear Regression:
A linear relationship should exist between the Target and predictor variables.
The regression residuals must be normally distributed.
MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
Logistic Regression:
Logistic regression is another supervised learning algorithm which is used to solve the classification
problems. In classification problems, we have dependent variables in a binary or discrete format such as 0
or 1. Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No, True or
False, Spam or not spam, etc. It is a predictive analysis algorithm which works on the concept of
probability. Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used. Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
1
F ( x )=
1+e− x
F ( x ) = Output between the 0 and 1 value.
x = input to the function
e = base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
It uses the concept of threshold levels, values above the threshold level are rounded up to 1, and values
below the threshold level are rounded up to 0.
There are three types of logistic regression:
Binary(0/1, pass/fail) .
Multi(cats, dogs, lions).
Ordinal(low, medium, high).

Bayesian Learning
Some Terms to Understand Before delving into Bayesian learning:
Random variable (Stochastic variable) — In statistics, the random variable is a variable whose possible
values are a result of a random event. Therefore, each possible value of a random variable has some
probability attached to it to represent the likelihood of those values.
Probability distribution — The function that defines the probability of different outcomes/values of a
random variable. The continuous probability distributions are described using probability density
functions whereas discrete probability distributions can be represented using probability mass functions.
Conditional probability — This is a measure of probability P(A|B) of an event A given that another
event B has occurred.
Joint probability distribution -
Imagine a situation where your friend gives you a new coin and asks you the fairness of the coin (or the
probability of observing heads) without even flipping the coin once. In fact, you are also aware that your
friend has not made the coin biased. In general, you have seen that coins are fair, thus you expect the
probability of observing heads is 0.5. In the absence of any such observations, you assert the fairness of
the coin only using your past experiences or observations with coins.
Suppose that you are allowed to flip the coin 10 times in order to determine the fairness of the coin. Your
observations from the experiment will fall under one of the following cases:
Case 1: observing 5 heads and 5 tails.
Case 2: observing h heads and 10-h tails, where h ≠ 10−h.
If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the
probability of observing heads is 0.5 with more confidence.
If case 2 is observed, you can either:
1. Neglect your prior beliefs since now you have new data and decide the probability of observing heads
is h/10 by solely depending on recent observations.
2. Adjust your belief accordingly to the value of h that you have just observed, and decide the probability
of observing heads using your recent observations.
The first method suggests that we use the frequent method, where we omit our beliefs when making
decisions. However, the second method seems to be more convenient because 10 coins are insufficient to
determine the fairness of a coin.
Therefore, we can make better decisions by combining our recent observations and beliefs that we have
gained through our past experiences. It is this thinking model that uses our most recent observations
together with our beliefs or inclination for critical thinking that is known as Bayesian thinking. Moreover,
assume that your friend allows you to conduct another 10 coin flips. Then, we can use these new
observations to further update our beliefs. As we gain more data, we can incrementally update our beliefs
increasing the certainty of our conclusions. This is known as incremental learning, where you update your
knowledge incrementally with new evidence.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Bayesian learning comes into play on such occasions, where we are unable to use frequenters statistics
due to the drawbacks that we have discussed above. We can use Bayesian learning to address all these
drawbacks and even with additional capabilities (such as incremental updates of the posterior) when
testing a hypothesis to estimate unknown parameters of a machine learning models.
Bayesian learning uses Bayes' theorem to determine the conditional probability of a hypotheses given
some evidence or observations.
Bayes Theorem : Bayes' theorem describes how the conditional probability of an event or a hypothesis
can be computed using evidence and prior knowledge. It is similar to concluding that our code has no
bugs given the evidence that it has passed all the test cases, including our prior belief that we have rarely
observed any bugs in our code. However, this intuition goes beyond that simple hypothesis test where
there are multiple events or hypotheses involved (let us not worry about this for the moment). The Bayes'
theorem is given by:
P ( X|ϴ )∗P (ϴ)
P ( ϴ| X )=
P (X)

Where (ϴ │ X )=Posterio Probability ; P(𝛳) = Prior Probability; P(X)= Likelihood of event and
P ( D|ϴ )=Probability of occurence of event D givenh ypot h eisϴ .

Consider the hypothesis that there are no bugs in our code. 𝛳 and X denote that our code is bug-free and
passes all the test cases respectively.
P(θ) — Prior Probability is the probability of the hypothesis θ being true before applying the Bayes'
theorem. Prior represents the beliefs that we have gained through past experience, which refers to either
common sense or an outcome of Bayes' theorem for some past observations.
For the example given, prior probability denotes the probability of observing no bugs in our code.
However, since this is the first time we are applying Bayes' theorem, we have to decide the priors using
other means (otherwise we could use the previous posterior as the new prior). Let us assume that it is very
unlikely to find bugs in our code because rarely have we observed bugs in our code in the past. With our
past experience of observing fewer bugs in our code, we can assign our prior P(θ) with a higher
probability. However, for now, let us assume that P(θ) = p.
P(X|θ) — Likelihood is the conditional probability of the evidence given a hypothesis. The likelihood is
mainly related to our observations or the data we have. If it is given that our code is bug-free, then the
probability of our code passing all test cases is given by the likelihood. Assuming we have implemented
these test cases correctly, if no bug is presented in our code, then it should pass all the test cases.
Therefore, the likelihood P(X|θ) = 1.
P(X) — Evidence term denotes the probability of evidence or data. This can be expressed as a
summation (or integral) of the probabilities of all possible hypotheses weighted by the likelihood of the
same. Therefore, we can write P(X) as:

ϴ is the set of all the hypotheses.


Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Maximum a Posteriori (MAP) : We can use MAP to determine the valid hypothesis from a set of
hypotheses. According to MAP, the hypothesis that has the maximum posterior probability is considered
as the valid hypothesis. Therefore, we can express the hypothesis θ MAP that is concluded using MAP as
follows:

The argmaxθ operator estimates the event or hypothesis θi that maximizes the posterior probability
P ¿).

Naive Bayes :
It is a simple technique for constructing classifiers models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. There is
not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the
value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A
naive Bayes classifier considers each of these features to contribute independently to the probability that
this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter
features. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a
supervised learning setting. In many practical applications, parameter estimation for naive Bayes models
uses the method of maximum likelihood; in other words, one can work with the naive Bayes model
without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and
apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex
real-world situations. In 2004, an analysis of the Bayesian classification problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that
Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An
advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification .
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem
provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation
below:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)

How Naive Bayes algorithm works?


Let’s understand it using an example.
Below we have a training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing). Now, we need to classify whether players will play or not based on weather
condition.

Let’s follow the below steps to perform it.


Step 1: Convert the data set into a frequency table.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class
with the highest posterior probability is the outcome of prediction.

Problem:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny).

Solution : Here we have


P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,
which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various attributes.
This algorithm is mostly used in text classification and with problems having multiple classes.

What are the Pros and Cons of Naive Bayes?


Pros: It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other
models like logistic regression and you need less training data. It perform well in case of categorical
input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed
(bell curve, which is a strong assumption).
Cons: If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often
known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest
smoothing techniques is called Laplace estimation. On the other side naive Bayes is also known as a bad
estimator, so the probability outputs from prediction probability are not to be taken too seriously.

Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost
impossible that we get a set of predictors which are completely independent.

Applications of Naive Bayes Algorithms :

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success rate
as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail)
and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.

What is a Bayesian Belief Network?


Bayesian Belief Network or Bayesian Network or Belief Network is a Probabilistic Graphical Model
(PGM) that represents conditional dependencies between random variables through a Directed Acyclic
Graph (DAG). Bayesian Networks are applied in many fields. For example, disease diagnosis, optimized
web search, spam filtering, gene regulatory networks, etc. And this list can be extended. The main
objective of these networks is trying to understand the structure of causality relations. To clarify this, let’s
consider a disease diagnosis problem. With given symptoms and their resulting disease, we construct our
Belief Network and when a new patient comes, we can infer which disease or diseases may have the new
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
patient by providing probabilities for each disease. Similarly, these causality relations can be constructed
for other problems and inference techniques can be applied to interesting results.
The probabilities are calculated in the belief networks by the following formula:

As you would understand from the formula, to be able to calculate the joint distribution we need to have
conditional probabilities indicated by the network. But further that if we have the joint distribution, then
we can start to ask interesting questions.
For example, in the first example, we ask for the probability of “RAIN” if “SEASON” is “WINTER” and
“DOG BARK” is “TRUE”.

What is Expectation-Maximization algorithm?


It can be used for the latent variables (variables that are not directly observable and are actually inferred
from the values of the other observed variables) too in order to predict their values with the condition that
the general form of probability distribution governing those latent variables is known to us. This
algorithm is actually at the base of many unsupervised clustering algorithms in the field of ML.
It is used to find the local maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset, estimate
(guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step is
used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset to
estimate the missing data and then using that data to update the values of the parameters. Let us
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
understand the EM algorithm in detail. Initially, a set of initial values of the parameters are considered. A
set of incomplete observed data is given to the system with the assumption that the observed data comes
from a specific model. The next step is known as “Expectation” – step or E-step. In this step, we use the
observed data in order to estimate or guess the values of the missing or incomplete data. It is basically
used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis. Now, in the fourth step, it is checked whether the values are
converging or not,if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart of EM Algorithm is shown below:

Usage of EM algorithm – It can be used to fill the missing data in a sample. It can be used as the basis
of unsupervised learning of clusters. It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM). It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability).
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)

Support Vector Machine - Machine Learning Algorithms

Introduction

What is Support Vector Machine?

The objective of the support vector machine algorithm is to find a hyper plane in an N-dimensional space
(N — the number of features) that distinctly classifies the data points.

Possible hyper planes

To separate the two classes of data points, there are many possible hyper planes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of
both classes. Maximizing the margin distance provides some reinforcement so that future data points can
be classified with more confidence.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Hyperplanes and Support Vectors

Figure : Hyper planes in 2D and 3D feature space

Hyper planes are decision boundaries that help classify the data points. Data points falling on either side of
the hyper plane can be attributed to different classes. Also, the dimension of the hyper plane depends upon
the number of features. If the number of input features is 2, then the hyper plane is just a line. If the
number of input features is 3, then the hyper plane becomes a two-dimensional plane. It becomes difficult
to imagine when the number of features exceeds 3.

Support Vectors: Support vectors are data points that are closer to the hyper plane and influence the
position and orientation of the hyper plane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyper plane. These are the points
that help us build our SVM.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
Large Margin Intuition:

In logistic regression, we take the output of the linear function and squash the value within the range of
[0,1] using the sigmoid function. If the squashed value is greater than a threshold value (0.5) we assign it a
label 1, else we assign it a label 0. In SVM, we take the output of the linear function and if that output is
greater than 1, we identify it with one class and if the output is -1, we identify is with another class. Since
the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values ([-1,1])
which acts as margin.

Cost Function and Gradient Updates:

In the SVM algorithm, we are looking to maximize the margin between the data points and the hyper
plane. The loss function that helps maximize the margin is hinge loss.

How Does SVM Work?

The basics of Support Vector Machines and how it works are best understood with a simple example.
Let’s imagine we have two tags: red and blue, and our data has two features: x and y. We want a classifier
that, given a pair of (x,y) coordinates, outputs if it’s either red or blue. We plot our already labeled
training data on a plane:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
A support vector machine takes these data points and outputs the hyper plane (which in two dimensions
it’s simply a line) that best separates the tags. This line is the decision boundary: anything that falls to
one side of it we will classify as blue, and anything that falls to the other as red.

In 2D, the best hyperplane is simply a line

But, what exactly is the best hyper plane?

For SVM, it’s the one that maximizes the margins from both tags. In other words: the hyper plane
(remember it’s a line in this case) whose distance to the nearest element of each tag is the largest.
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)

Not all hyperplanes are created equal

Nonlinear data
Now this example was easy, since clearly the data was linearly separable — we could draw a straight line
to separate triangle  and circle. Sadly, usually things aren’t that simple. Take a look at this case:

A more complex dataset


Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
It’s pretty clear that there’s not a linear decision boundary (a single straight line that separates both tags).
However, the vectors are very clearly segregated and it looks as though it should be easy to separate
them.

So here’s what we’ll do: we will add a third dimension. Up until now we had two dimensions: x and y.
We create a new z dimension, and we rule that it be calculated a certain way that is convenient for us: 

z = x² + y² (you’ll notice that’s the equation for a circle).

This will give us a three-dimensional space. Taking a slice of that space, it looks like this:

From a different perspective, the data is now in two linearly separated groups
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
What can SVM do with this? Let’s see:

Note that since we are in three dimensions now, the hyper plane is a plane parallel to the x axis at a
certain z (let’s say z = 1).

What’s left is mapping it back to two dimensions:

Back to our original view, everything is now neatly separated.

And there we go! Our decision boundary is a circumference of radius 1, which separates both tags using
SVM. Check out this 3d visualization to see another example of the same effect:
Greater Noida Institute of Technology
Department of CSE
Unit-II (KCS-055)
The kernel trick

In our example we found a way to classify nonlinear data by cleverly mapping our space to a higher
dimension. However, it turns out that calculating this transformation can get pretty computationally
expensive: there can be a lot of new dimensions, each one of them possibly involving a complicated
calculation. Doing this for every vector in the dataset can be a lot of work, so it’d be great if we could
find a cheaper solution.

Here’s a trick: SVM doesn’t need the actual vectors to work its magic, it actually can get by only with
the dot products between them. This means that we can sidestep the expensive calculations of the new
dimensions! This is what we do instead:

Imagine the new space we want:

 z = x² + y²

Figure out what the dot product in that space looks like:

 a · b = xa · xb  +  ya · yb  +  za · zb


 a · b = xa · xb  +  ya · yb +  (xa² + ya²) · (xb² + yb²)

Tell SVM to do its thing, but using the new dot product — we call this a kernel function.

The kernel trick, allows us to sidestep a lot of expensive calculations. Normally, the kernel is linear, and
we get a linear classifier. However, by using a nonlinear kernel (like above) we can get a nonlinear
classifier without transforming the data at all: we only change the dot product to that of the space that we
want and SVM will work along.

Note that the kernel trick isn’t actually part of SVM. It can be used with other linear classifiers such as
logistic regression also. A support vector machine only takes care of finding the decision boundary.

********************************************

You might also like