0% found this document useful (0 votes)
8 views

Unit II NOTES

Uploaded by

sksharma3058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit II NOTES

Uploaded by

sksharma3058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT II

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis.
Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent
(y) and one or more independent (y) variables, hence called as linear
regression. Since linear regression shows the linear relationship, which means
it finds how the value of the dependent variable is changing according to the
value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0 +a1 x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression
model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm
is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm
is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent
variable increases on X-axis, then such a relationship is termed as a
Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a
negative linear relationship.
Logistic Regression in Machine Learning
ADVERTISEMENT
o Logistic regression is one of the most popular Machine Learning
algorithms, which comes under the Supervised Learning technique. It is
used for predicting the categorical dependent variable using a given set
of independent variables.
o Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete
value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values which
lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except
that how they are used. Linear Regression is used for solving
Regression problems, whereas Logistic regression is used for solving
the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not, a mouse is
obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm
because it has the ability to provide probabilities and classify new data
using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using
different types of data and can easily determine the most effective
variables used for the classification. The below image is showing the
logistic function:

Note: Logistic regression uses the concept of predictive modeling as


regression; therefore, it is called logistic regression, but is used to classify
samples; Therefore, it falls under the classification algorithm.
Logistic Function (Sigmoid Function):
o The sigmoid function is a mathematical function used to map the
predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form. The S-
form curve is called the Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values
tends to 0.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's


divide the above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm


of the equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three
types:
o Binomial: In binomial Logistic regression, there can be only two
possible types of the dependent variables, such as 0 or 1, Pass or Fail,
etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or
more possible unordered types of the dependent variable, such as "cat",
"dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low", "Medium", or
"High".

What is Bayes Theorem?


Bayes theorem is one of the most popular machine learning concepts that
helps to calculate the probability of occurring one event with uncertain
knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability
of event X with known event Y:
o According to the product rule we can express as the probability of
event X with known event Y as follows;

1.P(X ? Y)= P(X|Y) P(Y) {equation 1}


o Further, the probability of event Y with known event X:

1.P(X ? Y)= P(Y|X) P(X) {equation 2}


Mathematically, Bayes theorem can be expressed by combining both
equations on right hand side. We will get:
Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
o P(X|Y) is called as posterior, which we need to calculate. It is defined
as updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before
considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of
evidence under any consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Prerequisites for Bayes Theorem
While studying the Bayes theorem, we need to understand few important
concepts. These are as follows:
1. Experiment
An experiment is defined as the planned operation carried out under controlled
condition such as tossing a coin, drawing a card and rolling a dice, etc.
2. Sample Space
During an experiment what we get as a result is called as possible outcomes
and the set of all possible outcome of an event is known as sample space.
For example, if we are rolling a dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its
outcomes, then sample space will be:
S2 = {Head, Tail}
3. Event
Event is defined as subset of sample space in an experiment. Further, it is also
called as set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such
that;
A = Event when an even number is obtained = {2, 4, 6}
B = Event when a number is greater than 4 = {5, 6}
o Probability of the event A ''P(A)''= Number of favourable outcomes /
Total number of possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of favourable
outcomes / Total number of possible outcomes
=2/6
=1/3
=0.333
o Union of event A and B:
A B = {2, 4, 5, 6}
o Intersection of event A and B:
A! B= {6}

o Disjoint Event: If the intersection of the event A and B is an empty set


or null then such events are known as disjoint event or mutually
exclusive events also.
4. Random Variable:
It is a real value function which helps mapping between sample space and a
real line of an experiment. A random variable is taken on some random values
and each value having some probability. However, it is neither random nor a
variable but it behaves as a function which can either be discrete, continuous
or combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a
time, called exhaustive event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely
occur at a time and both are mutually exclusive for e.g., while tossing a coin,
either it will be a Head or may be a Tail.
6. Independent Event:
Two events are said to be independent when occurrence of one event does not
affect the occurrence of another event. In simple words we can say that the
probability of outcome of both events does not depends one another.
Mathematically, two events A and B are said to be independent if:
P(A ! B) = P(AB) = P(A)*P(B)
7. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that
another event B has already occurred (i.e. A conditional B). This is represented
by P(A|B) and we can define it as:
P(A|B) = P(A ! B) / P(B)
8. Marginal Probability:
Marginal probability is defined as the probability of an event A occurring
independent of any other event B. Further, it is considered as the probability of
evidence under any consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)
Here ~B represents the event that B does not occur.
What is concept learning, and how does it work?
Now you must be wondering what is concept learning in machine learning?
Concept learning, as a broader term, includes both case-based and instance-
based learning. At its core, concept learning involves the extraction of general
rules or patterns from specific instances to make predictions on new, unseen
data. The ultimate goal is for the machine to grasp abstract concepts and
apply them in diverse contexts.
Concept learning in machine learning is not confined to a single pattern; it
spans various approaches, including rule-based learning, neural networks,
decision trees, and more. The choice of approach depends on the nature of
the problem and the characteristics of the data.
The process of concept learning in machine learning involves iterative
refinement. The model learns from examples, refines its understanding of the
underlying concepts, and continually updates its knowledge as it encounters
new instances. This adaptability is a hallmark of effective concept learning
systems
Learning may be characterized as “the problem of exploring through a preset
space of candidate hypotheses for the theory that best matches the training
instances” in terms of machine learning, according to Tom Michell.
The acquisition of broad concepts from previous experiences accounts for a
large portion of human Learning. Humans, for example, distinguish between
various cars based on specific traits specified over a vast collection of
attributes. This unique collection of characteristics distinguishes the subset
of automobiles in the collection of vehicles. A concept is a collection of
elements that distinguishes automobiles.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
ADVERTISEMENT
ADVERTISEMENT
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
ADVERTISEMENT
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a
normal distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the
data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a problem w
has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also use
probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we n
Bayesian network. It can also be used in various tasks including prediction, anomaly detection, diagnostics, autom
insight, reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:
ADVERTISEMENT
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledg
known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random varia
These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that me
that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network gra
o If we are considering node B, which is connected with node A by a directed arrow, then node A is calle
parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic gra
DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines the e
the parent on that node.
ADVERTISEMENT
Bayesian network is based on Joint probability distribution and conditional probability. So let's first understand th
probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known
Joint probability distribution.
P[x1 , x2 , x3 ,....., xn], it can be written as the following way in terms of the joint probability distribution.
= P[x1 | x2 , x3 ,....., xn]P[x2 , x3 ,....., xn]
= P[x1 | x2 , x3 ,....., xn]P[x2 |x3 ,....., xn]....P[xn-1 |xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1 ,........., X1 ) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:


Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably responds at detec
burglary but also responds for minor earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls Harry when he hears the alarm
sometimes he got confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to li
high music, so sometimes she misses to hear the alarm. Here we would like to compute the probability of Burglar
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake occurred, and
and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is showing that burgla
earthquake is the parent node of the alarm and directly affecting the probability of alarm's going off, but D
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do not no
minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set of c
for the variable.
K
o In CPT, a boolean variable with k boolean parents contains 2 probabilities. Hence, if there are two parents
CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
ADVERTISEMENT
ADVERTISEMENT
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the above prob
statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B
E
P(A= True)
P(A= False)
True
True
0.94
0.06
True
False
0.95
0.04
False
True
0.31
0.69
False
False
0.001
0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of Alarm.
A
P(D= True)
P(D= False)
True
0.91
0.09
False
0.05
0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
A
P(S= True)
P(S= False)
True
0.75
0.25
False
0.02
0.98
From the formula of joint distribution, we can write the problem statement in the form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is given below:
1. To understand the network as the representation of the Joint probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional independence statements.

EM Algorithm in Machine Learning


The EM algorithm is considered a latent variable model to find the local
maximum likelihood parameters of a statistical model, proposed by Arthur
Dempster, Nan Laird, and Donald Rubin in 1977. The EM (Expectation-
Maximization) algorithm is one of the most commonly used terms in machine
learning to obtain maximum likelihood estimates of variables that are
sometimes observable and sometimes not. However, it is also applicable to
unobserved data or sometimes called latent. It has various real-world
applications in statistics, including obtaining the mode of the posterior
marginal distribution of parameters in machine learning and data mining
applications.

In most real-life applications of machine learning, it is found that several


relevant learning features are available, but very few of them are observable,
and the rest are unobservable. If the variables are observable, then it can
predict the value using instances. On the other hand, the variables which are
latent or directly not observable, for such variables Expectation-Maximization
(EM) algorithm plays a vital role to predict the value with the condition that
the general form of probability distribution governing those latent variables is
known to us. In this topic, we will discuss a basic introduction to the EM
algorithm, a flow chart of the EM algorithm, its applications, advantages, and
disadvantages of EM algorithm, etc.
What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination
of various unsupervised machine learning algorithms, which is used to
determine the local maximum likelihood estimates (MLE) or maximum a
posteriori estimates (MAP) for unobservable variables in statistical models.
Further, it is a technique to find maximum likelihood estimation when the
latent variables are present. It is also referred to as the latent variable model.
A latent variable model consists of both observable and unobservable
variables where observable can be predicted while unobserved are inferred
from the observed variable. These unobservable variables are known as latent
variables.
Key Points:
o It is known as the latent variable model to determine MLE and MAP
parameters for latent variables.
o It is used to predict values of parameters in instances where data is
missing or unobservable for learning, and this is done until
convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms,
such as the k-means clustering algorithm. Being an iterative approach, it
consists of two modes. In the first mode, we estimate the missing or latent
variables. Hence it is referred to as the Expectation/estimation step (E-step).
Further, the other mode is used to optimize the parameters of the models so
that it can explain the data more clearly. The second mode is known as
the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all
missing values in the dataset so that after completing this step, there
should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated
data in the E-step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of
the dataset to estimate the missing data of the latent variables and then use
that data to update the values of the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on
intuition, e.g., if there are two random variables that have very less difference
in their probability, then they are known as converged. In other words,
whenever the values of given variables are matched with each other, it is
called convergence.
Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization
Step, Expectation Step, Maximization Step, and convergence Step. These
steps are explained as follows:
st
o 1 Step: The very first step is to initialize the parameter values.
Further, the system is provided with incomplete observed data with the
assumption that data is obtained from a specific model.
nd
o 2 Step: This step is known as Expectation or E-Step, which is used to
estimate or guess the values of the missing or incomplete data using
the observed data. Further, E-step primarily updates the variables.
rd
o 3 Step: This step is known as Maximization or M-step, where we use
nd
complete data obtained from the 2 step to update the parameter
values. Further, M-step primarily updates the hypothesis.
th
o 4 step: The last step is to check if the values of latent variables are
converging or not. If it gets "yes", then stop the process; else, repeat the
process from step 2 until the convergence occurs.

Support Vector Machine (SVM) Algorithm

Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection
tasks. SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection. SVMs are adaptable and efficient in a variety of applications because they can manage high-dimensional
data and nonlinear relationships.

SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different classes available in the target feature.

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression
problems as well it’s best suited for classification. The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional space
that can separate the data points in different classes in the feature space. The hyperplane tries that the margin between the closest points of different
classes should be as maximum as possible. The dimension of the hyperplane depends upon the number of features. If the number of input features is two,
then the hyperplane is just a line. If the number of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the
number of features exceeds three.
Let’s consider two independent variables x1, x2, and one dependent variable which is either a blue circle or a red circle.

Linearly Separable Data points

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because we are considering only two input features x1, x2)
that segregate our data points or do a classification between red and blue circles. So how do we choose the best line or in general the best hyperplane that
segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest separation or margin between the two classes.

Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below
Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s simple! The blue ball in the boundary of red ones is an
outlier of blue balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is
robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous data sets along with that it adds a penalty each time a
point crosses the margin. So the margins in these types of cases are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∀(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional to the
distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are separable by a straight line/linear line). What to do if data
are not linearly separable?
Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We call a point xi on the line and we create a new
variable yi as a function of distance from origin o.so if we plot this we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-linear function that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points of different classes in a feature space. In the case
of linear classifications, it will be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which makes a critical role in deciding the hyperplane and
margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main objective of the support vector machine algorithm is
to maximize the margin. The wider margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original input data points into high-dimensional feature spaces,
so, that the hyperplane can be easily found out even if the data points are not linearly separable in the original input space. Some of the
common kernel functions are linear, polynomial, radial basis function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a hyperplane that properly separates the data points of
different categories without any misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a soft margin technique. Each data point has a
slack variable introduced by the soft-margin SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation parameter C in SVM. The penalty for going over the
margin or misclassifying data items is decided by it. A stricter penalty is imposed with a greater value of C, which results in a smaller margin
and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect classifications or margin violations. The objective function in
SVM is frequently formed by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the Lagrange multipliers related to the support vectors
can be used to solve SVM. The dual formulation enables the use of kernel tricks and more effective computing.

Advantages of SVM

● Effective in high-dimensional cases.


● Its memory is efficient as it uses a subset of training points in the decision function called support vectors.
● Different kernel functions can be specified for the decision functions and its possible to specify custom kernels.

Major Kernel Functions in Support Vector Machine


What is Kernel Method?
A set of techniques known as kernel methods are used in machine learning to
address classification, regression, and other prediction issues. They are built
around the idea of kernels, which are functions that gauge how similar two
data points are to one another in a high-dimensional feature space.
Kernel methods' fundamental premise is used to convert the input data into a
high-dimensional feature space, which makes it simpler to distinguish
between classes or generate predictions. Kernel methods employ a kernel
function to implicitly map the data into the feature space, as opposed to
manually computing the feature space.
The most popular kind of kernel approach is the Support Vector Machine
(SVM), a binary classifier that determines the best hyperplane that most
effectively divides the two groups. In order to efficiently locate the ideal
hyperplane, SVMs map the input into a higher-dimensional space using a
kernel function.
Other examples of kernel methods include kernel ridge regression, kernel PCA,
and Gaussian processes. Since they are strong, adaptable, and
computationally efficient, kernel approaches are frequently employed in
machine learning. They are resilient to noise and outliers and can handle
sophisticated data structures like strings and graphs.
Kernel Method in SVMs
Support Vector Machines (SVMs) use kernel methods to transform the input
data into a higher-dimensional feature space, which makes it simpler to
distinguish between classes or generate predictions. Kernel approaches in
SVMs work on the fundamental principle of implicitly mapping input data into
a higher-dimensional feature space without directly computing the
coordinates of the data points in that space.
The kernel function in SVMs is essential in determining the decision boundary
that divides the various classes. In order to calculate the degree of similarity
between any two points in the feature space, the kernel function computes
their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial
basis function (RBF) kernel. The RBF kernel maps the input data into an
infinite-dimensional feature space using a Gaussian function. This kernel
function is popular because it can capture complex nonlinear relationships in
the data.
Other types of kernel functions that can be used in SVMs include the
polynomial kernel, the sigmoid kernel, and the Laplacian kernel. The choice of
kernel function depends on the specific problem and the characteristics of the
data.
Basically, kernel methods in SVMs are a powerful technique for solving
classification and regression problems, and they are widely used in machine
learning because they can handle complex data structures and are robust to
noise and outliers.
Characteristics of Kernel Function
Kernel functions used in machine learning, including in SVMs (Support Vector
Machines), have several important characteristics, including:
o Mercer's condition: A kernel function must satisfy Mercer's condition
to be valid. This condition ensures that the kernel function is positive
semi definite, which means that it is always greater than or equal to
zero.
o Positive definiteness: A kernel function is positive definite if it is
always greater than zero except for when the inputs are equal to each
other.
o Non-negativity: A kernel function is non-negative, meaning that it
produces non-negative values for all inputs.
o Symmetry: A kernel function is symmetric, meaning that it produces
the same value regardless of the order in which the inputs are given.
o Reproducing property: A kernel function satisfies the reproducing
property if it can be used to reconstruct the input data in the feature
space.
o Smoothness: A kernel function is said to be smooth if it produces a
smooth transformation of the input data into the feature space.
o Complexity: The complexity of a kernel function is an important
consideration, as more complex kernel functions may lead to over
fitting and reduced generalization performance.
Basically, the choice of kernel function depends on the specific problem and
the characteristics of the data, and selecting an appropriate kernel function
can significantly impact the performance of machine learning algorithms.
Major Kernel Function in Support Vector Machine
In Support Vector Machines (SVMs), there are several types of kernel
functions that can be used to map the input data into a higher-dimensional
feature space. The choice of kernel function depends on the specific problem
and the characteristics of the data.
Here are some most commonly used kernel functions in SVMs:
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including
in SVMs (Support Vector Machines). It is the simplest and most commonly
used kernel function, and it defines the dot product between the input vectors
in the original feature space.
The linear kernel can be defined as:

1.K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input
vectors is a measure of their similarity or distance in the original feature
space.
When using a linear kernel in an SVM, the decision boundary is a linear
hyperplane that separates the different classes in the feature space. This
linear boundary can be useful when the data is already separable by a linear
decision boundary or when dealing with high-dimensional data, where the use
of more complex kernel functions may lead to overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in
SVMs, is a polynomial kernel (Support Vector Machines). It is a nonlinear
kernel function that employs polynomial functions to transfer the input data
into a higher-dimensional feature space.
One definition of the polynomial kernel is:
Where x and y are the input feature vectors, c is a constant term, and d is the
degree of the polynomial, K(x, y) = (x. y + c)d. The constant term is added to,
and the dot product of the input vectors elevated to the degree of the
polynomial.
The decision boundary of an SVM with a polynomial kernel might capture
more intricate correlations between the input characteristics because it is a
nonlinear hyperplane.
The degree of nonlinearity in the decision boundary is determined by the
degree of the polynomial.
The polynomial kernel has the benefit of being able to detect both linear and
nonlinear correlations in the data. It can be difficult to select the proper degree
of the polynomial, though, as a larger degree can result in overfitting while a
lower degree cannot adequately represent the underlying relationships in the
data.
In general, the polynomial kernel is an effective tool for converting the input
data into a higher-dimensional feature space in order to capture nonlinear
correlations between the input characteristics.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is
a popular kernel function used in machine learning, particularly in SVMs
(Support Vector Machines). It is a nonlinear kernel function that maps the
input data into a higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:

1.K(x, y) = exp(-gamma * ||x - y||^2)


Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Gaussian function, and ||x - y||^2 is the squared
Euclidean distance between the input vectors.
When using a Gaussian kernel in an SVM, the decision boundary is a
nonlinear hyper plane that can capture complex nonlinear relationships
between the input features. The width of the Gaussian function, controlled by
the gamma parameter, determines the degree of nonlinearity in the decision
boundary.
One advantage of the Gaussian kernel is its ability to capture complex
relationships in the data without the need for explicit feature engineering.
However, the choice of the gamma parameter can be challenging, as a smaller
value may result in under fitting, while a larger value may result in over fitting.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential
kernel, is a type of kernel function used in machine learning, including in
SVMs (Support Vector Machines). It is a non-parametric kernel that can be
used to measure the similarity or distance between two input feature vectors.
The Laplacian kernel can be defined as:

1.K(x, y) = exp(-gamma * ||x - y||)


Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Laplacian function, and ||x - y|| is the L1 norm or
Manhattan distance between the input vectors.
When using a Laplacian kernel in an SVM, the decision boundary is a
nonlinear hyperplane that can capture complex relationships between the
input features. The width of the Laplacian function, controlled by the gamma
parameter, determines the degree of nonlinearity in the decision boundary.
One advantage of the Laplacian kernel is its robustness to outliers, as it
places less weight on large distances between the input vectors than the
Gaussian kernel. However, like the Gaussian kernel, choosing the correct value
of the gamma parameter can be challenging.
Concept of Hyperplane
Hyperplanes are essentially a boundary which classifies the data
set (classifies Spam email from the ham ones). It could be lines, 2D planes, or
even n-dimensional planes that are beyond our imagination.
A line that is used to classify one class from another is called a hyperplane.
Hyperplanes are decision boundaries that help classify the data points. Data
points falling on either side of the hyperplane can be attributed to different
classes. Also, the dimension of the hyperplane depends upon the number of
features. If the number of input features is 2, then the hyperplane is just a line.
If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane. It becomes difficult to imagine when the number of
features exceeds 3.

In a p-dimensional space, a hyperplane is a flat affine subspace of


dimension p-1. Visually, in a 2D space, the hyperplane will be a line,
and in a 3D space, it will be a flat plane.

Mathematically, the hyperplane is simply:


In general, if the data can be perfectly separated using a hyperplane,
then there is an infinite number of hyperplanes, since they can be
shifted up or down, or slightly rotated without coming into contact with
an observation.

That is why we use the maximal margin hyperplane or optimal


separating hyperplane which is the separating hyperplane that
is farthest from the observations. We calculate the perpendicular
distance from each training observation given a hyperplane. This is
known as the margin.
Margin is defined as the gap between two lines on the closet data
points of different classes. It can be calculated as the perpendicular
distance from the line to the support vectors. Large margin is
considered as a good margin and small margin is considered as a bad
margin.

Support Vectors are datapoints that are closest to the hyperplane.


Separating line will be defined with the help of these data points.

Properties of SVM

Flexibility in choosing a similarity function

Sparseness of solution when dealing with large data sets

only support vectors are used to specify the separating hyperplane

Ability to handle large feature spaces

complexity does not depend on the dimensionality of the feature space

Overfitting can be controlled by soft margin approach

Nice math property: a simple convex optimization problem which is


guaranteed to converge to a single global solution

Feature Selection

You might also like