0% found this document useful (0 votes)
29 views

ML - Unit 2

best notes of ml

Uploaded by

cabhi7789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

ML - Unit 2

best notes of ml

Uploaded by

cabhi7789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 155

Unit 2

Supervised Learning: Regression

Pallavi Shukla
Assistant Professor
UCER
Regression
• Regression analysis is a statistical method to model the relationship
between dependent (target) and independent (predictor) variables
with one or more independent variables.
• Helps us to understand how the value of the dependent variable
changes corresponding to an independent variable when other
independent variables are held fixed.
• Regression searches for relationships among variables.
• For example, you can observe several employees of some company
and try to understand how their salaries depend on the features, such
as experience, level of education, role, city they work in, and so on.
Regression
• In Regression, we plot a graph between the variables which best fits the
given datapoints.
• Using this plot, the machine learning model can make predictions about
the data.
• In simple words, "Regression shows a line or curve that passes through
all the datapoints on target-predictor graph in such a way that the
vertical distance between the datapoints and the regression line is
minimum."
• The distance between datapoints and line tells whether a model has
captured a strong relationship or not.
Examples

• Prediction of rain using temperature and other factors


• Determining Market trends
• Prediction of road accidents due to rash driving.
Taxonomy -

• Dependent Variable: The main factor in Regression analysis which we


want to predict or understand is called the dependent variable. It is also
called target variable.
• Independent Variable: The factors which affect the dependent variables
or which are used to predict the values of the dependent variables are
called independent variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or
very high value in comparison to other observed values. An outlier may
hamper the result, so it should be avoided.
• .
Taxonomy -

• Multicollinearity: If the independent variables are highly correlated with


each other than other variables, then such condition is called
Multicollinearity. It should not be present in the dataset, because it
creates problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training
dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with
training dataset, then such problem is called underfitting.
Common Regression Algorithms
The most common regression algorithms are:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Linear Regression
• Multivariate adaptive regression splines
• Logistic Regression
• Maximum likelihood estimation(least squares)
Linear Regression:

• Linear regression is a statistical regression method that is used for


predictive analysis.
• It is one of the very simple and easy algorithms that works on regression
and shows the relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
• If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input
variable, then such linear regression is called multiple linear regression.
• The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
Applications of linear regression are:

• Analyzing trends and sales estimates


• Salary forecasting
• Real estate prediction
• Arriving at ETAs in traffic.
Simple Linear Regression
Slop of Simple Linear Regression
Linear Positive Slope
Curve Linear Positive Slope
Linear Negative Slope
Curve Linear Negative Slope
No Relationship Graph
Error in Simple regression
Example
X Y X - Xi Y - Yi (X - Xi)(Y-Yi) (X - Xi)2
1 15 49
2 23 63
3 18 58
4 23 60
5 24 58
6 22 61
7 22 60
8 19 63
9 19 60
10 16 52
11 24 62
12 11 30
13 24 59
14 16 49
15 23 68
SUM 299 852
MEAN 19.93333 56.8
Multiple Linear Regression-
Logistic Regression
• Logistic regression is a supervised machine learning algorithm
mainly used for classification tasks where the goal is to predict the
probability that an instance of belonging to a given class or not.

• It is a powerful tool for decision-making.


Logistic Regression
• Here, dependent variable (y) is binary(0,1) and independent variable(X) are
continuous in nature.
• Logistic regression predicts the output of a categorical dependent
variable. Therefore the outcome must be a categorical or discrete
value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of
giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification
problems.
• It is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using
continuous and discrete datasets.
Logistic Function (Sigmoid Function):

• The sigmoid function is a mathematical function used to map the


predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1. o
The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the “S” form.
• The S-form curve is called the Sigmoid function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values
tends to 0.
Type of Logistic Regression:
1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep.

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as “low”, “Medium”, or “High”.

How does Logistic Regression work?

• The logistic regression model transforms the linear


regression function continuous value output into categorical value
output using a sigmoid function, which maps any real-valued set
of independent variables input into a value between 0 and 1. This
function is known as the logistic function.
• Let the independent input features be
• and the dependent variable is Y having only binary value i.e. 0 or
1.

• then apply the multi-linear function to the input variables X


Sigmoid Function
Bayesian Decision Theory

• It is a method to take actions based on present observations.


• Mr. R. Thomas Bayes suggested this method in the year
1761.
Basic Concepts in Bayes Decision Theory -

• Marginal Probability (Simple Probability) P(A) –


• The ordinary probability of occurrence of an event (A) irrespective of all
other events is called simple or marginal probability.
• P(A) = No. of Successful Events
Total no. of all Events
Basic Concepts in Bayes Decision Theory -

• Condition Probability P(A/B) –


• The probability (P) of the occurrence of an event (A), when event(B)
has already occurred is called Conditional probability.
Basic Concepts in Bayes Decision Theory -

• Joint Probability P(A,B) –


• The occurrence of two events (A) and (B) simultaneously is called
Joint Probability.
Basic Concepts in Bayes Decision Theory -

• Prior: The prior knowledge or belief about the probabilities of


various hypotheses in H is called Prior in the Context of Bayes’
theorem.
• Ex- Knowledge about tumors can be used to validate tumors
being malignant.
Basic Concepts in Bayes Decision Theory

• Posterior – The probability that a particular hypothesis holds for a dataset


based on Prior is called the Posterior Probability or simply Posterior.

Example: The probability of the hypothesis that the patient has


a malignant tumour considering the Prior of correctness of the
malignancy test.
BAYES’ THEOREM
• It is based on the conditional probability. It is given as :
SOME MORE QUESTIONS:

• Q1 – To calculate the probability of “fire” when “smoke “ is given with data


as: P(Fire) = Prior Probability = 0.3, P(Smoke|Fire) = Likelihood
Probability = 0.5, P(Smoke) = Evidence = 0.7
• Q2 – (Patient Diseases Problem) Let us consider data of a patient as
Effect = the state of patient having red dot on skin. Cause = The state of a
patient having Rubella Disease. Given probabilities , P(Cause) = 0.001,
P(Effect) = 0.01, P(Effect|Cause)= 0.9. Use Baye's rule to find the value of
probability P(Cause|Effect).
Bayes’ Theorem in Terms of Posterior
Probability -
P(h|D) = P(D|h) . P(h)
P(D)
P( h|D) = called posterior Probability or conditional probability of the hypothesis
(h) when data(D) is given
P(D|h)= called likelihood Probability or conditional Probability of Data(D) when
hypothesis(h) is given.
P(h) = Prior probability of a hypothesis (h) or simple probability of a hypothesis(h).
P(D) = Prior probability of data(D) or simple probability of D.
Maximum a Posterior(MAP) Hypothesis

• The maximum probable hypothesis is called the Maximum A Posterior (MAP)


hypothesis.
• Denoted by (hMAP).
• hMAP = Arg max P(h|D) = Arg max P(D|h) . P(h)
P(D)
• In above equation we can ignore the denominator
hMAP = Arg max P(D|h) . P(h)
Maximum Likelihood(ML) Hypothesis

• All hypothesis are equiprobable.


hML = Arg max P(D|h)
Difference between Max f(x) and Arg max
f(x) functions in Mathematics -
Max f(x) Arg max f(x)
Maximum value of function f(x) Called Argument of variable(x) at which the
function f(x) has maximum value.

Ex – Ex –
Max f(x) of sin θ = 90o Arg.max f(x) of sin θ = 1

It means sin θ has its max value at 90o It means sin θ has a maximum value of 1.
BRUTE FORCE BAYESIAN CONCEPT
LEARNING -
• Also called Brute Force Algorithm.
P(h|D) = P(D|h) . P(h)
P(D)
• hMAP = Arg max P(h|D)
• Let P(h) = 1 / |H| for all h in H.
• h = a single hypothesis , H = a set consisting pf all hypothesis
• H = {h1, h2, h3,……..hn}
• Now, P(h) = Probability of hypothesis (h)
• P(D|h) = 1 , if di = h(xi)
0 , otherwise
P(D|h) = Conditional probability of data(D) when hypothesis (h) is given
di = Data Value
Xi = Variable Value
P(h|D) = 1 . 1/|H|
P(D)
• But P(D) = |VS H,D| / |H|
• Now, putting thois value in above equation ,

• P(h|D) = 1
• |VS H,D|
• Where |VS H,D| is called the version space of hypothesis set(H)
BAYE’S OPTIMAL CLASSIFIER
• It is a “Probabilistic Model” which makes the most probable prediction for a new
example.
• Equation is

• = Probability of Value(vj ) when data is given


• = Probability of value (vj) when hypothesis (hi) is given
• = Probability of hypothesis (hi) , when data is given
UNIT II
Naïve Bayes Classifier
Pallavi Shukla
Assistant Professor
Naïve Bayes Classifier

• It is a supervised learning algorithm.


• Based on Bayes Theorem.
• Used for solving classification problems in machine learning.
• It is a probabilistic classifier.
• It predicts on the basis of the probability of an event.
Naïve Bayes Classifier
• Naïve : Means “untrained” or “without experience”.
• Bayes : It is defined on Bayes Theorem.

• Question : We have been given dataset for weather condition with two
columns in which one has a value of weather condition and the other
column reports regarding whether player has gone for playing or not.
Find the probability of player going for play on sunny day.
0 Outlook Play
0. Rainy Yes
1. Sunny Yes
2. Overcast Yes
3. Overcast Yes
4. Sunny No
5. Rainy Yes
6. Sunny Yes
7. Overcast Yes
8. Rainy No
9. Sunny No
10. Sunny Yes
11. Rainy No
12. Overcast Yes
13. Overcast Yes
Solution: Frequency Table

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Make Likelihood Table :

Weather No Yes Likelihood


Overcast 0 5 5/14 = 0.35
Rainy 2 2 4/14 = 0.29
Sunny 2 3 5/14=0.35
All 4/14 =0.29 10/14 =0.71
Apply Bayes Theorem :
• P (A|B) = P(B|A) .P(A)
• P(B)

• P(Yes | Suuny) = P(Sunny|Yes) x P(Yes)


• P(Sunny)
• P(Sunny |Yes) = 3/10 = 0.3
• P(Sunny) = 0.35
• P(Yes) = 0.71
• P(Yes |Sunny) = (0.3 x 0.71) /0.35 = 0.60
• P (No |Sunny) = P(Sunny |No) x P(No) = 0.5 X 0.29 = 0.41
• P(Sunny) 0.35
• As the P (Yes |Sunny) > P(No| Sunny) i.e 0.60 >0.41
• Therefore , we can say that on a sunny day, the player will go for play.
Advantages of Naïve Bayes Classifier -
• It is a fast and easy algorithm for classification.
• It can be used for binary and multi-classification.
• It is mostly used for text classification problems.
Disadvantages of Naïve Bayes Classifier

• It cannot learn relation between independent features.


Applications of Naïve Bayes Classifier
• Real Time Prediction
• Text Classification
• Sentiment Analysis
• Multiclass Classification
• Spam Filtering
• Recommendation System
BAYESIAN BELIEF NETWORKS
• A Bayesian Belief Network is a probabilistic graphical model. It represents a
set of variables and their conditional dependencies using a directed acyclic
graph.

• Also called Bayes Network, Belief Network , Decision Network or Bayesian


Model.
• The Bayesian network consists of two parts:
• Directed Acyclic Graph.
• Table of Conditional Probabilities.
• The Bayesian Belief networks are based on the joint probability and marginal
probability.
EXPECTATION – MAXIMIZATION
ALGORITHM
• EM Algorithm is used to find unknown (unseen) variables of sample space.

• Latent Variables : The unseen(Missing Data) variables are called latent


variables. EM algorithm is used to find latent variables of data set.

• Maximum Likelihood Estimation: EM algorithm is used to find or


estimate the maximum likelihood of latent variables.
EM Algorithm:
• Expectation (E) Step – By using the observed available data of data set, we
estimate or find the values of missing data. After this step, WE Get complete
data with no missing values.

• Maximization (M) Step – Now we use complete data to the parameters.


Repeat Step 2 and step 3 until we converge to the solution.
Applications of EM Algorithm
• Used to find missing data of a dataset.
• Used for parameter estimation of Hidden Markov Model.
• Used to calculate Gaussian Density of a function.
• Used in Natural Language Processing(NLP) , computer vision etc.
• Used in Medical field and structural engineering.
Advantages of EM Algorithm
• Very easy to implement algorithm in machine learning as it has only two
steps.
• Solution of M step exists in closed form.
• After each iteration of EM algorithm, the value of likelihood is increased.
Disadvantages of EM Algorithm
• Has slow convergence
• Converges to local optimum only
• Required both forward and backward probabilities.
EXAMPLE 1 -
UNIT II
Support Vector Machine
Pallavi Shukla
Assistant Professor
Support Vector Machine
• It is most popular supervised learning technique which is used for both
classification and regression tasks.
• It is mainly used for classification problems in machine learning.
• Objective of an SVM algorithm is to find a hyperplane in an N-dimensional
space, that distinctly classify the data points.
Support Vectors
• They are simply the coordinates of individual observation.

• SVM classifier is a frontier that best segregates the two classes (Hyper plane/
line)
Support Vector Machine Terminology
1.Hyperplane: Hyperplane is the decision boundary that is used to
separate the data points of different classes in a feature space. In
the case of linear classifications, it will be a linear equation i.e.
wx+b = 0.
2.Support Vectors: Support vectors are the closest data points to
the hyperplane, which makes a critical role in deciding the
hyperplane and margin.
3.Margin: Margin is the distance between the support vector and
hyperplane. The main objective of the support vector machine
algorithm is to maximize the margin. The wider margin indicates
better classification performance.
Support Vector Machine Terminology
1. Kernel: Kernel is the mathematical function, which is used in SVM to map the
original input data points into high-dimensional feature spaces, so, that the
hyperplane can be easily found out even if the data points are not linearly
separable in the original input space. Some of the common kernel functions are
linear, polynomial, radial basis function(RBF), and sigmoid.
2. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane
is a hyperplane that properly separates the data points of different categories
without any misclassifications.
3. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced
by the soft-margin SVM formulation, which softens the strict margin
requirement and permits certain misclassifications or violations. It discovers a
compromise between increasing the margin and reducing violations.
Types of SVM

• Linear SVM
Non Linear SVM
UNIT II
KERNAL
Pallavi Shukla
Assistant Professor
KERNAL

- It is the mathematical function, which is used in SVM to map the original input
data points into high-dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly separable in the original
input space.

- Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
• KERNAL are used to solve a non-linear problem by using a linear classifier.
• The amazing thing about kernel is that we can go to higher dimensions
and perform smooth calculations with the help of it.
• We can go up to an infinite number of dimensions using kernels.
• Sometimes, we cannot have a hyperplane for certain problems. This
problem arises when we go up to higher dimensions and try to form a
hyperplane.
• A kernel helps to form the hyperplane in the higher dimension without
raising the complexity.
Characteristics of Kernel Function

• Mercer's condition: A kernel function must satisfy Mercer's condition to be


valid. This condition ensures that the kernel function is positive semi definite,
which means that it is always greater than or equal to zero.
• Positive definiteness: A kernel function is positive definite if it is always greater
than zero except for when the inputs are equal to each other.
• Non-negativity: A kernel function is non-negative, meaning that it produces
non-negative values for all inputs.
• Symmetry: A kernel function is symmetric, meaning that it produces the same
value regardless of the order in which the inputs are given.
• Reproducing property: A kernel function satisfies the reproducing
property if it can be used to reconstruct the input data in the feature
space.
• Smoothness: A kernel function is said to be smooth if it produces a
smooth transformation of the input data into the feature space.
• Complexity: The complexity of a kernel function is an important
consideration, as more complex kernel functions may lead to over fitting
and reduced generalization performance.
Linear Kernel

• is the simplest and most commonly used kernel function, and it defines
the dot product between the input vectors in the original feature space.
• The linear kernel can be defined as:
• K(x, y) = x .y
• Where x and y are the input feature vectors.
• When using a linear kernel in an SVM, the decision boundary is a linear
hyperplane that separates the different classes in the feature space.
Polynomial Kernel

• It is a nonlinear kernel function that employs polynomial functions to


transfer the input data into a higher-dimensional feature space.
Polynomial Kernel
Polynomial Kernel
• the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear
correlations between the input characteristics.
GUASSIAN KERNAL
• The Gaussian kernel, also known as the radial basis function (RBF) kernel,
is a popular kernel function used in machine learning, particularly in
SVMs (Support Vector Machines)
• It is a nonlinear kernel function that maps the input data into a higher-
dimensional feature space using a Gaussian function
GUASSIAN KERNAL
• One advantage of the Gaussian kernel is its ability to capture complex
relationships in the data without the need for explicit feature
engineering. However, the choice of the gamma parameter can be
challenging, as a smaller value may result in under fitting, while a larger
value may result in over fitting.
Exponential Kernel

• This is in close relation with the previous kernel i.e. the Gaussian kernel
with the only difference is — the square of the norm is removed.

Laplace Kernel

• It is a non-parametric kernel that can be used to measure the similarity or


distance between two input feature vectors.
• This type of kernel is less prone for changes and is totally equal to
previously discussed exponential function kernel, the equation of
Laplacian kernel is given as:
Laplace Kernel
Hyperbolic or the Sigmoid Kernel

• This kernel is used in neural network areas of machine learning.


• The activation function for the sigmoid kernel is the bipolar sigmoid
function. The equation for the hyperbolic kernel function is:
Anova radial basis kernel
• This kernel is known to perform very well in multidimensional regression
problems just like the Gaussian and Laplacian kernels.
• This also comes under the category of radial basis kernel. The equation
for Anova kernel is :
Graph Kernel Function

• This kernel is used to compute the inner on graphs.


• They measure the similarity between pairs of graphs.
• They contribute in area
like bioinformatics, chemoinformatics, etc.
String Kernel Function

• This kernel operates on the basis of strings.


• It is mainly used in areas like text classification.
• They are very useful in text mining, genome analysis, etc.
Tree Kernel Function

• This kernel is more associated with the tree structure.


• The kernel helps to split the data into tree format and helps the
SVM to distinguish between them.
• This is helpful in language classification and it is used in areas
like NLP.
UNIT – II
Lecture - 12
Pallavi Shukla
Assistant Professor
Kernel method
PROPERTIES OF SVM
• Flexibility in choosing a similarity function
• Sparseness of solution when dealing with large data sets - only support vectors are
used to specify the separating hyperplane
• Ability to handle large feature spaces - complexity does not depend on the
dimensionality of the feature space
• Overfitting can be controlled by soft margin approach
• Nice math property: a simple convex optimization problem which is guaranteed to
converge to a single global solution
• Feature Selection
Advantages of SVM
• Handling high-dimensional data: SVMs are effective in handling high-
dimensional data, which is common in many applications such as image
and text classification.
• Handling small datasets: SVMs can perform well with small datasets,
as they only require a small number of support vectors to define the
boundary.
• Modeling non-linear decision boundaries: SVMs can model non-linear
decision boundaries by using the kernel trick, which maps the data into a
higher-dimensional space where the data becomes linearly separable.
Advantages of SVM
• Robustness to noise: SVMs are robust to noise in the data, as the decision boundary is determined
by the support vectors, which are the closest data points to the boundary.
• Generalization: SVMs have good generalization performance, which means that they are able to
classify new, unseen data well.
• Versatility: SVMs can be used for both classification and regression tasks, and it can be applied to a
wide range of applications such as natural language processing, computer vision, and
bioinformatics.
• Sparse solution: SVMs have sparse solutions, which means that they only use a subset of the
training data to make predictions. This makes the algorithm more efficient and less prone to
overfitting.
• Regularization: SVMs can be regularized, which means that the algorithm can be modified to avoid
overfitting.
Disadvantages of SVM
• Computationally expensive: SVMs can be computationally expensive for large
datasets, as the algorithm requires solving a quadratic optimization problem.
• Choice of kernel: The choice of kernel can greatly affect the performance of an
SVM, and it can be difficult to determine the best kernel for a given dataset.
• Sensitivity to the choice of parameters: SVMs can be sensitive to the choice of
parameters, such as the regularization parameter, and it can be difficult to
determine the optimal parameter values for a given dataset.
• Memory-intensive: SVMs can be memory-intensive, as the algorithm requires
storing the kernel matrix, which can be large for large datasets.
Disadvantages of SVM
• Limited to two-class problems: SVMs are primarily used for two-class
problems, although multi-class problems can be solved by using one-versus-
one or one-versus-all strategies.
• Lack of probabilistic interpretation: SVMs do not provide a probabilistic
interpretation of the decision boundary, which can be a disadvantage in some
applications.
• Not suitable for large datasets with many features: SVMs can be very slow and
can consume a lot of memory when the dataset has many features.
• Not suitable for datasets with missing values: SVMs requires complete datasets,
with no missing values, it can not handle missing values.
Applications of SVM
1.Face observation – It is used for detecting the face according to
the classifier and model.
2.Text and hypertext arrangement – In this, the categorization
technique is used to find important information or you can say
required information for arranging text.
3.Grouping of portrayals – It is also used in the Grouping of
portrayals for grouping or you can say by comparing the piece of
information and take an action accordingly.
Applications of SVM
1. Bioinformatics – It is also used for medical science as well like in laboratory,
DNA, research, etc.
2. Handwriting remembrance – In this, it is used for handwriting recognition.
3. Protein fold and remote homology spotting – It is used for spotting or you can
say the classification class into functional and structural classes given their
amino acid sequences. It is one of the problems in bioinformatics.
4. Generalized predictive control(GPC) – It is also used for Generalized predictive
control(GPC) for predicting and it relies on predictive control using a multilayer
feed-forward network as the plants linear model is presented
Applications of SVM
5. Facial Expression Classification – Support vector machines (SVMs) is a
binary classification technique. The face Expression Classification model
determines the precise face expression by modeling differences between
two facial images. Validation techniques include the leave-one-out
methods and the K-fold test methods.
6. Speech Recognition – The transcription of speech into text is called
speech recognition. Mel Frequency Cepstral Coefficients (MFCC)-based
features are used to train Support Vector Machines (SVM), which are used
for figuring out speech. Speech recognition is a challenging classification
problem that is categorized using a variety of mathematical techniques,
including support vector machines, pattern recognition techniques, etc

You might also like