Likelihood Frequentist

Uploaded by

A40Shruti ChauhanIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views27 pages

Likelihood Frequentist

Uploaded by

A40Shruti ChauhanIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Likelihood Frequentist

Likelihood Frequentist
• Introduction :Likelihood
• Likelihood describes how to find the best distribution of the data
for some feature or some situation in the data given a certain
value of some feature or situation, while probability describes
how to find the chance of something given a sample distribution
of data.

• Maximum Likelihood Estimation (MLE) is a frequentist approach

for estimating the parameters of a model given some observed
data. The general approach for using MLE is:
• Observe some data.
• Write down a model for how we believe the data was generated.
• Set the parameters of our model to values which maximize the
likelihood of the parameters given the data.
Likelihood Frequentist
• Models
• A model is a formal representation of our beliefs, assumptions,
and simplifications surrounding some event or process.
• Let’s look at a couple of examples to make this idea clear.
• Example: Coin Flip
• We’d like to build a model for flipping a specific coin. What do we
know?
• The coin has two faces and an edge.
• The faces have different designs.
• The coin can sit on either face or the edge
• The weight of the coin.
• The diameter and thickness of the coin.
Likelihood Frequentist
• What assumptions can we make?
• The different designs probably cause the coin’s center of mass to
slightly favor one side over another.
• There’s no way to measure the force or angle exerted on the coin when
it’s flipped.
• Let’s take a first stab at writing down a model without simplifications:
• The initial position of the coin is drawn from a Bernoulli distribution.
This represents the flipper’s preference of starting the coin heads up
vs. heads down in their hand.
• The force exerted on the coin is drawn from an exponential
distribution.
• The angle in which the force is exerted is drawn from a truncated
normal distribution on the interval [-π, π].
• The center of mass of the coin is at some coordinate (x,y,z) in a system
where the center of the coin is the origin.
• The force of gravity is …
Likelihood Frequentist
• The real world can be complicated.
• Sometimes, a simplified model can do just as well or better. Let’s make a
simplified model:
• The outcome of the flip is drawn from a Bernoulli distribution with the
probability of heads p, and the probability of tails (1-p).
• Our simplified model only has a single parameter! In part one, we learned
that we can estimate this parameter by simply flipping the coin a few times
and counting the number of heads we get.
• Fitting the complicated model would require many more flips and difficult
calculations. So which model is right?
• Usefulness is the key metric when designing models.
• In this case, use the simplified model even though we know it’s wrong.
• How do I know it’s wrong? I’ve assigned 0 probability to the coin landing on
its edge.
• I’ve never seen this happen in real life, so I’ve made the simplifying
assumption that it can’t occur.
Likelihood Frequentist
• Introduction to Maximum Likelihood Estimation for Machine Learning
• Density estimation is the problem of estimating the probability
distribution for a sample of observations from a problem domain.
• There are many techniques for solving density estimation, although a
common framework used throughout the field of machine learning is
maximum likelihood estimation.
• Maximum likelihood estimation involves defining a likelihood function
for calculating the conditional probability of observing the data sample
given a probability distribution and distribution parameters.

• This approach can be used to search a space of possible distributions

and parameters.
Likelihood Frequentist
• Problem of Probability Density Estimation
• A common modeling problem involves how to estimate a joint
probability distribution for a dataset.
• For example, given a sample of observation (X) from a domain
(x1, x2, x3, …, xn), where each observation is drawn
independently from the domain with the same probability
distribution (so-called independent and identically distributed,
i.i.d., or close to it).
• Density estimation involves selecting a probability distribution
function and the parameters of that distribution that best
explain the joint probability distribution of the observed data
(X).
Likelihood Frequentist
• How do you choose the probability distribution function?
• How do you choose the parameters for the probability distribution
function?
• This problem is made more challenging as sample (X) drawn from the
population is small and has noise, meaning that any evaluation of an
estimated probability density function and its parameters will have some
error.
• There are many techniques for solving this problem, although two
common approaches are:
• Maximum a Posteriori (MAP), a Bayesian method.
• Maximum Likelihood Estimation (MLE), frequentist method.
• The main difference is that MLE assumes that all solutions are equally
likely beforehand, whereas MAP allows prior information about the form
of the solution to be harnessed.
Likelihood Frequentist
• One solution to probability density estimation is referred to as
Maximum Likelihood Estimation, or MLE for short.
• Maximum Likelihood Estimation involves treating the problem as an
optimization or search problem, where we seek a set of parameters
that results in the best fit for the joint probability of the data sample
(X).
• First, it involves defining a parameter called theta that defines both the
choice of the probability density function and the parameters of that
distribution. It may be a vector of numerical values whose values
change smoothly and map to different probability distributions and
their parameters.
• In Maximum Likelihood Estimation, we wish to maximize the probability
of observing the data from the joint probability distribution given a
specific probability distribution and its parameters, stated formally as:
Likelihood Frequentist
• P(X | theta)
• This conditional probability is often stated using the semicolon (;)
notation instead of the bar notation (|) because theta is not a
random variable, but instead an unknown parameter. For example:
• P(X ; theta)
• or
• P(x1, x2, x3, …, xn ; theta)
• This resulting conditional probability is referred to as the likelihood
of observing the data given the model parameters and written
using the notation L() to denote the likelihood function. For
example:
• L(X ; theta)
Likelihood Frequentist
• The objective of Maximum Likelihood Estimation is to find the set of
parameters (theta) that maximize the likelihood function, e.g. result in
the largest likelihood value.
• maximize L(X ; theta)
• We can unpack the conditional probability calculated by the likelihood
function.
• Given that the sample is comprised of n examples, we can frame this
as the joint probability of the observed data samples x1, x2, x3, …,
xn in X given the probability distribution parameters (theta).
• L(x1, x2, x3, …, xn ; theta)
• The joint probability distribution can be restated as the multiplication
of the conditional probability for observing each example given the
distribution parameters.
Likelihood Frequentist
• product i to n P(xi ; theta)
• Multiplying many small probabilities together can be numerically unstable in practice, therefore,
it is common to restate this problem as the sum of the log conditional probabilities of observing
each example given the model parameters.
• sum i to n log(P(xi ; theta))
• Where log with base-e called the natural logarithm is commonly used.
• This product over many probabilities can be inconvenient […] it is prone to numerical underflow.
To obtain a more convenient but equivalent optimization problem, we observe that taking the
logarithm of the likelihood does not change its arg max but does conveniently transform a
product into a sum
• Given the frequent use of log in the likelihood function, it is commonly referred to as a log-
likelihood function.
• It is common in optimization problems to prefer to minimize the cost function, rather than to
maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally
as a Negative Log-Likelihood (NLL) function.
• minimize -sum i to n log(P(xi ; theta))
• In software, we often phrase both as minimizing a cost function. Maximum likelihood thus
becomes minimization of the negative log-likelihood (NLL) …
Likelihood Frequentist
• Relationship to Machine Learning
• This problem of density estimation is directly related to applied machine learning.
• We can frame the problem of fitting a machine learning model as the problem of
probability density estimation.
• Specifically, the choice of model and model parameters is referred to as a modeling
hypothesis h, and the problem involves finding h that best explains the data X.
• P(X ; h)
• We can, therefore, find the modeling hypothesis that maximizes the likelihood
function.
• maximize L(X ; h)
• Or, more fully:
• maximize sum i to n log(P(xi ; h))
• This provides the basis for estimating the probability density of a dataset
• Which is typically used in unsupervised machine learning algorithms.
Likelihood Frequentist
• Clustering algorithms.
• Using the expected log joint probability as a key quantity for learning in
a probability model with hidden variables is better known in the context
of the celebrated “expectation maximization” or EM algorithm.
• The Maximum Likelihood Estimation framework is also a useful tool for
supervised machine learning.
• This applies to data where we have input and output variables and the
output variate may be a numerical value or a class label in the case of
regression and classification predictive modeling retrospectively.
• We can state this as the conditional probability of the output (y) given
the input (X) given the modeling hypothesis (h).
• maximize L(y|X ; h)
• Or, more fully:
• maximize sum i to n log(P(yi|xi ; h))
Likelihood Frequentist
• The maximum likelihood estimator can readily be generalized to the case
where our goal is to estimate a conditional probability
• P(y | x ; theta) in order to predict y given x.
• This is the most common situation because it forms the basis for most
supervised learning.
• This means that the same Maximum Likelihood Estimation framework that is
generally used for density estimation can be used to find a supervised learning
model and parameters.
• This provides the basis for foundational linear modeling techniques, such as:
• Linear Regression, for predicting a numerical value.
• Logistic Regression, for binary classification.
• In the case of linear regression, the model is constrained to a line and involves
finding a set of coefficients for the line that best fits the observed data.
• This problem can be solved analytically (e.g. directly using linear algebra).
Likelihood Frequentist
• In the case of logistic regression, the model defines a line and involves finding
a set of coefficients for the line that best separates the classes.
• This cannot be solved analytically and is often solved by searching the space of
possible coefficient values using an efficient optimization algorithm (e.g. the
BFGS algorithm or variants.)

• Both methods can also be solved less efficiently using a more general
optimization algorithm such as stochastic gradient descent.

• In fact, most machine learning models can be framed under the maximum
likelihood estimation framework, providing a useful and consistent way to
approach predictive modeling as an optimization problem.
• An important benefit of the maximize likelihood estimator in machine learning
is that as the size of the dataset increases, the quality of the estimator
continues to improve.
Fitting a Line using Likelihood
• Linear Regression as Maximum Likelihood
• We can frame the problem of fitting a machine
learning model as the problem of probability density
estimation.
• Specifically, the choice of model and model
parameters is referred to as a modeling hypothesis h,
and the problem involves finding h that best explains
the data X. We can, therefore, find the modeling
hypothesis that maximizes the likelihood function.
• maximize sum i to n log(P(xi ; h))
• Supervised learning can be framed as a conditional
probability problem of predicting the probability of
the output given the input:
• P(y | X)
• As such, we can define conditional maximum
likelihood estimation for supervised machine
learning as follows:
• maximize sum i to n log(P(yi|xi ; h))
• Now we can replace h with our linear regression
model.
• We can make some reasonable assumptions, such as the
observations in the dataset are independent and drawn
from the same probability distribution (i.i.d.), and that the
target variable (y) has statistical noise with a Gaussian
distribution, zero mean, and the same variance for all
examples.
• With these assumptions, we can frame the problem of
estimating y given X as estimating the mean value for y from
a Gaussian probability distribution given X.
• The analytical form of the Gaussian function is as follows:
• f(x) = (1 / sqrt(2 * pi * sigma^2)) * exp(- 1/(2 * sigma^2) * (y
– mu)^2 )
• Where mu is the mean of the distribution and sigma^2 is
the variance where the units are squared.
• We can use this function as our likelihood function, where mu is defined
as the prediction from the model with a given set of coefficients (Beta)
and sigma is a fixed constant.
• First, we can state the problem as the maximization of the product of the
probabilities for each example in the dataset:
• maximize product i to n (1 / sqrt(2 * pi * sigma^2)) * exp(-1/(2 * sigma^2)
* (yi – h(xi, Beta))^2)
• Where xi is a given example and Beta refers to the coefficients of the
linear regression model. We can transform this to a log-likelihood model
as follows:
• maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (yi
– h(xi, Beta))^2)

• (It can be simplified further)

• It’s interesting that the prediction is the mean of a
distribution.
• It suggests that we can very reasonably add a bound
to the prediction to give a prediction interval based
on the standard deviation of the distribution, which
is indeed a common practice.
• Although the model assumes a Gaussian distribution
in the prediction (i.e. Gaussian noise function or
error function), there is no such expectation for the
inputs to the model (X).
• a brief simple derivation of the MLE equation
• we consider y to be our vector of measured data, β0 and β1 are the
actual linear model parameters (intercept and gradient)
and ϵ represents the error vector, then the model can be expressed
as:

y1=β0+β1x1+ϵ1
y2=β0+β1x2+ϵ2
⋮
yn=β0+β1xn+ϵn

If the error vector ϵ is normally distributed N(0,σ2),

each measurement can be thought of being sampled from it's own
distribution with means μi=β0+β1xi and constant variance σ2.
• If the probability of a single point is
• N(yi|μi,σ2) then the probability of all points
occurring from the distribution defined
by μi=β0+β1xi and σ2 is the product of these
probabilities:
• n
• ∏ N(yi|μi,σ2)
i=1
• This is also the likelihood of the normal distributions
defined by μi and μi,σ2 being the distributions from
which data points yi have been sampled:
•

• Now since the probability density function (pdf) for a

normal distribution is:
• The likelihood can be defined as the product
of the individual probabilities calculated for
each data point:
Now let's take the natural logarithm of each side, mainly to help simpify the
equation by separating the products into sums:
• These formulas can be implemented in the programming language like
Python

Quantitative Methods High-Yield Notes
No ratings yet
Quantitative Methods High-Yield Notes
32 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
ML Notes
No ratings yet
ML Notes
4 pages
MLE Assingnment
No ratings yet
MLE Assingnment
7 pages
MIT18 05S14 Reading10b PDF
No ratings yet
MIT18 05S14 Reading10b PDF
9 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
11 Mle
No ratings yet
11 Mle
26 pages
Sta255 Week 11-1 Pre
No ratings yet
Sta255 Week 11-1 Pre
37 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
Mle & Map
No ratings yet
Mle & Map
21 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Tutorial On Maximum Likelihood Estimation
100% (2)
Tutorial On Maximum Likelihood Estimation
11 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Sta255 Week 11-2 Pre
No ratings yet
Sta255 Week 11-2 Pre
21 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
Chapter 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapter 2 - Maximum Likelihood - HEC - Lausanne
277 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
03 Lectureslides ParameterInference
No ratings yet
03 Lectureslides ParameterInference
24 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Week 6 Mle
No ratings yet
Week 6 Mle
41 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Slide 1
No ratings yet
Slide 1
37 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
10 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
Probabilistic Theory of Deep Learning
No ratings yet
Probabilistic Theory of Deep Learning
11 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Statistics 512 Notes 12: Maximum Likelihood Estimation: X X PX X
No ratings yet
Statistics 512 Notes 12: Maximum Likelihood Estimation: X X PX X
5 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
MLEstimation
No ratings yet
MLEstimation
8 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
Maximum Likelihood Homework
100% (1)
Maximum Likelihood Homework
8 pages
Chapte 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapte 2 - Maximum Likelihood - HEC - Lausanne
276 pages
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
From Everand
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
Andrei Besedin
2.5/5 (2)
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Introduction - R Programming
100% (1)
Introduction - R Programming
26 pages
Cda U2 Visualization
No ratings yet
Cda U2 Visualization
38 pages
Frequentist Statistics
No ratings yet
Frequentist Statistics
27 pages
Hypothesis Power Analysis
No ratings yet
Hypothesis Power Analysis
38 pages
Baysian Inferences
No ratings yet
Baysian Inferences
20 pages
Importing Data
No ratings yet
Importing Data
21 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Capturing Knowledge of User Preferences - Ontologies in Recommender Systems
No ratings yet
Capturing Knowledge of User Preferences - Ontologies in Recommender Systems
8 pages
DATA4800 Assessment 03 T1 2022
No ratings yet
DATA4800 Assessment 03 T1 2022
4 pages
A Survey On Machine Learning Applications in Vlsi Cad
No ratings yet
A Survey On Machine Learning Applications in Vlsi Cad
9 pages
EPAT Brochure
No ratings yet
EPAT Brochure
22 pages
Controller-Free Exploration of
No ratings yet
Controller-Free Exploration of
3 pages
Car Resale Value Prediction
No ratings yet
Car Resale Value Prediction
23 pages
Sahil Chavan Resume 24
No ratings yet
Sahil Chavan Resume 24
1 page
Healthcare Chatbot Using Decision Tree Algorithm
No ratings yet
Healthcare Chatbot Using Decision Tree Algorithm
3 pages
Ch. 9: Introduction To Convolution Neural Networks (CNN) and Systems
No ratings yet
Ch. 9: Introduction To Convolution Neural Networks (CNN) and Systems
96 pages
Main
No ratings yet
Main
86 pages
The Predicting Students Performance Using Machine Learning Algorithms.
No ratings yet
The Predicting Students Performance Using Machine Learning Algorithms.
3 pages
cs221 Lecture12
No ratings yet
cs221 Lecture12
28 pages
EDA
No ratings yet
EDA
3 pages
Gen AiI Worksheet Class9
0% (1)
Gen AiI Worksheet Class9
2 pages
LVIS: A Dataset For Large Vocabulary Instance Segmentation
No ratings yet
LVIS: A Dataset For Large Vocabulary Instance Segmentation
11 pages
Data Mining
No ratings yet
Data Mining
23 pages
J of Business Logistics - 2023 - Richey - Artificial Intelligence in Logistics and Supply Chain Management
No ratings yet
J of Business Logistics - 2023 - Richey - Artificial Intelligence in Logistics and Supply Chain Management
18 pages
Intelligent Web Security: Machine Learning-Based SQL Injection Detection and Honeypot Integration
No ratings yet
Intelligent Web Security: Machine Learning-Based SQL Injection Detection and Honeypot Integration
7 pages
Navigating Urban Mobility A Review of AI-Driven Traffic Flow Management in Smart Cities
No ratings yet
Navigating Urban Mobility A Review of AI-Driven Traffic Flow Management in Smart Cities
5 pages
Handwritten Notes Recognition Using Artificial Intelligence
No ratings yet
Handwritten Notes Recognition Using Artificial Intelligence
4 pages
Harness The Power of Data: Now Is The Time To Become An Analytics-Driven Organization. Discover How
No ratings yet
Harness The Power of Data: Now Is The Time To Become An Analytics-Driven Organization. Discover How
20 pages
Presentation On Supervised Learning
No ratings yet
Presentation On Supervised Learning
8 pages
2019 Data Science Summer Internship Program - Opt PDF
No ratings yet
2019 Data Science Summer Internship Program - Opt PDF
14 pages
Law and Tech Unit 1
No ratings yet
Law and Tech Unit 1
14 pages
A New Data-Mining Based Approach For Network Intrusion Detection
No ratings yet
A New Data-Mining Based Approach For Network Intrusion Detection
6 pages
Engproc 51 00020
No ratings yet
Engproc 51 00020
5 pages
Byte Dance
No ratings yet
Byte Dance
4 pages
多模态联邦学习
No ratings yet
多模态联邦学习
15 pages

Likelihood Frequentist

Uploaded by

Likelihood Frequentist

Uploaded by

Likelihood Frequentist

• Maximum Likelihood Estimation (MLE) is a frequentist approach

• This approach can be used to search a space of possible distributions

• (It can be simplified further)

If the error vector ϵ is normally distributed N(0,σ2),

• Now since the probability density function (pdf) for a

You might also like