0% found this document useful (0 votes)
12 views27 pages

Likelihood Frequentist

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views27 pages

Likelihood Frequentist

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Likelihood Frequentist

Likelihood Frequentist
• Introduction :Likelihood
• Likelihood describes how to find the best distribution of the data
for some feature or some situation in the data given a certain
value of some feature or situation, while probability describes
how to find the chance of something given a sample distribution
of data.

• Maximum Likelihood Estimation (MLE) is a frequentist approach


for estimating the parameters of a model given some observed
data. The general approach for using MLE is:
• Observe some data.
• Write down a model for how we believe the data was generated.
• Set the parameters of our model to values which maximize the
likelihood of the parameters given the data.
Likelihood Frequentist
• Models
• A model is a formal representation of our beliefs, assumptions,
and simplifications surrounding some event or process.
• Let’s look at a couple of examples to make this idea clear.
• Example: Coin Flip
• We’d like to build a model for flipping a specific coin. What do we
know?
• The coin has two faces and an edge.
• The faces have different designs.
• The coin can sit on either face or the edge
• The weight of the coin.
• The diameter and thickness of the coin.
Likelihood Frequentist
• What assumptions can we make?
• The different designs probably cause the coin’s center of mass to
slightly favor one side over another.
• There’s no way to measure the force or angle exerted on the coin when
it’s flipped.
• Let’s take a first stab at writing down a model without simplifications:
• The initial position of the coin is drawn from a Bernoulli distribution.
This represents the flipper’s preference of starting the coin heads up
vs. heads down in their hand.
• The force exerted on the coin is drawn from an exponential
distribution.
• The angle in which the force is exerted is drawn from a truncated
normal distribution on the interval [-π, π].
• The center of mass of the coin is at some coordinate (x,y,z) in a system
where the center of the coin is the origin.
• The force of gravity is …
Likelihood Frequentist
• The real world can be complicated.
• Sometimes, a simplified model can do just as well or better. Let’s make a
simplified model:
• The outcome of the flip is drawn from a Bernoulli distribution with the
probability of heads p, and the probability of tails (1-p).
• Our simplified model only has a single parameter! In part one, we learned
that we can estimate this parameter by simply flipping the coin a few times
and counting the number of heads we get.
• Fitting the complicated model would require many more flips and difficult
calculations. So which model is right?
• Usefulness is the key metric when designing models.
• In this case, use the simplified model even though we know it’s wrong.
• How do I know it’s wrong? I’ve assigned 0 probability to the coin landing on
its edge.
• I’ve never seen this happen in real life, so I’ve made the simplifying
assumption that it can’t occur.
Likelihood Frequentist
• Introduction to Maximum Likelihood Estimation for Machine Learning
• Density estimation is the problem of estimating the probability
distribution for a sample of observations from a problem domain.
• There are many techniques for solving density estimation, although a
common framework used throughout the field of machine learning is
maximum likelihood estimation.
• Maximum likelihood estimation involves defining a likelihood function
for calculating the conditional probability of observing the data sample
given a probability distribution and distribution parameters.

• This approach can be used to search a space of possible distributions


and parameters.
Likelihood Frequentist
• Problem of Probability Density Estimation
• A common modeling problem involves how to estimate a joint
probability distribution for a dataset.
• For example, given a sample of observation (X) from a domain
(x1, x2, x3, …, xn), where each observation is drawn
independently from the domain with the same probability
distribution (so-called independent and identically distributed,
i.i.d., or close to it).
• Density estimation involves selecting a probability distribution
function and the parameters of that distribution that best
explain the joint probability distribution of the observed data
(X).
Likelihood Frequentist
• How do you choose the probability distribution function?
• How do you choose the parameters for the probability distribution
function?
• This problem is made more challenging as sample (X) drawn from the
population is small and has noise, meaning that any evaluation of an
estimated probability density function and its parameters will have some
error.
• There are many techniques for solving this problem, although two
common approaches are:
• Maximum a Posteriori (MAP), a Bayesian method.
• Maximum Likelihood Estimation (MLE), frequentist method.
• The main difference is that MLE assumes that all solutions are equally
likely beforehand, whereas MAP allows prior information about the form
of the solution to be harnessed.
Likelihood Frequentist
• One solution to probability density estimation is referred to as
Maximum Likelihood Estimation, or MLE for short.
• Maximum Likelihood Estimation involves treating the problem as an
optimization or search problem, where we seek a set of parameters
that results in the best fit for the joint probability of the data sample
(X).
• First, it involves defining a parameter called theta that defines both the
choice of the probability density function and the parameters of that
distribution. It may be a vector of numerical values whose values
change smoothly and map to different probability distributions and
their parameters.
• In Maximum Likelihood Estimation, we wish to maximize the probability
of observing the data from the joint probability distribution given a
specific probability distribution and its parameters, stated formally as:
Likelihood Frequentist
• P(X | theta)
• This conditional probability is often stated using the semicolon (;)
notation instead of the bar notation (|) because theta is not a
random variable, but instead an unknown parameter. For example:
• P(X ; theta)
• or
• P(x1, x2, x3, …, xn ; theta)
• This resulting conditional probability is referred to as the likelihood
of observing the data given the model parameters and written
using the notation L() to denote the likelihood function. For
example:
• L(X ; theta)
Likelihood Frequentist
• The objective of Maximum Likelihood Estimation is to find the set of
parameters (theta) that maximize the likelihood function, e.g. result in
the largest likelihood value.
• maximize L(X ; theta)
• We can unpack the conditional probability calculated by the likelihood
function.
• Given that the sample is comprised of n examples, we can frame this
as the joint probability of the observed data samples x1, x2, x3, …,
xn in X given the probability distribution parameters (theta).
• L(x1, x2, x3, …, xn ; theta)
• The joint probability distribution can be restated as the multiplication
of the conditional probability for observing each example given the
distribution parameters.
Likelihood Frequentist
• product i to n P(xi ; theta)
• Multiplying many small probabilities together can be numerically unstable in practice, therefore,
it is common to restate this problem as the sum of the log conditional probabilities of observing
each example given the model parameters.
• sum i to n log(P(xi ; theta))
• Where log with base-e called the natural logarithm is commonly used.
• This product over many probabilities can be inconvenient […] it is prone to numerical underflow.
To obtain a more convenient but equivalent optimization problem, we observe that taking the
logarithm of the likelihood does not change its arg max but does conveniently transform a
product into a sum
• Given the frequent use of log in the likelihood function, it is commonly referred to as a log-
likelihood function.
• It is common in optimization problems to prefer to minimize the cost function, rather than to
maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally
as a Negative Log-Likelihood (NLL) function.
• minimize -sum i to n log(P(xi ; theta))
• In software, we often phrase both as minimizing a cost function. Maximum likelihood thus
becomes minimization of the negative log-likelihood (NLL) …
Likelihood Frequentist
• Relationship to Machine Learning
• This problem of density estimation is directly related to applied machine learning.
• We can frame the problem of fitting a machine learning model as the problem of
probability density estimation.
• Specifically, the choice of model and model parameters is referred to as a modeling
hypothesis h, and the problem involves finding h that best explains the data X.
• P(X ; h)
• We can, therefore, find the modeling hypothesis that maximizes the likelihood
function.
• maximize L(X ; h)
• Or, more fully:
• maximize sum i to n log(P(xi ; h))
• This provides the basis for estimating the probability density of a dataset
• Which is typically used in unsupervised machine learning algorithms.
Likelihood Frequentist
• Clustering algorithms.
• Using the expected log joint probability as a key quantity for learning in
a probability model with hidden variables is better known in the context
of the celebrated “expectation maximization” or EM algorithm.
• The Maximum Likelihood Estimation framework is also a useful tool for
supervised machine learning.
• This applies to data where we have input and output variables and the
output variate may be a numerical value or a class label in the case of
regression and classification predictive modeling retrospectively.
• We can state this as the conditional probability of the output (y) given
the input (X) given the modeling hypothesis (h).
• maximize L(y|X ; h)
• Or, more fully:
• maximize sum i to n log(P(yi|xi ; h))
Likelihood Frequentist
• The maximum likelihood estimator can readily be generalized to the case
where our goal is to estimate a conditional probability
• P(y | x ; theta) in order to predict y given x.
• This is the most common situation because it forms the basis for most
supervised learning.
• This means that the same Maximum Likelihood Estimation framework that is
generally used for density estimation can be used to find a supervised learning
model and parameters.
• This provides the basis for foundational linear modeling techniques, such as:
• Linear Regression, for predicting a numerical value.
• Logistic Regression, for binary classification.
• In the case of linear regression, the model is constrained to a line and involves
finding a set of coefficients for the line that best fits the observed data.
• This problem can be solved analytically (e.g. directly using linear algebra).
Likelihood Frequentist
• In the case of logistic regression, the model defines a line and involves finding
a set of coefficients for the line that best separates the classes.
• This cannot be solved analytically and is often solved by searching the space of
possible coefficient values using an efficient optimization algorithm (e.g. the
BFGS algorithm or variants.)

• Both methods can also be solved less efficiently using a more general
optimization algorithm such as stochastic gradient descent.

• In fact, most machine learning models can be framed under the maximum
likelihood estimation framework, providing a useful and consistent way to
approach predictive modeling as an optimization problem.
• An important benefit of the maximize likelihood estimator in machine learning
is that as the size of the dataset increases, the quality of the estimator
continues to improve.
Fitting a Line using Likelihood
• Linear Regression as Maximum Likelihood
• We can frame the problem of fitting a machine
learning model as the problem of probability density
estimation.
• Specifically, the choice of model and model
parameters is referred to as a modeling hypothesis h,
and the problem involves finding h that best explains
the data X. We can, therefore, find the modeling
hypothesis that maximizes the likelihood function.
• maximize sum i to n log(P(xi ; h))
• Supervised learning can be framed as a conditional
probability problem of predicting the probability of
the output given the input:
• P(y | X)
• As such, we can define conditional maximum
likelihood estimation for supervised machine
learning as follows:
• maximize sum i to n log(P(yi|xi ; h))
• Now we can replace h with our linear regression
model.
• We can make some reasonable assumptions, such as the
observations in the dataset are independent and drawn
from the same probability distribution (i.i.d.), and that the
target variable (y) has statistical noise with a Gaussian
distribution, zero mean, and the same variance for all
examples.
• With these assumptions, we can frame the problem of
estimating y given X as estimating the mean value for y from
a Gaussian probability distribution given X.
• The analytical form of the Gaussian function is as follows:
• f(x) = (1 / sqrt(2 * pi * sigma^2)) * exp(- 1/(2 * sigma^2) * (y
– mu)^2 )
• Where mu is the mean of the distribution and sigma^2 is
the variance where the units are squared.
• We can use this function as our likelihood function, where mu is defined
as the prediction from the model with a given set of coefficients (Beta)
and sigma is a fixed constant.
• First, we can state the problem as the maximization of the product of the
probabilities for each example in the dataset:
• maximize product i to n (1 / sqrt(2 * pi * sigma^2)) * exp(-1/(2 * sigma^2)
* (yi – h(xi, Beta))^2)
• Where xi is a given example and Beta refers to the coefficients of the
linear regression model. We can transform this to a log-likelihood model
as follows:
• maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (yi
– h(xi, Beta))^2)

• (It can be simplified further)


• It’s interesting that the prediction is the mean of a
distribution.
• It suggests that we can very reasonably add a bound
to the prediction to give a prediction interval based
on the standard deviation of the distribution, which
is indeed a common practice.
• Although the model assumes a Gaussian distribution
in the prediction (i.e. Gaussian noise function or
error function), there is no such expectation for the
inputs to the model (X).
• a brief simple derivation of the MLE equation
• we consider y to be our vector of measured data, β0 and β1 are the
actual linear model parameters (intercept and gradient)
and ϵ represents the error vector, then the model can be expressed
as:

y1=β0+β1x1+ϵ1
y2=β0+β1x2+ϵ2

yn=β0+β1xn+ϵn

If the error vector ϵ is normally distributed N(0,σ2),


each measurement can be thought of being sampled from it's own
distribution with means μi=β0+β1xi and constant variance σ2.
• If the probability of a single point is
• N(yi|μi,σ2) then the probability of all points
occurring from the distribution defined
by μi=β0+β1xi and σ2 is the product of these
probabilities:
• n
• ∏ N(yi|μi,σ2)
i=1
• This is also the likelihood of the normal distributions
defined by μi and μi,σ2 being the distributions from
which data points yi have been sampled:

• Now since the probability density function (pdf) for a


normal distribution is:
• The likelihood can be defined as the product
of the individual probabilities calculated for
each data point:
Now let's take the natural logarithm of each side, mainly to help simpify the
equation by separating the products into sums:
• These formulas can be implemented in the programming language like
Python

You might also like