Likelihood Function

Last Updated : 23 Jul, 2025

The likelihood function is an important concept in statistics and machine learning and forms basis in many key methods such as maximum likelihood estimation (MLE), Bayesian inference and model selection techniques like AIC and BIC. While it is often combined with probability, the likelihood function has a distinct interpretation and serves a unique role in statistical modeling.

Probability vs. Likelihood

Before defining the likelihood function, it is important to differentiate between probability and likelihood:

Probability is used to describe the likelihood of data given parameters, i.e. P(\text{data} \mid \text{parameters}).
Likelihood treats the data as fixed and the parameters as variable, i.e., it measures L(\text{parameters} \mid \text{data}) \propto P(\text{data} \mid \text{parameters}).

Mathematical Definition of the Likelihood Function

Let X = (X_1, X_2, \dots, X_n) be a random sample from a probability distribution with probability density function (pdf) or probability mass function (pmf) f(x; \theta), where θ is an unknown parameter or vector of parameters.

Given a realization x = (x_1, x_2, \dots, x_n), the likelihood function is defined as:

L(\theta \mid x) = \prod_{i=1}^n f(x_i; \theta)

Alternatively, the log-likelihood function is often used for computational convenience:

\ell(\theta \mid x) = \log L(\theta \mid x) = \sum_{i=1}^n \log f(x_i; \theta)

Example: Coin Toss

Suppose we toss a coin 10 times and get 7 heads.

Let \theta be the probability of heads. The likelihood function is:

L(\theta) = \binom{10}{7} \cdot \theta^7 \cdot (1 - \theta)^3

This function shows how likely it is to observe 7 heads, depending on the value of \theta. You can plot this to see which value of \theta makes the data most likely (this is called Maximum Likelihood Estimation).

Maximum Likelihood Estimation (MLE)

The most common use of the likelihood function is in maximum likelihood estimation. MLE seeks the parameter value \hat{\theta} that maximizes the likelihood function:

\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} L(\theta \mid x)

Or equivalently:

\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} \log L(\theta \mid x)

Example: Bernoulli Distribution

Suppose X_1, \dots, X_n \sim \text{Bernoulli}(p), where p \in (0, 1) is unknown. Then,

L(p \mid x) = \prod_{i=1}^n p^{x_i}(1 - p)^{1 - x_i}

\ell(p \mid x) = \sum_{i=1}^n [x_i \log p + (1 - x_i) \log (1 - p)]

Taking derivative and solving for p:

\frac{d\ell}{dp} = \sum_{i=1}^n \left( \frac{x_i}{p} - \frac{1 - x_i}{1 - p} \right) = 0

\Rightarrow \hat{p} = \frac{1}{n} \sum_{i=1}^n x_i

Which is simply the sample mean.

Likelihood Function in Continuous Distributions

For continuous distributions, the form of the likelihood function is similar, though the interpretation of the density differs. For instance, for the normal distribution:

Let X_i \sim \mathcal{N}(\mu, \sigma^2). Then,

L(\mu, \sigma^2 \mid x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)

The log-likelihood becomes:

\ell(\mu, \sigma^2 \mid x) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2

Maximizing with respect to μ and \sigma^2 yields the MLEs:

\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i, \quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2

Properties of the Likelihood Function

1. Invariance under Reparameterization: If \hat{\theta} maximizes L(\theta \mid x), and \phi = g(\theta) is a bijective transformation, then \hat{\phi} = g(\hat{\theta}) maximizes L_\phi(\phi \mid x).

2. Not a Probability: Likelihoods do not integrate to 1. It’s a function of parameters, not a probability distribution.

3. Relative Scale Matters: The absolute value of the likelihood is often less important than relative likelihoods across different parameter values.

Likelihood vs Posterior in Bayesian Inference

In Bayesian inference, the likelihood plays a central role in updating beliefs:

P(\theta \mid x) = \frac{L(\theta \mid x) \cdot \pi(\theta)}{P(x)}

Where:

\pi(\theta) is the prior,
P(\theta \mid x) is the posterior,
L(\theta \mid x) is the likelihood,
P(x) is the marginal likelihood or evidence.

Thus, the likelihood is the bridge between prior and posterior.

Visualization of Likelihood Functions in Python

1D Likelihood Visualization – Bernoulli Distribution

A Bernoulli model for the 1D case (estimating probability p).

Python

import numpy as np
import matplotlib.pyplot as plt

# Simulated data: 10 coin tosses (1 = heads, 0 = tails)
d = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 1])
n = len(data)
x_sum = np.sum(d)

def likelihood_bernoulli(p):
    return p**x_sum * (1 - p)**(n - x_sum)

# Parameter values
p_vals = np.linspace(0.01, 0.99, 100)
like_vals = [likelihood_bernoulli(p) for p in p_vals]

# Normalize for plotting
like_vals = like_vals / np.max(like_vals)

plt.figure(figsize=(6, 4))
plt.plot(p_vals, likelihood_vals, label="Likelihood")
plt.axvline(x=x_sum/n, color='red', linestyle='--', label='MLE (Sample Mean)')
plt.title("Likelihood Function (Bernoulli)")
plt.xlabel("Parameter $p$")
plt.ylabel("Relative Likelihood")
plt.legend()
plt.grid(True)
plt.show()

Output:

likelihood1 — Likelihood Function (Bernoulli)

Inference:

The peak of the likelihood curve occurs at the MLE, \hat{p} = \frac{7}{10} = 0.7.
The narrowness of the curve reflects how confident the estimate is: narrower = more information (larger n).
The shape is unimodal and smooth, indicating a unique MLE.

2D Likelihood Visualization – Normal Distribution in Python

A bivariate normal model for the 2D case (estimating μ and σ).

Python

import matplotlib.pyplot as plt
import numpy as np

# Simulated data
np.random.seed(42)
d = np.random.normal(loc=5.0, scale=2.0, size=50)

# Grid of mu and sigma values
mu_vals = np.linspace(3, 7, 100)
sigma_vals = np.linspace(0.5, 4, 100)
MU, SIGMA = np.meshgrid(mu_vals, sigma_vals)

def log_likelihood(mu, sigma):
    n = len(d)
    return -n*np.log(sigma) - np.sum((d - mu)**2) / (2*sigma**2)

Z = np.array([[log_likelihood(mu, sigma) for mu in mu_vals] for sigma in sigma_vals])

# Normalize for plotting
Z_norm = Z - np.max(Z)

plt.figure(figsize=(6, 4))
cp = plt.contourf(MU, SIGMA, Z_norm, levels=20, cmap='viridis')
plt.colorbar(cp)
plt.title("Log-Likelihood Contour (Normal Distribution)")
plt.xlabel("Mean $\mu$")
plt.ylabel("Standard Deviation $\sigma$")
plt.grid(True)
plt.show()

Output:

Inference

The peak of the contour represents the MLE (\hat{\mu}, \hat{\sigma}).
The elliptical shape of the contours reflects the curvature of the likelihood function—more circular = lower correlation between parameters.
A flat or elongated shape would indicate parameter uncertainty or collinearity.

Applications of the Likelihood Function

1. Model Fitting

Likelihood-based methods are used to fit models in:

Regression (logistic, Poisson, etc.),
Time series (ARIMA),
Hidden Markov models (HMMs),
Neural networks (via cross-entropy, which is derived from the log-likelihood).

2. Hypothesis Testing

The likelihood ratio test (LRT) compares nested models:

\lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid x)}{\sup_{\theta \in \Theta} L(\theta \mid x)} \Rightarrow -2 \log \lambda \sim \chi^2_{df}

3. Model Selection

Criteria like AIC and BIC are based on the likelihood:

AIC: \text{AIC} = -2 \ell(\hat{\theta}) + 2k
BIC: \text{BIC} = -2 \ell(\hat{\theta}) + k \log n