The likelihood function is an important concept in statistics and machine learning and forms basis in many key methods such as maximum likelihood estimation (MLE), Bayesian inference and model selection techniques like AIC and BIC. While it is often combined with probability, the likelihood function has a distinct interpretation and serves a unique role in statistical modeling.
Probability vs. Likelihood
Before defining the likelihood function, it is important to differentiate between probability and likelihood:
- Probability is used to describe the likelihood of data given parameters, i.e. P(\text{data} \mid \text{parameters}).
- Likelihood treats the data as fixed and the parameters as variable, i.e., it measures L(\text{parameters} \mid \text{data}) \propto P(\text{data} \mid \text{parameters}).
Mathematical Definition of the Likelihood Function
Let X = (X_1, X_2, \dots, X_n) be a random sample from a probability distribution with probability density function (pdf) or probability mass function (pmf) f(x; \theta), where θ is an unknown parameter or vector of parameters.
Given a realization x = (x_1, x_2, \dots, x_n), the likelihood function is defined as:
L(\theta \mid x) = \prod_{i=1}^n f(x_i; \theta)
Alternatively, the log-likelihood function is often used for computational convenience:
\ell(\theta \mid x) = \log L(\theta \mid x) = \sum_{i=1}^n \log f(x_i; \theta)
Example: Coin Toss
Suppose we toss a coin 10 times and get 7 heads.
Let \theta be the probability of heads. The likelihood function is:
L(\theta) = \binom{10}{7} \cdot \theta^7 \cdot (1 - \theta)^3
This function shows how likely it is to observe 7 heads, depending on the value of \theta. You can plot this to see which value of \theta makes the data most likely (this is called Maximum Likelihood Estimation).
Maximum Likelihood Estimation (MLE)
The most common use of the likelihood function is in maximum likelihood estimation. MLE seeks the parameter value \hat{\theta} that maximizes the likelihood function:
\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} L(\theta \mid x)
Or equivalently:
\hat{\theta}_{\text{MLE}} = \arg \max_{\theta} \log L(\theta \mid x)
Example: Bernoulli Distribution
Suppose X_1, \dots, X_n \sim \text{Bernoulli}(p), where p \in (0, 1) is unknown. Then,
L(p \mid x) = \prod_{i=1}^n p^{x_i}(1 - p)^{1 - x_i}
\ell(p \mid x) = \sum_{i=1}^n [x_i \log p + (1 - x_i) \log (1 - p)]
Taking derivative and solving for p:
\frac{d\ell}{dp} = \sum_{i=1}^n \left( \frac{x_i}{p} - \frac{1 - x_i}{1 - p} \right) = 0
\Rightarrow \hat{p} = \frac{1}{n} \sum_{i=1}^n x_i
Which is simply the sample mean.
Likelihood Function in Continuous Distributions
For continuous distributions, the form of the likelihood function is similar, though the interpretation of the density differs. For instance, for the normal distribution:
Let X_i \sim \mathcal{N}(\mu, \sigma^2). Then,
L(\mu, \sigma^2 \mid x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)
The log-likelihood becomes:
\ell(\mu, \sigma^2 \mid x) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2
Maximizing with respect to μ and \sigma^2 yields the MLEs:
\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i, \quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2
Properties of the Likelihood Function
1. Invariance under Reparameterization: If \hat{\theta} maximizes L(\theta \mid x), and \phi = g(\theta) is a bijective transformation, then \hat{\phi} = g(\hat{\theta}) maximizes L_\phi(\phi \mid x).
2. Not a Probability: Likelihoods do not integrate to 1. It’s a function of parameters, not a probability distribution.
3. Relative Scale Matters: The absolute value of the likelihood is often less important than relative likelihoods across different parameter values.
Likelihood vs Posterior in Bayesian Inference
In Bayesian inference, the likelihood plays a central role in updating beliefs:
P(\theta \mid x) = \frac{L(\theta \mid x) \cdot \pi(\theta)}{P(x)}
Where:
- \pi(\theta) is the prior,
- P(\theta \mid x) is the posterior,
- L(\theta \mid x) is the likelihood,
- P(x) is the marginal likelihood or evidence.
Thus, the likelihood is the bridge between prior and posterior.
Visualization of Likelihood Functions in Python
1D Likelihood Visualization – Bernoulli Distribution
A Bernoulli model for the 1D case (estimating probability p).
Python
import numpy as np
import matplotlib.pyplot as plt
# Simulated data: 10 coin tosses (1 = heads, 0 = tails)
d = np.array([1, 0, 1, 1, 0, 1, 0, 1, 1, 1])
n = len(data)
x_sum = np.sum(d)
def likelihood_bernoulli(p):
return p**x_sum * (1 - p)**(n - x_sum)
# Parameter values
p_vals = np.linspace(0.01, 0.99, 100)
like_vals = [likelihood_bernoulli(p) for p in p_vals]
# Normalize for plotting
like_vals = like_vals / np.max(like_vals)
plt.figure(figsize=(6, 4))
plt.plot(p_vals, likelihood_vals, label="Likelihood")
plt.axvline(x=x_sum/n, color='red', linestyle='--', label='MLE (Sample Mean)')
plt.title("Likelihood Function (Bernoulli)")
plt.xlabel("Parameter $p$")
plt.ylabel("Relative Likelihood")
plt.legend()
plt.grid(True)
plt.show()
Output:
Likelihood Function (Bernoulli)Inference:
- The peak of the likelihood curve occurs at the MLE, \hat{p} = \frac{7}{10} = 0.7.
- The narrowness of the curve reflects how confident the estimate is: narrower = more information (larger n).
- The shape is unimodal and smooth, indicating a unique MLE.
2D Likelihood Visualization – Normal Distribution in Python
A bivariate normal model for the 2D case (estimating μ and σ).
Python
import matplotlib.pyplot as plt
import numpy as np
# Simulated data
np.random.seed(42)
d = np.random.normal(loc=5.0, scale=2.0, size=50)
# Grid of mu and sigma values
mu_vals = np.linspace(3, 7, 100)
sigma_vals = np.linspace(0.5, 4, 100)
MU, SIGMA = np.meshgrid(mu_vals, sigma_vals)
def log_likelihood(mu, sigma):
n = len(d)
return -n*np.log(sigma) - np.sum((d - mu)**2) / (2*sigma**2)
Z = np.array([[log_likelihood(mu, sigma) for mu in mu_vals] for sigma in sigma_vals])
# Normalize for plotting
Z_norm = Z - np.max(Z)
plt.figure(figsize=(6, 4))
cp = plt.contourf(MU, SIGMA, Z_norm, levels=20, cmap='viridis')
plt.colorbar(cp)
plt.title("Log-Likelihood Contour (Normal Distribution)")
plt.xlabel("Mean $\mu$")
plt.ylabel("Standard Deviation $\sigma$")
plt.grid(True)
plt.show()
Output:
Log-likelihood Contour Inference
- The peak of the contour represents the MLE (\hat{\mu}, \hat{\sigma}).
- The elliptical shape of the contours reflects the curvature of the likelihood function—more circular = lower correlation between parameters.
- A flat or elongated shape would indicate parameter uncertainty or collinearity.
Applications of the Likelihood Function
1. Model Fitting
Likelihood-based methods are used to fit models in:
- Regression (logistic, Poisson, etc.),
- Time series (ARIMA),
- Hidden Markov models (HMMs),
- Neural networks (via cross-entropy, which is derived from the log-likelihood).
2. Hypothesis Testing
The likelihood ratio test (LRT) compares nested models:
\lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid x)}{\sup_{\theta \in \Theta} L(\theta \mid x)} \Rightarrow -2 \log \lambda \sim \chi^2_{df}
3. Model Selection
Criteria like AIC and BIC are based on the likelihood:
- AIC: \text{AIC} = -2 \ell(\hat{\theta}) + 2k
- BIC: \text{BIC} = -2 \ell(\hat{\theta}) + k \log n
Where:
- k = number of parameters,
- n = number of observations.
Limitations and Challenges
- Overfitting: MLE tends to fit noise in small samples.
- Non-identifiability: Likelihood may be flat or multimodal.
- Computational issues: Numerical instability for large data or complex models.