Bayesian linear regression for Posterior Predictive Distribution MATLAB
Bayesian linear regression for Posterior Predictive Distribution MATLAB
com)
Anna-Lena Popkes
In nearly all real-world situations, our data and knowledge about the
world is incomplete, indirect and noisy. Hence, uncertainty must be a
fundamental part of our decision-making process. This is exactly what
the Bayesian approach is about. It provides a formal and consistent way
to reason in the presence of uncertainty. Bayesian methods have been
around for a long time and are widely-used in many areas of science
(e.g. astronomy). Although Bayesian methods have been applied to
machine learning problems too, they are usually less well known to
beginners. The major reason is that they require a good understanding
of probability theory.
In the following post we will work our way from linear regression to
Bayesian linear regression, including the most important theoretical
knowledge and code examples. Remember: you can run the original
notebook directly in your Browser using Binder.
values 𝑓(𝑥)∈𝑅f(x)∈R.
We are given an input dataset 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn}n=1N
, where 𝑦𝑛yn is a noisy observation value: 𝑦𝑛=𝑓(𝑥𝑛)+𝜖yn
3. Fundamental concepts
One fundamental tool in Bayesian learning is Bayes’ theorem
follows:𝑝(𝜃∣𝑥,𝑦)=𝑝(𝑦∣𝑥,𝜃)𝑝(𝜃)𝑝(𝑥,𝑦)p(θ∣x,y)=p(x,y)p(
Bayes’ theorem looks as
function:𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑥𝑇𝜃,𝜎2)p(y∣x,θ)=N(xTθ,σ2)
This corresponds to the following likelihood
When 𝑥x is not a vector but a single value (that is, the dataset is one-
dimensional) the model 𝑦𝑖=𝑥𝑖⋅𝜃yi=xi⋅θ describes straight lines
with 𝜃θ being the slope of the line.
intercept 0.
Having a model which is linear both with respect to the parameters and
inputs limits the functions it can learn significantly. We can make our
model more powerful by making it nonlinear with respect to the inputs.
After all, linear regression refers to models which are linear in
the parameters, not necessarily in the inputs (linear in the parameters
means that the model describes a function by a linear combination of
input features).
Making the model nonlinear with respect to the inputs is easy. We can
Linear regression
set 𝜙𝑖(𝑥)=𝑥𝑖ϕi
Another common choice of basis function for the one-dimensional case is
as follows:
parameter 𝜃θ,
parameter values are plausible (before we have seen any data).
too: 𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑦∣𝜙⊤(𝑥)𝜃,𝜎2)p(y∣x,θ)=N(y∣ϕ⊤(x)θ,σ2).
that the likelihood function is Gaussian,
distribution: 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)
…,yn}, the likelihood function becomes a multivariate Gaussian
The nice thing about choosing a Gaussian distribution for our prior is that
the posterior distributions will be Gaussian, too (keyword conjugate
prior)!
to allow plotting the data later on we will assume that it’s two
dimensional (d=2).
class BayesianLinearRegression:
""" Bayesian linear regression
Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""
form𝑦=𝜃0+𝜃1,𝑥+𝜖y=θ0+θ1,x+ϵ
inputs). Consequently, our data generating function will be of the
Returns:
target values, either true or corrupted with noise
"""
n_samples = len(data)
if noise_std_dev == 0: # Real function
return slope * data + intercept
else: # Noise corrupted
return slope * data + intercept + np.random.normal(0,
noise_std_dev, n_samples)
# Set random seed to ensure reproducibility
seed = 42
np.random.seed(seed)
# Generate dataset
features = np.random.uniform(lower_bound, upper_bound, n_datapoints)
labels = compute_function_labels(slope, intercept, 0., features)
noise_corrupted_labels = compute_function_labels(slope, intercept,
noise_std_dev, features)
# Plot the dataset
plt.figure(figsize=(10,7))
plt.plot(features, labels, color='r', label="True values")
plt.scatter(features, noise_corrupted_labels, label="Noise corrupted
values")
plt.xlabel("Features")
plt.ylabel("Labels")
plt.title("Real function along with noisy targets")
plt.legend();
6.3 Step 2: Posterior over the parameters
theorem:𝑝(𝜃∣𝑋,𝑌)=𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)𝑝(𝑌∣𝑋)p(θ∣X,Y)=p(Y∣X)p(Y
We can estimate the parameter posterior using Bayes
∣X,θ)p(θ)
𝑝(𝑌∣𝑋)=∫𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)d𝜃p(Y∣X)=∫p(Y∣X,θ)p(θ)dθ is
the evidence which ensures that the posterior is normalized (that
is, that it integrates to 1).
The parameter posterior can be estimated in closed form (for proof see
Learning):𝑝(𝜃∣𝑋,𝑌)=𝑁(𝜃∣𝑚𝑁,𝑆𝑁) 𝑆𝑁=(𝑆0−1+𝜎−2Φ⊤Φ)
theorem 9.1 in the book Mathematics for Machine
−1 𝑚𝑁=𝑆𝑁(𝑆0−1𝑚0+𝜎−2Φ⊤𝑦)p(θ∣X,Y) SN mN=N(θ∣mN
,SN)=(S0−1+σ−2Φ⊤Φ)−1=SN(S0−1m0+σ−2Φ⊤y)
Coming back to our BayesianLinearRegression class we need to add a
method which allows us to update the posterior distribution given a
dataset.
class BayesianLinearRegression:
""" Bayesian linear regression
Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""
Args:
features: numpy array of features
targets: numpy array of targets
"""
# Reshape targets to allow correct matrix multiplication
# Input shape is (N,) but we need (N, 1)
targets = targets[:, np.newaxis]
Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)
pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T))
+ self.noise_var
pred_posterior = univariate_normal(loc=pred_mean.flatten(),
scale=pred_cov**0.5)
return pred_posterior
The plot above illustrates both the prior parameter distribution and the
true parameter values that we want to find. If our model works correctly,
the posterior distribution should become more narrow and move closer
to the true parameter values as the model sees more datapoints. This
can be visualized with contour plots, too! Below we update the posterior
distribution iteratively as the model sees more and more data. The
contour plots for each step show how the parameter posterior develops
and converges close to the true values in the end.
previous_n_points = n_points
6.5 Step 3: Posterior predictive distribution
𝑝(𝑦∗∣𝑋,𝑌,𝑥∗)=∫𝑝(𝑦∗∣𝑥∗,𝜃)𝑝(𝜃∣𝑋,𝑌)d𝜃 =∫𝑁(𝑦∗∣𝜙⊤(𝑥∗
𝑥∗)+𝜎2)p(y∗∣X,Y,x∗) =∫p(y∗∣x∗,θ)p(θ∣X,Y)dθ=∫N(y∗∣ϕ⊤(x∗
)𝜃,𝜎2)𝑁(𝜃∣𝑚𝑁,𝑆𝑁)d𝜃 =𝑁(𝑦∗∣𝜙⊤(𝑥∗)𝑚𝑁,𝜙⊤(𝑥∗)𝑆𝑁𝜙(
)θ,σ2)N(θ∣mN,SN)dθ=N(y∗∣ϕ⊤(x∗)mN,ϕ⊤(x∗)SNϕ(x∗)+σ2)
First of all: note that the predictive posterior for a new input 𝑥∗x∗ is
a univariate Gaussian distribution. We can see that the mean of the
Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)
pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T)) +
self.noise_var
To make sure that we are all on the same page: given a new input
example our Bayesian linear regression model predicts not a single label
but a distribution over possible labels. This distribution is Gaussian. We
can get actual labels by sampling from this distribution.
import pandas as pd
import seaborn as sns
all_rows = []
sample_size = 1000
test_features = [-2, -1, 0, 1, 2]
all_labels = []
Prev
Deep Work
Next
Bayesian linear regression 2
Anna-Lena Popkes
In nearly all real-world situations, our data and knowledge about the
world is incomplete, indirect and noisy. Hence, uncertainty must be a
fundamental part of our decision-making process. This is exactly what
the Bayesian approach is about. It provides a formal and consistent way
to reason in the presence of uncertainty. Bayesian methods have been
around for a long time and are widely-used in many areas of science
(e.g. astronomy). Although Bayesian methods have been applied to
machine learning problems too, they are usually less well known to
beginners. The major reason is that they require a good understanding
of probability theory.
In the following post we will work our way from linear regression to
Bayesian linear regression, including the most important theoretical
knowledge and code examples. Remember: you can run the original
notebook directly in your Browser using Binder.
values 𝑓(𝑥)∈𝑅f(x)∈R.
We are given an input dataset 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn
}n=1N, where 𝑦𝑛yn is a noisy observation value: 𝑦𝑛=𝑓(𝑥𝑛)
3. Fundamental concepts
One fundamental tool in Bayesian learning is Bayes’ theorem
follows:𝑝(𝜃∣𝑥,𝑦)=𝑝(𝑦∣𝑥,𝜃)𝑝(𝜃)𝑝(𝑥,𝑦)p(θ∣x,y)=p(x,y)p(
Bayes’ theorem looks as
theorem 𝑝(ℎ∣𝑒)=𝑝(𝑒∣ℎ)𝑝(ℎ)𝑝(𝑒)p(h∣e)=p(e)p(e∣h)p(h) to
Finally, we use Bayes’
function:𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑥𝑇𝜃,𝜎2)p(y∣x,θ)=N(xTθ,σ2)
This corresponds to the following likelihood
When 𝑥x is not a vector but a single value (that is, the dataset is one-
dimensional) the model 𝑦𝑖=𝑥𝑖⋅𝜃yi=xi⋅θ describes straight lines
with 𝜃θ being the slope of the line.
model 𝑦=𝑥⋅𝜃y=x⋅θ, using different values for the slope 𝜃θ and
The plot below shows example lines produced with the
intercept 0.
Having a model which is linear both with respect to the parameters and
inputs limits the functions it can learn significantly. We can make our
model more powerful by making it nonlinear with respect to the inputs.
After all, linear regression refers to models which are linear in
the parameters, not necessarily in the inputs (linear in the parameters
means that the model describes a function by a linear combination of
input features).
Making the model nonlinear with respect to the inputs is easy. We can
𝑝(𝑦∣𝑥,𝜃)=𝑁(𝜙𝑇(𝑥)𝜃,,𝜎2)p(y∣x,θ)=N(ϕT(x)θ,,σ2)
Linear regression
set 𝜙𝑖(𝑥)=𝑥𝑖ϕi
Another common choice of basis function for the one-dimensional case is
as follows:
𝛷:=[𝜙⊤(𝑥1) ⋮ 𝜙⊤(𝑥𝑁)]=[𝜙0(𝑥1)⋯𝜙𝐾−1(𝑥1) 𝜙0(𝑥2)⋯
𝜙𝐾−1(𝑥2) ⋮⋮ 𝜙0(𝑥𝑁)⋯𝜙𝐾−1(𝑥𝑁)]∈𝑅𝑁×𝐾Φ:=ϕ⊤(x1) ⋮
ϕ⊤(xN)=ϕ0(x1) ϕ0(x2) ⋮ ϕ0(xN)⋯⋯⋯ϕK−1(x1)ϕK−1(x2)⋮ϕK−1
(xN)∈RN×K
parameter 𝜃θ,
parameter values are plausible (before we have seen any data).
too: 𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑦∣𝜙⊤(𝑥)𝜃,𝜎2)p(y∣x,θ)=N(y∣ϕ⊤(x)θ,σ2).
that the likelihood function is Gaussian,
distribution: 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)
…,yn}, the likelihood function becomes a multivariate Gaussian
The nice thing about choosing a Gaussian distribution for our prior is that
the posterior distributions will be Gaussian, too (keyword conjugate
prior)!
to allow plotting the data later on we will assume that it’s two
dimensional (d=2).
class BayesianLinearRegression:
""" Bayesian linear regression
Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""
form𝑦=𝜃0+𝜃1,𝑥+𝜖y=θ0+θ1,x+ϵ
inputs). Consequently, our data generating function will be of the
Args:
slope: slope of the function (theta_1)
intercept: intercept of the function (theta_0)
data: input feature values (x)
noise_std_dev: standard deviation of noise distribution (sigma)
Returns:
target values, either true or corrupted with noise
"""
n_samples = len(data)
if noise_std_dev == 0: # Real function
return slope * data + intercept
else: # Noise corrupted
return slope * data + intercept + np.random.normal(0,
noise_std_dev, n_samples)
# Set random seed to ensure reproducibility
seed = 42
np.random.seed(seed)
# Generate dataset
features = np.random.uniform(lower_bound, upper_bound, n_datapoints)
labels = compute_function_labels(slope, intercept, 0., features)
noise_corrupted_labels = compute_function_labels(slope, intercept,
noise_std_dev, features)
# Plot the dataset
plt.figure(figsize=(10,7))
plt.plot(features, labels, color='r', label="True values")
plt.scatter(features, noise_corrupted_labels, label="Noise corrupted
values")
plt.xlabel("Features")
plt.ylabel("Labels")
plt.title("Real function along with noisy targets")
plt.legend();
theorem:𝑝(𝜃∣𝑋,𝑌)=𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)𝑝(𝑌∣𝑋)p(θ∣X,Y)=p(Y∣X)p(Y
We can estimate the parameter posterior using Bayes
∣X,θ)p(θ)
𝑝(𝑌∣𝑋)=∫𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)d𝜃p(Y∣X)=∫p(Y∣X,θ)p(θ)dθ is
the evidence which ensures that the posterior is normalized (that
is, that it integrates to 1).
The parameter posterior can be estimated in closed form (for proof see
Learning):𝑝(𝜃∣𝑋,𝑌)=𝑁(𝜃∣𝑚𝑁,𝑆𝑁) 𝑆𝑁=(𝑆0−1+𝜎−2Φ⊤Φ)
theorem 9.1 in the book Mathematics for Machine
−1 𝑚𝑁=𝑆𝑁(𝑆0−1𝑚0+𝜎−2Φ⊤𝑦)p(θ∣X,Y) SN mN=N(θ∣mN
,SN)=(S0−1+σ−2Φ⊤Φ)−1=SN(S0−1m0+σ−2Φ⊤y)
Coming back to our BayesianLinearRegression class we need to add a
method which allows us to update the posterior distribution given a
dataset.
class BayesianLinearRegression:
""" Bayesian linear regression
Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""
Args:
features: numpy array of features
targets: numpy array of targets
"""
# Reshape targets to allow correct matrix multiplication
# Input shape is (N,) but we need (N, 1)
targets = targets[:, np.newaxis]
Args:
features: numpy array of features
Returns:
design_matrix: numpy array of transformed features
Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)
pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T))
+ self.noise_var
pred_posterior = univariate_normal(loc=pred_mean.flatten(),
scale=pred_cov**0.5)
return pred_posterior
The plot above illustrates both the prior parameter distribution and the
true parameter values that we want to find. If our model works correctly,
the posterior distribution should become more narrow and move closer
to the true parameter values as the model sees more datapoints. This
can be visualized with contour plots, too! Below we update the posterior
distribution iteratively as the model sees more and more data. The
contour plots for each step show how the parameter posterior develops
and converges close to the true values in the end.
previous_n_points = n_points
6.5 Step 3: Posterior predictive distribution
𝑝(𝑦∗∣𝑋,𝑌,𝑥∗)=∫𝑝(𝑦∗∣𝑥∗,𝜃)𝑝(𝜃∣𝑋,𝑌)d𝜃 =∫𝑁(𝑦∗∣𝜙⊤(𝑥∗
𝑥∗)+𝜎2)p(y∗∣X,Y,x∗) =∫p(y∗∣x∗,θ)p(θ∣X,Y)dθ=∫N(y∗∣ϕ⊤(x∗
)𝜃,𝜎2)𝑁(𝜃∣𝑚𝑁,𝑆𝑁)d𝜃 =𝑁(𝑦∗∣𝜙⊤(𝑥∗)𝑚𝑁,𝜙⊤(𝑥∗)𝑆𝑁𝜙(
)θ,σ2)N(θ∣mN,SN)dθ=N(y∗∣ϕ⊤(x∗)mN,ϕ⊤(x∗)SNϕ(x∗)+σ2)
First of all: note that the predictive posterior for a new input 𝑥∗x∗ is
a univariate Gaussian distribution. We can see that the mean of the
Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)
pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T)) +
self.noise_var
To make sure that we are all on the same page: given a new input
example our Bayesian linear regression model predicts not a single label
but a distribution over possible labels. This distribution is Gaussian. We
can get actual labels by sampling from this distribution.
import pandas as pd
import seaborn as sns
all_rows = []
sample_size = 1000
test_features = [-2, -1, 0, 1, 2]
all_labels = []
Prev
Bayesian linear regression
Next
Kullback-Leibler Divergence