0% found this document useful (0 votes)
14 views

Bayesian linear regression for Posterior Predictive Distribution MATLAB

Bayesian linear regression for Posterior Predictive Distribution MATLAB

Uploaded by

Angel Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Bayesian linear regression for Posterior Predictive Distribution MATLAB

Bayesian linear regression for Posterior Predictive Distribution MATLAB

Uploaded by

Angel Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Bayesian linear regression (alpopkes.

com)

Anna-Lena Popkes

Saturday, February 20, 2021

Bayesian linear regression


I finally found time to continue working on my machine learning basics
repository which implements fundamental machine learning algorithms
in plain Python. Especially, I took a detailed look at Bayesian linear
regression. The blog post below contains the same content as
the original notebook. You can run the notebook directly in your Browser
using Binder.

1. What is Bayesian linear regression (BLR)?


Bayesian linear regression is the Bayesian interpretation of linear
regression. What does that mean? To answer this question we first have
to understand the Bayesian approach. In most of the algorithms we have
looked at so far we computed point estimates of our parameters. For
example, in linear regression we chose values for the weights and bias
that minimized our mean squared error cost function. In the Bayesian
approach we don’t work with exact values but with probabilities. This
allows us to model the uncertainty in our parameter estimates. Why is
this important?

In nearly all real-world situations, our data and knowledge about the
world is incomplete, indirect and noisy. Hence, uncertainty must be a
fundamental part of our decision-making process. This is exactly what
the Bayesian approach is about. It provides a formal and consistent way
to reason in the presence of uncertainty. Bayesian methods have been
around for a long time and are widely-used in many areas of science
(e.g. astronomy). Although Bayesian methods have been applied to
machine learning problems too, they are usually less well known to
beginners. The major reason is that they require a good understanding
of probability theory.

In the following post we will work our way from linear regression to
Bayesian linear regression, including the most important theoretical
knowledge and code examples. Remember: you can run the original
notebook directly in your Browser using Binder.

2. Recap linear regression


In linear regression, we want to find a function 𝑓f that maps
inputs 𝑥∈𝑅𝐷x∈RD to corresponding function

values 𝑓(𝑥)∈𝑅f(x)∈R.
We are given an input dataset 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn}n=1N
, where 𝑦𝑛yn is a noisy observation value: 𝑦𝑛=𝑓(𝑥𝑛)+𝜖yn

=f(xn)+ϵ, with 𝜖ϵ being an i.i.d. random variable that describes

Our goal is to infer the underlying function 𝑓f that generated the


measurement/observation noise

data such that we can predict function values at new input

In linear regression, we model the underlying function 𝑓f using a


locations

linear combination of the input features:

𝑦=𝜃0+𝜃1𝑥1+𝜃2𝑥2+…+𝜃𝑑𝑥𝑑 =𝑥𝑇𝜃y =θ0+θ1x1+θ2x2+…


+θdxd=xTθ
 For more details take a look at the notebook on linear regression

3. Fundamental concepts
 One fundamental tool in Bayesian learning is Bayes’ theorem

follows:𝑝(𝜃∣𝑥,𝑦)=𝑝(𝑦∣𝑥,𝜃)𝑝(𝜃)𝑝(𝑥,𝑦)p(θ∣x,y)=p(x,y)p(
 Bayes’ theorem looks as

𝑝(𝑦∣𝑥,𝜃)p(y∣x,θ) is the likelihood. It describes the probability


y∣x,θ)p(θ)

𝑝(𝜃)p(θ) is the prior. It describes our initial knowledge about


of the target values given the data and parameters.

𝑝(𝑥,𝑦)p(x,y) is the evidence. It describes the joint probability of


which parameter values are likely and unlikely.

𝑝(𝜃∣𝑥,𝑦)p(θ∣x,y) is the posterior. It describes the probability of


the data and targets.

the parameters given the observed data and targets.
 Another important tool you need to know about is the Gaussian
distribution. If you are not familiar with it I suggest you pause for a
minute and understand its main properties before reading on.

The general Bayesian inference approach is as follows:

2. We observe some data, gaining new evidence 𝑒e


1. We start with some prior belief about a hypothesis p(h)

3. We update our belief using Bayes’ theorem, resulting


theorem 𝑝(ℎ∣𝑒)=𝑝(𝑒∣ℎ)𝑝(ℎ)𝑝(𝑒)p(h∣e)=p(e)p(e∣h)p(h) to
Finally, we use Bayes’

incorporate the new evidence yielding a refined posterior belief

4. Linear regression from a probabilistic


perspective
In order to pave the way for Bayesian linear regression we will take a

the observation noise 𝜖ϵ. For simplicity, we assume that 𝜖ϵ is normally


probabilistic spin on linear regression. Let’s start by explicitly modelling

variance 𝜎2σ2: 𝜖∼𝑁(0,𝜎2)ϵ∼N(0,σ2).


distributed with mean 00 and some known

assumes that the target function 𝑓(𝑥)f(x) is given by a linear


As mentioned in the beginning, a simple linear regression model

combination of the input features: 𝑦=𝑓(𝑥)+𝜖 =𝑥𝑇𝜃+𝜖y=f(x)


+ϵ =xTθ+ϵ

function:𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑥𝑇𝜃,𝜎2)p(y∣x,θ)=N(xTθ,σ2)
This corresponds to the following likelihood

Our goal is to find the parameters 𝜃={𝜃1,…,𝜃𝐷}θ={θ1,…,θD} that


model the given data best. In standard linear regression we can find the
best parameters using a least-squares, maximum likelihood (ML) or
maximum a posteriori (MAP) approach. If you want to know more about
these solutions take a look at the notebook on linear regression or at
chapter 9.2 of the book Mathematics for Machine Learning.

5. Linear regression with basis functions

to the parameters 𝜃θ but also with respect to the inputs 𝑥x.


The simple linear regression model above is linear not only with respect

When 𝑥x is not a vector but a single value (that is, the dataset is one-
dimensional) the model 𝑦𝑖=𝑥𝑖⋅𝜃yi=xi⋅θ describes straight lines
with 𝜃θ being the slope of the line.

model 𝑦=𝑥⋅𝜃y=x⋅θ, using different values for the slope 𝜃θ and


The plot below shows example lines produced with the

intercept 0.
Having a model which is linear both with respect to the parameters and
inputs limits the functions it can learn significantly. We can make our
model more powerful by making it nonlinear with respect to the inputs.
After all, linear regression refers to models which are linear in
the parameters, not necessarily in the inputs (linear in the parameters
means that the model describes a function by a linear combination of
input features).

Making the model nonlinear with respect to the inputs is easy. We can

features 𝜙(𝑥)ϕ(x). With this adaptation our model looks as follows: $$ \


adapt it by using a nonlinear transformation of the input

begin{split} y &= \pmb{\phi}^T(\pmb{x}) \pmb{\theta} + \epsilon \


&= \sum_{k=0}^{K-1} \theta_k \phi_k(\pmb{x}) + \epsilon \end{split}
$$

Where 𝜙:𝑅𝐷→𝑅𝐾ϕ:RD→RK is a (non)linear transformation of the


inputs 𝑥x and 𝜙𝑘:𝑅𝐷→𝑅ϕk:RD→R is the 𝑘−k−th component of
the feature vector 𝜙ϕ:

𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]∈𝑅𝐾ϕ(x)=ϕ0(x) ϕ1(x) ⋮


ϕK−1(x)∈RK
With our new nonlinear transformation the likelihood function is given by
𝑝(𝑦∣𝑥,𝜃)=𝑁(𝜙𝑇(𝑥)𝜃,,𝜎2)p(y∣x,θ)=N(ϕT(x)θ,,σ2)

5.1 Example basis functions

Linear regression

The easiest example for a basis function (for one-dimensional data)

at all. In this case we would choose 𝜙0(𝑥)=1ϕ0(x)=1 and 𝜙𝑖(𝑥)=𝑥ϕi


would be simple linear regression, that is, no non-linear transformation

(x)=x. This would result in the following vector 𝜙(𝑥)ϕ(x):

𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]=[1 𝑥 ⋮ 𝑥]∈𝑅𝐾ϕ(x)=ϕ0(x)


ϕ1(x) ⋮ ϕK−1(x)=1 x ⋮ x∈RK
Polynomial regression

set 𝜙𝑖(𝑥)=𝑥𝑖ϕi
Another common choice of basis function for the one-dimensional case is

(x)=xi for 𝑖=0,…,𝐾−1i=0,…,K−1. The corresponding feature


polynomial regression. For this we would

vector 𝜙(𝑥)ϕ(x) would look as follows:

𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]=[1 𝑥 𝑥2 𝑥3 ⋮ 𝑥𝐾−1]∈𝑅


𝐾ϕ(x)=ϕ0(x) ϕ1(x) ⋮ ϕK−1(x)=1 x x2 x3 ⋮ xK−1∈RK

into a 𝐾K-dimensional feature space. Our function 𝑓f can be any


With this transformation we can lift our original one-dimensional input

degree ≤𝐾−1≤K−1: 𝑓(𝑥)=∑𝑘=0𝐾−1𝜃𝑘𝑥𝑘f(x)=∑k=0K−1θkxk


polynomial with

5.2 The design matrix

To make it easier to work with the transformations 𝜙(𝑥)ϕ(x) for the


different input vectors 𝑥x we typically create a so called design

dataset 𝐷=𝑥𝑛,𝑦𝑛𝑛=1𝑁D=xn,ynn=1N we define the design matrix


matrix (also called feature matrix). Given our

as follows:

𝛷:=[𝜙⊤(𝑥1) ⋮ 𝜙⊤(𝑥𝑁)]=[𝜙0(𝑥1)⋯𝜙𝐾−1(𝑥1) 𝜙0(𝑥2)⋯


𝜙𝐾−1(𝑥2) ⋮⋮ 𝜙0(𝑥𝑁)⋯𝜙𝐾−1(𝑥𝑁)]∈𝑅𝑁×𝐾Φ:=ϕ⊤(x1) ⋮
ϕ⊤(xN)=ϕ0(x1) ϕ0(x2) ⋮ ϕ0(xN)⋯⋯⋯ϕK−1(x1)ϕK−1(x2)⋮ϕK−1
(xN)∈RN×K
Note that the design matrix is of shape 𝑁×𝐾N×K. 𝑁N is the number
of input examples and 𝐾K is the output dimension of the non-linear
transformation 𝜙(𝑥)ϕ(x).

6. Bayesian linear regression


What changes when we consider a Bayesian interpretation of linear

before: 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn}n=1N. Given the data 𝐷D we


regression? Our data stays the same as

can define the set of all inputs as 𝑋:={𝑥1,…,𝑥𝑛}X:={x1,…,xn


} and the set of all targets as 𝑌:={𝑦1,…,𝑦𝑛}Y:={y1,…,yn}.
In simple linear regression we compute point estimates of our
parameters (e.g. using a maximum likelihood approach) and use these
estimates to make predictions. Different to this, Bayesian linear
regression estimates distributions over the parameters and predictions.
This allows us to model the uncertainty in our predictions.

To perform Bayesian linear regression we follow three steps:

1. We set up a probabilistic model that describes our assumptions

2. We perform inference for the parameters 𝜃θ, that is, we compute


how the data and parameters are generated

the posterior probability distribution over the parameters

inputs 𝑦∗y∗. In this step we don’t compute point estimates of the


3. With this posterior we can perform inference for new, unseen

outputs. Instead, we compute the parameters of the posterior


distribution over the outputs.

6.1 Step 1: Probabilistic model

We start by setting up a probabilistic model that describes our

place a prior 𝑝(𝜃)p(θ) over our parameters which encodes what


assumptions how the data and parameters are generated. For this, we

parameter 𝜃θ,
parameter values are plausible (before we have seen any data).

prior 𝑝(𝜃)=𝑁(0,1)p(θ)=N(0,1) says that parameter values are


Example: With a single a Gaussian

normally distributed with mean 0 and standard deviation 1. In other


words: the parameter values are most likely to fall into the
interval [−2,2][−2,2] which is two standard deviations around the
mean value.
parameters: 𝑝(𝜃)=𝑁(𝑚0,𝑆0)p(θ)=N(m0,S0). Let’s further assume
To keep things simple we will assume a Gaussian prior over the

too: 𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑦∣𝜙⊤(𝑥)𝜃,𝜎2)p(y∣x,θ)=N(y∣ϕ⊤(x)θ,σ2).
that the likelihood function is Gaussian,

Note: When considering the set of all targets 𝑌:={𝑦1,…,𝑦𝑛}Y:={y1,

distribution: 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)
…,yn}, the likelihood function becomes a multivariate Gaussian

The nice thing about choosing a Gaussian distribution for our prior is that
the posterior distributions will be Gaussian, too (keyword conjugate
prior)!

We will start our BayesianLinearRegression class with the knowledge we

assume that the variance 𝜎2σ2 of the noise 𝜖ϵ is known. Furthermore,


have so far - our probabilistic model. As mentioned in the beginning we

to allow plotting the data later on we will assume that it’s two
dimensional (d=2).

from scipy.stats import multivariate_normal


import numpy as np

class BayesianLinearRegression:
""" Bayesian linear regression

Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""

def __init__(self, prior_mean: np.ndarray, prior_cov: np.ndarray,


noise_var: float):
self.prior_mean = prior_mean[:, np.newaxis] # column vector of
shape (1, d)
self.prior_cov = prior_cov # matrix of shape (d, d)
# We initalize the prior distribution over the parameters using
the given mean and covariance matrix
# In the formulas above this corresponds to m_0 (prior_mean) and
S_0 (prior_cov)
self.prior = multivariate_normal(prior_mean, prior_cov)

# We also know the variance of the noise


self.noise_var = noise_var # single float value
self.noise_precision = 1 / noise_var

# Before performing any inference the parameter posterior equals


the parameter prior
self.param_posterior = self.prior
# Accordingly, the posterior mean and covariance equal the prior
mean and variance
self.post_mean = self.prior_mean # corresponds to m_N in
formulas
self.post_cov = self.prior_cov # corresponds to S_N in formulas

# Let's make sure that we can initialize our model


prior_mean = np.array([0, 0])
prior_cov = np.array([[0.5, 0], [0, 0.5]])
noise_var = 0.2
blr = BayesianLinearRegression(prior_mean, prior_cov, noise_var)

6.2 Generating a dataset

Before going any further we need a dataset to test our implementation.

form 𝑦=𝜙𝑇(𝑥)𝜃+𝜖y=ϕT(x)θ+ϵ where 𝜖ϵ is


Remember that we assume that our targets were generated by a
function of the

variance 𝜎2σ2: 𝜖∼𝑁(0,𝜎2)ϵ∼N(0,σ2).


normally distributed with mean 00 and some known

To keep things simple we will work with one-dimensional data and


simple linear regression (that is, no non-linear transformation of the

form𝑦=𝜃0+𝜃1,𝑥+𝜖y=θ0+θ1,x+ϵ
inputs). Consequently, our data generating function will be of the

Note that we added a parameter 𝜃0θ0 which corresponds to the


intercept of the linear function. Until know we assumed 𝜃0=0θ0=0. As
mentioned earlier, 𝜃1θ1 represents the slope of the linear function.

import matplotlib.pyplot as plt


%matplotlib inline

def compute_function_labels(slope: float, intercept: float,


noise_std_dev: float, data: np.ndarray) -> np.ndarray:
"""
Compute target values given function parameters and data.
Args:
slope: slope of the function (theta_1)
intercept: intercept of the function (theta_0)
data: input feature values (x)
noise_std_dev: standard deviation of noise distribution (sigma)

Returns:
target values, either true or corrupted with noise
"""
n_samples = len(data)
if noise_std_dev == 0: # Real function
return slope * data + intercept
else: # Noise corrupted
return slope * data + intercept + np.random.normal(0,
noise_std_dev, n_samples)
# Set random seed to ensure reproducibility
seed = 42
np.random.seed(seed)

# Generate true values and noise corrupted targets


n_datapoints = 1000
intercept = -0.7
slope = 0.9
noise_std_dev = 0.5
noise_var = noise_std_dev**2
lower_bound = -1.5
upper_bound = 1.5

# Generate dataset
features = np.random.uniform(lower_bound, upper_bound, n_datapoints)
labels = compute_function_labels(slope, intercept, 0., features)
noise_corrupted_labels = compute_function_labels(slope, intercept,
noise_std_dev, features)
# Plot the dataset
plt.figure(figsize=(10,7))
plt.plot(features, labels, color='r', label="True values")
plt.scatter(features, noise_corrupted_labels, label="Noise corrupted
values")
plt.xlabel("Features")
plt.ylabel("Labels")
plt.title("Real function along with noisy targets")
plt.legend();
6.3 Step 2: Posterior over the parameters

model and our dataset 𝑋,𝑌X,Y to estimate the parameter


We finished setting up our probabilistic model. Next, we want to use this

posterior 𝑝(𝜃∣𝑋,𝑌)p(θ∣X,Y). Keep in mind that we don’t compute


point estimates of the parameters. Instead, we determine the mean and
variance of the (Gaussian) posterior distribution and use this entire
distribution when making predictions.

theorem:𝑝(𝜃∣𝑋,𝑌)=𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)𝑝(𝑌∣𝑋)p(θ∣X,Y)=p(Y∣X)p(Y
We can estimate the parameter posterior using Bayes

∣X,θ)p(θ)

𝑝(𝑌∣𝑋,𝜃)p(Y∣X,θ) is the likelihood


function, 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)

𝑝(𝜃)p(θ) is the prior


distribution, 𝑝(𝜃)=𝑁(𝜃∣𝑚0,𝑆0)p(θ)=N(θ∣m0,S0)

 𝑝(𝑌∣𝑋)=∫𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)d𝜃p(Y∣X)=∫p(Y∣X,θ)p(θ)dθ is
the evidence which ensures that the posterior is normalized (that
is, that it integrates to 1).
The parameter posterior can be estimated in closed form (for proof see

Learning):𝑝(𝜃∣𝑋,𝑌)=𝑁(𝜃∣𝑚𝑁,𝑆𝑁) 𝑆𝑁=(𝑆0−1+𝜎−2Φ⊤Φ)
theorem 9.1 in the book Mathematics for Machine

−1 𝑚𝑁=𝑆𝑁(𝑆0−1𝑚0+𝜎−2Φ⊤𝑦)p(θ∣X,Y) SN mN=N(θ∣mN
,SN)=(S0−1+σ−2Φ⊤Φ)−1=SN(S0−1m0+σ−2Φ⊤y)
Coming back to our BayesianLinearRegression class we need to add a
method which allows us to update the posterior distribution given a
dataset.

from scipy.stats import multivariate_normal


from scipy.stats import norm as univariate_normal
import numpy as np

class BayesianLinearRegression:
""" Bayesian linear regression

Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""

def __init__(self, prior_mean: np.ndarray, prior_cov: np.ndarray,


noise_var: float):
self.prior_mean = prior_mean[:, np.newaxis] # column vector of
shape (1, d)
self.prior_cov = prior_cov # matrix of shape (d, d)
# We initalize the prior distribution over the parameters using
the given mean and covariance matrix
# In the formulas above this corresponds to m_0 (prior_mean) and
S_0 (prior_cov)
self.prior = multivariate_normal(prior_mean, prior_cov)

# We also know the variance of the noise


self.noise_var = noise_var # single float value
self.noise_precision = 1 / noise_var

# Before performing any inference the parameter posterior equals


the parameter prior
self.param_posterior = self.prior
# Accordingly, the posterior mean and covariance equal the prior
mean and variance
self.post_mean = self.prior_mean # corresponds to m_N in
formulas
self.post_cov = self.prior_cov # corresponds to S_N in formulas

def update_posterior(self, features: np.ndarray, targets:


np.ndarray):
"""
Update the posterior distribution given new features and targets

Args:
features: numpy array of features
targets: numpy array of targets
"""
# Reshape targets to allow correct matrix multiplication
# Input shape is (N,) but we need (N, 1)
targets = targets[:, np.newaxis]

# Compute the design matrix, shape (N, 2)


design_matrix = self.compute_design_matrix(features)

# Update the covariance matrix, shape (2, 2)


design_matrix_dot_product = design_matrix.T.dot(design_matrix)
inv_prior_cov = np.linalg.inv(self.prior_cov)
self.post_cov = np.linalg.inv(inv_prior_cov +
self.noise_precision * design_matrix_dot_product)

# Update the mean, shape (2, 1)


self.post_mean = self.post_cov.dot(
inv_prior_cov.dot(self.prior_mean) +
self.noise_precision *
design_matrix.T.dot(targets))

# Update the posterior distribution


self.param_posterior =
multivariate_normal(self.post_mean.flatten(), self.post_cov)

def compute_design_matrix(self, features: np.ndarray) -> np.ndarray:


"""
Compute the design matrix. To keep things simple we use simple
linear
regression and add the value phi_0 = 1 to our input data.
Args:
features: numpy array of features
Returns:
design_matrix: numpy array of transformed features

>>> compute_design_matrix(np.array([2, 3]))


np.array([[1., 2.], [1., 3.])
"""
n_samples = len(features)
phi_0 = np.ones(n_samples)
design_matrix = np.stack((phi_0, features), axis=1)
return design_matrix

def predict(self, features: np.ndarray):


"""
Compute predictive posterior given new datapoint

Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)

pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T))
+ self.noise_var

pred_posterior = univariate_normal(loc=pred_mean.flatten(),
scale=pred_cov**0.5)
return pred_posterior

6.4 Visualizing the parameter posterior

To ensure that our implementation is correct we can visualize how the


posterior over the parameters changes as the model sees more data. We
will visualize the distribution using a contour plot - a method for
visualizing three-dimensional functions. In our case we want to visualize
the density of our bi-variate Gaussian for each point (that is, each
slope/intercept combination). The plot below shows an example which
illustrates how the lines and colours of a contour plot correspond to a
Gaussian distribution:
As we can see, the density is highest in the yellow regions decreasing
when moving further out into the green and blue parts. This should give
you a better understanding of contour plots.

To analyze our Bayesian linear regression class we will start by


initializing a new model. We can visualize its prior distribution over the
parameters before the model has seen any real data.

# Initialize BLR model


prior_mean = np.array([0, 0])
prior_cov = 1/2 * np.identity(2)
blr = BayesianLinearRegression(prior_mean, prior_cov, noise_var)

def plot_param_posterior(lower_bound, upper_bound, blr, title):


fig = plt.figure()
mesh_features, mesh_labels = np.mgrid[lower_bound:upper_bound:.01,
lower_bound:upper_bound:.01]
pos = np.dstack((mesh_features, mesh_labels))
plt.contourf(mesh_features, mesh_labels,
blr.param_posterior.pdf(pos), levels=15)
plt.scatter(intercept, slope, color='red', label="True parameter
values")
plt.title(title)
plt.xlabel("Intercept")
plt.ylabel("Slope")
plt.legend();

# Visualize parameter prior distribution


plot_param_posterior(lower_bound, upper_bound, blr, title="Prior
parameter distribution")

The plot above illustrates both the prior parameter distribution and the
true parameter values that we want to find. If our model works correctly,
the posterior distribution should become more narrow and move closer
to the true parameter values as the model sees more datapoints. This
can be visualized with contour plots, too! Below we update the posterior
distribution iteratively as the model sees more and more data. The
contour plots for each step show how the parameter posterior develops
and converges close to the true values in the end.

n_points_lst = [1, 5, 10, 50, 100, 200, 500, 1000]


previous_n_points = 0
for n_points in n_points_lst:
train_features = features[previous_n_points:n_points]
train_labels = noise_corrupted_labels[previous_n_points:n_points]
blr.update_posterior(train_features, train_labels)

# Visualize updated parameter posterior distribution


plot_param_posterior(lower_bound,
upper_bound,
blr,
title=f"Updated parameter distribution using
{n_points} datapoints")

previous_n_points = n_points
6.5 Step 3: Posterior predictive distribution

Given the posterior distribution over the parameters we can determine


the predictive distribution (= posterior over the outputs) for a new
input (𝑥∗,𝑦∗)(x∗,y∗). This is the distribution we are really interested
in. A trained model is not particularly useful when we can’t use it to
make predictions, right?
The posterior predictive distribution looks as follows:

𝑝(𝑦∗∣𝑋,𝑌,𝑥∗)=∫𝑝(𝑦∗∣𝑥∗,𝜃)𝑝(𝜃∣𝑋,𝑌)d𝜃 =∫𝑁(𝑦∗∣𝜙⊤(𝑥∗

𝑥∗)+𝜎2)p(y∗∣X,Y,x∗) =∫p(y∗∣x∗,θ)p(θ∣X,Y)dθ=∫N(y∗∣ϕ⊤(x∗
)𝜃,𝜎2)𝑁(𝜃∣𝑚𝑁,𝑆𝑁)d𝜃 =𝑁(𝑦∗∣𝜙⊤(𝑥∗)𝑚𝑁,𝜙⊤(𝑥∗)𝑆𝑁𝜙(

)θ,σ2)N(θ∣mN,SN)dθ=N(y∗∣ϕ⊤(x∗)mN,ϕ⊤(x∗)SNϕ(x∗)+σ2)

First of all: note that the predictive posterior for a new input 𝑥∗x∗ is
a univariate Gaussian distribution. We can see that the mean of the

example (𝜙⊤(𝑥∗)ϕ⊤(x∗)) and the mean of the parameter posterior


distribution is given by the product of the design matrix for the new

(𝑚𝑁mN). The variance (𝜙⊤(𝑥∗)𝑆𝑁𝜙(𝑥∗)+𝜎2(ϕ⊤(x∗)SNϕ(x∗)


+σ2) of the predictive posterior has two parts:

𝜎2σ2: The variance of the noise


𝜙⊤(𝑥∗)𝑆𝑁𝜙(𝑥∗)ϕ⊤(x∗)SNϕ(x∗): The posterior uncertainty
1.

associated with the parameters 𝜃θ


2.

Let’s add a predict method to our BayesianLinearRegression class which


computes the predictive posterior for a new input (you will find the
method in the class definition above):

def predict(self, features: np.ndarray):


"""
Compute predictive posterior given new datapoint

Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)

pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T)) +
self.noise_var

pred_posterior = univariate_normal(pred_mean.flatten(), pred_cov)


return pred_posterior

6.6 Visualizing the predictive posterior


Our original dataset follows a simple linear function. After training the
model it should be able to predict labels for new datapoints, even if they
lie beyond the range from [-1.5, 1.5]. But how can we get from the
predictive distribution that our model computes to actual labels? That’s
easy: we sample from the predictive posterior.

To make sure that we are all on the same page: given a new input
example our Bayesian linear regression model predicts not a single label
but a distribution over possible labels. This distribution is Gaussian. We
can get actual labels by sampling from this distribution.

The code below implements and visualizes this:

 We create some test features for which we want predictions


 Each feature is given to the trained BLR model which returns a
univariate Gaussian distribution over possible labels
(pred_posterior = blr.predict(np.array([feat])))
 We sample from this distribution (sample_predicted_labels =
pred_posterior.rvs(size=sample_size))
 The predicted labels are saved in a format that makes it easy to
plot them
 Finally, we plot each input feature, its true label and the sampled
predictions. Remember: the samples are generated from the
predictive posterior returned by the predict method. Think of a
Gaussian distribution plotted along the y-axis for each feature. We
visualize this with a histogram: more likely values close to the
mean will be sampled more often than less likely values.

import pandas as pd
import seaborn as sns

all_rows = []
sample_size = 1000
test_features = [-2, -1, 0, 1, 2]
all_labels = []

for feat in test_features:


true_label = compute_function_labels(slope, intercept, 0,
np.array([feat]))
all_labels.append(true_label)
pred_posterior = blr.predict(np.array([feat]))
sample_predicted_labels = pred_posterior.rvs(size=sample_size)
for label in sample_predicted_labels:
all_rows.append([feat, label])
all_data = pd.DataFrame(all_rows, columns=["feature", "label"])
sns.displot(data=all_data, x="feature", y="label")
plt.scatter(x=test_features, y=all_labels, color="red", label="True
values")
plt.title("Predictive posterior distributions")
plt.legend()
plt.plot();

7. Sources and further reading


The basis for this notebook is chapter 9.2 of the book Mathematics for
Machine Learning. I can highly recommend to read through chapter 9 to
get a deeper understanding of (Bayesian) linear regression.

You will find explanations and an implementation of simple linear


regression in the notebook on linear regression

Improve this page

Prev
Deep Work
Next
Bayesian linear regression 2

Anna-Lena Popkes

Saturday, February 20, 2021

Bayesian linear regression 2


I finally found time to continue working on my machine learning basics
repository which implements fundamental machine learning algorithms
in plain Python. Especially, I took a detailed look at Bayesian linear
regression. The blog post below contains the same content as
the original notebook. You can run the notebook directly in your Browser
using Binder.

1. What is Bayesian linear regression (BLR)?


Bayesian linear regression is the Bayesian interpretation of linear
regression. What does that mean? To answer this question we first have
to understand the Bayesian approach. In most of the algorithms we have
looked at so far we computed point estimates of our parameters. For
example, in linear regression we chose values for the weights and bias
that minimized our mean squared error cost function. In the Bayesian
approach we don’t work with exact values but with probabilities. This
allows us to model the uncertainty in our parameter estimates. Why is
this important?

In nearly all real-world situations, our data and knowledge about the
world is incomplete, indirect and noisy. Hence, uncertainty must be a
fundamental part of our decision-making process. This is exactly what
the Bayesian approach is about. It provides a formal and consistent way
to reason in the presence of uncertainty. Bayesian methods have been
around for a long time and are widely-used in many areas of science
(e.g. astronomy). Although Bayesian methods have been applied to
machine learning problems too, they are usually less well known to
beginners. The major reason is that they require a good understanding
of probability theory.

In the following post we will work our way from linear regression to
Bayesian linear regression, including the most important theoretical
knowledge and code examples. Remember: you can run the original
notebook directly in your Browser using Binder.

2. Recap linear regression


In linear regression, we want to find a function 𝑓f that maps
inputs 𝑥∈𝑅𝐷x∈RD to corresponding function

values 𝑓(𝑥)∈𝑅f(x)∈R.
We are given an input dataset 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn
}n=1N, where 𝑦𝑛yn is a noisy observation value: 𝑦𝑛=𝑓(𝑥𝑛)

+𝜖yn=f(xn)+ϵ, with 𝜖ϵ being an i.i.d. random variable that

Our goal is to infer the underlying function 𝑓f that generated the


describes measurement/observation noise

data such that we can predict function values at new input

In linear regression, we model the underlying function 𝑓f using a


locations

linear combination of the input features:

𝑦=𝜃0+𝜃1𝑥1+𝜃2𝑥2+…+𝜃𝑑𝑥𝑑 =𝑥𝑇𝜃y =θ0+θ1x1+θ2x2+…


+θdxd=xTθ
 For more details take a look at the notebook on linear regression

3. Fundamental concepts
 One fundamental tool in Bayesian learning is Bayes’ theorem

follows:𝑝(𝜃∣𝑥,𝑦)=𝑝(𝑦∣𝑥,𝜃)𝑝(𝜃)𝑝(𝑥,𝑦)p(θ∣x,y)=p(x,y)p(
 Bayes’ theorem looks as

𝑝(𝑦∣𝑥,𝜃)p(y∣x,θ) is the likelihood. It describes the probability


y∣x,θ)p(θ)

𝑝(𝜃)p(θ) is the prior. It describes our initial knowledge about


of the target values given the data and parameters.

𝑝(𝑥,𝑦)p(x,y) is the evidence. It describes the joint probability of


which parameter values are likely and unlikely.

𝑝(𝜃∣𝑥,𝑦)p(θ∣x,y) is the posterior. It describes the probability of


the data and targets.

the parameters given the observed data and targets.
 Another important tool you need to know about is the Gaussian
distribution. If you are not familiar with it I suggest you pause for a
minute and understand its main properties before reading on.

The general Bayesian inference approach is as follows:


2. We observe some data, gaining new evidence 𝑒e
1. We start with some prior belief about a hypothesis p(h)

3. We update our belief using Bayes’ theorem, resulting

theorem 𝑝(ℎ∣𝑒)=𝑝(𝑒∣ℎ)𝑝(ℎ)𝑝(𝑒)p(h∣e)=p(e)p(e∣h)p(h) to
Finally, we use Bayes’

incorporate the new evidence yielding a refined posterior belief

4. Linear regression from a probabilistic


perspective
In order to pave the way for Bayesian linear regression we will take a

the observation noise 𝜖ϵ. For simplicity, we assume that 𝜖ϵ is normally


probabilistic spin on linear regression. Let’s start by explicitly modelling

variance 𝜎2σ2: 𝜖∼𝑁(0,𝜎2)ϵ∼N(0,σ2).


distributed with mean 00 and some known

assumes that the target function 𝑓(𝑥)f(x) is given by a linear


As mentioned in the beginning, a simple linear regression model

combination of the input features: 𝑦=𝑓(𝑥)+𝜖 =𝑥𝑇𝜃+𝜖y=f(x)


+ϵ =xTθ+ϵ

function:𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑥𝑇𝜃,𝜎2)p(y∣x,θ)=N(xTθ,σ2)
This corresponds to the following likelihood

Our goal is to find the parameters 𝜃={𝜃1,…,𝜃𝐷}θ={θ1,…,θD} that


model the given data best. In standard linear regression we can find the
best parameters using a least-squares, maximum likelihood (ML) or
maximum a posteriori (MAP) approach. If you want to know more about
these solutions take a look at the notebook on linear regression or at
chapter 9.2 of the book Mathematics for Machine Learning.

5. Linear regression with basis functions

to the parameters 𝜃θ but also with respect to the inputs 𝑥x.


The simple linear regression model above is linear not only with respect

When 𝑥x is not a vector but a single value (that is, the dataset is one-
dimensional) the model 𝑦𝑖=𝑥𝑖⋅𝜃yi=xi⋅θ describes straight lines
with 𝜃θ being the slope of the line.
model 𝑦=𝑥⋅𝜃y=x⋅θ, using different values for the slope 𝜃θ and
The plot below shows example lines produced with the

intercept 0.

Having a model which is linear both with respect to the parameters and
inputs limits the functions it can learn significantly. We can make our
model more powerful by making it nonlinear with respect to the inputs.
After all, linear regression refers to models which are linear in
the parameters, not necessarily in the inputs (linear in the parameters
means that the model describes a function by a linear combination of
input features).

Making the model nonlinear with respect to the inputs is easy. We can

features 𝜙(𝑥)ϕ(x). With this adaptation our model looks as follows: $$ \


adapt it by using a nonlinear transformation of the input

begin{split} y &= \pmb{\phi}^T(\pmb{x}) \pmb{\theta} + \epsilon \


&= \sum_{k=0}^{K-1} \theta_k \phi_k(\pmb{x}) + \epsilon \end{split}
$$

Where 𝜙:𝑅𝐷→𝑅𝐾ϕ:RD→RK is a (non)linear transformation of the


inputs 𝑥x and 𝜙𝑘:𝑅𝐷→𝑅ϕk:RD→R is the 𝑘−k−th component of
the feature vector 𝜙ϕ:
𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]∈𝑅𝐾ϕ(x)=ϕ0(x) ϕ1(x) ⋮
ϕK−1(x)∈RK
With our new nonlinear transformation the likelihood function is given by

𝑝(𝑦∣𝑥,𝜃)=𝑁(𝜙𝑇(𝑥)𝜃,,𝜎2)p(y∣x,θ)=N(ϕT(x)θ,,σ2)

5.1 Example basis functions

Linear regression

The easiest example for a basis function (for one-dimensional data)

at all. In this case we would choose 𝜙0(𝑥)=1ϕ0(x)=1 and 𝜙𝑖(𝑥)=𝑥ϕi


would be simple linear regression, that is, no non-linear transformation

(x)=x. This would result in the following vector 𝜙(𝑥)ϕ(x):

𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]=[1 𝑥 ⋮ 𝑥]∈𝑅𝐾ϕ(x)=ϕ0(x)


ϕ1(x) ⋮ ϕK−1(x)=1 x ⋮ x∈RK
Polynomial regression

set 𝜙𝑖(𝑥)=𝑥𝑖ϕi
Another common choice of basis function for the one-dimensional case is

(x)=xi for 𝑖=0,…,𝐾−1i=0,…,K−1. The corresponding feature


polynomial regression. For this we would

vector 𝜙(𝑥)ϕ(x) would look as follows:

𝜙(𝑥)=[𝜙0(𝑥) 𝜙1(𝑥) ⋮ 𝜙𝐾−1(𝑥)]=[1 𝑥 𝑥2 𝑥3 ⋮ 𝑥𝐾−1]∈𝑅


𝐾ϕ(x)=ϕ0(x) ϕ1(x) ⋮ ϕK−1(x)=1 x x2 x3 ⋮ xK−1∈RK

into a 𝐾K-dimensional feature space. Our function 𝑓f can be any


With this transformation we can lift our original one-dimensional input

degree ≤𝐾−1≤K−1: 𝑓(𝑥)=∑𝑘=0𝐾−1𝜃𝑘𝑥𝑘f(x)=∑k=0K−1θkxk


polynomial with

5.2 The design matrix

To make it easier to work with the transformations 𝜙(𝑥)ϕ(x) for the


different input vectors 𝑥x we typically create a so called design

dataset 𝐷=𝑥𝑛,𝑦𝑛𝑛=1𝑁D=xn,ynn=1N we define the design matrix


matrix (also called feature matrix). Given our

as follows:
𝛷:=[𝜙⊤(𝑥1) ⋮ 𝜙⊤(𝑥𝑁)]=[𝜙0(𝑥1)⋯𝜙𝐾−1(𝑥1) 𝜙0(𝑥2)⋯
𝜙𝐾−1(𝑥2) ⋮⋮ 𝜙0(𝑥𝑁)⋯𝜙𝐾−1(𝑥𝑁)]∈𝑅𝑁×𝐾Φ:=ϕ⊤(x1) ⋮
ϕ⊤(xN)=ϕ0(x1) ϕ0(x2) ⋮ ϕ0(xN)⋯⋯⋯ϕK−1(x1)ϕK−1(x2)⋮ϕK−1
(xN)∈RN×K

Note that the design matrix is of shape 𝑁×𝐾N×K. 𝑁N is the number


of input examples and 𝐾K is the output dimension of the non-linear
transformation 𝜙(𝑥)ϕ(x).

6. Bayesian linear regression


What changes when we consider a Bayesian interpretation of linear

before: 𝐷={𝑥𝑛,𝑦𝑛}𝑛=1𝑁D={xn,yn}n=1N. Given the data 𝐷D we


regression? Our data stays the same as

can define the set of all inputs as 𝑋:={𝑥1,…,𝑥𝑛}X:={x1,…,xn


} and the set of all targets as 𝑌:={𝑦1,…,𝑦𝑛}Y:={y1,…,yn}.
In simple linear regression we compute point estimates of our
parameters (e.g. using a maximum likelihood approach) and use these
estimates to make predictions. Different to this, Bayesian linear
regression estimates distributions over the parameters and predictions.
This allows us to model the uncertainty in our predictions.

To perform Bayesian linear regression we follow three steps:

1. We set up a probabilistic model that describes our assumptions

2. We perform inference for the parameters 𝜃θ, that is, we compute


how the data and parameters are generated

the posterior probability distribution over the parameters

inputs 𝑦∗y∗. In this step we don’t compute point estimates of the


3. With this posterior we can perform inference for new, unseen

outputs. Instead, we compute the parameters of the posterior


distribution over the outputs.

6.1 Step 1: Probabilistic model

We start by setting up a probabilistic model that describes our

place a prior 𝑝(𝜃)p(θ) over our parameters which encodes what


assumptions how the data and parameters are generated. For this, we

parameter 𝜃θ,
parameter values are plausible (before we have seen any data).

prior 𝑝(𝜃)=𝑁(0,1)p(θ)=N(0,1) says that parameter values are


Example: With a single a Gaussian
normally distributed with mean 0 and standard deviation 1. In other
words: the parameter values are most likely to fall into the
interval [−2,2][−2,2] which is two standard deviations around the
mean value.

parameters: 𝑝(𝜃)=𝑁(𝑚0,𝑆0)p(θ)=N(m0,S0). Let’s further assume


To keep things simple we will assume a Gaussian prior over the

too: 𝑝(𝑦∣𝑥,𝜃)=𝑁(𝑦∣𝜙⊤(𝑥)𝜃,𝜎2)p(y∣x,θ)=N(y∣ϕ⊤(x)θ,σ2).
that the likelihood function is Gaussian,

Note: When considering the set of all targets 𝑌:={𝑦1,…,𝑦𝑛}Y:={y1,

distribution: 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)
…,yn}, the likelihood function becomes a multivariate Gaussian

The nice thing about choosing a Gaussian distribution for our prior is that
the posterior distributions will be Gaussian, too (keyword conjugate
prior)!

We will start our BayesianLinearRegression class with the knowledge we

assume that the variance 𝜎2σ2 of the noise 𝜖ϵ is known. Furthermore,


have so far - our probabilistic model. As mentioned in the beginning we

to allow plotting the data later on we will assume that it’s two
dimensional (d=2).

from scipy.stats import multivariate_normal


import numpy as np

class BayesianLinearRegression:
""" Bayesian linear regression

Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""

def __init__(self, prior_mean: np.ndarray, prior_cov: np.ndarray,


noise_var: float):
self.prior_mean = prior_mean[:, np.newaxis] # column vector of
shape (1, d)
self.prior_cov = prior_cov # matrix of shape (d, d)
# We initalize the prior distribution over the parameters using
the given mean and covariance matrix
# In the formulas above this corresponds to m_0 (prior_mean) and
S_0 (prior_cov)
self.prior = multivariate_normal(prior_mean, prior_cov)

# We also know the variance of the noise


self.noise_var = noise_var # single float value
self.noise_precision = 1 / noise_var

# Before performing any inference the parameter posterior equals


the parameter prior
self.param_posterior = self.prior
# Accordingly, the posterior mean and covariance equal the prior
mean and variance
self.post_mean = self.prior_mean # corresponds to m_N in
formulas
self.post_cov = self.prior_cov # corresponds to S_N in formulas

# Let's make sure that we can initialize our model


prior_mean = np.array([0, 0])
prior_cov = np.array([[0.5, 0], [0, 0.5]])
noise_var = 0.2
blr = BayesianLinearRegression(prior_mean, prior_cov, noise_var)

6.2 Generating a dataset

Before going any further we need a dataset to test our implementation.

form 𝑦=𝜙𝑇(𝑥)𝜃+𝜖y=ϕT(x)θ+ϵ where 𝜖ϵ is


Remember that we assume that our targets were generated by a
function of the

variance 𝜎2σ2: 𝜖∼𝑁(0,𝜎2)ϵ∼N(0,σ2).


normally distributed with mean 00 and some known

To keep things simple we will work with one-dimensional data and


simple linear regression (that is, no non-linear transformation of the

form𝑦=𝜃0+𝜃1,𝑥+𝜖y=θ0+θ1,x+ϵ
inputs). Consequently, our data generating function will be of the

Note that we added a parameter 𝜃0θ0 which corresponds to the


intercept of the linear function. Until know we assumed 𝜃0=0θ0=0. As
mentioned earlier, 𝜃1θ1 represents the slope of the linear function.

import matplotlib.pyplot as plt


%matplotlib inline
def compute_function_labels(slope: float, intercept: float,
noise_std_dev: float, data: np.ndarray) -> np.ndarray:
"""
Compute target values given function parameters and data.

Args:
slope: slope of the function (theta_1)
intercept: intercept of the function (theta_0)
data: input feature values (x)
noise_std_dev: standard deviation of noise distribution (sigma)

Returns:
target values, either true or corrupted with noise
"""
n_samples = len(data)
if noise_std_dev == 0: # Real function
return slope * data + intercept
else: # Noise corrupted
return slope * data + intercept + np.random.normal(0,
noise_std_dev, n_samples)
# Set random seed to ensure reproducibility
seed = 42
np.random.seed(seed)

# Generate true values and noise corrupted targets


n_datapoints = 1000
intercept = -0.7
slope = 0.9
noise_std_dev = 0.5
noise_var = noise_std_dev**2
lower_bound = -1.5
upper_bound = 1.5

# Generate dataset
features = np.random.uniform(lower_bound, upper_bound, n_datapoints)
labels = compute_function_labels(slope, intercept, 0., features)
noise_corrupted_labels = compute_function_labels(slope, intercept,
noise_std_dev, features)
# Plot the dataset
plt.figure(figsize=(10,7))
plt.plot(features, labels, color='r', label="True values")
plt.scatter(features, noise_corrupted_labels, label="Noise corrupted
values")
plt.xlabel("Features")
plt.ylabel("Labels")
plt.title("Real function along with noisy targets")
plt.legend();

6.3 Step 2: Posterior over the parameters

model and our dataset 𝑋,𝑌X,Y to estimate the parameter


We finished setting up our probabilistic model. Next, we want to use this

posterior 𝑝(𝜃∣𝑋,𝑌)p(θ∣X,Y). Keep in mind that we don’t compute


point estimates of the parameters. Instead, we determine the mean and
variance of the (Gaussian) posterior distribution and use this entire
distribution when making predictions.

theorem:𝑝(𝜃∣𝑋,𝑌)=𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)𝑝(𝑌∣𝑋)p(θ∣X,Y)=p(Y∣X)p(Y
We can estimate the parameter posterior using Bayes

∣X,θ)p(θ)

𝑝(𝑌∣𝑋,𝜃)p(Y∣X,θ) is the likelihood


function, 𝑝(𝑌∣𝑋,𝜃)=𝑁(𝑦∣Φ𝜃,𝜎2𝐼)p(Y∣X,θ)=N(y∣Φθ,σ2I)

𝑝(𝜃)p(θ) is the prior
distribution, 𝑝(𝜃)=𝑁(𝜃∣𝑚0,𝑆0)p(θ)=N(θ∣m0,S0)

 𝑝(𝑌∣𝑋)=∫𝑝(𝑌∣𝑋,𝜃)𝑝(𝜃)d𝜃p(Y∣X)=∫p(Y∣X,θ)p(θ)dθ is
the evidence which ensures that the posterior is normalized (that
is, that it integrates to 1).

The parameter posterior can be estimated in closed form (for proof see

Learning):𝑝(𝜃∣𝑋,𝑌)=𝑁(𝜃∣𝑚𝑁,𝑆𝑁) 𝑆𝑁=(𝑆0−1+𝜎−2Φ⊤Φ)
theorem 9.1 in the book Mathematics for Machine

−1 𝑚𝑁=𝑆𝑁(𝑆0−1𝑚0+𝜎−2Φ⊤𝑦)p(θ∣X,Y) SN mN=N(θ∣mN
,SN)=(S0−1+σ−2Φ⊤Φ)−1=SN(S0−1m0+σ−2Φ⊤y)
Coming back to our BayesianLinearRegression class we need to add a
method which allows us to update the posterior distribution given a
dataset.

from scipy.stats import multivariate_normal


from scipy.stats import norm as univariate_normal
import numpy as np

class BayesianLinearRegression:
""" Bayesian linear regression

Args:
prior_mean: Mean values of the prior distribution (m_0)
prior_cov: Covariance matrix of the prior distribution (S_0)
noise_var: Variance of the noise distribution
"""

def __init__(self, prior_mean: np.ndarray, prior_cov: np.ndarray,


noise_var: float):
self.prior_mean = prior_mean[:, np.newaxis] # column vector of
shape (1, d)
self.prior_cov = prior_cov # matrix of shape (d, d)
# We initalize the prior distribution over the parameters using
the given mean and covariance matrix
# In the formulas above this corresponds to m_0 (prior_mean) and
S_0 (prior_cov)
self.prior = multivariate_normal(prior_mean, prior_cov)

# We also know the variance of the noise


self.noise_var = noise_var # single float value
self.noise_precision = 1 / noise_var
# Before performing any inference the parameter posterior equals
the parameter prior
self.param_posterior = self.prior
# Accordingly, the posterior mean and covariance equal the prior
mean and variance
self.post_mean = self.prior_mean # corresponds to m_N in
formulas
self.post_cov = self.prior_cov # corresponds to S_N in formulas

def update_posterior(self, features: np.ndarray, targets:


np.ndarray):
"""
Update the posterior distribution given new features and targets

Args:
features: numpy array of features
targets: numpy array of targets
"""
# Reshape targets to allow correct matrix multiplication
# Input shape is (N,) but we need (N, 1)
targets = targets[:, np.newaxis]

# Compute the design matrix, shape (N, 2)


design_matrix = self.compute_design_matrix(features)

# Update the covariance matrix, shape (2, 2)


design_matrix_dot_product = design_matrix.T.dot(design_matrix)
inv_prior_cov = np.linalg.inv(self.prior_cov)
self.post_cov = np.linalg.inv(inv_prior_cov +
self.noise_precision * design_matrix_dot_product)

# Update the mean, shape (2, 1)


self.post_mean = self.post_cov.dot(
inv_prior_cov.dot(self.prior_mean) +
self.noise_precision *
design_matrix.T.dot(targets))

# Update the posterior distribution


self.param_posterior =
multivariate_normal(self.post_mean.flatten(), self.post_cov)

def compute_design_matrix(self, features: np.ndarray) -> np.ndarray:


"""
Compute the design matrix. To keep things simple we use simple
linear
regression and add the value phi_0 = 1 to our input data.

Args:
features: numpy array of features
Returns:
design_matrix: numpy array of transformed features

>>> compute_design_matrix(np.array([2, 3]))


np.array([[1., 2.], [1., 3.])
"""
n_samples = len(features)
phi_0 = np.ones(n_samples)
design_matrix = np.stack((phi_0, features), axis=1)
return design_matrix

def predict(self, features: np.ndarray):


"""
Compute predictive posterior given new datapoint

Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)

pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T))
+ self.noise_var

pred_posterior = univariate_normal(loc=pred_mean.flatten(),
scale=pred_cov**0.5)
return pred_posterior

6.4 Visualizing the parameter posterior

To ensure that our implementation is correct we can visualize how the


posterior over the parameters changes as the model sees more data. We
will visualize the distribution using a contour plot - a method for
visualizing three-dimensional functions. In our case we want to visualize
the density of our bi-variate Gaussian for each point (that is, each
slope/intercept combination). The plot below shows an example which
illustrates how the lines and colours of a contour plot correspond to a
Gaussian distribution:

As we can see, the density is highest in the yellow regions decreasing


when moving further out into the green and blue parts. This should give
you a better understanding of contour plots.

To analyze our Bayesian linear regression class we will start by


initializing a new model. We can visualize its prior distribution over the
parameters before the model has seen any real data.

# Initialize BLR model


prior_mean = np.array([0, 0])
prior_cov = 1/2 * np.identity(2)
blr = BayesianLinearRegression(prior_mean, prior_cov, noise_var)

def plot_param_posterior(lower_bound, upper_bound, blr, title):


fig = plt.figure()
mesh_features, mesh_labels = np.mgrid[lower_bound:upper_bound:.01,
lower_bound:upper_bound:.01]
pos = np.dstack((mesh_features, mesh_labels))
plt.contourf(mesh_features, mesh_labels,
blr.param_posterior.pdf(pos), levels=15)
plt.scatter(intercept, slope, color='red', label="True parameter
values")
plt.title(title)
plt.xlabel("Intercept")
plt.ylabel("Slope")
plt.legend();

# Visualize parameter prior distribution


plot_param_posterior(lower_bound, upper_bound, blr, title="Prior
parameter distribution")

The plot above illustrates both the prior parameter distribution and the
true parameter values that we want to find. If our model works correctly,
the posterior distribution should become more narrow and move closer
to the true parameter values as the model sees more datapoints. This
can be visualized with contour plots, too! Below we update the posterior
distribution iteratively as the model sees more and more data. The
contour plots for each step show how the parameter posterior develops
and converges close to the true values in the end.

n_points_lst = [1, 5, 10, 50, 100, 200, 500, 1000]


previous_n_points = 0
for n_points in n_points_lst:
train_features = features[previous_n_points:n_points]
train_labels = noise_corrupted_labels[previous_n_points:n_points]
blr.update_posterior(train_features, train_labels)

# Visualize updated parameter posterior distribution


plot_param_posterior(lower_bound,
upper_bound,
blr,
title=f"Updated parameter distribution using
{n_points} datapoints")

previous_n_points = n_points
6.5 Step 3: Posterior predictive distribution

Given the posterior distribution over the parameters we can determine


the predictive distribution (= posterior over the outputs) for a new
input (𝑥∗,𝑦∗)(x∗,y∗). This is the distribution we are really interested
in. A trained model is not particularly useful when we can’t use it to
make predictions, right?
The posterior predictive distribution looks as follows:

𝑝(𝑦∗∣𝑋,𝑌,𝑥∗)=∫𝑝(𝑦∗∣𝑥∗,𝜃)𝑝(𝜃∣𝑋,𝑌)d𝜃 =∫𝑁(𝑦∗∣𝜙⊤(𝑥∗

𝑥∗)+𝜎2)p(y∗∣X,Y,x∗) =∫p(y∗∣x∗,θ)p(θ∣X,Y)dθ=∫N(y∗∣ϕ⊤(x∗
)𝜃,𝜎2)𝑁(𝜃∣𝑚𝑁,𝑆𝑁)d𝜃 =𝑁(𝑦∗∣𝜙⊤(𝑥∗)𝑚𝑁,𝜙⊤(𝑥∗)𝑆𝑁𝜙(

)θ,σ2)N(θ∣mN,SN)dθ=N(y∗∣ϕ⊤(x∗)mN,ϕ⊤(x∗)SNϕ(x∗)+σ2)

First of all: note that the predictive posterior for a new input 𝑥∗x∗ is
a univariate Gaussian distribution. We can see that the mean of the

example (𝜙⊤(𝑥∗)ϕ⊤(x∗)) and the mean of the parameter posterior


distribution is given by the product of the design matrix for the new

(𝑚𝑁mN). The variance (𝜙⊤(𝑥∗)𝑆𝑁𝜙(𝑥∗)+𝜎2(ϕ⊤(x∗)SNϕ(x∗)


+σ2) of the predictive posterior has two parts:

𝜎2σ2: The variance of the noise


𝜙⊤(𝑥∗)𝑆𝑁𝜙(𝑥∗)ϕ⊤(x∗)SNϕ(x∗): The posterior uncertainty
1.

associated with the parameters 𝜃θ


2.

Let’s add a predict method to our BayesianLinearRegression class which


computes the predictive posterior for a new input (you will find the
method in the class definition above):

def predict(self, features: np.ndarray):


"""
Compute predictive posterior given new datapoint

Args:
features: 1d numpy array of features
Returns:
pred_posterior: predictive posterior distribution
"""
design_matrix = self.compute_design_matrix(features)

pred_mean = design_matrix.dot(self.post_mean)
pred_cov = design_matrix.dot(self.post_cov.dot(design_matrix.T)) +
self.noise_var

pred_posterior = univariate_normal(pred_mean.flatten(), pred_cov)


return pred_posterior

6.6 Visualizing the predictive posterior


Our original dataset follows a simple linear function. After training the
model it should be able to predict labels for new datapoints, even if they
lie beyond the range from [-1.5, 1.5]. But how can we get from the
predictive distribution that our model computes to actual labels? That’s
easy: we sample from the predictive posterior.

To make sure that we are all on the same page: given a new input
example our Bayesian linear regression model predicts not a single label
but a distribution over possible labels. This distribution is Gaussian. We
can get actual labels by sampling from this distribution.

The code below implements and visualizes this:

 We create some test features for which we want predictions


 Each feature is given to the trained BLR model which returns a
univariate Gaussian distribution over possible labels
(pred_posterior = blr.predict(np.array([feat])))
 We sample from this distribution (sample_predicted_labels =
pred_posterior.rvs(size=sample_size))
 The predicted labels are saved in a format that makes it easy to
plot them
 Finally, we plot each input feature, its true label and the sampled
predictions. Remember: the samples are generated from the
predictive posterior returned by the predict method. Think of a
Gaussian distribution plotted along the y-axis for each feature. We
visualize this with a histogram: more likely values close to the
mean will be sampled more often than less likely values.

import pandas as pd
import seaborn as sns

all_rows = []
sample_size = 1000
test_features = [-2, -1, 0, 1, 2]
all_labels = []

for feat in test_features:


true_label = compute_function_labels(slope, intercept, 0,
np.array([feat]))
all_labels.append(true_label)
pred_posterior = blr.predict(np.array([feat]))
sample_predicted_labels = pred_posterior.rvs(size=sample_size)
for label in sample_predicted_labels:
all_rows.append([feat, label])
all_data = pd.DataFrame(all_rows, columns=["feature", "label"])
sns.displot(data=all_data, x="feature", y="label")
plt.scatter(x=test_features, y=all_labels, color="red", label="True
values")
plt.title("Predictive posterior distributions")
plt.legend()
plt.plot();

7. Sources and further reading


The basis for this notebook is chapter 9.2 of the book Mathematics for
Machine Learning. I can highly recommend to read through chapter 9 to
get a deeper understanding of (Bayesian) linear regression.

You will find explanations and an implementation of simple linear


regression in the notebook on linear regression

Improve this page

Prev
Bayesian linear regression
Next
Kullback-Leibler Divergence

You might also like