Bayesian SVM
Bayesian SVM
Big Data
1 Introduction
Statistical machine learning branches into two classic strands of research: Bayesian
and frequentist. In the classic supervised learning setting, both paradigms aim
to find, based on training data, a function fβ that predicts well on yet unseen
test data. The difference in the Bayesian and frequentist approach lies in the
treatment of the parameter vector β of this function. In the frequentist setting,
we select the parameter β that minimizes a certain loss given the training data,
from a restricted set B of limited complexity. In the Bayesian school of thinking,
we express our prior belief about the parameter, in the form of a probability
distribution over the parameter vector. When we observe data, we adapt our
belief, resulting in a posterior distribution over β
Advantages of the Bayesian approach include automatic treatment of hyper-
parameters and direct quantification of the uncertainty3 of the prediction in the
form of class membership probabilities which can be of tremendous importance
in practice. As examples consider the following. (1) We have collected blood
samples of cancer patients and controls. The aim is to screen individuals that
have increased likelihood of developing cancer. The knowledge of the uncertainty
in those predictions is invaluable to clinicians. (2) In the domain of physics it
is important to have a sense about the certainty level of predictions since it
3
Note that frequentist approaches can also lead to other forms of uncertainty estimates,
e.g. in form of confidence intervals. But since the classic SVM does not exhibit a
probabilistic formulation these uncertainty estimates cannot be directly computed.
2 Florian Wenzel et al.
2 Related Work
There has recently been significant interest in utilizing max-margin based dis-
criminative Bayesian models for various applications. For example, [12] employs
a max-margin based Bayesian classification to discover latent semantic structures
for topic models, [13] uses a max-margin approach for efficient Bayesian matrix
Bayesian Nonlinear SVMs for Big Data 3
n
Y
p(β|D) ∝ L(yi |xi , β)p(β).
i=1
Here p(β) denotes a prior such that log p(β) ∝ −2γR(β). In the following we use
the prior β ∼ N (0, Σ), where Σ ∈ Rd×d is a positive definite matrix. From a
frequentist SVM view, this choice generalizes the usual L2 -regularization to non-
isotropic regularizers. Note that our proposed framework can be easily extended to
4 Florian Wenzel et al.
other regularization techniques by adjusting the prior on β (e.g. block `(2,p) -norm
regularization which is known as multiple kernel learning [17]). In order to obtain
a Bayesian interpretation of the SVM, we need to define a pseudolikelihood L
such that the following holds,
L (y|x, f (·)) ∝ exp (−2 max(1 − yi f (xi ), 0)) . (2)
By introducing latent variables λ := (λ1 , . . . , λn )> (data augmentation) and
making use of integral identities stemming from function theory, [2] show that
the specification of L in terms of the following marginal distribution satisfies (2):
Z ∞ 2 !
1 1 1 + λi − yi xTi β
L(yi |xi , β) = √ exp − dλi . (3)
0 2πλi 2 λi
Writing X ∈ Rd×n for the matrix of data points and Y = diag(y), the full
conditional distributions of this model are
β|λ, Σ, D ∼ N B(λ−1 + 1), B ,
(4)
λi |β, Di ∼ GIG 1/2, 1, (1 − yi x> 2
i β) ,
with Z = Y X, B −1 = ZΛ−1 Z > + Σ −1 , Λ = diag(λ) and where GIG denotes a
generalized inverse Gaussian distribution. The n latent variables λi of the model
scale the variance of the full posteriors locally. The model thus constitutes a
special case of a normal variance-mean mixture, where we implicitly impose
the improper prior p(λ) = 1[0,∞) (λ) on λ. This could be generalized by using a
generalized inverse Gaussian prior on λi , leading to a conjugate model for λi .
Henao et al. show that in the case of an exponential prior on λi , this leads to
a skewed Laplace full conditional for λi . Note that this, however, destroys the
equivalency to the frequentist linear SVM.
By using the ideas of Gaussian processes [9], Henao et al. develop a nonlinear
(kernelized) version of this model [3]. They assume a continuous decision function
f (x) to be drawn from a zero-mean Gaussian process GP(0, k), where k is a
kernel function. The random Gaussian vector f = (f1 , ..., fn )> corresponds to
f (x) evaluated at the data points. They substitute the linear function x>
i β by fi
in (3) and obtain the conditional posteriors
f |λ, D ∼ N CY (λ−1 + 1), C ,
(5)
λi |fi , Di ∼ GIG 1/2, 1, (1 − yi fi )2 ,
where K := k(X, X), kX∗ := k(X, x∗ ), k∗∗ := k(x∗ , x∗ ). The conditional class
membership probability is
k∗ (K + Λ)−1 Y (1 + λ)
T
p(y∗ = 1|λ, x∗ , D) = Φ ,
1 + k∗∗ − k∗> (K + Λ)−1 k∗
Bayesian Nonlinear SVMs for Big Data 5
The downside of this approach is that it does not scale to big datasets. The
covariance matrix of the variational distribution q(f ) has dimension n × n and
has to be updated and inverted at every inference step. This operation exhibits
the computational complexity O(n3 ), where n is the number of data points.
Furthermore, in this setup we cannot apply stochastic gradient descent. We show
how to overcome both problems in the next section paving the way to perform
inference on big datasets.
where Kmm is the kernel matrix resulting from evaluating the kernel function
between all inducing points locations, Knm is the cross-covariance between the
e = Knn −Knm Kmm−1
data points and the inducing points and K
e is given by K Kmn .
The augmented model exhibits the joint distribution
Note that we can recover the original joint distribution by marginalizing over u.
We now aim to apply the methodology
R of variational inference to the marginal
joint distribution p(y, u, λ) = p(y, u, f, λ)df . We impose a variational distribu-
tion q(u) = N (u|µ, ζ) on the inducing points u. We follow [7] and apply Jensen’s
inequality to obtain a lower bond on the following intractable conditional proba-
bility,
n
(1 + λi − yi fi )2
c 1X
=− Ep(fi |u) log λi +
2 i=1 λi
n
1X 1
log λi + Ep(fi |u) (1 + λi − yi fi )2
=−
2 i=1 λi
n
1X 1 e −1
2
=− log λi + Kii + 1 + λi − yi Kim Kmm u
2 i=1 λi
=: L1 .
Plugging the lower bound L1 into the standard evidence lower bound (ELBO)
[18] leads to the new variational objective
log p(y) ≥ Eq [log p(y, λ, u)] − Eq [log q(λ, u)]
= Eq [log p(y, λ|u)] + Eq [log p(u)] − Eq [log q(λ, u)]
≥ Eq [L1 ] + Eq [log p(u)] − Eq [log q(λ, u)] (7)
n
1X 1 e −1
2
=− Eq log λi + Kii + 1 + λi − yi Kim Kmm u
2 i=1 λi
− KL (q(u)||p(u)) − Eq(λ) [log q(λ)]
=: L.
The expectations can be computed analytically (details are given in the appendix)
and we obtain L in closed form,
c 1 1 −1 1
L= log |ζ| − tr(Kmm ζ) − µ> Kmm
−1
µ + y > κµ
2 2 2
n
X √ 1
+ log(B 14 ( αi )) + log(αi ) (8)
i=1
2
n
X 1 − 12
− αi 1 − αi − 2yi κi. µ + κ(µµ> + ζ)κ> + K
e ,
i=1
2 ii
−1
where κ = Knm Kmm and B 12 (.) is the modified Bessel function with parameter
1
2 [20]. This objective is amenable to stochastic optimization where we subsample
from the sum to obtain a noisy gradient estimate. We develop a stochastic
variational inference scheme by following noisy natural gradients of the variational
objective L. Using the natural gradient over the standard euclidean gradient is
often favorable since natural gradients are invariant to reparameterization of
the variational family [21,22] and provide effective second-order optimization
updates [23,6]. The natural gradients of L w.r.t. the Gaussian natural parameters
η1 = ζ −1 µ, η2 = − 21 ζ −1 are
e η L = κ> Y (α− 21 + 1) − η1
∇ (9)
1
e η L = − 1 (Kmm
∇ −1 1
+ κ> A− 2 κ) − η2 , (10)
2
2
8 Florian Wenzel et al.
with A = diag(α). Details can be found in the appendix. The natural gradient
updates always lead to a positive definite covariance matrix4 and in our imple-
mentation ζ has not to be parametrized in any way to ensure positive-definiteness.
The derivative of L w.r.t. αi is
(1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi
e 1
∇α L = √ 3 − √ . (11)
4 αi 4 αi
Setting it to zero gives the coordinate ascent update for αi ,
αi = (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi .
e
Details can be found in the appendix. The inducing point locations can be
either treated as hyperparameters and optimized while training [24] or can be
fixed before optimizing the variational objective. We follow the first approach
which is often preferred in a stochastic variational inference setup [7,10]. The
inducing point locations can be either randomly chosen as subset of the training
set or via a density estimator. In our experiments we have observed that the
k-means clustering algorithm (kMeans) [25] yields the best results. Combining
our results, we obtain a fast stochastic variational inference algorithm for the
Bayesian nonlinear SVM which is outlined in alg. 1. We apply the adaptive
learning rate method described in [26].
Since the standard SVM does not exhibit a probabilistic formulation, the hy-
perparameters have to be tuned via computationally very expensive methods
as grid search and cross validation. Our approach allows us to estimate the
hyperparameters during training time and lets us follow gradients instead of only
evaluating single hyperparameters.
In the appendix we provide the gradient of the variational objective L w.r.t.
to a general kernel and show how to optimize arbitrary differentiable hyper-
parameters. Our experiments exemplify our automated hyperparameter tuning
approach by optimizing the hyper parameter of an RBF kernel.
= N y ∗ |K∗m Kmm−1 −1 −1
m, K∗∗ − K∗m Kmm (Km∗ + ζKmm Km∗ )
=: q(f ∗ |x∗ , D),
where K∗m denotes the kernel matrix between test and inducing points and
K∗∗ the kernel matrix between test points. This leads to the approximate class
membership distribution
−1
K∗m Kmm m
q(y ∗ |x∗ , D) = Φ −1 −1 (13)
K∗∗ − K∗m Kmm (Km∗ + ζKmm Km∗ ) + 1
where Φ(.) is the probit link function. Note that we already computed inverse
−1
Kmm for the training procedure leading to a computational overhead stemming
only from simple matrix multiplication. Our experiments show that (13) leads to
reasonable uncertainty estimates.
10 Florian Wenzel et al.
We now consider the special case of using a linear kernel. If we are interested in
this case we may consider the Bayesian model for the linear SVM proposed by
Polson et al. (c.f. eq. 4). This can be favorable over using the nonlinear version
since this model is formulated in primal space and, therefore, the computational
complexity depends on the dimension d and not on the number of data points
n. Furthermore, focusing directly on the linear model allows us to optimize the
true ELBO, Eq [log p(y, λ, β)] − Eq [log q(λ, β)], without the need of relying on a
lower bound (as in eq. 7). This typically leads to a better approximate posterior.
We again follow the structured mean field approach and chose our variational
distributions to be in the same families as the full conditionals (4),
1
q(λi ) ≡ GIG( , 1, αi ) and q(β) ≡ N (µ, ζ).
2
We use again the fact that the coordinate updates of the variational param-
eters can be obtained by computing the expected natural parameters of the
corresponding full conditionals (4) and obtain
x>
Z
∗ ∗ ∗µ
p(y∗ = 1|x , D) ≈ Φ(f∗ )p(f∗ |f, x )q(f |D)df df∗ = Φ ,
x>
∗ ζx∗ + 1
where x∗ are the test points and q(f |D) = N (f |µ, ζ) the approximate posterior
obtained by the above described SVI scheme.
5 Experiments
(S-GPC) [10], and libSVM with Platt scaling [29,8] (SVM + Platt). For all
experiments we use an RBF kernel5 with length-scale parameter θ. We perform
all experiments using only one CPU core with 2.9 GHz and 386 GB RAM.
Code is available at github.com/theogf/BayesianSVM.
We demonstrate the scalability of our method on the SUSY dataset [11] containing
5 million points with 17 features. This dataset size is very common in particle
physics due to the simplicity of artificially generating new events as well as
the quantity of data coming from particle detectors. Since it is important to
have a sense of the confidence of the predictions for such datasets the Bayesian
SVM is an appropriate choice. We use an RBF kernel7 , 64 inducing points and
minibatches of 100 points. The training of our model takes only 10 minutes
without any parallelization. We use the the area under the receiver operating
5
The RBF kernel is defined as k(x1 , x2 , θ) = exp − ||x1θ−x
2
2 ||
, where θ is the length
scale parameter.
6
For a comparison with the stochastic variational inference version of GPC, see
section 5.3.
7
The length scale parameter tuning is not included in the training time. We found
θ = 5.0 by our proposed automatic tuning approach.
12 Florian Wenzel et al.
We examine the run time of our methods and the competitors. We include both
the batch variational inference method (B-BSVM) described in section 4.1 and
our fast and scalable inference method (S-BSVM) described in section 4.2 in the
experiments. For each method we iteratively evaluate the prediction performance
on a held-out dataset given a certain training time budget. The prediction error as
function of the training time is shown in fig. 1. We experiment on the Waveform
dataset from the Rätsch benchmark dataset (N = 5000, d = 21). We use an
RBF kernel with fixed length-scale parameter θ = 5.0 and for the stochastic
variational inference methods, S-BSVM and S-GPC, we use a batch size of 10
and 100 inducing points.
Our scalable method (S-BSVM) is around 10 times faster than the direct
competitor ECM while having slightly better prediction performance. The batch
Bayesian Nonlinear SVMs for Big Data 13
variational inference version (B-BSVM) is the slowest of the Bayesian SVM infer-
ence methods. The related probabilistic model, Gaussian process classification, is
around 5000 times slower than S-BSVM. Its stochastic inducing point version
(S-GPC) has comparable run time to S-BSVM but is very unstable leading to bad
prediction performance. S-GPC showed these instabilities for multiple settings
of the hyperparameters. The classic SVM (libSVM) has a similar run time as
our method. The speed and prediction performance of S-BSVM depend on the
number of inducing points. See section 5.5 for an empirical study. Note that the
run time in table 1 is determined after the methods have converged.
In section 4.3 we show that our inference method possesses the ability of automatic
hyperparameter tunning. In this experiment we demonstrate that our method,
indeed, finds the optimal length-scale hyperparameter of the RBF kernel. We
use the optimizing scheme (12) and alternate between 10 variational parameter
updates and one hyperparameter update. We compute the true validation loss of
the length-scale parameter θ by a grid search approach which consists of training
our model (S-BSVM) for each θ and measuring the prediction performance using
10-fold cross validation. In fig. 2 we plot the validation loss and the length-scale
parameter found by our method. We find the true optimum by only using 5
hyperparameter optimization steps. Training and hyperparameter optimization
takes only 0.3 seconds for our method, whereas grid search takes 188 seconds
(with a grid size of 1000 points).
14 Florian Wenzel et al.
Fig. 2. Average validation loss as function of the RBF kernel length-scale parameter
θ, computed by grid search and 10-fold cross validation. The red circle represents the
hyperparameter found by our proposed automatic tuning approach.
6 Conclusion
We presented a fast, scalable and reliable approximate inference method for the
Bayesian nonlinear SVM. While previous methods were restricted to rather small
datasets our method enables the application of the Bayesian nonlinear SVM to
large real world datasets containing millions of samples. Our experiments showed
Bayesian Nonlinear SVMs for Big Data 15
Fig. 3. Average prediction error and training time as functions of the number of inducing
points selected by two different methods with one standard deviation (using 10-fold
cross validation).
that our method is orders of magnitudes faster than the state-of-the-art while
still yielding comparable prediction accuracies. We showed how to automatically
tune the hyperparameters and obtain prediction uncertainties which is important
in many real world scenarios.
In future work we plan to further extend the Bayesian nonlinear SVM model to
deal with missing data and account for correlations between data points building
on ideas from [34]. Furthermore, we want to develop Bayesian formulations of
important variants of the SVM as for instance one-class SVMs [35].
References
1. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning (1995)
2. Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian
Anal. (2011)
3. Henao, R., Yuan, X., Carin, L.: Bayesian Nonlinear Support Vector Machines and
Discriminative Factor Modeling. NIPS (2014)
4. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds
of classifiers to solve real world classification problems? JMLR (2014)
5. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of machine learning.
MIT press (2012)
6. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic Variational Inference.
JMLR (2013)
7. Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. In:
Conference on Uncertainty in Artificial Intellegence. (2013)
16 Florian Wenzel et al.
8. Platt, P.J.C.: Probabilistic Outputs for Support Vector Machines and Comparisons
to Regularized Likelihood Methods. Advances in Large Margin Classifier (1999)
9. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning
(Adaptive Computation and Machine Learning). The MIT Press (2005)
10. Hensman, J., Matthews, A.: Scalable Variational Gaussian Process Classification.
AISTATS (2015)
11. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy
physics with deep learning. Nature communications (2014) 4308
12. Zhu, J., Chen, N., Perkins, H., Zhang, B.: Gibbs Max-margin Topic Models with
Data Augmentation. JMLR (2014)
13. Xu, M., Zhu, J., Zhang, B.: Fast Max-Margin Matrix Factorization with Data
Augmentation. ICML (2013) 978–986
14. Zhang, Jun, Zhang: Max-Margin Infinite Hidden Markov Models. ICML (2014)
15. Luts, J., Ormerod, J.T.: Mean field variational bayesian inference for support vector
machine classification. Comput. Stat. Data Anal. (May 2014) 163–176
16. Snelson, E., Ghahramani, Z.: Sparse GPs using Pseudo-inputs. NIPS (2006)
17. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: lp-norm multiple kernel learning.
JMLR (Mar) (2011) 953–997
18. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An Introduction to
Variational Methods for Graphical Models. Mach. Learn. (1999)
19. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and
variational inference. Found. Trends Mach. Learn. (1-2) (January 2008) 1–305
20. Jørgensen, B.: Statistical properties of the generalized inverse Gaussian distribution.
Springer Science & Business Media (2012)
21. Amari, S., Nagaoka, H.: Methods of Information Geometry. Am. Math. Soc. (2007)
22. Martens, J.: New insights and perspectives on the natural gradient method. Arxiv
Preprint (2017)
23. Amari, S.: Natural grad. works efficiently in learning. Neural Computation (1998)
24. Titsias, M.K.: Variational learning of inducing variables in sparse gaussian processes.
In: In Artificial Intelligence and Statistics 12. (2009) 567–574
25. Murphy: Machine Learning: A Probabilistic Perspective. The MIT Press (2012)
26. Ranganath, R., Wang, C., Blei, D.M., Xing, E.P.: An Adaptive Learning Rate for
Stochastic Variational Inference. ICML (2013)
27. Maritz, J., Lwin, T.: Empirical Bayes Methods with Applications. Monographs on
Statistics and Applied Probability. (1989)
28. Mandt, S., Hoffman, M., Blei, D.: A Variational Analysis of Stochastic Gradient
Algorithms. ICML (2016)
29. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (2011) 27:1–27:27
30. Brier, G.W.: Verification of forecasts expressed in terms of probability. Monthly
weather review (1) (1950) 1–3
31. Diethe, T.: 13 benchmark datasets derived from the UCI, DELVE and STATLOG
repositories (2015)
32. Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and Provably Good Seedings
for k-Means. NIPS (2016)
33. Lichman, M.: UCI machine learning repository (2013)
34. Mandt, S., Wenzel, F., Nakajima, S., Cunningham, J.P., Lippert, C., Kloft, M.:
Sparse Probit Linear Mixed Model. Machine Learning Journal (2017)
35. Perdisci, R., Gu, G., Lee, W.: Using an Ensemble of One-Class SVM Classifiers to
H. P.-based Anomaly Detection Systems. Data Mining (2006)
Bayesian Nonlinear SVMs for Big Data 17
A Appendix
In the following we give the details of the derivation of the variational objective
(8) for the inducing point model in section 4.2. The variational objective as
defined in (7) is
−1
Using the abbreviation κi = Kim Kmm the first expectation term simplifies to
1 e −1
2
Eq log λi + Kii + 1 + λi − yi Kim Kmm u
λi
= Eq [log λi ] + Eq λ−1
i
e ii + 1 + λ2i + yi2 (κi u)2 + 2λi − 2yi κi u − 2λi yi κi u
K
|{z}
=1
c 1
= Eq(λi ) [log λi ] + √ K
e ii + 1 + λ2i + (κi µ)2 + κi ζκ>
i − 2yi κi µ + Eq(λi ) [λi ] − 2yi κi µ
αi
1
e ii + (1 − yi κi µ)2 + λ2 + κi ζκ> + Eq(λ ) [λi ] − 2yi κi µ.
= Eq(λi ) [log λi ] + √ K i i i
αi
1 1 √ 1 αi
Eq(λi ) [log q(λi )] = Eq(λi ) − log(αi ) − log(λi ) − log(2) − log(B 12 ( αi )) − λi +
4 2 2 λi
c 1 1 √ 1 αi 1
= − log(αi ) − Eαi [log(λi )] − log(B 21 ( αi )) − Eαi [λi ] − Eαi
4 2 2 2 λi
√
c 1 1 √ 1 αi
= − log(αi ) − Eαi [log(λi )] − log(B 21 ( αi )) − Eαi [λi ] − ,
4 2 2 2
1
where B 12 (.) is the modified Bessel function with parameter 2 [20].
18 Florian Wenzel et al.
By summing the terms the remaining expectations cancel out and we obtain
n n
c X 1 1 e 1
L= − Eq(λi ) [log λi ] − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>i − Eq(λi ) [λi ] + yi κi µ
i=1
2 2 αi 2
√ o
1 1 √ 1 αi
+ log(αi ) + Eq(λi ) [log(λi )] + log(B 12 ( αi )) + Eq(λi ) [λi ] + − KL (q(u)||p(u))
4 2 2 2
n n
X 1 e 1 √ o
= − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>i − αi + yi κi µ + log(αi ) + log(B 21 ( αi ))
i=1
2 αi 4
− KL (q(u)||p(u))
n n
c X 1 e 1 √ o
= − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>
i − αi + y i κi µ + log(αi ) + log(B 1( αi ))
i=1
2 αi 4 2
1 1 −1 1
+ log |ζ| − tr(Kmm ζ) − µ> Kmm−1
µ
2 2 2
1 1 −1 1
= log |ζ| − tr(Kmm ζ) − µ> Kmm
−1
µ + y > κµ
2 2 2
n
X √ 1 1 −1
+ log(B 14 ( αi )) + log(αi ) − αi 2 1 − αi − 2yi κi. µ + κ(µµ> + ζ)κ> + K
e .
i=1
2 2 ii
The derivative w.r.t. parameter alphai of the generalized inverse Gaussian distri-
bution is
dL 1 d d √ 1 d √ (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi d
e 1
= log(αi ) + log(K 12 ( αi )) + αi − √
dαi 4 dαi dαi 2 dαi 2 dαi αi
1 1 1 1 (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi
e
= −( + √ )+ √ + √ 3
4αi 4αi 2 αi 4 αi 4 αi
2 >
(1 − yi κi µ) + yi (κi ζκi + K e ii )yi 1
= √ 3 − √ .
4 αi 4 αi
The natural gradient can be computed by pre-multiplying the euclidean gradient
with the inverse Fisher information matrix [21]. Applied to a Gaussian distribution
this leads to the following expressions for the natural gradient w.r.t. the natural
parameters [21],
e (η ,η ) L(η) = L0 (η) − 2L0 (η)µ, L0 (η) .
∇ 1 2 µ ζ ζ
1 −1 1
1 1 1
= Kmm + κ> A− 2 κ η2−1 η1 + κ> Y (α− 2 + 1) + −1
−2η2 − Kmm − κ> A− 2 κ η2−1 η1
2 2
1
= κ> Y (α− 2 + 1) − η1 ,
and
e η L = L0 = − 1 (K −1 + κ> A− 21 κ) − η2 .
∇ ζ
2
2 mm
dL 1 d −1 −1 1>
=− log |Kmm | + tr(Kmm ζ) + µT Kmm µ − 2(1 + α− 2 )Y κµ
dω 2 dω
1>
+ α− 2 diag κ(µµ> + ζ)κ> + K e
20 Florian Wenzel et al.
ω
Using the abbreviations J∗∗ = dK ω dκ ω ω
dω and ι = dω = (Jnm − κJmm ) Kmm we
∗∗ −1
obtain
dL 1 −1 ω −1
1>
=− Tr Kmm Jmm I − Kmm ζ − µ> Kmm −1 ω −1
Jmm Kmm + 2(1 + α− 2 )Y ιω µ
dω 2 h i
− 12 >
diag κ µµ> + ζ ιω > − Jmn
ω
+ ιω µµ> + ζ κ> − Kmn + Jnn ω
+α .