0% found this document useful (0 votes)

41 views20 pages

Bayesian SVM

The document presents a fast inference method for Bayesian nonlinear support vector machines (SVMs) that utilizes stochastic variational inference and inducing points, demonstrating superior speed and scalability to large datasets compared to existing methods. This approach not only enhances computational efficiency but also provides accurate predictive uncertainty estimates and automatic hyperparameter tuning. The proposed method is validated through experiments, including a significant application on a particle physics dataset containing 5 million data points.

Uploaded by

zain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views20 pages

Bayesian SVM

Uploaded by

zain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Bayesian Nonlinear Support Vector Machines for

Big Data

Florian Wenzel1 , Théo Galy-Fajou1 , Matthäus Deutsch2 , and Marius Kloft1

1
Humboldt University of Berlin, Germany
2
G+J Digital Products Hamburg, Germany
{wenzelfl,galy,kloft}@hu-berlin.de, [email protected]

Abstract. We propose a fast inference method for Bayesian nonlinear

arXiv:1707.05532v1 [stat.ML] 18 Jul 2017

support vector machines that leverages stochastic variational inference and

inducing points. Our experiments show that the proposed method is faster
than competing Bayesian approaches and scales easily to millions of data
points. It provides additional features over frequentist competitors such as
accurate predictive uncertainty estimates and automatic hyperparameter
search.

Keywords: Bayesian Approximative Inference, Support Vector Ma-

chines, Kernel Methods, Big Data

1 Introduction

Statistical machine learning branches into two classic strands of research: Bayesian
and frequentist. In the classic supervised learning setting, both paradigms aim
to find, based on training data, a function fβ that predicts well on yet unseen
test data. The difference in the Bayesian and frequentist approach lies in the
treatment of the parameter vector β of this function. In the frequentist setting,
we select the parameter β that minimizes a certain loss given the training data,
from a restricted set B of limited complexity. In the Bayesian school of thinking,
we express our prior belief about the parameter, in the form of a probability
distribution over the parameter vector. When we observe data, we adapt our
belief, resulting in a posterior distribution over β
Advantages of the Bayesian approach include automatic treatment of hyper-
parameters and direct quantification of the uncertainty3 of the prediction in the
form of class membership probabilities which can be of tremendous importance
in practice. As examples consider the following. (1) We have collected blood
samples of cancer patients and controls. The aim is to screen individuals that
have increased likelihood of developing cancer. The knowledge of the uncertainty
in those predictions is invaluable to clinicians. (2) In the domain of physics it
is important to have a sense about the certainty level of predictions since it
3
Note that frequentist approaches can also lead to other forms of uncertainty estimates,
e.g. in form of confidence intervals. But since the classic SVM does not exhibit a
probabilistic formulation these uncertainty estimates cannot be directly computed.
2 Florian Wenzel et al.

is mandatory to assert the statistical confidence in any physical variable mea-

surement. (3) In the general context of decision making, it is crucial that the
uncertainty of the estimated outcome of an action can be reliably determined.
Recently, it was shown that the support vector machine (SVM) [1]—which
is a classic supervised classification algorithm— admits a Bayesian interpreta-
tion through the technique of data augmentation [2,3]. This so-called Bayesian
nonlinear SVM combines the best of both worlds: it inherits the geometric in-
terpretation, its robustness against outliers, state-of-the-art accuracy [4], and
theoretical error guarantees [5] from the frequentist formulation of the SVM,
but like Bayesian methods it also allows for flexible feature modeling, automatic
hyperparameter tuning, and predictive uncertainty quantification.
However, existing inference methods for the Bayesian support vector machine
(such as the expectation conditional maximization method introduced in [3])
scale rather poorly with the number of samples and are limited in application
to datasets with thousands of data points [3]. Based on stochastic variational
inference [6] and inducing points [7], we develop in this paper a fast and scalable
inference method for the nonlinear Bayesian SVM.
Our experiments show superior performance of our method over competing
methods for uncertainty quantification of SVMs such as Platt’s method [8].
Furthermore, we show that our approach is faster (by one to three orders of
magnitude) than the following competitors: expectation conditional maximization
(ECM) for nonlinear Bayesian SVM by [3], Gaussian process classification [9],
and the recently proposed scalable variational Gaussian process classification
method [10]. We apply our method to the domain of particle physics, namely on
the SUSY dataset [11] (a standard benchmark in particle physics containing 5
million data points) where our method takes only 10 minutes to train on a single
CPU machine.
Our experiments demonstrate that Bayesian inference techniques are mature
enough to compete with corresponding frequentist approaches (such as nonlinear
SVMs) in terms of scalability to big data, yet they offer additional benefits such
as uncertainty estimation and automated hyperparameter search.
Our paper is structured as follows. In section 2 we discuss related work and
review the Bayesian nonlinear SVM model in section 3. In section 4 we propose
our novel scalable inference algorithm, show how to optimize hyperparameters
and obtain an approximate predictive distribution. We discuss also the special
case of the linear SVM, for which we propose a specially tailored fast inference
algorithm. Section 5 concludes with experimental results.

2 Related Work

There has recently been significant interest in utilizing max-margin based dis-
criminative Bayesian models for various applications. For example, [12] employs
a max-margin based Bayesian classification to discover latent semantic structures
for topic models, [13] uses a max-margin approach for efficient Bayesian matrix
Bayesian Nonlinear SVMs for Big Data 3

factorization, and [14] develops a new max-margin approach to Hidden Markov

models.
All these approaches apply the Bayesian reformulation of the classic SVM
introduced by [2]. This model is extended by [3] to the nonlinear case. The authors
show improved accuracy compared to standard methods such as (non-Bayesian)
SVMs and Gaussian process (GP) classification.
However, the inference methods proposed in [2] and [3] have the drawback
that they partially rely on point estimates of the latent variables and do not scale
well to large datasets. In [15] the authors apply mean field variational inference
to the linear case of the model, but their proposed technique does not lead to
substantial performance improvements and neglects the nonlinear model.
Uncertainty estimation for SVMs is usually done via Platt’s technique [8],
which consists of applying a logistic regression on the function scores produced by
the SVM. In contrast, our technique directly yields a sound predictive distribution
instead of using a heuristically motivated transformation. We make use of the idea
of inducing point GPs to develop a scalable inference method for the Bayesian
nonlinear SVM. Sparse GPs using pseudo-inputs were first introduced in [16].
Building on this idea Hensman et al. developed a stochastic variational inference
scheme for GP regression and GP classification [7,10]. We further extend this
ideas to the setting of Bayesian nonlinear SVM.

3 The Bayesian SVM Model

n
Let D = {xi , yi }i=1 be n observations where xi ∈ Rd is a feature vector with
corresponding labels yi ∈ {−1, 1}. The SVM aims to find an optimal score
function f by solving the following regularized risk minimization objective:
n
X
arg min γR (f ) + max (0, 1 − yi f (xi )) , (1)
f
i=1

where R is a regularizer function controlling the complexity of the decision

function f , and γ is a hyperparameter to adjust the trade-off between training
error and the complexity of f . The loss max (0, 1 − yf (x)) is called hinge loss.
The classifier is then defined as sign(f (x)).
For the case of a linear decision function, i.e. f (x) = xT β, the SVM optimiza-
tion problem (1) is equivalent to estimating the mode of a pseudo-posterior

n
Y
p(β|D) ∝ L(yi |xi , β)p(β).
i=1

Here p(β) denotes a prior such that log p(β) ∝ −2γR(β). In the following we use
the prior β ∼ N (0, Σ), where Σ ∈ Rd×d is a positive definite matrix. From a
frequentist SVM view, this choice generalizes the usual L2 -regularization to non-
isotropic regularizers. Note that our proposed framework can be easily extended to
4 Florian Wenzel et al.

other regularization techniques by adjusting the prior on β (e.g. block `(2,p) -norm
regularization which is known as multiple kernel learning [17]). In order to obtain
a Bayesian interpretation of the SVM, we need to define a pseudolikelihood L
such that the following holds,
L (y|x, f (·)) ∝ exp (−2 max(1 − yi f (xi ), 0)) . (2)
By introducing latent variables λ := (λ1 , . . . , λn )> (data augmentation) and
making use of integral identities stemming from function theory, [2] show that
the specification of L in terms of the following marginal distribution satisfies (2):
Z ∞ 2 !
1 1 1 + λi − yi xTi β
L(yi |xi , β) = √ exp − dλi . (3)
0 2πλi 2 λi

Writing X ∈ Rd×n for the matrix of data points and Y = diag(y), the full
conditional distributions of this model are
β|λ, Σ, D ∼ N B(λ−1 + 1), B ,

(4)
λi |β, Di ∼ GIG 1/2, 1, (1 − yi x> 2

i β) ,
with Z = Y X, B −1 = ZΛ−1 Z > + Σ −1 , Λ = diag(λ) and where GIG denotes a
generalized inverse Gaussian distribution. The n latent variables λi of the model
scale the variance of the full posteriors locally. The model thus constitutes a
special case of a normal variance-mean mixture, where we implicitly impose
the improper prior p(λ) = 1[0,∞) (λ) on λ. This could be generalized by using a
generalized inverse Gaussian prior on λi , leading to a conjugate model for λi .
Henao et al. show that in the case of an exponential prior on λi , this leads to
a skewed Laplace full conditional for λi . Note that this, however, destroys the
equivalency to the frequentist linear SVM.
By using the ideas of Gaussian processes [9], Henao et al. develop a nonlinear
(kernelized) version of this model [3]. They assume a continuous decision function
f (x) to be drawn from a zero-mean Gaussian process GP(0, k), where k is a
kernel function. The random Gaussian vector f = (f1 , ..., fn )> corresponds to
f (x) evaluated at the data points. They substitute the linear function x>
i β by fi
in (3) and obtain the conditional posteriors
f |λ, D ∼ N CY (λ−1 + 1), C ,

(5)
λi |fi , Di ∼ GIG 1/2, 1, (1 − yi fi )2 ,

with C −1 = Λ−1 +K −1 . For a test point x∗ the conditional predictive distribution

for f∗ = f (x∗ ) under this model is
f∗ |λ, x∗ , D ∼ N k∗> (K + Λ)−1 Y (1 + λ), k∗∗ − k∗> (K + Λ)−1 k∗ ,

where K := k(X, X), kX∗ := k(X, x∗ ), k∗∗ := k(x∗ , x∗ ). The conditional class
membership probability is
k∗ (K + Λ)−1 Y (1 + λ)
T
p(y∗ = 1|λ, x∗ , D) = Φ ,
1 + k∗∗ − k∗> (K + Λ)−1 k∗
Bayesian Nonlinear SVMs for Big Data 5

where Φ(.) is the probit link function.

Note that the conditional posteriors as well as the class membership prob-
ability still depend on the local latent variables λi . We are interested in the
marginal predictive distributions, but unfortunately the latent variables cannot
be integrated out analytically. Both [2] and [3] propose MCMC-algorithms and
stepwise inference schemes similar to EM-algorithms to overcome this problem.
These methods do not scale well to big data problems and the probability es-
timation still relies on point estimates of the n-dimensional λ. We overcome
these problems proposing a scalable inference method and obtaining approximate
marginal predictive distributions (that are not conditioned on λ).

4 Scalable Inference and Automated Hyperparameter

Tuning
In the following we develop a fast and reliable inference method for the Bayesian
nonlinear SVM. Our method builds on the idea of using inducing points for
Gaussian Processes in a stochastic variational inference setting [7] that scales
easily to millions of data points. We proceed by first discussing a standard batch
variational scheme in section 4.1 and then in section 4.2 we develop our fast and
scalable inference method. We show how to automatically tune hyperparameters
in section 4.3 and obtain uncertainty estimates for predictions in section 4.4.
Finally, we discuss the special case of the Bayesian linear SVM in section 4.5.

4.1 Batch Variational Inference

The idea of variational inference is to approximate the typically intractable pos-
terior of a probabilistic model by a variational (typically factorized) distribution.
We find the optimal approximating distribution by maximizing a lower bound
on the evidence (the so-called ELBO) with respect to the parameters of the
variational distribution, which is equivalent to minimizing the Kullback-Leibler
divergence between the variational distribution and the posterior [18,19].
In this section we first develop a batch variational inference scheme [18,19],
which uses the full dataset in every iteration. We follow the structured mean
field approach and choose the variational distributionsQwithin the same fami-
n
lies as the full conditional distributions q(f, λ) = q(f ) i=1 q(λi ), with q(f ) ≡
N (µ, ζ) and q(λi ) ≡ GIG(1/2, 1, αi ). The coordinate ascent updates can be com-
puted by the expected natural parameters of the corresponding full conditionals
(5) leading to

αi = Eq(f ) [(1 − yi fi )2 ] = (1 − yi> µ)2 + yi> ζyi ,

−1 1 −1
ζ = Eq(λ) [ Λ−1 + K −1 ] = A− 2 + K −1 ,
1
µ = ζEq(λ) [Y (λ−1 + 1)] = ζY (α− 2 + 1).

This concludes the batch variational inference scheme.

6 Florian Wenzel et al.

The downside of this approach is that it does not scale to big datasets. The
covariance matrix of the variational distribution q(f ) has dimension n × n and
has to be updated and inverted at every inference step. This operation exhibits
the computational complexity O(n3 ), where n is the number of data points.
Furthermore, in this setup we cannot apply stochastic gradient descent. We show
how to overcome both problems in the next section paving the way to perform
inference on big datasets.

4.2 Stochastic Variational Inference Using Inducing Points

We aim to develop a stochastic variational inference (SVI) scheme using only
minibatches of the data in each iteration. The Bayesian nonlinear SVM model
does not exhibit a set of global variables. Both the number of latent variables λ
and the observations of the latent GP f grow with number of data points (c.f.
eq.5), i.e. they are local variables. This hinders us from directly developing a SVI
scheme. We make use of the concept of inducing points [7] imposing a sparse GP
acting as global variable. This allows us to apply SVI and reduces the complexity
to O(m3 ), where m is the number of inducing points, which is independent of
the number of data points.
We augment our original model (5) with m < n inducing points. Let u ∈ Rm
be pseudo observations at inducing locations {x̂1 , ..., x̂m }. We employ a prior on
the inducing points, p(u) = N (0, Kmm ) and connect f and u setting
−1
p(f |u) = N (Knm Kmm u, K)
e (6)

where Kmm is the kernel matrix resulting from evaluating the kernel function
between all inducing points locations, Knm is the cross-covariance between the
e = Knn −Knm Kmm−1
data points and the inducing points and K
e is given by K Kmn .
The augmented model exhibits the joint distribution

p(y, u, f, λ) = p(y, λ|f )p(f |u)p(u).

Note that we can recover the original joint distribution by marginalizing over u.
We now aim to apply the methodology
R of variational inference to the marginal
joint distribution p(y, u, λ) = p(y, u, f, λ)df . We impose a variational distribu-
tion q(u) = N (u|µ, ζ) on the inducing points u. We follow [7] and apply Jensen’s
inequality to obtain a lower bond on the following intractable conditional proba-
bility,

log p(y, λ|u) = log Ep(f |u) [p(y, λ|f )]

n
(1 + λi − yi fi )2

c 1X
=− Ep(fi |u) log λi +
2 i=1 λi
n
1X 1
log λi + Ep(fi |u) (1 + λi − yi fi )2

=−
2 i=1 λi
n
1X 1 e −1
2
=− log λi + Kii + 1 + λi − yi Kim Kmm u
2 i=1 λi
=: L1 .
Plugging the lower bound L1 into the standard evidence lower bound (ELBO)
[18] leads to the new variational objective
log p(y) ≥ Eq [log p(y, λ, u)] − Eq [log q(λ, u)]
= Eq [log p(y, λ|u)] + Eq [log p(u)] − Eq [log q(λ, u)]
≥ Eq [L1 ] + Eq [log p(u)] − Eq [log q(λ, u)] (7)
n
1X 1 e −1
2
=− Eq log λi + Kii + 1 + λi − yi Kim Kmm u
2 i=1 λi
− KL (q(u)||p(u)) − Eq(λ) [log q(λ)]
=: L.
The expectations can be computed analytically (details are given in the appendix)
and we obtain L in closed form,
c 1 1 −1 1
L= log |ζ| − tr(Kmm ζ) − µ> Kmm
−1
µ + y > κµ
2 2 2
n
X √ 1
+ log(B 14 ( αi )) + log(αi ) (8)
i=1
2
n
X 1 − 12

− αi 1 − αi − 2yi κi. µ + κ(µµ> + ζ)κ> + K
e ,
i=1
2 ii

−1
where κ = Knm Kmm and B 12 (.) is the modified Bessel function with parameter
1
2 [20]. This objective is amenable to stochastic optimization where we subsample
from the sum to obtain a noisy gradient estimate. We develop a stochastic
variational inference scheme by following noisy natural gradients of the variational
objective L. Using the natural gradient over the standard euclidean gradient is
often favorable since natural gradients are invariant to reparameterization of
the variational family [21,22] and provide effective second-order optimization
updates [23,6]. The natural gradients of L w.r.t. the Gaussian natural parameters
η1 = ζ −1 µ, η2 = − 21 ζ −1 are
e η L = κ> Y (α− 21 + 1) − η1
∇ (9)
1

e η L = − 1 (Kmm
∇ −1 1
+ κ> A− 2 κ) − η2 , (10)
2
2
8 Florian Wenzel et al.

with A = diag(α). Details can be found in the appendix. The natural gradient
updates always lead to a positive definite covariance matrix4 and in our imple-
mentation ζ has not to be parametrized in any way to ensure positive-definiteness.
The derivative of L w.r.t. αi is
(1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi
e 1
∇α L = √ 3 − √ . (11)
4 αi 4 αi
Setting it to zero gives the coordinate ascent update for αi ,
αi = (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi .
e

Details can be found in the appendix. The inducing point locations can be
either treated as hyperparameters and optimized while training [24] or can be
fixed before optimizing the variational objective. We follow the first approach
which is often preferred in a stochastic variational inference setup [7,10]. The
inducing point locations can be either randomly chosen as subset of the training
set or via a density estimator. In our experiments we have observed that the
k-means clustering algorithm (kMeans) [25] yields the best results. Combining
our results, we obtain a fast stochastic variational inference algorithm for the
Bayesian nonlinear SVM which is outlined in alg. 1. We apply the adaptive
learning rate method described in [26].

Algorithm 1 Inducing Point SVI

1: set the learning rate schedule ρt appropriately
2: initialize η1 , η2
3: select m inducing points locations (e.g. via kMeans)
−1 −1
e = Knn − Knm Kmm
4: compute kernel matrices Kmm and K Kmn
5: while not converged do
6: get S = minibatch index set of size s
7: update αi = (1 − yi κi µ)2 + yi (κi ζκ> i + Kii )yi
e
8: compute AS = diag(αi , i ∈ S)
1
9: compute η̂1 = κ> Y (α− 2 + 1)
1
10: compute η̂2 = − 2 (Kmm + κ> A− 2 κ)
1 −1

11: update η1 = (1 − ρt )η1 + ρt η̂1

12: update η2 = (1 − ρt )η2 + ρt η̂2
13: compute ζ = − 21 η2−1
14: compute µ = ζη1
15: return α1 , . . . , αn , µ, ζ

4.3 Auto Tuning of Hyperparameters

The probabilistic formulation of the SVM lets us directly learn the hyperparame-
ters while training. To this end we maximize the marginal likelihood p(y|X, h),
1
4
This follows directly since Kmm and A− 2 are positive definite.
Bayesian Nonlinear SVMs for Big Data 9

where h denotes the set of hyperparameters (this approach is called empirical

Bayes [27]). We follow an approximate approach and optimize the fitted varia-
tional lower bound L(h) over h by alternating between optimization steps w.r.t.
the variational parameters and the hyperparameters [28]. We include a gradient
ascent step w.r.t. h after multiple variational updates in the SVI scheme, this is
commonly known as Type II maximum likelihood (ML-II) [9]

h(t) = h(t−1) + ρet ∇h L(α(t−1) , µ(t−1) , ζ (t−1) , h). (12)

Since the standard SVM does not exhibit a probabilistic formulation, the hy-
perparameters have to be tuned via computationally very expensive methods
as grid search and cross validation. Our approach allows us to estimate the
hyperparameters during training time and lets us follow gradients instead of only
evaluating single hyperparameters.
In the appendix we provide the gradient of the variational objective L w.r.t.
to a general kernel and show how to optimize arbitrary differentiable hyper-
parameters. Our experiments exemplify our automated hyperparameter tuning
approach by optimizing the hyper parameter of an RBF kernel.

4.4 Uncertainty Predictions

Besides the advantage of automated hyperparameter tuning, the probabilistic

formulation of the SVM leads directly to uncertainty estimates of the predictions.
The standard SVM lacks this capability, and only heuristic approaches as e.g.
Platt [8] exist. Using the approximate posterior q(u|D) = N (u|µ, ζ) obtained
by our stochastic variational inference method (alg. 1) we compute the class
membership probability for a test point x∗ ,
Z
p(f ∗ |x∗ , D) = p(y ∗ |u, x∗ )p(u|D)du
Z
≈ p(y ∗ |u, x∗ )q(u|D)du

= N y ∗ |K∗m Kmm−1 −1 −1

m, K∗∗ − K∗m Kmm (Km∗ + ζKmm Km∗ )
=: q(f ∗ |x∗ , D),

where K∗m denotes the kernel matrix between test and inducing points and
K∗∗ the kernel matrix between test points. This leads to the approximate class
membership distribution
−1

K∗m Kmm m
q(y ∗ |x∗ , D) = Φ −1 −1 (13)
K∗∗ − K∗m Kmm (Km∗ + ζKmm Km∗ ) + 1

where Φ(.) is the probit link function. Note that we already computed inverse
−1
Kmm for the training procedure leading to a computational overhead stemming
only from simple matrix multiplication. Our experiments show that (13) leads to
reasonable uncertainty estimates.
10 Florian Wenzel et al.

4.5 Special Case of Linear Bayesian SVM

We now consider the special case of using a linear kernel. If we are interested in
this case we may consider the Bayesian model for the linear SVM proposed by
Polson et al. (c.f. eq. 4). This can be favorable over using the nonlinear version
since this model is formulated in primal space and, therefore, the computational
complexity depends on the dimension d and not on the number of data points
n. Furthermore, focusing directly on the linear model allows us to optimize the
true ELBO, Eq [log p(y, λ, β)] − Eq [log q(λ, β)], without the need of relying on a
lower bound (as in eq. 7). This typically leads to a better approximate posterior.
We again follow the structured mean field approach and chose our variational
distributions to be in the same families as the full conditionals (4),
1
q(λi ) ≡ GIG( , 1, αi ) and q(β) ≡ N (µ, ζ).
2
We use again the fact that the coordinate updates of the variational param-
eters can be obtained by computing the expected natural parameters of the
corresponding full conditionals (4) and obtain

αi = (1 − ziT µ)2 + ziT ζzi

1
ζ = (ZA− 2 Z T + Σ −1 )−1 (14)
− 21
µ = ζZ(α + 1),

where α = (αi )1≤i≤n , A = diag(α) and Z = Y X. Since the Bayesian Linear

SVM model exhibits global and local variables we can directly employ stochastic
variational inference by subsampling the data and only updating minibatches of
α. Note that for the linear case the covariance matrices have size d × d, i.e. being
independent of the number of data points. Therefore, the SVI algorithm (14) for
the Bayesian Linear SVM exhibits the computational complexity O(d3 ). Luts et.
al develop a batch variational inference scheme for the Bayesian linear SVM but
do not scale to big datasets.
The hyperparameter can be tuned analogously to (12). The class membership
probabilities are

x>
Z
∗ ∗ ∗µ
p(y∗ = 1|x , D) ≈ Φ(f∗ )p(f∗ |f, x )q(f |D)df df∗ = Φ ,
x>
∗ ζx∗ + 1

where x∗ are the test points and q(f |D) = N (f |µ, ζ) the approximate posterior
obtained by the above described SVI scheme.

5 Experiments

We compare our approach against the expectation conditional maximization

(ECM) method proposed by Henao et. al [3], Gaussian process classification
(GPC) [9], its recently proposed scalable stochastic variational inference version
Bayesian Nonlinear SVMs for Big Data 11

(S-GPC) [10], and libSVM with Platt scaling [29,8] (SVM + Platt). For all
experiments we use an RBF kernel5 with length-scale parameter θ. We perform
all experiments using only one CPU core with 2.9 GHz and 386 GB RAM.
Code is available at github.com/theogf/BayesianSVM.

5.1 Prediction Performance and Uncertainty Estimation

We experiment on seven real-world datasets and compare the prediction perfor-

mance, the quality of the uncertainty estimates and run time of the methods. The
results are presented in table 1. We show that our method (S-BSVM) is up to 22
times faster than the direct competitor ECM and up to 700 times faster than
Gaussian process classification6 while outperforming the competitors in terms
of prediction performance and quality of uncertainty estimates in most cases.
The non-probabilistic SVM is naturally the fastest method. Combined with the
heuristic Platt scaling approach it leads to class membership probabilities but,
however, still lacks the advantages of a probabilistic model (as e.g. uncertainty
quantification of the learned parameters and automatic hyperparameter tuning).
To evaluate the quality of the uncertainty estimates we compute the Brier score
which is considered as a goodPperformance measure for probabilistic predictions
N 2
[30] being defined as BS = n1 i=1 (yi − q(xi )) , where yi ∈ {0, 1} is the observed
output and q(xi ) ∈ [0, 1] is the predicted class membership probability. Note that
smaller Brier score indicates better performance.
The datasets are all from the Rätsch benchmark datasets [31] commonly
used to test the accuracy of binary nonlinear classifiers. We perform a 10-fold
cross-validation and use an RBF kernel with fixed parameters for all methods.
For S-BSVM we choose the number of inducing points as 20% of the training set
size, except for the datasets Splice, German and Waveform where we use 100
inducing points. For each dataset minibatches of 10 samples are used.

5.2 Big Data Experiments

We demonstrate the scalability of our method on the SUSY dataset [11] containing
5 million points with 17 features. This dataset size is very common in particle
physics due to the simplicity of artificially generating new events as well as
the quantity of data coming from particle detectors. Since it is important to
have a sense of the confidence of the predictions for such datasets the Bayesian
SVM is an appropriate choice. We use an RBF kernel7 , 64 inducing points and
minibatches of 100 points. The training of our model takes only 10 minutes
without any parallelization. We use the the area under the receiver operating

5
The RBF kernel is defined as k(x1 , x2 , θ) = exp − ||x1θ−x
2
2 ||
, where θ is the length
scale parameter.
6
For a comparison with the stochastic variational inference version of GPC, see
section 5.3.
7
The length scale parameter tuning is not included in the training time. We found
θ = 5.0 by our proposed automatic tuning approach.
12 Florian Wenzel et al.

Dataset n dim. S-BSVM ECM GPC SVM + Platt

Error .26 ± .07 .27 ± .10 .27 ± .07 .27 ± .09
Breast
263 9 Brier Score .18 ± .03 .19 ± .05 .18 ± .03 .19 ± .04
Cancer
Time [s] 0.32 1.4 6.7 0.04
Error .22 ± .06 .25 ± .07 .23 ± .07 .24 ± .07
Diabetes 768 8 Brier Score .16 ± .04 .17 ± .04 .15 ± .04 .16 ± .04
Time [s] 3.9 33 67 0.11
Error .36 ± .12 .36 ± .12 .36 ± .11 .36 ± .12
Flare 144 9 Brier Score .22 ± .05 .25 ± .07 .24 ± .03 .24 ± .04
Time [s] 0.08 0.26 1.8 0.01
Error .24 ± .11 .25 ± .12 .25 ± .13 .27 ± .10
German 1000 20 Brier Score .17 ± .06 .17 ± .05 .17 ± .06 .18 ± .05
Time [s] 12 80 115 0.15
Error .16 ± .06 .19 ± .09 .16 ± .06 .17 ± .07
Heart 270 13 Brier Score .13 ± .04 .14 ± .04 .12 ± .03 .12 ± .04
Time [s] 0.34 2.2 6 0.04
Error .13 ± .03 .11 ± .03 .32 ± .14 .14 ± .01
Splice 2991 60 Brier Score .17 ± .01 .18 ± .01 .40 ± .14 .11 ± .01
Time [s] 18 406 419 1.3
Error .09 ± .02 .10 ± .02 .10 ± .02 .10 ± .02
Waveform 5000 21 Brier Score .06 ± .01 .15 ± .01 .06 ± .01 .06 ± .01
Time [s] 12.5 264 8691 2.3
Table 1. Average prediction error and Brier score with one standard deviation.

characteristic (ROC) curve (AUC) as performance measure since it is a standard

evaluation measure on this dataset [11].
Our method achieves an AUC of 0.84 and a Brier score of 0.22, whereby the
state-of-the-art obtains an AUC of 0.88 using a deep neural network (5 layers,
300 hidden units each) [11]. Note that this approach takes much longer to train
and does not include uncertainty estimates.

5.3 Run Time

We examine the run time of our methods and the competitors. We include both
the batch variational inference method (B-BSVM) described in section 4.1 and
our fast and scalable inference method (S-BSVM) described in section 4.2 in the
experiments. For each method we iteratively evaluate the prediction performance
on a held-out dataset given a certain training time budget. The prediction error as
function of the training time is shown in fig. 1. We experiment on the Waveform
dataset from the Rätsch benchmark dataset (N = 5000, d = 21). We use an
RBF kernel with fixed length-scale parameter θ = 5.0 and for the stochastic
variational inference methods, S-BSVM and S-GPC, we use a batch size of 10
and 100 inducing points.
Our scalable method (S-BSVM) is around 10 times faster than the direct
competitor ECM while having slightly better prediction performance. The batch
Bayesian Nonlinear SVMs for Big Data 13

variational inference version (B-BSVM) is the slowest of the Bayesian SVM infer-
ence methods. The related probabilistic model, Gaussian process classification, is
around 5000 times slower than S-BSVM. Its stochastic inducing point version
(S-GPC) has comparable run time to S-BSVM but is very unstable leading to bad
prediction performance. S-GPC showed these instabilities for multiple settings
of the hyperparameters. The classic SVM (libSVM) has a similar run time as
our method. The speed and prediction performance of S-BSVM depend on the
number of inducing points. See section 5.5 for an empirical study. Note that the
run time in table 1 is determined after the methods have converged.

Fig. 1. Prediction error on held-out dataset vs. training time.

5.4 Auto Tuning of Hyperparameters

In section 4.3 we show that our inference method possesses the ability of automatic
hyperparameter tunning. In this experiment we demonstrate that our method,
indeed, finds the optimal length-scale hyperparameter of the RBF kernel. We
use the optimizing scheme (12) and alternate between 10 variational parameter
updates and one hyperparameter update. We compute the true validation loss of
the length-scale parameter θ by a grid search approach which consists of training
our model (S-BSVM) for each θ and measuring the prediction performance using
10-fold cross validation. In fig. 2 we plot the validation loss and the length-scale
parameter found by our method. We find the true optimum by only using 5
hyperparameter optimization steps. Training and hyperparameter optimization
takes only 0.3 seconds for our method, whereas grid search takes 188 seconds
(with a grid size of 1000 points).
14 Florian Wenzel et al.

Fig. 2. Average validation loss as function of the RBF kernel length-scale parameter
θ, computed by grid search and 10-fold cross validation. The red circle represents the
hyperparameter found by our proposed automatic tuning approach.

5.5 Inducing Points Selection

The sparse GP model used in our inference scheme builds on a set of inducing
points where both the number and the locations of the inducing points are free
parameters. We investigate three different inducing point selection methods:
random subset selection from the training set, the Gaussian Mixture Model
(GMM), and the k-means clustering algorithm with an improved k-means++
seeding (kMeans) [32]. Furthermore we show how the number of inducing points
affects the prediction accuracy and the run time. We test the three inducing
point selection methods on the USPS dataset [33] which we reduced to a binary
problem using only the digits 3 and 5 (N=1350 and d=256). For all methods we
progressively increase the number of inducing points and compute the prediction
error by 10-fold cross validation. We present our results in fig. 3.
The GMM is unable to fit large numbers of samples and dimensions and fails
to converge for almost all datasets tried, therefore, we do not include it in the
plot. Using the k-means selection algorithm leads for small numbers of inducing
points to much better prediction performance than random subset selection.
Furthermore, we show that using only a small fraction of inducing points (around
1% of the original dataset) leads to a nearly optimal prediction performance by
simultaneously significantly decreasing the run time. We observe similar results
on all datasets we considered.

6 Conclusion
We presented a fast, scalable and reliable approximate inference method for the
Bayesian nonlinear SVM. While previous methods were restricted to rather small
datasets our method enables the application of the Bayesian nonlinear SVM to
large real world datasets containing millions of samples. Our experiments showed
Bayesian Nonlinear SVMs for Big Data 15

Fig. 3. Average prediction error and training time as functions of the number of inducing
points selected by two different methods with one standard deviation (using 10-fold
cross validation).

that our method is orders of magnitudes faster than the state-of-the-art while
still yielding comparable prediction accuracies. We showed how to automatically
tune the hyperparameters and obtain prediction uncertainties which is important
in many real world scenarios.
In future work we plan to further extend the Bayesian nonlinear SVM model to
deal with missing data and account for correlations between data points building
on ideas from [34]. Furthermore, we want to develop Bayesian formulations of
important variants of the SVM as for instance one-class SVMs [35].

Acknowledgments. We thank Stephan Mandt, Manfred Opper and Patrick

Jähnichen for fruitful discussions. This work was partly funded by the German
Research Foundation (DFG) award KL 2698/2-1.

References
1. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning (1995)
2. Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian
Anal. (2011)
3. Henao, R., Yuan, X., Carin, L.: Bayesian Nonlinear Support Vector Machines and
Discriminative Factor Modeling. NIPS (2014)
4. Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D.: Do we need hundreds
of classifiers to solve real world classification problems? JMLR (2014)
5. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of machine learning.
MIT press (2012)
6. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic Variational Inference.
JMLR (2013)
7. Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. In:
Conference on Uncertainty in Artificial Intellegence. (2013)
16 Florian Wenzel et al.

8. Platt, P.J.C.: Probabilistic Outputs for Support Vector Machines and Comparisons
to Regularized Likelihood Methods. Advances in Large Margin Classifier (1999)
9. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning
(Adaptive Computation and Machine Learning). The MIT Press (2005)
10. Hensman, J., Matthews, A.: Scalable Variational Gaussian Process Classification.
AISTATS (2015)
11. Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy
physics with deep learning. Nature communications (2014) 4308
12. Zhu, J., Chen, N., Perkins, H., Zhang, B.: Gibbs Max-margin Topic Models with
Data Augmentation. JMLR (2014)
13. Xu, M., Zhu, J., Zhang, B.: Fast Max-Margin Matrix Factorization with Data
Augmentation. ICML (2013) 978–986
14. Zhang, Jun, Zhang: Max-Margin Infinite Hidden Markov Models. ICML (2014)
15. Luts, J., Ormerod, J.T.: Mean field variational bayesian inference for support vector
machine classification. Comput. Stat. Data Anal. (May 2014) 163–176
16. Snelson, E., Ghahramani, Z.: Sparse GPs using Pseudo-inputs. NIPS (2006)
17. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: lp-norm multiple kernel learning.
JMLR (Mar) (2011) 953–997
18. Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., Saul, L.K.: An Introduction to
Variational Methods for Graphical Models. Mach. Learn. (1999)
19. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and
variational inference. Found. Trends Mach. Learn. (1-2) (January 2008) 1–305
20. Jørgensen, B.: Statistical properties of the generalized inverse Gaussian distribution.
Springer Science & Business Media (2012)
21. Amari, S., Nagaoka, H.: Methods of Information Geometry. Am. Math. Soc. (2007)
22. Martens, J.: New insights and perspectives on the natural gradient method. Arxiv
Preprint (2017)
23. Amari, S.: Natural grad. works efficiently in learning. Neural Computation (1998)
24. Titsias, M.K.: Variational learning of inducing variables in sparse gaussian processes.
In: In Artificial Intelligence and Statistics 12. (2009) 567–574
25. Murphy: Machine Learning: A Probabilistic Perspective. The MIT Press (2012)
26. Ranganath, R., Wang, C., Blei, D.M., Xing, E.P.: An Adaptive Learning Rate for
Stochastic Variational Inference. ICML (2013)
27. Maritz, J., Lwin, T.: Empirical Bayes Methods with Applications. Monographs on
Statistics and Applied Probability. (1989)
28. Mandt, S., Hoffman, M., Blei, D.: A Variational Analysis of Stochastic Gradient
Algorithms. ICML (2016)
29. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (2011) 27:1–27:27
30. Brier, G.W.: Verification of forecasts expressed in terms of probability. Monthly
weather review (1) (1950) 1–3
31. Diethe, T.: 13 benchmark datasets derived from the UCI, DELVE and STATLOG
repositories (2015)
32. Bachem, O., Lucic, M., Hassani, H., Krause, A.: Fast and Provably Good Seedings
for k-Means. NIPS (2016)
33. Lichman, M.: UCI machine learning repository (2013)
34. Mandt, S., Wenzel, F., Nakajima, S., Cunningham, J.P., Lippert, C., Kloft, M.:
Sparse Probit Linear Mixed Model. Machine Learning Journal (2017)
35. Perdisci, R., Gu, G., Lee, W.: Using an Ensemble of One-Class SVM Classifiers to
H. P.-based Anomaly Detection Systems. Data Mining (2006)
Bayesian Nonlinear SVMs for Big Data 17

A Appendix

A.1 Derivation of the Variational Objective

In the following we give the details of the derivation of the variational objective
(8) for the inducing point model in section 4.2. The variational objective as
defined in (7) is

L = Eq [L1 ] + Eq [log p(u)] − Eq [log q(λ, u)]

n
1X 1 e −1
2
=− Eq log λi + Kii + 1 + λi − yi Kim Kmm u
2 i=1 λi
− KL (q(u)||p(u)) − Eq(λ) [log q(λ)] .

−1
Using the abbreviation κi = Kim Kmm the first expectation term simplifies to

1 e −1
2
Eq log λi + Kii + 1 + λi − yi Kim Kmm u
λi
  

= Eq [log λi ] + Eq λ−1
i
e ii + 1 + λ2i + yi2 (κi u)2 + 2λi − 2yi κi u − 2λi yi κi u
K
|{z}
=1
c 1
= Eq(λi ) [log λi ] + √ K
e ii + 1 + λ2i + (κi µ)2 + κi ζκ>
i − 2yi κi µ + Eq(λi ) [λi ] − 2yi κi µ
αi
1
e ii + (1 − yi κi µ)2 + λ2 + κi ζκ> + Eq(λ ) [λi ] − 2yi κi µ.
= Eq(λi ) [log λi ] + √ K i i i
αi

The entropy of q(λi ) is

1 1 √ 1 αi
Eq(λi ) [log q(λi )] = Eq(λi ) − log(αi ) − log(λi ) − log(2) − log(B 12 ( αi )) − λi +
4 2 2 λi

c 1 1 √ 1 αi 1
= − log(αi ) − Eαi [log(λi )] − log(B 21 ( αi )) − Eαi [λi ] − Eαi
4 2 2 2 λi
√
c 1 1 √ 1 αi
= − log(αi ) − Eαi [log(λi )] − log(B 21 ( αi )) − Eαi [λi ] − ,
4 2 2 2

1
where B 12 (.) is the modified Bessel function with parameter 2 [20].
18 Florian Wenzel et al.

By summing the terms the remaining expectations cancel out and we obtain
n n
c X 1 1 e 1
L= − Eq(λi ) [log λi ] − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>i − Eq(λi ) [λi ] + yi κi µ
i=1
2 2 αi 2
√ o
1 1 √ 1 αi
+ log(αi ) + Eq(λi ) [log(λi )] + log(B 12 ( αi )) + Eq(λi ) [λi ] + − KL (q(u)||p(u))
4 2 2 2
n n
X 1 e 1 √ o
= − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>i − αi + yi κi µ + log(αi ) + log(B 21 ( αi ))
i=1
2 αi 4
− KL (q(u)||p(u))
n n
c X 1 e 1 √ o
= − √ Kii + (1 − yi κi µ)2 + λ2i + κi ζκ>
i − αi + y i κi µ + log(αi ) + log(B 1( αi ))
i=1
2 αi 4 2

1 1 −1 1
+ log |ζ| − tr(Kmm ζ) − µ> Kmm−1
µ
2 2 2
1 1 −1 1
= log |ζ| − tr(Kmm ζ) − µ> Kmm
−1
µ + y > κµ
2 2 2
n
X √ 1 1 −1
+ log(B 14 ( αi )) + log(αi ) − αi 2 1 − αi − 2yi κi. µ + κ(µµ> + ζ)κ> + K
e .
i=1
2 2 ii

A.2 Euclidean and Natural Gradients of the Variational Objective

First, we compute the standard euclidean gradients of L. The derivative w.r.t.
the mean and covariance matrix are
X N
dL 1 d d −1 1 d
= log |ζ| − tr(Kmm ζ) + − √ yi κi ζκ>
i yi
dζ 2 dζ dζ i=1
2 αi dζ
1 −1 T 1 −1 T 1 1
− Y 2 κ> A− 2 κ

= ζ − Kmm
2 2 2
1 −1 −1 1

= ζ − Kmm − κ> A− 2 κ
2
=: L0ζ ,

with A = diag(α) and

N
dL 1 d T −1 X d 1 d
=− µ Kmm µ + yi κi µ + √ (1 − yi κi µ)2
dµ 2 dµ i=1
dµ 2 α i dµ
N
−1
X 1
yi κ > 2 >

= −Kmm µ+ yi κi + √ i − yi κi κi µ
i=1
αi
1 1
−1
= −Kmm µ + κ> y + κ> Y α− 2 + κ> A− 2 κµ
1
1
−1
= − Kmm + κ> A− 2 κ µ + κ> Y (α− 2 + 1)
=: L0µ .
Bayesian Nonlinear SVMs for Big Data 19

The derivative w.r.t. parameter alphai of the generalized inverse Gaussian distri-
bution is
dL 1 d d √ 1 d √ (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi d
e 1
= log(αi ) + log(K 12 ( αi )) + αi − √
dαi 4 dαi dαi 2 dαi 2 dαi αi
1 1 1 1 (1 − yi κi µ)2 + yi (κi ζκ>
i + Kii )yi
e
= −( + √ )+ √ + √ 3
4αi 4αi 2 αi 4 αi 4 αi
2 >
(1 − yi κi µ) + yi (κi ζκi + K e ii )yi 1
= √ 3 − √ .
4 αi 4 αi
The natural gradient can be computed by pre-multiplying the euclidean gradient
with the inverse Fisher information matrix [21]. Applied to a Gaussian distribution
this leads to the following expressions for the natural gradient w.r.t. the natural
parameters [21],
e (η ,η ) L(η) = L0 (η) − 2L0 (η)µ, L0 (η) .

∇ 1 2 µ ζ ζ

Using the identities η1 = ζ −1 µ and η2 = − 12 ζ −1 we obtain

1 −1 1
1
L0µ (η) = Kmm + κ> A− 2 κ η2−1 η1 + κ> Y (α− 2 + 1)
2
1 1
1 −1 1
L0ζ (η) = −2η2 − Kmm −1
− κ> A− 2 κ = − (Kmm + κ> A− 2 κ) − η2
2 2
Finally, this leads to the natural gradients with respect to the natural parameters,
e η L = L0µ − 2L0ζ µ
∇ 1

1 −1 1
1 1 1

= Kmm + κ> A− 2 κ η2−1 η1 + κ> Y (α− 2 + 1) + −1
−2η2 − Kmm − κ> A− 2 κ η2−1 η1
2 2
1
= κ> Y (α− 2 + 1) − η1 ,

and

e η L = L0 = − 1 (K −1 + κ> A− 21 κ) − η2 .
∇ ζ
2
2 mm

A.3 Optimization of the Kernel Hyperparameters

We consider a general multiple kernel approach. Let k(x, x0 ) = j γj kj (x, x0 , θj )
P
be the kernel function where θj denote the hyperparameters of the kernel function
kj (e.g. the length scale parameter of an RBF kernel) and γj the corresponding
kernel weight. Let ω = {θj , γj }j=1,...,J be the collection of all hyperparameters.
The derivative of the variational objective L w.r.t. to the hyperparameters is

dL 1 d −1 −1 1>
=− log |Kmm | + tr(Kmm ζ) + µT Kmm µ − 2(1 + α− 2 )Y κµ
dω 2 dω
1>
+ α− 2 diag κ(µµ> + ζ)κ> + K e
20 Florian Wenzel et al.

ω
Using the abbreviations J∗∗ = dK ω dκ ω ω
dω and ι = dω = (Jnm − κJmm ) Kmm we
∗∗ −1

obtain
dL 1 −1 ω −1
1>

=− Tr Kmm Jmm I − Kmm ζ − µ> Kmm −1 ω −1
Jmm Kmm + 2(1 + α− 2 )Y ιω µ
dω 2 h i
− 12 >
diag κ µµ> + ζ ιω > − Jmn
ω
+ ιω µµ> + ζ κ> − Kmn + Jnn ω

+α .

To compute the gradient w.r.t. to specific hyperparameters we only have to plug

in the derivatives of the kernel function dK
dω into the above formula.
∗∗

Bayesian SVMs: Hyperparameter Tuning
No ratings yet
Bayesian SVMs: Hyperparameter Tuning
32 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Support Vector Machines As Probabilistic Models
No ratings yet
Support Vector Machines As Probabilistic Models
8 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
Probabilistic Feature Selection and Classification Vector Machine
No ratings yet
Probabilistic Feature Selection and Classification Vector Machine
27 pages
Bayesian Framework for Machine Learning
No ratings yet
Bayesian Framework for Machine Learning
24 pages
EEG Classification with Gaussian Processes
No ratings yet
EEG Classification with Gaussian Processes
9 pages
Support Vector Machines Design & Training
No ratings yet
Support Vector Machines Design & Training
364 pages
Relevance Vector Machines Tutorial
No ratings yet
Relevance Vector Machines Tutorial
25 pages
Implementing SVM Algorithm in MATLAB
No ratings yet
Implementing SVM Algorithm in MATLAB
6 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Support Vector Machine Dissertation Help
100% (2)
Support Vector Machine Dissertation Help
7 pages
Bayesian Modelling for Data Analysis
No ratings yet
Bayesian Modelling for Data Analysis
19 pages
Importance of Priors in Bayesian DL
No ratings yet
Importance of Priors in Bayesian DL
26 pages
Bayesian Learning and SVM Overview
No ratings yet
Bayesian Learning and SVM Overview
90 pages
Vietnam University Project Report 2023
No ratings yet
Vietnam University Project Report 2023
31 pages
Lecture Notes in Statistics 133: Edited by Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Oikin, N. Wennuth, S. Zeger
No ratings yet
Lecture Notes in Statistics 133: Edited by Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Oikin, N. Wennuth, S. Zeger
375 pages
A Simple Method For Estimating Conditional Probabilities For Svms
No ratings yet
A Simple Method For Estimating Conditional Probabilities For Svms
11 pages
Parallel SVM Mixture for Large Datasets
No ratings yet
Parallel SVM Mixture for Large Datasets
8 pages
Bayesian DNN Weight Priors with Empirical Bayes
No ratings yet
Bayesian DNN Weight Priors with Empirical Bayes
8 pages
SVM and Naive Bayes Classifiers Explained
No ratings yet
SVM and Naive Bayes Classifiers Explained
15 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
18 pages
Thesis Help for Support Vector Machines
100% (2)
Thesis Help for Support Vector Machines
4 pages
Topic Wise Lecture Notes Unit 4
No ratings yet
Topic Wise Lecture Notes Unit 4
31 pages
Cost-Sensitive SVM for Imbalanced Data
No ratings yet
Cost-Sensitive SVM for Imbalanced Data
8 pages
Lecture 14
No ratings yet
Lecture 14
20 pages
Predictive Analytics
No ratings yet
Predictive Analytics
22 pages
Introducción a las SVM en Aprendizaje Supervisado
No ratings yet
Introducción a las SVM en Aprendizaje Supervisado
52 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
64 pages
Support Vector Machines Course Overview
No ratings yet
Support Vector Machines Course Overview
18 pages
SVM Guide for Data Scientists
No ratings yet
SVM Guide for Data Scientists
48 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
4 pages
SVMs: Theory Meets Practice
No ratings yet
SVMs: Theory Meets Practice
12 pages
Midterm Solutions For Machine Learning
No ratings yet
Midterm Solutions For Machine Learning
13 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
21 pages
Support Vector Machines in Stata
No ratings yet
Support Vector Machines in Stata
19 pages
SWAG: Bayesian Uncertainty in Deep Learning
No ratings yet
SWAG: Bayesian Uncertainty in Deep Learning
25 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
20 pages
Maximal Margin Hyper-Sphere SVM
No ratings yet
Maximal Margin Hyper-Sphere SVM
23 pages
SVM Algorithm Guide with Python Code
No ratings yet
SVM Algorithm Guide with Python Code
10 pages
MLT Unit-2
No ratings yet
MLT Unit-2
30 pages
QUESTIONS
No ratings yet
QUESTIONS
20 pages
Support Vector Machines in R: Overview
No ratings yet
Support Vector Machines in R: Overview
28 pages
Machine Learning: SVM and Naive Bayes
No ratings yet
Machine Learning: SVM and Naive Bayes
25 pages
Supervised vs. Unsupervised Learning Guide
No ratings yet
Supervised vs. Unsupervised Learning Guide
4 pages
SVM for Data Imputation Techniques
No ratings yet
SVM for Data Imputation Techniques
66 pages
Optimal Sparse Support Vector Machines
No ratings yet
Optimal Sparse Support Vector Machines
14 pages
CloudSVM: SVM Training in Cloud Systems
No ratings yet
CloudSVM: SVM Training in Cloud Systems
13 pages
08 Classification
No ratings yet
08 Classification
46 pages
Explaining Support Vector Machines - A Color Based Nomogram
No ratings yet
Explaining Support Vector Machines - A Color Based Nomogram
33 pages
Bayesian Probabilistic Matrix Factorization
No ratings yet
Bayesian Probabilistic Matrix Factorization
11 pages
SVM Ph.D. Thesis Writing Support
100% (2)
SVM Ph.D. Thesis Writing Support
6 pages
Machine Learning
No ratings yet
Machine Learning
78 pages
Advanced NLP Techniques: GloVe & SVM
No ratings yet
Advanced NLP Techniques: GloVe & SVM
38 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
54 pages
Performance of Three Classifiers Analysis
No ratings yet
Performance of Three Classifiers Analysis
4 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
Informative Hyper-Parameter Optimization and Selection
No ratings yet
Informative Hyper-Parameter Optimization and Selection
68 pages
Manual of Hospital Planning and Designing For Medical Administrators Architects and Planners
No ratings yet
Manual of Hospital Planning and Designing For Medical Administrators Architects and Planners
1,044 pages
19136723
No ratings yet
19136723
106 pages
Buku1 Autologous
No ratings yet
Buku1 Autologous
52 pages
Quotation List of 30TPD Complete Rice Production Line
No ratings yet
Quotation List of 30TPD Complete Rice Production Line
12 pages
vmcf031111 110314140846 Phpapp02
No ratings yet
vmcf031111 110314140846 Phpapp02
11 pages
Oil Palm Biomass in Indonesia - Thermochemical Upgrading and Its Utilization
No ratings yet
Oil Palm Biomass in Indonesia - Thermochemical Upgrading and Its Utilization
23 pages
2015 MINICOM Business Case Dried Fish
No ratings yet
2015 MINICOM Business Case Dried Fish
44 pages
9845-Article Text-57242-1-10-20240320
No ratings yet
9845-Article Text-57242-1-10-20240320
7 pages
En-Tinjauan Komprehensif Intervensi Terapi Fisik Untuk Rehabilitasi Stroke
No ratings yet
En-Tinjauan Komprehensif Intervensi Terapi Fisik Untuk Rehabilitasi Stroke
22 pages
105-Pop Itik Prop
No ratings yet
105-Pop Itik Prop
1 page
Indonesian Healthcare Market Insights
No ratings yet
Indonesian Healthcare Market Insights
78 pages
Development and Verification of A Simulation Model For Paddy Drying With Different Atbed Dryers
No ratings yet
Development and Verification of A Simulation Model For Paddy Drying With Different Atbed Dryers
14 pages
Okok 1 Eng
No ratings yet
Okok 1 Eng
12 pages
National Consultant for Adaptive Social Protection
No ratings yet
National Consultant for Adaptive Social Protection
4 pages
Understanding Indonesia's Rice Supply Chains
No ratings yet
Understanding Indonesia's Rice Supply Chains
16 pages
Exergoeconomic Analysis of Integrated Rice Mill Sy
No ratings yet
Exergoeconomic Analysis of Integrated Rice Mill Sy
10 pages
Road Feeder Services in Green Corridors
No ratings yet
Road Feeder Services in Green Corridors
4 pages
Economic Viability of Stevia Cultivation
No ratings yet
Economic Viability of Stevia Cultivation
8 pages
Analisis Permintaan & Penawaran Jamur Tiram
No ratings yet
Analisis Permintaan & Penawaran Jamur Tiram
13 pages
Stevia Production Guide 2014
No ratings yet
Stevia Production Guide 2014
24 pages
KD-Lib: PyTorch Library for Model Compression
No ratings yet
KD-Lib: PyTorch Library for Model Compression
4 pages
Road Maps
No ratings yet
Road Maps
8 pages
ML Development Quiz Answers
No ratings yet
ML Development Quiz Answers
13 pages
Azure ML Tools & CI/CD Overview
No ratings yet
Azure ML Tools & CI/CD Overview
8 pages
Lysine Acetylation Site Prediction in Prokaryotes
No ratings yet
Lysine Acetylation Site Prediction in Prokaryotes
14 pages
Stealing Hyperparameters in Machine Learning: Binghui Wang, Neil Zhenqiang Gong
No ratings yet
Stealing Hyperparameters in Machine Learning: Binghui Wang, Neil Zhenqiang Gong
17 pages
ML Lab File
No ratings yet
ML Lab File
48 pages
Random Forest
No ratings yet
Random Forest
25 pages
Introduction To The AI Project Cycle
No ratings yet
Introduction To The AI Project Cycle
10 pages
PhD Thesis Help for Humanities
100% (2)
PhD Thesis Help for Humanities
6 pages
Swarm-Based LSTM for Temp Forecasting
No ratings yet
Swarm-Based LSTM for Temp Forecasting
16 pages
Optimal Initialization for WeightNorm ResNets
No ratings yet
Optimal Initialization for WeightNorm ResNets
10 pages
Thermal-Aware Floorplanning and TSV-Planning For Mixed-Type Modules in A Fixed-Outline 3-D IC
No ratings yet
Thermal-Aware Floorplanning and TSV-Planning For Mixed-Type Modules in A Fixed-Outline 3-D IC
13 pages
Fraud Call Detection Model
No ratings yet
Fraud Call Detection Model
6 pages
Azure Data Scientist Associate Skills Guide
No ratings yet
Azure Data Scientist Associate Skills Guide
3 pages
Enhancing IoT Cyberattack Detection Via Hyperparameter Optimization Techniques in Ensemble Models
No ratings yet
Enhancing IoT Cyberattack Detection Via Hyperparameter Optimization Techniques in Ensemble Models
22 pages
Early Stopping, Dropout, Augmentation, Optimizers New
No ratings yet
Early Stopping, Dropout, Augmentation, Optimizers New
91 pages
AutoSmart An Efficient and Automatic Machine Learn
No ratings yet
AutoSmart An Efficient and Automatic Machine Learn
9 pages
Functional Generative Design An Evolutionary Appro
No ratings yet
Functional Generative Design An Evolutionary Appro
9 pages
Basepaper (2023)
No ratings yet
Basepaper (2023)
53 pages
DL Unit 1 Notes
No ratings yet
DL Unit 1 Notes
90 pages
Overfitting and Underfitting in ML
No ratings yet
Overfitting and Underfitting in ML
10 pages
Machine Learning Applicationsin Predictive Maintenance Improved Reliabilityand Reduced Downtimein Manufacturing Systems
No ratings yet
Machine Learning Applicationsin Predictive Maintenance Improved Reliabilityand Reduced Downtimein Manufacturing Systems
11 pages
Explainable Lung Disease Classification From Chest X-Ray Images Utilizing Deep Learning and XAI
No ratings yet
Explainable Lung Disease Classification From Chest X-Ray Images Utilizing Deep Learning and XAI
5 pages
Thesis
No ratings yet
Thesis
31 pages
Shihab Thesis
No ratings yet
Shihab Thesis
33 pages
Machine Learning for Household Load Forecasting
No ratings yet
Machine Learning for Household Load Forecasting
32 pages
Toward Accurate Fused Deposition Modeling 3D Printer Fault Detection Using Improved YOLOv8 With Hyperparameter Optimization
No ratings yet
Toward Accurate Fused Deposition Modeling 3D Printer Fault Detection Using Improved YOLOv8 With Hyperparameter Optimization
12 pages
Traffic Prediction Using Machine Learning
No ratings yet
Traffic Prediction Using Machine Learning
7 pages
KNN Algorithm: Intuition & Properties
No ratings yet
KNN Algorithm: Intuition & Properties
72 pages

Bayesian SVM

Uploaded by

Bayesian SVM

Uploaded by

Bayesian Nonlinear Support Vector Machines for

Florian Wenzel1 , Théo Galy-Fajou1 , Matthäus Deutsch2 , and Marius Kloft1

Abstract. We propose a fast inference method for Bayesian nonlinear

support vector machines that leverages stochastic variational inference and

Keywords: Bayesian Approximative Inference, Support Vector Ma-

is mandatory to assert the statistical confidence in any physical variable mea-

factorization, and [14] develops a new max-margin approach to Hidden Markov

3 The Bayesian SVM Model

where R is a regularizer function controlling the complexity of the decision

with C −1 = Λ−1 +K −1 . For a test point x∗ the conditional predictive distribution

where Φ(.) is the probit link function.

4 Scalable Inference and Automated Hyperparameter

4.1 Batch Variational Inference

αi = Eq(f ) [(1 − yi fi )2 ] = (1 − yi> µ)2 + yi> ζyi ,

This concludes the batch variational inference scheme.

4.2 Stochastic Variational Inference Using Inducing Points

p(y, u, f, λ) = p(y, λ|f )p(f |u)p(u).

log p(y, λ|u) = log Ep(f |u) [p(y, λ|f )]

Algorithm 1 Inducing Point SVI

11: update η1 = (1 − ρt )η1 + ρt η̂1

4.3 Auto Tuning of Hyperparameters

where h denotes the set of hyperparameters (this approach is called empirical

h(t) = h(t−1) + ρet ∇h L(α(t−1) , µ(t−1) , ζ (t−1) , h). (12)

4.4 Uncertainty Predictions

Besides the advantage of automated hyperparameter tuning, the probabilistic

4.5 Special Case of Linear Bayesian SVM

αi = (1 − ziT µ)2 + ziT ζzi

where α = (αi )1≤i≤n , A = diag(α) and Z = Y X. Since the Bayesian Linear

We compare our approach against the expectation conditional maximization

5.1 Prediction Performance and Uncertainty Estimation

We experiment on seven real-world datasets and compare the prediction perfor-

5.2 Big Data Experiments

Dataset n dim. S-BSVM ECM GPC SVM + Platt

characteristic (ROC) curve (AUC) as performance measure since it is a standard

5.3 Run Time

Fig. 1. Prediction error on held-out dataset vs. training time.

5.4 Auto Tuning of Hyperparameters

5.5 Inducing Points Selection

Acknowledgments. We thank Stephan Mandt, Manfred Opper and Patrick

A.1 Derivation of the Variational Objective

L = Eq [L1 ] + Eq [log p(u)] − Eq [log q(λ, u)]

The entropy of q(λi ) is

A.2 Euclidean and Natural Gradients of the Variational Objective

with A = diag(α) and

Using the identities η1 = ζ −1 µ and η2 = − 12 ζ −1 we obtain

A.3 Optimization of the Kernel Hyperparameters

To compute the gradient w.r.t. to specific hyperparameters we only have to plug

You might also like