0% found this document useful (0 votes)
99 views10 pages

Quantum Boltzmann Machine

The document proposes a Quantum Boltzmann Machine (QBM) approach to machine learning. A QBM is based on the quantum Boltzmann distribution of a transverse-field Ising Hamiltonian. Training a QBM can be non-trivial due to the non-commutative nature of quantum mechanics. The authors introduce bounds on the quantum probabilities to circumvent this problem, allowing efficient QBM training via sampling. They provide examples of QBM training with and without the bound, using exact diagonalization, and compare to classical Boltzmann machine training. The authors also discuss using quantum annealing processors like D-Wave for QBM training and applications.

Uploaded by

Lakshika Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views10 pages

Quantum Boltzmann Machine

The document proposes a Quantum Boltzmann Machine (QBM) approach to machine learning. A QBM is based on the quantum Boltzmann distribution of a transverse-field Ising Hamiltonian. Training a QBM can be non-trivial due to the non-commutative nature of quantum mechanics. The authors introduce bounds on the quantum probabilities to circumvent this problem, allowing efficient QBM training via sampling. They provide examples of QBM training with and without the bound, using exact diagonalization, and compare to classical Boltzmann machine training. The authors also discuss using quantum annealing processors like D-Wave for QBM training and applications.

Uploaded by

Lakshika Rathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Quantum Boltzmann Machine

Mohammad H. Amin,1, 2 Evgeny Andriyash,1 Jason Rolfe,1 Bohdan Kulchytskyy,3 and Roger Melko3, 4
1
D-Wave Systems Inc., 3033 Beta Avenue, Burnaby BC Canada V5G 4M9
2
Department of Physics, Simon Fraser University, Burnaby, BC Canada V5A 1S6
3
Department of Physics and Astronomy, University of Waterloo,
200 University Avenue West Waterloo, Ontario, Canada N2L 3G1
4
Perimeter Institute for Theoretical Physics, Waterloo, Ontario, N2L 2Y5, Canada
Inspired by the success of Boltzmann Machines based on classical Boltzmann distribution, we
propose a new machine learning approach based on quantum Boltzmann distribution of a transverse-
field Ising Hamiltonian. Due to the non-commutative nature of quantum mechanics, the training
process of the Quantum Boltzmann Machine (QBM) can become nontrivial. We circumvent the
problem by introducing bounds on the quantum probabilities. This allows us to train the QBM
efficiently by sampling. We show examples of QBM training with and without the bound, using
arXiv:1601.02036v1 [quant-ph] 8 Jan 2016

exact diagonalization, and compare the results with classical Boltzmann training. We also discuss
the possibility of using quantum annealing processors like D-Wave for QBM training and application.

I. INTRODUCTION The Boltzmann machine (BM) is a classic machine


learning technique, and serves as the basis of power-
ful deep learning models such as deep belief networks
Machine learning is a rapidly growing field in com- and deep Boltzmann machines [17, 18, 20]. It com-
puter science with applications in computer vision, voice prises a probabilistic network of binary units with a
recognition, medical diagnosis, spam filtering, search en- quadratic energy function. In principle, one could con-
gines, etc.[1] Machine learning algorithms operate by con- sider more general energy functions to bring in more flex-
structing a model with parameters that can be deter- ibility [19, 21, 22], but training can become impractical
mined (learned) from a large amount of example inputs, and generalization suffers as the number of parameters
called training set. The trained model can then make grows. A BM commonly consists of visible and hidden
predictions about unseen data. The ability to do so is binary units, which we jointly denote by za , a = 1, ..., N ,
called generalization. This could be, for example, detect- where N is the total number of units. To maintain consis-
ing an object, like a cat, in an image or recognizing a tency with the standard notation in quantum mechanics,
command from a voice input. One approach to machine we use za ∈ {−1, +1}, rather than za ∈ {0, 1}; the cor-
learning is probabilistic modeling in which the probabil- responding probability distributions are identical up to
ity distribution of the data (Pvdata for a given state v) is a linear transformation of their parameters. To distin-
approximated based upon a finite set of samples. If the guish the visible and hidden variables, we use the no-
process of training is successful, the learned distribution tation za = (zν , zi ), with index ν for visible variables
Pv has enough resemblance to the actual distribution of and i for hiddens. We also use vector notations v, h,
the data, Pvdata , such that it can make correct predictions and z = (v, h) to represent states of visible, hidden, and
about unseen situations. Depending upon the details of combined units, respectively. In physics language, the
the distributions and the approximation technique, ma- quadratic energy function over binary units za is referred
chine learning can be used to perform classification, clus- to as Ising model with the energy function:
tering, collaborative filtering, compression, denoising, in-
painting, or a variety of other algorithmic tasks [2]. X X
Ez = − ba za − wab za zb . (1)
The possibility of using quantum computation for ma- a a,b
chine learning has been considered theoretically for both
gate model [3–5] and quantum annealing [6–13] schemes. The dimensionless parameters ba and wab are tuned dur-
With the development of quantum annealing processors ing the training [34]. In equilibrium, the probability of
[14], it has become possible to test machine learning ideas observing a state v of the visible variables is given by the
with an actual quantum hardware [15, 16]. In all of the Boltzmann distribution summed over the hidden vari-
above works, however, the quantum processor is consid- ables:
ered as a means to provide fast solutions to an other- X X
wise classical problem. In other words, the model stays Pv = Z −1 e−Ez , Z= e−Ez , (2)
classical and quantum mechanics is only used to facili- h z
tate the training. In this work, we propose a quantum
probabilistic model for machine learning based on Boltz- called marginal distribution. Our goal is to determine
mann distribution of a quantum Hamiltonian, therefore, Hamiltonian parameters, θ∈{ba , wab }, such that Pv be-
a Quantum Boltzmann Machine (QBM). As we shall see, comes as close as possible to Pvdata defined by the training
in our approach, the quantum nature of the processor is set. To achieve this, we need to maximize the average log-
exploited both in the model and in the training process. likelihood, or equivalently minimize the average negative
σiz
2
hidden
log-likelihood defined by
(a)
visible σνz
X
L=− Pvdata log Pv , (3)
v

which for the probability distribution (2) is


X
data
P −Ez
e hidden σiz
L=− Pv log P h −E 0 .
v z0 e z
(4)
(b)
The minimization can be done using gradient decent visible σνz
technique. In each iteration, the parameter θ is changed
by a small step in the direction opposite to the gradient:

δθ = −η∂θ L, (5)
hidden
where the learning rate, η, controls the step sizes. An
important requirement for applicability of the gradient (c)
decent technique is the ability to calculate the gradients visible
∂θ L efficiently. Using (4), we have

∂θ Ez e−Ez ∂θ Ez e−Ez
P P
input (x) output (y)
X
data h z
∂θ L = Pv P −Ez
− P −Ez
v he ze

= h∂θ Ez iv − h∂θ Ez i, (6)


FIG. 1: (a) An example of a quantum Boltzmann machine
where h...i and h...iv are Boltzmann averages with free with visible (blue) and hidden (red) qubits. (b) A restricted
quantum Boltzmann machine with no lateral connection be-
and
P fixed visible variables, respectively, and h...iv ≡
data tween the hidden variables. (c) Discriminative learning with
v Pv h...iv denotes double averaging. Fixing visible QBM. The (green) squares represent classical input x, which
variables to the data is usually called clamping. Using are not necessarily binary numbers. The input applies en-
(1) for Ez , we obtain ergy biases to the hidden and output qubits according to the
  coupling coefficients represented by solid lines.
δba = η hza iv − hza i , (7)
 
δwab = η hza zb iv − hza zb i . (8) This Hamiltonian is constructed in such a way that its
diagonal elements are energy values (1) corresponding
The gradient steps are expressed in terms of differences to all 2N binary states z ordered lexicographically. To
between the clamped (i.e., fixed v) and unclamped aver- generate such a Hamiltonian, we replace za in (1) with
ages. These two terms are sometimes called positive and 2N ×2N matrix
negative phases. Since the averages can be estimated us-
a−1 N −a
ing sampling, the process of gradient estimation can be z }| { z }| {
done efficiently provided that we have an efficient way of σaz ≡ I ⊗ ... ⊗ I ⊗σz ⊗ I ⊗ ... ⊗ I (10)
performing sampling.
where ⊗ means tensor product (sometimes called Kro-
necker or outer product) and
II. QUANTUM BOLTZMANN MACHINE    
1 0 1 0
I= , σz = . (11)
We now replace the classical spins or bits in (1) with 0 1 0 −1
quantum bits (qubits). The mathematics of quantum
mechanics is based on matrices (operators) with dimen- Every element in (10) is an identity matrix (I) except the
sionality equal to the number of possible states (2N ). a-th element which is a Pauli matrix (σz ). Equation (1)
This in contrast to vectors with dimensionality equal to will therefore be replaced by the diagonal Hamiltonian
the number of variables (N ) used in common machine where ba and wab are still scalars. Fig. 1a shows an ex-
learning techniques. For instance, instead of the energy ample of such a model with visible and hidden qubits de-
function (1), one considers a 2N ×2N diagonal matrix, picted as blue and red circles, respectively. We represent
called the Hamiltonian: eigenstates of this Hamiltonian by |v, hi, where again v
X X and h denote visible and hidden variables, respectively.
H=− ba σaz − wab σaz σbz . (9) We can now define matrix
P∞ exponentiation through Tay-
a a,b lor expansion, e−H = 1
k=0 k! (−H) k
. For a diagonal
3

Hamiltonian, e−H is a diagonal matrix with its 2N di- The gradient of L is given by
agonal elements being e−Ez corresponding to all the 2N
Tr[Λv ∂θ e−H ] Tr[∂θ e−H ]
 
states. With the partition function given by Z = Tr[e−H ]
X
data
∂θ L = Pv − . (18)
(c.f. (2)), we define the density matrix as v
Tr[Λv e−H ] Tr[e−H ]

ρ = Z −1 e−H . (12) Once again, we hope to be able to estimate the gradi-


ents efficiently using sampling. However, since H and
The diagonal elements of ρ are therefore Boltzmann prob- ∂θ H are now matrices that do not commute, we have
abilities of all the 2N states. For a given state |vi of the ∂θ e−H 6= −e−H ∂θ H and therefore we don’t trivially ob-
visible variables, we can obtain the marginal Boltzmann tain expectations of ∂θ H as in the classical case. Writing
probability Pv by tracing over the hidden variables e−H = [e−δτ H ]n , where δτ ≡ 1/n, we have
Pv = Tr[Λv ρ], (13) n
X
∂θ e−H = e−mδτ H (−∂θ Hδτ ) e−(n−m)δτ H . (19)
where Λv limits the trace only to diagonal terms that
m=1
correspond to the visible variables being in state v. Thus,
Λv is a diagonal matrix with diagonal elements being Introducing imaginary time τ ≡ mδτ , in the limit of
either 1, when the visibles are in state v, or 0 otherwise. n → ∞, we obtain
In operator notation, we write Z 1
Λv = |vi hv| ⊗ Ih , (14) ∂θ e−H = − dτ e−τ H ∂θ He(τ −1)H . (20)
0

where Ih is the identity matrix acting on the hidden vari- Tracing over both sides and using permutation property
ables, and of the trace, we find
Y  1 + vν σ z 
|vi hv| ≡ ν
(15) Tr[∂θ e−H ] = −Tr[∂θ He−H ], (21)
ν
2
which is the same as the classical relation. Plugging this
is a projection operator in the subspace of visible vari- into the second term of (18) gives
ables. Equations (2) and (13) are equivalent when the
Hamiltonian and therefore the density matrix are diago- Tr[∂θ e−H ]
= −h∂θ Hi, (22)
nal, but (13) also holds for non-diagonal matrices. Tr[e−H ]
We can now add a transverse field to the Ising Hamil-
where h...i ≡ Tr[ρ...] denotes Boltzmann averaging. This
tonian by introducing non-diagonal matrices
term can be estimated by sampling from the distribution
a−1 N −a   (12). However, the first term in (18),
z }| { z }| { 0 1
σax ≡ I ⊗ ... ⊗ I ⊗σx ⊗ I ⊗ ... ⊗ I, σx = , Z 1
1 0 Tr[Λv ∂θ e−H ] Tr[Λv e−tH ∂θ He−(1−t)H ]
= − dt ,
Tr[Λv e−H ] 0 Tr[Λv e−H ]
which represent transverse components of spin. The (23)
transverse Ising Hamiltonian is then written as cannot be estimated using sampling. This renders the
X X X training of a QBM inefficient and basically impractical
H=− Γa σax − ba σaz − wab σaz σbz (16)
for large system. A work around for this problem is to
a a a,b
introduce a properly defined upper-bound for L and min-
Every eigenstate of H is now a superposition in the com- imize it, as we shall discuss below. We call this approach
putation basis made of the classical states |v, hi. As bound-based QBM (bQBM). Minimizing a bound on the
the probabilistic model for QBM, we use quantum Boltz- negative log-likelihood is a common approach in machine
mann distribution with the density matrix (12), which learning.
now has off-diagonal elements. In each measurement the
states of the qubits are read out in the σz -basis and the
outcome will be a classical value ±1. Because of the A. Bound-based QBM
statistical nature of quantum mechanics, after each mea-
surement a classical output v will appear for the visible One can define a lower bound for the probabilities us-
variables with the probability Pv given by (13). ing Golden-Thompson inequality [23, 24]:
To train a QBM, we change the parameters θ such that
the probability distributions Pv becomes close to Pvdata Tr[eA eB ] ≥ Tr[eA+B ], (24)
of the input data. This is achieved by minimizing the
which holds for any hermitian matrices A and B. We can
negative log-likelihood, which from (3), (12), and (13) is
therefore write
X Tr[Λv e−H ] Tr[e−H eln Λv ] Tr[e−H+ln Λv ]
L=− Pvdata log . (17) Pv = ≥ . (25)
v
Tr[e−H ] Tr[e−H ] Tr[e−H ]
4

Introducing a new Hamiltonian: which means Γν → 0 for all visible variables. This is
inconsistent with what we obtain when we train Γν us-
Hv = H − ln Λv , (26) ing the exact gradient (18). Therefore, vanishing Γν is
an artifact of the upper bound minimization. In other
we can write words, we cannot learn the transverse field using the up-
Tr[e−Hv ] per bound. One may still train the transverse field using
Pv ≥ . (27) the exact log-likelihood, but it becomes quickly inefficient
Tr[e−H ]
as the size of the QBM grows.
Notice that Hv has an infinite energy penalty for any
state of the visible qubits that is different from |vi.
Therefore, for any practical purposes, B. Restricted QBM

Hv ≡ H(σνx = 0, σνz = vν ). (28) So far we haven’t imposed any restrictions on the con-
nectivity between visible and hidden qubits or lateral
This is a clamped Hamiltonian because every visible qubit connectivity among visible or hidden qubits. We note
σνz is clamped to its corresponding classical data value vν . that calculation of the first term in (32) and (33), some-
From (17) and (27) it follows that times called positive phase, requires sampling from dis-
tributions with clamped Hamiltonians (28). This sam-
X Tr[e−Hv ] pling can become computationally expensive for a large
L ≤ L̃ ≡ − Pvdata log . (29)
v
Tr[e−H ] data set, because it has to be done for every input data
element. If we restrict our QBM to have no lateral con-
Instead of minimizing L, we now minimize its upper nectivity in the hidden layer (see Fig. 1b), the hidden
bound L̃ using the gradient qubits become uncoupled in the positive phase and the
calculations can be carried out exactly. We can write the
Tr[e−Hv ∂θ Hv ] Tr[e−H ∂θ H]
 
clamped Hamiltonian (28) as
X
data
∂θ L̃ = Pv − ,
v
Tr[e−Hv ] Tr[e−H ] X
Γi σix + beff z

  Hv = − i (v)σi , (35)
= h∂θ Hv iv − h∂θ Hi , (30) i

where beff z
P
where i (v) = bi + ν wiν vν . Expectations hσi iv en-
tering (32) can be computed exactly:
X X Tre−Hv ...
h...iv = Pvdata h...iv = Pvdata . (31)
Tre−Hv beff
i
v v hσiz iv = tanh Di , (36)
Di
Taking θ to be ba , wab , and using δθ = −η∂θ L̃, we obtain p
  where Di = Γ2i + (beff 2
i ) . Notice that (36) reduces to
δba = η hσaz iv − hσaz i , (32) the classical RBM expression,

hσiz iv = tanh beff


 
δwab = η hσaz σbz iv − hσaz σbz i . (33) i , (37)

in the limit Γi → 0. We emphasize that unlike the clas-


Again the gradient steps are expressed in terms of dif- sical RBM, in which there are no lateral connections in
ferences between the unclamped and clamped averages, both hidden and visible layers (for contrastive divergence
h...i and h...iv , which can be obtained by sampling from techniques to work), we only require their absence in the
a Boltzmann distribution with Hamiltonians H and Hv , hidden layer, usually called semi-restricted Boltzmann
respectively. In Section IV, we give examples of training machine [26]. In Section IV we give an example of train-
QBM and compare the results of minimizing L using (18) ing RQBM and illustrate the importance of using (36)
with minimizing its upper bound L̃ using (30). instead of their classical limit.
One may also attempt to train Γa using the upper
bound L̃. From (30) we obtain
  III. SUPERVISED LEARNING
δΓa = η hσax iv − hσax i . (34)
One important application of machine learning is clas-
There are a few problems with using (34) to train Γa . sification in which a category (label) is assigned to each
First of all, one cannot calculate hσax i by sampling in σaz data point. For example, in spam detection the goal is
basis. Therefore, measurement in the σax basis is needed to determine which of the two labels, “spam” or “not
to estimate hσax i. Moreover, the first term in (34) is al- spam”, should be assigned to a given text. The process
ways zero for visible variables, i.e., hσνx iv = 0, ∀ν. Since of inferring a functional relation between input and label
hσνx i > 0 for positive Γν , δΓν will always be negative, from a set of labeled data is called supervised learning.
5

data
Denoting the feature vector (input) by x and label (out- should also match Py|x as desired for supervised train-
put) by y, the problem is to infer a function g(x) : x → y ing. However, there is a problem when it comes to sam-
from the set of labeled data (xi , yi ). In probabilistic ap- pling from this conditional for a given x. If the input x
proaches to this problem, which are of our main interest appears with a very small probability (Px  1), it would
here, the output y that is most probable, subject to the require a large amount of samples from Px,y and Px to
input x, is chosen as the label. Therefore, the function reliably calculate Py|x using (41).
g(x) is determined by the conditional probability Py|x of In a classical BM, one can sample from the conditional
output given input distribution by clamping the input variables x to the data
and sampling the output y. To understand how that
g(x) = arg max Py|x . (38) strategy would work for QBM, let us introduce a clamped
y
Hamiltonian
The end goal of training is to make Py|x as close as pos-
data Hx = H − ln Λx , Λx = |xi hx| ⊗ Iy ⊗ Ih , (42)
sible to the conditional distribution of the data, Py|x .
Assuming that the data comes with a joint probability which clamps the input qubits to x. Here, Iy and
data data data
distribution Px,y , we can write: Py|x = Px,y /Pxdata , Ih are identity matrices acting on the output and hid-
data
P data
where Px = y Px,y is the marginal distribution. den variables respectively. For classical Hamiltonians
Two possible approaches to supervised learning are dis- ([H, Λx ]=0), we have
criminative and generative learning [35]. In the discrim-
inative approach, for each x we try to learn the condi- Tr[Λy e−H eln Λx ] clamped
data Py|x = = Py|x , (43)
tional distribution Py|x . If an input x appears in the Tr[e−H eln Λx ]
training set with probability Pxdata , the loss function can where
be written as
clamped Tr[Λy e−Hx ]
Py|x ≡ . (44)
X X
Ldiscr = − Pxdata data
Py|x log Py|x , Tr[e−Hx ]
x y
X clamped
= − data
Px,y log Py|x . (39) This means for any x, we can sample Py|x from Hx
x,y and that will give us Py|x in an efficient way regardless
of how small Px is. For quantum Hamiltonians, when
In the generative approach, on the other hand, we learn [H, Λx ]6=0, we know that e−H eln Λx 6= e−Hx . Therefore,
the joint probability distribution without separating in- clamped
Py|x is not necessarily equal to Py|x and there is no
put from output. The loss function is therefore: easy way to draw samples from Py|x .
X
data One might still hope that the clamped distribution is
Lgen = − Px,y log Px,y not too far off from (41) and can be used as an approx-
x,y clamped
X imation Py|x ≈ Py|x . As we shall see in an example
= Ldiscr − Pxdata log Px (40) in Sec. IV-C, this is not true in general.
x

where we have used Px,y = Py|x Px . Notice that the first B. Discriminative learning
term is just Ldiscr while the second term measures the dif-
ference between the probability distribution of the train-
ing set inputs and the marginal distribution Px . This In discriminative learning one distinguishes input from
second term is called cross-entropy and it is equal to KL- output during the training [2] and learns the conditional
divergence, see Eq. (58), up to a constant. Now, we ex- probability distribution using (39). This can be done
plore the possibility of applying QBM to both cases. by clamping the input x in both positive and negative
phases. Since the input is always clamped, its role is just
to apply biases to the other variables and therefore we
A. Generative learning don’t need to assign any qubits to the input (see Fig. 1c).
The Hamiltonian of the system for a particular state of
the input, x, is given by
Generative learning with loss (40) can be done with
the methods of Section II by treating input and output
X X
Hx = − [Γa σax + beff z
a (x)σa ] − wab σaz σbz , (45)
(x, y) jointly as the visible data v = [x, y] in a QBM. a a,b
At the end of training, the QBM provides samples with
data
a joint probability Px,y that is close to Px,y . Therefore, where indices a and b range over both hidden and visible
the conditional probability (output only) variables. Here, the input x provides a bias

Tr[Λx Λy e−H ]
X
Px,y beff
a (x) = ba + waµ xµ (46)
Py|x = = . (41)
Px Tr[Λx e−H ] µ
6

to the a-th qubit, where ba and waµ are tunable param- IV. EXAMPLES
eters. Notice that xµ does not need to be restricted to
binary numbers, which can bring more flexibility to the In this Section we describe a few toy examples il-
supervised learning. lustrating the ideas described in the previous sections.
The probability of measuring an output state y once In all examples studied, the training data was gener-
the input is set to state x is given by ated as a mixture of M factorized distributions (modes),
each peaked around a random point. Every mode (k) is
Tr[Λy e−Hx ] constructed by randomly selecting a center point sk =
Py|x = , Λy = Ix ⊗ |yi hy| ⊗ Ih , (47)
Tr[e−Hx ] [sk1 , sk2 , ..., skN ] with ski ∈ {±1} and using Bernoulli dis-
k k
tribution: pN −dv (1 − p)dv , where p is the probability of
where Hx is given by (45) and Ix is an identity matrix qubit ν being aligned with skν , and dkv is the Hamming
acting on the input variables. The negative log-likelihood distance between v and sk . The average probability dis-
is given by (39). Using the same tricks as discussed in the tribution over M such modes gives our data distribution
previous section we can define a clamped Hamiltonian,
M
Hx,y = Hx − ln Λy , (48) 1 X N −dkv k
Pvdata = p (1 − p)dv , (57)
M
k=1
and show that
In all our examples, we choose p = 0.9 and M = 8.
Tr[e−Hx,y ] To have a measure of the quality of
Py|x & . (49) Plearning, we sub-
Tr[e−Hx ] tract from L its minimum Lmin = − v Pvdata log Pvdata ,
which happens when Pv = Pvdata . The difference, com-
Again we introduce an upper bound L̃ for the L monly called Kullback-Leibler (KL) divergence:
X
data Tr[e−Hx,y ] X Pvdata
Ldiscr ≤ L̃discr = − Px,y log . (50) KL = L − Lmin = Pvdata log , (58)
x,y
Tr[e−Hx ] Pv
v

The derivative of L̃ with respect to a Hamiltonian pa- is a non-negative number measuring the difference be-
rameter θ is given by tween the two distributions; KL = 0 if and only if the
two distributions are identical.
∂θ L̃ = h∂θ Hx,y ix,y − h∂θ Hx ix , (51)

where A. Fully visible model

X Tr[e−Hx A] We start with a fully visible example to compare BM


hAix = Pxdata , (52)
x
Tr[e−Hx ] with QBM and evaluate the quality of the bound (29)
X Tr[e−Hx,y A] by training bQBM. We consider a fully connected model
data
hAix,y = Px,y . (53) with N = 10 qubits. Classical BM will have N + N (N −
Tr[e−Hx,y ]
x,y 1)/2 = N (N + 1)/2 trainable parameters (ba , wab ). The
Hamiltonian of QBM has the form (16) where we restrict
The gradient descent steps in the parameter space are all Γa to be the same (= Γ). In order to understand the
given by efficiency of QBM in representing data we will train the
  exact log-likelihood using (18) and treat ba , wab , and Γ
δba = η hσaz ix,y − hσaz ix , (54) as trainable parameters. This can be done using exact
  diagonalization for small systems. We will also perform
δwab = η hσaz σbz ix,y − hσaz σbz ix , (55) training of the bound L̃ using (30) treating (ba , wab ) as
  trainable parameters but fixing Γ to some ad-hoc non-
δwaµ = η hσaz xµ ix,y − hσaz xµ ix . (56) zero value Γ = 2. Comparing the training results of QBM
with bQBM will give us some idea of the efficiency of
Notice that x is not only clamped in the positive phase training the bound L̃.
(the first expectations), but also is clamped in the nega- Since all expectations entering the gradients of log-
tive phase (the second expectations). The positive phase likelihood are computed exactly, we will use second-order
can still be done efficiently if we use RQBM (Fig. 1c with optimization routine BFGS [25]. The results of training
no lateral connection among the hidden variables). The BM, QBM and bQBM are given in Fig 2a. The x-axis
negative phase needs a more elegant sampling method. in the figure corresponds to iterations of BFGS that does
This can make the calculation of the gradient steps com- line search along the gradient. QBM is able to learn the
putationally expensive for large data sets, unless a very data noticeably better than BM, and bQBM approaches
fast sampling method is available. the value close the one for QBM.
1.4
QBM
1.3
a) BM 7
1.2
bQBM

1.1
the best value at Γ = 2.5 learned by QBM).
1
KL
0.9

0.8
B. Restricted QBM
0.7

0.6
We now consider a (semi-) restricted BM discussed in
0.5 Section II B. Our toy model has 8 visible units and 2
0.4
hidden units. We allow full connectivity within the vis-
0 5 10 15 20 25 30 35
ible layer and all-to-all connectivity between the layers.
iteration The data is again generated using Eq. (57) for the visi-
25
ble variables, with p = 0.9 and M = 8. We present the
QBM results of training in Fig 3. Similarly to the fully visible
b) BM model, QBM outperforms BM, and bQBM represents a
20
bQBM good proxy for learning quantum distribution.
In order to illustrate the significance of consistent us-
15
age of quantum distribution in evaluating the gradients
(32) and (33), we train bQBM using classical expression
|Eq |

0.5
instead of (36) for expectations of hidden units in the
0.4
10 positive phase. The resulting machine (bQBM-CE in
0.3
Fig. 3) learns worse than purely classical BM because
|Eq |

0.2
the two terms in gradient expressions are evaluated in-
5 0.1
consistently.
0
2 2.5 3 3.5 4
|Ecl |
0
0 5 10 15 20 25 30 35

|Ecl |
C. Generative supervised learning
FIG. 2: Training of a fully visible fully connected model with
N = 10 qubits on artificial data from Bernoulli mixture model
(57). Training is done using second-order optimization rou- We consider a supervised learning example with 8 in-
tine BFGS. (a) KL-divergence (58) of BM, QBM, bQBM mod- puts and 3 outputs with full connectivity between all
els during training process. Both QBM and bQBM learn to units. For the training set we again used the multi-modal
KL values that are lower than that for BM. (b) Classical and distribution (57) over x, with M = 8 and p = 0.9, and
quantum average energies (59) during training process. set the label y for each mode to be a 3-bit binary num-
ber from 0 to 7 [36]. Both BM and QBM are trained to
learn the loss function (40). Our goal is to check whether
In order to visualize the training process, we keep track clamped
Py|x ≈ Py|x , when training QBM in this genera-
of the average values of classical and quantum parts of
tive setup. In Fig 4a, we plot KL-divergence based on
the Hamiltonian during the training
the generative log-likelihood (40) for both classical BM
* + and QBM. It is clear that QBM has trained with better
X X KL-divergence than classical BM. In 4b we plot the KL-
Ecl = − ba σaz + wab σaz σbz , divergence based on the discriminative log-likelihoods
a a,b (39), evaluated with conditional probabilities Py|x and
* + clamped
X clamped probabilities Py|x . One can see that al-
Eq = − Γa σax . (59)
though QBM is trained with the joint probability distri-
a
bution, the conditional distribution is also learned better
Fig 2b shows the learning trajectories in the space than BM. The clamped probability distribution, on the
(|Ecl |, |Eq |). BM learns a model with average energy other hand, starts very close to the conditional distribu-
≈ 3.5, and KL ≈ 0.62. One can see that QBM, which tion at the beginning of the iterations, when QBM and
starts off with Γ = 0.1, initially lowers Γ and learns BM are closed to each other. But as the transverse field
(ba , wab ) that are close to the best classical result (see the in QBM starts to grow, the clamped distribution deviates
inset). Soon after, QBM increases Γ and (ba , wab ) until it from the conditional one and its KL-divergence grows to
converges to a point with Γ = 2.5 and KL ≈ 0.42, which is a value much worse than the classical BM value. This
better than classical BM value. Having a fixed transverse shows that even for such a small example the clamped
field, Γ = 2, bQBM starts with a large Eq and approaches distribution can be very different from the true condi-
the parameter learned by QBM (although doesn’t reach tional distribution.
1 1.5
QBM QBM
0.9 a) BM 1.4
a) BM 8
bQBM
1.3
0.8
bQBM-CE
1.2

0.7

KLgen
1.1
KL
0.6 1

0.9
0.5

0.8
0.4
0.7

0.3
0.6

0.2 0.5
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 100

iteration iteration
22 1.4

b) QBM
20
BM 1.3
b)
18 bQBM
16 bQBM-CE 1.2
14

KLdiscr
1.1
|Eq |

12
0.5
10
0.4 1
8
0.3 QBM
|Eq |

6 0.2 0.9 BM
4 0.1 QBM-clamped
0.8
2 0
0 2 4 6
|Ecl |
0 0.7
0 5 10 15 20 25 30 0 10 20 30 40 50 60 70 80 90 100
|Ecl | iteration
FIG. 3: Training of a restricted RBM with 8 visible and 2 hid- FIG. 4: Supervised learning using fully visible fully connected
den units on artificial data from Bernoulli mixture model (57) model with N = 11 qubits divided into 8 inputs and 3 outputs.
using second-order optimization routine. (a) KL-divergence As our training data we use artificial data from Bernoulli mix-
(58) of different models during training process. Again QBM ture model (57) for inputs and 3-bit binary labels (0 to 7)
and bQBM outperform BM, but when positive phase was for outputs. Training is done using second-order optimiza-
calculated classically in bQBM, the performance (bQBM-CE tion routine BFGS. (a) KL-divergence of joint distribution
curve in the figure) deteriorated and became worse than that (40) of BM, QBM models during training process. Once
for BM (see Section IV B for details). (b) Classical and quan- again QBM learns the distribution better than BM. (b) KL-
tum average energies (59) during training process. divergence of conditional distribution during the same train-
ing, for BM, QBM models using (39), and for clamped QBM
(QBM-clamped) using (43). The conditional distribution is
V. QBM WITH A QUANTUM ANNEALING also learned better by QBM than BM, but the clamped QBM
PROCESSOR distribution is very different from the conditional one and give
a KL-divergence much higher than the classical BM.
Recent developments in manufacturing quantum an-
nealing processors have made it possible to experimen-
tally test some of the quantum machine learning ideas. discussed in Ref. [32], an open system quantum annealier
Up to now many experiments have confirmed the exis- follows quasistatic evolution following equilibrium distri-
tence of quantum phenomena in such processors [14, 27– bution, ρ = Z −1 e−βH(s) , up to a point where the dy-
30], which includes entanglement [31]. A quantum an- namics become too slow to establish equilibrium. Here,
nealing processor implements the time-dependent Hamil- β = (kB T )−1 with T being the temperature and kB being
tonian the Boltzmann constant. The system will then deviate
X X X from the equilibrium distribution and soon after the dy-
H(s) = −A(s) σax + B(s)[ hi σaz + Jab σaz σbz ], namics will freeze (see Fig. 2c in Ref. [32] and the related
a a a,b discussion).
(60) In general, a quantum annealer with linear annealing
where s = t/ta , t is time, ta is the annealing time, hi schedule s = t/ta does not return a Boltzmann distribu-
and Jij are tuneable dimensionless parameters, and A(s) tion. However, as argued in Ref. [32], if the dynamical
and B(s) are monotonic functions, with units of energy, slow-down and freeze-out happen within a short period
such that A(0)  B(0) ≈ 0 and B(1)  A(1) ≈ 0. As of time during the annealing, then the final distribution
9

will be close to the quantum Boltzmann distribution of Ising Hamiltonian is augmented with a transverse field.
(60) at a single point s∗ , called the freeze-out time. In Motivated by the success of stochastic gradient descent
such a case, the quantum annealer with linear anneal- in training classical Boltzmann machines, one may wish
ing schedule will provide approximate samples from the to use a similar technique to optimize the log-likelihood
Boltzmann distribution corresponding to the Hamilto- of the QBM. However, unlike the classical BM, for which
nian H(s∗ ). Moreover, if A(s∗ ) happens to be small the gradients of the log-likelihood can be estimated using
enough such that the quantum eigenstates at s∗ are close sampling, the existence of a transverse field in the QBM
to the classical eigenstates, then the resulting Boltzmann makes the gradient estimation nontrivial. We have in-
distribution will be close to the classical Boltzmann dis- troduced a lower bound on the log-likelihood, for which
tribution. In such a case, the quantum annealer can be the gradient can be estimated using sampling. We have
used as an approximate classical Boltzmann sampler for shown examples of QBM training through maximizing
training a BM, as was done in [15, 16]. Unfortunately, not both the log-likelihood and its lower bound, using ex-
all problems have a narrow freeze-out region and A(s∗ ) act diagonalization, and compared the results with clas-
is not always small. If the freeze-out does not happen in sical BM training. We have shown small-size examples
a narrow region, then the final probability distribution in which QBM learned the data distribution better than
will depend on the history within this region and will BM. Whether QBM can learn and generalize better than
not correspond to a Boltzmann distribution at any par- BM at larger sizes are questions that need to be answered
ticular point. This would limit the applicability of using in future works.
quantum annealer for Boltzmann sampling. Our method is different from other existing quantum
In principle, it is possible to controllably freeze the machine learning proposals [3–13, 15, 16], because quan-
evolution at a desired point, s∗ , in the middle of the an- tum mechanics is not only used to facilitate the training
nealing and readout the qubits. One way to do this is process, as in other proposals, but also is exploited in
via a nonuniform s(t) which anneals slowly at the be- the model. In other words, the probabilistic model, i.e.,
ginning up to s∗ and then moves very fast (faster than the quantum Boltzmann distribution, that we use in our
all dynamics) to the end of annealing. An experimental QBM is different from any other models that have been
demonstration of such controlled sampling was done in studied in the machine learning community. Therefore,
[33] for a specially designed 16 qubit problem. If s∗ lies the potential of the model for machine learning is unex-
in the region where the evolution is still quasistatic, the plored.
quantum annealer will provide samples from the Boltz-
mann distribution of Hamiltonian (16), with We should mention that the similarity between BM
and QBM training may not hold in all situations. For
Γa = Γ = βA(s∗ ), (61) example, as we have shown in Sec. III A, sampling from
ba = βB(s∗ )ha , (62) a conditional distribution cannot be performed by clamp-
ing in QBM, as it is commonly done in classical BM. The
wab = βB(s∗ )Jab . (63) two models may also differ in other aspects. Therefore,
careful examination is needed before replacing BM with
Since ha and Jab are tunable parameters, if one can con-
QBM in existing machine learning techniques.
trol the freeze-out point s∗ , then all the dimensionless pa-
rameters in (16), i.e., Γ, ba , wab , can be tuned and there- Finally, we have discussed the possibility of using a
fore the quantum annealer can be used for training a quantum annealer for QBM training. Although the cur-
QBM. rent commercial quantum annealers like D-Wave are not
The applicability of the controlled sampling technique designed to provide quantum Boltzmann samples, with
used in [33] is limited by how fast the second part of the minor modifications to the hardware design, such a fea-
annealing can be done, which is ultimately determined by ture can become available. This would open new possi-
the bandwidth of the filters that bring electrical signals bilities in both quantum information processing and ma-
to the chip. Because of this, such a technique is only chine learning research areas.
applicable to specially designed problems that have very
slow dynamics. With some modifications to the current
hardware design, however, such techniques can become
possible for general problems relevant to QBM in the
Acknowledgement
near future.

We are grateful to Ali Ghodsi, Firas Hamze, William


VI. CONCLUSION Macready, Anatoly Smirnov, and Giacomo Torlai for
fruitful discussions. This research was partially sup-
We have examined the possibility of training a quan- ported by a Natural Sciences and Engineering Research
tum Boltzmann machine (QBM), in which the classical Council of Canada (NSERC) Engage grant.
10

[1] M.I. Jordan and T.M. Mitchell, Machine learning: [19] T. J. Sejnowski, Higher-order Boltzmann machines, AIP
Trends, perspectives, and prospects, Science 349, 255 Conference Proceedings 151: Neural Networks for Com-
(2015). puting (1986).
[2] C.M. Bishop, Pattern Recognition and Machine Learn- [20] R. Salakhutdinov, G. E. Hinton, Deep Boltzmann ma-
ing, Springer 2006. chines, AISTATS 2009.
[3] S. Lloyd, M. Mohseni, P. Rebentrost, Quantum algo- [21] M. Ranzato, G. E. Hinton, Modeling pixel means and
rithms for supervised and unsupervised machine learn- covariances using factorized third-order Boltzmann ma-
ing, eprint: arXiv:1307.0411. chines, CVPR 2010.
[4] P. Rebentrost, M. Mohseni, S. Lloyd, Quantum support [22] R. Memisevic, G. E. Hinton, Learning to represent spatial
vector machine for big data classification, Phys. Rev. transformation with factored higher-order Boltzmann
Lett. 113, 130503 (2014). machines, Neural Comput. 22, 1473–1492 (2010).
[5] N. Wiebe, A. Kapoor, and K.M. Svore, Quantum Deep [23] S. Golden, Lower bounds for the Helmholtz function,
Learning, eprint: arXiv:1412.3489. Phys. Rev., 137, B1127 (1965).
[6] H. Neven, G. Rose, W.G. Macready, Image recog- [24] C.J. Thompson, Inequality with applications in statisti-
nition with an adiabatic quantum computer I. Map- cal mechanics, J. Math. Phys. 6, 1812 (1965).
ping to quadratic unconstrained binary optimization, [25] https://en.wikipedia.org/wiki/Broyden-Fletcher-
arXiv:0804.4457. Goldfarb-Shanno algorithm.
[7] H. Neven, V.S. Denchev, G. Rose, W.G. Macready, [26] S. Osindero and G.E. Hinton, Modeling image patches
Training a Binary Classifier with the Quantum Adiabatic with a directed hierarchy of Markov random fields, Ad-
Algorithm, arXiv:0811.0416. vances in neural information processing systems (2008).
[8] H. Neven, V.S. Denchev, G. Rose, W.G. Macready, [27] R. Harris et al., Experimental Investigation of an Eight
Training a Large Scale Classifier with the Quantum Adi- Qubit Unit Cell in a Superconducting Optimization Pro-
abatic Algorithm, arXiv:0912.0779. cessor, Phys. Rev. B 82, 024511 (2010).
[9] K.L. Pudenz, D.A. Lidar, Quantum adiabatic machine [28] S. Boixo, T. Albash, F. M. Spedalieri, N. Chancellor, and
learning, arXiv:1109.0325. D.A. Lidar, Experimental Signature of Programmable
[10] M. Denil and N. de Freitas, Toward the implementa- Quantum Annealing, Nat. Commun. 4, 2067 (2013).
tion of a quantum RBM, NIPS*2011 Workshop on Deep [29] S. Boixo, T.F. Rønnow, S.V. Isakov, Z. Wang, D. Wecker,
Learning and Unsupervised Feature Learning. D.A. Lidar, J.M. Martinis, M. Troyer, Nature Phys. 10,
[11] V.S. Denchev, N. Ding, S.V.N. Vishwanathan, H. Neven, 218 (2014).
Robust Classification with Adiabatic Quantum Opti- [30] S. Boixo, V. N. Smelyanskiy, A. Shabani, S. V. Isakov,
mization, arXiv:1205.1148. M. Dykman, V. S. Denchev, M. Amin, A. Smirnov, M.
[12] V. Dumoulin, I.J. Goodfellow, A. Courville, Y. and Ben- Mohseni, and H. Neven, eprint arXiv:1502.05754, long
gio, On the Challenges of Physical Implementations of version: arXiv:1411.4036 (2014).
RBMs, AAAI 2014: 1199-1205. [31] T. Lanting et al., Phys. Rev. X, 4, 021041 (2014).
[13] R. Babbush, V. Denchev, N. Ding, S. Isakov, H. Neven, [32] M.H. Amin, Searching for quantum speedup in qua-
Construction of non-convex polynomial loss functions sistatic quantum annealers, Phys. Rev. A 92 052323,
for training a binary classifier with quantum annealing, (2015).
arXiv:1406.4203. [33] N.G. Dickson, M.W. Johnson, M.H. Amin, R. Harris, F.
[14] M.W. Johnson, M.H.S. Amin, S. Gildert, T. Lanting, F. Altomare, A. J. Berkley, P. Bunyk, J. Cai, E. M. Chap-
Hamze, N. Dickson, R. Harris, A.J. Berkley, J. Johans- ple, P. Chavez, F. Cioata, T. Cirip, P. deBuen, M. Drew-
son, P. Bunyk, E.M. Chapple, C. Enderud, J.P. Hilton, Brook, C. Enderud, S. Gildert, F. Hamze, J.P. Hilton,
K. Karimi, E. Ladizinsky, N. Ladizinsky, T. Oh, I. Permi- E. Hoskinson, K. Karimi, E. Ladizinsky, N. Ladizinsky,
nov, C. Rich, M.C. Thom, E. Tolkacheva, C.J.S. Truncik, T. Lanting, T. Mahon, R. Neufeld, T. Oh, I. Perminov,
S. Uchaikin, J. Wang, B. Wilson, and G. Rose, Quantum C. Petroff, A. Przybysz, C. Rich, P. Spear, A. Tcaciuc,
Annealing with Manufactured Spins, Nature 473, 194 M.C. Thom, E. Tolkacheva, S. Uchaikin, J. Wang, A. B.
(2011). Wilson, Z. Merali, and G. Rose, Thermally assisted quan-
[15] S.H. Adachi, M.P. Henderson, Application of Quantum tum annealing of a 16-qubit problem, Nature Commun.
Annealing to Training of Deep Neural Networks, eprint: 4 1903, (2013).
arXiv:1510.06356. [34] In physical systems, Hamiltonian parameters have unit of
[16] M. Benedetti, J. Realpe-Gmez, R. Biswas, A. Perdomo- energy. We normalize these parameters by kB T ≡ β −1 ,
Ortiz, Estimation of effective temperatures in a quan- where T is temperature and kB is the Boltzmann con-
tum annealer and its impact in sampling applica- stant; we absorb β into the parameters.
tions: A case study towards deep learning applications, [35] There are other techniques used for supervised learning,
arXiv:1510.07611. for example, when only a small fraction of the available
[17] G. E. Hinton, T. J. Sejnowski, Optimal perceptual infer- data is labeled.
ence, CVPR 1983. [36] This choice was made to keep the number of qubits small
[18] G. E. Hinton, S. Osindero, Y-W. Teh, A fast learning to allow for exact diagonalization.
algorithm for deep belief nets, Neural Comput. 18, 1527–
1554 (2006).

You might also like