Quantum Boltzmann Machine
Quantum Boltzmann Machine
Mohammad H. Amin,1, 2 Evgeny Andriyash,1 Jason Rolfe,1 Bohdan Kulchytskyy,3 and Roger Melko3, 4
1
D-Wave Systems Inc., 3033 Beta Avenue, Burnaby BC Canada V5G 4M9
2
Department of Physics, Simon Fraser University, Burnaby, BC Canada V5A 1S6
3
Department of Physics and Astronomy, University of Waterloo,
200 University Avenue West Waterloo, Ontario, Canada N2L 3G1
4
Perimeter Institute for Theoretical Physics, Waterloo, Ontario, N2L 2Y5, Canada
Inspired by the success of Boltzmann Machines based on classical Boltzmann distribution, we
propose a new machine learning approach based on quantum Boltzmann distribution of a transverse-
field Ising Hamiltonian. Due to the non-commutative nature of quantum mechanics, the training
process of the Quantum Boltzmann Machine (QBM) can become nontrivial. We circumvent the
problem by introducing bounds on the quantum probabilities. This allows us to train the QBM
efficiently by sampling. We show examples of QBM training with and without the bound, using
arXiv:1601.02036v1 [quant-ph] 8 Jan 2016
exact diagonalization, and compare the results with classical Boltzmann training. We also discuss
the possibility of using quantum annealing processors like D-Wave for QBM training and application.
δθ = −η∂θ L, (5)
hidden
where the learning rate, η, controls the step sizes. An
important requirement for applicability of the gradient (c)
decent technique is the ability to calculate the gradients visible
∂θ L efficiently. Using (4), we have
∂θ Ez e−Ez ∂θ Ez e−Ez
P P
input (x) output (y)
X
data h z
∂θ L = Pv P −Ez
− P −Ez
v he ze
Hamiltonian, e−H is a diagonal matrix with its 2N di- The gradient of L is given by
agonal elements being e−Ez corresponding to all the 2N
Tr[Λv ∂θ e−H ] Tr[∂θ e−H ]
states. With the partition function given by Z = Tr[e−H ]
X
data
∂θ L = Pv − . (18)
(c.f. (2)), we define the density matrix as v
Tr[Λv e−H ] Tr[e−H ]
where Ih is the identity matrix acting on the hidden vari- Tracing over both sides and using permutation property
ables, and of the trace, we find
Y 1 + vν σ z
|vi hv| ≡ ν
(15) Tr[∂θ e−H ] = −Tr[∂θ He−H ], (21)
ν
2
which is the same as the classical relation. Plugging this
is a projection operator in the subspace of visible vari- into the second term of (18) gives
ables. Equations (2) and (13) are equivalent when the
Hamiltonian and therefore the density matrix are diago- Tr[∂θ e−H ]
= −h∂θ Hi, (22)
nal, but (13) also holds for non-diagonal matrices. Tr[e−H ]
We can now add a transverse field to the Ising Hamil-
where h...i ≡ Tr[ρ...] denotes Boltzmann averaging. This
tonian by introducing non-diagonal matrices
term can be estimated by sampling from the distribution
a−1 N −a (12). However, the first term in (18),
z }| { z }| { 0 1
σax ≡ I ⊗ ... ⊗ I ⊗σx ⊗ I ⊗ ... ⊗ I, σx = , Z 1
1 0 Tr[Λv ∂θ e−H ] Tr[Λv e−tH ∂θ He−(1−t)H ]
= − dt ,
Tr[Λv e−H ] 0 Tr[Λv e−H ]
which represent transverse components of spin. The (23)
transverse Ising Hamiltonian is then written as cannot be estimated using sampling. This renders the
X X X training of a QBM inefficient and basically impractical
H=− Γa σax − ba σaz − wab σaz σbz (16)
for large system. A work around for this problem is to
a a a,b
introduce a properly defined upper-bound for L and min-
Every eigenstate of H is now a superposition in the com- imize it, as we shall discuss below. We call this approach
putation basis made of the classical states |v, hi. As bound-based QBM (bQBM). Minimizing a bound on the
the probabilistic model for QBM, we use quantum Boltz- negative log-likelihood is a common approach in machine
mann distribution with the density matrix (12), which learning.
now has off-diagonal elements. In each measurement the
states of the qubits are read out in the σz -basis and the
outcome will be a classical value ±1. Because of the A. Bound-based QBM
statistical nature of quantum mechanics, after each mea-
surement a classical output v will appear for the visible One can define a lower bound for the probabilities us-
variables with the probability Pv given by (13). ing Golden-Thompson inequality [23, 24]:
To train a QBM, we change the parameters θ such that
the probability distributions Pv becomes close to Pvdata Tr[eA eB ] ≥ Tr[eA+B ], (24)
of the input data. This is achieved by minimizing the
which holds for any hermitian matrices A and B. We can
negative log-likelihood, which from (3), (12), and (13) is
therefore write
X Tr[Λv e−H ] Tr[e−H eln Λv ] Tr[e−H+ln Λv ]
L=− Pvdata log . (17) Pv = ≥ . (25)
v
Tr[e−H ] Tr[e−H ] Tr[e−H ]
4
Introducing a new Hamiltonian: which means Γν → 0 for all visible variables. This is
inconsistent with what we obtain when we train Γν us-
Hv = H − ln Λv , (26) ing the exact gradient (18). Therefore, vanishing Γν is
an artifact of the upper bound minimization. In other
we can write words, we cannot learn the transverse field using the up-
Tr[e−Hv ] per bound. One may still train the transverse field using
Pv ≥ . (27) the exact log-likelihood, but it becomes quickly inefficient
Tr[e−H ]
as the size of the QBM grows.
Notice that Hv has an infinite energy penalty for any
state of the visible qubits that is different from |vi.
Therefore, for any practical purposes, B. Restricted QBM
Hv ≡ H(σνx = 0, σνz = vν ). (28) So far we haven’t imposed any restrictions on the con-
nectivity between visible and hidden qubits or lateral
This is a clamped Hamiltonian because every visible qubit connectivity among visible or hidden qubits. We note
σνz is clamped to its corresponding classical data value vν . that calculation of the first term in (32) and (33), some-
From (17) and (27) it follows that times called positive phase, requires sampling from dis-
tributions with clamped Hamiltonians (28). This sam-
X Tr[e−Hv ] pling can become computationally expensive for a large
L ≤ L̃ ≡ − Pvdata log . (29)
v
Tr[e−H ] data set, because it has to be done for every input data
element. If we restrict our QBM to have no lateral con-
Instead of minimizing L, we now minimize its upper nectivity in the hidden layer (see Fig. 1b), the hidden
bound L̃ using the gradient qubits become uncoupled in the positive phase and the
calculations can be carried out exactly. We can write the
Tr[e−Hv ∂θ Hv ] Tr[e−H ∂θ H]
clamped Hamiltonian (28) as
X
data
∂θ L̃ = Pv − ,
v
Tr[e−Hv ] Tr[e−H ] X
Γi σix + beff z
Hv = − i (v)σi , (35)
= h∂θ Hv iv − h∂θ Hi , (30) i
where beff z
P
where i (v) = bi + ν wiν vν . Expectations hσi iv en-
tering (32) can be computed exactly:
X X Tre−Hv ...
h...iv = Pvdata h...iv = Pvdata . (31)
Tre−Hv beff
i
v v hσiz iv = tanh Di , (36)
Di
Taking θ to be ba , wab , and using δθ = −η∂θ L̃, we obtain p
where Di = Γ2i + (beff 2
i ) . Notice that (36) reduces to
δba = η hσaz iv − hσaz i , (32) the classical RBM expression,
data
Denoting the feature vector (input) by x and label (out- should also match Py|x as desired for supervised train-
put) by y, the problem is to infer a function g(x) : x → y ing. However, there is a problem when it comes to sam-
from the set of labeled data (xi , yi ). In probabilistic ap- pling from this conditional for a given x. If the input x
proaches to this problem, which are of our main interest appears with a very small probability (Px 1), it would
here, the output y that is most probable, subject to the require a large amount of samples from Px,y and Px to
input x, is chosen as the label. Therefore, the function reliably calculate Py|x using (41).
g(x) is determined by the conditional probability Py|x of In a classical BM, one can sample from the conditional
output given input distribution by clamping the input variables x to the data
and sampling the output y. To understand how that
g(x) = arg max Py|x . (38) strategy would work for QBM, let us introduce a clamped
y
Hamiltonian
The end goal of training is to make Py|x as close as pos-
data Hx = H − ln Λx , Λx = |xi hx| ⊗ Iy ⊗ Ih , (42)
sible to the conditional distribution of the data, Py|x .
Assuming that the data comes with a joint probability which clamps the input qubits to x. Here, Iy and
data data data
distribution Px,y , we can write: Py|x = Px,y /Pxdata , Ih are identity matrices acting on the output and hid-
data
P data
where Px = y Px,y is the marginal distribution. den variables respectively. For classical Hamiltonians
Two possible approaches to supervised learning are dis- ([H, Λx ]=0), we have
criminative and generative learning [35]. In the discrim-
inative approach, for each x we try to learn the condi- Tr[Λy e−H eln Λx ] clamped
data Py|x = = Py|x , (43)
tional distribution Py|x . If an input x appears in the Tr[e−H eln Λx ]
training set with probability Pxdata , the loss function can where
be written as
clamped Tr[Λy e−Hx ]
Py|x ≡ . (44)
X X
Ldiscr = − Pxdata data
Py|x log Py|x , Tr[e−Hx ]
x y
X clamped
= − data
Px,y log Py|x . (39) This means for any x, we can sample Py|x from Hx
x,y and that will give us Py|x in an efficient way regardless
of how small Px is. For quantum Hamiltonians, when
In the generative approach, on the other hand, we learn [H, Λx ]6=0, we know that e−H eln Λx 6= e−Hx . Therefore,
the joint probability distribution without separating in- clamped
Py|x is not necessarily equal to Py|x and there is no
put from output. The loss function is therefore: easy way to draw samples from Py|x .
X
data One might still hope that the clamped distribution is
Lgen = − Px,y log Px,y not too far off from (41) and can be used as an approx-
x,y clamped
X imation Py|x ≈ Py|x . As we shall see in an example
= Ldiscr − Pxdata log Px (40) in Sec. IV-C, this is not true in general.
x
where we have used Px,y = Py|x Px . Notice that the first B. Discriminative learning
term is just Ldiscr while the second term measures the dif-
ference between the probability distribution of the train-
ing set inputs and the marginal distribution Px . This In discriminative learning one distinguishes input from
second term is called cross-entropy and it is equal to KL- output during the training [2] and learns the conditional
divergence, see Eq. (58), up to a constant. Now, we ex- probability distribution using (39). This can be done
plore the possibility of applying QBM to both cases. by clamping the input x in both positive and negative
phases. Since the input is always clamped, its role is just
to apply biases to the other variables and therefore we
A. Generative learning don’t need to assign any qubits to the input (see Fig. 1c).
The Hamiltonian of the system for a particular state of
the input, x, is given by
Generative learning with loss (40) can be done with
the methods of Section II by treating input and output
X X
Hx = − [Γa σax + beff z
a (x)σa ] − wab σaz σbz , (45)
(x, y) jointly as the visible data v = [x, y] in a QBM. a a,b
At the end of training, the QBM provides samples with
data
a joint probability Px,y that is close to Px,y . Therefore, where indices a and b range over both hidden and visible
the conditional probability (output only) variables. Here, the input x provides a bias
Tr[Λx Λy e−H ]
X
Px,y beff
a (x) = ba + waµ xµ (46)
Py|x = = . (41)
Px Tr[Λx e−H ] µ
6
to the a-th qubit, where ba and waµ are tunable param- IV. EXAMPLES
eters. Notice that xµ does not need to be restricted to
binary numbers, which can bring more flexibility to the In this Section we describe a few toy examples il-
supervised learning. lustrating the ideas described in the previous sections.
The probability of measuring an output state y once In all examples studied, the training data was gener-
the input is set to state x is given by ated as a mixture of M factorized distributions (modes),
each peaked around a random point. Every mode (k) is
Tr[Λy e−Hx ] constructed by randomly selecting a center point sk =
Py|x = , Λy = Ix ⊗ |yi hy| ⊗ Ih , (47)
Tr[e−Hx ] [sk1 , sk2 , ..., skN ] with ski ∈ {±1} and using Bernoulli dis-
k k
tribution: pN −dv (1 − p)dv , where p is the probability of
where Hx is given by (45) and Ix is an identity matrix qubit ν being aligned with skν , and dkv is the Hamming
acting on the input variables. The negative log-likelihood distance between v and sk . The average probability dis-
is given by (39). Using the same tricks as discussed in the tribution over M such modes gives our data distribution
previous section we can define a clamped Hamiltonian,
M
Hx,y = Hx − ln Λy , (48) 1 X N −dkv k
Pvdata = p (1 − p)dv , (57)
M
k=1
and show that
In all our examples, we choose p = 0.9 and M = 8.
Tr[e−Hx,y ] To have a measure of the quality of
Py|x & . (49) Plearning, we sub-
Tr[e−Hx ] tract from L its minimum Lmin = − v Pvdata log Pvdata ,
which happens when Pv = Pvdata . The difference, com-
Again we introduce an upper bound L̃ for the L monly called Kullback-Leibler (KL) divergence:
X
data Tr[e−Hx,y ] X Pvdata
Ldiscr ≤ L̃discr = − Px,y log . (50) KL = L − Lmin = Pvdata log , (58)
x,y
Tr[e−Hx ] Pv
v
The derivative of L̃ with respect to a Hamiltonian pa- is a non-negative number measuring the difference be-
rameter θ is given by tween the two distributions; KL = 0 if and only if the
two distributions are identical.
∂θ L̃ = h∂θ Hx,y ix,y − h∂θ Hx ix , (51)
1.1
the best value at Γ = 2.5 learned by QBM).
1
KL
0.9
0.8
B. Restricted QBM
0.7
0.6
We now consider a (semi-) restricted BM discussed in
0.5 Section II B. Our toy model has 8 visible units and 2
0.4
hidden units. We allow full connectivity within the vis-
0 5 10 15 20 25 30 35
ible layer and all-to-all connectivity between the layers.
iteration The data is again generated using Eq. (57) for the visi-
25
ble variables, with p = 0.9 and M = 8. We present the
QBM results of training in Fig 3. Similarly to the fully visible
b) BM model, QBM outperforms BM, and bQBM represents a
20
bQBM good proxy for learning quantum distribution.
In order to illustrate the significance of consistent us-
15
age of quantum distribution in evaluating the gradients
(32) and (33), we train bQBM using classical expression
|Eq |
0.5
instead of (36) for expectations of hidden units in the
0.4
10 positive phase. The resulting machine (bQBM-CE in
0.3
Fig. 3) learns worse than purely classical BM because
|Eq |
0.2
the two terms in gradient expressions are evaluated in-
5 0.1
consistently.
0
2 2.5 3 3.5 4
|Ecl |
0
0 5 10 15 20 25 30 35
|Ecl |
C. Generative supervised learning
FIG. 2: Training of a fully visible fully connected model with
N = 10 qubits on artificial data from Bernoulli mixture model
(57). Training is done using second-order optimization rou- We consider a supervised learning example with 8 in-
tine BFGS. (a) KL-divergence (58) of BM, QBM, bQBM mod- puts and 3 outputs with full connectivity between all
els during training process. Both QBM and bQBM learn to units. For the training set we again used the multi-modal
KL values that are lower than that for BM. (b) Classical and distribution (57) over x, with M = 8 and p = 0.9, and
quantum average energies (59) during training process. set the label y for each mode to be a 3-bit binary num-
ber from 0 to 7 [36]. Both BM and QBM are trained to
learn the loss function (40). Our goal is to check whether
In order to visualize the training process, we keep track clamped
Py|x ≈ Py|x , when training QBM in this genera-
of the average values of classical and quantum parts of
tive setup. In Fig 4a, we plot KL-divergence based on
the Hamiltonian during the training
the generative log-likelihood (40) for both classical BM
* + and QBM. It is clear that QBM has trained with better
X X KL-divergence than classical BM. In 4b we plot the KL-
Ecl = − ba σaz + wab σaz σbz , divergence based on the discriminative log-likelihoods
a a,b (39), evaluated with conditional probabilities Py|x and
* + clamped
X clamped probabilities Py|x . One can see that al-
Eq = − Γa σax . (59)
though QBM is trained with the joint probability distri-
a
bution, the conditional distribution is also learned better
Fig 2b shows the learning trajectories in the space than BM. The clamped probability distribution, on the
(|Ecl |, |Eq |). BM learns a model with average energy other hand, starts very close to the conditional distribu-
≈ 3.5, and KL ≈ 0.62. One can see that QBM, which tion at the beginning of the iterations, when QBM and
starts off with Γ = 0.1, initially lowers Γ and learns BM are closed to each other. But as the transverse field
(ba , wab ) that are close to the best classical result (see the in QBM starts to grow, the clamped distribution deviates
inset). Soon after, QBM increases Γ and (ba , wab ) until it from the conditional one and its KL-divergence grows to
converges to a point with Γ = 2.5 and KL ≈ 0.42, which is a value much worse than the classical BM value. This
better than classical BM value. Having a fixed transverse shows that even for such a small example the clamped
field, Γ = 2, bQBM starts with a large Eq and approaches distribution can be very different from the true condi-
the parameter learned by QBM (although doesn’t reach tional distribution.
1 1.5
QBM QBM
0.9 a) BM 1.4
a) BM 8
bQBM
1.3
0.8
bQBM-CE
1.2
0.7
KLgen
1.1
KL
0.6 1
0.9
0.5
0.8
0.4
0.7
0.3
0.6
0.2 0.5
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 100
iteration iteration
22 1.4
b) QBM
20
BM 1.3
b)
18 bQBM
16 bQBM-CE 1.2
14
KLdiscr
1.1
|Eq |
12
0.5
10
0.4 1
8
0.3 QBM
|Eq |
6 0.2 0.9 BM
4 0.1 QBM-clamped
0.8
2 0
0 2 4 6
|Ecl |
0 0.7
0 5 10 15 20 25 30 0 10 20 30 40 50 60 70 80 90 100
|Ecl | iteration
FIG. 3: Training of a restricted RBM with 8 visible and 2 hid- FIG. 4: Supervised learning using fully visible fully connected
den units on artificial data from Bernoulli mixture model (57) model with N = 11 qubits divided into 8 inputs and 3 outputs.
using second-order optimization routine. (a) KL-divergence As our training data we use artificial data from Bernoulli mix-
(58) of different models during training process. Again QBM ture model (57) for inputs and 3-bit binary labels (0 to 7)
and bQBM outperform BM, but when positive phase was for outputs. Training is done using second-order optimiza-
calculated classically in bQBM, the performance (bQBM-CE tion routine BFGS. (a) KL-divergence of joint distribution
curve in the figure) deteriorated and became worse than that (40) of BM, QBM models during training process. Once
for BM (see Section IV B for details). (b) Classical and quan- again QBM learns the distribution better than BM. (b) KL-
tum average energies (59) during training process. divergence of conditional distribution during the same train-
ing, for BM, QBM models using (39), and for clamped QBM
(QBM-clamped) using (43). The conditional distribution is
V. QBM WITH A QUANTUM ANNEALING also learned better by QBM than BM, but the clamped QBM
PROCESSOR distribution is very different from the conditional one and give
a KL-divergence much higher than the classical BM.
Recent developments in manufacturing quantum an-
nealing processors have made it possible to experimen-
tally test some of the quantum machine learning ideas. discussed in Ref. [32], an open system quantum annealier
Up to now many experiments have confirmed the exis- follows quasistatic evolution following equilibrium distri-
tence of quantum phenomena in such processors [14, 27– bution, ρ = Z −1 e−βH(s) , up to a point where the dy-
30], which includes entanglement [31]. A quantum an- namics become too slow to establish equilibrium. Here,
nealing processor implements the time-dependent Hamil- β = (kB T )−1 with T being the temperature and kB being
tonian the Boltzmann constant. The system will then deviate
X X X from the equilibrium distribution and soon after the dy-
H(s) = −A(s) σax + B(s)[ hi σaz + Jab σaz σbz ], namics will freeze (see Fig. 2c in Ref. [32] and the related
a a a,b discussion).
(60) In general, a quantum annealer with linear annealing
where s = t/ta , t is time, ta is the annealing time, hi schedule s = t/ta does not return a Boltzmann distribu-
and Jij are tuneable dimensionless parameters, and A(s) tion. However, as argued in Ref. [32], if the dynamical
and B(s) are monotonic functions, with units of energy, slow-down and freeze-out happen within a short period
such that A(0) B(0) ≈ 0 and B(1) A(1) ≈ 0. As of time during the annealing, then the final distribution
9
will be close to the quantum Boltzmann distribution of Ising Hamiltonian is augmented with a transverse field.
(60) at a single point s∗ , called the freeze-out time. In Motivated by the success of stochastic gradient descent
such a case, the quantum annealer with linear anneal- in training classical Boltzmann machines, one may wish
ing schedule will provide approximate samples from the to use a similar technique to optimize the log-likelihood
Boltzmann distribution corresponding to the Hamilto- of the QBM. However, unlike the classical BM, for which
nian H(s∗ ). Moreover, if A(s∗ ) happens to be small the gradients of the log-likelihood can be estimated using
enough such that the quantum eigenstates at s∗ are close sampling, the existence of a transverse field in the QBM
to the classical eigenstates, then the resulting Boltzmann makes the gradient estimation nontrivial. We have in-
distribution will be close to the classical Boltzmann dis- troduced a lower bound on the log-likelihood, for which
tribution. In such a case, the quantum annealer can be the gradient can be estimated using sampling. We have
used as an approximate classical Boltzmann sampler for shown examples of QBM training through maximizing
training a BM, as was done in [15, 16]. Unfortunately, not both the log-likelihood and its lower bound, using ex-
all problems have a narrow freeze-out region and A(s∗ ) act diagonalization, and compared the results with clas-
is not always small. If the freeze-out does not happen in sical BM training. We have shown small-size examples
a narrow region, then the final probability distribution in which QBM learned the data distribution better than
will depend on the history within this region and will BM. Whether QBM can learn and generalize better than
not correspond to a Boltzmann distribution at any par- BM at larger sizes are questions that need to be answered
ticular point. This would limit the applicability of using in future works.
quantum annealer for Boltzmann sampling. Our method is different from other existing quantum
In principle, it is possible to controllably freeze the machine learning proposals [3–13, 15, 16], because quan-
evolution at a desired point, s∗ , in the middle of the an- tum mechanics is not only used to facilitate the training
nealing and readout the qubits. One way to do this is process, as in other proposals, but also is exploited in
via a nonuniform s(t) which anneals slowly at the be- the model. In other words, the probabilistic model, i.e.,
ginning up to s∗ and then moves very fast (faster than the quantum Boltzmann distribution, that we use in our
all dynamics) to the end of annealing. An experimental QBM is different from any other models that have been
demonstration of such controlled sampling was done in studied in the machine learning community. Therefore,
[33] for a specially designed 16 qubit problem. If s∗ lies the potential of the model for machine learning is unex-
in the region where the evolution is still quasistatic, the plored.
quantum annealer will provide samples from the Boltz-
mann distribution of Hamiltonian (16), with We should mention that the similarity between BM
and QBM training may not hold in all situations. For
Γa = Γ = βA(s∗ ), (61) example, as we have shown in Sec. III A, sampling from
ba = βB(s∗ )ha , (62) a conditional distribution cannot be performed by clamp-
ing in QBM, as it is commonly done in classical BM. The
wab = βB(s∗ )Jab . (63) two models may also differ in other aspects. Therefore,
careful examination is needed before replacing BM with
Since ha and Jab are tunable parameters, if one can con-
QBM in existing machine learning techniques.
trol the freeze-out point s∗ , then all the dimensionless pa-
rameters in (16), i.e., Γ, ba , wab , can be tuned and there- Finally, we have discussed the possibility of using a
fore the quantum annealer can be used for training a quantum annealer for QBM training. Although the cur-
QBM. rent commercial quantum annealers like D-Wave are not
The applicability of the controlled sampling technique designed to provide quantum Boltzmann samples, with
used in [33] is limited by how fast the second part of the minor modifications to the hardware design, such a fea-
annealing can be done, which is ultimately determined by ture can become available. This would open new possi-
the bandwidth of the filters that bring electrical signals bilities in both quantum information processing and ma-
to the chip. Because of this, such a technique is only chine learning research areas.
applicable to specially designed problems that have very
slow dynamics. With some modifications to the current
hardware design, however, such techniques can become
possible for general problems relevant to QBM in the
Acknowledgement
near future.
[1] M.I. Jordan and T.M. Mitchell, Machine learning: [19] T. J. Sejnowski, Higher-order Boltzmann machines, AIP
Trends, perspectives, and prospects, Science 349, 255 Conference Proceedings 151: Neural Networks for Com-
(2015). puting (1986).
[2] C.M. Bishop, Pattern Recognition and Machine Learn- [20] R. Salakhutdinov, G. E. Hinton, Deep Boltzmann ma-
ing, Springer 2006. chines, AISTATS 2009.
[3] S. Lloyd, M. Mohseni, P. Rebentrost, Quantum algo- [21] M. Ranzato, G. E. Hinton, Modeling pixel means and
rithms for supervised and unsupervised machine learn- covariances using factorized third-order Boltzmann ma-
ing, eprint: arXiv:1307.0411. chines, CVPR 2010.
[4] P. Rebentrost, M. Mohseni, S. Lloyd, Quantum support [22] R. Memisevic, G. E. Hinton, Learning to represent spatial
vector machine for big data classification, Phys. Rev. transformation with factored higher-order Boltzmann
Lett. 113, 130503 (2014). machines, Neural Comput. 22, 1473–1492 (2010).
[5] N. Wiebe, A. Kapoor, and K.M. Svore, Quantum Deep [23] S. Golden, Lower bounds for the Helmholtz function,
Learning, eprint: arXiv:1412.3489. Phys. Rev., 137, B1127 (1965).
[6] H. Neven, G. Rose, W.G. Macready, Image recog- [24] C.J. Thompson, Inequality with applications in statisti-
nition with an adiabatic quantum computer I. Map- cal mechanics, J. Math. Phys. 6, 1812 (1965).
ping to quadratic unconstrained binary optimization, [25] https://en.wikipedia.org/wiki/Broyden-Fletcher-
arXiv:0804.4457. Goldfarb-Shanno algorithm.
[7] H. Neven, V.S. Denchev, G. Rose, W.G. Macready, [26] S. Osindero and G.E. Hinton, Modeling image patches
Training a Binary Classifier with the Quantum Adiabatic with a directed hierarchy of Markov random fields, Ad-
Algorithm, arXiv:0811.0416. vances in neural information processing systems (2008).
[8] H. Neven, V.S. Denchev, G. Rose, W.G. Macready, [27] R. Harris et al., Experimental Investigation of an Eight
Training a Large Scale Classifier with the Quantum Adi- Qubit Unit Cell in a Superconducting Optimization Pro-
abatic Algorithm, arXiv:0912.0779. cessor, Phys. Rev. B 82, 024511 (2010).
[9] K.L. Pudenz, D.A. Lidar, Quantum adiabatic machine [28] S. Boixo, T. Albash, F. M. Spedalieri, N. Chancellor, and
learning, arXiv:1109.0325. D.A. Lidar, Experimental Signature of Programmable
[10] M. Denil and N. de Freitas, Toward the implementa- Quantum Annealing, Nat. Commun. 4, 2067 (2013).
tion of a quantum RBM, NIPS*2011 Workshop on Deep [29] S. Boixo, T.F. Rønnow, S.V. Isakov, Z. Wang, D. Wecker,
Learning and Unsupervised Feature Learning. D.A. Lidar, J.M. Martinis, M. Troyer, Nature Phys. 10,
[11] V.S. Denchev, N. Ding, S.V.N. Vishwanathan, H. Neven, 218 (2014).
Robust Classification with Adiabatic Quantum Opti- [30] S. Boixo, V. N. Smelyanskiy, A. Shabani, S. V. Isakov,
mization, arXiv:1205.1148. M. Dykman, V. S. Denchev, M. Amin, A. Smirnov, M.
[12] V. Dumoulin, I.J. Goodfellow, A. Courville, Y. and Ben- Mohseni, and H. Neven, eprint arXiv:1502.05754, long
gio, On the Challenges of Physical Implementations of version: arXiv:1411.4036 (2014).
RBMs, AAAI 2014: 1199-1205. [31] T. Lanting et al., Phys. Rev. X, 4, 021041 (2014).
[13] R. Babbush, V. Denchev, N. Ding, S. Isakov, H. Neven, [32] M.H. Amin, Searching for quantum speedup in qua-
Construction of non-convex polynomial loss functions sistatic quantum annealers, Phys. Rev. A 92 052323,
for training a binary classifier with quantum annealing, (2015).
arXiv:1406.4203. [33] N.G. Dickson, M.W. Johnson, M.H. Amin, R. Harris, F.
[14] M.W. Johnson, M.H.S. Amin, S. Gildert, T. Lanting, F. Altomare, A. J. Berkley, P. Bunyk, J. Cai, E. M. Chap-
Hamze, N. Dickson, R. Harris, A.J. Berkley, J. Johans- ple, P. Chavez, F. Cioata, T. Cirip, P. deBuen, M. Drew-
son, P. Bunyk, E.M. Chapple, C. Enderud, J.P. Hilton, Brook, C. Enderud, S. Gildert, F. Hamze, J.P. Hilton,
K. Karimi, E. Ladizinsky, N. Ladizinsky, T. Oh, I. Permi- E. Hoskinson, K. Karimi, E. Ladizinsky, N. Ladizinsky,
nov, C. Rich, M.C. Thom, E. Tolkacheva, C.J.S. Truncik, T. Lanting, T. Mahon, R. Neufeld, T. Oh, I. Perminov,
S. Uchaikin, J. Wang, B. Wilson, and G. Rose, Quantum C. Petroff, A. Przybysz, C. Rich, P. Spear, A. Tcaciuc,
Annealing with Manufactured Spins, Nature 473, 194 M.C. Thom, E. Tolkacheva, S. Uchaikin, J. Wang, A. B.
(2011). Wilson, Z. Merali, and G. Rose, Thermally assisted quan-
[15] S.H. Adachi, M.P. Henderson, Application of Quantum tum annealing of a 16-qubit problem, Nature Commun.
Annealing to Training of Deep Neural Networks, eprint: 4 1903, (2013).
arXiv:1510.06356. [34] In physical systems, Hamiltonian parameters have unit of
[16] M. Benedetti, J. Realpe-Gmez, R. Biswas, A. Perdomo- energy. We normalize these parameters by kB T ≡ β −1 ,
Ortiz, Estimation of effective temperatures in a quan- where T is temperature and kB is the Boltzmann con-
tum annealer and its impact in sampling applica- stant; we absorb β into the parameters.
tions: A case study towards deep learning applications, [35] There are other techniques used for supervised learning,
arXiv:1510.07611. for example, when only a small fraction of the available
[17] G. E. Hinton, T. J. Sejnowski, Optimal perceptual infer- data is labeled.
ence, CVPR 1983. [36] This choice was made to keep the number of qubits small
[18] G. E. Hinton, S. Osindero, Y-W. Teh, A fast learning to allow for exact diagonalization.
algorithm for deep belief nets, Neural Comput. 18, 1527–
1554 (2006).