SimultaneousDiscoveryQuantumErrorCodeEncoders
SimultaneousDiscoveryQuantumErrorCodeEncoders
https://doi.org/10.1038/s41534-024-00920-y
In the ongoing race towards experimental implementations of quantum error correction (QEC), finding
ways to automatically discover codes and encoding strategies tailored to the qubit hardware platform is
emerging as a critical problem. Reinforcement learning (RL) has been identified as a promising approach,
1234567890():,;
1234567890():,;
but so far it has been severely restricted in terms of scalability. In this work, we significantly expand the
power of RL approaches to QEC code discovery. Explicitly, we train an RL agent that automatically
discovers both QEC codes and their encoding circuits for a given gate set, qubit connectivity and error
model, from scratch. This is enabled by a reward based on the Knill-Laflamme conditions and a vectorized
Clifford simulator, showing its effectiveness with up to 25 physical qubits and distance 5 codes, while
presenting a roadmap to scale this approach to 100 qubits and distance 10 codes in the near future. We
also introduce the concept of a noise-aware meta-agent, which learns to produce encoding strategies
simultaneously for a range of noise models, thus leveraging transfer of insights between different
situations. Our approach opens the door towards hardware-adapted accelerated discovery of QEC
approaches across the full spectrum of quantum hardware platforms of interest.
Quantum error correction1,2 (QEC) protects quantum information by codes, each of them conventionally labeled [[n, k, d]], where n is the number
encoding the state of a logical qubit into several physical qubits and is crucial of physical qubits, k the number of encoded logical qubits, and d the code
to ensure that quantum technologies such as quantum communication or distance that defines the number d − 1 of detectable errors. The first examples
quantum computing can achieve their groundbreaking potential. are provided by the [[5, 1, 3]] perfect code11, the [[7, 1, 3]] Steane12 and the [[9,
The past few years have witnessed dramatic progress in experimental 1, 3]] Shor10 codes, which encode one logical qubit into 5, 7, and 9 physical
realizations of QEC on different platforms3–7 (this includes especially var- qubits, respectively, being able to detect up to 2 physical errors and correct up
ious superconducting qubit architectures, ion traps, quantum dots, and to 1 error on any physical qubit. The most promising approach so far is
neutral atoms), reaching a point where the lifetime of qubits has been probably the family of the so-called toric or surface codes13, which encode a
extended by applying QEC8. Given the strong differences in native gate sets, logical qubit into the joint entangled state of a d × d square of physical qubits.
qubit connectivities, and relevant noise models, there is a strong need for a More recently, examples of quantum Low-Density Parity Check (LDPC)
flexible and efficient scheme to automatically discover not only codes but codes that are competitive with the surface code have been discovered14.
also efficient encoding circuits, adapted to the platform at hand. However, knowledge of a code does not automatically translate to
In particular, in the field of quantum communication and networking, knowing how to encode the logical states of that code in an efficient way.
third-generation quantum repeaters rely on QEC to correct errors during Standard approaches are unconstrained, meaning that an all-to-all con-
transmission9. The use of QEC permits very high communication rates, nectivity between qubits is assumed as well as a set of gates that are not
since only one-way signaling is involved, in contrast to earlier generations of necessarily native to the hardware platform of interest15,16. This then leads to
quantum repeaters. In this setting, we may in a first approximation assume larger-than-necessary circuits when implementing them on specific devices.
that errors happen mainly during transmission over the noisy channel and Numerical techniques have already been employed to construct QEC
treat the encoding circuits themselves as noiseless. This is the scenario we codes. Often, this has involved greedy algorithms, which may lead to sub-
will adopt here. optimal solutions but can be relatively fast17–20.
Since Shor’s original breakthrough10, different qubit-based QEC codes The recent advent of powerful tools from the domains of Artificial
have been constructed, both analytically and numerically, leading to a zoo of Intelligence (AI), are transforming scientific approaches21. From these,
1
Max Planck Institute for the Science of Light, Erlangen, Germany. 2Department of Physics, Friedrich-Alexander Universität Erlangen-Nürnberg,
Erlangen, Germany. e-mail: [email protected]
Reinforcement Learning (RL), which is designed to solve complex decision- Regarding applications to quantum computing, the discovered circuits
making problems by autonomously following an action-reward scheme22, is a are in general not fault-tolerant. However, strategies to build fault-tolerant
promising artificial discovery tool for QEC strategies. The task to solve is versions out of non-fault-tolerant circuits exist33, and these can even be
encoded in a reward function, and the aim of RL training algorithms is to automated with RL34.
maximize such a reward over time. RL can provide new answers to difficult While the authors of ref. 35 also set themselves the task of finding both
questions, in particular in fields where optimization in a high-dimensional codes and their encoding circuits, this was done using variational quantum
search space plays a crucial role. For this reason, RL can be an efficient tool to circuits involving continuously parametrized gates, which leads to much
tackle the problem of QEC code construction and encoding under hardware- more costly numerical simulations and eventually only an approximate QEC
specific constraints. scheme. By contrast, our RL-based approach does not rely on any human-
The first example of RL-based automated discovery of QEC strategies23 provided circuit ansatz, can use directly any given discrete gate set, is able to
did not rely on any human knowledge of QEC concepts. While this allowed exploit highly efficient Clifford simulations, and produces a meta-agent able
exploration without any restrictions, e.g., going beyond stabilizer codes, it to cover strategies for a range of noise models. In particular, their approach
was limited to only small qubit numbers. More recent works have moved was not able to scale to d = 5 codes due to prohibitive computational costs.
towards optimizing only certain QEC subtasks, injecting substantial human The paper is organized as follows: In Section “Results” we detail the RL
knowledge. For example, RL has been used for optimization of given QEC strategy, its numerical results and estimations on how far this strategy can be
codes24, and to discover tensor network codes25 or codes based on “Quantum scaled in principle. In Section “Methods” we give a reminder on stabilizer
Lego” parametrizations26,27. Additionally, RL has been used to find efficient codes and the Knill-Laflamme conditions, provide background describing
decoding processes28–31 and self-correcting control protocols32. the RL methods used in this work and give all details of our implementation.
In our work, we significantly expand the scaling capabilities of RL code
discovery by introducing two critical components: Results
1. An efficiently computable and general RL reward based on the Knill- Section “Reinforcement Learning Approach to QEC Code Discovery”
Laflamme error correction conditions. describes our approach to build a noise-aware RL agent. Section “Reinfor-
2. A highly parallelized custom-built Clifford circuit simulator that runs cement Learning Results” details the numerical results found with our
entirely on modern AI chip accelerators such as GPUs or TPUs. strategy. Section “Scaling automated QEC discovery” explains how our
approach can be scaled up to larger code parameters.
The main results that are enabled by this strategy are the following:
1. A state-of-the-art scheme based on deep RL that simultaneously Reinforcement learning approach to QEC code discovery
discovers QEC codes together with the encoding circuit from scratch, The main objective of this work is to automatize the discovery of QEC
tailored to specific noise models, native gate sets, and connectivities, codes and their encoding circuits using RL. We exclusively focus on
minimizing the circuit size for improved hardware efficiency. stabilizer codes due to their efficient simulability with classical
2. Effortless discovery of both stabilizer and CSS codes and encoders with computers. We will consider a scenario where the encoding circuit is
code distances from 3 (found in tens of seconds) to 5 (found in tens of assumed to be error-free (non fault-tolerant encoding). This is
minutes to a few hours) with up to 25 physical qubits. applicable to quantum communication or quantum memories, where
3. A general RL agent that is trained only once but afterwards is able to the majority of errors happen during transmission over a noisy
adapt and switch its encoding strategy based on the specific noise that is channel or during the time the memory is retaining the information.
present in the system. We call this a noise-aware RL agent. Nevertheless, we remark that there exist techniques to make circuits
4. A scalable platform for artificial scientific discovery of QEC strategies fault-tolerant such as flag fault-tolerance33, and the code itself would
based on RL that potentially allows discovery of distance 8-10 codes on anyway be discovered with our strategy. A scheme of our approach
a single GPU, while offering further scaling opportunities on can be found in Fig. 1, and the following sections are dedicated to
distributed machines. explain its different constituent parts.
Fig. 1 | QEC code and encoding discovery using a noise-aware RL meta-agent. A and connectivity that detects the most likely errors from the target error model by
set of error operators, a gate set, and qubit connectivity are chosen. Different error using a reward based on the Knill-Laflamme QEC conditions according to Eq. (2).
models can be considered by varying some noise parameters, which are fed as an After training, a single RL agent is able to find suitable encodings for different noise
observation to the agent. The agent then builds a circuit using the available gate set models, which are able to encode any state ∣ψ of choice.
Encoding circuit. In order to encode the state of k logical qubits on n Noise-aware meta-agent. Regarding the error channel to be targeted,
physical qubits one must find a sequence of quantum gates that will here there are in principle several choices that can be made. The most
entangle the quantum information in such a way that QEC is possible straightforward one is choosing a global depolarizing channel (see
with respect to a target noise channel. Initially, we imagine the first k “Methods” (8)). This still allows for asymmetric noise, i.e., different
qubits as the original containers of our (yet unencoded) quantum probabilities pX, pY, pZ. One option would be to train an agent for any
information, which can be in any state ∣ψ 2 ðC2 Þk . The remaining given, fixed choice of these probabilities, necessitating retraining if these
n − k qubits are chosen to each be initialized in the state
∣0i. These will be characteristics change. However, we want to go beyond that and build a
turned into the corresponding logical state ∣ψ L 2 ðC2 Þn via the single agent being capable of deciding what is the optimal encoding
application of a sequence of Clifford gates on any of the n qubits. In the strategy for any level of bias in the noise channel (11). For instance, we
stabilizer formalism, this means that initially, the generators of the code want this noise-aware agent to be able to understand that it should
stabilizer group are prioritize detecting more Z errors than X ones when the channel is biased
towards Z, yet it should do the opposite when X errors become more
Z kþ1 ; Z kþ2 ; . . . ; Z n : ð1Þ likely. This translates into two aspects: The first one is that the agent has to
receive the noise parameters as input. In the illustrative example further
The task of the RL agent is to discover a suitable encoding sequence of below, we will choose to supply the bias parameter cZ ¼ log pZ = log pX
gates for the particular error model under consideration. After (see “Methods”) as an extra observation, while keeping the overall error
applying each gate, the n − k code generators (1) are updated. The probability fixed. The second aspect is that the list of error operators will
agent then receives a representation of these generators as input (as its have to contain more operators than the total number that can actually be
observation) and suggests the next gate (action) to apply. In this way, detected reliably since it is now part of the agent’s task to prioritize some
an encoding circuit is built up step by step, taking into account the of those errors while ignoring the least likely errors. All in all, the list of
available gate set and connectivity for the particular hardware plat- operators participating in the reward (2) will be fixed, and we will vary cZ
form. This process terminates when the Knill-Laflamme conditions during training.
are satisfied for the target error channel
and the learned circuit can
then be used to encode any state ∣ψ of choice. Vectorized Clifford simulator. RL algorithms exploit guided trial-and-
error loops until a signal of a good strategy is picked up and convergence
Reward. The most delicate matter in RL problems is building a suitable is reached, so it is of paramount importance that simulations of our RL
reward for the task at hand. Our goal is to design an agent that, given a list environment are extremely fast. Thanks to the Gottesman-Knill theorem,
of (Pauli) errors {Eμ} with associated occurrence probabilities {pμ}, is able the Clifford circuits needed here can be simulated efficiently on classical
to find an encoding sequence that protects the quantum information computers. Optimized numerical implementations of Clifford circuits
from such noise. exist, e.g., Stim36. However, in an RL application we want to be able to run
Ideally, one would like to maximize the probability of successful multiple circuits in parallel in an efficient, vectorized way that is com-
recovery of the initial encoded state after decoding. Unfortunately, opti- patible with modern machine learning frameworks. For that reason, we
mizing for this task is computationally too expensive. A much cheaper have implemented our own special-purpose vectorized GPU Clifford
alternative is to use a scheme where the cumulative reward (which RL simulator (described in detail in Methods), which is publicly available in
optimizes) simply is maximized whenever all the Knill-Laflamme condi- our repository37. When compared to Stim, we find a ~50 × speedup at
tions are fulfilled. One implementation of this idea uses what we call the simulating random Clifford circuits and a ~450 × speedup when
(negative) weighted Knill-Laflamme sum as an instantaneous reward, which restricted to the simulation of Calderbank-Shor-Steane (CSS) codes (see
we define as: “Methods”). In particular, we can simulate 8000 random Clifford circuits
X of 1000 gates on 49 qubits in under a second. However, note that our
rt ¼ λμ K μ ; ð2Þ
simulator is not capable of sampling noisy circuits, which is the main
μ application of Stim.
where Kμ = 0 if the corresponding error operator Eμ satisfies the Knill- Reinforcement learning results
Laflamme conditions, and Kμ = 1 otherwise, and where λμ are real positive We will first illustrate the basic workings of our approach for a symmetric
hyperparameters weighting each error. If all errors in {Eμ} can be detected, noise channel before showing the noise-aware meta-agent that is able to
the reward is zero, and is negative otherwise, thus leading the agent towards simultaneously discover strategies for a range of noise models.
short gate sequences. In particular, note that the agent is not explicitly
incentivized to minimize circuit depth or to place gates in parallel. However, Codes in a symmetric depolarizing noise channel. We now show the
reinforcing short gate sequences may sometimes also lead to a small circuit versatility of our approach by discovering a library of different [[n, k, d]]
depth. The range of the index μ is found by counting the number of Pauli codes and their associated encoding circuits.
strings of weight w < d, which is We fix the error model to be a symmetric depolarizing channel and
consider different target code distances (from 3 to 5). The corresponding
X
d1 target error set is Eμ = {I, Xi, Yi, Yj, XiXj, …, ZiZj} for d = 3, and likewise for
n
∣fEμ g∣w<d ¼ 3w ; ð3Þ d = 4, 5, with the set for d = 5 including all Pauli string operators of up to
w¼0 w
weight 4. For illustrative purposes, we start by taking the gateset to be {Hi,
where the factor of three is for X, Y, Z Pauli errors. Thus, the fact that (3) CNOT(i < j)}, i.e., a directed all-to-all connectivity, which is sufficient given
grows exponentially with d will impose the most severe limitation in our that our unencoded logical state is at the first k qubits by design. Never-
approach (as is the case in any QEC application). Later, we will also be theless, we will also see examples with other connectivities and alternative
interested in situations where not all errors can be corrected simultaneously gatesets. The error probability p is fixed, meaning pI = 1 − 3p,
and a good compromise has to be found. In that case, one simple heuristic pX = pY = pZ = p, and thus no noise parameter is needed as an observation to
choice for the reward (2) would be λμ = pμ, giving more weight to errors that the agent.
occur more frequently. While we will later see that maximizing the Knill- For d = 3 and d = 4 codes we proceed as follows: for any given target [[n,
Laflamme reward given here is not precisely equivalent to maximizing the k, d]], we launch a few training runs. Once the codes are collected, we
state recovery probability, one can still expect a reasonable performance at categorize them by calculating their quantum weight enumerators (see
this task, and indeed this is what we find in our work. “Methods”), leading to a certain number of non-degenerate and degenerate
code families. We repeat this process and keep launching new training runs with an encoding circuit consisting of 32 gates in the minimal example,
until no new families are found. In this way, our strategy presumably finds which we show in the Supplementary.
all stabilizer codes that are possible for the given parameters n, k, d, together The largest d = 5 code that we have considered here is [[15, 2, 5]],
with a suitable encoding circuit. Note that this statement is based on although we will later show larger codes. We have found a single code family
empirical observations. While successive training runs do not yield new with weight enumerators
code families, this does not exclude the possibility of there being more. This
total number of families is shown in Fig. 2, with labels (x, y) for each [[n, k,
A ¼ ð1; 0; 0; 0; 0; 0; 23; 96; 361; 776; 1318; 1832;
d]], where x is the number of non-degenerate families and y is the number of
degenerate ones. It should be stressed that categorizing all stabilizer code 1814; 1304; 579; 88Þ;
ð5Þ
families is in general an NP-complete problem38, yet our framework is very B ¼ ð1; 0; 0; 0; 0; 101; 449; 1763; 5081; 12034;
effective at solving this task. To the best of our knowledge, this work provides 21722; 29366; 29622; 20489; 8661; 1783Þ:
the most detailed tabulation of (x, y) populations together with optimal
encoding circuits for the code parameters shown here.
This approach discovers suitable encoding circuits, given the assumed and an encoding circuit consisting of 49 gates shown in the Supplementary.
gate set, for a large set of codes. Among them are the following known codes Other successfully discovered d = 5 codes are shown in Methods, Fig. 4.
for d = 3 (see ref. 39 for explicit constructions of codes [[n, n − r, 3]] with
minimal r, for all n): The first one is the five-qubit perfect code11, which Noise-aware meta-agent. We now move on to codes in more general
consists of a single non-degenerate [[5, 1, 3]] code family and is the smallest asymmetric depolarizing noise channels. This lets us illustrate a powerful
stabilizer code that corrects an arbitrary single-qubit error. Next are the 10 aspect of RL-based encoding and code discovery: One and the same agent
families38 of [[7, 1, 3]] codes, one of which corresponds to Steane’s code12. can learn to switch its encoding strategy depending on some parameter
The smallest single-error-correcting surface code, Shor’s code10, is redis- characterizing the noise channel. This is realized by training this noise-
covered as one of the 143 degenerate code families with parameters [[9, 1, aware agent on many different runs with varying choices of the para-
3]]. The smallest quantum Hamming code40[[8, 3, 3]] is obtained as well. meter, which is fed as an additional input to the agent.
Our approach is efficient enough to discover codes with up to 20 physical In the present example, the parameter in question is the bias parameter
qubits in under 10 min, at which point we stopped increasing n. We also cZ ¼ log pZ = log pX . This allows the same agent to switch its strategy
include in the Supplementary the encoding circuit for a [[20, 13, 3]] code depending on the kind of bias present in the noise channel. The error set Eμ
consisting of a total of 45 gates. is now taken to be all Pauli strings of weight ≤4, i.e., {Eμ} = {I, Xi, Yi, Zi, XiXj,
The RL framework presented here easily allows to find encoding cir- …, ZiZjZkZl}, but their associated error probabilities will vary depending on
cuits for different connectivities. The connectivity affects the likelihood of cZ. For every RL training trajectory, a new cZ is chosen and the error
discovering codes within a certain family during RL training as well as the probabilities pμ are updated correspondingly.
typical circuit sizes. In Fig. 3 we illustrate this for the case of [[9, 3, 3]] codes, We apply this strategy to target codes with parameters n = 9, k = 1 in
with their 13 families, for two different connectivities: an all-to-all (directed, asymmetric noise channels. We allow a maximum number of 35 gates.
i.e., CNOT(i < j)) and a nearest-neighbor square lattice connectivity. On Moreover, we consider an all-to-all connectivity, taking as available gate set
average, the agent needs one less gate to prepare the encoding on the all-to- {Hi, Si, CNOT(i, j)}, where Si is the phase gate acting on qubit i.
all connectivity than when using the square lattice. This difference in circuit We discover codes with the following parameters: [[9, 1,
size is likely to become larger for larger qubit numbers. We also include in de(cZ = 0.5) = 2]], [[9, 1, de(cZ = 0.6) = 3]], [[9, 1, de(cZ = 1.4) = 4]], [[9, 1,
Methods examples using different gatesets and a larger variety of de(cZ = 2) = 5]], where de is the effective code distance, defined in Methods.
connectivities. To the best of our knowledge, the last two codes are new. Codes inbetween,
We now move to distance d = 5 codes. These are more challenging to 0.5 ≤ cZ < 0.6, have de = 2, 0.6 ≤ cZ < 1.4 have de = 3, and so on.
find due to the significantly increased number of error operators (3) to keep Next, we evaluate the performance of the noise-aware agent trained
track of, which impacts both the computation time and the hardness of with this strategy at minimizing the failure probability, defined in “Meth-
satisfying all Knill-Laflamme conditions simultaneously. Nevertheless, our ods”. The main results are shown in Fig. 5. We start by comparing the two
strategy is also successful in this case. It is known that the smallest possible best-performing post-selected agents according to minimizing the weighted
distance—5 code has parameters [[11, 1, 5]], a result that we confirm with Knill-Laflamme sum (green) and minimizing the failure probability
our strategy. We find the single family of this code to have weight enu- (orange), see Fig. 5a, b. There we see that there is a nice correlation between
merators, the two tasks, especially in the region cZ < 1. We also compare the smallest
undetected effective weight of the codes found by these two agents in Fig. 5c.
Surprisingly, the code found by the best agent according to the weighted
A ¼ ð1; 0; 0; 0; 0; 0; 198; 0; 495; 0; 330; 0Þ;
ð4Þ Knill-Laflamme sum (green) at cZ = 2 has de = 5, while the best code at
B ¼ ð1; 0; 0; 0; 0; 198; 198; 990; 495; 1650; 330; 234Þ; minimizing the failure probability (orange) has de = 4. However, at the
Fig. 4 | Families of d = 5 stabilizer codes found with RL. The labels (x, y) indicate
the number of non-degenerate (x) and degenerate (y) code families. The circuit size
shown is the absolute minimum throughout all families using a directed
(CNOT(i < j)) qubit connectivity.
on the specific value of cZ. Thus, the strategies found during training at a
fixed value of pI are readily usable in other situations.
We continue by analyzing the encoding circuits and code generators
for some selected values of cZ. These are chosen after computing the
quantum weight enumerators (see “Methods”), which we show in Fig. 6a.
There we see that the same code family is kept for 0.5 ≤ cZ < 0.9, where Z
errors are more likely than X/Y. From that point onward, the agent switches
to a new code family that is kept until the end (cZ = 2). We thus choose to
analyze the encoding circuits and their associated code generators for the
values cZ = {0.5, 0.9, 1.4, 2}. However, we remark that this particular code
switching only occurs for the best post-selected agent and there is a large
variety of strategies observed for the 714 meta-agents that we have trained,
both in terms of where the switching occurs and the number of switches.
We begin by showing the encoding circuits in Fig. 6b, highlighting
common motifs that are re-used across various values of cZ with different
colors, indicative of transfer learning. Another interesting behavior is that S
gates are used more prominently at small values of cZ, in particular in the
combination S ⋅ H. This gate combination implements a permutation: X →
Y, Y → Z, Z → X (ignoring signs), which is very useful to exchange Y by Z
efficiently. In situations where Z errors are more likely than X/Y, (cZ < 1),
this operation is beneficial. While we have been able to identify and interpret
this simple combination of gates with the naked eye, extracting general
Fig. 3 | Influence of connectivity. Characteristics of the 13 families of [[9, 3, 3]] principles from the discovered codes remains challenging but is nonetheless
codes found with our framework, clustered according to families distinguished by a valuable and important area that deserves further analysis.
their quantum weight enumerators (13). Families 9 and 13 are degenerate, while the Next, we show the code generators of such encoding circuits in Fig. 6c.
rest are non-degenerate. We have trained a total of 10240 agents for each of both Since the code used at cZ = 0.5 is the only one from a different code family, it
cases. In the all-to-all (directed: CNOT(i < j)) connectivity, 9574 agents were suc- is natural that its code generator pattern is the most distinct. However, we
cessful, while this number went down to 3808 in the other case. The bars display how see that the generators of the remaining values of cZ have similar structures.
these codes are distributed across different families. Codes in the same family found So far we have shown that a single meta-agent trained on different
by different agents are not necessarily distinct, so the bars are rather an indication of values of the noise bias parameter can find suitable strategies for all values of
the likelihood of a training run to find a code within the family. The points show the
such a parameter. Now, we want to compare the performance of such meta-
mean circuit size, averaged within each family, while the error bar is its standard
agent against an ensemble of agents that each have been trained on a single
deviation. It is interesting to see that even with different connectivities, families occur
with similar likelihoods during training. We explicitly list the corresponding
value of the noise bias parameter. The settings of this comparison are
quantum weight enumerators computed with (13) in the Supplementary. explained in Methods. The results are shown in Fig. 7. The first stark result is
that the simple agents perform rather bad at the extreme values cZ = 1.9 and
cZ = 2. Outside of these two points, they perform comparably to the best
meta-agent, even though the meta-agent strategy yields better performance
overall. This advantage is enabled by transfer learning, i.e., the idea that
specific point cZ = 2 these two codes perform equally well in terms of the patterns that work in one situation can be reused in other places effectively
failure probability (see Fig. 5b). (recall the common motifs from Fig. 6b). In our case, the meta-agent
Now we focus on the agent that performs best at minimizing the failure switched the code family as early as cZ = 0.9 (recall Fig. 6a), and all the
probability (orange) since it is the one of most interest in practical scenarios. experiences between cZ = 0.9 and cZ = 2 were useful in providing a superior
We begin by evaluating the performance of the same agent on different performance to that of the simple agents. Moreover, the noise-aware meta-
values of pI. This is shown in Fig. 5d. There where we see that the failure agent is able to provide predictions for all continuous values in the con-
probability asymptotically follows a power law with exponent ≳2 depending sidered range, while the simple agents cannot.
Fig. 5 | Performance of the noise-aware RL agent. The agent finds n = 9, k = 1 codes undetected effective weight (effective code distance is the integer part) as a function
and encoding circuits, simultaneously for different levels of noise bias cZ, with single- of the noise bias parameter cZ. While there is almost a perfect overlap between both
qubit fidelity pI = 0.9. In panels a,b,c, green represents the agent that was post- best agents until cZ = 1.1, the situation changes afterwards, leading at cZ = 2 to a de = 5
selected among all trained agents for performing best at minimizing the weighted code (green) or a de = 4 code (orange) that perform equally well in terms of the failure
Knill-Laflamme sum, averaged over all cZ values. Orange refers to the agent mini- probability, as seen in b. d Evaluation of the failure-probability of the best-
mizing the failure probability, averaged over cZ. a Weighted Knill-Laflamme sum as a performing agent (orange in the other panels) for larger values of pI (smaller errors)
function of the noise bias parameter cZ (best agent: green line). b Failure probability than the ones it was trained on.
as a function of the noise bias parameter cZ (best agent: orange line) (c) Smallest
Scaling automated QEC discovery (that we chose) and a remaining sequence of 46 CNOT gates discovered by
In this final section we explore to which extent our RL-based strategy can be the agent. The few CNOTs that connect seemingly distant qubits are due to
scaled up. We will see that by restricting to CSS10,12 codes (which are a allowing periodic boundary conditions. An interesting strategy that the
subclass of stabilizer codes) we are able to reduce the computational agent uses is first building Bell pairs between adjacent qubits (which are [[2,
demands of our algorithms, leading to an estimated better scaling with larger 0, 2]] codes) and then entangling these pairs with each other to gradually
code parameters. build up a d = 5 code. We remind the reader that the largest (non-CSS) code
In order to exclusively target CSS codes, it is sufficient to constrain the that we had shown in previous sections was [[15, 2, 5]] and it needs roughly
structure of the circuit to contain an initial layer of Hadamard gates applied 4 h of computing. The [[17, 1, 5]] code presented here only needs around
to a subset of the qubits followed by CNOT gates thereafter (see “Methods” 20 min of compute time.
for a proof). An interesting observation is that the strategy of initially creating Bell
There are several possible modifications that we could do to our RL pairs is persistent. We thus consider a final scenario where we initialize the
strategy in order to target CSS codes, which we discuss in Methods. In this circuit with neighboring Bell pairs and ask the agent to complete the
work, we choose a mixed human-AI strategy where we are the ones deciding encoding circuit.
the content of the Hadamard layer (i.e., how many and where they are Now we focus on [[25, 1, 5]] due to these parameters being
placed) and where the agent has to discover suitable CNOT blocks. In this compatible with the first d = 5 surface code. We present an example of
way, we simplify the task of the agent as much as possible. such a discovered code with its encoding circuit in a next-to-nearest
We have tested this approach by targeting weakly self-dual codes neighbor connectivity in Fig. 8. It uses a total of 83 gates, where the last
(meaning the Hadamard layer contains num(H) = (n − k)/2 gates) of dis- 59 CNOT gates were discovered by the agent and took around 2 h to
tance d = 5 using next-to-nearest neighbor CNOT connectivity and where train. If we instead ask the agent to start from a circuit where only the
we place the initial Hadamard gates in alternating qubit indices. Hadamard layer is provided, it still finds good encodings. The draw-
We have found that we can discover [[17, 1, 5]] codes (with back is that it takes longer to train, and the agent still prepares the Bell
num(H) = 8), from scratch and with their encoding circuit. An example of pairs (but has to learn it). We remark that these code parameters are by
such a discovered circuit is shown in Fig. 8. It consists of 8 Hadamard gates no means the upper limit of what is possible with our strategy.
Fig. 6 | Characteristics of the 9-qubit codes and encodings found by the noise- remark that the agent does not place gates in parallel, the circuits shown here show
aware meta-agent post-selected for minimizing the failure probability. gates in parallel for compactness. c Corresponding code generators. To aid visua-
a Associated code family according to their (symmetric) weight enumerators A, B. lization, we have chosen different colors for different Pauli matrices. However, since
The same code family is used from 0.5 ≤ cZ < 0.9, while a family switching occurs at our scenario is by construction symmetric in X/Y, we choose to represent X and Y by
cZ = 0.9, and it is kept until cZ = 2. b Encoding circuits: Here we see that many small the same color. Code generators gi corresponding to the encoding circuits, where we
gate sequences (highlighted with different colors) are reused across different values do not make a distinction between X or Y. Here we see that the code generators gi
of cZ. This is an indication of transfer learning, i.e., the power of the meta-agent. We vary across different values of cZ.
recent quasi-cyclic codes from ref. 14 in the near future. To achieve such a strings out of the 2n−k elements of the stabilizer group are the stabilizer
milestone, one should be able to target LDPC codes directly. As a starting generators, leading to different stabilizer weights. All in all, we believe that,
point, one could add an additional term in the reward that penalizes sta- while promising, substantial innovations are needed in order to discover
bilizers with large weights. This would not be guaranteed to work out of the LDPC codes with such an RL-based strategy. However, the payoff would be
box, as one would need to tune the importance between the original Knill- quite substantial: a strategy based on RL would not be restricted to the
Laflamme term and this new term through some new hyperparameter. In particular ansatz of quasi-cyclic codes. In addition, not only would the codes
addition, stabilizer generators of LDPC codes must also be local, meaning be discovered, but their encoding circuits would also be automatically known.
that their weight must be distributed along neighboring qubits for efficient One of the limits of our approach is GPU memory. However, this could
measurement cycles. Finally, there is a large degeneracy in how the code be circumvented through different means. While it is always possible to
generators are chosen: there are many possible choices of which n − k Pauli trade performance by memory load, the tendency to train very large AI
models is thrusting both the development of novel hardware with increased
memory capabilities and the integration of distributed computing options in
modern machine learning libraries. These developments makes us envision
scenarios where the framework presented in this work could be scaled up
straightforwardly to multiple GPU machines. This makes us optimistic
about the future of AI-discovered QEC in the very near future.
Methods
Stabilizer codes
The stabilizer formalism. Some of the most promising QEC codes are
based on the stabilizer formalism15, which leverages the properties of the
Pauli group Gn on n qubits. The basic idea of the stabilizer formalism is
that many quantum states of interest for QEC can be more compactly
described by listing the set of n operators
that stabilize them, where an
operator O stabilizes ∣ψ if ∣ψ is an eigenvector of O with
a state
eigenvalue + 1: O∣ψ ¼ ∣ψ . The Pauli group on a single qubit G1 is
defined as the group that is generated by the Pauli matrices X, Y, Z under
matrix multiplication. Explicitly, G1 = { ±I, ±iI, ±X, ±iX, ±Y, ±iY, ±Z,
±iZ}. The generalization to n qubits consists of all n-fold tensor products
Fig. 7 | Noise-aware meta-agent vs ensemble of agents trained on fixed single of Pauli matrices (called Pauli strings).
values of noise. They have comparable performance at minimizing the failure A code that encodes k logical qubits into n physical qubits is a 2k-
probability (smaller is better), but the simple agents perform badly at larger values of dimensional subspace (the code spaceC) of the full 2n-dimensional Hilbert
cZ. The noise-aware meta-agent reaches a superior performance by reusing useful space. It is completely specified
by the
set of Pauli strings SC that stabilize it,
sub-circuits across different values of cZ and can provide encoding circuits for all i.e., SC ¼ fsi 2 Gn jsi ∣ψ ¼ ∣ψ ; 8∣ψ 2 Cg. SC is called the stabilizer group
continuous values of cZ. of C and is usually written in terms of its group generators gi as
SC ¼ g 1 ; g 2 ; . . . ; g nk , where each gi is a Pauli string.
either
fEμ ; g i g ¼ 0; ð9Þ
E μ 2 SC : ð10Þ
The smallest weight in Gn for which none of the above two conditions hold is
called the distance of the code. For instance, a distance − 3 code is capable of
detecting all Pauli strings of up to weight 2, meaning that Knill-Laflamme
conditions (9), (10) are satisfied for all Pauli strings of weights 0, 1 and 2.
Moreover, the smallest weight for which these are not satisfied is 3, meaning
Fig. 9 | Scaling CSS code and encoding discovery to larger code parameters. We that there is at least one weight − 3 Pauli string violating both (9) and (10).
show the fraction of the 80 GB of GPU memory needed (NVIDIA A100 GPU) to However, some weight − 3 Pauli strings (and higher weights) will satisfy the
store all the error operators that are required to reward the agent. We also show for Knill-Laflamme conditions, in general.
comparison the memory load of stabilizer (non-CSS) code discovery for code dis- While these conditions are framed in the context of quantum error
tance d = 10. We identify a region of opportunity where our RL strategy could out-
detection, there is a direct correspondence with quantum error correction.
perform some of the qLDPC codes found in ref. 14 in the near future.
Indeed, a quantum code of distance d can correct all errors of up to weight
t = ⌊(d − 1)/2⌋15. If all the errors that are detected with a weight smaller than
d obey (9), the code is called non-degenerate. On the other hand, if some of
Quantum noise. Noise affecting quantum processes can be represented the errors satisfy (10), the code is called degenerate.
using the so-called operator-sum representation41, where a quantum
noise channel N induces dynamics on the state ρ according to Asymmetric codes. The default weight-based [[n, k, d]] classification of
QEC codes implicitly assumes that the error channel is symmetric,
X meaning that the probabilities of Pauli X, Y, and Z errors are equal.
N ðρÞ ¼ Eα ρEyα ; ð6Þ However, this is usually not the case in experimental setups: for example,
α
dephasing (Z errors) may dominate bit-flip (X) errors. In our work, we
P consider an asymmetric noise channel where pX = pY but pX ≠ pZ. To
where Eα are Kraus operators, satisfying α Eyα Eα ¼ I. The most elementary quantify the asymmetry, we use the bias parameter cZ35, defined as
example is the so-called depolarizing noise channel,
log pZ
cZ ¼ : ð11Þ
N DP ðρÞ ¼ pI ρ þ pX XρX þ pY YρY þ pZ ZρZ; ð7Þ log pX
where pI + pX + pY + pZ = 1 and the set of Kraus operators are For symmetric error channels, cZ = 1. If Z-errors dominate, then 0 < cZ < 1,
pffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi c
Eα ¼ f pI I; pX X; pY Y; pZ Zg. When considering n qubits, one can since pZ ¼ pXZ and pX, pZ ≪ 1; conversely cZ > 1 when X/Y errors are more
generalize the depolarizing noise channel by introducing the global depo- likely than Z errors.
larizing channel, The weight of operators and the code distance can both be generalized
to asymmetric noise channels44–47. Consider a Pauli string operator Eμ and
O
n denote as wX the number of Pauli X inside Eμ (likewise for Y, Z). Then one
ðjÞ can introduce the cZ − effective weight35 of Eμ as
N GDP ðρÞ ¼ N DP ðρj Þ; ð8Þ
j¼1
Hyperparameter Value
LR (1–5) × 10−4
NUM_ENVS 8–1024
NUM_STEPS 8–32
NUM_EPOCHS 1000–12000
UPDATE_EPOCHS 2–4
NUM_MINIBATCHES 8–128
GAMMA 0.99
GAE_LAMBDA 0.95
CLIP_EPS 0.1–0.2
ENT_COEF 0.01–0.05
VF_COEF 0.5
MAX_GRAD_NORM 0.05–0.25
ANNEAL_LR True
where w is the operator (cZ = 1) weight, j runs from 0 to n and PC is the Implementation and hyperparameters. We use the PPO imple-
orthogonal projector onto the code space. Intuitively, Aj counts the number mentation of 52, which we break down in more detail here (see also Fig. 10
of error operators of weight j in SC while Bj counts the number of error and Table 1 for a list of hyperparameters). In our implementation, the RL
operators of weight j that commute with all elements of SC . Logical errors are environment is vectorized, meaning that the agent interacts with multiple
thus the ones that commute with SC but are not in SC , and these are counted different quantum circuits at the same time. The hyperparameter that
with Bj − Aj. determines this number of RL environments is called NUM_ENVS. The
Such a classification is especially useful in scenarios with symmetric learning algorithm consists of two processes: collect and update. During
noise channels, where it is irrelevant whether the undetected errors contain a collection, the agent interacts with the environments and a total of
specific Pauli operator at a specific position. However, such a distinction can NUM_STEPS sequences of (observation, action, reward) are collected per
in principle be important in asymmetric noise channels. One could in environment. Following the collection, the update process begins. Here,
principle generalize (13) to asymmetric noise channels substituting the we have a total of NUM_ENVS * NUM_STEPS individual steps that are
weight w by the effective weight we of operators, but then comparing codes shuffled and reshaped into NUM_MINIBATCHES minibatches (each of
across different values of noise bias becomes cumbersome. Hence, in the size NUM_ENVS * NUM_STEPS // NUM_MINIBATCHES). These are
present work we always refer to (symmetric) code families according to (13) used for updating the weights of the neural networks through gradient
for all values of cZ, i.e., we will effectively pretend that cZ = 1 when computing ascent, which happens a number UPDATE_EPOCHS times during every
the weight enumerators of asymmetric codes. update process. The whole collection-update cycle gets repeated
NUM_EPOCHS times.
Reinforcement learning The neural networks that we have chosen are standard feedforward
Reinforcement Learning (RL)49 is designed to discover optimal action fully-connected neural networks with ReLU activation functions and with
sequences in decision-making problems. The goal in any RL task is encoded identical architectures for both the actor and value networks, except for the
by choosing a suitable rewardr, a quantity that measures how well the task output layer. In particular, they both consist of an input layer of size
has been solved, and consists of an agent (the entity making the decisions) 2n(n − k) given by the observation from the environment, followed by two
interacting with an environment (the physical system of interest or a hidden layers of size h (we have experimented with sizes 16 to 400) and an
Fig. 11 | Example of a training run for [[7, 1, 3]] code discovery. a Return and batches of 64 circuits processed in parallel. Each agent finds a different encoding
circuit size during training, b Details of the data calculation pipeline and complete set circuit, and the training finishes in 20 s on a single GPU. The meaning of every
of hyperparameters used for this run. Here, 4 parallel agents each interact with hyperparameter is explained in Methods.
output layer of size nA (number of actions) in the case of the actor network
and of size 1 for the value network (see Fig. 10). The number of actions nA is
determined by the number of physical qubits, available gate set and qubit
connectivity.
Other hyperparameters that participate in the PPO implementation
which we include for completeness (but that we refer to ref. 51 for further
explanations) are the discount factor γ, the generalized advantage estimator
(GAE) parameter λ, the actor loss clipping parameter ε, the entropy coef-
ficient and the value function (VF) coefficient (see Table 1 for typical values
that we have found to work well).
Regarding the optimizer itself, we use ADAM with a clipping in the
norm of the gradient (MAX_GRAD_NORM) and some initial learning rate
(LR) that gets annealed (ANNEAL_LR) using a linear schedule as the
training evolves, see Table 1 for specific numerical values of these
hyperparameters.
Next, we show an example of a typical training trajectory in Fig. 11
together with all the hyperparameter numerical values that were used and
the execution time on a single NVIDIA Quadro RTX 6000 GPU. There, 4
agents are tasked to find [[7, 1, 3]] codes, which each of them completes
successfully running in parallel in 20 s. The error channel is chosen to be Fig. 12 | Time to reach 1 million training steps. Execution time of training tra-
global symmetric depolarizing with pI = 0.9 (i.e., pX = pY = pZ = 1 − pI/3). jectories of 4 parallel agents (in a single GPU) with identical hyperparameters as
The average circuit size starts being 20 by design, i.e., if no code has been those shown in Fig. 11 with different number of physical qubits n and code distance d
found after 20 gates, the circuit gets reinitialized. This number starts (but keeping the number of logical qubits k = 1).
decreasing when codes start being found and it saturates to a final value,
which is in general different for each agent. As a final remark, running the
time compilation capabilities. On top of that, we also train multiple RL
same script on a CPU node with two Xeon Gold 6130 processors takes
agents in parallel on a single GPU. This is achieved by interfacing with
7 min 40 s.
PUREJAXRL52, a library that offers a high-performance end-to-end JAX RL
Finally, we show how the runtime scales when increasing the number
implementation. The source code for our project is available on GITHUB
of physical qubits n and the code distance d in Fig. 12. In order to get a
under the name QDX37, which is an acronym for Quantum Discovery with
meaningful comparison, we fix all other hyperparameters to be identical to
JAX. It includes both the Clifford simulator, the PPO algorithm and demo
those shown in Fig. 11. We remark that in general the agents will not have
Jupyter notebooks to reproduce some of our main results.
converged to a successful encoding sequence given the allotted resources.
A stabilizer generator gi is formally represented as a Pauli string P1 ⊗ P2
⊗ ⋯ ⊗ Pn, where Pi ∈ {I, X, Y, Z} is any Pauli operator, and numerically as a
Clifford simulator binary vector of size 2n. For example, the Pauli matrices are represented as
Here we give more details on the implementation of our simulations, which I = (0, 0), X = (1, 0), Y = (1, 1), Z = (0, 1), and a general Pauli string is
are based on the binary symplectic formalism16 of the Pauli group and that represented as (x1, …, xn, z1, …, zn), where all xi and zi are either 0 or 1. For
have been optimized to be compatible with modern vectorized machine instance, the binary vector (1, 1, 0, 0, 0, 1, 1, 0) represents the Pauli string
learning frameworks running on Graphical Processing Units (GPU). All the XYZI. Matrix multiplication gets mapped to binary sum (ignoring global
operations that are required for both simulating the quantum circuits and to phases), e.g.,
compute the reward have been implemented using binary linear algebra.
Our Clifford simulator is implemented using JAX53, a state-of-the-art
modern machine learning framework with good vectorization and just-in- X Y ¼ Z $ ð1; 0Þ þ ð1; 1Þ ¼ ð0; 1Þð mod 2Þ: ð14Þ
code generator gi anticommutes with any given error operator. This means G3:
that the result has to be transformed into a binary vector of size num(Eμ),
where a 1 means that the first Knill-Laflamme condition Eq. (9) is satisfied
for the corresponding operator Eμ and that is zero otherwise.
The second Knill-Laflamme condition Eq. (10) requires checking
whether any error operator Eμ 2 SC . In principle, the full stabilizer group of
2n−k elements must be built at every time step of our simulations. For the
physical qubit numbers that we have considered in our work, this compu-
tation is still fast enough, becoming more challenging as n − k ≥ 13. In
practice, not many error operators end up being in SC , which we leverage by
introducing a softness parameter s such that only a subgroup of SC is built.
More precisely, s = 0 means that this subgroup is empty, s = 1 means taking
only the generators gi as the subgroup, s = 2 means taking the generators gi
and all pairwise combinations of generators gigj, and so on for larger s.
G2:
G2:
by which we mean that for every cZ, the corresponding set of pμ’s gets
normalized by the maximal value of pμ in that set. We choose pI = 0.9, even
though both slightly smaller and larger values around pI ≈ 0.9 perform
equally well. However, going below pI ≲ 0.8 or above pI ≳ 0.95 comes with
different challenges. In the former (for large errors), we lose the important
property
P thatPthe sum of pμ’s decreases as a function of weight,
ð μ pμ Þw¼1 >ð μ pμ Þw¼2 > . . . . In the latter (small errors), the range of
values of pμ is so large that one would need to use a 64-bit floating-point
G2: representation to compute the reward with sufficient precision. Since both
RL algorithms and GPUs are currently designed to work best with 32-bit
precision, we decide to avoid this range of values for pI during training, but
we will still evaluate the strategies found by the RL agent on different values
of pI.
We allow a maximum of 35 gates before the trajectory gets reinitialized.
Even though all encodings that the meta-agent outputs have circuit size 35,
we notice that trivial gate sequences are applied at the last few steps, effec-
tively reducing the overall gate count. We remark that this feature is not
problematic: it means that the agent is done well before a new training run is
G1: launched, and the best thing it can do is collecting small negative rewards
until the end. We manually prune the encodings to get rid of such trivial
operations, and the resulting circuit sizes vary from 22 to 35, depending on
the value of cZ.
layer and where the RL agent decides the content of the CNOT block. Here
we comment on other possibilities.
The first one would be to keep as actions both H and CNOT gates for
the agent to use, but penalize the agent every time that a Hadamard gate is
used after a CNOT gate. This would in principle lead to an agent that would
know what is the correct architecture to be used for CSS codes at expenses of
having to fine-tune this new penalty term in the reward. We avoided this
strategy because we did not want to introduce further hyperparameters. The
second option would be to have a multi-agent scenario with two agents: one
that only places Hadamards and another one that only places CNOTs.
While interesting, multi-agent tasks are typically harder to train and would
involve redesigning our entire framework.
Circuit structure of CSS codes. Here we give a proof of the claim that
codes resulting from circuits with an initial block of Hadamard gates on a
subset of the qubits and followed by CNOT gates thereafter can
only be CSS.
Let us label physical qubits with index 1 ≤ q ≤ n and target a CSS code
Fig. 15 | Comparison between the performance of the noise-aware meta agent vs
simple agents. The results shown are averaged over their respective ensembles. Error
with parameters [[n, k, d]]. Let’s assume for simplicity that the initial block of
bars are one standard deviation. Hadamard gates is applied to qubits k + 1, …, k + nH, with nH < n − k. The
initial tableau of the would-be code reads
• Finally, we calculate the failure probability as the sum of all error g1 ¼ X kþ1 ;
probabilities except the most likely one in that given syndrome. g2 ¼ X kþ2 ;
If the code is degenerate there could still be the possibility that the
actual error was misidentified and after correction one could still have ended g nH ¼ X kþnH ; ð22Þ
up with an “error” that is inside the stabilizer group. The contribution from g nH þ1 ¼ Z kþnH þ1 ;
these cases are negligible in our case and are thus ignored. However, one
would in principle still have to consider them in a general scenario. In
g nk ¼ Zn:
practice one could still evaluate the codes discovered with our RL approach
by substituting the decoder accordingly. From this moment forward, only CNOT gates are allowed. Let’s start by
considering what is the effect of a CNOT gate with control qubit inside the
Noise-aware meta-agent vs an ensemble of simple agents. Here we H-block, i.e., control ∈ {k + 1, …, k + nH}. For whatever target qubit, what
explain the settings of this experiment (shown in Fig. 7) in order to make a such a CNOT does is populate the target position of the corresponding
fair comparison. There are 16 possible values of the bias parameter, stabilizer gcontrol with an X. Subsequent CNOT gates affecting those posi-
cZ = {0.5, 0.6, …, 1.9, 2}. Since each meta-agent has seen instances of all 16 tions, either as control or target qubits, will either introduce additional X’s or
values, we will only allow the single-cZ agents to be trained on one six- simply do nothing. Since X2 = 1, the stabilizers g 1 ; g 2 ; . . . g nH will only ever
teenth of the total timesteps than the ones used for each meta-agent. In contain either X’s or 1’s. Similarly, the effect of CNOTs on stabilizers
addition, the best post-selected meta-agent was selected out of 714 g nH þ1 ; . . . ; g nk is simply populating them with Z’s or 1’s. Since the set of
training runs. Therefore, we train 714 × 16 = 11424 single − cZ agents to stabilizer generators can be clearly separated into a subset built with only X’s
make the comparison. All other hyperparameters are kept fixed. and 1’s and another one with only Z’s and 1’s, such a tableau describes a
We also include an extended statistical analysis over the entire CSS code.
ensemble of both meta-agents and simple agents in Fig. 15. There, we
average over their respective ensembles and show the average performance GPU memory estimation. The independence of X and Z-type error
of agents of each class, together with their standard deviations. There we see detection in CSS codes means that the number of error operators that we
that all simple agents consistently fail at minimizing the failure probability at have to keep track of drastically reduces from (3) to
large values of cZ. The larger error bars at smaller values of cZ for the meta-
d1
X
agents can also be interpreted as these class of more general agents allocating n
jfEμ gCSS jw ≤ d1 ¼ 2 ; ð23Þ
a larger effort in both exploration and generalization to other values of cZ. w¼0 w
CSS codes where the overall factor of 2 counts both X and Z-type errors.
A particularly useful subclass of stabilizer codes are CSS codes10,12. They are Thanks to the separability of X and Z in the stabilizer generators, the
defined by their stabilizer generators containing either only X or only Z Pauli tableaus that we have to simulate are block-diagonal,
operators. This restriction is useful because X-type and Z-type errors are
detected independently, thereby implying the detection of Y-type errors gX 0
when the corresponding X and Z-type stabilizers fire simultaneously. ; ð24Þ
0 gZ
Moreover, strong contenders for implementation in large-scale quantum
computations such as surface codes or color codes are of the CSS type. where gX is a binary matrix of size num(H) × n containing the X-type
stabilizer generators, and gZ is of size (n − k − num(H)) × n and it contains
Alternative strategies using RL. In the main text we have argued that the representation of the Z-type generators. Here, num(H) is the number of
CSS codes can be constructed by constraining the encoding circuit to be Hadamard gates that are applied at the very beginning.
built from an initial layer of Hadamard gates and CNOTs thereafter. Separability of X- and Z-type error detection implies that gX must
In order to adapt our RL strategy to CSS code discovery we have detect all Z-type errors (by the first Knill-Laflamme condition (9)), and
considered a mixed human-AI strategy where we decide the Hadamard correspondingly for gZ with X-type errors. If the code is degenerate, it must
happen that some X-type errors are elements of the stabilizer subgroup 20. Chuang, I., Cross, A., Smith, G., Smolin, J. & Zeng, B. Codeword
generated by gX and likewise for Z. stabilized quantum codes: Algorithm and structure. J. Math. Phys.
All in all, this means that we can reduce the number of error operators https://doi.org/10.1063/1.3086833 (2009).
(23) by a factor of 2 (since we use the same representation for both X and Z- 21. Wang, H. et al. Scientific discovery in the age of artificial intelligence.
type errors). Each of such error operator is a binary array of size n, which Nature 620, 47–60 (2023).
amounts to 8n bits of memory. 22. Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. Policy gradient
We therefore estimate the memory usage by counting the number of methods for reinforcement learning with function approximation. Adv.
error operators (23) (divided by 2, as argued above), times the amount of Neural Inf. Process. Syst. 12 (1999).
binary digits that have to be specified for each of them, i.e., 8n. 23. Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement
learning with neural networks for quantum feedback. Phys. Rev. X 8,
Data availability 031084 (2018).
The data that supports the findings of this study are openly available in the 24. Nautrup, H. P., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N.
GitHub repository https://github.com/jolle-ag/qdx37. Optimizing quantum error correction codes with reinforcement
learning. Quantum 3, 215 (2019).
Code availability 25. Mauron, C., Farrelly, T. & Stace, T. M. Optimization of tensor network
The codes that supports the findings of this study are openly available in the codes with reinforcement learning. New J. Phys. 26 023024.
GitHub repository https://github.com/jolle-ag/qdx37. 26. Su, V. P. et al. Discovery of optimal quantum error correcting codes via
reinforcement learning 2305.06378 (2023).
Received: 21 May 2024; Accepted: 15 November 2024; 27. Cao, C. & Lackey, B. Quantum lego: Building quantum error correction
codes from tensor networks. PRX Quantum 3, 020332 (2022).
28. Andreasson, P., Johansson, J., Liljestrand, S. & Granath, M. Quantum
References error correction for the toric code using deep reinforcement learning.
1. Inguscio, M., Ketterle, W. & Salomon, C. Proceedings of the Quantum 3, 183 (2019).
International School of Physics “Enrico Fermi.” Vol. 164 (IOS Press, 29. Sweke, R., Kesselring, M. S., van Nieuwenburg, E. P. & Eisert, J.
2007). Reinforcement learning decoders for fault-tolerant quantum
2. Girvin, S. M. Introduction to quantum error correction and fault computation. Mach. Learn. Sci. Technol. 2, 025005 (2020).
tolerance. SciPost Phys. Lect. Notes (2023). 30. Colomer, L. D., Skotiniotis, M. & Mu noz-Tapia, R. Reinforcement
3. Krinner, S. et al. Realizing repeated quantum error correction in a learning for optimal error correction of toric codes. Phys. Lett. A 384,
distance-three surface code. Nature 605, 669–674 (2022). 126353 (2020).
4. Ryan-Anderson, C. et al. Realization of real-time fault-tolerant 31. Fitzek, D., Eliasson, M., Kockum, A. F. & Granath, M. Deep q-learning
quantum error correction. Phys. Rev. X 11, 041058 (2021). decoder for depolarizing noise on the toric code. Phys. Rev. Res. 2,
5. Postler, L. et al. Demonstration of fault-tolerant universal quantum 023230 (2020).
gate operations. Nature 605, 675–680 (2022). 32. Metz, F. & Bukov, M. Self-correcting quantum many-body control
6. Cong, I. et al. Hardware-efficient, fault-tolerant quantum computation using reinforcement learning with tensor networks. Nat. Mach. Intell.
with Rydberg atoms. Phys. Rev. X 12, 021049 (2022). 5, 780–791 (2023).
7. Acharya, R. et al. Suppressing quantum errors by scaling a surface 33. Chao, R. & Reichardt, B. W. Quantum error correction with only two
code logical qubit. Nature 614, 676–681 (2023). extra qubits. Phys. Rev. Lett. 121, 050502 (2018).
8. Sivak, V. et al. Real-time quantum error correction beyond break- 34. Zen, R. et al. Quantum circuit discovery for fault-tolerant logical state
even. Nature 616, 50–55 (2023). preparation with reinforcement learning. arXiv preprint
9. Azuma, K. et al. Quantum repeaters: From quantum networks to the arXiv:2402.17761 (2024).
quantum internet. Rev. Mod. Phys. 95, 045006 (2023). 35. Cao, C., Zhang, C., Wu, Z., Grassl, M. & Zeng, B. Quantum variational
10. Calderbank, A. R. & Shor, P. W. Good quantum error-correcting codes learning for quantum error-correcting codes. Quantum 6, 828 (2022).
exist. Phys. Rev. A 54, 1098–1105 (1996). 36. Gidney, C. Stim: a fast stabilizer circuit simulator. Quantum 5, 497
11. Laflamme, R., Miquel, C., Paz, J. P. & Zurek, W. H. Perfect quantum (2021).
error correcting code. Phys. Rev. Lett. 77, 198–201 (1996). 37. QDX: An AI discovery tool for quantum error correction codes. https://
12. Steane, A. M. Simple quantum error-correcting codes. Phys. Rev. A github.com/jolle-ag/qdx.
54, 4741–4751 (1996). 38. Yu, S., Chen, Q. & Oh, C. H. Graphical quantum error-correcting codes
13. Kitaev, A. Y. Quantum computations: algorithms and error correction. 0709.1780 (2007).
Russian Math. Surv. 52, 1191 (1997). 39. Yu, S., Bierbrauer, J., Dong, Y., Chen, Q. & Oh, C. All the stabilizer
14. Bravyi, S. et al. High-threshold and low-overhead fault-tolerant codes of distance 3. IEEE Trans. Inf. theory 59, 5179–5185 (2013).
quantum memory. Nature 627, 778–782 (2024). 40. Gottesman, D. Class of quantum error-correcting codes saturating
15. Gottesman, D. Stabilizer codes and quantum error correction quant- the quantum hamming bound. Phys. Rev. A 54, 1862–1868 (1996).
ph/9705052. (1997). 41. Nielsen, M. A. & Chuang, I. L.Quantum Computation and Quantum
16. Aaronson, S. & Gottesman, D. Improved simulation of stabilizer Information (Cambridge University Press, 2010).
circuits. Phys. Rev. A 70, 052328 (2004). 42. Bennett, C. H., DiVincenzo, D. P., Smolin, J. A. & Wootters, W. K.
17. Grassl, M. & Han, S. Computing extensions of linear codes using a Mixed-state entanglement and quantum error correction. Phys. Rev.
greedy algorithm. In 2012 IEEE International Symposium on A 54, 3824–3851 (1996).
Information Theory Proceedings 1568–1572 (IEEE, 2012). 43. Knill, E. & Laflamme, R. Theory of quantum error-correcting codes.
18. Grassl, M., Shor, P. W., Smith, G., Smolin, J. & Zeng, B. New Phys. Rev. A 55, 900 (1997).
constructions of codes for asymmetric channels via concatenation. 44. Ioffe, L. & Mézard, M. Asymmetric quantum error-correcting codes.
IEEE Trans. Inf. Theory 61, 1879–1886 (2015). Phys. Rev. A 75, 032345 (2007).
19. Li, M., Gutiérrez, M., David, S. E., Hernandez, A. & Brown, K. R. Fault 45. Wang, L., Feng, K., Ling, S. & Xing, C. Asymmetric quantum codes:
tolerance with bare ancillary qubits for a [[7,1,3]] code. Phys. Rev. A 96, characterization and constructions. IEEE Trans. Inf. Theory 56,
032341 (2017). 2938–2945 (2010).
46. Ezerman, M. F., Ling, S. & Sole, P. Additive asymmetric quantum Competing interests
codes. IEEE Trans. Inf. Theory 57, 5536–5550 (2011). The authors declare no competing interests.
47. Guardia, G. G. L. On the construction of asymmetric quantum codes.
Int. J. Theor. Phys. 53, 2312–2322 (2014). Additional information
48. Shor, P. & Laflamme, R. Quantum analog of the MacWilliams identities Supplementary information The online version contains
for classical coding theory. Phys. Rev. Lett. 78, 1600 (1997). supplementary material available at
49. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction https://doi.org/10.1038/s41534-024-00920-y.
(MIT Press, 2018).
50. Konda, V. & Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Correspondence and requests for materials should be addressed to
Process. Syst. 12 (1999). Jan Olle.
51. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O.
Proximal policy optimization algorithms. arXiv:1707.06347 (2017). Reprints and permissions information is available at
52. Lu, C. et al. Discovered policy optimisation. Adv. Neural Inf. Process. http://www.nature.com/reprints
Syst. 35, 16455–16468 (2022).
53. Bradbury, J. et al. JAX: composable transformations of Python Publisher’s note Springer Nature remains neutral with regard to
+NumPy programs. http://github.com/google/jax (2018). jurisdictional claims in published maps and institutional affiliations.