0% found this document useful (0 votes)

5 views17 pages

SimultaneousDiscoveryQuantumErrorCodeEncoders

This article discusses the development of a reinforcement learning (RL) agent capable of automatically discovering quantum error correction (QEC) codes and their encoding circuits tailored to specific qubit hardware platforms. The approach utilizes a noise-aware meta-agent that adapts encoding strategies based on varying noise models, significantly enhancing scalability and efficiency in QEC code discovery. The authors demonstrate the effectiveness of their method with up to 25 physical qubits and outline future scalability to larger code parameters.

Uploaded by

Fran J Gal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

SimultaneousDiscoveryQuantumErrorCodeEncoders

Uploaded by

Fran J Gal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

npj | quantum information Article

Published in partnership with The University of New South Wales

https://doi.org/10.1038/s41534-024-00920-y

Simultaneous discovery of quantum error

correction codes and encoders with a
noise-aware reinforcement learning agent
Check for updates
1 1 1 1,2
Jan Olle , Remmy Zen , Matteo Puviani & Florian Marquardt

In the ongoing race towards experimental implementations of quantum error correction (QEC), ﬁnding
ways to automatically discover codes and encoding strategies tailored to the qubit hardware platform is
emerging as a critical problem. Reinforcement learning (RL) has been identiﬁed as a promising approach,
1234567890():,;
1234567890():,;

but so far it has been severely restricted in terms of scalability. In this work, we signiﬁcantly expand the
power of RL approaches to QEC code discovery. Explicitly, we train an RL agent that automatically
discovers both QEC codes and their encoding circuits for a given gate set, qubit connectivity and error
model, from scratch. This is enabled by a reward based on the Knill-Laﬂamme conditions and a vectorized
Clifford simulator, showing its effectiveness with up to 25 physical qubits and distance 5 codes, while
presenting a roadmap to scale this approach to 100 qubits and distance 10 codes in the near future. We
also introduce the concept of a noise-aware meta-agent, which learns to produce encoding strategies
simultaneously for a range of noise models, thus leveraging transfer of insights between different
situations. Our approach opens the door towards hardware-adapted accelerated discovery of QEC
approaches across the full spectrum of quantum hardware platforms of interest.

Quantum error correction1,2 (QEC) protects quantum information by codes, each of them conventionally labeled [[n, k, d]], where n is the number
encoding the state of a logical qubit into several physical qubits and is crucial of physical qubits, k the number of encoded logical qubits, and d the code
to ensure that quantum technologies such as quantum communication or distance that defines the number d − 1 of detectable errors. The first examples
quantum computing can achieve their groundbreaking potential. are provided by the [[5, 1, 3]] perfect code11, the [[7, 1, 3]] Steane12 and the [[9,
The past few years have witnessed dramatic progress in experimental 1, 3]] Shor10 codes, which encode one logical qubit into 5, 7, and 9 physical
realizations of QEC on different platforms3–7 (this includes especially var- qubits, respectively, being able to detect up to 2 physical errors and correct up
ious superconducting qubit architectures, ion traps, quantum dots, and to 1 error on any physical qubit. The most promising approach so far is
neutral atoms), reaching a point where the lifetime of qubits has been probably the family of the so-called toric or surface codes13, which encode a
extended by applying QEC8. Given the strong differences in native gate sets, logical qubit into the joint entangled state of a d × d square of physical qubits.
qubit connectivities, and relevant noise models, there is a strong need for a More recently, examples of quantum Low-Density Parity Check (LDPC)
flexible and efficient scheme to automatically discover not only codes but codes that are competitive with the surface code have been discovered14.
also efficient encoding circuits, adapted to the platform at hand. However, knowledge of a code does not automatically translate to
In particular, in the field of quantum communication and networking, knowing how to encode the logical states of that code in an efficient way.
third-generation quantum repeaters rely on QEC to correct errors during Standard approaches are unconstrained, meaning that an all-to-all con-
transmission9. The use of QEC permits very high communication rates, nectivity between qubits is assumed as well as a set of gates that are not
since only one-way signaling is involved, in contrast to earlier generations of necessarily native to the hardware platform of interest15,16. This then leads to
quantum repeaters. In this setting, we may in a first approximation assume larger-than-necessary circuits when implementing them on specific devices.
that errors happen mainly during transmission over the noisy channel and Numerical techniques have already been employed to construct QEC
treat the encoding circuits themselves as noiseless. This is the scenario we codes. Often, this has involved greedy algorithms, which may lead to sub-
will adopt here. optimal solutions but can be relatively fast17–20.
Since Shor’s original breakthrough10, different qubit-based QEC codes The recent advent of powerful tools from the domains of Artificial
have been constructed, both analytically and numerically, leading to a zoo of Intelligence (AI), are transforming scientific approaches21. From these,

1
Max Planck Institute for the Science of Light, Erlangen, Germany. 2Department of Physics, Friedrich-Alexander Universität Erlangen-Nürnberg,
Erlangen, Germany. e-mail: [email protected]

npj Quantum Information | (2024)10:126 1

https://doi.org/10.1038/s41534-024-00920-y Article

Reinforcement Learning (RL), which is designed to solve complex decision- Regarding applications to quantum computing, the discovered circuits
making problems by autonomously following an action-reward scheme22, is a are in general not fault-tolerant. However, strategies to build fault-tolerant
promising artificial discovery tool for QEC strategies. The task to solve is versions out of non-fault-tolerant circuits exist33, and these can even be
encoded in a reward function, and the aim of RL training algorithms is to automated with RL34.
maximize such a reward over time. RL can provide new answers to difficult While the authors of ref. 35 also set themselves the task of finding both
questions, in particular in fields where optimization in a high-dimensional codes and their encoding circuits, this was done using variational quantum
search space plays a crucial role. For this reason, RL can be an efficient tool to circuits involving continuously parametrized gates, which leads to much
tackle the problem of QEC code construction and encoding under hardware- more costly numerical simulations and eventually only an approximate QEC
specific constraints. scheme. By contrast, our RL-based approach does not rely on any human-
The first example of RL-based automated discovery of QEC strategies23 provided circuit ansatz, can use directly any given discrete gate set, is able to
did not rely on any human knowledge of QEC concepts. While this allowed exploit highly efficient Clifford simulations, and produces a meta-agent able
exploration without any restrictions, e.g., going beyond stabilizer codes, it to cover strategies for a range of noise models. In particular, their approach
was limited to only small qubit numbers. More recent works have moved was not able to scale to d = 5 codes due to prohibitive computational costs.
towards optimizing only certain QEC subtasks, injecting substantial human The paper is organized as follows: In Section “Results” we detail the RL
knowledge. For example, RL has been used for optimization of given QEC strategy, its numerical results and estimations on how far this strategy can be
codes24, and to discover tensor network codes25 or codes based on “Quantum scaled in principle. In Section “Methods” we give a reminder on stabilizer
Lego” parametrizations26,27. Additionally, RL has been used to find efficient codes and the Knill-Laflamme conditions, provide background describing
decoding processes28–31 and self-correcting control protocols32. the RL methods used in this work and give all details of our implementation.
In our work, we significantly expand the scaling capabilities of RL code
discovery by introducing two critical components: Results
1. An efficiently computable and general RL reward based on the Knill- Section “Reinforcement Learning Approach to QEC Code Discovery”
Laflamme error correction conditions. describes our approach to build a noise-aware RL agent. Section “Reinfor-
2. A highly parallelized custom-built Clifford circuit simulator that runs cement Learning Results” details the numerical results found with our
entirely on modern AI chip accelerators such as GPUs or TPUs. strategy. Section “Scaling automated QEC discovery” explains how our
approach can be scaled up to larger code parameters.
The main results that are enabled by this strategy are the following:
1. A state-of-the-art scheme based on deep RL that simultaneously Reinforcement learning approach to QEC code discovery
discovers QEC codes together with the encoding circuit from scratch, The main objective of this work is to automatize the discovery of QEC
tailored to specific noise models, native gate sets, and connectivities, codes and their encoding circuits using RL. We exclusively focus on
minimizing the circuit size for improved hardware efficiency. stabilizer codes due to their efficient simulability with classical
2. Effortless discovery of both stabilizer and CSS codes and encoders with computers. We will consider a scenario where the encoding circuit is
code distances from 3 (found in tens of seconds) to 5 (found in tens of assumed to be error-free (non fault-tolerant encoding). This is
minutes to a few hours) with up to 25 physical qubits. applicable to quantum communication or quantum memories, where
3. A general RL agent that is trained only once but afterwards is able to the majority of errors happen during transmission over a noisy
adapt and switch its encoding strategy based on the specific noise that is channel or during the time the memory is retaining the information.
present in the system. We call this a noise-aware RL agent. Nevertheless, we remark that there exist techniques to make circuits
4. A scalable platform for artificial scientific discovery of QEC strategies fault-tolerant such as flag fault-tolerance33, and the code itself would
based on RL that potentially allows discovery of distance 8-10 codes on anyway be discovered with our strategy. A scheme of our approach
a single GPU, while offering further scaling opportunities on can be found in Fig. 1, and the following sections are dedicated to
distributed machines. explain its different constituent parts.

Fig. 1 | QEC code and encoding discovery using a noise-aware RL meta-agent. A and connectivity that detects the most likely errors from the target error model by
set of error operators, a gate set, and qubit connectivity are chosen. Different error using a reward based on the Knill-Laﬂamme QEC conditions according to Eq. (2).
models can be considered by varying some noise parameters, which are fed as an After training, a single RL agent is able to ﬁnd suitable encodings for different noise
observation to the agent. The agent then builds a circuit using the available gate set models, which are able to encode any state ∣ψ of choice.

npj Quantum Information | (2024)10:126 2

https://doi.org/10.1038/s41534-024-00920-y Article

Encoding circuit. In order to encode the state of k logical qubits on n Noise-aware meta-agent. Regarding the error channel to be targeted,
physical qubits one must find a sequence of quantum gates that will here there are in principle several choices that can be made. The most
entangle the quantum information in such a way that QEC is possible straightforward one is choosing a global depolarizing channel (see
with respect to a target noise channel. Initially, we imagine the first k “Methods” (8)). This still allows for asymmetric noise, i.e., different
qubits as the original containers of our (yet unencoded) quantum probabilities pX, pY, pZ. One option would be to train an agent for any
information, which can be in any state ∣ψ 2 ðC2 Þk . The remaining given, fixed choice of these probabilities, necessitating retraining if these
n − k qubits are chosen to each be initialized in the state
∣0i. These will be characteristics change. However, we want to go beyond that and build a
turned into the corresponding logical state ∣ψ L 2 ðC2 Þn via the single agent being capable of deciding what is the optimal encoding
application of a sequence of Clifford gates on any of the n qubits. In the strategy for any level of bias in the noise channel (11). For instance, we
stabilizer formalism, this means that initially, the generators of the code want this noise-aware agent to be able to understand that it should
stabilizer group are prioritize detecting more Z errors than X ones when the channel is biased
towards Z, yet it should do the opposite when X errors become more
Z kþ1 ; Z kþ2 ; . . . ; Z n : ð1Þ likely. This translates into two aspects: The first one is that the agent has to
receive the noise parameters as input. In the illustrative example further
The task of the RL agent is to discover a suitable encoding sequence of below, we will choose to supply the bias parameter cZ ¼ log pZ = log pX
gates for the particular error model under consideration. After (see “Methods”) as an extra observation, while keeping the overall error
applying each gate, the n − k code generators (1) are updated. The probability fixed. The second aspect is that the list of error operators will
agent then receives a representation of these generators as input (as its have to contain more operators than the total number that can actually be
observation) and suggests the next gate (action) to apply. In this way, detected reliably since it is now part of the agent’s task to prioritize some
an encoding circuit is built up step by step, taking into account the of those errors while ignoring the least likely errors. All in all, the list of
available gate set and connectivity for the particular hardware plat- operators participating in the reward (2) will be fixed, and we will vary cZ
form. This process terminates when the Knill-Laflamme conditions during training.
are satisfied for the target error channel
and the learned circuit can
then be used to encode any state ∣ψ of choice. Vectorized Clifford simulator. RL algorithms exploit guided trial-and-
error loops until a signal of a good strategy is picked up and convergence
Reward. The most delicate matter in RL problems is building a suitable is reached, so it is of paramount importance that simulations of our RL
reward for the task at hand. Our goal is to design an agent that, given a list environment are extremely fast. Thanks to the Gottesman-Knill theorem,
of (Pauli) errors {Eμ} with associated occurrence probabilities {pμ}, is able the Clifford circuits needed here can be simulated efficiently on classical
to find an encoding sequence that protects the quantum information computers. Optimized numerical implementations of Clifford circuits
from such noise. exist, e.g., Stim36. However, in an RL application we want to be able to run
Ideally, one would like to maximize the probability of successful multiple circuits in parallel in an efficient, vectorized way that is com-
recovery of the initial encoded state after decoding. Unfortunately, opti- patible with modern machine learning frameworks. For that reason, we
mizing for this task is computationally too expensive. A much cheaper have implemented our own special-purpose vectorized GPU Clifford
alternative is to use a scheme where the cumulative reward (which RL simulator (described in detail in Methods), which is publicly available in
optimizes) simply is maximized whenever all the Knill-Laflamme condi- our repository37. When compared to Stim, we find a ~50 × speedup at
tions are fulfilled. One implementation of this idea uses what we call the simulating random Clifford circuits and a ~450 × speedup when
(negative) weighted Knill-Laflamme sum as an instantaneous reward, which restricted to the simulation of Calderbank-Shor-Steane (CSS) codes (see
we define as: “Methods”). In particular, we can simulate 8000 random Clifford circuits
X of 1000 gates on 49 qubits in under a second. However, note that our
rt ¼ λμ K μ ; ð2Þ
simulator is not capable of sampling noisy circuits, which is the main
μ application of Stim.

where Kμ = 0 if the corresponding error operator Eμ satisfies the Knill- Reinforcement learning results
Laflamme conditions, and Kμ = 1 otherwise, and where λμ are real positive We will first illustrate the basic workings of our approach for a symmetric
hyperparameters weighting each error. If all errors in {Eμ} can be detected, noise channel before showing the noise-aware meta-agent that is able to
the reward is zero, and is negative otherwise, thus leading the agent towards simultaneously discover strategies for a range of noise models.
short gate sequences. In particular, note that the agent is not explicitly
incentivized to minimize circuit depth or to place gates in parallel. However, Codes in a symmetric depolarizing noise channel. We now show the
reinforcing short gate sequences may sometimes also lead to a small circuit versatility of our approach by discovering a library of different [[n, k, d]]
depth. The range of the index μ is found by counting the number of Pauli codes and their associated encoding circuits.
strings of weight w < d, which is We fix the error model to be a symmetric depolarizing channel and
consider different target code distances (from 3 to 5). The corresponding
X
d1 target error set is Eμ = {I, Xi, Yi, Yj, XiXj, …, ZiZj} for d = 3, and likewise for
n
∣fEμ g∣w<d ¼ 3w ; ð3Þ d = 4, 5, with the set for d = 5 including all Pauli string operators of up to
w¼0 w
weight 4. For illustrative purposes, we start by taking the gateset to be {Hi,
where the factor of three is for X, Y, Z Pauli errors. Thus, the fact that (3) CNOT(i < j)}, i.e., a directed all-to-all connectivity, which is sufficient given
grows exponentially with d will impose the most severe limitation in our that our unencoded logical state is at the first k qubits by design. Never-
approach (as is the case in any QEC application). Later, we will also be theless, we will also see examples with other connectivities and alternative
interested in situations where not all errors can be corrected simultaneously gatesets. The error probability p is fixed, meaning pI = 1 − 3p,
and a good compromise has to be found. In that case, one simple heuristic pX = pY = pZ = p, and thus no noise parameter is needed as an observation to
choice for the reward (2) would be λμ = pμ, giving more weight to errors that the agent.
occur more frequently. While we will later see that maximizing the Knill- For d = 3 and d = 4 codes we proceed as follows: for any given target [[n,
Laflamme reward given here is not precisely equivalent to maximizing the k, d]], we launch a few training runs. Once the codes are collected, we
state recovery probability, one can still expect a reasonable performance at categorize them by calculating their quantum weight enumerators (see
this task, and indeed this is what we find in our work. “Methods”), leading to a certain number of non-degenerate and degenerate

npj Quantum Information | (2024)10:126 3

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 2 | Discovering codes and encoding circuits

for various numbers of physical qubits, logical
qubits, and distances (see main text and Methods
for d = 5). Families of stabilizer codes tailored to
symmetric depolarizing noise channels, found with
our RL framework. The labels (x, y) indicate the
number of non-degenerate (x) and degenerate (y)
code families. The circuit size shown is the absolute
minimum throughout all families. In general, dif-
ferent families have different circuit sizes, and even
within the same family we ﬁnd variations in circuit
sizes. Since further training runs do not increase
family populations, it is likely that there are no more
stabilizer codes for the shown code parameters.

code families. We repeat this process and keep launching new training runs with an encoding circuit consisting of 32 gates in the minimal example,
until no new families are found. In this way, our strategy presumably finds which we show in the Supplementary.
all stabilizer codes that are possible for the given parameters n, k, d, together The largest d = 5 code that we have considered here is [[15, 2, 5]],
with a suitable encoding circuit. Note that this statement is based on although we will later show larger codes. We have found a single code family
empirical observations. While successive training runs do not yield new with weight enumerators
code families, this does not exclude the possibility of there being more. This
total number of families is shown in Fig. 2, with labels (x, y) for each [[n, k,
A ¼ ð1; 0; 0; 0; 0; 0; 23; 96; 361; 776; 1318; 1832;
d]], where x is the number of non-degenerate families and y is the number of
degenerate ones. It should be stressed that categorizing all stabilizer code 1814; 1304; 579; 88Þ;
ð5Þ
families is in general an NP-complete problem38, yet our framework is very B ¼ ð1; 0; 0; 0; 0; 101; 449; 1763; 5081; 12034;
effective at solving this task. To the best of our knowledge, this work provides 21722; 29366; 29622; 20489; 8661; 1783Þ:
the most detailed tabulation of (x, y) populations together with optimal
encoding circuits for the code parameters shown here.
This approach discovers suitable encoding circuits, given the assumed and an encoding circuit consisting of 49 gates shown in the Supplementary.
gate set, for a large set of codes. Among them are the following known codes Other successfully discovered d = 5 codes are shown in Methods, Fig. 4.
for d = 3 (see ref. 39 for explicit constructions of codes [[n, n − r, 3]] with
minimal r, for all n): The first one is the five-qubit perfect code11, which Noise-aware meta-agent. We now move on to codes in more general
consists of a single non-degenerate [[5, 1, 3]] code family and is the smallest asymmetric depolarizing noise channels. This lets us illustrate a powerful
stabilizer code that corrects an arbitrary single-qubit error. Next are the 10 aspect of RL-based encoding and code discovery: One and the same agent
families38 of [[7, 1, 3]] codes, one of which corresponds to Steane’s code12. can learn to switch its encoding strategy depending on some parameter
The smallest single-error-correcting surface code, Shor’s code10, is redis- characterizing the noise channel. This is realized by training this noise-
covered as one of the 143 degenerate code families with parameters [[9, 1, aware agent on many different runs with varying choices of the para-
3]]. The smallest quantum Hamming code40[[8, 3, 3]] is obtained as well. meter, which is fed as an additional input to the agent.
Our approach is efficient enough to discover codes with up to 20 physical In the present example, the parameter in question is the bias parameter
qubits in under 10 min, at which point we stopped increasing n. We also cZ ¼ log pZ = log pX . This allows the same agent to switch its strategy
include in the Supplementary the encoding circuit for a [[20, 13, 3]] code depending on the kind of bias present in the noise channel. The error set Eμ
consisting of a total of 45 gates. is now taken to be all Pauli strings of weight ≤4, i.e., {Eμ} = {I, Xi, Yi, Zi, XiXj,
The RL framework presented here easily allows to find encoding cir- …, ZiZjZkZl}, but their associated error probabilities will vary depending on
cuits for different connectivities. The connectivity affects the likelihood of cZ. For every RL training trajectory, a new cZ is chosen and the error
discovering codes within a certain family during RL training as well as the probabilities pμ are updated correspondingly.
typical circuit sizes. In Fig. 3 we illustrate this for the case of [[9, 3, 3]] codes, We apply this strategy to target codes with parameters n = 9, k = 1 in
with their 13 families, for two different connectivities: an all-to-all (directed, asymmetric noise channels. We allow a maximum number of 35 gates.
i.e., CNOT(i < j)) and a nearest-neighbor square lattice connectivity. On Moreover, we consider an all-to-all connectivity, taking as available gate set
average, the agent needs one less gate to prepare the encoding on the all-to- {Hi, Si, CNOT(i, j)}, where Si is the phase gate acting on qubit i.
all connectivity than when using the square lattice. This difference in circuit We discover codes with the following parameters: [[9, 1,
size is likely to become larger for larger qubit numbers. We also include in de(cZ = 0.5) = 2]], [[9, 1, de(cZ = 0.6) = 3]], [[9, 1, de(cZ = 1.4) = 4]], [[9, 1,
Methods examples using different gatesets and a larger variety of de(cZ = 2) = 5]], where de is the effective code distance, defined in Methods.
connectivities. To the best of our knowledge, the last two codes are new. Codes inbetween,
We now move to distance d = 5 codes. These are more challenging to 0.5 ≤ cZ < 0.6, have de = 2, 0.6 ≤ cZ < 1.4 have de = 3, and so on.
find due to the significantly increased number of error operators (3) to keep Next, we evaluate the performance of the noise-aware agent trained
track of, which impacts both the computation time and the hardness of with this strategy at minimizing the failure probability, defined in “Meth-
satisfying all Knill-Laflamme conditions simultaneously. Nevertheless, our ods”. The main results are shown in Fig. 5. We start by comparing the two
strategy is also successful in this case. It is known that the smallest possible best-performing post-selected agents according to minimizing the weighted
distance—5 code has parameters [[11, 1, 5]], a result that we confirm with Knill-Laflamme sum (green) and minimizing the failure probability
our strategy. We find the single family of this code to have weight enu- (orange), see Fig. 5a, b. There we see that there is a nice correlation between
merators, the two tasks, especially in the region cZ < 1. We also compare the smallest
undetected effective weight of the codes found by these two agents in Fig. 5c.
Surprisingly, the code found by the best agent according to the weighted
A ¼ ð1; 0; 0; 0; 0; 0; 198; 0; 495; 0; 330; 0Þ;
ð4Þ Knill-Laflamme sum (green) at cZ = 2 has de = 5, while the best code at
B ¼ ð1; 0; 0; 0; 0; 198; 198; 990; 495; 1650; 330; 234Þ; minimizing the failure probability (orange) has de = 4. However, at the

npj Quantum Information | (2024)10:126 4

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 4 | Families of d = 5 stabilizer codes found with RL. The labels (x, y) indicate
the number of non-degenerate (x) and degenerate (y) code families. The circuit size
shown is the absolute minimum throughout all families using a directed
(CNOT(i < j)) qubit connectivity.

on the specific value of cZ. Thus, the strategies found during training at a
fixed value of pI are readily usable in other situations.
We continue by analyzing the encoding circuits and code generators
for some selected values of cZ. These are chosen after computing the
quantum weight enumerators (see “Methods”), which we show in Fig. 6a.
There we see that the same code family is kept for 0.5 ≤ cZ < 0.9, where Z
errors are more likely than X/Y. From that point onward, the agent switches
to a new code family that is kept until the end (cZ = 2). We thus choose to
analyze the encoding circuits and their associated code generators for the
values cZ = {0.5, 0.9, 1.4, 2}. However, we remark that this particular code
switching only occurs for the best post-selected agent and there is a large
variety of strategies observed for the 714 meta-agents that we have trained,
both in terms of where the switching occurs and the number of switches.
We begin by showing the encoding circuits in Fig. 6b, highlighting
common motifs that are re-used across various values of cZ with different
colors, indicative of transfer learning. Another interesting behavior is that S
gates are used more prominently at small values of cZ, in particular in the
combination S ⋅ H. This gate combination implements a permutation: X →
Y, Y → Z, Z → X (ignoring signs), which is very useful to exchange Y by Z
efficiently. In situations where Z errors are more likely than X/Y, (cZ < 1),
this operation is beneficial. While we have been able to identify and interpret
this simple combination of gates with the naked eye, extracting general
Fig. 3 | Influence of connectivity. Characteristics of the 13 families of [[9, 3, 3]] principles from the discovered codes remains challenging but is nonetheless
codes found with our framework, clustered according to families distinguished by a valuable and important area that deserves further analysis.
their quantum weight enumerators (13). Families 9 and 13 are degenerate, while the Next, we show the code generators of such encoding circuits in Fig. 6c.
rest are non-degenerate. We have trained a total of 10240 agents for each of both Since the code used at cZ = 0.5 is the only one from a different code family, it
cases. In the all-to-all (directed: CNOT(i < j)) connectivity, 9574 agents were suc- is natural that its code generator pattern is the most distinct. However, we
cessful, while this number went down to 3808 in the other case. The bars display how see that the generators of the remaining values of cZ have similar structures.
these codes are distributed across different families. Codes in the same family found So far we have shown that a single meta-agent trained on different
by different agents are not necessarily distinct, so the bars are rather an indication of values of the noise bias parameter can find suitable strategies for all values of
the likelihood of a training run to find a code within the family. The points show the
such a parameter. Now, we want to compare the performance of such meta-
mean circuit size, averaged within each family, while the error bar is its standard
agent against an ensemble of agents that each have been trained on a single
deviation. It is interesting to see that even with different connectivities, families occur
with similar likelihoods during training. We explicitly list the corresponding
value of the noise bias parameter. The settings of this comparison are
quantum weight enumerators computed with (13) in the Supplementary. explained in Methods. The results are shown in Fig. 7. The first stark result is
that the simple agents perform rather bad at the extreme values cZ = 1.9 and
cZ = 2. Outside of these two points, they perform comparably to the best
meta-agent, even though the meta-agent strategy yields better performance
overall. This advantage is enabled by transfer learning, i.e., the idea that
specific point cZ = 2 these two codes perform equally well in terms of the patterns that work in one situation can be reused in other places effectively
failure probability (see Fig. 5b). (recall the common motifs from Fig. 6b). In our case, the meta-agent
Now we focus on the agent that performs best at minimizing the failure switched the code family as early as cZ = 0.9 (recall Fig. 6a), and all the
probability (orange) since it is the one of most interest in practical scenarios. experiences between cZ = 0.9 and cZ = 2 were useful in providing a superior
We begin by evaluating the performance of the same agent on different performance to that of the simple agents. Moreover, the noise-aware meta-
values of pI. This is shown in Fig. 5d. There where we see that the failure agent is able to provide predictions for all continuous values in the con-
probability asymptotically follows a power law with exponent ≳2 depending sidered range, while the simple agents cannot.

npj Quantum Information | (2024)10:126 5

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 5 | Performance of the noise-aware RL agent. The agent finds n = 9, k = 1 codes undetected effective weight (effective code distance is the integer part) as a function
and encoding circuits, simultaneously for different levels of noise bias cZ, with single- of the noise bias parameter cZ. While there is almost a perfect overlap between both
qubit fidelity pI = 0.9. In panels a,b,c, green represents the agent that was post- best agents until cZ = 1.1, the situation changes afterwards, leading at cZ = 2 to a de = 5
selected among all trained agents for performing best at minimizing the weighted code (green) or a de = 4 code (orange) that perform equally well in terms of the failure
Knill-Laflamme sum, averaged over all cZ values. Orange refers to the agent mini- probability, as seen in b. d Evaluation of the failure-probability of the best-
mizing the failure probability, averaged over cZ. a Weighted Knill-Laflamme sum as a performing agent (orange in the other panels) for larger values of pI (smaller errors)
function of the noise bias parameter cZ (best agent: green line). b Failure probability than the ones it was trained on.
as a function of the noise bias parameter cZ (best agent: orange line) (c) Smallest

Scaling automated QEC discovery (that we chose) and a remaining sequence of 46 CNOT gates discovered by
In this final section we explore to which extent our RL-based strategy can be the agent. The few CNOTs that connect seemingly distant qubits are due to
scaled up. We will see that by restricting to CSS10,12 codes (which are a allowing periodic boundary conditions. An interesting strategy that the
subclass of stabilizer codes) we are able to reduce the computational agent uses is first building Bell pairs between adjacent qubits (which are [[2,
demands of our algorithms, leading to an estimated better scaling with larger 0, 2]] codes) and then entangling these pairs with each other to gradually
code parameters. build up a d = 5 code. We remind the reader that the largest (non-CSS) code
In order to exclusively target CSS codes, it is sufficient to constrain the that we had shown in previous sections was [[15, 2, 5]] and it needs roughly
structure of the circuit to contain an initial layer of Hadamard gates applied 4 h of computing. The [[17, 1, 5]] code presented here only needs around
to a subset of the qubits followed by CNOT gates thereafter (see “Methods” 20 min of compute time.
for a proof). An interesting observation is that the strategy of initially creating Bell
There are several possible modifications that we could do to our RL pairs is persistent. We thus consider a final scenario where we initialize the
strategy in order to target CSS codes, which we discuss in Methods. In this circuit with neighboring Bell pairs and ask the agent to complete the
work, we choose a mixed human-AI strategy where we are the ones deciding encoding circuit.
the content of the Hadamard layer (i.e., how many and where they are Now we focus on [[25, 1, 5]] due to these parameters being
placed) and where the agent has to discover suitable CNOT blocks. In this compatible with the first d = 5 surface code. We present an example of
way, we simplify the task of the agent as much as possible. such a discovered code with its encoding circuit in a next-to-nearest
We have tested this approach by targeting weakly self-dual codes neighbor connectivity in Fig. 8. It uses a total of 83 gates, where the last
(meaning the Hadamard layer contains num(H) = (n − k)/2 gates) of dis- 59 CNOT gates were discovered by the agent and took around 2 h to
tance d = 5 using next-to-nearest neighbor CNOT connectivity and where train. If we instead ask the agent to start from a circuit where only the
we place the initial Hadamard gates in alternating qubit indices. Hadamard layer is provided, it still finds good encodings. The draw-
We have found that we can discover [[17, 1, 5]] codes (with back is that it takes longer to train, and the agent still prepares the Bell
num(H) = 8), from scratch and with their encoding circuit. An example of pairs (but has to learn it). We remark that these code parameters are by
such a discovered circuit is shown in Fig. 8. It consists of 8 Hadamard gates no means the upper limit of what is possible with our strategy.

npj Quantum Information | (2024)10:126 6

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 6 | Characteristics of the 9-qubit codes and encodings found by the noise- remark that the agent does not place gates in parallel, the circuits shown here show
aware meta-agent post-selected for minimizing the failure probability. gates in parallel for compactness. c Corresponding code generators. To aid visua-
a Associated code family according to their (symmetric) weight enumerators A, B. lization, we have chosen different colors for different Pauli matrices. However, since
The same code family is used from 0.5 ≤ cZ < 0.9, while a family switching occurs at our scenario is by construction symmetric in X/Y, we choose to represent X and Y by
cZ = 0.9, and it is kept until cZ = 2. b Encoding circuits: Here we see that many small the same color. Code generators gi corresponding to the encoding circuits, where we
gate sequences (highlighted with different colors) are reused across different values do not make a distinction between X or Y. Here we see that the code generators gi
of cZ. This is an indication of transfer learning, i.e., the power of the meta-agent. We vary across different values of cZ.

However, we defer the exploration of effective scaling strategies to Discussion

future work. We have presented an efficient RL framework that is able to simultaneously
Finally, we make some estimations on the practical limits of CSS code discover QEC codes and their encoding circuits from scratch, given a qubit
discovery using a Knill-Laflamme-based reward. As we have seen, a crucial connectivity, gate set, and error operators. It learns strategies simultaneously
ingredient of efficient QEC code discovery driven by RL is being able to both for a range of noise models, thus re-using and transferring discoveries
simulate the environment and train the RL agent with GPUs. With this in between different noise regimes. We have been able to discover codes and
mind, we estimate the amount of memory that would be needed to store all circuits up to 25 physical qubits and code distance 5, while presenting a
error operators for some code parameters n and d (this calculation is roadmap to scale this approach much further. This is thanks to our for-
independent of k, see Methods). We show the results of this estimation in mulation in terms of stabilizers, which serve both as compact input to the
Fig. 9 for code distances from 5 to 10 and physical qubit numbers of 20–100. agent as well as the basis for rapid Clifford simulations, which we imple-
In particular, we consider what fraction of memory they would occupy in an mented in a vectorized fashion using a modern machine-learning
NVIDIA A100 GPU, which is the modern GPU model standard. The results framework.
shown in Fig. 9 indicate that our approach can be extended to ~100 physical In the present work, we have focused on the quantum communication
qubits (d = 6) and to ~40 physical qubits and d = 10 in a single GPU. or quantum memory scenario, where the encoding circuit itself can be
Moreover, we identify a region of opportunity that could potentially lead to assumed error-free since we focus on errors happening during transmission.
new codes surpassing the performance of the smaller qLDPC codes found in As a result, our encoding circuits are not fault tolerant, i.e., single errors,
ref. 14 since we do not have an ansatz that limits the families of codes that we when introduced, might sometimes proliferate to become incorrigible. Flag-
could find. Exploring this region of opportunity is an exciting endeavor that based fault tolerance33 added on top of our encoding circuits would be an
we leave for future work. We emphasize that not only would we discover the effective strategy to make them fault tolerant.
code, but a hardware-efficient encoding circuit would also be simulta- We have also shown how to efficiently scale up this strategy by
neously discovered, which is something currently lacking. exclusively targeting CSS codes, potentially being able to outperform the

npj Quantum Information | (2024)10:126 7

https://doi.org/10.1038/s41534-024-00920-y Article

recent quasi-cyclic codes from ref. 14 in the near future. To achieve such a strings out of the 2n−k elements of the stabilizer group are the stabilizer
milestone, one should be able to target LDPC codes directly. As a starting generators, leading to different stabilizer weights. All in all, we believe that,
point, one could add an additional term in the reward that penalizes sta- while promising, substantial innovations are needed in order to discover
bilizers with large weights. This would not be guaranteed to work out of the LDPC codes with such an RL-based strategy. However, the payoff would be
box, as one would need to tune the importance between the original Knill- quite substantial: a strategy based on RL would not be restricted to the
Laﬂamme term and this new term through some new hyperparameter. In particular ansatz of quasi-cyclic codes. In addition, not only would the codes
addition, stabilizer generators of LDPC codes must also be local, meaning be discovered, but their encoding circuits would also be automatically known.
that their weight must be distributed along neighboring qubits for efﬁcient One of the limits of our approach is GPU memory. However, this could
measurement cycles. Finally, there is a large degeneracy in how the code be circumvented through different means. While it is always possible to
generators are chosen: there are many possible choices of which n − k Pauli trade performance by memory load, the tendency to train very large AI
models is thrusting both the development of novel hardware with increased
memory capabilities and the integration of distributed computing options in
modern machine learning libraries. These developments makes us envision
scenarios where the framework presented in this work could be scaled up
straightforwardly to multiple GPU machines. This makes us optimistic
about the future of AI-discovered QEC in the very near future.

Methods
Stabilizer codes
The stabilizer formalism. Some of the most promising QEC codes are
based on the stabilizer formalism15, which leverages the properties of the
Pauli group Gn on n qubits. The basic idea of the stabilizer formalism is
that many quantum states of interest for QEC can be more compactly
described by listing the set of n operators
that stabilize them, where an
operator O stabilizes ∣ψ if ∣ψ is an eigenvector of O with
a state
eigenvalue + 1: O∣ψ ¼ ∣ψ . The Pauli group on a single qubit G1 is
defined as the group that is generated by the Pauli matrices X, Y, Z under
matrix multiplication. Explicitly, G1 = { ±I, ±iI, ±X, ±iX, ±Y, ±iY, ±Z,
±iZ}. The generalization to n qubits consists of all n-fold tensor products
Fig. 7 | Noise-aware meta-agent vs ensemble of agents trained on fixed single of Pauli matrices (called Pauli strings).
values of noise. They have comparable performance at minimizing the failure A code that encodes k logical qubits into n physical qubits is a 2k-
probability (smaller is better), but the simple agents perform badly at larger values of dimensional subspace (the code spaceC) of the full 2n-dimensional Hilbert
cZ. The noise-aware meta-agent reaches a superior performance by reusing useful space. It is completely specified
by the
set of Pauli strings SC that stabilize it,
sub-circuits across different values of cZ and can provide encoding circuits for all i.e., SC ¼ fsi 2 Gn jsi ∣ψ ¼ ∣ψ ; 8∣ψ 2 Cg. SC is called the stabilizer group
continuous values of cZ. of C and is usually written in terms of its group generators gi as
SC ¼ g 1 ; g 2 ; . . . ; g nk , where each gi is a Pauli string.

Fig. 8 | CSS code and encoding discovery in a next-

to-nearest neighbor connectivity with periodic
boundary conditions. The initial layer of Hada-
mard gates was chosen by us and ﬁxed. We con-
sidered two scenarios: Starting from just that initial
Hadamard layer (as in [[17, 1, 5]]) or also providing
the ﬁrst layer of CNOTs to start from neighboring
Bell pairs (as in [[25, 1, 5]]). The rest of the circuit is
successfully discovered by the RL agent. We remark
that the agent does not place gates in parallel, the
circuits shown here show gates in parallel for
compactness.

npj Quantum Information | (2024)10:126 8

https://doi.org/10.1038/s41534-024-00920-y Article

either

fEμ ; g i g ¼ 0; ð9Þ

for at least one gi, or the error itself is harmless, i.e.,

E μ 2 SC : ð10Þ

The smallest weight in Gn for which none of the above two conditions hold is
called the distance of the code. For instance, a distance − 3 code is capable of
detecting all Pauli strings of up to weight 2, meaning that Knill-Laflamme
conditions (9), (10) are satisfied for all Pauli strings of weights 0, 1 and 2.
Moreover, the smallest weight for which these are not satisfied is 3, meaning
Fig. 9 | Scaling CSS code and encoding discovery to larger code parameters. We that there is at least one weight − 3 Pauli string violating both (9) and (10).
show the fraction of the 80 GB of GPU memory needed (NVIDIA A100 GPU) to However, some weight − 3 Pauli strings (and higher weights) will satisfy the
store all the error operators that are required to reward the agent. We also show for Knill-Laflamme conditions, in general.
comparison the memory load of stabilizer (non-CSS) code discovery for code dis- While these conditions are framed in the context of quantum error
tance d = 10. We identify a region of opportunity where our RL strategy could out-
detection, there is a direct correspondence with quantum error correction.
perform some of the qLDPC codes found in ref. 14 in the near future.
Indeed, a quantum code of distance d can correct all errors of up to weight
t = ⌊(d − 1)/2⌋15. If all the errors that are detected with a weight smaller than
d obey (9), the code is called non-degenerate. On the other hand, if some of
Quantum noise. Noise affecting quantum processes can be represented the errors satisfy (10), the code is called degenerate.
using the so-called operator-sum representation41, where a quantum
noise channel N induces dynamics on the state ρ according to Asymmetric codes. The default weight-based [[n, k, d]] classification of
QEC codes implicitly assumes that the error channel is symmetric,
X meaning that the probabilities of Pauli X, Y, and Z errors are equal.
N ðρÞ ¼ Eα ρEyα ; ð6Þ However, this is usually not the case in experimental setups: for example,
α
dephasing (Z errors) may dominate bit-flip (X) errors. In our work, we
P consider an asymmetric noise channel where pX = pY but pX ≠ pZ. To
where Eα are Kraus operators, satisfying α Eyα Eα ¼ I. The most elementary quantify the asymmetry, we use the bias parameter cZ35, defined as
example is the so-called depolarizing noise channel,
log pZ
cZ ¼ : ð11Þ
N DP ðρÞ ¼ pI ρ þ pX XρX þ pY YρY þ pZ ZρZ; ð7Þ log pX

where pI + pX + pY + pZ = 1 and the set of Kraus operators are For symmetric error channels, cZ = 1. If Z-errors dominate, then 0 < cZ < 1,
pffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffi c
Eα ¼ f pI I; pX X; pY Y; pZ Zg. When considering n qubits, one can since pZ ¼ pXZ and pX, pZ ≪ 1; conversely cZ > 1 when X/Y errors are more
generalize the depolarizing noise channel by introducing the global depo- likely than Z errors.
larizing channel, The weight of operators and the code distance can both be generalized
to asymmetric noise channels44–47. Consider a Pauli string operator Eμ and
O
n denote as wX the number of Pauli X inside Eμ (likewise for Y, Z). Then one
ðjÞ can introduce the cZ − effective weight35 of Eμ as
N GDP ðρÞ ¼ N DP ðρj Þ; ð8Þ
j¼1

we ðEμ ; cZ Þ ¼ wX ðEμ Þ þ wY ðEμ Þ þ cZ wZ ðEμ Þ; ð12Þ

consisting of local depolarizing channels acting on each qubit j indepen-
dently. Taken as is, this error model generates all 4n Pauli strings by
expanding (8). A commonly used simplification is the following. Assume which reduces to the symmetric weight for cZ = 1, as expected. The
that all error probabilities are identical, i.e., pX = pY = pZ ≡ p (and cZ − effective distance of a code de(cZ) is then defined35 as the largest possible
pI = 1 − 3p). Then, the probability that a given error occurs decreases with integer such that the Knill-Laflamme conditions (9), (10) hold for all Pauli
the number of qubits it affects. For instance, if we consider 3 qubits, the strings Eμ with we(Eμ, cZ) < de(cZ). Like in the symmetric noise case, the
probability associated with XII is p(XII) = p(1−3p)2, and in general, the meaning of this effective distance is that all error operators with an effective
leading order contribution to the probability of an error affecting m qubits is weight smaller than de can be detected.
pm. This leads to the concept of the weight of an operator as the number of
qubits on which it differs from the identity and to a hierarchical approach to Code classification. It is well known that there is no unique way to
building QEC codes. In particular, stabilizer codes are described by speci- describe quantum codes. For instance, there are multiple sets of code
fying what is the minimal weight in the Pauli group that they cannot detect. generators that generate the same stabilizer group, hence describing the
same code. Moreover, the choice of logical basis is not unique, and qubit
The Knill-Laflamme conditions. The fundamental theorem in QEC is a labeling is arbitrary. While such redundancies are convenient for
set of necessary and sufficient conditions for quantum error detection describing quantum codes in a compact way, comparing and classifying
discovered independently by Bennett, DiVincenzo, Smolin and different codes can be rather subtle. Fortunately, precise notions of code
Wootters42, and by Knill and Laflamme in ref. 43 (Knill-Laflamme con- equivalence have been available in the literature since the early days of
ditions from now on). These state that a code C with associated stabilizer this field. In this work, we refer to families of codes based on their
group SC can detect a set of errors {Eμ} ⊆ Gn, if and only if for all Eμ we have quantum weight enumerators (QWE)48, A(z), and B(z), which are

npj Quantum Information | (2024)10:126 9

https://doi.org/10.1038/s41534-024-00920-y Article

Table 1 | Hyperparameters that were used during training with

some typical range of values that we have seen to lead to good
performance (see Methods for a description of each
hyperparameter)

Hyperparameter Value
LR (1–5) × 10−4
NUM_ENVS 8–1024
NUM_STEPS 8–32
NUM_EPOCHS 1000–12000
UPDATE_EPOCHS 2–4
NUM_MINIBATCHES 8–128
GAMMA 0.99
GAE_LAMBDA 0.95
CLIP_EPS 0.1–0.2
ENT_COEF 0.01–0.05
VF_COEF 0.5
MAX_GRAD_NORM 0.05–0.25
ANNEAL_LR True

simulation of it). In each time step t, the environment’s state st is observed.

Based on this observation, the agent takes an action at which then affects the
current state of the environment. A trajectory is a sequence of state and
action pairs that the agent takes. An episode is a trajectory from an initial
state to a terminal state. For each action, the agent receives a reward rt, and
Fig. 10 | Structure of the PPO algorithm used in this work, focusing on its the goal of RLPalgorithms is to maximize the expected cumulative reward
structural and operational aspects. In the Collect phase, the agent interacts with the (return), E t r t . The agent’s behavior is defined by the policy πθ(at∣st),
environments to extract triples (observation, action, reward) that are then used in the which denotes the probability of choosing action at given observation st, and
Update phase to update the parameters of the neural networks via gradient ascent. that we parameterize by a neural network with parameters θ. Within RL,
policy gradient methods22 optimize the policy by maximizing the expected
return with respect to the parameters θ with gradient ascent. One of the most
polynomials with coefficients successful algorithms within policy gradient methods is the actor-critic
algorithm50. The idea is to have two neural networks: an actor network that
P acts as the agent and that defines the policy, and a critic network, which
Aj ¼ 1
2 Tr Eμ PC Tr Eyμ PC ;
ð2k Þ measures how good was the action taken by the agent. In this paper, we use a
wðEμ Þ¼j
P ð13Þ state-of-the-art policy-gradient actor-critic method called Proximal Policy
Bj ¼ 21k Tr Eμ PC Eyμ PC ; Optimization (PPO)51, which improves the efficiency and stability of policy
wðEμ Þ¼j
gradient methods.

where w is the operator (cZ = 1) weight, j runs from 0 to n and PC is the Implementation and hyperparameters. We use the PPO imple-
orthogonal projector onto the code space. Intuitively, Aj counts the number mentation of 52, which we break down in more detail here (see also Fig. 10
of error operators of weight j in SC while Bj counts the number of error and Table 1 for a list of hyperparameters). In our implementation, the RL
operators of weight j that commute with all elements of SC . Logical errors are environment is vectorized, meaning that the agent interacts with multiple
thus the ones that commute with SC but are not in SC , and these are counted different quantum circuits at the same time. The hyperparameter that
with Bj − Aj. determines this number of RL environments is called NUM_ENVS. The
Such a classification is especially useful in scenarios with symmetric learning algorithm consists of two processes: collect and update. During
noise channels, where it is irrelevant whether the undetected errors contain a collection, the agent interacts with the environments and a total of
specific Pauli operator at a specific position. However, such a distinction can NUM_STEPS sequences of (observation, action, reward) are collected per
in principle be important in asymmetric noise channels. One could in environment. Following the collection, the update process begins. Here,
principle generalize (13) to asymmetric noise channels substituting the we have a total of NUM_ENVS * NUM_STEPS individual steps that are
weight w by the effective weight we of operators, but then comparing codes shuffled and reshaped into NUM_MINIBATCHES minibatches (each of
across different values of noise bias becomes cumbersome. Hence, in the size NUM_ENVS * NUM_STEPS // NUM_MINIBATCHES). These are
present work we always refer to (symmetric) code families according to (13) used for updating the weights of the neural networks through gradient
for all values of cZ, i.e., we will effectively pretend that cZ = 1 when computing ascent, which happens a number UPDATE_EPOCHS times during every
the weight enumerators of asymmetric codes. update process. The whole collection-update cycle gets repeated
NUM_EPOCHS times.
Reinforcement learning The neural networks that we have chosen are standard feedforward
Reinforcement Learning (RL)49 is designed to discover optimal action fully-connected neural networks with ReLU activation functions and with
sequences in decision-making problems. The goal in any RL task is encoded identical architectures for both the actor and value networks, except for the
by choosing a suitable rewardr, a quantity that measures how well the task output layer. In particular, they both consist of an input layer of size
has been solved, and consists of an agent (the entity making the decisions) 2n(n − k) given by the observation from the environment, followed by two
interacting with an environment (the physical system of interest or a hidden layers of size h (we have experimented with sizes 16 to 400) and an

npj Quantum Information | (2024)10:126 10

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 11 | Example of a training run for [[7, 1, 3]] code discovery. a Return and batches of 64 circuits processed in parallel. Each agent ﬁnds a different encoding
circuit size during training, b Details of the data calculation pipeline and complete set circuit, and the training ﬁnishes in 20 s on a single GPU. The meaning of every
of hyperparameters used for this run. Here, 4 parallel agents each interact with hyperparameter is explained in Methods.

output layer of size nA (number of actions) in the case of the actor network
and of size 1 for the value network (see Fig. 10). The number of actions nA is
determined by the number of physical qubits, available gate set and qubit
connectivity.
Other hyperparameters that participate in the PPO implementation
which we include for completeness (but that we refer to ref. 51 for further
explanations) are the discount factor γ, the generalized advantage estimator
(GAE) parameter λ, the actor loss clipping parameter ε, the entropy coef-
ficient and the value function (VF) coefficient (see Table 1 for typical values
that we have found to work well).
Regarding the optimizer itself, we use ADAM with a clipping in the
norm of the gradient (MAX_GRAD_NORM) and some initial learning rate
(LR) that gets annealed (ANNEAL_LR) using a linear schedule as the
training evolves, see Table 1 for specific numerical values of these
hyperparameters.
Next, we show an example of a typical training trajectory in Fig. 11
together with all the hyperparameter numerical values that were used and
the execution time on a single NVIDIA Quadro RTX 6000 GPU. There, 4
agents are tasked to find [[7, 1, 3]] codes, which each of them completes
successfully running in parallel in 20 s. The error channel is chosen to be Fig. 12 | Time to reach 1 million training steps. Execution time of training tra-
global symmetric depolarizing with pI = 0.9 (i.e., pX = pY = pZ = 1 − pI/3). jectories of 4 parallel agents (in a single GPU) with identical hyperparameters as
The average circuit size starts being 20 by design, i.e., if no code has been those shown in Fig. 11 with different number of physical qubits n and code distance d
found after 20 gates, the circuit gets reinitialized. This number starts (but keeping the number of logical qubits k = 1).
decreasing when codes start being found and it saturates to a final value,
which is in general different for each agent. As a final remark, running the
time compilation capabilities. On top of that, we also train multiple RL
same script on a CPU node with two Xeon Gold 6130 processors takes
agents in parallel on a single GPU. This is achieved by interfacing with
7 min 40 s.
PUREJAXRL52, a library that offers a high-performance end-to-end JAX RL
Finally, we show how the runtime scales when increasing the number
implementation. The source code for our project is available on GITHUB
of physical qubits n and the code distance d in Fig. 12. In order to get a
under the name QDX37, which is an acronym for Quantum Discovery with
meaningful comparison, we fix all other hyperparameters to be identical to
JAX. It includes both the Clifford simulator, the PPO algorithm and demo
those shown in Fig. 11. We remark that in general the agents will not have
Jupyter notebooks to reproduce some of our main results.
converged to a successful encoding sequence given the allotted resources.
A stabilizer generator gi is formally represented as a Pauli string P1 ⊗ P2
⊗ ⋯ ⊗ Pn, where Pi ∈ {I, X, Y, Z} is any Pauli operator, and numerically as a
Clifford simulator binary vector of size 2n. For example, the Pauli matrices are represented as
Here we give more details on the implementation of our simulations, which I = (0, 0), X = (1, 0), Y = (1, 1), Z = (0, 1), and a general Pauli string is
are based on the binary symplectic formalism16 of the Pauli group and that represented as (x1, …, xn, z1, …, zn), where all xi and zi are either 0 or 1. For
have been optimized to be compatible with modern vectorized machine instance, the binary vector (1, 1, 0, 0, 0, 1, 1, 0) represents the Pauli string
learning frameworks running on Graphical Processing Units (GPU). All the XYZI. Matrix multiplication gets mapped to binary sum (ignoring global
operations that are required for both simulating the quantum circuits and to phases), e.g.,
compute the reward have been implemented using binary linear algebra.
Our Clifford simulator is implemented using JAX53, a state-of-the-art
modern machine learning framework with good vectorization and just-in- X Y ¼ Z $ ð1; 0Þ þ ð1; 1Þ ¼ ð0; 1Þð mod 2Þ: ð14Þ

npj Quantum Information | (2024)10:126 11

https://doi.org/10.1038/s41534-024-00920-y Article

A stabilizer code is speciﬁed by n − k stabilizer group generators SC ¼

hg 1 ; g 2 ; . . . ; g nk i and is therefore represented by a check matrixG16, which
is a (n − k) × 2n binary matrix where each row i represents the Pauli string gi
from SC . Clifford gates map Pauli strings to Pauli strings, meaning that a
check matrix G gets mapped to a different check matrix G0 under the action
of any Clifford gate. It is sufﬁcient to consider the action of the Clifford gates
H,S,CNOT on X/Z stabilizers. For instance, the action of H is the well-
known

HXH ¼ Z; HZH ¼ X; ð15Þ

meaning that it exchanges X by Z. More generally, Hi exchanges columns i

and i + n of a check matrix G. We implement this transformation by
representing Hi with a binary matrix H(i)b and by performing binary matrix
multiplication between G and H(i)b. Explicitly, H(i)b is the 2n × 2n identity
matrix with columns i and i + n exchanged,
0 1
1 0 0 0 0
B0 1 0 0 0C
B C
B C
B0 0 0 0 0C
B C
B .. .. .. .. .. C
B. . . . .C
B C
B 1
|{z} C
B0 0 0 0C
HðiÞb ¼ B
B
i C;
C ð16Þ
B .. .. .. .. .. C
B. . . . .C
B C
B 1
|{z} C
B0 0 0 0C
B iþn C
B C
B .. .. .. .. .. C
@. . . . .A
0 0 0 0 1
and matrix multiplication must be done from the right, i.e.,
G0 ¼ G HðiÞb ð mod 2Þ. Binary matrix representations can be built for all Si
and CNOT(i, j) gates in a similar manner and can be explicitly found in our
Fig. 13 | Comparison between our stabilizer simulator and Stim. Compute time
repository37.
required to generate random Clifford circuits of 1000 gates for 40 qubits (upper,
When simulating CSS circuits, the check matrix G splits into two non- generic stabilizer) and for 49 qubits (lower, CSS restriction) with Stim (green) and
overlapping block submatrices: GX and GZ. An advantage of working with with our simulator (orange).
CSS circuits is that we can make the binary representation of Pauli strings
even more compact. Speciﬁcally, we will never encounter a Pauli string with
a Y in it, and all Pauli strings will contain either only X’s or only Z’s. Thus, it generate random Clifford circuits of 1000 gates on 40 qubits (generic sta-
sufﬁces to represent Pauli strings with arrays of n bits. Possible ambiguities bilizer) and on 49 qubits (CSS), which is shown in Fig. 13. The gap in
(e.g., both XX and ZZ would be represented by (1, 1)) are avoided by labeling simulation time decreases as the number of qubits scales up, yet we retain a
which code generators are in GX and which ones are in GZ. We can thus competitive advantage for all qubit numbers considered in this work and
represent an [[n, k]] code with n(n − k) bits, getting an improvement of a that will likely be considered in follow-up works.
factor of 2 with respect to generic stabilizer codes. Two Pauli strings P1 and P2 either commute or anticommute. We
In practice, we only need to implement the CNOT gate (H only decides compute this by evaluating the binary symplectic bilinear
the splitting between GX and GZ). Here we show how to implement a simple
CNOT gate on a system of two qubits for illustrative purposes. The CNOT 0 if P1 and P2 commute;
transformation rules are the following: P1 Ω PT2 ¼
1 if P1 and P2 anticommute

where P1 and P2 are the corresponding binary representations and Ω is the

ð17Þ 2n × 2n symplectic metric

0n 1n
Ω¼ : ð19Þ
1n 0n
ð18Þ
In our problem, we want to determine whether a list of operators {Eμ}
anticommute with any of the code generators gi. We group the error
Crucially, exchange of control and target labels turns an X transformation operators inside a binary matrix EM, where each row corresponds to the
rule into a Z transformation rule. We can thus use a single binary matrix per binary representation of a different operator, and we compute
CNOT (we choose the one that implements the X transformation rule) and
use the binary matrix representation of the CNOT with exchanged control EM Ω GT : ð20Þ
and target to transform Z-type stabilizers.
We benchmark the performance of our simulator against Stim36, a fast The result is a binary matrix with dimensions (num(Eμ), n − k). The ﬁrst
simulator for Clifford circuits. In particular, we compare the time needed to Knill-Laﬂamme condition Eq. (9) requires checking whether at least one

npj Quantum Information | (2024)10:126 12

https://doi.org/10.1038/s41534-024-00920-y Article

Fig. 14 | Different connectivities and gatesets.

Description of the three different (a) connectivities
and (b) gatesets that we consider.

code generator gi anticommutes with any given error operator. This means G3:
that the result has to be transformed into a binary vector of size num(Eμ),
where a 1 means that the first Knill-Laflamme condition Eq. (9) is satisfied
for the corresponding operator Eμ and that is zero otherwise.
The second Knill-Laflamme condition Eq. (10) requires checking
whether any error operator Eμ 2 SC . In principle, the full stabilizer group of
2n−k elements must be built at every time step of our simulations. For the
physical qubit numbers that we have considered in our work, this compu-
tation is still fast enough, becoming more challenging as n − k ≥ 13. In
practice, not many error operators end up being in SC , which we leverage by
introducing a softness parameter s such that only a subgroup of SC is built.
More precisely, s = 0 means that this subgroup is empty, s = 1 means taking
only the generators gi as the subgroup, s = 2 means taking the generators gi
and all pairwise combinations of generators gigj, and so on for larger s.

Different connectivities and gatesets

Here we present results for some other selected gatesets and connectivities to
show the ﬂexibility of our approach. We choose to target stabilizer codes Brick connectivity:
with parameters [[7, 1, 3]] and show the shortest encoding circuit for each G1:
case. More concretely, we pick three different gatesets and three different
connectivities according to Fig. 14. We have trained 640 agents in every case.
Line connectivity:
G1:

G2:

npj Quantum Information | (2024)10:126 13

https://doi.org/10.1038/s41534-024-00920-y Article

G3: Noise-aware meta-agent

Here we provide further details on the more general meta-agent that
switches its encoding strategy depending on the kind of noise present in the
system, characterized by the bias parameter cZ, according to (11).

Training setup and hyperparameters. During training, the meta-agent

collects experiences with different values of cZ, which we sample from the
set cZ ∈ {0.5, 0.6, 0.7, …, 1.9, 2} with a uniform probability distribution.
Once a particular value of cZ is picked, the error probabilities char-
c
acterizing the noise channel are ðpI ; pX ; pX ; pXZ Þ. Normalization of the
error probabilities imposes a relationship between pI and pX, which
means that there is only one other free parameter besides cZ, either pI or
pX. It is more beneficial for training and generalization to keep pI fixed
and solve for pX; otherwise the magnitude of the probabilities {pμ}
changes a lot when varying cZ, leading to poorer performance.
Square connectivity: The hyperparameters λμ of the reward (2) are defined as
G1:
pμ
λμ ¼ ; ð21Þ
maxðpμ Þ
cZ

by which we mean that for every cZ, the corresponding set of pμ’s gets
normalized by the maximal value of pμ in that set. We choose pI = 0.9, even
though both slightly smaller and larger values around pI ≈ 0.9 perform
equally well. However, going below pI ≲ 0.8 or above pI ≳ 0.95 comes with
different challenges. In the former (for large errors), we lose the important
property
P thatPthe sum of pμ’s decreases as a function of weight,
ð μ pμ Þw¼1 >ð μ pμ Þw¼2 > . . . . In the latter (small errors), the range of
values of pμ is so large that one would need to use a 64-bit ﬂoating-point
G2: representation to compute the reward with sufﬁcient precision. Since both
RL algorithms and GPUs are currently designed to work best with 32-bit
precision, we decide to avoid this range of values for pI during training, but
we will still evaluate the strategies found by the RL agent on different values
of pI.
We allow a maximum of 35 gates before the trajectory gets reinitialized.
Even though all encodings that the meta-agent outputs have circuit size 35,
we notice that trivial gate sequences are applied at the last few steps, effec-
tively reducing the overall gate count. We remark that this feature is not
problematic: it means that the agent is done well before a new training run is
G1: launched, and the best thing it can do is collecting small negative rewards
until the end. We manually prune the encodings to get rid of such trivial
operations, and the resulting circuit sizes vary from 22 to 35, depending on
the value of cZ.

Failure probability. As is the case for most RL learning procedures, every

independent learning run will typically result in a different learned
strategy by the agent. We thus train many agents and post-select the few
best-performing ones. Now, there are in principle two different ways to
make this selection: The first one is based on how well they minimize the
weighted Knill-Laflamme sum (which is what they were trained for). The
second one is by evaluating the probability that a single error correction
cycle will end in failure, i.e., the probability that the wrong correction
would be applied based on the detected syndrome. Typically, this metric
would require a decoder. In practice, we implement a simple maximum
likelihood decoder as follows. First, since we work with a probabilistic
Distance 5 stabilizer codes model of errors, we have a representation of the probability that each type
Here we show the code families that were found for d = 5, with number of of error occurs. Then, we iterate through all possible non-zero syndromes
physical qubits varying between 11 and 15. In order to reduce computa- (undetectable errors in degenerate codes belong to the zero syndrome
tional effort, for n ≥ 14 we ignored the second Knill-Laflamme condition class and don’t lead to an error), so that for each non-zero syndrome:
(10), and as a result the codes found in Fig. 2 n ≥ 14 are only non-degenerate. • We identify all errors that could have caused this syndrome.
Moreover, the increased memory requirements from keeping track of more • We extract the probabilities of these errors based on our probabilistic
error operators (3) means that the number of agents that can be trained in model of errors.
parallel on a single GPU decreases. Each of these training runs needs 1–4 h, • We find the maximum probability among these errors, which corre-
depending on the code parameters and whether degenerate codes are also sponds to the most likely error for this syndrome.
targeted.

npj Quantum Information | (2024)10:126 14

https://doi.org/10.1038/s41534-024-00920-y Article

layer and where the RL agent decides the content of the CNOT block. Here
we comment on other possibilities.
The ﬁrst one would be to keep as actions both H and CNOT gates for
the agent to use, but penalize the agent every time that a Hadamard gate is
used after a CNOT gate. This would in principle lead to an agent that would
know what is the correct architecture to be used for CSS codes at expenses of
having to ﬁne-tune this new penalty term in the reward. We avoided this
strategy because we did not want to introduce further hyperparameters. The
second option would be to have a multi-agent scenario with two agents: one
that only places Hadamards and another one that only places CNOTs.
While interesting, multi-agent tasks are typically harder to train and would
involve redesigning our entire framework.

Circuit structure of CSS codes. Here we give a proof of the claim that
codes resulting from circuits with an initial block of Hadamard gates on a
subset of the qubits and followed by CNOT gates thereafter can
only be CSS.
Let us label physical qubits with index 1 ≤ q ≤ n and target a CSS code
Fig. 15 | Comparison between the performance of the noise-aware meta agent vs
simple agents. The results shown are averaged over their respective ensembles. Error
with parameters [[n, k, d]]. Let’s assume for simplicity that the initial block of
bars are one standard deviation. Hadamard gates is applied to qubits k + 1, …, k + nH, with nH < n − k. The
initial tableau of the would-be code reads

• Finally, we calculate the failure probability as the sum of all error g1 ¼ X kþ1 ;
probabilities except the most likely one in that given syndrome. g2 ¼ X kþ2 ;

If the code is degenerate there could still be the possibility that the
actual error was misidentiﬁed and after correction one could still have ended g nH ¼ X kþnH ; ð22Þ
up with an “error” that is inside the stabilizer group. The contribution from g nH þ1 ¼ Z kþnH þ1 ;
these cases are negligible in our case and are thus ignored. However, one
would in principle still have to consider them in a general scenario. In
g nk ¼ Zn:
practice one could still evaluate the codes discovered with our RL approach
by substituting the decoder accordingly. From this moment forward, only CNOT gates are allowed. Let’s start by
considering what is the effect of a CNOT gate with control qubit inside the
Noise-aware meta-agent vs an ensemble of simple agents. Here we H-block, i.e., control ∈ {k + 1, …, k + nH}. For whatever target qubit, what
explain the settings of this experiment (shown in Fig. 7) in order to make a such a CNOT does is populate the target position of the corresponding
fair comparison. There are 16 possible values of the bias parameter, stabilizer gcontrol with an X. Subsequent CNOT gates affecting those posi-
cZ = {0.5, 0.6, …, 1.9, 2}. Since each meta-agent has seen instances of all 16 tions, either as control or target qubits, will either introduce additional X’s or
values, we will only allow the single-cZ agents to be trained on one six- simply do nothing. Since X2 = 1, the stabilizers g 1 ; g 2 ; . . . g nH will only ever
teenth of the total timesteps than the ones used for each meta-agent. In contain either X’s or 1’s. Similarly, the effect of CNOTs on stabilizers
addition, the best post-selected meta-agent was selected out of 714 g nH þ1 ; . . . ; g nk is simply populating them with Z’s or 1’s. Since the set of
training runs. Therefore, we train 714 × 16 = 11424 single − cZ agents to stabilizer generators can be clearly separated into a subset built with only X’s
make the comparison. All other hyperparameters are kept ﬁxed. and 1’s and another one with only Z’s and 1’s, such a tableau describes a
We also include an extended statistical analysis over the entire CSS code.
ensemble of both meta-agents and simple agents in Fig. 15. There, we
average over their respective ensembles and show the average performance GPU memory estimation. The independence of X and Z-type error
of agents of each class, together with their standard deviations. There we see detection in CSS codes means that the number of error operators that we
that all simple agents consistently fail at minimizing the failure probability at have to keep track of drastically reduces from (3) to
large values of cZ. The larger error bars at smaller values of cZ for the meta-
d1
X
agents can also be interpreted as these class of more general agents allocating n
jfEμ gCSS jw ≤ d1 ¼ 2 ; ð23Þ
a larger effort in both exploration and generalization to other values of cZ. w¼0 w

CSS codes where the overall factor of 2 counts both X and Z-type errors.
A particularly useful subclass of stabilizer codes are CSS codes10,12. They are Thanks to the separability of X and Z in the stabilizer generators, the
defined by their stabilizer generators containing either only X or only Z Pauli tableaus that we have to simulate are block-diagonal,
operators. This restriction is useful because X-type and Z-type errors are

detected independently, thereby implying the detection of Y-type errors gX 0
when the corresponding X and Z-type stabilizers fire simultaneously. ; ð24Þ
0 gZ
Moreover, strong contenders for implementation in large-scale quantum
computations such as surface codes or color codes are of the CSS type. where gX is a binary matrix of size num(H) × n containing the X-type
stabilizer generators, and gZ is of size (n − k − num(H)) × n and it contains
Alternative strategies using RL. In the main text we have argued that the representation of the Z-type generators. Here, num(H) is the number of
CSS codes can be constructed by constraining the encoding circuit to be Hadamard gates that are applied at the very beginning.
built from an initial layer of Hadamard gates and CNOTs thereafter. Separability of X- and Z-type error detection implies that gX must
In order to adapt our RL strategy to CSS code discovery we have detect all Z-type errors (by the first Knill-Laflamme condition (9)), and
considered a mixed human-AI strategy where we decide the Hadamard correspondingly for gZ with X-type errors. If the code is degenerate, it must

npj Quantum Information | (2024)10:126 15

https://doi.org/10.1038/s41534-024-00920-y Article

happen that some X-type errors are elements of the stabilizer subgroup 20. Chuang, I., Cross, A., Smith, G., Smolin, J. & Zeng, B. Codeword
generated by gX and likewise for Z. stabilized quantum codes: Algorithm and structure. J. Math. Phys.
All in all, this means that we can reduce the number of error operators https://doi.org/10.1063/1.3086833 (2009).
(23) by a factor of 2 (since we use the same representation for both X and Z- 21. Wang, H. et al. Scientific discovery in the age of artificial intelligence.
type errors). Each of such error operator is a binary array of size n, which Nature 620, 47–60 (2023).
amounts to 8n bits of memory. 22. Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. Policy gradient
We therefore estimate the memory usage by counting the number of methods for reinforcement learning with function approximation. Adv.
error operators (23) (divided by 2, as argued above), times the amount of Neural Inf. Process. Syst. 12 (1999).
binary digits that have to be specified for each of them, i.e., 8n. 23. Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement
learning with neural networks for quantum feedback. Phys. Rev. X 8,
Data availability 031084 (2018).
The data that supports the findings of this study are openly available in the 24. Nautrup, H. P., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N.
GitHub repository https://github.com/jolle-ag/qdx37. Optimizing quantum error correction codes with reinforcement
learning. Quantum 3, 215 (2019).
Code availability 25. Mauron, C., Farrelly, T. & Stace, T. M. Optimization of tensor network
The codes that supports the findings of this study are openly available in the codes with reinforcement learning. New J. Phys. 26 023024.
GitHub repository https://github.com/jolle-ag/qdx37. 26. Su, V. P. et al. Discovery of optimal quantum error correcting codes via
reinforcement learning 2305.06378 (2023).
Received: 21 May 2024; Accepted: 15 November 2024; 27. Cao, C. & Lackey, B. Quantum lego: Building quantum error correction
codes from tensor networks. PRX Quantum 3, 020332 (2022).
28. Andreasson, P., Johansson, J., Liljestrand, S. & Granath, M. Quantum
References error correction for the toric code using deep reinforcement learning.
1. Inguscio, M., Ketterle, W. & Salomon, C. Proceedings of the Quantum 3, 183 (2019).
International School of Physics “Enrico Fermi.” Vol. 164 (IOS Press, 29. Sweke, R., Kesselring, M. S., van Nieuwenburg, E. P. & Eisert, J.
2007). Reinforcement learning decoders for fault-tolerant quantum
2. Girvin, S. M. Introduction to quantum error correction and fault computation. Mach. Learn. Sci. Technol. 2, 025005 (2020).
tolerance. SciPost Phys. Lect. Notes (2023). 30. Colomer, L. D., Skotiniotis, M. & Mu noz-Tapia, R. Reinforcement
3. Krinner, S. et al. Realizing repeated quantum error correction in a learning for optimal error correction of toric codes. Phys. Lett. A 384,
distance-three surface code. Nature 605, 669–674 (2022). 126353 (2020).
4. Ryan-Anderson, C. et al. Realization of real-time fault-tolerant 31. Fitzek, D., Eliasson, M., Kockum, A. F. & Granath, M. Deep q-learning
quantum error correction. Phys. Rev. X 11, 041058 (2021). decoder for depolarizing noise on the toric code. Phys. Rev. Res. 2,
5. Postler, L. et al. Demonstration of fault-tolerant universal quantum 023230 (2020).
gate operations. Nature 605, 675–680 (2022). 32. Metz, F. & Bukov, M. Self-correcting quantum many-body control
6. Cong, I. et al. Hardware-efficient, fault-tolerant quantum computation using reinforcement learning with tensor networks. Nat. Mach. Intell.
with Rydberg atoms. Phys. Rev. X 12, 021049 (2022). 5, 780–791 (2023).
7. Acharya, R. et al. Suppressing quantum errors by scaling a surface 33. Chao, R. & Reichardt, B. W. Quantum error correction with only two
code logical qubit. Nature 614, 676–681 (2023). extra qubits. Phys. Rev. Lett. 121, 050502 (2018).
8. Sivak, V. et al. Real-time quantum error correction beyond break- 34. Zen, R. et al. Quantum circuit discovery for fault-tolerant logical state
even. Nature 616, 50–55 (2023). preparation with reinforcement learning. arXiv preprint
9. Azuma, K. et al. Quantum repeaters: From quantum networks to the arXiv:2402.17761 (2024).
quantum internet. Rev. Mod. Phys. 95, 045006 (2023). 35. Cao, C., Zhang, C., Wu, Z., Grassl, M. & Zeng, B. Quantum variational
10. Calderbank, A. R. & Shor, P. W. Good quantum error-correcting codes learning for quantum error-correcting codes. Quantum 6, 828 (2022).
exist. Phys. Rev. A 54, 1098–1105 (1996). 36. Gidney, C. Stim: a fast stabilizer circuit simulator. Quantum 5, 497
11. Laflamme, R., Miquel, C., Paz, J. P. & Zurek, W. H. Perfect quantum (2021).
error correcting code. Phys. Rev. Lett. 77, 198–201 (1996). 37. QDX: An AI discovery tool for quantum error correction codes. https://
12. Steane, A. M. Simple quantum error-correcting codes. Phys. Rev. A github.com/jolle-ag/qdx.
54, 4741–4751 (1996). 38. Yu, S., Chen, Q. & Oh, C. H. Graphical quantum error-correcting codes
13. Kitaev, A. Y. Quantum computations: algorithms and error correction. 0709.1780 (2007).
Russian Math. Surv. 52, 1191 (1997). 39. Yu, S., Bierbrauer, J., Dong, Y., Chen, Q. & Oh, C. All the stabilizer
14. Bravyi, S. et al. High-threshold and low-overhead fault-tolerant codes of distance 3. IEEE Trans. Inf. theory 59, 5179–5185 (2013).
quantum memory. Nature 627, 778–782 (2024). 40. Gottesman, D. Class of quantum error-correcting codes saturating
15. Gottesman, D. Stabilizer codes and quantum error correction quant- the quantum hamming bound. Phys. Rev. A 54, 1862–1868 (1996).
ph/9705052. (1997). 41. Nielsen, M. A. & Chuang, I. L.Quantum Computation and Quantum
16. Aaronson, S. & Gottesman, D. Improved simulation of stabilizer Information (Cambridge University Press, 2010).
circuits. Phys. Rev. A 70, 052328 (2004). 42. Bennett, C. H., DiVincenzo, D. P., Smolin, J. A. & Wootters, W. K.
17. Grassl, M. & Han, S. Computing extensions of linear codes using a Mixed-state entanglement and quantum error correction. Phys. Rev.
greedy algorithm. In 2012 IEEE International Symposium on A 54, 3824–3851 (1996).
Information Theory Proceedings 1568–1572 (IEEE, 2012). 43. Knill, E. & Laflamme, R. Theory of quantum error-correcting codes.
18. Grassl, M., Shor, P. W., Smith, G., Smolin, J. & Zeng, B. New Phys. Rev. A 55, 900 (1997).
constructions of codes for asymmetric channels via concatenation. 44. Ioffe, L. & Mézard, M. Asymmetric quantum error-correcting codes.
IEEE Trans. Inf. Theory 61, 1879–1886 (2015). Phys. Rev. A 75, 032345 (2007).
19. Li, M., Gutiérrez, M., David, S. E., Hernandez, A. & Brown, K. R. Fault 45. Wang, L., Feng, K., Ling, S. & Xing, C. Asymmetric quantum codes:
tolerance with bare ancillary qubits for a [[7,1,3]] code. Phys. Rev. A 96, characterization and constructions. IEEE Trans. Inf. Theory 56,
032341 (2017). 2938–2945 (2010).

npj Quantum Information | (2024)10:126 16

https://doi.org/10.1038/s41534-024-00920-y Article

46. Ezerman, M. F., Ling, S. & Sole, P. Additive asymmetric quantum Competing interests
codes. IEEE Trans. Inf. Theory 57, 5536–5550 (2011). The authors declare no competing interests.
47. Guardia, G. G. L. On the construction of asymmetric quantum codes.
Int. J. Theor. Phys. 53, 2312–2322 (2014). Additional information
48. Shor, P. & Laﬂamme, R. Quantum analog of the MacWilliams identities Supplementary information The online version contains
for classical coding theory. Phys. Rev. Lett. 78, 1600 (1997). supplementary material available at
49. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction https://doi.org/10.1038/s41534-024-00920-y.
(MIT Press, 2018).
50. Konda, V. & Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Correspondence and requests for materials should be addressed to
Process. Syst. 12 (1999). Jan Olle.
51. Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O.
Proximal policy optimization algorithms. arXiv:1707.06347 (2017). Reprints and permissions information is available at
52. Lu, C. et al. Discovered policy optimisation. Adv. Neural Inf. Process. http://www.nature.com/reprints
Syst. 35, 16455–16468 (2022).
53. Bradbury, J. et al. JAX: composable transformations of Python Publisher’s note Springer Nature remains neutral with regard to
+NumPy programs. http://github.com/google/jax (2018). jurisdictional claims in published maps and institutional afﬁliations.

Acknowledgements Open Access This article is licensed under a Creative Commons

Fruitful discussions with Sangkha Borah, Jonas Landgraf, Maximilian Naegele Attribution 4.0 International License, which permits use, sharing,
and Oleg Yevtushenko are thankfully acknowledged. We are thankful to adaptation, distribution and reproduction in any medium or format, as long
Markus Grassl for comments on the ﬁrst version of this manuscript. This as you give appropriate credit to the original author(s) and the source,
research is part of the Munich Quantum Valley, which is supported by the provide a link to the Creative Commons licence, and indicate if changes
Bavarian state government with funds from the Hightech Agenda Bayern Plus. were made. The images or other third party material in this article are
included in the article’s Creative Commons licence, unless indicated
Author contributions otherwise in a credit line to the material. If material is not included in the
F.M. and J.O. conceived the idea. F.M. supervised the work. J.O., R.Z., and article’s Creative Commons licence and your intended use is not permitted
M.P. wrote the simulations. J.O. collected and analyzed the data, and wrote by statutory regulation or exceeds the permitted use, you will need to
the manuscript with inputs from all the authors. obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
FundingInformation
Open Access funding enabled and organized by Projekt DEAL. © The Author(s) 2024

npj Quantum Information | (2024)10:126 17

En - 13121 - Rectangular Tank Design
No ratings yet
En - 13121 - Rectangular Tank Design
7 pages
Dump State
No ratings yet
Dump State
5 pages
1349b2a0-f352-4a9a-8802-e11f58e47d97
No ratings yet
1349b2a0-f352-4a9a-8802-e11f58e47d97
11 pages
Learning high-accuracy error decoding for quantum processors
No ratings yet
Learning high-accuracy error decoding for quantum processors
28 pages
hyperbolic Floquet codes
No ratings yet
hyperbolic Floquet codes
10 pages
2307.09025v1
No ratings yet
2307.09025v1
18 pages
The Domain Wall Color Code: Konstantin - Tiurev@quantumsimulations - de
No ratings yet
The Domain Wall Color Code: Konstantin - Tiurev@quantumsimulations - de
17 pages
CS Presentation 2
No ratings yet
CS Presentation 2
1 page
s41586-022-05434-1
No ratings yet
s41586-022-05434-1
7 pages
2412.20380v1
No ratings yet
2412.20380v1
20 pages
The Idiots Guide To Quantum Error Correction
No ratings yet
The Idiots Guide To Quantum Error Correction
38 pages
1912.10063v1
No ratings yet
1912.10063v1
17 pages
PhysRevA.54.4741
No ratings yet
PhysRevA.54.4741
11 pages
Statistical Challenges and Opportunities in Quantum Computing: A Review
No ratings yet
Statistical Challenges and Opportunities in Quantum Computing: A Review
5 pages
2311.11167v3
No ratings yet
2311.11167v3
19 pages
1907.11157v1
No ratings yet
1907.11157v1
29 pages
Geometrical Approach to Logical Qubit Fi (1)
No ratings yet
Geometrical Approach to Logical Qubit Fi (1)
14 pages
Weiss Sievert Lecture
No ratings yet
Weiss Sievert Lecture
25 pages
Designing Neural Network Based Decoders For Surface Codes
No ratings yet
Designing Neural Network Based Decoders For Surface Codes
13 pages
Stability - Ai Image, @quantshah
No ratings yet
Stability - Ai Image, @quantshah
22 pages
Quantum Codes: Corrected Copies of Transparencies For This Sem-Inar Series Should Soon Be Available at
No ratings yet
Quantum Codes: Corrected Copies of Transparencies For This Sem-Inar Series Should Soon Be Available at
13 pages
Lecture Notes - Day 5 PDF
No ratings yet
Lecture Notes - Day 5 PDF
71 pages
Single-step_parity_check_gate_set_for_quantum_erro
No ratings yet
Single-step_parity_check_gate_set_for_quantum_erro
27 pages
QI_E
No ratings yet
QI_E
57 pages
Introduction To Quantum Error Correction
No ratings yet
Introduction To Quantum Error Correction
40 pages
11-A Basic Concepts of Quantum Error Correction
No ratings yet
11-A Basic Concepts of Quantum Error Correction
21 pages
Quantum Computing Research
No ratings yet
Quantum Computing Research
1 page
Jose Eduardo Moreira Barros Pereira
No ratings yet
Jose Eduardo Moreira Barros Pereira
55 pages
2408.13687v1 (1)
No ratings yet
2408.13687v1 (1)
27 pages
PRXQuantum.5.020355
No ratings yet
PRXQuantum.5.020355
30 pages
Guillaud - Repetition Cat Qubits for Fault-Tolerant Quantum Computation
No ratings yet
Guillaud - Repetition Cat Qubits for Fault-Tolerant Quantum Computation
23 pages
2505.09133v1 (1)
No ratings yet
2505.09133v1 (1)
21 pages
QError Correction
No ratings yet
QError Correction
14 pages
Exponential Suppression of Bit or Phase Flip Errors With Repetitive Error Correction
No ratings yet
Exponential Suppression of Bit or Phase Flip Errors With Repetitive Error Correction
32 pages
The Future of Quantum Computing with Superconducting Qubits
No ratings yet
The Future of Quantum Computing with Superconducting Qubits
20 pages
Corrección Eficaz Del Error Cuántico de La Difamación Inducida Por Un Fluctuante Común
No ratings yet
Corrección Eficaz Del Error Cuántico de La Difamación Inducida Por Un Fluctuante Común
6 pages
Decoding Small Surface Codes With Feedforward Neural Networks
No ratings yet
Decoding Small Surface Codes With Feedforward Neural Networks
12 pages
PhysRevX.11.011032
No ratings yet
PhysRevX.11.011032
25 pages
Intro_Litsurvey_1
No ratings yet
Intro_Litsurvey_1
5 pages
Introducing quantum error correction
No ratings yet
Introducing quantum error correction
46 pages
SciPostPhysLectNotes 70
No ratings yet
SciPostPhysLectNotes 70
91 pages
Calderbank, Shor - 1996 - Good Quantum Error-Correcting Codes Exist
No ratings yet
Calderbank, Shor - 1996 - Good Quantum Error-Correcting Codes Exist
8 pages
s41586-024-08449-y_reference
No ratings yet
s41586-024-08449-y_reference
14 pages
Logical Quantum Processor Based On Reconfigurable Atom Arrays
No ratings yet
Logical Quantum Processor Based On Reconfigurable Atom Arrays
28 pages
1504.01444v1
No ratings yet
1504.01444v1
155 pages
Advantage of Quantum Neural Networks As Quantum Information Decoders
No ratings yet
Advantage of Quantum Neural Networks As Quantum Information Decoders
25 pages
Encoding a qubit in an oscillator-GKP encoding
No ratings yet
Encoding a qubit in an oscillator-GKP encoding
22 pages
Lebreuilly 等 - 2021 - Autonomous Quantum Error Correction and Quantum Co
No ratings yet
Lebreuilly 等 - 2021 - Autonomous Quantum Error Correction and Quantum Co
18 pages
CompatativeStude_QEC_Stategies_Hex_Lattice
No ratings yet
CompatativeStude_QEC_Stategies_Hex_Lattice
31 pages
Letter: State Preservation by Repetitive Error Detection in A Superconducting Quantum Circuit
No ratings yet
Letter: State Preservation by Repetitive Error Detection in A Superconducting Quantum Circuit
4 pages
s41586-024-08449-y
No ratings yet
s41586-024-08449-y
8 pages
Quantum Communication
From Everand
Quantum Communication
IntroBooks Team
No ratings yet
Safari - 14 dic 2024, 10:44
No ratings yet
Safari - 14 dic 2024, 10:44
1 page
Low-universities
No ratings yet
Low-universities
9 pages
Daniel Gottesman Book
No ratings yet
Daniel Gottesman Book
322 pages
Error Corrections Tea Ne 06
No ratings yet
Error Corrections Tea Ne 06
24 pages
Logical Quantum Processor Based on Reconfigurable Atom Arrays - s41586-023-06927-3_reference-1
No ratings yet
Logical Quantum Processor Based on Reconfigurable Atom Arrays - s41586-023-06927-3_reference-1
42 pages
2504.07732v1
No ratings yet
2504.07732v1
41 pages
Final 2023 Tex
No ratings yet
Final 2023 Tex
3 pages
Neural Error Mitigation
No ratings yet
Neural Error Mitigation
20 pages
From Bits to Qubits: A Computer Engineer's Journey
From Everand
From Bits to Qubits: A Computer Engineer's Journey
Magnus Wolfe
No ratings yet
Quantum Horizon
From Everand
Quantum Horizon
Laura Lee
No ratings yet
On_toric_codes_and_multivariate_Vandermo
No ratings yet
On_toric_codes_and_multivariate_Vandermo
2 pages
lecturenotes1
No ratings yet
lecturenotes1
10 pages
HoneyComb_1901.04117v1
No ratings yet
HoneyComb_1901.04117v1
23 pages
lecturenotes6
No ratings yet
lecturenotes6
7 pages
preskill_7
No ratings yet
preskill_7
92 pages
Intro_Stabilizer_Circuits
No ratings yet
Intro_Stabilizer_Circuits
12 pages
Les_Houches_2018_Penning_L1
No ratings yet
Les_Houches_2018_Penning_L1
39 pages
2206.13724v3
No ratings yet
2206.13724v3
26 pages
strengthening-operational-technology-security
No ratings yet
strengthening-operational-technology-security
7 pages
Gottesman_Knill_Theorem
No ratings yet
Gottesman_Knill_Theorem
21 pages
2312.10851v1
No ratings yet
2312.10851v1
12 pages
ml4q_platforms_exercises_1-2
No ratings yet
ml4q_platforms_exercises_1-2
3 pages
0811.0898
No ratings yet
0811.0898
14 pages
Non-Adaptive Measurement-based Quantum Computation
No ratings yet
Non-Adaptive Measurement-based Quantum Computation
10 pages
A_Game_of_Surface_Codes_Large-Scale_Quantum_Comput
No ratings yet
A_Game_of_Surface_Codes_Large-Scale_Quantum_Comput
35 pages
0904.2557
No ratings yet
0904.2557
46 pages
scribe9
No ratings yet
scribe9
7 pages
Solovay_Kitaev_Algorithm
No ratings yet
Solovay_Kitaev_Algorithm
15 pages
Quantum_Computer_Architecture_Towards_Full-Stack_Q (1)
No ratings yet
Quantum_Computer_Architecture_Towards_Full-Stack_Q (1)
21 pages
scribe7
No ratings yet
scribe7
10 pages
scribe10
No ratings yet
scribe10
8 pages
scribe4
No ratings yet
scribe4
11 pages
Single_ions_in_Paul_traps
No ratings yet
Single_ions_in_Paul_traps
20 pages
scribe8
No ratings yet
scribe8
9 pages
scribe5
No ratings yet
scribe5
13 pages
scribe11 (1)
No ratings yet
scribe11 (1)
4 pages
scribe2
No ratings yet
scribe2
12 pages
scribe3
No ratings yet
scribe3
10 pages
Atomic and Molecular Spectroscopy Lecture 1B
No ratings yet
Atomic and Molecular Spectroscopy Lecture 1B
19 pages
The Mass Appraisal of The Real Estate by Computational Intelligence
No ratings yet
The Mass Appraisal of The Real Estate by Computational Intelligence
6 pages
Standard Padeye Sheet
No ratings yet
Standard Padeye Sheet
1 page
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
SEBI Practice Questions of Aptitude
No ratings yet
SEBI Practice Questions of Aptitude
8 pages
Mathematics 2 Quantum 40 Slashbyte 41
No ratings yet
Mathematics 2 Quantum 40 Slashbyte 41
90 pages
Homework Set #4: EE6412: Optimal Control January - May 2023
No ratings yet
Homework Set #4: EE6412: Optimal Control January - May 2023
5 pages
Overview of Ms-Spring and GMPLS: Multiplex Section - Shared Protection Ring
100% (2)
Overview of Ms-Spring and GMPLS: Multiplex Section - Shared Protection Ring
17 pages
Double Effect Evap
No ratings yet
Double Effect Evap
32 pages
01 Probability of Simple Events
No ratings yet
01 Probability of Simple Events
19 pages
Workspace Graphic
No ratings yet
Workspace Graphic
1 page
Metal Feature Set Tables
No ratings yet
Metal Feature Set Tables
15 pages
Datasheet
No ratings yet
Datasheet
4 pages
Special Angle Trigo Functions
No ratings yet
Special Angle Trigo Functions
14 pages
Events - Sri Ramakrishna Engineering College
No ratings yet
Events - Sri Ramakrishna Engineering College
4 pages
Chapter 19 Piping Handbook-Done
No ratings yet
Chapter 19 Piping Handbook-Done
14 pages
60R580P MagnaChip
No ratings yet
60R580P MagnaChip
10 pages
NE570 Applications
100% (1)
NE570 Applications
6 pages
8051 Arithmetic operations
No ratings yet
8051 Arithmetic operations
6 pages
9.parking Guidance System Utilizing Wireless Sensor Network and Ultrasonic Sensor
No ratings yet
9.parking Guidance System Utilizing Wireless Sensor Network and Ultrasonic Sensor
33 pages
OLL Orthodontic Record No WM
No ratings yet
OLL Orthodontic Record No WM
7 pages
Logcat 1663038310846
No ratings yet
Logcat 1663038310846
8 pages
74HC HCT151
No ratings yet
74HC HCT151
16 pages
Data Analyst Intern Resume
100% (1)
Data Analyst Intern Resume
4 pages
Fourier Series and Fourier Transform
No ratings yet
Fourier Series and Fourier Transform
3 pages
Tplink t1600g-28ps (Tl-sg2424p) Switch
No ratings yet
Tplink t1600g-28ps (Tl-sg2424p) Switch
3 pages
Unity Tutorials 3 - Game Objects I - Gameobjects: Cleaning Up The Scene
No ratings yet
Unity Tutorials 3 - Game Objects I - Gameobjects: Cleaning Up The Scene
14 pages

SimultaneousDiscoveryQuantumErrorCodeEncoders

Uploaded by

SimultaneousDiscoveryQuantumErrorCodeEncoders

Uploaded by

npj | quantum information Article

Published in partnership with The University of New South Wales

Simultaneous discovery of quantum error

npj Quantum Information | (2024)10:126 1

npj Quantum Information | (2024)10:126 2

npj Quantum Information | (2024)10:126 3

Fig. 2 | Discovering codes and encoding circuits

npj Quantum Information | (2024)10:126 4

npj Quantum Information | (2024)10:126 5

npj Quantum Information | (2024)10:126 6

However, we defer the exploration of effective scaling strategies to Discussion

npj Quantum Information | (2024)10:126 7

Fig. 8 | CSS code and encoding discovery in a next-

npj Quantum Information | (2024)10:126 8

for at least one gi, or the error itself is harmless, i.e.,

we ðEμ ; cZ Þ ¼ wX ðEμ Þ þ wY ðEμ Þ þ cZ wZ ðEμ Þ; ð12Þ

npj Quantum Information | (2024)10:126 9

Table 1 | Hyperparameters that were used during training with

simulation of it). In each time step t, the environment’s state st is observed.

npj Quantum Information | (2024)10:126 10

npj Quantum Information | (2024)10:126 11

A stabilizer code is speciﬁed by n − k stabilizer group generators SC ¼

HXH ¼ Z; HZH ¼ X; ð15Þ

meaning that it exchanges X by Z. More generally, Hi exchanges columns i

where P1 and P2 are the corresponding binary representations and Ω is the

npj Quantum Information | (2024)10:126 12

Fig. 14 | Different connectivities and gatesets.

Different connectivities and gatesets

npj Quantum Information | (2024)10:126 13

G3: Noise-aware meta-agent

Training setup and hyperparameters. During training, the meta-agent

Failure probability. As is the case for most RL learning procedures, every

npj Quantum Information | (2024)10:126 14

npj Quantum Information | (2024)10:126 15

npj Quantum Information | (2024)10:126 16

Acknowledgements Open Access This article is licensed under a Creative Commons

npj Quantum Information | (2024)10:126 17

You might also like