A computation of D(9) using FPGA
Supercomputing
arXiv:2304.03039v3 [cs.DM] 18 Apr 2023
Lennart Van Hirtum1,2,3 , Patrick De Causmaecker1 , Jens
Goemaere1 , Tobias Kenter2,3 , Heinrich Riebler2,3 , Michael Lass2,3 ,
and Christian Plessl2,3
1
KU Leuven, Department of Computer Science, KULAK
2
Department of Computer Science, Paderborn University
3
Paderborn Center for Parallel Computing, Paderborn University
e-mail adresses in footnote∗
April 2023
Abstract
This preprint makes the claim of having computed the 9th Dedekind
Number. This was done by building an efficient FPGA Accelerator for
the core operation of the process, and parallelizing it on the Noctua 2
Supercluster at Paderborn University. The resulting value is
286386577668298411128469151667598498812366
This value can be verified in two steps. We have made the data file con-
taining the 490M results available, each of which can be verified separately
on CPU, and the whole file sums to our proposed value.
1 Introduction
Let us consider the finite set A = {1, . . . , n}, which we will call the base set, and
let us denote the subsets of A by P(A). Dedekind numbers count the number
of monotone Boolean functions on P(A). This number is denoted by D(n) and
it is called the nth Dedekind number.
The set of permutations of the elements of base set A generates an equivalence
relation on Dn . The equivalence classes of this relation are denoted by Rn and
the number of such equivalence classes is denoted by R(n).
∗ [email protected], [email protected],
[email protected], [email protected], [email protected],
[email protected], [email protected]
1
Richard Dedekind first defined the numbers D(n) in 1897 [1]. Over the previ-
ous century, Dedekind numbers have been a challenge for computational power
in the evolving domain of computer science. Computing the numbers proved
exceptionally hard, and so far only formula’s with a double exponential time
complexity are known. Until recently, the largest known Dedekind number was
D(8). In this paper, we report on a first result for D(9). Table 1 shows the
known numbers, including the first result of our computation. As we explain
below, a verification run is needed. We expect to have verified our result in
about three months time.
Table 2 shows the known numbers R(n) of equivalence classes of monotone
Boolean functions under permutation of the elements of the base set. Note that
the last result dates from 2021.
D(0) 2 Dedekind (1897)
D(1) 3 Dedekind (1897)
D(2) 6 Dedekind (1897)
D(3) 20 Dedekind (1897)
D(4) 168 Dedekind (1897)
D(5) 7581 Church (1940)
D(6) 7828354 Ward (1946)
D(7) 2414682040998 Church (1965)
D(8) 56130437228687557907788 Wiedemann (1991)
D(9) 286386577668298411128469151667598498812366 Our Proposal (2023)
Table 1: Known Dedekind Numbers [2] and our first result.
R(0) 2
R(1) 3
R(2) 5
R(3) 10
R(4) 30
R(5) 210
R(6) 16353
R(7) 490013148 Tamon Stephen & Timothy Yusun (2014) [3]
R(8) 1392195548889993358 Bartlomiej Pawelski (2021) [4]
Table 2: Known Equivalence Class Counts
Note that a monotone Boolean function is completely defined by the set of sets
which are maximal among the sets for which the function value is true. For
any monotone Boolean function, no two of its maximal sets include each other.
This set of sets is called an anti-chain. Since a monotone Boolean function
is completely determined by its associated anti-chain, and any anti-chain is
completely determined by its associated monotone Boolean function, we will
use any of the two representations whenever it is more convenient. We will
2
represent monotone Boolean functions or anti-chains by letters from the Greek
alphabet. If we say that X ∈ α, we mean that X is a maximal set among the
sets for which α is T rue, in other words
∀Y ⊆ X : α(Y ) = T rue and ∀Z ) X : α(Z) = F alse
If we say that α = {X, Y, Z}, we mean that the sets X, Y, Z ⊆ A are the maximal
sets among the sets for which α is T rue. For the set Dn of monotone Boolean
functions on the base set a natural partial order ≤ is defined by
∀α, β ∈ Dn : α ≤ β ⇔ ∀X ⊆ A : α(X) ⇒ β(X) (1)
This partial ordering defines a complete lattice on Dn . We denote by ⊥ and >
the smallest, respectively the largest, element of Dn :
∀X ⊆ A : ⊥(X) = F alse, >(X) = T rue (2)
⊥(X) = {}, >(X) = {A} (3)
Intervals in Dn are denoted by
∀α, β ∈ Dn : [α, β] = χ ∈ Dn : α ≤ χ ≤ β (4)
For α, β ∈ Dn , the join α ∨ β and the meet α ∧ β are the monotone Boolean
functions defined by
∀X ⊆ A : (α ∨ β)(X) = α(X) or β(X) (5)
∀X ⊆ A : (α ∧ β)(X) = α(X) and β(X) (6)
Finally, in the formulas below, a number defined for each pair α ≤ β ∈ Dn plays
an important role. We refer to this number as the connector number Cα,β of α
and β. It counts the number of connected components of the anti-chain β with
respect to α. Two such sets X, Y ∈ β are connected if α(X ∩ Y ) = F alse or if
there is a path X, Z1 , ..., Zn , Y of such subsets X, Z1 , ..., Zn , Y ⊆ A in which for
every two subsequent sets α(X ∩ Z1 ) = α(Z1 ∩ Z2 ) = ... = α(Zn ∩ Y ) = F alse.
It turns out that the number of solutions of
χ∨υ =β (7)
χ∧υ =α (8)
for χ, υ ∈ Dn is given by 2Cα,β . This is called the PCoeff [5, 6].
2 Method, Theory
The original PCoeff Formula as taken from [5].
X
D(n + 2) = |[⊥, α]|2Cα,β |[β, >]| (9)
α,β∈Dn
3
In the master thesis of the first author of the current paper, Lennart Van Hirtum
[7], the author reworked this formula to a form making use of equivalence classes
to reduce the total number of terms.
X X Dβ
X
D(n + 2) = |[⊥, α]|Dα |[β, >]| 2Cα,γ (10)
n!
α∈Rn β∈Rn γ∈P ermutβ
∃δ'β:α≤δ α≤γ
The P ermutβ term is the collection of all n! equivalents of β under permutation
of the base set. Dβ is the number of different equivalents, and hence, P ermutβ
D
contains duplicates iff Dβ < n!. These duplicates are divided out by the n!β
factor.
For D(9), this means iterating through D7 . That would require iterating over an
estimated 4.59 ∗ 1016 α, β pairs. The total number of P-Coëfficients (Cα,γ ) that
needed to be computed was 1.148 ∗ 1019 . However we were able to improve on
this further using the process of ’deduplication’, where we can halve the total
amount of work again, by noticing that pairs of α, β give identical results to
their dual pair β, α. As per Equation 11. This allowed us to halve the total
amount of work to 5.574 ∗ 1018 P-Coëfficients. 1
|[⊥, α]|2Cα,β |[β, >]| = |[α, >]|2Cβ,α |[⊥, β]| (11)
3 Computing P-Coëfficients on FPGA
Computing P-Coëfficients is uniquely well-suited for hardware implementation.
Computing these terms requires solving the problem of counting the number of
distinct connected components within a standard graph structure. An example
of such a graph with its distinct connected components colored is shown in
Figure 1. Looking at the connection count problem structure it is easy to
see why counting connected components is an incredibly branchy procedure,
and why traditional instruction-based computing methods fare poorly on it,
especially any kind of Single Instruction, Multiple Data (SIMD) based method.
However, since counting connected components is almost purely plain boolean
operations, it translates very well to a hardware implementation. A simple
schematic implementation is shown in Figure 2. A detailed explanation of how
it works is provided in the first author’s master thesis [7]. In this thesis, some
optimizations are derived that bring the average number of iterations down to
4.061.
4 Computation on Noctua 2
We implemented this hardware accelerator on the Intel Stratix 10 GX 2800
cards found in Paderborn University’s Noctua 2 supercomputer. We were able
1 We made sure not to deduplicate pairs that were their own dual, ie when β = α
4
Figure 1: Connected components of an example graph. In this case there are 3
connected components.
to fit 300 of these CountConnected Cores on a single field-programmable gate
array (FPGA) die. These CountConnected Cores run at 450MHz. This gives
us a throughput of about 33 Billion CountConnected operations per second. At
this rate, a single FPGA processes about 5.2 α values per second, taking 47’000
FPGA hours to compute D(9) on Noctua 2, or about 3 months real-time.
The computation is split across the system along the lines of Equation 10. α
values (also named tops) are divided on the job level. There are 490M tops to
be processed for D(9). We split these into 15000 jobs of 30000 tops each. The
β values per top (also named bottoms) are placed in large buffers of 46M bots
on average, and sent over PCIe (Peripheral Component Interconnect Express)
to the FPGA. The FPGA then computes all 5040 permutations (γ) of each
bottom, computes and adds up their P-Coëfficients, and stores the result in an
output buffer of the same size.
The artifact of this computation is a dataset with an intermediary result for
each of the 490M α values. Each of these can be checked separately2 , and the
whole file sums to 286386577668298411128469151667598498812366.
2 It takes about 10-200s to compute a single α result on 128 AMD Epyc CPU cores
5
Figure 2: The CountConnected Core
5 Correctness
As much of the code as possible is written generically. This means the same
system is used for computing D(3) - D(8). All of these yield the correct results.
Of course, the FPGA kernel is written specifically for D(9) computation, so its
correctness was verified by comparing its results with the CPU results for a
small sample. In effect, both methods verified each other’s correctness.
However, we do have several checks to increase our confidence in the result.
• The most direct is the D(9) ≡ 6 mod 210 check provided by Pawelski &
Szepietowski[8]. Our result passes this check. Sadly, due to the structure
of our computation, nearly all terms are divisible by 210, which strongly
hampers the usefulness of this check. One thing that this check does
give us is that no integer overflow has occurred, which was an important
concern given we were working with integers of 128 and 192 bits wide.
• Our computation was plagued by one issue in particular. Namely that
there is a bug in the vendor library for communication over PCIe, wherein,
occasionally and at a low incidence rate, full 4K pages of FPGA data are
not copied properly from FPGA memory to host memory. This results
in large blocks of incorrect bottoms for some tops. We encountered this
issue in about 2300 tops. We were able to mitigate this issue by incuding
extra data from the FPGA to host memory, namely the ’valid permutation
count’. By checking these values, we could determine if a bottom buffer
had been corrupted. Additionally, adding all of these counts yields the
value for D(8), which shows that the correct number of terms have been
added.
• Finally, there is an estimation formula, which gives us an estimation which
is relatively close to our result. The Korshunov estimation formula esti-
mates D(9) = 1.15 ∗ 1041 which is off by about a factor 2. 3
3 This isn’t too unusual though, as the results for odd values are off by quite a lot. Esti-
mation for D(3) overestimates by a factor 2, D(5) also overestimates by a factor 2, and D(7)
overestimates roughly 10%
6
Figure 3: The FPGA Accelerator Die
6 Remaining potential source of errors
The one way our result could still be wrong is due to a Single Event Upset (SEU),
such as a bitflip in the FPGA fabric during processing, or a bitflip during data
transfer from FPGA DDR memory to Main Memory.
It is hard to characterise the odds of these SEU events. Expected number
of occurrences for the FPGA’s we used are not available to the best of our
knowledge. But example values shown on Intel’s website pin the error rate at
around 5000 SEU events per billion FPGA hours. In that case, given our 47000
FPGA hours, we expect to see 0.235 errors Poisson distributed, giving us a
chance of 20% of a hit. Of course, this is just an example and the real odds
might be higher than that.
7 Conclusion
In conclusion, our method for computing D(9) works, our implementation should
theoretically give the correct result. All that remains is: Have any bit errors
occurred during this first computation? In any case, we’re starting up a second
run now, and as it progresses we’ll gain more and more confidence in our re-
sult. Each subresult is computed a second time, and any values that differ are
recomputed a third time as a tiebreaker. If we find no errors, then we could
7
have been sure that 9th Dedekind Number was found on the 8th of March,
2023 using the Noctua 2 supercluster at Paderborn University. This value was
registered in the corresponding github commit: https://github.com/VonTum/
Dedekind/commit/1cf7b019afca655586e8210f97fbb5399d61e842 All code is
available at https://github.com/VonTum/Dedekind.
8 Note
On April 4th a preprint claiming D(9) was published, right before the present
publication. ”A computation of the ninth Dedekind Number A Preprint” by
Jäkel Christian [9]. This paper confirms our result. This shows our tentative
result is in fact correct.
References
[1] R. Dedekind. Über Zerlegungen von Zahlen Durch Ihre Grössten
Gemeinsamen Theiler, pages 1–40. Vieweg+Teubner Verlag, Wiesbaden,
1897.
[2] Doug Wiedemann. A computation of the eighth dedekind number. https:
//link.springer.com/article/10.1007%2FBF00385808, 1991.
[3] Tamon Stephen and Timothy Yusun. Counting inequivalent monotone
boolean functions. Discrete Applied Mathematics, 167:15–24, 2014.
[4] Bartlomiej Pawelski. On the number of inequivalent monotone boolean func-
tions of 8 variables, 2021.
[5] Patrick De Causmaecker and Stefan De Wannemacker. On the number of
antichains of sets in a finite universe, 2014.
[6] Patrick De Causmaecker, Stefan De Wannemacker, and Jay Yellen. Intervals
of antichains and their decompositions, 2016.
[7] Lennart Van Hirtum. A path to compute the 9th dedekind number using
fpga supercomputing. https://hirtum.com/thesis.pdf, 2021. KU Leu-
ven, Masters Thesis.
[8] Bartlomiej Pawelski and Andrzej Szepietowski. Divisibility properties of
dedekind numbers, 2023.
[9] Christian Jäkel. A computation of the ninth dedekind number a preprint,
04 2023.