Rethinking Attention With Performers
Rethinking Attention With Performers
linear (as opposed to quadratic) space and time complexity, without relying on
any priors such as sparsity or low-rankness. To approximate softmax attention-
kernels, Performers use a novel Fast Attention Via positive Orthogonal Random
features approach (FAVOR+), which may be of independent interest for scalable
kernel methods. FAVOR+ can also be used to efficiently model kernelizable
attention mechanisms beyond softmax. This representational power is crucial to
accurately compare softmax with other kernels for the first time on large-scale tasks,
beyond the reach of regular Transformers, and investigate optimal attention-kernels.
Performers are linear architectures fully compatible with regular Transformers
and with strong theoretical guarantees: unbiased or nearly-unbiased estimation
of the attention matrix, uniform convergence and low estimation variance. We
tested Performers on a rich set of tasks stretching from pixel-prediction through
text models to protein sequence modeling. We demonstrate competitive results
with other examined efficient sparse and dense attention methods, showcasing
effectiveness of the novel attention-learning paradigm leveraged by Performers.
1
Published as a conference paper at ICLR 2021
layers (Child et al., 2019). Unfortunately, there is a lack of rigorous guarantees for the representation
power produced by such methods, and sometimes the validity of sparsity patterns can only be verified
empirically through trial and error by constructing special GPU operations (e.g. either writing C++
CUDA kernels (Child et al., 2019) or using TVMs (Beltagy et al., 2020)). Other techniques which
aim to reduce Transformers’ space complexity include reversible residual layers allowing one-time
activation storage in training (Kitaev et al., 2020) and shared attention weights (Xiao et al., 2019).
These constraints may impede application to long-sequence problems, where approximations of
the attention mechanism are not sufficient. Approximations based on truncated back-propagation
(Dai et al., 2019) are also unable to capture long-distance correlations since the gradients are only
propagated inside a localized window. Other methods propose biased estimation of regular attention
but only in the non-causal setting and with large mean squared error (Wang et al., 2020).
In response, we introduce the first Transformer architectures, Performers, capable of provably
accurate and practical estimation of regular (softmax) full-rank attention, but of only linear space
and time complexity and not relying on any priors such as sparsity or low-rankness. Performers
use the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism, leveraging
new methods for approximating softmax and Gaussian kernels, which we propose. We believe
these methods are of independent interest, contributing to the theory of scalable kernel methods.
Consequently, Performers are the first linear architectures fully compatible (via small amounts
of fine-tuning) with regular Transformers, providing strong theoretical guarantees: unbiased or
nearly-unbiased estimation of the attention matrix, uniform convergence and lower variance of the
approximation.
FAVOR+ can be also applied to efficiently model other kernelizable attention mechanisms beyond
softmax. This representational power is crucial to accurately compare softmax with other kernels
for the first time on large-scale tasks, that are beyond the reach of regular Transformers, and find for
them optimal attention-kernels. FAVOR+ can also be applied beyond the Transformer scope as a
more scalable replacement for regular attention, which itself has a wide variety of uses in computer
vision (Fu et al., 2019), reinforcement learning (Zambaldi et al., 2019), training with softmax cross
entropy loss, and even combinatorial optimization (Vinyals et al., 2015).
We test Performers on a rich set of tasks ranging from pixel-prediction through text models to protein
sequence modeling. We demonstrate competitive results with other examined efficient sparse and
dense attention methods, showcasing the effectiveness of the novel attention-learning paradigm
leveraged by Performers. We emphasize that in principle, FAVOR+ can also be combined with other
techniques, such as reversible layers (Kitaev et al., 2020) or cluster-based attention (Roy et al., 2020).
2 FAVOR+ M ECHANISM & P OSITIVE O RTHOGONAL R ANDOM F EATURES
Below we describe in detail the FAVOR+ mechanism - the backbone of the Performer0 s architecture.
We introduce a new method for estimating softmax (and Gaussian) kernels with positive orthogonal
random features which FAVOR+ leverages for the robust and unbiased estimation of regular (softmax)
attention and show how FAVOR+ can be applied for other attention-kernels.
2.1 P RELIMINARIES - REGULAR ATTENTION MECHANISM
Let L be the size of an input sequence of tokens. Then regular dot-product attention (Vaswani et al.,
2017) is a mapping which accepts matrices Q, K, V ∈ RL×d as input where d is the hidden dimension
(dimension of the latent representation). Matrices Q, K, V are intermediate representations of the
input and their rows can be interpreted as queries, keys and values of the continuous dictionary data
structure respectively. Bidirectional (or non-directional (Devlin et al., 2018)) dot-product attention
has the following form, where A ∈ RL×L is the so-called attention matrix:
√
Att↔ (Q, K, V) = D−1 AV, A = exp(QK> / d), D = diag(A1L ). (1)
Here exp(·) is applied elementwise, 1L is the all-ones vector of length L, and diag(·) is a diagonal
matrix with the input vector as the diagonal. Time and space complexity of computing (1) are O(L2 d)
and O(L2 + Ld) respectively, because A has to be stored explicitly. Hence, in principle, dot-product
attention of type (1) is incompatible with end-to-end processing of long sequences. Bidirectional
attention is applied in encoder self-attention and encoder-decoder attention in Seq2Seq architectures.
Another important type of attention is unidirectional dot-product attention which has the form:
Att→ (Q, K, V) = D e −1 AV,
e A
e = tril(A), D e = diag(A1e L ), (2)
2
Published as a conference paper at ICLR 2021
where tril(·) returns the lower-triangular part of the argument matrix including the diagonal. As
discussed in (Vaswani et al., 2017), unidirectional attention is used for autoregressive generative
modelling, e.g. as self-attention in generative Transformers as well as the decoder part of Seq2Seq
Transformers.
We will show that attention matrix A can be approximated up to any precision in time O(Ld2 log(d)).
For comparison, popular methods leveraging sparsity via Locality-Sensitive Hashing (LSH) tech-
niques (Kitaev et al., 2020) have O(Ld2 log L) time complexity. In the main body of the paper we
will describe FAVOR+ for bidirectional attention. Completely analogous results can be obtained for
the unidirectional variant via the mechanism of prefix-sums (all details in the Appendix B.1).
2.2 G ENERALIZED K ERNELIZABLE ATTENTION
FAVOR+ works for attention blocks using matrices A ∈ RL×L of the form A(i, j) = K(q> >
i , kj ),
th th d d
with qi /kj standing for the i /j query/key row-vector in Q/K and kernel K : R × R → R+
defined for the (usually randomized) mapping: φ : Rd → Rr+ (for some r > 0) as:
K(x, y) = E[φ(x)> φ(y)]. (3)
0 0
d
We call φ(u) a random feature map for u ∈ R . For Q , K ∈ R L×r
with rows given as φ(q>
i )
>
Att
\ b −1 (Q0 ((K0 )> V)),
↔ (Q, K, V) = D
b = diag(Q0 ((K0 )> 1L )).
D (4)
Here Att
\ ↔ stands for the approximate attention and brackets indicate the order of computations. It is
easy to see that such a mechanism is characterized by space complexity O(Lr + Ld + rd) and time
complexity O(Lrd) as opposed to O(L2 + Ld) and O(L2 d) of the regular attention (see also Fig. 1).
Figure 1: Approximation of the regular attention mechanism AV (before D−1 -renormalization) via (random)
feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached.
The above scheme constitutes the FA-part of the FAVOR+ mechanism. The remaining OR+ part
answers the following questions: (1) How expressive is the attention model defined in Equation 3,
and in particular, can we use it in principle to approximate regular softmax attention ? (2) How do
we implement it robustly in practice, and in particular, can we choose r L for L d to obtain
desired space and time complexity gains? We answer these questions in the next sections.
2.3 H OW TO AND HOW NOT TO APPROXIMATE SOFTMAX - KERNELS FOR ATTENTION
It turns out that by taking φ of the following form for functions f1 , ..., fl : R → R, function
iid
g : Rd → R and deterministic vectors ωi or ω1 , ..., ωm ∼ D for some distribution D ∈ P(Rd ):
h(x)
φ(x) = √ (f1 (ω1> x), ..., f1 (ωm
>
x), ..., fl (ω1> x), ..., fl (ωm
>
x)), (5)
m
we can model most kernels used in practice. Furthermore, in most cases D is isotropic (i.e. with
pdf function constant on a sphere), usually Gaussian. For example, by taking h(x) = 1, l = 1 and
D = N (0, Id ) we obtain estimators of the so-called PNG-kernels (Choromanski et al., 2017) (e.g.
f1 = sgn corresponds to the angular kernel). Configurations: h(x) = 1, l = 2, f1 = sin, f2 = cos
correspond to shift-invariant kernels, in particular D = N (0, Id ) leads to the Gaussian kernel Kgauss
(Rahimi & Recht, 2007). The softmax-kernel which defines regular attention matrix A is given as:
3
Published as a conference paper at ICLR 2021
def
SM(x, y) = exp(x> y). (6)
√
In the above, without loss of generality, we omit d-renormalization since we can equivalently
2
kyk2
renormalize input keys and queries. Since: SM(x, y) = exp( kxk 2 )Kgauss (x, y) exp( 2 ), based
on what we have said, we obtain random feature map unbiased approximation of SM(x, y) using
2 trig
trigonometric functions with: h(x) = exp( kxk ), l = 2, f1 = sin, f2 = cos. We call it SM
2
dm (x, y).
There is however a caveat there. The attention module from (1) constructs for each token, a convex
combination of value-vectors with coefficients given as corresponding renormalized kernel scores.
That is why kernels producing non-negative scores are used. Applying random feature maps with
potentially negative dimension-values (sin / cos) leads to unstable behaviours, especially when kernel
scores close to 0 (which is the case for many entries of A corresponding to low relevance tokens) are
approximated by estimators with large variance in such regions. This results in abnormal behaviours,
e.g. negative-diagonal-values renormalizers D−1 , and consequently either completely prevents
training or leads to sub-optimal models. We demonstrate empirically that this is what happens for
trig trig
SM
dm and provide detailed theoretical explanations showing that the variance of SM dm is large
as approximated values tend to 0 (see: Section 3). This is one of the main reasons why the robust
random feature map mechanism for approximating regular softmax attention was never proposed.
We propose a robust mechanism in this paper. Furthermore, the variance of our new unbiased positive
random feature map estimator tends to 0 as approximated values tend to 0 (see: Section 3).
Lemma 1 (Positive Random Features (PRFs) for Softmax). For x, y ∈ Rd , z = x + y we have:
h kxk2 > kyk2 i
SM(x, y) = Eω∼N (0,Id ) exp ω > x− exp ω y− = ΛEω∼N (0,Id ) cosh(ω > z), (7)
2 2
2
+kyk2
where Λ = exp(− kxk 2 ) and cosh is hyperbolic cosine. Consequently, softmax-kernel admits a
2
positive random feature map unbiased approximation with h(x) = exp(− kxk
2 ), l = 1, f1 = exp
2
and D = N (0, Id ) or: h(x) = √1
2
exp(− kxk
2 ), l = 2, f1 (u) = exp(u), f2 (u) = exp(−u) and the
+ hyp+
same D (the latter for further variance reduction). We call related estimators: SM
d and SM
m
d
m .
Figure 2: Left: Symmetrized (around origin) utility function r (defined as the ratio of the mean squared errors
(MSEs) of estimators built on: trigonometric and positive random features) as a function of the angle φ (in
radians) between input feature vectors and their lengths l. Larger values indicate regions of (φ, l)-space with
better performance of positive random features. We see that for critical regions with φ large enough (small
enough softmax-kernel values) our method is arbitrarily more accurate than trigonometric random features. Plot
presented for domain [−π, π] × [−2, 2]. Right: The slice of function r for fixed l = 1 and varying angle φ.
Right Upper Corner: Comparison of the MSEs of both the estimators in a low softmax-kernel value region.
In Fig. 2 we visualize the advantages of positive versus standard trigonometric random features. In
critical regions, where kernel values are small and need careful approximation, our method outper-
forms its counterpart. In Section 4 we further confirm our method’s advantages empirically, using
√ ω features to efficiently train softmax-based linear Transformers. If we replace in (7) ω with
positive
d kωk , we obtain the so-called regularized softmax-kernel SMREG which we can approximate in
√
a similar manner, simply changing D = N (0, Id ) to D = Unif( dS d−1 ), a distribution correspond-
√ +
ing to Haar measure on the sphere of radius d in Rd , obtaining estimator SMREG
\ m . As we show
in Section 3, such random features can also be used to accurately approximate regular softmax-kernel.
4
Published as a conference paper at ICLR 2021
Consequently, positive random features for SMREG can be used to approximate the softmax-kernel.
Our next result shows that orthogonality provably reduces mean squared error of the estimation with
positive random features for any dimensionality d > 0 and we explicitly provide the gap.
ort+ +
Theorem 2. If SMdm (x, y) stands for the modification of SM dm (x, y) with orthogonal random
features (and thus for m ≤ d), then the following holds for any d > 0:
2
kxk2 + kyk2
ort+ + 2(m − 1)
MSE(SMm (x, y)) ≤ MSE(SMm (x, y))−
d d SM(x, y) − exp − .
m(d + 2) 2
(10)
Furthermore, completely analogous result holds for the regularized softmax-kernel SMREG.
5
Published as a conference paper at ICLR 2021
For the regularized softmax-kernel, orthogonal features provide additional concentration results - the
first exponentially small bounds for probabilities of estimators’ tails that are strictly better than for
non-orthogonal variants for every d > 0. Our next result enables us to explicitly estimate the gap.
Theorem 3. Let x, y ∈ Rd . The following holds for any a > SMREG(x, y), θ > 0 and m ≤ d:
+ ort+
\ m (x, y) > a] ≤ exp(−θma)MZ (θ)m , P[SMREG
P[SMREG \ m (x, y) > a]
m θ4 m(m − 1)
≤ exp(−θma) MZ (θ)m − exp − (kxk2 + kyk2 ) kx + yk4
2 4(d + 2)
ort+ +
where SMREG
\ \ m (x, y) with ORFs, X =
(x, y) stands for the modification of SMREG
√ ω> m
Λ exp( d kωk2 (x + y)), ω ∼ N (0, Id ), Λ is as in Lemma 1 and MZ is the moment generating
function of Z.
We see that ORFs provide exponentially small and sharper bounds for critical regions where the
softmax-kernel is small. Below we show that even for the SMtrig mechanism with ORFs, it suffices to
take m = Θ(d log(d)) random projections to accurately approximate the attention matrix (thus if not
attention renormalization, PRFs would not be needed). In general, m depends on the dimensionality
d of the embeddings, radius R of the ball where all queries/keys live and precision parameter (see:
Appendix F.6 for additional discussion), but does not depend on input sequence length L.
Theorem 4 (uniform convergence for attention approximation). Assume that L2 -norms of
1 2
queries/keys are upper-bounded by R > 0. Define l = Rd− 4 and take h∗ = exp( l2 ). Then
3
for any > 0, δ = (h∗ )2 and the number of random projections m = Θ( δd2 log( 4dδ4 R )) the fol-
trig
lowing holds for the attention approximation mechanism leveraging estimators SM
d with ORFs:
kAb − Ak∞ ≤ with any constant probability, where Ab approximates the attention matrix A.
4 E XPERIMENTS
We implemented our setup on top of pre-existing Transformer training code in Jax (Frostig et al.,
2018) optimized with just-in-time (jax.jit) compilation, and complement our theory with em-
pirical evidence to demonstrate the practicality of FAVOR+ in multiple settings. Unless explicitly
stated, a Performer replaces only the attention component with our method, while all other com-
ponents are exactly the same as for the regular Transformer. For shorthand notation, we denote
unidirectional/causal modelling as (U) and bidirectional/masked language modelling as (B).
In terms of baselines, we use other Transformer models for comparison, although some of them
are restricted to only one case - e.g. Reformer (Kitaev et al., 2020) is only (U), and Linformer
(Wang et al., 2020) is only (B). Furthermore, we use PG-19 (Rae et al., 2020) as an alternative (B)
pretraining benchmark, as it is made for long-length sequence training compared to the (now publicly
unavailable) BookCorpus (Zhu et al., 2015) + Wikipedia dataset used in BERT (Devlin et al., 2018)
and Linformer. All model and tokenization hyperparameters are shown in Appendix A.
Figure 3: Comparison of Transformer and Performer in terms of forward and backward pass speed and
maximum L allowed. "X" (OPT) denotes the maximum possible speedup achievable, when attention simply
returns the V-matrix. Plots shown up to when a model produces an out of memory error on a V100 GPU with
16GB. Vocabulary size used was 256. Best in color.
4.1 C OMPUTATIONAL COSTS
We compared speed-wise the backward pass of the Transformer and the Performer in (B) setting,
as it is one of the main computational bottlenecks during training, when using the regular default
size (nheads , nlayers , df f , d) = (8, 6, 2048, 512), where df f denotes the width of the MLP layers.
6
Published as a conference paper at ICLR 2021
We observed (Fig. 3) that in terms of L, the Performer reaches nearly linear time and sub-quadratic
memory consumption (since the explicit O(L2 ) attention matrix is not stored). In fact, the Performer
achieves nearly optimal speedup and memory efficiency possible, depicted by the "X"-line when
attention is replaced with the "identity function" simply returning the V-matrix. The combination of
both memory and backward pass efficiencies for large L allows respectively, large batch training and
lower wall clock time per gradient step. Extensive additional results are demonstrated in Appendix E
by varying layers, raw attention, and architecture sizes.
4.2 S OFTMAX ATTENTION APPROXIMATION ERROR
We further examined the approximation error via FAVOR+ in Fig. 4. We demonstrate that 1.
Orthogonal features produce lower error than unstructured (IID) features, 2. Positive features produce
lower error than trigonometric sin/cos features. These two empirically validate the PORF mechanism.
Figure 4: MSE of the approximation output when comparing Orthogonal vs IID features and trigonometric
sin/cos vs positive features. We took L = 4096, d = 16, and varied the number of random samples m. Standard
deviations shown across 15 samples of appropriately normalized random matrix input data.
To further improve overall approximation of attention blocks across multiple iterations which further
improves training, random samples should be periodically redrawn (Fig. 5, right). This is a cheap
procedure, but can be further optimized (Appendix B.2).
4.3 S OFTMAX APPROXIMATION ON T RANSFORMERS
Even if the approximation of the attention mechanism is tight, small errors can easily propagate
throughout multiple Transformer layers (e.g. MLPs, multiple heads), as we show in Fig. 14
(Appendix). In other words, the model’s Lipschitz constant can easily scale up small attention
approximation error, which means that very tight approximations may sometimes be needed. Thus,
when applying FAVOR(+)’s softmax approximations on a Transformer model (i.e. "Performer-X-
SOFTMAX"), we demonstrate that:
1. Backwards compatibility with pretrained models is available as a benefit from softmax approxima-
tion, via small finetuning (required due to error propagation) even for trigonometric features (Fig. 5,
left) on the LM1B dataset (Chelba et al., 2014). However, when on larger dataset PG-19, 2. Positive
(POS) softmax features (with redrawing) become crucial for achieving performance matching regular
Transformers (Fig. 5, right).
Figure 5: We transferred the original pretrained Transformer’s weights into the Performer, which produces
an initial non-zero 0.07 accuracy (dotted orange line), but quickly recovers accuracy in a small fraction of the
original number of gradient steps. However on PG-19, Trigonometric (TRIG) softmax approximation becomes
highly unstable (full curve in Appendix D.2), while positive features (POS) (without redrawing) and Linformer
(which also approximates softmax) even with redrawn projections, plateau at the same perplexity. Positive
softmax with feature redrawing is necessary to match the Transformer, with SMREG (regularization from Sec.
3) allowing faster convergence. Additional ablation studies over many attention kernels, showing also that
trigonometric random features lead even to NaN values in training are given in Appendix D.3.
7
Published as a conference paper at ICLR 2021
Figure 6: Train = Dashed, Validation = Solid. For TrEMBL, we used the exact same model parameters
(nheads , nlayers , df f , d) = (8, 36, 1024, 512) from (Madani et al., 2020) for all runs. For fairness, all TrEMBL
experiments used 16x16 TPU-v2’s. Batch sizes were maximized for each separate run given the compute
constraints. Hyperparameters can be found in Appendix A. Extended results including dataset statistics, out of
distribution evaluations, and visualizations, can be found in Appendix C.
4.5 L ARGE LENGTH TRAINING - C OMMON DATASETS
On the standard (U) ImageNet64 benchmark from (Parmar et al., 2018) with L = 12288 which is
unfeasible for regular Transformers, we set all models to use the same (nheads , df f , d) but varying
nlayers . Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matches
the Reformer/24-layers (Fig. 7: left). Depending on hardware (TPU or GPU), we also found that the
Performer can be 2x faster than the Reformer via Jax optimizations for the (U) setting.
For a proof of principle study, we also create an initial protein benchmark for predicting interactions
among groups of proteins by concatenating protein sequences to length L = 8192 from TrEMBL,
long enough to model protein interaction networks without the large sequence alignments required by
existing methods (Cong et al., 2019). In this setting, a regular Transformer overloads memory even at
a batch size of 1 per chip, by a wide margin. Thus as a baseline, we were forced to use a significantly
smaller variant, reducing to (nheads , nlayers , df f , d) = (8, {1, 2, 3}, 256, 256). Meanwhile, the Per-
former trains efficiently at a batch size of 8 per chip using the standard (8, 6, 2048, 512) architecture.
We see in Fig. 7 (right subfigure) that the smaller Transformer (nlayer = 3) is quickly bounded at
≈ 19%, while the Performer is able to train continuously to ≈ 24%.
Figure 7: Train = Dashed, Validation = Solid. For ImageNet64, all models used the standard (nheads , df f , d) =
(8, 2048, 512). We further show that our positive softmax approximation achieves the same performance as
ReLU in Appendix D.2. For concatenated TrEMBL, we varied nlayers ∈ {1, 2, 3} for the smaller Transformer.
Hyperparameters can be found in Appendix A.
5 C ONCLUSION
We presented Performer, a new type of Transformer, relying on our Fast Attention Via positive Or-
thogonal Random features (FAVOR+) mechanism to significantly improve space and time complexity
of regular Transformers. Our mechanism provides to our knowledge the first effective unbiased esti-
mation of the original softmax-based Transformer with linear space and time complexity and opens
new avenues in the research on Transformers and the role of non-sparsifying attention mechanisms.
8
Published as a conference paper at ICLR 2021
6 B ROADER IMPACT
We believe that the presented algorithm can be impactful in various ways:
Biology and Medicine: Our method has the potential to directly impact research on biological
sequence analysis by enabling the Transformer to be applied to much longer sequences without
constraints on the structure of the attention matrix. The initial application that we consider is the
prediction of interactions between proteins on the proteome scale. Recently published approaches
require large evolutionary sequence alignments, a bottleneck for applications to mammalian genomes
(Cong et al., 2019). The potentially broad translational impact of applying these approaches to biolog-
ical sequences was one of the main motivations of this work. We believe that modern bioinformatics
can immensely benefit from new machine learning techniques with Transformers being among the
most promising. Scaling up these methods to train faster more accurate language models opens
the door to the ability to design sets of molecules with pre-specified interaction properties. These
approaches could be used to augment existing physics-based design strategies that are of critical
importance for example in the development of new nanoparticle vaccines (Marcandalli et al., 2019).
Environment: As we have shown, Performers with FAVOR+ are characterized by much lower
compute costs and substantially lower space complexity which can be directly translated to CO2
emission reduction (Strubell et al., 2019) and lower energy consumption (You et al., 2020), as regular
Transformers require very large computational resources.
Research on Transformers: We believe that our results can shape research on efficient Transformers
architectures, guiding the field towards methods with strong mathematical foundations. Our research
may also hopefully extend Transformers also beyond their standard scope (e.g. by considering the
Generalized Attention mechanism and connections with kernels). Exploring scalable Transformer
architectures that can handle L of the order of magnitude few thousands and more, preserving
accuracy of the baseline at the same time, is a gateway to new breakthroughs in bio-informatics,
e.g. language modeling for proteins, as we explained in the paper. Our presented method can be
potentially a first step.
Backward Compatibility: Our Performer can be used on the top of a regular pre-trained Transformer
as opposed to other Transformer variants. Even if up-training is not required, FAVOR+ can still be
used for fast inference with no loss of accuracy. We think about this backward compatibility as a
very important additional feature of the presented techniques that might be particularly attractive for
practitioners.
Attention Beyond Transformers: Finally, FAVOR+ can be applied to approximate exact attention
also outside the scope of Transformers. This opens a large volume of new potential applications
including: hierarchical attention networks (HANS) (Yang et al., 2016), graph attention networks
(Velickovic et al., 2018), image processing (Fu et al., 2019), and reinforcement learning/robotics
(Tang et al., 2020).
7 ACKNOWLEDGEMENTS
We thank Nikita Kitaev and Wojciech Gajewski for multiple discussions on the Reformer, and
also thank Aurko Roy and Ashish Vaswani for multiple discussions on the Routing Transformer.
We further thank Joshua Meier, John Platt, and Tom Weingarten for many fruitful discussions on
biological data and useful comments on this draft. We lastly thank Yi Tay and Mostafa Dehghani for
discussions on comparing baselines.
Valerii Likhosherstov acknowledges support from the Cambridge Trust and DeepMind. Lucy Colwell
acknowledges support from the Simons Foundation. Adrian Weller acknowledges support from a
Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute under EPSRC grant
EP/N510129/1 and U/B/000074, and the Leverhulme Trust via CFI.
9
Published as a conference paper at ICLR 2021
R EFERENCES
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmented
convolutional networks. CoRR, abs/1904.09925, 2019. URL http://arxiv.org/abs/
1904.09925.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150.
William Chan, Chitwan Saharia, Geoffrey E. Hinton, Mohammad Norouzi, and Navdeep Jaitly.
Imputer: Sequence modelling via imputation and dynamic programming. CoRR, abs/2002.08926,
2020. URL https://arxiv.org/abs/2002.08926.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony
Robinson. One billion word benchmark for measuring progress in statistical language modeling.
In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication
Association, Singapore, September 14-18, 2014, pp. 2635–2639, 2014.
Ciprian Chelba, Mia Xu Chen, Ankur Bapna, and Noam Shazeer. Faster transformer decoding:
N-gram masked self-attention. CoRR, abs/2001.04589, 2020. URL https://arxiv.org/
abs/2001.04589.
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George F. Foster,
Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining
recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20,
2018, Volume 1: Long Papers, pp. 76–86. Association for Computational Linguistics, 2018. doi:
10.18653/v1/P18-1008. URL https://www.aclweb.org/anthology/P18-1008/.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
Krzysztof Choromanski, Carlton Downey, and Byron Boots. Initialization matters: Orthogonal predic-
tive state recurrent neural networks. In 6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net, 2018a. URL https://openreview.net/forum?id=HJJ23bW0b.
Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard E. Turner, and
Adrian Weller. The geometry of random features. In International Conference on Artificial
Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary
Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pp. 1–9. PMLR, 2018b.
URL http://proceedings.mlr.press/v84/choromanski18a.html.
Krzysztof Choromanski, Aldo Pacchiano, Jeffrey Pennington, and Yunhao Tang. KAMA-NNs:
Low-dimensional rotation based neural networks. In The 22nd International Conference on
Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan,
volume 89 of Proceedings of Machine Learning Research, pp. 236–245. PMLR, 2019a. URL
http://proceedings.mlr.press/v89/choromanski19a.html.
Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller. Unifying orthog-
onal Monte Carlo methods. In Proceedings of the 36th International Conference on Ma-
chine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of
Proceedings of Machine Learning Research, pp. 1203–1212. PMLR, 2019b. URL http:
//proceedings.mlr.press/v97/choromanski19a.html.
Krzysztof Marcin Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness
of structured random orthogonal embeddings. In Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December
2017, Long Beach, CA, USA, pp. 219–228, 2017.
10
Published as a conference paper at ICLR 2021
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network
learning by exponential linear units (elus). In 4th International Conference on Learning Represen-
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
URL http://arxiv.org/abs/1511.07289.
Qian Cong, Ivan Anishchenko, Sergey Ovchinnikov, and David Baker. Protein interaction networks
revealed by proteome coevolution. Science, 365(6449):185–189, 2019.
UniProt Consortium. Uniprot: a worldwide hub of protein knowledge. Nucleic acids research, 47
(D1):D506–D515, 2019.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to
Algorithms, 3rd Edition. MIT Press, 2009. ISBN 978-0-262-03384-8. URL http://mitpress.
mit.edu/books/introduction-algorithms.
Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, and Ruslan
Salakhutdinov. Transformer-XL: Language modeling with longer-term dependency, 2019. URL
https://openreview.net/forum?id=HJePno0cYm.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers. In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.
net/forum?id=HyzdRiR9Y7.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL
http://arxiv.org/abs/1810.04805.
Yilun Du, Joshua Meier, Jerry Ma, Rob Fergus, and Alexander Rives. Energy-based models for
atomic-resolution protein conformations. arXiv preprint arXiv:2004.13167, 2020.
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, and Burkhard Rost. End-to-end multitask
learning, from protein language to protein features without alignments. bioRxiv, pp. 864405, 2019.
Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-
level tracing. In Conference on Machine Learning and Systems 2018, 2018. URL http://www.
sysml.cc/doc/2018/146.pdf.
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention
network for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3146–3154, 2019.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo
Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented
transformer for speech recognition, 2020.
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam
Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Music
transformer: Generating music with long-term structure. In 7th International Conference on
Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net,
2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graph-
based protein design. In Advances in Neural Information Processing Systems, pp. 15794–15805,
2019.
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are
rnns: Fast autoregressive transformers with linear attention. CoRR, abs/2006.16236, 2020. URL
https://arxiv.org/abs/2006.16236.
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In
8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia,
April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=
rkgNKkHtvB.
11
Published as a conference paper at ICLR 2021
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of
bert. arXiv preprint arXiv:1908.08593, 2019.
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing. CoRR, abs/1808.06226, 2018. URL http:
//arxiv.org/abs/1808.06226.
Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. J. ACM, 27(4):831–838,
October 1980. ISSN 0004-5411. doi: 10.1145/322217.322232. URL https://doi.org/10.
1145/322217.322232.
Han Lin, Haoxian Chen, Tianyi Zhang, Clément Laroche, and Krzysztof Choromanski. Demystifying
orthogonal Monte Carlo and beyond. CoRR, abs/2005.13590, 2020.
Haoneng Luo, Shiliang Zhang, Ming Lei, and Lei Xie. Simplified self-attention for transformer-based
end-to-end speech recognition. CoRR, abs/2005.10463, 2020. URL https://arxiv.org/
abs/2005.10463.
Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi,
Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation. CoRR,
abs/2004.03497, 2020. URL https://arxiv.org/abs/2004.03497.
Jessica Marcandalli, Brooke Fiala, Sebastian Ols, Michela Perotti, Willem de van der Schueren, Joost
Snijder, Edgar Hodge, Mark Benhaim, Rashmi Ravichandran, Lauren Carter, et al. Induction of
potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory
syncytial virus. Cell, 176(6):1420–1431, 2019.
Nikita Nangia and Samuel R. Bowman. Listops: A diagnostic dataset for latent tree learning.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 2-4, 2018,
Student Research Workshop, pp. 92–99, 2018. doi: 10.18653/v1/n18-4013. URL https:
//doi.org/10.18653/v1/n18-4013.
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,
and Dustin Tran. Image transformer. In Proceedings of the 35th International Conference
on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018,
volume 80 of Proceedings of Machine Learning Research, pp. 4052–4061. PMLR, 2018. URL
http://proceedings.mlr.press/v80/parmar18a.html.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Com-
pressive transformers for long-range sequence modelling. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in
Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference
on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December
3-6, 2007, pp. 1177–1184. Curran Associates, Inc., 2007. URL http://papers.nips.cc/
paper/3182-random-features-for-large-scale-kernel-machines.
Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Zitnick, Jerry Ma, and
Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250
million protein sequences. bioArxiv, 04 2019. doi: 10.1101/622803.
Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamás Sarlós, and Adrian Weller.
Orthogonal estimation of Wasserstein distances. In The 22nd International Conference on Artificial
Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89
of Proceedings of Machine Learning Research, pp. 186–195. PMLR, 2019. URL http://
proceedings.mlr.press/v89/rowland19a.html.
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse
attention with routing transformers. CoRR, abs/2003.05997, 2020. URL https://arxiv.
org/abs/2003.05997.
12
Published as a conference paper at ICLR 2021
Zhuoran Shen, Mingyuan Zhang, Shuai Yi, Junjie Yan, and Haiyu Zhao. Factorized attention:
Self-attention with linear complexities. CoRR, abs/1812.01243, 2018. URL http://arxiv.
org/abs/1812.01243.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for
deep learning in NLP. CoRR, abs/1906.02243, 2019. URL http://arxiv.org/abs/1906.
02243.
Yujin Tang, Duong Nguyen, and David Ha. Neuroevolution of self-interpretable agents. CoRR,
abs/2003.08165, 2020. URL https://arxiv.org/abs/2003.08165.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao,
Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient
transformers. 2021.
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhut-
dinov. Transformer dissection: An unified understanding for transformer’s attention via the
lens of kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pp. 4335–4344, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.
nips.cc/paper/7181-attention-is-all-you-need.pdf.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph attention networks. In 6th International Conference on Learning Representations,
ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.
OpenReview.net, 2018. URL https://openreview.net/forum?id=rJXMpikCZ.
Jesse Vig. A multiscale visualization of attention in the transformer model. arXiv preprint
arXiv:1906.05714, 2019.
Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language
model. CoRR, abs/1906.04284, 2019. URL http://arxiv.org/abs/1906.04284.
Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen Fatema
Rajani. Bertology meets biology: Interpreting attention in protein language models. CoRR,
abs/2006.15222, 2020. URL https://arxiv.org/abs/2006.15222.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural
Information Processing Systems 28: Annual Conference on Neural Information Processing Systems
2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2692–2700, 2015.
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with
linear complexity. CoRR, abs/2006.04768, 2020. URL https://arxiv.org/abs/2006.
04768.
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. Sharing attention weights for
fast transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 5292–5298. ijcai.org, 2019. doi:
10.24963/ijcai.2019/735. URL https://doi.org/10.24963/ijcai.2019/735.
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy.
Hierarchical attention networks for document classification. In NAACL HLT 2016, The 2016
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 1480–1489.
The Association for Computational Linguistics, 2016. doi: 10.18653/v1/n16-1174. URL https:
//doi.org/10.18653/v1/n16-1174.
13
Published as a conference paper at ICLR 2021
Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk,
Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training of
deep networks. In International Conference on Learning Representations, 2020. URL https:
//openreview.net/forum?id=BJxsrgStvr.
Felix X. Yu, Ananda Theertha Suresh, Krzysztof Marcin Choromanski, Daniel N. Holtmann-Rice,
and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing
Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,
2016, Barcelona, Spain, pp. 1975–1983, 2016.
Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin,
Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria
Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter W. Battaglia. Deep
reinforcement learning with relational inductive biases. In 7th International Conference on
Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching
movies and reading books. In 2015 IEEE International Conference on Computer Vision, ICCV
2015, Santiago, Chile, December 7-13, 2015, pp. 19–27, 2015. doi: 10.1109/ICCV.2015.11. URL
https://doi.org/10.1109/ICCV.2015.11.
14
Published as a conference paper at ICLR 2021
1
https://github.com/google-research/google-research/blob/master/
performer/fast_attention
15
Published as a conference paper at ICLR 2021
• We use k = 600, which is more than twice than the default k = 256 from the paper, and
also twice than our default m = 256 number of features.
• We also use redrawing, which avoids "unlucky" projections on Q and K.
2
https://github.com/google-research/google-research/blob/master/
performer/fast_attention
3
https://github.com/google/trax/blob/master/trax/supervised/configs/
reformer_imagenet64.gin
16
Published as a conference paper at ICLR 2021
Figure 8: Visual representation of the prefix-sum algorithm for unidirectional attention. For clarity, we omit
attention normalization in this visualization. The algorithm keeps the prefix-sum which is a matrix obtained
by summing the outer products of random features corresponding to keys with value-vectors. At each given
iteration of the prefix-sum algorithm, a random feature vector corresponding to a query is multiplied by the most
recent prefix-sum (obtained by summing all outer-products corresponding to preceding tokens) to obtain a new
row of the matrix AV which is output by the attention mechanism.
For the unidirectional case, our analysis is similar as for the bidirectional case, but this time our goal is
to compute tril(Q0 (K0 )> )C without constructing and storing the L × L-sized matrix tril(Q0 (K0 )> )
explicitly, where C = [V 1L ] ∈ RL×(d+1) . In order to do so, observe that ∀1 ≤ i ≤ L:
i
X
[tril(Q0 (K0 )> )C]i = GPS 0
i,:,: × Qi , GPS
i,:,: = Gj,:,: , Gj,:,: = K0j C>
j ∈R
M ×(d+1)
(11)
j=1
where G, GPS ∈ RL×M ×(d+1) are 3d-tensors. Each slice GPS :,l,p is therefore a result of a prefix-sum
PS
P i
(or cumulative-sum) operation applied to G:,l,p : Gi,l,p = j=1 Gi,l,p . An efficient algorithm to
compute the prefix-sum of L elements takes O(L) total steps and O(log L) time when computed in
parallel (Ladner & Fischer, 1980; Cormen et al., 2009). See Algorithm 1 for the whole approach.
B.2 O RTHOGONAL R ANDOM F EATURES - E XTENSIONS
As mentioned in the main text, for isotropic Ω (true for most practical applications, including regular
attention), instead of sampling ωi independently, we can use orthogonal random features (ORF) (Yu
17
Published as a conference paper at ICLR 2021
et al., 2016; Choromanski et al., 2017; 2018b): these maintain the marginal distributions of samples
ωi while enforcing that different samples are orthogonal. If we need m > d, ORFs still can be used
locally within each d × d block of W (Yu et al., 2016).
ORFs were introduced to reduce the variance of Monte Carlo estimators (Yu et al., 2016; Choromanski
et al., 2017; 2018b; 2019a; Rowland et al., 2019; Choromanski et al., 2018a; 2019b) and we showed
in the theoretical and experimental sections from the main body that they do indeed lead to more
accurate approximations and substantially better downstream results. There exist several variants of
the ORF-mechanism and in the main body we discussed only the base one (that we refer to here as
regular). Below we briefly review the most efficient ORF mechanisms (based on their strengths and
costs) to present the most complete picture.
(1) Regular ORFs [R-ORFs]: Applies Gaussian orthogonal matrices (Yu et al., 2016). Encodes
matrix W of ω-samples (with different rows corresponding to different samples) in O(md) space.
Provides algorithm for computing Wx in O(md) time for any x ∈ Rd . Gives unbiased estimation.
Requires one-time O(md2 ) preprocessing (Gram-Schmidt orthogonalization).
(2) Hadamard/Givens ORFs [H/G-ORFs]: Applies random Hadamard (Choromanski et al., 2017)
or Givens matrices (Choromanski et al., 2019b). Encodes matrix W in O(m) or O(m log(d)) space.
Provides algorithm for computing Wx in O(m log(d)) time for any x ∈ Rd . Gives small bias
(tending to 0 with d → ∞).
B.3 T IME AND S PACE C OMPLEXITY - D ETAILED A NALYSIS
We see that a variant of bidirectional FAVOR+ using iid samples or R-ORFs has O(md + Ld + mL)
space complexity as opposed to Θ(L2 + Ld) space complexity of the baseline. Unidirectional
FAVOR+ using fast prefix-sum pre-computation in parallel (Ladner & Fischer, 1980; Cormen et al.,
2009) has O(mLd) space complexity to store GPS which can be reduced to O(md + Ld + mL) by
running a simple (though non-parallel in L) aggregation of GPSi,:,: without storing the whole tensor
GPS in memory. From Subsec. B.2, we know that if instead we use G-ORFs, then space complexity
is reduced to O(m log(d) + Ld + mL) and if the H-ORFs mechanism is used, then space is further
reduced to O(m+Ld+mL) = O(Ld+mL). Thus for m, d L all our variants provide substantial
space complexity improvements since they do not need to store the attention matrix explicitly.
The time complexity of Algorithm 1 is O(Lmd) (note that constructing Q0 and K0 can be done in
time O(Lmd)). Note that the time complexity of our method is much lower than O(L2 d) of the
baseline for L m.
As explained in Subsec. B.2, the R-ORF mechanism incurs an extra one-time O(md2 ) cost (negligible
compared to the O(Lmd) term for L d). H-ORFs or G-ORFs do not have this cost, and when
FAVOR+ uses them, computing Q0 and K0 can be conducted in time O(L log(m)d) as opposed
to O(Lmd) (see: Subsec. B.2). Thus even though H/G-ORFs do not change the asymptotic time
complexity, they improve the constant factor from the leading term. This might play an important
role in training very large models.
The number of random features m allows a trade-off between computational complexity and the level
of approximation: bigger m results in higher computation costs, but also in a lower variance of the
estimate of A. In the theoretical section from the main body we showed that in practice we can take
M = Θ(d log(d)).
Observe that the FAVOR+ algorithm is highly-parallelizable, and benefits from fast matrix multiplica-
tion and broadcasted operations on GPUs or TPUs.
18
Published as a conference paper at ICLR 2021
Figure 9: Visualization of the estimated empirical distribution for the 20 standard amino acids, colored by their
class. Note the consistency with the statistics on the TrEMBL web page.
A random baseline, with uniform probability across all the vocabulary tokens at every position, has
accuracy 5% (when including only the 20 standard amino acids) and 4% (when also including the
5 anomalous amino acids (Consortium, 2019)). However, the empirical frequencies of the various
4
https://www.uniprot.org/statistics/TrEMBL
5
https://www.uniprot.org/uniprot/
19
Published as a conference paper at ICLR 2021
amino acids in our dataset may be far from uniform, so we also consider an empirical baseline where
the amino acid probabilities are proportional to their empirical frequencies in the training set.
Figure 9 shows the estimated empirical distribution. We use both the standard and anomalous
amino acids, and we crop sequences to length 1024 to match the data processing performed for the
Transformer models. The figure shows only the 20 standard amino acids, colored by their class, for
comparison with the visualization on the TrEMBL web page6 .
C.3 TABULAR R ESULTS
Table 2 contains the results on the single protein sequence modeling task (L = 1024). We report
accuracy and perplexity as defined in Appendix A:
6
https://www.uniprot.org/statistics/TrEMBL
7
https://www.uniprot.org/uniprot/P00974
20
Published as a conference paper at ICLR 2021
(Kovaleva et al., 2019). In Figure 12 we highlight these attention patterns by focusing on the first 25
tokens, and in Figure 11, we illustrate in more detail two attention heads.
Amino acid similarity. Furthermore, we analyze the amino-acid similarity matrix estimated from
the attention matrices produced by the Performer model, as described in Vig et al. (Vig et al., 2020).
We aggregate the attention matrix across 800 sequences. The resulting similarity matrix is illustrated
in Figure 13. Note that the Performer recognizes highly similar amino acid pairs such as (D, E) and
(F, Y).
Figure 10: We show the attention matrices for the first 4 layers and all 8 heads (each row is a layer, each column
is head index, each cell contains the attention matrix across the entire BPT1_BOVIN protein sequence). Note
that many heads show a diagonal pattern, where each node attends to its neighbors, and some heads show a
vertical pattern, where each head attends to the same fixed positions.
Figure 11: We illustrate in more detail two attention heads. The sub-figures correspond respectively to: (1)
Head 1-2 (second layer, third head), (2) Head 4-1 (fifth layer, second head). Note the block attention in Head 1-2
and the vertical attention (to the start token (‘M’) and the 85th token (‘C’)) in Head 4-1.
21
Published as a conference paper at ICLR 2021
Figure 12: We highlight the attention patterns by restricting our attention to the first 25 tokens (note that we do
not renormalize the attention to these tokens). The illustration is based on Vig et al. (Vig, 2019; Vig & Belinkov,
2019). Note that, similar to prior work on protein Transformers (Madani et al., 2020), the attention matrices
include both local and global patterns.
A 1.0 A
C C
D D 0.8
E 0.8 EF
F
G G
H H 0.6
I 0.6 KI
K
L L
M M 0.4
N 0.4 P N
P
Q Q
R R
S 0.2 ST 0.2
T
V V
W W
Y 0.0 Y 0.0
ACDE F GH I K LMN PQR S T VWY ACDE F GH I K LMN PQR S T VWY
Figure 13: Amino acid similarity matrix estimated from attention matrices aggregated across a small subset
of sequences, as described in Vig et al. (Vig et al., 2020). The sub-figures correspond respectively to: (1) the
normalized BLOSUM matrix, (2) the amino acid similarity estimated via a trained Performer model. Note that
the Performer recognizes highly similar amino acid pairs such as (D, E) and (F, Y).
22
Published as a conference paper at ICLR 2021
Figure 14: Output approximation errors between a vanilla Transformer and a Performer (with
orthogonal features) for varying numbers of layers.
D.2 A PPROXIMATE S OFTMAX - E XTENDED P ROPERTIES
We show the following properties of our softmax approximation, in Fig. 15:
Redrawing: While the benefits of redrawing features was shown in Subsec. 4.3 of the main body of
the paper, we also demonstrate its benefits when there are multiple layers with large scale (16x16
TPU-v2) training.
Unidirectional: While we have shown on TrEMBL that Performer with generalized ReLU attention
outperforms softmax, we also show that approximate softmax attention can still be a solid choice, for
example on ImageNet64 (U). After 100K steps of training, the Performer-ReLU, Performer-Softmax,
and Performer-Softmax (SMREG) variants achieve respectively, 3.67, 3.69, 3.67 BPD.
Instability of Trigonometric Features: We see the full view of the unstable training curve when
using Trigonometric softmax.
Figure 15: Best viewed zoomed in. Left: The importance of redrawing features. If redrawing is not
used, an "unlucky" set of random features may cause training degradation, shown by the early-stopped
curve with Seed 1, while a ‘lucky’ set of random features may cause no issue, shown by the curve
with Seed 2. Redrawing allows the training to correct itself, as seen at the black vertical line. Middle:
Using the same 8x8 TPU-v2 compute and same 6-layer standard model, approximate softmax with
positive features achieves the same result as generalized ReLU attention. Right: Zoomed out view of
right subfigure of Fig. 5, showing that Trigonometric softmax causes very unstable training behaviors.
D.3 G ENERALIZED ATTENTION
We investigated Generalized Attention mechanisms (mentioned in Sec. 2.2) on TrEMBL when
L = 512 for various kernel functions. This is similar to (Tsai et al., 2019) which also experiments
with various attention kernels for natural language. Using hyperparameter sweeps across multiple
23
Published as a conference paper at ICLR 2021
variables in FAVOR, we compared several kernels and also renormalization on/off (Fig. 16 and
Fig. 17), where Renormalize corresponds to applying D−1 operator in attention, as for the standard
mechanism, though we noticed that disabling it does not necessarily hurt accuracy) to produce the
best training configuration for the Performer. We note that the effective batch size slightly affects
the rankings (as shown by the difference between 2x2 and 4x4 TPU runs) - we by default use the
generalized ReLU kernel with other default hyperparameters shown in Appendix A, as we observed
that they are empirically optimal for large batch size runs (i.e. 8x8 or 16x16 TPU’s).
Figure 16: To emphasize the highest accuracy runs but also show the NaN issues with certain kernels
which caused runs to stop early, we set both x and y axes to be log-scale. We tested kernels defined
by different functions f (see: Sec. 2.2): sigmoid, exponential, ReLU, absolute, gelu, cosine (original
softmax approximation), tanh, and identity. All training runs were performed on 2x2 TPU-v2’s, 128
batch size per device.
Figure 17: We also performed a similar setup as Fig. 16 for 4x4 TPU-v2’s.
D.4 C OMPARISON WITH L INEAR T RANSFORMER
We use the attention implementation of the Linear Transformer from (Katharopoulos et al., 2020),
which mainly involves setting our feature map φ(x) = elu(x) + 1, where elu(x) is the shifted-eLU
function from (Clevert et al., 2016).
Figure 18: Left: In the unidirectional 36-ProGen setting, we ran 3 seeds of the Linear Transformer,
and found that all 3 seeds produced exploding gradients very early on, stopping the training run.
Right: The Linear Transformer in the bidirectional setting also produced an exploding gradient in
the middle of training, near 125K steps. Exploding gradients can be evidenced by the sharp drop in
train accuracy right before a NaN error.
For the sake of fairness and to prevent confounding results, while (Katharopoulos et al., 2020) also
uses the GeLU nonlinearity for the MLPs in the Linear Transformer, we instead use the original
ReLU nonlinearity. We also used the exact same training hyperparameters as Performer-ReLU on
24
Published as a conference paper at ICLR 2021
our exact ProGen setting from Fig. 6. Ultimately, we empirically found that the Linear Transformer
possessed numerical instability during training via unstable training curves, ultimately stopping
training by producing exploding gradients (NaNs) (Fig. 18).
D.5 L ONG R ANGE A RENA
Performers are compared against many additional (scalable and not scalable) methods not included
in our paper: Local Attention, Sparse Attention, Longformer, Sinkhorn Transformer, Synthesizer,
Big Bird and the aforementioned Linear Transformer on challenging long range context tasks in the
Long Range Arena (Tay et al., 2021), with Fig. 19 displaying the original paper’s results. Performers
obtain the largest LRA (Long Range Arena) score among all tested scalable Transformers methods
(which we define by having speed of > 100 examples/sec).
Tasks used for comparison include: (1) a longer variation of the standard ListOps task proposed in
(Nangia & Bowman, 2018), (2) byte-level text classification using real-world data, (3) byte-level
document retrieval, (4) image classification on sequences of pixels, and (5) Pathfinder task (long-
range spatial dependency problem). In the Long Range Arena paper, the authors found that all models
do not learn anything on Path-X task (denoted by FAIL), contrary to the Pathfinder task, which shows
that increasing the sequence length can cause seriously difficulties for model training.
Figure 19: Upper Table: Results on Long-Range Arena benchmark. Best model is in boldface and
second best is underlined. Lower Table: Benchmark results of all X-former models with a consistent
batch size of 32 across all models. The authors report relative speed increase/decrease in comparison
with the vanilla Transformer in brackets besides the steps per second. Memory usage refers to per
device memory usage across each TPU device. Benchmarks are run on 4x4 TPU-v3 chips. Right
Fig: Performance (y-axis), speed (x-axis), and memory footprint (size of the circles) of different
models.
25
Published as a conference paper at ICLR 2021
1. Performer, with varying number of layers. We show that our method can scale up to (but not
necessarily limited to) even 20 layers.
2. Attention time complexities when comparing standard attention (from Transformer) and
FAVOR (from Performer). Note that the maximum memory size here is not reflective of
the maximum memory size in an actual model (shown below), as this benchmark requires
computing explicit tensors (causing memory increases) in Jax, while a model does not.
3. Time complexities when comparing the Transformer and Performer models. "X" (OPT)
denotes the maximum possible speedup achievable, when attention simply returns the V-
vector, showing that the Performer is nearly optimal. We see that the maximum possible
power of 2 length allowed on a V100 GPU (16GB) is 215 = 32768 using regular dimensions.
Since some of the computational bottleneck in the Transformer may originate from the ex-
tra feed-forward layers (Kitaev et al., 2020), we also benchmark the “Small" version, i.e.
(nheads , nlayers , df f , d) = (1, 6, 64, 64) as well, when the attention component is the dominant
source of computation and memory. We remind the reader that the “Regular" version consists of
(nheads , nlayers , df f , d) = (8, 6, 2048, 512).
Figure 20: Captions (1) and (2) for each 2x2 subfigure mentioned above.
26
Published as a conference paper at ICLR 2021
Figure 21: Caption (3) for this 2x2 subfigure mentioned above.
27
Published as a conference paper at ICLR 2021
F T HEORETICAL RESULTS
We provide here the proofs of all theoretical results presented in the paper.
F.1 P ROOF OF L EMMA 1
Proof. We first deduce that for any a, b ∈ Rd
SM(x, y) = exp(x> y) = exp(−kxk2 /2) · exp(kx + yk2 /2) · exp(−kyk2 /2).
Next, let w ∈ Rd . We use the fact that
Z
−d/2
(2π) exp(−kw − ck22 /2)dw = 1
The cancellation of the odd moments E[(ω > u)2i+1 ] follows directly from the fact that ω is taken
from the isotropic distribution (i.e. distribution with pdf function constant on each sphere). That
completes the proof.
28
Published as a conference paper at ICLR 2021
The above immediately follows from the fact that positive random feature maps provide unbiased
estimation of the softmax-kernel, thus the following is true:
kxk2 + kyk2
SM(x, y) = exp(− )Eω∼N (0,Id ) [exp(ω > z)]. (17)
2
Therefore we obtain:
+ 1
MSE(SM
dm (x, y)) = exp(−(kxk2 + kyk2 ))Var(exp(ω > z)) =
m
1
exp(−(kxk2 + kyk2 )) E[exp(2ω > z)] − (E[exp(ω > z)])2 =
(18)
m
1
exp(−(kxk2 + kyk2 ))(exp(2kzk2 ) − exp(kzk2 )),
m
where the last equality follows from Equation 16. Therefore we have:
+ 1
MSE(SM
dm (x, y)) = exp(−(kxk2 + kyk2 )) exp(kzk2 )(exp(kzk2 ) − 1) =
m (19)
1
exp(kzk2 )SM2 (x, y)(1 − exp(−kzk2 )).
m
Finally,
hyp+ 1 kxk2 + kyk2 2
MSE(SM
dm (x, y)) =
exp(− ) (Var(exp(ω > z)) + Var(exp(−ω > z))+
4m 2
1 kxk2 + kyk2 2
2Cov(exp(ω > z)), exp(−ω > z)))) = exp(− ) (2Var(exp(ω > z))+
4m 2
1
2Cov(exp(ω > z)), exp(−ω > z))))) = exp(−(kxk2 + kyk2 ))
2m
1
(Var(exp(ω > z)) + 1 − (E[exp(ω > z)])2 ) = exp(−(kxk2 + kyk2 ))
2m
1
(exp(2kzk2 ) − exp(kzk2 ) + 1 − exp(kzk2 )) = exp(−(kxk2 + kyk2 ))(exp(kzk2 ) − 1)2
2m
1 +
= (1 − exp(−kzk2 ))MSE(SM dm (x, y)).
2
(20)
In the chain of equalities above we used the fact that random variables exp(ω > z) and exp(−ω > z)
have the same distribution. This is true since ω and −ω have the same distribution (ω is Gaussian).
That completes the proof.
def
where e1 = (1, 0, ..., 0)> ∈ Rd . To obtain the above we used the fact that N (0, Id ) is isotropic (that
in particular implies zeroing of the even terms in the Taylor expansion).
def ω
Let us denote: A(k, d) = Eω∼N (0,Id ) [( kωk 2
e1 )2k ]. It turns out that:
(2k − 1)!!
A(2k, d) = . (22)
(d + 2k − 2)(d + 2k − 4) · ... · d
The proof of that fact can be found in the supplement of (Choromanski et al., 2018b), yet we provide
it below for completeness and the convenience of the Reader:
29
Published as a conference paper at ICLR 2021
Proof. Note first that for d ≥ 2 the density function pd (θ) of the angle between a vector r ∈ Rd
chosen uniformly at random from the unit sphere and e1 is given by the following formula:
sind−2 (θ)
pd (θ) = R π . (24)
0
sind−2(θ) dθ
def R π
Let us denote: F (k, d) = 0 cosk (θ) sind (θ)dθ. Using partial integration, we get:
Z π Z π
k d
cos (θ) sin (θ)dθ = cosk−1 (θ) sind (θ)(sin(θ))0 dθ =
0 0
Z π
(25)
cosk−1 (θ) sind+1 (θ)|π0 − sin(θ)((k − 1) cosk−2 (θ)(− sin(θ)) sind (θ)+
0
d cosk (θ) sind−1 (θ))dθ.
k−1
Thus we conclude that: F (k, d) = − 2, d + 2). Therefore we have:
d+1 F (k
Z π
(2k − 1)!!
F (2k, d) = sind+2k (θ)dθ. (26)
(d + 1)(d + 3) · ... · (d + 2k − 1) 0
We again conduct partial integration and get:
Z π
1
sind (θ)dθ = − sind−1 (θ) cos(θ)|π0 +
d
Z 0 (27)
d − 1 π d−2 d − 1 π d−2
Z
sin (θ)dθ = sin (θ)dθ.
d 0 d 0
Thus we obtain:
X wk ∞
SMREG(x, y)
= e−w f (k, d). (30)
SM(x, y) k!
k=0
30
Published as a conference paper at ICLR 2021
1
We also have for l = d 3 :
l ∞
SMREG(x, y) X wk X wk
= e−w f (k, d) + e−w f (k, d) ≥
SM(x, y) k! k!
k=0 k=l+1
l ∞ ∞ (32)
X wk X wk X wk
f (l, d)e−w + e−w f (k, d) ≥ f (l, d)(1 − e−w )=
k! k! k!
k=0 k=l+1 k=l+1
f (l, d)(1 − P[Po(w) > l]),
where Po(w) stands for the random variable of Poisson distribution with parameter w. Therefore we
get for t = ln( wl ):
SMREG(x, y) 2l − 2 l
≥ (1 − ) (1 − P[Po(w) > l]) ≥
SM(x, y) d
2l − 2
exp(l ln(1 − ))(1 − P[tPo(w) ≥ tl]) =
d
∞
!
X ( 2l−2 )i
exp l (−1)i d (1 − P[exp(tPo(w) − tl) ≥ 1]) ≥ (33)
i=1
i
2 1
exp(− 11 ))(1 − exp(−tl)E[exp(tPo(w))]) =
+ o(
d d3
3
2 1
exp(− 1 + o( 1 ))(1 − exp(−w − l(t − 1))),
d3 d3
where the last equality is implied by the formula for the Laplace Transform for the Poisson random
variable:
E[exp(tPo(w))] = exp(w(exp(t) − 1)). (34)
kzk2 ln(SM(x,x))+ln(SM(y,y))+2 ln(SM(x,y))
Notice that: w = 2 = 2 ≤ 2 ln(C). We conclude that:
1
SMREG(x, y) 2 1 d 1 3 2 1
≥ (1 − 1 + o( 1 ))(1 − C −2 ( )−d 3 ) = 1 − 1 + o( 1 ). (35)
SM(x, y) d 3 d 3 2e · ln(C) d 3 d3
That completes the proof.
Interestingly, beautiful functions can be used to define softmax and consequently, Gaussian kernels
(both standard and regularized), leading to our PRF mechanism presented in the main body of the
paper, as we explain below.
Remark 1. If one takes Ω = N (0, Id )(note that N (0, Id ) is isotropic) and g : x → exp(x) (such g
is clearly entire with nonnegative power-series coefficient) then the following is true for z = x + y:
kxk2 + kyk2
SM(x, y) = exp(− )FΩ,g (z). (37)
2
2 2
Similarly: SMREG(x, y) = exp(− kxk +kyk )FΩreg ,g (z), where Ωreg stands for the distribution
2 √
corresponding to Haar measure on the sphere of radius d (which is clearly isotropic). Therefore
general concentration results for Monte Carlo estimators of beautiful functions immediately imply
corresponding results for the (standard and regularized) softmax (and thus also Gaussian) kernel.
31
Published as a conference paper at ICLR 2021
We will consider two estimators of the beautiful functions from Definition 1 that directly lead
(through Remark 1) to: PRF-based approximation of the softmax-kernel and its enhanced version
iid iid
with orthogonal features. Standard Monte Carlo estimator samples independently ω1iid , ..., ωm ∼ Ω,
where m stands for the number of samples and then computes:
m
def 1
X
iid
Fbm (z) = g((ωiiid )> z). (38)
m i=1
Orthogonal Monte Carlo estimator samples ω1ort , ..., ωmort
(m ≤ d) in such a way that marginally we
ort ort > ort
have: ωi ∼ Ω, but (ωi ) ωj = 0 for i 6= j (such an orthogonal ensemble can be always created
if Ω is isotropic, as we already mentioned in the main body of the paper). We define:
m
def 1
X
ort
Fbm (z) = g((ωiort )> z). (39)
m i=1
iid
Lemma 4. Consider an estimator Fbm (z) of the beautiful function F evaluated at z. Then the
following holds for any a > F (z), θ > 0:
iid
P[Fbm (z) > a] ≤ exp(θma)MX (θ)m , (40)
>
where X = g(w z), w ∼ D.
The above result provides us with exponentially small (in Legendre Transform) upper bounds on tail
probabilities for the standard estimator. Below we provide our two main theoretical results.
Theorem 5 (orthogonality provides smaller tails). If FΩ,g is a beautiful function then the following
holds for m ≤ d, X as in Lemma 4 and any a > F (z), θ > 0:
θ4 m(m − 1) M −2 2
ort m 4 2 2
P[Fm (z)) > a] ≤ exp(−θma) MX (θ) −
b a a1 kzk (Ekωk ) . (41)
4d2 (d + 2) 0
This result shows that features obtained from the ensembles of pairwise orthogonal random vectors
provide exponentially small bounds on tail probabilities and that these bounds are strictly better than
for estimators using unstructured features. Furthermore, the result is universal, i.e. holds for any
dimensionality d, not just asymptotically for d large enough.
We also obtain similar result regarding mean squared errors (MSEs) of the considered estimators:
Theorem 6. If FΩ,g is a beautiful function then the following holds for m ≤ d:
ort iid 1 2 2
MSE(Fbm (z)) ≤ MSE(Fbm (z)) − (1 − ) (FΩ,g (z) − a0 ) . (42)
m d+2
As before, an orthogonal estimator leads to better concentration results and as before, this is the case
for any d > 0, not only asymptotically for large enough d.
Note that from what we have said above, Theorem 2 and Theorem 3 follow immediately from
Theorem 6 and Theorem 5 respectively.
Thus in the remainder of this section we will prove Theorem 6 and Theorem 5.
F.4.2 P ROOF OF T HEOREM 5
Proof. Note that by the analogous application of Markov’s Inequality as in Lemma 4, we get:
ort ort
ort E[eθ(X1 +...+Xm ) ] (43)
P[Fbm (z)) > a] ≤ ,
eθma
32
Published as a conference paper at ICLR 2021
where we have: Xiort = g((ωiort )> z). We see that it suffices to show that for any θ > 0 the following
ort ort iid iid
holds: E[eθ(X1 +...+Xm ) ] < E[eθ(X1 +...+Xm ) ]. We have:
∞ Pm ∞ m
ort ort X (θ i=1 Xiort )j X θj X ort j
E[eθ(X1 +...+Xm ) ] = E[ ] = E[ ( X ) ]=
j=0
j! j=0
j! i=1 i
∞ m ∞ (44)
θj θj
X X
ort j
X X j ort j1 ort jm
E[( Xi ) ] = E[ (X1 ) · ... · (Xm ) ],
j=0
j! i=1 j=0
j! j1 , . . . , jm
(j1 ,...,jm )∈Sj
Similarly, we get:
∞
θj
iid iid X X j
E[eθ(X1 +...+Xm )
]= E[(X1iid )j1 · ... · (Xm
iid jm
) ]. (46)
j=0
j! j1 , . . . , j m
(j1 ,...,jm )∈Sj
Therefore we get:
iid iid ort ort
∆ = E[eθ(X1 +...+Xm )
] − E[eθ(X1 +...+Xm )
]
∞
θj
X X j
E[(X1iid )j1 · ... · (Xm
iid jm
) ] − E[(X1ort )j1 · ... · (Xm
ort jm
= ) ]
j=0
j! j1 , . . . , jm
(j1 ,...,jm )∈Sj
(47)
Note first that using the fact that f is entire, we can rewrite each Xiort as:
∞
X
Xiort = as ((ωiort )> z)s , (48)
s=0
P∞
where f (x) = s=0 as xs and a0 , a1 , ... ≥ 0. Similarly,
∞
X
Xiiid = as ((ωiiid )> z)s . (49)
s=0
By plugging in the above formulae for Xiort and Xiiid int the formula for ∆ and expanding power-
expressions, we obtain:
∞
θj
X X j X
∆= cj1 ,...,jm (d1 , . . . , dm )∆(d
b b 1 , ..., dm ),
j=0
j! j1 , . . . , j m
(j1 ,...,jm )∈Sj (d1 ,...,dm )∈D(j1 ,...,jm )
(50)
for some ordered subsets of indices (with potentially repeating entries) D(j1 , ..., jm ) and some
nonnegative bcj1 ,...,jm (d1 , . . . , dm ) (exact formula for those can be given but we do not need it to
complete the proof and since it is technical, it would unnecessarily complicate the proof so we skip
it) and ∆(d
b 1 , ..., dm ) defined as:
b 1 , ..., dm ) = E[((ω iid )> z)d1 · ... · ((ω iid )> z)dm ] − E[((ω ort )> z)d1 · ... · ((ω ort )> z)dm ]. (51)
∆(d 1 m 1 m
33
Published as a conference paper at ICLR 2021
g g
Y 0 = (e>
1 kzk2 )d1 · ... · (e>
m
ort
kzk2 )dm · (kω1ort k2 )d1 · ... · (kωm k2 )dm , (53)
kgk2 kgk2
where g is a Gaussian vector taken from the N (0, Id ) distribution, independently from:
kω1ort k2 , ..., kωm
ort
k2 .
ω ort ω ort
This comes from the fact that for a fixed z one can think about the set: kωort 1
, ..., kωort
m
as a
1 k2 m k2
random rotation of the system of m canonical basis vectors: e1 , ..., em . Thus instead of applying
a random rotation to: e1 , ..., em , one can equivalently randomly rotate vector z. Randomly rotated
g
vector z has the same distribution as: kgk 2
kzk2 .
Now let us focus on the second expression from the formula on ∆(d
b 1 , ..., dm ). We have:
m
Y
E[((ω1iid )> z)d1 · ... · ((ωm
iid > dm
) z) ] = E[((ωiiid )> z)di ] =
i=1
m (56)
Y gidi
E[(kω1iid k2 )d1 ] · ... · iid
E[(kωm k2 )dm ] · kzkd21 +...+dm · E[ p di
],
i=1 g12 + ... + gd2
where the first equality comes from the fact that different ωiiid s are independent and the second one is
implied by the analogous analysis to the one conducted above.
We will need the following lemma:
Lemma 5. For every s ∈ N+ such that s ≤ n and every k1 , ..., ks ∈ N+ the following holds:
Qs ki
g1k1 · ... · gsks i=1 E[gi ]
E[ p k1 +...+ks
] = p k1 +...+ks
. (57)
g12 + ... + gd2 E[ g12 + ... + gd2 ]
g
Proof. Take r = kgk2 kg̃k2 , where g̃ is an independent copy of g. Note that r ∼ g. We have:
34
Published as a conference paper at ICLR 2021
Note that by Lemma 5, we can rewrite the right expression from the formula on ∆(d b 1 , ..., dm ) as:
Qm di
i=1 E[gi ]
E[(kω1ort k2 )d1 ] · ... · E[(kωm
ort
k2 )dm ] · kzk2d1 +...+dm p d1 +...+dm
. (60)
E[ g12 + ... + gd2 ]
Since marginal distributions of ωiort and ωiiid are the same, we can rewrite ∆(d
b 1 , ..., dn ) as:
b 1 , ..., dm ) = L(d1 , ..., dm )(1 − τ (d1 , ..., dm )),
∆(d (62)
where τ (d1 , ..., dm ) is defined as:
p d1 p dm
E[ g12 + ... + gd2 ] · ... · E[ g12 + ... + gd2 ]
τ (d1 , ..., dm ) = p d1 +...+dm
(63)
E[ g12 + ... + gd2 ]
We need now few observations regarding ∆(d b 1 , ..., dm ). Note firsr that since odd moments of
the Gaussian scalar distribution N (0, 1) are zero, ∆(d b 1 , ..., dm ) is zero if at least of of di is odd.
Furthermore, ∆(d\1 , ..., dm ) is trivially zero if all but at most one di are zero.
×L(d1 , ..., dm ).
Therefore (see: our observations on ∆(d b 1 , ..., dm )) to complete the proof it suffices to show that:
d
τ (d1 , ..., dm ) ≤ d+2 if at least two: di , dj for i 6= j are nonzero and all di are even.
Lemma 6. The following holds if for some i 6= j we have: di , dj > 0 and all di are even:
d
τ (d1 , ..., dm ) ≤ . (64)
d+2
35
Published as a conference paper at ICLR 2021
b 1 , . . . , dm ) ≥ 2
∆(d Λ(d1 , . . . , dm ) ≥ 0.
d+2
Hence, since all terms in the sum
∞
θj
X X j X
∆= cj1 ,...,jm (d1 , . . . , dm )
b (66)
j=0
j! j1 , . . . , j m
(j1 ,...,jm )∈Sj (d1 ,...,dm )∈D(j1 ,...,jm )
×∆(d
b 1 , . . . , dm ). (67)
are nonnegative, we’ll get a lower bound on ∆ by only taking a subset of these terms. For this subset,
wetake j = 4, a subset of S4 with only two nonzero jk1 = jk2 = 2 for some k1 6= k2 (there are
m
2 combinations of such j1 , . . . , jm ). Then, we take only those d1 , . . . , dm from D(j1 , . . . , jm )
which correspond to s = 1 in (49) for k1 , k2 and s = 0 for all other k’s. Hence, dk1 = dk2 = 2
and all other dk ’s are zero and the corresponding weight from the second sum in (67) would be
a21 am−2
0 . For d1 , . . . , dm in such set, we’ll have τ (d1 , . . . , dm ) ≤ d+2d
by Lemma 6 and, hence,
2
∆(d1 , . . . , dm ) ≥
b
d+2 Λ(d1 , . . . , dm ). As the result, we get the following lower bound on ∆:
2θ4
m 4
∆≥ a2 am−2 Λ(2, 2, 0, . . . , 0)
4!(d + 2) 2 2, 2, 0, . . . , 0 1 0
θ4 m(m − 1) 2 m−2
= a a Λ(2, 2, 0, . . . , 0)
4(d + 2) 1 0
θ4 m(m − 1) 2 m−2 2 (E(g12 ))2
= a1 a0 kzk4 Ekωk2 .
4(d + 2) (Ekgk2 )2
Since g ∼ N (0, 1)d , Eg12 = 1 and Ekgk2 = dEg12 = d. This results in
θ4 m(m − 1) 2 m−2 2
∆≥ 2
a1 a0 kzk4 Ekωk2 (68)
4d (d + 2)
which concludes the proof.
Similarly,
ort ort
Var(Fbm (z)) = E[(Fbm (z))2 ] − F 2 (z). (70)
We have:
m
iid 1 X 1 X
E[(Fbm (z))2 ] = 2
E[(Xiiid )2 ] + 2 E[Xiiid Xjiid ]. (71)
m i=1 m
i6=j
Similarly, we get:
m
ort 1 X 1 X
E[(Fbm (z))2 ] = 2
E[(Xiort )2 ] + 2 E[Xiort Xjort ]. (72)
m i=1 m
i6=j
36
Published as a conference paper at ICLR 2021
Therefore, since marginal distributions of Xiiid and Xiort are the same, we have:
iid ort m 1
MSE(Fm (z)) − MSE(Fm (z)) =
b b · 2 · 2 (E[X1iid X2iid ] − E[X1ort X2ort ])
2 m
(73)
1
= (1 − )(E[X1iid X2iid ] − E[X1ort X2ort ])
m
Plugging in the formula for Xiort and Xiiid from Equation 48 and Equation 49, and using our analysis
from the proof of Theorem 3 we obtain:
∞
1 X
iid
MSE(Fbm ort
(z)) − MSE(Fbm (z)) = (1 − ) at au kzkt+u t u
2 E[kωk2 ]E[kωk2 ]·
m t,u=0
(74)
E[rt ]E[ru ]
p t p u (1 − τ (t, u)).
E[ g12 + ... + gd2 ]E[ g12 + ... + gd2 ]
for ω ∼ Ω and r ∼ N (0, 1).
Based on the definition of τ (63), if t = 0 or u = 0, τ (t, u) = 1 and the whole corresponding term in
the sum (74) is zero. Also, if t is odd, E(rt ) = 0 and, again, the corresponding term in the sum (74)
is zero. Same holds for u from (74). Based on the analysis from Theorem 5’s proof and FΩ,g (z)’s
definition we have:
∞ ∞
X E[rt ] X E[r2t ]
FΩ,g (z) = at kzkt2 E[kωkt2 ]· p t = a2t kzk2t 2t
2 E[kωk2 ]· p 2t
t=0 E[ g12 + ... + gd2 ] t=0 E[ g12 + ... + gd2 ]
where in the second transition we use the fact that E[rt ] = 0 for odd t.
Hence, we can rewrite (74) by excluding terms which are definitely zero and using Lemma 6:
∞
1 2 X
iid
MSE(Fbm ort
(z)) − MSE(Fbm (z)) ≥ (1 − ) a2t a2u kzk2t+2u
2 E[kωk2t 2u
2 ]E[kωk2 ]·
m d + 2 t,u=1
E[r2t ]E[r2u ]
p 2t p 2u
E[ g12 + ... + gd2 ]E[ g12 + ... + gd2 ]
∞
!2
2t
1 2 X E[r ]
= (1 − ) a2t kzk2t 2t
2 E[kωk2 ] · 2t
m d+2
p
t=1 E[ g12 + ... + gd2 ]
1 2 2
= (1 − ) (FΩ,g (z) − a0 ) .
m d+2
(75)
That completes the proof.
37
Published as a conference paper at ICLR 2021
is a radial basis function (RBF) kernel (Choromanski et al., 2018b) with corresponding spectral
distribution Ω (e.g. Gaussian kernel for which Ω = N (0, Id )). Assume that the rows of matrices Q
and K are taken from a ball B(R) of radius R, centered at 0 (i.e. norms of queries and keys are upper-
1
bounded by R). Define l = Rd− 4 and take g ∗ = maxx∈B(l) |g(x)| and h∗ = maxx∈B(l) |h(x)|.
Then for any > 0, δ = g∗h∗ and the number of random projections m = Ω( δd2 log( 4σR 1 )) for
δd 4
>
σ = Eω∼Ω [ω ω] the following holds: kA − Ak∞ ≤ with any constant probability, where A
b b
approximates generalized attention matrix via orthogonal trigonometric random features.
The result holds in particular for regular softmax-attention for which K is a Gaussian kernel and
2 3
g(x) = h(x) = exp( kxk d 4d 4 R
2 ). In that case mopt = Ω( δ 2 log( δ )) since σ = d.
Proof. Let DQ be a diagonal matrix with entries of the form: g(q> i ) and let DK be a diagonal matrix
with entries of the form: h(k> i ). Denote B = [K( 1
1 q> 1
i , 1 k >
j )]i,j ∈ R
L×L
. Denote by Ab and
d4 d4
approximation of the attention matrix obtained from trigonometric orthogonal random features and
by B
b an approximation of matrix B that those random features provide. We rely on Theorem 3 from
(Lin et al., 2020). Note that we can apply it in our case, since for RBF kernels the corresponding
functions fi satisfy f1 (x) = sin(x), f2 (x) = cos(x) (thus in particular are bounded). Also, it is not
hard to observe (see for instance analysis in Claim 1 from (Rahimi & Recht, 2007)) that we can take:
Lf = 1 (for Lf as in Theorem 3 from (Lin et al., 2020)). Using Theorem 3 from (Lin et al., 2020),
we conclude that:
kB
b − Bk∞ ≤ δ (76)
with any constant probability as long as m = Ω( δd2 ) log( σ·diam(M)
δ ), where σ = E[ω > ω] and M
K
is the diameter of the smallest ball M containing all vectors of the form z = Q1i − 1j . Since
d4 d4
2R 4R
kQi k2 , kKj k2 ≤ R, we conclude that kzk2 ≤ 1 and thus one can take diam(M) = 1 . We have:
d4 d4
kA
b − Ak∞ = kDQ (B b − Bk∞ kDK k∞ ≤ δg ∗ h∗
b − B)DK k∞ ≤ kDQ k∞ kB (77)
Taking δ = g ∗ h∗ completes the proof.
38