tensor cookbook
tensor cookbook
February 9, 2025
Chapter 1
Introduction
What is this? These pages are a guide to tensors, using the visual language of “tensor
diagrams”. For illustrating the generality of the approach, I’ve tried to closely follow the
legendary “Matrix Cookbook”. As such, most of the presentation is a collection of facts
(identities, approximations, inequalities, relations, ...) about tensors and matters relating
to them. You won’t find many results not in the original cookbook, but hopefully the
diagrams will give you a new way to understand and appreciate them.
It’s ongoing: The Matrix Cookbook is a long book, and not all the sections are equally
amendable to diagrams. Hence I’ve opted to skip certain sections and shorten others.
Perhaps in the future, I, or others, will expand the coverage further.
For example, while we cover all of the results on Expectation of Linear Combinations
and Gaussian moments, we skip the section on general multi-variate distributions. I have
also had to rearrange the material a bit, to avoid having to introduce all the notation up
front.
Complex Matrices and Covariance Tensor diagrams (or networks) are currently
most often seen in Quantum Physics, but this is not a book for physicists. The Matrix
Cookbook is a book for engineers, in particular in Machine Learning, where complex
numbers are less common. Without complex numbers, we don’t have to worry about
complex conjugation, which simplifies transposes, and gets rid of the need for co- and
contra-variant tensors. If you are a physicist, you probably want a book on Tensor
Analysis.
Tensorgrad The symbolic nature of tensor diagrams make the well suited for symbolic
computation.
Advantages of Tensor Diagram Notation: Tensor diagram notation has many ben-
efits compared to other notations:
Various operations, such as a trace, tensor product, or tensor contraction can be
expressed simply without extra notation. Names of indices and tensors can often be
omitted. This saves time and lightens the notation, and is especially useful for internal
indices which exist mainly to be summed over. The order of the tensor resulting from
1
CHAPTER 1. INTRODUCTION 2
Etymology The term "tensor" is rooted in the Latin word tensio, meaning “tension” or
“stretching,” derived from the verb tendere, which means “to stretch” or “to extend.” It was
first introduced in the context of mathematics in the mid-19th century by William Rowan
Hamilton in his work on quaternions, where it referred to the magnitude of a quaternion.
The modern usage of "tensor" was later established by Gregorio Ricci-Curbastro and
Tullio Levi-Civita in their development of tensor calculus, a framework that generalizes
the concept of scalars, vectors, and matrices to more complex, multidimensional enti-
ties. [1, 10].
Contents
1 Introduction 1
1.1 Tensor Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The Copy Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Sums of Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Higher order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Higher order traces . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Symmetry and Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Covariance and Contravariance . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Simple Derivatives 14
2.1 Derivatives of Matrices, Vectors and Scalar Forms . . . . . . . . . . . . . 14
2.1.1 First Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Second Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Derivatives of Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 First Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Second Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3
CONTENTS 4
3.8.1 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.9 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Functions 28
4.1 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 The Hessian Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Chain Rule with Broadcasting . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Functions with multiple inputs . . . . . . . . . . . . . . . . . . . . 31
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Known derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Pseudo-linear forms . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Trace identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 Taylor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Advanced Derivatives 47
7.1 Derivatives of vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.1 Two-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Derivatives of matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Derivatives of Structured Matrices . . . . . . . . . . . . . . . . . . . . . . 48
7.3.1 Symmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3.2 Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3.3 Toeplitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Derivatives of a Determinant . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 General forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 Linear forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
CONTENTS 5
8 Special Matrices 50
8.0.1 Block matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.0.2 The Discrete Fourier Transform Matrix . . . . . . . . . . . . . . . 50
8.0.3 Fast Kronecker Multiplication . . . . . . . . . . . . . . . . . . . . . 50
8.0.4 Hermitian Matrices and skew-Hermitian . . . . . . . . . . . . . . . 54
8.0.5 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.6 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.7 Positive Definite and Semi-definite Matrices . . . . . . . . . . . . . 54
8.0.8 Singleentry Matrix, The . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.9 Symmetric, Skew-symmetric/Antisymmetric . . . . . . . . . . . . . 54
8.0.10 Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.11 Units, Permutation and Shift . . . . . . . . . . . . . . . . . . . . . 54
8.0.12 Vandermonde Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 54
9 Decompositions 55
9.1 Higher-order singular value decomposition . . . . . . . . . . . . . . . . . . 55
9.2 Rank Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.2.1 Border Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3 Fast Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11 Tensor Algorithms 59
11.1 Tensor Contraction Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
12 Tensorgrad 61
12.1 Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.1.1 In Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.1.2 In Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12.1.3 In Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.4 In Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.5 In Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.6 In Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.1.7 In Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS 6
12.1.8 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.2 Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.3.1 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.4 Simplification Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
13 Appendix 69
CONTENTS 7
We think of vectors and matrices as tensors of order 1 and 2. The order corresponds
to the number of dimensions in their [· · · ] visualization above, e.g. a vector is a 1-
dimensional list of numbers, while a matrix is a 2-dimensional grid of numbers. The order
also determines the degree of the node representing the variable in the tensor graph.
Diagram notation becomes more interesting when you have tensors of order 3 and
higher. An order 3 tensor is a cube or numbers, or stack of matrices. E.g. we can write
this as T ∈ Rn×m×k , so Ti ∈ Rm×k is a matrix for i = 1 . . . n. Of course we could slice T
along the other axes too, so T:,j ∈ Rn×k and T:,:,ℓ ∈ Rn×m are matrices too.
A matrix having two outgoing edges means there are two ways you can multiply a
vector onto it, either on the left: xT M , or on the right: M x. In graph notation we just
write x−M− and −M −x. An order 3 tensor has three edges, so we can multiply it with
a vector in three ways:
x
T and T and T
x x
To be perfectly precise about what each one means, we should give the edges labels. For
P
example we would write T i to specify the matrix i Ti xi . However, often the edge
x
in question will be clear from the context, which is part of what makes tensor diagram
notation cleaner than, say, Einstein sum notation.
i
X i A B o
Yi,j = Ai,k Bl,n,o Cj,k,l,m Dm,n Eo ⇔ j Y k l n E
k,l,m,n,o j C
m D
The key principle of tensor diagrams is that edge contraction is associative. This
means you can contract any edge in any order you prefer. This can be seen from the sum
representation above, which can be reordered to sum over k, l, m, n in any order.
CONTENTS 8
The computational price for different contraction orders can be widely different. Un-
fortunately it’s not computationally easy to find the optimal order. See section 11.1 for
algorithms to find the best contraction order, and approximate contraction methods.
Note that tensor graphs are not always connected. We already saw that the outer
product of two vectors can be written a b . This is natural from the sum represen-
tation: No edges simply means no sums. So here yi,j = ai bj , which is exactly the outer
product y = a ⊗ b.
Da =
a
P P
Why? Because (Da )i,j = k i,j,k ak = k [i = j = k]ak = [i = j]ai . Similarly the
Hadamard product, (a ◦ b)i = ai bi , can be written
a◦b=
a b
Now, let’s see why everyone loves copy tensors by using it to prove the identity Da Db =
Da◦b by “copy tensor manipulation”:
Da Db = = = = Da◦b .
a b a b a b
You can verify this using the sum representation.
The general rule at play is that any connected sub-graph of copy-tensors can be
combined into a single one. Sometimes we are even lucky enough that this simplification
(
1 For 1 if P
a logical proposition P , we define [P ] = .
0 otherwise
CONTENTS 9
S T
T S T
S
.
The only time you have to be a bit careful is when the resulting tensor has order 0.
Depending on how you define the order-0 copy tensor, , you may or may not have the
identity − = .
Lots of other constructions that require special notation (like diagonal matrices or
Hadamard products) with normal vector notation can be unified using the copy tensor.
In the Matrix Cookbook they define the order-4 tensor J, which satisfies Ji,j,k,l = [i =
k][j = l] and which we’d write as J = , and satisfies, for example, dX
dX = J. Using
“tensor products” you could write J = I ⊗ I. Note that J is different from the order-4
copy-tensor, .
When adding tensors that don’t have the same number of edges, or have edges with
different names, we can use “broadcasting”. Say we want to add a matrix M and a vector
i i
x. What does it even mean? If we want to add x to every row of M , we write M + x.
j j
i
This is because x is an outer product between x and the all one vector, which is a
j
matrix in which every row is the same. Similarly, if we want to add x to every column,
i
x
we could use the matrix .
j
Note that we typically don’t talk about “rows” or “columns” when dealing with tensors,
but simply use the name edge (sometimes axis) of the tensor. When using named edges,
operations from classical vector notation like “transpose” can also be removed. The matrix
CONTENTS 10
X T is simply X where the left and right edge have been swapped. But if the edges are
named, we don’t have to keep track on “where the edge is” at all.
1.4 Transposition
In classical matrix notation, transposition flips the indices of a matrix, so that
In tensor diagram notation, we have two choices depending on whether we want the
position of the edges to be significant. With significant edge positions, we typically
let the “left edge” be the first index, and the “right edge” be the second index. Thus
transposition requires flipping the edges:
( A )T = A = AT .
A
A fun notation used by some authors is flipping the tensor upside down, , as a
simpler way to flip the left and right edges.
In practical computations, keeping track of the edge positions can be easy to mess up.
It’s more robust to name the “outputs” of the tensor, and let transposition rename the
edges:
( i A j )T = i j A i j
Renaming can be done using multiplication by identity matrices:
(i A j )( j k )= i Aj k ,
but we have to be careful because overlapping edge names can make multiplication non-
associative. E.g.
( i A j )( j k) (k j) = i A j ̸= ( i A j ) ( j k )( k j) = i A j ,
where equals the matrix dimension. In tensorgrad we solve this problem by requiring
that any edge name is present at most twice in any product.
For the purpose of transcribing the matrix cookbook, using the “significant position”
notation is more convenient. We observe the following identities:
(AT )T = A A = A
(A + B)T = AT + B T ( A + B ) = A + B (4)
A
(AB)T = B T AT
A B
A B = B = (5)
CONTENTS 11
1.5 Trace
P
The “trace” of a square matrix is defined Tr(A) = i Ai,i . In tensor diagram notation,
that corresponds to a self-edge: A . The Matrix Cookbook has a list of identities
using traces. Let’s reproduce them with tensor diagrams:
n
X
Aii = Tr(A) = Tr(AI) A = A (11)
i=1
A
Tr(A) = Tr(AT ) A = (13)
B
Tr(AB) = Tr(BA) A B = = B A (14)
A
In quantum mechanics, it is common to use a “partial trace”, which we can define using
index notation:
k k
Tri,j ( T ) = T .
i j
Of course with tensor diagrams, we can also use partial traces without naming the indices.
We don’t even have to think about whether the contractions we use are traces or not.
CONTENTS 12
1.6 Eigenvalues
Eigenvalues and eigenvectors are fundamental concepts in linear algebra that have impor-
tant applications in tensor network theory. In tensor diagram notation, we can represent
these concepts in a visually intuitive way.
For a matrix A, if there exists a non-zero vector v and a scalar λ such that Av = λv,
then λ is called an eigenvalue of A, and v is the corresponding eigenvector. In tensor
diagram notation It’s convenient to write its eigendecomposition as A = QΛQ−1 , where
Q is a matrix whose columns are the eigenvectors of A, and Λ is a diagonal matrix of the
eigenvalues. Thus:
A = Q Q−1 (1.1)
λ
The trace of a matrix is equal to the sum of its eigenvalues. We can represent this
relationship using tensor diagrams:
X
Tr(A) = A = Q Q−1 = = = λi (12)
λ λ λ i
= + , = + + + + +
= , = .
CONTENTS 13
1.9 Exercises
Exercise 1. Given a sequence of matrices A1 , A2 , . . . , An ∈ Rn×n , and vectors v1 , v2 , . . . , vn ∈
Rn , draw, using tensor diagrams, the matrix made of vectors A1 v1 , A2 v2 , . . . , An vn .
Exercise 2. Represent the Hadamard product (element-wise multiplication) of two ma-
trices using tensor diagrams. How does this differ from regular matrix multiplication?
(We will see more about this in 3.6.)
Exercise 3. Represent the SVD of a matrix A = U ΣV T using tensor diagrams. How
does this compare to the eigendecomposition diagram? How can you generalize it to
higher order tensors? (In section 9.1 we will see more about this.)
Chapter 2
Simple Derivatives
A derivative with respect to a tensor is simply the collection of derivatives with respect
dT
to each element of this tensor. We can keep track of dU by making a tensor of shape
shape(T ) ∪ shape(U ). For example, if T is an order-3 tensor and U is an order-2 tensor,
we draw dT /dU as
dT
= T
dU
This notation follows Penrose. The two extra lines coming from the black dot on the
circle makes the derivative an order-5 tensor. That the order of derivatives grows this
way, is one of the main reasons we’ll encounter for tensors to show up in the first place.
When there are not too many edges, we will use a simple inline notation like this:
(T )
The Matrix Cookbook defines the single-entry matrix J i,j ∈ Rn×n as the matrix which
is zero everywhere except in the entry (i, j) in which it is 1. Alternatively we could write
i,j
Jn,m = [i = n][j = m].
∂xT a
=a (x a) = (x) a= a (69)
∂x
∂aT x
=a (a x) = a (x) = a = a (69)
∂x
∂aT Xb
= abT (a X b) =a (X) b=a b (70)
∂X
14
CHAPTER 2. SIMPLE DERIVATIVES 15
i
∂X j
= J i,j ( X ) j
= (73)
∂Xi,j i
m
∂(XA)i,j
= (J m,n A)i,j (i X A j ) n
= (X) A (74)
∂Xm,n
m
= i
A j
n
T U = T U + T U
Note that this rule holds independently of how many edges are between T and U , even if
there are none.
i
∂ X X X j X (X)
Xk,l Xm,n = ( Xk,l )2 = +
∂Xi,j X (X) X
k,l,m,n k,l
(76)
X X
=2 Xk,l =2 i
k,l j
∂bT X T Xc
= X(bcT + cbT ) ( b X T X c) =b XT (X) c (77)
∂X
+b (X T ) X c
=b XT c
+b X c
= X ( bc + cb )
i
∂
(X T BX)k,l = δl,j (X T B)k,i (k X T B X l ) j
= k
XT B
i
(79)
∂Xi,j j
l
j
+ δk,j (BX)i,l + k
B X l
i
∂
X T BX = X T BJ i,j + J j,i BX (same as above) (80)
∂Xi,j
∂ T
x Bx = (B + B T )x (x B x) = B x+x B (81)
∂x
= ( B + BT ) x
j
B i
=x i
+
i B j
n−1
∂ (Xn )kl X i
= k
(X) j X X ... X l
+ ...
i
+ k
X X X ... (X) j
l
n−1
X i
= k
Xr j
X n−r−1 l
r=0
n−1
∂ T n X T T
a X b= (Xr ) abT Xn−1−r (a − X ... X − b) (91)
∂X r=0
n−1
X
= a − Xr X n−r−1 − b
r=0
n−1
X
= −(X r )T − a b − (X n−r−1 )T −
r=0
∂
Tr(X) = I ( X ) = (X) (99)
∂X
=
=
∂
Tr(XA) = AT ( X A ) = (X) A (100)
∂X
= A
= A T
CHAPTER 2. SIMPLE DERIVATIVES 17
∂
Tr(AXB) = AT B T ( A X B ) = A (X) B (101)
∂X
= A B
= A T
B T
Continues for (102-105). The last one uses the Kronecker product, which we may have
to introduce first.
∂
Tr(X 2 ) = 2X T ( X X ) (106)
∂X
= (X) X + X (X)
= X + X
=2 X T
∂
Tr(X 2 B) = (XB + BX)T ( X X B ) (107)
∂X
= (X) X B + X (X) B
= X B + X B
= B T
X T + X T BT
∂ ∂
Tr(X T BX) = Tr(XX T B) ( XT B X ) (108, 109, 110)
∂X ∂X
∂
= Tr(BXX T ) = (X T ) B X + XT B (X)
∂X
= (B + B T )X = B X + XT B
= B X + BT X
∂ ∂
Tr(XBX T ) = Tr(X T XB) ( X B XT ) (111, 112, 113)
∂X ∂X
∂
= Tr(BX T X) = (X) B XT + X B (X T )
∂X
= X(B T + B) = B XT + X B
= X B T + X B
The last equation is a bit surprising, since we might assume we could simply substitute
CHAPTER 2. SIMPLE DERIVATIVES 18
∂
Tr(X n ) = n(X n−1 )T (X X X ... X) (121)
∂X
n−1
X
= Xr X n−r−1
r=0
= n(X T )n−1
n−1
∂ X
Tr(AX n ) = (X r AX n−1−r )T (A X X ... X) (122)
∂X r=0
n−1
X
= A Xr X n−r−1
r=0
n−1
X
= (X r )T AT (X n−r−1 )T
r=0
2.3 Exercises
Exercise 4. Find the derivative of (xT Ax)2 with respect to x.
Exercise 5. Find the second derivative of
xT AT xxT Ax
with respect to x.
Exercise 6. Find the derivative of X T AX with respect to X.
Exercise 7. Show the derivatives:
∂
(Bx + b)T C(Dx + d) = B T C(Dx + d) + DT C T (Bx + b) (78)
∂x
∂ T T
b X DXc = DT XbcT + DXcbT (82)
∂X
∂
(Xb + c)T D(Xb + c) = (D + DT )(Xb + c)bT (83)
∂X
CHAPTER 2. SIMPLE DERIVATIVES 19
Exercise 8. Show the remaining second order trace derivatives from the Matrix Cook-
book:
∂
Tr(AXBX) = AT X T B T + B T X T AT (114)
∂X
∂ ∂
Tr X T X = Tr XX T = 2X
(115)
∂X ∂X
∂
Tr B T X T CXB = C T XBB T + CXBB T
(116)
∂X
∂
Tr X T BXC = BXC + B T XC T
(117)
∂X
∂
Tr AXBX T C = AT C T XB T + CAXB
(118)
∂X
∂
Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)B T
(119)
∂X
Exercise 9. Show the derivative of the fourth order trace from the Matrix Cookbook:
∂
Tr B T X T CXX T CXB = CXX T CXBB T
(123)
∂X
+ C T XBB T X T C T X
+ CXBB T X T CX
+ C T XX T C T XBB T
Chapter 3
3.1 Flattening
Flattening is a common operation for programmers. In the language of numpy, we may
write np.ones((2,3,4)).reshape(2, 12) to flatten a shape (2,3,4) tensor into a shape
(2,12) matrix. Similarly, in mathematical notation, vec(X) is commonly used to denote
the flattening of a matrix into a vector.
Typically the main reason to do this is as a cludge for dealing with bad general notation
for tensors. Hence, with tensor diagrams, we can avoid this operation entirely. However,
it is still interesting to see how tensor diagrams can make a lot of properties of flattening
much more transparent.
To begin with we note that flattening is a linear operation, and hence can be repre-
sented as a simple tensor. We’ll use a triangle to denote this:
i
k
▷i,j,k = j = [i + jn = k].
Here n is the dimension of the i edge. Note we use a double line to denote the output of
the flattening operation. This is simply a syntactic choice to remind ourselves that the
output is a bundle of two edges.
Using this notation we can write
X k
vec(X)k = ▷i,j,k Xi,j = X .
i,j
= (3.1)
= (3.2)
= (3.3)
= (3.4)
20
CHAPTER 3. KRONECKER AND VEC OPERATOR 21
A A
A ⊗ (B + C) = A ⊗ B + A ⊗ C = (506)
(B+C) B
A
+
C
A A
A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C B = B (508)
C C
A
= B
C
aA ab A
aA ⊗ bB = ab(A ⊗ B) = (509)
bB B
A A
(A ⊗ B)T = AT ⊗ B T = (510)
B B
A C A C
(A ⊗ B)(C ⊗ D) = AC ⊗ BD = (511)
B D B D
A A
(A ⊗ I)(I ⊗ B) = A ⊗ B = (511b)
B B
A A
Tr(A ⊗ B) = Tr(A)Tr(B) = (515)
B B
= A B
CHAPTER 3. KRONECKER AND VEC OPERATOR 22
λ1
Q1 Q−1
1
eig(A ⊗ B) = eig(A)eig(B) (519)
Q2 Q−1
2
λ2
λ1
Q1 Q−1
1
=
Q2 Q−1
2
λ2
Q1 Q−1
1
=
Q2 Q−1
2
λ1 λ2
A22
At the start of the chapter we showed how to represent the vec-operator using the
flattening tensor: vec(X) = X . The Matrix Cookbook gives the following prop-
erties of the vec-operator:
A A
T T
vec(A XB) = vec(X) (A ⊗ B) X = X (520)
B B
T T A B = A B
Tr(A B) = vec(A) vec(B) (521)
vec(A + B) = vec(A) + vec(B) (A+B) = A + B (522)
vec(aA) = a vec(A) aA =aA (523)
B
aT XBX T c = vec(X)T (B ⊗ caT )vec(X) a X B X c= X X (524)
a b
CHAPTER 3. KRONECKER AND VEC OPERATOR 23
B
= X X
a b
a B E f B E f
D g a D g
C C
H H
AX + XB = C. (272)
CHAPTER 3. KRONECKER AND VEC OPERATOR 24
We use the rewriting vec(AX + XB) = (I ⊗ A + B T ⊗ I)vec(X), which follows from the
tensor diagram massaging:
A X
A
A
+ = X + X = X +
X B B B
after which we can take the normal matrix inverse to get
X
An XBn = C (274)
n
X
vec(X) = ( BnT ⊗ An )−1 vec(C) (275)
n
A
xT (A ◦ B)y = tr(AT Dx BDy ) x y = AT B
B x y
A A
B B
(A ⊗ B) ◦ (C ⊗ D) = (A ◦ C) ⊗ (B ◦ D) =
C C
D D
The first equation is simply massaging the tensor diagram. The second follows from
(3.3). Alternatively, it suffices to follow the double lines to see that A and C both use
the “upper” part of the double edge, while B and D use the lower part.
CHAPTER 3. KRONECKER AND VEC OPERATOR 25
A11 B11 A12 B12
" # " #
A11 A12 B11 B12 A11 B21 A22 B22
A∗B = ∗ =
A21 A22 B21 B22 A B
21 11 A22 B12
A21 B21 A22 B22
" #
A11 B11 A11 B12 A12 B11 A12 B12
A • B = ... =
A21 B21 A21 B22 A22 B21 A22 B22
In terms of tensor diagrams, these products correspond simply to flattening the prod-
uct on one side, and using a copy tensor on the other:
A
A∗B =
B
A
A•B =
B
Clearly the two are identical up to transpose. Indeed, (A∗B)T = B T •AT and (A•B)T =
B T ∗ AT .
There are multiple “mixed product” identities:
A C A C
(A • B)(C ⊗ D) = (AC) • (BD) =
B D B D
A x A x
(Ax) ◦ (By) = (A • C)(x ⊗ y) =
B y B y
B
TODO
3.8.1 Stacking
Can be part of Kronecker section
CHAPTER 3. KRONECKER AND VEC OPERATOR 26
From Josh: Proposition 2.5. For any field F, integers d1 , d2 , d3 , d4 and matrices X1 ∈
Fd1 ×d2 , X2 ∈ Fd2 ×d3 , X3 ∈ Fd1 ×d4 , and X4 ∈ Fd4 ×d3 , we have
X2
X1 × X2 + X3 × X4 = (X1 | X3 ) × ,
X4
where we are writing ’|’to denote matrix concatenation.
With tensor diagrams we can write stacking along a new axis i as
i i
e(0) e(1)
stacki (X, Y ) = X + Y
(i)
where ei = 1 and 0 elsewhere.
Fro this we easily get the identity
B
(A | C) = stacki (A, C) stacki (B, D) (3.11)
D
! !
i i i i
e(0) e(1) e(0) e(1)
= + + (3.12)
k A j k C j j B m j D m
= AB + CD (3.14)
3.9 Derivatives
3.10 Exercises
2
×N 2
Exercise 10. Let J = Tr[(IN ⊗F )T A(IN ⊗F )B] where F ∈ RN ×N n , A ∈ RN , B∈
2 2
RN n×N n . Find the derivative of J with respect to F .
Exercise 11. Consider J = ∥G − (B ⊗ X)∥2F , where G and B are matrices, and ∥ · ∥F is
the Frobenius norm. Find the derivative with respect to X
Exercise 12. Prove the equation [8]:
Here diag(b) is a diagonal matrix with the vector b on the diagonal, 1 is a vector of ones
of the right size, and b ⊗ A, the Kronecker product for a vector and a matrix, is defined
by b , that is, you just flatten on one side.
A
Exercise 13. Let a and b be two vectors and let D and X be two matrices. Minimize
the following cost function with respect to X:
E = ∥a − DXb∥22 .
CHAPTER 3. KRONECKER AND VEC OPERATOR 27
Exercise 14. Prove using diagrams that Tr(AT B) = 1T vec(A ◦ B), where 1 is a vector
of the appropriate size.
Exercise 15. Find the derivative of
Tr(G(A ⊗ X))
with respect to X.
Exercise 16.
∂ ∂
Tr(X ⊗ X) = Tr(X) Tr(X) = 2 Tr(X)I
∂X ∂X
Exercise 17. Take the derivative of
Tr(G(A ⊗ X))
2 2
with respect to X, where G is an Rn × Rn matrix, and A and X are Rn×n matrices.
Write the result in terms of vec(A) a reshaped version of G.
Exercise 18. Verify the following identity:
d
vec(diag(x) A diag(x)) = diag(vec(A))(I ⊗ x + x ⊗ I).
dx
Hint: Use the matrix-vector identities (3.5).
Exercise 19. Show that
∂ T
x A (x ⊗ x) = A(x ⊗ x) + (xT ⊗ I + I ⊗ xT )AT x
∂x
Hint: Use the matrix-vector identities (3.5).
Chapter 4
Functions
In this chapter we explore general functions from tensors into other tensors. While not
quite as elegant as linear functions, they are important for many applications, such as
non-linearities in neural networks. The main goal is to make the vector chain rule intuitive
and easy to apply, but we also cover Taylor series, determinants and other topics.
Standard notation for function over vectors can be quite ambiguous when the function
broadcasts over some dimensions. If x ∈ Rn is a vector, it’s not clear whether f (x) applies
to each element of x independently or if it’s a function of the whole vector. With tensor
diagrams, we make this distinction clear by explicitly showing the edges of x that go into
f , and which don’t. We use the notation m f n x when f is a function from Rn to
Rm . If f is element-wise, we write ( f x n ) ∈ Rn . We always assume that the function
(arrow) edges are contracted before normal edges. If we want something else, like f (M x),
we can use brackets: f ( M x ) . It may be helpful with some more examples:
n
f : Rn → R f (x) ∈ R f x (Scalar function)
m n
g : Rn → Rm g(x) ∈ Rm g x (Vector function)
u : Rn → Rm m n
v T u(x) ∈ R u v x (Vector times vector function)
v ∈ Rm
A : Rn → Rm×n , m
n v
A(x)v ∈ Rm A (Vector times matrix function)
v ∈ Rn n x
f : Rd → R, d b
f (X) ∈ Rb f X (Vector function, batched)
X ∈ Rb×d
n
A : Rn×m → R,
A(X) ∈ R A m X (Matrix input function)
X ∈ Rn×m
f : R n × Rm → R d , d
n u b
f (u, v) ∈ Rb×d f (Two inputs, batched)
u ∈ Rb×n , v ∈ Rb×m m v
28
CHAPTER 4. FUNCTIONS 29
To make it more concrete, let’s consider some well known functions: (1) The deter-
minant function, det : Rn×n → R is a scalar function, and we can apply it to a batch
n
d d
For a more advanced example, consider the softmax function: softmax x . We
can write this using elementary functions as:
exp(x) exp x
softmax(x) = =
sum(exp(x)) pow−1 ( exp x )
It looks a bit complicated at first, but let us break it down step by step: ( exp x )∈
Rn isP the element-wise exponential function. If we contract it with , we get the sum
s = i exp(xi ). Finally, we apply pow−1 to s and multiply it with the exponential
function to get the softmax function.
One situation where things may get a bit confusing if the output tensor is contracted
with the broadcasting edges of an input tensor. But as long as we remember to contract
the function edges first, things work out. For example, for a function f : Rn → R and a
matrix x ∈ Rm×n :
m
m
f v x
In traditional notation, the chain rule is written Jf ◦v (x) = ∇f (v(x)) Jv (x), where Jv (x)
is the Jacobian of v and ∇f (v(x)) is the gradient of f at v(x). With tensor diagrams the
chain rule is actually a chain!
f f v f f v f v u
= = =
v v x v v u v u x
x x u u x u x
x x x
CHAPTER 4. FUNCTIONS 30
This clearly follows the classical mnemonic: “the derivative of the outer function times the
derivative of the inner function”. The only refinement we need is that the “outer function”
is connected to the “inner function” using the new edge created by taking the derivative.
We could further simplify f to f ′ , where f ′ is the derivative of f . For example,
′
if f was cos −, then f would be sin −. But we can also just keep the circled f as an easy
to recognize symbol for the derivative.
This “Hessian Chain Rule” is much easier to derive using tensor diagrams:
f f v
=
v v x
x x
f v f v
= +
v x v x
x x
f v f v
= v +
v x v x
x
x x
If we continue to expand this way, we see there is a simple pattern in the chain rule for
more functions:
f f v u f v u f v u
= v u + u +
v v u x v u x v u x
u x x
u u x u x u x
x
x x x x
Yaroslav Bulatov has written extensively on the Hessian Chain Rule, and how to evaluate
it efficiently for deep learning applications. Interested readers may refer to his blog post
on the topic.
CHAPTER 4. FUNCTIONS 31
4.2 Examples
4.2.1 Known derivatives
Two examples where we can expand f with known derivatives:
Here inv(X) is the matrix inverse function, taking a matrix X to its inverse X −1 . In the
case of the determinant, one may continue taking derivatives to find the simple pattern
∂ k det(X)
= det(X)(X −1 )⊗k .
∂X k
CHAPTER 4. FUNCTIONS 32
A x A x A x A x A
= + = +
x x x x x
We may appreciate the simplicity of this expression, when we consider the following
derivation given by Peyman Milanfar using classical notation:
d[A(x)x] = d[A(x)]x + A(x)dx
= vec d[A(x)]x + A(x)dx
= vec Id[A(x)]x + A(x)dx
= (xT ⊗ I) vec d[A(x)] + A(x)dx
= (xT ⊗ I)D vec[A(x)]dx + A(x)dx
= [(xT ⊗ I)D vec[A(x)] + A(x)]dx
Which finally implies:
∂A(x)x ∂
= (xT ⊗ I) vec[A(x)] + A(x).
∂x ∂x
which is equivalent to cos(x) ◦ I. That is, cos(x) but zero everywhere except the diagonal.
The reason is that the matrix cookbook actually uses a slightly different definition of
“function applied to matrix”. If F can be written as a power series, then one way to define
F (X) is the matrix power series:
∞
X
F (X) = ak X k .
k=0
In this case, the derivative of Tr(F (X)) is f (X)T , where f (X) is the scalar derivative of
F (X), matching the Matrix Cookbook’s formula.
CHAPTER 4. FUNCTIONS 33
4.2.4 Taylor
For an n-times differentiable function v : Rd → Rd we can write the Taylor expansion:
ε
v v ε v 1ε v 1ε v
≈ + + ε + ε + ...
x 2 6
(x + ε) x x x
4.3 Exercises
Exercise 20. Draw the tensor diagram for a function f : Rn → R that applies an element-
wise nonlinearity (for instance, exp) followed by a summation over the components. Verify
that the diagram corresponds to the conventional formula for the softmax denominator.
Exercise 21. Represent the composition of two functions
f : Rm → R and v : Rn → Rm ,
using tensor diagrams. Then, using the diagrammatic chain rule, derive the expression
for the derivative of f ◦ v with respect to x ∈ Rn .
Exercise 22. For a matrix function A(x) that depends on a vector x, use tensor diagrams
to illustrate the derivative
∂
[A(x)x],
∂x
and explain how the product rule is implemented in the diagram.
Exercise 23. Represent the KL-divergence term for a Variational Autoencoder (VAE)
as
1
KL(µ, σ) = − 1 + log σ 2 − µ2 − σ 2 ,
2
with parameters given by
µ = W x + c and log σ 2 = W x + c.
Derive the gradient ∇W KL with respect to the weight matrix W . Be sure to keep track
of dimensions and account for elementwise operations.
CHAPTER 4. FUNCTIONS 34
Exercise 26. Consider a Gaussian process with covariance matrix K(θ) and the log-
marginal likelihood defined as
1
L(θ) = − y T K(θ)−1 y + log det K(θ) .
2
Derive the gradient of L with respect to θ, showing that
∂L 1 ∂K
= tr K −1 yy T K −1 − K −1 .
∂θ 2 ∂θ
Exercise 27. In logistic regression, let
1
pi = σ(wT xi ) with σ(z) = ,
1 + e−z
and consider the negative log-likelihood
n h
X i
J(w) = − yi ln pi + (1 − yi ) ln(1 − pi ) .
i=1
Derive the gradient ∇w J and the Hessian H = ∇2w J. In particular, show that
n
X n
X
∇w J = (pi − yi )xi , and H = pi (1 − pi ) xi xTi .
i=1 i=1
f (A) = aT A−1 b,
where a, b ∈ Rn are fixed vectors and A ∈ Rn×n is invertible. Using the result from
differentiating the inverse, show that
Exercise 32. Let X ∈ Rn×n be a symmetric matrix with a simple eigenvalue λ and
corresponding unit eigenvector v. Prove that the derivative of λ with respect to X is
given by
∇X λ = v v T .
Additionally, discuss why this result does not hold when λ has multiplicity greater than
one.
Exercise 33. For a symmetric matrix A ∈ Rn×n , consider the Rayleigh quotient
xT Ax
R(x) = , x ∈ Rn \ {0}.
xT x
Using Lagrange multipliers, show that the stationary values of R(x) correspond to the
eigenvalues of A, and that the maximizer (minimizer) is the eigenvector associated with
the largest (smallest) eigenvalue.
Exercise 34. Let A ∈ Rm×m and B ∈ Rn×n be constant matrices, and let X ∈ Rm×n
be a variable matrix. Prove that
∇X Tr(AXB) = AT B T .
CHAPTER 4. FUNCTIONS 36
Exercise 36. Let s ∈ Rn be a vector and C ∈ Rn×n be a fixed matrix. Define the matrix
function
F (s) = diag(s) C diag(s),
where diag(s) denotes the diagonal matrix with the entries of s. Express the differential
dF in terms of s, ds, and C, and hence derive an expression for the derivative ∇s F (s).
Exercise 37. Let A ∈ Rp×m , X ∈ Rm×n , and B ∈ Rn×q . Prove the identity
vec(A X B T ) = (B ⊗ A) vec(X).
Discuss how this identity can be used to convert certain matrix derivative problems into
vectorized forms.
Exercise 38. Let
f (x) = xT Ax,
where x ∈ Rn and A ∈ Rn×n is not necessarily symmetric. Show that
∇x f (x) = (A + AT )x.
Then, compute the Hessian ∇2x f (x) and discuss what simplification occurs when A is
symmetric.
p
Exercise 39. Consider the Frobenius norm ∥X∥F = Tr(X T X), for X ∈ Rm×n .
Show that
X
∇X ∥X∥F =
∥X∥F
for X ̸= 0.
Exercise 40. Let Y ∈ Rn×n be a symmetric matrix, and define ϕ(Y ) to be its largest
eigenvalue. If this eigenvalue has multiplicity k > 1, show that the derivative of ϕ(Y ) is
not unique. In particular, demonstrate that any matrix of the form
k
X
G= ui viT ,
i=1
where {ui }ki=1 is any orthonormal basis for the eigenspace corresponding to the largest
eigenvalue, is a valid subgradient of ϕ(Y ). Explain the challenges that arise in defining a
unique gradient in this setting.
Chapter 5
m=[ x] and M =[ (x ⊖ m) (x ⊖ m) ]
We will use the circled minus, ⊖, to distinguish the operation from contraction edges.
We can also define the third and fourth centralized moment tensors
(x ⊖ m)
(x ⊖ m)
(x ⊖ m)
M3 = (x ⊖ m)
and M 4 = (x ⊖ m) .
(x ⊖ m)
(x ⊖ m)
A X B C A X B C A B C
= = M3
X D X E X D X E D E
h i
where M3 is the expectation1 x x x , which is an order-9 tensor with no dependence
on the constants A, B, C and D. In practice you would want to name the edges to keep
track of what gets multiplied with what.
1 FIXME: This is different from the notation just introduced above.
37
CHAPTER 5. STATISTICS AND PROBABILITY 38
We can even use linearity of expectation to push the expectation inside an infinite
sum of tensors, as in the following moment generating function, which relates all the Mk
tensors:
∞ ∞ ∞
X 1 X 1 X 1
E(e⟨x⊖m,t⟩ ) = E[⟨x ⊖ m, t⟩k ] = E ⟨(x ⊖ m)⊗k , t⊗k ⟩ = ⟨Mk , t⊗k ⟩
k! k! k!
k=0 k=0 k=0
t t
t 1 1 1 t t
=1+ + M + M3 + M4 + . . .
m 2 6t t 24 t t
t
" #
A X B A [X] B
E[AXB + C] = AE[X]B + C = (312)
+ C + C
This makes it easy to handle the quadratic forms from the Matrix Cookbook:
" # " #
T A x ⊖ [A x] A (x ⊖ m)
Var[Ax] = AVar[x]A = (313)
A x ⊖ [A x] A (x ⊖ m)
" #
A (x ⊖ m)
=
A (x ⊖ m)
= A M2 A
(x ⊖ m)− m−
E[xT Ax] = Tr(AM ) + mT Am [x A x] = + A (318)
(x ⊖ m)− m−
" #
(x ⊖ m) m
= A+m A
(x ⊖ m)
= M A +m A m
x− (x ⊖ m)− (x ⊖ m)− m− m−
x− = (x ⊖ m)− + 3 m− + 3 (x ⊖ m)− + m−
x− (x ⊖ m)− m− (x ⊖ m)− m−
m−
m−
= M3 + 3 + m−
−M −
m−
TODO: The edges from the m, M term needs to be symmetrized.
But this is still a bit of a mess. See also below on Cumulants.
Assume x to be a stochastic vector with independent coordinates, mean m, covariance
M and central moments v3 = E[(x − m)3 ]. Then (see [7])
+ (Tr(M ) + mT m)m
5.2 Cumulants
Given a random vector x ∈ Rd , its nth Cumulant Tensor, Kn ∈ Rd,...,d is defined by
∞ t t
⟨t,x⟩
X 1 ⊗n t 1 1
log E(e )= ⟨Kn , t ⟩ = + K2 + K3 + . . .
n=1
n! K1 2 6t t
t
The first couple of cumulants are similar to the central moments:
M2
K1 = m and K2 = M and K3 = M3 and K4 = M4 − − M2 M2 − M2 M2 .
M2
They have the nice property (which is easy to see from the definition) that they are
additive for independent random variables: Kn (x + y) = Kn (x) + Kn (y). This generalizes
the standard property that the variance of the sum of independent random variables is
the sum of the variances.
We can write the expectations of x⊗n in terms of the cumulants:
" # K2 K2 K1 K1 K1
x x K3 +
K1
+
K1
+
K2
+
x = K1 .
In general the sum is over all the partitions of the set {1, . . . , n}.
If the entries of x are independent, the off-diagonals of the cumulant tensors K1 , K2 , . . .
are zero. This means Assume each xi has cumulants κ1 , κ2 , κ3 , κ4 ∈ R, then
κ4
+κ3 κ1 + + +
xx =
xx +κ22
+ +
+κ2 κ21 + + + + +
+κ41
Note in particular, that if the mean, κ1 , is zero, only four terms survive. For n = 5
there are 52 partitions in total, but only 11 survive if κ1 = 0:
κ5 +κ3 κ2 + + + + + + + + +
If x is Gaussian, all cumulants of order 3 and higher are zero. For n = 6 there are 203
partitions in total, but only 15 terms2 for E[x⊗6 ]:
κ32 + + + + + +
+ + + + + + + +
.
2 This is why E[g 6 ] = 15 for g ∼ N (0, 1).
CHAPTER 5. STATISTICS AND PROBABILITY 41
Var[xT Ax] = x A x − x A x
x A x x A x
= κ4 A A + 4κ3 κ1 A A
+ 2κ22 A A + κ22 A A
Var[xT Ax] = 2µ22 Tr(A2 ) + 4µ2 cT A2 c + 4µ3 cT Aa + (µ4 − 3µ22 )aT a (319)
E[y] = m = wT µ (321)
2
E[(y ⊖ m) ] = w M2 w (322)
w
E[(y ⊖ m)3 ] = w M3 w (323)
w
E[(y ⊖ m)4 ] = w M4 w (324)
w
For x ∼ N (0, 1), we have the inequality:
p
E[y n ]1/n ≤ 2/π ∥w∥2 .
For instance:
E[(x ⊖ m)⊗4 ] = M ⊗ M + M ⊗ M + . . .
(summing over the different pairings of indices).
X
X f(X) = f(X)
X
Combined with the tensor chain rule from chapter 4, this can be a very powerful way to
evaluate many hard expectations.
+ Tr(Σ)(Σ + mmT )
See [7].
X
E[x] = ρk m k (384)
k
XX
Cov(x) = ρk ρk′ (Σk + mk mTk − mk mTk′ ) (385)
k k′
5.4.5 Derivatives
Derivatives of moments with respect to m or M can be found by differentiating under
the integral sign, and using Stein’s lemma for Gaussian cases.
5.5 Exercises
Exercise 41.
E[(Ax + a)bT (Cx + c)(Dx + d)T ] = (Ax + a)bT (CM DT + (Cm + c)(Dm + d)T )
+(AM C T + (Am + a)(Cm + c)T )b(Dm + d)T
+bT (Cm + c)(AM DT − (Am + a)(Dm + d)T )
Exercise 42. Find more identities in the Matrix Reference Manual and try to prove
them. Also try to verify your derivations using tensorgrad.
Chapter 6
6.1 Determinant
It’s convenient to write the determinant in tensor notation as
1
det(A) = A ··· A
n!
where i1 i2 ... in = εi1 ,...,in is the rank-n Levi-Civita tensor defined by
(
sign(σ) σ = (i1 , . . . , in ) is a permutation
εi1 ,...,in =
0 otherwise.
45
CHAPTER 6. DETERMINANT AND INVERSES 46
···
A ··· A = A ··· A
Q
(18) det(A) = i λi ...
A ··· A
A ··· A
(21) det(AB) = det(A)det(B) B ··· B = B ··· B
6.2 Inverses
Might be reduced, unless cofactor matrices have a nice representation?
Chapter 7
Advanced Derivatives
d x−a
∥x − a∥2 = (7.1)
dx ∥x − a∥2
d x−a I (x − a)(x − a)T
= − (7.2)
dx ∥x − a∥2 ∥x − a∥2 ∥x − a∥32
d d T
∥x∥22 = ∥x x∥2 = 2x (7.3)
dx dx
47
CHAPTER 7. ADVANCED DERIVATIVES 48
with respect to Σ.
Chapter 8
Special Matrices
an
An
50
CHAPTER 8. SPECIAL MATRICES 51
Hadamard
The Hadamard matrix is defined as H2n = H2⊗n = H2 ⊗ · · · ⊗ H2 where H2 =
1 1
1 −1 .
For example
1 1 1 1
H2 H2 1 −1 1 −1
H4 = H2 ⊗ H2 = =1 1 −1 −1 .
H2 −H2
1 −1 −1 1
This gives a very simple tensor diagram representation as:
H2
..
H2n = . .
H2
The Fast Hadamard Transform (FHT) transform is usually described recursively by:
" #" #
H2n−1 H2n−1 x(1)
H2n x = ,
H2n−1 −H2n−1 x(2)
h i
where x(1) is the first and second half of x. Because of the redundancy in the matrix
x(2)
multiplication (it only depends on H2n−1 x(1) and H2n−1 x(2) , the algorithm computes HN x
in O(N log N ) time.
Alternatively we could just use the general fact, as described above, where ai = 2 for
all i. Then the “fast Kronecker multiplication” method takes time (a1 a2 · · · an )(a1 + a2 +
· · · an ) = 2n log2 n.
Fourier
The Discrete Fourier Matrix is defined by (FN )i,j = ω ij , where ω = e−2πi/N :
···
1 1 1 1 1
1 ω ω2 ω3 ··· ω N −1
ω2 ω4 ω6 ω 2(N −1)
1 ···
FN = 1 ω 3(N −1) .
ω3 ω6 ω9 ···
.. .. .. .. .. ..
. . . . . .
1 ω N −1 ω 2(N −1) ω 3(N −1) ··· ω (N −1)(N −1)
arange(N )
= exp 2πi/N
arange(N )
TODO: Show how the matrix can be written with the function notation.
CHAPTER 8. SPECIAL MATRICES 52
Fpi2
2
FN = P1 .. P2 ,
.
Fpinn
where N = pi11 pi22 · · · pinn is the prime factorisation of N , and P1 and P2 are some permu-
tation matrices.
Using fast Kronecker multiplication, the algorithm this takes (pi11 + · · · + pinn )N time.
By padding x with zeros, we can increase N by a constant factor to get a string of
n = O(log(N )/ log log(N )) primes, the sum of which is ∼ n2 / log n = O(log(N )2 ). The
complete algorithm thus takes time O(N log(N )2 ). Next we will see how to reduce this
to O(N log N ).
The classical Cooley-Tukey FFT algorithm uses a recursion:
" #" #" #" #
I I I 0 FN/2 0 even-odd
FN = ,
I −I 0 DN/2 0 FN/2 permutation
where DN = [1, wN , w2N , . . . ]. The even-odd permutation moves all the even values to
the start. If we reshape I2n as I2 ⊗· · ·⊗I2 , this permutation is just PN = , or in pytorch:
h i
F 0
x.permute([3,0,1,2]). Also note that II −I
I
= H2 ⊗ I and N/2 0 FN/2 = I2 ⊗ FN/2 .
So we can write in tensor diagram notation:
H2 H2
H2
FN = FN/2 = H2 .
H2
DN DN/20 DN/21 DN/22
Since one can multiply with the permutation and diagonal matrices in linear time, the
O(n log n) time complexity follows from the same argument as for Hadamard.
Note there are a bunch of symmetries, such as by transposing (horizontal flip), since
the matrix is symmetric. Or by pushing the permutation to the left side.
We don’t have to split the matrix in half, we can also split it in thirds, fourths, etc.
With this generalized Cooley-Tukey algorithm, we get the following diagram:
Fn0
Fn1
FN = Fn2 ,
Fn3
F (n0 ,n1 n2 n3 ) F (n1 ,n2 n3 ) F (n2 ,n3 )
1/b
We can use the property Fa,bc = Fa,b • Fa,c to simplify the diagram further:
Fn0
F n0 ,n1
FN = Fn1 .
F n0 ,n2 F n1 ,n2
Fn2
F n0 ,n3 F n1 ,n3 F n2 ,n3
Fn3
In the simple case where we split in 2 every time, this is also called the “Quantum FFT”
algorithm.
We hid some stuff above, namely that the matrices should be divided by different N s.
Note that this figure may look different from some FFT diagrams you have seen.
These typically look like this:
and have 2n rows. The tensor diagram only has n rows (or log2 N ).
Decompositions
55
CHAPTER 9. DECOMPOSITIONS 56
W = [1 0],
0 1 0 0 1 1
−1 0 0 0 1 0
01 0 −1 , [ 1 1 ] , [ 0 0 ] , 1 0 , [0 1], [0 0]
W
= .
SA SB
To multiply two matrices, A and B, faster than the normal n3 time, we reshape them as
block matrices, shape (2, n2 , 2, n2 ) and use Strassen’s tensor:
W
A B = = .
A B SA SB
A B
Contracting the edges in the right order, uses only 7/8n3 + O(n2 ) operations.
If we instead reshape to (2, 2, . . . , 2),
AB = A B ,
and using Strassen’s tensor along each axis reduces the work by (7/8)log2 (n) , giving us
matrix multiplication in time n3+log2 (7/8) = n2.80735 .
CHAPTER 9. DECOMPOSITIONS 57
Other
If we instead wrote A and B using (n, m) and (m, p) shaped blocks, we could factor
In ⊗ Im ⊗ Ip and get a matrix multiplication algorithm using the same approach as the
Strassen (2, 2, 2) tensors above. Lots of papers have investigated this problem, which has
led to the best algorithms by Josh Alman and others. For example, Deep Mind found a
rank 47 factorization of I3 ⊗ I4 ⊗ I5 .
Maybe a more interesting example is the (4, 4, 4) tensor, for which they find a rank
47 factorization. This an easy way to create a rank 49 is to take Strassen and double
it. Would this be a nice thing to show? Maybe too messy? Well, actually their rank 47
construction only works in the “modular” case. Then (3, 4, 5) is general.
Chapter 10
58
Chapter 11
Tensor Algorithms
Note that we also write the sizes of the respective dimensions, i.e., A is a 2 × 2 × 4-
tensor, while B is a 4 × 2-matrix. Now, the contraction cost is defined as the number of
FLOPS performed during the contraction; as a convention, this is the number of scalar
multiplications.1 In our example, the contraction cost is 2 × 2 × 4 × 2 = 32, i.e., we simply
multiply the dimension sizes.
The previous example was rather small. However, tensor networks in the wild tend to
have hundreds of tensors. Naturally, these also need to be contracted to a single tensor.
Now comes the interesting part: The order in which we perform these contraction can
have a tremendous impact on the execution time. To get an intuition for this, consider
an extended example of a 3-tensor diagram:
2 2 2
2 4 1 2
A B C
If we were to contract A with B first (and the resulting tensor with C), we would have a
cost of 22 × 4 × 2 + 23 × 1 × 22 = 32 + 32 = 64 multiplications, whereas performing the
contraction between B and C at the beginning would result in 23 × 1 × 22 + 22 × 4 × 23 =
32 + 128 = 160 multiplications in total. Hence, the first order is much better.
It is thus natural to ask: Can we always find the optimal contraction order?
1 This assumes a naive implementation of the operation, i.e., nested loops over the dimensions.
59
CHAPTER 11. TENSOR ALGORITHMS 60
11.1.1 Algorithms
We summarize well-known results and frameworks to find good contraction orders.
Optimal Algorithms
Indeed, finding the optimal contraction order for arbitrary tensor network shapes is pretty
hard; better said, NP-hard [3]. There is a well-known exact algorithm running in O(3n )-
time [17], where n is the number of tensors in the network. This finds the optimal
contraction order with respect to the total contraction cost. If one is interested in mini-
mizing the size of the largest intermediate tensor, i.e., to optimize for the memory used
during the execution, this can be done faster in O(2n n3 )-time [22].
The good news is that for some restricted shapes of tensor networks, there are indeed
efficient algorithms. A classic example is that dynamic programming solution for the
matrix-chain problem [4], which is just our problem, but only for matrices. The naive
algorithm runs in O(n3 )-time, but can be implemented in O(n2 )-time [26] (or even in
O(n log n)-time [11, 12]). Another shape for which polynomial-time algorithms exist is
that of tree tensor networks [25, 23].
Another prominent way to optimize contraction orders is via the tree decomposition of
the line graph representation of the tensor network [15, 5, 19]. In particular, this results
in a contraction order with a maximal intermediate tensor rank equal to the treewidth
of the tree decomposition. Loosely speaking, treewidth measures how tree-like a graph
is. This does not directly solve our problem since finding the tree decompositions of
the smallest treewidth is itself hard [14]. Two well-known frameworks to find good tree
decompositions are QuickBB [6] and FlowCutter [24, 9].
Best-Effort Algorithms
However, once we want to contract arbitrary network shapes, the best we can do is to fall
back on heuristics or approximations. Two well-known frameworks are opt_einsum [20]
and cotengra [7], which aim to optimize the contraction order (also referred to as “con-
traction path”) of arbitrary einsums: For tensor networks where the optimal algorithm
would be too slow, opt_einsum applies an ad-hoc greedy algorithm, while cotengra uses
hypergraph partitioning [18], along with a Bayesian optimization approach, which has
been later refined [21]. Other algorithms adopted from database query optimization are
implemented in netzwerk [23]. Another option is to learn the best contraction orders,
e.g., using reinforcement learning [16].
Naturally, the above heuristics do not come with an optimality guarantee. There
exists a (1 + ε)-approximation O∗ (2n /ε)-time algorithm that minimizes the sum of the
intermediate tensor sizes [22].
Worth mentioning is the SQL-view of einsum [2]: It allows to perform the entire tensor
network contraction on any database engine. Indeed, for some einsum classes, running
the contractions on state-of-the-art database engines, such as Hyper [13], can be much
faster than using the de-facto numpy-array implementation.
Chapter 12
Tensorgrad
Implementation details
12.1 Isomorphisms
There is actually a concept of “tensor isomorphism”, but it’s basically just the same as
graph isomorphism.
We need to understand isomorphisms in many different parts of the code.
12.1.1 In Products
- Cancelling / combining equal parts of a product This is actually extra hard, because
you have to collect a subset of nodes that constitute isomorphic subgraphs. Right now
we hack this a bit by just considering separate components of the product.
U
T S
pow(2) S pow(2) U
S S
T S T T S
T
T
S U
S U pow(2)
pow(2) S
G G S S
V
G T T
V T
61
CHAPTER 12. TENSORGRAD 62
The problem is probably NP-hard, but it might still have an algorithm that’s faster
than 2n trying all subsets. In particular, we might modify the VF2 algorithm, which
iteratively tries to match nodes in G1 and G2 . The NetworkX library already has a
GraphMatcher, which searches for isomoprhic subgraphs. It might be extendable to our
problem... But honestly I don’t know if we even want to solve this problem in the most
general, since it corresponds a bit to factoring the graph. And we don’t do factoring, just
as we don’t do inverse distribution.
In either case, it’s clear that we need to be able to compare nodes and edges for
isomorphism.
Also, the basic usecase of isomorphism canonaization in products is simply to compute
the canonical product itself from its parts. Part of our approach here is taking the outer
edges and turning them into nodes, so they can be colored.
12.1.2 In Sums
When deciding whether A + B is equal to 2A we need to check if A and B are isomorphic.
But we also need to do this under the current renaming of the edges. That’s why you
can’t just transform A + AT = 2A.
The way it actually works in my code is
1 def key_fn ( t : Tensor ) :
2 # Align tensor edges to have the same order , using Sum ’s order as
reference .
3 canons = [ t . c a n o n i c a l _ e d g e _ n a m e s [ t . edges . index ( e ) ] for e in self . edges ]
4 return hash (( t . canon ,) + tuple ( canons ) )
5
6 ws_tensors = TensorDict ( key_fn = key_fn , default_fn = int )
7 for w , t in zip ( weights , tensors ) :
8 ws_tensors [ t ] += w
9 ws_tensors = [( w , t ) for t , w in ws_tensors . items () ]
which says that I’m using for a hash, the canonical form of the tensor, plus the canonical
form of the edges in the order of the edges in the sum. These are basically the orbits,
meaning that if the summed tensor has a symmetry, we are allowed to "flip" it to make
the summands isomorphic.
In the “compute canonical” method, we do more or less the same, but we also include
the weights.
1 def _ c o m p u t e _ c a n o n i c a l ( self ) :
2 hashes = []
3 for e in self . edges :
4 canons = [ t . c a n o n i c a l _ e d g e _ n a m e s [ t . edges . index ( e ) ] for t in self .
tensors ]
5 hashes . append ( hash (( " Sum " ,) + tuple ( sorted ( zip ( self . weights , canons
)))))
CHAPTER 12. TENSORGRAD 63
In the future we want to use symmetry groups instead. What would be the symmetry
group of a sum? It’s the diagonal of the product of the symmetry groups of the summands.
How can we find the generators of this group? Maybe we should just construct some joint
graph and then find the automorphisms of that graph.
Alternatively we can use sympy. It is not known whether this problem is solvable
in polynomial time. I think Babai proved that it is quasi-polynomial but not with a
practical algorithm. Incidentally the problems of intersections of subgroups, centralizers
of elements, and stabilizers of subsets of {1, . . . , n} have been proved (by Eugene Luks)
to be polynomially equivalent.
Actually making a graph and using nauty is a really good idea, since it would be able
to detect that A + AT is symmetric. Just taking the intersection of the automorphism
groups of the summands would not find that.
Another option is to convert the sum to a function... But no, that’s weird. That
would require me to support functions with arbitrary numbers of inputs, which is not
currently the case.
12.1.3 In Evaluation
When evaluating a tensor, we can look at the graph of the tensor and see if it’s isomorphic
to a previously evaluated tensor. This is an example where we don’t really need a canonical
form, but an approximate hash plus vf2 would be fine. Also note that in this case we
don’t care about the edge renaming, because we can just rename the edges before we
return the tensor. E.g. if we have already evaluated A, we can use that to get AT easily.
12.1.4 In Variables
In variables we include the name of the variable in the hash. Basically we assume that
variables named the same refer to the same data.
1 base = hash (( " Variable " , self . name ) )
2 return base , [ hash (( base , e ) ) for e in self . origi nal_ed ges ]
For the original canonical edge names, we use the edge names before renaming. This
means, in the case of AT that will have the same hash as A. But because it’s renamed,
the t.index call in the Sum will flip the edges.
We could imagine variables taking an automorphism group as an argument, which
would allow us to define variables with different symmetries. Such as a symmetric matrix
A where A + AT is actually 2A.
12.1.5 In Constants
When computing the canonical form of a constant, like Zero or Copy we don’t care about
the edge names. I guess because the constants we use are all maximally symmetric? We
currently include the constants tag, which is the hash of the variable that it came from,
if any.
CHAPTER 12. TENSORGRAD 64
12.1.6 In Functions
One issue is that while the original names are usually part of the function definition,
the new edges added by differentiation are often automatically generated based on the
context, so they shouldn’t really be part of the canonical form.
In contrast to Sum, we don’t sort the canons here, since the order of the inputs
matters.
Maybe functions should be allowed to transform the symmetry group? E.g. if we have
a function that takes a symmetric matrix and returns a symmetric matrix, we should be
able to use the symmetry group of the input to simplify the output.
12.1.7 In Derivatives
All we do is hashing the tensor and the wrt. And then add new edges for the derivative.
12.1.8 Other
For some tensors there might be edge dimension relations that aren’t equivalences. For
example, a flatten tensor would have the “out” edge dimension equal to the product of
the “in” edge dimensions.
In a previous version I had every tensor register a “callback” function. Whenever
an edge dimension “became available”, the tensor would get a chance to emit new edge
dimensions. However, this was a lot more work for each tensor to implement, and not
needed for any of the existing tensors.
12.2 Renaming
This is an important part of the code.
12.3 Evaluation
An important part of evaluation is determining the dimension of each edge. To do this, I’m
basically creating a full graph of the tensor, using a function called edge_equivalences
which a list of tuples ((t1 , e1 ), (t2 , e2 )), indicating that edge e1 of tensor t1 is equivalent
to edge e2 of tensor t2 . Note that the same edge name can appear multiple times in the
graph, so we need to keep track of the tensor as well.
For variables, since the user gives edge dimensions in terms of variables, it’s important
to keep track of renamed edge names:
1 for e1 , e2 in zip ( self . original_edges , self . edges ) :
2 yield ( self , e1 ) , ( self , e2 )
For constants, there might be some equivalences based on tensors that the constant
was derived from.
1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 if self . link is not None :
3 yield from self . link . e d g e _ e q u i v a l e n c e s ()
4 for e in self . link . edges :
CHAPTER 12. TENSORGRAD 65
5 if e in self . edges :
6 yield ( self , e ) , ( self . link , e )
For functions we can’t really say anything about the edges of the function itself
(self.edges_out), but at least we can say something about the broadcasted edges.
1 for t , * inner_edges in self . inputs :
2 yield from t . e d g e _ e q u i v a l e n c e s ()
3 for e in t . edges :
4 if e not in inner_edges :
5 yield (t , e ) , ( self , e )
We could maybe also say that input edges with the same name are equivalent?
For products, we look at each edge (t1 , e, t2 ) and yield (t1 , e), (t2 , e). However for the
free edges, (t, e), we match them with ourselves, (t, e), (self, e).
1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 pairs = defaultdict ( list )
3 for t in self . tensors :
4 yield from t . e d g e _ e q u i v a l e n c e s ()
5 for e in t . edges :
6 pairs [ e ]. append ( t )
7 for e , ts in pairs . items () :
8 if len ( ts ) == 1:
9 yield ( self , e ) , ( ts [0] , e )
10 else :
11 t1 , t2 = ts
12 yield ( t1 , e ) , ( t2 , e )
Finally, we use BFS to propagate the edge dimensions from the variables (which are
given by the user) to the rest of the graph.
Why is it even necessary for non-variables to know the edge dimensions? Mostly
because of copy tensors, which we use for hyper edges, and have to construct. Could we
get rid of this if we computed hyper-edges more efficiently without copy’s? There are also
sometimes "detached" copies...
Also, an alternative idea would be to actually construct the full graph. I originally
didn’t think this would be possible because of the Sum’s which aren’t really graphs. But
maybe with the new approach of using nauty, we could actually do this.
12.3.1 Products
We simply evaluate the tensors in the product and give them to einsum.
CHAPTER 12. TENSORGRAD 66
[2] Mark Blacher, Julien Klaus, Christoph Staudt, Sören Laue, Viktor Leis, and Joachim
Giesen. Efficient and portable einstein summation in SQL. Proc. ACM Manag. Data,
1(2):121:1–121:19, 2023.
[3] Lam Chi-Chung, P Sadayappan, and Rephael Wenger. On optimizing a class of
multi-dimensional loops with reduction for parallel execution. Parallel Processing
Letters, 7(02):157–168, 1997.
[4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
[5] Jeffrey M Dudek, Leonardo Duenas-Osorio, and Moshe Y Vardi. Efficient contraction
of large tensor networks for weighted model counting through graph decompositions.
arXiv preprint arXiv:1908.04381, 2019.
[6] Vibhav Gogate and Rina Dechter. A complete anytime algorithm for treewidth. In
David Maxwell Chickering and Joseph Y. Halpern, editors, UAI ’04, Proceedings of
the 20th Conference in Uncertainty in Artificial Intelligence, Banff, Canada, July
7-11, 2004, pages 201–208. AUAI Press, 2004.
[7] Johnnie Gray and Stefanos Kourtis. Hyper-optimized tensor network contraction.
Quantum, 5:410, 2021.
[8] greg (https://math.stackexchange.com/users/357854/greg). Proving that
vec(a diag(b) c) = ((ct ⊗ 1a ) ⊙ (1c ⊗ a)) b. Mathematics Stack Exchange.
URL:https://math.stackexchange.com/q/2993406 (version: 2018-11-11).
[9] Michael Hamann and Ben Strasser. Graph bisection with pareto optimization. Jour-
nal of Experimental Algorithmics (JEA), 23:1–34, 2018.
[10] William Rowan Hamilton. Lectures on Quaternions. Hodges and Smith, 1853.
67
BIBLIOGRAPHY 68
[13] Alfons Kemper and Thomas Neumann. Hyper: A hybrid oltp&olap main memory
database system based on virtual memory snapshots. In 2011 IEEE 27th Interna-
tional Conference on Data Engineering, pages 195–206. IEEE, 2011.
[14] Tuukka Korhonen. A single-exponential time 2-approximation algorithm for
treewidth. SIAM Journal on Computing, (0):FOCS21–174, 2023.
[15] Igor L Markov and Yaoyun Shi. Simulating quantum computation by contracting
tensor networks. SIAM Journal on Computing, 38(3):963–981, 2008.
[16] Eli Meirom, Haggai Maron, Shie Mannor, and Gal Chechik. Optimizing tensor
network contraction using reinforcement learning. In International Conference on
Machine Learning, pages 15278–15292. PMLR, 2022.
[17] Robert N. C. Pfeifer, Jutho Haegeman, and Frank Verstraete. Faster identification
of optimal contraction sequences for tensor networks. Phys. Rev. E, 90:033315, 2014.
[18] Sebastian Schlag, Vitali Henne, Tobias Heuer, Henning Meyerhenke, Peter Sanders,
and Christian Schulz. K-way hypergraph partitioning via n-level recursive bisec-
tion. In 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and
Experiments (ALENEX), pages 53–67. SIAM, 2016.
[19] Roman Schutski, Danil Lykov, and Ivan Oseledets. Adaptive algorithm for quantum
circuit simulation. Phys. Rev. A, 101:042335, Apr 2020.
[20] Daniel G. A. Smith and Johnnie Gray. opt_einsum - a Python package for optimizing
contraction order for einsum-like expressions. Journal of Open Source Software,
3(26):753, 2018.
[21] Christoph Staudt, Mark Blacher, Julien Klaus, Farin Lippmann, and Joachim Giesen.
Improved cut strategy for tensor network contraction orders. In Leo Liberti, editor,
22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26,
2024, Vienna, Austria, volume 301 of LIPIcs, pages 27:1–27:19. Schloss Dagstuhl -
Leibniz-Zentrum für Informatik, 2024.
[22] Mihail Stoian and Andreas Kipf. Dpconv: Super-polynomially faster join ordering,
2024.
[23] Mihail Stoian, Richard M. Milbradt, and Christian B. Mendl. On the optimal linear
contraction order of tree tensor networks, and beyond. SIAM Journal on Scientific
Computing, 46(5):B647–B668, 2024.
[24] Ben Strasser. Computing tree decompositions with flowcutter: PACE 2017 submis-
sion. CoRR, abs/1709.08949, 2017.
[25] Jianyu Xu, Ling Liang, Lei Deng, Changyun Wen, Yuan Xie, and Guoqi Li. Towards
a polynomial algorithm for optimal contraction sequence of tensor networks from
trees. Phys. Rev. E, 100:043309, 2019.
[26] F. Frances Yao. Efficient dynamic programming using quadrangle inequalities. In
Raymond E. Miller, Seymour Ginsburg, Walter A. Burkhard, and Richard J. Lipton,
editors, Proceedings of the 12th Annual ACM Symposium on Theory of Computing,
April 28-30, 1980, Los Angeles, California, USA, pages 429–435. ACM, 1980.
Chapter 13
Appendix
Contains some proofs, such as of equation 524 or 571. They are pretty long and could be
useful for contrasting with the diagram proofs.
69