0% found this document useful (0 votes)
8 views70 pages

tensor cookbook

The Tensor Cookbook is a guide to understanding tensors through tensor diagrams, aimed primarily at engineers in machine learning rather than physicists. It presents a collection of facts and identities about tensors, simplifying complex operations and notation. The document is ongoing and may expand in the future to cover more topics.

Uploaded by

ghinyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views70 pages

tensor cookbook

The Tensor Cookbook is a guide to understanding tensors through tensor diagrams, aimed primarily at engineers in machine learning rather than physicists. It presents a collection of facts and identities about tensors, simplifying complex operations and notation. The document is ongoing and may expand in the future to cover more topics.

Uploaded by

ghinyang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

The Tensor Cookbook

Thomas Dybdahl Ahle

February 9, 2025
Chapter 1

Introduction

What is this? These pages are a guide to tensors, using the visual language of “tensor
diagrams”. For illustrating the generality of the approach, I’ve tried to closely follow the
legendary “Matrix Cookbook”. As such, most of the presentation is a collection of facts
(identities, approximations, inequalities, relations, ...) about tensors and matters relating
to them. You won’t find many results not in the original cookbook, but hopefully the
diagrams will give you a new way to understand and appreciate them.

It’s ongoing: The Matrix Cookbook is a long book, and not all the sections are equally
amendable to diagrams. Hence I’ve opted to skip certain sections and shorten others.
Perhaps in the future, I, or others, will expand the coverage further.
For example, while we cover all of the results on Expectation of Linear Combinations
and Gaussian moments, we skip the section on general multi-variate distributions. I have
also had to rearrange the material a bit, to avoid having to introduce all the notation up
front.

Complex Matrices and Covariance Tensor diagrams (or networks) are currently
most often seen in Quantum Physics, but this is not a book for physicists. The Matrix
Cookbook is a book for engineers, in particular in Machine Learning, where complex
numbers are less common. Without complex numbers, we don’t have to worry about
complex conjugation, which simplifies transposes, and gets rid of the need for co- and
contra-variant tensors. If you are a physicist, you probably want a book on Tensor
Analysis.

Tensorgrad The symbolic nature of tensor diagrams make the well suited for symbolic
computation.

Advantages of Tensor Diagram Notation: Tensor diagram notation has many ben-
efits compared to other notations:
Various operations, such as a trace, tensor product, or tensor contraction can be
expressed simply without extra notation. Names of indices and tensors can often be
omitted. This saves time and lightens the notation, and is especially useful for internal
indices which exist mainly to be summed over. The order of the tensor resulting from

1
CHAPTER 1. INTRODUCTION 2

a complicated network of contractions can be determined by inspection: it is just the


number of unpaired lines. For example, a tensor network with all lines joined, no matter
how complicated, must result in a scalar.

Etymology The term "tensor" is rooted in the Latin word tensio, meaning “tension” or
“stretching,” derived from the verb tendere, which means “to stretch” or “to extend.” It was
first introduced in the context of mathematics in the mid-19th century by William Rowan
Hamilton in his work on quaternions, where it referred to the magnitude of a quaternion.
The modern usage of "tensor" was later established by Gregorio Ricci-Curbastro and
Tullio Levi-Civita in their development of tensor calculus, a framework that generalizes
the concept of scalars, vectors, and matrices to more complex, multidimensional enti-
ties. [1, 10].
Contents

1 Introduction 1
1.1 Tensor Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The Copy Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Sums of Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Higher order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Higher order traces . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Symmetry and Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Covariance and Contravariance . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Simple Derivatives 14
2.1 Derivatives of Matrices, Vectors and Scalar Forms . . . . . . . . . . . . . 14
2.1.1 First Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Second Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Derivatives of Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 First Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Second Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Higher Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Kronecker and Vec Operator 20


3.1 Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 The Kronecker Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 The Vec Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Kronecker Vector Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 General Matrification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 The Lyapunov Equation . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Encapsulating Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 The Hadamard Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 Khatri–Rao product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Tracy-Singh product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3
CONTENTS 4

3.8.1 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.9 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Functions 28
4.1 The Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 The Hessian Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.2 Chain Rule with Broadcasting . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Functions with multiple inputs . . . . . . . . . . . . . . . . . . . . 31
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Known derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Pseudo-linear forms . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Trace identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 Taylor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Statistics and Probability 37


5.1 Definition of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Expectation of Linear Combinations . . . . . . . . . . . . . . . . . 37
5.1.2 Linear Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.3 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.4 Cubic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.1 Quartic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Weighted Scalar Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Gaussian Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.1 Gaussian Integration by Parts . . . . . . . . . . . . . . . . . . . . . 42
5.4.2 Cubic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.3 Mean of Quartic Forms . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.5 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Determinant and Inverses 45


6.1 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7 Advanced Derivatives 47
7.1 Derivatives of vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.1 Two-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Derivatives of matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3 Derivatives of Structured Matrices . . . . . . . . . . . . . . . . . . . . . . 48
7.3.1 Symmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3.2 Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.3.3 Toeplitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Derivatives of a Determinant . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 General forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.6 Linear forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
CONTENTS 5

7.7 Square forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


7.8 From Stack Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.9 Derivatives of an Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.9.1 Trace Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.10 Derivatives of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8 Special Matrices 50
8.0.1 Block matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.0.2 The Discrete Fourier Transform Matrix . . . . . . . . . . . . . . . 50
8.0.3 Fast Kronecker Multiplication . . . . . . . . . . . . . . . . . . . . . 50
8.0.4 Hermitian Matrices and skew-Hermitian . . . . . . . . . . . . . . . 54
8.0.5 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.6 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.7 Positive Definite and Semi-definite Matrices . . . . . . . . . . . . . 54
8.0.8 Singleentry Matrix, The . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.9 Symmetric, Skew-symmetric/Antisymmetric . . . . . . . . . . . . . 54
8.0.10 Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.0.11 Units, Permutation and Shift . . . . . . . . . . . . . . . . . . . . . 54
8.0.12 Vandermonde Matrices . . . . . . . . . . . . . . . . . . . . . . . . . 54

9 Decompositions 55
9.1 Higher-order singular value decomposition . . . . . . . . . . . . . . . . . . 55
9.2 Rank Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.2.1 Border Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3 Fast Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10 Machine Learning Applications 58


10.1 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.2 Hessian of Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . 58
10.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 58
10.4 Transformers / Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.5 Tensor Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
10.6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

11 Tensor Algorithms 59
11.1 Tensor Contraction Orders . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

12 Tensorgrad 61
12.1 Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.1.1 In Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
12.1.2 In Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12.1.3 In Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.4 In Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.5 In Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.6 In Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.1.7 In Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CONTENTS 6

12.1.8 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.2 Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.3.1 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.4 Simplification Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

13 Appendix 69
CONTENTS 7

1.1 Tensor Diagrams


Tensor diagrams are simple graphs (or “networks”) where nodes represent variables (e.g.
vectors or matrices) and edges represent contractions (e.g. matrix multiplication or inner
products.) The follow table shows how some basic operations can be written with tensor
diagrams:
·
" #
[ · · · · ] ··
P
Dot product a b y = i ai bi =y
·
·
" #
Outer product a b Yi,j = ai bj · [· · · ·] = Y
·
·
· · · · ·
" #" #
Matrix-Vector A b
P
yi = j Ai,j bj · · · · · = y
· · · · ·
· · · · ·
· · · · · · · ·
" #" #
Matrix-Matrix A B
P
Yi,k = j Ai,j Bj,k · · · · · · · · = Y
· · · · · · · ·
· · · · · · · ·

We think of vectors and matrices as tensors of order 1 and 2. The order corresponds
to the number of dimensions in their [· · · ] visualization above, e.g. a vector is a 1-
dimensional list of numbers, while a matrix is a 2-dimensional grid of numbers. The order
also determines the degree of the node representing the variable in the tensor graph.
Diagram notation becomes more interesting when you have tensors of order 3 and
higher. An order 3 tensor is a cube or numbers, or stack of matrices. E.g. we can write
this as T ∈ Rn×m×k , so Ti ∈ Rm×k is a matrix for i = 1 . . . n. Of course we could slice T
along the other axes too, so T:,j ∈ Rn×k and T:,:,ℓ ∈ Rn×m are matrices too.
A matrix having two outgoing edges means there are two ways you can multiply a
vector onto it, either on the left: xT M , or on the right: M x. In graph notation we just
write x−M− and −M −x. An order 3 tensor has three edges, so we can multiply it with
a vector in three ways:
x
T and T and T
x x
To be perfectly precise about what each one means, we should give the edges labels. For
P
example we would write T i to specify the matrix i Ti xi . However, often the edge
x
in question will be clear from the context, which is part of what makes tensor diagram
notation cleaner than, say, Einstein sum notation.

i
X i A B o
Yi,j = Ai,k Bl,n,o Cj,k,l,m Dm,n Eo ⇔ j Y k l n E
k,l,m,n,o j C
m D
The key principle of tensor diagrams is that edge contraction is associative. This
means you can contract any edge in any order you prefer. This can be seen from the sum
representation above, which can be reordered to sum over k, l, m, n in any order.
CONTENTS 8

The computational price for different contraction orders can be widely different. Un-
fortunately it’s not computationally easy to find the optimal order. See section 11.1 for
algorithms to find the best contraction order, and approximate contraction methods.
Note that tensor graphs are not always connected. We already saw that the outer
product of two vectors can be written a b . This is natural from the sum represen-
tation: No edges simply means no sums. So here yi,j = ai bj , which is exactly the outer
product y = a ⊗ b.

1.2 The Copy Tensor


A particularly important tensor is the “copy” tensor, also known as the “diagonal”, “kro-
necker delta” or “spider” tensor. The simplest version is the all-ones vector, which we
write as −. That is i = 1. The general order-n tensor is 1 on the diagonal, 0 everywhere
else: (
1 if i = j = k = . . .
i,j,k,... =
0 otherwise

Or, using Iversonian notation,1 i,j,k,... = [i = j = k = . . . ]. We see the order-2 copy-


tensor, − − = I, is just the identity matrix, so we can simply remove it from graphs like
this:
−A− −B− = −A−B−
Higher order copy-tensors are very useful, because they let us turn the simple tensor
graphs into hyper-graphs. A simple example of how we can use this is the diagonal matrix
Da , which has a on the diagonal and 0 elsewhere. We can write this as

Da =
a
P P
Why? Because (Da )i,j = k i,j,k ak = k [i = j = k]ak = [i = j]ai . Similarly the
Hadamard product, (a ◦ b)i = ai bi , can be written

a◦b=
a b
Now, let’s see why everyone loves copy tensors by using it to prove the identity Da Db =
Da◦b by “copy tensor manipulation”:

Da Db = = = = Da◦b .
a b a b a b
You can verify this using the sum representation.
The general rule at play is that any connected sub-graph of copy-tensors can be
combined into a single one. Sometimes we are even lucky enough that this simplification
(
1 For 1 if P
a logical proposition P , we define [P ] = .
0 otherwise
CONTENTS 9

leaves us with an identity matrix we can remove too:

S T
T S T
S
.

The only time you have to be a bit careful is when the resulting tensor has order 0.
Depending on how you define the order-0 copy tensor, , you may or may not have the
identity − = .
Lots of other constructions that require special notation (like diagonal matrices or
Hadamard products) with normal vector notation can be unified using the copy tensor.
In the Matrix Cookbook they define the order-4 tensor J, which satisfies Ji,j,k,l = [i =
k][j = l] and which we’d write as J = , and satisfies, for example, dX
dX = J. Using
“tensor products” you could write J = I ⊗ I. Note that J is different from the order-4
copy-tensor, .

1.3 Sums of Tensors


Tensor products can express any linear function. That is f such that f (ax, by) = abf (x, y).
Unfortunately not all operations on tensors are linear. Even something as simple as a
sum of two vectors, x + y, can not be displayed with a simple contraction graph. (Note
that this is not linear because ax + by ̸= ab(x + y).)
To handle this important operation, Penrose suggesting simply writing the two graphs
with a plus sign between them, such as −x + −y. Note that this is itself an order-1 tensor,
even though it may look like there are two free edges. If we want to multiply the sum
with another tensor, we can use parentheses like −M −(−x + −y).
It can be helpful to use named edges when dealing with sums, to make it clear how
the edges are matched up. Sums and tensor products interact nicely, with a general form
of the distributive law:
T R
T M
R M V U
U V
U R .

When adding tensors that don’t have the same number of edges, or have edges with
different names, we can use “broadcasting”. Say we want to add a matrix M and a vector
i i
x. What does it even mean? If we want to add x to every row of M , we write M + x.
j j
i
This is because x is an outer product between x and the all one vector, which is a
j
matrix in which every row is the same. Similarly, if we want to add x to every column,
i
x
we could use the matrix .
j
Note that we typically don’t talk about “rows” or “columns” when dealing with tensors,
but simply use the name edge (sometimes axis) of the tensor. When using named edges,
operations from classical vector notation like “transpose” can also be removed. The matrix
CONTENTS 10

X T is simply X where the left and right edge have been swapped. But if the edges are
named, we don’t have to keep track on “where the edge is” at all.

1.4 Transposition
In classical matrix notation, transposition flips the indices of a matrix, so that

(AT )ij = Aji .

In tensor diagram notation, we have two choices depending on whether we want the
position of the edges to be significant. With significant edge positions, we typically
let the “left edge” be the first index, and the “right edge” be the second index. Thus
transposition requires flipping the edges:

( A )T = A = AT .
A
A fun notation used by some authors is flipping the tensor upside down, , as a
simpler way to flip the left and right edges.
In practical computations, keeping track of the edge positions can be easy to mess up.
It’s more robust to name the “outputs” of the tensor, and let transposition rename the
edges:
( i A j )T = i j A i j
Renaming can be done using multiplication by identity matrices:

(i A j )( j k )= i Aj k ,

but we have to be careful because overlapping edge names can make multiplication non-
associative. E.g.
   
( i A j )( j k) (k j) = i A j ̸= ( i A j ) ( j k )( k j) = i A j ,

where equals the matrix dimension. In tensorgrad we solve this problem by requiring
that any edge name is present at most twice in any product.
For the purpose of transcribing the matrix cookbook, using the “significant position”
notation is more convenient. We observe the following identities:

(AT )T = A A = A

(A + B)T = AT + B T ( A + B ) = A + B (4)

A
(AB)T = B T AT
A B
A B = B = (5)
CONTENTS 11

1.4.1 Higher order


It is possible to generalize the idea of a transpose to higher order tensors. It requires
partitioning the edges as “input” and “output” edges, or more commonly, “contravariant”
and “covariant” edges. See the section at the end of this chapter for more on this.
We say matrices are symmetric if AT = A. For higher order tensors, there are many
ways to be symmetric. See the section 1.7 for more on this.

1.5 Trace
P
The “trace” of a square matrix is defined Tr(A) = i Ai,i . In tensor diagram notation,
that corresponds to a self-edge: A . The Matrix Cookbook has a list of identities
using traces. Let’s reproduce them with tensor diagrams:

n
X
Aii = Tr(A) = Tr(AI) A = A (11)
i=1
A
Tr(A) = Tr(AT ) A = (13)
B
Tr(AB) = Tr(BA) A B = = B A (14)
A

Tr(A + B) = Tr(A) + Tr(B) ( A + B ) = A + B (15)

Tr(ABC) = Tr(BCA) A B C = B C A (16)


= Tr(CAB) = C A B

aT a = Tr(aaT ) a a = Tr( a a ) (17)


= a a

1.5.1 Higher order traces


For higher order tensors, we still define the trace as the sum of the diagonal elements.
That is, for a tensor A of order n, the trace is just the product with the order n copy
tensor:
Tr( T ) = T .

In quantum mechanics, it is common to use a “partial trace”, which we can define using
index notation:
k k

Tri,j ( T ) = T .
i j

Of course with tensor diagrams, we can also use partial traces without naming the indices.
We don’t even have to think about whether the contractions we use are traces or not.
CONTENTS 12

1.6 Eigenvalues
Eigenvalues and eigenvectors are fundamental concepts in linear algebra that have impor-
tant applications in tensor network theory. In tensor diagram notation, we can represent
these concepts in a visually intuitive way.
For a matrix A, if there exists a non-zero vector v and a scalar λ such that Av = λv,
then λ is called an eigenvalue of A, and v is the corresponding eigenvector. In tensor
diagram notation It’s convenient to write its eigendecomposition as A = QΛQ−1 , where
Q is a matrix whose columns are the eigenvectors of A, and Λ is a diagonal matrix of the
eigenvalues. Thus:

A = Q Q−1 (1.1)
λ
The trace of a matrix is equal to the sum of its eigenvalues. We can represent this
relationship using tensor diagrams:

X
Tr(A) = A = Q Q−1 = = = λi (12)
λ λ λ i

1.7 Symmetry and Symmetrization


A common property of tensors is symmetry. For matrices, symmetry means that AT = A.
For tensors it means that permuting its indices does not change its value. Typical exam-
ples are Covariance or Hessian matrices, but also certain higher-order statistical moments,
like the third or fourth moment tensors, can exhibit symmetries. Sometimes only with
respect to particular group, but most of the time with respect to all permutations of
edges. The Copy Tensor is a simple example of a completely symmetric tensor.
Sometimes it will be useful to symmetrize a tensor by summing over all permutations
of its indices. We write this using a squiggly line over the symmetrized edges:

= + , = + + + + +

For example, if A is a square matrix, A + AT = A = A + A . There


is a complementary notion of anti-symmetrization (or skew-symmetrization), where we
sum over all permutations with appropriate sign. For instance, a rank-2 skew-symmetric
matrix A satisfies Aij = −Aji . We can anti-symmetrize using a flat thick line: A =
A − A . In higher-rank cases, the sign of each term is determined by the parity
of the permutation.
Both tensors are idempotent, since symmetrizing a symmetric tensor has no effect:

= , = .
CONTENTS 13

1.8 Covariance and Contravariance


In physics it’s often relevant to change between different coordinate systems. Most such
changes transform the left and right side of matrices differently, and similarly row and
column vectors. For tensors this generalize to the concepts of covariance and contravari-
ance. With notion, we keep track of “input” and “output” vectors. We’ll also need to
distinguish between things such as the identity matrix, , , and .
This notion allows some ability to “algebraically” combine tensors, by connecting input
edges to output edges, just like we do with matrices and vectors. However, for more
complicated tensors, we need to know which edges to combine, and index notation is
more useful.
In computer science and machine learning, the concept of covariance and contravari-
ance is typically not useful, so we won’t use it in this book.

1.9 Exercises
Exercise 1. Given a sequence of matrices A1 , A2 , . . . , An ∈ Rn×n , and vectors v1 , v2 , . . . , vn ∈
Rn , draw, using tensor diagrams, the matrix made of vectors A1 v1 , A2 v2 , . . . , An vn .
Exercise 2. Represent the Hadamard product (element-wise multiplication) of two ma-
trices using tensor diagrams. How does this differ from regular matrix multiplication?
(We will see more about this in 3.6.)
Exercise 3. Represent the SVD of a matrix A = U ΣV T using tensor diagrams. How
does this compare to the eigendecomposition diagram? How can you generalize it to
higher order tensors? (In section 9.1 we will see more about this.)
Chapter 2

Simple Derivatives

A derivative with respect to a tensor is simply the collection of derivatives with respect
dT
to each element of this tensor. We can keep track of dU by making a tensor of shape
shape(T ) ∪ shape(U ). For example, if T is an order-3 tensor and U is an order-2 tensor,
we draw dT /dU as
dT
= T
dU

This notation follows Penrose. The two extra lines coming from the black dot on the
circle makes the derivative an order-5 tensor. That the order of derivatives grows this
way, is one of the main reasons we’ll encounter for tensors to show up in the first place.
When there are not too many edges, we will use a simple inline notation like this:

(T )

The Matrix Cookbook defines the single-entry matrix J i,j ∈ Rn×n as the matrix which
is zero everywhere except in the entry (i, j) in which it is 1. Alternatively we could write
i,j
Jn,m = [i = n][j = m].

2.1 Derivatives of Matrices, Vectors and Scalar Forms


2.1.1 First Order
The following first order derivatives show the basic linearity properties of the derivative
operator.

∂xT a
=a (x a) = (x) a= a (69)
∂x
∂aT x
=a (a x) = a (x) = a = a (69)
∂x
∂aT Xb
= abT (a X b) =a (X) b=a b (70)
∂X

14
CHAPTER 2. SIMPLE DERIVATIVES 15

i
∂X j
= J i,j ( X ) j
= (73)
∂Xi,j i
m
∂(XA)i,j
= (J m,n A)i,j (i X A j ) n
= (X) A (74)
∂Xm,n
m
= i
A j
n

2.1.2 Second Order


The second order derivatives are follow from the product rule:

T U = T U + T U

Note that this rule holds independently of how many edges are between T and U , even if
there are none.

 i
∂ X X X j X (X)
Xk,l Xm,n = ( Xk,l )2 = +
∂Xi,j X (X) X
k,l,m,n k,l
(76)
X X
=2 Xk,l =2 i

k,l j

∂bT X T Xc
= X(bcT + cbT ) ( b X T X c) =b XT (X) c (77)
∂X
+b (X T ) X c

=b XT c
+b X c

= X ( bc + cb )
i

(X T BX)k,l = δl,j (X T B)k,i (k X T B X l ) j
= k
XT B
i
(79)
∂Xi,j j
l

j
+ δk,j (BX)i,l + k
B X l
i


X T BX = X T BJ i,j + J j,i BX (same as above) (80)
∂Xi,j
∂ T
x Bx = (B + B T )x (x B x) = B x+x B (81)
∂x
= ( B + BT ) x
 j 
B i
=x i
+
i B j

TODO: Assume W is symmetric, then... (84) - (88)


CHAPTER 2. SIMPLE DERIVATIVES 16

2.1.3 Higher Order


Integer powers of matrices, like X n , are easy to handle by writing out the product and
using the product rule. The Matrix Cookbook includes a few derivatives we can handle
this way.

n−1
∂ (Xn )kl X i

Xr Jij Xn−1−r kl ...


 k
= (X X X X) j
l (90)
∂Xij r=0
i

= k
(X) j X X ... X l

+ ...
i

+ k
X X X ... (X) j
l

n−1
X i
= k
Xr j
X n−r−1 l

r=0
n−1
∂ T n X T T
a X b= (Xr ) abT Xn−1−r (a − X ... X − b) (91)
∂X r=0
n−1
X
= a − Xr X n−r−1 − b
r=0
n−1
X
= −(X r )T − a b − (X n−r−1 )T −
r=0

2.2 Derivatives of Traces


The Matrix Cookbook contains a lot of derivatives for traces. These can be elegant in
classical notation, since traces are scalar, so the derivatives are low order.

2.2.1 First Order


Tr(X) = I ( X ) = (X) (99)
∂X

=
=

Tr(XA) = AT ( X A ) = (X) A (100)
∂X
= A
= A T
CHAPTER 2. SIMPLE DERIVATIVES 17


Tr(AXB) = AT B T ( A X B ) = A (X) B (101)
∂X
= A B
= A T
B T

Continues for (102-105). The last one uses the Kronecker product, which we may have
to introduce first.

2.2.2 Second Order


Tr(X 2 ) = 2X T ( X X ) (106)
∂X

= (X) X + X (X)

= X + X
=2 X T


Tr(X 2 B) = (XB + BX)T ( X X B ) (107)
∂X
= (X) X B + X (X) B

= X B + X B
= B T
X T + X T BT

∂ ∂
Tr(X T BX) = Tr(XX T B) ( XT B X ) (108, 109, 110)
∂X ∂X

= Tr(BXX T ) = (X T ) B X + XT B (X)
∂X

= (B + B T )X = B X + XT B
= B X + BT X

∂ ∂
Tr(XBX T ) = Tr(X T XB) ( X B XT ) (111, 112, 113)
∂X ∂X

= Tr(BX T X) = (X) B XT + X B (X T )
∂X

= X(B T + B) = B XT + X B
= X B T + X B

The last equation is a bit surprising, since we might assume we could simply substitute
CHAPTER 2. SIMPLE DERIVATIVES 18

X for X T in the previous equation and conclude


∂ ∂
(B + B T )X = Tr(XBX T ) = Tr(X T BX) = X(B T + B).
∂X ∂X
However that is clearly not that case. Such substitution would only work for a linear
∂ ∂
function, not a quadratic. In general it is the case that ∂X f (X)T ̸= ∂X f (X T ).

2.2.3 Higher Order


Tr(X n ) = n(X n−1 )T (X X X ... X) (121)
∂X
n−1
X
= Xr X n−r−1
r=0
= n(X T )n−1

n−1
∂ X
Tr(AX n ) = (X r AX n−1−r )T (A X X ... X) (122)
∂X r=0
n−1
X
= A Xr X n−r−1
r=0
n−1
X
= (X r )T AT (X n−r−1 )T
r=0

2.3 Exercises
Exercise 4. Find the derivative of (xT Ax)2 with respect to x.
Exercise 5. Find the second derivative of

xT AT xxT Ax

with respect to x.
Exercise 6. Find the derivative of X T AX with respect to X.
Exercise 7. Show the derivatives:

(Bx + b)T C(Dx + d) = B T C(Dx + d) + DT C T (Bx + b) (78)
∂x
∂ T T
b X DXc = DT XbcT + DXcbT (82)
∂X

(Xb + c)T D(Xb + c) = (D + DT )(Xb + c)bT (83)
∂X
CHAPTER 2. SIMPLE DERIVATIVES 19

Exercise 8. Show the remaining second order trace derivatives from the Matrix Cook-
book:

Tr(AXBX) = AT X T B T + B T X T AT (114)
∂X
∂ ∂
Tr X T X = Tr XX T = 2X
 
(115)
∂X ∂X

Tr B T X T CXB = C T XBB T + CXBB T

(116)
∂X

Tr X T BXC = BXC + B T XC T
 
(117)
∂X

Tr AXBX T C = AT C T XB T + CAXB

(118)
∂X

Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)B T
 
(119)
∂X

Exercise 9. Show the derivative of the fourth order trace from the Matrix Cookbook:

Tr B T X T CXX T CXB = CXX T CXBB T
 
(123)
∂X
+ C T XBB T X T C T X
+ CXBB T X T CX
+ C T XX T C T XBB T
Chapter 3

Kronecker and Vec Operator

3.1 Flattening
Flattening is a common operation for programmers. In the language of numpy, we may
write np.ones((2,3,4)).reshape(2, 12) to flatten a shape (2,3,4) tensor into a shape
(2,12) matrix. Similarly, in mathematical notation, vec(X) is commonly used to denote
the flattening of a matrix into a vector.
Typically the main reason to do this is as a cludge for dealing with bad general notation
for tensors. Hence, with tensor diagrams, we can avoid this operation entirely. However,
it is still interesting to see how tensor diagrams can make a lot of properties of flattening
much more transparent.
To begin with we note that flattening is a linear operation, and hence can be repre-
sented as a simple tensor. We’ll use a triangle to denote this:
i
k
▷i,j,k = j = [i + jn = k].

Here n is the dimension of the i edge. Note we use a double line to denote the output of
the flattening operation. This is simply a syntactic choice to remind ourselves that the
output is a bundle of two edges.
Using this notation we can write
X k
vec(X)k = ▷i,j,k Xi,j = X .
i,j

Some basic properties of ▷:

= (3.1)

= (3.2)

= (3.3)

= (3.4)

20
CHAPTER 3. KRONECKER AND VEC OPERATOR 21

3.2 The Kronecker Product


The Kronecker product of an m × n matrix A and an r × q matrix B, is an mr × nq
matrix, A ⊗ B defined as
 
A1,1 B A1,2 B · · · A1,n B
 
 A2,1 B A2,2 B · · · A2,n B 
A⊗B = . .
 
 .. .. .. .. 
 . . .  
Am,1 B Am,2 B · · · Am,n B
Using index notation we can also write this as (A ⊗ B)p(r−1)+v,q(s−1)+w = Ars Bvw , but
it’s pretty hard to read.
In tensor notation the Kronecker Product is simply the outer product of two matrices,
A
flattened “on both sides”: A ⊗ B = .
B
The Kronecker product has the following properties:

A A
A ⊗ (B + C) = A ⊗ B + A ⊗ C = (506)
(B+C) B
A
+
C
A A
A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C B = B (508)
C C
A
= B
C

aA ab A
aA ⊗ bB = ab(A ⊗ B) = (509)
bB B

A A
(A ⊗ B)T = AT ⊗ B T = (510)
B B

A C A C
(A ⊗ B)(C ⊗ D) = AC ⊗ BD = (511)
B D B D
A A
(A ⊗ I)(I ⊗ B) = A ⊗ B = (511b)
B B

A A
Tr(A ⊗ B) = Tr(A)Tr(B) = (515)
B B

= A B
CHAPTER 3. KRONECKER AND VEC OPERATOR 22

Recalling the eigen decomposition of a matrix, (1.1), we also get:

λ1
Q1 Q−1
1
eig(A ⊗ B) = eig(A)eig(B) (519)
Q2 Q−1
2
λ2
λ1
Q1 Q−1
1
=
Q2 Q−1
2
λ2
Q1 Q−1
1
=
Q2 Q−1
2
λ1 λ2

where the last step is due to (3.4).


This is easier to see when we consider that V = represents the tensor where
M
Vi,j,i,j = Mi,j and 0 otherwise. So flattening V on both sides is the same as diag(vec(M )).

3.3 The Vec Operator


The vec-operator applied on a matrix A stacks the columns into a vector, i.e. for a 2 × 2
matrix  
A11
" #  
A11 A12 A21 
A= vec(A) =  
A21 A22 A12 

A22
At the start of the chapter we showed how to represent the vec-operator using the
flattening tensor: vec(X) = X . The Matrix Cookbook gives the following prop-
erties of the vec-operator:

A A
T T
vec(A XB) = vec(X) (A ⊗ B) X = X (520)
B B
T T A B = A B
Tr(A B) = vec(A) vec(B) (521)
vec(A + B) = vec(A) + vec(B) (A+B) = A + B (522)
vec(aA) = a vec(A) aA =aA (523)
B
aT XBX T c = vec(X)T (B ⊗ caT )vec(X) a X B X c= X X (524)
a b
CHAPTER 3. KRONECKER AND VEC OPERATOR 23

B
= X X
a b

3.4 Kronecker Vector Product


Sometimes it’s convenient to write the Kronecker product between two vectors as a matrix.
We define
A
A⊗v = (3.5)
v
A
A ⊗ vT = (3.6)
v
v
v⊗A= (3.7)
A
v
vT ⊗ A = (3.8)
A
Similarly, we can define the Kronecker product between two vectors:
v
v⊗w = (3.9)
w
v
v T ⊗ wT = (3.10)
w

3.5 General Matrification


The last equation is an example of a general idea: Any tensor network can be transformed
into a series of matrix multiplications by applying the vec-operator to all tensors and the
flattening tensor to all edges. For example, the following complicated graph:

a B E f B E f
D g a D g
C C
H H

Can be written as a simple vector-matrix-matrix-vector product, aM1 M2 b, where M1 =


vec(B) ⊗ C ′ , M2 = E ′ ⊗ D′ ⊗ I and b = f ⊗ g ⊗ vec(H), where C ′ , D′ and E ′ are rank 3
tensors flattened on one side, and vec(B) is interpreted as a matrix with a single column.

3.5.1 The Lyapunov Equation


A nice application of Kronecker product rewritings is to solve equations like

AX + XB = C. (272)
CHAPTER 3. KRONECKER AND VEC OPERATOR 24

We use the rewriting vec(AX + XB) = (I ⊗ A + B T ⊗ I)vec(X), which follows from the
tensor diagram massaging:

A X

A
 A 
+ = X + X = X +
X B B B
after which we can take the normal matrix inverse to get

vec(X) = (I ⊗ A + B T ⊗ I)−1 vec(C). (273)

3.5.2 Encapsulating Sum


This is a generalization of the previous equation.

X
An XBn = C (274)
n
X
vec(X) = ( BnT ⊗ An )−1 vec(C) (275)
n

3.6 The Hadamard Product


The Hadamard product, also known as element-wise multiplication, is not described in the
Matrix Cookbook. Yet, it is a very useful operation, and has some interesting properties
in connection with the Kronecker product.
We define the Hadamard product of two 2 × 2 matrices A and B as
" # " # " #
A11 A12 B11 B12 A11 B11 A12 B12
A◦B = ◦ = .
A21 A22 B21 B22 A21 B21 A22 B22
In tensor notation, the Hadamard product can be represented using two rank-3 copy
tensors:
A
A◦B = .
B
Some properties of the Hadamard product are:

A
xT (A ◦ B)y = tr(AT Dx BDy ) x y = AT B
B x y
A A
B B
(A ⊗ B) ◦ (C ⊗ D) = (A ◦ C) ⊗ (B ◦ D) =
C C
D D
The first equation is simply massaging the tensor diagram. The second follows from
(3.3). Alternatively, it suffices to follow the double lines to see that A and C both use
the “upper” part of the double edge, while B and D use the lower part.
CHAPTER 3. KRONECKER AND VEC OPERATOR 25

3.7 Khatri–Rao product


Also known as the column-wise Kronecker, row-wise Kronecker or “Face-splitting Prod-
uct”. We use the symbols ∗ and • for the column and row-wise Kronecker products,
respectively.

 
A11 B11 A12 B12
" # " #  
A11 A12 B11 B12 A11 B21 A22 B22 
A∗B = ∗ = 
A21 A22 B21 B22 A B
 21 11 A22 B12 

A21 B21 A22 B22
" #
A11 B11 A11 B12 A12 B11 A12 B12
A • B = ... =
A21 B21 A21 B22 A22 B21 A22 B22

In terms of tensor diagrams, these products correspond simply to flattening the prod-
uct on one side, and using a copy tensor on the other:

A
A∗B =
B
A
A•B =
B

Clearly the two are identical up to transpose. Indeed, (A∗B)T = B T •AT and (A•B)T =
B T ∗ AT .
There are multiple “mixed product” identities:

A C A C
(A • B)(C ⊗ D) = (AC) • (BD) =
B D B D
A x A x
(Ax) ◦ (By) = (A • C)(x ⊗ y) =
B y B y

3.8 Tracy-Singh product

B
TODO

3.8.1 Stacking
Can be part of Kronecker section
CHAPTER 3. KRONECKER AND VEC OPERATOR 26

From Josh: Proposition 2.5. For any field F, integers d1 , d2 , d3 , d4 and matrices X1 ∈
Fd1 ×d2 , X2 ∈ Fd2 ×d3 , X3 ∈ Fd1 ×d4 , and X4 ∈ Fd4 ×d3 , we have
 
X2
X1 × X2 + X3 × X4 = (X1 | X3 ) × ,
X4
where we are writing ’|’to denote matrix concatenation.
With tensor diagrams we can write stacking along a new axis i as
i i
e(0) e(1)
stacki (X, Y ) = X + Y
(i)
where ei = 1 and 0 elsewhere.
Fro this we easily get the identity
 
B
(A | C) = stacki (A, C) stacki (B, D) (3.11)
D
! !
i i i i
e(0) e(1) e(0) e(1)
= + + (3.12)
k A j k C j j B m j D m

e(0) e(0) e(1) e(1)


= + (3.13)
k A B m k C D m

= AB + CD (3.14)

TODO: Relation to direct sum, which is basically stacking + flattening. Or maybe


it’s nicer to do it by hadamard producting with the ei vector. Also notice that this is
what quantum comp people call “controlling”.
Also have a bunch of properties like A ⊗ (B ⊕ C) = A ⊗ B ⊕ A ⊗ C.

3.9 Derivatives
3.10 Exercises
2
×N 2
Exercise 10. Let J = Tr[(IN ⊗F )T A(IN ⊗F )B] where F ∈ RN ×N n , A ∈ RN , B∈
2 2
RN n×N n . Find the derivative of J with respect to F .
Exercise 11. Consider J = ∥G − (B ⊗ X)∥2F , where G and B are matrices, and ∥ · ∥F is
the Frobenius norm. Find the derivative with respect to X
Exercise 12. Prove the equation [8]:

vec(A diag(b) C) = ((C T ⊗ 1) ◦ (1 ⊗ A))b.

Here diag(b) is a diagonal matrix with the vector b on the diagonal, 1 is a vector of ones
of the right size, and b ⊗ A, the Kronecker product for a vector and a matrix, is defined
by b , that is, you just flatten on one side.
A
Exercise 13. Let a and b be two vectors and let D and X be two matrices. Minimize
the following cost function with respect to X:

E = ∥a − DXb∥22 .
CHAPTER 3. KRONECKER AND VEC OPERATOR 27

Exercise 14. Prove using diagrams that Tr(AT B) = 1T vec(A ◦ B), where 1 is a vector
of the appropriate size.
Exercise 15. Find the derivative of

Tr(G(A ⊗ X))

with respect to X.
Exercise 16.
∂ ∂
Tr(X ⊗ X) = Tr(X) Tr(X) = 2 Tr(X)I
∂X ∂X
Exercise 17. Take the derivative of

Tr(G(A ⊗ X))
2 2
with respect to X, where G is an Rn × Rn matrix, and A and X are Rn×n matrices.
Write the result in terms of vec(A) a reshaped version of G.
Exercise 18. Verify the following identity:
d
vec(diag(x) A diag(x)) = diag(vec(A))(I ⊗ x + x ⊗ I).
dx
Hint: Use the matrix-vector identities (3.5).
Exercise 19. Show that
∂ T
x A (x ⊗ x) = A(x ⊗ x) + (xT ⊗ I + I ⊗ xT )AT x
∂x
Hint: Use the matrix-vector identities (3.5).
Chapter 4

Functions

In this chapter we explore general functions from tensors into other tensors. While not
quite as elegant as linear functions, they are important for many applications, such as
non-linearities in neural networks. The main goal is to make the vector chain rule intuitive
and easy to apply, but we also cover Taylor series, determinants and other topics.
Standard notation for function over vectors can be quite ambiguous when the function
broadcasts over some dimensions. If x ∈ Rn is a vector, it’s not clear whether f (x) applies
to each element of x independently or if it’s a function of the whole vector. With tensor
diagrams, we make this distinction clear by explicitly showing the edges of x that go into
f , and which don’t. We use the notation m f n x when f is a function from Rn to
Rm . If f is element-wise, we write ( f x n ) ∈ Rn . We always assume that the function
(arrow) edges are contracted before normal edges. If we want something else, like f (M x),
we can use brackets: f ( M x ) . It may be helpful with some more examples:

n
f : Rn → R f (x) ∈ R f x (Scalar function)

m n
g : Rn → Rm g(x) ∈ Rm g x (Vector function)

h:R→R h(x) ∈ Rn h x n (Element-wise function)

u : Rn → Rm m n
v T u(x) ∈ R u v x (Vector times vector function)
v ∈ Rm
A : Rn → Rm×n , m
n v
A(x)v ∈ Rm A (Vector times matrix function)
v ∈ Rn n x
f : Rd → R, d b
f (X) ∈ Rb f X (Vector function, batched)
X ∈ Rb×d
n
A : Rn×m → R,
A(X) ∈ R A m X (Matrix input function)
X ∈ Rn×m
f : R n × Rm → R d , d
n u b
f (u, v) ∈ Rb×d f (Two inputs, batched)
u ∈ Rb×n , v ∈ Rb×m m v

28
CHAPTER 4. FUNCTIONS 29

To make it more concrete, let’s consider some well known functions: (1) The deter-
minant function, det : Rn×n → R is a scalar function, and we can apply it to a batch
n

of matrices by det m X b which results in a single vector ∈ Rb . (2) Cosine similar-


ity, cossim : Rn × Rn → R is a scalar function, and we can apply it to two vectors by
cossim n u
n

v which results in a single value ∈ R. (3) If the element-wise power function,


pown : R → R is be applied to two vectors, they simplify under the Hadamard product:
pown u
= pown+m u
powm u

d d
For a more advanced example, consider the softmax function: softmax x . We
can write this using elementary functions as:

exp(x) exp x
softmax(x) = =
sum(exp(x)) pow−1 ( exp x )

It looks a bit complicated at first, but let us break it down step by step: ( exp x )∈
Rn isP the element-wise exponential function. If we contract it with , we get the sum
s = i exp(xi ). Finally, we apply pow−1 to s and multiply it with the exponential
function to get the softmax function.
One situation where things may get a bit confusing if the output tensor is contracted
with the broadcasting edges of an input tensor. But as long as we remember to contract
the function edges first, things work out. For example, for a function f : Rn → R and a
matrix x ∈ Rm×n :
m
m

f n x = f (x) = Tr(f (x)).

4.1 The Chain Rule


One of the most attractive features of tensor diagrams is the transparency of the chain
rule. For two functions f : Rm → R and v : Rn → Rm , the composite function
f ◦ v : Rn → R is diagrammed simply by feeding the output of v into f :

f v x

In traditional notation, the chain rule is written Jf ◦v (x) = ∇f (v(x)) Jv (x), where Jv (x)
is the Jacobian of v and ∇f (v(x)) is the gradient of f at v(x). With tensor diagrams the
chain rule is actually a chain!

f f v f f v f v u
= = =
v v x v v u v u x
x x u u x u x
x x x
CHAPTER 4. FUNCTIONS 30

This clearly follows the classical mnemonic: “the derivative of the outer function times the
derivative of the inner function”. The only refinement we need is that the “outer function”
is connected to the “inner function” using the new edge created by taking the derivative.
We could further simplify f to f ′ , where f ′ is the derivative of f . For example,

if f was cos −, then f would be sin −. But we can also just keep the circled f as an easy
to recognize symbol for the derivative.

4.1.1 The Hessian Chain Rule


Things get even more interesting when we consider the second derivative. Using standard
notation, the Hessian of f ◦ v is
d
T 2
X ∂f ∂ 2 vk
Hf ◦v (x) = Dv(x) · D f (v(x)) · Dv(x) + (v(x)) (x).
∂uk ∂x∂xT
k=1

This “Hessian Chain Rule” is much easier to derive using tensor diagrams:

f f v
=
v v x
x x

f v f v
= +
v x v x
x x

f v f v
= v +
v x v x
x
x x
If we continue to expand this way, we see there is a simple pattern in the chain rule for
more functions:

f f v u f v u f v u
= v u + u +
v v u x v u x v u x
u x x
u u x u x u x
x
x x x x

Yaroslav Bulatov has written extensively on the Hessian Chain Rule, and how to evaluate
it efficiently for deep learning applications. Interested readers may refer to his blog post
on the topic.
CHAPTER 4. FUNCTIONS 31

4.1.2 Chain Rule with Broadcasting


When the chain rule is applied to functions with broadcasting, the result still follows
the “outer derivative times inner derivative” mnemonic, but the meaning of “times” is
expanded to take the Hadamard product of the broadcasted edges.

4.1.3 Functions with multiple inputs


If f : Rm × Rn → R is a function with two inputs, the derivative of f (x, y) is simply
∂f ∂f
∂x ∂x + ∂y ∂y, or in tensor diagram notation:

4.2 Examples
4.2.1 Known derivatives
Two examples where we can expand f with known derivatives:

Here inv(X) is the matrix inverse function, taking a matrix X to its inverse X −1 . In the
case of the determinant, one may continue taking derivatives to find the simple pattern
∂ k det(X)
= det(X)(X −1 )⊗k .
∂X k
CHAPTER 4. FUNCTIONS 32

4.2.2 Pseudo-linear forms


Pseudo-linear forms, A(x)x, are common. All pixel-adaptive filters like non-local means,
bilateral, etc, and the so-called attention mechanism in transformers can be written this
way. According to Peyman Milanfar, the gradient of this function is important and has
a form worth remembering:

A x A x A x A x A
= + = +
x x x x x

We may appreciate the simplicity of this expression, when we consider the following
derivation given by Peyman Milanfar using classical notation:
d[A(x)x] = d[A(x)]x + A(x)dx
= vec d[A(x)]x + A(x)dx
= vec Id[A(x)]x + A(x)dx
= (xT ⊗ I) vec d[A(x)] + A(x)dx
= (xT ⊗ I)D vec[A(x)]dx + A(x)dx
= [(xT ⊗ I)D vec[A(x)] + A(x)]dx
Which finally implies:
∂A(x)x ∂
= (xT ⊗ I) vec[A(x)] + A(x).
∂x ∂x

4.2.3 Trace identity


The Matrix Cookbook has the formula:
∂Tr(sin(x))
= cos(x)T (128)
∂x
If we attempt to derive this using tensor diagrams, we get the following different result,

which is equivalent to cos(x) ◦ I. That is, cos(x) but zero everywhere except the diagonal.
The reason is that the matrix cookbook actually uses a slightly different definition of
“function applied to matrix”. If F can be written as a power series, then one way to define
F (X) is the matrix power series:

X
F (X) = ak X k .
k=0

In this case, the derivative of Tr(F (X)) is f (X)T , where f (X) is the scalar derivative of
F (X), matching the Matrix Cookbook’s formula.
CHAPTER 4. FUNCTIONS 33

4.2.4 Taylor
For an n-times differentiable function v : Rd → Rd we can write the Taylor expansion:
ε
v v ε v 1ε v 1ε v
≈ + + ε + ε + ...
x 2 6
(x + ε) x x x

Writing this using classical notation is quite messy:


           
∂ 1 ∂ ∂ 1 ∂ ∂ ∂
v(x + ε) ≈ v(x) + v(x) ε + v(x) ε ε + v(x) ε ε ε + . . .
∂x 2 ∂x ∂x 6 ∂x ∂x ∂x
        
∂ 1 ∂vec ∂v(x) 1 ∂vec ∂vec ∂v(x)
= v(x) + v(x) ε + (I ⊗ ε) ε + (I ⊗ ε ⊗ ε) ε + ...
∂x 2 ∂x ∂x 6 ∂x ∂x ∂x

With index notation it’s ok:


X ∂vi (x) 1 X ∂ 2 vi (x) 1 X ∂ 3 vi (x)
vi (x + ε) ≈ vi (x) + εj + εj εk + εj εk εℓ
j
∂xj 2 ∂xj ∂xk 6 ∂xj ∂xk ∂xℓ
j,k j,k,ℓ

4.3 Exercises
Exercise 20. Draw the tensor diagram for a function f : Rn → R that applies an element-
wise nonlinearity (for instance, exp) followed by a summation over the components. Verify
that the diagram corresponds to the conventional formula for the softmax denominator.
Exercise 21. Represent the composition of two functions

f : Rm → R and v : Rn → Rm ,

using tensor diagrams. Then, using the diagrammatic chain rule, derive the expression
for the derivative of f ◦ v with respect to x ∈ Rn .
Exercise 22. For a matrix function A(x) that depends on a vector x, use tensor diagrams
to illustrate the derivative

[A(x)x],
∂x
and explain how the product rule is implemented in the diagram.
Exercise 23. Represent the KL-divergence term for a Variational Autoencoder (VAE)
as
1 
KL(µ, σ) = − 1 + log σ 2 − µ2 − σ 2 ,
2
with parameters given by

µ = W x + c and log σ 2 = W x + c.

Derive the gradient ∇W KL with respect to the weight matrix W . Be sure to keep track
of dimensions and account for elementwise operations.
CHAPTER 4. FUNCTIONS 34

Exercise 24. Let the softmax function s : Rn → Rn be defined by


ezi
si (z) = Pn , i = 1, . . . , n.
j=1 ezj

Prove that the Jacobian matrix of s is given by


∂si 
= si δij − sj ,
∂zj
where δij is the Kronecker delta.
Exercise 25. Suppose x(1) , . . . , x(n) ∈ Rd are independent samples from a multivariate
normal distribution N (µ, Σ). Write the log-likelihood function and derive the gradients
∇µ ℓ and ∇Σ ℓ. Then, by setting these gradients to zero, show that the maximum likelihood
estimators are
n n
1 X (i) 1 X (i)
µ̂ = x and Σ̂ = (x − µ̂)(x(i) − µ̂)T .
n i=1 n i=1

Exercise 26. Consider a Gaussian process with covariance matrix K(θ) and the log-
marginal likelihood defined as
1 
L(θ) = − y T K(θ)−1 y + log det K(θ) .
2
Derive the gradient of L with respect to θ, showing that
∂L 1   ∂K 
= tr K −1 yy T K −1 − K −1 .
∂θ 2 ∂θ
Exercise 27. In logistic regression, let
1
pi = σ(wT xi ) with σ(z) = ,
1 + e−z
and consider the negative log-likelihood
n h
X i
J(w) = − yi ln pi + (1 − yi ) ln(1 − pi ) .
i=1

Derive the gradient ∇w J and the Hessian H = ∇2w J. In particular, show that
n
X n
X
∇w J = (pi − yi )xi , and H = pi (1 − pi ) xi xTi .
i=1 i=1

Exercise 28. Let X ∈ Rn×n be an invertible matrix. Prove that


∂ det(X)
= det(X) (X −1 )T .
∂X
In other words, show that the gradient of det(X) with respect to X is det(X) times the
transpose of X −1 .
CHAPTER 4. FUNCTIONS 35

Exercise 29. For an invertible matrix X ∈ Rn×n , prove that



ln det(X) = (X −1 )T .
∂X
Briefly discuss why this result is useful in statistical applications such as Gaussian likeli-
hoods.
Exercise 30. Let A ∈ Rn×n be an invertible matrix (which may depend on a parameter).
Starting from the identity I = A−1 A, differentiate both sides to show that

d(A−1 ) = −A−1 (dA) A−1 .


∂A−1
Deduce an expression for the partial derivative ∂Aij .
Exercise 31. Define the scalar function

f (A) = aT A−1 b,

where a, b ∈ Rn are fixed vectors and A ∈ Rn×n is invertible. Using the result from
differentiating the inverse, show that

∇A f (A) = −A−T (a bT )A−T .

Exercise 32. Let X ∈ Rn×n be a symmetric matrix with a simple eigenvalue λ and
corresponding unit eigenvector v. Prove that the derivative of λ with respect to X is
given by
∇X λ = v v T .
Additionally, discuss why this result does not hold when λ has multiplicity greater than
one.
Exercise 33. For a symmetric matrix A ∈ Rn×n , consider the Rayleigh quotient

xT Ax
R(x) = , x ∈ Rn \ {0}.
xT x
Using Lagrange multipliers, show that the stationary values of R(x) correspond to the
eigenvalues of A, and that the maximizer (minimizer) is the eigenvector associated with
the largest (smallest) eigenvalue.
Exercise 34. Let A ∈ Rm×m and B ∈ Rn×n be constant matrices, and let X ∈ Rm×n
be a variable matrix. Prove that

∇X Tr(X T AXB) = AXB T + AT XB.

(Hint: Use the cyclic property of the trace to rearrange terms.)


Exercise 35. (a) Show that for a constant matrix A ∈ Rm×n and variable X ∈ Rm×n ,

Tr(AT X) = A.
∂X
(b) More generally, for constant matrices A ∈ Rp×m and B ∈ Rn×q , prove that

∇X Tr(AXB) = AT B T .
CHAPTER 4. FUNCTIONS 36

Exercise 36. Let s ∈ Rn be a vector and C ∈ Rn×n be a fixed matrix. Define the matrix
function
F (s) = diag(s) C diag(s),
where diag(s) denotes the diagonal matrix with the entries of s. Express the differential
dF in terms of s, ds, and C, and hence derive an expression for the derivative ∇s F (s).
Exercise 37. Let A ∈ Rp×m , X ∈ Rm×n , and B ∈ Rn×q . Prove the identity

vec(A X B T ) = (B ⊗ A) vec(X).

Discuss how this identity can be used to convert certain matrix derivative problems into
vectorized forms.
Exercise 38. Let
f (x) = xT Ax,
where x ∈ Rn and A ∈ Rn×n is not necessarily symmetric. Show that

∇x f (x) = (A + AT )x.

Then, compute the Hessian ∇2x f (x) and discuss what simplification occurs when A is
symmetric.
p
Exercise 39. Consider the Frobenius norm ∥X∥F = Tr(X T X), for X ∈ Rm×n .
Show that
X
∇X ∥X∥F =
∥X∥F
for X ̸= 0.
Exercise 40. Let Y ∈ Rn×n be a symmetric matrix, and define ϕ(Y ) to be its largest
eigenvalue. If this eigenvalue has multiplicity k > 1, show that the derivative of ϕ(Y ) is
not unique. In particular, demonstrate that any matrix of the form
k
X
G= ui viT ,
i=1

where {ui }ki=1 is any orthonormal basis for the eigenspace corresponding to the largest
eigenvalue, is a valid subgradient of ϕ(Y ). Explain the challenges that arise in defining a
unique gradient in this setting.
Chapter 5

Statistics and Probability

5.1 Definition of Moments


Let x ∈ Rn is a random variable. We write m = E[x] ∈ Rn for the expectation and
M = Var[x] = E[(x − m)(x − m)T ] for the covariance (when these quantities are defined.)
In tensor diagrams, we will use square brackets:

m=[ x] and M =[ (x ⊖ m) (x ⊖ m) ]

We will use the circled minus, ⊖, to distinguish the operation from contraction edges.
We can also define the third and fourth centralized moment tensors
 
  (x ⊖ m)
(x ⊖ m)  
   (x ⊖ m) 
M3 =   (x ⊖ m) 
 and M 4 =  (x ⊖ m)  .
 
(x ⊖ m)
 
(x ⊖ m)

5.1.1 Expectation of Linear Combinations


General principle: The “linearity of expectation” lets you pull out all parts of the graph
not involving X.

A X B C A X B C A B C
= = M3
X D X E X D X E D E

h i
where M3 is the expectation1 x x x , which is an order-9 tensor with no dependence
on the constants A, B, C and D. In practice you would want to name the edges to keep
track of what gets multiplied with what.
1 FIXME: This is different from the notation just introduced above.

37
CHAPTER 5. STATISTICS AND PROBABILITY 38

We can even use linearity of expectation to push the expectation inside an infinite
sum of tensors, as in the following moment generating function, which relates all the Mk
tensors:
∞ ∞ ∞
X 1 X 1  X 1
E(e⟨x⊖m,t⟩ ) = E[⟨x ⊖ m, t⟩k ] = E ⟨(x ⊖ m)⊗k , t⊗k ⟩ = ⟨Mk , t⊗k ⟩
k! k! k!
k=0 k=0 k=0
t t
t 1 1 1 t t
=1+ + M + M3 + M4 + . . .
m 2 6t t 24 t t
t

5.1.2 Linear Forms


The Matrix Cookbook gives the following simple expectation:

" #
A X B A [X] B
E[AXB + C] = AE[X]B + C = (312)
+ C + C

5.1.3 Quadratic Forms


We often prefer to write expectations in terms of the simple centered moments, which we
can do by pulling out the mean:
   
x− (x ⊖ m)− m− m−
= + =M +
x− (x ⊖ m)− m− m−

This makes it easy to handle the quadratic forms from the Matrix Cookbook:

" # " #
T A x ⊖ [A x] A (x ⊖ m)
Var[Ax] = AVar[x]A = (313)
A x ⊖ [A x] A (x ⊖ m)
" #
A (x ⊖ m)
=
A (x ⊖ m)
= A M2 A
  
(x ⊖ m)− m−
E[xT Ax] = Tr(AM ) + mT Am [x A x] = + A (318)
(x ⊖ m)− m−
" #
(x ⊖ m) m
= A+m A
(x ⊖ m)

= M A +m A m

E[(Ax + a)(Bx + b)T ] = AM B T + (Am + a)(Bm + b)T (320)


CHAPTER 5. STATISTICS AND PROBABILITY 39

E[xxT ] = M + mmT (321)


T T
E[xa x] = (M + mm )a (322)
E[xT axT ] = aT (M + mmT ) (323)
T T T
E[(Ax)(Ax) ] = A(M + mm )A (324)
T T
E[(x + a)(x + a) ] = M + (m + a)(m + a) (325)
T T T
E[(Ax + a) (Bx + b)] = Tr(AM B ) + (Am + a) (Bm + b) (326)
T T
E[x x] = Tr(M ) + m m (327)
T T
E[x Ax] = Tr(AM ) + m Am (328)
T T T
E[(Ax) (Ax)] = Tr(AM A ) + (Am) (Am) (329)
T T
E[(x + a) (x + a)] = Tr(M ) + (m + a) (m + a) (330)

5.1.4 Cubic Forms


 
κ3 +κ2 κ1 + + + κ31

When x is a stochastic vector with mean vector m, it can be convenient to expand


the raw third moment in terms of the central moments:

         
x− (x ⊖ m)− (x ⊖ m)− m− m−
     
x− = (x ⊖ m)− + 3 m− + 3 (x ⊖ m)− + m−
x− (x ⊖ m)− m− (x ⊖ m)− m−
m−
m−
= M3 + 3 + m−
−M −
m−
TODO: The edges from the m, M term needs to be symmetrized.
But this is still a bit of a mess. See also below on Cumulants.
Assume x to be a stochastic vector with independent coordinates, mean m, covariance
M and central moments v3 = E[(x − m)3 ]. Then (see [7])

E[(Ax + a)(Bx + b)T (Cx + c)]


= A diag(B T C)v3
+ (Am + a)Tr(BM C T )
+ AM C T (Bm + b)
+ AM B T (Cm + c)
+ (Am + a)(Bm + b)T (Cm + c)
E[xxT x]
= v3
+ 2M m
CHAPTER 5. STATISTICS AND PROBABILITY 40

+ (Tr(M ) + mT m)m

5.2 Cumulants
Given a random vector x ∈ Rd , its nth Cumulant Tensor, Kn ∈ Rd,...,d is defined by
∞ t t
⟨t,x⟩
X 1 ⊗n t 1 1
log E(e )= ⟨Kn , t ⟩ = + K2 + K3 + . . .
n=1
n! K1 2 6t t
t
The first couple of cumulants are similar to the central moments:
M2
K1 = m and K2 = M and K3 = M3 and K4 = M4 − − M2 M2 − M2 M2 .
M2

They have the nice property (which is easy to see from the definition) that they are
additive for independent random variables: Kn (x + y) = Kn (x) + Kn (y). This generalizes
the standard property that the variance of the sum of independent random variables is
the sum of the variances.
We can write the expectations of x⊗n in terms of the cumulants:
" # K2 K2 K1 K1 K1
x x K3 +
K1
+
K1
+
K2
+
x = K1 .

In general the sum is over all the partitions of the set {1, . . . , n}.
If the entries of x are independent, the off-diagonals of the cumulant tensors K1 , K2 , . . .
are zero. This means Assume each xi has cumulants κ1 , κ2 , κ3 , κ4 ∈ R, then
κ4
   
+κ3 κ1 + + +
xx =
xx +κ22

+ +


 
+κ2 κ21 + + + + +

+κ41

Note in particular, that if the mean, κ1 , is zero, only four terms survive. For n = 5
there are 52 partitions in total, but only 11 survive if κ1 = 0:

 
κ5 +κ3 κ2 + + + + + + + + +

If x is Gaussian, all cumulants of order 3 and higher are zero. For n = 6 there are 203
partitions in total, but only 15 terms2 for E[x⊗6 ]:

κ32 + + + + + +


+ + + + + + + +
.
2 This is why E[g 6 ] = 15 for g ∼ N (0, 1).
CHAPTER 5. STATISTICS AND PROBABILITY 41

5.2.1 Quartic Forms


We can use this to compute Var[xT Ax]. We assume A is symmetric:

Var[xT Ax] = x A x − x A x
x A x x A x

= κ4 A A + 4κ3 κ1 A A

+ 2κ22 A A + κ22 A A

+ 4κ2 κ21 A A + 2κ2 κ21 A A


+ κ41 A A
2
− κ2 A + κ21 A

= 2κ22 A A + 4κ2 κ21 A A


+ 4κ3 κ1 A A + κ4 A A

The Matrix Cookbook lists this as

Var[xT Ax] = 2µ22 Tr(A2 ) + 4µ2 cT A2 c + 4µ3 cT Aa + (µ4 − 3µ22 )aT a (319)

where c = µ1 is the mean of x, and a = diag(A) is the diagonal of A, and µ4 = κ4 + 3κ22


is the fourth central moment.

5.3 Weighted Scalar Variable


Let y = wT x, and let m = E[y], then

E[y] = m = wT µ (321)
2
E[(y ⊖ m) ] = w M2 w (322)
w
E[(y ⊖ m)3 ] = w M3 w (323)
w
E[(y ⊖ m)4 ] = w M4 w (324)
w
For x ∼ N (0, 1), we have the inequality:
p
E[y n ]1/n ≤ 2/π ∥w∥2 .

5.4 Gaussian Moments


For a Gaussian vector x ∼ N (m, M ): - All odd centered moments vanish: M3 = 0, etc. -
Even moments can be computed via Isserlis’ theorem.
CHAPTER 5. STATISTICS AND PROBABILITY 42

For instance:
E[(x ⊖ m)⊗4 ] = M ⊗ M + M ⊗ M + . . .
(summing over the different pairings of indices).

5.4.1 Gaussian Integration by Parts


If X is a tensor with Gaussian entries, zero mean, and some covariance, Stein’s lemma
gives the following very general equation, for any differentiable function f :

X
X f(X) = f(X)
X

Combined with the tensor chain rule from chapter 4, this can be a very powerful way to
evaluate many hard expectations.

E(xxT ) = Σ + mmT (377)


T T
E[x Ax] = Tr(AΣ) + m Am (378)
T T
Var(x Ax) = Tr[AΣ(A + A )Σ] + · · · (379)
T T T
+ m (A + A )Σ(A + A )m
E[(x − m ) A(x − m )] = (m − m′ )T A(m − m′ ) + Tr(AΣ)
′ T ′
(380)

If Σ = σ 2 I and A is symmetric, then

Var(xT Ax) = 2σ 4 Tr(A2 ) + 4σ 2 mT A2 m (381)

Assume x ∼ N (0, σ 2 I) and A and B to be symmetric, then

Cov(xT Ax, xT Bx) = 2σ 4 Tr(AB) (382)

5.4.2 Cubic forms


Assume x to be a stochastic vector with independent coordinates, mean m and covariance
M

E[xbT xxT ] = mbT (M + mmT ) + (M + mmT )bmT (383)


T T
+ b m(M − mm )

5.4.3 Mean of Quartic Forms

E[xxT xxT ] = 2(Σ + mmT )2 + mT m(Σ − mmT )


CHAPTER 5. STATISTICS AND PROBABILITY 43

+ Tr(Σ)(Σ + mmT )

E[xxT AxxT ] = (Σ + mmT )(A + AT )(Σ + mmT )


+ mT Am(Σ − mmT ) + Tr[AΣ](Σ + mmT )

E[xT xxT x] = 2Tr(Σ2 ) + 4mT Σm + (Tr(Σ) + mT m)2

E[xT AxxT Bx] = Tr[AΣ(B + B T )Σ] + mT (A + AT )Σ(B + B T )m


+ (Tr(AΣ) + mT Am)(Tr(BΣ) + mT Bm)

E[aT xbT xcT xdT x] = (aT (Σ + mmT )b)(cT (Σ + mmT )d)


+ (aT (Σ + mmT )c)(bT (Σ + mmT )d)
+ (aT (Σ + mmT )d)(bT (Σ + mmT )c) − 2aT mbT mcT mdT m

E[(Ax + a)(Bx + b)T (Cx + c)(Dx + d)T ]


= [AΣB T + (Am + a)(Bm + b)T ][CΣDT + (Cm + c)(Dm + d)T ]
+ [AΣC T + (Am + a)(Cm + c)T ][BΣDT + (Bm + b)(Dm + d)T ]
+ (Bm + b)T (Cm + c)[AΣDT − (Am + a)(Dm + d)T ]
+ Tr(BΣC T )[AΣDT + (Am + a)(Dm + d)T ]

E[(Ax + a)T (Bx + b)(Cx + c)T (Dx + d)]


= Tr[AΣ(C T D + DT C)ΣB T ]
+ [(Am + a)T B + (Bm + b)T A]Σ[C T (Dm + d) + DT (Cm + c)]
+ [Tr(AΣB T ) + (Am + a)T (Bm + b)][Tr(CΣDT ) + (Cm + c)T (Dm + d)]

See [7].

5.4.4 Mixture of Gaussians


For a mixture of Gaussians: X
x∼ πk N (mk , Mk ),
k

the moments are weighted sums:


! !T
X X X X
E[x] = πk mk , Var[x] = πk (Mk + mk mTk ) − πk mk πk m k .
k k k k

Higher moments similarly combine via linearity.


CHAPTER 5. STATISTICS AND PROBABILITY 44

X
E[x] = ρk m k (384)
k

XX
Cov(x) = ρk ρk′ (Σk + mk mTk − mk mTk′ ) (385)
k k′

5.4.5 Derivatives
Derivatives of moments with respect to m or M can be found by differentiating under
the integral sign, and using Stein’s lemma for Gaussian cases.

5.5 Exercises
Exercise 41.

E[(Ax + a)(Ax + a)T (Ax + a)] = Adiag(AT A)v3


+[2AM AT + (Ax + a)(Ax + a)T ](Am + a)
+Tr(AM AT )(Am + a)

E[(Ax + a)bT (Cx + c)(Dx + d)T ] = (Ax + a)bT (CM DT + (Cm + c)(Dm + d)T )
+(AM C T + (Am + a)(Cm + c)T )b(Dm + d)T
+bT (Cm + c)(AM DT − (Am + a)(Dm + d)T )

Exercise 42. Find more identities in the Matrix Reference Manual and try to prove
them. Also try to verify your derivations using tensorgrad.
Chapter 6

Determinant and Inverses

6.1 Determinant
It’s convenient to write the determinant in tensor notation as
1
det(A) = A ··· A
n!
where i1 i2 ... in = εi1 ,...,in is the rank-n Levi-Civita tensor defined by
(
sign(σ) σ = (i1 , . . . , in ) is a permutation
εi1 ,...,in =
0 otherwise.

To see that the definition makes sense, let’s first consider


1 1 X 1 X 2
det(I) = ... = εi1 ,...,in εj1 ,...,jn [i = j] = ε = 1.
n! n! i n! i ,...,i i1 ,...,in
1 ,...,in ,j1 ,...,jn 1 n

In general we get from the permutation definition of the determinant:


X
A ··· A = εi1 ,...,in εj1 ,...,jn Ai1 ,j1 · · · Ain ,jn
i1 ,...,in ,j1 ,...,jn
X
= sign(σ)sign(τ )Aσ1 ,τ1 · · · Aσn ,τn
σ,τ
X X
= sign(σ) sign(τ )Aσ1 ,τ1 · · · Aσn ,τn
σ τ
X
= sign(σ)2 det(A)
σ
= n!det(A).

The definition generalizes to Cayley’s “hyper determinants” by . . . .

45
CHAPTER 6. DETERMINANT AND INVERSES 46

A curious property is that

···

A ··· A = A ··· A

Q
(18) det(A) = i λi ...

(19) det(cA) = cn det(A) cA · · · cA = cn A · · · A

(20) det(A) = det(AT ) ...

A ··· A
A ··· A
(21) det(AB) = det(A)det(B) B ··· B = B ··· B

(22) det(A−1 ) = 1/det(A) ...

(23) det(An ) = det(A)n ...

(24) det(I + uv T ) = 1 + uT v ...

6.2 Inverses
Might be reduced, unless cofactor matrices have a nice representation?
Chapter 7

Advanced Derivatives

7.1 Derivatives of vector norms


7.1.1 Two-norm

d x−a
∥x − a∥2 = (7.1)
dx ∥x − a∥2
d x−a I (x − a)(x − a)T
= − (7.2)
dx ∥x − a∥2 ∥x − a∥2 ∥x − a∥32
d d T
∥x∥22 = ∥x x∥2 = 2x (7.3)
dx dx

47
CHAPTER 7. ADVANCED DERIVATIVES 48

7.2 Derivatives of matrix norms


7.3 Derivatives of Structured Matrices
7.3.1 Symmetric
7.3.2 Diagonal
7.3.3 Toeplitz

7.4 Derivatives of a Determinant


7.5 General forms
7.6 Linear forms
7.7 Square forms
7.8 From Stack Exchange
Let F (t) = |1 + tA| then F ′ (t) = Tr(A) and F ′′ (t) = Tr(A)2 − Tr(A2 ).

7.9 Derivatives of an Inverse


7.9.1 Trace Identities
∂ T
Tr AX−1 B = − X−1 BAX−1 = −X−T AT BT X−T

∂X
Assume B and C to be symmetric, then
∂ h −1 i  −1  −1
Tr XT CX A = − CX XT CX A + AT XT CX

∂X
∂ h −1 i −1 T −1
Tr XT CX XT BX = −2CX XT CX X BX XT CX
∂X
∂ h −1 i −1
Tr A + XT CX XT BX = −2BX XT CX
∂X
−1 T −1
− 2CX A + XT CX X BX A + XT CX
−1
+ 2BX A + XT CX
CHAPTER 7. ADVANCED DERIVATIVES 49

7.10 Derivatives of Eigenvalues


7.11 Exercises
Exercise 43. Find the derivative of
 
1 1
f (x, Σ, µ) = p exp − (x − µ)T Σ−1 (x − µ)
(2π)k |Σ| 2

with respect to Σ.
Chapter 8

Special Matrices

8.0.1 Block matrices


Stuff like Schur complements is interesting. But can we say anything useful using tensor
diagrams?

8.0.2 The Discrete Fourier Transform Matrix


I think FFT can be nicely described with diagrams
 1 ⊗n
Let’s start with the Hadamard matrix: Hn = 11 −1 . Hm. It’s just a bunch of
matrices below each other, kinda boring.
What about the FFT? Does that require a bit more?

8.0.3 Fast Kronecker Multiplication


Say we want to compute (A1 ⊗A2 · · · An )x, where Ai is a ai ×ai matrix, and x ∈ Ra1 a2 ···an .
If we first compute the Kronecker product, and then the matrix-vector multiplication, this
would take (a1 · · · an )2 time.
Instead we can reshape x into a a1 × · · · an tensor and perform the multiplication
a1
A1 a1
a2
A2 X
..
. an

an
An

by contracting the ai edges one by one. This takes time

a21 (a2 · · · an ) + a22 (a1 a3 · · · an ) + · · · + a2n (a1 · · · an−1 ) = (a1 + · · · + an )(a1 · · · an ),

which is the basis of many fast algorithm as we will see.

50
CHAPTER 8. SPECIAL MATRICES 51

Hadamard
The Hadamard matrix is defined as H2n = H2⊗n = H2 ⊗ · · · ⊗ H2 where H2 =
1 1

1 −1 .
For example  
  1 1 1 1
H2 H2 1 −1 1 −1
H4 = H2 ⊗ H2 = =1 1 −1 −1 .

H2 −H2
1 −1 −1 1
This gives a very simple tensor diagram representation as:

H2
..
H2n = . .
H2

The Fast Hadamard Transform (FHT) transform is usually described recursively by:
" #" #
H2n−1 H2n−1 x(1)
H2n x = ,
H2n−1 −H2n−1 x(2)
h i
where x(1) is the first and second half of x. Because of the redundancy in the matrix
x(2)
multiplication (it only depends on H2n−1 x(1) and H2n−1 x(2) , the algorithm computes HN x
in O(N log N ) time.
Alternatively we could just use the general fact, as described above, where ai = 2 for
all i. Then the “fast Kronecker multiplication” method takes time (a1 a2 · · · an )(a1 + a2 +
· · · an ) = 2n log2 n.

Fourier
The Discrete Fourier Matrix is defined by (FN )i,j = ω ij , where ω = e−2πi/N :

···
 
1 1 1 1 1
1 ω ω2 ω3 ··· ω N −1 
ω2 ω4 ω6 ω 2(N −1)
 
1 ··· 
FN = 1 ω 3(N −1) .
 
 ω3 ω6 ω9 ··· 
 .. .. .. .. .. .. 
. . . . . . 
1 ω N −1 ω 2(N −1) ω 3(N −1) ··· ω (N −1)(N −1)

 arange(N ) 
= exp 2πi/N
arange(N )

TODO: Show how the matrix can be written with the function notation.
CHAPTER 8. SPECIAL MATRICES 52

The Good-Thomas Fast Fourier Transformer (FFT) uses a decomposition based on


the Chinese Remainder Theorem:
Fpi1
1

Fpi2
2
FN = P1 .. P2 ,
.
Fpinn

where N = pi11 pi22 · · · pinn is the prime factorisation of N , and P1 and P2 are some permu-
tation matrices.
Using fast Kronecker multiplication, the algorithm this takes (pi11 + · · · + pinn )N time.
By padding x with zeros, we can increase N by a constant factor to get a string of
n = O(log(N )/ log log(N )) primes, the sum of which is ∼ n2 / log n = O(log(N )2 ). The
complete algorithm thus takes time O(N log(N )2 ). Next we will see how to reduce this
to O(N log N ).
The classical Cooley-Tukey FFT algorithm uses a recursion:
" #" #" #" #
I I I 0 FN/2 0 even-odd
FN = ,
I −I 0 DN/2 0 FN/2 permutation

where DN = [1, wN , w2N , . . . ]. The even-odd permutation moves all the even values to
the start. If we reshape I2n as I2 ⊗· · ·⊗I2 , this permutation is just PN = , or in pytorch:
h i
F 0
x.permute([3,0,1,2]). Also note that II −I
 I 
= H2 ⊗ I and N/2 0 FN/2 = I2 ⊗ FN/2 .
So we can write in tensor diagram notation:
H2 H2
H2
FN = FN/2 = H2 .
H2
DN DN/20 DN/21 DN/22
Since one can multiply with the permutation and diagonal matrices in linear time, the
O(n log n) time complexity follows from the same argument as for Hadamard.
Note there are a bunch of symmetries, such as by transposing (horizontal flip), since
the matrix is symmetric. Or by pushing the permutation to the left side.
We don’t have to split the matrix in half, we can also split it in thirds, fourths, etc.
With this generalized Cooley-Tukey algorithm, we get the following diagram:
Fn0
Fn1
FN = Fn2 ,
Fn3
F (n0 ,n1 n2 n3 ) F (n1 ,n2 n3 ) F (n2 ,n3 )

where n0 n1 n2 n3 = N . Here we replaced the DN matrix with the “generalized” Fourier


(a,b)
matrix F (a,b) matrix, which is defined as Fj,k = e−2πijk/(ab) , and we reshaped as nec-

essary. In the simple case of where we split in just two parts of roughly n1 = n2 = N ,
this is also called “Four step FFT” or Bailey’s FFT algorithm.
CHAPTER 8. SPECIAL MATRICES 53

1/b
We can use the property Fa,bc = Fa,b • Fa,c to simplify the diagram further:

Fn0
F n0 ,n1

FN = Fn1 .
F n0 ,n2 F n1 ,n2
Fn2
F n0 ,n3 F n1 ,n3 F n2 ,n3
Fn3

In the simple case where we split in 2 every time, this is also called the “Quantum FFT”
algorithm.
We hid some stuff above, namely that the matrices should be divided by different N s.
Note that this figure may look different from some FFT diagrams you have seen.
These typically look like this:

x[0] E[0] X[0]


x[2] N
E[1] X[1]
2 -point DFT
x[4] E[2] X[2]
x[6] E[3] X[3]
x[1] O[0] X[4]
x[3] N
O[1] X[5]
2 -point DFT
x[5] O[2] X[6]
x[7] O[3] X[7]

and have 2n rows. The tensor diagram only has n rows (or log2 N ).

Multi-dimensional Fourier Transform


This is just taking the Fourier transform along each axis.
CHAPTER 8. SPECIAL MATRICES 54

8.0.4 Hermitian Matrices and skew-Hermitian


Complex. Skip

8.0.5 Idempotent Matrices


Skip

8.0.6 Orthogonal matrices


Skip

8.0.7 Positive Definite and Semi-definite Matrices


Skip

8.0.8 Singleentry Matrix, The


Describes the matrix J. All of this is trivial with diagrams.

8.0.9 Symmetric, Skew-symmetric/Antisymmetric


Could introduce Penrose’s symmetric tensors here?

8.0.10 Toeplitz Matrices


Could talk about the convolution tensor here...

8.0.11 Units, Permutation and Shift


Not that interesting...

8.0.12 Vandermonde Matrices


Does this have a nice description? Not a lot of properties are given in the Cookbook.
Chapter 9

Decompositions

9.1 Higher-order singular value decomposition


Say we have an order n tensor A. We “unfold” A along each dimension. This means
pulling the edge i to the left, and flattening the rest to the right. Then we compute the
SVD, U SV . Here U is a square matrix, which we keep. We multiply the ith edge of A by
U T (which is also the inverse of U ). The result is a “core” tensor as well as a sequence of
U tensors. If we want a more compact SVD, we can make each U low rank, like normal
SVD. There is also the “Interlacing computation“ where we multiply the U T s onto A as
we go along.
For order 3 tensors, this method is called a “Tucker decomposition”.
If the “core matrix” is diagonal, this is called tensor rank decomposition. If we were
good at that, we could use it to factor I ⊗3 to get better matrix multiplication algorithms.
Unfortunately tensor rank decomposition is NP hard.
I guess HOSVD gives a rank decomposition if we diagonalize the core tensor. It just
won’t be an efficient one.

9.2 Rank Decomposition


9.2.1 Border Rank
The border rank of a tensor is the smallest rank of a tensor that is close to it. TODO:
Example where the border rank is much smaller than the rank.

9.3 Fast Matrix Multiplication


Strassen defines 3 tensors of shape 7 × 2 × 2:
 
SA = [ ] , [ ] , [ ] , [ ] , [ ] ,
1 0 0 0 1 0 0 0 1 1
 −1 0   0 1 
01 11 00 01 00 1 0 , 0 −1
 
SB = [ ] , [ ] ,
1 0 1 0
 0 1   −1 0  0 0 1 1 0 0
01 00 0 −1 , 1 0 , [0 1], [0 0], [1 1]

55
CHAPTER 9. DECOMPOSITIONS 56
 
W = [1 0],
0 1  0 0 1 1
 −1 0  0 0 1 0
01 0 −1 , [ 1 1 ] , [ 0 0 ] , 1 0 , [0 1], [0 0]

These tensors have the neat property that they factor I2 ⊗ I2 ⊗ I2 :

W
= .
SA SB

To multiply two matrices, A and B, faster than the normal n3 time, we reshape them as
block matrices, shape (2, n2 , 2, n2 ) and use Strassen’s tensor:

W
A B = = .
A B SA SB
A B

Contracting the edges in the right order, uses only 7/8n3 + O(n2 ) operations.
If we instead reshape to (2, 2, . . . , 2),

AB = A B ,

and using Strassen’s tensor along each axis reduces the work by (7/8)log2 (n) , giving us
matrix multiplication in time n3+log2 (7/8) = n2.80735 .
CHAPTER 9. DECOMPOSITIONS 57

Contracting the double edges, SA − A and SB − B, is both O(n2 ) time.


It remains to verify that this is actually faster than the naive matrix multiplication:
Contracting SA − A takes 7 · 22 (n/2)2 operations, and likewise SB − B. Next we contract
SA A − SB B which takes 7(n/2)3 time. And finally we contract the edge with W which
takes 22 ·7(n/2)2 . The important term is the cubic 7/8n3 , which if instead done recursively,
leads to the classical O(nlog2 7 ) algorithm.
FIXME: What “edge with W ”? I think we have to/want to contract the hyperedge
with W immediately?

Other
If we instead wrote A and B using (n, m) and (m, p) shaped blocks, we could factor
In ⊗ Im ⊗ Ip and get a matrix multiplication algorithm using the same approach as the
Strassen (2, 2, 2) tensors above. Lots of papers have investigated this problem, which has
led to the best algorithms by Josh Alman and others. For example, Deep Mind found a
rank 47 factorization of I3 ⊗ I4 ⊗ I5 .
Maybe a more interesting example is the (4, 4, 4) tensor, for which they find a rank
47 factorization. This an easy way to create a rank 49 is to take Strassen and double
it. Would this be a nice thing to show? Maybe too messy? Well, actually their rank 47
construction only works in the “modular” case. Then (3, 4, 5) is general.
Chapter 10

Machine Learning Applications

10.1 Least Squares


10.2 Hessian of Cross Entropy Loss
10.3 Convolutional Neural Networks
10.4 Transformers / Attention
10.5 Tensor Sketch
10.6 Reinforcement Learning
When considering imperfect information games, Let’s try to approximate each player’s
strategy with a low rank tensor. We can use the tensor sketch to approximate the tensor.
Basically that means picking a random outer product of vectors and multiplying on the
derivative. This should be fast...

58
Chapter 11

Tensor Algorithms

11.1 Tensor Contraction Orders


Throughout the previous chapters, we drew a lot of tensor networks. While the theo-
retical aspect is sufficient for some applications, there are indeed research areas, mostly
computational ones such as quantum circuit simulation, where one wants to actually
compute the tensor underlying the tensor network. This is done by the so-called tensor
contractions. To see what this means, consider the following 2-tensor diagram, where we
contract two tensors A and B:
2 2 2
2 4 Contraction 2 ′
A B A

Note that we also write the sizes of the respective dimensions, i.e., A is a 2 × 2 × 4-
tensor, while B is a 4 × 2-matrix. Now, the contraction cost is defined as the number of
FLOPS performed during the contraction; as a convention, this is the number of scalar
multiplications.1 In our example, the contraction cost is 2 × 2 × 4 × 2 = 32, i.e., we simply
multiply the dimension sizes.
The previous example was rather small. However, tensor networks in the wild tend to
have hundreds of tensors. Naturally, these also need to be contracted to a single tensor.
Now comes the interesting part: The order in which we perform these contraction can
have a tremendous impact on the execution time. To get an intuition for this, consider
an extended example of a 3-tensor diagram:

2 2 2
2 4 1 2
A B C

If we were to contract A with B first (and the resulting tensor with C), we would have a
cost of 22 × 4 × 2 + 23 × 1 × 22 = 32 + 32 = 64 multiplications, whereas performing the
contraction between B and C at the beginning would result in 23 × 1 × 22 + 22 × 4 × 23 =
32 + 128 = 160 multiplications in total. Hence, the first order is much better.
It is thus natural to ask: Can we always find the optimal contraction order?
1 This assumes a naive implementation of the operation, i.e., nested loops over the dimensions.

59
CHAPTER 11. TENSOR ALGORITHMS 60

11.1.1 Algorithms
We summarize well-known results and frameworks to find good contraction orders.

Optimal Algorithms
Indeed, finding the optimal contraction order for arbitrary tensor network shapes is pretty
hard; better said, NP-hard [3]. There is a well-known exact algorithm running in O(3n )-
time [17], where n is the number of tensors in the network. This finds the optimal
contraction order with respect to the total contraction cost. If one is interested in mini-
mizing the size of the largest intermediate tensor, i.e., to optimize for the memory used
during the execution, this can be done faster in O(2n n3 )-time [22].
The good news is that for some restricted shapes of tensor networks, there are indeed
efficient algorithms. A classic example is that dynamic programming solution for the
matrix-chain problem [4], which is just our problem, but only for matrices. The naive
algorithm runs in O(n3 )-time, but can be implemented in O(n2 )-time [26] (or even in
O(n log n)-time [11, 12]). Another shape for which polynomial-time algorithms exist is
that of tree tensor networks [25, 23].
Another prominent way to optimize contraction orders is via the tree decomposition of
the line graph representation of the tensor network [15, 5, 19]. In particular, this results
in a contraction order with a maximal intermediate tensor rank equal to the treewidth
of the tree decomposition. Loosely speaking, treewidth measures how tree-like a graph
is. This does not directly solve our problem since finding the tree decompositions of
the smallest treewidth is itself hard [14]. Two well-known frameworks to find good tree
decompositions are QuickBB [6] and FlowCutter [24, 9].

Best-Effort Algorithms
However, once we want to contract arbitrary network shapes, the best we can do is to fall
back on heuristics or approximations. Two well-known frameworks are opt_einsum [20]
and cotengra [7], which aim to optimize the contraction order (also referred to as “con-
traction path”) of arbitrary einsums: For tensor networks where the optimal algorithm
would be too slow, opt_einsum applies an ad-hoc greedy algorithm, while cotengra uses
hypergraph partitioning [18], along with a Bayesian optimization approach, which has
been later refined [21]. Other algorithms adopted from database query optimization are
implemented in netzwerk [23]. Another option is to learn the best contraction orders,
e.g., using reinforcement learning [16].
Naturally, the above heuristics do not come with an optimality guarantee. There
exists a (1 + ε)-approximation O∗ (2n /ε)-time algorithm that minimizes the sum of the
intermediate tensor sizes [22].
Worth mentioning is the SQL-view of einsum [2]: It allows to perform the entire tensor
network contraction on any database engine. Indeed, for some einsum classes, running
the contractions on state-of-the-art database engines, such as Hyper [13], can be much
faster than using the de-facto numpy-array implementation.
Chapter 12

Tensorgrad

Implementation details

12.1 Isomorphisms
There is actually a concept of “tensor isomorphism”, but it’s basically just the same as
graph isomorphism.
We need to understand isomorphisms in many different parts of the code.

12.1.1 In Products
- Cancelling / combining equal parts of a product This is actually extra hard, because
you have to collect a subset of nodes that constitute isomorphic subgraphs. Right now
we hack this a bit by just considering separate components of the product.

U
T S
pow(2) S pow(2) U
S S
T S T T S
T
T

S U
S U pow(2)
pow(2) S
G G S S
V
G T T
V T

Figure 12.1: Combining equal parts of a product

Basically the problem is:

61
CHAPTER 12. TENSORGRAD 62

1. You are given a multigraph G with nodes V and edges E.

2. Nodes and edges are all labeled.


3. You are to find two disjoint subsets V1 and V2 of V such that the subgraphs G1 and
G2 induced by V1 and V2 are isomorphic. Also, under the isomorphism, the labels
of the nodes and edges in G1 and G2 are the same.

The problem is probably NP-hard, but it might still have an algorithm that’s faster
than 2n trying all subsets. In particular, we might modify the VF2 algorithm, which
iteratively tries to match nodes in G1 and G2 . The NetworkX library already has a
GraphMatcher, which searches for isomoprhic subgraphs. It might be extendable to our
problem... But honestly I don’t know if we even want to solve this problem in the most
general, since it corresponds a bit to factoring the graph. And we don’t do factoring, just
as we don’t do inverse distribution.
In either case, it’s clear that we need to be able to compare nodes and edges for
isomorphism.
Also, the basic usecase of isomorphism canonaization in products is simply to compute
the canonical product itself from its parts. Part of our approach here is taking the outer
edges and turning them into nodes, so they can be colored.

12.1.2 In Sums
When deciding whether A + B is equal to 2A we need to check if A and B are isomorphic.
But we also need to do this under the current renaming of the edges. That’s why you
can’t just transform A + AT = 2A.
The way it actually works in my code is
1 def key_fn ( t : Tensor ) :
2 # Align tensor edges to have the same order , using Sum ’s order as
reference .
3 canons = [ t . c a n o n i c a l _ e d g e _ n a m e s [ t . edges . index ( e ) ] for e in self . edges ]
4 return hash (( t . canon ,) + tuple ( canons ) )
5
6 ws_tensors = TensorDict ( key_fn = key_fn , default_fn = int )
7 for w , t in zip ( weights , tensors ) :
8 ws_tensors [ t ] += w
9 ws_tensors = [( w , t ) for t , w in ws_tensors . items () ]

which says that I’m using for a hash, the canonical form of the tensor, plus the canonical
form of the edges in the order of the edges in the sum. These are basically the orbits,
meaning that if the summed tensor has a symmetry, we are allowed to "flip" it to make
the summands isomorphic.
In the “compute canonical” method, we do more or less the same, but we also include
the weights.
1 def _ c o m p u t e _ c a n o n i c a l ( self ) :
2 hashes = []
3 for e in self . edges :
4 canons = [ t . c a n o n i c a l _ e d g e _ n a m e s [ t . edges . index ( e ) ] for t in self .
tensors ]
5 hashes . append ( hash (( " Sum " ,) + tuple ( sorted ( zip ( self . weights , canons
)))))
CHAPTER 12. TENSORGRAD 63

6 base = hash (( " Sum " , len ( self . tensors ) ) )


7 hashes = [ hash (( base , h ) ) for h in hashes ]
8 return base , hashes

In the future we want to use symmetry groups instead. What would be the symmetry
group of a sum? It’s the diagonal of the product of the symmetry groups of the summands.
How can we find the generators of this group? Maybe we should just construct some joint
graph and then find the automorphisms of that graph.
Alternatively we can use sympy. It is not known whether this problem is solvable
in polynomial time. I think Babai proved that it is quasi-polynomial but not with a
practical algorithm. Incidentally the problems of intersections of subgroups, centralizers
of elements, and stabilizers of subsets of {1, . . . , n} have been proved (by Eugene Luks)
to be polynomially equivalent.
Actually making a graph and using nauty is a really good idea, since it would be able
to detect that A + AT is symmetric. Just taking the intersection of the automorphism
groups of the summands would not find that.
Another option is to convert the sum to a function... But no, that’s weird. That
would require me to support functions with arbitrary numbers of inputs, which is not
currently the case.

12.1.3 In Evaluation
When evaluating a tensor, we can look at the graph of the tensor and see if it’s isomorphic
to a previously evaluated tensor. This is an example where we don’t really need a canonical
form, but an approximate hash plus vf2 would be fine. Also note that in this case we
don’t care about the edge renaming, because we can just rename the edges before we
return the tensor. E.g. if we have already evaluated A, we can use that to get AT easily.

12.1.4 In Variables
In variables we include the name of the variable in the hash. Basically we assume that
variables named the same refer to the same data.
1 base = hash (( " Variable " , self . name ) )
2 return base , [ hash (( base , e ) ) for e in self . origi nal_ed ges ]

For the original canonical edge names, we use the edge names before renaming. This
means, in the case of AT that will have the same hash as A. But because it’s renamed,
the t.index call in the Sum will flip the edges.
We could imagine variables taking an automorphism group as an argument, which
would allow us to define variables with different symmetries. Such as a symmetric matrix
A where A + AT is actually 2A.

12.1.5 In Constants
When computing the canonical form of a constant, like Zero or Copy we don’t care about
the edge names. I guess because the constants we use are all maximally symmetric? We
currently include the constants tag, which is the hash of the variable that it came from,
if any.
CHAPTER 12. TENSORGRAD 64

12.1.6 In Functions
One issue is that while the original names are usually part of the function definition,
the new edges added by differentiation are often automatically generated based on the
context, so they shouldn’t really be part of the canonical form.
In contrast to Sum, we don’t sort the canons here, since the order of the inputs
matters.
Maybe functions should be allowed to transform the symmetry group? E.g. if we have
a function that takes a symmetric matrix and returns a symmetric matrix, we should be
able to use the symmetry group of the input to simplify the output.

12.1.7 In Derivatives
All we do is hashing the tensor and the wrt. And then add new edges for the derivative.

12.1.8 Other
For some tensors there might be edge dimension relations that aren’t equivalences. For
example, a flatten tensor would have the “out” edge dimension equal to the product of
the “in” edge dimensions.
In a previous version I had every tensor register a “callback” function. Whenever
an edge dimension “became available”, the tensor would get a chance to emit new edge
dimensions. However, this was a lot more work for each tensor to implement, and not
needed for any of the existing tensors.

12.2 Renaming
This is an important part of the code.

12.3 Evaluation
An important part of evaluation is determining the dimension of each edge. To do this, I’m
basically creating a full graph of the tensor, using a function called edge_equivalences
which a list of tuples ((t1 , e1 ), (t2 , e2 )), indicating that edge e1 of tensor t1 is equivalent
to edge e2 of tensor t2 . Note that the same edge name can appear multiple times in the
graph, so we need to keep track of the tensor as well.
For variables, since the user gives edge dimensions in terms of variables, it’s important
to keep track of renamed edge names:
1 for e1 , e2 in zip ( self . original_edges , self . edges ) :
2 yield ( self , e1 ) , ( self , e2 )

For constants, there might be some equivalences based on tensors that the constant
was derived from.
1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 if self . link is not None :
3 yield from self . link . e d g e _ e q u i v a l e n c e s ()
4 for e in self . link . edges :
CHAPTER 12. TENSORGRAD 65

5 if e in self . edges :
6 yield ( self , e ) , ( self . link , e )

For the copy tensor, everything is equivalent:


1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 yield from super () . e d g e _ e q u i v a l e n c e s ()
3 for e in self . edges [1:]:
4 yield ( self , self . edges [0]) , ( self , e )

For functions we can’t really say anything about the edges of the function itself
(self.edges_out), but at least we can say something about the broadcasted edges.
1 for t , * inner_edges in self . inputs :
2 yield from t . e d g e _ e q u i v a l e n c e s ()
3 for e in t . edges :
4 if e not in inner_edges :
5 yield (t , e ) , ( self , e )

We could maybe also say that input edges with the same name are equivalent?
For products, we look at each edge (t1 , e, t2 ) and yield (t1 , e), (t2 , e). However for the
free edges, (t, e), we match them with ourselves, (t, e), (self, e).
1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 pairs = defaultdict ( list )
3 for t in self . tensors :
4 yield from t . e d g e _ e q u i v a l e n c e s ()
5 for e in t . edges :
6 pairs [ e ]. append ( t )
7 for e , ts in pairs . items () :
8 if len ( ts ) == 1:
9 yield ( self , e ) , ( ts [0] , e )
10 else :
11 t1 , t2 = ts
12 yield ( t1 , e ) , ( t2 , e )

Similarly, for sums, everything is just matched with ourselves:


1 def e d g e _ e q u i v a l e n c e s ( self ) :
2 for t in self . tensors :
3 yield from t . e d g e _ e q u i v a l e n c e s ()
4 for e in t . edges :
5 yield (t , e ) , ( self , e )

Finally, we use BFS to propagate the edge dimensions from the variables (which are
given by the user) to the rest of the graph.
Why is it even necessary for non-variables to know the edge dimensions? Mostly
because of copy tensors, which we use for hyper edges, and have to construct. Could we
get rid of this if we computed hyper-edges more efficiently without copy’s? There are also
sometimes "detached" copies...
Also, an alternative idea would be to actually construct the full graph. I originally
didn’t think this would be possible because of the Sum’s which aren’t really graphs. But
maybe with the new approach of using nauty, we could actually do this.

12.3.1 Products
We simply evaluate the tensors in the product and give them to einsum.
CHAPTER 12. TENSORGRAD 66

12.4 Simplification Rules


There are a bunch of these.
Mostly we can do everything in a single depth-first pass, but a few times we need to
do multiple passes. That can be done with the full-simplify method, which repeatedly
calls simplify until nothing changes.
Bibliography

[1] John Aldrich. The Origins of Mathematical Words: A Comprehensive Dictionary of


Latin, Greek, and Arabic Roots. Mathematical Association of America, 2010.

[2] Mark Blacher, Julien Klaus, Christoph Staudt, Sören Laue, Viktor Leis, and Joachim
Giesen. Efficient and portable einstein summation in SQL. Proc. ACM Manag. Data,
1(2):121:1–121:19, 2023.
[3] Lam Chi-Chung, P Sadayappan, and Rephael Wenger. On optimizing a class of
multi-dimensional loops with reduction for parallel execution. Parallel Processing
Letters, 7(02):157–168, 1997.
[4] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
[5] Jeffrey M Dudek, Leonardo Duenas-Osorio, and Moshe Y Vardi. Efficient contraction
of large tensor networks for weighted model counting through graph decompositions.
arXiv preprint arXiv:1908.04381, 2019.
[6] Vibhav Gogate and Rina Dechter. A complete anytime algorithm for treewidth. In
David Maxwell Chickering and Joseph Y. Halpern, editors, UAI ’04, Proceedings of
the 20th Conference in Uncertainty in Artificial Intelligence, Banff, Canada, July
7-11, 2004, pages 201–208. AUAI Press, 2004.

[7] Johnnie Gray and Stefanos Kourtis. Hyper-optimized tensor network contraction.
Quantum, 5:410, 2021.
[8] greg (https://math.stackexchange.com/users/357854/greg). Proving that
vec(a diag(b) c) = ((ct ⊗ 1a ) ⊙ (1c ⊗ a)) b. Mathematics Stack Exchange.
URL:https://math.stackexchange.com/q/2993406 (version: 2018-11-11).

[9] Michael Hamann and Ben Strasser. Graph bisection with pareto optimization. Jour-
nal of Experimental Algorithmics (JEA), 23:1–34, 2018.
[10] William Rowan Hamilton. Lectures on Quaternions. Hodges and Smith, 1853.

[11] T. C. Hu and M. T. Shing. Computation of matrix chain products. part I. SIAM J.


Comput., 11(2):362–373, 1982.
[12] T. C. Hu and M. T. Shing. Computation of matrix chain products. part II. SIAM
J. Comput., 13(2):228–251, 1984.

67
BIBLIOGRAPHY 68

[13] Alfons Kemper and Thomas Neumann. Hyper: A hybrid oltp&olap main memory
database system based on virtual memory snapshots. In 2011 IEEE 27th Interna-
tional Conference on Data Engineering, pages 195–206. IEEE, 2011.
[14] Tuukka Korhonen. A single-exponential time 2-approximation algorithm for
treewidth. SIAM Journal on Computing, (0):FOCS21–174, 2023.
[15] Igor L Markov and Yaoyun Shi. Simulating quantum computation by contracting
tensor networks. SIAM Journal on Computing, 38(3):963–981, 2008.
[16] Eli Meirom, Haggai Maron, Shie Mannor, and Gal Chechik. Optimizing tensor
network contraction using reinforcement learning. In International Conference on
Machine Learning, pages 15278–15292. PMLR, 2022.
[17] Robert N. C. Pfeifer, Jutho Haegeman, and Frank Verstraete. Faster identification
of optimal contraction sequences for tensor networks. Phys. Rev. E, 90:033315, 2014.
[18] Sebastian Schlag, Vitali Henne, Tobias Heuer, Henning Meyerhenke, Peter Sanders,
and Christian Schulz. K-way hypergraph partitioning via n-level recursive bisec-
tion. In 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and
Experiments (ALENEX), pages 53–67. SIAM, 2016.
[19] Roman Schutski, Danil Lykov, and Ivan Oseledets. Adaptive algorithm for quantum
circuit simulation. Phys. Rev. A, 101:042335, Apr 2020.
[20] Daniel G. A. Smith and Johnnie Gray. opt_einsum - a Python package for optimizing
contraction order for einsum-like expressions. Journal of Open Source Software,
3(26):753, 2018.
[21] Christoph Staudt, Mark Blacher, Julien Klaus, Farin Lippmann, and Joachim Giesen.
Improved cut strategy for tensor network contraction orders. In Leo Liberti, editor,
22nd International Symposium on Experimental Algorithms, SEA 2024, July 23-26,
2024, Vienna, Austria, volume 301 of LIPIcs, pages 27:1–27:19. Schloss Dagstuhl -
Leibniz-Zentrum für Informatik, 2024.
[22] Mihail Stoian and Andreas Kipf. Dpconv: Super-polynomially faster join ordering,
2024.
[23] Mihail Stoian, Richard M. Milbradt, and Christian B. Mendl. On the optimal linear
contraction order of tree tensor networks, and beyond. SIAM Journal on Scientific
Computing, 46(5):B647–B668, 2024.
[24] Ben Strasser. Computing tree decompositions with flowcutter: PACE 2017 submis-
sion. CoRR, abs/1709.08949, 2017.
[25] Jianyu Xu, Ling Liang, Lei Deng, Changyun Wen, Yuan Xie, and Guoqi Li. Towards
a polynomial algorithm for optimal contraction sequence of tensor networks from
trees. Phys. Rev. E, 100:043309, 2019.
[26] F. Frances Yao. Efficient dynamic programming using quadrangle inequalities. In
Raymond E. Miller, Seymour Ginsburg, Walter A. Burkhard, and Richard J. Lipton,
editors, Proceedings of the 12th Annual ACM Symposium on Theory of Computing,
April 28-30, 1980, Los Angeles, California, USA, pages 429–435. ACM, 1980.
Chapter 13

Appendix

Contains some proofs, such as of equation 524 or 571. They are pretty long and could be
useful for contrasting with the diagram proofs.

69

You might also like