0% found this document useful (0 votes)
172 views258 pages

Notes - Ryan

This document provides an in-depth overview of formal aspects of language modeling including probabilistic foundations. It covers topics such as defining language models as distributions over strings, global and local normalization, tight language models, and representation-based language models. The document is intended as class notes for a course on language modeling.

Uploaded by

Wassim Azzabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views258 pages

Notes - Ryan

This document provides an in-depth overview of formal aspects of language modeling including probabilistic foundations. It covers topics such as defining language models as distributions over strings, global and local normalization, tight language models, and representation-based language models. The document is intended as class notes for a course on language modeling.

Uploaded by

Wassim Azzabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Formal Aspects

of
Language Modeling
Ryan Cotterell, Anej Svete, Clara Meister,
Tianyu Liu, and Li Du

Wednesday 6th March, 2024


Contents

1 Introduction 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Probabilistic Foundations 7
2.1 An Invitation to Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 A Measure-theoretic Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Language Models: Distributions over Strings . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Sets of Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Defining a Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Global and Local Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Globally Normalized Language Models . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Locally Normalized Language Models . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Tight Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Defining the probability measure of an LNM . . . . . . . . . . . . . . . . . . 28
2.5.3 Interpreting the Constructed Probability Space . . . . . . . . . . . . . . . . . 35
2.5.4 Characterizing Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Modeling Foundations 45
3.1 Representation-based Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.1 Vector Space Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.2 Compatibility of Symbol and Context . . . . . . . . . . . . . . . . . . . . . . 52
3.1.3 Projecting onto the Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.4 Representation-based Locally Normalized Models . . . . . . . . . . . . . . . . 58
3.1.5 Tightness of Softmax Representation-based Models . . . . . . . . . . . . . . . 58
3.2 Estimating a Language Model from Data . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.2 Language Modeling Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.4 Regularization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Classical Language Models 75


4.1 Finite-state Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Weighted Finite-state Automata . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.1.2 Finite-state Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3
4 CONTENTS

4.1.3 Normalizing Finite-state Language Models . . . . . . . . . . . . . . . . . . . . 87


4.1.4 Tightness of Finite-state Models . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.5 The n-gram Assumption and Subregularity . . . . . . . . . . . . . . . . . . . 96
4.1.6 Representation-based n-gram Models . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Pushdown Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.1 Human Language Is not Finite-state . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.2 Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.2.3 Weighted Context-free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.2.4 Context-free Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.2.5 Tightness of Context-free Language Models . . . . . . . . . . . . . . . . . . . 120
4.2.6 Normalizing Weighted Context-free Grammars . . . . . . . . . . . . . . . . . 124
4.2.7 Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2.8 Pushdown Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.2.9 Multi-stack Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5 Neural Network Language Models 137


5.1 Recurrent Neural Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.1.1 Human Language is Not Context-free . . . . . . . . . . . . . . . . . . . . . . 138
5.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.1.3 General Results on Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1.4 Elman and Jordan Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.1.5 Variations on Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . 152
5.1.6 Representational Capacity of Recurrent Neural Networks . . . . . . . . . . . 157
5.2 Transformer-based Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.2.1 Informal Motivation of the Transformer Architecture . . . . . . . . . . . . . . 209
5.2.2 A Formal Definition of Transformers . . . . . . . . . . . . . . . . . . . . . . . 211
5.2.3 Tightness of Transformer-based Language Models . . . . . . . . . . . . . . . . 223
5.3 Computational Expressiveness of Transformers . . . . . . . . . . . . . . . . . . . . . 226
1 Chapter 1

2 Introduction

3 1.1 Introduction
4 Welcome to the class notes for the first third of Large Language Models (263-5354-00L). The course
5 comprises an omnibus introduction to language modeling. The first third of the lectures focuses on
6 a formal treatment of the subject. The second part focuses on the practical aspects of implementing
7 a language model and its applications. Many universities are offering similar courses at the moment,
8 e.g., CS324 at Stanford University ([Link] and CS
9 600.471 ([Link] at Johns Hopkins University. Their
10 syllabi may serve as useful references.

11 Disclaimer. This is the third time the course is being taught and we are improving the notes
12 as we go. We will try to be as careful as possible to make them typo- and error-free. However,
13 there will undoubtedly be mistakes scattered throughout. We will be very grateful if you report
14 any mistakes you spot, or anything you find unclear and confusing in general—this will benefit the
15 students as well as the teaching staff by helping us organize a better course!

5
6 CHAPTER 1. INTRODUCTION
1 Chapter 2

2 Probabilistic Foundations

3 2.1 An Invitation to Language Modeling


4 The first module of the course focuses on defining a language model mathematically. To see why
5 such a definition is nuanced, we are going to give an informal definition of a language model and
6 demonstrate two ways in which that definition breaks and fails to meet our desired criteria.

Definition 2.1.1: Language Model (Informal)

Given an alphabeta Σ and a distinguished end-of-sequence symbol eos R Σ, a language model


is a collection of conditional probability distributions ppy | yq for y P Σ Y teosu and y P Σ˚ ,
where Σ˚ is the set of all strings over the alphabet Σ. The term ppy | yq represents the
probability of the symbol y occurring as the next symbol after the string y.
a An alphabet is a finite, non-empty set. It is also often referred to as a vocabulary.
7

8 Definition 2.1.1 is the definition of a language model that is implicitly assumed in most papers
9 on language modeling. We say implicitly since most technical papers on language modeling simply
10 write down the following autoregressive factorization

T
ź
ppyq “ ppy 1 ¨ ¨ ¨ y T q “ ppeos | yq ppy t | y ăt q (2.1)
t“1

11 as the probability of a string according to the distribution p.1 The part that is left implicit in
12 Eq. (2.1) is whether or not p is indeed a probability distribution and, if it is, over what space.
13 The natural assumption in Definition 2.1.1 is that p is a distribution over Σ˚ , i.e., the set of all
14 finite strings2 over an alphabet Σ. However, in general, it is not true that all such collections of
15 conditionals will yield a valid probability distribution over Σ˚ ; some may “leak” probability mass
16 to infinite sequences.3 More subtly, we additionally have to be very careful when dealing with
1 Many authors (erroneously) avoid writing eos for concision.
2 Some authors assert that strings are by definition finite.
3 However, the converse is true: All valid distributions over Σ˚ may be factorized as the above.

7
8 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 uncountably infinite spaces lest we run into a classic paradox. We highlight these two issues with
2 two very simple examples. The first example is a well-known paradox in probability theory.

Example 2.1.1: Infinite Coin Toss

Consider the infinite independent fair coin toss model, where we aim to place a distribution
over tH, Tu8 , the (uncountable) set of infinite sequences of tH, Tu (H represents the event of
throwing heads and T the event of throwing tails). Intuitively, such a distribution corresponds
to a “language model” as defined above in which for all y ăt , ppH | y ăt q “ ppT | y ăt q “ 12 and
ppeos | y ăt q “ 0. However, each individual infinite sequence over tH, Tu should also be assigned
probability p 21 q8 “ 0. Without a formal foundation, one arrives at the following paradox:

1 “ p ptH, Tu8 q
¨ ˛
ď
“ p˝ tωu‚
ωPtH,Tu8
ÿ
“ pptωuq
ωPtH,Tu8
ÿ ?
“ 0 “ 0.
ωPtH,Tu8
3

4 The second example is more specific to language modeling. As we stated above, an implicit
5 assumption made by most language modeling papers is that a language model constitutes a
6 distribution over Σ˚ . However, in our next example, we show that a collection of conditions that
7 satisfy Definition 2.1.1 may not sum to 1 if the sum is restricted to elements of Σ˚ . This means
that it is not a priori clear what space our probability distribution is defined over.4

1
H{ 2 1 H{1

0{1

T{
2
1 2{ 21 T{ 21

Figure 2.1: Graphical depiction of the possibly finite coin toss model. The final weight 1
2 of the
state 2 corresponds to the probability p peos | y t´1 “ Tq “ 12 .

4 This also holds for the first example.


2.1. AN INVITATION TO LANGUAGE MODELING 9

Example 2.1.2: Possibly Finite Coin Toss

Consider now the possibly finite “coin toss” model with a rather peculiar coin: when tossing
the coin for the first time, both H and T are equally likely. After the first toss, however, the
coin gets stuck: If y 1 “ H, we can only ever toss another H again, whereas if y 1 “ T, the next
toss can result in another T or “end” the sequence of throws (eos) with equal probability. We,
therefore, model a probability distribution over tH, Tu Y tH, Tu , the set of finite and infinite
˚ 8

sequences of tosses. Formally:a


1
ppH | y ă1 q “ ppT | y ă1 q “
2
#
1 if t ą 1 and y t´1 “ H
ppH | y ăt q “
0 if t ą 1 and y t´1 “ T
#
1
2 if t ą 1 and y t´1 “ T
ppT | y ăt q “
0 if t ą 1 and y t´1 “ H
#
1
2 if t ą 1 and y t´1 “ T
ppeos | y ăt q “
0 otherwise.

If you are familiar with (probabilistic) finite-state automata,b you can imagine the model
as depicted in Fig. 2.1. It is easy to see that this model only places the probability of 12 on
finite sequences of tosses. If we were only interested in those (analogously to how we are only
interested in finite strings when modeling language), yet still allowed the model to specify the
probabilities as in this example, the resulting probability distribution would not model what
we require.
a Note that ppH | y ă1 q “ ppH | εq and ppT | y ă1 q “ ppT | εq.
b They will be formally introduced in §4.1.5
1

2 It takes some mathematical heft to define a language model in a manner that avoids such
3 paradoxes. The tool of choice for mathematicians is measure theory, as it allows us to define
4 probability over uncountable sets5 in a principled way. Thus, we begin our formal treatment of
5 language modeling with a primer of measure theory in §2.2. Then, we will use concepts discussed in
6 the primer to work up to a formal definition of a language model.

5 As stated earlier, tH, Tu8 is uncountable. It’s easy to see there exists a surjection from tH, Tu8 to the binary

expansion of the real interval p0, 1s. Readers who are interested in more details and mathematical implications can
refer to §1 in Billingsley (1995).
10 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 2.2 A Measure-theoretic Foundation


2 At their core, (large) language models are an attempt to place a probabilistic distribution over
3 natural language utterances. However, our toy examples in Examples 2.1.1 and 2.1.2 in the previous
4 section reveal that it can be relatively tricky to get a satisfying definition of a language model. Thus,
5 our first step forward is to review the basics of rigorous probability theory,6 the tools we need to
6 come to a satisfying definition. Our course will assume that you have had some exposure to rigorous
7 probability theory before, and just review the basics. However, it is also possible to learn the basics
8 of rigorous probability on the fly during the course if it is new to you. Specifically, we will cover
9 measure-theoretic foundations of probability theory. This might come as a bit of a surprise since
10 we are mostly going to be talking about language, which is made up of discrete objects—strings.
11 However, as we will see in §2.5 soon, formal treatment of language modeling indeed requires some
12 mathematical rigor from measure theory.
13 The goal of measure-theoretic probability is to assign probabilities to subsets of an outcome
14 space Ω. However, in the course of the study of measure theory, it has become clear that for
15 many common Ω, it is impossible to assign probabilities in a way that satisfies a set of reasonable
16 desiderata.7 Consequently, the standard approach to probability theory resorts to only assigning
17 probability to certain “nice” (but not necessarily all) subsets of Ω, which are referred to as events or
18 measurable subsets, as in the theory of integration or functional analysis. The set of measurable
19 subsets is commonly denoted as F (Definition 2.2.1) and a probability measure P : F Ñ r0, 1s is the
20 function that assigns a probability to each measurable subset. The triple pΩ, F, Pq is collectively
21 known as a probability space (Definition 2.2.2). As it turns out, the following simple and reasonable
22 requirements imposed on F and P are enough to rigorously discuss probability.

Definition 2.2.1: σ-algebra

Let PpΩq be the power set of Ω. Then F Ď PpΩq is called a σ-algebra (or σ-field) over Ω if
the following conditions hold:
1) Ω P F,

2) if E P F, then E c P F,
3) if E 1 , E 2 , . . . is a finite or infinite sequence of sets in F, then
Ť
n E n P F.
If F is a σ-algebra over Ω, we call the tuple pΩ, Fq a measurable space.
23

Example 2.2.1: σ-algebras

Let Ω be any set. Importantly, there is more than one way to construct a σ-algebra over Ω:
1. The family consisting of only the empty set H and the set Ω, i.e., F “ tH, Ωu, is called
def

the minimal or trivial σ-algebra.


2. The full power set F “ PpΩq is called the discrete σ-algebra.
def

24

6 By
rigorous probability theory we mean a measure-theoretic treatment of probability theory.
7 Measuretheory texts commonly discuss such desiderata and the dilemma that comes with it. See, e.g., Chapter 7
in Tao (2016), Chapter 3 in Royden (1988) or Chapter 3 in Billingsley (1995). We also give an example later.
2.2. A MEASURE-THEORETIC FOUNDATION 11

3. Given A Ď Ω, the family F “ tH, A, ΩzA, Ωu is a σ-algebra induced by A.


def

4. Suppose we are rolling a six-sided die. There are six events that can happen: We can
roll any of the numbers 1–6. In this case, we will then define the set of outcomes Ω as
Ω “ tThe number observed is n | n “ 1, . . . , 6u. There are of course multiple ways to
def

define an event space F and with it a σ-algebra over this outcome space. By definition,
H P F and Ω P F. One way to intuitively construct a σ-algebra is to consider that
all individual events (observing any number) are possible, meaning that we would like
to later assign probabilities to them (see Definition 2.2.2). This means that we should
include individual singleton events in the event space: tThe number observed is nu P F
for n “ 1, . . . , 6. It is easy to see that in this case, to satisfy the axioms in Definition 2.2.1,
the resulting event space should be F “ PpΩq.
You might want to confirm these are indeed σ-algebras by checking them against the axioms
in Definition 2.2.1.
1

2 A measurable space guarantees that operations on countably many sets are always valid, and
3 hence permits the following definition.

Definition 2.2.2: Probability measure

A probability measure P over a measurable space pΩ, Fq is a function P : F Ñ r0, 1s such


that
1) PpΩq “ 1,

2) if E 1 , E 2 , . . . is a countable sequence of disjoint sets in F, then Pp


Ť ř
n E nq “ n PpE n q.
In this case we call pΩ, F, Pq a probability space.
4

5 As mentioned, measure-theoretic probability only assigns probabilities to “nice” subsets of Ω. In


6 fact, it is often impossible to assign a probability measure to every single subset of Ω and we must
7 restrict our probability space to a strict subset of PpΩq. More precisely, the sets B Ď Ω for which a
8 probability (or more generally, a volume) can not be defined are called non-measurable sets. An
9 example of such sets is the Vitali set.8 See also Appendix A.2 in Durrett (2019).
10 Later, we will be interested in modeling probability spaces over sets of (infinite) sequences. By
11 virtue of a theorem due to Carathéodory, there is a natural way to construct such a probability
12 space for sequences (and many other spaces) that behaves in accordance with our intuition, as we
13 will clarify later. Here, we shall lay out a few other necessary definitions.

Definition 2.2.3: Algebra

A Ď PpΩq is called an algebra (or field) over Ω if


1) Ω P A,
2) if E P A, then E c P A,
14

8 See [Link] and [Link]


12 CHAPTER 2. PROBABILISTIC FOUNDATIONS

3) if E 1 , E 2 P A, then E 1 Y E 2 P A.
1

Definition 2.2.4: Probability pre-measure

Let A be an algebra over some set Ω. A probability pre-measure over pΩ, Aq is a function
P0 : A Ñ r0, 1s such that
1) P0 pΩq “ 1,
2) if E 1 , E 2 , . . . is a (countable) sequence of disjoint sets in A whose (countable) union is
also in A, then P0 pY8
ř8
n“1 E n q “ n“1 P0 pE n q.
2

3 Note that the only difference between a σ-algebra (Definition 2.2.1) and an algebra is that
4 condition 3 is weakened from countable to finite, and the only difference between a probability
5 measure (Definition 2.2.2) and a pre-measure is that the latter is defined with respect to an algebra
6 instead of a σ-algebra.
7 The idea behind Carathéodory’s extension theorem is that there is often a simple construction
8 of an algebra A over Ω such that there is a natural way to define a probability pre-measure. One
9 can then extend this probability pre-measure to a probability measure that is both minimal and
10 unique in a precise sense. For example, the standard Lebesgue measure over the real line can be
11 constructed this way.
12 Finally, we define random variables.

Definition 2.2.5: Random

A mapping x : Ω Ñ S between two measurable spaces pΩ, Fq and pS, T q is an pS, T q-valued
random variable, or a measurable mapping, if, for all B P T ,

x´1 pBq “ tω P Ω : xpωq P Bu P F. (2.2)


def

13

14 Any measurable function (random variable) induces a new probability measure on the output
15 σ-algebra based on the one defined on the original σ-algebra. This is called the pushforward
16 measure (cf. §2.4 in Tao, 2011), which we will denote by P˚ , given by

P˚ px P Eq “ P x´1 pEq , (2.3)


def
` ˘

17 that is, the probability of the result of x being in some event E is determined by the probability of
18 the event of all the elements which x maps into E, i.e., the pre-image of E given by x.

Example 2.2.2: Random Variables

We give some simple examples of random variables.


1. Let Ω be the set of possible outcomes of throwing a fair coin, i.e., Ω “ tT, Hu. Define
def

19
2.2. A MEASURE-THEORETIC FOUNDATION 13

F “ PpΩq, S “ t0, 1u, and T “ PpSq. Then, the random variable


def def def

#
T ÞÑ 0
x:
H ÞÑ 1

assigns tails (T) the value 0 and heads (H) the value 1.
2. Consider the probability space of throwing two dice (similar to Example 2.2.1) where
Ω “ tpi, jq : i, j “ 1, . . . , 6u where the element pi, jq refers to rolling i on the first and
j on the second die and F “ PpΩq. Define S “ Z and T “ PpSq. Then, the random
def def

variable
x : pi, jq ÞÑ i ` j
is an pS, T q-valued random variable which represents the sum of two dice.
1
14 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 2.3 Language Models: Distributions over Strings


2 Language models are defined as probability distributions over sequences of words, referred to as
3 utterances. This chapter delves into the formalization of the term “utterance” and introduces
4 fundamental concepts such as the alphabet, string, and language. Utilizing these concepts, a formal
5 definition of a language model is presented, along with a discussion on the intricacies of defining
6 distributions over infinite sets.

7 2.3.1 Sets of Strings


8 We begin by defining the very basic notions of alphabets and strings, where we take inspiration from
9 formal language theory. First and foremost, formal language theory concerns itself with sets of
10 structures. The simplest structure it considers is a string. So what is a string? We start with the
11 notion of an alphabet.

Definition 2.3.1: Alphabet

An alphabet is a finite, non-empty set. In this course, we will denote an alphabet using
Greek capital letters, e.g., Σ and ∆. We refer to the elements of an alphabet as symbols or
letters and will denote them with lowercase letters: a, b, c.
12

Definition 2.3.2: String

A stringa over an alphabet is any finite sequence of letters. Strings made up of symbols from
Σ will denoted by bolded Latin letters, e.g., y “ y 1 ¨ ¨ ¨ y T where each y n P Σ.
aA string is also referred to as a word, which continues with the linguistic terminology.
13

14 The length of a string, written as |y|, is the number of letters it contains. Usually, we will use T
15 to denote |y| more concisely whenever the usage is clear from the context. There is only one string
16 of length zero, which we denote with the distinguished symbol ε and refer to as the empty string.
17 By convention, ε is not an element of the original alphabet.
18 New strings are formed from other strings and symbols with concatenation. Concatenation,
19 denoted with x ˝ y or just xy, is an associative operation on strings. Formally, the concatenation
20 of two words y and x is the word y ˝ x “ yx, which is obtained by writing the second argument
21 after the first one. The result of concatenating with ε from either side results in the original string,
22 which means that ε is the unit of concatenation and the set of all words over an alphabet with the
23 operation of concatenation forms a monoid.
24 We have so far only defined strings as individual sequences of symbols. To give our strings made
25 up of symbols in Σ a set to live in, we now define Kleene closure of an alphabet Σ.

Definition 2.3.3: Kleene Star

Let Σ be an alphabet. The Kleene star Σ˚ is defined as


8
ď
Σ˚ “ Σn (2.4)
n“0
26
2.3. LANGUAGE MODELS: DISTRIBUTIONS OVER STRINGS 15

where
Σn “ looooomooooon
Σ ˆ ¨¨¨ ˆ Σ (2.5)
def

n times

Note that we define Σ “ tεu. We call the Σ the Kleene closure of the alphabet Σ. We
0 def ˚

also define
8
ď
Σ` “ Σn “ ΣΣ˚ . (2.6)
def

n“1
1

2 Finally, we also define the set of all infinite sequences of symbols from some alphabet Σ as Σ8 .

Definition 2.3.4: Infinite sequences

Let Σ be an alphabet. The set of all infinite sequences over Σ is defined as:

Σ8 “ Σ ˆ ¨ ¨ ¨ ˆ Σ, (2.7)
def
looooomooooon
8-times
3

4 Since strings are canonically finite in computer science, we will explicitly use the terms infinite
5 sequence or infinite string to refer to elements of Σ8 .
6 More informally, we can think of Σ˚ as the set which contains ε and all (finite-length) strings
7 which can be constructed by concatenating arbitrary symbols from Σ. Σ` , on the other hand, does
8 not contain ε, but contains all other strings of symbols from Σ. The Kleene closure of an alphabet
9 is a countably infinite set (this will come into play later!). In contrast, the set Σ8 is uncountably
10 infinite for any Σ such that |Σ| ě 2.
11 The notion of the Kleene closure leads us very naturally to our next definition.

Definition 2.3.5: Formal language

Let Σ be an alphabet. A language L is a subset of Σ˚ .


12

13 That is, a language is just a specified subset of all possible strings made up of the symbols in the
14 alphabet. This subset can be specified by simply enumerating a finite set of strings, or by a formal
15 model. We will see examples of those later. Importantly, these strings are finite. If not specified
16 explicitly, we will often assume that L “ Σ˚ .

17 A note on terminology. As we mentioned, these definitions are inspired by formal language


18 theory. We defined strings as our main structures of interest and symbols as their building blocks.
19 When we talk about natural language, the terminology is often slightly different: we may refer
20 to the basic building blocks (symbols) as tokens or words (which might be composed of one or
21 more characters and form some form of “words”) and their compositions (strings) as sequences or
22 sentences. Furthermore, what we refer to here as an alphabet may be called a vocabulary (of
23 words or tokens) in the context of natural language. Sentences are therefore concatenations of words
24 from a vocabulary in the same way that strings are concatenations of symbols from an alphabet.
16 CHAPTER 2. PROBABILISTIC FOUNDATIONS

Example 2.3.1: Kleene Closure

Let Σ “ ta, b, cu. Then

Σ˚ “ tε, a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, aaa, aab, aac, . . . u.

Examples of a languages over this alphabet include L1 “ ta, b, ab, bau, L2 “ ty P Σ˚ | y 1 “ au,
def def

and L3 “ ty P Σ˚ | |y| is evenu.


def

2 Next, we introduce two notions of subelements of strings.

Definition 2.3.6: String Subelements

A subsequence of a string y is defined as a sequence that can be formed from y by deleting


some or no symbols, leaving the order untouched. A substring is a contiguous subsequence.
For instance, ab and bc are substrings and subsequences of y “ abc, while ac is a subsequence
but not a substring. Prefixes and suffixes are special cases of substrings. A prefix is a
substring of y that shares the same first letter as y and a suffix is a substring of y that shares
the same last letter as y. We will also denote a prefix y 1 . . . y n´1 of the string y “ y 1 . . . y T
as y ăn . We will also use the notation y Ÿ y 1 to denote that y is a suffix of y 1 .
3

4 2.3.2 Defining a Language Model


5 We are now ready to introduce the main interest of the entire lecture series: language models.

Definition 2.3.7: Language model

Let Σ be an alphabet. A language model is a (discrete) distribution pLM over Σ˚ .


6

Example 2.3.2: A very simple language model

Let Σ “ tau. For n P Ně0 , define


def

pLM pan q “ 2´pn`1q ,


def

where a0 “ ε and an “ lo . .oan.


def def
[Link]
n times
We claim that pLM is a language model. To see that, we verify that it is a valid probability
distribution over Σ˚ . It is easy to see that pLM pan q ě 0 for any n. Additionally, we see that
the probabilities of finite sequences indeed sum to 1:
8 8
1 ÿ ´n
8
ÿ ÿ ÿ 1 1
pLM pyq “ pLM pan q “ 2´pn`1q “ 2 “ “ 1.
n“0 n“0
2 n“0 21´ 1
2
yPΣ˚
7

8 In our formal analysis of language models, we will also often refer to the language defined by a
9 language model.
2.3. LANGUAGE MODELS: DISTRIBUTIONS OVER STRINGS 17

Definition 2.3.8: Weighted language

Let pLM be a language model. The weighted language of pLM is defined as

L ppLM q “ tpy, pLM pyqq | y P Σ˚ u (2.8)


def

Example 2.3.3: Languge of a langauge model

The language of the language model from Example 2.3.2 is


!´ ¯ )
an , 2´pn`1q | n P Ně0 (2.9)
def
L ppLM q “
2

3 A language model is itself a very simple concept—it is simply a distribution that weights strings
4 (natural utterances) by their probabilities to occur in a particular language. Note that we have
5 not said anything about how we can represent or model this distribution yet. Besides, for any
6 (natural) language, the ground-truth language model pLM is of course unknown and complex. The
7 next chapter, therefore, discusses in depth the computational models which we can use to try to
8 tractably represent distributions over strings and ways of approximating (learning) the ground-truth
9 distribution based on finite datasets using such models.
18 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 2.4 Global and Local Normalization


2 The previous chapter introduced a formal definition of a language as a set of strings and the definition
3 of a language model as a distribution over strings. We now delve into a potpourri of technical
4 questions to complete the theoretical minimum for discussing language models. While doing so, we
5 will introduce (and begin to answer) three fundamental questions in the first part of the course. We
6 will introduce them later in the section.

7 A note on terminology. Unfortunately, we will encounter some ambiguous terminology. In §2.5,


8 we explicitly define a language model as ařvalid probability distribution over Σ˚ , the Kleene closure
9 of some alphabet Σ, which means that yPΣř˚ pLM pyq “ 1. As we will see later, this means that
10 the model is tight, whereas it is non-tight if yPΣ˚ pLM pyq ă 1. Definitionally, then, all language
11 models are tight. However, it is standard in the literature to refer to many non-tight language
12 models as language models as well. We pardon in advance the ambiguity that this introduces. Over
13 the course of the notes, we attempt to stick to the convention that the term “language model”
14 without qualification only refers to a tight language model whereas a “non-tight language model” is
15 used to refer to a language model in the more colloquial sense. Linguistically, tight is acting as a
16 non-intersective adjective. Just as in English, where a fake gun is not a gun, so too in our course
17 notes a non-tight language model is not a language model. This distinction does in fact matter. On
18 one hand, we can prove that many language models whose parameters are estimated from data
19 (e.g., a finite-state language model estimated by means of maximum-likelihood estimation) are, in
20 fact, tight. On the other hand, we can show that this is not true in general, i.e., not all language
21 models estimated from data will be tight. For instance, a recurrent neural network language model
22 estimated through gradient descent may not be tight (Chen et al., 2018).
23 When specifying pLM , we have two fundamental options. Depending on whether we model
24 pLM pyq for each string y directly or we model individual conditional probabilities pLM py n | y ăt q we
25 distinguish globally and locally normalized models. The names naturally come from the way the
26 distributions in the two families are normalized: whereas globally normalized models are normalized
27 by summing over the entire (infinite) space of strings, locally normalized models define a sequence of
28 conditional distributions and make use of the chain rule of probability to define the joint probability
29 of a whole string.

30 The bos symbol. Conventionally, we will include a special symbol over which globally or locally
31 normalized models operate: the beginning of sequence (bos) symbol, which, as the name suggests,
32 denotes the beginning of a string or a sequence. For a string y “ y 1 ¨ ¨ ¨ y T , we will suggestively
denote y 0 “ bos.
def
33

34 2.4.1 Globally Normalized Language Models


35 We start with globally normalized models. Such models are also called energy-based language
36 models in the literature (Bakhtin et al., 2021). To define a globally normalized language model, we
37 start with the definition of an energy function.

Definition 2.4.1: Energy function

An energy function is a function pp : Σ˚ Ñ R.


38
2.4. GLOBAL AND LOCAL NORMALIZATION 19

1 Inspired by concepts from statistical mechanics, an energy function can be used to define a very
2 general class of probability distributions by normalizing its exponentiated negative values.
3 Now, we can define a globally normalized language model in terms of an energy function over
4 Σ˚ .

Definition 2.4.2: Globally normalized models

Let ppGN pyq : Σ˚ Ñ R be an energy function. A globally normalized model (GNM) is


defined as
exp r´p pGN pyqs 1
exp r´p (2.10)
def def
pLM pyq “ ř “ pGN pyqs ,
1
y PΣ ˚ exp p
r´p GN py 1 qs Z G

where Z G “ y1 PΣ˚ exp r´p pGN py 1 qs.a We call Z G the normalization constant.
def ř

aWe will later return to this sort of normalization when we define the softmax function in §3.1.
5

6 Globally normalized models are attractive because one only needs to define an (unnormalized)
7 energy function ppGN , which scores entire sequences at once. This is often easier than specifying a
8 probability distribution. Furthermore, they define a probability distribution over strings y P Σ˚
9 directly. As we will see in §2.4.2, this stands in contrast to locally normalized language models
10 which require care with the space over which they operate. However, the downside is that it may be
11 difficult to compute the normalizer Z G .

12 Normalizability

In defining the normalizer Z G “ y1 PΣ˚ exp r´ppGN py 1 qs, we notationally cover up a certain subtlety.
def ř
13

14 The set Σ is countably infinite, so Z G may diverge to 8. In this case, Eq. (2.10) is not well-defined.
˚

15 This motivates the following definition.

Definition 2.4.3: Normalizable energy function

We say that an energy function is normalizable if the quantity Z G in Eq. (2.10) is finite, i.e.,
if Z G ă 8.
16

17 With this definition, we can state a relatively trivial result that characterizes when an energy
18 function can be turned into a globally normalized language model.

Theorem 2.4.1: Normalizable energy functions induce language models

Any normalizable energy function pGN induces a language model, i.e., a distribution over Σ˚ .
19
20 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 pGN pyqs ě 0 and


Proof. Given an energy function ppGN , we have exp r´p
ÿ ÿ exp r´ppGN pyqs
pGN pyq “ (2.11)
y 1 PΣ˚ exp r´p
ř
pGN py 1 qs
yPΣ˚ yPΣ˚
1 ÿ
“ř exp r´p
pGN pyqs (2.12)
y 1 PΣ˚ exp r´p
pGN py 1 qs
yPΣ˚

“ 1, (2.13)

2 which means that pGN is a valid probability distribution over Σ˚ . ■

3 While the fact that normalizable energy functions always form a language model is a big
4 advantage, we will see later that ensuring that they are normalizable can be difficult and restrictive.
5 This brings us to the first fundamental question of the section:

Question 2.1: Normalizing an energy function

When is an energy function normalizable? More precisely, for which energy functions ppGN is
Z G ă 8?
6

7 We will not discuss any specific results here, as there are no general necessary or sufficient
8 conditions—the answer to this of course depends on the precise definition of ppGN . Later in the course
9 notes, we will present two formalisms where we can exactly characterize when an energy function
10 is normalizable. First, when it is weighted finite-state automaton (cf. §4.1), and, second, when it
11 is defined through weighted context-free grammars (§4.2) and discuss the specific sufficient and
12 necessary conditions there. However, under certain assumptions, determining whether an energy
13 function is normalizable in the general case is undecidable.
14 Moreover, even if it is known that an energy function is normalizable, we still need an efficient
15 algorithm to compute it. But, efficiently computing Z G can be challenging: the fact that Σ˚
16 is infinite means that we cannot always compute Z G in a tractable way. In fact, there are no
17 general-purpose algorithms for this. Moreover, sampling from the model is similarly intractable, as
18 entire sequences have to be drawn at a time from the large space Σ˚ .

19 2.4.2 Locally Normalized Language Models


20 The inherent difficulty in computing the normalizer, an infinite summation over Σ˚ , motivates
21 the definition of locally normalized language models, which we will denote with pLN . Rather than
22 defining a probability distribution over Σ˚ directly, they decompose the problem into the problem
23 of modeling a series of conditional distributions over the next possible symbol in the string given
24 the context so far, i.e., pLN py | yq, which could be naı̈vely combined into the full probability of the
25 string by multiplying the conditional probabilities.9 Intuitively, this reduces the problem of having
26 to normalize the distribution over an infinite set Σ˚ to the problem of modeling the distribution of
27 the next possible symbol y n given the symbols seen so far y ăn . This means that normalization would
28 only ever require summation over |Σ| symbols at a time, solving the tractability issues encountered
29 by globally normalized models.
9We will soon see why this would not work and why we have to be a bit more careful.
2.4. GLOBAL AND LOCAL NORMALIZATION 21

1 However, we immediately encounter another problem: In order to be a language model, pLN py | yq


2 must constitute a probability distribution over Σ˚ . However, as we will discuss in the next section,
3 this may not be the case because locally normalized models can place positive probability mass on
4 infinitely long sequences (cf. Example 2.5.1 in §2.5.1). Additionally, we also have to introduce a new
5 symbol that tells us to “stop” generating a string, which we call the end of sequence symbol, eos.
6 Throughout the notes, we will assume eos R Σ and we define
Σ “ Σ Y teosu . (2.14)
def

7 Moreover, we will explicitly denote elements of Σ as y and symbols in Σ as y. Given a sequence


˚

8 of symbols and the eos symbol, we take the string to be the sequence of symbols encountered
9 before the first eos symbol. Informally, you can think of the bos symbol as marking the beginning
10 of the string, and the eos symbol as denoting the end of the string or even as a language model
11 terminating its generation, as we will see later.
12 Due to the issues with defining valid probability distributions over Σ˚ , we will use the term
13 sequence model to refer to any model that may place positive probability on infinitely long sequences.
14 Thus, sequence models are strictly more general than language models, which, by definition, only
15 place positive probability mass on strings, i.e., finite sequences.

Definition 2.4.4: Sequence model

Let Σ be an alphabet. A sequence model (SM) over Σ is defined as a set of conditional


probability distributions
pSM py | yq (2.15)
for y P Σ and y P Σ˚ . We will refer to the string y in pSM py | yq as the history or the
context.
16

17 Note that we will mostly consider SMs over the set Σ. To reiterate, we have just formally defined
18 locally normalized sequence models rather than locally normalized language models. That has to do
19 with the fact that, in contrast to a globally normalized model with a normalizable energy function,
20 a SM might not correspond to a language model, as alluded to at the beginning of this section and
21 as we discuss in more detail shortly.
22 We will now work up to a locally normalized language model.

Definition 2.4.5: Locally normalized language model

Let Σ be an alphabet. Next, let pSM be a sequence model over Σ. A locally normalized
language model (LNM) over Σ is defined as
T
ź
(2.16)
def
pLN pyq “ pSM peos | yq pSM py t | y ăt q
t“1

for y P Σ˚ with |y| “ T . We say a locally normalized language model is tight if


ÿ
pLN pyq “ 1. (2.17)
yPΣ˚

Tightness is a nuanced concept that will be discussed in great detail in §2.5.


23
22 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 We now contrast globally and locally normalized models pictorially in the following example.

Example 2.4.1: Locally and globally normalized langauge models

Fig. 2.2a shows a simple instance of what a locally normalized language model would look
like. We can compute the probabilities of various strings by starting at the root node bos
and choosing one of the paths to a leaf node, which will always be eos. The values on the
edges represent the conditional probabilities of observing the new word given at the target of
the edge given the context seen on the path so far, i.e., pLN py t | y ăt q at the level t of the tree.
For example, the probability of the string bos “The best” eos under this language model is
0.04 ¨ 0.13 ¨ 0.22 “ 0.001144. On the other hand, a globally normalized model would simply
score all possible sentences using the score function ppGN pyq, as is hinted at in Fig. 2.2b.
2

3 Locally Normalizing a Language Model


4 The second fundamental question of this section concerns the relationship between language models
5 and local normalization.

Question 2.2: Locally normalizing a language model

When can a language model be locally normalized?


6

7 The answer to that is simple: every language model can be locally normalized! While the
8 intuition behind this is very simple, the precise formulation is not. Before we discuss the details, we
9 have to introduce the concept of prefix probabilities, which denote the sum of the probabilities of all
10 strings beginning with a certain prefix.

Definition 2.4.6: Prefix probability

Let pLM be a language model. We define a pLM ’s prefix probability π as


ÿ
(2.18)
def
` ˘
π pyq “ pLM yy 1 ,
y 1 PΣ˚

that is, the probability that y is a prefix of any string yy 1 in the language, or, equivalently,
the cumulative probability of all strings beginning with y.
11

12 Note that, naturally, π pεq “ 1.

Theorem 2.4.2: Any language model can be locally normalized

Let pLM be a language model. Then, there exists a locally normalized language model pLN
such that, for all y P Σ˚ with |y| “ T ,
T
ź
pLM pyq “ pLN pyq “ pSM peos | yq pSM py t | y ăt q. (2.19)
t“1
13
2.4. GLOBAL AND LOCAL NORMALIZATION 23

bos
4 0.0

0.01
0.0 3

The Please ¨¨¨ Hello

0.0

0.0
8

0.
0.0

0.0

0. 2
13

6
quick best don’t consider world there
0.07
2

0.0

0.22
0.1

brown and eos ! ¨¨¨ ¨¨¨ ¨¨¨ ¨¨¨

¨¨¨ ¨¨¨ eos


(a) An example of a locally normalized language model. The values of the edges represent the conditional
probability of observing the new word given the observed words (higher up on the path from the root node
bos). Note that the probabilities stemming from any inner node should sum to 1—however, to avoid clutter,
only a subset of the possible arcs is drawn.

ppGN p The best q

ppGN p The best! q


y„
ppGN p The quick fox. q

ppGN p Hello World! q

(b) An example of a globally normalized model which can for example generate sentences based on the
probabilities determined by normalizing the assigned scores ppGN .

Figure 2.2: “Examples” of a locally and a globally normalized language model.


24 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 Proof. We define the individual conditional probability distributions over the next symbol of the
2 SM pSM using the chain rule of probability. If πpyq ą 0, then define
π pyyq
(2.20)
def
pSM py | yq “
π pyq
3 for y P Σ and y P Σ˚ such that p pyq ą 0. We still have to define the probabilities of ending the
4 sequence using pSM by defining the eos probabilities. We define, for any y P Σ˚ such that πpyq ą 0,
pLM pyq
(2.21)
def
pSM peos | yq “
πpyq
5 that is, the probability that the globally normalized model will generate exactly the string y and
6 not any continuation of it yy 1 , given that y has already been generated. Each of the conditional
7 distributions of this model (Eqs. (2.20) and (2.21)) is clearly defined over Σ. This, therefore, defines
8 a valid SM. To see that pLN constitutes the same distribution as pLM , consider two cases.

9 Case 1: Assume πpyq ą 0. Then, we have


« ff
T
ź
pLN pyq “ pSM py t | y ăt q pSM peos | yq (2.22)
t“1
π py
  π py
1q  1y 2q π py
ăTq π pyq
  pSM peos | yq
“ ¨ ¨ ¨ 
π pεq  π py 1 q
  π py

 π py ăT q
ăT ´1 q 

π pyq pLM pyq

“ (2.23)
π pεq π pyq

“ pLM pyq (2.24)
10 where π pεq “ 1.

11 Case 2: Assume πpyq “ 0. Let y “ y 1 ¨ ¨ ¨ y T . Then, there must exist a 1 ď t1 ď T such that
12 πpy ăt1 q “ 0. Note that
źt1
pLN pyq “ pSM py t | y ăt q “ 0 (2.25)
t“1
13 whereas the conditional probabilities after t can be arbitrarily defined since they do not affect the
1

14 string having 0 probability. ■

15 When Is a Locally Normalized Language Model a Language Model?


16 LNMs which specify distributions over strings pLN py 1 . . . y T q in terms of their conditional probabilities
17 pSM py t | y ăt q for t “ 1, . . . , T and pSM peos | yq have become the standard in NLP literature. However,
18 LNMs come with their own set of problems. An advantage of normalizable globally normalized
19 models is that they, by definition, always define a valid probability space over Σ. Although this might
20 be counterintuitive at first, the same cannot be said for LNMs—in this sense, locally normalized
21 “language models” ř might not even be language models! One might expect that in a LNM pLN , it
22 would hold that yPΣ˚ pLN pyq “ 1. However, this might not be the case! This is the issue with the
23 terminology we brought up earlier and it brings us to the last fundamental question of this section.
2.4. GLOBAL AND LOCAL NORMALIZATION 25

Question 2.3: Locally normalized langauge models

When does an LNM encode a language model?


1

2 As the conditions are a bit more nuanced, it requires a longer treatment. We explore this issue
3 in much more detail in the next section.
26 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 2.5 Tight Language Models


2 We saw in the last section that any language model pLM can be converted into a locally normalized
3 sequence model (cf. §2.4.2). The converse, however, is not true. As alluded to in the previous
4 section and as we detail in this section, there exist sets of conditional distributions pLN py | yq over
5 Σ˚ such that pLN pyq as defined in Eq. (2.15) does not represent a valid probability measure over Σ˚
6 (after taking into account the semantics of eos), i.e., over the set of finite strings. Indeed, we will
7 later show that some popular classes of locally normalized sequence models used in practice have
8 parameter settings in which the generative process terminates with probability ă 1. This means
9 that pLN “leaks” some of its probability mass to infinite sequences. This section investigates this
10 behavior in a lot of detail. It is based on the recent work from Du et al. (2022).

11 2.5.1 Tightness
12 Models whose generative process may fail to terminate are called non-tight (Chi, 1999).10

Definition 2.5.1: Tightness

A locally normalized language model pLN derived from a sequence model pSM is called tight if
it defines a valid probability distribution over Σ˚ :
« ff
ÿ ÿ T
ź
pLN pyq “ pSM peos | yq pSM py t | y ăt q “ 1. (2.26)
yPΣ˚ yPΣ˚ t“1
13

14 Note that the individual conditional distributions pSM py | yq in a non-tight LNM still are valid
15 conditional distributions (i.e., they sum to one). However, the distribution over all possible strings
16 that they induce may not sum to 1. To be able to investigate this phenomenon more closely, let us
17 first examine what the conditional probabilities of an LNM actually define and how they can result
18 in non-tightness. We now ask ourselves: given a sequence model pSM , what is pLN ? Is pLN a language
19 model, i.e., a distribution over Σ˚ (after taking into account the semantics of eos)? Certainly, the
20 answer is yes if the LNM’s conditional probabilities match the conditional probabilities of some
21 known language model pLM as defined in §2.4.2,11řin which case př LN is specifically the language
model pLM itself. In this case clearly pLN pΣ˚ q “ yPΣ˚ pLN pyq “ yPΣ˚ pLM pyq “ 1. If instead
def
22

23 pLN pΣ˚ q ă 1, the LNM’s conditional probabilities do not match the conditional probabilities of any
24 language model pLM .
25 To see how this can happen, we now exhibit such an LNM in the following example.

Example 2.5.1: A non-tight 2-gram model

Consider the bigram model defined in Fig. 2.3a over the alphabet Σ “ ta, bu.a Although the
conditional probability distributions pLN p¨ | y ăn q each sum to 1 over Σ, they fail to combine
into a model pLN that sums to 1 over Σ˚ (i.e., a language model): under this model, any finite
26

10 Tight models are also called consistent (Booth and Thompson, 1973; Chen et al., 2018) and proper (Chi, 1999)

in the literature.
11 That is, p
LM py t | y ăt q “ pLN py t | y ăt q whenever the former conditional probability is well-defined under the
language model pLM , i.e., whenever y t P Σ and y ăt P Σ˚ with pLM py ăt q ą 0.
2.5. TIGHT LANGUAGE MODELS 27

string that contains the ř


symbol b will have
ř8probability 0, since pLN peos | bq “ pLN pa | bq “ 0.
This implies pLN pΣ˚ q “ n“0 pLN pan q “ n“0 p0.7qn ¨ 0.1 “ 1´0.7 “ 13 ă 1.
8 0.1

a The graphical representation of the LNM depicts a so-called weighted finite-state automaton, a framework
of language models we will introduce shortly. For now, it is not crucial that you understand the graphical
representation and you can simply focus on the conditional probabilities specified in the figure.
1

Example 2.5.2: A tight 2-gram model

On the other hand, in the bigram model in Fig. 2.3b, obtained from Example 2.5.1 by changing
the arcs from the b state, pLN pΣ˚ q “ 1. We can see that by calculating:
8 ÿ
ÿ 8
PpΣ˚ q “ Ppan bm q
n“1 m“0
˜ ¸
8
ÿ 8
ÿ
“ Ppan q ` Ppan bm q
n“1 m“1
˜ ¸
8
ÿ 8
ÿ
“ 0.1 ¨ p0.7qn´1
` p0.7q n´1
¨ 0.2 ¨ p0.9q m´1
¨ 0.1
n“1 m“1
8 ˆ
1
ÿ ˙
“ 0.1 ¨ p0.7qn´1 ` p0.7qn´1 ¨ 0.2 ¨ ¨ 0.1
n“1
1 ´ 0.9
ÿ8
0.1 ¨ p0.7qn´1 ` 0.2 ¨ p0.7qn´1
` ˘

n“1
8
ÿ 0.3
“ 0.3 ¨ p0.7qn´1 “ “ 1.
n“1
1 ´ 0.7
2

3 Example 2.5.1 confirms that the local normalization does not necessarily yield pLN that is a valid
4 distribution over Σ˚ . But if pLN is not a language model, what is it? It is intuitive to suspect that,
5 in a model with pLN pΣ˚ q ă 1, the remainder of the probability mass “leaks” to infinite sequences,
6 i.e., the generative process may continue forever with probability ą 0. This means that, to be able
7 to characterize pLN , we will have to be able to somehow take into account infinite sequences. We
8 will make this intuition formal below.

9 Delving a bit deeper, the non-tightness of Example 2.5.1 is related to the fact that the conditional
10 probability of eos is 0 at some states, in contrast to Example 2.5.2. However, requiring pLN py n “
11 eos | y ăn q ą 0 for all prefixes y ăn is neither necessary nor sufficient to ensure tightness. It is not
12 necessary because one can, for example, construct an LNM in which pLN py n “ eos | y ăn q “ 0.1
13 when n is even but “ 0 otherwise. Such a model generates only odd-length strings but is tight.
14 We will postpone non-sufficienty for later, where we will present specific LNMs under which the
15 conditional probability of eos is always ą 0, yet are non-tight.
28 CHAPTER 2. PROBABILISTIC FOUNDATIONS

pLN pa | bosq 1 .2
pLN pa | aq 0.7 b{0 b b{1
pLN pb | aq 0.2
pLN peos | aq 0.1
a a{0.7
pLN pb | bq 1
pLN peos | eosq 1
eo
s{
0.1 eos eos{1

(a) A non-tight 2-gram model.


pLN pa | bosq 1 .2
pLN pa | aq 0.7 b{0 b b{0.9
pLN pb | aq 0.2
pLN peos | aq 0.1
pLN pb | bq 0.9 a a{0.7 eos{0.1
pLN peos | bq 0.1
pLN peos | eosq 1 eo
s{
0.1 eos eos{1

(b) A tight 2-gram model.

Figure 2.3: Tight and non-tight bigram models, expressed as Mealy machines. Symbols with
conditional probability of 0 are omitted.

1 2.5.2 Defining the probability measure of an LNM


2 We now rigorously characterize the kind of distribution induced by an LNM, i.e., we investigate what
3 pLN is. As mentioned earlier, an LNM can lose probability mass to the set of infinite sequences, Σ8 .
4 However, Σ8 , unlike Σ˚ , is uncountable, and it is due to this fact that we need to work explicitly with
5 the measure-theoretic formulation of probability which we introduced in §2.2. We already saw the
6 peril of not treating distributions over uncountable sets carefully is necessary in Example 2.1.1—the
7 set of all infinite sequences of coin tosses is indeed uncountable.

8 Including infinite strings and the end of string symbol. As we saw in Example 2.1.1,
9 sampling successive symbols from a non-tight LNM has probability ą 0 of continuing forever, i.e.,
10 generating infinite strings. Motivated by that, we hope to regard the LNM as defining a valid
11 probability space over Ω “ Σ˚ Y Σ8 , i.e., both finite as well as infinite strings, and then “relate” it
12 to our definition of true language models. Notice, however, that we also have to account for the
13 difference in the alphabets: while we would like to characterize language models in terms of strings
14 over the alphabet Σ, LNMs work over symbols in Σ.
15 With this in mind, we now embark on our journey of discovering what pLN represents. Given an
16 LNM, we will first need to turn its pLN into a measurable space by defining an appropriate σ-algebra.
17 This type of distribution is more general than a language model as it works over both finite as
18 well as infinite sequences. To distinguish the two, we will expand our vocabulary and explicitly
19 differentiate between true language models and non-tight LNMs. We will refer to a distribution over
20 Σ˚ Y Σ8 as a sequence model. As noted in our definition of a sequence model (cf. Definition 2.4.4),
2.5. TIGHT LANGUAGE MODELS 29

Cylinder Conditional prob-


sets Algebra abilities from pLN Pre-measure
Σ8 Σ ,C
` 8 ˘
Σ , C, P0
` 8 ˘

n
nsio
Exte
s
ry’
o do
thé
Cara

Random vari-
Measure able construction Measure (sequence model)
Σ ,σ C ,P
` 8 ` ˘ ˘
pΣ8 Y Σ˚ , σ pCq , P1 q

Figure 2.4: The outline of our measure-theoretic treatment of LNMs in this section to arrive at
a precise characterization of pLN . The final box corresponds to the sequence model (probability
measure over Σ˚ Y Σ8 ) constructed for pLN .

1 an LNM defines a probabilty measure over Σ˚ Y Σ8 . Thus, an equivalent distribution, which will
2 be useful for this section, would be the following.

Definition 2.5.2: Sequence model

A sequence model is a probability space over the set Σ˚ Y Σ8 .


3

4 Intuitively, and we will make this precise later, the set Σ8 Ă Σ˚ Y Σ8 in Definition 2.5.2
5 represents the event where the sequence model is non-terminating, i.e., it attempts to generate an
6 infinitely long sequence. We can then understand language models in a new sense.

Definition 2.5.3: Re-definition of a Language model

A language model is a probability space over Σ˚ . Equivalently, a language model is a


sequence model such that PpΣ8 q “ 0.
7

8 Now buckle up! Our goal through the rest of this section is to rigorously construct a probability
9 space of a sequence model as in Definition 2.2.2 and Definition 2.5.2 which encodes the probabilities
10 assigned by an LNM. Then, we will use this characterization to formally investigate tightness. An
11 outline of what this is going to look like is shown in Fig. 2.4.

12 Defining an Algebra over Σ8 (Step 1)


13 Since an LNM produces conditional distributions over the augmented alphabet Σ (first box in
14 Fig. 2.4) and results in possibly infinite strings, we will first construct a probability space over Σ8 ,
15 which will naturally induce a sequence model. We will do that by first constructing an algebra (cf.
30 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 Definition 2.2.3) over Ω “ Σ8 for some alphabet Σ (second box in Fig. 2.4). Then, assuming we
2 are given an LNM pLN over Σ, we will associate the constructed algebra with a pre-measure (cf.
3 Definition 2.2.4) that is “consistent” with pLN (third box in Fig. 2.4).
4 We will make use of the following definition to construct the algebra:

Definition 2.5.4: Cylinder set

Given any set H Ď Σk , i.e., a set of sequences of symbols from Σ of length k, define its
cylinder set (of rank k) to be

CpHq “ yω : y P H, ω P Σ8 (2.27)
def
␣ (

6 In essence, a cylinder set of rank k is the set of infinite strings that share their k-prefix with some
string y P H Ď Σk . In particular, for a length-k string y “ y 1 ¨ ¨ ¨ y k , the cylinder set Cpyq “ Cptyuq
def
7

8 is the set of all infinite strings prefixed by y.12


9 We denote the collection of all rank-k cylinder sets by

C k “ CpHq : H P P Σk (2.28)
def
␣ ` ˘(

10 and define
8
ď
(2.29)
def
C“ Ck
k“1

11 to be the collection of all cylinder sets over Ω.13


12 The following lemma asserts C Ď PpΩq is what we want in the second block of Fig. 2.4.

Lemma 2.5.1

C Ď PpΩq is an algebra over Ω “ Σ8 .


13

14 Proof. First, Σ8 “ CpΣk q for any k, and in particular ˘isc a cylinder ˘set of any rank. Secondly,
given a cylinder set CpHq of rank k, i.e., H Ď Σk , CpHq “ C Σk zH . Hence, C is closed under
` `
15

16 complements. Finally, notice that the intersection of two cylinder sets of ranks k1 ď k2 is another
17 cylinder set of rank k2 . Hence, C is an algebra over Ω. ■

18 With this, the first step of Fig. 2.4 is done!

19 Defining a Pre-measure over C (Step 2)


20 We are now ready to define the pre-measure P0 for the cylinder algebra C. Given an LNM pLN and
21 any set CpHq P C, let
ÿ
(2.30)
def
P0 pCpHqq “ pLN pyq
yPH

12 This type of cylinder set, i.e., one that is generated by a singleton, is also called a thin cylinder.
13We invite the reader to verify that C 1 Ă C 2 Ă C 3 Ă ¨ ¨ ¨ .
2.5. TIGHT LANGUAGE MODELS 31

1 where we have defined


T
ź
(2.31)
def
pLN pyq “ pLN py t | y ăt q.
t“1

2 Note that there is a caveat here since the same cylinder set may admit different H.14 Before showing
3 that P0 defines a valid pre-measure, we address this and show that P0 is indeed well defined.

Proposition 2.5.1

P0 as defined in Eq. (2.30) is a well-defined function.


4

5 Proof. Suppose a cylinder set can be described by two different prefix sets: H1 Ď Σk1 and H2 Ď Σk2 .
6 In other words, CpH1 q “ CpH2 q. Without loss of generality, assume that k1 ď k2 . Then,

CpH2 q “ CpH1 q (2.32a)


ď
“ Cpyq (2.32b)
yPH1
ď ď
“ Cpyyq. (2.32c)
yPH1 yPΣk2 ´k1

All the unions above are disjoint, and hence H2 “ : y P H1 u. Then, by the
Ť
7
yPΣk2 ´k1 tyy
8 locally-normalizing property of pLN , we have that

P0 pCpH1 qq “ P0 pCpH2 qq. (2.33)

9 ■

10 With this, we are able to state and prove the lemma which shows that P0 is a pre-measure, which
11 is what we need in the third block of Fig. 2.4.

Lemma 2.5.2

P0 is a pre-measure over C.
12

13 For the proof of Lemma 2.5.2, we will mostly follow the proof of Theorem 2.3 in Billingsley
14 (1995), with the exception of invoking the Tychonoff theorem directly. This proof depends on the
15 following lemma, which is Example 2.10 in Billingsley (1995). We repeat the statement and proof
16 here for the reader’s convenience.

Lemma 2.5.3

Let P0 be a finitely additive probability pre-measure over C such that, given a decreasing
sequence of sets A1 Ą A2 Ą ¨ ¨ ¨ in C where n“1 An “ H, limnÑ8 P0 pAn q “ 0. Then, P0 is
Ş8
also countably additive over C.
17
14 For example, in the infinite coin toss model, CpHq “ CptHH, HTuq.
32 CHAPTER 2. PROBABILISTIC FOUNDATIONS

Proof.ŤLet tAn u be a sequence of disjoint sets in


Ş C such that A “ n An P C. Then, defining
Ť
1

2 Bn “ mąn Am , we see that B1 Ą B2 Ą ¨ ¨ ¨ and n Bn “ H. Notice that

A “ A1 Y B 1 “ A1 Y A2 Y B 2 “ ¨ ¨ ¨ “ A1 Y ¨ ¨ ¨ Y An Y B n (2.34)

3 for any n and hence by finite additivity of P0

P0 pAq “ P0 pA1 q ` ¨ ¨ ¨ ` P0 pAn q ` P0 pBn q (2.35)

4 or equivalently
P0 pA1 q ` ¨ ¨ ¨ ` P0 pAn q “ P0 pAq ´ P0 pBn q. (2.36)
5 Since, Bn Ó H implies that P0 pBn q Ó 0 by assumption, taking the limits on both sides of Eq. (2.36)
6 yields ÿ ÿ
P0 pAn q “ lim P0 pAi q “ P0 pAq ´ lim P0 pBn q “ P0 pAq (2.37)
nÑ8 nÑ8
n iďn

7 which shows countable additivity. ■


8 We also recall the Tychonoff theorem.15

Theorem 2.5.1: Tychonoff

Let
ś tX α uαPJ be an indexed family of compact topologies. Then, their product topology
αPJ X α is also compact.
9

10 We can now give the proof for Lemma 2.5.2.


11 Proof of Lemma 2.5.2. We first show that P0 is finitely additive over C. Let CpH1 q and CpH2 q be
12 two disjoint cylinder sets. By Proposition 2.5.1, we can assume they are of the same rank without
13 loss of generality. Then,
ď ď
CpH1 q Y CpH2 q “ tyω : ω P Σ8 u Y tyω : ω P Σ8 u (2.38a)
yPH1 yPH2
ď
“ tyω : ω P Σ u
8
(H1 and H2 equal rank and disjoint) (2.38b)
yPH1 YH2

“ CpH1 Y H2 q (2.38c)

14 which leads to

P0 pCpH1 q Y CpH2 qq “ P0 pCpH1 Y H2 qq (2.39a)


ÿ
“ pLN pyq (2.39b)
yPH1 YH2

“ P0 pCpH1 qq ` P0 pCpH2 qq. (2.39c)

15 Hence, P0 is finitely additive.


16 Now, equip Σ with the discrete topology. Since Σ is finite, it is compact under the discrete
17 topology and so is Σ8 by Theorem 2.5.1. Then, by properties of the product topology over discrete
15 See §37 in Munkres (2000) for a detailed and well-written treatise.
2.5. TIGHT LANGUAGE MODELS 33

1 finite spaces, all cylinder sets in Σ8 are compact. To apply Lemma 2.5.3, let C 1 Ą C 2 Ą ¨ ¨ ¨
2 be aŞdecreasing sequence of cylinder sets with empty intersection. Suppose to the contrary that
3 P0 p n C n q ą 0. This would imply that all C n are nonempty (anyŞof these being empty would result
4 in a measure 0). However, by Ş Cantor’s intersection theorem , n C n is nonempty, contradicting
16

5 the assumption. Hence, P0 p n C n q “ 0, and by Lemma 2.5.3, P0 is countably additive.


6 With this, we have proved that P0 is countably additive. To show that P0 defines a pre-measure,
7 we still have to show that P0 pΩq “ 1. Recall from the proof of Lemma 2.5.1 that Σ8 “ CpΣk q for
8 any k ą 0. In particular, Σ8 “ CpΣ1 q “ CpΣq. This means that

P0 pΩq “ P0 C Σ (2.40)
` ` ˘˘
ÿ
“ pLN pyq (2.41)
yPΣ
ÿ
“ pLN py | bosq “ 1. (2.42)
yPΣ

9 The last equality follows from local normalization of the sequence model. ■

10 With this, we have successfully completed the first two steps of Fig. 2.4! However, we have
11 only defined a pre-measure over the set of infinite eos-containing sequences Σ8 . This does not yet
12 satisfy all the properties we would like from a probability space. Because of that, we next extend
13 the constructed probability pre-measure P0 into a valid probability measure P to arrive to a valid
14 probability space.

15 Extending the Pre-measure P0 into a Measure P (Step 3)


16 To extend P0 into a measure, we will use Carathéodory’s theorem:

Theorem 2.5.2: Carathéodory’s Extension Theorem

Given an algebra A over some set Ω and a probability pre-measure P0 : A Ñ r0, 1s, there exists
a probability space pΩ, F, Pq such that A Ă F and P|A “ P0 . Furthermore, the σ-algebra F
depends only on A and is minimal and unique, which we will also denote by σpAq, and the
probability measure P is unique.
17

18 Proof Sketch. First, construct an outer measure by approximation with countable coverings. Then,
19 show that the collection of sets that is measurable with respect to this outer measure is a σ-algebra
20 F that contains A. Finally, restricting the outer measure to this σ-algebra, one is then left with
21 a probability space. To show minimality, one can show that F is contained in any σ-algebra that
22 contains A. Uniqueness is given by applying Dynkin’s π-λ theorem (Theorem 3.2 in Billingsley,
23 1995).
24 Great care must be taken in each step involved in the outline above. To address these is well
25 beyond the scope of this treatment and we refer the reader to the many excellent texts with a proof
26 of this theorem, such as Chapter 12 in Royden (1988) and Chapter 11 in Billingsley (1995). ■
16 Cantor’s intersection theorem states that a decreasing sequence of nonempty compact sets have a nonempty

intersection. A version of this result in introductory real analysis is the Nested Interval Theorem.
34 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 Applying Carathéodory’s extension theorem to our cylinder algebra C and pre-measure P0 , we


2 see that there exists a probability space pΣ8 , σpCq, Pq over Σ8 that agrees with the LNM pLN ’s
3 probabilities.
4 Phew! This now gets us to the fourth box in Fig. 2.4 and we only have one step remaining.

5 Defining a Sequence Model (Step 4)


6 We now have to make sure that the outcome space of the defined probability space fits the definition
7 of a sequence model. That is, we have to find a way to convert (map) the infinite eos-containing
8 sequences from Σ8 into eos-free finite or possibly infinite strings processed by a sequence model as
9 required by Definition 2.5.2. We will achieve this through the use of a random variable.
10 Recall from Definition 2.2.5 that a random variable is a mapping between two sigma-algebras.
11 Since we want our final measure space to work with the outcome space Σ8 Y Σ˚ , we, therefore, want
12 to construct a σ-algebra over Σ˚ Y Σ8 and then map elements from Σ8 to Σ˚ Y Σ8 to have the
13 appropriate objects. We will do so in a similar fashion as we constructed pΣ8 , Cq. Given H Ă Σk ,
14 define a rank-k cylinder set in Σ˚ Y Σ8 to be
CpHq “ tyω : y P H, ω P Σ˚ Y Σ8 u. (2.43)
def

15 Notice the major change from Eq. (2.27): the suffixes ω of the elements in C pHq now come from
16 Σ˚ Y Σ8 rather than Σ8 . This means (i) that they do not contain eos and (ii) that they (and
17 thus, elements of C pHq) can also be finite. Let C k be the set of all rank-k cylinder sets. Define
C “ k“1 C k . Then, σ pCq is a σ-algebra by the same reasoning as in Lemma 2.5.1 and Theorem 2.5.2.
def Ť8
18

19 We can now define the following random variable


#
ω ăk if k is the first eos in ω,
xpωq “ (2.44)
ω otherwise (if eos R ω)

20 given any ω P Σ8 . The proposition below shows that x is well-defined.

Proposition 2.5.2

The function x : pΣ8 , σpCqq Ñ pΣ˚ Y Σ8 , σpCqq defined in Eq. (2.44) is a measurable mapping.
21

22 Proof. To show that x is measurable, it suffices to show the measurability of preimage of a generating
23 set of the σ-algebra. Note that the set of thin cylinder sets is a generating set. Let Cpyq be a thin
24 cylinder set,
x´1 pCpyqq “x´1 ptyω : ω P Σ˚ Y Σ8 uq (2.45a)
“x ptyω : ω P Σ uq Y x ptyω : ω P Σ uq
´1 ˚ ´1 8
(2.45b)
˜ ¸ ˜ ¸
ď č8
“ Cpyωeosq Y Cpyq X Ak c (2.45c)
ωPΣ˚ k“1

25 Note that the sets Ak above are defined in Eq. (2.58) which are cylinder sets representing the event
26 of terminating at step k. Then, from the derivation above, we can see that x´1 pCpyqq is formed by
27 countable operations over measurable sets (cylinder sets) in Σ8 , and is hence measurable. So x is
28 a measurable function. ■
2.5. TIGHT LANGUAGE MODELS 35

1 x intuitively “cuts out” the first stretch of ω before the first eos symbol (where an LNM would
2 stop generating) or leaves the sequence intact if there is no termination symbol eos. One can
3 check that P˚ , defined using P, is indeed a probability measure on pΣ˚ Y Σ8 , σpCqq and hence
4 pΣ˚ Y Σ8 , σpCq, P˚ q is a probability space. We have therefore arrived at the final box of Fig. 2.4
5 and shown that, given any LNM, we can construct an associated sequence model as defined in
6 Definition 2.5.2! In other words, given an LNM pLN , we have constructed a sequence model pSM (a
7 probability space over Σ8 Y Σ˚ where the probabilities assigned to (infinite) strings by pSM agree
8 with pLN .

9 2.5.3 Interpreting the Constructed Probability Space


10 Under the formulation of a probability space together with a random variable, useful probability
11 quantities arise naturally and intuitively.
12 Consider, for example, the probability of a single finite string y P Σ˚ , P˚ pyq. By definition of x,
13 this equals

P˚ pyq “ P˚ px “ yq (2.46)
“ P x´1 pyq (2.47)
` ˘

“ P All the sequences ω P Σ8 which map to y. (2.48)


` ˘

14 All the sequences ω P Σ8 which map to y are sequences of the form ω “ yeosω 1
for
` ˘ω P˘ Σ —this
1 8

is exactly the cylinder C pyeosq! By the definition of the probability space Σ, σ C , P , this is
`
15

ÿ
(2.49)
` ˘ ` ˘
P C pyeosq “ pLN y 1 “ pLN pyeosq
y 1 Ptyeosu

śT
16 and as before pLN pyeosq “ t“1 pLN py t | y ăt q pLN peos | yq.
17 Altogether, this means that, given a finite string y P Σ˚ , we intuitively have

P˚ px “ yq “ pLN peos | yqpLN pyq. (2.50)

18 Additionally, as we will show in the next section, the probability of the set of infinite strings
19 P˚ px P Σ8 q is the probability of generating an infinite string.
20 An important technical detail left out in this discussion so far is that both the singleton set tyu
21 and Σ8 need to be measurable in pΣ˚ Y Σ8 , σpCqq for the above to make sense. This is addressed
22 by Proposition 2.5.3 and Proposition 2.5.4.

Proposition 2.5.3

In measure space pΣ˚ Y Σ8 , σpCqq, tyu is measurable for all y P Σ˚ .


23

24 Proof. By definition in Eq. (2.43), for any y P Σ˚ ,

Cpyq “ tyω : ω P Σ˚ Y Σ8 u (2.51a)


“ tyω : ω P Σ u Y tyω : ω P Σ u
˚ 8
(2.51b)
36 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 where
ď
tyω : ω P Σ˚ u “ tyu Y tyaω : ω P Σ˚ u (2.52a)
aPΣ

2 and
ď
tyω : ω P Σ8 u “ tyaω : ω P Σ8 u. (2.53)
aPΣ

3 So,
ďˆ ˙
Cpyq “ tyu Y tyaω : ω P Σ u Y tyaω : ω P Σ u
˚ 8
(2.54a)
aPΣ
ď
“ tyu Y Cpyaq (2.54b)
aPΣ

which implies that tyu “ Cpyqz Cpyaq and hence measurable.


Ť
4
aPΣ ■

Proposition 2.5.4

In the measure space pΣ˚ Y Σ8 , σpCqq, Σ8 is measurable.


5

6 Proof. First, the outcome space Σ˚ Y Σ8 is measurable by definition of σ-algebra. Notice that
ď
Σ8 “ pΣ˚ Y Σ8 qz tyu. (2.55)
yPΣ˚

7 Since each tyu in the above is measurable by Proposition 2.5.3 and Σ˚ is a countable set, Σ8 is
8 then measurable. ■

9 Since both tyu and Σ8 are measurable in pΣ˚ Y Σ8 , σpCqq by Propositions 2.5.3 and 2.5.4, we
10 have the following.

Proposition 2.5.5

A sequence model pΣ˚ Y Σ8 , σpCq, Pq is tight if and only if Pptyuq “ 1.


ř
yPΣ˚
11

12 Proof. By definition, a sequence model is tight if and only if PpΣ8 q “ 0. By Propositions 2.5.3
13 and 2.5.4, we can write

PpΣ˚ Y Σ8 q “ PpΣ8 q ` PpΣ˚ q (countable additivity) (2.56a)


ÿ
“ PpΣ8 q ` Pptyuq. (countable additivity) (2.56b)
yPΣ˚

Hence, a sequence model is tight if and only if Pptyuq “ 1.


ř
14
yPΣ˚ ■
2.5. TIGHT LANGUAGE MODELS 37

1 Deriving eos

2 As an aside, the preceding section allows us to motivate the eos token in LNM as a construct that
3 emerges naturally. Specifically, for any y P Σ˚ , rearranging Eq. (2.50):

P˚ px “ yq
pLN peos | yq “ (2.57a)
pLN pyq
P˚ px “ yq
“ (2.57b)
P˚ px P Cpyqq
“ P˚ px “ y | x P Cpyqq (2.57c)

4 where we have used pLN pyq “ PpCpyqq “ Ppx´1 pCpyqqq “ P˚ px P Cpyqq. This means that the eos
5 probability in an LNM emerges as the conditional probability that, given that we must generate a
6 string with a prefix y P Σ˚ , the string is exactly y, i.e., that generation ends there.

7 2.5.4 Characterizing Tightness


8 Now that we have derived a measure-theoretic formalization of the probability space induced by
9 locally-normalized models, we can use it to provide an exact characterization of tightness in LNMs.
10 First, we consider the event

Ak “ tω P Σ8 : ωk “ eosu (2.58)
def

11 in the probability space pΣ8 , σpCq, Pq. Intuitively, Ak is the event that an eos symbol appears at
12 position k in the string. Note that under this definition the Ak are not disjoint. For example, the
13 string ω “ ab eos c eos dddd ¨ ¨ ¨ lives in the intersection of A3 and A5 since eos appears at both
14 position 3 and position 5. Using Eq. (2.58), we can express the event consisting of all finite strings as

8
ď
Ak . (2.59)
k“1

15 It follows that we can express the event of an infinite string as


˜ ¸c
8
ď 8
č
Ak “ Ak c . (2.60)
k“1 k“1

16 Thus, using the random variable x, we can express the probability of generating an infinite string as

P˚ px P Σ8 q “ Ppx´1 pΣ8 qq (2.61a)


˜ ¸
č8
“P c
Ak . (2.61b)
k“1

17 Hence, we can now restate and formalize the notion of tightness.


38 CHAPTER 2. PROBABILISTIC FOUNDATIONS

Definition 2.5.5: Tight sequence model

A sequence model is said to be tight if P˚ px P Σ8 q “ 0, in which case it is also a language


model. Otherwise, we say that it is non-tight.
1

2 Note that the definition of Ak only uses a string’s k-prefix, and hence is a cylinder set of rank k.
3 Recalling that the cylinder sets are measurable and so are the sets countably generated by them, we
4 see that both the event consisting of
˘ all finite strings and
˘ the event consisting of all infinite strings
are measurable. Thus, P and are well defined.
`Ť8 `Ş 8 c
5 A
k“1 k P A
k“1 k

6 A Lower Bound Result

We have characterized tightness in terms of the probability of a specific event P Ak c , a


`Ş8 ˘
7
k“1
8 quantity we now seek to determine.

Lemma 2.5.4
´ Şn´1 ¯
If n“2 P An | m“1 Am c “ 8, then P “ 0.
ř8 `Ş 8 c
˘
m“1 Am
9

10 Proof. First, recall an elementary inequality that for x ą 0,

1
x ´ 1 ě log x ô 1 ´ x ď log . (2.62)
x

Şn
11 Note that Pp m“1 Acm q ą 0 for any n, for otherwise the conditional probabilities would be undefined.
2.5. TIGHT LANGUAGE MODELS 39

Şn
Let pn “ Pp Acm q. Then we have that pn ą 0 for all n, and
def
1
m“1

8
ÿ n´1
č
8“ PpAn | Acm q (2.63a)
n“2 m“1
8
ÿ n´1
č
“ 1 ´ PpAcn | Acm q (2.63b)
n“2 m“1
N
ÿ n´1
č
“ lim 1 ´ PpAcn | Acm q (2.63c)
N Ñ8
n“2 m“1
N
ÿ n´1
č
ď lim log 1{PpAcn | Acm q (by Eq. (2.62)) (2.63d)
N Ñ8
n“2 m“1
N Şn´1
ÿ Pp m“1 Acm q
“ lim log Şn (2.63e)
N Ñ8
n“2
Pp m“1 Acm q
N
ÿ pn´1
“ lim log (2.63f)
N Ñ8
n“2
pn
N
ÿ
“ lim plog pn´1 ´ log pn q (2.63g)
N Ñ8
n“2
“ lim plog p1 ´ log pN q (2.63h)
N Ñ8
“ log p1 ´ lim log pN (2.63i)
N Ñ8

2 which implies that

lim log pN “ ´8 (2.64a)


N Ñ8
ô lim pN “ 0 (2.64b)
N Ñ8
N
č
ô lim Pp Acm q “ 0 (2.64c)
N Ñ8
m“1
8
č
ô Pp Acm q “ 0. (by continuity of measure) (2.64d)
m“1

3 ■

4 Using Lemma 2.5.4, we can derive the following useful condition of tightness of a language model.
5 Specifically, it applies when the probability of eos is lower bounded by a function that depends only
6 on the length and not the content of the prefix.

Proposition 2.5.6

If pLN peos | yq ě f ptq for all y P Σt and for all t and f ptq “ 8, then Pp k“1 Ak c q “ 0.
ř8 Ş8
t“1
In other words, pLN is tight.
7
40 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 Proof. Suppose pLN peos | yq ě f ptq for all y P Σt . To apply Lemma 2.5.4, we observe that
˜ ¸
n´1
č
An X pA1 X ¨ ¨ ¨ X An´1 q “tω P Σ8 : ωn “ eosu X
c c
tω P Σ8 : ωi ­“ eosu (2.65a)
i“1

“tω P Σ 8
: ω “ eos, @ i ă n, ω ­“ eosu (2.65b)
“tω P Σ 8
: ω’s first eos is at position nu (2.65c)
2 and similarly
Ac1 X ¨ ¨ ¨ X Acn´1 “ tω P Σ8 : There is no eos in ω’s first n ´ 1 positionsu (2.66)

Setting G “ tωeos : ω P Σn´1 u Ă Σn , we get


def
3

PpAn X pAc1 X ¨ ¨ ¨ X Acn´1 qq


PpAn | Ac1 X ¨ ¨ ¨ X Acn´1 q “ (2.67a)
PpAc1 X ¨ ¨ ¨ X Acn´1 q
PpCpGqq
“ (definition of G) (2.67b)
PpCpΣn´1 qq
ř
n´1 ppeos | ωqppωq
“ ωPΣř (by Eq. (2.30)) (2.67c)
ωPΣn´1 ppωq
n´1 f pn ´ 1qppωq
ř
ě ωPΣ ř (definition of f ptq) (2.67d)
ωPΣn´1 ppωq
ř
n´1 ppωq
“ f pn ´ 1q řωPΣ (2.67e)
ωPΣn´1 ppωq
“ f pn ´ 1q. (2.67f)
Since t“0 f ptq “ 8, LemmaŞ82.5.4c shows that the event of a string never terminating, i.e., k“1 Ak
ř8 Ş8 c
4

5 has probability measure Pp k“1 Ak q “ 0. In other words, if the eos probability of a language model
6 is lower bounded by a divergent sequence at every step, then the event that this language model
7 terminates has probability 1. ■

8 The Borel–Cantelli Lemmata


9 It turns out that Proposition 2.5.6 admits a converse statement in which we can prove a similar
10 property of pLN by assuming that the model is tight. To show this result, we will use a fundamental
11 inequality from probability theory—the Borel–Cantelli lemmata. The Borel–Cantelli lemmata are
useful for our purposes because they relate the probability measure of sets of the form n“0 n or
Ş8
12 A
to a series . We will only state the lemmata here without supplying their proofs;17
Ť8 ř8
13
n“0 nA p
n“0 n
14 however, we point out that Lemma 2.5.4 can be viewed as a parallel statement to the Borel–Cantelli
15 lemmata and one can prove the lemmata using a very similar proof (cf. proof of Theorem 2.3.7 in
16 Durrett, 2019).
17 Concretely, given a sequence of events tAn u8 n“1 in some probability space, the Borel–Cantelli
18 lemmata are statements about the event
8 ď
č 8
tAn i.o.u “ (2.68)
def
An
m“1 n“m
17 See §2.3 in Durrett (2019) or §4 in Billingsley (1995) instead.
2.5. TIGHT LANGUAGE MODELS 41

1 where i.o. stands for “infinitely often.” Intuitively, tAn i.o.u is the set of outcomes that appear
2 in infinitely many sets in the collection tAn u8n“1 —they are the events that always remain in the
3 union of an infinite family of sets no matter how many of the leading ones we remove (hence the
4 name). We will not use Borel–Cantelli directly, but they offer a probabilistic proof of a key result
5 (Corollary 2.5.1) which will in turn lead to the desired statement about tightness. We formally state
6 the first and second Borel–Cantelli lemmata below.

Lemma 2.5.5: Borel–Cantelli I

If n“1 PpAn q ă 8, then PpAn i.o.q “ 0.


ř8
7

Lemma 2.5.6: Borel–Cantelli II

Assume tAn u is a sequence of independent events, then PpAn q “ 8 ñ PpAn i.o.q “ 1.


ř8
n“1
8

9 Using the Borel–Cantelli lemmata, we can prove the following useful fact.

Corollary 2.5.1

Given a sequence tpn u where pn P r0, 1q. Then,


8
ź 8
ÿ
p1 ´ pn q “ 0 ðñ pn “ 8. (2.69)
n“1 n“1
10

11 To show Corollary 2.5.1, we first show the following simple consequence of Borel–Cantelli.

Corollary 2.5.2

If PpAn i.o.q “ 1, then


ř8
n“1 PpAn q “ 8.
12

Proof. Suppose to the contrary that n“1 PpAn q ă 8, then,


ř8 by Borel–Cantelli I (Lemma 2.5.5),
ř8
13

14 PpAn i.o.q “ 0, which contradicts the assumption. Hence, n“1 PpAn q “ 8. ■

15 Proof. We can use a product measure to construct a sequence of independent events tAn u8
n“1 such
16 that PpAn q “ pn . (The product measure ensures independence.) Then, by definition in Eq. (2.68),

8 č
ď
tAn [Link] “ Acn (2.70)
m“1 něm
42 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 So,
˜ ¸
ď č
1 ´ PpAn i.o.q “ P Acn (2.71a)
m něm
˜ ¸
č
“ lim P Acn (2.71b)
mÑ8
něm
ź
“ lim PpAcn q (An are independent by construction) (2.71c)
mÑ8
něm
ź
“ lim p1 ´ pn q (2.71d)
mÑ8
něm

pñq: Assume ´ pn q “ 0. Then, for any m,


ś8
2
n“1 p1
˜ ¸˜ ¸
ź ź ź
0“ p1 ´ pn q “ p1 ´ pn q p1 ´ pn q (2.72)
ně1 1ďnăm
loooooooooomoooooooooon něm

ą0

So it must the case that, for any m, ´ pn q “ 0. Therefore,


ś
3
něm p1
ź
1 ´ PpAn i.o.q “ lim p1 ´ pn q “ 0 (2.73)
mÑ8
něm

which implies PpAn i.o.q “ 1. Corollary 2.5.2 implies that


ř8
4
n“1 pn “ 8.

pðq: Assume pn “ 8. Then by Borel–Cantelli II (Lemma 2.5.6), PpAn i.o.q “ 1 which


ř8
5
n“1
6 implies ź
0 “ 1 ´ PpAn i.o.q “ lim p1 ´ pn q (2.74)
mÑ8
něm
!ś )
7 Observe that něm p1 ´ pn q is a non-decreasing sequence in m; to see this, note that as m
m
8 grows larger we multiply strictly fewer values p1 ´ pn q P p0, 1s. However, since we know the sequence
9 is non-negative and tends to 0, it follows that for any m, we have
ź
p1 ´ pn q “ 0. (2.75)
něm

10 It follows that, for any m, we have


8
ź ź ź ź
p1 ´ pn q “ p1 ´ pn q p1 ´ pn q “ p1 ´ pn q ¨ 0 “ 0. (2.76)
n“1 năm něm
loooooomoooooon năm

“0

11 ■
2.5. TIGHT LANGUAGE MODELS 43

1 We now turn to proving a more general version of Proposition 2.5.6, which would imply its
2 converse. First, we define the following quantity

p̃eos ptq “ PpAt | Ac1 X ¨ ¨ ¨ X Act´1 q (2.77)


def

3 which can be viewed as the eos probability at step t, given that eos was not generated at any
4 earlier step. One can also show that, when p̃eos ptq is defined, it has the same value as
ř
t´1 pLN pωqpLN peos | ωq
p̃eos ptq “ řωPΣ , (2.78)
ωPΣt´1 pLN pωq

5 which one can see as the weighted average probability of terminating at a string of length t.
6 We can now completely characterize the tightness of an LNM with the following theorem.

Theorem 2.5.3: A sufficient condition for tightness

An LNM is tight if and only if p̃eos ptq “ 1 for some t or


ř8
t“1 p̃eos ptq “ 8.
7

8 Proof. Recall the definition of p̃eos , as previously defined in Eq. (2.77), is

p̃eos ptq “ PpAt | Ac1 X ¨ ¨ ¨ X Act´1 q. (2.79)


def

9 Case 1. Suppose that p̃eos ptq ă 1 for all t. Consider the termination probability again:
˜ ¸ ˜ ¸
č8 čT
P At “ lim P
c
At c
(2.80a)
T Ñ8
t“1 t“1
T
ź
“ lim PpAt c | Ac1 X ¨ ¨ ¨ X Act´1 q (2.80b)
T Ñ8
t“1
T
ź
“ lim p1 ´ preos ptqq (2.80c)
T Ñ8
t“1
8
ź
“ p1 ´ preos ptqq. (2.80d)
t“1

10 In the above, we have assumed that PpAc1 X ¨ ¨ ¨ X Act q ą 0 for all t, which
ř is true by assumption that
11 p̃eos ptq ă 1. Hence, by Corollary 2.5.1, Eq. (2.80d) is 0 if and only if t preos ptq “ 8.

Case 2. If preos ptq “ 1 is true for some t “ t0 , then PpAc1 X¨ ¨ ¨XAct0 q “ 0 and hence P At c “ 0
`Ş 8 ˘
12
t“1
13 and such a language model is guaranteed to terminate at t0 . ■
14 The first condition intuitively says that there exists a step t at which the LNM will stop with
15 probability 1. If the first case of the condition does not hold, the second case can be checked since its
16 summands will be well-defined (the conditional probabilities in Eq. (2.78) will not divide by 0). We
17 remark that Theorem 2.5.3 is a generalization of Proposition 2.5.6 since if p̃eos ptq is lower-bounded
18 by f ptq whose series diverges, its own series would also diverge. However, since p̃eos ptq involves the
19 computation of a partition function in its denominator, it is most likely intractable to calculate
44 CHAPTER 2. PROBABILISTIC FOUNDATIONS

1 (Lin et al., 2021a; Lin and McCarthy, 2022). Hence, Proposition 2.5.6 will be the main tool for
2 determining tightness when we explore concrete language modeling frameworks later.
3 We have now very thoroughly defined the notion of language model tightness and provided
4 sufficient and necessary conditions for an LNM or a sequence model to be tight. In the next
5 sections, we start our exploration of concrete computational models of language, from the very
6 simple and historically important finite-state language models, their neural variants, to the modern
7 Transformer architectures. For each of them, we will also individually discuss their tightness results
8 and conditions.
1 Chapter 3

2 Modeling Foundations

3 The previous chapter introduced the fundamental measure-theoretic characteristics of language


4 modeling. We will revisit those over and over as they will serve as the foundations on which
5 subsequent concepts are built.
6 In this chapter, we turn our attention to modeling foundations, that is, the decisions we face
7 when we want to build a distribution over strings and learn the appropriate parameters for that
8 distribution. We first discuss how to parameterize a distribution over strings (§3.1), what it means
9 to learn good parameters, and how this can be done with modern optimization techniques and
10 objectives (§3.2).
11 Continuing our framing of the notes in terms of questions, we will try to address the following:

Question 3.1: Parametrizing a sequence model

How can a sequence model be parameterized?


12

13 We introduce a more formal definition of a “parametrized model” later. For now, you can simply
14 think of it as a function pθ : Σ˚ Ñ R described by some free parameters θ P Θ from a parameter
15 space Θ. This means that the values that pθ maps its inputs to might depend on the choice of the
16 parameters θ—the presence of parameters in a model, therefore, allows us to fit them, which in our
17 context specifically, means choosing them to maximize some objective with respect to data. This
18 raises the following question:

Question 3.2: Training a model

Given a parameterized model and a dataset, how can model parameters be chosen to reflect
the dataset as well as possible?
19

20 We begin with Question 3.1.

45
46 CHAPTER 3. MODELING FOUNDATIONS

1 3.1 Representation-based Language Models


2 Most modern language models are defined as locally normalized models. However, in order to define
3 locally normalized language model, we first define a sequence model pSM py | yq. Then, we prove
4 that the specific parameterization used in pSM py | yq encodes a tight locally normalized language
5 model. However, as we demonstrated in Example 2.5.1, not all sequence models encode tight locally
6 normalized language models in the sense of Definition 2.5.1. So far, however, we have only talked
7 about this process abstractly. For example, we have proven that every language model can be locally
8 normalized and we have also given necessary and sufficient conditions for when a sequence model
9 encodes a tight locally normalized language model. In this section, we start making the abstraction
10 more concrete by considering a very general framework for parameterizing a locally normalized
11 language model through sequence models pSM py | yq. We will call this the representation-based
12 language modeling framework.
13 In the representation-based language modeling framework, each conditional distribution in a
14 sequence model pSM py | yq directly models the probability of the next symbol y P Σ given the
15 context y—in other words, it tells us how likely y is to appear in the context of y.1 For example,
16 given the string y “ “Papa eats caviar with a”, we would like pLN py | yq to capture that “spoon” is
17 more likely than “fork” At the same time, since eating caviar with a fork is technically possible, we
18 would also like pLN py | yq to capture that “fork” is likelier than, for example, “pencil”.
19 However, it is not a-priori clear how we should model pSM py | yq concretely. We want to define
20 a function that can map contexts y to a distribution over possible continuations y with the caveat
21 that this distribution can be easily adjusted, i.e., we can optimize its parameters with some objective
22 in mind (cf. §3.2). We will do this by adopting a very general idea of defining pSM py | yq in terms
23 of similarity between representations that represent the symbol y and the context y. The more
24 compatible the symbol y is with the context y, the more probable it should be. Intuitively, going from
25 the example above, this means that “spoon” should be more similar to “Papa eats caviar with a”
26 than “fork” should be, and that should still be more similar than “pencil”. On the other hand,
27 notice that this also means that “spoon” and “fork” should be closer together than any of them to
28 “pencil”.
29 One possibility for doing this is by embedding individual symbols y and all possible contexts y
30 as vectors in a Hilbert space, i.e., a complete vector space endowed with an inner product. Once we
31 embed the symbols and contexts in such a space, we can talk about how similar they are. We will first
32 describe how this can be done abstractly §3.1.1 and then discuss how exactly vector representations
33 can be used when defining discrete probability distributions over the symbols in §3.1.3 by taking into
34 account the notion of similarities between vectors. We discuss methods for learning representations
35 later in this chapter (§3.2) and in Chapter 5.

36 3.1.1 Vector Space Representations


37 It is not immediately obvious how to measure the similarity or compatibility between two symbols,
38 two contexts or a symbol and a context. However, such a notion is required as part of our intuitive
39 desiderata for pSM py | yq. We begin by stating an important guiding principle, which we describe in
40 detail next and use heavily throughout the rest of the notes.
1 Unless explicitly stated otherwise, we use the phrase “in the context of” to imply given prior context—i.e.,

when discussing probability distributions, this refers to the distribution pSM py t | y ăt q with y “ y t . We will also see
examples of models which specify the conditional probabilities in terms of symbols that do not necessarily appear
before the current one.
3.1. REPRESENTATION-BASED LANGUAGE MODELS 47

Principle 3.1.1: Representation Learning

The good representation principle states that the success of a machine learning model
depends—in great part—on the representation that is chosen (or learned) for the objects that
are being modeled. In the case of language modeling, the two most salient choice points are
the representations chosen for the symbols, elements of Σ, and the representations chosen for
the contexts, elements of Σ˚ .
1

2 Learning vector representations from data where individual entities are represented in some
3 representation space (i.e., a Hilbert space) has a rich history in NLP and machine learning in
4 general (Bengio et al., 2013).
5 To discuss the representations of symbols and strings more formally, we first introduce the notion
6 of a Hilbert space, which leads us to a useful geometric manner to discuss the similarity and
7 compatibility of symbols and contexts. We first start with some more basic definitions. A vector
8 space over a field F is a set V together with two binary operations that satisfy certain axioms. The
9 elements of F are often referred to as scalars and the elements of V as vectors. The two operations
10 in the definition of a vector space are the addition of vectors and scalar multiplication of vectors.

Definition 3.1.1: Vector space

A vector space over a field F is a set V together with two binary operations that satisfy the
following axioms:
1. Associativity of vector addition: for all v, u, q P V

pv ` uq ` q “ v ` pu ` qq (3.1)

2. Commutativity of vector addition: for all v, u P V

v`u“u`v (3.2)

3. Identity element of vector addition: there exists 0 P V such that for all v P V

v`0“v (3.3)

4. Inverse elements of vector addition: for every v P V there exists a ´v P V such that

v ` p´vq “ 0 (3.4)

5. Compatibility of scalar multiplication with field multiplication: for all v P V and


x, y P F
x pyvq “ pxyq v (3.5)

6. Identity element of scalar multiplication: for all v P V

1v “ v (3.6)

where 1 is the multiplicative identity in F.


11
48 CHAPTER 3. MODELING FOUNDATIONS

7. Distributivity of scalar multiplication with respect to vector addition: for all x P F


and all u, v P V
x pv ` uq “ xv ` xu (3.7)

8. Distributivity of scalar multiplication with respect to field addition: for all x, y P F


and all v P V
px ` yq v “ xv ` yv (3.8)
1

2 In almost all practical cases, F will be R and V will be RD for some D P N.


3 An important characteristic of a vector space is its dimensionality, which, informally, corre-
4 sponds to the number of independent directions—basis vectors—in the space. Any v P V can be
5 expressed as a linear combination of the D basis vectors. The coefficients of this linear combination
6 can then be combined into a D-dimensional coordinate vector in FD . Vector spaces, therefore,
7 allow us to talk about their elements in terms of their expressions with respect to the basis vectors.
8 Inner product spaces additionally define an inner product, mapping pairs of elements of the vector
9 space to scalars. More formally, it is a vector space together with a map x¨, ¨y (the inner product)
10 defined as follows.

Definition 3.1.2: Inner product space

An inner product space is a vector space V over a field F coupled with a map

x¨, ¨y : V ˆ V Ñ F (3.9)

such that the following axioms hold


1. Conjugate symmetry: for all v, u P V

xv, uy “ xu, vy (3.10)

where x denotes the conjugate of the element x P F.


2. Linearity in the first argument: for all v, u, z P V and x, y P F

xxv ` yu, zy “ xxv, zy ` yxu, zy (3.11)

3. Positive-definiteness: for all v ‰ 0

xv, vy ą 0 (3.12)
11

12 Inner products are often defined such that they capture some notion of similarity of the vectors
13 in V. We will use this when formally defining pSM py | yq in §3.1.2.
14 Every inner product on a real or complex vector space induces a vector norm defined as follows.
3.1. REPRESENTATION-BASED LANGUAGE MODELS 49

Definition 3.1.3: Norm

Given a vector space V over R or C and an inner product x¨, ¨y over it, the norm induced by
the inner product is defined as the function ∥¨∥ : V Ñ Rě0 where
a
∥v∥ “ xv, vy. (3.13)
def

2 A Hilbert space is then an inner product space in which all sequences of elements satisfy a useful
3 property with respect to the norm defined by the inner product: every convergent series with respect
4 to the norm converges to a vector in V.

Definition 3.1.4: Hilbert space

A Hilbert space is an inner product space that is complete with respect to the norm defined
by the inner product. An inner product space is complete with respect to the norm if every
Cauchy sequence (an absolutely convergent sequence, i.e., a sequence whose elements become
arbitrarily close to each other) converges to an element in V. More precisely, an inner product
space is complete if, for every series
8
ÿ
vn (3.14)
n“1

such that
8
ÿ
∥vn ∥ ă 8, (3.15)
n“1

it holds that
8
ÿ
vn P V. (3.16)
n“1
5

6 Note that even if an inner product space V is not necessarily a Hilbert space, V can always be
7 completed to a Hilbert space.

Theorem 3.1.1: Completion theorem for inner product spaces

Any inner product space can be completed into a Hilbert space.


8

9 We omit the proof for this theorem. More precisely, the inner product space can be completed
10 into a Hilbert space by completing it with respect to the norm induced by the inner product on the
11 space. For this reason, inner product spaces are also called pre-Hilbert spaces.
12 To motivate our slightly more elaborate treatment of representation spaces, we consider an
13 example of a model which falls under our definition of a representation-based language model but
14 would be ill-defined if it worked under any space with fewer axioms than a Hilbert space.
50 CHAPTER 3. MODELING FOUNDATIONS

Space Utility
Vector space A space in which representations of symbols and string live. It
also allows the expression of the vector representations in terms
of the basis vectors.
Inner product space Defines an inner product, which defines a norm and can measure
similarity.
Hilbert space There are no “holes” in the representation space with respect to
the defined norm, since all convergent sequences converge into V.

Table 3.1: The utility of different spaces introduced in this section.

Example 3.1.1: A series of representations

Recurrent neural networks are a type of neural network that sequentially process their input
and compute the output (context representation) at time step t based on the output at time
step t ´ 1: ht “ f pht´1 , y t q. A formal definition of a recurrent neural network, which we
provide in §5.1.2, is not required at the moment. However, note that a recurrent neural
network with one-dimensional representations h could, for example, take the specific form
1 1
ht “ ht´1 ` (3.17)
2 ht´1

with h0 “ 2.
Suppose we chose the inner product space Q over the field Q for our representation space.
All elements of the sequence ht are indeed
? rational numbers. However, the limit of the
sequence, which can be shown to be 2, is not in the inner product space! This shows
that Q is not a Hilbert space and that we must, in full generality, work with Hilbert spaces
whenever we are dealing with possibly infinite sequences of data. The reason this is especially
relevant for language modeling is the need to consider arbitrarily long strings (contexts), whose
representations we would like to construct in a way similar to Eq. (3.17). Such representations
can, therefore, approach a limiting representation outside the space whenever the representation
space does not satisfy the axioms of a Hilbert space.
1

2 A summary of the utilities of the three algebraic spaces introduced in this subsection is summarized
3 in Tab. 3.1.

4 Representation Functions
5 We can now introduce the notion of a general representation function.

Definition 3.1.5: Representation function

Let S be a set and V a Hilbert space over some field F. A representation function f for
the elements of S is a function of the form
f : S ÞÑ V. (3.18)
6
3.1. REPRESENTATION-BASED LANGUAGE MODELS 51

1 The dimensionality of the Hilbert space of the representations, D, is determined by the modeler.
2 In NLP, D usually ranges between 10 to 10000.
3 Importantly, in the case that S is finite, we can represent a representation function as a matrix
4 E P R|S|ˆD (assuming V “ RD where the nth row corresponds to the representation of the nth
5 element of S. This method for representing f is both more concise and will be useful for integrating
6 the symbol representation function into a model, where matrix multiplications are often the most
7 efficient way to implement such functions on modern hardware.
8 This is the case for the representations of the individual symbols y from Σ, where the representa-
9 tion function, which we will denote as ep¨q, is implemented as a lookup into the embedding matrix
10 E P R|Σ|ˆD , i.e., epyq “ Ey .2 In this case, we will also refer to ep¨q as the embedding function.

Definition 3.1.6: Symbol embedding function

Let Σ be an alphabet. An embedding function ep¨q : Σ Ñ RD is a representation function


of individual symbols y P Σ.
11

12 The representations epyq are commonly referred to as embeddings, but, for consistency, we
13 will almost exclusively use the term representations in this text. Let us first consider possibly the
14 simplest way to represent discrete symbols with real-valued vectors: one-hot encodings.

Example 3.1.2: One-hot encodings

Let n : Σ Ñ 1, . . . , |Σ| be a bijection (i.e., an ordering of the alphabet, assigning an index


␣ (

to each symbol in Σ). A one-hot encoding J¨K is a representation function which assigns the
th
symbol y P Σ the npyq basis vector:

JyK “ dnpyq , (3.19)


def

where here dn is the nth canonical basis vector, i.e., a vector of zeros with a 1 at position n.
15

16 While one-hot encodings are an easy way to create vector representations of symbols, they have
17 a number of drawbacks. First, these representations are relatively large—we have D “ |Σ|—and
18 sparse, since only one of the dimensions is non-zero. Second, such representations are not ideal
19 for capturing the variation in the similarity between different words. For example, the cosine
20 similarity—a metric we will motivate in the next section for measuring the similarity between symbol
21 representations—between symbols’ one-hot encodings is zero for all non-identical symbols. Ideally,
22 we would like symbol representations to encode semantic information, in which case, a metric such
23 as cosine similarity could be used to quantify semantic similarity. This motivates the use of more
24 complex representation functions, which we subsequently discuss.
25 While most systems use this standard way of defining individual symbol representations using the
26 embedding matrix, the way that the context is encoded (and what even is considered as context) is
27 really the major difference between the different architectures which we will consider later. Naturally,
28 since the set of all contexts is infinite, we cannot simply represent the representation function with
29 a matrix. Rather, we define the representation of a context y through an encoding function.

2 Here, we use the notation Ey to refer to the lookup of the row in E corresponding to y.
52 CHAPTER 3. MODELING FOUNDATIONS

Definition 3.1.7: Context encoding function

Let Σ be an alphabet. A context encoding function enc p¨q : Σ˚ Ñ RD is a representation


function of strings y P Σ˚ .a
´ ¯˚
a Note that, to be completely consistent, the encoding function should be defined over the set Σ Y tbosu
to allow for the case when y 0 “ bos. However, unlike eos, we do not necessarily require bos in any formal
setting, which is why we leave it out. We apologize for this inconsistency.
1

2 We will refer to encpyq as the encoding of y P Σ˚ . In the general framework, we can simply
3 consider the encoding function enc to be a black box—however, a major part of Chapter 5 will
4 concern defining specific functions enc and analyzing their properties.
5 With this, we now know how we can represent the discrete symbols and histories as real-valued
6 vectors. We next consider how to use such representations for defining probability distributions over
7 the next symbol.

8 3.1.2 Compatibility of Symbol and Context


9 Inner products naturally give rise to the geometric notion of angle, by giving us the means to
10 measure the similarity between two representations. Concretely, the smaller the angle between the
11 two representations is, the more similar the two representations are. In a Hilbert space, we define
12 the cosine of the angle θ between the two representations

xu, vy
cos pθq “ (3.20)
def
.
∥u∥∥v∥

13 The Cauchy–Schwartz inequality immediately gives us that cos pθq P r´1, 1s since ´∥u∥∥v∥ ď
14 xu, vy ď ∥u∥∥v∥. Traditionally, however, we take the unnormalized cosine similarity as our measure
15 of similarity, which simply corresponds to the inner product of the Hilbert space.
16 Given a context representation enc pyq, we can compute its inner products with all symbol
17 representations epyq:
xepyq, enc pyqy. (3.21)
18 which can be achieved simply with a matrix-vector product:

E enc pyq . (3.22)

19 E enc pyq P R|Σ| , therefore, has the nice property that each of the individual entries corresponds to
20 the similarities of a particular symbol to the context y. For reasons that will become clear soon, the
21 entries of the vector E enc pyq are often called scores or logits. This brings us almost to the final
22 formulation of the probability distribution pSM py | yq.
23 If E enc pyq encodes similarity or compatibility, then a natural way to model the probability
24 distribution pSM py | yq would be as proportional to the inner product between epyq and enc pyq.
25 However, the inner product xepyq, enc pyqy may be negative; further, the sum over the similarity
26 between a context and all tokens is not necessarily 1. To resolve this, we have to introduce the last
27 piece of the puzzle: transforming E enc pyq into a valid discrete probability distribution by using a
28 projection function.
3.1. REPRESENTATION-BASED LANGUAGE MODELS 53

1 3.1.3 Projecting onto the Simplex


2 In the previous subsections we discussed how to encode symbols and contexts in a Hilbert space and
3 how an inner product gives us a natural notation of similarity between a potentially infinite number
4 of items. We can now finally discuss how to create the conditional distribution pSM py | yq, i.e., how
5 we can map the real-valued E enc pyq that encodes symbol–context similarities to a valid probability
6 distribution—a vector on the probability simplex.

7 Projection Functions: Mapping Vectors onto the Probability Simplex


8 pSM py | yq is a categorical distribution with |Σ| categories, i.e., a vector of probabilities whose
9 components correspond to the probabilities of individual categories. Perhaps the simplest way to
10 represent a categorical distribution is as a vector on a probability simplex.

Definition 3.1.8: Probability Simplex

A probability simplex ∆D´1 is the set of non-negative vectors RD whose components sum
to 1: # +
ÿD
∆ “ x P R | xd ě 0, d “ 1, . . . , D and
D´1 def D
xd “ 1 (3.23)
d“1
11

12 So far, we have framed pSM as a function assigning the conditional distribution over y to each
13 string y. The definition of a simplex means that we can more formally express pSM as a projection
14 from the Hilbert space of the context representations to ∆|Σ|´1 , i.e., pSM : V Ñ ∆|Σ|´1 . Yet all
15 we have discussed so far is creating a vector E enc pyq that encodes symbol–context similarities—
16 E enc pyq is not necessarily on the probability simplex ∆|Σ|´1 . To address this issue, we turn to
17 projection functions:

Definition 3.1.9: Projection Function

A projection function f ∆D´1 is a mapping from a real-valued Hilbert space RD to the


probability simplex ∆D´1
f ∆D´1 : RD Ñ ∆D´1 . (3.24)
18

19 which allows us to define a probability distribution according to E enc pyq:

pSM py | yq “ f ∆|Σ|´1 pE enc pyqqy (3.25)

20 Clearly, we still want the projection of E enc pyq onto ∆|Σ|´1 to maintain several attributes of
21 the original vector—otherwise, we will lose the notion of compatibility that E enc pyq inherently
22 encodes. However, f ∆|Σ|´1 must satisfy several additional criteria in order to map onto a valid
23 point in ∆|Σ|´1 . For example, the inner product of two vectors (and consequently E enc pyq) is not
24 necessarily positive—yet all points in ∆|Σ|´1 are positive (see Definition 3.1.8). These characteristics
25 motivate the use of a projection function that is both monotonic and positive everywhere. Thus,
26 one clear choice is to base our chosen projection function on the exponential function, i.e.,

f ∆|Σ|´1 pE enc pyqq9 exp pE enc pyqq . (3.26)


54 CHAPTER 3. MODELING FOUNDATIONS

1 To make a function of the form in Eq. (3.26) a valid projection function, we now simply have to
2 ensure that the output of f ∆D´1 sums to 1, which can easily be accomplished by re-normalizing the
3 vector of exponentiated values by their sum. This brings us to the main star of this subsection: the
4 softmax.
5 While we simply motivated its introduction by chasing our goal of ending up on the probability
6 simplex, the origin of the softmax function goes back to the Boltzmann distribution from statistical
7 mechanics introduced in the mid-1800s by Boltzmann (1868). It was then studied intensely and
8 popularized by Gibbs (1902). It was originally introduced as a way to convert the energy function
9 of the Boltzmann distribution into a probability distribution.3 Yet now, for reasons we will see in
10 this subsection, the softmax is the predominant choice of projection function in machine learning
11 applications.
12 Formally, the softmax is often defined in terms of a temperature parameter τ as follows.

Definition 3.1.10: Softmax


Let τ P R` be the temperature. The softmax at temperature τ is the projection function
defined as:
exp τ1 xd
“ ‰
softmaxpxqd “ řD “ 1 ‰ , for d “ 1, . . . , D (3.27)
def

j“1 exp τ xj
13

14 where the temperature parameter τ gives us a mechanism for controlling the entropy of the
15 softmax function by scaling the individual scores in the input vector before their exponentiation. In
16 the context of the Boltzmann distribution, it was used to control the “randomness” of the system:
17 When the temperature is high, the softmax function outputs a more uniform probability distribution
18 whose probabilities are relatively evenly spread out among the different categories. When the
19 temperature is low, the softmax function outputs a peaked probability distribution, where the
20 probability mass is concentrated on the most likely category. In the limit, as we take τ to the edge
21 of the possible values it can assume, the following properties hold:

Theorem 3.1.2: Limiting behavior of the softmax function

1
lim softmax pxq “ 1 (3.28)
τ Ñ8 D
lim softmax pxq “ eargmaxpxq , (3.29)
τ Ñ0`

where ed denotes the dth basis vector in RD , 1 P RD the vector of all ones, and
" *
argmax pxq “ min d | xd “ max pxd q , (3.30)
def

d“1,...,D

i.e., the index of the maximum element of the vector x (with the ties broken by choosing
the lowest such index). In words, this means that the output of the softmax approaches the
uniform distribution as τ Ñ 8 and towards a single mode as τ Ñ 0` .a
aτ Ñ 0` denotes the limit from above.
22

3 This is precisely the connection we mentioned in Definition 2.4.1.


3.1. REPRESENTATION-BASED LANGUAGE MODELS 55

1 Proof. Let us first consider the case of τ Ñ 0` . Without loss of generality, let us consider a
2 2-dimensional vector x “ rx1 , x2 sJ

exp p xτ1 q
lim softmaxpxq1 “ lim (3.31)
τ Ñ0` exp p τ1 q ` exp p τ2 q
x x
τ Ñ0`

exp p τ q exp p´ xτ1 q


x1
“ lim ` (3.32)
τ Ñ0` exp p 1 q ` exp p 2 q exp p´ 1 q
x x
˘ x
τ τ τ
1
“ lim (3.33)
τ Ñ0` 1 ` exp p 2
x ´x1
τ q

3 which leads us to the following definition for element-wise values:


˙ & 0, if x1 ą x2
$
ˆ
x2 ´ x1
lim exp “ 1, if x1 “ x2 (3.34)
τ Ñ0` τ
8, o.w.
%

4 Then the limit of softmax as τ Ñ 0` is given as


$
0s , if x1 ą x2
J
& r1,

lim softmaxpxq “ (3.35)
“ 1 1 ‰J
, , if x1 “ x2
τ Ñ0` % 2 2J
r0, 1s , o.w.

5 which is equivalent to the argmax operator over x. The proof extends to arbitrary D-dimensional
6 vectors.
The case of τ Ñ 8 follows similar logic, albeit limτ Ñ8 exp x2 ´x “ 1 in all cases. Hence, we
` ˘
1
7
τ
8 get limτ Ñ8 softmaxpxq “ D 1
1. ■

9 The second property, specifically, shows that the softmax function resembles the argmax function
10 as the temperature approaches 0—in that sense, a more sensible name for the function would have
11 been “softargmax”. We will most often simply take τ to be 1. However, different values of the
12 parameter are especially useful when sampling or generating text from the model, as we discuss
13 subsequently.
14 The output of the softmax is equivalent to the solution to a particular optimization problem,
15 giving it a variational interpretation.

Theorem 3.1.3: Variational characterization of the softmax


Given a set of real-valued scores x, the following equality holds
˜ ¸
D
ÿ
softmaxpxq “ argmax p x ´ τ
J
pd log pd (3.36)
pP∆D´1 d“1
“ argmax pJ x ` τ Hppq (3.37)
` ˘
pP∆D´1

This tells us that softmax can be given a variational characterization, i.e., it can be viewed as
the solution to an optimization problem.
16
56 CHAPTER 3. MODELING FOUNDATIONS

1 Proof. Eq. (3.36) can equivalently be written as


˜ ¸
D
ÿ
softmaxpxq “ argmax p x ´ τ J
pd log pd (3.38)
d“1
ÿ
s.t. pd “ 1 (3.39)
d

2 from
řD which we can ř clearly see that the Lagrangian of this optimization problem is Λ “ pJ x ´
3 τ d“1 pd log pd ` λ d pd . Taking the derivative of Λ with respect to pd , we see that the optimum
4 of is reached when

“ vd ´ τ plog pd ` 1q ` λ “ 0 (3.40)
Bpd
Solving for pd gives us pd “ Z expp xτi q, where Z is the normalizing constant that ensures d pd “ 1.
ř
5

6 This solution is equivalent to performing the softmax operation over x, as desired. ■


7 Theorem 3.1.3 reveals an interpretation of the softmax as the projection p P ∆D´1 that has the
8 maximal similarity with x while being regularized to produce a solution with high entropy. Further,
9 from both Definition 3.1.10 and Eq. (3.36), we can see that softmax leads to non-sparse solutions as
10 an entry softmaxpxqi can only be 0 if xd “ ´8.
11 In summary, the softmax has a number of desirable properties for use in machine learning
12 settings.

Theorem 3.1.4: Desirable properties of the softmax function

The softmax function with temperature parameter τ exhibits the following properties.
1. In the limit as τ Ñ 0` and τ Ñ 8, the softmax recovers the argmax operator and
projection to the center of the probability simplex (at which lies the uniform distribution),
respectively.

2. softmaxpx ` c1q “ softmaxpxq for c P R, i.e., the softmax is invariant to adding the
same constant to all coordinates in x.
3. The derivative of the softmax is continuous and differentiable everywhere; the value of
its derivative can be explicitly computed.

4. For all temperatures τ P R` , if xi ď xj , then softmaxpxqi ď softmaxpxqj . In words, the


softmax maintains the rank of x.
13

14 Proof. Property 1. is simply a restatement of Theorem 3.1.2. The proof for property 2. can be
15 shown using simple algebraic manipulation:
expr τ1 xd `cs expr τ1 xd s¨exp c
softmaxpx ` c1qd “ řD 1 “ řD 1 “ softmaxpxqd (3.41)
j“1 expr τ xj `cs j“1 expr τ xj s¨exp c

16 The derivative of the softmax at position i with respect to the variable at position j is given by
δi pjq ¨ exppxi q k exppxk q ´ exppxi q ¨ exppxj q
ř
Bsoftmaxpxqi
“ (3.42)
p k exppxk qq2
ř
xj
3.1. REPRESENTATION-BASED LANGUAGE MODELS 57
#
1 if i “ j
1 where δi pjq is the Dirac Delta function, defined as δi pjq “ . Clearly, Eq. (3.42) is
0 else
2 continuous. Further, it takes on values for all x P Rd . Lastly, property 4. follows from the
3 monotonicity of the exp function.
4 ■
5 There are many other valid projection functions that one could choose from. For example,
6 Martins and Astudillo (2016) introduce the sparsemax, which can output sparse distributions:

sparsemaxpxq “ argmin ||p ´ x||22 (3.43)


def

pP∆D´1

7 In words, sparsemax directly maps x onto the probability simplex, which often leads to solutions on
8 the boundary, i.e., where at least one entry of p is 0. Martins and Astudillo (2016) provide a method
9 for computing the closed form solution of this optimization problem in Alg. 1 of their work. Blondel
10 et al. (2019) later introduced a framework that encompasses many different projection functions,
11 which they term regularized prediction functions. Essentially, this framework considers the subset of
12 projection functions that can be written as:

f ∆|Σ|´1 pxq “ argmax pJ x ´ Ωppq (3.44)


def
` ˘
pP∆D´1

13 where Ω : RD Ñ R is regularization term. For certain choices of Ω, there are straightforward closed-
14 form solutions to Eq. (3.44). For example, as we can see from Eq. (3.36), Eq. (3.44) is equivalent
15 to the softmax when Ωppq “ ´Hppq, meaning we can compute its closed form using Eq. (3.27).
16 Further, we recover the sparsemax when Ωppq “ ´||p||22 , which likewise has a closed-form solution.
17 The notion of regularizing p may be unintuitive at first, but we can view it as trying to balance
18 out the “suitability” term pJ x with a “confidence” term Ωppq, which should be smaller when p is
19 “uncertain.” We point the interested reader to the comprehensive work of Blondel et al. (2019) for
20 further elaboration.
21 So why aren’t these other projection functions more widely employed in machine learning
22 frameworks? First, not all choices of Ω lead to closed-form solutions; further, not all meet the desirable
23 criterion listed in Theorem 3.1.4. For example, the sparsemax is not everywhere differentiable,
24 meaning that one could not simply use out-of-the-box automatic differentiation frameworks when
25 training a model using the sparsemax as its projection function. Rather one would have to specify
26 its gradient explicitly.

Theorem 3.1.5: Deterivative of the sparsemax function

The derivative of the the sparsemax with respect to its input x is as follows:
#
Bsparsemaxpxqi 1
δij ´ Spxq if i, j P Spxq
“ (3.45)
Bxj 0 else
27

28 Proof. See Martins and Astudillo (2016). ■


29 To conclude, projection functions, together with symbol representations and the representation
30 function enc, give us the tools to define a probability distribution over next symbols that encodes
58 CHAPTER 3. MODELING FOUNDATIONS

1 complex linguistic interactions. We now bring all the components together into the locally normalized
2 modeling framework in the next section.

3 3.1.4 Representation-based Locally Normalized Models

4 With these tools at hand, we now define representation-based locally normalized language models.

Definition 3.1.11: Representation-Based Locally Normalized Model

Let enc be an encoding function. A representation-based locally normalized model is a


model of the following form:

pSM py t | y ăt q “ f ∆|Σ|´1 pE encpy ăt qqyt (3.46)


def

where unless otherwise stated, we assume f ∆|Σ|´1 “ softmax. It defines the probability of an
entire string y P Σ˚ as
T
ź
(3.47)
def
pLN pyq “ pSM peos | yq pSM py t | y ăt q
t“1

where y 0 “ bos.
def

6 Alternatively, we could also include an additive bias term b as part of the projection func-
7 tion f ∆|Σ|´1 in the definition of the conditional distribution pSM py t | y ăt q, i.e., pSM py t | y ăt q “
8 f ∆|Σ|´1 pE encpy ăt q ` bqyt . However, note that the bias term can be absorbed into the encoding
9 function enc, meaning that we can assume the form Eq. (3.46) without loss of generality. In
10 representation-based language models, epyq and encpyq carry all the necessary information to deter-
11 mine how probable individual symbols y are given the context y. Therefore, the design choices of
12 epyq and encpyq are crucial when building language models this way. Indeed, a large portion of the
13 discussion in the remainder of the notes will center around how to build good representations of the
14 context and individual symbols.

15 3.1.5 Tightness of Softmax Representation-based Models

16 Having introduced representation-based language models, we can now state a very general result
17 about the tightness of such models. It connects the notion of tightness to the intuition about
18 the “compatibility” of symbols to the context—namely, the compatibility of the eos symbol to the
19 context (compared to the compatibility of all other symbols). The compatibility is here captured by
20 the distance of the representation of the eos symbol to the representation of the other symbols—if
21 this distance grows slowly enough with respect to t (modulo the norm of the context representation),
22 the model is tight.
3.1. REPRESENTATION-BASED LANGUAGE MODELS 59

Theorem 3.1.6: Proposition 5.9 in Du et al., 2022

Let pSM be a representation-based sequence model over the alphabet Σ, as defined in Defini-
tion 3.1.11. Let
s “ sup ∥epyq ´ epeosq∥2 , (3.48)
def

yPΣ

i.e, the largest distance to the representation of the eos symbol, and

zmax “ maxt ∥enc pyq∥2 , (3.49)


def

yPΣ

i.e., the maximum attainable context representation norm for contexts of length t. Then the
locally normalized model pLN induced by pSM is tight if

szmax ď log t. (3.50)


1

2 Proof. Let xt pωq be the random variable that is equal to the tth token in an outcome ω P Ω. Then
3 for an arbitrary t P N and any y P Σt , we have:

exp epeosqJ encpyq


“ ‰
Ppxt “ eos | xăt “ yq “ ř (3.51a)
yPΣ exp repyq encpyqs
J

1
“ ř
exprepyqJ encpyqs
(3.51b)
yPΣ
exprepeosqJ encpyqs
1
“ (3.51c)
1` exp ´ epeosqqJ encpyqs
ř
yPΣ rpepyq
1
ě (Cauchy–Schwarz) (3.51d)
1 ` yPΣ exp r}epyq ´ epeosq}2 }encpyq}2 s
ř

1
ě (3.51e)
1 ` yPΣ exp rk}encpyq}2 s
ř

1
“ (3.51f)
1 ` |Σ| exp r }encpyq}2 s

Now define zmax “ supyPΣt }encpyq}2 . We then have that @t P N and @y P Σt :


def
4

1
Ppxt “ eos | xăt “ yq ě (3.52)
1 ` |Σ| exppkzmax q

Now, by Proposition 2.5.6, we have that if t“1 1`|Σ| exppk zmax q diverges, then the language
ř8 1
5

6 model is tight. We will show that if we have that DN P N such that @t ě N , kzmax ď log t, then the
7 sequence model must be tight.
8 First, note that limtÑ8 1t 1`|Σ|t
1 “ limtÑ8 1t ` |Σ| “ |Σ| P p0, 8q. Hence, by the limit comparison
test, since t“1 t diverges, this means t“1 1`|Σ|t must also diverge.
ř8 1 ř8 1
9

10 Now, suppose that k zmax ď log t for all t ě N . This implies that for t ě N we have
1`|Σ| exppkzmax q ě 1`|Σ|t , which combined with the above and the comparison test, implies that
1 1
11
60 CHAPTER 3. MODELING FOUNDATIONS

diverges. This in turn means that t“1 1`|Σ| exppkz diverges. Hence, if
ř8 1
ř8 1
1
t“N 1`|Σ| exppkzmax q max q
2 k zmax ď log t for all t ě N for some N P N, then the language model is tight.
3 ■
4 Theorem 3.1.6 is a generalization of the following result from Welleck et al. (2020).

Theorem 3.1.7: Representation-based language models with bounded encodings


are tight

A locally-normalized representation-based language model, as defined in Definition 3.1.11,


with uniformly bounded ||encpyq||p (for some p ě 1) is tight.
5

6 For most of the language models that we consider, encpyq is bound due to the choice of activation
7 functions. In turn, E encpy ăt q is bounded for all y. Further, by the definition of the softmax,
8 f ∆|Σ|´1 pE encpy ăt qqeos ą η for some constant η.
9 This concludes our investigation of general representation-based models. The next section
10 discusses learning parametrized models (as a special case, also symbol and context representations).
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 61

1 3.2 Estimating a Language Model from Data


2 The language modeling task refers to any attempt to estimate the parameters4 of a model pM of
3 the ground-truth probability distribution over natural language strings pLM using data D “ ty pnq uN
n“1 ,
4 where we assume samples y pnq were generated according to pLM . This task is often treated as an
5 optimization problem. Here we will discuss the various components of this optimization problem,
6 primarily the objective and the algorithm used to perform optimization. Note that the material
7 covered here corresponds to what is colloquially referred to as pre-training. The learning paradigm
8 for fine-tuning a language model for a downstream task will be covered later in the course.

9 3.2.1 Data
10 In this course, we consider objectives that are defined in terms of data D. Therefore, we will first
11 discuss the nature of this data which, more precisely, is a corpus of texts. Following the notation
12 used throughout the rest of these notes, let Σ be an alphabet. A corpus D “ ty pnq uN n“1 Ă Σ is a
˚

13 collection of N strings. We will use the terms corpus and dataset interchangeably throughout this
14 section. We make the following assumption about the data-generating process of D:

Assumption 3.2.1: Independently and identically distributed assumption

The strings y pnq in our corpus D are generated independently and identically distributed
(i.i.d.) by some unknown distribution pLM .
15

16 Note that y pnq are strings of an arbitrary length; they can be single words, sentences, paragraphs,
17 or even entire documents depending on how we choose Σ. For example, often our models’ architectural
18 designs make them unable to process document-length strings efficiently, e.g., they might not fit
19 into a context window that can be reasonably processed by a transformer language model; we will
20 elaborate on this statement in our discussion of transformers in §5.2. Thus in practice, we often
21 chunk documents into paragraphs that we treat as separate data points.5 This means that our
22 model may not be able to learn properties of language such as discourse structure.

23 3.2.2 Language Modeling Objectives


24 Similarly to many other machine learning tasks, we can cast our problem as the search for the best
25 model pM of the ground-truth distribution over strings pLM . In order to make this search tractable,
26 we must limit the models pM that we consider. Explicitly, we make the following assumption:

Assumption 3.2.2: Parametrized model

pLM is a member of the parameterized family of models tpθ | θ P Θu, the set of all distributions
representable by parameters θ in a given parameter space Θ.
27

4 Most of this course focuses on the parametric case, i.e., where p


M is governed by a set of parameters θ. However,
we will briefly touch upon various non-parametric language models.
5 This practice technically breaks Assumption 3.2.1, yet the negative (empirically-observed) effects of this violation

are minimal and perhaps outweighed by the additional data it allows us to make use of.
62 CHAPTER 3. MODELING FOUNDATIONS

1 As concrete examples, θ could be the conditional probabilities in a simple, standard n-gram model
2 for a given prefix of size n ´ 1, i.e., θ is n ´ 1 simplices of size |Σ|.6 As another example, θ could be
3 the weights of a neural network; the set Θ would then cover all possible valid weight matrices that
4 could parameterize our model.
5 Assumption 3.2.2 implies that we can equivalently write pLM as pθ for certain (unknown) pa- ‹

6 rameters θ ‹ P Θ.7 Further, an arbitrary model pM from this hypothesis space with parameters
7 θ can be written as pθ ; we will use this notation for the remainder of the chapter to make the
8 parameterization of our distribution explicit. We now turn to the general framework for choosing
9 the best parameters θ P Θ so that our model pθ serves as a good approximation of pθ .8 ‹

10 General Framework
11 We search for model parameters θ p P Θ such that the model induced by those parameters maximizes
12 a chosen objective, or alternatively, minimizes some loss function ℓ : Θ ˆ Θ Ñ Rě0 . This loss can
13 be used to measure the quality of this model as an approximation to pθ . In simple math, we search

14 for the solution.


“ argmin ℓpθ ‹ , θq
p def
θ (3.53)
θPΘ
15 where our loss function is chosen with the following principle in mind

Principle 3.2.1: Proximity Principle

We seek a model pθ that is “close” to pθ . ‹

16

17 That is, we choose our loss function to be a measure M of the difference between a distribution
18 parameterized by θ and one parameterized by the true θ ‹ , i.e., those of our ground-truth distribution.
19 Yet we are immediately faced with a problem: computing an arbitrary M between θ and θ ‹ (or at
20 least the distributions induced by these sets of parameters) requires knowledge of both, the latter
21 for which we only have samples y pnq P D. We will therefore use our corpus D as an approximation
22 to pθ , which is typically implemented by representing D as an empirical distribution—a collection

23 of Dirac Delta functions—which we will denote as pĂ θ . Formally, we define


N
1 ÿ
(3.54)
def

θ pyq “
‹ δ pnq pyq
N n“1 y
#
1 if x “ x1
24 where the Dirac Delta function δx1 pxq “ is essentially a point mass with all probability
0 else
25 on x1 . We can decompose this definition over symbols in our strings as well. I.e., we can compute
N
1 ÿ
θ py t | y ăt q “
pĂ ‹ δyt pnq |ypnq py t | y ăt q (3.55)
N yăt n“1
ăt

6 One might be tempted to assume we only need |Σ| ´ 1 parameters per simplex, but we condition over Σ classes

per prefix position.


7We discuss the implications of the case that p
LM R tpθ | θ P Θu later in this section.
8 The modeling paradigms that we will discuss in this section are predominantly generative, i.e., these models try

to learn the underlying distribution of the data rather than the boundaries between different classes or categories.
The implication is that parameter estimation in language modeling typically makes use of unannotated text data, and
is therefore sometimes referred to as self-supervised.
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 63

where N yăt “ n“1 1ty ăt “ y ăt u. Note that we can likewise define Eq. (3.55) in terms of
def řN pnq
1

2 the one-hot encodings of symbols, i.e., using the definition in Example 3.1.2: pĂ θ p¨ | y ăt q “

1
řN
In fact, the empirical distribution is often also referred to in
1 pnq pnq
3
N yăt n“1 Jy t K ty ăt “ y ăt u.
4 machine learning as the one-hot encoding of a dataset.
5 Now that we are equipped with methods for representing both pθ and pθ , we can define a loss

6 function for approximating pθ using pθ .


7 Cross-Entropy. A natural choice for a loss function is cross-entropy, a measure of the difference
8 between two probability distributions, which has its roots in information theory (Shannon, 1948).
9 Specifically, in Eq. (3.53), we take ℓpθ ‹ , θq “ HpĂ
pθ , pθ q where the definition of the cross-entropy H

10 between distributions p1 (with support Y) and p2 is as follows:


ÿ
Hpp1 , p2 q “ ´ p1 pyq log p2 pyq (3.56)
yPY

11 Further, most of the models that we will encounter in this course are locally normalized. Thus, it is
12 more common to see cross-entropy expressed as

T
ÿ ÿ
Hpp1 , p2 q “ ´ p1 py t pnq q log p2 py t | y ăt q. (3.57)
yPY t“1

13 Note that cross-entropy is not symmetric, i.e., Hpp1 , p2 q ‰ Hpp2 , p1 q. To motivate cross-entropy
14 as a loss function, as well as the intuitive difference between the two argument orderings, we turn to
15 coding theory, a sub-field of information theory. In words, the cross-entropy between two probability
16 distributions is the expected number of bits needed to encode an event y P Y from p1 when using
17 the optimal encoding scheme corresponding to distribution p2 . Importantly, the optimal encoding
18 scheme for p1 uses log p1 pyq bits to encode an event y that occurs with probability p1 pyq, implying
19 that the minimal cross-entropy is achieved when p1 “ p2 . This characteristic of cross-entropy
20 motivates another metric: the KL divergence DKL .

21 KL Divergence. A divergence measure is a measure of statistical distance9 between two


22 probability distributions. The KL divergence is defined as:
ÿ
DKL pp1 || p2 q “ p1 pyq log p2 pyq ´ p1 pyq log p1 pyq (3.58)
yPY

23 The KL divergence can intuitively be viewed as the cross-entropy shifted by the expected number
24 of bits used by the optimal encoding scheme for p1 , i.e., it is the additional number of expected
25 bits needed to encode events from p1 when using our encoding scheme from p2 . Indeed, taking
26 ℓpθ ‹ , θq “ DKL pĂ pθ || pθ q should lead to the same solution as taking ℓpθ ‹ , θq “ HpĂ
‹ pθ , pθ q because

27 the pË
θ pyq log p

Ă θ pyq term is constant with respect to model parameters θ.

9 Divergences are not technically distances because they are not symmetric, i.e., it may be the case for divergence

measure D and probability distributions p and q that Dpp || qq ‰ Dpp || qq. However, they do meet the criteria that
Dpp || qq ě 0 @p, q and Dpp || qq “ 0 ðñ p “ q.
64 CHAPTER 3. MODELING FOUNDATIONS

1 Relationship to Maximum Likelihood Estimation


2 An alternative way that we could frame our search for model parameters θ p P Θ is in terms of
3 data likelihood. Formally, the likelihood of the corpus D under the distribution pθ is the joint
4 probability of all y pnq :
N
ź
Lpθq “ pθ py pnq q. (3.59)
n“1

5 The principle of maximum likelihood then dictates:

Principle 3.2.2: Maximum Likelihood

The optimal parameters for a model are those that maximize the likelihood of observing the
given data under that model. Formally:

“ argmax Lpθq
pMLE def
θ (3.60)
θPΘ
6

7 Note that in practice, we typically work with the log-likelihood Lpθq “ log Lpθq rather than
8 the likelihood for a number of reasons, e.g., it is convex and more numerically stable given the small
9 probabilities we encounter when using L and the finite precision of the computing frameworks that
10 we employ. Since log is a monotonically increasing function, this would not change the solution
11 to Eq. (3.60). Further, as is the case with Eq. (3.57), we decompose our loss over symbol-level
12 distributions.
13 Notably, in our setting, finding parameters that maximize data log-likelihood is equivalent to
14 finding those that minimize cross-entropy. We show this equivalence below.

Proposition 3.2.1

The optimal parameters under Eq. (3.60) are equivalent to the optimal parameters when
solving for Eq. (3.53) with the cross-entropy loss between the empirical distribution pĂ
θ and

the model pθ .
15

16 Proof. Under the standard practice of taking 0 logp0q “ 0, the only elements of Y that make a
17 nonzero contribution to HpĂ
pθ , pθ q are sequences in the support of pĂ

θ , making summing over Y

18 equivalent to summing over D:


ÿ
HpĂpθ , pθ q “ ´

θ pyq log pθ pyq
pĂ ‹ (3.61)
yPΣ˚

ÿ 1 ÿ N
“´ δ pnq pyq log pθ pyq (3.62)
N n“1 y
yPΣ˚
ÿ 1
“´ 1ty pnq P Du log pθ pyq (3.63)
N
yPΣ˚
ÿ
9´ log pθ pyq (3.64)
yPD

“ ´Lpθq (3.65)
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 65

1 Thus, we can see that the objectives are equivalent, up to a multiplicative constant that is independent
2 of model parameters. ■

3 The equivalence of cross-entropy, DKL divergence , and maximum likelihood as learning objectives
4 provides intuition about our many goals when learning pθ : (1) we want a close (w.r.t. a given metric)
5 approximation of the data-generating distribution, and (2) this approximation should place high
6 probability on samples of real language data.

7 Properties of θ p under the cross-entropy loss. Assumption 3.2.2 may feel quite strong, as it
8 implies we know a great deal about the nature of pLM . However, it allows us to prove the optimality
9 of pθp under certain conditions.

Theorem 3.2.1: Maximum likelihood estimate is consistent

Consider that our loss function ℓpθ ‹ , θq “ HpĂ


pθ , pθ q (or equivalently that ℓpθ ‹ , θq “ DKL pĂ
‹ pθ || ‹

pθ q). Given Assumption 3.2.1 and that the minimizer pθp of HpĂ pθ , pθ q is unique, then under

certain (quite strong) regularity conditions on tpθ | θ P Θu, θ p is a consistent estimator, i.e., it
converges to θ in probability as n Ñ 8.

10

11 Arguably, in practice, Assumption 3.2.2 does not hold; we often make some incorrect modeling
12 assumptions. Naturally, this raises the following question: If we misspecify the family of models
13 that pLM belongs to, i.e., pLM R tpθ | θ P Θu, then is our optimal model pθp under the cross-entropy
14 loss at all meaningful? Fortunately, the answer here is yes. In this case, we can interpret pθp as a
15 projection of pLM onto the manifold of parametric models tpθ | θ P Θu. This projection is formally
16 known as an information projection (Nielsen, 2018), which while we do not cover formally here,
17 we can intuit as a mapping of pLM onto its “closest” point in tpθ | θ P Θu. In this setting, using
18 different metrics M leads to different definitions of closeness, which in turn means that optimal
19 models under different M exhibit different properties.

20 Potential drawbacks of cross-entropy loss. A closer inspection of Eq. (3.56) reveals that,
21 when we use HpĂ pθ , pθ q as our loss function, pθ must put probability mass on all samples y pnq in

22 the support of pĂθ ; otherwise, our loss is infinite. Since the model is not explicitly penalized for

23 extraneous coverage, it will thus resort to placing mass over all of Σ˚ to avoid such gaps;10 this is
24 sometimes referred to as mean-seeking behavior. In practice, this means that sequences of symbols
25 that one might qualitatively describe as gibberish are assigned nonzero probability by pθ . It is
26 unclear whether this is a desirable property under a language model. While perhaps useful when
27 using such a model to assign probabilities to strings—in which case, we might be more interested
28 in how strings’ probabilities rank against each other and may not want to write off any string as
29 completely improbable—it could prove problematic when generating strings from these models, a
30 topic covered later in this course.

31 Teacher Forcing. The loss functions that we have considered thus far are all based on our
32 model’s predictions conditioned on prior context. Here we are faced with a choice during training:
10 This behavior can also be (at least partially) attributed to the softmax used to transform model outputs into a

probability distribution over symbols. Since the softmax maps to the interior of the probability simplex, no symbol
can be assigned a probability of exactly 0.
66 CHAPTER 3. MODELING FOUNDATIONS

1 we could either use the model’s predictions from the previous time step(s) pθ p¨ | y p ăt q (e.g., the
2 most probable symbols) as the prior context or use the ground-truth prior context from our data
3 pθ p¨ | y ăt q. The latter method is often referred to as teacher forcing: Even if our model makes
4 an incorrect prediction at one step of training, we intervene and provide the correct answer for it
5 to make subsequent predictions with.
6 From a theoretical perspective, training with the cross-entropy loss mandates that we should
7 use the teacher-forcing approach since each conditional distribution is defined with respect to the
8 ground-truth context; this is elucidated, for example, in Eq. (3.57). Yet such meticulous guidance
9 can lead to poor performance in tasks where the model is required to accurately predict an entire
10 sequence of symbols on its own. For example, in language generation, since the model is not
11 exposed to its own generations during training, small errors in predictions can compound, leading
12 to degenerate text. This problem is known as exposure bias. Only the other hand, using previous
13 model outputs in order to make subsequent predictions can lead to serious instability during training,
14 especially if implemented from the start of training. Methods for alleviating exposure bias have
15 been proposed with more stable training dynamics, such as scheduled sampling Bengio et al. (2015),
16 which we discuss in §3.2.2.

17 Alternative Objectives

18 Masked Language Modeling. So far, our parameter estimation strategies have made use of
19 the decomposition of pθ pyq into individual symbol probabilities, conditioned on prior symbols, i.e.,
śT
20 pθ pyq “ t“1 pθ py t | y ăt q. In other words, we do not give a model both sides of a symbol’s context
21 when asking it to estimate the probability distribution over that symbol. While this paradigm might
22 be more realistic when using a language model for tasks such as generation—for which we may want
23 to generate outputs sequentially to mimic human language production—access to both sides of a
24 symbol’s context could be critical when using the model for tasks such as acceptability judgments
25 or classification. This motivates the use of an alternative objective for parameter estimation.
26 Similarly to the maximum likelihood objective in Eq. (3.59), we can choose model parameters by
27 optimizing for the per-symbol log-likelihood of a dataset D, albeit in this case, using both sides of
28 the symbol’s context:
N ÿ
ÿ T
log pθ py t pnq | y ăt , y ąt q (3.66)
pnq pnq
LMLM pθq “
n“1 t“1

29 Eq. (3.66) is sometimes referred to as the pseudo(log)likelihood (Besag, 1975), since it gives us an
řT
30 approximation of the true log-likelihood, i.e., t“1 log pθ py t | y ăt , y ąt q « log pθ pyq. Pseudolikelihood
31 has its origins in thermodynamics, where it was used as an approximate inference technique for
32 parameter estimation in Ising models. In such situations, computing pθ py t | y ‰t q often proved
33 computationally easier than computing the exact set of of conditional probabilities whose product
34 equaled the marginal.
35 Using Eq. (3.66) as a model’s training objective is also motivated by psychological tests of
36 language understanding—specifically, the Cloze (Taylor, 1953) task in psychology, in which the
37 goal is to predict the omitted symbol from a piece of text that constitutes a logical and coherent
38 completion. For example, in the string
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 67

Example 3.2.1: The Close task

The students [MASK] to learn about language models.


1

2 we predict want or like with high probability for the [MASK] position. When used as an objective
3 in NLP, estimating the probability distribution over symbols at the masked position is referred
4 to as masked language modeling; BERT (Devlin et al., 2019) is one well known example of a
5 masked language model. In practice, typically only the distributions over symbols at a percentage of
6 randomly-chosen positions in D are estimated during training. As mentioned in §2.5, a model whose
7 parameters are estimated with the masked language modeling objective is not a valid language
8 model in the sense of Definition 2.3.7 because it does not provide a valid distribution over Σ˚ . Yet,
9 masked language models have become increasingly popular as base models for fine-tuning on certain
10 downstream tasks, where they sometimes lead to superior performance over standard language
11 models.

12 Other Divergence Measures. From a given hypothesis space (see Assumption 3.2.2), the
13 distribution that minimizes a given divergence measure with pθ exhibits certain properties with

14 respect to how probability mass is spread over the support of that distribution. For example, the
15 model pθ that minimizes DKL ppθ || pθ q exhibits mean-seeking behavior, as discussed earlier in

16 this section. These properties have been studied in depth by a number of works (Minka, 2005;
17 Theis et al., 2016; Huszár, 2015; Labeau and Cohen, 2019). The implication of these findings is
18 that, depending on the use case for the model, other divergence measures may be better suited
19 as a learning objective. For example, prior work has noted frequency biases in models estimated
20 using the standard log-likelihood objective, i.e., these models exhibit an inability to accurately
21 represent the tails of probability distributions (Gong et al., 2018). This is particularly relevant in
22 the case of language modeling, as symbol-usage in natural language tends to follow a power-law
23 distribution (Zipf, 1935). Consequently, when we care particularly about accurately estimating the
24 probability of rare words, we may wish to instead use a loss function that prioritizes good estimation
25 of probability distribution tails. On the other hand, in the case of language generation, we may
26 desire models that only assign probability mass to outputs that are highly-likely according to pθ , ‹

27 even if this means assigning probabilities of 0 to some outcomes possible under pθ . In other words,

28 we may want a model with mode-seeking behavior, which is characteristic of models trained to
29 minimize DKL ppθ || pθ q. However, there are a number of computational issues with using other

30 divergence measures—such as general power divergences, reverse DKL divergence, and total variation
31 distance—for training neural probabilistic models over large supports, making them difficult to
32 work with in practice. For example, we can compute a Monte Carlo estimate of the forward DKL
33 divergence simply by using samples from pθ , which is exactly what we have in our dataset. However

34 an unbiased estimator of the reverse DKL divergence would require the ability to query pθ for ‹

35 probabilities, which we do not have.

36 Scheduled Sampling and Alternative Target Distributions. Scheduled sampling (Bengio


37 et al., 2015) is an algorithm proposed with the goal of alleviating exposure bias: after an initial period
38 of training using the standard teacher forcing approach, some percentage of the models’ predictions
39 are conditioned on prior model outputs, rather than the ground-truth context. However, under this
40 algorithm, θ p does not lead to a consistent estimator of θ ‹ (Huszár, 2015). Other methods likewise
41 aim to alleviate the discrepancy between settings during parameter estimation and those at inference
68 CHAPTER 3. MODELING FOUNDATIONS

1 time by specifying an alternative target distribution, for example, one that ranks “higher-quality”
2 text as more probable than average-quality text. Ultimately, these methods often make use of
3 techniques developed for reinforcement learning, i.e., the REINFORCE algorithm. These methods
4 fall under the category of fine-tuning criterion, which are discussed later in this course.

5 Auxiliary Prediction Tasks. Certain works jointly optimize for an additional objective when
6 performing parameter estimation. For example, the parameters for BERT were learned using both
7 the masked language modeling objective as well as a task referred to as next sentence prediction,
8 i.e., given two sentences, estimating the probability that the second sentence followed the first in a
9 document. A number of similar auxiliary tasks have subsequently been proposed, such as symbol
10 frequency prediction or sentence ordering (see Aroca-Ouellette and Rudzicz (2020) for summary).
11 However, these tasks do not have a formal relationship to language modeling and it is unclear what
12 their effects are on a model’s ability to serve as a valid probability distribution over strings. They
13 likely lead to models that no longer fulfill the formal criteria of §2.5.4.

14 3.2.3 Parameter Estimation


15 Given a loss function ℓ and a parameter space Θ from which to choose model parameters, we are
16 now tasked with finding the parameters θ, p i.e., solving Eq. (3.53). For the class of models that
17 we consider (those parameterized by large neural networks), finding an exact solution analytically
18 would be impractical, if not impossible. Thus, we must resort to numerical methods, where we
19 find approximately optimal parameters by iterating over solutions. This is known as parameter
20 estimation, or more colloquially as training our model.
21 Here we will review the various components of training a language model from start to finish.
22 Many of the techniques used for training language models are generally applicable machine learning
23 techniques, e.g., gradient-descent algorithms. Further, these techniques are constantly evolving and
24 often viewed as trade secrets, meaning that entities building and deploying models may not reveal
25 the combination of components that they employed. Thus, we give a more general overview of the
26 design choices involved in parameter estimation, along with the characteristics common to most
27 components.

28 Data Splitting
29 In any machine learning setting, we may overestimate model quality if we evaluate solely on its
30 performance w.r.t. the data on which its parameters were estimated. While we can often construct a
31 model that performs arbitrarily well on a given dataset, our goal is to build a model that generalizes
32 to unseen data. Thus, it is important to measure the final performance of a model on data that has
33 had no influence on the choice of model parameters.
34 This practice can be accomplished simply by splitting the data into several sets. The two basic
35 data splits are a training set Dtrain and test set Dtest ; as the names imply, the training set is used
36 during parameter estimation while the test set is used for evaluating final performance. When
37 samples from Dtest can be found in Dtrain , we call this data leakage. The training set can be further
38 divided to produce a validation set Dval . Typically, Dval is not used to define the objective for which
39 parameters are optimized. Rather, it serves as a check during training for the generalization abilities
40 of a model, i.e., to see whether the model has started overfitting to the training data. The validation
41 set can be used, e.g., to determine when to stop updating parameters.
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 69

1 Numerical Optimization
2 From a starting point θ 0 P Θ chosen according to our initialization strategy, we want to find θ
p in an
3 efficient manner. This is where numerical optimization algorithms come into play—a precise set
4 of rules for choosing how to move within Θ in order to find our next set of parameters. The output
5 of a numerical optimization algorithm is a sequence of iterates tθ s uTt“0 , with the property that as
6 T Ñ 8 we find the minimizer of our objective ℓ. Ideally, even after a finite number of iterations, we
7 will be sufficiently close to θ.
p
8 The basic algorithm for searching the parameter space for θ p follows a simple formula: starting
9 from θ 0 P Θ, we iteratively compute θ 1 , θ 2 , . . . as

θ s`1 “ θ s ` update magnitude ˆ update direction. (3.67)

10 where the update added to θ s to obtain θ s`1 is intended to move us closer to θ. p Once some
11 maximum number of updates S or a pre-defined desideratum has been met, e.g., our loss has not
12 improved in subsequent iterations, we stop and return the current set of parameters. Many of the
13 numerical optimization techniques in machine learning are gradient-based, i.e., we use the gradient
14 of the objective with respect to current model parameters (denoted as ∇θs ℓpθ s q) to determine our
15 update direction. Standard vanilla gradient descent takes the form of §3.2.3, where the learning rate
16 schedule η “ xη 0 , ¨ ¨ ¨ , η T y determines the step size of our parameter update in the loss minimizing
17 direction—there is an inherent trade-off between the rate of convergence and overshooting—and the
18 stopping criterion C determines whether we can terminate parameter updates before our maximum
number of iterations S. In vanilla gradient descent, we set η “ c ¨ 1 for some constant c and

Algorithm 1 Gradient descent for parameter optimization.


Input: ℓ objective
θ 0 initial parameters
η learning rate schedule
C : ℓ ˆ Θ ˆ Θ Ñ tTrue, Falseu stopping criterion
1. for s “ 0, ¨ ¨ ¨ , S :

2. θ s`1 Ð θ s ´ η s ¨ ∇θ ℓpθ s q
3. if Cpℓ, θ s , θ s´1 q :
4. break
5. return θ s

19

20 Cpℓ, θ s , θ s´1 q “ 1t|ℓpθ s q ´ ℓpθ s´1 q| ă ϵu for user-chosen ϵ—in words, we stop when the change
21 in loss between parameter updates is below a chosen threshold. In practice, more sophisticated
22 learning rate schedules η, e.g., square-root functions of the timestep (Hoffer et al., 2017) or adaptive
23 functions that take into account model parameter values (Duchi et al., 2011), and stopping criterion
24 C are employed.
25 Modern training frameworks rely on backpropagation—also known as reverse-mode automatic
26 differentiation (Griewank and Walther, 2008)—to compute gradients efficiently (and, as the name
27 implies, automatically!). In fact, gradients can be computed using backpropogation in the same
28 complexity as evaluation of the original function. We do not provide a formal discussion of
29 backpropagation here but see Griewank and Walther (2008) for this material.
30 Recall that our loss function—and consequently the gradient of our loss function—is defined with
31 respect to the entire dataset. Vanilla gradient descent therefore requires iterating through all of Dtrain
70 CHAPTER 3. MODELING FOUNDATIONS

1 in order to determine the direction to move parameters U , which is an incredibly time-consuming


2 computation for the large datasets employed in modern machine learning settings. Rather, an
3 optimization algorithm would likely take much less time to converge if it could rapidly compute
4 estimates of the gradient at each step. This is the motivation behind perhaps the most widely
5 employed class of optimization algorithms in machine learning: variations of stochastic gradient
6 descent (SGD), such as mini-batch gradient descent. Explicitly, these algorithms make use of the
fact that E 1 i.i.d. ∇θ pθ, D1 q “ ∇θ pθ, Dq, where in slight abuse of notation, we use D1 „ D to signify
i.i.d.
7
D „ D
8 that the multi-set D1 consists of random i.i.d. samples from D. Thus we can instead base our loss
9 ℓ, and consequently U , off of a randomly selected subset of the data.11 In practice though, this
10 sample is taken without replacement, which breaks the i.i.d. assumption. This in turn implies this
11 our gradient estimates are biased under the mini-batch gradient descent algorithm. However, this
12 bias does not seem to empirically harm the performance of such optimization strategies. Indeed, an
13 entire branch of machine learning called curriculum learning focuses on trying to find an optimal
14 data ordering with which to train models to achieve desirable characteristics such as generalization
15 abilities. Even when orderings are randomly selected, the chosen ordering can have a large impact
16 on model performance Dodge et al. (2020).
17 A number of optimization algorithms have since iterated on SGD, e.g., the momentum algorithm
18 (Polyak, 1964). In short, the momentum algorithm computes an exponentially decaying moving
19 average of past gradients, and continues updating parameters in this direction, which can drastically
20 speed up convergence. A widely-employed optimization algorithm called ADAM (Kingma and Ba,
21 2015) takes a similar approach. Just as in momentum, it computes update directions using a moving
22 average (first moment) of gradients, albeit it additionally makes use of the variance of gradients
23 (second moment) when computing update directions. ADAM is one of the most popular optimization
24 algorithms used for training large language models in modern ML frameworks.

25 Parameter Initialization
26 Our search for (approximately) optimal model parameters must start from some point in the
27 parameter space, which we denote as θ 0 . Ideally, starting from any point would lead us to the
28 same solution, or at least to solutions of similar quality. Unfortunately, this is not the case: both
29 training dynamics and the performance of the final model can depend quite heavily on the chosen
30 initialization strategy, and can even have high variance between different runs of the same strategy.
31 This makes sense at some intuitive level though: depending on the learning algorithm, an initial
32 starting point can heavily dictate the amount of searching we will have to do in order to find θ, p
33 and how many local optima are on the route to θ. p Consequently, a poor initial starting point may
34 lead to models that take longer to train and/or may lead our learning algorithm to converge to
35 sub-optimal solutions (i.e., an alternative local optimum) (Dodge et al., 2020; Sellam et al., 2022).
36 This can be the case even when only estimating the final layer of a network, e.g., when building
37 a classifier by appending a new layer to a pretrained model—a recent, widely-adopted practice in
38 NLP (Dodge et al., 2020).
39 Methods for initializing the parameters of neural language models are largely the same as those
11While this logic holds even for samples of size 1 (which is the sample size for standard SGD by definition), basing

updates off of single samples can lead to noisy updates. Depending on resource constraints, batch sizes of a few
hundred are often used, leading to much more stable training (although in the face of memory constraints, larger
batch sizes can be mimicked by accumulating, i.e., averaging, gradients across multiple batches when computing
update directions). Batch size itself is often viewed as a model hyperparameter that can have a significant effect on
model performance.
3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 71

1 for initializing other neural networks. Perhaps the simplest approach is to randomly initialize all
2 parameters, e.g., using a uniform or normal random variable generator. The parameters of these
3 generators (mean, standard deviation, bounds) are considered hyperparameters of the learning
4 algorithm. Subsequent methods have iterated on this strategy to develop methods that take into
5 account optimization dynamics or model architectures. One consideration that is particularly
6 relevant for language models is that the input and output sizes of the embedding layer and the
7 fully connected layer can be very different; this exacerbate the problem of vanishing or exploding
8 gradients during training. For example, Glorot, Xavier and Bengio, Yoshua (2010) proposed Xavier
9 init, which keeps the variance of the input and output of all layers within a similar range in order
10 to prevent vanishing or exploding gradients; He et al. (2015) proposed a uniform initialization
11 strategy specifically designed to work with ReLU activation units. Using uniform random variables
12 during parameter initialization can likewise alleviate the problem of vanishing gradients. While
13 most deep learning libraries use thoughtfully-selected initialization strategies for neural networks, it
14 is important to internalize the variance in performance that different strategies can cause.

15 Early Stopping
16 As previously discussed, performance on Dtrain is not always the best indicator of model performance.
17 Rather, even if our objective continues to increase as we optimize over model parameters, performance
18 on held-out data, i.e., Dtest or even Dval , may suffer as the model starts to overfit to the training
19 data. This phenomenon inspires a practice called early stopping, where we stop updating model
20 parameters before reaching (approximately) optimal model parameter values w.r.t. Dtrain . Instead,
21 we base our stopping criterion C off of model performance on Dval as a quantification of generalization
22 performance, a metric other than that which model parameters are optimized for, or just a general
23 slow down in model improvement on the training objective.
24 Early stopping sacrifices better training performance for better generalization performance; in
25 this sense, it can also be viewed as a regularization technique, a topic which we discuss next. As
26 with many regularization techniques, early stopping can have adverse effects as well. Recent work
27 suggests that many models may have another period of learning after an initial period of plateauing
28 train/validation set performance. Indeed, a sub-field has recently emerged studying the “grokking”
29 phenomenon (Power et al., 2022), when validation set performance suddenly improves from mediocre
30 to near perfect after a long period in which it appears that model learning has ceased, or even that
31 the model has overfit to the training data. Thus, it is unclear whether early stopping is always a
32 good practice.

33 3.2.4 Regularization Techniques


34 Our goal during learning is to produce a model pθ that generalizes beyond the observed data; a model
35 that perfectly fits the training data but produces unrealistic estimates for a new datapoint is of little
36 use. Exactly fitting the empirical distribution is therefore perhaps not an ideal goal. It can lead to
37 overfitting, which we informally define as the situation when a model uses spurious relationships
38 between inputs and target variables observed in training data in order to make predictions. While
39 this behavior decreases training loss, it generally harms the model’s ability to perform on unseen
40 data, for which such spurious relationships likely do not hold.
41 To prevent overfitting, we can apply some form of regularization.
72 CHAPTER 3. MODELING FOUNDATIONS

Principle 3.2.3: Regularization

Regularization is a modification to a learning algorithm that is intended to increase a model’s


generalization performance, perhaps at the cost of training performance.a
a Adapted from Goodfellow et al. (2016), Ch. 7.
1

2 There are many ways of implementing regularization, such as smoothing a distribution towards
3 a chosen baseline or adding a penalty to the loss function to reflect a prior belief that we may have
4 about the values model parameters should take on Hastie et al. (2001); Bishop (2006). Further,
5 many regularization techniques are formulated for specific model architecture: for example, the
6 count-based smoothing methods used n-gram language models (Ney et al., 1994; Gale and Sampson,
7 1995). Here we specifically consider the forms of regularization often used in the estimation of neural
8 language models. Most fall into two categories: methods that try to ensure a model’s robustness
9 to yet unseen (or rarely seen) inputs—e.g., by introducing noise into the optimization process—or
10 methods that add a term to our loss function that reflects biases we would like to impart on our
11 model. This is by no means a comprehensive discussion of regularization techniques, for which we
12 refer the reader to Ch.7 of Goodfellow et al. (2016).

13 Weight Decay
14 A bias that we may wish to impart on our model is that not all the variables available to the model
15 may be necessary for an accurate prediction. Rather, we hope for our model to learn the simplest
16 mapping from inputs to target variables, as this is likely the function that will be most robust to
17 statistical noise.12 This bias can be operationalized using regularization techniques such as weight
18 decay (Goodfellow et al., 2016)—also often referred to as ℓ2 regularization. In short, a penalty
19 for the ℓ2 norm of θ is added to ℓ. This should in theory discourage the learning algorithm from
20 assigning high values to model parameters corresponding to variables with only a noisy relationship
21 with the output, instead assigning them a value close to 0 that reflects the a non-robust relationship.

22 Entropy Regularization
23 One sign of overfitting in a language model pθ is that it places effectively all of its probability mass
24 on a single symbol.13 Rather, we may want the distributions output by our model to generally have
25 higher entropy, i.e., following the principle of maximum entropy: “the probability distribution which
26 best represents the current state of knowledge about a system is the one with largest entropy, in the
27 context of precisely stated prior data” Jaynes (1957). Several regularization techniques, which we
28 refer to as entropy regularizers, explicitly penalize the model for low entropy distributions.
29 Label smoothing (Szegedy et al., 2015) and the confidence penalty (Pereyra et al., 2017) add
30 terms to ℓ to penalize the model for outputting peaky distributions. Explicitly, label smoothing
31 reassigns a portion of probability mass in the reference distribution from the ground truth symbol
32 to all other symbols in the vocabulary. It is equivalent to adding a term DKL pu || pθ q to ℓ, where
33 u is the uniform distribution. The confidence penalty regularizes against low entropy distribution
12 This philosophy can be derived from Occam’s Razor, i.e., the principle that one should search for explanations

constructed using the smallest possible set of elements.


13 The softmax transformation serves as somewhat of a regularizer against this behavior since it does not allow any

symbol be assigned a probability of 0.


3.2. ESTIMATING A LANGUAGE MODEL FROM DATA 73

1 by adding a term Hppθ q to ℓ that encourage high entropy in model outputs. The general class of
2 entropy regularizers have proven effective in training neural models Meister et al. (2020).

3 Dropout
4 Regularization also encompasses methods that expose a model to noise that can occur in the data
5 at inference time. The motivation behind such methods is both to penalize a model for being overly
6 dependent on any given variable (whether directly from the input or somewhere further along in the
7 computational graph) for making predictions. Dropout does this explicitly by randomly “dropping”
8 variables from a computation in the network (Srivastava et al., 2014).
9 More formally, consider a model defined as a series of computational nodes, where any given
10 node is the product of the transformation of previous nodes. When dropout is applied to the module
11 that contains that node, then the node is zeroed out with some percentage chance, i.e., it is excluded
12 in all functions that may make use of it to compute the value of future nodes. In this case, the
13 model will be penalized if it relied completely on the value of that node for any given computation
14 in the network. Dropout can be applied to most variables within a model, e.g., the inputs to the
15 model itself, the inputs to the final linear projection in a feed-forward layer, or the summands in the
16 attention head of a Transformer. Note that at inference time, all nodes are used to compute the
17 model’s output.14

18 Batch and Layer Normalization


19 Rescaling variables within a network helps with training stability, and further, with generalization
20 by keeping variables within the same range and with unit variance. Specifically, batch normalization
21 helps regularize the problem of covariate shift, where the distribution of features (both the input
22 features and the variables corresponding to transformed features within a network) differs between
23 the training data and the data at inference time. Batch normalization alleviates this problem by
24 recentering (around 0) and scaling (such that data points have unit variance) data points such
25 that the data flowing between intermediate layers of the network follows approximately the same
26 distribution between batches. Layer normalization likewise performs centering and rescaling, albeit
27 across features rather than across data points. Specifically, normalization is performed so that all of
28 the feature values within a data point have mean 0 and unit variance.

14 Some form of renormalization is typically performed to account for the fact that model parameters are learned

with only partial availability of variables. Thus when all variables are used in model computations, the scale of the
output will (in expectation) be larger than during training, potentially leading to poor estimates.
74 CHAPTER 3. MODELING FOUNDATIONS
1 Chapter 4

2 Classical Language Models

3 Next, we turn to two classical language modeling frameworks: finite-state language models (a
4 natural generalization of the well-known n-gram models) in §4.1 and pushdown language models
5 §4.2. Although the most successful approaches to language modeling are based on neural networks,
6 the study of older approaches to language modeling is invaluable. First, due to the simplicity of
7 the models, learning how they work helps distill concepts. And, moreover, they often serve as
8 important baselines in modern NLP and provide very useful insights into the capabilities of modern
9 architectures as we will see when we discuss modern architectures in Chapter 5.
10 In the spirit of our question-motivated investigation, we will focus on the following two questions.

Question 4.1: Representing conditional distributions

How can we tractably represent all conditional distributions of the form pSM py | yq in a simple
way?
11

Question 4.2: Representing hierarchical structure

How can we tractably represent the hierarchical structure of human language?


12

13 4.1 Finite-state Language Models


14 After rigorously defining what language models are (and what they are not) and discussing how we
15 can estimate them, it is time to finally introduce our first class of language models—those based on
16 finite-state automata. Language models derived from probabilistic finite-state automata are some
17 of the simplest classes of language models because they definitionally distinguish a finite numbers
18 of contexts when modeling the conditional distribution of the next possible symbol pM py | yq. We
19 first give an intuitive definition of a finite-state language model and then introduce a more formal
20 definition, which we will use throughout the rest of the section.

75
76 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.1.1: Informal definition of a finite-state language model

A language model pLM is finite-state if it defines only finitely many unique conditional
distributions pLM py | yq. In other words, there are only finitely many contexts y which define
the distribution over the next symbol, pLM py | yq.
1

2 Intuitively, this framework might be useful because it bounds the number of unique conditional
3 distributions we have to learn. However, as we will see later in this chapter, finite-state language
4 models are not sufficient for modeling human language. Nevertheless, they can still offer a baseline
5 for modeling more complex phenomena. They also offer a useful theoretical tool in the understanding
6 of neural language models, which we will discuss in Chapter 5.

7 4.1.1 Weighted Finite-state Automata


8 Before we introduce finite-state language models, we go on a brief detour into the theory of finite-state
9 automata. As we will see, finite-state automata are a tidy and well-understood formalism. As we will
10 see later in §5.1.6, they also provide a solid and convenient theoretical framework for understanding
11 modern neural language models, e.g., those based on recurrent neural networks and transformers.
12 We, therefore, begin by briefly introducing the theory of finite-state automata with real-valued
13 weights.

14 Finite-state Automata

15 In words, finite-state automata are one of the simplest devices for defining a formal language (cf.
16 Definition 2.3.5). We give a formal definition below.

Definition 4.1.2: Finite-state Automata

A finite-state automaton (FSA) is a 5-tuple pΣ, Q, I, F , δq where


• Σ is an alphabet;
• Q is a finite set of states;
• I Ď Q is the set of initial states;

• F Ď Q the set of final or accepting states;


• A finite multiset δ Ď Q ˆ pΣ Y tεuq ˆ Q.a Elements of δ are generally called transitions.
a The fact that it is a multiset reflects that it can contain multiple copies of the same element (i.e., transitions

between the same pair of states with the same symbol).


17

18 The name, finite-state automaton, stems from the requirement that the set of states Q is finite,
19 which stands in contrast to the remaining formalisms we will cover in this course, e.g., pushdown
20 automata and recurrent neural networks. We will denote a general finite-state automaton with
21 a (subscripted) A. We will also adopt a more suggestive notation for transitions by denoting a
a
22 transition pq 1 , a, q 2 q as q 1 Ý
Ñ q2 .
4.1. FINITE-STATE LANGUAGE MODELS 77

1 An FSA can be graphically represented as a labeled, directed multi-graph.1 The vertices in the
2 graph represent the states q P Q and the (labeled) edges between them the transitions in δ. The
3 labels on the edges correspond to the input symbols a P Σ which are consumed when transitioning
4 over the edges. The initial states qι P I are marked by a special incoming arrow while the final
5 states qφ P F are indicated using a double circle.

Example 4.1.1: An example of a finite-state automaton

An example of an FSA can be seen in Fig. 4.1. Formally, we can specify it as

• Σ “ ta, b, cu
• Q “ t1, 2, 3u
• I “ t1u
• F “ t3u

• δ “ tp1, a, 2q , p1, b, 3q , p2, b, 2q , p2, c, 3qu


6

2 b
a

1
c
b

Figure 4.1: Example of a simple FSA.

7 A finite-state automaton sequentially reads in individual symbols of an input string y P Σ˚


8 and transitions from state to state according to the transition function δ. The traversal through the
9 automaton starts in a state qι P I (more precisely, it acts as if starting from all of them in parallel).
a
10 It then transitions from state q into the state q 1 upon reading the symbol a if and only if q Ý
Ñ q 1 P δ.
11 ε-labeled transitions, however, allow a finite-state machine to transition to a new state without
12 consuming a symbol. This is in line with ε’s definition as an empty string.
13 A natural question to ask at this point is what happens if for a state–symbol pair pq, aq there is
14 more than one possible transition allowed under the relation δ. In such a case, we take all implicit
15 transitions simultaneously, which leads us to a pair of definitions.

Definition 4.1.3: Deterministic finite-state automaton

A FSA A “ pΣ, Q, I, F , δq is deterministic if


• it does not have any ε-transitions;
16

1 The multi- aspect of the multi-graph refers to the fact that we can have multiple transitions from any pair of

states and labeled refers to the fact that we label those transitions with symbols from the alphabet Σ.
78 CHAPTER 4. CLASSICAL LANGUAGE MODELS

a
• for every pq, aq P Q ˆ Σ, there is at most one q 1 P Q such that q Ý
Ñ q 1 P δ;

• there is a single initial state, i.e., |I| “ 1.


Otherwise, A is non-deterministic.
1

2 An important, and perhaps not entirely obvious, result is that the classes of deterministic and
3 non-deterministic FSA are equivalent, in the sense that you can always represent a member of one
4 class with a member of the other.
5 If the automaton ends up, after reading in the last symbol of the input string, in one of the final
6 states qφ P F , we say that the automaton accepts that string. A finite-state automaton is therefore
7 a computational device that determines whether a string satisfies a condition (namely, the condition
8 that the automaton, by starting in an initial state and following one of the paths labeled with that
9 string, ends in a final state). A string that satisfies this condition is said to be recognized by the
10 automaton and the set of all strings satisfying this condition form the language of the automaton.2

Definition 4.1.4: Language of a finite-state automaton

Let A “ pΣ, Q, I, F , δq be an finite-state automaton. The language of A, L pAq is defined as

L pAq “ ty | y is recognized by Au (4.1)


def

11

12 Abstractly, a finite-state automaton is hence a specification of a set of rules that strings must
13 satisfy to be included in its language. The set of languages that finite-state automata can recognize
14 is known as the class of regular languages.

Definition 4.1.5: Regular language

A language L Ď Σ˚ is regular if and only if it can be recognized by an unweighted finite-state


automaton, i.e., if there exists a finite-state automaton A such that L “ L pAq.
15

Example 4.1.2: Additional examples of finite-state automata

Additional simple examples of FSAs are shown in Fig. 4.2. The FSA in Fig. 4.2a, for example,
can formally be defined with
• Σ “ ta, b, cu
• Q “ t1, 2, 3, 4, 5, 6u
• I “ t1u
• F “ t6u
• δ “ tp1, a, 2q , p1, b, 3q , p2, b, 2q , p2, c, 4q , p3, c, 4q , p3, b, 5q , p4, a, 6q , p5, a, 6qu
The FSA in Fig. 4.2a is deterministic while the one in Fig. 4.2b is non-deterministic.
16

2We also say that the automaton recognizes this set of strings (language).
4.1. FINITE-STATE LANGUAGE MODELS 79

A few examples of strings accepted by the A1 include bba, bca, aca, abca, abbca, abbbca, . . . .
In fact, due to the self-loop at state 2, the symbol b can appear an arbitrary number of times
at position 2 in the accepted string abca. Notice that, starting from the state 1 and following
the transitions dictated by any of the accepted strings, we always end up in the only final
state, state 6. In particular, the string “abbca” is accepted with the following set of transitions
in A1 :
a b b c a
1ÝÑ 2, 2 ÑÝ 2, 2 Ñ
Ý 2, 2 Ñ
Ý 4, 4 Ý
Ñ 6.
1

c
c 2 4
2 4

a
a

1 b 6

b
1 b 6
c

a
a
b

b
b 3 5
3 5

(b) A non-deterministic FSA, A2 . State 1 has


(a) A deterministic FSA, A1 . Each state only has
two outgoing transitions labeled with a whereas
one outgoing transition labeled with the same
state 3 has two outgoing transitions labeled with
symbol.
b.

Figure 4.2: Examples of a deterministic and a non-deterministic FSA.

2 Weighted Finite-state Automata


3 A common and very useful augmentation to finite-state automata is through the addition of weights
4 on the transitions. The general theory of weighted automata makes use of semiring theory, which is
5 beyond the scope of this course.3 In this course, we will limit ourselves to the study of automata
6 with real-valued weights.

Definition 4.1.6: Real-weighted Finite-State Automaton

A real-weighted finite-state automaton (WFSA) A is a 5-tuple pΣ, Q, δ, λ, ρq where


• Σ is a finite alphabet;
• Q is a finite set of states;
• δ Ď Q ˆ pΣ Y tεuq ˆ R ˆ Q a finite multiset of transitions;a
• λ : Q Ñ R a weighting function over Q;
• ρ : Q Ñ R a weighting function over Q.
a Again, a{w
we use the notation q Ý
ÝÝÑ q 1 to denote pq, a, w, q 1 q P δ.
7

3 Semirings and semiring-weighted formal languages are covered in detail in the Advanced Formal Language Theory

course offered at ETH as well.


80 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 Notice that we omit the initial and final state sets from the definition of WFSAs. Those can
2 implicitly be specified by the states given non-zero initial or final weights by the λ and ρ functions,
3 i.e., I “ tq P Q | λ pqq ‰ 0u and F “ tq P Q | ρ pqq ‰ 0u. We might refer to them in the text later for
4 notational
ˆ convenience
˙ and clarity of exposition. We will also sometimes denote transition weights
a{w
with ω q ÝÝÑ q 1
def
5 “ w.

6 Graphically, we write the transition weights on the edges of the graph representing the WFSA
7 after the output symbol, separated by a “/”. The same separator is also used to separate the state
8 name from its final weight, which is written in the node. The initial weights, however, are written
9 on the incoming arrow denoting initial states.

Example 4.1.3: An example of a weighted finite-state automaton

Fig. 4.3 shows a weighted version of the FSA from Fig. 4.2a above.
10

b{0.63

c{0.9
2 4 a{
5
0.

1 e
π
a{

¨
1
0.2

0.3 1 6{ 1e
c{

29
b{

0.
1

a{
π

b{0.13
3 5

Figure 4.3: The WFSA corresponding to the FSA from Fig. 4.2a.

11 The connection of WFSAs to graphs makes it natural to define a set of transition matrices
12 specified by a WFSA.

Definition 4.1.7: Transition matrix

Let A “ pΣ, Q, δ, λ, ρq be a WFSA. For any a P Σ, we define the symbol-specific transition


matrix Tpaq as the transition matrix of the graph restricted to a-labeled transitions. We also
define the (full) transition matrix as T “ aPΣ Tpaq .
def ř

13
4.1. FINITE-STATE LANGUAGE MODELS 81

Example 4.1.4: Examples of transition matrices

Consider the WFSA A in Fig. 4.3. The (symbol-specific) transition matrices for A are

0 0.5 0 0 0 0
¨ ˛
˚0 0 0 0 0 0 ‹
0 0 0 0 0 0 ‹
˚ ‹
T “˚
paq
˚
˚0 0 0 0 0 π¨e ‹
˚ 1 ‹

˝0 0 0 0 0 0.29‚
0 0 0 0 0 0
0 0 0 0 0
¨ 1
˛
π
˚0 0.63 0 0 0 0‹
0 0 0 0 0.13 0
˚ ‹
T “˚
pbq
˚ ‹
0 0 0 0 0 0
˚ ‹

˝0 0 0 0 0 0‚
˚ ‹

0 0 0 0 0 0
0 0 0 0 0 0
¨ ˛
˚0 0 0 0.9 0 0‹
˚0 0 0 0.21 0 0‹
˚ ‹
T “˚
pcq
˚0 0 0 0 0 0‹
˚ ‹
˝0 0 0 0 0 0‚

0 0 0 0 0 0
0 0.5 π 0 0 0
¨ 1
˛
˚0 0.63 0 0.9 0 0 ‹
0 0 0 0.21 0.13 0 ‹
˚ ‹
T“T `T `T “˚
paq pbq pcq
˚
˚0 0 0 0 0
˚ 1 ‹

π¨e ‹
˝0 0 0 0 0 0.29‚
0 0 0 0 0 0
1

2 Paths and Path Weights


3 A path is an important concept when talking about (weighted) finite-state automata as it defines
4 the basic structure by which a string is recognized or weighted. We now give a formal definition of a
5 path and discuss how to weight paths.

Definition 4.1.8: Path

A
ˆ path π is an element of δ with consecutive
˚
˙ transitions, meaning that it is of the form
‚{‚ ‚{‚ ‚{‚
q 1 ÝÝÑ q 2 , q 2 ÝÝÑ q 3 ¨ ¨ ¨ q n´1 ÝÝÑ q n , where ‚ is a placeholder.a The length of a path is
the number of transition in it; we denote the length as |π|. We use p pπq and n pπq to denote
the origin and the destination of a path, respectively. The yield of a path is the concatenation
of the input symbols on the edges along the path, which we will mark with s pπq. Furthermore,
we denote sets of paths with capital Π. Throughout the text, we will use a few different
variants involving Π to avoid clutter:
6
82 CHAPTER 4. CLASSICAL LANGUAGE MODELS

• ΠpAq as the set of all paths in automaton A;


• ΠpA, yq as the set of all paths in automaton A with yield y P Σ˚ ;

• ΠpA, q, q 1 q as the set of all paths in automaton A from state q to state q 1 .


a Notice we use the Kleene closure on the set δ here. It thus represents any sequence of transitions P δ
1

2 One of the most important questions when talking about weighted formalisms like weighted
3 finite-state automata is how to combine weights of atomic units like transitions into weights of
4 complete structures.4 We begin by multiplicatively combining the weights of individual transitions
5 in a path into the weights of the full path.

Definition 4.1.9: Path Weight

a1 {w1 aN {wN
The inner path weight wI pπq of a path π “ q 1 ÝÝÝÝÑ q 2 ¨ ¨ ¨ q N ´1 ÝÝÝÝÝÑ q N is defined as
N
ź
wI pπq “ wn . (4.2)
n“1

The (full) path weight of the path π is then defined as

w pπq “ λ pp pπqq wI pπq ρ pn pπqq . (4.3)

A path π is called accepting or successful if w pπq ‰ 0.


6

7 The inner path weight is therefore the product of the weights of the transitions on the path,
8 while the (full) path weight is the product of the transition weights as well as the initial and final
9 weights of the origin and the destination of the path, respectively.

10 String Acceptance Weights and Weighted Regular Languages

11 When we introduced unweighted finite-state automata, we defined the important concept of recogniz-
12 ing a string and recognizing a language. We generalize these concepts to the very natural quantity
13 of the weight assigned by a WFSA to a string y P Σ˚ , i.e., its acceptance weight, or stringsum, as
14 the sum of the weights of the paths that yield y.

Definition 4.1.10: Stringsum

The stringsum, string weight, or acceptance weight of a string y P Σ˚ under a WFSA A is


defined as ÿ
w pπq . (4.4)
def
A pyq “
πPΠpA,yq
15

4 In the case of WFSAs, a structure is a path. In the next section, we will see how to combine weights from basic

units into trees.


4.1. FINITE-STATE LANGUAGE MODELS 83

1 This naturally generalizes the notion of acceptance by an unweighted FSA—whereas an un-


2 weighted FSA only makes a binary decision of accepting or rejecting a string, a weighted FSA always
3 accepts a string with a specific weight. This leads to the definition of the weighted language of the
4 WFSA.

Definition 4.1.11: Weighted language of a weighted finite-state automaton

Let A be a WFSA. Its (weighted) language is defined as

L pAq “ tpy, A pyqq | y P Σ˚ u . (4.5)


def

6 We say a language is a weighted regular language if it is a language of some WFSA:

Definition 4.1.12: Weighted regular language

A weighted language L is a weighted regular language if there exists a WFSA A such that
L “ L pAq.
7

8 Lastly, we also define the full and state-specific allsum of the automaton. The former refers to
9 the total weight assigned to all possible strings, or all possible paths whereas the latter refers to the
10 sum of the path weights of the paths stemming from a specific state.

Definition 4.1.13: State-specific allsum

Let A “ pΣ, Q, δ, λ, ρq be a WFSA. The allsum of a state q P Q is defined as


ÿ
Z pA, qq “ wI pπq ρ pn pπqq . (4.6)
πPΠpAq
q 1 “q
11

12 State-specific allsums are also referred to as the backward values in the literature and are
13 often denoted as β pqq.

Definition 4.1.14: WFSA allsum

Let A “ pΣ, Q, δ, λ, ρq be a WFSA. The allsum of A is defined as


ÿ ÿ ÿ ÿ
Z pAq “ A pyq “ w pπq “ w pπq . (4.7)
yPΣ˚ yPΣ˚ πPΠpA,yq πPΠpAq
14

15 The second equality in Eq. (4.7) comes from the crucial observation that the double sum in
16 the second term sums over precisely all paths of the automaton A, which is where the name of the
17 quantity comes from allsum.5 This is easy to see if we consider that by summing over all possible
18 strings, we enumerate all possible path yields, and each path in the automaton has a yield P Σ˚ .
19 Z pAq is again the result of summing over infinitely many terms (whether the set of strings in Σ˚
5 Analogously, given some (implicitly defined) set of paths S, we will name the sum over the weights of the paths

in S the allsum over S


84 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 of the infinitely many paths in a cyclic WFSA), and might therefore not necessarily be finite. For
2 reasons which will become clear shortly, we will say that a WFSA A is normalizable if Z pAq ă 8.
3 Note that the sum in Eq. (4.4) only contains one term if the automaton is deterministic. Whenever
4 the automaton is non-deterministic, or when we are interested in the sum of paths with different
5 yields as in Eq. (4.7), the interactions (namely, the distributive law) between the sum over the
6 different paths and the multiplications over the transitions in the paths play an important role when
7 designing efficient algorithms. Indeed, many algorithms defined for WFSAs rely on decompositions
8 of such sums enabled by the distributive law.6

9 Accessibility and Probabilistic Weighted Finite-state Automata

10 An important property of states of a WFSA which we will need when investigating the tightness of
11 finite-state language models is accessibility.

Definition 4.1.15: (Co)-Accessible and useful states

A state q P Q of a WFSA is accessible if there is a non-zero-weighted path to q from some


state qι with λ pqι q ‰ 0; it is co-accessible state if there is a non-zero-weighted path from q
to some state qφ with ρ pqφ q ‰ 0. It is useful if it is both accessible and co-accessible, i.e., q
appears on some non-zero-weighted accepting path.
12

Definition 4.1.16: Trim automaton


Trimming a WFSA means removing its useless states.a Removing the non-useful states
Ñ
Ý
means removing their rows and columns from T as well as their rows from λ and Ñ
ρ , yielding
Ý
Ñ
Ý
possibly smaller T , λ and ρ .
1 1 Ñ
Ý 1

a This does not affect the weights of the strings with w pyq ‰ 0.
13

14 We will use WFSAs to specify language models. However, not every WFSA is a language model,
15 i.e., a distribution over strings. Generally, the weight of a string could be negative if we allow
16 arbitrary real weights. Thus, a restriction we will impose on all weighted automata that represent
17 finite-state language models is that the weights be non-negative.
18 Furthermore, a special class of WFSAs that will be of particular interest later is probabilistic
19 WFSAs.

Definition 4.1.17: Probabilistic Weighted Finite-State Automaton

A WFSA A “ pΣ, Q, δ, λ, ρq is probabilistic (a PFSA) if


ÿ
λ pqq “ 1 (4.8)
qPQ
20

6 Many such examples are covered in the Advanced Formal Language Theory course.
4.1. FINITE-STATE LANGUAGE MODELS 85

a{w
and, for all q P Q and all outgoing transitions q ÝÝÑ q 1 P δ it holds that

λ pqq ě 0 (4.9)
ρ pqq ě 0 (4.10)
wě0 (4.11)

and ÿ
w ` ρ pqq “ 1. (4.12)
a{w
q ÝÝÑq 1
1

2 This means that the initial weights of all the states of the automaton form a probability
3 distribution (the initial weight of a state corresponds to the probability of starting in it), as well as
4 that, for any state q in the WSFA, the weights of its outgoing transitions (with any label) together
5 with its final weight form a valid discrete probability distribution. In a certain way, probabilistic
6 finite-state automata naturally correspond to locally normalized language models, as we explore in
7 the next subsection.

8 The eos symbol and the final weights. Notice that the final weights in a PFSA play an
9 analogous role to the eos symbol: the probability of ending a path in a specific state q—and
10 therefore ending a string—is q’s final weight! That is, the probability ρ pqφ q for some qφ P Q,
11 representing the probability of ending the path in qφ , is analogous to the probability of ending a
12 string y, pSM peos | yq, where qφ “represents” the string (history) y.7 When modeling language with
13 weighted finite-state automata, we will therefore be able to avoid the need to specify the special
14 symbol and rather rely on the final weights, which are naturally part of the framework.

15 4.1.2 Finite-state Language Models


16 We can now formally define what it means for a language model to be finite-state:

Definition 4.1.18: Finite-state language models

A language model pLM is finite-state if it can be represented by a weighted finite-state automa-


ton, i.e., if there exists a WFSA A “ pΣ, Q, δ, λ, ρq such that L pAq “ L ppLM q. Equivalently,
we could say that pLM is finite-state if its language is a weighted regular language.
17

18 On the other hand, given a WFSA A, there are two established ways of defining a probability of
19 string.

20 String Probabilities in a Probabilistic Finite-state Automaton


21 In a probabilistic FSA (cf. Definition 4.1.17), any action from a state q P Q is associated with a
22 probability. Since the current state completely encodes all the information of the input seen so far
23 in a finite-state automaton, it is intuitive to see those probabilities as conditional probabilities of
7 Due to the possible non-determinism of WFSAs, the connection is of course not completely straightforward, but

the point still stands.


86 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 the next symbol given the input seen so far. One can, therefore, define the probability of a path as
2 the product of these individual “conditional” probabilities.

Definition 4.1.19: Path probability in a PFSA

We call the weight of a path π P Π pAq in a probabilistic FSA the probability of the path π.
3

4 This alone is not enough to define the probability of any particular string y P Σ˚ since there
5 might be multiple accepting paths for y. Naturally, we define the probability of y as the sum of the
6 individual paths that recognize it:

Definition 4.1.20: String probability in a PFSA

We call the stringsum of a string y P Σ˚ in a probabilistic FSA the probability of the string
y:
(4.13)
def
pA pyq “ A pyq .
7

8 Crucially, notice that these two definitions did not require any normalization over all possible
9 paths or strings. This closely resembles the way we defined locally normalized models based on the
10 conditional probabilities of a sequence model. Again, such definitions of string probabilities are
11 attractive as the summation over all possible strings is avoided. However, a careful reader might
12 then ask themself: do these probabilities actually sum to 1, i.e., is a probabilistic FSA tight? As you
13 might guess, they might not.8 We explore this question in §4.1.4.

14 String Probabilities in a General Weighted Finite-state Automaton


15 To define string probabilities in a general weighted FSA, we use the introduced notions of the
16 stringsum and the allsum. The allsum allows us to tractably normalize the stringsum to define the
17 globally normalized probability of a string y as the proportion of the total weight assigned to all
18 strings that is assigned to y.9

Definition 4.1.21: String probability in a WFSA

Let A “ pΣ, Q, δ, λ, ρq be a normalizable WFSA with non-negative weights. We define the


probability of a string y P Σ˚ under A as

A pyq
(4.14)
def
pA pyq “ .
Z pAq
19

20 Language Models Induced by a WFSA


21 With the notions of string probabilities in both probabilistic and general weighted FSAs, we can
22 now define the language model induced by A as follows.

8 Notice that, however, whenever a PFSA is tight, its allsum is 1.


9We will see how the allsum can be computed tractably in §4.1.3.
4.1. FINITE-STATE LANGUAGE MODELS 87

Definition 4.1.22: A language model induced by a WFSA

Let A “ pΣ, Q, δ, λ, ρq be a WFSA. We define the language model induced by A as the


following probability distribution over Σ˚

(4.15)
def
pLM A pyq “ pA pyq .
1

2 It is easy to see that while global normalization requires the computation of the allsum, language
3 models induced by weighted FSAs through Eq. (4.14) are globally normalized and thus always tight.
4 In the next subsection, we consider how the quantities needed for computing Eq. (4.14) can be
5 computed. Of particular interest will be the quantity Z pAq, as it involves the summation over
6 possibly infinitely many terms and therefore requires some clever tricks to be computed.

7 4.1.3 Normalizing Finite-state Language Models


8 In this subsection, we develop an algorithm for normalizing a globally normalized language model (cf.
9 Definition 2.4.2) defined by a WFSA, i.e., an algorithm for computing the allsum Z pAq whenever
10 this quantity is finite. Moreover, the derivation will also reveal necessary and sufficient conditions
11 for WFSAs to be normalizable.

12 Converting a matrix of pairwise pathsums to the allsum. Before we consider how to


13 compute Z pAq, let us first consider a much simpler problem. Suppose we had a matrix M, which
14 contained at the entry Mij the sum of all the inner weights over all paths between the states i and
15 j, i.e., ÿ
Mij “ wI pπq .
πPΠpA,i,jq

16 How could we then compute the quantity Z pAq?


ÿ
Z pAq “ w pπq (4.16)
πPΠpAq
ÿ
“ λ pp pπqq wI pπq ρ pn pπqq (4.17)
πPΠpAq
ÿ ÿ
“ λ pp pπqq wI pπq ρ pn pπqq (4.18)
i,jPQ πPΠpA,i,jq
ÿ ÿ
“ λ piq wI pπq ρ pjq (4.19)
i,jPQ πPΠpA,i,jq
¨ ˛
ÿ ÿ
“ λ piq ˝ wI pπq‚ρ pjq (4.20)
i,jPQ πPΠpA,i,jq
ÿ
“ λ piq Mij ρ pjq (4.21)
i,jPQ
Ñ
Ý Ý
“ λ MÑ
ρ, (4.22)
Ñ
Ý
17 where λ and ÑÝρ denote the vectors resulting from the “vectorization” of the functions λ and ρ, i.e.,
Ñ
Ý
18 λ n “ λ pnq and Ñ
ρ n “ ρ pnq. This also explains the naming of the functions λ and ρ: the initial
Ý
88 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 weights function λ, “lambda” appears on the left side of the closed form expression for Z pAq and
2 the definition of the path weight (cf. Eq. (4.3)), whereas the final weights function ρ, rho, appears
3 on the right side of the expression and the definition of the path weight.

4 Computing the matrix of pairwise pathsums. Let T be the transition matrix of the automaton
5 A. Notice that the entry Tij by definition contains the sum of the inner weights of all paths of
6 length exactly 1 (individual transitions) between the states i and j. We also define T0 “ I, meaning
7 that the sum of the weights of the paths between i and j of length zero is 0 if i ‰ j and 1 (the unit
8 for multiplication) if i “ j. This corresponds to not transitioning, i.e., staying in place, if i “ j. We
9 next state a basic result from graph theory.

Lemma 4.1.1

Let T be the transition matrix of some weighted directed graph G. Then the matrix Td
contains the allsum of all paths of length exactly d, i.e.,
ÿ
Tdi,j “ wI pπq . (4.23)
πPΠpA,i,jq
|π|“d
10

11 Proof. By induction on the path length. Left as an exercise for the reader. ■

12 It follows directly that the matrix

d
ÿ
Tďd “ Tk
def

k“1

13 contains the pairwise pathsums of paths of length at most d.


14 In general, the WFSA representing a n-gram language model can of course be cyclic. This means
15 that the number of paths in Π pAq might be infinite and they might be of arbitrary length (which is
16 the result of looping in a cycle arbitrarily many times). To compute the pairwise pathsums over all
17 possible paths, we, therefore, have to compute

8
ÿ
T˚ “ lim Tďd “ Td . (4.24)
def

dÑ8
d“0

18 This is exactly the matrix form of the geometric sum. Similarly to the scalar version, we can
4.1. FINITE-STATE LANGUAGE MODELS 89

1 manipulate the expression Eq. (4.24) to arrive to a closed-form expression for computing it:
8
ÿ
T˚ “ Td (4.25)
d“0
8
ÿ
“I` Td (4.26)
d“1
ÿ8
“I` TTd´1 (4.27)
d“1
ÿ8
“I`T Td´1 (4.28)
d“1
ÿ8
“I`T Td (4.29)
d“0
“ I ` TT . ˚
(4.30)
2 If the inverse of pI ´ Tq exists, we can further rearrange this equation to arrive at
T˚ “ I ` TT˚ (4.31)
T ´ TT “ I
˚ ˚
(4.32)
T ´T T“I
˚ ˚
(4.33)
T˚ pI ´ Tq “ I (4.34)
T˚ “ pI ´ Tq (4.35)
´1
.
3 This means that, if pI ´ Tq exists, we can compute the pairwise pathsums by simply inverting
4 it! Using the remark above on how to convert a matrix of pairwise pathsums into the full allsum,
5 we can therefore see that we can globally normalize an n-gram language ` model by computing a
matrix inversion! Since the runtime of inverting a N ˆ N matrix is O N 3 , and N “ |Q| for a
˘
6

7 transition matrix of a WFSA with states Q, we can globally normalize a n-gram language model in
8 time cubic in the number of its states. This is a special case of the general algorithm by Lehmann
9 (1977). Note, however, that this might still be prohibitively expensive: as we saw, the number of
10 states in a n-gram model grows exponentially with n, and even small n’s and reasonable alphabet
11 sizes might result in a non-tractable number of states in the WFSA with the cubic runtime.
12 We still have to determine when the infinite sum in Eq. (4.24) converges. One can see by writing
13 out the product Td in terms of its eigenvalues that the entries of Td diverge towards ˘8 as soon as
14 the magnitude of any of T’s eigenvalues is larger than 1. This means that ∥T∥2 ă 1 (spectral norm)
15 is a necessary condition for the infinite sum to exist. This is, however, also a sufficient condition: if
16 ∥T∥2 ă 1, all of T’s eigenvalues are smaller than 1 in magnitude, meaning that the eigenvalues of
17 I ´ T are strictly positive and the matrix I ´ T is invertible.10

18 Speed-ups of the Allsum Algorithm


19 The introduced algorithm for computing the allsum in´a WFSA¯ can, therefore, be implemented
3
20 as a matrix inverse. This means that its runtime is O |Q| , which can be relatively expensive.
10 1 ´ λ is an eigenvalues of I ´ T iff λ is an eigenvalue of T.
90 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 Fortunately, faster algorithms exist for WFSAs with more structure (in their transition functions)—
2 for example, the allsum can be computed in time linear in the number of transitions if the automaton
3 is acyclic using a variant of the Viterbi algorithm (Eisner, 2016). Furthermore, if the automaton
4 “decomposes” into many smaller strongly connected components (i.e., subgraphs that are cyclic), but
5 the components are connected sparsely and form an acyclic graph of components, the allsum can
6 also be computed more efficiently using a combination of the algorithms described above and the
7 algorithm for acyclic WFSA, resulting in a possibly large speedup over the original algorithm.
8 Importantly, the allsum algorithm and all the speed-ups are differentiable, meaning that they
9 can be used during the gradient-based training (cf. §3.2.3) of a finite-state language model, where
10 the weights are parametrized using some learnable parameters—we will return to this point shortly.

11 Locally Normalizing a Globally Normalized Finite-state Language Model


12 As shown in Theorem 2.4.2, any language model (and thus, any globally-normalized model with a
13 normalizable energy function) can also be locally normalized. In the case of finite-state language
14 models, we can actually explicitly construct the WFSA representing the locally normalized variant
15 using a procedure that is conceptually similar to the allsum algorithm described here. In contrast
16 to the procedure we presented here, however, the local normalization algorithm computes the
17 pathsums of the paths stemming from every possible state q individually and then “reweights” the
18 transitions depending on the pathsums of their target states r. You can think of this as computing
19 the contributions to the entire allsum from q made by all the individual outgoing transitions from q
20 and then normalizing those contributions. This is an instance of the more general weight pushing
21 algorithm.11 This can be summarized by the following theorem:

Theorem 4.1.1: PFSAs and WFSAs are equally expressive

Normalizable weighted finite-state automata with non-negative weights and tight probabilistic
finite-state automata are equally expressive.
22

23 In the proof of this theorem, we will make use of the following lemma.

Lemma 4.1.2

Let A “ pΣ, Q, δ, λ, ρq and q P Q. Then


ˆ ˙
ÿ a{¨
Z pA, qq “ ω q ÝÝÑ q 1
ZpA, q 1 q ` ρ pqq (4.36)
a{w
q ÝÝÑq 1 Pδ AL
24

25 Proof. You are asked to show this in Exercise 4.1. ■

26 We can now prove Theorem 4.1.1

27 Proof. To prove the theorem, we have to show that any WFSA can be written as a PFSA and vice
28 versa.12
11 See Mohri et al. (2008) for a more thorough discussion of weight pushing.
12 By “written as”, we mean that the weighted language is the same.
4.1. FINITE-STATE LANGUAGE MODELS 91

1 ð Since any tight probabilistic FSA is simply a WFSA with Z pAq “ 1, this holds trivially.
2 ñ Local normalization is a general property of automata resulting from weight pushing. Here,
3 we describe the construction in the special case of working with real-valued weights. See Mohri et al.
4 (2008) for a general treatment.
5 Let A “ pΣ, Q, δ, λ, ρq be a normalizable WFSA with non-negative weights. We now show that,
6 for any WFSA, there exists a PFSA encoding the same language model. Let AG “ pΣ, Q, δ, λ, ρq
7 be a trim WFSA that encodes a distribution over Σ˚ using Eq. (4.14). We now construct a tight
8 probabilistic finite-state automaton AL “ pΣ, Q, δ AL , λAL , ρAL q whose language is identical. We
9 define the initial and final weights of the probabilistic FSA as follows.

Z pA, qq
(4.37)
def
λAL pqq “ λ pqq
Z pAq
ρ pqq
(4.38)
def
ρAL pqq “
Z pA, qq

10 We define the transitions of the probabilistic FSA as follows.

ˆ ˙
a{¨
ˆ ˙ ω q ÝÝÑ q 1 ZpA, q 1 q
a{¨
(4.39)
def
ω AL q ÝÝÑ q 1 “
ZpA, qq

11 This means that AL contains the same transitions as A, they are simply reweighted. Note that the
12 assumption that A is trimmed means that all the quantities in the denominators are non-zero.
13 It is easy to see that the weights defined this way are non-negative due to the non-negativity of
14 A’s weights. Furthermore, the weights of all outgoing arcs from any q P Q and its final weight sum
15 to 1:

ÿ
w ` ρAL pqq (4.40)
a{w
q ÝÝÑ q 1 Pδ AL
ˆ ˙
a{¨
ω q ÝÝÑ q 1 ZpA, q 1 q
ÿ ρ pqq
“ ` (definition of δ AL ) (4.41)
a{w
ZpA, qq Z pA, qq
q ÝÝÑq 1 Pδ AL
¨ ˛
1
ˆ ˙
ÿ a{¨
(4.42)
˚ ‹
“ ˚ ω q ÝÝÑ q 1 ZpA, q 1 q ` ρ pqq‹
Z pA, qq ˝ a{w

q ÝÝÑ q 1 Pδ AL

“1 (Lemma 4.1.2) (4.43)

16 It is also easy to see that the initial weights form a probability distribution over the states of the
92 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 constructed automaton.
ÿ ÿ Z pA, qq
λAL pqq “ λ pqq (4.44)
qPQ qPQ
Z pAq
1 ÿ
“ λ pqq Z pA, qq (4.45)
Z pAq qPQ
1
“ Z pAq “ 1 (4.46)
Z pAq
2 We now have to show that the probabilities assigned by these two automata match. We
3 will do that by showing that the probabilities assigned to individual paths match, implying that
4 stringsums match as well. The probability of a path is defined analogously to a probability of
wpπq
5 a string, i.e., pA pπq “ ZpAq (where Z pAq “ 1 for tight probabilistic FSAs). Let then π “
ˆ ˙
a1 {w1 aN ´1 {wN ´1
6 q 1 ÝÝÝÝÑ q 2 , . . . , q N ´1 ÝÝÝÝÝÝÝÝÑ q N P Π pAq “ Π pAL q. Then, by the definitions of ω AL , λAL ,
7 and ρAL
˜ ¸
N
ź ´1
pAL pπq “ λAL pq 1 q wn ρAL pq N q (4.47)
n“1
ˆ ˙
a{¨
N ´1 ω q n Ý
ÝÑ q n`1 ZpA, q n`1 q
Z pA, q 1 q ź ρ pq N q
“ λ pq 1 q . (4.48)
Z pAq n“1 ZpA, q n q Z pA, q N q

8 Notice that the state-specific allsums of all the inner states of the path (all states apart from q 1
9 and q N ) cancel out as the product moves over the transitions of the path. Additionally, the terms
10 Z pA, q 1 q and Z pA, q N q cancel out with the definitions of λAL and ρAL . This leaves us with
N ´1 ˆ
1 ź
˙
a{¨
pAL pπq “ λ pq 1 q ω q n ÝÝÑ q n`1 ρ pq N q “ pA pπq , (4.49)
ZpAq n“1

11 finishing the proof. ■


12 While Theorem 2.4.2 shows that any language model can be locally normalized, Theorem 4.1.1
13 shows that in the context of finite-state language models, the locally normalized version of a
14 globally-normalized model is also a finite-state model.

15 Defining a Parametrized Globally Normalized Language Model


16 Having learned how an arbitrary normalizable finite-state language model can be normalized,
17 we now discuss how models in this framework can be parametrized to enable fitting them to
18 some training data. Crucial for parameterizing a globally normalized model is a score function
19 f δ θ : Q ˆ Σ ˆ Q Ñ R, which parametrizes the transitions between the states and thus determines the
20 weights of the (accepting) paths. Additionally, we also parameterized the initial and final functions
f λ θ and f ρ θ"
. These parametrized
* functions then define the automaton Aθ “ pΣ, Q, δ θ , λθ , ρθ q,
def
21

y{f δ θ pq 1 ,y,q 2 q
where δ θ “ q 1 ÝÝÝÝÝÝÝÝÝÑ q 2 , λθ pqι q “ f λ θ pqι q, and ρθ pqφ q “ f ρ θ pqφ q. Note that we can
def def def
22
4.1. FINITE-STATE LANGUAGE MODELS 93

1 parametrize the function f θ in any way we want; for example, the function could be a neural network
2 using distributed representations (we will see a similar example at the end of this section), or it
3 could simply be a lookup table of weights. The fact that the function f θ : pq 1 , y, q 2 q can only “look
4 at” the identities of the states and the symbol might seem limiting; however, the states alone can
5 encode a lot of information: for example, in n-gram models we describe below, they will encode the
6 information about the previous n ´ 1 symbols and the transitions will then encode the probabilities
7 of transitioning between such sequences of symbols.
8 The globally parametrized model then simply takes in any string y P Σ˚ and computes its
9 stringsum value under the parametrized automaton, which in turn, as per Eq. (4.15), defines
10 probabilities of the strings. The quantity Z pAθ q can be computed with the allsum algorithm
11 discussed in §4.1.3. Importantly, since the algorithms for computing the string probabilities are
12 differentiable, the model defined this way can also be trained with gradient-based learning as
13 described in §3.2.3.
14 You might notice that this formulation does not exactly match the formulation of globally
15 normalized models from Definition 2.4.2—the function A : Σ˚ Ñ R does not exactly match the
16 form of an energy function as its values are not exponentiated as in Eq. (2.11). However, we tie
17 this back to the definition of globally normalized models by defining an actual energy function as a
18 simple transformation of the stringsum given by Aθ . We can define the globally normalizing energy
19 function ppGN Aθ as
ppGN Aθ pyq “ ´ log pA pyqq , (4.50)
def

20 which can be easily seen to, after exponentiating it as in Eq. (2.11), result in the same expression as
21 Eq. (4.15). With this, we have formulated finite-state language models as general globally normalized
22 models.
23 Having introduced WFSAs as a formal and abstract computational model which can define a set
24 of weighted strings, we now show how it can be used to explicitly model a particularly simple family
25 of languages. We arrive at this family of language models when we impose a specific assumption on
26 the set of conditional distributions of the language models that ensures that they are finite-state:
27 the n-gram assumption.

28 4.1.4 Tightness of Finite-state Models


29 Any normalizable globally normalized finite-state language model is tight by definition because
30 the sum of the scores over all finite strings is finite, and since they are normalized, they sum to 1.
31 We, therefore, focus on locally normalized finite-state models and provide necessary and sufficient
32 conditions for their tightness. Locally normalized finite-state models are exactly probabilistic WFSAs
33 (Definition 4.1.17). Luckily, the tightness of probabilistic WFSAs can be easily characterized, as the
34 following theorem shows.

Theorem 4.1.2: A sufficient condition for tightness of finite-state language models

A probabilistic FSA is tight if and only if all accessible states are also co-accessible.
35

36 Proof. We prove each direction in turn.

37 (ñ): Assume the WFSA is tight. Let q P Q be an accessible state, which means q can be reached
38 after a finite number of steps with positive probability. By tightness assumption, then there must be
94 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 a positive probability path from q to termination, or else the WFSA will not be able to terminate
2 after reaching q, resulting in non-tightness. This means that q is also co-accessible. So, assuming
3 that the WFSA is tight, every accessible state is also co-accessible.

4 (ð): Assume that all accessible states are co-accessible. First, one may consider a Markov chain
5 consisting only of the set of accessible states QA Ď Q, since all other states will have probability 0
6 at every step. Recall a fundamental result in finite-state Markov chain theory which states that, if
7 there exists a unique absorbing state which is reachable from every state, then the Markov process
8 is absorbed by this state with probability 1 (see, e.g., Theorem 11.3 in Grinstead and Snell, 1997).
9 We already have that
10 • eos is an absorbing state, and that
11 • by assumption, every state in QA is co-accessible which implies that they can reach eos.
12 Hence, it remains to show that eos is the unique absorbing state. Suppose there is another state
13 (or group of states) in QA distinct from eos that is absorbing, i.e., cannot leave once entered. Then,
14 these states cannot reach eos by assumption, which means they are not co-accessible, contradicting
15 the assumption that every state in QA is co-accessible. Hence, eos is the only absorbing state in QA
16 and by the property of an absorbing Markov chain, the process is absorbed by eos with probability
17 1. In other words, the WFSA is tight. ■
Notice that trimming a PFSA results in a model that satisfies ρ pqq ` w ď 1, but might
ř
18 a{w
q ÝÝÑq 1
19 no longer achieve equality as required by Definition 4.1.17. We call such models substochastic
20 WFSAs.

Definition 4.1.23: Substochastic Weighted Finite-State Automaton

A WFSA A “ pΣ, Q, δ, λ, ρq is substochastic if for all q P Q and all outgoing transitions


a{w
q ÝÝÑ q 1 P δ it holds that

λ pqq ě 0 (4.51)
ρ pqq ě 0 (4.52)
wě0 (4.53)

and ÿ
ρ pqq ` w ď 1. (4.54)
a{w
q ÝÝÑ q1
21

22 We can then express the termination probability of a WFSA in simple linear algebra terms.

Theorem 4.1.3: A sufficient condition for the tightness of a sub-stochastic WFSA

Let T1 be the transition sum matrix of a trimmed substochastic WFSA. Then I ´ T1 is


Ñ
Ý
invertible and p px P Σ˚ q “ λ 1J pI ´ T1 q´1 Ñ
ρ 1 ď 1.
Ý
23

24 In the following, we will make use of the spectral radius of a matrix.


4.1. FINITE-STATE LANGUAGE MODELS 95

Definition 4.1.24: Spectral radius

The spectral radius of a matrix M P CN ˆN with eigenvalues λ1 , . . . , λN is defined as

ρs pMq “ max t|λ1 |, . . . , |λN |u . (4.55)


def

2 To prove Theorem 4.1.3, we will make use of the following useful lemma.

Lemma 4.1.3

Let T1 be the transition sum matrix of a trimmed substochastic WFSA, then ρs pT1 q ă 1.
3

4 To begin with, we wish to apply the following result which connects the row sums of a matrix
řN to its
5 spectral radius. Below, MN denotes the set of N ˆ N matrices, and ∥A∥8 “ max1ďnďN i“1 |Ani |
6 denotes the infinity matrix norm.

Proposition 4.1.1: §6.2.P8; Horn and Johnson, 2012

For any A P MN , ρs pAq ď ∥A∥8 . Additionally, if A is irreducible and not all absolute row
sums of A are equal, then ρs pAq ă ∥A∥8 .
7

8 However, the transition sum matrix P of a substochastic WFSA may be reducible whereas the
9 irreducibility condition in Proposition 4.1.1 cannot be dropped. Hence, we need to “decompose” T1
10 in a way to recover irreducibility. We use the Frobenius normal form (also known as irreducible
11 normal form) to achieve this.

Proposition 4.1.2: §8.3.P8; Horn and Johnson, 2012

Let A P MN be non-negative. Then, either A is irreducible or there exists a permutation


matrix P such that
A1
» fi
˚
PJ AP “ – .. (4.56)
.
— ffi
fl
0 AK

is block upper triangular, and each diagonal block is irreducible (possibly a 1-by-1 zero matrix).
This is called an Frobenius normal form (or irreducible normal form) of A. Additionally,
ΛpAq “ ΛpA1 q Y ¨ ¨ ¨ Y ΛpAK q where Λp¨q denotes the set of eigenvalues of a matrix.
12

13 We now proceed to the proof of Lemma 4.1.3.

14 Proof. Notice that, by way of a similarity transformation via a permutation matrix, the Frobenius
15 normal form is equivalent to a relabeling of the states in the trimmed WFSA in the sense of
Ñ
Ý Ñ
Ý
pPJ λ 1 qJ pPJ T1 PqK pPJ Ñ
ρ 1 q “ p λ 1J PqpPJ T1K PqpPJ Ñ
Ý Ý
ρ 1q (4.57a)
Ñ
Ý 1J 1K Ñ
“ λ T Ý ρ1 (4.57b)
96 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 where the equalities follow from the fact that the inverse of a permutation matrix P is its transpose.
2 Hence, with an appropriate relabeling, we may assume without loss of generality that P is already
3 put into a Frobenius normal form

T1
» 1 fi
˚
T1 “ – .. (4.58)
.
— ffi
fl
0 T1K

4 where each T1k is irreducible.


5 Since the transition sum matrix T1 of a trimmed substochastic WFSA is a substochastic matrix,
6 each T1k is also substochastic. In fact, each T1k is strictly substochastic, meaning that there is at least
7 a row that sums to less than 1. To see this, suppose to the contrary that there is a probabilistic T1k .
8 Since the WFSA is trimmed, every state is both accessible and co-accessible. Being accessible implies
9 that there is a positive probability of reaching every state in T1k . However, the probabilisticity
10 of T1k forces the corresponding Ñ Ýρ 1 entries to be 0. Hence, none of these states can transition
11 to eos, meaning that they’re not co-accessible, contradicting the assumption. Hence, every T1k
12 is strictly substochastic and has at least one strictly less than 1 row sum. Then, either all row
13 sums of T1k are less than 1 or some row sums are 1 and some are less than 1. In either cases,
14 Proposition 4.1.1 implies that ρs pT1k q ă 1 for all 1 ď k ď K. Finally, as Proposition 4.1.2 entails,
15 ρs pT1 q “ maxtρs pT11 q, . . . , ρs pT1K qu where each ρs pT1k q ă 1. Hence, ρs pT1 q ă 1. ■

16 We now use the stated results to finally prove Theorem 4.1.3.

17 Proof. By Lemma 4.1.3, ρs pT1 q ă 1, in which case I ´ T1 is invertible and the Neumann series
18 I ` T1 ` T12 `
ř¨8¨ ¨ converges to pI ´ T1 q´1 (§5.6, Horn and Johnson, 2012). Hence, we can write
19 pI ´ T q “ k“0 T . Then,
1 ´1 1k

8
ÿ
ppΣ˚ q “ P pΣk q (4.59a)
k“0
8
ÿ Ñ
Ý 1J 1k Ñ
“ λ T Ý ρ1 (4.59b)
k“0
˜ ¸
8
Ñ
Ý ÿ
“ λ 1J T1k Ñ
Ý
ρ1 (4.59c)
k“0
Ñ
Ý
“ λ 1J pI ´ T1 q´1 Ñ
Ý
ρ 1. (4.59d)

20 ■

21 4.1.5 The n-gram Assumption and Subregularity


22 We now turn our attention to one of the first historically significant language modeling frameworks:
23 n-gram models. While they are often taught completely separately from (weighted) finite-state
24 automata, we will see shortly that they are simply a special case of finite-state language models and
25 thus all results for the more general finite-state language models also apply to the specific n-gram
26 models as well.
4.1. FINITE-STATE LANGUAGE MODELS 97

1 As we saw in Theorem 2.4.2, we can factorize the language model pLM for y “ y 1 . . . y T P Σ˚ as
T
ź
pLM pyq “ pLN pyq “ pSM peos | yq pSM py t | y ăt q, (4.60)
t“1

2 where pSM py | yq are specified by a locally normalized model (Definition 2.4.5).


3 Recall that SMs specify individual conditional distributions of the next symbol y t given the
4 previous t ´ 1 symbols for all possible t. However, as t grows and the history of seen tokens
5 accumulates, the space of possible histories (sequences of strings to condition on) grows very large
6 (and indeed infinite as t Ñ 8). This makes the task of modeling individual conditional distributions
7 for large t computationally infeasible. One way to make the task more manageable is by using the
8 n-gram assumption.

Assumption 4.1.1: n-gram assumption

In words, the n-gram assumption states that the probability of a word y t only depends on
n ´ 1 previous words y t´1 , . . . , y t´n`1 where y 0 “ bos. Notationally, we can write the n-gram
def

assumption as a conditional independence assumption i.e.,

(4.61)
def def
pSM py t | y ăt q “ pSM py t | y t´1 ¨ ¨ ¨ y t´n`1 q “ pSM py t | y t´n´1:t´1 q

The sequence y t´1 ¨ ¨ ¨ y t´n`1 q is often called the history or the context.
9

10 In plain English, this means that the probability of a token only depends on the previous n ´ 1
11 tokens. n-gram assumption is, therefore, an alias of pn ´ 1qth order Markov assumption in the
12 language modeling context.

13 Handling edge cases by padding. Given our definition in Eq. (4.61) where the conditional
14 probability pSM py t | y t´n´1:t´1 q depends on exactly n ´ 1 previous symbols, we could run into an
15 issue with negative indices for t ă n. To handle edge cases for t ă n, we will pad the sequences with
16 the bos symbols at the beginning, that is, we will assume that the sequences y 1 . . . y t for t ă n ´ 1
17 are “transformed” as
y 1 y 2 . . . y t ÞÑ looooomooooon
bos . . . bos y 1 y 2 . . . y t (4.62)
n´1´t times

18 Notice that with such a transformation, we always end up with strings of length n ´ 1, which is
19 exactly what we need for conditioning in an n-gram model. In the following, we will assume that all
20 such sequences are already transformed, but at the same time, we will assume that
¨ ˛

pSM ˝y | looooomooooon
bos . . . bos y 1 y 2 . . . y t ‚ “ y 0 y 1 y 2 . . . y t (4.63)
n´1´t times

21 By definition, n-gram language models can only model dependencies spanning n tokens or less.
22 By limiting the length of the relevant context when determining pSM py t | y ăt q to the previous n
23 tokens, the n-gram
` assumption limits the number of possible probability distributions that need to
be tracked to O |Σ|n´1 .
˘
24
98 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 A particularly simple case of the n-gram model is the bigram model where n “ 2, which
2 means that the probability of the next word only depends on the previous one, i.e., pSM py t | y ăt q “
3 pSM py t | y t´1 q.13

Example 4.1.5: A simple bigram model

Let us look at a specific example of a simple bigram model. Suppose our vocabulary consists
of the words “large”, “language”, and “models”, thus, |Σ| “ 3. To specify the bigram
model, we have to define the conditional probabilities pM py j | y i q for y i P Σ Y tbos, eosu and
y j P Σ Y teosu (remember that we do not have to model the probability of the next token
being bos). In the case of bigrams, we can represent those in a table, where the entry at
position i, j represents the probability pM py j | y i q:

“large” “language” “models” “EOS”


bos 0.4 0.2 0.2 0.2
“large” 0.1 0.4 0.2 0.3
“language” 0.1 0.1 0.4 0.4
“models” 0.2 0.2 0.1 0.5
“EOS” 0.4 0.2 0.2 0.2

Under our model, the probability of the sentence “large language models” would be

pSM p“large” | bosq


¨ pSM p“language” | “large”q
¨ pSM p“models” | “language”q
¨ pSM peos | “models”q
“ 0.4 ¨ 0.4 ¨ 0.4 ¨ 0.5 “ 0.032

while the probability of the sentence “large large large” would be

pSM p“large” | bosq


¨ pSM p“large” | “large”q
¨ pSM p“large” | “large”q
¨ pSM peos | “large”q
“ 0.4 ¨ 0.1 ¨ 0.1 ¨ 0.3 “ 0.0012.

Note that the probabilities in the above table are made up and not completely reasonable. A
real n-gram model would not allow for probabilities of exactly 0 to avoid pathological behavior.
4

13What would the uni-gram (n “ 1) model look like? What conditional dependencies between words in a sentence

could be captured by it?


4.1. FINITE-STATE LANGUAGE MODELS 99

1 Representing n-gram Models as WFSAs

2 We define n-gram language models as models that only consider a finite amount of context when
3 defining the conditional probabilities of the next token. This means that the set of possible conditional
4 distributions pSM py | yq is also finite which very naturally connects them to weighted finite-state
5 automata—indeed, every n-gram language model is a WFSA—specifically, a probabilistic finite-state
6 automaton (or a substochastic one). We will make this connection more formal in this subsection,
7 thus formally showing that n-gram models are indeed finite-state. Note that this is different from
8 §4.1.3, where we discussed how to parametrize a general WFSA and use it as a globally normalized
9 model—in contrast, in this section, we consider how to fit a (locally normalized) n-gram model into
10 the finite-state framework.
11 The intuition behind the connection is simple: the finite length of the context implies a finite
12 number of histories we have to model. These histories represent the different states the corresponding
13 automaton can reside in at any point. Given any history y with |y| ă n and the state q P Q
14 representing y, then, the conditional distribution of the next token given y dictate the transition
15 weights into the next states in the WFSA, representing the new, updated history of the input.
16 Importantly, since we want PFSAs to represent globally-normalized models, we will also remove
17 the eos symbol from the n-gram model before transforming it into a PFSA—as the remark above
18 about the relationship between the eos symbol and the final states hints, the latter will fill in
19 the role of the eos symbol. The way we do that is the following. From the semantics of the eos
20 symbol discussed in the section on tightness (cf. Eq. (2.44)), we also know that to model the
21 probability distribution over finite strings in Σ˚ , we only require to keep track of strings up to
22 the first occurrence of the eos symbol. Therefore, when converting a given n-gram model to a
23 WFSA, we will only model sequences up to the first occurrence of the special symbol, meaning that
24 eos will never occur in the context of any conditional distribution pSM py | yq. We now detail this
25 construction.
26 Let pLN be a well-defined n-gram language model specified by conditional distributions pSM as
27 defined by ??. We will now construct a WFSA representing pLN . Intuitively, its states will represent
28 all possible sequences of words of length n while the transitions between the states q 1 and q 2 will
29 correspond to the possible transitions between the n-grams which those represent. This means that
30 the only possible (positively weighted) transitions will be between the n-grams which can follow
31 each other, i.e. y t´n:t´1 and y t´n`2:t for some y t´n , y t P Σ (until the first occurrence of eos). The
32 transition’s weight will depend on the probability of observing the “new” word y 0 in the second
33 n-gram given the starting n-gram y ´n y ´pn´1q . . . y ´1 . Further, the final weights of the states will
34 correspond to ending the string in them. In pLN , this is modeled as the probability of observing eos
35 given the context y t´n:t´1 —this, therefore, is set as the final weight of the state representing the
36 history y t´n:t´1 . Formally, we can map a n-gram model into a WFSA A “ pΣA , QA , δ A , λA , ρA q by
37 constructing A as follows.

38 • Automaton’s alphabet:
ΣA “ Σ (4.64)
def

39 • The set of states:


n´1
ď
tbosun´1´t ˆ Σt (4.65)
def
QA “
t“0
100 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 • The transitions set


y t {p py t |y t´n:t´1 q
(4.66)
def SM
δ A “ ty t´n:t´1 ÝÝÝÝÝÝÝÝÝÝÝÝÝÑ y t´n`1:t |
n´2
ď
y t´n`1:t´1 P tbosun´2´t ˆ Σt ; y t´n , y t P Σu
t“0

2 • The initial function:


&1 if y “ looooomooooon
$
bos . . . bos
λA : y ÞÑ n´1 times (4.67)
0 otherwise
%

3 • The final function


ρA : y ÞÑ pSM peos | yq , y P QA (4.68)

4 The definition of the states set QA captures exactly the notion of padding with the bos symbol for
5 handling the edge cases we described above. This shows that n-gram language models are indeed
6 finite-state (we leave the formal proof showing that L pAq “ L ppLN q to the reader.

7 Defining a n-gram language model through a parametrized WFSA. We now consider


8 how we can use the framework of WFSA to define a more “flexible” parametrized globally normalized
9 model. In this case, we do not start from an existing locally normalized set of distributions forming
10 pSM . Rather, we would like to model the “suitability” of different n-grams following each other—that
11 is, we would like to somehow parametrize the probability that some n-gram y 1 will follow an n-gram
12 y without having to worry about normalizing the model at every step. This will allow us to then fit
13 the probability distributions of the model to those in the data, e.g., with techniques described in
14 §3.2.3. Luckily, the flexibility of the WFSA modeling framework allows us to do exactly that.

15 Subregularity
16 We saw that language models implementing the very natural n-gram assumption can be represented
17 using weighted finite-state automata. However, n-gram models do not “need the full expressive
18 power” of WFSAs—they can actually be modeled using even simpler machines than finite-state
19 automata. This, along with several other examples of simple families of formal languages, motivates
20 the definition of subregular languages.

Definition 4.1.25: Subregular language

A language is subregular if it can be recognized by a finite-state automaton or any weaker


machine.
21

22 Most subregular languages can indeed be recognized by formalisms which are much simpler than
23 FSAs. Many useful and interesting classes of subregular languages have been identified—recently,
24 especially in the field of phonology. Naturally, due to their simpler structure, they also allow for
25 more efficient algorithms—this is why we always strive to represent a language with the simplest
26 formalism that still captures it adequately. See Jäger and Rogers (2012); Avcu et al. (2017) for
27 comprehensive overviews of subregular languages.
4.1. FINITE-STATE LANGUAGE MODELS 101

1 Subregular languages actually form multiple hierarchies of complexity within regular languages.
2 Interestingly, n-gram models fall into the simplest level of complexity in one of the hierarchies,
3 directly above finite languages. This class of subregular languages is characterized by patterns that
4 depend solely on the blocks of symbols that occur consecutively in the string, which each of the
5 blocks considered independently of the others—it is easy to see that n-gram models intuitively
6 fall within such languages. This family of subregular languages is suggestively called strictly local
7 languages.

Definition 4.1.26: Strictly local languages

A language L is strictly n-local (SLn ) if, for every string y of length |y| “ n ´ 1, and all
strings x1 , x2 , z 1 , z 2 P Σ˚ , it holds that if x1 yz 1 P L and x2 yz 2 P L, then also x1 yz 2 P L
(and x2 yz 1 P L).
A language is strictly local (SL) if it is strictly n-local for any n.
8

9 Note that we could of course also define this over with the eos-augmented alphabet Σ. You can
10 very intuitively think of this definition as postulating that the history more than n symbols back
11 does not matter anymore for determining or specifying whether a string is in a language (or its
12 weight, in the weighted case)—this is exactly what the n-gram assumption states.

13 4.1.6 Representation-based n-gram Models


14 So far, we have mostly talked about the conditional probabilities and the WFSA weights defining
15 a language model very abstractly. Apart from describing how one can generally parametrize the
16 weights of the underlying WFSA with the scoring function in §4.1.5, we only discussed what values
17 the weights can take for the language model to be well-defined and what implications that has
18 on the distribution defined by the WFSA. In this section, we consider for the first time what an
19 actual implementation of a finite-state, or more precisely, a n-gram language model might look
20 like. Concretely, we will define our first parameterized language model in our General language
21 modeling framework (cf. §3.1) by defining a particular form of the encoding function enc as a simple
22 multi-layer feed-forward neural network.14
23 However, before we dive into that, let us consider as an alternative possibly the simplest way to
24 define a (locally normalized) n-gram language model: by directly parametrizing the probabilities of
25 each of the symbols y in the distribution pSM py | yq for any context y, that is
$ ,
& ÿ .
θy|y “ pSM py | yq | y P Σ, y P Σn´1 , θy|y ě 0, θy|y “ 1 (4.69)
def def
θ“ .
% -
y 1 PΣ

26 The following proposition shows that the maximum likelihood solution (Eq. (3.60)) to this parametriza-
27 tion is what you would probably expect.

14While we introduce particular architectures of neural networks, for example, recurrent neural networks and

transformers later in Chapter 4, we assume some familiarity with neural networks in general. See Chapter 6 of
Goodfellow et al. (2016) for an introduction.
102 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Proposition 4.1.3

The MLE solution of Eq. (4.69) is

C py 1 , . . . , y n q
pSM py n | y ăn q “ (4.70)
C py 1 , . . . , y n´1 q

whenever the denominator ą 0, where C py 1 , . . . , y n q denotes the number of occurrences of all


possible strings of the form y 1 , . . . , y n and C py 1 , . . . , y n q denotes the number of occurrences
of all possible strings of the form y 1 , . . . , y n´1 .
1

Proof. Let D “ y p1q , . . . , y pM q be the training dataset. The log-likelihood of a single example
def
␣ (
2

3 y pmq is
¨ pmq ˛
|yź | ´ ¯
log ppLN pyqq “ log ˝ (4.71)
pmq pmq
pSM y t | y t´n:t´1 ‚
t“1
pmq
|yÿ | ´ ¯
log pSM y t | y t´n:t´1 (4.72)
pmq pmq

t“1

4 which means that the log-likelihood of the entire dataset is


pmq
M |yÿ |
ÿ ´ ¯
log pSM y t | y t´n:t´1 (4.73)
pmq pmq
ℓℓ pDq “
m“1 t“1
pmq
M |yÿ |
ÿ
“ log θyn |yăn . (4.74)
m“1 t“1

5 Exercise 4.2 asks you to show that this can be rewritten with the token to type switch as
ÿ
ℓℓ pDq “ C pyq θyn |yăn . (4.75)
y
|y|“n

6 The maximum likelihood parameters can then be determined using Karush–Kuhn–Tucker (KKT)
7 conditions15 to take into account the non-negativity and local normalization constraints:
¨ ˛
˜ ¸
ÿ ÿ ÿ
θy|y ´ 1 ´ ‚ “ 0. (4.76)
˚ ‹
∇θ ˚˝ℓℓ pDq ´ λyy η yy θy|y ‹
yPΣ˚ yPΣ yPΣ˚
|y|“n´1 |y|“n´1
´
Recall that the KKT conditions state that a θ is an optimal solution of ℓℓ if and only if θ, tλyy1 uyPΣn´1 ,yPΣ , η yy1 yPΣn´
␣ (
8

9 satisfy Eq. (4.76). Since this is simply a sum over the dataset with no interactions of parameters for
10 individual contexts y with |y| “ n ´ 1 in θy|y , it can be solved for each context y individually.
15 See [Link]
4.1. FINITE-STATE LANGUAGE MODELS 103

1 Moreover, as you are asked to show Exercise 4.3, it holds that


ÿ `
(4.77)
˘
C y 1 . . . y n´1 y 1 “ C py 1 . . . y n´1 q
y 1 PΣ

2 for any y “ y 1 . . . y n´1 P Σn´1 . This leaves us with the following system for each y P Σn´1 :
˜ ¸
ÿ ` ÿ ÿ
C yy log θy1 |y ´ λy θy1 |y ´ 1 ´ (4.78)
1
˘
η yy1 θy1 |y .
y 1 PΣ y 1 PΣ y 1 PΣ

Cpyyq
3 It is easy to confirm that θy|y “ Cpyq with λy “ C pyq and η yy1 “ 0 is a saddle point of Eq. (4.76).
Cpyyq
4 This means that θy|y “ Cpyq is indeed the maximum likelihood solution. ■

5 This results in a locally normalized n-gram model. To avoid issues with division-by-zero and
6 assigning 0 probability to unseen sentences, we can employ methods such as smoothing and backoff,
7 which are beyond the scope of the course.16
8 While this model might seem like an obvious choice, it comes with numerous drawbacks. To see
9 what can go wrong, consider the following example.

Example 4.1.6: n-gram model

Suppose we have a large training corpus of sentences, among which sentences like “We are
going to the shelter to adopt a dog.”, “We are going to the shelter to adopt a puppy.”, and
“We are going to the shelter to adopt a kitten.”, however, without the sentence “We are going
to the shelter to adopt a cat.” Fitting an n-gram model using the count statistics and individual
tables of conditional probabilities pSM py | yq, we would assign the probability

pSM py t “ cat | y ăt “ We are going to the shelter to adopt aq

the value 0 (or some “default” probability if we are using smoothing). However, the words
“dog”, “puppy”, “kitten”, and “cat” are semantically very similar—they all describe pets often
found in shelters. It would therefore be safe to assume that the word “cat” is similarly probable
given the context “We are going to the shelter to adopt a” as the other three words observed
in the training dataset. However, if we estimate all the conditional probabilities independently,
we have no way of using this information—the words have no relationship in the alphabet,
they are simply different indices in a lookup table. Additionally, statistics gathered for the
sentences above will not help us much when encountering very similar sentences, such as
“We went to a nearby shelter and adopted a kitten.” The issue is that there are simply many
ways of expressing similar intentions. We would thus like our language models to be able
to generalize across different surface forms and make use of more “semantic” content of the
sentences and words. However, if the model is parametrized as defined in Eq. (4.69), it is not
able to take advantage of any such relationships.
10

11 The model defined by Eq. (4.69) is therefore unable to take into account the relationships and
12 similarities between words. The general modeling framework defined in §3.1 allows us to remedy
16 See (Chen and Goodman, 1996) and Chapter 4 in (Jurafsky and Martin, 2009).
104 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 this using the distributed word representations. Recall that, in that framework, we associate
2 each word y with its vector representation epyq (its embedding), and we combine those into the
3 embedding matrix E. Importantly, word embeddings are simply additional parameters of the model
4 and can be fit on the training dataset together with the language modeling objective. One of the
5 first successful applications of encp¨q is due to Bengio et al. (2003), which we discuss next.
6 To be able to use the embeddings in our general framework, we now just have to define the
7 concrete form of the context-encoding function enc. In the case of the neural n-gram model which
8 we consider here and as defined by (Bengio et al., 2003), the representations of the context y ăt ,
9 enc py ăt q, are defined as the output of a neural network which looks at the previous n ´ 1 words in
10 the context:
encpy ăt q “ encpy t´1 , y t´2 , . . . , y t´n`1 q, (4.79)
def

11 where enc is a neural network we define in more detail shortly. The full language model is therefore
12 defined through the conditional distributions
´ ¯
pSM py t | y ăt q “ softmax enc py t´1 , y t´2 , . . . , y t´n`1 q E ` b (4.80)
def J
yt

13 resulting in the locally normalized model


´ ¯
pLN pyq “softmax enc py T , y T ´1 , . . . , y T ´n`2 q E ` b (4.81)
J
eos
T
ź ´ ¯
softmax enc py t´1 , y t´2 , . . . , y t´n`1 q E ` b (4.82)
J
¨
yt
t“1

14 for y P Σ˚ .
15 Importantly, notice that although this is a neural model, it is nonetheless still an n-gram model
16 with finite context—Eq. (4.79) is simply a restatement of the n-gram assumption in terms of the
17 neural encoding function enc. It therefore still suffers from some of the limitations of regular n-gram
18 models, such as the inability to model dependencies spanning more than n words. However, it solves
19 the problems encountered in Example 4.1.6 by considering word similarities and sharing parameters
20 across different contexts in the form of an encoding function rather than a lookup table.
21 While encoding function enc in Eq. (4.79) could in principle take any form, the original model
22 defined in Bengio et al. (2003) defines the output as for the string y “ y t , y t´1 , . . . , y t´n`1 as

enc py t , y t´1 , . . . , y t´n`1 q “ b ` Wx ` U tanh pd ` Hxq , (4.83)


def

where x “ concat pepy t q, epy t´1 q, . . . , epy t´n`1 qq denotes the concatenation of the context symbol
def
23

24 embeddings into a long vector of size pn ´ 1q ¨ R, and b, d, W, and U define the parameters of the
25 encoding function. This completes our definition of the model in the general language modeling
26 framework—the model can then simply be trained on the language modeling objective as defined in
27 §3.2.2.
28 We can also see that such a model also reduces the number of parameters required to specify
29 a n-gram model: whereas a lookup-table-based n-gram model with no parameter sharing requires
30 O p|Σ|n q parameters to be defined, the number of parameters required by a representation-based
31 n-gram model scales linearly with n—all we have to do is add additional rows to the matrices defined
32 in Eq. (4.83). We will later see how this can be reduced to a constant number of parameters w.r.t.
33 the sequence length in the case of recurrent neural networks in §5.1.2.
4.1. FINITE-STATE LANGUAGE MODELS 105

Figure 4.4: A pictorial depiction of the n-gram neural language model from the original publication
(Bengio et al., 2003). Note that the quantity C pwq corresponds to epyq for a word y in our notation.
106 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 Pictorially, we can imagine the model as depicted in Fig. 4.4 (taken from the original publication).
2 This shows that the n-gram modeling framework is not limited to counting co-occurrence statistics.
3 The model from Eq. (4.79) can also be represented by a WFSA just like the simpler models we
4 discussed above, with the weights on the transitions parametrized by the neural network. This
5 allows us to both understand well with insights from formal language theory, as well as to train
6 it in a flexible way allowed for by the non-linear encoding function. However, the model from
7 Eq. (4.79) is still limited to statistics of the last n tokens or less. If we want to model arbitrarily long
8 dependencies and hierarchical structures, we have to leave the space of finite-state languages behind
9 and develop formalisms capable of modeling more complex languages. The next section explores the
10 first of such frameworks: context-free languages with the computational models designed to model
11 them.
4.2. PUSHDOWN LANGUAGE MODELS 107

1 4.2 Pushdown Language Models


2 An strong limitation of finite-state language models is that they can definitionally only distinguish a
3 finite set of contexts. However, human language has inherently more structure than what a finite set
4 of contexts can encode. For example, human language contains arbitrarily deep recursive structures
5 which cannot be captured by a finite set of possible histories—we will see an example of this soon in
6 §4.2.1.
7 To be able to model these structures we are climbing a rung higher on the ladder of the hierarchy
8 of formal languages: we are going to consider context-free languages, a larger class of languages
9 than regular languages. Luckily, we will see that a lot of the formal machinery we introduce in
10 this section closely follows analogs from the finite-state section and we invite the reader to pay
11 close attention to the parallels. For example, similarly to how we weighted a string in a regular
12 language by summing over the weights of the paths labeled with that string, we will weight strings
13 in context-free languages by summing over analogous structures.
14 To be able to recognize context-free languages, will have to extend finite-state automata from
15 §4.1.1 with an additional data structure—the stack. Finite-state automata augmented with a stack
16 are called pushdown automata. We introduce them in §4.2.7. Before giving a formal treatment of
17 pushdown automata, however, we will discuss an arguably more natural formalism for generating
18 the context-free languages—context-free grammars.17
19 In the last part of the section, we will then further extend the regular pushdown automaton with
20 an additional stack. Interestingly, this will make it more powerful: as we will see, it will raise its
21 expressive power from context-free languages to all computable languages, as it is Turing complete.
22 While this augmentation will not be immediately useful from a language modeling perspective, we
23 will then later use this machine to prove some theoretical properties of other modern language
24 models we consider later in the course.

25 4.2.1 Human Language Is not Finite-state


26 As hinted above, human language contains structures that cannot be modeled by finite-state
27 automata. Before we introduce ways of modeling context-free languages, let us, therefore, first
28 motivate the need for a more expressive formalism by more closely considering a specific phenomenon
29 often found in human language: recursive hierarchical structure. We discuss it through an example,
30 based on Jurafsky and Martin (2009).

Example 4.2.1: Center embeddings

Consider the sentence:

“The cat likes to cuddle.”


It simply describes a preference of a cat. However, we can also extend it to give additional
information about the cat:
“The cat the dog barked at likes to cuddle.”
31

17 You might wonder what non-context-free grammars are: a superclass of context-free grammars is that of context-

sensitive grammars, in which a production rule may be surrounded by a left and right context. They are still however
a set of restricted cases of general grammars, which are grammars that can emulate Turing machines.
108 CHAPTER 4. CLASSICAL LANGUAGE MODELS

This sentence, in turn, can be extended to include additional information about the dog:
“The cat the dog the mouse startled barked at likes to cuddle.”
Of course, we can continue on:

“The cat the dog the mouse the rat frightened startled barked at likes to cuddle.”
and on:
“The cat the dog the mouse the rat the snake scared frightened startled barked at likes to cuddle.”

In theory, we could continue like this for as long as we wanted—all these sentences are
grammatically correct—this is an instance of the so-called center embeddings.
Crucially, such sentences cannot be captured by a regular language, i.e., a language based
on an automaton with finitely many states. While we would need formal machinery beyond
the scope of this course to formally prove this, the intuition is quite simple. By adding more
and more “levels” of recursion to the sentences (by introducing more and more animals in
the chain), we unboundedly increase the amount of information the model has to “remember”
about the initial parts of the sentence while processing it sequentially, to be able to process
or generate the matching terms on the other end of the sentence correctly. Because such
hierarchies can be arbitrarily deep (and thus the sentences arbitrarily long), there is no bound
on the number of states needed to remember them, which means they cannot be captured by
a finite-state automaton.
Note that this example also touches upon the distinction of the grammatical competence versus
grammatical performance (Chomsky, 1959; Chomsky and Schützenberger, 1963; Chomsky,
1965). The former refers to the purely theoretical properties of human language, for example,
the fact that such hierarchical structures can be arbitrarily long and still grammatically correct.
Grammatical performance, on the other hand, studies language grounded more in the way
people actually use it. For example, nested structures like the one above are never very deep
in day-to-day speech—indeed, you probably struggled to understand the last few sentences
above. We rarely come across nestings of depth more than three in human language (Miller
and Chomsky, 1963; Jin et al., 2018; Karlsson, 2007).
1

2 4.2.2 Context-free Grammars


3 How can we capture recursive structures like those in Example 4.2.1 and the long-term dependencies
4 arising from them? The first formalism modeling such phenomena we will introduce is context-free
5 grammars: a generative formalism which can tell us how to generate or “compose“ strings in the
6 language it describes. Later in the section (§4.2.7), we will introduce the context-free analog of
7 finite-state automata, which will tell us how to recognize whether a string is in a context-free
8 language (rather than generate a string): pushdown automata.

Definition 4.2.1: Context-free Grammar

A context-free grammar (CFG) is a 4-tuple G “ pΣ, N , S, Pq where Σ is an alphabet of


terminal symbols, N is a non-empty set of non-terminal symbols with N X Σ “ H, S P N is
9
4.2. PUSHDOWN LANGUAGE MODELS 109

the designated start non-terminal symbol and P is the set of production rules, where each
rule p P P is of the form X Ñ α with X P N and α P pN Y Σq˚ .a
a As is the case for initial states in FSAs, multiple start symbols could be possible. However we consider

only one for the sake of simplicity.


1

Example 4.2.2: A simple context-free grammar

Let G “ pΣ, N , S, Pq be defined as follows:

• Σ “ ta, bu
• N “ tXu
• S“X

• P “ tX Ñ aXb, X Ñ εu
This defines a simple context-free grammar. We will return to it later, when we will formally
show that it generates the language L “ tan bn | n P Ně0 u.
2

3 Rule Applications and Derivations

4 Context-free grammars allow us to generate strings y P Σ˚ by applying production rules on its


5 non-terminals. We apply a production rule X Ñ α to X P N in a rule p by taking X on the
6 right-hand side of p and replacing it with α.18

Definition 4.2.2: Rule Application

A production rule Y Ñ β, β P pN Y Σq˚ , is applicable to Y in a rule p, if p takes the form

X Ñ α Y γ, α, γ P pN Y Σq˚ .

The result of applying Y Ñ β to α Y γ is α β γ.


7

8 Starting with S, we apply S Ñ α to S for some pS Ñ αq P P, then take a non-terminal in α and


9 apply a new production rule.19 To generate a string we follow this procedure until all non-terminal
10 symbols have been transformed into terminal symbols. The resulting string, i.e., the yield, will be
11 the string taken by concatenating all terminal symbols read from left to right. More formally, a
12 derivation can be defined as follows.

18We say that X is on the right-hand side of a rule p if p takes the form p “ pY Ñ α Xγq, where α, γ P pN Y Σq˚ .

We will sometimes refer to X as the head of the production rule X Ñ α, and the right-hand side α as the body of the
production rule.
19We will write X P α, which formally means a substring of α with length 1. Unless otherwise stated, X can be

either a non-terminal or a terminal.


110 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.3: Derivation


A derivation in a grammar G is a sequence α1 , . . . , αM , where α1 P N , α2 , . . . , αM ´1 P
pN Y Σq˚ and αM P Σ˚ , in which each αm`1 is formed by applying a production rule in P to
αm .
1

2 We say that α P pN Y Σq˚ is derived from X P N if we can apply a finite sequence of production
˚
3 rules to generate α starting from X. We will denote this as XñG α. See the following formal
4 definition.

Definition 4.2.4: Derives

Let G “ pΣ, N , S, Pq be a CFG. We say that X derives β under the grammar G, denoted
def

as XñG β if Dp P P such that p “ pX Ñ α β γq, α, γ P pN Y Σq˚ and β P pN Y Σq˚ ztεu. The


special case XñG ε holds iff X Ñ ε P P. We denote the reflexive transitive closure of the ñG
˚ ˚
relation as ñG . We say that β is derived from X if XñG β.
5

6 The (context-free) language of a CFG G is defined as all the strings y P Σ˚ that can be derived
7 from the start symbol S of G, or alternatively, the set of all yields possible from derivations in G
8 that start with S. We will denote the language generated by G as LpGq.

Definition 4.2.5: Language of a Grammar

The language of a context-free grammar G is


˚
LpGq “ ty P Σ˚ | SñG yu (4.84)
9

10 Parse Trees and Derivation Sets

11 A natural representation of a derivation in a context-free grammar is a derivation tree d (also


12 known as a parse tree). A derivation tree represents the sequence of applied rules in a derivation
13 with a directed tree. The tree’s internal nodes correspond to the non-terminals in the derivation,
14 and each of their children corresponds to a symbol (from Σ Y N ) on the right side of the applied
15 production in the derivation. The leaves, representing terminal symbols, “spell out” the derived
16 string—the tree’s yield. More formally, for each production rule X Ñ α, the node corresponding to
17 the specific instance of the non-terminal X in the derivation is connected to the nodes corresponding
18 to Y P α where Y P Σ Y N .
19 We will mostly be interested in representing derivations starting with S—the root node of a tree
20 representing any such derivation will correspond to S. We will denote the string generated by a tree
21 d—its yield—by s pdq. See Fig. 4.5 for examples of parse trees for the grammar from Example 4.2.2.
22

23 Importantly, a grammar may in fact admit multiple derivations and hence multiple derivation
24 trees for any given string.
4.2. PUSHDOWN LANGUAGE MODELS 111

X X X X

ε a X b a X b a X b

ε a X b a X b

ε a X b

Figure 4.5: A sequence of derivation trees for the strings in tan bn | n “ 0, 1, 2, 3u in the grammar
from Example 4.2.2.

X X

ε Y

Figure 4.6: Two parse trees in the modified grammar G yielding ε.

Example 4.2.3: Multiple derivation strings

It is relatively easy to see that in the grammar G, each string an bn is only generated by a single
derivation tree—each new pair of symbols a and b can only be added by applying the rule
X Ñ aXb and the string an bn can only be generated by the application of the rule X Ñ aXb n
times and the rule X Ñ ε once in this order.
However, we can modify G by adding, for instance, a non-terminal Y and rules X Ñ Y, Y Ñ ε.
The empty string ε may then be derived either by pX Ñ εq, or pX Ñ Yq, pY Ñ εq, corresponding
to two separate derivation trees, as shown in Fig. 4.6. The set of these two trees comprises
what we call the derivation set of ε.
1

2 We denote a derivation set of a string y, generated by the grammar G, as DG pyq.

Definition 4.2.6: String derivation set

Let y P Σ˚ . Its derivation set, denoted by DG pyq is defined as

DG pyq “ td | s pdq “ yu. (4.85)


def

4 We say that a grammar is unambiguous if, for every string that can be generated by the
5 grammar, there is only one associated derivation tree.
112 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.7: Unambiguity

A grammar G is unambiguous if for all y P LpGq, |DG pyq| “ 1.


1

2 The converse holds for ambiguous grammars.

Definition 4.2.8: Ambiguity

A grammar G is ambiguous if Dy P LpGq such that |DG pyq| ą 1.


3

4 The set of all derivation trees in a grammar is its derivation set.

Definition 4.2.9: Grammar derivation set


The derivation set of a grammar, DG , is the set of all derivations possible under the
grammar. More formally, it can be defined as the union over the derivation set for the strings
in its language, ď
(4.86)
def
DG “ DG py 1 q
y 1 PLpGq
5

Definition 4.2.10: Non-terminal derivation set

The derivation set of a non-terminal Y P N in G, denoted DG pYq, is defined as the set of


derivation subtrees with root node Y.
6

7 Note that DG could be defined as DG pSq. For a terminal symbol a P Σ, we trivially define the
8 derivation set DG paq to be empty.20
9 In cases where it is irrelevant to consider the order of the production rules in a derivation tree,
10 we will write pX Ñ αq P d to refer to specific production rules in the tree—viewing trees as multisets
11 (or bags) over the production rules they include.

Example 4.2.4: Nominal Phrases

CFGs are often used to model natural languages. Terminals would then correspond to words in
the natural language, strings would be text sequences and non-terminals would be abstractions
over words. As an example, consider a grammar G that can generate a couple of nominal
phrases. We let N “ tAdj, Det, N, Nominal, NPu, Σ “ ta, big, female, giraffe, male, tall, theu,
S “ Nominal and define the following production rules:

Nominal Ñ Det NP
NP Ñ N | Adj NP
Det Ñ a | the
N Ñ female | giraffe | male
Adj Ñ big | female | male | tall
12

20 Empty derivation sets for terminal symbols is defined solely for ease of notation later.
4.2. PUSHDOWN LANGUAGE MODELS 113

Nominal Nominal Nominal

Det NP Det NP Det NP

a N the Adj NP a Adj NP

giraffe big N tall Adj NP

male female N

giraffe

Figure 4.7: Derivation trees for natural language nominal phrases.

See Fig. 4.7 for a few examples of derivation trees in this grammar.
1

Example 4.2.5: The generalized Dyck languages Dpkq

A very widely studied family of context-free languages are the Dyck-k languages, Dpkq, the
languages of well-nested brackets of k types. They are, in some ways, archetypal context-free
languages (Chomsky and Schützenberger, 1963). Formally, we can define them as follows.

Definition 4.2.11: Dpkq languages

Let k P N. The Dpkq language is the language of the following context-free grammar
G “ pΣ, N , S, Pq
def

• Σ “ txn | n “ 1, . . . , ku Y tyn | n “ 1, . . . , ku
def

• N “ tSu
def

• S“S
def

• P “ tS Ñ ε, S Ñ SSu Y tS Ñ xn Syn | n “ 1, . . . , ku
def

Examples of strings in the language Dp3q would be x3 y3 x2 y2 x1 y1 , x3 y3 x1 x2 x2 y2 y2 y1 , and


x1 x2 x2 y2 y2 x3 x1 y1 y3 y1 . The string x2 x2 y1 y2 is not in the language Dp3q.
2

3 To give you a taste of what formally working with context-free grammars might look like,
4 we now formally show that the grammar from Example 4.2.2 really generates the language L “
5 tan bn | n P Ně0 u, as we claimed.
114 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Example 4.2.6: Recognizing an bn

The language L “ tan bn | n P Nu is not regular.a However, we can show that it is context-free
and recognized exactly by the simple grammar from Example 4.2.2. We restate it here for
convenience: G “ pΣ, N , S, Pq with N “ tXu, Σ “ ta, bu, S “ X, P “ tX Ñ aXb, X Ñ εu.

Lemma 4.2.1

Given the grammar G defined above, we have L pGq “ tan bn | n P Nu.

Proof. We will show that L “ LpGq in two steps: (i) showing that L Ď LpGq and (ii) showing
that LpGq Ď L. Define y n “ an bn .
(i) We first need to show that each y P L can be generated by G, which we will do by induction.

Base case (n “ 0) We have that y 0 “ ε, which is generated by d “ pX Ñ εq.

Inductive step (n ą 1) We have that y n is generated by

d “ loooooooooooooooomoooooooooooooooon
pX Ñ aXbq ¨ ¨ ¨ pX Ñ aXbqpX Ñ εq.
n times

It is then easy to see that y n`1 is generated by the derivation we get by replacing the last
rule pX Ñ εq with pX Ñ aXbqpX Ñ εq—they are exactly the trees illustrated in Fig. 4.5.
(ii) Next, we show that for each d P DG , we have that ypdq P L.

Base case (d “ pX Ñ εq) It is trivial to see that the derivation d “ pX Ñ εq yields ypdq “ ε.

Inductive step Now observe that P only contains two production rules and one non-terminal.
Starting with X, we can either apply X Ñ aXb to get one new non-terminal X, or apply X Ñ ε
to terminate the process. Hence, if we fix the length of the sequence of production rules, there
is no ambiguity in which string will be generated. Thus, by induction, we conclude that if
we have a derivation tree given by pX Ñ aXbq, . . . , pX Ñ aXbq, pX Ñ εq generating an bn , the
looooooooooooooooomooooooooooooooooon
n times
derivation tree given by looooooooooooooooomooooooooooooooooon
pX Ñ aXbq, . . . , pX Ñ aXbq, pX Ñ εq will generate an`1 bn`1 . ■
n`1 times

a Again, while the intuition behind this is similar to our reasoning from Example 4.2.1, this would have to

be proven using the so-called pumping lemma for regular languages.


1

2 Reachable Non-terminals and Pruning


3 Similarly to how some states in a WFSA can be useless in the sense that they are not accessible
4 from an initial state or might not lead to a final state, so too can non-terminals in a CFG be useless
5 by not beaing reachable from the start symbol or might not lead to any string of terminals. In the
6 context of CFGs, we typically use a different terminology: “reachable” instead of “accessible” and
7 “generating” instead of “co-accessible”.
4.2. PUSHDOWN LANGUAGE MODELS 115

Definition 4.2.12: Accessibility for CFGs


˚
A symbol X P N Y Σ is reachable (or accessible) if Dα, α1 P pN Y Σq such that SñαXα1 .
˚

Definition 4.2.13: Co-accessibility for CFGs


˚
A non-terminal Y is generating (or co-accessible) if Dy P Σ˚ such that Yñy.
2

3 In words, reachable symbols are those that can be derived from the start symbol, whereas
4 generating non-terminals are those from which at least one string (including the empty string) can
5 be derived. Note that we define reachable for both non-terminals and terminals while generating is
6 only defined for non-terminals.
7 This allows us to define a pruned context-free grammar, which is the CFG version of a trimmed
8 WFSA.

Definition 4.2.14: Pruned CFG

A CFG is pruned (or trimmed) if it has no useless non-terminals, i.e. all non-terminals
are both reachable and generating. Pruning (or trimming) refers to the removal of useless
non-terminals.
9

10 4.2.3 Weighted Context-free Grammars


11 As we did with finite-state automata, we will augment the classic, unweighted context-free grammars
12 with real-valued weights. We do that by associating with each rule X Ñ α a weight WpX Ñ αq P R.

Definition 4.2.15: Weighted Context-free Grammar

A real-weighted context-free grammar is a 5-tuple pΣ, N , S, P, Wq where Σ is an alphabet


of terminal symbols, N is a non-empty set of non-terminal symbols with N X Σ “ H, S P N is
the designated start non-terminal symbol, P is the set of production rules, and W a function
W : P Ñ R, assigning each production rule a real-valued weight.
13

w
14 For notational brevity, we will denote rules p P P as p “ X Ý
Ñ α for X P N , α P pN Y Σq˚ and
15 w “ WpX Ñ αq P R.

Example 4.2.7: A simple weighted context-free grammar

Consider the grammar G “ pΣ, N , S, P, Wq defined as follows:


• Σ “ ta, bu
• N “ tXu

• S“X
16
116 CHAPTER 4. CLASSICAL LANGUAGE MODELS

• P “ tX Ñ aXb, X Ñ εu
• W “ X Ñ aXb ÞÑ 12 , X Ñ ε ÞÑ 12
␣ (

This defines a simple weighting of the CFG from Example 4.2.2.


1

2 Weights assigned to productions by WFCGs can be arbitrary real numbers. Analogous to


3 probabilistic WFSAs (Definition 4.1.17) describing locally normalized finite-state language models,
4 we also define probabilistic WCFGs, where the weights of applicable production rules to any non-
5 terminal form a probability distribution.

Definition 4.2.16: Probabilistic Context-free grammar

A weighted context-free grammar G “ pΣ, N , S, P, Wq is probabilistic if the weights of the


productions of every non-terminal are non-negative and sum to 1, i.e., for all X P N , it holds
that
@ X Ñ α P P, W pX Ñ αq ě 0 (4.87)
and ÿ
W pX Ñ αq “ 1 (4.88)
XÑαPP
6

7 Intuitively, this means that all the production weights are non-negative and that, for any left
8 side of a production rule X, the weights over all production rules X Ñ α sum to 1. The grammar
9 from Example 4.2.7 is, therefore, also probabilistic.
10 Again analogously to the WFSA case, we say that a string y is in the language of WCFG G if
11 there exists a derivation tree d in G containing only non-zero weights with yield s pdq “ y.

12 Tree Weights, String Weights, and Allsums

13 In the case of regular languages, we discussed how individual strings are “produced” by paths
14 in the automaton (in the sense that each path yields a string). As Example 4.2.4 showed, the
15 structures that “produce” or yield strings in a context-free grammar are trees—those, therefore, play
16 an analogous role in context-free grammars to paths in finite-state automata.
17 Just like we asked ourselves how to combine individual transition weights in a WFSA into weights
18 of entire paths and later how to combine those into weights of strings, we now consider the questions
19 of how to combine the weights of individual production rules into the weight of entire trees and
20 later also individual strings. We start by giving a definition of the weight of a tree as the product
21 over the weights of all the rules in the tree, i.e., as a multiplicatively decomposable function over the
22 weights of its rules. As you can probably foresee, we will then define the weight of a string as the
23 sum over all the trees that yield that string.
4.2. PUSHDOWN LANGUAGE MODELS 117

Definition 4.2.17: Weight of a derivation tree

The weight of a derivation tree d P DG defined by a WCFG G is


ź
wpdq “ WpX Ñ αq. (4.89)
pXÑαqPd
1

2 The stringsum or the string acceptance weight of a particular string under a grammar is then
3 defined as follows:

Definition 4.2.18: Stringsum in a context-free grammar

The stringsum G pyq of a string y generated by a WCFG G is defined by


ÿ
G pyq “ wpdq (4.90)
dPDG pyq
ÿ ź
“ WpX Ñ αq (4.91)
dPDG pyq pXÑαqPd
4

5 Lastly, analogously to the allsum in WFSAs, an allsum is the sum of the weights of all the trees
6 in a WCFG. We first define the allsum for symbols (non-terminals and terminals).

Definition 4.2.19: Nonterminal allsum in a context-free grammar

The allsum for a non-terminal Y in a grammar G is defined by


ÿ
Z pG, Yq “ wpdq (4.92)
dPDG pYq
ÿ ź
“ WpX Ñ αq (4.93)
dPDG pYq pXÑαqPd

The allsum for a terminal a P Σ Y tεu is defined to be

Z paq “ 1. (4.94)
def

8 The allsum for a grammar is then simply the allsum for its start symbol.
118 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.20: Allsum in a context-free grammar

The allsum of a weighted context-free grammar G “ pΣ, N , S, P, Wq is

Z pGq “ Z pG, Sq (4.95)


ÿ
“ wpdq (4.96)
dPDG pSq
ÿ ź
“ WpX Ñ αq (4.97)
dPDG pSq pXÑαqPd
1

2 When the grammar G we refer to is clear from context, we will drop the subscript and write e.g.
3 ZpSq.
4 Although we can in some cases compute the allsum of a WCFG in closed form, as we will see in
5 the example below, we generally require some efficient algorithm to be able to do so.

Example 4.2.8: Geometric Series as an Allsum

Consider the WCFG G “ pΣ, N , S, P, Wq, given by N “ tXu, Σ “ tau, S “ X, and the rules:
1{3
X ÝÝÑ a X
1

Ñε

The language generated by G is LpGq “ tan | n ě 0u. Further note that this grammar is
unambiguous – each string y “ am , for some m ě 0, is associated with the derivation tree
1{3 1{3 1
given by pX ÝÝÑ a Xq, . . . , pX ÝÝÑ a Xq, pX Ý
loooooooooooooooooomoooooooooooooooooon Ñ εq. Due to the multiplicative decomposition over
m times
the weights of the rules, the weight associated with each derivation tree d will hence be
ˆ ˙m ˆ ˙m
1 1
wpdq “ ˆ1“
3 3

Accordingly, we can compute the allsum of G using the closed-form expression for geometric
series:
8 ˆ ˙m
ÿ 1 1 3
Z pGq “ “ “
m“0
3 1 ´ 1{3 2
6

7 Just like we defined normalizable WFSAs, we also define normalizable WCFSs in terms of their
8 allsum.

Definition 4.2.21: Normalizable Weighted Context-free Grammar

A weighted context-free grammar G is normalizable if Z pGq is finite, i.e., Z pGq ă 8.


9
4.2. PUSHDOWN LANGUAGE MODELS 119

1 4.2.4 Context-free Language Models


2 This brings us to the definition of context-free language models.

Definition 4.2.22: Context-free language model

A language model pLM is context-free if its weighted language equals the language of some
weighted context-free grammar, i.e., if there exists a weighted context-free grammar G such
that L pGq “ L ppLM q.
3

4 Going the other way—defining string probabilities given a weighted context-free grammar—there
5 are again two established ways of defining the probability of a string in its language.

6 String Probabilities in a Probabilistic Context-free Grammar


7 In a probabilistic CFG (cf. Definition 4.2.16), any production from a non-terminal X P N is
8 associated with a probability. As the probabilities of continuing a derivation (and, therefore, a
9 derivation tree) depend solely on the individual terminals (this is the core of context-free grammars!),
10 it is intuitive to see those probabilities as conditional probabilities of the new symbols given the
11 output generated so far. One can, therefore, define the probability of a path as the product of these
12 individual “conditional” probabilities.

Definition 4.2.23: Tree probability in a PCFG

We call the weight of a tree d P DG in a probabilistic CFG the probability of the tree d.
13

14 This alone is not enough to define the probability of any particular string y P Σ˚ since there
15 might be multiple derivations of y. Naturally, we define the probability of y as the sum of the
16 individual trees that generate it:

Definition 4.2.24: String probability in a PCFG

We call the stringsum of a string y P Σ˚ in a probabilistic CFG G the probability of the


string y:
(4.98)
def
pG pyq “ G pyq .
17

18 These definitions and their affordances mirror the ones in probabilistic finite-state automata (cf.
19 §4.1.2): they again do not require any normalization and are therefore attractive as the summation
20 over all possible strings is avoided. Again, the question of tightness of such models comes up: we
21 explore it question in §4.2.5.

22 String Probabilities in a General Weighted Context-free Grammar


23 To define string probabilities in a general weighted CFG, we use the introduced notions of the
24 stringsum and the allsum—we normalize the stringsum to define the globally normalized probability
25 of a string y as the proportion of the total weight assigned to all strings that is assigned to y.
120 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.25: String probability in a WCFG

Let G “ pΣ, N , S, P, Wq be a normalizable WCFG with non-negative weights. We define the


probability of a string y P Σ˚ under G as

G pyq
(4.99)
def
pG pyq “ .
Z pGq
1

2 Language Models Induced by a Weighted Context-free Grammar

3 With the notions of string probabilities in both probabilistic and general weighted CFGs, we can
4 now define the language model induced by G as follows.

Definition 4.2.26: A language model induced by a WCFG

Let G “ pΣ, N , S, P, Wq be a WCFG. We define the language model induced by G as the


following probability distribution over Σ˚

(4.100)
def
pLM G pyq “ pG pyq .
5

6 Again, it is easy to see that while global normalization requires the computation of the allsum,
7 language models induced by weighted FSAs through Eq. (4.99) are globally normalized and thus
8 always tight. The tightness of probabilistic WCFGs is discussed next, after which we investigate the
9 relationship between globally- and locally-normalized context-free grammars.

10 4.2.5 Tightness of Context-free Language Models


11 Again, an advantage of globally normalized context-free language models (grammars) is that they
12 are always tight, as the derivation trees are explicitly normalized with the global normalization
13 constant such that they sum to 1 over the set of possible sentences.
14 In this section, we, therefore, consider the tightness of probabilistic context-free grammars. We
15 follow the exposition from Booth and Thompson (1973). The proof requires the use of multiple new
16 concepts, which we first introduce below.

Definition 4.2.27: Generation level


We define the level of a generation sequence inductively as follows. The zeroth level γ 0 of
a generation sequence is defined as S. Then, for any n ą 0, γ n corresponds to the string is
obtained by applying the applicable productions onto all nonterminals of γ n´1 .
17

Example 4.2.9: Generation levels

Let G “ pΣ, N , S, Pq with Σ “ ta, bu, N “ tS, X, Yu,


and P “ tS Ñ a X Y, X Ñ Y X, X Ñ bY Y, Y Ñ a a Y, Y Ñ au. Then the generation sequence
18
4.2. PUSHDOWN LANGUAGE MODELS 121

of the string aabaaaaa would be

γ0 “ S (definition)
γ 1 “ aXY (applying S Ñ a X Y)
γ 2 “ aYXaaY (applying X Ñ Y X, Y Ñ a a Y)
γ 3 “ aabYYaaaaY (applying Y Ñ a, X Ñ b Y Y, X Ñ a a Y)
γ 3 “ aabaaaaaaa (applying Y Ñ a, Y Ñ a, Y Ñ a)
1

2 We will also rely heavily on generating functions. A generating function is simply a way of
3 representing an infinite sequence by encoding its elements as the coefficients of a formal power series.
4 Unlike ordinary series such as the geometric power series from Example 4.2.8, a formal power series
5 does not need to converge: in fact, at its core a generating function is not actually regarded as a
6 function—its “variables” are indeterminate and they simply serve as “hooks” for the numbers in the
7 sequence.

Definition 4.2.28: Production generating function

Let G “ pΣ, N , S, P, Wq be a PCFG and N “ |N |. For each Xn P N , define its production


def def

generating function as
ÿ r pαq r pαq r pαq
(4.101)
def
g ps1 , . . . , sN q “ W pXn Ñ αq s11 s22 ¨ ¨ ¨ ¨ ¨ sNN ,
Xn Ñα

where rm pαq denotes the number of times the nonterminal Xm P N appears in α P pΣ Y N q .


˚

Example 4.2.10: Tightness of a context-free grammar

Let G “ pΣ, N , S, Pq with Σ “ ta, bu, N “ tS, Xu, and P “


tS Ñ a S X, S Ñ b, X Ñ aX X, X Ñ aau. Then

g 1 ps1 , s2 q “ W pS Ñ aSXq s1 s2 ` W pS Ñ bq
g 2 ps1 , s2 q “ W pX Ñ aXXq s22 ` W pX Ñ aaq
9

Definition 4.2.29: Generating function

The generating function of the lth level is defined as

(4.102)
def
G0 ps1 , . . . , sN q “ s1
(4.103)
def
G1 ps1 , . . . , sN q “ g 1 ps1 , . . . , sN q
(4.104)
def
Gl ps1 , . . . , sN q “ Gl´1 pg 1 ps1 , . . . , sN q , . . . , g N ps1 , . . . , sN qq ,

that is, the lth -level generating function is defined as the l ´ 1st -level generating function
applied to production generating functions as arguments.
10
122 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Example 4.2.11: Tightness of a context-free grammar

For the grammar from Example 4.2.10, we have

G0 ps1 , s2 q “ s1
G1 ps1 , s2 q “ g ps1 , s2 q “ W pS Ñ aSXq s1 s2 ` W pS Ñ bq
G2 ps1 , s2 q “ W pS Ñ aSXq rg 1 ps1 , s2 qs rg 2 ps1 , s2 qs ` W pS Ñ bq
2
“ W pS Ñ aSXq W pX Ñ aXXq s1 s32
2
` W pS Ñ aSXq W pX Ñ aaq s1 s2
` W pS Ñ aSXq W pS Ñ bq W pX Ñ aXXq s22
` W pS Ñ aSXq W pS Ñ bq W pX Ñ aaq
` W pS Ñ bq
1

2 We can see that a generating function Gl ps1 , . . . , sN q can be expressed as

Gl ps1 , . . . , sN q “ Dl ps1 , . . . , sN q ` Cl (4.105)

3 where the polynomial Dl ps1 , . . . , sN q does not contain any constant terms. It is easy to see that the
4 constant Cl then corresponds to the probability of all strings that can be derived in l levels or fewer.
5 This brings us to the following simple lemma.

Lemma 4.2.2
A PCFG is tight if and only if
lim Cl “ 1. (4.106)
lÑ8
6

7 Proof. Suppose that limlÑ8 Cl ă 1. This means that the generation process can enter a generation
8 sequence that has a non-zero probability of not terminating—this corresponds exactly to it not
9 being tight.
10 On the other hand, limlÑ8 Cl “ 1 implies that no such sequence exists, since the limit represents
11 the probability of all strings that can be generated by derivations of a finite number of production
12 rules. ■

13 The rest of the section considers necessary and sufficient conditions for Eq. (4.106) to hold. For
14 this, we first define the first-moment matrix of a PCFG.

Definition 4.2.30: First-moment matrix

Let G “ pΣ, N , S, P, Wq be a PCFG. We define its first-moment matrix (its mean matrix)
def

E P RN ˆN as ˇ
def Bg n ps1 , . . . , sN q ˇ
ˇ
E nm “ ˇ . (4.107)
Bsm ˇ
s1 ,...,sN “1
15
4.2. PUSHDOWN LANGUAGE MODELS 123

1 Note that E nm represents the expected number of occurrences of the non-terminal Xm in the set
2 of sequences α with Xn ñG α, i.e., the set of sequences Xn can be rewritten into:
ÿ
E nm “ W pXn Ñ αq rm pαq . (4.108)
Xn Ñα

r pαq r pαq
3 The informal intuition behind this is the following: each of the terms W pXn Ñ αq s11 s22 ¨¨¨¨¨
rN pαq
4 sN in g n contains the information about how many times any non-terminal Xm appears in the
5 production rule Xn Ñ α as well as what the probability of “using” or applying that production
r pαq r pαq r pαq
6 rule to Xn is. Differentiating W pXn Ñ αq s11 s22 ¨ ¨ ¨ ¨ ¨ sNN w.r.t. sm then “moves” the
7 coefficient rm corresponding to the number of occurrences of Xm in Xn Ñ α in front of the term
r pαq r pαq r pαq
8 W pXn Ñ αq s11 s22 ¨ ¨ ¨ ¨ ¨ sNN in g n , effectively multiplying the probability of the occurrence
9 of the rule with the number of terms Xm in the rule—this is exactly the expected number of
10 occurrences of Xm for this particular rule, averaging over all possible rules that could be applied.
11 Summing over all applicable production rules for Xn (which form a probability distribution) gives us
12 the total expected number of occurrences of Xm . This brings us to the core theorem of this section
13 characterizing the tightness of PCFGs.

Theorem 4.2.1: A sufficient condition for the tightness of probabilistic context-free


grammars

A PCFG is tight if |λmax | ă 1 and is non-tight if |λmax | ą 1, where λmax is the eigenvalue of
E with the largest absolute value.
14

15 Proof. The coefficient of the term sr11 sr22 ¨¨ ¨ ¨¨srNN in the generating function Gl ps1 , . . . , sN q corresponds
16 to the probability that there will be r1 non-terminal symbols X1 , . . . , rN non-terminal symbols XN
17 in the lth level of the generation sequence. In particular, if the grammar is tight, this means that

lim Gl ps1 , . . . , sN q “ lim rDl ps1 , . . . , sN q ` Cl s “ 1. (4.109)


lÑ8 lÑ8

18 This, however, is only true if


lim Dl ps1 , . . . , sN q “ 0 (4.110)
lÑ8

19 and this, in turn, can only be true if limlÑ8 rn “ 0 for all n “ 1, . . . N . The expected value of rn at
20 level l is
ˇ
BGl ps1 , . . . , sN q ˇˇ
rl,n “ ˇ . (4.111)
Bsn ˇ
s1 ,...,sN “1

21 Reasoning about this is similar to the intuition behind the first-moment matrix, with the difference
22 that we are now considering the number of occurrences after a sequence of l applications. Denoting

rl “ rrl,1 , . . . , rl,N s (4.112)


def
124 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 we have
«
N
ÿ BGl´1 pg 1 ps1 , . . . , sN q , . . . , g N ps1 , . . . , sN qq
rl “ (4.113)
j“1
Bg j
ffˇ
Bg j ˇ
ps1 , . . . , sN q | n “ 1, . . . , N ˇ (4.114)
ˇ
¨
Bsn ˇ
s1 ,...,sN “1
“ rl´1 E. (4.115)

2 Applying this relationship repeatedly, we get

rl “ r0 El “ r1, 0, . . . , 0s El , (4.116)

3 meaning that
lim rl “ 0 iff lim El “ 0. (4.117)
lÑ8 lÑ8

4 The matrix E satisfies this condition if |λmax | ă 1. On the other hand, if |λmax | ą 1, the limit
5 diverges. ■

6 Note that the theorem does not say anything about the case when |λmax | “ 1.
7 We conclude the subsection by noting that, interestingly, weighted context-free grammars trained
8 on data with maximum likelihood are always tight (Chi and Geman, 1998; Chi, 1999). This is not
9 the case for some models we consider later, e.g., recurrent neural networks (cf. §5.1.2).

10 4.2.6 Normalizing Weighted Context-free Grammars


11 Having investigated probabilistic context-free grammars in terms of their tightness, we now turn out
12 attention to general weighted context-free grammars, which define string probabilities using global
13 normalization (cf. Eq. (4.99)). To be able to compute these probabilities, require a way to compute
14 the normalizing constant Z pGq and the stringsum G pyq. In the section on finite-state automata, we
15 explicitly presented an algorithm for computing the normalizing constant Z pAq. The derivation of
16 a general allsum algorithm for weighted context-free grammars, on the other hand, is more involved
17 and beyond the scope of this course.21 Here, we simply assert that there are ways of computing the
18 quantities in Eq. (4.99) and only consider the following result:

Theorem 4.2.2: PCFGs and WCFGs are equally expressive (Smith and Johnson,
2007)

Normalizable weighted context-free grammars with non-negative weights and tight probabilistic
context-free grammars are equally expressive.
19

20 Proof. To prove the theorem, we have to show that any WCFG can be written as a PCFG and vice
21 versa.22
21 The allsums of individual non-terminals can be expressed as solutions to a nonlinear set of equations. Again, the

interested reader should have a look at the Advanced Formal Language Theory course.
22 Again, by “written as”, we mean that the weighted language is the same.
4.2. PUSHDOWN LANGUAGE MODELS 125

1 ð Since any tight probabilistic context-free grammar is simply a WCFG with Z pGq “ 1, this
2 holds trivially.
3 ñ We now show that, for any WCFG, there exists a PCFG encoding the same language model.
4 Let G G “ pΣ, N , S, P, Wq be a pruned WCFG that encodes a distribution over Σ˚ using Eq. (4.99).
5 We now construct a tight probabilistic context-free grammar G L “ pΣ, N , S, P, W L q whose language
6 is identical. Notice that all components of the grammar remain identical apart from the weighting
7 function. This means that the derivations of the strings in the grammars remain the same (i.e.,
8 DG G “ DG L )—only the weights of the derivations change, as we detail next. We define the production
9 weights of the probabilistic CFG as follows.
YPα ZpG, Yq
ś
def W pX Ñ αq
W G L pX Ñ αq “ (4.118)
ZpG, Xq
10 Remember that Z paq “ 1 for a P Σ. Note that the assumption that G is pruned means that all the
11 quantities in the denominators are non-zero.
12 It is easy to see that the weight defined this way are non-negative due to the non-negativity of
13 G’s weights. Furthermore, the weights of all production rules for any non-terminal X P N sum to 1,
14 as by the definitions of W G L and ZpG, Xq we have

YPα ZpG, Yq
ÿ ÿ W pX Ñ αq ś
W G L pX Ñ αq “ (4.119)
XÑα XÑα
ZpG, Xq
1 ÿ ź
“ W pX Ñ αq ZpG, Yq (4.120)
ZpG, Xq XÑα YPα
1
“ ZpG, Xq (4.121)
ZpG, Xq
“1 (4.122)
15 We now have to show that the probabilities assigned by these two grammars match. We will
16 do that by showing that the probabilities assigned to individual derivations match, implying that
17 stringsums match as well. The probability of a derivation is defined analogously to a probability
wpdq
18 of a string, i.e., pG pdq “ ZpGq (where Z pGq “ 1 for tight probabilistic grammars). Let then
19 d P DG “ DG L . Then
ź
pG L pdq “ W G L pX Ñ αq (4.123)
XÑαPd
W pX Ñ αq YPα ZpG, Yq
ź ś
“ (definition of W G L ). (4.124)
XÑαPd
ZpG, Xq

20 Notice that by multiplying over the internal nodes of the derivation tree, Eq. (4.123) includes the
21 non-terminal allsum of each internal (non-root and non-leaf) non-terminal in the derivation twice:
22 once as a parent of a production in the denominator, and once as a child in the numerator. These
23 terms, therefore, all cancel out in the product. The only terms which are left are the allsums of the
24 leaf nodes—the terminals—which are 1, and the allsum of the root node—S—which equals Z pG G q
25 and the weights of the individual productions, which multiply into the weight assigned to d by the
26 original grammar G G . This means that
1 ź 1
pG L pdq “ W pX Ñ αq “ w pdq “ pG G pdq , (4.125)
ZpG, Xq XÑαPd ZpG, Xq
126 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 finishing the proof. ■


2 This means that the classes of probabilistic and weighted context-free grammars are in fact equally
3 expressive. In other words, this result is analogous to Theorem 4.1.1 in WFSAs: it shows that in
4 the context of context-free language models, the locally normalized version of a globally-normalized
5 model is also context-free.

6 4.2.7 Pushdown Automata


7 We presented context-free grammars as a formalism for specifying and representing context-free
8 languages. Many algorithms for processing context-free languages, for example, the allsum algorithms
9 and their generalizations, can also be directly applied to context-free grammars. However, it is also
10 convenient to talk about processing context-free languages in terms of computational models in the
11 form of automata, i.e., the recognizer of the language.23 As we mentioned, the types of automata
12 we considered so far, (weighted) finite-state automata, can only recognize regular languages. To
13 recognize context-free languages, we must therefore extend finite-state automata.24 We do that by
14 introducing pushdown automata (PDA), a more general and more expressive type of automata.

15 Single-stack Pushdown Automata


16 Pushdown automata augment finite-state automata by implementing an additional stack memory
17 structure for storing arbitrarily long strings from a designated alphabet, which allows them to work
18 with unbounded memory effectively. Abstractly, this unbounded memory is the only difference to
19 finite-state automata. However, the definition looks slightly different:

Definition 4.2.31: Pushdown automaton

A pushdown automaton (PDA) is a tuple P “ pΣ, Q, Γ, δ, pqι , γ ι q, pqφ , γ φ qq, where:


def

• Q is a finite set of states;


• Σ is a finite set of input symbols called the input alphabet;

• Γ is a finite set of stack symbols called the stack alphabet;


• δ Ď Q ˆ Γ˚ ˆ pΣ Y tεuq ˆ Q ˆ Γ˚ is a multiset representing the transition function;
• pqι , γ ι q is called the initial configuration and pqφ , γ φ q is called the final configuration,
where qι , qφ P Q and γ ι , γ φ P Γ˚ .
20

21 The initial and final configurations in pushdown play analogous roles to the sets of initial and
22 final sets of finite-state automata. Compared to the latter, they also allow for different starting
23 configurations of the stack coupled with each possible initial or final state.
23 This relationship between a formalism specifying how to generate (i.e., a grammar) and a model of recognizing a

language can be seen in multiple levels of the hierarchy of formal languages. In the case of context-free languages, the
former are context-free grammars, while the latter are pushdown automata discussed in this subsection. Regular
languages as introduced in the previous section, however, are simply defined in terms of their recognizers—finite-state
automata.
24 Formally, we would of course have to prove that finite-state automata cannot model context-free languages. This

can be done with the so-called pumping lemma, which are outside the scope of this class.
4.2. PUSHDOWN LANGUAGE MODELS 127

1 Stacks are represented as strings over Γ, from bottom to top. Thus, in the stack γ “ X1 X2 ¨ ¨ ¨ Xn ,
2 the symbol X1 is at the bottom of the stack, while Xn is at the top. γ “ H denoses the empty
3 stack.

Definition 4.2.32: Configuration of a pushdown automaton

A configuration of a PDA is a pair pq, γq, where q P Q is the current state and γ P Γ˚ is the
current contents of the stack.
4

5 The initial and final configurations of a PDA are examples of configurations; it is possible to
6 generalize the initial and final stacks to (say) regular expressions over Γ, but the above definition
7 suffices for our purposes.
a,γ Ñγ 2
8 A PDA moves from configuration to configuration by following transitions of the form q ÝÝÝ1ÝÝÝÑ r,
9 which represents a move from the state q to state r, while popping the sequence of symbols γ 1 P Γ˚
10 from the top of the stack and pushing the sequence γ 2 P Γ˚ . The PDA transition function therefore
11 not only depends on the current state q and input symbol a, but also on some finite sequence
12 of symbols on the top of the stack. The stack hence determines the behavior of the automaton,
13 and since the set of possible configurations of the stack is infinite, the set of configurations of the
14 automaton is infinite, in contrast to finite-state automata.
15 To describe how pushdown automata process strings, we introduce the concepts of scanning and
16 runs.

Definition 4.2.33: Scanning

We say that τ “ pp, γ 1 , a, q, γ 2 q P δ scans a, and if a ‰ ε, we call τ scanning; otherwise, we


call it non-scanning.
17

Definition 4.2.34: Pushdown automaton transitions


a,γ Ñγ
If pq 1 , γγ 1 q and pq 2 , γγ 2 q are configurations, and τ is a transition q 1 ÝÝÝ1ÝÝÝÑ
2
q 2 , we write
pq 1 , γγ 1 q ñτ pq 2 , γγ 2 q.
18

19 Since the behavior of a pushdown automaton does not only depend on the states encountered by
20 it but also on the content of the stack, we generalize the notion of a path to include the configuration
21 of the automaton. This is called a run.

Definition 4.2.35: Run of a pushdown automaton

A run of a PDA P is a sequence of configurations and transitions


π “ pq 0 , γ 0 q, τ 1 , pq 1 , γ 1 q, . . . , τ n , pq N , γ N q
where, for n “ 1, . . . , N , we have pq n´1 , γ n´1 q ñτ n pq n , γ n q.a A run is called accepting if
pq 0 , γ 0 q is the initial configuration and pq N , γ N q is the final configuration. If, for n “ 1, . . . , N ,
τ n scans an , then we say that π scans the string a1 ¨ ¨ ¨ aN . We write Π pP, yq for the set of
runs that scan y and Π pPq for the set of all accepting runs of P.
a Sometimes it will be convenient to treat π as a sequence of only configurations or only transitions.
22
128 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.36: Recognition of a string by a pushdown automaton

We say that the PDA P recognizes the string y if Π pP, yq ‰ H, i.e., if there exists an
accepting run with the yield y. The set of all strings recognized by P is the language
recognized by P, which we denote by L pPq, i.e.,

L pPq “ ty | Π pP, yq ‰ Hu . (4.126)


def

Example 4.2.12: Example of a pushdown automaton

Fig. 4.8 shows ´


an example of a pushdown
¯ automaton´ P accepting the language L pPq¯ “
a,εÑX ε,εÑε a,εÑX ε,εÑε b,XÑε
ta b | n P Nu. 1 ÝÝÝÝÑ 1, 1 ÝÝÝÝÑ 2 is a run of P; 1 ÝÝÝÝÑ 1, 1 ÝÝÝÝÑ 2, 2 ÝÝÝÝÑ 2 is
n n

an accepting run of P.
2

ε, ε Ñ ε
1 2

a, ε Ñ X b, X Ñ ε

Figure 4.8: The PDA that accepts the language tan bn | n P Nu.

3 Lastly, we define deterministic pushdown automata, analogously to their finite-state version


4 (Definition 4.1.3). Recall that in the case of finite-state automata, a deterministic machine has at
5 most one possible next move for each state. Similalry, a deterministic pushdown automaton has at
6 most one possible next move for each configuration.

Definition 4.2.37: Deterministic pushdown automaton

A PDA P “ pΣ, Q, Γ, δ, pqι , γ ι q, pqφ , γ φ qq is deterministic if

• there are no transitions of the type pq, ε, γ, p, γq;


• for every pq, a, γq P Q ˆ Σ Y tεu ˆ Γ˚ , there is at most one transition pq, a, γ, p, γ 1 q P δ;
• if there is a transition pq, a, γ, p, γ 1 q P δ for some a P Σ, then there is no transition
pq, ε, γ, p, γ 2 q P δ.

Otherwise, P is non-deterministic.
7

8 Importantly, not all context-free languages can be recognized by deterministic pushdown au-
9 tomata. That is, in contrast to finite-state automata, where deterministic machines are just as
10 powerful as non-deterministic ones (at least in the unweighted case—interestingly, some weighted
11 non-deterministic FSAs cannot be determinized), non-deterministic pushdown automata are more
12 expressive than deterministic ones. Specifically, as stated in Theorem 4.2.3, non-deterministic push-
4.2. PUSHDOWN LANGUAGE MODELS 129

1 down automata recognize exactly context-free languages, while deterministic pushdown automata
2 only recognize a subset of them (Sipser, 2013).

3 Weighted Pushdown Automata

4 Analogously to the finite-state case, and the case of context-free grammars, we now also extend the
5 definition of a pushdown automaton to the weighted case. The formal definition is:

Definition 4.2.38: Weighted pushdown automaton

A real-weighted pushdown automaton (WPDA) is a tuple P “


pQ, Σ, Γ, δ, pqι , γ ι q, pqφ , γ φ qq, where:
• Q is a finite set of states;
• Σ is a finite set of input symbols called the input alphabet;
• Γ is a finite set of stack symbols called the stack alphabet;

• δ Ď Q ˆ Γ˚ ˆ pΣ Y tεuq ˆ Q ˆ Γ˚ ˆ R is a multi-set representing the transition weighting


function;
• pqι , γ ι q is called the initial configuration and pqφ , γ φ q is called the final configuration,
where qι , qφ P Q and γ ι , γ φ P Γ˚ .
6

7 As you can see, the only difference between the weighted and the unweighted case is the transition
8 function, which in the weighted case weights the individual transitions instead of specifying the set
9 of possible target configurations.
10 As with WFSAs (Definition 4.1.17) and WCFGs (Definition 4.2.16), we now define probabilistic
11 WPDAs. This definition, however, is a bit more subtle. Notice that the transition weighting
12 “function” δ in a WPDA is crucially still a finite—there is only a finite number of actions we can
13 ever do. Similarly, when defining a probabilistic PDA, we have to limit ourselves to a finite number
14 of configurations over which we define probability distributions over the next possible actions.
15 We define a probabilistic pushdown automaton given an equivalence relation as follows.

Definition 4.2.39: Probabilistic pushdown automaton

A WPDA P “ pQ, Σ, Γ, δ, pqι , γ ι q, pqφ , γ φ qq is probabilistic if it holds that

a,γ 1 Ñγ 2 {w
@ q ÝÝÝÝÝÝÝÑ r P δ : w ě 0 (4.127)

and for any q P Q and γ P Γ˚ ÿ


w “ 1. (4.128)
a,γ 1 Ñγ 2 {w
q ÝÝÝÝÝÝÝÑr
s.t. γ 1 Ÿ γ
16
130 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Definition 4.2.40: Transitions of a weighted pushdown automaton

a,γ 1 Ñγ 2 {w
If pq 1 , γγ 1 q and pq 2 , γγ 2 q are configurations, and τ is a transition q 1 ÝÝÝÝÝÝÝÑ q 2 with w ‰ 0,
we write pq 1 , γγ 1 q ñτ pq 2 , γγ 2 q.
1

Definition 4.2.41: Transition weights in a pushdown automaton

If δpp, γ 1 , a, q, γ 2 q “ w, then we usually write


a
τ pp, γ 1 Ý
Ñ q, γ 2 q “ w (4.129)
a,γ 1 Ñγ 2 {w
or that δ has transition pq ÝÝÝÝÝÝÝÑ pq. We sometimes let τ stand for a transition, and we
define δpτ q “ w.
2

3 And again, just like we combined the weights of individual transitions into the weights of paths
4 in WFSAs, and we combined the weights of production rules into the weights of the trees in WCFGs,
5 we now multiplicatively combine the weights of individual transitions in a run to define the weight
6 of a run in a WPDA:

Definition 4.2.42: Run weight

The weight w pπq of a run

π “ pq 0 , γ 0 q, τ 1 , pq 1 , γ 1 q, . . . , τ N , pq N , γ N q

is the multiplication of the transition weights, i.e.,


N
ź
w pπq “ (4.130)
def
δ pτ n q
n“1
7

8 Analogously to a stringsum in WFSAs, we define the stringsum for a string y in a WPDA P as


9 the sum over the weights of all runs scanning y.

Definition 4.2.43: Stringsum in a pushdown automaton

Let P be a WPDA and y P Σ˚ a string. The stringsum for y in P is defined as


ÿ
w pπq (4.131)
def
P pyq “
πPΠpP,yq
10

Definition 4.2.44: Recognition by a weighted pushdown automaton

We say that the PDA P recognizes the string y with the weight P pyq.
11

12 With this, we can define the weighted language defined by a WPDA.


4.2. PUSHDOWN LANGUAGE MODELS 131

Definition 4.2.45: Weighted language of a weighted pushdown automaton

Let P be a WPDA. The (weighted) language L pPq of P is defined as

L pPq “ tpy, P pyqq | y P Σ˚ u (4.132)


def

2 Finally, we also define the WPDA allsum and normalizable WPDAs.

Definition 4.2.46: Allsum of a weighted pushdown automaton

The allsum of a WPDA P is defined as


ÿ
w pπq (4.133)
def
Z pPq “
πPΠpPq
3

Definition 4.2.47: Normalizable weighted pushdown automaton

A WPDA P is normalizable if Z pPq is finite, i.e., if Z pPq ă 8.


4

5 Relationship to Context-free Grammars


6 We motivated the introduction of pushdown automata as a means of recognizing context-free
7 languages. However, this correspondence is not obvious from the definition! Indeed, the equivalence
8 of the expressive power of context-free grammars and pushdown automata is a classic result in
9 formal language theory, and it is summarised by the theorem below:

Theorem 4.2.3: Context-free grammars and pushdown automata are equally ex-
pressive

A language is context-free if and only if some pushdown automaton recognizes it.


10

11 Proof. See Theorem 2.20 in Sipser (2013). ■

12 This result extends to the probabilistic case.

Theorem 4.2.4: Probabilistic context-free grammars and probabilistic pushdown


automata are equally expressive

A language is generated by a probabilistic context-free grammar if and only if some probabilistic


pushdown automaton recognizes it.
13

14 Proof. See Theorems 3 and 7 in Abney et al. (1999). ■

15 Lastly, analogously to how Theorem 4.2.2 showed that weighted context-free grammars are
16 equally expressive as probabilistic context-free grammars, the following theorem asserts the same
17 about pushdown automata:
132 CHAPTER 4. CLASSICAL LANGUAGE MODELS

Theorem 4.2.5: Globally normalized weighted pushdown automata can be locally


normalized
Any globally normalized weighted pushdown automaton can be locally normalized. More
precisely, this means the following. Let P be a weighted pushdown automaton. Then, there
exists a probabilistic pushdown automaton P p such that

P pyq
P p pyq “ (4.134)
Z pPq

for all y P Σ˚ .
1

2 Proof. The proof is not straightforward: it can be shown that one cannot simply convert an arbitrary
3 weighted pushdown automaton into a locally-normalized one directly. Rather, the construction of
4 the latter goes through their context-free grammars: given a WPDA P, one first constructs the
5 WCFG equivalent to P, and then converts that to a locally normalized one (cf. Theorem 4.2.2), i.e.,
6 a PCFG. Then, this PCFG can be converted to a structurally quite different probabilistic pushdown
7 automaton P p , which nevertheless results in the language we require (Eq. (4.134)).
8 This construction is described in more detail in Abney et al. (1999) and Butoi et al. (2022). ■

9 4.2.8 Pushdown Language Models


10 We can now define pushdown language models, the title of this section.

Definition 4.2.48: Pushdown language model

A pushdown language model is a language model whose weighted language equals the
language of some weighted pushdown automaton, i.e., if there exists a weighted pushdown
automaton P such that L pPq “ L ppLM q.
11

12 Similarly, pushdown automata also induce language models.

Definition 4.2.49: Language model induced by a pushdown automaton

Let P be a weighted pushdown automaton. We define the language model induced by P


as the probability distribution induced by the probability mass function

P pyq
(4.135)
def
pLM P pyq “ ,
Z pPq

for any y P Σ˚ .
13

14 You might wonder why we specifically define pushdown language models and models induced
15 by them if WPDAs are equivalent to WCFGs (cf. §4.2.7). In that sense, a language model is
16 context-free if and only if it is a pushdown language model. However, this holds only for single-stack
17 pushdown automata which we have discussed so far. We make this explicit distinction of pushdown
18 language models with an eye to the next section, in which we introduce multi-stack WPDAs. Those
4.2. PUSHDOWN LANGUAGE MODELS 133

1 are, as it turns out, much more powerful (expressive) than context-free grammars. We will, however,
2 reuse this definition of a pushdown language model for those more powerful machines.

3 4.2.9 Multi-stack Pushdown Automata


4 We now consider an extension of (weighted) pushdown automata, namely, machines that employ
5 multiple stacks. While this might not seem like an important distinction, we will see shortly that
6 this augmentation results in a big difference in the expressiveness of the framework!

Definition 4.2.50: Two-stack pushdown automaton

A two-stack pushdown automaton (2-PDA) is a tuple P “


pΣ, Q, Γ1 , Γ2 , δ, pqι , γ ι1 , γ ι2 q, pqφ , γ φ1 , γ φ2 qq, where:

• Σ is a finite set of input symbols called the input alphabet;


• Q is a finite set of states;
• Γ1 and Γ2 are finite sets of stack symbols called the stack alphabets;
• δ Ď Q ˆ Γ˚1 ˆ Γ˚2 ˆ pΣ Y tεuq ˆ Q ˆ Γ˚1 ˆ Γ˚2 is a multiset representing the transition
function;
• pqι , γ ι1 , γ ι2 q is called the initial configuration and pqφ , γ φ1 , γ φ2 q is called the final
configuration, where qι , qφ P Q, γ ι1 , γ φ1 P Γ˚1 , and γ ι2 , γ φ2 P Γ˚2 .
7

8 Note that we could more generally define a k-stack PDA by including k stacks in the definition,
9 but the restriction to two stacks will be sufficient for our needs, as we will see in the next subsection.
10 The transition function now depends on the values stored in both of the stacks. The definitions of
11 the configuration and run of a two-stack PDA are analogous to the single-stack variant, with the
12 addition of the two stacks. We again extend this definition to the weighted and the probabilistic
13 case.

Definition 4.2.51: Two-stack weighted pushdown automaton

A two-stack real-weighted pushdown automaton (2-WPDA) is a tuple P “


pΣ, Q, Γ1 , Γ2 , δ, pqι , γ ι1 , γ ι2 q, pqφ , γ φ1 , γ φ2 qq, where:

• Σ is a finite set of input symbols called the input alphabet;


• Q is a finite set of states;
• Γ1 and Γ2 are finite sets of stack symbols called the stack alphabets;

• δ Ď Q ˆ Γ˚1 ˆ Γ˚2 ˆ pΣ Y tεuq ˆ Q ˆ Γ˚1 ˆ Γ˚2 ˆ R is a multiset representing the transition


weighting function;
• pqι , γ ι1 , γ ι2 q is called the initial configuration and pqφ , γ φ1 , γ φ2 q is called the final
configuration, where qι , qφ P Q, γ ι1 , γ φ1 P Γ˚1 , and γ ι2 , γ φ2 P Γ˚2 .
14
134 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 And lastly, we define probabilistic two-stack PDAs:

Definition 4.2.52: Probabilistic two-stack pushdown automaton

A 2-WPDA P “ pQ, Σ, Γ1 , Γ2 , δ, pqι , γ ι1 , γ ι2 q, pqφ , γ φ1 , γ φ2 qq is probabilistic if for any con-


figuration pq, γ 1 , γ 2 q it holds that

a,γ 1 Ñγ 1 ,γ 2 Ñγ 1 {w
@ q ÝÝÝÝÝÝ1ÝÝÝÝÝ2ÝÑ r P δ : w ě 0 (4.136)

and for any q P Q and γ P Γ˚1 , γ 1 P Γ˚2


ÿ
w “ 1. (4.137)
a,γ 1 Ñγ 1 ,γ 2 Ñγ 1 {w
1 2
q ÝÝÝÝÝÝÝÝÝÝÝÝÑr
s.t. γ 1 Ÿ γ and γ 2 Ÿ γ 1
2

3 Turing Completeness of Multi-stack Pushdown Automata


4 Besides modeling more complex languages than finite-state language models from §4.1, (multi-stack)
5 pushdown automata will also serve an important part in analyzing some modern language models that
6 we introduce later. Namely, we will show that Recurrent neural networks (cf. §5.1.2) can simulate
7 any two-stack PDA. This will be useful when reasoning about the computational expressiveness of
8 recurrent neural networks because of a fundamental result in the theory of computation, namely,
9 that two-stack PDAs are Turing complete:

Theorem 4.2.6: Two-stack pushdown automata are Turing complete

Any 2-stack pushdown automaton is Turing complete.


10

11 Proof. The equivalence is quite intuitive: the two stacks (which are infinite in one direction) of the
12 2-PDA can simulate the tape of a Turing machine by popping symbols from one stack and pushing
13 symbols onto the other one simultaneously. The head of the Turing machine then effectively reads
14 the entries at the top of one of the two stacks. For a formal proof, see Theorem 8.13 in Hopcroft
15 et al. (2006). ■

16 This is also the reason why we only have to consider two-stack PDAs—they can compute
17 everything that can be computed, meaning that additional stacks do not increase their expressiveness!
18 Since unweighted pushdown automata are simply special cases of weighted PDAs, which are
19 equivalent to probabilistic PDAs, we can therefore also conclude:

Corollary 4.2.1

Any weighted 2-stack pushdown automaton is Turing complete.


20

Corollary 4.2.2

Any probabilistic 2-stack pushdown automaton is Turing complete.


21
4.2. PUSHDOWN LANGUAGE MODELS 135

1 A straightforward consequence of the Turing completeness of two-stack PPDAs is that their


2 tightness is undecidable.

Theorem 4.2.7: Tightness of 2-PPDA is undecidable

The tightness of a probabilistic two-stack pushdown automaton is undecidable.


3

4 Proof. We start with a simple observation: a pushdown automaton P is tight if and only if it halts
5 on inputs with measure 1 (given the probability measure on Σ˚ Y Σ8 defined in §2.5.4), as this, by
6 the definition of the language accepted by the WPDA (cf. §4.2.7), corresponds to its language only
7 containing finite strings with probability 1.
8 Let M then be a Turing machine and P be a 2-PPDA which simulates it. Then, P is tight if
9 and only if it halts with probability 1 (again, based on the probability measure from above). This is
10 equivalent to the problem of M halting with probability 1—this, however, is a variant of the halting
11 problem, which is one of the fundamental undecidable problems. We have therefore reduced the
12 problem of determining the tightness of 2-PPDAs to the halting problem, implying that the former
13 is undecidable. ■

14 You might wonder what this means for the (weighted) languages recognized by multiple-stack
15 (weighted) automata. Turing machines can recognize recursively enumerable languages. This means
16 that weighted multi-stack pushdown automata model distributions over recursively enumerable
17 languages. To see why this might be useful, let us finish the discussion of context-free languages
18 with an example of a language model that is not context-free:

Example 4.2.13: Example of a non-context-free distribution over strings


n
Let Σ “ tau and pLM pan q “ e´λ λn! for n P Ně0 , i.e., L ppLM q Q y “ an „ Poisson pλq for some
λ ą 0. This language is not context-free: the proof, however, is not trivial. We direct the
reader to Icard (2020b) for one.
19
136 CHAPTER 4. CLASSICAL LANGUAGE MODELS

1 4.3 Exercises
2 Exercise 4.1
3 Prove the following lemma.

Lemma 4.3.1

Let A “ pΣ, Q, δ, λ, ρq and q P Q. Then


ˆ ˙
ÿ a{¨
Z pA, qq “ ω q ÝÝÑ q 1
ZpA, q 1 q ` ρ pqq (4.138)
a{w
q ÝÝÑq 1 Pδ AL
4

5 Exercise 4.2
6 Show that the expression for the log-likelihood of the n-gram model can be rewritten as
pmq
M |yÿ |
ÿ ÿ
ℓℓ pDq “ log θyn |yăn “ C pyq θyn |yăn (4.139)
m“1 t“1 y
|y|“n

with the quantities as defined in Proposition 4.1.3. This is a common trick. It is also known as the
token to type switch because we switch from counting over the individual tokens to counting
over their identities (types)

7 Exercise 4.3
8 Let C pyq be the string occurrence count for y P Σ˚ occurrence count as defined in Proposition 4.1.3.
9 Show (or simply convince yourself) that, in a given training corpus D
ÿ `
(4.140)
˘
C y 1 . . . y n´1 y 1 “ C py 1 . . . y n´1 q
y 1 PΣ
1 Chapter 5

2 Neural Network Language Models

3 Chapter 4 introduced two classical language modeling frameworks: finite-state language models and
4 context-free language models. While those served as a useful introduction to the world of language
5 modeling, most of today’s state-of-the-art language models go beyond the modeling assumptions
6 of these two frameworks. This chapter dives into the diverse world of modern language modeling
7 architectures, which are based on neural networks. We define two of the most common architectures—
8 recurrent neural networks and transformers—and some of their variants. The focus is again on
9 rigorous formalization and theoretical understanding—we analyze the introduced models in terms of
10 the theoretical foundations so far (e.g., expressiveness and tightness)—but, due to their practical
11 applicability, we also study some practical aspects of the models.
12 We begin with recurrent neural networks.

137
138 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 5.1 Recurrent Neural Language Models

2 The first neural language modeling architecture we consider is one based on recurrent neural networks.
3 Recurrent neural networks capture the idea of the sequential processing of strings relatively naturally
4 while also making decisions based on an infinite context. Before delving into the technical details of
5 recurrent neural networks (RNNs), however, we first motivate the introduction of modeling contexts
6 of unbounded length. Then, we formally define recurrent neural networks and devote a large portion
7 of the section to their theoretical properties. The most important of those will be the Turing
8 completeness of this architecture, as it has numerous consequences on the solvability of many of the
9 tasks we might be interested in, such as finding the most probable string in the language model
10 represented by a recurrent neural network and determining whether an RNN is tight.

11 5.1.1 Human Language is Not Context-free

12 Recall that we motivated the introduction of context-free languages by observing that finite memory
13 is insufficient to model all formal phenomena of human language, e.g., infinite recursion (cf. Ex-
14 ample 4.2.1). Context-free languages described by context-free grammars and pushdown automata
15 were able to capture those. However, human language is more expressive than that—it includes
16 linguistic phenomena that cannot be described by context-free grammars. A typical example is
17 called cross-serial dependencies, which are common in Swiss German.

Example 5.1.1: Cross-Serial Dependencies in Swiss German, Shieber, 1985

Swiss German is a textbook example of a language with grammatical cross-serial dependencies,


i.e., dependencies in which the arcs representing them, cross. In the example sentence below,
the words connected with arcs are objects and verbs belonging to the same predicates (verb
phrases). Because of that, they have to agree on the form—they depend on one another. As
we show next, context-free languages cannot capture such dependencies.

...mer d’chind em Hans s’ huus lönd hälfe aastriiche


...we the children Hans the house let help paint
18

19 Why are cross-serial dependencies non-context-free? Before reasoning about the phe-
20 nomenon of cross-serial dependencies, we revisit Example 4.2.1 with a somewhat more formal
21 approach. The arbitrarily deep nesting can, for example, be abstractly represented with the
22 expression

xAn B n y (5.1)
5.1. RECURRENT NEURAL LANGUAGE MODELS 139

1 with1

x “ “The cat”
A “ “the dog”
B “ “barked at”
y “ “likes to cuddle”.

2 From this abstract perspective, center embeddings are very similar to the Dp1q language (Exam-
3 ple 4.2.5), in that every noun phrase “the dog” has to be paired with a verb phrase “barked at”,
4 which cannot be represented by any regular language.
5 In a similar fashion, Example 5.1.1 can abstractly be represented with the expression

xAm B n C m yDn z (5.2)

6 with

x “ “...mer”
A “ “d’chind”
B “ “em Hans”
y “ “s’ huus”
C “ “lönd”
D “ “hälfe”
z “ “aastriiche”.

7 Admittedly, this is a relatively uncommon formulation even with n “ m “ 1. It should be taken with
8 a grain of salt, as the title of the original publication discussing this phenomenon, Evidence against
9 context-freeness (Shieber, 1985), also suggests. However, theoretically, the number of repetitions of
10 “d’chind” and “lönd”, as well as “em Hans” and “hälfe”, can be increased arbitrarily. Repeating
11 the former would correspond to having many groups of children. The last of the groups would let
12 Hans help paint the house, whereas each of the previous groups would let the group after them either
13 let Hans paint the house or recurse onto another group of children. Similarly, repeating “em Hans”
14 and “hälfe” would correspond to a number of Hanses, each either helping another Hans or helping
15 paint the house. Then, using the pumping lemma for context-free languages, it can be shown that
16 the expressions of the form in Eq. (5.2) cannot be recognized by any context-free grammar. We
17 refer the readers to Hopcroft et al. (2006, Example 7.20) for detailed proof.
18 Example 5.1.1 means that to model a human language formally, we need more expressive
19 formalisms than context-free grammars or pushdown automata as described in the previous sections.2
20 However, instead of defining a more expressive formalism motivated by formal language theory (like
21 we did with context-free grammars and center embeddings), we now introduce recurrent neural
22 networks, which, as we will see, under certain assumptions, have the capacity to model all computable
23 languages (i.e., they are Turing complete). Moreover, they can also model infinite lengths of the
24 context y ăt in a very flexible way. In the next section, we define them formally.
1 In this case, we of course only consider an arbitrarily long sequence of barking dogs.
2 On the other hand, note that we would ideally also like to upper-bound the expressive power of the formal models,
as this introduces useful inductive biases for learning and sparks insights into how humans process language. This
means that we would not simply like to jump to Turing-complete models in such an exploration of language models.
140 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 5.1.2 Recurrent Neural Networks


2 As discussed, natural languages are beyond the descriptive power of regular and context-free
3 languages. Now, we turn to a class of models that is theoretically capable of recognizing all
4 computable languages: recurrent neural networks (RNNs).3

5 An Informal Introduction to Recurrent Neural Networks


6 Human language is inherently sequential: we produce and consume both spoken as well as written
7 language as a stream of units.4 This structure is reflected in some of the algorithms for processing
8 language we have seen so far. For example, finite-state automata (cf. §4.1) process the input string
9 one symbol at a time and build the representations of the string seen so far in the current state of
10 the automaton. Pushdown automata function similarly, but additionally keep the stack as part of
11 the configuration.
12 Recurrent neural networks are neural networks that capture the same idea of iterative processing
13 of the input but do so in a more flexible way than the finite-memory finite-state automata and the
14 stack-based pushdown automata. Very abstractly, a recurrent neural network sequentially processes
15 a sequence of inputs and, while doing so, produces a sequence of hidden states, which we will
16 denote as h, based on a transition function in form of a recurrent dynamics map, which acts
17 similarly to a (deterministic) transition function in a finite-state machine: given the current hidden
18 state and an input symbol, it (deterministically) determines the next hidden state. The hidden states
19 play, as we will see, an analogous role to the states of a finite-state automaton or the configuration of
20 a pushdown automaton: The current hidden state of a recurrent neural network at time t determines,
21 together with the input at time t, through the dynamics map, the hidden state at time t ` 1—indeed,
22 very similar to how finite-state automata process strings and transition between their states. Again,
23 the hidden state can be thought of as a compact (constant-size) summary of the input y ďt seen so
24 far and should ideally characterize y ďt as well as possible (in the sense of retaining all information
25 required for continuing the string). Remember from §4.2.1 that the finite number of states of a
26 finite-state automaton presented a serious limitation to its ability to model human language. As
27 we will see, the main difference between (weighted) finite-state automata and RNNs is that the
28 latter can work with infinite state spaces, for example, RD in the abstract formulation, or QD in a
29 digital computing system, such as a computer. This, together with the flexibility of the transition
30 function between hidden states, will allow RNNs to represent more complex languages than those
31 recognized by finite-state automata or context-free grammars. In fact, the large state space and the
32 flexible transition functions endow RNNs, under some assumptions, with the possibility to model
33 infinitely-long-term dependencies on the input string, distinguishing them from the Markovian
34 n-gram models.
35 You might wonder why we refer to the current state of an RNN as hidden states instead of
36 only as states, as with finite-state automata. Indeed, when analyzing recurrent neural networks
37 in terms of their expressivity and connections to classical models of computation, we will regard
38 the hidden states as completely analogous to states in a finite-state or pushdown automaton. The
39 hidden part comes from the fact that the hidden states h are usually not what we are interested
40 in when modeling language with an RNN. Rather, h is simply seen as a component in a system
3 In this subsection, we focus on the applications of recurrent neural networks to language modeling. However,

recurrent neural networks have been widely used to process sequential data and time series, thanks to their power of
taking in arbitrary-length inputs.
4 So far, we have simply referred to those units as symbols.
5.1. RECURRENT NEURAL LANGUAGE MODELS 141

State ht h0 h1 h2 h3 ¨¨¨

Input y t y1 y2 y3

(a) An abstract depiction of how an RNN processes one symbol in a string. The hidden state ht summarizes
the inputs y 1 y 2 . . . y t .

y1 y2 y3 y4
h0 h1 h2 h3 ¨¨¨

(b) An abstract depiction of an RNN as an automaton. The transitions between the possibly infinitely-many
hidden states are determined by the dynamics map.
y

(c) An abstract depiction of an RNN as a system updating the hidden state h depending on the input y.

Figure 5.1: Different possible depictions of an abstract RNN model. The way that the hidden states
are updated based on the input symbol y t is abstracted away.

1 that produces individual conditional probabilities over the next symbol, as in sequence models (cf.
2 Definition 2.5.2)—these conditional probabilities are the actual “visible” parts, while the “internal”
3 states are, therefore, referred to as hidden.
4 RNNs are abstractly illustrated in different ways in the literature. Often, they are represented
5 as a sequence of hidden states and the input symbols consumed to arrive at those states—this
6 is shown in Fig. 5.1a. They can also be presented more similarly to automata, with (a possibly
7 infinite) labeled graph, where the transition labels again correspond to the symbols used to enter
8 the individual states. This is presented in Fig. 5.1b. Lastly, due to the infinite state space, one can
9 also think of an RNN as a system that keeps the most current hidden state in memory and updates
10 it as new symbols are consumed—this is shown in Fig. 5.1c.5

11 A Formal Definition of Recurrent Neural Networks


12 Having introduced RNNs and their motivations informally, we now move to their formal definition.
13 Our definition and treatment of recurrent neural networks might differ slightly from what you
14 might normally encounter in the literature. Namely, we define RNNs below as abstract systems
15 transitioning between possibly infinitely-many states. Our definition will allow for an intuitive
16 connection to classical language models such as finite-state and pushdown language models as
5 More precisely, these illustrations correspond to first-order RNNs, which are by far the most common. Later, we

will also briefly consider higher-order RNNs, whose hidden state update depends on multiple previous hidden states.
142 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 well as for tractable theoretical analysis in some special cases. Specifically, when analyzing RNNs
2 theoretically, we will make use of their connections to automata we saw in Chapter 4.
3 In an abstract sense, recurrent neural networks can be defined as a system transitioning between
4 possibly infinitely-many states, which we will assume to be vectors in a vector space. Specifically,
5 we will distinguish between real and rational recurrent neural networks.

Definition 5.1.1: Real-valued Recurrent Neural Network

Let Σ be an alphabet. A (deterministic) real-valued recurrent neural network R is a


four-tuple pΣ, D, f , h0 q where

• Σ is the alphabet of input symbols;


• D is the dimension of R;
• f : RD ˆ Σ Ñ RD is the dynamics map, i.e., a function defining the transitions between
subsequent states;

• h0 P RD is the initial state.


6

7 We analogously define rational-valued recurrent neural networks as recurrent neural


8 networks with the hidden state space QD instead of RD . You might wonder why we make the
9 distinction. Soon, when we take on theoretical analysis of RNNs, it will become important over
10 which state spaces the models are defined. RNNs implemented in a computer using floating-point
11 numbers, of course, cannot have irrational-valued weights—in this sense, all implemented recurrent
12 neural networks are rational. However, defining the models over the real numbers crucially allows us
13 to perform operations from calculus for which some sort of continuity and smoothness is required,
14 for example, differentiation for gradient-based learning (cf. §3.2.3).

Example 5.1.2: A rational-valued RNN

An example of a rational-valued RNN is the series


1 1
ht “ ht´1 ` (5.3)
2 ht´1

which we considered in Example 3.1.1. In this case


• Σ “ tau

• D“1
• f : px, aq ÞÑ 12 x ` 1
x

• h0 “ 2
15
5.1. RECURRENT NEURAL LANGUAGE MODELS 143

Example 5.1.3: Another example of an RNN

The tuple R “ pΣ, D, f , h0 q where


• Σ “ ta, bu
• D“2
$˜ ¸
cos ϕ ´ sin ϕ
& sin ϕ cos ϕ x if y= a




• f : px, yq ÞÑ ˜ ¸
’ cos ψ ´ sin ψ
x otherwise


% sin ψ

cos ψ

1
ˆ ˙
• h0 “
1
is an example of a real-valued RNN which rotates the current hidden state by the angle ϕ if
the input symbol is a and rotates it by ψ if the symbol is b.
1

Example 5.1.4: Another example of an RNN

Another example of an RNN would be the tuple

• Σ “ GOOD Y BAD “ t“great”, “nice”, “good”u Y t“awful”, “bad”, “abysmal”u


• D“2
$ ˜ ¸
1
&h ` 0 if a P GOOD




• f : ph, aq ÞÑ ˜ ¸
0
’h ` otherwise



% 1

0
ˆ ˙
• h0 “
0
which counts the number of occurrences of positive and negative words.
2

3 To define language models using recurrent neural networks, we will use them as the encoder
4 functions enc in our general language modeling framework (cf. §3.1). To connect Definition 5.1.1
5 with the general LM framework, we define the RNN encoding function.

Definition 5.1.2: Recurrent Neural Encoding Function

Let R “ pΣ, D, f , h0 q be a recurrent neural network. A recurrent neural encoding function


encR is a representation function (cf. §3.1.1) that recursively encodes strings of arbitrary
lengths using its dynamics map f :
encR py ăt`1 q “ f pencR py ăt q, y t q P RD (5.4)
def

6
144 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

and
encR py ă1 q “ h0 P RD (5.5)
def

2 Intuitively, an RNN R takes an input string y and encodes it with the encoding function
3 encR by sequentially applying its dynamics map f . The representations of individual prefixes (cf.
4 Definition 2.3.6) of the input string are called hidden states.

Definition 5.1.3: Hidden State

Let R “ pΣ, D, f , h0 q be an RNN. The hidden state ht P RD describes state of R after reading
y t . It is recursively computed according to the dynamics map f as follows:

ht “ encR py ăt`1 q “ f pht´1 , y t q (5.6)


def

Example 5.1.5: Hidden states

The hidden states of?the RNN from Example 5.1.2 are the individual values ht , which, as t
increases, approach 2.
6

7 Recurrent Neural Sequence Models

8 A recurrent neural network based on Definition 5.1.1 on its own does not yet define a sequence
9 model, but simply a context encoding function encR : Σ˚ Ñ RD . To define a sequence model based
10 on an RNN, we simply plug in the RNN encoding function Definition 5.1.2 into the General language
11 modeling framework from §3.1.

Definition 5.1.4: Recurrent neural sequence model

Let R “ pΣ, D, f , h0 q be a recurrent neural network and E P R|Σ|ˆD a symbol representation


matrix. A D-dimensional recurrent neural sequence model over an alphabet Σ is a tuple
pΣ, D, f , E, h0 q defining the sequence model of the form

pSM py t | y ăt q “ f ∆|Σ|´1 pE encR py ăt qqyt “ f ∆|Σ|´1 pE ht´1 qyt . (5.7)


def

By far the most common choice of the projection function is the softmax yielding the sequence
model
pSM py t | y ăt q “ softmaxpE encR py ăt qqyt “ softmaxpE ht´1 qyt . (5.8)
def

For conciseness, we will refer to RNN sequence models whose next-symbol probability distri-
butions are computed using the softmax function as softmax RNN sequence models.
12

13 From this perspective, we see that RNNs are simply a special case of our general language
14 modeling framework with parameterized representations of tokens y P Σ and the history y P Σ˚ (cf.
15 §3.1)—an RNN simply defines how the encoding function enc is specified. The three figures from
16 Fig. 5.1 are presented again with this probabilistic perspective in Fig. 5.2.
5.1. RECURRENT NEURAL LANGUAGE MODELS 145

State ht h0 h1 h2 ¨¨¨
y1 y2
„ „
pS pS
M M
p¨ p¨
|h |h
0q 1q

Input y t y1 y2 ¨¨¨

(a) An abstract depiction of how an RNN generates a string one symbol at a time. The hidden state ht
summarizes the string y 1 y 2 . . . y t generated so far. The dotted lines denote the sampling steps.

y 1 „ pSM p¨ | h0 q y 2 „ pSM p¨ | h1 q
h0 h1 h2 ¨¨¨

(b) An abstract depiction of a generative RNN as an automaton.

y t „ pSM p¨ | ht q

ht

(c) An abstract depiction of an RNN as a system updating the hidden state ht depending on the generated
symbol y t .

Figure 5.2: Different possible depictions of an abstract RNN model generating symbols. The way
that the hidden states are updated based on the input symbol y t is abstracted away.
146 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 A Few More Definitions


2 In the following, we will often use the so-called one-hot encodings of symbols for concise notation.
3 We define them here.

Definition 5.1.5: One-hot encoding

Let Σ be an alphabet and n : Σ Ñ t1, . . . , |Σ|u a bijection (i.e., an ordering of the alphabet,
assigning an index to each symbol in Σ). A one-hot encoding J¨K is a representation function
th
of the symbols in Σ which assigns the symbol y P Σ the npyq basis vector:

JyK “ dnpyq , (5.9)


def

where here dn is the nth canonical basis vector, i.e., a vector of zeros with a 1 at position n.
4

Example 5.1.6: One-hot encoding

Let Σ “ t“large”, “language”, “models”u and n “ t“large” : 1, “language” : 2, “models” : 3u.


The one-hot encoding of the vocabulary is:

1 0 0
¨ ˛ ¨ ˛ ¨ ˛

J“large”K “ ˝0‚, J“language”K “ ˝1‚, J“models”K “ ˝0‚ (5.10)


0 0 1
5

6 Many specific variants of recurrent neural networks define the dynamics map f in a specific way:
7 the output of the function is some element-wise (non-linear) transformation of some “inner” function
8 g. The dynamics map of such an RNN is then the composition of g and the non-linearity.

Definition 5.1.6: Activation function

Let R “ pΣ, D, f , E, h0 q be an RNN. If the hidden states ht of the RNN are computed as

ht “ σpgpht´1 , yqq (5.11)

for some function g : RD ˆ Σ Ñ RD and some function σ : R Ñ R which is computed element-


wise (that is, σpxqd “ σpxd q for all d “ 1, . . . , D and x P RD ), we call σ an activation
function.
9

10 This finishes our formal definition of recurrent neural networks. We next consider some of their
11 theoretical properties, starting with tightness.

12 5.1.3 General Results on Tightness


13 We now discuss a general result on the tightness of recurrent neural sequence models, as defined
14 in Definition 5.1.4. The analysis is straightforward and is a translation of the generic results on
15 tightness (cf. §3.1.5) to the case of the norm of the hidden states of an RNN, ht as the encodings of
16 the prefixes y ďt , but it requires us to focus specifically on softmax RNN sequence models.
5.1. RECURRENT NEURAL LANGUAGE MODELS 147

Theorem 5.1.1: Tightness of Recurrent Neural Sequence Model

A softmax recurrent neural sequence model is tight if for all time steps t it holds that

s∥ht ∥2 ď log t, (5.12)

where s “ maxyPΣ ∥epyq ´ epeosq∥2 .


def

2 Proof. This is simply a restatement of Theorem 3.1.6 for the case when enc takes the form of a
3 general RNN encoding function, encR . ■

Corollary 5.1.1: RNNs with bounded dynamics maps are tight

A softmax recurrent neural sequence model R “ pΣ, D, f , h0 q with a bounded dynamics map
f , i.e, with a dynamics map f such that

|f pxqd | ď M (5.13)

for some M P R, for all d “ 1, . . . , D and all x P RD , is tight.


4

5 Proof. If the dynamics map is bounded, the norm of the hidden state, ∥ht ∥2 , is bounded as well.
6 This means that the left-hand-side of Eq. (5.12) is constant with respect to t and the condition
7 holds trivially. ■

8 A special case of Corollary 5.1.1 is RNNs with bounded activation functions (cf. Definition 5.1.6).
9 Those are tight if the activation function itself is bounded. This implies that all standard sigmoid
10 and tanh activated recurrent neural networks are tight. However, the same does not hold for RNNs
11 with unbounded activation functions, which have lately been more popular (one of the reasons for
12 this is the vanishing gradient problem (Glorot et al., 2011)).

Example 5.1.7: RNNs with unbounded activation functions may not be tight

A very popular unbounded activation function is the so-called rectified linear unit (ReLU),
defined as
ReLU pxq “ max p0, xq . (5.14)
def

This function is clearly unbounded.


Now suppose we had the following RNN over the simple alphabet Σ “ tau.

ht “ ht´1 ` 1 , (5.15)
` ˘

initial state
h0 “ 0 (5.16)
` ˘

and the output matrix ˆ ˙


´1
E“ (5.17)
1
13
148 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

where the top row of E computes the logit of the eos symbol and the bottom one the one of
a. It is easy to see that
ht “ t . (5.18)

This already` ˘does not look promising for tightness—the norm of the hidden state, which is, in
this case, t “ t, and is, therefore, increasing at a much higher rate than O plog tq required
by Theorem 5.1.1. We encounter a similar hint against tightness if we compute the conditional
probabilities of the eos symbol and the symbol a.

exp r´t ` 1s
ˆˆ ˙ ˙
´1 `
pSM peos | y ăt q “ softmax t´1 (5.19)
˘

1 eos exp r´t ` 1s ` exp rt ´ 1s
exp rt ´ 1s
ˆˆ ˙ ˙
´1 `
pSM peos | y ăt q “ softmax t´1 (5.20)
˘

1 a exp r´t ` 1s ` exp rt ´ 1s

The probability of ending the string at time step t is, therefore


1
pSM peos | y ăt q “ . (5.21)
1 ` exp r2 pt ´ 1qs

Intuitively, this means that the probability of ending the string (generating eos) diminishes
rapidly with t—in this case much faster than any diverging sum required by Theorem 2.5.3.
All signs, thus, point towards the RNN from Eq. (5.15) not being tight. Indeed, for this
specific case, one can show using some algebraic manipulations that
ÿ ÿ
pLN pyq “ pLN pan q ă 0.15 (5.22)
yPΣ˚ nPNě0

where pLN is the locally normalized model induced by the RNN. This means that the RNN
from Eq. (5.15) assigns less than 0.15 probability to finite strings—all other probability mass
leaks to infinite sequences.
1

Example 5.1.8: RNNs with unbounded activation functions can still be tight

Example 5.1.7 showed that RNNs with unbounded activation functions can indeed result in
non-tight sequence models. However, this is not necessarily the case, as this simple modification
of the RNN from Example 5.1.7 shows. The only aspect of the RNN that we modify is the
output matrix E, which we change by flipping its rows:

1
ˆ ˙
E“ (5.23)
´1

Now the probability of ending the string at time step t is

exp rt ´ 1s 1
pSM peos | y ăt q “ “ . (5.24)
exp r´t ` 1s ` exp rt ´ 1s exp r´2 pt ´ 1qs ` 1

Compared to Eq. (5.21), the probability of eos in Eq. (5.24) does not diminish. Indeed, since
2
5.1. RECURRENT NEURAL LANGUAGE MODELS 149

1
expr´2pt´1qs`1 ą 1
2 for all t, the sum

8
ÿ
pSM peos | y ăt q (5.25)
t“0

diverges, which, according to Proposition 2.5.6 implies that the sequence model is tight.
1

2 5.1.4 Elman and Jordan Networks


3 The characterization of dynamics maps we gave in Definition 5.1.4 allows for f to be an arbitrary
4 mapping from the previous state and the current input symbol to the new state. In this section, we
5 introduce two seminal and particularly simple parameterizations of this map—the simplest recurrent
6 neural sequence models. We term them Elman sequence models and Jordan sequence models, as
7 each is inspired by architectures proposed by Elman (1990) and Jordan (1986), respectively. The
8 definitions we present here are slightly different than those found in the original works—most notably,
9 both Elman and Jordan networks were originally defined for transduction (mapping an input string
10 to an output string, as with translation) rather than language modeling.
11 Put simply, these two models restrict the form of the dynamics map f in the definition of an
12 RNN (cf. Definition 5.1.1). They define particularly simple relationships between the subsequent
13 hidden states, which are composed of affine transformations of the previous hidden state and
14 the representation of the current input symbol passed through a non-linear activation function
15 (cf. Definition 5.1.6). The affine transformations are performed by different matrices and bias
16 vectors—the parameters of the model (cf. Assumption 3.2.2)—each transforming a separate part of
17 the input to the dynamics map.

Definition 5.1.7: Elman Sequence Model (Elman, 1990)

An Elman sequence model R “ pΣ, D, U, V, E, bh , h0 q is a D-dimensional recurrent neural


sequence model over an alphabet Σ with the following dynamics map

ht “ σ Uht´1 ` Ve1 py t q ` bh . (5.26)


` ˘

Here, e1 ¨ : Σ Ñ RR is the input symbol embedding function which represents each symbol
y P Σ as a R-dimensional vector and σ is an element-wise non-linearity.a bh P RD , U P RDˆD ,
and V P RDˆR .
a The symbol representations e1 y are often also referred to as static symbol embeddings because they do

not depend on the string surrounding or preceding y. Here, we treat them as any other parameters of the
model which can be learned using gradient-based learning (cf. §3.2). However, note that learning good static
embeddings was a very active field before the emergence of large end-to-end systems we see today. Very popular
examples include Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Bojanowski
et al., 2017).
18

19 Due to its simplicity, the Elman RNN is also known as the vanilla RNN variant, emphasizing it
20 is one of the most fundamental variants of the framework.
150 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 On the symbol representations. Notice that in Eq. (5.26), the input symbols y t are first
2 transformed into their vector representations e1 py t q and then additionally linearly transformed
3 using the matrix V. This results in an over-parametrized network—since the symbols are already
4 embedded using the representation function e1 , the matrix V is theoretically superfluous and could
5 be replaced by the identity matrix. However, the matrix V could still be useful if the representations
6 e1 py t q are fixed—in this case, the matrix can be used by the RNN to transform the representations
7 during training to fit the training data better. This is especially useful if the symbol representations
8 e1 py t q already represent the input symbols in a compact representation space in which the parameters
9 can be shared across different symbols. Alternatively, we could represent the symbols using their
10 one-hot encodings, i.e., e1 py t q “ Jy t K, in which case the columns of the matrix V would correspond to
11 the symbol representations (analogously to the representation matrix E from Eq. (3.46)). However,
12 notice that in this case, the representations on the symbols do not share any parameters, and
13 each column of the matrix is therefore an unconstrained vector. Such matrix-lookup-based input
14 symbol representations from Eq. (5.26) are sometimes tied, i.e., e1 ¨ “ ep¨q, with the output symbol
15 representations from the embedding matrix E in the definition of the sequence model induced by an
16 RNN (cf. Definition 3.1.11 and Eq. (5.7)).
17 However, embedding tying is non-essential to representation-based LMs. The input symbol
18 embedding function can always be chosen independently with the output symbol embedding function
19 Definition 3.1.6.
20 The Jordan network is somewhat different in that it feeds the output logits computed through
21 the output matrix E into the computation of the next state, and not directly the hidden state.

Definition 5.1.8: Jordan Sequence Model (Jordan, 1986)

A Jordan sequence model is a D-dimensional recurrent neural sequence model over an


alphabet Σ with the following dynamics map

ht “ σ Urt´1 ` Ve1 py t q ` bh (5.27)


` ˘

rt “ σ o pEht q (5.28)

Again, e1 ¨ : Σ Ñ RR is the input symbol embedding function which represents each symbol
y P Σ as a R-dimensional vector while σ and σ o are element-wise non-linearities. bh P RD ,
U P RDˆD , and V P RDˆR .
22

23 Notice that the hidden state ht in Eq. (5.27) is not computed based on the previous hidden
24 state ht´1 , but rather on the transformed outputs rt´1 —this is analogous to feeding back in the
25 logits computed in Eq. (5.7) into the computation of h rather than the previous hidden state. The
26 sequence model induced by a Jordan network is then directly induced by the logits rt (i.e., the
27 conditional probabilities are computed by putting vt through the softmax.
28 In both architectures, the activation function σ can be any suitable element-wise function. The
29 canonical choices for it have been the sigmoid and tanh functions, however, a more common choice
30 nowadays is the ReLU function or any of its more modern variants.6
31 Since we will refer to the individual matrices defining the dynamics maps in Elman and Jordan
32 networks quite a lot in the next subsections, we give them specific names. The matrix U, which
33 linearly transforms the previous hidden state (or the output) is the recurrence matrix. The
6 See Goodfellow et al. (2016, §6̃.3.1) for an overview of modern activation functions used in neural networks.
5.1. RECURRENT NEURAL LANGUAGE MODELS 151

1 matrix V, which linearly transforms the representations of the input symbol, is called the input
2 matrix. Lastly, the matrix which linearly transforms the hidden state before computing the output
3 values rt with an activation function is called the output matrix. bh is the hidden bias vector.

4 Tightness of Elman and Jordan Recurrent Neural Networks As a simple corollary of


5 Corollary 5.1.1, we can characterize the tightness of Elman and Jordan recurrent neural networks as
6 follows.

Corollary 5.1.2: Tightness of simple RNNs

Elman and Jordan RNNs with a bounded activation function σ and the softmax projection
function are tight.
7
152 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 5.1.5 Variations on Recurrent Networks


2 In the previous sections we introduced the two simplest RNN variants: the Elman (Elman, 1990)
3 and Jordan (Jordan, 1997) networks. Even though such simple RNNs in theory are all we need to
4 model any computable language, empirically those architectures face many challenges. One of the
5 biggest are the vanishing and exploding gradient problems (Hochreiter and Schmidhuber, 1997),
6 which in practice is linked with the issue of learning long-term dependencies in language.
7 In this subsection, we expand our repertoire of RNN variants by going beyond the simple recurrent
8 dynamics defined by the Elman and Jordan update rules. To do so, we take a step back and return
9 to Definition 5.1.1 of a recurrent neural network R as the tuple pΣ, D, f , h0 q . We will define more
10 elaborate dynamics maps f which both aim to tackle some of the (empirically encountered) challenges
11 of simpler variants as well as improve some theoretical aspects of the networks. Importantly, keep
12 in mind that the only aspect of the RNN we will strive to modify is the dynamics map—that is,
13 the mapping from ht´1 to ht . Given a hidden state, the definition of a sequence model will remain
14 identical.
15 A common component of the more complex dynamics maps we explore in this section is the
16 gating mechanism, which is why we start with it.

17 Gating
18 The update equations of Elman and Jordan RNNs define relatively simple transformations of the
19 hidden states as an affine transformation of the previous hidden state and the new input, followed
20 by some form of non-linearity. In this sense, the interaction between the previous hidden state and
21 the input symbol is relatively limited—the hidden state is transformed by the recurrence matrix U
22 at every time step invariant to the input symbol being read. To see why this could be a limiting
23 factor, consider the following example.

Example 5.1.9: RNN Gates

Consider the language L “ tan bn cn xam bm cm | n, m P Ně0 u. It intuitively consists of two-part


strings, where the two parts are separated by a symbol x. The part on the left side of x
contains a sequence of n a’s followed by n b’s, which is followed by n c’s. The substring on the
right side of x contains a sequence of m a’s which is again followed by m b’s, and later by m c’s.
Both parts of the string can be arbitrarily long, and, intuitively, to correctly recognize a string
in this language, a computational model has to keep the information about the number of a’s
while reading in b’s to be able to ensure there is a correct number of c’s as well. This creates
a long-term dependency across the entire block of b’s. However, notice that, after reading the
symbol x, the information about the number of a’s becomes irrelevant to the recognition of
the string: the model can, therefore, discard it and solely focus on modeling the rest of the
string, which again requires keeping track of the number of the symbol occurrences. In other
words: after a certain amount of time, previous information becomes irrelevant, and we may
want to design a network that is able to select which information is important to keep around.
24

25 To enable richer interaction between the transformation of the RNN hidden state and the input
26 symbols, we introduce the gating mechanism. Intuitively, the gating mechanism enables more
27 fine-grained control over the transformations of the hidden state by “selecting” which aspects of the
28 hidden state should be retained, which should be modified, and which should be deleted—in general,
5.1. RECURRENT NEURAL LANGUAGE MODELS 153

1 based on both the previous hidden state as well as the current input symbol. Such transformations
2 are defined using gates and gating functions.

Definition 5.1.9: Gate

A gate is a real-valued vector g P RD , such that g d P r0, 1s for all d P t1, . . . , Du. Gates are
D
computed using gating functions, i.e., functions whose outputs live in r0, 1s .
3

4 The fact that every dimension in a gate gt takes a value between 0 and 1 invites a natural
5 interpretation of the values as soft switches, analogously to how switches are used in the electrical
6 engineering context. Intuitively, in the context of RNNs, where the information is passed around in
7 the hidden states ht , a gate of the same dimensionality as the hidden state can control which aspects
8 (dimensions) of the hidden state should be forgotten (switched off) or and which ones retained (kept
9 on)—a gate value close to 0 can be interpreted as a signal that the information captured in the
10 corresponding dimension of the hidden state should be “forgotten”, and a gate value close to 1
11 as the opposite. Such modifications of the hidden state can be performed using an element-wise
12 multiplication of the hidden state h and the gate g, which we denote with h d g.
13 Importantly, the gates can be computed based on the information about the string seen so far as
14 well as the new input symbol—this means that the decision on what should be remembered and
15 what should be forgotten can be made for each situation individually. This allows RNN variants
16 using gating to implement mechanisms to tackle challenges as the one described in Example 5.1.9.
17 Furthermore this not only enables RNNs to selectively keep information about the string, but
18 also combat the vanishing and exploding gradient problems (Hochreiter and Schmidhuber, 1997,
19 Appendix 2). We next consider two of the best-known gated RNNs: Long Short-Term Memory and
20 Gated Recurrent Unit networks .

21 Long Short-term Memory Networks

22 Long Short-term Memory Networks (LSTM, Hochreiter and Schmidhuber, 1997) are perhaps the best-
23 known type of a gated recurrent network. They were introduced specifically to combat the vanishing
24 gradient problem in the famous paper with more than 80 000 citations. The somewhat unusual
25 name comes from connections to human memory, in which short-term memory is characterized by
26 evanescent neural activations, and long-term memory is based on the growth and structural change
27 in neuron connections (Hebb, 1949).

28 The LSTM unit. LSTM RNNs are built from the LSTM units, which implement the RNN
29 dynamics map and therefore perform the RNN update step. To transform the hidden state ht at
30 each time step, an LSTM network additionally keeps another running summary of the string y ăt , on
31 which the recurrent update depends—this is the so-called memory cell which we will denote by ct .
32 Informally, one can think of the context as the information needed to decide on how to transform
33 the hidden state at each individual time step, depending on the input string. The formal definition
34 of the LSTM cell is the following.
154 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Definition 5.1.10: Long Short-Term Memory

A long short-term memory unit is a recurrent neural network with the dynamics map
defined through the following sequence of computations:

it “ σ Ui ht´1 ` Vi e1 y t ` bi (input gate)


` ˘

ft “ σ U ht´1 ` V e y t ` b (forget gate)


` f f 1 f
˘

ot “ σ U ht´1 ` V e y t ` b (output gate)


` o o 1 o
˘

gt “ tanhpUg ht´1 ` Vg e1 y t ` bg q (candidate vector)


ct “ ft d ct´1 ` it d gt (memory cell)
ht “ ot d tanh pct q (hidden state)

it , ft , ot are the input, forget, and output gates, ct is the memory cell vector, and gt is
the candidate vector. Here, σ refers to the original sigmoid function.
1

2 As we can see, the update rule of an LSTM network is considerably more complex than that of an
3 Elman RNN. It is also computationally more expensive, as it involves more matrix multiplications.
4 However, LSTMs have consistently shown improved performance compared to vanilla RNNs and
5 are therefore considered together with GRUs the go-to choice for an RNN architecture (Goodfellow
6 et al., 2016). The theoretical reason of their success is that their gating mecahnism helps to reduce
7 the Vanishing/Exploding gradient problem, and thus to learn long-term dependencies (Hochreiter
8 and Schmidhuber, 1997, Appendix 2).

9 The names of the different quantities computed in Definition 5.1.10 reflect their intuitive
10 interpretations. The input, forget, and output vectors are all gates: they control the information
11 which will be added to the memory cell based on the new input, the information which will be
12 retained or forgotten from the previous memory cell, and the information which will be transferred
13 from the memory cell to the hidden state, respectively. Notice the identical nature in which all
14 three gates are computed: they are non-linear transformations of affine transformations of the
15 previous hidden state and the input symbol representations. Their parameter matrices define the
16 way in which the gates will influence the memorization, forgetting, and addition of information. The
17 additional information added to the memory cell in the form of the candidate vector gt is computed
18 similarly, with the only difference being the activation function. This is the step that bears the most
19 resemblance to the update step of the vanilla RNN (Eq. (5.26)). However, compared to the latter,
20 only parts of this transformation are kept (based on the input and forget vectors it and ft ). The
21 memory cell ct then combines the old memory content ct´1 with the newly integrated information
22 in gt to form the new memory content, which is then transformed using the tanh function and
23 combined with the output gate to produce the hidden state ht . This is pictorially presented in
24 Fig. 5.3.

25 As mentioned, the LSTM update step is noticeably more computationally complex than that of
26 a vanilla RNN. This has led to a line of work trying to combine the efficiency of vanilla RNNs and
27 the empirical performance of gated RNNs. In the next subsection, we consider Gated Recurrent
28 Units, one of the best-known compromises found in this domain.
5.1. RECURRENT NEURAL LANGUAGE MODELS 155

ht

Cell

ct´1 ˆ + ct

Tanh

it ˆ ot ˆ
ft gt
Hidden σ σ Tanh σ

ht´1 ht

Input e’yt

Figure 5.3: A pictorial depiction of the LSTM cell in action. The input it , forget ft , and output
ot gates control which information of the input and of the previous hidden state is retained in the
memory cell, and which information is passed to next the hidden state.

1 Gated Recurrent Units

2 The Gated Recurrent Unit (GRU, Cho et al., 2014b,a) provides a compromise between the simplicity
3 of vanilla recurrent neural networks and the empirical success of being able to model long-term
4 dependencies with LSTMs. It defines a gated recurrent update unit that implements a simpler
5 dynamics map by removing the memory component ct in the LSTM cell and combining the input
6 and forget gates it , ft into one update gate. These changes make GRU more memory efficient and
7 easier to train than LSTM in practice. The full GRU update step is defined as follows.

Definition 5.1.11: Gated Recurrent Units


A gated recurrent unit defines a dynamics map in which a new hidden state is computed as:

rt “ σ Ur ht´1 ` Vr e1 y t ` br (reset gate)


` ˘

zt “ σ U ht´1 ` V e y t ` b (update gate)


` z z 1 z
˘

gt “ tanh U prt d ht´1 q ` V e y t ` b (candidate vector)


` g g 1 g
˘

ht “ p1 ´ zt q d gt ` zt d ht´1

rt , zt are known as the reset and update gates, and gt as the candidate vector.
8

9 Intuitively, the update gate works like a hot/cold water mixing valve: it is trained to find the
10 optimum blend of information of the candidate vector with that coming from the previous hidden
11 state. The reset gate instead, can zero the information of the previous hidden state, when computing
12 the candidate vector. This allows to forget past information that becomes irrelevant, exactly like in
156 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 the LSTM architecture.

2 Parallelizability: The Achilles’ Heel of Recurrent Neural Networks

3 It is easy to see from the definition of the RNN hidden state (cf. Definition 5.1.3) that, to compute
4 ht , we have to compute ht1 for all t1 ă t first. Another way to say this is that RNNs are inherently
5 sequential models, processing the input string one symbol at a time to update their hidden state
6 ht , and using this hidden state in turn to compute ht`1 . This results in perhaps the biggest
7 shortcoming of the architecture for its applicability to real-world language modeling: The inability
8 to efficiently parallelize the processing (encoding) of long strings. Let us first consider what we
9 mean by the parallelizability of a language model architecture.7 Due to this sequential nature, the
10 training procedure of RNNs is difficult to parallelize effectively, leading to slower training times.
11 This characteristic poses a significant challenge when modeling long strings, as the computation for
12 each element is dependent on the computation of the previous element, leading to a bottleneck in
13 the training process.
14 In short, in our specific use case of language modeling, parallelization refers to the division of the
15 processing of a specific string across multiple computational nodes, such that any specific node only
16 performs a subset of operations required to process the entire string—the results of the subsets of
17 the operations are then combined to build the representation of the entire string. Importantly, the
18 computations should be performed independently between nodes in the sense that no node has to
19 wait for any other node to provide it the results of its computation. Being able to parallelize large
20 models across computational nodes has led to some of the biggest advancements in modern deep
21 learning. As such, parallelizability is a crucial feature of any successful deep learning architecture.
22 However, notice that any dependence between the computations performed by the nodes defeats
23 the purpose of parallelization—if the nodes have to wait for each other to finish computations,
24 the same operations might as well be performed by a single node. This is where recurrent neural
25 networks fall short: the computations required to encode a string y into the hidden state h|y| will
26 always be sequential, preventing their distribution across different nodes.

27 Parallelizability in language modeling. When talking about parallelizing language models, it


28 is important to think about which parts can actually be parallelized. In the case of RNNs, we saw
29 that no part of the processing can be (besides the matrix multiplication in a single update rule)—the
30 length of the longest chain of dependent computation will always scale linearly with the length of
31 the string. In the next section, we introduce transformers, a recent neural network architecture first
32 introduced for processing text. One of the big contributions of transformers is their parallelizability
33 during training—it enables their training on extremely large corpora and is thus one of the main
34 reasons that they are behind the success of many of the most successful modern large language
35 models. Parallelizability during training is crucial—notice that parallelization is, in fact, not possible
36 during generation from a locally normalized language model (cf. Definition 2.4.5)—by definition,
37 such models will generate one symbol at a time. To compute the representation of the new sentence
38 (or the new prefix), which is required for the generation of the next symbol, the generated (sampled)
39 symbol has to first be determined, which leaves their generation process inherently sequential. In
40 that respect, RNNs are as parallelizable as they can be during generation. However, the sequential
7 This section provides a very brief and intuitive treatment of parallelization. Our main goal is simply to point out

this shortcoming of RNN LMs and with it motivate the next neural architecture we will introduce: transformers.
5.1. RECURRENT NEURAL LANGUAGE MODELS 157

1 computation of the hidden states prevents the parallelization of computations of encR py ďt q even if
2 the whole string is already given.
3 We will see that the big parallelizability improvements of other architectures only come into play
4 during training when the model is given the whole string in advance (such that no part of the string
5 has to be sequentially generated) and can compute the (log-)likelihood of the given ground-truth
6 next symbols given the context. That is, during training and given a string y P Σ˚ , the model simply
7 has to compute pSM py t | y ăt q for all t “ 1, . . . , T —in representation based language models (cf.
8 Definition 3.1.11) this depends on encpy ăt q. Crucially, computing pSM py t | y ăt q is all that we need
9 for training a language model (in the simplest case that we analyze)—the computed log-likelihood of
10 the ground-truth next character y t is used for computing the loss function during training and used
11 for gradient-based updates to the parameters as discussed in §3.2.3. In the next section, we will see
12 how encpy ăt q can be computed without sequential dependencies. However, in the case of RNNs,
13 encpy ăt q can only be computed sequentially—even if the entire string is known in advance. This
14 results in a crucial bottleneck in training RNNs on large corpora and vastly limits their applicability
15 to implementing large language models.

16 5.1.6 Representational Capacity of Recurrent Neural Networks


17 Recurrent neural networks are one of the fundamental and most successful neural language model
18 architectures. In this section, we study some theoretical explanations behind their successes as well
19 as some of their theoretical limitations. Answering this question is essential whenever we require
20 formal guarantees of the correctness of the outputs generated by an LM. For example, one might ask
21 a language model to solve a mathematical problem based on a textual description (Shridhar et al.,
22 2023) or ask it to find an optimal solution to an everyday optimization problem (Lin et al., 2021b).
23 If such problems fall outside the theoretical capabilities of the LM, we have no ground to believe
24 that the result provided by the model is correct. The question also follows a long line of work on
25 the linguistic capabilities of LMs, as LMs must be able to implement mechanisms of recognizing
26 specific syntactic structures to generate grammatical sequences (Talmor et al., 2020; Hewitt and
27 Manning, 2019; Jawahar et al., 2019; Liu et al., 2019; Icard, 2020a; Manning et al., 2020; Rogers
28 et al., 2021; Belinkov, 2022; Del’etang et al., 2022, inter alia).
29 One way of quantifying the expressive power of computational models is with the complexity
30 of formal languages they can recognize (Del’etang et al., 2022)—we, too, will study the classes of
31 (weighted) formal languages (such as the regular languages and the Turing computable languages)
32 they can express. Through this, diverse formal properties of modern LM architectures have been
33 shown (e.g., Siegelmann and Sontag, 1992; Hao et al., 2018; Korsky and Berwick, 2019; Merrill,
34 2019; Merrill et al., 2020; Hewitt et al., 2020; Merrill et al., 2022a,b, inter alia). Inspecting complex
35 models such as recurrent neural networks through the lens of formal language theory allows us to
36 apply the well-studied theoretical results and understanding from the field to the recently more
37 successful neural models. While studying neural language models, we will revisit various aspects of
38 the classical language models introduced in Chapter 4—indeed, this was also our main motivation
39 for studying those closely.
40 Specifically, we will focus mainly on the Elman recurrent neural networks due to their simplicity
41 and their role as the “fundamental” RNN, capturing their recurrent nature. We will also briefly
42 touch upon the computational power of LSTM networks due to their somewhat different theoretical
43 properties. However, note that most of the results presented in the section generalize to other
44 architectures as well. We begin by investigating Elman RNNs in a practical setting, that is, under
158 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

b{0.9

.5 c{0
a{0 q1 .1

q 0 {1 q 3 {1

a{0 .9
.5 c{0
q2

b{0.1

Figure 5.4: A non-determinizable WFSA.

1 relatively strict and realistic assumptions, such as fixed-point arithmetic. We show that Elman RNNs
2 under such a regime are in fact equivalent to weighted finite-state automata. Next, in the second
3 subsection, we show that under more permissive assumptions of infinite precision and unbounded
4 computation time, Elman RNNs are Turing complete.

5 Equivalence of Formalisms and Language Homomorphisms

Example 5.1.10: A non-determinizable WFSA

The WFSA in Fig. 5.4 is not determinizable. While the formal proof (and discussion of
what it even means to be determinizable formally) is beyond the scope of this worka , note
that the intuition behind the issue is that we can arrive at the states q 1 and q 2 with the
same string (a), and then loop over the self-loop b in both states. However, these self-loops
have different weights, which means that they result in a path with the same label but with
different weights. Because we can take the self-loop infinitely-many times, there are infinitely
many paths with the same label yet with different weights. If we wanted the WFSA to be
deterministic, these paths with the same label should lead to the same state (otherwise, we
would create non-determinism at some point). However, since we can construct infinitely many
paths with the same label but with different weights, this means that we cannot “group” them
into the same state. That is why the WFSA cannot be determinized.
a As always, more on this can be found in the Advanced formal language theory course.
6

7 RNNs and Weighted Regular Languages


8 Analyzing complex systems with intricate interactions between inputs and parameters and temporal
9 dependencies can be tricky. This is a common issue when studying neural networks in general. In
10 fact, most, if not all, theoretical frameworks for analyzing neural models such as RNNs rely on
5.1. RECURRENT NEURAL LANGUAGE MODELS 159

1 various assumptions about their components to make the analysis feasible. For example, theoretical
2 results on neural networks (for example, optimization guarantees or function/system identifiability
3 guarantees) often make the assumption that the activation functions are linear or of some other
4 easy-to-analyze form. Similarly, a fruitful manner to analyze the expressivity of recurrent neural
5 networks specifically is by making (somewhat different) simplifying assumptions on the non-linear
6 activation functions, since those are what often make analysis difficult. A common simplification is
7 the use of the Heaviside activation function.

Definition 5.1.12: Heaviside function


The Heaviside function is defined as
#
1 if x ą 0
Hpxq “ (5.29)
0 otherwise
8

9 In words, the Heaviside function maps every real value either 0 or 1, depending on whether it is
greater than or less than zero. In the following, we will refer to the set t0, 1u as B “ t0, 1u.
def
10

Definition 5.1.13: Heaviside Elman Network

A Heaviside Elman network (HRNN) is an Elman network with Heaviside function H as


the non-linearity.
11

12 Elman network parameters. Importantly, note that the parameters of the network do not have
13 to be elements of B—we assume those can take arbitrary real (or rational) values. Indeed, networks
14 constrained to parameters θ P B would only be able to recognize unweighted languages. Furthermore,
15 for this section, we expand our definition of a real- or rational-weighted RNN to be able to contain
16 weights 8 and ´8. While those are not real (or rational) numbers, we will see they become useful
17 when we want to explicitly exclude specific sequences from the support of the model, i.e., when we
18 want to assign probability 0 to them.
19 Before we move to the central result of the subsection, we first introduce a fact that makes it
20 easier to talk about how an RNN language models can simulate a deterministic PFSAA. We will be
21 interested in conjoining elements of vectors in BD , which can be performed by an Elman RNN with
22 appropriately set parameters.

Fact 5.1.1: Performing the AND operation with a neural network

Let` m P rDs, i1 ,˘. . . , im P rDs, and x, v P BD with v i “ 1 ti P ti1 , . . . , im uu. Then,


H `vJ x ´ pm ´ 1q˘ “ 1 if and only if xik “ 1 for all k “ 1, . . . , m. In other words,
H vJ x ´ pm ´ 1q “ xi1 ^ ¨ ¨ ¨ ^ xim .
23

24 The central result. The central result of this section is captured in the following theorem.
160 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Theorem 5.1.2: Equivalence of Heaviside Elman RNNs and WFSAs

Heaviside Elman RNNs are equivalent to deterministic probabilistic finite-state automata.


1

2 Notice that we only make the claim for probabilistic WFSA. This is without loss of generality, as,
3 from Theorem 4.1.1, we know we can assume A is locally normalized. We will prove Theorem 5.1.2
4 by showing that an RNN with Heaviside activations is at most regular, and then showing how such
5 an RNN can in fact simulate any deterministic PFSA. We show each direction as its own lemma.

Lemma 5.1.1
The distribution represented by a recurrent neural network with a Heaviside non-linearity H
is regular.
6

7 Proof. Let R “ pΣ, D, U, V, E, bh , h0 q be a HRNN defining the conditional probabilities pSM . We


8 construct a deterministic PFSA A “ pΣ, Q, δ, λ, ρq defining the same string probabilities. Let
s : BD Ñ Z2D be a bijection. Now, for every state q “ sphq P Q “ BD , construct a transition
def def
9
y{w
10 q ÝÝÑ q 1 where q 1 “ σ pUh ` VJyK ` bh q with the weight w “ pSM py | hq “ f ∆|Σ|´1 pE hqy . We
define the initial function as λ psphqq “ 1 th “ h0 u and final function ρ with ρ pqq “ pSM peos | spqqq.
def
11

12 It is easy to see that A defined this way is deterministic. We now prove that the weights assigned
13 to strings by A and R are the same. Let y P Σ˚ with |y| “ T and
ˆ ˙
y 1 {w1 y T {wT
π“ sph0 q ÝÝÝÝÑ q 1 , . . . , q T ÝÝÝÝÑ q T `1

14 the y-labeled path starting in sph0 q (such a path exists since we the defined automaton is complete—
15 all possible transitions are defined for all states).
« ff
T
ź
A pyq “λ psph0 qq ¨ wt ¨ ρ pq T `1 q
t“1
T
ź ` ˘ ` ˘
“1 ¨ pSM y t | s´1 pq t q ¨ pSM eos | s´1 pq T `1 q
t“1
“pLN pyq

16 which is exactly the weight assigned to y by R. Note that all paths not starting in sph0 q have
17 weight 0 due to the definition of the initial function. ■

18 Let us look at an example of the construction above.


5.1. RECURRENT NEURAL LANGUAGE MODELS 161

e
b{ e`1

b{ 21
q 00 q 01

a{ 12 1
a{ e`1

1
b{ 2e`1
e e
a{ 2e`1 q 10 { 2e`1 q 11 a{ 21

b{ 21

Figure 5.5: The WFSA corresponding to the RNN defined in Eq. (5.30).

Example 5.1.11: A PFSA simulating an RNN

Let R “ pΣ, D, f , E, h0 q be a Heaviside RNN sequence model with the parameters

Σ “ ta, bu (5.30)
D“2 (5.31)
1 0 1 0
ˆˆ ˙ ˆ ˙ ˙
f pht , yq “ H ht´1 ` (5.32)
0 1 0 1
JyK

1 0
¨ ˛

E “ ˝0 1 ‚ (5.33)
1 ´8
0
ˆ ˙
h0 “ (5.34)
0

and n paq “ 1, n pbq “ 2, and n peosq “ 3. The automaton


ˆ ˙corresponding to this RNN contains
i
the states q ij corresponding to the hidden states h “ . It is shown in Fig. 5.5; as we can
j
see, the automaton indeed has an exponential number of useful states in the dimensionality of
the hidden state, meaning that the RNN is a very compact way of representing it.
1

2 To show the other direction of Theorem 5.1.2, we now give a variant of a classic theorem originally
3 due to Minsky (1986) but with a probabilistic twist, allowing us to model weighted languages with
4 Elman RNNs.
162 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Lemma 5.1.2: Elman RNNs can encode PFSAs

Let A “ pΣ, Q, δ, λ, ρq be a tight probabilistic deterministic finite-state automaton. Then,


there exists a Heaviside-activated Elman network with a hidden state of size h “ |Σ||Q| that
encodes the same distribution as A.
1

2 We give proof by construction: Given a deterministic PFSA A “ pΣ, Q, δ, λ, ρq, we construct an


3 Elman RNN R “ pΣ, D, U, V, E, bh , h0 q accepting the same weighted language as A: L pAq “ L pRq
4 by defining the elements of the tuple pΣ, D, U, V, E, bh , h0 q. In the rest of the section, we will
5 first intuitively describe the construction, and then formally prove the central results that the
6 construction relies on. Let n : Q ˆ Σ Ñ Z|Q||Σ| be a bijection, i.e., an ordering of Q ˆ Σ, m : Σ Ñ Z|Σ|
7 an ordering of Σ, and m : Σ Ñ Z|Σ| an ordering of Σ; these mappings assign each element in their
8 pre-image an integer which can then be used to index into matrices and vectors as we will see below.
9 We use n, m, and m to define the one-hot encodings J¨K of state-symbol pairs and of the symbols.
10 That is, we assume that Jq, yKd “ 1 td “ n pq, yqu, and similar for JyK. Similarly to the proof of
11 Lemma 5.1.1, we denote with h and h´1 the mappings from Q ˆ Σ to the hidden state space of the
12 RNN and its inverse. The alphabet of the RNN of course matches the one of the WFSA.

13 HRNN’s hidden states. The hidden states of the RNN live in B|Q||Σ| . A hidden state ht encodes
14 the state q t the simulated A is in at time t and the transition symbol y t with which A “arrived” at
15 q t as a one-hot encoding of the pair pq t , y t q. Formally,

ht “ Jpq t , y t qK P B|Q||Σ| . (5.35)

16 This also means that D “ |Q||Σ|. There is a small caveat: how do we set the incoming symbol
17 of A’s (sole) initial state qι (the first time it is entered)? A straightforward solution would be to
18 augment the alphabet of the RNN with the bos symbol (cf. §2.4), which we define to be the label
19 of the incoming arc denoting the initial state (this would be the only transition labeled with bos).
20 However, as we show later, the symbol used to arrive into p does not have an effect on the subsequent
21 transitions—it is only needed to determine the target of the current transition. Therefore, we can
22 simply represent the initial state h0 of R with the one-hot encoding of any pair pqι , aq, where qι is
23 the initial state of the WFSA and a P Σ.
24 For example, for the fragment of a WFSA in Fig. 5.6, the hidden state encoding the current
state q and the incoming arc b is of the form presented in Eq. (5.36).
¨ ˛
q1 ˚0‹
˚.‹
˚.‹
˝

˚.‹
a{

b{˝ ˚0‹
˚ ‹
r q ht “˚ (5.36)
˚1‹ Ð n pq, bq

˚0‹
b{

˚ ‹
˝

˚.‹
˚.‹
q2 ˚.‹
0
˝ ‚

Figure 5.6: A fragment of a WFSA.


25
5.1. RECURRENT NEURAL LANGUAGE MODELS 163

Uh“Jq 1 ,aK`Jq 2 ,bK


a : w1
¨ ˛

f ∆|Σ|´1 pEh1 q “ ˝ b : w2 ‚
eos : 0
q1

1
w
a{
a{w0
q

b{
w
ood

2
q1 orh q1
ghb
-nei
Out
1

w1
q2
w
a{

a{
a{w0 a{w0
q q
b{

b{
w

w
a-rea
2

2
chab q1
l e
1

q2 q2
w
a{

a{w0
q
b{
w
2

h “ Jq, aK q2 h1 “ Jq 1 , aK

VJaK“Jq,aK`Jq 1 ,aK

Figure 5.7: A high-level illustration of how the transition function of the FSA is simulated in
Minsky’s construction on a fragment of an FSA starting at q (encoded in h) and reading the symbol
a. The top path disjoins the representations of the states in the out-neighborhood of q, whereas
the bottom path disjoins the representations of states reachable by an a-transition. The Heaviside
activation conjoins these two representations into h1 (rightmost fragment). Projecting Eh1 results in
the vector defining the same probability distribution as the outcoming arcs of q (green box).
164 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Encoding the transition function. The idea of defining U, V, and bh is for the Elman update
2 rule to perform, upon reading y t`1 , element-wise conjunction between the representations of the
3 out-neighborhood of q t and the representation of the states A can transition into after reading in
4 y t`1 from any state. The former is encoded in the recurrence matrix U, which has access to the
5 current hidden state that encodes q t while the latter is encoded in the input matrix V, which has
6 access to the one-hot representation of y t`1 . Conjugating the entries in those two representations
7 will, due to the determinism of A, result in a single non-zero entry: one representing the state which
8 can be reached from q t (1st component) using the symbol y t`1 (2nd component); see Fig. 5.7.
9 The recurrence matrix U lives in B|Σ||Q|ˆ|Σ||Q| . The main idea of the construction is for each
10 column U : ,npq,yq of the matrix to represent the “out-neighborhood” of the state q in the sense that
11 the column contains 1’s at the indices corresponding to the state-symbol pairs pq 1 , y 1 q such that A
12 transitions from q to q 1 after reading in the symbol y 1 . That is, for q, q 1 P Q and y, y 1 P Σ, we define
" *
y 1 {˝
U npq1 ,y1 q,npq,yq “ 1 q t ÝÝÑ q 1 P δ . (5.37)
def

13 Since y is free, each column is repeated |Σ|-times: once for every y P Σ—this is why, after entering
14 the next state, the symbol used to enter it does not matter anymore and, in the case of the initial
15 state, any incoming symbol can be chosen to represent h0 .
16 For example, for the fragment of a WFSA in Fig. 5.6, the recurrence matrix would take the form

n pq, bq
Ó
¨ 0 ˛
..
.
˚ ‹
˚ ‹
1 ‹ Ð n pq 1 , aq
˚ ‹
U “˚ (5.38)
˚
.. ‹
.
˚ ‹
˚ ¨¨¨ ¨¨¨ ‹
1 ‹ Ð n pq 2 , bq
˚ ‹
˚
˚ .. ‹
.
˝ ‚
0

17 and the matrix-vector product Uht with ht from before results in

¨ ˛
˚0‹
˚.‹
˚.‹
˚.‹
˚1‹ Ð n pq 1 , aq
˚ ‹
˚.‹
Uht “˚
˚ .. ‹
‹ (5.39)
˚1‹ Ð n pq 2 , bq
˚ ‹
˚.‹
˚.‹
˚.‹
0
˝ ‚
5.1. RECURRENT NEURAL LANGUAGE MODELS 165

1 The input matrix V lives in B|Σ||Q|ˆ|Σ| and encodes the information about which states can be
2 reached by which symbols (from any state in A). The non-zero entries in the column corresponding
3 to y 1 P Σ correspond to the state-symbol pairs pq 1 , y 1 q such that q 1 is reachable with y 1 from some
4 state:

" 1 *
“ 1 ˝ ÝÝÑ q P δ .
y {˝
(5.40)
def 1
V npq1 ,y1 q,mpy1 q

5 For example, for the fragment of a WFSA in Fig. 5.8a, the input matrix would take the form

m pbq
Ó
¨ 0 ˛
..
.
˚ ‹
˚ ‹
1 ‹ Ð n pp, bq
˚ ‹
V “˚ (5.41)
˚
.. ‹
.
˚ ‹
˚ ¨¨¨ ¨¨¨ ‹
1 ‹ Ð n pq 2 , bq
˚ ‹
˚
˚ .. ‹
.
˝ ‚
0

6 and the matrix-vector product Vempaq and Vempbq would take the form (see also Fig. 5.8b)

¨ ˛
¨ ˛ ˚0‹
˚.‹
˚0‹ ˚.‹
˚.‹ ˚.‹
˚.‹ ˚1‹ Ð n pp, bq
˚.‹
˚ ‹
˚.‹
1‹ Ð n pq 1 , aq
˚ ‹ ˚.‹
Vempaq “˚
˚.‹ Vempbq “˚ . ‹ (5.42)
˚.‹ ˚1‹ Ð n pq 2 , bq
˚ ‹
˚.‹ ˚.‹
0
˝ ‚ ˚.‹
˚.‹
0
˝ ‚

Lastly, we define the bias as bh “ ´1 P R|Q||Σ| , which allows the Heaviside function to perform the
def
7

8 needed conjunction.
9 To put these components together, consider that, at each step of the computation, R computes
166 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

q1
q1

˝
a{

˝
a{
b{˝
r p b{˝
r p

b{

b{
˝

˝
q2 q2

(a) An example of a fragment of a WFSA. (b) An example of a fragment of a WFSA.

1 ht`1 “ H pUht ` Vea ` bh q where y t`1 “ a. The input to the non-linearity is computed as follows:

¨ ˛
˚0‹
¨ ˛ ¨ ˛
˚.‹
˚.‹ ˚0‹ ˚´1‹
˚.‹ ˚.‹
˚.‹
˚ . ‹
˚ . ‹
˚1‹ Ð n pq 1 , aq
˚ ‹ ˚ . ‹ ˚ . ‹
˚.‹ 1 n , aq
˚ ‹ ˚ ‹
Uht ` Vempaq ` bh “˚ Ð pq ´1‹ (5.43)
˚ .. ‹ `˚ ‹
˚ ‹ 1 `˚
. ˚ . ‹

˚ .. ‹ ˚ .. ‹
˚ ‹ ˚ ‹
˚1‹ Ð n pq 2 , bq
˚ ‹
˚.‹ 0
˝ ‚ ˝ ‚
˚.‹ ´1
˚.‹
0
˝ ‚

2 The following lemma proves that the construction described correctly implements the transition
3 function of the PFSA.

Lemma 5.1.3

Let A “ pΣ, Q, δ, λ, ρq be a deterministic PFSA, y “ y 1 . . . y T P Σ˚ , and q t the state arrived at


by A upon reading the prefix y ďt . Let R be the HRNN specified by the Minsky construction
for A, n the ordering defining the one-hot representations of state-symbol pairs by R, and ht
R’s hidden state after reading y ďt . Then, it holds that h0 “ Jpqι , yqK where qι is the initial
state of A and y P Σ and hT “ Jpq T , y T qK.
4

Proof. Define sph “ Jpq, yqKq “ q. We can then restate the lemma as sphT q “ q T for all y P Σ˚ ,
def
5

6 |y| “ T . Let π be the y-labeled path in A. We prove the lemma by induction on the string length
7 T.

8 Base case: T “ 0. Holds by the construction of h0 .

9 Inductive step: T ą 0. Let y P Σ˚ with |y| “ T and assume that sphT ´1 q “ q T ´1 .


10 We prove that the specifications of U, V, and bh ensure that sphT q “ q T . By definition
11 of the recurrence matrix U (cf. Eq. (5.37)), the vector UhT ´1 will contain a 1 at the entries
5.1. RECURRENT NEURAL LANGUAGE MODELS 167

y 1 {˝
1 n pq 1 , y 1 q for
Ž q P Q and y1 P1 Σ such that q T ÝÝÑ q P δ. This can equivalently be written as
1 1 1

2 UhT ´1 “ y 1 {˝ Jpq , y qK, where the disjunction is applied element-wise.


q T ÝÝÑq 1 Pδ
3 On the other hand, by definition of the input matrix V (cf. Eq. (5.40)), the vector VJy T K will
y T {˝
4 contain a Ž
1 at the entries n pq 1 , y T q for q 1 P Q such that ˝ ÝÝÝÑ q 1 P δ. This can also be written as
5 VJy T K “ y T {˝ Jpq 1 , y T qK.
˝ÝÝÝÑq1 Pδ
6 By Fact 5.1.1, H pUhT ´1 ` VJy T K ` bh qnpq1 ,y1 q “ H pUhT ´1 ` VJy T K ´ 1qnpq1 ,y1 q “ 1 holds if
7 and only if pUhT ´1 qnpq1 ,y1 q “ 1 and pVJy T Kqnpq1 ,y1 q “ 1. This happens if

y 1 {˝ y T {˝ y T {˝
q T ÝÝÑ q 1 P δ and ˝ ÝÝÝÑ q 1 P δ ðñ q T ÝÝÝÑ q 1 , (5.44)

8 i.e., if and only if A transitions from q T to q T upon reading y T (it transitions only to q T due to
9 determinism).
10 Since the string y was arbitrary, this finishes the proof. ■

11 Encoding the transition probabilities. We now turn to the second part of the construction:
12 encoding the string acceptance weights given by A into the probability distribution defined by R.
13 We present two ways of doing that: using the more standard softmax formulation, where we make
14 use of the extended real numbers, and with the sparsemax.
15 The conditional probabilities assigned by R are controlled by the |Σ| ˆ |Q||Σ|-dimensional output
16 matrix E. Since ht is a one-hot encoding of the state-symbol pair q t , y t , the matrix-vector product
th
17 Eht simply looks up the values in the n pq t , y t q column. After being projected to ∆|Σ|´1 , the
18 entry in the projected vector corresponding to some y t`1 P Σ should match the probability of that
19 symbol given that A is in the state q t . This is easy to achieve by simply encoding the weights of the
th
20 outgoing transitions into the n pq t , y t q column, depending on the projection function used. This is
21 especially simple in the case of the sparsemax formulation. By definition, in a PFSA, the weights of
22 the outgoing transitions and the final weight of a state q t form a probability distribution over Σ for
23 every q t P Q. Projecting those values to the probability simplex, therefore, leaves them intact. We
24 can therefore define #
y 1 {w
ωpq ÝÝÝÑ ˝q | if y 1 P Σ
Empy1 qnpq,yq “ (5.45)
def
.
ρ pqq | otherwise
25 Projecting the resulting vector Eht , therefore, results in a vector whose entries represent the
26 transition probabilities of the symbols in Σ.
27 In the more standard softmax formulation, we proceed similarly but log the non-zero transition
weights. Defining log 0 “ ´8,8 we set
def
28

#
y 1 {w
log ωpq ÝÝÝÑ ˝q | if y 1 P Σ
Empy1 qnpq,yq “ (5.46)
def
.
log ρ pqq | otherwise

29 It is easy to see that the entries of the vector softmaxpEht q form the same probability distribution as
30 the original outgoing transitions out of q. Over the course of an entire input string, these weights are
8 Note that the ´8 entries are only needed whenever the original WFSA assigns 0 probability to some transitions.

In many implementations using softmax-activated probabilities, this would not be required.


168 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

q1

1
w
a{
b{˝
r p

b{
w2
q2

Figure 5.9: An example of a fragment of a WFSA.

1 multiplied as the RNN transitions between different hidden states corresponding to the transitions
2 in the original PFSA A.
3 For example, for the fragment of a WFSA in Fig. 5.9, the output matrix would take the form
n pq, bq
Ó
¨ ´8 ˛
..
.
˚ ‹
˚ ‹
log w1 ‹ Ð m paq
˚ ‹
E “˚ (5.47)
˚
.. ‹
.
˚ ‹
˚ ¨¨¨ ¨¨¨ ‹
log w2 ‹ Ð m pbq
˚ ‹
˚
˚ .. ‹
.
˝ ‚
´8

5 This means that, if ht encodes the state-symbol pair pq, yq, the vector Eht will copy the
6 selected column in E which contains the output weight for all out symbols y; of q, i.e., the entry
y 1 {w
7 Ehmpy1 q contains the weight on the arc q ÝÝÝÑ ˝. Over the course of an entire input string y,
8 these probabilities are simply multiplied as the RNN transitions between different hidden states
9 corresponding to the transitions in the original WFSA A.
10 For example, for the fragment of a WFSA in Fig. 5.9, the matrix-vector product Eht would take
11 the form
¨ ´8 ˛
˚ .. ‹
˚ . ‹
˚log w1 ‹ Ð m paq
˚ ‹
˚ . ‹
˚ . ‹
Eht “˚ . ‹ (5.48)
log
˚ w2 ‹ Ð m pbq
˚ ‹
˚ . ‹
˝ . ‚
.
´8

12 The equivalence of the produced RNN LM to the PFSA is shown in the following lemma.
5.1. RECURRENT NEURAL LANGUAGE MODELS 169

.1
a{0
q1

.5
a{0

b{0.5
1 q0

b{0
.9

b{0.5
q 2 {0.5

Figure 5.10: The WFSA A.

Lemma 5.1.4

Let A “ pΣ, Q, δ, λ, ρq be a deterministic PFSA, y “ y 1 . . . y T P Σ˚ , and q t the state arrived at


by A upon reading the prefix y ďt . Let R be the HRNN specified by the Minsky construction
for A, E the output matrix specified by the generalized Minsky construction, n the ordering
defining the one-hot representations of state-symbol pairs by R, and ht R’s hidden state after
reading y ďt . Then, it holds that pLN pyq “ A pyq.
1

Proof. Let y P Σ˚ , |y| “ T and let π be the y-labeled path in A. Again, let p pyq “
def ś|y|
2
t“1 pSM py t | y ăt q.
śT
3 We prove p pyq “ t“1 wt by induction on T .

4 Base case: T “ 0. In this case, y “ ε, i.e., the empty string, and A pεq “ 1. R computes
ś0
5 p pεq “ t“1 pSM py t | y ăt q “ 1.

śT ´1
6 Inductive step: T ą 0. Assume that the p py 1 . . . y T ´1 q “ t“1 wt . By Lemma 5.1.3, we know
7 that sphT ´1 q “ q T and sphT q “ q T . By the definition of E for the specific f ∆|Σ|´1 , it holds that
y{wT śT
8 f ∆|Σ|´1 pEhT ´1 qmpyq “ ωpsphT ´1 q ÝÝÝÑ sphT qq “ wT . This means that p py ďT q “ t“1 wt , which
9 is what we wanted to prove.
10 Clearly, pLN pyq “ p pyq pSM peos | yq. By the definition of E (cf. Eq. (5.45)), pEhT qmpeosq “
11 ρ psphT qq, meaning that

T
ź
pLN pyq “ p pyq pSM peos | yq “ wt ρ psphT qq “ A pyq .
t“1

12 Since y P Σ˚ was arbitrary, this finishes the proof. ■

13 We now walk through an example of the Minsky construction.


170 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Example 5.1.12: Minsky construction

Let A “ pΣ, Q, δ, λ, ρq be a WFSA as shown in Fig. 5.10. Since A has |Q| “ 3 states
and an alphabet of |Σ| “ 2 symbols, the hidden state of the representing RNN R will
be of dimensionality 3 ¨ 2 “ 6. Assume that the set of state-symbol pairs is ordered as
pq 0 , aq , pq 0 , bq , pq 1 , aq , pq 1 , bq , pq 2 , aq , pq 2 , bq. The initial state can be represented (choosing a
as the arbitrary “incoming symbol”) as

1
¨ ˛
˚0‹
˚0‹
˚ ‹
h0 “ ˚˚0‹ .
‹ (5.49)
˝0‚
˚ ‹

The recurrent matrix U of R is


0 0 1 1 0 0
¨ ˛
˚0 0 0 0 0 0 ‹
˚1 1 0 0 0 0
˚ ‹
U“˚ (5.50)

‹,
˚0 0 0 0 0 0 ‹
˝0 0 0 0 0 0
˚ ‹

1 1 1 1 1 1

the input matrix V


1 0
¨ ˛
˚0 0‹
˚1 0‹
˚ ‹
V“˚
˚0
‹, (5.51)
0‹
˝0 0‚
˚ ‹

0 1
and the output matrix E is

log p0.1q log p0.1q log p0.5q log p0.5q


¨ ˛
´8 ´8
E “ ˝log p0.9q log p0.9q log p0.5q log p0.5q log p0.5q log p0.5q‚, (5.52)
´8 ´8 ´8 ´8 log p0.5q log p0.5q

where the last row corresponds to the symbol eos. The target of the b-labeled transition from
1
5.1. RECURRENT NEURAL LANGUAGE MODELS 171

b{0.9
q 0 (q 0 ÝÝÝÑ q 2 ) is computed as follows:

h1 “ H pUh0 ` VJbK ` bh q
0 0 1 1 0 0 1 1 0
¨¨ ˛¨ ˛ ¨ ˛ ¨ ˛˛
´1
˚˚0 0 0 0 0 0 ‹ ˚0‹ ˚0 0‹ ˚´1‹‹
˚˚1 1 0 0 0 0 ‹ ˚0‹ ˚1 0‹ 0
˚˚ ‹˚ ‹ ˚ ‹ ˆ ˙ ˚ ‹‹
˚´1‹‹
“H˚ ˚˚0 0 0 0 0 0 ‹ ˚0‹ ˚0 0‹ 1 ` ˚´1‹‹
˚ ‹˚ ‹ ` ˚ ‹ ˚ ‹‹
˝˝0 0 0 0 0 0 ‚˝0‚ ˝0 0‚
˚˚ ‹˚ ‹ ˚ ‹ ˚ ‹‹
˝´1‚‚
1 1 1 1 1 1 0 0 1 ´1
0 0 0
¨¨ ˛ ¨ ˛ ¨ ˛˛ ¨¨ ˛˛ ¨ ˛
´1 ´1
˚˚0‹ ˚0‹ ˚´1‹‹ ˚˚´1‹‹ ˚0‹
˚˚1‹ ˚0‹ ˚´1‹‹
˚ ‹ ` ˚ ‹ ` ˚ ‹‹ “ H ˚˚ 0 ‹‹ “ ˚0‹ ,
˚˚ ‹ ˚ ‹ ˚ ‹‹ ˚˚ ‹‹ ˚ ‹
˚˚ ‹‹ ˚ ‹
“H˚ ˚˚0‹ ˚0‹ ˚´1‹‹ ˚˚´1‹‹ ˚0‹
˝˝0‚ ˝0‚ ˝´1‚‚ ˝˝´1‚‚ ˝0‚
˚˚ ‹ ˚ ‹ ˚ ‹‹ ˚˚ ‹‹ ˚ ‹

1 1 ´1 1 1

which corresponds exactly the configuration in which A is in state q 2 which it arrived to by


reading in the symbol b.
The probability of the string y “ b under the locally-normalized model induced by R can be
computed as

pLN pyq “ pLN pbq “ pSM pb | bosq pSM peos | bq “ pSM pb | h0 q pSM peos | h1 q
“ softmax pEh0 qb softmax pEh1 qeos
1
¨ ¨ ˛˛
˛ ˚0‹‹
˚ log p0.1q ¨ ¨ ¨
˚¨
´8 ˚0‹‹
˚ ‹‹
“ softmax ˚ log p0.9q ¨ ¨ ¨ log p0.5q ˚
˚
˚0‹‹ ¨
˚ ˝ ‚ ‹‹
´8 ¨ ¨ ¨ log p0.5q ˝ ‹
0‚‚
˚ ˚ ‹
˝
0
¨ ˛˛ b
0
¨
˛ ˚0‹‹
˚ log p0.1q ¨ ¨ ¨
˚¨
´8 ˚0‹‹
˚ ‹‹
softmax ˚ log p0.9q ¨ ¨ ¨ log p0.5q ˚
˚
˚0‹‹
˚ ˝ ‚ ‹‹
´8 ¨ ¨ ¨ log p0.5q ˝ ‹
0‚‚
˚ ˚ ‹
˝
1 eos
log p0.1q
¨ ˛ ¨ ˛
´8
“ softmax ˝log p0.9q‚ ¨ softmax ˝log p0.5q‚ “ 0.9 ¨ 0.5 “ 0.45.
´8 b
log p0.5q eos
1

2 Implications for recurrent neural language models. Lemmas 5.1.1 and 5.1.2 formalize the
3 equivalence between HRNNs and deterministic PFSAs. A direct corollary of this result is that
4 HRNNs are at most as expressive as deterministic PFSAs and, therefore, strictly less expressive as
172 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

b{0.9

.5 c{0.
a{0 q1 1

q 0 {1 q 3 {1
a{0 9
.5
q2 c{0.

b{0.1

Figure 5.11: A non-determinizable PFSA. It assigns the string abn c the probability A pabn cq “
0.5 ¨ 0.9n ¨ 0.1 ` 0.5 ¨ 0.1n ¨ 0.9, which can not be expressed as a single term for arbitrary n P Ně0 .

1 general, non-deterministic, PFSAs.9 An example of a very simple non-deterministic PFSA, i.e., a


2 PFSA whose distribution cannot be expressed by an HRNN LM, is shown in Fig. 5.11. Furthermore,
3 even if a non-deterministic PFSA can be determinized, the number of states of the determinized
4 machine can be exponential in the size of the non-deterministic one (Buchsbaum et al., 2000).
5 In this sense, non-deterministic PFSAs can be seen as exponentially compressed representations
6 of finite-state LMs. However, the compactness of this non-deterministic representation must be
7 “undone” using determinization before it can be encoded by an HRNN.
8 While Lemma 5.1.1 focuses on HRNN LMs and shows that they are finite-state, a similar
9 argument could be made for any RNN whose activation functions map onto a finite set. This is
10 the case with any implementation of an RNN on a computer with finite-precision arithmetic—in
11 that sense, all deployed RNNLMs are finite-state, albeit very large in the sense of encoding possibly
12 very large weighted finite-state automata. However, there are a few important caveats with this:
13 firstly, notice that, although finite, the number of states represented by an RNN is exponential in
14 the size of the hidden state. Even for moderate hidden state dimensionalities, this can be very large
15 (hidden states can easily be of size 100–1000). In other words, one can view RNNs as very compact
16 representations of large deterministic probabilistic finite-state automata whose transition functions
17 are represented by the RNN’s update function. Furthermore, since the topology of this implicit
18 WFSA is completely determined by the update function of the RNN, it can be learned very flexibly
19 yet efficiently based on the training data—this is made possible by the sharing of parameters across
20 the entire graph of the WFSA instead of explicitly parametrizing every possible transition, as, for
21 example, in §4.1.3, or hard-coding the allowed transitions as in §4.1.5. This means that the WFSA
22 is not only represented, but also parametrized very efficiently by an RNN. Nevertheless, there is an
23 important detail that we have somewhat neglected so far: this is the requirement that the simulated
24 WFSA be deterministic.

9 General PFSAs are, in turn, equivalent to probabilistic regular grammars and discrete Hidden Markov Models

(Icard, 2020b).
5.1. RECURRENT NEURAL LANGUAGE MODELS 173

1 Addendum to Minsky’s Construction: Constructing a smaller RNN; Lower Bounds on


2 the Space Complexity of Simulating PFSAs with RNNs

3 Lemma 5.1.2 shows that HRNN LMs are at least as powerful as dPFSAs. More precisely, it shows
4 that any dPFSA A “ pΣ, Q, δ, λ, ρq can be simulated by an HRNN LM of size O p|Q||Σ|q. In this
5 section, we address the following question: How large does an HRNN LM have to be such that it
6 can correctly simulate a dPFSA? We study the asymptotic bounds with respect to the size of the
7 set of states, |Q|, as well as the number of symbols, |Σ|.

8 Asymptotic Bounds in |Q|. Intuitively, the 2D configurations of a D-dimensional HRNN hidden


9 state could represent 2D states of a (P)FSA. One could therefore expect that we could achieve
10 exponential compression of a dPFSA by representing it as an HRNN LM. Interestingly, this is not
11 possible in general: extending work by Dewdney (1977), Indyk (1995)
´ ashows¯ that, to represent an
12 unweighted FSA with an HRNN, one requires an HRNN of size Ω |Σ| |Q| . This lower bound can
13 be achieved. For completeness, we present constructions´by Dewdney ¯ (1977);
´ a Indyk¯(1995), which
3
14 represent an unweighted FSA with a HRNN of size O |Σ||Q| 4
and O |Σ| |Q| , respectively,
15 next, before giving a lower bound in for the probabilistic case.
16 Lemma 5.1.2 gives a relatively simple construction of an RNN recognizing a weighted regular
17 language. However, the resulting RNN is relatively large, with a hidden state of size linear in
18 the number of states of the (deterministic) WFSA recognizing the language, with the additional
19 multiplicative factor in the size of the alphabet. Note that constructions resulting in smaller RNNs
20 exist, at least for the unweighted case. For example, for an arbitrary WFSA A “ pΣ, Q, δ, λ, ρq,
21 Dewdney
´ ¯ (1977); Alon et al. (1991) present a construction of an RNN with a hidden state of size
3
22 O |Q| 4 simulating A, whereas Indyk (1995) provides a construction of an RNN with a hidden
´ 1¯
23 state of size O |Q| 2 . The latter is also provably a lower bound on the number of neurons required
24 to represent an arbitrary unweighted FSA with a Heaviside-activated recurrent neural network
25 (Indyk, 1995). It is not yet clear if this can be generalized to the weighted case or if Minsky’s
26 construction is indeed optimal in this setting. This is quite interesting since one would expect that
27 an RNN with a hidden state of size D can represent up to 2D individual states (configurations of the
28 D-dimensional vector). However, the form of the transition function with the linear transformation
29 followed by a Heaviside activation limits the number of transition functions that can be represented
30 using D dimensions, resulting in the required exponential increase in the size of the hidden state.
31 Minsky’s construction (Lemma 5.1.2) describes how to represent a dPFSA A with a HRNN of
32 size linear in the number of A’s states. Importantly, the encoding of the FSA transition function
33 (taken from Minsky’s original construction) is decoupled from the parameter defining the probability
34 distribution, E. This section describes two asymptotically more space-efficient ways of constructing
35 the component simulating the transition function. They originate in the work by Dewdney (1977),
36 who´ showed ¯ that an unweighted FSA A “ pΣ, Q, I, F , δq can be represented by an HRNN of size
3
37 O |Σ||Q| . Using the same ideas, but a specific trick to compress the size of the processing
4

´ a ¯
38 layer of the RNN further, Indyk (1995) reduced this bound to O |Σ| |Q| , which, as discussed
39 in §5.1.6, is asymptotically optimal. Naturally, as shown in §5.1.6, the space-efficiency gain can
40 not be carried over to the weighted case—that is, the space-efficiency is asymptotically overtaken
41 by the output matrix E. Nevertheless, for a more complete treatment of the subject, we cover the
174 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 two compressed constructions of the HRNN simulating an unweighted FSA in this section in our
2 notation. Importantly, given a dPFSA, we focus only on the underlying FSA, i.e., the unweighted
3 transition function of the automaton, since by Theorem 5.1.5, the compression can only be achieved
4 with components representing that part of the automaton.

5 Dewdney’s Construction

6 This section describes the construction due to Dewdney (1977) in our notation. Since some of the
7 parts are very similar to the construction due to Indyk (1995), those parts are reused in §5.1.6 and
8 introduced more generally.

9 Representing states of the FSA. Let A “ pΣ, Q, I, F , δq be a deterministic FSA. Recall that
10 Minsky’s construction encodes the A’s current state as a one-hot encoding of the state-symbol pair.
11 The construction due to Dewdney (1977), on the other hand, represents the states separately from
12 the symbols. It encodes the states with two-hot representations by using the coefficients of ´
what we
a ¯
13 call a square-root state representation. This results in representations of states of size O |Q| .
14 The input symbols are incorporated into the hidden state separately.10

Definition 5.1.14: Square-root state representation

Let A “ pΣ, Q, I, F , δq be an FSA and s “ r |Q|s. We define the square-root state


def
a

representation of A’s states q P Q asa


´ q ¯
ϕ2 pqq “ t u, q mod s . (5.53)
def

s
We denote the inverse of ϕ2 with ϕ´1
2 and further define for k P Zs

2 pk, ¨q “ tq P Q | φ0 “ k where φ “ ϕ2 pqqu (5.54)


def
ϕ´1

2 p¨, kq analogously.
and ϕ´1
a Notice that ϕ2 pqq represents the coefficients of the expression of q P N in base s.
15

16 Specifically, we will denote ϕ´1


2 pk, ¨q and ϕ2 p¨, kq with k in the j
´1 th
position (with j P Z2 , 0 for
17 ϕ´1
2 pk, ¨q and 1 for ϕ2 p¨, kq) as Φk,j .
´1

18 We can think of the function ϕ2 as representing states of the FSA in a two-dimensional space
19 Zs ˆ Zs . However, to efficiently simulate A with an HRNN, it is helpful to think of ϕ2 pqq in two
20 different ways: as a vector v P Ně0 2|Q| , or as a matrix in B|Q|ˆ|Q| in the following sense.

Definition 5.1.15: Vector and matrix state representations

Given a square-root state representation function ϕ2 , we define the vector representation


21

10 This again adds a factor |Σ| to the size of the hidden state, as we discuss later.
5.1. RECURRENT NEURAL LANGUAGE MODELS 175

of the state q P Q as the vector v pqq P B2|Q| with

v pqqφ0 “ 1 (5.55)
v pqqs`φ1 “ 1, (5.56)

where φ “ pφ0 , φ1 q “ ϕ2 pqq, and all other entries 0. Furthermore, we define the matrix
representation of the state q P Q as the matrix B P B|Q|ˆ|Q| with

WQ q φ0 φ1 “ 1 (5.57)

and all other entries 0.


1

2 Dewdney’s construction also heavily relies on the representations of sets of states. We define
3 those additively.

Definition 5.1.16: Matrix and vector representation of state sets

Let Q Ď Q be a set of states. We define the vector representation of Q as the vector


ł
v pQq “ v pqq. (5.58)
def

qPQ

Similarly, we define the matrix representation of Q as the matrix


ł
WQ Q “ WQ q. (5.59)
def

qPQ
4

5 To help understand the above definitions, we give an example of a FSA and the representations
6 of its states.

Example 5.1.13: Dewdney’s construction


?
Consider the FSA in Fig. 5.12, for which s “ r |Q|s “ r 3s “ 2, meaning that
a

ϕ2 p0q “ p0, 0q ϕ2 p1q “ p0, 1q ϕ2 p2q “ p1, 0q , (5.60)

resulting in the state-to-vector mappinga

v p0q “ 1 0 1 0 (5.61)
` ˘
|
v p1q “ 1 0 0 1 (5.62)
` ˘
|
v p2q “ 0 1 1 0 , (5.63)
` ˘
|

and the state-to-matrix mapping

1 0 0 1 0 0
ˆ ˙ ˆ ˙ ˆ ˙
WQ 0 “ WQ 1 “ WQ 2 “ . (5.64)
0 0 0 0 1 0
7
176 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

b 2
b
0 a

a 1 b

Figure 5.12: An example of a fragment of an FSA.

The two components of the vector representations separated by “|” denote the two halves of
the representation vectors, corresponding to the two components of ϕ2 pqq.
a Despite the notation p. . . | . . .q, we assume we are working with column vectors.
1

2 High-level idea of Dewdney’s construction. Given these definitions, the intuition behind
3 Dewdney’s construction of an HRNN simulating an FSA A is the following:

4 1. Represent A’s states as vectors in B2s , or, equivalently, matrices in Bsˆs .

5 2. For each q P Q, construct the matrix representation of the set of y-predecessors WQ Pred pq; yq
6 for all y P Σ.

7 3. To simulate A’s transition function δ, compare the representation of the current state q t
8 with all constructed predecessor matrices WQ Pred pq; y t q given the current input symbol y t .
9 Activate the two-hot representation of the (unique) state q t`1 for which the representation of
10 q t was detected in q t`1 ’s predecessor matrix for symbol y t , WQ Pred pq t`1 ; y t q.

11 Simulating the transition function of an FSA by detecting preceding states. We elaborate


12 on the last point above since it is the central part of the construction.11 The idea of simulating the
13 transition function δ is reduced to detecting whose predecessor given the current input symbol y t is
14 currently active—naturally, this should be the state active at t ` 1. Concretely, consider again the
15 FSA A in Fig. 5.12. The predecessors of the three states, indexed by the incoming symbols are: for
16 0 tb : 2u, for 1 ta : 1, b : 0u, and for 2 ta : 1, b : 0u. Suppose that at some time t, A is in state 0 and
17 is reading in the symbol b. Then, since the state 0 is the b-predecessor of the state 2, we know that
18 at time t ` 1, A will be in state 2. This principle can be applied more generally: to determine the
19 state of an FSA at time t ` 1, we simply have to somehow detect whose predecessor is active at
20 time t given the current input symbol at time t.
21 The crux of Dewdney’s construction is then the following:12 How do we, using only the Elman
22 update rule, determine whose y t -predecessor is active at time t? This can be done by detecting
23 which predecessor matrix WQ Pred pq; y t q the representation of the current state q t is included in
24 in the sense that if ϕ2 pq t q “ φ, it holds that WQ Pred pq; y t qφ0 φ1 “ 1. To be able to formally talk
11 Later, we will see that Indyk (1995) uses the exact same idea for simulating δ.
12 Again, the same applies to Indyk (1995).
5.1. RECURRENT NEURAL LANGUAGE MODELS 177

1 about the detection of a representation in a set of predecessors, we define several notions of matrix
2 detection.
3 Informally, we say that a matrix is easily detectable if the presence of its non-zero elements can
4 be detected using a single neuron in the hidden layer of a HRNN.

Definition 5.1.17: Easily detectable matrices

Let B P BDˆD be a binary matrix. We say that B is easily detectable if there exist w P Q2D
and b P Q (neuron coefficients) such that

σ pxeij , wy ` bq “ 1 ðñ B ij “ 1, (5.65)

where eij “ ei | ej refers to the 2D-dimensional vector with 1’s at positions i and D ` j.
` ˘

In words, this means that the neuron defined by w, b fires on the input eij if and only if
B ij “ 1.
5

6 We define detectable matrices as the matrices which can be detected using a conjunction of two
7 neurons.

Definition 5.1.18: Detectable matrices

Let B P BDˆD be a binary matrix. We say that B is detectable if there exist w1 , w2 P Q2D
and b1 , b2 P Q such that

σ pxeij , w1 y ` b1 q “ 1 ^ σ pxeij , w2 y ` b2 q “ 1 ðñ B ij “ 1. (5.66)


8

9 Furthermore, we say that a matrix is (easily) permutation-detectable if there exist permutation


10 matrices P and Q such that PBQ is (easily) detectable.
11 Intuitively, this means that one can effectively replace an easily detectable matrix B with a
12 single neuron: instead of specifying the matrix explicitly, one can simply detect if an entry B ij of B
13 is 1 by passing eij through the neuron and seeing if it fires. This reduces the space complexity from
14 D2 to 2D. Similarly, one can replace a detectable matrix with two neurons. As shown in Fact 5.1.1,
15 the required conjunction of the two resulting neurons can then easily be performed by a third (small)
16 neuron, meaning that a detectable matrix is effectively represented by a two-layer MLP.
17 An example of easily detectable matrices are the so-called northwestern matrices.

Definition 5.1.19: Northwestern matrix

A matrix B P BDˆD is northwestern if there exists a vector α with |α| “ D and D ě α1 ě


. . . ě αD ě 0 such that
B ij “ 1 ðñ j ď αi . (5.67)
18

19 Intuitively, northwestern matrices contain all their ones


` contiguously in their upper left (northwest)
corner. An example of a northwestern matrix for α “ 2 1 1 is
˘
20

1 1 0
¨ ˛

B “ ˝1 0 0‚. (5.68)
1 0 0
178 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Lemma 5.1.5
Northwestern matrices are easily detectable.
1

2 Proof. Define
w“ α 1
def
` ˘
| D ...
3 and b “ ´D. It is easy to see that for any eij where Bij “ 1, it holds that

xeij , wy “ αi ` pD ´ j ` 1q ě j ` D ´ j ` 1 “ D ` 1
ùñ H pxeij , wy ` bq “ H pxeij , wy ´ Dq “ 1.

4 On the other hand, for B ij “ 0, we have

xeij , wy “ αi ` pD ´ j ` 1q ă j ` D ´ j ` 1 “ D
ùñ H pxeij , wy ` bq “ H pxeij , wy ´ Dq “ 0.

5 ■

6 A more general useful class of detectable matrices are line matrices (Dewdney, 1977).

Definition 5.1.20: Line matrix

A binary matrix B P BDˆD is a line matrix if any of the following conditions hold:

1. All B’s ones lie either in the same row (B is a row matrix) or in the same column (B
is a column matrix).
2. B is a transversal, i.e., a matrix in which there is at most one 1 in any column and row.
7

Lemma 5.1.6
Row and column matrices are easily permutation-detectable.
8

9 Proof. Let i, N P ZD and B be a row matrix with Bijn “ 1 for n P ZN , i.e., a row matrix with
10 all its ones in the ith row. Define P P BDˆD as P 1i “ 1 and 0 elsewhere and Q P BDˆD with
11 Qjn n “ 1 and 0 elsewhere. Then, PBQ contains all its 1 in its northwestern corner (contiguously
in the first row) and is thus easily detectable. Let w “ α | D . . . 1 , b “ D be the neuron
def
` ˘
12

weights from Lemma 5.1.5. Define w1 “ PJ α | QpD . . . 1q , b1 “ D. It is easy to see that


def
` ˘
13

14 this “rearranges” the components of the neuron recognizing the northwestern matrix PBQ to make
15 them recognize the original matrix, meaning that the neuron defined by w1 and b1 recognizes the
16 line matrix. The proof for a column matrix is analogous. ■

Lemma 5.1.7
Transversals are permutation-detectable.
17
5.1. RECURRENT NEURAL LANGUAGE MODELS 179

1 Proof. The core idea of this proof is that every transversal can be permuted into a diagonal matrix,
2 which can be written as a Hadamard product of a lower-triangular and an upper-triangular matrix.
Let B be a transversal. Pre-multiplying B with its transpose P “ BJ results in a diagonal matrix.
def
3

4 It is easy to see that PB can be written as a Hadamard product H1 bH2 of a lower-triangular matrix
5 H1 and an upper-triangular matrix H2 . Both are easily permutation detectable. A conjunction of
6 the neurons detecting H1 and H2 (again, performed by another neuron) detects the original matrix
7 B. In the following, we will refer to H1 and H2 as the factors of the transversal. ■

8 Crucially, any binary matrix B P BDˆD can be decomposed into a set of line matrices B whose
disjunction is B: MPB M “ B. It is easy to see that Bij “ 1 if and only if there exists M P B
Ž
9

10 such that Mij “ 1. This means that non-zero entries of any B P BDˆD decomposed into the set of
11 line matrices B can be detected using an MLP in two steps:

12 1. Detect the non-zero entries of the individual line matrices from the decomposition B (which
13 are, as shown above, detectable).

14 2. Take a disjunction of the detections of the individual line matrices to result in the activation
15 of the original matrix.

16 The disjunction can again be performed by applying another 2-layer MLP to the activations of the
17 line matrices. An important consideration in both Dewdney’s as well as Indyk’s construction later
18 will be how large B has to be.

19 Using matrix decomposition and detection for simulating the transition function. We
20 now describe how Dewdney’s construction uses matrix detection based on the decomposition of
21 matrices into line matrices to simulate an FSA using an HRNN. From a high level, the update
22 steps of the HRNN will, just like in Minsky’s construction, simulate the transition function of the
23 simulated FSA. However, in contrast to the Minsky construction, in which each transition step in
24 the FSA was implemented by a single application of the Elman update rule, here, a single transition
25 in the FSA will be implemented using multiple applications of the Elman update rule, the end result
26 of which is the activation of the two-hot representation of the appropriate next state. Nonetheless,
27 there are, abstractly, two sub-steps of the update step, analogous to the Minsky construction (cf.
28 Fig. 5.7):

29 1. Detect the activations of all possible next states, considering any possible input symbol
30 (performed by the term Uht in Minsky’s construction).

31 2. Filter the activations of the next states by choosing only the one transitioned into by a
32 y t -transition (performed by conjoining with the term VJy t K in Minsky’s construction).

33 The novelty of Dewdney’s construction comes in the first sub-step: How can the Elman update
34 step be used to activate the two-hot representation of q t ’s out-neighborhood? As alluded to, this
35 relies on the pre-computed predecessor matrices Pred pq; yq (cf. Definition 5.1.15). The predecessor
36 matrices of individual states are compressed (disjoined) into component-activating matrices, the
37 representation matrices of the predecessors of specific sets of states (cf. Definition 5.1.16), defined
38 through the function ϕ2 in the following sense.
180 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Definition 5.1.21: Component-activating matrix

A component-activating matrix is the representation matrix Bj,y,k “ WQ Pred pΦk,j ; yq


def

for some k P Zr and j P Z2 .


1

2 Intuitively, the component-activating matrix Bj,y,k is the result of the disjunction of the matrix
3 representations of all y-predecessors q of all states q 1 whose j th component of the vector ϕ2 pq 1 q
4 equals k. This results in 2|Σ|s matrices. They can be pre-computed and naturally depend on the
5 transition function δ. The name component-activating matrix is inspired by the fact that each of the
6 matrices “controls” the activation of one of the 2|Σ|s neurons in a specific sub-vector of the HRNN
7 hidden state. That is, each component-activating matrix controls a particular dimension, indexed
8 by the tuple pj, y, kq for j P B, y P Σ, k P Zs , in the data sub-vector of the HRNN hidden state.
9 As we will see shortly, they contain all the information required for simulating A with a HRNN.
10 To define the transition function of the HRNN simulating A, all 2|Σ|s component-activating
11 matrices are decomposed into permutation-detectable line matrices (cf. Definition 5.1.20) whose
12 activations are combined (disjoined) into the activations of individual component-activating matrices.
13 Analogously to above, we will denoteŽ the sets of line matrices decomposing the component-activating
14 matrices as B j,y,k , i.e., Bj,y,k “ MPBj,y,k M. The dimensions of the hidden state corresponding to
15 the activations of the line matrices before they are combined into the activations of the component-
16 activating matrices form the processing sub-vector of the HRNN hidden state since they are
17 required in the pre-processing steps of the update step to determine the activation of the actual
18 hidden state. This is schematically drawn in Fig. 5.13a.
19 For any component-activating matrix B decomposed into the set of line matrices B, we know
20 by Lemmas 5.1.6 and 5.1.7 that all M P B are detectable by a single-layer MLP. By adding an
21 additional layer to the MLP, we can disjoin the detections of M P B into the detection of B. More
22 abstractly, this MLP, therefore, detects the activation of one of the 2|Q|s cells of the data sub-vector
23 of the HRNN hidden state—all of them together then form the two-hot encoding of all possible next
24 states of the FSA (before taking into account the input symbol). Designing 2|Q|s such single-values
25 MLPs, therefore, results in an MLP activating the two-hot representations of all possible next states
26 of the simulated FSA. Conjoining these activations with the input symbol, analogously to how this
27 is done in the Minsky construction, results in the activation of the two-hot representation of only
28 the actual next state of the simulated FSA. This is illustrated in Fig. 5.13b.

29 High-level overview of simulating a transition. In summary, after decomposing all the


30 component-activating matrices into the sets B j,y,k , the detection of all candidate next states (before
31 considering the input symbol) in the update step of HRNN is composed of the following sub-steps.

32 1. Compute the activations of the two factors of all the transversals in B j,y,k for all j, y, k
33 (Lemma 5.1.7).

34 2. Conjoin the activations of the two factors into the activations of the transversals (Lemma 5.1.7).

35 3. Compute the activations of the column and row matrices in B j,y,k for all j, y, k (Lemma 5.1.6).

36 4. Disjoin of the activations of all the line matrices (transversals, row, and column matrices) in
37 B j,y,k for all x, y, k to compute the activations of all 2|Σ|s component-activatimg matrices.
5.1. RECURRENT NEURAL LANGUAGE MODELS 181
Data sub-vector

… … …

Processing sub-vector

AND

OR

(a) High-level overview of Dewdney’s construction. The highlighted orange neuron in the representation
of the state from the data sub-vector corresponds to the activation of one of the components of the red
states (which have in common that their 0th component of ϕ2 pqq is the same). The matrix corresponding
to the disjunction of the representations of their y-predecessors (blue states) is decomposed into two line
matrices—a transversal and a column matrix. The non-zero elements of the former can be detected by a
conjunction of two neurons while the non-zero elements of the latter can be detected directly by a single
neuron. Those activations are then disjoined to result in the activation in the orange neuron. The purple
neurons in the processing sub-vector are composed of the neurons in the networks implementing the detection
of line matrices and their conjunctions and disjunctions (also shown in purple).

q1 q1 q1
a

q q q
b

q2 q2 q2

Phase 1 Phase 2 Phase 3 Phase 4


v pqq v pqq v ptq 1 , q 2 uq v pq 1 q
ˆ ˙ ˆ ˙ ˆ ˙ ˆ ˙

p p1 p2 p3

(b) A high-level illustration of how the transition function of the FSA is implemented in Dewdney’s
construction on an example of an FSA fragment, where the simulated automaton is initially in the
state q and reads the symbol a, transitioning to q 1 . The components whose changes are relevant at
a given step are highlighted. Starting in the state q, which is stored in the data sub-vector v pqq, in
the first sub-step, the processing bits of the appropriate line matrices are activated (p1 ). Next, the
activated line matrices are used`␣
to activate
(˘ the representations of all the states in the out-neighborhood
of q in the data sub-vector (v q 1 , q 2 ). Lastly, these representations are conjoined with the states
reachable by the symbol a, resulting in the representation of the state q in the data sub-vector (v pqq).
182 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 This results in the activation of the two-hot representations of all possible next states (i.e., the entire
2 out-neighborhood of q t ). In the last sub-step of the HRNN update step, these are conjoined with
3 the representation of the current input symbol. This step is very similar to the analogous stage in
4 Minsky’s construction, with the difference that here, the non-zero entries of the vector Vht must
5 cover the two-hot representations of the states with an incoming y t -transition. This conjunction
6 then ensures that among all the states in the out-neighborhood of q t , only the one reached by taking
7 the y t -transition will be encoded in ht`1 . The construction just described can be summarized by
8 the following lemma.13

Lemma 5.1.8

Let A “ pΣ, Q, I, F , δq be a deterministic FSA. Then, Dewdney’s construction results in a


HRNN correctly simulating A’s transition function, i.e, spht q “ q t for all t.
9

10 This shows that Dewdeny’s construction correctly encodes the FSA in a HRNN. However, its
11 space efficiency remains to be determined. As mentioned above,
´ working with two-hot representations
a ¯
12 of the states means that the data sub-vector is of size O |Σ| |Q| . However, the construction
13 also requires a number of processing dimensions in the processing sub-vector. To understand the
14 full complexity of the construction, we have to determine the maximal number of processing bits
15 in the HRNN. The first step to the answer is contained in the following lemma, which describes
16 the number of line matrices required to cover an arbitrary binary matrix. It lies in the core of the
17 efficiency of Dewdney’s construction.

Lemma 5.1.9

Let B P BDˆD with N 2 elements equalling 1. Then, there exists a decomposition B of B into
at most 2N line matrices such that MPB M “ B.
Ž
18

19 Proof. Based on Dewdney (1977). Define the sequence of transversals T1 , T2 , . . . where Ti is the
Ži´1
transversal containing the maximum number of ones in the matrix Bi “ B ´ j“1 Bj . The
def
20

21 transversal containing the maximal number of ones can be found using the maximum matching
22 algorithm. Continue this sequence until there are no more ones in Bi . The number of ones in the
23 matrices Bi , ∥Bi ∥1 , forms a (weakly) decreasing sequence.
24 If there are at most 2N transversals in the sequence, the lemma holds. Otherwise, we compare
the functions f piq “ ∥Ti ∥1 and g piq “ 2N ´ i.
def def
25

řN řN řN
26 • If f piq ą g piq for all i “ 1, . . . , N , then i“1 f piq “ i“1 ∥Ti ∥1 ą i“1 2N ´ i “ 2N 2 ´
2 N pN ` 1q ě N . However, the transversals in the decomponsition cannot contain more ones
1 2
27

28 than the original matrix.

29 • We conclude that for some i ď N , f piq ď g piq. Let i0 be the first such index in 1, . . . , N and
L1 “ tT1 , . . . , Tk u. Since the maximum number of independent ones (in the sense that at
def
30

31 most one appears in a single row/column) in Bi0 ´1 is ∥Ti0 ∥1 ď 2N ´ i0 (those are chosen by
32 the maximum transversal Ti0 ). By König’s theorem (Szárnyas, 2020), there is a set of at most
13 To formally prove it is correct, we would have to follow a similar set of steps to how the correctness of Minsky’s

construction (Lemma 5.1.3) was proved. We omit this for conciseness.


5.1. RECURRENT NEURAL LANGUAGE MODELS 183

2N ´ i0 column or row matrices L2 “ tL1 , . . . Lk u with k ď 2N ´ i0 which cover Bi0 ´1 .14


def
1

Therefore, L “ L1 Y L2 constitutes a valid cover of B with ď N ` 2N ´ i0 “ O pN q matrices.


def
2

3 ■
4 We will denote the number of matrices in the line decomposition of a matrix B constructed
5 by the greedy procedure from Lemma 5.1.9 as L pBq. Connecting this lemma to Dewdney’s
6 construction, this shows that the number of neurons required to detect the activation of a single
7 set Pred pk; yq grows asymptotically as the square root of the number of ones in the representation
8 matrix WQ Pred pk; yq—this is how many line matrices the matrix will decompose into. The size of
9 each neuron is 2|Σ|s.
10 This allows us to show how many neurons the entire HRNN simulating A has. Since we know that
11 the data sub-vector will always have exactly 2|Σ|s cells, we characterize the number of processing
12 cells in the following lemma.

Lemma 5.1.10

Let A “ pΣ, Q, I, F , δq be a deterministic


´ FSA.
¯ Then, Dewdney’s construction results in a
3
HRNN with a hidden state of size O |Σ||Q| . 4

13

14 Proof. The number of cells in the entire processing sub-vector is simply the sum of the processing
15 neurons of all the data components. In the worst case, a single component-activating matrix B
16 requires 2L pBq ` 1 neurons (2 for each transversal in the decomposition of B and an additional
17 one for their disjunction). Therefore, enumerating the set of matrices tBj,y,k | j P Z2 , y P Σ, k P Zs u
18 with Bn for n “ 1, . . . , 2|Σ|s, the number of neurons required by all component-activating matrices
19 is bounded as follows.
2|Σ|s
ÿ 2|Σ|s
ÿ ´ a ¯ 2|Σ|s
ÿ
2L pBn q ` 1 ď 2 2r ∥Bn ∥1 s ` 1 “ 4mn ` 1 (5.69)
def

n“1 n“1 n“1

20 Since the matrices Bn contain one non-zero entry for each state-symbol pair, it holds that
2|Σ|s
ÿ 2|Σ|s
ÿ
∥Bn ∥1 ď m2n “ |Σ||Q| (5.70)
n“1 n“1

21 Pretending that mn can take real values, the value ?


of Eq. (5.69) is maximized under the constraint
22 from Eq. (5.70) when all mn are equal with mn “ 2s. This means that
2|Σ|s 2|Σ|s
ÿ ÿ ? ? ´ 3
¯
4mn ` 1 ď 4 2s ` 1 “ 8|Σ|s 2s ` 1 “ O |Σ||Q| 4 , (5.71)
n“1 n“1

23 finishing the proof. ■


24 All results stated in this section can be summarized in the following theorem.

14 Intuitively, since all ones are contained within ď 2N ´ i rows or columns, they can be simply covered by matrices
0
containing those.
184 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Theorem 5.1.3: Dewdney (1977)


´ 3
¯
Let A “ pΣ, Q, I, F , δq be a deterministic FSA. Then, there exists a HRNN of size O |Σ||Q| 4
correctly simulating A.
1

2 Indyk’s Construction
´ 3
¯
3 §5.1.6 describes a construction of an HRNN of size O |Σ||Q| 4 simulating an FSA. While this
4 improves the space efficiency compared to Minsky’s construction, it is not asymptotically optimal.
5 Indyk (1995) proved that´a HRNN simulating an FSA A “ pΣ, Q, I, F , δq over a binary alphabet
a ¯
6 Σ “ B requires at least Ω |Q| hidden dimensions. He also provided a construction that achieves
7 this lower bound. This construction is conceptually very similar to Dewdney’s in that it works by
8 activating neurons corresponding to some form of compressed predecessor matrices (component-
9 activating matrices) and then selecting the transition which matches the input symbol. Again, it
10 additively covers these matrices with components that are easy to detect, similar to how Dewdney’s
11 construction uses line matrices. However, Indyk’s construction defines component-activating matrices
12 based on different sets of states and covers them with a different decomposition—these are the two
13 crucial differences allowing the construction to achieve the optimal lower bound.
14 We first define the component-activating matrices and their role in updating the hidden state of
15 the HRNN. In Indyk’s construction, the component-activating matrices are based on four-hot rather
16 than two-hot encodings of states.

Definition 5.1.22: Four-hot representation of a state


1
Let A “ pΣ, Q, I, F , δq be an FSA, r “ r|Q| 4 s, and π a permutation of Q “ r|Q|s.a We define
def

the four-hot representation of q P Q as

ϕ4 pqq “ pℓ1 , ℓ2 , ℓ3 , ℓ4 q (5.72)

where
π pqq
ℓj “ mod r. (5.73)
rj´1
We denote the inverse of ϕ4 with ϕ´1
4 and further define for k P Zr

(5.74)
def
ϕ4´1 pk, ¨, ¨, ¨q “ tq P Q | ϕ4 pqq1 “ ku

and ϕ´1
4 p¨, k, ¨, ¨q, ϕ4 p¨, ¨, k, ¨q, and ϕ4 p¨, ¨, ¨, kq analogously.
´1 ´1

a The exact form of π will be important later. For now, one can think of π as the identity function.
17

18 We will denote ϕ´14 p. . . , k, . . .q with k in j


th
position (with j P Z4 ) as Φk,j . Despite using the
19 four-hot representations, Indyk’s construction still requires the two-hot representations based on ϕ2
20 as before. In this case, however, they again depend on the chosen permutation π. This allows us to
21 define the component-activating matrices as follows.
5.1. RECURRENT NEURAL LANGUAGE MODELS 185

Definition 5.1.23: Component-activating matrix

A component-activating matrix in Indyk’s construction is the representation matrix


WQ Pred pΦk,j ; yq for some k P Zr , j P Z4 , and y P Σ.
1

2 For efficient detection, the component-activating matrices are covered by so-called non-decreasing
3 matrices.

Definition 5.1.24: Non-decreasing matrix

We say that B P BDˆD is non-decreasing if there exists a non-decreasing (partial) function


f : ZD Ñ ZD (from columns to rows) such that

Bij “ 1 ðñ f pjq “ i (5.75)

and, if f is defined for some j P ZD , it is also defined for all j 1 ě j.


4

Example 5.1.14: Non-decreasing matrices

An example of a non-decreasing matrix is

0 1 0 0
¨ ˛
˚0 0 0 0‹
B“˚ ˝0
‹. (5.76)
0 1 1‚
0 0 0 0

0 1 2 3
ˆ ˙
The function f defining the non-decreasing matrix B is f “ , where H denotes
H 0 1 1
that the function is not defined.
5

6 Again, clearly, any matrix B P BDˆD can be (non-uniquely) decomposed into at most D
7 non-decreasing matrices. Moreover, non-decreasing matrices are detectable.

Lemma 5.1.11
Non-decreasing matrices are detectable.
8

9 Proof. Let B P BDˆD be a non-decreasing matrix defined by the partial function f . Divide the
10 domain of f into the set of intervals in which the function is constant, with Ipjq denoting the interval
11 of j P Zr2 for j such that f pjq is defined. Then, it is easy to see that B ij “ 1 ðñ i “ f pjq,
12 meaning that by defining the parameters w and b as

wf pjq “ r2 ´ I pjq (5.77)


def

(5.78)
def
wr2 `j “ Ipjq
2
(5.79)
def
b “ ´r
186 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 and other elements as 0, we get that


B ij “ 1 ðñ i “ f pjq ðñ wi ` wj ` b “ 0. (5.80)
2 Compared to earlier, where component-activating matrices were detected by testing an inequality,
3 detecting a non-decreasing matrix requires testing an equality. Since all terms in the equality are
4 integers, testing the equality can be performed with the Heaviside activation function by conjoining
5 two neurons; one testing the inequality wi ` wj ` b ´ 1 ă 0 and another one testing the inequality
6 wi ` wj ` b ` 1 ą 0. Both can individually be performed by a single neuron and then conjoined by
7 an additional one. ■
8 With this, the high-level idea of Indyk’s construction is outlined in Fig. 5.14. After constructing
9 the component-activating matrices based on ϕ4 and decomposing them into non-decreasing matrices,
10 the rest of Indyk’s construction is very similar to Dewdney’s construction, although the full update
11 step of the HRNN requires some additional processing. To test the equality needed to detect
12 non-decreasing matrices in the decomposition, Eq. (5.80), the four-hot representations are first
13 converted into two-hot ones. This can be done by a simple conjunction of the first two and the last
14 two components of the four-hot representation. Then, the activations of the non-decreasing matrices
15 can be computed and disjoined into the representations of the component-activating matrices. These
16 form the 4|Σ|r components of the data sub-vector of the HRNN hidden state. They contain the
17 activations of all possible next states, i.e., the out-neighborhood of the current state of A. These are
18 then conjoined with the representation of the current input symbol in the same way as in Dewdney’s
19 construction but adapted to the four-hot representations of the states. The process is thus very
20 similar to the phases of Dewdeney’s construction illustrated in Fig. 5.13b.
21 Indyk’s construction can be summarized by the following lemma.15

Lemma 5.1.12

Let A “ pΣ, Q, I, F , δq be a deterministic FSA. Then, Indyk’s construction results in a HRNN


correctly simulating A’s transition function, i.e, spht q “ q t for all t.
22

23 The only remaining thing to show is that Indyk’s construction achieves the theoretically optimal
24 lower bound on the size of the HRNN simulating a deterministic FSA. All previous steps of the
25 construction were valid no matter the chosen permutation π. The permutation, however, matters
26 for space efficiency: intuitively, it determines how efficiently one can decompose the resulting
27 component-activating matrices (which depend on the permutation) into non-decreasing matrices in
28 the sense of how many non-decreasing matrices are required to cover it. Indyk, therefore, proved
29 that there always exists, with non-zero probability, a permutation in which the decomposition across
30 all states is efficient enough to achieve the minimum number of neurons required. This is formalized
31 by the following lemma, whose proof can be found in Indyk (1995, Lemma 6).

Lemma 5.1.13

Let A “ pΣ, Q, I, F , δq be a deterministic FSA. There


´ aexists
¯ a permutation of Q such that
Indyk’s construction results in a HRNN of size O |Σ| |Q| .
32

15 Again, to formally prove it is correct, we would have to follow a similar set of steps to how the correctness of

Minsky’s construction (Lemma 5.1.3) was proved. We omit this for conciseness.
5.1. RECURRENT NEURAL LANGUAGE MODELS 187

1 This concludes our presentation of Indyk’s construction. All results stated in this section can be
2 summarized by the following theorem.

Theorem 5.1.4: Indyk (1995)


´ a ¯
Let A “ pΣ, Q, I, F , δq be a deterministic FSA. There exists a HRNN of size O |Σ| |Q|
correctly simulating A.
3

4 Lower Bound in the Probabilistic Setting

5 We now ask whether the same lower bound can also be achieved when simulating dPFSAs. We find
6 that the answer is negative: dPFSAs may require an HRNN LMs of size Ω p|Σ||Q|q to faithfully
7 represent their probability distribution. Since the transition function of the underlying FSA can be
8 simulated with more efficient constructions, the bottleneck comes from defining the same probability
9 distribution. Indeed, as the proof of the following theorem shows, the issue intuitively arises in the
10 fact that, unlike in an HRNN LM, the local probability distributions of the different states in a
11 PFSA are completely arbitrary, whereas they are defined by shared parameters (the output matrix
12 E) in an HRNN LM.

Theorem 5.1.5: A lower bound on the size of the RNN simulating a PFSA

Let A “ pΣ, Q, δ, λ, ρq be a minimal dPFSA and pΣ, D, U, V, E, bh , h0 q an HRNN LM defining


the same LM. Then, D must scale linearly with |Q|.
13

14 Proof. Without loss of generality, we work with R-valued hidden states. Let A be a minimal
15 deterministic PFSA and R “ pΣ, D, U, V, E, bh , h0 q a HRNN with pLN pyq “ A pyq for every
y P Σ˚ . Let y ăT P Σ˚ and y ďT “ y ăT y for some y P Σ. Define p pyq “ t“1 pSM py t | y ăt q. It
def def ś|y|
16

17 is easy to see that p py ăT y T q “ p py ăT q pSM py t | y ăT q. The conditional distribution pSM p¨ | y ăT q


18 are proportional to the values in EhT ´1 . By definition of the deterministic PFSA, there are |Q|
19 such conditional distributions. Moreover, these distributions (represented by vectors P ∆|Σ|´1 ) can
20 generally be linearly independent. This means that for any q, the probability distribution of the
21 outgoing transitions can not be expressed as a linear combination of the probability distributions of
22 other states. To express the probability vectors for all states, the columns of the output matrix E,
23 therefore, have to span R|Q| , implying that E must have at least |Q| columns. This means that the
24 total space complexity (and thus the size of the HRNN representing the same distribution as A) is
25 Ω p|Q|q. ■

26 Note that there exist regular languages motivated by phenomena in human language that can
27 be represented in logarithmic space in the number of the states of their minimal FSA. For example,
28 Hewitt et al. (2020) show that bounded Dyck languages of k parenthesis types and of maximal
29 depth m, which require an FSA with k m states to be recognized, can be represented by HRNN LMs
30 of size m log k, which is an exponential improvement on Indyk’s lower bound.
188 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS


Data sub-vector


… … … …

Processing sub-vector

AND AND

OR

Figure 5.14: High-level overview of Indyk’s construction. The highlighted orange neuron in the
representation of the state from the data sub-vector corresponds to the activation of one of the
components of the red states (which have in common that their 0th component of ϕ4 pqq is the same).
The matrix corresponding to the disjunction of the representations of their y-predecessors (blue states)
is decomposed into two non-decreasing matrices. The non-zero elements
ˆ of both can be˙ detected
0 1 2 3 0 1 2 3
ˆ ˙
by a conjunction of two neurons; here, f 1 “ and f 2 “ , meaning
H `0 0 0 H H˘ 1 2
that w1 “ 3 0 0 0 | 0 1 1 1 , w2 “ 0 3 2 0 | 0 0 1 2 , and b1 “ b2 “ 4.
` ˘

Those activations are then disjoined to result in the activation in the orange neuron. The purple
neurons in the processing sub-vector are composed of the neurons in the networks implementing the
detection of line matrices and their conjunctions and disjunctions (also shown in purple). Note that
even if the second matrix were not non-decreasing in itself (i.e., the columns of the two ones would
be flipped), one could still transform it into a non-decreasing matrix by permuting the columns and
permuting the corresponding neurons.
5.1. RECURRENT NEURAL LANGUAGE MODELS 189

1
y1

y2 ,
..., 2
yN

Figure 5.15: The FSA AN .

1 Asymptotic Bounds in |Σ| Since each of the input symbols can be encoded in log |Σ| bits, one
2 could expect that the linear factor in the size of the alphabet from the constructions above could be
3 reduced to O plog |Σ|q. However, we again find that such reduction is in general not possible—the set
4 of FSAs presented next is an example of a family that requires an HRNN whose size scales linearly
5 with |Σ| to be simulated correctly. We also provide a sketch of the proof of why a compression in
6 |Σ| is not possible.
7 Let! AN “ )pΣN !, t0, 1u, t0u, t1u, δ N q be) an FSA over the alphabet ΣN “ ty 1 , . . . , y N u such that
y1 yn
8 δ N “ 0 ÝÑ 1 Y 0 ÝÝÑ 2 | n “ 2, . . . N (see Fig. 5.15).
9 Clearly, to be able to correctly represent all local distributions of the dPFSA, the HRNN LM
10 must contain a representation of each possible state of the dPFSA in a unique hidden state. On the
11 other hand, the only way that the HRNN can take into account the information about the current
12 state q t of the simulated FSA A is through the hidden state ht . The hidden state, in turn, only
13 interacts with the recurrence matrix U, which does not have access to the current input symbol
14 y t`1 . The only interaction between the current state and the input symbol is thus through the
15 addition in Uht ` VJy t`1 K. This means that, no matter how the information about q t is encoded in
16 ht , in order to be able to take into account all possible transitions stemming in q t (before taking
17 into account y t`1 ), Uht must activate all possible next states, i.e., the entire out-neighborhood of
18 q t . On the other hand, since VJy t`1 K does not have precise information about q t , it must activate
19 all states which can be entered with an y t`1 -transition, just like in Minsky’s construction.
20 In Minsky’s construction, the recognition of the correct next state was done by keeping a separate
21 entry (one-dimensional sub-vector) for each possible pair q t`1 , y t`1 . However, when working with
22 compressed representations of states (e.g., in logarithmic space), a single common sub-vector of
23 size ă |Σ| (e.g., log |Σ|) has to be used for all possible symbols y P Σ. Nonetheless, the interaction
24 between Uht and VJy t`1 K must then ensure that only the correct state q t`1 is activated. For
25 example, in Minsky’s construction, this was done by simply taking the conjunction between the
26 entries corresponding to q, y in Uht and the entries corresponding to q 1 , y 1 in VJy 1 K, which were
27 all represented in individual entries of the vectors. On the other hand, in the case of the log
28 encoding, this could intuitively be done by trying to match the log |Σ| ones in the representation
29 pp pyq | 1 ´ p pyqq, where p pyq represent the binary encoding of y. If the log |Σ| ones match (which
30 is checked simply as it would result in a large enough sum in the corresponding entry of the
31 matrix-vector product), the correct transition could be chosen (to perform the conjunction from
32 Fact 5.1.1 correctly, the bias would simply be set to log |Σ| ´ 1). However, an issue arises as soon as
33 multiple dense representations of symbols in VJyK have to be activated against the same sub-vector
34 in Uht —the only way this can be achieved is if the sub-vector in Uht contains the disjunction of the
35 representations of all the symbols which should be activated with it. If this sets too many entries in
190 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Uht to one, this can result in “false positives”. This is explained in more detail for the dPFSAs in
2 Fig. 5.15 next.
3 Let rn represent any dense encoding of y n in the alphabet of AN (e.g., in the logarithmic case,
4 that would be pp pnq | 1 ´ p pnqq). Going from the intuition outlined above, any HRNN simulating
5 AN , the vector Uh0 must, among other things, contain a sub-vector corresponding to the states
6 1 and 2. The sub-vector corresponding to the state 2 must activate (through the interaction in
7 the Heaviside function) against any y n for n “ 2, . . . , N in AN . This means it has to match
8 all representations rn for all n “ 2, . . . , N . The only way this can be done isŽif the pattern for
N
9 recognizing state 2 being entered with any y n for n “ 2, . . . , N is of the form r “ n“2 rn . However,
ŽN
10 for sufficiently large N , r “ n“2 rn will be a vector of all ones—including all entries active in r1 .
11 This means that any encoding of a symbol will be activated against it—among others, y 1 . Upon
12 reading y 1 in state 1, the network will therefore not be able to deterministically activate only the
13 sub-vector corresponding to the correct state 1. This means that the linear-size encoding of the
14 symbols is, in general, optimal for representing dPFSAs with HRNN LMs. This discussion implies
15 the following theorem.

Theorem 5.1.6: A lower bound on the size of the RNN simulating a PFSA

Let A “ pΣ, Q, δ, λ, ρq be a minimal dPFSA and pΣ, D, U, V, E, bh , h0 q an HRNN LM defining


the same LM. Then, D must scale linearly with |Σ|.
16

17 Based on the challenges encountered in the example above, we can devise a simple sufficient
18 condition for a logarithmic compression w.r.t. |Σ| to be possible: namely, that for any pair of states
19 q, q 1 P Q, there is at most a single transition leading from q to q 1 . Importantly, this condition is met
20 by classical n-gram LMs, and also in the language studied by Hewitt et al. (2020), which allowed
21 them to reduce the complexity w.r.t. |Σ| to the logarithmic factor. This intuitive characterization
22 can be formalized by a property we call log |Σ|-separability.

Definition 5.1.25: log |Σ|-separable finite-state automaton

An FSA A “ pΣ, Q, I, F , δq is log |Σ|-separable if it is deterministic and, for any pair q, q 1 P Q,


y
there is at most one symbol y P Σ such that q Ý Ñ q 1 P δ.
23

24 log |Σ|-separability is a relatively restrictive condition. To amend that, we introduce a simple


25 procedure which, at the expense of enlarging the state space by a factor of Σ, transforms a general
26 deterministic (unweighted) FSA into a log |Σ|-separable one. We call this log |Σ|-separation.
27 Intuitively, it augments the state space by introducing a new state pq, yq for every outgoing transition
y
28 qÝÑ q 1 of every state q P Q, such that pq, yq simulates the only state the original state q would
29 transition to upon reading y. Due to the determinism of the original FSA, this results in a
30 log |Σ|-separable FSA with at most |Q||Σ| states.
31 While the increase of the state space might seem like a step backward, recall that using Indyk’s
32 construction, we can construct an HRNN simulating an FSA whose size scales with the square
33 root of the number of states. And, since the resulting FSA is log |Σ|-separable, we can reduce
34 the space complexity with respect to Σ to log |Σ|. This is summarized in the following theorem,
35 which characterizes how compactly general deterministic FSAs can be encoded by HRNNs. To our
36 knowledge, this is the tightest bound on simulating general unweighted deterministic FSAs with
5.1. RECURRENT NEURAL LANGUAGE MODELS 191

1 HRNNs.

Theorem 5.1.7: Efficiently simulating general FSAs

Let A “ pΣ, Q, I, F , δq be a minimal FSA recognizing the language


´ L. Then,¯there exists an
HRNN R “ pΣ, D, U, V, E, bh , h0 q accepting L with D “ O log |Σ| |Σ||Q| .
a

3 The full log |Σ|-separation procedure is presented in Algorithm 2. It follows the intuition of
y
4 creating a separate “target” for each transition q Ý
Ñ q 1 for every state q P Q. To keep the resulting
5 FSA deterministic, a new, artificial, initial state with no incoming transitions is added and is
6 connected with the augmented with the out-neighborhood of the original initial state.

Algorithm 2
1. def Separate(A “ pΣ, Q, I, F , δq):

2. A1 Ð pΣ, Q1 “ Q ˆ Σ Y tqι 1 u, δ 1 “ ∅, I 1 “ tqι 1 u, F 1 “ ∅q


3. Ź Connect the out-neighborhood of the original initial state qι with the new, aritificial, initial state.
4. for y P Σ :
y1
5. for qι ÝÑ q 1 P δ :
y
6. add qι 1 Ý
Ñ pq 1 , y 1 q to δ 1
7. for q P Q, y P Σ :
y1
8. for q ÝÑ q 1 P δ :
y1
9. add pq, yq ÝÑ pq 1 , y 1 q to δ 1
10. Ź Add all state-symbol pairs with a state from the original set of final states to the new set of final states.
11. for qφ P F , y P Σ :
12. add pqφ , yq to F 1
13. if qι P I : Ź Corner case: if the original initial state qι is an initial state, make the artificial initial state
qι 1 final.
14. add qι 1 to F 1
15. return A1

7 The following simple lemmata show the formal correctness of the procedure and show that it
8 results in a log |Σ|-separable FSA, which we need for compression in the size of the alphabet.

Lemma 5.1.14
y1 y1
For any y P Σ, pq, yq ÝÑ pq 1 , y 1 q P δ 1 if and only if q ÝÑ q 1 P δ.
9

10 Proof. Ensured by the loop on Line 3. ■

Lemma 5.1.15

log |Σ|-separation results in an equivalent FSA.


11
192 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Proof. We have to show that, for any y P Σ˚ , y leads to a final state in A if and only if y leads to a
2 final state in A1 . For the string of length 0, this is clear by Lines 13 and 14. For strings of length
3 ě 1, it follows from Lemma 5.1.14 that y leads to a state q in A if and only if Dy P Σ such that y
4 leads to pq, yq in A1 . From Lines 11 and 12, pq, yq P F 1 if and only if q P F , finishing the proof. ■

Lemma 5.1.16

log |Σ|-separation results in a log |Σ|-separable FSA.


5

6 Proof. Since the state pq 1 , y 1 q is the only state in Q1 transitioned to from pq, yq after reading y 1 (for
7 any y P Σ), it is easy to see that A1 is indeed log |Σ|-separable. ■

8 Discussion and the practical applicability of these result. This section showed that Heaviside-
9 activated RNNs are equivalent to WFSAs. This might come as a bit of a surprise considering that
10 we introduced RNNs with the goal of overcoming some limitations of exactly those models, e.g., the
11 finite context length. However, note that to arrive at this result, we considerately restricted the
12 form of a recurrent neural network. While on the one hand restriction to the Heaviside activation
13 function means that all the RNNs we considered in this section can be implemented and represented
14 in a computer, the RNN sequence models that we usually deal with are much more complex than
15 this analysis allowed for. Furthermore, note that the RNNs in practice do not learn sparse hidden
16 states of the form considered in the construction in the proof of Lemma 5.1.2—indeed, networks
17 with Heaviside activation functions are not trainable with methods discussed in §3.2 as the gradient
18 on the entire parameter space would be either 0 or undefined and in this sense, the trained networks
19 would never have such hidden state dynamics. The dynamics of RNNs in practice result in dense
20 hidden states, i.e., states in which many dimensions are non-zero. Nonetheless, keep in mind that
21 theoretically, due to the finite-precision nature of our computers, all models we ever consider will be
22 at most finite-state—the differentiating factor between them will be how appropriately to the task
23 they are able to learn the topology (transitions) of the finite-state automaton they represent and
24 how efficiently they are able to learn it.
5.1. RECURRENT NEURAL LANGUAGE MODELS 193

1 Turing Completeness of Recurrent Neural Networks


2 We now turn to the (purely theoretical) treatment of the expressive capacity of recurrent neural
3 networks in which we take the liberty of making somewhat unrealistic assumptions. Specifically, in
4 practice, RNNs usually have the following properties:

5 • The weights and intermediate computations are done with finite floating point precision;

6 • An RNN always operates in real-time, meaning that it performs a constant number of operations
7 before consuming/outputting a symbol.

8 Under these assumptions, we saw that RNNs with Heaviside activations in a practical setting lie at the
9 bottom of the weighted Chomsky hierarchy, being able to only recognize regular languages. However,
10 if we relax these two assumptions, allowing for arbitrary precision and unbounded computation time
11 between symbols, RNNs jump directly to the top of the hierarchy: they become Turing complete.
12 We start by introducing the saturated sigmoid, one of the building blocks we will use to show
13 this.

Definition 5.1.26: Saturated Sigmoid function

The saturated sigmoid is defined as


$
&0
’ if x ď 0
σ pxq “ x if 0 ă x ď 1 . (5.81)
1 if x ą 1

%
14

15 Intuitively, the saturated sigmoid clips all negative values to 0, all values larger than 1 to 1, and
leaves the elements of r0, 1s intact. The graph of this function is shown in Fig. 5.16.

σ pxq

0 1 x

Figure 5.16: The saturated sigmoid.


16
17 The central result of this subsection is then summarized in the following theorem.

Theorem 5.1.8: Saturated Sigmoid Elman RNNs are Turing complete

Elman recurrent neural network sequence models with the saturated sigmoid activation
functions are Turing complete.
18
194 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 By the end of this subsection, we will have proven this result by showing that Saturated Sigmoid
2 Elman RNNs can encode two-stack pushdown automata which are computationally equivalent to
3 Turing machines (cf. Definitions 4.2.50 and 4.2.51). We start with a simpler construction: building
4 on our placement of RNNs on at least the regular rung of the ladder of formal language complexity
5 (cf. Lemma 5.1.2), we take one step up the ladder and show that RNNs can simulate a single-stack
6 pushdown automaton (cf. Definition 4.2.31). This will help us gain the intuition behind how
7 RNNs can use infinite precision arithmetic to simulate a stack. We will then simply generalize this
8 construction to the two-stack case.
9 Let us begin by considering the problem of representing a stack—an arbitrarily long sequence
10 of symbols, accessible in a last-in-first-out (LIFO) fashion—in a vector of constants size, e.g., the
11 hidden state of a recurrent neural network. For simplicity, but without the loss of generality, assume
12 that we are working with a simple two-letter stack alphabet Γ “ t0, 1u. Any stack sequence γ will
13 be a member of Γ˚ , i.e., a string of 0’s and 1’s. If we think of the stack symbols as numbers for a
14 moment, there is a natural correspondence between the possible stack configurations and numbers
15 expressed in base 2. By convention, we will represent a string of stack symbols γ with numbers
16 after the decimal point, rather than as integers. Assuming infinite precision, we can therefore
17 simply represent each stack configuration as a single number (of course, the stack alphabet does not
18 have to be exactly Γ “ t0, 1u—we can always map symbols from any alphabet into their numeric
19 representations in some base large enough to allow for the entire alphabet). Notice that in this
20 case, pushing or popping from the stack can be performed by division and multiplication of the
21 value representing the stack—if we want to push a value x P t0, 1u, we can divide the current
22 representation (by 2) and append x to the right side of the new representation and if we want to
23 pop any value, we simply have to multiply the current representation by 2. This also gives us an
24 idea of how to represent a stack in the hidden state of an RNN: the entire stack sequence will simply
25 be represented in a single dimension of the hidden state, and the value stored in the cell will be
26 updated according to the transitions defined by the simulated automaton. Note, however, that the
27 RNN will not only have a single dimension in the hidden state: other dimensions will contain values
28 that will be required to control the RNN updates correctly.
29 In our proofs, we consider a special type of pushdown automata, as defined in Definition 4.2.31:
30 we will use pushdown automata which only consider the topmost element of the stack when
31 defining the possible transitions from a configuration and can only push one stack symbol at a
32 time. More formally, this means that in the tuple P “ pΣ, Q, Γ, δ, pqι , γ ι q, pqφ , γ φ qq, we have that
33 δ Ď Q ˆ Γ ˆ pΣ Y tεuq ˆ Q ˆ Γ rather than the more general δ Ď Q ˆ Γ˚ ˆ pΣ Y tεuq ˆ Q ˆ Γ˚ .
34 Furthermore, we assume that γ ι “ ε and γ φ “ ε, that is, the PDA starts off with an empty stack
35 and has to empty it again to arrive at a final configuration. Note that these restrictions can be done
36 without loss of generality—that is, such pushdown automata are as powerful as the unrestricted
37 versions (Sipser, 2013). With this in mind, we can show that arbitrary precision RNNs are capable
38 of recognizing at least deterministic context-free languages:

Theorem 5.1.9: RNNs can recognize deterministic context-free languages

Elman recurrent neural networks can recognize deterministic context-free languages.


39

40 Before we continue to the proof of Theorem 5.1.9, let us remark on three simple but important
41 intuitions which will be crucial for understanding the construction of the Elman RNN, both in the
42 single- as well as the two-stack variants of PDAs. Multiple times in the construction, we will be
5.1. RECURRENT NEURAL LANGUAGE MODELS 195

1 faced with the task of moving or copying the value from some dimension i to the dimension j in the
2 vector. The following fact shows how this can be done using simple matrix multiplication with a
3 specific matrix.

Fact 5.1.2: Copying elements of a vector

Let x P RD and M P RDˆD such that Mi, : “ ei , where Mi, : denotes the ith row of M and
ei denotes the ith basis vector. Then, it holds that pMxqi “ xi .
4

5 Also, note that setting the row Mi, : to the zero vector 0PRD sets the entry xi to 0, i.e., it erases
6 the entry.
7 Furthermore, we will use the saturated sigmoid function multiple times to detect whether a
8 number of dimensions of a vector are set to one at the same time. Given the recurrent dynamics of
9 the Elman RNN (cf. Eq. (5.26)), we can perform this check as follows.

Fact 5.1.3: Detecting the activation of multiple values in the hidden state

Let σ be the saturated sigmoid from Definition 5.1.26, m P t1, . . . , Du, i1 , . . . , im , j P t1, . . . , Du,
x P RD , b P RD , and M P RDˆD such that
#
1 if i P ti1 , . . . , im u
Mj,i “
0 otherwise

and bj “ ´pm ´ 1q. Then, it holds that pσ pMx ` bqqj “ 1 if and only if xik “ 1 for all
k “ 1, . . . , m.a
a Note that this is simply a restatement of Fact 5.1.1, which we include here for clarity and to make the

connection with the construction that follows clearer.


10

11 Lastly, we will sometimes have to turn off certain dimensions of the hidden state if any of the
12 other dimensions are active. Using the dynamics of Elman RNNs and the saturated sigmoid, this
13 can be done as follows.

Fact 5.1.4: Turning off dimensions in the hidden state

Let σ be the saturated sigmoid from Definition 5.1.26, m P t1, . . . , Du, i1 , . . . , im , j P t1, . . . , Du,
x P RD , b P RD , and M P RDˆD such that
#
´1 if i P ti1 , . . . , im u
Mj,i “
0 otherwise

and bj “ 1. Then, it holds that pσ pMx ` bqqj “ 0 if and only if xik “ 1 for some k “ 1, . . . , m.
14

15 With these intuitions in mind, we now prove Theorem 5.1.9. Due to the relatively elaborate
16 construction, we limit ourselves to pushdown automata with a two-symbol input alphabet Σ “ ta, bu
17 as well as a two-symbol stack alphabet Γ “ t0, 1u. Note, however, that this restriction can be done
18 without the loss of generality, meaning that this is enough to prove the Turing completeness of
196 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 RNNs in general.16

2 Proof. We show this by constructing, for a given deterministic pushdown automaton P recognizing
3 a deterministic context-free language, an Elman RNN simulating the steps performed by P.
4 Let P “ pΣ, Q, Γ, δ, pqι , γ ι q, pqφ , γ φ qq be such a deterministic pushdown automaton. We now
5 define the parameters of the RNN R “ pΣ, D, U, V, E, bh , h0 q such that the updates to R’s hidden
6 state will correspond to the configuration changes in P.
7 The construction is more involved than the one in Minsky’s theorem (cf. Lemma 5.1.2). We,
8 therefore, first intuitively describe the semantics of the different components of the hidden state
9 of the RNN. Then, we describe the submatrices of the parameters U, V, and b that control these
10 components of the vector. The hidden state h of the RNN will altogether have five components.

11 • Component 1: Data component: This component, consisting of three cells, will contain
12 the actual numerical representation of the stack, STACK, as well as two additional “buffer” cells,
13 BUFF1 and BUFF2 , which will be used for intermediate copies of the stack values during the
14 computation of the new state.

15 • Component 2: Top of stack component: This component contains three cells, each
16 corresponding to a flag denoting that (a) the stack is empty (STACKε ), (b) the top element of
17 the stack is a 0 (STACK0 ), or (c) the top element of the stack is a 1 (STACK1 ).

18 • Component 3: Configuration component: This component encodes the current configu-


19 ration of the stack (Component 2) together with the current input symbol. Note that, while
20 we assume that the input PDA works with the two-symbol alphabet Σ “ ta, bu, the sequence
21 model defined by the RNN requires an eos symbol to be able to terminate generation (cf.
22 Eq. (2.44)): R, therefore, defines the conditional probabilities over the set Σ “ ta, b, eosu.
23 With this, there are nine possible configurations py, γq for γ P tε, 0, 1u and y P ta, b, eosu,
24 meaning that there are nine cells in this configuration, CONFγ,y , each corresponding to one of
25 these configurations.

26 • Component 4: Computation component: This component contains four cells in which the
27 computation of the next value of the stack is computed. There are five cells OPaction,γ because
28 all possible actions (PUSH 0, PUSH 1, POP 0, POP 1, and NO-OP) are performed simultaneously,
29 and only the correct one is copied into the data component (Component 1) in the end.

30 • Component 5: Acceptance component: This component contains a single cell, ACCEPT,


31 signaling whether the RNN accepts the string y after reading in the input y eos.

32 Altogether, the hidden state of R contains 3 ` 3 ` 9 ` 5 ` 1 “ 21 dimensions. The initial hidden


33 state h0 is a vector with a single non-zero component, whose value is 1: the cell STACKε since we
34 assume that the stack of the simulated automaton is empty at the beginning of the execution. We
35 now intuitively describe the dynamics that these components define.

16 To simulate an arbitrary Turing machine with a machine with the binary alphabet, we simply have to encode

each of the finitely-many symbols of the simulated machine using binary encoding.
5.1. RECURRENT NEURAL LANGUAGE MODELS 197

1 The full update step of the network. The RNN will compute the next hidden state corre-
2 sponding to the new stack configuration by applying the Elman update rule (cf. Eq. (5.26)) four
3 times to complete four discrete sub-steps of the computation. We first define

ht`1 “ ht (5.82)
p0q def

4 and ´ ¯
ht`1 “ σ Uht`1 ` Vepy t q ` bh (5.83)
pnq def pn´1q

5 for n “ 1, 2, 3, 4. Then
ht`1 “ ht`1 . (5.84)
def p4q

6 Intuitively, each of the four stages of computation of the actual next hidden state “detects” some
7 parts of the pattern contributing to the transition in the pushdown automaton. We describe those
8 patterns next intuitively before talking about the submatrices (or subvectors) of the RNN parameters
9 corresponding to the specific parts that update the individual components of the hidden state.

10 Data component. The cells of the data component form a queue of three components: the STACK
11 cell forms the head of the queue, followed by BUFF1 and BUFF2 . The values in the cells are updated
12 at each execution of Eq. (5.83) by moving the currently stored values into the next cell in the queue.
13 By doing so, the entry in BUFF2 gets discarded. The value of STACK is copied from the cells of the
14 computation component by summing them. We will see later that at any point of the computation
15 (when it matters), only one of the computation components will be non-zero, which means that
16 the summation simply corresponds to copying the non-zero computation component. All these
17 operations can be performed by matrix multiplication outlined in Fact 5.1.2.

18 Encoding the stack sequence. While we outlined a possible encoding of a stack sequence above,
19 the encoding we use in this construction is a bit different. Remember that for a stack sequence
20 γ P Γ˚ of length N , the right-most symbol γ N denotes the top of the stack. We encode the stack
21 sequence γ P Γ˚ as follows:
N
ÿ
digit pγ n q10N ´n´1 (5.85)
def
rep pγ 1 . . . γ N q “
n“1
#
1 if γ “ 0
where digit pγq “
def
22 .
3 otherwise

Example 5.1.15: Scalar stack representations

For example, the stack sequence γ “ 00110111 would be represented with rep p00110111q “
0.33313311. Notice the “opposite orientation” of the two strings: the top of the stack in γ is
the right-most symbol, while it is the left-most digit in the numerical representation.
23

24 Note that the digits 1 and 3 in the definition of digit p¨q are chosen somewhat arbitrarily—the
25 encoding could also have been chosen differently. Similarly, a different (non-decimal) base could
26 have been chosen.
198 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Top of stack component. As mentioned, the STACKε cell of this component is one of the two
2 cells set to 1 in the initial state of the RNN. The individual cells of this component then get updated
3 according to the top symbol on the stack encoded in the STACK cell by taking into account how
4 the stack is represented by STACK. Specifically, the parameters of the RNN are defined such that
5 hSTACKε “ 1 if previously hSTACK “ 0, hSTACK0 “ 1 if previously hSTACK “ 0.1 . . ., and hSTACK1 “ 1 if
6 previously hSTACK “ 0.3 . . ..

7 Configuration component. The cells of the configuration component combine the pattern
8 captured by the top of the stack component with the input symbol at the current time step to
9 activate only the appropriate cell CONFγ,y . This can be done by incorporating the information from
10 the top of the stack component with the information about the current input symbol from VJy t K.
11 More precisely, the parameters of R are set such that hCONFγ,y “ 1 if at the previous one of the four
12 sub-steps of the computation of the next hidden state, hSTACKγ “ 1 and the input symbol is y.

13 Computation component. The computation component contains the cells in which the results
14 of all the possible actions on the stack are executed. The parameters of the computation component
15 are set such that, given that the previous stack configuration is hSTACK “ x1 x2 . . . xN and the input
16 symbol is y, cells of the computation component are set as

hOPPOP,γ “ x2 . . . xN
hOPPUSH,γ “ digit pyqx1 . . . xN .

17 Acceptance component. The cell in the acceptance component is activated if and only if the
18 current input symbol is eos (denoting the end of the string whose recognition should be determined)
19 and the stack is empty, i.e., the STACKε cell is activated.
20 More precisely, the dynamics described here are implemented by the four steps of the hidden
state update as follows (where ht “ ht ).
p1q
21

22 • In phase 1, the configuration of the stack is determined by setting the top of the stack
component in ht .
p2q
23

24 • In phase 2, the configuration of the stack and the input symbol are combined by setting the
configuration component in ht .
p3q
25

26 • In phase 3 all possible operations on the stack are performed in the computation component,
27 and, at the same time, the results of all invalid operations (only one operation is valid at each
time step due to the deterministic nature of P) are zeroed-out in ht . This is done by setting
p4q
28

29 the entries of the recurrence matrix U such that only the valid action is not zeroed out.

30 • In phase 4 the result of the executed operations (only one of which is non-zero) is copied over
31 to the STACK cell in the hidden state in ht`1 .

32 Having defined the intuition behind the dynamics of the hidden state updates, we now formally
33 define how the parameters of the RNN are set to enable them. Whenever an entry of a matrix or
34 vector is not set explicitly, it is assumed that it is 0 (that is, we only explicitly set the non-zero
35 values). Again, we define them for each component in turn.
5.1. RECURRENT NEURAL LANGUAGE MODELS 199

1 The data component. The values of the parameters in the data component are set as follows.

U BUFF1 ,STACK “ 1 (5.86)


U BUFF2 ,BUFF1 “ 1 (5.87)
U STACK,OPPUSH,0 “ U STACK,OPPUSH,1 “ U STACK,OPPOP,0 “ U STACK,OPPOP,1 “ 1 (5.88)

2 The first two elements correspond to moving the values to the next element in the data component
3 queue, while the entries in the last row correspond to summing up the values from the computation
4 component to move them into the stack cell after the computation has been completed. Note that,
5 of course, the elements of the computation component are always summed up and written in the
6 STACK cell, no matter what the values there are. However, the division of the computation of the
7 next hidden state into phases ensures that when it matters, i.e., after the third phase, there is only
8 a single computation component that is non-zero, and that one is copied into the STACK component
9 in the fourth computation sub-step. All other parameters (in V and bh ) are 0.

10 The top of the stack component. The parameters setting the top of the stack component are
11 set as follows:

U STACKε ,STACK “ ´10 (5.89)


U STACK0 ,STACK “ ´10 (5.90)
U STACK1 ,STACK “ 10 (5.91)
bSTACKε “ 1 (5.92)
bSTACK0 “ 3 (5.93)
bSTACK1 “ ´2. (5.94)

12 Other parameters (V) are 0. The reasoning behind these parameters is the following. The cell
13 STACK contains the numeric encoding of the stack content. We distinguish three cases.

14 • If the stack is empty, hSTACK “ 0. Therefore, using the parameters above, the value of the cell
15 STACK1 after the sub-step update will be 0, while the cells STACKε and STACK0 will be 1 due to
16 the positive bias term. This might not be what you would expect—it might seem like, in this
17 case, this step erroneously signals both an empty stack and a stack whose top component is 0.
18 This, however, is corrected for in the configuration component, as we discuss below.

19 • If the top of the stack is the symbol 0, hSTACK “ 0.1 . . .. This means that 10 ¨ hSTACK ď 1 and,
20 therefore, after the update rule application, hSTACK1 “ 0. It is easy to see that the setting
21 of the parameters also implies hSTACKε “ 0. However, since ´10 ¨ hSTACK ě ´2, we have that
22 hSTACK0 “ 1.

23 • Lastly, if the top of the stack is the symbol 1, hSTACK “ 0.3 . . .. Therefore, 10 ¨ hSTACK ě 3,
24 meaning that after the update rule application, hSTACK1 “ 1. Again, it is easy to see that the
25 setting of the parameters also implies hSTACKε “ 0. On the other hand, since ´10 ¨ hSTACK ď ´3,
26 it also holds that hSTACK0 “ 0.
200 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 The configuration component. The configuration component is composed of the most cells of
2 any component:

U CONFγ,y ,STACKγ “ 1 for γ P tε, 0, 1u, y P teos, a, bu (5.95)


U CONFγ,0 ,STACKε “ ´1 for y P teos, a, bu (5.96)
V CONFγ,y ,mpyq “ 1 for γ P tε, 0, 1u, y P teos, a, bu (5.97)
bCONFγ,y “ ´1 for γ P tε, 0, 1u, y P teos, a, bu (5.98)

3 Here, the first, third, and fourth terms together ensure that the cell CONFγ,y is activated if the current
4 input symbol is a (V CONFγ,y ,mpyq ) and the top of the stack is γ (U CONFγ,y ,STACKγ ). bCONFγ,y ensures that
5 both conditions have to be met. The second term, U CONF0,y ,STACKε , on the other hand, takes care of
6 an edge case: as shown above, bSTACK0 “ 0, which means that STACK0 is, by default, set to 1. The
7 negative weight U CONF0,y ,STACKε “ ´1 ensures that, if the stack is indeed empty, the effect of this
8 default value is “canceled out”, i.e., the configuration cell is not activated by mistake.

9 The computation component. This is the most complicated component. The computation
10 components are manipulated with the following parameters:
1
U PUSH0,BUFF2 “ U PUSH1,BUFF2 “ (5.99)
10
U POP0,BUFF2 “ U POP1,BUFF2 “ 10 (5.100)
U NO-OP,BUFF2 “ 1 (5.101)
U A,CONFγ,y “ ´10 for A P tOPPUSH,0 , OPPUSH,1 , OPPOP,0 , OPPOP,1 , OPNO-OP u (5.102)
γ P t0, 1u, y P ta, bu
y,γÑγ 1
U OPA,γ 1 ,CONFγ,y “ 0 for q ÝÝÝÝÝÑ q P δ, (5.103)
A P tOPPUSH,0 , OPPUSH,1 , OPPOP,0 , OPPOP,1 , OPNO-OP u
U OPNO-OP ,CONFγ,eos “ 0 for γ P tε, 0, 1u (5.104)
1
bOPPUSH,0 “ (5.105)
10
3
bOPPUSH,1 “ (5.106)
10
bOPPOP,0 “ ´1 (5.107)
bOPPOP,1 “ ´3 (5.108)
bOPNO-OP “ 0 (5.109)

11 The first three parameters above concern copying the value of the stack encoded by the previous
12 hidden state into the computation component and preparing it for modification. They work together
13 with the corresponding entries in the bias vector bh . For example, a value can be pushed onto the
14 stack by dividing the value of the stack encoding by 10 and adding either 0.1 or 0.3, depending on
15 whether 0 or 1 is being pushed. This is encoded by the first setting above and bOPPUSH,0 and bOPPUSH,1 .
16 Similarly, a value can be popped from the stack by multiplying the stack encoding with 10 and then
17 subtracting the appropriate value according to the bias entry. The NO-OP action is implemented
18 simply by copying the values of the stack into its cell. The remaining three parameter settings above
5.1. RECURRENT NEURAL LANGUAGE MODELS 201

1 ensure that, after executing all possible stack actions, only the appropriate computation is kept and
2 all others are zeroed out. The fourth row above ensures that “by default”, all computation cells are
3 reset to 0 after every update. However, the next row “removes” the negative weights (sets them to 0)
4 for the changes in the configuration which correspond to the valid transitions, or valid actions, in the
5 pushdown automaton. That is, setting those values of the matrix U to 0 disables “erasing” the entry
6 OPA,γ 1 in the hidden state by the configuration CONFγ,y if the transition from the configuration with
7 the top of the stack γ to γ 1 with the action A upon reading y is encoded by the original automaton.
8 The last remaining row simply ensures that reading in the eos symbol results in the NO-OP action
9 being executed (eos actions are not encoded by the original pushdown automaton).

10 The acceptance component. Lastly, the acceptance component is controlled by the following
11 parameters:

U ACCEPT,A “ ´10 for A P tOPPUSH,0 , OPPUSH,1 , OPPOP,0 , OPPOP,1 , OPNO-OP u (5.110)


U ACCEPT,CONFγ,y “ ´10 for γ P tε, 0, 1u, y P teos, a, bu (5.111)
bACCEPT “ 1 (5.112)

12 The entry bACCEPT ensures that, by default, the value of ACCEPT is set to 1. However, the other
13 parameters ensure that, as soon as any part of the configuration is not compatible with the acceptance
14 state (the read symbol is not eos or the stack is not empty), the acceptance bit is turned off.
15 A full proof of the theorem would now require us to show formally that the update rule
16 Eq. (5.84) results in the correct transitions in the PDA. We, however, leave the proof with the
17 intuitive reasoning behind the setting of the parameters and leave this as an exercise for the
18 reader. The proof is also demonstrated in the python implementation of the constructions here:
19 [Link] ■
20 The construction described in the proof of Theorem 5.1.9 is demonstrated in the following
21 example.

Example 5.1.16: Siegelmann’s construction

Let P be a single-stack PDA presented in Fig. 5.17. We now simulate the recognition of the
string y “ ab, which is accepted by P. The initial state h0 has a single non-zero cell, STACKε .
The four phases of the processing of the first input symbol a are shown in Tab. 5.1. The four
phases of the processing of the second input symbol b are shown in Tab. 5.2.
22

23 Theorem 5.1.9 shows that Elman RNNs are theoretically at least as expressive as deterministic
24 CFGs. We now return to the main result of this subsection: the Turing completeness of RNNs.
25 Luckily, Theorem 5.1.9 gets us most of the way there! Recall that by §4.2.9, two-stack PDA
26 are Turing complete. We make use of this fact by generalizing the construction in the proof of
27 Theorem 5.1.9 to the two-stack case. This will prove that RNNs can in fact simulate any Turing
28 machine, and are, therefore, Turing complete.

Lemma 5.1.17

Let P “ pQ, Σ, Γ, δ, pqι , γ ι , σ ι q, pqφ , γ φ , σ φ qq be a two-stack pushdown automaton. Then,


there exists an Elman RNN R simulating P, i.e., L pRq “ L pPq.
29
202 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Initial state Phase 1 Phase 2 Phase 3 Phase 4


STACK 0 0 2
5
2
5
1
10
BUFF1 0 0 0 2
5
2
5
BUFF2 0 0 0 0 2
5
STACKε 1 1 1 0 0
STACK0 0 1 1 0 0
STACK1 0 0 0 1 1
CONFeos,a 0 0 1 1 0
CONFeos,b 0 0 0 0 0
CONF0,a 0 0 0 0 0
CONF0,b 0 0 0 0 0
CONF1,a 0 0 0 0 1
CONF1,b 0 0 0 0 0
CONFε,eos 0 0 0 0 0
CONF0,eos 0 0 0 0 0
CONF1,eos 0 0 0 0 0
OPPUSH,0 0 1
10
1
10
1
10
1
10
OPPUSH,1 0 3
10
3
10 0 0
OPPOP,0 0 0 0 0 0
OPPOP,1 0 0 0 0 0
OPNO-OP 0 0 0 0 0
ACCEPT 0 1 0 0 0

Table 5.1: The simulation of the processing of the first symbol a by the RNN simulating the PDA
in Fig. 5.17. After the fourth phase, the stack cell contains the encoding of the stack as 0.1.
5.1. RECURRENT NEURAL LANGUAGE MODELS 203

Initial state Phase 1 Phase 2 Phase 3 Phase 4


STACK 1{10 1{10 1 2{5 0
BUFF1 2{5 1{10 1{10 1 2{5
BUFF2 2{5 2{5 1{10 1{10 1
STACKε 0 0 0 0 0
STACK0 0 1 1 0 0
STACK1 1 0 0 1 1
CONFeos,a 0 0 0 0 0
CONFeos,b 0 0 0 0 0
CONF0,a 0 0 0 0 0
CONF0,b 0 0 1 1 0
CONF1,a 1 0 0 0 0
CONF1,b 0 1 0 0 1
CONFε,eos 0 0 0 0 0
CONF0,eos 0 0 0 0 0
CONF1,eos 0 0 0 0 0
OPPUSH,0 1{10 0 0 0 0
OPPUSH,1 0 0 0 0 0
OPPOP,0 0 0 0 0 0
OPPOP,1 0 1 0 0 0
OPNO-OP 0 0 2{5 0 0
ACCEPT 0 0 0 0 0

Table 5.2: The simulation of the processing of the second symbol b by the RNN simulating the PDA
in Fig. 5.17. After the fourth phase, the stack cell contains the encoding of the empty stack.
204 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

a, ε Ñ 0
a, 0 Ñ 1
a, 1 Ñ ε

b, ε Ñ 1
q b, 0 Ñ ε
b, 1 Ñ 1

Figure 5.17: The single-stack pushdown automaton P.

1 Proof. Again, given a two-stack PDA P “ pQ, Σ, Γ, δ, pqι , γ ι , σ ι q, pqφ , γ φ , σ φ qq with Σ “ ta, bu,
2 Γ1 “ t0, 1u, and Γ2 “ t0, 1u, we construct an Elman RNN R “ pΣ, D, U, V, E, bh , h0 q which
3 recognizes the same language as P.
4 The hidden state of R will contain the same components as in the proof of Theorem 5.1.9.
5 Moreover, their dynamics will be exactly the same—they will simply be larger to account for more
6 possible configurations of the two stacks together. For example, the top of the stack component will
7 now consist of cells STACKγ1 γ2 for γ1 P Γ1 and γ2 P Γ2 flagging that the top symbol on stack 1 is γ1
8 and the top symbol of stack 2 is γ2 . Furthermore, the configuration component will contain cells
9 of the form OPaction,γ 1 γ 2 with an analogous interpretation. Lastly, all the computation and data
10 component cells would be duplicated (with one sub-component for each of the stacks), whereas the
11 acceptance component stays the same. We now again describe these components and their dynamics
12 intuitively, whenever there is any difference to the single-stack version.

13 Data component. Instead of having a single queue of data cells, R now has two queues, one
14 for each of the stacks. The first queue will be formed by STACK1, BUFF11 , and STACK21, and the
15 second one by STACK2, BUFF12 , and STACK22. Each of the queues acts exactly the same as in the
16 single-stack version, and they act independently based on the computations done in the computation
17 cells of the respective stacks. Each of the stacks is also encoded as a numeric sequence in the same
18 way as in the single-stack version.

19 Top of stack component. Again, R starts in an initial state in which the cell STACKεε is 1 and
20 all others are 0. The individual cells of this component then get updated according to the top
21 symbols on the stacks encoded in the STACK1 and STACK2 cells.

22 Configuration component. The cells of the configuration component combine the pattern
23 captured by the top of both stack components with the input symbol at the current time step to
24 activate only the appropriate cell CONFγ1 γ2 ,y .

25 Computation component. Again, the computation component contains the cells in which the
26 results of all the possible actions on both stacks are executed. They execute the actions on both
27 stacks independently.

28 Acceptance component. The acceptance component functions identically to the single-stack


29 case.
5.1. RECURRENT NEURAL LANGUAGE MODELS 205

1 Using these components, the RNN then transitions between the phases exactly like in the
2 single-stack case. We leave the specifications of the matrix parameter values to the reader. They
3 again follow those presented in the single-stack case but treat the transitions and configurations of
4 both stacks.
5 ■

6 The Computational Power of RNN Variants


7 Most of the current section was devoted to understanding the computational power of simple Elman
8 RNN language models due to their simplicity which allows an easy connection to formal models
9 of computation. However, as mentioned in §5.1.5, gated RNN variants such as LSTMs and GRUs
10 have become the standard in modern natural language processing tasks, showing more better and
11 more reliable performance on a variety of tasks. Interestingly, besides their better resilience to the
12 vanishing gradient problem, LSTM-based language models are also provably more expressive than
13 simple RNN language models. On the other hand, the simpler GRU-based language models are in
14 some ways only as powerful as simple Elman RNN language models. To give an intuition behind
15 this, this subsection provides a short overview the results considering the computational power of
16 RNN variants.
17 Weiss et al. (2018) compare the practical computational power of different RNN variants under
18 the constraint of bounded computation time and limited precision. Interestingly they empirically find
19 that LSTMs can learn to recognize languages that require some form of counting, like tan bn | n P Nu
20 or tan bn cn | n P Nu, while GRUs struggle to do so. This invites the comparison of LSTMs with
21 counter machines (Hopcroft et al., 2006; Fischer et al., 1968), a class of formal computational
22 models with the ability to count. Simply put, counter machines are finite-state automata with an
23 additional (unbounded) counter cell, which they can manipulate by incriminating and decrementing
24 the value stored in it.17 Counter machines present an interesting addition to the traditional hierarchy
25 of computational models, since they in some ways cut across it, being able to recognize some, but
26 not all, context-free languages, while also being able to recognize some context-sensitive languages.
27 Among others, they can for example recognize the context-free language tan bn | n P Nu and the
28 context-sensitive language tan bn cn | n P Nu, while not being able to recognize the Dyck context-free
29 languages Dpkq with k different parenthesis types (intuitively, this is because recognizing Dyck
30 languages requires keeping track of the order in which the parantheses appeared, not only their
31 counts).
32 Counter machines can recognize languages like tan bn cn | n P Nu, by counting the number of as
33 and making sure that it matches the number of bs. Further, analyzing the activation of the memory
34 cell and of the hidden state, they find that LSTMs that one can recognize the use of a counting
35 mechanism implemented by the network by using one or more dimensions of the memory cell as
36 counters. This result is particularly interesting as GRUs are often considered an equivalent variant
37 to the LSTM one, with the same computational power, but smaller computation overhead. However,
38 GRUs seem to lack this counting behavior, which can be backed by theoretical analysis.
39 Merrill et al. (2020) take a look at the computational power of the different RRN variant from
40 another perspective. They consider space complexity and whether the networks are rationally
41 recurrent, i.e. whether their hidden state update function can be expressed in terms of finite state
42 machine computations. Making the assumption of saturated networks18 , they find that while GRUs
17We only consider counter machines with a single counter cell.
18 Informally,a saturated network is a neural network in which the norms of the parameters are taken to 8. This
206 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Figure 5.18: Picture from Weiss et al. (2018): Plot of the activation value of the memory cell (LSTM)
and of the hidden state (GRU), versus the indices of the input. The Networks have been trained
to recognize either the languages tan bn | n P Nu or tan bn cn | n P Nu. As you can notice, in both
cases the LSTM has learned to use one or more dimension of the memory cell as a counter, which
allows it to count how many a and b have been have been consumed so far. Conversely, the GRU
hasn’t developed this counter mecahnism, and in fact empirical evidence shows that it struggles to
to recognize the described languages.

1 and Elman networks are rationally recurrent and therefore at most regular, LSTMs are not, meaning
2 that their hidden state update function cannot be expressed by means of finite-state machines.

3 Consequences of the Turing completeness of recurrent neural networks


4 The section above outlines Siegelmann and Sontag’s (1992) construction of encoding a Turing
5 machine in an RNN. While Turing completeness means that RNNs are in many ways computational
6 very powerful (as powerful as they can be), it also brings with it many computational challenges faced
7 when working with Turing machines. Computability theory, for example, defines many problems
8 related to Turing machines and their properties which are not computable, or undecidable, meaning
9 that no algorithm (or, equivalently, a Turing machine) can solve them.19 The most classical and
10 fundamental of such problems is the halting problem (Turing, 1937).
has the effect of pushing all the squashing functions of the network to one of their extreme values, in the case of the
sigmoid t0, 1u and t´1, 1u in the case of the hyperbolic tangent
19 The notion of solving a problem computationally is quite nuanced and we only provide very broad intuitions

for now. Readers who want to delve deeper into the topic of computability and with it the implications of Turing
completeness are encouraged to look at some classical texbooks on the material, for example Sipser (2013, Part Two).
5.1. RECURRENT NEURAL LANGUAGE MODELS 207

Definition 5.1.27: Halting problem

Let M be a Turing machine over the input alphabet Σ and y P Σ˚ . The halting problem is
the problem of deciding whether M (halts and) accepts y.a
a A formal definition of the halting problem would require us to define the formal language L of Turing

machine and input tuples pM, yq and asking if there exists a Turing machine M1 that accepts L. However, to
keep the discussion brief, we define the problem more intuitively.
1

2 The halting problem is the foundation of the results presented by Chen et al. (2018), which
3 building on Siegelmann and Sontag (1992) considers its implications on the theory of RNN language
4 models. For example, they show that determining many practically useful properties of general
5 rationally weighted RNNs is undecidable. Such properties include the tightness of a general RNN,20
6 the equivalence of two RNN language models, the minimal size (in the size of the hidden state) of
7 an RNN defining the same language model as some given RNN, and the highest probability string
8 in an RNN, i.e., argmaxyPΣ˚ pLN pyq. By simulating a Turing machine, we can encode the halting
9 problem as solving any of those tasks.21 . This means that if we could solve these problems for RNNs,
10 we could also solve the halting problem, which is provably impossible, meaning that these problems
11 are not computable either. We briefly outline the individual findings of Chen et al. (2018) below,
12 sketching the rough idea behind the proofs.22

Theorem 5.1.10: Tightness

Determining tightness of an RNN language model is undecidable.


13

14 Proof. Note first that not all RNNs are tight language models. As mentioned, this is not a
15 contradiction to the results from §5.1.3, which considered softmax-projected RNN language models.
16 Indeed, the RNN we constructed in §5.1.6 is not such an RNN. That construction shows that, given
17 an arbitrary Turing machine M and input y, we can construct an RNN that simulates M running
18 on y. Using this construction, we can reduce the problem of deciding the tightness of a general
19 (rational weighted Elman) RNN to the halting problem, i.e., the problem of answering “Does M
20 halt on y?”. We do that by constructing an RNN that simulates the given Turing machine M on
21 the input y, ending generation if M halts, and, at the same time, produces strings according to a
22 distribution that is tight for finite length strings, on the condition that no infinite length strings can
23 be produced. Now if and only if M halts on y, the language model is tight. Therefore, deciding is
24 at least as hard as solving the halting problem, and since that is undecidable, so is tightness. ■

Theorem 5.1.11: Highest weighted string

Finding the highest weighted string of an RNN LM is undecidable.


25

26 Proof. Once more, we reduce the halting problem to the given task on the RNN. The idea behind
27 this is to again simulate an arbitrary Turing machine by constructing an RNN LM which is not
20 Note that the results from §5.1.3 consider only RNN LM with the softmax projection function.
21 Incomputability theory, this is known as reducing the halting problem to the given one
22 The interested reader is encouraged to see the original paper for the details.
208 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 tight unless the simulated Turing machine halts. One can create such an RNN with a one-symbol
2 output alphabet such that its highest-weighted output is an infinite string. Now, if we again enforce
3 that the RNN ends producing outputs once the simulated Turing machine halts, it can be shown
4 that there exists a language in which each string has a probability of less than 0.12 if and only if the
5 Turing machine does not halt. On the other hand, if the Turing machine does halt after T steps,
6 producing a string which has length 3T ´ 5 has a probability of ě 0.25. Therefore, the weight of the
7 highest probability string depends on the question of whether the simulated Turing machine halts,
8 which is undecidable. ■

Theorem 5.1.12: Equivalence

Equivalence between two RNN LMs in the sense of defining the same language model is
undecidable.
9

10 Proof. The proof of this claim is again a reduction from the halting problem. We construct an RNN
11 which simulates a given arbitrary Turing machine until it halts and has the same outputs as some
12 other RNN. As soon as the Turing machine halts, the outputs differ, so the RNNs are equivalent if
13 and only if the Turing machine does not halt. Hence, equivalence is undecidable. ■

Theorem 5.1.13: Minimization


Finding the RNN with the minimum number of hidden layer neurons defining the same
language model as a given RNN LM is undecidable.
14

15 Proof. We can reduce the halting problem to the following problem: For some RNN LM and an
16 integer D, return yes if there is another RNN LM with ď D hidden units that generates the same
17 weighted language. Assume that there is a Turing machine M that can decide this problem. Now,
18 for another Turing machine M1 and input y, construct a one symbol RNN LM, R, that simulates
19 M1 running on y and stops generating if M1 halts. We assume without loss of generality that
20 M1 runs for more than one computation step. Now we run M on the input pR, 0q, which checks
21 whether there is another RNN LM generating the same weighted language as R and has no hidden
22 units. Having no hidden units means that the output probabilities of each symbol would have to
23 be constant for each time step. If M decides minimization returns true, that means the output
24 probabilities of R cannot change over time, which means that M1 has to run forever. Conversely, if
25 M returns false, the output probabilities change when M1 halts. Therefore, M deciding the minimal
26 hidden states problem on pR, 0q is equivalent to it deciding the Halting problem for pM1 , yq. ■
27 This concludes our investigation of the formal properties of recurrent neural language models.
28 The sequential nature of the architecture and the relatively simple transition functions in the vanilla
29 RNN architectures made the link to automata from formal language theory relatively straightforward,
30 which allowed relatively strong theoretical insights. However, it was exactly this sequential nature
31 and the issues associated with it of RNNs that eventually led to them being overtaken by another
32 neural architecture, which is now at the core of most if not all, modern state-of-the-art language
33 models: the transformer.23 We introduce them and discuss their theoretical underpinnings in the
34 next section.
23We will not discuss the issues with the training speed and parallelization of RNNs in detail. Some of these issues

will be highlighted in the latter parts of the course.


5.2. TRANSFORMER-BASED LANGUAGE MODELS 209

1 5.2 Transformer-based Language Models


2 In the previous section (§5.1.2), we introduced and studied RNN language models as of language
3 models capable of storing an arbitrarily long context in its encoding encpy ăt q by updating their
4 hidden state ht an arbitrary number of times. As we saw in their theoretical analysis, this mechanism
5 gives them, under some assumptions, a lot of expressive power. Nevertheless, as we also discussed,
6 RNN LMs come with their distinct set of drawbacks. Some of them, e.g., the exploding and vanishing
7 gradient problems, can be amended using specific mechanisms resulting in more complex recurrent
8 neural networks, such as LSTMs and GRUs (cf. §5.1.5). As discussed in §5.1.5, a more fundamental
9 issue that cannot be avoided is the difficulty of parallel training, which is particularly noticeable on
10 the vast internet-scale corpora used nowadays to train language models. Can we do anything about
11 that? As discussed, the inherently sequential nature of RNNs suggests strict limits on this front.
12 Motivated by this limitation, in this section, we present a newer architecture that took over the field
13 of language modeling (and NLP in general)—transformers (Vaswani et al., 2017). It was originally
14 introduced for machine translation, but it can easily be applied to language modeling and has led to
15 the success of models such as GPT-n.
16 The structure of this section will be a bit different from that of the other sections in the notes.
17 We will first give a formal definition of a transformer model in §5.2.2 and, based on this definition,
18 derive a number of results analogous to those for RNN LMs from §5.1.6. However, due to the current
19 practical relevance of transformers in language modeling, we then devote a significant portion of the
20 section to more practical aspects of transformer models and introduce a number of modifications
21 used in modern language modeling systems.

22 5.2.1 Informal Motivation of the Transformer Architecture


23 Before we introduce transformers, let us consider another practical drawback of RNN LMs, which
24 will give us more clues on how to improve them and motivate the architectural decisions behind
25 transformers. Luckily, the simplest patch-up of this issue will also lend itself naturally to paral-
26 lelization, as we will see shortly. The main characteristic of RNNs is the use of a single hidden
27 state ht to represent an arbitrary prefix of any string y ăt up to the current time step t. While this
28 allows RNNs to model strings of any length, it also means that arbitrarily long strings must be
29 compressed into this hidden vector of fixed size. Intuitively, this becomes increasingly difficult as
30 the length of the context grows: As the amount of information to be compressed into the hidden
31 state increases with the prefix length, the hidden state may struggle to model the entirety of the
32 preceding context. How can we amend that? The simplest naı̈ve way to go about this is to retain
33 the contextual encodings of all prefixes of the string. In this case, it is actually more natural to
34 talk about contextual encodings not of full prefixes, but simply of the individual symbols in the
35 string.24 Here, contextual means that the symbol encodings are augmented with the information
36 from the rest of the string (in most cases, about the preceding context, as with the hidden states of
37 an RNN). With this, we avoid the need to summarize the entire context into a single state. Note
38 that our infinite-precision RNNs from the previous section implicitly did that as well—for example,
39 by storing the information in the “stack” neurons, they could, in principle, store the entire history of
40 the string. However, storing all the encodings explicitly makes their utilization more direct and thus
41 easier. This of course leaves the model with the issue of remembering increasingly large amounts of
24 For example, looking back to RNNs, we could consider h to simply be an encoding of the symbol y augmented
t t
with the information from y ăt .
210 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

State ht h0 h1 h2 h3 h4 ¨¨¨

Input y t y0 y1 y2 y3 y4 ¨¨¨

Figure 5.19: An abstract depiction of how a transformer language model produces the contextual
embeddings of all symbols in a string. The hidden state ht can “attend to” (the precise meaning of
this term will be introduced soon in §5.2.2) all preceding symbols y ăt and the current symbol y t .

1 information as the length of the context increases, but we will, for the moment, assume that we can
2 always store enough information to process any string in this way.
3 Having decided to keep around the encodings of all symbols in the string, let us think about
4 parallelizing the process of encoding a string, i.e., computing encpyq. Remember the very general
5 way in which RNNs build a representation of the string y ăt by incrementally modifying ht , which
6 is illustrated in Fig. 5.1a—this incrementality brings with it all the challenges of impossible par-
7 allelization. The workaround for the issues with the sequential processing of RNNs is to process
8 the context for each y t independently, without relying on the encodings of the previous symbols,
9 thus avoiding the sequential bottleneck. Nevertheless, we still want the contextual encoding of y t to
10 contain information about the rest of the string, i.e., the preceding context. How can we achieve that
11 without relying on recurrence? Again, we grab onto the simplest solution: to compute the symbol
12 encodings for each symbol y t “from the ground up” based only on the static symbol encodings e1 py t q,
13 which do not require any recurrence. This is abstractly illustrated in Fig. 5.19, whereas Fig. 5.20
14 shows how this translates into the generative framework from Definition 3.1.11, where individual
15 symbols y t are both sequentially generated based on the encodings of the preceding context encpy ăt q
16 as well as used to build the representation of the context in the next time step encpy t q. Notice that
17 instead of containing arcs denoting dependencies between the symbol encodings (“hidden states”)
18 ht , Fig. 5.19 and Fig. 5.20 contain arcs connecting each ht to all symbols y j for all j ď t. Compare
19 this to Fig. 5.1a, where the arcs between ht´1 and ht induce the temporal dependence, and carry
20 the information about the symbols y j for all j ď t.
21 Clearly, this avoids the issues faced by RNNs due to their sequentiality. However, it also
22 introduces more work required to compute the individual contextual encodings from the static
23 symbol representations. The operations performed to do this by transformers, which are together
24 known as the attention mechanism, are introduced in the next subsection. They represent
25 possibly the most important component of the entire structure of the transformer—by “attending”
26 to relevant preceding symbols when computing the symbol encodings (i.e., using them to compute
27 encpy ăt q), the transformer can model long-range dependencies very effectively, and use them for
28 appropriately modeling the distribution over the next word.
29 To recap, in this section, we informally motivated the new architecture, transformers, with the
30 goals of 1. remembering the contextual encodings of all symbols explicitly and 2. parallelizing the
31 computation of the contextual symbol encodings. The next subsection formally introduces the
32 architecture, before we dive into their theoretical properties.
5.2. TRANSFORMER-BASED LANGUAGE MODELS 211

State ht h0 h1 h2 ¨¨¨

y1

y2

y3


pS

pS

pS
M

M


|

|
h0

h1

h2
q

q
Input/Output y t y0 y1 y2 y3 ¨¨¨

Figure 5.20: An abstract depiction of how a transformer language model generates a string one
symbol at a time. The hidden state ht can attend to all previously generated symbols y ăt to sample
the next symbol y t . The dotted lines denote the sampling steps.

1 5.2.2 A Formal Definition of Transformers


2 Having informally introduced the main two ideas behind the transformer architecture in the previous
3 subsection, we now provide a formal definition of a transformer model, which we will then augment
4 with more practical considerations in .

Definition 5.2.1: Transformer network

A transformer network T is a tuple pΣ, D, encT q where

• Σ is the alphabet of input symbols,


• D is the dimension of T , and
• encT is the transformer encoding function (cf. Definition 3.1.7), which we define in more
detail below (Definition 5.2.2).
5

6 From afar, the definition of a transformer network is therefore relatively simple; it is stated to
7 make the transformer models fit well into our representation-based locally normalized language
8 modeling framework (cf. Definition 3.1.11). The complexity of the models of course comes from the
9 definition of the transformer encoding function encT , to which we devote the rest of the section.
10 Continuing in the framework of representation-based locally normalized language models, the
11 hidden states of the transformer play an analogous role to those of RNNs, with the only difference
12 being how they are computed.

Definition 5.2.2: Transformer hidden state

Let T “ pΣ, D, encT q be a transformer network. The hidden state ht P RD describes the state
of T after reading y ďt . It is defined with respect to the transformer encoding function encT
as follows:
ht “ encT py ďt q (5.113)
def

13

14 Crucially, as we will see shortly, the hidden state ht of the transformer does not have any
15 dependence on the preceding hidden states themselves (although, as we will see, it is partially a
212 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 function of the same inputs).


2 As hinted above, with this, we can easily fit transformers into the representation-based locally
3 normalized language modeling framework and define a sequence model based on the model.

Definition 5.2.3: Transformer sequence model

Let T be a transformer network and E P R|Σ|ˆD a symbol representation matrix. A D-


dimensional transformer sequence model over the alphabet Σ is a tuple pΣ, D, encT , Eq
defining the sequence model of the form

pSM py t | y ăt q “ softmaxpE ht´1 qyt “ softmaxpE encT py ăt qqyt (5.114)


def

5 Now that we have unified the transformer T to the theoretical framework introduced so far in
6 the course, we can jump in and look at the internal structure of the transformer encoder function,
7 which is where the novelty of the transformer architecture comes from.

8 The Attention Mechanism


9 As we mentioned in the informal motivation, to avoid over-compressing information about sentences
10 into a single vector, a transformer model retains the encodings (captured in the hidden states ht ) of
11 all possible prefixes of the string, which we can equivalently simply regard as encodings of individual
12 symbols augmented with the information from the preceding string (see Fig. 5.19).25 However,
13 rather than computing the encodings sequentially like an RNN, the encodings of the individual
14 symbols are computed with the so-called attention mechanism.

Definition 5.2.4: Attention

Let f : RD ˆ RD Ñ´ R be a scoring ¯ function and f ∆D´1 a projection function. Furthermore,


let q P RD , Kt “ kJ k and Vt “ vJ 1 , . . . , vt P RtˆD .
J tˆD
` J
˘
1 , . . . , t P R
Attention over Kt , Vt , also denoted by Att pqt , Kt , Vt q : RD ˆ RtˆD ˆ RtˆD Ñ RD is a
function computing the vector a in the following two-step process:

st “ ps1 , . . . , st q “ f ∆D´1 pf pq, k1 q , f pq, k2 q , . . . , f pq, kt qq (5.115)


def

at “ Att pq, Kt , Vt q “ s1 v1 ` s2 v2 ` ¨ ¨ ¨ ` st vt (5.116)


def

15

16 q, K, and V are commonly referred to as the query, keys, and values of the attention
17 mechanism, respectively. We talk about the parameters q, K, and V completely abstractly for now.
18 However, to help you connect this to the representation-based language modeling framework, note
19 that q will later correspond to a query representing an individual symbol y t , whereas K and V will
20 contain the information from y ăt used to compute ht .

21 What the attention function computes. The scoring function f is, abstractly, simply a
22 parameter of the model which we can choose freely. Intuitively, it should express the relevance of a
25 From now on, we will talk about contextual symbol encodings, which simply refers to the hidden states

corresponding to individual symbols.


5.2. TRANSFORMER-BASED LANGUAGE MODELS 213

1 particular key k to the query q—the more the key is relevant to the query, the more “attention” the
2 model will put to the value associated to that key. The projection function f ∆D´1 then transforms
3 the computed scores ensuring that the transformed scores sum to 1.26 The vector of transformed
4 scores s (Eq. (5.115)) is then used to compute the result of the attention function—the vector a. a
5 is a convex combination of the values v passed to the attention function. Abstractly, therefore, the
6 keys contain the information used for “indexing” the values with the specific query.

7 The scoring function. As mentioned, the scoring function is supposed to measure the “relevance”
8 of a particular value for a query q through the values’ key. The most common choice for f is the dot
9 product between query and key, which is often scaled by the square root of the vector dimensionality:

1
f pq, kq “ ? xq, ky (5.117)
D

10 The projection function and soft and hard attention. The projection function used to
11 transform the un-normalized attention scores is a crucial component of the transformer model. By
12 far the most commonly used projection function is again the softmax. In this case, the attention
13 function is referred to as soft attention.

Definition 5.2.5: Soft attention


The soft attention is computed with the projection function f ∆D´1 “ softmax.
14

15 However, the softmax again makes the models difficult to analyze. In our voyage to theoretically
16 understand transformer-based language models, we will, therefore, again make specific (less frequently
17 used) modeling choices, particularly in the case of the projection function.
18 Indeed, to be able to derive any interesting expressivity results (see §5.3), we jump to the other
19 side of the spectrum and define hard attention. Simply put, instead of spreading the attention across
20 all values like softmax, hard attention puts all the mass on the element whose key maximizes the
21 scoring function f . One way to arrive at it from the definition of soft attention is by sending the
22 temperature τ in the definition of the softmax function (cf. Definition 3.1.10) to 0. Recall that that
23 results in the output vector representing a uniform distribution over the elements that maximize the
24 input vector. This is known as averaging hard attention.

Definition 5.2.6: Averaging hard attention

The averaging hard attention is an attention mechanism with the projection function
f ∆D´1 “ hardmaxavg , where hardmaxavg is defined as:
#
1
if d P argmaxpxq
hardmaxavg pxqd “ r , for d “ 1, . . . , D (5.118)
def

0 otherwise

where x P RD and r “ | argmaxpxq| is the cardinality of the argmax set over x.


25

26While the fact that the transformed scores sum to one invites their interpretation as probabilities, this is not

their central role. Rather, the weights are simply used to define a convex combination of the values.
214 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Interestingly, there is another form of hard attention that results in a model with a different
2 expressive capacity: the unique hard attention. The difference lies exactly in how it handles ties
3 in the elements which maximize the scoring function. Unique hard attention chooses only a single
4 element of those that maximize the score: it can be chosen randomly or deterministically (e.g.,
5 always the first one).

Definition 5.2.7: Unique hard attention

The unique hard attention is an attention mechanism with the projection function f ∆D´1 “
hardmaxuni , where hardmaxuni is defined as follows. For x P RD , sample dˆ „ Unif pargmaxpxqq
or choose some dˆ P argmaxpxq deterministically. Then

1 if d “ dˆ
#
hardmaxuni pxqd “ , for d “ 1, . . . , D. (5.119)
def

0 otherwise
6

7 While the difference between unique and averaging hard attention might seem subtle and marginal,
8 it actually results in a large difference in the expressivity of transformer-based language models as
9 we discuss in §5.3. While we will investigate this in a lot more detail there, we just mention that the
10 intuition behind the expressive difference is relatively straightforward: while the keys maximizing
11 the un-normalized scores might be the same (even though they necessarily don’t have to be if f is
12 not injective), the values (whose content is decoupled from the keys) that those keys index might
13 not be—and in some cases, all those different values might be relevant for the task at hand. Unique
14 hard attention always allows us to only “lookup” a single value associated with those keys, no matter
15 how “different” and relevant all of those are. It also does not allow the attention mechanism to
16 sum over (“summarize”) across all the elements that maximize attention. This is a very limiting
17 characteristic, as many of the expressivity results that we will see later rely on summing over all the
18 elements that maximize the attention scores.

19 Transformer Blocks

20 We have, so far, described the “low-level” details of how the attention function is computed. We now
21 combine the computations performed into larger blocks, showing how they are used to compute the
22 string-augmented encodings of the individual symbols. In particular, we have to connect the concepts
23 of queries, keys, and values to the symbols and their (initial, static) encodings. Intuitively, this is
24 done by transforming the static encodings of those symbols using specific functions implemented
25 through the attention mechanism and using the transformed encodings as queries, keys, and values,
26 as we describe below.
27 We first abstract the attention mechanism from Definition 5.2.4 a bit. With this, we will, in a
28 few steps, arrive at exactly how the hidden states ht or the contextual encodings are computed in a
29 transformer. At the core of this computation lies a repeated application of the same sequence of
30 operations, which, as we will see, augment the “current version” of the contextual encodings with
31 the current information from the preceding information. We call a single sequence of operations a
32 transformer layer.
5.2. TRANSFORMER-BASED LANGUAGE MODELS 215

Definition 5.2.8: Transformer layer

Let Q, K, V , and O be parametrized functions from RD to RD .


A transformer layer is a function T : RT ˆD Ñ RT ˆD that takes as input sequence of vectors
X “ pxJ1 , x2 , . . . , xT q and returns Z “ pz1 , z2 , . . . , zT q P R
J J J J J T ˆD
according to the following
steps:
at “ Att pQpxt q, KpXt q, V pXt qq `xt
loooooooooooooooomoooooooooooooooon (5.120)
Definition 5.2.4

zt “ Opat q ` at (5.121)
for t “ 1, . . . , T , so that TpXq “ Z “ pzJ
1 , z2 , . . . , zT q P R .
def J J T ˆD

2 While we defined the transformer layer on a general matrix (with T columns), note that these
3 T vectors will refer to the (current) symbol encodings of the symbols in the string up to the T th
4 symbol, i.e., y ďT .
5 What do these quantities correspond to? Eqs. (5.120) and (5.121) outline a two-step process of
6 computing the outputs of a single transformer layer: X “ pxJ 1 , x2 , . . . , xT q represents the input
J J

7 to the layer, which T transforms into the output sequence Z “ pz1 , z2 , . . . , zT J q. Before being fed
J J

8 into the attention mechanism, the inputs X are first transformed into the quantities required by the
9 attention mechanism: the query qt (a single one for each xt ), the matrix of keys Kt , and the matrix
10 of values Vt —all of these are, therefore, transformations of the input sequence of vectors. The
11 transformations Q, K, and V determine how these inputs are transformed into the (interpretable)
12 quantities required by the attention mechanism.
13 The individual at represent the “intermediate” results of the computation—the results of applying
14 the actual attention mechanism (with a slight modification) from Definition 5.2.4 onto the produced
15 values of the query, the keys, and the values.
16 The modification mentioned refers to the addition of the inputs xt to the output of the attention
17 mechanism in Eq. (5.120). This mechanism is known as adding residual connections to the model.
18 First introduced by He et al. (2016) in the context of deep convolutional neural network-based
19 architectures, residual connections are now a common feature in many state-of-the-art deep learning
20 architectures. Note that their use is mostly motivated by empirically better performance—this is
21 often attributed to the fact that, intuitively, residual connections allow gradients (i.e., learning signals)
22 to flow through the network through a more direct route rather than all layers that can “squish” the
23 signal similarly to the Elman RNN case (in that sense, they help mitigate the vanishing gradient
24 issue). However, as we will see later in our analysis, residual connections will also play an important
25 role in determining the theoretical properties of transformers, particularly their computational
26 expressive power. The same mechanism is applied in the second step of the transformer layer, where
27 the intermediate results at are transformed by the output transformation O into the final outputs of
28 the layer.
29 In the simplest case, you can imagine the inputs X to be the initial static embeddings of the
30 symbols. The application of the transformer layer, in this case, therefore, “selectively” (determined by
31 the attention mechanism) augments the static embeddings with the information from the preceding
32 context. However, as we will see shortly, a transformer model will apply multiple transformer
33 blocks to the input sequence and thus transform it in multiple steps, analogously to how layers are
34 composed in a regular feed-forward neural network. In that sense, the inputs to the transformer
35 blocks will refer to general intermediate representations produced from the initial static embeddings
216 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 after some number of applications of transformer layers.


2 Lastly, let us consider how the current symbol representations X are transformed into the queries,
3 keys, and values using Q, K, and V ? The original formulation (Vaswani et al., 2017) and all standard
4 implementations of the transformer architecture use one of the simplest possible mappings: a linear
5 transformation implemented by matrix multiplication. On the other hand, the final output of the
6 transformer, computed with output mapping O, is usually implemented by a multi-layer perceptron.
7 The transformer layer puts the attention mechanism into a functional block that describes how
8 a sequence of current symbol representations is transformed in a single step into a sequence of
9 representations augmented with the information in the current set of values. However, we are not
10 done abstracting yet! As mentioned, this process can be applied arbitrarily many times, resulting
11 in “deep” encodings of individual symbols, which contain information from the preceding symbols
12 in the string computed in a composite way. This also answers the question of how the augmented
13 symbol representations used in the language modeling formulation are computed from the initial
14 symbol representations: they are the result of multiple applications of the transformer layer to the
15 matrix of initial symbol representations. That is: multiple transformer layer layers are stacked on
16 top of one another so that the output of one layer becomes the input of the next.
17 We now have all the building blocks to define the full transformer architecture, which computes
18 the encodings of the string prefixes (and thus the hidden states) in Definition 5.2.3.

Definition 5.2.9: Transformer


For L P N, we define a L-layer transformer model as a D-dimensional transformer sequence
model over an alphabet Σ where the hidden state ht “ encT py 1 . . . y t q “ encT pyq is computed
def

as follows.

X1 “ e1 py 0 q, e1 py 1 q, . . . , e1 py t q (5.122)
def
` ˘

Zℓ “ Tℓ pXℓ q for 1 ď ℓ ă L (5.123)


X ℓ`1
“ Z for 1 ď ℓ ă L

(5.124)
ht “ F pzt q
L
(5.125)

Tℓ for ℓ “ 1, . . . , L represent L different transformer layers with decoupled parameter (cf.


Definition 5.2.8). F : RD Ñ RD is a transformation function applied to the contextual
encoding of the last symbol in the last (Lth ) layer and e1 : Σ Ñ RD is a symbol representation
function computing the initial representations of the symbols passed to the first layer of the
transformer.a
a The symbol representation function e1 is often also implemented as a linear transformation of the one-hot

representations (cf. Definition 5.1.5) of symbols, i.e., it is simply a table-lookup.


19

20 With this, the transformer model now fully specifies how to compute the representations
21 required for the representation-based locally normalized sequence models from Definition 5.2.3—the
22 representation function encT is the composition of L transformer layers applied to the sequence of
23 static encodings, followed by a final transformation F .
5.2. TRANSFORMER-BASED LANGUAGE MODELS 217

Locally-normalized
pSM py 4 | h3 q pSM py t`1 | ht q
distribution

Encoding encT h3 ¨¨¨ ht

L applications of
the transformer
layer

Input y y0 y1 y2 y3 ¨¨¨ y t´1 yt

Figure 5.21: encT py ďt q is a function of the symbols y 1 , . . . , y t computed with multiple applications
of the transformer block. Here, the dashed lines illustrate the dependencies of the outputs ht on
the initial static encoding of the symbols y denoted by green nodes. Naı̈vely, h3 and ht could be
computed by independently applying the attention mechanism from Definition 5.2.4. However, as we
describe in the text, while the applications of the attention mechanism do not share computations,
they can be written concisely together.

1 Making Attention Work Fast Over Entire Strings


2 Notice that, in the formulations so far, we always presented computations of the attention mechanism
3 for individual queries qt . This corresponds to the computation of the new version of the representation
4 of a single symbol in the string, with the keys and values representing the preceding symbols (including
5 the symbol itself). This is illustrated in Fig. 5.21. This could of course be applied |y|-times to
6 compute the representations of all |y| symbols in a single transformer block. However, this would
7 unnecessarily re-compute the keys and values of the symbols multiple times—and, as we motivated
8 at the beginning, speedups were one of the main reasons to talk about transformers in the first
9 place. We can now show how the attention mechanism can be conveniently applied to entire strings
10 at once. Specifically, we focus on the case where the attention scoring function f is implemented as
11 a dot-product.27

12 What does the attention mechanism´ from Definition


¯ 5.2.4 do in this case? Given a query
13 qt and a matrix of key values K “ k1 , . . . , kt
J J
PR tˆD
, the scoring function simply computes28

uj “ f pqt , kj q “ qt J kj .

14 Notice that, in this case, the vector u “ pu1 , . . . , ut q of unnormalized attention weights can simply
15 be computed as a single matrix-vector product

u “ qt J KJ .
27 For conciseness, we will ignore the scaling factor, which could easily be added.
28We switch notation from xqt , kj y to qt J kj to make the connection to matrix multiplication later clearer.
218 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 Furthermore, with this, attention can be easily


` extended to consider many queries in parallel by
stacking multiple queries into a matrix Q “ qJ 1 , q2 , . . . , qt , as we detail now.29 Consider now
def J J
˘
2

3 the product
U “ QKJ .
4 Each entry of the resulting matrix U ij is exactly the dot-product between the query qi and the
5 key kj ! The rows of U then contain the unnormalized score vectors ui from the definition of the
6 attention mechanism. This means that if we now apply the normalization function f ∆D´1 row-wise
7 (such that the sums of the elements in each row equal 1), we end up with exactly the required
8 normalized scores required for combining the values from the value matrix. With some abuse of
9 notation, we will simply write that as
´ ¯
S “ sJ s f f QK (5.126)
def
` ˘
J def J
1 , . . . , t “ ∆ D´1 pUq “ ∆ D´1 .

10 The rows of f ∆D´1 pAq, therefore, represent the normalized attention weights. This brings us to the
11 final step of the matrix-multiplication-based attention mechanism: Combining the values based on the
12 computed attention weights. Again, this can be performed by a single matrix multiplication. Notice
13 that the value vectors are the same for all queries—they are simply combined with different (attention)
weights based on the query. Right-multiplying the transposed values matrix V “ vJ vt J with
` ˘
14 1 , . . . ,
15 S, therefore, perform the convex combination of the value vector vJ 1 , . . . , vt such that
J

ai “ si VJ “ Si, : VJ (5.127)

16 and thus
A “ pa1 , . . . , at q “ SVJ . (5.128)
def

17 Altogether, this means that, given a sequence of (contextual) symbol encodings X, we can compute
18 the attention values (i.e., the output of the attention mechanism) of all queries (i.e., for all string
19 in the string) with a single matrix multiplication, as long as the scoring function is the (scaled)
20 dot-product. We refer to this version of attention as an attention block, which, intuitively, simply
21 replaces the element-wise definition of the attention mechanism from Definition 5.2.4 with a more
22 efficient (and concise) definition through matrix multiplications.30

Definition 5.2.10: Attention Block

Let Q, K, and V be parametrized functions from RT ˆD to RT ˆD and X P RT ˆD the matrix


of input encodings. An attention block is the function A : RT ˆD Ñ RT ˆD defined as
´ ¯
ApXq “ f ∆D´1 Q pXq K pXq V pXq (5.129)
J

Further, we define the attention matrix as the square matrix U “ QpXqKpXqJ P RT ˆT .


def

23

29 Note that, for easier presentation, we make a slight departure from the original definition of the attention

mechanism, where the result of the attention mechanism for query t only depended on the keys and values j ď t. For
the rest of the paragraph, we assume that a query qi with i ă t can consider keys and values kj and vj with j ą i,
which, in the interpretation of attention applied to strings, would mean that the symbols can “look ahead” in the
string to their right. This will be addressed shortly with masking.
30 Again, with the caveat that the attention weights are not confined to the preceding symbols but to all symbols in

the string.
5.2. TRANSFORMER-BASED LANGUAGE MODELS 219

Encoding encT ht

L applications of
the transformer
layer

Input y y1 y2 ¨¨¨ yt ¨¨¨ y T ´1 yT

k 1 , v1 k 2 , v2 k t , vt kT ´1 , vT ´1 k T , vT

Figure 5.22: In the context of language modeling, the attention mechanism is allowed to consider the
symbols y j and their keys/values for j ď t (the green dashed lines) when computing the contextual
encoding of the symbol y t . In Definition 5.2.4 this is enforced by the definition of the matrices Kt and
Vt . However, in the attention block formulation of the attention mechanism from Definition 5.2.10,
since the matrices K and V contain the values corresponding to the entire string y, the query qt
could, in principle, index into the values corresponding to the symbols y t`1 , . . . , y T (the red dotted
lines). Masking prevents that by enforcing the attention weights at`1 , . . . , aT to be 0. In this sense,
it removes the red dotted lines.

1 As mentioned, the functions Qp¨q, Kp¨q, and V p¨q are usually implemented as a linear trans-
2 formation via matrix multiplication using weight matrices WQ , WK , and WV . This means that
3 the query matrix Q can be computed as Q “ XWQ , where WQ P RDˆD is a matrix of learnable
4 parameters. Since the attention block uses the same input matrix X to encode queries, keys, and
5 values, it is usually called self-attention.

6 Confining attention to preceding symbols in the string. We now address the departure
7 from the original definition of the attention mechanism in which the query qt was only allowed
8 to consider the keys and values kj and vj with j ď t. Notice that, in general, the version of
9 attention of Eq. (5.129) allows each symbol to attend to any symbol in the string, even those in
10 later positions in the string. Note that there is nothing inherently wrong with that—the contextual
11 symbol encodings could, in principle, depend on the information from the entire string. In fact, this
12 is very common in the so-called masked language modeling, which, importantly, despite its name,
13 does not define language models in our sense of the word. A very commonly used family of masked
14 models is the BERT family of models.31 However, in the case of locally normalized language models,
15 “looking ahead” obviously violates the autoregressive structure of language modeling, i.e., violates
16 our assumption that the context at time t forms y ăt . This is illustrated in Fig. 5.22. To recover the
17 autoregressive nature of the language model, we, therefore, posthoc modify Eq. (5.129) to allow
18 each symbol to attend only to itself and to preceding symbols, while still being able to implement it
19 using matrix multiplication. We do that by adding a mask to zero out the unwanted elements of U.
31 In a very unfortunate, but also understandable, turn of events, we mention two completely different notions of

masking in a single paragraph. Importantly, the masking in masked language models (which, again, are not language
models in our strict definition) has nothing to do with the “causal” masking relevant for autoregressive language
modeling which we introduce in this section.
220 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Definition 5.2.11: Masked Attention Block

Let Qp¨q, Kp¨q, and V p¨q be parametrized functions from RT ˆD to RT ˆD . A masked attention
block is a function ApX, Mq : RT ˆD ˆ RT ˆD Ñ RT ˆD defined as

ApX, Mq “ softmaxpQpXqKpXqJ d MqV pXq (5.130)

where d is the element-wise product between matrices, and M P Rℓˆℓ , the masking matrix,
is constructed as follows.
#
1 if i ď j
M i,j “ for 0 ď i, j ă T (5.131)
´8 otherwise
1

2 This implements a very easy “fix” to the looking-ahead problem by simply putting the normalized
3 attention scores of the “illegal” elements to 0. In general, the exact value of the elements M i,j with
4 i ą j of course depends on the projection function f ∆D´1 —for simplicity, we only define M for the
5 case where f ∆D´1 “ softmax.

6 Bits and Bobs of the Transformer Architecture: Positional Encodings, Multiple Heads,
7 and Layer Normalization
8 Let us now take a step back and consider what the transformer model introduced so far does abstractly.
9 A transformer takes as input as string y P Σ˚ , computes the initial symbol embeddings X1 “ X, and
10 transforms those through a sequence of L applications of the transformer layer (cf. Definition 5.2.9).
11 This results in the final augmented (contextual) symbol representations ht “ encT py ďt q, which are
12 then used to compute the conditional probabilities in the representation-based locally normalized
13 language model defined by the transformer (cf. Definition 5.2.3), as illustrated on top of Fig. 5.21.
14 In this subsection, which will finish off our formal definition of the architecture, we introduce the last
15 three components often connected closely to the transformer model: Symbol positional encodings,
16 multi-head attention, and layer normalization.

17 Adding positional information into the transformer architecture. There is an important


18 omission we still have not addressed when talking about transformers: How does the model
19 incorporate any notion of word order into the contextual representations of symbols or the encodings
20 of the context ht ? The motivation is very clear: The meaning of a sentence depends on the word
21 order. The meaning of “A dog bit a man.” is not the same as “A man bit a dog.”. This is one of
22 the reasons why simple “sentence encoding” functions such as bag-of-words, which simply represent
23 sentences with the number of individual words they contain, do not work well. A careful reader might
24 have noticed that at no point in our discussion about transformers and the attention mechanism
25 did we say anything about the positions of the words. Importantly, we did not talk about word
26 positions in the case of RNNs either. However, the sequential and incremental processing nature of
27 RNNs makes it easy to “manually” keep track of the position of the current symbol of the string y t ,
28 to the extent that the RNN variant is capable of “counting” (cf. §5.1.6). However, all operations
29 composing the transformer model are position-agnostic: The convex combination of the value vectors
30 V will be the same, no matter the permutation of the vectors (if we, of course, accordingly permute
31 the keys). The keys also cannot contain any positional information, since they are computed from
5.2. TRANSFORMER-BASED LANGUAGE MODELS 221

1 position-agnostic static embeddings and a transformation function K which does not depend on the
2 position.
3 All that is to say that, to be able to take into account word order in a transformer, we have to
4 explicitly provide the positional information to the model. The simplest way to do this is to augment
5 the static symbol encodings in the first transformer layer with positional encodings in the form of
6 vectors which can be added to or concatenated to the static encodings of symbols (Vaswani et al.,
7 2017).32

Definition 5.2.12: Positional encoding

A positional encoding is a function f pos : N Ñ RD .


8

9 This is a very simple definition: A positional encoding simply assigns a position in a string
10 a vector representation. A trivial example would be f pos ptq “ ptq. This allows us to define a
11 position-augmented symbol representation function.

Definition 5.2.13: Position-augmented representation function

Let e1 : Σ Ñ RD be a symbol representation function and f pos : N Ñ RD a positional encod-


ing. A position-augmented representation function of a symbol y t in a string y is the
representation function e1 pos : Σ Ñ RD defined as

e1 pos py t q “ e1 py t q ` f pos ptq. (5.132)


def

12

13 To make the positional information available to the transformer model, we now simply pass the
14 position-augmented “static” symbol encodings Xpos to the model instead of the original ones X.
15 Apart from that, the transformer model can remain unaltered, and function simply as defined above,
16 taking into account the positional information now included in its inputs. Importantly, the intuitive
17 notion of the importance of positional encodings for understanding natural language also transfers to
18 the computational power of the model: Transformers as introduced in this section without positional
19 information are strictly less powerful than those with positional information (Pérez et al., 2021).
20 Again, this intuitively makes sense: Without positional information, a transformer model could not
21 even recognize the simple (unweighted) regular language

L “ tabn | n P Nu

22 since it would have no way of knowing, provided that a is in a given string y, whether it appears in
23 the first position or in any other position in the string.

24 Multiple heads. Importantly, the transformer introduced so far computes a single set of contextual
25 representations—one for every input symbol (at every layer of the transformer). However, we can
26 easily extend the model to compute multiple contextual representations for each symbol. This
27 is done using the so-called multi-head attention, where a single attention block is called an
32 For simplicity, we assume the positional encodings are added to the static ones; notice that by dividing the

D-dimensional vectors into two components, one responsible for the static encodings and one for the positional
ones (where the positional encoding component is zeroed out in the static encoding and vice-versa), one can easily
implement “concatenation” of the two representations using only addition.
222 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 attention head. This increases the representation space of the individual symbols and thus enables
2 the model to capture more information about the symbols and the sentence. The interpretation of
3 computing multiple representations (one for each head) independently also invites the interpretations
4 that each of the heads “focuses” on a separate aspect of the text. To be able to use the outputs of
5 multi-head attention as inputs to the next block again, the outputs of the different attention heads
6 are then concatenated and then projected down to the output size of a single attention block using
7 an additional transformation.

Definition 5.2.14: Multi-Head Attention Block

Let H P N be the number of attention heads, Qh p¨q, K h p¨q, and V h p¨q be parametrized
functions from RT ˆD to RT ˆD for 0 ď h ď H, and f H : RT ¨HˆD Ñ RT ˆD be a parametrized
function.
A multi-head attention block is a function MH-ApXq : RT ˆD Ñ RT ˆD defined as
´ ´ ¯¯
MH-ApXq “ f H pconcat0ďhăH softmax Qh pXq K h pXq qV h pXq (5.133)
J

9 While multi-head attention is mostly motivated by empirically better performance (and the
10 intuitive motivation of being able to separately focus on different notions of similarity), it will have
11 some implications on the computational power of the model as well. As we will see shortly, having
12 multiple heads makes it very easy to reason about how a transformer model can simulate an n-gram
13 model.33

14 Layer normalization. As a final component of a transformer, we mention layer normalization.


15 Layer normalization, similar to the use of residual connections, represents a common “trick” in the
16 deep learning space for ensuring more stable and reliable gradient-based learning—as such, it is not
17 limited to transformers. Formally, we can define layer normalization as follows (Ba et al., 2016).

Definition 5.2.15: Layer normalization

Let x, γ, β P RD , and ϵ ą 0. The layer normalization function LN : RD Ñ RD is defined as


x´x
(5.134)
def
LN px; γ, βq “ a d γ ` β,
σ 2 pxq ` ϵ

where x refers to the mean of the vector x (and is subtracted from all elements of x in
the formulation above) and σ 2 pxq refers to the variance of elements of x. ϵ is added in the
denominator to ensure stability if σ 2 pxq ! 1.
18

19 Intuitively, the application of the layer normalization function ensures that the mean of the vector
20 x is (approximately) β and its variance is controlled by γ (after being standardized by dividing by
21 the standard deviation of x). Most commonly, we simply set γ “ 1 P RD and β “ 0 P RD .
22 Layer normalization is most commonly applied to the output of the transformer layer (on every
33 This does not, however, mean that multiple heads are required for recognizing n-gram models. As we will see,

under some caveats, single-head transformers are Turing complete.


5.2. TRANSFORMER-BASED LANGUAGE MODELS 223

1 layer), i.e., to zi in Eq. (5.121). The full output of the transformer layer is therefore computed as

zi “ LN pO pai q ` ai ; γ, βq. (5.135)


def

2 Interestingly, although we mentioned that layer normalization is mostly motivated by the stability it
3 brings to training and with it better performance, it does, just like multi-headed attention, seem to
4 contribute to the computational expressivity of transformer models. As we will see in §5.3, layer
5 normalization allows for a simple fix that solves one of the best known formal limitations of the
6 transformer architecture (again, under some assumptions) (Hahn, 2020; Chiang and Cholak, 2022).

7 Connecting the Formal Definition Back to our Desiderata


8 This brings us to the end of the formal definition of the transformer architecture. As we saw, a
9 transformer has many more moving parts than an RNN. Is the additional complexity warranted?
10 Let us now return back to the informal motivations or desiderata that we laid out in §5.2.1 and see
11 how the components we defined in this section come through and ensure transformers fit them. First
12 of all, the transformer layers clearly store the representations of all symbols at all times—they are
13 all needed to produce the query, key, and value matrices required by the attention mechanism. As
14 mentioned above, this allows us to easily store information about the entire string in a convenient
15 and “accessible” format without having to compress it into a single hidden state. Furthermore, the
16 fact that the contextual representations Zp ℓ ` 1qt are computed from the representations xℓ1 , . . . , xℓt
17 directly at every time step, with no direct dependence between the different zp ℓ ` 1qi , means that
18 these computations can easily be parallelized. More concretely, in the most common implementations
19 of the transformer components, most operations take the form of matrix multiplications, which
20 makes the computation and parallelization that much more efficient.34 Again, note that here, we
21 are only interested in parallelizing the processing of entire strings, as for example given in a training
22 corpus. As discussed in §5.1.5, there is an aspect of language modeling that is inherently sequential
23 and even heavily parallelizable architectures such as the transformer cannot overcome: Generating
24 stings one symbol at time. While generating strings, even a transformer model will have to generate
25 symbols one at a time and, therefore, recompute (parts of) the encoding encT py ďt q anew at every
26 time step to generate the next symbol. The advantages of parallelizability, therefore, come only
27 at training time—however, given the vast corpora used for training today’s models, this makes a
28 crucial difference in the applicability of the architecture over recurrent ones.
29 Altogether, this means that transformers do, indeed, achieve the desiderata from our informal
30 motivation! This concludes our formal definition of transformers. We move to analyze their
31 theoretical properties.

32 5.2.3 Tightness of Transformer-based Language Models


33 Having introduced transformers formally, we can start investigating their formal properties. As we
34 did for RNN LMs, we first consider their tightness. Specifically, in this subsection, we show that
35 all soft attention-based transformer language models are tight. Key to our proof of the tightness of
36 transformer language models, as well as the tightness of various other neural architectures, is the
37 following basic fact in topology.
34 Note that, there is, of course, some sense of recurrence in the transformer—the composition of the transformer

layers, which are stacked on top of each other, and of course require sequential computation. However, crucially, the
number of layers does not depend on the length of the string—the number of sequential steps required to process a
string, therefore, does not depend on its length, which is what we wanted to achieve.
224 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Theorem 5.2.1: Compactness

Let X be a compact topological space and Y be any topological space. If f : X Ñ Y is


continuous, then f pX q Ď Y is also compact.
1

2 Proof. Let tU s usPA be any open cover of f pX q. By continuity, f ´1 pU α q Ă X is open for any α P A,
3 and hence tf ´1 pU s quαPA is also an open cover of X . By the compactness of X , there is a finite
4 sub-cover tf ´1 pU αn quN
n“1 , in which case tU αn un“1 forms a finite sub-cover for f pX q.
N

` ˘`
5 We now further mathematically abstract transformers as a function on vector tuples,35 f Att : RD Ñ
` D ˘`
, that is length-preserving in the sense that f Att RtˆD Ď RtˆD for all t ą 0. Intuitively,
` ˘ ` ˘
6 R
7 this definition is saying that f Att is a function that maps a nonempty vector tuple tvj utj“1 to another
8 vector tuple thj utj“1 of the same length,

f Att pv1 , . . . , vt q “ ph1 , . . . , ht q P RtˆD , (5.136)

9 where vj “ e1 py j q P RD are the initial representations of the input symbols y j . In particular, we


` ˘` ` ˘`
10 can take the function f Att : RD Ñ RD to be the function defined by a stack of transformer
11 layers, i.e., an attention block. This setup will help us state the following.

Lemma 5.2.1
` ˘` ` ˘`
Let f Att : RD Ñ RD be the function defined by a L transformer layers with continuous
functiona Q, K, V , and O. Given a compact set K Ă RD . Then, there exists a compact set
K1 Ă RD such that for every t P Zą0 ,
` ˘ ` ˘t
f Att Kt Ď K1 . (5.137)
12

13 Note. We make use of the following notations in the proof below: Br pzq “ tv P RD : distpz, vq ă ru
14 denotes the open ball centered at z with radius r; A denotes the closure of set A.

15 Proof. Let K0 “ K. In an autoregressive transformer, each of the L layers consists of two blocks: a
16 self-attention block and a feedforward block. We will use induction on the 2L blocks to build up
17 compact sets K1 , K2 , . . . , K2L that contain the output vectors of these respective blocks, and then
18 take K1 “ K2L .
19 The self-attention block is a function on pRD q` Ñ pRD q` . So, let t P Zą0 be arbitrary and
20 consider any sequence of input vectors pv1 , . . . , vt q such that for all i, vi P K0 . Denote the output
21 vectors of the attention block with pv11 , . . . , v1t q. By definition of attention, each output vector
řt
v1j “ i“1 si vi where spjq P ∆t´1 are the attention weight vectors obtained through the softmax
pjq
22

23 function. Compact sets in RD are bounded (by the Heine–Borel theorem), and hence there exists
35 Here
˘`
is the set of nonempty tuples of vectors in RD . This is formally the disjoint union (coproduct)
`
RD
RtˆD .
š
tPZą0
5.2. TRANSFORMER-BASED LANGUAGE MODELS 225

1 M ą 0 such that K0 Ď BM p0q. Noting that the norm function } ¨ } on RD is convex, we have the
2 following
› ›
›ÿt ›
}vj } “ › si vi › (5.138a)
1 › pjq ›
›i“1 ›
t
ÿ
(˚)
pjq
ď si }vj }
i“1
ÿt
(5.138b)
pjq
ď si M “ M
i“1

3 where (˚) results from Jensen’s inequality. Eq. (5.138b) shows that each of the output vectors v1j
4 lies in BM p0q which is compact. Hence, setting K1 “ BM p0q, we have shown that, for any t P Zą0 ,
5 the attention block maps Kt0 into Kt1 .
6 Note that we cannot use Theorem 5.2.1 here because the attention block defines a different
7 function on RtˆD Ñ RtˆD for each t, and Theorem 5.2.1 only implies that there exists a separate
8 length-dependent output compact set Kt Ă RtˆD for each t, which is different from this lemma’s
9 statement.
10 The feedforward function is a continuous function on RD Ñ RD , and therefore, by Theorem 5.2.1,
11 maps its input compact set K1 to an output compact set, which we call K2 .
12 Finally, residual connections and layer norms are also continuous functions acting on each of the
13 input vectors, and hence by the same reasoning would also preserve compactness.
14 Now we can use induction and show that there exist compact sets K3 , K4 , . . . , K2L´1 , K2L where
15 K2L contains the output set of the final layer. Set K1 “ K2L and we have proven the statement. ■
16 Now recall that a transformer language model with the softmax projection function (Defini-
17 tion 5.2.3) defines the conditional probabilities using the softmax transformation

exppepy t qJ ht q
pSM py t | y ăt q “ ř (5.139)
y 1 PΣ exppepy q ht q
1 J

18 where epyq P RD is the output symbol embedding of y P Σ and ht is defined from the input
19 embeddings of y ăt via Eq. (5.136). Using Lemma 5.2.1, together with the finiteness of the vocabulary
20 Σ and the continuity of the softmax transformation (5.139), readily yields our main result on
21 transformer language models.

Theorem 5.2.2: Transformer language models are tight

The representation-based locally normalized language model (cf. Definition 5.2.3) defined by
any (fixed-depth) transformer with soft attention is tight.
22

23 Proof. Given the Transformer, there exists a fixed compact set K that will contain all inputs vi P RD
24 to the first layer. This is true because each vi is the sum of a word embedding, which falls in a
25 finite set since Σ is finite, and a position embedding, which lies in the compact set r´1, 1sD . Hence,
26 by Lemma 5.2.1, there exists a fixed compact set K1 that contains all output embedding vectors
27 (regardless of how long the sequence is).
226 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

0, 1

1
1 2

Figure 5.23: An FSA recognizing the language First “ ty P Σ˚ | Σ “ t0, 1u, y 1 “ 1u.

1 The final output probability is given by a multiplication with the word embedding matrix followed
2 by the softmax function as in Eq. (5.139). This process amounts to composing two continuous
3 functions. In particular, we can extract the eos probability as a continuous R-valued function
4 g eos : K1 Ñ p0, 1q (neither 0 nor 1 is in the range of the softmax function). By continuity of g eos
and Theorem 5.2.1, K2 “ g eos pK1 q Ď p0, 1q is compact. Since K2 is compact, and hence closed,
def
5

6 inf K P K . Thus inf K P p0, 1q and in particular inf K2 ą 0. Therefore, taking ϵ “ inf K2 , we have
2 2 2

7 shown that the eos probability of a Transformer is bounded below by some ϵ ą 0 (regardless of the
8 length of the sequence). Hence, by Proposition 2.5.6, any transformer-based sequence model is tight
9 and thus defines a language model. ■

10 5.3 Computational Expressiveness of Transformers


11 So far, we have introduced our formal definition of the transformer architecture and examined its
12 tightness. We now move on to the computational power of the architecture. This section mirrors
13 §5.1.6 and examines the expressivity of the transformer language model as defined in Definition 5.2.3.
14 Transformers are a much more recent architecture than recurrent neural language models, and
15 our theoretical understanding of them is thus much more limited. However, over the last few years,
16 a series of results showing various properties of the transformer model have been established. At
17 first glance, one might find a number of contradictions among them: One of the first results shows
18 that transformers are not even able to recognize the very simple First language recognized by
19 the (unweighted) finite-state automaton shown in Fig. 5.23 nor the Dyck language. On the other
20 hand, there is work showing that transformers can recognize the majority language (determining
21 whether a string contains more symbols a or b) and can even count: Both of these languages are
22 instances of non-regular languages. Moreover, a fundamental result by Pérez et al. (2021) even
23 shows that transformers are Turing complete. Upon closer inspection, the results can be explained
24 by the different theoretical abstractions of the original transformer model that the different works
25 make and even different notions of equivalence. Even very subtle differences in the model can lead
26 to substantial differences in the expressivity of the model, as we will see below. In this section,
27 we present some original results which show that transformers can, with infinite precision, in fact,
28 reach up to the top of the hierarchy of formal languages that we consider in these notes: They are
29 Turing complete. We also comment on the differences in our approach to the work so far (the main
30 difference and novelty is that we embed the analysis into our language modeling framework) and try
31 to unify it. We will show that infinite-precision transformers can simulate the recognizers across
32 the entire hierarchy: (weighted) finite-state automata, pushdown automata, and Turing machines.
33 Luckily, apart from the first construction, the proofs and constructions in our ascent up the hierarchy
34 will be based on a unified approach in which we build on Pérez et al. (2021) and sequentially add
35 components to be able to recognize more and more complex languages.
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 227

1 Besides being more novel and thus less researched, transformers are also less intuitive to think
2 about as sequential machines transitioning between states as with finite-state or pushdown automata.
3 All classical computational models we introduced (finite-state automata, pushdown automata, and
4 Turing machines) rely on some notion of an internal state which is sequentially updated, where
5 the next state is determined based on the current configuration. We also said in §5.1.5 that this
6 sequentiality is the Achilles’ heel of the ability to parallelize and thus speed up the computations in
7 a language model. One of the main motivations for defining the transformer model is to avoid these
8 sequential dependencies and to make sure the contextual representations of the individual symbols
9 can be computed independently. However, the lack of sequentiality in transformers makes it more
10 difficult to compare to classical and well-understood models of computation—they simply do not
11 define any notion of a configuration that would be passed over by reading a symbol at a time, and
12 relating the configurations at different time points to the configuration of some classical model of
13 computation was the main idea of most of the analyses in §5.1.6. This will not be possible with
14 transformers, and we will have to be more clever about it to draw parallels to better-understood
15 formalisms. What is more, it seems like their parallelizable nature is one of the reasons for the lower
16 (or, at least, ambiguous) computational power under some formalisms, as covered in Merrill et al.
17 (2022a); Merrill and Sabharwal (2023).

18 Infinite-precision vs. finite-precision.

19 A word on model equivalence. As mentioned above, the nature of the transformer architecture
20 does not lend itself well to a straightforward comparison to classical models of computation. To
21 make the connection, we will have to be somewhat clever about the analysis. As we will see shortly,
22 we will mainly deal with this in two ways: (i) By foregoing any notion of a state of a machine in
23 case of n-gram language models36 and (ii) by embedding the state of a computational model into
24 the alphabet itself —the model will then use the augmented output alphabet to keep track of its
25 state in the string itself without relying on any notion of its own internal state which would have
26 to be updated sequentially.37 How can this help us? As will become clear in our analysis of the
27 Turing completeness of a transformer model, the model can use the generated string as a sort of a
28 sequential memory structure. Because the transformer model can look back at the entirety of the
29 string when computing encT py ďt q (where y ďt is the augmented string generated so far), it is able
30 to “read off” its internal state from the string. Importantly, the generated string will still contain
31 the information about the generated string, besides including the state of the computational model.
32 As the transformer will then compute the new embeddings encT py ďt q, it will be able to account for
33 the state it should be in. This is illustrated in Fig. 5.24.
34 While this might seem like a convenient trick to achieve Turing completeness—and in many ways,
35 it is—it is also, in a way, cheating. This “cheating” can be described formally as the difference between
36 model equivalence and homomorphism equivalence. When we discussed the Turing completeness of
37 RNN LMs, we showed they can model a Turing machine by directly recognizing the same strings
38 (for the time being, we ignored the string weights). This means that, for every Turing machine
36 Reminder that n-gram language models are in fact subregular (cf. §4.1.5) and we make use of that in our analysis.

Because their recognition relies purely on local patterns in the strings, and a transformer model has the ability to
consider large enough substrings, we will see that we can model an n-gram language model without keeping any
notion of a state in a transformer
37 Recall that, as discussed in §5.1.5, generation is inherently sequential. One can thus imagine augmenting the

alphabet as a sort of exploitation of this sequential nature.


228 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Original tape of the y1 y2 y3 y4 yt


¨¨¨ ¨¨¨
Turing machine

Augmented tape of y1 q1 y2 q2 y3 q3 y4 q4 ¨¨¨ yt qt ¨¨¨


the stateless model

Figure 5.24: An abstract illustration of how a model can keep track of its internal state by “outputting”
it into the generated string. By reading the augmented symbol generated at time t, the model can
then determine its internal state.

1 M, there exists an RNN R which recognizes the same language: L pRq “ L pMq. However, we
2 will not be able to make statements like this in the case of transformer models. The augmented
3 alphabet will instead bring us to a statement of the sort “For every Turing machine M, there
4 exists a transformer T which recognizes the same language augmented with the state set of the
5 Turing machine: L∆ pT q “ L pMq,” where L∆ refers to the language of strings where each symbol
6 is additionally augmented with the state of the Turing machine. This might seem like a small
7 difference, but, in formal language theory, homomorphism equivalence refers to a different problem
8 to that of normal model equivalence (Culik and Salomaa, 1978) and thus has to be considered
9 differently. Intuitively, it additionally allows additional information to be stored in the strings (in
10 our case, that will be the state of the Turing machine) while still considering some models to be
11 “equivalent”. Formally, model equivalence asks the following question.

Definition 5.3.1: Model equivalence

Two computational models C 1 and C 2 are equivalent if

L pC 1 q “ L pC 2 q . (5.140)
12

13 On the other hand, homomorphic equivalence considers the following.

Definition 5.3.2: Homomorphism

Let C 1 and C 2 be two computational models. C 1 is homomorphically equivalent to C 2 if


there exists a homomorphisms h : L pC 1 q Ñ L pC 2 q such that

(5.141)
def
h pL pC 1 qq “ th pyq | y P L pC 1 qu “ L pC 2 q .
14

Definition 5.3.3: Homomorphic equivalence

Let C 1 and C 2 be two computational models. C 1 and C 2 are homomorphically equivalent


if C 1 is homomorphically equivalent to C 2 and C 2 is homomorphically equivalent to C 1 as per
Definition 5.3.2.
15
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 229

1 Transformers and the Inability to Recognize Simple Languages


2 We start our exploration of the computational power of transformer models with some negative
3 results, which we will later “correct” by using our formalization of a transformer model and different
4 components of the transformer architecture (for example, a different form of attention). Given
5 their success at modeling human language which is assumed to be at least mildly context-sensitive
6 (Huybregts et al., 1984; Shieber, 1985), it seems surprising that transformers cannot, in fact, recognize
7 some very simple regular languages, such as Parity or First (the FSA shown in Fig. 5.23), as well
8 as simple non-regular context-free languages such as Dyck languages:

First “ ty P Σ˚ | Σ “ t0, 1u, y 1 “ 1u


Parity “ ty P Σ˚ | Σ “ t0, 1u, y has odd number of 1su
Dyck “ ty P Σ˚ | Σ “ tp, qu, y is correctly parenthesizedu

9 This has been formally shown by Hahn (2020), and experimentally verified by Chiang and Cholak
10 (2022). Bhattamishra et al. (2020a) found that transformers especially struggle to learn any languages
11 that require counting occurrences in some way, such as the number 0s and 1s in Parity or the
12 number of previous open and closed parentheses in Dyck. Hahn (2020) finds that with unique
13 hard attention, these languages cannot be recognized: Recognizing them by a transformer in their
14 formulation would require the number of parameters to increase with the length of the input. Chiang
15 and Cholak (2022) consider the setting with soft attention, where the issue is more subtle: In theory,
16 it is possible for a transformer to recognize languages such as First and Parity, however with
17 less and less confidence as the length increases. This is reflected by the cross-entropy of deciding
18 language membership approaching the worst possible value of 1 bit per symbol. The reason behind
19 this is quite intuitive: The membership of any of the languages defined above changes if a single
20 symbol changes. However, by examining the information flow in a transformer, one can show that
21 the corresponding information gets less and less weight relative to the length of the string due to
22 the attention mechanism averaging over all positions.

23 Transformers Can Simulate n-gram Models


24 §5.3 showed that transformer models can struggle to recognize some of the simplest formal languages.
25 While we did not discuss those results in detail, intuitively, they stem from the use of unique
26 hard attention and the resulting inability to take into account all values whose keys maximize the
27 attention scoring function. By relaxing that restriction to averaging hard attention, the model
28 becomes more expressive. To show that, we begin by looking at the very simple n-gram language
29 models, as defined in §4.1.5. By constructing, for any n-gram model, a transformer representing it,
30 we will show the following theorem.

Theorem 5.3.1: Transformer language models can simulate n-gramlanguage models

Let pLN be an n-gram language model. Then, there exists a transformer T with L ppLN q “ L pT q.
31

32 Alternatively, we could say that transformers can recognize strictly local languages (cf. §4.1.5).

33 Proof. We prove the theorem by constructing, for pLN , a transformer T with L ppLN q “ L pT q. Note
34 that we will mostly restrict the proof to the construction of the transformer, i.e., the formal definition
230 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

pSM py t | y ăt q

fH

Head 3 Head 2 Head 1

y1 y2 ¨¨¨ y t´3 y t´2 y t´1 yt ¨¨¨

Figure 5.25: An abstract depiction of how a transformer can simulate an n-gram model using n ´ 1
heads (here, n “ 4). The stronger arrows from the heads to the symbols in the string show where
the heads concentrate their attention. The lighter green arrow is meant to represent that the heads
still can consider the entire history of the input so far but are then configured such that they only
look at the appropriate position.

1 of its parameters. The (mostly trivial) mathematical details and derivations are left as an exercise
2 to the reader.
3 Recall that, by definition, an n-gram language model considers a fixed number of previous
4 symbols to define pSM py t | y ăt q—exactly n ´ 1 of them. The constructed transformer T will capture
5 this idea with n ´ 1 heads, each of them attending to exactly one of those positions in the previous
6 n ´ 1 positions.38 We can then use the symbols the heads attended to (and thus identified) to
7 identify the current n-gram and with it define the relevant conditional distribution over the next
8 symbol. To be able to attend to the positions of interest—the ones containing the previous n ´ 1
9 symbols—we have to make use of appropriate positional encodings (cf. Definition 5.2.12), which
10 will allow the model to attend to them. The idea of the construction is abstractly illustrated in
11 Fig. 5.25.
12 For hopefully a better pedagogical effect, we will present this proof from the “last” part of the
13 construction to the “first”. We, therefore, start with the final step: Assuming we have identified
14 the appropriate n-gram y t´n : t´1 y t , how can we encode the conditional probability distribution
15 pSM py t | y t´n : t´1 q? The construction we use here directly mirrors the one in Minksy’s construction
16 (cf. Lemma 5.1.2): Knowing what the individual pSM py t | y t´n : t´1 q for y t P Σ are (those are, as
17 described in §4.1.5, “hard-coded”, or specified, for each n-gram separately in a look-up table), we
18 can simply put their logits (log probabilities; in case we are using the softmax projection function)
19 or the probabilities directly (if we are using the sparsemax projection function) into a vector and
38 Note that, given an n-gram model, the number n is fixed. This means that, for a given n-gram we can always fix

the number of heads and therefore construct such a transformer.


5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 231

1 concatenate all the constructed vectors (for all possible n-grams) into a large matrix E.
2 If we one-hot encode the identified n-gram by defining
encT py ăt q “ Jy t´n : t´1 K (5.142)
def

3 we can then, using the formulation of the transformer sequence model from Definition 5.2.3, use the
4 one-hot encoded n-gram to lookup the appropriate column containing the conditional probabilities
5 given the identified n-gram for all possible y t P Σ.39 . The formal proof of correctness given that we
6 have identified the correct n-gram is therefore analogous to the final part of the Minsky construction.
7 We now consider the preceding step of the simulation: How can we identify the complete n-gram
8 given that the n ´ 1 heads of the transformer identified the symbols in the positions they attended
9 to? This, it turns out, is a simple instance of the “AND” problem investigated in Fact 5.1.3: After
10 concatenating the values of the n´1 heads into a common vector v (each of which is a |Σ|-dimensional
11 vector), this vector of size |Σ| pn ´ 1q will contain the multi-hot representation of the n-gram of
12 interest. Let y 1 , . . . , y n´1 be the symbols represented by v. This means that v is of the form
¨ ˛
Jy 1 K
v “ ˝ ... ‚ (5.143)
˚ ‹

Jy n´1 K
13 and vk|Σ|`j “ 1 if and only if m py k q “ j for an ordering m of Σ determining the one-hot
14 representations of the individual symbols. We would then like to transform this vector into a vector
n´1
15 u P R|Σ| such that
ui “ 1 if and only if i “ s py 1 , . . . , y n´1 q (5.144)
16 for some ordering s of looooomooooon
Σ ˆ ¨ ¨ ¨ ˆ Σ. This can be equivalently written as
n´1 times

ui “ 1 if and only if vk|Σ|`mpyk q ` 1 for all k “ 1, . . . , n ´ 1 (5.145)


17 where i “ s py 1 , . . . , y n´1 q. Clearly, this is the same problem as described in Fact 5.1.3 and can
18 therefore be solved by a linear transformation followed by the application of the thresholded sigmoid
19 nonlinearity, which will together form the transformation f H combining the information obtained
20 from all the heads of the transformer model. Note that, to make this more similar to the practice of
21 how transformers are actually implemented, we could also use the ReLU activation function instead
22 of the saturated sigmoid.
23 This brings us to the final part of the proof, which considers the first part of determining the
24 conditional probability of the n-gram model by the transformer: Identifying the symbols at the
25 previous n ´ 1 positions by the n ´ 1 heads of the transformer. To show how this can be done, let
26 us consider and define the “degrees of freedom” we have left when specifying a transformer model in
27 our framework.
• The symbol representations r. We will use simple one-hot encodings of the tokens: r pyq “ JyK.
def
28

29 • The
ˆ ˙positional encodings f pos . We will use the following simple positional encoding: f pos ptq “
t
. The utility of the constant 1 will be made clear shortly. We will combine positional
1
30

31 encodings with symbol representations by concatenating them into a vector of size |Σ| ` 2.
39 Note that we are again working over the set of extended reals R “ R Y t´8, 8u in case of the softmax activation

function.
232 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 • The number of transformer layers. We will use a single transformer layer.

2 • The number of heads H: As we mentioned, we will use H “ n ´ 1 heads to attend to the


3 previous n ´ 1 symbols.

4 • The form of the attention scoring function f . While not the most typical, we will use the
5 following scoring function:
f pq, kq “ ´|xq, ky|. (5.146)
def

6 It will, together with the positional encodings, allow us to easily single out the positions in
7 the string that we care about.

8 • The form of attention. We will use hard attention (in this case, it can be either unique or
9 averaging).

10 • The parameters of each of the attention heads, that is the transformations Q, K, and V . Each
11 of those will take the form of a linear transformation of the symbol embedding. We describe
12 them and their roles in more detail below.

13 As mentioned above, the input symbol y t is presented to the transformer model together with its
14 positional encoding in the form ¨ ˛
Jy t K
r py t q “ ˝ t ‚ P R|Σ|`2 . (5.147)
1
15 The parameters of all the heads are defined in the same way, with the only difference being a simple
16 parameter that depends on the “index” of the head we are considering, h. Therefore, in the following,
17 we describe the construction of a single head Head h. At any time step t (i.e., when modeling the
18 conditional distribution pSM py t | y ăt q), the head h will attend to or be “responsible for” recognizing
19 the symbol at position t ´ h, y t´h . This can be seen in Fig. 5.25, where, for example, Head 3 is
20 responsible for the position t ´ 3, which is denoted by the stronger arrow to that position. All we
21 still have to do is describe the individual transformations Qh , K h , V h of the head h. All of them
22 will be linear transformations, i.e., matrix multiplication:

q “ Q pxq “ Qh x (5.148)
def def

k “ K pxq “ Kh x (5.149)
def def

v “ V pxq “ Vh x (5.150)
def def

23 We now define the matrices Qh , Kh , and Vh , specifically in the first (in this case, the only) layer
24 of a transformer language model. Importantly, since we are talking about only the first layer, we
25 can simply consider as inputs to the layer the original static symbol representations together with
26 their position encodings rather than any contextual representations. First, let us consider again
27 what roles the matrices play in computing encT py ăt q. In the context of language modeling, the
28 matrix Qh takes in the representation of the “latest” generated symbol y t´1 and produces from it
29 the query vector of y t´1 . It is, therefore, only applied once per generation step—only for symbol
30 y t´1 . The matrices Kh and Vh , on the other hand, transform all non-masked input symbols to
31 the key and value vectors. That is, they take the representations of the input symbols and their
32 positions encodings encT py j q for every j “ 1, . . . , t ´ 1 and transform them into the key and value
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 233

1 vectors. The keys will then be compared with the query constructed for y t´1 with the Qh matrix,
2 while the constructed values will be used to compute the new hidden state ht .40
3 So, what kind of query, key, and value vectors do we want? As mentioned, the head h will be
4 responsible for identifying the symbol at position t ´ h. Therefore, we want it to put all its attention
5 to this position. In other words, given the query qt´1 , we want the attention function in Eq. (5.146)
6 to be maximized by the key of the symbol at position t ´ h. Notice that, therefore, the key does
7 not have to depend on the identity of the symbol at position t ´ h—only the position information
8 matters. Let us then consider the following query and key transformations for head h:
¨ ˛
Jy t K ˆ ˙
t´h
Q : ˝ t ‚ ÞÑ (5.151)
1
1
¨ ˛
Jy j K ˆ ˙
´1
K : ˝ j ‚ ÞÑ . (5.152)
j
1
9 Given such a query and such keys, the attention scoring function computes
ˆ ˙ ˆ ˙
t´h ´1
f pqt , kj q “ ´|x , y| “ ´|t ´ h ´ j|, (5.153)
1 j
10 which is maximized exactly when j “ t ´ h, that is, at the position that we want the head h to
11 attend to! This means that the hard attention we use will put all its probability mass to exactly the
12 position we intended it to. Intuitively, both transformations keep only the positional information.
13 The query transformation “injects” the knowledge of which position should maximize the attention
14 score, while the key transformation (which is, again, applied to all the non-masked positions) simply
15 “exposes” the positional information about the symbol. The alternating constant 1 (or ´1) and the
16 index of the position ensure that the inner product simply computes the difference between the
17 position of the symbol and the position of interest—we will use this trick multiple times in later
18 constructions as well. It is easy to see that the two transformations are indeed linear.
19 This leaves us with the question of how to use this position of the symbol of interest (t ´ h)
20 to extract the one-hot encoding of the symbol at that position. Luckily, due to the information
21 contained in the symbol representations r py j q, this is trivial. All that the transformation V has to
22 do is the following: ¨ ˛
Jy j K
V : ˝ j ‚ ÞÑ Jy j K. (5.154)
1
23 With this, the identity of the symbol is carried forward through the attention mechanism. Again,
24 is easy to see that this is a linear transformation of the symbol representation. Notice that the
25 only head-depend transformation is the query transformation—it depends on the index of the head,
26 determining the position of interest, meaning that every head defines a different query transformation,
27 while the keys and values transformations are the same among all heads.
28 This concludes the proof. Fig. 5.26 again shows an illustration of the described model with all
29 the defined components.
30 ■
40 Importantly, in a multi-layer transformer, the queries would be constructed for every non-masked symbol and its

representation (hidden state) would be updated. However, since the updated representations would not be used in
the single layer case, we only have to compute the representation of the newest symbol in this case.
234 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

pSM py t | y ăt q
E

fH

Head 3 Head 2 Head 1

y1 y2 ¨¨¨ y t´3 y t´2 y t´1 yt ¨¨¨

Query t´3
1

Keys ´1
1
´1
2 ¨¨¨ ´1
t´3
´1
t´2
´1
t´1

Values Jy 1 K Jy 2 K ¨¨¨ Jy t´3 K Jy t´2 K Jy t´1 K

Attention scores 2´t 1´t ¨¨¨ 0 ´1 ´2

Hard attention weights 0 0 ¨¨¨ 1 0 0

Figure 5.26: A more complete illustration of the construction described in the proof for the case of
the third head, Head 3, based on Fig. 5.25. Note that, among the three heads, only the query vector
(transformation) differs, while the key and value transformations are identical among the heads.
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 235

1 This proof establishes the only “concrete” result on the (lower bound of the) expressivity for
2 transformers in the form of model equivalence (cf. Definition 5.3.1) that we know of. In the next
3 subsections, we discuss how transformer-based language models can simulate more complex formal
4 models. However, the simulation will not be as “direct” as the n-gram one, in the sense that
5 we will have to work with modified alphabets which, as we noted above, results in a different
6 notion of equivalence of models than what we have considered so far. We will thus not model the
7 conditional probabilities pSM py | y ăt q, but rather the probabilities over some more complex (but
8 still finitely-many) objects, which will carry in them more information than just the generated
9 symbol. As we will discuss, this will be required due to the limited abilities of transformers to
10 execute sequential operations compared to RNNs and classical language models, as hinted at in
11 §5.1.5.

12 Infinite-precision Transformers and Finite-state Languages

13 Having shown that transformers can perfectly represent at least strictly local subregular languages,
14 we now start our “standardized” climb up the hierarchy of regular, context-free, and all computable
15 languages. This is also the point where we depart from our stricter notion of equivalence dis-
16 cussed above. As we will see shortly, to be able to correctly simulate sequential processing of
17 the classical language models, we will have to augment the alphabet of generated symbols with
18 additional information, meaning that, from now on, we will talk about homomorphism equivalence
19 (cf. Definition 5.3.3).
20 The central result of this subsection can be summarized by the following theorem.

Theorem 5.3.2: Transformer language models can simulate probabilistic finite-


state automata
Infinite-precision transformers can simulate probabilistic finite-state automata.
21

22 Proof. Theorem 5.3.2 presents a roughly analogous result to the fact that RNNs can simulate
23 deterministic probabilistic weighted finite-state automata (cf. Lemma 5.1.2). This result, however, is
24 in some sense stronger: it says that a transformer can simulate any PFSA—even non-deterministic
25 ones. This, of course, comes with the caveat of the augmented alphabet and the corresponding
26 homomorphism equivalence, which we discuss shortly.
27 As always, we will prove the theorem by constructing, for a given PFSA A “ pΣ, Q, δ, λ, ρq, a
28 transformer model T with the same weighted language, i.e., L pAq “ L pT q. While the PFSA can in
29 general have multiple initial states, we assume here, without loss of generality, that it has a single
30 initial state.41
31 We start by defining the structure of the hidden states (i.e., contextual representations of symbols)
32 of the transformer, then discuss how those will be used to simulate the PFSA, and lastly discuss the
33 concrete parameter settings which allow the transformer to carry out the actions defined by the
34 PFSA. However, for all of that to be possible, we will require an augmented alphabet of symbols.

41 As an exercise, you might think about how one can represent a WFSA with multiple initial states with one that

has a single initial state.


236 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 An augmented alphabet of symbols. To simulate a PFSA with a transformer, the transformer


2 will work over an extended, but crucially still finite alphabet of “symbols”.42 Let us consider why
3 this might be required. As mentioned, transformers have no mechanism to pass any notion of an
4 inner state between the subsequent symbols in the string (apart from the notion of passing the state
5 between layers—however, there is always a finite number of those). Importantly, the inner state of a
6 machine (e.g., the state in a finite-state automaton, the configuration in a pushdown automaton,
7 or the hidden state in an RNN) is not in any direct correspondence to the current symbol, as it
8 depends on all the previously read symbols in the string. This is where the modification comes
9 in: What if we encoded the inner state of the machine directly in the output symbol itself? This
10 would then allow a model to access it simply from the raw context y ăt itself, without having to
11 keep any latent variable in the inner state. This is the main idea behind using a modified version of
12 the automaton’s alphabet to simulate it using a transformer: we will expose the machine’s current
13 state q t into the generated symbol so that the transformer will be able to access it in the next time
14 step with the attention mechanism which can always look at the entire history so far. The alphabet
15 whose Kleene closure we will therefore be modeling with the transformer is
∆ “ Q ˆ Σ. (5.155)
def

16 That is, the transformer will generate (or model conditional distributions of) individual states of and
17 symbols read by the PFSA. This will allow it to keep track of both the symbols read or generated
18 by the PFSA as well as its computation steps (even in the non-deterministic case).

19 Transformer hidden states. From a very high level, we will encode the possible “configurations”
20 of the PFSA in the transformer’s hidden states as follows:
¨ ˛
Jq t K
˚ Jy t K ‹
..
˚ ‹
.
˚ ‹
ht “ ˚ (5.156)
˚ ‹
˚ Control variables ‹ ,

..
˚ ‹
˚ ‹
˝ . ‚
Positional encoding
21 where q t refers to the state A is in at time t and y t the input symbol it is reading. J¨K refers to the
22 one-hot encoding function of the appropriate dimensionality, i.e., JqK P B|Q| and JyK P B|Σ| . We will
23 again use the orderings of the sets Q ˆ Σ (ordering n), Σ (ordering m), and Q (ordering r). Notice
24 that the first two elements together exactly correspond to the elements of the extended alphabet ∆.
25 Lastly, the positional encodings will be of the form
1
¨ ˛
˚t`1‹
f pos ptq “ ˚ (5.157)
def
1 ‹.
˝ t`1 ‚
1
pt`1q2
42 Some previous work also shows that transformers are Turing complete by showing that they can simulate an RNN

by simply encoding the hidden states of the RNN in the generated symbols of the transformer (Bhattamishra et al.,
2020b). This, together with the Turing completeness of RNNs, would suffice for Turing completeness of transformers.
However, notice that by generating RNN hidden states, the transformer model is no longer generating symbols from a
finite alphabet, meaning that, by our definition (and other standard definitions of a transformer) it is no longer a
language model.
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 237

1 The need for the constant 1 will become apparent later.

2 Modeling sequentiality with a transformer. As mentioned above, transformers, unlike RNNs,


3 do not model their states sequentially, meaning that their transitions are harder to analyze by
4 comparing them to automata. However, by considering what the attention mechanism does and
5 looking at transformers as sequence models, we can make the connection to sequential machines
6 more apparent. Recall from Definition 5.2.3 that the transformer computes the probability of the
7 next token y t given the context y ăt as

pSM py t | y ăt q “ f ∆|Σ|´1 pEht qyt , (5.158)

8 where the hidden state ht is the hidden state computed for the position t by the attention block.
9 Importantly, during generation, given (all) the symbols y ăt , we can always compute pSM py t | y ăt q,
10 i.e., the distribution over the next symbol, and thus generate it. This is done one symbol at a
11 time—this presents the analogy in the transformer model to how automata work and such sequential
12 generation can never be parallelized, as discussed in §5.1.5. We will, therefore, use the generative
13 steps of a transformer model to simulate the transitions of an automaton one transition at a time.
14 Note, however, that the transformer still does not keep any notion of a hidden state that is passed
15 through the generation steps, as is the case of an RNN (cf. ??). Therefore, all the information
16 to compute the next state of the automaton (i.e., to simulate the transition) has to be present in
17 the so-far generated sequence itself. In the case of simulating finite-state automata, this can be
18 done with the augmented alphabet described above quite easily—we will describe the details shortly.
19 However, this notion of storing the information about the configuration of the simulated machine in
20 the generated string so far will be especially important once we extend this construction and start
21 working with potentially arbitrarily large amounts of information, for example, encoded by a stack.
22 Coming back to the structure of the transformer, our goal in this proof is to construct a model
23 which can sequentially produce hidden states of the form Eq. (5.156). However, before we do
24 that, let us consider why keeping hidden states of the form from Eq. (5.156) is enough to capture
25 the distribution defined by the PFSA. In other words, we want to show that, if we can design a
26 transformer that manages to generate symbols from the extended alphabet ∆ correctly, we are
27 guaranteed to be able to represent A’s distribution. We show this in the following lemma.

Lemma 5.3.1

Suppose that a transformer T defines the hidden states of the form of Eq. (5.156) where q t
refers to the state A is in at time t and y t the input symbol it is reading. Then, we can define
a transformation f out of the transformer hidden states and a symbol representation (output)
matrix E P R|Σ|ˆ|Q||Σ| where f out pht q P R|Q||Σ| .
28

29 Proof. The idea is to, based on the hidden state q t and the read symbol y t encoded in ht , one-hot
30 encode the pair “separately”, and construct a lookup matrix E in a similar way to how we the
31 Minsky construction in Lemma 5.1.2 and the n-gram construction in the proof of Theorem 5.3.1.
32 Notice that the hidden state always contains the information to do that: We can always extract the
33 one-hot encodings of the state and the symbol and then combine them into a one-hot encoding of the
34 pair q t , y t by implementing the AND function—this is completely analogous to the construction of the
238 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

1 one-hot encoding of an n-gram in the proof of Theorem 5.3.1. This is done by the transformation
2 f out .43
3 After transforming ht into the one-hot encoding of the state-symbol pair, we can look up the
4 logits of the next-state distribution in the representation matrix E. This is again analogous to
5 the construction in Theorem 5.3.1. The matrix embeds the symbols y P Σ and thus enables the
6 computation of the softmax in the definition of a transformer sequence model (cf. Definition 5.2.3).
7 More formally, for y P Σ, we define
#
a{w
log w | if p ÝÝÑ ˝ P δ
Empyqnpp,aq “ (5.159)
def

´8 | otherwise

8 and for eos, we define #


log ρ pqq | if ρ pqq ą 0
Empeosqnpp,aq “ (5.160)
def

´8 | otherwise
9 We can show that this ensures that the probabilities assigned to strings by the transformer language
10 model match those assigned by the PFSA by simply multiplying the conditional probabilities of the
11 transformer defined above and checking that they match the string acceptance weights assigned by
12 A. ■
13 Having shown that hidden states of the form from Eq. (5.156) suffice, let us now outline the entire
14 architecture of the transformer simulating A. As mentioned, by outputting state-symbol pairs, the
15 model can always access the current state and the read symbol of the simulated automaton in the
16 previous generated symbol. The application of the transformer layer, therefore, has to “extract” this
17 information from the entire generated string (by attending to the previous symbol) and then simulate
18 the transition weighted transition function of the PFSA. This will be done in two stages—this
19 corresponds to the transformer having two layers. The first task is very similar to what we did in
20 the n-gram case: the transformer simply needs one head to attend to the previous generated symbol
21 in the string. This can be done easily by using the same positional encodings and The model will
22 have two layers.
23 ■

24 Turing Completeness of Transformers


25 Turing completeness of two-stack pushdown automata.
Theorem 5.3.3: Turing completeness

Transformer language models are Turing complete.


26

Lemma 5.3.2
Transformer language models can simulate probabilistic two-stack pushdown automata.
27

43 This does represent a slight departure from the standard definition of a transformer language model, in which

the hidden state directly is used, together with E, to determine pSM py t`1 | y ďt q. For simplicity, we allow ourselves
this departure. However, note that this transformation could easily also be done by an application of an additional
transformer block, in which the attention mechanism would mostly just copy the values over and the output
transformation F would perform the conjunction.
5.3. COMPUTATIONAL EXPRESSIVENESS OF TRANSFORMERS 239

1 Just like in the RNN case in §5.1.6, we will start by proving a weaker claim whose proof is
2 simpler, but conceptually completely the same—the fact that transformers can simulate deterministic
3 single-stack pushdown automata. We will then generalize the construction to the two-stack case.

4 Transformer hidden states. From a very high level, we will encode the transformer’s hidden
5 states as follows: ¨ ˛
Jq t K
˚
˚ Jγ t K ‹

˚ d pat q ‹
..
˚ ‹
ht “ ˚ . (5.161)
˚ ‹
˚ ‹
˚ Control variables ‹

..
˚ ‹
˚ ‹
˝ . ‚
Positional encoding
6 More concretely, the positional encodings will be of the form

1
¨ ˛
˚t`1‹
f pos ptq “ ˚ (5.162)
def
1 ‹.
˝ t`1 ‚
1
pt`1q2

7 The need for the constant 1 will become apparent later.

8 Two memory structures. The construction will work with two “memory structures”: the se-
9 quence of the generated “symbols” so far and the stack of the simulated automaton. Distinguishing
10 and finding the relationship between them (in the sense of how the sequence of generated symbols
11 encodes the stack) will be the main challenge of the construction. The sequence of generated
12 symbols plays an analogous role to the context of the transformer sequence model as defined in
13 Definition 5.2.3 Conceptually, we can imagine the stack being represented on an infinite tape (infinite
14 in one direction), whose cells have indices through which we can look up the values of the stack. See
15 ?? for an abstract depiction. Note that we will only ever access the top of the stack—we will only
16 use the indexing to make the construction and manipulation of the stack through the transformer
17 hidden states easier.

Lemma 5.3.3

18

19 Turing completeness of two-stack pushdown automata.

20 The Nuances of Transformer Computational Power


21 • Importance of positional encodings

22 • Difference between HAT and AHAT

23 • The difficulty to put them onto the Chomsky Hierarchy


240 CHAPTER 5. NEURAL NETWORK LANGUAGE MODELS

Expressivity
s
LM Ms
le
wn NL
ta b hdo RN
Pus Q-
u
mp
Co

s
LM
Free
ree

e x t -
t
Con
f
xt-
nte
Co

s s
M LM
AL NN
lar

PFS B-R
gu
Re

rs
s rme
cal

LM nsfo
am r a
Lo

n -g r -T
Q
ly
ict
Str

Time

Figure 5.27: A pictorial representation of the expressive power of various language model architec-
tures.

1 – Hahn’s construction
2 – Satwik counter construction

3 This concludes our investigation of the computational capacity of transformers. While this does
4 not finish our theoretical investigation of language models, it is the last result on the computational
5 complexity of language models. We, therefore, summarize the results covered in the notes in Fig. 5.27,
6 where we pictorially represent the computational capacity of all the different language models we
7 considered.

8 Transformers and First Order Logic


1 Index

2 Symbols 37 bigram model 98


3 bos symbol 18
4 eos symbol 21 38 C
5 σ-algebra 10 39 candidate 155
6 softmax RNN sequence models 144 40 candidate vector 154
7 σ-algebra 10 41 categorical distribution 53
8 (full) path weight 82 42 center embeddings 108
9 (weighted) language 83, 131 43 co-accessible 84
10 n-gram assumption 97 44 co-accessible state 84
11 beginning of sequence 18 45 column matrix 178
12 end of sequence 21 46 complete 49
13 n-gram assumption 97 47 component-activating matrix 180, 185
48 concatenation 14
14 A 49 configuration 127
15 accepting 82, 127 50 conjugate 48
16 accepts 78 51 consistent 26
17 accessible 84 52 context 21, 97
18 accessible state 84 53 context encoding function 52
19 activation function 146 54 context-free 119
20 algebra 11 55 context-free grammar 108
21 allsum 83, 117, 118, 131 56 applicable production 109
22 alphabet 14 57 derivation 110
23 ambiguous 112 58 derivation tree 110
24 applicable 109 59 derive 110
25 Attention 212 60 language 110
26 attention block 218 61 non-terminal 109
27 attention head 222 62 parse tree 110
28 attention matrix 218 63 production 109
29 attention mechanism 210 64 start-symbol 109
30 averaging hard attention 213 65 terminal 109
66 context-free language 107
31 B 67 context-free languages 107
32 backward values 83 68 contextual symbol encodings 212
33 basis vectors 48 69 coordinate vector 48
34 bias 58 70 corpus 61
35 bias vector 151 71 counter machines 205
36 bigram 98 72 cross-serial dependencies 138

241
242 INDEX

1 cross-serial dependency 138 48 Frobenius normal form 95


2 curriculum learning 70
3 cylinder set 30 49 G
50 gate 153
4 D 51 gated recurrent unit 155
5 data leakage 68 52 gating functions 153
6 data sub-vector 180 53 generating 115
7 derivation 110 54 generating function 121
8 derivation set 111 55 globally normalized model 19
9 derivation set of a grammar 112 56 good representation principle 47
10 derivation set of a non-terminal 112
11 derivation tree 110 57 H
12 derived from 110 58 halting problem 135, 207
13 derives 110 59 Heaviside 159
14 detectable 177 60 Heaviside Elman network 159
15 deterministic 77, 128 61 hidden state 140
16 deterministic FSA 77 62 hidden states 140
17 deterministic PDA 128 63 Hilbert space 47, 49
18 dimensionality 48 64 history 21, 97
19 distributed word representations 104 65 homomorphically equivalent 228
20 divergence measure 63
21 dynamics map 140
66 I
67 infinite sequence 15
22 E 68 infinite sequences 15
23 early stopping 71 69 information projection 65
24 easily detectable 177 70 initial state 142
25 Elman sequence model 149 71 inner path weight 82
26 embedding function 51, 149 72 inner product 48
27 embedding tying 150
73 inner product space 48
28 embeddings 51
74 input 154
29 empty string 14
75 input matrix 151
30 encoding 52
76 input string 77
31 energy function 18
77 irreducible normal form 95
32 energy-based 18
33 entropy regularizer 72
34 entropy regularizers 72 78 J
35 equivalent 228 79 Jordan sequence model 150
36 event 10
37 events 10 80 K
38 exposure bias 66 81 keys 212
82 Kleene closure 15
39 F 83 Kleene star 14
40 finite-state 76, 85
41 finite-state automaton 76 84 L
42 recognized language 78 85 language 15, 78, 110
43 string acceptance 78 86 language model 16, 29
44 first-moment matrix 122 87 context-free 119
45 forget 154 88 energy-based 18
46 formal language theory 14 89 finite-state 85
47 four-hot representation 184 90 globally normalized 19
INDEX 243

1 locally normalized 21 47 O
2 pushdown 132 48 one-hot encoding 146
3 weighted language 17 49 one-hot encodings 51
4 language model induced by G 120 50 optimization algorithms 69
5 language model induced by P 132 51 outcome space 10
6 language model induced by A 87 52 output 154
7 language modeling task 61 53 output matrix 151
8 language recognized by P 128 54 overfitting 71
9 layer normalization 222
10 length 81 55 P
11 level of a generation sequence 120 56 pad 97
12 likelihood 64 57 padding 97
13 log 64 58 parameter estimation 68
14 pseudo 66 59 path 81
15 line matrix 178 60 accepting 82
16 locally normalized language model 21 61 inner weight 82
17 log-likelihood 64 62 length 81
18 logits 52 63 successful 82
19 long short-term memory unit 154 64 weight 82
20 LSTM 154 65 yield 81
66 permutation-detectable 177
67 position-augmented 221
21 M 68 positional encoding 221
22 masked attention block 220 69 prefix 16
23 masking matrix 220 70 prefix probability 22
24 matrix detection 177 71 Prefixes 16
25 matrix representation 175 72 probabilistic 84, 116, 129, 134
26 measurable space 10 73 probabilistic context-free grammar 116
27 measurable subset 10 74 probabilistic finite-state automaton 84
28 measurable subsets 10 75 probabilistic pushdown automaton 129
29 memory cell 154 76 probabilistic two-stack pushdown
30 monoid 14 77 automaton 134
31 multi-head attention 221 78 probability 86, 119
32 multi-head attention block 222 79 probability measure 11
33 multi-hot 231 80 probability pre-measure 12
81 probability simplex 53
82 probability space 11
34 N 83 processing sub-vector 180
35 non-decreasing 185 84 production
36 non-deterministic 78, 128 85 yield 109
37 non-deterministic FSA 78 86 production generating function 121
38 non-deterministic PDA 128 87 projection 53
39 non-scanning 127 88 projection function 53
40 non-terminating 29 89 proper 26
41 non-tight 26, 38 90 pruned 115
42 norm 49 91 Pruning 115
43 normalizable 19, 84, 118, 131 92 pseudo(log)likelihood 66
44 normalizable energy function 19 93 pushdown automata 126
45 normalization constant 19 94 pushdown automaton 126
46 northwestern 177 95 accepting run 127
244 INDEX

1 configuration 127 48 scalars 47


2 non-scanning transition 127 49 scanning 127
3 recognized language 128 50 scans 127
4 recognized string 128 51 score function 92
5 run 127 52 scores 52
6 scan 127 53 scoring function 212
7 scanning transition 127 54 self-attention 219
8 pushdown language model 132 55 self-supervised 62
9 pushforward measure 12 56 sentence 15
57 sentences 15
10 Q 58 sequence 15, 239
11 query 212 59 sequence model 21, 29
60 sequences 15
12 R 61 soft attention 213
13 random variable 12 62 softmax 54
14 rational-valued recurrent neural network 63 sparsemax 57
15 142 64 spectral radius 95
16 rational-valued recurrent neural networks 65 square-root state representation 174
17 142 66 stack 239
18 reachable 115 67 static symbol embeddings 149
19 real-valued recurrent neural network 142 68 stochastic gradient descent 70
20 real-weighted context-free grammar 115 69 strictly n-local 101
21 real-weighted finite-state automaton 79 70 strictly n-local 101
22 real-weighted pushdown automaton 129 71 strictly local 101
23 recognizes 128, 130 72 string 14
24 rectified linear unit 147 73 stringsum 82, 117, 130
25 recurrence matrix 150 74 subregular 100
26 recurrent dynamics map 140 75 subregular language 100
27 recurrent neural encoding function 143 76 subsequence 16
28 recurrent neural network 140 77 substochastic 94
29 recurrent neural networks 140 78 substring 16
30 recurrent neural sequence model 144 79 successful 82
31 regular 78 80 suffix 16
32 regular language 78 81 suffixes 16
33 regular languages 78 82 symbol-specific transition matrix 80
34 ReLU 147 83 symbols 14
35 representation function 50
36 representation space 47 84 T
37 representation-based language modeling 46 85 teacher forcing 66
38 representation-based locally normalized 86 temperature 54
39 model 58 87 thin cylinder 30
40 reset 155 88 tied 150
41 residual connections 215 89 tight 21, 26, 38
42 result of applying 109 90 token 15
43 RNN 140 91 token to type switch 102, 136
44 row matrix 178 92 tokens 15
45 run 127 93 training 68
94 transformer 216
46 S 95 transformer layer 215
47 saturated sigmoid 193 96 transformer network 211
INDEX 245

1 transformer sequence model 212 27 weight pushing 90


2 transition 76 28 weighted context-free grammar 115
3 transition matrix 80 29 allsum 118
4 transitions 76 30 derivation tree weight 117
5 Trimming 84 31 induced language model 120
6 two-stack pushdown automaton 133 32 non-terminal allsum 117
7 two-stack real-weighted pushdown 33 normalizable 118
8 automaton 133 34 stringsum 117
9 two-stack weighted pushdown automaton 35 weighted finite-state automaton 79
10 133 36 allsum 83
37 induced language model 87
11 U 38 normalizable 84
12 unambiguous 111, 112 39 state-specific allsum 83
13 unique hard attention 214 40 stringsum 82
14 unit 14 41 substochastic 94
15 update 155 42 transition matrix 80
16 useful 84 43 trim 84
17 useful state 84 44 weighted language 17
45 weighted pushdown automaton 129
18 V 46 allsum 131
19 values 212 47 induced language model 132
20 vector representation 46, 174 48 normalizable 131
21 vector space 47 49 run weight 130
22 vectors 47 50 stringsum 130
23 vocabulary 15 51 word 14, 15
52 words 15
24 W
25 weight 130 53 Y
26 weight of a derivation tree 117 54 yield 81, 109
246 INDEX
1 Bibliography

2 Steven Abney, David McAllester, and Fernando Pereira. 1999. Relating probabilistic grammars
3 and automata. In Proceedings of the 37th Annual Meeting of the Association for Computa-
4 tional Linguistics, pages 542–549, College Park, Maryland, USA. Association for Computational
5 Linguistics.

6 Noga Alon, A. K. Dewdney, and Teunis J. Ott. 1991. Efficient simulation of finite automata by
7 neural nets. J. ACM, 38(2):495–514.

8 Stéphane Aroca-Ouellette and Frank Rudzicz. 2020. On Losses for Modern Language Models.
9 In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
10 (EMNLP), pages 4970–4981, Online. Association for Computational Linguistics.

11 Enes Avcu, Chihiro Shibata, and Jeffrey Heinz. 2017. Subregular complexity and deep learning.
12 ArXiv, abs/1705.05940.

13 Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization.

14 Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc’Aurelio Ranzato, and Arthur Szlam.
15 2021. Residual energy-based models for text. Journal of Machine Learning Research, 22(40):1–41.

16 Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational
17 Linguistics, 48(1):207–219.

18 Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for
19 sequence prediction with recurrent neural networks. In Advances in Neural Information Processing
20 Systems, volume 28.

21 Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and
22 new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828.
23 Cite arxiv:1206.5538.

24 Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin, Jauvinciro Umontreal Ca, Jaz
25 Kandola, Thomas Hofmann, Tomaso Poggio, and John Shawe-Taylor. 2003. A neural probabilistic
26 language model. Journal of Machine Learning Research, 3:1137–1155.

27 Julian Besag. 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society.
28 Series D (The Statistician), 24(3):179–195.

247
248 BIBLIOGRAPHY

1 Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. 2020a. On the Ability and Limitations of
2 Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical
3 Methods in Natural Language Processing (EMNLP), pages 7096–7116, Online. Association for
4 Computational Linguistics.

5 Satwik Bhattamishra, Arkil Patel, and Navin Goyal. 2020b. On the computational power of
6 transformers and its implications in sequence modeling. In Proceedings of the 24th Conference on
7 Computational Natural Language Learning, pages 455–475, Online. Association for Computational
8 Linguistics.

9 Patrick Billingsley. 1995. Probability and Measure, 3rd edition. Wiley.

10 Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer-Verlag, Berlin,
11 Heidelberg.

12 Mathieu Blondel, Andre Martins, and Vlad Niculae. 2019. Learning classifiers with fenchel-young
13 losses: Generalized entropies, margins, and algorithms. In Proceedings of the Twenty-Second
14 International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of
15 Machine Learning Research, pages 606–615. PMLR.

16 Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word
17 vectors with subword information. Transactions of the Association for Computational Linguistics,
18 5:135–146.

19 L. Boltzmann. 1868. Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten
20 materiellen Punkten: vorgelegt in der Sitzung am 8. October 1868 . k. und k. Hof- und Staatsdr.

21 T.L. Booth and R.A. Thompson. 1973. Applying probability measures to abstract languages. IEEE
22 Transactions on Computers, C-22(5):442–450.

23 Adam L. Buchsbaum, Raffaele Giancarlo, and Jeffery R. Westbrook. 2000. On the determinization
24 of weighted finite automata. SIAM Journal on Computing, 30(5):1502–1531.

25 Alexandra Butoi, Brian DuSell, Tim Vieira, Ryan Cotterell, and David Chiang. 2022. Algorithms
26 for weighted pushdown automata.

27 Stanley F. Chen and Joshua Goodman. 1996. An empirical study of smoothing techniques for
28 language modeling. In 34th Annual Meeting of the Association for Computational Linguistics,
29 pages 310–318, Santa Cruz, California, USA. Association for Computational Linguistics.

30 Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, and Kevin Knight. 2018. Recurrent
31 neural networks as weighted language recognizers. NAACL HLT 2018 - 2018 Conference of the
32 North American Chapter of the Association for Computational Linguistics: Human Language
33 Technologies - Proceedings of the Conference, 1:2261–2271.

34 Zhiyi Chi. 1999. Statistical properties of probabilistic context-free grammars. Computational


35 Linguistics, 25(1):131–160.

36 Zhiyi Chi and Stuart Geman. 1998. Estimation of probabilistic context-free grammars. Computational
37 Linguistics, 24(2):299–305.
BIBLIOGRAPHY 249

1 David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. In
2 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
3 1: Long Papers), pages 7654–7664, Dublin, Ireland. Association for Computational Linguistics.

4 Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the
5 properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8,
6 Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,
7 Doha, Qatar. Association for Computational Linguistics.

8 Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
9 Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase representations using RNN encoder–
10 decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical
11 Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association
12 for Computational Linguistics.

13 N. Chomsky and M.P. Schützenberger. 1963. The algebraic theory of context-free languages. In
14 P. Braffort and D. Hirschberg, editors, Computer Programming and Formal Systems, volume 35
15 of Studies in Logic and the Foundations of Mathematics, pages 118–161. Elsevier.

16 Noam Chomsky. 1959. On certain formal properties of grammars. Information and Control,
17 2(2):137–167.

18 Noam Chomsky. 1965. Aspects of the Theory of Syntax, 50 edition. The MIT Press.

19 K. Culik and Arto Salomaa. 1978. On the decidability of homomorphism equivalence for languages.
20 Journal of Computer and System Sciences, 17(2):163–175.

21 Gr’egoire Del’etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot
22 Catt, Marcus Hutter, Shane Legg, and Pedro A. Ortega. 2022. Neural networks and the chomsky
23 hierarchy. ArXiv, abs/2207.02098.

24 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training
25 of deep bidirectional transformers for language understanding. In Proceedings of the 2019
26 Conference of the North American Chapter of the Association for Computational Linguistics:
27 Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis,
28 Minnesota. Association for Computational Linguistics.

29 A. K. Dewdney. 1977. Threshold matrices and the state assignment problem for neural nets. In
30 Proceedings of the 8th SouthEastern Conference on Combinatorics, Graph Theory and Computing,
31 pages 227–245, Baton Rouge, La, USA.

32 Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah A. Smith.
33 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early
34 stopping. CoRR, abs/2002.06305.

35 Li Du, Lucas Torroba Hennigen, Tiago Pimentel, Clara Meister, Jason Eisner, and Ryan Cotterell.
36 2022. A measure-theoretic characterization of tight language models.

37 John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning
38 and stochastic optimization. J. Mach. Learn. Res., 12(null):2121–2159.
250 BIBLIOGRAPHY

1 Rick Durrett. 2019. Probability: Theory and Examples, 5th edition. Cambridge Series in Statistical
2 and Probabilistic Mathematics. Cambridge university press.

3 Jason Eisner. 2016. Inside-outside and forward-backward algorithms are just backprop (tutorial
4 paper). In Proceedings of the Workshop on Structured Prediction for NLP, pages 1–17, Austin,
5 TX. Association for Computational Linguistics.

6 Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179–211.

7 Patrick C. Fischer, Albert R. Meyer, and Arnold L. Rosenberg. 1968. Counter machines and counter
8 languages. Mathematical systems theory, 2:265–283.

9 William A. Gale and Geoffrey Sampson. 1995. Good-turing frequency estimation without tears. J.
10 Quant. Linguistics, 2:217–237.

11 W. G. Gibbs. 1902. Elementary Principles in Statistical Mechanics. Charles Scribner’s Sons.

12 Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In
13 Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,
14 volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL,
15 USA. PMLR.

16 Glorot, Xavier and Bengio, Yoshua. 2010. Understanding the difficulty of training deep feedforward
17 neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence
18 and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna
19 Resort, Sardinia, Italy. PMLR.

20 Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-
21 Agnostic word representation. In Proceedings of the 32nd International Conference on Neural
22 Information Processing Systems, NIPS’18, page 1341–1352, Red Hook, NY, USA. Curran Associates
23 Inc.

24 Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

25 Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques
26 of Algorithmic Differentiation, 2nd edition. SIAM.

27 Charles M. Grinstead and J. Laurie Snell. 1997. Introduction to Probability, 2nd revised edition.
28 American Mathematical Society.

29 Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions
30 of the Association for Computational Linguistics, 8:156–171.

31 Yiding Hao, William Merrill, Dana Angluin, Robert Frank, Noah Amsel, Andrew Benz, and Simon
32 Mendelsohn. 2018. Context-free transductions with neural stacks. In Proceedings of the 2018
33 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages
34 306–315, Brussels, Belgium. Association for Computational Linguistics.

35 Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning.
36 Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
BIBLIOGRAPHY 251

1 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers:
2 Surpassing human-level performance on imagenet classification. In 2015 IEEE International
3 Conference on Computer Vision (ICCV), pages 1026–1034.

4 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image
5 recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
6 pages 770–778.

7 D.O. Hebb. 1949. The Organization of Behavior: A Neuropsychological Theory. A Wiley book in
8 clinical psychology. Wiley.

9 John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. RNNs
10 can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020
11 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010,
12 Online. Association for Computational Linguistics.

13 John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word
14 representations. In Proceedings of the 2019 Conference of the North American Chapter of the
15 Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
16 and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational
17 Linguistics.

18 Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation,
19 9:1735–80.

20 Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017. Train longer, generalize better: Closing
21 the generalization gap in large batch training of neural networks. In Proceedings of the 31st
22 International Conference on Neural Information Processing Systems, pages 1729—-1739, Red
23 Hook, NY, USA. Curran Associates Inc.

24 John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. 2006. Introduction to Automata Theory,
25 Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.

26 Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis, 2nd edition. Cambridge University
27 Press.

28 Ferenc Huszár. 2015. How (not) to Train your Generative Model: Scheduled Sampling, Likelihood,
29 Adversary? CoRR, abs/1511.05101.

30 Riny Huybregts, Germen de Haan, Mieke Trommelen, and Wim Zonneveld. 1984. Van periferie naar
31 kern. Computational Linguistics.

32 Thomas Icard. 2020a. Calibrating generative models: The probabilistic chomsky-schützenberger


33 hierarchy. Journal of Mathematical Psychology, 95.

34 Thomas F. Icard. 2020b. Calibrating generative models: The probabilistic chomsky–schützenberger


35 hierarchy. Journal of Mathematical Psychology, 95:102308.

36 P. Indyk. 1995. Optimal simulation of automata by neural nets. In STACS 95, pages 337–348,
37 Berlin, Heidelberg. Springer Berlin Heidelberg.
252 BIBLIOGRAPHY

1 Gerhard Jäger and James Rogers. 2012. Formal language theory: Refining the chomsky hierarchy.
2 Philos Trans R Soc Lond B Biol Sci, 367(1598):1956–1970.

3 Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure
4 of language? In Proceedings of the 57th Annual Meeting of the Association for Computational
5 Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.

6 E. T. Jaynes. 1957. Information theory and statistical mechanics. Phys. Rev., 106:620–630.

7 Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, and Lane Schwartz. 2018. Depth-
8 bounding is effective: Improvements and evaluation of unsupervised PCFG induction. In Pro-
9 ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages
10 2721–2731, Brussels, Belgium. Association for Computational Linguistics.

11 Michael I. Jordan. 1986. Serial order: A parallel distributed processing approach. Technical report.

12 Michael I. Jordan. 1997. Chapter 25 - serial order: A parallel distributed processing approach.
13 In John W. Donahoe and Vivian Packard Dorsel, editors, Neural-Network Models of Cognition,
14 volume 121 of Advances in Psychology, pages 471–495. North-Holland.

15 Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2nd Edition).
16 Prentice-Hall, Inc., USA.

17 Fred Karlsson. 2007. Constraints on multiple center-embedding of clauses. Journal of Linguistics,


18 43(2):365–392.

19 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd
20 International Conference on Learning Representations.

21 Samuel A. Korsky and Robert C. Berwick. 2019. On the computational power of rnns.

22 Matthieu Labeau and Shay B. Cohen. 2019. Experimenting with power divergences for language
23 modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
24 Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
25 IJCNLP), pages 4104–4114, Hong Kong, China. Association for Computational Linguistics.

26 Daniel J. Lehmann. 1977. Algebraic structures for transitive closure. Theoretical Computer Science,
27 4(1):59–76.

28 Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Gormley, and Jason Eisner. 2021a. Limitations
29 of autoregressive models and their alternatives. In Proceedings of the 2021 Conference of the
30 North American Chapter of the Association for Computational Linguistics: Human Language
31 Technologies, pages 5147–5173, Online. Association for Computational Linguistics.

32 Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Gormley, and Jason Eisner. 2021b. Limitations
33 of autoregressive models and their alternatives. In Proceedings of the 2021 Conference of the
34 North American Chapter of the Association for Computational Linguistics: Human Language
35 Technologies, pages 5147–5173, Online. Association for Computational Linguistics.

36 Chu-Cheng Lin and Arya D. McCarthy. 2022. On the uncomputability of partition functions in
37 energy-based sequence models. In International Conference on Learning Representations.
BIBLIOGRAPHY 253

1 Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019.
2 Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019
3 Conference of the North American Chapter of the Association for Computational Linguistics:
4 Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis,
5 Minnesota. Association for Computational Linguistics.

6 Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. 2020.
7 Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings
8 of the National Academy of Sciences, 117(48):30046–30054.

9 André F. T. Martins and Ramón F. Astudillo. 2016. From softmax to sparsemax: A sparse model
10 of attention and multi-label classification. In Proceedings of the 33rd International Conference on
11 International Conference on Machine Learning - Volume 48, ICML’16, page 1614–1623. [Link].

12 Clara Meister, Elizabeth Salesky, and Ryan Cotterell. 2020. Generalized entropy regularization or:
13 There’s nothing special about label smoothing. In Proceedings of the 58th Annual Meeting of the
14 Association for Computational Linguistics, pages 6870–6886, Online. Association for Computational
15 Linguistics.

16 William Merrill. 2019. Sequential neural networks as automata. In Proceedings of the Workshop on
17 Deep Learning and Formal Languages: Building Bridges, pages 1–13, Florence. Association for
18 Computational Linguistics.

19 William Merrill and Ashish Sabharwal. 2023. The parallelism tradeoff: Limitations of log-precision
20 transformers. Transactions of the Association for Computational Linguistics, 11:531–545.

21 William Merrill, Ashish Sabharwal, and Noah A. Smith. 2022a. Saturated transformers are constant-
22 depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–
23 856.

24 William Merrill, Alex Warstadt, and Tal Linzen. 2022b. Entailment semantics can be extracted from
25 an ideal language model. In Proceedings of the 26th Conference on Computational Natural Language
26 Learning (CoNLL), pages 176–193, Abu Dhabi, United Arab Emirates (Hybrid). Association for
27 Computational Linguistics.

28 William Merrill, Gail Weiss, Yoav Goldberg, Roy Schwartz, Noah A. Smith, and Eran Yahav. 2020.
29 A formal hierarchy of RNN architectures. In Proceedings of the 58th Annual Meeting of the
30 Association for Computational Linguistics, pages 443–459, Online. Association for Computational
31 Linguistics.

32 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word
33 representations in vector space.

34 George A. Miller and Noam Chomsky. 1963. Finitary models of language users. In D. Luce, editor,
35 Handbook of Mathematical Psychology, pages 2–419. John Wiley & Sons.

36 Thomas Minka. 2005. Divergence measures and message passing. Technical report, Microsoft.

37 Marvin Lee Minsky. 1986. Neural Nets and the brain model problem. Ph.D. thesis, Princeton
38 University.
254 BIBLIOGRAPHY

1 Mehryar Mohri, Fernando Pereira, and Michael Riley. 2008. Speech Recognition with Weighted
2 Finite-State Transducers, pages 559–584. Springer Berlin Heidelberg, Berlin, Heidelberg.

3 James R. Munkres. 2000. Topology, 2nd edition. Prentice Hall, Inc.

4 Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in
5 stochastic language modelling. Computer Speech & Language, 8(1):1–38.

6 Frank Nielsen. 2018. What is an information projection? Notices of the American Mathematical
7 Society, 65:1.

8 Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for
9 word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
10 Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational
11 Linguistics.

12 Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. 2017.
13 Regularizing neural networks by penalizing confident output distributions. In Proceedings of the
14 International Conference on Learning Representations.

15 B.T. Polyak. 1964. Some methods of speeding up the convergence of iteration methods. USSR
16 Computational Mathematics and Mathematical Physics, 4(5):1–17.

17 Alethea Power, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. 2022. Grokking:
18 Generalization beyond overfitting on small algorithmic datasets. CoRR, abs/2201.02177.

19 Jorge Pérez, Pablo Barceló, and Javier Marinkovic. 2021. Attention is turing-complete. Journal of
20 Machine Learning Research, 22(75):1–35.

21 Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We
22 Know About How BERT Works. Transactions of the Association for Computational Linguistics,
23 8:842–866.

24 Halsey L. Royden. 1988. Real Analysis, 3rd edition. Prentice-Hall.

25 Thibault Sellam, Steve Yadlowsky, Ian Tenney, Jason Wei, Naomi Saphra, Alexander D’Amour, Tal
26 Linzen, Jasmijn Bastings, Iulia Raluca Turc, Jacob Eisenstein, Dipanjan Das, and Ellie Pavlick.
27 2022. The multiBERTs: BERT reproductions for robustness analysis. In International Conference
28 on Learning Representations.

29 Claude E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical
30 Journal, 27(3):379–423.

31 Stuart M. Shieber. 1985. Evidence against the context-freeness of natural language. Linguistics and
32 Philosophy, 8:333–343.

33 Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. 2023. Distilling reasoning capabilities
34 into smaller language models. In Findings of the Association for Computational Linguistics: ACL
35 2023, pages 7059–7073, Toronto, Canada. Association for Computational Linguistics.
BIBLIOGRAPHY 255

1 Hava T. Siegelmann and Eduardo D. Sontag. 1992. On the computational power of neural nets. In
2 Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, page
3 440–449, New York, NY, USA. Association for Computing Machinery.
4 Michael Sipser. 2013. Introduction to the Theory of Computation, third edition. Course Technology,
5 Boston, MA.
6 Noah A. Smith and Mark Johnson. 2007. Weighted and probabilistic context-free grammars are
7 equally expressive. Computational Linguistics, 33(4):477–491.
8 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
9 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine
10 Learning Research, 15(56):1929–1958.
11 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2015. Re-
12 thinking the inception architecture for computer vision. 2016 IEEE Conference on Computer
13 Vision and Pattern Recognition, pages 2818–2826.
14 Gábor Szárnyas. 2020. Graphs and matrices: A translation of ”Graphok és matrixok” by Dénes
15 Kőnig (1931).
16 Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. 2020. oLMpics-On What Language
17 Model Pre-training Captures. Transactions of the Association for Computational Linguistics,
18 8:743–758.
19 Terence Tao. 2011. An Introduction to Measure Theory. American Mathematical Society.
20 Terence Tao. 2016. Analysis II: Third Edition. Texts and Readings in Mathematics. Springer
21 Singapore.
22 Wilson L. Taylor. 1953. “Cloze Procedure”: A new tool for measuring readability. Journalism
23 Quarterly, 30(4):415–433.
24 L. Theis, A. van den Oord, and M. Bethge. 2016. A note on the evaluation of generative models. In
25 4th International Conference on Learning Representations.
26 A. M. Turing. 1937. On Computable Numbers, with an Application to the Entscheidungsproblem.
27 Proceedings of the London Mathematical Society, s2-42(1):230–265.
28 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz
29 Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
30 Processing Systems, volume 30.
31 Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of
32 finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the
33 Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne,
34 Australia. Association for Computational Linguistics.
35 Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, and Kyunghyun Cho. 2020.
36 Consistency of a recurrent language model with respect to incomplete decoding. In Proceedings
37 of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
38 5553–5568, Online. Association for Computational Linguistics.
256 BIBLIOGRAPHY

1 George Kingsley Zipf. 1935. The Psycho-Biology of Language. Houghton-Mifflin, New York, NY,
2 USA.

You might also like