Quantitative Economics With Python
Quantitative Economics With Python
2 Modeling COVID 19 25
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 The SIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Ending Lockdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Linear Algebra 35
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Solving Systems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 QR Decomposition 59
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Gram-Schmidt process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Some Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Using QR Decomposition to Compute Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 𝑄𝑅 and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Circulant Matrices 77
i
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Constructing a Circulant Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Connection to Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Examples with Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Associated Permutation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.6 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
ii
9.4 A forward looking model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
iii
16 Introduction to Artificial Neural Networks 283
16.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2 A Deep (but not Wide) Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
16.3 Calibrating Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
16.4 Back Propagation and the Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
16.5 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
16.6 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.7 How Deep? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
16.8 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
iv
IV Introduction to Dynamics 371
22 Dynamics in One Dimension 373
22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
22.3 Graphical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
22.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
22.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
v
27.8 Pure Multiplier Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
27.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
V Search 561
32 Job Search I: The McCall Search Model 563
32.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
32.2 The McCall Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
32.3 Computing the Optimal Policy: Take 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
32.4 Computing the Optimal Policy: Take 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
32.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
32.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
vi
34 Job Search III: Fitted Value Function Iteration 591
34.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
34.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
34.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
34.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
34.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
vii
40.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
40.3 The Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671
40.4 The Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
40.5 The Euler Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
40.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
40.7 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
viii
47.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
ix
53 Computing Mean of a Likelihood Ratio Process 885
53.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
53.2 Mathematical Expectation of Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
53.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888
53.4 Selecting a Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
53.5 Approximating a cumulative likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890
53.6 Distribution of Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891
53.7 More Thoughts about Choice of Sampling Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 893
IX LQ Control 967
58 LQ Control: Foundations 969
58.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
58.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 970
58.3 Optimality – Finite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 972
58.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975
58.5 Extensions and Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 980
x
58.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
58.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990
58.8 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 991
xi
64.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077
64.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
64.5 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
xii
XI Asset Pricing and Finance 1191
71 Asset Pricing: Finite State Models 1193
71.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
71.2 Pricing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
71.3 Prices in the Risk-Neutral Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195
71.4 Risk Aversion and Asset Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1199
71.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
71.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
xiii
76.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1312
76.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313
76.9 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313
80 References 1383
Bibliography 1387
Index 1395
xiv
Quantitative Economics with Python
This website presents a set of lectures on quantitative economic modeling, designed and written by Thomas J. Sargent
and John Stachurski.
For an overview of the series, see this page
• Tools and Techniques
– Geometric Series for Elementary Economics
– Modeling COVID 19
– Linear Algebra
– QR Decomposition
– Complex Numbers and Trigonometry
– Circulant Matrices
– Singular Value Decomposition (SVD)
• Elementary Statistics
– Elementary Probability with Matrices
– Univariate Time Series with Matrix Algebra
– LLN and CLT
– Two Meanings of Probability
– Multivariate Hypergeometric Distribution
– Multivariate Normal Distribution
– Heavy-Tailed Distributions
– Fault Tree Uncertainties
– Introduction to Artificial Neural Networks
– Randomized Response Surveys
– Expected Utilities of Random Responses
• Linear Programming
– Linear Programming
– Optimal Transport
– Von Neumann Growth Model (and a Generalization)
• Introduction to Dynamics
– Dynamics in One Dimension
– AR1 Processes
– Finite Markov Chains
– Inventory Dynamics
– Linear State Space Models
– Samuelson Multiplier-Accelerator
– Kesten Processes and Firm Dynamics
– Wealth Distribution Dynamics
CONTENTS 1
Quantitative Economics with Python
2 CONTENTS
Quantitative Economics with Python
Previous website
While this new site will receive all future updates, you may still view the old site here for the next month.
CONTENTS 3
Quantitative Economics with Python
4 CONTENTS
Part I
5
CHAPTER
ONE
Contents
1.1 Overview
The lecture describes important ideas in economics that use the mathematics of geometric series.
Among these are
• the Keynesian multiplier
• the money multiplier that prevails in fractional reserve banking systems
• interest rates and present values of streams of payouts from assets
(As we shall see below, the term multiplier comes down to meaning sum of a convergent geometric series)
These and other applications prove the truth of the wise crack that
“in economics, a little knowledge of geometric series goes a long way “
Below we’ll use the following imports:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
import sympy as sym
from sympy import init_printing, latex
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
7
Quantitative Economics with Python
1 + 𝑐 + 𝑐2 + 𝑐3 + ⋯
1 + 𝑐 + 𝑐2 + 𝑐3 + ⋯ + 𝑐𝑇
1 − 𝑐𝑇 +1
1 + 𝑐 + 𝑐2 + 𝑐3 + ⋯ + 𝑐𝑇 =
1−𝑐
Remark: The above formula works for any value of the scalar 𝑐. We don’t have to restrict 𝑐 to be in the set (−1, 1).
We now move on to describe some famous economic applications of geometric series.
In a fractional reserve banking system, banks hold only a fraction 𝑟 ∈ (0, 1) of cash behind each deposit receipt that
they issue
• In recent times
– cash consists of pieces of paper issued by the government and called dollars or pounds or …
– a deposit is a balance in a checking or savings account that entitles the owner to ask the bank for immediate
payment in cash
• When the UK and France and the US were on either a gold or silver standard (before 1914, for example)
– cash was a gold or silver coin
– a deposit receipt was a bank note that the bank promised to convert into gold or silver on demand; (sometimes
it was also a checking or savings account balance)
Economists and financiers often define the supply of money as an economy-wide sum of cash plus deposits.
In a fractional reserve banking system (one in which the reserve ratio 𝑟 satisfies 0 < 𝑟 < 1), banks create money by
issuing deposits backed by fractional reserves plus loans that they make to their customers.
A geometric series is a key tool for understanding how banks create money (i.e., deposits) in a fractional reserve system.
The geometric series formula (1.1) is at the heart of the classic model of the money creation process – one that leads us
to the celebrated money multiplier.
𝐿𝑖 + 𝑅 𝑖 = 𝐷 𝑖 (1.2)
The left side of the above equation is the sum of the bank’s assets, namely, the loans 𝐿𝑖 it has outstanding plus its reserves
of cash 𝑅𝑖 .
The right side records bank 𝑖’s liabilities, namely, the deposits 𝐷𝑖 held by its depositors; these are IOU’s from the bank to
its depositors in the form of either checking accounts or savings accounts (or before 1914, bank notes issued by a bank
stating promises to redeem note for gold or silver on demand).
Each bank 𝑖 sets its reserves to satisfy the equation
𝑅𝑖 = 𝑟𝐷𝑖 (1.3)
𝐷𝑖+1 = 𝐿𝑖 (1.4)
Thus, we can think of the banks as being arranged along a line with loans from bank 𝑖 being immediately deposited in
𝑖+1
• in this way, the debtors to bank 𝑖 become creditors of bank 𝑖 + 1
Finally, we add an initial condition about an exogenous level of bank 0’s deposits
𝐷0 is given exogenously
We can think of 𝐷0 as being the amount of cash that a first depositor put into the first bank in the system, bank number
𝑖 = 0.
Now we do a little algebra.
Combining equations (1.2) and (1.3) tells us that
𝐿𝑖 = (1 − 𝑟)𝐷𝑖 (1.5)
This states that bank 𝑖 loans a fraction (1 − 𝑟) of its deposits and keeps a fraction 𝑟 as cash reserves.
Combining equation (1.5) with equation (1.4) tells us that
Equation (1.6) expresses 𝐷𝑖 as the 𝑖 th term in the product of 𝐷0 and the geometric series
1, (1 − 𝑟), (1 − 𝑟)2 , ⋯
The money multiplier is a number that tells the multiplicative factor by which an exogenous injection of cash into bank
0 leads to an increase in the total deposits in the banking system.
1
Equation (1.7) asserts that the money multiplier is 𝑟
𝐷0
• An initial deposit of cash of 𝐷0 in bank 0 leads the banking system to create total deposits of 𝑟 .
∞
• The initial deposit 𝐷0 is held as reserves, distributed throughout the banking system according to 𝐷0 = ∑𝑖=0 𝑅𝑖 .
The famous economist John Maynard Keynes and his followers created a simple model intended to determine national
income 𝑦 in circumstances in which
• there are substantial unemployed resources, in particular excess supply of labor and capital
• prices and interest rates fail to adjust to make aggregate supply equal demand (e.g., prices and interest rates are
frozen)
• national income is entirely determined by aggregate demand
An elementary Keynesian model of national income determination consists of three equations that describe aggregate
demand for 𝑦 and its components.
The first equation is a national income identity asserting that consumption 𝑐 plus investment 𝑖 equals national income 𝑦:
𝑐+𝑖=𝑦
The second equation is a Keynesian consumption function asserting that people consume a fraction 𝑏 ∈ (0, 1) of their
income:
𝑐 = 𝑏𝑦
∞
The expression ∑𝑡=0 𝑏𝑡 motivates an interpretation of the multiplier as the outcome of a dynamic process that we describe
next.
We arrive at a dynamic version by interpreting the nonnegative integer 𝑡 as indexing time and changing our specification
of the consumption function to take time into account
• we add a one-period lag in how income affects consumption
We let 𝑐𝑡 be consumption at time 𝑡 and 𝑖𝑡 be investment at time 𝑡.
We modify our consumption function to assume the form
𝑐𝑡 = 𝑏𝑦𝑡−1
so that 𝑏 is the marginal propensity to consume (now) out of last period’s income.
We begin with an initial condition stating that
𝑦−1 = 0
𝑖𝑡 = 𝑖 for all 𝑡 ≥ 0
𝑦0 = 𝑖 + 𝑐0 = 𝑖 + 𝑏𝑦−1 = 𝑖
and
𝑦1 = 𝑐1 + 𝑖 = 𝑏𝑦0 + 𝑖 = (1 + 𝑏)𝑖
and
𝑦2 = 𝑐2 + 𝑖 = 𝑏𝑦1 + 𝑖 = (1 + 𝑏 + 𝑏2 )𝑖
𝑦𝑡 = 𝑏𝑦𝑡−1 + 𝑖 = (1 + 𝑏 + 𝑏2 + ⋯ + 𝑏𝑡 )𝑖
or
1 − 𝑏𝑡+1
𝑦𝑡 = 𝑖
1−𝑏
Evidently, as 𝑡 → +∞,
1
𝑦𝑡 → 𝑖
1−𝑏
Remark 1: The above formula is often applied to assert that an exogenous increase in investment of Δ𝑖 at time 0 ignites
a dynamic process of increases in national income by successive amounts
at times 0, 1, 2, ….
Remark 2 Let 𝑔𝑡 be an exogenous sequence of government expenditures.
If we generalize the model so that the national income identity becomes
𝑐𝑡 + 𝑖 𝑡 + 𝑔 𝑡 = 𝑦 𝑡
1
then a version of the preceding argument shows that the government expenditures multiplier is also 1−𝑏 , so that a
permanent increase in government expenditures ultimately leads to an increase in national income equal to the multiplier
times the increase in government expenditures.
We can apply our formula for geometric series to study how interest rates affect values of streams of dollar payments that
extend over time.
We work in discrete time and assume that 𝑡 = 0, 1, 2, … indexes time.
We let 𝑟 ∈ (0, 1) be a one-period net nominal interest rate
• if the nominal interest rate is 5 percent, then 𝑟 = .05
A one-period gross nominal interest rate 𝑅 is defined as
𝑅 = 1 + 𝑟 ∈ (1, 2)
1, 𝑅, 𝑅2 , ⋯ (1.8)
and
Sequence (1.8) tells us how dollar values of an investment accumulate through time.
Sequence (1.9) tells us how to discount future dollars to get their values in terms of today’s dollars.
1.5.1 Accumulation
Geometric sequence (1.8) tells us how one dollar invested and re-invested in a project with gross one period nominal rate
of return accumulates
• here we assume that net interest payments are reinvested in the project
• thus, 1 dollar invested at time 0 pays interest 𝑟 dollars after one period, so we have 𝑟 + 1 = 𝑅 dollars at time1
• at time 1 we reinvest 1 + 𝑟 = 𝑅 dollars and receive interest of 𝑟𝑅 dollars at time 2 plus the principal 𝑅 dollars, so
we receive 𝑟𝑅 + 𝑅 = (1 + 𝑟)𝑅 = 𝑅2 dollars at the end of period 2
• and so on
Evidently, if we invest 𝑥 dollars at time 0 and reinvest the proceeds, then the sequence
𝑥, 𝑥𝑅, 𝑥𝑅2 , ⋯
1.5.2 Discounting
Geometric sequence (1.9) tells us how much future dollars are worth in terms of today’s dollars.
Remember that the units of 𝑅 are dollars at 𝑡 + 1 per dollar at 𝑡.
It follows that
• the units of 𝑅−1 are dollars at 𝑡 per dollar at 𝑡 + 1
• the units of 𝑅−2 are dollars at 𝑡 per dollar at 𝑡 + 2
• and so on; the units of 𝑅−𝑗 are dollars at 𝑡 per dollar at 𝑡 + 𝑗
So if someone has a claim on 𝑥 dollars at time 𝑡 + 𝑗, it is worth 𝑥𝑅−𝑗 dollars at time 𝑡 (e.g., today).
Expanding:
We could have also approximated by removing the second term 𝑟𝑔𝑥0 (𝑇 + 1) when 𝑇 is relatively small compared to
1/(𝑟𝑔) to get 𝑥0 (𝑇 + 1) as in the finite stream approximation.
We will plot the true finite stream present-value and the two approximations, under different values of 𝑇 , and 𝑔 and 𝑟 in
Python.
First we plot the true finite stream present-value after computing it below
# Infinite lease
def infinite_lease(g, r, x_0):
G = (1 + g)
R = (1 + r)
return x_0 / (1 - G * R**(-1))
Now that we have defined our functions, we can plot some outcomes.
First we study the quality of our approximations
T_max = 50
T = np.arange(0, T_max+1)
g = 0.02
r = 0.03
x_0 = 1
fig, ax = plt.subplots()
ax.set_title('Finite Lease Present Value $T$ Periods Ahead')
for f in funcs:
plot_function(ax, T, f, our_args)
ax.legend()
ax.set_xlabel('$T$ Periods Ahead')
ax.set_ylabel('Present Value, $p_0$')
plt.show()
The graph above shows how as duration 𝑇 → +∞, the value of a lease of duration 𝑇 approaches the value of a perpetual
lease.
Now we consider two different views of what happens as 𝑟 and 𝑔 covary
# First view
# Changing r and g
fig, ax = plt.subplots()
ax.set_title('Value of lease of length $T$')
ax.set_ylabel('Present Value, $p_0$')
ax.set_xlabel('$T$ periods ahead')
T_max = 10
T=np.arange(0, T_max+1)
ax.legend()
plt.show()
This graph gives a big hint for why the condition 𝑟 > 𝑔 is necessary if a lease of length 𝑇 = +∞ is to have finite value.
For fans of 3-d graphs the same point comes through in the following graph.
If you aren’t enamored of 3-d graphs, feel free to skip the next visualization!
# Second view
fig = plt.figure()
T = 3
ax = fig.gca(projection='3d')
r = np.arange(0.01, 0.99, 0.005)
g = np.arange(0.011, 0.991, 0.005)
rr, gg = np.meshgrid(r, g)
z = finite_lease_pv_true(T, gg, rr, x_0)
↪releases later, gca() will take no keyword arguments. The gca() function should␣
↪only be used to get the current axes, or if no axes exist, create new axes with␣
↪default keyword arguments. To create a new axes with non-default arguments, use␣
↪plt.axes() or plt.subplot().
ax = fig.gca(projection='3d')
We can use a little calculus to study how the present value 𝑝0 of a lease varies with 𝑟 and 𝑔.
We will use a library called SymPy.
SymPy enables us to do symbolic math calculations including computing derivatives of algebraic equations.
We will illustrate how it works by creating a symbolic expression that represents our present value formula for an infinite
lease.
After that, we’ll use SymPy to compute derivatives
𝑥0
− 𝑔+1
𝑟+1 + 1
print('dp0 / dg is:')
dp_dg = sym.diff(p0, g)
dp_dg
dp0 / dg is:
𝑥0
2
(𝑟 + 1) (− 𝑔+1
𝑟+1 + 1)
print('dp0 / dr is:')
dp_dr = sym.diff(p0, r)
dp_dr
dp0 / dr is:
𝑥0 (𝑔 + 1)
− 2
2
(𝑟 + 1) (− 𝑔+1
𝑟+1 + 1)
𝜕𝑝0 𝜕𝑝0
We can see that for 𝜕𝑟 < 0 as long as 𝑟 > 𝑔, 𝑟 > 0 and 𝑔 > 0 and 𝑥0 is positive, so 𝜕𝑟 will always be negative.
𝜕𝑝0 𝜕𝑝0
Similarly, 𝜕𝑔 > 0 as long as 𝑟 > 𝑔, 𝑟 > 0 and 𝑔 > 0 and 𝑥0 is positive, so 𝜕𝑔 will always be positive.
We will now go back to the case of the Keynesian multiplier and plot the time path of 𝑦𝑡 , given that consumption is a
constant fraction of national income, and investment is fixed.
# Initial values
i_0 = 0.3
g_0 = 0.3
# 2/3 of income goes towards consumption
b = 2/3
y_init = 0
T = 100
fig, ax = plt.subplots()
ax.set_title('Path of Aggregate Output Over Time')
ax.set_xlabel('$t$')
ax.set_ylabel('$y_t$')
ax.plot(np.arange(0, T+1), calculate_y(i_0, b, g_0, T, y_init))
# Output predicted by geometric series
ax.hlines(i_0 / (1 - b) + g_0 / (1 - b), xmin=-1, xmax=101, linestyles='--')
plt.show()
In this model, income grows over time, until it gradually converges to the infinite geometric series sum of income.
We now examine what will happen if we vary the so-called marginal propensity to consume, i.e., the fraction of income
that is consumed
fig,ax = plt.subplots()
ax.set_title('Changing Consumption as a Fraction of Income')
ax.set_ylabel('$y_t$')
ax.set_xlabel('$t$')
x = np.arange(0, T+1)
for b in bs:
y = calculate_y(i_0, b, g_0, T, y_init)
ax.plot(x, y, label=r'$b=$'+f"{b:.2f}")
ax.legend()
plt.show()
Increasing the marginal propensity to consume 𝑏 increases the path of output over time.
Now we will compare the effects on output of increases in investment and government spending.
x = np.arange(0, T+1)
values = [0.3, 0.4]
for i in values:
y = calculate_y(i, b, g_0, T, y_init)
ax1.plot(x, y, label=f"i={i}")
for g in values:
y = calculate_y(i_0, b, g, T, y_init)
ax2.plot(x, y, label=f"g={g}")
Notice here, whether government spending increases from 0.3 to 0.4 or investment increases from 0.3 to 0.4, the shifts
in the graphs are identical.
TWO
MODELING COVID 19
Contents
• Modeling COVID 19
– Overview
– The SIR Model
– Implementation
– Experiments
– Ending Lockdown
2.1 Overview
This is a Python version of the code for analyzing the COVID-19 pandemic provided by Andrew Atkeson.
See, in particular
• NBER Working Paper No. 26867
• COVID-19 Working papers and code
The purpose of his notes is to introduce economists to quantitative modeling of infectious disease dynamics.
Dynamics are modeled using a standard SIR (Susceptible-Infected-Removed) model of disease spread.
The model dynamics are represented by a system of ordinary differential equations.
The main objective is to study the impact of suppression through social distancing on the spread of the infection.
The focus is on US outcomes but the parameters can be adjusted to study other countries.
We will use the following standard imports:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from numpy import exp
We will also use SciPy’s numerical routine odeint for solving differential equations.
25
Quantitative Economics with Python
This routine calls into compiled code from the FORTRAN library odepack.
In the version of the SIR model we will analyze there are four states.
All individuals in the population are assumed to be in one of these four states.
The states are: susceptible (S), exposed (E), infected (I) and removed ®.
Comments:
• Those in state R have been infected and either recovered or died.
• Those who have recovered are assumed to have acquired immunity.
• Those in the exposed group are not yet infectious.
𝑠(𝑡)
̇ = −𝛽(𝑡) 𝑠(𝑡) 𝑖(𝑡)
𝑒(𝑡)
̇ = 𝛽(𝑡) 𝑠(𝑡) 𝑖(𝑡) − 𝜎𝑒(𝑡) (2.1)
̇ = 𝜎𝑒(𝑡) − 𝛾𝑖(𝑡)
𝑖(𝑡)
In these equations,
• 𝛽(𝑡) is called the transmission rate (the rate at which individuals bump into others and expose them to the virus).
• 𝜎 is called the infection rate (the rate at which those who are exposed become infected)
• 𝛾 is called the recovery rate (the rate at which infected people recover or die).
• the dot symbol 𝑦 ̇ represents the time derivative 𝑑𝑦/𝑑𝑡.
We do not need to model the fraction 𝑟 of the population in state 𝑅 separately because the states form a partition.
In particular, the “removed” fraction of the population is 𝑟 = 1 − 𝑠 − 𝑒 − 𝑖.
We will also track 𝑐 = 𝑖 + 𝑟, which is the cumulative caseload (i.e., all those who have or have had the infection).
The system (2.1) can be written in vector form as
2.2.2 Parameters
2.3 Implementation
pop_size = 3.3e8
γ = 1 / 18
σ = 1 / 5.2
"""
s, e, i = x
# Time derivatives
ds = - ne
de = ne - σ * e
di = σ * e - γ * i
2.3. Implementation 27
Quantitative Economics with Python
# initial conditions of s, e, i
i_0 = 1e-7
e_0 = 4 * i_0
s_0 = 1 - i_0 - e_0
We solve for the time path numerically using odeint, at a sequence of dates t_vec.
"""
G = lambda x, t: F(x, t, R0)
s_path, e_path, i_path = odeint(G, x_init, t_vec).transpose()
2.4 Experiments
t_length = 550
grid_size = 1000
t_vec = np.linspace(0, t_length, grid_size)
for r in R0_vals:
i_path, c_path = solve_path(r, t_vec)
i_paths.append(i_path)
c_paths.append(c_path)
fig, ax = plt.subplots()
ax.legend(loc='upper left')
plt.show()
plot_paths(i_paths, labels)
plot_paths(c_paths, labels)
2.4. Experiments 29
Quantitative Economics with Python
Let’s look at a scenario where mitigation (e.g., social distancing) is successively imposed.
Here’s a specification for R0 as a function of time.
This is what the time path of R0 looks like at these alternative rates:
fig, ax = plt.subplots()
ax.legend()
plt.show()
for η in η_vals:
R0 = lambda t: R0_mitigating(t, η=η)
i_path, c_path = solve_path(R0, t_vec)
i_paths.append(i_path)
c_paths.append(c_path)
plot_paths(i_paths, labels)
plot_paths(c_paths, labels)
2.4. Experiments 31
Quantitative Economics with Python
The following replicates additional results by Andrew Atkeson on the timing of lifting lockdown.
Consider these two mitigation scenarios:
1. 𝑅𝑡 = 0.5 for 30 days and then 𝑅𝑡 = 2 for the remaining 17 months. This corresponds to lifting lockdown in 30
days.
2. 𝑅𝑡 = 0.5 for 120 days and then 𝑅𝑡 = 2 for the remaining 14 months. This corresponds to lifting lockdown in 4
months.
The parameters considered here start the model with 25,000 active infections and 75,000 agents already exposed to the
virus and thus soon to be contagious.
# initial conditions
i_0 = 25_000 / pop_size
e_0 = 75_000 / pop_size
s_0 = 1 - i_0 - e_0
x_0 = s_0, e_0, i_0
for R0 in R0_paths:
i_path, c_path = solve_path(R0, t_vec, x_init=x_0)
i_paths.append(i_path)
c_paths.append(c_path)
plot_paths(i_paths, labels)
ν = 0.01
Pushing the peak of curve further into the future may reduce cumulative deaths if a vaccine is found.
THREE
LINEAR ALGEBRA
Contents
• Linear Algebra
– Overview
– Vectors
– Matrices
– Solving Systems of Equations
– Eigenvalues and Eigenvectors
– Further Topics
– Exercises
– Solutions
3.1 Overview
Linear algebra is one of the most useful branches of applied mathematics for economists to invest in.
For example, many applied problems in economics and finance require the solution of a linear system of equations, such
as
𝑦1 = 𝑎𝑥1 + 𝑏𝑥2
𝑦2 = 𝑐𝑥1 + 𝑑𝑥2
The objective here is to solve for the “unknowns” 𝑥1 , … , 𝑥𝑘 given 𝑎11 , … , 𝑎𝑛𝑘 and 𝑦1 , … , 𝑦𝑛 .
When considering such problems, it is essential that we first consider at least some of the following questions
• Does a solution actually exist?
• Are there in fact many solutions, and if so how should we interpret them?
• If no solution exists, is there a best “approximate” solution?
35
Quantitative Economics with Python
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from scipy.interpolate import interp2d
from scipy.linalg import inv, solve, det, eig
3.2 Vectors
A vector of length 𝑛 is just a sequence (or array, or tuple) of 𝑛 numbers, which we write as 𝑥 = (𝑥1 , … , 𝑥𝑛 ) or 𝑥 =
[𝑥1 , … , 𝑥𝑛 ].
We will write these sequences either horizontally or vertically as we please.
(Later, when we wish to perform certain matrix operations, it will become necessary to distinguish between the two)
The set of all 𝑛-vectors is denoted by ℝ𝑛 .
For example, ℝ2 is the plane, and a vector in ℝ2 is just a point in the plane.
Traditionally, vectors are represented visually as arrows from the origin to the point.
The following figure represents three vectors in this manner
The two most common operators for vectors are addition and scalar multiplication, which we now describe.
As a matter of definition, when we add two vectors, we add them element-by-element
𝑥1 𝑦1 𝑥1 + 𝑦1
⎡𝑥 ⎤ ⎡𝑦 ⎤ ⎡𝑥 + 𝑦 ⎤
𝑥 + 𝑦 = ⎢ 2 ⎥ + ⎢ 2 ⎥ ∶= ⎢ 2 2⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑥𝑛 ⎦ ⎣𝑦𝑛 ⎦ ⎣𝑥𝑛 + 𝑦𝑛 ⎦
Scalar multiplication is an operation that takes a number 𝛾 and a vector 𝑥 and produces
𝛾𝑥1
⎡ 𝛾𝑥 ⎤
𝛾𝑥 ∶= ⎢ 2 ⎥
⎢ ⋮ ⎥
⎣𝛾𝑥𝑛 ⎦
3.2. Vectors 37
Quantitative Economics with Python
scalars = (-2, 2)
x = np.array(x)
for s in scalars:
v = s * x
ax.annotate('', xy=v, xytext=(0, 0),
arrowprops=dict(facecolor='red',
shrink=0,
alpha=0.5,
width=0.5))
ax.text(v[0] + 0.4, v[1] - 0.2, f'${s} x$', fontsize='16')
plt.show()
In Python, a vector can be represented as a list or tuple, such as x = (2, 4, 6), but is more commonly represented
as a NumPy array.
One advantage of NumPy arrays is that scalar multiplication and addition have very natural syntax
4 * x
3.2. Vectors 39
Quantitative Economics with Python
12.0
1.7320508075688772
1.7320508075688772
3.2.3 Span
Given a set of vectors 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } in ℝ𝑛 , it’s natural to think about the new vectors we can create by performing
linear operations.
New vectors created in this manner are called linear combinations of 𝐴.
In particular, 𝑦 ∈ ℝ𝑛 is a linear combination of 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } if
In this context, the values 𝛽1 , … , 𝛽𝑘 are called the coefficients of the linear combination.
The set of linear combinations of 𝐴 is called the span of 𝐴.
The next figure shows the span of 𝐴 = {𝑎1 , 𝑎2 } in ℝ3 .
The span is a two-dimensional plane passing through these two points and the origin.
α, β = 0.2, 0.1
gs = 3
z = np.linspace(x_min, x_max, gs)
x = np.zeros(gs)
y = np.zeros(gs)
ax.plot(x, y, z, 'k-', lw=2, alpha=0.5)
ax.plot(z, x, y, 'k-', lw=2, alpha=0.5)
ax.plot(y, z, x, 'k-', lw=2, alpha=0.5)
# Lines to vectors
for i in (0, 1):
x = (0, x_coords[i])
y = (0, y_coords[i])
z = (0, f(x_coords[i], y_coords[i]))
ax.plot(x, y, z, 'b-', lw=1.5, alpha=0.6)
↪releases later, gca() will take no keyword arguments. The gca() function should␣
↪only be used to get the current axes, or if no axes exist, create new axes with␣
↪default keyword arguments. To create a new axes with non-default arguments, use␣
↪plt.axes() or plt.subplot().
ax = fig.gca(projection='3d')
3.2. Vectors 41
Quantitative Economics with Python
Examples
If 𝐴 contains only one vector 𝑎1 ∈ ℝ2 , then its span is just the scalar multiples of 𝑎1 , which is the unique line passing
through both 𝑎1 and the origin.
If 𝐴 = {𝑒1 , 𝑒2 , 𝑒3 } consists of the canonical basis vectors of ℝ3 , that is
1 0 0
𝑒1 ∶= ⎡ ⎤
⎢0⎥ , 𝑒2 ∶= ⎡ ⎤
⎢1⎥ , 𝑒3 ∶= ⎡
⎢0⎥
⎤
⎣0⎦ ⎣0⎦ ⎣1⎦
then the span of 𝐴 is all of ℝ3 , because, for any 𝑥 = (𝑥1 , 𝑥2 , 𝑥3 ) ∈ ℝ3 , we can write
𝑥 = 𝑥1 𝑒1 + 𝑥2 𝑒2 + 𝑥3 𝑒3
As we’ll see, it’s often desirable to find families of vectors with relatively large span, so that many vectors can be described
by linear operators on a few vectors.
The condition we need for a set of vectors to have a large span is what’s called linear independence.
In particular, a collection of vectors 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } in ℝ𝑛 is said to be
• linearly dependent if some strict subset of 𝐴 has the same span as 𝐴.
• linearly independent if it is not linearly dependent.
Put differently, a set of vectors is linearly independent if no vector is redundant to the span and linearly dependent
otherwise.
To illustrate the idea, recall the figure that showed the span of vectors {𝑎1 , 𝑎2 } in ℝ3 as a plane through the origin.
If we take a third vector 𝑎3 and form the set {𝑎1 , 𝑎2 , 𝑎3 }, this set will be
• linearly dependent if 𝑎3 lies in the plane
• linearly independent otherwise
As another illustration of the concept, since ℝ𝑛 can be spanned by 𝑛 vectors (see the discussion of canonical basis vectors
above), any collection of 𝑚 > 𝑛 vectors in ℝ𝑛 must be linearly dependent.
The following statements are equivalent to linear independence of 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } ⊂ ℝ𝑛
1. No vector in 𝐴 can be formed as a linear combination of the other elements.
2. If 𝛽1 𝑎1 + ⋯ 𝛽𝑘 𝑎𝑘 = 0 for scalars 𝛽1 , … , 𝛽𝑘 , then 𝛽1 = ⋯ = 𝛽𝑘 = 0.
(The zero in the first expression is the origin of ℝ𝑛 )
Another nice thing about sets of linearly independent vectors is that each element in the span has a unique representation
as a linear combination of these vectors.
In other words, if 𝐴 ∶= {𝑎1 , … , 𝑎𝑘 } ⊂ ℝ𝑛 is linearly independent and
𝑦 = 𝛽1 𝑎1 + ⋯ 𝛽𝑘 𝑎𝑘
3.2. Vectors 43
Quantitative Economics with Python
3.3 Matrices
Matrices are a neat way of organizing data for use in linear operations.
An 𝑛 × 𝑘 matrix is a rectangular array 𝐴 of numbers with 𝑛 rows and 𝑘 columns:
Often, the numbers in the matrix represent coefficients in a system of linear equations, as discussed at the start of this
lecture.
For obvious reasons, the matrix 𝐴 is also called a vector if either 𝑛 = 1 or 𝑘 = 1.
In the former case, 𝐴 is called a row vector, while in the latter it is called a column vector.
If 𝑛 = 𝑘, then 𝐴 is called square.
The matrix formed by replacing 𝑎𝑖𝑗 by 𝑎𝑗𝑖 for every 𝑖 and 𝑗 is called the transpose of 𝐴 and denoted 𝐴′ or 𝐴⊤ .
If 𝐴 = 𝐴′ , then 𝐴 is called symmetric.
For a square matrix 𝐴, the 𝑖 elements of the form 𝑎𝑖𝑖 for 𝑖 = 1, … , 𝑛 are called the principal diagonal.
𝐴 is called diagonal if the only nonzero entries are on the principal diagonal.
If, in addition to being diagonal, each element along the principal diagonal is equal to 1, then 𝐴 is called the identity matrix
and denoted by 𝐼.
Just as was the case for vectors, a number of algebraic operations are defined for matrices.
Scalar multiplication and addition are immediate generalizations of the vector case:
and
𝑎11 ⋯ 𝑎1𝑘 𝑏11 ⋯ 𝑏1𝑘 𝑎11 + 𝑏11 ⋯ 𝑎1𝑘 + 𝑏1𝑘
𝐴+𝐵 =⎡
⎢ ⋮ ⋮ ⋮ ⎤ ⎡
⎥+⎢ ⋮ ⋮ ⋮ ⎤ ⎡
⎥ ∶= ⎢ ⋮ ⋮ ⋮ ⎤
⎥
⎣𝑎𝑛1 ⋯ 𝑎𝑛𝑘 ⎦ ⎣𝑏𝑛1 ⋯ 𝑏𝑛𝑘 ⎦ ⎣𝑎𝑛1 + 𝑏𝑛1 ⋯ 𝑎𝑛𝑘 + 𝑏𝑛𝑘 ⎦
In the latter case, the matrices must have the same shape in order for the definition to make sense.
We also have a convention for multiplying two matrices.
The rule for matrix multiplication generalizes the idea of inner products discussed above and is designed to make multi-
plication play well with basic linear operations.
If 𝐴 and 𝐵 are two matrices, then their product 𝐴𝐵 is formed by taking as its 𝑖, 𝑗-th element the inner product of the 𝑖-th
row of 𝐴 and the 𝑗-th column of 𝐵.
There are many tutorials to help you visualize this operation, such as this one, or the discussion on the Wikipedia page.
If 𝐴 is 𝑛 × 𝑘 and 𝐵 is 𝑗 × 𝑚, then to multiply 𝐴 and 𝐵 we require 𝑘 = 𝑗, and the resulting matrix 𝐴𝐵 is 𝑛 × 𝑚.
As perhaps the most important special case, consider multiplying 𝑛 × 𝑘 matrix 𝐴 and 𝑘 × 1 column vector 𝑥.
NumPy arrays are also used as matrices, and have fast, efficient functions and methods for all the standard matrix oper-
ations1 .
You can create them manually from tuples of tuples (or lists of lists) as follows
A = ((1, 2),
(3, 4))
type(A)
tuple
A = np.array(A)
type(A)
numpy.ndarray
A.shape
(2, 2)
The shape attribute is a tuple giving the number of rows and columns — see here for more discussion.
To get the transpose of A, use A.transpose() or, more simply, A.T.
There are many convenient functions for creating common matrices (matrices of zeros, ones, etc.) — see here.
Since operations are performed elementwise by default, scalar multiplication and addition have very natural syntax
A = np.identity(3)
B = np.ones((3, 3))
2 * A
1 Although there is a specialized matrix data type defined in NumPy, it’s more standard to work with ordinary NumPy arrays. See this discussion.
3.3. Matrices 45
Quantitative Economics with Python
A + B
Each 𝑛 × 𝑘 matrix 𝐴 can be identified with a function 𝑓(𝑥) = 𝐴𝑥 that maps 𝑥 ∈ ℝ𝑘 into 𝑦 = 𝐴𝑥 ∈ ℝ𝑛 .
These kinds of functions have a special property: they are linear.
A function 𝑓 ∶ ℝ𝑘 → ℝ𝑛 is called linear if, for all 𝑥, 𝑦 ∈ ℝ𝑘 and all scalars 𝛼, 𝛽, we have
You can check that this holds for the function 𝑓(𝑥) = 𝐴𝑥 + 𝑏 when 𝑏 is the zero vector and fails when 𝑏 is nonzero.
In fact, it’s known that 𝑓 is linear if and only if there exists a matrix 𝐴 such that 𝑓(𝑥) = 𝐴𝑥 for all 𝑥.
𝑦 = 𝐴𝑥 (3.3)
The problem we face is to determine a vector 𝑥 ∈ ℝ𝑘 that solves (3.3), taking 𝑦 and 𝐴 as given.
This is a special case of a more general problem: Find an 𝑥 such that 𝑦 = 𝑓(𝑥).
Given an arbitrary function 𝑓 and a 𝑦, is there always an 𝑥 such that 𝑦 = 𝑓(𝑥)?
If so, is it always unique?
The answer to both these questions is negative, as the next figure shows
def f(x):
return 0.6 * np.cos(4 * x) + 1.4
for ax in axes:
# Set the axes through the origin
for spine in ['left', 'bottom']:
ax.spines[spine].set_position('zero')
for spine in ['right', 'top']:
ax.spines[spine].set_color('none')
ax = axes[0]
ax = axes[1]
ybar = 2.6
ax.plot(x, x * 0 + ybar, 'k--', alpha=0.5)
ax.text(0.04, 0.91 * ybar, '$y$', fontsize=16)
plt.show()
In the first plot, there are multiple solutions, as the function is not one-to-one, while in the second there are no solutions,
since 𝑦 lies outside the range of 𝑓.
Can we impose conditions on 𝐴 in (3.3) that rule out these problems?
In this context, the most important thing to recognize about the expression 𝐴𝑥 is that it corresponds to a linear combination
of the columns of 𝐴.
In particular, if 𝑎1 , … , 𝑎𝑘 are the columns of 𝐴, then
𝐴𝑥 = 𝑥1 𝑎1 + ⋯ + 𝑥𝑘 𝑎𝑘
Indeed, it follows from our earlier discussion that if {𝑎1 , … , 𝑎𝑘 } are linearly independent and 𝑦 = 𝐴𝑥 = 𝑥1 𝑎1 +⋯+𝑥𝑘 𝑎𝑘 ,
then no 𝑧 ≠ 𝑥 satisfies 𝑦 = 𝐴𝑧.
Let’s discuss some more details, starting with the case where 𝐴 is 𝑛 × 𝑛.
This is the familiar case where the number of unknowns equals the number of equations.
For arbitrary 𝑦 ∈ ℝ𝑛 , we hope to find a unique 𝑥 ∈ ℝ𝑛 such that 𝑦 = 𝐴𝑥.
In view of the observations immediately above, if the columns of 𝐴 are linearly independent, then their span, and hence
the range of 𝑓(𝑥) = 𝐴𝑥, is all of ℝ𝑛 .
Hence there always exists an 𝑥 such that 𝑦 = 𝐴𝑥.
Moreover, the solution is unique.
In particular, the following are equivalent
1. The columns of 𝐴 are linearly independent.
2. For any 𝑦 ∈ ℝ𝑛 , the equation 𝑦 = 𝐴𝑥 has a unique solution.
The property of having linearly independent columns is sometimes expressed as having full column rank.
Inverse Matrices
Determinants
Another quick comment about square matrices is that to every such matrix we assign a unique number called the deter-
minant of the matrix — you can find the expression for it here.
If the determinant of 𝐴 is not zero, then we say that 𝐴 is nonsingular.
Perhaps the most important fact about determinants is that 𝐴 is nonsingular if and only if 𝐴 is of full column rank.
This gives us a useful one-number summary of whether or not a square matrix can be inverted.
This is the 𝑛 × 𝑘 case with 𝑛 < 𝑘, so there are fewer equations than unknowns.
In this case there are either no solutions or infinitely many — in other words, uniqueness never holds.
For example, consider the case where 𝑘 = 3 and 𝑛 = 2.
Thus, the columns of 𝐴 consists of 3 vectors in ℝ2 .
This set can never be linearly independent, since it is possible to find two vectors that span ℝ2 .
(For example, use the canonical basis vectors)
It follows that one column is a linear combination of the other two.
For example, let’s say that 𝑎1 = 𝛼𝑎2 + 𝛽𝑎3 .
Then if 𝑦 = 𝐴𝑥 = 𝑥1 𝑎1 + 𝑥2 𝑎2 + 𝑥3 𝑎3 , we can also write
Here’s an illustration of how to solve linear equations with SciPy’s linalg submodule.
All of these routines are Python front ends to time-tested and highly optimized FORTRAN code
-2.0
array([[-2. , 1. ],
[ 1.5, -0.5]])
x = A_inv @ y # Solution
A @ x # Should equal y
array([[1.],
[1.]])
array([[-1.],
[ 1.]])
Observe how we can solve for 𝑥 = 𝐴−1 𝑦 by either via inv(A) @ y, or using solve(A, y).
The latter method uses a different algorithm (LU decomposition) that is numerically more stable, and hence should almost
always be preferred.
To obtain the least-squares solution 𝑥̂ = (𝐴′ 𝐴)−1 𝐴′ 𝑦, use scipy.linalg.lstsq(A, y).
𝐴𝑣 = 𝜆𝑣
A = ((1, 2),
(2, 1))
A = np.array(A)
evals, evecs = eig(A)
evecs = evecs[:, 0], evecs[:, 1]
plt.show()
The eigenvalue equation is equivalent to (𝐴 − 𝜆𝐼)𝑣 = 0, and this has a nonzero solution 𝑣 only when the columns of
𝐴 − 𝜆𝐼 are linearly dependent.
This in turn is equivalent to stating that the determinant is zero.
Hence to find all eigenvalues, we can look for 𝜆 such that the determinant of 𝐴 − 𝜆𝐼 is zero.
This problem can be expressed as one of solving for the roots of a polynomial in 𝜆 of degree 𝑛.
This in turn implies the existence of 𝑛 solutions in the complex plane, although some might be repeated.
Some nice facts about the eigenvalues of a square matrix 𝐴 are as follows
1. The determinant of 𝐴 equals the product of the eigenvalues.
2. The trace of 𝐴 (the sum of the elements on the principal diagonal) equals the sum of the eigenvalues.
3. If 𝐴 is symmetric, then all of its eigenvalues are real.
4. If 𝐴 is invertible and 𝜆1 , … , 𝜆𝑛 are its eigenvalues, then the eigenvalues of 𝐴−1 are 1/𝜆1 , … , 1/𝜆𝑛 .
A corollary of the first statement is that a matrix is invertible if and only if all its eigenvalues are nonzero.
Using SciPy, we can solve for the eigenvalues and eigenvectors of a matrix as follows
A = ((1, 2),
(2, 1))
A = np.array(A)
(continues on next page)
evecs
It is sometimes useful to consider the generalized eigenvalue problem, which, for given matrices 𝐴 and 𝐵, seeks generalized
eigenvalues 𝜆 and eigenvectors 𝑣 such that
𝐴𝑣 = 𝜆𝐵𝑣
We round out our discussion by briefly mentioning several other important topics.
Matrix Norms
The norms on the right-hand side are ordinary vector norms, while the norm on the left-hand side is a matrix norm — in
this case, the so-called spectral norm.
For example, for a square matrix 𝑆, the condition ‖𝑆‖ < 1 means that 𝑆 is contractive, in the sense that it pulls all vectors
towards the origin2 .
2 Suppose that ‖𝑆‖ < 1. Take any nonzero vector 𝑥, and let 𝑟 ∶= ‖𝑥‖. We have ‖𝑆𝑥‖ = 𝑟‖𝑆(𝑥/𝑟)‖ ≤ 𝑟‖𝑆‖ < 𝑟 = ‖𝑥‖. Hence every point is
pulled towards the origin.
Neumann’s Theorem
Spectral Radius
A result known as Gelfand’s formula tells us that, for any square matrix 𝐴,
Here 𝜌(𝐴) is the spectral radius, defined as max𝑖 |𝜆𝑖 |, where {𝜆𝑖 }𝑖 is the set of eigenvalues of 𝐴.
As a consequence of Gelfand’s formula, if all eigenvalues are strictly less than one in modulus, there exists a 𝑘 with
‖𝐴𝑘 ‖ < 1.
In which case (3.4) is valid.
𝜕𝑦′ 𝐵𝑧
5. 𝜕𝐵 = 𝑦𝑧 ′
Exercise 3.7.1 below asks you to apply these formulas.
3.7 Exercises
Exercise 3.7.1
Let 𝑥 be a given 𝑛 × 1 vector and consider the problem
𝑣(𝑥) = max {−𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢}
𝑦,𝑢
𝑦 = 𝐴𝑥 + 𝐵𝑢
Here
• 𝑃 is an 𝑛 × 𝑛 matrix and 𝑄 is an 𝑚 × 𝑚 matrix
• 𝐴 is an 𝑛 × 𝑛 matrix and 𝐵 is an 𝑛 × 𝑚 matrix
• both 𝑃 and 𝑄 are symmetric and positive semidefinite
(What must the dimensions of 𝑦 and 𝑢 be to make this a well-posed problem?)
One way to solve the problem is to form the Lagrangian
ℒ = −𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢 + 𝜆′ [𝐴𝑥 + 𝐵𝑢 − 𝑦]
Note: If we don’t care about the Lagrange multipliers, we can substitute the constraint into the objective function, and
then just maximize −(𝐴𝑥 + 𝐵𝑢)′ 𝑃 (𝐴𝑥 + 𝐵𝑢) − 𝑢′ 𝑄𝑢 with respect to 𝑢. You can verify that this leads to the same
maximizer.
3.8 Solutions
s.t.
𝑦 = 𝐴𝑥 + 𝐵𝑢
with primitives
• 𝑃 be a symmetric and positive semidefinite 𝑛 × 𝑛 matrix
• 𝑄 be a symmetric and positive semidefinite 𝑚 × 𝑚 matrix
• 𝐴 an 𝑛 × 𝑛 matrix
• 𝐵 an 𝑛 × 𝑚 matrix
The associated Lagrangian is:
𝐿 = −𝑦′ 𝑃 𝑦 − 𝑢′ 𝑄𝑢 + 𝜆′ [𝐴𝑥 + 𝐵𝑢 − 𝑦]
Step 1.
Differentiating Lagrangian equation w.r.t y and setting its derivative equal to zero yields
𝜕𝐿
= −(𝑃 + 𝑃 ′ )𝑦 − 𝜆 = −2𝑃 𝑦 − 𝜆 = 0 ,
𝜕𝑦
since P is symmetric.
Accordingly, the first-order condition for maximizing L w.r.t. y implies
𝜆 = −2𝑃 𝑦
Step 2.
Differentiating Lagrangian equation w.r.t. u and setting its derivative equal to zero yields
𝜕𝐿
= −(𝑄 + 𝑄′ )𝑢 − 𝐵′ 𝜆 = −2𝑄𝑢 + 𝐵′ 𝜆 = 0
𝜕𝑢
Substituting 𝜆 = −2𝑃 𝑦 gives
𝑄𝑢 + 𝐵′ 𝑃 𝑦 = 0
𝑄𝑢 + 𝐵′ 𝑃 (𝐴𝑥 + 𝐵𝑢) = 0
(𝑄 + 𝐵′ 𝑃 𝐵)𝑢 + 𝐵′ 𝑃 𝐴𝑥 = 0
which is the first-order condition for maximizing 𝐿 w.r.t. 𝑢.
Thus, the optimal choice of u must satisfy
𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥 ,
3.8. Solutions 57
Quantitative Economics with Python
which follows from the definition of the first-order conditions for Lagrangian equation.
Step 3.
Rewriting our problem by substituting the constraint into the objective function, we get
Since we know the optimal choice of u satisfies 𝑢 = −(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥, then
−2𝑢′ 𝐵′ 𝑃 𝐴𝑥 = −2𝑥′ 𝑆 ′ 𝐵′ 𝑃 𝐴𝑥
= 2𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥
Notice that the term (𝑄 + 𝐵′ 𝑃 𝐵)−1 is symmetric as both P and Q are symmetric.
Regarding the third term −𝑢′ (𝑄 + 𝐵′ 𝑃 𝐵)𝑢,
Hence, the summation of second and third terms is 𝑥′ 𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴𝑥.
This implies that
Therefore, the solution to the optimization problem 𝑣(𝑥) = −𝑥′ 𝑃 ̃ 𝑥 follows the above result by denoting 𝑃 ̃ ∶= 𝐴′ 𝑃 𝐴 −
𝐴′ 𝑃 𝐵(𝑄 + 𝐵′ 𝑃 𝐵)−1 𝐵′ 𝑃 𝐴
FOUR
QR DECOMPOSITION
4.1 Overview
The QR decomposition (also called the QR factorization) of a matrix is a decomposition of a matrix into the product of
an orthogonal matrix and a triangular matrix.
A QR decomposition of a real matrix 𝐴 takes the form
𝐴 = 𝑄𝑅
where
• 𝑄 is an orthogonal matrix (so that 𝑄𝑇 𝑄 = 𝐼)
• 𝑅 is an upper triangular matrix
We’ll use a Gram-Schmidt process to compute a QR decomposition
Because doing so is so educational, we’ll write our own Python code to do the job
59
Quantitative Economics with Python
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ]
Here (𝑎𝑗 ⋅ 𝑒𝑖 ) can be interpreted as the linear least squares regression coefficient of 𝑎𝑗 on 𝑒𝑖
• it is the inner product of 𝑎𝑗 and 𝑒𝑖 divided by the inner product of 𝑒𝑖 where 𝑒𝑖 ⋅ 𝑒𝑖 = 1, as normalization has assured
us.
• this regression coefficient has an interpretation as being a covariance divided by a variance
It can be verified that
𝑎1 · 𝑒1 𝑎2 · 𝑒1 ⋯ 𝑎 𝑛 · 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 ⎤
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒 𝑛 ⎦
𝐴 = 𝑄𝑅
where
𝑄 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑛 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]
and
𝑎1 · 𝑒1 𝑎2 · 𝑒1 ⋯ 𝑎 𝑛 · 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 ⎤
𝑅=⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒𝑛 ⎦
60 Chapter 4. QR Decomposition
Quantitative Economics with Python
𝑎1 · 𝑒1 𝑎2 · 𝑒 1 ⋯ 𝑎 𝑛 · 𝑒1 𝑎𝑛+1 ⋅ 𝑒1 ⋯ 𝑎 𝑚 ⋅ 𝑒1
⎡ 0 𝑎2 · 𝑒 2 ⋯ 𝑎 𝑛 · 𝑒2 𝑎𝑛+1 ⋅ 𝑒2 ⋯ 𝑎 𝑚 ⋅ 𝑒2 ⎤
𝐴 = [ 𝑎1 𝑎2 ⋯ 𝑎𝑚 ] = [ 𝑒1 𝑒2 ⋯ 𝑒𝑛 ]⎢ ⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ 0 0 ⋯ 𝑎𝑛 · 𝑒 𝑛 𝑎𝑛+1 ⋅ 𝑒𝑛 ⋯ 𝑎 𝑚 ⋅ 𝑒𝑛 ⎦
𝑎1 = (𝑎1 ⋅ 𝑒1 )𝑒1
𝑎2 = (𝑎2 ⋅ 𝑒1 )𝑒1 + (𝑎2 ⋅ 𝑒2 )𝑒2
⋮ ⋮
𝑎𝑛 = (𝑎𝑛 ⋅ 𝑒1 )𝑒1 + (𝑎𝑛 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑛 ⋅ 𝑒𝑛 )𝑒𝑛
𝑎𝑛+1 = (𝑎𝑛+1 ⋅ 𝑒1 )𝑒1 + (𝑎𝑛+1 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑛+1 ⋅ 𝑒𝑛 )𝑒𝑛
⋮ ⋮
𝑎𝑚 = (𝑎𝑚 ⋅ 𝑒1 )𝑒1 + (𝑎𝑚 ⋅ 𝑒2 )𝑒2 + ⋯ + (𝑎𝑚 ⋅ 𝑒𝑛 )𝑒𝑛
Now let’s write some homemade Python code to implement a QR decomposition by deploying the Gram-Schmidt process
described above.
import numpy as np
from scipy.linalg import qr
def QR_Decomposition(A):
n, m = A.shape # get the shape of A
u[:, 0] = A[:, 0]
Q[:, 0] = u[:, 0] / np.linalg.norm(u[:, 0])
u[:, i] = A[:, i]
for j in range(i):
u[:, i] -= (A[:, i] @ Q[:, j]) * Q[:, j] # get each u vector
R = np.zeros((n, m))
for i in range(n):
for j in range(i, m):
R[i, j] = A[:, j] @ Q[:, i]
return Q, R
The preceding code is fine but can benefit from some further housekeeping.
We want to do this because later in this notebook we want to compare results from using our homemade code above with
the code for a QR that the Python scipy package delivers.
There can be be sign differences between the 𝑄 and 𝑅 matrices produced by different numerical algorithms.
All of these are valid QR decompositions because of how the sign differences cancel out when we compute 𝑄𝑅.
However, to make the results from our homemade function and the QR module in scipy comparable, let’s require that
𝑄 have positive diagonal entries.
We do this by adjusting the signs of the columns in 𝑄 and the rows in 𝑅 appropriately.
To accomplish this we’ll define a pair of functions.
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
D = diag_sign(Q)
Q[:, :] = Q @ D
R[:, :] = D @ R
return Q, R
4.5 Example
Q, R = adjust_sign(*QR_Decomposition(A))
62 Chapter 4. QR Decomposition
Quantitative Economics with Python
print('Our Q: \n', Q)
print('\n')
print('Scipy Q: \n', Q_scipy)
Our Q:
[[ 0.70710678 -0.40824829 -0.57735027]
[ 0.70710678 0.40824829 0.57735027]
[ 0. -0.81649658 0.57735027]]
Scipy Q:
[[ 0.70710678 -0.40824829 -0.57735027]
[ 0.70710678 0.40824829 0.57735027]
[ 0. -0.81649658 0.57735027]]
print('Our R: \n', R)
print('\n')
print('Scipy R: \n', R_scipy)
Our R:
[[ 1.41421356 0.70710678 0.70710678]
[ 0. -1.22474487 -0.40824829]
[ 0. 0. 1.15470054]]
Scipy R:
[[ 1.41421356 0.70710678 0.70710678]
[ 0. -1.22474487 -0.40824829]
[ 0. 0. 1.15470054]]
The above outcomes give us the good news that our homemade function agrees with what scipy produces.
Now let’s do a QR decomposition for a rectangular matrix 𝐴 that is 𝑛 × 𝑚 with 𝑚 > 𝑛.
Q, R = adjust_sign(*QR_Decomposition(A))
Q, R
4.5. Example 63
Quantitative Economics with Python
A_old = np.copy(A)
A_new = np.copy(A)
diff = np.inf
i = 0
while (diff > tol) and (i < maxiter):
A_old[:, :] = A_new
Q, R = QR_Decomposition(A_old)
A_new[:, :] = R @ Q
eigvals = np.diag(A_new)
return eigvals
64 Chapter 4. QR Decomposition
Quantitative Economics with Python
Now let’s try the code and compare the results with what scipy.linalg.eigvals gives us
Here goes
sorted(QR_eigvals(A))
sorted(np.linalg.eigvals(A))
There are interesting connections between the 𝑄𝑅 decomposition and principal components analysis (PCA).
Here are some.
1. Let 𝑋 ′ be a 𝑘 × 𝑛 random matrix where the 𝑗th column is a random draw from 𝒩(𝜇, Σ) where 𝜇 is 𝑘 × 1 vector
of means and Σ is a 𝑘 × 𝑘 covariance matrix. We want 𝑛 >> 𝑘 – this is an “econometrics example”.
2. Form 𝑋 ′ = 𝑄𝑅 where 𝑄 is 𝑘 × 𝑘 and 𝑅 is 𝑘 × 𝑛.
3. Form the eigenvalues of 𝑅𝑅′ , i.e., we’ll compute 𝑅𝑅′ = 𝑃 ̃ Λ𝑃 ̃ ′ .
̂ ′.
4. Form 𝑋 ′ 𝑋 = 𝑄𝑃 ̃ Λ𝑃 ̃ ′ 𝑄′ and compare it with the eigen decomposition 𝑋 ′ 𝑋 = 𝑃 Λ𝑃
5. It will turn out that that Λ = Λ̂ and that 𝑃 = 𝑄𝑃 ̃ .
Let’s verify conjecture 5 with some Python code.
Start by simulating a random (𝑛, 𝑘) matrix 𝑋.
k = 5
n = 1000
X.shape
(1000, 5)
Q, R = adjust_sign(*QR_Decomposition(X.T))
Q.shape, R.shape
RR = R @ R.T
, P_tilde = np.linalg.eigh(RR)
Λ = np.diag( )
̂ ′.
We can also apply the decomposition to 𝑋 ′ 𝑋 = 𝑃 Λ𝑃
XX = X.T @ X
_hat, P = np.linalg.eigh(XX)
Λ_hat = np.diag( _hat)
, _hat
QP_tilde = Q @ P_tilde
4.385380947269368e-15
np.abs(QPΛPQ - XX).max()
5.4569682106375694e-12
66 Chapter 4. QR Decomposition
CHAPTER
FIVE
Contents
5.1 Overview
67
Quantitative Economics with Python
𝑟 = |𝑧| = √𝑥2 + 𝑦2
The value 𝜃 is the angle of (𝑥, 𝑦) with respect to the real axis.
Evidently, the tangent of 𝜃 is ( 𝑥𝑦 ).
Therefore,
𝑦
𝜃 = tan−1 ( )
𝑥
Three elementary trigonometric functions are
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from sympy import *
5.1.2 An Example
√
Consider the complex number 𝑧 = 1 + 3𝑖.
√ √
For 𝑧 = 1 + 3𝑖, 𝑥 = 1, 𝑦 = 3.
√
It follows that 𝑟 = 2 and 𝜃 = tan−1 ( 3) = 𝜋3 = 60𝑜 .
√
Let’s use Python to plot the trigonometric form of the complex number 𝑧 = 1 + 3𝑖.
# Set parameters
r = 2
θ = π/3
x = r * np.cos(θ)
x_range = np.linspace(0, x, 1000)
θ_range = np.linspace(0, θ, 1000)
# Plot
fig = plt.figure(figsize=(8, 8))
ax = plt.subplot(111, projection='polar')
ax.set_rmax(2)
ax.set_rticks((0.5, 1, 1.5, 2)) # Less radial ticks
ax.set_rlabel_position(-88.5) # Get radial labels away from plotted line
ax.grid(True)
plt.show()
5.1. Overview 69
Quantitative Economics with Python
and compute.
5.3.1 Example 1
𝑥2 + 𝑦 2 = 𝑟 2
5.3.2 Example 2
𝑥𝑛 = 𝑎𝑧 𝑛 + 𝑎𝑧̄ 𝑛̄
= 𝑝𝑒𝑖𝜔 (𝑟𝑒𝑖𝜃 )𝑛 + 𝑝𝑒−𝑖𝜔 (𝑟𝑒−𝑖𝜃 )𝑛
= 𝑝𝑟𝑛 𝑒𝑖(𝜔+𝑛𝜃) + 𝑝𝑟𝑛 𝑒−𝑖(𝜔+𝑛𝜃)
= 𝑝𝑟𝑛 [cos (𝜔 + 𝑛𝜃) + 𝑖 sin (𝜔 + 𝑛𝜃) + cos (𝜔 + 𝑛𝜃) − 𝑖 sin (𝜔 + 𝑛𝜃)]
= 2𝑝𝑟𝑛 cos (𝜔 + 𝑛𝜃)
5.3.3 Example 3
This example provides machinery that is at the heard of Samuelson’s analysis of his multiplier-accelerator model [Sam39].
Thus, consider a second-order linear difference equation
𝑥𝑛+2 = 𝑐1 𝑥𝑛+1 + 𝑐2 𝑥𝑛
𝑧 2 − 𝑐1 𝑧 − 𝑐 2 = 0
or
(𝑧 2 − 𝑐1 𝑧 − 𝑐2 ) = (𝑧 − 𝑧1 )(𝑧 − 𝑧2 ) = 0
has roots 𝑧1 , 𝑧1 .
A solution is a sequence {𝑥𝑛 }∞
𝑛=0 that satisfies the difference equation.
Under the following circumstances, we can apply our example 2 formula to solve the difference equation
• the roots 𝑧1 , 𝑧2 of the characteristic polynomial of the difference equation form a complex conjugate pair
• the values 𝑥0 , 𝑥1 are given initial conditions
To solve the difference equation, recall from example 2 that
where 𝜔, 𝑝 are coefficients to be determined from information encoded in the initial conditions 𝑥1 , 𝑥0 .
Since 𝑥0 = 2𝑝 cos 𝜔 and 𝑥1 = 2𝑝𝑟 cos (𝜔 + 𝜃) the ratio of 𝑥1 to 𝑥0 is
𝑥1 𝑟 cos (𝜔 + 𝜃)
=
𝑥0 cos 𝜔
We can solve this equation for 𝜔 then solve for 𝑝 using 𝑥0 = 2𝑝𝑟0 cos (𝜔 + 𝑛𝜃).
With the sympy package in Python, we are able to solve and plot the dynamics of 𝑥𝑛 given different values of 𝑛.
√ √
In this example, we set the initial values: - 𝑟 = 0.9 - 𝜃 = 41 𝜋 - 𝑥0 = 4 - 𝑥1 = 𝑟 ⋅ 2 2 = 1.8 2.
We first numerically solve for 𝜔 and 𝑝 using nsolve in the sympy package based on the above initial condition:
# Set parameters
r = 0.9
θ = π/4
x0 = 4
x1 = 2 * r * sqrt(2)
# Solve for ω
## Note: we choose the solution near 0
eq1 = Eq(x1/x0 - r * cos(ω+θ) / cos(ω), 0)
ω = nsolve(eq1, ω, 0)
ω = np.float(ω)
print(f'ω = {ω:1.3f}')
# Solve for p
eq2 = Eq(x0 - 2 * p * cos(ω), 0)
p = nsolve(eq2, p, 0)
p = np.float(p)
print(f'p = {p:1.3f}')
ω = 0.000
p = 2.000
↪by itself. Doing this will not modify any behavior and is safe. If you␣
ω = np.float(ω)
/tmp/ipykernel_11335/2347233413.py:20: DeprecationWarning: `np.float` is a␣
↪deprecated alias for the builtin `float`. To silence this warning, use `float`␣
↪by itself. Doing this will not modify any behavior and is safe. If you␣
p = np.float(p)
# Define range of n
max_n = 30
n = np.arange(0, max_n+1, 0.01)
# Define x_n
x = lambda n: 2 * p * r**n * np.cos(ω + n * θ)
# Plot
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(n, x(n))
ax.set(xlim=(0, max_n), ylim=(-5, 5), xlabel='$n$', ylabel='$x_n$')
ax.grid()
plt.show()
We can obtain a complete suite of trigonometric identities by appropriately manipulating polar forms of complex numbers.
We’ll get many of them by deducing implications of the equality
𝑒𝑖(𝜔+𝜃) = 𝑒𝑖𝜔 𝑒𝑖𝜃
For example, we’ll calculate identities for
cos (𝜔 + 𝜃) and sin (𝜔 + 𝜃).
Using the sine and cosine formulas presented at the beginning of this lecture, we have:
𝑒𝑖(𝜔+𝜃) + 𝑒−𝑖(𝜔+𝜃)
cos (𝜔 + 𝜃) =
2
𝑒𝑖(𝜔+𝜃) − 𝑒−𝑖(𝜔+𝜃)
sin (𝜔 + 𝜃) =
2𝑖
We can also obtain the trigonometric identities as follows:
cos (𝜔 + 𝜃) + 𝑖 sin (𝜔 + 𝜃) = 𝑒𝑖(𝜔+𝜃)
= 𝑒𝑖𝜔 𝑒𝑖𝜃
= (cos 𝜔 + 𝑖 sin 𝜔)(cos 𝜃 + 𝑖 sin 𝜃)
= (cos 𝜔 cos 𝜃 − sin 𝜔 sin 𝜃) + 𝑖(cos 𝜔 sin 𝜃 + sin 𝜔 cos 𝜃)
Since both real and imaginary parts of the above formula should be equal, we get:
cos (𝜔 + 𝜃) = cos 𝜔 cos 𝜃 − sin 𝜔 sin 𝜃
sin (𝜔 + 𝜃) = cos 𝜔 sin 𝜃 + sin 𝜔 cos 𝜃
The equations above are also known as the angle sum identities. We can verify the equations using the simplify
function in the sympy package:
# Define symbols
ω, θ = symbols('ω θ', real=True)
# Verify
print("cos(ω)cos(θ) - sin(ω)sin(θ) =",
simplify(cos(ω)*cos(θ) - sin(ω) * sin(θ)))
print("cos(ω)sin(θ) + sin(ω)cos(θ) =",
simplify(cos(ω)*sin(θ) + sin(ω) * cos(θ)))
We can also compute the trigonometric integrals using polar forms of complex numbers.
For example, we want to solve the following integral:
𝜋
∫ cos(𝜔) sin(𝜔) 𝑑𝜔
−𝜋
We can verify the analytical as well as numerical results using integrate in the sympy package:
ω = Symbol('ω')
print('The analytical solution for integral of cos(ω)sin(ω) is:')
integrate(cos(ω) * sin(ω), ω)
5.3.6 Exercises
We invite the reader to verify analytically and with the sympy package the following two equalities:
𝜋
𝜋
∫ cos(𝜔)2 𝑑𝜔 =
−𝜋 2
𝜋
𝜋
∫ sin(𝜔)2 𝑑𝜔 =
−𝜋 2
SIX
CIRCULANT MATRICES
6.1 Overview
import numpy as np
from numba import njit
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(precision=3, suppress=True)
[𝑐0 𝑐1 𝑐2 𝑐3 𝑐4 ⋯ 𝑐𝑁−1 ] .
After setting entries in the first row, the remaining rows of a circulant matrix are determined as follows:
𝑐0 𝑐1 𝑐2 𝑐3 𝑐4 ⋯ 𝑐𝑁−1
⎡ 𝑐 𝑐0 𝑐1 𝑐2 𝑐3 ⋯ 𝑐𝑁−2 ⎤
𝑁−1
⎢ ⎥
⎢ 𝑐𝑁−2 𝑐𝑁−1 𝑐0 𝑐1 𝑐2 ⋯ 𝑐𝑁−3 ⎥
𝐶=⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥ (6.1)
⎢ ⎥
⎢ 𝑐3 𝑐4 𝑐5 𝑐6 𝑐7 ⋯ 𝑐2 ⎥
⎢ 𝑐2 𝑐3 𝑐4 𝑐5 𝑐6 ⋯ 𝑐1 ⎥
⎣ 𝑐1 𝑐2 𝑐3 𝑐4 𝑐5 ⋯ 𝑐0 ⎦
It is also possible to construct a circulant matrix by creating the transpose of the above matrix, in which case only the first
column needs to be specified.
Let’s write some Python code to generate a circulant matrix.
77
Quantitative Economics with Python
@njit
def construct_cirlulant(row):
N = row.size
C = np.empty((N, N))
for i in range(N):
return C
𝑐 = [𝑐0 𝑐1 ⋯ 𝑐𝑁−1 ]
𝑎 = [𝑎0 𝑎1 ⋯ 𝑎𝑁−1 ]
𝑏 = 𝐶𝑇 𝑎
0 1 0 0 ⋯ 0
⎡ 0 0 1 0 ⋯ 0 ⎤
⎢ ⎥
0 0 0 1 ⋯ 0
𝑃 =⎢ ⎥ (6.3)
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥
⎢ 0 0 0 0 ⋯ 1 ⎥
⎣ 1 0 0 0 ⋯ 0 ⎦
serves as a cyclic shift operator that, when applied to an 𝑁 × 1 vector ℎ, shifts entries in rows 2 through 𝑁 up one row
and shifts the entry in row 1 to row 𝑁 .
Eigenvalues of the cyclic shift permutation matrix 𝑃 defined in equation (6.3) can be computed by constructing
−𝜆 1 0 0 ⋯ 0
⎡ 0 −𝜆 1 0 ⋯ 0 ⎤
⎢ ⎥
0 0 −𝜆 1 ⋯ 0
𝑃 − 𝜆𝐼 = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥
⎢ 0 0 0 0 ⋯ 1 ⎥
⎣ 1 0 0 0 ⋯ −𝜆 ⎦
and solving
𝑃𝑃′ = 𝐼
@njit
def construct_P(N):
P = np.zeros((N, N))
for i in range(N-1):
P[i, i+1] = 1
P[-1, 0] = 1
return P
P4 = construct_P(4)
P4
for i in range(4):
print(f' {i} = { [i]:.1f} \nvec{i} = {Q[i, :]}\n')
0 = -1.0+0.0j
vec0 = [-0.5+0.j 0. +0.5j 0. -0.5j -0.5+0.j ]
1 = 0.0+1.0j
vec1 = [ 0.5+0.j -0.5+0.j -0.5-0.j -0.5+0.j]
2 = 0.0-1.0j
vec2 = [-0.5+0.j 0. -0.5j 0. +0.5j -0.5+0.j ]
3 = 1.0+0.0j
vec3 = [ 0.5+0.j 0.5-0.j 0.5+0.j -0.5+0.j]
In graphs below, we shall portray eigenvalues of a shift permutation matrix in the complex plane.
These eigenvalues are uniformly distributed along the unit circle.
They are the 𝑛 roots of unity, meaning they are the 𝑛 numbers 𝑧 that solve 𝑧 𝑛 = 1, where 𝑧 is a complex number.
In particular, the 𝑛 roots of unity are
2𝜋𝑗𝑘
𝑧 = exp ( ), 𝑘 = 0, … , 𝑁 − 1
𝑁
where 𝑗 denotes the purely imaginary unit number.
row_i = i // 2
col_i = i % 2
P = construct_P(N)
, Q = np.linalg.eig(P)
for j in range(N):
ax[row_i, col_i].scatter( [j].real, [j].imag, c='b')
plt.show()
𝐶 = 𝑐0 𝐼 + 𝑐1 𝑃 + 𝑐2 𝑃 2 + ⋯ + 𝑐𝑁−1 𝑃 𝑁−1 .
def construct_F(N):
return F, w
F8, w = construct_F(8)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
(0.7071067811865476-0.7071067811865475j)
F8
# normalize
Q8 = F8 / np.sqrt(8)
P8 = construct_P(8)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
diff_arr
Next, we execute calculations to verify that the circulant matrix 𝐶 defined in equation (6.1) can be written as
𝐶 = 𝑐0 𝐼 + 𝑐1 𝑃 + ⋯ + 𝑐𝑛−1 𝑃 𝑛−1
c = np.random.random(8)
C8 = construct_cirlulant(c)
N = 8
C = np.zeros((N, N))
P = np.eye(N)
for i in range(N):
C += c[i] * P
P = P8 @ P
C8
Now let’s compute the difference between two circulant matrices that we have constructed in two different ways.
np.abs(C - C8).max()
0.0
7
The 𝑘th column of 𝑃8 associated with eigenvalue 𝑤𝑘−1 is an eigenvector of 𝐶8 associated with an eigenvalue ∑ℎ=0 𝑐𝑗 𝑤ℎ𝑘 .
for j in range(8):
for k in range(8):
_C8[j] += c[k] * w ** (j * k)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
_C8
# verify
for j in range(8):
diff = C8 @ Q8[:, j] - _C8[j] * Q8[:, j]
print(diff)
The Discrete Fourier Transform (DFT) allows us to represent a discrete time sequence as a weighted sum of complex
sinusoids.
Consider a sequence of 𝑁 real number {𝑥𝑗 }𝑁−1
𝑗=0 .
where
𝑁−1
𝑘𝑛
𝑋𝑘 = ∑ 𝑥𝑛 𝑒−2𝜋 𝑁 𝑖
𝑛=0
def DFT(x):
"The discrete Fourier transform."
N = len(x)
w = np.e ** (-np.complex(0, 2*np.pi/N))
X = np.zeros(N, dtype=np.complex)
for k in range(N):
for n in range(N):
X[k] += x[n] * w ** (k * n)
return X
1/2 𝑛 = 0, 1
𝑥𝑛 = {
0 otherwise
x = np.zeros(10)
x[0:2] = 1/2
array([0.5, 0.5, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ])
X = DFT(x)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
X = np.zeros(N, dtype=np.complex)
We can plot magnitudes of a sequence of numbers and the associated discrete Fourier transform.
data = []
names = []
xs = []
if (x is not None):
data.append(x)
names.append('x')
xs.append('n')
if (X is not None):
data.append(X)
names.append('X')
xs.append('j')
num = len(data)
for i in range(num):
n = data[i].size
plt.figure(figsize=(8, 3))
plt.scatter(range(n), np.abs(data[i]))
plt.vlines(range(n), 0, np.abs(data[i]), color='b')
plt.xlabel(xs[i])
plt.ylabel('magnitude')
plt.title(names[i])
plt.show()
plot_magnitude(x=x, X=X)
def inverse_transform(X):
N = len(X)
w = np.e ** (np.complex(0, 2*np.pi/N))
x = np.zeros(N, dtype=np.complex)
for n in range(N):
for k in range(N):
x[n] += X[k] * w ** (k * n) / N
return x
inverse_transform(X)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
x = np.zeros(N, dtype=np.complex)
array([ 0.5+0.j, 0.5-0.j, -0. -0.j, -0. -0.j, -0. -0.j, -0. -0.j,
-0. +0.j, -0. +0.j, -0. +0.j, -0. +0.j])
Another example is
11
𝑥𝑛 = 2 cos (2𝜋 𝑛) , 𝑛 = 0, 1, 2, ⋯ 19
40
1 11
Since 𝑁 = 20, we cannot use an integer multiple of 20 to represent a frequency 40 .
To handle this, we shall end up using all 𝑁 of the availble frequencies in the DFT.
11
Since 40 is in between 10 12
40 and 40 (each of which is an integer multiple of
1
20 ), the complex coefficients in the DFT have
their largest magnitudes at 𝑘 = 5, 6, 15, 16, not just at a single frequency.
N = 20
x = np.empty(N)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 11 * j / 40)
X = DFT(x)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
X = np.zeros(N, dtype=np.complex)
plot_magnitude(x=x, X=X)
N = 20
x = np.empty(N)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 10 * j / 40)
X = DFT(x)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
X = np.zeros(N, dtype=np.complex)
plot_magnitude(x=x, X=X)
If we represent the discrete Fourier transform as a matrix, we discover that it equals the matrix 𝐹𝑁 of eigenvectors of the
permutation matrix 𝑃𝑁 .
We can use the example where 𝑥𝑛 = 2 cos (2𝜋 11
40 𝑛) , 𝑛 = 0, 1, 2, ⋯ 19 to illustrate this.
N = 20
(continues on next page)
for j in range(N):
x[j] = 2 * np.cos(2 * np.pi * 11 * j / 40)
X = DFT(x)
X
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
X = np.zeros(N, dtype=np.complex)
Now let’s evaluate the outcome of postmultiplying the eigenvector matrix 𝐹20 by the vector 𝑥, a product that we claim
should equal the Fourier tranform of the sequence {𝑥𝑛 }𝑁−1
𝑛=0 .
F20, _ = construct_F(20)
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
↪`complex` by itself. Doing this will not modify any behavior and is safe. If you␣
F20 @ x
−1
Similarly, the inverse DFT can be expressed as a inverse DFT matrix 𝐹20 .
F20_inv = np.linalg.inv(F20)
F20_inv @ X
SEVEN
In addition to regular packages contained in Anaconda by default, this lecture also requires:
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
%matplotlib inline
import quandl as ql
import pandas as pd
7.1 Overview
The singular value decomposition is a work-horse in applications of least squares projection that form a foundation for
some important machine learning methods.
This lecture describes the singular value decomposition and two of its uses:
• principal components analysis (PCA)
• dynamic mode decomposition (DMD)
Each of these can be thought of as a data-reduction procedure designed to capture salient patterns by projecting data onto
a limited set of factors.
95
Quantitative Economics with Python
𝑋 = 𝑈 Σ𝑉 𝑇
where
𝑈𝑈𝑇 = 𝐼 𝑈𝑇 𝑈 = 𝐼
𝑉𝑉𝑇 = 𝐼 𝑉 𝑇𝑉 = 𝐼
where
• 𝑈 is an 𝑚 × 𝑚 matrix whose columns are eigenvectors of 𝑋 𝑇 𝑋
• 𝑉 is an 𝑛 × 𝑛 matrix whose columns are eigenvectors of 𝑋𝑋 𝑇
• Σ is an 𝑚 × 𝑛 matrix in which the first 𝑟 places on its main diagonal are positive numbers 𝜎1 , 𝜎2 , … , 𝜎𝑟 called
singular values; remaining entries of Σ are all zero
• The 𝑟 singular values are square roots of the eigenvalues of the 𝑚 × 𝑚 matrix 𝑋𝑋 𝑇 and the 𝑛 × 𝑛 matrix 𝑋 𝑇 𝑋
• When 𝑈 is a complex valued matrix, 𝑈 𝑇 denotes the conjugate-transpose or Hermitian-transpose of 𝑈 , mean-
𝑇
ing that 𝑈𝑖𝑗 is the complex conjugate of 𝑈𝑗𝑖 .
• Similarly, when 𝑉 is a complex valued matrix, 𝑉 𝑇 denotes the conjugate-transpose or Hermitian-transpose of
𝑉
In what is called a full SVD, the shapes of 𝑈 , Σ, and 𝑉 are (𝑚, 𝑚), (𝑚, 𝑛), (𝑛, 𝑛), respectively.
There is also an alternative shape convention called an economy or reduced SVD .
Thus, note that because we assume that 𝑋 has rank 𝑟, there are only 𝑟 nonzero singular values, where 𝑟 = rank(𝑋) ≤
min (𝑚, 𝑛).
A reduced SVD uses this fact to express 𝑈 , Σ, and 𝑉 as matrices with shapes (𝑚, 𝑟), (𝑟, 𝑟), (𝑟, 𝑛).
Sometimes, we will use a full SVD in which 𝑈 , Σ, and 𝑉 have shapes (𝑚, 𝑚), (𝑚, 𝑛), (𝑛, 𝑛)
Caveat: The properties
𝑈𝑈𝑇 = 𝐼 𝑈𝑇 𝑈 = 𝐼
𝑉𝑉𝑇 = 𝐼 𝑉 𝑇𝑉 = 𝐼
apply to a full SVD but not to a reduced SVD.
In the tall-skinny case in which 𝑚 >> 𝑛, for a reduced SVD
𝑈𝑈𝑇 ≠ 𝐼 𝑈𝑇 𝑈 = 𝐼
𝑉𝑉𝑇 = 𝐼 𝑉 𝑇𝑉 = 𝐼
𝑈𝑈𝑇 = 𝐼 𝑈𝑇 𝑈 = 𝐼
𝑇
𝑉𝑉 =𝐼 𝑉 𝑇𝑉 ≠ 𝐼
When we study Dynamic Mode Decomposition below, we shall want to remember this caveat because we’ll be using
reduced SVD’s to compute key objects.
import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V ='), U, S, V
U, S, V =
(None,
array([[-0.37541622, -0.19584961, -0.67837623, -0.4401461 , -0.40839036],
[-0.55849099, -0.35735004, 0.65235117, -0.3671743 , -0.00312385],
[-0.42904804, 0.86421085, 0.11626597, 0.02008157, -0.23481127],
[-0.48271863, -0.27745485, -0.10739446, 0.81450895, -0.12265046],
[-0.36062582, 0.10051012, -0.2986508 , -0.08732897, 0.87351479]]),
array([1.55734944, 0.39671477]),
array([[-0.96609979, -0.25816894],
[-0.25816894, 0.96609979]]))
(None,
array([[-0.37541622, -0.19584961],
[-0.55849099, -0.35735004],
[-0.42904804, 0.86421085],
[-0.48271863, -0.27745485],
[-0.36062582, 0.10051012]]),
array([1.55734944, 0.39671477]),
array([[-0.96609979, -0.25816894],
[-0.25816894, 0.96609979]]))
rr = np.linalg.matrix_rank(X)
print('rank of X - '), rr
rank of X -
(None, 2)
Properties:
• Where 𝑈 is constructed via a full SVD, 𝑈 𝑇 𝑈 = 𝐼𝑟×𝑟 and 𝑈 𝑈 𝑇 = 𝐼𝑚×𝑚
• Where 𝑈̂ is constructed via a reduced SVD, although 𝑈̂ 𝑇 𝑈̂ = 𝐼𝑟×𝑟 it happens that 𝑈̂ 𝑈̂ 𝑇 ≠ 𝐼𝑚×𝑚
We illustrate these properties for our example with the following code cells.
UTU = U.T@U
UUT = [email protected]
print('UUT, UTU = '), UUT, UTU
UUT, UTU =
(None,
array([[ 1.00000000e+00, 7.25986945e-17, -4.29313316e-17,
-7.16514089e-20, -3.42603458e-17],
[ 7.25986945e-17, 1.00000000e+00, 1.64603035e-16,
-6.51859739e-17, -8.41387647e-18],
[-4.29313316e-17, 1.64603035e-16, 1.00000000e+00,
9.85891939e-18, -1.53238976e-16],
[-7.16514089e-20, -6.51859739e-17, 9.85891939e-18,
1.00000000e+00, -4.63853856e-17],
[-3.42603458e-17, -8.41387647e-18, -1.53238976e-16,
-4.63853856e-17, 1.00000000e+00]]),
array([[ 1.00000000e+00, 6.09772859e-18, 1.03210775e-16,
7.46020143e-17, 9.69958745e-18],
[ 6.09772859e-18, 1.00000000e+00, 1.78870202e-16,
1.51877577e-17, 3.14064924e-17],
[ 1.03210775e-16, 1.78870202e-16, 1.00000000e+00,
8.72751834e-17, -6.17786179e-17],
[ 7.46020143e-17, 1.51877577e-17, 8.72751834e-17,
1.00000000e+00, 3.04189949e-17],
[ 9.69958745e-18, 3.14064924e-17, -6.17786179e-17,
3.04189949e-17, 1.00000000e+00]]))
UhatUhatT = [email protected]
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= '), UhatUhatT, UhatTUhat
UhatUhatT, UhatTUhat=
(None,
array([[ 0.17929441, 0.27965344, -0.00818377, 0.23555983, 0.11569991],
[ 0.27965344, 0.43961124, -0.06920632, 0.36874251, 0.16548898],
[-0.00818377, -0.06920632, 0.93094262, -0.03267001, 0.24158774],
[ 0.23555983, 0.36874251, -0.03267001, 0.30999847, 0.14619378],
[ 0.11569991, 0.16548898, 0.24158774, 0.14619378, 0.14015327]]),
array([[1.00000000e+00, 6.09772859e-18],
[6.09772859e-18, 1.00000000e+00]]))
Remark: The cells above illustrate application of the fullmatrices=True and full-matrices=False op-
tions. Using full-matrices=False returns a reduced singular value decomposition. This option implements an
optimal reduced rank approximation of a matrix, in the sense of minimizing the Frobenius norm of the discrepancy
between the approximating matrix and the matrix being approximated. Optimality in this sense is established in the
celebrated Eckart–Young theorem. See https://en.wikipedia.org/wiki/Low-rank_approximation.
When we study Dynamic Mode Decompositions below, it will be important for us to remember the following important
properties of full and reduced SVD’s in such tall-skinny cases.
Let’s do another exercise, but now we’ll set 𝑚 = 2 < 5 = 𝑛
import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V ='), U, S, V
U, S, V =
(None,
array([[ 0.75354378, -0.65739773],
[ 0.65739773, 0.75354378]]),
array([1.52371698, 0.78047845]),
array([[ 0.24335076, 0.20429337, 0.41002952, 0.69252016, 0.50133447],
[ 0.47340092, 0.16756132, 0.56260509, -0.076725 , -0.65222969],
[-0.35798988, -0.64724402, 0.63701494, -0.16367735, 0.14261877],
[-0.61776202, 0.05930924, -0.01976992, 0.58720849, -0.51927627],
[-0.45484648, 0.71256229, 0.3304125 , -0.37805425, 0.18240677]]))
(None,
array([[ 0.75354378, -0.65739773],
[ 0.65739773, 0.75354378]]),
(continues on next page)
rr = np.linalg.matrix_rank(X)
print('rank X = '), rr
rank X =
(None, 2)
𝑋 = 𝑆𝑄
where
𝑆 = 𝑈 Σ𝑈 𝑇
𝑄 = 𝑈𝑉 𝑇
and 𝑆 is evidently a symmetric matrix and 𝑄 is an orthogonal matrix.
Let’s begin with a case in which 𝑛 >> 𝑚, so that we have many more observations 𝑛 than random variables 𝑚.
The matrix 𝑋 is short and fat in an 𝑛 >> 𝑚 case as opposed to a tall and skinny case with 𝑚 >> 𝑛 to be discussed
later.
We regard 𝑋 as an 𝑚 × 𝑛 matrix of data:
𝑋 = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ]
𝑋1𝑗 𝑥1
⎡𝑋 ⎤ ⎡𝑥 ⎤
2𝑗 ⎥ is a vector of observations on variables ⎢ 2 ⎥.
where for 𝑗 = 1, … , 𝑛 the column vector 𝑋𝑗 = ⎢
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣𝑋𝑚𝑗 ⎦ ⎣𝑥𝑚 ⎦
In a time series setting, we would think of columns 𝑗 as indexing different times at which random variables are observed,
while rows index different random variables.
In a cross section setting, we would think of columns 𝑗 as indexing different individuals for which random variables are
observed, while rows index different random variables.
The number of positive singular values equals the rank of matrix 𝑋.
Arrange the singular values in decreasing order.
Arrange the positive singular values on the main diagonal of the matrix Σ of into a vector 𝜎𝑅 .
Set all other entries of Σ to zero.
To relate a SVD to a PCA (principal component analysis) of data set 𝑋, first construct the SVD of the data matrix 𝑋:
where
𝑉1𝑇
⎡𝑉 𝑇 ⎤
𝑉𝑇 =⎢ 2 ⎥
⎢…⎥
⎣𝑉𝑛𝑇 ⎦
In equation (7.1), each of the 𝑚 × 𝑛 matrices 𝑈𝑗 𝑉𝑗𝑇 is evidently of rank 1.
Thus, we have
Here is how we would interpret the objects in the matrix equation (7.2) in a time series context:
• 𝑉𝑘𝑇 = [𝑉𝑘1 𝑉𝑘2 … 𝑉𝑘𝑛 ] for each 𝑘 = 1, … , 𝑛 is a time series {𝑉𝑘𝑗 }𝑛𝑗=1 for the 𝑘th principal component
𝑈1𝑘
⎡𝑈 ⎤
• 𝑈𝑗 = ⎢ 2𝑘 ⎥ 𝑘 = 1, … , 𝑚 is a vector of loadings of variables 𝑋𝑖 on the 𝑘th principle component, 𝑖 = 1, … , 𝑚
⎢ … ⎥
⎣𝑈𝑚𝑘 ⎦
• 𝜎𝑘 for each 𝑘 = 1, … , 𝑟 is the strength of 𝑘th principal component
Ω = 𝑋𝑋 𝑇
Ω = 𝑃 Λ𝑃 𝑇
Here
• 𝑃 is 𝑚 × 𝑚 matrix of eigenvectors of Ω
• Λ is a diagonal matrix of eigenvalues of Ω
𝑋 = 𝑃𝜖
where
𝜖𝜖𝑇 = Λ.
𝑋𝑋 𝑇 = 𝑃 Λ𝑃 𝑇 .
𝜖1
⎡𝜖 ⎤
𝑋 = [𝑋1 |𝑋2 | … |𝑋𝑚 ] = [𝑃1 |𝑃2 | … |𝑃𝑚 ] ⎢ 2 ⎥ = 𝑃1 𝜖1 + 𝑃2 𝜖2 + … + 𝑃𝑚 𝜖𝑚
⎢…⎥
⎣𝜖𝑚 ⎦
where
𝜖𝜖𝑇 = Λ.
To reconcile the preceding representation with the PCA that we obtained through the SVD above, we first note that
𝜖2𝑗 = 𝜆𝑗 ≡ 𝜎𝑗2 .
𝜖𝑗
Now define 𝜖𝑗̃ = √𝜆𝑗
, which evidently implies that 𝜖𝑗̃ 𝜖𝑇𝑗̃ = 1.
Therefore
𝑋 = √𝜆1 𝑃1 𝜖1̃ + √𝜆2 𝑃2 𝜖2̃ + … + √𝜆𝑚 𝑃𝑚 𝜖𝑚̃
= 𝜎1 𝑃1 𝜖2̃ + 𝜎2 𝑃2 𝜖2̃ + … + 𝜎𝑚 𝑃𝑚 𝜖𝑚̃ ,
7.9 Connections
To pull things together, it is useful to assemble and compare some formulas presented above.
First, consider the following SVD of an 𝑚 × 𝑛 matrix:
𝑋 = 𝑈 Σ𝑉 𝑇
Compute:
𝑋𝑋 𝑇 = 𝑈 Σ𝑉 𝑇 𝑉 Σ𝑇 𝑈 𝑇
≡ 𝑈 ΣΣ𝑇 𝑈 𝑇
≡ 𝑈 Λ𝑈 𝑇
Thus, 𝑈 in the SVD is the matrix 𝑃 of eigenvectors of 𝑋𝑋 𝑇 and ΣΣ𝑇 is the matrix Λ of eigenvalues.
Second, let’s compute
𝑋 𝑇 𝑋 = 𝑉 Σ𝑇 𝑈 𝑇 𝑈 Σ𝑉 𝑇
= 𝑉 Σ𝑇 Σ𝑉 𝑇
𝑋𝑋 𝑇 = 𝑃 Λ𝑃 𝑇
𝑋𝑋 𝑇 = 𝑈 ΣΣ𝑇 𝑈 𝑇
𝑋 = 𝑃 𝜖 = 𝑈 Σ𝑉 𝑇
It follows that
𝑈 𝑇 𝑋 = Σ𝑉 𝑇 = 𝜖
𝜖𝜖𝑇 = Σ𝑉 𝑇 𝑉 Σ𝑇 = ΣΣ𝑇 = Λ,
class DecomAnalysis:
"""
A class for conducting PCA and SVD.
"""
self.X = X
self.Ω = (X @ X.T)
if n_component:
(continues on next page)
def pca(self):
# sort by eigenvalues
self. = [ind]
P = P[:, ind]
self.P = P @ diag_sign(P)
self.Λ = np.diag(self. )
P = self.P[:, :self.n_component]
= self. [:self.n_component, :]
# transform data
self.X_pca = P @
def svd(self):
U, , VT = LA.svd(self.X)
# sort by eigenvalues
d = min(self.m, self.n)
self. = [ind]
U = U[:, ind]
D = diag_sign(U)
self.U = U @ D
VT[:d, :] = D @ VT[ind, :]
self.VT = VT
_sq = self. ** 2
self.explained_ratio_svd = np.cumsum( _sq) / _sq.sum()
# transform data
# pca
P = self.P[:, :n_component]
= self. [:n_component, :]
# transform data
self.X_pca = P @
# svd
U = self.U[:, :n_component]
Σ = self.Σ[:n_component, :n_component]
VT = self.VT[:n_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
We also define a function that prints out information so that we can compare decompositions obtained by different algo-
rithms.
def compare_pca_svd(da):
"""
Compare the outcomes of PCA and SVD.
"""
da.pca()
da.svd()
# loading matrices
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('loadings')
axs[0].plot(da.P.T)
axs[0].set_title('P')
axs[0].set_xlabel('m')
axs[1].plot(da.U.T)
axs[1].set_title('U')
axs[1].set_xlabel('m')
plt.show()
# principal components
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('principal components')
(continues on next page)
For an example PCA applied to analyzing the structure of intelligence tests see this lecture Multivariable Normal Distri-
bution.
Look at the parts of that lecture that describe and illustrate the classic factor analysis model.
where 𝜖𝑡+1 is the time 𝑡 + 1 instance of an i.i.d. 𝑚 × 1 random vector with mean vector zero and identity covariance
matrix and
where the 𝑚 × 1 vector 𝑋𝑡 is
𝑇
𝑋𝑡 = [𝑋1,𝑡 𝑋2,𝑡 ⋯ 𝑋𝑚,𝑡 ] (7.4)
and where 𝑇 again denotes complex transposition and 𝑋𝑖,𝑡 is an observation on variable 𝑖 at time 𝑡.
We want to fit equation (7.3).
Our data are organized in an 𝑚 × (𝑛 + 1) matrix 𝑋̃
𝑋̃ = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ∣ 𝑋𝑛+1 ]
𝑋 = [𝑋1 ∣ 𝑋2 ∣ ⋯ ∣ 𝑋𝑛 ]
and
𝑋 ′ = [𝑋2 ∣ 𝑋3 ∣ ⋯ ∣ 𝑋𝑛+1 ]
Here ′ does not indicate matrix transposition but instead is part of the name of the matrix 𝑋 ′ .
In forming 𝑋 and 𝑋 ′ , we have in each case dropped a column from 𝑋,̃ the last column in the case of 𝑋, and the first
column in the case of 𝑋 ′ .
Evidently, 𝑋 and 𝑋 ′ are both 𝑚 × 𝑛 matrices.
We denote the rank of 𝑋 as 𝑝 ≤ min(𝑚, 𝑛).
Two possible cases are
• 𝑛 >> 𝑚, so that we have many more time series observations 𝑛 than variables 𝑚
• 𝑚 >> 𝑛, so that we have many more variables 𝑚 than time series observations 𝑛
At a general level that includes both of these special cases, a common formula describes the least squares estimator 𝐴 ̂ of
𝐴 for both cases, but important details differ.
The common formula is
𝐴̂ = 𝑋′𝑋+ (7.5)
𝑋 + = 𝑋 𝑇 (𝑋𝑋 𝑇 )−1
𝐴 ̂ = 𝑋 ′ 𝑋 𝑇 (𝑋𝑋 𝑇 )−1
𝑋 + = (𝑋 𝑇 𝑋)−1 𝑋 𝑇
𝐴 ̂ = 𝑋 ′ (𝑋 𝑇 𝑋)−1 𝑋 𝑇 (7.6)
̂ = 𝑋′
𝐴𝑋
so that the regression equation fits perfectly, the usual outcome in an underdetermined least-squares model.
Thus, we want to fit equation (7.3) in a situation in which we have a number 𝑛 of observations that is small relative to the
number 𝑚 of variables that appear in the vector 𝑋𝑡 .
To reiterate and offer an idea about how we can efficiently calculate the pseudo-inverse 𝑋 + , as our estimator 𝐴 ̂ of 𝐴 we
form an 𝑚 × 𝑚 matrix that solves the least-squares best-fit problem
𝐴̂ = 𝑋′𝑋+ (7.8)
𝑋 = 𝑈 Σ𝑉 𝑇 (7.9)
We can use the singular value decomposition (7.9) efficiently to construct the pseudo-inverse 𝑋 + by recognizing the
following string of equalities.
𝑋 + = (𝑋 𝑇 𝑋)−1 𝑋 𝑇
= (𝑉 Σ𝑈 𝑇 𝑈 Σ𝑉 𝑇 )−1 𝑉 Σ𝑈 𝑇
= (𝑉 ΣΣ𝑉 𝑇 )−1 𝑉 Σ𝑈 𝑇 (7.10)
= 𝑉 Σ−1 Σ−1 𝑉 𝑇 𝑉 Σ𝑈 𝑇
= 𝑉 Σ−1 𝑈 𝑇
(Since we are in the 𝑚 >> 𝑛 case in which 𝑉 𝑇 𝑉 = 𝐼 in a reduced SVD, we can use the preceding string of equalities
for a reduced SVD as well as for a full SVD.)
Thus, we shall construct a pseudo-inverse 𝑋 + of 𝑋 by using a singular value decomposition of 𝑋 in equation (7.9) to
compute
𝑋 + = 𝑉 Σ−1 𝑈 𝑇 (7.11)
where the matrix Σ−1 is constructed by replacing each non-zero element of Σ with 𝜎𝑗−1 .
We can use formula (7.11) together with formula (7.8) to compute the matrix 𝐴 ̂ of regression coefficients.
Thus, our estimator 𝐴 ̂ = 𝑋 ′ 𝑋 + of the 𝑚 × 𝑚 matrix of coefficients 𝐴 is
𝐴 ̂ = 𝑋 ′ 𝑉 Σ−1 𝑈 𝑇 (7.12)
In addition to doing that, we’ll eventually use dynamic mode decomposition to compute a rank 𝑟 approximation to 𝐴,̂
where 𝑟 < 𝑝.
Remark: We described and illustrated a reduced singular value decomposition above, and compared it with a full
singular value decomposition. In our Python code, we’ll typically use a reduced SVD.
Next, we describe alternative representations of our first-order linear dynamic system.
7.11 Representation 1
𝑏̃𝑡 = 𝑈 𝑇 𝑋𝑡 (7.13)
and
𝑋𝑡 = 𝑈 𝑏̃𝑡 (7.14)
(Here we use the notation 𝑏 to remind ourselves that we are creating a basis vector.)
Since we are using a full SVD, 𝑈 𝑈 𝑇 is an 𝑚 × 𝑚 identity matrix.
So it follows from equation (7.13) that we can reconstruct 𝑋𝑡 from 𝑏̃𝑡 by using
• Equation (7.13) serves as an encoder that rotates the 𝑚 × 1 vector 𝑋𝑡 to become an 𝑚 × 1 vector 𝑏̃𝑡
• Equation (7.14) serves as a decoder that recovers the 𝑚 × 1 vector 𝑋𝑡 by rotating the 𝑚 × 1 vector 𝑏̃𝑡
Define a transition matrix for a rotated 𝑚 × 1 state 𝑏̃𝑡 by
̂
𝐴 ̃ = 𝑈 𝑇 𝐴𝑈 (7.15)
𝐴 ̂ = 𝑈 𝐴𝑈
̃ 𝑇
𝑏̃𝑡+1 = 𝐴𝑏̃ ̃𝑡
To construct forecasts 𝑋 𝑡 of future values of 𝑋𝑡 conditional on 𝑋1 , we can apply decoders (i.e., rotators) to both sides
of this equation and deduce
𝑋 𝑡+1 = 𝑈 𝐴𝑡̃ 𝑈 𝑇 𝑋1
7.12 Representation 2
𝐴 ̃ = 𝑊 Λ𝑊 −1 (7.16)
where Λ is a diagonal matrix of eigenvalues and 𝑊 is an 𝑚 × 𝑚 matrix whose columns are eigenvectors corresponding
to rows (eigenvalues) in Λ.
When 𝑈 𝑈 𝑇 = 𝐼𝑚×𝑚 , as is true with a full SVD of 𝑋, it follows that
𝐴 ̂ = 𝑈 𝐴𝑈
̃ 𝑇 = 𝑈 𝑊 Λ𝑊 −1 𝑈 𝑇 (7.17)
Evidently, according to equation (7.17), the diagonal matrix Λ contains eigenvalues of 𝐴 ̂ and corresponding eigenvectors
of 𝐴 ̂ are columns of the matrix 𝑈 𝑊 .
Thus, the systematic (i.e., not random) parts of the 𝑋𝑡 dynamics captured by our first-order vector autoregressions are
described by
𝑋𝑡+1 = 𝑈 𝑊 Λ𝑊 −1 𝑈 𝑇 𝑋𝑡
𝑊 −1 𝑈 𝑇 𝑋𝑡+1 = Λ𝑊 −1 𝑈 𝑇 𝑋𝑡
or
𝑏̂𝑡+1 = Λ𝑏̂𝑡
𝑏̂𝑡 = 𝑊 −1 𝑈 𝑇 𝑋𝑡
𝑋𝑡 = 𝑈 𝑊 𝑏̂𝑡
We can use this representation to construct a predictor 𝑋 𝑡+1 of 𝑋𝑡+1 conditional on 𝑋1 via:
𝑋 𝑡+1 = 𝑈 𝑊 Λ𝑡 𝑊 −1 𝑈 𝑇 𝑋1 (7.18)
Φ𝑠 = 𝑈 𝑊 (7.19)
𝑋 𝑡+1 = Φ𝑠 Λ𝑡 Φ+
𝑠 𝑋1 (7.20)
7.13 Representation 3
Departing from the procedures used to construct Representations 1 and 2, each of which deployed a full SVD, we now
use a reduced SVD.
Again, we let 𝑝 ≤ min(𝑚, 𝑛) be the rank of 𝑋.
Construct a reduced SVD
𝑋 = 𝑈̃ Σ̃ 𝑉 ̃ 𝑇 ,
𝐴 ̂ = 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ 𝑇
Paralleling a step in Representation 1, define a transition matrix for a rotated 𝑝 × 1 state 𝑏̃𝑡 by
𝐴 ̃ = 𝑈 ̃ 𝑇 𝐴𝑈
̂ ̃ (7.21)
𝐴 ̃ = 𝑊 Λ𝑊 −1 (7.22)
Mimicking our procedure in Representation 2, we cross our fingers and compute the 𝑚 × 𝑝 matrix
Φ̃ 𝑠 = 𝑈̃ 𝑊 (7.23)
̂ ̃ = (𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑈̃ 𝑇 )(𝑈̃ 𝑊 )
𝐴Φ 𝑠
= 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊
≠ (𝑈̃ 𝑊 )Λ
= Φ̃ 𝑠 Λ
That 𝐴Φ̂ ̃ ≠ Φ̃ Λ means, that unlike the corresponding situation in Representation 2, columns of Φ̃ = 𝑈̃ 𝑊 are not
𝑠 𝑠 𝑠
eigenvectors of 𝐴 ̂ corresponding to eigenvalues Λ.
But in a quest for eigenvectors of 𝐴 ̂ that we can compute with a reduced SVD, let’s define
̂ ̃ = 𝑋 ′ 𝑉 ̃ Σ̃ −1 𝑊
Φ ≡ 𝐴Φ 𝑠
It turns out that columns of Φ are eigenvectors of 𝐴,̂ a consequence of a result established by Tu et al. [TRL+14].
To present their result, for convenience we’ll drop the tilde ⋅ ̃ above 𝑈 , 𝑉 , and Σ and adopt the understanding that each of
them is computed with a reduced SVD.
Thus, we now use the notation that the 𝑚 × 𝑝 matrix Φ is defined as
Φ = 𝑋 ′ 𝑉 Σ−1 𝑊 (7.24)
̂ = ΦΛ
𝐴Φ (7.25)
Let 𝜙𝑖 be the the 𝑖the column of Φ and 𝜆𝑖 be the corresponding 𝑖 eigenvalue of 𝐴 ̃ from decomposition (7.22).
Writing out the 𝑚 × 1 vectors on both sides of equation (7.25) and equating them gives
̂ =𝜆𝜙.
𝐴𝜙 𝑖 𝑖 𝑖
𝐴 ̂ = ΦΛΦ+ . (7.26)
𝑏̌𝑡+1 = Λ𝑏̌𝑡
where
𝑏̌𝑡 = Φ+ 𝑋𝑡 (7.27)
Φ† = (Φ𝑇 Φ)−1 Φ𝑇
and so
𝑏̌ is recognizable as the matrix of least squares regression coefficients of the matrix 𝑋 on the matrix Φ and
𝑋̌ = Φ𝑏̌
𝑋 = Φ 𝑏̌ + 𝜖
where 𝜖 is an 𝑚 × 𝑛 matrix of least squares errors satisfying the least squares orthogonality conditions 𝜖𝑇 Φ = 0 or
There is a better way to compute the 𝑝 × 1 vector 𝑏̌𝑡 than provided by formula (7.27).
In particular, the following argument from [BK19] (page 240) provides a computationally efficient way to compute 𝑏̌𝑡 .
For convenience, we’ll do this first for time 𝑡 = 1.
For 𝑡 = 1, we have
𝑋1 = Φ𝑏̌1 (7.30)
and consequently
̃ 𝑏̌
𝑏̃1 = 𝐴𝑊 1
𝑏̃1 = 𝑊 Λ𝑏̌1
Consequently,
or
which is computationally more efficient than the following instance of equation (7.27) for computing the initial vector 𝑏̌1 :
𝑏̌1 = Φ+ 𝑋1 (7.32)
Users of DMD sometimes call components of the basis vector 𝑏̌𝑡 = Φ+ 𝑋𝑡 ≡ (𝑊 Λ)−1 𝑈 𝑇 𝑋𝑡 the exact DMD modes.
or
Some of the preceding formulas assume that we have retained all 𝑝 modes associated with the positive singular values of
𝑋.
We can adjust our formulas to describe a situation in which we instead retain only the 𝑟 < 𝑝 largest singular values.
In that case, we simply replace Σ with the appropriate 𝑟 × 𝑟 matrix of singular values, 𝑈 with the 𝑚 × 𝑟 matrix of whose
columns correspond to the 𝑟 largest singular values, and 𝑉 with the 𝑛 × 𝑟 matrix whose columns correspond to the 𝑟
largest singular values.
Counterparts of all of the salient formulas above then apply.
Elementary Statistics
115
CHAPTER
EIGHT
This lecture uses matrix algebra to illustrate some basic ideas about probability theory.
After providing somewhat informal definitions of the underlying objects, we’ll use matrices and vectors to describe
probability distributions.
Among concepts that we’ll be studying include
• a joint probability distribution
• marginal distributions associated with a given joint distribution
• conditional probability distributions
• statistical independence of two random variables
• joint distributions associated with a prescribed set of marginal distributions
– couplings
– copulas
• the probability distribution of a sum of two independent random variables
– convolution of marginal distributions
• parameters that define a probability distribution
• sufficient statistics as data summaries
We’ll use a matrix to represent a bivariate probability distribution and a vector to represent a univariate probability dis-
tribution
As usual, we’ll start with some imports
import numpy as np
import matplotlib.pyplot as plt
import prettytable as pt
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
%matplotlib inline
↪set_matplotlib_formats()`
set_matplotlib_formats('retina')
117
Quantitative Economics with Python
We’ll briefly define what we mean by a probability space, a probability measure, and a random variable.
For most of this lecture, we sweep these objects into the background, but they are there underlying the other objects that
we’ll mainly focus on.
Let Ω be a set of possible underlying outcomes and let 𝜔 ∈ Ω be a particular underlying outcomes.
Let 𝒢 ⊂ Ω be a subset of Ω.
Let ℱ be a collection of such subsets 𝒢 ⊂ Ω.
The pair Ω, ℱ forms our probability space on which we want to put a probability measure.
A probability measure 𝜇 maps a set of possible underlying outcomes 𝒢 ∈ ℱ into a scalar number between 0 and 1
• this is the “probability” that 𝑋 belongs to 𝐴, denoted by Prob{𝑋 ∈ 𝐴}.
A random variable 𝑋(𝜔) is a function of the underlying outcome 𝜔 ∈ Ω.
The random variable 𝑋(𝜔) has a probability distribution that is induced by the underlying probability measure 𝜇 and
the function 𝑋(𝜔):
Before diving in, we’ll say a few words about what probability theory means and how it connects to statistics.
These are topics that are also touched on in the quantecon lectures https://python.quantecon.org/prob_meaning.html and
https://python.quantecon.org/navy_captain.html.
For much of this lecture we’ll be discussing fixed “population” probabilities.
These are purely mathematical objects.
To appreciate how statisticians connect probabilities to data, the key is to understand the following concepts:
• A single draw from a probability distribution
• Repeated independently and identically distributed (i.i.d.) draws of “samples” or “realizations” from the same
probability distribution
• A statistic defined as a function of a sequence of samples
• An empirical distribution or histogram (a binned empirical distribution) that records observed relative fre-
quencies
• The idea that a population probability distribution is what we anticipate relative frequencies will be in a long
sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by anticipated
relative frequencies
– Law of Large Numbers (LLN)
– Central Limit Theorem (CLT)
Scalar example
Consider the following discrete distribution
𝑋 ∼ {𝑓𝑖 }𝐼−1
𝑖=0 , 𝑓𝑖 ⩾ 0, ∑ 𝑓𝑖 = 1
𝑖
A probability distribution Prob(𝑋 ∈ 𝐴) can be described by its cumulative distribution function (CDF)
Sometimes, but not always, a random variable can also be described by density function 𝑓(𝑥) that is related to its CDF
by
Prob{𝑋 ∈ 𝐵} = ∫ 𝑓(𝑡)𝑑𝑡
𝑡∈𝐵
𝑥
𝐹 (𝑥) = ∫ 𝑓(𝑡)𝑑𝑡
−∞
Here 𝐵 is a set of possible 𝑋’s whose probability we want to compute.
When a probability density exists, a probability distribution can be characterized either by its CDF or by its density.
For a discrete-valued random variable
We’ll devote most of this lecture to discrete-valued random variables, but we’ll say a few things about continuous-valued
random variables.
𝑓0
⎡ 𝑓 ⎤
𝑓 =⎢ 1 ⎥ (8.2)
⎢ ⋮ ⎥
⎣ 𝑓𝐼−1 ⎦
𝐼−1
for which 𝑓𝑖 ∈ [0, 1] for each 𝑖 and ∑𝑖=0 𝑓𝑖 = 1.
This vector defines a probability mass function.
𝐼−2
The distribution (8.2) has parameters {𝑓𝑖 }𝑖=0,1,⋯,𝐼−2 since 𝑓𝐼−1 = 1 − ∑𝑖=0 𝑓𝑖 .
These parameters pin down the shape of the distribution.
(Sometimes 𝐼 = ∞.)
Such a “non-parametric” distribution has as many “parameters” as there are possible values of the random variable.
We often work with special distributions that are characterized by a small number parameters.
In these special parametric distributions,
𝑓𝑖 = 𝑔(𝑖; 𝜃)
Let 𝑋 be a continous random variable that takes values 𝑋 ∈ 𝑋̃ ≡ [𝑋𝑈 , 𝑋𝐿 ] whose distributions have parameters 𝜃.
̃ =1
Prob{𝑋 ∈ 𝑋}
𝑋 ∈ {0, … , 𝐽 − 1}
𝑌 ∈ {0, … , 𝐽 − 1}
Then their joint distribution is described by a matrix
𝑓𝑖𝑗 = Prob{𝑋 = 𝑖, 𝑌 = 𝑗} ≥ 0
where
∑ ∑ 𝑓𝑖𝑗 = 1
𝑖 𝑗
𝐼−1
Prob{𝑌 = 𝑗} = ∑ 𝑓𝑖𝑗 = 𝜈𝑖 , 𝑖 = 0, … , 𝐽 − 1
𝑖=0
For example, let the joint distribution over (𝑋, 𝑌 ) be
.25 .1
𝐹 =[ ] (8.3)
.15 .5
Then marginal distributions are:
Prob{𝑋 = 0} = .25 + .1 = .35
Prob{𝑋 = 1} = .15 + .5 = .65
Prob{𝑌 = 0} = .25 + .15 = .4
Prob{𝑌 = 1} = .1 + .5 = .6
Digression: If two random variables 𝑋, 𝑌 are continuous and have joint density 𝑓(𝑥, 𝑦), then marginal distributions can
be computed by
Prob{𝑋 = 𝑖, 𝑌 = 𝑗} = 𝑓𝑖 𝑔𝑖
where
Prob{𝑋 = 𝑖} = 𝑓𝑖 ≥ 0 ∑ 𝑓𝑖 = 1
Prob{𝑌 = 𝑗} = 𝑔𝑗 ≥ 0 ∑ 𝑔𝑗 = 1
A continuous random variable having density 𝑓𝑋 (𝑥)) has mean and variance
∞
𝜇𝑋 ≡ 𝔼 [𝑋] = ∫ 𝑥𝑓𝑋 (𝑥)𝑑𝑥
−∞
∞
2 2 2
𝜎𝑋 ≡ 𝔻 [𝑋] = E [(𝑋 − 𝜇𝑋 ) ] = ∫ (𝑥 − 𝜇𝑋 ) 𝑓𝑋 (𝑥)𝑑𝑥
−∞
Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability
distribution
1
Prob{𝑋̃ = 𝑖} = , 𝑖 = 0, … , 𝐼 − 1
𝐼
How can we transform 𝑋̃ to get a random variable 𝑋 for which Prob{𝑋 = 𝑖} = 𝑓𝑖 , 𝑖 = 0, … , 𝐼 − 1, where 𝑓𝑖 is an
arbitary discrete probability distribution on 𝑖 = 0, 1, … , 𝐼 − 1?
The key tool is the inverse of a cumulative distribution function (CDF).
Observe that the CDF of a distribution is monotone and non-decreasing, taking values between 0 and 1.
We can draw a sample of a random variable 𝑋 with a known CDF as follows:
• draw a random variable 𝑢 from a uniform distribution on [0, 1]
• pass the sample value of 𝑢 into the “inverse” target CDF for 𝑋
𝑋 = 𝐹 −1 (𝑈 ),
where the last equality occurs because 𝑈 is distributed uniformly on [0, 1] while 𝐹 (𝑥) is a constant given 𝑥 that also lies
on [0, 1].
Let’s use numpy to compute some examples.
Example: A continuous geometric (exponential) distribution
Let 𝑋 follow a geometric distribution, with parameter 𝜆 > 0.
Its density function is
𝑓(𝑥) = 𝜆𝑒−𝜆𝑥
Its CDF is
∞
𝐹 (𝑥) = ∫ 𝜆𝑒−𝜆𝑥 = 1 − 𝑒−𝜆𝑥
0
𝑙𝑜𝑔(1−𝑈)
Let’s draw 𝑢 from 𝑈 [0, 1] and calculate 𝑥 = −𝜆 .
We’ll check whether 𝑋 seems to follow a continuous geometric (exponential) distribution.
Let’s check with numpy.
n, λ = 1_000_000, 0.3
# transform
x = -np.log(1-u)/λ
Geometric distribution
Let 𝑋 distributed geometrically, that is
1 − 𝜆𝑖+1
= (1 − 𝜆)[ ]
1−𝜆
= 1 − 𝜆𝑖+1
= 𝐹 (𝑋) = 𝐹𝑖
Again, let 𝑈̃ follow a uniform distribution and we want to find 𝑋 such that 𝐹 (𝑋) = 𝑈̃ .
Let’s deduce the distribution of 𝑋 from
𝑈̃ = 𝐹 (𝑋) = 1 − 𝜆𝑥+1
1 − 𝑈̃ = 𝜆𝑥+1
𝑙𝑜𝑔(1 − 𝑈̃ ) = (𝑥 + 1) log 𝜆
log(1 − 𝑈̃ )
=𝑥+1
log 𝜆
log(1 − 𝑈̃ )
−1=𝑥
log 𝜆
So let
log(1 − 𝑈̃ )
𝑥=⌈ − 1⌉
log 𝜆
n, λ = 1_000_000, 0.8
# transform
x = np.ceil(np.log(1-u)/np.log(λ) - 1)
np.random.geometric(1-λ, n).max()
56
np.log(0.4)/np.log(0.3)
0.7610560044063083
Let’s write some Python code to compute means and variances of some univariate random variables.
We’ll use our code to
• compute population means and variances from the probability distribution
• generate a sample of 𝑁 independently and identically distributed draws and compute sample means and variances
• compare population and sample means and variances
Prob(𝑋 = 𝑘) = (1 − 𝑝)𝑘−1 𝑝, 𝑘 = 1, 2, …
⟹
1
𝔼(𝑋) =
𝑝
1−𝑝
𝔻(𝑋) =
𝑝2
We draw observations from the distribution and compare the sample mean and variance with the theoretical results.
# specify parameters
p, n = 0.3, 1_000_000
print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
The Newcomb–Benford law fits many data sets, e.g., reports of incomes to tax authorities, in which the leading digit is
more likely to be small than large.
See https://en.wikipedia.org/wiki/Benford%27s_law
A Benford probability distribution is
1
Prob{𝑋 = 𝑑} = log10 (𝑑 + 1) − log10 (𝑑) = log10 (1 + )
𝑑
where 𝑑 ∈ {1, 2, ⋯ , 9} can be thought of as a first digit in a sequence of digits.
This is a well defined discrete distribution since we can verify that probabilities are nonnegative and sum to 1.
9
1 1
log10 (1 + ) ≥ 0, ∑ log10 (1 + )=1
𝑑 𝑑=1
𝑑
We verify the above and compute the mean and variance using numpy.
# mean
(continues on next page)
# variance
var = np.sum([(k-mean)**2 * Benford_pmf])
# verify sum to 1
print(np.sum(Benford_pmf))
print(mean)
print(var)
0.9999999999999999
3.440236967123206
6.056512631375667
# plot distribution
plt.plot(range(1,10), Benford_pmf, 'o')
plt.title('Benford\'s distribution')
plt.show()
print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
print("\nThe population mean is: ", r*(1-p)/p)
print("The population variance is: ", r*(1-p)/p**2)
We write
𝑋 ∼ 𝑁 (𝜇, 𝜎2 )
# specify parameters
μ, σ = 0, 0.1
# compare
print(μ-μ_hat < 1e-3)
print(σ-σ_hat < 1e-3)
True
True
𝑋 ∼ 𝑈 [𝑎, 𝑏]
1
, 𝑎≤𝑥≤𝑏
𝑓(𝑥) = { 𝑏−𝑎
0, otherwise
The population mean and variance are
𝑎+𝑏
𝔼(𝑋) =
2
(𝑏 − 𝑎)2
𝕍(𝑋) =
12
# specify parameters
a, b = 10, 20
𝑃 (𝑋 = 0) = 0.95
400
𝑃 (300 ≤ 𝑋 ≤ 400) = ∫ 𝑓(𝑥) 𝑑𝑥 = 0.05
300
𝑓(𝑥) = 0.0005
Let’s start by generating a random sample and computing sample moments.
x = np.random.rand(1_000_000)
# x[x > 0.95] = 100*x[x > 0.95]+300
x[x > 0.95] = 100*np.random.rand(len(x[x > 0.95]))+300
x[x <= 0.95] = 0
μ_hat = np.mean(x)
σ2_hat = np.var(x)
print("The sample mean is: ", μ_hat, "\nThe sample variance is: ", σ2_hat)
400
𝜎2 = 0.95 × (0 − 17.5)2 + ∫ (𝑥 − 17.5)2 𝑓(𝑥)𝑑𝑥
300
400
= 0.95 × 17.52 + 0.0005 ∫ (𝑥 − 17.5)2 𝑑𝑥
300
400
1
2
= 0.95 × 17.5 + 0.0005 × (𝑥 − 17.5)3 ∣
3 300
mean: 17.5
variance: 5860.416666666666
Let’s use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and
variance of a bivariate random variable.
The table below illustrates a probability distribution for a bivariate random variable.
0.3 0.2
𝐹 = [𝑓𝑖𝑗 ] = [ ]
0.1 0.4
Prob(𝑋 = 𝑖) = ∑ 𝑓𝑖𝑗 = 𝑢𝑖
𝑗
Prob(𝑌 = 𝑗) = ∑ 𝑓𝑖𝑗 = 𝑣𝑗
𝑖
Below we draw some samples confirm that the “sampling” distribution agrees well with the “population” distribution.
Sample results:
# specify parameters
xs = np.array([0, 1])
ys = np.array([10, 20])
f = np.array([[0.3, 0.2], [0.1, 0.4]])
f_cum = np.cumsum(f)
[[ 0. 0. 0. ... 0. 1. 0.]
[10. 20. 10. ... 10. 20. 10.]]
Here, we use exactly the inverse CDF technique to generate sample from the joint distribution 𝐹 .
# marginal distribution
xp = np.sum(x[0, :] == xs[0])/1_000_000
yp = np.sum(x[1, :] == ys[0])/1_000_000
# print output
print("marginal distribution for x")
xmtb = pt.PrettyTable()
xmtb.field_names = ['x_value', 'x_prob']
xmtb.add_row([xs[0], xp])
xmtb.add_row([xs[1], 1-xp])
print(xmtb)
# conditional distributions
xc1 = x[0, x[1, :] == ys[0]]
xc2 = x[0, x[1, :] == ys[1]]
yc1 = x[1, x[0, :] == xs[0]]
yc2 = x[1, x[0, :] == xs[1]]
# print output
print("conditional distribution for x")
xctb = pt.PrettyTable()
xctb.field_names = ['y_value', 'prob(x=0)', 'prob(x=1)']
xctb.add_row([ys[0], xc1p, 1-xc1p])
xctb.add_row([ys[1], xc2p, 1-xc2p])
print(xctb)
Let’s calculate population marginal and conditional probabilities using matrix algebra.
⋮ 𝑦 1 𝑦2 ⋮ 𝑥
⎡ ⋯ ⋮ ⋯ ⋯ ⋮ ⋯ ⎤
⎢ ⎥
⎢ 𝑥1 ⋮ 0.3 0.2 ⋮ 0.5 ⎥
⎢ 𝑥2 ⋮ 0.1 0.4 ⋮ 0.5 ⎥
⎢ ⋯ ⋮ ⋯ ⋯ ⋮ ⋯ ⎥
⎣ 𝑦 ⋮ 0.4 0.6 ⋮ 1 ⎦
⟹
(1) Marginal distribution:
𝑣𝑎𝑟 ⋮ 𝑣𝑎𝑟1 𝑣𝑎𝑟2
⎡ ⋯ ⋮ ⋯ ⋯ ⎤
⎢ ⎥
⎢ 𝑥 ⋮ 0.5 0.5 ⎥
⎢ ⋯ ⋮ ⋯ ⋯ ⎥
⎣ 𝑦 ⋮ 0.4 0.6 ⎦
(2) Conditional distribution:
𝑥 ⋮ 𝑥1 𝑥2
⎡ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎤
⎢ 0.3 0.1 ⎥
⎢ 𝑦 = 𝑦1 ⋮ 0.4 = 0.75 0.4 = 0.25 ⎥
⎢ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎥
0.2 0.4
⎣ 𝑦 = 𝑦2 ⋮ 0.6 ≈ 0.33 0.6 ≈ 0.67 ⎦
𝑦 ⋮ 𝑦1 𝑦2
⎡ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎤
⎢ 0.3 0.2 ⎥
⎢ 𝑥 = 𝑥1 ⋮ 0.5 = 0.6 0.5 = 0.4 ⎥
⎢ ⋯⋯⋯ ⋮ ⋯⋯⋯ ⋯⋯⋯ ⎥
0.1 0.4
⎣ 𝑥 = 𝑥2 ⋮ 0.5 = 0.2 0.5 = 0.8 ⎦
These population objects closely resemble sample counterparts computed above.
Let’s wrap some of the functions we have used in a Python class for a general discrete bivariate joint distribution.
class discrete_bijoint:
def joint_tb(self):
'''print the joint distribution table'''
xs = self.xs
ys = self.ys
f = self.f
jtb = pt.PrettyTable()
jtb.field_names = ['x_value/y_value', *ys, 'marginal sum for x']
for i in range(len(xs)):
jtb.add_row([xs[i], *f[i, :], np.sum(f[i, :])])
jtb.add_row(['marginal_sum for y', *np.sum(f, 0), np.sum(f)])
print("\nThe joint probability distribution for x and y\n", jtb)
self.jtb = jtb
def marg_dist(self):
'''marginal distribution'''
x = self.x
xs = self.xs
ys = self.ys
n = self.n
xmp = [np.sum(x[0, :] == xs[i])/n for i in range(len(xs))]
ymp = [np.sum(x[1, :] == ys[i])/n for i in range(len(ys))]
# print output
xmtb = pt.PrettyTable()
ymtb = pt.PrettyTable()
xmtb.field_names = ['x_value', 'x_prob']
ymtb.field_names = ['y_value', 'y_prob']
for i in range(max(len(xs), len(ys))):
if i < len(xs):
xmtb.add_row([xs[i], xmp[i]])
if i < len(ys):
ymtb.add_row([ys[i], ymp[i]])
xmtb.add_row(['sum', np.sum(xmp)])
ymtb.add_row(['sum', np.sum(ymp)])
print("\nmarginal distribution for x\n", xmtb)
print("\nmarginal distribution for y\n", ymtb)
self.xmp = xmp
self.ymp = ymp
def cond_dist(self):
'''conditional distribution'''
x = self.x
xs = self.xs
ys = self.ys
n = self.n
xcp = np.empty([len(ys), len(xs)])
ycp = np.empty([len(xs), len(ys)])
for i in range(max(len(ys), len(xs))):
if i < len(ys):
xi = x[0, x[1, :] == ys[i]]
idx = xi.reshape(len(xi), 1) == xs.reshape(1, len(xs))
xcp[i, :] = np.sum(idx, 0)/len(xi)
if i < len(xs):
yi = x[1, x[0, :] == xs[i]]
idy = yi.reshape(len(yi), 1) == ys.reshape(1, len(ys))
ycp[i, :] = np.sum(idy, 0)/len(yi)
# print output
xctb = pt.PrettyTable()
yctb = pt.PrettyTable()
xctb.field_names = ['x_value', *xs, 'sum']
yctb.field_names = ['y_value', *ys, 'sum']
for i in range(max(len(xs), len(ys))):
if i < len(ys):
xctb.add_row([ys[i], *xcp[i], np.sum(xcp[i])])
if i < len(xs):
yctb.add_row([xs[i], *ycp[i], np.sum(ycp[i])])
self.xcp = xcp
self.xyp = ycp
# joint
d = discrete_bijoint(f, xs, ys)
d.joint_tb()
# sample marginal
d.draw(1_000_000)
d.marg_dist()
# sample conditional
d.cond_dist()
Example 2
d_new.draw(1_000_000)
d_new.marg_dist()
d_new.cond_dist()
1 (𝑥 − 𝜇1 )2 2𝜌(𝑥 − 𝜇1 )(𝑦 − 𝜇2 ) (𝑦 − 𝜇2 )2
𝑓(𝑥, 𝑦) = (2𝜋𝜎1 𝜎2 √1 − 𝜌2 )−1 exp [− ( − + )]
2(1 − 𝜌2 ) 𝜎12 𝜎1 𝜎2 𝜎22
1 1 (𝑥 − 𝜇1 )2 2𝜌(𝑥 − 𝜇1 )(𝑦 − 𝜇2 ) (𝑦 − 𝜇2 )2
exp [− ( 2
− + )]
2𝜋𝜎1 𝜎2 √1 − 𝜌2 2(1 − 𝜌2 ) 𝜎1 𝜎1 𝜎2 𝜎22
We start with a bivariate normal distribution pinned down by
0 5 .2
𝜇=[ ], Σ=[ ]
5 .2 1
μ1 = 0
μ2 = 5
σ1 = np.sqrt(5)
σ2 = np.sqrt(1)
ρ = .2 / np.sqrt(5 * 1)
Joint Distribution
Let’s plot the population joint density.
# %matplotlib notebook
fig = plt.figure()
ax = plt.axes(projection='3d')
# %matplotlib notebook
fig = plt.figure()
ax = plt.axes(projection='3d')
Next we can simulate from a built-in numpy function and calculate a sample marginal distribution from the sample mean
and variance.
μ= np.array([0, 5])
σ= np.array([[5, .2], [.2, 1]])
n = 1_000_000
data = np.random.multivariate_normal(μ, σ, n)
x = data[:, 0]
y = data[:, 1]
Marginal distribution
-0.0001294339629668653 2.2338665663818036
4.998845851883271 1.0005847916711021
Conditional distribution
𝑦 − 𝜇𝑌 2
[𝑋|𝑌 = 𝑦] ∼ ℕ[𝜇𝑋 + 𝜌𝜎𝑋 , 𝜎𝑋 (1 − 𝜌2 )]
𝜎𝑌
𝑥 − 𝜇𝑋 2
[𝑌 |𝑋 = 𝑥] ∼ ℕ[𝜇𝑌 + 𝜌𝜎𝑌 , 𝜎𝑌 (1 − 𝜌2 )]
𝜎𝑋
Let’s approximate the joint density by discretizing and mapping the approximating joint density into a matrix.
We can compute the discretized marginal density by just using matrix algebra and noting that
𝑓𝑖𝑗
Prob{𝑋 = 𝑖|𝑌 = 𝑗} =
∑𝑖 𝑓𝑖𝑗
Fix 𝑦 = 0.
𝑓𝑖𝑗
𝔼 [𝑋|𝑌 = 𝑗] = ∑ 𝑖𝑃 𝑟𝑜𝑏{𝑋 = 𝑖|𝑌 = 𝑗} = ∑ 𝑖
𝑖 𝑖
∑𝑖 𝑓𝑖𝑗
𝑓𝑖𝑗2
𝔻 [𝑋|𝑌 = 𝑗] = ∑ (𝑖 − 𝜇𝑋|𝑌 =𝑗 )
𝑖
∑𝑖 𝑓𝑖𝑗
Let’s draw from a normal distribution with above mean and variance and check how accurate our approximation is.
# discretized mean
μx = np.dot(x, z)
# sample
zz = np.random.normal(μx, σx, 1_000_000)
plt.hist(zz, bins=300, density=True, alpha=0.3, range=[-10, 10])
plt.show()
Fix 𝑥 = 1.
# sample
zz = np.random.normal(μy,σy,1_000_000)
plt.hist(zz, bins=100, density=True, alpha=0.3)
plt.show()
We compare with the analytically computed parameters and note that they are close.
print(μx, σx)
print(μ1 + ρ * σ1 * (0 - μ2) / σ2, np.sqrt(σ1**2 * (1 - ρ**2)))
print(μy, σy)
print(μ2 + ρ * σ2 * (1 - μ1) / σ1, np.sqrt(σ2**2 * (1 - ρ**2)))
-0.9997518414498433 2.22658413316977
-1.0 2.227105745132009
5.039999456960771 0.9959851265795592
5.04 0.9959919678390986
Let 𝑋, 𝑌 be two independent discrete random variables that take values in 𝑋,̄ 𝑌 ̄ , respectively.
Define a new random variable 𝑍 = 𝑋 + 𝑌 .
Evidently, 𝑍 takes values from 𝑍 ̄ defined as follows:
𝑋̄ = {0, 1, … , 𝐼 − 1}; 𝑓𝑖 = Prob{𝑋 = 𝑖}
𝑌 ̄ = {0, 1, … , 𝐽 − 1}; 𝑔𝑗 = Prob{𝑌 = 𝑗}
𝑍 ̄ = {0, 1, … , 𝐼 + 𝐽 − 2}; ℎ𝑘 = Prob{𝑋 + 𝑌 = 𝑘}
Independence of 𝑋 and 𝑌 implies that
ℎ𝑘 = Prob{𝑋 = 0, 𝑌 = 𝑘} + Prob{𝑋 = 1, 𝑌 = 𝑘 − 1} + … + Prob{𝑋 = 𝑘, 𝑌 = 0}
ℎ𝑘 = 𝑓0 𝑔𝑘 + 𝑓1 𝑔𝑘−1 + … + 𝑓𝑘−1 𝑔1 + 𝑓𝑘 𝑔0 for 𝑘 = 0, 1, … , 𝐼 + 𝐽 − 2
Thus, we have:
𝑘
ℎ𝑘 = ∑ 𝑓𝑖 𝑔𝑘−𝑖 ≡ 𝑓 ∗ 𝑔
𝑖=0
Prob{𝑋 = 𝑖, 𝑌 = 𝑗} = 𝜌𝑖𝑗
where 𝑖 = 0, … , 𝐼 − 1; 𝑗 = 0, … , 𝐽 − 1 and
∑ ∑ 𝜌𝑖𝑗 = 1, 𝜌𝑖𝑗 ⩾ 0.
𝑖 𝑗
where
𝑝 𝑝12
[ 11 ]
𝑝21 𝑝22
8.19 Coupling
𝑓𝑖𝑗 = Prob{𝑋 = 𝑖, 𝑌 = 𝑗}
𝑖 = 0, ⋯ 𝐼 − 1
𝑗 = 0, ⋯ 𝐽 − 1
stacked to an 𝐼 × 𝐽 matrix
𝑒.𝑔. 𝐼 = 1, 𝐽 = 1
where
𝑓11 𝑓12
[ ]
𝑓21 𝑓22
From the joint distribution, we have shown above that we obtain unique marginal distributions.
Now we’ll try to go in a reverse direction.
We’ll find that from two marginal distributions, can we usually construct more than one joint distribution that verifies
these marginals.
Each of these joint distributions is called a coupling of the two martingal distributions.
Let’s start with marginal distributions
Prob{𝑋 = 𝑖} = ∑ 𝑓𝑖𝑗 = 𝜇𝑖 , 𝑖 = 0, ⋯ , 𝐼 − 1
𝑗
Prob{𝑌 = 𝑗} = ∑ 𝑓𝑖𝑗 = 𝜈𝑗 , 𝑗 = 0, ⋯ , 𝐽 − 1
𝑗
Given two marginal distribution, 𝜇 for 𝑋 and 𝜈 for 𝑌 , a joint distribution 𝑓𝑖𝑗 is said to be a coupling of 𝜇 and 𝜈.
Example:
Prob{𝑋 = 0} =1 − 𝑞 = 𝜇0
Prob{𝑋 = 1} =𝑞 = 𝜇1
Prob{𝑌 = 0} =1 − 𝑟 = 𝜈0
Prob{𝑌 = 1} =𝑟 = 𝜈1
where 0 ≤ 𝑞 < 𝑟 ≤ 1
(1 − 𝑞)(1 − 𝑟) (1 − 𝑞)𝑟
𝑓𝑖𝑗 = [ ]
𝑞(1 − 𝑟) 𝑞𝑟
(1 − 𝑟) 𝑟−𝑞
𝑓𝑖𝑗 = [ ]
0 𝑞
1−𝑟+𝑟−𝑞+𝑞 =1
𝜇0 = 1 − 𝑞
𝜇1 = 𝑞
𝜈0 = 1 − 𝑟
𝜈1 = 𝑟
Thus, our two proposed joint distributions have the same marginal distributions.
But the joint distributions differ.
Thus, multiple joint distributions [𝑓𝑖𝑗 ] can have the same marginals.
Remark:
• Couplings are important in optimal transport problems and in Markov processes.
We can obtain
In a reverse direction of logic, given univariate marginal distributions 𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 ) and a
copula function 𝐶(⋅), the function 𝐻(𝑥1 , 𝑥2 , … , 𝑥𝑁 ) = 𝐶(𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 )) is a coupling of
𝐹1 (𝑥1 ), 𝐹2 (𝑥2 ), … , 𝐹𝑁 (𝑥𝑁 ).
Thus, for given marginal distributions, we can use a copula function to determine a joint distribution when the associated
univariate random variables are not independent.
Copula functions are often used to characterize dependence of random variables.
Discrete marginal distribution
As mentioned above, for two given marginal distributions there can be more than one coupling.
For example, consider two random variables 𝑋, 𝑌 with distributions
Prob(𝑋 = 0) = 0.6,
Prob(𝑋 = 1) = 0.4,
Prob(𝑌 = 0) = 0.3,
Prob(𝑌 = 1) = 0.7,
For these two random variables there can be more than one coupling.
Let’s first generate X and Y.
# define parameters
mu = np.array([0.6, 0.4])
nu = np.array([0.3, 0.7])
# number of draws
draws = 1_000_000
# print output
print("distribution for x")
xmtb = pt.PrettyTable()
xmtb.field_names = ['x_value', 'x_prob']
(continues on next page)
distribution for x
+---------+----------+
| x_value | x_prob |
+---------+----------+
| 0 | 0.600175 |
| 1 | 0.399825 |
+---------+----------+
distribution for y
+---------+----------+
| y_value | y_prob |
+---------+----------+
| 0 | 0.300562 |
| 1 | 0.699438 |
+---------+----------+
Let’s now take our two marginal distributions, one for 𝑋, the other for 𝑌 , and construct two distinct couplings.
For the first joint distribution:
Prob(𝑋 = 𝑖, 𝑌 = 𝑗) = 𝑓𝑖𝑗
where
0.18 0.42
[𝑓𝑖𝑗 ] = [ ]
0.12 0.28
Let’s use Python to construct this joint distribution and then verify that its marginal distributions are what we want.
# define parameters
f1 = np.array([[0.18, 0.42], [0.12, 0.28]])
f1_cum = np.cumsum(f1)
# number of draws
draws1 = 1_000_000
# print output
print("marginal distribution for x")
c1_x_mtb = pt.PrettyTable()
c1_x_mtb.field_names = ['c1_x_value', 'c1_x_prob']
c1_x_mtb.add_row([0, 1-c1_q_hat])
c1_x_mtb.add_row([1, c1_q_hat])
print(c1_x_mtb)
Now, let’s construct another joint distribution that is also a coupling of 𝑋 and 𝑌
0.3 0.3
[𝑓𝑖𝑗 ] = [ ]
0 0.4
# define parameters
f2 = np.array([[0.3, 0.3], [0, 0.4]])
f2_cum = np.cumsum(f2)
# number of draws
draws2 = 1_000_000
# print output
print("marginal distribution for x")
c2_x_mtb = pt.PrettyTable()
c2_x_mtb.field_names = ['c2_x_value', 'c2_x_prob']
c2_x_mtb.add_row([0, 1-c2_q_hat])
c2_x_mtb.add_row([1, c2_q_hat])
print(c2_x_mtb)
We have verified that both joint distributions, 𝑐1 and 𝑐2 , have identical marginal distributions of 𝑋 and 𝑌 , respectively.
So they are both couplings of 𝑋 and 𝑌 .
Remark:
• This is a key formula for a theory of optimally predicting a time series.
NINE
Contents
9.1 Overview
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
157
Quantitative Economics with Python
where we assume that 𝑦0 and 𝑦−1 are given numbers that we take as initial conditions.
In Samuelson’s model, 𝑦𝑡 stood for national income or perhaps a different measure of aggregate activity called gross
domestic product (GDP) at time 𝑡.
Equation (9.1) is called a second-order linear difference equation.
But actually, it is a collection of 𝑇 simultaneous linear equations in the 𝑇 variables 𝑦1 , 𝑦2 , … , 𝑦𝑇 .
Note: To be able to solve a second-order linear difference equation, we require two boundary conditions that can take
the form either of two initial conditions or two terminal conditions or possibly one of each.
𝐴𝑦 = 𝑏
where
𝑦1
⎡𝑦 ⎤
𝑦 = ⎢ 2⎥
⎢⋯⎥
⎣𝑦𝑇 ⎦
Evidently 𝑦 can be computed from
𝑦 = 𝐴−1 𝑏
T = 80
# parameters
0 = 10.0
1 = 1.53
2 = -.9
for i in range(T):
if i-1 >= 0:
A[i, i-1] = - 1
if i-2 >= 0:
A[i, i-2] = - 2
b = np.full(T, 0)
b[0] = 0 + 1 * y0 + 2 * y_1
b[1] = 0 + 2 * y0
Let’s look at the matrix 𝐴 and the vector 𝑏 for our example.
A, b
(array([[ 1. , 0. , 0. , ..., 0. , 0. , 0. ],
[-1.53, 1. , 0. , ..., 0. , 0. , 0. ],
[ 0.9 , -1.53, 1. , ..., 0. , 0. , 0. ],
...,
[ 0. , 0. , 0. , ..., 1. , 0. , 0. ],
[ 0. , 0. , 0. , ..., -1.53, 1. , 0. ],
[ 0. , 0. , 0. , ..., 0.9 , -1.53, 1. ]]),
array([ 21.52, -11.6 , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ,
10. , 10. , 10. , 10. , 10. , 10. , 10. , 10. ]))
A_inv = np.linalg.inv(A)
y = A_inv @ b
y_second_method = np.linalg.solve(A, b)
Here make sure the two methods give the same result, at least up to floating point precision:
np.allclose(y, y_second_method)
True
Note: In general, np.linalg.solve is more numerically stable than using np.linalg.inv directly. However,
stability is not an issue for this small example. Moreover, we will repeatedly use A_inv in what follows, so there is added
value in computing it directly.
plt.plot(np.arange(T)+1, y)
plt.xlabel('t')
plt.ylabel('y')
plt.show()
The steady state value 𝑦∗ of 𝑦𝑡 is obtained by setting 𝑦𝑡 = 𝑦𝑡−1 = 𝑦𝑡−2 = 𝑦∗ in (9.1), which yields
𝛼0
𝑦∗ =
1 − 𝛼 1 − 𝛼2
If we set the initial values to 𝑦0 = 𝑦−1 = 𝑦∗ , then 𝑦𝑡 will be constant:
y_star = 0 / (1 - 1 - 2)
y_1_steady = y_star # y_{-1}
y0_steady = y_star
b_steady = np.full(T, 0)
b_steady[0] = 0 + 1 * y0_steady + 2 * y_1_steady
b_steady[1] = 0 + 2 * y0_steady
plt.plot(np.arange(T)+1, y_steady)
plt.xlabel('t')
(continues on next page)
plt.show()
To generate some excitement, we’ll follow in the spirit of the great economists Eugen Slutsky and Ragnar Frisch and
replace our original second-order difference equation with the following second-order stochastic linear difference
equation:
where 𝑢𝑡 ∼ 𝑁 (0, 𝜎𝑢2 ) and is IID, meaning independent and identically distributed.
We’ll stack these 𝑇 equations into a system cast in terms of matrix algebra.
Let’s define the random vector
𝑢1
⎡ 𝑢 ⎤
𝑢=⎢ 2 ⎥
⎢ ⋮ ⎥
⎣ 𝑢𝑇 ⎦
Where 𝐴, 𝑏, 𝑦 are defined as above, now assume that 𝑦 is governed by the system
𝐴𝑦 = 𝑏 + 𝑢
𝑦 = 𝐴−1 (𝑏 + 𝑢)
u = 2.
u = np.random.normal(0, u, size=T)
y = A_inv @ (b + u)
plt.plot(np.arange(T)+1, y)
plt.xlabel('t')
plt.ylabel('y')
plt.show()
The above time series looks a lot like (detrended) GDP series for a number of advanced countries in recent decades.
We can simulate 𝑁 paths.
N = 100
for i in range(N):
col = cm.viridis(np.random.rand()) # Choose a random color from viridis
u = np.random.normal(0, u, size=T)
y = A_inv @ (b + u)
plt.plot(np.arange(T)+1, y, lw=0.5, color=col)
plt.xlabel('t')
plt.ylabel('y')
plt.show()
Also consider the case when 𝑦0 and 𝑦−1 are at steady state.
N = 100
for i in range(N):
col = cm.viridis(np.random.rand()) # Choose a random color from viridis
u = np.random.normal(0, u, size=T)
y_steady = A_inv @ (b_steady + u)
plt.plot(np.arange(T)+1, y_steady, lw=0.5, color=col)
plt.xlabel('t')
plt.ylabel('y')
plt.show()
Samuelson’s model is backwards looking in the sense that we give it initial conditions and let it run.
Let’s now turn to model that is forward looking.
We apply similar linear algebra machinery to study a perfect foresight model widely used as a benchmark in macroeco-
nomics and finance.
As an example, we suppose that 𝑝𝑡 is the price of a stock and that 𝑦𝑡 is its dividend.
We assume that 𝑦𝑡 is determined by second-order difference equation that we analyzed just above, so that
𝑦 = 𝐴−1 (𝑏 + 𝑢)
= .96
# construct B
B = np.zeros((T, T))
for i in range(T):
B[i, i:] = ** np.arange(0, T-i)
u = 0.
u = np.random.normal(0, u, size=T)
y = A_inv @ (b + u)
y_steady = A_inv @ (b_steady + u)
p = B @ y
plt.show()
Can you explain why the trend of the price is downward over time?
Also consider the case when 𝑦0 and 𝑦−1 are at the steady state.
p_steady = B @ y_steady
plt.show()
TEN
Contents
10.1 Overview
This lecture illustrates two of the most important theorems of probability and statistics: The law of large numbers (LLN)
and the central limit theorem (CLT).
These beautiful theorems lie behind many of the most fundamental results in econometrics and quantitative economic
modeling.
The lecture is based around simulations that show the LLN and CLT in action.
We also demonstrate how the LLN and CLT break down when the assumptions they are based on do not hold.
In addition, we examine several useful extensions of the classical theorems, such as
• The delta method, for smooth functions of random variables.
• The multivariate case.
Some of these extensions are presented as exercises.
We’ll need the following imports:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import random
import numpy as np
from scipy.stats import t, beta, lognorm, expon, gamma, uniform, cauchy
(continues on next page)
167
Quantitative Economics with Python
10.2 Relationships
10.3 LLN
We begin with the law of large numbers, which tells us when sample averages will converge to their population means.
The classical law of large numbers concerns independent and identically distributed (IID) random variables.
Here is the strongest version of the classical LLN, known as Kolmogorov’s strong law.
Let 𝑋1 , … , 𝑋𝑛 be independent and identically distributed scalar random variables, with common distribution 𝐹 .
When it exists, let 𝜇 denote the common mean of this sample:
𝜇 ∶= 𝔼𝑋 = ∫ 𝑥𝐹 (𝑑𝑥)
In addition, let
1 𝑛
𝑋̄ 𝑛 ∶= ∑ 𝑋𝑖
𝑛 𝑖=1
ℙ {𝑋̄ 𝑛 → 𝜇 as 𝑛 → ∞} = 1 (10.1)
10.3.2 Proof
The proof of Kolmogorov’s strong law is nontrivial – see, for example, theorem 8.3.5 of [Dud02].
On the other hand, we can prove a weaker version of the LLN very easily and still get most of the intuition.
The version we prove is as follows: If 𝑋1 , … , 𝑋𝑛 is IID with 𝔼𝑋𝑖2 < ∞, then, for any 𝜖 > 0, we have
(This version is weaker because we claim only convergence in probability rather than almost sure convergence, and assume
a finite second moment)
To see that this is so, fix 𝜖 > 0, and let 𝜎2 be the variance of each 𝑋𝑖 .
Recall the Chebyshev inequality, which tells us that
𝔼[(𝑋̄ 𝑛 − 𝜇)2 ]
ℙ {|𝑋̄ 𝑛 − 𝜇| ≥ 𝜖} ≤ (10.3)
𝜖2
Now observe that
2
⎧
{ 1 𝑛 ⎫
}
̄ 2
𝔼[(𝑋𝑛 − 𝜇) ] = 𝔼 ⎨[ ∑(𝑋𝑖 − 𝜇)] ⎬
{ 𝑛 𝑖=1
⎩ }
⎭
𝑛 𝑛
1
= 2 ∑ ∑ 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇)
𝑛 𝑖=1 𝑗=1
1 𝑛
= ∑ 𝔼(𝑋𝑖 − 𝜇)2
𝑛2 𝑖=1
𝜎2
=
𝑛
Here the crucial step is at the third equality, which follows from independence.
Independence means that if 𝑖 ≠ 𝑗, then the covariance term 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) drops out.
As a result, 𝑛2 − 𝑛 terms vanish, leading us to a final expression that goes to zero in 𝑛.
Combining our last result with (10.3), we come to the estimate
𝜎2
ℙ {|𝑋̄ 𝑛 − 𝜇| ≥ 𝜖} ≤ (10.4)
𝑛𝜖2
The claim in (10.2) is now clear.
Of course, if the sequence 𝑋1 , … , 𝑋𝑛 is correlated, then the cross-product terms 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) are not necessarily
zero.
While this doesn’t mean that the same line of argument is impossible, it does mean that if we want a similar result then
the covariances should be “almost zero” for “most” of these terms.
In a long sequence, this would be true if, for example, 𝔼(𝑋𝑖 − 𝜇)(𝑋𝑗 − 𝜇) approached zero when the difference between
𝑖 and 𝑗 became large.
In other words, the LLN can still work if the sequence 𝑋1 , … , 𝑋𝑛 has a kind of “asymptotic independence”, in the sense
that correlation falls to zero as variables become further apart in the sequence.
This idea is very important in time series analysis, and we’ll come across it again soon enough.
10.3.3 Illustration
Let’s now illustrate the classical IID law of large numbers using simulation.
In particular, we aim to generate some sequences of IID random variables and plot the evolution of 𝑋̄ 𝑛 as 𝑛 increases.
Below is a figure that does just this (as usual, you can click on it to expand it).
It shows IID observations from three different distributions and plots 𝑋̄ 𝑛 against 𝑛 in each case.
The dots represent the underlying observations 𝑋𝑖 for 𝑖 = 1, … , 100.
In each of the three cases, convergence of 𝑋̄ 𝑛 to 𝜇 occurs as predicted
n = 100
for ax in axes:
# Choose a randomly selected distribution
name = random.choice(list(distributions.keys()))
distribution = distributions.pop(name)
# Plot
ax.plot(list(range(n)), data, 'o', color='grey', alpha=0.5)
axlabel = '$\\bar X_n$ for $X_i \sim$' + name
ax.plot(list(range(n)), sample_mean, 'g-', lw=3, alpha=0.6, label=axlabel)
m = distribution.mean()
ax.plot(list(range(n)), [m] * n, 'k--', lw=1.5, label='$\mu$')
ax.vlines(list(range(n)), m, data, lw=0.2)
ax.legend(**legend_args, fontsize=12)
plt.show()
The three distributions are chosen at random from a selection stored in the dictionary distributions.
10.4 CLT
Next, we turn to the central limit theorem, which tells us about the distribution of the deviation between sample averages
and population means.
The central limit theorem is one of the most remarkable results in all of mathematics.
In the classical IID setting, it tells us the following:
If the sequence 𝑋1 , … , 𝑋𝑛 is IID, with common mean 𝜇 and common variance 𝜎2 ∈ (0, ∞), then
√ 𝑑
𝑛(𝑋̄ 𝑛 − 𝜇) → 𝑁 (0, 𝜎2 ) as 𝑛 → ∞ (10.5)
𝑑
Here → 𝑁 (0, 𝜎2 ) indicates convergence in distribution to a centered (i.e, zero mean) normal with standard deviation 𝜎.
10.4.2 Intuition
The striking implication of the CLT is that for any distribution with finite second moment, the simple operation of adding
independent copies always leads to a Gaussian curve.
A relatively simple proof of the central limit theorem can be obtained by working with characteristic functions (see, e.g.,
theorem 9.5.6 of [Dud02]).
The proof is elegant but almost anticlimactic, and it provides surprisingly little intuition.
In fact, all of the proofs of the CLT that we know are similar in this respect.
Why does adding independent copies produce a bell-shaped distribution?
Part of the answer can be obtained by investigating the addition of independent Bernoulli random variables.
In particular, let 𝑋𝑖 be binary, with ℙ{𝑋𝑖 = 0} = ℙ{𝑋𝑖 = 1} = 0.5, and let 𝑋1 , … , 𝑋𝑛 be independent.
𝑛
Think of 𝑋𝑖 = 1 as a “success”, so that 𝑌𝑛 = ∑𝑖=1 𝑋𝑖 is the number of successes in 𝑛 trials.
The next figure plots the probability mass function of 𝑌𝑛 for 𝑛 = 1, 2, 4, 8
plt.show()
When 𝑛 = 1, the distribution is flat — one success or no successes have the same probability.
When 𝑛 = 2 we can either have 0, 1 or 2 successes.
Notice the peak in probability mass at the mid-point 𝑘 = 1.
The reason is that there are more ways to get 1 success (“fail then succeed” or “succeed then fail”) than to get zero or two
successes.
Moreover, the two trials are independent, so the outcomes “fail then succeed” and “succeed then fail” are just as likely as
the outcomes “fail then fail” and “succeed then succeed”.
(If there was positive correlation, say, then “succeed then fail” would be less likely than “succeed then succeed”)
Here, already we have the essence of the CLT: addition under independence leads probability mass to pile up in the middle
and thin out at the tails.
For 𝑛 = 4 and 𝑛 = 8 we again get a peak at the “middle” value (halfway between the minimum and the maximum
possible value).
The intuition is the same — there are simply more ways to get these middle outcomes.
If we continue, the bell-shaped curve becomes even more pronounced.
We are witnessing the binomial approximation of the normal distribution.
10.4.3 Simulation 1
Since the CLT seems almost magical, running simulations that verify its implications is one good way to build intuition.
To this end, we now perform the following simulation
1. Choose an arbitrary distribution 𝐹 for the underlying observations 𝑋𝑖 .
√
2. Generate independent draws of 𝑌𝑛 ∶= 𝑛(𝑋̄ 𝑛 − 𝜇).
3. Use these draws to compute some measure of their distribution — such as a histogram.
4. Compare the latter to 𝑁 (0, 𝜎2 ).
Here’s some code that does exactly this for the exponential distribution 𝐹 (𝑥) = 1 − 𝑒−𝜆𝑥 .
(Please experiment with other choices of 𝐹 , but remember that, to conform with the conditions of the CLT, the distribution
must have a finite second moment.)
# Set parameters
n = 250 # Choice of n
k = 100000 # Number of draws of Y_n
distribution = expon(2) # Exponential distribution, λ = 1/2
μ, s = distribution.mean(), distribution.std()
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
xmin, xmax = -3 * s, 3 * s
ax.set_xlim(xmin, xmax)
ax.hist(Y, bins=60, alpha=0.5, density=True)
xgrid = np.linspace(xmin, xmax, 200)
ax.plot(xgrid, norm.pdf(xgrid, scale=s), 'k-', lw=2, label='$N(0, \sigma^2)$')
ax.legend()
plt.show()
Notice the absence of for loops — every operation is vectorized, meaning that the major calculations are all shifted to
highly optimized C code.
The fit to the normal density is already tight and can be further improved by increasing n.
You can also experiment with other specifications of 𝐹 .
10.4.4 Simulation 2
√
Our next simulation is somewhat like the first, except that we aim to track the distribution of 𝑌𝑛 ∶= 𝑛(𝑋̄ 𝑛 − 𝜇) as 𝑛
increases.
In the simulation, we’ll be working with random variables having 𝜇 = 0.
Thus, when 𝑛 = 1, we have 𝑌1 = 𝑋1 , so the first distribution is just the distribution of the underlying random variable.
√
For 𝑛 = 2, the distribution of 𝑌2 is that of (𝑋1 + 𝑋2 )/ 2, and so on.
What we expect is that, regardless of the distribution of the underlying random variable, the distribution of 𝑌𝑛 will smooth
out into a bell-shaped curve.
The next figure shows this process for 𝑋𝑖 ∼ 𝑓, where 𝑓 was specified as the convex combination of three different beta
densities.
(Taking a convex combination is an easy way to produce an irregular shape for 𝑓.)
In the figure, the closest density is that of 𝑌1 , while the furthest is that of 𝑌5
beta_dist = beta(2, 2)
def gen_x_draws(k):
"""
Returns a flat array containing k independent draws from the
distribution of X, the underlying random variable. This distribution
(continues on next page)
nmax = 5
reps = 100000
ns = list(range(1, nmax + 1))
# Plot
fig = plt.figure(figsize = (10, 6))
ax = fig.gca(projection='3d')
a, b = -3, 3
gs = 100
xs = np.linspace(a, b, gs)
# Build verts
greys = np.linspace(0.3, 0.7, nmax)
verts = []
for n in ns:
density = gaussian_kde(Y[:, n-1])
ys = density(xs)
verts.append(list(zip(xs, ys)))
↪releases later, gca() will take no keyword arguments. The gca() function should␣
↪only be used to get the current axes, or if no axes exist, create new axes with␣
↪default keyword arguments. To create a new axes with non-default arguments, use␣
↪plt.axes() or plt.subplot().
ax = fig.gca(projection='3d')
The law of large numbers and central limit theorem work just as nicely in multidimensional settings.
To state the results, let’s recall some elementary facts about random vectors.
A random vector X is just a sequence of 𝑘 random variables (𝑋1 , … , 𝑋𝑘 ).
Each realization of X is an element of ℝ𝑘 .
A collection of random vectors X1 , … , X𝑛 is called independent if, given any 𝑛 vectors x1 , … , x𝑛 in ℝ𝑘 , we have
𝔼[𝑋1 ] 𝜇1
⎛ 𝔼[𝑋2 ] ⎞ ⎛ 𝜇2 ⎞
𝔼[X] ∶= ⎜
⎜
⎜
⎟
⎟ ⎜
⎜
⎟ ⎜ ⋮ ⎟
= ⎟ =∶ 𝜇
⎟
⋮
⎝ 𝔼[𝑋𝑘 ] ⎠ ⎝ 𝜇𝑘 ⎠
1 𝑛
X̄ 𝑛 ∶= ∑ X𝑖
𝑛 𝑖=1
ℙ {X̄ 𝑛 → 𝜇 as 𝑛 → ∞} = 1 (10.6)
10.5 Exercises
Exercise 10.5.1
One very useful consequence of the central limit theorem is as follows.
Assume the conditions of the CLT as stated above.
If 𝑔 ∶ ℝ → ℝ is differentiable at 𝜇 and 𝑔′ (𝜇) ≠ 0, then
√ 𝑑
𝑛{𝑔(𝑋̄ 𝑛 ) − 𝑔(𝜇)} → 𝑁 (0, 𝑔′ (𝜇)2 𝜎2 ) as 𝑛 → ∞ (10.8)
This theorem is used frequently in statistics to obtain the asymptotic distribution of estimators — many of which can be
expressed as functions of sample means.
(These kinds of results are often said to use the “delta method”.)
The proof is based on a Taylor expansion of 𝑔 around the point 𝜇.
Taking the result as given, let the distribution 𝐹 of each 𝑋𝑖 be uniform on [0, 𝜋/2] and let 𝑔(𝑥) = sin(𝑥).
√
Derive the asymptotic distribution of 𝑛{𝑔(𝑋̄ 𝑛 ) − 𝑔(𝜇)} and illustrate convergence in the same spirit as the program
discussed above.
What happens when you replace [0, 𝜋/2] with [0, 𝜋]?
What is the source of the problem?
Exercise 10.5.2
Here’s a result that’s often used in developing statistical tests, and is connected to the multivariate central limit theorem.
If you study econometric theory, you will see this result used again and again.
Assume the setting of the multivariate CLT discussed above, so that
1. X1 , … , X𝑛 is a sequence of IID random vectors, each taking values in ℝ𝑘 .
2. 𝜇 ∶= 𝔼[X𝑖 ], and Σ is the variance-covariance matrix of X𝑖 .
3. The convergence
√ 𝑑
𝑛(X̄ 𝑛 − 𝜇) → 𝑁 (0, Σ) (10.9)
is valid.
In a statistical setting, one often wants the right-hand side to be standard normal so that confidence intervals are easily
computed.
This normalization can be achieved on the basis of three observations.
First, if X is a random vector in ℝ𝑘 and A is constant and 𝑘 × 𝑘, then
Var[AX] = A Var[X]A′
𝑑
Second, by the continuous mapping theorem, if Z𝑛 → Z in ℝ𝑘 and A is constant and 𝑘 × 𝑘, then
𝑑
AZ𝑛 → AZ
Third, if S is a 𝑘 × 𝑘 symmetric positive definite matrix, then there exists a symmetric positive definite matrix Q, called
the inverse square root of S, such that
QSQ′ = I
Applying the continuous mapping theorem one more time tells us that
𝑑
‖Z𝑛 ‖2 → ‖Z‖2
𝑊𝑖
X𝑖 ∶= ( )
𝑈𝑖 + 𝑊𝑖
where
• each 𝑊𝑖 is an IID draw from the uniform distribution on [−1, 1].
• each 𝑈𝑖 is an IID draw from the uniform distribution on [−2, 2].
• 𝑈𝑖 and 𝑊𝑖 are independent of each other.
Hints:
1. scipy.linalg.sqrtm(A) computes the square root of A. You still need to invert it.
2. You should be able to work out Σ from the preceding information.
10.6 Solutions
"""
Illustrates the delta method, a consequence of the central limit theorem.
"""
# Set parameters
n = 250
replications = 100000
(continues on next page)
g = np.sin
g_prime = np.cos
# Plot
asymptotic_sd = g_prime(μ) * s
fig, ax = plt.subplots(figsize=(10, 6))
xmin = -3 * g_prime(μ) * s
xmax = -xmin
ax.set_xlim(xmin, xmax)
ax.hist(error_obs, bins=60, alpha=0.5, density=True)
xgrid = np.linspace(xmin, xmax, 200)
lb = "$N(0, g'(\mu)^2 \sigma^2)$"
ax.plot(xgrid, norm.pdf(xgrid, scale=asymptotic_sd), 'k-', lw=2, label=lb)
ax.legend()
plt.show()
What happens when you replace [0, 𝜋/2] with [0, 𝜋]?
In this case, the mean 𝜇 of this distribution is 𝜋/2, and since 𝑔′ = cos, we have 𝑔′ (𝜇) = 0.
Hence the conditions of the delta theorem are not satisfied.
Since linear combinations of normal random variables are normal, the vector QY is also normal.
Its mean is clearly 0, and its variance-covariance matrix is
𝑑
In conclusion, QY𝑛 → QY ∼ 𝑁 (0, I), which is what we aimed to show.
Now we turn to the simulation exercise.
Our solution is as follows
# Set parameters
n = 250
replications = 50000
dw = uniform(loc=-1, scale=2) # Uniform(-1, 1)
du = uniform(loc=-2, scale=4) # Uniform(-2, 2)
sw, su = dw.std(), du.std()
vw, vu = sw**2, su**2
Σ = ((vw, vw), (vw, vw + vu))
Σ = np.array(Σ)
# Compute Σ^{-1/2}
Q = inv(sqrtm(Σ))
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
(continues on next page)
ELEVEN
11.1 Overview
https://youtu.be/8JIe_cz6qGA
After you watch that video, please watch the following video on the Bayesian approach to constructing coverage intervals
https://youtu.be/Pahyv9i_X2k
After you are familiar with the material in these videos, this lecture uses the Socratic method to to help consolidate your
understanding of the different questions that are answered by
• a frequentist confidence interval
• a Bayesian coverage interval
We do this by inviting you to write some Python code.
It would be especially useful if you tried doing this after each question that we pose for you, before proceeding to read
the rest of the lecture.
We provide our own answers as the lecture unfolds, but you’ll learn more if you try writing your own code before reading
and running ours.
Code for answering questions:
In addition to what’s in Anaconda, this lecture will deploy the following library:
import numpy as np
import pandas as pd
import prettytable as pt
import matplotlib.pyplot as plt
from scipy.stats import binom
(continues on next page)
185
Quantitative Economics with Python
Empowered with these Python tools, we’ll now explore the two meanings described above.
𝑛!
Prob(𝑋 = 𝑘|𝜃) = ( ) 𝜃𝑘 (1 − 𝜃)𝑛−𝑘
𝑘!(𝑛 − 𝑘)!
Exercise 11.2.1
1. Please write a Python class to compute 𝑓𝑘𝐼
2. Please use your code to compute 𝑓𝑘𝐼 , 𝑘 = 0, … , 𝑛 and compare them to Prob(𝑋 = 𝑘|𝜃) for various values of 𝜃, 𝑛
and 𝐼
3. With the Law of Large numbers in mind, use your code to say something
class frequentist:
'''
initialization
-----------------
parameters:
θ : probability that one toss of a coin will be a head with Y = 1
n : number of independent flips in each independent sequence of draws
I : number of independent sequence of draws
'''
θ, n = self.θ, self.n
self.k = k
self.P = binom.pmf(k, n, θ)
def draw(self):
Y, I = self.Y, self.I
K = np.sum(Y, 1)
f_kI = np.sum(K == kk) / I
self.f_kI = f_kI
self.kk = kk
def compare(self):
n = self.n
comp = pt.PrettyTable()
comp.field_names = ['k', 'Theoretical', 'Frequentist']
self.draw()
for i in range(n):
self.binomial(i+1)
(continues on next page)
freq = frequentist(θ, n, I)
freq.compare()
+----+------------------------+-------------+
| k | Theoretical | Frequentist |
+----+------------------------+-------------+
| 1 | 1.6271660538000033e-09 | 0.0 |
| 2 | 3.606884752589999e-08 | 0.0 |
| 3 | 5.04963865362601e-07 | 1e-06 |
| 4 | 5.007558331512455e-06 | 6e-06 |
| 5 | 3.7389768875293014e-05 | 2.8e-05 |
| 6 | 0.00021810698510587546 | 0.000214 |
| 7 | 0.001017832597160754 | 0.000992 |
| 8 | 0.003859281930901185 | 0.003738 |
| 9 | 0.012006654896137007 | 0.011793 |
| 10 | 0.030817080900085007 | 0.030752 |
| 11 | 0.065369565545635 | 0.06523 |
| 12 | 0.11439673970486108 | 0.114338 |
| 13 | 0.1642619852172365 | 0.165208 |
| 14 | 0.19163898275344246 | 0.191348 |
| 15 | 0.17886305056987967 | 0.178536 |
| 16 | 0.1304209743738704 | 0.130838 |
| 17 | 0.07160367220526209 | 0.071312 |
| 18 | 0.027845872524268643 | 0.027988 |
| 19 | 0.006839337111223871 | 0.006858 |
| 20 | 0.0007979226629761189 | 0.00082 |
+----+------------------------+-------------+
From the table above, can you see the law of large numbers at work?
From the above graphs, we can see that 𝐼, the number of independent sequences, plays an important role.
When 𝐼 becomes larger, the difference between theoretical probability and frequentist estimate becomes smaller.
Also, as long as 𝐼 is large enough, changing 𝜃 or 𝑛 does not substantially change the accuracy of the observed fraction as
an approximation of 𝜃.
The Law of Large Numbers is at work here.
For each draw of an independent sequence, Prob(𝑋𝑖 = 𝑘|𝜃) is the same, so aggregating all draws forms an i.i.d sequence
of a binary random variable 𝜌𝑘,𝑖 , 𝑖 = 1, 2, ...𝐼, with a mean of Prob(𝑋 = 𝑘|𝜃) and a variance of
𝑛!
𝐸[𝜌𝑘,𝑖 ] = Prob(𝑋 = 𝑘|𝜃) = ( ) 𝜃𝑘 (1 − 𝜃)𝑛−𝑘
𝑘!(𝑛 − 𝑘)!
as 𝐼 goes to infinity.
𝜃𝛼−1 (1 − 𝜃)𝛽−1
𝑃 (𝜃) =
𝐵(𝛼, 𝛽)
where 𝐵(𝛼, 𝛽) is a beta function , so that 𝑃 (𝜃) is a beta distribution with parameters 𝛼, 𝛽.
Exercise 11.3.1
a) Please write down the likelihood function for a sample of length 𝑛 from a binomial distribution with parameter 𝜃.
b) Please write down the posterior distribution for 𝜃 after observing one flip of the coin.
c) Please pretend that the true value of 𝜃 = .4 and that someone who doesn’t know this has a beta prior distribution with
parameters with 𝛽 = 𝛼 = .5.
d) Please write a Python class to simulate this person’s personal posterior distribution for 𝜃 for a single sequence of 𝑛
draws.
e) Please plot the posterior distribution for 𝜃 as a function of 𝜃 as 𝑛 grows as 1, 2, ….
f) For various 𝑛’s, please describe and compute a Bayesian coverage interval for the interval [.45, .55].
g) Please tell what question a Bayesian coverage interval answers.
h) Please compute the Posterior probabililty that 𝜃 ∈ [.45, .55] for various values of sample size 𝑛.
i) Please use your Python class to study what happens to the posterior distribution as 𝑛 → +∞, again assuming that the
true value of 𝜃 = .4, though it is unknown to the person doing the updating via Bayes’ Law.
b) Please write the posterior distribution for 𝜃 after observing one flip of our coin.
The prior distribution is
𝜃𝛼−1 (1 − 𝜃)𝛽−1
Prob(𝜃) =
𝐵(𝛼, 𝛽)
Prob(𝑌 |𝜃)Prob(𝜃)
Prob(𝜃|𝑌 ) =
Prob(𝑌 )
Prob(𝑌 |𝜃)Prob(𝜃)
= 1
∫0 Prob(𝑌 |𝜃)Prob(𝜃)𝑑𝜃
𝜃𝛼−1 (1−𝜃)𝛽−1
𝜃𝑌 (1 − 𝜃)1−𝑌 𝐵(𝛼,𝛽)
= 1 𝜃𝛼−1 (1−𝜃)𝛽−1
∫0 𝜃𝑌 (1 − 𝜃)1−𝑌 𝐵(𝛼,𝛽) 𝑑𝜃
𝜃𝑌 +𝛼−1 (1 − 𝜃)1−𝑌 +𝛽−1
= 1
∫0 𝜃𝑌 +𝛼−1 (1 − 𝜃)1−𝑌 +𝛽−1 𝑑𝜃
Prob(𝜃|𝑌 ) ∼ Beta(𝛼 + 𝑌 , 𝛽 + (1 − 𝑌 ))
c) Please pretend that the true value of 𝜃 = .4 and that someone who doesn’t know this has a beta prior with 𝛽 = 𝛼 = .5.
d) Please write a Python class to simulate this person’s personal posterior distribution for 𝜃 for a single sequence of 𝑛
draws.
class Bayesian:
n : int.
number of independent flips in an independent sequence of draws
"""
self.θ, self.n, self.α, self.β = θ, n, α, β
self.prior = st.beta(α, β)
def draw(self):
"""
simulate a single sequence of draws of length n, given probability θ
"""
array = np.random.rand(self.n)
self.draws = (array < self.θ).astype(int)
Parameters
----------
step_num: int.
number of steps observed to form a posterior distribution
Returns
------
the posterior distribution for sake of plotting in the subsequent steps
"""
heads_num = self.draws[:step_num].sum()
tails_num = step_num - heads_num
def form_posterior_series(self,num_obs_list):
"""
form a series of posterior distributions that form after observing different␣
↪number of draws.
Parameters
----------
num_obs_list: a list of int.
a list of the number of observations used to form a series of␣
↪posterior distributions.
"""
self.posterior_list = []
for num in num_obs_list:
self.posterior_list.append(self.form_single_posterior(num))
Bay_stat = Bayesian()
Bay_stat.draw()
Bay_stat.form_posterior_series(num_list)
ax.legend(fontsize=11)
plt.show()
f) For various 𝑛’s, please describe and compute .05 and .95 quantiles for posterior probabilities.
interval_df = pd.DataFrame()
interval_df['upper'] = upper_bound
interval_df['lower'] = lower_bound
interval_df.index = num_list[:14]
interval_df = interval_df.T
interval_df
1 2 3 4 5 10 20 \
upper 0.228520 0.430741 0.235534 0.16528 0.127776 0.347322 0.280091
lower 0.998457 0.999132 0.937587 0.83472 0.739366 0.814884 0.629953
As 𝑛 increases, we can see that Bayesian coverage intervals narrow and move toward 0.4.
g) Please tell what question a Bayesian coverage interval answers.
The Bayesian coverage interval tells the range of 𝜃 that corresponds to the [𝑝1 , 𝑝2 ] quantiles of the cumulative probability
distribution (CDF) of the posterior distribution.
To construct the coverage interval we first compute a posterior distribution of the unknown parameter 𝜃.
If the CDF is 𝐹 (𝜃), then the Bayesian coverage interval [𝑎, 𝑏] for the interval [𝑝1 , 𝑝2 ] is described by
𝐹 (𝑎) = 𝑝1 , 𝐹 (𝑏) = 𝑝2
h) Please compute the Posterior probabililty that 𝜃 ∈ [.45, .55] for various values of sample size 𝑛.
fontsize=13)
ax.set_xticks(np.arange(0, len(posterior_prob_list), 3))
ax.set_xticklabels(num_list[::3])
ax.set_xlabel('Number of Observations', fontsize=11)
plt.show()
Notice that in the graph above the posterior probabililty that 𝜃 ∈ [.45, .55] typically exhibits a hump shape as 𝑛 increases.
Two opposing forces are at work.
The first force is that the individual adjusts his belief as he observes new outcomes, so his posterior probability distribution
becomes more and more realistic, which explains the rise of the posterior probabililty.
However, [.45, .55] actually excludes the true 𝜃 = .4 that generates the data.
As a result, the posterior probabililty drops as larger and larger samples refine his posterior probability distribution of 𝜃.
The descent seems precipitous only because of the scale of the graph that has the number of observations increasing
disproportionately.
When the number of observations becomes large enough, our Bayesian becomes so confident about 𝜃 that he considers
𝜃 ∈ [.45, .55] very unlikely.
That is why we see a nearly horizontal line when the number of observations exceeds 500.
i) Please use your Python class to study what happens to the posterior distribution as 𝑛 → +∞, again assuming that the
true value of 𝜃 = .4, though it is unknown to the person doing the updating via Bayes’ Law.
Using the Python class we made above, we can see the evolution of posterior distributions as 𝑛 approaches infinity.
ax.legend(fontsize=11)
plt.show()
As 𝑛 increases, we can see that the probability density functions concentrate on 0.4, the true value of 𝜃.
Here the posterior means converges to 0.4 while the posterior standard deviations converges to 0 from above.
To show this, we compute the means and variances statistics of the posterior distributions.
ax[0].plot(mean_list)
ax[0].set_title('Mean Values of Posterior Distribution', fontsize=13)
ax[0].set_xticks(np.arange(0, len(mean_list), 3))
ax[0].set_xticklabels(num_list[::3])
ax[0].set_xlabel('Number of Observations', fontsize=11)
ax[1].plot(std_list)
ax[1].set_title('Standard Deviations of Posterior Distribution', fontsize=13)
ax[1].set_xticks(np.arange(0, len(std_list), 3))
ax[1].set_xticklabels(num_list[::3])
ax[1].set_xlabel('Number of Observations', fontsize=11)
plt.show()
𝜃𝛼−1 (1−𝜃)𝛽−1
(𝑁
𝑘 )(1 − 𝜃)
𝑁−𝑘 𝑘
𝜃 ∗ 𝐵(𝛼,𝛽)
= 1 𝜃𝛼−1 (1−𝜃)𝛽−1
∫0 (𝑁
𝑘 )(1 − 𝜃)
𝑁−𝑘 𝜃𝑘 ∗
𝐵(𝛼,𝛽) 𝑑𝜃
(1 − 𝜃)𝛽+𝑁−𝑘−1 ∗ 𝜃𝛼+𝑘−1
= 1
∫0 (1 − 𝜃)𝛽+𝑁−𝑘−1 ∗ 𝜃𝛼+𝑘−1 𝑑𝜃
= 𝐵𝑒𝑡𝑎(𝛼 + 𝑘, 𝛽 + 𝑁 − 𝑘)
A beta distribution with 𝛼 and 𝛽 has the following mean and variance.
𝛼
The mean is 𝛼+𝛽
𝛼𝛽
The variance is (𝛼+𝛽)2 (𝛼+𝛽+1)
ax.legend(fontsize=11)
plt.show()
After observing a large number of outcomes, the posterior distribution collapses around 0.4.
Thus, the Bayesian statististian comes to believe that 𝜃 is near .4.
As shown in the figure above, as the number of observations grows, the Bayesian coverage intervals (BCIs) become
narrower and narrower around 0.4.
However, if you take a closer look, you will find that the centers of the BCIs are not exactly 0.4, due to the persistent
influence of the prior distribution and the randomness of the simulation path.
TWELVE
Contents
12.1 Overview
This lecture describes how an administrator deployed a multivariate hypergeometric distribution in order to access
the fairness of a procedure for awarding research grants.
In the lecture we’ll learn about
• properties of the multivariate hypergeometric distribution
• first and second moments of a multivariate hypergeometric distribution
• using a Monte Carlo simulation of a multivariate normal distribution to evaluate the quality of a normal approxi-
mation
• the administrator’s problem and why the multivariate hypergeometric distribution is the right tool
201
Quantitative Economics with Python
To evaluate whether the selection procedure is color blind the administrator wants to study whether the particular re-
alization of 𝑋 drawn can plausibly be said to be a random draw from the probability distribution that is implied by the
color blind hypothesis.
The appropriate probability distribution is the one described here.
Let’s now instantiate the administrator’s problem, while continuing to use the colored balls metaphor.
The administrator has an urn with 𝑁 = 238 balls.
157 balls are blue, 11 balls are green, 46 balls are yellow, and 24 balls are black.
So (𝐾1 , 𝐾2 , 𝐾3 , 𝐾4 ) = (157, 11, 46, 24) and 𝑐 = 4.
15 balls are drawn without replacement.
So 𝑛 = 15.
The administrator wants to know the probability distribution of outcomes
𝑘1
⎛
⎜𝑘2 ⎞
⎟.
𝑋=⎜
⎜⋮⎟ ⎟
⎝𝑘4 ⎠
In particular, he wants to know whether a particular outcome - in the form of a 4 × 1 vector of integers recording the
numbers of blue, green, yellow, and black balls, respectively, - contains evidence against the hypothesis that the selection
process is fair, which here means color blind and truly are random draws without replacement from the population of 𝑁
balls.
The right tool for the administrator’s job is the multivariate hypergeometric distribution.
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import matplotlib.cm as cm
import numpy as np
from scipy.special import comb
from scipy.stats import normaltest
from numba import njit, prange
Pr{𝑋𝑖 = 𝑘𝑖 ∀𝑖} = 𝑖
(𝑁
𝑛)
Mean:
𝐾𝑖
E(𝑋𝑖 ) = 𝑛
𝑁
Variances and covariances:
𝑁 − 𝑛 𝐾𝑖 𝐾
Var(𝑋𝑖 ) = 𝑛 (1 − 𝑖 )
𝑁 −1 𝑁 𝑁
𝑁 − 𝑛 𝐾𝑖 𝐾𝑗
Cov(𝑋𝑖 , 𝑋𝑗 ) = −𝑛
𝑁 −1 𝑁 𝑁
To do our work for us, we’ll write an Urn class.
class Urn:
Parameters
----------
K_arr: ndarray(int)
number of each type i object.
"""
self.K_arr = np.array(K_arr)
self.N = np.sum(K_arr)
(continues on next page)
Parameters
----------
k_arr: ndarray(int)
number of observed successes of each object.
"""
k_arr = np.atleast_2d(k_arr)
n = np.sum(k_arr, 1)
pr = num / denom
return pr
Parameters
----------
n: int
number of draws.
"""
# mean
μ = n * K_arr / N
# variance-covariance matrix
Σ = np.full((c, c), n * (N - n) / (N - 1) / N ** 2)
for i in range(c-1):
Σ[i, i] *= K_arr[i] * (N - K_arr[i])
for j in range(i+1, c):
Σ[i, j] *= - K_arr[i] * K_arr[j]
Σ[j, i] = Σ[i, j]
return μ, Σ
Parameters
----------
n: int
number of objects for each draw.
size: int(optional)
sample size.
seed: int(optional)
random seed.
"""
K_arr = self.K_arr
gen = np.random.Generator(np.random.PCG64(seed))
sample = gen.multivariate_hypergeometric(K_arr, n, size=size)
return sample
12.3 Usage
(52)(10 15
2 )( 2 )
𝑃 (2 black, 2 white, 2 red) = = 0.079575596816976
(30
6)
Now use the Urn Class method pmf to compute the probability of the outcome 𝑋 = (2 2 2)
array([0.0795756])
We can use the code to compute probabilities of a list of possible outcomes by constructing a 2-dimensional array k_arr
and pmf will return an array of probabilities for observing each case.
array([0.0795756, 0.1061008])
n = 6
μ, Σ = urn.moments(n)
k_arr = [10, 1, 4, 0]
urn.pmf(k_arr)
array([0.01547738])
We can compute probabilities of three possible outcomes by constructing a 3-dimensional arrays k_arr and utilizing
the method pmf of the Urn class.
n = 6 # number of draws
μ, Σ = urn.moments(n)
# mean
μ
# variance-covariance matrix
Σ
We can simulate a large sample and verify that sample means and covariances closely approximate the population means
and covariances.
size = 10_000_000
sample = urn.simulate(n, size=size)
# mean
np.mean(sample, 0)
Evidently, the sample means and covariances approximate their population counterparts well.
To judge the quality of a multivariate normal approximation to the multivariate hypergeometric distribution, we draw
a large sample from a multivariate normal distribution with the mean vector and covariance matrix for the correspond-
ing multivariate hypergeometric distribution and compare the simulated distribution with the population multivariate
hypergeometric distribution.
x_μ = x - μ_x
y_μ = y - μ_y
@njit
def count(vec1, vec2, n):
size = sample.shape[0]
return count_mat
c = urn.c
fig, axs = plt.subplots(c, c, figsize=(14, 14))
for i in range(c):
axs[i, i].hist(sample[:, i], bins=np.arange(0, n, 1), alpha=0.5, density=True,␣
↪label='hypergeom')
axs[i, i].legend()
axs[i, i].set_title('$k_{' +str(i+1) +'}$')
for j in range(c):
if i == j:
continue
plt.show()
The diagonal graphs plot the marginal distributions of 𝑘𝑖 for each 𝑖 using histograms.
Note the substantial differences between hypergeometric distribution and the approximating normal distribution.
The off-diagonal graphs plot the empirical joint distribution of 𝑘𝑖 and 𝑘𝑗 for each pair (𝑖, 𝑗).
The darker the blue, the more data points are contained in the corresponding cell. (Note that 𝑘𝑖 is on the x-axis and 𝑘𝑗 is
on the y-axis).
The contour maps plot the bivariate Gaussian density function of (𝑘𝑖 , 𝑘𝑗 ) with the population mean and covariance given
by slices of 𝜇 and Σ that we computed above.
Let’s also test the normality for each 𝑘𝑖 using scipy.stats.normaltest that implements D’Agostino and Pearson’s
test that combines skew and kurtosis to form an omnibus test of normality.
The null hypothesis is that the sample follows normal distribution.
normaltest returns an array of p-values associated with tests for each 𝑘𝑖 sample.
test_multihyper = normaltest(sample)
test_multihyper.pvalue
As we can see, all the p-values are almost 0 and the null hypothesis is soundly rejected.
By contrast, the sample from normal distribution does not reject the null hypothesis.
test_normal = normaltest(sample_normal)
test_normal.pvalue
The lesson to take away from this is that the normal approximation is imperfect.
THIRTEEN
Contents
13.1 Overview
This lecture describes a workhorse in probability theory, statistics, and economics, namely, the multivariate normal
distribution.
In this lecture, you will learn formulas for
• the joint distribution of a random vector 𝑥 of length 𝑁
• marginal distributions for all subvectors of 𝑥
• conditional distributions for subvectors of 𝑥 conditional on other subvectors of 𝑥
We will use the multivariate normal distribution to formulate some useful models:
211
Quantitative Economics with Python
This lecture defines a Python class MultivariateNormal to be used to generate marginal and conditional distri-
butions associated with a multivariate normal distribution.
For a multivariate normal distribution it is very convenient that
• conditional expectations equal linear least squares projections
• conditional distributions are characterized by multivariate linear regressions
We apply our Python class to some examples.
We use the following imports:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
from numba import njit
import statsmodels.api as sm
@njit
def f(z, μ, Σ):
"""
The density function of multivariate normal distribution.
Parameters
---------------
z: ndarray(float, dim=2)
random vector, N by 1
μ: ndarray(float, dim=1 or 2)
the mean of z, N by 1
Σ: ndarray(float, dim=2)
the covarianece matrix of z, N by 1
"""
N = z.size
𝑧1
𝑧=[ ],
𝑧2
where
𝛽 = Σ12 Σ−1
22
class MultivariateNormal:
"""
Class of multivariate normal distribution.
Parameters
----------
μ: ndarray(float, dim=1)
the mean of z, N by 1
Σ: ndarray(float, dim=2)
the covarianece matrix of z, N by 1
Arguments
---------
μ, Σ:
see parameters
μs: list(ndarray(float, dim=1))
list of mean vectors μ1 and μ2 in order
Σs: list(list(ndarray(float, dim=2)))
2 dimensional list of covariance matrices
Σ11, Σ12, Σ21, Σ22 in order
βs: list(ndarray(float, dim=1))
list of regression coefficients β1 and β2 in order
"""
Returns
---------
μ_hat: ndarray(float, ndim=1)
The conditional mean of z1 or z2.
(continues on next page)
.5 1 .5
𝜇=[ ], Σ=[ ]
1.0 .5 1
μ = np.array([.5, 1.])
Σ = np.array([[1., .5], [.5 ,1.]])
k = 1 # choose partition
(array([[0.5]]), array([[0.5]]))
Let’s illustrate the fact that you can regress anything on anything else.
We have computed everything we need to compute two regression lines, one of 𝑧2 on 𝑧1 , the other of 𝑧1 on 𝑧2 .
We’ll represent these regressions as
𝑧1 = 𝑎 1 + 𝑏 1 𝑧2 + 𝜖 1
and
𝑧2 = 𝑎 2 + 𝑏 2 𝑧1 + 𝜖 2
𝐸𝜖1 𝑧2 = 0
and
𝐸𝜖2 𝑧1 = 0
Let’s compute 𝑎1 , 𝑎2 , 𝑏1 , 𝑏2 .
beta = multi_normal.βs
a1 = μ[0] - beta[0]*μ[1]
b1 = beta[0]
a2 = μ[1] - beta[1]*μ[0]
b2 = beta[1]
a1 = [[0.]]
b1 = [[0.5]]
a2 = [[0.75]]
b2 = [[0.5]]
Now let’s plot the two regression lines and stare at them.
z2 = np.linspace(-4,4,100)
a1 = np.squeeze(a1)
b1 = np.squeeze(b1)
a2 = np.squeeze(a2)
b2 = np.squeeze(b2)
z1 = b1*z2 + a1
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(1, 1, 1)
(continues on next page)
a1 = 0.0
b1 = 0.5
-a2/b2 = -1.5
1/b2 = 2.0
We can use these regression lines or our code to compute conditional expectations.
Let’s compute the mean and variance of the distribution of 𝑧2 conditional on 𝑧1 = 5.
After that we’ll reverse what are on the left and right sides of the regression.
Now let’s compute the mean and variance of the distribution of 𝑧1 conditional on 𝑧2 = 5.
Let’s compare the preceding population mean and variance with outcomes from drawing a large sample and then regressing
𝑧1 − 𝜇1 on 𝑧2 − 𝜇2 .
We know that
𝐸𝑧1 |𝑧2 = (𝜇1 − 𝛽𝜇2 ) + 𝛽𝑧2
which can be arranged to
𝑧1 − 𝜇1 = 𝛽 (𝑧2 − 𝜇2 ) + 𝜖,
We anticipate that for larger and larger sample sizes, estimated OLS coefficients will converge to 𝛽 and the estimated
variance of 𝜖 will converge to Σ̂ 1 .
# OLS regression
μ1, μ2 = multi_normal.μs
results = sm.OLS(z1_data - μ1, z2_data - μ2).fit()
Let’s compare the preceding population 𝛽 with the OLS sample estimate on 𝑧2 − 𝜇2
multi_normal.βs[0], results.params
(array([[0.5]]), array([0.49951561]))
Let’s compare our population Σ̂ 1 with the degrees-of-freedom adjusted estimate of the variance of 𝜖
(array([[0.75]]), 0.7499568468007555)
(array([2.5]), array([2.49806245]))
Thus, in each case, for our very large sample size, the sample analogues closely approximate their population counterparts.
A Law of Large Numbers explains why sample analogues approximate population objects.
μ = np.random.random(3)
C = np.random.random((3, 3))
Σ = C @ C.T # positive semi-definite
multi_normal = MultivariateNormal(μ, Σ)
μ, Σ
k = 1
multi_normal.partition(k)
2
Let’s compute the distribution of 𝑧1 conditional on 𝑧2 = [ ].
5
ind = 0
z2 = np.array([2., 5.])
n = 1_000_000
data = np.random.multivariate_normal(μ, Σ, size=n)
z1_data = data[:, :k]
z2_data = data[:, k:]
μ1, μ2 = multi_normal.μs
results = sm.OLS(z1_data - μ1, z2_data - μ2).fit()
As above, we compare population and sample regression coefficients, the conditional covariance matrix, and the condi-
tional mean vector in that order.
multi_normal.βs[0], results.params
(array([[0.00678627]]), 0.006790326919375728)
(array([6.74777753]), array([6.74743547]))
Once again, sample analogues do a good job of approximating their populations counterparts.
Let’s move closer to a real-life example, namely, inferring a one-dimensional measure of intelligence called IQ from a list
of test scores.
The 𝑖th test score 𝑦𝑖 equals the sum of an unknown scalar IQ 𝜃 and a random variable 𝑤𝑖 .
𝑦𝑖 = 𝜃 + 𝜎𝑦 𝑤𝑖 , 𝑖 = 1, … , 𝑛
The distribution of IQ’s for a cross-section of people is a normal random variable described by
𝜃 = 𝜇𝜃 + 𝜎𝜃 𝑤𝑛+1 .
𝑤1
⎡ 𝑤 ⎤
2
⎢ ⎥
𝑤=⎢ ⋮ ⎥ ∼ 𝑁 (0, 𝐼𝑛+1 )
⎢ 𝑤𝑛 ⎥
⎣ 𝑤𝑛+1 ⎦
The following system describes the (𝑛 + 1) × 1 random vector 𝑋 that interests us:
𝑦1 𝜇𝜃 𝜎𝑦 0 ⋯ 0 𝜎𝜃 𝑤1
⎡ 𝑦2 ⎤ ⎡ 𝜇𝜃 ⎤ ⎡ 0 𝜎𝑦 ⋯ 0 𝜎𝜃 ⎤⎡ 𝑤 ⎤
2
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
𝑋=⎢ ⋮ ⎥=⎢ ⋮ ⎥+⎢ ⋮ ⋮ ⋱ ⋮ ⋮ ⎥⎢ ⋮ ⎥,
⎢ 𝑦𝑛 ⎥ ⎢ 𝜇𝜃 ⎥ ⎢ 0 0 ⋯ 𝜎𝑦 𝜎𝜃 ⎥ ⎢ 𝑤𝑛 ⎥
⎣ 𝜃 ⎦ ⎣ 𝜇𝜃 ⎦ ⎣ 0 0 ⋯ 0 𝜎𝜃 ⎦ ⎣ 𝑤𝑛+1 ⎦
or equivalently,
𝑋 = 𝜇𝜃 1𝑛+1 + 𝐷𝑤
𝑦
where 𝑋 = [ ], 1𝑛+1 is a vector of 1s of size 𝑛 + 1, and 𝐷 is an 𝑛 + 1 by 𝑛 + 1 matrix.
𝜃
Let’s define a Python function that constructs the mean 𝜇 and covariance matrix Σ of the random vector 𝑋 that we know
is governed by a multivariate normal distribution.
As arguments, the function takes the number of tests 𝑛, the mean 𝜇𝜃 and the standard deviation 𝜎𝜃 of the IQ distribution,
and the standard deviation of the randomness in test scores 𝜎𝑦 .
n = 50
μθ, σθ, σy = 100., 10., 10.
(array([100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
(continues on next page)
We can now use our MultivariateNormal class to construct an instance, then partition the mean vector and co-
variance matrix as we wish.
We want to regress IQ, the random variable 𝜃 (what we don’t know), on the vector 𝑦 of test scores (what we do know).
We choose k=n so that 𝑧1 = 𝑦 and 𝑧2 = 𝜃.
k = n
multi_normal_IQ.partition(k)
Using the generator multivariate_normal, we can make one draw of the random vector from our distribution and
then compute the distribution of 𝜃 conditional on our test scores.
Let’s do that and then print out some pertinent quantities.
x = np.random.multivariate_normal(μ_IQ, Σ_IQ)
y = x[:-1] # test scores
θ = x[-1] # IQ
104.54899674277466
The method cond_dist takes test scores 𝑦 as input and returns the conditional normal distribution of the IQ 𝜃.
In the following code, ind sets the variables on the right side of the regression.
Given the way we have defined the vector 𝑋, we want to set ind=1 in order to make 𝜃 the left side variable in the
population regression.
ind = 1
multi_normal_IQ.cond_dist(ind, y)
(array([104.49298531]), array([[1.96078431]]))
The first number is the conditional mean 𝜇𝜃̂ and the second is the conditional variance Σ̂ 𝜃 .
How do additional test scores affect our inferences?
To shed light on this, we compute a sequence of conditional distributions of 𝜃 by varying the number of test scores in the
conditioning set from 1 to 𝑛.
We’ll make a pretty graph showing how our judgment of the person’s IQ change as more test results come in.
plt.show()
The solid blue line in the plot above shows 𝜇𝜃̂ as a function of the number of test scores that we have recorded and
conditioned on.
The blue area shows the span that comes from adding or deducing 1.96𝜎̂𝜃 from 𝜇𝜃̂ .
Therefore, 95% of the probability mass of the conditional distribution falls in this range.
The value of the random 𝜃 that we drew is shown by the black dotted line.
As more and more test scores come in, our estimate of the person’s 𝜃 become more and more reliable.
By staring at the changes in the conditional distributions, we see that adding more test scores makes 𝜃 ̂ settle down and
approach 𝜃.
Thus, each 𝑦𝑖 adds information about 𝜃.
1
If we were to drive the number of tests 𝑛 → +∞, the conditional standard deviation 𝜎̂𝜃 would converge to 0 at rate 𝑛.5 .
Σ ≡ 𝐷𝐷′ = 𝐶𝐶 ′
and
𝐸𝜖𝜖′ = 𝐼.
It follows that
𝜖 ∼ 𝑁 (0, 𝐼).
Let 𝐺 = 𝐶 −1
𝜖 = 𝐺 (𝑋 − 𝜇𝜃 1𝑛+1 )
This formula confirms that the orthonormal vector 𝜖 contains the same information as the non-orthogonal vector
(𝑋 − 𝜇𝜃 1𝑛+1 ).
We can say that 𝜖 is an orthogonal basis for (𝑋 − 𝜇𝜃 1𝑛+1 ).
Let 𝑐𝑖 be the 𝑖th element in the last row of 𝐶.
Then we can write
The mutual orthogonality of the 𝜖𝑖 ’s provides us with an informative way to interpret them in light of equation (13.1).
Thus, relative to what is known from tests 𝑖 = 1, … , 𝑛 − 1, 𝑐𝑖 𝜖𝑖 is the amount of new information about 𝜃 brought by
the test number 𝑖.
Here new information means surprise or what could not be predicted from earlier information.
Formula (13.1) also provides us with an enlightening way to express conditional means and conditional variances that we
computed earlier.
In particular,
𝐸 [𝜃 ∣ 𝑦1 , … , 𝑦𝑘 ] = 𝜇𝜃 + 𝑐1 𝜖1 + ⋯ + 𝑐𝑘 𝜖𝑘
and
2 2 2
𝑉 𝑎𝑟 (𝜃 ∣ 𝑦1 , … , 𝑦𝑘 ) = 𝑐𝑘+1 + 𝑐𝑘+2 + ⋯ + 𝑐𝑛+1 .
C = np.linalg.cholesky(Σ_IQ)
G = np.linalg.inv(C)
ε = G @ (x - μθ)
cε = C[n, :] * ε
To confirm that these formulas give the same answers that we computed earlier, we can compare the means and variances
of 𝜃 conditional on {𝑦𝑖 }𝑘𝑖=1 with what we obtained above using the formulas implemented in the class Multivari-
ateNormal built on our original representation of conditional distributions for multivariate normal distributions.
# conditional mean
np.max(np.abs(μθ_hat_arr - μθ_hat_arr_C)) < 1e-10
True
# conditional variance
np.max(np.abs(Σθ_hat_arr - Σθ_hat_arr_C)) < 1e-10
True
Evidently, the Cholesky factorizations automatically computes the population regression coefficients and associated
statistics that are produced by our MultivariateNormal class.
The Cholesky factorization computes these things recursively.
Indeed, in formula (13.1),
• the random variable 𝑐𝑖 𝜖𝑖 is information about 𝜃 that is not contained by the information in 𝜖1 , 𝜖2 , … , 𝜖𝑖−1
• the coefficient 𝑐𝑖 is the simple population regression coefficient of 𝜃 − 𝜇𝜃 on 𝜖𝑖
When 𝑛 = 2, we assume that outcomes are draws from a multivariate normal distribution with representation
𝑦1 𝜇𝜃 𝜎𝑦 0 0 0 𝜎𝜃 0 𝑤1
⎡ 𝑦2 ⎤ ⎡ 𝜇𝜃 ⎤ ⎡ 0 𝜎𝑦 0 0 𝜎𝜃 0 ⎤⎡ 𝑤2 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
𝑦3 ⎥=⎢ 𝜇𝜂 ⎥+⎢ 0 0 𝜎𝑦 0 0 𝜎𝜂 ⎥⎢ 𝑤3
𝑋=⎢ ⎥
⎢ 𝑦4 ⎥ ⎢ 𝜇𝜂 ⎥ ⎢ 0 0 0 𝜎𝑦 0 𝜎𝜂 ⎥⎢ 𝑤4 ⎥
⎢ 𝜃 ⎥ ⎢ 𝜇𝜃 ⎥ ⎢ 0 0 0 0 𝜎𝜃 0 ⎥⎢ 𝑤5 ⎥
⎣ 𝜂 ⎦ ⎣ 𝜇𝜂 ⎦ ⎣ 0 0 0 0 0 𝜎𝜂 ⎦⎣ 𝑤6 ⎦
𝑤1
⎡𝑤 ⎤
where 𝑤 ⎢ 2 ⎥ is a standard normal random vector.
⎢ ⋮ ⎥
⎣𝑤6 ⎦
We construct a Python function construct_moments_IQ2d to construct the mean vector and covariance matrix of
the joint normal distribution.
μ_IQ2d = np.empty(2*(n+1))
μ_IQ2d[:n] = μθ
μ_IQ2d[2*n] = μθ
μ_IQ2d[n:2*n] = μη
μ_IQ2d[2*n+1] = μη
(continues on next page)
n = 2
# mean and variance of θ, η, and y
μθ, σθ, μη, ση, σy = 100., 10., 100., 10, 10
(104.87169696314103, 107.9697815380512)
(array([101.17829519, 105.80501858]),
array([[33.33333333, 0. ],
[ 0. , 33.33333333]]))
Now let’s compute distributions of 𝜃 and 𝜇 separately conditional on various subsets of test scores.
It will be fun to compare outcomes with the help of an auxiliary function cond_dist_IQ2d that we now construct.
n = len(μ)
multi_normal = MultivariateNormal(μ, Σ)
multi_normal.partition(n-1)
μ_hat, Σ_hat = multi_normal.cond_dist(1, data)
for indices, IQ, conditions in [([*range(2*n), 2*n], 'θ', 'y1, y2, y3, y4'),
([*range(n), 2*n], 'θ', 'y1, y2'),
([*range(n, 2*n), 2*n], 'θ', 'y3, y4'),
([*range(2*n), 2*n+1], 'η', 'y1, y2, y3, y4'),
([*range(n), 2*n+1], 'η', 'y1, y2'),
([*range(n, 2*n), 2*n+1], 'η', 'y3, y4')]:
The mean and variance of θ conditional on y1, y2, y3, y4 are 101.18 and 33.33␣
↪respectively
The mean and variance of θ conditional on y1, y2 are 101.18 and 33.33␣
↪respectively
The mean and variance of θ conditional on y3, y4 are 100.00 and 100.00␣
↪respectively
The mean and variance of η conditional on y1, y2, y3, y4 are 105.81 and 33.33␣
↪respectively
The mean and variance of η conditional on y1, y2 are 100.00 and 100.00␣
↪respectively
The mean and variance of η conditional on y3, y4 are 105.81 and 33.33␣
↪respectively
Evidently, math tests provide no information about 𝜇 and language tests provide no information about 𝜂.
We can use the multivariate normal distribution and a little matrix algebra to present foundations of univariate linear time
series analysis.
Let 𝑥𝑡 , 𝑦𝑡 , 𝑣𝑡 , 𝑤𝑡+1 each be scalars for 𝑡 ≥ 0.
Consider the following model:
𝑥0 ∼ 𝑁 (0, 𝜎02 )
𝑥𝑡+1 = 𝑎𝑥𝑡 + 𝑏𝑤𝑡+1 , 𝑤𝑡+1 ∼ 𝑁 (0, 1) , 𝑡 ≥ 0
𝑦𝑡 = 𝑐𝑥𝑡 + 𝑑𝑣𝑡 , 𝑣𝑡 ∼ 𝑁 (0, 1) , 𝑡 ≥ 0
𝑥0
⎡ 𝑥 ⎤
𝑋=⎢ 1 ⎥
⎢ ⋮ ⎥
⎣ 𝑥𝑇 ⎦
and the covariance matrix Σ𝑥 can be constructed using the moments we have computed above.
Similarly, we can define
𝑦0 𝑣0
⎡ 𝑦 ⎤ ⎡ 𝑣 ⎤
𝑌 =⎢ 1 ⎥, 𝑣=⎢ 1 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
⎣ 𝑦𝑇 ⎦ ⎣ 𝑣𝑇 ⎦
and therefore
𝑌 = 𝐶𝑋 + 𝐷𝑉
where 𝐶 and 𝐷 are both diagonal matrices with constant 𝑐 and 𝑑 as diagonal respectively.
Consequently, the covariance matrix of 𝑌 is
Σ𝑦 = 𝐸𝑌 𝑌 ′ = 𝐶Σ𝑥 𝐶 ′ + 𝐷𝐷′
𝑋
𝑍=[ ]
𝑌
and
Σ𝑥 Σ𝑥 𝐶 ′
Σ𝑧 = 𝐸𝑍𝑍 ′ = [ ]
𝐶Σ𝑥 Σ𝑦
Thus, the stacked sequences {𝑥𝑡 }𝑇𝑡=0 and {𝑦𝑡 }𝑇𝑡=0 jointly follow the multivariate normal distribution 𝑁 (0, Σ𝑧 ).
Σx[0, 0] = σ0 ** 2
for i in range(T):
Σx[i, i+1:] = Σx[i, i] * a ** np.arange(1, T+1-i)
Σx[i+1:, i] = Σx[i, i+1:]
Σx
Σy = C @ Σx @ C.T + D @ D.T
Σz[:T+1, :T+1] = Σx
Σz[:T+1, T+1:] = Σx @ C.T
Σz[T+1:, :T+1] = C @ Σx
Σz[T+1:, T+1:] = Σy
Σz
z = np.random.multivariate_normal(μz, Σz)
x = z[:T+1]
y = z[T+1:]
print("X = ", x)
print("Y = ", y)
print(" E [ X | Y] = ", )
multi_normal_ex1.cond_dist(0, y)
t = 3
sub_Σz
sub_y = y[:t]
multi_normal_ex2.cond_dist(0, sub_y)
(array([0.59205609]), array([[1.00201996]]))
t = 3
j = 2
sub_μz = np.zeros(t-j+2)
sub_Σz = np.empty((t-j+2, t-j+2))
sub_Σz
sub_y = y[:t-j+1]
multi_normal_ex3.cond_dist(0, sub_y)
(array([0.91359083]), array([[1.81413617]]))
𝜖 = 𝐻 −1 𝑌 .
H = np.linalg.cholesky(Σy)
array([[1.00124922, 0. , 0. , 0. ],
[0.8988771 , 1.00225743, 0. , 0. ],
[0.80898939, 0.89978675, 1.00225743, 0. ],
[0.72809046, 0.80980808, 0.89978676, 1.00225743]])
ε = np.linalg.inv(H) @ y
This example is an instance of what is known as a Wold representation in time series analysis.
𝐶Σ𝑦̃ 𝐶 ′ 0𝑁−2×𝑁−2 𝛼2 𝛼1
Σ𝑏 = [ ], 𝐶=[ ]
0𝑁−2×2 0𝑁−2×𝑁−2 0 𝛼2
𝜎𝑢2 0 ⋯ 0
⎡ 0 𝜎𝑢2 ⋯ 0 ⎤
Σ𝑢 = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⋮ ⎥
⎣ 0 0 ⋯ 𝜎𝑢2 ⎦
# set parameters
T = 80
T = 160
# coefficients of the second order difference equation
0 = 10
1 = 1.53
2 = -.9
# variance of u
σu = 1.
σu = 10.
for i in range(T):
A[i, i] = 1
if i-1 >= 0:
A[i, i-1] = - 1
if i-2 >= 0:
A[i, i-2] = - 2
A_inv = np.linalg.inv(A)
μy = A_inv @ μb
Σb = np.zeros((T, T))
Let
𝑇 −𝑡
𝑝𝑡 = ∑ 𝛽 𝑗 𝑦𝑡+𝑗
𝑗=0
Form
𝑝1 1 𝛽 𝛽 2 ⋯ 𝛽 𝑇 −1 𝑦1
⎡ 𝑝 ⎤ ⎡ 0 1 𝛽 ⋯ 𝛽 𝑇 −2 ⎤ ⎡ 𝑦2 ⎤
⎢ 2 ⎥ ⎢ 𝑇 −3 ⎥ ⎢ ⎥
⎢ 𝑝3 ⎥ = ⎢ 0 0 1 ⋯ 𝛽 ⎥⎢ 𝑦3 ⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋮ ⋮ ⋮ ⎥⎢ ⋮ ⎥
⎣ 𝑝𝑇 ⎦ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
⏟ ⎣ 0 0 0 ⋯ 1 ⎦⎣ 𝑦𝑇 ⎦
≡𝑝 ≡𝐵
we have
𝜇𝑝 = 𝐵𝜇𝑦
Σ𝑝 = 𝐵Σ𝑦 𝐵′
β = .96
# construct B
B = np.zeros((T, T))
for i in range(T):
B[i, i:] = β ** np.arange(0, T-i)
Denote
𝑦 𝐼
𝑧=[ ]= [ ]𝑦
𝑝 ⏟ 𝐵
≡𝐷
Thus, {𝑦𝑡 }𝑇𝑡=1 and {𝑝𝑡 }𝑇𝑡=1 jointly follow the multivariate normal distribution 𝑁 (𝜇𝑧 , Σ𝑧 ), where
𝜇𝑧 = 𝐷𝜇𝑦
Σ𝑧 = 𝐷Σ𝑦 𝐷′
D = np.vstack([np.eye(T), B])
μz = D @ μy
Σz = D @ Σy @ D.T
We can simulate paths of 𝑦𝑡 and 𝑝𝑡 and compute the conditional mean 𝐸 [𝑝𝑡 ∣ 𝑦𝑡−1 , 𝑦𝑡 ] using the MultivariateNor-
mal class.
z = np.random.multivariate_normal(μz, Σz)
y, p = z[:T], z[T:]
cond_Ep = np.empty(T-1)
sub_μ = np.empty(3)
sub_Σ = np.empty((3, 3))
for t in range(2, T+1):
sub_μ[:] = μz[[t-2, t-1, T-1+t]]
sub_Σ[:, :] = Σz[[t-2, t-1, T-1+t], :][:, [t-2, t-1, T-1+t]]
plt.xlabel('t')
plt.legend(loc=1)
plt.show()
In the above graph, the green line is what the price of the stock would be if people had perfect foresight about the path of
dividends while the green line is the conditional expectation 𝐸𝑝𝑡 |𝑦𝑡 , 𝑦𝑡−1 , which is what the price would be if people did
not have perfect foresight but were optimally predicting future dividends on the basis of the information 𝑦𝑡 , 𝑦𝑡−1 at time
𝑡.
Assume that 𝑥0 is an 𝑛 × 1 random vector and that 𝑦0 is a 𝑝 × 1 random vector determined by the observation equation
𝑥0̂ Σ0 Σ0 𝐺′
𝜇=[ ], Σ=[ ]
𝐺𝑥0̂ 𝐺Σ0 𝐺Σ0 𝐺′ + 𝑅
By applying an appropriate instance of the above formulas for the mean vector 𝜇1̂ and covariance matrix Σ̂ 11 of 𝑧1
conditional on 𝑧2 , we find that the probability distribution of 𝑥0 conditional on 𝑦0 is 𝒩(𝑥0̃ , Σ̃ 0 ) where
𝛽0 = Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1
𝑥0̃ = 𝑥0̂ + 𝛽0 (𝑦0 − 𝐺𝑥0̂ )
Σ̃ 0 = Σ0 − Σ0 𝐺′ (𝐺Σ0 𝐺′ + 𝑅)−1 𝐺Σ0
Now suppose that we are in a time series setting and that we have the one-step state transition equation
Define
𝑥1̂ = 𝐴𝑥0̃
Σ1 = 𝐴Σ̃ 0 𝐴′ + 𝐶𝐶 ′
where as before 𝑥0 ∼ 𝒩(𝑥0̂ , Σ0 ), 𝑤𝑡+1 is the 𝑡 + 1th component of an i.i.d. stochastic process distributed as 𝑤𝑡+1 ∼
𝒩(0, 𝐼), and 𝑣𝑡 is the 𝑡th component of an i.i.d. process distributed as 𝑣𝑡 ∼ 𝒩(0, 𝑅) and the {𝑤𝑡+1 }∞ ∞
𝑡=0 and {𝑣𝑡 }𝑡=0
processes are orthogonal at all pairs of dates.
The logic and formulas that we applied above imply that the probability distribution of 𝑥𝑡 conditional on 𝑦0 , 𝑦1 , … , 𝑦𝑡−1 =
𝑦𝑡−1 is
where {𝑥𝑡̃ , Σ̃ 𝑡 }∞
𝑡=1 can be computed by iterating on the following equations starting from 𝑡 = 1 and initial conditions for
𝑥0̃ , Σ̃ 0 computed as we have above:
Σ𝑡 = 𝐴Σ̃ 𝑡−1 𝐴′ + 𝐶𝐶 ′
𝑥𝑡̂ = 𝐴𝑥𝑡−1
̃
𝛽𝑡 = Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1
𝑥𝑡̃ = 𝑥𝑡̂ + 𝛽𝑡 (𝑦𝑡 − 𝐺𝑥𝑡̂ )
Σ̃ 𝑡 = Σ𝑡 − Σ𝑡 𝐺′ (𝐺Σ𝑡 𝐺′ + 𝑅)−1 𝐺Σ𝑡
If we shift the first equation forward one period and then substitute the expression for Σ̃ 𝑡 on the right side of the fifth
equation into it we obtain
This is a matrix Riccati difference equation that is closely related to another matrix Riccati difference equation that appears
in a quantecon lecture on the basics of linear quadratic control theory.
That equation has the form
Stare at the two preceding equations for a moment or two, the first being a matrix difference equation for a conditional
covariance matrix, the second being a matrix difference equation in the matrix appearing in a quadratic form for an
intertemporal cost of value function.
Although the two equations are not identical, they display striking family resemblences.
• the first equation tells dynamics that work forward in time
• the second equation tells dynamics that work backward in time
• while many of the terms are similar, one equation seems to apply matrix transformations to some matrices that play
similar roles in the other equation
The family resemblences of these two equations reflects a transcendent duality between control theory and filtering theory.
13.12.3 An example
G = np.array([[1., 3.]])
R = np.array([[1.]])
μ = np.hstack([x0_hat, G @ x0_hat])
Σ = np.block([[Σ0, Σ0 @ G.T], [G @ Σ0, G @ Σ0 @ G.T + R]])
multi_normal.partition(2)
# the observation of y
y0 = 2.3
# conditional distribution of x0
μ1_hat, Σ11 = multi_normal.cond_dist(0, y0)
μ1_hat, Σ11
(array([-0.078125, 0.803125]),
array([[ 0.72098214, -0.203125 ],
[-0.403125 , 0.228125 ]]))
# conditional distribution of x1
x1_cond = A @ μ1_hat
Σ1_cond = C @ C.T + A @ Σ11 @ A.T
x1_cond, Σ1_cond
Here is code for solving a dynamic filtering problem by iterating on our equations, followed by an example.
p, n = G.shape
T = len(y_seq)
x_hat_seq = np.empty((T+1, n))
Σ_hat_seq = np.empty((T+1, n, n))
x_hat_seq[0] = x0_hat
Σ_hat_seq[0] = Σ0
for t in range(T):
xt_hat = x_hat_seq[t]
Σt = Σ_hat_seq[t]
μ = np.hstack([xt_hat, G @ xt_hat])
Σ = np.block([[Σt, Σt @ G.T], [G @ Σt, G @ Σt @ G.T + R]])
# filtering
multi_normal = MultivariateNormal(μ, Σ)
multi_normal.partition(n)
x_tilde, Σ_tilde = multi_normal.cond_dist(0, y_seq[t])
# forecasting
x_hat_seq[t+1] = A @ x_tilde
Σ_hat_seq[t+1] = C @ C.T + A @ Σ_tilde @ A.T
(array([[0. , 1. ],
[0.1215625 , 0.24875 ],
[0.18680212, 0.06904689],
[0.75576875, 0.05558463]]),
array([[[1. , 0.5 ],
[0.3 , 2. ]],
[[4.12874554, 1.95523214],
[1.92123214, 1.04592857]],
[[4.08198663, 1.99218488],
[1.98640488, 1.00886423]],
[[4.06457628, 2.00041999],
[1.99943739, 1.00275526]]]))
The iterative algorithm just described is a version of the celebrated Kalman filter.
We describe the Kalman filter and some applications of it in A First Look at the Kalman Filter
The factor analysis model widely used in psychology and other fields can be represented as
𝑌 = Λ𝑓 + 𝑈
where
1. 𝑌 is 𝑛 × 1 random vector, 𝐸𝑈 𝑈 ′ = 𝐷 is a diagonal matrix,
2. Λ is 𝑛 × 𝑘 coefficient matrix,
3. 𝑓 is 𝑘 × 1 random vector, 𝐸𝑓𝑓 ′ = 𝐼,
4. 𝑈 is 𝑛 × 1 random vector, and 𝑈 ⟂ 𝑓 (i.e., 𝐸𝑈 𝑓 ′ = 0 )
5. It is presumed that 𝑘 is small relative to 𝑛; often 𝑘 is only 1 or 2, as in our IQ examples.
This implies that
Σ𝑦 = 𝐸𝑌 𝑌 ′ = ΛΛ′ + 𝐷
𝐸𝑌 𝑓 ′ = Λ
𝐸𝑓𝑌 ′ = Λ′
Thus, the covariance matrix Σ𝑌 is the sum of a diagonal matrix 𝐷 and a positive semi-definite matrix ΛΛ′ of rank 𝑘.
This means that all covariances among the 𝑛 components of the 𝑌 vector are intermediated by their common dependencies
on the 𝑘 < factors.
Form
𝑓
𝑍=( )
𝑌
𝐼 Λ′
Σ𝑧 = 𝐸𝑍𝑍 ′ = ( )
Λ ΛΛ′ + 𝐷
In the following, we first construct the mean vector and the covariance matrix for the case where 𝑁 = 10 and 𝑘 = 2.
N = 10
k = 2
where the first half of the first column of Λ is filled with 1s and 0s for the rest half, and symmetrically for the second
column.
𝐷 is a diagonal matrix with parameter 𝜎𝑢2 on the diagonal.
Λ = np.zeros((N, k))
Λ[:N//2, 0] = 1
Λ[N//2:, 1] = 1
σu = .5
D = np.eye(N) * σu ** 2
# compute Σy
Σy = Λ @ Λ.T + D
We can now construct the mean vector and the covariance matrix for 𝑍.
μz = np.zeros(k+N)
Σz = np.empty((k+N, k+N))
z = np.random.multivariate_normal(μz, Σz)
f = z[:k]
y = z[k:]
Let’s compute the conditional distribution of the hidden factor 𝑓 on the observations 𝑌 , namely, 𝑓 ∣ 𝑌 = 𝑦.
multi_normal_factor.cond_dist(0, y)
(array([0.37829706, 0.31441423]),
array([[0.04761905, 0. ],
[0. , 0.04761905]]))
B = Λ.T @ np.linalg.inv(Σy)
B @ y
array([0.37829706, 0.31441423])
multi_normal_factor.cond_dist(1, f)
Λ @ f
To learn about Principal Components Analysis (PCA), please see this lecture Singular Value Decompositions.
For fun, let’s apply a PCA decomposition to a covariance matrix Σ𝑦 that in fact is governed by our factor-analytic model.
Technically, this means that the PCA model is misspecified. (Can you explain why?)
Nevertheless, this exercise will let us study how well the first two principal components from a PCA can approximate the
conditional expectations 𝐸𝑓𝑖 |𝑌 for our two factors 𝑓𝑖 , 𝑖 = 1, 2 for the factor analytic model that we have assumed truly
governs the data on 𝑌 we have generated.
So we compute the PCA decomposition
̃ ′
Σ𝑦 = 𝑃 Λ𝑃
𝑌 = 𝑃𝜖
and
𝜖 = 𝑃 ′𝑌
Note that we will arrange the eigenvectors in 𝑃 in the descending order of eigenvalues.
_tilde, P = np.linalg.eigh(Σy)
P = P[:, ind]
_tilde = _tilde[ind]
(continues on next page)
_tilde = [5.25 5.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25]
4.440892098500626e-16
array([[1.25, 1. , 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1.25, 1. , 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1.25, 1. , 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1. , 1.25, 1. , 0. , 0. , 0. , 0. , 0. ],
[1. , 1. , 1. , 1. , 1.25, 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 1.25, 1. , 1. , 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1.25, 1. , 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1.25, 1. , 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1.25, 1. ],
[0. , 0. , 0. , 0. , 0. , 1. , 1. , 1. , 1. , 1.25]])
ε = P.T @ y
print("ε = ", ε)
print('f = ', f)
f = [0.34553632 0.37994724]
plt.scatter(range(N), y, label='y')
plt.scatter(range(N), ε, label='$\epsilon$')
plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
plt.legend()
plt.show()
ε[:2]
array([0.88819283, 0.73820417])
array([0.37829706, 0.31441423])
The fraction of variance in 𝑦𝑡 explained by the first two principal components can be computed as below.
_tilde[:2].sum() / _tilde.sum()
0.84
Compute
𝑌 ̂ = 𝑃 𝑗 𝜖𝑗 + 𝑃 𝑘 𝜖𝑘
In this example, it turns out that the projection 𝑌 ̂ of 𝑌 on the first two principal components does a good job of approx-
imating 𝐸𝑓 ∣ 𝑦.
We confirm this in the following plot of 𝑓, 𝐸𝑦 ∣ 𝑓, 𝐸𝑓 ∣ 𝑦, and 𝑦 ̂ on the coordinate axis versus 𝑦 on the ordinate axis.
plt.scatter(range(N), Λ @ f, label='$Ey|f$')
plt.scatter(range(N), y_hat, label='$\hat{y}$')
plt.hlines(f[0], 0, N//2-1, ls='--', label='$f_{1}$')
plt.hlines(f[1], N//2, N-1, ls='-.', label='$f_{2}$')
Efy = B @ y
plt.hlines(Efy[0], 0, N//2-1, ls='--', color='b', label='$Ef_{1}|y$')
plt.hlines(Efy[1], N//2, N-1, ls='-.', color='b', label='$Ef_{2}|y$')
plt.legend()
plt.show()
The covariance matrix of 𝑌 ̂ can be computed by first constructing the covariance matrix of 𝜖 and then use the upper left
block for 𝜖1 and 𝜖2 .
Σy_hat =
[[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[1.05 1.05 1.05 1.05 1.05 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 1.05 1.05 1.05 1.05 1.05]
(continues on next page)
FOURTEEN
HEAVY-TAILED DISTRIBUTIONS
Contents
• Heavy-Tailed Distributions
– Overview
– Visual Comparisons
– Failure of the LLN
– Classifying Tail Properties
– Exercises
– Solutions
In addition to what’s in Anaconda, this lecture will need the following libraries:
14.1 Overview
Most commonly used probability distributions in classical statistics and the natural sciences have either bounded support
or light tails.
When a distribution is light-tailed, extreme observations are rare and draws tend not to deviate too much from the mean.
Having internalized these kinds of distributions, many researchers and practitioners use rules of thumb such as “outcomes
more than four or five standard deviations from the mean can safely be ignored.”
However, some distributions encountered in economics have far more probability mass in the tails than distributions like
the normal distribution.
With such heavy-tailed distributions, what would be regarded as extreme outcomes for someone accustomed to thin
tailed distributions occur relatively frequently.
Examples of heavy-tailed distributions observed in economic and financial settings include
• the income distributions and the wealth distribution (see, e.g., [Vil96], [BB18]),
• the firm size distribution ([Axt01], [Gab16]}),
• the distribution of returns on holding assets over short time horizons ([Man63], [Rac03]), and
249
Quantitative Economics with Python
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (11, 5) #set default figure size
import numpy as np
import quantecon as qe
The following two lines can be added to avoid an annoying FutureWarning, and prevent a specific compatibility issue
between pandas and matplotlib from causing problems down the line:
One way to build intuition on the difference between light and heavy tails is to plot independent draws and compare them
side-by-side.
14.2.1 A Simulation
The figure below shows a simulation. (You will be asked to replicate it in the exercises.)
The top two subfigures each show 120 independent draws from the normal distribution, which is light-tailed.
The bottom subfigure shows 120 independent draws from the Cauchy distribution, which is heavy-tailed.
In the top subfigure, the standard deviation of the normal distribution is 2, and the draws are clustered around the mean.
In the middle subfigure, the standard deviation is increased to 12 and, as expected, the amount of dispersion rises.
The bottom subfigure, with the Cauchy draws, shows a different pattern: tight clustering around the mean for the great
majority of observations, combined with a few sudden large deviations from the mean.
This is typical of a heavy-tailed distribution.
import yfinance as yf
import pandas as pd
r = s.pct_change()
fig, ax = plt.subplots()
ax.set_ylabel('returns', fontsize=12)
ax.set_xlabel('date', fontsize=12)
plt.show()
[*********************100%***********************] 1 of 1 completed
Five of the 1217 observations are more than 5 standard deviations from the mean.
Overall, the figure is suggestive of heavy tails, although not to the same degree as the Cauchy distribution the figure above.
If, however, one takes tick-by-tick data rather daily data, the heavy-tailedness of the distribution increases further.
One impact of heavy tails is that sample averages can be poor estimators of the underlying mean of the distribution.
To understand this point better, recall our earlier discussion of the Law of Large Numbers, which considered IID
𝑋1 , … , 𝑋𝑛 with common distribution 𝐹
𝑛
If 𝔼|𝑋𝑖 | is finite, then the sample mean 𝑋̄ 𝑛 ∶= 1
𝑛 ∑𝑖=1 𝑋𝑖 satisfies
ℙ {𝑋̄ 𝑛 → 𝜇 as 𝑛 → ∞} = 1 (14.1)
np.random.seed(1234)
N = 1_000
distribution = cauchy()
fig, ax = plt.subplots()
data = distribution.rvs(N)
# Plot
ax.plot(range(N), sample_mean, alpha=0.6, label='$\\bar X_n$')
plt.show()
̄ 𝑡 𝑛
𝔼𝑒𝑖𝑡𝑋𝑛 = 𝔼 exp {𝑖 ∑𝑋 }
𝑛 𝑗=1 𝑗
𝑛
𝑡
= 𝔼 ∏ exp {𝑖 𝑋𝑗 }
𝑗=1
𝑛
𝑛
𝑡
= ∏ 𝔼 exp {𝑖 𝑋𝑗 } = [𝜙(𝑡/𝑛)]𝑛
𝑗=1
𝑛
To keep our discussion precise, we need some definitions concerning tail properties.
We will focus our attention on the right hand tails of nonnegative random variables and their distributions.
The definitions for left hand tails are very similar and we omit them to simplify the exposition.
We say that a nonnegative random variable 𝑋 is heavy-tailed if its distribution 𝐹 (𝑥) ∶= ℙ{𝑋 ≤ 𝑥} is heavy-tailed.
This is equivalent to stating that its moment generating function 𝑚(𝑡) ∶= 𝔼 exp(𝑡𝑋) is infinite for all 𝑡 > 0.
• For example, the lognormal distribution is heavy-tailed because its moment generating function is infinite every-
where on (0, ∞).
A distribution 𝐹 on ℝ+ is called light-tailed if it is not heavy-tailed.
A nonnegative random variable 𝑋 is light-tailed if its distribution 𝐹 is light-tailed.
• Example: Every random variable with bounded support is light-tailed. (Why?)
• Example: If 𝑋 has the exponential distribution, with cdf 𝐹 (𝑥) = 1 − exp(−𝜆𝑥) for some 𝜆 > 0, then its moment
generating function is finite whenever 𝑡 < 𝜆. Hence 𝑋 is light-tailed.
One can show that if 𝑋 is light-tailed, then all of its moments are finite.
The contrapositive is that if some moment is infinite, then 𝑋 is heavy-tailed.
The latter condition is not necessary, however.
• Example: the lognormal distribution is heavy-tailed but every moment is finite.
One specific class of heavy-tailed distributions has been found repeatedly in economic and social phenomena: the class
of so-called power laws.
Specifically, given 𝛼 > 0, a nonnegative random variable 𝑋 is said to have a Pareto tail with tail index 𝛼 if
Evidently (14.4) implies the existence of positive constants 𝑏 and 𝑥̄ such that ℙ{𝑋 > 𝑥} ≥ 𝑏𝑥−𝛼 whenever 𝑥 ≥ 𝑥.̄
The implication is that ℙ{𝑋 > 𝑥} converges to zero no faster than 𝑥−𝛼 .
In some sources, a random variable obeying (14.4) is said to have a power law tail.
The primary example is the Pareto distribution, which has distribution
𝛼
1 − (𝑥/𝑥)
̄ if 𝑥 ≥ 𝑥̄
𝐹 (𝑥) = { (14.5)
0 if 𝑥 < 𝑥̄
One graphical technique for investigating Pareto tails and power laws is the so-called rank-size plot.
This kind of figure plots log size against log rank of the population (i.e., location in the population when sorted from
smallest to largest).
Often just the largest 5 or 10% of observations are plotted.
For a sufficiently large number of draws from a Pareto distribution, the plot generates a straight line. For distributions
with thinner tails, the data points are concave.
A discussion of why this occurs can be found in [NOM04].
The figure below provides one example, using simulated data.
The rank-size plots shows draws from three different distributions: folded normal, chi-squared with 1 degree of freedom
and Pareto.
The Pareto sample produces a straight line, while the lines produced by the other samples are concave.
You are asked to reproduce this figure in the exercises.
14.5 Exercises
Exercise 14.5.1
Replicate the figure presented above that compares normal and Cauchy draws.
Use np.random.seed(11) to set the seed.
Exercise 14.5.2
Prove: If 𝑋 has a Pareto tail with tail index 𝛼, then 𝔼[𝑋 𝑟 ] = ∞ for all 𝑟 ≥ 𝛼.
Exercise 14.5.3
Repeat exercise 1, but replace the three distributions (two normal, one Cauchy) with three Pareto distributions using
different choices of 𝛼.
For 𝛼, try 1.15, 1.5 and 1.75.
Use np.random.seed(11) to set the seed.
Exercise 14.5.4
Replicate the rank-size plot figure presented above.
If you like you can use the function qe.rank_size from the quantecon library to generate the plots.
Use np.random.seed(13) to set the seed.
Exercise 14.5.5
There is an ongoing argument about whether the firm size distribution should be modeled as a Pareto distribution or a
lognormal distribution (see, e.g., [FDGA+04], [KLS18] or [ST19a]).
This sounds esoteric but has real implications for a variety of economic phenomena.
To illustrate this fact in a simple way, let us consider an economy with 100,000 firms, an interest rate of r = 0.05 and
a corporate tax rate of 15%.
Your task is to estimate the present discounted value of projected corporate tax revenue over the next 10 years.
Because we are forecasting, we need a model.
We will suppose that
1. the number of firms and the firm size distribution (measured in profits) remain fixed and
2. the firm size distribution is either lognormal or Pareto.
Present discounted value of tax revenue will be estimated by
1. generating 100,000 draws of firm profit from the firm size distribution,
2. multiplying by the tax rate, and
3. summing the results with discounting to obtain present value.
The Pareto distribution is assumed to take the form (14.5) with 𝑥̄ = 1 and 𝛼 = 1.05.
(The value the tail index 𝛼 is plausible given the data [Gab16].)
To make the lognormal option as similar as possible to the Pareto option, choose its parameters such that the mean and
median of both distributions are the same.
Note that, for each distribution, your estimate of tax revenue will be random because it is based on a finite number of
draws.
To take this into account, generate 100 replications (evaluations of tax revenue) for each of the two distributions and
compare the two samples by
• producing a violin plot visualizing the two samples side-by-side and
• printing the mean and standard deviation of both samples.
For the seed use np.random.seed(1234).
What differences do you observe?
(Note: a better approach to this problem would be to model firm dynamics and try to track individual firms given the
current distribution. We will discuss firm dynamics in later lectures.)
14.6 Solutions
n = 120
np.random.seed(11)
for ax in axes:
(continues on next page)
s_vals = 2, 12
ax = axes[2]
distribution = cauchy()
data = distribution.rvs(n)
ax.plot(list(range(n)), data, linestyle='', marker='o', alpha=0.5, ms=4)
ax.vlines(list(range(n)), 0, data, lw=0.2)
ax.set_title(f"draws from the Cauchy distribution", fontsize=11)
plt.subplots_adjust(hspace=0.25)
plt.show()
But then
∞ 𝑥̄ ∞
𝔼𝑋 𝑟 = 𝑟 ∫ 𝑥𝑟−1 ℙ{𝑋 > 𝑥}𝑥 ≥ 𝑟 ∫ 𝑥𝑟−1 ℙ{𝑋 > 𝑥}𝑥 + 𝑟 ∫ 𝑥𝑟−1 𝑏𝑥−𝛼 𝑥.
0 0 𝑥̄
∞
We know that ∫𝑥̄ 𝑥𝑟−𝛼−1 𝑥 = ∞ whenever 𝑟 − 𝛼 − 1 ≥ −1.
Since 𝑟 ≥ 𝛼, we have 𝔼𝑋 𝑟 = ∞.
np.random.seed(11)
n = 120
alphas = [1.15, 1.50, 1.75]
plt.subplots_adjust(hspace=0.4)
plt.show()
sample_size = 1000
np.random.seed(13)
z = np.random.randn(sample_size)
data_1 = np.abs(z)
data_2 = np.exp(z)
data_3 = np.exp(np.random.exponential(scale=1.0, size=sample_size))
(continues on next page)
ax.legend()
fig.subplots_adjust(hspace=0.4)
plt.show()
Using the corresponding expressions for the lognormal distribution leads us to the equations
𝛼
= exp(𝜇 + 𝜎2 /2) and 21/𝛼 = exp(𝜇)
𝛼−1
which we solve for 𝜇 and 𝜎 given 𝛼 = 1.05.
Here is code that generates the two samples, produces the violin plot and prints the mean and standard deviation of the
two samples.
num_firms = 100_000
num_years = 10
tax_rate = 0.15
r = 0.05
β = 1 / (1 + r) # discount factor
x_bar = 1.0
α = 1.05
def pareto_rvs(n):
"Uses a standard method to generate Pareto draws."
u = np.random.uniform(size=n)
y = x_bar / (u**(1/α))
return y
μ = np.log(2) / α
σ_sq = 2 * (np.log(α/(α - 1)) - np.log(2)/α)
σ = np.sqrt(σ_sq)
Here’s a function to compute a single estimate of tax revenue for a particular choice of distribution dist.
def tax_rev(dist):
tax_raised = 0
for t in range(num_years):
if dist == 'pareto':
π = pareto_rvs(num_firms)
else:
π = np.exp(μ + σ * np.random.randn(num_firms))
tax_raised += β**t * np.sum(π * tax_rate)
return tax_raised
num_reps = 100
np.random.seed(1234)
tax_rev_lognorm = np.empty(num_reps)
tax_rev_pareto = np.empty(num_reps)
for i in range(num_reps):
tax_rev_pareto[i] = tax_rev('pareto')
tax_rev_lognorm[i] = tax_rev('lognorm')
fig, ax = plt.subplots()
ax.violinplot(data)
plt.show()
tax_rev_pareto.mean(), tax_rev_pareto.std()
(1.4587290546623734e+06, 406089.3613661567)
tax_rev_lognorm.mean(), tax_rev_lognorm.std()
(2556174.8615230713, 25586.44456513965)
Looking at the output of the code, our main conclusion is that the Pareto assumption leads to a lower mean and greater
dispersion.
FIFTEEN
15.1 Overview
This lecture puts elementary tools to work to approximate probability distributions of the annual failure rates of a system
consisting of a number of critical parts.
We’ll use log normal distributions to approximate probability distributions of critical component parts.
To approximate the probability distribution of the sum of 𝑛 log normal probability distributions that describes the failure
rate of the entire system, we’ll compute the convolution of those 𝑛 log normal probability distributions.
We’ll use the following concepts and tools:
• log normal distributions
• the convolution theorem that describes the probability distribution of the sum independent random variables
• fault tree analysis for approximating a failure rate of a multi-component system
• a hierarchical probability model for describing uncertain probabilities
• Fourier transforms and inverse Fourier tranforms as efficient ways of computing convolutions of sequences
For more about Fourier transforms see this quantecon lecture Circulant Matrices as well as these lecture Covariance
Stationary Processes and Estimation of Spectra.
El-Shanawany, Ardron, and Walker [ESAW18] and Greenfield and Sargent [GS93] used some of the methods described
here to approximate probabilities of failures of safety systems in nuclear facilities.
These methods respond to some of the recommendations made by Apostolakis [Apo90] for constructing procedures for
quantifying uncertainty about the reliability of a safety system.
We’ll start by bringing in some Python machinery.
import numpy as np
from numpy import fft
import matplotlib.pyplot as plt
import scipy as sc
from scipy.signal import fftconvolve
from tabulate import tabulate
(continues on next page)
267
Quantitative Economics with Python
np.set_printoptions(precision=3, suppress=True)
If a random variable 𝑥 follows a normal distribution with mean 𝜇 and variance 𝜎2 , then the natural logarithm of 𝑥, say
𝑦 = log(𝑥), follows a log normal distribution with parameters 𝜇, 𝜎2 .
Notice that we said parameters and not mean and variance 𝜇, 𝜎2 .
• 𝜇 and 𝜎2 are the mean and variance of 𝑥 = exp(𝑦)
• they are not the mean and variance of 𝑦
1 2 2 2
• instead, the mean of 𝑦 is 𝑒𝜇+ 2 𝜎 and the variance of 𝑦 is (𝑒𝜎 − 1)𝑒2𝜇+𝜎
A log normal random variable 𝑦 is nonnegative.
The density for a log normal random variate 𝑦 is
1 −(log 𝑦 − 𝜇)2
𝑓(𝑦) = √ exp ( )
𝑦𝜎 2𝜋 2𝜎2
for 𝑦 ≥ 0.
Important features of a log normal random variable are
1 2
mean: 𝑒𝜇+ 2 𝜎
2 2
variance: (𝑒𝜎 − 1)𝑒2𝜇+𝜎
median: 𝑒𝜇
2
mode: 𝑒𝜇−𝜎
.95 quantile: 𝑒𝜇+1.645𝜎
.95-.05 quantile ratio: 𝑒1.645𝜎
Recall the following stability property of two independent normally distributed random variables:
If 𝑥1 is normal with mean 𝜇1 and variance 𝜎12 and 𝑥2 is independent of 𝑥1 and normal with mean 𝜇2 and variance 𝜎22 ,
then 𝑥1 + 𝑥2 is normally distributed with mean 𝜇1 + 𝜇2 and variance 𝜎12 + 𝜎22 .
Independent log normal distributions have a different stability property.
The product of independent log normal random variables is also log normal.
In particular, if 𝑦1 is log normal with parameters (𝜇1 , 𝜎12 ) and 𝑦2 is log normal with parameters (𝜇2 , 𝜎22 ), then the product
𝑦1 𝑦2 is log normal with parameters (𝜇1 + 𝜇2 , 𝜎12 + 𝜎22 ).
Note: While the product of two log normal distributions is log normal, the sum of two log normal distributions is not
log normal.
This observation sets the stage for challenge that confronts us in this lecture, namely, to approximate probability distri-
butions of sums of independent log normal random variables.
To compute the probability distribution of the sum of two log normal distributions, we can use the following convolution
property of a probability distribution that is a sum of independent random variables.
to compute a discretized version of the probability distribution of the sum of two random variables, one with probability
mass function 𝑓, the other with probability mass function 𝑔.
Before applying the convolution property to sums of log normal distributions, let’s practice on some simple discrete
distributions.
To take one example, let’s consider the following two probability distributions
𝑓𝑗 = Prob(𝑋 = 𝑗), 𝑗 = 0, 1
and
𝑔𝑗 = Prob(𝑌 = 𝑗), 𝑗 = 0, 1, 2, 3
and
ℎ𝑗 = Prob(𝑍 ≡ 𝑋 + 𝑌 = 𝑗), 𝑗 = 0, 1, 2, 3, 4
ℎ=𝑓 ∗𝑔 =𝑔∗𝑓
f = [.75, .25]
g = [0., .6, 0., .4]
h = np.convolve(f,g)
hf = fftconvolve(f,g)
A little later we’ll explain some advantages that come from using scipy.signal.ftconvolve rather than numpy.
convolve.numpy program convolve.
They provide the same answers but scipy.signal.ftconvolve is much faster.
That’s why we rely on it later in this lecture.
We’ll construct an example to verify that discretized distributions can do a good job of approximating samples drawn
from underlying continuous distributions.
We’ll start by generating samples of size 25000 of three independent log normal random variates as well as pairwise and
triple-wise sums.
Then we’ll plot histograms and compare them with convolutions of appropriate discretized log normal distributions.
## create sums of two and three log normal random variates ssum2 = s1 + s2 and ssum3␣
↪= s1 + s2 + s3
ssum2 = s1 + s2
ssum3 = s1 + s2 + s3
samp_mean2 = np.mean(s2)
pop_mean2 = np.exp(mu2+ (sigma2**2)/2)
Here are helper functions that create a discretized version of a log normal probability density function.
def p_log_normal(x,μ,σ):
p = 1 / (σ*x*np.sqrt(2*np.pi)) * np.exp(-1/2*((np.log(x) - μ)/σ)**2)
return p
def pdf_seq(μ,σ,I,m):
x = np.arange(1e-7,I,m)
p_array = p_log_normal(x,μ,σ)
p_array_norm = p_array/np.sum(p_array)
return p_array,p_array_norm,x
Now we shall set a grid length 𝐼 and a grid increment size 𝑚 = 1 for our discretizations.
Note: We set 𝐼 equal to a power of two because we want to be free to use a Fast Fourier Transform to compute a
convolution of two sequences (discrete distributions).
We recommend experimenting with different values of the power 𝑝 of 2.
Setting it to 15 rather than 12, for example, improves how well the discretized probability mass function approximates
the original continuous probability density function being studied.
p=15
I = 2**p # Truncation value
m = .1 # increment size
p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
## compute number of points to evaluate the probability mass function
NT = x.size
plt.figure(figsize = (8,8))
plt.subplot(2,1,1)
plt.plot(x[:np.int(NT)],p1[:np.int(NT)],label = '')
plt.xlim(0,2500)
count, bins, ignored = plt.hist(s1, 1000, density=True, align='mid')
plt.show()
↪itself. Doing this will not modify any behavior and is safe. When replacing `np.
↪int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision.
↪ If you wish to review your current use, check the release note link for␣
↪additional information.
plt.plot(x[:np.int(NT)],p1[:np.int(NT)],label = '')
# Compute mean from discretized pdf and compare with the theoretical value
mean= np.sum(np.multiply(x[:NT],p1_norm[:NT]))
meantheory = np.exp(mu1+.5*sigma1**2)
mean, meantheory
(2.446905989830291e+02, 244.69193226422038)
Now let’s use the convolution theorem to compute the probability distribution of a sum of the two log normal random
variables we have parameterized above.
We’ll also compute the probability of a sum of three log normal distributions constructed above.
Before we do these things, we shall explain our choice of Python algorithm to compute a convolution of two sequences.
Because the sequences that we convolve are long, we use the scipy.signal.fftconvolve function rather than
the numpy.convove function.
These two functions give virtually equivalent answers but for long sequences scipy.signal.fftconvolve is much
faster.
The program scipy.signal.fftconvolve uses fast Fourier transforms and their inverses to calculate convolu-
tions.
Let’s define the Fourier transform and the inverse Fourier transform.
The Fourier transform of a sequence {𝑥𝑡 }𝑇𝑡=0
−1
is a sequence of complex numbers {𝑥(𝜔𝑗 )}𝑇𝑗=0
−1
given by
𝑇 −1
𝑥(𝜔𝑗 ) = ∑ 𝑥𝑡 exp(−𝑖𝜔𝑗 𝑡) (15.1)
𝑡=0
2𝜋𝑗
where 𝜔𝑗 = 𝑇 for 𝑗 = 0, 1, … , 𝑇 − 1.
The inverse Fourier transform of the sequence {𝑥(𝜔𝑗 )}𝑇𝑗=0
−1
is
𝑇 −1
𝑥𝑡 = 𝑇 −1 ∑ 𝑥(𝜔𝑗 ) exp(𝑖𝜔𝑗 𝑡) (15.2)
𝑗=0
p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
p2,p2_norm,x = pdf_seq(mu2,sigma2,I,m)
p3,p3_norm,x = pdf_seq(mu3,sigma3,I,m)
tic = time.perf_counter()
c1 = np.convolve(p1_norm,p2_norm)
c2 = np.convolve(c1,p3_norm)
(continues on next page)
toc = time.perf_counter()
tic = time.perf_counter()
c1f = fftconvolve(p1_norm,p2_norm)
c2f = fftconvolve(c1f,p3_norm)
toc = time.perf_counter()
toc = time.perf_counter()
print("time with np.convolve = ", tdiff1, "; time with fftconvolve = ", tdiff2)
The fast Fourier transform is two orders of magnitude faster than numpy.convolve
Now let’s plot our computed probability mass function approximation for the sum of two log normal random variables
against the histogram of the sample that we formed above.
NT= np.size(x)
plt.figure(figsize = (8,8))
plt.subplot(2,1,1)
plt.plot(x[:np.int(NT)],c1f[:np.int(NT)]/m,label = '')
plt.xlim(0,5000)
plt.show()
↪this will not modify any behavior and is safe. When replacing `np.int`, you may␣
↪wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish␣
↪to review your current use, check the release note link for additional␣
↪information.
plt.plot(x[:np.int(NT)],c1f[:np.int(NT)]/m,label = '')
NT= np.size(x)
plt.figure(figsize = (8,8))
plt.subplot(2,1,1)
plt.plot(x[:np.int(NT)],c2f[:np.int(NT)]/m,label = '')
plt.xlim(0,5000)
plt.show()
↪this will not modify any behavior and is safe. When replacing `np.int`, you may␣
↪wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish␣
↪to review your current use, check the release note link for additional␣
↪information.
plt.plot(x[:np.int(NT)],c2f[:np.int(NT)]/m,label = '')
(489.38109740938546, 489.38386452844077)
(734.0714863312252, 734.0757967926611)
We shall soon apply the convolution theorem to compute the probability of a top event in a failure tree analysis.
Before applying the convolution theorem, we first describe the model that connects constituent events to the top end whose
failure rate we seek to quantify.
The model is an example of the widely used failure tree analysis described by El-Shanawany, Ardron, and Walker
[ESAW18].
To construct the statistical model, we repeatedly use what is called the rare event approximation.
We want to compute the probabilty of an event 𝐴 ∪ 𝐵.
• the union 𝐴 ∪ 𝐵 is the event that 𝐴 OR 𝐵 occurs
A law of probability tells us that 𝐴 OR 𝐵 occurs with probability
𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃 (𝐵) − 𝑃 (𝐴 ∩ 𝐵)
where the intersection 𝐴 ∩ 𝐵 is the event that 𝐴 AND 𝐵 both occur and the union 𝐴 ∪ 𝐵 is the event that 𝐴 OR 𝐵
occurs.
𝑃 (𝐴 ∩ 𝐵) = 𝑃 (𝐴)𝑃 (𝐵)
If 𝑃 (𝐴) and 𝑃 (𝐵) are both small, then 𝑃 (𝐴)𝑃 (𝐵) is even smaller.
The rare event approximation is
𝑃 (𝐴 ∪ 𝐵) ≈ 𝑃 (𝐴) + 𝑃 (𝐵)
15.7 Application
A system has been designed with the feature a system failure occurs when any of 𝑛 critical components fails.
The failure probability 𝑃 (𝐴𝑖 ) of each event 𝐴𝑖 is small.
We assume that failures of the components are statistically independent random variables.
We repeatedly apply a rare event approximation to obtain the following formula for the problem of a system failure:
or
𝑛
𝑃 (𝐹 ) ≈ ∑ 𝑃 (𝐴𝑖 ) (15.3)
𝑖=1
Probabilities for each event are recorded as failure rates per year.
Now we come to the problem that really interests us, following [ESAW18] and Greenfield and Sargent [GS93] in the
spirit of Apostolakis [Apo90].
The constituent probabilities or failure rates 𝑃 (𝐴𝑖 ) are not known a priori and have to be estimated.
We address this problem by specifying probabilities of probabilities that capture one notion of not knowing the con-
stituent probabilities that are inputs into a failure tree analysis.
Thus, we assume that a system analyst is uncertain about the failure rates 𝑃 (𝐴𝑖 ), 𝑖 = 1, … , 𝑛 for components of a system.
The analyst copes with this situation by regarding the systems failure probability 𝑃 (𝐹 ) and each of the component prob-
abilities 𝑃 (𝐴𝑖 ) as random variables.
• dispersions of the probability distribution of 𝑃 (𝐴𝑖 ) characterizes the analyst’s uncertainty about the failure prob-
ability 𝑃 (𝐴𝑖 )
• the dispersion of the implied probability distribution of 𝑃 (𝐹 ) characterizes his uncertainty about the probability
of a system’s failure.
This leads to what is sometimes called a hierarchical model in which the analyst has probabilities about the probabilities
𝑃 (𝐴𝑖 ).
The analyst formalizes his uncertainty by assuming that
• the failure probability 𝑃 (𝐴𝑖 ) is itself a log normal random variable with parameters (𝜇𝑖 , 𝜎𝑖 ).
• failure rates 𝑃 (𝐴𝑖 ) and 𝑃 (𝐴𝑗 ) are statistically independent for all pairs with 𝑖 ≠ 𝑗.
The analyst calibrates the parameters (𝜇𝑖 , 𝜎𝑖 ) for the failure events 𝑖 = 1, … , 𝑛 by reading reliability studies in engineering
papers that have studied historical failure rates of components that are as similar as possible to the components being used
in the system under study.
The analyst assumes that such information about the observed dispersion of annual failure rates, or times to failure, can
inform him of what to expect about parts’ performances in his system.
The analyst assumes that the random variables 𝑃 (𝐴𝑖 ) are statistically mutually independent.
The analyst wants to approximate a probability mass function and cumulative distribution function of the systems failure
probability 𝑃 (𝐹 ).
• We say probability mass function because of how we discretize each random variable, as described earlier.
The analyst calculates the probability mass function for the top event 𝐹 , i.e., a system failure, by repeatedly applying
the convolution theorem to compute the probability distribution of a sum of independent log normal random variables, as
described in equation (15.3).
Note: Because the failure rates are all very small, log normal distributions with the above parameter values actually
describe 𝑃 (𝐴𝑖 ) times 10−09 .
So the probabilities that we’ll put on the 𝑥 axis of the probability mass function and associated cumulative distribution
function should be multiplied by 10−09
To extract a table that summarizes computed quantiles, we’ll use a helper function
p=15
I = 2**p # Truncation value
m = .05 # increment size
p1,p1_norm,x = pdf_seq(mu1,sigma1,I,m)
p2,p2_norm,x = pdf_seq(mu2,sigma2,I,m)
p3,p3_norm,x = pdf_seq(mu3,sigma3,I,m)
p4,p4_norm,x = pdf_seq(mu4,sigma4,I,m)
p5,p5_norm,x = pdf_seq(mu5,sigma5,I,m)
p6,p6_norm,x = pdf_seq(mu6,sigma6,I,m)
p7,p7_norm,x = pdf_seq(mu7,sigma7,I,m)
p8,p8_norm,x = pdf_seq(mu7,sigma7,I,m)
p9,p9_norm,x = pdf_seq(mu7,sigma7,I,m)
p10,p10_norm,x = pdf_seq(mu7,sigma7,I,m)
p11,p11_norm,x = pdf_seq(mu7,sigma7,I,m)
p12,p12_norm,x = pdf_seq(mu7,sigma7,I,m)
p13,p13_norm,x = pdf_seq(mu7,sigma7,I,m)
p14,p14_norm,x = pdf_seq(mu7,sigma7,I,m)
tic = time.perf_counter()
c1 = fftconvolve(p1_norm,p2_norm)
c2 = fftconvolve(c1,p3_norm)
c3 = fftconvolve(c2,p4_norm)
c4 = fftconvolve(c3,p5_norm)
c5 = fftconvolve(c4,p6_norm)
c6 = fftconvolve(c5,p7_norm)
c7 = fftconvolve(c6,p8_norm)
c8 = fftconvolve(c7,p9_norm)
c9 = fftconvolve(c8,p10_norm)
c10 = fftconvolve(c9,p11_norm)
c11 = fftconvolve(c10,p12_norm)
c12 = fftconvolve(c11,p13_norm)
c13 = fftconvolve(c12,p14_norm)
toc = time.perf_counter()
d13 = np.cumsum(c13)
Nx=np.int(1400)
plt.figure()
(continues on next page)
plt.hlines(0.5,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.hlines(0.9,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.hlines(0.95,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.hlines(0.1,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.hlines(0.05,min(x),Nx,linestyles='dotted',colors = {'black'})
plt.ylim(0,1)
plt.xlim(0,Nx)
plt.xlabel("$x10^{-9}$",loc = "right")
plt.show()
x_1 = x[find_nearest(d13,0.01)]
x_5 = x[find_nearest(d13,0.05)]
x_10 = x[find_nearest(d13,0.1)]
x_50 = x[find_nearest(d13,0.50)]
x_66 = x[find_nearest(d13,0.665)]
x_85 = x[find_nearest(d13,0.85)]
x_90 = x[find_nearest(d13,0.90)]
x_95 = x[find_nearest(d13,0.95)]
x_99 = x[find_nearest(d13,0.99)]
x_9978 = x[find_nearest(d13,0.9978)]
print(tabulate([
['1%',f"{x_1}"],
['5%',f"{x_5}"],
['10%',f"{x_10}"],
['50%',f"{x_50}"],
['66.5%',f"{x_66}"],
['85%',f"{x_85}"],
['90%',f"{x_90}"],
['95%',f"{x_95}"],
['99%',f"{x_99}"],
['99.78%',f"{x_9978}"]],
headers = ['Percentile', 'x * 1e-9']))
↪this will not modify any behavior and is safe. When replacing `np.int`, you may␣
↪wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish␣
↪to review your current use, check the release note link for additional␣
↪information.
Nx=np.int(1400)
/tmp/ipykernel_15783/3082528578.py:4: DeprecationWarning: `np.int` is a deprecated␣
↪alias for the builtin `int`. To silence this warning, use `int` by itself. Doing␣
↪this will not modify any behavior and is safe. When replacing `np.int`, you may␣
↪wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish␣
↪to review your current use, check the release note link for additional␣
↪information.
Percentile x * 1e-9
------------ ----------
1% 76.15
5% 106.5
10% 128.2
50% 260.55
66.5% 338.55
85% 509.4
90% 608.8
95% 807.6
99% 1470.2
99.78% 2474.85
SIXTEEN
Note: If you are running this on Google Colab the above cell will present an error. This is because Google Colab doesn’t
use Anaconda to manage the Python packages. However this lecture will still execute as Google Colab has plotly
installed.
16.1 Overview
283
Quantitative Economics with Python
𝑦 = 𝑓(𝑥)
ℎ(𝑧) = max(0, 𝑧)
ℎ(𝑧) = 𝑧
As activation functions below, we’ll use the sigmoid function for layers 1 to 𝑁 − 1 and the identity function for layer 𝑁 .
̂ by proceeding as follows.
To approximate a function 𝑓(𝑥) we construct 𝑓(𝑥)
Let
𝑙𝑖 (𝑥) = 𝑤𝑖 𝑥 + 𝑏𝑖 .
̂ =ℎ ∘𝑙 ∘ℎ
𝑓(𝑥) ≈ 𝑓(𝑥) 𝑁 𝑁 𝑁−1 ∘ 𝑙1 ∘ ⋯ ∘ ℎ1 ∘ 𝑙1 (𝑥)