0% found this document useful (0 votes)
50 views182 pages

Machine Learning in Asset Pricing Models

Uploaded by

qinjn.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views182 pages

Machine Learning in Asset Pricing Models

Uploaded by

qinjn.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning in the Cross Section of

Asset Returns

Doron Avramov

Updated: October 16, 2024


1
Machine learning Methods in asset pricing
 Machine learning offers a vast collection of high-dimensional models designed to predict, explain,
or classify quantities of interest, while imposing regularization to prevent overfitting.
 In empirical finance, machine learning models have been deployed to predict returns, estimate the
stochastic discount factor (SDF), extract common factors, and test asset pricing models.
 There are many other applications including the study of mutual funds.
 Machine learning applies to any asset class, including equities, fixed-income securities,
currencies, and cryptocurrencies.
 This set of notes comprehensively covers the architecture of various routines and their broad
implementations in empirical asset pricing.
 The routines apply to beta-pricing (IPCA, CA) and pricing kernel (Ridge, GAN) representations.
 The routines can also apply to reduced-form settings (Ridge, LASSO, NN, LSTM, TE, RL),
whereas the return-generating process does not conform to any structural representation.
 IPCA, CA, Ridge, GAN, LASSO, NN, LSTM, TE, and RL are all abbreviations for machine
learning routines to be explained in detail throughout the notes.
 I will also discuss how ChatGPT works, building on concepts from stock return prediction.

2 Professor Doron Avramov, IDC, Israel


Asset pricing models
 We start with a brief review of asset pricing settings.
 In unconditional asset pricing, stock return moments are fixed.
 That is, all model parameters are time-invariant.
 Conditional models formulate time variation.
 There are different ways to model time-varying moments.
 First, risk and risk premia can vary with macro variables (e.g., the dividend-to-price
ratio) or firm characteristics (e.g., size, book-to-market).
 Second, risk and risk premia can follow a latent ARMA process, e.g., AR(1).
 Third, time variation can be captured through high-frequency data in rolling samples.
 Theoretically, beta pricing and pricing kernel representations are equivalent.
 Empirical tests, however, could be different.

3 Professor Doron Avramov, IDC, Israel


Beta pricing representation
 In beta pricing settings, the expected excess return of asset i at time t is given by
𝑒
𝔼 𝑟𝑖,𝑡 = 𝛼𝑖 + 𝛽𝑖 ′ 𝔼(𝑓𝑡 )
where 𝑓𝑡 denotes a set of K portfolio spreads, 𝛽𝑖 is a 𝐾 vector of factor loadings, and 𝛼𝑖
reflects the expected return component unexplained by factors, or model mispricing.
 The aim is to identify economic (ICAPM) or statistical (APT) factors that eliminate
mispricing.
 Absent alpha, the expected return differential across assets is triggered by factor
loadings only.
 The presence of model mispricing could give rise to additional cross-sectional effects.
 The alpha-beta debate: does an anomaly represent a risk factor or instead mispricing?
 An anomaly can also reflect false discovery; hence, t-ratio thresholds should be higher.

4 Professor Doron Avramov, IDC, Israel


Pricing kernel representation
 The pricing kernel representation for asset pricing can be formulated through the Hansen-
Jagannathan equation:

𝑀𝑡 = 1 − 𝑏′(𝑟𝑡 − 𝜇),

where 𝜇 is an N−vector of expected return and 𝑟𝑡 is an N-vector of realized returns.

 The unobservable pricing kernel is projected on the space of demeaned returns.

 Identification of the projection slopes is feasible through the first-order conditions.

 FOCs are also known as the Euler Equation, E(𝑀𝑡 𝑟𝑡 )=1, where 1 is an N−vector of ones.

 There are three plausible formulations for the N-vector of slope coefficients.

5 Professor Doron Avramov, IDC, Israel


Pricing kernel parameters
 First, 𝑏 can be constant – amounting to unconditional models.

 Moreover, b can also vary with macro conditions or firm characteristics.

 To illustrate, consider variation with firm characteristics.

 We have 𝑏𝑡−1 = 𝐶𝑡−1 𝑑, where 𝐶𝑡−1 is an 𝑁 × 𝐻 matrix, 𝐻 characteristics (e.g., size,


profitability, past returns) for each of the 𝑁 stocks, and 𝑑 is an H× 1 vector.

 Plugging 𝑏𝑡−1 into the pricing kernel yields


′ ′
𝑀𝑡 = 1 − 𝑑′ 𝐶𝑡−1 𝑟𝑡 − 𝐸𝑡−1 𝐶𝑡−1 𝑟𝑡 .

 The quantity 𝐶𝑡−1 𝑟𝑡 can be interpreted as H returns on managed portfolios.

 Conceptually, the pricing kernel is now projected on the space of H demeaned managed
portfolio returns.
6 Professor Doron Avramov, IDC, Israel
OLS
 Prior to delving into machine learning methods, we start, for perspective, with
ordinary least squares (OLS).
 OLS is the best linear unbiased estimator of the regression coefficients.
 BLUE=Best Linear Unbiased Estimate.
 Regression errors do not have to be normal, nor do they have to be
independently and identically distributed.
 But errors have to be zero mean, serially uncorrelated, as well as homoscedastic.
 In the presence of heteroskedasticity or autocorrelation, OLS is no longer BLUE.
 We can still use OLS estimators by finding heteroskedasticity-robust estimators
of the variance, or we can devise an efficient estimator by re-weighting the data
appropriately to incorporate heteroskedasticity.
 Similarly, with autocorrelation, we can find an autocorrelation-robust estimator
of the variance, or we can devise an efficient estimator by re-weighting the data
appropriately to account for autocorrelation

7 Professor Doron Avramov, IDC, Israel


Is BLUE so promising?
 Requiring linearity is binding since nonlinear estimators do exist.
 This is where nonparametric Lasso and neural networks (NN) come to play.
 Likewise, requiring unbiasedness is crucial since biased estimators do exist.
 This is where shrinkage methods come into play: the OLS estimator's variance
can be too large as OLS coefficients are unregulated.
 If judged by Mean Squared Error (MSE), alternative biased estimators could be
more effective if they produce substantially smaller variance within the set of
linear models.
 Recall that MSE = Variance + Biased Squared.
 Likewise, alternative non-linear estimators could be more effective within the
set of unbiased models.
 The MSE of an OLS estimate is computed on the next page.

8 Professor Doron Avramov, IDC, Israel


MSE of OLS
 Let 𝛽 denote the true regression coefficients and let 𝛽መ = (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑌, where 𝑋 is a 𝑇 × 𝑀
matrix of de-meaned predictors and 𝑌 is a 𝑇 × 1 vector of the dependent variable.
 Both 𝛽 and its estimate 𝛽መ are vectors of dimension M.
 The mean squared error (MSE) of the OLS estimate is given by

MSE 𝛽መ = 𝐸 𝛽መ − 𝛽 𝛽መ − 𝛽

= 𝐸 tr 𝛽መ − 𝛽 𝛽መ − 𝛽

= 𝐸 tr 𝛽መ − 𝛽 𝛽መ − 𝛽

= tr 𝐸 𝛽መ − 𝛽 𝛽መ − 𝛽
= tr[ 𝑋 ′ 𝑋 −1 𝜎 2]
= 𝜎 2 tr[ 𝑋 ′ 𝑋 −1 ].

 When predictors are highly correlated, the matrix X’X is ill-conditioned and the expression
tr 𝑋 ′ 𝑋 −1 quickly explodes.

9 Professor Doron Avramov, IDC, Israel


Shortcomings of OLS
 In the presence of many predictors, OLS delivers nonzero estimates for all coefficients – thus
it is difficult to implement variable selection when the true data-generating process is sparse.
 Interpretation becomes challenging, as even insignificant coefficients contribute to the
predicted value.
 The OLS solution is unique only if the design of X is full rank.
 The OLS does not handle potential nonlinearities and interactions between predictors.
 In summary, OLS is restrictive, often yields poor predictions, may overfit, does not penalize
model complexity or large coefficients, and can be difficult to interpret.
 Bayesian perspective: one can introduce informed priors on regression coefficients to shrink
slopes toward zero or values implied by economic theory or sound intuition.
 Classical perspective: shrinkage methods penalize complexity and impose regularization.
 Nonlinearities and interactions between predictors can also be accounted for.
 Such objectives are accomplished through an assortment of machine-learning methods.

10 Professor Doron Avramov, IDC, Israel


Stock return Predictability: Economic restrictions on OLS
 Still, there are two easy-going ways to possibly improve OLS estimates.
 Source: Gu, Kelly, and Xiu (2019).
 Base case: the pooled OLS estimator corresponds to a panel (balanced) regression of future returns on firm
attributes, where T and N represent the time-series and the cross-section dimensions.
 The objective is formulated as
𝑁 𝑇
1 2
ℒ 𝜃 = ෍ ෍ 𝑟𝑖,𝑡+1 − 𝑓 𝑥𝑖,𝑡 ; 𝜃
𝑁𝑇
𝑖=1 𝑡=1

where 𝑟𝑖,𝑡+1 is stock return at time t+1 per firm i, f= 𝑥𝑖,𝑡 𝜃 is the corresponding predicted return, 𝑥𝑖,𝑡 is
a set of firm characteristics, and 𝜃 stands for model parameters.

 Predictive performance could be improved using optimization that value weights, rather than
equal weights, stocks based on market size, inverse volatility (precision), inverse credit risk, etc.

11 Professor Doron Avramov, IDC, Israel


Stock return Predictability: Economic restrictions on OLS
 An alternative optimization takes account of the heavy tail displayed by stocks and the potential
harmul effects of outliers.
 The objective is formulated such that squared (absolute) loss is applied to small (large) errors:
𝑁 𝑇
1
ℒ 𝜃 = ෍ ෍ 𝐻 𝑟𝑖,𝑡+1 − 𝑓 𝑥𝑖,𝑡 ; 𝜃 , 𝜉
𝑁𝑇
𝑖=1 𝑡=1

where 𝜉 is a tuning hyper-parameter and


𝑦2, if 𝑦 ≤𝜉
𝐻 𝑦, 𝜉 =
2𝜉 𝑦 − 𝜉 2 , if 𝑦 >𝜉

 The hyper-parameter 𝜉 is determined by model performance in a validation sample.


 Later, the selection of hyperparameters is described in detail.
 At this point, we are ready to proceed with machine learning methods, starting from Ridge.

12 Professor Doron Avramov, IDC, Israel


Ridge Regression
 Ridge is one of several shrinkage methods.
 Hoerl and Kennard (1970a, 1970b) introduce the Ridge regression
min (𝑌 − 𝑋𝛽)′(𝑌 − 𝑋𝛽) s. t. σ𝑀 2
𝑗=1 𝛽𝑗 ≤ 𝑐

 The minimization can be rewritten as


ℒ 𝛽 = 𝑌 − 𝑋𝛽 𝑌 − 𝑋𝛽 + 𝜆(𝛽 ′ 𝛽)
 We get
𝛽መ ridge = (𝑋 ′ 𝑋 + 𝜆𝐼𝑀 )−1 𝑋 ′ 𝑌
where 𝐼𝑀 is the identity matrix of order 𝑀.
 A first-order benefit is that the regression coefficients can be expressed analytically.
 Notice that including 𝜆 makes the problem nonsingular even when 𝑋 ′ 𝑋 is noninvertible.
 𝜆 is a hyperparameter that controls for the amount of regularization.

13 Professor Doron Avramov, IDC, Israel


Ridge Regression
 As 𝜆 → 0, the OLS estimator obtains.
 As 𝜆 → ∞, we have 𝛽መ ridge = 0.
 Ridge regressions do not have a sparse representation (dropping irrelevant predictors), so
using model selection criteria to pick 𝜆 is infeasible.
 Instead, validation methods are employed.
 In particular, split the sample into three intervals: training, validation, and testing.
 The training sample considers various values for 𝜆 each of which delivers a prediction.
 The validation sample chooses 𝜆 which provides the smallest MSE.
 Hence, both training and validation samples are used to pick 𝜆.
 Then, the experiment is assessed through out-of-sample predictions given the choice of 𝜆.

14 Professor Doron Avramov, IDC, Israel


Ridge Regression
 As shown below, in a Bayesian setting, the parameter 𝜆 denotes the prior precision of beliefs
that regression slope coefficients are all equal to zero.
 Precision is the inverse of the variance.
 Classical perspective: the ridge estimator is essentially biased:

𝐸 𝛽መ ridge ≠ 𝛽
 There is no bias from a Bayesian perspective because 𝐸 𝛽መ ridge is the posterior mean of 𝛽.
 The Bayesian analysis combines informed prior views with the likelihood function.
 The posterior mean is the weighted average of (i) the prior mean and (ii) the sample mean,
with weights reflecting the precisions of (i) the prior and (ii) the sample estimate, respectively.

15 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
Interpretation #1: Data Augmentation
 The Ridge-minimization problem can be formulated as
𝑇 𝑀
2
෍(𝑦𝑡 − 𝑥𝑡′ 𝛽)2 + ෍ 0 − 𝜆𝛽𝑗
𝑡=1 𝑗=1

 Thus, the Ridge-estimator is the usual OLS estimator where the data is transformed such that
𝑋 𝑌
𝑋𝜆 = , 𝑌𝜆 =
𝜆𝐼𝑀 0𝑀
where 0M is an M-vector of zeros.
 Then, it follows that
𝛽መ ridge = 𝑋𝜆′ 𝑋𝜆 −1 𝑋 ′ 𝑌
𝜆 𝜆
= (𝑋 ′ 𝑋 + 𝜆𝐼𝑀 )−1 𝑋 ′ 𝑌

16 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
Interpretation #2: Informative Bayes Prior
 Suppose the prior on 𝛽 is of the form:
1
𝛽~𝑁 0, 𝐼𝑀
𝜆
 Then, the posterior mean of 𝛽 is:
(𝑋 ′ 𝑋 + 𝜆𝐼𝑀 )−1 𝑋 ′ 𝑌
 Bayesian methods are quite useful in asset pricing.
 For instance, consider the time-series asset pricing regression
𝑟𝑡 = 𝛼 + 𝛽𝑟𝑚𝑡 + 𝜀𝑡
 From a Bayesian perspective, we can formulate informed prior on mispricing
𝜎2 𝜂
α|𝑉~𝑁 0, 2 𝑉
𝑠
where V is the covariance matrix of the residuals in time-series asset pricing regressions, 𝑠 2 = 𝑡𝑟𝑎𝑐𝑒 𝑉 =
σ𝑁
𝑗=1 𝜆𝑗 and the 𝜆𝑗 − 𝑠 are eigenvalues of 𝑉.

17 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
Interpretation #2: Informative Bayes Prior
 Notice that we can decompose the positive define matrix V as
V= Q𝛬Q’
 Q is a matrix of ordered eignevectors that are orthogonal and 𝛬 is a diagonal matrix with the
corresponding ordered eigenvalues.
 Then, 𝑡𝑟𝑎𝑐𝑒 𝑉 = 𝑡𝑟𝑎𝑐𝑒 𝑄𝛬𝑄′ = 𝑡𝑟𝑎𝑐𝑒 𝛬𝑄′ 𝑄 = 𝑡𝑟𝑎𝑐𝑒 𝛬 = σ𝑁
𝑗=1 𝜆𝑗

 𝜎 2 controls for the degree of confidence in the prior, tuned by 𝑡𝑟𝑎𝑐𝑒 𝑉 .


 Limit cases: zero 𝜎 2 means dogmatic beliefs while infinitely large 𝜎 2 amounts to noninformative priors.
 The case 𝜂 = 1 resembles the asset pricing prior of Pastor (2000) and Pastor and Stambaugh (2000).
 The PS prior is flexible since factors are prespecified and are not ordered per their importance, yet there are
also merits for the 𝜂 = 2 case, which applies to factors that are principal components, as discussed below.

18 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
 Source: Kozak, Nagel, and Santosh (2020).
 To continue the Bayesian interpretation, consider the Hansen-Jagannathan representation of the
pricing kernel
𝑀𝑡 = 1 − 𝑏′ 𝑟𝑡 − 𝜇
= 1 − 𝜇′ 𝑉 −1 𝑟𝑡 − 𝜇
= 1 − 𝜇′ 𝑄Λ−1𝑄′ 𝑟𝑡 − 𝜇
= 1 − 𝜇𝑄′ Λ−1 𝑄𝑡 − 𝜇𝑄
= 1 − 𝑏𝑄′ 𝑄𝑡 − 𝜇𝑄
where the second equation follows by the Euler equation -- E[𝑀𝑡 (𝑟𝑡 − 𝜇)]=0.
𝜎2 𝜂
 Assuming 𝜇~𝑁 0, 𝑠2 𝑉 , it follows that 𝜇𝑄 = 𝑄′ 𝜇 has the prior distribution (V is assumed known)
𝜎 2
𝜇𝑄 = 𝑄′ 𝜇 ~ 𝑁 0, 2 𝑄′ 𝑽𝜼 𝑄
𝑠
𝜎2 ′ 𝜼 ′
~ 𝑁 0, 2 𝑄 𝑸𝜦 𝑸 𝑄
𝑠
𝜎2 𝜂
~ 𝑁 0, 2 Λ
𝑠

19 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
 As 𝑏𝑄 = Λ−1𝜇𝑄 , its prior distribution is formulated as (Λ is assumed known):
−1
𝜎 2 𝜂−2
𝑏𝑄 = Λ 𝜇𝑄 ~ 𝑁 0, 2 Λ
𝑠
 For 𝜂 < 2, the variance of the 𝑏𝑄 coefficients associated with the smallest eigenvalues explodes.
 For 𝜂 = 2, the pricing kernel coefficients 𝑏 = 𝑉 −1𝜇 have the prior distribution
𝜎2
𝑏~𝑁 0, 2 𝐼𝑁
𝑠
 Picking 𝜂 = 2 makes the prior of 𝑏 independent of 𝑉.
 Let us stick to this prior and further formulate the likelihood for 𝑏 as
1
𝑏~𝑁 𝑉 −1𝜇,ො 𝑉 −1
𝑇
where 𝜇ො is the sample mean return.
𝑠2
 Then, the posterior mean of 𝑏 is given by 𝐸 𝑏 = 𝑉 + 𝜆𝐼𝑁 −1 𝜇
ො where 𝜆 = 𝑇𝜎2
.

 Ridge regression pricing kerenl projected on 𝑁 demeaned returns with a tuning parameter 𝜆 would deliver
the same E(b) coefficient.
 Or the Ridge (biased) coefficient is equal to the Bayesian posterior mean.
20 Professor Doron Avramov, IDC, Israel
Interpretations of the Ridge Regression
 The prior expected value of the squared Sharpe ratio (SR) is given by (V is assumed known):
𝐸 𝑆𝑅2 = 𝐸 𝜇′ 𝑉 −1𝜇
= 𝐸 𝜇′ 𝑄Λ−1𝑄′ 𝜇
= 𝐸 𝜇𝑄′ Λ−1𝜇𝑄
= 𝐸 𝑡𝑟𝑎𝑐𝑒 𝜇𝑄′ 𝛬−1𝜇𝑄
= 𝑡𝑟𝑎𝑐𝑒 𝛬−1𝐸 𝜇𝑄 𝜇𝑄′
𝜎2
= 2 𝑡𝑟𝑎𝑐𝑒 𝛬𝜂−1
𝑠
 The Pastor-Stambaugh type prior (𝜂 = 1 ) tells you that
2
𝜎2 𝜎2
𝐸 𝑆𝑅 = 2 𝑡𝑟𝑎𝑐𝑒 𝐼𝑁 = 𝑁 2
𝑠 𝑠
 Thus, each principal component portfolio has the same expected contribution to the Sharpe ratio.
 If 𝜂 = 2, then
𝑁
𝜎2
𝐸 𝑆𝑅2 = ෍ 2 𝜆𝑗 = 𝜎 2
𝑠
𝑗=1

 Thus, the expected contribution of each PC is proportional to its eigenvalue.


 The derivations are based on knowing V, the covariance matrix, while a full Bayesian approach would
formulate the inverted Wishart distribution for V. Then, the entire derivations will be different.
21 Professor Doron Avramov, IDC, Israel
Interpretations of the Ridge Regression
Interpretation #3: Eigen-values and Eigen-vectors
 The singular value decomposition (SVD) is the basis for techniques in dimensionality reduction.
 By the unique SVD, we can express 𝑋 as

𝑋𝑇×𝑀 = 𝑈𝑇×𝑀 Λ0.5 ′


𝑀×𝑀 𝑉𝑀×𝑀
𝜆10.5 ⋯ 0
where 𝑈 = [𝑈1 , … , 𝑈𝑀 ] is a 𝑇 × 𝑀 orthonormal matrix, Λ0.5 = ⋮ ⋱ ⋮ is an 𝑀 × 𝑀 matrix
0 ⋯ 𝜆0.5 𝑀
so that 𝜆1 ≥ 𝜆2 ≥ ⋯ ≥ 𝜆𝑀 , and 𝑉 = 𝑉1 , 𝑉2 , … , 𝑉𝑀 is an 𝑀 × 𝑀 orthonormal matrix.
 As 𝑋 ′ 𝑋 = 𝑉Λ𝑉 ′ , the columns of V are the eigenvectors of 𝑋 ′ 𝑋 and (𝜆1 , … , 𝜆𝑀 ) are the corresponding
eigenvalues.
 As 𝑋𝑋 ′ = 𝑈Λ𝑈′ , the columns of U are the eigenvectors of 𝑋𝑋 ′ .
 The SVD facilitates dimensionality reduction by allowing us to focus on the most
significant components.
 By retaining only the largest singular values (and their corresponding singular vectors),
we can approximate the original matrix with lower rank, capturing the essential
structure while reducing noise and complexity.
22 Professor Doron Avramov, IDC, Israel
OLS – Eigen-values and Eigen-vectors based analysis
 By the singular value theorem, the OLS estimate can be reformulated as

𝛽መ OLS = 𝑋 ′ 𝑋 −1
𝑋′𝑌
= 𝑉Λ𝑉 ′ −1
𝑉Λ0.5 𝑈 ′ 𝑌
= 𝑉Λ−1 𝑉 ′ 𝑉Λ0.5 𝑈 ′ 𝑌
= 𝑉Λ−0.5 𝑈 ′ 𝑌
= 𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5 , 𝜆−0.5
2 , … , 𝜆−0.5
𝑀 𝑈′𝑌
 The fitted value is
𝑌෠ OLS = 𝑋 𝛽መ OLS
= 𝑈Λ0.5 𝑉 ′ 𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5 , 𝜆−0.5
2 , … , 𝜆−0.5
𝑀 𝑈′𝑌
= 𝑈Λ0.5 Λ−0.5 𝑈 ′ 𝑌
= 𝑈𝑈 ′ 𝑌
𝑀
= ෍ (𝑈𝑗 𝑈𝑗′ ) 𝑌
𝑗=1

 Interpretation: we project 𝑌 on all the M columns of U.


 For comparison, as we show on the next pages, Ridge gives stronger prominence for columns
in U associated with higher eigen values, while PCA considers only the first K<M columns.
23 Professor Doron Avramov, IDC, Israel
Interpretations of the Ridge Regression
 We examine how the predicted value of Y is related to the eigen vectors.
 First, we would like to find the eigenvectors and eigenvalues of the matrix 𝑍
𝑍 = 𝑋 ′ 𝑋 + 𝜆𝐼𝑀
 We know that for every j=1,2…,M, the following holds by definition

𝑋 ′ 𝑋 𝑉𝑗 = 𝜆𝑗 𝑉𝑗
 Thus
𝑋 ′ 𝑋 + 𝜆𝐼𝑀 𝑉𝑗 = 𝑋 ′ 𝑋 𝑉𝑗 + 𝜆𝑉𝑗
= 𝜆𝑗 𝑉𝑗 + 𝜆𝑉𝑗
= 𝜆𝑗 + 𝜆 𝑉𝑗
 Telling you that 𝑉 still denotes the eigenvectors of 𝑍 while 𝜆𝑗 + 𝜆 is the 𝑗-th eigenvalue.
 Notice now that if 𝐴 = 𝑉Λ𝑉 ′ then 𝐴𝐿 = 𝑉Λ𝐿 𝑉 ′ , while L can be either positive or negative.
 Hence, there are the same eigenvectors while eigenvalues are raised to the power of 𝐿.

24 Professor Doron Avramov, IDC, Israel


Interpretations of the Ridge Regression
 Then, the inverse of the matrix Z is given by
−1
1 1 1
𝑍 = 𝑉 𝑑𝑖𝑎𝑔 , ,…, 𝑉′
𝜆1 + 𝜆 𝜆2 + 𝜆 𝜆𝑀 + 𝜆
 The Ridge regression coefficients are
𝛽መ ridge = 𝑍 −1 𝑋 ′ 𝑌
= 𝑋 ′ 𝑋 + 𝜆𝐼𝑀 −1 𝑋 ′ 𝑌
𝜆10.5 𝜆0.5
2 𝜆0.5
𝑀
= 𝑉 𝑑𝑖𝑎𝑔 , ,…, 𝑈′ 𝑌
𝜆1 + 𝜆 𝜆2 + 𝜆 𝜆𝑀 + 𝜆
 And the fitted value is
𝑌෠ ridge = 𝑋𝛽መ ridge
𝑀 𝜆𝑗
= ෍ (𝑈𝑗 𝑈𝑗′ ) 𝑌
𝑗=1 𝜆𝑗 + 𝜆
𝜆1 𝜆2 𝜆𝑀
= 𝑈 𝑑𝑖𝑎𝑔 , ,…, 𝑈′ 𝑌
𝜆1 + 𝜆 𝜆2 + 𝜆 𝜆𝑀 + 𝜆
 Ridge regression projects 𝑌 onto components with large 𝜆𝑗 , shrinking the coefficients of
25 low variance components.
Professor Doron Avramov, IDC, Israel
PCA Vs. OLS Vs. Ridge Regressions
 In a principal components analysis (PCA) setup, we project 𝑌 on a subset of 𝑈𝑗 , 𝑗
= 1,2, … , 𝐾 < 𝑀.
 Notice that 𝑈𝑗 ’s are ordered per their corresponding eigenvalues in a descending order.
 The 𝑋 ′ 𝑋 expression is approximated by using the 𝐾 largest eigenvectors and eigenvalues
𝑉1 ′
෩𝑉෨ ′ = 𝑉1 , ⋮
𝑋 ′ 𝑋 = 𝑉Λ𝑉 ′ ≈ 𝑉෨ Λ …, 𝑉𝐾 , 0𝑀×(𝑀−𝐾) 𝑑𝑖𝑎𝑔 𝜆1 , 𝜆2, … , 𝜆𝐾 , 0(𝑀−𝐾)×1
𝑉𝐾 ′
0(𝑀−𝐾)×𝑀

 Then,
𝛽መ PCA= 𝑉෨ Λ෩𝑉෨ ′ −1 𝑉Λ0.5 𝑈′ 𝑌
෩−1 𝑉෨ ′ 𝑉Λ0.5 𝑈′ 𝑌
= 𝑉෨ Λ
෩−1 𝐼𝐾×𝐾 0𝐾×𝑀−𝐾
= 𝑉෨ Λ Λ0.5 𝑈′ 𝑌
0𝑀−𝐾×𝐾 0𝑀−𝐾×𝑀−𝐾
= 𝑉෨ 𝑑𝑖𝑎𝑔 𝜆1−0.5 , 𝜆−0.5 −0.5 ′
2 , … , 𝜆𝐾 , 0,0, . . 𝑈 𝑌

26 Professor Doron Avramov, IDC, Israel


PCA – the predicted value
 The PCA predicted value is
𝑌෠ PCA = 𝑋𝛽መ PCA
−0.5
= 𝑈Λ0.5 𝑉 ′ 𝑉 𝑑𝑖𝑎𝑔 𝜆1−0.5 , 𝜆−0.5 ′
2 , … , 𝜆𝐾 , 0,0, . . 𝑈 𝑌

= 𝑈 𝑑𝑖𝑎𝑔 1,1, … , 1,0,0, . . 𝑈′ 𝑌


= σ𝐾 ′
𝑗=1(𝑈𝑗 𝑈𝑗 ) 𝑌

 PCA projects Y on the K first columns of U, rather than the entire M columns.
 Later in the notes, I discuss non-linear ways of extracting factors including autoencoder, conditional
autoencoder (CA), and LSTM.
 Keep in mind that PCA seeks to maximize the variance within the predictors only, without
considering the relationship with the response variable, aiming to reduce dimensionality while
retaining as much variance as possible from X.
 Partial Least Squares (PLS), on the other hand, maximizes the covariance between the predictors and
the response variables; more on the next slide.

27 Professor Doron Avramov, IDC, Israel


Partial Least Squares(PLS)
 The PLS components are called latent variables.
 They are derived from both X and Y and are designed to maximize the correlation
between X and Y.
 This makes PLS more effective for predictive modeling because it aligns
the dimensionality reduction with the prediction task
 PLS Latent Variable Extraction works as follows.
 Let 𝑡1 = X 𝑤1 , where 𝑡1 is the first latent variable and 𝑤1 is the weight vector.
 The first component 𝑤1 maximizes Cov(X𝑤1 , Y)

 It is given by 𝑤1 = X’Y; see the solution on the next page.

28 Professor Doron Avramov, IDC, Israel


Derivation of wᵢ using Lagrange Multiplier
 The objective is to maximize Cov(X 𝑤1 , Y) = 𝑤1 ' X’ Y subject to a unity norm constraint.

 The Lagrangian is given by:

𝓛(𝑤1, λ) = 𝑤1' X' Y - λ(𝑤1′ 𝑤1 - 1)

 Taking the derivative

∂𝓛/∂𝑤1 = X' Y - 2λ𝑤1 = 0

 Solving gives: 𝑤1∝ X’ Y

 The second component is found similarly with the additional orthogonality constraint w.r.t.
the first component.

 The third component has two orthogonality conditions and so on.


29 Professor Doron Avramov, IDC, Israel
PLS Regression and Solution for Slope
 After extracting the latent variables, PLS regresses Y on these components:

Y = T C + E, where T is the matrix of latent components, C is the regression


coefficients, and E is the error

 The solution for the slope coefficients is given by:

C = (T'T)⁻¹ T'Y

 This is analogous to the Ordinary Least Squares (OLS) solution but uses the latent
components instead of the original X matrix.

30 Professor Doron Avramov, IDC, Israel


PLS – Summary
 PLS is a dimensionality reduction technique that projects predictors
(independent variables) and responses (dependent variables) onto a new space,
maximizing covariance between them.
 Key Features:
1. Correlation Maximization: PLS finds latent components (factors) that explain
the maximum covariance between the predictors and response variables.
2. Dimensionality Reduction: Like PCA, PLS reduces the number of predictors,
but it focuses on those that are most predictive of the response.
3. Handling Multicollinearity: PLS is particularly effective when predictors are
highly collinear, as it constructs orthogonal latent variables that summarize the
predictor space.
4. Predictive Power: By focusing on the relationship between predictors and
responses, PLS enhances predictive power compared to methods like PCA that
ignore the response.
31 Professor Doron Avramov, IDC, Israel
IPCA: Instrumental Principal Component Analysis
 IPCA, per Kelly, Pruitt, and Su (2017, 2019), is conceptually closer to PLS than to PCA because
both IPCA and PLS are supervised methods that incorporate the relationship between the
predictors and the target variables.
 The factor model for excess returns is formulated as
𝑟𝑖,𝑡+1 = 𝛽′𝑖,𝑡 𝑓𝑡+1 + 𝜖𝑖,𝑡+1
𝛽′𝑖,𝑡 = 𝑥′𝑖,𝑡 Γ𝛽
where 𝑓𝑡+1 is a K-vector of latent factors.
 The loadings depend on observable asset characteristics contained in the 𝑀 × 1 vector
𝑥𝑖,𝑡 the first element is one for the intercept , while Γ𝛽 is an 𝑀 × 𝐾 matrix.
 Motivation: Gomes, Kogan, and Zhang (2003) formulate an equilibrium model where beta
varies with firm level predictors, such as size and book-to-market.
 Avramov and Chordia (2006) show empirically that conditional beta that varies with firm
characteristics improves model pricing abilities.
 Characteristics can simply reflect covariances, or risk sources.
32 Professor Doron Avramov, IDC, Israel
IPCA
 Rewriting the model in a vector form (collecting all the assets at time t):
𝑟𝑡+1 = 𝑋𝑡 Γ𝛽 𝑓𝑡+1 + 𝜖𝑡+1
where 𝑟𝑡+1 is an 𝑁 × 1 vector of excess returns (number of assets can be time varying), 𝑋𝑡 is 𝑁 × 𝑀
matrix of the characteristics, and 𝜖𝑡+1 is an 𝑁 × 1 vector of residuals.

 The estimation objective is to minimize


𝑇−1
min ෍ 𝑟𝑡+1 − 𝑋𝑡 Γ𝛽 𝑓𝑡+1 ′ 𝑟𝑡+1 − 𝑋𝑡 Γ𝛽 𝑓𝑡+1
Γ𝛽 ,𝐹
𝑡=1
 From the first order condition, we get that for 𝑡 = 1 , 2, … , 𝑇 − 1
−1 ′
𝑓መ𝑡+1 = Γ෠𝛽 ′𝑋𝑡 ′𝑋𝑡 Γ෠𝛽 Γ෠𝛽 𝑍𝑡 ′𝑟𝑡+1
and
−1
𝑣𝑒𝑐 Γ෠𝛽 ′ = σ𝑇−1 𝑋
𝑡=1 𝑡 𝑡

𝑋 ⨂ መ
𝑓 𝑓መ
𝑡+1 𝑡+1 ′ σ𝑇−1 መ
𝑡=1 𝑋𝑡 ⨂𝑓𝑡+1 ′ ′𝑟𝑡+1 .

33 Professor Doron Avramov, IDC, Israel


IPCA
 Estimating the IPCA parameters can efficiently be conducted through Bayesian Gibbs Sampling.
 The Bayesian approach is extremely convenient and moreover you can adopt economically
meaningful prior beliefs on the various coefficients.
 Kelly, Pruitt, and Su (2017) propose ways of solving the system as well as they give a plausible
managed portfolio-based interpretation to the problem.
 Notice that in a static latent factor model, stock returns are formulated as
𝑟𝑡 = 𝛽𝑓𝑡 + 𝜖𝑡
and the PCA factor solution is
𝑓መ𝑡 = (𝛽′ 𝛽)−1 𝛽′ 𝑟𝑡
 IPCA is analogous, while it accounts for dynamic instrumented betas.

34 Professor Doron Avramov, IDC, Israel


IPCA
 One can also allow for mispricing, where the intercepts in the unrestricted factor model
vary with the same firm characteristics.
 Then, the model is formulated as
𝑟𝑖,𝑡+1 = 𝑥′𝑖,𝑡 Γ𝛼 + 𝑥′𝑖,𝑡 Γ𝛽 𝑓𝑡+1 + 𝜖𝑖,𝑡+1
where Γ𝛼 is an 𝐿 × 1 vector.
 Let Γ෨ = Γ𝛼 , Γ𝛽 and let 𝑓ሚ𝑡+1 = 1, 𝑓𝑡+1 ′ ′.
 The model can be rewritten in a matrix form

𝑟𝑡+1 = 𝑋𝑡 Γ෨ 𝑓ሚ𝑡+1 + 𝜖𝑡+1


 From the first-order minimization condition, we get for 𝑡 = 1 , 2, … , 𝑇 − 1
−1 ′

ሚ ෠
෨ ෠

𝑓𝑡+1 = Γ𝛽 ′𝑋𝑡 ′𝑋𝑡 Γ𝛽 Γ𝛽 𝑋𝑡 ′ 𝑟𝑡+1 − 𝑋𝑡 Γ෠෨𝛼

and
−1
෠෨ = σ𝑇−1 𝑋 ′𝑋 ⨂𝑓ሚመ 𝑓ሚመ ′ ሚመ
𝑣𝑒𝑐 Γ′ 𝑡=1 𝑡 𝑡 𝑡+1 𝑡+1 𝑡=1 𝑋𝑡 ⨂𝑓𝑡+1 ′ ′𝑟𝑡+1 .
σ𝑇−1
35 Professor Doron Avramov, IDC, Israel
Lasso(Least Absolute Shrinkage and Selection Operator)
 Tibshirani (1996) was the first to introduce Lasso.
 Lasso simultaneously performs variable selection and coefficient estimation via shrinkage.
 While the ridge regression implements an 𝑙2 -penalty, Lasso is an 𝑙1 -optimization:
𝑀

min (𝑌 − 𝑋𝛽)′(𝑌 − 𝑋𝛽) s. t. ෍ |𝛽𝑗 | ≤ 𝑐


𝑗=1

 The 𝑙1 penalization approach is called basis pursuit in signal processing.


 There is, again, a non-negative tuning parameter 𝜆 that controls for the amount of
regularization:
𝑀

ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆 ෍ |𝛽𝑗 |
𝑗=1

 Both Ridge and Lasso have solutions even when 𝑋 ′ 𝑋 may not be of full rank (e.g., when there are more
explanatory variables than time-series observations) or ill conditioned.

36 Professor Doron Avramov, IDC, Israel


Lasso – Sparsity
 Unlike Ridge, the Lasso coefficients cannot be expressed in closed form.
 However, Lasso generates sparse solutions, retaining the variables that matter most.
 This improves the interpretability of regression models.
 Large enough 𝜆 will set some coefficients exactly to zero.
 To understand why, notice that LASSO can be casted as having a Laplace prior on 𝛽
𝜆 𝜆𝛽
𝑃 𝛽𝜆 ∝ exp −
2𝜎 𝜎

 Lasso obtains by combining Laplace prior and normal likelihood.


 Like the normal distribution, Laplace is symmetric.
 Unlike the normal distribution, Laplace spikes at zero (first derivative is discontinuous) and
it concentrates its probability mass closer to zero than does the normal distribution.
 This explains why Lasso sets some coefficients to zero, while Ridge (normal prior) does not.

37 Professor Doron Avramov, IDC, Israel


Lasso – picking the shrinkage intensity
 Because LASSO imposes sparsity, you can use model selection criteria to pick 𝜆
 Can also use a validation sample, as in Ridge.
 Examples of model selection criteria include AIC, BIC, FIC, and PIC.
 Model selection criterion consists of (i) goodness-of-fit and (ii) a penalty factor that
gets larger as the number of retained variables increases.
 Occam’s razor: the law of parsimony – thinner models are preferred.
 Bayesian information criterion (BIC) is often used:
𝑅𝑆𝑆
𝐵𝐼𝐶 = 𝑇 × 𝑙𝑜𝑔 + 𝑙 × 𝑙𝑜𝑔 𝑇 , where 𝑙 𝑖s the number of variables retained.
𝑇
 Different values of lambda affect the optimization in a way that a different set of
characteristics is retained.
 You choose 𝜆 as follows: initiate a range of values, compute BIC for each value, and
pick the one that minimizes BIC.
38 Professor Doron Avramov, IDC, Israel
The Spike and Slab Regression
 In all fields of econometrics, there is large uncertainty about whether the true
underlying model is sparse (as in Lasso) or dense (as in Ridge) —whether only a few
predictors truly matter, or if many variables have non-trivial influence.
 A spike-and-slab prior introduces a Bayesian framework that addresses this
uncertainty by learning from the data whether the model is sparse or dense.
 This dual prior allows the combination of sparse and non-sparse representations.
 The steps are provided below, assuming M candidate predictors of future returns.
 Consider a binary inclusion vector γ = (γ1 , γ2 , … γ𝑀 )′ where γ𝑖 = 1 indicates the
inclusion of the i-th variable in the model, and γ𝑖 = 0 indicates its exclusion.
 In the absence of compelling prior information, we assign a Bernoulli prior with
probability p to each variable, determining the subset of variables retained in the
model and the subset that is excluded.

39 Professor Doron Avramov, IDC, Israel


The Slack and Slab Regression
 The 'spike' component of the prior concentrates mass at zero for coefficients
associated with γ𝑖 = 0.
 Conditional on γ, we draw the regression coefficients β from a multivariate normal
prior, typically with zero mean and a large variance ('slab'), which allows for non-
zero effects when variables are included.
 The posterior distribution of both the inclusion vector γ and the coefficients β is
updated using Gibbs Sampling, a type of Markov Chain Monte Carlo (MCMC)
method.
 This algorithm iterates between updating γ and β in a stepwise manner based on their
conditional posterior distributions.

40 Professor Doron Avramov, IDC, Israel


Step-by-Step Procedure
 Sample γ𝑖 : For each variable, given the current value of β and the observed data, the
probability that γ𝑖 is updated using the Bernoulli distribution:

P(γ𝑖 = 1 | β, y, X) ∝ P(y | X, β) P(β | γ)

 Sample β: Conditional on the sampled γ, the regression coefficients are drawn from a
multivariate normal distribution:

β | γ, y, X ∼ N (μβ , Σβ )

 These steps are repeated iteratively across thousands of iterations, yielding the
posterior distributions of γ and β, approximating the true posterior.

41 Professor Doron Avramov, IDC, Israel


The Spike and Slab Prior Explained
 The spike component enforces sparsity by placing mass on zero for coefficients
where the inclusion indicator γ𝑖 = 0.

 It models the probability that a given variable does not influence the dependent
variable.

 For variables with γ𝑖 = 1, the corresponding β coefficients are drawn from a normal
distribution.

 The variance of this slab prior reflects our belief that included variables can have
significant but varying effects.

 The combination of spike and slab allows the model to infer which variables should
be included (sparsity) and which should have non-zero effects (inclusion) based on
the posterior probability of the coefficients.
42 Professor Doron Avramov, IDC, Israel
The Spike and Slab - Advantages
 The spike-and-slab prior adapts to both sparse and dense models, allowing for efficient
variable selection in high-dimensional settings where the true model structure is uncertain.

 Using Gibbs sampling, we can draw inferences from the posterior distributions of the model
parameters, accounting for uncertainty in both inclusion (via γ) and coefficients (via β).

 Bayesian econometrics provides a natural framework for incorporating prior beliefs about
model structure, whether based on theoretical considerations or previous empirical studies.

 Prior knowledge about the importance of certain variables can be encoded into the model via
the prior distribution.

 The spike-and-slab prior helps avoid overfitting by penalizing the inclusion of irrelevant
variables, improving the model's inferences.

43 Professor Doron Avramov, IDC, Israel


The Adaptive Lasso
 Revisiting LASSO, the approach forces coefficients to be equally penalized.
 One modification is to assign different weights to different coefficients:
ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆 σ𝑀
𝑗=1 𝑤𝑗 |𝛽𝑗 |

 It can be shown that if the weights are data driven and are chosen in the right way, the weighted LASSO can have
the so-called oracle properties even when the LASSO does not have the oracle property.
 This is the adaptive LASSO.
 For instance, 𝑤𝑗 can be chosen such that it is equal to one divided by the absolute value of the corresponding
1
OLS coefficient raised to the power of 𝛾 > 0. That is, 𝑤𝑗 = 𝛾 for 𝑗 = 1, … , 𝑀, where 𝛽𝑗 comes from
𝛽𝑗
unconstrained optimization (OLS).
 The adaptive LASSO estimates are given by
ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆 σ𝑀
𝑗=1 𝑤𝑗 |𝛽𝑗 |

 Hyper-parameters 𝜆 and 𝛾 can be chosen using model selection criteria.


 The adaptive LASSO is a convex optimization problem and thus does not suffer from multiple local minima.
 Later, I describe how adaptive LASSO has been implemented in asset pricing through an RFS paper.

44 Professor Doron Avramov, IDC, Israel


Bridge Regression
 Frank and Friedman (1993) introduce the bridge regression.
 This specification generalizes for ℓq penalty.
 The optimization is given by
𝑞
ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′
𝑌 − 𝑋𝛽 + 𝜆 σ𝑀
𝑗=1 𝛽𝑗
 Notice that 𝑞 = 0, 1, 2, correspond to OLS, LASSO, and Ridge, respectively.
 Moreover, the optimization is convex for 𝑞 ≥ 1 and the solution is sparse for 0
≤ 𝑞 ≤ 1.
 Eventually, q is a hyperparameter to be selected.
 So, there are two hyperparameters.
 When the solution is sparse – use model selection criteria to pick
hyperparameters.
 Otherwise, use a validation sample.

45 Professor Doron Avramov, IDC, Israel


The Elastic Net
 The elastic net is yet another regularization and variable selection method.
 Zou and Hastie (2005) describe it as stretchable fishing net that retains “all big fish.”
 Using simulation, they show that it often outperforms Lasso in terms of prediction
accuracy.
 The elastic net encourages a grouping effect, where strongly correlated predictors tend to
be in or out of the model together.
 The elastic net is particularly useful when the number of predictors is much larger than
the number of observations.
 The naïve version of the elastic net is formulated through
𝑀 𝑀
ℒ 𝛽 = 𝑌 − 𝑋𝛽 ′ 𝑌 − 𝑋𝛽 + 𝜆1 ෍ |𝛽𝑗 | + 𝜆2 ෍ 𝛽𝑗2
𝑗=1 𝑗=1

 Thus, the elastic net combines 𝑙1 and 𝑙2 norm penalties.


 It still produces sparse representations.
 Thus, can use model selection to pick the hyperparameters.

46 Professor Doron Avramov, IDC, Israel


The Group Lasso
 Suppose the 𝑀 predictors can be classified into L-representing groups.
 In financial economics, you can divide predictive characteristics into
accounting versus market or technical versus fundamental.
 Or you can divide the characteristics into valuation ratios, profitability,
investment, and liquidity.
 Let 𝑀𝑙 denote the number of predictors per group 𝑙.
 Let X l represent the predictors corresponding to the 𝑙-th group, while 𝛽𝑙 is
the corresponding coefficient vector.
 Then, 𝛽=[𝛽1′ , 𝛽2′ , … , 𝛽𝐿′ ]’, which amounts to a restricted regression with M
coefficients but L distinct coefficients.

47 Professor Doron Avramov, IDC, Israel


The Group Lasso
 The group Lasso solves the convex optimization problem.
1
′ 2
𝐿 𝐿 𝐿
ℒ 𝛽 = 𝑌 − ෍ 𝑋𝑙 𝛽𝑙 𝑌 − ෍ 𝑋𝑙 𝛽𝑙 + 𝜆 ෍ 𝛽𝑙′ 𝛽𝑙
𝑙=1 𝑙=1 𝑙=1

 The group Lasso yields sparsity across the groups, in that some groups are excluded.
 However, there is no sparsity within a group: if a group of parameters is nonzero, all the
group members are nonzero.
 The sparse group Lasso criterion yields sparsity
1
′ 2
𝐿 𝐿 𝐿 𝑀
ℒ 𝛽 = 𝑌 − ෍ 𝑋𝑙 𝛽𝑙 𝑌 − ෍ 𝑋𝑙 𝛽𝑙 + 𝜆1 ෍ 𝛽𝑙′ 𝛽𝑙 + 𝜆2 ෍ 𝛽𝑗
𝑙=1 𝑙=1 𝑙=1 𝑗=1

48 Professor Doron Avramov, IDC, Israel


Nonparametric-nonlinear methods
 Lasso, Adaptive Lasso, Group Lasso, Ridge, Bridge, and Elastic net are all linear or
parametric approaches for shrinkage.
 Some other parametric approaches (uncovered here) include the smoothed clip absolute
deviation (SCAD) penalty of Fang and Li (2001) and Fang and Peng (2004) and the
minimum concave penalty of Zhang (2010).
 In many applications, however, there is little a priori justification to assume that the
effects of covariates take a linear form or belong to other known parametric families.
 Huang, Horowitz, and Wei (2010) thus propose to use a nonparametric approach: the
adaptive group Lasso for variable selection.
 This approach is based on a spline approximation to the nonparametric components.
 To achieve model selection consistency, they apply Lasso in two steps.
 First, they use group Lasso to obtain an initial estimator and reduce the dimension of
the problem.
 Second, they use the adaptive group Lasso to select the final set of nonparametric
components.
49 Professor Doron Avramov, IDC, Israel
Nonparametric Models in asset pricing
 Cochrane (2011) notes that portfolio sorts are equivalent to nonparametric cross section regressions.
 Following Huang, Horowitz, and Wei (2010), Freyberger, Neuhier, and Weber (2017) study this equivalence formally.

 The cross section of stock returns is modelled as a nonlinear function of firm characteristics:
𝑟𝑖𝑡 = 𝑚𝑡 𝐶1,𝑖𝑡−1 , … , 𝐶𝑆,𝑖𝑡−1 + 𝜖𝑖𝑡

 Notation:

 𝑟𝑖𝑡 is the return on firm i at time t.


 𝑚t is a function of S firm characteristics 𝐶1 , 𝐶2 , … , 𝐶𝑆 .
 Notice, 𝑚t itself is not stock specific but firm characteristics are, just like slopes in cross section regressions
 Consider an additive model of the following form

𝑚𝑡 𝐶1 , … , 𝐶𝑆 = σ𝑆𝑠=1 𝑚𝑡,𝑠 (𝐶𝑠 )

𝜕2 𝑚𝑡 𝑐1 ,…,𝑐𝑆
 As the additive model implies that =0 for 𝑠 ≠ 𝑠 ′ , apparently there should not be no cross dependencies
𝜕𝑐𝑠 𝜕𝑐𝑠′
between characteristics.
 Such dependencies can still be accomplished through producing more predictors as interactions between characteristics.
50 Professor Doron Avramov, IDC, Israel
Nonparametric Models
−1
 For each characteristic s, let 𝐹𝑠,𝑡 ∙ be a strictly monotone function and let 𝐹𝑠,𝑡 ∙ denote its inverse.
 Define 𝐶ሚ𝑠,𝑖𝑡−1 = 𝐹𝑠,𝑡 𝐶𝑠,𝑖𝑡−1 such that 𝐶ሚ𝑠,𝑖𝑡−1 ∈ 0,1 .
 That is, characteristics are monotonically mapped into the [0,1] interval.
𝑟𝑎𝑛𝑘 𝐶𝑠,𝑖𝑡−1
 An example for 𝐹𝑠,𝑡 ∙ is the rank function: 𝐹𝑠,𝑡 𝐶𝑠,𝑖𝑡−1 = , where 𝑁𝑡 is the total number of firms at time t.
𝑁𝑡

 The aim then is to find 𝑚


෥ 𝑡 such that

𝑚𝑡 𝐶1 , … , 𝐶𝑆 ෦𝑡 𝐶ሚ1,𝑖𝑡−1, … , 𝐶ሚ𝑠,𝑖𝑡−1
=𝑚
 In particular, to estimate the 𝑚
෦𝑡 function, the normalized characteristic interval 0,1 is divided into L subintervals
(L+1 knots): 0 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝐿−1 < 𝑥𝐿 = 1.
 To illustrate, consider the equal spacing case.
𝑙
 Then, 𝑥𝑙 = for l=0,…,L-1 and the intervals are:
𝐿
𝐼෩1 = 𝑥0 , 𝑥1 , 𝐼෩𝑙 = 𝑥𝑙−1 , 𝑥𝑙 for l=2,…,L-1, and 𝐼෩𝐿 = [𝑥𝐿−1 , 𝑥𝐿 ]

51 Professor Doron Avramov, IDC, Israel


Nonparametric Models
 Each firm characteristic is transformed into its corresponding interval.
 Estimating the unknown function 𝑚
෥ 𝑡,𝑠 nonparametricaly is done by using quadratic splines.
 A quadratic spline is a differentiable piecewise quadratic function.
෥ 𝑡,𝑠 is approximated by a quadratic function on each interval 𝐼෩𝑙 .
 The function 𝑚
 Quadratic functions in each interval are chosen such that 𝑚
෥ 𝑡,𝑠 is continuous and differentiable in the whole
interval 0,1 .
෥ 𝑡,𝑠 𝑐ǁ = σ𝐿+2
 𝑚 𝑘=1 𝛽𝑡𝑠𝑘 × 𝑝𝑘 𝑐ǁ , where 𝑝𝑘 𝑐ǁ are basis functions and 𝛽𝑡𝑠𝑘 are estimated slopes.

 In particular, 𝑝1 𝑦 = 1, 𝑝2 𝑦 = 𝑦, 𝑝3 𝑦 = 𝑦 2 , and 𝑝𝑘 𝑦 = max 𝑦 − 𝑥𝑘−3 , 0 2


for 𝑘 = 4, … , 𝐿 + 2
 In that way, you can get a continuous and differentiable function.
 To illustrate, consider the case of two characteristics, e.g., size and book to market (BM), and 3 intervals.
 Then, the 𝑚
෥ 𝑡 function is:
𝑚
෥ 𝑡 𝑐𝑖,𝑠𝑖𝑧𝑒
ǁ , 𝑐𝑖,𝐵𝑀
ǁ =
2 2
= 𝛽𝑡, 𝑠𝑖𝑧𝑒, 1 × 1 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 2 × 𝑐𝑖,𝑠𝑖𝑧𝑒
ǁ ǁ2
+ 𝛽𝑡, 𝑠𝑖𝑧𝑒, 3 × 𝑐𝑖,𝑠𝑖𝑧𝑒 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 4 × max 𝑐𝑖,𝑠𝑖𝑧𝑒
ǁ − 1/3,0 + 𝛽𝑡, 𝑠𝑖𝑧𝑒, 5 × max 𝑐𝑖,𝑠𝑖𝑧𝑒
ǁ − 2/3,0
2 2
+ 𝛽𝑡, 𝐵𝑀, 1 × 1 + 𝛽𝑡, 𝐵𝑀, 2 × 𝑐𝑖,𝐵𝑀
ǁ ǁ 2 + 𝛽𝑡, 𝐵𝑀, 4 × max 𝑐𝑖,𝐵𝑀
+ 𝛽𝑡, 𝐵𝑀, 3 × 𝑐𝑖,𝐵𝑀 ǁ − 1/3,0 + 𝛽𝑡, 𝐵𝑀, 5 × max 𝑐𝑖,𝐵𝑀
ǁ − 2/3,0

52 Professor Doron Avramov, IDC, Israel


Adaptive group Lasso
 The estimation of 𝑚
෥ 𝑡 is done in two steps:

 First step, estimate the slope coefficients 𝑏𝑠𝑘 using the group Lasso routine:
2 1/2
𝑁𝑡 𝑆 𝐿+2 𝑆 𝐿+2
𝛽෡𝑡 = argmin ෍ 𝑟𝑖𝑡 − ෍ ෍ 𝑏𝑠𝑘 × 𝑝𝑘 𝐶ሚ𝑠,𝑖𝑡−1 2
+ 𝜆1 ෍ ෍ 𝑏𝑠𝑘
𝑏𝑠𝑘:𝑠=1,…,𝑆;𝑘=1,…,𝐿+2
𝑖=1 𝑠=1 𝑘=1 𝑠=1 𝑘=1

 Altogether, the number of 𝑏𝑠𝑘 coefficients is 𝑆 × 𝐿 + 2 .


 The second expression is a penalty term applied to the spline expansion.
 𝜆1 is chosen such that it minimizes the Bayesian Information Criterion (BIC).
 Each characteristic represents a group.
 The essence of group Lasso is to either include or exclude all L+2 spline terms associated with a given characteristic.
 While this optimization yields a sparse solution there are still many characteristics retained.
 To include only characteristics with strong predictive power the adaptive Lasso is then employed.

53 Professor Doron Avramov, IDC, Israel


Adaptive group Lasso
 To implement adaptive group Lasso, define the following weights using estimates for 𝑏𝑠𝑘 from the first step:
1

𝐿+2 2 𝐿+2

෍ 𝑏෨𝑠𝑘
2
𝑖𝑓 ෍ 𝑏෨𝑠𝑘
2
≠0
𝑤𝑡𝑠 = 𝑘=1 𝑘=1
𝐿+2

∞ 𝑖𝑓 ෍ 𝑏෨𝑠𝑘
2
=0
𝑘=1

 Then, estimate again the coefficients 𝑏𝑠𝑘 using the above-estimated weights 𝑤𝑡𝑠
2 1/2
𝑁𝑡 𝑆 𝐿+2 𝑆 𝐿+2
𝛽෡𝑡 = argmin ෍ 𝑟𝑖𝑡 − ෍ ෍ 𝑏𝑠𝑘 × 𝑝𝑘 𝐶ሚ𝑠,𝑖𝑡−1 2
+ 𝜆2 ෍ 𝑤𝑡𝑠 ෍ 𝑏𝑠𝑘
𝑏𝑠𝑘:𝑠=1,…,𝑆;𝑘=1,…,𝐿+2
𝑖=1 𝑠=1 𝑘=1 𝑠=1 𝑘=1

 𝜆2 is chosen such that it minimizes BIC.

 The above formulation of weights 𝑤𝑡𝑠 guarantees that the second step does not pick characteristics that are
excluded in the first step.

54 Professor Doron Avramov, IDC, Israel


Regression Trees
 Regression trees are a nonparametric machine learning technique used to model decisions
based on input variables, resulting in a tree-like structure where each decision (or split) is
made based on specific criteria.
 They are particularly useful when predicting outcomes that depend on multiple interacting
variables and when the relationship between predictors and outcomes is nonlinear.
 In the context of asset returns, a regression tree can be used to decide whether to sort stocks
by a particular characteristic.
 If sorting by that characteristic is not effective, the tree can then ask whether another
characteristic might be more useful.
 At each decision point (or node), the tree asks whether a cut-off at a specific value of the
chosen variable would help divide the stocks into two groups—each with similar
characteristics.
 For example, the tree might ask if sorting stocks by market capitalization and then applying
a cut-off at a particular value could create two distinct portfolios.
55 Professor Doron Avramov, IDC, Israel
Regression Trees
 The tree is built by selecting splits that minimize prediction error.
 Specifically, at each node, the dataset is divided into two subsets that
minimize the mean squared error (MSE) of predicted returns:
1 2 1 2
ℒ 𝐶, 𝐶𝑙𝑒𝑓𝑡 , 𝐶𝑟𝑖𝑔ℎ𝑡 = ෍ 𝑟𝑖,𝑡+1 − 𝜃𝑙𝑒𝑓𝑡 + ෍ 𝑟𝑖,𝑡+1 − 𝜃𝑟𝑖𝑔ℎ𝑡
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
 Notation: C denotes the data set from the preceding step, while the new bins are 𝐶𝑙𝑒𝑓𝑡 and
𝐶𝑟𝑖𝑔ℎ𝑡 , and the 𝑁 − s are the corresponiding number of observations.
 The predicted return is the average of returns of all stocks within the group
1 1
𝜃𝑙𝑒𝑓𝑡 = ෍ 𝑟𝑖,𝑡+1 ; 𝜃𝑟𝑖𝑔ℎ𝑡 = ෍ 𝑟𝑖,𝑡+1
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝑧𝑖,𝑡 ∈𝐶𝑙𝑒𝑓𝑡 𝑧𝑖,𝑡 ∈𝐶𝑟𝑖𝑔ℎ𝑡

56 Professor Doron Avramov, IDC, Israel


Pros of Regression Trees
 Simplicity and Interpretability:
• Regression trees are intuitive and easy to interpret. Each split represents a simple decision rule that
can be visualized, making it easy to understand the relationship between variables and outcomes.
 Nonlinearity and Interactions:
• They naturally capture nonlinear relationships and interactions between predictors without requiring
complex transformations or assumptions about the underlying data distribution.
 Versatility:
• Regression trees can be used for both regression and classification tasks, making them versatile across
different domains.
 Handle Missing Data:
• They can handle missing values by splitting the data based on available information, without requiring
imputation or removal of missing cases.
 No Assumptions on Data Distribution:
• Unlike linear models, regression trees do not require any assumptions about the linearity or normality
of the data, allowing them to adapt to complex datasets.
 Variable Importance:
• Trees naturally rank variables by importance, as splits are based on the variable that most reduces the
prediction error at each step.

57 Professor Doron Avramov, IDC, Israel


Cons of Regression Trees
 Prone to Overfitting:
• Without proper tuning, regression trees can easily overfit the training data, capturing noise rather than underlying patterns . This
happens when the tree grows too deep with too many nodes.
 Instability:
• Small changes in the data can result in entirely different trees being generated, making them unstable. This is because the s plits at
each step are highly sensitive to variations in the data.
 Bias Toward Dominant Features:
• Regression trees can be biased toward features with more levels or categories. Features with more unique values are more likely to be
chosen for splits, which may not always reflect true predictive power.
 Limited Predictive Power:
• While easy to interpret, single regression trees often lack predictive accuracy, especially when compared to more sophisticated models
like random forests, gradient boosting, or neural networks.
 Difficulty Capturing Smooth Relationships:
• Since regression trees use step functions to make predictions, they may struggle to capture smooth relationships between the
independent and dependent variables.
 Greedy Algorithm:
• The algorithm makes locally optimal decisions at each split, without considering the overall structure of the tree. This can lead to
suboptimal trees, which might not represent the best global model.
 Need for Pruning:
• To combat overfitting, trees often need to be pruned (i.e., removing branches that add complexity without improving accuracy) , which
can be computationally intensive and adds an additional layer of complexity.

58 Professor Doron Avramov, IDC, Israel


Random Forest
 A random forest is an ensemble learning method that combines multiple decision trees to
improve prediction accuracy and control for overfitting.
 While a single decision tree may be prone to overfitting by learning too much from the
training data, a random forest builds several trees using random subsets of the data and
features, and then aggregates their predictions.
 The process works as follows:
1. Bootstrap Sampling: Each tree in the random forest is trained on a randomly sampled
subset (with replacement) of the original data.
2. Feature Randomness: At each node within a tree, only a random subset of the features
is considered for splitting. This increases diversity among the trees and prevents any
single variable from dominating the predictions.
3. Averaging Predictions: Once all trees are built, the final prediction is made by
averaging the predictions of each individual tree (in the case of regression) or by
majority voting (in the case of classification).
 Random forests improve prediction performance by reducing variance.
 The individual trees might have high variance (i.e., they could overfit the training data),
but when their predictions are averaged together, the variance is reduced, leading to
more accurate and robust predictions.
59 Professor Doron Avramov, IDC, Israel
Neural Networks
 All upcoming machine learning routines will utilize Neural Networks (NN) with varying levels
of complexity.
 NNs form the foundation of deep learning and are highly capable of approximating complex
functions in high-dimensional spaces, as well as capturing intricate time dependencies (e.g.,
LSTM, Transformer Encoders).
 Inspired by the structure of the human brain, NNs consist of layers of "neurons" connected by
"synapses" that transmit signals between layers.
 NN models ingest data, learn to recognize patterns, and produce predictions.
 NNs have diverse applications, including facial recognition, time series forecasting (e.g., stock
returns, rainfall), music composition, and self-driving cars.
 The network processes information from an input layer, through one or more hidden layers,
to an output layer—a structure often called a Feed-Forward Network (FFN).
 The output layer makes predictions similar to fitted values in regression analysis.
 Most data processing occurs in the hidden layers, which consist of interconnected neurons that
extract complex features.

60 Professor Doron Avramov, IDC, Israel


Fully connected (dense layer) Network

output
layer
input hidden
layer layer2
hidden with 2
layer1 neurons
with 4
neurons
61 Professor Doron Avramov, IDC, Israel
Neural Networks
 Each neuron applies a nonlinear activation function f to its aggregated signal before sending its output to the next layer
𝑥𝑘𝑙 = 𝑓(𝜃0 + ෍ 𝑧𝑗 𝜃𝑗 )
𝑗

where 𝑥𝑘𝑙 corresponds to neuron 𝑘 ∈ 1,2, … , 𝐾 𝑙 in the hidden layer 𝑙 ∈ 1,2, … , 𝐿.


 The activation function (or the threshold function) is usually one of the following
1
 Sigmoid 𝜎 𝑥 =
(1+𝑒 −𝑥 )

 tanh 𝑥 = 2𝜎 𝑥 − 1

0 𝑖𝑓 𝑥 < 0
 𝑅𝑒𝐿𝑈(𝑥) = ቊ
𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 The sigmoid function is between 0 and 1
 The hyperbolic tangent is between -1 and 1
 The result of the activation function determines whether the particular neuron will get activated.
 An activated neuron transmits data to the neuron of the next layer over the channel – forward propagation – data propagates
through the network.
 NN is essentially a nonlinear nonparametric regression.
 With a linear activation function, NN boils down to OLS.

62 Professor Doron Avramov, IDC, Israel


Neural Networks
 For the ReLU activation function, we can rewrite the neural network function as:

1 2
𝑜𝑢𝑡𝑝𝑢𝑡 = max max max 𝑋𝑊hl , 0 𝑊hl , 0 … 𝑊hln , 0 𝑊output
where X is the input, 𝑊hl𝑖 are the weight matrix of the neurons in hidden layer
𝑖 ∈ 1, … , 𝑛, n is the number of hidden layers, and 𝑊output are the weighs of the output layer.
 In NN, slopes are termed weights while intercepts are termed biases.
 Then, run an optimization to minimize the loss function.
 When the predicted variable is continuous, mean squared errors (MSE) can be used for the
loss function.
 To predict probability or for classification purposes, can use Softmax loss function.
 If the activation function is linear – simply ignore the MAX operator in the above equation.
 Then the output is XW – boiling down to OLS.

63 Professor Doron Avramov, IDC, Israel


A simple example with ReLU
 Let us assume two inputs only: market cap and BM (book to market) A
 One hidden layer with three neurons: A, B, and C size
 W’-s are the slops (weights) while b-s are the intercepts (biases).
B ol
𝐴 𝐴
 𝑖𝑛𝑝𝑢𝑡 𝐴 = 𝑠𝑖𝑧𝑒 × 𝑊𝑠𝑖𝑧𝑒 + 𝐵𝑀 × 𝑊𝐵𝑀 + 𝑏𝐴
BM
 𝑜𝑢𝑡𝑝𝑢𝑡 𝐴 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡 𝐴 , 0
𝐵
 𝑖𝑛𝑝𝑢𝑡𝐵 = 𝑠𝑖𝑧𝑒 × 𝑊𝑠𝑖𝑧𝑒 𝐵
+ 𝐵𝑀 × 𝑊𝐵𝑀 + 𝑏𝐵 C
input output
 𝑜𝑢𝑡𝑝𝑢𝑡 𝐵 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡𝐵 , 0 layer layer
𝐶
 𝑖𝑛𝑝𝑢𝑡 𝐶 = 𝑠𝑖𝑧𝑒 × 𝑊𝑠𝑖𝑧𝑒 𝐶
+ 𝐵𝑀 × 𝑊𝐵𝑀 + 𝑏𝐶 hidden
layer
 𝑜𝑢𝑡𝑝𝑢𝑡 𝐶 = 𝑚𝑎𝑥 𝑖𝑛𝑝𝑢𝑡 𝐶 , 0
 Output layer (ol): 𝑜𝑢𝑡𝑝𝑢𝑡 = 𝑜𝑢𝑡𝑝𝑢𝑡 𝐴 × 𝑊𝐴𝑜𝑙 + 𝑜𝑢𝑡𝑝𝑢𝑡 𝐵 × 𝑊𝐵𝑜𝑙 +𝑜𝑢𝑡𝑝𝑢𝑡 𝐶 × 𝑊𝐶𝑜𝑙 +𝑏 𝑜𝑙
 The output is the predicted return.
 Implement that procedure for any stock while the parameters are identical across stocks.
 To find the parameters, you minimize the sum squared errors (realized versus predicted returns)
by aggregating across all stocks and all months.

48 Professor Doron Avramov, IDC, Israel


The output Layer – Interpretation
 The output layer is interpreted based on the particular experiment.
 In supervised learning, the output tries to come close to the label.
 If the label is return, as in the previous example, the output is predicted return.
 If the label is about identifying the top decile of stocks (ones versus zeros), the
output is about the predicted probability of belonging to the top.
 Further, while in the previous example, the output reflects a scalar, it can also be a
vector of dimension d.
 For instance, you apply different weights and biases to predict future returns –
hence, there are multiple return predictions per stock.
 Then, if there are 1000 stocks, the intermediate output is 1000 by d matrix.
 You have to convert that matrix to a vector of dimension 1000 by multiplying the
matrix by a vector of order d, which is yet another set of parameters to estimate.

65 Professor Doron Avramov, IDC, Israel


Hyperparameters
 The number of hidden layers and the number of neurons per layer are both hyperparameters.
 The regularization (next page) parameters are also hyperparameters.
 The learning rate (coming up soon) establishes yet another hyperparameter.
 Can split the sample into three pieces: training, validation, and testing.
 To illustrate, suppose the sample spans the January 1981 till the end of 2024.
 Use the first twenty years as training – January 1981 till December 2000.
 Use the next ten years as validation – January 2001 till December 2010.
 Then, generate one year of monthly predictions – for the year 2011.
 Next…
 Training sample becomes January 1981 till December 2001.
 Validation sample becomes January 2002 till December 2011.
 Generate yet another year of monthly predictions – for the year 2012.
 And so on. ..
 The training sample is expanding.
 The validation sample is rolling.
66 Professor Doron Avramov, IDC, Israel
Regularization
 Machine learning methods are subject to overfitting; hence, regularization is essential.
 You can implement LASSO on the weights and biases when minimizing the loss function.
 Lasso will mute some of the coefficients.
 However, it won’t essentially mute a variable or variables, as in Lasso regressions.
 You can also implement Ridge or Elastic Net.
 Batch normalization: Normalizes the inputs to each layer by adjusting the mean and variance of
the mini-batch, which helps stabilize training, allows for higher learning rates, and provides
some regularization benefits.
 Early stopping: Stops training once the model's performance on a validation set starts to
deteriorate, preventing the network from overfitting to the training data.
 There are many more regularization methods.
 One useful and intuitive approach is dropout, to be explained on the next page.

67 Professor Doron Avramov, IDC, Israel


Dropout
 Dropout works by randomly "dropping out" a fraction of the neurons in the network during training.
 This prevents the network from relying too heavily on any one neuron and forces the model to learn
more robust features.
 Here’s how dropout works in practice:
1. Training Phase: During each iteration, which consists of a forward pass followed by a backward pass, a
fraction of the neurons (e.g., p=20% or 50%) is randomly selected and dropped out. These neurons are
ignored during both the forward pass (when activations are calculated) and the backward pass (when
gradients are computed), so their weights do not get updated. The remaining neurons continue to process
the data. Each time a new batch of data is processed in an iteration, a different random set of neurons is
dropped out.
2. Inference Phase: During testing or validation, all the neurons are active, but their outputs are scaled
down by the same fraction that was used during training. This adjustment ensures that the output remains
consistent and accounts for the randomness introduced during training.
 The key benefit of dropout is that it reduces the risk of overfitting by preventing the network from
relying too much on specific neurons. By forcing the network to use different subsets of neurons in each
iteration, dropout helps create a more generalized model.
68 Professor Doron Avramov, IDC, Israel
Training the network
 Training a neural network involves minimizing a loss function, quantifying the sum
(across time and assets) differences between predicted and actual values.
 This is achieved through gradient descent, an optimization technique that updates the
network's weights iteratively.
• Weights are typically initialized and updated using gradient descent.
• In each step, weights are adjusted as follows:
new weight = old weight - (learning rate × gradient of the loss w.r.t to the weight)
• The learning rate is a crucial hyperparameter that controls the step size.
• If it is too large, the network may overshoot the minimum.
• If it is too small, training can be slow or get stuck in local minima.
• AdaTune is an adaptive learning rate algorithm that adjusts the learning rate
dynamically during training, allowing for faster and more stable convergence.
• The training process aims to move in the direction of the steepest descent of the loss
function, using the gradient (or its approximation) at each point to guide the updates.

69 Professor Doron Avramov, IDC, Israel


Training the network
 Backpropagation is the key algorithm used to train neural networks by calculating the gradients required for
updating weights.
 It works by applying the chain rule to compute the gradient of the loss w.r.t. each weight, starting from the output
layer and propagating backward through the network.
 The chain rule allows efficient computation by breaking the gradient into simpler components at each layer.
 Backpropagation stores intermediate results, so gradients for previous layers don't need to be recalculated, making
the process efficient.
 The goal is to minimize the loss by updating the weights based on these gradients.
 However, challenges arise due to:
• Non-convexity: The loss function often has many local minima and saddle points, making it hard to find the global
minimum.
• Vanishing gradients: In deep networks, gradients can become very small as they propagate backward, slowing
down learning.
 Common activation functions like ReLU (Rectified Linear Unit) help mitigate vanishing gradients by introducing
non-linearity and maintaining strong gradient signals.
 Different optimizers can be used for training:
• Stochastic Gradient Descent (SGD): Simple and widely used, but can be slow to converge.
• Adam: An adaptive optimizer that combines the benefits of momentum and learning rate adaptation for faster
convergence.
70 Professor Doron Avramov, IDC, Israel
Reconciling Classical Theory with Deep Learning
 Traditionally, it is expected that larger models should eventually underperform
due to overfitting.
 However, modern deep-learning models challenge this conventional wisdom.
 So, which is correct—common wisdom or the empirical evidence from deep
learning?
 The two perspectives can be reconciled:
• Under-Parameterized Models: As model complexity increases, test error
decreases (following traditional bias-variance tradeoff).
• Over-Parameterized Models: Surprisingly, further increasing complexity still
reduces test error, defying traditional overfitting concerns.
• Critically-Parameterized Models: At critical points, increasing complexity can
either increase or decrease test error, leading to unpredictable behavior.
 In the under-parameterized regime, test error follows a U-shaped curve as
model complexity increases, in line with classical bias-variance tradeoff
predictions.
71 Professor Doron Avramov, IDC, Israel
Autoencoding: nonlinear dimension reduction
 Autoencoders are a type of unsupervised learning model that perform dimensionality
reduction by compressing input data into a lower-dimensional representation, then
reconstructing the original input from this compressed version.
 This process is highly useful in extracting latent features from complex data, such as in
asset pricing applications.
 Autoencoders consist of two main steps:
• Encoding: The input data is compressed into a smaller, lower-dimensional
representation. The goal is to capture the most important features of the data using
fewer variables.
• Decoding: The compressed representation is then transformed back into the original
input space. This step evaluates how well the autoencoder can reconstruct the input
data, making it useful for tasks like denoising.

72 Professor Doron Avramov, IDC, Israel


Comparision with PCA
 Both autoencoders and Principal Component Analysis (PCA) perform
dimensionality reduction, but there are key differences:
• PCA is a linear technique that reduces dimensions by projecting data onto
orthogonal axes, retaining as much variance as possible.
• Autoencoders, by contrast, use nonlinear neural networks. This enables them
to capture more complex patterns in the data that PCA may miss.
 Nonlinearity through Neural Networks
 The use of nonlinear activation functions like ReLU (Rectified Linear Unit)
in autoencoders allows them to map input data through complex
transformations, making them more versatile than PCA. For example, in
financial data where relationships between variables are often nonlinear,
autoencoders can uncover hidden factors that explain more variance than
traditional linear methods.
 Loss Function and Reconstruction Error
 The primary goal of the autoencoder is to minimize the reconstruction error,
which measures how different the original input is from the reconstructed
output. This is achieved by training the network to optimize a loss function,
typically the mean squared error between the input and the output.
73 Professor Doron Avramov, IDC, Israel
Applications in Asset Pricing
 In asset pricing, autoencoders can be used to extract nonlinear factors
that are better suited for complex market behaviors than traditional
factor models.
 These latent factors, derived from the encoding process, capture hidden
relationships between stock returns and firm characteristics, leading to
better predictions and a deeper understanding of asset risk.
 The next slide provides a clear implementation of the routine in a simple
one-layer setup.
 The next-next slide provides a color scheme that helps visualize the
compression (encoding) and decompression (decoding) processes that are
core to the autoencoder structure.

74 Professor Doron Avramov, IDC, Israel


The structure of an autoencoder
 Green Circles (Input Layer): These circles represent the original input
variables or features. In the context of asset pricing, these inputs could be the
returns or characteristics of multiple assets that are to be compressed.
1. Purple Circles (Hidden Layer): These circles represent the hidden layer
neurons. This hidden layer performs the "encoding" step, where the input data is
transformed into a lower-dimensional, compressed representation. The use of
nonlinear activation functions (like ReLU) allows the autoencoder to capture
complex patterns in the data.
2. Red Circles (Output Layer): The red circles represent the "decoded" output,
which is a reconstruction of the input variables. The goal is for the output to
closely match the original input after passing through the bottleneck (hidden
layer), minimizing reconstruction error.
3. Transition Between Layers:
1. The transition from the green to purple layer represents the encoding process, where the
input is compressed into latent factors.
2. The transition from purple to red circles represents the decoding process, where the
compressed representation is expanded back into the original form.
75 Professor Doron Avramov, IDC, Israel
Autoencoding: unconditional asset pricing

76 Professor Doron Avramov, IDC, Israel


Autoencoding with Multiple Layers
 There may be cases where an autoencoder includes multiple hidden layers. When
this happens:
• The encoding layers (compressing the data) have a decreasing number of neurons
as the model goes deeper, condensing the input data into fewer dimensions.
• Conversely, the decoding layers (reconstructing the data) have an increasing
number of neurons as they attempt to recover the original data from the
compressed representation.
 In this setup, the encoding phase identifies the essential lower-dimensional latent
factors from the input data. These latent factors are key to understanding the
underlying patterns.
 In the particular example provided, there are three latent factors (K = 3), which
means the autoencoder has reduced the input to just three key features.
 The model is unconditional, meaning that the factor loadings (which map input
features to latent factors) are fixed and do not change over time.
 The model's parameters are optimized by minimizing the sum of squared
errors between the actual returns and the reconstructed returns, ensuring that the
model accurately captures the key features while reducing dimensionality.
77 Professor Doron Avramov, IDC, Israel
Conditional Autoencoding (CA)
 Gu, Kelly, and Xu (2019) implement autoencoding in a setup where betas vary non-linearly with
firm characteristics.
 Beta variations characterize conditional asset pricing models.
 The conditional autoencoder extends IPCA in which loadings are a linear function of firm
characteristics.
 So, we have the two pairs {PCA, autoencoder} and {IPCA, CA) reflecting linear versus nonlinear
activation functions, while the first (second) pair refers to unconditional (conditional) asset pricing.
 The figure on the next page describes CA. Source: Gu, Kelly, and Xu (2019).
 The left side of the network models factor loadings as a nonlinear function of predictive
characteristics, while the right-side network formulates factors as portfolios of individual stock
returns.
 Let us start with the left-hand-side.
 The yellow level describes a panel of predictive characteristics – N stocks P characteristics per
stock.
 Characteristics are transformed through s hidden layers to form intermediate outputs – factor
loadings.
 The right side is about constructing factors through autoencoder.
 Factors and loadings are interacted to form the eventual output layer – predicted returns.

78 Professor Doron Avramov, IDC, Israel


CA: beta pricing

79 Professor Doron Avramov, IDC, Israel


Recurrent NN (RNN)
 A significant extension of neural networks (NNs) is the recurrent neural network
(RNN).
 To illustrate, imagine observing a snapshot of a flying ball and being asked to
predict its future location.
 Without prior information about its motion or history, any prediction would be
purely a guess.
 The flying ball is an example of a sequence, where past information influences
future outcomes.
 Other common examples include:
• Audio, which is a sequence of sound waves;
• Text, a sequence of characters or words;
• Genetic data and EKG signals are also sequential.
➢ In all these cases, a sequence is defined by its order: one event follows another.
80 Professor Doron Avramov, IDC, Israel
Recurrent NN (RNN)
 In financial economics, time-series data with short- or long-run dependencies are
examples of sequences.
 For instance, asset returns exhibit short- and long-term serial dependencies, such
as short-term reversals, long-term reversals, and intermediate-term momentum.
 Can traditional NNs predict the outcome of sequences?
 Standard NNs lose effectiveness when dealing with sequentially dependent
inputs.
 This is because they map a fixed and static input into a fixed and static output,
ignoring the temporal relationships between data points.
 To address this, deep sequence models like RNNs are used.
 RNNs take into account the temporal dimension by relating the output to both (i)
the current input and (ii) the prior history, stored as a latent cell state.

81 Professor Doron Avramov, IDC, Israel


Recurrent NN (RNN)
 The current cell state in an RNN depends on both the input and the past state
ℎ𝑡 =G(𝑥𝑡 , ℎ𝑡−1 )
 For instance,
ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑊ℎ𝑥 𝑥𝑡 + 𝑊ℎℎ ℎ𝑡−1 + 𝑏ℎ
 The output is then given by
yො 𝑡 = 𝑊𝑦ℎ ℎ𝑡 + 𝑏𝑦
 In an RNN, only the last hidden state is used to generate the output at time t.
 The loss is the forecast error, or the difference between 𝑦𝑡 and yො 𝑡 .
 The total loss is the sum of forecast errors squared throughout the sample and experiments
(e.g., N stocks).
 Estimate the model parameters by minimizing the loss function.
 The model parameters are common across the sequence.
 There are three weight matrices
𝑊ℎ𝑥 defines how the inputs at each time step are being transformed
𝑊ℎℎ defines the relationship between the prior and current hidden states
𝑊𝑦ℎ transforms the hidden state to the output at a particular time step
82 Professor Doron Avramov, IDC, Israel
Simple RNN structure (left side) and its unfolded representation (right side)
 A key feature of RNNs is their ability to handle sequences of arbitrary length.
 For example, consider predicting the next word in a text.
 In a feedforward neural network (FFN), the input text must be of a fixed size (e.g., 30 words), which complicates
processing longer or shorter sequences.
 While the last word is critical, its importance depends on its relationship with prior words in the sequence—a
relationship that FFNs cannot handle effectively.
 In contrast, RNNs process sequences step-by-step, making them versatile for predicting word sequences, stock
returns, and other financial applications.

83 Professor Doron Avramov, IDC, Israel


RNN – summary
 RNNs can be thought of as analogous to traditional time-series analysis in
econometrics, as both aim to model patterns in sequential data.
 However, RNNs are more flexible, as they uncover these patterns in a data-
driven and highly nonlinear manner, often involving many hidden states.
 In summary, RNNs:
a. Handle variable-length sequences, making them versatile for different
types of sequential data;
b. Capture long-term dependencies, though this may be challenging for
vanilla RNNs, which can suffer from vanishing gradients;
c. Retain information about the order of elements within a sequence, which is
essential for tasks requiring sequential context;
d. Share parameters across the sequence, allowing the model to generalize
across different parts of the sequence effectively.

84 Professor Doron Avramov, IDC, Israel


RNN – a caveat
 Feed-forward networks are trained using the backpropagation
algorithm. The process involves taking a set of inputs, making a forward
pass through the network (from input to output), and then adjusting the
weights through backpropagation. Specifically, the derivative of the loss
with respect to each weight parameter is calculated, and the weights are
updated to minimize the loss function.
 In RNNs, the forward pass occurs through time, and backpropagation is
also through time, referred to as backpropagation through time
(BPTT). This means errors are propagated back from the most recent time
step to the beginning of the sequence, tracing through all previous steps. As
this happens, the gradients are multiplied by the same weight matrix
repeatedly.
 This recursive multiplication introduces a significant challenge: exploding
gradients or, more commonly, vanishing gradients.

85 Professor Doron Avramov, IDC, Israel


RNN – a caveat
 To illustrate:
• If you keep multiplying 0.9 by itself, the sequence eventually approaches zero
(vanishing gradient).
• Conversely, multiplying 2 by itself leads to an explosive increase (exploding
gradient).
 When the gradients become too small, the network struggles to learn long-term
dependencies because the contribution of earlier time steps diminishes rapidly.
This results in biases toward short-term dependencies, even when long-term
ones are important. On the other hand, if the gradients explode, the model
becomes unstable and fails to converge.
 Due to this vanishing gradient problem, standard RNNs are often ineffective
at learning from long-range dependencies in sequences. This issue motivated the
development of more sophisticated architectures, like Long Short-Term
Memory (LSTM) and Gated Recurrent Units (GRUs), which are designed to
mitigate these challenges by preserving information over longer periods in a
sequence.
86 Professor Doron Avramov, IDC, Israel
LSTM: Maintaining Gradient Value in Backpropagation
 Problem with RNNs: As noted, in traditional RNNs, gradients tend to
vanish or explode during backpropagation, especially over long sequences,
which limits their ability to capture long-term dependencies.
 Solution – LSTM Cell:
• LSTM replaces the simple RNN cell with a more sophisticated structure
that includes multiple gates:
• Forget Gate: Controls how much of the past information to stored.
• Input Gate: Decides which parts of the new input to incorporate into the memory.
• Output Gate: Determines how much of the cell’s state to pass to the output.
• This structure allows LSTM to effectively preserve gradients, enabling
learning over longer periods.

87 Professor Doron Avramov, IDC, Israel


LSTM: Maintaining Gradient Value in Backpropagation
 Input-Output Transformation:
• LSTM transforms time-series inputs (of dimension d) into time-series outputs (of
dimension h).
• This ability to manage dependencies across time makes LSTM especially suited for
financial time series with low signal-to-noise ratios, which often exhibit both short-term
volatility and long-term trends.
 Recent Advances:
• Attention Mechanisms: More recent models, such as Transformer architectures,
leverage attention mechanisms to focus on relevant parts of the input sequence, allowing
them to outperform LSTMs in many tasks, including natural language processing (NLP)
and translation.
 Relevance to Finance:
• Despite these advances, LSTM remains a valid approach for predicting stock returns and
financial outcomes, especially when signal-to-noise ratios are low.
• For a detailed comparison of deep sequence models, refer to "Deep Sequence Modeling:
Development and Applications in Asset Pricing" by Cong, Tang, Wang, and Zhang (2020),
which compares various deep sequence models in predicting future returns.

88 Professor Doron Avramov, IDC, Israel


Key Parts of an LSTM Cell:
A cell state (c)– Represents long-term memory, holding all accumulated
learning from previous steps.
Three regulators (“gates”) that control the flow of information inside the
LSTM unit:
Input gate (i) -Controls how much of the new information should
be allowed into the cell state.
Forget gate (f) -Decides which parts of the previous cell state
should be discarded.
Output gate (o) - Determines what information should be passed
to the next time step as output.

89 Professor Doron Avramov, IDC, Israel


LSTM Unit functionality
 The gates regulate the flow of information into and out of the cell, ensuring
the LSTM can maintain long-term dependencies while filtering out
irrelevant details.
 In particular,
• Forget Irrelevant Information: The forget gate decides how much of the
previous memory should be kept. It controls what portion of the previous
cell state continues to the next step and what portion is discarded.
• Store Relevant New Information: The input gate decides what new information to
retain.
• Update the Cell State: The cell state gets updated with the combination of previous
and new information.
• Generate Output: The output gate decides what part of the memory and new input
should be output at the current time step.
90 Professor Doron Avramov, IDC, Israel
A schematic figure for an LSTM unit
 𝑥𝑡 is the input consisting of time series observations
 ℎ𝑡 is an hidden state
 𝑐𝑡 is the long term memory, maintaning new relevant information while discarding irrelevant
information
 𝑖𝑡 , 𝑜𝑡 , and 𝑓𝑡 are the gates
Ct−1 Ct
 𝐶෩𝑡 is the candidate cell state.

ft it ot
𝐶ሚt

ht−1 ht

91 Professor Doron Avramov, IDC, Israel


The LSTM – Math Representation
 The functions are defined as follows
𝑓𝑡 = 𝜎𝑔 𝑊𝑓 𝑥𝑡 + 𝑈𝑓 ℎ𝑡−1 + 𝑏𝑓
𝑖𝑡 = 𝜎𝑔 𝑊𝑖 𝑥𝑡 + 𝑈𝑖 ℎ𝑡−1 + 𝑏𝑖
𝐶ሚ𝑡 = tanh 𝑊𝑐 𝑥𝑡 + 𝑈𝑐 ℎ𝑡−1 + 𝑏𝑐
෩𝒕
𝑪𝒕 = 𝒇𝒕 ∘ 𝑪𝒕−𝟏 + 𝒊𝒕 ∘ 𝑪
𝑜𝑡 = 𝜎𝑔 𝑊𝑜 𝑥𝑡 + 𝑈𝑜 ℎ𝑡−1 + 𝑏𝑜
𝒉𝒕 = 𝒐𝒕 ∘ 𝐭𝐚𝐧𝐡 𝑪𝒕
where ∘ is an element-by-element product, and 𝐶0 and ℎ0 are initial values (can set to zero)
 The new cell state is updated by two main components:
1. A portion of the previous cell state, which is kept or discarded based on the decision of the forget gate.
This gate controls how much of the old information should be retained.
2. The new candidate cell state, which is introduced through the input gate. The input gate controls how
much of the new information should be added to the cell's memory.
Variables: 𝑥𝑡 ∈ ℝ𝑑 is the intput vector, 𝑓𝑡 ∈ ℝℎ is the forget gate, 𝑖𝑡 ∈ ℝℎ is the input update gate, 𝑜𝑡
∈ ℝℎ is the ouptut gate, 𝐶ሚ𝑡 (𝑐𝑡 ) ∈ ℝℎ is the cell input (state) vector, 𝑊 − 𝑠 ∈ ℝℎ×𝑑 and 𝑈 − 𝑠 ∈ ℝℎ×𝑑 are
weight matrices, and b−s ∈ ℝℎ are intercept vectors, all of which are learnt during the training stage, and
d is the pre-specified number of input feature. Input and output can have different dimensions.
92 Professor Doron Avramov, IDC, Israel
LSTM dynamics and hyperparameters
 Role of Gates:
 Larger f: Indicates a decision to retain more of the previous cell state, giving greater
importance to past information.
 Larger i: Implies more weight is given to the new input, allowing the current values to
have a stronger impact on the cell state.
 Independence of f and i: They function separately, meaning they do not act as explicit
complementary weights, and their weights in the cell state do not sum up to one.
 Hyperparameters:
 h: Represents the number of hidden units, which controls the output dimension of the
LSTM. This is a key hyperparameter that defines the model's capacity to learn patterns.
 LASSO Regularization: Adding LASSO during optimization introduces another
hyperparameter, which controls how much regularization is applied to prevent
overfitting.
 Hyperparameter Tuning: validation sample.

93 Professor Doron Avramov, IDC, Israel


Key Aspects of LSTM Architecture
• Maintain a Separate Cell State: The cell state is updated across time steps to store long-term information.
• Gates to Control Information Flow:
• Forget Gate: Discards irrelevant information.
• Input Gate: Stores relevant information from the current input.
• Cell State Update: Selectively updates the cell state based on the input gate.
• Output Gate: Returns a filtered version of the cell state as output.
• Backpropagation Through Time: Ensures uninterrupted gradient flow for learning long-term dependencies.
 Three Important Caveats About LSTM:
1. High Memory Requirement: Due to its complex architecture, LSTMs require more memory.
2. Training Challenges: LSTMs face difficulties during training because of long gradient paths, similar to training a
100-layer neural network on a 100-word document.
3. Activation Functions:
1. Sigmoid & Tanh: Can be difficult to work with due to saturation issues.
2. ReLU (Rectified Linear Unit): Less sensitive to random initialization, allowing neurons to express strong
opinions.
94 Professor Doron Avramov, IDC, Israel
LSTM: Example and factor extraction
𝑖
 Let firm i have M characteristics whose time t realizations are denoted by 𝑥𝑡 .
𝑖
 At time t, we use the most recent K periods to predict the next period return 𝑟𝑡+1
Ƹ .
𝑖 𝑖 𝑖
 The input is the series 𝑥𝑡−𝐾+1 , … , 𝑥𝑡 while the final output, ℎ𝑡 generates the
𝑖
prediction 𝑟𝑡+1
Ƹ .
 LSTM parameters are estimated by minimizing the loss function
2
𝑖
ℒ= σ𝑁 σ𝑇
𝑖=1 𝑡=1 𝑟𝑖,𝑡+1 − ℎ𝑡
 In that setup, it is assumed that the output is a single number per stock, predicted return.
𝑖 𝑖 𝑖 𝑖
 If ℎ𝑡 is a vector of length h, the output is given by ℎ෨ 𝑡 = 𝑊ℎ ℎ𝑡 , while ℎ෨ 𝑡
𝑖
is replacing ℎ𝑡 in the loss function above.
 Notice that the estimated parameters are identical across stocks.

95 Professor Doron Avramov, IDC, Israel


LSTM factors
 LSTM can also be applied to a large set of macro variables where lower-
dimension state cells summarize the short and long-run dependencies.
Hidden State (𝒉𝒕 ) as Dynamic Factors:
 The hidden state in an LSTM cell represents the output at each time step, capturing the
most relevant features from the input sequence.
 The hidden states can be interpreted as a set of dynamic factors that evolve, reflecting
immediate influences on the time-series data, such as market sentiment or short-term
economic fluctuations.
Cell State (𝑪𝒕 ) as Long-Term Factors:
 The cell state carries the long-term memory of the LSTM, accumulating and retaining
information across time steps.
 The cell state can be viewed as encapsulating long-term factors, such as persistent
macroeconomic trends or underlying market conditions, that affect the sequence over a
more extended period.

96 Professor Doron Avramov, IDC, Israel


Factor Extraction through LSTM
• Unlike traditional linear factor models, LSTMs can capture complex,
nonlinear dependencies in the data, providing a richer, more refined
understanding of the underlying factors.
• By processing sequences of macro predictors or other time-series data,
LSTMs dynamically extract and update factors that are crucial for predicting
future outcomes.
 Complementing Other Factor Models:
• PCA: Linear extraction of static factors.
• IPCA: Instrumental PCA with varying loadings based on characteristics.
• Autoencoder: Nonlinear, unsupervised factor extraction.
• CA: Conditional Autoencoder capturing nonlinear conditional relationships.
• LSTM: Adds the temporal dimension, capturing both short-term and long-
term dependencies in a dynamic, data-driven manner.
97 Professor Doron Avramov, IDC, Israel
RNN and LSTM – attention mechanism
 Let us revisit RRN, recall that the latent states are denoted by
ℎ1 ,ℎ2 , … , ℎ𝑇 .
 Notice that ℎ𝑡 is perceived to contain all the abstract features in the
entire sequence.
 As all hidden states before t are not involved directly in generating the
output, the old information is washed out after being propagated over
multiple time steps.
 LSTM has the same drawback.
 To address this issue, the attention mechanism is proposed (Chaudhari
et al., 2019).
 With attention, not only ℎ𝑡 but also all the hidden states play a role.
 I will later explain the attention mechanism in the context of TE.
98 Professor Doron Avramov, IDC, Israel
Machine Learning versus Economic Restrictions
 Are machine learning methods lead to better forecasts of future stock returns?
 Avramov, Cheng, and Metzker (2021) show that deep learning signals, such as those
described earlier, confront similar caveats as individual anomalies.
 Investment performance considerably deteriorates when distressed stocks are excluded.
 Likewise, performance is vastly stronger during high limits-to-arbtirage market states, such
as high market volatility (VIX).
 In addition, the stochastic discount factor estimated by NKS is taking extreme long and short
positions that cannot be practically implemented in real-time.
 As noted by Chen, Pelger, and Zhu (2021), it is a natural idea to use machine learning
techniques, such as deep neural networks, to deal with the high dimensionality
and complex functional dependencies of input data.
 However, machine-learning tools are designed to work well for prediction tasks
in a high signal-to-noise environment.
 As asset returns seem to be dominated by unforecastable news, it is hard to
predict their risk premia with off-the-shelf methods.
 The presentation of the Avramov et al (2021) paper is in the appendix to the class notes.
99 Professor Doron Avramov, IDC, Israel
Machine Learning versus Economic Restrictions
 On the other hand, Avramov, Kaplanski, and Subrahmanyam (2021) apply LASSO, Ridge, and
Elastic Net techniques (shallow learners) to newly defined variables and document robust
performance.
 In particular, they consider all COMPUTAT items and compute, per item, the distance between
current values and moving averages over past quarters.
 Such deviations predict future stock returns to economically significant degrees.
 The rule based on their Fundamental Deviation Index (FDI) survives recent years, excluding
microcaps, long positions only, all market states, and reasonable trading costs.
 They attribute their findings to investor’s anchoring.
 Avramov, Cheng, Metzker, and Voigt (2021) show the robust prediction ability of Bayesian
Model Averaging (BMA).
 Asset pricing inferences in BMA draw on an integrated model that weights individual models
based on posterior probabilities.
 Is there hope for deep learning signals?
 In what follows, two state-of-the-art approaches are explained, Reinforcement Learning (RL)
and adversarial GMM – both can be potentially useful.
100 Professor Doron Avramov, IDC, Israel
Reinforcement Learning (RL)
 RL is based on two key ideas: trial-and-error learning and delayed reward.
 It focuses on solving problems rather than using specific methods.
 Any method that fits the problem can be considered RL.
 Difference from Supervised Learning:
• In supervised learning, labeled examples are provided by an external supervisor.
• RL involves learning from interaction with the environment, where the agent must
explore and exploit based on feedback.
 Exploration vs Exploitation:
• RL agents need to balance exploiting known rewards and exploring
unknown actions for potentially better outcomes.
• This is especially important in dynamic and uncertain environments

101 Professor Doron Avramov, IDC, Israel


Core Elements of RL

• Policy: The agent's strategy for selecting actions based on the current state.
• Reward Function: Defines the immediate feedback for an action, signaling good or
bad outcomes.
• Value Function: Estimates the long-term reward from a state, considering future
potential rewards.

102 Professor Doron Avramov, IDC, Israel


Q Learning

• A model-free RL algorithm where the agent learns the optimal action for
each state through trial and error.
• The agent updates its "Q-values" (estimates of the action's value) based on
feedback from the environment.
 Example:
• A chess-playing agent improves over time by learning which moves lead to
victories (rewards) and which do not, adjusting its policy to maximize long-
term success.

103 Professor Doron Avramov, IDC, Israel


Reinforcement Learning in Asset Pricing
• Reinforcement Learning (RL) in asset pricing was pioneered by Wang,
Zhang, Tang, Wu, and Xiong (2021).
• RL incorporates three components for return prediction, while the RL aspect
focuses on forming optimal portfolios based on these predictions.
1. Step 1: Use LSTM or Transformer-Encoder to generate a representation for
each asset based on its historical state.
2. Step 2: Introduce a Cross Asset Attention Network (CAAN), which utilizes
these asset representations to extract interrelationships among the assets.
3. Step 3: Employ a portfolio generator, which uses a scalar winner score for
each asset from CAAN to derive optimal portfolio weights.
• RL models the joint distribution of asset returns, observing trading actions,
testing a range of actions (portfolio weights), and exploring a high-dimensional
parameter space to maximize the Sharpe ratio.

104 Professor Doron Avramov, IDC, Israel


Transformer Encoder (TE)
 LSTM is effective but limited due to its sequential processing and lack of an
attention mechanism.
 The TE is a deep learning model that incorporates an attention mechanism, which
assigns different levels of importance to each part of the input data.
 Applications: TE is widely used in fields such as Natural Language Processing
(NLP) and Computer Vision (CV), including tasks like self-driving cars and
interactive gaming.
 Like RNNs, Transformer Encoders aim to process sequential input data.
 Unlike RNNs, TEs do not process the data in a strict sequence order, allowing for
more flexibility.
 The attention mechanism provides context to any position within the input
sequence, making it highly efficient for complex tasks.
 Example: In a sentence, the Transformer does not need to process the beginning
before the end—it can analyze the entire input simultaneously.
105 Professor Doron Avramov, IDC, Israel
TE: Key Features
 Instead of sequentially processing the data, the TE identifies the context
that gives meaning to each word or element in the sequence.
 This feature allows for parallelization, which significantly reduces training
times compared to sequential models like LSTMs.
 TEs have become the model of choice for NLP tasks, replacing LSTMs due
to their efficiency and scalability.
 Parallel processing enables TEs to handle much larger datasets.
 TEs also overcome vanishing and exploding gradient problems, which
are common in RNNs.
 The success of transformers has led to the development of pre-trained
models such as BERT (Bidirectional Encoder Representations from
Transformers) and GPT (Generative Pre-trained Transformer), both trained
on massive datasets like Wikipedia and fine-tuned for specific tasks.
106 Professor Doron Avramov, IDC, Israel
TE: Architecture
 The transformer uses an encoder-decoder architecture.
 The encoder consists of multiple layers that process the input data iteratively, one layer
at a time. Similarly, the decoder processes the encoded output.
 Each encoder layer identifies which parts of the input are most relevant and generates a
compressed representation.
 The output of each encoder layer is passed to the next layer for further processing.
 The decoder layers operate in reverse: they use contextual information to produce an
output sequence from the encoded data.
 Both the encoder and decoder rely on the attention mechanism, which assigns
different weights to parts of the input data based on relevance.
 For each input, attention determines which other inputs are most important for
producing the final output.
 Each decoder layer incorporates an additional attention mechanism that references the
output from previous decoder layers before drawing from the encoder.
 Both the encoder and decoder layers include feedforward neural networks, residual
connections, and layer normalization for processing outputs.
107 Professor Doron Avramov, IDC, Israel
TE in Finance
 Common Applications in general
1. Translating from one language to another.
2. Generating an answer (output) for a given question (input).
 In Finance, Cong et al. use only the encoder part of the TE.
 The authors also implement LSTM for time series encoding, using the Ct values
to capture the sequence of stock characteristics.
 Both the TE and LSTM models are referred to as Sequence Representation
Extraction Models (SREM) in the paper.
 The input for TE or LSTM is the time series of stock characteristics, which
are encoded into a lower-dimensional representation.
 The encoded data is used to model the cross-sectional interactions between
stocks using CAAN (Cross Asset Attention Network).
 These interactions are then converted into scores and weights to build portfolios
aimed at achieving the highest ex-post Sharpe ratio.
108 Professor Doron Avramov, IDC, Israel
TE in Finance
 The first part of the analysis consists of time-series encoding.
 Consider firm i characteristics over past K months
𝑖 𝑖 𝑖 𝑖
𝑋 = 𝑥1 , … , 𝑥𝑘 , … , 𝑥𝐾

 In the Cong et al paper, the dimension of the input is K=12, reflecting a one year of monthly
observations, while there are 51 firm characteristics.
𝑖
 Thus, 𝑥𝑘 is a vector of dimension 𝑑 = 51, while there are K=12 such vectors.
 Time-series observations are encoded by TE or LSTM into hidden states
𝑖 𝑖 𝑖
𝑍 𝑖 = 𝑧1 , … , 𝑧𝑘 , … , 𝑧𝐾

 It is a sequence-to-sequence encoding for each firm separately.


𝑖 𝑖
 The dimension of 𝑧𝑘 could be equal to or different from the dimension of 𝑥𝑘
 The hidden states attempt to capture long-range dependencies in the data.
 Below, I explain encoding – the transition from 𝑋 𝑖
to 𝑍 𝑖
– through TE.
 LSTM encoding was explained earlier, the system cells (C) contain the encoded information.
109 Professor Doron Avramov, IDC, Israel
TE: Self attention
 Again, the inputs for firm i are given by the matrix of characteristics
𝑋 𝑖 = 𝑥1𝑖 , … , 𝑥𝑘𝑖 , … , 𝑥𝐾𝑖

 The self-attention unit forms, for each 𝑥𝑘𝑖 , query - 𝑞𝑘𝑖 , key - 𝑘𝑘𝑖 , and value - 𝑣𝑘𝑖 vectors.
 Specifically, the query, key, and value vectors are given by

𝑞𝑘𝑖 = 𝑊 𝑄 𝑥𝑘𝑖
𝑘𝑘𝑖 = 𝑊 𝐾 𝑥𝑘𝑖
𝑣𝑘𝑖 = 𝑊 𝑉 𝑥𝑘𝑖
 The dimensions of 𝑞𝑘𝑖 and 𝑘𝑘𝑖 are equal to 𝑑1 , which is is a hyperparameter.

 The dimension of 𝑣𝑘𝑖 is also a hyperparameter, say 𝑑2 , could be diffferent from 𝑑1 .


 Thus, 𝑊 𝑄 𝑊 𝐾 are 𝑑1 × 𝑑 matrices and 𝑊 𝑉 is a 𝑑2 × 𝑑 matrix (recall, d=51).

110 Professor Doron Avramov, IDC, Israel


TE: Self attention
𝑸′ 𝑲
Attention is defined as Attention(Q,K,V)=softmax( ) V, where the product of the
𝒅𝒌
query and the key represents the relevance and the softmax function is
𝒆𝒙𝒑 𝒛𝒌
𝒔𝒐𝒇𝒕𝒎𝒂𝒙 𝒛𝒌 = σ𝑲 𝒇𝒐𝒓 𝒌 = 𝟏, … , 𝑲
𝒌=𝟏 𝒆𝒙𝒑 𝒛𝒌

Then, the interrelationship between firm i characteristics at different times in the


historical period (𝑘, 𝑘 ′ ) ∈ 1, … , 𝐾 is modeled by the dot product of query 𝑞𝑘𝑖 and key, 𝑘𝑘′𝑖
𝑞𝑘𝑖 ′ 𝑘𝑘′𝑖
𝛽𝑘,𝑘′ =
𝑑1
The attention score for time k is a softmax of the interrelationship

𝑖
σ𝐾
𝑘′=1 exp 𝛽𝑘,𝑘′ 𝑣
𝑎𝑘𝑖 = 𝑘′
σ𝐾𝑘′=1 exp 𝛽𝑘,𝑘′

Notice that the attention 𝑎𝑘𝑖 is a vector of dimension 𝑑2 obtained as the weighted average
(sum of weights is equal one) of the value.
111 Professor Doron Avramov, IDC, Israel
Transformer– Self attention
 The output of the self-attention process can also be written using a matrix notation as

𝑖 𝑊 𝑄 ′𝑊 𝐾 𝑋 𝑖
𝑍 𝑖 = softmax
𝑋
𝑋 𝑖 ′𝑊 𝑉
𝑑1

𝑖
 𝑍 𝑖 s a 𝐾 × 𝑑2 matrix that collects the K values of 𝑎𝑘
 The value of each position is calculated by all the positions in the sequence.
 The output is computed as a weighted sum of the values, where the weight assigned to each
value is computed by a compatibility function of the query with the corresponding key.
 So, we can compute the attention function on a set of queries simultaneously, rather than
period by period.
 The transition from 𝑋 𝑖
to 𝑍 𝑖
is complete.
 Should do it stock by stock – query, key, and value matrices (W-s) are identical
across stocks.
112 Professor Doron Avramov, IDC, Israel
Transformer– Multi Head Attention
 In order to capture a number of complex interrelations in a sequence, self-
attention units are grouped and connected in parallel.
 This connected group is termed multi-head attention.
 For instance, if it is about understanding a sentence, different people could
have different perspectives on that sentence.
 The multi-head attention unit is a group of four (hyperparameter) self
attention units each with different weights 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 .
 Each self attention unit has an output 𝑍ℎ , ℎ = 1, … , 4, which is a 𝐾 × 𝑑2
matrix.
 Do the same thing 4 times ,and let the network learn four different items to
pay attention to.
 All those matrices are concatenated into a new matrix.
 𝑍෩ = 𝑍1 , … , 𝑍4 is of dimension 𝐾 × 4𝑑2

113 Professor Doron Avramov, IDC, Israel


Transformer– Multi Head Attention
 Then 𝑍෩ is multiplied from the right by a weight matrix 𝑊 𝑜 of dimension 4𝑑2
× 𝑑.
 The output of the multi-head attention is of dimension 𝐾 × 𝑑, which is (12
× 51) in our case:
𝑄
ℎ𝑒𝑎𝑑𝑗 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑗 , 𝐾𝑊𝑗𝐾 , 𝑉𝑊𝑗𝑉
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡(ℎ𝑒𝑎𝑑1 , … , ℎ𝑒𝑎𝑑4 )𝑊 𝑂
 Notice that 𝑍෩ is computed for each stock.
 Stocks only differ with about their inputs while the parameters are shared.
 The aim is to transform the observed stock characteristics into hidden states by
encoding the characteristics through the TE mechanism with multi-head
attention.
 Each attention head acts like a "different lens" through which the data is
viewed, capturing various aspects of the relationships between the elements in
the sequence.
114 Professor Doron Avramov, IDC, Israel
Cross asset attention network (CAAN)
 Up to this stage, we have transformed the real observations 𝑋 𝑖
into hidden states 𝑍 𝑖
for each stock.
 LSTM is an an alternative way to make that transformation, while the 𝑍 𝑖
are replaced by the cell states.
 The cross-sectional interrelations among firms’ hidden states are modeled by a self attention module, termed
cross asset attention network (CAAN), similar to the multi-head attention.
𝑖 𝑖 𝑖
 All the vectors of hidden states for firm i, 𝑍 𝑖
= 𝑧1 , … , 𝑧𝑘 , … , 𝑧𝐾 are concatenated into a vector
𝑖 𝑖 𝑖 𝑖
𝑦 = 𝐶𝑜𝑛𝑐𝑎𝑡 𝑧1 , … , 𝑧𝑘 , … , 𝑧𝐾

 𝑦 𝑖
is a vector with length 𝑑 𝐾 that represents firm i
 The representation 𝑦 𝑖 is used to construct three other representation vectors: query - 𝑞 𝑖 , key - 𝑘 𝑖
and value
𝑄 𝐾 𝑉
- 𝑣 𝑖 , using the trainable matrices 𝑊CAAN , 𝑊CAAN , 𝑊CAAN (during the training process)
𝑖 𝑄 𝑖
𝑞 = 𝑊CAAN 𝑦
𝑖 𝐾 𝑖
𝑘 = 𝑊CAAN 𝑦
𝑖 𝑉 𝑖
𝑣 = 𝑊CAAN 𝑦
 These matrices for query, key, and value are identical across assets.
𝑄 𝐾 𝑉
 𝑊CAAN , 𝑊CAAN are 𝑑3 × 𝑑𝐾 matrices and 𝑊CAAN is a 𝑑4 × 𝑑𝐾 matrix

115 Professor Doron Avramov, IDC, Israel


CAAN
 The interrelationship between stock j and stock i is modeled by the dot product of stock i
query, 𝑞 𝑖 and stock j key,
𝑞 𝑖 ′𝑘 𝑗
𝛽𝑖,𝑗 =
𝑑3
 Calculating attention score for stock i is a softmax of the interrelationship of stock i
and j multiplied by stock j value
σ𝐼𝑗=1 exp 𝛽𝑖,𝑗 𝑣 𝑗
𝑎𝑖 =
σ𝐼𝑗=1 exp 𝛽𝑖,𝑗
where I is the overall number of stocks and 𝑎 𝑖 is a vector with length 𝑑4 .
 Finally, the winner score determining the long short position of stock i in the portfolio is
𝑠 𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑊 𝑆 𝑎 𝑖 + 𝑏 𝑆
where 𝑊 𝑆 is a 1 × 𝑑4 weight vector and 𝑏 𝑆 is the bias.

116 Professor Doron Avramov, IDC, Israel


Reinforcement Learning Optimization
 Given the scores of each stock 𝑠 1 , … , 𝑠 𝑖 , … 𝑠 𝐼 the long and short
portfolios are constructed by the G extreme scores.
 Let 𝑜 𝑖
be the rank of stock i in a descending order.
 Stock i is in the long portfolio 𝑏 + if 𝑜 𝑖 ∈ 1, 𝐺
𝑖
𝑒𝑥𝑝 𝑠
𝑏+ 𝑖 = 𝑖′
σ𝑜 𝑖′ ∈ 1,𝐺 𝑒𝑥𝑝 𝑠

 Stock i is in the short portfolio 𝑏 − if 𝑜 𝑖 ∈ 𝐼 − 𝐺 + 1, 𝐼


𝑖
𝑒𝑥𝑝 −𝑠
𝑏− 𝑖 =
σ𝑜 𝑖′ ∈ 𝐼−𝐺+1,𝐼 𝑒𝑥𝑝 −𝑠 𝑖′
 Let denote by 𝑏 𝑐 the vector of weights for all the I stocks.

117 Professor Doron Avramov, IDC, Israel


RL: Forming portfolios through winner scores (s)

118 Professor Doron Avramov, IDC, Israel


Reinforcement Learning
 “Learning what to do – how to map situations into actions – so as to
maximize a numerical reward signal” (Sutton and Barto, 2008)
𝑖 𝑖 𝑖 𝑖
 The characteristics 𝑋𝑡 = 𝑥1 , … , 𝑥𝑘 , … , 𝑥𝐾 for the past K periods for
𝑖 = 1, … , 𝐼 is the state of environment.
 The action at time t is the stock weights in the constructed portfolio 𝑏𝑡𝑐 .
𝑝
 The reward at time t is the portfolio return at t+1: 𝑟𝑡+1 = 𝑟𝑡+1 ’ 𝑏𝑡𝑐
 The value is the Sharpe ratio for a sequence of realized returns, say 12
months
𝑝 𝑝 𝑝
𝐽 = 𝑆𝑅 𝑟1 , 𝑟2 , … , 𝑟12

119 Professor Doron Avramov, IDC, Israel


Reinforcement Learning
 Let us denote by 𝜃 the model parameters that affect 𝑏𝑡𝑐 , then the optimization is to find 𝜃 ∗
such that
𝜃 ∗ = argmax𝜃 𝐽 𝜃
 To summarize
𝑄 𝐾 𝑉 𝑆 𝑆 𝑄 𝐾 𝑉
𝜃 = 𝑊CAAN , 𝑊CAAN , 𝑊CAAN , 𝑊 ,𝑏 ,4 × 𝑊 ,𝑊 ,𝑊 , 𝑊𝑜

CAAN parameters TE parameters

 The collection of hyper-parameters 𝑑, 𝑑1 , 𝑑2 , 𝑑3 , 𝑑4 , ℎ, 𝐾, 𝐺 is determined in the test sample.


 Because reinforced learning is not supervised, there is no training/validation samples.
 Hyperparameters are determined in the test sample by experimenting on different values.
 The investor is price taker; hence, his action does not affect the evolution of the
environment.
 Another reinforcement learning techniques and applications in economics, game theory,
operational research and finance can be found in Charpentier, Elie, and Remlinger (2020)

120 Professor Doron Avramov, IDC, Israel


Stock return Prediction versus ChatGPT
 ChatGPT, like many advanced language models, relies on the Transformer architecture described earlier.
 Let’s explore how the same principles apply to generating coherent and contextually appropriate responses in a
conversation.
 Understanding Transformers in ChatGPT
1. Sequential Data Handling:
o Just as the Transformer Encoder processes sequences of stock characteristics over time, ChatGPT processes
sequences of words or tokens in a sentence or paragraph. Each word is treated as part of a sequence, with the
model considering the context provided by all previous words to generate the next word.
2. Self-Attention Mechanism:
o The self-attention mechanism in ChatGPT, like in predicting stock returns, allows the model to weigh the
importance of different words in a sequence. For example, when predicting the next word in a sentence, the
model looks at all previous words, determines which ones are most relevant to the context, and uses this
information to generate the most appropriate next word.
o In ChatGPT, the multi-head attention allows the model to understand different nuances of meaning and
relationships between words, making it capable of generating contextually accurate and sophisticated
language.
121 Professor Doron Avramov, IDC, Israel
Stock return Predictions versus ChatGPT
3. Parallel Processing and Efficiency:
o One of the reasons ChatGPT is so powerful is because of the parallelization capability of transformers.
Unlike traditional models that process data sequentially (like RNNs), transformers can process multiple parts
of a sequence simultaneously. This is why ChatGPT can generate responses quickly and handle long
conversations without losing context.
4. Encoding and Decoding:
o While in stock return prediction, we mainly discussed the encoding part of the Transformer, ChatGPT also
relies on a decoding process. The encoded information (which represents the understanding of the input text)
is used by the decoder to generate the response. The decoder works similarly by applying attention
mechanisms to ensure the generated text is relevant to the input query.
5. Fine-Tuning and Adaptation:
o Just as you might tune a model's parameters to better predict stock returns, ChatGPT is fine-tuned on vast
amounts of text data. This allows it to adapt to different types of questions, styles of conversation, and even
specific topics, providing more accurate and relevant responses.

122 Professor Doron Avramov, IDC, Israel


Stock return Predictions versus ChatGPT
To sum up, ChatGPT works by leveraging the same principles we’ve used in stock
return prediction with transformers: it processes sequences of data (in this case, words),
applies self-attention to understand context, and uses this information to generate
coherent and contextually relevant responses.

The power of transformers, with their ability to handle long-range dependencies and
process data efficiently, is what makes models like ChatGPT so effective in natural
language processing tasks.

123 Professor Doron Avramov, IDC, Israel


Generative Adversarial Network - GAN
 GAN is a setup with two neural networks contesting with each other in a
(often zero-sum) game.
 For example, let 𝑤 and 𝑔 be two neural networks’ outputs.
 The loss function is defined over both outputs, 𝐿 𝑤, 𝑔 .
 The competition between the two neural networks is done via iterating
both 𝑤 and 𝑔 sequentially:
 𝑤 is updated by minimizing the loss while 𝑔 is given
𝑤
ෝ = min 𝐿(𝑤|𝑔)
𝑤
 𝑔 is the adversarial and it is updated by maximizing the loss while 𝑤 is given
𝑔ො = max 𝐿(𝑔|𝑤)
𝑔

124 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 Chen, Pelger, and Zhu (2019) employ an adversarial GMM to estimate the SDF
 The CPZ model is formulated as follows.
 For any excess return, no arbitrage suggests that
𝑖=1 𝑡,𝑖 𝑡+1,𝑖 and 𝑤𝑡,𝑖 is a general function
𝑒 𝑒
Ε𝑡 𝑀𝑡+1 𝑅𝑡+1,𝑖 = 0, where 𝑀𝑡+1 = 1 − σ𝑁 𝑤 𝑅
 It then follows that Ε𝑡 𝑀𝑡+1 𝑅𝑡+1,𝑖
𝑒
𝑔(𝐼𝑡,𝑖 , 𝐼𝑡 ) = 0
 That is because you can multiply the moment conditions with any time t
measurable function of firm characteristics and macro variables.
 The unconditional representation follows from the LIE: 𝐸 𝑀𝑡+1 𝑅𝑡+1,𝑖
𝑒
𝑔(𝐼𝑡,𝑖 , 𝐼𝑡 ) = 0
 The unconditional moment conditions can be interpreted as the pricing errors for a
choice of portfolios and times, determined by g(.).
 The challenge is to find the relevant moment conditions to identify the SDF.
 Considering only unconditional moments, the g function is constant.
125 Professor Doron Avramov, IDC, Israel
Adversarial GMM
 Can use the adversarial approach to select the moment conditions that lead to the
largest mispricing.
 This is a minimax optimization problem.
2
𝑁 𝑁
1 𝑒 𝑒
m𝑖𝑛 max ෍ 𝐸 1 − ෍ 𝑤 𝐼𝑡 , 𝐼𝑡,𝑖 𝑅𝑡+1,𝑖 𝑅𝑡+1,𝑗 𝑔 𝐼𝑡 , 𝐼𝑡,𝑗
𝑤 𝑔 𝑁
𝑗=1 𝑖=1

where ω and g are normalized functions chosen from a specified functional class.
 These types of problems can be modeled as a zero-sum game, where one player, the
asset pricing modeler, aims to choose an asset pricing model, while the adversary
searches for conditions under which the asset pricing model performs badly.
 This can be interpreted as first finding portfolios or times that are the most
mispriced and then tuning the asset pricing model to also price these assets.
 The process is repeated until the adversary cannot find portfolios with large enough
pricing errors.
126 Professor Doron Avramov, IDC, Israel
Adversarial GMM
 Note that this is a data-driven generalization for the research protocol
conducted in asset pricing in the last decades.
 To illustrate, assume that the asset pricing modeler uses the Fama-French 5
factor model, spanned by the five factors.
 The adversary might propose momentum sorted test assets, that is g is a
vector of indicator functions for different quantiles of past returns.
 As these test assets have significant pricing errors with respect to the Fama-
French 5 factors, the asset pricing modeler needs to revise the candidate
SDF, for example, by adding a momentum factor.
 Next, the adversary searches for other mispriced anomalies or states of the
economy, which the asset pricing modeler will exploit in revising the SDF.

127 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 The adversarial estimation with a minimax objective function is motivated from the insights of Hansen and
Jagannathan (1997).
 HJ show that if the SDF implied by an asset pricing model is only a proxy that does not price all possible
assets in the economy, then minimizing the largest possible pricing error corresponds to estimating the
SDF that is the closest to an admissible true SDF in a least square distance.
 HJ discuss the estimation of the SDF based on the minimax objective function and compare it with the
conventional efficient GMM estimation for parametric models with a low dimensional parameter set.
 They conclude that the minimax estimation has desirable properties when models are misspecified and the
resulting SDFs have substantially less variation relative to the conventional GMM approach.
 In CPZ, the SDF is implicitly constrained by the fact that it can only depend on stock specific
characteristics 𝐼𝑖,𝑡 but not the identity of the stocks themselves and by a regularization in the estimation.
 Hence, even in-sample, the SDF will have non-zero pricing errors for some stocks and their characteristic
managed portfolios, which naturally puts CPZ into the setup of Hansen and Jagannathan (1997).

128 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 Choosing the conditioning function g correspond to finding optimal instruments in a
GMM estimation.
 The conventional GMM approach assumes a finite number of moments that identify
a finite dimensional set of parameters.
 In CPZ, there are an infinite number of candidate moments without the knowledge
of which moments identify the parameters.
 The parameter set is also of infinite dimension, and, hence, there is not an
asymptotic normal distribution with a feasible estimator of the covariance matrix.
 The approach thus selects the moments based on robustness.
 By controlling the worst possible pricing error, the approach aims to choose the test
assets that can identify all parameters of the SDF and provide a robust fit.
 The conditioning function g generates a very large number of test assets to identify
a complex SDF structure.

129 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 The moment conditions are averaged over the sample of all instrumented
1
stocks, that is the loss function is σ𝑁
𝑖=1 σ𝐷
𝑑=1 𝛼𝑖,𝑑 , where the moment
2
𝑁
deviation 𝛼𝑖,𝑑 = 𝐸 𝑀𝑡+1, 𝑅𝑖 𝑔𝑑 (𝐼𝑡 , 𝐼𝑡,𝑖 ) can be interpreted as the pricing error of
stock i instrumented by the element 𝑔𝑑 of the vector valued function g(.).
 Note that the instruments 𝑔𝑑 are normalized to be in −1,1 .
 In their benchmark model, CPZ consider N = 10, 000 stocks and D = 8
instruments and therefore the total is 80,000 instrumented assets.
 Hence, the SDF depends only on information that affects a very large
proportion of the stocks, amounting to systematic mispricing.
 This also implies that the adversarial approach will only select instruments
that lead to mispricing for most stocks.

130 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 Once CPZ obtain the SDF factor weights, the loadings are proportional to the
conditional moments 𝐸𝑡 𝐹𝑡+1, 𝑅𝑡+1,𝑖
𝑒
.
 A key element of their approach is to avoid estimating directly conditional
means of stock returns.
 CPZ show that they can better estimate the conditional co-movement of
stock returns with the SDF factors, which is a second moment, than the
conditional first moment.

131 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 The empirical loss function of the model minimizes the weighted sample
moments which can be interpreted as weighted sample mean pricing errors:
𝑁 2
1 𝑇𝑖 1 𝑒
𝐿 𝜔|𝑔,
ො 𝐼𝑡 , 𝐼𝑡,𝑖 = ෍ ෍ 𝑀𝑡+1 𝑅𝑡+1,𝑖 𝑔ො 𝐼𝑡 , 𝐼𝑡+1
𝑛 𝑇 𝑇𝑖
𝑖=1 𝑡∈𝑇𝑖

for a given conditioning function 𝑔ො . and information set.


 As the convergence rates of the moments under suitable conditions is 1/𝑇𝑖 ,
they weight each cross-sectional moment condition by 1/𝑇𝑖 which assigns a
higher weight to moments that are estimated more precisely and down-
weights the moments of assets that are observed only for a short time period.

132 Professor Doron Avramov, IDC, Israel


Adversarial GMM
 For a given conditioning function 𝑔ො . and choice of information set the SDF
portfolio weights are estimated by a feedforward network that minimizes the
pricing error loss 𝜔
ෝ = m𝑖𝑛 𝐿(𝜔 𝑔,
ො 𝐼𝑡 , 𝐼𝑡,𝑖
𝜔
 This is the SDF network.
 CPZ then construct the conditioning function 𝑔ො via a conditional network
with a similar neural network architecture.
 The conditional network serves as an adversary and competes with the SDF
network to identify the assets and portfolio strategies that are the hardest to
explain.
 The macroeconomic information dynamics are summarized by
macroeconomic state variables ℎ𝑡 which are obtained by LSTM.
 The model architecture is summarized on the next page.
133 Professor Doron Avramov, IDC, Israel
Adversarial GMM

134 Professor Doron Avramov, IDC, Israel


Appendix

MACHINE LEARNING VERSUS ECONOMIC RESTRICTIONS:


EVIDENCE FROM STOCK RETURN PREDICTABILITY

Doron Avramov, IDC Herzliya, Israel


Si Cheng, Chinese University of Hong Kong
Lior Metzker, Hebrew University of Jerusalem
Zoo of Anomalies

136 Source:
Professor Doron Avramov, Harvey,
IDC, Israel Liu, and Zhu (2016)
Zoo of Anomalies: Challenges
 Harvey, Liu, and Zhu (2016): 296 anomalies, 27% to 53% are likely
to be false discoveries
 Hou, Xue, and Zhang (2020): 452 anomalies, 82% turn insignificant
upon excluding microcaps + value-weighting
 Other challenges: anomaly profits mostly originate from short-leg
distressed stocks and often disappear in recent years (e.g., Avramov,
Chordia, Jostova, and Philipov, 2013)

137 Professor Doron Avramov, IDC, Israel


Zoo of Anomalies: Challenges
 Traditional methods:
 Portfolio sorts and cross-sectional regressions
 Low-dimensional
 What needs to be done?
 High-dimensional, noisy and correlated predictors
 Flexible functional forms
 Model selection
 Mitigate overfitting biases
 Machine Learning: automated detection of complex patterns in data;
combine multiple weak sources of information into a meaningful
composite signal
 Growing literature on return prediction and asset pricing models
138 Professor Doron Avramov, IDC, Israel
FinTech Adoption in Asset Management

141 Professor Doron Avramov, IDC, Israel


Research Questions
 Two strands of literature: diminishing anomalies vs. increasing
prominence of ML methods
 Do ML methods clear the common economic restrictions in asset
pricing?
Exclude difficult-to-arbitrage stocks
Cannot infer from individual anomalies
 Does the return predictability of ML signals vary over time?
Exclude market states with high limits to arbitrage
Again, cannot infer…
 What are the economic grounds for the seemingly opaque ML
methods?
142 Professor Doron Avramov, IDC, Israel
Summary of Machine Learning Methods
Asset Pricing
Linearity Testing Asset Predictors
Model
GKX Nonlinear Reduced Form Stock Firm + Macro
CPZ Nonlinear Pricing Kernel Stock Firm + Macro
IPCA Linear Beta Pricing Stock Firm
CA Nonlinear Beta Pricing Stock Firm
KNS Linear Pricing Kernel Portfolio Firm

• GKX: Gu, Kelly, and Xiu (2020)


• CPZ: Chen, Pelger, and Zhu (2019)
• IPCA: Kelly, Pruitt, and Su (2019)
• CA: Gu, Kelly, and Xiu (2019)
• KNS: Kozak, Nagel, and Santosh (2020)

143 Professor Doron Avramov, IDC, Israel


Machine Learning Method I: GKX
Neural network with 3 hidden layers (NN3)
Example: 1 hidden layer with 5 neurons
(4 + 1) × 5 + 6 = 31 parameters

144 Professor Doron Avramov, IDC, Israel


Machine Learning Method I: GKX
Neural network with 3 hidden layers (NN3)
32, 16, and 8 neurons per layer
Reduced form, no economic restriction
94 firm characteristics + 8 macroeconomic predictors + 74
industry dummies + interactions
(8+1) × 94 + 74 = 920 predictors
Training sample: 18 years, 1957 to 1974
Validation sample: 12 years, 1975 to 1986
Out-of-sample test: 31 years, 1987 to 2017

145 Professor Doron Avramov, IDC, Israel


Machine Learning Method II: CPZ
Adversarial approach, multiple connected neural networks
Incorporate no-arbitrage condition to estimate SDF and stock
risk loadings

Asset
Pricing
Modeler

Adversary

146 Professor Doron Avramov, IDC, Israel


Machine Learning Method II: CPZ
Adversarial approach, multiple connected neural networks
46 firm characteristics + 178 macroeconomic predictors +
interactions → 10,000+ predictors
Training sample: 20 years, 1967 to 1986
Validation sample: 5 years, 1987 to 1991
Out-of-sample test: 25 years, 1992 to 2016

147 Professor Doron Avramov, IDC, Israel


Machine Learning Method III: IPCA
Instrumented principal component analysis

𝑟𝑖,𝑡+1 = 𝛼𝑖,𝑡 + 𝛽𝑖,𝑡 𝑓𝑡+1 + 𝜖𝑖,𝑡+1
′ ′ ′
𝛽𝑖,𝑡 = 𝑧𝑖,𝑡 Γ𝛽 + 𝜐𝛽,𝑖,𝑡
Factor loadings vary with predictive characteristics linearly, 6
latent factors
Incorporate no-arbitrage condition
94 firm characteristics
Estimated in each month
Out-of-sample test: 31 years, 1987 to 2017

148 Professor Doron Avramov, IDC, Israel


Machine Learning Method IV: CA
Conditional autoencoder with 2 hidden layers (CA2)
Factor loadings vary with predictive characteristics nonlinearly
through neural networks.
32 and 16 neurons per layer, 5 latent factors
Incorporate no-arbitrage condition
94 firm characteristics
Training sample: 18 years, 1957 to 1974
Validation sample: 12 years, 1975 to 1986
Out-of-sample test: 30 years, 1987 to 2016

150 Professor Doron Avramov, IDC, Israel


Data
CRSP: daily and monthly stock data
COMPUSTAT: quarterly and annual financial statement data
GKX, IPCA, and CA: all NYSE/AMEX/Nasdaq stocks, set
missing characteristics to cross-sectional median
21,882 stocks, between 5,117 and 7,877 per month
CPZ: all U.S. stocks from CRSP with available data on firm
characteristics
7,904 stocks, between 1,933 and 2,755 per month

151 Professor Doron Avramov, IDC, Israel


Economic Restrictions
Cross-sectional return predictability is concentrated in
microcaps and distressed firms
Exclude microcaps: market cap smaller than the 20th NYSE
size percentile
Rated firms: firms with data on S&P long-term issuer credit
rating
Exclude distressed firms: [−12, +12] months around an issuer
credit rating downgrade

152 Professor Doron Avramov, IDC, Israel


Subsamples with Economic Restrictions
25,000

21,882

20,000

15,000
13,119

10,000
7,904

5,083 4,715
5,000 4,499

2,436 2,294

0
Full Sample Non-Microcaps Credit Rating Sample Non-Downgrades
GKX CPZ

153 Professor Doron Avramov, IDC, Israel


GKX Portfolio Return Spread: EW vs. VW

3.00
2.74
2.47 2.48 2.50
2.50 2.35 2.31
2.24

2.00 1.89
1.56
1.50 1.34 1.36
1.21

1.00 0.92
0.77

0.50

0.00
Return CAPM FFC FFC+PS FF5 FF6 SY
Full Sample EW Full Sample VW

• VW performance is 48% lower than EW.

154 Professor Doron Avramov, IDC, Israel


GKX Portfolio Return Spread: Economic Restrictions

2.00
Insignificant after
excluding
1.50
microcaps or
distressed firms
1.00

0.50

0.00
Return CAPM FFC FFC+PS FF5 FF6 SY

-0.50
Full Sample Non-Microcaps Credit Rating Sample Non-Downgrades

• Non-microcaps:
48% lower than the full sample; Rated firms: 46% ↓;
Non-downgrades: 70% ↓
155 Professor Doron Avramov, IDC, Israel
Robustness Test: Train the NN3 Model in Subsamples

1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Return CAPM FFC FFC+PS FF5 FF6 SY
-0.20
-0.40
Non-Microcaps Credit Rating Sample Non-Downgrades

• Non-microcaps:
37% lower than the full sample; Rated firms: 50% ↓;
Non-downgrades: 84% ↓
156 Professor Doron Avramov, IDC, Israel
Robustness Test: Alternative Objective Function

 NN3: EW loss function, predict return


 NN3-VW: VW loss function, predict FF6 alpha
1.20
0.99
1.00

0.80
0.61
0.60

0.40

0.20

0.00
Full Sample Non-Microcaps Credit Rating Non-Downgrades
Sample
NN3-VW NN3

• The
seemingly more aligned objective function does not necessarily
improve the predictive performance.
157 Professor Doron Avramov, IDC, Israel
CPZ Portfolio Return Spread: Economic Restrictions

2.50

2.00

1.50

1.00

0.50

0.00
Return CAPM FFC FFC+PS FF5 FF6 SY
Full Sample Non-Microcaps Credit Rating Sample Non-Downgrades

• Non-microcaps:
62% lower than the full sample; Rated firms: 72% ↓;
Non-downgrades: 65% ↓
159 Professor Doron Avramov, IDC, Israel
ML Portfolio Return Spread: IPCA vs. GKX vs. CPZ
2.50
2.18

2.00 1.87

1.56
1.50

1.00 0.95 0.92

0.62
0.50

0.00
IPCA GKX CPZ
Return FF6

• IPCA underperforms deep learning models in the full sample.

160 Professor Doron Avramov, IDC, Israel


ML Portfolio Return Spread: IPCA vs. GKX vs. CPZ

2.50

2.00

1.50

1.00

0.50

0.00
IPCA GKX CPZ IPCA GKX CPZ
Return FF6
Full Sample Non-Microcaps Credit Rating Sample Non-Downgrades

• IPCA:no material deterioration of performance among the cheap-to-


trade stocks
161 Professor Doron Avramov, IDC, Israel
CA Portfolio Return Spread: Economic Restrictions

1.40

1.20

1.00

0.80

0.60

0.40

0.20

0.00
Full Sample Non-Microcaps Credit Rating Sample Non-Downgrades
Return FF6

• Non-microcaps:
48% lower than the full sample; Rated firms: 75% ↓;
Non-downgrades: 94% ↓
162 Professor Doron Avramov, IDC, Israel
Characteristics of ML Portfolios
 ML methods: positive/less negative skewness; smaller maximum drawdown than
the market; higher return during the crisis period

Excess Maximum Return in


Sharpe Ratio Skewness Turnover
Kurtosis Drawdown Crisis
Panel A: Sorted by NN3-Predicted Return
Full Sample 0.944 0.631 5.222 0.350 4.100 0.976
Non-Microcaps 0.644 0.361 7.062 0.349 3.563 0.869
Panel B: Sorted by Risk Loading
Full Sample 1.225 1.063 5.932 0.209 0.472 1.664
Non-Microcaps 0.839 0.326 1.582 0.246 0.677 1.625
Panel C: Sorted by IPCA-Predicted Return
Full Sample 0.967 -0.449 4.805 0.203 0.574 1.186
Non-Microcaps 0.978 -0.267 5.369 0.234 1.493 1.130
Panel D: Sorted by CA2-Predicted Return
Full Sample 0.784 -0.077 2.418 0.202 -0.047 1.565
Non-Microcaps 0.748 0.291 4.684 0.207 -0.529 1.478
Panel E: Market Portfolio
Full Sample 0.527 -0.978 3.323 0.486 -6.954 0.089
Non-Microcaps 0.530 -0.959 3.222 0.485 -6.907 0.086

163 Professor Doron Avramov, IDC, Israel


Characteristics of ML Portfolios
 ML methods: require high turnover in portfolio rebalancing
 One-side turnover: < 10% (size, value); 14% to 35% (failure probability, IVOL); >
90% (short-term reversals, seasonality)
Excess Maximum Return in
Sharpe Ratio Skewness Turnover
Kurtosis Drawdown Crisis
Panel A: Sorted by NN3-Predicted Return
Full Sample 0.944 0.631 5.222 0.350 4.100 0.976
Non-Microcaps 0.644 0.361 7.062 0.349 3.563 0.869
Panel B: Sorted by Risk Loading
Full Sample 1.225 1.063 5.932 0.209 0.472 1.664
Non-Microcaps 0.839 0.326 1.582 0.246 0.677 1.625
Panel C: Sorted by IPCA-Predicted Return
Full Sample 0.967 -0.449 4.805 0.203 0.574 1.186
Non-Microcaps 0.978 -0.267 5.369 0.234 1.493 1.130
Panel D: Sorted by CA2-Predicted Return
Full Sample 0.784 -0.077 2.418 0.202 -0.047 1.565
Non-Microcaps 0.748 0.291 4.684 0.207 -0.529 1.478
Panel E: Market Portfolio
Full Sample 0.527 -0.978 3.323 0.486 -6.954 0.089
Non-Microcaps 0.530 -0.959 3.222 0.485 -6.907 0.086

164 Professor Doron Avramov, IDC, Israel


Break-Even Transaction Cost
 Novy-Marx and Velikov (2016): > 0.5%
 Brandt, Santa-Clara, and Valkanov (2009): 𝑐𝑖,𝑡 = 𝑧𝑖,𝑡 × 𝑇𝑡 , where 𝑧𝑖,𝑡 = 0.006 −
0.0025 × 𝑀𝐸𝑖,𝑡
→ 0.67% for the full sample and 0.64% for non-microcaps

FF6 Turnover Break-Even Cost


Panel A: Sorted by NN3-Predicted Return
Full Sample 0.916 0.976 0.94
Non-Microcaps 0.312 0.869 0.36
Panel B: Sorted by Risk Loading
Full Sample 1.867 1.664 1.12
Non-Microcaps 0.548 1.625 0.34
Panel C: Sorted by IPCA-Predicted Return
Full Sample 0.624 1.186 0.53
Non-Microcaps 0.613 1.130 0.54
Panel D: Sorted by CA2-Predicted Return
Full Sample 0.746 1.565 0.48
Non-Microcaps 0.387 1.478 0.26

165 Professor Doron Avramov, IDC, Israel


ML Portfolio Return Spreads: Non-Microcaps + VW

 Assume transaction cost = 0.5% of the long-short portfolio turnover


1.20
1.08 1.11
1.05
1.00
0.90
0.81
0.80 0.74
0.61
0.60 0.55 0.57
0.43
0.39
0.40 0.31

0.20

0.00
GKX CPZ IPCA CA
Return FF6 Estimated TC

166 Professor Doron Avramov, IDC, Israel


An Alternative ML Method
 CPZ: estimate SDF for individual stocks
 Kozak, Nagel, and Santosh (2020): estimate SDF for equity
portfolios, i.e., long-short portfolio return based on predictive
characteristics
 Minimize the Hansen-Jagannathan (1991) distance
 Ridge regression with three-fold cross-validation
 Apply the 94 characteristics in GKX
 In-sample estimation: 1964 to 2004
 Out-of-sample test: 2005 to 2017

167 Professor Doron Avramov, IDC, Israel


Characteristics of SDF-Implied MVE Portfolios

Sharpe SDF-Implied MVE Portfolio Weights


CAPM FF6
Ratio Mean 10% 25% Median 75% 90%
Full Sample 3.662*** 3.338*** 2.318 0.083 -1.994 -0.912 0.341 0.964 1.687
(6.01) (5.90)
Non-Microcaps 1.543*** 0.895*** 0.977 0.084 -0.592 -0.238 0.072 0.407 0.647
(3.88) (2.87)
Credit Rating Sample 1.418*** 0.717* 0.898 -0.006 -0.382 -0.137 -0.003 0.187 0.326
(2.97) (1.93)
Non-Downgrades 1.308*** 0.545 0.828 -0.022 -0.370 -0.217 0.004 0.135 0.293
(2.92) (1.59)

• Imposingeconomic restrictions reduces performance, and the odds of


extreme positions
• Deep learning techniques face the usual challenge of cross-sectional
return predictability: concentrated in difficult-to-arbitrage stocks +
sizable trading costs (high turnover and extreme positions)

168 Professor Doron Avramov, IDC, Israel


Time-Varying Return Predictability: GKX-FF6

 Binding limits to arbitrage → more profitable anomaly-based


trading strategies
High sentiment, high volatility, and low liquidity
1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
-0.20 Low High Low High
Full Sample Non-Microcaps
SENT MKTVOL VIX MKTILLIQ

169 Professor Doron Avramov, IDC, Israel


Time-Varying Return Predictability: Full Sample-FF6

 CPZ: outperforms in high limits-to-arbitrage periods


 IPCA and CA: low time series variation, mixed evidence
3.00

2.50

2.00

1.50

1.00

0.50

0.00
Low High Low High Low High
CPZ IPCA CA
SENT MKTVOL VIX MKTILLIQ

170 Professor Doron Avramov, IDC, Israel


Time-Varying Return Predictability: GKX and CPZ

 𝐻𝑀𝐿𝑡 = 𝛼0 + 𝛽1 𝐻𝑖𝑔ℎ 𝑆𝐸𝑁𝑇𝑡−1 + 𝛽2 𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝑉𝑂𝐿𝑡−1 + 𝛽3 𝐻𝑖𝑔ℎ 𝑀𝐾𝑇𝐼𝐿𝐿𝐼𝑄𝑡−1 +


𝛽4 𝑀𝑡−1 + 𝑐′𝐹𝑡 + 𝑒𝑡

Sorted by NN3-Predicted Return Sorted by Risk Loading


Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8
Constant 0.016 -0.453 0.865 1.252 0.103 -0.081 0.432 0.528
(0.03) (-0.92) (1.14) (1.42) (0.16) (-0.13) (0.32) (0.36)
High SENT 1.534** 1.710** 0.228 -0.005 1.412** 1.395** 1.161* 1.124*
(2.43) (2.51) (0.58) (-0.01) (2.32) (2.23) (1.91) (1.75)
High MKTVOL 0.791 0.959* 0.787 1.065
(1.24) (1.93) (1.29) (1.56)
High VIX 1.851*** 1.647*** 1.255** 1.563**
(2.85) (3.28) (2.09) (2.28)
High MKTILLIQ 0.754 0.529 0.592 0.695 1.828*** 1.918*** 1.609** 1.851**
(1.24) (0.78) (1.37) (1.41) (2.86) (2.89) (2.20) (2.31)

Controls N N Y Y N N Y Y

• Controls: down market state, term spread, default spread, Fama-French six
factors

171 Professor Doron Avramov, IDC, Israel


Return Predictability in Recent Years: Non-Microcaps

1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Return CAPM FFC FFC+PS FF5 FF6 SY
GKX CPZ IPCA CA

• Unlikeindividual anomalies, most ML signals continue to predict the


stock returns after 2001.
173 Professor Doron Avramov, IDC, Israel
Return Predictability in Recent Years: FF6

2.00
1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
GKX CPZ IPCA CA GKX CPZ IPCA CA
Full Sample Non-Microcaps
1987 to 2017 2001 to 2017

• Unlikeindividual anomalies, there is no vast drop in trading profits of ML


signals.
174 Professor Doron Avramov, IDC, Israel
Deep Learning Models as ‘Black Boxes’

• Common features of stocks selected by ML methods


• Decompose to intra-industry vs. inter-industry strategy

175 Professor Doron Avramov, IDC, Israel


Stock Characteristics of ML Portfolios
All ML methods identify stocks in line with most anomaly-
based trading strategies.
Long positions: small, value, illiquid and old stocks with low
price, low beta, high 11-month return, low asset growth, low
equity issuance, low credit rating coverage, and low analyst
coverage.
Exceptions: long stocks with high corporate investment and
high idiosyncratic volatility
Nonlinear relation, interact with other firm characteristics or
macro conditions
Advantage: does not require prior knowledge on truly useful
characteristics and models, avoid the data snooping problem
176 Professor Doron Avramov, IDC, Israel
Intra-Industry vs. Inter-Industry Return Predictability

 ML signals identify mispricing in difficult-to-arbitrage stocks.


 Industry-adjustment controls for firm fundamentals and might better
predict the subsequent corrections.
 Unconditional strategy: buy market winners and sell market losers
1
 𝑊𝑀𝐿𝑡+1 = σ𝑁𝑡 ෠ ෠
𝑖=1 𝑅𝑖,𝑡 − 𝑅𝑚,𝑡 𝑅𝑖,𝑡+1
𝐻𝑡
𝑁𝑡 1
 𝐻𝑡 = σ ෠ ෠
𝑖=1 |𝑅𝑖,𝑡 − 𝑅𝑚,𝑡 |
2
 𝑅෠𝑖,𝑡 − 𝑅෠𝑚,𝑡 = 𝑅෠𝑖,𝑡 − 𝑅෠𝑗,𝑡 + 𝑅෠𝑗,𝑡 − 𝑅෠𝑚,𝑡

177 Professor Doron Avramov, IDC, Israel


Intra-Industry vs. Inter-Industry Return Predictability

1 1
 𝑊𝑀𝐿𝑡+1 = σ𝑁𝑡 ෠ ෠
𝑖=1 𝑅𝑖,𝑡 − 𝑅𝑗,𝑡 𝑅𝑖,𝑡+1 +
σ𝐿𝑗=1
𝑡
𝑅෠𝑗,𝑡 − 𝑅෠𝑚,𝑡 𝑁𝑗,𝑡 𝑅𝑗,𝑡+1
𝐻𝑡 𝐻𝑡

 Unconditional = Intra-Industry + Inter-Industry


 Intra-Industry strategy: buy industry winners and sell industry losers
 Inter-Industry strategy: buy winner industries and sell loser industries

178 Professor Doron Avramov, IDC, Israel


Intra-Industry vs. Inter-Industry Strategy: GKX

 Intra-industry strategy accounts for 84% (93%) of the unconditional payoff in


raw (risk-adjusted) return across all stocks → stock selection
2.50

2.00

1.50

1.00

0.50

0.00
CAPM

CAPM
Return

FF5

FF6

Return

FF5
FFC

FF6
FFC

FFC+PS

FFC+PS
SY

SY
-0.50

Full Sample Non-Microcaps


WML_INTRA × H_INTRA/H WML_INTER × H_INTER/H

179 Professor Doron Avramov, IDC, Israel


Intra-Industry vs. Unconditional Strategy: GKX

2.50

2.00

1.50

1.00

0.50

0.00 CAPM

CAPM
Return

Return
FFC

FF6

FF6
FF5

FFC

FF5
FFC+PS

FFC+PS
SY

SY
Full Sample Non-Microcaps
WML WML_INTRA

• Intra-industry strategy improves performance, especially on non-microcaps.

180 Professor Doron Avramov, IDC, Israel


ML in Asset Management
 Mitigate the downside risk and hedge against crisis
 Remain profitable in recent years
 Profitable in long positions: e.g., GKX signal, non-microcaps + VW
2.00
1.50
1.00
0.50
0.00
Return CAPM FFC FFC+PS FF5 FF6 SY
-0.50
-1.00
-1.50
Low High HML

181 Professor Doron Avramov, IDC, Israel


Conclusion
 Investments based on deep learning signals extract profitability primarily
from difficult-to-arbitrage stocks and during high limits-to-arbitrage
market states.
 Performance further deteriorates due to sizable trading costs.
 Despite their opaque nature, ML methods generate economically
interpretable trading strategies and are mostly informative for stock
selection.
 Beyond economic restrictions, ML signals are profitable in long positions,
remain viable in recent years, and command low downside risk.
 In “Post-Fundamentals Drift in Stock Price: A Regression Regularization
Perspective”, my coauthors and I use easy-going methods to “acceleration”
in accounting items.
 Investment payoffs do cross all restrictions studied here.
182 Professor Doron Avramov, IDC, Israel
References

183 Professor Doron Avramov, IDC, Israel


 Avramov, D., T. Chordia, G. Jostova, and A. Philipov, 2013, Anomalies and financial distress. Journal of Financial Economics
108:139–159.
 Avramov, D., S. Cheng, and L. Metzker, 2021, Machine Learning versus Economics Restrictions: Evidence from Stock Return
Predictability, Management Science.
 Avramov, D., S. Cheng, L. Metzker, and S. Voigt, 2021, Integrating Factor Models, Working Paper.
 Avramov, D., G. Kaplanski, and A. Subrahmanyam, 2021, Post Fundamentals Price Drift in Capital Markets: A Regression
Regularization Perspective. Management Science.
 Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen, 2014, Inference on treatment effects after selection among high-
dimensional controls, The Review of Economic Studies 81, 608-650.
 Borisenko, Dmitry., 2019, Dissecting Momentum: We Need to Go Deeper. Working paper.
 Chaudhari, Sneha, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. 2019. “An Attentive Survey of Attention Models.”
ArXiv Preprint ArXiv:1904.02874.
 Charpentier, A., Elie R. and Remlinger C., 2020 Reinforcement Learning in Economics and Finance, arXiv 2003.10014v1
 Chen, L., M. Pelger, and J. Zhu , 2019 , Deep learning in asset pricing. Working Paper.
 Chordia, T., A. Subrahmanyam, and Q. Tong , 2014 , Have capital market anomalies attenuated in the recent era of high liquidity
and trading activity? Journal of Accounting and Economics 58:51–58.
 Cochrane, John H., 2011, Presidential Address: Discount Rates, The Journal of Finance 66, 1047-1108.
 Cong L. W., Tang K., Wang J., and Zhang Y, 2020, Deep Sequence Modeling – Development and applications in asset pricing.
 Cong L. W., Tang K., Wang J., and Zhang Y, 2021, AlphaPortfolio, Direct construction through deep reinforcement learning and
interpretable AI, working paper
 Feng, Guanhao, Stefano Giglio, and Dacheng Xiu, 2017, Taming the factor zoo, Working Paper.
 Feng, G., S. Giglio, and D. Xiu , 2019 , Taming the factor zoo. Forthcoming in Journal of Finance.
184 Professor Doron Avramov, IDC, Israel
 Frank, I.E. and Friedman, J.H. (1993) An Statistical View of Some Chemometrics Regression Tools,
Technometrics 35, 109-135.
 Freyberger, J., A. Neuhierl, and M. Weber , 2018 , Dissecting characteristics nonparametrically. Working
Paper.
 Fu, Wenjiang J., 1998, Penalized regressions: the bridge versus the lasso, Journal of computational and
graphical statistics 7, 397-416.
 Gu, S., Kelly, B., Xiu, D., 2019, Autoencoder Asset Pricing Models, Forthcoming in Journal of
Econometrics.
 Gu, S., B. Kelly, and D. Xiu , 2019 , Empirical asset pricing via machine learning. Working Paper.
 Harvey, C. R., Y. Liu, and H. Zhu , 2016, ...and the cross-section of expected returns. Review of
Financial Studies 29:5–68.
 Hoerl, Arthur E., and Robert W. Kennard, 1970, Ridge regression: Biased estimation for nonorthogonal
problems, Technometrics 12, 55-67.
 Hoerl, Arthur E., and Robert W. Kennard, 1970, Ridge regression: applications to nonorthogonal
problems, Technometrics 12, 69-82.
 Hou, K., C. Xue, and L. Zhang , 2018, Replicating anomalies. Forthcoming in Review of Financial
Studies.
185 Professor Doron Avramov, IDC, Israel
 Huang, J., J. L. Horowitz, and F. Wei, 2010, Variable Selection in Nonparametric Additive Models,
Annals of statistics 38, 2282-2313.
 Kelly, B., Pruitt S., and Su Y., 2019, Characteristics are covariances: a unified model of risk and return,
Forthcoming in Journal of Financial Economics.
 Kelly, B., Pruitt S., and Su Y., 2017, Instrumental principal component analysis, working papers,
Chicago Booth and ASU WP Carey.
 Kozak, S., S. Nagel, and S. Santosh , 2019 , Shrinking the cross-section. Forthcoming in Journal
of Financial Economics.
 Lettau, M., and M. Pelger. 2018a. Estimating latent asset-pricing factors. Forthcoming in Journal
of Econometrics.
 Lettau, M., and M. Pelger. 2018b. Factors that fit the time series and cross-section of stock returns.
Working Paper
 Luyang, Chen., Markus, Pelger†., and Jason Zhu, 2019, Deep Learning in Asset Pricing. Working paper.

186 Professor Doron Avramov, IDC, Israel


 McLean, R., and J. Pontiff 2016, Does academic research destroy stock return predictability? Journal of
Finance 71:5–31.
 Nair, Vinod, and Geoffrey E Hinton, 2010, Rectified linear units improve restricted boltzmann machines,
in Proceedings of the 27th international conference on machine learning (ICML-10), 807–814
 Pástor, Ľuboš, and Robert F. Stambaugh, 2000, Comparing asset pricing models: An investment
perspective. Journal of Financial Economics 56 (3): 335-81.
 Pástor, Ľuboš. 2000. Portfolio selection and asset pricing models. The Journal of Finance 55 (1): 179-
223
 Stambaugh, R. F., J. Yu, and Y. Yuan , 2012 , The short of it: Investor sentiment and anomalies. Journal
of Financial Economics 104:288–302.
 Wang, Jingyuan, Yang Zhang, Ke Tang, Junjie Wu, and Zhang Xiong, 2019, Alphastock: A buying-
winners-and-selling-losers investment strategy using interpretable deep reinforcement attention
networks, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining pp. 1900-1908.
 Zou, Hui, and Trevor Hastie, 2005, Regularization and variable selection via the elastic net, Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 67, 301-320.

187 Professor Doron Avramov, IDC, Israel

You might also like