0% found this document useful (0 votes)
150 views

Chapter 2 - Quantitative Analysis

The document discusses Level II of the CFA exam and quantitative methods. It covers topics like linear regression, assumptions of linear regression, interpreting regression coefficients, and calculating regression coefficients using the ordinary least squares method. Examples are provided to demonstrate calculating regression coefficients from data.

Uploaded by

CHAN Stephenie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views

Chapter 2 - Quantitative Analysis

The document discusses Level II of the CFA exam and quantitative methods. It covers topics like linear regression, assumptions of linear regression, interpreting regression coefficients, and calculating regression coefficients using the ordinary least squares method. Examples are provided to demonstrate calculating regression coefficients from data.

Uploaded by

CHAN Stephenie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

CFA ®

Level II
Quantitative Methods

Opal Xu
CONTENTS 1. Quantitative Methods (1)
• R4 Introduction to Linear Regression
• R5 Multiple Regression
• R6 Time- Series Analysis

2. Quantitative Methods (2)


• R7 Machine Learning
• R8 Big Data Projects
• R9 Excerpt from “Probabilistic Approaches: Scenario Analysis, Decision
Trees, and Simulations”

www.zecaiedu.cn
Level I vs. Level II

➢ Level I 学习的主要是描述统计和推断统计中的估计与判断部分。二级主要学习regression,
是推断统计中的预测部分。

➢ Level II 的学习中,会较多的用到Level I 中学习的 Hypotheses Testing ,可以提前复习一


下,再开始二级的内容;

➢ 课程特征与学习建议:
⚫ 理科内容,文科考法;

⚫ 逻辑递进关系很强,要把每个知识点学懂了再继 续往前学;

⚫ 听课与做题相结合,但并不建议“刷题”。

www.zecaiedu.cn
R4 Introduction to Linear
Regression

www.zecaiedu.cn
R4 Introduction to Linear
Regression

www.zecaiedu.cn
R4 Introduction to Linear
Regression

www.zecaiedu.cn
R4 Introduction to Linear
Regression

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Explain the assumptions underlying linear regression and interpret regression


coefficients

 Calculate and interpret the standard error of estimate, the coefficient of


determination, and a confidence interval;

 Formulate a null and alternative hypothesis about a population value of a


regression coefficient and determine the appropriate test statistic and
whether the null hypothesis is rejected at a given level of significance;

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Calculate the predicted value for the dependent variable, given an estimated
regression model and a value for the independent variable;

 Calculate and interpret a confidence interval for the predicted value of the
dependent variable;

 Describe the use of analysis of variance (ANOVA) in regression analysis,


interpret ANOVA results, and calculate and interpret the F-statistic; h.
describe limitations of regression analysis.

www.zecaiedu.cn
The Basics of Simple Linear Regression

➢ Linear regression with one independent variable, sometimes called simple


linear regression, models the relationship between two variables as a
straight line.

➢ Linear regression allows you to


⚫ use linear regression to summarize the relationship, if this
relationship between two variables is linear,

⚫ use one variable to make predictions about another.

www.zecaiedu.cn
Scatter Plots

➢ Recall that in CFA level I, the correlation coefficient is used as a measure of


linear association.

➢ Interpretations of Correlation Coefficients


r=+1 0<r<1

Variable B
Variable B

r=0
Variable A Variable A

Variable B
-1<r<0 r=-1
Variable A
Variable B

Variable B
Variable A Variable A

www.zecaiedu.cn
Sample Covariance and Correlation

➢ Covariance:
⚫ Covariance measures how one random variable moves with another random variable. —— one of measures of
linear association.
⚫ Sample covariance:
n
Cov(X ,Y ) = 
i
(X i
=1
− X )(Yi − Y ) /(n − 1)

➢ Correlation:
⚫ The correlation coefficient measures the direction and extent of linear association between two variables
Cov( X , Y )
r=
sx s y
⚫ Correlation has no units, ranges from –1 to +1
⚫ 0<r<1, positive linear association
⚫ -1<r<0, negative linear association

www.zecaiedu.cn
Simple linear regression model
➢ The simple linear regression model

Yi = b0 + b1 X i +  i , i = 1,..., n

➢ Linear regression assumes a linear relation between the dependent and the independent
variables.
⚫ The dependent variable, Y is the variable whose variation about its mean is to be explained by
the regression.
⚫ The independent variable, X is the variable used to explain the dependent variable in a
regression.
⚫ Regression coefficients, b0 is intercept term of the regression, b1 is slope coefficient of the
regression, regression coefficient.
⚫ The error term, εi is the portion of the dependent variable that is not explained by the
independent variable(s) in the regression.
www.zecaiedu.cn
Calculation of Regression Coefficients
➢ How does linear regression estimate b0 and b1?
⚫ Computes a line that best fits the observations
⚫ Minimize the sum of the squared vertical distances between the observations and the
regression line
⚫ The estimated intercept coefficient ( b̂0 ) is interpreted as the value of Y when X is equal to
zero.
⚫ The estimated slope coefficient ( b̂1 ) defines the sensitivity of Y to a change in X .The
estimated slope coefficient ( b̂1 ) equals covariance divided by variance of X.
➢ Example of interpretation of estimated coefficients
⚫ An estimated slope coefficient of 2 would indicate that the dependent variable will change two
units for every 1 unit change in the independent variable.
⚫ The intercept term of 2% can be interpreted to mean that the independent variable is zero, the
dependent variable is 2%.

www.zecaiedu.cn
Calculation of Regression Coefficients

➢ Ordinary least squares (OLS)


⚫ OLS estimation is a process that estimates the population parameters Bi with
corresponding values for bi that minimize the squared residuals (i.e., error terms).

⚫ the OLS sample coefficients are those that:

Cov( X , Y )
(X i − X )(Yi − Y )
b1 = = i =1
n

 i
Var ( X )
( X − X ) 2

i =1

⚫ The estimated intercept coefficient ( b̂0 ) : because the point ( X , Y) lies on the
ത − 𝑏෠1 𝑋ത .
regression line. we can solve 𝑏෠0 = 𝑌
www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient

• Bouvier Co. is a Canadian company that sells forestry products to

several Pacific Rim customers. Bouvier’s sales are very sensitive to

exchange rates. The following table shows recent annual sales (in

millions of Canadian dollars) and the average exchange rate for the

year (expressed as the units of foreign currency needed to buy one

Canadian dollar).

www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient

Year i Xi = Exchange Rate Yi = Sales


1 0.40 20
2 0.36 25
3 0.42 16
4 0.31 30
5 0.33 35
6 0.34 30

• Calculate the intercept and coefficient for an estimated linear

regression with the exchange rate as the independent variable and

sales as the dependent variable.

www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate a Regression Coefficient
• The following table provides several useful calculations:

Year i Xi = Exchange Rate Yi = Sales (Xi -X)2 (Yi -Y)2 (Xi -X)(Yi -Y)
1 0.4 20 0.0016 36 -0.24
2 0.36 25 0 1 0
3 0.42 16 0.0036 100 -0.6
4 0.31 30 0.0025 16 -0.2
5 0.33 35 0.0009 81 -0.27
6 0.34 30 0.0004 16 -0.08
Sum 2.16 156 0.009 250 -1.39

www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient

• The sample
n
mean of the exchange rate is:
X = X i =1
i / n = 2.16 / 6 = 0.36
• The sample mean of sales is:
n
Y =  Yi / n = 156 / 6 = 26
i =1

• We want to estimate a regression equation of the form Yi = b0 + b1Xi +εi.

The estimates of the slope coefficient and the intercept are


 ( Y -Y )( X -X )
n

i i
-1.39
bˆ 1 = i=1
n
= = -154.44, and
 ( X -X )
2 0.009
i
i=1

bˆ 0 = Y − bˆ 1 X = 26 − ( −154.444 )( 0.36 ) = 26 + 55.6 = 81.6


• So the regression equation is Yi = 81.6 – 154.444Xi
www.zecaiedu.cn
The Assumptions of the Linear Regression

➢ The assumptions
⚫ A linear relationship exists between X and Y
⚫ X is not random, and the condition that X is uncorrelated with the error
term can substitute the condition that X is not random.
⚫ The expected value of the error term is zero (i.e., E(εi)=0 )
⚫ The variance of the error term is constant (i.e., the error terms are
homoskedastic)
⚫ The error term is uncorrelated across observations (i.e., E(εiεj)=0 for all
i≠j)
⚫ The error term is normally distributed.
www.zecaiedu.cn
Analysis of Variance(ANOVA table)

➢ Elements of ANOVA Table

⚫ The sum of squared errors or residuals (SSE)

⚫ The regression sum of squares (RSS)


Y
⚫ Total sum of squares (TSS)
𝑌𝑖 − 𝑌෠𝑖
__
Yi − Y
__
𝑌෠𝑖 − 𝑌ത
Y

b0
X

www.zecaiedu.cn
Analysis of Variance(ANOVA table)

➢ANOVA table

SSE
➢ Standard error of estimate: SEE =
n−2
= MSE

➢ Coefficient of determination (R² )


RSS SSE
R2 = = 1-
SST SST
explained variation unexplained variation
= = 1-
www.zecaiedu.cn total variation total variation
Standard Error of Estimate

➢ Standard Error of Estimate (SEE) gives some indication of how certain we can
be about a particular prediction of Y using the regression equation.
⚫ SEE is low if the regression is very strong and high if the relationship is weak.

➢ The formula of SEE

⚫ The SEE formula looks like the formula for computing a standard deviation,
except that n−2 appears in the denominator instead of n−1.
⚫ In fact, the SEE is the standard deviation of the error term because the degree
of freedom of the error is n-2.
www.zecaiedu.cn
Coefficient of Determination (R2)
➢ Coefficient of determination (R2) measures the fraction of the total variation in the
dependent variable that is explained by the independent variable.
⚫ Its limits are 0≤R2≤1;
⚫ Example: R2 of 0.8250 means the independent variable explains approximately
82.5 percent of the variation in the dependent variable.

➢ For simple linear regression, R² is equal to the squared correlation coefficient.

➢ For report purpose, regression programs also report multiple R, which is the
correlation between the actual values and the forcast values of Y, the coefficient of
determination is the square of multiple R.

www.zecaiedu.cn
Coefficient of Determination (R2)

➢ The Different between the R2 ,multiple R and Correlation Coefficient


⚫ The correlation coefficient indicates the sign of the relationship between two
variables, whereas the coefficient of determination does not.
⚫ Multiple R is the correlation between the actual values and the forecast values of Y.
✓ It is the square root of R2 and always positive.
✓ It can be the same the correlation between dependent and independent variable
only in case of a simple linear regression and a positive slope coefficient.
✓ It can also apply to multiple regression.
⚫ The coefficient of determination can apply to an equation with several independent
variables, and it implies a explanatory power, while the correlation coefficient only
applies to two variables and does not imply explanatory between the variables.
www.zecaiedu.cn
Regression coefficient confidence interval

➢Regression coefficient confidence interval


bˆ1  t c sbˆ
1

⚫ If the confidence interval with a given degree of confidence dose not include
the hypothesized value, the null is rejected, and the coefficient is said to be
statistically different from hypothesized value.

⚫ 𝑆𝑏෠1 is the standard error of the estimated coefficient.

⚫Stronger regression results (usually lower SEE or higher R2) lead to smaller
standard error of an estimated coefficient 𝑆𝑏෠1 and tighter confidence intervals.

www.zecaiedu.cn
Hypothesis Testing

➢Hypothesis testing about regression coefficient


⚫ H0: b1= hypothesized value of b1
⚫ Test Statistic:
bˆ1 − hypothesized value of b1
t= , df = n - 2
sbˆ
1

⚫ Decision rule: reject H0 if | t | > + t critical

⚫ Rejection of the null means that the slope coefficient is


significantly different from the hypothesized value of b1.

www.zecaiedu.cn
Hypothesis Testing

➢Significance test for regression coefficient


⚫ H0: b1= 0
⚫ Test Statistic:
bˆ1 − 0 bˆ1
t= = , df = n - 2
sbˆ sbˆ
1 1

⚫ Decision rule: reject H0 if | t | > + t critical

⚫ Rejection of the null means that the slope coefficient is


significantly different from zero.

www.zecaiedu.cn
Hypothesis Testing
Case 2 Hypothesis Testing
• An analyst ran a regression and got the following result:
Coefficient t-statistic p-value

Intercept -0.5 -0.91 0.18

Slope 2 12.00 <0.001

ANOVA Table df SS MSS


Regression 1 8000 ?
Error ? 2000 ?
Total 51 ? -
Hypothesis Testing
Case 2 Hypothesis Testing
• Fill in the blanks of the ANOVA Table.
• What is the standard error of estimate?
• What is the result of the slope coefficient significance test?
• What is the result of the sample correlation?
• What is the 95% confidence interval of the slope coefficient?
Predicted Value of the Dependent Variable

➢Two sources of uncertainty when using the regression model and the
estimated parameters to make a prediction.
⚫ The error term itself contains uncertainty

⚫ Uncertainty in the estimated parameters

➢ Point estimate
𝑌෠ = 𝑏෠0 + 𝑏෠1 𝑋

www.zecaiedu.cn
Predicted Value of the Dependent Variable

➢Confidence interval estimate


Yˆ  (t c  s f )
t c = the critical t-value with df=n−2

s f = the standard error of the forecast

1 ( X − X )2 1 ( X − X )2
s f = SEE  1 + + = SEE  1 + +
n (n − 1) s X
2
n  ( X i − X )2

www.zecaiedu.cn
Limitations of Regression Analysis
➢ Regression relations can change over time, just as correlations can.
⚫ Parameter instability: the problem or issue of population regression parameters
that have changed over time.

➢ Public knowledge of regression relationships may negate their future


usefulness.
⚫ For example, an analyst discovers that stocks with a certain characteristic have
had historically very high returns. If other analysts discover and act upon this
relationship, then the prices of stocks with that characteristic will be bid up and
the relation no longer holds in the future.

➢ If the regression assumptions are violated, hypothesis tests and predictions


based on linear regression will not be valid.
www.zecaiedu.cn
Summary

➢ Importance: ☆☆☆
➢ Content:
⚫ Underlying consumptions of linear regression;

⚫ Prediction of dependent variable;

⚫ Interpretation of hypothesis testing for regression coefficient.

⚫ ANOVA;

⚫ SEE, R2, and F-statistic

➢ Exam tips:
⚫ Underlying consumption;回归系数的假设检验。

⚫ 给出ANOVA表,计算某空白格;R2的calculation and interpretation,计算题和概念题都可能考。


www.zecaiedu.cn
R5 Multiple regression

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Formulate a multiple regression equation to describe the relation between a dependent


variable and several independent variables and determine the statistical significance of
each independent variable;

 Interpret estimated regression coefficients and their p-values;

 Interpret the results of hypothesis tests of regression coefficients;

 Calculate and interpret 1) a confidence interval for the population value of a regression
coefficient and 2) a predicted value for the dependent variable, given an estimated
regression model and assumed values for the independent variables;

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Calculate and interpret the F-statistic, and describe how it is used in


regression analysis;

 Distinguish between and interpret the R2 and adjusted R2 in multiple


regression;

 Evaluate how well a regression model explains the dependent variable by


analyzing the output of the regression equation and an ANOVA table;

 Formulate a multiple regression equation by using dummy variables to represent


qualitative factors and interpret the coefficients and regression results;
www.zecaiedu.cn
The Basics of Multiple Regression

➢The multiple linear regression model


Y = b0 + b1 X 1 + b2 X 2 + + bk X k + 

⚫ Xi = the ith observation of the independent variable, i=1, 2, …, k

⚫ b0 = the intercept of the equation

⚫ b1,…, bk = the slope coefficients for each of the independent variables

➢ Predicted value of the dependent variable


Yˆ = bˆ0 + bˆ1 Xˆ 1 + bˆ2 Xˆ 2 +  + bˆk Xˆ k
⚫ Multiple linear regression allows us to determine the effect of more than one independent
variable on a particular dependent variable.
www.zecaiedu.cn
Interpreting The Multiple Regression Results

➢The slope coefficients in a multiple regression are know as partial


regression coefficients.
⚫ A partial regression coefficient measures the expected change in the dependent
variable for a 1-unit increase in an independent variable, holding all the other
independent variables constant.

➢The intercept coefficient is interpreted as the value of Y when X1, X2,…,


Xk are all equal to zero.

www.zecaiedu.cn
Multiple Regression Assumptions

➢The assumptions of the multiple linear regression


⚫ The relationship between the dependent variable, Y, and the independent
variables, X1, X2, ..., Xk, is linear
⚫ The independent variables are not random. And no exact linear relation exists
between two or more of the independent variables
⚫ The expected value of the error term, conditioned on the independent variables,
is zero: E(εi| X1, X2, ..., Xk)=0
⚫ The variance of the error term is the same for all observations (The error terms
are homoskedastic)
⚫ The error term is uncorrelated across observations (E(εiεj)=0 for all i≠j)
⚫ The error term is normally distributed.
www.zecaiedu.cn
Hypothesis Testing about Regression Coefficient

➢ Significance test for a regression coefficient


⚫ H0: bj=0
⚫ Test statistic: df = n-k-1
➢ P-value: the smallest significance level for which the null hypothesis can
be rejected
⚫ Reject H0 if p-value < α
⚫ Fail to reject H0 if p-value > α
➢Regression coefficient confidence interval
⚫ (
bˆ j  tc  sbˆ
j
)
www.zecaiedu.cn
Regression Coefficient F-test
➢ How to test the regression’s overall significance?

⚫ If none of the independent variables in a regression model help explain the dependent
variable, the slope coefficient should all equal 0.

⚫ In a multiple regression, however, we cannot test the null hypothesis that all slope
coefficients equal 0 based on t-test, because the individual tests do not account for
the effects of interactions among the independent variables.

➢ To test the null hypothesis that all of the slope coefficients in the
multiple regression model are jointly equal to 0,we must use an F-test.
⚫ The F-statistic measures how well the regression equation explains the variation in the
dependent variable.
www.zecaiedu.cn
Regression Coefficient F-test
➢ Define hypothesis:
⚫ H0: b1= b2= b3= … = bk=0

⚫ Ha: at least one bj≠0 (for j = 1, 2, …, k)

➢ F-statistic:
RSS
MSR k
F= =
MSE SSE
(n − k − 1)
⚫ k, n-k-1 are the degrees of freedom for an F-test

⚫ In simple liner regression model, F-test duplicates the t-test for the significance of the
slope coefficient

www.zecaiedu.cn
Regression Coefficient F-test
➢ Decision rule
⚫ Reject H0 : if F-statistic > Fα (k, n-k-1)

⚫ Specially, the F-statistic for testing the null hypothesis (that all the slope
coefficients are equal to 0) has a value of 0 when the independent variables
do not explain the dependent variable at all.

➢ Note that we use a “one-tailed” F-test.

➢ The test assesses the effectiveness of the model as a whole in explaining the
dependent variable.

www.zecaiedu.cn
Analysis of Variance (ANOVA)
➢ ANOVA Table

➢Standard error of estimate


SSE
SEE = = MSE
n − k −1

➢Coefficient of determination (R² )


⚫ Is R2 still reliable?
RSS SSE
R2 = = 1−
SST SST
www.zecaiedu.cn
Coefficient of Determination ( R2 )

➢ R2
⚫ The percentage of variation in the dependent variable that is
collectively explained by all of the independent variable.

➢ Adjusted R2
⚫ In a multiple linear regression, R2 by itself is less appropriate as a
measure of whether a regression model fits the data well (goodness
of fit).
✓ We can increase R2 simply by including many additional independent variables
that explain even a slight amount of the previously unexplained variation, even if
the amount they explain is not statistically significant.
www.zecaiedu.cn
Adjusted R2

➢ Function of adjusted R2

⚫ adjusted for degrees of freedom

⚫ if k>=1, R² is strictly greater than adjusted R²

⚫ adjusted R² may be less than zero

⚫ a high adjusted R2 does not necessary mean the correct choice of variables.

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Heteroskedasticity

➢ Serial correlation (autocorrelation)

➢ Multicollinearity

www.zecaiedu.cn
Qualitative Independent Variables
➢ Dummy variables

⚫ To use qualitative variables as independent variables in a regression

⚫ Takes on a value of 1 if a particular condition is true and 0 if that


condition is false

⚫ If we want to distinguish among n categories, we need n − 1 dummy


variables

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Heteroskedasticity

⚫ Heteroskedasticity refers to the situation that the variance of the error


term is not constant. (i.e., the error terms are not homoskedastic)

⚫ Unconditional heteroskedasticity occurs when heteroskedasticity of the


error variance is not correlated with the independent variables in the
multiple regression. It creates no major problems for statistical inference.

⚫ Conditional heteroskedasticity is heteroskedasticity in the error variance


that is correlated with (conditional on) the values of the independent
variables in the regression.
✓ Conditional heteroskedasticity causes the most problems for statistical inference.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Effect of Heteroskedasticity on Regression Analysis
⚫ The coefficient estimates ( bˆ j ) are not affected

⚫ Not affect the consistency of regression parameter estimators

✓ Consistency: the larger the number of sample, the lower probability of error.

⚫ Heteroskedasticity introduces bias into estimators of the standard error of


regression coefficients.

✓ t-tests for the significance of individual regression coefficients are unreliable.

✓ The F-test for the overall significance of the regression is unreliable.

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢Detecting heteroskedasticity
ˆ
⚫ Two methods to detect bheteroskedasticity
j

✓ Residual scatter plots (residual vs. independent variable)

✓ The Breusch-Pagan χ² test


➢ H0: No heteroskedasticity, one-tailed test

➢ Chi-square test: BP = n×Rresidual² , df=k

➢ Tips: Regress squared residuals with independent variable, X, and Rresidual² is the
coefficient of determination.

➢ Decision rule: BP test statistic should be small (χ²分布表)

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Correcting heteroskedasticity
bˆ j errors, to correct the standard error of
⚫ Computing robust standard
estimated coefficients, ( a.k.a, White-corrected standard error)

⚫ Generalized least squares , modify the equation to eliminate


heteroskedasticity

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Serial correlation (autocorrelation)
⚫ Regression errors are correlated across observations.

⚫ Serial correlation is often found in time series data


✓ Positive serial correlation is serial correlation in which a positive error for one
observation increases the chance of a positive error for another observation, in
other word, Cov(εi, εi+1)>0.

✓ Negative serial correlation is serial correlation in which a positive error for one
observation increases the chance of a negative error for another observation, in
other word, Cov(εi, εi+1)<0.

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Effect of serial correlation on regression analysis
⚫ Positive serial correlation → Type I error & F-test unreliable
✓ Not affect the consistency of estimated regression coefficients.

✓ F-statistic to test for overall significance of the regression may be inflated because
the mean squared error will tend to underestimate the population error variance.

✓ Standard errors for the regression coefficient are artificially small → the estimated
t-statistics to be overestimated →the prob. of type I error increased.

⚫ Negative serial correlation → Type II error & F-test, t-test unreliable


✓ Standard errors for the regression coefficient are artificially big → the estimated t-
statistics to be underestimated →the prob. of type II error increased.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Detecting serial correlation
⚫ Durbin-Watson test
✓ H0: No serial correlation
T

 ( t −  t −1 ) 2
DW = t =2
T
 2(1 − r )

2
t
t =1

✓ Decision rule
Reject H0, Reject H0,
conclude Do not conclude
positive serial reject H0 negative serial
Inconclusive Inconclusive
correlation correlation
0 dL dU 4-dU 4-dL 4

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢Detecting positive serial correlation
⚫ Durbin-Watson test
✓ H0: No positive serial correlation

✓ DW ≈ 2×(1−r)

✓ Decision rule

Reject H0,
conclude positive
Inconclusive Fail to reject null hypothesis of no positive
serial correlation
serial correlation

0 dL dU

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Methods to correct serial correlation
⚫ Adjust the coefficient standard errors for the linear regression parameter
estimates to account for the serial correlation (Recommended)
✓ Hansen method to adjust coefficient standard error

✓ Hansen method simultaneously correct for conditional heteroskedasticity


➢ However, when heteroskedasticity is the only problem, computing robust standard errors is
still the better method.

⚫ Modify the regression equation itself to eliminate the serial correlation.


✓ May result in inconsistent parameter estimates unless implemented with extreme care.

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Multicollinearity
⚫ Multicollinearity occurs when two or more independent variables (or
combinations of independent variables) are highly correlated with
each other.

⚫ In practice, multicollinearity is often a matter of degree rather than


of absence or presence, because approximate linear relationship
among financial variables are common.

www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Effect of multicollinearity on regression analysis
⚫ Not affect the consistency of coefficient estimates

⚫ The estimates become extremely imprecise and unreliable,


practically impossible to distinguish the individual impacts of the
independent variables on the dependent variables

⚫ Introduces bias into estimators of the standard error of regression


coefficients.
✓ Inflated standard errors for the regression coefficients

✓ t-tests for the significance of individual regression coefficients are unreliable..


www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Two methods to detect multicollinearity
⚫ Occasionally suggested method: Using the magnitude of pairwise correlations among
the independent variables to assess multicollinearity.
✓ High pairwise correlations among the independent variables can usually indicate multicollinearity.

⚫ Classic method
✓ T-statistics indicate that none of the individual coefficients is significantly different than
zero.
✓ Significant F-tastistic
✓ A high R2

➢Methods to correct multicollinearity


⚫ Excluding one or more of the regression variables.

www.zecaiedu.cn
Summary of Assumption Violations

www.zecaiedu.cn
Model Misspecification
➢ Misspecified functional form.
⚫ One or more important variables could be omitted from regression.
✓ e.g. If the true regression model is Yi =b0 +b1X1i +b2 X 2i +εi , but we estimate the
model Yi = a0 +a1X1i +εi
⚫ One or more of the regression variables may need to be transformed before
estimating the regression.
✓ e.g. Regress the natural logarithm of the variable
⚫ The regression model pools data from different samples that should not be pooled.
✓ e.g. Represent the relationship between two financial variables at two different
time periods

www.zecaiedu.cn
Model Misspecification
➢ Time-Series Misspecification (Regressors that are correlated
with the error term)
⚫ Including lagged dependent variables as independent variables in
regressions with serially correlated errors.

⚫ Including a function of a dependent variable as an independent variable,


sometimes as a result of the incorrect dating of variables.

⚫ Independent variables are measured with error.

www.zecaiedu.cn
Model Misspecification
➢ Other Types of Time-Series Misspecification (Nonstationarity)
⚫ Relations among time series with trends (for example, the relation
between consumption and GDP).

⚫ Relations among time series that may be random walks (time series for
which the best predictor of next period’s value is this period’s value).
Exchange rates are often random walks.

www.zecaiedu.cn
Qualitative Independent Variables
➢Dummy variables
⚫ To use qualitative variables as independent variables in a regression

⚫ Takes on a value of 1 if a particular condition is true and 0 if that


condition is false

⚫ If we want to distinguish among n categories, we need n − 1 dummy


variables

www.zecaiedu.cn
Qualitative Independent Variables
➢Illustrates the use of dummy variables
⚫ Returnst =b0 + b1Jant + b2Febt +···+ b11Novt + εt
✓ Returnst = a monthly observation of returns
✓ Jant =1 if period t is in the January, Jant =0 otherwise
✓ Febt =1 if period t is in the February, Febt =0 otherwise
✓ ···
✓ Novt =1 if period t is in the November, Novt =0 otherwise
⚫ The intercept, b0, measures the average return for stocks in December because
there is no dummy variable for December.

⚫ Each of the estimated coefficients for the dummy variables shows the estimated
difference between returns in that month and returns for December.

www.zecaiedu.cn
Qualitative Independent Variables
➢ Qualitative dependent variables are dummy variables used as dependent variables
instead of as independent variables.
⚫ Probit and logit models estimate the probability of a discrete outcome given the values
of the independent variables used to explain that outcome.
✓ The probit model, which is based on the normal distribution, estimates the probability that Y = 1 (a
condition is fulfilled) given the value of the independent variable X.
✓ The logit model is identical, except that it is based on the logistic distribution rather than the normal
distribution.
✓ Both models must be estimated using maximum likelihood methods(极大似然估计).

⚫ Discriminant models yields a linear function, similar to a regression equation, which can
then be used to create an overall score, or ranking, for an observation. Based on the
score, an observation can be classified into the bankrupt or not bankrupt category.

www.zecaiedu.cn
Credit Analysis
➢ Z – score
Z = 1.2 A + 1.4 B + 3.3 C + 0.6 D + 1.0 E
Where:
A = WC / TA
B = RE / TA
C = EBIT / TA
D = MV of Equity / BV of Debt
E = Revenue / TA
⚫ If Z<1.81 → Bankruptcy.

www.zecaiedu.cn
Model Misspecification
Case 2 Case
Hansen is developing a regression model to predict the initial return for IPOs.

Exhibit 1. Hansen’s Regression Results Dependent Variable: IPO Initial Return


(Expressed in Decimal Form, i.e., 1% = 0.01)

Variable Coefficient (bj) Standard Error t-Statistic

Intercept 0.0477 0.0019 25.11

Underwriter rank 0.0150 0.0049 3.06

Pre-offer price adjustment 0.4350 0.0202 21.53

Offer size -0.0009 0.0011 −0.82

Fraction retained 0.0500 0.0260 1.92


Model Misspecification
Case 2 Case
Hansen is developing a regression model to predict the initial return for IPOs.

Exhibit 2. Selected ANOVA Results for Hansen’s Regression

Degrees of Freedom (df) Sum of Squares (SS)

Regression 4 51.433

Residual 1,720 91.436

Total 1,724 142.869

He believes that for each 1 percent increase in pre-offer price adjustment, the initial return
will increase by less than 0.5 percent, holding other variables constant.
Model Misspecification
Case 2 Case
Before applying his model, Hansen asks a colleague, Phil Chang, to review its
specification and results. After examining the model, Chang concludes that the
model suffers from two problems: 1) conditional heteroskedasticity, and 2) omitted
variable bias. Chang makes the following statements:

• Statement 1: “Conditional heteroskedasticity will result in consistent coefficient


estimates, but both the t-statistics and F-statistic will be biased, resulting in false
inferences.”

• Statement 2: “If an omitted variable is correlated with variables already included


in the model, coefficient estimates will be biased and inconsistent and standard
errors will also be inconsistent.”
Model Misspecification
Case 2 Case
1. The 95 percent confidence interval for the regression coefficient for the pre-offer price
adjustment is closest to:

A. 0.156 to 0.714.

B. 0.395 to 0.475.

C. 0.402 to 0.468.

2. The most appropriate null hypothesis and the most appropriate conclusion regarding
Hansen’s belief about the magnitude of the initial return relative to that of the pre-offer
price adjustment (reflected by the coefficient bj) are:

Null Hypothesis Conclusion about bj (0.05 Level of Significance)

A H0: bj = 0.5 Reject H0

B H0: bj ≥ 0.5 Fail to reject H0

C H0: bj ≥ 0.5 Reject H0


Model Misspecification
Case 2 Case

3. Is Chang’s Statement 1 correct?

A. Yes.

B. No, because the model’s F-statistic will not be biased.

C. No, because the model’s t-statistics will not be biased.

4. Is Chang’s Statement 2 correct?

A. Yes.

B. No, because the model’s coefficient estimates will be unbiased.

C. No, because the model’s coefficient estimates will be consistent.


Model Misspecification
Case 2 Case

5. Hansen is concerned about the possible presence of multicollinearity in the


regression. He states that adding a new independent variable that is highly
correlated with one or more independent variables already in the regression
model, has three potential consequences, which one is incorrect

1. The R2 is expected to decline.

2. The regression coefficient estimates can become imprecise and unreliable.

3. The standard errors for some or all of the regression coefficients will
become inflated.
Summary

➢ Importance: ☆☆☆

➢ Content:
⚫ Assumptions of multiple linear regression;

⚫ Interpretation and hypothesis testing of regression coefficients;

⚫ Prediction of dependent variable

➢ Exam tips:
⚫ regression coefficients的假设检验;

⚫ 出题点比较灵活,包括检验统计量的计算,检验结果的判断和解读。

www.zecaiedu.cn
R6 Time-series analysis

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Calculate and evaluate the predicted trend value for a time series, modeled as either a linear
trend or a log- linear trend, given the estimated trend coefficients;

 Describe factors that determine whether a linear or a log- linear trend should be used with a
particular time series and evaluate limitations of trend models;

 Explain the requirement for a time series to be covariance stationary and describe the
significance of a series that is not stationary;

 Describe the structure of an autoregressive (AR) model of order p and calculate one- and two-
period- ahead forecasts given the estimated coefficients;

 Explain mean reversion and calculate a mean- reverting level;


www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 Explain how autocorrelations of the residuals can be used to test whether the autoregressive
model fits the time series;

 Contrast in- sample and out- of- sample forecasts and compare the forecasting accuracy of
different time- series models based on the root mean squared error criterion;

 Describe characteristics of random walk processes and contrast them to covariance stationary
processes;

 Describe implications of unit roots for time- series analysis, explain when unit roots are likely to
occur and how to test for them, and demonstrate how a time series with a unit root can be
transformed so it can be analyzed with an AR model.

www.zecaiedu.cn
Trend Models
➢ Linear trend model
⚫ yt=b0+b1t+εt

⚫ Same as linear regression, except for that the independent variable is time t (t=1, 2,
3, ……)
yt

www.zecaiedu.cn
Trend Models
➢ Log-linear trend model
⚫ yt=e(b0+b1t)
⚫ Ln ( yt ) =b0+b1t+εt
⚫ Model the natural log of the series using a linear trend
⚫ Use the Durbin Watson statistic to detect autocorrelation
yt ln(yt)

t t

www.zecaiedu.cn
Trend Models
➢ How to select a trend model

⚫ A linear trend model may be appropriate if fitting a linear trend to a time series
leads to uncorrelated errors.

⚫ A log-linear model may be more appropriate if a time series grows at an


exponential rate, we can model the natural log of that series using a linear trend.

➢ Limitations of Trend Model


⚫ Usually the time series data exhibit serial correlation, which means that the
model is not appropriate for the time series

⚫ Existence of serial correlation suggests we build better forecasting model than


trend model
www.zecaiedu.cn
Autoregressive Models (AR)
➢An autoregressive model uses past values of dependent variables as
independent variables
⚫ AR(p) model

xt = b0 + b1 xt −1 + b2 xt −2 + ... + bp xt − p +  t

⚫ AR (p): AR model of order p (p indicates the number of lagged values that the
autoregressive model will include as independent variable).
⚫ For example, a model with two lags is referred to as a second-order
autoregressive model or an AR (2) model.

www.zecaiedu.cn
Autoregressive Models (AR)
➢Multiperiod forecasts
⚫ Chain rule of forecasting
✓ The one-period-ahead forecast of xt from an AR(1) model is as follows:

x t +1 = b 0 + b1 xt

✓ If we want to forecast xt+2 using an AR(1) model, our forecast will be based on

x t + 2 = b 0 + b1 xt +1

www.zecaiedu.cn
Autoregressive Models (AR)
➢ Forecasting with an autoregressive model, we should prove:
⚫ No autocorrelation

⚫ Covariance-stationary series

⚫ No conditional heteroskedasticity

www.zecaiedu.cn
Autocorrelation
➢ Autocorrelation in an AR model
⚫ When the error terms are correlated, standard errors are unreliable.
⚫ Durbin-Watson statistic is invalid when the independent variables include past values of the
dependent variable
⚫ Using t-test with residual autocorrelation and the standard error of the residual autocorrelation

➢Detecting autocorrelation in an AR model


⚫ Compute the autocorrelations of the residual
⚫ t-tests to see whether the residual autocorrelations differ significantly from 0,
r t , t −k -0 r t , t −k
t − statistics = =
Sr 1/ n

⚫ If the residual autocorrelations differ significantly from 0, the model is not correctly specified, so
we may need to modify it (e.g. seasonality)
⚫ Correction: add lagged values
www.zecaiedu.cn
Autocorrelation
➢ Seasonality – a special question
⚫ Time series shows regular patterns of movement within the year
⚫ The seasonal autocorrelation of the residual will differ significantly from 0
⚫ We should uses a seasonal lag in an AR model
⚫ For example: xt=b0+b1 xt-1+ b2 xt-4+εt, AR(1) model with a seasonal lag

www.zecaiedu.cn
Autocorrelation
Case 1 Example

• Suppose we decide to use an autoregressive model with a seasonal lag


because of the seasonal autocorrelation in the previous problem. We are
modeling quarterly data, so we estimate Equation:

(ln Salest – ln Salest–1) = b0 + b1(ln Salest–1 – ln Salest–2) + b2(ln Salest–4 – ln


Salest–5) + εt.
Q1: Using the information in Table 1, determine if the model is correctly specified.

Q2: If sales grew by 1 percent last quarter and by 2 percent four quarters ago, use
the model to predict the sales growth for this quarter.
Autocorrelation
Case 1 Example
Autocorrelation
Case 1 Example
• Answer to Q1

• At the 0.05 significance level, with 68 observations and three parameters, this
model has 65 degrees of freedom. The critical value of the t-statistic needed to
reject the null hypothesis is thus about 2.0.

• The absolute value of the t-statistic for each autocorrelation is all below 2.0, so
we cannot reject the null hypothesis that each autocorrelation is not significantly
different from 0. We have determined that the model is correctly specified.

• Answer to Q2

• If sales grew by 1 percent last quarter and by 2 percent four quarters ago, then
the model predicts that sales growth this quarter will be e0.0121 – 0.0839
ln(1.01) + 0.6292 ln(1.02) – 1 = 2.40%.
Covariance-stationary
➢ Covariance-stationary series
⚫ Statistical inference based on OLS estimates for a lagged time series model assumes that the
time series is covariance stationary.

⚫ Three conditions for covariance stationary

✓ The expected value of the time series must be constant and finite in all periods.
E ( yt ) =  and   , t = 1, 2,..., T

✓ The variance of the time series must be constant and finite in all periods.

✓ The covariance of the time series with itself for a fixed number of periods in the past or
future must be constant and finite in all periods. Cov( yt , yt −s ) =  ,   , t = 1, 2,..., T ; s = 0, 1, 2,..., T

⚫ Stationary in the past does not guarantee stationary in the future

⚫ All covariance-stationary time series have a finite mean-reverting level.


www.zecaiedu.cn
Covariance-stationary
➢ Mean reversion
时间序列从长期来看,往往都带有均值回归的特性,即当某个阶
⚫ A time series shows mean reversion if it tends to fall when its level is above its mean and
值时,下一个阶段数值会倾向于减小;而当小于均值时,下一个阶段
rise when its level is below its mean 大。以自回归 AR(1)模型来看, xt = b0 + b1 xt −1 ,当处于均值时,则
b0 b0
⚫ For an AR(1) model, the mean-reverting level, 程,就可以得到均值
xt, is given by: xt = 。则当xt  时,AR(1)模型预测
1 − b1 1 − b1
⚫ The time series will increase if xt 
b0
b0 b0
1 − b1 接近于 ;当 xt  时,AR(1)模型预测xt −1 会增大而更接近
1 − b1 1 − b1
⚫ The time series will decrease if xt 
b0
1 − b1

案例 5-5,Mean-reverting level
Suppose a one-lag autoregressive model by xt = b0 + b1 xt −1 . The coefficie
16.54 and 0.65 respectively. If currently X is 42.5,what is the trend for X
Referenced Answer
b0 16 .54
Mean Reverting level = = = 47 .26 >42.5. X trends to incre
1 − b1 1 − 0.65
period.
www.zecaiedu.cn
Covariance-Stationary

➢ Instability of regression coefficients


⚫ The regression coefficient estimates of a time-series model estimated using an earlier
sample period can be quite different from those of a model estimated using a later sample
period.

✓ e.g. Exchange rate with fixed regime and floating regime

⚫ The estimates can be different between models estimated using relatively shorter and
longer sample periods.

➢ So we need to check covariance stationary when applying sample data.

www.zecaiedu.cn
Covariance-stationary
➢ Random walk
⚫ Random walk without a drift
✓ xt =xt-1+εt (b0=0 and b1=1)
✓ The best forecast of xt that can be made in period t−1 is xt−1. (the expected value of εt is zero). In
fact, in this model, xt−1 is the best forecast of x in every period after t−1
✓ xt - xt-1= εt

⚫ Random walk with a drift


✓ xt=b0+xt-1+εt (b0≠0, b1=1)
✓ A random walk with drift should increase or decrease by a constant amount in each period.

➢ Features
⚫ A random walk has an undefined mean reverting level
⚫ A time series must have a finite mean reverting level to be covariance stationary
⚫ A random walk, with or without a drift, is not covariance stationary
www.zecaiedu.cn
Unit Root Test
➢The unit root test of nonstationarity
⚫ The time series is said to have a unit root if the lag coefficient is equal to one
⚫ A common t-test of the hypothesis that b1=1 is invalid to test the unit root,
however, it is not often the case.
⚫ Dickey-Fuller test (DF test) to test the unit root
✓ Start with an AR(1) model xt=b0+b1 xt-1+εt

Subtract xt-1 from both sides xt-xt-1 =b0+(b1 –1) xt-1+εt


xt-xt-1 =b0+g xt-1+εt
✓ H0: g=0 (has a unit root and is nonstationary)
Ha: g<0 (does not have a unit root and is stationary)
✓ Calculate conventional t-statistic and use revised t-table, which is computed by Dickey and Fuller.
✓ If we can reject the null, the time series does not have a unit root and is stationary.

www.zecaiedu.cn
Unit Root Correction
➢ If a time series appears to have a unit root
⚫ One method that is often successful is to first-difference the time series (as
discussed previously) and try to model the first-differenced series as an
autoregressive time series.

➢ First differencing
⚫ Define yt as yt = xt - xt-1 =εt

⚫ The first-differenced variable yt is covariance stationary

www.zecaiedu.cn
Autoregressive Conditional Heteroskedasticity
➢ Heteroskedasticity refers to the situation that the variance of the error term is not
constant.
➢ Test whether a time series is ARCH(1)
⚫  t2 = a0 + a1 t2−1 + ut
⚫ If the estimate of a1 is statistically significantly different from zero, we conclude that the
time series is ARCH(1).
✓ If a time-series model has ARCH(1) errors, then the variance of the errors in period t + 1 can be predicted in
period t.

➢ If ARCH exists,
⚫ the standard errors for the regression parameters will not be correct. Generalized least
squares must be used to develop a predictive model.

⚫ ARCH model can be used to predict the variance of the residuals in future periods.

www.zecaiedu.cn
Compare Forecasting Power with RMSE
➢Comparing forecasting model performance
⚫ In-sample forecast errors are the residuals from a fitted time-series model.

⚫ Out-of-sample forecast errors are the differences between actual and


predicted inflation, if we use this model to make a prediction outside this
period.
✓ Root mean squared error (RMSE): the model with the smallest RMSE is most accurate
for out-of-sample.

✓ RMSE is the square root of the average squared error.

www.zecaiedu.cn
Regression with More Than One Time Series
➢ In linear regression, if any time series contains a unit root, OLS may
be invalid.
➢ Use DF tests for each of the time series to detect unit root, we will
have 3 possible scenarios.
⚫ None of the time series has a unit root: we can use multiple regression
⚫ At least one time series has a unit root while at least one time series does not: we
cannot use multiple regression
⚫ Each time series has a unit root: we need to establish whether the time series are
cointegrated.
✓ If conintegrated, can estimate the long-term relation between the two series (but may
not be the best model of the short-term relationship between the two series).

www.zecaiedu.cn
Regression with More Than One Time Series
➢ Use the Dickey-Fuller Engle-Granger test (DF-EG test) to test the
cointegration
⚫ H0: no cointegration Ha: cointegration
⚫ If we cannot reject the null, we cannot use multiple regression
⚫ If we can reject the null, we can use multiple regression
⚫ Critical value calculated by Engle and Granger

www.zecaiedu.cn
Steps in Time-Series Forecasting
画出散点图,判断序列是否有趋势
Does series have a trend?

Yes No



建 线性趋势 指数趋势 判断是否有季节性因素
线 a linear trend an exponential trend Seasonality?


使用DW检验判断残差是否自相关
Yes
Serial correlation?

No Yes
使用趋势模型 使用自回归模型
Use a trend model Use an AR model

www.zecaiedu.cn
Steps in Time-Series Forecasting
序列协方差是否固定
Is series Covariance Stationary?
No Yes

以差额法重新组建序列 以AR(1)模型开始
Take First Differences 模型的估计


归 残差是否自相关 Yes 继续增加自回归数量和级数

型 Serial Correlation? Adding Lags

建 No

线 是否存在季节性因素 Yes 增加相应的自回归级数
图 Seasonality Present Adding Lags
No
用ARCH模型检测是否存在异质性 Yes 通过广义的最小二乘法来调
Heteroskedasticity? 整模型中的错误

No
www.zecaiedu.cn
组建完成模型,测试模型的预测能力
Time-Series analysis
Case 2 Example
⚫ Angela Martinez, an energy sector analyst at an investment bank, is concerned about the future
level of oil prices and how it might affect portfolio values. She is considering whether to
recommend a hedge for the bank portfolio's exposure to changes in oil prices. Martinez
examines West Texas Intermediate (WTI) monthly crude oil price data, expressed in US dollars
per barrel, for the 181-month period from August 2000 through August 2015. The end-of-month
WTI oil price was $51.16 in July 2015 and $42.86 in August 2015 (Month 181).

⚫ After reviewing the time-series data, Martinez determines that the mean and variance of the time
series of oil prices are not constant over time. She then runs the following four regressions using
the WTI time-series data. (Exhibit 1 presents selected data from all four regressions)

Linear trend model: Oil pricet = b0 + b1t + et

Log-linear trend model: ln Oil pricet = b0 + b1t + et

AR(1) model: Oil pricet = b0 + b1Oil pricet-1 + et

AR(2) model: Oil pricet = b0 + b1Oil pricet-1 + b2Oil pricet-2 + et


www.zecaiedu.cn
Time-Series analysis
Case 2 Example
Exhibit 1. Crude Oil Price per Barrel, August 2000-August 2015

Regression Statistics (t-statistics for coefficients are reported in parentheses)


Linear Log-Linear AR(1) AR(2)

R2 0.5703 0.6255 0.9583 0.9656

Standard error 18.6327 0.3034 5.7977 5.2799

Observations 181 181 180 179

Durbin-Watson 0.10 0.08 1.16 2.08

RMSE 2.0787 2.0530

Coefficients:

Intercept 28.3278 3.3929 1.5948 2.0017


(10.1846) (74.9091) (1.4610) (1.9957)
t (Trend) 0.4086 0.0075
(15.4148) (17.2898)
Oil Pricet-1 0.9767 1.3946
(63.9535) (20.2999)
Oil Pricet-2 -0.4249
www.zecaiedu.cn
(6.2064)
Time-Series analysis
Case 3 Example
In Exhibit 1, at the 5% significance level, the lower critical value for the Durbin-Watson test statistic is
1.75 for both the linear and log-linear regressions. Exhibit 2 presents selected autocorrelation data
from the AR(1) models.

Exhibit 2. Autocorrelations of the Residual from AR(1) Model

Lag Autocorrelation t-Statistic

1 0.4157 5.5768

2 0.2388 3.2045

3 0.0336 0.4512

4 -0.0426 -0.5712

Note: At the 5% significance level, the critical value for at-statistic is 1.97.

After reviewing the data and regression results, Martinez draws the following conclusions.

Conclusion 1: The time series for WTI oil prices is covariance stationary.

Conclusion 2: Out-of-sample forecasting using the AR(1) model appears to be more accurate than
www.zecaiedu.cn
that of the AR(2) model.
Time-Series analysis
Case 3 Example
1. Based on Exhibit 1, the predicted WTI oil price for October 2015 using the linear
trend model is closest to:

A. $29.15.

B. $74.77.

C. $103.10.

2. Based on Exhibit 1, the predicted WTI oil price for September 2015 using the log-
linear trend model is closest to:

A. $29.75.

B. $29.98.

C. $116.50.

www.zecaiedu.cn
Time-Series analysis
Case 3 Example
3. Based on the regression output in Exhibit 1, there is evidence of positive serial
correlation in the errors in:

A. the linear trend model but not the log-linear trend model.

B. both the linear trend model and the log-linear trend model.

C. neither the linear trend model nor the log-linear trend model.

4. Martinez's Conclusion 1 is:

A. correct.

B. incorrect because the mean and variance of WTI oil prices are not constant over time.

C. incorrect because the Durbin-Watson statistic of the AR(2) model is greater than 1.75.
www.zecaiedu.cn
Time-Series analysis
Case 3 Example
5. Based on Exhibit 1, the forecasted oil price in September 2015 based on
the AR(2) model is closest to:

A. $38.03.

B. $40.04.

C. $61.77.

6. Based on the data for the AR(1) model in Exhibits 1 and 2, Martinez can
conclude that the:

A. residuals are not serially correlated.

B. autocorrelations do not differ significantly from zero.

C. standard error for each of the autocorrelations is 0.0745.


www.zecaiedu.cn
Time-Series analysis
Case 3 Example

7. Based on the mean-reverting level implied by the AR(1) model regression


output in Exhibit 1, the forecasted oil price for September 2015 is most likely
to be:

A. less than $42.86.

B. equal to $42.86.

C. greater than $42.86.

www.zecaiedu.cn
Summary
➢ Importance: ☆☆☆

➢ Content:
⚫ Linear trend model & log-linear trend model ; Limitation of trend models;

⚫ Covariance stationary and AR model;

⚫ Auto-correlation and seasonality;

⚫ Mean reversion

➢ Exam tips:
⚫ Mean-reverting level的计算;

⚫ effects and testing of heteroskedasticity and serial correlation, 概念题。

www.zecaiedu.cn
R7 Machine Learning

www.zecaiedu.cn
What we are going to learn?
The candidate should be able to:

 Distinguish between supervised machine learning, unsupervised machine learning, and deep
learning; overfitting and identify methods of addressing it;

 Describe supervised machine learning algorithms—including penalized regression, support


vector machine, k- nearest neighbor, classification and regression tree, ensemble learning,
and random forest—and determine the problems for which they are best suited;

 Describe unsupervised machine learning algorithms—including principal components


analysis, k- means clustering, and hierarchical clustering—and determine the problems for
which they are best suited;

 Describe neural networks, deep learning nets, and reinforcement learning.


www.zecaiedu.cn
Defining Machine Learning
➢ Statistical approaches rely on foundational assumptions and explicit models of
structure, such as observed samples that are assumed to be drawn from a specified
underlying probability distribution. These a priori restrictive assumptions can fail in
reality.

➢ Machine learning seeks to extract knowledge from large amounts of data with no such
restrictions. The goal of machine learning algorithms is to automate decision- making
processes by generalizing (i.e., “learning”) from known examples to determine an
underlying structure in the data.
⚫ Machine learning techniques are better able than statistical approaches to handle
problems with many variables (high dimensionality) or with a high degree of non- linearity.

www.zecaiedu.cn
Machine Learning Algorithms

➢ Machine learning vocabulary

⚫ In regression analysis

✓ Y variable known as the dependent variable

✓ X variables are known as independent variables or explanatory variables

⚫ In machine learning

✓ Y variable called the target variable or tag variable

✓ X variables called features

⚫ Hyperparameter: model input specified by the researcher.

www.zecaiedu.cn
Overview of Supervised Learning

www.zecaiedu.cn
Types of Machine Learning

➢ Supervised learning

➢ Unsupervised learning

➢ Neural networks, deep learning and reinforcement learning

www.zecaiedu.cn
Types of Machine Learning
➢ Supervised learning
⚫ Supervised learning requires a labeled data set, one that contains matched sets of observed
inputs and the associated output.
⚫ Applying the ML algorithm to this data set to infer the pattern between the inputs and output is
called “training” the algorithm.
⚫ Once the algorithm has been trained, the inferred pattern can be used to predict output values
based on new inputs (i.e., ones not in the training data set).
⚫ Two categories of supervised learning
✓ Regression model : making predictions of continuous target variables.
 Multiple regression is an example of supervised learning
✓ Classification problems : sorting observations into distinct categories.
 binary classification-e.g. default or not likely default
 Multicategory classification-e.g. bond rating

www.zecaiedu.cn
Types of Machine Learning
➢ Unsupervised learning
⚫ Unsupervised learning is machine learning that does not make use of labeled data.

⚫ The algorithm seeks to discover structure within the data themselves

⚫ Two categories of supervised learning


✓ Dimension reduction: reducing the number of features while retaining variation
across observations to preserve the information contained in that variation.
✓ Clustering : sorting observations into groups (clusters) such that observations in the
same cluster are more similar to each other than they are to observations in other
clusters.

www.zecaiedu.cn
Data Sets
➢ Deep Learning and Reinforcement Learning
⚫ Neural networks(NNs, also called artificial neural networks, or ANNs) include highly flexible
ML algorithms that have been successfully applied to a variety of tasks characterized by
non-linearities and interactions among features.

⚫ Deep learning and reinforcement learning are themselves based on neural networks.
✓ In deep learning, sophisticated algorithms address highly complex tasks, such as
image classification, face recognition, speech recognition, and natural language
processing..
✓ In reinforcement learning, a computer learns from interacting with itself (or data
generated by the same algorithm).

www.zecaiedu.cn
Summary of ML Algorithms

➢ How to Choose ML algorithms?

www.zecaiedu.cn
Data Sets
➢ To measure how well a model generalizes , data analysts create three
nonoverlapping data sets :
⚫ Training sample ( used to develop the model)
✓ In-sample prediction errors occur with the training sample.

⚫ Validation sample(used for tuning the model)

⚫ Test sample (used for evaluating the model using new data)

www.zecaiedu.cn
Overfitting
➢ Underfitting means the model does not capture the relationships in the data.

➢ Overfitting means training a model to such a degree of specificity to the


training data that the model begins to incorporate noise coming from quirks or
spurious correlations; it mistakes randomness for patterns and relationships.
⚫ The main contributors to overfitting are thus high noise levels in the data and too much
complexity in the model.
⚫ Complexity refers to the number of features, terms, or branches in the model and to whether
the model is linear or non- linear (non- linear is more complex).
⚫ As models become more complex, overfitting risk increase.

➢ A good fit/robust model fits the training (in-sample) data well and generalizes
well to out-of-sample data, both within acceptable degrees of error.

www.zecaiedu.cn
Overfitting
➢ Data scientists decompose the total out- of- sample error into three
sources:
⚫ Bias error, or the degree to which a model fits the training data. Algorithms with
erroneous assumptions produce high bias with poor approximation, causing
underfitting and high in- sample error.

⚫ Variance error, or how much the model’s results change in response to new data
from validation and test samples. Unstable models pick up noise and produce high
variance, causing overfitting and high out- of- sample error.

⚫ Base error due to randomness in the data.

www.zecaiedu.cn
Overfitting
➢ A fitting curve, which shows in- and out- of- sample
error rates (Ein and Eout) on the y- axis plotted
against model complexity on the x- axis.
⚫ Variance error increases with model complexity;
⚫ Bias error decreases with complexity;
⚫ Linear functions are more susceptible to bias error and
underfitting;
⚫ Non-linear functions are more prone to variance error
and overfitting.
⚫ An optimal level of complexity minimizes the total
error and is a key part of successful model
generalization .

www.zecaiedu.cn
Preventing Overfitting
➢ Two common guiding principles and two methods are used to reduce
overfitting:

⚫ Preventing the algorithm from getting too complex during selection and training, which
requires estimating an overfitting penalty;
✓ In supervised machine learning, it means limiting the number of features and penalizing
algorithms that are too complex or too flexible by constraining them to include only
parameters that reduce out- of- sample error.

⚫ Proper data sampling achieved by using cross- validation, a technique for estimating
out- of- sample error directly by determining the error in validation samples.

www.zecaiedu.cn
Preventing Overfitting
➢ In cross- validation techniques, out-of-sample error is estimated directly
by determining the error in validation samples.
⚫ One such technique is k- fold cross- validation
✓ In which the data (excluding test sample and fresh data) are shuffled randomly and
then are divided into k equal sub- samples, with k – 1 samples used as training samples
and one sample, the kth, used as a validation sample.
✓ Note that k is typically set at 5 or 10.
✓ This process is then repeated k times, which helps minimize both bias and variance by
insuring that each data point is used in the training set k – 1 times and in the
validation set once.
✓ The average of the k validation errors (mean Eval) is then taken as a reasonable
estimate of the model’s out- of- sample error (Eout).
www.zecaiedu.cn
Supervised Learning Algorithms
➢ Supervised machine learning models are trained using labeled data, and
depending on the nature of the target (Y) variable, they can be divided into two
types: regression for a continuous target variable and classification for a
categorical or ordinal target variable.
⚫ Penalized Regression

www.zecaiedu.cn
Penalized Regression
➢ Penalized Regression
⚫ Reduce the problem of overfitting by imposing a penalty term.
⚫ ILASSO (least absolute shrinkage and selection operator), it automatically performs feature selection

✓ Lambda (λ) is a hyperparameter—determine the balance between fitting the model versus keeping the model
parsimonious.

✓ Note : λ = 0, LASSO penalized regression = OLS regression.

⚫ In addition to minimizing the sum of the squared residuals, LASSO also involves minimizing the sum of
the absolute values of the regression coefficients.

➢ Regularization describes methods that reduce statistical variability in high dimensional data
estimation problems.
⚫ Regularization can be applied to non-linear models.

www.zecaiedu.cn
Support Vector Machine( SVM )
➢ SVM
⚫ Support vector machine (SVM) is a linear classifier that determines the hyperplane that optimally
separates the observations into two sets of data points .

⚫ It is a powerful supervised algorithm used for classification, regression, and outlier detection.

⚫ 核心思想: 任何一个P维物体(空间)都可以被一个 P-1 维的超平面分成两部分。

www.zecaiedu.cn
Support Vector Machine( SVM )
➢ SVM
⚫ SVM maximizes the probability of making a correct prediction by determining the boundary that is
farthest away from all the observations.

⚫ SVM separates the data by the maximum margin.

www.zecaiedu.cn
Support Vector Machine( SVM )
➢ Many real-world data sets, however, are not perfectly linearly separable , in
that case, soft margin classification is applied.
⚫ This adaptation adds a penalty to the object function for observations in the training set
that are misclassified, it optimizes the tradeoff between a wider margin and classification
error.

➢ As an alternative to soft margin classification, a non- linear SVM algorithm can


be run by introducing more advanced, non- linear separation boundaries.
⚫ These algorithms will reduce the number of misclassified instances in the training data
sets but will have more features, thus adding to the model’s complexity.

www.zecaiedu.cn
K- Nearest Neighbor
➢ K- nearest neighbor (KNN) is a supervised learning technique most often used
for classification and sometimes for regression. The idea is to classify a new
observation by finding similarities (“nearness”) between this new observation
and the existing data.

www.zecaiedu.cn
K- Nearest Neighbor
➢ Two vital concerns
⚫ A critical challenge of KNN, however, is defining what it means to be “similar”(or near).
⚫ The researcher specifies the number k, the hyperparameter of the model, must be
chosen with the understanding.
E.G. For predicting the credit rating of an unrated bond, should k be the 3, 15, or 50 most
similar bonds to the unrated bond?
✓ If k is an even number, there may be ties and no clear classification.;
✓ Choosing a value for k that is too small would result in a high error rate and sensitivity to local
outliers;
✓ Choosing a value for k that is too large would dilute the concept of nearest neighbors by averaging too
many outcomes.

www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ Classification and regression trees (CART)
⚫ Classification trees are appropriate when the target variable is categorical.
✓ Typically used when the target is binary.( e.g. an IPO will be successful vs not
successful.)

⚫ Regression trees are appropriate when the target is continuous.

⚫ CART is most commonly used in binary target, and is better adapted to


classification problems with significant non-linear relationships among
variables

www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ CART—Decision Tree and Partitioning of
the Feature Space

⚫ Such a classification requires a binary tree: a


combination of an initial root node, decision nodes,
and terminal nodes. The root node and each
decision node represent a single feature (f) and a
cutoff value (c) for that feature.

www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ The CART algorithm chooses the feature and the cutoff value at each node that
generates the widest separation of the labeled data to minimize classification error.
⚫ From the initial root node, the data are partitioned at decision nodes into smaller and smaller subgroups

until terminal nodes are formed that contain the predicted labels, so observations in each group have

lower within-group error than before.

⚫ At any level of the tree, when the classification error does not diminish much more from another

split(bifurcation),the process stops, the node is a terminal node, and the category that is in the

majority at that node is assigned to it.

⚫ If the objective of the model is classification, then the prediction of the algorithm at each terminal
node will be the category with the majority of data points.

⚫ If the goal is regression, then the prediction at each terminal node is the mean of the labeled values.

www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ CART is a popular supervised machine learning model because the tree provides a
visual explanation for the prediction., This contrasts favorably with other
algorithms that are often considered to be “black boxes.

➢ To avoid overfitting

⚫ Regularization criteria such as maximum tree depth, maximum number of decision

nodes, and so on are specified by the researcher.

⚫ Alternatively ,sections of tree with minimal explanatory power are pruned.

www.zecaiedu.cn
Ensemble Learning and Random Forest
➢ This technique of combining the predictions from a collection of models is called
ensemble learning.

⚫ Ensemble learning typically produces more accurate and more stable predictions than

the best single model .

⚫ Ensemble learning can be divided into two main categories:

✓ 1) an ensemble method can be an aggregation of heterogeneous learners (i.e., different types of

algorithms combined together with a voting classifier);

✓ 2) an ensemble method can be an aggregation of homogenous learners (i.e., a combination of the

same algorithm, using different training data that are based, for example, on a bootstrap

aggregating (i.e., bagging) technique as discussed later).

www.zecaiedu.cn
Voting Classifiers
➢ Suppose you have been working on a machine learning project for some time and
have trained and compared the results of several algorithms, such as SVM, KNN,
and CART. A majority- vote classifier will assign to a new data point the predicted
label with the most votes .

⚫ For example, if the SVM and KNN models are both predicting the category “stock

outperformance” and the CART model is predicting the category “stock

underperformance,” then the majority- vote classifier will choose ‘“stock

outperformance.”

⚫ The more individual models you have trained, the higher the accuracy of the

aggregated prediction up to a point.

www.zecaiedu.cn
Bootstrap Aggregating (Bagging)
➢ Bootstrap aggregating (or bagging) is a technique whereby the original training data
set is used to generate n new training data sets or bags of data.

⚫ Each new bag of data is generated by random sampling with replacement from the

initial training set.

⚫ The algorithm can now be trained on n independent data sets that will generate n new

models.

⚫ Then, for each new observation, we can aggregate the n predictions using a majority-

vote classifier for a classification or an average for a regression.

⚫ Bagging is a very useful technique because it helps to improve the stability of

predictions and protects against overfitting the model.

www.zecaiedu.cn
Random Forest
➢ A random forest classifier is a collection of a large number of decision trees trained
via a bagging method.
⚫ For example, a CART algorithm would be trained using each of the n independent data sets (from the
bagging process) to generate the multitude of different decision trees that make up the random forest
classifier.

⚫ For any new observation, we let all the classifier trees (the “random forest”) undertake classification
by majority vote—implementing a machine learning version of the “wisdom of crowds.”

⚫ The process involved in random forest tends to protect against overfitting on the training data. It also
reduces the ratio of noise to signal because errors cancel out across the collection of slightly different
classification trees.

⚫ However, an important drawback of random forest is that it lacks the ease of interpretability of
individual trees; as a result, it is considered a relatively black box- type algorithm.

www.zecaiedu.cn
Unsupervised Learning Algorithms
➢ Unsupervised learning is machine learning that does not use labeled
data (i.e., no target variable); thus, the algorithms are tasked with
finding patterns within the data themselves .

➢The two main types of unsupervised ML algorithms are


⚫ Dimension reduction

✓ principal components analysis

⚫ Clustering

✓ K- means

✓ Hierarchical clustering.

www.zecaiedu.cn
Principal Components Analysis
➢ Dimension reduction. Problems associated with too much noise often arise
when the number of features in a data set( i.e, its dimension) is excessive .

➢ PCA is used to summarize or reduce highly correlated features of data into


a few main, uncorrelated composite variables.
⚫ A composite variable is a variable that combines two or more variables that are statistically
strongly related to each other.

⚫ The eigenvectors define new, mutually uncorrelated composite variables that are linear
combinations of the original features. As a vector, an eigenvector also represents a
direction. Associated with each eigenvector is an eigenvalue.

⚫ An eigenvalue gives the proportion of total variance in the initial data that is explained by
each eigenvector.

www.zecaiedu.cn
Principal Components Analysis
➢ The PCA algorithm orders the eigenvectors from highest to lowest according to their
eigenvalues—that is, in terms of their usefulness in explaining the total variance in the initial data.

➢ PCA selects as the first principal component the eigenvector that explains the largest proportion
of variation in the data set (the eigenvector with the largest eigenvalue).

➢ The second principal component explains the next largest proportion of variation remaining after
the first principal component; it continues for the third, fourth, and subsequent principal
components.

➢ As the principal components are linear combinations of the initial feature set, only a few principal
components are typically required to explain most of the total variance in the initial feature
covariance matrix.

➢ In practice, the smallest number of principal components that should be retained is that which the
scree plot shows as explaining 85% to 95% of total variance in the initial data set.

www.zecaiedu.cn
Principal Components Analysis
➢ The main drawback of PCA is that since the principal components are combinations
of the data set’s initial features, they typically cannot be easily labeled or directly
interpreted by the analyst. Compared to modelling data with variables that
represent well- defined concepts, the end user of PCA may perceive PCA as
something of a “black box.”

www.zecaiedu.cn
Clustering
➢ Clustering is used to organize data points into similar groups called clusters.
⚫ A cluster contains a subset of observations from the data set such that all the observations
within the same cluster are deemed “similar.”

⚫ The aim is to find a good clustering of the data—meaning that the observations inside each
cluster are similar or close to each other (a property known as cohesion) and the observations in
two different clusters are as far away from one another or are as dissimilar as possible (a
property known as separation).

⚫ In practice, expert human judgment has a role in defining what is similar.

✓ Euclidian distance, the straight line distance between two observations, is one common
metric that is used.

✓ The smaller the distance, the more similar the observations; the larger the distance, the
more dissimilar the observations.

www.zecaiedu.cn
Clustering

➢ Common types of clustering :


⚫ K- Means Clustering

⚫ Hierarchical clustering.

✓ Agglomerative clustering ( or bottom-up)

✓ Divisive clustering ( or top-down)

www.zecaiedu.cn
K- Means Clustering
➢ K- means is a relatively old algorithm that repeatedly partitions observations into a
fixed number, k, of non- overlapping clusters. The number of clusters, k, is a model
hyperparameter—a parameter whose value must be set by the researcher before
learning begins.

➢ Each cluster is characterized by its centroid (i.e., center), and each observation is
assigned by the algorithm to the cluster with the centroid to which that observation
is closest.

www.zecaiedu.cn
K- Means Clustering

www.zecaiedu.cn
K- Means Clustering
➢ The k- means algorithm is fast and works well on very large data sets with hundreds
of millions of observations.

➢ However, the final assignment of observations to clusters can depend on the initial
location of the centroids.
⚫ To address this problem, the algorithm can be run several times using different sets of initial
centroids, and then one can choose the clustering that is most useful given the business purpose.

➢ One limitation of this technique is that the hyperparameter, k, the number of


clusters in which to partition the data, must be decided before k- means can be run.
⚫ one can run the algorithm using a range of values for k to find the optimal number of clusters

www.zecaiedu.cn
Hierarchical Clustering: Agglomerative and Divisive
➢ Hierarchical clustering is an iterative procedure used to build a hierarchy of
clusters.
⚫ Agglomerative clustering (or bottom-up) hierarchical clustering begins with each
observation being treated as its own cluster.

⚫ Divisive clustering (or top- down) hierarchical clustering starts with all the
observations belonging to a single cluster.

www.zecaiedu.cn
Hierarchical Clustering: Agglomerative and Divisive

www.zecaiedu.cn
Neural Networks
➢ Neural networks (also called artificial neural networks, or ANNs) are a highly flexible type
of ML algorithm that have been successfully applied to a variety of tasks characterized by
non- linearities and complex interactions among features.

➢ Neural networks are commonly used for classification and regression supervised learning
but are also important in reinforcement learning, which can be unsupervised.

➢ Neural networks have three types of layers:

⚫ Input layer (here with a node for each of the 4 features);

⚫ Hidden layers, where learning occurs in training and inputs are processed on trained nets;

⚫ Output layer (here consisting of a single node for the target variable y), which passes
information to outside the network.

www.zecaiedu.cn
Neural Networks
➢ Note that for neural networks, the
feature inputs would be scaled
(i.e., standardized) to account for
differences in the units of the data.
For example, if the inputs were
positive numbers, each could be
scaled by its maximum value so that
their values lie between 0 and 1.

www.zecaiedu.cn
Neural Networks
➢ Each node has, conceptually, two functional parts: a summation operator and an
activation function.
⚫ Once the node receives the four input values, the summation operator multiplies each
value by a weight and sums the weighted values to form the total net input.

⚫ The total net input is then passed to the activation function, which transforms this
input into the final output of the node.
✓ Informally, the activation function operates like a light dimmer switch that decreases or
increases the strength of the input.

✓ The activation function is characteristically non- linear, such as an S- shaped (sigmoidal)


function (with output range of 0 to 1) or the rectified linear unit function.

www.zecaiedu.cn
Neural Networks
➢ Forward propagation (向前传播)
➢ This activation function is shown in Exhibit 20,
where in the left graph a negative total net input is
transformed via the S- shaped function into an
output close to 0. This low output implies the node
does not “fire,” so there is nothing to pass to the
next node. Conversely, in the right graph a positive
total net input is transformed into an output close
to 1, so the node does fire. The output of the
activation function is then transmitted to the next
set of nodes if there is a second hidden layer or, as
in this case, to the output layer node as the
predicted value.

www.zecaiedu.cn
Neural Networks
➢ Starting with an initialized set of random network weights, training a neural network in a
supervised learning context is an iterative process in which predictions are compared to
actual values of labeled data and evaluated by a specified performance measure (e.g., mean
squared error).

➢ Then, network weights are adjusted to reduce total error of the network.

➢ If the process of adjustment works backward through the layers of the network, this
process is called backward propagation.

➢ Learning takes place through this process of adjustment to network weights with the aim of
reducing total error.

www.zecaiedu.cn
Deep Learning Nets
➢ Neural networks with many hidden layers—at least 3 but often more than 20 hidden layers—
are known as deep learning nets (DLNs) and are the backbone of the artificial intelligence
revolution.

➢ Advances in DLNs have driven developments in many complex activities, such as image,
pattern, and speech recognition.

➢ Some researchers have used DLNs known as multi- layer perceptrons (or feed- forward
networks) with many nodes (sometimes over 1,000) split over several hidden layers to predict
corporate fundamental factors and price- related technical factors.

www.zecaiedu.cn
Reinforcement Learning
➢ Reinforcement learning (RL) is an algorithm that made headlines in 2017 when DeepMind’s
AlphaGo program beat the reigning world champion at the ancient game of Go. The RL algorithm
involves an agent that should perform actions that will maximize its rewards over time, taking into
consideration the constraints of its environment.

➢ In the case of AlphaGo, a virtual gamer (the agent) uses his/her console commands (the actions)
with the information on the screen (the environment) to maximize his/ her score (the reward).

➢ Unlike supervised learning, reinforcement learning has neither direct labeled data for each
observation nor instantaneous feedback.

➢ With RL, the algorithm needs to observe its environment, learn by testing new actions (some of
which may not be immediately optimal), and reuse its previous experiences. The learning
subsequently occurs through millions of trials and errors.

www.zecaiedu.cn
Summary
➢ Supervised learning
⚫ Penalize regression
⚫ Support vector machine (SVM)
⚫ K-nearest neighbor ( KNN)
⚫ Classification and regression tree(CART) algorithms
⚫ Ensemble and Random forest

➢ Unsupervised learning
⚫ Dimension reduction
✓ Principal components analysis
⚫ Clustering
✓ K- Means Clustering
✓ Hierarchical clustering.

➢ Neural networks, deep learning and reinforcement learning


www.zecaiedu.cn
Summary

www.zecaiedu.cn
How to Choose Among Them
➢ First, start by asking, are the data complex, having many features that are highly correlated?
⚫ Yes, then dimensionality reduction using principal components analysis (PCA)

➢ Next, is the problem one of classification or numerical prediction?


⚫ If numerical prediction, then depending on whether or not the data have non- linear characteristics
✓ Penalized regression/LASSO for linear data
✓ CART, random forest, and neural networks for non- linear data.
⚫ If classification, then depending on whether or not the data are labeled
✓ If the data are labeled, classification algorithms,then the data have non- linear characteristics
 k- nearest neighbor (KNN) and support vector machine (SVM) for linear data
 CART random forest, and neural networks for non- linear data。
✓ If the data are unlabeled, clustering algorithms ,then the data have non- linear characteristics.
 Neural networks for non- linear data or for linear data,
 K-means with a known number of categories, and hierarchical clustering with an unknown number of
categories.

www.zecaiedu.cn
How to Choose Among Them

www.zecaiedu.cn
How to Choose Among Them
➢针对不同情况,我们需要挑选最合适的机器学习方法:
⚫ 如果数据中包含特征和标签,希望学习标签和特征间的对应关系,监督学习;
⚫ 如果没有标签,探索数据自身规律,非监督学习;
⚫ 如果学习任务有一系列行动和对应的奖励机制组成,则强化学习;
⚫ 如果需要预测的标签是分类变量,例如预测股票上升或下跌,则分类方法;
⚫ 如果标签是连续的数值变量,例如预测具体的涨幅,则回归方法;
⚫ 另外,样本和特征的个数,数据本身的特点,都决定了机器学习算法的选择。

www.zecaiedu.cn
Machine Learning
Case 1 Example

• Which of the following best describes machine learning? Machine


learning:

A. is a type of computer algorithm.

B. is a set of computer-driven approaches that can be used to


extract information from Big Data.

C.is a set of computer-driven approaches adapted to extracting


information from structured data.

Answer: B
Machine Learning
Case 2 Example

• Which of the following statements is most accurate? Machine


learning:

A. contrasts with human learning in relation to measuring


performance on specific tasks.

B. takes place when a computer program is programmed to perform


specific tasks.

C.takes place when a computer improves performance in a specific


class of tasks as experience increases.

Answer: C
Machine Learning
Case 3 Example

• Which of the following statements is most accurate? When


attempting to place data into groups based on their inherent
similarities and differences:

A. an unsupervised ML algorithm is used.

B. an ML algorithm that is given tagged data is used.

C.an ML algorithm that is given tagged data and untagged data is


used.

Answer: A
Machine Learning
Case 4 Example

• Which of the following statements concerning supervised learning


best distinguishes it from unsupervised learning? Supervised
learning involves:

A. training on labeled data.

B. training on unlabeled data.

C.learning from unlabeled data.

Answer : A
Machine Learning
Case 5 Example

• As used in supervised machine learning, regression problems


involve:

A. binary target variables.

B. continuous target variables.

C. categorical target variables.

Answer: B
Machine Learning
Case 6 Example

• Which of the following best describes penalized regression?


Penalized regression:

A. is unrelated to multiple linear regression.

B. involves a penalty term for the sum of squared residuals.

C. is a category of general linear models that is used when the


number of independent variables is a concern.

Answer: C is correct.
Machine Learning
Case 7 Example

• CART is best described as a type of:

A. unsupervised ML.

B. a clustering algorithm based on decision trees.

C. a supervised ML algorithm that accounts for non-linear


relationships among the features.

Answer :C
Machine Learning
Case 8 Example

• Neural networks are best described as an ML technique for learning:

A. exactly modeled on the human nervous system.

B. based on layers of nodes when the relationships among the


features are usually non-linear.

C. based on a tree structure of nodes when the relationships


among the features are non-linear.

Answer : B
Machine Learning
Case 9 Example

• Clustering is best described as a technique in which:

A. the grouping of observations is unsupervised.

B. features are grouped into clusters by a top-down algorithm.

C. observations are classified according to predetermined labels.

Answer : A is correct.
Machine Learning
Case 10 Example

• Dimension reduction techniques are best described as means to


reduce a set of features:

A. to a manageable size.

B. to a manageable size while controlling for variation in the data.

C. to a manageable size while retaining as much of the variation in


the data as possible.

Answer : C is correct.
R8 Big Data Projects

www.zecaiedu.cn
What we are going to learn?

The candidate should be able to:

 State and explain steps in a data analysis project;

 Describe objectives, steps, and examples of preparing and wrangling data;

 Describe objectives, steps, and techniques in model training;

 Describe preparing, wrangling, and exploring text-based data for financial forecasting;

 Describe methods for extracting, selecting and engineering features from textual data;

 Evaluate the fit of a machine learning algorithm.

www.zecaiedu.cn
Big Data Introduction

➢ Big data differs from traditional data sources based on the presence of a set of
characteristics commonly referred to as the 3Vs: volume, variety, and velocity.
⚫ Volume refers to the quantity of data. Big data refers to a huge volume of data.
⚫ Variety pertains to the array of available data sources. Variety includes traditional transactional data;
user-generated text, images, and videos; social media; sensor-based data; web and mobile clickstreams;
and spatial-temporal data. Effectively leveraging the variety of available data presents both
opportunities and challenges, including such legal and ethical issues as data privacy.
⚫ Velocity is the speed at which data are created. Many large organizations collect several petabytes of
data every hour.

➢ When used for generating inferences, an additional characteristic, the veracity or validity of the
data, needs to be considered. Not all data sources are reliable and the researcher has to separate
quality from quantity to generate robust forecasts.
www.zecaiedu.cn
Big Data Introduction

➢ Structured data (e.g. balance sheet data for companies) is neatly


organized in rows and columns.

➢ Unstructured data(e.g. text from SEC filings) is unorganized, and the


ML algorithm has to sift through the noise to pick out information.
⚫ With the proliferation of textual big data(e.g. online news articles, internet financial
forums),such unstructured data have been shown to offer insights faster (as they are
real-time) and have enhanced predictive power.

www.zecaiedu.cn
Big Data Introduction
➢ Model Building for Financial Forecasting Using Big Data: Structured (Traditional) vs.
Unstructured (Text)

www.zecaiedu.cn
Structured Data Analysis
➢ Structured data are organized in a systematic format that is readily searchable and
readable by computer operations for processing and analyzing. In structured data, data
errors can be in the form of incomplete, invalid, inaccurate, inconsistent, non-uniform,
and duplicate data observations. The data cleansing process mainly deals with
identifying and mitigating all such errors. Exhibit 3 shows a raw dataset before
cleansing. The data have been collected from different sources and are organized in a
data matrix (or data table) format. Each row contains observations of each customer of
a US-based bank. Each column represents a variable (or feature) corresponding to each
customer.

www.zecaiedu.cn
Structured Data Analysis

➢ To illustrate the steps involved in analyzing data for financial


forecasting(traditional ML model), we will use an example of a consumer credit
scoring model in the following five steps:
⚫ Conceptualization of the modeling task

⚫ Data collection.

⚫ Data preparation and wrangling.

⚫ Data exploration.

⚫ Model training.

www.zecaiedu.cn
Conceptualization of the modeling task

➢ Conceptualization of the modeling task

⚫ The crucial first step entails determining what the output of the model should be( e.g.
whether the price of a stock will go up/down one week from now),how this model will
be used and by whom , and how it will be embedded in existing or new business
processes.

www.zecaiedu.cn
Data Collection

➢ Data Collection

⚫ For financial forecasting, usually structured , numeric data is collected from internal
and external sources.

⚫ External data usually can be accessed through an application programming interface


(API)—a set of well-defined methods of communication between various software
components—or the vendors can deliver the required data in the form of csv files or
other formats (as previously mentioned).

www.zecaiedu.cn
Data Preparation and Wrangling

➢ Data Preparation and Wrangling involve cleansing and organizing raw data
into a consolidated format.
⚫ Data Preparation (Cleansing) is the process of examining, identifying, and mitigating errors in
raw data, includes addressing any missing values or verification of any out-of-range values.

⚫ Data Wrangling (Preprocessing) data may performs transformation and critical processing
steps on the cleansed data to make the data ready for ML model training, involving aggregating ,
filtering or extracting relevant variables.

www.zecaiedu.cn
Data Preparation (Cleansing)

➢ The possible errors in a raw dataset include the following:


⚫ Incompleteness error is where the data are not present, resulting in missing data. This
can be corrected by investigating alternate data sources.

⚫ Invalidity error is where the data are outside of a meaningful range, resulting in invalid
data. This can be corrected by verifying other administrative data records.

⚫ Inaccuracy error is where the data are not a measure of the true value. This can be
rectified with the help of business records and administrators.

www.zecaiedu.cn
Data Preparation (Cleansing)

➢ The possible errors in a raw dataset include the following:


⚫ Inconsistency error is where the data conflict with the corresponding data points or
reality. This contradiction should be eliminated by clarifying with another source.

⚫ Non-uniformity error is where the data are not present in an identical format. This can
be resolved by converting the data points into a preferable standard format.

⚫ Duplication error is where duplicate observations are present. This can be corrected by
removing the duplicate entries.

www.zecaiedu.cn
Data Preparation (Cleansing)

➢ Raw Data Before Cleansing

www.zecaiedu.cn
Data Preparation (Cleansing)

➢ Data After Cleansing

www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Data preprocessing primarily includes transformations and scaling of
the data.
➢ Data transformations
⚫ Extraction: A new variable can be extracted from the current variable for ease of analyzing and using
for training the ML model.
⚫ Aggregation: Two or more variables can be aggregated into one variable to consolidate similar
variables..
⚫ Filtration: The data rows that are not needed for the project must be identified and filtered.
⚫ Selection: The data columns that are intuitively not needed for the project can be removed. This
should not be confused with feature selection, which is explained later.
⚫ Conversion: The variables can be of different types: nominal, ordinal, continuous, and categorical. The
variables in the dataset must be converted into appropriate types to further process and analyze
them correctly.
www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Data After Applying Transformations

www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Scaling is a process of adjusting the range of a feature by shifting and changing the
scale of data.

➢ Here are two of the most common ways of scaling:


⚫ Normalization is the process of rescaling numeric variables in the range of [0, 1].

✓ Sensitive to outliers, so treatment of outliers is necessary before normalization is performed.

✓ Used when the distribution of data is not known.

⚫ Standardization is the process of both centering and scaling the variables.

✓ Less sensitive to outliers as it depends on the mean and standard deviation of the data.

✓ The data must be normally distributed to use standardization.


www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢As outliers may be present in the data, and domain knowledge is needed
to deal with them.
⚫ Any outliers that are present must first be identified.

⚫ The outliers then should be examined and a decision made to either remove or replace
them with values imputed using statistical techniques.

www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢There are several practical methods for handling outliers.
⚫ When extreme values and outliers are simply removed from the dataset, it is
known as trimming (also called truncation).
✓ E.G a 5% trimmed dataset is one for which the 5% highest and the 5% lowest values have
been removed.

⚫ When extreme values and outliers are replaced with the maximum (for large value
outliers) and minimum (for small value outliers) values of data points that are not
outliers, the process is known as winsorization.

www.zecaiedu.cn
Data exploration
➢Data exploration is a crucial part of big data projects. The prepared data
are explored to investigate and comprehend data distributions and
relationships.
➢ Data exploration involves three important tasks: exploratory data analysis, feature
selection, and feature engineering.

www.zecaiedu.cn
Data exploration-EDA
➢ Exploratory data analysis (EDA) is the preliminary step in data exploration.
⚫ An important objective of EDA is to serve as a communication medium among
project stakeholders, including business users, domain experts, and analysts.

➢ Tools of EDA as follows:


⚫ Summary statistics
✓ Mean, variance , skewness, kurtosis

⚫ Visualizations
✓ Histograms, bar charts ,box plots ,density plots

www.zecaiedu.cn
Data exploration-EDA
➢ Visualizations

www.zecaiedu.cn
Data exploration-FS and FE
➢ After using EDA to discover relevant patterns in the data, it is essential to identify and
remove unneeded, irrelevant, and redundant features. Feature selection is a process
whereby only pertinent features from the dataset are selected for ML model training
such as PCA.
⚫ The objective of the feature selection process is to assist in identifying significant features that when used in
a model retain the important patterns and complexities of the larger dataset while requiring fewer data overall.

➢ Feature Engineering helps further optimize and improve the features


⚫ This action involves engineering an existing feature into a new feature or decomposing it into multiple
features.

⚫ The feature engineering process attempts to produce good features that describe the structures
inherent in the dataset.

➢ Model performance heavily depends on feature selection and engineering.


www.zecaiedu.cn
Model training
➢ Model training. This step involves determining the appropriate ML algorithm to use,

evaluating the algorithm using a training data set ,and tuning the model. The choice of the

model depends on the nature of the relationship between the features and the target

variable.

www.zecaiedu.cn
Method Selection
➢ Selecting and applying a method or an algorithm is the first step of the

training process. Method selection is governed by the following factors:


⚫ Supervised or unsupervised learning.

⚫ Type of data.(numerical, text, speech, image)

⚫ Size of data.

✓ Number of feature and number of observation.

www.zecaiedu.cn
Performance Evaluation
➢ It is important to measure the model training performance or goodness of

fit for validation of the model.

➢ Several techniques to measure model performance that are well suited

specifically for binary classification models:


⚫ Error analysis

⚫ Receiver Operating Characteristic (ROC).

⚫ Root Mean Squared Error (RMSE).

www.zecaiedu.cn
Performance Evaluation
➢ Error analysis. For classification problems, error analysis involves computing four basic

evaluation metrics: true positive (TP), false positive (FP), true negative (TN), and false

negative (FN) metrics.

⚫ FP is also called a Type I error, and FN is also called a Type II error.

➢ Confusion matrix : Assume in the following explanation that Class “0” is “not defective”

and Class “1” is “defective.”

www.zecaiedu.cn
Performance Evaluation
➢ Elements in error analysis.
⚫ Precision is the ratio of correctly predicted positive classes to all predicted positive

classes.

Precision (P) = TP/(TP + FP).

⚫ Recall (also known as sensitivity) is the ratio of correctly predicted positive classes

to all actual positive classes.

Recall (R) = TP/(TP + FN).

www.zecaiedu.cn
Performance Evaluation
➢ Elements in error analysis.
⚫ Accuracy is the percentage of correctly predicted classes out of total predictions

Accuracy = (TP + TN)/(TP + FP + TN + FN).

⚫ F1 score is the harmonic mean of precision and recall.

F1 score = (2 * P * R)/(P + R).

✓ F1 score is more appropriate (than accuracy) when unequal class distribution is in the

dataset and it is necessary to measure the equilibrium of precision and recall.

www.zecaiedu.cn
Performance Evaluation
Case 1 Example

• Once satisfied with the final set of features, Steele selects and runs
a model on the training set that classifies the text as having positive
sentiment ( Class “1”or negative sentiment (Class “0”).She
then evaluates its performance using error analysis. The resulting
confusion matrix is presented in Exhibit.

Actual Training Results


Class “1” Class “0”
Predicted Class “1” TP = 182 FP = 52

Result Class “0” FN = 31 TN = 96

www.zecaiedu.cn
Performance Evaluation
Case 1 Example

• Based on Exhibit , the model’s precision metrics is closest to :

A. 78%

B. 81%

C. 85%

• Based on Exhibit , the model’s F1 score is closest to :

A. 77%

B. 81%

C. 85%

www.zecaiedu.cn
Performance Evaluation
Case 1 Example

• Based on Exhibit , the model’s accuracy metric is closest to :

A. 77%

B. 81%

C. 85%

www.zecaiedu.cn
Performance Evaluation
➢ Receiver Operating Characteristic (ROC). This technique for

assessing model performance involves the plot of a curve showing the


False positive rate (FPR) = FP/(TN + FP)
trade- off between the false positive rate (x- axis) and true positive
True positive rate (TPR) = TP/(TP + FN).
rate (y- axis) for various cutoff points .

⚫ The shape of the ROC curve provides insight into the model’s

performance.

⚫ A more convex curve indicates better model performance.

⚫ Area under the curve (AUC) is the metric that measures the

area under the ROC curve.

⚫ An AUC close to 1.0 indicates near perfect prediction, while an

AUC of 0.5 signifies random guessing.

www.zecaiedu.cn
Performance Evaluation
➢ Root Mean Squared Error (RMSE). This measure is appropriate for continuous data

prediction and is mostly used for regression methods.

⚫ The root mean squared error is computed by finding the square root of the mean of the

squared differences between the actual values and the model’s predicted values (error).

⚫ A small RMSE indicates potentially better model performance.

www.zecaiedu.cn
Model Tuning
➢ Model fitting has two types of error: bias and variance. It is necessary to find an

optimum tradeoff between bias and variance errors, such that the model is neither

underfitting (Bia errors) nor overfitting (Variance error).

⚫ It is not possible to completely eliminate both types of errors. However, both errors can be

minimized so the total aggregate error (bias error + variance error) is at a minimum. A small

RMSE indicates potentially better model performance.

⚫ Parameters are critical for a model and are dependent on the training data. Parameters are

learned from the training data as part of the training process by an optimization technique.

⚫ Hyperparameters are used for estimating model parameters and are not dependent on the

training data.

www.zecaiedu.cn
Model Tuning
➢Method of model tuning
⚫ Grid search is trained using different combinations of hyperparameter values until the

optimum set of values are found.

⚫ Ceiling analysis is a systematic process of evaluating different components in the

pipeline of model building. It helps to understand what part of the pipeline can

potentially improve in performance by further tuning.

✓ The performance of the larger model depends on performance of the sub- model(s). Ceiling

analysis can help determine which sub- model needs to be tuned to improve the overall

accuracy of the larger model.

www.zecaiedu.cn
Unstructured (Text) Data Analysis
➢ Unstructured , texted-based data is more suitable for human use. The
five steps involved need to be modified (the first four) in order to
analyze unstructured, text-based data.
⚫ Text problem formulation. The analyst will determine the problem and identify the exact
inputs and output of the model.
⚫ Data collection (curation) .This is determining the sources of data to be used(e.g. web scouring,
specific social media sites.
⚫ Text preparation and wrangling. This requires preprocessing the streams of unstructured
data to make it usable by traditional structured modeling methods.
⚫ Text exploration. This involves test visualization as well as text feature selection and
engineering.
⚫ Model training .
www.zecaiedu.cn
Text preparation and wrangling
➢ Unstructured data can be in the form of text, images, videos, and
audio files.
⚫ For analysis and use to train the ML model, the unstructured data must be transformed
into structured data.

⚫ For example

www.zecaiedu.cn
Text Preparation (Cleansing)
➢ The following steps describe the basic operations in the text cleansing process.
⚫ Remove html tags: The initial task is to remove (or strip) the html tags that are not part of
the actual text using programming functions or using regular expressions.

⚫ Remove Punctuations: Most punctuations are not necessary for text analysis and should be
removed. However, some punctuations, such as percentage signs, currency symbols, and
question marks, may be useful for ML model training. These punctuations should be
substituted with such annotations as /percentSign/, /dollarSign/, and /questionMark/ to
preserve their grammatical meaning in the text.

⚫ Remove Numbers: When numbers (or digits) are present in the text, they should be
removed or substituted with an annotation /number/.

⚫ Remove white spaces: Extra formatting-related white spaces(e.g . tabs, indents) do not
serve any purpose in text processing and are removed.
www.zecaiedu.cn
Text Cleansing Process Example

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ To further understand text processing, tokens and tokenization need to

be defined.
⚫ A token is equivalent to a word.

⚫ Tokenization is the process of splitting a given text into separate tokens. In other words, a

text is considered to be a collection of tokens. Tokenization can be performed at word or

character level, but it is most commonly performed at word level.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 1: Normalization
⚫ Lowercasing ; This action helps the computers to process the same words appropriately (e.g., “The”
and “the”).

⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.

⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.

✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.

⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 1: Normalization
⚫ Lowercasing ; This action helps the computers to process the same words appropriately (e.g., “The”
and “the”).

⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.

⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.

✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.

⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 2. After the cleansed text is normalized, a bag- of- words(BOW) is created.

⚫ BOW is simply a set of words and does not capture the position or sequence of words

present in the text. However, it is memory efficient and easy to handle for text analyses.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ The last step of text preprocessing is using the final BOW after normalizing to build a
document term matrix (DTM).
⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.

⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.

✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.

⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ If the sequence of text is important, N-grams can be used to represent word
sequences.
⚫ The length of a sequence can vary from 1 to n. When one word is used, it is a unigram; a
two- word sequence is a bigram; and a 3- word sequence is a trigram and so on ;

⚫ Consider the sentence,” The market is up today.”

✓ Bigrams of this sentence include “the_ market”, “market _is “,”is_up”, and “up_today”.
BOW is then applied to the bigrams instead of the original words.

✓ N-grams implementation will affect the normalization of the BOW because stop words
will not be removed.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ The advantage of n- grams is that they can be used in the same way as unigrams to build a BOW.

In practice, different n- grams can be combined to form a BOW and eventually be used to build

a DTM.

www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢Step 3 build a document term matrix (DTM).
⚫ DTM is a matrix that is similar to a data table for structured data and is widely used for text data.

⚫ Each row of the matrix belongs to a document (or text file), and each column represents a token (or term). The

cells can contain the counts of the number of times a token is present in each document.

DTM

www.zecaiedu.cn
Text Exploration - EDA
➢Various statistics are used to explore, summarize, and analyze text data.
⚫ Term frequency (TF), the ratio of the number of times a given token occurs in all the

texts in the dataset to the total number of tokens in the dataset.

⚫ Visualization such as word cloud can be applied.

www.zecaiedu.cn
Text Exploration - Feature Selection
➢ For text data, feature selection involves selecting a subset of the

terms or tokens occurring in the dataset. The tokens serve as features

for ML model training.


⚫ Feature selection in text data effectively decreases the size of the vocabulary or BOW.

This helps the ML model be more efficient and less complex.

⚫ Another benefit is to eliminate noisy features from the dataset. Noisy features are both

the most frequent and most sparse (or rare) tokens in the dataset.

www.zecaiedu.cn
Text Exploration - Feature Selection
➢ The general feature selection methods in text data are as follows:
⚫ Frequency measures can be used for vocabulary pruning to remove noise features by
filtering the tokens with very high and low TF values across all the texts.

⚫ Chi- square test is applied to test the independence of two events: occurrence of the token
and occurrence of the class. The test ranks the tokens by their usefulness to each class in
text classification problems.
✓ Tokens with the highest chi- square test statistic values occur more frequently in texts associated with a particular
class and therefore can be selected for use as features for ML model training due to higher discriminatory potential

⚫ Mutual information (MI) measures how much information is contributed by a token to a


class of texts.
✓ The mutual information value will be equal to 0 if the token’s distribution in all text classes is the same.

✓ The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of
text.
www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:
⚫ Numbers: tokens with standard lengths are identified and converted into a token such as “/

number/.” For example, numbers with four digits may indicate years and are assigned a value

of/number 4/.

⚫ N- grams: Multi- word patterns that are particularly discriminative can be identified and

their connection kept intact.

⚫ Name entity recognition (NER): NER analyzes the individual tokens and their surrounding

semantics while referring to its dictionary to tag an object class to the token.

✓ For example, Microsoft would be assigned a NER tag of ORG and Europe would be assigned a NER tag of Place.

NER object class assignment is meant to make the selected features more discriminatory.

www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:
⚫ Parts of speech (POS): Similar to NER, parts of speech uses language structure and

dictionaries to tag every token in the text with a corresponding part of speech.

✓ For example, Microsoft would be assigned as a POS tag of NNP(indicating a proper noun), and the year 1969

would be assigned a POS tag of CD( indicating a cardinal number).

✓ For example, the word “market” can be a verb when used as “to market …” or noun when used as “in the

market.”

✓ Differentiating such tokens can help further clarify the meaning of the text.

www.zecaiedu.cn
Model Training and Evaluation
➢ Once the unstructured data has been processed and codified in a structured form such
as a data matrix , model training is similar to that of structured data.ML seeks to identify

patterns in the data set via asset of rules . Model fitting describes how well the model

generalized to new data( i.e. how the model performs out of sample.)

www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:

⚫ Frequency measures can be used for vocabulary pruning to remove noise features by
filtering the tokens with very high and low TF values across all the texts.

⚫ Chi- square test is applied to test the independence of two events: occurrence of the token
and occurrence of the class. The test ranks the tokens by their usefulness to each class in
text classification problems.
✓ Tokens with the highest chi- square test statistic values occur more frequently in texts associated with a particular
class and therefore can be selected for use as features for ML model training due to higher discriminatory potential

⚫ Mutual information (MI) measures how much information is contributed by a token to a class
of texts.
✓ The mutual information value will be equal to 0 if the token’s distribution in all text classes is the same.

✓ The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of text.
www.zecaiedu.cn
Summary

➢ Importance: ☆☆

➢ Content:
⚫ Big data introduction

⚫ Structured data analysis (data preparation and wrangling ; data exploration)

⚫ Unstructured data analysis (text preparation and wrangling ; text exploration )

⚫Exam tips:
⚫ 重点在于数据的准备和预处理

www.zecaiedu.cn
R9 Excerpt from “probabilistic
approaches: scenario analysis,
decision trees, and simulation”

www.zecaiedu.cn
What we are going to learn?
The candidate should be able to:

 Describe steps in running a simulation;

 Explain three ways to define the probability distributions for a simulation’s variables;

 Describe how to treat correlation across variables in a simulation;

 Describe advantages of using simulations in decision making;

 Describe some common constraints introduced into simulations;

 Describe issues in using simulations in risk assessment;

 Compare scenario analysis, decision trees, and simulations.

www.zecaiedu.cn
Simulation
➢Steps in Simulation
⚫ Determine “probabilistic” variables
⚫ Define probability distributions for these variables
✓ Historical data
✓ Cross sectional data
✓ Statistical distribution and parameters:
⚫ Check for correlation across variables
✓ When there is strong correlation, positive or negative, across inputs, you have two
choices.
 One is to pick only one of the two inputs to vary;
 The other is to build the correlation explicitly into the simulation;
⚫ Run the simulation
www.zecaiedu.cn
Simulation
➢ Advantage of using simulation in decision making
⚫ Better input estimation
⚫ It yields a distribution for expected value rather than a point estimate

➢ Simulations with constraints: we can then evaluate the effectiveness of risk


hedging tools by examining the likelihood that the constraint will be violated.
⚫ Book value constraints
✓ Regulatory capital restrictions (Financial service firms)
✓ Negative book value for equity

⚫ Earnings and cash flow constraints


✓ Either internally or externally imposed

⚫ Market value constraints


✓ Explicitly model the effect of distress on expected cash flows and discount rates.
www.zecaiedu.cn
Simulation

➢ Issues in using simulation


⚫ Garbage in, garbage out( GIGO )

⚫ Real data may not fit distributions

⚫ Non-stationary distributions

⚫ Changing correlation across inputs

www.zecaiedu.cn
Simulation

➢ Risk-Adjusted Value and Simulations


⚫ If the cash flows are expected cash flows and are not risk adjusted, we should be
discounting these cash flows at a risk-adjusted rate .

⚫ When you use the standard deviation in values from a simulation ,using a risk-adjusted
discount rate will result in a double counting of risk.

www.zecaiedu.cn
Comparing The Approaches
➢ Selective versus full risk analysis
⚫ Scenarios analysis : we will not have a complete assessment of all possible
outcomes from risky investments or assets.

⚫ Decision trees and simulations : we attempt to consider all possible outcomes.


✓ Decision trees : we try to accomplish this by converting continuous risk into a manageable set of
possible outcomes.

✓ Simulations : we use probability distributions to capture all possible outcomes.

⚫ Put in terms of probability, the sum of the probabilities of the scenarios we


examine in scenario analysis can be less than one, whereas the sum of the
probabilities of outcomes in decision trees and simulations has to equal one.

www.zecaiedu.cn
Comparing The Approaches
➢ Type of risk
⚫ Scenarios analysis and decision trees : generally built around discrete outcomes in
risky events
✓ Scenarios analysis : easier to use when risks occur concurrently
✓ Decision trees : better suited for sequential risks.
⚫ Simulations:better suited for continuous risks.

➢ Correlation across risks


⚫ Simulations allow for explicitly modeling these correlations
⚫ In scenario analysis, we can deal with correlations subjectively by creating scenarios
that allow for them; the high (low) interest rate scenario will also include slower
(higher) economic growth.
⚫ Correlated risks are difficult to model in decision trees.
www.zecaiedu.cn
Comparing The Approaches
➢ The quality of the information
⚫ Since simulations are heavily dependent upon being able to assess probability distributions
and parameters, they work best in cases where there is substantial historical and cross
sectional data available.

www.zecaiedu.cn
Comparing with Risk-Adjusted Value
➢ Complement or Replacement for Risk-Adjusted Value
⚫ Both decision trees and simulations are approaches that can be used as either
complements to or substitutes for risk-adjusted value.

⚫ Scenario analysis, on the other hand, will always be a complement to risk-adjusted


value, since it does not look at the full spectrum of possible outcomes.

www.zecaiedu.cn
Conclusion
➢ In the most extreme form of scenario analysis, you look at the value in the best case and worst case scenarios
and contrast them with the expected value. In its more general form, you estimate the value under a small
number of likely scenarios, ranging from optimistic to pessimistic.

➢ Decision trees are designed for sequential and discrete risks, where the risk in an investment is considered
into phases and the risk in each phase is captured in the possible outcomes and the probabilities that they will
occur. A decision tree provides a complete assessment of risk and can be used to determine the optimal courses
of action at each phase and an expected value for an asset today.

➢ Simulations provide the most complete assessments of risk since they are based upon probability distributions
for each input (rather than a single expected value or just discrete outcomes). The output from a simulation
takes the form of an expected value across simulations and a distribution for the simulated values.

➢ With all three approaches, the keys are to avoid double counting risk (by using a risk-adjusted discount rate
and considering the variability in estimated value as a risk measure) or making decisions based upon the wrong
types of risk.
www.zecaiedu.cn
Conclusion
➢ In the most extreme form of scenario analysis, you look at the value in the best case and worst case scenarios
and contrast them with the expected value. In its more general form, you estimate the value under a small
number of likely scenarios, ranging from optimistic to pessimistic.

➢ Decision trees are designed for sequential and discrete risks, where the risk in an investment is considered
into phases and the risk in each phase is captured in the possible outcomes and the probabilities that they will
occur. A decision tree provides a complete assessment of risk and can be used to determine the optimal courses
of action at each phase and an expected value for an asset today.

➢ Simulations provide the most complete assessments of risk since they are based upon probability distributions
for each input (rather than a single expected value or just discrete outcomes). The output from a simulation
takes the form of an expected value across simulations and a distribution for the simulated values.

➢ With all three approaches, the keys are to avoid double counting risk (by using a risk-adjusted discount rate
and considering the variability in estimated value as a risk measure) or making decisions based upon the wrong
types of risk.
www.zecaiedu.cn
Summary

➢ Importance:

➢ Content:
⚫ Simulation ( steps; advantages; risk-adjusted value)

⚫ Comparing The Approaches

⚫ Conclusion

➢ Exam tips:
⚫ 重点在于三种方法间的比较和区分

www.zecaiedu.cn
THANKS
No pains No gains

www.zecaiedu.cn

You might also like