Chapter 2 - Quantitative Analysis
Chapter 2 - Quantitative Analysis
Level II
Quantitative Methods
Opal Xu
CONTENTS 1. Quantitative Methods (1)
• R4 Introduction to Linear Regression
• R5 Multiple Regression
• R6 Time- Series Analysis
www.zecaiedu.cn
Level I vs. Level II
➢ Level I 学习的主要是描述统计和推断统计中的估计与判断部分。二级主要学习regression,
是推断统计中的预测部分。
➢ 课程特征与学习建议:
⚫ 理科内容,文科考法;
⚫ 逻辑递进关系很强,要把每个知识点学懂了再继 续往前学;
⚫ 听课与做题相结合,但并不建议“刷题”。
www.zecaiedu.cn
R4 Introduction to Linear
Regression
www.zecaiedu.cn
R4 Introduction to Linear
Regression
www.zecaiedu.cn
R4 Introduction to Linear
Regression
www.zecaiedu.cn
R4 Introduction to Linear
Regression
www.zecaiedu.cn
What we are going to learn?
www.zecaiedu.cn
What we are going to learn?
Calculate the predicted value for the dependent variable, given an estimated
regression model and a value for the independent variable;
Calculate and interpret a confidence interval for the predicted value of the
dependent variable;
www.zecaiedu.cn
The Basics of Simple Linear Regression
www.zecaiedu.cn
Scatter Plots
Variable B
Variable B
r=0
Variable A Variable A
Variable B
-1<r<0 r=-1
Variable A
Variable B
Variable B
Variable A Variable A
www.zecaiedu.cn
Sample Covariance and Correlation
➢ Covariance:
⚫ Covariance measures how one random variable moves with another random variable. —— one of measures of
linear association.
⚫ Sample covariance:
n
Cov(X ,Y ) =
i
(X i
=1
− X )(Yi − Y ) /(n − 1)
➢ Correlation:
⚫ The correlation coefficient measures the direction and extent of linear association between two variables
Cov( X , Y )
r=
sx s y
⚫ Correlation has no units, ranges from –1 to +1
⚫ 0<r<1, positive linear association
⚫ -1<r<0, negative linear association
www.zecaiedu.cn
Simple linear regression model
➢ The simple linear regression model
Yi = b0 + b1 X i + i , i = 1,..., n
➢ Linear regression assumes a linear relation between the dependent and the independent
variables.
⚫ The dependent variable, Y is the variable whose variation about its mean is to be explained by
the regression.
⚫ The independent variable, X is the variable used to explain the dependent variable in a
regression.
⚫ Regression coefficients, b0 is intercept term of the regression, b1 is slope coefficient of the
regression, regression coefficient.
⚫ The error term, εi is the portion of the dependent variable that is not explained by the
independent variable(s) in the regression.
www.zecaiedu.cn
Calculation of Regression Coefficients
➢ How does linear regression estimate b0 and b1?
⚫ Computes a line that best fits the observations
⚫ Minimize the sum of the squared vertical distances between the observations and the
regression line
⚫ The estimated intercept coefficient ( b̂0 ) is interpreted as the value of Y when X is equal to
zero.
⚫ The estimated slope coefficient ( b̂1 ) defines the sensitivity of Y to a change in X .The
estimated slope coefficient ( b̂1 ) equals covariance divided by variance of X.
➢ Example of interpretation of estimated coefficients
⚫ An estimated slope coefficient of 2 would indicate that the dependent variable will change two
units for every 1 unit change in the independent variable.
⚫ The intercept term of 2% can be interpreted to mean that the independent variable is zero, the
dependent variable is 2%.
www.zecaiedu.cn
Calculation of Regression Coefficients
Cov( X , Y )
(X i − X )(Yi − Y )
b1 = = i =1
n
i
Var ( X )
( X − X ) 2
i =1
⚫ The estimated intercept coefficient ( b̂0 ) : because the point ( X , Y) lies on the
ത − 𝑏1 𝑋ത .
regression line. we can solve 𝑏0 = 𝑌
www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient
exchange rates. The following table shows recent annual sales (in
millions of Canadian dollars) and the average exchange rate for the
Canadian dollar).
www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient
www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate a Regression Coefficient
• The following table provides several useful calculations:
Year i Xi = Exchange Rate Yi = Sales (Xi -X)2 (Yi -Y)2 (Xi -X)(Yi -Y)
1 0.4 20 0.0016 36 -0.24
2 0.36 25 0 1 0
3 0.42 16 0.0036 100 -0.6
4 0.31 30 0.0025 16 -0.2
5 0.33 35 0.0009 81 -0.27
6 0.34 30 0.0004 16 -0.08
Sum 2.16 156 0.009 250 -1.39
www.zecaiedu.cn
Calculation of Regression Coefficients
Case 1 Example: Calculate A Regression Coefficient
• The sample
n
mean of the exchange rate is:
X = X i =1
i / n = 2.16 / 6 = 0.36
• The sample mean of sales is:
n
Y = Yi / n = 156 / 6 = 26
i =1
i i
-1.39
bˆ 1 = i=1
n
= = -154.44, and
( X -X )
2 0.009
i
i=1
➢ The assumptions
⚫ A linear relationship exists between X and Y
⚫ X is not random, and the condition that X is uncorrelated with the error
term can substitute the condition that X is not random.
⚫ The expected value of the error term is zero (i.e., E(εi)=0 )
⚫ The variance of the error term is constant (i.e., the error terms are
homoskedastic)
⚫ The error term is uncorrelated across observations (i.e., E(εiεj)=0 for all
i≠j)
⚫ The error term is normally distributed.
www.zecaiedu.cn
Analysis of Variance(ANOVA table)
b0
X
www.zecaiedu.cn
Analysis of Variance(ANOVA table)
➢ANOVA table
SSE
➢ Standard error of estimate: SEE =
n−2
= MSE
➢ Standard Error of Estimate (SEE) gives some indication of how certain we can
be about a particular prediction of Y using the regression equation.
⚫ SEE is low if the regression is very strong and high if the relationship is weak.
⚫ The SEE formula looks like the formula for computing a standard deviation,
except that n−2 appears in the denominator instead of n−1.
⚫ In fact, the SEE is the standard deviation of the error term because the degree
of freedom of the error is n-2.
www.zecaiedu.cn
Coefficient of Determination (R2)
➢ Coefficient of determination (R2) measures the fraction of the total variation in the
dependent variable that is explained by the independent variable.
⚫ Its limits are 0≤R2≤1;
⚫ Example: R2 of 0.8250 means the independent variable explains approximately
82.5 percent of the variation in the dependent variable.
➢ For report purpose, regression programs also report multiple R, which is the
correlation between the actual values and the forcast values of Y, the coefficient of
determination is the square of multiple R.
www.zecaiedu.cn
Coefficient of Determination (R2)
⚫ If the confidence interval with a given degree of confidence dose not include
the hypothesized value, the null is rejected, and the coefficient is said to be
statistically different from hypothesized value.
⚫Stronger regression results (usually lower SEE or higher R2) lead to smaller
standard error of an estimated coefficient 𝑆𝑏1 and tighter confidence intervals.
www.zecaiedu.cn
Hypothesis Testing
www.zecaiedu.cn
Hypothesis Testing
www.zecaiedu.cn
Hypothesis Testing
Case 2 Hypothesis Testing
• An analyst ran a regression and got the following result:
Coefficient t-statistic p-value
➢Two sources of uncertainty when using the regression model and the
estimated parameters to make a prediction.
⚫ The error term itself contains uncertainty
➢ Point estimate
𝑌 = 𝑏0 + 𝑏1 𝑋
www.zecaiedu.cn
Predicted Value of the Dependent Variable
1 ( X − X )2 1 ( X − X )2
s f = SEE 1 + + = SEE 1 + +
n (n − 1) s X
2
n ( X i − X )2
www.zecaiedu.cn
Limitations of Regression Analysis
➢ Regression relations can change over time, just as correlations can.
⚫ Parameter instability: the problem or issue of population regression parameters
that have changed over time.
➢ Importance: ☆☆☆
➢ Content:
⚫ Underlying consumptions of linear regression;
⚫ ANOVA;
➢ Exam tips:
⚫ Underlying consumption;回归系数的假设检验。
www.zecaiedu.cn
What we are going to learn?
Calculate and interpret 1) a confidence interval for the population value of a regression
coefficient and 2) a predicted value for the dependent variable, given an estimated
regression model and assumed values for the independent variables;
www.zecaiedu.cn
What we are going to learn?
www.zecaiedu.cn
Multiple Regression Assumptions
⚫ If none of the independent variables in a regression model help explain the dependent
variable, the slope coefficient should all equal 0.
⚫ In a multiple regression, however, we cannot test the null hypothesis that all slope
coefficients equal 0 based on t-test, because the individual tests do not account for
the effects of interactions among the independent variables.
➢ To test the null hypothesis that all of the slope coefficients in the
multiple regression model are jointly equal to 0,we must use an F-test.
⚫ The F-statistic measures how well the regression equation explains the variation in the
dependent variable.
www.zecaiedu.cn
Regression Coefficient F-test
➢ Define hypothesis:
⚫ H0: b1= b2= b3= … = bk=0
➢ F-statistic:
RSS
MSR k
F= =
MSE SSE
(n − k − 1)
⚫ k, n-k-1 are the degrees of freedom for an F-test
⚫ In simple liner regression model, F-test duplicates the t-test for the significance of the
slope coefficient
www.zecaiedu.cn
Regression Coefficient F-test
➢ Decision rule
⚫ Reject H0 : if F-statistic > Fα (k, n-k-1)
⚫ Specially, the F-statistic for testing the null hypothesis (that all the slope
coefficients are equal to 0) has a value of 0 when the independent variables
do not explain the dependent variable at all.
➢ The test assesses the effectiveness of the model as a whole in explaining the
dependent variable.
www.zecaiedu.cn
Analysis of Variance (ANOVA)
➢ ANOVA Table
➢ R2
⚫ The percentage of variation in the dependent variable that is
collectively explained by all of the independent variable.
➢ Adjusted R2
⚫ In a multiple linear regression, R2 by itself is less appropriate as a
measure of whether a regression model fits the data well (goodness
of fit).
✓ We can increase R2 simply by including many additional independent variables
that explain even a slight amount of the previously unexplained variation, even if
the amount they explain is not statistically significant.
www.zecaiedu.cn
Adjusted R2
➢ Function of adjusted R2
⚫ a high adjusted R2 does not necessary mean the correct choice of variables.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Heteroskedasticity
➢ Multicollinearity
www.zecaiedu.cn
Qualitative Independent Variables
➢ Dummy variables
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Heteroskedasticity
✓ Consistency: the larger the number of sample, the lower probability of error.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢Detecting heteroskedasticity
ˆ
⚫ Two methods to detect bheteroskedasticity
j
➢ Tips: Regress squared residuals with independent variable, X, and Rresidual² is the
coefficient of determination.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Correcting heteroskedasticity
bˆ j errors, to correct the standard error of
⚫ Computing robust standard
estimated coefficients, ( a.k.a, White-corrected standard error)
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Serial correlation (autocorrelation)
⚫ Regression errors are correlated across observations.
✓ Negative serial correlation is serial correlation in which a positive error for one
observation increases the chance of a negative error for another observation, in
other word, Cov(εi, εi+1)<0.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Effect of serial correlation on regression analysis
⚫ Positive serial correlation → Type I error & F-test unreliable
✓ Not affect the consistency of estimated regression coefficients.
✓ F-statistic to test for overall significance of the regression may be inflated because
the mean squared error will tend to underestimate the population error variance.
✓ Standard errors for the regression coefficient are artificially small → the estimated
t-statistics to be overestimated →the prob. of type I error increased.
( t − t −1 ) 2
DW = t =2
T
2(1 − r )
2
t
t =1
✓ Decision rule
Reject H0, Reject H0,
conclude Do not conclude
positive serial reject H0 negative serial
Inconclusive Inconclusive
correlation correlation
0 dL dU 4-dU 4-dL 4
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢Detecting positive serial correlation
⚫ Durbin-Watson test
✓ H0: No positive serial correlation
✓ DW ≈ 2×(1−r)
✓ Decision rule
Reject H0,
conclude positive
Inconclusive Fail to reject null hypothesis of no positive
serial correlation
serial correlation
0 dL dU
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Methods to correct serial correlation
⚫ Adjust the coefficient standard errors for the linear regression parameter
estimates to account for the serial correlation (Recommended)
✓ Hansen method to adjust coefficient standard error
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Multicollinearity
⚫ Multicollinearity occurs when two or more independent variables (or
combinations of independent variables) are highly correlated with
each other.
www.zecaiedu.cn
Multiple Regression Assumption Violations
➢ Effect of multicollinearity on regression analysis
⚫ Not affect the consistency of coefficient estimates
⚫ Classic method
✓ T-statistics indicate that none of the individual coefficients is significantly different than
zero.
✓ Significant F-tastistic
✓ A high R2
www.zecaiedu.cn
Summary of Assumption Violations
www.zecaiedu.cn
Model Misspecification
➢ Misspecified functional form.
⚫ One or more important variables could be omitted from regression.
✓ e.g. If the true regression model is Yi =b0 +b1X1i +b2 X 2i +εi , but we estimate the
model Yi = a0 +a1X1i +εi
⚫ One or more of the regression variables may need to be transformed before
estimating the regression.
✓ e.g. Regress the natural logarithm of the variable
⚫ The regression model pools data from different samples that should not be pooled.
✓ e.g. Represent the relationship between two financial variables at two different
time periods
www.zecaiedu.cn
Model Misspecification
➢ Time-Series Misspecification (Regressors that are correlated
with the error term)
⚫ Including lagged dependent variables as independent variables in
regressions with serially correlated errors.
www.zecaiedu.cn
Model Misspecification
➢ Other Types of Time-Series Misspecification (Nonstationarity)
⚫ Relations among time series with trends (for example, the relation
between consumption and GDP).
⚫ Relations among time series that may be random walks (time series for
which the best predictor of next period’s value is this period’s value).
Exchange rates are often random walks.
www.zecaiedu.cn
Qualitative Independent Variables
➢Dummy variables
⚫ To use qualitative variables as independent variables in a regression
www.zecaiedu.cn
Qualitative Independent Variables
➢Illustrates the use of dummy variables
⚫ Returnst =b0 + b1Jant + b2Febt +···+ b11Novt + εt
✓ Returnst = a monthly observation of returns
✓ Jant =1 if period t is in the January, Jant =0 otherwise
✓ Febt =1 if period t is in the February, Febt =0 otherwise
✓ ···
✓ Novt =1 if period t is in the November, Novt =0 otherwise
⚫ The intercept, b0, measures the average return for stocks in December because
there is no dummy variable for December.
⚫ Each of the estimated coefficients for the dummy variables shows the estimated
difference between returns in that month and returns for December.
www.zecaiedu.cn
Qualitative Independent Variables
➢ Qualitative dependent variables are dummy variables used as dependent variables
instead of as independent variables.
⚫ Probit and logit models estimate the probability of a discrete outcome given the values
of the independent variables used to explain that outcome.
✓ The probit model, which is based on the normal distribution, estimates the probability that Y = 1 (a
condition is fulfilled) given the value of the independent variable X.
✓ The logit model is identical, except that it is based on the logistic distribution rather than the normal
distribution.
✓ Both models must be estimated using maximum likelihood methods(极大似然估计).
⚫ Discriminant models yields a linear function, similar to a regression equation, which can
then be used to create an overall score, or ranking, for an observation. Based on the
score, an observation can be classified into the bankrupt or not bankrupt category.
www.zecaiedu.cn
Credit Analysis
➢ Z – score
Z = 1.2 A + 1.4 B + 3.3 C + 0.6 D + 1.0 E
Where:
A = WC / TA
B = RE / TA
C = EBIT / TA
D = MV of Equity / BV of Debt
E = Revenue / TA
⚫ If Z<1.81 → Bankruptcy.
www.zecaiedu.cn
Model Misspecification
Case 2 Case
Hansen is developing a regression model to predict the initial return for IPOs.
Regression 4 51.433
He believes that for each 1 percent increase in pre-offer price adjustment, the initial return
will increase by less than 0.5 percent, holding other variables constant.
Model Misspecification
Case 2 Case
Before applying his model, Hansen asks a colleague, Phil Chang, to review its
specification and results. After examining the model, Chang concludes that the
model suffers from two problems: 1) conditional heteroskedasticity, and 2) omitted
variable bias. Chang makes the following statements:
A. 0.156 to 0.714.
B. 0.395 to 0.475.
C. 0.402 to 0.468.
2. The most appropriate null hypothesis and the most appropriate conclusion regarding
Hansen’s belief about the magnitude of the initial return relative to that of the pre-offer
price adjustment (reflected by the coefficient bj) are:
A. Yes.
A. Yes.
3. The standard errors for some or all of the regression coefficients will
become inflated.
Summary
➢ Importance: ☆☆☆
➢ Content:
⚫ Assumptions of multiple linear regression;
➢ Exam tips:
⚫ regression coefficients的假设检验;
⚫ 出题点比较灵活,包括检验统计量的计算,检验结果的判断和解读。
www.zecaiedu.cn
R6 Time-series analysis
www.zecaiedu.cn
What we are going to learn?
Calculate and evaluate the predicted trend value for a time series, modeled as either a linear
trend or a log- linear trend, given the estimated trend coefficients;
Describe factors that determine whether a linear or a log- linear trend should be used with a
particular time series and evaluate limitations of trend models;
Explain the requirement for a time series to be covariance stationary and describe the
significance of a series that is not stationary;
Describe the structure of an autoregressive (AR) model of order p and calculate one- and two-
period- ahead forecasts given the estimated coefficients;
Explain how autocorrelations of the residuals can be used to test whether the autoregressive
model fits the time series;
Contrast in- sample and out- of- sample forecasts and compare the forecasting accuracy of
different time- series models based on the root mean squared error criterion;
Describe characteristics of random walk processes and contrast them to covariance stationary
processes;
Describe implications of unit roots for time- series analysis, explain when unit roots are likely to
occur and how to test for them, and demonstrate how a time series with a unit root can be
transformed so it can be analyzed with an AR model.
www.zecaiedu.cn
Trend Models
➢ Linear trend model
⚫ yt=b0+b1t+εt
⚫ Same as linear regression, except for that the independent variable is time t (t=1, 2,
3, ……)
yt
www.zecaiedu.cn
Trend Models
➢ Log-linear trend model
⚫ yt=e(b0+b1t)
⚫ Ln ( yt ) =b0+b1t+εt
⚫ Model the natural log of the series using a linear trend
⚫ Use the Durbin Watson statistic to detect autocorrelation
yt ln(yt)
t t
www.zecaiedu.cn
Trend Models
➢ How to select a trend model
⚫ A linear trend model may be appropriate if fitting a linear trend to a time series
leads to uncorrelated errors.
xt = b0 + b1 xt −1 + b2 xt −2 + ... + bp xt − p + t
⚫ AR (p): AR model of order p (p indicates the number of lagged values that the
autoregressive model will include as independent variable).
⚫ For example, a model with two lags is referred to as a second-order
autoregressive model or an AR (2) model.
www.zecaiedu.cn
Autoregressive Models (AR)
➢Multiperiod forecasts
⚫ Chain rule of forecasting
✓ The one-period-ahead forecast of xt from an AR(1) model is as follows:
x t +1 = b 0 + b1 xt
✓ If we want to forecast xt+2 using an AR(1) model, our forecast will be based on
x t + 2 = b 0 + b1 xt +1
www.zecaiedu.cn
Autoregressive Models (AR)
➢ Forecasting with an autoregressive model, we should prove:
⚫ No autocorrelation
⚫ Covariance-stationary series
⚫ No conditional heteroskedasticity
www.zecaiedu.cn
Autocorrelation
➢ Autocorrelation in an AR model
⚫ When the error terms are correlated, standard errors are unreliable.
⚫ Durbin-Watson statistic is invalid when the independent variables include past values of the
dependent variable
⚫ Using t-test with residual autocorrelation and the standard error of the residual autocorrelation
⚫ If the residual autocorrelations differ significantly from 0, the model is not correctly specified, so
we may need to modify it (e.g. seasonality)
⚫ Correction: add lagged values
www.zecaiedu.cn
Autocorrelation
➢ Seasonality – a special question
⚫ Time series shows regular patterns of movement within the year
⚫ The seasonal autocorrelation of the residual will differ significantly from 0
⚫ We should uses a seasonal lag in an AR model
⚫ For example: xt=b0+b1 xt-1+ b2 xt-4+εt, AR(1) model with a seasonal lag
www.zecaiedu.cn
Autocorrelation
Case 1 Example
Q2: If sales grew by 1 percent last quarter and by 2 percent four quarters ago, use
the model to predict the sales growth for this quarter.
Autocorrelation
Case 1 Example
Autocorrelation
Case 1 Example
• Answer to Q1
• At the 0.05 significance level, with 68 observations and three parameters, this
model has 65 degrees of freedom. The critical value of the t-statistic needed to
reject the null hypothesis is thus about 2.0.
• The absolute value of the t-statistic for each autocorrelation is all below 2.0, so
we cannot reject the null hypothesis that each autocorrelation is not significantly
different from 0. We have determined that the model is correctly specified.
• Answer to Q2
• If sales grew by 1 percent last quarter and by 2 percent four quarters ago, then
the model predicts that sales growth this quarter will be e0.0121 – 0.0839
ln(1.01) + 0.6292 ln(1.02) – 1 = 2.40%.
Covariance-stationary
➢ Covariance-stationary series
⚫ Statistical inference based on OLS estimates for a lagged time series model assumes that the
time series is covariance stationary.
✓ The expected value of the time series must be constant and finite in all periods.
E ( yt ) = and , t = 1, 2,..., T
✓ The variance of the time series must be constant and finite in all periods.
✓ The covariance of the time series with itself for a fixed number of periods in the past or
future must be constant and finite in all periods. Cov( yt , yt −s ) = , , t = 1, 2,..., T ; s = 0, 1, 2,..., T
案例 5-5,Mean-reverting level
Suppose a one-lag autoregressive model by xt = b0 + b1 xt −1 . The coefficie
16.54 and 0.65 respectively. If currently X is 42.5,what is the trend for X
Referenced Answer
b0 16 .54
Mean Reverting level = = = 47 .26 >42.5. X trends to incre
1 − b1 1 − 0.65
period.
www.zecaiedu.cn
Covariance-Stationary
⚫ The estimates can be different between models estimated using relatively shorter and
longer sample periods.
www.zecaiedu.cn
Covariance-stationary
➢ Random walk
⚫ Random walk without a drift
✓ xt =xt-1+εt (b0=0 and b1=1)
✓ The best forecast of xt that can be made in period t−1 is xt−1. (the expected value of εt is zero). In
fact, in this model, xt−1 is the best forecast of x in every period after t−1
✓ xt - xt-1= εt
➢ Features
⚫ A random walk has an undefined mean reverting level
⚫ A time series must have a finite mean reverting level to be covariance stationary
⚫ A random walk, with or without a drift, is not covariance stationary
www.zecaiedu.cn
Unit Root Test
➢The unit root test of nonstationarity
⚫ The time series is said to have a unit root if the lag coefficient is equal to one
⚫ A common t-test of the hypothesis that b1=1 is invalid to test the unit root,
however, it is not often the case.
⚫ Dickey-Fuller test (DF test) to test the unit root
✓ Start with an AR(1) model xt=b0+b1 xt-1+εt
www.zecaiedu.cn
Unit Root Correction
➢ If a time series appears to have a unit root
⚫ One method that is often successful is to first-difference the time series (as
discussed previously) and try to model the first-differenced series as an
autoregressive time series.
➢ First differencing
⚫ Define yt as yt = xt - xt-1 =εt
www.zecaiedu.cn
Autoregressive Conditional Heteroskedasticity
➢ Heteroskedasticity refers to the situation that the variance of the error term is not
constant.
➢ Test whether a time series is ARCH(1)
⚫ t2 = a0 + a1 t2−1 + ut
⚫ If the estimate of a1 is statistically significantly different from zero, we conclude that the
time series is ARCH(1).
✓ If a time-series model has ARCH(1) errors, then the variance of the errors in period t + 1 can be predicted in
period t.
➢ If ARCH exists,
⚫ the standard errors for the regression parameters will not be correct. Generalized least
squares must be used to develop a predictive model.
⚫ ARCH model can be used to predict the variance of the residuals in future periods.
www.zecaiedu.cn
Compare Forecasting Power with RMSE
➢Comparing forecasting model performance
⚫ In-sample forecast errors are the residuals from a fitted time-series model.
www.zecaiedu.cn
Regression with More Than One Time Series
➢ In linear regression, if any time series contains a unit root, OLS may
be invalid.
➢ Use DF tests for each of the time series to detect unit root, we will
have 3 possible scenarios.
⚫ None of the time series has a unit root: we can use multiple regression
⚫ At least one time series has a unit root while at least one time series does not: we
cannot use multiple regression
⚫ Each time series has a unit root: we need to establish whether the time series are
cointegrated.
✓ If conintegrated, can estimate the long-term relation between the two series (but may
not be the best model of the short-term relationship between the two series).
www.zecaiedu.cn
Regression with More Than One Time Series
➢ Use the Dickey-Fuller Engle-Granger test (DF-EG test) to test the
cointegration
⚫ H0: no cointegration Ha: cointegration
⚫ If we cannot reject the null, we cannot use multiple regression
⚫ If we can reject the null, we can use multiple regression
⚫ Critical value calculated by Engle and Granger
www.zecaiedu.cn
Steps in Time-Series Forecasting
画出散点图,判断序列是否有趋势
Does series have a trend?
Yes No
模
型
组
建 线性趋势 指数趋势 判断是否有季节性因素
线 a linear trend an exponential trend Seasonality?
路
图
使用DW检验判断残差是否自相关
Yes
Serial correlation?
No Yes
使用趋势模型 使用自回归模型
Use a trend model Use an AR model
www.zecaiedu.cn
Steps in Time-Series Forecasting
序列协方差是否固定
Is series Covariance Stationary?
No Yes
以差额法重新组建序列 以AR(1)模型开始
Take First Differences 模型的估计
自
回
归 残差是否自相关 Yes 继续增加自回归数量和级数
模
型 Serial Correlation? Adding Lags
组
建 No
路
线 是否存在季节性因素 Yes 增加相应的自回归级数
图 Seasonality Present Adding Lags
No
用ARCH模型检测是否存在异质性 Yes 通过广义的最小二乘法来调
Heteroskedasticity? 整模型中的错误
No
www.zecaiedu.cn
组建完成模型,测试模型的预测能力
Time-Series analysis
Case 2 Example
⚫ Angela Martinez, an energy sector analyst at an investment bank, is concerned about the future
level of oil prices and how it might affect portfolio values. She is considering whether to
recommend a hedge for the bank portfolio's exposure to changes in oil prices. Martinez
examines West Texas Intermediate (WTI) monthly crude oil price data, expressed in US dollars
per barrel, for the 181-month period from August 2000 through August 2015. The end-of-month
WTI oil price was $51.16 in July 2015 and $42.86 in August 2015 (Month 181).
⚫ After reviewing the time-series data, Martinez determines that the mean and variance of the time
series of oil prices are not constant over time. She then runs the following four regressions using
the WTI time-series data. (Exhibit 1 presents selected data from all four regressions)
Coefficients:
1 0.4157 5.5768
2 0.2388 3.2045
3 0.0336 0.4512
4 -0.0426 -0.5712
Note: At the 5% significance level, the critical value for at-statistic is 1.97.
After reviewing the data and regression results, Martinez draws the following conclusions.
Conclusion 1: The time series for WTI oil prices is covariance stationary.
Conclusion 2: Out-of-sample forecasting using the AR(1) model appears to be more accurate than
www.zecaiedu.cn
that of the AR(2) model.
Time-Series analysis
Case 3 Example
1. Based on Exhibit 1, the predicted WTI oil price for October 2015 using the linear
trend model is closest to:
A. $29.15.
B. $74.77.
C. $103.10.
2. Based on Exhibit 1, the predicted WTI oil price for September 2015 using the log-
linear trend model is closest to:
A. $29.75.
B. $29.98.
C. $116.50.
www.zecaiedu.cn
Time-Series analysis
Case 3 Example
3. Based on the regression output in Exhibit 1, there is evidence of positive serial
correlation in the errors in:
A. the linear trend model but not the log-linear trend model.
B. both the linear trend model and the log-linear trend model.
C. neither the linear trend model nor the log-linear trend model.
A. correct.
B. incorrect because the mean and variance of WTI oil prices are not constant over time.
C. incorrect because the Durbin-Watson statistic of the AR(2) model is greater than 1.75.
www.zecaiedu.cn
Time-Series analysis
Case 3 Example
5. Based on Exhibit 1, the forecasted oil price in September 2015 based on
the AR(2) model is closest to:
A. $38.03.
B. $40.04.
C. $61.77.
6. Based on the data for the AR(1) model in Exhibits 1 and 2, Martinez can
conclude that the:
B. equal to $42.86.
www.zecaiedu.cn
Summary
➢ Importance: ☆☆☆
➢ Content:
⚫ Linear trend model & log-linear trend model ; Limitation of trend models;
⚫ Mean reversion
➢ Exam tips:
⚫ Mean-reverting level的计算;
www.zecaiedu.cn
R7 Machine Learning
www.zecaiedu.cn
What we are going to learn?
The candidate should be able to:
Distinguish between supervised machine learning, unsupervised machine learning, and deep
learning; overfitting and identify methods of addressing it;
➢ Machine learning seeks to extract knowledge from large amounts of data with no such
restrictions. The goal of machine learning algorithms is to automate decision- making
processes by generalizing (i.e., “learning”) from known examples to determine an
underlying structure in the data.
⚫ Machine learning techniques are better able than statistical approaches to handle
problems with many variables (high dimensionality) or with a high degree of non- linearity.
www.zecaiedu.cn
Machine Learning Algorithms
⚫ In regression analysis
⚫ In machine learning
www.zecaiedu.cn
Overview of Supervised Learning
www.zecaiedu.cn
Types of Machine Learning
➢ Supervised learning
➢ Unsupervised learning
www.zecaiedu.cn
Types of Machine Learning
➢ Supervised learning
⚫ Supervised learning requires a labeled data set, one that contains matched sets of observed
inputs and the associated output.
⚫ Applying the ML algorithm to this data set to infer the pattern between the inputs and output is
called “training” the algorithm.
⚫ Once the algorithm has been trained, the inferred pattern can be used to predict output values
based on new inputs (i.e., ones not in the training data set).
⚫ Two categories of supervised learning
✓ Regression model : making predictions of continuous target variables.
Multiple regression is an example of supervised learning
✓ Classification problems : sorting observations into distinct categories.
binary classification-e.g. default or not likely default
Multicategory classification-e.g. bond rating
www.zecaiedu.cn
Types of Machine Learning
➢ Unsupervised learning
⚫ Unsupervised learning is machine learning that does not make use of labeled data.
www.zecaiedu.cn
Data Sets
➢ Deep Learning and Reinforcement Learning
⚫ Neural networks(NNs, also called artificial neural networks, or ANNs) include highly flexible
ML algorithms that have been successfully applied to a variety of tasks characterized by
non-linearities and interactions among features.
⚫ Deep learning and reinforcement learning are themselves based on neural networks.
✓ In deep learning, sophisticated algorithms address highly complex tasks, such as
image classification, face recognition, speech recognition, and natural language
processing..
✓ In reinforcement learning, a computer learns from interacting with itself (or data
generated by the same algorithm).
www.zecaiedu.cn
Summary of ML Algorithms
www.zecaiedu.cn
Data Sets
➢ To measure how well a model generalizes , data analysts create three
nonoverlapping data sets :
⚫ Training sample ( used to develop the model)
✓ In-sample prediction errors occur with the training sample.
⚫ Test sample (used for evaluating the model using new data)
www.zecaiedu.cn
Overfitting
➢ Underfitting means the model does not capture the relationships in the data.
➢ A good fit/robust model fits the training (in-sample) data well and generalizes
well to out-of-sample data, both within acceptable degrees of error.
www.zecaiedu.cn
Overfitting
➢ Data scientists decompose the total out- of- sample error into three
sources:
⚫ Bias error, or the degree to which a model fits the training data. Algorithms with
erroneous assumptions produce high bias with poor approximation, causing
underfitting and high in- sample error.
⚫ Variance error, or how much the model’s results change in response to new data
from validation and test samples. Unstable models pick up noise and produce high
variance, causing overfitting and high out- of- sample error.
www.zecaiedu.cn
Overfitting
➢ A fitting curve, which shows in- and out- of- sample
error rates (Ein and Eout) on the y- axis plotted
against model complexity on the x- axis.
⚫ Variance error increases with model complexity;
⚫ Bias error decreases with complexity;
⚫ Linear functions are more susceptible to bias error and
underfitting;
⚫ Non-linear functions are more prone to variance error
and overfitting.
⚫ An optimal level of complexity minimizes the total
error and is a key part of successful model
generalization .
www.zecaiedu.cn
Preventing Overfitting
➢ Two common guiding principles and two methods are used to reduce
overfitting:
⚫ Preventing the algorithm from getting too complex during selection and training, which
requires estimating an overfitting penalty;
✓ In supervised machine learning, it means limiting the number of features and penalizing
algorithms that are too complex or too flexible by constraining them to include only
parameters that reduce out- of- sample error.
⚫ Proper data sampling achieved by using cross- validation, a technique for estimating
out- of- sample error directly by determining the error in validation samples.
www.zecaiedu.cn
Preventing Overfitting
➢ In cross- validation techniques, out-of-sample error is estimated directly
by determining the error in validation samples.
⚫ One such technique is k- fold cross- validation
✓ In which the data (excluding test sample and fresh data) are shuffled randomly and
then are divided into k equal sub- samples, with k – 1 samples used as training samples
and one sample, the kth, used as a validation sample.
✓ Note that k is typically set at 5 or 10.
✓ This process is then repeated k times, which helps minimize both bias and variance by
insuring that each data point is used in the training set k – 1 times and in the
validation set once.
✓ The average of the k validation errors (mean Eval) is then taken as a reasonable
estimate of the model’s out- of- sample error (Eout).
www.zecaiedu.cn
Supervised Learning Algorithms
➢ Supervised machine learning models are trained using labeled data, and
depending on the nature of the target (Y) variable, they can be divided into two
types: regression for a continuous target variable and classification for a
categorical or ordinal target variable.
⚫ Penalized Regression
www.zecaiedu.cn
Penalized Regression
➢ Penalized Regression
⚫ Reduce the problem of overfitting by imposing a penalty term.
⚫ ILASSO (least absolute shrinkage and selection operator), it automatically performs feature selection
✓ Lambda (λ) is a hyperparameter—determine the balance between fitting the model versus keeping the model
parsimonious.
⚫ In addition to minimizing the sum of the squared residuals, LASSO also involves minimizing the sum of
the absolute values of the regression coefficients.
➢ Regularization describes methods that reduce statistical variability in high dimensional data
estimation problems.
⚫ Regularization can be applied to non-linear models.
www.zecaiedu.cn
Support Vector Machine( SVM )
➢ SVM
⚫ Support vector machine (SVM) is a linear classifier that determines the hyperplane that optimally
separates the observations into two sets of data points .
⚫ It is a powerful supervised algorithm used for classification, regression, and outlier detection.
www.zecaiedu.cn
Support Vector Machine( SVM )
➢ SVM
⚫ SVM maximizes the probability of making a correct prediction by determining the boundary that is
farthest away from all the observations.
www.zecaiedu.cn
Support Vector Machine( SVM )
➢ Many real-world data sets, however, are not perfectly linearly separable , in
that case, soft margin classification is applied.
⚫ This adaptation adds a penalty to the object function for observations in the training set
that are misclassified, it optimizes the tradeoff between a wider margin and classification
error.
www.zecaiedu.cn
K- Nearest Neighbor
➢ K- nearest neighbor (KNN) is a supervised learning technique most often used
for classification and sometimes for regression. The idea is to classify a new
observation by finding similarities (“nearness”) between this new observation
and the existing data.
www.zecaiedu.cn
K- Nearest Neighbor
➢ Two vital concerns
⚫ A critical challenge of KNN, however, is defining what it means to be “similar”(or near).
⚫ The researcher specifies the number k, the hyperparameter of the model, must be
chosen with the understanding.
E.G. For predicting the credit rating of an unrated bond, should k be the 3, 15, or 50 most
similar bonds to the unrated bond?
✓ If k is an even number, there may be ties and no clear classification.;
✓ Choosing a value for k that is too small would result in a high error rate and sensitivity to local
outliers;
✓ Choosing a value for k that is too large would dilute the concept of nearest neighbors by averaging too
many outcomes.
www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ Classification and regression trees (CART)
⚫ Classification trees are appropriate when the target variable is categorical.
✓ Typically used when the target is binary.( e.g. an IPO will be successful vs not
successful.)
www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ CART—Decision Tree and Partitioning of
the Feature Space
www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ The CART algorithm chooses the feature and the cutoff value at each node that
generates the widest separation of the labeled data to minimize classification error.
⚫ From the initial root node, the data are partitioned at decision nodes into smaller and smaller subgroups
until terminal nodes are formed that contain the predicted labels, so observations in each group have
⚫ At any level of the tree, when the classification error does not diminish much more from another
split(bifurcation),the process stops, the node is a terminal node, and the category that is in the
⚫ If the objective of the model is classification, then the prediction of the algorithm at each terminal
node will be the category with the majority of data points.
⚫ If the goal is regression, then the prediction at each terminal node is the mean of the labeled values.
www.zecaiedu.cn
Classification and Regression Tree (CART)
➢ CART is a popular supervised machine learning model because the tree provides a
visual explanation for the prediction., This contrasts favorably with other
algorithms that are often considered to be “black boxes.
➢ To avoid overfitting
www.zecaiedu.cn
Ensemble Learning and Random Forest
➢ This technique of combining the predictions from a collection of models is called
ensemble learning.
⚫ Ensemble learning typically produces more accurate and more stable predictions than
same algorithm, using different training data that are based, for example, on a bootstrap
www.zecaiedu.cn
Voting Classifiers
➢ Suppose you have been working on a machine learning project for some time and
have trained and compared the results of several algorithms, such as SVM, KNN,
and CART. A majority- vote classifier will assign to a new data point the predicted
label with the most votes .
⚫ For example, if the SVM and KNN models are both predicting the category “stock
outperformance.”
⚫ The more individual models you have trained, the higher the accuracy of the
www.zecaiedu.cn
Bootstrap Aggregating (Bagging)
➢ Bootstrap aggregating (or bagging) is a technique whereby the original training data
set is used to generate n new training data sets or bags of data.
⚫ Each new bag of data is generated by random sampling with replacement from the
⚫ The algorithm can now be trained on n independent data sets that will generate n new
models.
⚫ Then, for each new observation, we can aggregate the n predictions using a majority-
www.zecaiedu.cn
Random Forest
➢ A random forest classifier is a collection of a large number of decision trees trained
via a bagging method.
⚫ For example, a CART algorithm would be trained using each of the n independent data sets (from the
bagging process) to generate the multitude of different decision trees that make up the random forest
classifier.
⚫ For any new observation, we let all the classifier trees (the “random forest”) undertake classification
by majority vote—implementing a machine learning version of the “wisdom of crowds.”
⚫ The process involved in random forest tends to protect against overfitting on the training data. It also
reduces the ratio of noise to signal because errors cancel out across the collection of slightly different
classification trees.
⚫ However, an important drawback of random forest is that it lacks the ease of interpretability of
individual trees; as a result, it is considered a relatively black box- type algorithm.
www.zecaiedu.cn
Unsupervised Learning Algorithms
➢ Unsupervised learning is machine learning that does not use labeled
data (i.e., no target variable); thus, the algorithms are tasked with
finding patterns within the data themselves .
⚫ Clustering
✓ K- means
✓ Hierarchical clustering.
www.zecaiedu.cn
Principal Components Analysis
➢ Dimension reduction. Problems associated with too much noise often arise
when the number of features in a data set( i.e, its dimension) is excessive .
⚫ The eigenvectors define new, mutually uncorrelated composite variables that are linear
combinations of the original features. As a vector, an eigenvector also represents a
direction. Associated with each eigenvector is an eigenvalue.
⚫ An eigenvalue gives the proportion of total variance in the initial data that is explained by
each eigenvector.
www.zecaiedu.cn
Principal Components Analysis
➢ The PCA algorithm orders the eigenvectors from highest to lowest according to their
eigenvalues—that is, in terms of their usefulness in explaining the total variance in the initial data.
➢ PCA selects as the first principal component the eigenvector that explains the largest proportion
of variation in the data set (the eigenvector with the largest eigenvalue).
➢ The second principal component explains the next largest proportion of variation remaining after
the first principal component; it continues for the third, fourth, and subsequent principal
components.
➢ As the principal components are linear combinations of the initial feature set, only a few principal
components are typically required to explain most of the total variance in the initial feature
covariance matrix.
➢ In practice, the smallest number of principal components that should be retained is that which the
scree plot shows as explaining 85% to 95% of total variance in the initial data set.
www.zecaiedu.cn
Principal Components Analysis
➢ The main drawback of PCA is that since the principal components are combinations
of the data set’s initial features, they typically cannot be easily labeled or directly
interpreted by the analyst. Compared to modelling data with variables that
represent well- defined concepts, the end user of PCA may perceive PCA as
something of a “black box.”
www.zecaiedu.cn
Clustering
➢ Clustering is used to organize data points into similar groups called clusters.
⚫ A cluster contains a subset of observations from the data set such that all the observations
within the same cluster are deemed “similar.”
⚫ The aim is to find a good clustering of the data—meaning that the observations inside each
cluster are similar or close to each other (a property known as cohesion) and the observations in
two different clusters are as far away from one another or are as dissimilar as possible (a
property known as separation).
✓ Euclidian distance, the straight line distance between two observations, is one common
metric that is used.
✓ The smaller the distance, the more similar the observations; the larger the distance, the
more dissimilar the observations.
www.zecaiedu.cn
Clustering
⚫ Hierarchical clustering.
www.zecaiedu.cn
K- Means Clustering
➢ K- means is a relatively old algorithm that repeatedly partitions observations into a
fixed number, k, of non- overlapping clusters. The number of clusters, k, is a model
hyperparameter—a parameter whose value must be set by the researcher before
learning begins.
➢ Each cluster is characterized by its centroid (i.e., center), and each observation is
assigned by the algorithm to the cluster with the centroid to which that observation
is closest.
www.zecaiedu.cn
K- Means Clustering
www.zecaiedu.cn
K- Means Clustering
➢ The k- means algorithm is fast and works well on very large data sets with hundreds
of millions of observations.
➢ However, the final assignment of observations to clusters can depend on the initial
location of the centroids.
⚫ To address this problem, the algorithm can be run several times using different sets of initial
centroids, and then one can choose the clustering that is most useful given the business purpose.
www.zecaiedu.cn
Hierarchical Clustering: Agglomerative and Divisive
➢ Hierarchical clustering is an iterative procedure used to build a hierarchy of
clusters.
⚫ Agglomerative clustering (or bottom-up) hierarchical clustering begins with each
observation being treated as its own cluster.
⚫ Divisive clustering (or top- down) hierarchical clustering starts with all the
observations belonging to a single cluster.
www.zecaiedu.cn
Hierarchical Clustering: Agglomerative and Divisive
www.zecaiedu.cn
Neural Networks
➢ Neural networks (also called artificial neural networks, or ANNs) are a highly flexible type
of ML algorithm that have been successfully applied to a variety of tasks characterized by
non- linearities and complex interactions among features.
➢ Neural networks are commonly used for classification and regression supervised learning
but are also important in reinforcement learning, which can be unsupervised.
⚫ Hidden layers, where learning occurs in training and inputs are processed on trained nets;
⚫ Output layer (here consisting of a single node for the target variable y), which passes
information to outside the network.
www.zecaiedu.cn
Neural Networks
➢ Note that for neural networks, the
feature inputs would be scaled
(i.e., standardized) to account for
differences in the units of the data.
For example, if the inputs were
positive numbers, each could be
scaled by its maximum value so that
their values lie between 0 and 1.
www.zecaiedu.cn
Neural Networks
➢ Each node has, conceptually, two functional parts: a summation operator and an
activation function.
⚫ Once the node receives the four input values, the summation operator multiplies each
value by a weight and sums the weighted values to form the total net input.
⚫ The total net input is then passed to the activation function, which transforms this
input into the final output of the node.
✓ Informally, the activation function operates like a light dimmer switch that decreases or
increases the strength of the input.
www.zecaiedu.cn
Neural Networks
➢ Forward propagation (向前传播)
➢ This activation function is shown in Exhibit 20,
where in the left graph a negative total net input is
transformed via the S- shaped function into an
output close to 0. This low output implies the node
does not “fire,” so there is nothing to pass to the
next node. Conversely, in the right graph a positive
total net input is transformed into an output close
to 1, so the node does fire. The output of the
activation function is then transmitted to the next
set of nodes if there is a second hidden layer or, as
in this case, to the output layer node as the
predicted value.
www.zecaiedu.cn
Neural Networks
➢ Starting with an initialized set of random network weights, training a neural network in a
supervised learning context is an iterative process in which predictions are compared to
actual values of labeled data and evaluated by a specified performance measure (e.g., mean
squared error).
➢ Then, network weights are adjusted to reduce total error of the network.
➢ If the process of adjustment works backward through the layers of the network, this
process is called backward propagation.
➢ Learning takes place through this process of adjustment to network weights with the aim of
reducing total error.
www.zecaiedu.cn
Deep Learning Nets
➢ Neural networks with many hidden layers—at least 3 but often more than 20 hidden layers—
are known as deep learning nets (DLNs) and are the backbone of the artificial intelligence
revolution.
➢ Advances in DLNs have driven developments in many complex activities, such as image,
pattern, and speech recognition.
➢ Some researchers have used DLNs known as multi- layer perceptrons (or feed- forward
networks) with many nodes (sometimes over 1,000) split over several hidden layers to predict
corporate fundamental factors and price- related technical factors.
www.zecaiedu.cn
Reinforcement Learning
➢ Reinforcement learning (RL) is an algorithm that made headlines in 2017 when DeepMind’s
AlphaGo program beat the reigning world champion at the ancient game of Go. The RL algorithm
involves an agent that should perform actions that will maximize its rewards over time, taking into
consideration the constraints of its environment.
➢ In the case of AlphaGo, a virtual gamer (the agent) uses his/her console commands (the actions)
with the information on the screen (the environment) to maximize his/ her score (the reward).
➢ Unlike supervised learning, reinforcement learning has neither direct labeled data for each
observation nor instantaneous feedback.
➢ With RL, the algorithm needs to observe its environment, learn by testing new actions (some of
which may not be immediately optimal), and reuse its previous experiences. The learning
subsequently occurs through millions of trials and errors.
www.zecaiedu.cn
Summary
➢ Supervised learning
⚫ Penalize regression
⚫ Support vector machine (SVM)
⚫ K-nearest neighbor ( KNN)
⚫ Classification and regression tree(CART) algorithms
⚫ Ensemble and Random forest
➢ Unsupervised learning
⚫ Dimension reduction
✓ Principal components analysis
⚫ Clustering
✓ K- Means Clustering
✓ Hierarchical clustering.
www.zecaiedu.cn
How to Choose Among Them
➢ First, start by asking, are the data complex, having many features that are highly correlated?
⚫ Yes, then dimensionality reduction using principal components analysis (PCA)
www.zecaiedu.cn
How to Choose Among Them
www.zecaiedu.cn
How to Choose Among Them
➢针对不同情况,我们需要挑选最合适的机器学习方法:
⚫ 如果数据中包含特征和标签,希望学习标签和特征间的对应关系,监督学习;
⚫ 如果没有标签,探索数据自身规律,非监督学习;
⚫ 如果学习任务有一系列行动和对应的奖励机制组成,则强化学习;
⚫ 如果需要预测的标签是分类变量,例如预测股票上升或下跌,则分类方法;
⚫ 如果标签是连续的数值变量,例如预测具体的涨幅,则回归方法;
⚫ 另外,样本和特征的个数,数据本身的特点,都决定了机器学习算法的选择。
www.zecaiedu.cn
Machine Learning
Case 1 Example
Answer: B
Machine Learning
Case 2 Example
Answer: C
Machine Learning
Case 3 Example
Answer: A
Machine Learning
Case 4 Example
Answer : A
Machine Learning
Case 5 Example
Answer: B
Machine Learning
Case 6 Example
Answer: C is correct.
Machine Learning
Case 7 Example
A. unsupervised ML.
Answer :C
Machine Learning
Case 8 Example
Answer : B
Machine Learning
Case 9 Example
Answer : A is correct.
Machine Learning
Case 10 Example
A. to a manageable size.
Answer : C is correct.
R8 Big Data Projects
www.zecaiedu.cn
What we are going to learn?
Describe preparing, wrangling, and exploring text-based data for financial forecasting;
Describe methods for extracting, selecting and engineering features from textual data;
www.zecaiedu.cn
Big Data Introduction
➢ Big data differs from traditional data sources based on the presence of a set of
characteristics commonly referred to as the 3Vs: volume, variety, and velocity.
⚫ Volume refers to the quantity of data. Big data refers to a huge volume of data.
⚫ Variety pertains to the array of available data sources. Variety includes traditional transactional data;
user-generated text, images, and videos; social media; sensor-based data; web and mobile clickstreams;
and spatial-temporal data. Effectively leveraging the variety of available data presents both
opportunities and challenges, including such legal and ethical issues as data privacy.
⚫ Velocity is the speed at which data are created. Many large organizations collect several petabytes of
data every hour.
➢ When used for generating inferences, an additional characteristic, the veracity or validity of the
data, needs to be considered. Not all data sources are reliable and the researcher has to separate
quality from quantity to generate robust forecasts.
www.zecaiedu.cn
Big Data Introduction
www.zecaiedu.cn
Big Data Introduction
➢ Model Building for Financial Forecasting Using Big Data: Structured (Traditional) vs.
Unstructured (Text)
www.zecaiedu.cn
Structured Data Analysis
➢ Structured data are organized in a systematic format that is readily searchable and
readable by computer operations for processing and analyzing. In structured data, data
errors can be in the form of incomplete, invalid, inaccurate, inconsistent, non-uniform,
and duplicate data observations. The data cleansing process mainly deals with
identifying and mitigating all such errors. Exhibit 3 shows a raw dataset before
cleansing. The data have been collected from different sources and are organized in a
data matrix (or data table) format. Each row contains observations of each customer of
a US-based bank. Each column represents a variable (or feature) corresponding to each
customer.
www.zecaiedu.cn
Structured Data Analysis
⚫ Data collection.
⚫ Data exploration.
⚫ Model training.
www.zecaiedu.cn
Conceptualization of the modeling task
⚫ The crucial first step entails determining what the output of the model should be( e.g.
whether the price of a stock will go up/down one week from now),how this model will
be used and by whom , and how it will be embedded in existing or new business
processes.
www.zecaiedu.cn
Data Collection
➢ Data Collection
⚫ For financial forecasting, usually structured , numeric data is collected from internal
and external sources.
www.zecaiedu.cn
Data Preparation and Wrangling
➢ Data Preparation and Wrangling involve cleansing and organizing raw data
into a consolidated format.
⚫ Data Preparation (Cleansing) is the process of examining, identifying, and mitigating errors in
raw data, includes addressing any missing values or verification of any out-of-range values.
⚫ Data Wrangling (Preprocessing) data may performs transformation and critical processing
steps on the cleansed data to make the data ready for ML model training, involving aggregating ,
filtering or extracting relevant variables.
www.zecaiedu.cn
Data Preparation (Cleansing)
⚫ Invalidity error is where the data are outside of a meaningful range, resulting in invalid
data. This can be corrected by verifying other administrative data records.
⚫ Inaccuracy error is where the data are not a measure of the true value. This can be
rectified with the help of business records and administrators.
www.zecaiedu.cn
Data Preparation (Cleansing)
⚫ Non-uniformity error is where the data are not present in an identical format. This can
be resolved by converting the data points into a preferable standard format.
⚫ Duplication error is where duplicate observations are present. This can be corrected by
removing the duplicate entries.
www.zecaiedu.cn
Data Preparation (Cleansing)
www.zecaiedu.cn
Data Preparation (Cleansing)
www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Data preprocessing primarily includes transformations and scaling of
the data.
➢ Data transformations
⚫ Extraction: A new variable can be extracted from the current variable for ease of analyzing and using
for training the ML model.
⚫ Aggregation: Two or more variables can be aggregated into one variable to consolidate similar
variables..
⚫ Filtration: The data rows that are not needed for the project must be identified and filtered.
⚫ Selection: The data columns that are intuitively not needed for the project can be removed. This
should not be confused with feature selection, which is explained later.
⚫ Conversion: The variables can be of different types: nominal, ordinal, continuous, and categorical. The
variables in the dataset must be converted into appropriate types to further process and analyze
them correctly.
www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Data After Applying Transformations
www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢ Scaling is a process of adjusting the range of a feature by shifting and changing the
scale of data.
✓ Less sensitive to outliers as it depends on the mean and standard deviation of the data.
⚫ The outliers then should be examined and a decision made to either remove or replace
them with values imputed using statistical techniques.
www.zecaiedu.cn
Data Wrangling (Preprocessing)
➢There are several practical methods for handling outliers.
⚫ When extreme values and outliers are simply removed from the dataset, it is
known as trimming (also called truncation).
✓ E.G a 5% trimmed dataset is one for which the 5% highest and the 5% lowest values have
been removed.
⚫ When extreme values and outliers are replaced with the maximum (for large value
outliers) and minimum (for small value outliers) values of data points that are not
outliers, the process is known as winsorization.
www.zecaiedu.cn
Data exploration
➢Data exploration is a crucial part of big data projects. The prepared data
are explored to investigate and comprehend data distributions and
relationships.
➢ Data exploration involves three important tasks: exploratory data analysis, feature
selection, and feature engineering.
www.zecaiedu.cn
Data exploration-EDA
➢ Exploratory data analysis (EDA) is the preliminary step in data exploration.
⚫ An important objective of EDA is to serve as a communication medium among
project stakeholders, including business users, domain experts, and analysts.
⚫ Visualizations
✓ Histograms, bar charts ,box plots ,density plots
www.zecaiedu.cn
Data exploration-EDA
➢ Visualizations
www.zecaiedu.cn
Data exploration-FS and FE
➢ After using EDA to discover relevant patterns in the data, it is essential to identify and
remove unneeded, irrelevant, and redundant features. Feature selection is a process
whereby only pertinent features from the dataset are selected for ML model training
such as PCA.
⚫ The objective of the feature selection process is to assist in identifying significant features that when used in
a model retain the important patterns and complexities of the larger dataset while requiring fewer data overall.
⚫ The feature engineering process attempts to produce good features that describe the structures
inherent in the dataset.
evaluating the algorithm using a training data set ,and tuning the model. The choice of the
model depends on the nature of the relationship between the features and the target
variable.
www.zecaiedu.cn
Method Selection
➢ Selecting and applying a method or an algorithm is the first step of the
⚫ Size of data.
www.zecaiedu.cn
Performance Evaluation
➢ It is important to measure the model training performance or goodness of
www.zecaiedu.cn
Performance Evaluation
➢ Error analysis. For classification problems, error analysis involves computing four basic
evaluation metrics: true positive (TP), false positive (FP), true negative (TN), and false
➢ Confusion matrix : Assume in the following explanation that Class “0” is “not defective”
www.zecaiedu.cn
Performance Evaluation
➢ Elements in error analysis.
⚫ Precision is the ratio of correctly predicted positive classes to all predicted positive
classes.
⚫ Recall (also known as sensitivity) is the ratio of correctly predicted positive classes
www.zecaiedu.cn
Performance Evaluation
➢ Elements in error analysis.
⚫ Accuracy is the percentage of correctly predicted classes out of total predictions
✓ F1 score is more appropriate (than accuracy) when unequal class distribution is in the
www.zecaiedu.cn
Performance Evaluation
Case 1 Example
• Once satisfied with the final set of features, Steele selects and runs
a model on the training set that classifies the text as having positive
sentiment ( Class “1”or negative sentiment (Class “0”).She
then evaluates its performance using error analysis. The resulting
confusion matrix is presented in Exhibit.
www.zecaiedu.cn
Performance Evaluation
Case 1 Example
A. 78%
B. 81%
C. 85%
A. 77%
B. 81%
C. 85%
www.zecaiedu.cn
Performance Evaluation
Case 1 Example
A. 77%
B. 81%
C. 85%
www.zecaiedu.cn
Performance Evaluation
➢ Receiver Operating Characteristic (ROC). This technique for
⚫ The shape of the ROC curve provides insight into the model’s
performance.
⚫ Area under the curve (AUC) is the metric that measures the
www.zecaiedu.cn
Performance Evaluation
➢ Root Mean Squared Error (RMSE). This measure is appropriate for continuous data
⚫ The root mean squared error is computed by finding the square root of the mean of the
squared differences between the actual values and the model’s predicted values (error).
www.zecaiedu.cn
Model Tuning
➢ Model fitting has two types of error: bias and variance. It is necessary to find an
optimum tradeoff between bias and variance errors, such that the model is neither
⚫ It is not possible to completely eliminate both types of errors. However, both errors can be
minimized so the total aggregate error (bias error + variance error) is at a minimum. A small
⚫ Parameters are critical for a model and are dependent on the training data. Parameters are
learned from the training data as part of the training process by an optimization technique.
⚫ Hyperparameters are used for estimating model parameters and are not dependent on the
training data.
www.zecaiedu.cn
Model Tuning
➢Method of model tuning
⚫ Grid search is trained using different combinations of hyperparameter values until the
pipeline of model building. It helps to understand what part of the pipeline can
✓ The performance of the larger model depends on performance of the sub- model(s). Ceiling
analysis can help determine which sub- model needs to be tuned to improve the overall
www.zecaiedu.cn
Unstructured (Text) Data Analysis
➢ Unstructured , texted-based data is more suitable for human use. The
five steps involved need to be modified (the first four) in order to
analyze unstructured, text-based data.
⚫ Text problem formulation. The analyst will determine the problem and identify the exact
inputs and output of the model.
⚫ Data collection (curation) .This is determining the sources of data to be used(e.g. web scouring,
specific social media sites.
⚫ Text preparation and wrangling. This requires preprocessing the streams of unstructured
data to make it usable by traditional structured modeling methods.
⚫ Text exploration. This involves test visualization as well as text feature selection and
engineering.
⚫ Model training .
www.zecaiedu.cn
Text preparation and wrangling
➢ Unstructured data can be in the form of text, images, videos, and
audio files.
⚫ For analysis and use to train the ML model, the unstructured data must be transformed
into structured data.
⚫ For example
www.zecaiedu.cn
Text Preparation (Cleansing)
➢ The following steps describe the basic operations in the text cleansing process.
⚫ Remove html tags: The initial task is to remove (or strip) the html tags that are not part of
the actual text using programming functions or using regular expressions.
⚫ Remove Punctuations: Most punctuations are not necessary for text analysis and should be
removed. However, some punctuations, such as percentage signs, currency symbols, and
question marks, may be useful for ML model training. These punctuations should be
substituted with such annotations as /percentSign/, /dollarSign/, and /questionMark/ to
preserve their grammatical meaning in the text.
⚫ Remove Numbers: When numbers (or digits) are present in the text, they should be
removed or substituted with an annotation /number/.
⚫ Remove white spaces: Extra formatting-related white spaces(e.g . tabs, indents) do not
serve any purpose in text processing and are removed.
www.zecaiedu.cn
Text Cleansing Process Example
➢
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ To further understand text processing, tokens and tokenization need to
be defined.
⚫ A token is equivalent to a word.
⚫ Tokenization is the process of splitting a given text into separate tokens. In other words, a
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 1: Normalization
⚫ Lowercasing ; This action helps the computers to process the same words appropriately (e.g., “The”
and “the”).
⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.
⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.
✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.
⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 1: Normalization
⚫ Lowercasing ; This action helps the computers to process the same words appropriately (e.g., “The”
and “the”).
⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.
⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.
✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.
⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ Step 2. After the cleansed text is normalized, a bag- of- words(BOW) is created.
⚫ BOW is simply a set of words and does not capture the position or sequence of words
present in the text. However, it is memory efficient and easy to handle for text analyses.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ The last step of text preprocessing is using the final BOW after normalizing to build a
document term matrix (DTM).
⚫ Removal of stop words; Stop words are such commonly used words as “the,” “is,” and “a.” Stop
words do not carry a semantic meaning for the purpose of text analyses and ML training.
⚫ Stemming is a rule-based approach that converts all variations of a word into a common value.
✓ For example, the stem of the words “analyzed” and “analyzing” is “analyz.” Similarly, the British English variant “analysing” would
become “analys.” Stemming is available in R and Python.
✓ While stemming makes the text confusing for human processing, it is ideally suited for machines.
⚫ Lemmatization is the process of converting inflected forms of a word into its morphological root
(known as lemma).
✓ Lemmatization is similar to stemming ,but is computationally more expensive and advanced. It is an algorithmic
approach and depends on the knowledge of the word and language structure.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ If the sequence of text is important, N-grams can be used to represent word
sequences.
⚫ The length of a sequence can vary from 1 to n. When one word is used, it is a unigram; a
two- word sequence is a bigram; and a 3- word sequence is a trigram and so on ;
✓ Bigrams of this sentence include “the_ market”, “market _is “,”is_up”, and “up_today”.
BOW is then applied to the bigrams instead of the original words.
✓ N-grams implementation will affect the normalization of the BOW because stop words
will not be removed.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢ The advantage of n- grams is that they can be used in the same way as unigrams to build a BOW.
In practice, different n- grams can be combined to form a BOW and eventually be used to build
a DTM.
www.zecaiedu.cn
Text Wrangling (Preprocessing)
➢Step 3 build a document term matrix (DTM).
⚫ DTM is a matrix that is similar to a data table for structured data and is widely used for text data.
⚫ Each row of the matrix belongs to a document (or text file), and each column represents a token (or term). The
cells can contain the counts of the number of times a token is present in each document.
DTM
www.zecaiedu.cn
Text Exploration - EDA
➢Various statistics are used to explore, summarize, and analyze text data.
⚫ Term frequency (TF), the ratio of the number of times a given token occurs in all the
www.zecaiedu.cn
Text Exploration - Feature Selection
➢ For text data, feature selection involves selecting a subset of the
⚫ Another benefit is to eliminate noisy features from the dataset. Noisy features are both
the most frequent and most sparse (or rare) tokens in the dataset.
www.zecaiedu.cn
Text Exploration - Feature Selection
➢ The general feature selection methods in text data are as follows:
⚫ Frequency measures can be used for vocabulary pruning to remove noise features by
filtering the tokens with very high and low TF values across all the texts.
⚫ Chi- square test is applied to test the independence of two events: occurrence of the token
and occurrence of the class. The test ranks the tokens by their usefulness to each class in
text classification problems.
✓ Tokens with the highest chi- square test statistic values occur more frequently in texts associated with a particular
class and therefore can be selected for use as features for ML model training due to higher discriminatory potential
✓ The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of
text.
www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:
⚫ Numbers: tokens with standard lengths are identified and converted into a token such as “/
number/.” For example, numbers with four digits may indicate years and are assigned a value
of/number 4/.
⚫ N- grams: Multi- word patterns that are particularly discriminative can be identified and
⚫ Name entity recognition (NER): NER analyzes the individual tokens and their surrounding
semantics while referring to its dictionary to tag an object class to the token.
✓ For example, Microsoft would be assigned a NER tag of ORG and Europe would be assigned a NER tag of Place.
NER object class assignment is meant to make the selected features more discriminatory.
www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:
⚫ Parts of speech (POS): Similar to NER, parts of speech uses language structure and
dictionaries to tag every token in the text with a corresponding part of speech.
✓ For example, Microsoft would be assigned as a POS tag of NNP(indicating a proper noun), and the year 1969
✓ For example, the word “market” can be a verb when used as “to market …” or noun when used as “in the
market.”
✓ Differentiating such tokens can help further clarify the meaning of the text.
www.zecaiedu.cn
Model Training and Evaluation
➢ Once the unstructured data has been processed and codified in a structured form such
as a data matrix , model training is similar to that of structured data.ML seeks to identify
patterns in the data set via asset of rules . Model fitting describes how well the model
generalized to new data( i.e. how the model performs out of sample.)
www.zecaiedu.cn
Text Exploration - Feature Engineering
➢ Techniques of FE include:
⚫ Frequency measures can be used for vocabulary pruning to remove noise features by
filtering the tokens with very high and low TF values across all the texts.
⚫ Chi- square test is applied to test the independence of two events: occurrence of the token
and occurrence of the class. The test ranks the tokens by their usefulness to each class in
text classification problems.
✓ Tokens with the highest chi- square test statistic values occur more frequently in texts associated with a particular
class and therefore can be selected for use as features for ML model training due to higher discriminatory potential
⚫ Mutual information (MI) measures how much information is contributed by a token to a class
of texts.
✓ The mutual information value will be equal to 0 if the token’s distribution in all text classes is the same.
✓ The MI value approaches 1 as the token in any one class tends to occur more often in only that particular class of text.
www.zecaiedu.cn
Summary
➢ Importance: ☆☆
➢ Content:
⚫ Big data introduction
⚫Exam tips:
⚫ 重点在于数据的准备和预处理
www.zecaiedu.cn
R9 Excerpt from “probabilistic
approaches: scenario analysis,
decision trees, and simulation”
www.zecaiedu.cn
What we are going to learn?
The candidate should be able to:
Explain three ways to define the probability distributions for a simulation’s variables;
www.zecaiedu.cn
Simulation
➢Steps in Simulation
⚫ Determine “probabilistic” variables
⚫ Define probability distributions for these variables
✓ Historical data
✓ Cross sectional data
✓ Statistical distribution and parameters:
⚫ Check for correlation across variables
✓ When there is strong correlation, positive or negative, across inputs, you have two
choices.
One is to pick only one of the two inputs to vary;
The other is to build the correlation explicitly into the simulation;
⚫ Run the simulation
www.zecaiedu.cn
Simulation
➢ Advantage of using simulation in decision making
⚫ Better input estimation
⚫ It yields a distribution for expected value rather than a point estimate
⚫ Non-stationary distributions
www.zecaiedu.cn
Simulation
⚫ When you use the standard deviation in values from a simulation ,using a risk-adjusted
discount rate will result in a double counting of risk.
www.zecaiedu.cn
Comparing The Approaches
➢ Selective versus full risk analysis
⚫ Scenarios analysis : we will not have a complete assessment of all possible
outcomes from risky investments or assets.
www.zecaiedu.cn
Comparing The Approaches
➢ Type of risk
⚫ Scenarios analysis and decision trees : generally built around discrete outcomes in
risky events
✓ Scenarios analysis : easier to use when risks occur concurrently
✓ Decision trees : better suited for sequential risks.
⚫ Simulations:better suited for continuous risks.
www.zecaiedu.cn
Comparing with Risk-Adjusted Value
➢ Complement or Replacement for Risk-Adjusted Value
⚫ Both decision trees and simulations are approaches that can be used as either
complements to or substitutes for risk-adjusted value.
www.zecaiedu.cn
Conclusion
➢ In the most extreme form of scenario analysis, you look at the value in the best case and worst case scenarios
and contrast them with the expected value. In its more general form, you estimate the value under a small
number of likely scenarios, ranging from optimistic to pessimistic.
➢ Decision trees are designed for sequential and discrete risks, where the risk in an investment is considered
into phases and the risk in each phase is captured in the possible outcomes and the probabilities that they will
occur. A decision tree provides a complete assessment of risk and can be used to determine the optimal courses
of action at each phase and an expected value for an asset today.
➢ Simulations provide the most complete assessments of risk since they are based upon probability distributions
for each input (rather than a single expected value or just discrete outcomes). The output from a simulation
takes the form of an expected value across simulations and a distribution for the simulated values.
➢ With all three approaches, the keys are to avoid double counting risk (by using a risk-adjusted discount rate
and considering the variability in estimated value as a risk measure) or making decisions based upon the wrong
types of risk.
www.zecaiedu.cn
Conclusion
➢ In the most extreme form of scenario analysis, you look at the value in the best case and worst case scenarios
and contrast them with the expected value. In its more general form, you estimate the value under a small
number of likely scenarios, ranging from optimistic to pessimistic.
➢ Decision trees are designed for sequential and discrete risks, where the risk in an investment is considered
into phases and the risk in each phase is captured in the possible outcomes and the probabilities that they will
occur. A decision tree provides a complete assessment of risk and can be used to determine the optimal courses
of action at each phase and an expected value for an asset today.
➢ Simulations provide the most complete assessments of risk since they are based upon probability distributions
for each input (rather than a single expected value or just discrete outcomes). The output from a simulation
takes the form of an expected value across simulations and a distribution for the simulated values.
➢ With all three approaches, the keys are to avoid double counting risk (by using a risk-adjusted discount rate
and considering the variability in estimated value as a risk measure) or making decisions based upon the wrong
types of risk.
www.zecaiedu.cn
Summary
➢ Importance:
➢ Content:
⚫ Simulation ( steps; advantages; risk-adjusted value)
⚫ Conclusion
➢ Exam tips:
⚫ 重点在于三种方法间的比较和区分
www.zecaiedu.cn
THANKS
No pains No gains
www.zecaiedu.cn