0% found this document useful (0 votes)
25 views

File4 Session3 Introduction To Regression

Uploaded by

Đan Anh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

File4 Session3 Introduction To Regression

Uploaded by

Đan Anh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Introduction to

Regression
Introduction to linear regression
● Simple linear regression
● Multiple regression
● Find out relationships between dependent and
independent variable
● Introduction to SPSS
Simple linear regression

y = Dependent variable
X = Independent variable

= y-intercept of the line (constant), cuts through the y-axis


= Unknown parameter – Slope of the line

= Random error component


Terminology for simple
regression
y X
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor
Examples

- With a sample from the population


- denote a random sample of
size n from the population

File: dataspss-s4.1
DETERMINING THE EQUATION OF THE
REGRESSION LINE

● Deterministic Regression Model – mathematical


models that produce an ‘exact’ output for a given
input

● Probabilistic Regression Model- a model that


includes an error term that allows for various values
of output to occur for a given value of input

= predicted value of y
= value of independent variable for the ith value
= real value of dependent variable for the ith
value
= population slope
= population intercept
= error of prediction for the ith value
Simple Linear Regression Model
Y

Observed
value of Y for
Xi
εi
Predicted Slope =
value of Y for Random error for β1
Xi this Xi value

Intercept =
β0
Xi X
0
Sample Regression
Function (SRF)

= estimated value of Y for observation i


xi = value of X for observation i
b0 = Y- intercept
is the value of Y when X is zero
b1 = slope of the regression line
change in Y for 1 unit X

b1 > 0 : Line will go up; positive relationship between X and Y


b1 < 0 : Line will go down; negative relationship between X and Y
SIMPLE LINEAR REGRESSION MODEL
(sample)

Simple Linear Regression:

wher
e
Sample Regression Function
(SRF) (continued
)
● b0 and b1 are obtained by finding the values of b0 and b1
that minimizes the sum of the squared residuals (minimize the
error). This process is called Least Squares Analysis

Yi = actual value of Y for observation i


= predicted value of Y for observation i
ei = residual (error)
● b0 provides an estimate of 0
● b1 provides and estimate of 1
RESIDUAL ANALYSIS
Simple example

1 1 -2 -2 4
2 1 -1 -1 1
3 2 0 0 0
4 2 1 2 1
5 4 2 8 4
Average =3 Average =2 Total=7 10

X = experience years (year)


Y = Income (10 million VND)
Fit to the data

4 .
3
.
2 .
1 . .
x
1 2 3 4 5
Result of estimation by SPSS

Degree of freedom

k
n-k-1
n-1
Meaning of b0 and b1
● Y-Intercept (b0)
• Average value of individual income (Y) is
-0.1 (10 million VND) when when the experience year
(X) is 0

● Slope (b1)
• Income (Y) is expected to increase by 0.7 (*10 million
VND) for each unit increased in experience year

b1 > 0 : Line will go up; positive relationship between X and Y (increase)


b1 < 0 : Line will go down; negative relationship between X and Y (decrease)
Measures of Variation
Total variation is made up of two parts.

Total Sum of Regression Sum of Error Sum of


Squares Squares Squares

Measures the Explained variation Variation attributable to


variation of the Yi attributable to the factors other than the
values around their relationship between X relationship between X
mean Y. and Y. and Y.

/* Other notation for SSyy is SST. They are the same!


Measure of Variation: (continued)
The Sum of Squares
Y
SSE = (Yi - Yi )2
_
SSyy = (Yi - Y)2

_
SSR = (Yi - Y)2
_
Y

X
Xi
Result of estimation by SPSS
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in
the dependent variable that is explained by variation in the
independent variable.
The coefficient of determination is also called r-squared and is denoted
as r2

Note:

0
The Coefficient of Determination r2
and the Coefficient of Correlation r

0<r2<1

r2 = Coefficient of Determination
Measures the % of variation in Y that is explained by the
independent variable X in the regression model

-1<r<1
r = Coefficient of Correlation
Measures how strong the relationship is between X and Y

r > 0 if b1>0
r < 0 if b1 <0
Examples of Approximate r2
values
Perfect linear
Y Y relationship
between X and Y.
100% of the
variation in Y is
explained by
variation in X.
X
r2 = r2 = X
1 1
Y
No linear relationship between X
and Y.
The value of Y does not depend
on X (None of the variation in Y is
explained by variation in X).
r2 = X
0
0
Examples of Approximate r2
values 0 < r2 <
Y
1
Weaker linear relationships
between X and Y.

X Some but not all of the


Y variation in Y is explained by
variation in X.

X
0
Standard Error of the Estimate
The standard deviation of the variation of observations around the
regression line is estimated by:

Where
SSE = error sum of squares
n = sample size

0
Result of estimation by SPSS
Inferences About the Slope
The standard error of the regression slope coefficient (b1) is
estimated by:

Where
= Estimate of the standard error of the least squares slope.

= Standard error of the estimate.


0
Result of estimation by SPSS

Sb1 = 0.1914854
Inference about the Slope: t Test

t test for a population slope:


• Is there a linear relationship between X and Y?

Null hypothesis (H0) and Alternative hypothesis (H1)


H0: β1 = 0 (no linear relationship)
H1: β1 ≠ 0 (linear relationship does exist)

Test statistic with d.f. = n-2

Where b1 = regression slope coefficient


β1 = hypothesized slope
Sb1 = standard error of the slope

0
Result of estimation by SPSS
Inferences about the Slope: t Test
Example
H0: β1 = 0 Test Statistic: t = 3.66
T critical = +/- 3.182 (from t tables)
H1: β1 ≠0

Decision: Reject H0
Conclusion: There is
d.f. = 5-2 = 3
sufficient evidence that
/2=.025 number of customers
/2=.025 Do not
affects weekly sales.
reject H0

Reject H0 Reject H0
-t /2 0 t /2
-3.182 3.182 3.66

0
F Test for Significance
F Test statistic:

Where

F follows an F distribution with k numerator and (n – k - 1)


denominator degrees of freedom.
k = the number of independent (explanatory) variables in the
regression model.

0
Result of estimation by SPSS
F Test for Significance Example
df1= k =1
df2 = n-k-1=5-1-1
H0: β1 = 0 Test Statistic:

H1: β1 ≠ 0
= .05
df1= 1 df2 = 3
Conclusion:
Critical Value: Reject H0 at = 0.05
F = 10.128 There is sufficient evidence that
number of customers affects
= .05 weekly sales.

0
F
Do not reject H0
F.05 = 10.128 Reject H0
0
Introduction to SPSS (file: dataspss-s4.1)
Result of estimation by SPSS
Voice of result
● R-squared ranges in value between 0 and 1
● R2 = 0, nothing to help explain the variance in y
● R2 = 1, all the same points lie on the estimated regression line
● Example: R2 = 0.93 implies that the regression equation explains
93% of the variation in the dependent variable

● Sig. (significant): Goodness of fit only if

● Sig. of coefficient < 0.01 significant at 1%, Ho is rejected


● 0.01 ≤ Sig. value < 0.05 significant at 5%, Ho is rejected
● Sig. value > 0.1 significant at 10%, Ho is rejected
Introduction to multiple
regression

● Multiple regression
● Find out relationships between dependent and
independent variables
● Dummy variable enclosed
● Solution? and SPSS
Linear regression

y = Dependent (or response variable)


X1, X2, …., Xn = Independent or predictor variables

= y-intercept of the line (constant), cuts through the y-axis

= Unknown parameter – Slope of the line

= Random error component


Terminology for simple
regression
y X1, x2, …, xk
Dependent variable Independent variable
Explained variable Explanatory variable
Response variable Control variable
Predicted variable Predictor variable
Regressand Regressor
Example
● File: dataspss-s4.2
● Dependent variable ?
● Independent variables
● SPSS program
● Estimate and discuss
Think?
● Survey conducted with variables
● Income
● Age
● Years in experience working
● Education
● Gender
● ………
Think which one is dependent and independent
variables?
Regression with dummy
independent variables
● Independent variable: Gender
● 1= female, 0 = male
● If coefficient estimated of gender is a positive value,
dependent variable is the direction increase with female
● If coefficient estimated of gender is negative value,
dependent variable is the direction increase with male.
Samples of hypotheses

● An increase in education does not cause a rise


in the earning
● People’s earning is not positively influenced by
their age
● There is not a significant relationship between
the earning and the gender
Adjusted R2
Adjusted R square to identify a good regression model once some
variance are added.
The higher Adjusted R-square, the better model

(where/với n = sample size/kích cỡ mẫu, k = number of indendent variables/số


biến độc lập)

• Support to control number of independent variables added/Tạo


trở ngại việc sử dụng vượt mức những biến độc lập không quan
trọng.
• Adjusted R square < R square/ Luôn nhỏ hơn giá trị của r2
Collinearity
Collinearity: High correlation exists among two or more
independent variables.
This means the correlated variables contribute redundant
information to the multiple regression model.
Including two highly correlated independent variables
can adversely affect the regression results.
No new information provided:
• Can lead to unstable coefficients (large standard error
and low t-values).
• Coefficient signs may not match prior expectations.
Some Indications of Strong Collinearity

Incorrect signs on the coefficients.


Large change in the value of a previous coefficient when a
new variable is added to the model.
A previously significant variable becomes non-significant
when a new independent variable is added.
The estimate of the standard deviation of the model increases
when a variable is added to the model.

0
Measuring Collinearity Variance
Inflationary Factor
The variance inflationary factor VIFj can be used to measure collinearity:

VIF – PHStat program


Where R2j is the coefficient of
multiple determination of
independent variable Xj with all other
X variables.

If VIFj =1, Xj is uncorrelated with the other Xs

If VIFj > 10, Xj is highly correlated with the other Xs


(conservative estimate reduces this to VIFj > 5)

0
Section Summary
• Developed the multiple regression model.
• Tested the significance of the multiple regression model.
• Discussed r2, adjusted r2 and overall F test.
• Discussed using residual plots to check model
assumptions.
• Tested individual regression coefficients.
• Used dummy variables.
• Evaluated interaction effects.
• Evaluated collinearity.
Regression and collinearity
Chọn Statistics để vào
kiểm tra đa cộng tuyến

Chọn Collinearity
diagnostics
Kết quả điển hình từ SPSS
Biến phụ thuộc: Hài lòng

VIF > 10 Đa cộng tuyến

Tolerance > 1 Đa cộng


tuyến
Group assignment
● Check database of group assignment
● Develop general regression model (multiple
regression)
● Develop hypotheses
● Test regression model + check collinearity +
write out the estimated regression model
● Present the result of hypothesis testing
● Develop possible solutions and think of solution
ranking

You might also like