0% found this document useful (0 votes)
43 views82 pages

Correlation

This document discusses correlation and correlation coefficients. It begins by asking whether two variables are related and if so, how strongly. Scatterplots are used to depict relationships graphically. The correlation coefficient measures the strength and direction of linear relationships between variables on a scale from -1 to 1. Factors like range restrictions, outliers, and nonlinearity can impact the correlation coefficient.

Uploaded by

divya arya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views82 pages

Correlation

This document discusses correlation and correlation coefficients. It begins by asking whether two variables are related and if so, how strongly. Scatterplots are used to depict relationships graphically. The correlation coefficient measures the strength and direction of linear relationships between variables on a scale from -1 to 1. Factors like range restrictions, outliers, and nonlinearity can impact the correlation coefficient.

Uploaded by

divya arya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 82

Correlation

Major Points - Correlation


 Questions answered by correlation
 Scatterplots

 An example

 The correlation coefficient

 Other kinds of correlations

 Factors affecting correlations

 Testing for significance


Keywords

1. Scatterplot
2. Correlation
Coefficient
3. Range restrictions
4. Nonlinearity
5. Outliers
The Question
 Are two variables related ?
 Does one increase as the other
increases?
 e. g. skills and income
 Does one decrease as the other
increases?
 e. g. health problems and nutrition
 How can we get a numerical measure
of the degree of relationship?
Scatterplots
 AKA scatter diagram or
scattergram.
 Graphically depicts the

relationship between two


variables in two dimensional
space.
Direct Relationship
Scatterplot:Video Games and Alcohol Consumption

20
Average Number of Alcoholic Drinks

18
16
14
12
Per Week

10
8
6
4
2
0
0 5 10 15 20 25
Average Hours of Video Games Per Week
Inverse Relationship
Scatterplot: Video Games and Test Score

100
90
80
70
Exam Score

60
50
40
30
20
10
0
0 5 10 15 20
Average Hours of Video Games Per Week
An Example
 Does smoking cigarettes
increase systolic blood pressure?
 Plotting number of cigarettes

smoked per day against systolic


blood pressure
 Fairly moderate relationship
 Relationship is positive
Trend?
170

160

150

140

130

120
SYSTOLIC

110

100
0 10 20 30

SMOKING
Smoking and BP
 Note relationship is moderate, but
real.
 Why do we care about relationship?

 What would conclude if there were no


relationship?
 What if the relationship were near
perfect?
 What if the relationship were negative?
Heart Disease and
Cigarettes
 Data on heart disease and
cigarette smoking in 21
developed countries (Landwehr and
Watkins, 1987)
 Data have been rounded for

computational convenience.
 The results were not affected.
Country Cigarettes CHD
1
The Data
11 26
2 9 21
3 9 24
4 9 21
5 8 19
6 8 13
7 8 19
Surprisingly, the 8 6 11
9 6 23
U.S. is the first 10 5 15
country on the 11 5 13
12 5 4
list--the country 13 5 18
with the highest 14
15
5 12
5 3
consumption and 16 4 11
17 4 15
highest mortality. 18 4 6
19 3 13
20 3 4
21 3 14
Scatterplot of Heart Disease
 CHD Mortality goes on ordinate (Y
axis)
 Why?
 Cigarette consumption on abscissa
(X axis)
 Why?
 What does each dot represent?
 Best fitting line included for clarity
30

CHD Mortality per 10,000

20

10

{X = 6, Y = 11}

0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day


What Does the Scatterplot
Show?
 As smoking increases, so does
coronary heart disease mortality.
 Relationship looks strong

 Not all data points on line.

 This gives us “residuals” or “errors


of prediction”
 To be discussed later
Correlation
 Co-relation
 The relationship between two

variables
 Measured with a correlation

coefficient
 Most popularly seen correlation

coefficient: Pearson Product-


Moment Correlation
Types of Correlation
 Positive correlation
 High values of X tend to be
associated with high values of Y.
 As X increases, Y increases
 Negative correlation
 High values of X tend to be
associated with low values of Y.
 As X increases, Y decreases
 No correlation
 No consistent tendency for values
on Y to increase or decrease as X
increases
Correlation Coefficient
 A measure of degree of relationship.
 Between 1 and -1

 Sign refers to direction.

 Based on covariance
 Measure of degree to which large scores
on X go with large scores on Y, and small
scores on X go with small scores on Y
 Think of it as variance, but with 2
variables instead of 1 (What does that
mean??)
19
Covariance
 Remember that variance is:
( X  X ) 2
( X  X )( X  X )
VarX  
N 1 N 1
 The formula for co-variance is:

( X  X )(Y  Y )
Cov XY 
N 1
 How this works, and why?
 When would cov
XY be large and
positive? Large and negative?
Example
Example
22

( X  X )(Y  Y ) 222.44
Covcig .&CHD    11.12
N 1 21  1
 What the heck is a covariance?

 I thought we were talking


about correlation?
Correlation Coefficient
 Pearson’s Product Moment
Correlation
 Symbolized by r

 Covariance ÷ (product of the 2 SDs)

Cov XY
r
s X sY
 Correlation is a standardized
covariance
Calculation for Example

 CovXY = 11.12
 s = 2.33
X
 s = 6.69
Y

cov XY 11.12 11.12


r    .713
s X sY (2.33)(6.69) 15.59
Example
 Correlation = .713
 Sign is positive

 Why?
 If sign were negative
 What would it mean?
 Would not alter the degree of
relationship.
Other calculations
26

Z-score method
z z

x y
r
N 1

 Computational (Raw Score)


Method N  XY   X  Y
r
 N  X 2  ( X ) 2   N  Y 2  ( Y ) 2 
Other Kinds of Correlation
 Spearman Rank-Order
Correlation Coefficient (rsp)
 used with 2 ranked/ordinal
variables
Attractiveness Symmetry
 uses the 3same Pearson 2
formula
4 6
1 1
2 3
5 4
6 5 27
rsp = 0.77
Other Kinds of Correlation
 Point biserial correlation coefficient (rpb)
 used with one continuous scale and one
nominal or ordinal or dichotomous scale.
 uses the same Pearson formula

Attractiveness Date?
3 0
4 0
1 1
2 1
5 1
6 0
rpb = -0.49 28
Other Kinds of Correlation
 Phi coefficient ()
 used with two dichotomous scales.
 uses the same Pearson formula
Attractiveness Date?
0 0
1 0
1 1
1 1
0 0
1 1
F = 0.71 29
Factors Affecting r
 Range restrictions
 Looking at only a small portion of the
total scatter plot (looking at a smaller
portion of the scores’ variability)
decreases r.
 Reducing variability reduces r
 Nonlinearity
 The Pearson r (and its relatives)
measure the degree of linear
relationship between two variables
 If a strong non-linear relationship
exists, r will provide a low, or at least
inaccurate measure of the true
relationship.
Factors Affecting r
 Heterogeneous subsamples
 Everyday examples (e.g. height and
weight using both men and women)
 Outliers
 Overestimate Correlation
 Underestimate Correlation
Countries With Low
Consumptions
Data With Restricted Range

Truncated at 5 Cigarettes Per Day


20

18
CHD Mortality per 10,000

16

14

12

10

4
2
2.5 3.0 3.5 4.0 4.5 5.0 5.5

Cigarette Consumption per Adult per Day


Truncation
33
Non-linearity
34
Heterogenous samples
35
Outliers
36
Testing Correlations
37

 So you have a correlation. Now what?


 In terms of magnitude, how big is big?
 Small correlations in large samples are
“big.”
 Large correlations in small samples
aren’t always “big.”
 Depends upon the magnitude of the
correlation coefficient
AND
 The size of your sample.
Testing r
 Population parameter = 
 Null hypothesis H :  = 0
0
 Test of linear independence
 What would a true null mean here?
 What would a false null mean here?
 Alternative hypothesis (H1)   0
 Two-tailed
Tables of Significance
 We can convert r to t and test for
significance:

N 2
tr
1 r 2

 Where DF = N-2
Tables of Significance
 In our example r was .71
 N-2 = 21 – 2 = 19

N 2 19 19
tr  .71*  .71*  6.90
1 r 2
1  .712
.4959

 T-crit (19) = 2.09


 Since 6.90 is larger than 2.09 reject 

= 0.
Computer Printout
 Printout gives test of
significance.Correlations
CIGARET CHD
CIGARET Pearson Correlation 1 .713**
Sig. (2-tailed) . .000
N 21 21
CHD Pearson Correlation .713** 1
Sig. (2-tailed) .000 .
N 21 21
**. Correlation is significant at the 0.01 level (2-tailed).
Regression
What is regression?
43

 How do we predict one variable


from another?
 How does one variable change

as the other changes?


 Influence
Linear Regression
44

 A technique we use to predict the


most likely score on one variable
from those on another variable
 Uses the nature of the

relationship (i.e. correlation)


between two variables to
enhance your prediction
Linear Regression: Parts
45

 Y - the variables you are predicting


 i.e. dependent variable
 X - the variables you are using to
predict
 i.e. independent variable

 - your predictions (also known as
Y’)
Why Do We Care?
46

 We may want to make a


prediction.
 More likely, we want to

understand the relationship.


 How fast does CHD mortality rise
with a one unit increase in smoking?
 Note: we speak about predicting, but
often don’t actually predict.
An Example
47

 Cigarettes and CHD Mortality


again
 Data repeated on next slide

 We want to predict level of

CHD mortality in a country


averaging 10 cigarettes per
day.
Country Cigarettes CHD
1 11 26

48
The Data 2
3
9
9
21
24
4 9 21

Based on the data we have 5


6
8
8
19
13
what would we predict the 7
8
8
6
19
11
rate of CHD be in a country 9
10
6
5
23
15
that smoked 10 cigarettes on 11 5 13
12 5 4
average? 13 5 18
14 5 12
First, we need to establish a 15 5 3

prediction of CHD from 16


17
4
4
11
15
smoking… 18
19
4
3
6
13
20 3 4
21 3 14
30

We predict a
CHD Mortality per 10,000

20
CHD rate of
about 14
Regression
Line

10

For a country that


smokes 6 C/A/D…
0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day

49
Regression Line
50

 Formula

Yˆ  bX  a
 Yˆ = the predicted value of Y (e.g.
CHD mortality)
 X = the predictor variable (e.g.
average cig./adult/country)
Regression Coefficients
51

 “Coefficients” are a and b


 b = slope

 Change in predicted Y for one


unit change in X
 a = intercept
 value ofYˆ when X = 0
Calculation
52

 Slope b  cov XY or b  r  s y 
2  
sX  sx 
N  XY   X  Y
or b 
 N  X  ( X ) 
2 2

 Intercept a  Y bX
For Our Data
53

 CovXY = 11.12
 s2 = 2.332 = 5.447
X
 b = 11.12/5.447 = 2.042

 a = 14.524 - 2.042*5.952 =

2.32
 See SPSS printout on next slide
Answers are not exact due to rounding error and desire to match
SPSS.
SPSS Printout
54
Note:
55

 The values we obtained are


shown on printout.
 The intercept is the value in the

B column labeled “constant”


 The slope is the value in the B

column labeled by name of


predictor variable.
Making a Prediction
56

 Second, once we know the


relationship we can predict
Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*10  2.367  22.787
 We predict 22.77 people/10,000
in a country with an average of
10 C/A/D will die of CHD
Accuracy of Prediction
 Finnish smokers smoke 6 C/A/D
 We predict:

Yˆ  bX  a  2.042 X  2.367
Yˆ  2.042*6  2.367  14.619
 They actually have 23 deaths/10,000
 Our error (“residual”) =

23 - 14.619 = 8.38
 a large error
57
30

CHD Mortality per 10,000 Residual

20

Prediction

10

0
2 4 6 8 10 12

Cigarette Consumption per Adult per Day

58
Residuals
59

 When we predict Ŷ for a given X, we will


sometimes be in error.
 Y – Ŷ for any X is a an error of

estimate
 Also known as: a residual

 We want to Σ(Y- Ŷ) as small as possible.

 BUT, there are infinitely many lines that

can do this.
 Just draw ANY line that goes through

the mean of the X and Y values.


 Minimize Errors of Estimate… How?
Minimizing Residuals
60

 Again, the problem lies with


this definition of the mean:

(X  X )  0
 So, how do we get rid of the
0’s?
 Square them.
Regression Line:
A Mathematical Definition
 The regression line is the line which
when drawn through your data set
produces the smallest value of:

 (Y  Y )
ˆ 2

 Called the Sum of Squared Residual


or SSresidual
 Regression line is also called a “least

squares line.” 61
Summarizing Errors of
62
Prediction
 Residual variance
 The variability of predicted values
ˆ
(Yi  Yi ) 2
SS residual
2
s
Y Yˆ
 
N 2 N 2
Standard Error of Estimate
63

 Standard error of estimate


 The standard deviation of
predicted values
ˆ
(Yi  Yi ) 2
SS residual
sY Yˆ  
N 2 N 2
 A common measure of the

accuracy of our predictions


 We want it to be as small as
possible.
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2
1 11 26 24.829 1.171 1.371
2
3
9
9
21
24
20.745
20.745
0.255
3.255
0.065
10.595
Example
64 4 9 21 20.745 0.255 0.065
5 8 19 18.703 0.297 0.088
(Yi  Yˆi )2 440.756
6 8 13 18.703 -5.703 32.524 2
s
Y Yˆ
   23.198
7 8 19 18.703 0.297 0.088 N 2 21  2
8 6 11 14.619 -3.619 13.097
9 6 23 14.619 8.381 70.241 (Yi  Yˆi ) 2 440.756
10 5 15 12.577 2.423 5.871 sY Yˆ   
N 2 21  2
11 5 13 12.577 0.423 0.179
12 5 4 12.577 -8.577 73.565  23.198  4.816
13 5 18 12.577 5.423 29.409
14 5 12 12.577 -0.577 0.333
15 5 3 12.577 -9.577 91.719
16 4 11 10.535 0.465 0.216
17 4 15 10.535 4.465 19.936
18 4 6 10.535 -4.535 20.566
19 3 13 8.493 4.507 20.313
20 3 4 8.493 -4.493 20.187
21 3 14 8.493 5.507 30.327
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757
Regression and Z Scores
65

 When your data are standardized


(linearly transformed to z-scores),
the slope of the regression line is
called β
 DO NOT confuse this β with the β

associated with type II errors. They’re


different.
 When we have one predictor, r = β

 Z = βZ , since A now equals 0


y x
Partitioning Variability
66

 Sums of square deviations


Total
SStotal   (Y  Y )
 2

 Y )
2
 Regression SS  (Yˆ
regression

 Residual we already covered


SS residual   (Y  Yˆ )
2

 SStotal = SSregression + SSresidual


Partitioning Variability
67

 Degrees of freedom
 Total
 dftotal =N-1
 Regression
 dfregression = number of predictors
 Residual
 dfresidual = dftotal – dfregression
 dftotal = dfregression + dfresidual
Partitioning Variability
68

 Variance (or Mean Square)


 Total Variance
 s2total = SStotal/ dftotal
 Regression Variance
 s2regression = SSregression/ dfregression
 Residual Variance
 s2residual = SSresidual/ dfresidual
Country X (Cig.) Y (CHD) Y' (Y - Y') (Y - Y')2 (Y' - Ybar) (Y - Ybar)
1 11 26 24.829 1.171 1.371 106.193 131.699
2 9 21 20.745 0.255 0.065 38.701 41.939
3 9 24 20.745 3.255 10.595 38.701 89.795
69
4 9 21 20.745 0.255 0.065 38.701 41.939
5 8 19 18.703 0.297 0.088 17.464 20.035
6 8 13 18.703 -5.703 32.524 17.464 2.323
7 8 19 18.703 0.297 0.088 17.464 20.035
8 6 11 14.619 -3.619 13.097 0.009 12.419

Example
9 6 23 14.619 8.381 70.241 0.009 71.843
10 5 15 12.577 2.423 5.871 3.791 0.227
11 5 13 12.577 0.423 0.179 3.791 2.323
12 5 4 12.577 -8.577 73.565 3.791 110.755
13 5 18 12.577 5.423 29.409 3.791 12.083
14 5 12 12.577 -0.577 0.333 3.791 6.371
15 5 3 12.577 -9.577 91.719 3.791 132.803
16 4 11 10.535 0.465 0.216 15.912 12.419
17 4 15 10.535 4.465 19.936 15.912 0.227
18 4 6 10.535 -4.535 20.566 15.912 72.659
19 3 13 8.493 4.507 20.313 36.373 2.323
20 3 4 8.493 -4.493 20.187 36.373 110.755
21 3 14 8.493 5.507 30.327 36.373 0.275
Mean 5.952 14.524
SD 2.334 6.690
Sum 0.04 440.757 454.307 895.247
Y' = (2.04*X) + 2.37
Example
SSTotal   (Y  Y )  895.247; df total  21  1  20
2
70

  (Y  Y )  454.307; df regression  1 (only 1 predictor)


2
SS regression ˆ

  (Y  Y )  440.757; df residual  20  1  19
2
SS residual ˆ


2
(Y  Y ) 895.247
s2
total    44.762
N 1 20
 (Y  Y )
2
ˆ 454.307
s2
regression    454.307
1 1
 (Y  Y )
2
ˆ 440.757
s2
residual    23.198
N 2 19
2
Note : sresidual  sY Yˆ
Coefficient of
71
Determination
 It is a measure of the percent of
predictable variability
r 2  the correlation squared
or
SS regression
r 
2

SSY
 The percentage of the total
variability in Y explained by X
r 2
for our example
72

 r = .713
 r 2 = .7132 =.508

SS regression 454.307
 or
r 2
   .507
SSY 895.247

 Approximately 50% in variability of


incidence of CHD mortality is
associated with variability in
smoking.
Coefficient of Alienation
73

 It is defined as 1 - r 2 or
SS residual
1 r 
2

SSY
 Example
1 - .508 = .492
SS residual 440.757
1 r 
2
  .492
SSY 895.247
r2, SS and sY-Y’
74

r * SStotal = SSregression
 2

 (1 - r2) * SS
total = SSresidual
 We can also use r2 to calculate

the standard error of estimate


as:
 N 1   20 
sY Yˆ  s y (1  r ) 
2
  6.690* (.492)    4.816
 N 2  19 
Testing Overall Model
75

 We can test for the overall


prediction of the model by
2
sregression the ratio:
forming
2
 F statistic
sresidual

 If the calculated F value is larger


than a tabled value (F-Table) we
have a significant prediction
Testing Overall Model
76

2
 Example sregression 454.307
2
  19.594
sresidual 23.198

 F-Table – F critical is found using 2 things


dfregression (numerator) and dfresidual.
(demoninator)
 F-Table our Fcrit (1,19) = 4.38
 19.594 > 4.38, significant overall
 Should all sound familiar…
SPSS output
77

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .713a .508 .482 4.81640
a. Predictors: (Constant), CIGARETT

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 454.482 1 454.482 19.592 .000a
Residual 440.757 19 23.198
Total 895.238 20
a. Predictors: (Constant), CIGARETT
b. Dependent Variable: CHD
Testing Slope and Intercept
78

 The regression coefficients can


be tested for significance
 Each coefficient divided by it’s

standard error equals a t value


that can also be looked up in a t-
table
 Each coefficient is tested against

0
Testing the Slope
79

 With only 1 predictor, the


standard error for the slope
is: sY Yˆ
seb 
sX N  1

 For our Example:


4.816 4.816
seb    .461
2.334 21  1 10.438
Testing Slope and Intercept
80

 These are given in computer


printout as a t test.
Testing
81

 The t values in the second from


right column are tests on slope
and intercept.
 The associated p values are next

to them.
 The slope is significantly different

from zero, but not the intercept.


 Why do we care?
Testing
82

 What does it mean if slope is


not significant?
 How does that relate to test on r?
 What if the intercept is not
significant?
 Does significant slope mean we

predict quite well?

You might also like