Correlation
Lecture 4
Survey Research & Design in Psychology
James Neill, 2012
Overview
1. Purpose of correlation
2. Covariation
3. Linear correlation
4. Types of correlation
5. Interpreting correlation
6. Assumptions / limitations
7. Dealing with several correlations
2
Readings
Howell (2010)
Ch6 Categorical Data & Chi-Square
Ch9 Correlation & Regression
Ch10 Alternative Correlational
Techniques
10.1 Point-Biserial Correlation and Phi:
Pearson Correlation by Another Name
10.3 Correlation Coefficients for Ranked
Data
3
Purpose of correlation
Purpose of correlation
The underlying purpose of
correlation is to help address the
question:
What is the
relationship or
degree of association or
amount of shared variance
between two variables?
5
Purpose of correlation
Other ways of expressing the
underlying correlational question
include:
To what extent
do two variables covary?
are two variables dependent or
independent of one another?
can one variable be predicted
from another?
6
Covariation
The world is made of
covariation.
We observe
covariations in
the psychosocial world.
e.g., depictions of
violence in the
environment.
We observe
Do they tend
We can
covariations
in
to co-occur? measure our
the psychoobservations.
social world.
e.g., psychological states
such as stress
and depression.
10
Covariations are the basis
of more complex models.
11
Linear correlation
12
Linear correlation
The extent to which two variables
have a simple linear (straight-line)
relationship.
Linear correlations provide the
building blocks for multivariate
correlational analyses, such as:
Factor analysis
Reliability
Multiple linear regression
13
Linear correlation
Linear relations between variables
are indicated by correlations:
Direction: Correlation sign (+ / -)
indicates direction of linear relationship
Strength: Correlation size indicates
strength (ranges from -1 to +1)
Statistical significance: p indicates
likelihood that observed relationship
could have occurred by chance
14
What is the linear correlation?
Types of answers
No relationship (independence)
Linear relationship:
As one variable s, so does the other (+ve)
As one variable s, the other s (-ve)
Non-linear relationship
Pay caution due to:
Heteroscedasticity
Restricted range
Heterogeneous samples
15
Types of correlation
To decide which type of
correlation to use, consider
the levels of measurement
for each variable
16
Types of correlation
Nominal by nominal:
Phi () / Cramers V, Chi-squared
Ordinal by ordinal:
Spearmans rank / Kendalls Tau b
Dichotomous by interval/ratio:
Point bi-serial rpb
Interval/ratio by interval/ratio:
Product-moment or Pearsons r
17
Types of correlation and LOM
Nominal
Nominal
Ordinal
Int/Ratio
Clustered barchart,
Chi-square,
Phi () or
Cramer's V
Ordinal
Int/Ratio
Recode
Scatterplot,
bar chart or
error-bar chart
Point bi-serial
correlation
(rpb)
Scatterplot or
clustered bar
chart
Spearman's
Rho or
Kendall's Tau
Recode
Scatterplot
Productmoment
18
correlation (r)
Nominal by nominal
19
Nominal by nominal
correlational approaches
Contingency (or cross-tab) tables
Observed
Expected
Row and/or column %s
Marginal totals
Clustered bar chart
Chi-square
Phi/Cramer's V
20
Contingency tables
Bivariate frequency tables
Cell frequencies (red)
Marginal totals (blue)
21
Contingency table: Example
RED = Contingency cells
BLUE = Marginal totals
22
Contingency table: Example
Chi-square is based on the differences between
the actual and expected cell counts.
23
Example
Row and/or column cell percentages may also
aid interpretation e.g., ~2/3rds of smokers snore, whereas only
~1/3rd of non-smokers snore.
24
Clustered bar graph
Bivariate bar graph of frequencies or percentages.
The category
axis bars are
clustered (by
colour or fill
pattern) to
indicate the the
second variables
categories.
25
~2/3rds of
snorers are
smokers,
whereas only
~1/3rd of nonsnores are
smokers.
26
Pearson chi-square test
27
Pearson chi-square test:
Example
Write-up: 2 (1, 186) = 10.26, p = .001
28
Chi-square distribution: Example
29
Phi ( ) & Cramers V
(non-parametric measures of correlation)
Phi ()
Use for 2x2, 2x3, 3x2 analyses
e.g., Gender (2) & Pass/Fail (2)
Cramers V
Use for 3x3 or greater analyses
e.g., Favourite Season (4) x Favourite
Sense (5)
30
Phi () & Cramers V: Example
(1, 186) = 10.26, p = .001, = .24
2
31
Ordinal by ordinal
32
Ordinal by ordinal
correlational approaches
Spearman's rho (rs)
Kendall tau ()
Alternatively, use nominal by
nominal techniques (i.e., treat as
lower level of measurement)
33
Graphing ordinal by ordinal data
Ordinal by ordinal data is difficult to
visualise because its non-parametric,
yet there may be many points.
Consider using:
Non-parametric approaches (e.g.,
clustered bar chart)
Parametric approaches (e.g.,
scatterplot with binning)
34
Spearmans rho (rs ) or
Spearman's rank order correlation
For ranked (ordinal) data
e.g. Olympic Placing correlated with
World Ranking
Uses product-moment correlation
formula
Interpretation is adjusted to consider
the underlying ranked scales
35
Kendalls Tau ()
Tau a
Does not take joint ranks into account
Tau b
Takes joint ranks into account
For square tables
Tau c
Takes joint ranks into account
For rectangular tables
36
Dichotomous by
interval/ratio
37
Point-biserial correlation (rpb)
One dichotomous & one
continuous variable
e.g., belief in god (yes/no) and
amount of international travel
Calculate as for Pearson's
product-moment r,
Adjust interpretation to consider
the underlying scales
38
Point-biserial
correlation
(r
):
Those who report that they pb
believe Example
in God also report
having travelled to slightly
fewer countries (rpb= -.10) but
this difference could have
occurred by chance (p > .05),
thus H0 is not rejected.
Do not believe
Believe
39
Point-biserial correlation (rpb):
Example
0 = No
1 = Yes
40
Interval/ratio by
Interval/ratio
41
Scatterplot
Plot each pair of observations (X, Y)
x = predictor variable (independent)
y = criterion variable (dependent)
By convention:
the IV should be plotted on the x
(horizontal) axis
the DV on the y (vertical) axis.
42
Scatterplot showing relationship between
age & cholesterol with line of best fit
43
Line of best fit
The correlation between 2
variables is a measure of the
degree to which pairs of numbers
(points) cluster together around a
best-fitting straight line
Line of best fit: y = a + bx
Check for:
outliers
linearity
44
What's wrong with this scatterplot?
IV should
treated as X
and DV as Y,
although this is
not always
distinct.
45
Scatterplot example:
Strong positive (.81)
Q: Why is infant
mortality positively
linearly associated
with the number of
physicians (with the
effects of GDP
removed)?
A: Because more
doctors tend to be
deployed to areas
with infant mortality
(socio-economic
status aside).
46
Scatterplot example:
Weak positive (.14)
47
Scatterplot example:
Moderately strong negative (-.76)
48
Pearson product-moment correlation (r)
The product-moment
correlation is the
standardised covariance.
49
Covariance
Variance shared by 2 variables
Cov XY
( X X )(Y Y )
N 1
Cross products
Covariance reflects the
direction of the relationship:
+ve cov indicates + relationship
-ve cov indicates - relationship.
50
Covariance: Cross-products
3
-ve dev.
products
Y1
-ve
cro
ss
pro
duc
ts
+ve dev.
products
0
0
+ve dev.
10
products
20
30
-ve dev.
products
40
X1
51
Covariance
Dependent on the scale of
measurement Cant compare
covariance across different scales of
measurement (e.g., age by weight in kilos versus
age by weight in grams).
Therefore, standardise
covariance (divide by the cross-product of
the Sds) correlation
Correlation is an effect size i.e.,
standardised measure of strength of linear
52
relationship
Covariance, SD, and
correlation: Quiz
For a given set of data the
covariance between X and Y is
1.20. The SD of X is 2 and the SD
of Y is 3. The resulting correlation
is:
a. .20
b. .30
c. .40
d. 1.20
Answer:
1.20 / 2 x 3 = .20
53
Hypothesis testing
Almost all correlations are not 0,
therefore the question is:
What is the likelihood that a
relationship between variables is a
true relationship - or could it
simply be a result of random
sampling variability or chance?
54
Significance of correlation
Null hypothesis (H0): = 0: assumes
that there is no true relationship (in
the population)
Alternative hypothesis (H1): 0:
assumes that the relationship is real
(in the population)
Initially assume H0 is true, and
evaluate whether the data support H1.
rho= population product-moment
correlation coefficient
55
How to test the null hypothesis
Select a critical value (alpha ());
commonly .05
Can use a 1 or 2-tailed test
Calculate correlation and its p value.
Compare this to the critical value.
If p < critical value, the correlation is
statistically significant, i.e., that there is
less than a x% chance that the relationship being
tested is due to random sampling variability.
56
Correlation SPSS output
C o rre la tio n s
C ig a re tte
C H D
C o n s u m p tio n M o rta li
p e r A d u lt p e r ty p e r
D ay
1 0 ,0 0 0
C ig a re tte
P e a rs o n
C o n s u m p tio n p e r C o rre la tio n
A d u lt p e r D a y
S ig .
(2 -ta ile d )
N
C H D M o rta lity P e a rs o n
.7
p e r 1 0 ,0 0 0
C o rre la tio n
S ig .
.0
(2 -ta ile d )
N
* * . C o rre la tio n is s ig n ific a n t a t th e 0 .0 1
(2 -ta ile d ).
13**
00
21
le v e l
57
Imprecision in hypothesis testing
Type I error: rejects H0 when it is true
Type II error: Accepts H0 when it is false
Significance test result will depend
on the power of study, which is a
function of:
Effect size (r)
Sample size (N)
Critical alpha level (crit)
58
Significance of correlation
df
critical
(N-2)
p = .05
5
10
15
20
25
30
50
200
500
1000
.67
.50
.41
.36
.32
.30
.23
.11
.07
.05
The size of
correlation
required to be
significant
decreases as N
increases
why?
59
Scatterplot showing a confidence
interval for a line of best fit
60
US States 4th Academic Achievement by SES
61
Practice quiz question:
Significance of correlation
If the correlation between Age and
test Performance is statistically
significant, it means that:
a. there is an important relationship between Age
and test Performance
b. the true correlation between Age and
Performance in the population is equal to 0
c. the true correlation between Age and
Performance in the population is not equal to 0
d. getting older causes you to do poorly on tests
62
Interpreting correlation
63
Coefficient of Determination (r2)
CoD = The proportion of
variance or change in one
variable that can be accounted
for by another variable.
e.g., r = .60, r2 = .36
64
Interpreting correlation
(Cohen, 1988)
A correlation is an effect size, so
guidelines re strength can be
suggested.
Strength
r
r2
weak:
.1 to .3 (1 to 10%)
moderate: .3 to .5 (10 to 25%)
strong:
>.5 (> 25%)
65
Size of correlation (Cohen, 1988)
WEAK (.1 - .3)
MODERATE (.3-.5)
STRONG (>.5)
66
Interpreting correlation
(Evans, 1996)
Strength
very weak
weak
moderate
strong
very strong
r
0 - .19
.20 - .39
.40 - .59
.60 - .79
.80 - 1.00
r2
(0 to 4%)
(4 to 16%)
(16 to 36%)
(36% to 64%)
(64% to 100%)
67
Correlation of this scatterplot = -.9
3
Scale has no effect
on correlation.
Y1
0
0
10
20
30
40
X1
68
Y1
Correlation of this scatterplot = -.9
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
Scale has no effect
on correlation.
10
20
30
40
50
60
70
80
90
100
X1
69
What do you estimate the correlation of this
scatterplot of height and weight to be?
176
174
HEIGHT
a. -.5
b. -1
c. 0
d. .5
e. 1
172
170
168
166
65
66
WEIGHT
67
68
69
70
71
72
73
70
What do you estimate the correlation of this
scatterplot to be?
14
b. -1
12
c. 0
10
d. .5
e. 1
a. -.5
2
4.4
4.6
4.8
5.0
5.2
5.4
5.6
71
What do you estimate the correlation of this
scatterplot to be?
6
b. -1
c. 0
d. .5
e. 1
a. -.5
4
2
10
12
14
72
Write-up: Example
Number of children and marital
satisfaction were inversely related
(r (48) = -.35, p < .05), such that
contentment in marriage tended
to be lower for couples with more
children. Number of children
explained approximately 10% of
the variance in marital
satisfaction, a small-moderate
effect (see Figure 1).
73
Assumptions and
limitations
(Pearson product-moment
linear correlation)
74
Assumptions and limitations
1. Levels of measurement interval
2. Correlation is not causation
3. Linearity
1. Effects of outliers
2. Non-linearity
4. Normality
5. Homoscedasticity
6. Range restriction
7. Heterogenous samples
75
Correlation is not causation e.g.,:
correlation between ice cream consumption and crime,
but actual cause is temperature
76
Correlation is not causation e.g.,:
Stop global warming: Become a pirate
77
Causation may be
in the eye of the
beholder
It's a rather interesting
phenomenon. Every time I
press this lever, that
graduate student breathes
a sigh of relief.
78
Effect of outliers
Outliers can disproportionately
increase or decrease r.
Options
compute r with & without outliers
get more data for outlying values
recode outliers as having more
conservative scores
transformation
recode variable into lower level of
measurement
79
Age & self-esteem
(r = .63)
10
SE
0
10
AGE
20
30
40
50
60
70
80
80
Age & self-esteem
(outliers removed) r = .23
9
8
7
6
5
4
3
SE
2
1
10
AGE
20
30
40
81
Non-linear relationships
40
X2
Check
scatterplot
Can a linear
relationship
capture the
lions share of
the variance?
If so,use r.
30
20
10
0
0
10
20
30
40
50
60
Y2
82
Non-linear relationships
If non-linear, consider
Does a linear relation help?
Transforming variables to create
linear relationship
Use a non-linear mathematical
function to describe the
relationship between the variables
83
Normality
The X and Y data should be sampled
from populations with normal distributions
Do not overly rely on a single indicator of
normality; use histograms, skewness and
kurtosis, and inferential tests (e.g.,
Shapiro-Wilks)
Note that inferential tests of normality are
overly sensitive when sample is large
84
Homoscedasticity
Homoscedasticity refers to even
spread about a line of best fit
Heteroscedasticity refers to uneven
spread about a line of best fit
Assess visually and with Levene's
test
85
Homoscedasticity
86
Range restriction
Range restriction is when the
sample contains restricted (or
truncated) range of scores
e.g., cognitive capacity and age < 18
might have linear relationship
If range restriction, be cautious in
generalising beyond the range for
which data is available
E.g., cognitive capacity does not
continue to increase linearly with age
after age 18
87
Range restriction
88
Heterogenous samples
Sub-samples (e.g.,
males & females)
may artificially
increase or
decrease overall r.
Solution - calculate
r separately for subsamples & overall,
look for differences
190
180
170
H1
160
150
140
130
50
60
70
80
W1
89
Scatterplot of Same-sex &
Opposite-sex Relations by Gender
7
r = .67
r = .52
Same Sex Relations
SEX
female
2
male
0
Opp Sex Relations
90
Scatterplot of Weight and Selfesteem by Gender
10
r = .50
r = -.48
SEX
SE
male
0
female
40
50
60
70
80
90
100
110
120
WEIGHT
91
Dealing with several
correlations
92
Dealing with several correlations
Scatterplot matrices
organise
scatterplots and
correlations
amongst several
variables at once.
However, they are
not detailed over for
more than about five
variables at a time.
93
Correlation matrix:
Example of an APA Style
Correlation Table
94
95
Scatterplot matrix
Summary
96
Key points
1. Covariations are the building
blocks of more complex analyses,
e.g., reliability analysis, factor analysis,
multiple regression
2. Correlation does not prove
causation may be in opposite
direction, co-causal, or due to other
variables.
97
Key points
3. Choose measure of correlation
and graphs based on levels of
measurement.
4. Check graphs (e.g., scatterplot):
Outliers?
Linear?
Range?
Homoscedasticity?
Sub-samples to consider?
98
Key points
5. Consider effect size (e.g., ,
2
Cramer's V, r, r ) and direction of
relationship
6. Conduct inferential test
(if needed).
99
Key points
7. Interpret/Discuss
Relate back to research
hypothesis
Describe & interpret correlation
(direction, size, significance)
Acknowledge limitations e.g.,
Heterogeneity (sub-samples)
Range restriction
Causality?
100
References
Evans, J. D. (1996). Straightforward statistics
for the behavioral sciences. Pacific Grove, CA:
Brooks/Cole Publishing.
Howell, D. C. (2007). Fundamental statistics
for the behavioral sciences. Belmont, CA:
Wadsworth.
Howell, D. C. (2010). Statistical methods for
psychology (7th ed.). Belmont, CA:
Wadsworth.
101
Open Office Impress
This presentation was made using
Open Office Impress.
Free and open source software.
http://www.openoffice.org/product/impress.html
102