0% found this document useful (0 votes)
6 views56 pages

Chapter 14

Chapter 14 covers correlation and regression, focusing on the relationship between a quantitative response variable and an explanatory variable using historical public health data. It discusses scatterplots, correlation coefficients, and the interpretation of correlation strength, as well as the methodology for regression analysis, including the calculation of slope and intercept. The chapter emphasizes the importance of conditions for inference and the potential pitfalls of misinterpreting correlation as causation.

Uploaded by

diego940306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views56 pages

Chapter 14

Chapter 14 covers correlation and regression, focusing on the relationship between a quantitative response variable and an explanatory variable using historical public health data. It discusses scatterplots, correlation coefficients, and the interpretation of correlation strength, as well as the methodology for regression analysis, including the calculation of slope and intercept. The chapter emphasizes the importance of conditions for inference and the potential pitfalls of misinterpreting correlation as causation.

Uploaded by

diego940306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Chapter 14

Correlation and Regression

•Jul 15, 2025 •2


In Chapter 14
14.1 Data
14.2 Scatterplots
14.3 Correlation
14.4 Regression

•3
14.1 Data
• Quantitative response variable Y (“dependent
variable”)
• Quantitative explanatory variable X (“independent
variable”)
• Historically important public health data set used to
illustrate techniques (Doll, 1955)
– n = 11 countries
– Explanatory variable = per capita cigarette consumption in
1930 (CIG1930)
– Response variable = lung cancer mortality per 100,000
(LUNGCA)
•4
Table 14.2 Data used for chapter illustrations. Per capita
cigarette consumption in 1930 (cig1930) and lung cancer
cases per 100,000 in 1950 (lungca) in 11 countries.

Data from Doll,


R. (1955).
Etiology of lung
cancer.
Advances
Cancer
Research, 3,
1–50. Data
stored online in
the file doll-
ecol.sav.

•5
Figure 14.1 Scatterplot of Doll’s illustration of the correlation
between smoking and lung cancer rates. (Data listed in Table
14.2.) The data point for the United States is highlighted.
Inspect scatterplots
• Form: Can the relation be described with a
straight or some other type of line?
• Direction: Do points tend trend upward or
downward?
• Strength of association: Do point adhere
closely to an imaginary trend line?
• Outliers (in any): Are there any striking
deviations from the overall pattern?
•7
Judging Correlational Strength
• Correlational strength refers to the degree to
which points adhere to a trend line
• The eye is not a good judge of strength.
• The top plot appears to show a weaker
correlation than the bottom plot. However,
these are plots of the same data sets. (The
perception of a difference is an artifact of axes
scaling.)

•8
Figure 14.2
Scatterplots of the
same data with
different axis
scalings. It is difficult
to determine
correlational
strength visually.
§14.3. Correlation
• Correlation coefficient r quantifies linear relationship
with a number between −1 and 1.
• When all points fall on a line with an upward slope, r
= 1. When all data points fall on a line with a
downward slope, r = −1
• When data points trend upward, r is positive; when
data points trend downward, r is negative.
• The closer r is to 1 or −1, the stronger the
correlation.

•10
Figure 14.3
Examples of
different
correlations.

•11
Calculating r
• Formula

Correlation coefficient tracks the degree to which X


and Y “go together.”
• Recall that z scores quantify the amount a value lies
above or below its mean in standard deviations units.
• When z scores for X and Y track in the same
direction, their products are positive and r is positive
(and vice versa).
•12
Table 14.3 Calculation of correlation
coefficient r, illustrative data.

•13
Calculating r
• In practice, we rely on computers and
calculators to calculate r.
– SPSS
– Scientific and graphing calculators
• I encourage my students to use these tools
whenever possible.

•14
Calculating r
• SPSS output for Analyze > Correlate > Bivariate
using the illustrative data:

•15
Interpretation of r
1. Direction. The sign of r indicates the direction of the
association: positive (r > 0), negative (r < 0), or no
association (r ≈ 0).
2. Strength. The closer r is to 1 or −1, the stronger the
association.
3. Coefficient of determination. The square of the
correlation coefficient (r2) is called the coefficient of
determination. This statistic quantifies the proportion
of the variance in Y [mathematically] “explained” by
X. For the illustrative data, r = 0.737 and r2 = 0.54.
Therefore, 54% of the variance in Y is explained by X.

•16
Notes, cont.

4. Reversible relationship. With correlation, it does


not matter whether variable X or Y is specified as
the explanatory variable; calculations come out the
same either way. [This will not be true for
regression.]
5. Outliers. Outliers can have a profound effect on r.
This figure has an r of 0.82 that is fully accounted
for by the single outlier (see next slide).

•17
Figure 14.4 The calculated correlation for this
data set is r 0.82. The single influential
observation in the upper-right quadrant
accounts for this large r.
Notes, cont.
6. Linear
relations only.
Correlation
applies only to
linear
relationships
This figure
shows a strong
non-linear
relationship, yet
r = 0.00.
•19
Notes, cont.
7. Correlation does not necessarily mean causation.
Beware lurking variables.
• A near perfect negative correlation (r = −.987) was
seen between cholera mortality and elevation
above sea level during a 19th century epidemic
• We now know that cholera is transmitted by water.
• The observed relationship between cholera and
elevation was confounded by the lurking variable
proximity to polluted water.
• See next slide

•20
Figure 14.6 Cholera mortality and elevation above sea
level were strongly correlated in the 1850s (r 5 20.987),
but this correlation was an artifact of confounding by
the extraneous factor of “water source.”

•21
Hypothesis Test
• Random selection from a random scatter can result
in an apparent correlation
• We conduct the hypothesis test to guard against
identifying too many random correlations.

•22
Hypothesis Test
A. Hypotheses. Let ρ represent the population correlation
coefficient.
H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided)
[or Ha: ρ > 0 (right-sided) or Ha: ρ < 0 (left-sided)]
B. Test statistic
r 1 r 2
tstat  where SEr 
SEr n 2
df n  2
C. P-value. Convert tstat to P-value with software or Table C.

•23
Hypothesis Test – Illustrative Example
A. H0: ρ = 0 vs. Ha: ρ ≠ 0 (two-sided)
B. Test stat
1  0.737 2
SE r  0.2253
11  2
0.737
tstat  3.27
0.2253
df 11  2 9

C. 005 < P < .01 by Table C. P = .0097 by computer. The


evidence against H0 is highly significant.
•24
Confidence Interval for ρ

•25
Illustrative Example

•26
Conditions for Inference
• Independent observations
• Bivariate Normality (r can still be used
descriptively when data are not bivariate
Normal)

•27
Figure 14.8 Bivariate Normality
§14.4. Regression
• Regression describes the relationship in the
data with a line that predicts the average
change in Y per unit X.
• The best fitting line is found by minimizing the
sum of squared residuals, as shown in this
figure.

•29
Figure 14.9 Fitted regression line and residuals,
smoking and lung cancer illustrative data

•30
Regression Line, cont.
• The regression line equation is:

where ŷ ≡ predicted value of Y,


a ≡ the intercept of the line, and
b ≡ the slope of the line
• Equations to calculate a and b
SLOPE:

INTERCEPT:
•31
Figure 14.10 Components of a
regression model
Slope b is the key statistic produced by the regression
Regression Line, illustrative example

Here’s the output from SPSS:

•33
Inference
• Let α represent the population intercept, β
represent population slope, and εi represent the
residual “error” for point i.
• The population regression model is

• The estimated standard error of the regression is

•34
Inference
• A (1−α)100% CI for population slope β is

•35
Confidence Interval for β–Example

95% Confidence Interval for B


Model Lower Bound Upper Bound
1 (Constant) -4.342 17.854
cig1930 .007 .039
•36
t Test of Slope Coefficient
A. Hypotheses. H0: β = 0 against Ha: β ≠ 0
B. Test statistic.
b sY | x
tstat  where SEb 
SEb n  1 s X
df n  2

C. P-value. Convert the tstat to a P-value

•37
t Test: Illustrative Example

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 6.756 4.906 1.377 .202
cig1930 .023 .007 .737 3.275 .010

•38
Analysis of Variance of the
Regression Model
• An ANOVA technique equivalent to the t test
can also be used to test H0: β = 0.
• This technique is covered on pp. 321 – 324 in
the text but is not included in this
presentation.

•39
Conditions for Inference
• Inference about the regression line requires
these conditions
– Linearity
– Independent observations
– Normality at each level of X
– Equal variance at each level of X

•40
Figure 14.12 Population regression model showing
Normality and homoscedasticity conditions

•41
Assessing Conditions
• The scatterplot should be visually inspected
for linearity, Normality, and equal variance
• Plotting the residuals from the model can be
helpful in this regard.
• The table on the next slide lists residuals for
the illustrative data

•42
Table 14.7 Residuals in smoking and lung cancer
illustrate data set
Assessing Conditions, cont.
• A stemplot of the
residuals show no
major departures from |-1|6
Normality |-0|2336
• A residual plot shows | 0|01366
more variability at | 1|4
higher X values (but x10
the data is very sparse)
• See next slide

•44
Figure 14.15 Residual plot for
illustrative data set
Residual Plots
• With a little experience, you can get good at
reading residual plots.
• On the next three slides, see:
A. An example of linearity with equal variance
B. An example of linearity with unequal variance
C. An example of non-linearity with equal variance

•46
Figure 14.16 Residual plot demonstrating (A)
linearity with equal variance
Figure 14.16 Residual plot demonstrating
(B) linearity with unequal variance
Figure 14.16 Residual plot
demonstrating (C) nonlinearity.
Dependency and
Independency of Data

•50
Validity Issue

• Question: Is there a positive


association between X and Y?

•51
Your Answer Is …..
• It is so clear that there is a
positive association between X
and Y.
• Wait….. Let’s make sure whether
the data are independent or
dependent (i.e., with repeated
measurement?)

•52
• Every two points connected by the
same red line are measurements
observed from the same person.
A1

A2
It is so clear
that “Y” will B1

decrease as C1
B2

“X” increases C2

for nearly all


study subjects.

•53
Failure to
consider
“dependence” of
data may result
in biased study
findings.

•54
Test for “Normality”
• Examining the residuals (or
standardized residuals), help detect
violations of the required conditions.
• Nonnormality
– Use Excel to obtain the standardized
residual histogram
– Examine the histogram and look for a
bell shaped. diagram with a mean close
to zero
•55
Cont.
Standardized residuals

40
30
20
10
0
-2 -1 0 1 2 More

t seems the residual are normally distributed


with mean zero
•56

You might also like