0% found this document useful (0 votes)
1 views42 pages

Correlation Regression Tutorial

The document discusses correlation and regression, focusing on the correlation coefficient (R) as a measure of the linear relationship between two variables and the coefficient of determination (R²) which indicates the proportion of variation explained by the independent variable. It outlines the differences between correlation and regression, the assumptions for linear regression, and provides an example of estimating birth weight based on BMI using a regression line. Additionally, it highlights the limitations of correlation and the importance of scatter diagrams in exploring relationships between variables.

Uploaded by

Jaffer Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views42 pages

Correlation Regression Tutorial

The document discusses correlation and regression, focusing on the correlation coefficient (R) as a measure of the linear relationship between two variables and the coefficient of determination (R²) which indicates the proportion of variation explained by the independent variable. It outlines the differences between correlation and regression, the assumptions for linear regression, and provides an example of estimating birth weight based on BMI using a regression line. Additionally, it highlights the limitations of correlation and the importance of scatter diagrams in exploring relationships between variables.

Uploaded by

Jaffer Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Correlation &

Regression
BIO323

SBASSE, LUMS
Correlation Coefficient,
R
• “R” is a measure of strength of the linear
association between two variables, x and y.

• Most statistical packages and some hand calculators


can calculate R

• For the data in our BW example, R=0.94

• R provides us unique insights into data’s distribution


Formula – Pearson
Correlation, R
The Pearson correlation evaluates the linear relationship between two
continuous variables.
Correlation Coefficient,
R
• R takes values between -1 and +1

• R=0 represents no linear relationship between the


two variables

• R>0 implies a direct linear relationship


• R<0 implies an inverse linear relationship

• The closer R comes to either +1 or -1, the stronger is


the linear relationship
Coefficient of
Determination
• R2 is another important measure of linear
association between x and y (0 £ R2 £ 1)

• R2 measures the proportion of the total variation in


y which is explained by x

• For example r2 = 0.8751, indicates that 87.51% of


the variation in BW is sourced from the
independent variable x (BMI).
Difference between
Correlation and Regression

• Correlation Coefficient, R, measures the strength


of bivariate association

• The regression line is a prediction equation that


estimates the values of y for any given x, within
the domain (independent var)
Limitations of the correlation coefficient

• Though R measures how closely the two variables


approximate a straight line, it does not validly
measures the strength of nonlinear relationship

• When the sample size, n, is small we also have to


be careful with the reliability of the correlation

• Outliers could have a marked effect on R

• Causal Linear Relationship


Background
Regression

For continuous variables


• Correlation Coefficient, R
• Coefficient of Determination, R-Square

For categorical variables


• Odds Ratio (OR)
• Relative Risk (RR)
8
Example
• A researcher believes that there is a linear
relationship between BMI (Kg/m2) of pregnant
mothers and the birth-weight (BW in Kg) of their
newborn

• The following data set provide information on 15


pregnant mothers who were contacted for this
study
BMI (Kg/m2) Birth-weight (Kg)

20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
Scatter Diagram
• Scatter diagram is a graphical method to display the
relationship between two variables

• Scatter diagram plots pairs of bivariate observations


(x, y) on the X-Y plane

• Y is called the dependent variable

• X is called an independent variable


Scatter diagram of BMI and Birthweight
4

3.5

2.5

1.5

0.5

0
0 10 20 30 40 50 60 70
Is there a linear
relationship between BMI
and BW?
• Scatter diagrams are important for initial
exploration of the relationship between two
quantitative variables

• In the above example, we may wish to summarize


this relationship by a straight line drawn through
the scatter of points
Simple Linear Regression
• Although we could fit a line "by eye" e.g. using a
transparent ruler, this would be a subjective approach
and therefore unsatisfactory.

• An objective, and therefore better, way of determining


the position of a straight line is to use the method of
least squares.

• Using this method, we choose a line such that the sum


of squares of vertical distances of all points from the
line is minimized.
Computing the vertical
distance

15
Least-squares or
regression line
• These vertical distances, i.e., the distance between y
values and their corresponding estimated values on
the line are called residuals

• The line which fits the best is called the regression line
or, sometimes, the least-squares line

• The line always passes through the point defined by


the mean of Y and the mean of X
Linear Regression Model

• The method of least-squares is usually referred to


as linear regression

• Y is also known as an outcome variable

• X is also called as a predictor


Assumptions for Linear
Regression:
Assumption # 1 — Independence of
observations
• The data collection process

• Dependent Observations- Dont want!


E.g time series data.

• Independent Observations- Do want!


• experimental studies (assignment of
participants are random and forced)

18
Assumption # 2 — Linear and Additive
relationship
Relationship between the independent and dependent variables
must be linear.

Inefficient
models !!!!!

19
Linearity - The relationship between height and weight must be linear.
21
Assumption # 3 — Independence of errors
There should not be a relationship between
the residuals and X

Points randomly scattered no evident relationship existing


22
Assumption # 4 — Normality of errors
The residuals must be approximately normally distributed.

23
24
Assumption # 5 — Equal Variances
The variance of the residuals is the same for all values of x

No pattern !!

25
Funnel
shape

26
The Least-Squares Line

y is a value on the vertical axis,


x is a value on the horizontal axis,
a is the point where the line crosses the
vertical axis, and
b shows the amount by which y changes for
each unit change in x.

Simple linear regression model:

27
BMI (Kg/m2) Birth-weight (Kg)

20 2.7
30 2.9
50 3.4
45 3.0
10 2.2
30 3.1
40 3.3
25 2.3
50 3.5
20 2.5
10 1.5
55 3.8
60 3.7
50 3.1
35 2.8
Estimated Regression Line
for BW Data

yˆ = ˆ + ˆ x = 1.775351 + 0.0330187 x

ˆ . 1.775351  is.called . y  int ercept

ˆ 0.0330187  is.called .the.slope


Application of
Regression Line
This equation allows you to estimate BW of other
newborns when the BMI is given.

e.g., for a mother who has BMI=40, i.e. X = 40 we


predict BW to be

yˆ = ˆ + ˆ x = 1.775351 + 0.0330187 (40) 3.096


A question of interest is how well one can predict and estimate deep abdominal AT from
knowledge of the waist circumference

31
32
coefficient of determination

33
r2 as a measure of closeness-of-fit of the sample regression line to the sample observations.

34
35
We wish to know if we can conclude that the slope of the population regression line
describing the relationship between X and Y is zero.

Assumptions: We presume that the simple linear regression model and its underlying
assumptions are applicable.

Hypotheses:

Distribution of test statistic.

When the assumptions are met and H0 is true, the test statistic is distributed as
Student’s t with n 2 degrees of freedom.

Decision rule:

Reject H0 if the computed value of t is either greater than or equal to 1.9826 or less
than or equal to - 1.9826
36
Calculation of statistic :

Statistical decision: Reject H0 because 14.741 > 1.9826

Conclusion: We conclude that the slope of the true regression line is not zero

37
Multiple regression model:

38
Logistic regression

39
Types of Logistic regression

40
The End!

41
Example of Computing
Regression Line
• Data: (1,2), (2,1), (4,3)

• Mean (x) = 1+2+4 = 7/3 = 2.33


• Mean (y) = 2+1+3 = 6/3 = 2
• Mean (xy) = (1x2 + 2x1 + 4x3)/3 = 5.33
• Mean (x2) = 1x1 + 2x2 + 4x4 / 3 = 7

• Gradient = M(x) M(y) – M(xy) / (Mx)2 – M(x2) = 0.428


• Intercept = M(y) – Gradient x (Mx) = 2 – (0.42 * 2.3) = 1
42

You might also like