Linear Regression
Dr Menaal Kaushal
JR II
Department of S P M
S N Medical College, Agra
1 22-11-2013
Statistical Analysis can be:
Univariate: When Only one variable is studied. E.g
Heights of all the IV graders, ages of mothers
delivering at a DH, etc. (Measures of Central
Tendency, Measures of Dispersion)
Bivariate: When relationship between two variables
are studied. e.g. Relationship between height and
weight of Every Child in the IV grade; relation
between mother’s age & birth weight of her baby, etc.
Multivariate: When relationship between more than
two variables are studied. E.g Relationship between
height, weight and MAC of every child in the IV grade
2 22-11-2013
Bivariate Regression
Linear Regression: When the data is
continuous
Logistic Regression: When the data is
categorical, e.g. the research question can
be answered as either yes or no category
3 22-11-2013
Levels (Types) of Data
Nominal (Categorical) Measures: Are exhaustive
and mutually exclusive (e.g., religion), gender
Ordinal Measures: All of the above plus can be
rank-ordered (e.g., social class).
Interval Measures: All of the above plus equal
differences between measurement points
(temperature in ℃ or ℉ ).
Ratio Measures: All of the above plus a true zero
point (weight, Absolute Temperature in Kelvin).
4 22-11-2013
Relationship Between Two
Variables
Association: any relation between variables
Positive association: above average values of one variable
tend to go with above average values of the other; the scatter
slopes up
Negative association: above average values of one variable
tend to go with below average values of the other; the scatter
slopes down
Linear association: roughly, the scatter diagram is clustered
around a straight line. This is Correlation
5 22-11-2013
6 22-11-2013
[‘p-0
7 22-11-2013
8 22-11-2013
The “Football” Bivariate
Normal Scatter Plot
9 22-11-2013
Can you identify any
difference?
10 22-11-2013
How Tightly Clustered
Are these Data?
11 22-11-2013
Calculating the Correlation
Coefficient
12 22-11-2013
So, How to Calculate r
13 22-11-2013
Formula of Correlation
Coefficient
Lets Simplify:
Convert the data into Standard units.
Multiply the corresponding standard unit values
of x and y
r is the mean of this product
14 22-11-2013
Properties of Correlation
Coefficient
The calculations uses only standard units so r is a pure
number with no units
-1≤ r ≤ 1
In the extreme cases, r = -1 when the scatter diagram is a
perfect straight line sloping down. If r = 1, the scatter
diagram is a perfect line sloping up
Switching the variables x and y does not change r. it
remains the same
15 22-11-2013
Adding a constant to one of the lists just slides the
scatter diagram so r stays the same
Multiplying one of the lists by a positive constant does
not change standard units so r stays the same
Multiplying just one (not both) of the lists by a negative
constant switches the signs of the standard units of that
variable, so r has the same absolute value but its sign gets
switched.
16 22-11-2013
Heteroscadastic Curve
17 22-11-2013
What r can not tell?
Association is not causation. r does not tell “Why”
r is only used for linearly correlated variables. It
measures linear association.
This diagram shows a strong relation
between x& y, but it is not linear. But r
for this diagram comes out to be Zero
18 22-11-2013
Beware of:
Outliers
Tendency for Ecological correlations
19 22-11-2013
Deal with the outliers
20 22-11-2013
Can you find the outlier?
21 22-11-2013
Avoid “Ecological
Correlation”:
Replacing students by averages
can artificially increase
clustering. This is not desirable.
22 22-11-2013
Regression
The technique to estimate dependent variable
“y”, for a given value of variable “x” when they
are linearly associated and the correlation
coefficient “r” is known.
23 22-11-2013
Each estimate is at the center of the vertical strip
22-11-2013 24
25 22-11-2013
The slope of the green line= r
26 22-11-2013
The Equation of Regression
Estimate of y = r* given x (in Standard units)
⇒ estimate of y- µy = r (x- µx)
SDy SDx
Estimate of y= Slope* (x) + intercept
(Here Slope= r* SDy / SDx and intercept= µy-slope*x)
27 22-11-2013
Why call “Regression”
Sir Francis Galton 1822- 1911: “The Galton Effect”
“Those who have high values in one variable tend to
be not as high in the second variable”
A eugenicist, who gave the idea of SD and regression
“Fathers who are tall, tend to have sons who are not
quite that tall on average”
All data regresses towards “mediocrity”
i.e. regresses towards mean
The Regression Fallacy or Sophomore Slump
28 22-11-2013
29 22-11-2013
Univariate Normal Bivariate Normal
+1 r.m.s.
error
68%
68% r
µx
+1 SD
30 22-11-2013
Residual Plot
Regardless of the shape of the scatter diagram:
the average of the residuals is Always 0,
There is No linear association between residuals and x.
The residual plot should not show any trend or linear
relation.
Good regression: Residual plot should look like a formless
31 22-11-2013
blob around the horizontal axis
Residual Plot as a Diagnostic
Tool
32 22-11-2013
Questions??
33 22-11-2013