Epidemiological Applications in Health Services Research
Introduction to Multivariate Analysis
Dr. Ibrahim Awad Ibrahim.
Areas to be addressed today
Introduction to variables and data
Simple linear regression
Correlation
Population covariance
Multiple regression
Canonical correlation
Discriminant analysis
Logistic regression
Survival analysis
Principal component analysis
Factor analysis
Cluster analysis
Types of variables (Stevens’
classification, 1951)
Nominal
distinct categories: race, religions, counties, sex
Ordinal
rankings: education, health status, smoking
levels
Interval
equal differences between levels: time,
temperature, glucose blood levels
Ratio
interval with natural zero: bone density, weight,
height
Variables use in data analysis
Dependent: result, outcome
developing CHD
Independent: explanatory
Age, sex, diet, exercise
Latent constructs
SES, satisfaction, health status
Measurable indicators
education, employment, revisit, miles walked
Variables in data example
Name # of Position
characters
STFIPS FIPS 1 2
CODE (STATE)
STCENSUS 1 3
LEVEL 1 4
STABBREV 1 5
AREANAME 7 6
NAME OF
US/STATE/COUN
TY
POPULATION 7 13
1992 ABS
ITEM002
xyz 20
Data
Data screening and transformation
Normality
Independence
Correlation (or lack of independence)
Variable types and measures of
central tendency
Nominal: mode
Ordinal: median
Interval: Mean
Ratio: Geometric mean and harmonic
mean
Simple linear regression
Y = A + BX
X
Correlation
Mean =
Variance (SD)2 =
Population covariance = (X- x)(Y- y)
Product moment coefficient=
=xy/ x y
It lies between -1 and 1
Example physical and mental health
indicators
Correlations
PHYSICAL MENTAL
PHYSICAL Pearson Correlation 1.000 .230**
Sig. (2-tailed) . .000
N 109888 109888
MENTAL Pearson Correlation .230** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
**. Correlation is significant at the 0.01 level (2-tailed).
Negative correlation
Correlations
WEIGHT AGEDIAB
WEIGHT Pearson Correlation 1.000 -.029**
Sig. (2-tailed) . .000
N 109888 109888
AGEDIAB Pearson Correlation -.029** 1.000
Sig. (2-tailed) .000 .
N 109888 109888
**. Correlation is significant at the 0.01 level (2-tailed).
Population covariance
=0.00 =0.33 =0.6
=0.88
Multiple regression and correlation
Simple linear Y = + X
Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp
EF ejection fraction
Exercise
Body fat
Issues with regression
Missing values
random
pattern
mean substitution and ML
Dummy variables
equal intervals!
Multicollinearity
independent variables are highly
correlated
Garbage can method
Canonical correlation
An extension of multiple regression
Multiple Y variables and multiple X
variables
Finding several linear combinations of the
X var and the same number of linear
combinations of the Y var.
These combinations are called canonical
variables and the correlations between the
corresponding pairs of canonical variables
are called CANONICAL CORRELATIONS
Correlation matrix
Correlations
WTFORH PHYSHLT MENTHL POORHL HLTHPLA
WTFORHTX
Pearson Correlation
Data screening and transformation
TX
1.000
GENHLTH
.072**
H
-.008**
TH
.016**
TH
-.005
N
.023**
BPTAKE
.011**
TOLDHI
.000
Sig. (2-tailed) . .000 .006 .000 .208 .000 .000 .903
GENHLTH
N
Pearson Correlation
Normality
109888
.072**
109888
1.000
109888
-.228**
109888
-.061**
54351
-.147**
109888
.035**
108445
-.084**
77436
-.091**
Sig. (2-tailed)
N
Independence
.000
109888 109888
. .000
109888
.000
109888
.000
54351
.000
109888
.000
108445
.000
77436
PHYSHLTH Pearson Correlation -.008** -.228** 1.000 .223** .295** -.011** .083** .030**
Sig. (2-tailed)
N
Correlation (or lack of independence)
.006
109888
.000
109888 109888
. .000
109888
.000
54351
.000
109888
.000
108445
.000
77436
MENTHLTH Pearson Correlation .016** -.061** .223** 1.000 -.120** -.038** .019** .014**
Sig. (2-tailed) .000 .000 .000 . .000 .000 .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
POORHLTH Pearson Correlation -.005 -.147** .295** -.120** 1.000 -.001 .055** .014**
Sig. (2-tailed) .208 .000 .000 .000 . .816 .000 .005
N 54351 54351 54351 54351 54351 54351 53754 38018
HLTHPLAN Pearson Correlation .023** .035** -.011** -.038** -.001 1.000 .152** .022**
Sig. (2-tailed) .000 .000 .000 .000 .816 . .000 .000
N 109888 109888 109888 109888 54351 109888 108445 77436
BPTAKE Pearson Correlation .011** -.084** .083** .019** .055** .152** 1.000 .039**
Sig. (2-tailed) .000 .000 .000 .000 .000 .000 . .000
N 108445 108445 108445 108445 53754 108445 108445 77436
TOLDHI Pearson Correlation .000 -.091** .030** .014** .014** .022** .039** 1.000
Sig. (2-tailed) .903 .000 .000 .000 .005 .000 .000 .
N 77436 77436 77436 77436 38018 77436 77436 77436
**. Correlation is significant at the 0.01 level (2-tailed).
Discriminant analysis
A method used to classify an individual
in one of two or more groups based on a
set of measurements
Examples:
at risk for
heart disease
cancer
diabetes, etc.
It can be used for prediction and
description
Discriminant analysis
B B
ab
A
A
a and b are wrongly classified
discriminant function to describe the probability
of being classified in the right group.
Logistic regression
An alternative to discriminant analysis to
classify an individual in one of two
populations based on a set of criteria.
It is appropriate for any combination of
discrete or continuous variables
It uses the maximum likelihood
estimation to classify individuals based
on the independent variable list.
Survival analysis (event history
analysis)
Analyze the length of time it takes a
specific event to occur.
Time for death, organ failure, retirement,
etc.
Length of time function of {explanatory
variables (covariates)}
Survival data example
died
died
died
lost
surviving
1980
1985 1990
Log-linear regression
A regression model in which the
dependent variable is the log of survival
time (t) and the independent variables
are the explanatory variables.
Multiple regression Y = + 1X1 + 2X2 + 3X3 . . .+ pXp
Log (t) = + 1X1 + 2X2 + 3X3 . . .+ pXp + e
Cox proportional hazards model
Another method to model the relationship between
survival time and a set of explanatory variables.
Proportion of the population who die up to time (t) is
the lined area
1980 t 1985 1990
Cox proportional hazards model
The hazard function (h) at time (t) is
proportional among groups 1 & 2 so that
h1(t1)/h2(t2) is constant.
Principal component analysis
Aimed at simplifying the description of a set
of interrelated variables.
All variables are treated equally.
You end up with uncorrelated new variables
called principal components.
Each one is a linear combination of the
original variables.
The measure of the information conveyed
by each is the variance.
The PC are arranged in descending order of
the variance explained.
Principal component analysis
A general rule is to select PC explaining
at least 5% but you can go higher for
parsimony purposes.
Theory should guide this selection of
cutoff point.
Sometimes it is used to alleviate
multicollinearity.
Factor analysis
The objective is to understand the
underlying structure explaining the
relationship among the original variables.
We use the factor loading of each of the
variables on the factors generated to
determine the usability of a certain
variable.
It is guided again by theory as to what are
the structures depicted by the common
factors encompassing the selected
variables.
Factor analysis
Total Variance Explained
Extraction Sums of Squared
Initial Eigenvalues Loadings
% of Cumulativ % of Cumulativ
Component Total Variance e% Total Variance e%
1 1.699 16.986 16.986 1.699 16.986 16.986
2 1.663 16.629 33.614 1.663 16.629 33.614
3 1.108 11.083 44.697 1.108 11.083 44.697
4 1.035 10.351 55.048 1.035 10.351 55.048
5 .908 9.077 64.125
6 .881 8.808 72.933
7 .834 8.338 81.271
8 .788 7.879 89.150
9 .571 5.714 94.865
10 .514 5.135 100.000
Extraction Method: Principal Component Analysis.
Factor analysis
Component Matrixa
Component
1 2 3 4
GENHLTH .450 .207 -.150 -.552
PHYSHLTH -.770 .254 -3.31E-03 -.208
MENTHLTH .652 -.232 -6.74E-02 .353
POORHLTH -.612 6.329E-02 -1.03E-02 .110
BPTAKE -.128 .352 -.465 .474
BLOODCHO 6.411E-02 .335 -.563 .158
SEATBELT .166 .697 .242 .222
SFTYLT16 .137 .676 .447 .188
BIKEHLMT .156 .414 .210 -.299
SMOKENOW -.112 -.382 .495 .356
Extraction Method: Principal Component Analysis.
a. 4 components extracted.
Cluster analysis
A classification method for individuals into
previously unknown groups
It proceeds from the most general to the most
specific:
Kingdom: Animalia
Phylum: Chordata
Subphylum: vertebrata
Class: mammalia
Order: primates
Family: hominidae
Genus: homo
Species: sapiens
Patient clustering
Major: patients
Types: medical
Subtype: neurological
Class: genetic
Order: lateonset
disease: Guillian Barre syndrom
Hierarchical: divisive or agglumerative
Conclusions
Presentation Schedule
4 each on 4/22 and 4/27
5 on 4/29
Each presentation should be maximum of
10 minutes and 5 minutes for discussion
E-mail me your requirements of software
and hardware for your presentation.
Final projects due 5/7/99 by 5:00 pm in
my office.
Presentation Schedule 1
Date Time Who
4/22 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
Presentation Schedule 2
Date Time Who
4/27 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
2:01 - 2:15
Presentation Schedule 3
Date Time Who
4/29 1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00