C01 Introduction S
C01 Introduction S
INTRODUCTION
1.1 Aspects
Multivariate data arise when researchers record the values of several random variables
on the number of subjects or objects or perhaps one of a variety of other things (we
will use the general term “units” or “items”) in which they are interested, leading to a
vector-valued or multidimensional observation from each. Since the data include
simultaneous measurements on many variables, this body of methodology is called
multivariate analysis.
(iv) Environmentalists might assess pollution levels of a set of cities along with
noting other characteristics of the cities related to climate and human ecology
1
1.2 Classification
It is difficult to establish a classification scheme for multivariate techniques that is both
widely accepted and indicates the appropriateness of the techniques.
▪ One classification distinguishes techniques designed to study interdependent
relationships from those designed to study dependent relationships.
▪ Another classifies techniques according to the number of populations and the
number of sets of variables being studied.
The choice of methods and the types of analyses employed are largely determined by
the objective of the investigation.
(ii) Sorting and grouping:- Groups of “similar” objects or variables are created,
based upon measured characteristics. Or rules for classifying objects or
variables into well-defined groups may be required
2
1.3 Applications
To give some indication of the usefulness of multivariate techniques, few examples are
given below. These examples are multifaceted and could be placed in more than one
category.
3
Prediction
▪ Measurements on several accounting and financial variables were used to
develop a method for identifying potentially insolvent property-liability insurers.
▪ cDNA microarray experiments(gene expression data) are increasingly used to
study molecular variations among cancer tumours. A reliable classification of
tumours is essential for successful diagnosis and treatment of cancer.
The values of these variables are all recorded for each item, individual, or experimental
unit.
The notation x jk , indicate the particular value of the kth variable that is observed on
the jth item. That is,
4
variable 1 variable 2 variable k variable p
Item1 x11 x12 x1k x1 p
Item 2 x21 x22 x2 k x2 p
Item j x j1 x j2 x jk x jp
This contains the data consisting of all of the observations on all of the variables.
Example 1.1
A selection of four receipts from a bookstore was obtained to investigate the nature of
book sales. The receipt provided among other things, the number of books sold and
the total amount of each sale.
Let the first variable be total dollar sales and the second variable the number of books
sold. We can consider the number of receipts as four measurements on two variables.
5
Notation wise:
x11 42 x21 52 x31 48 x41 58
x12 4 x22 5 x32 4 x42 3
(ii) Ordinal
Where there is ordering but no implication of equal distance between the
different points of the scale. Examples:- educational attainment(no schooling,
primary, secondary, tertiary), social class, degree classification(distinction, merit,
pass).
(iii) Interval
Where there are equal differences between successive points on the scale but
the position of zero is arbitrary. For example, the measurement of temperature
using Celsius or Fahrenheit scales.
(iv) Ratio
The highest level of measurement, where one can investigate the relative
magnitudes of scores as well as the differences between them. The position of
zero is fixed. Example, the absolute measure of temperature in Kelvin, age,
weight and length.
6
1.4.2 Missing values
Consider the table below; where NA denotes missing values. This is one of the
problems that faced statisticians undertaking statistical analysis in general and
multivariate analysis in particular.
Missing values, i.e., observations and measurements that should have been recorded
but for one reason or another, were not.
In multivariate data, missing values arise, for example, non-response in sample surveys,
dropouts in longitudinal data, refusal to answer particular questions in a questionnaire.
The most important way for dealing with missing data is try to avoid them during the
data-collection stage of a study. However, this is not possible.
7
1.5 Graphical Techniques
There are several graphical displays that can be used to aid in data analysis. It is
impossible to simultaneously plot all measurements made on several variables and
study the configurations. But, plots of individual variables and plots of pairs of variables
can still be very informative.
8
9
The diagram below, specimen (board) 16 and possibly specimen (board) 9 are identified
as unusual observations. Figures 1.12(a), (b), and (c) contain perspectives of the
stiffness data in the x1, x2 , x3 space.
These views were obtained by continually rotating and turning the three-dimensional
coordinate axes. Spinning the coordinate axes allows one to get a better understanding
of the three-dimensional aspects of the data. Figure 1.12(d) gives one picture of the
stiffness data in x2 , x3 , x4 space. Notice that Figures 1.12(a) and(d) visually confirm
specimens 9 and 16 as outliers. Specimen 9 is very large in all three coordinates.
Additional insights can sometimes be gleaned from visual inspection of the slowly
spinning data. It is this dynamic aspect that statisticians are just beginning to understand
and exploit.
10
Plots like those in Figure 1.12 allow one to identify readily observations that do not
conform to the rest of the data and that may heavily influence inferences based on
standard data-generating models
11
1.6 Descriptive Statistics
Information contained in the data can be assessed by calculating summary statistics.
For example, sample mean, which provides a measure of location, i.e. “central value”
for a set of numbers. The average of the squares of the distance s of all numbers from
the mean provides a measure of the variation (or spread), in the numbers.
Consider the dataset, as a matrix X , then we could treat each column of X separately.
That is, x11, x21,...., xn1 be n measurements on the first variable.
▪ sample mean
1 n
xj xij
n i 1
for j 1, 2,...., p
▪ sample variance
1 n
xij x j
2
s 2j for j 1, 2,...., p
n 1 i 1
x11 x21 xn 2
x x
, ,......., x
21 22 n2
12
A measure of the linear association between the measurements of variables 1 and 2 is
given by the sample covariance
1 n
s12
x j1 x1 x j 2 x2
n 1 j 1
1 n
s jk
xij x j
n 1 i 1
xik xk j 1,2,..., p k 1,2,..., p
The sample correlation coefficient for the jth and kth variables is
13
n
xij x j xik xk
i 1
r jk
n 2 n
xij x j xik xk 2
i 1 i 1
s jk
j 1,...., p; k 1,..., p
s jj skk
(iv) If r jk 1 , then there are constants a and b such that xij a bxik
for j 1, 2,..., n
(v) The value of r jk does not change if either variable is subject to a linear
transformation
14
We can represent all these quantities as an array:
x1
x
x
2
Sample mean
x p
s11 s12 s1 p
s21 s22 s2 p
Sample variances-covariances Sn
s p1 s p2 s pp
1 r12 r1 p
r21 1 r2 p
Sample correlations R
rp1 1
Example 1.6.1
Consider the data given in Example 1.1. Compute the sample mean, covariance,
variance and correlation coefficients.
15
1.7 Geometry
(i)
The straight-line or Euclidean distance of a point P x1, x2 ,.., x p from the
origin O 0,....,0 is
All points that lie a constant squared distance, such as c 2 , from the origin
satisfy the equation
d 2 O, P x12 x22 ... x 2p c 2 (1.7.2)
(ii) The straight-line distance between two arbitrary points P and Q with
coordinates P x12 , x22 ,..., x 2p
and Q y12 , y22 ,..., y 2p is given by
d P, Q x1 y1 2 x2 y2 2 ... x p y p
2
(1.7.3)
16
(iii) We need to find a ‘statistical’ distance that accounts for differences in variation,
and in due course, the presence for correlation. Our choice will depend upon
the sample variance and covariance. The statistical distance is fundamental to
multivariate analysis.
2 2
x x x12 x22
d O, P 1 2 (1.7.4)
s s s11 s22
11 22
• In 2
, the difference between (1.7.1) and 1.7.4) is due to the
weights k1 s1 and k2 s1 attached to x12 and x22
11 22
• If the sample variances are the same, k1 k2 , then x12 and x22 will
receive the same weight
• If the variability in the x1 direction is the same as the variability
in the x2 direction, and the x1 values vary independently of the
x2 values, Euclidean distance is appropriate.
Using (1.7.4), we see that all points which have coordinates x1, x2 and
are a constant squared distance, c 2 , from the origin must satisfy
x12 x22
c2
s11 s22
17
Which is the equation of an ellipse centred at the origin, whose major
and minor axes coincide with the coordinate axes. That is, the statistical
distance in (1.7.4) has an ellipse as the locus of all points a constant
distance from the origin. This general case is shown below.
Q y1, y2 ,..., y p .
Assume that the coordinates vary independently of one another, the
statistical distance
x1 y1 2 x2 y2 2 ... x p y p
2
d P, Q (1.7.5)
s11 s22 s pp
18
Thus, we rotate the original coordinate system through the angle while
keeping the scatter fixed label the rotated axes x1 and x2 . This suggests
that we calculate the sample variances using the x1 and x2 coordinates
and measure distance as in (1.7.4). That is,
x12 x22
d O, P (1.7.6)
s11 s22
A scatterplot for
positively correlated
measurements and a
rotated coordinate
system
The relation between the original coordinates x1, x2 and the rotated coordinates
x1, x2 is given by
x1 x1 cos x2 sin
x2 x1 sin x2 cos
19
where
cos 2 sin 2
a11
cos 2 s11 2sin cos s12 sin 2 s22 cos 2 s22 2sin cos s12 sin 2 s11
sin 2 cos 2
a22
cos 2 s11 2sin cos s12 sin 2 s22 cos 2 s22 2sin cos s12 sin 2 s11
Hence, in general
11 1 1 22 2 2 pp p
a x y 2 a x y 2 a x y 2 2a x y x y
d P, Q
p
12 1 1 2 2
2a13 x1 y1 x3 y3 2a p 1, p x p 1 y p 1 x p y p
where
O 0, 0,..., 0 denote the origin, P x1, x2 ,...., x p and Q y1, y2 ,..., y p be
a specified fixed point.
20