0% found this document useful (0 votes)
3 views20 pages

C01 Introduction S

The document provides an overview of multivariate analysis, emphasizing its complexity due to simultaneous measurements on multiple variables and the need for advanced statistical techniques. It discusses various objectives of multivariate methods, such as data reduction, sorting, dependence investigation, prediction, and hypothesis testing, along with applications in fields like psychology, education, and environmental science. Additionally, it covers data structure, types of variables, handling missing values, graphical techniques, and descriptive statistics relevant to multivariate data analysis.

Uploaded by

janae gardener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

C01 Introduction S

The document provides an overview of multivariate analysis, emphasizing its complexity due to simultaneous measurements on multiple variables and the need for advanced statistical techniques. It discusses various objectives of multivariate methods, such as data reduction, sorting, dependence investigation, prediction, and hypothesis testing, along with applications in fields like psychology, education, and environmental science. Additionally, it covers data structure, types of variables, handling missing values, graphical techniques, and descriptive statistics relevant to multivariate data analysis.

Uploaded by

janae gardener
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1.

INTRODUCTION

1.1 Aspects
Multivariate data arise when researchers record the values of several random variables
on the number of subjects or objects or perhaps one of a variety of other things (we
will use the general term “units” or “items”) in which they are interested, leading to a
vector-valued or multidimensional observation from each. Since the data include
simultaneous measurements on many variables, this body of methodology is called
multivariate analysis.

To understand the relationships between many variables make multivariate analysis an


inherently difficult subject. That is, the human mind is overwhelmed by sheer bulk of
the data, and more mathematics is required to derive multivariate statistical technique
for making inferences.

Our emphasis will be on analysis of measurements obtained without actively


controlling or manipulating any of the variables on which the measurements are made.

Multivariate data are ubiquitous as is illustrated by the examples below:


(i) Psychologists and other behavioural scientists often record the values of several
different cognitive variables on a number of subjects.

(ii) Educational researchers may be interested in the examination marks obtained


by students for a variety of different subjects.

(iii) Archaeologist may make a set of measurements on artefacts of interest.

(iv) Environmentalists might assess pollution levels of a set of cities along with
noting other characteristics of the cities related to climate and human ecology

1
1.2 Classification
It is difficult to establish a classification scheme for multivariate techniques that is both
widely accepted and indicates the appropriateness of the techniques.
▪ One classification distinguishes techniques designed to study interdependent
relationships from those designed to study dependent relationships.
▪ Another classifies techniques according to the number of populations and the
number of sets of variables being studied.

The choice of methods and the types of analyses employed are largely determined by
the objective of the investigation.

The objectives of scientific investigations to which multivariate methods lend


themselves include the following:

(i) Data reduction or structural simplification:- The phenomenon being


studied is represented as simply as possible without sacrificing valuable
information. It is hoped that this will make interpretation easier.

(ii) Sorting and grouping:- Groups of “similar” objects or variables are created,
based upon measured characteristics. Or rules for classifying objects or
variables into well-defined groups may be required

(iii) Investigation of the dependence among variable:- The nature of


relationships among variables is of interest. Are all the variables mutually
independent or are one or more variables dependent on the others? If so, how?

(iv) Prediction:- Relationships between variables must be determined for the


purpose of predicting the values of one or more variables on the basis of
observations on the other variables.

(v) Hypothesis construction and testing:- Specific statistical hypotheses,


formulated in terms of the parameters of multivariate populations, are tested.
This may be done to validate assumptions or to reinforce prior convictions.

2
1.3 Applications
To give some indication of the usefulness of multivariate techniques, few examples are
given below. These examples are multifaceted and could be placed in more than one
category.

Data reduction or structural simplification


▪ Track records from many nations used to develop an index of performance for
both male and female athletes
▪ Multispectral image data collected by a high-altitude scanner were reduced to a
form that could be viewed as images of a shoreline in two dimensions
▪ Data on several variables related to cancer patient responses to radio-therapy, a
simple measure of patient response to radiotherapy was constructed
▪ Data on several variables relating to yield and protein content were used to
create an index to select parents of subsequent generations of improved bean
plant.

Sorting and grouping


▪ Measurements of several physiological variables were used to develop a
screening procedure that discriminates alcoholics from non-alcoholics.
▪ Data related to responses to visual stimuli were used to develop a rule for
separating people suffering from a multiple-sclerosis-caused visual pathology
from those not suffering from the disease.

Investigation of the dependence among variable


▪ The associations between measures of risk-taking propensity and measures of
socioeconomic characteristics for top-level business executives were used to
assess the relation between risk-taking behaviour and performance.
▪ Data on several variables were used to identify factors that were responsible for
client success in hiring external consultants.

3
Prediction
▪ Measurements on several accounting and financial variables were used to
develop a method for identifying potentially insolvent property-liability insurers.
▪ cDNA microarray experiments(gene expression data) are increasingly used to
study molecular variations among cancer tumours. A reliable classification of
tumours is essential for successful diagnosis and treatment of cancer.

Hypothesis construction and testing


▪ Experimental data on several variables were used to see whether the nature of
the instructions make any difference in perceived risks, as quantified by test
scores.
▪ Data on several variables were used to determine whether different types of
firms in newly industrialised countries exhibited different patterns of
innovation.

1.4 Data Structure


Investigators seeking to understand a social or physical phenomenon, selects a number
p  1 of variables or characters to record.

The values of these variables are all recorded for each item, individual, or experimental
unit.

The notation x jk , indicate the particular value of the kth variable that is observed on
the jth item. That is,

x jk  measurement of the kth variable on the jth item

Consequently, n measurements on p variables can be displayed as follows:

4
variable 1 variable 2 variable k variable p
Item1 x11 x12 x1k x1 p
Item 2 x21 x22 x2 k x2 p

Item j x j1 x j2 x jk x jp

Item n xn1 xn 2 xnk xnp

This can be represented as a matrix of n rows and p columns:

 x11 x12 x1k x1 p 


 
 x21 x22 x2 k x2 p 
 
X  
 x j1 x j2 x jk x jp 
 
 
x xn 2 xnk xnp 
 n1

This contains the data consisting of all of the observations on all of the variables.

Example 1.1
A selection of four receipts from a bookstore was obtained to investigate the nature of
book sales. The receipt provided among other things, the number of books sold and
the total amount of each sale.

Let the first variable be total dollar sales and the second variable the number of books
sold. We can consider the number of receipts as four measurements on two variables.

Variable 1 (dollar sales): 42 52 48 58


Variable 2(number of books): 4 5 4 3

5
Notation wise:
x11  42 x21  52 x31  48 x41  58
x12  4 x22  5 x32  4 x42  3

Matrix (data array):


 42 4
52 5 
X 
 48 4
 
 58 3

1.4.1 Types of variables


(i) Nominal
Unordered categorical variables. Examples, such as, the sex of respondent, hair
colour, presence or absence of depression and nationality.

(ii) Ordinal
Where there is ordering but no implication of equal distance between the
different points of the scale. Examples:- educational attainment(no schooling,
primary, secondary, tertiary), social class, degree classification(distinction, merit,
pass).
(iii) Interval
Where there are equal differences between successive points on the scale but
the position of zero is arbitrary. For example, the measurement of temperature
using Celsius or Fahrenheit scales.

(iv) Ratio
The highest level of measurement, where one can investigate the relative
magnitudes of scores as well as the differences between them. The position of
zero is fixed. Example, the absolute measure of temperature in Kelvin, age,
weight and length.

6
1.4.2 Missing values
Consider the table below; where NA denotes missing values. This is one of the
problems that faced statisticians undertaking statistical analysis in general and
multivariate analysis in particular.

ID Sex Age IQ Depression Health Weight


1 Male 21 120 Yes Very good 150
2 Male 43 NA No Very good 160
3 Male 22 135 No Average 135
4 Female 16 130 Yes Good 110
5 Female NA 150 Yes Good 110
6 Male 86 150 No Average 140
7 Female 22 84 No Average 105
8 Female 22 84 No Very good 105

Missing values, i.e., observations and measurements that should have been recorded
but for one reason or another, were not.

In multivariate data, missing values arise, for example, non-response in sample surveys,
dropouts in longitudinal data, refusal to answer particular questions in a questionnaire.

The most important way for dealing with missing data is try to avoid them during the
data-collection stage of a study. However, this is not possible.

There are several ways to deal with missing values:-


▪ Complete-case analysis
▪ Available case analysis
▪ Imputation
▪ Multiple imputation

7
1.5 Graphical Techniques
There are several graphical displays that can be used to aid in data analysis. It is
impossible to simultaneously plot all measurements made on several variables and
study the configurations. But, plots of individual variables and plots of pairs of variables
can still be very informative.

Bellows are few plots that frequently aid in data analysis.

8
9
The diagram below, specimen (board) 16 and possibly specimen (board) 9 are identified
as unusual observations. Figures 1.12(a), (b), and (c) contain perspectives of the
stiffness data in the x1, x2 , x3 space.

These views were obtained by continually rotating and turning the three-dimensional
coordinate axes. Spinning the coordinate axes allows one to get a better understanding
of the three-dimensional aspects of the data. Figure 1.12(d) gives one picture of the
stiffness data in x2 , x3 , x4 space. Notice that Figures 1.12(a) and(d) visually confirm
specimens 9 and 16 as outliers. Specimen 9 is very large in all three coordinates.

A counter-clockwise-like rotation of the axes in Figure 1.12(a) produces Figure 1.12(b),


and the two unusual observations are masked in this view. A further spinning of the
x2 , x3 axes gives Figure 1.12(c); one of the outliers (16) is now hidden.

Additional insights can sometimes be gleaned from visual inspection of the slowly
spinning data. It is this dynamic aspect that statisticians are just beginning to understand
and exploit.

10
Plots like those in Figure 1.12 allow one to identify readily observations that do not
conform to the rest of the data and that may heavily influence inferences based on
standard data-generating models

11
1.6 Descriptive Statistics
Information contained in the data can be assessed by calculating summary statistics.
For example, sample mean, which provides a measure of location, i.e. “central value”
for a set of numbers. The average of the squares of the distance s of all numbers from
the mean provides a measure of the variation (or spread), in the numbers.

Consider the dataset, as a matrix X , then we could treat each column of X separately.
That is, x11, x21,...., xn1 be n measurements on the first variable.

For each column we can find the:

▪ sample mean
1 n
xj   xij
n i 1
for j  1, 2,...., p

▪ sample variance
1 n

 xij  x j 
2
s 2j  for j  1, 2,...., p
n  1 i 1

The positive square root of s 2j is known as the sample standard deviation.

Consider n pairs of measurements on each of variables 1 and 2:

 x11   x21   xn 2 
x  x 
, ,......., x 
 21   22   n2 

That is x j1 and x j 2 are observed on the jth experimental items  j  1, 2,...n  .

12
A measure of the linear association between the measurements of variables 1 and 2 is
given by the sample covariance

1 n
s12   
 x j1  x1 x j 2  x2
n  1 j 1

▪ sample covariance ( is a measure of their linear dependence) between two


variables

1 n
s jk  
 xij  x j
n  1 i 1
  xik  xk  j  1,2,..., p k  1,2,..., p

Note that covariance reduces to the sample variance when


j  k , i.e., s 2j  s jj

Also, s jk  skj for all j and k

▪ the (Pearson’s) sample correlation coefficient


This measure the strength of linear relationship between two variables does not
depend on the units of measurement.

The sample correlation coefficient for the jth and kth variables is

13
n
  xij  x j   xik  xk 
i 1
r jk 
n 2 n
 xij  x j    xik  xk 2
i 1 i 1

s jk
 j  1,...., p; k  1,..., p
s jj skk

Note that, r jk  rkj for all j and k

▪ Properties of the Correlation Coefficient


(i) 1  rjk  1 for all j , k

(ii) r jk gives the strength of the relationship with values of r jk close to


one implying strong relationships and values close to zero implying
weak relationship.

(iii) The sign of r jk gives the direction of the association

(iv) If r jk  1 , then there are constants a and b such that xij  a  bxik
for j  1, 2,..., n

(v) The value of r jk does not change if either variable is subject to a linear
transformation

14
We can represent all these quantities as an array:

 x1 
x 
x 
2
Sample mean
 
 
 x p 

 s11 s12 s1 p 
 
 s21 s22 s2 p 
Sample variances-covariances Sn   
 
 s p1 s p2 s pp 

 1 r12 r1 p 
 
 r21 1 r2 p 
Sample correlations R 
 
 rp1 1 

Example 1.6.1
Consider the data given in Example 1.1. Compute the sample mean, covariance,
variance and correlation coefficients.

15
1.7 Geometry
(i) 
The straight-line or Euclidean distance of a point P  x1, x2 ,.., x p  from the

origin O   0,....,0  is

d  O, P   x12  x22  ...  x 2p (1.7.1)

All points that lie a constant squared distance, such as c 2 , from the origin
satisfy the equation
d 2  O, P   x12  x22  ...  x 2p  c 2 (1.7.2)

This is the equation of a hypersphere (a circle, if p  2 ), points equidistant


from the origin lie on a hypersphere.

(ii) The straight-line distance between two arbitrary points P and Q with


coordinates P  x12 , x22 ,..., x 2p   
and Q  y12 , y22 ,..., y 2p is given by

d  P, Q    x1  y1 2   x2  y2 2  ...   x p  y p 
2
(1.7.3)

This is another measure, which is unsatisfactory for most statistical purposes,


because each coordinate contributes equally to the calculation of Euclidean
distance.

16
(iii) We need to find a ‘statistical’ distance that accounts for differences in variation,
and in due course, the presence for correlation. Our choice will depend upon
the sample variance and covariance. The statistical distance is fundamental to
multivariate analysis.

One way to proceed is to divide each coordinate by the sample standard


deviation, thus, we have a standardised coordinates. These are now on the same
footing, hence, we find the distance by using the standard Euclidean formula.

(a) Consider the point P   x1, x2  from the origin O   0,0 


The statistical distance,

2 2
 x   x  x12 x22
d  O, P    1    2    (1.7.4)
 s   s s11 s22
 11   22 

• In 2
, the difference between (1.7.1) and 1.7.4) is due to the
weights k1  s1 and k2  s1 attached to x12 and x22
11 22

• If the sample variances are the same, k1  k2 , then x12 and x22 will
receive the same weight
• If the variability in the x1 direction is the same as the variability
in the x2 direction, and the x1 values vary independently of the
x2 values, Euclidean distance is appropriate.

Using (1.7.4), we see that all points which have coordinates  x1, x2  and
are a constant squared distance, c 2 , from the origin must satisfy

x12 x22
  c2
s11 s22

17
Which is the equation of an ellipse centred at the origin, whose major
and minor axes coincide with the coordinate axes. That is, the statistical
distance in (1.7.4) has an ellipse as the locus of all points a constant
distance from the origin. This general case is shown below.

(b) Consider the point 


P  x1, x2 ,..., x p  to any fixed point


Q  y1, y2 ,..., y p .
Assume that the coordinates vary independently of one another, the
statistical distance

 x1  y1 2   x2  y2 2  ...   x p  y p 
2

d  P, Q   (1.7.5)
s11 s22 s pp

(c) Equation (1.7.5) assume independent coordinates. If the coordinates of


the pairs  x1, x2  exhibit a tendency to be large or small together, and
the sample correlation coefficient is positive, then the variability in the
x2 direction is larger than the variability in the x1 direction.

18
Thus, we rotate the original coordinate system through the angle  while
keeping the scatter fixed label the rotated axes x1 and x2 . This suggests
that we calculate the sample variances using the x1 and x2 coordinates
and measure distance as in (1.7.4). That is,

x12 x22
d  O, P    (1.7.6)
s11 s22

A scatterplot for
positively correlated
measurements and a
rotated coordinate
system

The relation between the original coordinates  x1, x2  and the rotated coordinates
 x1, x2  is given by
x1  x1 cos    x2 sin  
x2   x1 sin    x2 cos  

In terms of the original coordinates,

d  O, P   a11x12  2a12 x1x2  a22 x22

19
where

cos 2   sin 2  
a11  
cos 2   s11  2sin   cos   s12  sin 2   s22 cos 2   s22  2sin   cos   s12  sin 2   s11

sin 2   cos 2  
a22  
cos 2   s11  2sin   cos   s12  sin 2   s22 cos 2   s22  2sin   cos   s12  sin 2   s11

cos   sin   sin   cos  


a12  
cos   s11  2sin   cos   s12  sin   s22
2 2
cos   s22  2sin   cos   s12  sin 2   s11
2

Hence, in general

11 1 1 22 2 2 pp p 
 a  x  y 2  a  x  y 2   a x  y 2  2a  x  y  x  y  
d  P, Q   
p 
12 1 1 2 2 
 

 2a13  x1  y1   x3  y3    2a p 1, p x p 1  y p 1 x p  y p   

where

O   0, 0,..., 0  denote the origin, P  x1, x2 ,...., x p  and Q   y1, y2 ,..., y p  be
a specified fixed point.

20

You might also like