Cluster Sampling
Cluster Sampling
Cluster Sampling
The divisions of population into a finite number of distinct and identifiable units are called the sampling units. The
smallest unit into which the population can be divided is called an element of the population and the groups of
elements are called the clusters.
Thus cluster sampling is a sampling technique which consists in forming suitable clusters of elements and surveying
all the elements in a sample of clusters selected according to an appropriate sampling scheme. In cluster sampling
the sampling unit is a cluster.
Example
To estimate the milk production in a Thana, village can be taken as cluster which consists of elements as
households. In order to save time and expenditure we can choose some villages as clusters from the Thana and take
all households belonging to the villages and surveying them, such a procedure, to estimate the milk production in a
Thana, is called cluster sampling and it can be extended to a district or the whole country.
Area Sampling
If the entire area containing the population under study is divided into smaller area segments and each element in
the population is associated with one and only one such area segment, the procedure is called area sampling.
2. To increase the efficiency of cluster estimates the number of cluster should be large and the number of
elements in a cluster should be small.
Discuss the Reasons for Conducting Cluster Sampling With Some Example.
There are two main reasons for the widespread application of cluster sampling:
1. When the population under study is large enough so that no reliable list of the elements in the population is
available or even impossible to obtain a frame. For example, cluster sampling.
2. When time, money and labor is beyond our capacity respect to survey.
Example
A simple random sample of 600 houses covers a town more evenly than 20 city blocks containing an average of 30
houses apiece. But greater field costs are incurred in locating 600 houses and in travel between them than in
locating 20 blocks and visiting all the houses in these blocks.
1. Cluster sampling is operationally more convenient and less costly than SRS and stratified sampling due to
saving of listing cost and traveling cost. It also save time in journeys, identifications, contacts etc.
2. When the sampling frame of elements may not be readily available or costly to make up, cluster sampling
procedure can be adopted.
1
021521
2. The efficiency of cluster sampling is likely to decrease with increase of cluster size.
3. In most practical situations, the loss of efficiency may be balanced by the cost. Therefore cost per unit in cluster
sampling is more than simple random sampling.
2. Regional analysis: Classifying cities of region into typologist based on demographic or fiscal variables.
3. Marketing analysis: Classifying customers into segment on the basis of product use.
Frame is essential for drawing a simple random sample. Frame is not essential for drawing a cluster sample.
When the population under study is large enough then When the population under study is large enough then
SRS is more expensive and more time consuming. cluster sampling is less expensive and less time
consuming.
If each cluster of size then there is no difference between The efficiency of cluster sampling increases at mean
SRS and cluster sampling. square between cluster decreases.
SRS is more efficient than cluster sampling. Cluster sampling is less efficient than SRS.
2
021521
Difference between Cluster and Strata
Cluster Strata
The smallest unit into which the population can be Relatively homogenous (non-overlapping) subgroup of a
divided is called an element of the population. A group of population is known as strata.
such elements is known as cluster.
The elements within each cluster are heterogeneous and The elements within each stratum are homogeneous and
between clusters are homogeneous. between strata are heterogeneous.
Clusters are generally made up on the basis of Stratum is generally made up on the basis of some
geographical location. characteristics of the population.
Obtaining sample from clusters generally costs less. Obtaining sample from strata generally costs more.
In cluster sampling elements within each cluster are In stratified sampling elements within each stratum are
heterogeneous and between clusters are homogeneous. homogeneous and between strata are heterogeneous.
Clustering is often done on the basis of geographical Stratification is done on the basis of some characteristics
location. of the population.
In cluster sampling clusters are subjected to complete In stratified sampling strata are subjected to sampling.
enumeration.
It is less costly than stratified sampling. It is more costly than cluster sampling.
It gives less efficient result than stratified sampling. It gives more efficient result than cluster sampling.
In two stage sampling auxiliary information is not In two phase sampling auxiliary information is
necessary. necessary.
Sampling unit is not same for both stages. Sampling unit is same for both stages.
It is not more advantageous when the gain in precision It is more advantageous when the gain in precision of
of the estimates increases. the estimates increases.
A sampling frame of the second stage units is necessary It is necessary to have a complete sampling frame of the
for the selected first stage units. units.
Design Effect
3
021521
L. Kish defined design effect as the ratio of the variance of the estimate obtained from the more complex sample to
the variance of the estimate obtained from a simple random sample of same number of units (elements).
The design effect has two primary uses:
For instance systematic sampling may be considered a particular case of cluster sampling, since in this case the
population is divided into a number of cluster, each cluster consists of units distributed at a fixed interval
(systematically) over the whole population and one such cluster is selected at random.
Other sampling procedures, viz. SRS, PPS and stratified sampling can be applied to sampling of clusters by treating
the clusters themselves as sampling units.
Theorem
th
For the u type of unit, let,
with variance
Nu2 Su2
V
nu
Nu2 Su2
nu i
V
Cu Nu2 Su2
The cost of taking nu units is Cu nu
V
Since Nu M u Constant for different types of unit,
Cu Su2
The cost is proportional to
M u2
C
On the other hand, if the cost C is specified, nu and equation in 1 .
Cu
Cu Su2
V Proved .
M u2
1 2 … i … N
4
021521
y11 y21 … yi1 … yN1
y12 y22 … yi 2 … yN 2
y1M y2M … yiM … y NM
i.e. the population consists of N clusters, each of M elements and a simple of n cluster is drawn by the method of
simple random sampling, let
yij the value of the characteristics under study for the jth element j 1, 2,..., M in the i th cluster i 1, 2,..., N .
M
yij
j 1
Yi the mean per element of the i th cluster.
M
1 n N M
Yi the mean of cluster sample means of n cluster NM yij .
1
yi
n i 1 i 1 j 1
N M
N Yi
Yi
1 i 1 j 1
Y the mean per element in the population,
N i 1 NM
M
yij Yi
2
j 1
Si2 the mean square among the elements within the i th cluster.
M 1
S w2
Si2
the mean square within clusters,
N
yi Y
N 2
i 1
Sb2 the mean square between cluster means in the populaton.
N 1
yij Y
N M 2
i 1 j 1
S2 the mean square between the elements in the population.
NM 1
E yij Y y ik Y
i 1 j 1 j k
M 1 NM 1 S 2
2
E yij Y
where is theintra cluster correlation coefficient between elements within clusters.
5
021521
6
021521
7
021521
th th
Let yij be the observed value for the j element within the i unit and let Yi be the unit total. The intra-cluster
8
021521
E yij Y y ik Y 2 yij Y
i 1 j k
y ik Y
M 1 NM 1 S 2
2
E yij Y
Where, Y
yi and . Y
yi Y
N NM M
The number of terms (cross product) in the numerator E is NM M 1 2 and in the denominator E is
NM 1 S 2
NM
9
021521
10
021521
Theorem
A simple random sample of n clusters, each containing M elements, is drawn from the N clusters in the
population. Then the sample mean per element y is an unbiased estimate of Y with variance
1 f NM 1 1 f 2
var y 2 S 2 1 M 1 S 1 M 1 .
n M N 1 NM
Where is the intra-cluster correlation coefficient.
Proof
N
yi
i 1
Let yi denote the total for the i th cluster and y .
n
Now we have,
N M
yij
1
y
nM i 1 j 1
n
Yi
1
n i 1
11
021521
n
E y 1
E Yi
n i 1
1 N Yi N
Yi
n
E Yi
n i 1 N i 1 N
N M yij
Y.
i 1 j 1 NM
Y Y
2
1 f N
1 f 2
i
var y Sb
n i 1 N 1 n
Here,
2
M yij
Yi Y
N 2 N
Y
i 1 j 1 M
i 1
2
M
N
2 yij MY
1
M i 1 j 1
M
yij Y yij Y yik Y
N 2 M M
1
2
M i 1 j 1 j 1 j k
N M 2 N M M
2 yij Y yij Y yik Y
1
M i 1 j 1 i 1 j 1 j k
1
2 NM 1 S 2 M 1 NM 1 S 2
M
NM 1 S 2 1
M 1
M2
1 f 1 NM 1 2
var y
n N 1 M 2
S 1 M 1
1 f NM 2
S 1 M 1 for large N
n NM 2
1 f
S 2 1 M 1 proved
nM
From the above function, it is easily observed that the variance in cluster sampling depends on the number of
2
clusters, the size of the cluster, the intra-cluster correlation coefficient and S .
Again we know that,
y My
var y M 2 var y
Also,
12
021521
yˆ NMy
var yˆ N 2 M 2 var y
1 f 2
N 2M 2 S 1 M 1
nM
N 2 M 1 f 2
S 1 M 1 .
n
If M is large and is positive then M 1 is positive which will increase the value of var y i.e. the cluster
sampling is less precise, but if 0 the cluster sampling is more precise. Though usually decrease with
increase in M the efficiency of cluster sampling decline, because the factor M 1 greatly increases with
Thus, the importance behind the intra-cluster correlation coefficient is that, it helps us to make decision whether
According to Hansen, Hurwitz and Madow 1953 , is a “measure of homogeneity” of the cluster. If M 1 , the
cluster sampling design and simple random sampling design are equally efficient i.e. both process are equally good
in this situation.
Expression for
2
Let Sb denote the variance among cluster totals, on a single unit basis then
Yi Y N 1 MSb2
2
We have,
Yi Y NM 1 S 2 1 M 1
2
N 1 MSb2 NM 1 S 2 1 M 1
N 1 MSb2 NM 1 S 2
Sb2 S 2
NM 1 M 1 S 2 M 1 S 2
1
When terms in are negligible.
N
The value of within-cluster, mean square
N M
Sw2 Yij Yi N M 1
2
i 1 j 1
Yi Y
2
N N M
NM 1 S Yij Yi
2
2
i 1 M i 1 j 1
NM 1
S 2 1 M 1 N M 1 S w2
M
13
021521
M NM 1 S NM 1 S 1 M 1 NM M 1
2 2
S w2
M NM 1 S 2 NM 1 S 2 1 M 1 NM M 1 S w2
NM M 1 S w2 NM 1 S 2 M 1 M 1
NM M 1 S w2 NM 1 S 2 M 11
NMS w2 NM 1 S 2 1
NM 1 2
S w2 S 1
NM
S 2 1
S 2 S w2
S2
Relative Efficiency of Cluster Sampling
In sampling of nM elements from the population by SRSWOR, we have,
1 f 2
var y S
nM
1 f 2
and in cluster sampling
var y
n
Sb .
V y 1 S2
Relative efficiency R.E
V y M Sb2
This shows that the efficiency of cluster sampling increases as the mean square between clusters sb2 decreases.
Again we have,
NM 1 S 2 Yij Y
N M 2
i 1 j 1
Yij Yi Yi Y
N M 2
i j 1
Yij Yi Yi Y
N M N M 2
2
i 1 j 1 i j
M
M 1 Si2 M N 1 Sb2
i
M 1 NS w2 M N 1 Sb2
1 NM 1 S 2 N M 1 S w2
Sb2
M N 1
1 S2 S2
R.E
M NM 1 S 2 N M 1 Sw2 1
NM 1 S 2 N M 1 Sw2
M N 1 N 1
2
i.e. if S w i.e. mean square within cluster increases, the efficiency of cluster sampling will increases.
Again,
14
021521
Yi Y
2
NM 1 S 2
2
Sb2 Yi Y 1 M 1
N 1 M 2
NM 1 S 2 1 M 1
M 2 N 1
S2
1 M 1
M
1 f 2
var y nM
S 1 M 1
1 f S 2
Then, R.E nM
1 f 2
S 1 M 1
nM
1
1 M 1
1
In case of complete homogeneity of cluster S w 0 and so 1 and R.E
2
i.e. cluster sampling is not efficient.
M
1
In case of complete heterogeneity, S w S , so sb 0 and
2 2 2
, i.e. cluster sampling is very effective.
M 1
Estimate the Sampling Variance in Case of Cluster Sampling
We know,
1 f 2 M N 1 Sb2 N M 1 S w2
var y
nM
S 1 M 1 where, S2
NM 1
2 2 2 2
If we select n clusters from the clusters we get, Sb and S w . In this case, Sb is an unbiased estimate of Sb and
S w2 is an unbiased estimate S w2 .
2 2
But s is not an unbiased estimate of S because frmp population of size NM , a simple random sample of size n
can not be drawn. So if we put Sˆb2 sb2 and Sˆw2 sw2 then we get Ŝ which is an unbiased estimate of S .
2 2
1 f ˆ2
var y
nM
S 1 M 1 ˆ
ˆ
n 1 Msb2 nsw2 .
n 1 Msb2 n M 1 sw2
Variance Function
It is of interest to examine how the variances var y behave with the cluster size M . This involves investigative the
2
relation between Sb and M .
2
By the analysis of variance, Sb can be found if we know
2
The variance S between all elements in the population and
15
021521
Our approach is to predict S w2 and S 2
to find Sb2 by the analysis variance. The sample data produce estimate S 2
and S w2 . Since S 2 is the variance among elements, it is not affected by the size of the unit. However S w2 will be
affected, Jessen 1942 , Mahalanobis, 1944 , Hendricks 1944 to attempts to develop a general law to predict
2
how S w changed with the size of unit. On the basis of several agricultural surveys within cluster variance formula
Sw2 AM g ;g 0 1
Where A, g are constants that do not depend on M and g begins a small positive quantity.
If this formula fits, log S w should plot as a straight line against log M . Values of S w2 far at least two values of M
2
S 2 A NM
g
Now A and g can be estimated by using the value of M . The two equations that lead to the estimates are
Hence it is necessary to determine a balancing point by finding out the optimum cluster size and the number of
clusters in the samples which can minimize the sampling variance for a given cost or alternatively, minimizing the
cost for a fixed variance.
c c1nM c2 n
where c1 is the cost of enumerating an element, including the cost of travel between units within the cluster and c2
16
021521
Assume that c1 is expected to be, considerably less than c2 . The variance of the estimator y based on a sample of
S2
b ; if f . p.c. is ignored.
n
NM 1 S 2 N M 1 Sw2
where, Sb2
M N 1
It is of interest to examine how the variance var y behave with the cluster size M , This involves investigative the
2
relationship between Sb and M .
Jessen 1942 , Mahalanobis 1944 , Hendricks 1944 empirically demonstrated on the basis of several
2
agricultural survey that S w is related to M by the relation
Sw2 AM g
where A and g are positive constants, are to be determined from the survey data and are independent of M .
NM 1 S 2 N M 1 AM g
Sb2
M N 1
NMS N M 1 AM g
2
; for large N
MN
S M 1 AM g 1
2
S 2 M 1 AM g 1
var y
n
To determine the optimum size of cluster, we find M and incidentally n to minimize V for fixed c .
To minimize c V , where being Lagrange’s multiplier. Differentiate with respect to n and M respectively
17
021521
n n
c1nM c2 n V
c V
c1M 2
2 n n
c1M
c2
S 2 M 1 AM g 1
2
2 n n
S M 1 AM
2 g 1
c2
c1M
2 n n n
c2
c1M V 0
2 n n
c2 V
c1M i
2 n n
And
V
c1n 0
M M
V
c1n ii
M
Diving ii by i we get,
V
c1n
M
c2
c1M V
2 n n
c1n n V
c V M
c1M 2
2 n
1 V 1
V M c2
M 1
2c M n
1
M V 1
iii
V M c2
1
2c1M n
Solving
c c1nM c2 n
cc1M c12 M 2 n c1c2 M n
4cc1M 4c12 M 2 n 4c1c2 M n
4cc1M c22 4c12 M 2 n 4c1c2 M n c22
2
4cc1M c22 2c1M n c2
18
021521
M V 1 1
V M c2
1
1
1
4cc1M 4cc1M
1 1 1 1
c22 c22
2c1M
2c1M
c2
1
4cc1M
1 1 1
c22
4cc1M
1 1
c22
4cc1M
1 1
c22
4cc1M
1
c22
12
M V 4cc1M
1 1
V M c22
12
M S M 1 AM
g 1
2
4cc1M
1 1
V M n c22
12
M A 4cc1M
V n M
M g M g 1 1
c22
1
12
AM 4cc1M
gM g 1 g 1 M g 2 1 1
n V c22
12
AM 1 4cc1M
g g 1 1 1
n V M c22
12
A M g 1 gM g 1 4cc1M
M 1 1
n V M c22
12
AM g 1 gM g 1 4cc1M
1 1
v
S 2 M 1 AM g 1 c22
19
021521
The above equation can be solved by iterative method and gives the optimum value of M . Using this optimum value
of M we can find optimum value of n .
c1cM
From the relation iv , it follows that, M will change according to changes c1 , c2 and c such that is nearly
c22
constant. This leads to the conclusion that the optimum size of the unit will be smaller when
Assuming that,
class.
1 n
E Pˆc E Pi
n i
N N
Pi Pi
1 1 1
n
n N i N i
var Pˆc
1 f 2
n
Sb
2
where Sb is the variance between cluster proportions and is given by
20
021521
Pi P
2
N N
PQ
MSb2 PQ i i
i N i N
1 f 1
N
PQ
var Pˆc PQ
i i
n M i N
S2 Sb2 S w2 PQ
2
and the within variance S w is given by
N
PQ
Sw2 i i
i N
N
MPQ
i i
1 i
M 1 PQ
Therefore, the sampling variance in terms of the intra cluster correlation coefficient can be expressed as
1 f NPQ 1 M 1
var Pˆc
N 1 nM
If a SRS of nM elements could be taken, the variance of the sample proportion P̂ would be given by
var Pˆ 1 f
NPQ
nM N 1
R.E
N 1 NPQ
NM 1 NPQ N
PQ
i i
i
is given by
An estimator of var Pˆc
2
s
var Pˆc 1 f b
n
Pi P
2
1 f
n n 1
An estimator for the variance of the total number of units belonging to a specified class can be obtained by
21