Sampling: Lecture Notes
Sampling: Lecture Notes
SAMPLING
Session Objectives:
Upon completion of the lecture sessions, each student should be able to:
1. Explain the importance of sampling in veterinary epidemiological research;
2. Distinguish between probability and nonprobability sampling;
3. Understand the factors to consider when determining sample size;
4. Understand the steps in developing a sampling plan
1
Disease Surveys
Two types of cross-sectional study are commonly performed.
1. Censuses- In this kind of study, the investigator includes every unit in the target
population. This is doable if the population is small. Some claim this is the most accurate
and effective way of conducting a survey. In reality however, most investigations involve
populations too large to study and too expensive to undertake. In addition, the sample
size can be too large to manage and measure accurately.
2. Sample surveys or Surveys- A survey examines only a small part (sample) of the target
population.
Advantages of Sampling
1. SPEED. A smaller team can be trained and mobilized to collect sample data for a shorter
period of time. Sampling is faster!
2. COST. Since the study will only need several paid individuals to study a small segment
of the population for a limited period of time, sampling is definitely cheaper than a
census that covers the whole population.
3. QUALITY. It allows a more thorough investigation of the elements that would be
impossible to apply to the whole population.
2
Definitions
Population Sample
Infinite/finite size Finite size
Characterized by unknown parameters Characterized by measurable
parameters (e.g., mean,
standard dev.)
2. Sampling- is the process of selecting a small number of units from a larger defined target
group of units such that the information generated from the small group will allow inferences to
be made about the larger group.
3. Target population: This is any complete, or the theoretically specified aggregation of study
elements. It is usually the ideal population or universe to which research results are to be
generalized. For example, all buffaloes in the Philippines.
4. Study population- The study population is the population to which the results of the study will
be inferred. For example, all buffaloes in the Philippines except those in remote mountains and
islands. The study population depends upon the research question:
5. Sampling unit (Basic sampling unit, BSU)- the units which are chosen in selecting the sample.
Animals
Herds
Villages
3
6. Sampling frame- A list of sampling units from which units to be sampled can be selected. In
most situations, it is difficult to get an accurate list. Sample frame error occurs when certain
elements of the population are accidentally omitted or not included on the list.
7. Sampling scheme- Method used to select sampling units from the sampling frame.
9. Inference is the process of assuming that the disease status of the population is similar to the
disease status of the sample.
10. Sampling error- the difference between the value of the parameter being investigated and the
estimates of this value based on the different samples. For example, the difference between the
sample mean and the population mean.
11. Confidence level- a statement of how often you could expect to find similar results if the
survey were to be repeated, or the degree of certainty of obtaining the same results. It often
informs about how often the findings will fall outside the margin of error.
12. Confidence interval is a range in which we are fairly certain that the population value lies.
14. Statistic-the summary description of a given variable in a sample. Example- sample mean,
sample variance.
4
5
Characteristics of a good sample
1. REPRESENTATIVE. Taken at random so that every member of the population of data
has an equal chance of selection. Unbiased by the sampling procedure or equipment. The
sample possesses the characteristics of the target population.
2. ADEQUATE. Large enough to give sufficient precision;
3. OBTAINABLE. The sample can be collected or measured according to the sampling
design.
4. AFFORDABLE. The individual or organization doing the survey can collect the data at
the least possible cost.
A. Probability sampling- Every unit in the population has a known probability of being
selected. The rules and procedures for selecting the sample and estimating the parameters
are clearly defined.
B. Non-probability sampling- Probability of being selected is unknown
6
Comparison of probability and non-probability sampling
7
Types of Probability Sampling Methods
1. Simple, random sampling
2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
5. Multistage sampling
Procedure
Number all units
Randomly draw units
8
How to generate random numbers
Random numbers can be obtained using your calculator, a computer program for random
number generation, a spreadsheet, printed tables of random numbers, or by the more
traditional methods of drawing slips of paper from a hat, tossing coins or rolling dice.
Advantages
Simple
Sampling error easily measured
Disadvantages
Need complete list of units
Does not always achieve best representativity
2. Systematic sampling
Description- Systematic random sampling is a method of probability sampling
in which the defined target population is ordered and the sample is selected
according to position using a skip or sampling interval.
9
Procedure
1: Obtain a list of units that contains an acceptable frame of the target population (N)
2: Determine the number of units in the list and the desired sample size (n)
3: Compute the skip interval (sampling interval calculated as k = N/n)
4: Draw a random number ( k) for starting
5: Beginning at the start point, select the units by choosing each unit that corresponds to
the skip interval
Advantages
Applicable to situations when no sampling frame is available.
Ensures representativity across list
Applicable to situations when the sampling units are too numerous to number for
purposes of simple random sampling.
Easy to implement
Disadvantage
Dangerous if list has cycles
Example 2
Target Population size= N= 100
Desired Sample size= n= 20
Skip interval= N/n= 100/20= 5
Choose random number from 1 to 5 for starting= lets assume the number 3
Start with number 3 and take every 5th unit
10
3. Stratified sampling
Description- Stratified random sampling is a method of probability sampling in which the
population is divided into different subgroups (strata) and samples are selected randomly
from each stratum.
A major objective of stratified sampling is to increase precision without increasing
cost.
The strata should be mutually exclusive and collectively exhaustive in that every
population element should be assigned to one and only one stratum and no population
elements should be omitted.
Elements are selected from each stratum by a random procedure, usually simple
random sample (SRS).
The elements within a stratum should be as homogeneous as possible, but the
elements in different strata should be as heterogeneous as possible.
Procedure
1. Classify population into homogeneous subgroups (strata)
2. Draw random samples from each stratum
3. Combine results of all strata into a single sample of the target population
Advantage
More precise if variable associated with strata
All subgroups represented, allowing separate conclusions about each of them
Disadvantages
Sampling error difficult to measure
Loss of precision if small numbers sampled in individual strata
11
4. Cluster sampling
Description- Cluster sampling is a method of probability sampling in which the
population is divided into a large number of groups, called clusters. Then a random
sample of clusters is selected, based on a probability sampling technique such as SRS.
Every element found in each cluster selected may or may not be included in the study.
Advantages
Simple: No list of units required
Less travel/resources required
Disadvantages
Imprecise if clusters homogeneous (Large design effect)
Sampling error difficult to measure
12
5. Multistage sampling
Description- Multi-stage sampling is a method of probability sampling wherein sampling
a population is undertaken in different stages, with the sample unit being different at each
stage.
Procedure:
1. FIRST-STAGE. The population is first divided into a set of primary or first-stage
sampling units. For example, the researcher divided the Philippines into 15
regions. From this sampling frame, he randomly selected six (6) regions.
2. SECOND STAGE. Each of the selected units from the first-stage sampling is
further subdivided into secondary or second-stage sampling units. For example,
the researcher divided each of the selected six (6) regions into existing provinces.
From this sampling frame of provinces, he randomly selected two (2) provinces
per region.
3. ADDITIONAL STAGES. The procedure is repeated until the desired stage is
reached. The third stage may involve listing the commercial swine farms in each
province and a random sample of say 3 farms per province is selected. Once the
farm units have been selected, it may prove possible to construct a sample frame
of the animals within the units and sample these in turn, say 30 pigs per farm (this
procedure constitutes the fourth stage).
13
Advantages
No complete listing of population required
Most feasible approach for large populations
The complete sample frame is needed only at the first-stage sampling.
The reduction in places to visit for data collection, makes this sampling design cheaper.
Disadvantages
Several sampling lists
Sampling error difficult to measure
For example, in a hypothetical population in which precisely 50.0% of goats less than 6 months
old have anemia, a very well-done survey of 300 young goats shows that 135 (45.0%) have
anemia.
Two explanations:
1. Bias - Something is wrong with the way the sampling was done or the measurements
taken.
2. Sampling error - Just by chance, even in the perfect survey, a sample selected randomly
from a population will almost never be exactly the same as the entire population.
14
Bias
Bias is the difference between survey result and population value due to:
Incorrect measurements, resulting in measurement bias
Selection of a non-representative sample, resulting in sampling bias
15
When your survey records a different result, consider the following questions:
Did you perform the measurements correctly?
Did you sample from the right animals?
Sampling error
Sampling error is the difference between survey result and population value due to the random
selection of animals or farms to include in the sample. Sampling error is the error that occurs just
because of chance (some call it bad luck).No sample is a perfect mirror image of the
population
Unlike bias, sampling error can be predicted, calculated, and accounted for. There are several
measures of sampling error:
Confidence intervals
Standard error
Coefficient of variance
P values
Others
Confidence Interval
A confidence interval gives an estimated range of values which is likely to include an unknown
population parameter, the estimated range being calculated from a given set of sample data.
The width of the confidence interval gives us some idea about how uncertain we are about the
unknown parameter (see precision). A very wide interval may indicate that more data should be
collected before anything very definite can be said about the parameter.
Confidence Limits
Confidence limits are the lower and upper boundaries / values of a confidence interval, that is,
the values which define the range of a confidence interval.
Confidence Level
The confidence level is the probability value (1-) associated with a confidence interval. It is
often expressed as a percentage. For example, say = 0.05= 5%, then the confidence level is
equal to (1-0.05) = 0.95, i.e. a 95% confidence level.
Confidence Level is the likelihood - expressed as a percentage - that the results of a test are real
and repeatable, and not just random. The idea is based on the concept of the "normal distribution
curve," which shows that variation in almost any data (such as the heights of all Landrace
16
breeding boars, or the amount of rainfall in June) tends to be clustered around an average value,
with relatively few individual measurements at the extremes.
In surveys, the most common measurement of sampling error is the 95% confidence interval.
If you repeat the same survey many times and measure the same indicator with the same
methodology and same sample size, 95% of the results of these surveys will have confidence
intervals which overlap the true value for this indicator in the population.
The drawing below is another way of visualizing confidence intervals. It imagines that a
single survey is a dart which produces a single estimate of some health outcome, for example,
the prevalence of having a safe water supply. If the sampling error is large because the sample
size of the survey was small, the dart might have a large circle of uncertainty. We may be 95%
sure that the true population value is somewhere in the circle, but if the circle is large, this survey
result may not be very useful. If the sampling error is small because the sample size was large,
the circle of certainty may be much smaller, as shown on the right. Now if we are 95% sure that
the true population value is within this small circle, the survey result may be very useful.
Sample Size - The larger your sample, the more sure you can be that their answers truly
reflect the population. This indicates that for a given confidence level, the larger your
sample size, the smaller your confidence interval. However, the relationship is not linear
(i.e., doubling the sample size does not halve the confidence interval).
Percentage - Your accuracy also depends on the percentage of your sample that picks a
particular answer. If 99% of your sample said "Yes" and 1% said "No" the chances of
error are remote, irrespective of sample size. However, if the percentages are 51% and
49% the chances of error are much greater. It is easier to be sure of extreme answers than
17
of middle-of-the-road ones. When determining the sample size needed for a given level
of accuracy you must use the worst case percentage (50%). You should also use this
percentage if you want to determine a general level of accuracy for a sample you already
have. To determine the confidence interval for a specific answer your sample has given,
you can use the percentage picking that answer and get a smaller interval.
Population Size - How many animals are there in the group your samples represent? This
may be the number of broiler chickens in a province you are studying, the number of pigs
vaccinated with hog cholera, etc. Often you may not know the exact population size. This
is not a problem. The mathematics of probability proves the size of the population is
irrelevant, unless the size of the sample exceeds a few percent of the total population you
are examining. This means that a sample of 500 people is equally useful in examining the
opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, the
sample calculator ignores the population size when it is "large" or unknown. Population
size is only likely to be a factor when you work with a relatively small and known group
of people.
Narrow CI = precise
Wide CI = imprecise
Accuracy defined
The degree to which a measurement, or an estimate based on measurements, represents
the true value of the attribute that is being measured- -- Last. A Dictionary of
Epidemiology. 1988
Accuracy vs. precision
A measurement (or in our case, the estimate from a survey) is precise if it obtains similar
results with repeated measurement (or repeated surveys).
A measurement is accurate if it is close to the truth with repeated measurement (or
repeated surveys).
A faulty measurement may be expressed precisely but may not be accurate.
Measurements should be both accurate and precise, but the two terms are not
synonymous- Last's Dictionary of Epidemiology
18
Precision of the estimate of Prevalence
19
Accurate and precise
20
Sample Size Estimation
More complicated studies (e.g. those involving multiple regression models, survival analysis, or
longitudinal data with repeated measurements) may require specialized software and additional
biostatistical input to calculate sample size.
21
To estimate prevalence of disease with
a sample from a large population
(theoretically finite)
Formula:
This formula assumes random sampling and that
2
n = Z p(1 - p) the sample size is small relative to the population
------------- size (practically this is true when the sample size is
less than about 10% of the population size).
e2
where,
Z is The critical value obtained from a standard normal distribution. For each level of
confidence there is a corresponding value of z. See table below:
e is the margin of error (e.g., 0.1 = 10%, and 0.05 = 5%); same as desired accuracy or
absolute precision
p is the estimated value for the proportion of a sample that have the condition of interest
(e.g., .50 for 50%). Theoretically this is based on the assumption that the test that
estimates this proportion is perfectly sensitive and specific, but the calculation can
also assume the proportion estimated is the apparent (test-based) prevalence
22
The expected prevalence can be based on previous investigations. If there are none, a pilot study
may be undertaken.
Example:
Calculate the sample size needed to study a disease with an expected prevalence of 20%. Assume
a level of confidence of 95% and a desired absolute precision of 5%.
Notes:
23
To estimate prevalence of disease with
a sample from a small (finite)
population
When population sizes are less than 10 times the estimated sample size, it is prudent to use the
Finite Population Correction (FPC) to calculate a corrected sample size.
Formula:
Assume a small ruminant farm with 800 sheep and an expected prevalence of 20% for caseous
lymphadenitis. How many samples are needed to give an estimate of prevalence within 10% of
the true value with 95% confidence?
Expected Desired
Prevalence accuracy 0.1 0.05 0.01 0.001
0.2 61 246 6146 614633
24
To Detect the presence of a disease
(source: Hawkins, C. [Link] Field Survey Tables)
Percentage of diseased animals in the population (d/N), OR percentage sampled and found clean (n/N)
Population
Size (N) 50% 40% 30% 25% 20% 10% 5% 2% 1% 0.5% 0.1%
10 4 5 6 7 7 10 10 10 10 10 10
20 4 5 7 8 10 15 19 20 20 20 20
30 5 6 8 9 11 19 26 30 30 30 30
40 5 6 8 10 12 21 31 39 40 40 40
50 5 6 8 10 12 22 35 48 50 50 50
60 5 6 8 10 12 23 37 55 60 60 60
70 5 6 8 10 13 24 40 62 69 70 70
80 5 6 8 10 13 24 42 68 78 80 80
90 5 6 9 10 13 25 43 73 87 90 90
100 5 6 9 10 13 25 44 77 95 100 100
120 5 6 9 10 13 26 46 85 110 119 120
140 5 6 9 11 13 26 48 92 123 138 140
160 5 6 9 11 13 26 49 97 135 156 160
180 5 6 9 11 13 27 50 101 146 174 180
200 5 6 9 11 13 27 51 105 155 190 200
250 5 6 9 11 14 27 52 112 174 227 250
300 5 6 9 11 14 28 53 117 189 259 300
350 5 6 9 11 14 28 54 121 201 287 350
400 5 6 9 11 14 28 55 124 210 310 400
450 5 6 9 11 14 28 55 127 218 331 450
500 5 6 9 11 14 28 56 129 225 349 499
600 5 6 9 11 14 28 56 132 235 379 596
700 5 6 9 11 14 28 56 134 243 402 690
800 5 6 9 11 14 28 57 136 249 421 781
900 5 6 9 11 14 28 57 137 254 437 868
1000 5 6 9 11 14 29 57 138 258 450 950
1200 5 6 9 11 14 29 57 140 264 471 1101
1400 5 6 9 11 14 29 58 141 269 487 1235
1600 5 6 9 11 14 29 58 142 272 499 1354
1800 5 6 9 11 14 29 58 143 275 509 1459
2000 5 6 9 11 14 29 58 143 277 517 1553
3000 5 6 9 11 14 29 58 145 284 542 1894
4000 5 6 9 11 14 29 58 146 288 556 2108
5000 5 6 9 11 14 29 59 147 290 564 2253
6000 5 6 9 11 14 29 59 147 291 569 2358
7000 5 6 9 11 14 29 59 147 292 573 2436
8000 5 6 9 11 14 29 59 147 293 576 2498
9000 5 6 9 11 14 29 59 148 294 579 2547
10000 5 6 9 11 14 29 59 148 294 581 2588
"Infinite" 5 6 9 11 14 29 59 149 299 598 2994
25
Notes:
1. The previous table gives the sample size (n) required to be 95% certain of including at
least one positive if the disease is present at the specified level.
Example: if the expected percentage of positives is 20% and the population size is
878 (use 900), the required sample size to be 95% certain of detecting at least one
positive is 14.
2. The table can also be used to determine the upper limit to the number (d) of diseased
animals in a population given that the specified proportion were tested and found to be
negative.
Example: if the 10% sample taken from a population of 2000 were all found to be
negative, the 95% confidence limit for the number of positives is 29
26
To estimate the mean of a continuous
variable
Formula
Example
Determine the number of piglets required for an experiment to measure the mean increase in
weight over a 30-day feeding trial using a new diet. Assume the following conditions- the
standard deviation of the group is 100 gm, the acceptable error is 50 gm, and with 95 %
confidence.
n = 3.84 (100)2
(50)2
= 38,400
2,500
= 15.36 or 16
27