Study Notes MPT
Study Notes MPT
Statistics is the numerically stated facts like the population of a country. It can also be
referred to as the science dealing with data collection, tabulation and analysis and interpretation of
data. In data analysis it can be classified into descriptive statistics and inferential statistics.
Descriptive Statistics – are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. It includes frequency counts, percentages,
ranges (min-max), mean, median, mode, SD etc..
Inferential Statistics – are used to draw conclusions about a population by examining the sample.
Accuracy of inference depends on representativeness of sample from population. Inferential
statistics help researchers to test hypotheses and answer research questions, and derive meaning
from the results. Researchers set the significance level for each statistical test they conduct and by
using probability theory as a basis for their tests, researchers can assess how likely it is that the
difference they find is real and not due to chance (p-value).
Biostatistics
Statistical methods applied in medicine, biology and public health are termed as
biostatistics. Biostatistics is the term used when tools of statistics are applied to data that is
derived from biological sciences such as medicine. It may be stated that the application of statistical
methods to the solution of biological problems. Biostatistics is known by many names such as
medical statistics, health statistics and vital statistics.
Medical statistics : Statistics related to clinical and laboratory parameters, their relationship, efficacy
of drug, diagnostic analysis etc.
Health statistics : Statistics related to health of people in a community, epidemiology of diseases,
association of occurrence of various diseases with socioeconomic and demographic variables,
control and prevention of diseases etc.
Vital statistics : Statistics related to vital events in life such as of birth, death, marriages, morbidity
etc. These terms are overlapping and not exclusive of each other.
Uses of Biostatistics
Statistical methods are widely used in almost all fields. Most of the basic as well as
advanced statistical methods are applied in fields such as medicine, biology, public health etc.
Statistical methods are useful in planning and conducting meaningful and valid research
studies on medical, health and biological problems in the population for the prevention of diseases,
for finding effective appropriate treatment modalities etc. Statistical methods needed in general are,
> Collection of medical and health data scientifically
> Summarizing the collected data to make it comprehensible
> Generalizing the result from the sample to the entire population with scientific validity
> Drawing conclusions from the summarized data and generalized results.
Example :
> To determine the normal limits of various laboratory and clinical parameters such as BP, pulse
rate, Cholesterol level, Blood sugar level etc.
> To find difference between means and proportions of two different groups or places or periods.
> To find correlation between variables such as cholesterol and BMI, exercise and obesity etc..
> To find action of a drug or to compare between two drugs.
> To find relative potency of a new drug with respect to a standard drug.
> To find efficacy of a line of treatment or to compare between efficacies of two different line of
treatments.
> In community medicine and public health to compare the prevalence of deaths among
vaccinated and unvaccinated in a community etc..
Demerits
1. It cannot be located graphically.
2. Affected by extreme values. For example if there are three terms 4, 7, 10 ; mean is 7 in this
case. If we add a new term 95, the new mean is 4+7+10+95/4 = 116/4 = 29. This is a big
change as compared to the size of first three terms.
3. Qualitative forms such as Cleverness, gender etc. cannot give mean as data can’t be
expressed numerically.
Median: Merits
1. It can be easily calculated; and can be easily understood.
2. It can be located graphically.
3. Median is rigidly defined as in the case of mean.
4. As it is a positional measure, is not affected by extreme values. For example if there are
five terms 4, 5, 7, 8, 10; median is 7 in this case. If we add a new term 95, the new median
is 7.5. Addition of the new value 95 didn’t much affect the median value.
5. It can also be used for data, those can’t give mean; as is in case of intelligence etc.
Demerits
1. Even if the value of extreme items is too large, it does not affect too much, but due to this
reason, sometimes median does not remain the representative of the series.
2. It is affected much more by fluctuations of sampling than A.M.
3. Median cannot be used for further algebraic treatment.
Mode: Merits
1. Mode is not an isolated value like mean, median that may not be there in the series. Mode
is a value present among the observations.
2. It is not affected by extreme values hence is a good representative of the series.
3. It can be found graphically also.
4. It can effectively be used in case of qualitative phenomenon.
Demerits
1. Mode cannot be determined if the series is bimodal or multimodal.
2. Mode is most affected by fluctuation of sampling.
3. Mode is not so rigidly defined.
4. It is not capable of further algebraic treatment.
CORRELATION AND REGRESSION ANALYSIS
Correlation
It is a statistical measure which shows the relationship between two or more variables
moving in the same direction or in opposite direction. With correlation, two or more variables may be
compared to determine if there is a relationship and to measure the strength of that relationship.
The correlation coefficient gives the strength of relationship between the variables.
The correlation may be either positive, negative or zero. The first role of correlation is to
determine the strength of relationship between the two variables represented on the x-axis and y-
axis. The measure of this magnitude is called the correlation co-efficient. The data required to
compute this coefficient are two continuous measurements (x, y) obtained on the same entity.
If there is a perfect relationship, a straight line can be drawn through all the data points. The
greater the change in y for a constant change in x, the steeper the slope of the line. In a less than
perfect relationship between two variables, the closer the data points are located on a straight line,
the stronger the relationship and greater the correlation coefficient. In contrast, a zero correlation
would indicate absolutely no linear relationship between the two variables.
Positive Correlation
One variable increases with increase of the other or decreases with decrease of the other.
Eg: Body temperature and pulse.
Negative Correlation
One variable increases with decrease of the other or decreases with increase of the other.
Eg: Insulin and blood sugar.
Zero Correlation
There is no relation between the variables.
The Coefficient of Correlation
A measure of the strength of linear relationship between two variables that is defined in
terms of the covariance of the variables divided by their standard deviations.
In regression analysis, researchers control the values of at least one of the variables and
assign objects at random to different levels of these variables. Where correlation simply described
the strength and direction of the relationship, regression analysis provides a method for describing
the nature of the relationship between two or more continuous variables. Correlation coefficient can
support the interpretation associated with regression. If a linear relationship is established, the
magnitude of the effect of the independent variable can be used to predict the corresponding
magnitude of the effect on the dependent variable.
Regression analysis is a form of predictive modeling technique which investigates the
relationship between a dependent (response) and independent variable(s) (predictor). This
technique is used for forecasting, time series modeling and finding the causal effect relationship
between the variables.
Regression analysis is a statistical method to estimate or predict the values of one variable
(dependent variable) for the given values of independent variable.
> Dependent variable is to be estimated or predicted (response)
> Independent variable is the given variable (predictor)
Example: weight of a baby depends on age.
So age is the independent variable whereas weight is dependent variable.
Types
Simple Linear Regression (1 response – 1 predictor)
Multiple Regression (1 response – Many predictors)
Logistic Regression (Any response or predictors – Nominal / Ordinal)
The calculations involved in the regression line equation can be performed by using the following
values.
Regression equation of x on y :
Regression equation of y on x :
A statistical package is a suite of computer programs that are specialised for statistical
analysis. It enables people to obtain the results of standard statistical procedures and statistical
significance tests, without requiring low-level numerical programming. Most statistical packages also
provide facilities for data management.
(1) SPSS (Statistical Package for Social Sciences):
SPSS is a computer program used for statistical analysis. SPSS (Statistical Package for the
Social Sciences) was released in its first version in 1968 after being developed by Norman H. Nie
and C. Hadlai Hull. It is used by market researchers, health researchers, survey companies,
government, education researchers, marketing organizations and others.
The many features of SPSS are accessible via pull-down menus or can be programmed with
command syntax language. Additionally, some complex applications can only be programmed in
syntax and are not accessible through the menu structure.
SPSS has got two views : Data View and Variable View. In the variable view, the variables
such as age, sex, stress score, knowledge score etc... are declared. In the Data view, we enter the
data collected from the samples. SPSS datasets have a 2-dimensional table structure where the
rows typically represent cases (individuals) and the columns represent measurements (such as age,
sex or household income). SPSS Statistics data files are organized by cases (rows) and variables
(columns).
Rather than typing all of your data directly into the Data Editor, you can read data from
applications such as Microsoft Excel. The Opening Excel Data Source dialog box is displayed,
allowing you to specify whether variable names are to be included in the spread sheet, as well as
the cells that you want to import. If you want to import only a portion of the spread sheet, specify the
range of cells to be imported in the Range text box.
Statistical Analysis such as Paired t test, Unpaired t-test, Chi-Square test, Test of normality,
ANOVA etc... can be carried out very easily in the Software. For the two group studies, we can do
the separation into groups and do the analysis separately in groups.
Applications:
1. Calculation is very fast and no need of paper work in computing is required.
2. Different significance tests can be performed with one time entry of data.
3. Graphical presentation of data can also be done with the data.
4. Different comparisons and associations can be made at a time.
5. Documentation of the whole data analysis is easy.
Epi Info uses three distinct modules to accomplish these tasks: Form Designer, Enter, and
Analysis. Other modules include the Dashboard module, a mapping module, and various utilities
such as StatCalc. Within the analysis module, analytic routines include t-tests, ANOVA,
nonparametric statistics, cross tabulations and stratification with estimates of odds ratios, risk ratios,
and risk differences, logistic regression, survival analysis, and analysis of complex survey data.
1. The Abstract:
The abstract is an overview of the research study and is typically two to four paragraphs in
length. Think of it as an executive summary containing the key elements of the remaining sections
into a few sentences.
2. Introduction:
The introduction provides the key question that the researcher is attempting to answer and a
review of any literature that is relevant. In addition, the researcher will provide a rationale for why
the research is important and will present a hypothesis that attempts to answer the key question.
The introduction should summarize the state of the key question following the completion of the
research.
3. Methodology:
The methodology section of the research report is the most important component of the
thesis. It allows readers to evaluate the quality of the research and it provides the details by which
another researcher may replicate and validate the findings. The information in the methodology
section is arranged in chronological order with the most important information at the top of each
section.
5. Discussion:
The discussion section is where the results of the study are interpreted and evaluated
against the existing body or research literature. In addition, should there be any anomalies found in
the results, this is where the authors will point them out. The discussion section will attempt to
connect the results to the bigger picture and show how the results might be applied.
6. References:
This section provides a list of each author and paper cited in the research report. Any fact,
idea, or direct quotation used in the report should be cited and referenced.
Semi-logarithmic plots:
In a semi-logarithmic graph, one axis has a logarithmic scale and the other axis has a linear
scale. It is a way of visualizing data that are related according to an exponential relationship. This
kind of plotting method is useful when one of the variables being plotted covers a large range of
values and the other has only a restricted range – the advantage being that it can bring out features
in the data that would not easily be seen if both variables had been plotted linearly.
Eg: The semi-log graph, defined by a logarithmic scale on the y-axis, and a linear scale on the x-
axis. Plotted lines are: y = 10x, y = x, y = log(x).
Incidence
Incidence in epidemiology is a measure of the probability of occurrence of a given medical
condition (disease) in a population within a specified period of time. It is sometimes expressed
simply as the number of new cases during some time period, it is better expressed as a proportion
or a rate.
Incidence rate (also known as cumulative incidence) is the number of new cases within a
specified time period divided by the size of the population initially at risk. For example, if a
population initially contains 1,000 non-diseased persons and 28 develop a condition over one year
of observation, the incidence proportion is 28 cases per 1,000 persons per year, i.e. 2.8% in a year.
Prevalence
It is the proportion of a particular population found to be affected by a disease or a risk
factor. It is calculated with the number of people found to have the condition to the total number of
people studied, and is usually expressed as a fraction, as a percentage, or as the number of cases
per 1000 people. Point prevalence is the proportion of a population that has the condition at a
specific point in time. Period prevalence is the proportion of a population that has the condition
during a given period (e.g., 12 month prevalence), and includes people who already have the
condition at the start of the study period as well as those who acquire it during that period.
For example, consider a disease that takes a long time to cure and is widespread in 2002
but dissipated in 2003. This disease will have both high incidence and high prevalence in 2002, but
in 2003 it will have a low incidence yet will continue to have a high prevalence (because it takes a
long time to cure, so the fraction of individuals that are affected remains high). In contrast, a disease
that has a short duration may have a low prevalence and a high incidence.
Randomization
Randomization means that the intervention group into which a subject is placed is
determined randomly. The goal of randomization is to minimize selection bias and to increase the
likelihood that comparison groups are similar in all variables except the intervention of interest. The
larger the clinical trial, the lower the risk that treatment groups will be different from each other.
There are several strategies which can be used to randomize subjects.
Study Blinding
Blinding in a study means that either subjects (single-blind) or subjects and investigators
(double-blind) are not aware of the treatment into which subjects are randomized. Blinding
decreases the likelihood that study outcome will be influenced by knowing the intervention to which
a subject is randomized. Not every protocol lends itself to blinding. Subjects can be “unblinded” if
there are any adverse events that occur during the trial.
Research methodology
When we talk of research methodology we not only talk of the research methods but also
consider the logic behind the methods we use in the context of our research study and explain why
we are using a particular method or technique and why we are not using others so that research
results are capable of being evaluated either by the researcher himself or by others.
Research Designs
A research design is the set of methods and procedures used in collecting and analysing
measures of the variables specified in the research problem research. The design of a study defines
the study type (descriptive, correlation, semi-experimental, experimental, review, meta-analytic),
research problem, hypotheses, independent and dependent variables, experimental design, and if
applicable, data collection methods and a statistical analysis plan. Research design is the
framework that has been created to find answers to research questions. Two main categories of
research designs are
Observational studies
Experimental / Interventional studies
1. Observational studies
In observational studies, the investigators stand apart from events taking place in the study
– simply observe and record. Observational studies provide a mechanism for clarifying the
epidemiology and risk factors/exposures that place populations at risk for disease. Observational
studies assist in identifying clinical outcomes that might aid in predicting future disease. This type
of study allows researchers to draw conclusions about responses to variables. Observational
studies can be:
a) Descriptive
b) Analytical
a) Descriptive studies:
This is the first phase of an epidemiological study. Observations are done on various
aspects of illness among a population like time of occurrence, symptoms, duration of illness,
place of residence and individual characteristics (age, education, occupation etc.). This will lead
to formulation of hypothesis. Census and clinical records are important sources of information in
this type of study. Case reports, case series, cross-sectional and ecologic studies come under
this category.
Eg : A case of 40 year old women developing pulmonary embolism within 5 weeks of oral
contraceptive usage.
(ii) Case Series:
The individual case reports can be expanded to a case series. Case series are studies
that describe in detail a homogenous group of subjects who share common characteristics
and/or have received similar interventions.
Eg :A study to find out the proportion of obese children in high school classes in a city.
If the data are analysed to demonstrate differences either between exposed and non-
exposed groups or between those with and without outcome with respect to exposure, it is
analytical cross sectional study.
b) Analytical studies:
Analytical studies are done in order to find out if an outcome is related to exposure. These
studies are used to test hypothesis concerning the relationship between a suspected risk factor
and an outcome and its statistical significance.
Cohort studies may be retrospective (looking back in time) or prospective (looking forward in
time).
Develop disease
Exposure to risk
factor Do not Develop disease
Population
Sample
at risk
Develop disease
Not Exposure to
risk factor Do not Develop disease
Example: Employees in a factory are enrolled in the study from a particular period in the
past, from company records, and outcome measured in the present.
Here the researcher first select a research question, then frame a hypothesis about the
potential outcome of an exposure. The researcher then observe a group of people (cohort),
over a period of time (often several years), collecting data that may be relevant to the disease.
The study begin in present and continue to future.
Develop disease
Exposure to risk
factor Do not Develop disease
Reference
Sample
Population
Develop disease
Not Exposure to
risk factor Do not Develop disease
Example: A study to determine the natural history of cardiovascular disease in a group of
children. The children are followed up from birth till they develop/do not develop the disease.
The study is a combination of prospective and retrospective. Here the exposure has
occurred and the outcome may or may not have occurred. The follow up for outcome
estimation is started in the past, and will continue in the present and to the future. Some of the
outcome might have been measured and the search is continuing for the newer outcome
during further follow-up.
The specific hypothesis under investigation must be clearly stated before the case
control study. Case control studies start with an outcome. Individuals with a particular
disease (case) from a given population are selected and individuals without disease (control)
are also selected from same population. The same diagnostic criteria should be followed to
assign as a case or as a control.
> Analysis
Both case and control should stem from the same source population. The control should
have a similar risk to become a case if exposed (with potential to develop the disease) Both
case and control should meet the same inclusion criteria and should be similar in all aspects
except in outcome status.
We can analyse the data by using Odds Ratio. Odds Ratio is a measure of association
between an exposure and an outcome. (explain Odds Ratio)
2. Experimental studies
Experimental study designs are primary method for testing the effectiveness of new
therapies and other interventions, including innovative drugs. An experimental study may be
controlled or non controlled. But, for a definitive answer, we need a control group who do not get
the intervention under study. An experimental study is designed to compare benefits of an
intervention with standard treatments, or no treatment, such as a new drug therapy or prevention
program.
Eg : Pharmaceutical industry had adopted experimental methods and other research designs to
develop and screen new compounds and test drugs for therapeutic benefits.
a) Pre-experimental designs
b) True experimental designs
c) Quasi experimental designs
a) Pre-experimental designs:
These studies follow basic experimental steps but fail to include a control group. A
single group is often studied but no comparison between an equivalent non-treatment group
is made.
b) True-experimental designs:
This is a type of design used to establish cause and effect relationships. There are three
criteria that must be met in order for an experiment to be determined as a true experiment.
This type of design has two randomly assigned groups: an experimental group and a
control group. Neither group is pre-tested before the implementation of the treatment. The
treatment is applied to experimental group and the post-test is carried out on both groups to
assess the effect of the treatment. This type of design is common when it is not possible to
pre-test the subjects.
The subjects are randomly assigned to either the experimental or the control group.
Both groups are pre-tested for the independent variable. The experimental group receives
the treatment and both groups are post-tested to examine the effects of manipulating the
independent variable on the dependent variable.
Intervention
Experimental Group Pre-test Post-test
Eg: Testing of two different protocols for burn wounds with the frequency of care being
administered in 2, 4 and 6 hours.
Frequency of care
2 hrs 4 hrs 6 hrs
P1 A B C
Protocol
s P2 D E F
(iv) Randomized Block design:
The design is used when there are inherent differences between the subjects and
possible differences in experimental conditions. If there are more number of experimental
groups, the randomized block design may be used to bring some homogeneity to each
group. Using this design, subjects are assigned to blocks in order to reduce variability within
each treatment level. The observations or subjects within each block are more homogenous
than subjects within different blocks.
Eg: Assume the age of the samples influence the study results and the researcher
wants to include all possible age groups with each of the possible treatment levels. The
samples are divided into groups based on age. Then one subject from each age group is
randomly selected to receive each treatment. In this randomized block design, each age
group represents one block and there is only one observation per cell.
Samples
Set time period
Treatment 2 Treatment 1
(vi) Randomized Controlled Trials (RCT):
Probability in Medicine
Probability medical practice, diagnosis and prognosis always have varying degrees of
uncertainty and at best can be stated as probable in a particular case. When a practicing clinician
reads that some new treatment is superior to the conventional one, he will assess the evidence
critically, and at best he will conclude that probably it is true. The clinician’s belief in a particular
diagnosis in an individual patient may be based on the recorded experience in a group of patients,
but it is still a subjective probability. It reflects not only the observed frequency of the disease in a
reference group but also the clinician’s theoretical knowledge which determines the choice of
reference group. Recorded experience is never the sole basis of clinical decision making.
Probability is increasingly used in modern science and in medicine, in scientific proof claims.
In medicine, probability is now commonly used in survey data analysis. For example A always has a
10% risk of causing illness B.
In medicine, probability is now used both in experiment data analysis and in survey data
analysis. Probability is our preferred means of expressing uncertainty. In this framework, probability
(p) expresses a physician’s opinion about the likelihood of an event as a number between 0 and 1.
An event that is certain to occur has a probability of 1; an event that is certain not to occur has a
probability of 0.
A hypothetical survey probability testing.
Where the unknown facts that the study seeks to discover are,
o Disease A is actually caused by using too much of product C or the weaker product D.
o Product C is more expensive than product D.
o Product C is used more by middle-class vegetarians.
o Product D is used more by working-class smokers.
o Product C sells in less locations than product D, and some locations sell neither.
Illness rates will vary between sub-populations, such that it may be reported that 'cigar-
smoking carries a 20% risk for this illness' and 'egg-eating carries a 15% risk for this illness'.
For instance, a patient is observed to have a certain symptom, and Bayes’ formula can be
used to compute the probability that a diagnosis is correct, given that observation. In simple words,
suppose a doctor is interested in whether a person has cancer, and knows the person’s age. If
cancer is related to age, then, using Baye's theorem, the person’s age can be used to access more
accurate probability that the patient have cancer.
Example :
A clinician suspects one of his patients is having cancer ‘C’. He requests a suitable test to
confirm the diagnosis, and suppose the test is positive for cancer ‘C’. He now wishes to assess the
probability of the diagnosis being positive on the basis of this information, but perhaps the medical
literature only provides the information that a positive test is seen in 70% of the patients with cancer
‘C’ However, it is also positive in 2% of patients without cancer ‘C’. (S indicates signs & symptoms)
The frequential probability which the doctor found here may be written in the statistical notation as
follows:
P(S/C+) = 0.70, i.e., the probability of the presence of this particular sign given this particular
cancer is 70%.
P(S/C–) = 0.02, i.e., the probability of this particular sign given the absence of this particular
cancer is 2%.
In clinical practice, we needP(C/S), i.e., the probability of the cancer in a particular patient
given this positive sign. This can be estimated by means of Baye's Theorem. We must also know
the prior probability of the presence and the absence of the disease, LetP(C+)= Probability of
having cancer 'C' = 25% = 0.25 and hence P(C–) = Probability of not having cancer 'C' = 1 - P(C+) =
75% = 0.75.
/
P(C / S) =
/ /
..
=
. . . .
= 0.9211
= 92.11 % chance that the patient is having cancer 'C' given the condition
that he is having the particular sign S (diagnosed S as positive).
PROBABILITY DISTRIBUTIONS
Example :
The probability that a person suffering from migraine will obtain relief with a particular drug is
0.9. Three randomly selected sufferers from migraine are given the drug. Find the probability that all
the three will get relief ?
p = probability of success = 0.9, q = 1 - p = 0.1
p(x = 3) = 3C3 p3 q3-3 = 1x (0.9)3 (0.1)3-3 = 0.729
3. Poisson Distribution
This can be considered as a limiting case of Binomial distribution under the following
conditions,
The number of trials is very large
The probability of success (p) is very small.
The mean np is finite and np = m.
The variable 'x' is said to follow Poisson distribution if its probability density function is
p(x) =
!
Example :
The probability of a bad reaction from a certain injection is 0.001, determine the probability
that out of 2000 individuals, two will get a bad reaction ?
p = 0.001 ; n = 2000 ; m = np = 0.001 x 2000 = 2
p(x= 2) = [1 + 2 + 2]
= 1 - 0.1353 x 5
= 1 - 0.6765
= 0.3235
Population:
A research population is generally a large collection of individuals or objects that is the main
focus of interest of the researcher. It is for the benefit of the population that researches are done.
Due to the large sizes of populations, researchers often cannot test every individual in the
population because it is too expensive and time-consuming. This is the reason why researchers rely
on samples.
Sample:
A sample is simply a subset of the population. The sample must be representative of the
population from which it was drawn and it must have good size for further statistical analysis. The
main function of the sample is to allow the researchers to conduct the study to individuals from the
population so that the results of their study can be used to derive conclusions that will apply to the
entire population. The population “gives” the sample, and then it “takes” conclusions from the results
obtained from the sample.
(i) Type I error, α - error (Level of significance) : cut-off level at which we say a p-value is
significant. Probability of concluding that there is a statistically significant difference. Typically
5%.
(ii) Type II error (β - error) and Power (1- β): Power is the ability of a statistical test to show if
a significant difference truly exist symbolized as 1- β. In hypothesis testing, it is important to
have a sizable sample to allow statistical tests to show significant differences where they exist.
Typically 80%, 90%.
(iii) Effect Size: It is the difference the researcher expects to see. What has been seen
previously from reviews. What is a clinically important difference?
(iv) Standard deviation of population: It is the standard deviation of the outcome variable, in
most of the cases obtained from previous studies.
1. Estimating the sample size for a descriptive study based on a proportion
To calculate the sample size based on the sample required to estimate a proportion, the
following formula is used:
n ≥
n is the required sample size, z is the normal distribution value corresponds to 95% limits
(1.96) or 99% limits (2.58), p = proportion of population having that characteristic, which can be
known from previous studies or other sources, q = 1 – p (or 100 – p if p and q are expressed in
percentages), m is the allowable error.
Example : A study on anaemic children in schools. Proportion of anaemic children in a similar study
is found to be 30%. Find the minimum sample size required at a confidence limit of 95% and
accepting an error of 10% of the population.
. ! " #!
n ≥ = 80.67 ≈ 81 or more samples
!
2. Estimating the sample size for difference in means (comparison of two means)
n ≥
2σ 2 ( Z β + Zα/2 ) 2
difference 2
n is the required sample size, Zα/2 is the desired level of significance - type I error, normal
distribution value corresponds to 95% limits (1.96) or 99% limits (2.58), Zβ represents the desired
power, typically 80% power (0.84), σ is the standard deviation of the outcome variable, obtained
from previous studies, difference is the effect size, ie, the difference in means.
Example : An interventional study on anaemic children in schools to improve their Hb level. From a
previous study the mean difference is found to be 1.25 and the standard deviation is 2, calculate the
required sample size to determine the significant difference with 5% level of significance and 80%
power.
n ≥
2σ 2 ( Z β + Zα/2 ) 2
difference 2
= 2x 2 2 (0.84 + 1.96) 2
1.252
= 40.14
≈ 40 or more samples