Unit 2:Biostatistics
By- Mrs. Mitali M. Bora
Biostatistics: Basics and Definitions
• Biostatistics is the application of statistical methods and techniques to analyse and interpret data related to
living organisms and health sciences.
• Biostatistics helps researchers and healthcare professionals make informed decisions, draw meaningful
conclusions, and identify patterns and trends in biological and health-related data.
• Biostatistics is the application of statistical principles to questions and problems in medicine, public health,
and biology.
Cont.…
• Biostatistics is a broad field that encompasses a wide range of topic including:
• Descriptive statistics: Summarizing and describing data, such as calculating the mean, median and mode.
• Inferential statistics: Drawing conclusions about a population based on a sample, such as using a hypothesis
test to determine whether a new drug is effective.
• Research design: Planning and conducting studies in a way that will produce reliable and valid results.
• Data analysis: Using statistical methods to analyse data and interpret the results.
• Public health: Using statistics to track and understand diseases, evaluate the effectiveness of public health
interventions, and develop policies to improve the health of the population.
Applications
• Biostatistics is a field that applies statistics to biological and medical research. It is used to design studies,
collect data, analyse data, and interpret results. Biostatistics is essential for conducting rigorous and reliable
scientific research.
Research field Application of Biostatistics
Clinical trials Designing clinical trials, determining sample size, randomization procedures and data
collection methods, analysing data to determine safety and efficacy of new drugs and
treatments.
Epidemiology Studying the distribution and determinants of diseases and other health condition in
populations, identifying risk factors for disease, evaluating the effectiveness of public health
interventions, tracking the spread of diseases.
Genetics Studying the genetic factors that contribute to diseases and other health conditions,
identifying genes that are associated with diseases, understanding the mechanisms of disease,
developing new diagnostic tests and treatments
Public health Evaluating the effectiveness of public health interventions, tracking the spread of diseases,
identifying populations at risk, developing policies to improve access to healthcare.
• The various applications:
• Biostatistics is used to design clinical trials, which are studies that test the safety and efficacy of new drugs
and treatments.
• Biostatistics is used to study the distribution and determination of diseases and other health conditions in
populations.
Biostatistics: Sample size
• Sample size in biostatistics is the number of participants of observations included in a study. It is an important
factor to consider when designing a study, as it can affect the power and precision of the study.
• The sample size is the number of individuals, subjects or data points to be included in a study, and it directly
impacts the statistical power and validity of the research findings.
• In general, a larger sample size is required to achieve a higher power, detect a smaller effect size, a reduce the
margin of error.
Cont..
• The choice of a specific formula for calculating the required sample size in a study depends on the research
design, the type of data, and the specific hypothesis being tested. Different formulas are used for various
statistical tests and scenarios. The common statistical formulas used to calculate sample size, along with
factors that are typically taken into account:
Cont..
Importance of sample size
• The sample size in biostatistics is of paramount importance for several reasons and it plays a critical role in the
validity and reliability of research findings. The various specific benefits of sample size calculation:
1. Increased power: A study with a larger sample size will have more power to detect true effects. This means that
the study is more likely to find a difference between the groups being compared, if a difference actually exists.
2. Improved precision: A study with a larger sample size will produce more precise estimates of population
parameters. This means that the results of the study are more likely to be reliable.
3. Enhanced generalizability: A study with a larger sample size is more likely to be representative of the population
as a whole. This means that the results of the study are more likely to be applicable to other populations.
4. Reduced costs and time: Sample size calculations can help researchers to avoid concluding studies that are
underpowered or oversampled. This can save researchers time and money.
Factors influencing sample size
• There are a number of factors that can affect the required sample size for a study, including:
• The desired power of the study
• The expected effect size
• The variability of the outcome variable
• The desired margin of error
Sample size determination is a complex process that involves considering multiple factors. Sample size is an
important consideration in biostatistics because it can affect the power, precision and generalizability of a study.
Factor Description
Desired power The probability of detecting a true effect if one exists.
Expected effect The magnitude of the difference between the two groups being compared.
size
Variability of the The amount of variation in the outcome variable within each group.
outcome variable
Desired margin of The maximum about of error that is acceptable in the study results.
error
Type of study Some study designs, such as randomized controlled trials, require larger sample sizes than other study
design designs, such as observation
Number of groups Studies that compare more than two groups will require larger sample size than studies that compare
being compared two groups.
Presence of Confounding variables are variables that can affect the outcome of a study, but are not of interest to the
confounding researcher. Studies with confounding variable may require larger sample sizes to account for the effects
variables of these variables.
Availability of The availability of resources, such as time and money, can also influence the sample size, studies with
resources limited resources may need to use smaller sample sizes.
Dropouts
• Droupouts in the context of biostatistics and clinical research refer to study participants who, for various reasons,
discontinue their participation in a clinical trial or study before its completion. Dropouts in biostatistics are
participants in a study who withdraw from the study before it is completed. Dropouts can occur for a variety of
reasons including:
• Lack of time or interest: Participants may drop out of a study if they find what they do not have enough time to
participate or if they lose interest in the study.
• Adverse effects: Participants may drop out of a study if they experience adverse effects from the study treatment or
invention.
• Lack of efficacy: Participants mat drop out of a study if they find that the study treatment or intervention is not
effective.
• Death: Participants may drop out of a study if they die during the course of the study.
Statistical tests of significance
• Statistical tests of significance are used to determine whether the results of a study are statistically significant.
Statistical significance means that the results of the study are unlikely to have occurred by chance.
• In other words, statistical significance means that there is a real difference between the groups being compared in the
study. Some common statistical tests of significance include:
• T-test: Used to compare the means of two independent groups.
• ANOVA: Used to compare the means of three or more independent groups.
• Chi-square test: Used to compare the proportions of two or more groups.
• Linear regression: Used to test for a relationship between two continuous variables.
• Logistic regression: Used to test for a relationship between a binary outcome variable and one or more predictor
variable.
1. t-test:
• Independent samples t-test- Compare the means of two independent groups to assess whether they are statistically different.
• Paired sample t-test- Compares the means of paired or matched observations, such as before and after measurements.
2. Analysis of variance (ANOVA):
• One-way ANOVA: Tests if there are statistically significant differences in means among three or more independent groups.
• Two way ANOVA: Examines the effects of two independent categorical variables on a continuous dependent variable.
3. Chi-Square test:
• Chi-Square goodness of fit test: Compares observed categorical data to expected values to determine if they match a theoretical
distribution.
• Chi-Square test of independence: Assesses whether there is an association between two categorical variables.
4. Wilcoxon rank-sum test (Mann-Whitney U test): A non-parametric test that compares the distribution of two
independent groups when the assumptions for the t-test are not met.
5. Kruskal-Wallis test: A non-parametric alternative to one-way ANOVA, used when comparing three or more
independent groups.
6. Wilcoxon signed-rank test: A non-parametric test used for paired samples, analogous to the paired samples t-test.
7. Fisher’s Exact test: A test for independence in a contingency table, often used when dealing with small sample size.
8. Logistic regression: Used for modelling binary or categorical outcomes based on one or more predictor variables.
9. Cox Proportional Hazard Model:
Used for survival analysis to assess the impact of predictor variables on the hazard of an event (e.g., Time of death)
10.Pearson’s correlation coefficient:
Measures the linear association between two continuous variables.
11. Spearman’s Rank Correlation:
A non-parametric alterative to Pearson’s correlation, which assesses the monotonic relationship between
variables.
12. Linear Regression:
Models the relationship between a continuous dependent variable and one or more continuous or categorical
predictor variables.
13.Multiple Regression:
Extends linear regression to include multiple predictor variables, allowing for the assessment pf their combined
impact on the dependent variable.
14. Chi-Square Test for Homogeneity:
Determines whether the distribution of a categorical variable is the same across different groups.
15. Repeated measures analysis of variance (RMANOVA):
Compares means of related groups, such as measurements taken from the same subjects over time.
16. Bayesian Hypothesis Testing:
A Bayesian approach to hypothesis testing that provides posterior probabilities for hypotheses and allows for the
incorporation of prior beliefs.
Types of Significance tests
• There are two main types of significance tests: Parametric and non-parametric tests.
• Parametric tests assume that the data is normally distributed. Some common parametric tests include:
• t-test: Used to compare the means of two independent groups.
• ANOVA: Used to compare the means of three or more independent groups.
• Linear regression: Used to test for a relationship between two continuous variables.
• Logistic regression: Used to test for a relationship between a binary outcome variable and one or more
predictor variables.
Cont.…
Nonparametric tests do not make any assumptions about the distribution of the data. Some common
nonparametric tests include:
• Chi-Squared test: Used to compare the proportions of two or more groups.
• Wilcoxon rank-sum test: Used to compare the medians of two independent groups.
• Kruskal-Wallis test: Used to compare the medians of three or more independent groups.
• Spearman’s rank correlation test: Used to test for a relationship between two ordinal variables.
The type of significance test that is used depends on the type of data that is being analysed and the research
questions that are being asked.
Parametric tests (Students t-test, ANDA,
Correlation coefficient, Regression)
• Parametric tests, which assume that the data follows a specific probability
distribution (typically the normal distribution), are commonly used in statistical
analysis when certain assumptions are met.
• They are typically used to test hypotheses about the relationship between two or
more variables.
• Parametric tests can be used to test hypotheses about the relationship between two
or more variables. They can also be used to test hypotheses about the relationship
between two or more variables.
• Parametric tests are widely used when data adhere to specific assumptions, such
as normal distribution and homogeneity of variances. They are particularly useful
for comparing means, quantifying associations, and modelling relationship
between variables.
Student’s test
• The student t-test is a statistical test that is used to compare the means of two independent groups. It is a
parametric test, which means that is assumes that the data is normally distributed. The t-test can be used to
test for a difference in means between two groups, or to test whether the mean of one group is different from a
known population mean.
• Independent samples t-test: Used to compare means between two independent groups. (e.g., treatment Vs.
control group) to determine if there is a significant difference.
• Paired samples t-test: Compares means of paired or matched observations (e.g., before and after
measurements within the same subjects).
ANOVA (Analysis of Variance)
• The analysis of variance (ANOVA) is a statistical test that is used to compare the means of three or more
independent groups. It is a parametric test, which means that it assumes that the data is normally distributed.
ANOVA is used to test for a difference in means between three or more groups, or to test whether the means
of two or more groups are different from a known population mean.
• One way ANOVA: Used to compare means among three or more independent groups (e.g., different drug
doses) to determine if there are statistically significant differences.
• Two-way ANOVA: Extends one-way ANOVA to assess the effects of two independent categorical variables
on a continuous dependent variable.
Correlation coefficient
• The correlation coefficient is a statistical measure of the strength and direction of the linear relationship
between two variables. It is a number between -1 and 1 where a correlation coefficient of -1 indicates a
perfect negative correlation, a correlation coefficient of 1 indicates a perfect positive correlation, and a
correlation coefficient of 0 indicates no correlation.
• Pearson’s Correlation Coefficient (r) : Measures the strength and direction of a linear relationship between
two continuous variables. It ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation).
• Partial Correlation: Assesses the relationship between two variables while controlling for the effects of one
or more additional variables.
• Bivariate Correlation: Examines the correlation between two variables, often as a preliminary step in
regression analysis.
Regression
• Regression in biostatistics is a statistical method that is used to model the relationship between two or more
variables. It is used to predict the value of one variable (the dependent variable) based on the value of another
variable (the independent variable).
• Linear regression: Models the relationship between one or more predictor variables and a continuous outcome
(dependent variable) using a linear equation.
• Multiple regression: Extends linear regression to include multiple predictor variables to predict a continuous
outcome.
• Logistic regression: Models the probability of a binary outcome based on predictor variables, providing odds ratios.
• Polynomial regression: Models relationships that are non-linear by including polynomial terms (quadratic, cubic
etc.) in the regression equation.
Non-Parametric tests (Wilcoxan rank tests,
Analysis of variance, correlation Chi-Square test)
• Non-parametric tests, also known as distribution-free tests, are used when the assumptions of normal
distribution or equal variacnes required for parametric tests are not met or when dealing with ordinal or
nominal data.
• Non-parametric tests are a type of statistical test that does not assume that the data is normally distributed.
They are often used when the data is not normally distributed, or when the sample size is small.
Wilcoxan Rank Tests (Mann-Whitney U Test)
• Used to compare the distribution of two independent groups when the assumptions for the t-test are not met. It
assesses whether one group has significantly higher or lower values than the other.
• The Wilcoxon rank-sum test, also known as the Mann-Whitney U test, is a nonparametric test that is used to
compare two independent groups on a continuous or ordinal dependent variable. It is the nonparametric
alternative to the two sample t-test.
• The Wilcoxon rank-sum test works by ranking the data in each group and then comparing the ranks between
groups. The null hypothesis of the Wilcoxon rank-sum test is that the two groups have the same median. If the
null hypothesis is rejected, then we can conclude that at least one group has a different median than the other
group.
Kruskal-Wallis Test
• A non-parametric alternative to one-wat ANOVA. It is used to compare three or more independent groups to
determine if there are statistically significant differences in their distributions.
• The Kruskal-Wallis test is a non-parametric test that is used to compare three or more groups on a continuous
or ordinal dependent variable. It is the non-parametric alternative to the one-way ANOVA.
• The Kruskal-Wallis test works by ranking the data in each group and then comparing the ranks between
groups. The null hypothesis of the Kruskal-Wallis test is that the groups have the same median. If the null
hypothesis is rejected, then we can conclude that at least one group has a different median than the other
groups.
Spearman’s Rank Correlation
• Measures the strength and direction of the monotonic relationship between two continuous or ordinal
variables. It is a non-parametric alternative to Peason’s correlation coefficient and is more suitable for data
that may not follow a linear relationship.
• Spearman’s rank correlation coefficient, also known as Spearman’s rho, is a non-parametric measure of rank
correlation.
• Spearman's rank correlation coefficient is calculated by ranking the data for each variable and them
calculating the correlation between the ranks. The correlation coefficient can range from – 1 to 1. A value of -
1 indicates a perfect negative correlation , meaning that the two variables are inversely related. A value of 1
indicates a perfect positive correlation, meaning that the two variables are directly related. A value of 0
indicates no correlation.
Chi-Square Test
• The chi-squre test is a statistical test used to compare observed and expected data. It is used to test the null
hypothesis that there is no relationship between two variables. The chi-squared test is calculated by comparing the
observed and expected frequencies of a categorical variable. The Chi-Squared test is a non-parametric test, which
means that it does not make any assumptions about the distribution of the data.
• Chi-Square Goodness of Fit Test: Compress observed categorical data to expected values to determine if they match
a theoretical distribution.
• Chi-Square test of Independence: Assesses the independence or association between two categorical variables by
comparing observed and expected frequencies in a contingency table.
• Non-parametric tests are robust and versatile, making them suitable for various situations when parametric
assumptions are violated or when dealing with non-continuous data. These tests are also valuable when working
with ranked or ordered data or when the sample size is small.
Null Hypothesis
• The null hypothesis in biostatistics is a statement that there is no difference between the two groups being compared. It is
the stating point for statistics tests. The null hypothesis is denoted by the symbol H0.
• The null hypothesis in biostatistics is typically a statement of no effect, no difference, or no association between variables.
The some examples of null hypotheses in biostatistics:
✓ There is no difference in the mean height of children who receive a new growth hormone and children who receive a
placebo.
✓ There is no difference in the mean blood pressure of patients who receive a new drug and patients who receive a placebo.
✓ There is no relationship between blood pressure and age.
✓ There is no difference in the proportion of patients who develop a side effect from a new drug and the proportion of
patients who develop a side effect from a placebo.
P-values Interpretation of P-values & Degree
of Freedom
• P values are the probability of obtaining the results of a study or more extreme results if there is no real difference
between the groups being compared. A P-value of 0.05 or less is generally considered to be statistically significant.
• The degree of freedom is the number of independent values in a statistical test. It is calculated by subtracting the
number of constraints from the total number of values. For example, the degree of freedom for a t-test is the number
of participant in the study minus two. The interpretation of P values and the degree of freedom depends on the type
of statistical test that is being used. However, some general guidelines can be followed:
• A P value of less than 0.05 indicates that the results of the study are statistically significant. This means that there is
a less than 5% chance of obtaining the results of the study or more extreme results if there is no real difference
between the groups being compared.
• A degree of freedom of greater than 30 indicates that the results of the study are more likely to be reliable.
Interpretation of P-value
• The interpretation of a p-value depends on the context of your hypotheses test and a predetermined significance level
(alpha, often set at 0.05). The common interpretations:
▪ If the p-value is less than or equal to alpha (p ˂ α), typically 0.05, you can reject the null hypothesis. In this case,
you have sufficient evidence to suggest that there is an effect, difference, or association in the data.
▪ If the p-value is greater than alpha ( p˃ α), you fail to reject the null hypothesis. In this case, you do not have enough
evidence to suggest that there is a significant effect, difference, or association in the data.
▪ Smaller p-values (much less than alpha) indicate stronger evidence against the null hypothesis, while larger p-values
suggest weaker evidence against the null hypothesis.
▪ The p-values does not provide information about the magnitude of the effect or the clinical or practical significance.
It only assesses whether an effect exists.
References
1. Biostatistics: A foundation for analysis in the health sciences by Wayne W. Daniel, John Wiley and Sons, 9 th edition, 2019.
2. Biostatistics for the biological and health sciences by Micheal Pagano and Kimberjee Gauvreau, Cengage Learning, 3rd edition, 2019.
3. Introductory biostatistics by chap T. Le and Louis J. LaMotte, John Wiley and Sons, 3 rd edition, 2019.
4. Biostatistics with R: An introduction for the life sciences by Jean-Pierre Fortin, Wiley-Blackwell, 2nd edition, 2019.
5. Statistical Methods for the health sciences by John C. Bailar III, William M. Lutz, and Charles F. Mantel, John Wiley & Sons, 2nd edition,
1991.
6. Epidemiology and Biostatistics by Laurence S. Freedman Robert R. Slonim, and Timothy J. Graham, McGraw-Hill, 2nd edition, 1991.
7. The analysis of biological data by Micheal C. Whitelock and David Schluter, Roberts & Company publishers, 2 nd edition, 2009.
8. Statistics in Medicine by Stephen J. Senn, Chapman & Hall/CRC, 2011.
9. Statistical methods in medical research by Neil R. Kleinbaum, Lawrence L. Kupper, Keith E. Muller, and Azhar Asgharian, John Wiley &
Sons, 8th Edition, 2015.