STATISTICS
STATISTICS
Statistical methods and analysis techniques play a crucial role in extracting insights
from data. These methods help researchers make sense of complex datasets and
draw valid conclusions. Some common statistical methods and techniques include:
1.Descriptive Statistics:
Central Tendency Measures:
• Mean: The average value of a variable.
Use mean(data$variable_name) where data is your data frame
and variable_name is the specific variable you're analyzing.
• Median: The middle value in a sorted dataset.
Use median(data$variable_name).
• Mode: The most frequent value. Use table(data$variable_name)$Mode.
Frequency Distributions:
2. Hypothesis Testing:
• Formulating Hypotheses: Define the null hypothesis (no significant
difference) and alternative hypothesis (a significant difference exists) based
on your research question.
• Choosing Tests: Select the appropriate statistical test based on the type of
data (categorical or continuous) and the number of variables being compared.
o Categorical Data: Chi-square tests (e.g., Chi-square test of
independence) are used to assess relationships between two
categorical variables.
o Continuous Data:
▪ One Sample: T-tests (e.g., one-sample t-test) compare the
sample mean to a specific value.
▪ Two Samples: T-tests (e.g., two-sample t-test, paired t-test) or
ANOVA (Analysis of Variance) are used to compare means
between groups.
• Conducting Tests: R offers various functions like chisq.test, t.test,
and aov for performing specific hypothesis tests.
• Interpreting Results: Evaluate the p-value (probability of observing the data
under the null hypothesis) to assess the level of significance (typically alpha =
0.05). Reject the null hypothesis if the p-value is less than alpha.
Common hypothesis tests: t-tests, z-tests, chi-square tests, ANOVA, etc.
Regression Analysis:
Confidence Intervals:
• Confidence intervals provide a range of values within which the population
parameter is likely to lie with a certain level of confidence (e.g., 95%
confidence interval).
Non-parametric Methods:
• Non-parametric tests are used when the assumptions of parametric tests are
violated or when data are not normally distributed.
• When data is skewed or has outliers, non-parametric tests offer a more
reliable alternative to parametric tests.
• Wilcoxon signed-rank test: This test compares two related samples (e.g.,
user satisfaction before and after using a new library feature). Use
the wilcox.test function.
• Mann-Whitney U test: This test compares two independent samples (e.g.,
user borrowing rates for different membership types). Use
the wilcox.test function with paired = FALSE.
• Kruskal-Wallis test: This test compares three or more independent samples
(e.g., resource usage across different academic disciplines). Use
the kruskal.test function.
• Import libraries: Use dplyr and tidyr for data manipulation and wrangling
tasks.
• Load data: Use read.csv or relevant functions to read your data from its
source (e.g., CSV file).
• Check data structure: Use str(data) to understand variable types,
dimensions, and potential missing values.
• Handle missing values: Employ appropriate methods (e.g., removal,
imputation) to address missing values if necessary.
Survival Analysis
• Survival analysis focuses on analyzing the time until an event of interest
occurs and the factors influencing the occurrence
Data Preparation:
4. Interpretation: