STATISTICS
STATISTICS
1. Hypothesis Testing- In hypothesis testing, the normal distribution plays a crucial role
in determining critical regions and calculating p-values. Many statistical tests, such as
the t-test and z-test, rely on the assumption that the data follows a normal
distribution. Deviations from normality may affect the validity of these tests,
emphasizing the importance of understanding normal distribution properties.
2. Statistical Inference: Many statistical methods assume normality. Parametric tests
(e.g., t-tests, ANOVA, linear regression) often require the assumption of normality.
3. Modeling Real-World Phenomena:
o The normal distribution is used to model and describe the behavior of many real-
valued random variables.
o In natural and social sciences, we encounter phenomena that exhibit a bell-shaped
distribution. Examples include:
o Heights of people in a population.
o Errors in measurements (e.g., instrument readings, experimental data).
o Blood pressure levels in patients.
o Test scores in educational assessments.
o IQ scores.
4. The normal distribution is of great value in educational evaluation and educational
research, when we make use of mental measurement
APPLICATIONS OF NPC
There are number of applications of normal curve in the field of psychology as well as
educational measurement and evaluation. These are:
i) To determine the percentage of cases (in a normal distribution) within given limits
or scores.
ii) To determine the percentage of cases that are above or below a given score or
reference point.
iii) To determine the limits of scores which include a given percentage of cases to
determine the percentile rank of an individual or a student in his own group.
iv) To find out the percentile value of an individual on the basis of his percentile rank.
v) Dividing a group into sub-groups according to certain ability and assigning the
grades.
vi) To compare the two distributions in terms of overlapping.
vii) To determine the relative difficulty of test items.
Mean: It is the average of the data set
- Mean as the Location Parameter
- The mean signifies the center of the bell curve.
- It tells you where the data tends to cluster the most.
- Imagine the curve shifting left or right on the horizontal axis (x-axis). The mean value
dictates this location shift.
Skewnes
A distribution is said to be is skewed when the mean and median fall at different points in
the distribution and the balance i.e. the point of center of gravity is shifted to one side or
the other to left or right. In a normal distribution the mean equals, the median exactly and
the skewness is of course zero (SK = 0).
There are two types of skewness which appear in the normal curve.
a) Negative Skewness : Distribution said to be skewed negatively or to the left when scores
are massed at the high end of the scale, i.e. the right side of the curve are spread out more
gradually toward the low end i.e. the left side of the curve. In negatively skewed distribution
the value of median will be higher than that of the value of the mean
Positive Skewness: Distributions are skewed positively or to the right, when scores are
massed at the low; i.e. the left end of the scale, and are spread out gradually toward the
high or right end
Kurtosis
The term kurtosis refers to (the divergence) in the height of the curve, specially in the
peakness. There are two types of divergence in the peakness of the curve
a) Leptokurtosis: the curve become more peeked i.e. its top become more narrow than the
normal curve and scatterdness in the scores or area of the curve shrink towards the center.
Thus in a Leptokurtic distribution, the frequency distribution curve is more peaked than to
the normal distribution curve
b) Platykurtosis: Now suppose we put a heavy pressure on the top of the wire made normal
curve. What would be the change in the, shape of the curve? Probably you may say that the
top of the curve become mor flat than to the normal. Thus a distribution of flatter Peak
than to the normal is known Platykurtosis distribution.
1. Selection of the Sample: Selection of the subjects (individuals) produce skewness and
kurtosis in the distribution. If the sample size is small or sample is biased one,
skewness is possible in the distribution of scores obtained on the basis of selected
sample or group of individuals. Scores from small and highly hetrogeneous groups
yield platykurtic distribution.
2. Unsuitable or Poorly Made Tests: . If a test is too easy, scores will pile up at the high
end of the scale, whereas the test is too hard, scores will pile up at the low end of the
scale.
3. The Trait being Measured is Non-Normal: Skewness or Kurtosis or both will appear
when there is a real lack of normality in the trait being measured, e.g. interest,
attitude, suggestibility
4. Errors in the Construction and Administration of Tests: while administrating the test,
the unclear instructions ñ Error in timings, Errors in the scoring, practice and
motivation to complete the test all the these factors may cause skewness in the
distribution.
PARAMETRIC TESTS
Those test that makes the assumptions about the parameters of the population distribution
from which the sample is drawn.When data can be measured in units which are
interchangeable e.g., weights (by ratio scales), temperatures (by interval scales), that data
is said to be parametric and can be subjected to most kinds of statistical and mathematical
processes.
Assumptions
CORRELATION
Correlation is a measure of association between to variables. Typically, one variable is
denoted as X and the other variable is denoted as Y. The relationship between these
variables is assessed by correlation coefficient.
The relationship between two variables can be of various types. Broadly, they can be
classified as linear and nonlinear relationships.
1. Linear Relationship - One of the basic forms of relationship is linear relationship.
Linear relationship can be expressed as a relationship between two variables that can
be plotted as a straight line.
2. Non-linear Relationship - They are called as curvilinear or nonlinear relationships.
The Yorkes-Dodson Law, Steven’s Power Law in Psychophysics, etc. are good examples
of non-linear relationships.
If the two variables are correlated then the relationship is either positive or negative. The
absence of relationship indicates “zero correlation”.
1. Positive Correlation- The positive correlation indicates that as the values of one
variable increases the values of other variable also increase. Consequently, as the
values of one variable decreases, the values of other variable also decrease. This
means that both the variables move in the same direction
2. Negative Correlation- The Negative correlation indicates that as the values of one
variable increases, the values of the other variable decrease. Consequently, as the
values of one variable decreases, the values of the other variable increase. This
means that two variables move in the opposite direction.
3. No relationship- . If they do not share any relationship (that is, technically the
correlation coefficient is zero), then, obviously, the direction of the correlation is
neither positive nor negative. It is often called as zero correlation or no correlation.
Correlation Coefficient
1. The correlation between any two variables is expressed in terms of a number, usually
called as correlation coefficient. The correlation coefficient is denoted by various
symbols depending on the type of correlation. The most common is ‘r’ (small ‘r’)
indicating the Pearson’s product-moment correlation coefficient.
2. The range of the correlation coefficient is from –1.00 to + 1.00
3. If the correlation coefficient is 1, then relationship between the two variables is
perfect.
4. This will happen if the correlation coefficient is – 1 or + 1.
5. As the correlation coefficient moves nearer to + 1 or – 1, the strength of relationship
between the two variables increases.
6. If the correlation coefficient moves away from the + 1 or – 1, then the strength of
relationship between two variables decreases (that is, it becomes weak).
When using Pearson’s product-moment correlation coefficient (also known as Pearson’s r),
there are several assumptions that need to be met:
1. Level of Measurement:
Both variables should be measured at the interval or ratio level.
2. Linear Relationship:
There should exist a linear relationship between the two variables.
3. Normality:
Both variables should be roughly normally distributed.
4. Related Pairs:
Each observation in the dataset should have a pair of values for the two variables.
5. No Outliers:
There should be no extreme outliers in the dataset.
BISERIAL CORRELATION
1. The biserial correlation coefficient is computed when one variable is continuous and
the other variable is artificially reduced to two categories (dichotomy).
2. The general formula for this is
STUDENT T-TEST
A t test is a statistical test that is used to compare the means of two groups. It is often used
in hypothesis testing to determine whether a process or treatment actually has an effect on
the population of interest, or whether two groups are different from one another.
Assumptions
The t test is a parametric test of difference, meaning that it makes the same assumptions
about your data as other parametric tests. The t test assumes your data:
1. are independent
2. are (approximately) normally distributed
3. have a similar amount of variance within each group being compared (a.k.a.
homogeneity of variance)
NULL HYPOTHESIS: The mean score of all the groups are the same
ALTERNATE HYPOTHESIS: At least one group has different means
Assumptions:
1. Normality: The data within each group should be approximately normally distributed.
2. Homogeneity of Variance: The variance of the data within each group should be
similar.
3. Independence: Observations within each group should be independent.
Formula
F = Mean of the sum of squares between the group(MSB)
Mean of the summer squares due to error (MSE)
Assumption
Normality
Independence
Homogeneity of variance
Level of measurement: Continuous and categorical
TWO-WAY ANOVA
A statistical test used to determine the effect of two nominal predictor variables on a
continuous outcome variable. In two-way ANOVA, there is still one Quantitative dependent
variable and two categorical independent variables.
NULL HYPOTHESIS 1: Mean of observations grouped by one factor are the same
NULL HYPOTHESIS 2: The mean of observations grouped by other factors are the same.
NULL HYPOTHESIS 3: There is no interaction effect between the two factors
Assumption
Normality
Independence
Homogeneity of variance
Level of measurement: Continuous and categorical
When to Use Two-Way ANOVA:
You can use it when you have collected data on a quantitative dependent variable at
multiple levels of two categorical independent variables.
Examples:
Investigating the effect of different social media platforms (Facebook, Twitter,
Instagram) and time of day (morning, afternoon, evening) on user engagement.
Analyzing how temperature (hot, moderate, cold) and humidity (low, moderate, high)
impact energy consumption in buildings.
NON-PARAMETRIC TEST
ASSUMPTIONS
1. Random Sampling- The underlying data do not meet the assumptions about the
population sample
2. The population sample size is too small
3. Level of measurement- The analyzed data is ordinal or nominal
Chi-square TEST
It is used to determine whether there is a relationship between two categorical variables
And see whether the data is significantly different from what was expected. Chi-square,
symbolically written as χ2. It is used for one of the two purposes:
Goodness of fit: To see whether the data from the sample matches the population
from which the data was taken. In other words, To test whether the frequency
distribution of categorical variable matches your expectations
Test for Independence: To see whether two categorical variables differ from each
other
DF = (rows - 1)(Columns - 1)
Assumptions
Level of measurement: Categorical
Independence
Cells in the contingency table are mutually exclusive: It’s assumed that individuals
can only belong to one cell in the contingency table. That is, cells in the table are
mutually exclusive – an individual cannot belong to more than one cell.
The expected value of cells should be 5 or greater in at least 80% of cells.
The null hypothesis (H0) is that there is no association between the two variables
The alternative hypothesis (H1) is that there is an association of any kind.
SIGN TEST
- It compares the sizes of two groups.
- The Sign Test stands as a fundamental non-parametric statistical method designed to
compare two related samples, typically used in scenarios where more conventional
tests such as the t-test cannot be applied due to the distributional characteristics of
the data.
- It focuses on the direction (sign) of changes between paired observations rather than
their numerical difference.
Practical Application
1. Consumer preference testing: Imagine a taste test comparing two sodas (A and B)
for a group of people. By analyzing the "before and after" preferences (positive for A,
negative for B), the sign test can tell you if there's a statistically significant preference
for one soda over the other.
2. Medical research: Researchers might use the sign test to compare the effectiveness
of two different pain medications. They can track pain levels (before and after) for
patients and use the sign test to see if one medication leads to a significantly greater
reduction in pain compared to the other.
3. Survey analysis: Imagine a survey asking people's opinions on a new policy before
and after its implementation. The sign test can help analyze if there's a significant
shift in public opinion (more positive, more negative) after the policy change.
MEDIAN TEST
The median test is used to compare the performance of two independent groups as for
example an experimental group and a control group.
The null hypothesis: the groups are drawn from populations with the same median.
The alternative hypothesis: either that the two medians are different (two-tailed test) or
that one median is greater than the other (one-tailed test).
Assumptions
Level of measurement: ordinal or continuous.
Independence
Random Sampling: each observation is chosen randomly and represents the
population.
Making decisions based on results: The outcome of a median test, typically given by a p-
value, helps you decide whether to reject the null hypothesis (that the medians are equal).
This informs practical decisions. For example, a marketing campaign might target a specific
demographic group if the median test shows a significant difference in purchase
preferences between that group and others.
MANN-WHITNEY U TEST
Mann-Whitney U test is the non-parametric alternative test to the independent sample t-
test. It is a non-parametric test that is used to compare two sample means that come from
the same population, and used to test whether two sample means are equal or not.
Assumption
1. The sample drawn from the population is random.
2. Independence within the samples and mutual
independence is assumed. That means that an
observation is in one group or the other (it cannot be
in both).
3. Ordinal measurement scale is assumed.
Practical Implications
Wide range of applications: The Mann-Whitney U test finds use in various fields like
psychology (comparing treatment effects), medicine (evaluating drug efficacy
between groups), economics (analyzing differences between income groups), and
many more.
Focus on medians: While not directly providing information about means, the test
helps us understand if the medians (center points) of the two groups are likely to be
different. This can be crucial when data may have outliers or skewed distributions.
Decision making: The test results help researchers and analysts decide whether to
reject the null hypothesis (no difference between groups) or accept it. This informs
conclusions about the effectiveness of interventions, group characteristics, and more.
FRIEDMAN TEST
The Friedman test is a non-parametric statistical test developed by Milton Friedman. Similar
to the parametric repeated measures ANOVA, it is used to detect differences in treatments
across multiple test attempts
Assumptions
Data should be ordinal (e.g. the Likert scale) or continuous,
Data comes from a single group, measured on at least three different
occasions,
The sample was created with a random sampling method,
Blocks are mutually independent (i.e. all of the pairs are independent — one
doesn’t affect the other),
Observations are ranked within blocks with no ties.
Formula:
k= number of coloumns
n= number of rows
Rj= sum of ranks in coloumn
Practical Implications
By using the Friedman test, researchers can gain valuable insights into whether different
interventions, conditions, or time points have a statistically significant impact within the
same group of subjects. This helps them draw stronger conclusions about the effects being
studied.
MULTIPLE REGGRESION
Multiple regression is a statistical technique that explores how several independent
(predictor) variables influence a single dependent (criterion) variable. It can be used to
predict the value of dependent variable if the independent variables are known. It is also
used to see if there is a statistically significant relationship between sets of variable and to
find the trends in those sets of data.
Assumptions
1. Model Specification: Ensure the model includes all relevant variables and
accurately reflects the relationships being studied.
2. Linearity: The relationship between the predictors and the outcome should be
linear.
3. Normality: The variables involved should follow a normal distribution.
4. Homoscedasticity: The variance (spread of values) should be consistent across all
levels of the predictors.
Formula
Practical
Implications
Understanding complex relationships: In real-world scenarios, outcomes are rarely
influenced by just one factor. Multiple regression allows you to analyze how multiple
independent variables interact to affect a dependent variable. This provides a more
comprehensive understanding of the system you're studying.
Making predictions: A well-constructed regression model can be used to predict
future values of the dependent variable based on the values of the independent
variables. This is useful in various domains, from business (e.g., predicting sales based
on marketing spend and economic factors) to science (e.g., predicting crop yield
based on weather patterns and fertilizer application).
Informed decision-making: By isolating the independent contributions of different
factors, you can make more informed decisions. For instance, a company might use
regression to assess the impact of various advertising channels on customer
conversion rates, allowing them to optimize their marketing budget.
Types
Standard multiple regression: This is the most common type of multiple regression.
In standard multiple regression, all of the independent variables (predictors) are
entered into the regression equation at once. This type of regression is used to assess
the overall relationship between the independent variables and the dependent
variable.
Stepwise multiple regression:Stepwise multiple regression is a more complex type of
regression that is used to identify the best subset of independent variables to predict
the dependent variable. In stepwise regression, the variables are entered into the
model one at a time, based on a statistical criterion. The process continues until no
more variables meet the criteria for inclusion. Stepwise regression is a useful tool for
identifying the most important predictors of a dependent variable, but it is important
to be aware that the results can be sensitive to the order in which the variables are
entered into the model.
Ridge regression and Lasso regression: These are types of regression that are used to
address the problem of multicollinearity, which occurs when the independent
variables are highly correlated with each other. Multicollinearity can make it difficult
to estimate the coefficients of the regression model and can lead to unreliable
results. Ridge regression and Lasso regression are techniques that can be used to
shrink the coefficients of the regression model, which can help to reduce the impact
of multicollinearity.
FACTOR ANALYSIS
Factor analysis is a sophisticated statistical method aimed at reducing a large number of
variables into a smaller set of factors. This technique is valuable for extracting the
maximum common variance from all variables, transforming them into a single score for
further analysis. It is a part of the general linear model (GLM).
It determines if they underlining latent variable (factors) may explain the predictable
(Patterned)connection within a set of observed variables.
Four primary objectives
To determine the factors that underlined within a set of observable variables
To provide a system that can explain variance amid certain observable variables
through fewer statistically established factors.
To reduce the data by extracting a small group of factors from a collection of
observable variables to be able to summarise said variables into fewer factor.
To establish the characteristics of the extracted variables.
Assumptions
1. There is a linear relationship between variables
2. There is no multicollinearity, Which implies that each variable is unique. it exists
when two independent variables are highly correlated
3. Include relevant variables into the analysis
4. There is a true Correlation between variables and factors
5. There is no outliers in the data set
6. The sample used is of sufficient sizethat is there are more variables than factors and
that each variable has more data values than factor
Practical Applications
Here are some specific examples of how factor analysis is used in practice:
Psychology: Identifying personality traits (e.g., "Big Five").
Marketing: Understanding customer preferences and segmenting markets.
Finance: Analyzing financial risk factors in investments.
Education: Assessing student learning outcomes and identifying areas for
improvement.
Types
Exploratory Factor Analysis (EFA): It is applied in situations where there isn't a fixed
idea of numbers or factors in port or the relationship they have with the observed
variables.The goal is to investigate the way the factors are structured and to identify
the underlining correlations within the variables.
It is not based on previous theories and it aims to uncover structures in large sets of
variable through the measuring of latent factors that affects the variable within a
determined data structure.
It does not require Previous hypothesis on the relationship between factors and
variables and the results are of an inductive nature based on observations
It is mostly used in empirical research and in the development, Validation, An
adaptation of measurement instruments in psychology because it's useful to detect a
set of common factors that explain the response to test item.
Confirmatory Factor analysis(CFA): It is used to confirm predefined components that
have already been explored in literature before and it is applied to sanction The
effects and the possible correlation between a collection of certain factors and
variables.
It usually requires a large sampleand the model is specified in advanceand it
produces statistics based on deduction.It is used in the situation where the
researcher has a particular hypothesis on how many factors there are and how he
observable variables are associated with each component.
The hypothesis is founded on past studies or theories that has the purpose of
corroborating that there's a link between the factors and the observed variables.