RM-Quantitative Data Analysis
RM-Quantitative Data Analysis
ANALYSIS
Hazura Mohamed
Center of Software Technology and Management
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
[email protected]
Purpose of Data Analysis
• Independent variables
• "predictor variable“, “explanatory variable”
• The variable that you believe will influence
your outcome measure. Can be
manipulated or controlled, or changed.
• Studies to see its relationship or effects.
• Dependent variables
• "predicted variable", "measured variable"
• The variable that is dependent on or
influenced by the independent variable(s).
A dependent variable may also be the
variable you are trying to predict.
Example
• The tutor wants to know why some students perform
better than others. The tutor thinks that it might be
because of two reasons.
• 1. some students spend more time revising their test
• 2. some students are naturally more intelligent than
others.
• The tutor decides to investigate the effect of revision time
and intelligence on the test performance of the 100
students. What are the dependent and independent
variables for the study?
Solution
1. Descriptive Statistics
2. Inferential Statistics
DESCRIPTIVE STATISTICS
Descriptive Statistics
• Descriptive statistics refers to the branch of statistics that
involves summarizing and presenting data in a meaningful
and informative way.
• The primary goal of descriptive statistics is to organize,
simplify, and describe the main features of a dataset. This
involves using various numerical (frequency or percentage,
mean, mode median, standard deviation) and graphical
techniques (bar chart, pie chart, histogram, box plot) to
convey the essential characteristics of the data.
• Can only be used to describe the group that is being
studying.
• The results cannot be generalized to any larger group.
• Descriptive statistics uses two tools to organize and describe
data. These are given as follows:
• Central tendency is simply the location of the middle in a
distribution of scores.
• Measures of dispersion is describing the spread of the data, or
its variation around a central value.
Measures of central tendency
Mean
• The sum of all the scores divided by the number of
scores. Often referred to as the average.
• Good measure of central tendency.
• The mean can be misleading because it can be
greatly influenced by extreme scores (very high, or
very low scores).
• Extreme cases or values are called outliers.
• Used for interval and ratio data.
Measures of central tendency
Median
• Median is the middle number in a set of data when
the data is arranged in numerical order.
• Half the scores are above the median and half are
below the median.
• Sometimes the median may yield more information
when your distribution contains outliers or is
skewed (not normally distributed).
• Used for the ordinal, interval and ratio data.
Measures of central tendency
Mode
• The mode is the number that occurs the most.
• Not recommended as the only measure of central
tendency.
• Distributions can have more than one mode, called
"multimodal.“
• Used for nominal and ordinal data.
Measures of Spread
Range
• The range is the difference between the largest and
the smallest observation in the data.
• Easy to calculate.
• Sensitive to outliers (highly affected by outliers)
and does not use all the observations in a data set
(Swinscow TD, 2003).
• It is more informative to provide the minimum and
the maximum values rather than providing the
range.
Measures of Spread
Interquartile Range (IQR)
• The difference between the “third quartile” (75th
percentile) and the “first quartile” (25th
percentile). So, the “middle-half” of the values.
• IQR = Q3-Q1
• Quartile deviation = IQR/2
• According to Norizan (2003), high consensus, if QD
≤ 0.5
• Robust to outliers or extreme observations.
• Works well for skewed data.
Measures of Spread
Variance
• Measures average squared deviation of data points
from their mean.
• If measuring variance of population, denoted by 2
(“sigma-squared”).
• If measuring variance of sample, denoted by s2 (“s-
squared”).
• Highly affected by outliers. Best for symmetric
data.
• Problem is units are squared.
Measures of Spread
Standard Deviation
• Measures average deviation of data points from
their mean.
• Sample standard deviation is square root of sample
variance, and so is denoted by s.
• Units are the original units.
• Larger standard deviation greater variability
• Also, highly affected by outliers.
Measures of Spread
Coefficient of Variation
• Ratio of sample standard deviation to sample mean
multiplied by 100.
• Measures relative variability, that is, variability
relative to the magnitude of the data.
• Unitless, so good for comparing variation between
two groups.
Choosing Appropriate
Measure of Variability
• If data are symmetric, with no serious outliers, use
range and standard deviation.
• If data are skewed, and/or have serious outliers,
use IQR.
• If comparing variation across two data sets, use
coefficient of variation.
Box Plot
• A box plot, also known as a box-and-whisker plot, is
a graphical representation that provides a visual
summary of the distribution of a dataset.
• It includes 5 key summary statistics, and the box
itself is directly related to the measure of
dispersion, specifically the interquartile range (IQR).
How a box plot relates to the measure of
dispersion?
• Box (Interquartile Range - IQR):
• The box in a box plot represents the interquartile range (IQR), which is
the range of values between the first quartile (Q1) and the third
quartile (Q3).
• The lower edge of the box corresponds to Q1, and the upper edge
corresponds to Q3.
• The length of the box (height if the box is oriented vertically) is
proportional to the IQR, providing a visual representation of the spread
of the middle 50% of the data.
• Median (Middle Line in the Box):
• The median (Q2) is typically represented by a line inside the box. It
shows the center of the distribution.
How a box plot relates to the measure of
dispersion?
• Whiskers:
• The whiskers of the box plot extend from the edges of the box to
indicate the range of the data outside the IQR.
• The whiskers can vary in length and may extend to a certain range
beyond the Q1 and Q3, or they may extend to the minimum and
maximum values.
• Outliers:
• Individual data points beyond the whiskers may be marked as outliers.
The definition of outliers can vary, but they are often identified based
on a certain multiple of the IQR.
Skewness
• Skewness measures the asymmetry of a distribution.
• A distribution can be either positively skewed (tail on the right) or
negatively skewed (tail on the left).
• If the majority of the data is concentrated to the left and the right
tail is longer, the distribution is negatively skewed.
• If the majority of the data is concentrated to the right and the left
tail is longer, the distribution is positively skewed.
• Skewness is often quantified using the third standardized moment.
A skewness value of 0 indicates a perfectly symmetrical
distribution. n
i
( x − x ) 3
skewness = i =1
ns 3
5/31/2024
Kurtosis
• Kurtosis measures the "tailedness" or the sharpness of the peak of a
distribution.
• It indicates whether the data have heavy or light tails compared to a normal
distribution.
• A distribution with positive kurtosis (leptokurtic) has heavier tails and a more
peaked central region than a normal distribution.
• A distribution with negative kurtosis (platykurtic) has lighter tails and a flatter
central region than a normal distribution.
• A kurtosis value of 3 is subtracted to measure excess kurtosis, so a normal
n
distribution has a kurtosis of 0. ( xi − x ) 4
kurtosis = i
−3
ns 4
• The values for skewness and kurtosis between -2 and +2 are considered
acceptable in order to prove normal univariate distribution (George & Mallery,
2010).
Inferential Statistics
• Example of a Hypothesis:
• Null Hypothesis (H0): "There is no significant difference in test
scores between students who receive tutoring and those who do
not."
• Alternative Hypothesis (Ha): "Students who receive tutoring will
show a significant improvement in test scores compared to those
who do not."
Hypotheses Statement
• H0: The null hypothesis: It is a statement about the population
that either is believed to be true or is used to put forth an
argument unless it can be shown to be incorrect beyond a
reasonable doubt.
• Ha: The alternative hypothesis (research hypothesis): It is a claim
about the population that is contradictory to H0 and what we
conclude when we reject H0.
• Since the null and alternative hypotheses are contradictory, you
must examine evidence to decide if you have enough evidence to
reject the null hypothesis or not. The evidence is in the form of
sample data.
Truth
(for population studied)
Null
Null Hypothesis
Hypothesis
False
True
Reject Null
Type I Error Correct Decision
Decision Hypothesis
(based on sample) Fail to reject Null Correct
Type II Error
Hypothesis Decision
LEVEL OF SIGNIFICANCE
Statistics
Descriptive Inferential
Parametric Non-Parametric
T-test Mann-Whitney U
test
Pearson’s
Chi-Square test
Correlation
Linear
Wilcoxon test
regression
Parametric Test
• Through
• skewness and kurtosis value
• Histogram with normality plot
• If not normal check for the outliers.
• Q-Q plot
• Run normality test
• Data meet the normality assumption is required for
parametric tests.
23.6, 21.0, 14.5, 25.8, 17.6, 28.0, 16.0, 22.2, 9.9, 15.0, 16.5, 24.6,
17.6, 16.6, 27.4, 16.2, 39.0, 13.5, 22.2, 32.0, 13.1, 10.7, 16.2,
20.0, 15.0, 8.3, 24.4, 16.7, 19.8, 14.2, 11.0, 18.2, 14.2, 13.5, 19.7,
12.4, 13.1, 18.1, 8.7, 19.9
Use
(Describe the relationship between two
Type of Test
variables)
Use
Type of Test (Look for the difference between the
means of variables)
68
Common statistical tests and their use
Use
(Non-parametric: are used when the data
Type of Test:
does not meet assumptions required for
parametric tests)
70
t-test
71
Independent sample t-test
72
Assumptions
1. Independence of observations, which means that there is
no relationship between the observations in each group
or between the groups themselves.
2. There should be no significant outliers.
3. Dependent variable should be approximately normally
distributed for each group of the independent variable.
4. Homogeneity of Variance: The two populations must
have equal variances. You can test this assumption in
SPSS using Levene’s test for homogeneity of variances.
73
Independent sample t-test
Example
The scores are listed on the next slide along with boxplots to
visualize the scores recorded at hospital A and hospital B
Hospital A Hospital B
23 15
41 36
17 25
38 28
16 31
37 26
33 12
40 29
38 33
36 35
22 20
86
Output of the independent t-test in SPSS
• SPSS generates two main tables of output for the independent t-test.
87
Output SPSS
Group Statistics
Std. Error
Condition N Mean Std. Deviation Mean
Score integrated method 20 85.65 8.242 1.843
traditional method 20 79.45 10.782 2.411
Students taking the integrated course would conduct better quality research
projects than students in the traditional courses.
Example
Research question: Is there a significant difference in the mean
cholesterol concentrations for an exercise training programme
and a calorie-controlled diet for overweight, physically inactive
male.
Ho : There is no a significant difference in the mean scores for
the two groups (an exercise-training programme and a calorie-
controlled diet ).
DV : cholesterol concentrations
IV : Treatment ( diet group and exercise group)
89
t-test procedure in SPSS
• Click Analyze > Compare Means > Independent-Samples t Test... on the
top menu, as shown below:
90
Descriptive Statistics
This table provides useful descriptive statistics for the two groups that you
compared, including the mean and standard deviation.
Looking at the Group Statistics table, we can see that those people who
undertook the exercise trial had lower cholesterol levels at the end of the
programme than those who underwent a calorie-controlled diet.
91
Independent Samples t-test Table
94
Example
95
SPSS output
99
ANOVA Tests
• Requirements:
• The dependent variable is measured in interval or ratio
scales.
• The independent variable is measured in nominal or
ordinal scales.
• The dependent variable data are normally distributed in
all independent variable groups and identical variance
values.
• Population and sample means are normally distributed.
100
Example
• A teacher wants to compare the effectiveness of the six
different techniques of teaching science.
101
SPSS output
102
Correlation tests
103
Correlation tests
104
Example of correlation tests
105
Example
106
SPSS Output and interpretation
107
Example of non-parametric tests
108
Sign Test
110
Example
Reported 68 74 82.2 66.5 69 68 71 70 70 67 68 70
height
Measured 66.8 73.9 74.3 66.1 67.2 67.9 69.4 69.9 68.6 67.9 67.6 68.8
height
111
SPSS Output and interpretation
Test Statisticsb
m easured
hei ght -
reported
hei ght
Exact Si g. (2-tailed) .006 a
a. Binom ial di stributi on us ed.
b. Sign Tes t
reported
height -
measured
height
Z -2.595 a
Asymp. Sig. (2-tailed) .009
a. Based on negative ranks.
b. Wilcoxon Signed Ranks Test
Hypotheses
H0: Annual energy costs for Westin freezers and
Brand-X freezers are the same.
Ha: Annual energy costs differ for the two brands of
freezers.
Example: Output for Mann Whitney test
Ranks
vol um es
o f the rig h t
cor d ate
Ma nn - Wh itn e y U 1 8.5 00
Wil co xon W 7 3.5 00
Z -2 .7 3 7
Asym p . Si g . ( 2- ta ile d ) .0 0 6
Exa ct Si g. [2 * (1 -ta i le d a
.0 0 4
Sig .) ]
a . No t co rr ecte d fo r ti es .
b . Gro u p in g Var ia bl e : g ro u p
Kruskal-Wallis test
• The Kruskal-Wallis test is a non-parametric statistical test used to determine
whether there are any statistically significant differences between the medians
of three or more independent groups. It is a non-parametric alternative to one-
way ANOVA.
• Hypothesis statements for the Kruskal-Wallis test can be framed as follows:
Test Statisticsa,b
Ranks
Write as follows:
Students’ intelligibility ratings were significantly affected by which language
statistics was taught in, 2 (2) = 6.19, p = .045.
Looks like lectures are more intelligible in Malay than in either English or
Arabic (which are similar to each other).
Spearman Rank Correlation
• The hypotheses:
130
example
Source: https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-
using-spss-statistics.php
Solution
This table provides the R and R2 values. The R value represents the simple
correlation and is 0.873 (the "R" Column), which indicates a high degree of
correlation. The R2 value (the "R Square" column) indicates how much of the total
variation in the dependent variable, Price, can be explained by the independent
variable, Income. In this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation fits
the data (i.e., predicts the dependent variable)
This table indicates that the regression model predicts the dependent variable
significantly well. Here, p < 0.0005, which is less than 0.05, and indicates that,
overall, the regression model statistically significantly predicts the outcome
variable (i.e., it is a good fit for the data).
• The Coefficients table provides us with the necessary information to predict
price from income, as well as determine whether income contributes
statistically significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the
"Unstandardized Coefficients" column.
Profit = a + b x sales
Coefficientsa
147
Model and Required Conditions
• We allow for k independent
variables to potentially be related
to the dependent variable
VIF > = 5, implies severe correlation among independent variables. (Issue with
multicollinearity)
7. We should not have significant outliers, high leverage, or highly influential points.
i. outliers can be detected using "casewise diagnostics" and "studentized deleted residuals”.
iii. influential points can be detected using a measure of influence known as Cook's Distance.
- If the removal of a data point causes a large change in the values of correlation coefficients.