0% found this document useful (0 votes)
20 views

RM-Quantitative Data Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

RM-Quantitative Data Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

QUANTITATIVE DATA

ANALYSIS
Hazura Mohamed
Center of Software Technology and Management
Faculty of Information Science and Technology
Universiti Kebangsaan Malaysia
[email protected]
Purpose of Data Analysis

TO DESCRIBE OR SUMMARIZE TO SEARCH FOR CONSISTENT TO ENABLE YOU TO ANSWER


DATA. PATTERNS OR THEMES YOUR RESEARCH QUESTIONS.
AMONG DATA.
Variable
For effective statistical data analysis, a crucial prerequisite is a
comprehensive understanding of variables and the aspects that
should be measured through their analysis.

VARIABLES - Characteristics that can take on different values,


such as height, age, temperature, or test scores.

To select suitable statistical tests and interpret study results


accurately, it is essential to identify the types of variables you
are working with.
Types of Variables
Qualitative Variables

Qualitative variables, also known as categorical variables, are types of


variables that represent categories or groups. These variables describe
qualities or characteristics that cannot be measured with a numerical value.
Instead, they are typically expressed in words or labels.
Example : hair color, religion, race, gender, social status, method of
payment, and so on.
• Quantitative variables, also known as
Quantitative numerical variables, are types of variables that
represent measurable quantities with numerical
Variables values. These variables can be discrete or
continuous.
1. Discrete - These are quantitative
variables that take on a finite or countable
number of distinct values. Examples include
the number of children in a family, the
count of cars in a parking lot, or the number
of items purchased.
2. Continuous - Continuous variables can take
on an infinite number of possible values
within a given range. They are often
associated with measurements that can be
more precise. Examples include height,
weight, temperature, or time.
Independent and Dependent Variables

• Independent variables
• "predictor variable“, “explanatory variable”
• The variable that you believe will influence
your outcome measure. Can be
manipulated or controlled, or changed.
• Studies to see its relationship or effects.

• Dependent variables
• "predicted variable", "measured variable"
• The variable that is dependent on or
influenced by the independent variable(s).
A dependent variable may also be the
variable you are trying to predict.
Example
• The tutor wants to know why some students perform
better than others. The tutor thinks that it might be
because of two reasons.
• 1. some students spend more time revising their test
• 2. some students are naturally more intelligent than
others.
• The tutor decides to investigate the effect of revision time
and intelligence on the test performance of the 100
students. What are the dependent and independent
variables for the study?
Solution

• DV: Test mark (measured from 0 to 100)


• IV: Revision time (measured in hours), Intelligence
(measured using IQ score)
Example

• Identify the dependent and independent variables for the


following examples
• 1. A study of teacher-student classroom interaction at
different levels of schooling.
• 2. A comparative study of the professional attitudes of
secondary school teachers by gender.
Solution

• 1. IV: level of schooling.


• DV: Score on a classroom observation (measure teacher-
student interaction)
• 2. IV: gender of the teacher- male, female
• DV: score on a professional attitude.
Level of Measurement
- The level of measurement, also known as the scale of
measurement or scale of data.
- The level of measurement is an important concept in statistics
because it determines the types of analyses that can be
performed on a given set of data and the inferences that can be
drawn from those analyses.
- There are four main levels of measurement.
Nominal
- Nominal data represent categories (two or more)
or labels with no inherent order or ranking.
- Operations like counting and assigning
frequencies are applicable, but mathematical
operations such as addition or subtraction are not
meaningful.
- Examples include gender (male, female), colors
(red, blue, green), or types of fruit (apple, banana,
orange).
Ordinal
- Ordinal data represent categories with a meaningful order or
ranking.
- Rank order is meaningful, but the differences between ranks may
not be uniform and can’t exactly say that the difference .
- Examples
Socioeconomic status: Low income, medium income, high income
Workplace status: Entry Analyst, Analyst I, Analyst II, Lead Analyst
Degree of pain: Small amount of pain, medium amount of pain,
high amount of pain
Educational levels: high school, bachelor's, master's
Interval

- Interval data have a consistent interval between points on


the scale, but they lack a true zero point.
- A zero in an interval variable does not represent the
absence of the quantity; it is simply a point on the scale.
- Examples include temperature measured in Celsius or
Fahrenheit.
- Arithmetic operations such as addition and subtraction are
meaningful, but multiplication and division are not, as there
is no true zero.
Ratio

-Ratio variables also have numerical values and a consistent


interval between points.
-The crucial difference is that ratio variables have a true zero
point, indicating the complete absence of the quantity being
measured. For example, if you have a variable like weight, a
weight of 0 means the absence of weight.
-Examples of ratio variables include height, weight, income,
and age (when measured from birth).
Activity
1) Researcher A wants to examine if a women’s calcium consumption
is related to large foot size. Calcium is measured in milligrams, and
foot size is measured in centimetres. Researcher A hypothesises
that calcium significantly affects foot size.
2) Researcher B wants to know if a man’s consumption of orange
juice is related to an increase in male pattern baldness.
Consumption of orange juice is measured in millilitres, and male
pattern baldness is measured on a scale of 1-3 (1=totally bald,
2=some balding, 3=no balding). Researcher A hypothesises that
orange juice significantly affects male pattern baldness.
3) Researcher C wants to know if pet type is related to happiness. Pet
type is measured on a scale of 1-5 (1=cat, 2=dog, 3=bird, 4=fish,
5=other). Happiness is measured on a scale of 1-3 (1=not happy,
2=somewhat happy, 3=very happy). Researcher C hypothesises
that pet type significantly affects the level of happiness.
Questions
• Determine the dependent and independent variables in these study.
• What is the level of measurement for each variable?
Types of Statistics

1. Descriptive Statistics
2. Inferential Statistics
DESCRIPTIVE STATISTICS
Descriptive Statistics
• Descriptive statistics refers to the branch of statistics that
involves summarizing and presenting data in a meaningful
and informative way.
• The primary goal of descriptive statistics is to organize,
simplify, and describe the main features of a dataset. This
involves using various numerical (frequency or percentage,
mean, mode median, standard deviation) and graphical
techniques (bar chart, pie chart, histogram, box plot) to
convey the essential characteristics of the data.
• Can only be used to describe the group that is being
studying.
• The results cannot be generalized to any larger group.
• Descriptive statistics uses two tools to organize and describe
data. These are given as follows:
• Central tendency is simply the location of the middle in a
distribution of scores.
• Measures of dispersion is describing the spread of the data, or
its variation around a central value.
Measures of central tendency
Mean
• The sum of all the scores divided by the number of
scores. Often referred to as the average.
• Good measure of central tendency.
• The mean can be misleading because it can be
greatly influenced by extreme scores (very high, or
very low scores).
• Extreme cases or values are called outliers.
• Used for interval and ratio data.
Measures of central tendency
Median
• Median is the middle number in a set of data when
the data is arranged in numerical order.
• Half the scores are above the median and half are
below the median.
• Sometimes the median may yield more information
when your distribution contains outliers or is
skewed (not normally distributed).
• Used for the ordinal, interval and ratio data.
Measures of central tendency
Mode
• The mode is the number that occurs the most.
• Not recommended as the only measure of central
tendency.
• Distributions can have more than one mode, called
"multimodal.“
• Used for nominal and ordinal data.
Measures of Spread
Range
• The range is the difference between the largest and
the smallest observation in the data.
• Easy to calculate.
• Sensitive to outliers (highly affected by outliers)
and does not use all the observations in a data set
(Swinscow TD, 2003).
• It is more informative to provide the minimum and
the maximum values rather than providing the
range.
Measures of Spread
Interquartile Range (IQR)
• The difference between the “third quartile” (75th
percentile) and the “first quartile” (25th
percentile). So, the “middle-half” of the values.
• IQR = Q3-Q1
• Quartile deviation = IQR/2
• According to Norizan (2003), high consensus, if QD
≤ 0.5
• Robust to outliers or extreme observations.
• Works well for skewed data.
Measures of Spread
Variance
• Measures average squared deviation of data points
from their mean.
• If measuring variance of population, denoted by 2
(“sigma-squared”).
• If measuring variance of sample, denoted by s2 (“s-
squared”).
• Highly affected by outliers. Best for symmetric
data.
• Problem is units are squared.
Measures of Spread
Standard Deviation
• Measures average deviation of data points from
their mean.
• Sample standard deviation is square root of sample
variance, and so is denoted by s.
• Units are the original units.
• Larger standard deviation  greater variability
• Also, highly affected by outliers.
Measures of Spread
Coefficient of Variation
• Ratio of sample standard deviation to sample mean
multiplied by 100.
• Measures relative variability, that is, variability
relative to the magnitude of the data.
• Unitless, so good for comparing variation between
two groups.
Choosing Appropriate
Measure of Variability
• If data are symmetric, with no serious outliers, use
range and standard deviation.
• If data are skewed, and/or have serious outliers,
use IQR.
• If comparing variation across two data sets, use
coefficient of variation.
Box Plot
• A box plot, also known as a box-and-whisker plot, is
a graphical representation that provides a visual
summary of the distribution of a dataset.
• It includes 5 key summary statistics, and the box
itself is directly related to the measure of
dispersion, specifically the interquartile range (IQR).
How a box plot relates to the measure of
dispersion?
• Box (Interquartile Range - IQR):
• The box in a box plot represents the interquartile range (IQR), which is
the range of values between the first quartile (Q1) and the third
quartile (Q3).
• The lower edge of the box corresponds to Q1, and the upper edge
corresponds to Q3.
• The length of the box (height if the box is oriented vertically) is
proportional to the IQR, providing a visual representation of the spread
of the middle 50% of the data.
• Median (Middle Line in the Box):
• The median (Q2) is typically represented by a line inside the box. It
shows the center of the distribution.
How a box plot relates to the measure of
dispersion?

• Whiskers:
• The whiskers of the box plot extend from the edges of the box to
indicate the range of the data outside the IQR.
• The whiskers can vary in length and may extend to a certain range
beyond the Q1 and Q3, or they may extend to the minimum and
maximum values.
• Outliers:
• Individual data points beyond the whiskers may be marked as outliers.
The definition of outliers can vary, but they are often identified based
on a certain multiple of the IQR.
Skewness
• Skewness measures the asymmetry of a distribution.
• A distribution can be either positively skewed (tail on the right) or
negatively skewed (tail on the left).
• If the majority of the data is concentrated to the left and the right
tail is longer, the distribution is negatively skewed.
• If the majority of the data is concentrated to the right and the left
tail is longer, the distribution is positively skewed.
• Skewness is often quantified using the third standardized moment.
A skewness value of 0 indicates a perfectly symmetrical
distribution. n

 i
( x − x ) 3

skewness = i =1
ns 3
5/31/2024
Kurtosis
• Kurtosis measures the "tailedness" or the sharpness of the peak of a
distribution.
• It indicates whether the data have heavy or light tails compared to a normal
distribution.
• A distribution with positive kurtosis (leptokurtic) has heavier tails and a more
peaked central region than a normal distribution.
• A distribution with negative kurtosis (platykurtic) has lighter tails and a flatter
central region than a normal distribution.
• A kurtosis value of 3 is subtracted to measure excess kurtosis, so a normal
n
distribution has a kurtosis of 0.  ( xi − x ) 4
kurtosis = i
−3
ns 4
• The values for skewness and kurtosis between -2 and +2 are considered
acceptable in order to prove normal univariate distribution (George & Mallery,
2010).
Inferential Statistics

• Inferential statistics is a branch of statistics that involves drawing


conclusions and making inferences about a population based on a
sample of data taken from that population. The goal of inferential
statistics is to make predictions, generalize findings, and test
hypotheses about a population using the information gathered from
a representative sample.
• Population: Group that the researcher wishes to study.
• Sample: A group of individuals selected from the population.
• Census: Gathering data from all units of a population, no sampling.
• Inferential statistics generally require that data come from a
random sample.
• In a random sample each person/object/item of the
population has an equal chance of being chosen.
Hypothesis
A hypothesis is a specific, testable statement or proposition about a
phenomenon or relationship between variables that is made to guide
empirical research. It is a fundamental element in the scientific method
and is used in various fields, including science, social science, and
research.

• “A hypothesis can be defined as a tentative explanation of the research


problem, a possible outcome of the research, or an educated guess
about the research outcome” (Sarantakos, 1993: 1991).

• “A hypothesis is a statement or explanation that is suggested by


knowledge or observation but has not, yet, been proved or disproved”
(Macleod Clark J and Hockey L 1981).
Characteristics of a hypothesis
• Testability:
• A hypothesis must be formulated in a way that allows it to be tested and
potentially falsified through empirical observations or experimentation.
• Clear and Specific:
• It should be clear and specific, outlining the expected relationship between
variables or the predicted outcome.
• Based on Existing Knowledge:
• A hypothesis is typically grounded in existing knowledge, theories, or
observations. It represents an educated guess or prediction about a
phenomenon.
• Potential for Refutation:
• A good hypothesis allows for the possibility of being proven wrong or
refuted based on evidence. This is essential for scientific rigor.
• Directs Research:
• The formulation of a hypothesis guides the research process, helping
researchers design experiments, collect data, and analyze results.
HYPOTHESIS TESTING

Formulating and testing hypotheses about the characteristics of


a population based on sample data. Involves statistical tests
that help determine whether observed differences or
relationships are statistically significant or could have occurred
by chance.
Statistical hypothesis

• A verbal statement, or claim, about a population


parameter.
• The actual test begins by considering two hypotheses:
null hypothesis and alternative hypothesis.
• These hypotheses contain opposing viewpoints.

• Example of a Hypothesis:
• Null Hypothesis (H0): "There is no significant difference in test
scores between students who receive tutoring and those who do
not."
• Alternative Hypothesis (Ha): "Students who receive tutoring will
show a significant improvement in test scores compared to those
who do not."
Hypotheses Statement
• H0: The null hypothesis: It is a statement about the population
that either is believed to be true or is used to put forth an
argument unless it can be shown to be incorrect beyond a
reasonable doubt.
• Ha: The alternative hypothesis (research hypothesis): It is a claim
about the population that is contradictory to H0 and what we
conclude when we reject H0.
• Since the null and alternative hypotheses are contradictory, you
must examine evidence to decide if you have enough evidence to
reject the null hypothesis or not. The evidence is in the form of
sample data.

• After you have determined which hypothesis the sample


supports, you make a decision.
• There are two options for a decision. They are “reject H0” if the
sample information favors the alternative hypothesis or “do not
reject H0” or “decline to reject H0” if the sample information is
insufficient to reject the null hypothesis.
Outcome, Type 1 & type II error
• In every hypothesis test, the outcomes are dependent on a
correct interpretation of the data.
• Incorrect calculations or misunderstood summary statistics
can yield errors that affect the results.
• A Type I error occurs when a true null hypothesis is rejected.
• A Type II error occurs when a false null hypothesis is not
rejected.
Outcome, Type 1 & type II error
• The power of the test, 1 – β, quantifies the likelihood that a
test will yield the correct result of a true alternative
hypothesis being accepted. A high power is desirable.
• Α (alpha) = probability of a Type I error = P(Type I error) =
probability of rejecting the null hypothesis when the null
hypothesis is true.
• Also call as a significance level - value determined by the
researcher in order to reject or retain the null hypothesis.
It is a pre-determined value, not calculated.
• Β (beta) = probability of a Type II error = P(Type II error) =
probability of not rejecting the null hypothesis when the null
hypothesis is false.
example
• Suppose the null hypothesis, H0, is: Trank’s rock climbing
equipment is safe.
• Type I error: Trank thinks that his rock climbing equipment
may not be safe when, in fact, it really is safe.
• Type II error: Trank thinks that his rock climbing equipment
may be safe when, in fact, it is not safe.
Type 1 and Type II Error

Truth
(for population studied)
Null
Null Hypothesis
Hypothesis
False
True
Reject Null
Type I Error Correct Decision
Decision Hypothesis
(based on sample) Fail to reject Null Correct
Type II Error
Hypothesis Decision
LEVEL OF SIGNIFICANCE

In a hypothesis test, the level of significance () is your maximum


allowable probability of making a type I error.

By setting the level of significance at a small value, you are saying


that you want the probability of rejecting a true null hypothesis to
be small.

Commonly used levels of significance:

 = 0.10  = 0.05  = 0.01


P-VALUES
P-value or probability value is the estimated probability of
rejecting the null hypothesis (Ho) when that hypothesis is
true.
The P-value of a hypothesis test depends on the nature of
the test.

Rule: reject Ho if p-value ≤ α


Type of statistical test

Statistics

Descriptive Inferential

Parametric Non-Parametric

T-test Mann-Whitney U
test

ANOVA Kruskal wallis test

Pearson’s
Chi-Square test
Correlation

Linear
Wilcoxon test
regression
Parametric Test

Parametric tests assume that the Most parametric tests require an


variable in study is from a normal interval or ratio level of
distribution. measurement.
Non-Parametric Test

Non-parametric tests do not Most non-parametric tests used


require the assumption of with nominal/ordinal level data.
normality.
Normality of data

• How do you know if data are normally distributed?

• Through
• skewness and kurtosis value
• Histogram with normality plot
• If not normal check for the outliers.
• Q-Q plot
• Run normality test
• Data meet the normality assumption is required for
parametric tests.

Understanding Normal Distribution


https://www.youtube.com/watch?v=mtH1fmUVkfE

Checking Normality Via Excel


https://www.youtube.com/watch?v=mGFuE4AeVXc
Normality Test
• Many statistical tests, like the calculation of confidence
intervals and tests of statistical hypotheses, require the
sample data to follow the normal distribution. In other
words, it is essential that the populations from which the
samples were derived, follow the normal distribution.
• So, the normality tests are the first and most important
tests to run in order to analyze the data properly.
Normality Test
• In such tests the null hypothesis and the alternative
hypothesis are:
• Η0: The sample comes from a normal population
• Ηa: The sample does not come from a normal
population

• The principal statistical tests for normality apply the


Kolmogorov-Smirnov and Shapiro-Wilk criteria, of which
the latter is considered more robust.
Normality Test
• SPSS uses both criteria in order to calculate the p-value.
• When the p-value is larger that α = 0.05, we accept that
the samples follow the normal distribution, whereas when
the p-value is smaller than α = 0.05, we reject the null
hypothesis, that is, we accept that the sample is not
normal.
Solution
❖ We go: Analyze → Descriptive Statistics →
Explore.
❖ We insert the variable under study to the
box Dependent List and click on Plots.
❖ In the Boxplots panel we click on the option
None, in the Descriptive panel we
deactivate the option Stem-and-leaf and
select Normality plots with tests.
Test for Normality
• There are several different tests that can be used to test the following hypotheses:
Ho: The distribution is normal
HA: The distribution is NOT normal
• Common tests of normality include:
Shapiro-Wilk Kolmogorov-Smirnov
Anderson-Darling Lillefor’s
• Problem: THEY DON’T ALWAYS AGREE!!
• Shapiro-Wilk test is the most powerful normality test, followed
by Anderson-Darling test, Lilliefors test and Kolmogorov-Smirnov
test. However, the power of all four tests is still low for small
sample size (Razali & Wah, 2011).
• The Shapiro–Wilk test is more appropriate method for small
sample sizes (<50 samples) although it can also be handling on
larger sample size while Kolmogorov–Smirnov test is used for n
≥50.
Results

We observe that the values Sig. (p-value) = 0.72


for the Shapiro-Wilk test and 0.2 for the
Kolmogorov-Smirnov test are larger than 0.05,
therefore we cannot reject the null hypothesis,
that is, we accept that the sample is normal.
Example
The following values express the lengths (in cm) of metal knives
from an archaeological site:

23.6, 21.0, 14.5, 25.8, 17.6, 28.0, 16.0, 22.2, 9.9, 15.0, 16.5, 24.6,
17.6, 16.6, 27.4, 16.2, 39.0, 13.5, 22.2, 32.0, 13.1, 10.7, 16.2,
20.0, 15.0, 8.3, 24.4, 16.7, 19.8, 14.2, 11.0, 18.2, 14.2, 13.5, 19.7,
12.4, 13.1, 18.1, 8.7, 19.9

Examine whether the sample follows the normal distribution.


Results

The histogram of the data shows a lack of symmetry, which


suggests that the sample is not normal, that is, it does not
originate from a population that follows the normal distribution.
Results

❖ TheQ-Q plot does


not help us decide if
the sample is
normal since most
points lie on the line
but there are some
that deviate.
Results
This is confirmed by the Shapiro-Wilk criterion, which has a p-value
= 0.026 < 0.05. In contrast the Kolmogorov-Smirnov criterion gives
a p-value = 0.133 > 0.05. However, as mentioned, this criterion is
less robust than the Shapiro-Wilk and this example proves it.
Types of Analyses
Univariate analysis- the analysis of one variable.
• Frequency/percentage
• Mean
• Median
• Mode
• Range
• Standard deviation
Types of Analyses
• Bivariate analysis is a kind of data analysis that explores
the association between two variables.
• Some examples of bivariate analysis include:
-Pearson’s correlation
-T-Test
-Spearman’s Rho
-Mann-Whitney Test
-Linear regression (not multiple regression)

• Example: Are height and weight correlated?


Types of Analyses
• Multivariate analysis: the analysis of more than two
variables.

• Example of multivariate analysis include:


-Multiple regression

• Example: Do age, diet, exercise, and diabetes


predict heart disease?
• Selecting the appropriate statistical test requires several steps.
• The level of variable is a major component in deciding what
test to use.
Test selection should be based on:
1. What is your goal: Description? Comparison? Prediction?
Quantify association? Prove effectiveness? Prove causality?
2. What kind of data have you collected?
3. Is your data normally distributed? Can you use a parametric
or non-parametric test?
4. What are the assumptions of the statistical test you would
like to use? Does the data meet these assumptions?
Common statistical tests and their use

Use
(Describe the relationship between two
Type of Test
variables)

Tests for the strength of the association


Pearson correlation
between two continuous variables

Tests for the strength of the association


Spearman
between two ordinal variables (does not rely
correlation
on the assumption of normal distributed data)

Tests for the strength of the association


Chi-square
between two categorical variables 66
Common statistical tests and their use

Use
Type of Test (Look for the difference between the
means of variables)

Tests for difference between two related


Paired T-test
variables.

Tests for difference between two


Independent T-test
independent variables.

Tests the difference between group


ANOVA means after any other variance in the
outcome variable is accounted for.
67
Common statistical tests and their use
Use
Type of Test (For predicting the unknown value of a variable from
the known value of two or more variables)

For determining how one variable of interest (the


Simple response/dependent variable) is affected by changes
regression in one variable (the explanatory / independent
variable).

For determining how one variable of interest (the


response/dependent variable) is affected by changes
Multiple
in two or variables (the explanatory / independent
regression
variable).

68
Common statistical tests and their use
Use
(Non-parametric: are used when the data
Type of Test:
does not meet assumptions required for
parametric tests)

Wilcoxon rank-sum Tests for difference between two


test or Mann–Whitney independent variables - takes into account
test magnitude and direction of difference

Tests for difference between two related


Wilcoxon sign-rank
variables - takes into account magnitude and
test
direction of difference

Tests if two related variables are different –


Sign test ignores magnitude of change, only takes into
69
account direction
Example of Tests

70
t-test

• Allows the comparison of mean of two groups.


• Independent sample t-test
• Paired-sample t-test

71
Independent sample t-test

- The independent-samples t-test compares the means


between two unrelated groups on the same
dependent variable.
- Dependent variable is measured at the interval or
ratio level.
- Independent variable should consist of two
categories.
- e.g: Employer want to know whether first year
graduate salaries differed based on gender
- DV : first year graduate salaries
- IV : gender, which has two groups: male and female.

72
Assumptions
1. Independence of observations, which means that there is
no relationship between the observations in each group
or between the groups themselves.
2. There should be no significant outliers.
3. Dependent variable should be approximately normally
distributed for each group of the independent variable.
4. Homogeneity of Variance: The two populations must
have equal variances. You can test this assumption in
SPSS using Levene’s test for homogeneity of variances.

73
Independent sample t-test

Example

“A computerized records system has been recently introduced


into all of the out-patient departments in hospital A. A
researcher administers a questionnaire to departmental
receptionists regarding the ease of locating and updating
patients records. Total scores can range from 50 (most positive
response) to zero (most negative response). The researcher
administers the same questionnaire to staff doing similar duties
in nearby hospital B, which does not yet have the new system”

The scores are listed on the next slide along with boxplots to
visualize the scores recorded at hospital A and hospital B
Hospital A Hospital B
23 15
41 36
17 25
38 28
16 31
37 26
33 12
40 29
38 33
36 35
22 20

Inspection of the boxplots suggests that Hospital A


(with the new computerized system) does improve the
ease of locating and updating patients records
Independent sample t-test
– Null hypothesis: The new computerized record system
makes no significant difference to the ease of locating and
updating patients records
– Alternative Hypothesis: The new computerized records
system significantly improves the ease of locating and
updating patients records

– Descriptive statistical analysis (i.e. the boxplots) suggests


hypothesis is true, but only for the sample
– We now need to perform a test to examine the hypothesis
and determine if sample results are of sufficient
significance to apply to a population (i.e. all hospitals) and
also determine with what level of confidence (probability)
that the results apply
Independent sample t-test
•What test do we use for “computerized hospital
records system” study?
–Our boxplots show frequency distributions with well
defined mean values so we select a parametric test (although
strictly speaking the scores for Hospital A are skewed and
not normal)
–We follow the “I’m examining differences between groups
on one or more variables” branch,
–The same participants are not being tested more than once
We are dealing with two groups (hospital A
receptionists and hospital B receptionists)
–So we choose “t-test for independent samples”
Independent sample t-test
• Having chosen the parametric t-test for
independent samples we use SPSS to run the test
Independent sample t-test

Select hospital independent variable and click right arrow


to transfer into the “Grouping Variable” field
Independent sample t-test

Click “Define Groups” to specify the groups (samples)


for which we want to run the t-test
Independent sample t-test

Enter the group (sample) nominal identifier values, 1


= “Hospital A”, 2 = “Hospital B”
Independent sample t-test

Select score dependent variable and click right arrow to


transfer into the “Test Variable(s)” box
Independent sample t-test

Finally click OK to run the test


•Interpreting the test results
–Shaded column “Sig. (2-tailed)” gives the value p that the null
hypothesis is true
–If we take the “Equal variances assumed” row, p > 0.05
Conclusion: A computerized records system did not help
in ease of locating and updating patients records.
Example
Independent – samples t – test
• A study to determine the effectiveness of an integrated
statistics/experimental methods course as opposed to the
traditional method of taking the two courses separately was
conducted.
• It was hypothesized that the students taking the integrated
course would conduct better quality research projects than
students in the traditional courses as a result of their integrated
training.
• Hypotheses
• Ho : there is no a significant difference in the score
means of those students in integrated course and
traditional course.
• Ha : there is a significant difference in the score means
of those students in integrated course and traditional
course.
Output of the independent t-test in SPSS
However, since you should have tested your data for these
assumptions, you will also need to interpret the SPSS output that was
produced when you tested for them (i.e., you will have to interpret:
(a) the boxplots you used to check if there were any significant
outliers;
(b) the output SPSS produces for your Shapiro-Wilk test of
normality to determine normality; and
(c) the output SPSS produces for Levene's test for
homogeneity of variances).

86
Output of the independent t-test in SPSS
• SPSS generates two main tables of output for the independent t-test.

• If your data passed assumption #2 (i.e., there were no significant


outliers), assumption #3 (i.e., your dependent variable was
approximately normally distributed for each group of the
independent variable) and assumption #4 (i.e., there was
homogeneity of variances) you will only need to interpret these two
main tables.

87
Output SPSS
Group Statistics

Std. Error
Condition N Mean Std. Deviation Mean
Score integrated method 20 85.65 8.242 1.843
traditional method 20 79.45 10.782 2.411

Independent Samples Test

Levene's Test for


Equality of Variances t-test for Equality of Means
95% Confidence
Interval of the
Mean Std. Error Difference
F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper
Score Equal variances
3.880 .056 2.043 38 .048 6.200 3.035 .057 12.343
assumed
Equal variances
2.043 35.551 .049 6.200 3.035 .043 12.357
not assumed

Students taking the integrated course would conduct better quality research
projects than students in the traditional courses.
Example
Research question: Is there a significant difference in the mean
cholesterol concentrations for an exercise training programme
and a calorie-controlled diet for overweight, physically inactive
male.
Ho : There is no a significant difference in the mean scores for
the two groups (an exercise-training programme and a calorie-
controlled diet ).
DV : cholesterol concentrations
IV : Treatment ( diet group and exercise group)

89
t-test procedure in SPSS
• Click Analyze > Compare Means > Independent-Samples t Test... on the
top menu, as shown below:

90
Descriptive Statistics
This table provides useful descriptive statistics for the two groups that you
compared, including the mean and standard deviation.

Looking at the Group Statistics table, we can see that those people who
undertook the exercise trial had lower cholesterol levels at the end of the
programme than those who underwent a calorie-controlled diet.

91
Independent Samples t-test Table

• This test for homogeneity of variance provides an F statistic and a


significance value (p-value).

• We are primarily concerned with the significance level - if it is


greater than 0.05, our group variances can be treated as equal.

• However, if p < 0.05, we have unequal variances and we have


violated the assumption of homogeneity of variance.

• From the result of Levene's Test for Equality of Variances, we can


reject the null hypothesis that there is no difference in the variances
between the groups and accept the alternative hypothesis that there
92
is a significant difference in the variances between groups.
Reporting the output of the
independent t-test

• When reporting the result of an independent t-test, you need to


include the t-statistic value, the degrees of freedom (df) and the
significance value of the test (p-value).
• The format of the test result is: t(df) = t-statistic, p = significance
value.
• Therefore, for the example above, you could report the result as
t(7.001) = 2.233, p = 0.061. In this case, we therefore do not accept
the alternative hypothesis and accept that there are no statistically
significant differences between means.
93
Paired Samples t-Test
• Used when each individual in the sample is measured twice
using the same test before and after a period of time (or the
sample is measured in two different situations), and the two
measurement data are compared.
• You wish to determine whether the difference between mean
for the two sets of scores is the same or different.
• Assumption :
• Interval or ratio measuring scale.
• Random samples
• Normality
• Repeated measurement: each subject in the
sample is measured twice

94
Example

• We want to examine if preparation course improved student's


score on the test
• The evaluation was made before (pre-course) and after (post-
course).
• The evaluation score is compared for each student.
• Ho : there is no a significant difference in students’ score before
and after preparation course.
• Ha : there is a significant difference in students’ score before
and after preparation course.

95
SPSS output

As can be seen from the output, no significance difference between pre-test


and post-test with and without preparation course (t(11)=-2.171, p>0.05).
96
The preparation course did not help.
Exercise
Twenty masters' students were selected for a study to determine whether a
research methodology course improve their data gathering skills.
The students attended the course for one month. They were tested for
requirement data gathering skills before the course began and were retested at
the end of the course. Given the output from the statistical software for the
hypothesis testing.

1. What is your research question?


2. What are the variables in this study?
3. Identify the level of measurement of the variables.
4. Which statistical procedure could we use to test the research question?
5. What is your research hypothesis?
6. Based on the SPSS output, what is your conclusion?
97
Exercise
In the example below, the same students took both the writing and the
reading test.

Please answer the following questions:


• What is your research question?
• What are the variables in this study?
• Identify the level of measurement of the variables.
• What is your research hypothesis?
• Based on the SPSS output, what is your conclusion?
ANOVA Tests
• Similar to a t-test, they concerned with differences in means,
but the test be applied on three or more groups (IV).
• The test is usually applied to interval and ratio types (DV).

99
ANOVA Tests
• Requirements:
• The dependent variable is measured in interval or ratio
scales.
• The independent variable is measured in nominal or
ordinal scales.
• The dependent variable data are normally distributed in
all independent variable groups and identical variance
values.
• Population and sample means are normally distributed.

100
Example
• A teacher wants to compare the effectiveness of the six
different techniques of teaching science.

• Ho : there is no a significant difference in the effectiveness


of the six techniques for teaching science.
• Ha : there is a significant difference in the effectiveness of
the six techniques for teaching science.

101
SPSS output

• Reporting the test


results

• The ANOVA test show


that there is a
significant difference in
the effectiveness of the
six science teaching
techniques, F(5, 114)=
25.3, p<.05.

102
Correlation tests

• Allows an examination of the relationship between


variables; is there a relationship between these variables?
Are they positively or negatively related?

• A correlation coefficient of 0 means that there is no


relationship between the variables, -1 negative relationship,
1 positive relationship.

103
Correlation tests

• Important: Correlation is not causation.


• Ex. What is the relationship between exercise and
depression?
• Does depression increase when exercise increases?
• Does depression decrease when exercise increases?
• Is there significant correlation between exercise and
depression?

104
Example of correlation tests

105
Example

• To identify relationship between the Rosenberg Self-


Esteem Scale and the Assessing Anxiety Scale.

• Ho : there is no a significant relationship between


self-esteem and assessing anxiety.
• Ha : there is a significant relationship between self-
esteem and assessing anxiety.

106
SPSS Output and interpretation

There is a negative correlation between self-esteem and


anxiety, indicating that anxiety decreases as self-esteem
increases (r = -.378, p < 0.05).

107
Example of non-parametric tests

• Sign test and Wilcoxon Signed- Rank test - Tests of


differences between variables (dependent samples).

• Mann-Whitney U test and Kruskal-Wallis analysis - Tests of


differences between groups (independent samples).

• Spearman R - Tests of relationships between variables.

108
Sign Test

• The objective is to determine whether there is a difference in


preference between the two items being compared.
• To record the preference data, we use a plus sign if the individual
prefers one brand and a minus sign if the individual prefers the
other brand.
• Because the data are recorded as plus and minus signs, this test is
called the sign test.
Example
• Is there any difference between reported height and
measured height?
• A random sample of 12 male students provide the data.
• Each student reported his height, then his weight was
measured.
• 0.05 significance level was used to test the claim that there
is no difference between reported height and measured
height.

110
Example
Reported 68 74 82.2 66.5 69 68 71 70 70 67 68 70
height
Measured 66.8 73.9 74.3 66.1 67.2 67.9 69.4 69.9 68.6 67.9 67.6 68.8
height

Ho: there is no a significant difference between


reported heights and measured heights.
Ha :

111
SPSS Output and interpretation
Test Statisticsb

m easured
hei ght -
reported
hei ght
Exact Si g. (2-tailed) .006 a
a. Binom ial di stributi on us ed.
b. Sign Tes t

Significant value, p < 0.05. Reject Ho.


There is sufficient evidence to reject the claim that there
is no a significant difference between the reported and
measured heights.
112
Wilcoxon Signed-Rank Test

• This test is the nonparametric alternative to the


parametric matched-sample test.

• This test is nonparametric method for determining


whether there is a difference between two populations.
Example: Express Deliveries

• A firm has decided to select one of two express delivery


services to provide next-day deliveries to the district
offices.
• To test the delivery times of the two services, the firm
sends two reports to a sample of 10 district offices, with
one report carried by one service and the other report
carried by the second service.
• Do the data (delivery times in hours) on the next slide
indicate a difference in the two services?
Example: Express Deliveries
District Office Overnight NiteFlite
Seattle 32 hrs. 25 hrs.
Los Angeles 30 24
Boston 19 15
Cleveland 16 15
New York 15 13
Houston 18 15
Atlanta 14 15
St. Louis 10 8
Milwaukee 7 9
Denver 16 11
Example: Express Deliveries
District Office Differ. |Diff.| Rank
Sign. Rank
Seattle 7 10 +10
Los Angeles 6 9 +9
Boston 4 7 +7
Cleveland 1 1.5 +1.5
New York 2 4 +4
Houston 3 6 +6
Atlanta -1 1.5 -1.5
St. Louis 2 4 +4
Milwaukee -2 4 -4
Denver 5 8 +8
+44
Example: Express Deliveries
• Hypotheses
H0: there is no a significant difference in delivery times of
the two services.
Ha: there is a significant difference in delivery times of the
two services; recommend the one with the smaller times.
Example:Output for wilcoxon signed
ranked test
Test Statisticsb

reported
height -
measured
height
Z -2.595 a
Asymp. Sig. (2-tailed) .009
a. Based on negative ranks.
b. Wilcoxon Signed Ranks Test

Conclusion: p < 0.05; therefore, reject H0 where the delivery


times differ between the two services; recommend the one with
the smaller times.
Mann-Whitney Test

• This test is the nonparametric alternative to the


parametric independent t-test.

• This test is another nonparametric method for


determining whether there is a difference between two
populations.
Example: Westin Freezers

Manufacturer labels indicate the annual energy cost


associated with operating home appliances such as
freezers.
The energy costs for a sample of 10 Westin freezers and a
sample of 10 Brand-X Freezers are shown on the next slide.
Do the data indicate, using  = .05, that a difference exists
in the annual energy costs associated with the two brands
of freezers?
Example: Westin Freezers
Westin Freezers Brand-X Freezers
$55.10 $56.10
54.50 54.70
53.20 54.40
53.00 55.40
55.50 54.10
54.90 56.00
55.80 55.50
54.00 55.00
54.20 54.30
55.20 57.00
Example: solution

Hypotheses
H0: Annual energy costs for Westin freezers and
Brand-X freezers are the same.
Ha: Annual energy costs differ for the two brands of
freezers.
Example: Output for Mann Whitney test
Ranks

group N Mean Rank Sum of Ranks


volumes of the experim ental group 10 7.35 73.50
right cordate control group 12 14.96 179.50
Total 22

Tes t Sta tis tic sb

vol um es
o f the rig h t
cor d ate
Ma nn - Wh itn e y U 1 8.5 00
Wil co xon W 7 3.5 00
Z -2 .7 3 7
Asym p . Si g . ( 2- ta ile d ) .0 0 6
Exa ct Si g. [2 * (1 -ta i le d a
.0 0 4
Sig .) ]
a . No t co rr ecte d fo r ti es .
b . Gro u p in g Var ia bl e : g ro u p
Kruskal-Wallis test
• The Kruskal-Wallis test is a non-parametric statistical test used to determine
whether there are any statistically significant differences between the medians
of three or more independent groups. It is a non-parametric alternative to one-
way ANOVA.
• Hypothesis statements for the Kruskal-Wallis test can be framed as follows:

• Null Hypothesis (H0):

• There is no significant difference in medians among the groups.


• All groups have the same population median.

• Alternative Hypothesis (Ha):

• There is a significant difference in medians among the groups.


• At least one group differs in median from the others.
Example
Does it make any difference to students’ comprehension of
statistics whether the lectures are given in English, Malay or
Arabic?
• Group A: lectures in English;
• Group B: lectures in malay;
• Group C: lectures in Arabic.
• DV: student rating of lecturer's intelligibility on 100-point scale
("0" = "incomprehensible").
• Ho : there is no a significance different in students’
intelligibility ratings between three group language statistics
was taught
• Ha : there is a significance different in students’ intelligibility
ratings between three group language statistics was taught
SPSS output for Kruskal-Wallis test :

Test Statisticsa,b
Ranks

language N Mean Rank intelligibility


intelligibility Englis h 4 5.00 Chi-Square 6.190
Malay
Serbo-croat 4 10.13 df 2
Arabic
Cantones e 4 4.38
As ymp. Sig. .045
Total 12
a. Krus kal Wallis Tes t
b. Grouping Variable: language

Write as follows:
Students’ intelligibility ratings were significantly affected by which language
statistics was taught in, 2 (2) = 6.19, p = .045.
Looks like lectures are more intelligible in Malay than in either English or
Arabic (which are similar to each other).
Spearman Rank Correlation

• The Spearman rank-correlation coefficient, rs, is a measure


of association between two variables when only ordinal data
are available.
Spearman Rank Correlation Test

• The hypotheses:

• H0: ps = 0 (There is no significant correlation


between two variables)
• Ha: ps ≠ 0 (There is a significant correlation
between two variables)
SPSS Output and interpretation
• Correlation test between Self-Esteem Scale and Anxiety Scale.

The output indicates that there is a significant negative correlation


between Self-Esteem Scale and Anxiety Scale, indicating that
anxiety decreases as self-esteem increases (r = -.392, p < 0.003).129
Chi Square Test of independence

• Used to test whether two categorical variables are


related to each other.

130
example

• Educators are always looking for novel ways in which to teach


statistics to undergraduates as part of a non-statistics degree
course (e.g., psychology). With current technology, it is
possible to present how-to guides for statistical programs
online instead of in a book. However, different people learn in
different ways. An educator would like to know whether
gender (male/female) is associated with the preferred type of
learning medium (online vs. books). Therefore, we have two
nominal variables: Gender (male/female) and Preferred
Learning Medium (online/books).

Source: https://statistics.laerd.com/spss-tutorials/chi-square-test-for-association-
using-spss-statistics.php
Solution

• Ho: there is no significant association between gender and


preferred learning medium.
• Ha: there is significant association between gender and preferred
learning medium.
• We can see here that χ(1) = 0.487, p = .485. This tells us that there is no
statistically significant association between Gender and Preferred Learning
Medium; that is, both Males and Females equally prefer online learning
versus books.
Linear Regression Analysis
• Correlation shows strength of relationship but not the causality
• Causality indicates the likely impact of IV on DV
• e.g. How many pax would visit AMS if more flights were
provided (forecasting)
• Regression calculates equation for ‘best fit line’:
y = a + bx
a = a constant representing the point the line crosses the y-axis
b = a co-efficient representing the gradient of the slope
• Simple linear regression uses the following null and alternative
hypotheses:
the null hypothesis: H0 : There is no effect of X on Y.
versus the alternative hypothesis: H1 : There is an effect of X on Y.
Assumptions

1.Linearity: The relationship between X and the mean


of Y is linear.
2.Homoscedasticity: The variance of residual is the
same for any value of X.
3.Independence: Observations are independent of
each other.
4.Normality: For any fixed value of X, Y is normally
distributed.
example
• A salesperson for a large car brand wants to determine
whether there is a relationship between an individual's
income and the price they pay for a car. As such, the
individual's "income" is the independent variable and the
"price" they pay for a car is the dependent variable. The
salesperson wants to use this information to determine
which cars to offer potential customers in new areas where
average income is known.
solution

• Ho: there is no significant effect of income on price.


• Ha: there is significant effect of income on price
output
• The first table of interest is the Model
Summary table, as shown below:

This table provides the R and R2 values. The R value represents the simple
correlation and is 0.873 (the "R" Column), which indicates a high degree of
correlation. The R2 value (the "R Square" column) indicates how much of the total
variation in the dependent variable, Price, can be explained by the independent
variable, Income. In this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation fits
the data (i.e., predicts the dependent variable)

This table indicates that the regression model predicts the dependent variable
significantly well. Here, p < 0.0005, which is less than 0.05, and indicates that,
overall, the regression model statistically significantly predicts the outcome
variable (i.e., it is a good fit for the data).
• The Coefficients table provides us with the necessary information to predict
price from income, as well as determine whether income contributes
statistically significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the
"Unstandardized Coefficients" column.

The regression equation as:


Price = 8287 + 0.564(Income)
Example: Annual ticket sales & profits data
p/region for HiMolde Airlines
Region Sales (NOKmn) Profits (NOKmn)
North East 1181 38.4
North West 1140 49.3
Yorkshire Humber 740 31.9
East Midlands 1050 39.9
West Midlands 1165 52.1
East of England 1129 32.1
London 1134 65.4
South East 1497 58.9
South West 687 30
Wales 912 27.2
Scotland 808 26.6
Northern Ireland 551 10.3
xy scatter plot indicates possible linear
relationship (or not)
Best Fit Line

• Perhaps we wish to predict profit for given values of sales


• Profit is the dependent variable (y)
• Sales the independent variable (x)
• Data seems scattered around a straight line
• Then need to find the equation of a ‘best fit’ line:
y = a + bx
Profit = ‘a number’ + (‘some other number’ x sales)
Linear Regression Output SPSS

Extent to which DV can Model Summary


be predicted by the Adjusted Std. Error of
IV(s) i.e. 66% Model R R Square R Square the Es timate
1 ,815 a ,664 ,630 9,46114
a. Predictors : (Cons tant), Sales NOKmn

Profit = a + b x sales

Coefficientsa

Effect of IV Unstandardized Standardized


Coefficients Coefficients
on DV is Model B Std. Error Beta t Sig.
significant 1 (Cons tant) -9,237 11,080 -,834 ,424
SalesNOKmn ,048 ,011 ,815 4,446 ,001
(p=0.001, i.e.
a. Dependent Variable: Profits NOKmn
0.1%)
Linear Regression

• Focuses on prediction. Involves discovering the equation for a


line that is the best fit for the given data. That linear equation
is then used to predict values for the data.
• Do variables a and b predict event c?
• Example.
• Does age predict income?
• Does the effective life of a cutting tool depend
on the cutting speed and the tool angle?
Multiple Regression
• Simple linear regression (i.e., with one IV) allows us to study
the relationship between two variables only.

• But in reality, we do not believe that only a single variable


explains all the variation of the dependent variable.

• For example, in the scenario of IQ and income, we do not


expect IQ only to explain income, but that there are also
other variables, such as years in education, to explain income.

147
Model and Required Conditions
• We allow for k independent
variables to potentially be related
to the dependent variable

Coefficients Random error variable

Y = b0 + b1X1+ b2X2 + …+ bkXk + e


Dependent variable Independent variables
Model Assessment

• The model is assessed using three


measures:
• The standard error of estimate
• The coefficient of determination
• The F-test of the analysis of variance
• The standard error of estimates is used
in the calculations for the other
measures.
Assumptions of Multiple Linear Regression

1. Dependent variable is in continuous scale.

2. Independent variables are either continuous or categorical (e.g.


ordinal or nominal). Nominal: ethnicity, gender etc.

3. Independence of observations (Durbin-Watson statistics)

4. The should be linear relationships between dependent variable and


each of independent variables, dependent and independent variables
collectively.

5. Data should be homoscedastic.


6. Data should not have multicollinearity. Multicollinearity exists if 2 or more independent
variables are highly correlated with each other. This impedes our understanding which
independent variable contributes to the variance explained in the dependent variable.

Variance Inflation Factor (VIF) helps us to detect multicollinearity.

VIF = 1, implies no correlation among independent variables

1 < VIF < 5, implies moderate correlations among independent variables

VIF > = 5, implies severe correlation among independent variables. (Issue with
multicollinearity)

7. We should not have significant outliers, high leverage, or highly influential points.

i. outliers can be detected using "casewise diagnostics" and "studentized deleted residuals”.

ii. Leverage points can be checked.


- A data point that has an unusual predictor value.

iii. influential points can be detected using a measure of influence known as Cook's Distance.
- If the removal of a data point causes a large change in the values of correlation coefficients.

8. Residuals should be approximately normally distributed.

You might also like