Hypothesis Testing in Python
Hypothesis Testing in Python
Problem Statement
A traffic analyst in the city of Zreeha wants to find if there is any difference in the crash frequencies (no.
of crashes per year) between rear-end and side-swipe crashes. The transport depeartment collects crash
frequencies for a year at 10 sites of 4-legged intersections. The data is described below in the data frame df.
Statistically speaking, the analyst wants to answer the question:
Are the crash frequencies between rear-end and side-swipe crashes at 4-legged intersection statistically different?
1.1
1.1.1
Data Description
Reading Data
Out[21]:
0
1
2
3
4
Unnamed: 2
Side-swipe
12
9
4
7
We can see that the first row is un-necessary here so we can skip that.
In [22]: df = pd.read_csv(C:\\Users\\durraniu\\Documents\\HT2.csv, skiprows = 2)
df.head()
Out[22]:
0
1
2
3
4
1.1.2
Site #
1
2
3
4
5
Rear-end
10
7
6
5
9
Side-swipe
12
9
4
7
8
Summary Statistics
In [23]: df.describe()
Out[23]:
count
mean
std
min
25%
50%
75%
max
Site #
10.00000
5.50000
3.02765
1.00000
3.25000
5.50000
7.75000
10.00000
Rear-end
10.000000
8.200000
1.932184
5.000000
7.000000
8.500000
9.750000
11.000000
Side-swipe
10.000000
8.300000
2.311805
4.000000
7.000000
8.000000
9.750000
12.000000
But we are not really interested in individual averages of rear-end and side-swipe crashes but the difference
between them. Our main goal is to verify whether the mean of the differences is statistically significant.
1.1.3
Hypothesis Testing
For estimating the significance in mean difference in crash frequencies well first find the difference:
In [24]: df[d] = df[Rear-end] - df[Side-swipe]
df.head()
Out[24]:
0
1
2
3
4
Site #
1
2
3
4
5
Rear-end
10
7
6
5
9
Side-swipe d
12 -2
9 -2
4 2
7 -2
8 1
We can use the following command in stats module to find the t-statistic and p-value for two-tailed test:
In [74]: paired_sample = stats.ttest_rel(df[Rear-end], df[Side-swipe])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample
The t-statistic is -0.190 and the p-value is 0.853.
1.2
Conclusion
Because the t-value falls in the acceptance region i.e. between 2.262 and -2.262 critical t-values we fail to
reject the null hypothesis.
Another way to interpret the result is that the p-value is higher than the critical t-value, the probability
of getting the observed or extreme mean difference given the null hypothesis is true is higher than the
probability of rejecting the null hypothesis when it is in fact true. Therefore, we fail to reject the null
hypothesis. In the context of this example, we say that mean difference between rear-end and side-swipe
crashes is not statistically significant.
1.3
Resources
Learning Python for Data Analysis and Visualization
Data Analysis and Statistical Inference course
Caldwell, Sally. Statistics unplugged. Cengage Learning, 2012.
paired t test in python
Version
2.7.9 64bit [MSC v.1500 64 bit (AMD64)]
3.0.0
Windows 8 6.2.9200
1.9.2
0.15.1
1.4.3
0.7.6
0.16.0
0.6.5
17:40:56 2015 Eastern Daylight Time