0% found this document useful (0 votes)
2K views

Hypothesis Testing in Python

The document describes using a hypothesis test to analyze crash frequency data from rear-end and side-swipe crashes at intersections. The null hypothesis is that there is no difference in the mean crash frequencies between the two types. A t-test is used to calculate the t-statistic and p-value, finding no statistically significant difference since the t-value is in the acceptance region and the p-value is above the critical value. Therefore, the analyst cannot reject the null hypothesis of no difference in the mean crash frequencies.

Uploaded by

Umair Durrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

Hypothesis Testing in Python

The document describes using a hypothesis test to analyze crash frequency data from rear-end and side-swipe crashes at intersections. The null hypothesis is that there is no difference in the mean crash frequencies between the two types. A t-test is used to calculate the t-statistic and p-value, finding no statistically significant difference since the t-value is in the acceptance region and the p-value is above the critical value. Therefore, the analyst cannot reject the null hypothesis of no difference in the mean crash frequencies.

Uploaded by

Umair Durrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hypothesis Testing for Mean Difference (2 Samples) using Python

April 26, 2015


In [1]: # Telling IPython to render plots inside cells
%matplotlib inline
In [3]: # Importing required Libraries
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import ggplot as gg
from IPython.display import display
from IPython.display import Image
from IPython.display import HTML

Problem Statement

A traffic analyst in the city of Zreeha wants to find if there is any difference in the crash frequencies (no.
of crashes per year) between rear-end and side-swipe crashes. The transport depeartment collects crash
frequencies for a year at 10 sites of 4-legged intersections. The data is described below in the data frame df.
Statistically speaking, the analyst wants to answer the question:
Are the crash frequencies between rear-end and side-swipe crashes at 4-legged intersection statistically different?

In [14]: # Rear-end Crash


HTML(<img src="http://upload.wikimedia.org/wikipedia/commons/1/1f/Head_On_Collision.jpg" width
Out[14]: <IPython.core.display.HTML object>

In [13]: # Side-swipe Crash


HTML(<img src="http://upload.wikimedia.org/wikipedia/commons/5/50/Japanese_car_accident_blur.j
Out[13]: <IPython.core.display.HTML object>

1.1
1.1.1

Data Description
Reading Data

We will first read the data which is saved in a csv file:


In [21]: df = pd.read_csv(C:\\Users\\durraniu\\Documents\\HT2.csv)
df.head()

Out[21]:
0
1
2
3
4

Unnamed: 0 Crash Frequency \n(Crashes per year)


Site #
Rear-end
1
10
2
7
3
6
4
5

Unnamed: 2
Side-swipe
12
9
4
7

We can see that the first row is un-necessary here so we can skip that.
In [22]: df = pd.read_csv(C:\\Users\\durraniu\\Documents\\HT2.csv, skiprows = 2)
df.head()
Out[22]:
0
1
2
3
4
1.1.2

Site #
1
2
3
4
5

Rear-end
10
7
6
5
9

Side-swipe
12
9
4
7
8

Summary Statistics

In [23]: df.describe()
Out[23]:
count
mean
std
min
25%
50%
75%
max

Site #
10.00000
5.50000
3.02765
1.00000
3.25000
5.50000
7.75000
10.00000

Rear-end
10.000000
8.200000
1.932184
5.000000
7.000000
8.500000
9.750000
11.000000

Side-swipe
10.000000
8.300000
2.311805
4.000000
7.000000
8.000000
9.750000
12.000000

But we are not really interested in individual averages of rear-end and side-swipe crashes but the difference
between them. Our main goal is to verify whether the mean of the differences is statistically significant.
1.1.3

Hypothesis Testing

For estimating the significance in mean difference in crash frequencies well first find the difference:
In [24]: df[d] = df[Rear-end] - df[Side-swipe]
df.head()
Out[24]:
0
1
2
3
4

Site #
1
2
3
4
5

Rear-end
10
7
6
5
9

Side-swipe d
12 -2
9 -2
4 2
7 -2
8 1

The mean of the differences of two samples is:


In [27]: dbar = df[d].mean()
print(dbar)
-0.1

And the standard deviation is:


In [28]: s = df[d].std()
print(s)
1.66332999332
Hypothesis Our null hypothesis is that there is no difference between the crash frequencies of rear-end
and side-swipe crashes or, in other words, the mean of the population of all these differences is zero:
Ho : D = 0 and the alternative hypothesis would be:
HA : D 6= 0
Level of significance = 0.5
In [64]: HTML(<img src="HT2.png" width=750 height=500/>)
Out[64]: <IPython.core.display.HTML object>
Critical Value Because we have a sample size of 10 only we will use t-test instead of Z distribution.
According to CLT, the mean of the sampling distribution of mean differences in crash frequencies of Rearend and Side-swipe crashes is equal to the population mean difference which is assumed as zero in this
case.
We can find the critical t for 0.05 significance level and degree of freedom 9 using following command:
In [73]: from scipy.stats import distributions as dists
tcritical = dists.t.ppf(1-0.05/2, 9)
print(tcritical)
2.26215716274
t-statistic

From our data we can compute t score using following formula:


p
t = (d D )/(s/ (n))

We can use the following command in stats module to find the t-statistic and p-value for two-tailed test:
In [74]: paired_sample = stats.ttest_rel(df[Rear-end], df[Side-swipe])
print "The t-statistic is %.3f and the p-value is %.3f." % paired_sample
The t-statistic is -0.190 and the p-value is 0.853.

1.2

Conclusion

Because the t-value falls in the acceptance region i.e. between 2.262 and -2.262 critical t-values we fail to
reject the null hypothesis.
Another way to interpret the result is that the p-value is higher than the critical t-value, the probability
of getting the observed or extreme mean difference given the null hypothesis is true is higher than the
probability of rejecting the null hypothesis when it is in fact true. Therefore, we fail to reject the null
hypothesis. In the context of this example, we say that mean difference between rear-end and side-swipe
crashes is not statistically significant.

1.3

Resources
Learning Python for Data Analysis and Visualization
Data Analysis and Statistical Inference course
Caldwell, Sally. Statistics unplugged. Cengage Learning, 2012.
paired t test in python

In [67]: %reload_ext version_information


%version_information numpy, scipy, matplotlib, sympy, pandas, ggplot
Out[67]:
Software
Python
IPython
OS
numpy
scipy
matplotlib
sympy
pandas
ggplot
Sun Apr 26

Version
2.7.9 64bit [MSC v.1500 64 bit (AMD64)]
3.0.0
Windows 8 6.2.9200
1.9.2
0.15.1
1.4.3
0.7.6
0.16.0
0.6.5
17:40:56 2015 Eastern Daylight Time

You might also like