0% found this document useful (0 votes)
61 views34 pages

Data Science Hypothesis Testing

This document discusses statistical hypothesis testing and inference. It provides an example of testing whether a coin is fair by flipping it multiple times and counting the heads. It explains how to set up the null and alternative hypotheses and use the binomial distribution and normal approximation to calculate p-values and confidence intervals. Bayesian inference is also briefly mentioned. Key concepts covered include statistical significance, type I and II errors, and the power of statistical tests.

Uploaded by

Sahil Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views34 pages

Data Science Hypothesis Testing

This document discusses statistical hypothesis testing and inference. It provides an example of testing whether a coin is fair by flipping it multiple times and counting the heads. It explains how to set up the null and alternative hypotheses and use the binomial distribution and normal approximation to calculate p-values and confidence intervals. Bayesian inference is also briefly mentioned. Key concepts covered include statistical significance, type I and II errors, and the power of statistical tests.

Uploaded by

Sahil Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hypothesis and Inference

Lecture 7
Centre for Data Science
Institute of Technical Education and Research
Siksha ‘O’ Anusandhan (Deemed to be University)
Bhubaneswar, Odisha, 751030

1 / 34
Overview

1 Statistical Hypothesis Testing

2 Example: Flipping a coin

3 p-Values

4 Confidence Intervals

5 p-Hacking

6 Example: Running an AIB Test

7 Bayesian Inference

8 References

2 / 34
Statistical Hypothesis Testing

Hypotheses are assertions like ”this coin is fair” or ”data scientists


prefer Python to R” that can be translated into statistics about data.
Those statistics can be thought of as observations of random
variables from known distributions.
It allows us to make statements about how likely those assumptions
are to hold.
In the classical setup, we have
Null hypothesis, H0 , represents some default position.
some alternative hypothesis, H1 , that we’d like to compare it with.
We use statistics to decide whether we can reject H0 as false or not.

3 / 34
Example: Flipping a coin

Imagine we have a coin and we want to test whether it’s fair.


We’ll make the assumption that the coin has some probability p of
landing heads.
Our null hypothesis is that the coin is fair that is, that p = 0.5.
In particular, our test will involve flipping the coin some number, n,
times and counting the number of heads, X.

4 / 34
Example: Flipping a Coin

Each coin flip is a Bernoulli trial, which means that X is a


Binomial(n,p) random variable, which we can approximate using the
normal distribution.
1 from t y p i n g i m p o r t T u p l e
2 i m p o r t math
3 d e f n o r m a l a p p r o x i m a t i o n t o b i n o m i a l ( n : i n t , p : f l o a t ) =>
Tuple [ f l o a t , f l o a t ] :
4 ””” R e t u r n s mu and s i g m a c o r r e s p o n d i n g t o a B i n o m i a l ( n
, p ) ”””
5 mu = p * n
6 s i g m a = math . s q r t ( p * ( 1 = p ) * n )
7 r e t u r n mu , s i g m a

5 / 34
Example: Flipping a Coin

Whenever a random variable follows a normal distribution, we can use


normal cdf to figure out the probability that its realized value lies
within or outside a particular interval:
1 from s c r a t c h . p r o b a b i l i t y i m p o r t n o r m a l c d f
2 # The n o r m a l c d f is the p r o b a b i l i t y the v a r i a b l e i s
below a t h r e s h o l d
3 normal probability below = normal cdf
4 # I t ’ s above the t h r e s h o l d i f i t ’ s not below the
threshold
5 d e f n o r m a l p r o b a b i l i t y a b o v e ( l o : f l o a t , mu : f l o a t = 0 ,
s i g m a : f l o a t = 1 ) => f l o a t :
6 ””” The p r o b a b i l i t y t h a t an N(mu , s i g m a ) i s g r e a t e r
t h a n l o . ”””
7 r e t u r n 1 = n o r m a l c d f ( l o , mu , s i g m a )

6 / 34
Example: Flipping a Coin

1 # I t ’ s between i f i t ’ s l e s s than hi , but not l e s s than l o


2 d e f n o r m a l p r o b a b i l i t y b e t w e e n ( l o : f l o a t , h i : f l o a t , mu : f l o a t
= 0 , s i g m a : f l o a t = 1 ) => f l o a t :
3 ””” The p r o b a b i l i t y t h a t an N(mu , s i g m a ) i s b e t w e e n l o and
h i . ”””
4 r e t u r n n o r m a l c d f ( h i , mu , s i g m a ) = n o r m a l c d f ( l o , mu ,
sigma )
5 # I t ’ s o u t s i d e i f i t ’ s not between
6 d e f n o r m a l p r o b a b i l i t y o u t s i d e ( l o : f l o a t , h i : f l o a t , mu : f l o a t
= 0 , s i g m a : f l o a t = 1 ) => f l o a t :
7 ””” The p r o b a b i l i t y t h a t an N(mu , s i g m a ) i s n o t b e t w e e n l o
and h i . ”””
8 r e t u r n 1 = n o r m a l p r o b a b i l i t y b e t w e e n ( l o , h i , mu , s i g m a )

7 / 34
Example: Flipping a Coin

We can also do the reverse find either the nontail region or the
(symmetric) interval around the mean that accounts for a certain
level of likelihood.
For example, if we want to find an interval centered at the mean and
containing 60% probability, then we find the cutoffs where the upper
and lower tails each contain 20% of the probability (leaving 60%).
1 from s c r a t c h . p r o b a b i l i t y i m p o r t i n v e r s e n o r m a l c d f
2 d e f n o r m a l u p p e r b o u n d ( p r o b a b i l i t y : f l o a t , mu : f l o a t = 0 ,
s i g m a : f l o a t = 1 ) => f l o a t :
3 ””” R e t u r n s t h e z f o r w h i c h P( Z <= z ) = p r o b a b i l i t y ”””
4 r e t u r n i n v e r s e n o r m a l c d f ( p r o b a b i l i t y , mu , s i g m a )
5 d e f n o r m a l l o w e r b o u n d ( p r o b a b i l i t y : f l o a t , mu : f l o a t = 0 ,
s i g m a : f l o a t = 1 ) => f l o a t :
6 ””” R e t u r n s t h e z f o r w h i c h P( Z >= z ) = p r o b a b i l i t y ”””
7 r e t u r n i n v e r s e n o r m a l c d f ( 1 = p r o b a b i l i t y , mu , s i g m a )

8 / 34
Example: Flipping a Coin

1 d e f n o r m a l t w o s i d e d b o u n d s ( p r o b a b i l i t y : f l o a t , mu : f l o a t = 0 ,
s i g m a : f l o a t = 1 ) => T u p l e [ f l o a t , f l o a t ] :
2 ””” R e t u r n s t h e s y m m e t r i c ( a b o u t t h e mean ) bounds t h a t
c o n t a i n t h e s p e c i f i e d p r o b a b i l i t y ”””
3 t a i l p r o b a b i l i t y = (1 = p r o b a b i l i t y ) / 2
4 # u p p e r bound s h o u l d h a v e t a i l p r o b a b i l i t y a b o v e i t
5 u p p e r b o u n d = n o r m a l l o w e r b o u n d ( t a i l p r o b a b i l i t y , mu ,
sigma )
6 # l o w e r bound s h o u l d h a v e t a i l p r o b a b i l i t y b e l o w i t
7 l o w e r b o u n d = n o r m a l u p p e r b o u n d ( t a i l p r o b a b i l i t y , mu ,
sigma )
8 r e t u r n lower bound , upper bound

9 / 34
Example: Flipping a Coin

Let’s say that we choose to flip the coin n = 1,000 times.


If our hypothesis of fairness is true, X should be distributed
approximately normally with mean 500 and standard deviation 15.8.
1 mu 0 , s i g m a 0 = n o r m a l a p p r o x i m a t i o n t o b i n o m i a l ( 1 0 0 0 ,
0.5)

We need to make a decision about significance how willing we are to


make a type 1 error (’false positive”).
In type 1 error, we reject H0 even though it’s true.

10 / 34
Example: Flipping a Coin

Consider the test that rejects H0 if X falls outside the bounds given by
1 lower bound , upper bound = normal two sided bounds (0 .95 ,
mu 0 , s i g m a 0 )

Assuming p really equals 0.5 (i.e., H0 is true), there is just a 5%


chance we observe an X that lies outside this interval.
If H0 is true, then, approximately 19 times out of 20, this test will
give the correct result.

11 / 34
Example: Flipping a Coin

We are also often interested in the power of a test, which is the


probability of not making a type 2 error (”false negative”).
In type 2 error, we fail to reject H0 even though it’s false.
In particular, let’s check what happens if p is really 0.55, so that the
coin is slightly biased toward heads.
1 # 95% bounds b a s e d on a s s u m p t i o n p i s 0 . 5
2 l o , h i = n o r m a l t w o s i d e d b o u n d s ( 0 . 9 5 , mu 0 , s i g m a 0 )
3 # a c t u a l mu and s i g m a b a s e d on p = 0 . 5 5
4 mu 1 , s i g m a 1 = n o r m a l a p p r o x i m a t i o n t o b i n o m i a l ( 1 0 0 0 ,
0.55)
5 # a t y p e 2 e r r o r means we f a i l t o r e j e c t t h e n u l l
h y p o t h e s i s , w h i c h w i l l happen when X i s s t i l l i n o u r
original interval
6 t y p e 2 p r o b a b i l i t y = n o r m a l p r o b a b i l i t y b e t w e e n ( lo , hi ,
mu 1 , s i g m a 1 )
7 power = 1 = t y p e 2 p r o b a b i l i t y # 0 . 8 8 7

12 / 34
p- Values

Instead of choosing bounds based on some probability cutoff, we


compute the probability.
Assuming H0 is true - that we would see a value at least as extreme
as the one we actually observed.
For our two-sided test of whether the coin is fair, we compute:
1 d e f t w o s i d e d p v a l u e ( x : f l o a t , mu : f l o a t = 0 , s i g m a :
f l o a t = 1 ) => f l o a t :
2 i f x >= mu :
3 # x i s g r e a t e r t h a n t h e mean , s o t h e t a i l i s
e v e r y t h i n g g r e a t e r than x
4 r e t u r n 2 * n o r m a l p r o b a b i l i t y a b o v e ( x , mu , s i g m a )
5 else :
6 # x i s l e s s t h a n t h e mean , s o t h e t a i l i s
e v e r y t h i n g l e s s than x
7 r e t u r n 2 * n o r m a l p r o b a b i l i t y b e l o w ( x , mu , s i g m a )

13 / 34
p- Values

If we were to see 530 heads, we would compute:


1 t w o s i d e d p v a l u e ( 5 2 9 . 5 , mu 0 , s i g m a 0 ) # 0 . 0 6 2

Note
Why did we use a value of 529.5 rather than using 530? This is what’s
called a continuity correction. It reflects the fact that
normal probability between(529.5, 530.5, mu 0, sigma 0) is a better
estimate of the probability of seeing 530 heads than
normal probability between(530, 531, mu 0, sigma 0) is.

14 / 34
p- Values
One way to convince yourself that this is a sensible estimate is with a
simulation:
1 i m p o r t random
2 extreme value count = 0
3 for in range (1000) :
4 num heads = sum ( 1 i f random . random ( ) < 0 . 5 e l s e 0 f o r
in range (1000) )
5 i f num heads >= 530 o r num heads <= 4 7 0 :
6 e x t r e m e v a l u e c o u n t += 1
7 # p= v a l u e was 0 . 0 6 2 => ˜62 e x t r e m e v a l u e s o u t o f 1000
8 a s s e r t 59 < e x t r e m e v a l u e c o u n t < 6 5 , f ” {
e x t r e m e v a l u e c o u n t }”

Since the p-value is greater than our 5% significance, we don’t reject


the null.
If we instead saw 532 heads, the p-value would be:
1 t w o s i d e d p v a l u e ( 5 3 1 . 5 , mu 0 , s i g m a 0 ) # 0 . 0 4 6 3

15 / 34
p- Values

The p-Value computed above is smaller than the 5% significance,


which means we would reject the null.
It’s the exact same test as before. It’s just a different way of
approaching the statistics.
For our one-sided test, if we saw 525 heads we would compute:
1 u p p e r p v a l u e ( 5 2 4 . 5 , mu 0 , s i g m a 0 ) # 0 . 0 6 1

It means we wouldn’t reject the null.


If we saw 527 heads, the computation would be:
1 u p p e r p v a l u e ( 5 2 6 . 5 , mu 0 , s i g m a 0 ) # 0 . 0 4 7

It means we would reject the null.

16 / 34
Confidence Intervals

We’ve been testing hypotheses about the value of the heads


probability p, which is a parameter of the unknown ”heads”
distribution.
When this is the case, a third approach is to construct a confidence
interval around the observed value of the parameter.
Example: we can estimate the probability of the unfair coin by
looking at the average value of the Bernoulli variables corresponding
to each flip 1 if heads, 0 if tails.
If we observe 525 heads out of 1,000 flips, then we estimate p equals
0.525.
How confident can we be about this estimate?

17 / 34
Confidence Intervals

Using the normal approximation, we conclude that we are ”95%


confident” that the following interval contains the true parameter p:
1 n o r m a l t w o s i d e d b o u n d s ( 0 . 9 5 , mu , s i g m a ) # [ 0 . 4 9 4 0 ,
0.5560]

Note
This is a statement about the interval, not about p. You should
understand it as the assertion that if you were to repeat the experiment
many times, 95% of the time the ”true” parameter would lie within the
observed confidence interval.
we do not conclude that the coin is unfair, since 0.5 falls within our
confidence interval.

18 / 34
Confidence Intervals

If instead we’d seen 540 heads, then we’d have:


1 p h a t = 540 / 1000
2 mu = p h a t
3 s i g m a = math . s q r t ( p h a t * ( 1 = p h a t ) / 1 0 0 0 ) # 0 . 0 1 5 8
4 n o r m a l t w o s i d e d b o u n d s ( 0 . 9 5 , mu , s i g m a ) # [ 0 . 5 0 9 1 ,
0.5709]

Here, ”fair coin” doesn’t lie in the confidence interval.

19 / 34
p-Hacking

A procedure that erroneously rejects the null hypothesis only 5% of


the time will—by definition—5% of the time erroneously reject the
null hypothesis.
Test enough hypotheses against your dataset, and one of them will
almost certainly appear significant.
Remove the right outliers, and you can probably get your p-value
below 0.05.
This is sometimes called p-hacking. and in some ways a consequence
of the ”inference from p-values framework.”

20 / 34
p-Hacking

1 from t y p i n g i m p o r t L i s t
2 d e f r u n e x p e r i m e n t ( ) => L i s t [ b o o l ] :
3 ””” F l i p s a f a i r c o i n 1000 t i m e s , True = heads , F a l s e =
t a i l s ”””
4 r e t u r n [ random . random ( ) < 0 . 5 f o r in range (1000) ]
5 d e f r e j e c t f a i r n e s s ( e x p e r i m e n t : L i s t [ b o o l ] ) => b o o l :
6 ””” U s i n g t h e 5% s i g n i f i c a n c e l e v e l s ”””
7 num heads = l e n ( [ f l i p f o r f l i p i n e x p e r i m e n t i f f l i p ] )
8 r e t u r n num heads < 469 o r num heads > 531
9 random . s e e d ( 0 )
10 experiments = [ run experiment () for in range (1000) ]
11 num rejections = len ( [ experiment for experiment in
experiments i f r e j e c t f a i r n e s s ( experiment ) ] )
12 a s s e r t n u m r e j e c t i o n s == 46

21 / 34
Example: Running an AIB Test

Assume one of your advertisers has developed a new energy drink


targeted at data scientists, and the VP of Advertisements wants your
help choosing between advertisement A (”tastes great!”) and
advertisement B (”less bias!”).
You decide to run an experiment by randomly showing site visitors
one of the two advertisements and tracking how many people click on
each one.
If 990 out of 1,000 A-viewers click their ad, while only 10 out of 1,000
B-viewers click their ad.
you can be pretty confident that A is the better ad.
But what if the differences are not so stark?

22 / 34
Example: Running an AIB Test

Let’s say that NA people see ad A, and that nA of them click it.
We can think of each ad view as a Bernoulli trial where pA is the
probability that someone clicks ad A.
we know that nA /NA is approximately apnormal random variable with
mean pA and standard deviation σA = pA (1 − pA )/NA .
Similarly, nB /NB is approximately a normal
p random variable with
mean pB and standard deviation σB = pB (1 − pB )/NB .
1 d e f e s t i m a t e d p a r a m e t e r s (N : i n t , n : i n t ) => T u p l e [ f l o a t ,
float ]:
2 p = n / N
3 s i g m a = math . s q r t ( p * ( 1 = p ) / N)
4 r e t u r n p , sigma

23 / 34
Example: Running an AIB Test

If we assume those two normals are independent then their difference


should also be normal with mean pA − pB and standard deviation
q
σA2 + σB2 .
This means we can test the null hypothesis that pA and pB are the
same( i.e. pA − pB = 0).
1 def a b t e s t s t a t i s t i c (N A : int , n A : int , N B : int , n B :
i n t ) => f l o a t :
2 p A , s ig m a A = e s t i m a t e d p a r a m e t e r s ( N A , n A )
3 p B , si g ma B = e s t i m a t e d p a r a m e t e r s ( N B , n B )
4 r e t u r n ( p B = p A ) / math . s q r t ( sigma A ** 2 + sigma B
** 2 )

24 / 34
Example: Running an AIB Test

For example, if ”tastes great” gets 200 clicks out of 1,000 views and
—”ess bias” gets 180 clicks out of 1,000 views, the statistic equals.
1 z = a b t e s t s t a t i s t i c ( 1 0 0 0 , 2 0 0 , 1 0 0 0 , 1 8 0 ) # = 1.14

The probability of seeing such a large difference if the means were


actually equal would be:
1 t w o s i d e d p v a l u e ( z ) # 0.254

which is large enough that we can’t conclude there’s much of a


difference.
On the other hand, if ”less bias” only got 150 clicks, we’d have:
1 z = a b t e s t s t a t i s t i c ( 1 0 0 0 , 2 0 0 , 1 0 0 0 , 1 5 0 ) # = 2.94
2 t w o s i d e d p v a l u e ( z ) # 0.003

which means there’s only a 0.003 probability we’d see such a large
difference if the ads were equally effective.
25 / 34
Bayesian Inference

An alternative approach to inference involves treating the unknown


parameters themselves as random variables.
The analyst starts with a prior distribution for the parameters and
then uses the observed data and Bayes’s theorem to get an updated
posterior distribution for the parameters.
Rather than making probability judgments about the tests, you make
probability judgments about the parameters.

26 / 34
Bayesian Inference

For example, when the unknown parameter is a probability, we often


use a prior from the Beta distribution, which puts all its probability
between 0 and 1:
1 d e f B( a l p h a : f l o a t , b e t a : f l o a t ) => f l o a t :
2 ”””A n o r m a l i z i n g c o n s t a n t s o t h a t t h e t o t a l
p r o b a b i l i t y i s 1 ”””
3 r e t u r n math . gamma( a l p h a ) * math . gamma ( b e t a ) / math .
gamma ( a l p h a + b e t a )
4 d e f b e t a p d f ( x : f l o a t , a l p h a : f l o a t , b e t a : f l o a t ) =>
float :
5 i f x <= 0 o r x >= 1 : # no w e i g h t o u t s i d e o f [ 0 , 1 ]
6 return 0
7 r e t u r n x ** ( a l p h a = 1 ) * ( 1 = x ) ** ( b e t a = 1 ) / B(
alpha , beta )

Generally speaking, this distribution centers its weight at:


alpha/(alpha + beta) and the larger alpha and beta are, the ”tighter”
the distribution is.
27 / 34
Bayesian Inference

For example, if alpha and beta are both 1, it’s just the uniform
distribution.
If alpha is much larger than beta, most of the weight is near 1.
if alpha is much smaller than beta, most of the weight is near 0.

28 / 34
Bayesian Inference

Beta Distribution
Beta(10,10)
Beta(1,1)
Beta(4,16)
Beta(16,4)
4

3
Probability

0
0.0 0.2 0.4 0.6 0.8 1.0
Values of Random Variable X
29 / 34
Bayesian Inference

Maybe we don’t want to take a stand on whether the coin is fair, and
we choose alpha and beta to both equal 1.
Then we flip our coin a bunch of times and see h heads and t tails.
Bayes’s theorem tells us that the posterior distribution for p is again a
Beta distribution, but with parameters alpha + h and beta + t.
Let’s say you flip the coin 10 times and see only 3 heads.
Your posterior distribution would be a Beta(4, 8), centered around
0.33. {(4,8) = (1+3, 1+7)}
If you started with a Beta(20, 20), your posterior distribution would
be a Beta(23, 27), centered around 0.46. {(23,27) = (20+3, 20+7)}
If you started with a Beta(30, 10), your posterior distribution would
be a Beta(33, 17), centered around 0.66. {(33,17) = (30+3, 10+7)}

30 / 34
Bayesian Inference

Posteriors from different priors


6 Beta(4,8)
Beta(33,17)
Beta(23,27)
5

0
0.0 0.2 0.4 0.6 0.8 1.0
31 / 34
Bayesian Inference

If you flipped the coin more and more times, the prior would matter
less and less until eventually you’d have (nearly) the same posterior
distribution no matter which prior you started with.
Using Bayesian inference to test hypotheses is considered somewhat
controversial in part because the mathematics can get somewhat
complicated and in part because of the subjective nature of choosing
a prior.

32 / 34
References

[1] Joel Grus. Data Science from Scratch. First Principles with Python.
Second Edition. O’REILLY, May 2019.

33 / 34
Thank You

34 / 34

You might also like