100% found this document useful (1 vote)
121 views28 pages

A B Testing

The document discusses A/B testing and bandit testing. It explains that A/B testing involves randomly dividing users into a control group and one or more test groups to see if changing something in the test groups has any effect. Bandit testing is similar but continuously updates the probabilities of selecting each variation throughout the testing process to minimize "regret," or the difference between the actual outcome and the best possible outcome. Bandit testing is generally preferred over A/B testing when the time for exploration and exploitation is limited, such as for short-term campaigns. The document also discusses common mistakes in A/B testing and issues around correlation not necessarily implying causation.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
121 views28 pages

A B Testing

The document discusses A/B testing and bandit testing. It explains that A/B testing involves randomly dividing users into a control group and one or more test groups to see if changing something in the test groups has any effect. Bandit testing is similar but continuously updates the probabilities of selecting each variation throughout the testing process to minimize "regret," or the difference between the actual outcome and the best possible outcome. Bandit testing is generally preferred over A/B testing when the time for exploration and exploitation is limited, such as for short-term campaigns. The document also discusses common mistakes in A/B testing and issues around correlation not necessarily implying causation.

Uploaded by

Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

A/B Testing

A/B testing (or statistical experiments)


▪ A/B tests (a.k.a. online controlled experiments)
▪ Randomly dividing things into a control a group and one or more test groups to see
whether something we do differently in the test groups has any effect.
▪ Random partitioning is the key that allows us to distinguish correlation from
causality.

Data Science: The Executive Summary – A Technical Book for Non–Technical Professionals 2
A/B testing + statistical interference #1

Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019. 3
A/B testing + statistical interference #2

Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019. 4
A/B testing + statistical interference #3

Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 5
2019.
A/B testing + statistical interference #4

Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019. 6
A/B Testing vs. Bandit Testing: #1
▪ If we have three variations that we wish to test: this is translated into a exploration -
exploitation dilemma problem.

▪ There are two options:


▪ The A/B testing approach we try out each of the three variations with equal proportions
until we are done with our test at week 5, and then select the variation with the highest
value.
▪ As for bandit testing, it attempts to use what it knows about each variation from the very
beginning, and it continuously updates the probabilities that it will select each variation
throughout the optimization process. In the above chart we can see that with each new
week, the bandit testing reduces how often it selects the lower performing options and
increases how often if selects the highest performing option. bandit testing algorithms will
try to minimize what's known as regret, which is the difference between our actual payoff
and the payoff we would have collected had we played the optimal (best) options at every
opportunity.

https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/bandits/multi_armed_bandits.ipynb 7
A/B Testing vs. Bandit Testing: #2

https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/bandits/multi_armed_bandits.ipynb 8
A/B Testing vs. Bandit Testing: #3
▪ There are tons of different bandit methods: some of them are:
▪ Algorithm 1 - Epsilon Greedy
▪ Algorithm 2 - Boltzmann Exploration (Softmax)
▪ Algorithm 3 - Upper Confidence Bounds (UCB)
▪ Algorithm 4 - Bayesian Bandits

▪ .

https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/bandits/multi_armed_bandits.ipynb 9
A/B Testing vs. Bandit Testing: #4
▪ In general, when is it preferred over the classical A/B testing?
▪ Bandit testing if you have a small amount of time for both exploration and
exploitation.
▪ A/B testing if you have a small amount of time for both exploration and
exploitation.

▪ Examples:
○ Headlines: News has a short life cycle. Why would you run A/B testing on a
headline if by the time you learn which variation is best, the time where the
answer is applicable is over?
○ Holiday Campaigns: e.g. If you’re running tests on an ecommerce site for
Black Friday, an A/B testing isn’t that practical – you might only be confident in
the result at the end of the day. On the other hand, a bandit testing will drive
more traffic to the better-performing variation – and that in turn can increase
revenue. https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/bandits/multi_armed_bandits.ipynb 10
How is A/B testing different from usual Hypothesis
testing?
● Hypothesis testing is a broader term.
● A/B testing is a method of hypothesis testing.
● To be more precise: A/B testing, also known as two-sample hypothesis
testing, is one type of statistical hypothesis testing.

● There are other types of hypothesis testing, e.g., one-sample testing

https://www.quora.com/What-is-the-difference-between-A-B-testing-and-hypothesis-testing 11
A/B testing – Common mistakes #1
▪ Significance level = the percent of the time the observed difference is due to chance.
▪ Repeated significance testing errors - > essentially it means that even if you
establish that your result is statistically significant, there’s a good chance that it’s
actually insignificant.
▪ Argument:
▪ “95% chance of beating original” or “95% probability of statistical significance,”
▪ Assuming there is no underlying difference between A and B, how often will we see
a difference like we do in the data just by chance?
▪ Answer? = significance level
▪ “Statistically significant results” mean that the significance level is low, e.g. 5%
12
(0.05).
▪ The critical assumption is that the sample size was fixed in advance.
▪ If instead you’ll stop when we see a significant difference,” all the reported significance
levels become meaningless.
https://www.evanmiller.org/how-not-to-run-an-ab-test.html
A/B testing – Common mistakes #2

▪ Repeated significance testing always


increases the rate of false positives
▪ That is, you’ll think many insignificant
results are significant (but not the other
way around).
▪ The problem will be present if you ever
find yourself “peeking” at the data and
stopping an experiment that seems to be
giving a significant result.
13
▪ Peeking is fine, just do not stop earlier.
▪ The table on the right gives an idea of
how severe the problem is

https://www.evanmiller.org/how-not-to-run-an-ab-test.html
A/B testing – Common mistakes #3

14

https://www.evanmiller.org/how-not-to-run-an-ab-test.html
A/B testing – Common mistakes #4
▪ Don’t report significance levels until an experiment is over, and stop using significance
levels to decide whether an experiment should stop or continue.
▪ Instead of reporting significance of ongoing experiments, report how large of an effect can
be detected given the current sample size. That can be calculated with:

▪ Where the two 𝑡 ’s are the t-statistics for a given significance level 𝛼/2 and power (1− 𝛽).

https://www.evanmiller.org/how-not-to-run-an-ab-test.html
Correlation does not imply causation! #1
▪ The conventional dictum that "correlation does not imply causation" means that correlation cannot be used
by itself to infer a causal relationship between the variables. This dictum should not be taken to mean that
correlations cannot indicate the potential existence of causal relations. However, the causes underlying the
correlation, if any, may be indirect and unknown, and high correlations also overlap with identity relations
(tautologies), where no causal process exists. Consequently, a correlation between two variables is not a
sufficient condition to establish a causal relationship (in either direction).

▪ A correlation between age and height in children is fairly causally transparent, but a correlation between
mood and health in people is less so. Does improved mood lead to improved health, or does good health
lead to good mood, or both? Or does some other factor underlie both? In other words, a correlation can be
taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any,
might be.

▪ If X and Y are related in the data, then neither one is causing the other. Instead, there is some factor Z (called
a “ confounding factor ” by statisticians) that is affecting them both.

https://en.wikipedia.org/wiki/Correlation_and_dependence 16
Correlation does not imply causation! #2
▪ If x and y are strongly correlated, that might mean that:
▪ x causes y,
▪ y causes x,
▪ each causes the other,
▪ some third factor causes both,
▪ it might mean nothing.

▪ These are all valid options! Fine but what else can we do? One way to feel more confident
about causality is by conducting randomized trials. If you can randomly split your users into
two groups with similar demographics and give one of the groups a slightly different
experience, then you can often feel pretty good that the different experiences are causing
the different outcomes.

Grus, Joel. Data science from scratch: first principles with python. O'Reilly Media, 2019.
17
Correlation does not imply causation! Fine
so what do we do in practice? #1
▪ "Correlation does not equal causation” is undoubtedly correct however the way it used is
take to extreme.

▪ This popular phrase has led decision makers to believe that they need a causal insight in
order to make a decision with data. Yes, in a perfect world, we'd only act on causal insights.

▪ What is happening in practice? But in practice, this requirement isn't reasonable. More
often than not, when stakeholders require "causality" to make a decision, it takes way too
long so they lose patience and end up making a decision without any data at all.

https://www.kdnuggets.com/2021/08/correlation-better-causation.html 18
Correlation does not imply causation! Fine so
what do we do in practice? #2
▪ But how about A/B testing?
▪ Consider A/B testing, for example. This the most common way teams are tackling the
requirement of causality today.
▪ But an A/B is surprisingly difficult to execute correctly - as shown by the countless
statisticians waving their hands trying to get us to acknowledge this fact.
▪ The sad reality is that A/B tests require a lot of data, flawless engineering implementation,
and a high level of statistical rigor to do it right... so we end up releasing new features
without valid results.
▪ It also requires A LOT of carefully collected data. Meaning you will have to wait a long time
before you can make any causal claim. This is true for other causal inference methods too,
not just A/B testing.

https://www.kdnuggets.com/2021/08/correlation-better-causation.html 19
Correlation does not imply causation! Fine
so what do we do in practice? #3
▪ Ultimately, causality is an impractical requirement when making decisions with data. So let's
stop trying and find another way. Let's go back to using correlations.
▪ We can still use but we have to follow some good practice guideline:

▪ Don't correlate random things. Instead, focus on correlating things that are already
connected.
▪ Correlate conversion rates, not totals.
▪ Ensure trends are consistent over longer periods of time. But everything is changing in
time so a correlation that existed in the past may have disappeared today.
▪ Always monitor the results, always track the data while making the changes.

https://www.kdnuggets.com/2021/08/correlation-better-causation.html 20
Correlation does not imply causation! Fine so
what do we do in practice? #4
▪ Conclusion?

▪ In practice, we need accurate insights and we to act fast. Waiting two months for an
analysis that claims causality or 4 weeks for an A/B test to run its course is not going to cut
it.
▪ But if we can act quickly on correlations, especially when they've been rigorously evaluated
using the techniques above, we'll be able to make better decisions, faster.

https://www.kdnuggets.com/2021/08/correlation-better-causation.html 21
Conversion rate: #1
▪ Whenever talking about conversation rate pay attention to the traps of volume and time.

▪ Volume: The trick here is that when seeing a conversion rate is to ask “How large are the
numbers that created that rate?”
▪ Time: Time plays two key roles here, the time it takes to get between point A and point B
and how the conversion rate changes over time. There are no laws of physics limiting time.
To solve this we need to control for the time when making our conversion calculation. The
simplest way to do this is to time box conversion for each visitor. Instead of asking, “What is
the conversion between A and B?”, we should instead ask, “What is the conversion between
A and B in X days?”

https://towardsdatascience.com/how-you-should-be-looking-at-conversion-rates-325849604b9e 22
Conversion rate: #2
▪ How many of those who do A, then go on to do B?
▪ What is the conversion rate between A and B?
▪ How large are the numbers that created that rate?
▪ How long does it take for most to convert?
▪ What is the conversion rate between A and B within X days?
▪ How has conversion changed over time and are we improving?
▪ How are we trending to our next goal? Are we on track?
▪ Luckily, there is a single plot that can answer all of these questions. Just plot the
cumulative percent conversion over time.

https://towardsdatascience.com/how-you-should-be-looking-at-conversion-rates-325849604b9e 23
Conversion rate: #3

https://towardsdatascience.com/how-you-should-be-looking-at-conversion-rates-325849604b9e 24
Conversion rate: #4
▪ How long does it take for most to convert? — Easy, it starts to “elbow” around 15 days
▪ What is the conversion rate between A and B in 25 days? — Between about 23% and 26%
for cohorts between February and August. January, however, had a record conversion!
▪ How many convert on the same day? — About 7%, pretty consistently too.
▪ We made updates to the website in late September, has conversion improved since then? —
No, the worst conversion is after the changes we made, and it’s getting worse over time!
▪ If we make additional changes, what should our goal be? — We should start by aiming for
where we started (23%-26%) and ideally we’d shoot for above 26% conversion (in 25 days).

https://towardsdatascience.com/how-you-should-be-looking-at-conversion-rates-325849604b9e 25
A/B testing
▪ https://towardsdatascience.com/top-5-mistakes-with-statistics-in-a-b
-testing-9b121ea1827c

▪ This resources offers a pretty good review of of the most common


mistake made while using A/B testing

https://towardsdatascience.com/top-5-mistakes-with-statistics-in-a-b-testing-9b121ea1827c
26
How do people ever determine causality? #1
▪ The Gold Standard: Randomized Clinical Trials: The gold standard for establishing causality is the
randomized experiment. This a setup whereby we randomly assign some group of people to
receive a “treatment” and others to be in the “control” group —that is, they don’t receive the
treatment. We then have some outcome that we want to measure, and the causal effect is simply
the difference between the treatment and control group in that measurable outcome.

• A/B Tests: In software companies, what we described as random experiments are sometimes
referred to as A/B tests. In fact, it was that found that if we said the word “experiments” to
software engineers, it implied to them “trying something new” and not necessarily the
underlying statistical design of having users experience different versions of the product in order
to measure the impact of that difference using metrics.

O'Neil, Cathy, and Rachel Schutt. Doing data science: Straight talk from the frontline. " O'Reilly Media, Inc.", 2013.
How do people ever determine causality? #2

▪ Second Best: Observational Studies An observational study is an empirical study in which the
objective is to elucidate cause-and-effect relationships in which it is not feasible to use controlled
experimentation.

▪ One of the issue is the Simpson’s Paradox

O'Neil, Cathy, and Rachel Schutt. Doing data science: Straight talk from the frontline. " O'Reilly Media, Inc.", 2013.

You might also like