How to set alpha when you have underpowered experiments?

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

Published Nov 27, 2023

Jakub Linowski asked how to assign trust-level to an experiment corpus when some experiments have low power (post).

If you’re designing a new experiment, I strongly recommend designing it with sufficient power (e.g., over 80% for an MDE that’s at most 5%).

What about experiments that you previously ran that do not meet the bar? In that case, I would lower the alpha (p-value threshold for calling something statistically significant) to control for the False Positive Risk (FPR), that is, the probability that the statistically significant result is a false positive.

The formula is not complicated, and we shared it in https://bit.ly/ABTestingIntuitionBusters:

Article content — How to set alpha to control the False Positive Risk

where

β is the type-II error (usually 0.2 for 80% power). You can estimate power by plugging an MDE of similar experiments (with suggested upper-bound of 5% for online experiments optimizing conversion as noted in this post. Here’s a simple Excel spreadsheet for that: https://bit.ly/powerFromN
π is the prior probability of the null hypothesis, that is, P(H_0). Based on the Table 2 in https://bit.ly/ABTestingIntuitionBusters, I recommend using 85%, for A/B tests, that is, a success rate of about 15%.

Plugging in the above defaults, gives

Let look at a few examples:

If your power is 80%, the formula suggests alpha of 0.015, lower than the industry standard of 0.05, but similar to the 0.01 threshold we used at Airbnb search after seeing many false positives. If your success rate is 35%, then set alpha to 0.045, similar to the industry standard.
If your power is 50%, set alpha to 0.009
If your power is 20%, set alpha to 0.004

These may be viewed as hard to achieve, but that’s the penalty you have to pay if you want to control the false positive risk. Note that with low powered experiments, the treatment effect is still expected to be exaggerated.

Also note that there are several additional factors to check to increase trust. These were discussed here.

Xavier Paz

Ecommerce Strategy, Growth & Behavioural Design | The Art of Ecommerce & The Choice Labs

What is the benefit of doing this calculation? If an experiment is underpowered and you don't want false positives, then you can detect almost nothing. I think it's easier than that: if you don't have enough power to run an experiment, don't run an experiment. Redesign the experiment or do some other test. Tweaking the numbers won't increase the power. Or will it?

1 Reaction

Pavel K.

Product Strategy & Data Science at Bitpanda

Is there a reference to this formula?

Shivam Negi

Staff Data Scientist | Gen-AI / LLM / ML-Ops | Ex-Amazon

That's very informative Ron !

1 Reaction

Ryan Kessler

Principal Economist at Amazon

Implicit in this seems to be an assumption that bad launches (ie, launching when you shouldn’t) are much worse than missed opportunities (ie, not launching when you should). Would be interested to learn more about how you think about the relative costs of these mistakes.

LinkedIn respects your privacy

How to set alpha when you have underpowered experiments?

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

More articles by Ron Kohavi

Others also viewed

How we typify things

Four-and-a-Half Rules You Must Break: A Message to 2020 Grads.

Distributions For Dummies: Gaussians

You Think You Understand Exponential Growth?

Nuggets #19 - Human-in-the-loop... just

The Hidden Geometry of Emergence: How Structure Arises from Information

The Books I Read in 2018

Perfect Prediction Paradox: When Information Theory Reverses Causality

RESPONDING THIS THOUGHT PROVOCKING PICTURE...

Impressions and Highlights from EC'24

Explore content categories

More articles by Ron Kohavi

Are Aha Moments useful?

Goodhart’s Law with Examples

The QA Tradeoff in A/B Testing

Should you suggest or enforce a template for hypotheses in A/B tests?

When should you use quasi-experiments instead of controlled experiments, or A/B tests? The barometer question analogy

The Cost of False Positive A/B Tests

Does offline accuracy of machine learning models predict performance in A/B tests?

Why 5% should be the upper bound of your MDE in A/B tests

Multi-Armed Bandits, Thompson Sampling, or A/B Testing? Are you optimizing for short-term headlines or long-term pills worth billions?

My (Biased) Review of Reforge’s Experimentation + Testing Class