How to set alpha when you have underpowered experiments?

How to set alpha when you have underpowered experiments?

Jakub Linowski asked how to assign trust-level to an experiment corpus when some experiments have low power (post).

If you’re designing a new experiment, I strongly recommend designing it with sufficient power (e.g., over 80% for an MDE that’s at most 5%).

What about experiments that you previously ran that do not meet the bar?  In that case, I would lower the alpha (p-value threshold for calling something statistically significant) to control for the False Positive Risk (FPR), that is, the probability that the statistically significant result is a false positive.

The formula is not complicated, and we shared it in https://bit.ly/ABTestingIntuitionBusters:

Article content
How to set alpha to control the False Positive Risk

 where

Plugging in the above defaults, gives

Article content
How to set Alpha for reasonable default values

Let look at a few examples:

  • If your power is 80%, the formula suggests alpha of 0.015, lower than the industry standard of 0.05, but similar to the 0.01 threshold we used at Airbnb search after seeing many false positives. If your success rate is 35%, then set alpha to 0.045, similar to the industry standard.
  • If your power is 50%, set alpha to 0.009
  • If your power is 20%, set alpha to 0.004

These may be viewed as hard to achieve, but that’s the penalty you have to pay if you want to control the false positive risk.  Note that with low powered experiments, the treatment effect is still expected to be exaggerated.

Also note that there are several additional factors to check to increase trust.  These were discussed here.



Xavier Paz

Ecommerce Strategy, Growth & Behavioural Design | The Art of Ecommerce & The Choice Labs

1y

What is the benefit of doing this calculation? If an experiment is underpowered and you don't want false positives, then you can detect almost nothing. I think it's easier than that: if you don't have enough power to run an experiment, don't run an experiment. Redesign the experiment or do some other test. Tweaking the numbers won't increase the power. Or will it?

Pavel K.

Product Strategy & Data Science at Bitpanda

1y

Is there a reference to this formula?

Like
Reply
Shivam Negi

Staff Data Scientist | Gen-AI / LLM / ML-Ops | Ex-Amazon

1y

That's very informative Ron !

Ryan Kessler

Principal Economist at Amazon

1y

Implicit in this seems to be an assumption that bad launches (ie, launching when you shouldn’t) are much worse than missed opportunities (ie, not launching when you should). Would be interested to learn more about how you think about the relative costs of these mistakes.

To view or add a comment, sign in

More articles by Ron Kohavi

Others also viewed

Explore content categories