Ron Kohavi’s Post

View profile for Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

Is it important that experimenters doing A/B testing understand the nuances of p-values? No. Two things are important: 1. The organization should agree on a decision-making process (e.g., ship the treatment if the p-value is below 0.05 for a key metric, and no guardrails are crossed).  That process should be established by people that do understand the nuances of p-values. 2. Experimenters should understand that lower p-values provide more evidence that the treatment is different, but also that it is easy to mis-interpret. For example, 1-p-value is NOT the probability that the treatment is different than control, a common fallacy often perpetuated by vendors naming 1-p-value as “confidence.” Jonny Longden claimed (without evidence) in https://lnkd.in/gNMeMDcH that at CJam, “almost nobody understood” the definition of p-value that was shared in https://lnkd.in/e3DMjCvR . I don’t believe he is right about “almost nobody” (but neither of us has strong data) as I believe many in the audience know the statistics, but even if a large segment did not understand the definition, it misses the point. My goal was to highlight the biggest misinterpretation in #2 above.  Knowing that you don’t know the details of p-value (negative introspection) is a more critical first step than knowing the details. I know how to operate a microwave oven, and I use it successfully every day, but I also know that my intuition about how waves cause water molecules in food to vibrate and produce heat is insufficient to build one, as the devil is in the details. In my classes (https://lnkd.in/gh-s9gn5 and https://bit.ly/ABClassRKLI) I focus on the intuition behind p-value as evidence, and caution about misinterpretations to develop that negative introspection and know that the concept is complex. I don’t believe most experimenters need to understand the full details, but someone in the organization should have that understanding to define the process (#1 above). #abtesting #experimentguide

  • table
Tyler Buffington

Statistics Engineer @ Datadog

1y

Agree with this post overall. Question--is there evidence to suggest that these success rates should be extrapolated to smaller companies with less optimized products? I'm struggling with the notion that A/B testing isn't worth it unless you have enough users to meet conventional power standards (80% power with 5% MDE and \alpha=0.05).

Ron Kohavi

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

1y

I love this line by Greenland et al (2016) in https://link.springer.com/article/10.1007/s10654-016-0149-3 "Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof."

Ryan Thomas

Optimizing websites for humans @ Koalatative | Building Katsed

1y

What's wrong with looking at "confidence" as a useful but not-quite-correct model? Is "95% chance that the variant is different" such a bad shorthand for "95% of the time we would not see an effect of this magnitude or larger if there was actually no difference"? Also seems unnecessary to be so critical about and twist what Jonny said, there's a lot of overlap between your points.

Itamar Gilad

Author of Evidence-Guided, Product Coach, keynote speaker, Ex-Google PM

1y

I like the microwave analogy, but I get he feeling that A/B experiments are far from simple to do or intuitive, as your post illustrates. It feels like you need a resident data scientist to avoid the plathora of fallacies and misconceptions.

Garret O'Connell

Senior Data Scientist at Bolt

1y

How would you respond to Lakens standpoint that using p-values as measures of evidence is practically not (really) useful? http://daniellakens.blogspot.com/2021/11/why-p-values-should-be-interpreted-as-p.html

Paulo Freire

Senior Data & Applied Scientist at Microsoft

1y

Fair points. I think a good intuition behind p-value is to think about it as the probability that random chance generated the data we're seeing, or something equal or rarer (taken from one of Joshua Starmer PhD's videos). Looking at it this way usually helps newcomers get a better grasp of p-value right off the bat.

Like
Reply
René Gilster

Digital growth through behavioral science, experimentation, and data science

1y

It is long overdue that we overcome "confidence" (1-p value). Ingenious as a marketing gag, be it as "confidence" or as "chance to beat original".  The problem I am often confronted with: Experimenters love the statement that 1-p value is the probability that the treatment is better than the control. We therefore need a replacement that is just as powerful and "understandable". 

Maja Voje

Bestselling Author | Empowering 9,500+ Companies with My GTM Method | B2B AI-GTM Consultant | Building AI Agents & Agentic Workflows | 72K LinkedIn | 24K Newsletter

1y

It is so great to calculate your p since vendor-provided metrics can be biased. Their job is to show you that "you are doing well" and "you should continue the experimentation" - at least at entry full-stack level tools.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories