Is it important that experimenters doing A/B testing understand the nuances of p-values? No. Two things are important: 1. The organization should agree on a decision-making process (e.g., ship the treatment if the p-value is below 0.05 for a key metric, and no guardrails are crossed). That process should be established by people that do understand the nuances of p-values. 2. Experimenters should understand that lower p-values provide more evidence that the treatment is different, but also that it is easy to mis-interpret. For example, 1-p-value is NOT the probability that the treatment is different than control, a common fallacy often perpetuated by vendors naming 1-p-value as “confidence.” Jonny Longden claimed (without evidence) in https://lnkd.in/gNMeMDcH that at CJam, “almost nobody understood” the definition of p-value that was shared in https://lnkd.in/e3DMjCvR . I don’t believe he is right about “almost nobody” (but neither of us has strong data) as I believe many in the audience know the statistics, but even if a large segment did not understand the definition, it misses the point. My goal was to highlight the biggest misinterpretation in #2 above. Knowing that you don’t know the details of p-value (negative introspection) is a more critical first step than knowing the details. I know how to operate a microwave oven, and I use it successfully every day, but I also know that my intuition about how waves cause water molecules in food to vibrate and produce heat is insufficient to build one, as the devil is in the details. In my classes (https://lnkd.in/gh-s9gn5 and https://bit.ly/ABClassRKLI) I focus on the intuition behind p-value as evidence, and caution about misinterpretations to develop that negative introspection and know that the concept is complex. I don’t believe most experimenters need to understand the full details, but someone in the organization should have that understanding to define the process (#1 above). #abtesting #experimentguide
I love this line by Greenland et al (2016) in https://link.springer.com/article/10.1007/s10654-016-0149-3 "Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof."
What's wrong with looking at "confidence" as a useful but not-quite-correct model? Is "95% chance that the variant is different" such a bad shorthand for "95% of the time we would not see an effect of this magnitude or larger if there was actually no difference"? Also seems unnecessary to be so critical about and twist what Jonny said, there's a lot of overlap between your points.
I like the microwave analogy, but I get he feeling that A/B experiments are far from simple to do or intuitive, as your post illustrates. It feels like you need a resident data scientist to avoid the plathora of fallacies and misconceptions.
How would you respond to Lakens standpoint that using p-values as measures of evidence is practically not (really) useful? http://daniellakens.blogspot.com/2021/11/why-p-values-should-be-interpreted-as-p.html
Fair points. I think a good intuition behind p-value is to think about it as the probability that random chance generated the data we're seeing, or something equal or rarer (taken from one of Joshua Starmer PhD's videos). Looking at it this way usually helps newcomers get a better grasp of p-value right off the bat.
It is long overdue that we overcome "confidence" (1-p value). Ingenious as a marketing gag, be it as "confidence" or as "chance to beat original". The problem I am often confronted with: Experimenters love the statement that 1-p value is the probability that the treatment is better than the control. We therefore need a replacement that is just as powerful and "understandable".
It is so great to calculate your p since vendor-provided metrics can be biased. Their job is to show you that "you are doing well" and "you should continue the experimentation" - at least at entry full-stack level tools.
Statistics Engineer @ Datadog
1yAgree with this post overall. Question--is there evidence to suggest that these success rates should be extrapolated to smaller companies with less optimized products? I'm struggling with the notion that A/B testing isn't worth it unless you have enough users to meet conventional power standards (80% power with 5% MDE and \alpha=0.05).