That’s a nice write up, covering multiple misinterpretations of p-values and a very, very common practice – to design a fixed sample test and to then threat it as a sequential one by peeking and deciding based on observed p-values, thus nullifying any validity of the calculated statistics. Taking into account multiple testing is also important (testing 20 variations vs control is almost certainly going to yield one statistically significant result at p=0.05 without proper adjustments) and often underestimated.

When you see that it’s best to do multiple tests of the same hypothesis before reaching a conclusion, I’d say: if you can afford multiple tests, why not just one longer test with higher thresholds? I think the only case where doing several tests instead of just one longer one is when you start with several variants vs a control in the first test, and then single out the winner from the first test against the control in a second test, which should grant you an increase in test speed and thus, efficiency.

I think you could have also mentioned that p-values are not actually attached to hypotheses in a frequentist framework, but to the measuring device, that is the test statistic. Making the leap from “we have observed some statistical significant discrepancy between these two populations” to “hypothesis X is tue” is not warranted logically, but this is a topic for a much broader discussion.

I think if you care about proper statistics in A/B testing you’d be interested in reading my free white paper: “Efficient A/B Testing in Conversion Rate Optimization: The AGILE Statistical Method” ( https://www.analytics-toolkit.com/whitepapers.php?paper=efficient-ab-testing-in-cro-agile-statistical-method ).

P.S. When describing what the p-value is, thus works best for me, maybe it will help your readers as well: when the test assumptions are met and we have a properly designed and executed test and we observe a small p-value, we can say that we have either:

1.) Observed a genuine discrepancy between the variant and the control.

2.) Observed a very rare event, assuming the null hypothesis is true (how rare depends on the actual p-value)

or 3.) The assumptions of the statistical model are incorrect.

Please, note that #1 says nothing about the size of the actual discrepancy between the two, just that there is some.

]]>