Interpreting A/B test results: false positives and statistical significance

Subsequent posts will go into more details on experimentation across Netflix, how Netflix has invested in infrastructure to support and scale experimentation, and the importance of the culture of experimentation within Netflix. In Part 2: What is an A/B Test we talked about testing the Top 10 lists on Netflix, and how the primary decision metric for this test was a measure of member satisfaction with Netflix.

By convention, this false positive rate is usually set to 5%: for tests where there is not a meaningful difference between treatment and control, we’ll falsely conclude that there is a “statistically significant” difference 5% of the time.

Say we want to know if a coin is unfair, in the sense that the probability of heads is not 0.5 (or 50%).

If an observation falls in the rejection region, we conclude that there is statistically significant evidence that the coin is not fair, and “reject” the null.

DevOps Articles

Interpreting A/B test results: false positives and statistical significance

Product

Useful Links

DevOps Articles