Markets

The Hidden Flaws in Your A/B Testing Strategy Nobody Talks About

  1. Introduction

  2. Hypothesis testing

    2.1 Introduction

    2.2 Bayesian statistics

    2.3 Test martingales

    2.4 p-values

    2.5 Optional Stopping and Peeking

    2.6 Combining p-values and Optional Continuation

    2.7 A/B testing

  3. Safe Tests

    3.1 Introduction

    3.2 Classical t-test

    3.3 Safe t-test

    3.4 χ2 -test

    3.5 Safe Proportion Test

  4. Safe Testing Simulations

    4.1 Introduction and 4.2 Python Implementation

    4.3 Comparing the t-test with the Safe t-test

    4.4 Comparing the χ2 -test with the safe proportion test

  5. Mixture sequential probability ratio test

    5.1 Sequential Testing

    5.2 Mixture SPRT

    5.3 mSPRT and the safe t-test

  6. Online Controlled Experiments

    6.1 Safe t-test on OCE datasets

  7. Vinted A/B tests and 7.1 Safe t-test for Vinted A/B tests

    7.2 Safe proportion test for sample ratio mismatch

  8. Conclusion and References

2.6 Combining p-values and Optional Continuation

Combining p-values has been a subject of debate since their origins with Pearson and Fisher [HR18]. These methods are often applied for meta-analysis for multiple experiments. Various methods exist for different contexts, and it is not always clear which method should be used in a given situation. Safe testing provides a simple, intuitive way to combine the results of many experiments.

Figure 1: False positive probability for the classical t-test for α = 0.01, 0.5, 0.1 .Figure 1: False positive probability for the classical t-test for α = 0.01, 0.5, 0.1 .

In the section on peeking, it was mentioned that experimenters may want to make a decision about the experiment results based on an intermediate observed effect size. With traditional statistical testing, the observed results are not statistically valid, and hence correct conclusions cannot be drawn. Safe testing, however, allows the experimenter to take the decision to continue a test if more results are needed to observe a significant effect.

2.7 A/B testing

A/B testing at first appears as a simple application of statistical tests; however, there are nuances that are incredibly relevant to experimenters. A typical A/B test will have automated measurements of tens or possibly hundreds of metrics. Consider a test in which an experimenter wishes to measure a new feature’s impact on the impact on sales on their website. The target metric for this experiment may be total sales per user. In addition to testing the feature’s impact on the total sales, they may wish to see more engagement from users that did not buy anything. This is because higher engagement with the platform can increase its value to users. Therefore, monitoring secondary metrics, such as the number of favourited items per user, the time spent on the platform, and the proportion of searches that lead to sales may give additional information about the performance of the feature. There may, however, be unintended consequences of the feature. There may be a bug that causes the website to crash on certain browsers, or the feature may cannibalize sales of cheaper products by showing more expensive ones. It is therefore crucial to monitor so-called guardrail metrics to ensure that the feature is working as intended.

Aside from the metrics in the experiment, there are other factors to consider when evaluating results. Most statistical tests assume data are independent and identically distributed. However, a new feature may attract interest from curious users, leading to unreliable metrics. This is known as the novelty effect, and may bias the results of a test. Another point of consideration is in the time it takes for metrics to converge. Some metrics, such as the number of items viewed after a search, give instantaneous results. A metric such as the proportion of users who make a purchase may take several days to converge. This is because they may be exposed to a test while browsing the products, and return several days later to make the purchase. This time between exposure to a test and its realization can make some metrics unreliable in the short-term.

A final challenge to large-scale A/B testing concerns the random assignment of users to variants. Each experiment has an associated probability for users to be assigned to either the control or test group. The results of the user’s session are recorded in a database before being aggregated over the course of metric calculations. Issues in this process can lead to unequal samples in the control and test group. This is known as a sample ratio mismatch (SRM) and can indicate that the test results are biased, and therefore unreliable. It is therefore important for experimenters to continuously monitor the sample ratio of their A/B tests in order to stop erroneous experiments.

Having discussed A/B testing and the inflexibility of traditional statistical testing, we now introduce safe testing and how it can be applied to solve these issues.

Author:

(1) Daniel Beasley


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button