Skip to main content
Compared to other types of experiments, the distinguishing features of A/B tests is that they:
  • can have multiple treatments, which is sometimes referred to as an A/B/n test
  • use both success and guardrail metrics to identify experiences that improve some metrics without negatively impacting others
  • let you learn and find promising ideas
  • have a fixed allocation that doesn’t change
  • can use either a fixed or sequential design, where you view results upon conclusion or continuously during the experiment
The goal of an A/B test is to decide if the change has a positive or negative effect on the experience as measured by the test’s metrics. If the change has a positive effect, distribute the variant to everyone using a rollout. A rollout lets you gradually increase how widely to distribute the variant.
Most A/B tests aim to test product changes with the goal of understanding whether you should roll out the changes, or if they need further development. A learning experiment is another type of A/B test that aims to learn about user behavior or to measure a strategic baseline for the product. This learning is typically achieved by removing a product or feature from the experience or degrading the experience in some other way. Such a test helps inform future product prioritization by breaking down which parts of the existing product have the most impact on user behavior or the business. Learning experiments can also be exploratory and only aim to find if a certain variant has a causal relation to an outcome regardless of direction.

The Anatomy of an Experiment

An A/B test has different parts. This section gives a high-level overview of these concepts.

The Hypothesis is the Product Foundation of the Test

A hypothesis is a specific assumption that can be conclusively tested when subjected to an experiment, and is the basis for a good experiment. It guides the experiment from a product perspective, and makes the anticipated impact and value of the experiment clear.

A/B Tests Distribute Different Experiences Through Variants

An A/B test evaluates how users react after exposure to a new experience. Variants describe the different user experiences you test. For example, there could be different variants of a button color. One variant sets the button color to red, another to blue. A variant in an experiment is often referred to as a treatment. These variants often introduce new features, innovations, or changes that should improve the experience for the user. Typically, an experiment has one variant representing the current default (in production) experience, usually called control or the control treatment.

Randomization Makes Differences Causal

Users in an experiment are randomly assigned a variant. The variant is the only difference in the experience between the control and treatment groups. Because of randomization, the different treatments explain any observed change in behavior. If the treatment group outperforms the control group on the target metric, the treatment variant improves the user experience. Randomization ensures that the groups are similar. External factors, such as seasonality, other feature launches, and competitor moves, affect control and treatment evenly and have no impact on the results of the experiment.
The treatment effect estimated in an A/B test is only valid for the time of the test. The estimated effect doesn’t necessarily generalize to other future points in time. The same treatment can have a widely different impact depending on when you run the test. For example, recommending Christmas songs in July might not have the same effect as in December. The randomization only ensures that the groups are similar during the experiment.

Metrics Measure the Effect of the Treatments

Every A/B test needs at least one metric. These metrics help prove or disprove the hypothesis and to make a business decision based on the outcome of the test. In other words, your metrics help answer whether the change is good enough to release widely. Confidence supports two types of metrics:
  1. Success metrics are metrics that should improve with the treatment
  2. Guardrail metrics are metrics that don’t need to improve, but shouldn’t deteriorate
It’s common and strongly recommended to use both success and guardrails metrics. The reason is to guard against, for example, cannibalization. An experiment may want to increase engagement in a new feature, but not by cannibalizing on the engagement in another feature. In this case, the engagement in the new feature would be the success metric, while the engagement in the related feature is the guardrail metric.

Statistical Analysis Tells the Answer

Experimentation uses statistical analysis to reach a conclusion. A statistical test is a formal procedure used to assess whether the observed difference between two groups is sufficiently large to say that there is an effect. The goal of the statistical test is to distinguish the actual effect of treatment from that due to noise from random sampling. The statistical tests analyze each metric, and ultimately summarize the results using a recommendation for the product decision.

Roll Out a Successful Experiment

Convert the A/B test to a rollout when you complete the A/B test and you have a winning variant. The rollout targets the exact same users, with all the metrics and configuration from the A/B test. You can scale up to more users if the A/B test used less than 100% of the allocation. To avoid reassigning users, the control and treatment groups must remain at the same proportions. For example, suppose an A/B test was running at 10% of the population but had a 50/50 split of control and treatment. When you increase the rollout percentage to 50%, all users are in either control or treatment. You can’t continue to track metrics when you increase the rollout percentage beyond this point. Read more about rollouts