Comparison Specifications
Define how to compare groups in an analysis:All to Baseline
Compare all treatment groups to a designated control:All Pairs
Compare every group to every other group:Specific Pairs
Define exactly which groups to compare:Hypothesis Types
Superiority Hypothesis
Test if a treatment improves a metric by a meaningful amount:preferredDirection:INCREASEorDECREASEminimumDetectableEffect: Relative change considered meaningful (for example, 0.03 = 3%)
Non-Inferiority Hypothesis
Test if a treatment doesn’t harm a metric beyond an acceptable margin:preferredDirection:INCREASEorDECREASEnonInferiorityMargin: Maximum acceptable degradation (for example, 0.01 = 1%)
Preferred Direction
| Value | Meaning | Example Metrics |
|---|---|---|
INCREASE | Higher is better | Revenue, conversion rate, engagement |
DECREASE | Lower is better | Load time, error rate, bounce rate |
Decision Rules
Combine multiple hypotheses into a single decision:AND Rule
All hypotheses must be significant:OR Rule
At least one hypothesis must be significant:Complex Rule
Combine AND/OR logic:(guardrail1 AND guardrail2) AND (success1 OR success2 OR success3)
Group Structure
Define groups with allocation weights:id: Unique identifier for the groupweight: Relative allocation (typically proportional to traffic split)
- Equal split: All weights = 1
- 50/25/25: Weights = 2, 1, 1
- 90/10: Weights = 9, 1
Statistical Parameters
Significance Level (Alpha)
Probability of false positive:0.05: Standard significance level0.01: Stricter threshold0.10: More lenient threshold
Statistical Power
Probability of detecting a true effect:0.80: Standard power level0.90: Higher power (larger sample needed)0.70: Lower power (smaller sample enough)
Data Types
Binary Data
For conversion-like metrics:Continuous Data
For numeric measurements:Analysis Methods
Different methods have different assumptions and use cases:| Method | Sequential | Data Type | Use Case |
|---|---|---|---|
| Fixed horizon | No | Both | Final analysis only |
| Sequential | Yes | Both | Continuous monitoring |
| Bayesian | Yes | Both | Continuous updates with prior knowledge |
Method Assumptions
All methods assume:- Random assignment: Users randomly assigned to groups
- Independence: User outcomes are independent
- Stable variance: Variance doesn’t change over time
- No spillover: Treatment doesn’t affect control group
- Data arrives continuously: New data added over time
- Stopping rules followed: Don’t peek without accounting for it
Best Practices
Hypothesis Design
- Set MDE/NIM based on business impact, not statistical convenience
- Use superiority for metrics you want to improve
- Use non-inferiority for metrics you want to protect
- Define hypotheses before looking at data
Decision Rules
- Require all guardrails to pass (use AND)
- Allow any success metric to trigger (use OR)
- Be explicit about what defines success
- Consider multiple testing adjustments
Power Analysis
- Run power analysis before experiment
- Ensure adequate sample size for MDE
- Consider seasonal effects on sample collection
- Account for multiple comparisons in power calculation

