Modern experimentation platforms rarely test a single metric. Recommendation systems track click-through rate, watch time, retention, and revenue simultaneously. Advertising experiments segment results across countries, devices, creatives, and user cohorts. The statistical problem is that every additional test increases the probability of false discoveries.
Why Multiple Testing Becomes Dangerous
Suppose you test 20 independent metrics with significance level. Even if all null hypotheses are true, the probability of observing at least one false positive is:
That means there's roughly a 64% chance of reporting at least one statistically significant result purely by noise.