This is the first in a series in which we explain the fundamentals of statistical significance with relation to specific factors, including sample error, baseline conversion rate, and more. In this installment, we examine the relationship between statistical significance and sample size.
Back in high school, a lot of us learned the basics of the scientific method. Part of that time-honored process relates to optimization testing: testing a hypothesis. When a marketer runs an A/B test for her web and mobile offerings, for example, she’s essentially testing a hypothesis about customer behavior and conversion rate. Version A is the control, and Version B is the challenger, which represents the change that drives the hypothesis.
But after the marketer runs her test and gets her results, how does she know if they’re valid—if she can build a CXO strategy around them and use them as the foundation for yet more hypotheses? If her results are statistically significant, she can do just that.
What Is Statistical Significance?
Statistical significance is a measure of the likelihood that test results have not occurred by chance. The key word here is “likelihood”: Statistics is about probability, not certainty, so even statistically significant results are not 100% guaranteed to have come from a direct cause. Instead, the best way to think about it is this: As statistical significance goes up, the likelihood that test results were a mathematical accident (or a false positive) goes down.
Statistics is about probability, not certainty, so even statistically significant results are not 100% guaranteed to have come from a direct cause.
This is where the term “confidence level” comes in. Confidence level and statistical significance go hand in hand: Basically, confidence level measures how sure you can be your results are statistically significant. At Oracle Maxymiser, we recommend our clients run a test long enough to achieve at least 95% confidence. The amount of time it takes to reach this depends on many things, including most importantly how many elements are being tested and how many people are viewing each possible experience.
Why Is Sample Size Important?
Sample size factors heavily into statistical significance. This is true whether you’re running an A/B test or a multivariate test. (For simplicity’s sake, I’m going to examine it here in the context of A/B testing.) As a general rule, the larger your sample size, the less likely it is your results came from random chance. Therefore, running a test with a small sample size is one of the easiest ways to get statistically insignificant data. By increasing your sample size from a few dozen to a few hundred (or even a few thousand), you become able to more accurately find connections and patterns in customer behavior.
Sample size factors heavily into statistical significance; this is true whether you’re running an A/B test or a multivariate test.
Let’s say you flip a coin bare-handed 100 times and get 48 tails. Then, you put on a pair of gloves and flip the coin 100 times and get 56 tails. From this test, you might conclude that wearing gloves directly impacts the odds you’ll throw a tails; your conversion rate increased 16.6% when you put them on. This is an interesting finding! But is it statistically significant? The truth is you’ve only conducted one test (two sets of flipping, with bare hands as control and gloved as challenger), which means your sample size is so low it wouldn’t be surprising (statistically speaking) if your next test had completely different results.
This same concept applies to web and mobile testing and optimization. Say a marketer tests two versions of a call-to-action button, one gray and the other red, and finds the red button converts more than the gray button. While it’s tempting to therefore make the button red by default, the marketer must consider sample size both microscopically (within the test) and macroscopically. Within the test, how much traffic did Versions A and B receive? Macroscopically, how many times has she run the test?
The former is a question of traffic. If her site normally receives 1,000 unique visitors a day and this number dips to 250 while the test was running (perhaps it’s an academic site and the test ran during summer break), then so few visitors were tested that her results are probably not trustworthy.
Marketers must consider sample size both microscopically (within the test) and macroscopically.
The latter is a question of variables that could skew data. If she only runs the test once, it’s possible she’ll get results that are influenced by factors that won’t be there at other times. For example, during the U.S. holiday season sites and apps use the colors red and green more often in design. As a result, if the marketer only runs her test in December, it might not mean much that a red button beat a gray button.
A quick way for her to gain more insight would be to run the test again to increase its sample size, both in terms of the number of customers she tests (microscopic or within-the-test sample size) and the number of tests she runs (macroscopic sample size).
Keep in mind: This was an example using A/B testing! Multivariate tests typically require a much higher sample size because they measure at least four different versions. The more versions you test, the more traffic you need to validate your data and the more tests you need to run to ensure this data is not the product of chance.
Get Better Results, Stat
It’s inefficient to pour resources into a test only for it to yield data you can’t trust. But by accounting for statistical significance before and as you build out your tests, you can maximize your ROI and increase the odds you get valid data you can use.