Wave Testing is a way of testing which web content appeals the most to visitors. As in any testing scenario, there is a ‘default’ content and alternative content variants, whose performance is measured against the default content.
For example, assume there are 4 different pieces of content to test – one of them is the default content and we have three other alternatives. These variants can be different versions of a checkout process, or they could be the possible combinations of ‘elements’ on a webpage. In any case, there are 4 distinct user experiences to be tested.
Wave Testing splits the experiment in time as shown in the image (right). Instead of testing all 4 experiences at the same time, over a period of three weeks, for instance, it splits it up into distinct sets of A/B experiments. In our example, the first experiment lasts roughly one week and compares the default content (A) against the first alternative (B). Then the same process is repeated with other alternative contents over the following weeks, i.e. each time another variant is compared to the default content.
Wave Testing is often used for faster results, or when there is not enough traffic going into each content variant.
In our example, with 4 experiences being split up into several A/B tests, which are spread over time, the first test ‘wave’ would be concluded in, say, about a week. This is a much shorter timeframe than the full 3 weeks required for a full Factorial test with all 4 experiences being served at the same time.
If this first test results in a clear winner (i.e. if content B is a runaway winner versus A yielding a statistically robust result with, say, 30% uplift at a 99% confidence level) then some marketers may think that they do not need to invest resources in testing for the other two alternatives. On the back of this clear result one may decide not to test further and not worry about traffic requirements so much. After all, a great winner has already been found!
Wave Testing has one major drawback: it compares the performance of variants at different times.
In the example above, A and B were tested against each other on a different week from the testing between A and other content variants. How does one know that the clear winner found in the first experiment (content B) is a true winner when time, as a factor in its own right, could have played a significant role?
For instance, what if that week there was a rush of traffic into the website, perhaps because the marketing team sent out a special promotion? It could be that this special promotion favored content B, in particular. This is a common case we see, but there are many cases when pure business reasons are at play. Seasonality is one such reason, and it can often skew the behavior of most visitors to a website, only temporarily (see the example below).
The inherently out-of-sync comparison of content variants in Wave Testing implies that time-dependent effects, such as seasonality, may come into play. Time becomes a variable and it has a different value across the different tests. From a purely technical point of view this is misguided practice any robust statistical analysis of content variants needs to have all other things being equal, except for the different variants of online content. This needs to be the case so that a cause-and-effect analysis can be performed at the end of the process.
If both content treatments and time are varying across tests, where do you attribute any uplift in click-through or revenue?
Another simple example that illustrates the point is one where a test is broken down into two smaller waves, each being of 2 weeks duration. Imagine the first test starts at the end of November and the second starts in the middle of December. Clearly, the average visitor behavior can be influenced by the particular timeframe, one test being much closer to Christmas than the other, and therefore the results of the two tests cannot ever be combined to produce a single, coherent analysis.
Overall, a key point to bear in mind is that a Full Factorial test, and ideally a multivariate test, always contains more information than any other test method. In fact, it contains the maximum possible amount of information about a statistical test, compared to any other method.
This is because all content variations are compared and contrasted against each other at the same time. It is also the case that MVT experiments contain much more granular information, “zooming in” at the level of an individual element of a page. This implies that interactions between elements, or different sub-parts of a page or funnel, are being taken into account. Our research on the majority of tests we have performed over the years has shown that the probability of there being significant “correlations” between elements on a page can be as high as 80%.