The purpose of testing is not to find out what works, but rather to find out what does not work.
I often encounter clients who have what they consider a large percentage of “failed” tests. Yet these tests reveal a rather large amount of information and insight towards future testing.
In fact, when a test “works” — and I use quotes on that to mean “does what we wanted it to do by supporting the hypothesis in some way” — we often learn less because we over-interpret the success. For example, one frequent test failure pattern I’ve seen is: “we tested a large number of headline copy variations and NONE of them showed any improvement! How can this be?”
Often, the subject matter (in this example, the headline copy) isn’t the problem, it’s the contextual basis under which the test was presented. To paraphrase Shakespeare: “The fault lies not in our tests, but in ourselves”. That is where you go to find actual insight that ends up leading to better tests. Ask yourself questions like, “what assumptions did I build into that test?”, “And are they all valid?”, “If I were sitting across the table from this prospect they would need X, Y and Z at this point to continue — so is my test creating a roadblock to that?” .
You may in fact have roadblocked your way out of teasing incremental improvement out of your tests by means of something vitally important but unrelated to your test. And, remember you are testing not on inanimate particles but on humans who have memory and require cognitive resonance in order to proceed. If you disturb that, it requires no small amount of testing effort to tease out when are then minor issues such as headline copy.
Another issue that is often brought up is “how much traffic do I need for my test to be meaningful?” There are rules of thumb for traffic, the most important of which is that the more homogeneous the traffic, the smaller the variance you can expect in the sample of visitors versus the population of visitors. If you have a site that was geared towards something specific — say, late stage Lung cancer patients — you don’t need nearly as large a set of traffic to get meaningful results than with a broader spectrum of, say, eBay shoppers. That is not a trivial meme to keep in mind as the size of your test samples will be driven by that concern, as well as impacting the frequency of the tests and the overall testing schedule you keep.
On some tests you’re going to need 50,000 visitors to get significance; on others you might only need 500. In fact, the closer any number of variations are to each other in their measured performance during the test, the larger a sample size you’d need for each to achieve the same level of confidence in the results.
Speaking of traffic, another issue that arises is what to do when you expect quite a difference between the original version of your site and something new which you hypothesize to be a strong improvement (because, after all, why would you spend your time working one something you didn’t a priori think would be an improvement?). You try umpteen different variations and … no significant results. If you have the traffic to support it, I recommend running a multivariate test to attempt to deconstruct the results… so that you can learn from them. If you started with the multivariate, remove variables and test as a standard univariate (what we normally call an AB) test. When you get an unexpected result, try doing the opposite to see if you get another unexpected result.
One of the most powerful tricks I’ve found is dead-on simple: The first thing you need to do is to repeat the test. You have to convince yourself — and you can do this numerically — that the sample set of visitors of your test is representative from the population of your visitors as a whole. Or, more simply, did you just get a goofy mix of folks in the first test? One can’t really know this from just one test, though there are ways to sniff out some confidence levels.
I’d suggest repeating with lesser traffic since at the end of the day, when you subject any visitors to a less optimized experience you’re costing yourself some money…what you’re looking to do is to see if the results are different, while costing yourself as little as possible while still getting meaningful results. For example, repeat the test with only, say, 10% of the traffic being exposed to it. Yes, it will take longer to run, but while it’s doing so, you’ll be testing other parts of your site anyway. It’s definitely a balancing act!
Further, back to the issue of “are there rules for total amount of traffic for a test?” touched on earlier, I’d also comment that if someone had, say, 5000 visitors taking a test, I’d much rather see the results of 10 of the same tests of 500 visitors each, than one big test of 5000. The challenge with conversion rates that are low, is that you have to expose a larger number of people to the test to tease out insight into what are typically 1-2-3% conversion rates. This means the signal to noise ratio can be rather poor, but the same techniques used in polling (“Candidate Jones 51%, Candidate Smith 49%”) can be useful.
The take-away? Commit to “testing your tests” by repeating them — because if you get a randomly skewed sample of visitors, it will completely throw off your interpretation of the test results. Repeat your “failed” tests to ensure your results aren’t fooling you, and repeat your “success” tests to ensure you aren’t fooling yourself. Worry more about Directionally Correct, not Metaphysical Certitude.