The Risks of Cowboy Split Testing March 6, 2015

What the Internet Says about Split Testing can Hurt You

Google likes to think of A/B testing as “Content Experiments”. If you read any of these how to articles, you’ll learn about all the ways you can improve your conversion rates by simply changing a header, a button color, user flow, or whatever.

And if you listen to the internet, this is the process you should take:

  1. Think about your business goals
  2. Think about a change that you think will help you improve this goal.
  3. Test to prove or disprove your theory.

This approach may make sense in some contexts, but it doesn’t work well – and can be downright costly – when testing the impact of user experience changes.

Time – The Hidden Danger in Split Testing

The thing the internet doesn’t tell you is just how long split testing really takes. When you’re running tests that influence revenue, waiting for “statistical significance” can be agonizing. What’s worse, simple statistical significance often doesn’t tell the whole story.

At Animoto, there’s a growing consensus that “2 weeks” is the minimum duration for any split test. There are a few reasons for this. Unless you’re testing a highly-frequented part of your site, it takes a lot of time to build up significant enough traffic to get an accurate measurement from split testing. If you segment your users in any way (say, by persona), this makes the numbers per segment even smaller, leading to a longer test.

User behavior is another big set of confounding factors. Consider purchase latency, the time that elapses between some measured action and the user actually buying something. At Animoto we often test UX around features not directly related to purchase to see how it impacts conversions from our trial tier to some subscription package. If it takes a user a few days from interacting with the component under test to making a purchase decision, we have to make sure we run our test long enough account for this delay in conversion. Also consider that some users are more likely to purchase at different times of the week or month. For instance a business owner might be more likely to purchase on Monday whereas someone making a recap video of a child’s soccer game may purchase on Saturdays. In general, we need to make sure that our test duration is long enough to balance this all out in aggregate.

So running tests accurately can be expensive in term of time. Waiting around that much is risky, especially with a “content experiment” mentality. How likely is it that your solution will be on target the first time? If it’s not, do you think it will be on target with the second shot? What if it’s not? Do you think you could sit on your hands for a 3rd round of split testing?

Without knowing “why,” you’re just shooting from the hip. In part 2 of this series, we’ll talk about what you should do to increase your odds.

Brian Rhee