Unfortunately there’s a great chance you’re using them wrong, and it’s having an impact on your AB testing program!

Here is a business person’s overview to help you avoid common problems and misunderstandings about p-values.

P-values are a common way to determine the statistical significance of a test. The smaller it is, the more confident you can be that the test results are due to something other than random chance.

A common p-value of .05 is a 5% significance level. Similarly, a p-value of .01 is a 1% significance level. A p-value of .20 is a 20% significance level.

If you prefer visual interpretations, what the p-value quantifies is the wing(s) of a distribution if the experiment were repeated many times. Here’s a classical bell curve:

Repeating the experiment multiple times, we’d expect to get slightly different results each time, distributed around some central value. “What does it take to be an outlier?”, you might ask yourself, or, “what sort of test result might I expect to see such that I’d start to think I’ve got something significantly different on my hands?”

In the above, a test result significant to look at would be one that yields a result in the outlying red areas shown on either side of the curve. In this example, p=.05 (5%) so we split that and look at 2.5% on the left and right extremes. More commonly we’re looking not just for “different from expected” but for “improvement over expected” — conversion rate improvement comes to mind — so we concentrate all of our investigation to the right side and expand it from 2.5% to a full 5.0%

Some people prefer to think in terms of a Confidence Level. This is simply the p-value’s complement, 100-p; thus p=.05 corresponds to a confidence level of 100-.05 = .95 = 95%. Similarly p=.01 corresponds to 99% confidence level, p=.20 corresponds to 80% confidence level, etc.

This lets us examine a range of values: what is the interval, surrounding that central value of all the test results, within which we would have 95% or 99% or etc confidence the true value is? Anything outside this area can be considered as statistically different and possibly significant.

We move now to how this gets woven into AB testing, because that’s where the real impact comes in. What if we ran an AB test that looked like this:

Let’s pretend Variation A is an existing treatment and is thus a Control, and Variation B is some new treatment that we hope will do better.

Superficially, above, it looks like A is the winner, but we’d have to know a bit more to really say that. Here’s three different confidence intervals, all corresponding to a p-value of .05, which might change our mind:

If the 95% confidence interval (i.e., p=.05) of orange Variation A is 23% +/- 1%, that would be a pretty tight fit, so we could feel good that the ‘true’ conversion rate for A is somewhere between 22% and 24%; this is far enough away from green Variation B’s 11%, (with its own confidence interval, I might add) that we’d happily call A the winner. Well, assuming there’s not some systemic bias going on.

As for the middle case, this corresponds to Variation A having a conversion rate somewhere between 18% and 28%, so yeah that feels like it’s still better than green Variation but we might not be quite so confident as the example above.

On the other end of the range, if the 95% confidence interval of Variation A is 23% +/- 20%, well then we really have very little confidence to say anything much at all, being that the range for Variation A is 3% to 43%. Maybe A is better than B but there’s so much overlap, who really knows?

So, as you can see, depending on that confidence interval, we might take action to implement orange Variation A over green Variation B, or we might throw our hands up and figure a better test is needed to really tease anything meaningful out. Or somewhere in between. If we tighten the p-value (making it smaller), we have a more significant result *when it happens*, but it comes at the cost of needing more traffic to get to that level of significance. When we loosen the p-value by making it bigger we lessen the relative amount of traffic needed to get to that level of significance but we might be so loosey-goosey on the significance that it’s not very actionable.

And that was with p=.05, a fairly common value, particularly for AB testing online. In more stringent conditions, for example medical studies of a leukemia drug, where the cost of being wrong impacts people’s lives, it’s far more common to see p-values of .01 or .001. Again, the smaller p-value you accept, the more significant the result will be when a change is detected.

In the example above, I went out of my way to pick an example where the conversion rates between the two Variations was so far apart that it’s mostly clear what step to take next. However, the messy reality of online testing is that most tests aren’t so clean cut, and the actionability of what-the-heck-do-we-do-with-this-test leads ordinary business people to use p-values incorrectly, to their AB testing program’s detriment.

——————–

Here are five problems and challenges with using p-values which are almost certainly impacting your ability to maximize your testing and optimization efforts:

A Null Hypothesis is just a fancy way of saying “actually, there is no difference”.

The purpose of the p-value is to give some measure to the concept “how likely is this data given the null hypothesis? Is the difference we see in B compared to A due to nothing more than random chance?” And that’s *all* the p-value is designed to do. *Nothing more*.

Thus, p = .05 (5%) means we’re accepting that 1 out of 20 times we do this test, we’ll see some difference between A and B *when there really is no difference at all* (called a “false positive”). The true error rate is actually far worse (more on this in a moment)

**Take-away**: Write out a null hypothesis for your test before the test begins; that is what you’re hoping to disprove with a low p-value. And while you’re at it, declare a p-value (or a Confidence level) that you want to use as a metric.

The only thing the p-value told us was how confident to be about the null hypothesis. That is, is it worth looking further into this difference between A and B, or is it just noise? P-values tell us nothing much about any alternative hypothesis, and it surely doesn’t tell us if B is a great idea.

This may seem counter-intuitive. But consider: if I ask you, “did it rain today?” and you respond “no, it didn’t rain”, I can’t really conclude that it snowed. Yes, it might have snowed. It might also have been clear skies all day. Or maybe it sleeted, a sort of rain/snow mix. Anyway, the point is: don’t run off to conclusions about the alternative hypothesis just because you get a singular decent p-value that let you discard the null hypothesis.

You may find it useful to consider a so-called “AA test”. This simple trick can help keep yourself honest. What would you expect from an AA test? Your instincts tell you an AA test should show no difference between Variation A and, well, Variation A. Duh. If you did see a difference, you’d chalk it up to just random chance between the group that saw the first A and the group that say the second A. So don’t give B any more import than that.

[As an aside, running an AA test represents very small risk. You might incur operational costs, though, if you pay for your AB testing software based on traffic sent through the test.]

**Take-away:** AB testing is a game of RBIs, not grand slams (despite what conversion consultants will tell you).

A common mistake is to consider the p-value as the probability that the test result is the number reported. It’s not. So statements like the following are you fooling yourself: “Oh, the orange version A has a 5% (p=.05) chance of being 23% wrong”. Or, again using the Confidence Level, “Orange version A’s conversion rate has a 95% of being right”. The mistake is in the meaning of “right” and “wrong”.

“Being wrong” means this test caused us to reject the null hypothesis when we should have accepted it. Or put it this way: “we fooled ourselves into thinking we had a winner and in reality we should’ve just stuck with the Control”. So better ways to express this probability is, we have a 5% (p=.05) chance that we will interpret this result as significant when it really wasn’t, or, “we have a 95% confidence that the true conversion rate is somewhere inside the Confidence intervals”.

Why is a p-value not a probability? Three great reasons:

- first, we’re looking for the “true” value for the whole population — in our example, the “true” conversion rate, and we’re estimating it by looking at samples drawn from the population. Either that true value is within the confidence interval, or it is not. You can’t be a little bit pregnant!
- second, p-values are calculated
*as if the null hypothesis is true*and the difference we’re seeing is due to random chance. It doesn’t ask*if*it’s true, it assumes it already is, and then quantifies our confidence that this assumption is right. - third, a low p-value — which is what we want — can’t tell the difference between the case where we really have a true null hypothesis situation but the sample is off, versus the case where we have a false null hypothesis due to the data.

**Take-away:** Your chance of coming to the wrong conclusion from a single, isolated test is far higher than you think. Be careful.

Resist the urge to call a test the moment a particular p-value is reached. I call this “tagging”, as in the game of “Tag, you’re it!”. “Hey we ran the test and got p=.05 (95%) the first day! We can stop the test and declare a winner! Yay us!”

In fact, “peeking” this way and calling the test as soon as you hit 95% is almost a sure way to get bad test results. Resist the urge to peek; you’re biased enough already.

Consider: if you stop the test when you hit 95% how do you know you weren’t about to get a whole slew of additional data that would’ve moved the confidence needle back down? This is a variation on Problem #3 above, but this game of “tag” is so prevalent in business settings, that it becomes a fetish — it almost as if having spent so much time thinking about the test and setting it up, everyone is in a rush to complete the test, take credit for the winners and silently ignore the losers, and move on to the next test.

**Take-away:** Tagging is a great way to fool yourself. Regularly run AA tests — as often as once a quarter! — declare a winner of the test as soon as you hit p=.05, *and then let the test continue running*. You’ll be surprised how often the other variation you didn’t call wins.

OK, so you resisted playing tag with your tests, and you’ve stopped thinking that p=.05 means you’ve got a 95% chance of being “right”. What is an actual estimate that you’re wrong?

The true error rate is typically much higher than the p-value. Some studies describe the true false positive rate for p=.05 tests to be close to 25-50%, and for p=.01 to be 7-15% . Yikes! This should scare you, or at least make you cautious.

If the effect is true, the result should be reproducible. If you can’t wait to implement the supposed improvement, at least make a calculation as to what the financial cost of implementing wrongly is and balance it against the opportunity cost of not making the change. Maybe it will help justify the resources to do a follow-up test. Or maybe it will be worth simply re-jiggering what percentage of your traffic gets the new treatment.

Run successful tests multiple times. The more impactful the experimental results might be to your company, the more important it is to re-run the test again (and even “often”). Likewise if you have a failed test, consider putting it back in your inventory of test ideas to be run again at some future date (again, weighted by how relatively “big” you feel the underlying idea is to your team). You’ll be surprised how often you can save a decent idea this way.

**Take-away:** All successful tests from from unsuccessful “parent” tests. Embrace a healthy dose of failed tests as the cost of finding the winners.

Thomas Fisher, who invented p-values, once said:

‘A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance’

In other words, be careful. Fisher never intended p-values to be anything other than a useful way to determine if the data merited a second look.

Recognize you have a bias, always. This is particularly true in business settings. You’re already biased to B: If you really thought A and B would show no difference, you’d probably be running a different test. Your brain is primed to fix on any improvement and to ignore evidence against as “just noise.”

Since the chance of big change from a test is fairly low — one runs out of low-hanging fruit fairly quickly — any test that results in big improvement (100%! 500%! 1000%!) probably needs a healthy dose of skepticism . It’s more likely to be due to something odd.

————–

In an upcoming post, I’ll delve into Bayesian AB testing and how it leverages a more intuitive approach to avoid many of the problems inherent to classical frequentist statistics (including the p-value we just covered)

]]>Today, let’s try this on another topic: using a boxplot to get a general overview of the spread of your data without using “capital S” Statistics, and, again at a rough level, making smarter “guestimates” of outcomes.

A boxplot looks something like this:

It has a central box with a divider inside, and two “whiskers” going out each side.

Read this entire article over at my MarketingLand.com column

This time, I want to show you a technique to fix another common time-series problem: seasonality. Yeah, your metrics are down in January, but is that the usual post-holiday sales slump? Or is it the start of a true downtrend that you need to keep an eye on? The article will illustrate a fast and simple way to de-seasonalize your data.

Let’s work through an example step by step:

Read this entire article over at my MarketingLand.com column

]]>The Exponential Moving Average (“EMA”) is a very useful tool in your metrics arsenal because in the EMA the memory of past metrics values is never forgotten, though it is gradually given less and less weight over time. Further since the EMA is based on what the value of your metric is today combined with what the EMA was yesterday, you don’t have to keep large amounts of past data on hand to update each day’s new values.

Google AdWords almost certainly uses an EMA to track your keyword Quality Score.

Read this entire article over at my MarketingLand.com column

You can read Part 1 dealing with thinking through negative metrics here.

Now, let us turn attention to setting up negative metrics for any of your tests. This, too, I’ll cover by discussing a specific example from real life.

One of my larger clients earns a significant percentage of its revenue from ad impressions on its site. It’s considered a “trusted source” in its niche, and people visit this site all the time for information, reviews, and fair-market price comparisons.

Because of the important revenue coming from ad impressions, one of the success metrics they use is “pages per session”. One of the metrics they try to minimize is “bounce rate”. Both of these are completely rational ways to measure success. In a recent test they were able to substantially reduce bounce rate on a given landing page, and apparently page views per session was approximately constant, so on first blush this seems like a success. “But what about the negative metric?”, I asked. Reader, can you think of what the negative metric might be? Or at least think of how it would manifest?

What if solving the bounce rate problem simply causes the visitor to “pogo stick” around on ensuing pages? Pogo sticking means more page views but very short, very quick changes of pages, indicating non-engagement. This isn’t good, you’ve simply moved the problem to another place as I mentioned in my earlier post on conversion symptoms versus conversion disease. So in this test, bounce went down, which is good, but the effect for the company isn’t good.

What about page views per session, this test’s other success metric? If someone is pogo sticking then these should be expected to actually go up . Great for ad impressions (maybe), but probably not for ad viewing (surely not what the person who bought the ad impressions wants!). And pogo sticking means the visitor isn’t achieving what they want, and that’s not an effect the company wants either.

So what we’re actually looking for is an additional metric, this time negative metric, to indicate if our success metrics of bounce rate and page views per session are fooling us. In this case a time component would help, so perhaps “avg time spent on page per session”. Right? If the visitor is pogo sticking, this metric will go down strongly relative to the control, even if bounce on the landing pages improves and page views go up. It can help indicate to us if the test is truly a success or not. If this number stays relatively constant, and we improve bounce and page views at the same time, then we really do know we have a success.

Take-away: every time you define a success metric for your testing and optimization efforts, consider what the negative metric will be. What ways could your test have a false success … and how would you know?

]]>A negative metric is not necessarily something you want to get less of — for example, “reduce bounce rate”. That’s just an easier way to state the positive inverse of “increasing the un-bounce rate”. Rather, a negative metric is something you look at to ensure that when you have success with your positive metric, you aren’t penalizing yourself somewhere else. I covered this when I discussed the value of nothing earlier.

Read this entire article over at my MarketingLand.com column

I do have an article post on setting up negative metrics for your tests, along with an example from a completely different direction.

Well, it works just as well when you’re concentrating on your metrics and optimization program.

What do the letters mean?

Read this entire article over at my MarketingLand.com column

]]>In other words, why should we believe that the place at which we measure our metrics is also the location of the problem? Perhaps the metric is simply reporting a symptom, but not the malady itself?

Take, for example, a rather common fixation on Exit Pages. These are often bumped up the organizational ladder as “oh, these pages have to be fixed! They have high Exit rates, therefore people are leaving! Let’s re-factor, or AB test, or etc”.

Certainly a high Exit rate indicates *something* is amiss, and deserves attention. What I’m suggesting, however, is that a high Exit rate page is often not a problem with the Exit page itself at all, but rather a manifestation of a problem which may have occured much earlier in the process.

Do we really want to be treating Symptoms, rather than the Disease? You have Weight Loss? Eat More. You’re Thirsty all the time? Drink more. Feel tired or run-down? Get a full nights sleep. Yet all three of those symptoms are correlated with Diabetes, for which eat more, drink more, sleep more are hardly the best pieces of advice. You may well cure a symptom (act locally) but have little impact on the disease (globally a problem)

The Exit page can be thought of the place where the visitor “gave up”. Something occurred on previous pages or interaction points, and the reported Exit page is simply the final divorce decree your customer is serving on you. Yes, it is much like a divorce, where the marriage has ended *de facto* long before it’s ended *de jura*.

Many analysts get caught up in this conundrum. They are tasked with reporting metrics and (usually) making suggestions for improvement. But unless they are looking at the bigger picture, they have a built-in incentive to treat the symptomatic problem. Don’t you fall into that.

To be sure, there are plenty of cases where the Exit page is the problem. And this is my point; simply being a page with a high Exit rate isn’t sufficient in and of itself to diagnose the problem. So, in line with taking a broader view of continuous optimization at your organization, join me in this thought experiment: what would it mean if the Exit page itself were a problem, versus those instances where the Exit page is simply the place where the problem is measured? How would we expect the metrics of Exit pages to act in this context?

Here’s one approach:

If we think of Exit page as “end of conversation” or “not interested” etc — then you might expect that the time spent on this page to be of approximately average or even above average of time spent compared to all other Exit pages. The visitor has continued down a conversational path with you, and has come to a point where, in some context, you’re no longer relevant to her. If this is that point, then she’ll finish up with this page and look around for more info or move on. Of course, she doesn’t move to another page on your site at this point (since we measured THIS page as the Exit page). Fair enough, we can investigate the various factors on that page that may have gone awry, and fix those we are capable of fixing.

However, what about when the visitor loses her way long before the Exit page? Obviously, she hasn’t exited yet (otherwise one of the earlier pages would have been analytically reported as the Exit page for this visitor). But from the moment of her dis-engagement, what we might expect a human to do is to flitter around a bit in an attempt to get back on track or find what she is looking for. Visitors have goals on your site and they will put in (at least a little) effort in getting to those goals. Maybe hit the Back button. Or go to the home page. She might even start using the Primary navigation (you may be surprised, but Primary Nav is one of the least used parts of a site among visitors who are getting what they want, and one of the most used parts of a site among visitors that are having a “disconnect” from you).

So what might we expect to see in the metrics in this case? We should see such Exit pages to having a lower time spent on this page compared to the average Exit page. And likely the pages just before getting to the Exit page also have lower-than-average Time Spent as she jumps around trying to rediscover the scent of her intended trail.

What else might we expect? Well, in those cases where the problem cases of the Exit pages being of only one type of problem or the other (that is, “we have a problem with Exit Pages” versus “we have a problem somewhere earlier in the process”), the spread of the average metrics for this page such as Time Spent, etc should be fairly narrow and static over time. The std deviation of the metric will be fairly tight compared to its average.

In contrast, if you have both types of Exit page problems on your site, then you’d expect the standard deviation of Exit page metrics to be much wider, because really you’re measuring two diff’t populations of problems. This in itself suggests an occasional “binning” of the Exit pages in some visual way so you can diagnose if you have anything other than a bell curve distribution of Exit page problems.

Once you start thinking about your problem with Exit pages this way, you can come up with better ways to isolate Symptoms from Disease, and you’re that further along in treating both effectively. Your Patient-Visitor will thank you because she’ll get more done at your site.

By the way, this sort of shift in your thinking will point you towards a similar approach to other problems on your site. For example, Bounce Rate.

[As an aside, I’ll make the distinction here that Exit Page is the last page the visitor was on in a session, and Bounce page is a special kind of Exit page where the visitor was only ever on that one page before leaving]

For years, people have made a lot out of Bounce Rate — as they should — but without considering that the Bounce page, typically a landing page or home page, may not really have any problem with it all.

Again, this doesn’t mean that all Bounce Rates are ignorable. Just the opposite, because what I’m asserting is that there is as strong a possibility that the Bounce Page is being bounced off of because of something wrong with the Ad or the referring Search Engine result, or etc which had set up an expectation of relevancy — a contract with the visitor if you will — that the Bounce page isn’t prepared to handle. Perhaps someone in charge of PPC efforts has changed something in the Ad — all with good intent — but if this scent isn’t followed through from the Ad onto the ensuing pages it manifests as an increase to the Bounce rate when visitors get to the site.

This comes about far more often than you would think because so many organizations are set up as silos. You’ve got the analysts on one side trying to measure as much of it as they can get done, people responsible for the website tweaking and optimizing away, and PPC folks add driving click-thrus but possibly without interacting with the team managing the site. All of which create symptoms that there’s something wrong with the site when the disease may well be the lack of co-ordinated effort cross-company.

That should give us all something to think about, right? I’m curious, what percentage (rough estimate) would *you* put on the ratio of Exit Pages that have problems with the page itself versus problems that occured much earlier in the process? My experience is that it’s far closer to 50:50 (meaning: “it’s a coin flip!And I can’t treat this problem until I know more!”) than any organization would like to admit.

[This article is Cross-posted to my monthly column at MarketingLand, which is a great place to read all sorts of interesting content.]

]]>I want to try to explain the latter, especially for marketers. Why? Because political elections are in many ways like optimization testing: you’ve got two (or more) candidates and the ‘market’ is choosing between them. So insight we gain from understanding political polling is valuable to us when we look at our testing efforts and the metrics we use to judge them.

And, I’m going to attempt this without using much math. Why? Because many readers are marketers, and numbers can be … a challenge. You can always go dig up your local “small data” geek when you need to run such numbers yourself — so let’s focus on the concepts.

First off, to avoid the inevitable arguments of party affiliation, so let us consider two fictional candidates, Mr. Smith and Ms. Jones. In the latest polls, Jones leads Smith 49 to 47 in a poll where the tiny print at the bottom of the results says there’s a margin of error of +/- 2 points.

First off, what do the above numbers means? Based on a sampling of the voters, the simple answer is “Jones is slightly ahead (apparently)”. You didn’t need a math whiz to derive that, although the “apparently” might seem odd.

As always, start with defining what metric we’re looking for. What we would really like are the “true” numbers — what the results will be on election day. That’s what we’re trying to predict or at least get a sense of. We are looking for which candidate is ahead at the time the poll is conducted as a means of guessing what the election outcome might be like. And again, for simplicity, we’ll leave out the discussion of polling bias, systemic bias, turn-out bias, etc., anything that might cause the people responding in the poll (the “sample” voters) to be anything other than completely representative of folks on election day (the “population” voters).

Now a poll is just a snapshot in time; it’s not really a prediction for the (future) election day. But you can well imagine that as the polls get closer and closer to election day that they should in principle reflect closer and closer to what the end results turn out to be, at least as compared with polls taken many months ahead.

In the same way, if you’re running an AB test on your site, you are hoping that the people responding to the test are approximately the same sort of customers as who buy from you — usually true since the “control” variation is often the same as “how we currently do things” on your site. And, the equivalent of a “poll” is the snapshot of your AB test when you’re part-way through the test.

Back to Jones and Smith. What we know is this: a poll was conducted. Jones leads by two points. And there’s this odd “margin of error” number being thrown at us.

Now since we’re really interested in is the outcome on election day, what we get from the margin of error is an estimate of where the True number for Jones and for Smith might be. A very common way to quote margin of error is at the 95% level…From the Jones-Smith poll, this means that Jones’ “true” number is somewhere in the range of 49% +/- 2% — that is, the True Jones number is somewhere between 47% and 51% (by the way, you will often see that range referred to as the “confidence interval”). Likewise, Smith is somewhere in the range 45% to 49% (47% +/- 2%).

And again, these are qualified as being “with 95% confidence”… all else being equal, if you did this poll a bunch of times you’d certainly get slightly different numbers for Jones and Smith, but their True numbers would be outside the range above only about 1 time out of 20 (5%).

This is where most people stop. Hmm, wait a second, most people stop way before this! But most folks who are interested stop right about at this point. But then you’d be missing all the interesting stuff to follow:

1) One important thing to keep in mind is that the margin of error, expressed as a percentage like +/- 2% above, is dependent on the number of people responding in the poll, and how close the results are…the same results conducted over a larger number of people will mean a tighter margin of error. A larger difference in the candidates (for example, say one candidate is way ahead) over the same number of people will mean a tighter margin of error (although in this case it’s often more accurate to calculate margins of error for each candidate, but I digress). In races that are relatively close as in our Jones-Smith example or in a country whose electorate is somewhat evening split between two candidates, the margin begins to increase. In U.S. politics the numbers for each side of the major party candidates are often in the 40’s almost the entire election cycle, so that a particular poll’s margin of error is often just a function of the number of respondents in the poll — in this example, in the Jones-Smith poll there would have to have been a couple thousand people in the poll to get a m.o.e. of 2%

2) More importantly, just because Jone’s True number is somewhere in the range 49% +/- 2%, doesn’t mean it’s equally likely within that range. It’s much more likely for Jones to be at, say 50% than for her to be at 51%. After all the entire analysis is based on the assumption that we’re randomly sampling people when we do our poll. But also keep in mind that for Smith to be at 50% is way less likely than for Jones. This is an important part of what most people miss.

Yes, it’s absolutely possible for Smith to really be at 49%, the upper end of his confidence range, and for Jones to be at 47%, the lower end of her confidence range, but that’s not nearly as likely than for Jones to be at 50 and Smith at 48. You might find it useful to think of it visually with Jones having a bell curve centered at 49%, and Smith having one centered at 47% (with the spread of each curve determined approximately by 2%/2). For Jones to win, her curve needs to be above Smith, which it is most of the time. For Smith to win, he not only needs to get more of the vote than he is currently but he must also get more of the vote than Jones, which is more of a challenge. So in this way, Jones really is further ahead than the spread between her and Smith, 49-47, superficially indicates.

3) So how would we go about estimating Jones chances to win? We know that voters in the poll are choosing her 49% to 47% for Smith. But 49% isn’t her probability of winning. That is determined by all the cases wherein she gets the most votes. This is handled mathematically by something called a Monte Carlo method: we run the election based on these poll results and generate a random number, though of course, as mentioned earlier, this is not equally random — you can roll snake eyes on a pair of dice but the chances of rolling snake eyes is not the same as rolling, say, an eight. But if you do this enough times you start to build up a probability curve for the expected results.

So we repeat this simulation of running an election a hundred, or a thousand, or ten thousand times and look at all the results where Jones wins compared to where Smith wins. Professionals that number crunch like this would typically use at least 100,000 runs. Given the numbers from our fictitious poll, what do you suppose the winning probability would be for Jones?

Pick a number in your head. even odds of 50%? 60%? 75%? Write it down.

This is where the human part of the brain can lead one astray. If you saw numbers like that on your TV screen, one candidate ahead 49%-47%, you’d think “hmm, that’s somewhat of a close race; maybe the underdog can pull ahead”.

Turns out Jones is expected to win upwards upwards of …wait for it … 96% of the time! Of course, the election is not run umpteen thousands of times; it’s only run once. So there’s definitely a chance for underdog Smith to win (about 4%, 1 in 25). And there’s always the probability that more of Smith people’s people will turn on on election day, etc., but according to the snapshot poll in our example, Jones looks pretty darn strong to take it.

So, many times things that are close are not really so. And it’s extraordinarily easy to deceive oneself on these occasions. When you’re doing your AB tests keep this in mind!

By the way, this use of the Monte Carlo technique is what Google Analytics Content Experiments (previously called Google Website Optimizer) used as a technique for “chance to beat” calculations.

p.s. I ran the simulations for our fictitious poll using a nifty program called R (which numbers people love but which might confuse the average marketer — use with adult supervision!)

[This article is Cross-posted to my monthly column at MarketingLand, which is a great place to read all sorts of interesting content.]

]]>Continuing with our earlier list, here are four more ways a marketer might self-deceive:

Example: “Our Average Customer is 32.47 years old”

Apparently, your average customer was born June 21. Ok, and your point is….? Do you honestly think you can measure average age of your customers to within three days (i.e., hundredths of a year)? And is there any difference between your customers that are 28 or 37? Maybe, for actuarial reasons, and if you’re in the insurance business. Maybe not, if you sell sweaters.

Precision — particularly for those who are also subject to Self-Deception #1 (“Innumeracy”. See previous article.) — is often a fetish, giving a false sense of accuracy. In other words, if you can measure it as 32.47, then by golly, it must not be 31 or 33 so it *must* be right!

Instead consider the difference in the meanings of the words: Accuracy measures how close the data are to the True number (in this example, it would be the average age of the your customer, rounded to the nearest year). Precision, however, measures how close the data are to each other, without regard to how close they are to the truth.

If you’re more of a visual person, imagine a typical bell curve. In this case, a bell curve of the frequency of customer ages. Accuracy is a measure of how close a number is to the top of the bell curve; Precision is a measure of how spread out the bell curve is.

While this may seem like a little too prissy in insisting on what the words mean, this self-deception #5 is a fairly common mistake even among people who are pretty smart! So the better you get at avoiding #1-#4, the more likely you are to end up committing #5 ! Oh, we humans *are* an amusing lot!

General rule of thumb: you can’t be more precise with your calculated metric that you are with the least precise data. If you are measuring customer age in years (not months), then your average age will be in years. In fact, your math teacher likely told you to quote instead a median value, rather than the average, in such cases specifically to avoid this sort of self-delusion. (That was probably the time cut class and got a detention for smoking under the bleachers)

Example: “We’ve tested zillions of variations on product images for our shoes — everyone prefers a human model wearing them! Everything else we tried lowered conversion!”

It’s easy to get caught up testing too narrow a set of variations once one factor displays its importance. What about multiple images? What about shoe treads/soles (often not seen in primary images). Call to Action buttons in proximity to the image? Ancillary data (“returned shoe rate”) that has little to do with imagery but imbues trust? One can go on and on.

Testing is about *continuous* improvement; you’re not guaranteed *perpetual* improvement. And you’re not even guaranteed to come up with the important factors (though I hope you do!). Get over it, and get on with it.

And, although the following comment has little to do with self-deception, I think it bears repeating since I first said it 2004: when you engage in continuous testing and you win (trans: “you improve the desired outcome”) then you make money. When you test and you lose (trans: “you don’t improve the desired outcome”) then you learn something, and that something that will give you insight into testing more efficiently in the future. So you are ahead either way, and most especially ahead of competitors who do nothing.

A false positive: you are looking for a particular type of outcome and you identify one falsely. A false negative: you are looking for a particular type of outcome and you identify one as not occuring when, in fact, it really does. To be sure, this is less about self-deception and more about experimental bias risk — but the outcomes are usually so important to the company that not knowing about false positives or false negatives can lead you down a bad path.

This time I’ll use a non-business example to get the point across and then circle back to a business situation. (No doubt this will roll the comments in, from both sides.)

Example: You’re in charge of security at your airport. You’re trying to determine when one of the crazy dangerous people are trying to get on the aircraft. You’ve decided for whatever reason (or, if you live in Arizona, apparently) that you have this power to “tell” who is the crazy, dangerous person* just by looking, *so you decide to cavity search anyone at the airport wearing, say, a turban or burqa. I didn’t say it was a great idea, I’m just setting up an example to make a point.

So a false positive is: “someone who appears crazy dangerous but turns out to be just someone who got a stylish burqa from LandsEnd”. The rule we used to indentify dangerous turns out to not apply to this person, thus a false positive.

A false negative is “someone who appears completely safe — in this case all one has to do is wear jeans and a t-shirt to satisfy the cavity search rule — and then promptly hijacks the plane at 30,000 ft.”

In both cases, we got it wrong. One was in to incorrectly find something we were looking for, the other was to incorrectly *not* find something we were looking for. [By the way, for those readers who did not skip out to go smoke under the bleachers that day, the false positives are referred to by the quants as “Type II errors” and the false negatives as “Type I errors”. Just FYI. ]

Now back to a business example to you can be ready for false positives and false negatives at your job:

Example: We’re running a test. Do Green Add-to-Cart buttons convert better than Blue Add-To-Cart buttons. The false positive: We run the test and it apparently shows that Green converts better, but in reality the Blue really is the winner. The false negative: we run the test and it apparently shows our original Blue button remained the winner, but in reality the Green really is the winner.

Two different risks that have to be handled separately. Some readers will like to know what to do in such a case, to which I’d advise: use my earlier rule of “do the opposite” and “look for evidence against what your wished-for outcome is”. In this case, which ever test result you end up declaring the winner, consider re-running the test again. You may not need to run nearly as much traffic through it, though you might run it longer just to make up for the lessened traffic. See if this second test confirms the first test’s results. If you’re really keen, don’t run one test across 10,000 customers, but run 10 different test across 1,000 customers. And then try a lesser test a month or a quarter from now. Always be on the look out for “how can I challenge my presumptions — especially those I got from testing a while ago”. This is often a great exercise in July or August when you’re not quite ready for the Christmas season to start but you’ve got extra cycles to try a few extra tests.

First off, a quick little illustration that hilariously spoofs this point.

Example: “I changed our Green Add-to-Cart Buttons to Blue in November. Out December sales figures were way up. Ahha! Blue converts better than Green!”

Not necessarily. Maybe sales were just up because of Christmas. Maybe our new marketing guy is doing miracles on Facebook. Maybe the new Button color actually *decreased* sales but the Facebook efforts more than made up for that, leading to a net plus for us. Maybe a lot of things. We just don’t know enough to be assigning credit.

Causation is about one thing causing another. And since our everyday experience of time flows in only one direction, that means the causer happened before the, hmm, causee. In the example above, if the changing of the Button color really caused sales to go up, then the change must’ve occured first. In fact, there’s a old logic fallacy “post hoc ergo proptor hoc” — so old that they came up with it in Latin when Latin was a live language (or the urban myth about Dan Quayle) — which means roughly “after, therefore because of”. Just because something comes after something else, doesn’t mean the first caused the second.

Correlation is much looser. Two things are correlated when they tend to move in the same direction of change somewhat in accordance with each other. This doesn’t mean that one causes the other. Nor does it means that it doesn’t. It’s just that there’s some sort of relationship in how they tend to move. For example: Rock Music Quality correlates well with US Crude Oil Production. Or US Highway Fatalities correlate well with Fresh Lemon Imports from Mexico. These are correlated, but there’s no causation. (Unless you think the highway deaths are caused by slippage on lemon juice!).

But here’s another one: Men who smoke cigarettes and the incidence of lung cancer among men. Turns out these are fairly well correlated, with a lag of about 20 years from when the guy starts smoking to when the cancer rates start going up. Do you think there’s possibly a causation here? Yeah, probably.

Now it’s true that the phrase “correlation doesn’t imply causation” is an oft-bandied about one. In fact, it’s over-used in the extreme. But an important point to keep in mind is that all causation implies some sort of decent correlation (if you’re measuring what counts), whereas high correlation in and of itself just means, well, that there’s a high level of correlation. Causation and Correlation are not opposites, causation simply includes a way to “give credit” to a causal event. If you keep the distinction in mind and ask youself “is there anyting, *other than the correlation*, that causes me to believe that A caused B?” you’ve got a powerful way to come up with new ideas for testing more broadly or more deeply.

Well, that’s it. It’s been a long two articles! But I hope you’ve learned something that will help you keep sharp in your testing efforts, the metrics you use to measure success, and your self-assessment of what you think you “know” for sure.

Which ones of the Self-Delusions do you feel you’ve perpetuated on yourself in the last 90 days?

[This article is Cross-posted to my monthly column at MarketingLand, which is a great place to read all sorts of interesting content.]

]]>