Sunday, April 15, 2012

Why Model Selection?: Bayes and Biased Coins

A key tool in the work in the work I do is Bayesian model selection. These days 'Bayesian' is a buzzword that is often used colloquially as a synonym for all relatively sophisticated inference methods, and frequently deployed in papers to great effect, making them seem more exciting. However, model selection, and Bayesian model selection in particular do have concrete meanings beyond the hype, and I think it is important to see, in a few simple examples, why model selection is often (and, I believe, usually) a better alternative to standard significance tests when trying to test scientific hypotheses

While I refuse to be dogmatic in my use of inference methods, I do subscribe to what is known as the 'Bayesian interpretation' of probability theory, which views probabilities as degrees of belief in things, much as the odds at a horse race represent how much the bookmaker believes a given horse is going to win. Again, I hope it is possible to show, with a simple example, that this interpretation has important and useful implications for real inference.

A standard tool in teaching probability is the biased coin, a coin which is more likely to land on one of either heads or tails. Lets look at a couple of examples of biased coin problems to illustrate why model selection can be more sensible than significance testing.

Disclaimer: Examples like this are common in textbooks, including my favourite. I make no claim to originality. Also please let me know if you find any errors in the maths, theres nothing worse than having your argument die by virtue of a missing factor 2 somewhere! (which did eventually happen!)

The Setup

Imagine I offer you two coins, telling you that one is biased towards tails, the other is fair. The biased coin gives tails on average 4 times in 5, the fair coin gives tails half the time You choose one and I ask you to try and guess if it is fair or not. I tell you that you can toss the coin a maximum of 50 times. You toss the coin the allotted 50 times and you get 31 tails, 19 heads. Do you believe the coin is biased?

The Significance Test

The classic way to answer this question is a significance test. We ask 'are the data [the coin tosses] significantly different from what we expect by chance?' To do this we construct two hypotheses:

The test hypothesis (H1): The coin is biased.

The null hypothesis (H0): The coin is fair.

We then look to see whether our data are sufficiently unlikely under the null hypothesis that we might reject it. We do not consider the test hypothesis any further.

The question we ask, mathematically, is 'how likely was I to see 31 or more tails from 50 tosses, if the coin was fair?' The answer is:

P(31 or more tails | fair coin) = ∑50i=31  binopdf(i, 50, 0.5) = 0.0325

Where binopdf is the binomial probability distribution. Traditionally we reject the null hypothesis if the probability is less than 0.05, so here we can claim a significant result. We would then say we believe the coin to be biased, and roughly speaking we would assume we had about a 3-4% of being wrong.


Woah there! Do you have any objections to what we just did? I hope so....

  • Why do we care about 31 or more tails? We didn't get more than (or less than) 31. Why are we basing our conclusions on things that never happened?
  • What the hell happened to H1? Surely we could/should have tested that hypothesis too. Nope, in significance tests we always choose the most 'random' possible hypothesis, label it the null and ignore the rest
  • The setup was specified in some detail. There were exactly 2 coins, only one was biased, and the bias was known. You were asked to choose one at random. How come none of these details appear in our analysis? Are they really unimportant?
Model Selection

Lets deal with the first two of those objections first. Why do we care about more than 31 tails? One can make arguments in terms of deciding 'what would convince me?' before doing the test, but the simplest answer is: because this test is ridiculous. Our conclusions should only ever be based on what actually happened and how likely that was, a positon known as the likelihood principle.

Ok, so what about H1? Surely we could go back, label our hypotheses the other way round and test the biased hypothesis instead? Exactly! Its a quirk of the kind of effects scientists test and how they test them that we have become used to 'rejecting the null hypothesis'. Lets see how easy it is to test H1 directly against H0.

First, we'll calculate just the probability of what actually happened, i.e. 31 tails and 19 heads. First, if the coin is fair:

P(31 tails, 19 heads | fair coin) =  binopdf(31, 50, 0.5) = 0.0270

and if the coin is biased (remember, we know the bias):

P(31 tails, 19 heads | biased coin) =  binopdf(31, 50, 0.8) = 0.0016

And we can now simply state that the data is 17 (yes, 17!) times more likely if the coin is fair than if it is biased.

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 17

While the significance test suggested the coin was biased, our direct model comparison shows this to be an absurd conclusion. Of course, I have chosen these numbers to illustrate how bad the significance test will sometimes give the right answer. But why jump through the perverse hoops required for the significance test when the direct comparison is both more accurate and simpler?

If we don't happen to know the exact bias of the biased coin in advance, we can still do a comparison, by integrating over the possible biases the coin could have, from 0.5 to 1, with a prior distribution on p of pr(p)=2 for 0.5 < p < 1 (thanks to commenter for correction)

P(31 tails, 19 heads | biased coin) =  ∫1p=0.5binopdf(31, 50, p)pr(p)dp = 0.0372

P(31 tails, 19 heads | fair coin)/P(31 tails, 19 heads | biased coin) = 0.7

So now the biased coin is somewhat more likely because the bias could be nearer the 0.6 necessary to give these results (though the probability of a fair coin is still ~ 40%, not the 3-4% we thought earlier).


So direct comparison of two hypotheses is (I hope you'll agree) better than doing a significance test, if we have a two or more clear hypotheses to test. And if we have no clear alternative to the null hypothesis, what are we even doing? Boldly rejecting the null when we have nothing to replace it with? Doesn't that sound like we're getting ahead of ourselves?

Now to deal with the third objection above. We saw in the last section that we can make use of the exact bias of the coin if it is known, and adjust if it is not. What about al the other info we have in the problem? And to introduce a further connected problem, why are we judging these models on how likely they make the data? I don't know about you, but I want to know how likely the data makes the models....they are not the same thing!

Looking at our original problem setup, a key aspect we haven't considered is how many coins you were offered. It was exactly two, one of which was biased. What if I had offered you a bucket of 1000 coins, only one of which was biased. Even if i didn't tell you the exact bias, could we really repeat the analysis above and conclude that a coin giving 31 tails was likely to be biased?

Without further ado, let me introduce Bayes rule, which allows us to go from the probability of the data, D, to the probability of hypothesis, H

P(H | D) = P(D | H) x P(H)/P(D)

the first term on the right is know as the likelihood, and is what we have previously been using to judge our hypotheseses. But, ignoring P(D), which will cancel out when comparing two hypotheses, that second term on the right, P(H), shows us that something else is going on. This is what we call the prior probability of H, i.e. how likely H was before we saw any data (tossed the coin). In our first example we could easily argue that for both H0 and H1, P(H0)=P(H1)=1/2, since there were two equally choosable coins. With 1000 coins on offer, with a random choice, we should assume that there is only a 1/1000 chance that we picked the biased coin, before we start testing it. Bayes rule says that this prior probability doesn't just disappear once we start doing tests. So if we calculate the ratio between the probability of H0 and H1 we need to include the ratio of prior probabilities

P(fair | 31 tails) / P(biased | 31 tails) 
= P(31 tails | fair) x P(fair) / P(31 tails | biased) x P(biased)
= 0.0270 x 0.999 / 0.0372 x 0.001
= 729

So yes, the data is somewhat more probable from a biased coin, but the sheer unlikeliness of us having picked that coin out of the bucket to start with massively outweighs this slim evidence. If you don't believe this, I invite you to fly over to Sweden and we'll play some betting games with you picking coins out of a bucket and I'll finally have some expendable income!

If you still have doubts about the power of prior beliefs, watch the video below. Although the data your eyes receive is slightly more likely from a concave face, you see a convex face because your brain knows convex faces are hugely more likely


  1. Great explanation!

  2. When I try to replicate the final answer, I get 725.08, rather than 729. Also it would be good to add some parentheses to the final calculation.

    Other than that, great explanation indeed, and a very nice illustration of the limitations of the standard hypothesis testing approach!

  3. This is an excellent tutorial, except the math appears to be wrong. By my calculations, the integral evaluates to 0.0187, and the subsequent fair/biased ratio is 1.44.

    My calcs match the example on the Bayes Factor wikipedia page, which also gives a very nice working example.

    1. Thanks for pointing this out! - I missed a factor in the integral, pr(p), which is 2 since pr(p) is uniform between 0.5 and 1. The post is now corrected

    2. Great tutorial, as mentioned. But why does the prior distribution of p=2? I don't understand where you get this number from.

    3. This is because the prior distribution I used is uniform between 0.5 and 1. To make sure the prior distribution sums to 1, this means the probability density for p must be 2 everywhere, since the integral of 2 from 0.5 to 1 is 1.

      I could have used a prior from 0 to 1 instead, with pr(p)= 1. However, here I wanted to test the idea that p was *more* than 0.5, so for that model I exclude values of p below 0.5.

      Finally, I could consider other priors for p that allow p to be more than 0.5 but which are not uniform. This would appropriate if someone told me they thought the coin was biased, and knew that of it was then p would be around, say, 0.7. Here I've assumed that if the coin is biased it is equally likely to be biased by any amount.