Bayes’ Theorem: A Gentle Introduction
To illustrate Bayesian reasoning I’ll use a couple of common-sense examples. Suppose that your socks go missing. Why did they go missing? There are many possible explanations, but let’s just consider two. The first explanation is that you dropped your socks as you were leaving your local Laundromat. The second explanation is that evil sock-stealing gnomes broke into your home and took the socks without leaving any signs of entry or taking anything else. Either hypothesis would explain the fact that your socks went missing. However, it is clear that one is more likely than the other. Why is that? It is because you already know from your own observations that you could have easily dropped your socks on the way out of the Laundromat. You have probably lost things because you carelessly dropped them a hundred times in the past. On the other hand, you have no evidence that any pair socks that have ever gone missing were stolen by evil sock-stealing gnomes. So, even though both theories explain the evidence (the fact that your socks went missing) equally well, you reject the evil gnome hypothesis because it is inherently less likely than the hypothesis that you were careless. A case of missing socks is not usually the result of gnomes, which is to say that it is inherently unlikely or that it has what Bayesians call a low prior probability.
Let’s take another example: suppose that you reach into a purse that you know contains an even mix of ordinary coins and trick coins that always land on “heads.” You pull out a coin. You flip it five times and find that it always comes up heads. Did you get a trick coin? Probably so. Even though it is possible that your coin is an ordinary one (sometimes people flip a normal coin and it lands heads five times in a row) it isn’t likely. You are justified in believing your coin is a trick coin because the evidence you have, which in this case is your observation of the coin-toss outcomes, is far more likely if you got a trick coin than if you got an ordinary coin. In other words, the evidential probability of the trick coin hypothesis is greater than the evidential probability of the ordinary coin hypothesis.
In the first scenario, the evidential probability of both theories were equal, and so to find the right answer we only had to figure out which theory had the greater prior probability. In the second example the tables were turned; The prior probability was the same for getting a trick coin as it was for getting a regular coin, but the evidence (seeing the coin consistently turn up heads after five flips) was more likely if it was a trick coin. Obviously the real world is rarely this ideal. Sometimes a theory explains a few pieces of evidence better than its competitors, but that does not mean that it is necessarily more likely. It could be the case that other evidence is predicted so poorly by the theory (and so much better by one or more competing theories) that it is not, on final consideration, the most likely theory. In reality, judging a theory as the most likely out of many contenders requires looking at the prior and evidential probabilities of all our theories and carefully comparing them to see which one has the greatest final probability (that is, the probability that a theory is correct after all factors are considered). The equation that allows us to do this correctly is Bayes’ Theorem.
Bayes’ Theorem can be stated as the following:
P(h|b) x P(e|h&b)
[ P(h|b) x P(e|h&b) ] + [ P(~h|b) x P(e|~h&b) ]
Despite appearances this equation is actually very easy to understand.
The term “P(h|b)” stands for the prior probability and it means “the probability of the hypothesis we are examining given our background knowledge.” Our background knowledge is everything we know about the world excluding the evidence we are currently investigating. Remember the missing sock example I gave earlier? When I said that the evil-gnome hypothesis was inherently unlikely, that’s just another way of saying that the the probability of the hypothesis, given only our background knowledge (everything else we already know about the world), is very low. We know that it’s low because we have never encountered evil-gnomes stealing socks, and so the frequency of that type of event is either zero or extraordinarily low.
The term “P(e|h&b)” means “the probability that we would have the evidence that we do if our hypothesis is true, given our background knowledge.” Think back to the trick coin example I gave earlier: we know that an ordinary coin lands on heads about fifty-percent of the time. This fact is part of our background knowledge, and it helps us find the probability that our outcome will be observed if the “ordinary coin” hypothesis is correct.
Bayes’ Theorem simply says that you should multiply the two numbers I just discussed and then divide the product of that multiplication by itself plus the product of the prior and evidential probability of all alternative hypotheses (the term ~h stands for an alternative hypothesis in our other two terms, which are otherwise the same as the two terms we discussed previously).
Let’s try one more example. Suppose that you have two neighbors, Mrs. Smith and Mrs. Jones, who grow tomatoes. Mrs. Smith’s tomato garden is fifty percent red and fifty percent green. Mrs. Jones’ garden is one-hundred percent green tomatoes. If you find a green tomato in your backyard, how likely is it that it came from Mrs. Jones’ garden? Since all Mrs. Jones’ tomatoes are green while only half of Mrs. Smith’s tomatoes are, the odds are two out of three. Think about it: if you randomly took four tomatoes from each garden (making eight total) then all four tomatoes you took from Mrs. Jones’ garden would be green (because that’s all she grows) while only two of the four tomatoes you took from Mrs. Smith would be green. To help you decide the probability, you could throw away the two red tomatoes you got from Mrs. Smith, because you already know your tomato is green, and just focus on the green tomatoes. If you had eight tomatoes and you tossed out the two red ones, you’d be left with six green tomatoes. Again, four of those six green tomatoes came from Mrs. Jones. Four out of Six, or 4/6, is the same as “two out of three” (2/3). The probability that it came from Mrs. Smith is one out of three.
Now let’s mix things up: pretend that you now notice that the tomato you found in your yard is what you call “a big ‘un.” Let’s pretend that one out of every three tomatoes in Mrs. Jones’ garden is a big ‘un, while all of Mrs. Smith’s tomatoes are big ‘uns (“Must be that new fertilizer,” you say to yourself). We’ll treat this new observation as “evidence.” If the tomato came from Mrs. Smith’s garden, then the probability of the evidence, that is, the probability that the tomato would be a big ‘un, is one hundred percent. If it came from Mrs. Jones garden, then the probability is one out of three, or 33 percent. Remember, this isn’t the probability that your tomato would come from one garden instead of the other, these are just the probabilities that your tomato would be a big ‘un if it came from one or the other. In each case, we are talking about the probability of the evidence if our hypothesis is true given our background knowledge. In this case, our background knowledge is that the tomatoes from Mrs. Smith’s garden are always big ‘uns while tomatoes from Mrs. Jones’ garden are big ‘uns thirty-three percent of the time.
What is the likelihood that your tomato came from Mrs. Smith’s garden? If we want to know the overall probability of the hypothesis that the tomato came from Mrs. Smith, we need to use Bayes Theorem, and to use Bayes Theorem, we need to know the prior and evidential probabilities of that hypothesis and the hypothesis that it came from Mrs. Jones. How do we find the prior probability? Since the prior probability is the probability of a hypothesis given only our “background knowledge” (not our evidence), and since we have said that we are using the size of the tomato (“It’s a big ‘un!”) as our evidence, what we need to know is the probability that the tomato came from Mrs. Smith before we took into account the evidence. Take a look at the paragraph before last to find the answer: one out of three, which can be written as the decimal number .33. What is the probability of the evidence if the tomato came from Mrs. Smith? Take a look at the last paragraph: 100 percent (or 1). The prior and evidential probabilities for the Mrs. Jones hypothesis are two out of three, or .67, and one out of three, or .33, respectively, and these numbers both come from the last two paragraphs.
Bayes’ Theorem dictates that in order to know the overall probability of the Mrs. Smith hypothesis, we must take the prior and evidential probabilities of that hypothesis and multiply them, then divide that number times itself and the multiplication of the prior and evidential probabilities of the Mrs. Jones hypothesis. Let’s take this one step at a time: What is the prior probability of the Mrs. Smith hypothesis? One out of three. What is its evidential probability? One. Multiplying those two numbers together equals one out of three. Let’s put that number over to the side for the moment. What is the prior probability of the Mrs. Jones hypothesis? Two out of three. What is its evidential probability? One out of three. What do we get when we multiply those two numbers? Two out of nine, or the decimal .22. What do we do now? We take the number .33 (the product of multiplying the prior and evidential probabilities of the Mrs. Smith hypothesis) and divide it by itself (.33) plus the product of multiplying the prior and evidential probabilities of the Mrs. Jones hypothesis (.22). So we divide .33 by .55. The answer we end up with is .6, which means that the probability that your tomato came from Mrs. Smith’s garden is sixty percent. We can visualize this entire procedure and know that the number here is correct. Imagine that we picked six random tomatoes from each garden. As I’ve described above, here’s what that collection would look like:
MRS. SMITH MRS. JONES
Since we already know our tomato is green, we can erase all the red ones and just focus on the green:
MRS. SMITH MRS. JONES
One out of every three of Mrs. Jones’ tomatoes are “big ‘uns” and all of Mrs. Smith’s tomatoes are “big ‘uns,” so we let’s update the picture to show that:
MRS. SMITH MRS. JONES
Since we know our tomato is big, we can erase all the ones which are not, since we only care about the frequency of large tomatoes:
MRS. SMITH MRS. JONES
Look at what the last illustration shows: we have five large tomatoes, and 3 out of 5 (or sixty percent) of those come from Mrs. Smith’s garden, which is the same answer we got with Bayes’ Theorem. That’s why we know Bayes’ Theorem is right: it is nothing but a way to describe what is self-evidently true. It takes the numbers and percentages that human beings know (the prior and evidential probabilities of a hypothesis) and helps us realize what follows from those numbers.
End Note: I should add that this example assumes for the sake of argument that there is no correlation between the color and size of a tomato. That is, a red tomato is no more likely to be a “big ‘un” than a green tomato, and vice versa.