Act 1: Magical Monkeys
Two monkeys, Alfred () and Betty () live in a parallel universe with two kinds of blocks, green () and yellow (). Alfred likes green blocks, and Betty prefers the yellow blocks. One day, a Wizard decides to give one of the monkeys the magical power to send one block over to our universe each day.
The Wizard chooses the magical monkey by flipping a fair four-sided die. He casts a spell on Alfred if the outcome is 1, and Betty if the outcome is . That is, the Wizard chooses Alfred with probability and Betty with probability . Both Alfred and Betty send their favorite colored block with probability
After the Wizard has chosen, we see the first magical block, and it is green. Our problem is: What is the probability that Alfred is the magical monkey?
Intuitively speaking, we have two somewhat contradictory pieces of information. We know that the Wizard chooses Betty more often. But green is Alfred’s preferred color. Given probabilities above, is Alfred or Betty more probable?
First, we can write down all the probabilities. These are
Now, the quantity we are interested in is We can mechanically apply Bayes’ rule to obtain
Similarly, we can calculate that
But, of course, we know that one of the monkeys is magical, so these quantities must sum to one. Thus (since they both involve the quantity that we haven’t explicitly calculated) we can normalize to obtain that
Act 2: Magical Monkeys on Multiple Days
We wait around two more days, and two more blocks appear, both of which are yellow. Assume that the monkeys make an independent choice each day to send their favorite or less favorite block. Now, what is the probability that Alfred is the magical monkey, given that ?
Intuitively speaking, what do we expect? Betty is more likely to be chosen, and 2/3 of the blocks we’ve seen are suggestive of Betty (since she prefers yellow).
Now, we can mechanically calculate the probabilities. This is just like before, except we use the fact that the blocks seen on each day are conditionally independent given a particular monkey.
Again, we know that these two probabilities sum to one. Thus, we can normalize and get that
So now, Betty looks more likely by far.
Act 3: Magical Monkeys and the Weather
Now, suppose the Wizard gives us some additional information. The magical monkey (whilst relaxing over in the other universe) is able to perceive our weather, which is either clear or rainy . Both the monkeys prefer clear weather. When the weather is clear, they send their preferred block with probability , while if the weather is rainy, they angrily send their preferred block with probability That is, we have
Along with seeing the previous sequence of blocks , we observed the weather sequence . Now, what is the probability that Alfred is the magical monkey?
We ask the Wizard what the distribution over rainy and clear weather is. The wizard haughtily responds that this is irrelevant, but does confirm that the weather is independent of which monkey was made magical.
Can we do anything without knowing the distribution over the weather? Can we calculate the probability that Alfred is the magical monkey?
Let’s give it a try. We can apply Bayes’ equation to get that
Now, since the weather is independent of the monkey, we know that
Thus, we have that
Through the same logic we can calculate that
Again, we know that these probabilities need to sum to one. Since the factor is constant between the two, we can normalize it out. Thus, we get that
Again, it looks like Alfred was the more likely monkey. And– oh wait– we somehow got away with not knowing the distribution over the weather…
Act 4: Predicting the next block
Now, suppose that after seeing the previous sequences (namely and ) , on the fourth day, we find that it has rained. What is the probability that we will get a green block on the fourth day? Mathematically, we want to calculate
How can we go about this? Well, if we knew that the magical monkey was Alfred (which we do not!), it would be easy to calculate. We would just have
and similarly if the magical monkey were Betty. Now, we don’t know which monkey is magical, but we know the probabilities that each monkeys is magical given the available information– we just calculated them in the previous section! So, we can factorize the distribution of interest as
So we are slightly less likely than even to see a green block on the next day.
Epilogue: That’s Bayesian Inference
This is how Bayesian inference works (in the simplest possible setting). In general Bayesian inference, you have:
- A set of possible inputs (for us these are weather patterns)
- A set of possible outputs (for us these are colored blocks)
- A set of possible models, that map a given input to a distribution over the outputs (for us these are monkeys)
- A “prior” distribution over which models are more likely (for us, this is the wizard and the four-sided die)
- A dataset of inputs and outputs (for us, this is and )
- A new input (for us this is )
- An unknown output that we’d like to predict (for us this is )
Bayesian inference essentially proceeds in two steps:
- In the first step, one uses the prior over which models are more likely along with the dataset of inputs and outputs to build a “posterior” distribution over which models are more likely. We did this by using
where indexes the different days of observation.
- In the second step, one “integrates over the posterior”. That is, each model gives us a distribution over given the new input . So, we simply sum over the different models
In the general case, the set of possible models is much larger (typically infinite) and more complex computational methods need to be used to integrate over it. (Commonly Markov chain Monte Carlo or variational methods). Also, of course, we don’t typically have a Wizard telling us exactly what the prior distribution over the models is, meaning one must make assumptions, or try to “learn” the prior as in, say, empirical Bayesian methods.