There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars. Who is making the best predictions? If everyone simply made a *guess* for the winner of each state or award, we could evaluate performance easily: whoever guesses the most outcomes correctly is making the best predictions. But what do we do if the predictors provide us full probabilities of the different outcomes? Intuitively, someone who gives 99% probability of an event that doesn’t occur is much “more wrong” than someone who gives only a 51% probability.

Nate Intrade WinnerBest PictureSlumdog .990 .903 X Milk .010 .040 Frost/Nixon .013 Benjamin Button .080 The Reader .030Best DirectorDanny Boyle .997 .900 X Gus Van Sant .001 .059 David Fincher .001 .050 Ron Howard Steven DaldryLead ActressKate Winslet .676 .850 X Meryl Streep .324 .150 Anne Hathaway .044 Melissa Leo Angelina JolieLead ActorMickey Rourke .711 .700 Sean Penn .190 .335 X Brad Pitt .059 Frank Langella .034 .049 Richard Jenkins .005Supporting ActressTaraji P. Henson .510 .190 Penélope Cruz .246 .588 X Viola Davis .116 .199 Amy Adams .116 Marisa Tomei .012Supporting ActorHeath Ledger .858 .950 X Josh Brolin .050 .050 Philip Seymour Hoffman .044 Michael Shannon .036 Robert Downey Jr. .012

Let us think about this situation from the perspective of “bent coin predictors”. Let’s say we have a pool of 100 bent coins, each of which has some unknown probability of ending up heads. We have a number of people who reckon they can estimate that probability by looking at the coin. Denote the prediction of guesser for coin by After predictions have been made, we flip all the coins. Now, how do we find the best guesser?

What we *want*, is to measure how close is to . Since we don’t know , this seems impossible. In some sense, it *is* impossible, but let us fantasize for a moment. Suppose that instead of all the coin flips, someone actually revealed all the true probabilities to us. Then, what would we do? There is no single best answer. One reasonable way to measure the quality of the guess would be the sum of squares difference

or, equivalently,

.

**MONTE-CARLO APPROXIMATIONS**

Now, of course, we can’t calculate either of the above quantities. We only have a single result from flipping each coin. The central idea here is that we can use what is known as a Monte-Carlo approximation. This is a very simple idea. Suppose we would like to calculate the expected value of some function .

Now, suppose that we don’t know , but we can *simulate* . That is, by running some sort of experiment, we can get some random value , whose probability is . If we draw many such values, then we can approximate the above by:

It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.

As an example, suppose we want to know the average amount that a slot machine pays out. We could approximate this by playing the machine 1000 times, and calculating the average observed payout.

**LOSS FUNCTIONS**

How can we apply loss functions to prediction markets? Notice that we can make the following simplification

.

The first term, is hard to estimate, but fortunately we don’t need to, because it doesn’t depend on the predictions. If we ignore this term for all guessers, we still have a valid *relative* rating of the guessers.

As for the second term, we can apply the Monte-Carlo technique from above. Let denote the outcome of flipping coin . We make the approximation

.

This is a *very noisy* approximation. However, it is *unbiased*: If we average over many coins, we will get something very close to the true value. (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)

Happily, the final term,

,

can be computed exactly, since the guessers have provided .

**THE OSCARS**

Now, let’s apply the above theory to the Oscar predictions.

Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:

**Quadratic loss**:

538: -0.6235 intrade: -0.7925

Remember, less is better, so this is a clear win for intrade.

**OTHER LOSS FUNCTIONS**

Another reasonable way to measure errors would be the KL-divergence between and

.

Again, dropping a constant term, we get the loss

.

For historical reasons, let’s call this this “conditional likelihood” loss.

**Conditional likelihood loss**:

538: 0.6032 intrade: 0.3699

Again, intrade looks much better. Then again, of course, maybe intrade was just lucky. Five predictions isn’t a large number to average over.

I’d love to see this type of analysis applied to the state-by-state results of the 2008 elections.

Matlab code (probably also works with Octave) is here.