There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars. Who is making the best predictions? If everyone simply made a *guess* for the winner of each state or award, we could evaluate performance easily: whoever guesses the most outcomes correctly is making the best predictions. But what do we do if the predictors provide us full probabilities of the different outcomes? Intuitively, someone who gives 99% probability of an event that doesn’t occur is much “more wrong” than someone who gives only a 51% probability.

Nate Intrade WinnerBest PictureSlumdog .990 .903 X Milk .010 .040 Frost/Nixon .013 Benjamin Button .080 The Reader .030Best DirectorDanny Boyle .997 .900 X Gus Van Sant .001 .059 David Fincher .001 .050 Ron Howard Steven DaldryLead ActressKate Winslet .676 .850 X Meryl Streep .324 .150 Anne Hathaway .044 Melissa Leo Angelina JolieLead ActorMickey Rourke .711 .700 Sean Penn .190 .335 X Brad Pitt .059 Frank Langella .034 .049 Richard Jenkins .005Supporting ActressTaraji P. Henson .510 .190 Penélope Cruz .246 .588 X Viola Davis .116 .199 Amy Adams .116 Marisa Tomei .012Supporting ActorHeath Ledger .858 .950 X Josh Brolin .050 .050 Philip Seymour Hoffman .044 Michael Shannon .036 Robert Downey Jr. .012

Let us think about this situation from the perspective of “bent coin predictors”. Let’s say we have a pool of 100 bent coins, each of which has some unknown probability of ending up heads. We have a number of people who reckon they can estimate that probability by looking at the coin. Denote the prediction of guesser for coin by After predictions have been made, we flip all the coins. Now, how do we find the best guesser?

What we *want*, is to measure how close is to . Since we don’t know , this seems impossible. In some sense, it *is* impossible, but let us fantasize for a moment. Suppose that instead of all the coin flips, someone actually revealed all the true probabilities to us. Then, what would we do? There is no single best answer. One reasonable way to measure the quality of the guess would be the sum of squares difference

or, equivalently,

.

**MONTE-CARLO APPROXIMATIONS**

Now, of course, we can’t calculate either of the above quantities. We only have a single result from flipping each coin. The central idea here is that we can use what is known as a Monte-Carlo approximation. This is a very simple idea. Suppose we would like to calculate the expected value of some function .

Now, suppose that we don’t know , but we can *simulate* . That is, by running some sort of experiment, we can get some random value , whose probability is . If we draw many such values, then we can approximate the above by:

It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.

As an example, suppose we want to know the average amount that a slot machine pays out. We could approximate this by playing the machine 1000 times, and calculating the average observed payout.

**LOSS FUNCTIONS**

How can we apply loss functions to prediction markets? Notice that we can make the following simplification

.

The first term, is hard to estimate, but fortunately we don’t need to, because it doesn’t depend on the predictions. If we ignore this term for all guessers, we still have a valid *relative* rating of the guessers.

As for the second term, we can apply the Monte-Carlo technique from above. Let denote the outcome of flipping coin . We make the approximation

.

This is a *very noisy* approximation. However, it is *unbiased*: If we average over many coins, we will get something very close to the true value. (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)

Happily, the final term,

,

can be computed exactly, since the guessers have provided .

**THE OSCARS**

Now, let’s apply the above theory to the Oscar predictions.

Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:

**Quadratic loss**:

538: -0.6235 intrade: -0.7925

Remember, less is better, so this is a clear win for intrade.

**OTHER LOSS FUNCTIONS**

Another reasonable way to measure errors would be the KL-divergence between and

.

Again, dropping a constant term, we get the loss

.

For historical reasons, let’s call this this “conditional likelihood” loss.

**Conditional likelihood loss**:

538: 0.6032 intrade: 0.3699

Again, intrade looks much better. Then again, of course, maybe intrade was just lucky. Five predictions isn’t a large number to average over.

I’d love to see this type of analysis applied to the state-by-state results of the 2008 elections.

Matlab code (probably also works with Octave) is here.

Thanks for the detailed explanation! Would it be too much of an effort to also plug the HubDub numbers, as a basis for comparison?

To be honest, I can’t bear the thought of drudging through the website to pull out the final predictions. If you can give me the numbers for HubDub like I have above for 538 and intrade, I’d be happy to run the evaluation on those as well.

Great analysis.

Here are the Hubdub forecasts: http://newspundits.hubdub.com/2009/02/hubdub-takes-home-the-gold-at-the-oscars/

Well, to compute the quadratic loss, I would actually need to know how Hubdub apportioned probability between the different losing entries in each category. I can calculate the conditional likelihood loss, though, and at 0.2237, Hubdub beats both intrade and 538.

Justin,

Perhaps you may want to clarify that these metrics do not measure the internal consistency of the reported confidences.

So, if exchange A gets 10000/10000 predictions right (with all contract prices at 0.51) and exchange B gets 5100/10000 predictions right (with all contract prices at 0.51), then exchange A will have better conditional and quadratic loss than B. And this, despite the fact that exchange A succeeded much more often that it should.

Panos,

That’s an extremely interesting point. I wouldn’t go so far as to say that these metrics don’t measure internal consistency *at all*. All else being equal, internal consistency is good. If you can predict 10000/10000, obviously, you can improve your loss by raising your confidences to 100%. Another way of saying this is that if you are able to drive either of these losses to their minimum (by actually producing the true value ), you will also end up with consistent predictions.

I guess there is some sort of a trade-off between measuring internal consistency and prediction accuracy: I can just always predict and, assuming coins are unbiased overall, I will look internally consistent, even though my predictions are horrible. I guess the quadratic loss and conditional likelihood lie somewhere in the middle of this trade-off.

Well, if you estimate P(heads)=0.5 and indeed this is the probability of heads, then your estimation is absolutely fine.

We have two things to measure: a. how much better than random the predictions are “predictive power” (accuracy, loss functions etc etc), and b. how consistent are the confidence metrics for these predictions.

Most of the papers focus on (a), often ignoring (b). But for prediction markets (where prices are supposed to estimate probabilities) it is important to examine rigorously how good are the estimates themselves, not only how good where the predictions.

I disagree that the quadratic loss and conditional likelihood only measure (a). The quadratic loss tries to measure

which seems to me a very reasonable mix of predictive power and consistency.

Next interesting question would be, what is the likelihood that the guesser A’s lower value compared to B is not just cause of pure luck:

In other words,

if guesser A gets value X and guesser B gets value Y (e.g. from the quadratic loss calculation), number of events being N, what is the likelihood, that A really was a better forecaster instead of being more lucky? How that could be calculated?

OJ