There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars. Who is making the best predictions? If everyone simply made a guess for the winner of each state or award, we could evaluate performance easily: whoever guesses the most outcomes correctly is making the best predictions. But what do we do if the predictors provide us full probabilities of the different outcomes? Intuitively, someone who gives 99% probability of an event that doesn’t occur is much “more wrong” than someone who gives only a 51% probability.
Nate Intrade Winner
Best Picture
Slumdog .990 .903 X
Milk .010 .040
Frost/Nixon .013
Benjamin Button .080
The Reader .030
Best Director
Danny Boyle .997 .900 X
Gus Van Sant .001 .059
David Fincher .001 .050
Ron Howard
Steven Daldry
Lead Actress
Kate Winslet .676 .850 X
Meryl Streep .324 .150
Anne Hathaway .044
Melissa Leo
Angelina Jolie
Lead Actor
Mickey Rourke .711 .700
Sean Penn .190 .335 X
Brad Pitt .059
Frank Langella .034 .049
Richard Jenkins .005
Supporting Actress
Taraji P. Henson .510 .190
Penélope Cruz .246 .588 X
Viola Davis .116 .199
Amy Adams .116
Marisa Tomei .012
Supporting Actor
Heath Ledger .858 .950 X
Josh Brolin .050 .050
Philip Seymour Hoffman .044
Michael Shannon .036
Robert Downey Jr. .012
Let us think about this situation from the perspective of “bent coin predictors”. Let’s say we have a pool of 100 bent coins, each of which has some unknown probability
of ending up heads. We have a number of people who reckon they can estimate that probability by looking at the coin. Denote the prediction of guesser
for coin
by
After predictions have been made, we flip all the coins. Now, how do we find the best guesser?
What we want, is to measure how close
is to
. Since we don’t know
, this seems impossible. In some sense, it is impossible, but let us fantasize for a moment. Suppose that instead of all the coin flips, someone actually revealed all the true probabilities
to us. Then, what would we do? There is no single best answer. One reasonable way to measure the quality of the guess would be the sum of squares difference

or, equivalently,
.
MONTE-CARLO APPROXIMATIONS
Now, of course, we can’t calculate either of the above quantities. We only have a single result from flipping each coin. The central idea here is that we can use what is known as a Monte-Carlo approximation. This is a very simple idea. Suppose we would like to calculate the expected value of some function
.

Now, suppose that we don’t know
, but we can simulate
. That is, by running some sort of experiment, we can get some random value
, whose probability is
. If we draw many such values, then we can approximate the above by:
It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.

As an example, suppose we want to know the average amount that a slot machine pays out. We could approximate this by playing the machine 1000 times, and calculating the average observed payout.
LOSS FUNCTIONS
How can we apply loss functions to prediction markets? Notice that we can make the following simplification

.
The first term,
is hard to estimate, but fortunately we don’t need to, because it doesn’t depend on the predictions. If we ignore this term for all guessers, we still have a valid relative rating of the guessers.
As for the second term, we can apply the Monte-Carlo technique from above. Let
denote the outcome of flipping coin
. We make the approximation
.
This is a very noisy approximation. However, it is unbiased: If we average over many coins, we will get something very close to the true value. (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)
Happily, the final term,
,
can be computed exactly, since the guessers have provided
.
THE OSCARS
Now, let’s apply the above theory to the Oscar predictions.
Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:
Quadratic loss:
538: -0.6235
intrade: -0.7925
Remember, less is better, so this is a clear win for intrade.
OTHER LOSS FUNCTIONS
Another reasonable way to measure errors would be the KL-divergence between
and 
.
Again, dropping a constant term, we get the loss
.
For historical reasons, let’s call this this “conditional likelihood” loss.
Conditional likelihood loss:
538: 0.6032
intrade: 0.3699
Again, intrade looks much better. Then again, of course, maybe intrade was just lucky. Five predictions isn’t a large number to average over.
I’d love to see this type of analysis applied to the state-by-state results of the 2008 elections.
Matlab code (probably also works with Octave) is here.