Prediction markets, monte carlo, and loss functions

There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars. Who is making the best predictions? If everyone simply made a guess for the winner of each state or award, we could evaluate performance easily: whoever guesses the most outcomes correctly is making the best predictions. But what do we do if the predictors provide us full probabilities of the different outcomes? Intuitively, someone who gives 99% probability of an event that doesn’t occur is much “more wrong” than someone who gives only a 51% probability.

                        Nate    Intrade     Winner
Best Picture
Slumdog                .990      .903        X
Milk                   .010      .040
Frost/Nixon                      .013
Benjamin Button                  .080
The Reader                       .030

Best Director
Danny Boyle            .997      .900        X
Gus Van Sant           .001      .059
David Fincher          .001      .050
Ron Howard              
Steven Daldry           

Lead Actress
Kate Winslet           .676      .850        X
Meryl Streep           .324      .150
Anne Hathaway                    .044
Melissa Leo            
Angelina Jolie         

Lead Actor
Mickey Rourke          .711      .700
Sean Penn              .190      .335        X
Brad Pitt              .059
Frank Langella         .034      .049
Richard Jenkins        .005

Supporting Actress
Taraji P. Henson       .510      .190
Penélope Cruz          .246      .588        X
Viola Davis            .116      .199
Amy Adams              .116
Marisa Tomei           .012

Supporting Actor
Heath Ledger           .858      .950        X
Josh Brolin            .050      .050
Philip Seymour Hoffman .044
Michael Shannon        .036
Robert Downey Jr.      .012

Let us think about this situation from the perspective of “bent coin predictors”. Let’s say we have a pool of 100 bent coins, each of which has some unknown probability p_c(\text{heads}) of ending up heads. We have a number of people who reckon they can estimate that probability by looking at the coin. Denote the prediction of guesser i for coin c by g_{i,c}(\text{heads}) After predictions have been made, we flip all the coins. Now, how do we find the best guesser?

What we want, is to measure how close g_{i,c}(\text{heads}) is to p_c(\text{heads}). Since we don’t know p_i(\text{heads}), this seems impossible. In some sense, it is impossible, but let us fantasize for a moment. Suppose that instead of all the coin flips, someone actually revealed all the true probabilities p_c(\text{heads}) to us. Then, what would we do? There is no single best answer. One reasonable way to measure the quality of the guess would be the sum of squares difference

(p_c(\text{heads}) - g_{i,c}(\text{heads}))^2 + (p_c(\text{tails}) - g_{i,c}(\text{tails}))^2

or, equivalently,

\sum_\text{\text{rez}} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2.


Now, of course, we can’t calculate either of the above quantities. We only have a single result from flipping each coin. The central idea here is that we can use what is known as a Monte-Carlo approximation. This is a very simple idea. Suppose we would like to calculate the expected value of some function f.

\sum_x p(x) f(x)

Now, suppose that we don’t know p, but we can simulate p. That is, by running some sort of experiment, we can get some random value x_n, whose probability is p(x). If we draw many such values, then we can approximate the above by:

It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.

\sum_n 1/N f(x_n)

As an example, suppose we want to know the average amount that a slot machine pays out. We could approximate this by playing the machine 1000 times, and calculating the average observed payout.


How can we apply loss functions to prediction markets? Notice that we can make the following simplification

\sum_\text{rez} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2 =

\sum_\text{rez} p_c(\text{rez})^2 - 2 \sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) + \sum_\text{rez} g_{i,c}(\text{rez})^2.

The first term, \sum_\text{rez} p_c(\text{rez})^2 is hard to estimate, but fortunately we don’t need to, because it doesn’t depend on the predictions. If we ignore this term for all guessers, we still have a valid relative rating of the guessers.

As for the second term, we can apply the Monte-Carlo technique from above. Let \text{rez}_c denote the outcome of flipping coin c. We make the approximation

\sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) \approx g_{i,c}(\text{rez}_c).

This is a very noisy approximation. However, it is unbiased: If we average over many coins, we will get something very close to the true value. (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)

Happily, the final term,

\sum_\text{rez} g_{i,c}(\text{rez})^2,

can be computed exactly, since the guessers have provided g_{i,c}.


Now, let’s apply the above theory to the Oscar predictions.

Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:

Quadratic loss:

538:     -0.6235
intrade: -0.7925

Remember, less is better, so this is a clear win for intrade.


Another reasonable way to measure errors would be the KL-divergence between p_c and g_{i,c}

\sum_\text{rez} p_c(\text{rez}) \log(p_c(\text{rez})/g_{i,c}(\text{rez}) ).

Again, dropping a constant term, we get the loss

\sum_\text{rez} -p_c(\text{rez}) \log g_{i,c}(\text{rez}).

For historical reasons, let’s call this this “conditional likelihood” loss.

Conditional likelihood loss:

538:     0.6032
intrade: 0.3699

Again, intrade looks much better. Then again, of course, maybe intrade was just lucky. Five predictions isn’t a large number to average over.

I’d love to see this type of analysis applied to the state-by-state results of the 2008 elections.

Matlab code (probably also works with Octave) is here.

11 thoughts on “Prediction markets, monte carlo, and loss functions

  1. To be honest, I can’t bear the thought of drudging through the website to pull out the final predictions. If you can give me the numbers for HubDub like I have above for 538 and intrade, I’d be happy to run the evaluation on those as well.

  2. Well, to compute the quadratic loss, I would actually need to know how Hubdub apportioned probability between the different losing entries in each category. I can calculate the conditional likelihood loss, though, and at 0.2237, Hubdub beats both intrade and 538.

  3. Justin,

    Perhaps you may want to clarify that these metrics do not measure the internal consistency of the reported confidences.

    So, if exchange A gets 10000/10000 predictions right (with all contract prices at 0.51) and exchange B gets 5100/10000 predictions right (with all contract prices at 0.51), then exchange A will have better conditional and quadratic loss than B. And this, despite the fact that exchange A succeeded much more often that it should.

  4. Panos,

    That’s an extremely interesting point. I wouldn’t go so far as to say that these metrics don’t measure internal consistency *at all*. All else being equal, internal consistency is good. If you can predict 10000/10000, obviously, you can improve your loss by raising your confidences to 100%. Another way of saying this is that if you are able to drive either of these losses to their minimum (by actually producing the true value g_{i,c}), you will also end up with consistent predictions.

    I guess there is some sort of a trade-off between measuring internal consistency and prediction accuracy: I can just always predict p({\text heads})=.5 and, assuming coins are unbiased overall, I will look internally consistent, even though my predictions are horrible. I guess the quadratic loss and conditional likelihood lie somewhere in the middle of this trade-off.

  5. Well, if you estimate P(heads)=0.5 and indeed this is the probability of heads, then your estimation is absolutely fine.

    We have two things to measure: a. how much better than random the predictions are “predictive power” (accuracy, loss functions etc etc), and b. how consistent are the confidence metrics for these predictions.

    Most of the papers focus on (a), often ignoring (b). But for prediction markets (where prices are supposed to estimate probabilities) it is important to examine rigorously how good are the estimates themselves, not only how good where the predictions.

  6. I disagree that the quadratic loss and conditional likelihood only measure (a). The quadratic loss tries to measure

    \sum_r (p(r)-p_0(r))^2,

    which seems to me a very reasonable mix of predictive power and consistency.

  7. Next interesting question would be, what is the likelihood that the guesser A’s lower value compared to B is not just cause of pure luck:

    In other words,
    if guesser A gets value X and guesser B gets value Y (e.g. from the quadratic loss calculation), number of events being N, what is the likelihood, that A really was a better forecaster instead of being more lucky? How that could be calculated?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s