Prediction markets, monte carlo, and loss functions

There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars. Who is making the best predictions? If everyone simply made a guess for the winner of each state or award, we could evaluate performance easily: whoever guesses the most outcomes correctly is making the best predictions. But what do we do if the predictors provide us full probabilities of the different outcomes? Intuitively, someone who gives 99% probability of an event that doesn’t occur is much “more wrong” than someone who gives only a 51% probability.

                        Nate    Intrade     Winner
Best Picture
Slumdog                .990      .903        X
Milk                   .010      .040
Frost/Nixon                      .013
Benjamin Button                  .080
The Reader                       .030

Best Director
Danny Boyle            .997      .900        X
Gus Van Sant           .001      .059
David Fincher          .001      .050
Ron Howard              
Steven Daldry           

Lead Actress
Kate Winslet           .676      .850        X
Meryl Streep           .324      .150
Anne Hathaway                    .044
Melissa Leo            
Angelina Jolie         

Lead Actor
Mickey Rourke          .711      .700
Sean Penn              .190      .335        X
Brad Pitt              .059
Frank Langella         .034      .049
Richard Jenkins        .005

Supporting Actress
Taraji P. Henson       .510      .190
Penélope Cruz          .246      .588        X
Viola Davis            .116      .199
Amy Adams              .116
Marisa Tomei           .012

Supporting Actor
Heath Ledger           .858      .950        X
Josh Brolin            .050      .050
Philip Seymour Hoffman .044
Michael Shannon        .036
Robert Downey Jr.      .012

Let us think about this situation from the perspective of “bent coin predictors”. Let’s say we have a pool of 100 bent coins, each of which has some unknown probability p_c(\text{heads}) of ending up heads. We have a number of people who reckon they can estimate that probability by looking at the coin. Denote the prediction of guesser i for coin c by g_{i,c}(\text{heads}) After predictions have been made, we flip all the coins. Now, how do we find the best guesser?

What we want, is to measure how close g_{i,c}(\text{heads}) is to p_c(\text{heads}). Since we don’t know p_i(\text{heads}), this seems impossible. In some sense, it is impossible, but let us fantasize for a moment. Suppose that instead of all the coin flips, someone actually revealed all the true probabilities p_c(\text{heads}) to us. Then, what would we do? There is no single best answer. One reasonable way to measure the quality of the guess would be the sum of squares difference

(p_c(\text{heads}) - g_{i,c}(\text{heads}))^2 + (p_c(\text{tails}) - g_{i,c}(\text{tails}))^2

or, equivalently,

\sum_\text{\text{rez}} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2.


Now, of course, we can’t calculate either of the above quantities. We only have a single result from flipping each coin. The central idea here is that we can use what is known as a Monte-Carlo approximation. This is a very simple idea. Suppose we would like to calculate the expected value of some function f.

\sum_x p(x) f(x)

Now, suppose that we don’t know p, but we can simulate p. That is, by running some sort of experiment, we can get some random value x_n, whose probability is p(x). If we draw many such values, then we can approximate the above by:

It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.

\sum_n 1/N f(x_n)

As an example, suppose we want to know the average amount that a slot machine pays out. We could approximate this by playing the machine 1000 times, and calculating the average observed payout.


How can we apply loss functions to prediction markets? Notice that we can make the following simplification

\sum_\text{rez} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2 =

\sum_\text{rez} p_c(\text{rez})^2 - 2 \sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) + \sum_\text{rez} g_{i,c}(\text{rez})^2.

The first term, \sum_\text{rez} p_c(\text{rez})^2 is hard to estimate, but fortunately we don’t need to, because it doesn’t depend on the predictions. If we ignore this term for all guessers, we still have a valid relative rating of the guessers.

As for the second term, we can apply the Monte-Carlo technique from above. Let \text{rez}_c denote the outcome of flipping coin c. We make the approximation

\sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) \approx g_{i,c}(\text{rez}_c).

This is a very noisy approximation. However, it is unbiased: If we average over many coins, we will get something very close to the true value. (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)

Happily, the final term,

\sum_\text{rez} g_{i,c}(\text{rez})^2,

can be computed exactly, since the guessers have provided g_{i,c}.


Now, let’s apply the above theory to the Oscar predictions.

Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:

Quadratic loss:

538:     -0.6235
intrade: -0.7925

Remember, less is better, so this is a clear win for intrade.


Another reasonable way to measure errors would be the KL-divergence between p_c and g_{i,c}

\sum_\text{rez} p_c(\text{rez}) \log(p_c(\text{rez})/g_{i,c}(\text{rez}) ).

Again, dropping a constant term, we get the loss

\sum_\text{rez} -p_c(\text{rez}) \log g_{i,c}(\text{rez}).

For historical reasons, let’s call this this “conditional likelihood” loss.

Conditional likelihood loss:

538:     0.6032
intrade: 0.3699

Again, intrade looks much better. Then again, of course, maybe intrade was just lucky. Five predictions isn’t a large number to average over.

I’d love to see this type of analysis applied to the state-by-state results of the 2008 elections.

Matlab code (probably also works with Octave) is here.


The Stalin Compiler

Stalin is a questionably named Scheme compiler written by Jeffrey Mark Siskind that can supposedly create binaries as fast or faster than Fortran or C for numerical problems. To test this, I tried creating a simple program to numerically integrate \sqrt{x} from 0 to 10000. To make things interesting, I used a manual Newton’s method implementation of sqrt from SICP.  The integration is done by a simple tail-recursive method.

The scheme code is very pretty:

(define (sqrt-iter guess x)
  (if (good-enough? guess x)
      (sqrt-iter (improve guess x)

(define (improve guess x)
  (average guess (/ x guess)))

(define (average x y)
  (/ (+ x y) 2))

(define (good-enough? guess x)
  (< (abs (- (* guess guess) x)) 0.001))

(define (mysqrt x)
  (sqrt-iter 1.0 x))

(define (int x acc step)
  (if (>= x 10000.0)
      (int (+ x step)
           (+ acc (* step (mysqrt x)))

(write (int 0.0 0.0 .001))

I then converted this to C. It is pretty much a transliteration, except it uses a for loop instead of recursion to accumulate the values:

#include "stdio.h"

double improve(double guess, double x);
double average(double x, double y);

double sqrt_iter(double guess, double x){
  if( good_enough(guess, x))
    return guess;
    return sqrt_iter( improve(guess,x), x);

double improve(double guess, double x){
  return average(guess, x/guess);

double average(double x, double y){
  return (x+y)/2;

int good_enough(double guess, double x){
  if (abs(guess*guess-x)<.001)
    return 1;
  return 0;

double mysqrt(double x){
  return sqrt_iter(1.0, x);

  double rez = 0;
  double x;
  double step = .001;
  for(x=0; x<= 10000; x+=step)
    rez += mysqrt(x)*step;
  printf("%f\n", rez);

I compiled the two methods via:

stalin -On -d -copt -O3
gcc -O3 int.c

The results are:
Stalin: 1.90s
gcc: 3.61s

If you declare every method inline in C, you get:

gcc-inline: 3.28s

For reference, I also compiled the code with chicken scheme, using the -optimize-level 3 compiler flag.

Chicken Scheme: 27.9s.

Some issues:

  • The Scheme code is more appealing on the whole, though it suffers from a lack of infix notation: (< (abs (- (* guess guess) x)) 0.001) is definitely harder to read than abs(guess*guess-x)<.001.  I wonder if more familiarity with prefix code would reduce this.
  • All three methods give slightly different results, which is worrisome.  In particular, Stalin is correct to 6 digits, whilst gcc and chicken are correct to 7.
  • Stalin apparently does not include macros.  One would think it would be easy to use the macro system from a different scheme compiler, and then send the source to stalin for final compilation.
  • Stalin is extremely slow to compile.  In principle this isn’t a big deal: you can debug using a different scheme compiler.  Still, Stalin seems to be somewhat less robust to edge cases, than at least chicken scheme.
  • It is amazing that Scheme code with no type declarations can beat C by almost a factor of 2.
  • Though in principle Stalin produces intermediate c code, it is utterly alien and low-level.  I have not been able to determine exactly what options Stalin is using when it calls gcc on the source code.  That could account for some of the difference.
  • A detailed writeup on Stalin is here.

Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?

Update: (November 2015) In the almost seven years since writing this, there has been an explosion of great tools for automatic differentiation and a corresponding upsurge in its use. Thus, happily, this post is more or less obsolete.

I recently got back reviews of a paper in which I used automatic differentiation.  Therein, a reviewer clearly thought I was using finite difference, or “numerical” differentiation. This has led me to wondering: Why don’t machine learning people use automatic differentiation more?  Why don’t they use it…constantly? Before recklessly speculating on the answer, let me briefly review what automatic differentiation (henceforth “autodiff”) is. Specifically, I will be talking about reverse-mode autodiff.

(Here, I will use “subroutine” to mean a function in a computer programming language, and “function” to mean a mathematical function.)

It works like this:

  1. You write a subroutine to compute a function f({\bf x}).  (e.g. in C++ or Fortran).  You know f to be differentiable, but don’t feel like writing a subroutine to compute \nabla f.
  2. You point some autodiff software at your subroutine.  It produces a subroutine to compute the gradient.
  3. That new subroutine has the same complexity as the original function!
    1. It does not depend on the dimensionality of \bf x.
  4. It also does not suffer from round-off errors!

To take a specific example, suppose that you have written a subroutine to evaluate a neural network.  If you perform autodiff on that subroutine, it will produce code that is equivalent to the backpropagation algorithm. (Incidentally, autodiff significantly predates the invention of backprop).

Why should autodiff be possible?  It is embarasingly obvious in retrospect:  Complex subroutines consist of many elementary operations.  Each of those is differentiable.  Apply the calc 101 chain rule to the expression graph of all these operations.

A lot of papers in machine learning (including many papers I like) go like this:

  1. Invent a new type of model or a new loss function
  2. Manually crank out the derivatives.  (The main technical difficulty of the paper.)
  3. Learn by plugging the derivatives into an optimization procedure.  (Usually L-BFGS or stochastic gradient.)
  4. Experimental results.

It is bizarre that the main technical contribution of so many papers seems to be something that computers can do for us automatically.  We would be better off just considering autodiff part of the optimization procedure, and directly plugging in the objective function from step 1.  In my opinion, this is actually harmful to the field.  Before discussing why, let’s consider why autodiff is so little used:

  1. People don’t know about it.
  2. People know about it, but don’t use it because they want to pad their papers with technically formidable derivations of gradients.
  3. People know about it, but don’t use it for some valid reasons I’m not aware of.

I can’t comment on (3) by definition.  I think the answer is (1), though I’m not sure.  Part of the problem is that “automatic differentiation” sounds like something you know, even if you actually have no idea what it is.  I sometimes get funny looks from people when I claim that both of the following are true.

\text{autodiff}\neq\text{symbolic diff.}

\text{autodiff}\neq\text{numerical diff.}.

Further evidence for (1) is that many papers, after deriving their gradient, proclaim that “this algorithm for computing the gradient is the same order of complexity as evaluating the objective function,” seeming to imply it is fortunate that a fast gradient algorithm exists.

Now, why bother complaining about this?  Are those manual derivatives actually hurting anything?  A minor reason is that it is distracting.  Important ideas get obscured by the details, and valuable paper space (often a lot of space) is taken up for the gradient derivation rather than more productive uses.  Similarly, valuable researcher time is wasted.

In my view a more significant downside is that the habit of manually deriving gradients restricts the community to only using computational structures we are capable of manually deriving gradients for.  This is a huge restriction on the kinds of computational machinery that could be used, and the kinds of problems that could be addressed.

As a simple example, consider learning parameters for some type of image filtering algorithm.  Imaging problems suffer from boundary problems.  It is not possible to compute a filter response at the edge of an image.  A common trick is to “flip” the image over the boundary to provide the nonexistent measurements.  This poses no difficulty at all to autodiff, but borders on impossible to account for analytically.  Hence, I suspect, people simply refrain from researching these types of algorithms.  That is a real cost.

For C++ users, it is probably easiest to get started with David Gay‘s RAD toolbox.  (See also the paper.)


  • Often, derivatives are needed for analysis, not just to plug into an optimization routine.  Of course, these derivatives should always remain.
  • When derivatives are reasonably simple, it can be informative to see the equations.  Oftentimes, however, an algorithm is derived for computing the gradient.  In this cases, it would almost always be easier for someone trying to implement the method to install an autodiff tool than try to implement the gradient algorithm.

Update: Answers to some questions:

Q) What’s the difference between autodiff and symbolic diff?

A) They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:

function f(x)
  y = x;
  for i=1...100
    y = sin(x+y);
  return y

Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.

Q) What about non-differentiable functions?

A) No problem, as long as the function is differentiable at the place you try to compute the gradient.

Q) Why don’t you care about convexity?

A) I do care about convexity, of course.