The human regression ensemble

I sometimes worry that people credit machine learning with magical powers. Friends from other fields often show me little datasets. Maybe they measured the concentration of a protein in some cell line for the last few days and they want to know what it will be tomorrow.

Day Concentration
Monday 1.32
Tuesday 1.51
Wednesday 1.82
Thursday 2.27
Friday 2.51
Saturday ???

Sure, you can use a fancy algorithm for this, but I usually recommend to just stare hard at the data, use your intuition, and make a guess. My friends respond with horror—you can’t just throw out predictions, that’s illegal! They want to use a rigorous method with guarantees.

Now, it’s true we have methods with guarantees, but those guarantees are often a bit of a mirage. For example, you can do a linear regression and get a confidence interval for the regression coefficients. That’s fine, but you’re assuming (1) the true relationship is linear, (2) the data are independent, (3) the noise is Gaussian, and (4) the magnitude of noise is constant. If those (unverifiable) assumptions aren’t true, your guarantees don’t hold.

But is it really true that humans can do as well as algorithms for simple tasks? Let’s test this.

What I did

1. I defined four simple one-dimensional regression problems using common datasets. For each of those problems, I split the data into a training set and a test set. Here’s what that looks like for the `boston` dataset

2. I took the training points, and plotted them to a .pdf file as black dots, with four red dots for registration.

In each .pdf file there were 25 identical copies of the training data like above.

3. I transferred that .pdf file to my tablet. On the tablet, I hand-drew 25 curves that I felt were all plausible fits of the data.

4. I transferred the labeled .pdf back to my computer, and wrote some simple image processing code that would read in all of the lines and average them. I then used this average to predict the test data.

5. As a comparison, I made predictions for the test data using six standard regression methods: Ridge regression (Ridge), local regression (LOWESS), Gaussian processes regression (GPR), random forests (RF), neural networks (MLP) and K-nearest neighbors (K-NN). More details about all these methods are below.

6. To measure error, I computed the root mean squared error (RMSE) and the mean absolute error (MAE).

To make sure the results were fair, I committed myself to just drawing the curves for each dataset once, and never touching them again, even if I did something that seems stupid in retrospect—which as you’ll see below, I did.

On the other hand, I had to do some of tinkering with all the machine learning methods to get reasonable results, e.g. changing how neural networks were optimized, or what hyper-parameters to cross-validate over. This might create some bias, but if it does, it’s in favor of the machine learning methods and against me.

Results

For the `boston` dataset, I used the crime variable for the x-axis and house value variable for the y-axis. Here’s all the lines I drew on top of each other:

And here are the results comparing to the machine learning algorithms:

Here are the results for the `diabetes` dataset. I used age for the x-axis and disease progression for the y-axis. (I don’t think I did a great job drawing curves for this one.)

Here are the results for the `iris` dataset, using sepal length for the x-axis and petal width for the y-axis.

And finally, here are the results for the `wine` dataset, using malic acid for the x-axis and alcohol for the y-axis.

I tend to think I under-reacted a bit to the spike of data with with x around 0.2 and large y values. I thought at the time that it didn’t make sense to have a non-monotonic relationship between malic acid and alcohol. However, in retrospect it could easily be real, e.g. because it’s a cluster of one type of wine.

Summary of results

Here’s a summary of the the RMSE for all datasets.

Method Boston Diabetes Iris Wine
Ridge .178 .227 .189 .211
LOWESS .178 .229 .182 .212
Gaussian Process .177 .226 .184 .204
Random Forests .192 .226 .192 .200
Neural Nets .177 .225 .185 .211
K-NN .178 .232 .186 .202
justin .178 .230 .181 .204

And here’s a summary of the MAE

Method Boston Diabetes Iris Wine
Ridge .133 .191 .150 .180
LOWESS .134 .194 .136 .180
Gaussian Process .131 .190 .139 .170
Random Forests .136 .190 .139 .162
Neural Nets .131 .190 .139 .179
K-NN .129 .196 .137 .165
justin .121 .194 .138 .171

Honestly, I’m a little surprised how well I did here—I expected that I’d do OK but that some algorithm (probably LOWESS, still inexplicably not in base scikit-learn) would win in most cases.

I’ve been doing machine learning for years, but I’ve never run a "human regression ensemble" before. With practice, I’m sure I’d get better at drawing these lines, but I’m not going to get any better at applying machine learning methods.

I didn’t do anything particularly clever in setting up these machine learning methods, but it wasn’t entirely trivial (see below). A random person in the world is probably more likely than I was to make a mistake when running a machine learning method, but just as good at drawing curves. This is an extremely robust way to predict.

What’s the point of this? It’s just that machine learning isn’t magic. For simple problems, it doesn’t fundamentally give you anything better than you can get just from common sense.

Machine learning is still useful, of course. For one thing, it can be automated. (Drawing many curves is tedious…) And with much larger datasets, machine learning will—I assume—beat any manual predictions. The point is just that in those cases it’s an elaboration on common sense, not some magical pixie dust.

Details on the regression methods

Here were the machine learning algorithms I used:

1. Ridge: Linear regression with squared l2-norm regularization.
2. LOWESS: Locally-weighted regression.
3. GPR: Gaussian-process regression with an RBF kernel
4. RF: Random forests
5. MLP: A single hidden-layer neural network / multi-layer perceptron with tanh nonlinearities, optimized by (non-stochastic) l-bfgs with 50,000 iterations.
6. KNN: K-nearest neighbors

For all the methods other than gaussian processes, I used 5-fold cross-validation to tune the key hyper- parameter. The options I used were

1. Ridge: Regularization penalty of λ=.001, .01, .1, 1, or 10.
2. LOWESS: Bandwidth of σ=.001,.01,.1,1,10
3. Random forests: Minimum samples in each leaf of n=1,2,…,19
4. Multi-layer perceptrons: Used 1, 5, 10, 20, 50, or 100 hidden units, with α=.01 regularization.
5. K-nearest neighbors: used K=1,2,…,19 neighbors.

For Gaussian processes, I did not use cross-validation, but rather scikit-learn’s built-in hyperparameter optimization. In particular, I used the magical incantation `kernel = ConstantKernel(1.0,(.1,10)) + ConstantKernel(1.0,(.1,10)) * RBF(10,(.1,100)) + WhiteKernel(5,(.5,50))` which I understand means the system optimizes the kernel parameters to maximize the marginal likelihood.