The second and third best features of LyX you aren’t using.

LyX is a WYSIWYG editor for latex files. It’s a little bit clunky to use at first, and isn’t perfect (thank you, open source developers– I’m not ungrateful!) but after becoming familiar with it, it’s probably the single piece of software that has most improved my productivity. I like it so much I use it not just for papers but often as a scratchpad for math. Many times during random discussions I’ve used it to quickly bang out some equations and after seeing how fast it was, others immediately switched to using it.

Anyway, while macros may be the best feature you LyX aren’t using, I recently discovered another couple excellent ones I wasn’t familiar with after years of use so I thought I’d publicize. Specifically, I’ve always hated the process of including explanatory figures into LyX. Exporting plots from an experiment is tricky to improve, but when trying to create explanatory graphs, I’ve always hated the process.

Before, I thought the options were.

1) Create the graph in an external program. This is fine, of course, but is quite inconvenient when you want to go back and revise it. The external program usually saves it in some other format, so you have to open the graphic in open it again, revise it, export it to .pdf (or whatever), then open the document in LyX, compile it. Then, when you don’t like the way it looks, you have to repeat the whole process. It works, but it’s not efficient, since you can’t edit the content in place. (Which is the whole point of using a WYSIWYG editor in the first place– remove the need for thinking about anything but content.)

2) Write the graphics directly in LyX in a language like TikZ. This is more “in-place” in that you don’t have external files to find and manipulate. However, I find TikZ to be quite painful to get right with many re-compilations necessary. If the TikZ document is in place each requires a full compilation of the document. This is hilariously slow when making something like Beamer slides. Further, this totally violates the whole point of WYSIWYG since you’re looking at code, rather than the output.

There are better ways! I’ve wasted countless hours not being aware of these.

Preview Boxes

First, LyX has a beautiful feature of “preview boxes”. Take the following very simple TikZ code, which just draws a square:

\draw[red] (0,0) -- (0,1);
\draw[green] (0,1) -- (1,1);
\draw[red] (1,1) -- (1,0);
\draw[blue] (1,0) -- (0,0);

Typically, I’d include this in LyX files by inserting a raw tex box:

Screen Shot 2017-04-09 at 9.52.44 AM.png

And then putting the the TikZ code inside:

Screen Shot 2017-04-09 at 9.54.23 AM.png

This is OK, but has the disadvantages from (2) above. The code can be huge, if I have a lot of graphics, I can’t tell what corresponds to what, and I have to do a (slooooooow) recompile of the whole document to see what it looks like.

However, if you just add a “preview box”:

Screen Shot 2017-04-09 at 9.56.13 AM

You get something that looks like this:

Screen Shot 2017-04-09 at 9.56.33 AM.png

So far, so pointless, right? However, when you deselect, LyX shows the graphic in-place:

Screen Shot 2017-04-09 at 9.57.50 AM

You can then click on it to expand the code. This solves most of the problems: You can see what you are doing at a glance, and you don’t need to recompile the whole document to do it.

Native SVG support

Newer versions of LyX also natively support SVG files. You first have to create the file externally using something like Inkscape, which itself saves directly to the SVG format. Then, you can include it in LyX by doing Insert->File->External Material:

Screen Shot 2017-04-09 at 10.02.31 AM.png

And then selecting the SVG file:

Screen Shot 2017-04-09 at 10.03.14 AM

Again, LyX will show it in-place and (if LyX is configured correctly…) correctly output vector graphics in the final document.

Screen Shot 2017-04-09 at 10.05.15 AM

What’s even better is that LyX can automatically open the file in the external editor for you. If you right click, you can “edit externally”:

Screen Shot 2017-04-09 at 10.06.11 AM

Then the external editor will automatically open the file. You can then save it with a keystroke and go back to LyX. No hunting around for the file, no cycles of exporting to other formats, and you see exactly what the final output will look like at all stages. You can really tell that LyX was created by people using it themselves.

Bonus feature

This one is described well-enough already, but helps a lot in big documents: you can click on a point in a generated .pdf and automatically have LyX sync the editor to the corresponding point in the file.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

You deserve better than two-sided finite differences

In calc 101, the derivative is derived as df/dx = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon. So, if you want to estimate a derivative, an easy way to do so would be to just pick some small \epsilon and estimate:

df/dx \approx \frac{1}{\epsilon} (f(x+\epsilon)-f(x))

This can work OK. Let’s look at an example of trying to calculate the derivative of \log (x), using a range of different \epsilon


What’s happening? Well, the result is true mathematically in the limit that \epsilon is small, so it’s natural to get errors for large \epsilon. However, with very small \epsilon you run into trouble because floating-point arithmetic can only represent finite precision. Let’s try again with a smaller value of x=.001


That’s somewhat concerning. We can still get a nearly correct value, but we have a limited range of steps that achieve it. This is very well-known in numerical analysis, and a common solution is to use two-sided differences, i.e. to estimate

df/dx \approx \frac{1}{2 \epsilon} (f(x+\epsilon)-f(x-\epsilon)).

If we try that, we indeed get better results:


What’s happening here? Basically, we are using more information about the function to make a higher-order approximation, so we can mathematically get away with using a larger value of \epsilon. This in turn isis helpful to avoid the numerical precision demons.

Great! But let’s go deeper. If we use x = 10^{-5}, we get:


Huh. 1-sided differences totally fall apart, but we seem to be running into more trouble. But never fear, you can use “four-sided differences”!

df/dx \approx \frac{1}{12 \epsilon} (-f(x+2\epsilon)+8 f(x+\epsilon)-8 f(x+\epsilon)+ f(x-2 \epsilon)

Then, we get what you might expect:


But what if we go deeper, with x=10^{-7}?log_1e-07_4sided

Or maybe we should go even deeper, with x = 10^{-9}?


Even the four-sided differences have failed us. Now, you might take a lesson here that you shouldn’t be using numerical differences (and I sort of agree). Those of certain temperament, on the other hand, would say instead that what we need is power. OK then, how do six-sided differences sound to you? Sound good?

df/dx \approx \frac{1}{60 \epsilon}(- f(x-3 \epsilon)+9 f(x-2 \epsilon)-45 f(x-\epsilon)+45 f(x+\epsilon)-9 f(x+2\epsilon)+f(x+3 \epsilon))


This is still tough, but there is at least some range of epsilon where you can calculate a reasonable derivative. There is, of course, a never-ended sequence of these higher-order derivative approximations. There’s even a calculator, for all your bespoke finite-difference stencil needs.

Now, you might think that the real solution here is to use automatic differentiation, and you’re mostly right. It takes more computation to sample the function at more points, and no number of samples will fundamentally stop the numerical demons from destroying your result.

However, there still remain cases where it’s worthwhile to manually compute a derivative. More importantly perhaps, when you’re implementing an automatic differentiation tool, you still need to test that it is actually correct! Here, numerically differences will probably forever remain useful. So, certainly for the cases of building a test suite for autodiff code, it certainly makes sense to use these higher-order derivatives.

(I originally intended to leave this as a comment on Tim Viera’s post on testing gradient implementations, but this kinda got out of control.)

Posted in Uncategorized | Tagged , | 1 Comment

Sneaking up on Bayesian Inference (A fable in four acts)


Act 1: Magical Monkeys

Two monkeys, Alfred ({a}) and Betty ({b}) live in a parallel universe with two kinds of blocks, green ({g}) and yellow ({y}). Alfred likes green blocks, and Betty prefers the yellow blocks. One day, a Wizard decides to give one of the monkeys the magical power to send one block over to our universe each day.

The Wizard chooses the magical monkey by flipping a fair four-sided die. He casts a spell on Alfred if the outcome is 1, and Betty if the outcome is {2-4}. That is, the Wizard chooses Alfred with probability {.25} and Betty with probability {.75}. Both Alfred and Betty send their favorite colored block with probability {.8}

After the Wizard has chosen, we see the first magical block, and it is green. Our problem is: What is the probability that Alfred is the magical monkey?

Intuitively speaking, we have two somewhat contradictory pieces of information. We know that the Wizard chooses Betty more often. But green is Alfred’s preferred color. Given probabilities above, is Alfred or Betty more probable?

First, we can write down all the probabilities. These are

\displaystyle \begin{aligned}Pr(M=a) & =.25\\ Pr(M=b) & =.75\\ Pr(B=g\vert M=a) & =.8\\ Pr(B=y\vert M=a) & =.2\\ Pr(B=g\vert M=b) & =.2\\ Pr(B=y\vert M=b) & =.8. \end{aligned}

Now, the quantity we are interested in is {Pr(M=a\vert B=g).} We can mechanically apply Bayes’ rule to obtain

\displaystyle \begin{aligned}Pr(M=a\vert B=g) & =\frac{1}{Pr(B=g)}Pr(M=a)Pr(B=g\vert M=a)\\ & =\frac{1}{Pr(B=g)}.25\times.8. \end{aligned}

Similarly, we can calculate that

\displaystyle Pr(M=b\vert B=g)=\frac{1}{Pr(B=g)}.75\times.2.

But, of course, we know that one of the monkeys is magical, so these quantities must sum to one. Thus (since they both involve the quantity {Pr(B=g)} that we haven’t explicitly calculated) we can normalize to obtain that

\displaystyle Pr(M=a\vert B=g)=\frac{.25\times.8}{.25\times.8+.75\times.2}=\frac{4}{7}.

Act 2: Magical Monkeys on Multiple Days

We wait around two more days, and two more blocks appear, both of which are yellow. Assume that the monkeys make an independent choice each day to send their favorite or less favorite block. Now, what is the probability that Alfred is the magical monkey, given that {B=gyy}?

Intuitively speaking, what do we expect? Betty is more likely to be chosen, and 2/3 of the blocks we’ve seen are suggestive of Betty (since she prefers yellow).

Now, we can mechanically calculate the probabilities. This is just like before, except we use the fact that the blocks seen on each day are conditionally independent given a particular monkey.

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=gyy\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}Pr(M=a)Pr(B=g\vert M=a)Pr(B=y\vert M=a)Pr(B=y\vert M=a)\\ & =\frac{1}{Pr(B=gyy)}.25\times.8\times.2\times.2\\ Pr(M=b\vert B=gyy) & =\frac{1}{Pr(B=gyy)}Pr(M=b)Pr(B=g\vert M=b)Pr(B=y\vert M=b)Pr(B=y\vert M=b)\\ & =\frac{1}{Pr(B=ggy)}.75\times.2\times.8\times.8 \end{aligned}

Again, we know that these two probabilities sum to one. Thus, we can normalize and get that

\displaystyle Pr(M=a\vert B=gyy)=\frac{.25\times.8\times.2\times.2}{.25\times.8\times.2\times.2+.75\times.2\times.8\times.8}=\frac{1}{13}.

So now, Betty looks more likely by far.

Act 3: Magical Monkeys and the Weather

Now, suppose the Wizard gives us some additional information. The magical monkey (whilst relaxing over in the other universe) is able to perceive our weather, which is either clear {(c)} or rainy {(r)}. Both the monkeys prefer clear weather. When the weather is clear, they send their preferred block with probability {.8}, while if the weather is rainy, they angrily send their preferred block with probability {.2.} That is, we have

\displaystyle \begin{aligned}Pr(B=g\vert M=a,W=c) & =.8\\ Pr(B=g\vert M=a,W=r) & =.2\\ Pr(B=g\vert M=b,W=c) & =.2\\ Pr(B=g\vert M=b,W=r) & =.8. \end{aligned}

Along with seeing the previous sequence of blocks {B=gyy}, we observed the weather sequence {W=rrc}. Now, what is the probability that Alfred is the magical monkey?

We ask the Wizard what the distribution over rainy and clear weather is. The wizard haughtily responds that this is irrelevant, but does confirm that the weather is independent of which monkey was made magical.

Can we do anything without knowing the distribution over the weather? Can we calculate the probability that Alfred is the magical monkey?

Let’s give it a try. We can apply Bayes’ equation to get that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) & =\frac{1}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy,W=rrc\vert M=a).\end{aligned}

Now, since the weather is independent of the monkey, we know that

\displaystyle \begin{aligned}Pr(B=gyy,W=rrc\vert M=a) & =Pr(W=rrc\vert M=a)Pr(B=gyy\vert M=a,W=rrc)\\ & =Pr(W=rrc)Pr(B=gyy\vert M=a,W=rrc). \end{aligned}

Thus, we have that

\displaystyle \begin{aligned}Pr(M=a\vert B=gyy,W=rrc) =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a)Pr(B=gyy\vert M=a,W=rrc)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}Pr(M=a) \\ & \times Pr(B=g\vert M=a,W=r)Pr(B=y\vert M=a,W=r)Pr(B=y\vert M=a,W=c)\\ =&\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.25\times.2\times.8\times.8. \end{aligned}

Through the same logic we can calculate that

\displaystyle Pr(M=b\vert B=gyy,W=rrc)=\frac{Pr(W=rrc)}{Pr(B=gyy,W=rrc)}.75\times.8\times.2\times.2.

Again, we know that these probabilities need to sum to one. Since the factor {Pr(W=rrc)/Pr(B=gyy,W=rrc)} is constant between the two, we can normalize it out. Thus, we get that

\displaystyle Pr(M=a\vert B=gyy,W=rrc)=\frac{.25\times.2\times.8\times.8.}{.25\times.2\times.8\times.8.+.75\times.8\times.2\times.2.}=\frac{4}{7}.

Again, it looks like Alfred was the more likely monkey. And– oh wait– we somehow got away with not knowing the distribution over the weather…

Act 4: Predicting the next block

Now, suppose that after seeing the previous sequences (namely {B=gyy} and {W=rrc}) , on the fourth day, we find that it has rained. What is the probability that we will get a green block on the fourth day? Mathematically, we want to calculate

\displaystyle Pr(B'=g\vert W'=r,B=ggy,W=rrc).

How can we go about this? Well, if we knew that the magical monkey was Alfred (which we do not!), it would be easy to calculate. We would just have

\displaystyle Pr(B'=g\vert W'=r,M=a),

and similarly if the magical monkey were Betty. Now, we don’t know which monkey is magical, but we know the probabilities that each monkeys is magical given the available information– we just calculated them in the previous section! So, we can factorize the distribution of interest as

\displaystyle \begin{aligned}Pr(B'=g\vert W'=r,B=ggy,W=rrc) & =Pr(B'=g\vert W'=r,M=a)Pr(M=a\vert B=ggy,W=rrc)\\ & +Pr(B'=g\vert W'=r,M=b)Pr(M=b\vert B=ggy,W=rrc)\\ & =.2\times\frac{4}{7}+.8\times\frac{3}{7}\\ & =\frac{33}{70} \end{aligned}

So we are slightly less likely than even to see a green block on the next day.

Epilogue: That’s Bayesian Inference

This is how Bayesian inference works (in the simplest possible setting). In general Bayesian inference, you have:

  • A set of possible inputs (for us these are weather patterns)
  • A set of possible outputs (for us these are colored blocks)
  • A set of possible models, that map a given input to a distribution over the outputs (for us these are monkeys)
  • A “prior” distribution over which models are more likely (for us, this is the wizard and the four-sided die)
  • A dataset of inputs and outputs (for us, this is {W=rrc} and {B=ggy})
  • A new input (for us this is {W'=r})
  • An unknown output that we’d like to predict (for us this is {B'})

Bayesian inference essentially proceeds in two steps:

  • In the first step, one uses the prior over which models are more likely along with the dataset of inputs and outputs to build a “posterior” distribution over which models are more likely. We did this by using

    \displaystyle Pr(M\vert W,B)\propto Pr(M)\prod_{i}Pr(B_{i}\vert W_{i},M),

    where {i} indexes the different days of observation.

  • In the second step, one “integrates over the posterior”. That is, each model gives us a distribution over {B'} given the new input {W'}. So, we simply sum over the different models

    \displaystyle Pr(B'\vert W')=\sum_{m}Pr(M=m\vert W,B)Pr(B'\vert W',M=m).

In the general case, the set of possible models is much larger (typically infinite) and more complex computational methods need to be used to integrate over it. (Commonly Markov chain Monte Carlo or variational methods). Also, of course, we don’t typically have a Wizard telling us exactly what the prior distribution over the models is, meaning one must make assumptions, or try to “learn” the prior as in, say, empirical Bayesian methods.

Posted in Uncategorized | Tagged | 2 Comments

Algorithmic Dimensions

There are many dimensions on which we might compare a machine learning or data mining algorithm. A few of the first that come to mind are:

1) Sample complexity, convergence

algA_vs_algBHow much predictive power is the algorithm able to extract from a given number of examples?

All else being equal, if algorithm A behaves the same as algorithm B with twice as many examples, we would generally prefer algorithm A. This can vary in importance depending on how scarce or expensive it is to acquire data.

2) Speed

How quickly does the algorithm run?

Obviously, a tool is more useful if it is faster. However, if that faster algorithm comes at the expense of sample complexity, one would need to measure the expense of running a program longer against the expense of gathering more data.

3) Guarantees

Does the algorithm just have good performance (along whatever dimension) in practice, or is it proven? Is it always good, or only in certain situations? How predictable is the performance?

When SVMs showed up, it looked like good practical algorithms came from good theory. These days, it seems clear that powerful intuition is also an excellent source of practical algorithms (deep learning).

Perhaps theory will some day catch up, and the algorithm with the best bounds will also perform the best. However, this is not always the case today. We often face the choice between an algorithms with good theoretical guarantees, and algorithms that work well, and are appealing otherwise. Some examples of theory versus practice are 1) Upper-Confidence Bounds versus Thomson Sampling with bandit algorithms 2) Running a convex optimization algorithm with a theoretically derived Lipschitz constant versus a smaller one that still seems to work and 3) Doing model selection via VC-dimension generalization bounds versus using K-fold cross-validation.

4) Memory usage

How much space does the algorithm need?

I work with a lot of compute-heavy applications where this is almost a wall: we don’t care about memory usage until we run out of it, after which we care a great deal. With simpler algorithms and larger datasets, this is often more nuanced, with concerns about different cache sizes.

5) Handholding

Can it be used out of the box, or does it require an expert to “tune” it for best performance?

A classic example of this is stochastic gradient descent. In principle, for a wide range of inputs, convergence is guaranteed by iteratively setting \theta_{k+1} = \theta_k - \lambda_k g_k where g_k is a noisy unbiased estimate of the gradient at \theta_k and \lambda_k is some sequence of step-sizes that obeys the Robbins-Monro conditions [1]. However, in practice, the difference between, say, \lambda_k = 1/k, and \lambda_k = 7.2/(522.9+k) can be enormous, and finding these constants is a bit of a dark art.

[1] (\sum_k \lambda_k=\infty and \lim_{k\rightarrow\infty} \lambda_k=0.)

6) Implementability

How simple is the algorithm? How easy is it to implement?

This is quite complex and situational. These days, I’d consider an algorithm that consists of a few moderate-dimensional matrix multiplications or singular value decompositions “simple”. However, that’s due to a huge amount of effort designing reliable algorithms for these problems, and the ubiquity of easy to use libraries.

7) Amenability to parallelization

Does the algorithm lend itself to parallelization? (And what type of parallelization?)

And if it does, under what model? Map-reduce, GPU, MPI, and openMP all have different properties.

8) “Anytime-ness”

Can the algorithm be implemented in any anytime manner?

That is, does the algorithm continuously return a current “best-guess” of the answer that is refined over time in a sensible manner? This can help diagnose problems before running the full algorithm, enormously useful in debugging large systems.

Note this is distinct from being an online algorithm, which I’m not mentioning here, since it’s a mix of speed and memory properties.

9) Transparency, interpretability

Can the final result be understood? Does it give insight into how predictions are being made?

treeGalit Shmueli argues that “explanatory power” and “predictive power” are different dimensions. However, there are several ways in which interpretability is important even when prediction is the final goal. Firstly, the insight might convince a decision maker that the machine learning system is reliable. Second, this can be vital in practice for generalization. The world is rarely independent and identically distributed. If a domain expert can understand the predictive mechanism, they may be able to assess if this will still hold in the future, or captures something true only in the training period. Third, understanding what the predictor is doing also often yields ways to improve it. For example, the adjacent visualization of the outputs of a decision tree in two-dimensions suggests the need for non axis-aligned splits.

10) Generality

What class of problems can the algorithm address?

All else being equal, we prefer an algorithm that can, say, optimize all convex losses over one that can only fit the logistic loss. However, this often comes at a cost— a more general-purpose algorithm cannot exploit the extra structure present in a specialized problem (or, at least, has more difficulty doing so). It’s instructive how many different methods are used by LIBLINEAR depending on the particular loss and regularization constant.

11) Extendability

Does the algorithm have lots of generalizations and variants that are likely to be interesting?

More of an issue when reviewing a paper than deciding what is the final best algorithm to use once all the dust has settled.

12) Insight

Does the algorithm itself (as opposed to its results) convey insight into the problem?

Gradient descent for maximum likelihood learning of an exponential family is a good example, as it reveals the moment-matching conditions. Insight of this type, however, doesn’t suggest that one should actually *run* the algorithm.

13) Model robustness

How does the performance of the algorithm hold up to violations of its modeling assumptions?

Suppose we are fitting a simple line to some 2-D data. Obviously, all else being equal, we would prefer an algorithm that still does something reasonable when the actual dependence when the expected value of y is not linear in x. The example I always harp on here is the pseudolikelihood. The original paper pointed out that this will have somewhat worse sample complexity that the full likelihood, and (much!) better time complexity. Many papers seem to attribute the bad performance of the pseudolikelihood in practice to this sample complexity, when the true cause is that the likelihood does something reasonable (minimizes KL divergence) when there is model mis-specification, but the pseudolikelihood does not.

I often ponder is how much improvement in one dimension is enough to “matter”. Personally, I would generally consider even a small constant factor (say 5-10%) improvement in sample complexity quite important. Meanwhile, it would be rare to get exited about even, say, a factor of two improvement in running time.

What does this reflect? Firstly, generalization is the fundamental goal of data analysis, and we are likely to be willing to compromise most things if we can really predict better. Second, we instinctively distrust running times. Theory offers few tools for understanding constant factors, as these are highly architecture and implementation dependent. In principle, if one could be completely convincing that a factor of two improvement was truly there, I think this probably would be significant. (This might be true, say, if all algorithms are bottlenecked by a few matrix multiplications, and a new algorithm provably reduces the number needed.) However, this is rare.

In some places, I think constant factor skepticism can lead up astray. In reality, a factor of 30 improvement in speed is probably better than changing a O(N \log N) complexity to O(N). (Calculate N such that \log N=30.) This is particularly true when the lower-complexity algorithm has higher constant factors. As an example, I’ve always found the O(N \log N) algorithm for projection onto the l_1 ball to be faster than the O(N) algorithm in practice.

Given that there are so very many dimensions, can a new algorithm really improve on all simultaneously? It seems rare to even improve on a single dimension without a corresponding cost somewhere else. There will often be a range of techniques that are Pareto optimal, depending on one’s priorities. Understanding this range is what makes having an expert around so important.

Ideally, as a community, we would be able to provide a “consumer” of machine learning an easy way to find the algorithm for them. Or, at least, we might be able to point them in the direction of possible algorithms. One admirable attempt along this line is the following table from The Elements of Statistical Learning:

hasite et al

(Incidentally, notice how many of the desiderata do not overlap with mine.)

Some other situations show a useful contrast. For example, take this decision tree for choosing an algorithm for unconstrained optimization, due to Dianne O’Leary:

olearyEssentially, this amounts to the principle that one should use the least general algorithm available, so that it can exploit as much structure of the problem as possible. Though one can quibble with the details (pity the subgradient methods) it at least comes close to giving the “right” algorithm in each situation.

This doesn’t seem possible with machine learning, since there doesn’t exist a single hierarchy Rather, ML problems are a tangle of model specification, computational and architecture requirements, implementation constraints, user risk-tolerances and so on. It won’t be easy to automate away the experts. (Even ignoring the possible misalignment of incentives in that the field has the domain of automating itself.)

(This post incorporates comments from Aditya Menon and Cheng Soon Ong.)

Posted in Uncategorized | 2 Comments

Favorite things NIPS

I always enjoy reading conference reports, so I thought I’d mention a few papers that caught my eye.  (I welcome any corrections to my summaries of any of these.)

1. Recent Progress in the Structure of Large-Treewidth Graphs and Some Applications, by Chandra Chekuri. Not a paper, but rather an exceptionally clear tutorial on material that should probably be mandatory for anyone interested in computational complexity and graphical models. The talk isn’t online yet, but slides are available.

2. Learning Distributed Representations for Structured Output Prediction, by Vivek Srikumar and Christopher Manning. The idea is that when doing structured prediction, labels are often “similar” in some sense, and so should be encouraged to have similar weights. This is done by defining weights not over a discrete label space, but over a finite-dimensional distributed representation, with one vector for each label. Learning alternates between updating the representation of each label and regular structured learning.

3. Clamping Variables and Approximate Inference, by Adrian Weller and Tony Jebara. The paper provides a first-principles based proof that the Bethe approximation gives a lower-bound on the true partition function for attractive binary pairwise models. (A result originally proved by Ruozzi in 2012) The basic idea is that “clamping” (i.e. brute-force summing over all values of a variable, and using the Bethe approximation on the resulting model) can only increase the Bethe approximation. But, if you clamp everything, you have the exact partition function, so Bethe must be a lower-bound.  The proofs provide a lot of insight into why the results are true.

4. Making Pairwise Binary Graphical Models Attractive, by Nicholas Ruozzi and Tony Jebara. This builds on similar ideas to the above paper. There is an idea of a cover of a graph, where vertices have multiple copies, but neighborhoods look locally the same. The paper gives an appropriate cover construction such that the cover is attractive (up to a “flipping” of the variables). Thus, the Bethe partition function of the cover lower-bounds the true partition function of the cover. This is shown to be a lower-bound on the tree-reweighted partition function. As far as I can tell, Bethe on the cover may upper or lower-bound the true partition function, so Bethe-on-cover isn’t necessarily better than tree-reweighted.  Still, since tree-reweighted is typically a loose upper-bound, it is likely to be better in practice.

5. Distributed Bayesian Posterior Sampling via Moment Sharing, by Minjie Xu, Balaji Lakshminarayanan, Yee Whye Teh, Jun Zhu, and Bo Zhang. You want to do Bayesian MCMC inference using a cluster. This paper partitions the data over the nodes of the cluster, does local MCMC approximations, and uses an exponential-family approximation and expectation propagation to send information about the posterior between the nodes, thus approximating full/normal Bayesian inference. A couple days ago Andrew Gelman mentioned a similar method, considering other local approximations.

6. Mode Estimation for High Dimensional Discrete Tree Graphical Models, by Chao Chen, Han Liu, Dimitris Metaxas, and Tianqi Zhao. The goal is to find the M highest modes in a tree-structured graphical model. They find a set of necessary local conditions to be a mode, and then propose a junction-tree type algorithm operating with local mode configurations as values to find global modes.

7. Efficient Inference of Continuous Markov Random
Fields with Polynomial Potentials
, by Shenlong Wang, Alexander Schwing, and Raquel Urtasun. The goal is to find a local minimum of a polynomial MRF by use of the CCCP method where one iteratively splits the objective into convex and concave parts and linearizes the concave part to get an upper-bound. I’d have expected this to be easy, but it turns out to be quite difficult to split up a polynomial into convex and concave parts. This paper gives an SDP problem to find a function to add to the objective to make it convex, yielding the desired split.

8. Hardness of parameter estimation in graphical models, by Guy Bresler, David Gamarnik, and Devavrat Shah. In principle, this is the ultimate non-surprising result: given the mean parameters (data marginals), it is computationally intractable to find the natural parameters (MRF weights) that produce these marginals. Formally given an oracle to do maximum likelihood learning in the normal way (find parameters to match a given mean vector), you could use that oracle to approximate the partition function, which is known to be hard. The really surprising thing here is that this wasn’t proved until 2014!

9. A* Sampling, by Chris Maddison, Daniel Tarlow, and Tom Minka. A method for exact sampling for low-dimensional continuous distributions, based on fundamentally different principles from existing methods. Intuitively, the algorithm simultaneously generates Gumbel-type noise for every possible (real-valued) configuration and optimizes the resulting function by branch-and-bound.

10. The workshops, as usual, were great aside from the usual annoyance of it only being possible to be in one place at one time.

11. Montreal.  Charming, friendly.


Posted in Uncategorized | Tagged , | 1 Comment

Truncated Bi-Level Optimization

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”.  I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it.

The idea of bilevel optimization is quite simple.  Imagine that you would like to minimize some function L(w).  However, L itself is defined through some optimization.  More formally, suppose we would like to solve

\min_{w,y} Q(y), \text{ s.t. } y = \arg\min_y E(y,w).

Or, equivalently,

\min_{w} L(w)=Q(y^*(w)),

where y^* is defined as y^*(w) := \arg\min_y E(y,w).  This seems a little bit obscure at first, but actually comes up in several different ways in machine learning and related fields.

Hyper-parameter learning

The first example would be in learning hyperparameters, such as regularization constants.  Inevitably in machine learning, one fits parameters parameters to optimize some tradeoff between the quality of a fit to training data and a regularization function being small.  Traditionally, the regularization constant is selected by optimizing on a training dataset with a variety of values, and then picking the one that performs best on a held-out dataset.  However, if there are a large number of regularization parameters, a high-dimensional grid-search will not be practical.  In the notation above, suppose that w is a vector of regularization constants, and that y are training parameters.  Let, E  be the regularized empirical risk on a training dataset, and let Q be how well the parameters y^*(w) perform on some held-out validation dataset.

Energy-based models

Another example (and the one suggesting the notation) is an energy-based model.  Suppose that we have some “energy” function E_x(y,w) which measures how well an output y fits to an input x.  The energy is parametrized by w.  For a given training input/output pair (\hat{x},\hat{y}), we might have that Q_{\hat{y}}(y^*(w)) measures how how the predicted output y^* compares to the true output \hat{y}, where y^*(w)=\arg\max_y E_{\hat{x}}(y,w).

Computing the gradient exactly

Even if we just have the modest goal of following the gradient of L(w) to a local minimum, even computing the gradient is not so simple.  Clearly, even to evaluate L(w) requires solving the “inner” minimization of E. It turns out that one can compute \nabla_w L(w) through first solving the inner minimization, and then solving a linear system.

  1. Input w
  2. Solve y^* \leftarrow \arg\min_y E(y,w).
  3. Compute:
  4.    (a) the loss L(w)= Q(y^*)
  5.    (b) the gradient g=\nabla_y Q(y^*)
  6.    (c) the Hessian H=\frac{\partial^2 E(y^*,w)}{\partial w\partial w^T}
  7. Solve the linear system z=H^{-1} g.
  8. Compute the parameter gradient \nabla_w L(w) = - \frac{\partial^2 E(y^*,w)}{\partial w\partial y^T} z
  9. Return L(w) and \nabla_w L(w).

This looks a bit nasty, since we need to compute second-derivative matrices of E.  In fact, as long as one has a routine to compute \nabla_y E and \nabla_w E, this can be avoided through Efficient Matrix-vector products.  This is essentially proposed by Do, Foo, and Ng in “Efficient multiple hyperparameter learning for log-linear models”.

Overall, this is a decent approach, but it can be quite slow, simply because one must solve an “inner” optimization in order to compute each gradient of the “outer” optimization.  Often, the inner-optimization needs to be solved to very high accuracy in order to estimate a gradient accurately enough to reduce L(w)— higher accuracy than is needed when one is simply using the predicted value y^* itself.

Truncated optimization

To get around this expense, a fairly obvious idea is to re-define the problem.  The slowness of exactly computing the gradient stems from needing to exactly solve the inner optimization.  Hence, perhaps we re-define the problem such that an inexact solve of the inner problem nevertheless yields an “exact” gradient?

Re-define the problem as solving

\min_{w} L(w)=Q(y^*(w)), \text{ } y^*(w) := \text{opt-alg}_y E(y,w),

where \text{opt-alg} denotes an approximate solve of the inner optimization.  In order for this to work, \text{opt-alg} must be defined in such a way that y^*(w) is a continuous function of w.  With standard optimization methods such as gradient descent or BFGS, this can be achieved by assuming there are a fixed number of iterations applied, with a fixed step-size.  Since each iteration of these algorithms is continuous, this clearly defines y^*(w) as a continuous function.  Thus, in principle, it could be optimized efficiently through automatic differentiation of the code that optimizes E.  That’s fine in principle, but often inconvenient in practice.

It turns out, however, that one can derive “backpropagating” versions of algorithms like gradient descent, that take as input only a procedure to compute E along with it’s first derivatives.  These algorithms can then produce the gradient of L in the same time as automatic differentiation.

Back Gradient-Descent

If the inner-optimization is gradient descent for N steps with a step-size of \lambda, the algorithm to compute the loss is simple:

  1. Input w
  2. For k=0,1,...,N-1
  3.     (a) y_{k+1} \leftarrow y_k - \lambda \nabla_y E(y_k,w)
  4. Return L(w) = Q(y_N)

How to compute the gradient of this quantity?  The following algorithm does the trick.

  1. \overleftarrow{y_N} \leftarrow \nabla Q(y_N)
  2. \overleftarrow{w} \leftarrow 0
  3. For k=N-1,...,0
  4.     (a) \overleftarrow{w} \leftarrow \overleftarrow{w} - \lambda \frac{\partial^2 E(y_k,w)}{\partial w \partial y^T} \overleftarrow{y_{k+1}}
  5.     (b) \overleftarrow{y_k} \leftarrow \overleftarrow{y_{k+1}} - \lambda \frac{\partial^2 E(y_k,w)}{\partial y \partial y^T} \overleftarrow{y_{k+1}}
  6. Return \nabla L = \overleftarrow{w}.

Similar algorithms can be derived for the heavy-ball algorithm (with a little more complexity) and limited memory BFGS (with a lot more complexity).


So, finally, here is the code, and I’ll give a simple example of how to use it. There are just four simple files:

  1. back_gd.m
  2. back_hb.m
  3. back_lbfgs.m
  4. demo.m

I think the meanings of this are pretty straightforward, so I’ll just quickly step through the demo here.  I’ll start off by grabbing taking one of Matlab’s built-in datasets (on cities) so that we are trying to predict a measure of crime from measures of climate, housing, health, transportation, arts, recreation, and economy, as well as a constant.  There are 329 data, total, which I split into a training set of size 40, a validation set of size 160, and a test set of size 129.

load cities ratings
X = ratings;
for i=1:size(X,2)
    X(:,i) = X(:,i) - mean(X(:,i));
    X(:,i) = X(:,i) / std( X(:,i));

% predict crime from climate, housing, health, trans, edu, arts, rec, econ,
Y = X(:,4);
X = [X(:,[1:3 5:9]) 1+0*X(:,1)];

p = randperm(length(Y));
X = X(p,:);
Y = Y(p);

whotrain = 1:50;
whoval   = 51:200;
whotest  = 201:329;

Xtrain = X(whotrain,:);
Xval   = X(whoval  ,:);
Xtest  = X(whotest ,:);
Ytrain = Y(whotrain);
Yval   = Y(whoval  );
Ytest  = Y(whotest );

Next, I’ll set up some simple constants that will be used later on, and define the optimization parameters for minFunc, that I will be using for the outer optimization. In particular, here I choose the inner optimization to use 20 iterations.

opt_iters = 20;
ndims = size(Xtrain,2); w0    = zeros(ndims,1); ndata = size(Xtrain,1); ndata_val = size(Xval,1);
options = []; options.Method  = 'gd'; options.LS      = 1; options.MaxIter = 100;

Now, I’ll define the training risk function (E in the notation above). The computes the risk with a regularization constant of a, as well as derivatives. I’ll also define the validation risk (Q in the notation above).

function [R dRdw dRdloga] = training_risk(w,loga)
        a         = exp(loga);
        R         = sum( (Xtrain*w - Ytrain).^2 )/ndata + a*sum(w.^2);
        dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)  /ndata + 2*a*w;
        dRda      = sum(w.^2);
        dRdloga = dRda*a;

    function [R g] = validation_risk(w)
        R = sum( (Xval*w - Yval).^2 ) / ndata_val;
        g = 2*Xval'*(Xval*w-Yval)     / ndata_val;

Now, before going any further, let’s do a traditional sweep through regularization constants to see what that looks like.

LAMBDA = -5:.25:2;
    for i=1:length(LAMBDA)
        VAL_RISK(i) = back_lbfgs(@training_risk,@validation_risk,w0,LAMBDA(i),opt_iters);

Plotting, we get the following:


This is a reasonable looking curve. Instead, let’s ask the algorithm to find the constant by gradient descent.

eval = @(loga) back_lbfgs(@training_risk,@validation_risk,w0,loga,opt_iters);        
loga = 0;
[loga fval] = minFunc(eval,loga,options);

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     1.00000e+00     8.74176e-01     3.71997e-03
         2          3     1.00000e+00     8.73910e-01     1.86453e-04
         3          4     1.00000e+00     8.73909e-01     1.06619e-05
         4          5     1.00000e+00     8.73909e-01     3.60499e-08

This leads to a regularizer of the form:

0.860543 ||w||_2^2.

We can plot this on the graph, and see it matches the result of cross-validation.


If we actually compute the test-set error, this is 0.708217.

OK, let’s be a little bit more adventurous, and use a third-order regularizer. This is done like so:

    function [R dRdw dRdloga] = training_risk2(w,loga)
        a         = exp(loga(1));
        b         = exp(loga(2));
        R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata + a*sum(abs(w).^2)     + b*sum(abs(w).^3);
        dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata + a*abs(w).^1.*sign(w) + b*abs(w).^2.*sign(w);
        dRda      = sum(abs(w).^2);
        dRdb      = sum(abs(w).^3);
        dRdloga = [dRda*a; dRdb*b];

Running the optimization, we see:

Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     1.00000e+00     8.74445e-01     1.17262e-02
         2          3     1.00000e+00     8.73685e-01     3.21956e-03
         3          4     1.00000e+00     8.73608e-01     1.41744e-03
         4          5     1.00000e+00     8.73598e-01     8.20040e-04
         5          6     1.00000e+00     8.73567e-01     1.39830e-03
         6          7     1.00000e+00     8.73513e-01     2.52994e-03
         7          8     1.00000e+00     8.73471e-01     1.77157e-03
        23         28     1.00000e+00     8.70741e-01     8.42628e-06

With a final regularizer of the form

0.000160 ||w|_2^2 + 5.117057 * ||w||_3^3

and a hardly-improved test error of 0.679155.

Finally, let’s fit a fourth-order polynomial.

    function [R dRdw dRdloga] = training_risk3(w,loga)
        a         = exp(loga);
        R         = sum( (Xtrain*w - Ytrain).^2 ) / ndata;
        dRdw      = 2*Xtrain'*(Xtrain*w-Ytrain)   / ndata;
        for ii=1:length(a)
            R    = R    + a(ii)*sum(abs(w).^b);
            dRdw = dRdw + a(ii)*abs(w).^(b-1).*sign(w);
            dRda(ii,1) = sum(abs(w).^b);
        dRdloga = dRda.*a;

    eval = @(loga) back_lbfgs(@training_risk3,@validation_risk,w0,loga,opt_iters);
    loga = [0; 0; 0; 0];
    loga = minFunc(eval,loga,options);

Running the optimization, we see

Iteration   FunEvals     Step Length    Function Val        Opt Cond
         1          2     1.00000e+00     8.73982e-01     1.02060e-02
         2          3     1.00000e+00     8.73524e-01     3.30445e-03
         3          4     1.00000e+00     8.73447e-01     1.74779e-03
         4          5     1.00000e+00     8.73435e-01     1.43655e-03
         5          6     1.00000e+00     8.73345e-01     4.98340e-03
         6          7     1.00000e+00     8.73295e-01     1.85535e-03
         7          8     1.00000e+00     8.73231e-01     1.81136e-03
        38         41     1.00000e+00     8.69758e-01     3.44848e-06

Yielding the (strange) regularizer

0.000000 ||w||_2^2 + 0.000010 ||w||_3^3 + 24.310631 ||w||_4^4 + 0.325565 ||w||_5^5

and a final test-error of 0.67187.

Posted in Uncategorized | Tagged , , , , | 5 Comments

Reducing Sigmoid computations by (at least) 88.0797077977882%

A classic implementation issue in machine learning is reducing the cost of computing the sigmoid function

\sigma(a) = \frac{1}{1+\exp(-a)}.

Specifically, it is common to profile your code and discover that 90% of the time is spent computing the \exp in that function.  This comes up often in neural networks, as well as in various probabilistic architectures, such as sampling from Ising models or Boltzmann machines.  There are quite a few classic approximations to the function, using simple polynomials, etc. that can be used in neural networks.

Today, however, I was faced with a sampling problem involving the repeated use of the sigmoid function, and I noticed a simple trick that could reduce the number of sigmoids by about 88% without introducing any approximation.  The particular details of the situation aren’t interesting, but I repeatedly needed to do something like the following:

  1. Input a \in \Re
  2. Compute a random number r \in [0,1]
  3. If r < \sigma(a)
  4.   Output +1
  5. Else
  6.     Output -1

Now, let’s assume for simplicity that a is positive.  (Otherwise, sample using -a and then switch the sign of the output.)  There are two observations to make:

  1. If a is large, then you are likely to output +1
  2. Otherwise, there are easy upper and lower bounds on the probability of outputting +1

This leads to the following algorithm:

  1. Input a \in \Re
  2. Compute a random number r \in [0,1]
  3. If a \geq 2
  4.     If r \leq 0.880797077977882 or r \leq \sigma(a)
  5.         Output +1
  6.     Else
  7.         Output -1
  8. Else
  9.     If r > .5 + a/4
  10.         Output -1
  11.     Else if r \leq .5 + a/5.252141128658 or r \leq \sigma(a)
  12.         Output +1
  13.     Else
  14.         Output -1

The idea is as follows:

  1. If a\geq 2, then we can lower-bound the probability of outputting +1 by a pre-computed value of \sigma(2)\approx0.8807..., and short-circuit the computation in many cases.
  2. If a\leq 2, then we can upper bound the sigmoid function by .5+a/4.
  3. If a\leq 2, then we can also lower bound by .5+a/5.252141... respectively.  (This constant was found numerically).

The three cases are illustrated in the following figure, where the input a is on the x-axis, and the random number r is on the y-axis.


Since, for all a at least a fraction \sigma(2)\approx.8807 of the numbers will be short-circuited, sigmoid calls will be reduced appropriately.  If a is often near zero, you will do even better.

Obviously, you can take this farther by adding more cases, which may or may not be helpful, depending on the architecture, and the cost of branching vs. the cost of computing an \exp.

Posted in Uncategorized | Tagged , , , , | 9 Comments