<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Justin Domke's Weblog</title>
	<atom:link href="http://justindomke.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://justindomke.wordpress.com</link>
	<description></description>
	<lastBuildDate>Wed, 28 Oct 2009 02:00:40 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='justindomke.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/7266a6a4d12cf7541895cf84e5365fe0?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Justin Domke's Weblog</title>
		<link>http://justindomke.wordpress.com</link>
	</image>
			<item>
		<title>Notation is evil</title>
		<link>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/</link>
		<comments>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 02:00:40 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=839</guid>
		<description><![CDATA[Exhibit A:
We have the symbol ,with the interpretation
 for some number ,
but there doesn&#8217;t appear to exist a symbol (Here, I use a boxed question mark:  to denote the symbol I claim doesn&#8217;t exist) with the interpretation
 for some number .
This pains me.  People sometimes have to resort to writing something like
&#8220;

where the constants [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=839&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><strong>Exhibit A:</strong></p>
<p>We have the symbol <img src='http://l.wordpress.com/latex.php?latex=%5Cpropto&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\propto' title='\propto' class='latex' />,with the interpretation</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x+%5Cpropto+y+%5Cleftrightarrow+x+%3D+c+y&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x \propto y \leftrightarrow x = c y' title='x \propto y \leftrightarrow x = c y' class='latex' /> for some number <img src='http://l.wordpress.com/latex.php?latex=c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='c' title='c' class='latex' />,</p>
<p>but there doesn&#8217;t appear to exist a symbol (Here, I use a boxed question mark: <img src='http://l.wordpress.com/latex.php?latex=%5Cboxed%7B%3F%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\boxed{?}' title='\boxed{?}' class='latex' /> to denote the symbol I claim doesn&#8217;t exist) with the interpretation</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x+%5Cboxed%7B%3F%7D+y+%5Cleftrightarrow+x+%3D+y+%2B+c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x \boxed{?} y \leftrightarrow x = y + c' title='x \boxed{?} y \leftrightarrow x = y + c' class='latex' /> for some number <img src='http://l.wordpress.com/latex.php?latex=c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='c' title='c' class='latex' />.</p>
<p>This pains me.  People sometimes have to resort to writing something like</p>
<p>&#8220;<img src='http://l.wordpress.com/latex.php?latex=y+%3D+f%28x%29%2B%5Ctext%7Bconst%7D+%281%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='y = f(x)+\text{const} (1)' title='y = f(x)+\text{const} (1)' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%3D+g%28x%29+%2B+%5Ctext%7Bconst%7D+%282%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='= g(x) + \text{const} (2)' title='= g(x) + \text{const} (2)' class='latex' /></p>
<p>where the constants are (in general) different on lines (1) and (2).&#8221;</p>
<p>Even worse (or maybe not?), sometimes people seem to leave exponents lying around when they otherwise wouldn&#8217;t, e.g. write</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cexp%28y%29+%5Cpropto+%5Cexp%28f%28x%29%29+%5Cpropto+%5Cexp%28g%28x%29%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\exp(y) \propto \exp(f(x)) \propto \exp(g(x))' title='\exp(y) \propto \exp(f(x)) \propto \exp(g(x))' class='latex' />.</p>
<p><strong>Exhibit B:</strong></p>
<p>We have no symbol meaning &#8220;normalized sum&#8221;.  How many thousands of times have you seen some variant of</p>
<p>&#8220;<img src='http://l.wordpress.com/latex.php?latex=y+%3D+%5Cfrac%7B1%7D%7BN%7D%5Csum_%7Bx+%5Cin+X%7D+f%28x%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='y = \frac{1}{N}\sum_{x \in X} f(x)' title='y = \frac{1}{N}\sum_{x \in X} f(x)' class='latex' /></p>
<p>where <img src='http://l.wordpress.com/latex.php?latex=N%3D%7CX%7C&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='N=|X|' title='N=|X|' class='latex' />&#8220;?</p>
<p>Why do we need to define <img src='http://l.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='N' title='N' class='latex' />?  Can&#8217;t we just use another mystery symbol to write</p>
<p><img src='http://l.wordpress.com/latex.php?latex=y+%3D+%5Cboxed%7B%3F%7D_%7Bx+%5Cin+X%7D+f%28x%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='y = \boxed{?}_{x \in X} f(x)' title='y = \boxed{?}_{x \in X} f(x)' class='latex' />?</p>
<p>In some situations you could use <img src='http://l.wordpress.com/latex.php?latex=%5Ctext%7Bmean%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\text{mean}' title='\text{mean}' class='latex' />, but that doesn&#8217;t always really work and is rarely done.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/839/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/839/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/839/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=839&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/10/28/notation-is-evil/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Fitting an inference algorithm instead of a model</title>
		<link>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/</link>
		<comments>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 22:29:23 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[probability]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=741</guid>
		<description><![CDATA[One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm.  Basically, forget about the distribution that the CRF represents, and instead only care how accurate are the results that pop out of inference.  An extreme example of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=741&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm.  Basically, forget about the <em>distribution</em> that the CRF represents, and instead only care how accurate are the <em>results</em> that pop out of inference.  An extreme example of this is the recent paper <a href="http://stat.fsu.edu/~abarbu/papers/barbu_denoise_cvpr09.pdf">Learning Real-Time MRF Inference for Image Denoising</a> by <a href="http://stat.fsu.edu/~abarbu/">Adrian Barbu</a>.</p>
<p>The basic idea is to fit a FoE (<a href="http://www.gris.informatik.tu-darmstadt.de/~sroth/research/foe/index.html">Field of Experts</a>) image prior such that when one takes a very few gradient descent steps on a denoising posterior, the results are accurate.  From the abstract:</p>
<blockquote><p>We argue that through appropriate training, a MRF/CRF model can be trained to perform very well on a suboptimal inference algorithm. The model is trained together with a fast inference algorithm through an optimization of a loss function [...] We apply the proposed method to an image denoising application [...] with a 1-4 iteration gradient descent inference algorithm. [...] the proposed training approach obtains an improved  benchmark performance as well as a 1000-3000 times speedup compared to the Fields of Experts MRF.</p></blockquote>
<p>The implausible-sounding 1000-fold speedup comes simply from using only 4 iterations of gradient descent rather than several thousand.  (Incidentally, L-BFGS requires far fewer iterations for this problem.)  The results are a bit better than the generative FoE model&#8211; that takes much more work for training and inference.</p>
<p>I have every confidence that this does work well, and similar strategies could probably be used to create fast inference models/algorithms for many different problems.  My <a href="http://www.cs.umd.edu/~domke/papers/thesis.pdf">thesis</a> was largely an attempt to do the same thing for marginal, rather than MAP inference.</p>
<p>The disturbing/embarrassing question, for me, is does this really have anything to do with probabilistic modeling any more?  Formally speaking, a probability density is being fit, but I doubt it would transfer to, say, inpainting, or that samples from the density would look like natural images.  The best interpretation of what is happening might be that one is simply fitting a big, nonlinear, black box function approximation.</p>
<p>It seems that the more effort we expend to wring the best performance out of a probabilistic model, the less &#8220;probabilistic&#8221; it is.</p>
<h6>P.S. Some of my friends have invited me to never mention autodiff again, ever, but this is one of the many papers where I think the learning optimization would be made much easier/faster by using it.</h6>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/741/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/741/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/741/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=741&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/08/18/fitting-an-inference-algorithm-instead-of-a-model/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Bird</title>
		<link>http://justindomke.wordpress.com/2009/06/21/bird/</link>
		<comments>http://justindomke.wordpress.com/2009/06/21/bird/#comments</comments>
		<pubDate>Sun, 21 Jun 2009 23:48:29 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[birds]]></category>
		<category><![CDATA[photography]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=696</guid>
		<description><![CDATA[       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=696&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br />
<a href='http://justindomke.wordpress.com/2009/06/21/bird/p1030543/' title='P1030543'><img width="150" height="112" src="http://justindomke.files.wordpress.com/2009/06/p1030543.jpg?w=150&#038;h=112" class="attachment-thumbnail" alt="" title="P1030543" /></a>
<a href='http://justindomke.wordpress.com/2009/06/21/bird/p1030552/' title='P1030552'><img width="150" height="112" src="http://justindomke.files.wordpress.com/2009/06/p1030552.jpg?w=150&#038;h=112" class="attachment-thumbnail" alt="" title="P1030552" /></a>
<a href='http://justindomke.wordpress.com/2009/06/21/bird/p1030555/' title='P1030555'><img width="150" height="112" src="http://justindomke.files.wordpress.com/2009/06/p1030555.jpg?w=150&#038;h=112" class="attachment-thumbnail" alt="" title="P1030555" /></a>
<a href='http://justindomke.wordpress.com/2009/06/21/bird/p1030556-3/' title='P1030556'><img width="112" height="150" src="http://justindomke.files.wordpress.com/2009/06/p10305562.jpg?w=112&#038;h=150" class="attachment-thumbnail" alt="" title="P1030556" /></a>

  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/696/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/696/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/696/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=696&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/06/21/bird/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>What Gauss-Seidel is Really Doing</title>
		<link>http://justindomke.wordpress.com/2009/06/10/what-gauss-seidel-is-really-doing/</link>
		<comments>http://justindomke.wordpress.com/2009/06/10/what-gauss-seidel-is-really-doing/#comments</comments>
		<pubDate>Wed, 10 Jun 2009 23:57:16 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[linear algebra]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=643</guid>
		<description><![CDATA[I&#8217;ve been reading Alan Sokal&#8217;s lecture notes &#8220;Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms&#8221; today. Once I learned to take the word &#8220;Hamiltonian&#8221; and mentally substitute &#8220;function to be minimized&#8221;, they are very clearly written.
Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=643&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;ve been reading Alan Sokal&#8217;s lecture notes &#8220;<a href="http://dbwilson.com/exact/cargese.ps.gz">Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms</a>&#8221; today. Once I learned to take the word &#8220;Hamiltonian&#8221; and mentally substitute &#8220;function to be minimized&#8221;, they are very clearly written.</p>
<p>Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so clear, I feel a little cheated that it was never explained to me before.  Let&#8217;s start with the typical (confusing!) explanation of Gauss-Seidel (as in, say, <a href="http://en.wikipedia.org/wiki/Gauss%E2%80%93Seidel_method">Wikipedia</a>).  You want to solve some linear system</p>
<p><img src='http://l.wordpress.com/latex.php?latex=A%7B%5Cbf+x%7D%3D%7B%5Cbf+b%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A{\bf x}={\bf b}' title='A{\bf x}={\bf b}' class='latex' />,</p>
<p>for <img src='http://l.wordpress.com/latex.php?latex=%5Cbf+x&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\bf x' title='\bf x' class='latex' />.  What you do is decompose <img src='http://l.wordpress.com/latex.php?latex=A&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A' title='A' class='latex' /> into a diagonal matrix <img src='http://l.wordpress.com/latex.php?latex=D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='D' title='D' class='latex' />, a strictly lower triangular matrix <img src='http://l.wordpress.com/latex.php?latex=L&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='L' title='L' class='latex' />, and a strictly upper triangular matrix <img src='http://l.wordpress.com/latex.php?latex=U&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='U' title='U' class='latex' />, like</p>
<p><img src='http://l.wordpress.com/latex.php?latex=A%3DL%2BD%2BU&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A=L+D+U' title='A=L+D+U' class='latex' />.</p>
<p>To solve this, you iterate like so:</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+x%7D+%5Cleftarrow+%28D%2BL%29%5E%7B-1%7D%28%7B%5Cbf+b%7D-U%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf x} \leftarrow (D+L)^{-1}({\bf b}-U{\bf x})' title='{\bf x} \leftarrow (D+L)^{-1}({\bf b}-U{\bf x})' class='latex' />.</p>
<p>This works when <img src='http://l.wordpress.com/latex.php?latex=A&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A' title='A' class='latex' /> is symmetric positive definite.</p>
<p>Now, what the hell is going on here?  The first observation is that instead of inverting <img src='http://l.wordpress.com/latex.php?latex=A&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A' title='A' class='latex' />, you can think of minimizing the function</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7D+%7B%5Cbf+x%7D%5ET+A+%7B%5Cbf+x%7D+-+%7B%5Cbf+x%7D%5ET%7B%5Cbf+b%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T{\bf b}' title='\frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T{\bf b}' class='latex' />.</p>
<p>(It is easy to see that the global minimum of this is found when <img src='http://l.wordpress.com/latex.php?latex=A%7B%5Cbf+x%7D%3D%7B%5Cbf+b%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A{\bf x}={\bf b}' title='A{\bf x}={\bf b}' class='latex' />.)  Now, how to minimize this function?  One natural way to approach things would be to just iterate through the coordinates of <img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+x%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf x}' title='{\bf x}' class='latex' />, minimizing the function over each one.  Say we want to minimize with respect to <img src='http://l.wordpress.com/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_i' title='x_i' class='latex' />. So, we are going to make the change <img src='http://l.wordpress.com/latex.php?latex=x_i+%5Cleftarrow+x_i+%2B+d&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_i \leftarrow x_i + d' title='x_i \leftarrow x_i + d' class='latex' />.  Thus, we want to minimize (with respect to d), this thing:</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7D+%28%7B%5Cbf+x%7D%2Bd+%7B%5Cbf+e_i%7D%29%5ET+A+%28%7B%5Cbf+x%7D%2Bd+%7B%5Cbf+e_i%7D%29+-+%28%7B%5Cbf+x%7D%2Bd+%7B%5Cbf+e_i%7D%29%5ET+%7B%5Cbf+b%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\frac{1}{2} ({\bf x}+d {\bf e_i})^T A ({\bf x}+d {\bf e_i}) - ({\bf x}+d {\bf e_i})^T {\bf b}' title='\frac{1}{2} ({\bf x}+d {\bf e_i})^T A ({\bf x}+d {\bf e_i}) - ({\bf x}+d {\bf e_i})^T {\bf b}' class='latex' /></p>
<p>Expanding this, taking the derivative w.r.t. <img src='http://l.wordpress.com/latex.php?latex=d&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='d' title='d' class='latex' />, and solving gives us (where <img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+e_i%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf e_i}' title='{\bf e_i}' class='latex' /> is the unit vector in the <img src='http://l.wordpress.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='i' title='i' class='latex' />th direction)</p>
<p><img src='http://l.wordpress.com/latex.php?latex=d+%3D+%5Cfrac%7B%7B%5Cbf+e_i%7D%5ET%28%7B%5Cbf+b%7D-A%7B%5Cbf+x%7D%29%7D%7B%7B%5Cbf+e_i%7D%5ETA%7B%5Cbf+e_i%7D%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='d = \frac{{\bf e_i}^T({\bf b}-A{\bf x})}{{\bf e_i}^TA{\bf e_i}}' title='d = \frac{{\bf e_i}^T({\bf b}-A{\bf x})}{{\bf e_i}^TA{\bf e_i}}' class='latex' />,</p>
<p>or, equivalently,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=d+%3D+%5Cfrac%7B1%7D%7BA_%7Bii%7D%7D%28b_i-A_%7Bi%2A%7D%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='d = \frac{1}{A_{ii}}(b_i-A_{i*}{\bf x})' title='d = \frac{1}{A_{ii}}(b_i-A_{i*}{\bf x})' class='latex' />.</p>
<p>So, taking the solution at one timestep, <img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+x%7D%5Et&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf x}^t' title='{\bf x}^t' class='latex' />, and solving for the solution <img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+x%7D%5E%7Bt%2B1%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf x}^{t+1}' title='{\bf x}^{t+1}' class='latex' /> at the next timestep, we will have, for the first index,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x%5E%7Bt%2B1%7D_1+%3D+x%5Et_1+%2B+%5Cfrac%7B1%7D%7BA_%7B11%7D%7D%28b_1-A_%7B1%2A%7Dx%5Et%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x^{t+1}_1 = x^t_1 + \frac{1}{A_{11}}(b_1-A_{1*}x^t)' title='x^{t+1}_1 = x^t_1 + \frac{1}{A_{11}}(b_1-A_{1*}x^t)' class='latex' />.</p>
<p>Meanwhile, for the second index,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x%5E%7Bt%2B1%7D_2+%3D+x%5Et_2+%2B+%5Cfrac%7B1%7D%7BA_%7B22%7D%7D%28b_2-A_%7B2%2A%7D%28x%5Et-%7B%5Cbf+e_1%7D+x%5Et_1+%2B%7B%5Cbf+e_1%7D+x%5E%7Bt%2B1%7D_1+%29%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x^{t+1}_2 = x^t_2 + \frac{1}{A_{22}}(b_2-A_{2*}(x^t-{\bf e_1} x^t_1 +{\bf e_1} x^{t+1}_1 ))' title='x^{t+1}_2 = x^t_2 + \frac{1}{A_{22}}(b_2-A_{2*}(x^t-{\bf e_1} x^t_1 +{\bf e_1} x^{t+1}_1 ))' class='latex' />.</p>
<p>And, more generally</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x%5E%7Bt%2B1%7D_n+%3D+x%5Et_n+%2B+%5Cfrac%7B1%7D%7BA_%7Bnn%7D%7D%28b_n-A_%7Bn%2A%7D%28%5Csum_%7Bi%3Cn%7D%7B%5Cbf+e_i%7D+x%5E%7Bt%2B1%7D_i+%2B%5Csum_%7Bi%5Cgeq+n%7D%7B%5Cbf+e_i%7D+x%5Et_i+%29%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-A_{n*}(\sum_{i&lt;n}{\bf e_i} x^{t+1}_i +\sum_{i\geq n}{\bf e_i} x^t_i ))' title='x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-A_{n*}(\sum_{i&lt;n}{\bf e_i} x^{t+1}_i +\sum_{i\geq n}{\bf e_i} x^t_i ))' class='latex' />.</p>
<p>Now, all is a matter of cranking through the equations.</p>
<p><img src='http://l.wordpress.com/latex.php?latex=x%5E%7Bt%2B1%7D_n+%3D+x%5Et_n+%2B+%5Cfrac%7B1%7D%7BA_%7Bnn%7D%7D%28b_n-%5Csum_%7Bi%3Cn%7DA_%7Bn+i%7D+x%5E%7Bt%2B1%7D_i+-+%5Csum_%7Bi%5Cgeq+n%7DA_%7Bn+i%7D+x%5Et_i+%29%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-\sum_{i&lt;n}A_{n i} x^{t+1}_i - \sum_{i\geq n}A_{n i} x^t_i ))' title='x^{t+1}_n = x^t_n + \frac{1}{A_{nn}}(b_n-\sum_{i&lt;n}A_{n i} x^{t+1}_i - \sum_{i\geq n}A_{n i} x^t_i ))' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=D+%7B%5Cbf+x%7D%5E%7Bt%2B1%7D+%3D+D+%7B%5Cbf+x%7D%5Et+%2B+%7B%5Cbf+b%7D-+L+x%5E%7Bt%2B1%7D+-+%28D%2BU%29+%7B%5Cbf+x%7D%5Et+&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='D {\bf x}^{t+1} = D {\bf x}^t + {\bf b}- L x^{t+1} - (D+U) {\bf x}^t ' title='D {\bf x}^{t+1} = D {\bf x}^t + {\bf b}- L x^{t+1} - (D+U) {\bf x}^t ' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%28D%2BL%29+%7B%5Cbf+x%7D%5E%7Bt%2B1%7D+%3D+%7B%5Cbf+b%7D-U+%7B%5Cbf+x%7D%5Et&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='(D+L) {\bf x}^{t+1} = {\bf b}-U {\bf x}^t' title='(D+L) {\bf x}^{t+1} = {\bf b}-U {\bf x}^t' class='latex' /></p>
<p>Which is exactly the iteration we started with.</p>
<p>The fact that this would work when <img src='http://l.wordpress.com/latex.php?latex=A&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='A' title='A' class='latex' /> is symmetric positive definite is now obvious&#8211; that is when the objective function we started with is bounded below.</p>
<p>The understanding that Gauss-Seidel is just a convenient way to implement coordinate descent also helps to get intuition for the numerical properties of Gauss-Seidel (e.g. quickly removing &quot;high frequency error&quot;, and very slowly removing &quot;low frequency error&quot;).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/643/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/643/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/643/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/643/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/643/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/643/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/643/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/643/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/643/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/643/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=643&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/06/10/what-gauss-seidel-is-really-doing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Numbers in bar graphs</title>
		<link>http://justindomke.wordpress.com/2009/05/22/numbers-in-bar-graphs/</link>
		<comments>http://justindomke.wordpress.com/2009/05/22/numbers-in-bar-graphs/#comments</comments>
		<pubDate>Fri, 22 May 2009 21:54:46 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[graphs]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=631</guid>
		<description><![CDATA[I spent far too much time today figuring out how to put numbers in bar graphs. i.e. how to transform this:

into this:

When reading papers, I often wish numbers were included like this, and so resolved a while ago to always do so myself.  Plotting the numbers allows someone else to replot them when doing some [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=631&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I spent <em>far</em> too much time today figuring out how to put numbers in bar graphs. i.e. how to transform this:<br />
<img class="alignnone size-full wp-image-632" title="graph_nonums" src="../files/2009/05/graph_nonums.png" alt="graph_nonums" width="400" height="224" /></p>
<p>into this:</p>
<p><img class="alignnone size-full wp-image-634" title="graph_nums" src="http://justindomke.files.wordpress.com/2009/05/graph_nums1.png?w=400&#038;h=224" alt="graph_nums" width="400" height="224" /></p>
<p>When reading papers, I often wish numbers were included like this, and so resolved a while ago to always do so myself.  Plotting the numbers allows someone else to replot them when doing some sort of a comparison.  (With out needing to implement your algorithm, or ask you for your doubtlessly long-lost data, or (ugh) physically measuring the lengths of your bars.)  However, now seeing how tedious this can be, I understand why it is rarely done.</p>
<p>In any case, I wrote a function <a href="http://www.cs.umd.edu/~domke/code/add_bar_nums.m">add_bar_nums.m</a> that you can run after doing a bar plot that will add the numbers like above.  There are options for rotating the text, changing the display format, etc.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/631/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/631/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/631/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/631/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/631/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/631/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/631/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/631/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/631/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/631/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=631&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/05/22/numbers-in-bar-graphs/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>

		<media:content url="../files/2009/05/graph_nonums.png" medium="image">
			<media:title type="html">graph_nonums</media:title>
		</media:content>

		<media:content url="http://justindomke.files.wordpress.com/2009/05/graph_nums1.png" medium="image">
			<media:title type="html">graph_nums</media:title>
		</media:content>
	</item>
		<item>
		<title>A simple explanation of reverse-mode automatic differentiation</title>
		<link>http://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/</link>
		<comments>http://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 16:18:12 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[calculus]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[scheme]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=548</guid>
		<description><![CDATA[My previous rant about automatic differentiation generated several requests for an explanation of how it works.  This can be confusing because there are different types of automatic differentiation (forward-mode, reverse-mode, hybrids.)  This is my attempt to explain the basic idea of reverse-mode autodiff as simply as possible.
Reverse-mode automatic differentiation is most attractive when [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=548&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>My <a href="http://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/">previous rant</a> about automatic differentiation generated several requests for an explanation of how it works.  This can be confusing because there are different types of automatic differentiation (forward-mode, reverse-mode, hybrids.)  This is my attempt to explain the basic idea of <strong>reverse-mode autodiff</strong> as simply as possible.</p>
<p>Reverse-mode automatic differentiation is most attractive when you have a function that takes <img src='http://l.wordpress.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='n' title='n' class='latex' /> inputs <img src='http://l.wordpress.com/latex.php?latex=x_1%2Cx_2%2C...%2Cx_n&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_1,x_2,...,x_n' title='x_1,x_2,...,x_n' class='latex' />, and produces a single output <img src='http://l.wordpress.com/latex.php?latex=x_N&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_N' title='x_N' class='latex' />.  We want the derivatives of that function, <img src='http://l.wordpress.com/latex.php?latex=%5Cdisplaystyle%7B%5Cfrac%7Bd+x_N%7D%7Bd+x_i%7D%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\displaystyle{\frac{d x_N}{d x_i}}' title='\displaystyle{\frac{d x_N}{d x_i}}' class='latex' />, for all <img src='http://l.wordpress.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='i' title='i' class='latex' />.</p>
<p><strong>Point #1</strong>:  Any differentiable algorithm can be translated into a sequence of assignments of basic operations.</p>
<p><strong>Forward-Prop</strong></p>
<p>for <img src='http://l.wordpress.com/latex.php?latex=i%3Dn%2B1%2Cn%2B2%2C...%2CN&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='i=n+1,n+2,...,N' title='i=n+1,n+2,...,N' class='latex' /></p>
<p style="padding-left:30px;"><img src='http://l.wordpress.com/latex.php?latex=x_i+%5Cleftarrow+f_i%28%7B%5Cbf+x%7D_%7B%5Cpi%28i%29%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_i \leftarrow f_i({\bf x}_{\pi(i)})' title='x_i \leftarrow f_i({\bf x}_{\pi(i)})' class='latex' /></p>
<p>Here, each function <img src='http://l.wordpress.com/latex.php?latex=f_i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f_i' title='f_i' class='latex' /> is some very basic operation (e.g. addition, multiplication, a logarithm) and <img src='http://l.wordpress.com/latex.php?latex=%5Cpi%28i%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\pi(i)' title='\pi(i)' class='latex' /> denotes the set of &#8220;parents&#8221; of <img src='http://l.wordpress.com/latex.php?latex=x_i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_i' title='x_i' class='latex' />.  So, for example, if <img src='http://l.wordpress.com/latex.php?latex=%5Cpi%287%29%3D%282%2C5%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\pi(7)=(2,5)' title='\pi(7)=(2,5)' class='latex' /> and <img src='http://l.wordpress.com/latex.php?latex=f_7+%3D+%5Ctext%7Badd%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f_7 = \text{add}' title='f_7 = \text{add}' class='latex' />, then <img src='http://l.wordpress.com/latex.php?latex=x_7+%3D+x_2+%2B+x_5&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_7 = x_2 + x_5' title='x_7 = x_2 + x_5' class='latex' />.</p>
<p>It would be extremely tedious, of course, to actually write an algorithm in this &#8220;expression graph&#8221; form.  So, autodiff tools create this representation automatically from high-level source code.</p>
<p><strong>Point #2</strong>: Given an algorithm in the previous format, it is easy to compute its derivatives.</p>
<p>The essential point here is just the application of the chain rule.</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cdisplaystyle%7B+%5Cfrac%7Bd+x_N%7D%7Bd+x_i%7D+%3D+%5Csum_%7Bj%3Ai%5Cin+%5Cpi%28j%29%7D+%5Cfrac%7Bd+x_N%7D%7Bd+x_j%7D%5Cfrac%7B%5Cpartial+x_j%7D%7B%5Cpartial+x_i%7D%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\displaystyle{ \frac{d x_N}{d x_i} = \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial x_j}{\partial x_i}}' title='\displaystyle{ \frac{d x_N}{d x_i} = \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial x_j}{\partial x_i}}' class='latex' /></p>
<p>Applying this, we can compute all the derivatives in reverse order.</p>
<p><strong>Back-Prop</strong></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cdisplaystyle%7B+%5Cfrac%7Bd+x_N%7D%7Bd+x_N%7D+%5Cleftarrow+1%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\displaystyle{ \frac{d x_N}{d x_N} \leftarrow 1}' title='\displaystyle{ \frac{d x_N}{d x_N} \leftarrow 1}' class='latex' /></p>
<p>for <img src='http://l.wordpress.com/latex.php?latex=i%3DN-1%2CN-2%2C...%2C1&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='i=N-1,N-2,...,1' title='i=N-1,N-2,...,1' class='latex' /></p>
<p style="padding-left:30px;"><img src='http://l.wordpress.com/latex.php?latex=%5Cdisplaystyle%7B+%5Cfrac%7Bd+x_N%7D%7Bd+x_i%7D+%5Cleftarrow+%5Csum_%7Bj%3Ai%5Cin+%5Cpi%28j%29%7D+%5Cfrac%7Bd+x_N%7D%7Bd+x_j%7D%5Cfrac%7B%5Cpartial+f_j%7D%7B%5Cpartial+x_i%7D%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\displaystyle{ \frac{d x_N}{d x_i} \leftarrow \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial f_j}{\partial x_i}}' title='\displaystyle{ \frac{d x_N}{d x_i} \leftarrow \sum_{j:i\in \pi(j)} \frac{d x_N}{d x_j}\frac{\partial f_j}{\partial x_i}}' class='latex' /></p>
<p>That&#8217;s it!  Just create an expression graph representation of the algorithm and differentiate each basic operation <img src='http://l.wordpress.com/latex.php?latex=f_i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f_i' title='f_i' class='latex' /> in reverse order using calc 101 rules.</p>
<p><strong>Other stuff</strong>:</p>
<ul>
<li>No, this is not the same thing as symbolic differentiation.  This should be obvious:  Most algorithms don&#8217;t even have simple symbolic representations.  And, even if yours does,  it is possible that it &#8220;explodes&#8221; upon symbolic differentiation.  As a contrived example, try computing the derivative of <img src='http://l.wordpress.com/latex.php?latex=%5Cexp%28%5Cexp%28%5Cexp%28%5Cexp%28%5Cexp%28%5Cexp%28x%29%29%29%29%29%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\exp(\exp(\exp(\exp(\exp(\exp(x))))))' title='\exp(\exp(\exp(\exp(\exp(\exp(x))))))' class='latex' />.</li>
</ul>
<ul>
<li>The complexity of the back-prop step is the same as the forward propagation step.</li>
<li>In machine learning, functions from N inputs to one output come up all the time:  The N inputs are parameters defining a model, and the 1 output is a loss, measuring how well the model fits training data.  The gradient can be fed into an optimization routine to fit the model to data.</li>
<li>There are two common ways of implementing this:</li>
</ul>
<ol>
<li><span style="text-decoration:underline;"> Operator Overloading</span>.  One can create a new variable type that has all the common operations of numeric types, which automatically creates an expression graph when the program is run.    One can then call the back-prop routine on this expression graph.  Hence,  one does not need to modify the program, just replace each numeric type with this new type.  This is fairly easy to implement, and very easy to use.  <a href="http://www.cs.sandia.gov/~dmgay/">David Gay</a>&#8217;s RAD toolbox for C++ is a good example, which I use all the time.<br />
The major downside of operator overloading is efficiency:  current compilers will not optimize the backprop code.  Essentially, this  step is interpreted.  Thus, one finds in practice a non-negligible overhead of, say, 2-15 times the complexity of the original algorithm using a native numeric type.  (The overhead depends on how much the original code benefits from compiler optimizations.)</li>
<li><span style="text-decoration:underline;">Source code transformation</span>. Alternatively, one could write a program that examines the source code of the original program, and transforms this into source code computing the derivatives.  This is much harder to implement, unless one is using a language like Lisp with very uniform syntax.  However, because the backprop source code produced is then optimized like normal code, it offers the potential of zero overhead compared with manually computed derivatives.</li>
</ol>
<ul>
<li>If it isn&#8217;t convenient to use automatic differentiation, one can also use &#8220;manual automatic differentation&#8221;.  That is, to compute the derivatives, just attack each intermediate value your algorithm computes, in reverse order.</li>
<li>Some of the most interesting work on autodiff comes from <a href="http://www.bcl.hamilton.ie/~qobi/">Pearlmutter and Siskind</a>, who have produced a system called Stalingrad for a subset of scheme that allows for crazy things like taking derivatives of code that itself is taking derivates.  (So you can, for example, produce Hessians.)  I think they wouldn&#8217;t mind hearing from potential users.</li>
</ul>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/548/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/548/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/548/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/548/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/548/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/548/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/548/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/548/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/548/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/548/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=548&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Prediction markets, monte carlo, and loss functions</title>
		<link>http://justindomke.wordpress.com/2009/02/28/prediction-markets-monte-carlo-and-loss-functions/</link>
		<comments>http://justindomke.wordpress.com/2009/02/28/prediction-markets-monte-carlo-and-loss-functions/#comments</comments>
		<pubDate>Sat, 28 Feb 2009 20:48:42 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=488</guid>
		<description><![CDATA[There has been some discussion lately about how to evaluate the performance of different prediction markets (like Intrade), and predictors (like Nate Silver) at guessing the winners of elections, or Oscars.  Who is making the best predictions?  If everyone simply made a guess for the winner of each state or award, we could [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=488&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>There has been <a href="http://www.stat.columbia.edu/~cook/movabletype/archives/2009/02/predictions-tha.html#more">some</a> <a href="http://behind-the-enemy-lines.blogspot.com/2009/02/why-failing-to-fail-is-failure.html">discussion</a> lately about how to evaluate the performance of different prediction markets (like <a href="http://www.intrade.com">Intrade</a>), and predictors (like <a href="http://www.fivethirtyeight.com/2009/02/for-entertainment-purposes-only.html">Nate Silver</a>) at guessing the winners of elections, or Oscars.  Who is making the best predictions?  If everyone simply made a <em>guess</em> for the winner of each state or award, we could evaluate performance easily:  whoever guesses the most outcomes correctly is making the best predictions.  But what do we do if the predictors provide us full probabilities of the different outcomes?  Intuitively, someone who gives 99% probability of an event that doesn&#8217;t occur is much &#8220;more wrong&#8221; than someone who gives only a 51% probability.</p>
<pre>
                        Nate    Intrade     Winner
<strong>Best Picture</strong>
Slumdog                .990      .903        X
Milk                   .010      .040
Frost/Nixon                      .013
Benjamin Button                  .080
The Reader                       .030

<strong>Best Director</strong>
Danny Boyle            .997      .900        X
Gus Van Sant           .001      .059
David Fincher          .001      .050
Ron Howard
Steven Daldry           

<strong>Lead Actress</strong>
Kate Winslet           .676      .850        X
Meryl Streep           .324      .150
Anne Hathaway                    .044
Melissa Leo
Angelina Jolie         

<strong>Lead Actor</strong>
Mickey Rourke          .711      .700
Sean Penn              .190      .335        X
Brad Pitt              .059
Frank Langella         .034      .049
Richard Jenkins        .005

<strong>Supporting Actress</strong>
Taraji P. Henson       .510      .190
Penélope Cruz          .246      .588        X
Viola Davis            .116      .199
Amy Adams              .116
Marisa Tomei           .012

<strong>Supporting Actor</strong>
Heath Ledger           .858      .950        X
Josh Brolin            .050      .050
Philip Seymour Hoffman .044
Michael Shannon        .036
Robert Downey Jr.      .012</pre>
<p>Let us think about this situation from the perspective of &#8220;bent coin predictors&#8221;.  Let&#8217;s say we have a pool of 100 bent coins, each of which has some unknown probability <img src='http://l.wordpress.com/latex.php?latex=p_c%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p_c(\text{heads})' title='p_c(\text{heads})' class='latex' /> of ending up heads.  We have a number of people who reckon they can estimate that probability by looking at the coin.  Denote the prediction of guesser <img src='http://l.wordpress.com/latex.php?latex=i&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='i' title='i' class='latex' /> for coin <img src='http://l.wordpress.com/latex.php?latex=c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='c' title='c' class='latex' /> by <img src='http://l.wordpress.com/latex.php?latex=g_%7Bi%2Cc%7D%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='g_{i,c}(\text{heads})' title='g_{i,c}(\text{heads})' class='latex' />  After predictions have been made, we flip all the coins.  Now, how do we find the best guesser?</p>
<p>What we <em>want</em>, is to measure how close <img src='http://l.wordpress.com/latex.php?latex=g_%7Bi%2Cc%7D%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='g_{i,c}(\text{heads})' title='g_{i,c}(\text{heads})' class='latex' /> is to <img src='http://l.wordpress.com/latex.php?latex=p_c%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p_c(\text{heads})' title='p_c(\text{heads})' class='latex' />.  Since we don&#8217;t know <img src='http://l.wordpress.com/latex.php?latex=p_i%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p_i(\text{heads})' title='p_i(\text{heads})' class='latex' />, this seems impossible.  In some sense, it <em>is</em> impossible, but let us fantasize for a moment.  Suppose that instead of all the coin flips, someone actually revealed all the true probabilities <img src='http://l.wordpress.com/latex.php?latex=p_c%28%5Ctext%7Bheads%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p_c(\text{heads})' title='p_c(\text{heads})' class='latex' /> to us.  Then, what would we do?  There is no single best answer.  One reasonable way to measure the quality of the guess would be the sum of squares difference</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%28p_c%28%5Ctext%7Bheads%7D%29+-+g_%7Bi%2Cc%7D%28%5Ctext%7Bheads%7D%29%29%5E2+%2B+%28p_c%28%5Ctext%7Btails%7D%29+-+g_%7Bi%2Cc%7D%28%5Ctext%7Btails%7D%29%29%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='(p_c(\text{heads}) - g_{i,c}(\text{heads}))^2 + (p_c(\text{tails}) - g_{i,c}(\text{tails}))^2' title='(p_c(\text{heads}) - g_{i,c}(\text{heads}))^2 + (p_c(\text{tails}) - g_{i,c}(\text{tails}))^2' class='latex' /></p>
<p>or, equivalently,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7B%5Ctext%7Brez%7D%7D+%28p_c%28%5Ctext%7Brez%7D%29+-+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29%29%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{\text{rez}} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2' title='\sum_\text{\text{rez}} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2' class='latex' />.</p>
<p><strong>MONTE-CARLO APPROXIMATIONS</strong></p>
<p>Now, of course, we can&#8217;t calculate either of the above quantities.  We only have a single result from flipping each coin.  The central idea here is that we can use what is known as a <a href="http://en.wikipedia.org/wiki/Monte_Carlo_integration">Monte-Carlo</a> approximation.  This is a very simple idea.  Suppose we would like to calculate the expected value of some function <img src='http://l.wordpress.com/latex.php?latex=f&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f' title='f' class='latex' />.</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_x+p%28x%29+f%28x%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_x p(x) f(x)' title='\sum_x p(x) f(x)' class='latex' /></p>
<p>Now, suppose that we don&#8217;t know <img src='http://l.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p' title='p' class='latex' />, but we can <em>simulate</em> <img src='http://l.wordpress.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p' title='p' class='latex' />.  That is, by running some sort of experiment, we can get some random value <img src='http://l.wordpress.com/latex.php?latex=x_n&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x_n' title='x_n' class='latex' />, whose probability is <img src='http://l.wordpress.com/latex.php?latex=p%28x%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p(x)' title='p(x)' class='latex' />.  If we draw many such values, then we can approximate the above by:</p>
<p>It would be interesting to look at how accurate various predictors were for the 2008 election from this perspective.</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_n+1%2FN+f%28x_n%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_n 1/N f(x_n)' title='\sum_n 1/N f(x_n)' class='latex' /></p>
<p>As an example, suppose we want to know the average amount that a slot machine pays out.  We could approximate this by playing the machine 1000 times, and calculating the average observed payout.</p>
<p><strong>LOSS FUNCTIONS</strong></p>
<p>How can we apply loss functions to prediction markets?  Notice that we can make the following simplification</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+%28p_c%28%5Ctext%7Brez%7D%29+-+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29%29%5E2+%3D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2 =' title='\sum_\text{rez} (p_c(\text{rez}) - g_{i,c}(\text{rez}))^2 =' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+p_c%28%5Ctext%7Brez%7D%29%5E2+-+2+%5Csum_%5Ctext%7Brez%7D+p_c%28%5Ctext%7Brez%7D%29+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29+%2B+%5Csum_%5Ctext%7Brez%7D+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} p_c(\text{rez})^2 - 2 \sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) + \sum_\text{rez} g_{i,c}(\text{rez})^2' title='\sum_\text{rez} p_c(\text{rez})^2 - 2 \sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) + \sum_\text{rez} g_{i,c}(\text{rez})^2' class='latex' />.</p>
<p>The first term, <img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+p_c%28%5Ctext%7Brez%7D%29%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} p_c(\text{rez})^2' title='\sum_\text{rez} p_c(\text{rez})^2' class='latex' /> is hard to estimate, but fortunately we don&#8217;t need to, because it doesn&#8217;t depend on the predictions.  If we ignore this term for all guessers, we still have a valid <em>relative</em> rating of the guessers.</p>
<p>As for the second term, we can apply the Monte-Carlo technique from above.  Let <img src='http://l.wordpress.com/latex.php?latex=%5Ctext%7Brez%7D_c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\text{rez}_c' title='\text{rez}_c' class='latex' /> denote the outcome of flipping coin <img src='http://l.wordpress.com/latex.php?latex=c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='c' title='c' class='latex' />.  We make the approximation</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+p_c%28%5Ctext%7Brez%7D%29+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29+%5Capprox+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D_c%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) \approx g_{i,c}(\text{rez}_c)' title='\sum_\text{rez} p_c(\text{rez}) g_{i,c}(\text{rez}) \approx g_{i,c}(\text{rez}_c)' class='latex' />.</p>
<p>This is a <em>very noisy</em> approximation.  However, it is <em>unbiased</em>:  If we average over many coins, we will get something very close to the true value.  (This works exactly the same way that we could approximate the average payout of slot machines in a casino by playing 1000 random machines and then averaging the payouts.)</p>
<p>Happily, the final term,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} g_{i,c}(\text{rez})^2' title='\sum_\text{rez} g_{i,c}(\text{rez})^2' class='latex' />,</p>
<p>can be computed exactly, since the guessers have provided <img src='http://l.wordpress.com/latex.php?latex=g_%7Bi%2Cc%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='g_{i,c}' title='g_{i,c}' class='latex' />.</p>
<p><strong>THE OSCARS</strong></p>
<p>Now, let&#8217;s apply the above theory to the Oscar predictions.</p>
<p>Taking a zero for the empty entries, and normalizing each prediction, we obtain the scores:</p>
<p><strong>Quadratic loss</strong>:</p>
<pre>538:     -0.6235
intrade: -0.7925</pre>
<p>Remember, less is better, so this is a clear win for intrade.</p>
<p><strong>OTHER LOSS FUNCTIONS</strong></p>
<p>Another reasonable way to measure errors would be the KL-divergence between <img src='http://l.wordpress.com/latex.php?latex=p_c&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p_c' title='p_c' class='latex' /> and <img src='http://l.wordpress.com/latex.php?latex=g_%7Bi%2Cc%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='g_{i,c}' title='g_{i,c}' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+p_c%28%5Ctext%7Brez%7D%29+%5Clog%28p_c%28%5Ctext%7Brez%7D%29%2Fg_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29+%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} p_c(\text{rez}) \log(p_c(\text{rez})/g_{i,c}(\text{rez}) )' title='\sum_\text{rez} p_c(\text{rez}) \log(p_c(\text{rez})/g_{i,c}(\text{rez}) )' class='latex' />.</p>
<p>Again, dropping a constant term, we get the loss</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Csum_%5Ctext%7Brez%7D+-p_c%28%5Ctext%7Brez%7D%29+%5Clog+g_%7Bi%2Cc%7D%28%5Ctext%7Brez%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sum_\text{rez} -p_c(\text{rez}) \log g_{i,c}(\text{rez})' title='\sum_\text{rez} -p_c(\text{rez}) \log g_{i,c}(\text{rez})' class='latex' />.</p>
<p>For historical reasons, let&#8217;s call this this &#8220;conditional likelihood&#8221; loss.</p>
<p><strong>Conditional likelihood loss</strong>:</p>
<pre>538:     0.6032
intrade: 0.3699</pre>
<p>Again, intrade looks much better.  Then again, of course, maybe intrade was just lucky.  Five predictions isn&#8217;t a large number to average over.</p>
<p>I&#8217;d love to see this type of analysis applied to the state-by-state results of the 2008 elections.</p>
<p>Matlab code (probably also works with Octave) is <a href="http://www.cs.umd.edu/~domke/calcoscars.m">here</a>.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/488/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/488/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/488/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=488&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/02/28/prediction-markets-monte-carlo-and-loss-functions/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>The Stalin Compiler</title>
		<link>http://justindomke.wordpress.com/2009/02/23/the-stalin-compiler/</link>
		<comments>http://justindomke.wordpress.com/2009/02/23/the-stalin-compiler/#comments</comments>
		<pubDate>Mon, 23 Feb 2009 05:19:45 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[compilers]]></category>
		<category><![CDATA[lisp]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[scheme]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=471</guid>
		<description><![CDATA[Stalin is a questionably named Scheme compiler written by Jeffrey Mark Siskind that can supposedly create binaries as fast or faster than Fortran or C for numerical problems.  To test this, I tried creating a simple program to numerically integrate  from 0 to 10000.  To make things interesting, I used a manual [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=471&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><a href="http://en.wikipedia.org/wiki/Stalin_(Scheme_implementation)">Stalin</a> is a questionably named Scheme compiler written by <a href="http://cobweb.ecn.purdue.edu/~qobi/">Jeffrey Mark Siskind</a> that can supposedly create binaries as fast or faster than Fortran or C for numerical problems.  To test this, I tried creating a simple program to numerically integrate <img src='http://l.wordpress.com/latex.php?latex=%5Csqrt%7Bx%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\sqrt{x}' title='\sqrt{x}' class='latex' /> from 0 to 10000.  To make things interesting, I used a manual Newton&#8217;s method implementation of <strong>sqrt</strong> from <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-10.html#%_sec_1.1.7">SICP</a>.  The integration is done by a simple tail-recursive method.</p>
<p>The scheme code is very pretty:</p>
<pre>(define (sqrt-iter guess x)
  (if (good-enough? guess x)
      guess
      (sqrt-iter (improve guess x)
                 x)))

(define (improve guess x)
  (average guess (/ x guess)))

(define (average x y)
  (/ (+ x y) 2))

(define (good-enough? guess x)
  (&lt; (abs (- (* guess guess) x)) 0.001))

(define (mysqrt x)
  (sqrt-iter 1.0 x))

(define (int x acc step)
  (if (&gt;= x 10000.0)
      acc
      (int (+ x step)
           (+ acc (* step (mysqrt x)))
           step)))

(write (int 0.0 0.0 .001))</pre>
<p>I then converted this to C.  It is pretty much a transliteration, except it uses a for loop instead of recursion to accumulate the values:</p>
<pre>#include "stdio.h"

double improve(double guess, double x);
double average(double x, double y);

double sqrt_iter(double guess, double x){
  if( good_enough(guess, x))
    return guess;
  else
    return sqrt_iter( improve(guess,x), x);
}

double improve(double guess, double x){
  return average(guess, x/guess);
}

double average(double x, double y){
  return (x+y)/2;
}

int good_enough(double guess, double x){
  if (abs(guess*guess-x)&lt;.001)
    return 1;
  return 0;
}

double mysqrt(double x){
  return sqrt_iter(1.0, x);
}

main(){
  double rez = 0;
  double x;
  double step = .001;
  for(x=0; x&lt;= 10000; x+=step)
    rez += mysqrt(x)*step;
  printf("%f\n", rez);
}</pre>
<p>I compiled the two methods via:</p>
<pre>stalin -On -d -copt -O3 int.sc
gcc -O3 int.c</pre>
<p>The results are:<br />
<strong>Stalin</strong>: 1.90s<br />
<strong>gcc</strong>: 3.61s</p>
<p>If you declare every method inline in C, you get:</p>
<p><strong>gcc-inline</strong>: 3.28s</p>
<p>For reference, I also compiled the code with <a href="http://www.call-with-current-continuation.org/">chicken scheme</a>, using the <code>-optimize-level 3</code> compiler flag.</p>
<p><strong>Chicken Scheme</strong>: 27.9s.</p>
<p>Some issues:</p>
<ul>
<li>The Scheme code is more appealing on the whole, though it suffers from a lack of infix notation:  <code>(&lt; (abs (- (* guess guess) x)) 0.001)</code> is definitely harder to read than <code>abs(guess*guess-x)&lt;.001</code>.  I wonder if more familiarity with prefix code would reduce this.</li>
<li>All three methods give slightly different results, which is worrisome.  In particular, Stalin is correct to 6 digits, whilst gcc and chicken are correct to 7.</li>
<li>Stalin apparently does not include macros.  One would think it would be easy to use the macro system from a different scheme compiler, and then send the source to stalin for final compilation.</li>
<li>Stalin is extremely slow to compile.  In principle this isn&#8217;t a big deal: you can debug using a different scheme compiler.  Still, Stalin seems to be somewhat less robust to edge cases, than at least chicken scheme.</li>
<li>It is amazing that Scheme code with no type declarations can beat C by almost a factor of 2.</li>
<li>Though in principle Stalin produces intermediate c code, it is utterly alien and low-level.  I have not been able to determine exactly what options Stalin is using when it calls gcc on the source code.  That could account for some of the difference.</li>
<li>A detailed writeup on Stalin is <a href="ftp://ftp.ecn.purdue.edu/qobi/fdlcc.pdf">here</a>.</li>
</ul>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/471/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/471/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/471/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/471/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/471/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/471/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=471&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/02/23/the-stalin-compiler/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?</title>
		<link>http://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/</link>
		<comments>http://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 02:08:14 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=441</guid>
		<description><![CDATA[I recently got back reviews of a paper in which I used automatic differentiation.  Therein, a reviewer clearly thought I was using finite difference, or &#8220;numerical&#8221; differentiation.  This has led me to wondering:  Why don&#8217;t machine learning people use automatic differentiation more?  Why don&#8217;t they use it&#8230;constantly? Before recklessly speculating on the answer, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=441&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I recently got back reviews of a paper in which I used <a href="http://www.autodiff.org">automatic differentiation</a>.  Therein, a reviewer clearly thought I was using finite difference, or &#8220;numerical&#8221; differentiation.  This has led me to wondering:  <strong>Why don&#8217;t machine learning people use automatic differentiation more?  Why don&#8217;t they use it&#8230;constantly?</strong> Before recklessly speculating on the answer, let me briefly review what automatic differentiation (henceforth &#8220;autodiff&#8221;) is.  Specifically, I will be talking about <strong>reverse-mode autodiff</strong>.</p>
<p>(Here, I will use &#8220;subroutine&#8221; to mean a function in a computer programming language, and &#8220;function&#8221; to mean a mathematical function.)</p>
<p>It works like this:</p>
<ol>
<li>You write a subroutine to compute a function <img src='http://l.wordpress.com/latex.php?latex=f%28%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f({\bf x})' title='f({\bf x})' class='latex' />.  (e.g. in C++ or Fortran).  You know <img src='http://l.wordpress.com/latex.php?latex=f&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f' title='f' class='latex' /> to be differentiable, but don&#8217;t feel like writing a subroutine to compute <img src='http://l.wordpress.com/latex.php?latex=%5Cnabla+f&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\nabla f' title='\nabla f' class='latex' />.</li>
<li>You point some <a href="http://www.autodiff.org/?module=Tools">autodiff software</a> at your subroutine.  It produces a subroutine to compute the gradient.</li>
<li>That new subroutine has the same complexity as the original function!
<ol>
<li>It does <strong><em>not</em></strong> depend on the dimensionality of <img src='http://l.wordpress.com/latex.php?latex=%5Cbf+x&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\bf x' title='\bf x' class='latex' />.</li>
</ol>
</li>
<li>It also does not suffer from round-off errors!</li>
</ol>
<p>To take a specific example, suppose that you have written a subroutine to evaluate a neural network.  If you perform autodiff on that subroutine, it will produce code that is equivalent to the backpropagation algorithm. (Incidentally, autodiff significantly predates the invention of backprop).</p>
<p>Why should autodiff be possible?  It is embarasingly obvious in retrospect:  Complex subroutines consist of many elementary operations.  Each of those is differentiable.  Apply the calc 101 chain rule to the expression graph of all these operations.</p>
<p>A lot of papers in machine learning (including many papers I like) go like this:</p>
<ol>
<li>Invent a new type of model or a new loss function</li>
<li>Manually crank out the derivatives.  (The main technical difficulty of the paper.)</li>
<li>Learn by plugging the derivatives into an optimization procedure.  (Usually L-BFGS or stochastic gradient.)</li>
<li>Experimental results.</li>
</ol>
<p>It is bizarre that the main technical contribution of so many papers seems to be something that computers can do for us automatically.  We would be better off just considering autodiff part of the optimization procedure, and directly plugging in the objective function from step 1.  In my opinion, this is actually harmful to the field.  Before discussing why, let&#8217;s consider why autodiff is so little used:</p>
<ol>
<li>People don&#8217;t know about it.</li>
<li>People know about it, but don&#8217;t use it because they want to pad their papers with technically formidable derivations of gradients.</li>
<li>People know about it, but don&#8217;t use it for some valid reasons I&#8217;m not aware of.</li>
</ol>
<p>I can&#8217;t comment on (3) by definition.  I think the answer is (1), though I&#8217;m not sure.  Part of the problem is that &#8220;automatic differentiation&#8221; sounds like something you know, even if you actually have no idea what it is.  I sometimes get funny looks from people when I claim that both of the following are true.</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Ctext%7Bautodiff%7D%5Cneq%5Ctext%7Bsymbolic+diff.%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\text{autodiff}\neq\text{symbolic diff.}' title='\text{autodiff}\neq\text{symbolic diff.}' class='latex' /></p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Ctext%7Bautodiff%7D%5Cneq%5Ctext%7Bnumerical+diff.%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\text{autodiff}\neq\text{numerical diff.}' title='\text{autodiff}\neq\text{numerical diff.}' class='latex' />.</p>
<p>Further evidence for (1) is that many papers, after deriving their gradient, proclaim that &#8220;this algorithm for computing the gradient is the same order of complexity as evaluating the objective function,&#8221; seeming to imply it is fortunate that a fast gradient algorithm exists.</p>
<hr />Now, why bother complaining about this?  Are those manual derivatives actually hurting anything?  A minor reason is that it is distracting.  Important ideas get obscured by the details, and valuable paper space (often a lot of space) is taken up for the gradient derivation rather than more productive uses.  Similarly, valuable researcher time is wasted.</p>
<p>In my view a more significant downside is that the habit of manually deriving gradients restricts the community to <strong>only using computational structures we are capable of manually deriving gradients for</strong>.  This is a huge restriction on the kinds of computational machinery that could be used, and the kinds of problems that could be addressed.</p>
<p>As a simple example, consider learning parameters for some type of image filtering algorithm.  Imaging problems suffer from boundary problems.  It is not possible to compute a filter response at the edge of an image.  A common trick is to &#8220;flip&#8221; the image over the boundary to provide the nonexistent measurements.  This poses no difficulty at all to autodiff, but borders on impossible to account for analytically.  Hence, I suspect, people simply refrain from researching these types of algorithms.  That is a real cost.</p>
<p>For C++ users, it is probably easiest to get started with <a href="http://www.cs.sandia.gov/~dmgay/">David Gay</a>&#8217;s <a href="http://www.cs.sandia.gov/~dmgay/rad.tar.gz">RAD toolbox</a>.  (See also <a href="http://www.cs.sandia.gov/~dmgay/ad04_paper.pdf">the paper</a>.)</p>
<p>Caveats:</p>
<ul>
<li>Often, derivatives are needed for analysis, not just to plug into an optimization routine.  Of course, these derivatives should always remain.</li>
<li><em>When derivatives are reasonably simple</em>, it can be informative to see the equations.  Oftentimes, however, an <em>algorithm</em> is derived for computing the gradient.  In this cases, it would almost always be easier for someone trying to implement the method to install an autodiff tool than try to implement the gradient algorithm.</li>
</ul>
<p><strong>Update</strong>: Answers to some questions:</p>
<p>Q) What&#8217;s the difference between autodiff and symbolic diff?</p>
<p>A) They are totally different.  The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:</p>
<pre>
function f(x)
  y = x;
  for i=1...100
    y = sin(x+y);
  return y
</pre>
<p>Automatic differentiation can differentiate that, easily, in the same time as the original code.  Symbolic differentiation would lead to a huge expression that would take much more time to compute.</p>
<p>Q) What about non-differentiable functions?</p>
<p>A) No problem, as long as the function is differentiable at the place you try to compute the gradient.</p>
<p>Q) Why don&#8217;t you care about convexity?</p>
<p>A) I do care about convexity, of course.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/441/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/441/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/441/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/441/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/441/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/441/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=441&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
		<item>
		<title>Hessian-Vector products</title>
		<link>http://justindomke.wordpress.com/2009/01/17/hessian-vector-products/</link>
		<comments>http://justindomke.wordpress.com/2009/01/17/hessian-vector-products/#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:22:27 +0000</pubDate>
		<dc:creator>justindomke</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://justindomke.wordpress.com/?p=356</guid>
		<description><![CDATA[You have some function .  You have figured out how to compute it&#8217;s gradient, .  Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian  with certain vectors.  You become very upset, because either A) you don&#8217;t feel like [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=356&subd=justindomke&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>You have some function <img src='http://l.wordpress.com/latex.php?latex=f%28%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='f({\bf x})' title='f({\bf x})' class='latex' />.  You have figured out how to compute it&#8217;s gradient, <img src='http://l.wordpress.com/latex.php?latex=g%28%7B%5Cbf+x%7D%29%3D%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial+%5Cbf+x%7Df%28%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x})' title='g({\bf x})=\frac{\partial}{\partial \bf x}f({\bf x})' class='latex' />.  Now, however, you find that you are implementing some algorithm (like, say, <a href="http://hunch.net/?p=119">Stochastic Meta Descent</a>), and you need to compute the product of the Hessian <img src='http://l.wordpress.com/latex.php?latex=H%28%7B%5Cbf+x%7D%29%3D%5Cfrac%7B%5Cpartial%5E2%7D%7B%5Cpartial+%7B%5Cbf+x%7D%5Cpartial%7B%5Cbf+x%7D%5ET%7Df%28%7B%5Cbf+x%7D%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x})' title='H({\bf x})=\frac{\partial^2}{\partial {\bf x}\partial{\bf x}^T}f({\bf x})' class='latex' /> with certain vectors.  You become very upset, because either A) you don&#8217;t feel like deriving the Hessian (probable), or B) the Hessian has <img src='http://l.wordpress.com/latex.php?latex=N%5E2&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='N^2' title='N^2' class='latex' /> elements, where <img src='http://l.wordpress.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='N' title='N' class='latex' /> is the length of <img src='http://l.wordpress.com/latex.php?latex=%5Cbf+x&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\bf x' title='\bf x' class='latex' /> and that is too big to deal with (more probable).  What to do?  Behold:</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%2B%7B%5Cbf+%5CDelta+x%7D%29+%5Capprox+%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%29+%2B+H%28%7B%5Cbf+x%7D%29%7B%5Cbf+%5CDelta+x%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}' title='{\bf g}({\bf x}+{\bf \Delta x}) \approx {\bf g}({\bf x}) + H({\bf x}){\bf \Delta x}' class='latex' /></p>
<p>Consider the Hessian-Vector product we want to compute, <img src='http://l.wordpress.com/latex.php?latex=H%28%7B%5Cbf+x%7D%29%7B%5Cbf+v%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='H({\bf x}){\bf v}' title='H({\bf x}){\bf v}' class='latex' />.  For small <img src='http://l.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='r' title='r' class='latex' />,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%2Br%7B%5Cbf+v%7D%29+%5Capprox+%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%29+%2B+r+H%28%7B%5Cbf+x%7D%29%7B%5Cbf+v%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='{\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}' title='{\bf g}({\bf x}+r{\bf v}) \approx {\bf g}({\bf x}) + r H({\bf x}){\bf v}' class='latex' /></p>
<p>And so,</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cboxed%7BH%28%7B%5Cbf+x%7D%29%7B%5Cbf+v%7D%5Capprox%5Cfrac%7B%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%2Br%7B%5Cbf+v%7D%29+-+%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%29%7D%7Br%7D%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}' title='\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x})}{r}}' class='latex' /></p>
<p>This trick above has been around apparently forever.  The approximation becomes exact in the limit <img src='http://l.wordpress.com/latex.php?latex=r+%5Crightarrow+0&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='r \rightarrow 0' title='r \rightarrow 0' class='latex' />.  Of course, for small <img src='http://l.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='r' title='r' class='latex' />, numerical problems will also start to kill you.  <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.6143">Pearlmutter&#8217;s algorithm</a> is a way to compute <img src='http://l.wordpress.com/latex.php?latex=H+%7B%5Cbf+v%7D&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='H {\bf v}' title='H {\bf v}' class='latex' /> with the same complexity, with out suffering rounding errors.  Unfortunately, Pearlmutter&#8217;s algorithm is kind of complex, while the above is absolutely trivial.</p>
<p><strong>Update</strong>:  Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:</p>
<p><img src='http://l.wordpress.com/latex.php?latex=%5Cboxed%7BH%28%7B%5Cbf+x%7D%29%7B%5Cbf+v%7D%5Capprox%5Cfrac%7B%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D%2Br%7B%5Cbf+v%7D%29+-+%7B%5Cbf+g%7D%28%7B%5Cbf+x%7D-r%7B%5Cbf+v%7D%29%7D%7B2r%7D%7D.&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.' title='\boxed{H({\bf x}){\bf v}\approx\frac{{\bf g}({\bf x}+r{\bf v}) - {\bf g}({\bf x}-r{\bf v})}{2r}}.' class='latex' /></p>
<p>This expression will be closer to the true value for larger <img src='http://l.wordpress.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='r' title='r' class='latex' />, meaning that we are less likely to get hurt by rounding error.  This is nicely illustrated on page 6 of <a href="http://www.pvv.ntnu.no/~berland/resources/autodiff-triallecture.pdf">these notes</a>.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/justindomke.wordpress.com/356/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/justindomke.wordpress.com/356/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/justindomke.wordpress.com/356/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/justindomke.wordpress.com/356/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/justindomke.wordpress.com/356/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/justindomke.wordpress.com/356/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/justindomke.wordpress.com/356/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/justindomke.wordpress.com/356/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/justindomke.wordpress.com/356/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/justindomke.wordpress.com/356/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=justindomke.wordpress.com&blog=4009146&post=356&subd=justindomke&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://justindomke.wordpress.com/2009/01/17/hessian-vector-products/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2abaf920ed49d0d60f1753db18f0d367?s=96&#38;d=identicon" medium="image">
			<media:title type="html">justindomke</media:title>
		</media:content>
	</item>
	</channel>
</rss>